Robert Brian O’Hara, Department of Mathematical Sciences, NTNU

I am, in part, an ecologist. This means that I, like my ecologists colleague, am interested in the natural world around us. We are all worried about the effects that humans are having. Humans have been depleting nature for thousands of years by doing things like hunting large animals, clearing forests and other land for agriculture, and generally messing the place up.
Of course, ecologists are not the only people interested in the natural world. A lot of other people are too, and have been for a long time. They have been collecting animals and plants, recording what they have seen, and (fortunately not so much now) shooting bits of nature and stuffing it to put on display. This has meant there is a lot of information about the world around us that has been gathered by non-professionals. We have now given them a name: citizen scientists.
Whilst people have been collecting data, they have also noticed that we have been destroying the natural world. This has become more pressing in the last few decades, as we have come to understand the combined effects of us using land for agriculture, housing, and generally destroying it by mining underneath it, and the effects of climate change (caused, largely, by us burning all the stuff we have mined from underneath the land we have been messing up).
We know at the general level what human impacts are, and we can study specific cases in detail (for example, there was a cottage industry a few years ago looking at how many birds were killed by wind farms). But if we want to really understand the scope of the problem, and if we want to find effective remedies, we need to know what is really out there. We need to map nature.
Now, there are lots of ways to view this, and how much detail you want in your map. One approach is to map where each species can be found: its distribution. This can be done by getting data on where the species have been seen, and modelling that as a function of covariates, like land use and climate (usually something like average annual temperature and rainfall). Where do we get the data? Well, this is where citizen scientists come in.
There are professional schemes to map species, but these tend to be small, because professionals are not cheap. But there are also a lot of other people collecting data. Some of these are organised schemes where citizen scientists collect data, like breeding bird surveys, where birders are told to go to specific places on specific dates and record what they see and here in a set time. But there are also incidental observations: people will write down a list of what birds they saw when they went for a walk last week, or they will take photographs of plants they have seen. Thanks to the internet, it is easy to collect this data, and there are several services (e.g. eBird and iNaturalist) which help people organise their observations and upload their photographs. This helps citizen scientists (e.g. other people can help with identifying species from the photographs), but the data can help with mapping the distributions, because there are a lot of people collecting this data over wide areas.
So, we have a lot of data, but how do we use it? The most common approach is to treat all the data as the same, and throw it all into a regression model. But the data are not all the same. Some data is counts of how many individuals were seen, some is whether a species was seen or not, and some is simply where species was seen, with no information on where it (or other species) were not seen. On top of this, there are differences in where the data were collected: professional surveys try to get good coverage of the areas of interest, whilst citizen scientists will tend to go to areas that are interesting and easily accessible. Ignoring this can led to strange results, like concluding that major hotspots of biodiversity are near cities.
Here in Trondheim we are one of several groups that have been taking a different approach. We have tried to respect the data, which means developing a model for each data and how it was collected. This has lead to the idea of data integration. For each data set we have a model for how it was collected, and we combine these all together. This builds on a long history of statistical modelling, and is illustrated below in a figure that builds on a long history of Western Art.

For each data set we have a model, e.g. we might assume that the number of birds seen follows a Poisson distribution with a mean proportional to the time spent looking for them. Each data set probably covers several sites, and has its own properties, so a specific model can be developed for each data set.
The models for each data set give us the likelihood for that data (i.e. the probability of the data given the models and parameters). This has to include information about the actual distribution, i.e. and estimate of where the species actually is. For each species this has to same across every data set: there is, after all each species can only have one distribution. If we want to understand this distribution, we need to model it too, to find out if the species seems to be affected by things like temperature, rainfall or habitat. This is the process model: it is where the ecology happens. If we have enough data over time, we can model the distribution as dynamic.
There are many ways to go from this scheme to specific models. There has been a convergence on using point processes. We treat each individual as a point in continuous spaces, so the distribution of the points is the distribution of the species. We then model this as an intensity, which can be linear on the log scale (linearity makes everything easier). So our process model for the log intensity is linear, and this then appears in every observation model.
The upshot of all of this is that we can have one big model, but it is made up of several small models, each of which is easier to work with. We can build each model, and put them together, by treating them as graphical models: the full model, with all of its connections, can be written as a graph.
It is one thing to write a blog post about these amazing ideas, it is another to put them into practice. This requires a lot of coding and messing with details. We have been trying to make this easier, by developing packages to import and format the data and fit the models. We are getting close to being able to automate a lot of this work: we currently have a project trying to model hundreds of Norwegians species in this way.
