The package ‘SSDM’
(Stacked Species Distribution
Models) is a computer platform implemented in R providing a
range of methodological approaches and parameterization at each step of
the SSDM building. This vignette presents a typical workflow in R. An
additional vignette presents the same workflow using the graphic user
interface with the gui
function (see GUI vignette).
The workflow of the package ‘SSDM’ is based on three modelling levels:
In order to build species distribution models you will need environmental variables. Currently ‘SSDM’ uses all raster formats supported by the R package ‘raster’. The package ‘SSDM’ supports both continuous (e.g., climate maps, digital elevation models, bathymetric maps) and categorical environmental variables (e.g., land cover maps, soil type maps) as inputs. The package also allows normalizing environmental variables, which may be useful to improve the fit of certain algorithms (like artificial neural networks or support vector machines).
Rasters of environmental data need to have the same coordinate reference system while spatial extent and resolution of the environmental layers can differ. During processing, the package will deal with between-variables discrepancies in spatial extent and resolution by rescaling all environmental rasters to the smallest common spatial extent then upscaling them to the coarsest resolution.
‘SSDM’ includes the load_var
function to read raster
files of your environmental variables. We will work with three 30
arcsec-resolution rasters covering the north part of the main island of
New Caledonia ’Grande Terre’. Climatic variables (RAINFALL and
TEMPERATURE) are from the WorldClim database, and the SUBSTRATE map is
from the IRD Atlas of New Caledonia (2012) (see ?Env
).
## Welcome to the SSDM package, you can launch the graphical user interface by typing gui() in the console.
## Loading required package: sp
Env <- load_var(system.file('extdata', package = 'SSDM'), categorical = 'SUBSTRATE', verbose = FALSE)
Env
## class : RasterStack
## dimensions : 120, 120, 14400, 3 (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 164, 165, -21, -20 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## names : RAINFALL, SUBSTRATE, TEMPERATURE
## min values : 0.4593978, 0.0000000, 0.6610169
## max values : 1, 2, 1
Note that:
categorical
parameter.Norm
option.Species distribution models are built on natural history records indicating where the species occurred in the past.
‘SSDM’ include load_occ
function to read raw .csv or
.txt files including your natural history records. We will work with
natural history records from five Cryptocarya species native to New
Caledonia (see ?Occurrences
).
Occ <- load_occ(path = system.file('extdata', package = 'SSDM'), Env,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE',
file = 'Occurrences.csv', sep = ',', verbose = FALSE)
head(Occ)
## SPECIES LONGITUDE LATITUDE
## 1 elliptica 164.1833 -20.28333
## 3 elliptica 164.1999 -20.44999
## 4 elliptica 164.2166 -20.46666
## 5 elliptica 164.5166 -20.39999
## 6 elliptica 164.7333 -20.59999
## 7 elliptica 164.7499 -20.73333
Note that:
GeoRes
option to thin occurences. Thinning
removes unnecessary (clustered) records, reducing the effect of sampling
bias while retaining the greatest amount of information.read.csv
function used to open your raw data.*In the example below we build a distribution model of Cryptocarya elliptica with a subset of the occurrences of the species and for one single algorithm, here generalized linear models. The package ‘SSDM’ currently includes eight commonly used algorithms for modelling species distributions: generalized additive models (GAM), generalized linear models (GLM), multivariate adaptive regression splines (MARS), classification tree analysis (CTA), generalized boosted models (GBM), maximum entropy (MAXENT), artificial neural networks (ANN), random forests (RF), and support vector machines (SVM). Default parameters of the dependent R package of each algorithm were conserved but most of them remain settable (see section ‘Advanced settings’).
SDM <- modelling('GLM', subset(Occurrences, Occurrences$SPECIES == 'elliptica'),
Env, Xcol = 'LONGITUDE', Ycol = 'LATITUDE', verbose = FALSE)
plot(SDM@projection, main = 'SDM\nfor Cryptocarya elliptica\nwith GLM algorithm')
Note that: The package ‘SSDM’ encompasses a large
methodology offered by literature. Have a look at all parameters for the
modelling
function with ?modelling
.
In this next example we build an ensemble elliptica distribution model combining CTA and MARS based SDMs of this species. Because uncertainty in distribution projections can skew policy making and planning, a recommendation is to fit a number of alternative model algorithm, explore the range of projections across the different SDMs and then to find consensus in SDM projections. Two consensus methods are implemented in the package ‘SSDM’:
Additionally, data sampling uncertainties (presences,
pseudo-absences) may be addressed by running several replicates for each
algorithm (rep
argument).
The package also provides an uncertainty map representing the between-algorithms variance. The degree of agreement between each pair of algorithms can be assessed through a correlation matrix giving the Pearson’s coefficient.
ESDM <- ensemble_modelling(c('CTA', 'MARS'), subset(Occurrences, Occurrences$SPECIES == 'elliptica'),
Env, rep = 1, Xcol = 'LONGITUDE', Ycol = 'LATITUDE',
ensemble.thresh = 0, verbose = FALSE)
plot(ESDM@projection, main = 'ESDM\nfor Cryptocarya elliptica\nwith CTA and MARS algorithms')
Note that: the package ‘SSDM’ includes a large
methodology offered by literature. Have a look at all parameters for the
ensemble_modelling
function with
?ensemble_modelling
.
Finally, we build a stacked species distribution model using CTA and SVM algorithms and multiple species. The outputs of the different species are aggregated in SSDM maps of local species richness and endemism using the summing continuous habitat suitability maps stacking method (pSSDM).
SSDM <- stack_modelling(c('CTA', 'SVM'), Occurrences, Env, rep = 1, ensemble.thresh = 0,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE',
Spcol = 'SPECIES', method = "pSSDM", verbose = FALSE)
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
Five other stacking methods are available (see references in
?stack_modelling
for details):
A range of model evaluation metrics included in the package ‘SDMTools’ have been integrated in the package ‘SSDM’. They include the area under the receiving operating characteristic (ROC) curve (AUC), the Cohen’s Kappa coefficient, the omission rate, the sensitivity (true positive rate) and the specificity (true negative rate). These metrics are all based on the confusion matrix (also called ‘error matrix’, which represents the instances in a predicted class versus the instances in an actual class). The confusion matrix is computed converting the habitat suitability maps into binary presence/absence maps. Most of these criteria only test the discrimination capacity of a model (how well it may distinguish/correctly predict presences and absences), but do not tell, how well the model performs on unseen data. For this purpose, we also included a calibration metric (Naimi & Arauja), which is an important measure for model transferability over space and time (e.g. for climate change projections).
threshold | AUC | omission.rate | sensitivity | specificity | prop.correct | Kappa | calibration | |
---|---|---|---|---|---|---|---|---|
fp | 0.5024129 | 0.7905556 | 0.1740924 | 0.8333333 | 0.7433333 | 0.8259076 | 0.3583034 | 0.7210005 |
To assess the accuracy of an SSDM, the package provides the opportunity to compare modeled species assemblages with species pools from independent inventories observed in the field. Six evaluation metrics are computed: (1) the species richness error, i.e. the difference between the predicted and observed species richness; (2) the assemblage prediction success, i.e. the proportion of correct predictions; (3) the assemblage Cohen’s kappa, i.e. the proportion of specific agreement; (4) the assemblage specificity, i.e. the proportion of true negatives (species that are both predicted and observed as being absent); (5) the assemblage sensitivity, i.e. the proportion of true positives (species that are both predicted and observed as present); and (6) the Jaccard index, a widely used metric of community similarity.
species.richness.error | prediction.success | kappa | specificity | sensitivity | Jaccard | |
---|---|---|---|---|---|---|
mean | -0.7857143 | 0.6785714 | 1.0281348 | 0.7253086 | 0.1988095 | 0.1761905 |
SD | 1.2279807 | 0.2532853 | 0.0237939 | 0.2771717 | 0.2876372 | 0.2766804 |
The package ‘SSDM’ provides two methods to measure the relative contribution of environmental variables, which quantifies the relevance of an environmental variable to determine species distribution. The first method is based on a jack knife approach that evaluates the change in accuracy between a full model and the models where each environmental variable are omitted in turns. All metrics available in the package can serve to assess the change in accuracy. The second method is based on the Pearson’s correlation coefficient between a full model and the models where each environmental variable are omitted in turns.
RAINFALL | SUBSTRATE | TEMPERATURE | |
---|---|---|---|
Mean | 47.10145 | 6.172527 | 46.72603 |
SD | 31.63197 | 6.368049 | 28.42449 |
In addition to species richness, endemism is an important feature for conservation as it refers to species being unique to the defined geographic location. Species endemism maps can be computed using two metrics:
To investigate impacts of changing environmental conditions on
species distribution and assemblages, it is often of interest to project
an existing model using different environmental rasters than used for
model building. In ‘SSDM’ this can be done by using the
project
function, which takes any SDM
(Algorithm.SDM
), ESDM (Ensemble.SDM
) or SSDM
(Stacked.SDM
) object produced with modelling
,
ensemble_modelling
or stack_modelling
. The
following example shows the projection of our previously built SDM
object to an environment with reduced rainfall and increased
temperature.
The function by default takes the supplied SDM object and returns it
with updated @projection
slots and, in the case of ESDMs
and SSDMs, also updates the @uncertainty
,
@diversity.map
and @endemism.map
. If the
original model should be kept, the resulting maps can also be returned
as a list by setting update.projections=FALSE
. Further, for
ESDMs or SSDMs the projections of the individual models they are
composed of can be saved by setting
SDM.projections=TRUE
.
SSDM includes a collection of different modelling methods, which may
be seen as a science in themselves. For easier use, all algorithms are
set up with default settings, which may or may not be sufficient for
your specific application. For better fine-tuning, custom settings can
be easily passed to the original functions through
algo.args
lists (‘algo’ being the lowercase name of the
respective algorithm). For example, if you want to set the
singular.ok
argument in the glm
function use
glm.args=list(singular.ok=FALSE)
. Disclaimer: When
using custom argument settings to build models, please remember to also
supply these argument lists in any consecutive steps (e.g. when calling
project
afterwards to project the model into a different
environment). This is because - for reasons of efficiency and
harmonization - SSDM objects only store information necessary to rebuild
the models when needed instead of storing the original models.
If you need to access the original models for any reason, you may do
so by using the internal function SSDM:::get_model(model)
(mind the triple dots).
Due to their very nature, ensemble and stacked models can become large in size and take a long time to finish. Depending on your hardware, this may cause problems with your available working memory, since R by habits stores objects in RAM.
To deal with these issues, we included several options to reduce
memory load and optimize speed at the same time. Firstly, the number of
cores that should be used for computing in parallel (using the
parallel
and foreach
packages) can be set
through the ncores
arguments in
ensemble_modelling
, stack_modelling
and
project
.
R objects are by default exported to each worker and increase the
necessary memory with decreasing speed improvement. Here, we use the
itertools
package to only send the necessary chunks to each
worker. By using the minimal.memory
arguments, only one
model is send to each worker at a time, decreasing memory load some
more.
Finally, for the least memory use you may set the tmp
argument to TRUE to save temporary files (raster projections) to the
raster temporary folder. Please be aware, however, that the files may
take up a lot of disk space. Alternatively, with tmp
you
may specify a path, where temporary files should be stored. Please keep
in mind, that the temporary folders might be cleared when closing R, so
you may want to write the projections before closing your session.