Chapter 6 Model validation
In this section, model validation is performed in order to assess the predictive performance of the selected model. This validation is conducted via k-fold cross-validation. The data set is divided into k equally sized groups (Hijmans 2012), using a percentage of randomly selected observations to run the model and the remaining for validation, iteratively for each fold.
First, we load the list of required libraries.
requiredPackages <- c("here", "rstudioapi",
"stringr", "RColorBrewer", "ggplot2",
"dplyr", "tidyverse", "R.utils", "ggpubr",
"hrbrthemes", "fields", "maps", "raster",
"scam", "plotmo", "pkgbuild", "dismo",
"SDMTools")
We run a function to install the required packages that are not in our system and load all the required packages.
install_load_function <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[,
"Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
install_load_function(requiredPackages)
## here rstudioapi stringr RColorBrewer ggplot2 dplyr
## TRUE TRUE TRUE TRUE TRUE TRUE
## tidyverse R.utils ggpubr hrbrthemes fields maps
## TRUE TRUE TRUE TRUE TRUE TRUE
## raster scam plotmo pkgbuild dismo SDMTools
## TRUE TRUE TRUE TRUE TRUE TRUE
Note: in case SDMTools
failed to load you will have to first manually install RTools and then run this code below:
# find_rtools()
# install.packages('remotes')
# remotes::install_version('SDMTools',
# version = '1.1-221.2')
library(SDMTools)
We define some overall settings.
# General settings for ggplot
# (black-white background, larger
# base_size)
theme_set(theme_bw(base_size = 16))
We load output from the selected model saved in the previous step.
6.1 Optimum threshold
We generate a data frame with the data used in the selected model and we add the predicted values.
# PAdata_enviroment used in the
# selected model
data <- selected_model$model
# Predict
scgam.pred <- predict(selected_model, newdata = data,
type = "response")
# Add the prediction to the data object
data$scgam.pred <- as.vector(scgam.pred)
head(data)
## occurrenceStatus BO_sstmean BO2_chlomean_ss BO2_salinitymean_ss scgam.pred
## 1 1 16.99439 1.20410309 35.42666 0.9814526
## 5 1 16.99439 1.20410309 35.42666 0.9814526
## 8 1 25.52963 0.08455528 36.43070 0.4357622
## 10 1 16.99439 1.20410309 35.42666 0.9814526
## 11 1 15.21486 0.72283228 34.20426 0.8506453
## 12 1 16.99439 1.20410309 35.42666 0.9814526
The threshold for presence-absence classification for each species is obtained as the values maximizing sensitivity plus specificity (Jiménez-Valverde and Lobo 2007). If the result was a range (instead of a single value), we would select the mean value of the range.
# Optimizing the threshold probability
obs <- data$occurrenceStatus
predSCGAM_P <- data$scgam.pred
# Threshold optimizing
myoptim <- optim.thresh(obs, predSCGAM_P)
myoptim
## $min.occurence.prediction
## [1] 0.01055724
##
## $mean.occurence.prediction
## [1] 0.671467
##
## $`10.percent.omission`
## [1] 0.37
##
## $`sensitivity=specificity`
## [1] 0.45
##
## $`max.sensitivity+specificity`
## [1] 0.43
##
## $maxKappa
## [1] 0.43
##
## $max.prop.correct
## [1] 0.43
##
## $min.ROC.plot.distance
## [1] 0.43
# Select the threshold that maximizes
# the sum of sensitivity and
# specificity
myThreshold <- as.numeric((myoptim[["max.sensitivity+specificity"]]))
Accuracy indicators, such as AUC (Area Under the Receiver Operating Characteristic—ROC—curve), sensitivity (true predicted presences) and specificity (true predicted absences) are first computed for the all observations.
## threshold AUC omission.rate sensitivity specificity prop.correct
## 1 0.43 0.7409775 0.1978796 0.8021204 0.6798347 0.7412764
## Kappa
## 1 0.4822374
# Create confusion matrix with all
# observations
confusion.matrix(obs, predSCGAM_P, threshold = myThreshold)
## obs
## pred 0 1
## 0 10033 2949
## 1 4725 11954
## attr(,"class")
## [1] "confusion.matrix"
6.2 k-fold validation
In this case we use a 5-fold cross-validation.
The model is run for each of the 5 random subset (with a 20% of the observations) and indicators are then computed using the remaining 80% of the observations. Indicators are the averaged across folds.
# Initialise the confusion matrix and
# the accuracy table:
myCM <- NULL
myACC <- NULL
# Get the formula of the selected model
formula <- summary(selected_model)[["formula"]]
# Get the smoothing parameters of the
# selected model
sp <- selected_model$sp
# Loop for each group k
for (j in 1:k) {
# Preparation of Training Sites
p_Training <- data[groups != j, ]
# Model fit
selected_model.sp.j <- scam(formula,
family = binomial(link = "logit"),
data = p_Training, sp = c(sp))
# Predict Model
p_validacion <- data[groups == j, ]
selected_model.sp.j.pred <- predict(selected_model.sp.j,
newdata = p_validacion, type = "response")
p_validacion$Pred <- selected_model.sp.j.pred
# Confussion matrix and accuracy
# table for fold j
obs <- p_validacion$occurrenceStatus
predSCGAM <- p_validacion$Pred
myCM <- rbind(myCM, as.numeric(confusion.matrix(obs,
predSCGAM, threshold = myThreshold)))
myACC <- rbind(myACC, accuracy(obs, predSCGAM,
threshold = myThreshold))
}
# Mean values across k-folds
validation_summary <- cbind(Threshold = myThreshold,
mean_AUC = mean(myACC$AUC), mean_Omision = mean(myACC$omission.rate),
mean_sensitivity = mean(myACC$sensitivity),
mean_specificity = mean(myACC$specificity),
mean_Prop.Corr = mean(myACC$prop.correct))
validation_summary
## Threshold mean_AUC mean_Omision mean_sensitivity mean_specificity
## [1,] 0.43 0.7396357 0.1912325 0.8087675 0.6705039
## mean_Prop.Corr
## [1,] 0.7398605
We save the validation summary object.