 Research
 Open access
 Published:
Comparison of new computational methods for spatial modelling of malaria
Malaria Journal volume 22, Article number: 356 (2023)
Abstract
Background
Geostatistical analysis of health data is increasingly used to model spatial variation in malaria prevalence, burden, and other metrics. Traditional inference methods for geostatistical modelling are notoriously computationally intensive, motivating the development of newer, approximate methods for geostatistical analysis or, more broadly, computational modelling of spatial processes. The appeal of faster methods is particularly great as the size of the region and number of spatial locations being modelled increases.
Methods
This work presents an applied comparison of four proposed ‘fast’ computational methods for spatial modelling and the software provided to implement them—Integrated Nested Laplace Approximation (INLA), tree boosting with Gaussian processes and mixed effect models (GPBoost), Fixed Rank Kriging (FRK) and Spatial Random Forests (SpRF). The four methods are illustrated by estimating malaria prevalence on two different spatial scales—country and continent. The performance of the four methods is compared on these data in terms of accuracy, computation time, and ease of implementation.
Results
Two of these methods—SpRF and GPBoost—do not scale well as the data size increases, and so are likely to be infeasible for largerscale analysis problems. The two remaining methods—INLA and FRK—do scale well computationally, however the resulting model fits are very sensitive to the user’s modelling assumptions and parameter choices. The binomial observation distribution commonly used for disease prevalence mapping with INLA fails to account for smallscale overdispersion present in the malaria prevalence data, which can lead to poor predictions. Selection of an appropriate alternative such as the Betabinomial distribution is required to produce a reliable model fit. The smallscale random effect term in FRK overcomes this pitfall, but FRK model estimates are very reliant on providing a sufficient number and appropriate configuration of basis functions. Unfortunately the computation time for FRK increases rapidly with increasing basis resolution.
Conclusions
INLA and FRK both enable scalable geostatistical modelling of malaria prevalence data. However care must be taken when using both methods to assess the fit of the model to data and plausibility of predictions, in order to select appropriate model assumptions and parameters.
Background
Spatial proximity often plays an important role in governing the spread of geographic processes. Geostatistical techniques directly model these effects of proximity and are used to create continuous predictions from a finite set of observations. Since their original development for use in the mining sector [1], these techniques are now applied to wide ranging problems where spatial correlation must be accounted for, including species distribution modelling in ecology [2], interpolating weather and climate data [3, 4], mapping soil properties for agriculture [5], and spatial modelling of real estate prices [6]. Mapping disease risk is an important application in epidemiology, where geostatistical techniques are used to predict risks for a wide range of diseases with differing biology [7,8,9,10]. These methods are particularly prominent in malaria work, where risk maps inform policy and elimination strategies [11, 12]. Predictive maps created using geostatistical models have been published for malaria prevalence [13,14,15,16], mortality [17], use of malaria interventions [18], antimalarial drug resistance [19, 20], risks of adverse effects [21], and the relationship between sickle cell anaemia and P. falciparum [22]. The mapped metrics depend on various spatial processes including environmental factors (e.g., rainfall and temperature), variable access to health care, and human movement. In the absence of a full understanding of all these processes, spatial statistical modelling aims to describe the spatial variation in the metric of interest that is caused by the underlying spatial processes.
In their book, Diggle and Ribeiro Jr. introduce a fundamental paradigm for modelling geospatial data that unites previous spatial modelling approaches with modelbased statistical analysis [1]. The quantity of interest or the response, \({y}_{i}\), is defined throughout a contiguous study region and each measurement at the sample location \({x}_{i}\) is a realization of the random variable \({Y}_{i}\) whose distribution is dependent on the location \({x}_{i}\) as well as the random variables associated with the other data locations. That is, the random variables in space are dependent on each other based on their proximity. Hence, the observed responses at \(n\) locations are modelled as a joint \(n\)dimensional vector of random variables where the dependency can be modelled using spatial random effects as part of a generalized linear geostatistical model. The spatiallycorrelated random variables are modelled as a Gaussian process (GP) and spatial covariates, such as bioclimatic and environmental layers, are often included as additional regressors to capture general trends.
Gaussian processes are widely used in spatiotemporal modelling including in malaria prevalence mapping research [23,24,25]. With the explosion of machine learning research, the popularity of GPs has remarkably increased in both theoretical and applied domains [26]. Rasmussen and Nickisch made available a toolbox called GPML for machine learning regression and classification tasks [27]. GPs for large scale regression [28] and GPs for sparse approximations [29] are examples of the use of GPs in machine learning. These new advances have made GPs a viable tool for modelling of very large datasets beyond the field of malaria mapping [30].
These modelling approaches vary in their inference procedures (e.g. Bayesian or frequentist) and computational techniques (e.g. simulation versus optimization). One thing these newer methods have in common is that they have the potential to avoid calculations using a ‘full’ (approximationfree) GP, due to the fact that full GP models scale cubicly with the number of unique locations in the training data. That is, a tenfold increase in the number of unique spatial locations in the dataset results in a 1000fold increase in computation time. Consequently, the full GP can become computationally infeasible for large datasets, such as those used in national and continentalscale malaria mapping. For such large datasets, it may also be prohibitive to fit the model using asymptotically exact Bayesian methods such as Markov chain Monte Carlo (MCMC) methods, so deterministic approximations to such simulation approaches have also been explored. Newer approximation methods to both the spatial random effect and the inference method are often used to combat this limitation for example when using globalscale datasets [31, 32].
There are a multitude of approximation techniques available as alternatives to full Bayesian/frequentist inference, and the full Gaussian process, in addition to modern machine learning methods which lie outside of the standard geostatistical framework while still enabling the fitting of spatiallyexplicit models as used in disease mapping. Geostatistical techniques using Gaussian processes to model spatial autocorrelation are currently the most popular method for malaria risk mapping [33],with full Gaussian processes with both maximum likelihood frameworks for inference [14, 21] and Bayesian inference using MCMC [34, 35] appearing frequently in the literature. The Integrated Nested Laplace Approximation has become a well established method in the field [15, 36, 37], while recent modelling techniques outside of the popular geostatistical framework such as boosted regression trees and random forests have found some limited use [38, 39].
A review of all such alternative techniques is beyond the scope of this paper. Due to the importance of spatial modelling in the malaria mapping field and consequent need for computationally efficient methods however, this work presents a comparison of four such methods on a malaria prevalence mapping problem: Integrated Nested Laplace Approximation inference, with a Gaussian Markov Random Field approximation to the GP (INLA), Gaussian processes fitted via a boosting algorithm (GPBoost), Spatial Random Forests (SpRF), and Fixed Rank Kriging (FRK). These four methods are selected due to their different underlying techniques for modelling the spatial correlation structure and varying approaches to inference. Of these four methods, only INLA has been applied to malaria risk mapping, while the other three methods have found use in spatial modelling applications in various other fields (see, for example [40,41,42]).
The intended audience of this comparison are twofold. This work may be of interest to researchers interested in moving into the spatial (or spatiotemporal) mapping field who are looking for an introduction to currently available methods, and it may additionally inform malaria mapping researchers on the relative strengths and weaknesses of the considered methods when applied to a malaria mapping problem, assisting in decision making for future work. It should be noted that the inclusion of a method in the analysis does not constitute an endorsement of its use for malaria risk mapping. Rather, the purpose of this work is to compare each of these method's suitability, strengths, and weaknesses, when applied to malaria risk mapping problems, and in particular to examine the viability of the newer and previously unused methods.
An analysis at a national scale is first presented, with Kenya selected as the country of interest. This is then extended to a continentalscale analysis over Africa. The four methods are briefly introduced in “Methods” section, where the models implemented using each method are additionally specified. Due to their underlying mathematical differences in model specification and inference procedure, it is difficult to directly compare results. This problem is mitigated by comparing point and interval predictions against observed data in a crossvalidation scheme. In addition, predictive spatial maps produced by each of the implemented models are explored. National and continent scale results are discussed in “Case study: Kenya” and “Continent scale results” section, and the computation time taken by each of the methods at each scale is compared in “Computational results” section. As concluding remarks, nuances of the methods uncovered by the analysis are briefly discussed in “Discussion” section. The programming scripts for this work are available at [43].
Methods
Diggle and Ribeiro Jr. first introduce a basic geostatistical model that does not have any covariates [1]. They consider data given by \(\left(\mathbf{x}_{i},{y}_{i}\right)\) for \(i\in \{1,\dots ,n\}\), where \(\mathbf{x}_{i}\) denotes the spatial location (i.e. coordinates) and \({y}_{i}\) is the measured value for the quantity of interest at that location (e.g. the incidence of malaria at \(\mathbf{x}_{i}\)). They describe a model for normallydistributed response data with a stationary Gaussian process (one that tends back to the same average value, over the whole analysis region) as:
where \(S\left(\mathbf{x}\right)\) is a Gaussian process with mean \(\boldsymbol{\upmu}\) (the average value over the study region), variance or amplitude of the process at each location \({\boldsymbol{\upsigma} }^{2}={\text{var}}\{S\left(\mathbf{x}\right)\}\) and correlation function \(\boldsymbol{\uprho} \left(u\right)={\text{cor}}\left\{S\left(\mathbf{x}\right),S\left(\mathbf{x}^{\prime}\right)\right\},\) where \(u= \mathbf{x}\mathbf{x}^{\prime} \) and \(\cdot \) denotes Euclidean distance (which controls the similarity of responses based on their distances apart); and \({y}_{i}\) are realizations of mutually independent Gaussian random variables \({Y}_{i}\) conditional on \(\left\{S\left(\mathbf{x}\right):\mathbf{x}\in {R}^{2}\right\}\) (i.e. after accounting for the spatial correlation, each \({y}_{i}\) is independent and normallydistributed).
The model can be described by the equation:
This basic model represents only the effects of proximity on observations at different locations, and can be extended through the addition of a mean function to model the effects of covariates on the response. Common choices of correlation functions (termed covariance functions when they incorporate the variance term \({\boldsymbol{\upsigma} }^{2}\)) include Matérn, exponential and squared exponential functions.
INLA, GPBoost, and FRK provide approximation and inference tools for geostatistical models which extend the basic model in Eq. (1) by [1]. In contrast, SpRF avoids the use of a Gaussian process entirely, instead using a machine learning approach to model the impacts of spatial proximity on the response. Despite their varying approaches, all four methods allow for prediction and mapping of spatial processes, which is the focus in this paper.
The following implementations of the four methods are used:

1.
INLA: Integrated Nested Laplace Approximations, implemented in the R package INLA [44, 45].

2.
GPBoost: Tree boosting with Gaussian processes and mixed effect models, implemented in the R package gpboost [46].

3.
SpRF: Spatial Random Forests, implemented in the R package ranger [47].

4.
FRK: Fixed Rank Kriging, implemented in the R package FRK [48].
As the nationalscale dataset, P. falciparum prevalence data in Kenya from 2009 is used, retrieved from the openaccess portion of the Malaria Atlas Project malaria prevalence dataset. Kenya was selected as it had the highest number of surveys overall, with the most surveys occurring in 2009. In expanding to a continental scale, available surveys across Africa in 2009 are used, keeping the same year between analyses. The R package malariaAtlas [32] was used in order to download the malaria prevalence survey data. As the aim of this work is to evaluate the performance of statistical models for malaria mapping rather than to produce reliable maps per say, additional validation, correction, or selection on these datapoints were not applied. For each record, only the spatial coordinates, the numbers of individuals screened, and the number of those individuals that were positive for P. falciparum were extracted.
The models constructed in this paper represent the simplest possible implementations of each of the four methods, where factors such as environmental covariates are not included. Indeed, due to the strong role environmental factors play in the malaria parasite's and vectors' life cycles [49], an indepth selection and analysis of covariates is a key step in creating useful maps of malaria [33]. In skipping this step, the goal of this work is to highlight the differences between the methods at their most basic level, and specifically their differing treatments of spatial autocorrelation. Hence, the risk maps presented in this work should be considered only for the purpose of comparison with one another, and should not be interpreted as realistic maps of malaria prevalence.
INLA
INLA (Integrated Nested Laplace Approximations) is a method for approximate Bayesian inference which offers an improvement in speed over asymptotically exact methods such as MCMC. Instead of estimating a highdimensional joint posterior distribution by simulation, INLA obtains approximations to univariate posterior marginal distributions of the model parameters. INLA is restricted to the class of models that can be expressed as latent Gaussian Markov random fields. However, a multitude of commonly used models can be expressed in this form, including generalized linear geostatistical models. This approach to inference pairs well with an approximation to the spatial Gaussian process as a Gaussian Markov random field (GMRF) over a discrete ‘mesh’ describing the study area, with piecewise linear interpolation to any locations that fall between nodes of this ‘mesh’. When the Gaussian process has a covariance function of the Matérn type, the stochastic partial differential equation (SPDE) representation of the GMRF can be used, which makes evaluation of the spatial process very fast for large spatial datasets, compared with the full GP approach. Over the years there have been many updates to INLA [50] to broaden its scope and facilitate diverse problem solving tasks. For more details, refer to their website [51].
Inference with INLA combines a series of assumptions and Laplace approximations to compute the marginal posteriors of model parameters and latent effects. INLA assumes that the response vector \(\mathbf{y}\) depends on a vector of latent variables \({\boldsymbol{\upeta}} ,\) and hyperparameters \({\boldsymbol{\uptheta} }_{1}\), with density \(\uppi \left(\mathbf{y}{\boldsymbol{\upeta}} ,{\boldsymbol{\uptheta} }_{1}\right).\) The latent variables for example may include the values of a linear predictor, an intercept, regression coefficients, and the values of any random effects. Importantly, \({\boldsymbol{\upeta}}\) is assumed to be a mean \(0\) Gaussian Markov random field with precision matrix \(\mathbf{Q}\left({\boldsymbol{\uptheta} }_{2}\right)\) (the construction of \(\mathbf{Q}\) for continuous spatial models is outlined in [52]) where \({\boldsymbol{\uptheta} }_{2}\) is a vector of hyperparameters. The hyperparameters are often combined into a single vector \({\boldsymbol{\uptheta}} =\left({\boldsymbol{\uptheta} }_{1},{\boldsymbol{\uptheta} }_{2}\right)\) with prior distribution \(\uppi \left({\boldsymbol{\uptheta}} \right).\) INLA then approximates the marginal posteriors \(\uppi \left({\boldsymbol{\upeta} }_{i}\mathbf{y}\right)\) and \(\uppi \left({\boldsymbol{\uptheta} }_{k}\mathbf{y}\right)\) as follows.
The first step is to write the joint posterior of the hyperparameters as
A Laplace approximation is applied to the denominator, replacing it with a Gaussian and giving the approximation:
where \({\upeta }^{*}\left({\boldsymbol{\uptheta}} \right)\) is the mode of \(\uppi \left({\boldsymbol{\upeta}} {\boldsymbol{\uptheta}} ,\mathbf{y}\right),\) and \(\widetilde{{\uppi }_{G}}\left({\boldsymbol{\upeta}} {\boldsymbol{\uptheta}} ,\mathbf{y}\right)\) is its Gaussian approximation. Approximate posterior marginals for the hyperparameters can then be obtained as:
The exact marginals for the latent effects:
are approximated using numerical integration as
where \(\widetilde{\uppi }\left({\boldsymbol{\uptheta}} \mathbf{y}\right)\) is as in Eq. (2) and \(\widetilde{\uppi }\left({\boldsymbol{\upeta} }_{i}{\boldsymbol{\uptheta}} \right)\) is an approximation of \(\uppi \left({\boldsymbol{\upeta} }_{i}{\boldsymbol{\uptheta}} \right).\) INLA provides three primary methods for computing \(\widetilde{\uppi }\left({\boldsymbol{\upeta} }_{i}{\boldsymbol{\uptheta}} \right)\), termed the Gaussian, Laplace, and Simplified Laplace strategies, in addition to adaptive and automatic strategies. Each strategy applies Laplace approximations or series expansions to different conditional distributions, and has different trade offs for efficiency and accuracy. For full details on these methods, see for example [44, 53, 54].
Predictions in INLA are carried out concurrently with model fitting, where the posterior predictive distribution of the response at each prediction location is computed [53]. The INLA software provides summary statistics including the mean, median, standard deviation and quantiles of the predictive distribution.
INLAbased model
A model using the INLA software is formulated similar to [55] and [56] to predict malaria prevalence. Let \({H}_{i}\) denote the number of positive results (e.g., in this case, malaria infections) and \({N}_{i}\) the number of people screened at location \(\mathbf{x}_{i}\) for \(i = 1, \dots , n\). Let \({p}_{i}\) denote the modelled prevalence at location \(\mathbf{x}_{i}\), and \(\mathbf{p}\) be the vector of modelled prevalences over all locations. Then \({H}_{i}\) is modelled using a binomial distribution as
The standard link for the binomial distribution is the logit function, which opensup the probabilities in \(\left[\mathrm{0,1}\right]\) to real values in \(\left(\infty ,\infty \right)\). Thus obtaining,
where \({\upbeta }_{0}\) denotes the intercept and \(S\) is a spatial random effect that follows a zeromean Gaussian process with Matérn covariance function:
Here \(\boldsymbol{\uplambda}\) is the smoothness parameter, \({\boldsymbol{\upsigma} }^{2}\) denotes the variance and \({K}_{\boldsymbol{\uplambda} }\) is the modified Bessel function of the second kind. The parameter \(\upkappa\) controls how fast the correlation decays with distance.
The implementation and parameter settings for the model are based on the examples available in [56]. The first step in setting up a model is to construct a triangular mesh on which the SPDE will be solved. The software constructs this mesh based on restrictions provided by the user, and it usually contains a region of smaller triangles near the data surrounded by an extension of coarser triangles to avoid boundary effects [57]. When using the Kenya data, the maximum triangle edge length was set to \(0.5\) for the inner region, and \(4\) for extension. The cutoff parameter sets a distance, under which, points are grouped together when constructing the mesh vertices. This has been set to \(0.01\), and additionally the min.angle and offset parameters, which determine the minimum allowed angles in the triangles and the size of the extension, have been left at their default values of \(21\) degrees and \(0.1\), respectively. When using the Africa data, a mesh on the unit sphere is used, with the above parameter values converted to radians.
The smoothness parameter \(\boldsymbol{\uplambda}\) in the Matérn covariance function Eq. (5) must be chosen via the alpha parameter
where \(d\) is the dimension of the space (i.e. 2 for a spatial model). The alpha parameter has been set to its default value of \(2\).
User settings additionally control the approximations during inference. The default auto strategy was used for approximating \(\widetilde{\uppi }\left({\boldsymbol{\upeta} }_{l}{\boldsymbol{\uptheta}} ,\mathbf{y}\right).\) The int.strategy parameter then determines how the points \({{\boldsymbol{\uptheta} }}^{\left(k\right)}\) are selected for the numerical integration in Eq. (3), and the faster empirical Bayes strategy was chosen. This selects a single point, namely the mode of \(\widetilde{\uppi }\left({\boldsymbol{\uptheta}} \mathbf{y}\right)\) and therefore does not average predictions over uncertainty in the hyperparameters, as would typically happen in an MCMC inference procedure. The median of the predictive distribution was used for point predictions, though other quantities such as the mean are available.
GPBoost
GPBoost combines treeboosting with Gaussian processes and mixed effects models. Inspired by the performance of gradient boosting algorithms, as implemented in popular software such as XGBoost and LightGBM, GPBoost aims to leverage the advantages of treeboosting algorithms including accounting for complex nonlinearities, discontinuities and higher order interactions with the versatility of Gaussian processes [46]. It has the functionality to use mixed effects models, in particular models with grouped random effects.
The general structure for a model which can be implemented using GPBoost is
where \({Y}_{i}\) is the response variable at location \(\mathbf{x}_{i}\). The matrix \(X\in {R}^{n\times p}\) is the fixed effect predictor matrix, with the \(i\)th row containing covariates for location \(\mathbf{x}_{i}\). The fixed effects function of the covariates \(F\), is nonlinear and is learned with boosting. \(\mathbf{S}\in {R}^{m}\) contains the random effects with covariance matrix \(\Sigma \in {R}^{m\times m},\) while \(Z\in {R}^{n\times m}\) is the random effect predictor variable matrix, which is typically used to define grouped random effects.In a Gaussian process model the random effects \(\mathbf{S}=\left(S\left(\mathbf{x}_{1}\right),S\left(\mathbf{x}_{2}\right),\dots ,S\left(\mathbf{x}_{m}\right)\right)\) are a finitedimensional version of a Gaussian process \(S\left(\mathbf{x}\right)\) with a covariance function:
Here \(c\) is a covariance function often parameterized as:
where \(r\) is an isotropic autocorrelation function with \({\boldsymbol{\upsigma} }_{1}^{2}={\text{var}}\left(S\left(\mathbf{x}\right)\right)\) and \(\boldsymbol{\uprho}\) is the range parameter which determines how quickly \(r\) decays with distance. GPBoost currently supports the exponential, Gaussian, Matérn, powered exponential, Wendland, and tapered exponential covariance functions. In a Gaussian process model, \(Z\) is usually encoded as a diagonal matrix, so that each element of \(\mathbf{S}\) contains the spatial random effect for that location.
With its default settings, GPBoost does not apply approximations to the Gaussian process. For increased efficiency, Vecchia approximations are available in the software. These approximations assume conditional independence between responses based on their locations, resulting in sparse Cholesky factorizations of the precision matrix and in turn improved computational efficiency [46, 58].
Inference with GPBoost is carried out by jointly optimizing the nonlinear fixed effects function \(F\), and the variance and covariance parameters \({\boldsymbol{\uptheta}}\) (i.e \(.{\boldsymbol{\uptau} }^{2},{\boldsymbol{\upsigma} }_{1}^{2}\), and \(\boldsymbol{\uprho}\)). In the Gaussian process case, the goal of the optimization is to minimize the risk functional:
where \(\mathbf{y}=\left({y}_{1},\dots ,{y}_{n}\right)\) are the observed responses at locations \(\mathbf{x}_{1},\dots ,\mathbf{x}_{n}\).Here, \(L\left(\mathbf{y},F\left(X\right),{\boldsymbol{\uptheta}} \right)\) is the negative log marginal likelihood for obtaining the observed responses \(\mathbf{y}\), given the observed covariate matrix \(X\), and model parameters \({\boldsymbol{\uptheta}}\),
where \(\Psi =Z\Sigma {Z}^{T}+{\boldsymbol{\uptau} }^{2}I\). The risk functional is minimized by iteratively updating \(F\) and \({\boldsymbol{\uptheta}}\). At step \(k\), \({F}_{k1}\) is held fixed and \({\boldsymbol{\uptheta}} _{k}={\text{argmin}}_{{\boldsymbol{\uptheta}} }\left(L\left(\mathbf{y},{F}_{k1}\left(X\right),{\boldsymbol{\uptheta}} \right)\right)\) is computed using a gradient or quasiNewton method. With this value of \({\boldsymbol{\uptheta} }_{k}\), \(F\) is updated via a single step of a boosting algorithm.
After optimization, GPBoost produces predictions in a similar manner to Gaussian process regression. The joint distribution of the observed and predicted responses is formed, and conditioned on the observed responses. The mean of the resulting conditional distribution is used for the predicted value of the response.
As covariates are not used in this implementation, tree boosting is used only to find the intercept. While this does neglect GPBoost's functionality for learning nonlinear functions of covariates, GPBoost has been included in the analysis for users who may wish to apply it in more complicated scenarios that may benefit from tree boosting.
GPBoostbased model
The basic model in Eq. (6) can be extended to use nonGaussian likelihoods, where the implementation in the software uses a Laplace approximation during inference [59]. Currently Bernoulliprobit, Bernoullilogit, Poisson, and Gamma distributions are supported for the response variable. However unlike INLA and FRK, a binomial response is not currently supported which does present a limitation for applying this method for mapping malaria prevalence. This work therefore models malaria prevalence by customizing Eq. (6) as follows:
where \({\upbeta }_{0}\) is the intercept, and \({H}_{i}\) and \({N}_{i}\) denote the number of positive results and the number of people tested at location \(\mathbf{x}_{i}\). Note that for simplicity, the direct proportion of positive tests was used rather than the empirical logit, and predictions were clipped to lie within \(\left[\mathrm{0,1}\right]\) for the prevalence maps. The exponential covariance function \(r\left(\mathbf{x}\mathbf{x}^{\prime}/\boldsymbol{\uprho} \right)=\mathrm{exp}\left(\mathbf{x}\mathbf{x}^{\prime}/\boldsymbol{\uprho} \right)\) was selected, which is the default choice in the software. Notably, this model does not use GPBoost's full capability for learning nonlinear functions of the covariates, however it has been constructed in order to be consistent with the choice to not use covariates for any of the models.
The parameter settings for the implemented model follow examples by the package author [60]. For the spatial random effect, a full Gaussian process without approximation was used, with the gp_approx parameter set to its default value of none. Other parameters in the software control the trees and boosting algorithm used to learn the fixed effects function \(F\). The number of boosting rounds was set to 247 and the learning rate to 0.01, using the parameters nrounds and learning_rate. Other settings for the model include num_leaves=1024, max_depth = 6, and min_data_in_leaf = 5, each of which control the size of the trees.
SpRF
Spatial Random Forests (SpRF) [47] extend classical random forests to a spatial domain by using distances to observation points as explanatory variables, i.e. when fitting a model with SpRF, for each point \(\mathbf{x}_{i}\), where \({y}_{i}\) is given, covariates are used that give the distance from each other observation point. That is, the design matrix for this part of the model is simply the distance matrix between all pairs of observation locations. In order to obtain uncertainty estimates, the SpRF authors use quantile regression forests which estimate specified quantiles of the conditional distribution \({Y}_{i}{X}_{i}\) [61] where \({X}_{i}\) are the covariates for the \(i\)th response, in contrast to classical random forests which do not provide uncertainties.
The generic equation of an SpRFbased model is given by
where \({Y}_{i}\) is the response at location \(\mathbf{x}_{i}\), \({X}_{{G}_{i}}\) denotes a vector of the distances to each of the observation locations from the querying point \(\mathbf{x}_{i}\) (including a distance of 0 to itself, in the \(i\)th position of the vector) and \({X}_{{R}_{i}}\) and \({X}_{{P}_{i}}\) denote two types of covariates—surface reflectance and processbased. The function \(f\) is learned by the random forest. Unlike the other methods examined, SpRF does not use a covariance function.
SpRF is based on the ranger package for random forests, which provides an implementation of quantile regression forests with training procedure outlined in [61]. Point predictions are given by the estimated medians from the quantile regression forests.
SpRFbased model
As in [47], an additional normal assumption for the response was included in order to construct the simple SpRF model
where \({H}_{i}\) and \({N}_{i}\) are as defined above and \({X}_{{G}_{i}}\) contains the distances from each observation point to \(\mathbf{x}_{i}\).
The user parameters for SpRF determine the structure of the random forest and the rules for growing each tree, including the number of trees and the number of variables to split on at each node via the num.trees and mtry parameters. Each parameter was left at its default value, resulting in a forest with 500 trees where each node splits at \(\sqrt{{n}_{v}}\) variables (\({n}_{v}\) is the total number of variables input into the random forest). Other parameters which further tune the structure of the trees and forest have been left at their default values, and the code for the SpRF model is based on a tutorial from the method's authors [62].
FRK
Fixed Rank Kriging (FRK) [48] is a spatiotemporal modelling framework built for large datasets. It uses a spatial random effects (SRE) model, which decomposes a spatially correlated meanzero random process using a linear combination of spatial basis functions. This dimensionality reduction using a relatively small number of basis functions ensures FRK's computational efficiency. The spatial domain \(D\) is partitioned into \(M\) subsets, \({A}_{1},\dots ,{A}_{M}\), called basic areal units (BAUs) with centroids \(\mathbf{x}_{1},\dots ,\mathbf{x}_{M}\). The SRE model is constructed on these BAUs which determine the granularity of the model, and the process is assumed to be piecewise constant over the BAUs.
The general equation for a model implemented in FRK with a Gaussian response can be written as
Here, \({Y}_{i}\), \(i=1,\dots , n\) are the responses at the observation locations, \({\upzeta} ={\left({\upzeta }_{1},\dots ,{\upzeta }_{M}\right)}^{T}\) is the value of a latent spatial process evaluated at each of the BAUs with centroids \(\mathbf{x}_{1},\dots ,\mathbf{x}_{M}\), and \({C}_{Z}\) is an \(n\) by \(M\) matrix connecting the observation locations to the BAU locations. The vector \(\mathbf{t}\left(\mathbf{x}_{j}\right)\) is a collection of covariates at BAU \(j\) and \({\boldsymbol{\upbeta}}\) is a vector of regression coefficients, while \(v\left(\mathbf{x}_{j}\right)\) is the value of a smallscale, spatially correlated random effect. Lastly, \(\upxi \left(\mathbf{x}_{j}\right)\) is a finescale random effect, which is treated as uncorrelated across the BAUs [48].
FRK introduces nonGaussian data to the model by replacing the observation distribution with a member of the exponential family and using a link function to transform the latent process into a mean process [63]. The general structure of such a model is
where \(\uppsi\) is a dispersion parameter for the context dependent member of the exponential family \({\text{EF}}\), \({\boldsymbol{\upmu}}\) is called the mean process, and \(g\left(\cdot \right)\) is the link function. The mean process at the observation locations is represented by \({\boldsymbol{\upmu}}\), while \({\boldsymbol{\upmu} }^{\prime}\) represents the mean process at the BAUs.
The spatially correlated random effect \(\upnu \left(\mathbf{x}\right)\) is decomposed as
where \({\upphi }_{1},\dots ,{\upphi }_{r}\) are a fixed collection of basis functions on the spatial domain, and \({\boldsymbol{\upeta}} ={\left({\boldsymbol{\upeta} }_{1},\dots ,{\boldsymbol{\upeta} }_{r}\right)}^{T}\) is an rvariate Gaussian random variable with covariance matrix \(K\). To estimate model parameters including the coefficients \({\boldsymbol{\upbeta} }\), variance parameters for the fine scale random effect \(\upxi\), and covariance parameters for the covariance matrix \(K\), FRK carries out maximum likelihood estimation. When working with nonGaussian data, a Laplace approximation is used to approximate the marginal likelihood, which is then maximized via a quasiNewton method.
By default, FRK produces a prediction for the mean process \(\boldsymbol{\upmu} \left(\cdot \right)\) at each of the BAUs. Predictions and uncertainties are generated via a Monte Carlo sampling approach, and the predicted value of \({\boldsymbol{\upmu}}\) in each BAU is taken to be the average of the samples.
FRKbased model
As with INLA, the number of positive tests \({H}_{i}\) is modelled using a binomial distribution
where \({p}_{i}\) is the prevalence at the \(i\)th observation location. The vector \(\mathbf{p}^{\prime}\) gives the prevalence at the BAUs, and is transformed into the prevalence at the observation locations via the \({C}_{Z}\) matrix, which has construction detailed in [63]. The logit function is then used as the link function \(g\) in Eq. (7), i.e.
As covariates are not being used, the latent process over the BAUs \({\upzeta }_{j}\) can be written as:
where \({\upbeta }_{0}\) denotes the intercept.
The implemented model decomposes the spatial random effect, \(\upnu \left(\mathbf{x}\right)\), using Gaussian basis functions of two different scales placed regularly across the spatial domain, as controlled by the type, nres, and regular parameters respectively. The spatial scale of these basis functions is determined jointly by the regular parameter and the scale_aperture, which have been left at their default values of \(1\) and \(1.25\) respectively. The assumed correlation structure of the random coefficients \({\boldsymbol{\upeta}}\) is controlled by the K_type parameter. When using a nonGaussian model, this takes a default value of precision, which models the coefficient dependence using a precision matrix \(Q\) based on the Leroux model [63]. During prediction, the user can specify the number of Monte Carlo samples to be drawn, which has been left at the default value of \(400\). Code and parameter choices for the implemented FRK model are based on examples from the package authors in [48, 63].
Methods for the country scale analysis
At the country scale, the models implemented using the four methods were compared qualitatively using their predictive maps, while crossvalidation was used to compare their predictive performance. To produce maps of predicted prevalence, each model was fit on all available P. falciparum prevalence surveys from Kenya in 2009 from the malariaAtlas R package. This consisted of 382 surveys at points across the country which are shown in Fig. 1a. Point estimates of prevalence and uncertainties were produced by the fitted models on a grid over Kenya, with each cell covering a nominal 0.1 degrees (approximately 11 × 11 km at the equator) in longitude and latitude.
Model fitting and prediction were run on a 2014 MacBook Pro with a two core, 2.8 GHz Intel Core i5 processor running macOS 10.13.6. Each model was run using a single thread to obtain a baseline performance comparison to accompany the model predictions, although parallelization options are available for each model which may provide performance improvements. Recorded times were measured as the total time to run a model's R script, including both fitting and prediction.
To evaluate the models, spatial block crossvalidation (CV) [64] was used with both 10 and 50 folds. In a spatial setting, randomly allocating points to crossvalidation folds is not effective because close by points can act as proxies. The folds were selected using \(k\)means clustering [65] on the spatial coordinates of the prevalence surveys—resulting in a series of ‘blocks’ of spatiallyadjacent points. Figure 2 shows the location of points for the two sets of CV folds, where each colour represents a fold. The 10 and 50 CV folds measure different abilities of the methods. The 50fold CV quantifies shortscale interpolation ability, while the 10fold quantifies the ability to interpolate over longer distances.
Using 10 and 50fold crossvalidation, the following are investigated:

1.
analysis of the point predictions including a comparison between the predictions and outofsample prevalence values using multiple measures,

2.
analysis of the uncertainty bounds for each model, and

3.
analysis of the predictions with respect to density of the sampled locations (“Detailed crossvalidation results”).
Analysis of uncertainties is complicated by the differing measures of uncertainty provided by the different methods. INLA contains information on the summaries of the posterior marginal densities of the fitted model, and can compute the standard deviation and different quantiles of the predictions. GPBoost provides the variance of each prediction. FRK predicts the standard deviation of each prediction in the linear, Gaussian setting. For the nonGaussian case, it provides the predictions using a Monte Carlo approach [63]. SpRF uses quantile regression and the quantiles can be specified in the ranger package. To compare SpRF with the other methods, a normally distributed response is assumed as in [47], and the standard deviation for SpRF’s predictions is estimated as
Hengl et al. note that this assumption may not always be valid [47], and hence only a rough comparison of the SpRF model's uncertainty with the other three models is possible.
For each model, the number of observed prevalence values which lie within the predicted uncertainty intervals is measured. Let \(\widehat{{y}_{i}}\) denote the mean of the predicted response for observation \({y}_{i}\), and define
where SD denotes the standard deviation. As prevalence values are between 0 and 1, the bounds are trimmed if they exceed these limits. Note that \(\widehat{{y}_{i}}\) corresponds to the predicted prevalence for the implemented GPBoost and FRK models, but not for the INLA and SpRF models which use the median for predictions.
Methods for the continent scale analysis
At the continent scale, analysis focused on the prediction maps and each model was fit three sets of prevalence data over Africa. The first set consists of 868 P. falciparum prevalence surveys from 2009, available via the malariaAtlas R package. This data is shown in Fig. 3a, with survey points concentrated in Kenya and Somalia. Each model was additionally fit using two types of simulated data to allow comparison of the predictions with a known truth and to compare model performances on both interpolation and extrapolation tasks, and lastly to assess how properties of the data such as spatial sparsity and noise impact model predictions.
Simulated data was generated using the 2009 P. falciparum prevalence raster created by the Malaria Atlas Project (MAP), shown in Fig. 3d [13]. Prevalence was sampled from the raster at the locations of the 2009 surveys and combined with the number of tests at each location to generate a binomial sample for the number of people testing positive. Of the 868 observation locations, 28 points lie on gaps in the prevalence raster and were excluded, leaving 840 points in this second dataset, which is shown in Fig. 3b. This spatially clustered simulated data allows for the evaluation of each model's ability to extrapolate over regions with little or no data. While this dataset shares locations with the observation data, its prevalence notably contains less noise.
The second set of simulated data was generated by selecting 1000 points at random on the MAP raster, allowing for comparison of the models' interpolation performance when trained on data with good spatial coverage. A binomial sample for the number of positive tests was generated at each location, where the number of people tested was set to 85, approximately the average number in the surveys from 2009. The prevalences from this simulated dataset are shown in Fig. 3c.
The same parameter settings were used as in the country scale analysis, though whenever possible, settings that compute an appropriate spherical distance between points were chosen due to the larger spatial extent of the data. This was possible SpRF which uses great circle distances, and for INLA which allows for meshes to be constructed on the unit sphere. GPBoost does not have this functionality at the time of writing this article, however correspondence with the package authors reveals that they hope to add this functionality in future. While FRK does support using great circle distances for some models, this feature is not currently well supported for models with a binomial response and did not work in implemented tests. Hence, this work uses Euclidean distances between coordinates for both GPBoost and FRK.
Using INLA with a spherical geometry requires a mesh to be built on a subset of the sphere. Although several methods for constructing this mesh are used in the literature [52, 57, 66], each produced similar results and hence the method outlined by Lindgren and Rue [57] was selected.
Model fitting and prediction were carried out on a single 3.00 GHz Intel(R) Xeon(R) Gold 6154 CPU in the Physical partition of the University of Melbourne’s high performance computing cluster, Spartan, and each model run was allocated 32 GB of RAM. As with the country scale data, each model was run using a single thread. Predictions were produced on a grid with cell side length 0.15 degrees (approximately 16.7 km at the equator).
Results
This section presents the analysis on Kenya, which includes the predictive maps and crossvalidation results, and the continent scale analysis including models trained on three different input datasets discussed above.
Case study: Kenya
At a national scale, two means of verifying the implemented models have been used: 1. predictive maps, and 2. 10fold and 50fold crossvalidated predictions. The whole dataset was used to produce the predictive maps, while for crossvalidation, some data was left out in each fold.
Predictive maps
The predictions and uncertainties produced by the four models when trained on the 2009 Kenya prevalence data are shown in Fig. 4 At the broadest scale, each model is similar in predicting a region of high prevalence in Western Kenya, with clusters of higher prevalence in the East, but low prevalence over much of the rest of the country. For each model, the predicted prevalence drops to zero quite quickly away from the data, indicative of a smaller spatial range than might be expected. This is especially prominent with the INLAbased model, and may be indicative of overdispersion in the data.
A notable feature is the arc like band of higher prevalence in the north west of Kenya in SpRF's predictions in Fig. 3c, which is further discussed in “Continent scale results” section. Higher prevalence in this region is also somewhat apparent in the GPBoostbased model's predictions and, to a lesser extent, FRK. This area of predicted higher prevalence falls in a broad region with no prevalence data and so represents different approaches to extrapolation in the four models.
Crossvalidation results
Table 1 gives the crossvalidation results. In terms of crossvalidation RMSE and correlation, FRK performs the best for 10fold CV, and GPBoost performs the best for 50fold CV. SpRF predictions had the highest correlations to the data used to train the model, but poorer correlation to outofsample data, indicating that this model may be overfitting to the training data. INLA performs poorly with respect to the 10fold RMSE and correlation.
Table 1 shows that SpRF has only 37.105% of the points within 1SD for 10fold crossvalidation, which is much lower than for the other models and is discussed further in “Detailed crossvalidation results” section. GPBoost performs the best in terms of the percentage of points within 1SD. However these results need to be taken in context, because a higher standard deviation can increase this percentage.
Figure 5 shows the interval and point predictions for the four methods for 10fold crossvalidation. Points within 1SD are shown in green, points within 2SD are shown in blue, and the rest are shown in red. Many of INLA's predictions are close to zero and this issue is investigated in “Effects of input noise on INLA” section. Additional crossvalidation metrics that consider the prediction error divided by its standard deviation are detailed in [67]. However, this work does not consider these metrics as some methods produce very small standard deviations and thus will result in very large values. More details on crossvalidation results in terms of the clusters and density of locations are given in “Detailed crossvalidation results” section.
Continent scale results
Geostatistical mapping is often carried out at a continent or global scale and frequently uses large datasets of observations. As computation time for inference and prediction can scale poorly with the amount of data and size of the domain, it is important to assess the performance, both in terms of predictive power and time, of recent methods. To examine how each of the four methods perform at larger scales, and to understand how predictions are affected by potential violations of model assumptions and by clustering and sparsity of observation points, the study area was expanded to the whole continent of Africa in 2009, and each model was fit to the three prevalence datasets shown in Fig. 3. Figure 6 shows the prevalence predicted by the models when trained on each input dataset. The corresponding uncertainties appear in “Prediction uncertainty”.
Scaling up to the continent level reveals differences between the methods that are not apparent at smaller scales. While the national scale prevalence maps in Fig. 4 are largely similar apart from the slight banding effect seen in SpRF's predictions, the prevalence maps in Fig. 6 differ significantly, with artifacts appearing in several of the maps.
Overall, the four models are better at local interpolation than extrapolation over large regions without data. The predictions in Fig. 6(iii) generated using the randomly distributed data recover the prevalence structure of the MAP raster in Fig. 3d much more faithfully than the predictions in Fig. 6(ii) from the sparser nonuniform data. This behaviour is expected as malaria prevalence is known to be highly heterogeneous and the implemented models do not use covariate data.
SpRF's predictions display a prominent banding effect, visible in both the country and continent scale maps where contiguous arclike bands of high prevalence appear in both point and uncertainty estimates. This may be explained by the fact that SpRF models the quantity of interest—malaria prevalence in this case—based on distances to points with known values. Thus bands of high or low prevalence appear at different radii from clusters of observations, and the piecewise constant nature of random forests would contribute to the sharp steps between each band. The banding effect is particularly prominent in Fig. 6(ci) and (cii), where the points were clustered into smaller regions, while it is less obvious when the datapoints have good spatial coverage, as in Fig. 6(ciii), which does not show bands spanning the continent. Further increasing the number of simulated points was found to further reduce the prominence of these bands.
Even though SpRF produces maps with this unwelcome feature, the crossvalidated point estimates are quite accurate. Table 3 in “Detailed crossvalidation results” shows that SpRF has the highest proportion of points with absolute errors less than 0.05 and 0.1 for 10fold crossvalidation, which is a harder task for the algorithms than 50fold crossvalidation. Thus in this example, SpRF gives reliable predictions at points even though it may produce a predictive map that can be misleading in regions where there are no sample points.
Figure 6(ai), produced by INLA with the observation data, displays a sudden drop in prevalence away from observations, resulting in flat nearzero predictions covering most of the continent. This appears to result from a combination of both the sparsity and noise present in the data, rather than clustered nature of the data alone. Figure 6(aii) uses nearly the exact same locations, yet shows higher values of prevalence spreading much further from the observations. “Effects of input noise on INLA” outlines evidence that this effect arises from unaccountedfor overdispersion in the observation data. In particular, increased noise in the data appears to reduce the estimated range for the spatial random effect, resulting in the model reverting to constant predictions away from observation locations. This behaviour is consistent with the INLAbased model's poor performance in the 10fold crossvalidation analysis in “Crossvalidation results” section, where the model predicted nearzero malaria prevalence for each of the held out folds.
The FRKbased model's predictions depend strongly on the arrangement of the basis functions, which are generally placed by the software based on the data locations and the user parameters introduced in “FRKbased model” section. For example, FRK's predictions in Fig. 6(di) and (dii) display spurious oscillations in regions with little or no data, however these oscillations correspond to periodic placement of the basis functions. In regions with little data, FRK reverts to the prior mean, which varies with the basis functions and leads to oscillating predictions (A. Zammit Mangion, personal communication, September 1 2023). The spurious oscillations should generally coincide with locations where the predicted uncertainty is high, however notably Fig. 6 (di) and (dii) include a large patch of low prevalence over South Sudan where no data is located, yet where the predicted standard deviation is low in Fig. 20. Other tested arrangements led to flat predictions over the whole continent (results not shown). These types of artifacts are not present in Fig. 6(diii), where the input data has good spatial coverage. However, this map appears as a smoothed version of the input data, and does not resolve the finer structure in the MAP surface. The impact of the arrangement and number of basis functions on the prediction maps is detailed further in “FRK sensitivity” section.
For both Kenya and Africa, the GPBoostbased model produces prevalence maps without the artifacts appearing in the other models' outputs. However, the uncertainty maps in Fig. 20(ci)(cii) exhibit a high level of overall uncertainty regardless of whether the regions have more survey points or not. This is further confirmed by the nearconstant interval widths that rarely fluctuate with the density of the survey points in Fig. 16 (“Detailed crossvalidation results” section). Even though GPBoost currently computes only Euclidean distances between coordinates, both the prevalence maps for Africa and Kenya appear to be reasonable. However, it is suboptimal to use Euclidean distances between longitude and latitude coordinates for a global model.
Computational results
Times taken to train each model on each of the datasets and produce predictions are shown in Table 2. While FRK is consistently the fastest, INLA shows great variation among the African datasets, ranging from less than 10 min with the uniform simulated data to 69.11 min with the observation data. Further analysis of this variation for INLA is given in “Effects of input noise on INLA”.
Each model was additionally trained on simulated prevalence datasets with 1000–10,000 points selected at random. Figure 7 shows times taken to fit each model and produce predictions as the dataset size varies. Both INLA and FRK remained very fast on larger datasets, showing little variation in their times. In contrast SpRF's time appears to increase linearly with dataset size, and GPBoost rapidly slows down on larger datasets, reflecting the computational requirements of using an unapproximated Gaussian process. As noted in GPBoost, a Vecchia approximation is available for this method to improve the computational efficiency. In “GPBoost with the Vecchia approximation”, the effects on computation and model predictions of applying this approximation are examined.
Sensitivity of FRK and INLA to parameter choices
Of the four methods, INLA and FRK show promise in their computational efficiency, displaying favourable scaling compared to SpRF and GPBoost in Fig. 7. Additionally, while the artifacts in SpRF's output appear to stem from the way it uses distances as an input, it is less clear whether the artifacts in INLA and FRK's prevalence maps are due to specific model parameters, or if they are fundamentally caused by the approximations used by each method. For this reason, these two methods are examined more closely and test the sensitivity of their predictions to the model parameters.
INLA sensitivity
The primary artifact visible in the INLAbased model's prediction maps is the flat, near zero, predictions when the model is fit to the observation data, as shown in Fig. 6(ai). “Effects of input noise on INLA” outlines evidence that this feature is due to overdispersion, suggesting that the binomial response is unsuited for modelling the variability in the observed malaria data, despite commonly being used in tutorials on the application of INLA to disease mapping problems.
Several adjustments can be made to the model to address this overdispersion, such as the use of a Betabinomial or Gaussian response (either directly on the proportion positive, or its empirical logit transform), or the inclusion of an independent error term in the linear predictor. All of these options include an additional parameter in the model to capture error variance at the level of the observation. Figure 8 shows predictions from an INLAbased model with a Betabinomial response which has been fit to the observation data. The flat predictions of Fig. 6(ai) are notably absent, suggesting that the Betabinomial is effective in resolving the overdispersion. A Gaussian response was additionally tested and was found to also handle the variability in the observation data, with results shown in “INLA with a Gaussian response”. These results highlight a need for caution when applying INLA with a binomial response to disease mapping problems, and the importance of checking for overdispersion.
FRK sensitivity
While the fastest of the four methods, FRK's continent scale predictions display a spurious ‘spotty’ pattern when fit to either of the spatially sparse datasets and a much less detailed map when fit to the simulated data at randomly selected locations (Fig. 6). These features appear to stem from FRK's use of a small number of basis functions in approximating the Gaussian process. This section examines whether increasing the number of these functions can resolve the artifacts in FRK's outputs.
The number of basis functions used in FRK's approximation is primarily controlled by the nres and regular parameters. Increasing the nres parameter adds an additional ‘resolution’ or layer of basis functions with a finer spatial scale, while increasing the value of regular reduces the scale of each basis function and adds additional rows and columns to their arrangement. Details on the effects of these parameters are available in the software documentation [68]. The model used throughout “Case study: Kenya” to “Computational results” section had these parameters set to nres \(=2\) and regular \(=1\).
Figure 9a shows FRK's predictions when fit to the observation data with nres increased to 3, and regular left at 1, which resulted in a model with 1338 basis functions. Whilst the broadscale ‘spottiness’ is less prominent in this model, finerscale oscillation is quite visible in regions of Central Africa. This modest improvement came at a significant computational cost, as the model took over 55 min and required 106 GB of RAM, compared to the 4.77 GB of RAM and 3.28 min required when nres was set to \(2\), and regular was set to \(1\). Figure 9b shows the predictions when nres is kept at \(2\) and regular is increased to \(2\). These settings resulted in 600 basis functions, and required 22.74 GB of memory and 9 min to run, significantly less than when increasing the nres parameter. However the finescale oscillation is noticeably more pronounced in areas with little or no data.
While the key to FRK's computational efficiency is its decomposition of the spatial random effect into a small number of basis functions, these results suggest that it is challenging in practice to balance this efficiency with the risk of artifacts appearing in the model output, especially on large scale mapping problems, with sparse data. The observed issues with computational efficiency may result from the current software implementation rather than from FRK's underlying mathematical approach (A. Zammit Mangion, personal communication, September 1 2023), and hence such a balance may be easier to achieve in future software versions.
Discussion
This applied comparison of four computational spatial modelling methods found that two of them (SpRF and GPBoost) are not sufficiently scalable or accurate to be applicable to largescale malaria prevalence modelling problems. SpRF's spatial predictions displayed a prominent ‘banding’ artifact, and at first glance the SpRFbased models appeared to be overfitted (matching closely to training data, but making poor predictions to holdout data). However, closer inspection (“Detailed crossvalidation results”) shows that SpRF's tighter uncertainty intervals in low density regions may result in this perception. Unlike the other methods, SpRF does not incorporate a covariance function, but instead treats the columns of the distance matrix between coordinates as covariates for inclusion in the Random Forest. A covariance function could in fact be applied to the distance matrix before inclusion in the model. This would not have the same interpretation as in the Gaussianprocess based models considered in this work, but would enable SpRF to consider a distancebased decay in the unobserved spatial effects being modelled. However, the Random Forest inference machinery in SpRF would have no means to estimate the parameters of such a function (such as the rate of decay with distance), and it seems unlikely they could be reasonably specified in advance. Due to these issues of fit, and the fact that the computation time of SpRF scaled approximately linearly with the size of the data, this approach is unlikely to be useful for applied spatial modelling of malaria data.
The GPBoostbased model made reasonably good predictions to holdout data, being the bestperforming model at 50fold spatiallyblocked crossvalidation in the nationalscale comparison (implying a good ability to extrapolate over short distances) and the secondbest, behind FRK, at 10fold crossvalidation (ability to extrapolate over longer distances). However the computation time using the default GPBoost specification scaled very poorly with increasing data size. This is because by default GPBoost performs inference on the full (unapproximated) Gaussian process, with each step of the inference procedure requiring an \(\mathcal{O}\left({n}^{3}\right)\) inversion of the covariance matrix. Neither the maximumlikelihood inference of GP hyperparameters, and boosting inference on the intercept (and covariate effects if used) reduce this computational burden. Employing the Vecchia approximation available for the method did not resolve all issues, as shown in “GPBoost with the Vecchia approximation”. The Vecchia approximation resulted in faster computation times, but also resulted in artifacts in the model predictions. Increasing the complexity (number of neighbouring points to consider) in the approximation reduced these artifacts, but at the cost of a substantial increase in the required computation time and RAM usage. It is worth noting that GPBoost is a relatively new technique, and future versions may include faster approximations.
Both INLA and FRK offered substantially better scalability to increasing data size than SpRF and GPBoost, taking only minutes to fit to 10,000 datapoints. Whilst it was computationally scalable, and is a widely established method and software for geostatistical modelling of malaria data, implementing INLA using the commonly suggested binomial distribution for prevalence data (e.g. as suggested in [31, 56]) resulted in spurious predictions and poor ability to extrapolate in both the 10fold and 50fold crossvalidation tests. This work has demonstrated that this is due to the fact that the malaria prevalence data being modelled are overdispersed relative to the binomial sampling assumption and spatialonly model. That is, the assumption is violated that the infection status of each individual in a given sample is independent of the others, given the estimated prevalence estimate at that location. This should not be surprising from an epidemiological perspective, given that the infections in a given place do not arise independently—each infection is caused by another. This gives rise to local noise, either at the level of a pixel or group of pixels (that particular location may have some risk factor not accounted for by the smooth spatial model), or at the level of the observation (on the day of sampling, that population may have had a higher or lower than usual prevalence). The INLAbased model's behaviour in this case is an attempt to capture these smallscale variations with a very ‘wiggly’ spatial random effect, i.e. one with rapid decay with increasing distance. It favours this parameter configuration on overdispersed data because the observation variance is fixed when using a binomial likelihood, and the variance is not sufficiently large to explain the data. This issue of poor identifiability between the observationlevel variance and the lengthscale of a Gaussian processes has previously been described ([69], see Fig. 5.4), and can be resolved in classical (and modelbased) geostatistics with the use of an independent ‘nugget’ effect either on each observation or each observed location [1]. Despite also using a binomial observation distribution, FRK does not suffer the same pitfall because it includes a type of spatial nugget effect in its ‘smallscale’ effect parameter.
For malaria prevalence modelling with INLA, this analysis suggests that a more reliable ‘default’ model than the standard binomial observation model would be one which includes additional observationlevel random noise. This can be achieved by using a Betabinomial or Gaussian (on the observed prevalences or on the empiricallogit scale). Both of these options have an additional observationlevel variance parameter that can be used to explain the overdispersion relative to the binomial. Of these, the Betabinomial is most likely to be generally applicable to malaria prevalence data, since it is able to accurately account for observation errors in the common situation where only very few of the individuals tested are infected. It is worth noting that fitting with a Gaussian response is substantially more computationally efficient in INLA, and so may be preferable if computation time is a major constraint. An alternative approach would be to include an independent observationlevel random effect in the model specification.
Whilst FRK scaled well to large datasets (generally taking slightly less time than INLA) and performed well in both the 10fold and 50fold extrapolation comparisons, for continentalscale modelling, specifying the model in such a way that it was both computationally scalable and avoided the spurious oscillating effect of the basis functions was not achieved using with modest model modifications. Whilst less noticeable, similar patterns are visible in the nationalscale analysis in parts of NorthWestern and far NorthEastern Kenya where no data are available to inform such a prediction. Given these issues, significant care must be taken, when applying FRK to mapping of sparse malariometric data, to avoid these spurious predictions that are driven by computationally convenient approximations rather than data.
Comparing four methodologically different techniques has its limitations. One such limitation is that the inherent differences of the methods make a comparison somewhat difficult. For example, the likelihoods are different as well as the underlying model structure and/or covariance functions. Thus, each method has its own measures and an INLA goodness of fit measure cannot directly be compared with that of SpRF and vice versa. For this analysis, this has been mitigated by focusing on the outputs—predictive maps and crossvalidation results.
Another aspect of interest is the parameter settings. There are many different parameter settings for each method. This analysis selected the commonly used (default) parameter settings and even though several different parameter settings were explored, a comprehensive exploration of the parameter space of these algorithms was not conducted. While the default parameter settings were acceptable for Kenya, it is expected that algorithms can benefit from customized parameters when running the model on the scale of Africa. The limitations of the choice of parameters is brought to light by the extent of the geographical region. Exploring optimal parameter selection is another avenue of research. Furthermore, there might be other parameter settings that can make the inference approximations of the different models more comparable.
From a practitioner's point of view, it is challenging to adopt a new method for spatial modelling mostly because it takes a long time to learn the methodology and write code to produce meaningful output. This is a significant barrier to entry. If the methods discussed provide tuning functions that explore the parameter space and select a set of parameters that enables the practitioner to build a good model, it would increase the usability of these methods.
An indepth investigation of strengths and weaknesses of the models would be another avenue of interest. One option is to construct a metamodel that can predict the best model based on features of different locations [70]. Such a metamodel could combine the strengths of the diverse models to make a stronger prediction. The findings of this paper should be of use for those creating, interpreting or working with spatial data, as a baseline comparison of new computational geostatistical models.
Availability of data and materials
All model output and programming scripts for this work are available at https://github.com/sevvandi/ supplementary_material/tree/master/stcompare.
References
Diggle P, Ribeiro Jr PJ. Modelbased geostatistics. Springer; 2007.
MartínezMinaya J, Cameletti M, Conesa D, Pennino MG. Species distribution modelling: a statistical review with focus in spatiotemporal issues. Stoch Environ Res Risk Assess. 2018;32:3227–44.
Holdaway MR. Spatial modelling and interpolation of monthly temperature using kriging. Clim Res. 1996;6(3):215–25.
Samalot A, Astitha M, Yang J, Galanis G. Combined Kalman filter and universal kriging to improve storm wind speed predictions for the northeastern United States. Weather Forecast. 2019;34(3):587–601.
Mulla D. Mapping and managing spatial patterns in soil fertility and crop yield. In: Proceedings of soil specific crop management: a workshop on research and development issues. Wiley Online Library; 1993. pp. 15–26.
Kuntz M, Helbich M. Geostatistical mapping of real estate prices: an empirical comparison of kriging and cokriging. Int J Geogr Inf Sci. 2014;28(9):1904–21.
Lai YS, Zhou XN, Utzinger J, Vounatsou P. Bayesian geostatistical modelling of soiltransmitted helminth survey data in the People’s Republic of China. Parasit Vectors. 2013;6(1):359.
Oliver M, Muir K, Webster R, Parkes S, Cameron A, Stevens M, et al. A geostatistical approach to the analysis of pattern in rare disease. J Public Health. 1992;14(3):280–9.
Scholte RG, Gosoniu L, Malone JB, Chammartin F, Utzinger J, Vounatsou P. Predictive risk mapping of schistosomiasis in Brazil using Bayesian geostatistical models. Acta Trop. 2014;132:57–63.
Nicholson MC, Mather TN. Methods for evaluating Lyme disease risks using geographic information systems and geospatial analysis. J Med Entomol. 1996;33(5):711–20.
Alimi TO, Fuller DO, Quinones ML, Xue RD, Herrera SV, ArevaloHerrera M, et al. Prospects and recommendations for risk mapping to improve strategies for effective malaria vector control interventions in Latin America. Malar J. 2015;14(1):519.
Omumbo JA, Noor AM, Fall IS, Snow RW. How well are malaria maps used to design and finance malaria control in Africa? PLoS ONE. 2013;8(1): e53198.
Weiss DJ, Lucas TC, Nguyen M, Nandi AK, Bisanzio D, Battle KE, et al. Mapping the global prevalence, incidence, and mortality of Plasmodium falciparum, 2000–17: a spatial and temporal modelling study. Lancet. 2019;394(10195):322–31.
Chipeta MG, Giorgi E, Mategula D, Macharia PM, Ligomba C, Munyenyembe A, et al. Geostatistical analysis of Malawi’s changing malaria transmission from 2010 to 2017. Wellcome Open Res. 2019;4:57.
Noor AM, Kinyoki DK, Mundia CW, Kabaria CW, Mutua JW, Alegana VA, et al. The changing risk of Plasmodium falciparum malaria infection in Africa: 2000–10: a spatial and temporal analysis of transmission intensity. Lancet. 2014;383(9930):1739–47.
Ashton RA, Kefyalew T, Rand A, Sime H, Assefa A, Mekasha A, et al. Geostatistical modeling of malaria endemicity using serological indicators of exposure collected through school surveys. Am J Trop Med Hyg. 2015;93(1):168–77.
Gething PW, Casey DC, Weiss DJ, Bisanzio D, Bhatt S, Cameron E, et al. Mapping Plasmodium falciparum mortality in Africa between 1990 and 2015. N Engl J Med. 2016;375(25):2435–45.
BertozziVilla A, Bever CA, Koenker H, Weiss DJ, VargasRuiz C, Nandi AK, et al. Maps and metrics of insecticidetreated net access, use, and netspercapita in Africa from 2000–2020. Nat Commun. 2021;12(1):3589.
Flegg JA, Patil AP, Venkatesan M, Roper C, Naidoo I, Hay SI, et al. Spatiotemporal mathematical modelling of mutations of the dhps gene in African Plasmodium falciparum. Malar J. 2013;12(1):249.
Flegg JA, Humphreys GS, Montanez B, Strickland T, JacomeMeza ZJ, Barnes KI, et al. Spatiotemporal spread of Plasmodium falciparum mutations for resistance to sulfadoxinepyrimethamine across Africa, 1990–2020. PLoS Comput Biol. 2022;18(8): e1010317.
Amoah B, Giorgi E, Heyes DJ, van Burren S, Diggle PJ. Geostatistical modelling of the association between malaria and child growth in Africa. Int J Health Geogr. 2018;17(1):7.
Piel FB, Patil AP, Howes RE, Nyangiri OA, Gething PW, Williams TN, et al. Global distribution of the sickle cell gene and geographical confirmation of the malaria hypothesis. Nat Commun. 2010;1(1):104.
Hay SI, Snow RW. The Malaria atlas project: developing global maps of malaria risk. PLoS Med. 2006;3(12): e473.
Gething PW, Patil AP, Smith DL, Guerra CA, Elyazar IRF, Johnston GL, et al. A new world malaria map: Plasmodium falciparum endemicity in 2010. Malar J. 2011;10:378.
Bhatt S, Cameron E, Flaxman SR, Weiss DJ, Smith DL, Gething PW. Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization. J R Soc Interface. 2017;14(134):20170520.
Hensman J, Fusi N, Lawrence ND. Gaussian processes for big data. arXiv preprint arXiv:13096835. 2013.
Rasmussen CE, Nickisch H. Gaussian processes for machine learning (GPML) toolbox. J Mach Learn Res. 2010;11:3011–5.
Park C, Apley D. Patchwork kriging for largescale Gaussian process regression. J Mach Learn Res. 2018;19:1–43.
QuiñoneroCandela J, Rasmussen CE. A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res. 2005;6:1939–59.
Datta A, Banerjee S, Finley AO, Gelfand AE. Hierarchical nearestneighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc. 2016;111(514):800–12.
Moraga P, Dean C, Inoue J, Morawiecki P, Noureen SR, Wang F. Bayesian spatial modelling of geostatistical data using INLA and SPDE methods: a case study predicting malaria risk in Mozambique. Spat Spatiotemporal Epidemiol. 2021;39: 100440.
Pfeffer DA, Lucas TCD, May D, Harris J, Rozier J, Twohig KA, et al. MalariaAtlas: an R interface to global malariometric data hosted by the Malaria Atlas Project. Malar J. 2018;17(1):352.
Odhiambo JN, Kalinda C, Macharia PM, Snow RW, Sartorius B. Spatial and spatiotemporal methods for mapping malaria risk: a systematic review. BMJ Glob Health. 2020;5(10): e002919.
Adigun AB, Gajere EN, Oresanya O, Vounatsou P. Malaria risk in Nigeria: Bayesian geostatistical modelling of 2010 malaria indicator survey data. Malar J. 2015;14:156.
Kazembe LN, Kleinschmidt I, Holtz TH, Sharp BL. Spatial analysis and mapping of malaria risk in Malawi using pointreferenced prevalence of infection data. Int J Health Geogr. 2006;5:41.
Nzabakiriraho JD, Gayawan E. Geostatistical modeling of malaria prevalence among underfive children in Rwanda. BMC Public Health. 2021;21:369.
Bhatt S, Weiss D, Cameron E, Bisanzio D, Mappin B, Dalrymple U, et al. The effect of malaria control on Plasmodium falciparum in Africa between 2000 and 2015. Nature. 2015;526(7572):207–11.
Kabaria CW, Molteni F, Mandike R, Chacky F, Noor AM, Snow RW, et al. Mapping intraurban malaria risk using high resolution satellite imagery: a case study of Dar es Salaam. Int J Health Geogr. 2016;15(1):26.
Kapwata T, Gebreslasie MT. Random forest variable selection in spatial malaria transmission modelling in Mpumalanga Province, South Africa. Geospat Health. 2016;11(3):434.
ZammitMangion A, Cressie N, Shumack C. On statistical approaches to generate level 3 products from satellite remote sensing retrievals. Remote Sens (Basel). 2018;10(1):155.
Sakizadeh M, Zhang C. Health risk assessment of nitrate using a probabilistic approach in groundwater resources of western part of Iran. Environ Earth Sci. 2020;79(1):43.
Wu J, Jia P, Feng T, Li H, Kuang H. Spatiotemporal analysis of built environment restrained traffic carbon emissions and policy implications. Transp Res D Transp Environ. 2023;121:103839.
STcompare code and datasets. https://github.com/sevvandi/supplementary_material/tree/master/stcompare. Accessed May 2023.
Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Series B Stat Methodol. 2009;71(2):319–92.
Lindgren F, Rue H, Lindström J. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J R Stat Soc Series B Stat Methodol. 2011;73(4):423–98.
Gaussian SF, Boosting P. J Mach Learn Res. 2022;23(232):1–46.
Hengl T, Nussbaum M, Wright MN, Heuvelink GB, Gräler B. Random forest as a generic framework for predictive modeling of spatial and spatiotemporal variables. PeerJ. 2018;6: e5518.
ZammitMangion A, Cressie N. FRK: an R package for spatial and spatiotemporal prediction with large datasets. J Stat Softw. 2021;98(4):1–48.
Sadoine ML, Smargiassi A, Ridde V, Tusting LS, Zinszer K. The associations between malaria, interventions, and the environment: a systematic review and metaanalysis. Malar J. 2018;17(1):73.
Rue H, Riebler A, Sørbye SH, Illian JB, Simpson DP, Lindgren FK. Bayesian computing with INLA: a review. Annu Rev Stat Appl. 2017;4(1):395–421.
RINLA Project. https://www.rinla.org/home. Accessed Oct 2021.
Bakka H, Rue H, Fuglstad GA, Riebler A, Bolin D, Illian J, et al. Spatial modeling with RINLA: a review. Wiley Interdiscip Rev Comput Stat. 2018;10(6): e1443.
GómezRubio V. Bayesian inference with INLA. Boca Raton: Chapman & Hall/CRC Press; 2020.
Wang X, Ryan YY, Faraway J. Bayesian regression modeling with INLA. Boca Raton: Chapman & Hall/CRC Press; 2018.
Kang SY, Battle KE, Gibson HS, Ratsimbasoa A, Randrianarivelojosia M, Ramboarina S, et al. Spatiotemporal mapping of Madagascar’s Malaria Indicator Survey results to assess Plasmodium falciparum endemicity trends between 2011 and 2016. BMC Med. 2018;16(1):71.
Moraga P. Geospatial health data: Modeling and visualization with RINLA and shiny. Chapman & Hall/CRC Biostatistics Series; 2019.
Lindgren F, Rue H. Bayesian spatial modelling with RINLA. J Stat Softw. 2015;63:1–25.
Kang M, Katzfuss M. Correlationbased sparse inverse Cholesky factorization for fast Gaussianprocess inference. Stat Comput. 2023;33(3):56.
Sigrist F. Latent Gaussian model boosting. IEEE Trans Pattern Anal Mach Intell. 2022;45(2):1894–905.
Sigrist F. GPBoost. GitHub; 2020. Github repository, https://github.com/fabsig/GPBoost. Accessed Oct 2021.
Quantile MN, Forests R. Quantile Regression Forests. J Mach Learn Res. 2006;7:983–99.
Hengl T, Nussbaum M, Wright MN. GeoMLA. GitHub; 2021. Github repository, https://github.com/thengl/GeoMLA. Accessed Sep 2021.
SainsburyDale M, ZammitMangion A, Cressie N. Modelling Big, Heterogeneous, NonGaussian Spatial and SpatioTemporal Data using FRK. arXiv preprint arXiv:211002507. 2021.
Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, GuilleraArroita G, et al. Crossvalidation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 2017;40(8):913–29.
Likas A, Vlassis N, Verbeek J. The global kmeans clustering algorithm. Pattern Recognit. 2003;36(2):451–61.
Humphreys JM, Elsner JB, Jagger TH, Pau S. A Bayesian geostatistical approach to modeling global distributions of Lygodium microphyllum under projected climate warming. Ecol Modell. 2017;363:192–206.
Cressie N. Statistics for spatial data. John Wiley & Sons; 2015
ZammitMangion A, SainsburyDale M. Package ‘FRK’; 2023. Package documentation, https://cran.rproject.org/web/packages/FRK/FRK.pdf. Accessed Apr 2023.
Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. vol. 14. The MIT Press; 2006
Wang X, SmithMiles K, Hyndman R. Rule induction for forecasting method selection: metalearning the characteristics of univariate time series. Neurocomputing. 2009;72(10–12):2581–94.
Acknowledgements
The authors would like to thank Håvard Rue, Andrew ZammitMangion, Matthew SainsburyDale, Fabio Sigrist and Noel Cressie for their correspondence and help with setting up and troubleshooting models. This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative. J.A. Flegg’s research is supported by the Australian Research Council (DP200100747, FT210100034). J.A. Flegg and N. Golding's research is supported by the National Health and Medical Research Council (APP2019093).
Funding
J.A. Flegg’s research is supported by the Australian Research Council (DP200100747, FT210100034). J.A. Flegg and N. Golding’s research is supported by the National Health and Medical Research Council (APP2019093).
Author information
Authors and Affiliations
Contributions
SK, NG, and JF conceptualized the study. NG and JF curated data. Formal analysis, software development, and visualization were carried out by SW and SK. Methodology was determined by all authors. The manuscript was written and edited by all authors.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
See Tables 3, 4, 5, 6 and Figs. 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 and 22
Detailed crossvalidation results
Point predictions
Table 3 gives the results for 10fold and 50fold crossvalidation. For each set of folds it gives the Root Mean Square Error (RMSE), correlation coefficient between the predicted and actual values and the percentage of observations that have an absolute error \(\left(\left{\text{predicted}}{\text{actual}}\right\right)\) less than a specified threshold (thresholds of 0.05, 0.1 and 0.2 considered). As noted earlier, FRK and GPBoost have the best RMSE and correlation values for 10fold and 50fold crossvalidation respectively. SpRF gives the best performance in terms of the percentage of observations with absolute error less than 0.05 and 0.1. Compared to the other models, INLA performs poorly for the 10fold crossvalidation, with a higher RMSE and significantly lower correlation coefficient. However, it gets a high percentage of observations with absolute error less than the three thresholds. This is because a large number of observations have low prevalence values. This is further illustrated in Fig. 10, which shows the actual and predicted values using 10fold crossvalidation for each model.
Figure 10 shows the points by crossvalidation fold as determined in Fig. 2a As the folds are determined by \(k\)means clustering, observations in each fold lie close together. Data points in most folds have similar prevalence values, apart from the points in Folds 3, 8 and 9 which have a broad range of values. The points assigned to Fold 8 are difficult to predict for all four models. These points are along the coast near the city of Mombasa and are somewhat isolated from other clusters, which might be a contributing reason.
Figure 11 shows 50fold crossvalidation results for the four models while Fig. 12 shows their interval predictions. As shown in Table 3, GPBoost achieves better results in terms of RMSE and correlation. It has the same performance as INLA for the highest percentage of observations with absolute error less than 0.1. INLA has the highest percentage of observations with absolute error less than 0.05 and SpRF has the highest percentage of observations with absolute error less than 0.2. Figure 11 shows that certain folds perform poorly. These folds match with the locations of the poorly performing folds in the 10fold CV scenario. Another interesting observation is that while GPBoost achieves good results for both sets of folds, it performs poorly on high prevalence observations, whereas FRK and SpRF do not appear to have this limitation.
Point predictions by location density
These results are further analyzed using the density of sampled locations, i.e. do some models find it difficult to predict observations in low density regions? Fig. 13 shows the malaria prevalence and kernel density estimates of the sampled locations on two separate maps. Figure 14 shows scatter plots of prevalence and density with points coloured by the fold. Fold 8 in the 10fold CV, which is located around the city of Mombasa, has a broad range of prevalence values while having relatively low density. This explains the reason behind the high errors for Fold 8 (Fig. 10). When the sampled points are away from each other (low density) and the prevalence values have high variation, it is challenging for the models to predict accurately.
Figure 15 shows the absolute errors of the four models with respect to density for both 10 and 50fold crossvalidation. For 10fold CV, Fold 8 exhibits high error rates for all four models. Considering the same set of points for 50fold CV shows that while FRK and SpRF have similar error rates in both CV experiments (maximum \(\approx 0.75\)), INLA and GPBoost have comparatively lower error rates (maximum \(\approx 0.6\)). Thus, a higher number of folds benefits GPBoost and INLA in this instance more than it benefits FRK or SpRF.
Moving on to the points with very low density (\(< 0.2\)), Fig. 14 shows that the prevalence values of these points are relatively low. Figure 15a shows that INLA and SpRF have lower errors for these low density points compared to FRK and GPBoost. A similar outcome can be observed for the 50fold CV case in Fig. 15b when the density of points are less than 0.2.
As seen in Fig. 14, points in high density regions (\(> 0.4\)) have a higher variation in prevalence ranging from 0 to 0.75. The 10fold CV results in Fig. 15a show that FRK performs best for these high density points with an error \(< 0.4\), followed by GPBoost and SpRF. INLA performs poorly on these points for 10fold CV. For 50fold CV (Fig. 15b) GPBoost performs best on the high density points followed by SpRF, while both FRK and INLA perform similarly.
Table 4 gives metrics for different density groups for both 10 and 50fold CV. With low density defined as \({\text{density}}\le 0.2\), medium density as \(0.2<{\text{density}}\le 0.4\) and high density as \({\text{density}}>0.4\), the absolute errors of all four models are small for low density points, while the percentage of observations with absolute error less than 0.05, 0.1 and 0.2 are quite high for both 10 and 50fold crossvalidation sets. In both CV sets, INLA has the lowest RMSE with SpRF following closely. For both 10 and 50fold CV the correlation coefficients between the actual and the predicted values are negative. This indicates that most prevalence values are close to zero in low density locations. This would also explain why less points were sampled from those regions, as more points are generally sampled from high prevalence regions.
GPBoost, SpRF and FRK have higher RMSE for the medium density point set, compared to the high density point set for 10fold CV. A similar behaviour is observed for FRK for 50fold CV. This is due to the high absolute errors in Fold 8 as discussed previously. For the medium density points, GPBoost is preferred in terms of RMSE for both 10fold CV and 50fold CV. For high density points FRK is preferred for 10fold CV while GPBoost is preferred for 50fold. In terms of the percentage of points with absolute error less than 0.05 and 0.1, SpRF leads the other models for 10fold CV, while INLA leads for 50fold. For both 10 and 50fold CV, INLA and SpRF perform better on low density points compared to the other two methods, while GPBoost and FRK perform better on high density points.
Interval predictions
As described in "Results" section, each method has a different uncertainty quantification mechanism, however this analysis has estimated the standard deviation of predictions from each model to allow comparison.
Table 5 gives the interval prediction results for all four models. Figures 5 and 12 show the interval predictions for 10 and 50fold CV. For both 10 and 50fold CV, SpRF has on average the smallest uncertainty intervals and the smallest number of points within one or two standard deviations of the mean. Surprisingly SpRF’s interval widths are zero for 148 and 164 of the predictions for 10 and 50fold CV respectively. Each of these points correspond to a prediction (median) of zero prevalence, and the majority correspond to an observed prevalence of zero. However, the mean value of the response at nearly all of these locations is small but nonzero, and so the prevalence at these points does not lie within any number of standard deviations of the mean, contributing to the low percentages for SpRF in Table 5.
For both 10 and 50fold CV, GPBoost has the largest mean interval widths but smallest standard deviation, suggesting that it predicts consistently high width intervals for most observations. For 10fold CV, this results in the highest percentage of points lying within one or two standard deviations. For 50fold CV INLA has a higher percentage of points lying within each type of interval, which may be accounted for by the higher variation in INLA’s interval widths.
Figure 16 shows the kernel density estimates of the locations and the respective interval widths of the four models for both 10 and 50fold CV. For both 10 and 50fold CV, GPBoost has similar widths for all observations. There is a slight increase in width for low density points. However, there is not much variation in width with respect to the density. In contrast, FRK, INLA and SpRF have varying interval widths for different folds. There is also high variation for locations with high density, mostly likely because of the variation in prevalence. For both 10 and 50fold CV, FRK has relatively high width values for low density points. Conversely, INLA and SpRF have low width values for low density points. Similar to the point predictions, a high variation of width occurs for medium density points (\({\text{density}}\approx 0.3\)) for FRK, INLA and SpRF.
Effects of input noise on INLA
Figure 6(ai) shows INLA predicting a flat nearzero prevalence over most of Africa when trained on the observation data, a behaviour that is not replicated by fitting the model to either set of simulated data. This behaviour may be due to the noise in the observation data, which is visible in Fig. 3a particularly in Uganda. In contrast, the simulated data at the same locations, shown in Fig. 3b, appears much smoother.
This working hypothesis is examined by adding Gaussian noise to the simulated data. Prevalence values sampled from the MAP raster in Fig. 3d with locations based on the observation points were transformed using the logit function. Gaussian white noise with chosen standard deviations was added to the transformed values, before being brought back to values between 0 and 1 via the inverse logit. Binomial samples for the number of positive tests were then drawn using these prevalences and INLA was fit to this data.
Predictions from INLA fitted to data with three different levels of noise are shown in Fig. 17. Figure 18 shows posterior means and interquartile ranges for the intercept, range, and variance of the fitted models as the standard deviation of the added noise increases, as well as the time taken to fit each model and generate predictions.
As the standard deviation of added noise increases, both the intercept and spatial range fall, and predictions become less correlated between locations. The time taken jumps for standard deviations above 0.6 and presumably the model has difficulty converging. With greater noise, the model predicts a flat prevalence away from the simulated data dependent on the value of the intercept, and its output in Fig. 17c resembles the predictions in Fig. 6(ai) of the model trained on the observation data. These results suggest the presence of overdispersion, and that the model implemented with INLA may be misspecified. Indeed the model in Eq. (4) does not contain an independent error term. Methods to address this include adding an observational random effect to the model, or using a Betabinomial response.
Table 6 shows the crossvalidation results using the Kenya dataset for an INLAbased model using a Betabinomial response compared to the earlier model which used a binomial response. For both 10 and 50fold crossvalidation, the Betabinomial response leads to improved performance in terms of the root mean square error and the test correlation. For 10fold crossvalidation, the Betabinomial model still performs worse than the other three models in Table 3, however for 50fold crossvalidation it outperforms all but GPBoost.
INLA with a Gaussian response
Due to the overdispersion when fitting the INLAbased model to the observation data, an additional model using “INLA with a Gaussian response” was tested. Predictions from this model are shown in Fig. 19 and the absence of flat predictions suggests that this response is able to resolve the overdispersion.
Prediction uncertainty
Figure 20 shows the prediction uncertainties corresponding to each of the prevalence maps over Africa in Fig. 6.
GPBoost with the Vecchia approximation
As it uses a full Gaussian process, it is unsurprising that the GPBoost model shows the least favourable computational time for larger datasets. To improve efficiency, a Vecchia approximation is available in the software, which approximates the distribution of the response as
as per [46]. Here \({y}_{N\left(i\right)}\) is the subset of \(\{{y}_{1},\dots ,{y}_{i1}\}\) containing the \({m}_{v}\) nearest neighbours to \({y}_{i}\), where ‘nearest neighbours’ are determined by the distances between the responses' corresponding locations. The parameter \({m}_{v}\) determines the number of neighbours to use during fitting, while a separate parameter, \({m}_{v,p}\), controls the number of neighbours used for prediction. The above approximation of the density results in a sparse Cholesky factor of the precision matrix, with the number of selected neighbours impacting this sparsity and the accuracy of the approximation [58]. The approximation additionally requires a choice of ordering of the observed responses \(\{{y}_{1},\dots ,{y}_{n}\}\), which by default is taken to be a random ordering of the data.
Figure 21 shows GPBoost's predictions when using a Vecchia approximation with several values of \({m}_{v}\) and \({m}_{v,p}\). Uncertainty predictions were not produced as they are not currently well supported in the software when using the Vecchia approximation.
Applying the Vecchia approximation introduces artifacts to GPBoost's predictions. Figure 21a shows sharp discontinuities, which were prominent whenever low values of \({m}_{v}\) and \({m}_{v,p}\) were used. Experiments using the Kenya data suggested that the discontinuities could be prevented by increasing \({m}_{v}\) and \({m}_{v,p}\), however for the dataset on the continent scale there was a computational cost for doing so. Increasing \({m}_{v,p}\) from \(30\) to \(150\) while keeping \({m}_{v}\) fixed at \(30\) had a relatively small impact on the computation time, which surprisingly decreased from 11.09 min to 9.56 min, but the required memory jumped from 425 MB to 20,593 MB. Meanwhile, increasing both \({m}_{v}\) and \({m}_{v,p}\) to 150 greatly increased the computation time, requiring over 3.6 h to run, much longer than when the Vecchia approximation was not applied. Additionally, this model configuration required 20,722 MB of RAM. These examples suggest that increasing \({m}_{v}\) primarily increases the computation time required without affecting the RAM usage, while increasing \({m}_{v,p}\) increases the required RAM, with a smaller impact on computation time. Despite the increased computational requirements, neither adjustment to the parameters completely removed the discontinuities.
One benefit of using the Vecchia approximation is an improvement in scaling behaviour, even if the computational requirements on an individual dataset depend strongly on the choice of \({m}_{v}\) and \({m}_{v,p}\). Figure 22 shows the computational results from Fig. 7, with an additional plot for GPBoost using the Vecchia approximation. Parameters \({m}_{v}\) and \({m}_{v,p}\) were held fixed at \(30\) and \(150\) respectively, and the model shows favourable computation times compared to the full GP version of the model and SpRF, though is slower than INLA and FRK. These results highlight an obstacle to applying GPBoost to large scale data. Using the full Gaussian process can result in large computation times, while applying the Vecchia approximation introduces artifacts which require sacrifices in computational efficiency to remove.
GPBoost's computation times are amplified by the high value of the nrounds parameter, which has been set to \(247\) following available tutorials. As described in “GPBoostbased model”, this parameter controls the number of optimization steps during fitting. When fit to the malaria datasets used throughout this paper, the loglikelihood generally stopped increasing after 5 to 10 steps, suggesting that 247 training steps is unnecessarily high for the considered data and model. Figure 22 additionally includes the times taken for GPBoost with nrounds reduced to 5, and the scaling behaviour is greatly improved by this change, with the fastest computation time among all of the implemented models. Additional experimentation however found that reducing the number of rounds had little effect on the high RAM requirements for large values of \({m}_{v,p}\); something which may be necessary to minimize the discontinuities and noise in the predictions. Reducing nrounds would also improve the efficiency of the GPBoost model when no Vecchia approximation is used, however would not change the overall scaling behaviour.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wong, S., Flegg, J.A., Golding, N. et al. Comparison of new computational methods for spatial modelling of malaria. Malar J 22, 356 (2023). https://doi.org/10.1186/s12936023047607
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12936023047607