Comparison of new computational methods for spatial modelling of malaria

Background Geostatistical analysis of health data is increasingly used to model spatial variation in malaria prevalence, burden, and other metrics. Traditional inference methods for geostatistical modelling are notoriously computationally intensive, motivating the development of newer, approximate methods for geostatistical analysis or, more broadly, computational modelling of spatial processes. The appeal of faster methods is particularly great as the size of the region and number of spatial locations being modelled increases. Methods This work presents an applied comparison of four proposed ‘fast’ computational methods for spatial modelling and the software provided to implement them—Integrated Nested Laplace Approximation (INLA), tree boosting with Gaussian processes and mixed effect models (GPBoost), Fixed Rank Kriging (FRK) and Spatial Random Forests (SpRF). The four methods are illustrated by estimating malaria prevalence on two different spatial scales—country and continent. The performance of the four methods is compared on these data in terms of accuracy, computation time, and ease of implementation. Results Two of these methods—SpRF and GPBoost—do not scale well as the data size increases, and so are likely to be infeasible for larger-scale analysis problems. The two remaining methods—INLA and FRK—do scale well computationally, however the resulting model fits are very sensitive to the user’s modelling assumptions and parameter choices. The binomial observation distribution commonly used for disease prevalence mapping with INLA fails to account for small-scale overdispersion present in the malaria prevalence data, which can lead to poor predictions. Selection of an appropriate alternative such as the Beta-binomial distribution is required to produce a reliable model fit. The small-scale random effect term in FRK overcomes this pitfall, but FRK model estimates are very reliant on providing a sufficient number and appropriate configuration of basis functions. Unfortunately the computation time for FRK increases rapidly with increasing basis resolution. Conclusions INLA and FRK both enable scalable geostatistical modelling of malaria prevalence data. However care must be taken when using both methods to assess the fit of the model to data and plausibility of predictions, in order to select appropriate model assumptions and parameters.


Introduction
Spatial proximity often plays an important role in governing the spread of geographic processes, including fluctuations in the prevalence or incidence of infectious diseases.Thus, geostatistical modelling is a widely used method in the epidemiological mapping of infectious diseases and their impacts.This is particularly evident in the field of malaria mapping, where statistical models that explicitly account for space have been used to map important epidemiological metrics over broad spatial extents, to compensate for the spatial sparsity of data.For example, predictive maps created using geostatistical models have been published for malaria prevalence (Weiss et al. 2019), mortality (Gething et al. 2016), use of malaria interventions (Bertozzi-Villa et al. 2021) are available at https://github.com/sevvandi/supplementary_material/tree/master/stcompare.
2 Methods Diggle & Ribeiro Jr. (2007) start with the following basic geostatistical model that does not have any covariates.They consider data given by (x  ,   ) for  ∈ {1, . . ., }, where x  denotes the spatial location (i.e.coordinates) and   is the measured value for the quantity of interest at that location (e.g. the incidence of malaria at x  ).They describe a model for normally-distributed response data with a stationary Gaussian process (one that tends back to the same average value, over the whole analysis region) as 1. (x) : x ∈ R 2 where (x) is a Gaussian process with mean  (the average value over the study region), variance or amplitude of the process at each location  2 = var{(x)} and correlation function () = cor {(x), (x )}, where  = x − x and • denotes Euclidean distance (which controls the similarity of responses based on their distances apart); 2. and   are realisations of mutually independent Gaussian random variables   conditional on (x) : x ∈ R 2 (i.e. after accounting for the spatial correlation, each   is independent and normallydistributed).
Common choices of correlation functions (termed covariance functions when they incorporate the variance term  2 ) include Matérn, exponential and squared exponential functions.
INLA, GPBoost and FRK can be considered as different adaptations of the basic model described in equation (1) by Diggle & Ribeiro Jr. (2007), while SpRF is a spatial version of the random forest algorithm, which has its roots in machine learning and takes a fundamentally different approach.
We use the following implementations of the four methods: 1. INLA: Integrated Nested Laplace Approximations, implemented in the R package INLA (Rue et al. 2009, Lindgren et al. 2011).
2. GPBoost: Tree boosting with Gaussian processes and mixed effect models, implemented in the R package gpboost (Sigrist 2022).
As the national-scale dataset we have selected Plasmodium falciparum prevalence data in Kenya from 2009, retrieved from the open-access portion of the Malaria Atlas Project malaria prevalence dataset.Kenya was selected as it had the highest number of surveys overall, with the most surveys occurring in 2009.Expanding to a continental scale, we have used the available surveys across Africa in 2009, keeping the same year between analyses.We used the R package malariaAtlas (Pfeffer et al. 2018) to download the malaria prevalence survey data.Since our aim here was to evaluate the performance of statistical models for malaria mapping, rather than to produce reliable maps persay, we did not apply any additional validation, correction, or selection on these datapoints.We extract for each record only the spatial coordinates, the numbers of individuals screened, and the number of those individuals that were positive for P. falciparum.

INLA
INLA (Integrated Nested Laplace Approximations) is a method for approximate Bayesian inference which offers an improvement in speed over asymptotically exact methods such as MCMC.Instead of estimating a highdimensional joint posterior distribution by simulation, INLA obtains approximations to univariate posterior marginal distributions of the model parameters.INLA is restricted to the class of models that can be expressed as latent Gaussian Markov random fields.However, a multitude of commonly used models can be expressed in this form, including generalised linear geostatistical models.This approach to inference pairs well with an approximation to the spatial Gaussian process as a Gaussian Markov random field (GMRF) over a discrete 'mesh' describing the study area, with piecewise linear interpolation to any locations that fall between nodes of this 'mesh'.When the Gaussian process has a covariance function of the Matérn type, the stochastic partial differential equation (SPDE) representation of the GMRF can be used, which makes evaluation of the spatial process very fast for large spatial datasets, compared with the full GP approach.
Over the years there have been many updates to INLA (Rue et al. 2017) to broaden its scope and facilitate diverse problem solving tasks.For more details we refer to their website https://www.r-inla.org/home.
Inference with INLA combines a series of assumptions and Laplace approximations to compute the marginal posteriors of model parameters and latent effects.INLA assumes that the response vector y depends on a vector of latent variables , and hyperparameters  1 , with density (y|,  1 ).The latent variables for example may include the values of a linear predictor, an intercept, regression coefficients, and the values of any random effects.Importantly,  is assumed to be a mean 0 Gaussian Markov random field with precision matrix Q( 2 ) (the construction of Q for continuous spatial models is outlined in (Bakka et al. 2018)) where  2 is a vector of hyperparameters.The hyperparameters are often combined into a single vector  = ( 1 ,  2 ) with prior distribution ().INLA then approximates the marginal posteriors (  |y) and (  |y) as follows.
The first step is to write the joint posterior of the hyperparameters as .
A Laplace approximation is applied to the denominator, replacing it with a Gaussian and giving the approximation π( |y) ∝ (, , y) where  * () is the mode of (|, y), and π (|, y) is its Gaussian approximation.Approximate posterior marginals for the hyperparameters can then be obtained as The exact marginals for the latent effects are approximated using numerical integration as where π( |y) is as in (2) and π(  |) is an approximation of (  |).INLA provides three primary methods for computing π(  |), termed the Gaussian, Laplace, and Simplified Laplace strategies, in addition to adaptive and automatic strategies.Each strategy applies Laplace approximations or series expansions to different conditional distributions, and has different trade offs for efficiency and accuracy.For full details on these methods, see for example (Rue et al. 2009, Gómez-Rubio 2020, Wang et al. 2018).
Predictions in INLA are carried out concurrently with model fitting, where the posterior predictive distribution of the response at each prediction location is computed (Gómez-Rubio 2020).The INLA software provides summary statistics including the mean, median, standard deviation and quantiles of the predictive distribution.

INLA model
We formulate a model similar to Kang et al. (2018) and Moraga (2019) to predict malaria prevalence using INLA.Let   denote the number of positive results (e.g., in our case, malaria infections) and   the number of people screened at location x  for  = 1, . . ., .Let   denote the modelled prevalence at location x  , and p be the vector of modelled prevalences over all locations.Then we model   using a binomial distribution as The standard link for the binomial distribution is the logit function, which opens-up the probabilities in [0, 1] to real values in (−∞, ∞).Thus we obtain, logit where  0 denotes the intercept and  is a spatial random effect that follows a zero-mean Gaussian process with Matérn covariance function Here  is the smoothness parameter,  2 denotes the variance and   is the modified Bessel function of the second kind.The parameter  controls how fast the correlation decays with distance.
We have based the implementation and parameter settings for our model on the examples available in (Moraga 2019).The first step in setting up a model is to construct a triangular mesh on which the SPDE will be solved.The software constructs this mesh based on restrictions provided by the user, and it usually contains a region of smaller triangles near the data surrounded by an extension of coarser triangles to avoid boundary effects (Lindgren & Rue 2015).When using the Kenya data, we have set the maximum triangle edge length to be 0.5 for the inner region, and 4 for extension.The cutoff parameter sets a distance, under which, points are grouped together when constructing the mesh vertices.We have set this to 0.01, and additionally have left the min.angleand offset parameters, which determine the minimum allowed angles in the triangles and the size of the extension, to their default values of 21 degrees and −0.1 respectively.When using the Africa data we have used a mesh on the unit sphere, and have converted the above parameter values to radians.
The smoothness parameter  in the Matérn covariance function (5) must be chosen via the alpha parameter , where  is the dimension of the space (ie. 2 for a spatial model).We have set alpha to its default value of 2. User settings additionally control the approximations during inference.We have used the default auto strategy for approximating π(  |, y).The int.strategy parameter then determines how the points  () are selected for the numerical integration in (3), and we have used the faster empirical Bayes strategy which selects a single point, namely the mode of π( |y) and therefore does not average predictions over uncertainty in the hyperparameters, as would typically happen in an MCMC inference procedure.We have opted to use the median of the predictive distribution for point predictions, though other quantities such as the mean are available.

GPBoost
GPBoost combines tree-boosting with Gaussian processes and mixed effects models.It aims to leverage the advantages of tree-boosting algorithms such as accounting for complex nonlinearities, discontinuities and higher order interactions with the versatility of Gaussian processes (Sigrist 2022).It has the functionality to use mixed effects models, in particular models with grouped random effects.
The general equation of a GPBoost model is given by where   is the response variable at location x i .The matrix  ∈ R ×  is the fixed effect predictor matrix, with the th row containing covariates for location x  .The fixed effects function of the covariates , is nonlinear and is learned with boosting.S ∈ R  contains the random effects with covariance matrix Σ ∈ R × , while  ∈ R × is the random effect predictor variable matrix, which is typically used to define grouped random effects.
In a Gaussian process model the random effects S = ((x 1 ), (x 2 ), . . ., (x  )) are a finite-dimensional version of a Gaussian process (x) with a covariance function Here  is a covariance function often parameterised as where  is an isotropic autocorrelation function with  2 1 = var((x)) and  is the range parameter which determines how quickly  decays with distance.GPBoost currently supports the exponential, Gaussian, Matérn, powered exponential, Wendland, and tapered exponential covariance functions.In a Gaussian process model,  is usually encoded as a diagonal matrix, so that each element of S contains the spatial random effect for that location.
With its default settings, GPBoost does not apply approximations to the Gaussian process.For increased efficiency, Vecchia approximations are available in the software, which assume conditional independence between responses based on distances, resulting in sparse matrices during computations (Sigrist 2022).
Inference with GPBoost is carried out by jointly optimising the nonlinear fixed effects function , and the variance and covariance parameters  (i.e. 2 ,  2 1 , and ).In the Gaussian process case, the goal of the optimisation is to minimise the risk functional (, ) =  (y,  (), ) , where y = ( 1 , ...,   ) are the observed responses at locations x 1 , ..., x  .Here, (y,  (), ) is the negative log marginal likelihood for obtaining the observed responses y, given the observed covariate matrix , and model parameters , where Ψ = Σ  +  2 .The risk functional is minimised by iteratively updating  and .At step ,  −1 is held fixed and   = argmin  ((,  −1 (), )) is computed using a gradient or quasi-Newton method.With this value of   ,  is updated via a single step of a boosting algorithm.After optimisation, GPBoost produces predictions in a similar manner to Gaussian process regression.The joint distribution of the observed and predicted responses is formed, and conditioned on the observed responses.The mean of the resulting conditional distribution is used for the predicted value of the response.
As we are not using covariates in our model, tree boosting is used only to find the intercept.While this does neglect GPBoost's functionality for learning nonlinear functions of covariates, we have included GPBoost in the analysis for users who may wish to apply it in more complicated scenarios that may benefit from tree boosting.

GPBoost model
GPBoost supports Gaussian, Bernoulli-probit, Bernoulli-logit, Poisson, and Gamma distributions for the response variable, however unlike INLA does not currently support a binomial response.We therefore model malaria prevalence by customising equation (6) as follows: where  0 is the intercept, and   and   denote the number of positive results and the number of people tested at location x  .Note that for simplicity, we elected to use the direct proportion of positive tests rather than the empirical logit, and we clip predictions to lie within [0, 1] for the prevalence maps.We use the exponential covariance function  ( x − x /) = exp (− x − x /), which is the default choice in the software.We note that this model does not use GPBoost's full capability for learning nonlinear functions of the covariates, however it has been constructed to be consistent with our choice to not use covariates for any of the models.
The parameter settings for our model follow examples by the package author (Sigrist 2020).For the spatial random effect in our model we use a full Gaussian process without approximation, setting the vecchia_approx parameter to FALSE.Other parameters in the software control the trees and boosting algorithm used to learn the fixed effects function .We have set the number of boosting rounds to 247 and the learning rate to 0.01, using the parameters nrounds and learning_rate.Other settings for our model include num_leaves = 1024, max_depth = 6, and min_data_in_leaf = 5, each of which control the size of the trees.

SpRF
Spatial Random Forests (SpRF) (Hengl et al. 2018) extend classical random forests to a spatial domain by using distances to observation points as explanatory variables, i.e. when fitting a model with SpRF, for each point x  , where   is given, covariates are used that give the distance from each other observation point.That is, the design matrix for this part of the model is simply the distance matrix between all pairs of observation locations.In order to obtain uncertainty estimates, the SpRF authors use quantile regression forests which estimate specified quantiles of the conditional distribution   |  (Meinshausen 2006) where   are the covariates for the th response, in contrast to classical random forests which do not provide uncertainties.
The generic equation of an SpRF model is given by where   is the response at location x  ,    denotes a vector of the distances to each of the observation locations from the querying point x  (including a distance of 0 to itself, in the th position of the vector) and    and    denote two types of covariates -surface reflectance and process-based.The function  is learned by the random forest.Unlike the other methods we discuss, SpRF does not use a covariance function.
SpRF is based on the ranger package for random forests, which provides an implementation of quantile regression forests with training procedure outlined in Meinshausen (2006).Point predictions are given by the estimated medians from the quantile regression forests.

SpRF model
As in (Hengl et al. 2018), we include an additional normal assumption for the response to construct the simple SpRF model where   and   are as defined above and    contains the distances from each observation point to x  .
The user parameters for SpRF determine the structure of the random forest and the rules for growing each tree, including the number of trees and the number of variables to split on at each node via the num.trees and mtry parameters.We have left each parameter at its default value, resulting in a forest with 500 trees where each node splits at √   variables (  is the total number of variables input into the random forest).Other parameters which further tune the structure of the trees and forest have been left at their default values, and our code for SpRF is based on a tutorial from the method's authors (Hengl et al. 2021).

FRK
Fixed Rank Kriging (FRK) (Zammit-Mangion & Cressie 2021) is a spatio-temporal modelling framework built for large datasets.It uses a spatial random effects (SRE) model, which decomposes a spatially correlated mean-zero random process using a linear combination of spatial basis functions.This dimensionality reduction using a relatively small number of basis functions ensures FRK's computational efficiency.The spatial domain  is partitioned into  subsets,  1 , ...,   , called basic areal units (BAUs) with centroids x 1 , ..., x  .The SRE model is constructed on these BAUs which determine the granularity of the model, and the process is assumed to be piecewise continuous over the BAUs.
The general equation for FRK with a Gaussian response can be written as Here,   ,  = 1, ...,  are the responses at the observation locations,  = ( 1 , ...,   )  is the value of a latent spatial process evaluated at each of the BAUs with centroids x 1 , ..., x  , and   is an  by  matrix connecting the observation locations to the BAU locations.The vector t(x  ) is a collection of covariates at BAU  and  is a vector of regression coefficients, while (x  ) is the value of a small-scale, spatially correlated random effect.Lastly,  (x  ) is a fine-scale random effect, which is treated as uncorrelated across the BAUs (Zammit-Mangion & Cressie 2021).FRK introduces non-Gaussian data to the model by replacing the observation distribution with a member of the exponential family and using a link function to transform the latent process into a mean process.The general structure of such a model is where  is a dispersion parameter for the context dependent member of the exponential family EF,  is called the mean process, and (•) is the link function.We represent the mean process at the observation locations by , while  represents the mean process at the BAUs.
The spatially correlated random effect (x) is decomposed as where  1 , ...,   are a fixed collection of basis functions on the spatial domain, and  = ( 1 , ...,   )  is an r-variate Gaussian random variable with covariance matrix K. To estimate model parameters including the coefficients , variance parameters for the fine scale random effect , and covariance parameters for the covariance matrix K, FRK carries out maximum likelihood estimation.When working with non-Gaussian data, a Laplace approximation is used to approximate the marginal likelihood, which is then maximised via a quasi-Newton method.By default, FRK produces a prediction for the mean process (•) at each of the BAUs.Predictions and uncertainties are generated via a Monte Carlo sampling approach, and the predicted value of  in each BAU is taken to be the average of the samples.

FRK Model
As with INLA, we model the number of positive tests   using a binomial distribution where   is the prevalence at the th observation location.The vector p gives the prevalence at the BAUs, and is transformed into the prevalence at the observation locations via the   matrix, which has construction detailed in (Sainsbury-Dale et al. 2021).The logit function is then used as the link function  in equation ( 7), i.e.
As we are not using any covariates, the latent process over the BAUs   can be written as where  0 denotes the intercept.
Our model decomposes the spatial random effect, (x), using Gaussian basis functions of two different scales placed regularly across the spatial domain, as controlled by the type, nres, and regular parameters respectively.The spatial scale of these basis functions is determined jointly by the regular parameter and the scale_aperture, which we left at their default values of 1 and 1.25 respectively.The assumed correlation structure of the random coefficients  is controlled by the K_type parameter.When using a non-Gaussian model, this takes a default value of precision, which models the coefficient dependence using a precision matrix Q based on the Leroux model ( Sainsbury

Methods for the country scale analysis
At the country scale, the four models were compared qualitatively using their predictive maps, while cross validation was used to compare their predictive performance.To produce maps of predicted prevalence, we fit each model on all available P. falciparum prevalence surveys from Kenya in 2009 from the malariaAtlas R package.This consisted of 382 surveys at points across the country which are shown in Figure 1(a).Point estimates of prevalence and uncertainties were produced by the fitted models on a grid over Kenya, with each cell covering a nominal 0.1 degrees (approximately 11x11 km at the equator) in longitude and latitude.
Model fitting and prediction were run on a 2014 MacBook Pro with a two core, 2.8GHz Intel Core i5 processor running macOS 10.13.6.We ran each model using a single thread to obtain a baseline performance comparison to accompany our main focus on the model predictions; we note that parallelisation options are available for each model which may provide performance improvements.Recorded times were measured as the total time to run a model's R script, including both fitting and prediction.
To evaluate the models we use spatial block cross validation (CV) (Roberts et al. 2017) using 10 and 50 folds.In a spatial setting, randomly allocating points to cross validation folds is not effective because close by points can act as proxies.The folds were selected using -means clustering (Likas et al. 2003) on the spatial coordinates of the prevalence surveys -resulting in a series of 'blocks' of spatially-adjacent points.Figure 2 shows the location of points for the two sets of CV folds, where each colour represents a fold.The 10 and 50 CV folds measure different abilities of the methods.The 50-fold CV quantifies short-scale interpolation ability, while the 10-fold quantifies the ability to interpolate over longer distances.
Using 10 and 50-fold cross validation, we investigate the following: 1. analysis of the point predictions including a comparison between the predictions and out-of-sample prevalence values using multiple measures, 2. analysis of the uncertainty bounds for each model, and 3. analysis of the predictions with respect to density of the sampled locations (Appendix A).
Analysis of uncertainties is complicated by the differing measures of uncertainty output by the models.INLA contains information on the summaries of the posterior marginal densities of the fitted model, and can compute the standard deviation and different quantiles of the predictions.GPBoost provides the variance of each prediction.FRK predicts the standard deviation of each prediction in the linear, Gaussian setting.For the non-Gaussian case, it provides the predictions using a Monte Carlo approach (Sainsbury-Dale et al. 2021).SpRF uses quantile regression and the quantiles can be specified in the ranger package.To compare SpRF with the other methods, we assume a normally distributed response as in (Hengl et al. 2018), and estimate the standard deviation for SpRF's predictions as SD ≈ IQR/1.34898 .Hengl et al. (2018) note that this assumption may not always be valid, and hence we are only able to roughly compare the SpRF model's uncertainty with the other three models.
For each model we measure how many observed prevalence values lie within the predicted uncertainty intervals.Let ŷ denote the mean of the predicted response for observation   .We define Within 1SD( ŷ where SD denotes the standard deviation.As prevalence values are between 0 and 1, we trim the bounds if they exceed these limits.Note that ŷ corresponds to the predicted prevalence for our GPBoost and FRK models, but not for our INLA and SpRF models which use the median for predictions.

Methods for the continent scale analysis
At the continent scale we focused on prediction maps, fitting each model to three sets of prevalence data over Africa.The first set consists of 868 P. falciparum prevalence surveys from 2009, available via the malariaAtlas R package.This data is shown in Figure 3(a), with survey points concentrated in Kenya and Somalia.Each model was additionally fit using two types of simulated data to allow comparison of the predictions with a known truth and to compare model performances on both interpolation and extrapolation tasks, and lastly to assess how properties of the data such as spatial sparsity and noise impact model predictions.
Simulated data was generated using the 2009 P. falciparum prevalence raster created by the Malaria Atlas Project (MAP), shown in Figure 3(d) (Weiss et al. 2019).Prevalence was sampled from the raster at the locations of the 2009 surveys and combined with the number of tests at each location to generate a binomial sample for the number of people testing positive.Of the 868 observation locations, 28 points lie on gaps in the prevalence raster and were excluded, leaving 840 points in this second dataset, which is shown in Figure 3(b).This spatially clustered simulated data allows us to evaluate each model's ability to extrapolate over regions with little or no data.While this dataset shares locations with the observation data, its prevalence notably contains less noise.
The second set of simulated data was generated by selecting 1000 points at random on the MAP raster, allowing for comparison of the models' interpolation performance when trained on data with good spatial coverage.A binomial sample for the number of positive tests was generated at each location, where the number of people tested was set to 85, approximately the average number in the surveys from 2009.The prevalences from this simulated dataset are shown in Figure 3(c).
We have used the same parameter settings as in the country scale analysis, though whenever possible, we have used settings that compute an appropriate spherical distance between points, due to the larger spatial extent of the data.This was possible SpRF which uses great circle distances, and for INLA which allows for meshes to be constructed on the unit sphere.GPBoost does not have this functionality at the time of writing this article, however correspondence with the package authors reveals that they hope to add this functionality in future.While FRK does support using great circle distances for some models, this feature is not currently well supported for models with a binomial response and did not work in our tests.Hence we have used Euclidean distances between coordinates for both GPBoost and FRK.
Using INLA with a spherical geometry requires a mesh to be built on a subset of the sphere.Although several methods for constructing this mesh are used in the literature (Lindgren & Rue 2015, Bakka et al. 2018, Humphreys et al. 2017), each produced similar results and we have followed the method outlined by Lindgren and Rue (2015).
Model fitting and prediction were carried out on a single 3.00GHz Intel(R) Xeon(R) Gold 6154 CPU in the Physical partition of the University of Melbourne's high performance computing cluster, Spartan, and each model run was allocated 32 GB of RAM.As with the country scale data, each model was run using a single thread.Predictions were produced on a grid with cell side length 0.15 degrees (approximately 16.7 km at the equator).

Results
This section presents the analysis on Kenya, which includes the predictive maps and cross validation results, and the continent scale analysis including models trained on three different input datasets discussed above.

Case study: Kenya
At a national scale we have used two means of verifying our models: 1. predictive maps, and 2. 10-fold and 50-fold cross-validated predictions.To produce the maps we have used all the data while for cross validation we have left out some data in each fold.

Predictive maps
The predictions and uncertainties produced by the four models when trained on the 2009 Kenya prevalence data are shown in Figure 4.At the broadest scale, each model is similar in predicting a region of high prevalence in Western Kenya, with clusters of higher prevalence in the East, but low prevalence over much of the rest of the country.For each model, the predicted prevalence drops to zero quite quickly away from the data, indicative of a smaller spatial range than might be expected.This is especially prominent with INLA, and may be indicative of overdispersion in the data.
A notable feature is the arc like band of higher prevalence in the north west of Kenya in SpRF's predictions in Figure 3(c), which is further discussed in Section 3.2.Higher prevalence in this region is also somewhat apparent in GPBoost predictions and, to a lesser extent, FRK.This area of predicted higher prevalence falls in a broad region with no prevalence data and so represents different approaches to extrapolation in the four models.  1 gives the cross validation results.In terms of cross validation RMSE and correlation, FRK performs the best for 10-fold CV, and GPBoost performs the best for 50-fold CV.SpRF predictions had the highest correlations to the data used to train the model, but poorer correlation to out-of-sample data, indicating that this model may be overfitting to the training data.INLA performs poorly with respect to the 10 fold RMSE and correlation.

Cross-validation results
From Table 1 we see SpRF has only 37.105% of the points within 1SD for 10-fold cross validation, which is much lower than for the other models, and we discuss this further in Appendix A. GPBoost performs the best in terms of the percentage of points within 1SD.However these results need to be taken in context, because a higher standard deviation can increase this percentage.
Figure 5 shows the interval and point predictions for the four methods for 10-fold cross validation.Points within 1SD are shown in green, points within 2SD are shown in blue, and the rest are shown in red.We see that many of INLA's predictions are close to zero and we investigate this issue in Appendix B. Additional cross validation metrics that consider the prediction error divided by its standard deviation are detailed in Cressie (2015).However, we do not consider these metrics as some methods produce very small standard deviations and thus will result in very large values.More details on cross validation results in terms of the clusters and density of locations are given in Appendix A.

Continent scale results
Geostatistical mapping is often carried out at a continent or global scale and frequently uses large datasets of observations.As computation time for inference and prediction can scale poorly with the amount of data and size of the domain, it is important to assess the performance, both in terms of predictive power and time, of recent methods.To examine how each of the four methods perform at larger scales, and to understand how predictions are affected by potential violations of model assumptions, and by clustering and sparsity of observation points, we expanded the study area to the whole continent of Africa in 2009, and fit each model to the three prevalence datasets shown in Figure 3. Figure 6 shows the prevalence predicted by the models when trained on each input dataset.The corresponding uncertainties appear in Appendix C.
Scaling up to the continent level reveals differences between the models that are not apparent at smaller scales.While the national scale prevalence maps in Figure 4 are largely similar apart from the slight banding  effect seen in SpRF's predictions, the prevalence maps in Figure 6 differ significantly, with artifacts appearing in several of the maps.Overall, the four models are better at local interpolation than extrapolation over large regions without data.The predictions in Figure 6(iii) generated using the randomly distributed data recover the prevalence structure of the MAP raster in Figure 3(d) much more faithfully than the predictions in Figure 6(ii) from the sparser non-uniform data.This behaviour is expected as malaria prevalence is known to be highly heterogeneous and our models do not use covariate data.
SpRF's predictions display a prominent banding effect, visible in both the country and continent scale maps where contiguous arc-like bands of high prevalence appear in both point and uncertainty estimates.This may be explained by the fact that SpRF models the quantity of interest -malaria prevalence in our case -based on distances to points with known values.Thus we observe bands of high or low prevalence at different radii from clusters of observations, and the piecewise constant nature of random forests would contribute to the sharp steps between each band.The banding effect is particularly prominent in Figures 6(ci) and (cii), where the points were clustered into smaller regions, while it is less obvious when the datapoints have good spatial coverage, as in Figure 6(ciii), which does not show bands spanning the continent.Further increasing the number of simulated points was found to further reduce the prominence of these bands (results not shown).
Even though SpRF produces maps with this unwelcome feature, the cross validated point estimates are quite accurate.From Table 3 in Appendix A we see that SpRF has the highest proportion of points with absolute errors less than 0.05 and 0.1 for 10-fold cross validation, which is a harder task for the algorithms than 50-fold cross validation.Thus, we can argue that SpRF gives reliable predictions at points even though it may produce a predictive map that can be misleading in regions where there are no sample points.Figure 6(ai), produced by INLA with the observation data, displays a sudden drop in prevalence away from observations, resulting in flat near-zero predictions covering most of the continent.This appears to result from a combination of both the sparsity and noise present in the data, rather than clustered nature of the data alone.Figure 6(aii) uses nearly the exact same locations, yet shows higher values of prevalence spreading much further from the observations.In Appendix B we outline evidence that this effect arises from unaccounted-for overdispersion in the observation data.In particular, increased noise in the data appears to reduce the estimated range for the spatial random effect, resulting in the model reverting to constant predictions away from observation locations.This behavior is consistent with INLA's poor performance in the 10-fold cross validation analysis in Section 3.1.2,where the model predicted near-zero malaria prevalence for each of the held out folds.
We observe that FRK's predictions depend strongly on the arrangement of the basis functions, which are generally placed by the software based on the data locations and the user parameters introduced in Section 2.4.1.For example, FRK's predictions in Figures 6(di) and (dii) display spurious oscillations in regions with little or no data, however these oscillations correspond to periodic placement of the basis functions.Other arrangements we tested led to flat predictions over the whole continent (results not shown).These types of artifacts are not present in Figure 6(diii), where the input data has good spatial coverage.However, this map appears as a smoothed version of the input data, and does not resolve the finer structure in the MAP surface.The impact of the arrangement and number of basis functions on the prediction maps is detailed further in Section 3.4.2.
For both Kenya and Africa, GPBoost produces prevalence maps without the artifacts appearing in the other models' outputs.However, the uncertainty maps in Figures 20(ci)-(cii) in Appendix C exhibit a high level of overall uncertainty regardless of whether the regions have more survey points or not.This is further confirmed by the near-constant interval widths that rarely fluctuate with the density of the survey points in Figure 16 (Appendix A) .Even though GPBoost currently computes only Euclidean distances between coordinates, both the prevalence maps for Africa and Kenya appear to be reasonable.However, it is sub-optimal to use Euclidean distances between longitude and latitude coordinates for a global model.

Computational results
Times taken to run each model on each of the datasets are shown in Table 2.While FRK is consistently the fastest, INLA shows great variation among the African datasets, ranging from less than 10 minutes with the uniform simulated data to 69.11 minutes with the observation data.Further analysis of this variation for INLA is given in Appendix B.
Table 2: Times taken in minutes to train the models on each dataset and generate the prediction maps.The Kenya: Observations column corresponds to the maps in Figure 4.The Africa: Observations, Africa: Simulated observations, and Africa: Simulated uniform columns correspond to columns (i), (ii), and (iii) of Figure 6 respectively.Note that different machines were used to run the models for the Kenya and Africa datasets.Each model was additionally trained on simulated prevalence datasets with 1000 -10000 points selected at random. Figure 7 shows times taken to fit each model and produce predictions as the dataset size varies.Both INLA and FRK remained very fast on larger datasets, showing little variation in their times.In contrast SpRF's time appears to increase linearly with dataset size, and GPBoost rapidly slows down on larger datasets, reflecting the computational requirements of using an unapproximated Gaussian process.As noted in Section 2.2, a Vecchia approximation is available with GPBoost to improve the computational efficiency.In Appendix D, we examine the effects on computation and model predictions of applying this approximation.

Sensitivity of FRK and INLA on parameter choices
Of the four methods, INLA and FRK show promise in their computational efficiency, displaying favourable scaling compared to SpRF and GPBoost in Figure 7. Additionally, while the artifacts in SpRF's output appear to stem from the way it uses distances as an input, it is less clear whether the artifacts in INLA and FRK's prevalence maps are due to specific model parameters, or if they are fundamentally caused by the approximations used by each method.For this reason we examine these two methods more closely and test the sensitivity of their predictions to the model parameters.

INLA
The primary artifact visible in INLA's prediction maps is the flat, near zero, predictions when the model is fit to the observation data, as shown in Figure 6(ai).Appendix B outlines evidence that this feature is due to overdispersion, suggesting that the binomial response is unsuited for modelling the variability in the observed malaria data, despite commonly being used in tutorials on the application of INLA to disease mapping problems.
Several adjustments can be made to the model to address this overdispersion, such as the use of a Betabinomial or Gaussian response (either directly on the proportion positive, or its empirical logit transform), or the inclusion of an independent error term in the linear predictor.All of these options include an additional parameter in the model to capture error variance at the level of the observation.Figure 8 shows predictions from an INLA model with a Beta-binomial response which has been fit to the observation data.The flat predictions of Figure 6(ai) are notably absent, suggesting that the Beta-binomial is effective in resolving the overdispersion.A Gaussian response was additionally tested and was found to also handle the variability in the observation data, with results shown in Appendix B.1.These results highlight a need for caution when applying INLA with a binomial response to disease mapping problems, and the importance of checking for overdispersion.

FRK
While the fastest of the four methods, FRK's continent-scale predictions display a spurious "spotty" pattern when fit to either of the spatially sparse datasets and a much less detailed map when fit to the simulated data at randomly selected locations (Figure 6).These features appear to stem from FRK's use of a small number of basis functions in approximating the Gaussian process.In this section, we examine whether increasing the number of these functions can resolve the artifacts in FRK's outputs.
The number of basis functions used in FRK's approximation is primarily controlled by the nres and regular parameters.Increasing the nres parameter adds an additional "resolution" or layer of basis functions with a finer spatial scale, while increasing the value of regular reduces the scale of each basis function and adds additional rows and columns to their arrangement.Details on the effects of these parameters are available in the software documentation (Zammit-Mangion & Sainsbury-Dale 2023).The model used throughout Section 3.1 -3.3 had these parameters set to nres= 2 and regular= 1.
Figure 9(a) shows FRK's predictions when fit to the observation data with nres increased to 3, and regular left at 1, which resulted in a model with 1338 basis functions.Whilst the broad-scale spottiness is less prominent in this model, finer-scale oscillation is quite visible in regions of Central Africa.This modest improvement came at a significant computational cost, as the model took over 55 minutes and required 106GB of RAM, compared to the 4.77GB of RAM and 3.28 minutes required when nres was set to 2, and regular was set to 1. Figure 9(b) shows the predictions when nres is kept at 2 and regular is increased to 2. These settings resulted in 600 basis functions, and required 22.74GB of memory and 9 minutes to run, significantly less than when increasing the nres parameter.However the fine-scale oscillation is noticably more pronounced in areas with little or no data.
While the key to FRK's computational efficiency is its decomposition of the spatial random effect into a small number of basis functions, these results suggest that it is challenging in practice to balance this efficiency with the risk of artifacts appearing in the model output, especially on large scale mapping problems, with sparse data.

Discussion
Our applied comparison of four geostatistical modelling methods found that two of them (SpRF and GPBoost) are not sufficiently scalable or accurate to be applicable to large-scale malaria prevalence modelling problems.SpRF's spatial predictions displayed a prominent 'banding' artifact, and at first glance SpRF models appeared to be overfitted (matching closely to training data, but making poor predictions to hold-out data).However, on closer inspection (Appendix A we see that SpRF's tighter uncertainty intervals in low density regions may result in this perception.Unlike the other methods, SpRF does not incorporate a covariance function, but instead treats the columns of the distance matrix between coordinates as covariates for inclusion in the Random Forest.We note that a covariance function could in fact be applied to the distance matrix before inclusion in the model.This would not have the same interpretation as in the Gaussian-process based models we consider, but would enable SpRF to consider a distance-based decay in the unobserved spatial effects being modelled.However, the RandomForest inference machinery in SpRF would have no means to estimate the parameters of such a function (such as the rate of decay with distance), and it seems unlikely they could be reasonably specified in advance.Due to these issues of fit, and the fact that the computation time of SpRF scaled approximately linearly with the size of the data, this approach is unlikely to be useful for applied geostatistical modelling of malaria data.
GPBoost made reasonably good predictions to hold-out data, being the best-performing model at 50-fold spatially-blocked cross-validation in the national-scale comparison (implying a good ability to extrapolate over short distances) and the second-best, behind FRK, at 10-fold cross validation (ability to extrapolate over longer distances).However the computation time using the default GPBoost specification scaled very poorly with increasing data size.This is because by default GPBoost performs inference on the full (unapproximated) Gaussian process, with each step of the inference procedure requiring an O( 3 ) inversion of the covariance matrix.Neither the maximum-likelihood inference of GP hyperparameters, and boosting inference on the intercept (and covariate effects if used) reduce this computational burden.However, even deploying the Vecchia approximation to the Gaussian process provided by GPBoost did not resolve these issues, as shown in Appendix D. The Vecchia approximation resulted in faster, but linearly increasing computation times, but also resulted in severe artifacts in the model predictions.Increasing the complexity (number of neighbouring points to consider) in the approximation resolved these artifacts, but at the cost of a substantial increase in the required computation time and RAM usage.We were unable to determine a combination of these and the boosting parameters that would reduce computation time to a comparable level to INLA and FRK.We also note that GPBoost is a relatively new technique, and future versions may include faster approximations.
Both INLA and FRK offered substantially better scalability to increasing data size than SpRF and GPBoost, taking only minutes to fit to 10,000 datapoints.Whilst it was computationally scalable, and is a widely established method and software for geostatistical modelling of malaria data, implementing INLA using the commonly suggested binomial distribution for prevalence data (e.g. as suggested in (Moraga 2019, Moraga et al. 2021)) resulted in spurious predictions and poor ability to extrapolate in both the 10-fold and 50-fold cross-validation tests.We have demonstrated that this is due to the fact that the malaria prevalence data being modelled are overdispersed relative to the binomial sampling assumption and spatial-only model.That is, the assumption is violated that the infection status of each individual in a given sample is independent of the others, given the estimated prevalence estimate at that location.This should not be surprising from an epidemiological perspective, given that the infections in a given place do not arise independently -each infection is caused by another.This gives rise to local noise, either at the level of a pixel or group of pixels (that particular location may have some risk factor not accounted for by the smooth spatial model), or at the level of the observation (on the day of sampling, that population may have had a higher or lower than usual prevalence).INLA's behaviour in this case is an attempt to capture these small-scale variations with a very 'wiggly' spatial random effect, i.e. one with rapid decay with increasing distance.It favours this parameter configuration on overdispersed data because the observation variance is fixed when using a binomial likelihood, and the variance is not sufficiently large to explain the data.This issue of poor identifiability between the observation-level variance and the lengthscale of a Gaussian processes has previously been described (Rasmussen & Williams (2006), see Figure 5.4), and can be resolved in classical (and model-based) geostatistics with the use of an independent 'nugget' effect either on each observation or each observed location (Diggle & Ribeiro Jr. (2007)).Despite also using a binomial observation distribution, FRK does not suffer the same pitfall because it includes a type of spatial nugget effect in its 'small-scale' effect parameter.
For malaria prevalence modelling with INLA, we suggest that a more reliable 'default' model than the standard binomial observation model would be one which includes additional observation-level random noise.This can be achieved by using a Beta-binomial or Gaussian (on the observed prevalences or on the empirical-logit scale).Both of these options have an additional observation-level variance parameter that can be used to explain the overdispersion relative to the binomial.Of these, the Beta-binomial is most likely to be generally applicable to malaria prevalence data, since it is able to accurately account for observation errors in the common situation where only very few the individuals tested are infected.Though we note that fitting with a Gaussian response is substantially more computationally efficient in INLA, and so may be preferable if computation time is a major constraint.An alternative approach would be to include an independent observation-level random effect in the model specification.
Whilst FRK scaled well to large datasets (generally taking slightly less time than INLA) and performed well in both the 10-fold and 50-fold extrapolation comparisons, for continental-scale modelling, we were unable (with modest model modification) to specify the model in such a way that it was both computationally scalable and avoided the spurious oscillating effect of the basis functions.Whilst less noticeable, similar patterns are visible in the national-scale analysis in parts of North-Western and far North-Eastern Kenya where no data are available to inform such a prediction.Given these issues, we believe significant care must be taken when applying FRK to mapping of sparse malariometric data, to avoid these spurious predictions that are driven by computationally convenient approximations rather than data.
Comparing four methodologically different techniques has its limitations.One such limitation is that the inherent differences of the methods make a comparison somewhat difficult.For example, the likelihoods are different as well as the underlying model structure and/or covariance functions.Thus, each method has its own measures and we cannot compare an INLA goodness of fit measure with that of SpRF and vice versa.We have mitigated this problem by focusing on the outputs -predictive maps and cross validation results.
Another aspect of interest is the parameter settings.There are many different parameter settings for each method.We have selected the commonly used (default) parameter settings in this work and even though we have explored several different parameter settings, we have not conducted a comprehensive exploration of the parameter space of these algorithms.While the default parameter settings were acceptable for Kenya, we expect that algorithms can benefit from customised parameters when running the model on the scale of Africa.The limitations of the choice of parameters is brought to light by the extent of the geographical region.Exploring optimal parameter selection is another avenue of research.Furthermore, there might be other parameter settings that can make the inference approximations of the different models more comparable.
From a practitioner's point of view, it is challenging to adopt a new method for spatial modelling mostly because it takes a long time to learn the methodology and write code to produce meaningful output.This is a significant barrier to entry.If the methods discussed provide tuning functions that explore the parameter space and select a set of parameters that enables the practitioner to build a good model, it would increase the usability of these methods.
An in-depth investigation of strengths and weaknesses of the models would be another avenue of interest.
One option is to construct a meta-model that can predict the best model based on features of different locations (Wang et al. 2009).Such a meta-model could combine the strengths of the diverse models to make a stronger prediction.The findings of this paper should be of use for those creating, interpreting or working with spatial data, as a baseline comparison of new computational geostatistical models.
0.1 and 0.2 considered).As noted earlier, FRK and GPBoost have the best RMSE and correlation values for 10-fold and 50-fold cross validation respectively.SpRF gives the best performance in terms of the percentage of observations with absolute error less than 0.05 and 0.1.Compared to the other models, INLA performs poorly for the 10-fold cross validation, with a higher RMSE and significantly lower correlation coefficient.However, it gets a high percentage of observations with absolute error less than the three thresholds.This is because a large number of observations have low prevalence values.This is further illustrated in Figure 10, which shows the actual and predicted values using 10-fold cross validation for each model.
Figure 10 shows the points by cross validation fold as determined in Figure 2(a).As the folds are determined by -means clustering, observations in each fold lie close together.We see that data points in most folds have similar prevalence values.However, data points in Folds 3, 8 and 9 have a broad range of values.The points assigned to Fold 8 are difficult to predict for all four models.These points are along the coast near the city of Mombasa and are somewhat isolated from other clusters, which might be a contributing reason.Figure 11 shows 50-fold cross validation results for the four models while Figure 12 shows their interval predictions.From Table 3 we see that GPBoost achieves better results in terms of RMSE and correlation.It has the same performance as INLA for the highest percentage of observations with absolute error less than 0.1.INLA has the highest percentage of observations with absolute error less than 0.05 and SpRF has the highest percentage of observations with absolute error less than 0.2.From Figure 11 we see that certain folds perform poorly.These folds match with the locations of the poorly performing folds in the 10-fold CV scenario.Another interesting observation is that while GPBoost achieves good results for both sets of folds, it performs poorly on high prevalence observations, whereas FRK and SpRF do not appear to have this limitation.

A.2 Point predictions by location density
We further analyse these results using the density of sampled locations, i.e. do some models find it difficult to predict observations in low density regions? Figure 13 shows the malaria prevalence and kernel density estimates of the sampled locations on two separate maps.Figure 14 shows scatter plots of prevalence and density with points coloured by the fold.For 10-fold CV, we see that fold 8, which is around the city of Mombasa has a broad range of prevalence values while having relatively low density.This explains the reason behind the high errors  for Fold 8 (Figure 10).When the sampled points are away from each other (low density) and the prevalence values have high variation, it is challenging for the models to predict accurately.Figure 15 shows the absolute errors of the four models with respect to density for both 10 and 50-fold cross validation.For 10-fold CV we see that fold 8 exhibits high error rates for all four models.If we consider the same set of points for 50-fold CV, we see that while FRK and SpRF have similar error rates (maximum ≈ 0.75), INLA and GPBoost have comparatively lower error rates (maximum ≈ 0.6).Thus, a higher number of folds benefits GPBoost and INLA in this instance more than it benefits FRK or SpRF.
Moving on to the points with very low density (< 0.2) we see from Figure 14 that the prevalence values of these points are relatively low.In Figure 15(a) we see that INLA and SpRF have lower errors for these low density points compared to FRK and GPBoost.A similar outcome can be observed for the 50-fold CV case in Figure 15(b) when the density of points are less than 0.2.
As seen in Figure 14, points in high density regions (> 0.4) have a higher variation in prevalence ranging from 0 to 0.75.The 10-fold CV results in Figure 15(a) show that FRK performs best for these high density points with an error < 0.4, followed by GPBoost and SpRF.INLA performs poorly on these points for 10-fold CV.For 50-fold CV (Figure 15(b)) GPBoost performs best on the high density points followed by SpRF, while both FRK and INLA perform similarly.Table 4 gives metrics for different density groups for both 10 and 50-fold CV.We define low density as density ≤ 0.2, medium density as 0.2 < density ≤ 0.4 and high density as density > 0.4.We see that the absolute errors of all four models are small for low density points.The percentage of observations with absolute error less than 0.05, 0.1 and 0.2 are quite high for both 10 and 50-fold cross validation sets.In both CV sets, INLA has the lowest RMSE with SpRF following closely.For both 10 and 50-fold CV the correlation coefficients between the actual and the predicted values are negative.This indicates that most prevalence values are close to zero in low density locations.This would also explain why less points were sampled from those regions, as more points are generally sampled from high prevalence regions.GPBoost, SpRF and FRK have higher RMSE for the medium density point set, compared to the high density point set for 10-fold CV.A similar behaviour is observed for FRK for 50-fold CV.This is due to the high absolute errors in fold 8 as discussed previously.For the medium density points, GPBoost is preferred in terms of RMSE for both 10-fold CV and 50-fold CV.For high density points FRK is preferred for 10-fold CV while GPBoost is preferred for 50-fold.In terms of the percentage of points with absolute error less than 0.05 and 0.1, SpRF leads the other models for 10-fold CV, while INLA leads for 50-fold.For both 10 and 50-fold CV, INLA and SpRF perform better on low density points compared to the other two methods, while GPBoost and FRK perform better on high density points.

A.3 Interval predictions
As described in Section 3, each method has a different uncertainty quantification mechanism, however we have estimated the standard deviation of predictions from each model to allow comparison.
Table 5 gives the interval prediction results for all four models.Figures 5 and 12 show the interval predictions for 10 and 50-fold CV.For both 10 and 50-fold CV, SpRF has on average the smallest uncertainty intervals and the smallest number of points within one or two standard deviations of the mean.Surprisingly SpRF's interval widths are zero for 148 and 164 of the predictions for 10 and 50-fold CV respectively.Each of these points correspond to a prediction (median) of zero prevalence, and the majority correspond to an observed prevalence of zero.However, the mean value of the response at nearly all of these locations is small but non-zero, and so the prevalence at these points does not lie within any number of standard deviations of the mean, contributing to the low percentages for SpRF in Table 5.For both 10 and 50-fold CV, GPBoost has the largest mean interval widths but smallest standard deviation, suggesting that it predicts consistently high width intervals for most observations.For 10-fold CV, this results in the highest percentage of points lying within one or two standard deviations.For 50-fold CV INLA has a higher percentage of points lying within each type of interval, which may be accounted for by the higher variation in INLA's interval widths.
Figure 16 shows the kernel density estimates of the locations and the respective interval widths of the four models for both 10 and 50-fold CV.For both 10 and 50-fold CV we see that GPBoost has similar widths for all observations.We observe a slight increase in width for low density points.However, there is not much variation in width with respect to the density.In contrast, FRK, INLA and SpRF have varying interval widths for different folds.We see that there is high variation for locations with high density, mostly likely because of the variation in prevalence.For both 10 and 50-fold CV, FRK has relatively high width values for low density points.Conversely, INLA and SpRF have low width values for low density points.Similar to the point predictions we see a high variation of width for medium density points (density ≈ 0.3) for FRK, INLA and SpRF.

B Effects of input noise on INLA
Figure 6(ai) shows INLA predicting a flat near-zero prevalence over most of Africa when trained on the observation data, a behaviour that is not replicated by fitting the model to either set of simulated data.This behaviour may be due to the noise in the observation data, which is visible in Figure 3(a) particularly in Uganda.In contrast, the simulated data at the same locations, shown in Figure 3(b), appears much smoother.
We examine this working hypothesis by adding Gaussian noise to the simulated data.Prevalence values sampled from the MAP raster in Figure 3(d) with locations based on the observation points were transformed using the logit function.Gaussian white noise with chosen standard deviations was added to the transformed values, before being brought back to values between 0 and 1 via the inverse logit.Binomial samples for the number of positive tests were then drawn using these prevalences, and INLA was fit to this data.
Predictions from INLA fitted to data with three different levels of noise are shown in Figure 17.predictions in Figure 6(ai) of the model trained on the observation data.These results suggest the presence of overdispersion, and that the INLA model used may be misspecified.Indeed the model in equation ( 4) does not contain an independent error term.Methods to address this include adding an observational random effect to the model, or using a Beta-binomial response.

B.1 INLA with Gaussian response
Due to the overdispersion when fitting the INLA model to the observation data, we tested an additional INLA model which uses a Gaussian response.Predictions from this model are shown in Figure 19, and the absence of flat predictions suggests that this response is able to resolve the overdispersion.

C Prediction Uncertainty
Figure 20 shows the prediction uncertainties corresponding to each of the prevalence maps over Africa in Figure 6.

D GPBoost with the Vecchia approximation
As it uses a full Gaussian process, it is unsurprising that the GPBoost model shows the least favourable computational time for larger datasets.To improve efficiency, a Vecchia approximation is available in the software, which approximates the distribution of the response as as per Sigrist (2022).Here   () is the subset of { 1 , ...,  −1 } containing the   nearest neighbours to   , where "nearest neighbours" are determined by the distances between the responses' corresponding locations.The parameter   determines the number of neighbours to use during fitting, while a separate parameter,  ,  , controls the number of neighbours used for prediction.The approximation additionally requires a choice of ordering of the observed responses { 1 , ...,   }, which by default is taken to be the original ordering of the input data.Figure 21 shows GPBoost's predictions when using a Vecchia approximation with several values of   and   ,  .Uncertainty predictions were not produced as they are not currently well supported in the software when using the Vecchia approximation.
Applying the Vecchia approximation introduces several artifacts to GPBoost's predictions.Figure 21(a) and 21(d) show sharp discontinuities and noisy predictions, both of which were prominent whenever low values of   and   ,  were used.Experiments using the Kenya data suggested that the discontinuities and noise could be prevented by increasing   and   ,  (results not shown), however for the dataset on the continent scale there was a significant computational cost for doing so.Increasing  ,  from 30 to 150 while keeping   fixed at 30 had a relatively small impact on the computation time, which increased from 7.4 to 9 minutes, but the required memory jumped from 1289MB to 20546MB.Meanwhile, increasing both   and  ,  to 150, greatly increased the computation time, requiring over 3.4 hours to run, much longer than when the Vecchia approximation was not applied.Additionally, this model configuration required 20679MB of RAM.These examples suggest that increasing   primarily increases the computation time required without affecting the RAM usage, while increasing   ,  increases the required RAM, with a smaller impact on computation time.Despite the increased computational requirements, neither adjustment to the parameters completely removed the noise and discontinuities.
One benefit of using the Vecchia approximation is an improvement in scaling behaviour, even if the computational requirements on an individual dataset depend strongly on the choice of   and  ,  .Figure 22 shows the computational results from Figure 7, with an additional plot for GPBoost using the Vecchia approximation.Parameters   and  ,  were held fixed at 30 and 150 respectively, and the model shows a linear increase in computation time as the number of observations increases.However, the scaling is still more severe than for INLA and FRK.These results highlight a significant obstacle to applying GPBoost to large scale data.Using the full Gaussian process can result in large computation times, while applying the Vecchia approximation introduces additional artifacts which require sacrifices in computational efficiency to remove.
GPBoost's computation times are amplified by the high value of the nrounds parameter, which has been set to 247 following available tutorials.As described in Section 2.2.1, this parameter controls the number of optimisation steps during fitting.When fit to the malaria datasets used throughout this paper, the log-likelihood generally stopped increasing after 5 to 10 steps, suggesting that 247 training steps is unnecessarily high for our data.Reducing nrounds to a value around 10 would greatly improve the scaling gradient for the GPBoost model with a Vecchia approximation in Figure 22.Additional experimentation however found that reducing the number of rounds had little affect on the high RAM requirements for large values of  ,  ; something which may be necessary to minimise the discontinuities and noise in the predictions.Reducing nrounds would also improve the efficiency of the GPBoost model when no vecchia approximation is used, however would not change the overall scaling behaviour.
-Dale et al. 2021).During prediction, the user can specify the number of Monte Carlo samples to be drawn, which we have left at the default value of 400.Code and parameter choices for our FRK model are based on examples from the package authors in (Zammit-Mangion & Cressie 2021, Sainsbury-Dale et al. 2021).
Figure 1: 2009 P. falciparum prevalence data in Kenya.(a) shows prevalence survey results, while (b) shows the Malaria Atlas Project predicted prevalence.

Figure 3 :
Figure 3: P. falciparum prevalence data used to fit the four models at the continental scale.3(a) shows the 2009 observed data at 868 locations.3(b) shows the prevalence generated from binomial samples at the observation locations.3(c) shows the prevalence generated by binomial samples at 1000 uniformly random locations.3(d) is the Malaria Atlas Project predicted prevalence raster from 2009 used to generate the samples in 3(b) and 3(c).

Figure 4 :
Figure4: Predicted prevalences and uncertainties for the four models when trained on P. falciparum prevalence data from Kenya in 2009.Note that these maps are intended only to illustrate differences in model predictions when fitted to a small data sample, and are not likely to accurately represent malaria prevalence across the country in this year.

Figure 5 :
Figure 5: Interval predictions for 10-fold cross validation for national level Kenya data.Points show the predicted mean from each model, and intervals show one standard deviation above and below the mean.

Figure 6 :
Figure 6: P. falciparum prevalence predictions when fit using three different datasets.In column (i), models are fit using the survey data from Africa in 2009, shown in Figure 3(a).In column (ii), the models are fit to binomial samples drawn from the Malaria Atlas prevalence raster at the same survey locations, shown in Figure 3(b).In column (iii), they are fit to binomial samples drawn from the raster at 1000 uniformly selected locations across the continent, shown in Figure 3(c).Outputs have been masked by the Malaria Atlas Project raster in Figure 3(d).16

Figure 7 :
Figure 7: Times taken by each model on uniformly distributed simulated datasets.GPBoost was not run with 5000 or 10000 points due to the likely long computation time.

Figure 8 :
Figure 8: Prevalence predictions from an INLA model with a Beta-binomial response, fit to the observation data in Figure 3(a).

Figure 10 :
Figure 10: Model predictions of the four models vs actual prevalences using 10-fold CV.

Figure 11 :
Figure 11: Model predictions of the four models vs actual prevalences using 50-fold CV with folds in different colours.
Figure 13: P. falciparum prevalence in Kenya for 2009 and the kernel density estimates of the sampled locations.
Figure 15: Kernel density estimates of the locations and the absolute errors of the four models for 10 and 50-fold cross validation.
Figure6(ai) shows INLA predicting a flat near-zero prevalence over most of Africa when trained on the observation data, a behaviour that is not replicated by fitting the model to either set of simulated data.This behaviour may be due to the noise in the observation data, which is visible in Figure3(a) particularly in Uganda.In contrast, the simulated data at the same locations, shown in Figure3(b), appears much smoother.We examine this working hypothesis by adding Gaussian noise to the simulated data.Prevalence values sampled from the MAP raster in Figure3(d)with locations based on the observation points were transformed using the logit function.Gaussian white noise with chosen standard deviations was added to the transformed values, before being brought back to values between 0 and 1 via the inverse logit.Binomial samples for the number of positive tests were then drawn using these prevalences, and INLA was fit to this data.Predictions from INLA fitted to data with three different levels of noise are shown in Figure17.Figure18shows posterior means and interquartile ranges for the intercept, range, and variance of the fitted models as (y| (), ) =  =1 (  | −1 , ...,  1 ,  (), ) ≈  =1 (  |  () ,  (), ) ,

Figure 18 :
Figure 18: Posterior means of the intercept, range and variance for the INLA model fit using simulated data at the observation locations with added Gaussian noise of varying standard deviation.The bottom right plot shows the time taken to fit the INLA model to each of the datasets.Error bars show posterior interquartile ranges.

Figure 19 :Figure 20 :
Figure 19: Predictions from an INLA model with a Gaussian response fit to the observation data.Values have been clipped to lie within [0, 1].

Figure 21 :
Figure 21: P. falciparum prevalence predictions for GPBoost when using the Vecchia approximation for various values of the nearest neighbour parameters,   and   ,  .(a) uses   =  ,  = 30, (b) uses   = 30 and  ,  = 150, while (c) uses   =   ,  = 150.(d)-(f) show the southern regions of the above plots, where noise in the predictions is more prominent.

Figure 22 :
Figure 22: Time taken for GPBoost with the Vecchia approximation applied for simulated datasets of various sizes, compared to the times in Figure 7.

Table 1 :
Cross validation results of the four models with best results in each category in boldface.

Table 4 :
Cross validation results grouped by density of sampled locations with best results in boldface.

Table 5 :
Interval prediction results of the four models.Mean Width refers to the average of the predicted standard deviations, while Std.Dev.Width refers to their standard deviation.