Hot spot or not: a comparison of spatial statistical methods to predict prospective malaria infections

Background Within affected communities, Plasmodium falciparum infections may be skewed in distribution such that single or small clusters of households consistently harbour a disproportionate number of infected individuals throughout the year. Identifying these hotspots of malaria transmission would permit targeting of interventions and a more rapid reduction in malaria burden across the whole community. This study set out to compare different statistical methods of hotspot detection (SaTScan, kernel smoothing, weighted local prevalence) using different indicators (PCR positivity, AMA-1 and MSP-1 antibodies) for prediction of infection the following year. Methods Two full surveys of four villages in Mwanza, Tanzania were completed over consecutive years, 2010-2011. In both surveys, infection was assessed using nested polymerase chain reaction (nPCR). In addition in 2010, serologic markers (AMA-1 and MSP-119 antibodies) of exposure were assessed. Baseline clustering of infection and serological markers were assessed using three geospatial methods: spatial scan statistics, kernel analysis and weighted local prevalence analysis. Methods were compared in their ability to predict infection in the second year of the study using random effects logistic regression models, and comparisons of the area under the receiver operating curve (AUC) for each model. Sensitivity analysis was conducted to explore the effect of varying radius size for the kernel and weighted local prevalence methods and maximum population size for the spatial scan statistic. Results Guided by AUC values, the kernel method and spatial scan statistics appeared to be more predictive of infection in the following year. Hotspots of PCR-detected infection and seropositivity to AMA-1 were predictive of subsequent infection. For the kernel method, a 1 km window was optimal. Similarly, allowing hotspots to contain up to 50% of the population was a better predictor of infection in the second year using spatial scan statistics than smaller maximum population sizes. Conclusions Clusters of AMA-1 seroprevalence or parasite prevalence that are predictive of infection a year later can be identified using geospatial models. Kernel smoothing using a 1 km window and spatial scan statistics both provided accurate prediction of future infection.


Background
Malaria transmission in endemic countries is heterogeneous over multiple spatial scales [1,2]. At the micro scale, P. falciparum infections are frequently clustered in relatively few households that consistently have significantly more infections than others [3,4]. Many factors can contribute to this increased risk of malaria exposure, including design of housing, the proximity to mosquito breeding sites, host genetic factors, poor access to treatment, maternal education, wealth, and other as yet undefined characteristics [3,[5][6][7][8]. At sites with very low levels of transmission, such as those found in Swaziland, cases of symptomatic malaria detected at health facilities can help in identification of a hotspot, as additional asymptomatic cases can be found living in close proximity to the index case [9]. In areas of moderate transmission intensity, malaria hotspots may provide a reservoir of infected human hosts that can maintain some transmission year round. The individuals in such hotspots are thus likely to have acquired anti-parasite immunity and to carry parasites without clinical symptoms. In the wet season, when the mosquito population increases, these clusters of asymptomatic carriers may be responsible for seeding transmission to the rest of the community, including less immune people who are more likely to suffer symptomatic infections [7]. Thus in these settings, hotspots are difficult to identify using the distribution of clinical (symptomatic) malaria cases alone.
The most used geospatial method to detect clusters of infection is the spatial scan statistic [10][11][12]. Measures of exposure which have been explored using spatial scan statistics include prevalence of infection, incidence of clinical malaria and serological markers of malaria exposure [13][14][15][16][17][18]. While this approach allows identification of clusters using statistical hypothesis testing, it may ignore more subtle small-scale spatial heterogeneity and clusters that do not fit within circular or elliptical windows [19]. An alternative method that has been used to detect clustering of infection is distance-weighted prevalence of infection, whereby infection prevalence in neighbours is used as a proxy measure for household level exposure [20,21]. This method allows for a smoother estimation of risk in space than spatial scan statistics.
This study seeks to determine which geospatial method best describes a malaria transmission hotspot by comparing methodologies using cross-sectional data collected during the first year of the study to predict the distribution of infections found in the second year.

Study site
Misungwi district (lat 2.85000 S, long 33.08333 E) is located 60 km from Mwanza town in the north-west of Tanzania at an altitude of 1,178 m above sea level (see Figure 1). The district is rural with moderately intense malaria transmission; the overall prevalence of infection in the region is estimated to be 31.4% by microscopy in children 6 -59 months (Tanzania HIV and Malaria Indicator Survey 2008). The district has two annual rainy seasons, the long rains between February and May, and the short rains between November and December. The dry and relatively hot season falls between June and September. Malaria incidence peaks one to two months after the rains start. The National Malaria Control Programme (NMCP) carried out indoor residual spraying (IRS) in the study area during the period from late November 2010 to late January 2011.

Data collection
A census of four villages in a single ward was carried out in the dry season, between August and early November 2010. All data were collected using personalized digital assistants and every household was visited and mapped using a global positioning system (GPS). All individuals in the ward were invited to participate in the study. The head of household gave information on the age, sex and insecticide-treated net (ITN) use of those who were not present. Individuals who consented to join the study were asked to provide a finger-prick sample of blood which was spotted onto Whatman® standard 3 mm filter paper for parasite detection and serological analysis. Subjects who reported having had fever within the previous 24 hours were tested for malaria using a histidine-rich protein 2 (HRP2) rapid malaria diagnostic test (RDT, Paracheck-Pf®, Orchid Biomedical Systems, Goa, India) and referred to a study clinician for management of their febrile illness.
A follow-up survey was carried out in the same study villages during August to November 2011, one year after the initial study. The same procedures were carried out during the second survey as during the baseline survey.
Molecular estimation of P. falciparum infection DNA was extracted from filter papers using the Chelex® (Sigma, USA) extraction method described previously [22] in 96 deep-well plates. Parasite DNA was detected using nested PCR (nPCR) targeting the 18S rRNA gene as previously described [23].

Serology
Antibodies were eluted from filter paper spots and assayed for specific IgG responses to P.falciparum AMA-1 and MSP-1 19 by ELISA as described by Corran et al. [24]. Samples were tested in duplicate. Duplicate optical density (OD) values OD values that differed by more than 1.5-fold were rejected and, if possible, rerun. For each plate a standard curve was generated from a known positive control and blank wells were included and OD values normalised to these. To define seroprevalence a mixture model was applied to the OD data which assumed two inherent Gaussian distributions; a narrow distribution or sero-negatives and a broader distribution of seropositives. A cut-off was calculated as the mean plus 3 standard deviations of the narrow distribution and was calculated separately for each antigen [25].

Cluster analysis
While there are a range of different methodological approaches to identifying clusters of infection [12,26], here we focus on three geospatial cluster detection methods to explore baseline clustering of infection and serological markers and their ability to predict infection in the second year of the study. The unit of analysis was the individual, meaning that clustering of infected individuals was assessed rather than clustering of households with infection. Infection in the second year was defined as a positive nPCR result recorded as a binary variable.

Satscan analysis
Spatial analysis was performed to assess possible clustering of nPCR-positive individuals. A spatial scan statistic was obtained using the Bernoulli model [11] and SaTScan software (SaTScan, version 8.2.1). This software applies multiple circular windows, which are plastic in both position and size, across the study area. Each distinct circle represents a possible cluster. For each circle, the number of observed and expected infected individuals are counted, with expected numbers calculated assuming an even distribution of infections across the population. As multiple infected and non-infected individuals can be specified at each household, the spatial distribution of households is accounted for. A likelihood ratio test is used to compare the prevalence of infection within the circle to that outside it to identify significant clusters of higher than expected (hotspot) or lower than expected (coldspot) prevalence. The statistical significance of this hotspot is evaluated taking into account the multiple tests for the many potential cluster locations and sizes evaluated as well as the distribution of the population [10]. The maximum proportion of the population that a cluster could contain was set at 50%. This method has been extensively explored in studies of the micro-epidemiology of malaria [12,13,[27][28][29].
Households were grouped into three categories: 1) hotspots (clusters of significantly higher than expected malaria prevalence); 2) coldspots (clusters of significantly lower than expected malaria prevalence); and, 3) all other households. Clusters were defined using three measures: 1) nPCR positivity; 2) antibody seropositivity to AMA-1; 3) antibody sero-positivity to MSP-1 19 ; and, 4) antibody seropositivity to AMA-1and/or MSP-1 19 . So as to make results from analyses using different clustering methods comparable, hotspots were assigned a score of 1, coldspots 0 and all remaining households a score of 0.5. Households for which data were only available in the second year were assigned a hotspot score according to whether the household lay within the radius of the hot or coldspot.

Kernel analysis
Kernel density estimation is a statistical procedure used to produce a smoothed estimate of density of events, such as individuals, across space [26]. For any given point, the density of events within a predefined window is estimated, with the influence of events weighted according to the distance from the centre of the window. The weight assigned to each event is derived from the kernel function applied. In this analysis a quadratic kernel function was used with an initial window radius of 1 km. A quadratic function allows importance of data from neighbouring households to be relative to the distance to the index household. To obtain a smoothed estimate of infection prevalence over the study region, a kernel density surface of numbers nPCR positive was divided by a kernel density surface of numbers examined. This resulted in each household having a value between 0 (least exposed households) and 1 (most exposed households). Households for which data were only available in the second year were assigned a prevalence value based on infection in neighbouring households only.

Weighted local prevalence analysis
This method calculates parasite prevalence amongst all neighbours within 1 km of the index house, weighting the prevalence estimate according to the inverse of the distance of the neighbouring house to the index house [20]. While a form of spatial smoothing, an important distinction between weighted local prevalence and kernel smoothing is that individuals in the index household are not included in the weighted prevalence estimate. As for kernel prevalence estimates, the weighted local prevalence for each household ranged from 0 (least exposed households) to 1 (most exposed households). As this method does not include infection status of individuals in the index household in the calculation of prevalence, no further action was required for those households with data from only the second year.

Statistical analysis
To compare the ability of different cluster detection methods to predict infection in the second year, mixed effect logistic regression models was used. The outcome of interest was infection status by nPCR (0/1) in the second year. The risk factors explored were nPCR, AMA-1, MSP-1 19 and AMA-1 and/or MSP-1 19 (hereon termed combined seroprevalence) cluster score in the first year (generated via each of the three cluster detection methods). Simple summary contingency tables, graphs and scatter plots with Lowess curves were used to explore the relationship with potential risk factors and their associations with age. To explore the possibility of a non-linear relationship, risk factors were categorized into quartiles and a likelihood ratio test was used to assess which model (linear or categorical) was better. A household level random effect was included in the models to take account of correlation between individuals within the same household. All models were controlled for potential confounding by age, which due to an obvious non-linear relationship with infection was categorized before analysis into -zero to four years, five to nine years, ten to 15 years, [16][17][18][19][20][21][22][23][24][25] year, 26-35 years and over 36 years (Table 1).
To establish the effect of radius size on results obtained with the kernel and weighted local prevalence methods, models using different radii were built. In addition to the initial 1 km radius, radii of 500 m, 100 m and 0 m (i e, household) were explored. Models assuming individual level infection and serological status were also compared. Similarly, for the SaTScan analysis, maximum population sizes of 20 and 10% were explored. To compare the predictive performance of using different methods and radii, the area under the receiver operating curve (AUC) was calculated for each model. AUC values were compared using DeLong's test for paired ROC curves [30]. Statistical analysis was performed using STATA (version 12, College Station, TX, USA) and R (version 3.0.1) [31].

Study subjects
In 2010, 668 households from randomly selected subvillages participated in the first year survey, comprising a total of 3,801 individuals, 3,057 (80.4%) of whom were seen, consented to participate and provided a blood specimen. Approximately half of the participants (n = 1,612, 52.7%) were male. The median age of the study population was 13 years (IQR = 5-30 years; range 1-99 years). The overall prevalence of P. falciparum by nPCR was 34.3%. In the second year survey, 697 households participated in the survey with 3,246 (85.4%) of eligible individuals providing a blood specimen, 51.6% of whom were male. Distribution of age was similar to that of the first year survey. P. falciparum prevalence by nPCR was significantly higher at 51.9% than during the baseline survey (OR 1.95; 95% CI, 1.76-2.17; p <0.001).

Association of age and other individual factors with PCR positivity and seropositivity
Individuals aged 10 to 15 years had the highest nPCR prevalence of P. falciparum at baseline and at follow-up (Table 1). Seropositivity to AMA-1 similarly peaked in the age group ten to 15 years. This age group had more than eight times the odds of being seropositive to AMA-1 compared to individuals aged zero to four years (OR 8.87, 95% CI 6.29-12.5; P < 0.001). Seropositivity to MSP-1 19 showed a different relationship with age, displaying a steady increase with age, with those aged >36 years having roughly five times the odds of being seropositive compared to those aged zero to four years (OR 5.10 95%, CI 3.66-7.10) ( Table 1).

Prediction of infection in the second year survey nPCR prevalence in the baseline survey
Fifty-seven per cent of individuals who were nPCR positive in the first year were also nPCR positive in the second year whilst 47% who were negative in the first year were also negative in the second year (χ 2 = 27.2; P <0.001). Guided by AUC values, clustering estimated using kernel analysis appeared to predict infection by nPCR in the second year more accurately than the weighted local prevalence method (p = 0.016) ( Table 2). While clustering estimated by SaTScan gave a higher AUC value than clustering by the weighted local prevalence method, there was no evidence for a difference in AUC (p = 0.12).
Using SaTScan analysis to detect nPCR hotspots, one large cluster was identified with a radius of 2.88 km, covering 141 households and one small cluster was identified with a radius of 0.1 km covering five households ( Figure 1A). SaTScan analysis showed that individuals who were residing in a nPCR hotspot cluster in the first year had four times the odds of testing positive for malaria by nPCR in the second year than those residing in nPCR coldspots (OR 4.54 95% CI 2.68-7.72). The kernel and weighted local prevalence analyses showed a more complex distribution of hotspots ( Figure 1B and C). Both clearly show the central hotspot detected by SaTScan, but also show numerous other high transmission areas, more consistent with the micro-epidemiology of malaria. The kernel analysis also showed that individuals who were residing in the top quartile (areas with a high prevalence of infection by nPCR) had three times the odds of testing positive for malaria by nPCR in the second year compared to those living in the lowest quartile (OR 3.45, 95% CI 2.06-5.75).

MSP prevalence
Seropositivity to AMA-1 and MSP-1 19 antibodies Defining clusters of seroprevalence using AMA-1 and MSP-1 19 antibodies separately improved prediction of nPCR positivity in the second year compared to using combined seroprevalence. SaTScan analysis revealed that individuals living in areas of high AMA-1 seroprevalence (hotspots) in the first year had five times the odds of being nPCR positive in the second year compared to those who lived in AMA-1 coldspots (OR 5.84 95% CI 3.75-9.10), adjusting for age (Table 2). SaTScan could not identify any significant clusters using combined seroprevalence. When clusters were identified by kernel analysis, those individuals living in households with the highest quartile of AMA-1 seroprevalence (hotspots) had a more than five times the odds of being nPCR positive in the second year than those in the lowest quintile (OR 5.16 95% CI 3.06-8.69), adjusting for age (Table 2). Using weighted local prevalence scores to distinguish clusters showed a similar pattern, those residing in the households in the Table 2 Odds of testing positive for P. falciparum infection during the follow-up survey: results from three geospatial models defined by baseline infection, anti-AMA-1 antibody prevalence, and anti MSP-1 19  top quartile of AMA-1 seroprevalence (hotspots) had more than three times the odds of being nPCR positive than those residing in lowest quartile (OR 3.33 95% CI 1.97-5.62) ( Table 2). Likewise the kernel analyses showed a more complex distribution of AMA-1 hotspots than SaTScan analysis (Figure 2). A comparison of the predictive ability of different clustering methods showed that both SaTScan and kernel analysis yielded higher AUC values than the weighted prevalence method, however, only the SaTScan method produced a significantly different result (p = 0.002 and p = 0.27 respectively). Antibody responses to MSP-1 19 showed a less clear association with infection in the second year, with individual age-adjusted seroprevalence at baseline showing no relationship with infection status in the second year. SaTScan analysis suggested that individuals living in MSP-1 19 hotspots were at lower risk of infection in the second year. Both kernel and distance weighted prevalence analysis also suggested individuals living in areas of highest MSP-1 seroprevalence were at lower risk of infection, however those living in areas of intermediate seroprevalence (third quartile) were at higher risk of subsequent infection.
Individual seropositivity at baseline to the combined seroprevalence of AMA-1 and/ or MSP-1 19 antibodies showed no relationship with infection in the second year. Similar to results using just AMA-1, kernel analysis of combined seroprevalence showed that those individuals living in the highest quartile had more than two times the odds of being nPCR positive in the second year than those residing in the lowest quintile (OR 2.44 95% CI 1.44-4.14). While a similar relationship was seen if hotspots were determined by weighted local prevalence, overall predictive ability using this method was worse than when using kernels with an AUC value of 0.530 (Table 2). SaTScan was not able to find any hotspots or coldspots using combined seroprevalence.
Sensitivity analysis of kernel and SaTScan methods for determining the best radius to predict malaria in the second year of follow-up Based on AUC values, the weighted local prevalence method to identify clusters was generally less predictive of infection in the second year than the SaTScan and kernel methods. Sensitivity analyses of these two methods were therefore conducted to determine the radius size that best predicted infection in the second year. For the kernel method, using larger radii to identify clusters of nPCR tended to produced similar AUC values than smaller radii (Table 3). Using larger radii of 500 m and 1 km to identify clusters of AMA-1 seroprevalence, MSP-1 19 or the antigens combined, generally produced higher AUC values. Similar sensitivity analyses were done for SaTScan, whereby the maximum population size allowable was set to 20 and 10%. As for the kernel analysis, there was a general trend to suggest that a larger maximum population size of 50%, which allows for larger geographic clusters, was more predictive of subsequent infection than smaller maximum population sizes (Table 3).

Discussion
It has been suggested that if malaria transmission hotspots can be identified, targeting interventions can have a improved impact on transmission [7]. A number of previous studies have explored the use of geospatial techniques to identify clusters of transmission markers such as infection or seropositivity to selected antigens [13,14,18,28,32,33]. These studies show that households with active and historic exposure tend to cluster together geographically. It is less clear however, whether these clusters predict future infection and if so, which geospatial techniques and transmission indicators should be used for their detection. Using two consecutive years' data, this study shows that clusters of infection and seropositivity to AMA-1 are predictive of future infection and that kernel analysis and SaTScan are superior to the weighted local prevalence method of cluster detection.
Several authors have identified the existence of hotspots at single time points, using a variety of different measures of transmission [13,18,28]. Fewer studies have shown that hotspots are stable over time. Using data from multiple years in Kenya, Bejon et al. applied spatial scan statistics to identify infection hotspots that were predictive of future hotspots up to seven years later [14]. Another study done in a highland of Kenya by Ernst et al. identified stable spatial clusters of malaria cases by SaTScan statistics over a period of four years [33]. Again using spatial scan statistics, Bousema et al. showed that over the period of two years, clinical episodes of malaria cluster into hotspots [13]. This study is consistent with these findings, showing that hotspots of infection are predictive of future infection. The study also shows that being seropositive to AMA-1 or being in a hotspot of AMA-1 seroprevalence is predictive of future infection. As seropositivity to AMA-1 is indicative of recent exposure to P. falciparum, this finding adds further evidence that hotspots of transmission are stable over several years. The relatively low AUC values do, however, suggest the importance of other factors related to risk of infection that were not accounted for. In addition, the higher prevalence of infection seen in the second year, likely due to higher rainfall observed that year, led to some infections in non-hotspot households, which negatively impacts the AUC.
The relationship between hotspots of seropositivity to MSP-1 19 and future infection was less clear. Clusters with high MSP-1 seroprevalence were found to be at lower risk of infection suggesting some protection at the neighbourhood level. However, whilst some studies have demonstrated a protective effect of antibodies to MSP-1 19, [34][35][36][37] at the individual level, this was not observed in this study. The reasons for these observations and the differences in the patterns seen with AMA-1 require further investigation but they may relate to the differing immunogenicity and half-life of the antibody response to these two antigens [38]. In terms of methods to detect clusters, this study suggests that using spatial scan statistics or kernel analysis allows better characterization of hotspots than the weighted local prevalence method. This may be due to the fact that estimates of weighted local prevalence for each household are made using infection status of neighbours only. This likely leads to an inferior indication of hotspot location as individual or household level factors play an important role in risk of subsequent infection in that household. Sensitivity analyses, varying both the window size and maximum population size for kernel and SaTScan analysis respectively, suggests that generally hotspots form over larger (1-3 km) scales. While this likely varies by setting, similarly sized hotspots have been detected by previous studies in similar transmission settings [13,14,20]. In lower transmission settings, transmission appears to cluster over increasingly small scales. A recent study by Searle et al. in Zambia, where infection prevalence was estimated to be 23% by rapid diagnostic test (RDT), showed that active case detection within a 500-m radius could identify 76% of all RDTpositive individuals [39]. A study in Swaziland, where transmission is extremely low (PCR-derived parasite prevalence <1%), suggested that infections tend to cluster within households of passively detected cases [9].
This study has several potential operational implications for malaria control. Firstly, given the apparent stability of hotspots, targeting clusters of infection and seropositivity to AMA-1 (and/or antigens with similar properties) with complete cure treatment and vector control could have a dramatic impact on transmission [7]. Secondly, kernel analysis and SaTScan appear to be optimal methods to detect hotspots. Currently, establishment of seropositivity to AMA-1 can only be done using assays that require samples to be processed in the laboratory. Equally, while RDTs exist for determining infection status, these miss a large fraction of infections, most of which are likely to be subpatent [40][41][42]. Previous work has shown that these subpatent infections tend to cluster in hotspots, making RDTs inappropriate methods to detect hotspots [43]. In order to target interventions at hotspots, therefore, the development of sensitive rapid diagnostics for infection and seropositivity to AMA-1 (or similar) is required. Alternatively, it may be possible to identify hotspots in the field by clustering of particular risk factors or passively detected cases. This is the focus of further research. In the meantime, in the setting of moderate malaria transmission around Lake Victoria, mass drug administration of entire villages may be required to interrupt transmission [43].

Limitations
This study used indirect measures to define household malaria exposure. Using more direct measures, such as entomological inoculation rate (EIR) and other vector measures, may have led to different results. However, EIR can be challenging to measure in low-endemic settings. Thus, individual parasite prevalence was chosen as the measure of subsequent transmission for this study. In addition, indoor residual spraying (IRS) was applied between survey periods throughout the study area. While there is no supporting data, it is likely that households that did not receive IRS were randomly distributed and therefore unlikely to introduce bias into the results. Lastly, the study continued for only two years, thus stability of malaria hotspots could only be predicted for that time period. However, as stated, the fact that hotspots of AMA-1 seroprevalence were predictive of future infection suggests transmission hotspots are stable over a longer time frame.

Conclusions
This study supports previous work showing that hotspots can be defined using geospatial methods and are stable over a period of at least one year. Hotspots can be detected either by using parasite prevalence or seroprevalence of AMA-1 antibodies. It was also found that spatial scan statistics and kernel analysis were better at characterizing hotspots of transmission than the weighted local prevalence method. Given the lack of highly sensitive rapid diagnostic tests for infection and AMA-1 seropositivity, routine detection of hotspots is challenging. Further work exploring simple methods to identify hotspots with existing tools is therefore required. Furthermore, while theorized, it has yet to be shown in the field that targeting interventions does indeed lead to greater reductions in transmission over an untargeted approach. Studies linking methods of hotspot detection with assessments of the subsequent impact of targeted interventions would be extremely valuable.