A dynamic model of some malaria-transmitting anopheline mosquitoes of the Afrotropical region. II. Validation of species distribution and seasonal variations

Background The first part of this study aimed to develop a model for Anopheles gambiae s.l. with separate parametrization schemes for Anopheles gambiae s.s. and Anopheles arabiensis. The characterizations were constructed based on literature from the past decades. This part of the study is focusing on the model’s ability to separate the mean state of the two species of the An. gambiae complex in Africa. The model is also evaluated with respect to capturing the temporal variability of An. arabiensis in Ethiopia. Before conclusions and guidance based on models can be made, models need to be validated. Methods The model used in this paper is described in part one (Malaria Journal 2013, 12:28). For the validation of the model, a data base of 5,935 points on the presence of An. gambiae s.s. and An. arabiensis was constructed. An additional 992 points were collected on the presence An. gambiae s.l.. These data were used to assess if the model could recreate the spatial distribution of the two species. The dataset is made available in the public domain. This is followed by a case study from Madagascar where the model’s ability to recreate the relative fraction of each species is investigated. In the last section the model’s ability to reproduce the temporal variability of An. arabiensis in Ethiopia is tested. The model was compared with data from four papers, and one field survey covering two years. Results Overall, the model has a realistic representation of seasonal and year to year variability in mosquito densities in Ethiopia. The model is also able to describe the distribution of An. gambiae s.s. and An. arabiensis in sub-Saharan Africa. This implies this model can be used for seasonal and long term predictions of changes in the burden of malaria. Before models can be used to improving human health, or guide which interventions are to be applied where, there is a need to understand the system of interest. Validation is an important part of this process. It is also found that one of the main mechanisms separating An. gambiae s.s. and An. arabiensis is the availability of hosts; humans and cattle. Climate play a secondary, but still important, role.

http://www.malariajournal.com/content/12/1/78 Background Several attempts have been made to map the distribution of Anopheles gambiae s.s. and Anopheles arabiensis [1][2][3][4][5], two of the most important vectors of human malaria in sub-Saharan Africa. MacDonald [6] showed that limiting the human-vector contact reduces malaria transmission, and that the most efficient control measure is to increase the mortality rate of the involved mosquitoes. His thinking has been adopted in current malaria control efforts. Two of the most common interventions today are indoor residual spraying (IRS) [7] and insecticide-treated bed nets (ITNs) [8]. Often, there is no detailed understanding of the life history, behaviour and species composition where the interventions are applied [3].
Anopheles arabiensis inhabits areas from South Africa in the south to Mauritania and Sudan in the north. In Central-West Africa there is a pocket with very few observations of An. arabiensis. The border of this pocket is formed by Angola, Zambia, Burundi, Rwanda, Uganda, South-Sudan, Central African Republic, Congo, Gabon, and Equatorial Guinea. Anopheles gambiae s.s. is currently separated into five chromosomal forms: Forest, Bamako, Savanna, Mopti and Bissau [9], and two molecular forms: M and S [10,11]. It is distributed from South Africa to Mauritania and northern Mali, but is absent in Ethiopia and Northern Sudan. The species is considered the most efficient malaria vector in Africa [12].
Recent studies have shown that interventions aimed to prevent malaria has an impact on balance between An. gambiae s.s. and An. arabiensis [13]. The relative fraction of each species can vary from month to month, and year to year [14]. In Tanzania it has been shown that multi-decadal changes in the species composition can influence malaria transmission [15]. Given the observed changes in species composition, and their different capacity as vectors of malaria, it is highly relevant to have models which include several species when assessing the impact of climate variability and climate change.
This paper is the second of two describing and validating a new model of the dynamics of An. gambiae s.s. and An. arabiensis The model, which is described in part one [16], is a biophysical model driven by output from a climate model. Biophysical models seek to understand what drives a certain biological process, and to describe this with mathematical equations. Unlike statistical models, which often rely on observations to predict species presence and absence, biophysical models can be run with no information with respect to observed distribution and densities, and base the model equations on laboratory studies aiming to isolate different aspects of the life history of the mosquitoes. The role of field observations on the presence or absence of a species in the case of biophysical models, is to validate the model after an experiment has been completed. In some studies observations are used to reduce the uncertainty of unknown parameters [17].
In addition to predicting the current distribution, these type of models can be used to project changes in the historical and future density and distribution of these species. They can describe changes from day-to-day, month-to-month, year-to-year, and decade-to-decade. The model, named Open Malaria Warning (OMaWa) [16], includes several components, describing the mosquito's life from the aquatic stages to adult. In the aquatic stages, life history varies for eggs, larvae and pupae. As adults the life history changes with age. OMaWa is driven with air temperature, relative humidity of the air, wind speed and direction, soil temperature, relative soil moisture, and runoff from a climate model. These variables are used to parametrize mortality, rate at which eggs are laid, biting rate, development rate in the aquatic stages, and dispersion (spread) of mosquitoes. In part one, it was shown how the model responded to different forcings, and focused on its sensitivity to temperature, humidity, mosquito size, the probability of finding blood, and dispersion. Thus the results presented here should be seen in light of the sensitivity analysis. A full description of the model used here can be found in part one [16]. This is the first time a biophysical model has been used to model the relative density of An. gambiae s.s. and An. arabiensis, with simulations covering an entire continent. It is also the first time age dependent life history and mosquito dispersion (spread of mosquitoes) has been included in a continental analysis. The model is validated against 6,927 presence/absence points of the two species, and a more detailed analysis is carried out for Madagascar. The data is freely available to the public [18]. This study has also evaluated the ability to model the temporal variability, using case studies for Ethiopia.

Continental validation
To date there are three data sets describing the occurrence of An. arabiensis and An. gambiae s.s. [3,19,20]. Additional online resources have been described by Hay et al [21]. To compliment and extend these databases, a systematic search was conducted. A total of 1,940 occurrence points were collected for An. arabiensis, 1,813 for An. gambiae s.s., and 992 for An. gambiae. Merging these data with the three databases [3,19,20] result in 2,926 occurrence points for An. arabiensis, 3,009 for An. gambiae s.s., and 992 for An. gambiae [18]. Three methods were used to georeference the points. In papers where coordinates were given, these coordinates were used. If possible they were cross checked against given place names. In cases where only place name, and a description of the place were given, the locations were searched up using Google Maps/Earth. http://www.malariajournal.com/content/12/1/78 Where only a map was provided, the map was imported to qgis and geo-referenced [22], and occurrence points were manually extracted.
The database containing An. gambiae was mainly used to estimate the occurrence of An. gambiae s.l. in Namibia, DRC, South Sudan, Angola, Congo, and northern South Africa. To classify the points the expert opinion polygons from Sinka et al [3] was used. A point falling within the An. arabiensis polygon only was classified as An. arabiensis, points falling within the An. gambiae s.s. polygon only as An. gambiae s.s., and points falling within both polygons were assigned both species. To classify true presence/absence points the data described previously was used. Observations of An. gambiae s.s. were classified as presence for this species. Absence points for An. gambiae s.s. were those where An. arabiensis had been recorded, and no An. gambiae s.s. had been observed within a radius of 100 km. The same approach was used for An. arabiensis.
This model (OMaWa) was compared with species predictions from four other models, as well as the expert opinion from Sinka et al [3]. The first was the paper by Rogers et al [1] where they used satellite data to predict the presence of An. arabiensis and An. gambiae s.s.. To reproduce the images in the paper the figures were georeferenced, and polygons were drawn based on the 0.65-1 probability. The selection was based on the colouring they used in the figure. Next a 50 by 50 km grid was overlaid with the polygons, and points falling within the polygons were classified as presence points. Points falling outside were classified as absence. The second paper is by Levine et al [2]. They used a genetic algorithm to predict the presence of the two species. As before, the images were geo-referenced, and polygons were constructed based on dark grey to black shading. Next, absence and presence was constructed as for Rogers et al [1]. The third paper is a recent paper by Sinka et al [3]. Since this is a three band RGB (Red-Green-Blue) raster, the pixel values were first converted to a one band raster: 1 − (0.299 · R + 0.587 · G + 0.114 · B)/255. This new raster image was then gridded to a 50 by 50 km grid. Presence was defined as probability greater than approximately 0.4. As for Rogers et al [1], this threshold was selected based on the colouring in the figure (and it must be assumed the authors chose the colours based on what they thought to be realistic classifications). Where applicable, the weighted absolute mean error was also calculated based where weights were equal to the probability given in the maps. The fourth paper is by Moffet et al [5]. The same methodology as for Sinka et al [3] was used to construct a comparable map. For the expert opinion, presence/absence points were constructed with the same methodology used for Levine [2] and Rogers [1]. The extracted data and scripts are available upon request. The mosquito density from OMaWa was classified as present if the 19 year mean was greater than 0.004 mosquitoes per square kilometre, and absent if less. Quality of the models were estimated as mean absolute error (MAE), which is recommended over the root mean square when comparing model performance [23].

Relative fraction of each species, Madagascar
To investigate if the model is able to estimate the relative fraction of An. gambiae s.s. and An. arabiensis data from Pock Tsy's et al [24] and Chauvet's [25] article describing the fraction of each species in Madagascar was used. In total these two data sets consist of 275 observations, and should thus be suitable to give a rough idea about the relative fraction of the two members. Different measures were given to evaluate the model skill: a) For each observation there are information about the month of collection as well as longitude and latitude. From the model data, covering the period 1990-2008, the closest point to each observation in the month of collection is selected, and the yearly monthly mean is calculated. These data were used to make box plots, weighting for the number of observations in each point, comparing the observations with the model. b) From the data produced in a, maps were created using a distance weighted kernel with cut off at 100 km. Hence observations further away than 100 km were not included, and closer points will be given more weight. c) The distance to the closest wrong (difference in fraction greater than 0.2) and correct (difference smaller than 0.2) prediction will be indexes for the spatial accuracy. A non-parametric test like the Wilcoxon rank sum test with continuity correction (Mann-Whitney) test can then be used to test if the two indexes differ by a location shift of zero, and the alternative is that they differ by some other location shift.

Model setup
In addition to looking at the spatial patterns, it is of interest how the model reproduce temporal variability in mosquito numbers. Originally, this model was developed to increase the understanding of malaria epidemiology in Ethiopia. is called spectral nudging. In the African run (TC50) the intention was not to reproduce the exact year to year variability, but the interest was to reproduce reasonable weather in a reasonable climate, and thus no nudging was used. To validate the ability to reproduce seasonal variations data from Eth30 and Eth18 to drive OMaWa was used.
For simulations driven by Eth30 the model was run without dispersion, BLL aquatic mortality, development rate with no species correction, default gonotrophic cycle, and AL adult mortality. TC50 and Eth18 were run with the following parametrization: with dispersion, KBLL aquatic mortality, development rate with species correction, default gonotrophic cycle, and BLLad adult mortality. All results are based on single realizations of the model, and error bars are therefore not reported.

Validation data
There are few papers describing the year to year, and seasonal variations in mosquito numbers in Ethiopia. In the validation process three papers were used, one master thesis, and field data from Chano Mille, Arba Minch describing mosquito seasonality.
The first, a paper by Kenea et al [26], is describing An. arabiensis larva density in the vicinity of six villages in central Ethiopia, December 2007 to June 2008. The second paper is by Taye et al [27] and is reporting bi-monthly (October 2001 to August 2002) adult An. arabiensis numbers in Sille (Southern Ethiopia). The third paper is by Yemane Ye-Ebiyo et al [28], where they report larva density in seven naturally formed puddles, in Ziway. Since this paper does not report density in the area as a whole, the data might not be directly comparable to the modelled ones. To overcome this problem all time series were scaled, both observations and model results, as standardized anomalies: To compare the absolute density, it would be required that the papers reported the larva/mosquito density per square kilometre over a larger area. Since this is not the case, scaling is necessary. The last study is by Balkew, where the seasonality of An. arabiensis in Awash Valley, Ethiopia, was described [29]. The study locations are plotted in Figure 1.
In addition to the published data, Fekadu Massebo collected one year (May 2009 to April 2010) of mosquito densities in Chano Mille, Ethiopia. The study site is described in [30,31]. To see if the model was able to reproduce the mosquito densities, Eth18 was used to drive OMaWa.

Validation statistics
All correlations (Pearson) are calculated from the values reported in the papers [26][27][28][29], and a similar time series (sampled the same month as the observations) is constructed from the model averaging the four closest model points:

Climate model realizations
The simulations in this paper was driven by three different realizations of a limited area climate model. The first realization (Eth30), carried out in 2009, comes from WRF model version 3.1.1 [32]. It was run at 90 km resolution using a tropical channel set up. In this type of setup, the domain consists of the boundaries above and below certain latitude and no side boundaries. This process allows the interaction from the extra-tropics through the northand-south boundaries. In addition, it allows the generated waves to propagate around the globe more naturallyas in the real world and in global models. The meridional boundary conditions were specified using six-hourly National Centers for Environmental Prediction (NCEP) Reanalysis 2 (T42) data. year to year anomalies, the model was nudged, using spectral nudging, against waves (wind, pressure, and temperature) longer than 1,000 km in both domains. The Kain Frisch cumulus parametrization scheme was used [33,34]. The second realization (TC50), carried out in 2011, had again a tropical channel set up. The model was run at 50 km resolution from January 1 1989 to January 1 2009. At the north and southern boundaries the model was driven by Era Interim. The Kain Frisch cumulus parametrization scheme was used [33,34]. No nudging was used, and therefore it is less probable the model would reproduce year to year variability in the weather. This run was used to assess the mean state of mosquito density and distribution.
In the third experiment (Eth18), done in 2012, WRF 3.3.1 was used with the Tiedtke cumulus parametrization scheme [35,36]. The model was run at 18 km resolution from January 1 2008 to August 1 2011, with data from Era Interim at the boundaries. Outside the planetary boundary layer the same type of spectral nudging as

Results and discussion
Distribution of Anopheles gambiae s.l. Figure 2 is showing the presence data collected as part of this work. Data collection on An. gambiae was focused on areas where little information about the occurrence of An. arabiensis and An. gambiae s.s. was available. Figure 3 shows the modelled mean density of An. arabiensis and An. gambiae s.s.. The white contours are indicating the presence of each species. The pattern is consistent with the general perception of the species range [3]. This is the first time a model [1][2][3][4]   An. gambiae s.s. reveals some probable inconsistencies with respect to the species distribution in southern Chad where An. arabiensis should be dominating [37].  [44] also found low concentrations of An. gambiae s.s. in Haut-Ogooué, which was also predicted by the model. In the north-eastern part of Gabon it has not been possible to find any recent mosquito surveys, and it is therefore hard to conclude if the predicted absence of An. gambiae in this region is correct.

Occurence of Anopheles gambiae s.l. in Africa (TC50)
To evaluate the quality of the model with respect to classifying the presence and absence of the species the methodology described previously was used. Table 1 shows the mean absolute error for the four papers [1][2][3]5], expert opinion and this model. For reference, a MAE of 1 would be equivalent to completely wrong predictions, and 0 would be perfect. While the genetic algorithm of Levine [2] and the predictions based on satellite imagery by Rogers [1] show poor skill, the recent papers by Moffet et al [5] Sinka et al [3] are great improvements compared to those. Still, they have less skill than the expert opinion if comparing to the unweighed MAE. This model (OMaWa) has lower MEA than all the models included in this analysis, and including weights in the MEA makes it superior even to the expert opinion. The occurrence data suggest the expert opinion for An. arabiensis is wrong over West Africa and Southern Cameroon. A mosquito survey in Namibia, and north-eastern Gabon, would also clarify the present-day species composition in these countries.

Relative fraction of each species, Madagascar (TC50)
Since Madagascar has a sharp separation between An. arabiensis and An. gambiae s.s., the island is well suited to address whether the model is able to reproduce the relative fraction of each species.Three measures to evaluate  Figure 4 show the fraction of An. arabiensis from the model, grouped by the fraction in the observations. It is clear, while capturing the main tendencies well, the model has problems with the exact separation between the two species. In the mixed group, the model tends to let one species dominate over the other, possibly letting An. arabiensis dominate too easily. Figure 5, created using method b), shows the fraction of An. arabiensis as modelled, and observed. An eyeball comparison shows the separation is shifted westward in the model, and a bias in the South-Eastern tip of Madagascar. Whether this is a result of (climate) model resolution, failing to accurately separating the west/east gradient in topography, or the biological parametrization being inaccurate is hard to quantify. It is hoped this can be tested in a future analysis with higher model resolution. Table 2 shows the distance to the closest model point, distance to the closest model point with correct prediction, and distance to the closest point with wrong prediction as described in c). At all quantiles the distance to the closest correct prediction is 1.5 to 7 smaller than the closest wrong prediction. A Mann-Whitney test with confidence level of 0.99 shows the difference in location between wrong and correct predictions is 9.84 (5.07 25.68) km (p < .0001). Thus, although with biases, it is concluded that distance to closest correct prediction and closest wrong prediction are non-identical populations.

Temporal variability
It is important that mosquito models reproduce the seasonal cycle correctly, since this will be an indication of the sensitivity to climate. Here results from the model are compared to a number of observational studies. The comparison with each individual study might not have much information, but it is recommended that readers look at the results as a whole, having in mind the continental analysis showing the model is able to separate the distribution of An. arabiensis and An. gambiae s.s.. These results are meant to complement the continental analysis. Eth30 and Eth18 refers to the weather data used to drive OMaWa.

KEN11: 2007-2008 larva density in central Ethiopia (Eth30)
In this study [26] Kenea et al reported the An. arabiensis larva density in six locations in central Ethiopia, December 2007 to June 2008. Five of the sites followed the same seasonality, while one had the highest density before the rainy season started. The model is not designed to capture such local variations, but is rather aiming to describe the median, or sometimes mean, state within a certain area. In their study all anopheline positive habitats present within a 500 m radius of each irrigated village/town and 700m along the major drainages (lake or river) were sampled. This means that the data should be comparable to what is modelled. The seasonality of larva density, l sum =

YE2003: 2001 3 month larva variability in Zwai (Eth30)
If it is assumed larva per dip has units LPD = C larva m 2 , where C is a constant, and that the samples are representative for a larger area, the relative number of larva in that area can be estimated as LPD · W a , where W a is the mean water area in m 2 . This way it is assumed the number of puddles is constant from July to September, and that the puddles only change their surface area. These values are roughly comparable to the modelled number of larva. Since only the latitude (and not the longitude) is reported in the paper, and Zwai is not located at latitude Confidence interval is estimated using 1,000 random samples of the points within the bounding box, and the 2.5% and 97.5% quantiles of the correlations is reported. Since the sample size is small and the data might not be directly comparable, the correlation should be interpreted with care. The data from the observations and the model can be seen in Figure 8.

FEK2012: 2009-2010 mosquito catch Chano Mille, Ethiopia (Eth18)
As seen in Figure 10, and correlations in Table 3, the model corresponds well with the observations in Chano, 2009-2010. While the weather station in Arba Minch recorded some heavy rainfall events in October/November 2009 the regional climate model did not capture these events, or did not dump the precipitation in the right location [45]. representation of precipitation in the climate models. The differences between the trapping methods also highlight the uncertainty of related to data collection, especially when the number of mosquitoes is low. From December 2009 to March 2010 the observed number of An. arabiensis was very low ( Figure 10). It is interesting that despite of this, malaria started to rise in these months [30].

Summary of temporal variability analysis
Each of the five case studies consist of short time series, with different observational methodologies. It was attempted to show how the model results can be compared to the different type of observations, and in general the model is in good agreement with the observations. Since none of the studies cover several years, it was only

Conclusions
In this paper, the model has been validated using independent data. The results suggest sufficiently high bovine density influences the large-scale distribution of An. arabiensis. Similarly, the presence of An. gambiae s.s. is linked to the presence of humans, modulated by the density of An. arabiensis. Water and air temperature, and availability of breeding sites play secondary roles for the continental distribution of these species, but might be locally important in margin zones. The recent distribution shifts in species composition observed in Kenya [13,46] might be partially explained by increased mortality of An. gambiae s.s. due to interventions like IRS and LLINs. An alternative explanation might be the competitive advantage of An. arabiensis efficiently feeding on cattle, and thus suppressing the number of An. gambiae s.s. through easier access to blood, and thus reproducing at a higher rate. Over time, these interventions mainly reduce the human biting rate, and not necessarily the longevity of mosquitoes; the most efficient measure in MacDonald's formula of the basic reproductive number. Next, it can be challenged if a reduction of the number of breeding sites, lowering the number of adult mosquitoes per human, would be as efficient, and cost-effective, as IRS and LLINs over time. Studies on the long-term effect of interventions on the mortality rate of mosquitoes is needed to evaluate how these interventions work in practice. The large scale distribution of An. arabiensis, and its relation to cattle distribution, also rises the question of this species is using the odour of bovine to navigate, and if this causes of the observed coexistence of An. arabiensis and cattle. If this applies on large scales, there are reasons to believe the same mechanisms manifest themselves on small scales. In that case, keeping cattle separate from humans should further reduce the human biting rate in areas where An. arabiensis is the dominant species.
Several studies have found out the gene flow of An. gambiae s.s. and An. arabiensis, and how the species have spread, and is evolving, across Africa. In the model, which gives a good representation of the distribution of the two species, An. gambiae s.s. is spreading most efficient on surfaces with continuous human populations, while An. arabiensis disperse more easily on surfaces with continuous cattle populations. It is hypothesized the lack of such a human surface between Kenya and Ethiopia can explain the absence of An. gambiae s.s. in Ethiopia; -to spread to Ethiopia, there is a need of a more or less continuous human population cover from Lake Victoria to southern Ethiopia, sufficient breeding sites, and temperatures which are not too extreme. Thus, not only climate control the presence and absence of these species, but also the availability of hosts. This has implications for the ability to project the future distribution of the two species.
Before models can be related to improving human health, or guide which interventions are to be applied where, there is a need to understand the system of interest.
Validation is an important part of this process. Concluding based on too little data, and basing projections of for example the effects of climate change on models which have not been validated, is dangerous [47], might mislead the public, and lead to less confidence in science. The way forward would be to include effects on interventions. This would allow us to understand how residual spraying and bed nets influence the mosquito populations, and in turn malaria. Incorporating interventions in a continental model requires a) spatial data describing which interventions were applied when, and b) the long term effect of these interventions. Currently such data might exist, but have not been systematized for use by the research community. In these two papers, the focus has deliberately been on the mosquito population. By looking at each component involved in malaria transmission separately, the understanding of the dynamics of malaria can be improved. This process is crucial to robustly estimate how a changing environment and society, has changed, and will change, the premises for malaria transmission.