Field sampling strategy
The sampling strategy used for the collection of immature An. arabiensis aquatic habitat data was developed for earlier research projects and has been described in detail elsewhere [[1, 20, 21], and [22]]. Base maps were prepared for the study site in ArcGIS (Figure 1). We expected the larval/pupal count in An. arabiensis aquatic habitats in the study site to follow a Poisson distribution, as was the case in previous research in other Kenyan areas [[1, 2], and [3]]. Therefore, the mean count and standard deviations was used, on the log-number of mosquito larval/pupal counts collected in the study site, to determine sample size requirements. A sampling intensity formula was applied for determining the number of An. arabiensis aquatic habitats to collect when randomly sampling from an infinite population n = (ts/E)^2, where t = t value (t = 2), s = the standard deviation of log-larval/pupal count values observed, (s = 0.889), and E was the desired half-width of the confidence interval around the mean expressed in same units as standard deviation (E = ln(1.25) [1, 2]. Applying this formula, it was determined that 152 samples were required. The vector image of the sampling scheme (grid cell) was overlaid with the land cover raster images to identify areas of interest within each polygon (grid cell) of the sampling scheme. All potential aquatic habitat sites were identified, and data relative to species composition and abundance, predators, water quality and other environmental variables were assessed.
Poisson Regression
A Poisson regression, with statistical significance, was determined by a 95% confidence level which was used to ascertain whether the proportions of riceland aquatic habitats, positive for An. arabiensis larvae/pupae, differed by sampled grid cell. Poisson regression can be used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships among riceland An. arabiensis aquatic habitat covariates [1, 2, 20–22]. The regression analyses assumed independent counts (i.e., n
i
), taken at habitat locations i = 1 2... n, where each of the sampled An. arabiensis aquatic larval/pupal count values, was from a Poisson distribution. These larval/pupal counts were described by a set of explanatory variables denoted by matrix X
i
, a 1×p vector of covariate estimates for a sampled habitat location i. The expected value of these data was given by:
where β was the vector of non-redundant parameters and the Poisson rates parameter was given by:
The rates parameter λ
i
(X
i
) was both the mean and the variance of the Poisson distribution for an An. arabiensis aquatic habitat i. The dependent variable was the total larval/pupal count in an An. arabiensis aquatic habitat. The regression analyses were performed in SAS PROCREG. The sampled habitat data were log-transformed before analyses to normalize the distribution and minimize standard error. All of the covariate estimates for the models were tested for multicollinearity, using partial F test in SAS, and no problematic correlations were found.
Bayesian estimation procedures
In this research Bayesian estimation and MCMC methods were used to model the sampled covariates of An. arabiensis aquatic habitats in the study site. In the Bayesian paradigm, hierarchical models can be used to model heterogeneity of variances on the log- scale [8]. In this research, the natural logarithms of variances were modeled using a linear model to account for heterogeneity of the variances (on a logarithmic scale), in terms of the predictor variables sampled. In an anopheline aquatic habitat model, an environment-specific variance parameter is considered to be an independent draw from a random sampling distribution [3].
The MCMC sampling began with conditional (marginal) probability distributions, and the parameter estimates that were obtained using pseudo-likelihood estimation (i.e., an autoregressive term estimated with a conventional regression procedure). This involved estimating covariate coefficients (β) and ρ as though the field and remote-sampled observations were independent. MCMC outputs can sample values for an anopheline aquatic habitat parameter drawn from the joint posterior probability distribution [3]. In the first stage of the Bayesian analyses, a likelihood model was specified for the vector of the An. arabiensis aquatic habitats larval/pupal counts. At the second stage, predictor variables of the sampled An. arabiensis aquatic habitats were analyzed for specifying a prior model.
WinBUGS® was used to recognize conjugate specifications (e.g. Poisson-gamma), from the field and remote-sampled mosquito data. Our model assumed that the number of larval/pupal counts in the study site, i, Y
i
, had a conditional independent Poisson distribution with mean E
i
exp (μ
i
). The variable E
i
was used as the expected number of sampling events, which was proportional to the corresponding known An. arabiensis aquatic habitat larval/pupal population, n
i
. The expression exp (μ
i
) was the relative risk based on the sampled larval/pupal count values: regions with exp (μ
i
) > 1 having greater numbers of observed An. arabiensis aquatic habitat larval/pupal count values than expected, and vice versa for regions with exp (μ
i
) < 1, at the study site. The log-relative term was μ
i
which modeled all the predictor variables of An. arabiensis aquatic habitat data, linearly as:
In this research, x'
i
was the sampled An. arabiensis aquatic habitat covariates, and β was a vector of fixed effects in the models. Additionally, the terms θ
i
and φ
i
were used for capturing site-specific random effects and spatial dependence, respectively, in the ecologically sampled datasets. In previous research, Jacob et al [3] employed an MCMC algorithm and an autocovariate matrix to spatially quantify stochastic error propagation in Bayesian parametric variables estimated from Anopheles gambiae s.l. aquatic habitat covariates sampled in Malinda and Kisumu, Kenya. Their models revealed that a 10 cm increase in habitat depth was associated with a 0.391 cm increase in larval/pupal count on average, but after adjusting for habitat depth in both urban sites, using the spatial regression models, no significant autocorrelation or clustering of An. gambiae s.l. aquatic habitats appeared present in the residual error estimates. In this research all site specific An. arabiensis aquatic habitat characteristics were imposed using the equations:
Three chains were estimated for the variables in each potential model. Samples were discarded to allow the model to stabilize and the next 10,000 samples, after burn in, were used to derive parameter estimates. Discarding the first set of "burn-in" iterations can ensure that the chain has reached steady state, when estimating Monte Carlo parameters, such as posterior means from sampled anopheline habitat covariates [3]. After the model had converged, samples from the conditional distributions were used to summarize the posterior distribution of the model.
The Monte Carlo method of error propagation assumed that the distribution of error variables for each of the input data layers, generated in WinBUGS® from the ecological sampled An. arabiensis aquatic habitats parameters, were known. For each of the data layers an error surface was simulated by drawing, at random, from an error pool defined by the geographic distribution of the sampled habitat data. Error surfaces were added to the input data layers and the model was run using the resulting data error layers as input. The process was repeated so that, for each run, a new realization of an error surface was generated for each input data layer. The results of each run were accumulated and a running mean and standard deviation surface for the output was calculated. This process continued until the running mean stabilized. Since the random error visualizations were both positive and negative, the stable running mean were taken as the true model output surface, and the standard deviation surface was used as a measure of relative error. A simple summary was generated, showing posterior mean, median and standard deviation, with a 95% posterior credible interval.
Models were compared using the Deviance Information Criterion (DIC) in WinBUGS®, where
, was the sum of the posterior mean of the deviance, (D), a measure of goodness-of-fit, and the effective number of parameters (p
D
), a measure of model complexity. A measure of goodness-of-fit based on the DIC values was applied and an R2DIC, calculated in line with the standard R2 measure for the regression models. This was defined as:
where DIC
k
was the DIC value for model k under evaluation, DICmax was the DIC value for one-fixed parameter model and
was the posterior deviance from the model [3].
Checking the statistical efficiency of the MCMC Sequence
Model checking of all data input and compilation was also conducted in WinBUGS®. The number of chains had to be specified before compilation. In this research, three parallel chains were run. Syntax checking was used, which involved highlighting the entire model code and then choosing the sequence model specification. The uncertainty in estimates of quantities derived from an MCMC sequence of random samples was represented by N
k
and habitat samples v
k
represented a pdf of a scalar quantity v. The estimated value of v was given by the sample mean,
In this research, the expected variance in
was the expectation for the ensemble of the sequences generated from the ecological sampled An. arabiensis aquatic habitats covariates which was expressed as:
where
. The autocovariance of the sequence was defined as:
. The normalized autocovariance was
, where σ2 was the variance of v and ρ (l) did not depend on k. The length of the nonzero normalized autocovariance values were:
The normalized autocovariance was a symmetric function, i.e. ρ (-l) = ρ (l). The sequence sufficiently converged to the target pdf. The variance of the distribution of the sampled habitat parameters was generated using:
and the normalized autocovariance was estimated from the sequence using:
for lag l ≥ 0.
The MCMC sequence was defined as the reciprocal of the ratio of the number of MCMC trials needed to achieve the same variance in an estimated quantity as are required for independent draws from the target probability distribution [3]. The estimation of the mean and the variance for independent sampled An. arabiensis aquatic habitat parameters were generated by:
After compilation, the files contained some initial values for the parameters selected in the model. After careful inspection of the data, no aberrant values, leading to numerical overflow were found.
Spatial autocorrelation error matrix
All residual estimates from the Bayesian model were then evaluated in a spatial error (SE) model. An autoregressive model was employed that used a sampled habitat variable, Y, as a function of nearby sampled habitat Y values [i.e., an autoregressive response (AR) or spatial linear (SL) specification] and/or the residuals of Y as a function of nearby Y residuals [i.e., an AR or SE specification]. Distance between sampled habitats was defined in terms of an n-by-n geographic weights matrix, C, whose c
ij
values were 1 if the sampled An. arabiensis aquatic habitat locations i and j were deemed nearby, and 0 otherwise. Adjusting this matrix by dividing each row entry by its row sum, with the row sums given by C1, converted this matrix to matrix W [19].
The n-by-1 vector x = [x1 ⋯ x
n
]Tcontained measurements of a quantitative variable for n spatial units and n-by-n spatial weighting matrix W. The formulation for the Moran's index of spatial autocorrelation used in this research was:
where
with i ≠ j
The values w
ij
were spatial weights stored in the symmetrical matrix W [i.e., (w
ij
= w
ji
)] that had a null diagonal (w
ii
= 0). In this research the matrix was initially generalized to an asymmetrical matrix W. Matrix W can be generalized by a non-symmetric matrix W* by using W = (W* + W*T)/2 [19]. Moran's I was rewritten using matrix notation:
where H = (I - 11T/n) was an orthogonal projector verifying that H = H2, (i.e., H was independent). Features of matrix W for analyzing sampled covariates of An. arabiensis aquatic habitats include that it: is a stochastic matrix, expresses each observed value yi as a function of the average of habitat location i's nearby habitat larval/pupal counts, and allows a single spatial autoregressive parameter, ρ, to have a maximum value of 1 [1].
Simultaneous autoregressive model (SAR) specifications
A SAR model specification was used to describe the autoregressive variance uncertainty estimates. A spatial filter (SF) model specification was also used to describe both Gaussian and Poisson random variables. The resulting SAR model specification took on the following form:
where μ was the scalar conditional mean of Y, and ε was an n-by-1 error vector whose elements were statistically independent and identically distributed (iid) normally random variates. The spatial covariance matrix for equation (2.1), using the sampled anopheline aquatic habitat covariates was E [(Y - μ l)' (Y - μ l)] = Σ = [(I - ρ W')(I - ρ W)]-1σ2, where E (●) denoted the calculus of expectations, I was the n-by-n identity matrix denoting the matrix transpose operation, and σ2 was the error variance.
However, when a mixture of positive and negative spatial autocorrelation is present in an An. arabiensis aquatic habitat model, a more explicit representation of both effects leads to a more accurate interpretation of empirical results [1]. Alternately, the excluded values may be set to zero, although if this is done then the mean and variance must be adjusted [19]. In this research, two different spatial autoregressive parameters appeared in the spatial covariance matrix An. arabiensis aquatic habitat model specification, which for an SAR model specification became:
where the diagonal matrix of autoregressive parameters, <ρ >diag, contained two sampled parameters: ρ+ for those An. arabiensis. aquatic habitat pairs displaying positive spatial dependency, and ρ. for those habitat pairs displaying negative spatial dependency. For example, by letting σ2 = 1 and employing a 2-by-2 regular square tessellation,
for the vector
,
enabled positing a positive relationship between the sampled An. arabiensis aquatic habitats by covariates, y1 and y2, a negative relationship between covariates, y3 and y4, and, no relationship between covariates y1 and y3 and between y2 and y4. This covariance specification yielded:
where I+ was a binary 0-1 indicator variable which denoted those An. arabiensis aquatic habitat covariates displaying positive spatial dependency, and I- was a binary 0-1 indicator variable denoting those sampled habitats displaying negative spatial dependency, using I+ + I- = 1. Expressing the preceding 2-by-2 example in terms of equation (2.3) yielded:
If either ρ+ = 0 (and hence I+ = 0 and I- = I) or ρ- = 0 (and hence I- = 0 and I+ = I), then equation (2.3) reduces to equation (2.1) [19]. This indicator variables classification was made in accordance with the quadrants of the corresponding Moran scatterplot generated using the sampled An. arabiensis. aquatic habitat covariates sampled in the study site.
The SF model specification
If positive and negative spatial autocorrelation processes counterbalance each other in a mixture, the sum of the two spatial autocorrelation parameters--(ρ+ + ρ.) will be close to 0 [19]. In this research, Jacobian estimation was implemented by utilizing the differenced indicator An. arabiensis aquatic habitat variables (I+ - γ I-), estimating ρ+ and γ with maximum likelihood techniques, and setting
. The Jacobian generalizes the gradient of a scalar valued function of multiple variables which itself generalizes the derivative of a scalar-valued function of a scalar [17]. A more complex An. arabiensis. aquatic habitat specification was then posited by generalizing these binary indicator variables. We used F: Rn→ Rmas a function from Euclidean n-space to Euclidean m-space which was generated using the distance between sampled An. arabiensis aquatic habitat covariates. Such a function was given by m habitat covariate (i.e., component functions), y1(x1, xn), y
m
(x1, xn). The partial derivatives of all these functions were organized in an m-by-n matrix, the Jacobian matrix J of F, which was as follows:
This matrix was denoted by J
F
(x1,..., x
n
) and
. The i th row (i = 1,..., m) of this matrix was the gradient of the ithcomponent function y
i
:(∇ y
i
). In this analyses p was a sampled An. arabiensis aquatic habitat covariate in Rnand F (i.e., sampled larval/pupal count) was differentiable at p; its derivative was given by J
F
(p). The model described by J
F
(p)) was the best linear approximation of F near the point p, in the sense that:
The spatial structuring was achieved by constructing a linear combination of a subset of the eigenvectors of a modified geographic weights matrix, using (I - 11'/n) C (I - 11'/n) that appeared in the numerator of the Moran's Coefficient (MC) Spatial autocorrelation can be indexed with a MC, a product moment correlation coefficient [19]. A subset of eigenvectors was then selected with a stepwise regression procedure. Because (I - 11'/n) C (I - 11'/n) = E Λ E', where E is an n-by-n matrix of eigenvectors and Λ is an n-by-n diagonal matrix of the corresponding eigenvalues [17], the resulting An. arabiensis aquatic habitat model specification was given by:
where μ the scalar mean of Y, Ek was an n-by-k matrix containing the subset of k <<n eigenvectors selected with a stepwise regression technique, and β was a k-by-1 vector of regression coefficients [18]
A number of the eigenvectors were extracted from (I - 11'/n) C (I - 11'/n), which were affiliated with geographic patterns of the sampled An. arabiensis aquatic habitat covariates, in the study site, portraying a negligible degree of spatial autocorrelation. Consequently, only k of the n eigenvectors was of interest for generating a candidate set for a stepwise regression procedure. Candidate eigenvector represents a level of spatial autocorrelation which can account for the redundant information in orthogonal anopheline aquatic habitat map patterns [1]
The preceding eigenvector properties resulted in
and
for equation (2.3). Expressing equation (2.3) in terms of the preceding 2-by-2 example yielded
Of note is that because the 2-by-2 square tessellation rendered a repeated eigenvalue.
Surface partitioning
To identify spatial clusters of An. arabiensis aquatic habitats, Thiessen polygon surface partitioning were generated to construct geographic neighbor matrices, which also were used in the spatial autocorrelation analysis. Entries in matrix were 1, if two sampled An. arabiensis aquatic habitats shared a common Thiessen polygon boundary and 0, otherwise. Next, the linkage structure for each surface was edited to remove unlikely geographic neighbors to identify pairs of sampled An. arabiensis aquatic habitats sharing a common Thiessen polygon boundary. Attention was restricted to those map patterns associated with at least a minimum level of spatial autocorrelation, which, for implementation purposes, was defined by |MCj/MCmax| > 0.25, where MCj denoted the j th value and MCmax, the maximum value of MC. This threshold value allowed two candidate sets of eigenvectors to be considered for substantial positive and substantial negative spatial autocorrelation respectively. These statistics indicated that the detected negative spatial autocorrelation may be considered to be statistically significant, based upon a randomization perspective. Of note, is that the ratio of the PRESS (i.e., predicted error sum of squares) statistic to the sum of squared errors from the MC scatterplot trend line was 1.27 which was well within two standard deviations of the average standard prediction error value (roughly 1.13) for a sampled An. arabiensis aquatic habitat in the study site. Because larval/pupal counts were being analyzed, a Poisson spatial filter model specification was employed in this research [1, 2]. Detected overdispersion (i.e., extra-Poisson variation) results in its mean being specified as gamma distributed [19].
The model specification was written as follows:
where μ
i
was the expected mean larval/pupal count for habitat location i, μ was an n-by-1 vector of expected larval/pupal counts, LN denoted the natural logarithm (i.e., the generalized linear model link function), α was an intercept term, and η was the negative binomial dispersion parameter. This log-linear equation had no error term; rather, estimation was executed assuming a negative binomial random variable.
Eigenfunctions of a spatial weighting matrix
The upper and lower bounds for a spatial matrix generated using Morans indices (I) can be given by λmax(n/1TW 1) and λmin(n/1TW 1) where λmax and λmin which are the extreme eigenvalues of Ω = HWH [23]. Hence, in this research, the eigenvectors of Ω were vectors with unit norm maximizing Moran's I. The eigenvalues of this matrix were equal to Moran's I coefficients of spatial autocorrelation post-multiplied by a constant. Eigenvectors associated with high positive (or negative) eigenvalues have high positive (or negative) autocorrelation [19]. The eigenvectors associated with eigenvalues with extremely small absolute values correspond to low spatial autocorrelation and are not suitable for defining spatial structures [17]
The diagonalization of the spatial weighting matrix generated from the field and remote-sampled An. arabiensis aquatic habitat covariate coefficients consisted of finding the normalized vectors u
i
, stored as columns in the matrix U = [u1 ⋯ u
n
], satisfying:
where Λ = diag (λ1 ⋯ λ
n
),
and
for i ≠ j. Note that double centering of Ω implied that the eigenvectors u
i
generated from the ecological sampled An. arabiensis aquatic habitat covariates were centered and at least one eigenvalue was equal to zero. Introducing these eigenvectors in the original formulation of Moran's index lead to:
Considering the centered vector z = Hx and using the properties of idempotence of H, equation (2.6) was equivalent to:
From autocorrelation to correlation coefficient
As the eigenvectors u
i
and the vector z were centered, equation (2.7) was rewritten:
In this research, r was the number of null eigenvalues of Ω (r ≥ 1). These eigenvalues and corresponding eigenvectors were removed from Λ and U respectively. Equation 2.8 was then strictly equivalent to:
Moreover, it was demonstrated that Moran's index for a given eigenvector u
i
was equal to I(u
i
) = (n/1TW 1)λ
i
so the equation was rewritten:
The term cor2 (u
i
, z) represented the part of the variance of z that was explained by u
i
in the An. arabiensis aquatic habitat model z = β
i
u
i
+ e
i
. This quantity was equal to
. By definition, the eigenvectors u
i
were orthogonal, and therefore, regression coefficients of the linear models z = β
i
u
i
+ e
i
were those of the multiple regression model z = Uβ + ε = β
i
u
i
+ ⋯ + β n-run-r+ ε.
The distribution of the error residuals in the autocovariance matrix
The maximum value of I was obtained by all of the variation of z, as explained by the eigenvector u1, which corresponded to the highest eigenvalue λ1 in the spatial autocorrelation error matrix. In this research, cor2 (u
i
, z) = 1 (and cor2 (u
i
, z) = 0 for i ≠ 1) and the maximum value of I, was deduced for Equation (2.9), which was equal to Imax = λ1(n/1TW 1). The minimum value of I in the error matrix was obtained as all the variation of z was explained by the eigenvector un-rcorresponding to the lowest eigenvalue λ
n-r
generated in the An. arabiensis aquatic habitat model. This minimum value was equal to Imin = λn-r(n/1TW 1). If the ecological sampled predictor variable was not spatialized, the part of the variance explained by each eigenvector was equal, on average, to cor2 (u
i
, z) = 1/n-1. Because the field and remote-sampled An. arabiensis aquatic habitat variables in z were randomly permuted, it was assumed that we would obtain this result. In this research the set of n! random permutations, revealed that
. It was easily demonstrated that
and it followed that
.