The data comprises time series from 5 different MSF health centres across the DRC for varying amounts of time between 2010 and 2016. These MSF missions vary in size and represent a mixture of hospitals, health centres and community clinics in the Great Lakes region; from North and South Kivu, close to the eastern border with Rwanda and Burundi (Baraka, Kimbi-Lulimba, Mweso and Walikale) and from the South-East province of Katanga, bordering Tanzania and Zambia (Shamwana, closed by the end of 2016). All sites are considered ‘open’ humanitarian settings, i.e. areas of chronic conflict mainly from the ongoing Congolese civil war, including internally displaced peoples (IDPs) and with frequent population movement due to fighting.
The ANC prevalence time series is the number of pregnant women tested for malaria using RDTs and the proportion of these that tested positive. Data is collated each month and all women that attend ANC appointments are tested for malaria regardless of whether they are symptomatic. The second time series is the monthly clinical incidence in children under 5 confirmed by RDT (i.e. symptomatic cases arriving as outpatients that tested positive by RDT). The size of the under 5 population at Mweso, Walikale and Shamwana is estimated by MSF each month using population surveys. The size of the under 5 population at Baraka and Kimbi-Lulimba, which cover larger areas, is taken from national census data conducted during the period of investigation by the DRC Department of Health.
An illustration of how the change in one metric may continue to influence another metric in the future (a lagged effect) is shown in Fig. 2. If one metric can affect another second metric for a long period of time, then the value of the second metric will depend on the current and historical values of the first metric.
A causal framework was utilized to characterize the relationship between ANC prevalence and clinical incidence, as well as to determine the direction of the association between the two metrics. A variable X “Granger causes” Y if including past values of X in a predictive model of Y produces better predictions of Y than just using past values of Y alone [14]. The analysis follows a two-step process. Firstly, a Granger causality test is used to determine the direction of the association (whether changes in ANC prevalence can predict future changes in clinical incidence, or vice versa) as well as the duration of any lagged effect. Secondly, this relationship is then fully characterized using more complex statistical models to determine the magnitude of the lagged effects and how the association might change with disease endemicity.
A vector auto-regression (VAR) model is used to test for Granger causality between the two metrics, determining the direction and length of potential lagged effects between two or more time series [15]. Granger causality was tested for using a Wald test suitable for stationary time series [16]. The number of past observations that should be used in the VAR model (known as the lag order) is determined by finding the lag order that optimizes some information criterion, usually the Akaike information criterion [17]. The VAR model with the optimum lag order was assessed for goodness of fit by examining the model residuals, performing a multivariate Portmanteau test to confirm that they are not correlated with each other and an autoregressive conditional heteroscedasticity test that looks for changing variance over time. The VAR models were fit using the package ‘vars’ in the R statistical software [16].
Distributed lag non-linear models (DLNMs) are used to fully characterize the relationship between the two metrics, these flexible models allow a “lagged effect” as well as an “endemicity effect” of one metric upon the other. The “lagged effect” means that the effect of the explanatory metric upon the response metric happens over time (with the effect size changing with respect to time), whereas the “endemicity effect” enables the relationship between the two metrics to change according to the level of disease (the effect size varies with the value of the explanatory metric) [18]. DLNMs are specified by choosing two “basis” functions, the first basis function describes the shape of the association between the two metrics at each point in time (the transmission effect basis), the second basis function controls the shape of the lagged effects in the model (the temporal lag basis, an example being Fig. 2b). These two functions are combined into a “crossbasis” function that describes the relationship between the value of an observation, how long ago it was observed and what its current effect will be on the response variable [19]. The crossbasis function can vary in shape depending on the two individual functions used to construct it. A crossbasis function can be written as s(xt−l, t − l; η), where xt−l is the observation of the explanatory variable l months ago, t−l is the number of months since the observation, and η are the so-called “basis parameters” which are the parameters that describe the shape of the two functions combined in the crossbasis. The crossbasis function can be included as a predictor in a generalized additive model with the following form:
$$logit\left( {E\left( {Y_{t} } \right)} \right) = \alpha + h_{i} + \sum\nolimits_{l = 0}^{L} {s\left( {x_{t - l} ,\,\,t - l;\eta } \right)} ,$$
(1)
where E(Yt) is the expected value of the response variable at time t (as determined by the Granger causality test outlined above), xt−l is the value of the explanatory variable at time t − l, α is a parameter determining mean difference between the two metrics, hi is the location-specific modifier of the mean difference between the metrics for location i, and L is the optimal lag order found when fitting the VAR model (and takes a value of 0 in models with no lagged effects). Different crossbasis functions (s(xt−l, t − l; η)) made up of the two different basis functions are fit to the observed data and compared to determine the most parsimonious model. Two different functions are used to investigate how the relationship between metrics changes with endemicity, i.e. the transmission effect basis:
-
Linear basis: The simplest model assumes that the endemicity effect varies linearly with the explanatory metric.
-
Hill function: A function flexible enough to fit the relationship between the incidence and prevalence typically observed in non-temporal data [20].
A choice of three different basis functions are used as the temporal lag basis:
-
No lagged effect.
-
Linear basis: The effect of a change in the explanatory metric increases or decreases linearly with respect to time.
-
Non-linear basis: A non-linear spline function that is penalized to produce a smooth curve, using penalized splines has been shown in simulations to be an effective method of reconstructing a variety of lag-exposure relationships when fitting DLNMs [21].
All combinations of endemicity effect and lagged effect basis functions are tested, giving a total of six different models. For clarity, each model is named with an acronym that represents its structure. The first two letters of the acronym represent the function used for the transmission effect basis, this can be either LE for a linear function or NE for a Hill function. The second two letters indicate the function used for the temporal lag basis, this can be LL for linear lagged effects or NL for non-linear basis spline lagged effects. If there is only one pair of letters then the model does not have lagged effects. The names of all six models are listed in Table 2.
Models were fit using the ‘dlnm’ package [22] for the R statistical software and the most parsimonious model was identified using AIC value. The predictive power of each model (its ability to correctly predict into the future) was compared use a rolling origin cross-validation method. This predicted a year of unseen data at a time, with the model being fit using all previous years of data at the given location and all the data from every other location. The models can then be compared using the root mean squared error of their predictions.