Systematic review of analytical methods applied to longitudinal studies of malaria

Background Modelling risk of malaria in longitudinal studies is common, because individuals are at risk for repeated infections over time. Malaria infections result in acquired immunity to clinical malaria disease. Prospective cohorts are an ideal design to relate the historical exposure to infection and development of clinical malaria over time, and analysis methods should consider the longitudinal nature of the data. Models must take into account the acquisition of immunity to disease that increases with each infection and the heterogeneous exposure to bites from infected Anopheles mosquitoes. Methods that fail to capture these important factors in malaria risk will not accurately model risk of malaria infection or disease. Methods Statistical methods applied to prospective cohort studies of clinical malaria or Plasmodium falciparum infection and disease were reviewed to assess trends in usage of the appropriate statistical methods. The study was designed to test the hypothesis that studies often fail to use appropriate statistical methods but that this would improve with the recent increase in accessibility to and expertise in longitudinal data analysis. Results Of 197 articles reviewed, the most commonly reported methods included contingency tables which comprised Pearson Chi-square, Fisher exact and McNemar’s tests (n = 102, 51.8%), Student’s t-tests (n = 82, 41.6%), followed by Cox models (n = 62, 31.5%) and Kaplan–Meier estimators (n = 59, 30.0%). The longitudinal analysis methods generalized estimating equations and mixed-effects models were reported in 41 (20.8%) and 24 (12.2%) articles, respectively, and increased in use over time. A positive trend in choice of more appropriate analytical methods was identified over time. Conclusions Despite similar study designs across the reports, the statistical methods varied substantially and often represented overly simplistic models of risk. The results underscore the need for more effort to be channelled towards adopting standardized longitudinal methods to analyse prospective cohort studies of malaria infection and disease. Electronic supplementary material The online version of this article (10.1186/s12936-019-2885-9) contains supplementary material, which is available to authorized users.


Background
Plasmodium falciparum infection is one of the most common parasitic infection in humans. Individuals in high-transmission settings are exposed to bites of infected Anopheles mosquitoes and develop frequent infections. In early childhood, infections are almost always associated with symptomatic disease [1]. Over time, individuals acquire immunity to clinical disease and, to some degree, infection [2,3]. Thus, each infection alters the host immune response that protects against the next episode. In addition, risk of exposure to infected bites is not uniformly distributed [4,5]. Thus, risk of infection is not equally distributed across populations. While one infection may decrease the risk of disease or infection, an infection in an individual is also an indicator of more frequent exposure to infected mosquitoes. This provides an interesting challenge to modelling the risk of malaria disease and infection over time.
Longitudinal cohorts, unlike other study designs such as

Open Access
Malaria Journal *Correspondence: mlaufer@som.umaryland.edu 7 Center for Vaccine Development and Global Health, University of Maryland School of Medicine, 685 W. Baltimore St. HSF-1 Room 480, Baltimore, MD 21201, USA Full list of author information is available at the end of the article cross-sectional surveys, can differentiate between infections that appear asymptomatic at time of detection but may become symptomatic over time. With efforts intensified to prevent Plasmodium infection and reduce clinical malaria disease, prospective longitudinal cohorts are increasingly being conducted to assess risk of infection and disease over time.
Well-designed prospective longitudinal cohort studies can be key to understanding risk of malaria disease for participants over time, but appropriate analysis of the collected data is also critical for accurate results. Like any longitudinal designs, malaria cohorts are characterized by repeated measurements, resulting in possible correlated infection and disease episode data from same participant. The underlying assumption of no correlation among observations made by standard statistical methods such as t-tests, analyses of variance (ANOVA), and linear regression are violated [6][7][8]. The use of inappropriate methods, particularly those that do not account for possible correlation of events when dealing with repeated episodes of malaria or infection, can lead to biased results. Ignoring correlation of events could result in the confidence intervals for the estimated rates being artificially narrow and the null hypothesis may be rejected more often than warranted [9,10]. Appropriate longitudinal data analysis techniques should account for such possible within-participant correlation and different covariance structures of episodes of malaria or infection measurements over time. These include generalized estimating equations (GEE) and mixed-effects models. The superiority in consistency and efficiency of these methods over the traditional methods, such as ANOVA, t-tests and simple linear regression were demonstrated previously [11][12][13][14]. One of the attractive features of GEE is that the method is robust to mis-specification of the correlation structure, such that the estimator is consistent even if the working correlation structure is misspecified [15], although it can be inefficient under the mis-specified working correlation structure [16]. When the research interest is on repeated time-to-episodes of disease or infection, it is common to analyse time-tofirst episode ignoring the subsequent events, but this approach fails to utilize all information available in the data [13,[17][18][19]. Analyses of time-to-first event only may be preferred in some situations. For example, when the intention is to evaluate interventions or prognostic factors (e.g., some vaccines) as 'all-or-nothing' determinants of malaria risk, so occurrence of any events is considered a failure. Another scenario would be when the first event is considered to have 'reset' the host such that subsequent events may then be consequently considered in a different category. Recurrent time-to-event models are the appropriate techniques for analysing recurrent malaria episode data because they take into account the within-participant non-independence of episodes. These techniques include frailty models [20,21] or recurrent Cox-extended models, such as the Andersen-Gill and Prentice-Williams-Peterson models [17].
The statistical models that incorporate all the repeated observations over follow-up are particularly important in malaria since over time, malaria infections result in individuals acquiring immunity to malaria disease. Each infection may introduces protective effect to future disease episodes. Therefore, appropriate analytical methods for repeated episodes of disease and infection should capture the impact of the developed protective immunity.
Recent advancements in statistical methods and computer software have improved the ease of access and application of statistical methods specifically designed for longitudinal studies, allowing one to handle complexity with cohort data [9,[22][23][24][25]. Limited information is available about whether the recent developments in computer software and statistical methods for longitudinal data has translated into more wide adoption and application of appropriate methods when analysing prospective longitudinal malaria cohort data. This study tested the hypotheses that choice of statistical methods would change over time and that these changes would reflect more appropriate analytical methods. To address this hypothesis, a systematic review was conducted to investigate the trends of use of appropriate statistical methods among articles analysing prospective longitudinal cohort data for clinical malaria episodes or Plasmodium infection published over a 22-year period.

Search strategy
Titles and abstracts containing the terms "malaria prospective study" or "malaria cohort study" or "malaria longitudinal study" or "malaria years follow-up" or "malaria repeated measurements" indexed in PubMed, Medline or ScienceDirect were included in the search. Original articles in English, which was a familiar language to authors, published from 1996 until December 2017 were reviewed in chronological order.

Inclusion and exclusion criteria
Original research articles were included in the analysis if they were prospective cohort studies where participants were actively or passively followed up for repeated malaria episodes over time and described the following outcomes: clinical malaria based on clinical symptoms and confirmed by rapid diagnostic test (RDT) or microscopy, infection defined as polymerase chain reaction (PCR), microscopy, or RDT results positive for Plasmodium parasites. Several publications reporting on the same study or data set were all included. Letters to the editor, editorials, reviews or systematic reviews, meta-analyses, repeated cross-sectional surveys, case reports and articles written in languages other than English were excluded. Detailed selection process of the studies under review is shown in Fig. 1.

Data extraction
For each article included in the analysis, the following information was extracted: year of publication, duration of follow-up, outcomes, sample size, population, location and type of statistical methods used. Statistical methods were grouped according to categorization by Colditz and Emerson [26,27] as presented in Appendix. In the cases where researchers studied multiple outcomes, only methods applied to the outcomes of interest-clinical malaria or infection-were considered. Particular interest was on standard longitudinal analysis techniques including GEE, mixed-effects models (random effects models) and repeated measures ANOVA which are considered appropriate for analyses of repeatedly measured data such as repeated episodes malaria disease or infection. For repeated time-to-episodes, recurrent Cox-based models and frailty models are considered appropriate, while other survival methods such as the Kaplan-Meier estimator, log-rank test and Cox model are only appropriate for analyses of time-to-single-episode.

Statistical analysis
Logistic GEE was used to estimate the odds of using the types of statistical methods in the 2007-2017 period compared to 1996-2006 period. Each model adjusted for study sample size and year of publication was included as a panel variable. The outcome variable was binary, indicating whether in each article a particular method was used or not. Models were constructed for each of the following methods: contingency tables, descriptive statistics only, Kaplan-Meier estimator, Cox model, Poisson

Results
Out of the 1975 abstracts reviewed, 1778 were excluded because they were of other subject areas (n = 1007, 51.0%) and study designs (n = 771, 39.0%). One-hundred and ninety-seven (10.0%) articles met the inclusion criteria and were included in the analyses (Fig. 1). Overall, the median follow-up time was 12 months [interquartile range (IQR): 9-24]. The median sample size per article was 351 (IQR: 206-700). The number of articles increased over the 22-year study period, with the highest number found between 2012 and 2016 (Fig. 2). Overall, there were 128 (65.0%) articles that analysed both clinical malaria episodes and infection as outcomes; 53 (26.9%) assessed clinical malaria episodes only, and 16 (8.1%) infections only. A table listing articles reviewed in this study with details of year of publication, duration of follow-up, outcomes, sample size, population, location, and analysis methods are provided as Additional file 1: Table S1.
Among 197 articles reviewed, the most commonly reported methods included contingency tables comprising Pearson Chi-square, Fisher exact and McNemar's tests (n = 102, 51.8%), followed by Student's t-tests (n = 82, 41.6%), multiple linear regression (n = 71, 36.0%), Cox models (n = 62, 31.5%), and Kaplan-Meier estimators (n = 59, 30.0%) ( Table 1) There were 74 of 197 articles that analysed time-to-first malaria episode or infection ignoring subsequent events and 61 (82.4%) of these employed survival analysis techniques that included Cox models, Kaplan-Meier estimators, and log-rank tests. There were 24 (12.2%) of 197 articles that analysed repeated time-to-episodes and only 6 (25.0%) of these used appropriate recurrent event models with extensions of Cox model, namely Andersen-Gill (n = 4) and frailty models (n = 2) to evaluate risk factors for recurrent clinical malaria. Eighteen (9.1%) articles analysed repeated time-to-malaria or infection episodes using Cox models and Kaplan-Meier estimators which are only valid for single-event analyses because they assume independence of events.  Table 2). The proportion of articles using Cox models, Kaplan-Meier estimators and mixed-effects models increased over time (Fig. 3). However, only the Cox models, and mixed-effects models had increased odds of being used in the second period compared to the first period, (OR = 2.57, 95% CI 1.21-5.42) and (OR = 3.85, 95% CI 1.13-6.39), respectively ( Table 2).

Discussion
In this review of statistical methods among articles analysing prospective longitudinal cohort data for clinical malaria episodes or Plasmodium infection published over a 22-year period, the statistical methods substantially varied across articles despite that they reported analysis of the same general study design and outcome measurements. The most commonly used models for analysing count number of malaria episodes included the Poisson and negative binomial models, but these fall short with repeated measurements data as they cannot account for within-participant correlation of the episodes [6,28,29]. Ignoring this possible correlation of events tends to bias the model results as demonstrated previously [15,[30][31][32], yielding regression parameters with underestimated standards errors for time-dependent covariates or overestimated errors in time-dependent covariates [33]. As hypothesized, trends in choice of appropriate methods improved over time, which included the GEE and mixedeffects models as well as recurrent event models, although their usage during the entire study period remained low. Furthermore, the majority of the studies that modelled the correlation structure did not specify the kind of assumed correlation structure, with only exchangeable correlation structure stated in few studies. Among studies that assumed Poisson distribution, the majority of them did not state whether or not overdispersion was checked. Some articles analysed the repeated time-to-disease episodes or infections but applied the Cox model [34] or the Kaplan-Meier estimator, which assume independent events and are only valid for modelling the time-to-single disease episode or infection. Such analyses may yield biased results. This is particularly true in malaria because over time, individuals acquire immunity to clinical disease and, to some degree, infection. Therefore, each infection may be dependent on the past, and in turn alter the host immune response that protects against the next episode. Additionally, there may be temporal variations in exposure which can alter the risk of clinical malaria disease over time. In some instances, data from the later episodes were discarded leading to loss of information which may limit the resulting conclusions. Modelling that focuses on time-to-first event only may not be generalizable to analyses examining all repeated episodes of disease or infection over entire study period because the risk to subsequent episodes diminishes as a result of the gained immunity over time.  When analysing repeated times-to-episodes malaria or infection, analytical methods should account for possible within-participant correlation of events as well as censoring over follow-up. Such appropriate methods include recurrent event survival models [35,36]. In the current review, only six articles analysed repeated time-to-episodes using appropriate recurrent models. Well-developed descriptions of the recurrent event models have been presented by several authors [10,17,35,36] while this section briefly discusses these models in relation to malaria data. The recurrent event models include extensions of the Cox model such as Andersen-Gill (AG) [37], Prentice-Williams-Peterson (PWP) [38] and the frailty model [39]. The AG model uses a counting process time-scale for all episodes assuming that the correlation between episodes-times for each participant is explained by previous episodes where timescale does not reset to zero after an episode. This model is typically applicable when the interest is on the overall effect of covariates on the intensity of the recurrent episodes. The PWP model analyses ordered events by stratification and can be expressed as gap-time model where the time-scale is reset to zero after each episode occurs. Gap times between malaria parasitaemia detection to clearance is a good example where the PWP model can be valid. Moreover, the PWP model can give both overall and episode-specific effects and so can be more applicable in malaria where covariate effects may be different in subsequent episodes due to the naturally acquired immunity. The shared frailty model extends the Cox model by introducing a random covariate that induces dependence among the recurrent episode times. By conditioning on covariates and the random effect, the recurrent episodes are assumed to become independent. Another challenge when analysing recurrent events in malaria is the issue of exclusion of periods-at-risk after an event, and this may constrain the choice of statistical methods. Depending on kind of treatment if any, recurrent events within a few days of each other are likely to be the result of the same infection or the second event may be considered to be due to treatment failure. Ignoring periods-at-risk after an event is also a potential source of bias. Thus, authors ought to consider exclusion of periods-at-risk after an event during the datasets structuring before analyses are conducted.
This review has demonstrated that publications on prospective longitudinal malaria cohorts are increasing over time, a trend reflecting the growing understanding of the importance of longitudinal cohort data coupled with advancements in computer software developments for analysing such data [9,22]. Optimal methods for longitudinal data analysis were in their infancy 20 years ago and both access to appropriate software and general acceptable of these methods by the research community have increased dramatically over this period. The increased trend in malaria cohort studies and the improved but slow adoption of appropriate methods underscores the need for standardized analytical methods for such data to avoid drawing erroneous conclusions based on inaccurate results. This review may have included more than one report using the same data, which may have introduced bias due to duplication of datasets. For example, the same data may have been analysed using inappropriate methods in first instance and then using appropriate techniques. Furthermore, some authors published more than one article and such articles may share similarities in the choice of analytical methods, but accounting for this was beyond the scope of this review as author background information was not collected. Future studies should consider testing whether more appropriate statistical methods improve or change understanding of malaria epidemiology by using among other factors, authors' background information. Lastly, further studies may explore and compare how results from this study would compare with trends specifically from field trials.

Conclusions
Cohort studies assessing risk of malaria infection and disease over time did not employ consistent analytic methods. The statistical methods varied substantially and often represented overly simplistic models despite similar designs, which may have led to biased outcomes and results that cannot be compared across studies. This review suggests that recent studies are adopting more appropriate statistical methods, though the statistical analyses are not uniform. These results underscore the need for more effort to be channelled towards adopting standardized longitudinal methods to model recurrent events for malaria infection and disease.

Additional file
Additional file 1: Table S1. List of reviewed articles in the study with details of year of publication, duration of follow-up, outcomes, sample size, population, location and analysis methods.