Standardizing Plasmodium falciparum infection prevalence measured via microscopy versus rapid diagnostic test

Background Large-scale mapping of Plasmodium falciparum infection prevalence relies on opportunistic assemblies of infection prevalence data arising from thousands of P. falciparum parasite rate (PfPR) surveys conducted worldwide. Variance in these data is driven by both signal, the true underlying pattern of infection prevalence, and a range of factors contributing to ‘noise’, including sampling error, differing age ranges of subjects and differing parasite detection methods. Whilst the former two noise components have been addressed in previous studies, the effect of different diagnostic methods used to determine PfPR in different studies has not. In particular, the majority of PfPR data are based on positivity rates determined by either microscopy or rapid diagnostic test (RDT), yet these approaches are not equivalent; therefore a method is needed for standardizing RDT and microscopy-based prevalence estimates prior to use in mapping. Methods Twenty-five recent Demographic and Health surveys (DHS) datasets from sub-Saharan Africa provide child diagnostic test results derived using both RDT and microscopy for each individual. These prevalence estimates were aggregated across level one administrative zones and a Bayesian probit regression model fit to the microscopy- versus RDT-derived prevalence relationship. An errors-in-variables approach was employed to account for sampling error in both the dependent and independent variables. In addition to the diagnostic outcome, RDT type, fever status and recent anti-malarial treatment were extracted from the datasets in order to analyse their effect on observed malaria prevalence. Results A strong non-linear relationship between the microscopy and RDT-derived prevalence was found. The results of regressions stratified by the additional diagnostic variables (RDT type, fever status and recent anti-malarial treatment) indicate that there is a distinct and consistent difference in the relationship when the data are stratified by febrile status and RDT brand. Conclusions The relationships defined in this research can be applied to RDT-derived PfPR data to effectively convert them to an estimate of the parasite prevalence expected using microscopy (or vice versa), thereby standardizing the dataset and improving the signal-to-noise ratio. Additionally, the results provide insight on the importance of RDT brands, febrile status and recent anti-malarial treatment for explaining inconsistencies between observed prevalence derived from different diagnostics.


Background
Large-scale maps of Plasmodium falciparum infection prevalence [1][2][3] are increasingly used to inform disease control planning, implementation and evaluation at national to global scales [4], and as a basis for disease burden estimation and the monitoring of progress towards international targets [4][5][6][7]. Mapping at the continental or global scale relies on opportunistic assemblies of data on infection prevalence arising from thousands of P. falciparum parasite rate (PfPR) surveys conducted in different countries. The between-site variance observed in these PfPR estimates arises from both an underlying signal component: the variation of the true infection prevalence and the residual noise component, attributable, not only to the inherent random error, but also to a range of confounding factors that reduce the comparability of PfPR measurements from different surveys, most notably, immunological differences in the age ranges of subjects surveyed [8] and differences in parasite detection methods. Whilst random error and age-standardization have been addressed in previous mapping efforts, the effect of different diagnostic methods has not. Previous research has defined the functional relationship between microscopy and polymerase chain reaction (PCR) detection [9], but little research has explored the functional differences between microscopy and rapid diagnostic test (RDT). Given that the majority of PfPR data are based on positivity rates measured by either microscopy or RDT, and that the sensitivity and specificity of these approaches are not identical [10], it is crucial to understand the functional differences between these two approaches.
Detection of parasite infection via microscopy has formed the mainstay of modern PfPR surveys for many decades [11]. Giemsa-stained thick smear microscopy under ideal conditions where stains are prepared correctly and the slide is analysed by an expert is highly accurate for infection diagnosis at densities above 100 parasites per µL [12]. Since the early 2000s, however, economically preferable RDTs have become widely used in field locations as they allow quick diagnosis through antigen-detection without the need for laboratory equipment. Figure 1 illustrates the increasing role of RDT measurements in the global assembly of PfPR surveys maintained by the Malaria Atlas Project (MAP). This increase is mirrored in clinical settings, where the introduction of RDTs has enabled improvements in the proportion of suspected malaria cases receiving parasitological diagnosis (e.g., rising from 20 to 62 % between 2006 and 2013 in the African public sector and with over half of reported cases in 2013 identified by RDTs [4]).
World Health Organization (WHO) guidelines state that all RDTs used for malaria diagnosis pass the following criteria in the WHO Malaria RDT product testing programme: have a panel detection score of at least 75 % at 200 parasites per µL, with a false positive rate <10 and <5 % of tests deemed invalid [13]. RDTs used in PfPR surveys undergo extensive testing as part of WHO quality assurance procedures [14]. However, despite these procedures, the relative performance of RDTs versus microscopy in the field is influenced by a wide range of factors that cannot be fully captured in standardized quality assurance tests. RDTs have been shown to become ineffective through poor storage [15], and can display false positivity due to the perseverance of antigens detected in the blood after both anti-malarial treatment and parasite clearance [16,17]. Despite these limitations, no simple and 2014 are to be expected due to the lag time between data collection and its subsequent release adjustment factors currently exist to functionally map between RDT-derived PfPR measurements and those likely to be observed using microscopy.
To address the need for cross-comparability of microscopy-and RDT-derived PfPR measurements, a large database of individual-level parasitological outcomes tested, using both methods, was assembled from national household surveys conducted in sub-Saharan Africa. A hierarchical Bayesian model was then developed to capture the functional divergence in PfPR measured using the two techniques. The extent of this divergence was further explored in sub-analyses stratified using the presence or absence of symptomatic infection (i.e., fever), recent treatment with effective anti-malarial drugs and the type and brand of RDT used.

Data collection
The Demographic and Health Surveys (DHS) Programme collects nationally representative health and socio-economic information from over 90 countries worldwide through cross-sectional household surveys [18]. Typically, malaria testing within DHS Programme is limited to children under five years old. For this study, children from sub-Saharan African DHS datasets, whose malaria diagnostic outcome was recorded using both microscopy and RDT, were selected. Table 1 lists the 25 DHS Programme that met these inclusion criteria as of 15 July, 2015, which were tabulated from 118,078 individuals tested for falciparum malaria. The administrative zone (ADMIN1) of residence and the RDT and microscopy diagnostic outcome was recorded for each individual. Where available, the following additional factors were extracted: (1) RDT brand; (2) RDT type used whether the RDT detected histidinerich protein 2 (HRP2) or HRP2 and pan-Plasmodium lactate dehydrogenase (pLDH) or HRP2 and Plasmodium vivax-specific pLDH; (3) whether the individual had been febrile within the last two weeks; and (4) whether febrile individuals had received treatment with artemisininbased combination therapy (ACT) anti-malarial within the previous 2 weeks.

Modelling the relationship between microscopyand RDT-derived PfPR
Individual-level data on infection status were aggregated within ADMIN1 zones across sub-Saharan Africa, and microscopy-derived (PfPR MIC ) and RDT-derived (PfPR RDT ) infection prevalence were calculated for each zone. Any estimates from <10 individuals were excluded. A Bayesian probit regression was fit to the resulting 458 pairwise PfPR observations with an errors-in-variables approach [19]. This approach was employed to account for sampling error in both the RDT-derived measurement and the microscopy-derived measurement. In hierarchical Bayesian notation, where n i MIC and n i RDT are the number of microscopy and RDT positive counts, respectively, with n i tot the total number of individuals tested in each of the i = 1 … n obs site-specific aggregations; likewise, p i MIC and p i RDT are the long-run PfPR MIC and PfPR RDT ; φ −1 (·) denotes the inverse probit function that maps the linear predictor within the range of valid probabilities (zero to one); α and β are the unknown parameters of the regression model; μ i and σ i are the mean and standard deviation of the Normal mixture component from which the i-th φ −1 (p i RDT ) is drawn; each μ i and σ i pair is drawn from a single realisation, F, of the Dirichlet process, DP(G Θ , m), with reference density, G Θ , (here a single Normal controlled by the hyper-parameters, Θ) and concentration index m; while π represents a family of priors and hyperpriors chosen for conjugacy with the upper layers.
The use of Bayesian inference with an errors-in-variables structure requires the specification of a prior for the baseline distribution of long-run RDT-derived prevalence, which in the above is handled by the highly flexible semi-parametric form known as the Dirichlet process mixture model [20]. To facilitate posterior simulation via Gibbs sampling for ordinary probit regressions, Albert and Chib [21] propose an augmented variable method based on the introduction of an additional unit-variance, Normal latent variable per binary response, i.e., a non-observed or inferred variable. Here this approach is extended to the errors-in-variables context with a further latent variable for each binary observation in the RDT-derived measurement and couple the procedure to a Polya urn sampler for the Dirichlet process (from the DPpackage library in R [22]). Leave-one-out cross-validation was employed to evaluate the statistical performance of the fitted model to predict the microscopy-derived prevalence from the RDT-derived prevalence.

Analysis of factors influencing the relationship between PfPR MIC and PfPR RDT
To explore possible factors driving residual noise in the relationship between microscopy and RDT-derived prevalence, the Bayesian regression analysis described above was repeated on various stratified sub-sets of the data: by fever status, treatment status, RDT type, and RDT brand (see Fig. 2 for more details). For each factor preand post-stratification akaike information criterion (AIC) scores [23] were compared to determine whether the stratification substantially improved the net information content of the fitted regression model.

Results
Modelled PfPR MIC versus PfPR RDT relationship Figure 3a shows the aggregated prevalence data points and their associated uncertainties, the (point-wise) median and 95 % credible interval curves from the posterior of the probit regression model, as well as the line of equality for reference. A systematic tendency for PfPR RDT to exceed PfPR MIC when measured in the same population was observed, indicating that RDTs tend to give more false positives than false negatives, i.e., most (71.6 %) points in Fig. 3a lie below line of equality. The best-fit regression function relating PfPR RDT to PfPR MIC was: The results from the leave-one-out cross-validation procedure applied to this model are shown in Fig. 3b. The correlation coefficient between the observed and predicted PfPR MIC was 0.921, the mean square error was 0.84 %, and the mean absolute error was 5.77 % indicating a well-performing model with strong predictive capacity.  Figure 4a shows the data stratified by individual fever status and the resulting fitted curves. The prevalence rate estimated by RDT amongst non-febrile children is more closely aligned with the prevalence estimated by microscopy than that amongst febrile children. The difference in AIC scores between the original (non-stratified) model and that stratified by fever status confirms that the stratified model performs substantially better (see Table 2). Recent treatment with ACT has a marked effect on the observed relationship between microscopy and RDTderived prevalence as seen in Fig. 4b, greatly increasing PfPR RDT relative to PfPR MIC . The 'taken an ACT' subset relationship is less precisely determined (note the larger spread and larger confidence intervals in Fig. 4b).

Analysis of factors influencing the relationship between PfPR MIC and PfPR RDT
In addition to the smaller sample size, this likely reflects the irregular occurrence of false positivity due to perseverance of antigens after anti-malarial treatment and parasite clearance [16,17]. Although the stratified model is preferred for its lower AIC score (see Table 2), the unpredictability of the false positivity among the taken-ACT children indicates that RDT diagnosis is not ideal and that microscopy would provide a more conclusive diagnosis.
In contrast, there is far less divergence between the prevalence rates estimated by RDTs detecting HRP2 alone compared with HRP2 and pan-pLDH, or HRP2 and P. vivax-pLDH, as seen in Fig. 4c, and indeed the stratified model in this case is not favoured by the AIC score. Therefore, there is no benefit in stratifying by RDT type. Figure 4d shows the relationship of the PfPR MIC and PfPR RDT by RDT brand. The CareStart (HRP2/ pan-pLDH) has a near one-one relationship (parameter values being estimated near to α equal 0 and β equal 1) whereas First Response (HRP2) and SD Bioline (HRP2) diverge to PfPR RDT . 2 The stratified sub-sets of the data based on fever status, ACT usage, RDT type, and RDT brand stratified model is significantly lower than the non-stratified model, favouring stratification. The combination of febrile status and RDT brand was additionally explored. The AIC score of the febrile status with RDT brand stratified model was lower than the nonstratified model, indicating the stratified model to be preferred (see Table 2). The combination of ACT usage and RDT brands was not possible to explore with our dataset due to low sample sizes for the children taken an ACT by RDT brand.
In summary, the AIC scores are lower for the models stratified by febrile status, ACT usage among febrile individuals, RDT brands and febrile status with RDT brands than the non-stratified model, and are thus considered the preferred models. When applying the conversion formula to an RDT-derived prevalence, cross-sectional population, survey dataset, if the population has a known febrile status, known ACT usage among febrile individuals, or RDT brand used is known, the preferred method for estimating the equivalent microscopy prevalence would, therefore, be applying the model specific to that population. The resulting regression function parameters with 95 % credible intervals for each of the data sub-sets are listed in Table 3.

Discussion
The models developed in this paper provide a means of converting RDT-derived PfPR measurements into estimates compatible with microscopy-derived PfPR or vice versa. As such, the approach provides an indispensable tool for data standardization, designed to decrease the overall uncertainty associated with models that utilize PfPR data derived using differing diagnostic techniques. This research also illustrates the utility of ancillary factors when converting PfPR metrics to reduce the residual noise. Those wishing to apply the conversions developed here can do so using the coefficients in Table 3.
The models were derived using a hierarchical Bayesian framework implementing an errors-in-variables probit regression with a highly flexible, semi-parametric prior (the Dirichlet process mixture model) for marginalizing over uncertainty in the distribution of RDT-derived prevalence. The statistical reliability and predictive performance of this model has been demonstrated through leave-one-out cross-validation.
The field setting of the joint microscopy and RDT prevalence measurements in the DHS dataset distinguishes the present analysis from previous formal verification studies for RDT accuracy [14]; thus the results presented offer a complementary picture of RDT performance outside of control conditions, but require care in their interpretation. For instance, while one may be confident in supposing that the observed strength of recent treatment with ACT as a factor for overdiagnosis by RDT surveillance is likely due to the known lag between asexual parasite clearance and the antigenic-response targeted by RDTs [24], the apparent role of febrile status in RDT overdiagnosis is less easily explained. Previous studies have also observed higher RDT false positives among febrile patients [25,26]. Uncovering the causal relationship will be important for the accuracy of analyses forecasting likely cost-benefit analyses of different designs for The fitted regression functions with the 95 % credible intervals overlaid on the paired microscopy and RDT derived prevalence data points with their individual error-bounds, for the following stratifications. a febrile (dark blue) and non-febrile children (green), b febrile children who have taken an ACT in the last 2 weeks (purple) and febrile children who have not taken an ACT in the last 2 weeks (red), c RDT detected HRP2 (red), RDT detected HRP2 and pan-pLDH (dark blue) and RDT detected HRP2 and P. vivax-pLDH (yellow), d RDT brand CareStart detected HRP2 and pLDH (yellow), RDT brand First Response detected HRP2 and pan-pLDH (dark blue), RDT brand First Response detected HRP2 (green), RDT brand Paracheck detected HRP2 (red), RDT brand SD Bioline detected HRP2 and pan-pLDH (purple), RDT brand SD Bioline detected HRP2 and P. vivax-pLDH (dark grey) and RDT brand SD Bioline detected HRP2 (dark green). The line of equality is included in each plot for reference the diagnosis and treatment strategies of proposed mass screen and treat campaigns [27]. In the absence of further data, some insight into the epidemiological processes behind the role of febrile status in RDT positivity rates may be gained from in silico experiments with mechanistic transmission codes if RDT measurement can be modelled immunologically, rather than by proxy through asexual density. Similar caveats apply to the interpretation of the results for the role of RDT brand in shaping the observed relationship. The model incorporates random sampling noise for both the RDT and microscopy-derived measurements and the systematic noise of the RDT-derived measurements is addressed by exploring the factors that may influence the relationship. However, the systematic noise of the microscopy-derived measurements have not been explored here. There may be factors that are cause systematic noise to the relationship such as the condition of the microscope, microscopist's training and judgement [11]. The data used here are from a consistent resource that uses a strict microscopy protocol in national facilities [18] and, therefore, these factors are likely to be minimal. Caution should however be applied when applying the presented regression function to microscopy-derived prevalence data from inconsistent sources with large variation in their diagnosing protocol.
There has been reports in some regions, mainly in South America, of Pf with absent expression of HRP2 affecting RDT performance [28]. The occurrence of these reports have been marginal in sub-Sahara Africa [29,30] and, therefore, unlikely to affect the analysis presented here, however caution should be applied when extended this to other geographic regions.
Previous evaluations of RDT diagnoses in field settings at specific sites (covering a limited range of transmission conditions) have identified a strong dependence of RDT specificity on age, such that overdiagnosis is most common amongst young children and much less common amongst adults [31]. The results presented here are based on the DHS Programme data for children only; hence, the fitted relationships cannot be certified for conversion of all age prevalence survey data.