Different methodological approaches to the assessment of in vivo efficacy of three artemisinin-based combination antimalarial treatments for the treatment of uncomplicated falciparum malaria in African children

Background Use of different methods for assessing the efficacy of artemisinin-based combination antimalarial treatments (ACTs) will result in different estimates being reported, with implications for changes in treatment policy. Methods Data from different in vivo studies of ACT treatment of uncomplicated falciparum malaria were combined in a single database. Efficacy at day 28 corrected by PCR genotyping was estimated using four methods. In the first two methods, failure rates were calculated as proportions with either (1a) reinfections excluded from the analysis (standard WHO per-protocol analysis) or (1b) reinfections considered as treatment successes. In the second two methods, failure rates were estimated using the Kaplan-Meier product limit formula using either (2a) WHO (2001) definitions of failure, or (2b) failure defined using parasitological criteria only. Results Data analysed represented 2926 patients from 17 studies in nine African countries. Three ACTs were studied: artesunate-amodiaquine (AS+AQ, N = 1702), artesunate-sulphadoxine-pyrimethamine (AS+SP, N = 706) and artemether-lumefantrine (AL, N = 518). Using method (1a), the day 28 failure rates ranged from 0% to 39.3% for AS+AQ treatment, from 1.0% to 33.3% for AS+SP treatment and from 0% to 3.3% for AL treatment. The median [range] difference in point estimates between method 1a (reference) and the others were: (i) method 1b = 1.3% [0 to24.8], (ii) method 2a = 1.1% [0 to21.5], and (iii) method 2b = 0% [-38 to19.3]. The standard per-protocol method (1a) tended to overestimate the risk of failure when compared to alternative methods using the same endpoint definitions (methods 1b and 2a). It either overestimated or underestimated the risk when endpoints based on parasitological rather than clinical criteria were applied. The standard method was also associated with a 34% reduction in the number of patients evaluated compared to the number of patients enrolled. Only 2% of the sample size was lost when failures were classified on the first day of parasite recurrence and survival analytical methods were used. Conclusion The primary purpose of an in vivo study should be to provide a precise estimate of the risk of antimalarial treatment failure due to drug resistance. Use of survival analysis is the most appropriate way to estimate failure rates with parasitological recurrence classified as treatment failure on the day it occurs.

Using method (1a), the day 28 failure rates ranged from 0% to 39.3% for AS+AQ treatment, from 1.0% to 33.3% for AS+SP treatment and from 0% to 3.3% for AL treatment. The median [range] difference in point estimates between method 1a (reference) and the others were: (i) method 1b = 1.3% [0 to24. 8 The standard per-protocol method (1a) tended to overestimate the risk of failure when compared to alternative methods using the same endpoint definitions (methods 1b and 2a). It either overestimated or underestimated the risk when endpoints based on parasitological rather than clinical criteria were applied. The standard method was also associated with a 34% reduction in the number of patients evaluated compared to the number of patients enrolled. Only 2% of the sample size was lost when failures were classified on the first day of parasite recurrence and survival analytical methods were used.

Conclusion:
The primary purpose of an in vivo study should be to provide a precise estimate of the risk of antimalarial treatment failure due to drug resistance. Use of survival analysis is the most appropriate way to estimate failure rates with parasitological recurrence classified as treatment failure on the day it occurs.

Background
Studies of antimalarial drug efficacy remain the primary source of evidence for treatment policy decisions. How these studies should be conducted and interpreted is still a subject for debate. Contentious methodological issues include length of follow-up after treatment, whether to use clinical or parasitological outcomes and which analytical methods are the most appropriate [1][2][3][4]. Study designs and data analysis vary; this makes combining and comparing data difficult. The World Health Organization (WHO) has made considerable efforts to standardize methods for the assessment of antimalarial drug efficacy over the last 40 years but these have changed several times, in line with the prevailing opinion at the time. The first in vivo test was developed in 1965 and was designed for the assessment of chloroquine efficacy against falciparum malaria using strictly defined parasitological end points. The protocol was revised in 1967 and 1972, with patients followed up for 28 days and kept in a mosquitofree environment to prevent re-infection. Subsequent modifications to the protocol in 1996 permitted a shorter length of follow-up of 14 days in areas of high transmission, where the main endpoint changed from parasitological clearance without subsequent recrudescence to adequate clinical response, i.e. a patient in whom parasites reappeared without symptoms was still regarded as 'cured'. In 2001, the length of follow-up was increased to 28 days in areas of intense transmission, if PCR genotyping was available, and the concept of late treatment failure incorporating clinical or parasitological failures was introduced. Asymptomatic patients with parasitological failure were followed from the day of reappearance to the last day of follow-up when they were treated if parasites were still present [5][6][7]. In a more recent document, 'Susceptibility of Plasmodium falciparum to antimalarial drugs', an update of the therapeutic efficacy test and modification of the protocol have been proposed with late parasitological failure defined as "presence of parasitaemia between day 7 and day 28 with temperature ‹37.5°C, without the patient previously meeting any of the criteria of early treatment failure or late clinical failure" [8]. Definitions of endpoints are summarized in Figure 1.

Analysis of efficacy data
The recommended method for analysis of efficacy data changed in the 2001 WHO protocol to survival analysis. In the 2001 report, the life-table method is mentioned and, in the more detailed 2003 guidelines, there is an annex enabling users to calculate Kaplan-Meier survival curves by hand. In order for data to be comparable with earlier results, it was advised that "data [also] be analysed using the traditional 'per-protocol' approach", i.e. failure rates expressed as proportions with exclusion of patients lost from the study for any reason other than genotypingconfirmed recrudescence before the last day of follow-up. An Excel™ programme was developed to enable users to get their results analysed with per-protocol and Kaplan-Meier methods. In practice, survival analysis has not been used widely, but consensus appears to be growing that it is the method of choice [1,4,9].
Polymerase chain reaction genotyping PCR genotyping allows the reappearance of parasites during follow-up to be classified as a new or recrudescent infection. However the technique is not without its limitations. Methods are not standardized with different labo-ratories using different numbers of genotypic markers and different methods of interpretation [10]. Distinguishing a new from a recrudescent infection is particularly challenging in areas of high transmission where multiple genotype infections are common [11]. Identification of an identical genotype during follow-up is not absolute proof of recrudescence just as identification of a different genotype could indicate recrudescence of a previously undetected minority genotype. The WHO has recommended since 2003 that for data analysis "re-infections (as well as PCR unclassifiable results) be classified as involuntary withdrawals" and, therefore, excluded from the analysis.

Implications of using different analytical methods
The differences which may result from use of different methods of analysis are important because results of in vivo studies using the WHO protocol are used to guide changes in treatment policy. Current recommendations (WHO 2006) are that a new antimalarial drug combination should be 95% efficacious when deployed and change should be considered when efficacy falls to 90% [12]. All studies underwent ethical review and were approved by the relevant national authorities [13][14][15][16][17][18][19][20][21][22][23][24][25][26].

Study objectives
These data were combined for this analysis with the following objectives-(a) to examine the effect of using different definitions of treatment failure and methods of analysis on estimates of PCR-corrected failure rates.
(b) to describe the utility of PCR genotyping to classify parasite recurrences in in vivo studies.

Criteria for inclusion or exclusion of data
Studies were included if the treatment given was a supervised therapeutic regimen of three days of an artemisinin derivative combined with another antimalarial, and complied with the WHO 2001 guidelines, with follow-up for at least 28 days and PCR genotyping to distinguish reinfections from recrudescences.

Overview of design of studies and methods
In brief, children aged 6-59 months were eligible for enrolment if they had P. falciparum mono-infection (density threshold at entry varying with study site from a minimum of 500 to 2,000/μL up to a maximum of 100,000 or 200,000/μL), met none of the criteria for severe malaria, no reported hypersensitivity to the study drug, no serious  concomitant febrile illness and consent given by accompanying caregivers. Subjects were re-assessed clinically and parasitologically on days 1, 2, 3, 7, 14, 21 and 28, or any day in-between in case of illness. In the studies conducted by Epicentre/MSF, when asymptomatic parasitaemia occurred after day 3, children were monitored closely and administered a rescue antimalarial (quinine), if fever was measured in the clinic, or if they were still parasitaemic without symptoms on day 28. In the UCSF collaborative studies cases of recurrent parasitaemia with a history of fever were treated.

PCR genotyping methods
PCR genotyping was used to classify parasite recurrences occurring after day 14 of follow up. All recurrences occurring on or before day 14 were considered to be recrudescent infections. Genotyping was done at different laboratories that used different extraction methods, primers and reaction conditions (Table 1). For the Epicentre/ MSF studies, cases in which the post-treatment sample featured all or some of the pre-treatment alleles were considered recrudescent infections; cases in which pre-and post-treatment genotypes were entirely different were considered as new infections. This approach was modified to classify recurrences of parasitaemia in the UCSF collaborative studies as follows: (i) PCR outcomes defined as "recrudescence + re-infection" were reclassified as recrudescence if more the half the alleles on the day of failure were present on day 0 or as new infections for the remainder (Kampala, Uganda).
(iii) PCR outcomes defined as "recrudescence + re-infection" were classified as new infections. MSP1 was also done for patients defined as "recrudescence" by MSP2. The final classification of recrudescence required that all of the alleles on the day of failure had to be present on day 0 for both MSP2 and MSP1 (Tororo 2005, Uganda).
(iv) MSP2, MSP1 and four microsatellites were used for all LCF and LPFs. To be classified as a recrudescence there had to be at least one allele present on the day of failure which was present on day 0 for all six loci (Burkina Faso 2005).

Data analysis
Databases were imported into Stata 9.0 (Stata Corp., Texas, USA) and the variables of interest selected which were recoded according to a pre-determined list. Individual databases were verified against existing study reports if available. Discrepancies were resolved by consultation with the study principal investigators. Individual databases were merged into a single database for analysis.

Evaluation of PCR genotyping methods
PCR genotyping results for children with a parasite recurrence after day 14 were classified as either resolved by PCR (recrudescence or re-infection), or unresolved by PCR, and the reason recorded (missing sample, failure to extract DNA, indeterminate result).
The use-effectiveness of the PCR genotyping method was assessed, defined as the number of children with parasite recurrence resolved by PCR divided by the total number of children with recurrences after day 14 [27].

Efficacy analyses
Four different analyses were done. Figure 2 presents the population selected and the methods used for each analysis. The populations differ depending on both the definition of failure and the analytical method used. Efficacy results are presented as point estimates for each drug in each study without making comparisons between treatment arms in the same study. Patients wrongly randomized, who did not meet study inclusion criteria were excluded from all analyses. Withdrawals for other reasons were dealt with in different ways depending on the analytical method used as explained below. Failure rates were calculated as proportions (methods 1a and 1b) or estimated using survival analytical methods (methods 2a and 2b) as follows-(i) Day 28 failure rates calculated as proportions using a per-protocol dataset-Patients with no outcome on day 28 (losses to follow up and withdrawals) and patients with recurrence unresolved by PCR were excluded from these analyses. New infections during the follow-up were treated in 2 ways: • Method 1a: individuals with new infections excluded from the analysis. This is the standard per-protocol analysis reported by most investigators following the WHO protocol and for convenience will be referred to as the "WHO per protocol method" to indicate failure rates calculated as proportions and endpoints defined as in the WHO (2001) protocol.
(ii) Day 28 failure rates estimated by the Kaplan-Meier product limit formula (survival analysis) using the dataset of correctly randomized patients -  (2001)). Patients classified as LCF with a reinfection were censored. Patients with unresolved PCR results were excluded (n = 74).
• Method 2b: The definition of failure included all early treatment failures plus late treatment failures (LTF), which were defined on day 7 in patients whose initial parasitaemia had not cleared, or on the first day of any parasite recurrence during follow-up (including day 7 or earlier), irrespective of symptoms. For this analysis, patients with reinfections were censored on the day of parasite reappearance and those with indeterminate or missing genotyping results were censored on the last visit with a negative malaria smear result documented [4]. The date the PCR was taken was not the date of parasite recurrence in all cases and genotyping results were missing for a small number of participants (N = 16) because their recurrent parasites had resolved without further treatment by day 28, and thus using the WHO (2001) classification they had been classified as ACPRs at the time the study was done. These patients were censored at the time of the previous visit when the malaria smear was negative.

Effect of method of assessment of sample size evaluated
The numbers of patients included in the four efficacy analyses were compared and the percentage of patients excluded from the enrolled patient population calculated for each method ( Figure 2).

Results
Out of 21 studies in which 4,684 patients were in an ACT treatment arm, 17 were included in the analysis representing a total of 2926 children treated with an ACT. Reasons for excluding studies or patients were major deviations from the WHO protocol (1257 patients), patients greater than five years of age enrolled (335), or unacceptable laboratory quality control results (166 patients in one study). Most studies (12/17) were in areas described as hyperendemic for malaria.

Results of efficacy analyses
Efficacy results are summarized in Tables 2, 3 and 4. Using method 1a (per protocol) day 28 failure rates ranged from 0% to 39.3% for AS+AQ treatment ( Classifying patients as failures on day 7 if initial parasitaemia had not cleared, or on the first day of parasite recurrence, regardless of symptoms (method 2b) increased the number of treatment failures (uncorrected by genotyping) by 75, from 1015 to 1090 but it should be noted that the majority of these patients (50) were from one site (Boende). Figure 2 illustrates how the choice of definitions and methods for assessment of efficacy (corrected by PCR) affected the number of patients assessed in the final efficacy evaluation. The biggest reduction in sample size of 34% occurred using the per-protocol analysis (method 1a). The smallest reduction in sample size occurred when method 2b was applied.

PCR findings
Recurrence of parasitaemia between days 14 and 28 occurred in 942 (32%) children. Of these, 77 (8.3%) patients could not be classified because: (i) 18 samples were missing, (ii) in 31 samples it was not possible to amplify any parasite DNA, (iii) DNA was amplified but no classification could be given (indeterminate) in 26 samples, and (iv) the reason was unknown for two samples.

Discussion
This analysis of pooled data has shown that failure rates can vary markedly depending on the analytical method used. The point estimates of failure calculated using the per-protocol analysis compared to methods 1b and 2a were the highest. This systematic overestimation is attributable to the effect of excluding patients with new infec-  tions from the analysis. Predictably, the magnitude of the difference from the estimate obtained using method 1b varied according to the number of patients re-infected during follow-up, hence intensity of transmission and the post-treatment prophylactic effect of the drug. In these studies, the effect of data loss due to patient loss to follow up (2%) and withdrawals (4%) was small over 28 days.
If the results of the survival analysis using the same definitions of failure (WHO 2001) are compared to the results of the per protocol analysis the median difference in the point estimates for failure at day 28 was small at 1.1% but ranged from 0 to 21.5%. Again, the difference was greater at the sites with the highest numbers of parasite recurrences, whether related to transmission intensity and more reinfections or a higher number of recrudescent infections. The precision of those estimates generated by the per protocol analysis was lower. Comparisons of precision cannot be made against methods 1b or 2b because the endpoint classifications were different. The methods for assessing antimalarial drug efficacy continue to evolve. The quality of the data generated by an in vivo study depends on the study design and execution, the quality of the laboratory and the interpretation of the genotyping results. For the efficacy analysis, there is a growing consensus that the Kaplan-Meier survival analysis is the optimal method because it retains the maximal amount of data for analysis and avoids the inaccurate overestimation of failure rates that will occur when large numbers of patients are excluded from the analysis. This overestimation is likely to increase as a drug's efficacy declines with a consequent increase in the number of recrudescences and new infections. These results support expert opinion that survival analysis should become the standard way of analysing data from in vivo tests and should replace the tradi-tional per-protocol method. The per-protocol method was associated with a 34% reduction in sample size, compared to a < 10% reduction for the other methods. However, it is true that while survival analysis maximizes the sample size, some of the patients will only have completed a trivial period of follow up. Regarding re-infections as treatment successes (method 1b) preserves the sample size but leads to an underestimation of the failure rate and is not recommended.
Efficacy of AS+AQ was below 90% at 7/15 sites and of AS+SP at 4/8 sites and the lower 95% CI was below this level at a further four and two sites, respectively. Efficacy of AL was high in the small number of patients studied.
The reduced efficacy of AS+AQ and AS+SP could be explained by the use of the more slowly eliminated partner drug as monotherapy before and during the deployment of the combinations. An alternative explanation for these results is different methods of classification by PCR genotyping. The results reported here suggest that PCR genotyping performed well as a discriminatory test inasmuch as the majority of patients were classified, but the results are only as valid as the interpretation algorithms used to classify them. Differences in performance were observed between different laboratories, which did not appear to be related to the number of markers used. PCR outcomes defined as recrudescence + re-infection were classified as new infections by laboratories in certain studies but not in others as described in the methods. Current genotyping techniques, even using up to six markers, are not sufficiently discriminatory in areas of high transmission where multiple genotype infections are common [11]. Standardization of genotyping techniques and interpretation rules should reduce this inter-laboratory variation. There is evidence from genotypic studies that reinfections can occur before day 14 [28]. More work needs to be done in this area and the current guidelines, stating that only parasite recurrences occurring after this day should be genotyped, may need to be revised.
These results support the recommendation for a minimum follow-up period of 28 days in the assessment of efficacy of ACTs, with PCR-genotyping of all recurrences and the use of the Kaplan-Meier product-limit formula to estimate efficacy of treatment in all correctly randomized patients.
Accurate reporting of methods by investigators is essential, in particular definitions of endpoints and criteria for censoring all categories of patients who do not complete follow-up. It is recommended that the classification of treatment failure is based on parasitological outcomes only and that all parasitological recurrences be treated straightaway in view of the high likelihood that patients will go on to develop symptoms [29]. Further standardi-zation of definitions and analytical methods will facilitate combining data in this critical period post ACT-deployment, allowing trends in efficacy to be monitored at a global level [9,30].
The WHO (2001) protocol was not designed to be a comparative study protocol, but rather a single-arm in vivo assessment to guide treatment policy. Such studies should be powered to estimate the effect size with adequate precision. However, if an efficacy assessment is being done because of a suspicion a drug is failing it makes sense to evaluate a potential replacement at the same time in an adequately powered, randomized, controlled trial. This will provide good evidence to guide national programmes in their drug selections.
Whether in vivo efficacy monitoring data should be used as the only determinant for policy change is debatable, if a choice of highly efficacious therapies becomes available. When comparing treatments of similar efficacy other factors determining programmatic effectiveness become more important to policy makers such as simplicity of the regimen, formulation, tolerability, adherence, cost and post-treatment prophylactic effect.