Estimation of malaria haplotype and genotype frequencies: a statistical approach to overcome the challenge associated with multiclonal infections

Background Reliable measures of anti-malarial resistance are crucial for malaria control. Resistance is typically a complex trait: multiple mutations in a single parasite (a haplotype or genotype) are necessary for elaboration of the resistant phenotype. The frequency of a genetic motif (proportion of parasite clones in the parasite population that carry a given allele, haplotype or genotype) is a useful measure of resistance. In areas of high endemicity, malaria patients generally harbour multiple parasite clones; they have multiplicities of infection (MOIs) greater than one. However, most standard experimental procedures only allow measurement of marker prevalence (proportion of patient blood samples that test positive for a given mutation or combination of mutations), not frequency. It is misleading to compare marker prevalence between sites that have different mean MOIs; frequencies are required instead. Methods A Bayesian statistical model was developed to estimate Plasmodium falciparum genetic motif frequencies from prevalence data collected in the field. To assess model performance and computational speed, a detailed simulation study was implemented. Application of the model was tested using datasets from five sites in Uganda. The datasets included prevalence data on markers of resistance to sulphadoxine-pyrimethamine and an average MOI estimate for each study site. Results The simulation study revealed that the genetic motif frequencies that were estimated using the model were more accurate and precise than conventional estimates based on direct counting. Importantly, the model did not require measurements of the MOI in each patient; it used the average MOI in the patient population. Furthermore, if a dataset included partially genotyped patient blood samples, the model imputed the data that were missing. Using the model and the Ugandan data, genotype frequencies were estimated and four biologically relevant genotypes were identified. Conclusions The model allows fast, accurate, reliable estimation of the frequency of genetic motifs associated with resistance to anti-malarials using prevalence data collected from malaria patients. The model does not require per-patient MOI measurements and can easily analyse data from five markers. The model will be a valuable tool for monitoring markers of anti-malarial drug resistance, including markers of resistance to artemisinin derivatives and partner drugs.

.1: A directed acyclic graph representing the model quantities and their conditional dependencies. Solid arrows represent probabilistic dependencies, whereas the dashed arrow represents a deterministic dependency. Circles denote unknown quantities. The squares represents the data per patient blood sample, y i , where i = 1, · · · , q include no missing data. The hexagons denote the data per blood samples, y i , where i = q + 1, · · · , n are incomplete (unsuccessful genotyping outcome at one or more SNPs). The diamonds represent the hyperparameters of the priors. The repetitive structure of the n blood samples within the dataset is represented by the stacked rectangles and hexagons. The joint distribution of the data and random quantities is P r (Y , A, π, m) = P r (Y |A) P r (A|m, π) P r (m) P r (π) .
Suppose r ≤ 2 s genotypes are compatible with the observed data, Y . Let the n by r matrix A be the matrix in which the unobserved patient genotype counts are stored 5 . The genotype counts represent the numbers of clones characterised by each of the 1 to r genotypes in each of the n blood samples. The total genotype count per blood sample is equal to the total number of clones present in the blood sample, which is the multiplicity of infection (MOI). Let m = (m 1 , · · · , m n ) denote the MOIs for the n blood samples, so the sum of the ith row of A is m i (in other words n i=1 a i = m i ). Akin to the genotype counts, they are unobserved. Let π = (π 1 , · · · , π r ) denote the vector of genotype frequencies. π 1 , · · · , π r are estimates of the clonal genotype frequencies (the proportions of parasite clones in the Plasmodium falciparum population that carry genotypes 1 to r). The model quantities and their conditional dependencies are depicted in Figure A1.1. A list of notation is included in Section 1.2. A hypothetical example, complete with unobserved variables and mathematical notation, is included in Section 2.
1.2 List of notation n number of blood samples in the dataset.
s number of SNPs genotyped.
r number of genotypes compatible with the observed data (r ≤ 2 s ).
H k by r matrix of theoretically possible genotype counts given a specified MOI, m.
The matrix H is similar to A, in that it is a matrix of genotype counts; however, H contains all possible genotype counts given m, whereas A contains the genotype counts for the n blood samples in the dataset. For example, suppose two SNPs are genotyped, such that r = 4, since there are four possible genotypes (00, 10, 01, 11), and m = 2, then Distinct MOI values yield different H matrices because the number of theoretically possible combinations of genotype counts, k, is a function of the specified MOI, k = f (m) (see above).
λ hyperparameter on the prior of m (see Section 3).

Hypothetical example
Observed data Unobserved Blood sample SNP 1 SNP 2 SNP 3 MOI Genotype counts † Patient 1 (1, 0, 1, 1, 0, 0, 0, 0) Table A1.1: A hypothetical dataset based on five malaria patients. The observed data are shown on the left of the vertical division. For a given SNP, '0' denotes the detection of sensitive markers only, '1' denotes the detection of resistant markers only, '0.5' denotes the simultaneous detection of both sensitive and resistant markers and '99' denotes a missing datum (due to a failed genotyping outcome, for example).The genotype counts and MOIs give rise to the observed data, but they themselves are unobserved, and are hence shown on the right of the vertical division. † Genotype counts are represented by (n (000) , n (100) , n (010) , n (001) , n (110) , n (101) , n (011) , n (111) ), where n (000) is the number of clones with genotype 000. For a given genotype '0' denotes a single sensitive marker and '1' denotes a single resistant marker. For example, the genotype count (1, 0, 0, 0, 0, 0, 0, 1) indicates that the fourth patient blood sample contains two clones, one with genotype 000 and one other with genotype 111.
Suppose five hypothetical blood samples are genotyped at three SNPs (Table A1.1).

Model specification
In the above hypothetical example, the genotype frequencies were trivial to calculate because the genotype counts and MOIs were known. In reality, however, the genotype counts and MOIs are unobserved and unknown. To infer the genotype frequencies, a model, which sums over and all possible MOIs, is fit to the observed data. Let H be the matrix in which all possible patient level vectors of genotype counts given a specified MOI is stored (see Section 1.2). The posterior probability of the genotype frequencies, π = (π 1 · · · π r ), conditional on the observed data, Y , is thus given by: P r (y i |h g ) is modelled using an identity function: For example, if y i = (0.5, 0, 0) and h g = (7, 1, 0, 0, 0, 0, 0, 0), where h g = (n (000) , n (100) , n (010) , n (001) , n (110) , n (101) , n (011) , n (111) ) and n (100) is the number of clones with genotype 100 etc., then I (y i |h g ) = 1.
Each patient level vector of genotype counts, h g , is assumed to be a realisation of size m i from a multinomial distribution with probability π. The prior on π is Dirichlet, where α 1 = α 2 = · · · = α r = 1. The prior on m i is one of four possible distributions: Uniform, where m i ∈ {1, m max }; a truncated Poisson distribution, where m i ∈ {1, m max }, and the parameter of the distribution is set equal to the reported mean MOI; a truncated negative Binomial distribution, where m i ∈ {1, m max }, and the mean parameter of the distribution is set equal to the reported mean MOI and the dispersion factor is 1 2 ; or a truncated Geometric distribution, where m i ∈ {1, m max }, and the parameter of the truncated Geometric distribution is set equal to the reciprocal of the reported mean MOI. The prior that provides the best model fit (see Section 4) is selected. In practice, the normalising constants that result from truncation cancel in the Metropolis-Hastings step (Section 3.1). The truncated term is thus dropped from descriptions in Additional file 2.

Sampling algorithm
The double summation over the possible patient level vectors of genotype counts and MOI values is computationally expensive. The model approximates P r (π|Y ) by averaging over successive samples of patient genotype count vectors, a i , and MOIs, m i . The sampling procedure consists of a Metropolis-Hastings step embedded within a Gibbs sampler. Starting with an initial estimate of π 0 , m 0 and A 0 , at t = 0, it proceeds as follows 6 : 1. Update π t given m t , A t , and Y : π t+1 is drawn directly from its full conjugate conditional distribution: a t i r .
2. Propose an update, m * , given m t and Y : for i = (1, · · · , n), if m min < m t i < m max , where m min = 2, for observations with a least one mixed SNP, and m min = 1 otherwise, a proposed MOI, m * i = m t i ± 1, is selected with probability 1 2 . If m t i = m min , m * i = m t i + 1; if m t i = m max , m * i = m t i − 1. 3. Propose an update, A * , given m * , A t , m t , π t+1 , and Y : for each individual patient, a new genotype count vector is proposed by either adding or subtracting a clone (depending on m * i ) from the current genotype count vector. If a clone is added, m * i = m t i + 1, the genotype of the new clones is drawn from a multinomial distribution with renormalised probability vector, π t+1 add , such that incompatible genotypes have zero probability: where Mn denotes the multinomial distribution. If a clone is subtracted, m * i = m t i − 1, the genotype of the subtracted clone is drawn from a multinomial distribution whose probability vector, π t+1 sub = f (a t i ), is renormalised such that the 6 Imputation of incomplete data: because observed data are conditional on the genotype counts (see Figure A1.1), by specifying an initial estimate A 0 , missing data are assigned initial estimates. Moreover, as the sampling algorithm proceeds (t > 0), each time new set of genotypes (A t ) is sampled, new imputed values for the missing data are assigned.
probability of subtracting an existing clone that will render the ensuing genotype count incompatible with the observed data is zero: 4. Accept or reject proposals, A * and m * : the proposed MOI estimates and genotype counts are rejected or accepted based on a Metropolis-Hastings step, which is repeated for i = 1, · · · , n: