A high-throughput method for quantifying alleles and haplotypes of the malaria vaccine candidate Plasmodium falciparum merozoite surface protein-1 19 kDa

Background Malaria vaccine efficacy may be compromised if the frequency of non-target alleles increases following vaccination with a genetically polymorphic target. Methods are needed to monitor genetic diversity in polymorphic vaccine antigens, but determining which genetic variants of such antigens are present in infected individuals is complicated by the frequent occurrence of mixed infections. Methods Pyrosequencing was used to determine allele frequencies at each of six single nucleotide polymorphisms in the Plasmodium falciparum blood-stage vaccine antigen merozoite surface protein 1 19 kDa (MSP-119) in field samples from a vaccine-testing site in Mali. Mixtures of MSP-119 clones were created to validate a haplotype-estimating algorithm that uses maximum likelihood methods to determine the most probable combination of haplotypes given the allele frequencies for an infection and the haplotypes known to be circulating in the population. Results Fourteen unique MSP-119 haplotypes were identified among 351 genotyped infections. After adjustment to a standard curve, Pyrosequencing provided accurate and precise estimates of allele frequencies in mixed infections. The haplotype-estimating algorithm provided accurate estimates of haplotypes in mixed infections containing up to three haplotypes. Based on the MSP-119 locus, approximately 90% of the 351 infections contained two or fewer haplotypes. Conclusion Pyrosequencing in conjunction with a haplotype-estimating algorithm provides accurate estimates of haplotypes present in infections with up to 3 haplotypes, and can be used to monitor genetic diversity in parasite populations prior to and following introduction of MSP-1-based malaria vaccines.


Background
Malaria remains a major cause of disease and death in tropical regions. A malaria vaccine could contribute to malaria control, but as with other pathogens (e.g. HIV, Streptococcus pneumoniae, and influenza virus), malaria vaccine development is complicated by genetic diversity in vaccine antigens. Most malaria vaccine antigens have a high rate of non-synonymous amino acid substitutions and continue to evolve under selection from the immune system [1][2][3]. If immunity conferred by a subunit vaccine is allele-specific, then vaccination could lead to an increased frequency of alleles not targeted by the vaccine. Such changes in the parasite population could compromise vaccine efficacy. It is therefore important to understand the genetic diversity in polymorphic antigens in endemic populations before and after the introduction of a malaria vaccine, including the prevalence of different genetic variants and their natural dynamics, and to measure allele-specific protective efficacy in clinical trials of malaria vaccines.
Merozoite Surface Protein 1 is a leading malaria vaccine candidate antigen. It is the most abundant protein on the surface of the merozoite and is synthesized as a 195 kDa precursor. After undergoing proteolytic cleavage, only the c-terminal 19 kDa remains on the surface of the merozoite as it enters the erythrocyte [4]. The 19 kDa fragment contains two epidermal growth factor (EGF)-like domains, which are thought to have an important function in erythrocyte invasion [5]. Antibodies to this region can block erythrocyte invasion in vitro [4] and are associated with protection from clinical malaria in field studies [6][7][8]. The sequence of MSP-1 19 is highly conserved [9], which, along with its putative critical function, make it an attractive vaccine target. However, this region has six non-synonymous single nucleotide polymorphisms (SNPs) at amino acid positions 1644, 1691, 1699, 1700, 1701, and 1716 [9][10][11][12], which result in expression of different amino acids at those sites (e.g. EKSNGL, QKSNGF, ETSSRL, etc.). It is not known whether or how this polymorphism affects immunity.
Determining which genetic variants of polymorphic malaria antigens are present in infected individuals is complicated by the frequent occurrence of mixed infections. When region of interest is amplified using polymerase chain reaction (PCR), the product and subsequent sequence generated by direct DNA sequencing represents the pool of all parasite types present in that infection. Consequently, it is difficult to distinguish which nucleotides reside together on one parasite (i.e. haplotypes). Haplotypes can be identified using PCR cloning [11,13], since each clone contains a single copy of the amplified region of interest; however, cloning is time consuming, expensive, and not all sequences clone with equal efficiency.
Pyrosequencing™ (Biotage, Charlottesville, VA) is a realtime sequencing method that detects release of pyrophosphate during nucleotide incorporation by an enzyme cascade that generates light proportional to the amount of nucleotide incorporated. This technique allows sequencing of short stretches of nucleotides (10-20 bp) surrounding known polymorphisms without sequencing the rest of the conserved sequence. Pyrosequencing software can quantify the proportion of each alternative nucleotide at each SNP site based on relative peak heights. This method has been shown to provide accurate and precise measurements of allele frequencies in pools of human DNA [14,15] and of the degree of DNA methylation [16].
If allele frequencies at each polymorphic site in a candidate antigen can be determined, and it is known which unique haplotypes are circulating in the population, then a mathematical model can be developed to estimate which haplotypes are present in mixed infections. In this study, Pyrosequencing was used to determine allele frequencies at each of the six SNPs in MSP-1 19 , and an algorithm was developed to reconstruct the frequency of MSP-1 19 haplotypes in mixed malaria infections. This method will provide a time-and cost-effective alternative to PCR cloning for monitoring parasite populations before and after vaccine introduction.

Samples and DNA extraction
All samples used in this study were collected at a malaria vaccine-testing site in Bandiagara, Mali. DNA was extracted from 3 MM Whatman (Whatman Inc., Clifton, NJ) filter paper blood samples using a QIAamp DNA Mini Kit (Qiagen, Valencia, CA). PCR followed by direct sequencing was used to screen 55 samples collected from children participating in a case-control study of severe malaria [17]. PCR followed by Pyrosequencing was used to screen 296 samples collected from children and young adults participating in a malaria incidence study [18]. Both the case-control study and the malaria incidence study were conducted during the years 1999-2001, and were approved by Institutional Review Boards of the University of Bamako Faculty of Medicine and the University of Maryland Baltimore. Samples with sequences consistent with the presence of unique MSP-1 19 haplotypes were identified and one representative of each haplotype underwent PCR cloning. CA), and 5 µl template DNA. Cycling conditions were as follows: 95°C for 15 minutes; 94° for 30 seconds, 65° for 30 seconds, and 72° for 30 seconds for 10 touchdown cycles -0.5°/cycle; 94° for 30 seconds, 60° for 30 seconds, and 72° for 30 seconds for 35 cycles; and a final extension at 72° for 10 minutes. PCR products were visualized on 1.5% agarose gels.

Pyrosequencing
For each Pyrosequencing reaction, 5-10 µl of each biotinylated PCR product (depending on product yield) was aliquotted into the wells of a 96-well plate. To bind the products to sepharose beads, 70 µl of binding reaction mix was added to each well. The binding reaction mix consists of 40 µl Binding Buffer (Biotage, Charlottesville, VA), 28 µl high purity water, and 2 µl Streptavidin-Sepha-rose™ beads (Amersham Biosciences, Piscataway NJ). The binding reaction mix and PCR products were mixed at 1400 rpm at room temperature for at least five minutes.
Four Pyrosequencing reactions are required to genotype the six SNPs in MSP-1 19 . Table 1 shows the sequence of the primers for each Pyrosequencing reaction. Primers were designed using Pyrosequencing Assay Design Software version 1.0.6 (Biotage, Charlottesville, VA). Each Pyrosequencing primer was diluted to 0.417 µM in Annealing Buffer (Biotage, Charlottesville, VA), and 12 µl of the annealing mix (including Pyrosequencing primer) was added to each well of a PSQ™ HS 96-well plate, resulting in 5 pmol of Pyrosequencing primer per well. Negative controls (i.e. Pyrosequencing primer only and biotinylated primer and Pyrosequencing primer without tem-plate) were included on each plate to confirm that background signal was negligible.
Sepharose-bound PCR products were captured on the probes of the Pyrosequencing Vacuum Prep Tool (Biotage, Charlottesville, VA). The beads were washed in 70% ethanol, followed by denaturation solution (0.2 M NaOH), and then washing buffer (Biotage, Charlottesville, VA) for 15 seconds each. The vacuum was released, and the probes were immersed in the PSQ HS 96-well plate containing the annealing solution, and the beads were released by gentle shaking. The plate was then incubated on a heat block at 80°C for 2 minutes and allowed to cool to room temperature prior to reading. Plates were read on a PSQ HS Pyrosequencer using PSQ HS 96A SNP reagents and analysis software version 1.2 in AQ mode. Only samples with single peak signals of at least 30 RLU (relative luminescence units) were considered suitable for allele quantification. Samples that gave "wide peak" warnings upon analysis were also rejected and the Pyrosequencing repeated.

PCR cloning
For samples chosen for cloning, PCR products were generated using nonbiotinylated versions of the MSP-1 19 PCR primers. These products were cloned using a PCR Cloning Plus Kit (Qiagen, Valencia, CA). Transformed cells were plated on LB plates containing 100 mg ampicillin, 80 mg X-gal, and IPTG. Clones with successful ligations were chosen by blue-white screening, followed by PCR screening with MSP-1 19 PCR primers. Twelve clones were picked for each ligation. The nucleotide sequence of each clone was determined using Pyrosequencing and confirmed by direct sequencing.

Standard curve generation
To account for variation in the accuracy of allele frequency determination and to standardize across the different SNPs, standard curves were generated for each of the six polymorphic positions in MSP-1 19 (Figure 1). Experimental mixtures of MSP-1 19 clones were created to generate

SNP Primer Location
Nucleotides Amino Acids a SNPs at these three positions are genotyped in the same Pyrosequencing reaction. b Primer sits down prior to position 1711 where there is a rare polymorphism not considered in this analysis.
standard curves and estimate the magnitude of experimental errors. Plasmids were extracted from clones using a QiaPrep Spin Miniprep kit (Qiagen, Valencia, CA). Plasmid concentrations for dilutions and mixtures were determined using a NanodropTM ND-1000 Spectrophotometer (NanoDrop Technologies, Wilming-ton, DE). Because no two clones differed at all six SNPs, two curves were generated: one using clones that differed at all sites except 1699 and the other using clones that differed at all sites except 1716. Plasmids were combined in ratios of 10:0, 9:1, ...1:9, 0:10. TE was added to each mixture to dilute to a final concentration of 1 ng/µl. 2 µl of Standard curves for each of the six single nucleotide polymorphisms in MSP-1 19 Figure 1 Standard curves for each of the six single nucleotide polymorphisms in MSP-1 19 . Graphs depict the percent deviation between expected and observed frequencies (y-axis) over a range of expected frequencies (x-axis). Circles indicate the observed frequencies, the red line indicates the smoothed data, and the blue line represents the fitted standard curve.
each dilution was used in PCR as described above, and products underwent Pyrosequencing. To generate standard curves for each of the SNPs, the deviations between the expected and observed allele frequencies (i.e. the errors) were plotted, and a function, S i , was chosen to correct these errors. Standard curves were chosen from the family of curves given by the five-parameter function of the allele frequency, p: a + bp + cp 2 -dsin(2gπp). Backwards fitting was used to find the most parsimonious function S i for each of the SNPs (Table 2). Thus, given a set of measured frequencies p i , the best estimate of the actual frequencies is S i (p i ).

Haplotype estimation
The haplotype-estimating algorithm uses maximum likelihood methods to determine the most probable combination of haplotypes given the allele frequencies for an infection, the haplotypes known to be circulating in the population (Table 3), and a probability distribution of the measurement errors. To estimate the distribution of measurement errors associated with each SNP (i.e. the residual errors after adjustment to the standard curve), the absolute values of the errors were assumed to be exponentially distributed. The mean residual error for each SNP, ε i , was calculated using the same clone mixtures that were used to generate the standard curve data. Given a putative set of haplotype frequencies, f i , and a set of allele frequencies, p i , where A (f i ) indicates the allele frequencies for a putative combination of haplotypes, S(p i ) represents observed allele frequencies adjusted to the standard curve, and ε i is the mean residual error for each SNP. To estimate the multiplicity of infection (MOI) for each infection, M i , the number of haplotypes per infection was assumed to be distributed as a conditional Poisson [19] (i.e. each infection has at least one haplotype) with a mean of 1.38 haplotypes (estimated from 296 infections from the Bandiagara malaria incidence study). Thus, the full equa-   Finding the combination of haplotype frequencies, f i , that maximizes the likelihood is hampered by the possibility of finding multiple local maxima. To address this concern, a simple optimization procedure was used with a large set of starting conditions. Starting conditions were chosen by assuming the maximum number of haplotypes per infection was seven and finding all combinations of haplotypes that could explain the allele frequencies in the sample by solving a reduced linear system of equations ( Figure 2). The software was written in R (R Foundation for Statistical Computing, Vienna, Austria) and is available upon request.

Validation of haplotype-estimating algorithm
Experimental mixtures of plasmids from MSP-1 19 clones were made to determine the algorithm's ability to correctly estimate the haplotypes present in mixed malaria infections. Plasmids were mixed in the proportions listed in Table 4. Like the mixtures used to generate the standard curves, each mixture had a final concentration of 1 ng/µl and 2 µl were used as the template for PCR. Allele frequencies at each SNP were determined using Pyrosequencing as described above. Input for the algorithm included the standard curve adjusted allele frequencies for each sample and the list of 14 haplotypes observed in the study population.

Human subjects approval
Samples were collected under protocols reviewed and approved by Institutional Review Boards of the University of Maryland School of Medicine and the University of Mali Faculty of Medicine. Informed consent was obtained from all study participants or their guardians.

MSP-1 19 haplotypes in Bandiagara, Mali
A total of 20 samples underwent PCR cloning. Sequencing of the clones from these samples identified 14 unique MSP-1 19 haplotypes circulating in the study population. The observed haplotypes are listed in Table 3. Three of these haplotypes have not been previously reported (i.e. QKSSGL, QKSSRL, and QTSSGL).

Accuracy and precision of allele frequency determination
Several factors can affect relative peak heights generated during Pyrosequencing, including the bases flanking the polymorphic site (homopolymer formation occurs when adjacent nucleotides are identical to one of the alleles at the SNP site), increased signal from "A" alleles due to the use of dATPαS instead of dATP in the Pyrosequencing reaction, and background signal [14]. To account for these sources of variation and to standardize across the different SNPs, standard curves were generated for each of the six polymorphic positions in MSP-1 19 . Four replicates of each dilution were genotyped on two different days. There was no statistically significant difference between replicates run on different days (data not shown). The deviation between the expected and observed allele frequencies for all replicates was plotted over the range of expected frequencies ( Figure 1). For each SNP, a standard curve was fitted to the data. As observed in Figure 1, the allele frequencies at site 1701 required the most adjustment, with a correction of ~10% required as the frequency of the G allele approached 100%. The other five SNPs required allele frequency corrections of <10%. Raw allele frequencies for each SNP were adjusted to the standard curve prior to haplotype estimation.

Validation of haplotype-estimating algorithm
To test the algorithm's ability to correctly estimate the haplotypes present in mixed malaria infections, plasmids from MSP-1 19 clones were used to make mixtures with known frequencies of various MSP-1 19 haplotypes. These mixtures then underwent PCR, Pyrosequencing, and hap-Linear system of equations to choose starting conditions for haplotype estimation Figure 2 Linear system of equations to choose starting conditions for haplotype estimation. To address the concern of finding multiple local maxima during haplotype estimation, starting conditions were chosen by assuming the maximum number of haplotypes per infection was seven and finding all combinations of haplotypes that could explain the allele frequencies in the sample by solving a reduced linear system of equations.  1 1 1 1 1 1 0 1 0 1 1 1 1  1 1 0 0 1 1 0 1 0 1 0 0 0 1  1 1 0 1 1 1 0 1 0 1 0 1 1 1  1 1 1 1 1 0 1 1 1 1 1 1 1 0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 = i lotype-estimation. The actual and estimated haplotypes and their frequencies are shown in Table 4. Based solely on maximum likelihood, the algorithm does very well estimating up to three haplotypes. Haplotype estimation is less accurate for four or more haplotypes. Examining the algorithm output for the higher multiplicity of infection (MOI) mixtures shows that the model yields multiple "good" answers with similar likelihoods, and consequently it is difficult to choose which "good" answer is correct (i.e. identifiability becomes a problem with high MOI infections). However, lower MOI infections make up a majority of the infections observed in Mali (Figure 3). Based on data from 296 infections from a malaria incidence study in Bandiagara, Mali, nearly 90% of infections have one or two MSP-l 19 haplotypes (Figure 3).

Discussion
A high-throughput method that combines allele frequency determination by Pyrosequencing with a mathematical model was developed to estimate the MSP-1 19 haplotypes present in mixed malaria infections. After adjustment to a standard curve, Pyrosequencing yields accurate and precise estimates of the relative frequency of alleles in mixed infections. The haplotype-estimating algorithm uses maximum likelihood methods to determine the most probable combination of haplotypes given the allele frequencies for an infection and the haplotypes known to be circulating in the population, and provides accurate estimates of haplotypes present in lower multiplicity of infection (MOI) infections (≤3 types). For higher MOI infections (≥4 types), the algorithm gives statistically reasonable, but less accurate, estimates. The reduced accu- racy at high MOI is primarily due to the inability of the algorithm to choose between several haplotype combinations with similar likelihoods.
Because MSP-1 19 is highly conserved, measures of MOI based on this locus are likely to be lower than those based on more polymorphic loci (e.g. MSP-1 block 2 or MSP-2). Therefore it may be acceptable to have an algorithm with greater accuracy at lower MOI. In Mali, the vast majority of samples have low MOI (≤3 types) based on this locus. In 24 infections from six infants living in a high transmission area of western Kenya, the largest number of MSP-1 19 haplotypes observed in an infection was two; however, the largest number of clones picked per sample was four, and it is possible that higher MOIs would have been observed had more clones been picked [11]. At a population level, with large sample sizes, the inaccurate estimation of some haplotypes in a small number of high MOI infections is not likely to be statistically relevant. When individual histories are of interest, it may be possible to fine-tune the algorithm to allow more accurate estimation of high MOI infections by using information about the haplotypes present in low MOI infections that come before and after the high MOI infection to choose the "best" answer out of several statistically "good" answers.
Identifiability problems will also increase as the number of circulating haplotypes in a population increases. Therefore, in areas of high transmission where there may be more circulating haplotypes, it may be necessary to restrict the algorithm to include the most common haplotypes. By doing so, the algorithm should be able to resolve most of the infections, and will be unable to resolve infections that contain rare haplotypes. These rare haplotypes can then be identified using other methods such as PCR cloning.
Similar expectation-maximization methods have been used to estimate haplotype frequencies in diploid human populations [20,21] and in pooled human DNA [22,23]. The expectation-maximization (EM) algorithm developed by Excoffier and Slatkin uses maximum likelihood methods to determine the most probable haplotype assignment given the observed sample genotypes and the estimated population haplotype frequencies (under the assumption of Hardy-Weinberg equilibrium). This method works best for large sample sizes, and uses several sets of starting conditions to avoid convergence on local maxima [20]. Stephens and colleagues use a Bayesian method to reconstruct haplotypes based on both the likelihood and an a priori assumption that unresolved haplotypes tend to be similar to known haplotypes [21]. The EM algorithm has recently been applied toward resolving haplotypes in pooled human DNA samples [22,23]. Similar to the algorithm described in this study, haplotype estimation in pooled human DNA samples is most accurate when the pool consists of fewer individuals. Ito et al. achieved the most accurate estimates with pools containing fewer than four individuals [22], while Quade et al. achieved accurate estimates for up to ten pooled samples (using only two alleles at two loci) [23]. These studies indicate that lack of identifiablity in samples with larger numbers of haplotypes is a common limitation of these types of algorithms.
The accuracy of Pyrosequencing allele quantification can be affected by several factors including having an "A" allele in the SNP and having flanking bases identical to one or the other alternative alleles in the SNP (i.e. homopolymer formation). Given the A/T rich genome of Plasmodium, four out of six SNPs in MSP-1 19 contain an "A" allele. In addition, five of the six SNPs in 19 kDa form homopolymers with flanking alleles. Therefore, it is important to adjust the allele frequencies to a standard curve to improve accuracy. However, since allele frequencies of replicate runs of the same sample on different days did not differ significantly, one standard curve can be used to adjust all the data (as opposed to generating a curve every day the assay is run).
Several methods have been used to determine allele frequencies in mixed malaria infections including PCR cloning [11,24], real-time quantitative PCR (RTQ-PCR) [25], and proportional sequencing [26]. All of these methods, including Pyrosequencing, have advantages and disadvantages. PCR cloning gives definitive haplotypes; however, it is the most time-consuming and expensive of the methods, which significantly limits the number of samples that can be feasibly analyzed using this method. In addition, because Plasmodium often uses codons different than those used by the competent bacteria used in cloning, not all sequences can be cloned efficiently. RTQ-PCR Multiplicity of infection based on MSP-1 19 Figure 3 Multiplicity of infection based on MSP-1 19 .

Multiplicity of Infection
Proportion is a more sensitive method than Pyrosequencing at detecting very low frequency alleles (<5%); however, it has a lower throughput and requires more optimization than Pyrosequencing. Like RTQ-PCR, Pyrosequencing assays are designed to detect known polymorphisms. Methods that rely on sequencing an entire region or gene of interest (e.g. PCR cloning) are better for detecting new SNPs. Proportional sequencing is a method that estimates allele frequencies in mixed infections by measuring the peak heights in direct sequencing electropherograms [26]. While this method has similar applications and accuracy as Pyrosequencing, it is more expensive and has a lower throughput [26]. Because Pyrosequencing sequences short stretches of nucleotides (10-20 bp), for certain very polymorphic loci (e.g. domain I of P. falciparum apical membrane antigen-1, another vaccine candidate antigen), it is not possible to set down a sequencing primer every 20 bp. In this instance, proportional sequencing may be more appropriate. If MSP-1 19 haplotypes are of interest, allele frequencies from any of these methods can be used with the haplotype-estimating algorithm described here.
The cost of equipment for Pyrosequencing is similar to that for standard DNA sequencing, which is now done in several sub-Saharan African countries, including Mali.
Pyrosequencing may be suitable for other applications such as typing known single nucleotide polymorphisms in parasite genes that serve as molecular markers for drug resistant malaria.

Conclusion
In conclusion, Pyrosequencing is a technique that allows reliable quantification of alleles in mixed malaria infections. It is fast, relatively inexpensive, and can be used to genotype polymorphisms of interest in many important Plasmodium genes such as those responsible for drug resistance, immunity, and virulence. In this study, Pyrosequencing was adapted to measure the frequency of alleles in an erythrocytic vaccine candidate antigen MSP-1 19 and combined with a haplotype-estimating algorithm to estimate the frequency of MSP-1 19 haplotypes in infected individuals. This method is being used to understand the natural dynamics of MSP-1 19 , at both population and individual levels, at a malaria vaccine-testing site in Bandiagara, Mali, and can be used to monitor populations during large-scale vaccine trials to determine allele-specific vaccine efficacy.
Publish with Bio Med Central and every scientist can read your work free of charge