Deep sequencing of targeted genomic regions is becoming a common tool for understanding the dynamics and complexity of Plasmodium infections, but its lower limit of detection is currently unknown. Here, a new amplicon analysis tool, the Parallel Amplicon Sequencing Error Correction (PASEC) pipeline, is used to evaluate the performance of amplicon sequencing on low-density Plasmodium DNA samples. Illumina-based sequencing of two Plasmodium falciparum genomic regions (CSP and SERA2) was performed on two types of samples: in vitro DNA mixtures mimicking low-density infections (1–200 genomes/μl) and extracted blood spots from a combination of symptomatic and asymptomatic individuals (44–653,080 parasites/μl). Three additional analysis tools—DADA2, HaplotypR, and SeekDeep—were applied to both datasets and the precision and sensitivity of each tool were evaluated.
Amplicon sequencing can contend with low-density samples, showing reasonable detection accuracy down to a concentration of 5 Plasmodium genomes/μl. Due to increased stochasticity and background noise, however, all four tools showed reduced sensitivity and precision on samples with very low parasitaemia (< 5 copies/μl) or low read count (< 100 reads per amplicon). PASEC could distinguish major from minor haplotypes with an accuracy of 90% in samples with at least 30 Plasmodium genomes/μl, but only 61% at low Plasmodium concentrations (< 5 genomes/μl) and 46% at very low read counts (< 25 reads per amplicon). The four tools were additionally used on a panel of extracted parasite-positive blood spots from natural malaria infections. While all four identified concordant patterns of complexity of infection (COI) across four sub-Saharan African countries, COI values obtained for individual samples differed in some cases.
Amplicon deep sequencing can be used to determine the complexity and diversity of low-density Plasmodium infections. Despite differences in their approach, four state-of-the-art tools resolved known haplotype mixtures with similar sensitivity and precision. Researchers can therefore choose from multiple robust approaches for analysing amplicon data, however, error filtration approaches should not be uniformly applied across samples of varying parasitaemia. Samples with very low parasitaemia and very low read count have higher false positive rates and call for read count thresholds that are higher than current default recommendations.
Amplicon deep sequencing is an increasingly utilized genotyping approach that provides a cost-effective strategy to profile the genetic diversity of pathogen infections. Like single nucleotide polymorphism (SNP)-based genotyping methods, both the data-generation and data-analysis steps of amplicon sequencing are highly scalable, allowing for studies of hundreds to thousands of samples. Additionally, amplicons can be designed to cover long genetic segments composed of multiple variants, allowing for the identification of complete DNA sequences (or haplotypes) in a targeted genomic region. When targeting a highly polymorphic genomic region, a single amplicon can distinguish among hundreds of unique haplotypes , providing higher resolution than either SNP-based or length-based genotyping approaches. This improves estimates of the number of lineages within polyclonal infections (or complexity of infection; COI) [2,3,4], permits the discovery of unknown alleles [5,6,7], and provides increased information for haplotype-based analyses of epistasis and linkage disequilibrium .
Amplicon analysis in Plasmodium has been adapted to multiple sequencing platforms depending on the desired cost, sample size, and sequence length [3, 9,10,11]. Because of this high resolution and flexibility, amplicon-based methods have been utilized in a range of applications, including studies of allele-specific vaccine efficacy , disease severity , clearance rate , within-host competition , relapse rate , drug resistance [5,6,7], host selection , and population structure [8, 14]. Amplicon sequencing has high sensitivity for the detection of minority parasite lineages within an infection, and is of particular interest in longitudinal studies that track intra-host dynamics [3, 4].
When used to detect known single variant markers, amplicon sequences can be analysed with relatively straightforward approaches. Longer, complex haplotypes, however, require more sophisticated analysis methods. Amplicon sequencing data are known to be subject to PCR and sequencing artifacts, particularly for genomic regions with high A/T-content and high rates of homopolymerism [15, 16]. In addition, library preparation method and primer choice can influence the types and extent of errors . Correctly identifying sequence errors is therefore a challenge when applying amplicon sequencing to Plasmodium falciparum. Fortunately, several new analytical tools have been developed in recent years to address these challenges [18,19,20,21].
Unlike approaches that use reference datasets or cluster sequences with hard percent-identity thresholds, these new methods are more flexible and can distinguish among sequences that differ by only a single nucleotide change . When Plasmodium concentrations are reasonably high, these approaches have been demonstrated to be robust. To date, however, these methods have not been thoroughly evaluated on low-density Plasmodium samples. It is therefore unclear whether additional considerations are required when interpreting amplicon sequencing data from infections with low parasitaemia.
This study assesses amplicon sequencing’s lower limit of detection using four analysis tools, and further evaluates each tool’s accuracy and capacity to recover quantitative information on the relative abundance of different haplotypes within infections. Three of these tools—DADA2 , HaplotypR , and SeekDeep —were previously published and developed to contend with any Plasmodium amplicon. The fourth—the Parallel Amplicon Sequencing Error Correction (PASEC) pipeline—is a distance- and abundance-based error-correction tool that was specifically tailored for use with CSP and SERA2 amplicons , and it is formally presented here for the first time. Amplicon sequencing of densely polymorphic regions in the P. falciparum CSP and SERA2 genes was applied to two sample collections. The first—a set of in vitro human/parasite DNA mixtures that mimic low-density parasite infections—was designed to test the limit of detection for amplicon sequencing. The second sample set consisted of DNA extracted from dried blood spots from malaria-infected individuals collected on filter paper in sub-Saharan Africa. This allowed a comparison of analysis approaches using conditions under which samples are typically collected and processed. All four tools detected P. falciparum haplotypes with high sensitivity, and additionally were able to discriminate between major and minor haplotypes with reasonable accuracy. Additionally, PASEC was able to identify a SERA2 indel in patient samples due to its incorporation of prior knowledge on sequence composition.
Overall, the results show that low parasitaemia does not preclude amplicon analysis of P. falciparum samples, although researchers should expect reduced sensitivity and reduced precision with low read-count samples (< 100 reads/amplicon) and at parasite DNA concentrations under 5 genomes/μl.
Sample assembly and composition
Mock Plasmodium/human DNA mixtures
Mixtures of DNA from cultured P. falciparum parasites were combined with human genomic DNA to construct samples that mimic human infections. DNA from up to five culture-adapted parasite lines were combined in various proportions and number (Fig. 1; for exact sample composition and nucleotide differences between clones see Additional file 1: Tables S1, S2). Stock mixtures of 200 genomic copies/μl of DNA template were prepared by real-time PCR quantification of copies/μl in triplicate relative to a plasmid containing a single copy of the quantification target gene . These stock solutions were then diluted to the indicated concentrations in sequencing-grade water and 10 ng commercial human DNA (Promega Corp cat#G3041) was added to all samples. After mixing and dilution, a subset of samples were re-quantified using the same qPCR protocol and reported sample concentrations were adjusted as needed. Plasmodium-free negative control samples were also constructed. These contained either 10 ng of human DNA or only water.
Previously extracted DNA from 95 blood spots, obtained from individuals infected with P. falciparum, was re-amplified and re-sequenced as part of this study. These samples were acquired from both symptomatic and asymptomatic individuals from four countries in sub-Saharan Africa as part of the RTS,S malaria vaccine phase 3 trial and had parasite densities that ranged from 44 to 653,080 parasites/μl as determined by blood smear (Fig. 1; ). Full details on sampling and extraction, including human subjects approval for use of these samples, are provided in Neafsey et al. . In brief, samples were collected as blood spots on Whatman FTA cards, shipped to the Broad Institute, and stored in desiccators until processing. DNA was extracted in batches of 95 samples plus one blank control card using seven 3-mm punches and the automated Chemagen Chemagic bead-based extraction platform. Total DNA was stored at − 80 °C until re-amplification and sequencing.
Positive control plasmid
A plasmid containing synthetic target amplicon sequences for both CSP and SERA2 was obtained from a commercial vendor (Invitrogen/Thermo Fisher Scientific) and served as a positive control during the PCR amplification step. Outside the primer regions, the plasmid sequence contains nucleotide variants not observed in natural P. falciparum isolates so that any instances of contamination can be readily identified. The plasmid map can be found in Additional file 1: Fig. S1.
PCR and sequencing
Two regions from the CSP (PF3D7_0304600) and SERA2 (PF3D7_0207900) genes were PCR amplified as previously described . In brief, 5 μl of ~ 0.5 ng/μl DNA served as template for the initial PCR which amplified the targeted regions. A second PCR was carried out to index samples and create the full sequencing constructs. The final CSP and SERA2 amplicons cover 288 and 258 nucleotides, respectively (Pf3D7_03_v3:221,352–221,639; Pf3D7_02_v3:320,763–321,020). Both amplicons cover sequence regions of high nucleotide diversity in sub-Saharan Africa to maximize the number of distinct haplotypes that can be detected across samples from this geographic area.
All DNA samples and negative controls were amplified and sequenced in duplicate. Paired-end 250-bp reads were generated in one MiSeq run conducted on a pool of 384 PCR products. Unless otherwise noted, each PCR/sequencing technical replicate was analysed as a distinct sample. Before downstream analysis, raw sequencing data were demultiplexed and aligned to amplicon reference sequences to remove all non-Plasmodium sequences.
Sample analysis with PASEC
For each sample, paired-end reads were merged using FLASH  and aligned with BWA-MEM v0.7.12-r1039  to the amplicon regions of the P. falciparum reference genome assembly (PlasmoDB v.9.0 3D7). Two short homopolymeric tracts in CSP were masked from analysis, as such regions are highly error-prone in Illumina sequencing and these specific tracts were not known to harbour natural polymorphisms. Masked coordinates are given in Additional file 3.
Within each sample, haplotypes were filtered according to a set of pre-specified thresholds developed by Neafsey et al. . Haplotypes were required to (1) cover the entire amplicon region, (2) have no uncalled bases, (3) be supported by at least two sets of merged read pairs (henceforth referred to simply as “reads”), and (4) have an intra-sample frequency ≥ 0.01. To account for single nucleotide errors introduced during PCR and sequencing, the filtered haplotypes were clustered based on nucleotide distance and read depth. To inform the stringency of this clustering, the relative read depth of putative PCR errors in the dataset was analysed. Across all the mock samples, there were 208 erroneous haplotypes that (1) differed from a true haplotype within the same sample by a single nucleotide change, and (2) appeared only once in the sequencing run, making it unlikely that contamination was the source of the error. 92% of these erroneous haplotypes were present at a within-sample frequency that was as at least eight times less than the frequency of the correct “parent” haplotype (Additional file 1: Fig. S2). This 8:1 ratio was, therefore, used throughout the whole data set: if two haplotypes within the same sample differed by only one nucleotide and had a read coverage ratio ≥ 8:1, they were merged, maintaining the identity of the more common haplotype. This same 8:1 threshold was used in the initial application of the PASEC pipeline with data from sub-Saharan Africa where both nucleotide and haplotypic diversity was high at these amplicons . In studies where it is necessary to discriminate between closely related haplotypes with a high frequency skew, however, a different cutoff could be applied.
Previous implementations of PASEC removed all potential chimeric reads and applied sample read depth filters (≥ 200 reads for one of the two amplicons) [1, 8]. Here, these metrics were analysed, but hard filters were not applied to the samples before the downstream analyses presented in the results. The results therefore represent a minimally—not an optimally—filtered data set. For most applications, further filtration is recommended, as discussed below.
Full details on the PASEC pipeline, its customizable parameters, and its implementation in this study are found in Additional files 2 and 3 and at https://github.com/tmfarrell/pasec.
Sample analysis with DADA2, HaplotypR, and SeekDeep
All samples were independently analysed using three additional amplicon analysis tools: DADA2 , HaplotypR , and SeekDeep v.2.6.0 . Beyond the changes detailed below, input parameters deviated only modestly from the default settings. Parameters and scripts used for executing each pipeline can be found in Additional file 3. While previous implementations of PASEC applied a 200 reads/sample threshold, no read count filters were applied at the sample level in the analysis comparisons.
SeekDeep gives the option of grouping data from technical PCR/sequencing replicates of the same sample and applying clustering and filtering to this grouped data to increase confidence in final calls. The pipeline was therefore run under two conditions: grouping technical replicates (the recommended, default SeekDeep approach; “SeekDeep2x”) and treating each PCR/sequencing replicate independently (“SeekDeep1x”). This permitted more equivalent comparisons among pipelines that do not incorporate replicate information and allowed for a determination of whether a single replicate is sufficient for making accurate haplotype calls.
For HaplotypR, the command-line interface was extended in two ways. First, it was altered to return full haplotype sequences as opposed to only bases at variant positions. Second, the trimming input command was expanded to allow each amplicon to have different lengths. The version of HaplotypR used in this analysis can be found at https://github.com/tmfarrell/HaplotypR. After running the pipeline, the authors’ recommended sample-level filtering was applied to the data. Specifically, each sample was required to have a minimum of 25 reads, and individual haplotypes needed to have a minimum of 3 reads and a within-host frequency of at least 0.1%.
Comparison of analysis tools
All four tools were assessed for their ability to resolve haplotypes at within-sample frequencies down to 1% using the mock low-parasitaemia samples. Two performance metrics were computed by comparing expected vs. observed haplotypes in each sample: sensitivity (proportion of all expected haplotypes that were observed) and precision (proportion of all observed haplotypes that were expected). For sensitivity calculations, only haplotypes present at a concentration of at least 1 copy/μl of DNA template (5 copies/PCR reaction) were considered. For each tool, samples were only included in the performance metric calculation if at least one haplotype was identified. Except for the SeekDeep2x implementation, each PCR/sequencing replicate was analysed as a distinct sample.
Sequencing coverage for low-density mock infections and natural infections from sub-Saharan Africa
In total, 148 DNA mixtures of known haplotypic composition, 190 blood samples from sub-Saharan Africa, 12 positive-control plasmid samples, and 4 negative-control samples without Plasmodium DNA were PCR amplified for CSP and SERA2 and sequenced on a single Illumina MiSeq run.
The 148 mock infections were constructed to mimic infections with low parasite density and contained between 1 and 200 P. falciparum genomes/μl (Fig. 1a). We assume that these values roughly correspond to parasite densities of 1 and 200 parasites/μl as sampled peripheral blood is heavily enriched for ring-stage infected cells containing only a single parasite genome. Actual extracted DNA concentrations will vary, however, based on the volume of blood extracted, the extraction efficiency, and the DNA suspension volume. In the initial amplification step, 5 μl of DNA template were used, so samples at the lowest end of this distribution (1 genome/μl) should have had, on average, five genomic copies per PCR reaction. After sequencing, 145 samples had full-length read coverage for at least one of the two amplicons. For each amplicon, initial raw coverage across these samples ranged from 0 to 280,876 reads. After implementing the PASEC pipeline, coverage ranged from 0 to 31,787 reads. Coverage was sufficient for both amplicons, although median coverage was higher for CSP than for SERA2 (1872 vs. 909; Fig. 2a). All samples with low coverage (< 100 reads) had Plasmodium DNA concentrations below 21 genomes/μl. Overall, however, coverage and genome copy number were only weakly correlated (Spearman’s ρ = 0.55, P = 9.3 × 10−14; Fig. 2b), suggesting that stochastic factors influence read counts for low parasitaemia samples in general.
Read coverage was higher for the samples from natural infections (Fig. 2c). These samples were extracted from dried blood spots and had parasite densities that ranged from 44 to 653,080 parasites/μl as determined by microscopy of blood smears. As with the mock infections, coverage was generally higher for samples with higher parasite loads, but this correlation was low (Spearman’s ρ = 0.31, P = 1.1 × 10−9; Fig. 2d). While read coverage was higher, overall sequencing success was lower for the natural than for the mock infections (Fig. 2c), a likely result of difficulties with extracting high quality DNA from the stored filter paper blood spots. As would be expected under this scenario, failure rate was not evenly distributed across the natural infection samples, suggesting some experienced a higher degree of degradation. Each of the 95 blood samples was PCR amplified and sequenced in duplicate, yielding two CSP and two SERA2 technical replicates per initial blood sample extraction, or 340 total amplicon samples. Of these 340 amplicon samples, 94 (25%) had low read counts (< 100 reads). These failures clustered in a small number of blood samples, suggesting that amplification and sequencing success is dependent on sample quality: only 33 (35%) of the blood samples experienced any amplicon failure and 18 samples (19%) received low read counts for all 4 amplicon attempts.
Absolute haplotype concentration affects the probability of sequencing success
One challenge of amplicon sequencing analysis is to correctly resolve individual haplotypes present within an infection at varying concentrations. Each mock sample contained between one and four unique haplotypes at the CSP and SERA2 amplicons present at concentrations of 1–200 copies/μl (Fig. 1b). Overall, there was a high recovery of these expected haplotypes from each of the samples. PASEC correctly identified all haplotypes present at a concentration of 30 copies/μl or higher and 96% of haplotypes with concentrations over 20 copies/μl. Conversely, only 41% of haplotypes with 1–5 copies/μl were recovered (Fig. 3a). As further discussed in the tool comparison below, this haplotype sensitivity is only slightly influenced by the post-sequencing analysis method and instead is driven by a failure to initially amplify and/or sequence these low frequency haplotypes.
Amplicon sequencing retains some information on within-sample haplotype frequencies, even at low concentrations
When performing direct short-read sequencing, relative read depth can be used to infer sample features like genotype ratios or genome copy number variations. During construction of amplicon libraries, however, PCR amplification prior to sequencing introduces stochastic variation in the final read counts. Nevertheless, analysis of the final read ratios in the mock samples shows that some information about the original haplotype ratios can be recovered. For samples with at least 100 reads, the correlation between the haplotypic ratio in the template DNA and final read ratio was strong across all haplotypes (Pearson’s r = 0.82, P < 0.001), but weaker for haplotypes with intermediate frequencies between 0.1 and 0.9 (Pearson’s r = 0.60, P < 0.001; Additional file 1: Fig. S3). In 73% of samples with at least a 4% margin between the two most prevalent haplotypes, read ratio correctly identified the most prevalent haplotype in the starting DNA mixture. Again, low read count reduced the probability of identifying the correct major haplotype (Fig. 4a). Similarly, major haplotype identification was less accurate in samples with very low total Plasmodium DNA concentration (< 5 genomes/μl; Fig. 4b).
Erroneous haplotypes have lower read support than correct haplotypes
Read support is a useful indicator of the likelihood that a called haplotype is correct. Haplotypes with single-read support were largely sequencing artifacts, with only 0.030% matching a haplotype sequence known to be present in the sample mixtures. The default PASEC pipeline therefore requires haplotypes to have read support ≥ 2, a filter that eliminated 89.0% of CSP and 85.8% of SERA2 initially called haplotypes from the dataset.
After minimal filtration, 0.75% of the total reads were erroneous, a percentage close to that previously reported by Hathaway et al. on a different dataset analysed with their tool SeekDeep (0.8%) . Overall, this resulted in 31% of identified haplotypes being erroneous. Both erroneous reads and erroneous haplotypes were unevenly distributed across samples, however, making it possible to reduce the false positive rate with further filtration. First, erroneous haplotypes showed lower read support than true haplotypes (Fig. 3b). Raising the minimum haplotype read depth from two to five reads increased precision from 0.81 to 0.91 while having a smaller impact on sensitivity, which was lowered from 0.71 to 0.68 (Additional file 1: Fig. S4). Second, erroneous reads were more prevalent in samples with low read depth and/or low parasite concentration (Additional file 1: Fig. S5), which results in low precision within these sample groups specifically (Fig. 5). Finally, the number of reads supporting erroneous haplotypes differs between samples with low and high read depth. In samples with fewer than 100 reads, 68% of identified haplotypes were erroneous and 86% of these erroneous haplotypes had fewer than five supporting reads. In samples with at least 100 reads, 15% of identified haplotypes were erroneous but only 32% of these had fewer than five supporting reads. Therefore, in instances where samples with low read count must be included, researchers may decide to apply filters that are dependent on sample read depth, similar to the parasitaemia-dependent frequency filters created by Mideo et al. .
Frequency and source of haplotype errors in the mock samples
The PASEC pipeline contains customized filtration and error-correction steps to remove erroneous CSP and SERA2 haplotypes. The filtration and error-correction steps in PASEC were designed to address three main sources of erroneous haplotypes: sequencing errors, chimeric reads, and sample contamination. The frequency of these error types and the efficacy of the various PASEC filters are discussed in more detail below. To provide a more complete profile of the error types found in amplicon data, the presented results are minimally—not optimally—filtered. As most filters will result in a tradeoff between sensitivity and specificity, researchers can tailor the exact level of filtering to their specific data set and scientific question.
Nucleotide sequence errors
The majority of erroneous haplotypes are expected to result from sequence errors (nucleotide substitutions or indels) that occur during Illumina sequencing or the initial rounds of PCR. The PASEC pipeline accounted for these errors with two approaches: (1) hard masking error-prone sequence regions and (2) clustering haplotypes that differed by a single nucleotide and had a read coverage ratio ≥ 8:1. Hard masking was applied to two homopolymeric regions in CSP composed of 9 and 6 poly-Ts. In the raw data, erroneous indels within these two regions were detected in 5.7% and 1.2% of full-length reads. While true indels might occur in these sequences in natural populations, this high artifactual indel rate suggests that inference of variants in these regions would be too unreliable using Illumina sequencing. Compared to masking, the clustering of haplotypes had an even greater impact on reducing nucleotide errors: 57.0% of CSP haplotypes and 47.9% of SERA2 haplotypes were eliminated at this step.
In the final minimally filtered dataset, approximately half of the erroneous haplotypes (51%) differed from a true haplotype by one or two nucleotide changes and were likely the result of Illumina sequencing or PCR errors. As discussed above, these haplotypes were supported by fewer reads than true haplotypes (Fig. 3b) and were more prevalent in samples with low read count. Additional filtration could therefore be applied on these factors to further reduce the false positive rate after assessing the potential need to detect closely related haplotypes with a high frequency skew.
Chimeric reads are false recombinant haplotypes generated during PCR amplification. While a necessary consideration when performing amplicon sequencing, their overall impact on the mock sample analysis was minimal. Potential chimeras were identified with the isBimera function in DADA2 , which identifies all haplotypes that could be constructed from a simple combination of two other haplotypes within the same sample. This analysis flagged 7 CSP and 16 SERA2 samples as containing a total of 36 chimeric haplotypes. Eleven (31%) of the flagged haplotypes were in fact true haplotypes known to be within the given sample. Further analysis showed that 20 of the 25 flagged erroneous haplotypes were only one nucleotide change away from another haplotype in the sample, and the remaining five were related by two nucleotide changes. This suggests that these haplotypes may have resulted from PCR or sequencing error instead of chimeric read formation. Eighteen (78%) of the flagged samples had total read counts under 200, the read threshold previously used with the PASEC pipeline . The increased stochasticity associated with low-read samples may explain why these haplotypes were not merged as part of the PASEC sequencing error filter.
Correctly identifying chimeric reads in natural infections presents an additional challenge, especially in regions of high malaria prevalence where recombination among haplotypes will be higher. Of the 50 most common CSP sequences detected in sub-Saharan Africa , 38 (76%) were flagged as chimeric combinations by DADA2. Researchers must therefore consider additional factors like population-level haplotype frequency when identifying chimeric reads in natural infections [19, 20].
Cross-sample or environmental contamination
A large percentage (49%) of erroneous haplotypes had no evidence of chimerism and were unlikely to have resulted from sequencing errors as they were ≥ 3 nucleotide changes away from any true haplotype within a given sample. 68% of these haplotypes were present in other samples from the same MiSeq run, suggesting cross-sample or environmental contamination. The remaining haplotypes occurred only once in the whole dataset and may have resulted from environmental contamination. A small amount of cross-sample or environmental contamination was also observed in the negative control samples that contained either water (N = 2) or human DNA (N = 2). These four Plasmodium-free samples contained 5, 7, 16, and 20 reads, respectively. All of these read counts fell well below per-sample threshold of 200 reads that was used previously with the PASEC pipeline .
Comparison of PASEC with three state-of-the-art amplicon analysis tools
The performance of PASEC—a pipeline that has been carefully tuned for use with the CSP and SERA2 amplicons in P. falciparum—was compared to that of three analysis tools that were developed to be applied to amplicons from any genomic region: DADA2 , HaplotypR , and SeekDeep . All four of these tools were designed to detect low-frequency haplotypes and differentiate unique haplotypes with single-nucleotide resolution. There are, however, differences in the analytical approaches. For instance, during error filtration PASEC and HaplotypR rely mainly on variant frequency and read depth, while SeekDeep incorporates k-mer frequencies and base quality scores and DADA2 further models sequencer-specific error likelihoods. SeekDeep additionally allows users to incorporate replicate PCR and sequencing runs into the analysis. This approach provides higher confidence for differentiating between sequencing errors and true haplotypes that differ at only a single nucleotide. However, as the mock samples did not provide the opportunity to discriminate between such closely related haplotypes, this SeekDeep feature was not evaluated in the trial.
While all these tools have undergone rigorous testing, no previous study has focused on their performance under extremely low parasite densities (but see ). Here, each tool was applied to the mock samples and it was evaluated on (1) the proportion of all expected haplotypes that were observed (sensitivity) and (2) the proportion of observed haplotypes that were expected (precision).
Sensitivity and precision
Overall, the four tools performed comparably on the mock sample panel, although they showed more variability in precision than in sensitivity (Fig. 6). What differs most between pipelines is their ability to filter out erroneous haplotypes, not identify correct haplotypes. For instance, while the sensitivity of SeekDeep1x—the SeekDeep implementation using only one technical replicate—was comparable to the other four pipelines, its precision was substantially lower, driven by the identification of a high number of erroneous haplotypes. The use of replicate samples in SeekDeep2x greatly decreased the tool’s false positive rate, increasing precision with a small cost in sensitivity.
Each tool’s performance varied to some extent across amplicons. This variation was not consistent across pipelines, and as a result, the pipelines’ rank order for precision and sensitivity was different for CSP and SERA2 (Table 1; Additional file 1: Fig. S6).
Effect of sample read depth and genome copy number
All five pipelines showed reduced performance at low parasite concentrations (< 5 genomes/μl of template or < 25 genomes/PCR reaction; Additional file 1: Fig. S7) and at very low read depths (< 25 reads/sample; the exception being HaplotypR, which filtered out samples with < 25 reads). In particular, SeekDeep2x performed best on samples with at least 100 reads (Fig. 6b). Parasite genome copy number also affected the tools’ success at returning any data for a sample (i.e., resolving at least one haplotype within that sample). Overall, the pipelines reported haplotypes within 78% (HaplotypR), 81% (DADA2), 84% (SeekDeep2x), 89% (PASEC), and 96% (SeekDeep1x) of the samples (Additional file 1: Fig. S8A). The majority of the samples returning no data contained Plasmodium DNA concentrations under 5 genomes/μl (Additional file 1: Fig. S8B).
Determination of major haplotype frequency
As reported above, PASEC correctly identified the expected major haplotype in 73% of the mock samples. Misidentification of the expected haplotype could result from errors in the pipeline or stochasticity during sample construction, PCR amplification and sequencing. Strongly suggesting that stochasticity in sample processing and sequencing plays a role, the frequency estimate for each sample’s major haplotype was highly correlated between tools (Pearson’s r for all pairs > 0.85, P < 0.001; Additional file 1: Fig. S9A). The correlation between tools was even higher when limiting the analysis to samples with at least 100 reads (Pearson’s r for all pairs > 0.97, P < 0.001; Additional file 1: Fig. S9B). All tools, therefore, arrive at comparable frequency estimates based on the number of reads produced per haplotype.
Analysis of natural infection samples from sub-Saharan Africa with the four tools
All five pipelines were then applied to newly generated amplicon data from 95 previously extracted parasite positive blood spots from four countries in sub-Saharan Africa (Fig. 1c) . These biological samples were PCR amplified and sequenced in duplicate, yielding 190 independently sequenced samples for each of the two amplicons. With the exception of SeekDeep2x, the technical replicates were again treated as separate samples in the analysis step. All tools were run with the same parameters used for the mock samples.
The tools differed in the total number of unique haplotypes identified across the samples, with estimates ranging from 48 to 336 for CSP and 38 to 412 for SERA2 (Additional file 1: Fig. S10). For both amplicons, SeekDeep1x and DADA2 identified substantially more haplotypes than the other approaches, although a large percentage of these haplotypes were found at within-sample frequencies under 1%, raising the possibility that they were artifacts. Only PASEC identified a three nucleotide indel in SERA2 that was found on seven different haplotypic backgrounds. This was because the PASEC hard filters permitted this indel to remain based on its prior observation in African parasites .
Consistent with expectations for sub-Saharan Africa, the majority of the natural infection samples contained multiple P. falciparum parasite haplotypes. COI was estimated for each sample as the maximum number of unique haplotypes identified at either of the two amplicons. With the exception of SeekDeep1x, all four tools produced similar trends of mean COI per country (Fig. 7; Additional file 1: Fig. S11). The overall higher number of haplotypes identified with SeekDeep1x is also in keeping with the observation that SeekDeep showed lower precision on the mock samples than the other tools when run with single replicates (Fig. 6).
Amplicon sequencing of complex haplotypic regions is a powerful tool being applied to an increasing range of questions in malaria research. This highly scalable approach can accurately estimate COI, identify distinct haplotypes within polyclonal infections, and permit temporal tracking of distinct clones, however, reliable analysis requires a thorough understanding of potential error sources. Previous applications and evaluations of amplicon sequencing have focused on moderate to high density infections. Here, the performance of amplicon sequencing was assessed for the first time under a scenario of extremely low parasite densities (1–200 genomes/μl of DNA template), which mimicked samples that could be obtained from asymptomatic carriers. The results show that amplicon sequencing remains a viable approach under such challenging scenarios, as it was able to detect 77% of individual haplotypes present at concentrations of 5–10 genomic copies/μl when using 5 μl of template per PCR reaction. The ability of Illumina-based amplicon sequencing to reliably detect Plasmodium DNA at these extremely low concentrations shows that it has a limit of detection on par with standard nested PCR  and qPCR  methods.
While amplicon sequencing is successful at low parasite densities, analysis of such samples presents unique challenges, particularly when parasite DNA concentration drops below 5 genomes/μl. At these low concentrations, overall sample-level error rates are higher and quantification of haplotype ratios is less accurate, regardless of the applied analysis tool. Researchers should, therefore, take steps to lower false positive rates in this challenging class of samples. Since erroneous haplotypes are generally supported by fewer reads (Fig. 3b) and samples with lower read counts have a higher proportion of false haplotypes (Additional file 1: Fig. S5), it should be standard practice to raise read thresholds when analysing low parasitaemia or low coverage samples.
PASEC’s high performance was the result of hand-tuning for use with the amplicons CSP and SERA2. This included the hard masking of difficult-to-sequence homopolymer runs in the CSP amplicon and the a priori identification of indels in SERA2. As a result of this customization, it was the only tool to identify a naturally occurring three nucleotide deletion in SERA2 that is present in Africa. Importantly, however, this study shows that three other tools—DADA2, HaplotypR, and SeekDeep—also provide robust results when prior knowledge of the error profile of an individual amplicon is unavailable and rapid, parallelized analysis is not needed.
Amplicon sequencing will become more useful as further methodological development is undertaken. For instance, ongoing updates to SeekDeep (made after v. 2.6.0, which is used here) have focused on improving both sensitivity and specificity, especially with low read-depth and single-replicate samples (github.com/bailey-lab/SeekDeep). In this analysis, precision varied most among tools, resulting from their different approaches towards error correction. As the rank order of the tools’ precision differed between the two amplicons, however, the relative success of these different approaches seems dependent on genetic context. Evaluation of these tools on a larger set of diverse amplicons will be required to formulate an understanding of how specific genetic characteristics drive these differences in precision. In the meantime, with PASEC and SeekDeep in particular, users can increase precision by implementing a simple 100 read threshold at the sample level (Table 1) or by calibrating filters with variable read thresholds when parasite concentrations are known . Additional increases in precision will require further development in areas like contaminant identification, and this work is ongoing [29, 30]. These advances will also improve sensitivity with low-frequency haplotypes as more refined error identification could lessen the need for stringent cutoffs like the 1% within-sample read count filter recommended with PASEC. Further improvements in sensitivity, however, will largely rely on changes upstream of the analysis stage as the inability to detect a haplotype generally resulted from a failure to capture it at the amplification or sequencing stage. This is reflected by the roughly equivalent sensitivities for the four evaluated tools.
The exact error profile described here is not directly portable to studies that use alternative amplicons and PCR protocols or that employ different sequencing methods. Still, it likely provides reasonable guidelines for the use of amplicon sequencing with low-density samples. Mideo et al.  previously implemented sample-level filtering with a different CSP amplicon that was sequenced using Ion Torrent technology. Using a dilution series of mock samples, they evaluated the relationship between parasite density and haplotype error rate, allowing haplotype frequency cutoffs to shift as a function of sample parasite DNA concentration. As in the study here, they found that the proportion of erroneous reads within a sample increased dramatically below 6 genomic copies/μl. However, while this similarity is suggestive, it should not preclude future evaluations with different protocols, and researchers should continue to inform filtration parameters with study-specific error estimates.
Similarly, studies that use other amplicons or sample from different geographic regions must consider the expected haplotype diversity within the targeted parasite population. This knowledge can refine filtering at both the nucleotide and haplotype level. At the nucleotide level, segments prone to sequencing errors can be hard masked, and alternatively, known variants—like difficult-to-sequence indels—can be permitted to pass through otherwise stringent filters. Such filtration is directly incorporated into PASEC but could also be performed post hoc with other analysis tools. At the haplotype level, comparing the frequencies of haplotypes within samples, within plates, and across the entire population can help flag sequencing errors, chimeric reads, and instances of contamination. All the amplicon analysis pipelines used here rely on population-level information either gathered previously or drawn simultaneously from the dataset to inform filtering. As filtering cutoffs directly affect both sensitivity and specificity, however, researchers should make informed decisions regarding the expected sensitivity/specificity tradeoffs, especially in the instances where filtering levels are not manually set by the user. In addition for large studies, filtering can be implemented in an iterative way as more data are acquired for a given population.
As demonstrated here with the new tool PASEC, amplicon sequencing can be applied to samples with both low and high parasite densities, although the consistent detection of parasite clones with very low prevalence (< 5 genomes/μl of extracted DNA) is challenging. When used under their recommended conditions, three other versatile analysis tools (DADA2, HaplotypR, and SeekDeep) showed similar performance compared to PASEC. Overall, all tools performed well, and so final choice of analysis method will depend largely on study design (e.g., the inclusion of technical PCR/sequencing replicates), the read coverage of the samples, and expectations regarding the targeted Plasmodium genotypes (e.g., the potential presence of indels or the need to differentiate between low frequency haplotypes with a single SNP difference). Regardless of the tool used, however, it should be standard practice to raise read thresholds when analysing amplicon data from samples with low parasitaemia or low coverage (< 100 reads) and to tailor final filters based on haplotype frequencies within the study population.
Availability of data and materials
The datasets generated and analysed during the current study are available in the NCBI Sequence Read Archive under BioProject PRJNA542392.
complexity of infection
single nucleotide polymorphism
Neafsey DE, Juraska M, Bedford T, Benkeser D, Valim C, Griggs A, et al. Genetic diversity and protective efficacy of the RTS, S/AS01 malaria vaccine. N Engl J Med. 2015;373:2025–37.
Juliano JJ, Porter K, Mwapasa V, Sem R, Rogers WO, Ariey F, et al. Exposing malaria in-host diversity and estimating population diversity by capture-recapture using massively parallel pyrosequencing. Proc Natl Acad Sci USA. 2010;107:20138–43.
Lerch A, Koepfli C, Hofmann NE, Kattenberg JH, Rosanas-Urgell A, Betuela I, et al. Longitudinal tracking of Plasmodium falciparum clones in complex infections by amplicon deep sequencing. bioRxiv. 2018.https://doi.org/10.1101/306860
Ngondi JM, Ishengoma DS, Doctor SM, Thwai KL, Keeler C, Mkude S, et al. Surveillance for sulfadoxine-pyrimethamine resistant malaria parasites in the Lake and Southern Zones, Tanzania, using pooling and next-generation sequencing. Malar J. 2017;16:236.
Rao PN, Uplekar S, Kayal S, Mallick PK, Bandyopadhyay N, Kale S, et al. A method for amplicon deep sequencing of drug resistance genes in Plasmodium falciparum clinical isolates from India. J Clin Microbiol. 2016;54:1500–11.
Talundzic E, Ndiaye YD, Deme AB, Olsen C, Patel DS, Biliya S, et al. Molecular epidemiology of Plasmodium falciparum kelch13 mutations in senegal determined by using targeted amplicon deep sequencing. Antimicrob Agents Chemother. 2017;61:AAC.02116.
Lin JT, Hathaway NJ, Saunders DL, Lon C, Balasubramanian S, Kharabora O, et al. Using amplicon deep sequencing to detect genetic signatures of Plasmodium vivax relapse. J Infect Dis. 2015;212:999–1008.
Patel JC, Hathaway NJ, Parobek CM, Thwai KL, Madanitsa M, Khairallah C, et al. Increased risk of low birth weight in women with placental malaria associated with P. falciparum VAR2CSA clade. Sci Rep. 2017;7:7768.
Dara A, Travassos MA, Adams M, Schaffer DeRoo S, Drábek EF, Agrawal S, et al. A new method for sequencing the hypervariable Plasmodium falciparum gene var2csa from clinical samples. Malar J. 2017;16:343.
Mideo N, Bailey JA, Hathaway NJ, Ngasala B, Saunders DL, Lon C, et al. A deep sequencing tool for partitioning clearance rates following antimalarial treatment in polyclonal infections. Evol Med Public Health. 2016;2016:21–36.
Nair S, Li X, Arya GA, McDew-White M, Ferrari M, Nosten F, et al. Do fitness costs explain the rapid spread of kelch13-C580Y substitutions conferring artemisinin resistance? Antimicrob Agents Chemother. 2018;62:AAC.00605.
Miller RH, Hathaway NJ, Kharabora O, Mwandagalirwa K, Tshefu A, Meshnick SR, et al. A deep sequencing approach to estimate Plasmodium falciparum complexity of infection (COI) and explore apical membrane antigen 1 diversity. Malar J. 2017;16:490.
Lerch A, Koepfli C, Hofmann NE, Messerli C, Wilcox S, Kattenberg JH, et al. Development of amplicon deep sequencing markers and data analysis pipeline for genotyping multi-clonal malaria infections. BMC Genomics. 2017;18:864.
Snounou G, Viriyakosol S, Zhu XP, Jarra W, Pinheiro L, do Rosario VE, et al. High sensitivity of detection of human malaria parasites by the use of nested polymerase chain reaction. Mol Biochem Parasitol. 1993;61:315–20.
Davis NM, Proctor D, Holmes SP, Relman DA, Callahan BJ. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome. 2018; 6: 226. https://doi.org/10.1186/s40168-018-0605-2
We thank Anita Lerch for thoughtful comments on the manuscript, Katelyn Durfee for help preparing the mock samples, and members of the Neafsey and Wirth labs for helpful discussions throughout the course of the project.
This project has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Grant Number U19AI110818 to the Broad Institute. This project was also funded by the PATH Malaria Vaccine Initiative, which received a Grant from the Bill and Melinda Gates Foundation.
Authors and Affiliations
Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Angela M. Early, Rachel F. Daniels, Timothy M. Farrell, Sarah K. Volkman, Dyann F. Wirth, Bronwyn L. MacInnis & Daniel E. Neafsey
Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
Angela M. Early, Rachel F. Daniels, Timothy M. Farrell, Sarah K. Volkman, Dyann F. Wirth, Bronwyn L. MacInnis & Daniel E. Neafsey
Genomics Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
College of Natural, Behavioral, and Health Sciences, Simmons University, Boston, MA, 02115, USA
RFD and JG generated data. AME and TMF analysed the data. AME, RFD, TMF, SKV, DFW, BLM, DEN participated in critical evaluation and interpretation of the data. AME wrote the manuscript with contributions from RFD and TMF. All authors read and approved the final manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.