Optimization of whole-genome sequencing of Plasmodium falciparum from low-density dried blood spot samples

Background Whole-genome sequencing (WGS) is becoming increasingly useful to study the biology, epidemiology, and ecology of malaria parasites. Despite ease of sampling, DNA extracted from dried blood spots (DBS) has a high ratio of human DNA compared to parasite DNA, which poses a challenge for downstream genetic analyses. The effects of multiple methods for DNA extraction, digestion of methylated DNA, and amplification were evaluated on the quality and fidelity of WGS data recovered from DBS. Methods Low parasite density mock DBS samples were created, extracted either with Tween-Chelex or QIAamp, treated with or without McrBC, and amplified with one of three different amplification techniques (two sWGA primer sets and one rWGA). Extraction conditions were evaluated on performance of sequencing depth, percentiles of coverage, and expected SNP concordance. Results At 100 parasites/μL, Chelex-Tween-McrBC samples had higher coverage (5 × depth = 93% genome) than QIAamp extracted samples (5 × depth = 76% genome). The two evaluated sWGA primer sets showed minor differences in overall genome coverage and SNP concordance, with a newly proposed combination of 20 primers showing a modest improvement in coverage over those previously published. Conclusions Overall, Tween-Chelex extracted samples that were treated with McrBC digestion and are amplified using 6A10AD sWGA conditions had minimal dropout rate, higher percentages of coverage at higher depth, and more accurate SNP concordance than QiaAMP extracted samples. These findings extend the results of previously reported methods, making whole genome sequencing accessible to a larger number of low density samples that are commonly encountered in cross-sectional surveys.


Background
Whole genome sequence (WGS) data provide a complete picture of a pathogen genome and are transforming molecular epidemiology [1]. In the past few years, the availability of WGS data from field collected malaria samples has grown along with the development of more sensitive, faster and lower-cost sequencing technologies. Until recently, generating these sequence data required collection of large volumes of venous blood, leukocyte depletion at the time of sample collection, and storage in the field-tasks often challenging to perform in resourcelimited settings [2,3]. In contrast, collection of dried blood spots (DBS) requires small sample volumes and such samples are easily stored and transported. However, the low volumes of blood present in DBS often result in low quality and quantity of parasite DNA, particularly relative to the overwhelming proportion of human DNA in the sample. In order to address these challenges, several studies have utilized different enrichment methods such as non-selective whole genome amplification (WGA) [4,5]; hybrid selection [6][7][8][9]; enzymatic digestion of host DNA using the MspJI family of restriction enzymes [10,11] and selective whole genome amplification (sWGA) [11][12][13][14]. Recently, collection of leukocyte-depleted dried erythrocyte spots has also showed promise to provide parasite enriched samples for WGS [15].
Unlike other enrichment methods, sWGA requires relatively small amounts of starting material and provides relatively high quality data for a low-cost-enabling WGS of field samples that would otherwise contain insufficient genetic material for Plasmodium falciparum [13], Plasmodium vivax [11] and Plasmodium knowlesi [14]. However, it is possible that current sWGA protocols can be improved to increase coverage, sensitivity, and reproducibility. For example, two recent studies that performed sWGA on clinical samples using two different sets of sWGA primers, reported an average 55% of reads mapped to the P. falciparum genome with 48% of the genome covered at ≥ 5 × in 24 samples with an average parasite density of 73,601 parasites per µL [13,16]. However, Oyola et al. established a parasitaemia threshold of 0.03% (> 1200 parasites/µL) to obtain at least 50% of the genome at 5 × in clinical samples. Similar level of coverage was reported in a P. vivax-specific sWGA [11], highlighting the need to improve these protocols for DBS samples.
The efficiency of sWGA could be potentially improved by standardized comparison and optimization of different DNA extraction methods from DBS; identifying optimal sWGA primer sets that provide higher yield with reduced bias; and performing enzymatic cleavage of human DNA prior to sWGA [10]. In this study, four DNA extraction methods followed by either enzymatic cleavage of human DNA or no cleavage, and amplification with WGA or sWGA using different primer sets were compared. The combination of conditions was compared in standardized samples across a range of low parasite densities extracted from DBS, and the resulting WGS data were compared in terms of coverage of the parasite genome, dropout rate, and the concordance of called variants.

Mock DBS samples
Mock DBS samples were made by mixing synchronized, ring stage cultured P. falciparum parasites (V1/S) with uninfected human whole blood to obtain a range of parasite densities (10, 100, 1000 and 10,000 parasites per μL of blood). DBS samples were stored at − 20 °C until processing.

DNA extraction and quantitative PCR
DNA was extracted from a single 6 mm hole-punch using four different extraction methods. (i) Saponin-Chelex as described previously [17]; (ii) a modified Tween-Chelex described below; (iii) QIAamp DNA Mini Kit (Qiagen, CA, USA) following the manufacturer's recommendations and (iv) QIAamp DNA Investigator Kit (Qiagen, CA, USA) as described elsewhere [13].
Tween-Chelex extraction was conducted by modifying the Saponin-Chelex extraction method. DBS were punched using a 6 mm hole-puncher into 1.5 ml microcentrifuge tubes. 1 mL of 0.5% Tween 20 (catalogue # P1379, Sigma Aldrich) in 1 × PBS was added into the tube containing DBS punches and incubated overnight at 4 °C. The samples were briefly centrifuged, Tween-PBS was removed and the punches were washed with 1 mL of 1 × PBS and incubated for 30 min at 4 °C. The samples were briefly centrifuged, PBS removed and 150 μL of 10% Chelex 100 resin (catalogue # 1422822, Bio-Rad Laboratories) in water were added to each sample, ensuring the DBS punches were covered with the Chelex solution and incubated for 10 min at 95 °C. The tubes were centrifuged at 15,000 rpm for 10 min and the supernatant was transferred to 0.6 mL microcentrifuge tubes and centrifuged at 15,000 rpm for 5 min. The extracted DNA was then transferred to a 96-well plate and stored at − 20 °C until processing. The parasite densities were confirmed using var-ATS ultra-sensitive qPCR as described previously [18].

McrBC digestion of human DNA
For a subset of samples, extracted DNA was digested with McrBC (catalogue # M0272S, New England Biolabs) in a 30 μL reaction containing 20 μL of extracted DNA, 10 units of McrBC, 1 × NEBuffer 2, 0.5 μL 100X BSA and 0.5 μL 100X GTP. The samples were incubated at 37 °C for 2 h followed by inactivation at 65 °C for 20 min. The digested products were used as templates for wholegenome amplification.

Selective Whole genome amplification (sWGA)
The sWGA reactions were performed using two sets of primers: 10A [13] and 6A10AD, a combination of 6A [12] and 10A [13] with an adjusted ratio of dNTPs. The sWGA reaction using the 6A10AD primer set was performed as described previously [13] except for the adjusted proportion of nucleotides in the dNTP mix (i.e., 70% AT and 30% GC), similar to the composition of the malaria genome. The master mix and reaction condition for the modified sWGA protocol is shown in Additional file 1: Table S1. A template DNA volume of 20 μL with an estimated 16-16,000 parasite genomes (Additional file 1: Table S2) per sWGA reaction was used. For the nonselective whole genome amplification, random hexamer primers were used following the manufacturer's instructions for the GenomiPhi V2 DNA Amplification Kit (catalogue # 45-001-221, GE Healthcare Life Sciences).

Whole-genome sequencing
The whole genome amplification product was purified with SPRI magnetic beads (catalogue #65152105050250) to remove unbound primers, primer dimers, and other impurities. Sequencing libraries were prepared using the NEBNext ® Ultra ™ II DNA Library Prep Kit (catalogue #E7103) following manufacturer's instructions. Samples were barcoded, purified, pooled and sequenced on the Illumina NextSeq550 or NovaSeq 6000 System using 150 bp paired-end sequencing chemistry. Only preliminary analyses were sequenced using the Illumina NextSeq550.

Sequence analyses
Reads were demultiplexed, filtered by base quality and poly-g tails were clipped. The reads were then aligned to the P. falciparum 3D7 reference genome (version 3) with BWA-MEM with secondary alignments marked [19]. Sample base quality was recalibrated using Genome Analysis Toolkit (GATK) BaseRecalibrator and GATK ApplyBQSR with SNP locations from the Pf3k project (Data Release 5) used as a prior. The number of reads per sample were then downsampled to minimum total reads for equivalent comparisons. Percentiles of genome coverage were calculated using GATK CollectWgsMetrics across the core genome. Dropouts were evaluated by taking mean coverage over 200 bp sliding windows across the core genome (Additional file 2). Percentage of reads mapping to the core P. falciparum genome was calculated with samtools flagstat [20].
Variant calling and variant-filtering was conducted on the genome sequences following the Genome Analysis Toolkit (GATK) Best Practices [21] with minor modifications. Variants were called by sample across the core genome using gatk4 HaplotypeCaller (-ERC GVCF) then genotyped across all samples using gatk4 CombineG-VCFs and gatk4 GenotypeGVCFs. The variants were recalibrated using gatk4 VariantRecalibrator and Apply-VQSR using data from the Pf3k project (Data Release 5) as a training set (annotations: QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR). Variants were then filtered for passing VQSLOD and for biallelic SNPs. SNP concordance was measured with bcftools gtcheck, and gatk GenotypeConcordance. Concordance was measured using all known SNPs as well as with only high-quality non-homozygous SNPs (Additional file 2).

Experimental design
Dried blood spots (DBS) with densities from 10 to 10,000 parasites/μL blood were extracted using four different DNA extraction methods and then amplified by two different sets of sWGA primers or random hexamer primers with or without a prior treatment of input DNA with McrBC, an endonuclease which cleaves human DNA containing methyl cytosine [22] (Fig. 1). The recovery and quality of WGS were compared between the different combinations of methods. All analyses were performed considering the core genome of P. falciparum-excluding telomeres, centromeres and sub-telomeres.

Tween-Chelex extraction yielded the highest coverage in low density samples
The recovery and coverage of WGS for DNA extraction methods across parasite densities were investigated. Four methods were evaluated, two of which were spin column based and two Chelex based. DBS extracted with the Saponin-Chelex method had the lowest performance, with only 13% of reads mapping to the P. falciparum core genome and poor genome coverage (1X coverage ranging from 2 to 50%) across all parasite densities. This extraction method was not investigated further. QiaAMP Mini Fig. 1 Overview of the experimental workflow. Various combinations of DNA extraction, methylation dependent cleavage of human DNA, and whole genome amplification were compared. rWGA whole genome amplification using random hexamers, sWGA selective whole genome amplification using one of two sets of P. falciparum specific primers (10A and 6A10AD) and QiaAMP Investigator spin column kits performed similarly in terms of the percentage of reads that mapped to P. falciparum regardless of parasitaemia. After assessing the proportion of the genome left uncovered after sequencing the QiaAMP Investigator kit had a 3% higher dropout rate than QiaAMP Mini at 1000 parasites/μL (34% vs 37%) and 8% higher at 100 parasites/μL (36% vs 44%), and was also not investigated further (Additional file 1: Fig. S1). The performance of Tween-Chelex extraction and QiaAMP Mini kits were evaluated in further detail.
Extraction with QiaAMP Mini resulted in a consistently higher proportion of reads mapping to P. falciparum (Fig. 2a). However, the higher mapping rate did not Fig. 2 Comparisons of read mapping and genome coverage. a. Proportion of reads that mapped to either the reference P. falciparum 3D7 or hg19 human genome. Unmapped reads are indicated as other. Error bars indicate standard error between 2 experimental replicates. b The percentage of the core P. falciparum genome covered by a minimum depth is shown translate into better sequence coverage across the P. falciparum core genome (Fig. 2b). Samples extracted using the Tween-Chelex method consistently resulted in a greater proportion of the genome covered at 5 × depth than samples extracted via spin columns (e.g., 89% vs 72% at 100 p/μL). Similarly, DBS extracted via spin columns had higher dropout rates than those extracted with Tween-Chelex (e.g., 12% vs. 2%, respectively at 100 parasites/μL) (Fig. 3). In order to evaluate the accuracy of the mapped reads, the overall SNP concordance was compared for known SNP positions and de novo SNP calls to the reference 3D7 genome. Both extraction methods achieved high concordance, with Tween-Chelex outperforming QiaAMP for lower density samples (Fig. 4). Overall, the Tween-Chelex extraction method had a lower dropout rate, higher genome-wide coverage and higher rate of SNP concordance than the QiaAMP mini kit on low density DBS samples for a given number of total reads, despite having a lower fraction of these reads mapping to the P. falciparum core genome.

McrBC improves sensitivity and genome coverage in low density samples
Whole genome amplification has been shown to be a successful strategy for increasing the amount of material available for WGS of P. falciparum from DBS samples. This study compared performance for two sets of P. falciparum specific sWGA primer sets, 10A and 6A10AD, and a random hexamer-based WGA. Overall, the two sWGA primer sets performed similarly with respect to the percentage of reads that mapped to P. falciparum (Fig. 2a) and genotype concordance (Fig. 4). However, the 6A10AD primer set resulted in moderately higher breadth and depth of coverage across extraction methods and parasite densities (Fig. 2b). To test an alternative enrichment strategy to sWGA, WGA with random hexamers preceded by McrBC treatment was performed, resulting in a significantly lower genome coverage than sWGA (Additional file 1: Fig. S2). However, performing sWGA on McrBC treated samples yielded a substantial increase in the depth of genome coverage that is most marked at the lowest parasite densities extracted with Tween-Chelex. After McrBC treatment, the proportion of reads that mapped to the human genome dropped considerably at 100 parasites/ μL (Tween-Chelex: 43-5%), and 10 parasites/μL (Tween-Chelex: 87-47%), dramatically improving usability and cost efficiency of the assay (Fig. 2a). Conversely, McrBC improved P. falciparum genome coverage, with a more Fig. 4 Comparison of variant call concordance between experimental conditions and shotgun sequenced reference strains. a High quality biallelic SNP positions from pf3k_5.1 (https ://www.malar iagen .net/proje cts/Pf3k) were compared, as in Oyola et al. [13]. b High quality homozygous non-reference variant sites called de novo were compared. SNPs were not called for 10 parasites/μL samples without the addition of McrBC due to the lack of sufficient read depth pronounced increase in Tween-Chelex than QiaAMP mini extracted samples. Samples at 10 parasites/μL had a substantial increase in the percentage of the genome covered at 5 × from 23 to 60% (Fig. 2b).

Discussion
The ability to recover high quality sequencing data from low-density dried blood spots has useful and immediate implications for public health-allowing sample collection to be performed in a cost effective and scalable manner. In this study, a comprehensive evaluation was performed on the effects of DNA extraction method, McrBC based digestion of human DNA, and sWGA conditions across a range of parasite densities. Tween-Chelex extraction, digestion with McrBC, and sWGA using a new combination of 20 previously published primers [12,13] and an AT-biased dNTP mixture provided the highest yield and most reliable coverage, allowing whole genome sequencing data to be obtained from dried blood spots containing as few as 10 parasites per µL of blood. These findings extend the results of previously reported methods, making whole genome sequencing accessible to a larger number of low density samples that are common in cross-sectional surveys.
The data show that recovery of WGS from DBS samples with 10 parasites/µL of blood is possible, but the quality and accuracy was higher for samples containing at least 100 parasites/µL (e.g. 93% of the genome at 5 × coverage). McrBC treatment prior to sWGA allows for even greater recovery across all methods evaluated, and improved performance in all metrics tested. Other studies [10,23] have found varying results with methylation digestion methods which may be due to a difference in both enzyme and experimental process. Interestingly, QIAamp extraction resulted in a higher proportion of total reads mapping to the P. falciparum genome than Tween-Chelex extraction, but nonetheless resulted in lower coverage, higher drop-out, and lower SNP concordance. This apparent discrepancy could stem from preferential recovery of specific regions of the genome by spin columns but more even recovery by Tween-Chelex and can be seen as a compensating increase in a small part of the genome being covered at higher depth. This would result in the increased variability in genomic depth seen in QIAamp which in combination with the already large variability in sWGA results in a more random dropout pattern as opposed to Tween-Chelex showing only the depth variability from sWGA.
The factor evaluated with the least effect on sequence quality and recovery was the sWGA primer set, but there were still modest benefits to using a more expanded primer set with AT-biased dNTPs (6A10AD). These effects were present independent of extraction and digestion method, and showed consistent improvements over the previously published 10A primer set.

Conclusions
This study showed that Tween-Chelex extracted samples treated with McrBC digestion and amplified using 6A10AD sWGA conditions had higher performance with respect to minimal genome dropout, higher percentages of coverage at higher depth, and more accurate SNP concordance with respect to reference strains. Tween-Chelex extraction has the added benefit of being cheaper than spin columns, allowing researchers and elimination programs to scale analyses with respect to dollar per base coverage. McrBC digestion prior to sWGA provided a significant improvement over nondigested samples extracted with Tween-Chelex, most dramatic in the lowest parasite density samples. There are numerous logistical advantages of DBS collection over venous blood, including a lower barrier of training for sample collection, less blood to collect per sample, and no specialized transportation requirements, allowing researchers and control programs to incorporate whole genome sequencing, including samples as low as 10 parasites per µL, into research and surveillance activities where such data will be useful.
Additional file 1: Table S1. PCR conditions for sWGA amplification using modified sWGA primers. Table S2. Estimates of parasite densities from extracted DBS and WGA reactions. Figure S1: Comparison of genomic dropout of Saponin-Chelex, QIAamp-Investigator kit, and QIAamp-Mini for each sWGA and parasite densities from 10 parasites/uL to 10000 parasites/ uL. The read depth for each sample was 20M reads. Figure S2: Percentiles of coverage for QIAamp-Mini for 1000 parasites/uL at a read depth of 40M reads. Comparison of effects of sWGA and rWGA with McrBC.
Author's contributions NT, AC, EMD, BG and ST designed experiments. NT analysed and interpreted the sequencing results. AC and EMD performed the experiment. RS provided guidance on library preparation and sequencing. BG and ST provided guidance in experimental design and analyses. All authors read and approved the final manuscript.

Funding
This work was funded by the Bill and Melinda Gates Foundation (OPP1132226) and by the National Institute of Allergy and Infectious Diseases as part of the International Centers of Excellence in Malaria Research program (U19AI089674). BG is a Chan Zuckerberg Biohub investigator.