SNP barcodes provide higher resolution than microsatellite markers to measure Plasmodium vivax population genetics

Background Genomic surveillance of malaria parasite populations has the potential to inform control strategies and to monitor the impact of interventions. Barcodes comprising large numbers of single nucleotide polymorphism (SNP) markers are accurate and efficient genotyping tools, however may need to be tailored to specific malaria transmission settings, since ‘universal’ barcodes can lack resolution at the local scale. A SNP barcode was developed that captures the diversity and structure of Plasmodium vivax populations of Papua New Guinea (PNG) for research and surveillance. Methods Using 20 high-quality P. vivax genome sequences from PNG, a total of 178 evenly spaced neutral SNPs were selected for development of an amplicon sequencing assay combining a series of multiplex PCRs and sequencing on the Illumina MiSeq platform. For initial testing, 20 SNPs were amplified in a small number of mono- and polyclonal P. vivax infections. The full barcode was then validated by genotyping and population genetic analyses of 94 P. vivax isolates collected between 2012 and 2014 from four distinct catchment areas on the highly endemic north coast of PNG. Diversity and population structure determined from the SNP barcode data was then benchmarked against that of ten microsatellite markers used in previous population genetics studies. Results From a total of 28,934,460 reads generated from the MiSeq Illumina run, 87% mapped to the PvSalI reference genome with deep coverage (median = 563, range 56–7586) per locus across genotyped samples. Of 178 SNPs assayed, 146 produced high-quality genotypes (minimum coverage = 56X) in more than 85% of P. vivax isolates. No amplification bias was introduced due to either polyclonal infection or whole genome amplification (WGA) of samples before genotyping. Compared to the microsatellite panels, the SNP barcode revealed greater variability in genetic diversity between populations and geographical population structure. The SNP barcode also enabled assignment of genotypes according to their geographic origins with a significant association between genetic distance and geographic distance at the sub-provincial level. Conclusions High-throughput SNP barcoding can be used to map variation of malaria transmission dynamics at sub-national resolution. The low cost per sample and genotyping strategy makes the transfer of this technology to field settings highly feasible.


Background
Plasmodium vivax is the most widely distributed human malaria parasite outside sub-Saharan Africa, accounting for approximately 7.4 million clinical cases per year [1]. Despite previously being categorized as a benign infection, studies have revealed that P. vivax can cause severe and life-threatening malaria and some cases may be drug resistant [2][3][4][5]. Features of P. vivax biology such as relapse, low-density infections and the appearance of transmission forms (gametocytes) prior to detectable clinical symptoms [6,7], are a challenge for controlling and eliminating this disease. These characteristics, in addition to the high proportion of asymptomatic P. vivax infections in combination with increasing human movement also pose a significant challenge to malaria elimination [8,9]. Establishing a strong malaria surveillance system is essential to monitor the changing malaria landscape and to achieve elimination goals.
Currently endemic countries rely on traditional malaria surveillance approaches such as Light Microscopy (LM) and rapid diagnostic test (RDTs) [10,11], or more recently, through molecular diagnosis (PCR) of infection [12] to measure infection prevalence and provide an estimate of transmission intensity. Population genetics however can measure parasite genetic diversity, population structure, gene flow and relatedness of genotypes to help define transmission "zones", local transmission dynamics [13,14], to identify the source(s) of outbreaks [15], distinguish between local or imported cases [16,17], and track imported infections [18][19][20]. In combination with epidemiological data this can help to guide control strategies and to monitor the effect of interventions facilitate malaria elimination [21,22].
Capturing accurate population genetic signatures and tailoring molecular tools to the local malaria transmission scenario depends on the appropriate selection and use of informative molecular markers [23]. For over a decade, malaria population genetists have used panels of 6-12 microsatellites for assessment of malaria parasite transmission dynamics, its origins and dispersal [24][25][26]. Microsatellites have several advantages including being neutrally evolving [27,28], multiplexing can be easily done for ten or more markers in a single PCR cocktail and they are abundant in the genome [29]. Despite these advantages, microsatellites have high PCR amplification biases that may cause incorrect classification of dominant and minor haplotypes [30]. Furthermore the small number of markers, high mutation rate and difficulty of scoring alleles accurately decreases the resolution of microsatellite markers to identify related parasites [26]. Due to their high diversity, parasite population substructure may be missed in areas of high transmission since existing panels of 10 to 14 markers may provide inaccurate estimates of relatedness. Furthermore, microsatellite genotyping is difficult to standardize across laboratories, reproducibility is lacking, and amplicons require fragment analysis at core sequencing facilities [28,31]. Whole genome sequencing (WGS) is not widely used in malaria-endemic countries since it is costly and data analysis needs advanced bioinformatic tools and expertise [18].
As countries intensify their control programmes and approach malaria elimination, a robust, cost effective, rapid and easy-to-use set of molecular markers is urgently required as an alternative genotyping tool to rapidly track disease spread and imported cases. Socalled 'barcodes' composed of a panel of single nucleotide polymorphisms (SNPs) can be used to profile each parasite isolate and to generate population genetic insights [32][33][34]. SNP barcoding is more easily standardized across studies and offers more rapid and highly automated genotyping options compared to microsatellites [23,35]. Putatively universal (global) barcodes of 42 P. vivax SNPs have been developed and tested for their utility to determine parasite population structure [33]. However, there has been limited validation of these markers in different endemic settings to determine whether they are informative for all P. vivax populations. Informative barcodes need to distinguish between populations circulating in distinct geographic areas at sub-national resolution, as this is essential to inform malaria control programmes.
Universal barcodes may not provide an accurate estimate of local population structure due to ascertainment bias occuring if the SNP panel used was developed for populations other than those to be studied. That is, SNPs may be polymorphic (informative) in some populations, but not in others [31,36,37]. Ascertainment bias is a limitation to measuring the true allele frequency [38] resulting in minor allele frequency (MAF) biases in populations not included in ascertainment group [38,39]. Thus, global P. vivax SNP marker selection [33] may limit the resolution of this barcode to genotype parasite populations not closely related to the ascertainment group. Moreover, recent intensive malaria control activities may lead to changes in MAF with a loss of rare variants [19, Keywords: Malaria, Plasmodium vivax, Microsatellites, Single Nucleotide Polymorphisms (SNPs), Diversity, Population structure, Papua New Guinea 20] that could reduce the power of the existing SNP barcode to distinguish between different genotypes. The use of population genomic data to aid malaria control relies on having this insight at regional or sub-national resolution, and will vary for different endemic settings and stages in the elimination pipeline [23]. Therefore, validating available barcodes or developing a new barcode that accurately captures the diversity of a country's parasite population will facilitate characterization of the local malaria transmission scenario.
Here, we describe the development of a SNP barcode designed to capture the diversity of P. vivax populations of Papua New Guinea (PNG), which has the highest transmission of P. vivax in the world [40,41]. SNP barcodes have been employed in several studies in recent years [15,16,23,33], however their performance has not been compared to microsatellites, a commonly used genotyping tool [42][43][44] and used in our previous studies [9,[45][46][47]. We thus compared parasite population genetics using the newly developed SNP barcode with that for microsatellite markers to address the following research questions: (1) do these two marker panels capture parasite diversity at the sub-provincial scale at the same resolution, and (2) which marker panel has higher resolution to capture parasite geographic connectedness and population structure? Comparisons of SNP and microsatellite data demonstrates the superiority of this locally validated SNP barcode for monitoring parasite populations and tracking the source of infections.

Methods
This study aimed to develop a SNP barcode for high-resolution genomic surveillance of P. vivax in PNG (potentially applicable to other P. vivax endemic areas) and to benchmark it against an existing microsatellite marker panel that has been used in previous population genetic surveys. The study was performed using P. vivax isolates from four locations within a contiguous highly endemic region of the north coast of PNG, specifically in East Sepik (n = 1) and Madang Provinces (n = 3) (Fig. 1), where cross-sectional surveys and population genetic analyses were previously conducted using microsatellite markers [47]. Previous microsatellite analysis of these populations in 2005 and 2006 failed to identify any geographic population structure [46,47].

Plasmodium vivax isolates
For the genotyping, we selected a total of 94 P. vivax positive isolates collected during two cross-sectional surveys on the north coast of PNG including one population  Fig. 1) [48]. These P. vivax isolates were selected due to the availability of published microsatellite data [49], which was used to compare with SNP barcode data.

Selecting candidate SNPs
The software package, Genome Analysis Toolkit (GATK) [50] was used for selection of informative SNPs from published WGS data for 40 P. vivax isolates from three regions of PNG (Madang, East Sepik/Maprik and Milne Bay/Alotau) [14,51] (Additional file 1: Table S1). WGS data quality was checked using Fastqc and coverage checked using Bam coverage to identify high quality genomes. To identify informative SNP variants, pairedend raw reads were aligned to P. vivax Salvador I strain (PvSal1) [52] using the bwa-mem mapping algorithm [53]. SNPs were called and filtered using GATK Haplo-typeCaller [50]. To ensure uniform coverage of the whole parasite genome, variants present on 14 nuclear chromosomes were included after excluding all indel calls and 'blacklisted' highly polymorphic regions including the telomeres. Additional 'hard filtering' included retaining only biallelic single nucleotide variants (SNVs), high coverage (at least 90% of their bases covered up to 5x) SNPs, with a minor allele frequency (MAF) > 10% (0.10), with pairwise LD < 0.2 throughout a window of (0.5 kb), low positive or negative Tajima's D values (|Tajima's D| < 0.5) throughout a window of 0.5 kb, high heterozygosity (> 0.4) within PNG, and if they were relatively uniformly spaced across the P. vivax genome. From a total of 24,283 SNPs with MAF > 10%, 4006 remained after filtering and 220 relatively evenly spaced SNPs were selected for assay development (Additional file 2: Figure S1).

SNP barcoding assay development
The assay consists of a series of 20 × 8-10-plex PCRs, with multiplexed amplification of the target regions (PCR#1) using Locus-Specific Primers (LSP) containing universal Illumina overhang adaptors (OH), attaching to all amplicons from each sample (PCR#2) a short sequence tag (multiplex identifier, MID) unique to each sample. This was followed by pooling of all amplicons after indexing each sample and sequencing on an Illumina MiSeq (Additional file 2: Figure S2).
The major problem with multiplex PCR is primer dimer formation and melting temperature (T m ) variation between primers. To minimize these challenges, Prim-erPlex software [54] was used to design multiplex PCRs using LSP pools for target 400bp genomic regions which contain SNPs of interest. A total of 22 multiplex PCRs were designed, with each containing 8-12 LSP pairs (220 SNPs total). Primary multiplex PCRs were performed and optimized for each pool using published guidelines [55]. From a total of 220 SNPs, 42 were negative and were not amplified in the primary multiplex PCR. The remaining 178 SNPs were used for the assay development (Table S2)

SNP barcoding assay optimization
To optimize the assay, a total of six (three monoclonal and three polyclonal infections) P. vivax positive field samples were genotyped using 20 randomly selected SNP markers (2 multiplex sets) from a total of 178 described in the above section. The amplicon sequencing approach was used to amplify all 20 P. vivax genomic loci with target SNPs in multiple samples at a time. Following the optimization, the assay was applied to a total of 94 P. vivax positive field samples.
Due to the small size of the Plasmodium genome in comparison to the human genome, the presence of just a few nucleated human cells in field specimens can impede genotyping or sequencing sensitivity and specificity by contributing a large proportion of unwanted human DNA to the DNA sample. Thus, to obtain enough parasite DNA from field samples and to minimize contaminating human DNA, digestion with the McrBC enzyme, which is a DNA methylation-dependent restriction enzyme (MDRE), followed by random whole genome amplification (rWGA) [56,57] was used. In brief, digestion of human gDNA was done using the McrBC (methylation dependent) enzyme (New England Biolabs, United States) followed by whole genome amplification (WGA) of Plasmodium DNA by very high fidelity Phi29 DNA polymerase proofreading enzyme using the V2 DNA Amplification Kit (GE Lifesciences, Australia). The main aim of this protocol is to deplete contaminating human DNA in malaria field isolates by selectively digesting highly methylated DNA (human) followed by WGA of the remaining high molecular weight DNA (predominantly parasite). This enriches Plasmodium DNA for further use (e.g. SNP genotyping or whole genome sequencing). The protocol uses a minimum starting volume of 6ul of DNA extracted from human blood samples and works well even with low-density samples.
Primary PCR was performed to amplify target loci using optimized PCR#1 conditions as described in above. Then, primary PCR amplicons of each sample from two multiplex reactions were combined and purified using a QIAquick PCR Purification Kit (Qiagen) as per the manufacturer's protocol. The amount of DNA in the primary PCR was measured using the Qubit DsDNA High Sensitivity (HS) Assay Kit (Thermo Fisher Scientific, Scoresby, Victoria, Australia) and normalized by diluting overrepresented amplicons in PCR grade water. The secondary PCR reaction (PCR#2) was performed using cleaned primary PCR products as a template. Illumina adapters and a six-nucleotide sequence specific to each individual sample (MID index) was added to the template. Equimolar amounts of each amplicon pool from all samples were combined into a single tube and purified using AMPure XP magnetic beads (Beckman Coulter) for library preparation. Standard sequencing libraries were prepared following the manufacturer's recommended protocol and sequenced using in an Illumina MiSeq platform to generate (2X300) paired end reads. The TruSeq Custom Amplicon Sequencing Kit (Illumina, Inc) was used to allow 96 or more samples with integrated barcodes to be pooled prior to sequencing on an Illumina MiSeq.

Bioinformatic analysis
The raw FASTQ files were demultiplexed by binning based on the MID index, the read quality was checked using FastQC (Version 0.8.0) (https ://www.bioin forma tics.babra ham.ac.uk/proje cts/fastq c/) and combined FastQC output for all samples were visualized using Mul-tiQC [58]. Low-quality reads (< Q30), adaptors, primers, and reads shorter or longer than expected size of amplicon were trimmed using Trimmomatic [59]. Only reads that passed stringent quality filters progressed for alignment and variant calling.
Unmapped BAM files were generated from quality filtered and trimmed FASTQ files using the FastqtoSam function (http://broad insti tute.githu b.io/picar d/). The combined pipeline was then used to generate indexed, mapped BAM files. This pipeline consists of SamToFastq, bwa-mem and MergeBamAlignment to map reads, and generated a clean and indexed mapped BAM file. In brief, the sequenced reads were mapped to the P. vivax Salvador I strain reference genome bwa mem [50] (Additional file 2: Figure S3). Overall quality and genome coverage of mapped bam files were checked using QualiMap v.2.2.1 [60]. We set a minimum cutoff of 50-fold coverage and successful genotyping of loci in at least 60% of sequenced samples to avoid inclusion of PCR and sequencing errors. After removing unsuccessfully amplified loci, the coverage and the frequencies of the reference and alternative alleles were determined using the samtools mpileup function [61] for each sample and SNP. A VCF file from the samtools mpileup analysis output was further filtered using vcftools to remove sites containing insertions and deletions [62]. Finally, all selected SNPs were further confirmed by visually inspecting the individual mapped reads using IGV software [63].

Population genetic analyses
Alternative allele frequency (AAF) was computed as the proportion of genotyped samples whose genotype was not the reference allele for target loci. MAF was computed as the proportion of genotyped samples carrying the genotype that was least common (i.e. MAF = AAF if AAF < 0.5; MAF = (1-AAF) if not). To estimate the actual number of clones per sample the VCF file containing SNP data was converted to The Real McCOIL categorical method format: heterozygous call (0.5), homozygous minor allele (0), homozygous major allele (1) and no call (−1) and used as an input file for analysis of multiplicity of infection (MOI) using The Real McCOIL R package [63]. Input files for genetic analysis were created using the and PGDSpider (version 2.0.0.3) [65]. Genetic diversity was calculated as SNPπ using DnaSP Version 5.0 [64] for SNP data, and expected heterozygosity (H e ) and allelic richness, using the FSTAT software, version 2.9.4 for microsatellite data [65]), were calculated. Genetic differentiation (F ST ) was determined using DnaSP Version 5.0 [66] for SNP data and FSTAT 2.9.4 [67] for microsatellites. The Mantel Test was performed to measure associations between genetic distance and spatial geographical distance between catchments, using the R "Vegan" package [68]. Phylogenetic analysis was done based on the distance metric 1-P S using the "Ape" R package and the 'dist.gene' function for SNP and microsatellite data and visualized using the FigTree software, version 1.4.3.
The Bayesian clustering software, STRU CTU RE version 2.3.4 [69] was used to determine the number of discrete genetic clusters (K) and whether haplotypes cluster according to geographical origin. STRU CTU RE runs were performed with a burn-in period of 100,000 followed by 100,000 Monte Carlo steps. The simulations were replicated 20 times with different seeds for K values ranging from 1 to 20. The optimal K value was calculated based on Evanno's method of ΔK statistics. The CLUMPAK web-based server was used for summation and graphical representation of the STRU CTU RE results. The assumptions underlying the population genetics model in STRU CTU RE software may limit its use to detect malaria parasite population structure with declining transmission. Unlike natural populations, malaria parasites undergo inbreeding, clonal propagation, and there will be an absence of panmictic conditions when transmission declines. Therefore, to further explore parasite clustering the discriminant analysis of principal components (DAPC) was performed using the R package "Adegenet" [70]. DAPC is robust to Hardy-Weinberg disequilibrium or linkage disequilibrium [71].

Statistical analysis
The Mann-Whitney U test or a one-way analysis of variance were used to measure differences among two groups or more than two groups, respectively. To assess the concordance between genotype allele sharing by SNPs and microsatellite markers we performed correlation analysis using Kendall's Tau. Statistical analyses were performed using GraphPad Prism Software version 7.0 and a p value of ≤ 0.05 was considered statistically significant.

Identification of SNP candidates
From a total of 40 P. vivax isolates from PNG, 23 were sequenced at the Broad Institute (BI) in Boston, MA, USA [14] and the remaining 17 were sequenced at the Wellcome Trust Sanger Institute (WTSI), Cambridge UK as part of the MalariaGEN Plasmodium vivax Community Project [51]. To include the highest quality samples for SNP selection, 16 isolates were excluded due to low quality and poor coverage (less than 90% of their bases covered up to 5x) or to remove the lowest quality genome of any duplicated samples (4 isolates). Additionally, four samples derived from sequencing pooled isolates were excluded from analysis since pooling could affect variant calling. The remaining 20 P. vivax genomes, originating from three hyper-endemic provinces of PNG (Madang = 17, East Sepik = 2, Milne Bay = 1, Fig. 1, Additional file 1: Table S1), were used to select informative SNPs. From a total of 405,825 variants present on 14 nuclear chromosomes, 144,517 were included after excluding all indel calls and 'blacklisted' highly polymorphic regions including the telomeres. Finally, after additional 'hard filtering' (see "Methods"), 220 SNPs with MAF greater than 10% and relatively uniformly spaced across P. vivax genome were selected for assay development (Additional file 2: Figure S1).

Assay development
Six P. vivax field isolates were used to develop and optimize the new SNP genotyping assays. To evaluate the amplification of each target locus in the multiplex PCR, single-plex PCR was performed using primary multiplexed PCR products as the template. Of the six samples, three contained single clones (multiplicity of infection, MOI = 1) and the remaining three samples had two clones (MOI = 2) based on Pvmsp1F3 and Pvms16 genotyping [48]. Polyclonal infections were included to assess amplification bias of SNP alleles in complex infections due to multiple amplification steps. Of the 220 primer pairs tested, 178 produced a single clear band of the expected size. These 178 SNP loci were then used to develop a multiplex PCR, for further genotyping of P. vivax isolates from PNG (Additional file 3: Table S2). Only two SNPs from the previously developed barcode [33] met the inclusion criteria (Additional file 3: Table S2). To assess amplification bias that may occur due to preliminary amplification steps such as whole genome amplification (WGA), amplicon deep sequencing was performed for six samples with, and without, WGA for a test set of 20 SNP markers (2 × 10-plex PCRs). A total of 17 of 20 (85%) SNPs were successfully genotyped in these samples. There were no significant amplification differences before and after WGA (p = 0.13 for MOI = 1 samples and p = 0.92 for MOI = 2 (Mann-Whitney U test)) (Additional file 2: Figure S4a). There were also no observed discrepancies in the genotype calls between the WGA and non-WGA samples. There was an average read depth of 133 (range = 50-567) for test run samples (Additional file 2: Figure S4b). This read depth of approximately 133X is suitable to call variants however deeper sequencing is required to do downstream population genetic analysis with high confidence.

Data summary and validation of the barcode
A total of 94 low complexity (MOI ≤ 2) P. vivax samples from a cross-sectional survey conducted in four catchment areas of PNG (Madang Province: Mugil, Malala and Utu; East Sepik Province: Ilaita area) (Fig. 1) were then genotyped for all 178 SNPs using the parallel targeted amplicon sequencing assay. These samples had already been genotyped with ten microsatellite markers in a previous study [49]. A total of 28,934,460 reads were generated from the MiSeq Illumina run with a variable sequencing coverage across samples per locus (median = 563, range 56 -7586). Of the 178 SNPs, five were not amplified at all (no reads detected) and 34 had high missingness (no reads for > 40% of samples). Of the 94 genotyped samples, 83 were successfully genotyped for the remaining 146 SNP markers (Additional file 2: Figure S4c) indicating the genotyping success rate amongst samples was 88.2% (83/94) with an 82.1% (146/178) marker positivity rate. There were no identical genotypes, suggesting that the barcode is a unique identifier for P. vivax isolates from PNG. The SNPs generally had moderate minor allele frequencies (MAF) with 98% of SNP loci showing greater than 10% MAF (Additional file 2: Figure S5a). There were no private SNPs (unique to any one population). Published genotyping data for MOI [72] and microsatellite data for these P. vivax isolates [49] was then used to compare the population genetic metrics with the new SNP barcode data.
The SNP barcode detects more multiple clone infections than classical genotyping for MOI Despite previous 'classical' genotyping for MOI using msp1F3 and ms16 indicating the majority of samples were single clone infections (i.e. one allele at both markers [48]), all 83 samples showed at least one heterozygous call (two alleles found amongst the reads at a particular SNP locus), which is evidence of polyclonality (genotyping error is filtered out by the variant calling algorithm). The Real McCOIL analysis indicated that out of the 83 successfully genotyped P. vivax samples, 69 samples (83.2%) have at least two clones and 24 samples (16.8%) were confirmed monoclonal infections. For the population genetic analyses, the dominant allele was used to reconstruct dominant haplotypes (Additional file 4: Table S3).

The SNP barcode captures variable genetic diversity amongst parasite populations
Variable levels of within-population genetic diversity were observed in the four parasite populations with an average nucleotide diversity (π) of 0.33 per SNP site (range = 0.24-0.45) (Fig. 2a). Nucleotide diversity was lowest in the inland Utu population and highest in the coastal Malala parasite population. The genetic diversity (Heterozygosity, H e ) of the same parasite populations using the microsatellite panel showed uniformly high genetic diversity among populations (mean H e = 0.82, range = 0.78-0.85) (Fig. 2b). Note that the different diversity measures necessary for these different markers may also impact this result. Therefore, we used the alternative metric 1-P S (1-pairwise allele sharing) (Fig. 2c, d) to measure genetic diversity within each parasite population for both markers. Genetic diversity by both marker a SNP barcode ] where n is the number of isolates sampled and p i is the allele frequency at the ith loci) using as FSTAT software version 2.9.4 [67]; c SNP barcode diversity and d microsatellite haplotype diversity. For c and d, box plots show the results from another genetic diversity metric, 1-mean pairwise allele sharing. The variation in the box and median distribution indicates variability in genotype relatedness amongst pairs of genotypes. The analysis was done using genetic distance matrix for 1-P S generated by the 'dist. gene' command in "Ape" R package panels was significantly different among the four parasite populations (p value < 0.001, Kruskal-Wallis test). In general, microsatellite genotypes had higher genetic diversity compared to SNP genotypes. However, microsatellites show a wider range of values and more closely related genotype pairs (outliers 1-Ps < 0.4) (Fig. 2d).

The SNP barcode detects parasite population divergence that is associated with geographic distance
Bayesian cluster analysis of SNP genotypes using STRU CTU RE software [66] identified that three genetic clusters (K = 3) provided the best fit for the SNP data and two genetic clusters (K = 2) for microsatellites (Additional file 2: Figure S6).
Less population structure and genotype clustering according to their geographic origin was observed by microsatellite markers compared to SNPs (Fig. 3). Discriminant Analysis of Principal Components (DAPC) detected higher levels of genotype assignment to different geographic origins and higher differentiation between distant compared to neighbouring populations for SNPs (Fig. 4a, top) than microsatellite markers (Fig. 4b, top). Microsatellites revealed limited differentiation of distant parasite populations such as East Sepik and Utu (Fig. 4b,  top).
To further explore the patterns of gene flow in different geographic areas we measured genetic differentiation (F ST ) and observed very low to moderate genetic differentiation (F ST = 0.02-0.12) between parasite populations using either marker panel (Table 1). However, values were higher for SNPs and there was greater differentiation of distant populations e.g. East Sepik vs Utu for the SNP marker data.
For SNP barcodes, the DAPC individual density plot also supports the F ST result, where distant parasite populations were more distinctly clustered (Fig. 4a, bottom) than the nearby populations. However, the microsatellite marker data showed an unusual clustering of distant parasite populations together (East Sepik and Utu) (Fig. 4b,  bottom). To assess whether the geographic distance between geographic clusters affects gene flow, a Mantel correlation test was conducted. The analysis showed a significant association between genetic distance and geographic distance in km for SNP markers (Fig. 5a), but not for microsatellite markers (Fig. 5b).

No association between microsatellite and SNP haplotype relatedness
To further explore clustering patterns and investigate the relatedness of individual SNP haplotypes, phylogenetic analysis was conducted using Neighbour Joining trees. This identified clusters of closely related isolates from the same province and village with moderate population structure and geographic clustering of genotypes (Fig. 6). More clustering of genotypes was found in the East Sepik population compared to the three parasite populations from Madang (Malala, Mugil and Utu for the SNP markers) (Fig. 6a). Overall, phylogenetic analysis supported the STRU CTU RE and DAPC results, with higher parasite clustering between East Sepik versus Madang by SNP barcode compared to microsatellites.
Unless there is overall high relatedness among genotypes, it is difficult to identify population structure using phylogenetic analysis due to high recombination between distinct clones. To further infer relatedness between parasites within and between populations, a simple pairwise allelesharing (P S ) measure was used. Relatedness analysis using the SNP markers (Additional file 2: Figure S6) showed that the majority of genotypes share alleles at 50-70% of markers suggesting parasites are unrelated [73]. Only a few genotypes showed high relatedness, with 70-90% of alleles shared (Additional file 2: Figure S7) within the population, but no identical genotypes were detected. The allele sharing analysis of the same P. vivax isolates using ten microsatellite markers was consistent with SNP data where the majority of microsatellite genotypes are unique and only a few genotypes shared a high proportion of alleles.
Concordance analysis of allele sharing between genotypes using the SNP barcode and microsatellite haplotypes showed no statistically significant association (r = 0.032, p-value = 0.56) (Additional file 2: Figure S8). Thus, high outcrossing in the PNG populations due to high transmission removes any association between these markers.

Discussion
Genomic surveillance of malaria parasite populations is a useful tool to assess changing transmission patterns, identify imported cases and track the spread of infections [23,74,75]. Reliable, cheap and high-resolution genotyping assays are therefore needed to support malaria control programmes. SNP barcodes have been developed to study the complexity of infection [76,77], parasite population structure and the origins of outbreaks [15,33,34,78]. However, ascertainment bias can reduce the sensitivity of detecting distinct clones and population genetic analyses to detect and track discrete parasite populations. SNP barcodes need to be validated and/or tailored to specific geographic areas to reflect the SNP diversity in local parasite populations [23]. Here, the development of a SNP barcode comprising 178 locallyvalidated biallelic SNP loci is described. This barcode was tailored specifically to PNG, one of the world's hotspots for P. vivax malaria infection. Genetic diversity and population structure amongst four distinct catchment areas on the north coast of PNG was compared for the SNP barcode and panel of ten polyallelic microsatellite markers that many groups have previously employed for population genetic analyses [24][25][26]. The results demonstrate the greater sensitivity of these large biallelic SNP barcodes for malaria genomic epidemiology and potential to provide useful data to guide malaria control strategies. The SNP barcode detected a higher number of clones compared to two highly polymorphic microsatellite markers ms16 and msp1F3, which have been used previously to measure multiplicity of infection [79,80]. This indicates that the SNP barcode has higher resolution to identify multiple infections, most likely due to the much larger number of loci genotyped. However, there is an upper limit to clone detection due to the biallelic nature of the SNP loci. It also suggests that the complexity of  P. vivax infections (based on a small number of loci) is currently underappreciated, and that the barcode will be more useful than small numbers of microsatellites in very low endemic settings to distinguish between very closely related parasites. Samples were pre-genotyped and selected for monoclonal infection, which limits the direct comparison of the ability of these two microsatellites and SNP barcode to identify polyclonal infections. Further evaluation of the barcode by genotyping a large set of randomly selected field samples is needed to fully assess its utility for estimating complexity of infection.
Population genetic analyses using the SNP barcode elucidated genetic diversity, relatedness, population structure and connectivity of circulating parasite populations at higher resolution relative to the larger panel of ten microsatellite markers. More variable genetic diversity among populations is captured by the SNP barcode than microsatellite markers. Previous work with Association between geographic and genetic distance between P. vivax populations of Papua New Guinea. Genetic differentiation between population pairs was measuring using F ST for a SNP barcodes and b microsatellites and measured in association with geographic distance based on geographic co-ordinates of villages. A Mantel test was used to measure the association using R "Vegan" package [67]   Labels indicate a unique sample ID and village of origin. Distance was measured using the genetic distance matrix for 1-P S calculated by "Ape" R package, 'dist. gene' command microsatellite markers in eight locations of PNG revealed geographic population structure between the mainland, islands and highland areas [9]. However, microsatellites were unable to differentiate populations at a finer spatial scale between the mainland north coast provinces of East Sepik and Madang [46,47,49]. Indeed, microsatellite performance has not previously been compared to SNPs in terms of their ability to differentiate between P. vivax populations. The results suggest that large numbers of SNPs have higher resolution to detect differences in transmission dynamics between populations. Moreover, SNP barcodes detected substantial geographic population structure between the four catchment areas with clustering of haplotypes according to their geographic origin, whereas microsatellites did not achieve this. There was also a significant association between genetic distance and geographic distance for SNPs but not for microsatellite markers suggesting that SNPs can accurately pinpoint geographic origins of infections, whereas microsatellites cannot. This also implies that large numbers of SNP markers can capture population connectivity at fine spatial scales in high transmission areas. The ten microsatellite markers identify some population structure, but it does not fit the expected isolation by distance pattern expected of this contiguous endemic area. The findings are consistent with another study on the malaria vector Anopheles darlingi where SNP markers showed higher discrimination among genetic clusters with more than 4-35 fold higher F ST estimates than microsatellite markers [81]. Other studies in different fish species have also shown that biallelic SNP markers have greater accuracy and finer population structure than microsatellite markers [82,83]. The SNP barcode is more sensitive because markers are more densely covering the chromosomes than the microsatellite panel with less than one marker per chromosome, and thus will more accurately detect relatedness among parasites through inherited segments of the genome, using Identity by Descent (IBD) approaches [84,85]. Also, in high transmission areas, genetic differentiation (i.e. the difference between the diversity of the subpopulations compared to the metapopulation [86]) is typically estimated as being low when using polyallelic microsatellite markers because their diversity is at a maximum, making it difficult to identify low levels of population structure. Other measures such as Jost's D have been used to overcome these limitations of microsatellites [47]. While our results suggest that the currently used ten microsatellite panel may have lower resolution to identify distinct genotypes and correctly identify related parasites, larger microsatellite panels will undoubtedly be more sensitive and may also deconvolute complex mixtures of clones within an infection. Selection of informative markers is important to track gene flow and quantify parasite connectivity using IBD measures [87]. SNP barcoding using an adequate number and density of SNPs will be important for the characterization of these population genetic signals, and to identify patterns of parasite migration [23,32,87]. The generation of additional WGS data for the surveyed populations would help to verify the performance of different marker panels. Related isolates have a higher probability of identical alleles at a given locus than unrelated genotypes [73,74]. In this study, a significant difference in pairwise allele sharing was not detected either within and between populations using SNP or microsatellite markers. The SNP barcode identified a narrow range of allele sharing with > 90% genotypes with alleles shared amongst 50-90% of the markers. This finding is consistent with the previous study by Nkhoma et al. [73] where a SNP barcode comprising 96 SNPs detected high allele sharing in diverse and unrelated parasites (P S of up to 0.74 observed in two unrelated parasites (0% IBD)). Thus, only genotypes with P S values greater than 0.8 (distributed at tail end of the histogram) for either SNP or microsatellite markers are truly related. This finding indicates that simple pairwise allele sharing (IBS) values do not accurately represent the actual percentage of the parasite genome that is IBD or that this measure is not as sensitive to measure the actual genotype relatedness. IBD values were not compared between the different markers in this study since the ten microsatellite markers will have limited sensitivity to estimate the proportion of the genome that is IBD. Studies using large SNP barcodes (> 100 loci) such as that described here, are recommended in order to apply IBD measures to calculate parasite relatedness and between population connectivity [87].

SNP barcode Microsatellites
A minor allele frequency criteria was applied to select informative SNPs (> 0.1 MAF) to capture diverse parasites in hyperendemic regions of PNG. It is recommended to use these validated (n = 146) bi-allelic SNPs for future genotyping of P. vivax parasites if informative for the studied parasite population. A similar approach has been used to select a 'universal' 42 SNP barcode from hundreds of thousands of SNPs from genomic sequences of globally diverse parasite isolates [33]. A recent study focused on Plasmodium falciparum revealed that barcodes of 93 or even just 24 SNP markers are adequate in a low transmission area to capture parasite connectivity, allow stratification of closely related parasite populations and identify source and sink populations-with a high sample size required for a small number of markers and vice versa [85]. Before applying the developed markers to genotype parasite isolates from a given population, it is recommended to validate markers by evaluating allele frequencies within a subset of samples. Moreover, SNP barcodes need to be continually evaluated and validated against local WGS data to ensure they provide similar insights into parasite population genetics.
In conclusion, the locally-validated SNP barcoding assay showed higher resolution to measure variations in P. vivax diversity and population structure at local (subprovincial) scale compared to the currently used panel of ten microsatellite markers. As countries approach malaria elimination, SNP barcoding will help to identify transmission zones and their dynamics and routes of parasite migration, and hence how to contain infections and to monitor whether control efforts are having an impact. The findings from this approach in combination with epidemiological data are essential to policy makers. The developed amplicon sequencing assay requires only a small amount of starting DNA (2 mL) and can be done relatively easily using available Next Generation Sequencing technology platforms at low cost (less than $18USD per isolate). This technology allows the "plug and play" incorporation of other markers such as SNPs informative for a given country/region or those associated with drug resistance that could help to concurrently genotype circulating parasites and resistance genes to give timely information for malaria control strategies. Overall, the findings suggest that SNPs may be better suited than the currently used microsatellite markers due to their higher resolution. SNP barcodes would also be more suitable in a control program setting, given the availability of costeffective and robust high-throughput sequencing and the relative lack of technical issues.
Additional file 1: Table S1. Details of 40 P. vivax isolates from PNG sequenced and used for SNP candidate selection.
Additional file 2: Figure S1. Overview of steps used to select informative SNPs from P. vivax whole genome sequence data. Figure S2. Overview of amplicon sequencing approach for multiple samples (96) to amplify all target SNP loci (178) in single sequencing run. Genomic loci were amplified using standard PCR (PCR#1) using locus specific primer with universal overhang adaptors (a). Using purified primary PCR product as a template an additional PCR, index PCR (PCR#2) (b) was performed to attach a multiplex identifier (MID) tag which is unique to each sample and attaches to universal overhang sequence. Unique sequence of MID for each sample enables pooling of secondary PCR products (96 samples in this protocol) for library preparation (c). Then high throughput "multiplexed" sequencing of the combined amplicons from all samples (96) in a single MiSeq run to produce require millions of paired-end reads (2x300bp) (d). e Data was analysed using standard bioinformatics tools and mapped reads were visualized using Integrative Genomics Viewer. The graph shows a large number of reads covering the target SNP locus. This result shows that all reads possess the alternative allele (green) at the SNP site. This high frequency of SNP calls amongst reads gives high confidence to differentiate true SNPs (indicated as SNPS at target locus) from sequencing artefacts (rare SNPs shown left and right side of target locus). Figure S3. Overview of bioinformatic data analysis pipeline. Data processing sequence read consisting of quality checking of raw sequence reads, primers and adaptor trimming, mapping reads to reference sequence, SNP calling and filtering, and population genetic and statistical analyses. Figure S4. Quality control of the Plasmodium vivax barcoding assay. a Assessment of amplification bias and SNP polymorphism among genotyped samples. Comparison of number of successfully amplified loci before and after rWGA of samples. There was no statistically significant difference in number of loci successfully genotyped before and after WGA, and between monoclonal and polyclonal samples. b Read coverage for all 20 SNP markers per sequenced sample. The graph shows that read count is different for different samples with high read count in samples with MOI = 2. Read count variation between amplified WGA) and unamplified samples (S) were not significant. c PCR and SNP genotyping success rate. From a total of 220 SNPs, 42 were negative and were not amplified in the primary multiplex PCR, and 32 failed during genotyping (either no reads detected or did not meet the quality filtering threshold). The remaining 146 were used as "SNP barcode" for downstream population genetic analysis (Additional file 3: Table S2). Figure S5. Distribution of the Minor allele frequency (MAF) of SNP barcodes in P. vivax parasite population in north coast PNG. Figure  S6. Optimal number of clusters for each marker based Evanno's method [42]. For SNP markers the method identified three genetic clusters (K = 3) and two genetic clusters (K = 2) for Microsatellites. However, for Microsatellites K = 2 is a common artifact of the hierarchical clustering algorithm when two very distinct populations are present, so higher K must be observed to identify possible sub-population structure. Figure S7. Pairwise allele sharing of P. vivax genotypes within and between population. a SNP genotype frequency distribution of pairwise allele-sharing (P S ). b Microsatellite genotype frequency distribution of pairwise allele-sharing (P S ). Black bars indicate within population and grey bars indicate between populations. There was no significant difference in allele sharing within and between parasite populations either SNP barcode or microsatellite markers in north coast of PNG. Figure S8. Kendall's Tau concordance analysis of genotype allele sharing by SNP and Microsatellite. Each boxplot indicates the proportion of shared alleles between genotypes by SNP. The analysis revealed no significant correlation between the SNP and microsatellite pairwise allele sharing values.
Additional file 3: Table S2. Details of 178 SNPs, SNPs included in final panel (n = 146), primers sequences and their multiplex setting. Table S3. The 83 P. vivax haplotypes used for the final data analysis.