Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago

Background Tanzania’s Zanzibar archipelago has made significant gains in malaria control over the last decade and is a target for malaria elimination. Despite consistent implementation of effective tools since 2002, elimination has not been achieved. Importation of parasites from outside of the archipelago is thought to be an important cause of malaria’s persistence, but this paradigm has not been studied using modern genetic tools. Methods Whole-genome sequencing (WGS) was used to investigate the impact of importation, employing population genetic analyses of Plasmodium falciparum isolates from both the archipelago and mainland Tanzania. Ancestry, levels of genetic diversity and differentiation, patterns of relatedness, and patterns of selection between these two populations were assessed by leveraging recent advances in deconvolution of genomes from polyclonal malaria infections. Results Significant decreases in the effective population sizes were inferred in both populations that coincide with a period of decreasing malaria transmission in Tanzania. Identity by descent analysis showed that parasites in the two populations shared long segments of their genomes, on the order of 5 cM, suggesting shared ancestry within the last 10 generations. Even with limited sampling, two of isolates between the mainland and Zanzibar were identified that are related at the expected level of half-siblings, consistent with recent importation. Conclusions These findings suggest that importation plays an important role for malaria incidence on Zanzibar and demonstrate the value of genomic approaches for identifying corridors of parasite movement to the island.

through a combination of robust vector control and access to efficacious anti-malarial treatment, the archipelago of Zanzibar has been deemed a pre-elimination setting, having only low and mainly seasonal transmission [2]. Despite significant efforts, however, elimination has been difficult to achieve in Zanzibar. The reasons for Zanzibar's failure to achieve elimination are complex and likely driven by several key factors: (1) as transmission decreases, the distribution of cases changes and residual transmission is more focal and mainly outdoors [3]; (2) a significant number of malaria infections are asymptomatic and thus untreated and remain a source for local transmission [4][5][6][7]; and (3) the archipelago has a high level of connectivity with the mainland, thus imported malaria through human travel may play an increasing relative role in transmission.
Genomic epidemiology can supplement traditional epidemiological measures in studies of malaria transmission and biology, thereby helping to direct malaria elimination strategies [8]. Whole-genome sequencing (WGS) can be particularly useful for understanding the history of parasite populations and movement of closely related parasites over geographical distances [9,10]. Identity by descent (IBD), the sharing of discrete genomic segments inherited from a common genealogical ancestor, has been found to be a particularly good metric for studying the interconnectivity of parasite populations [11][12][13]. A major obstacle to studying IBD in microorganisms, and in particular malaria, is the presence of multiple clones in a single infection. In order to address this obstacle, recent algorithms have been developed to deconvolve multiple infections into their respective strains from Illumina sequence data [14,15]. These advances now make it tractable to conduct population genetic analysis of malaria in regions of higher transmission, where infections are often polyclonal.
Decreases in malaria prevalence are hypothesized to be associated with increasing inbreeding in the parasite population, decreased overall parasite genetic diversity and a reduced complexity of infection (COI), defined as a decreased number of infecting clones [8]. This has been shown in pre-elimination settings in Asia as well as in lower transmission regions of Africa [16][17][18]. It has not been determined if a similar reduction in diversity has occurred in Zanzibar with the significant reduction of malaria in the archipelago. WGS data was used to: (1) characterize the ancestry of parasites in the two regions, (2) determine the levels of genetic diversity and differentiation between archipelago and mainland, (3) determine patterns of relatedness and inbreeding and (4) search for signatures of adaptation and natural selection. Inferred genetic relationships were then examined for evidence of importation of parasites from the higher transmission regions of mainland Tanzania to the lower transmission regions of the Zanzibar archipelago. These findings improve understanding of how importation may affect malaria elimination efforts in Zanzibar.

Clinical samples
WGS was attempted on 106 P. falciparum isolates collected from subjects with uncomplicated malaria or asymptomatic infection from 2015 to 2017. Forty-three of these were leukodepleted blood collected as part of an in vivo efficacy study of artemether-lumefantrine (AL) in paediatric uncomplicated malaria patients collected from 2015-2017 in Yombo, Bagamoyo District. A remaining 63 isolates were from dried blood spots (DBS) collected in Zanzibar in 2017. These came from crosssectional surveys of asymptomatic individuals (n = 34) and an in vivo efficacy study of artesunate-amodiaquine (ASAQ) with single low dose primaquine (SLDP) in paediatric uncomplicated malaria patients (n = 29). These isolates essentially represent a convenience sample. Isolates were not selected for sequencing on the basis of specific clinical or epidemiologic characteristics; however, sequencing was more likely to be successful on isolates from subjects with high parasitaemia. Study participants from Zanzibar were asked to report any overnight travel away from home in the past 4 months. Responses were coded as yes (overnight travel to mainland Tanzania or Kenya) or no (no overnight travel off islands of Zanzibar). Clinical characteristics of the attempted and sequenced samples from each cohort from Zanzibar is provided in Additional file 1: Table S1.

Generation and sequencing of libraries
Leukodepleted blood samples and DBS were extracted using QIAmp 96 DNA blood kits per the manufacturer protocol (Qiagen, Hilden, Germany). DNA from leukodepleted blood was acoustically sheared using a Covaris E220 instrument, prepared for sequencing without enrichment using Kappa Hyper library preps, and individually barcoded per manufacturer's protocol (Kappa Biosystems, Columbus, OH). DNA extracted from DBS was enriched for P. falciparum DNA before library prep using two separate selective whole genome amplification (sWGA) reactions. The sWGA approach was adapted from previously published methods and employed two distinct sets of primers designed for P. falciparum, including the Probe_10 primer set described previously by Oyola et al. and another set of custom primers (JP9) designed using 'swga' [19][20][21]. Phosphorothioate bonds were included between the two most 3' nucleotides for all primers in both sets to prevent primer degradation. Design and evaluation of these custom primers and the sWGA approach are described in Additional file 1: Table S2. The two sWGA reactions were carried out under the same conditions. The products of the two sWGA reactions were pooled in equal volumes and acoustically sheared using a Covaris E220 instrument before library preparation using Kappa Hyper library preps. The indexed libraries were pooled and sequenced on a HiSeq 4000 using 2 × 150 chemistry at the University of North Carolina High Throughput Sequencing Facility. Sequencing reads were deposited into the NCBI SRA (Accession numbers: pending).

Public sequencing data
Illumina short read WGS data for P. falciparum isolates was downloaded from public databases. This included 68 isolates from other regions of Tanzania, collected between 2010 and 2013, as well as 179 isolates from other regions, including Southeast Asia, South Asia, East and West Africa (Additional file 1: Table S3).
Isolates may contain multiple strains that are haploid resulting in mixed infections with arbitrary effective ploidy. To account for this complexity of infection (COI), prior literature was followed [23] and the following quantities were calculated at each variant site: for each isolate, the within-sample allele frequency (WSAF), the proportion of mapped reads carrying the non-reference allele; the population-level allele frequency (PLAF), the mean of within-sample allele frequencies; and the populationlevel minor allele frequency (PLMAF), the minimum of PLAF or 1-PLAF. These calculations were performed with 'vcfdo wsaf ' (https ://githu b.com/IDEEL Resea rch/ vcfdo ).

Analyses of mutational spectrum
Ancestral versus derived alleles at sites polymorphic in P. falciparum were assigned by comparison to the outgroup species Plasmodium reichenowi. Briefly, an approximation to the genome of the P. reichenowi-P. falciparum common ancestor (hereafter, "ancestral genome") was created by aligning the P. falciparum 3D7 assembly to the P. reichenowi CDC strain assembly (version 3, Plas-moDB version 38: https ://plasm odb.org/commo n/downl oads/relea se-38/Preic henow iCDC/fasta /data/Plasm oDB-38_Preic henow iCDC_Genom e.fasta ) with 'nucmer' v3.1 using parameters "-g 500 -c 500 -l 10" as in [24]. Only segments with one-to-one alignments were retained; ancestral state at sites outside these segments was deemed ambiguous. The one-to-one segments were projected back into the 3D7 coordinate system. Under the assumption of no recurrent mutation, any site polymorphic in P. falciparum is not expected to also be mutated on the branch of the phylogeny leading to P. reichenowi. Thus, the allele observed in P. reichenowi is the ancestral state conditional on the site being polymorphic. Transitions-transversion (Ti:Tv) ratios and mutational spectra were tallied with 'bcftools stats' v1.19.

Analyses of ancestry and population structure
VQSR-passing sites were filtered more stringently for PCA to reduce artifacts due to rare alleles and missing data. Genotype calls with GQ < 20 or DP < 5 were masked; sites with < 10% missing data and PLMAF > 5% after sample-level filters were retained for PCA, which was performed with 'akt pca' v3905c48 [25]. For calculation of f 3 statistics, genotype calls with GQ < 10 or DP < 5 were masked; sites with < 10% missing data and PLMAF > 1% after sample-level filters were retained. Then f 3 statistics were calculated from WSAFs rather than nominal diploid genotype calls, using 'vcfdo f3stat' .

Estimation of sequence diversity
Estimates of sequence diversity and differentiation were obtained from the site-frequency spectrum (SFS), which in turn was estimated directly from genotype likelihoods with ' ANGSD' 0.921-11-g20b0655 [26] using parameters "-doCounts 1 -doSaf 1 -GL 2 -minD-epthInd 3 -maxDepthInd 2000 -minMapQ 20 -baq 1 -c 50. " Unfolded SFS were obtained with the ' ANGSD' tool 'realSFS' using the previously-described ancestral sequence from P. reichenowi. All isolates were treated as nominally diploid for purposes of estimating the SFS because systematic bias against mixed isolates was noted when using ' ANGSD' in haploid mode. Four-fold degenerate and zero-fold degenerate sites were defined for protein-coding genes in the usual fashion using transcript models from PlasmoDB v38. SFS for all sites, fourfold and zerofold degenerate sites were estimated separately in mainland Tanzania and Zanzibar isolates in nonoverlapping 100 kb bins across the core genome. Values of sequence diversity (theta_pi) and Tajima's D were estimated for these bin-wise SFS using 'sfspy summarize' (https ://githu b.com/IDEEL Resea rch/sfspy ), and confidence intervals obtained by nonparametric bootstrap. F st was calculated from the joint SFS between mainland Tanzania and Zanzibar. The distribution of local F st values was calculated in 5 kb bins for purposes of visualization only.

Strain deconvolution and inheritance-by-descent analyses
Complexity of infection (COI) and strain deconvolution (phasing) were performed jointly using 'dEploid' v0.6beta [14]. These analyses were limited to 125 isolates from mainland Tanzania and Zanzibar (57 new in this paper and 68 previously published). On the basis of the analyses shown in Figs. 1 and 2, these isolates appeared to constitute a reasonably homogeneous population, so the set of 125 was used for determination of PLAFs to be used as priors for the phasing algorithm. Phasing was performed using population allele frequencies as priors in the absence of an external reference panel known to be well-matched for ancestry. The analysis was further limited to very high-confidence sites: VQSLOD > 8, 75% of isolates having GQ ≥ 10 and DP ≥ 5, ≥ 10 bp from the nearest indel (in the raw callset), ≥ 10 total reads supporting the non-reference allele, and PLMAF ≥ 1%. The 'dEploid' algorithm was run in "-noPanel" mode with isolate-specific dispersion parameters ("-c") set to the median coverage in the core genome, and default parameters otherwise. Within-isolate IBD segments were extracted from the 'dEploid' HMM decodings by identifying runs of sites with probability ≥ 0.90 assigned to hidden states where at least two of the deconvoluted haplotypes were IBD. The total proportion of strain genomes shared IBD (within-isolate F IBD ) for isolates with COI > 1 was obtained directly from 'dEploid' log files, and agreed closely with the sum of within-isolate IBD segment lengths.
Between-isolate IBD segments were identified by applying 'refinedIBD' v12Jul18 [27] to the phased haplotypes produced by 'dEploid' . For a genetic map, a constant recombination rate of 6.44 × 10 −5 cM/bp (equal to the total genetic length of the P. falciparum map divided by the physical size of the autosomes in the 3D7 assembly) was assumed. Segments > 2 cM were retained for analysis. The proportion of the genome shared IBD between phased haplotypes (between-isolate F IBD ) was estimated by maximum likelihood described in [28] using 'vcfdo ibd' .

Demographic inference
Curves of recent historical effective population size were estimated from between-isolate IBD segments with 'IBDNe' v07May18-6a4 [29] using length threshold > 3 cM, 20 bootstrap replicates and default parameters otherwise. Local age-adjusted parasite prevalence point estimates (PfPR 2-10 ) and credible intervals were obtained from the Malaria Atlas Project [30] via the R package 'malariaAtlas' [31].
More remote population-size histories were estimated with 'smc++' v1.15.2 [32]. Phased haplotypes from 'dEploid' were randomly combined into diploids and parameters estimated separately for mainland Tanzania and Zanzibar populations using fivefold cross-validation via command 'smc++ cv' , with mutation rate set to 10 −9 bp −1 gen −1 . Marginal histories from each population were then used to estimate split times using 'smc++ split' .

Analyses of natural selection
The distribution of fitness effects (DFE) was estimated within mainland Tanzania and Zanzibar populations with 'polyDFE' v2.0 using fourfold degenerate sites as putatively-neutral and zerofold degenerate sites as putatively-selected [33]. "Model C" in 'polyDFE' parlance-a mixture of a gamma distribution on selection coefficients of deleterious mutations and an exponential distribution for beneficial mutations-was chosen because it does not require a priori definition of discrete bins for selection coefficients, and the gamma distribution can accommodate a broad range of shapes for the DFE of deleterious mutations (expected to represent the bulk of polymorphic sites). Confidence intervals for model parameters were obtained by nonparametric bootstrap via 20 rounds of resampling over the 100 kb blocks of the input SFS. Because 'poly-DFE' fits nuisance parameters for each bin in the SFS, computation time increased and numerical stability decreased for SFS with larger sample sizes. Input SFS were, therefore, smoothed and rescaled to pre-specified sample size of 10 chromosomes each using an empirical-Bayes-like method (https ://githu b.com/Cartw right Lab/SoFoS /) re-implemented in 'sfspy smooth' . Smoothing of input SFS had very modest qualitative effect on the resulting DFE.
The cross-population extended haplotype homozygosity (XP-EHH) statistic was used to identify candidate loci for local adaptation in mainland Tanzania or Zanzibar. Because the statistic requires phased haplotypes and is potentially sensitive to phase-switch errors, only isolates with COI = 1 were used (n = 18 mainland Tanzania, n = 12 Zanzibar). XP-EHH was calculated from haploid genotypes at a subset of 103,982 biallelic SNVs polymorphic among monoclonal isolates with the 'xpehhbin' utility of 'hapbin' v1.3.0-12-gdb383ad [34]. Raw values were standardized to have zero mean and unit variance; the resulting z-scores are known to have an approximately normal distribution [35] so nominal p-values were assigned from the standard normal distribution. The Benjamini-Hochberg method was used to adjust nominal p-values for multiple testing.
Pipelines used for WGS read alignment, variant calling, variant filtering, haplotype deconvolution and SFS

WGS and variant discovery
Genomic data for P. falciparum was generated using leukodepleted blood collected from 43 subjects from Yombo, Tanzania ("mainland") and from DBS collected from 63 subjects from the Zanzibar archipelago ("Zanzibar"; Fig. 1a) using selective whole-genome amplification (sWGA) followed by Illumina sequencing. Thirty-six isolates (84%) from the mainland and 21 isolates (33%) from Zanzibar yielded sufficient data for analysis. These 57 genomes were combined with an additional 68 published genomes from other sites in Tanzania in the Malaria-GEN P. falciparum Community Project (PfCP) and 179 genomes from other sites in Africa and Asia, representing a broad geographic sampling of Africa and Asia [36]. Single-nucleotide variants (SNVs) were ascertained jointly in the global cohort. After stringent quality control on 1.3 million putative variant sites, a total of 387,646 biallelic SNVs in the "core genome"-the 20.7 Mb of the 3D7 reference assembly lying outside hypervariable regions and accessible by short-read sequencing [22]were retained for further analysis. The frequency spectrum was dominated by rare alleles: 151,664 alleles (39.1%) were singletons and 310,951 (80.2%) were present in < 1% of isolates in the dataset. Ancestral and derived states at 361,049 sites (93.1%) were assigned by comparison to the P. reichenowi (CDC strain) genome, treating the reichenowi allele as ancestral. Similar biases were observed in the mutational spectrum as have been estimated directly from mutation-accumulation experiments [37]: transitions are more common transversions (Ti:Tv = 1.12; previous estimate 1. 13), with a large excess of G:C>A:T changes even after normalizing for sequence composition (Additional file 1: Fig. S1). Consistency in the mutational spectrum between independent studies, using different methods for sample preparation and different bioinformatics pipelines, supports the accuracy of genotype calls.

Ancestry of mainland Tanzania and Zanzibar isolates
In order to place new isolates in the context of global genetic variation in P. falciparum, principal components analysis (PCA) was performed with existing isolates from around the globe (Fig. 1b). A subset of 7122 stringently-filtered sites with PLMAF > 5% (see "Methods") were retained for PCA to minimize distortion of axes of genetic variation by rare alleles or missing data. Consistent with existing literature, isolates separated into three broad clusters corresponding to southeast Asia, east Africa and west Africa. Mainland Tanzania and Zanzibar isolates fell in the east Africa cluster. This observation was formalized using f 3 statistics [38,39], which measure shared genetic variation in a pair of focal populations A and B relative to an outgroup population O. By calculating f 3 across different combinations of comparator populations and holding the outgroup fixed, one can build up an idea of the ancestry of the populations of interest: pairs with relatively larger positive values of f 3 are more genetically similar than pairs with relatively smaller f 3 . The new isolates from Yombo and Zanzibar and published Tanzanian isolates shared mutually greater genetic affinity for each other than for other populations in the panel (Fig. 1c-e); isolates from neighbouring countries Malawi and Kenya were next-closest. Together these analyses support an east African origin for parasites in mainland Tanzania and in Zanzibar.

Genetic diversity and differentiation
In order to better understand the population demography and effects of natural selection in the parasite populations, indices of genetic diversity within populations, and the degree to which that diversity is shared across populations, were examined. The genome was partitioned into four sequence classes-all sites in the core genome; fourfold degenerate ("synonymous") sites; zerofold degenerate ("nonsynonymous") sites; and coding sites in genes associated with resistance to antimalarial drugs-and several estimators of sequence diversity were calculated in each class (see "Methods"). Levels of sequence diversity at synonymous (putatively neutral) sites were very similar within mainland Tanzania and Zanzibar isolates (theta_pi = 9.0 × 10 −4 [95% CI 8.6 × 10 −4 -9.4 × 10 −4 ] vs. 8.4 [95% CI 8.0 × 10 −4 -8.7 × 10 −4 per site) and 1.3fold lower than among previously-published Tanzanian isolates (Fig. 2a). As expected, diversity was lower at non-synonymous sites, which are more likely to be under purifying selection. Tajima's D took negative values in all three populations and across all sites classes (Fig. 2b); demographic explanations for this pattern are investigated later in the manuscript. Minimal evidence was found for differentiation between parasites in mainland Tanzania and Zanzibar. Genome-wide F st was just 0.0289 (95% bootstrap CI 0.0280-0.0297); the distribution of F st in 5 kb windows is shown in Fig. 2c. For comparison, genome-wide F st between southeast Asian and African isolates is on the order of 0.20 [23]. Thus there exists minimal evidence for genetic differentiation between parasites in mainland Tanzania and Zanzibar.

Patterns of relatedness and inbreeding
Long segments of the genome shared identical by descent (IBD)-that is, inherited intact from the same recent common ancestor-provide a powerful and fine-grained view of relationships in the recent past. Recent methodological innovations [14] allow estimation of complexity of infection (COI)-the number of distinct parasite strains in a single infection-and simultaneous deconvolution the component haplotypes. The F ws statistic, an index of within-host diversity that is conceptually similar to traditional inbreeding coefficients, was also calculated for comparison [23]. Approximately half of isolates had COI = 1 ("clonal") and half had COI > 1 ("polyclonal" or "mixed") in both populations, and the distribution of COI was similar between the mainland and Zanzibar (Chi squared = 0.27 on 2 df, p = 0.87; Additional file 1: Table S4). Ordinal trends in F ws were qualitatively consistent with COI but show marked variation for COI > 1 (Fig. 3a). Phased haplotypes were used to identify segments shared IBD between isolates and, in the case of mixed infections, within isolates. This revealed substantial relatedness between infecting lineages within mixed isolates (Fig. 3b): the median fraction of the genome shared IBD (F IBD ) within isolates was 0.22 among mainland and 0.24 among Zanzibar isolates, with no significant difference between populations (Wilcoxon rank-sum test, p = 0.19). The expected sharing is 0.50 for full siblings and 0.25 for half-siblings with unrelated parents [40]. F IBD was then estimated between all pairs of phased haplotypes. F IBD between pairs of isolates was then defined as the maximum over the values for all combinations of haplotypes inferred from the isolates (Fig. 3c). As expected, most pairs were effectively unrelated (median F IBD ≤ 0.001, on the boundary of the parameter space), but a substantial fraction were related at the level of halfsiblings or closer (F IBD > 0. 25,4.0% of all pairs), including 1.3% of mainland-Zanzibar pairs. Long segments of the genome are shared IBD both within and between isolates. Mean within-isolate segment length was 5.7 cM (95% CI 4.1-7.3 cM, n = 117) on the mainland and 3.7 cM (95% CI 2.8-4.6 cM, n = 80) on Zanzibar in a linear mixed model with individuallevel random effects; the full distributions are shown in Fig. 3d. Segments shared between isolates within the mainland population (6.2 cM, 95% CI 5.9-6.6 cM, n = 3279) were longer than segments shared within Zanzibar (4.5 cM, 95% 4.1-4.8 cM, n = 592) or between mainland and Zanzibar populations (4.1 cM, 95% CI 3.9-4.3 cM, n = 6506). After accounting for differences in segment length by population, difference in lengths of IBD segments detected between versus within individuals are not significant (mean difference − 0.038 cM, 95% CI − 0.10 to 0.023 cM). In a random-mating population the length of a segment shared IBD between a pair of individuals with last common ancestor G generations in the past is exponentially-distributed with mean 100/(2*G) cM. The shared haplotypes that observed, with length on the order of 5 cM, are thus consistent with shared ancestry in the past 10 generations-although as many as half of such segments probably date back at least 20 generations [41]. In the presence of inbreeding, IBD sharing persists even longer in time.
Close relationships between isolates from the archipelago and the mainland suggest recent genetic exchange. A threshold of F IBD > 0.25 (half-siblings) was chosen because it implies that two isolates shared at least one common parent in the last outcrossing generation and, therefore, are related as recently as the last 1-2 transmission cycles, depending on background population dynamics. In principle this could result from importation of either insect vectors or human hosts. To investigate the latter possibility, a travel-history questionnaire completed by subjects from Zanzibar was used. Nine subjects reported travel to the mainland in the month before study enrollment; their destinations are shown in Fig. 4a. Ten pairs with F IBD > 0.25 (marked by orange triangles in histogram in Fig. 4b) were identified; all involved a single Zanzibar isolate from a patient who travelled to the coastal town of Mtwara (orange arc in Fig. 4a). It is very likely that this individual represents an imported case. Overall, isolates from travellers had slightly higher mean pairwise relatedness to isolates from the mainland (mean F IBD = 0.0020, 95% CI 0.0018-0.0021) than did isolates from non-travellers (mean F IBD = 0.0015, 95% CI 0.0014-0.0016; Wilcoxon rank-sum test p = 1.8 × 10 −12 for difference). But these relationships-spanning 10 or more outcrossing generations-are far too remote to be attributed to the period covered by the travel questionnaire. The pattern likely represents instead the presence of subtle population structure within Zanzibar.

Demographic history of parasite populations
The distribution of IBD segment lengths carries information about the trajectory of effective population size in the recent past, up to a few hundred generations before the time of sampling. The site frequency spectrum and patterns of fine-scale linkage disequilibrium carry information about the more remote past. Complementary methods were used to infer recent and remote population demography from phased haplotypes. First, a non-parametric method was applied [29] to infer recent effective population size (N e ) from IBD segment lengths separately in mainland Tanzania and Zanzibar populations (Fig. 5a). The method infers a gradual decline of several orders of magnitude in N e over the past 100 generations to a nadir at N e ~ = 5000 around 15-20 outcrossing generations before the time of sampling. Although the confidence intervals are wide, similar trajectories are inferred in all  Fig. 3 Complexity of infection and patterns of within-and between-host relatedness. a The F ws index of within-host diversity, binned by complexity of infection (COI) estimated from genome-wide SNVs. Points coloured by population. b Distribution of within-host relatedness, measured as the proportion of the genome shared IBD (F IBD ) between strains, for isolates with COI > 1. Note that y-axis is on square-root scale. c Distribution of between-host relatedness, calculated from haplotype-level IBD. d Distribution of the length of segments shared IBD between (top) or within hosts (bottom). Segment lengths given in centimorgans (cM). Vertical lines mark 25th, 50th and 75th percentiles three populations (Zanzibar, new mainland Tanzania isolates and published Tanzanian isolates). Second, more remote population size histories were inferred jointly for mainland Tanzania and Zanzibar and used to estimate the split time between these populations using a sequentially Markovian coalescent method [32]. This family of models has good resolution for relatively remote events, but less precision in the recent past than models based on IBD segments. The result (Fig. 5b) supports a common ancestral population with N e ~ = 10 5 individuals that underwent a sharp bottleneck followed by rapid growth around 50,000 generations before the present. The time at which the mainland and Zanzibar populations diverged could not be estimated precisely and may have been as recent as 50 or as ancient as 50,000 generations before the present. Trends in N e were compared to local trends in parasite prevalence from the Malaria Atlas Project [30] (Fig. 5c). Assuming an interval of approximately 12 months per outcrossing generation [42], the contraction in N e may correspond in time to the decrease in prevalence brought about by infection-control measures over the past two decades.

Natural selection and adaptation
Finally, several approaches were taken to characterize the effects of natural selection on sequence variation in mainland and Zanzibar populations. The fate of a new mutation-whether it spreads and ultimately becomes fixed, or is lost-is determined by its selection coefficient (s), scaled by the effective population size (N e ). The distribution of fitness effects (DFE) describes the distribution of s and can be estimated from the frequency spectrum at putatively-neutral (synonymous) and putatively-selected (non-synonymous) sites (Fig. 6a). Building on previous work in other organisms, the DFE was modelled in each population as a mixture of a gamma distribution (for deleterious mutations, N e s < 0) and an exponential distribution (for beneficial mutations, N e s > 0) [33]. The inference was performed using both the raw SFS and a smoothed representation of the SFS that is more numerically stable and found that results to be similar with both methods. Fitted parameter values are provided in Additional file 1: Table S5, but the discretized representation of the DFE is more amenable to qualitative comparisons (Fig. 6b).
Differences in the DFE between mainland Tanzania and Zanzibar populations are not statistically significant. The great majority of new mutations (mainland: 74%; Zanzibar: 76%) are expected to be very weakly deleterious (− 0.01 < 4N e s < 0), and only a small minority are expected to be beneficial (4N e s > 0) (mainland: 4.5% [95% CI 2.7-29%]; Zanzibar: 2.4% [95% CI 0.56-50%]). The DFE also allows us to estimate that 8.8% (mainland) and 5.2% (Zanzibar) of substitutions since the common ancestor with P. reichenowi have been fixed by positive selection; this quantity is known in some contexts as the "rate of adaptive evolution. " Although the DFE tells us the proportion of polymorphic sites under positive selection, it does not pinpoint which sites those are. To identify signals of recent, population-specific positive selection, the XP-EHH statistic between mainland and Zanzibarian isolates were used [35]. Outliers in the XP-EHH scan, defined as standardized XP-EHH scores above the 99.9th percentile, represent candidates for local adaptation (Additional file 1: Fig. S2). One-hundred four biallelic SNPs in 20 distinct genes passed this threshold (Additional file 1: Table S6). None of these have been associated with resistance to anti-malarial drugs-an important form of local adaptation in this species-but one (PF3D7_0412300) has been identified in a previous selection scan [43]. Prevalences of 54 known drug-resistance alleles are shown in Additional file 1: Table S7 and are similar to previous reports in East Africa [44][45][46]. None of these loci had F st > 0.05 between mainland Tanzania and Zanzibar.

Discussion
Zanzibar has been the target of intensive malaria control interventions for nearly two decades following the early implementation of ACT therapies in 2003 [2]. Despite sustained vector control practices and broad access to rapid testing and effective treatment, malaria has not been eliminated from the archipelago [2]. Here WGS of P. falciparum isolates from Zanzibar and nearby sites on the mainland was used to investigate ancestry, population structure and transmission in local parasite populations. These data place Tanzanian parasites in a group of east African populations with broadly similar ancestry and level of sequence diversity. There was minimal genomewide signal of differentiation between mainland and Zanzibar isolates.
The most parsimonious explanation for these findings is a source-sink scenario, similar to a previous report in Namibia [47], in which importation of malaria from a region of high but heterogeneous transmission (the mainland) is inhibiting malaria elimination in a pre-elimination area (Zanzibar). Using WGS it is shown that the parasite population on the islands remains genetically almost indistinguishable from regions on the mainland of Tanzania. Numerous long haplotypes could be identified that are shared between the populations, on the order of 5 cM, suggesting that genetic exchange between the populations has occurred within the last 10-20 sexual generations. In addition, a Zanzibar isolate is identified that is related at the half-sibling level to a group of mutually-related mainland isolates. This likely represents an imported case and provides direct evidence for recent, and likely ongoing, genetic exchange between the archipelago and the mainland. These observations suggest that parasite movement from the mainland to the archipelago is appreciable and may be a significant hurdle to reaching elimination.
Human migration is critical in the spread of malaria [48], thus the most likely source for importation of parasites into Zanzibar is through human travel to high-risk malaria regions. Multiple studies have been conducted on travel patterns of Zanzibarian residents as it relates to importation of malaria [49][50][51], one of which estimated that there are 1.6 incoming infections per 1000 inhabitants per year. This is also in accordance with the estimate of about 1.5 imported new infections out of a total of 8 per 1000 inhabitants in a recent epidemiological study [2]. None of these studies have leveraged parasite population genetics to understand importation patterns. Though this study is small, the findings are proof of principle for using genetics to identify specific importation events. These data provides a platform for future genetic surveillance efforts by, for example, design of targeted assays for sequence variants that discriminate mainland from Zanzibari parasites. Such surveillance, including of asymptomatic individuals, would clarify the role of importation versus endemic transmission and potentially identify specific travel corridors to target for interventions. Larger sample sizes would also likely begin to reveal subtle population structure that is not obvious when examining a few dozen isolates.
Malarial infections in Africa are highly polyclonal. This within-host diversity poses technical challenges but also provides information on transmission dynamics. Approximately half of isolates from both the mainland and Zanzibar represent mixed infections (COI > 1), similar to estimates in Malawian parasites with similar ancestry [15]. It is clear that a widely-used heuristic index (F ws ) is qualitatively consistent with COI estimated by haplotype deconvolution [52], but has limited discriminatory power in the presence of related lineages in the same host. Furthermore, median within-host relatedness (F IBD ) is ~ 0.25, the expected level for half-siblings, in both mainland and Zanzibar populations. This strongly suggests frequent co-transmission of related parasites in both populations [40]. Estimates of F IBD are within the range of estimates from other African populations and add to growing evidence that mixed infections may be predominantly due to co-transmission rather than superinfection even in high-transmission settings [53,54]. An important caveat of this work is its dependence on statistical haplotype deconvolution. Direct comparison of statistical deconvolution to direct sequencing of single clones has shown that methods like 'dEploid' have limited accuracy for phasing the minority haplotype(s) in a mixed infection. Phasing errors tend to limit power to detect IBD between infections, and may cause underestimation of between-host relatedness.
Intensive malaria surveillance over the past several decades provides an opportunity to compare observed epidemiological trends to parasite demographic histories estimated from contemporary genetic data. Estimates of historical effective population size (N e ) support an ancestral population of approximately 10 5 individuals that grew rapidly around 10 4 generations ago, then underwent sharp contraction within the past 100 generations to a nadir around 10-20 generations before the present. Stable estimates of the split time between the mainland and Zanzibar populations could not be obtained, either with a coalescent-based method (Fig. 5b) or with method based on the diffusion approximation to the Wright-Fisher process [55]. This is not surprising given that the shape of joint site frequency spectrum (Additional file 1: Fig. S3), summarized in low F st genome-wide, is consistent with near-panmixia. The timing and strength of the recent bottleneck appears similar in mainland Tanzania and Zanzibar isolates and coincides with a decline in the prevalence of parasitemia. However, it should be remembered that the relationship between genetic and census population size-for which prevalence is a proxy-is complex, and other explanations may exist for the observed trends.
Finally, this paper makes the first estimates of the distribution of fitness effects (DFE) in P. falciparum. Although the impact of selection on genetic diversity in this species has long been of interest in the field, previous work has tended to focus on positive selection associated with resistance to disease-control interventions.
The DFE is a more fundamental construct that has wide-ranging consequences for the evolutionary trajectory of a population and the genetic architecture of phenotypic variation [56]. Purifying selection is pervasive, but most new alleles (~ 75%) are expected to have sufficiently small selection coefficients that their fate will be governed by drift. The proportion of new mutations expected to be beneficial-the "target size" for adaption-is small, on the order 1-2%. Together these observations imply that even in the presence of ongoing human interventions, patterns of genetic variation in the Tanzanian parasite population are largely the result of drift and purifying selection rather than positive selection. It should be noted that these conclusions are based on the core genome and may not hold for hypervariable loci thought to be under strong selection such as erythrocyte surface antigens. Furthermore, the complex lifecycle of Plasmodium species also departs in important ways from the assumptions of classical population-genetic models [57]. The qualitative impact of these departures conclusions is hard to determine.

Conclusion
The elimination of malaria from Zanzibar has been a goal for many years. This paper pesents genomic evidence of continued recent importation of P. falciparum from mainland Tanzania to the archipelago. Reducing this importation is likely to be an important component of reaching elimination. Investigation of approaches to limit importation, such as screening of travellers or mass drug treatment, is needed. However, the high degree of connectivity between the mainland and the Zanzibar archipelago will make this challenging. It is encouraging that parasite populations in the region appear to be contracting (Fig. 5). These declines are likely due to decreasing transmission but nonetheless need to be interpreted with caution, as they may also be due to other factors that impact effective population size estimates, including violation of model assumptions. The data suggests that larger studies of the relationship between Zanzibarian and mainland parasites will enable further more precise estimates of corridors of importation based on parasite genetics. Genomic epidemiology has the potential to supplement traditional epidemiologic studies in Zanzibar and to aid efforts to achieve malaria elimination on the archipelago.
Additional file 1. Additional figures and tables.