Exceptionally long-range haplotypes in Plasmodium falciparum chromosome 6 maintained in an endemic African population

Background Previous genome-wide analyses of single nucleotide variation in Plasmodium falciparum identified evidence of an extended haplotype region on chromosome 6 in West Africa, suggesting recent positive selection. Such a pattern is not seen in samples from East Africa or South East Asia, so it could be marking a selective process specific to West Africa. Analyses of the haplotype structure in samples taken at different times could give clues to possible causes of selection. Methods This study investigates chromosome 6 extended haplotypes in The Gambia by analysing alleles at multiple microsatellite loci using genome sequence data previously obtained from clinical isolates collected in 2008, followed by genotyping of 13 loci in 439 isolates from 1984, 1991, 2008 and 2014. Temporal changes in haplotype structure and frequencies were determined. Results A region of high linkage disequilibrium spanning over 170 kilobases (kb) was identified with both NGS and laboratory determined microsatellite alleles. Multiple long haplotypes were found in all temporal populations from The Gambia. Two of the haplotypes were detected in samples from 1984 and 1991. The frequency of long-range haplotypes increased in 2008 and 2014 populations. There was higher Fst between older and more recent populations at loci in proximity to genes involved in drug metabolism pathways. Conclusions The occurrence of several long haplotypes at intermediate frequencies suggests an unusual mode of selection in chromosome 6, possibly combined with recombination suppression on specific haplotypes. Such selection apparently occurred before the emergence of known anti-malarial drug resistance alleles, and could be due to effects of other drugs or unknown processes that have long been operating in this endemic region. Electronic supplementary material The online version of this article (doi:10.1186/s12936-016-1560-7) contains supplementary material, which is available to authorized users.


Background
Adaptation of malaria parasites to various challenges, including human immune responses, drugs and other interventions has left genetic signatures in the Plasmodium falciparum genome [1]. These can be detected with polymorphic genetic markers showing reduced variability, increased linkage disequilibrium, extended haplotypes, excess of rare alleles or high frequencies of derived alleles at loci close to the target of selection [2,3]. Anti-malarial drugs in particular exert a strong positive selection on the genome and several of the strongest signatures of directional selection are linked to drug resistance genes [4,5]. Apart from these, a number of other P. falciparum genomic loci show evidence of positive selection in endemic populations [6,7]. Some signatures of selection are apparent only in specific endemic populations, including an extensive selective sweep on chromosome 6 that was first described by analysing single nucleotide polymorphisms (SNPs) in parasites from Senegal and The Gambia in West Africa [6]. The evidence of extended haplotype homozygosity in chromosome 6 appears strong in West African parasite populations but weak or absent elsewhere [6,[8][9][10][11]. The size of the chromosomal region affected may be up to ~300 kb, and none of the ~60 genes in this region is a known direct target of selection from anti-malarial drugs. This result was obtained with SNP markers, but analysis of microsatellites may offer additional information due to their higher levels of polymorphism [12][13][14]. Microsatellite approaches are also attractive here because they require technology that is more accessible to low and medium income country labs. This allows for local targeted analysis of sweeps following initial scans with high density SNPs, which requires computational capacity that is limited in most of sub-Saharan Africa.
This study focuses on characterising the structure of P. falciparum chromosome 6 extended haplotypes around the selective signature in The Gambian population. This population benefits from prior knowledge of selective signatures from SNP studies, availability of whole genome short read sequence data for mining other polymorphisms such as microsatellites, and archived samples from the period prior to widespread chloroquine resistance. Microsatellite polymorphisms within the chromosome 6 region were targeted for this temporal analysis thanks to their abundance and sensitivity in determining haplotype structure around selective sweeps [14,15]. The analysis of sequence data enabled the design of new laboratory genotyping assays, which focused on a set of 13 microsatellite loci across the region of extended haplotype structure.

Plasmodium falciparum isolates
This study analysed four sets of P. falciparum isolates from the Gambia. The first two comprises 56 and 67 isolates collected in 1984 and 1991 respectively around Farafenni, located in the middle of the country [4,26]. During this period, chloroquine was still efficacious against most infections and resistance alleles were rare. The third population includes 166 isolates from the Greater Banjul area, in the west of the country, collected in 2008 when malaria transmission had decreased substantially and the first line treatment for malaria had been changed from sulfadoxine-pyrimethamine (SP) to artemether-lumefantrine (AL) [27]. The fourth was a recent population sample of 150 isolates collected in 2014 transmission season from the Greater Banjul area in the West, as well as Basse in the east of the country. Of the isolates collected in 2008, 76 had whole genome short sequence reads for mining microsatellite polymorphisms [16]. Single nucleotide polymorphisms and signatures of selection in this population has been described in a number of previous studies [4,7,16].
Microsatellite scoring across P. falciparum chromosome 6 using Illumina short-read sequence data Microsatellite markers across chromosome 6 of P. falciparum were determined from Illumina generated whole genome short sequence read data of the 68 isolates collected in 2008 as described previously [28]. For microsatellite identification and genotyping, short sequence reads were mapped against the P. falciparum 3D7 reference chromosome 6 sequences (version3, October 2012 release) to generate sequence alignment files (Bam) as previously described. Bam files were then screened for microsatellites against a reference library indexing all possible microsatellite repeats in the Pf3D7 reference chromosome 6 sequence created with the tandem repeat finder software (trf ) and SciRoKo for perfect repeats [29,30]. Algorithms for calling microsatellite alleles are as described for the programs RepeatSeq and Genotan [31,32]. Quality settings for calling microsatellite genotypes incorporated a short read quality score of Q30 and read coverage across the microsatellite of 5X spanning a unique region bordering both sides of the locus. Allele calls were corrected to lengths reflective of the repeat unit assuming a stepwise mutation model in Microsatellite Analyzer package [33]. Perfect repeat tracts with number of repeats from 14 to 63 were chosen for further analysis. The final dataset employed included loci that were polymorphic and with less than 30 % missing calls. The objective was to identify a set of polymorphic markers from Illumina short sequence reads and select a subset for further analysis by PCR fragment sizing following amplification of DNA from the target populations.

Selected chromosome 6 microsatellite loci for laboratory genotyping
Microsatellite loci employed in retrospective analysis of chromosome 6 selective sweep were selected from polymorphic loci determined from short read sequence data analysed for chromosome 6 as described above. These loci were perfect microsatellites, having at least two alleles in the Gambian population and included di-to hexa-nucleotide repeats. These were physically spaced at an average distance of ~15 kb across a 179.5 kb region, between positions 1069 and 1249 kb of the chromosome. Primer pairs were designed automatically for flanking sequences of each unique microsatellite locus with Primer3 software implemented in BatchPrimer using the Pf3D7 reference sequence as a template (Additional file 1: Targeted_13_ssr_stats). Primers for amplification and fragment labelling with 6FAM, TET, HEX and TAMRA dyes were commercially synthesized (Metabion).

PCR amplification and capillary electrophoresis
For each sample, DNA was obtained from whole blood aliquot using the QiAmp DNA extraction kit. Amplification of targeted microsatellite loci followed a two round PCR reaction in which the first round was a multiplex for up to 3 loci in a 10 µL reaction mix with 0.2 µM of each outer primer pair, 1 µL of DNA and 1X PCR multiplex mix (Qiagen). Each amplification batch included the reference P. falciparum 3D7 DNA as a positive control and nuclease free water as a negative control. Thermocycling was achieved with a touchdown PCR protocol on a Q-cycler (Quantarus); initial denaturation at 95 °C for 5 min, touchdown from 65 to 55 °C at 1 °C/cycle followed by 25 cycles of 95 °C for 30 s, 54 °C for 1 min, 72 °C for 30 s. Final elongation was at 68 °C for 30 min. First round PCR reactions were used immediately for second rounded amplification with labelled primers or stored at 4 °C until needed. Second round PCR reactions were run separately for each locus using 1 µL of first round PCR product, 150 nM of each primer and 1X of MyTaq amplification mix (Bioline) in a 10 µL reaction. Amplification conditions were initial denaturation at 95 °C for 30 s, 25 cycles of 95 °C for 30 s, 54 °C for 30 s, 68 °C for 30 s. Final elongation was at 68 °C for 20 min. For capillary electrophoresis, round 2 PCR products were constituted at 1:50 dilution in 10 µL assay mix of HIDI formamide with GeneScan ™ 600LIZ ® size standard. This was denatured by heating at 95 °C for 3 min and immediately cooled on ice. Labelled microsatellite fragments were separated on 3130xL DNA analyser. DNA fragment analysis traces for loci were processed in GeneMapper 4.1 and allele sizes were called using GeneMarker 1.1. Allele sizes were checked for error and binned using Tandem2 software, which corrects for deviations in expected fragment size based on repeat unit length of the target microsatellite. Binned genotypes were employed in determining pairwise haplotype frequencies and Linkage disequilibrium between contiguous loci using PowerMarker Version 3.5 and Midas. Fragment sizes for SSR1, SSR9 and SSR15 were correlated with illumina determined repeat length genotypes (Additional file 2).

Population genetic analysis of chromosome 6 microsatellite loci
The allele frequencies, allelic richness, variability, expected heterozygosity, and Weir and Cockerham's Fst for binned microsatellite were calculated with the heirfstat R package. Virtual heterozygosity (HE) was used to measure the overall genetic diversity at each locus. It is defined as, n/(n − 1)(1 − Σpi 2 ), where n is the number of isolates analysed and pi is the frequency of the i-th allele in the population. To test which microsatellite loci showed population variation that deviated from neutral expectations, simulations of the distribution of pairwise Fst coefficient were carried out in Bayescan 2.1. Bayescan uses posterior probabilities to control for False Discovery rates (FDR), which is the proportion of false positives among candidate outlier loci. Simulations were performed at default settings; 100,000 iterations, 5000 pilot runs and 50,000 burn-in length. P values (one-sided) were estimated from 10,000 permutations of population assuming a panmictic null hypothesis. The FDR expressed as q value (FDR analogue of p value defined under multiple testing) was plotted against Fst per selected loci using Bayescan plot code in R to detect and list outlier loci at a 5 % FDR. Unique multilocus genotypes were determined using R allele Match unique package with a maximum mismatch setting at 3 out of 13 loci following mismatch and multiple match minimization. Partially matched haplotypes were further analysed with Haplotype Analysis 4.05. Visualization of the evolution of frequency of the common long range haplotype was done with the REHH package in R. For this, the loci with the lowest and highest Fst values within the window of high linkage disequilibrium were chosen as focal sites for determining the extent of haplotypes around them. To allow for REHH analysis, the fragment sizes were converted into a binary format in which '2' represented the dominant allele while '1' was for any other fragment length at this locus. Analysis and plotting were as described in the REHH manual.

Diversity of chromosome 6 microsatellite loci derived from short sequence reads of P. falciparum
The number of repeats for 775 polymorphic microsatellite loci in chromosome 6 of P. falciparum were successfully scored from Illumina short read sequences of 54 out of 68 clinical isolates collected in 2008. The mean virtual heterozygosity for the 13 microsatellite loci was 0.47. Pairwise linkage disequilibrium, expressed as mean r 2 values between all pairs of loci within a window of 10 kb containing at least 5 loci, ranged from 0 to 0.3. The values were significantly higher towards the 3′ end of the chromosome, mapping between positions 1100-1300 kb with a peak at ~1200 kb (Fig. 1a). Mean pairwise haplotype frequencies between loci in 10 kb windows across the chromosome showed a narrower peak of ~100 kb (1150-1250 kb) at this region on the chromosome (Additional file 3).
Long-range haplotypes spanning 184 kb were identified with microsatellite genotypes for 28 polymorphic loci across the region of peak LD for 54 isolates (Fig. 1b). These long-range haplotypes shared by at least 2-8 isolates were represented in 28 isolates (50 %). Following allele matching and haplotype extension from 12 loci  Tables-sheet: Fst-13Loci). Following Bayesian simulation of pairwise population Fst distribution at each locus, seven (SSR1, SSR3, SSR5, SSR7, SSR11, SSR13 and SSR19) out of the selected 13 loci had a pattern of variation between populations that deviated from neutral expectations (Additional file 4a). Of these, SSR13 (Pf3D7_06_v3: 1135849) had the strongest signature of deviation from neutrality. This locus is located within the Pf3D7_0628100 sequence, which codes for 6-pyruvoyltetrahydropterin synthase (PTPS). The other six loci had lower Fst and high q-values. Three loci: SSR19 (Pf3D7_06_v3:1188628), SSR21 (Pf3D7_06_v3:1207757) and SSR23 (Pf3D7_06_v3:1228301), also showed significant deviation of Fst and heterozygosity than expected under neutrality (Additional file 4b). These loci span the region of peak LD detected (Fig. 2a). There are 11 genes encoded in the region with highest LD and Fsts, which includes 3 transferases, a transportin, a transporter, a synthetase, and a synthase ( Fig. 2c; Additional file 1: Tables-sheet: Genes_in_Sweep). The entire 179.5 kb region analysed with 13 microsatellite loci across temporal populations, codes for 36 genes, which are involve in a variety of metabolic processes (Additional file 1: Tablessheet: Genes_in_Sweep).

Shared long-range haplotypes
With complete matching at all 13 loci, 243 (55.4 %) isolates had unique haplotypes while 196 were shared by at least 2 isolates from the four temporal populations. Of the latter, 18 isolates shared the longest uninterrupted haplotype only from the 1991, 2008 and 2014 populations (Fig. 3). Recombinant and partial forms of this major haplotype were found in 39 other isolates from the same populations. The structure of this long-range haplotype was most conserved at five contiguous loci (SSR13-SSR21) spanning 71.9 kb (1,135,849-1,207,757 bp), (Fig. 3). With complete matching at all five loci within this conserved window, 4 (5.9 %), 28 (16.9 %) and 25 (16.6 %) isolates from 1991, 2008 and 2014 respectively had this haplotype group. This represents 13 % of all isolates from these populations (Fig. 3). There were 40 other haplotypes shared by 2-13 isolates from the four populations. Three of these haplotypes were present in the 1984 population of which 2 were seen in single isolates from 1991 and 2008 (Table 1). Nine haplotypes from 1991 were present at higher frequencies in 2008 and 2014.
The frequency of shared haplotypes was higher in later years; 8.9, 19.4, 60.8 and 51. 3 % for 1984, 1991, 2008 and 2014 respectively. Using loci with peak LD as focus (SSR13 and SSR21), two long-range LD tracks of ~100 kb each could be distinguished (Additional file 5). The block determined by SSR13 was present in the earliest population sample from 1984 (Additional file 6). The types and the frequencies of these haplotypes increased in the more recent populations from 2008 and 2014. The haplotypes defined by SSR21 were dominant from 2008 and maintained a similar structure in 2014.

Discussion
This is the first study describing extended haplotypes of microsatellite loci within a chromosome 6 region of P. falciparum previously determined with SNPs to be under selection. The microsatellite alleles derived from Illumina short read sequence data of clinical isolates collected in 2008 detected a 200 kb window of significantly elevated linkage disequilibrium that overlaps with the previously determined selective signature on the chromosome [6,7,16]. It shows microsatellites and linkage disequilibrium to be sensitive in detecting regions under selection as reported previously [14,17]. The use of microsatellites as markers has the potential to focus on very recent events. In contrast to SNPs, Higher mutation rates will also provide sufficient variation in closely related populations, and would reveal only recent selective sweeps. As our focus here was over a  (2008). These loci were highly diverse in all temporal P. falciparum populations genotyped. Similar to genomic loci from Illumina short reads, we detected strong linkage disequilibrium and several extended haplotypes. Sweeps lead to increase in LD when they are still in progress [18]. Thus, selection could have started as early as 1984 and continued into recent populations. This is also supported by increase in the frequencies of long-range haplotypes in the region from 1984 to recent populations. However, for the most common combination of multilocus alleles (from 1,135,849-1,207,757 bp), only two haplotypes found in the earliest population (1984) were present in single isolates from 1991 and 2008. On the other hand, other haplotypes upstream of this region were already at high frequencies in 1984. This would suggest selection on more than one locus or more than one event occurring prior to the 80s and between 80s and 90s. Antifolate anti-malarials were not officially used earlier than the year 2002 in the Gambia and so selection around these earlier years cannot be due to SP. The results here are against suggestions that anti-malarial SP might be the driver of selection in this region [5].
Finding a variety of long-range haplotypes at low to moderate frequencies in all temporal populations is contrary to the classical outcome of hard selective sweeps, where one haplotype carrying the favourable mutation rises in frequency to fixation [19]. With 36 genes coded for by the region with the selective signature, selection could be polygenic or both diversifying and directional selection may be acting on different loci in the region. While identifying the exact loci being selected remain beyond this current analysis, three genes within the region might be under non-neutral variation from previous SNP analysis and the analysis herein. The first gene is the acetyl-CoA synthetase (PF3D7_0627800) which was predicted as being under balancing selection [16]. Balancing selection increases diversity around a selected Long-term maintenance of such polymorphisms and reduction in recombination would increase LD in neutral neighbouring loci around the focus of selection. The second gene in the region is 6-pyruvoyltetrahydropterin synthase, PTPS (PF3D7_0628000) which has a direct role in pterin metabolism and folate salvage pathways in P. falciparum [20]. Folate salvage is a major mechanism of SP resistance. The increase in complete long-range haplotype frequencies in the 2008 population provides support to selection from sulfonamides. However, haplotypes around the PTPS gene that is involved in sulfonamide metabolic processes seemed to have emerged earlier than 2002. Since SP was not in use when the earlier populations here were collected, selection of such pathways might be as a result of non-antimalarial sulfonamides, such as sulfamethoxazole/trimethoprim (cotrimoxazole), that have been used across Africa for a long time. Thirdly there is a phospholipase A2, PF3D7_0629300 (phosphatidylcholine-sterol acyltransferase, putative gene), which by Bayesian analysis of Fst distribution of ssr loci, may be under selection. The repeat locus within this gene showed the highest Fst and LD. Phospholipase activity increases membrane permeability and stimulation of many enzymes associated with processes that disrupt cytoskeleton and membrane structure [21]. They are also strongly inhibited by quinoline anti-malarials [22,23]. Hence, the lipase may also be under directional selection from quinolines used in the 1990s. Chloroquine in particular was the most widely used quinolines against malaria before the 90 s and could have been driving selection. Antifolate SP replaced chlorquine from 2002 to 2008.
There was increase in frequencies of partial forms of haplotypes in the 2014 population, 6 and 12 years after the withdrawal of SP and chloroquine respectively as first line treatment. However, SP is still in use for intermittent preventive treatment and is now a component of seasonal malaria chemoprophylaxis [24]. The non-antimalarials antibiotics cotrimoxazole and derived quinolines were introduced in clinical practice in the 1960s while most quinoline antibiotics came into use in the 1980s. Cotrimoxazole in particular has been shown to be efficacious against malaria and thus might have been a source of selection on folate metabolism pathways for over half a century. If selection is acting on the phospholipase, due to quinolones, this could also explain the early origin of long-range haplotypes. Moreover, anti-malarial quinolines (e.g., quinine) have been in use against malaria for over is 400 years [25]. With SP being a component of SMC, increased frequencies of the sweep haplotypes in future analysis of The first column shows the haplotypes defined by a contiguous string of PCR fragment sizes in base pairs for five neighbouring loci within this region. The following columns show the number of isolates from each year with the haplotype. The last column presents the frequencies for the entire population SMC treated populations could confirm its role in selection. Meanwhile, analysis of populations with the earliest records of quinolone resistance such as Latin America could help resolve the involvement of quinolones as a selective force on this genomic region. With increasing accumulation of genomic data, this will be possible in the near future.

Conclusions
This study has observed multiple long-range haplotypes of a region on P. falciparum chromosome 6 maintained over more than two decades in The Gambia. The identity of the target of selection remains unknown as the narrowest region implicated contains 24 genes involved in various metabolic processes. It is unlikely to be a result of anti-malarial drug selection only, as extended haplotypes existed before the widespread use of antifolates for malaria treatment, and before chloroquine resistance alleles had become locally common. It is possible that off-target drug action from antibiotics may have been responsible, or some completely different process may be maintaining extended haplotypes in this region of the parasite genome.