Plasmodium copy number variation scan: gene copy numbers evaluation in haploid genomes
© Beghain et al. 2016
Received: 8 December 2015
Accepted: 31 March 2016
Published: 12 April 2016
In eukaryotic genomes, deletion or amplification rates have been estimated to be a thousand more frequent than single nucleotide variation. In Plasmodium falciparum, relatively few transcription factors have been identified, and the regulation of transcription is seemingly largely influenced by gene amplification events. Thus copy number variation (CNV) is a major mechanism enabling parasite genomes to adapt to new environmental changes.
Currently, the detection of CNVs is based on quantitative PCR (qPCR), which is significantly limited by the relatively small number of genes that can be analysed at any one time. Technological advances that facilitate whole-genome sequencing, such as next generation sequencing (NGS) enable deeper analyses of the genomic variation to be performed. Because the characteristics of Plasmodium CNVs need special consideration in algorithms and strategies for which classical CNV detection programs are not suited a dedicated algorithm to detect CNVs across the entire exome of P. falciparum was developed. This algorithm is based on a custom read depth strategy through NGS data and called PlasmoCNVScan.
The analysis of CNV identification on three genes known to have different levels of amplification and which are located either in the nuclear, apicoplast or mitochondrial genomes is presented. The results are correlated with the qPCR experiments, usually used for identification of locus specific amplification/deletion.
This tool will facilitate the study of P. falciparum genomic adaptation in response to ecological changes: drug pressure, decreased transmission, reduction of the parasite population size (transition to pre-elimination endemic area).
KeywordsMalaria Anti-malarial drug resistance Copy number variation Bioinformatics
The burden of malaria has decreased by half over the last decade. This is a direct consequence of effectives strategies mainly focused on vector control (long-lasting impregnated bed nets) and the management of suspect malaria cases (early diagnosis by rapid diagnostic tests and effective and prompt treatment with artemisinin-based combination therapy). As a consequence, a drastic decrease in Plasmodium falciparum population biomass in many countries has been observed . This new epidemiological situation has led to a change in the environment within which the parasite finds itself and will thus alter the selective pressures on parasite populations.
Natural evolution of malaria parasites generates an enormous amount of genetic diversity either linked with copy number variations (CNVs), or acquisition of new single nucleotide variations (SNVs) and their accumulation over time . This allows parasites to acquire a high capacity of adaptation to the environmental shifts and develop anti-malarial drug resistance. Indeed, SNVs are known to be at the origin of resistance to anti-malarial drugs, such as chloroquine, sulfadoxine, pyrimethamine, atovaquone, artemisinin, and mdr1 gene amplification is known to be at the origin of mefloquine resistance [3–5].
In eukaryotic genomes, SNP mutation rates occur at a rate of ~10−8 per generation and deletion or amplification rates have been estimated to be in the order of ~10−4 per generation [6, 7]. The number of P. falciparum parasites infecting an adult can be estimated from 5 to 50 billion parasites (103–104 parasites per μL of blood with a total of 5 l of blood). Because asexual replication occurs every 48 h, the erythrocytic stage of P. falciparum, therefore, appears to be a breeding ground for any selection pressure to act on parasite population. Although the regulation of gene expression in P. falciparum is still incompletely understood, relatively few transcription factors have been identified [8, 9] and the regulation of transcription is seemingly largely influenced by gene amplification events. Thus CNV is a major mechanism enabling parasite genomes to adapt to new environmental changes.
Currently, the detection of CNVs is based on quantitative PCR (qPCR), which is significantly limited by the relatively small number of genes that can be analyzed at any one time, and the fact that endogenous controls (e.g., housekeeping genes) can introduce bias into the results if not properly chosen . Technological advances that facilitate whole-genome sequencing such as Next Generation Sequencing (NGS) enable deeper analyses of the genomic variation to be performed. Because the characteristics of Plasmodium CNVs need special consideration in algorithms and strategies for which classical CNV detection programs are not suited, a dedicated algorithm to detect CNVs across the entire exome of P. falciparum based on a custom read depth strategy through NGS data was developed. This algorithm was named PlasmoCNVScan.
This study analysed CNV on three genes known to have different level of amplification and which are located either in the nuclear, apicoplast or mitochondrial genomes. The results showed a correlation between PlasmoCNVscan and the qPCR experiments, usually used for identification of locus specific amplification/deletion. The use of such a tool for the exploration of adaptive phenomena based on whole genome data is then discussed.
Real time PCR and whole genome analysis were carried out on the same DNA extracted from samples of P. falciparum collected in Cambodia between 2010 and 2014 and adapted to culture. DNA extraction was performed using QIAamp DNA Blood Kit (Qiagen ©).
The protocol for qPCR copy number evaluation used in this study was based on the WWARN (MOL-05) procedure: “Copy number estimation of P. falciparum pfmdr1 v1.1''. Relative quantification was performed by using “PCR Applied Biosystem ViiA 7®” and the Taqman® technologies (Thermo Fisher©).
The primers and probe used for copy number quantification
5′ TGCATCTATAAAACGATCAGACAAA 3′
5′ TCGTGTGTTCCATGTGACTGT 3′
5′ (FAM)-TTTAATAACCCTGATCGAAATGGAACCTTTG-(TAMRA) 3′
5′ (TET)-TAGCACATGCCGTTAAATATCTTCCATGTCT-(TAMRA) 3′
All samples were analysed in triplicate. The confidence intervals on measures must be superior to 95 % for one triplicate and the Z-score, designating the deviation from a normal distribution, must be inferior to 1.75 (Life Technologies Corporation, 2011). All the samples results that did not meet these criteria were removed from the final results.
Whole-genome sequencing was performed on parasite DNA from Cambodian parasite isolates, using an Illumina paired-reads sequencing technology, as previously described .
Read depth-based methods have recently become a major approach for estimating copy number . The underlying concept of RD-based methods is that the depth of coverage in a genomic region is correlated with the copy number of the region; e.g., a lower than expected depth of coverage intensity indicates deletion and a higher than expected depth of coverage intensity indicates amplification . The algorithm in classical RD-based methods relies heavily on the assumption that the sequencing process is uniform, i.e., the number of reads mapped to a region is assumed to follow a Poisson distribution and is proportional to the number of copies . However, due to the GC content and “mapability”, this assumption is for the most part unrealistic. Moreover, the uneven representation of genomic regions in library preparation due to variability in DNA fragmentation may induce a bias .
Firstly, the average frequency for each motif found across the whole exome was computed: this is the theoretical coverage for a motif. The observed coverage is the local coverage for a motif for each position (extracted from pileup file). Then, for each gene, using a sliding window, the ratio between observed coverage and theoretical coverage for each gene/position was computed. This ratio gives the estimated copy number variation for this region.
The algorithm was implemented in homemade software in C language called PlasmoCNVScan. PlasmoCNVScan use the external libraries gbfp  under GNU GPL v2 licence and utash.under the revised BSD licence.
Optimising the size of the sequence length used for the motif
PlasmoCNVScan versus benchmark softwares
The dataset was tested for pfmdr1 gene with two programs for detecting copy number variation using next generation sequencing data. CNV-seq , which is widely used software in case–control studies, and CNVnator , which uses a similar approach to calculate RD signal and correct the GC-bias.
The qPCR results were considered as reference and the Pearson test was used to calculate the measure of the linear correlation (dependence) between the two variables qPCR and PlasmoCNVScan or CNVnator software, giving a value between +1 and −1 inclusive, where 1 is total positive correlation.
PlasmoCNVScan vs CNV-seq and CNVnator
As CNV-seq method is conceptually derived from array comparative genomic hybridization (aCGH), two sets of reads mapped onto the same reference genome from the same flow cell is needed. CNV-seq fails to detect CNV on all isolates, because 3D7, used as a reference, has been sequenced on a different flow cell to the other isolates. To avoid this problem, there is a need to include a reference isolate in each of the flow cells used, which becomes prohibitively expensive.
CNVnator is able to discover CNVs in a vast range of sizes, from a few hundred bases to megabases in length for a single genome. The correction of GC-bias is based under the observation that the RD signal and GC content are correlated. Strikingly, CNVnator had a lower correlation with qPCR than PlasmoCNVScan (R2 = 0.65, N = 42 Fig. 3).
The overall (A + T) composition is 80.6 % in the P. falciparum genome and increases to ~90 % in introns and intergenic regions , resulting in very high similarity among non-coding regions. This introduces an important bias for CNV identification using NGS data. In coding regions, the GC content is higher and the coverage is likely to be higher and more specific. This heterogeneity in the GC content between coding and non-coding sequences led us to compute the average coverage for exons only.
For computing the CNV on a single sample using PlasmoCNVScan, only a BAM file (which is converted in pileup file) is necessary, along with the reference genome (fasta file) and a gff file. Given mapped reads, the efficient implementation of PlasmoCNVScan allowed a non IT specialist to perform whole-exome analysis of P. falciparum within a few minutes on a single 3.3-GHz Intel Core 3 Duo CPU. The RD signal is normalized with the genome itself. PlasmoCNVScan is thus able to compare different CNV exomes from different experiments. The results show very good correlation with the qPCR results, with R2 value above 0.8 for all the three genes explored irrespective of the CNV range (from 1 to 30 in the case of cytochrome b mitochondrial gene).
However when working with polyclonal infections, which is a very common situation in Africa, the same problem may arise in the case of mixed infections with different parasites that do not possess the same CNV profile. In this case the qPCR will be of no help.
The aim of this study was to test the ability of the algorithm to calculate the CNVs based on a whole genome sequencing with small reads (FASTQ). Thus the authors chose to work on clonal isolates directly isolated from the field (not reference strains). The Cambodian isolates were previously culture adapted (only for several cycles) before DNA extraction, likely leading to the removal of minor clones. The exome approach generates even more accurate data because of the higher GC content of the coding regions than in the intergenic regions, and, of course, expressed genes have much less similarity among them.
The strong correlation found between classical qPCR and PlasmoCNVScan opens the way for a systematic screening of CNVs changes on whole exomes. The global analysis of changes in the P. falciparum exome CNVs is beyond the scope of this article, but it is hoped that PlasmoCNVScan can be a useful tool to explore P. falciparum genomic adaptation in the face of ecological changes: drug pressure, decreased transmission, reduction of the parasite population size (transition to pre-elimination endemic area).
JB carried out the PlasmoCNVScan algorithm and software, AL, NK, VD and EL carried out the qPCR experiments. LG and RP carried out the statistical analysis of the algorithm. DM and FA supervised, carried out and coordinated the field collections and cultures of parasites. LM, BW and CB carried out the whole genome sequencing by Illumina method. All the authors read and approved the final manuscript.
This study benefited from Whorld Health Organisation within the Karma project. This work was supported in part by BioMerieux.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- WHO. World malaria report 2014. Geneva: World Health Organization; 2014.Google Scholar
- Bopp SE, Manary MJ, Bright AT, Johnston GL, Dharia NV, Luna FL, et al. Mitotic evolution of Plasmodium falciparum shows a stable core genome but recombination in antigen families. PLoS Genet. 2013;9:e1003293.View ArticlePubMedPubMed CentralGoogle Scholar
- Mita T, Tanabe K, Kita K. Spread and evolution of Plasmodium falciparum drug resistance. Parasitol Int. 2009;58:201–9.View ArticlePubMedGoogle Scholar
- Roper C, Alifrangis M, Ariey F, Talisuna A, Ménard D, Mercereau-Puijalon O, et al. Molecular surveillance for artemisinin resistance in Africa. Lancet Infect Dis. 2014;14:668–70.View ArticlePubMedGoogle Scholar
- Duraisingh MT, Cowman AF. Contribution of the pfmdr1 gene to antimalarial drug-resistance. Acta Trop. 2005;94:181–90.View ArticlePubMedGoogle Scholar
- Conrad DF, Hurles ME. The population genetics of structural variation. Nat Genet. 2007;39:S30–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Shaffer LG, Lupski JR. Molecular mechanisms for constitutional chromosomal rearrangements in humans. Annu Rev Genet. 2000;34:297–329.View ArticlePubMedGoogle Scholar
- Balaji S, Babu MM, Iyer LM, Aravind L. Discovery of the principal specific transcription factors of Apicomplexa and their implication for the evolution of the AP2-integrase DNA binding domains. Nucleic Acids Res. 2005;33:3994–4006.View ArticlePubMedPubMed CentralGoogle Scholar
- Coulson RM, Hall N, Ouzounis CA. Comparative genomics of transcriptional control in the human malaria parasite Plasmodium falciparum. Genome Res. 2004;14:1548–54.View ArticlePubMedPubMed CentralGoogle Scholar
- Fassbinder-Orth CA. Methods for quantifying gene expression in ecoimmunology: from qPCR to RNA-Seq. Integr Comp Biol. 2014;54:396–406.View ArticlePubMedGoogle Scholar
- Ariey F, Witkowski B, Amaratunga C, Beghain J, Langlois AC, Khim N, et al. A molecular marker of artemisinin-resistant Plasmodium falciparum malaria. Nature. 2014;505:50–5.View ArticlePubMedGoogle Scholar
- Teo SM, Pawitan Y, Ku CS, Chia KS, Salim A. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics. 2012;28:2711–8.View ArticlePubMedGoogle Scholar
- Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinform. 2013;14:S1.View ArticleGoogle Scholar
- Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, et al. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci USA. 2011;108:E1128–36.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee TH, Kim YK, Nahm BH. GBParsy: a GenBank flatfile parser library with high speed. BMC Bioinform. 2008;9:321.View ArticleGoogle Scholar
- Xie C, Tammi MT. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform. 2009;10:80.View ArticleGoogle Scholar
- Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511.View ArticlePubMedGoogle Scholar