Complete mtDNA genomes of Anopheles darlingi and an approach to anopheline divergence time

Background The complete sequences of the mitochondrial genomes (mtDNA) of members of the northern and southern genotypes of Anopheles (Nyssorhynchus) darlingi were used for comparative studies to estimate the time to the most recent common ancestor for modern anophelines, to evaluate differentiation within this taxon, and to seek evidence of incipient speciation. Methods The mtDNAs were sequenced from mosquitoes from Belize and Brazil and comparative analyses of structure and base composition, among others, were performed. A maximum likelihood approach linked with phylogenetic information was employed to detect evidence of selection and a Bayesian approach was used to date the split between the subgenus Nyssorhynchus and other Anopheles subgenera. Results The comparison of mtDNA sequences within the Anopheles darlingi taxon does not provide sufficient resolution to establish different units of speciation within the species. In addition, no evidence of positive selection in any protein-coding gene of the mtDNA was detected, and purifying selection likely is the basis for this lack of diversity. Bayesian analysis supports the conclusion that the most recent ancestor of Nyssorhynchus and Anopheles+Cellia was extant ~94 million years ago. Conclusion Analyses of mtDNA genomes of Anopheles darlingi do not provide support for speciation in the taxon. The dates estimated for divergence among the anopheline groups tested is in agreement with the geological split of western Gondwana (95 mya), and provides additional support for explaining the absence of Cellia in the New World, and Nyssorhynchus in the Afro-Eurasian continents.

Analysis of the An. darlingi population structure based on the cytochrome-oxidase subunit 1 mitochondrial gene (COI) provided evidence in support of two significant subdivisions among An. darlingi populations from Central (including northwestern Colombia) and South Amer-ica, with F ST values similar to those found among different species [10]. This finding was substantiated by the analysis of microsatellite markers, which revealed limited gene flow between Central America and Brazil and Peru [12]. Also, patterns of variation within the single-copy nuclear white gene, with five fixed non-synonymous mutations detected between the northern and southern An. darlingi populations (hereafter termed northern and southern genotypes) [14], were consistent with the COI studies. Three localities were identified in the Amazon basin where the two genotypes co-occur (Iquitos, Peru; Guayaramerín, Bolivia; Fortuna near Puerto Ayacucho, Venezuela), but no hybrid specimens were detected there.
Recent studies revealed that the distribution of the COI gene variants within South American populations, especially those recovered from Brazil, reflects subdivisions consistent with geographical elements, such as the Amazon River Delta and mountains in southeastern Brazil, as barriers to gene flow [13]. Significant population structure of An. darlingi, identified using microsatellites markers, also was reported in a malaria-endemic region of eastern Amazonian Brazil [11,12].
Overall, these findings support the hypothesis that An. darlingi is experiencing incipient speciation, with the northern and southern genotypes as the units of speciation [12], although other results appear to counter this hypothesis [9]. Therefore, it is critical to investigate the status of this taxon and elucidate the processes driving divergence between alternative genotypes.
The mtDNA markers are used often for phylogenetic and population genetics analyses because of ease of their amplification and high information content at different evolutionary levels, including variation within and between populations [15,16]. To date, mtDNA genomes are available for only three anopheline species, belonging to the subgenera Anopheles (Anopheles quadrimaculatus species A) [17] and Cellia (Anopheles gambiae [18] and Anopheles funestus [19]). Because the evolutionary lineage encompassing the subgenus Nyssorhynchus presumably branched prior to the divergence of the subgenera Cellia and Anopheles lineages, mtDNA of Nyssorhynchus species may possess different characteristics, such as altered gene order or different selection pressures, which may affect the utility of mtDNA sequences as markers for evolutionary studies. The purifying selection mechanism under which the mtDNA is assumed to evolve has been questioned recently with some reports describing positive selection acting at the mitochondrial level [20,21]. Therefore, testing the pattern of evolution of the mtDNA within the postulated incipient species, i.e. to examine whether the mtDNA is evolving under neutrality versus purifying or positive selection, can contribute to the understanding of An. darlingi differentiation.
Different approaches have been used to estimate divergence times within An. darlingi and to determine the most likely scenario for its diversification. Based on a single mitochondrial COI gene, the date of divergence was estimated to the Pleistocene, and the Pleistocene refugium was the main hypothesis to explain current distribution [10,22]. Recently, a Bayesian approach was employed to estimate divergence within other species complexes, including Anopheles annulipes in Australia and Papua New Guinea [23], and the Neocellia Series in the Oriental region [24]. In addition, Krzywinski et al. [18] calculated divergence dates of mtDNAs using a molecular clock approach with the subgenera Anopheles and Cellia, as well as with Aedes and between An. gambiae and An. funestus. No information exists on the time of divergence of Nyssorhynchus, which is hypothesized to be one of the earliest branching lineages of Anopheles [25]. This study tests the differentiation of the An. darlingi genotypes using the whole mtDNA sequence, analyses patterns of selection within this taxon, and infers a date of Nyssorhynchus divergence from other Anopheles groups.

Sample origin and DNA extraction
Two resting female mosquitoes were collected near local homes from Manaus, Amazonas State, Brazil (3° 08'S/60°0 1' W) in August 2006 and in Central Cayo District of Belize (17° 09'N/88° 36'W) in 2008. The Belize specimens were identified morphologically following the Wilkerson and Strickman [26] illustrated morphological key and the Brazilian mosquito was identified based on both Faran and Linthicum [27] and Forattini [28] entomological keys. Mosquitoes were stored on silica gel in a sealed container or preserved in isopropanol, both at room temperature. DNA was extracted from whole mosquitoes by two different kit extractions, a DNeasy Blood and Tissue kit (Qiagen, Hilden, Germany) and Wizard Genomic DNA Purification kit (Promega, Madison, WI), following the procedures of the manufacturers.

PCR amplification and sequencing of Anopheles darlingi mtDNA
The white gene genotype of each specimen was checked using the PCR-RFLP protocol described in Mirabello [14]. Gene amplification primers for the complete mtDNA were designed using Primer3 0.4.0 [29] based on published An. gambiae, An. quadrimaculatus and An. funestus mtDNA sequences (GenBank Accession NC_002084, NC_000875 and DQ146364). Each fragment overlapped upstream (5') by ~50 base pairs (bp) (see Additional file 1) and was amplified using a Ready-To-Go-PCR bead (Amersham Pharmacia/Biotech, NJ, USA) and run on a PTC-200 thermal cycler (BioRad, Inc.). Amplification conditions included an initial denaturation at 95° for 2 min, 35 cycles of denaturation at 95°C for 1 min, annealing at 58°C for 2 min and extension at 72°C for 1 min, with a final extension step at 72°C for 10 min. Amplification products were purified with the Exo-SAP-IT kit (Amersham Biosciences) and sequenced directly. Sequencing of both strands was performed at Applied Genomic Technologies Core (Wadsworth Center) on an ABI 3700 DNA automated sequencer. To ensure fidelity during amplification, each fragment was sequenced at least three times and a consensus sequence established. The regions with unclear electropherograms, from the 14,000-nucleotide position covering the whole AT-rich region, were cloned with the vector TOPO TA (Invitrogen) according to the manufacturer's recommendations.

Sequence assembly and annotation
Sequencher 3.0 (Gene Codes Corporation) was used for sequence assembly into contigs and for proofreading sequence files. Complete consensus sequences were aligned using the ClustalW application in Bioedit v 7.0 and alignments were annotated in SEQUIN [30]. Identification of transfer RNA genes was conducted using tRNAscan-SE [31]. Protein-coding and ribosomal RNA genes were by comparison with the An. gambiae mtDNA sequences (GenBank accession number NC_002084).

Genomic and evolutionary analyses
Nucleotide composition and the Relative Synonymous Codon Usage (RSU) values were calculated using the MEGA 4.1 program [32]. The GC-and AT-skews were used to determine the base compositional difference and strand asymmetry among the samples analysed [33]. Analyses of sequences also were performed with DnaSP v 4.10.9 [34] and DAMBE v 4.2.13 [35]. AT-rich region analyses and plot profiles were completed with the MacVector software (MacVector, Inc) and mreps v 2.5 [36].
To test if positive selection is acting as a force for genotype divergence, the identification of sites with an excess of amino acid replacements with selective pressure over long evolutionary time was determined. We calculated the dN/dS ratio (nonsynonymous and synonymous substitutions) using the entire sequence of each gene in pairwise comparisons to investigate the substitution rates in each protein-encoding gene [37]. In addition, the software PAML 3.15 was used to analyse DNA sequences with maximum likelihood methods in a phylogenetic framework [38]. CODEML is one of the programmes that can detect sites that have been under recurrent positive selection over long periods of time, e.g., to detect any departure from the neutrality theory in the protein sequences of An. darlingi. The selective pressure at the protein level was measured by ω, the ratio of nonsynonymous to synonymous rates dN/dS, with ω < 1, = 1, or >1 indicating conserved, neutral or adaptive evolution, respectively [39]. The measure of this ratio provides a sensitive measure of selective pressure at the amino acid level. Codon-based likelihood analysis was conducted under the random-sites model: Model M0 (one ratio) assumes constant selective pressure across codon sites and over time. Models M1a (neutral), M2a (selection), M7 (β) and M8 (β & ω) of variable selective pressure across codon sites were used to estimate selective pressure and to test for positive selection. Model M0 is the most simple and can be used to check that parameter estimates in more complex models are consistent. M0 (model = 0 NSsites = 0) was run to provide initial branch lengths, and the initial values of this tree were then used to run the other models. Because selection can act differ-ently in some species, another approach, the branch-specific model, was also tested [40,41] to detect any evidence of positive selection in any branch of the designated tree. For this analysis the whole protein-encoding genes were used in a combined file, setting the An. darlingi branch as the background one (ω = 1 fixed), and allowing ω to vary in the other anopheline branches of the tree topology. In both approaches, the tree topology for each protein obtained in PAUP [42] was used as the topology to be tested under maximum likelihood.
Three different alignments were defined to test the phylogeny of the samples; one with the whole mtDNA sequence, the second with all the protein-encoding gene sequences concatenated and a third new data set with only 1 st and 2 nd positions of these protein-encoding genes to minimize homoplasy in the 3 rd codon position [44]. The software Modeltest v. 3.7 [45] was used to determine the nucleotide substitution model and the alignments followed a GTR+I+γ model (although with different rates) and were used for the subsequent phylogenetic analysis. Three different approaches were followed to reconstruct the phylogeny for posterior analyses, maximum parsimony (MP), maximum likelihood (ML) and Bayesian inference (BI). MP and ML were conducted with PAUP v 4.0b10 [42] and bootstrap values calculated with 1,000 replicates. The BI was performed employing MrBayes v 3.1.1. [46], with two simultaneous runs of 5,000,000 generations, discarding the first 25% as burn-in.

Estimation of divergence date
Culicidae fossil records are not common, and while most of them are compressed, some specimens have been detected in amber inclusions [47]. Anopheles (Nyssorhynchus) dominicanus is the oldest fossil record of the Anophelinae subfamily in the New World, with controversy in the proposed age ranging from 15-20 million years ago (mya) based on foraminifera [48] and 30-45 mya based on coccoliths [49]. The deficient Diptera fossil record offers only an age estimate for the divergence between Drosophila and Anopheles at ~259.9 mya [50], and this deep external calibration might not provide reliable estimation dates, especially for the youngest nodes and for intraspecific divergences within the An. darlingi taxa [51]. Alternatively, Brower [52] estimated that the standard arthropod mtDNA sequence divergence is 2.3% per million years based on calculations across five arthropod genera. The limitation of applying this information to our data is that it may underestimate divergences older than 3.25 million years, because of mutation saturation [53].
A Bayesian Markov Chain Monte Carlo approach, (MCMC) available in the software BEAST v.1.4.6 [54] was performed to infer the topology and the node ages by estimating the Bayesian posterior distribution of the divergence. Following the recommendations of the authors, the uncorrelated relaxed lognormal clock was applied to estimate how clock-like our data were. The ucld.stdev parameter was checked, and in all partitions, the value was close to 0, indicating that the data were clock-like and that the rate of heterogeneity among species was low. Therefore, an uncorrelated lognormal relaxed clock was followed [55] and defined the rate prior to have a normal distribution with a mean of 0.1. The SRD06 model for the partition in the protein-encoding genes was selected, as it seems to provide a better fit for protein-coding nucleotide data [54]. This partition was run using the GTR with gamma and invariant sites model, the best fit with our nucleotide substitution detected by Modeltest. The Yule tree prior is the most suitable for trees describing the relationships among individuals from different species and it was selected to calculate the divergence time between taxa. Final analyses consisted of three separate MCMC runs of 20 million generations sampled every 1000 generations. Tracer v1.4 software was used to confirm the adequate mixing of the MCMC chains of the three separate runs and to check that the effective sample sizes of all the parameters were greater than 500. LogCombiner v1.4.7 was employed for combining the separate runs into one file. Finally, the maximum clade credibility tree was calculated with the mean node branch using TreeAnnotator v1.4.7. FigTree v1.2.1 was employed for visualization of the node branches.

Genome organization and structure
The complete mtDNAs of An. darlingi are 15,385 bp and 15,386 bp in length for the southern and northern genotypes, respectively. The typical 37 genes in animal mtDNA, comprising 13 protein-encoding genes, two rRNA genes (12S rRNA and 16S rRNA), 22 tRNA genes and a control region are found in both genomes ( Figure 1 and Table 1). DNA sequences were deposited in GenBank under accession number GQ918272 (northern genotype) and GQ918273 (southern genotype). Genes are encoded on both strands (Table 1) with some open-reading frames overlapping adjacent genes (ATP8/ATP6 and NAD4/ NAD4L as reported in other insects [56]), resulting in a compact genome. The An. darlingi gene arrangement structure is highly conserved when compared with the mitochondrion of the other three sequenced anopheline species.
The nucleotide composition of the An. darlingi southern genotype is as follows: the L-strand has a GC content of 20.73% and 79.27% AT (A = 35.06%, T = 44.21%, C = 9.23% and G = 11.5%) and the H-strand has A = 44.5%, C = 16.12%, G = 7.98%, T = 31.4%. The composition of both strands of the northern genotype was not significantly different from the southern genotype. The mtDNAs of both An. darlingi genotypes are biased towards a high A+T content ( Table 2). The overall AT content in the whole mitochondrion sequence was 78.2% and 78.1% in northern and southern, reaching 93.68% and 93.67% in the AT rich region, respectively ( Figure 1). This result is similar to that obtained with other anophelines, although it is slightly lower than in the other species included in the study ( Table 2). The base composition, measured by the G-skew and T-skew, i.e. asymmetry in nucleotide composition, showed both skews (-0.144 and -0.141, for the G skew) and (-0.028 for the T-skew), indicating the preference for these two nucleotides, a widespread characteristic of animal mtDNAs [57].

Transfer RNA and ribosomal RNA genes
The mtDNA of An. darlingi includes 22 tRNA genes with anticodons representing 20 different amino acids, identical to other anophelines ( Table 1).
The lengths of the 12S and 16S rRNA genes are 793 bp and 1329 bp, respectively, identical in both genotypes, and are encoded on the L-strand. The ends of those genes were assumed to be extended to the boundaries of the flanking genes [16]. As in other anophelines, the lrRNA gene is flanked by the tRNA Leu and tRNA Val , while the srRNA gene is between tRNA Val and the control region. Their A+T contents were 82.54-82.39% for lrRNA and 79.95-79.70% for srRNA, which are within the range of other insects.

Protein-coding genes
Thirteen protein-coding genes could be identified in the mtDNA of An. darlingi, similar to those of other Anophelinae [17][18][19]. Start codons in all the protein-coding genes are conserved and follow the ATN rule described previously in Anopheles [18], except for the COI gene, which has a TCG start codon characteristic of Anopheles [17,18,58]. Eight genes (NAD2, ATP8, ATP6, NAD3, NAD6, CYTB, NAD4L, NAD1) use the complete stop codon TAA (Table 1), whereas five genes have incomplete stop codons, TA_(NAD4, NAD5) and T_(COI, COII, COIII), which could be completed as a result of postranscriptional polyadenylation [59]. The codon usage of both genotypes of An. darlingi is nearly identical (Table 3), and the total number of nonstop codons is 3733 in both. This latter value is similar in the family Anophelinae, ranging from 3715 in An. quadrimaculatus to 3733 in An. gambiae (Figure 2).
The sequence identity within the An. darlingi taxon showed a high similarity across the whole genome (Table  4), with the lowest value (98.5) for the COII gene. Moreover, the sequence of the protein-encoding genes is conserved in Anophelinae, while more differences are found in the Control Region (CR) [60].

Non-coding regions
The mtDNA of An. darlingi contains five intergenic spacers. There are two spacers with only two nucleotides, while two others are formed by nine and 17 nucleotides. The A+T region, flanked by 12sRNA and tRNA Ile , has 574 and 573 nucleotides (northern and southern, respectively), and is presumed to be the origin of DNA replication of the organelle, as reported in other insects [61]. High levels of sequence divergence were found in the CRs among anopheline species, even when low diversity was detected in other genes (Table 4). This is consistent with the CR variation observed among dipteran mtDNAs [60]. Four motifs have been described in arthropod CRs: long sequence of thymines, tandem repeat sequences, a subregion of even higher A+T content and stem-loop structures [62]. The CR of the An. darlingi genotypes showed only 15 variable sites among them, including nine indels  (five insertions in the northern genotype). This is surprisingly small variation, even lower than within the An. gambiae complex, where 52 variable positions in CR were described [63]. The anopheline CR does not have the distinct domains seen in other insects [64]. In addition, the T-stretch structure, hypothesized as the regulation of transcription of the mtDNA, is located in position 142 (17 bp of extension) and this structure is a commonly observed in the other anophelines analysed as well in other Diptera, and in different insect orders [60,64,65].

Phylogenetic analyses
The relationships among Anopheles inferred using the MP, ML and Bayesian approaches are consistent with earlier reports based on mitochondrial and nuclear genes [25]. Nyssorhynchus was inferred as a sister taxon to Anopheles + Cellia with high bootstrap support for each  node. Inconsistent topology was obtained only in the case of rDNA sequences, which grouped Anopheles with Nyssorhynchus, likely due to non-independence of substitutions within loop regions in rDNA genes. Thus, the tree topology was consistent and employed in posterior estimation of parameters in tests of selection.

No evidence of positive selection in the mtDNA
The Likelihood Ratio Test did not detect any sites under positive selection in any mtDNA gene. Parameter estimations of the variable ω were derived under different models for the 13 protein-encoding genes ( Table 5). Analyses conducted with all the partitions were consistent with the conclusion that there is no positive selection acting on the protein-encoding genes of An. darlingi mtDNA, and that those are evolving by purifying selection. Even with models that allow variable selective pressure among sites (M2), ω values are distant from 1 and no indication of different selection intensities was detected. Under M8, the ω was >1 in three genes, NAD4, NAD6 and COIII, with less than one site with this pattern, although the comparison between models was insignificant. This could be an indication of a lack of power of the analysis, perhaps as a result of the few sequences that were available, the low divergence between them, or the lack of power of the Likelihood Ratio Test [39]. Concerns about small sample size and sequence divergence have been discussed [66].
Recently, it has been established that reasonable statistical power can be achieved with up to five genomes with divergence similar to those observed among some bacterial strains [67]. In addition, a recent study based on molecular evolution in mosquito immunity-related genes used this approach with similar sample size and with reliable results [68].

Estimates of time of most recent common ancestor
The analyses performed with all mtDNA genes pushed back the split between Culicinae (subgenus Stegomyia) and Anophelinae (genus Anopheles) to 190 mya, in the Early Jurassic. The Anopheles radiation took place during the Early Cretaceous (130 mya), and the most recent ancestor of the Anopheles and Cellia subgenera is estimated at 107 mya. Finally, the split within the Cellia subgenus (An. gambiae and An. funestus) was estimated at 82 mya. Alternatively, the analysis of the protein-coding genes provides earlier divergence dates, and although they correspond to the same geological periods, the nodes are dated more recently in agreement with an older origin of the mosquitoes in the Jurassic [69,70]. A recent study based on six nuclear genes and morphological   characters provides further support for a Jurassic origin of mosquitoes although the estimated divergence between Anopheles and Cellia was more recent (~43 mya) than the dates derived from our study [71]. The different evolutionary histories of the molecular markers (mtDNA vs nDNA), besides incomplete species sampling, may explain the apparent incongruence between the dates. The inclusion of an additional calibration point of the divergence of Anophelinae and Culicinae estimated to 120 mya in the Early Cretaceous [72] was incorporated in the analysis, providing a node age divergence of the Nyssorhynchus-Anopheles of 79 mya. The process of separation between South America and Africa began in the Permian and the last connection by land was estimated to exist ~95 mya. In addition, the split between Anopheles-Cellia took place 58 mya (Figure 3). Altogether, these data provide estimates for the radiation of the subgenus Cellia concordant with the loss of land bridges between Africa-Europe and Europe-North America (through Greenland), that may explain the current distribution of the Cellia subgenus, absent in the Americas. Anopheles distribution across the American continent, in agreement with the radiation dates derived from this study, is sustained from a latitudinal migration and subsequent radiation from the north to the south. The large confidence intervals of both analyses include congruent dates that fit with the above explanation, although these estimates should be read carefully, since the use of few calibration points could provide weak estimates of divergence [73].
The study of incipient species has been of paramount importance in the understanding of speciation in different organisms [74,75]. Even though there is strong evidence for diverging genotypes in An. darlingi, its mtDNA does not sustain this level of variance. The persistence of permeable barriers that permit exchange of genetic information within genotypes could preclude the detection of the differentiation. Questions focused on reproductive isolation among the An. darlingi genotypes need to be addressed, as well as assortative mating features between co-existing populations. Taxa with a recent divergence time are expected to share ancestral variation in high proportions, which may confound the reconstruction of their historical relationships. Furthermore, it is likely that young species, with a recent divergence process, have fixed differences only at genes directly responsible for speciation. The major African malaria vector, An. gambiae, has incomplete barriers to gene flow, where diver- gent selection is acting on local regions of the genome, and gradually increasing the divergence of the populations until gene flow disappears between them [76,77]. Further support for this form of divergence was provided with the "speciation island model", which suggests small regions with high differentiation in the genome of this taxon [77,78].
The current studies focused on deciphering the complete genome of An. darlingi will provide new elements and tools to further evaluate the speciation of this taxon. Subsequently, the potential identification of genes affecting speciation should improve the surveillance and management of malaria programs in the Americas and help to elucidate the different patterns of malaria transmission involving An. darlingi.
In addition, the radiation of anophelinae species underscores the significance of these studies together with the origin of malaria parasites. The link between the anophelines in the New World and malaria parasites is still controversial [79][80][81]. If human Plasmodium parasites were transferred to the Americas by European colonialists in post-Colombian times, then the contact between neotropical malaria vectors and protozoan is very recent. This supports the hypothesis that the susceptibility of mosquitoes to malaria infection is a basal characteristic and further investigations should be conducted to elucidate the refractoriness to this parasite in non-malaria vector species.

Conclusions
The lack of conclusive evidence of speciation at the mtDNA level demands further investigation of other markers, such as nuclear genes that can reflect differences in ecological preferences within An. darlingi. The divergence dates estimated for the subgenus Nyssorhynchus provide valuable information, which, although there is a need to be cautious, because the constraints of the method gave an approximation of the radiation of this group. Finally, it seems that purifying selection is predominant at the mitochondrial level, and the differences observed within the taxon may not be a consequence of that force.