FlexiChip package: an universal microarray with a dedicated analysis software for high-thoughput SNPs detection linked to anti-malarial drug resistance

Background A number of molecular tools have been developed to monitor the emergence and spread of anti-malarial drug resistance to Plasmodium falciparum. One of the major obstacles to the wider implementation of these tools is the absence of practical methods enabling high throughput analysis. Here a new Zip-code array is described, called FlexiChip, linked to a dedicated software program, which largely overcomes this problem. Methods Previously published microarray probes detecting single-nucleotide polymorphisms (SNP) associated with parasite resistance to anti-malarial drugs (ResMalChip) were adapted for a universal microarray FlexiChip format. To evaluate the overall sensitivity of the FlexiChip package (microarray + software), the results of FlexiChip were compared to ResMalChip microarray, using the same extension probes and with the same PCR products. In both cases, sequence results were used as gold standard to calculate sensitivity and specificity. FlexiChip results obtained with a set of field isolates were then compared to those assessed in an independent reference laboratory. Results The FlexiChip package gave results identical to the ResMalChip results in 92.7% of samples (kappa coefficient 0.8491, with a standard error 0.021) and had a sensitivity of 95.88% and a specificity of 97.68% compared to the sequencing as the reference method. Moreover the method performed well compared to the results obtained in the reference laboratories, with 99.7% of identical results (kappa coefficient 0.9923, S.E. 0.0523). Conclusion Microarrays could be employed to monitor P. falciparum drug resistance markers with greater cost effectiveness and the possibility for high throughput analysis. The FlexiChip package is a promising tool for use in poor resource settings of malaria endemic countries.

performed well compared to the results obtained in the reference laboratories, with 99.7% of identical results (kappa coefficient 0.9923, S.E. 0.0523).
Conclusion: Microarrays could be employed to monitor P. falciparum drug resistance markers with greater cost effectiveness and the possibility for high throughput analysis. The FlexiChip package is a promising tool for use in poor resource settings of malaria endemic countries.

Background
Anti-malarial drugs play a pivotal role in malaria control, but a limited number of new drugs are under development. Resistance of malaria parasites to commonly used anti-malarial drugs is also a global challenge. Thus, there is a need to optimize the use of existing treatments and to monitor the emergence and the spread of drug resistant malaria parasites, in particular Plasmodium falciparum, which is responsible for the vast majority of malaria deaths [1][2][3][4]. Typing the known genetic drug resistance markers is among the strategies currently used for monitoring the resistance of P. falciparum. Single nucleotide polymorphisms (SNP) related to anti-malarial drug resistance include five major genes: Pfdhfr and Pfdhps for pyrimethamine and sulphadoxine resistance, Pfcrt and Pfmdr1 for chloroquine resistance and recently, but not yet confirmed by field studies, serca/atpase6 for artemisinin resistance. Different molecular tools have been developed, including the PCR-RFLP method [5][6][7][8], real-time PCR for assessing gene copy number [9], sequence analysis [10], the heteroduplex tracking assay [11], and PCRamplification of the SNP containing fragments followed by single base extension (SBE) of an elongation primer with fluorescent ddNTP's [12].
DNA microarray-based SNP genotyping has been importantly developed over the past recent years. Surveying SNPs is an important tool in epidemiological studies on parasite resistance, but the currently available methods to identify resistance all have important drawbacks, including a limited focus on only the five mentioned genes and absence of a high throughput format. Several systems have been proposed [13,14], mainly based on PCR-amplification of the SNP containing fragments followed by SBE of an elongation primer with fluorescent ddNTP's [15]. Recently, a genotyping array called ResMalChip has been developed to monitor 34 SNPs in five genes of P. falciparum that either confer or increase resistance to antimalarial drugs [16]. The ResMalChip method also has two major drawbacks. First, the content of the microarray is designed only for a specific objective (typing SNPs related to resistance) in a specific organism. Therefore, the use of this microarray for surveying other SNPs in any other gene or organism requires a new array design and production. This implies a large number of tests to adapt the system to new markers. The design of the capture oligonucleotides must be specific of the elongation primers and the exper-imental conditions of hybridization must be compatible with all the couples of capture oligo-elongation primers. Moreover, no standard software has been developed for data analysis. A new Zip-code array, called FlexiChip, associated with a dedicated software has been designed to address these problems (see Figure 1) [17]. This array contains oligonucleotides (Zip-code) that are not complementary to any sequence in any known organism and have been designed to have the same thermodynamic properties. Therefore, various assays can be performed using a single protocol from SNP discovery to hybridization. The target probes contain the Zip-codes complementary sequences linked to the elongation primers. FlexiChip can in principle be used to test any SNP. Furthermore, an analysis algorithm based on a mixture model and allowing accurate SNP identification has been developed. This algorithm does not require any prior threshold determination and provides results in a simple Excel file format. To evaluate the overall sensitivity of the FlexiChip package (microarray + software), ResMalChip and FlexiChip data sets were analysed with this software and tested against sequence analysis. FlexiChip results were then compared to results obtained with ResMalChip using the same PCR products. For a set of 50 field isolates, FlexiChip results were also compared to those obtained in another molecular laboratory (MORU) acting as external quality control.

Clinical P. falciparum samples and DNA extraction
As part of the activities related to the assessment of drug efficacy in uncomplicated malaria in Cambodia, 263 P. falciparum isolates were collected from consenting patients from 2001 to 2004. Blood samples were kept at -20°C at the Institut Pasteur du Cambodge until use. DNA was extracted from blood samples using QIAmp ® DNA Mini Kit (Cat. No. 51306, QIAgen ® , Germany), according to the manufacturer's procedure. All studies were conducted following good clinical practice, and ethical clearance was obtained from the National Ethical Comity of Cambodia.

Microarray design
Zip code design and microarray production A Zip-code as used in the present study is defined as an artificial sequence composed of 24 bases. All the Zipcodes have similar melting temperature (Tm) values. A set of 96 Zip-code of 24 mer oligonucleotides has been Schematic representation of the FlexiChip analysis method Figure 1 Schematic representation of the FlexiChip analysis method. A) SNPs are detected by Single Base Extension (SBE) using Sequenase and ddNTP labelled with Cyanine 3 or 5 using a specific probe that hybridizes one nucleotide upstream of the SNP site. B) Products of the SBE reaction are hybridized on FlexiChip by their Zip-code oligonucleotides. After washing and drying, the slides are scanned at two wave lengths. C) Analysis algorithm is based on a mixture model and allows accurate SNP identification. The results are stored in Excel file.  Table S1 in Additional file 1). They were tested to avoid self-pairing and hairpin formation (FastPCR, Institute of Biotechnology, University of Helsinki [18]). Reverse complement oligonucleotides (cZip) were synthesized with an amino C-7 linker at the 3' end used for its attachment to the slide. Then cZip were spotted onto aldehydesilane coated slides with a 12-well format (AL MPX slides, Schott) using a VersArray ChipWritterPro system (Bio-Rad Laboratories, Hercules, CA). For spotting the cZip were resuspended at 50 micromolar in Phosphate buffer. FlexiChip spotting pattern of the 96 cZip with Cy3, Cy5 anchor prelabeled oligonucleotide and six negative controls is presented in additional figure 2A. Each oligonucleotide was spotted in triplicate. A total of 12 independent hybridizations can be performed in parallel on a single slide. ResMalChip arrays were produced as described in [16].

Microarray validation
To evaluate possible cross-hybridization between Zipcodes, each Zip-code associated with its primer was labelled using the SBE protocol (see below) and hybridized one by one on the microarray. Cross-hybridization was considered as significant when fluorescent average signal intensity of non tested Zip-code spots was above 10% of the average positive signal of the tested Zip-code spot. This was observed for only two spots out of 96 and the corresponding two Zip-codes were discarded from the analysis (see Table S1 in Additional file 1).

Single base extension, SBE
Remaining free dNTP's were removed using a shrimp alkaline phosphatase (SAP). Briefly, 5 μl of those ten nested PCR products were mixed to 2 U of SAP (Amersham Biosciences, Freiburg, Germany) and incubated for 1 h at 37°C. From each sample, two reactions were performed using two combinations of Cy3 and Cy5 labelled ddNTP's (Perkin Elmer, Schwerzenbach, Switzerland). Sequenase (Termipol ® , Solis, Tartu, Estonia) extension reaction, reaction mixture and final denaturation were done for Res-MalChip and FlexiChip as described by Crameri et al [16].
Extended primers with cyanine labelling were hybridized onto the microarray. With this experimental design on FlexiChip, two samples can be processed per spotting area. As 40 positions are needed per sample, one set of extension primers can be associated with Zip-codes 1 to 40 while the second set can be associated with Zip-codes 49 to 88 (remaining positions 41 to 48 and 89 to 96 were not used).

Chip hybridization
Briefly, extended primers associated with a Zip-code were resuspended in 6 μl of 20 × SSC (1× SSC = 0.15 M NaCl, 0.015 M sodium citrate, pH 7.2) and hybridized on the array. Microarrays were then incubated during 60 min at 50°C, in a humid chamber and subsequently washed in 2 × SSC and 0,2% SDS for 20 min and in 2 × SSC for 20 min. Microarrays were spun 5 min at 3000 g to dry. During hybridization, extended primers linked with their specific Zip-code were hybridized on the FlexiChip cZip pattern (see Table S2B in Additional file 2).

Data acquisition
Hybridized microarrays were scanned at 635 nm and 532 nm using an Axon 4100A fluorescence scanner (Axon, Bucher Biotec AG, Basel, Switzerland) and Axon GenePix ® Pro (version 6.0) software. The PMT (photomultiplier tube) was 550 at 532 nm and 500 at 635 nm.

Data analysis and allele identification
All the data analyses were performed using the R software [19] and packages. The allele identification algorithm was written in R. It was applied independently on each array. The aim of this algorithm is to classify each spot of the array in either one of the "green", "red", or "indeterminate" classes, and then convert the spot colour into the corresponding SNP sequence. ResMalChip and FlexiChip raw data were first corrected for background using the limma package [20] (version 2.12.0) according to a twostep procedure. A modified version of the "movingmin" option of the background correction function ("called "bgCorrect") was first applied to the data. This option smoothes the background on the basis of a 3 × 3 moving window. But unlike the original version, the modified version does not substract the smoothed background. Then the normexp procedure was applied. According to this procedure, the observed signal is modeled as the convolution of a true signal and a background one, where the true signal follows an exponential distribution and the background follows a Gaussian distribution.
This two-step process was derived because of a high background level observed with respect to the signal, especially on ResMalChip data. Spots that still had a signal to noise ratio lower than one after background correction were flagged "bg" (where "bg" stands for "background").
Data from negative and positive control spots were then excluded from the data set. An intensity threshold I T was computed on the remaining spots for each slide as the median of pooled "red" and "green" intensities. A log2 ratio of the "red" intensity over the "green" intensity was computed for each of the 1440 (three replicates of 40 SNPs spots for 12 samples) remaining spots.
A two-component Gaussian mixture model was fitted differently to the ResMalChip and FlexiChip datasets. For the ResMalChip dataset, a two-component Gaussian mixture model was computed using the Mclust function from the mclust package [21] with the modelNames parameter set to "E" (Gaussian functions with same variance). These two estimated Gaussian functions are estimates of the conditional prior probability functions f(x/ω 1 ) and f(x/ω 2 ) that describe the distribution of log ratios within the classes ω 1 and ω 2 (these two classes are respectively associated with "green" and "red" spots). For the FlexiChip dataset, the model was built in two steps. A first optimal mixture model was computed using the Mclust function with default parameters (modelNames = c("E","V")). In most cases a three or more components mixture model was obtained. To get a two-component mixture model these components were grouped according to the sign (positive or negative) of their mean and a mixture model was derived from each of these two groups. These two "sub"models were then used as estimates of the conditional prior probability functions (see Figure 2).
The remaining of the base-calling algorithm was then identical for the datasets of both chips. Conditional posterior probabilities P(ω1/x) and P(ω2/x) were computed according to the Bayes theorem: P(ω i /x) = f(x/ω i ). P(ω i )/ [f(x/ω 1 ). P(ω 1 ) + f(x/ω 2 ). P(ω 2 )], i = 1,2 A third class called ω 0 was created between ω 1 and ω 2 . Its boundaries were defined using a tunable parameter called ambiguity rejection threshold and denoted C r . This class contained data from spots that had a probability lower than C r of belonging to one of the "red" and "green" classes and was used to exclude data having a low probability of good classification, i.e. lower than C r .
Each spot on the array was first classified within one of the "green" (ω 1 )/"red" (ω 2 )/rejection (ω 0 )/weak signal/background (bg) classes according to the following decision rules: • P(ω 1 /x) > P(ω 2 /x) and P(ω 1 /x) > C r and I S > I T and bg = FALSE → d(x) = ω 1 • P(ω 2 /x) > P(ω 1 /x) and P(ω 2 /x) > C r and I S > I T and bg = FALSE → d(x) = ω 2 • max(P(ω i /x)) ≤ C r , i = 1,2 and I S > I T and bg = FALSE → d(x) = ω 0 • I S > I T and background = TRUE → d(x) = bg where x is the log ratio associated with the spot, d(x) is the decision associated with x, I SR and I SG are respectively the "red" and "green" intensities measured on the spot, and I S = max(I SR , I SG ) is the maximum of both intensities for this spot.
A final decision was taken for each SNP on the basis of its three replicate spots as follows: if at least two of the three replicates were belonging to the same class the SNP was associated with this class, otherwise it was declared "indeterminate" and no further interpretation was performed.
Allele identification was done using a pre-defined table that describes the expected signal for each allele of each SNP (see Table 1). This table was fully derived from the The mixture model (FlexiChip) Figure 2 The mixture model (FlexiChip). "Green": Gaussian components of the "green" class, "red": Gaussian components of the "red class", thick "green": prior conditional probability density function f(x/ω 1 ), thick "red": prior conditional probability density function f(x/ω 2 ), dashed black: mixture density function f(x) = f(x/ω 1 )P(ω 1 ) + f(x/ω 2 )P(ω 2 ), "green" vertical line: lower limit of ω 0 , "red" vertical line: upper limit of ω 0 design of the experiment. As an example, according to this table a "red" signal (Cy5) is expected for spots associated with the RES16 SNP if the allele in the studied sample is a mutant, and a "green" signal (Cy3) otherwise. Three possible scenarios are encountered depending on the number of different probes that were associated with the SNP. In the first case, SNPs were represented by only one probe, meaning that only two different alleles were known for them. This was the most general case. Then, for SNPs that had been classified in ω 1 or ω 2 , allele identification came straight from table 2. If a field sample was studied using FlexiChip or ResMalChip and the hybridization signal for the RES16 SNP was found to belong to the "red" class, the Pfdhfr gene from this sample was identified as mutant at position 16. The second scenario refers to SNPs that had only two known alleles but were represented by two different probes on the slide, in order to strengthen the identification process. Then, if one of the probes was classified as "weak signal" or "bg", the other probe result was taken into account. If both probe signals were valid (d(x) = ω 1 or d(x) = ω 2 ), the coherence between the probes was checked and in case of conflicting results the SNP was declared "indeterminate". The last scenario refers to the situation where more than two different alleles for a given SNP exist. Thus, three or four different alleles must be discriminated with two colours only. For these particular SNPs, two different probes were designed and the corresponding targets were labelled with two different combinations of Cy3 and Cy5 labelled ddNTP's, as explained in the experimental protocol section. For example, the position 108 on gene Pfdhfr is represented by two probes on the array, RES108 and RES108B. The first probe allows to distinguish between the wild type allele and either mutantA or mutantB. The second probe makes the difference between the mutantA and either wild type or mutantB. In such a case, allele identification was resolved according to the combination of both probe results. If one or both probe signals were classified "weak signal", "bg" or "indeterminate", the SNP was declared "indeterminate", otherwise it was determined according to Table 1. Mutually exclusive results for such two complementary probes led the associated SNP to be declared "indeterminate". As an example, this would be the case for the SNP RES108, if both probes gave "red" signals.

Direct sequencing of PCR products
A set of samples was sequenced for Pfdhfr, Pfcrt, Pfmdr1 and PfATPase6.genes. PCR products were purified using a P-100 Gel Fine solution (Biorad) and Multiscreen MAVN45 kit system (Millipore). Sequencing reactions were performed on both strands using internal primers and ABI Prism BigDye Terminator chemistry. Sequencing reactions were run on ABI Prism 3100 Genetic Analyzer (Applied Biosystems) at the Plate-Forme Génomique of Institut Pasteur in Paris, and analysed with Seqscape software v.2.0. (Applied Biosystems).

External quality control
Fifty P. falciparum isolates from Cambodia were tested blindly in Mahidol Oxford Research Unit according to their own protocols. Briefly five SNPs of Pfmdr1 (positions 86, 184, 1034, 1042 and 1246) and one SNP of Pfcrt genes (position 76) were screened with restriction length polymorphism methods [6,7]. The Pfserca/Pfatpase6 gene was sequenced (4068 bp). Results were compared to the four SNPs tested with FlexiChip.

Comparison between ResMalChip, FlexiChip and sequence results
ResMalChip Twenty five gpr files generated by the Axon GenePix ® Pro software were analysed. They included data from 10520 SNPs corresponding to 263 samples tested for 40 positions on five genes. The best compromise between the number of ambiguity rejections and the number of misclassifications was obtained with an optimal rejection threshold of C r = 0.2. On the 10520 SNPs data handled by this algorithm, 1396 (13.3%) were classified as "weak signal" and 905 (8.6%) were rejected for ambiguity (they belong to ω 0 , the intermediate class between the "red" and "green" classes). Among the 1642 SNPs data which could be compared to the sequence, 218 (13.3%) were classified as "weak signal" and 109 (6.6%) "indeterminate" (inconsistency between replicates or ambiguity). Compared to sequencing, considered the "gold standard", a good agreement was found with 96.63% and 95.74% for sensitivity and specificity respectively.

FlexiChip
Six gpr files were analysed corresponding to 5000 SNPs data from 125 samples (part of the previous 263 samples analysed within ResMalChip) tested for 40 positions on five genes. The optimal rejection threshold value C r was also 0.2. On 5000 SNPs analysed by the algorithm, 332 (6.6%) were classified as "weak signal" and 222 (4.4%) "indeterminate", i.e. two to three times less than with the ResMalChip array. Among the 1215 SNPs data for which the sequences were available, 28 (2.3%) and 38 (3.1%) were respectively considered as "weak signal" and "indeterminate". Sensitivity and specificity were 95.88% and 97.68% respectively.

ResMalChip versus FlexiChip
A total of 3,078 SNP data corresponding to 81 samples and 38 positions on five genes were available in both Res-MalChip and FlexiChip datasets. Among them, 2195 SNPs data were interpretable ("red" or "green" signal) with both techniques. An identical diagnosis was found for 2,034 (92.7%) of the SNP data (kappa coefficient 0.8491 with a standard error of 0.0213). When the results on a gene-by-gene basis were considered (Table 2), a very good agreement was found for dhfr, crt, atpase and mdr1 gene. The main discrepancies were observed for the dhps gene.

Comparison between FlexiChip and Mahidol-Oxford Research Unit (MORU) results
Fifty isolates were tested for eight SNPs in parallel in MORU with standard methods and with the FlexiChip. Among the 400 SNP data, 34 (8.5%) were classified as "weak signal" or "indeterminate". Among the 366 remaining SNP data, results were identical with both techniques in 365 cases (99.7% specificity, 91.5% sensibility, kappa coefficient 0.9923 with a standard error of 0.0523).

Discussion
Molecular tools are essential for monitoring emergence and spread of anti-malarial drug resistance and are part of strategies described by the World Wide Anti-malarial Resistance Network (WWARN) consortium [22,23]. Correlation of molecular markers with in vivo and in vitro drug resistance has been clearly established for dhfr/dhps (sulphadoxine-pyrimethamine) and pfcrt (chloroquine) mutations, mdr1 (chloroquine, mefloquine) and cytochrome b (atovaquone). The microarray method described in this paper enables to implement molecular monitoring on a large scale because of the possibility to automatically analyse and interpret the results. The aim of this project was to evaluate the flexible microarray under practical conditions using field isolates, in which multiple infections are frequently observed. Without any dye bias on the array, spots associated with mixed alleles should exhibit a "yellow" signal corresponding to a mix between red and green signals. In the framework of the proposed mathematical model, these spots should then fall in the intermediate class ω 0 . Thus, this class would be used to detect mixed infections instead of indeterminate ones. However, this mathematical property of the model could not be fully validated for several reasons.
First, in the current study some SNPs showed no polymorphism in the processed samples. Indeed, field samples were sequenced for 20 SNPs out of 40 that were genotyped on the array. Among these 20 sequenced SNPs, only five showed polymorphism, with only one having both alleles in (almost) equal amount. Therefore, any dye bias on the signals measured on FlexiChip cannot be excluded, it would prevent mixed signals to behave as expected by the mathematical model. Second, the gold standard used to compare FlexiChip results with is sequencing. This method may not be the best one in the case of mixed alleles because chromatograms may be difficult to interpret, leading to erroneous sequences. The parameters of the mathematical model are derived on an array-by-array basis in order to adapt to possible technical variabilities between arrays. So they depend also on the proportion of single and mixed infections that are hybridized on the array. As most of the field samples analysed in this study were not polymorphic, the model behaviour in the case of a majority of mixed alleles cannot be predicted. But it is doubtless that it will have to be adapted to match the data distribution in that particular case. Finally, the actual design of FlexiChip makes it non exhaustive, as the use of two colours only for most of the monitored SNPs makes it unable to detect all the mutations The use of two mix combinations for the SNP located at position 108 on the pfdhfr gene led to a good classification rate of 100%. It is clear that extending the concept to the whole set of SNPs would increase the reliability of the base calling process, even in the case of mixed infections. Nevertheless, Res-MalChip microarray has already been used in an environment of complex malaria infections like [24].
Combined with the FlexiChip microarray, the software provided a sensitivity and specificity of 95.88% and 97.68% respectively when compared to sequencing as the reference method. Moreover, the method performed well when compared to results obtained in a reference laboratory, with 99.7% concordance (kappa coefficient 0.9923 with a standard error of 0.0523).
The proposed package can be useful for epidemiological surveys and can give information on the dynamics of emergence and spread of genetic markers in time and/or in space. However, the method cannot be used as an immediate diagnostic tool for individual samples, because the format requires a high number of samples tested at one time to be cost effective.
In contrast to previous methods, FlexiChip is no longer dedicated to a single set of genes and/or organisms. Thanks to its flexibility, integration of new SNPs linked to anti-malarial drug resistance is made simpler and adjunction of species identification is now possible. It is easy to adapt to other loci and in particular for SNP detection of other organisms like HIV or Multi Drug Resistant Tuberculosis strains. Moreover, FlexiChip package is ready for use and adaptable to large scale studies to validate new molecular marker candidates.

Concluding remarks
One of the major obstacles for implementation of molecular monitoring of resistance lies in the absence of practical tools for high throughput analysis. Universal microarrays such as FlexiChip could help to change this, as they are adapted to processing of numerous samples and easily adaptable to new markers. Furthermore, they are well suited for molecular biology laboratories from endemic countries, which need a robust and simple tool that could be easily adapted to a specific epidemiological situation. Additional material