MinorityReport, software for generalized analysis of causal genetic variants

Background The widespread availability of next generation genome sequencing technologies has enabled a wide range of variant detection applications, especially in cancer and inborn genetic disorders. For model systems and microorganisms, the same technology may be used to discover the causative mutations for any phenotype, including those generated in response to chemical perturbation. In the case of pathogenic organisms, these approaches have allowed the determination of drug targets by means of resistance selection followed by genome sequencing. Results MinorityReport is open source software written in python that facilitates the comparison of any two sets of genome alignments for the purpose of rapidly identifying the spectrum of nonsynonymous changes, insertions or deletions, and copy number variations in a presumed mutant relative to its parent. Specifically, MinorityReport relates mapped sequence reads in SAM format output from any alignment tool for both the mutant and parent genome, relative to a reference genome, and produces the set of variants that distinguishes the mutant from the parent, all presented in an intuitive, straightforward report format. MinorityReport features tunable parameters for evaluating evidence and a scoring system that prioritizes reported variants based on relative proportions of read counts supporting the variant in the mutant versus parent data sets. The utility of MinorityReport is demonstrated using previously published publicly available data sets to find the determinants of resistance for novel anti-malarial drugs. Conclusions MinorityReport is readily available (github: JeremyHorst/MinorityReport) to identify the genetic mechanisms of drug resistance in Plasmodium, genotype-phenotype relationships in human diads, or genomic variations between any two related organisms. Electronic supplementary material The online version of this article (doi:10.1186/s12936-017-1730-2) contains supplementary material, which is available to authorized users.


Sequencing to characterize genotype-phenotype relationships
The advent of robust high throughput genetic sequencing methods has made genome sequencing tractable for any sample from any organism. Comparison of genomes for related organisms (within the same lineage of a species) with different physiologic, behavioural, or anatomic phenotypes has been increasingly used to characterize the genetic determinants of the feature, i.e. genotype-phenotype relationships. Differences in the genome sequences

How to
To perform this analysis, genomic sequencing reads from each organism (e.g. in FASTQ format) are first aligned to the reference genome FASTA file, resulting in Sequence Alignment/Map (SAM) format files [7]. The observed variations from the reference genome are then evaluated for sufficient supporting evidence by quantity and proportion of reads in the mutant, and support for non-variation in the parent. The variants that pass these threshold filters are then placed in the context of coding sequences, defined in gene file format files (GFF3), to determine whether the variant changes the gene product, i.e. is nonsynonymous (Fig. 2). Variants observed consistently in multiple independent pairs of organisms that differ in the same manner are likely to be the causal genetic variant that underlies the phenotype. In the case of drug resistance selections, the underlying assumption for a mutation or copy-number variant which leads to resistance is that it will be under purifying selection unlike other variants present in the population at the time of the initial selection.
This process requires the reference genome and gene model, and deep sequencing data for unaffected parents and the affected child for humans, or similarly the drugsensitive parent strain and the derivative drug-resistant mutants for pathogens. The reference genome is required as a contextual template to align the sequencing reads and thereby identify genetic variants. The gene model is needed to identify whether a variant is in a protein-coding region, and whether the variant changes the resulting protein sequence.
While this process is straightforward in concept, many bioinformatic analysis computer programs work only on one or an arbitrary set of organisms, e.g. human and mouse, or do not integrate both mutation and copynumber variant analysis into a unified pipeline. A few robust generalized tools are available to call variants between one set of sequencing reads against any reference genome, and a subset of these call nonsynonymous variants. However, these tools require extensive data preparation, have software library dependencies that complicate installation, are sensitive to the many imperfections in gene models for the many poorly characterized organisms (i.e. they fail to complete running or give any output), and they do not perform comparative analysis between multiple sequencing sets. In practice, many researchers cobble together pipelines based on several different tools to accomplish this task.

Freedom of alignment
Alignment algorithms continue to evolve at rapid pace, and the most recent iterations capture alignments to much more dissimilar sequences with inexact matches and significant insertions or deletions (indels). e.g. Bow-Tie2, handle indels of limited size, and further sequence differences [8]. Algorithms such as STAR [9] and GSNAP [10] handle extensive indels, and highly distinct but still recognizably similar read-genome sequence matches. These advances are useful to address the genotype-phenotype problem. Fortunately, these and other genome alignment algorithms have converged on a standard SAM format [7] which enable downstream applications, such as the one described here, to be independent of the base aligner. Fig. 1 Experimental scheme. MinorityReport deciphers nonsynonymous mutations, including those that enable drug resistance in pathogens. A susceptible pathogen strain (green) is cultivated in sub-LD100 concentrations of a drug until a resistant strain (blue) emerges. The genomes are sequenced, aligned to a reference genome, and MinorityReport outputs the causative mutation, which is often also the drug target

Copy number variants
Another major type of genetic anomaly that can underlie genotype-phenotype relationships is a gene copy number variant (CNV). When the function of a gene becomes particularly adaptive, useless, or harmful, environmental pressures can select for the replication or loss of any number of copies of a gene or set of genes [11]. Because CNVs affect gene dosage, they have the potential for physiologic impact. Various approaches are used to detect CNVs on a large scale, with the primary clinical tool being array comparative genomic hybridization [12]. Algorithms have been constructed to use indirect evidence to evaluate CNVs, such as inference through analysis of single nucleotide polymorphism (SNP) chips [13]. However, as with all microarray technologies, next generation genome sequencing technologies are replacing the use of array CGH and SNP chips due to their portability to any biological system, and unbiased sequence data generation. The normalized abundance of sequencing reads that map to each gene is a direct measure of the number of copies of the gene. CNVs can be estimated using sparse read coverage data sets as well [11]. However, the enormous number of reads accessible through contemporary sequencing instruments (e.g. the Illumina HiSeq series) makes direct assessment through deep coverage straightforward.

Results and discussion
MinorityReport facilitates the comparison of any two sets of genome alignments for the purpose of rapidly identifying the spectrum of nonsynonymous or indel changes in a presumed mutant relative to its parent. In addition, the data model for detecting nonsynonymous variants integrates rapid assessment of CNVs. Tunable parameters for both analyses enable reporting of high or low purity nonsynonymous mutations, and large or small range CNV detection. This software does not address noncoding variants such as promoter or enhancer element mutations, intergenic variants that may occur in encoded functional RNAs, nor chromosomal rearrangements, which may be assessed with the same sequencing data. Algorithms deployed in this software are described, and utility is demonstrated on freely available data sets that carry the genetic determinants of resistance for anti-malarial drugs.

Nonsynonymous variant caller algorithm
A simplified flow chart of the nonsynonymous variant caller algorithm in MinorityReport is shown in Fig. 2. FASTA and GFF3 files for the reference genome are taken as input to build the gene model. Indices for genes, exons, and splice variants are taken from coding sequence lines (those with CDS entered in the phase column). Gene descriptions are taken from gene lines. Reverse compliment indices are translated for genes present on the negative strand (python list indices are inclusive at the beginning but exclusive at the end, so must be shifted 1 nucleotide towards the 3′ direction). Sequences are found from these indices in the reference genome FASTA file, and checked for start and stop codons. Next, a hash table (python dictionary) is made from the reference genome FASTA file to store the sequence evidence: for each sequence position in each chromosome a key is created for each of the possible 4 nucleotides, and a zero value is entered (indels are added later). Sequence Alignment/Map files of sequencing reads aligned to the reference genome for the parent and mutant are each taken as input and applied into separate sequence evidence hash tables. For each mapped read pair, the mate pair read is checked for matched mapping location. The CIGAR entry is read for presence of insertions or deletions (indels) in the read. The mapped strand is extracted from the binary flag entry. Finally, each nucleotide in each paired read is added into the sequence evidence hash table. Insertions and deletions are added to the position entry as they occur, and counted the same way as nucleotides. Insertions are denoted in the direction of the positive strand; the first nucleotide in the insertion entry is technically not part of the insertion, but is counted in the insertion to enable a consistent notation system.
Next, the sequence evidence hash tables are scanned for nonsynonymous variants that appear in the mutant but not the parent. Chromosomes are read in turn. Then evidence thresholds are applied to each entry of the position in a progression to maximize efficiency: variant of the reference genome, supra-threshold number of mutant reads and fraction of reads at the position support the variant, supra-threshold number of parent read coverage, subthreshold number of parent reads and fraction of reads at the position support the variant, the variant is in a coding sequence region, and the variant changes an amino acid in the translated protein sequence (nonsynonymous). Variants passing these criteria are reported (Table 1). Each reported variant is assigned a priority score as follows:  Transportin mutation mutant refers to the quantity of sequencing reads that contain the variant in the mutant sample, and similarly mutation parent refers to the same in the parent sample. nonmutation mutant refers to the quantity of reads that contain anything other than the variant in the mutant sample. position variables refer to the total number of reads overlapping the position of the variant in each sample, and µ refers to the mean number of reads overlapping any position in the genome for each sample.
The rationale for the terms in the priority score calculation is as follows. The relative abundance of sequencing reads that identify a nucleotide or indel (variant) comprises the evidence for the variant. Reads that identify any other allele at the position, and reads identifying the same variant in the parent strain comprise the evidence against the variant being novel and causal. Low coverage in either the parent or mutant samples would decrease confidence in the variant, so high relative coverage of the samples increases prioritization.

CNV caller algorithm
As each position in each chromosome is scanned for nonsynonymous variants unique to the mutant, the total counts in each of parent and mutant for each position are applied to a CNV evidence model. After the nonsynonymous variant scan is complete for each chromosome, the genome is evaluated for regions with significant differences in median read coverage between mutant and parent. The probability of the parent or mutant being significantly high or low in copy number (gene abundance) is estimated by relating the number of reads that map to the position to the mean and standard deviation for the chromosome, i.e. the z-score. Positions are considered significant when a number of reads are mapped in the mutant that correspond to being above or below a probability threshold or number of copies, with respect to the parent.
Copy number variants are calculated as the ratio of reads that map to each sliding window in the mutant and parent data sets, normalized by the total number of reads for the parent and mutant. The probability of observing a CNV at random for each variant is estimated by placing the log2 transform in the context of log2 normalized read ratios for all sliding windows assessed across the genome. The resulting z-score (z = (x-mean)/standard deviation) is transformed to a probability estimate by the Gaussian transformation.
The CNV-seq algorithm is widely used to calculate CNVs from sequencing data [11]. The CNV-seq algorithm calculates the probability for each observed CNV occurring at random using the covariation of the read numbers in the parent and mutant data sets, with the Geary-Hinkley transformation. As seen in Figs. 3, 4 and 5, read coverage is not homogenous. Areas with deviations in coverage due to external processing factors, such as the non-uniform distribution of so-called "random" hexamer-nucleotide primers, or the inaccuracy of genome alignments, may become inflated or diminished using this approach. The total amounts of reads in each set are also the result of processing factors. In attempt to simplify, the statistical probability of observing read ratios normalized by the total read count is calculated by applying a Gaussian distribution to the entire read set, i.e. including all chromosomes. Recent work analysing genetic crosses of P. falciparum laboratory strains employed a hidden Markov permutation of the Gaussian distribution to improve CNV significance estimation [14]. While read coverage does not follow a normal distribution, at present, hidden Markov modelling requires library dependencies that are obtrusive to use, and so the software presently uses a less ideal distribution model.

Compound SJ733
The discovery of a novel anti-malarial agent was recently reported [2], and is now being tested for safety in humans.
A key milestone for understanding the mechanism of action for SJ733 and its derivatives was to identify molecular determinants of resistance. Resistant sub-clones of a susceptible parent strain were selected by long incubations in concentrations of the compound that allowed very few organisms to live, and identified the resistance loci by comparing genome sequence data between parent and resistant sub-clones (mutants). Comparing genomic sequencing reads between parent and mutants revealed a few variants in most mutants, while only one gene contained variants in all mutants: the gene encoding PfATP4, a sodium-dependent ATPase transporter [2].
The process of simultaneously assessing copy number and nonsynonymous variation in the search for the mutation that enables resistance to SJ733 is illustrated in Fig. 3. For a haploid genome under drug selection, the underlying genetic assumption is that the causal mutant allele will be driven to purity in the population of surviving cells, and thus it is reasonable to set an absolute threshold of at least 90% for the proportion of reads that are variant at any given position. This threshold is a tunable parameter and would clearly need to be below 50% for a diploid genome, for example.
The sequencing data for this experiment are available through the NCBI Short Read Archive (SRA): SRX644366 is the mutant S83, SRX643297 is the parent strain, and other resistant strains derived in parallel independent experiments can be found through SRP043648.
The list of nonsynonymous variants found by Minori-tyReport (Table 1) demonstrates the coincident emergence of mutations in the same gene from 5 independent selection experiments. In fact, in 3 of 5 selections, the only mutation that passes these basic thresholds are those in PfATP4 [2]. Note that relaxing the mutant proportion threshold would reveal several high scoring nonsynonymous variants, but essentially all of these belong to highly repeated gene families, such as the PfEMP1 family of antigenic variation genes, which are inherently prone to spurious alignment. The appearance of these non-causal variants, such as transportin, in the data set exemplifies the rare occurrence of real or artifact passenger mutations that may be reported, and also emphasizes the utility of analysing several independent selections. Also, the CNV assessments differ between the granular genome-wide scan performed with 3000 nucleotide-long tiles (top) versus the chromosome-specific scan performed with 300 nucleotide-long tiles (middle); while this was done to construct an illustrative figure with a reasonable amount of data, it shows that tile size affects the appearance of copy number and likelihood calculations: larger tiles are recommended for more relevant probability values. In the case of PfATP4, our analysis produced no evidence of copy number variation associated with resistance to the compound.

Apicoplast inhibitor MMV-08138
The mechanism of action of a Plasmodium apicoplast inhibitor MMV-08138 was recently reported [3]. As with SJ733, the genome sequence of sub-clones raised for resistance to the drug were compared against the susceptible parent strain. A single nonsynonymous variation (IspD L244I) not observed in the parent strain emerged, and reached nearly complete (98.9%) purity in the resistant mutant shown in Fig. 4. With the default threshold settings, 23 putative variants are output for this sample by MinorityReport (Additional file 1: Table S1). As well, the causal mutant receives over twice the priority score of any other variant.

Dihydroorotate dehydrogenase inhibitor DSM1
We also previously demonstrated that P. falciparum can achieve drug resistance to the dihydroorotate dehydrogenase inhibitor, DSM1, and perhaps generally, by randomly amplifying a large AT-repeat flanked region of the genome, then specifically further amplifying the region in progeny that successfully evade the drug effects [1]. Amplified regions likely contain the drug target, which can also contain nonsynonymous mutations that further aid resistance. MinorityReport successfully identified the amplified region (Fig. 5), and 4 nonsynonymous mutations (Additional file 2: Table S2) that were previously reported.

Conclusions
MinorityReport is a robust tool to rapidly identify the genetic differences between any two closely related organisms. It relates mapped sequence reads in SAM format from any alignment tool for both the mutant and parent genome, relative to a reference genome, and produces the set of variants that distinguishes the mutant from the parent. Output includes the spectrum of nonsynonymous changes, insertions or deletions, and copy number variations in a presumed mutant relative to its parent. Tunable evidence filters prioritize reported variants based on relative proportions of read counts supporting the genetic variant in the mutant versus parent data sets. Since MinorityReport is organism agnostic, it will be useful for both haploid and diploid genomes, as well as microorganisms and viruses. The software is freely available, open source, and only requires Python with libraries included in standard distributions (re, math, sys, and os). The R code for producing the data graphics shown in the Figures is also included in the github distribution (this has software library dependencies), and a "master" script that breaks the calculation into smaller parts by chromosome and launches them in parallel for faster calculation of nonsynonymous variants on machines with more processor cores than there are chromosomes in the genome.  Fig. 3. These data show a significant copy number variation from nucleotide 79,100 through 152,500 of chromosome 6, which determines resistance to DSM1