hmmIBD: software to infer pairwise identity by descent between haploid genotypes

Background A number of recent malaria studies have used identity by descent (IBD) to study epidemiological processes relevant to malaria control. In this paper, a software package, hmmIBD, is introduced for estimating pairwise IBD between haploid genomes, such as those of the malaria parasite, sampled from one or two populations. Source code is freely available. Methods The performance of hmmIBD was verified using simulated data and benchmarked against an existing method for detecting IBD within populations. Code for all tests is freely available. The utility of hmmIBD for detecting IBD across populations was demonstrated using Plasmodium falciparum data from Cambodia and Ghana. Results Alongside an existing method, hmmIBD was highly accurate, sensitive and specific. It is fast, requiring only 70 s on average to analyse 50 whole genome sequences on a laptop computer, and scales linearly in the number of pairwise comparisons. Treatment of different populations under hmmIBD improves detection of IBD across populations. Conclusion Fast and accurate software for detecting IBD in malaria parasite genetic data sampled from one or two populations is presented. The latter will likely be a useful feature for malaria elimination efforts, since it could facilitate identification of imported malaria cases. Software is robust to possible misspecification of the genotyping error and the recombination rate. However, exclusion of data in regions whose rates vary greatly from their genome-wide average is recommended. Electronic supplementary material The online version of this article (10.1186/s12936-018-2349-7) contains supplementary material, which is available to authorized users.


Introduction
In this document we outline a first-order hidden Markov model (HMM) to estimate identity by descent (IBD) between pairs of haploid genotypes from the same or different populations. Notation follows hmmIBD code and that of [2]. Consider a pair of haploid genotypes. At each position along the genome, the pair is considered to be in one of two hidden states, IBD or not IBD. We use genetic data to infer the hidden states. More specifically, for t = 1, . . . , T genotyped positions, we observe whether the allelic types of the genotype calls of the first and second sample (G It and G IIt , respectively) are the same (homo if G It = G IIt ) or different (het if G It = G IIt ), as illustrated in Table 1, and use the resulting sequence of observations, O = (O 1 , . . . , O T ), to infer the hidden states at the corresponding positions, Q = (q 1 , . . . , q T ). Since the genotype calls may differ from the true underlying genotypes (g It and g IIt , respectively), we include an error term, . The model makes the following assumptions.
1. Conditional on the t − 1th state, the tth state is independent of the t − 2th, . . . , 1st states.
2. Conditional on underlying genotypes, genotype calls are independent.
3. Underlying genotypes are dependent given IBD and independent otherwise. 4. The probability of an incorrect genotype call (calling one allelic type as another), , is constant, such that the probability of a correct genotype call, (1 − γ t ), decreases with the number of alternative genotypes at the tth position, γ t .
5. The recombination rate ρ is uniform across and between genomes.
6. Perfect knowledge of genotype frequencies, f gI = (f gI 1 . . . f gI T ) and f gII = (f gII 1 . . . f gII T ), given frequencies of the observed genotype calls, where f GI and f GII are either based on the observed data or supplied by the user.
7. The number of generations separating a pair of haploid genotypes, k, is constant across the genome for a given pair of haploid genotypes, such that all IBD segments in a pair of genomes have experienced the same opportunity for recombination.
8. No mutations have occurred in IBD segments.
9. If two samples come from different populations, the prior over the ancestral population given IBD is uniform.
The model supports pairwise comparisons of samples from the same or different populations. If the two samples come from the same population, let f GI t denote the frequency of the observed genotype at the tth position of the first sample across a given pairwise comparison, and f GII t that of the second sample. If two samples come from the different populations, let sample one be from the first population with genotype frequency f p1 GI t , and sample two be from the second population with genotype frequency f p2 GII t .

Hypothetical dataset
Model inputs for sid1:sid2

Model specification
The HMM bmλ is first-order, discrete, and heterogeneous over t = 1, . . . , T . It is fully specified by the following.
2. M = 2 observation symbols: . This formulation accommodates observations from within population multiple-genotype samples in which two unphased genotypes reside up to a multiplicative factor of 2.
3. N initial state probabilities, π = (π 1 , π 2 ), where π i = P(q 1 = S i ) for i = 1, 2 and These are initially set to 0.5, then updated to the marginal posterior probabilities of the hidden states, as P(S 1 | O, λ) is recalculated upon successive fitting iterations of the model, described below (Section 3).
4. An N by N matrix of transition probabilities, Transition probabilities are dependent on the genome position because they vary with the distance in base pairs (bp) of the tth observation from the t − 1th observation, d t . They also depend upon π 1 and π 2 , the generation number, k, and the recombination rate, ρ (bp −1 ), which are fixed over t = 1, . . . , T . This formulation allows for the transition probabilities to converge to initial probabilities when d t , ρ, or k are large, 5. An N by M matrix of observation probabilities, That is to say, we assume a uniform prior over the ancestral population given IBD. Note that when both samples come from the same population (p1 = p2), (2) to (5) can be reduced to (6) to (9), sincef Gt = f GI t and f p2 The observation probabilities are dependent on the genome position because the alternative genotype count and frequencies vary over the genome. Note that M k=1 b ik (t) = 1 since {b jk } encompass all possibilities by allowing f GI t and f GII t to represent the observed genotype calls (see Section 4 for an example derivation of observation probabilities at a triallelic position for two samples from the same population). This formulation exploits the fact that, regardless of γ t , at any given position t, only two genotypes can be called for a pair of haploid genotypes, and so given that frequencies must sum to one, only two frequencies, f GI t and f GII t , need to be defined.
The model framework is summarized in Figure 1. Default values in the code of the parameters and bounds are as follows. The recombination rate, ρ = 7.4 × 10 −7 M bp −1 , is that of Plasmodium falciparum [1]. The number of generations, k, is inferred under the model (see below), but can be capped by the user. The distances d = d 1 , . . . , d T are calculated from the data as illustrated in Table 1. We skip positions < 5bp apart to avoid mutations spanning >1 bp, while positions on different chromosomes are considered infinitely separated. In the code, the latter is achieved by fitting data for different chromosomes separately under a given iteration of the model (in other words, data from all chromosomes are fit using common values of π and k). The genotyping error is specified as = 0.001. Frequencies f GI and f GII and alternative alelle counts γ are either based on input data or on external information provided by the user. To accommodate indels, the maximum value of γ t is assumed to be 8 but, as with all other specified values, can be changed within the code. In is important to note that all genotype calls, including indels, are treated as point mutations under the model. 2. Over a given number of fitting iterations, τ = 1, . . . , τ max , which is capped at a user-settable maximum τ max that defaults to 5, we update λ via parameters k and π using the Baum-Welch method [2]. Note that the fitting iterations will stop before the cap if the default convergence criteria are met. Specifically, while τ ≤ τ max or default convergence criteria are not met, 3. Having fit π and k, we retrieve the most probable sequence of hidden states, argmax P(Q | O, λ), using the Viterbi algorithm [2].
Observations with one or more missing genotype calls are considered missing (NA in Table  1). Missing observations are easily accounted for within the HMM framework by simply omitting the respective observation probability terms in likelihood calculations. In the code, this is achieved by setting P(O t = NA|q t = S i ) = 1 ∀ i = 1, 2. Practical details regarding implementation, for example how to compile the code, can be found in the corresponding ReadMe file at https://github.com/glipsnort/hmmIBD/releases.

Example derivation of observation probabilities for within population samples
Consider a triallelic position where λ t = 2. Here we show how to derive the observation probabilities equations (6) to (9). For brevity, we henceforth drop the subscript t. Let the set of three possible genotype calls at the triallelic position be denoted by {A, B, C}, and the equivalent set of genotypes be denoted by {a, b, c}. The probability of a genotype call, G ∈ {A, B, C}, given an underlying genotype, g ∈ {a, b, c}, is given by, Now let us consider a pair of genotype calls, (G I G II ) where G I ∈ {A, B, C} and G II ∈ {A, B, C}, and genotypes, (g I g II ) where g I ∈ {a, b, c} and g II ∈ {a, b, c}. The probabilities of the genotype pairs given the hidden states, q ∈ {S 1 , S 2 }, are where f g is the frequency of a given genotype, due to assumed dependence between genotypes given IBD and independence given not IBD. And, since we are assuming conditional independence between genotype calls given underlying genotypes such that P G I G II | q = gIgII P(G I | g I )P(G II | g II )P g I , g II | q , where gI,gII denotes the summation over all possible genotype pairs, the probabilities of all possible homogeneous and heterogeneous comparisons of genotype calls given the hidden states are, f GI and f GII are the frequencies of the genotype calls in the first and second samples, respectively, in a heterogeneous comparison, equations (12) to (14) can be reduced to equation (6), equations (15) to (17) to equation (7), equations (18) to (23) to equation (8) and equations (24) to (29) to equation (9).