Li 2011
Li 2011
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2987
in genotypes may lead to a bulk of errors. From another angle, 2.1.3 Biallelic variants We assume all variants are biallelic. In the human
however, when discovering rare mutations, we only care about the population, the fraction of triallelic SNPs is ∼0.2% (Hodgkinson and Eyre-
difference between samples. Genotypes are just a way of measuring Walker, 2010). The biallele assumption does not have a big impact to the
the difference. Is it really necessary to go through the genotype modeling of SNPs, though it may have a bigger impact to the modeling of
calling step? INDELs at microsatellites.
This article explores the answer to these questions. We will
show in the following how to compute various statistics directly from 2.2 Computing genotype likelihoods
sequencing data without knowing genotypes. We will also evaluate For one sample at a site, the sequencing data d is composed of an array
our methods on real data. of bases on sequencing reads plus their base qualities. As we only consider
biallelic variants, we may focus on the two most evident types of nucleotides
and drop the less evident types if present. Thus, at any site we see at most two
2 METHODS types of nucleotides. This treatment is not optimal, but sufficient in practice.
Suppose at a site there are k reads. Without losing generality, let the
This section presents the precise equations on how to infer various statistics
first l bases (l ≤ k) be identical to the reference and the rest be different. The
2.1.1 Site independency We assume data at different sites are L(ψ) = ··· Pr{di ,gi |ψ} = Li (g)f (g;mi ,ψ) (3)
independent. This may not be true in real data, because sequencing and g1 gn i i=1 g=0
mapping are context dependent; when there is an insertion or deletion where Li (gi ) is computed by Equation (2) and
(INDEL) error or INDEL polymorphism, sites nearby are also correlated in
m g
alignment. Nonetheless, most of the existing methods make this assumption f (g;m,ψ) = ψ (1−ψ)m−g (4)
g
for simplicity. The effect of site dependency may also be reduced by post-
filtering and properly modeling the mapping and alignment errors (Li et al., is the probability mass function of the binomial distribution Binom(m,ψ).
2008; Li, 2011). Knowing the likelihood of ψ, we may numerically find the max-likelihood
estimate with, for example, Brent’s method (Brent, 1973). An alternative
2.1.2 Error independency and sample independency We assume that at a approach is to infer using an expectation–maximization algorithm (EM),
site the sequencing and mapping errors of different reads are independent. regarding the sample genotypes as missing data. Given we know the estimate
As a result, the likelihood functions of different individuals are independent: ψ(t) at the t-th iteration, the estimate at the (t +1)-th iteration is
n
1 g gLi (g)f (g;mi ,ψ )
(t)
n
ψ(t+1) = (5)
L(θ) = Li (θ) (1) M g Li (g)f (g;mi ,ψ )
(t)
i=1
i=1
where M = i mi is the total number of chromosomes in samples.
In real data, errors may be dependent of sequence context (Nakamura When the signal from the data is strong, or equivalently for each i, one of
et al., 2011). The independency assumption may not hold. It is possible to Li (g) is much larger than others, the EM algorithm converges faster than the
model error dependency within an individual (Li et al., 2008), but the sample direct numerical solution using Brent’s method. However, when the signal
independency assumption is essential to all the derivations below. from the data is weak, numerical method may converge faster than EM (Kim
et al., 2011). In implementation, we apply 10 rounds of EM iterations. If the
Table 1. Common notations estimate does not converge after 10 rounds, we switch to Brent’s method.
Symbol Description 2.3.2 Estimating the genotype frequencies In this section, we assume all
samples have the same ploidy: m = m1 = ··· = mn and aim to estimate ξg , the
n Number of samples frequency of genotype g. The likelihood of {ξ0 ,...,ξm } is:
mi Ploidy of the i-th sample (1 ≤ i ≤ n)
n
m
M Total number of chromosomes in samples: M = i mi L(ξ0 ,...,ξm ) = Li (g)ξg (6)
di Sequencing data (bases and qualities) for the i-th sample i=1 g=0
gi Genotype (the number of reference alleles) of the i-th sample
(0 ≤ gi ≤ mi )a
with the constraint g ξg = 1. The EM iteration equation is
φk Probability of observing k reference alleles ( M k=0 φk = 1) 1 Li (g)ξg
n (t)
2988
one-degree χ2 test. This approach would not work for sequencing data as 2.3.5 Estimating the number of non-reference alleles In this section, we
it does not account for the uncertainty in genotypes, especially when the use the term site reference allele count to refer to the number of reference
average read depth of each individual is low. A proper solution is to perform alleles at one single site. Allele count is a discrete number while allele
a likelihood-ratio test (LRT). The test statistic is frequency is contiguous.
L(ψ̂) L((1− ψ̂)2 ,2ψ̂(1−ψ), ψ̂2 ) random vector G = (G1 ,...,Gn ) to be a genotype
For convenience, define
De = −2log = −2log (8) configuration, and X = i Gi to be the site reference allele count in all the
L(ξˆ0 , ξˆ1 , ξˆ2 ) L(ξˆ0 , ξˆ1 , ξˆ2 ) samples. Assuming HWE, we have
where
n mi
2.3.3 Estimating haplotype frequencies between loci In this section, we where d = (d1 ,...,dn ) represents all sequencing data. To compute this
assume all samples are diploid. Given k loci, let h = (h1 ,...,hk ) be a haplotype probability efficiently, we define
where hj equals 1 if the allele at the j-th locus is identical to the reference,
m1 mj
j
mi
and equals 0 otherwise. Let ηh be the frequency of haplotype h satisfying
zjl = ··· δl,sj (g) Li (gi ) (14)
gi
h ηh = 1, where g1 =0 gj =0 i=1
j
1
1
1
for 0 ≤ l ≤ i=1 mi and zjl = 0 otherwise. zjl can be calculated iteratively with
ηh = ··· η(h1 ,...,hk )
h1 =0 h2 =0 hk =0
mj
h mj
zjl = zj−1,l−gj · Lj (gj ) (15)
knowing the genotype likelihood at the j-th locus for the i-th individual gj
gj =0
(j)
Li (g), we can compute the haplotype frequencies iteratively with:
starting from z00 = 1. Comparing the definition of znk and Equation (15), we
(t) (t) k (j)
η h ηh j=1 Li (hj +hj )
n know that
(t+1) znk
η = h
(j)
(10) L(k) = M (16)
h n (t) (t)
i=1 h ,h ηh ηh j Li (hj +hj ) k
which computes the likelihood of the allele count.
When sample genotypes are all certain, this EM iteration is reduced
Although the computation of the likelihood function L(k) is more complex
to the standard EM for estimating haplotype frequencies using genotype
than of L(ψ), L(k) is discrete, which is more convenient to maximize or sum
data (Excoffier and Slatkin, 1995).
over. This likelihood function establishes the foundation of the Bayesian
The time complexity of computing Equation (10) is O(n·4k ) and thus it
inference.
is impractical to estimate the haplotype frequency for many loci jointly. A
typical use of Equation (10) is to measure LD between two loci.
2.3.6 Numerical stability of the allele count estimation When computing
zjl with Equation (15), floating point underflow may occur given large j.
2.3.4 Testing associations Suppose we divide samples into two groups of M
size n1 and n−n1 , respectively, and want to test if Group 1 significantly A numerically stable approach is to compute yjl = zjl / l j instead, where
j
differs from Group 2. One possible test statistic could be (Kim et al., Mj = i=1 mi . Thus
2010, 2011) L(k) = ynk (17)
L(ψ̂) M
and by replacing zjk with yjk l j in Equation (15), we can derive:
Da1 = −2log (11)
L[1] (ψ̂[1] )L[2] (ψ̂[2] ) ⎛ ⎞
mj −1
k −l mj
where ψ̂ is the max-likelihood estimate of the site allele frequency of all mj
yjk = ⎝ ⎠ yj−1,k−gj · Lj (gj ) (18)
samples [Equation (9)], and ψ̂[1] and ψ̂[2] are the estimates of allele frequency Mj −l gj
l=0 gj =0
in Groups 1 and 2, respectively. Under the null hypothesis, D approximately ⎛ ⎞
mj −1
follows the one-degree χ2 distribution. Mj−1 −k +l +1
A potential concern with the Da1 statistic is that the computation of L(ψ) ·⎝ ⎠
k −l
assumes HWE. When HWE is violated, false positives may arise (Nielsen l=gj
et al., 2011). For diploid samples, a safer statistic is However, we note that yjl may decrease exponentially with increasing
L(ξ̂0 , ξˆ1 , ξ̂2 ) j. Floating point underflow may still occur. An even better solution is to
Da2 = −2log (12) rescale yjl for each j, similar to the treatment of the forward algorithm for
L[1] (ξ̂0[1] , ξ̂1[1] , ξ̂2[1] )L[2] (ξ̂0[2] , ξ̂1[2] , ξ̂2[2] ) Hidden Markov Models (Durbin et al., 1998). In practical implementation,
which in principle follows the two-degree χ2 distribution under the null we compute
yjl
hypothesis. However, when both cases and controls are in HWE, the degree ỹjl = j (19)
j =1 tj
of freedom is reduced and this statistic is underpowered.
We have not found a powerful test statistic robust to HWE violation. For where tj is chosen such that l ỹjl = 1.
practical applications, we propose to take the P-value computed with Da1 , As another implementation note, most yjl are close to zero and thus ynk
while filtering candidates having a low Da2 to reduce false positives caused can be computed in a band rather than in a triangle. This may dramatically
by HWE violation (see Section 3). speed up the computation of the likelihood.
2989
2.3.7 Calling variants In variant calling, we have a strong prior take the ratio between the two resulting likelihoods. The larger the ratio, the
knowledge that at most of the sites all samples are homozygous to the more confident the mutation. More exactly, the likelihood ratio is:
reference. To utilize the prior knowledge, we may adopt a Bayesian inference
max(gc ,gf ,gm )∈G {Lc (gc )Lf (gf )Lm (gm )}
for variant calling. Let φk , k = 1,...,M, be the probability of seeing k Dt = −2log (23)
reference alleles among M chromosomes/haplotypes. For convenience, maxLc (gc )·maxLf (gf )·maxLm (gm )
define = {φk }, which is in fact the sample AFS for M chromosomes. where Lc (gc ), Lf (gf ) and Lm (gm ) are the child, father and mother genotype
Recall that X is the number of reference alleles in the samples. The posterior likelihoods, respectively, and G is the set of genotype configurations
of X is satisfying the Mendelian inheritance.
= k}
φk Pr{d|X φk L(k)
Pr{X = k|d, }= = (20) Although most of the derivation in this article assumes that variants
φ l Pr{ = l}
d|X l φl L(l) are biallelic, we drop this assumption in the implementation for methods
l
where L(k) is defined by Equation (13) and computed by Equation (17). In described in this subsection. We have observed false somatic/germline
variant calling, we define variant quality as mutations caused by the mismodeling of triallelic variants (M.Depristo,
personal communication). The biallelic assumption may lead to false
}
Qvar = −10log10 Pr{X = M|d, positives.
2990
Frequency
AF1 3, 5 Non-reference site allele frequency
G3 7 Diploid genotype frequency 0.1
HWE 8 P-value of Hardy–Weinberg equilibrium
NEIR 10 Neighboring r 2 linkage disequilibrium statistic 0.05
LRT 11 One-degree association test P-value
LRT2 12 Two-degree association test P-value
0
AC1 17, 18, 19 Non-reference site allele count 0 2 4 6 8 10 12 14 16 18
FQ 20 Probability of the site being polymorphic among # derived alleles conditional on NA18507 hets
samples
CLR 22, 23 Log likelihood ratio score for de novo mutations Fig. 2. The derived AFS conditional on heterozygotes discovered in the
ML estimate sequenced by Complete Genomics (Drmanac et al., 2010) and analyzed using
6
CGA Tools version 1.10.0; (ii) from nine random Pilot-1 Yoruba individuals
5 released by the 1000 Genomes Project using the EM-AFS method and
(iii) from the same 9 Pilot-1 individuals using site-AFS.
4
2991
6
the cell lines used in two studies have greatly diverged, we might
5
see up to a dozen somatic mutations on chromosome 20.
4
This time with a threshold Dp ≥ 30 and a maximum depth
3 filter 150, we identified 667 single-base differences between the
2 two datasets, far more than our expectation. Again we sought to
1 reduce mapping errors by remapping reads with BWA-SW to the
0
1000 Genomes Project phase-2 reference genome. The number of
0 1 2 3 4 5 6 0 1 2 3 4 5 6 differences between the HiSeq and the old Illumina data quickly
Expected -log10(P-value) Expected -log10(P-value) drops to 33. If we further filter out clustered SNPs using a 100 bp
window, 13 potential differences are left, 2% of the initial candidates.
Fig. 3. QQ-plot comparing the association test statistics to the one-degree This exercise again proves that mismapping is the leading source of
2992
editing. It may be necessary to use two distinct mapping algorithms Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University Press,
to call variants and then take the intersection. Cambridge, UK.
Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular
Frequently, we require to know the exact DNA sequences or
haplotype frequencies in a diploid population. Mol. Biol. Evol., 12, 921–927.
genotypes only to estimate parameters or compute statistics. In Hodgkinson,A. and Eyre-Walker,A. (2010) Human triallelic sites: evidence for a new
these cases, the sequences and genotypes are just intermediate mutational mechanism? Genetics, 184, 233–241.
results. When the sequence itself is uncertain, mostly due to the Howie,B.N. et al. (2009) A flexible and accurate genotype imputation method for the
uncertainty in sequencing and mapping, it may sometimes be next generation of genome-wide association studies. PLoS Genet., 5, e1000529.
Kim,S.Y. et al. (2010) Design of association studies with pooled or un-pooled next-
preferred to directly work with the uncertain sequence, which may generation sequencing data. Genet. Epidemiol., 34, 479–491.
carry more information than an arbitrarily ascertained sequence. Kim,S.Y. et al. (2011) Estimation of allele frequency and association mapping using
We have showed that many population genetical parameters and next-generation sequencing data. BMC Bioinformatics, 12, 231.
statistical tests can be adapted to work on uncertain sequences, Le,S.Q. and Durbin,R. (2010) SNP detection and genotyping from low-coverage
sequencing data on multiple diploid samples. Genome Res., 21, 952–960.
and believe more existing methods can be adapted in a similar
Ley,T. J. et al. (2008) DNA sequencing of a cytogenetically normal acute myeloid
manner. Knowing the exact sequence is convenient, but not always leukaemia genome. Nature, 456, 66–72.
2993