0% found this document useful (0 votes)
12 views

Li 2011

This document proposes a statistical framework for analyzing sequencing data without explicitly calling genotypes. It aims to directly analyze sequencing data to call SNPs, discover mutations, perform association mapping and estimate population parameters. The framework jointly considers data from multiple samples to improve power over individual-based methods. It also aims to avoid imputation, which can be biased and slow, and directly analyze sequencing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Li 2011

This document proposes a statistical framework for analyzing sequencing data without explicitly calling genotypes. It aims to directly analyze sequencing data to call SNPs, discover mutations, perform association mapping and estimate population parameters. The framework jointly considers data from multiple samples to improve power over individual-based methods. It also aims to avoid imputation, which can be biased and slow, and directly analyze sequencing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Vol. 27 no.

21 2011, pages 2987–2993


BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr509

Sequence analysis Advance Access publication September 8, 2011

A statistical framework for SNP calling, mutation discovery,


association mapping and population genetical parameter
estimation from sequencing data
Heng Li
Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA
Associate Editor: Jeffrey Barrett

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


ABSTRACT from each individual and then combining the calls usually yield poor
Motivation: Most existing methods for DNA sequence analysis rely results. The preferred strategy is to enhance the power of variant
on accurate sequences or genotypes. However, in applications of discovery by jointly considering all samples (Depristo et al., 2011;
the next-generation sequencing (NGS), accurate genotypes may Le and Durbin, 2010; Li et al., 2011; Nielsen et al., 2011). This
not be easily obtained (e.g. multi-sample low-coverage sequencing strategy largely solves the variant discovery problem, but acquiring
or somatic mutation discovery). These applications press for the accurate genotypes for each individual remains unsolved. Without
development of new methods for analyzing sequence data with accurate genotypes, most of the previous methods [e.g. testing
uncertainty. Hardy–Weinberg equilibrium (HWE) and association mapping]
Results: We present a statistical framework for calling SNPs, would not work.
discovering somatic mutations, inferring population genetical To reuse the rich methods developed for genotyping data, the
parameters and performing association tests directly based on 1000 Genomes Project proposes to impute genotypes utilizing LD
sequencing data without explicit genotyping or linkage-based across loci (Browning and Yu, 2009; Howie et al., 2009; Li et al.,
imputation. On real data, we demonstrate that our method achieves 2009b, 2010a). Suppose at a site A, one sample has a low coverage.
comparable accuracy to alternative methods for estimating site allele If some samples at A have high coverage and there exists a site B
count, for inferring allele frequency spectrum and for association that is linked with A and has sufficient sequence support, we can
mapping. We also highlight the necessity of using symmetric transfer information across sites and between individuals, and thus
datasets for finding somatic mutations and confirm that for make a reliable inference for the low-coverage sample at A. The
discovering rare events, mismapping is frequently the leading source overall genotype accuracy can be greatly improved.
of errors. However, imputation is not without potential concerns. First,
Availability: http://samtools.sourceforge.net imputation cannot be used to infer the regional allele frequency
Contact: [email protected] spectrum (AFS) because imputation as of now can only be applied
to candidate variant sites, while we need to consider non-variants
Received on July 20, 2011; revised on August 30, 2011; accepted to infer AFS. Second, the effectiveness of imputation depends on
on September 1, 2011 the pattern of LD, which may lead to potential bias in population
genetical inferences. Third, the current imputation algorithms are
slow. For a thousand samples, the fastest algorithm may be slower
1 INTRODUCTION than read mapping algorithms, which is frequently the bottleneck
The 1000 Genomes Project (1000 Genomes Project Consortium, of analyzing NGS data (H.M.Kang, personal communication).
2010) sets an excellent example on how to design a sequencing Considering more samples and using more accurate algorithms will
project to get the maximum output pertinent to human populations. make imputation even slower.
An important lesson from this project is to sequence many human These potential concerns make us reconsider if imputation is
samples at relatively low coverage instead of a few samples always preferred. We notice that we perform imputation mainly
at high coverage. We adopt this strategy because with higher to reuse the methods developed for genotyping data, but would it
coverage, we will mostly reconfirm information from other reads, be possible to derive new methods to solve classical medical and
but with more samples, we will be able to reduce the sampling population genetical problems without precise genotypes?
fluctuations, gain power on variants present in multiple samples and Another application of NGS that requires genotype data is to
get access to many more rare variants. On the other hand, sequencing discover somatic mutations or germline mutations between a few
errors counteract the power in variant calling, which necessitates a related samples (Conrad et al., 2011; Ley et al., 2008; Mardis et al.,
minimum coverage. The optimal balancing point is broadly regarded 2009; Pleasance et al., 2010a, b; Roach et al., 2010; Shah et al.,
to be in the 2–6 fold range per sample (Le and Durbin, 2010; Li 2009). For such an application, samples are often sequenced to
et al., 2011), depending on the sequencing error rate, level of linkage high coverage. Although it is not hard to achieve an error rate
disequilibrium (LD) and the purpose of the project. one per 100 000 bases (Bentley et al., 2008), mutations occur at
A major concern with this design is that at 2–6 fold coverage a much lower rate, typically of the order of 10−6 or even 10−7 .
per sample, non-reference alleles may not always be covered by Naively calling genotypes and then comparing samples frequently
sequence reads, especially at heterozygous loci. Calling variants would not work well (Ajay et al., 2011), because subtle uncertainty

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2987

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2987 2987–2993


H.Li

in genotypes may lead to a bulk of errors. From another angle, 2.1.3 Biallelic variants We assume all variants are biallelic. In the human
however, when discovering rare mutations, we only care about the population, the fraction of triallelic SNPs is ∼0.2% (Hodgkinson and Eyre-
difference between samples. Genotypes are just a way of measuring Walker, 2010). The biallele assumption does not have a big impact to the
the difference. Is it really necessary to go through the genotype modeling of SNPs, though it may have a bigger impact to the modeling of
calling step? INDELs at microsatellites.
This article explores the answer to these questions. We will
show in the following how to compute various statistics directly from 2.2 Computing genotype likelihoods
sequencing data without knowing genotypes. We will also evaluate For one sample at a site, the sequencing data d is composed of an array
our methods on real data. of bases on sequencing reads plus their base qualities. As we only consider
biallelic variants, we may focus on the two most evident types of nucleotides
and drop the less evident types if present. Thus, at any site we see at most two
2 METHODS types of nucleotides. This treatment is not optimal, but sufficient in practice.
Suppose at a site there are k reads. Without losing generality, let the
This section presents the precise equations on how to infer various statistics
first l bases (l ≤ k) be identical to the reference and the rest be different. The

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


such as the genotype frequency and AFS, and to perform various statistical
error probability of the j-th read base is j . Assuming error independency,
test such as testing HWE and associations. Some of these equations
we can derive that
have already been described in the existing literature, but for theoretical
1   
l k  
completeness, we give the equations using our notations. The last subsection
L(g) = k (m−g)j +g(1−j ) (m−g)(1−j )+gj (2)
reviews the existing methods and summarizes the differences between them, m
j=1 j=l+1
as well as between ours and the existing formulation.
In this section, we suppose there are n individuals with the i-th individual where m is the ploidy.
having mi ploidy. At a site, the sequence data for the i-th individual is
represented as di and the genotype is gi which is an integer in [0,mi ], equal 2.3 Inferences from multiple samples
to the number of reference alleles in the individual. Table 1 gives notations
2.3.1 Estimating the site allele frequency In this section, we estimate the
common across this section. The detailed derivation of the equations in this
per-site reference allele frequency ψ. For the i-th sample, let mi be the ploidy,
article is presented in an online document (http://bit.ly/stmath).
gi the genotype and di the sequencing data. Assuming HWE, we can compute
the likelihood of ψ:
2.1 Assumptions   
n 
mi

2.1.1 Site independency We assume data at different sites are L(ψ) = ··· Pr{di ,gi |ψ} = Li (g)f (g;mi ,ψ) (3)
independent. This may not be true in real data, because sequencing and g1 gn i i=1 g=0

mapping are context dependent; when there is an insertion or deletion where Li (gi ) is computed by Equation (2) and
(INDEL) error or INDEL polymorphism, sites nearby are also correlated in  
m g
alignment. Nonetheless, most of the existing methods make this assumption f (g;m,ψ) = ψ (1−ψ)m−g (4)
g
for simplicity. The effect of site dependency may also be reduced by post-
filtering and properly modeling the mapping and alignment errors (Li et al., is the probability mass function of the binomial distribution Binom(m,ψ).
2008; Li, 2011). Knowing the likelihood of ψ, we may numerically find the max-likelihood
estimate with, for example, Brent’s method (Brent, 1973). An alternative
2.1.2 Error independency and sample independency We assume that at a approach is to infer using an expectation–maximization algorithm (EM),
site the sequencing and mapping errors of different reads are independent. regarding the sample genotypes as missing data. Given we know the estimate
As a result, the likelihood functions of different individuals are independent: ψ(t) at the t-th iteration, the estimate at the (t +1)-th iteration is
n 
1  g gLi (g)f (g;mi ,ψ )
(t)

n
ψ(t+1) =  (5)
L(θ) = Li (θ) (1) M g Li (g)f (g;mi ,ψ )
(t)
i=1
i=1 
where M = i mi is the total number of chromosomes in samples.
In real data, errors may be dependent of sequence context (Nakamura When the signal from the data is strong, or equivalently for each i, one of
et al., 2011). The independency assumption may not hold. It is possible to Li (g) is much larger than others, the EM algorithm converges faster than the
model error dependency within an individual (Li et al., 2008), but the sample direct numerical solution using Brent’s method. However, when the signal
independency assumption is essential to all the derivations below. from the data is weak, numerical method may converge faster than EM (Kim
et al., 2011). In implementation, we apply 10 rounds of EM iterations. If the
Table 1. Common notations estimate does not converge after 10 rounds, we switch to Brent’s method.

Symbol Description 2.3.2 Estimating the genotype frequencies In this section, we assume all
samples have the same ploidy: m = m1 = ··· = mn and aim to estimate ξg , the
n Number of samples frequency of genotype g. The likelihood of {ξ0 ,...,ξm } is:
mi Ploidy of the i-th sample (1 ≤ i ≤ n)
 
n 
m
M Total number of chromosomes in samples: M = i mi L(ξ0 ,...,ξm ) = Li (g)ξg (6)
di Sequencing data (bases and qualities) for the i-th sample i=1 g=0
gi Genotype (the number of reference alleles) of the i-th sample 
(0 ≤ gi ≤ mi )a
with the constraint g ξg = 1. The EM iteration equation is

φk Probability of observing k reference alleles ( M k=0 φk = 1) 1  Li (g)ξg
n (t)

Pr{A} Probability of an event A ξg(t+1) =  (7)


n  (t)
 Li (g )ξ 
Li (θ) Likelihood function for the i-th sample: Li (θ) = Pr{di |θ} i=1 g g

An important application of genotype frequencies is to test HWE for


a In this article, we only consider biallelic variants. diploid samples (m = 2). When genotypes are known, we can perform a

2988

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2988 2987–2993


Inference using sequencing data

one-degree χ2 test. This approach would not work for sequencing data as 2.3.5 Estimating the number of non-reference alleles In this section, we
it does not account for the uncertainty in genotypes, especially when the use the term site reference allele count to refer to the number of reference
average read depth of each individual is low. A proper solution is to perform alleles at one single site. Allele count is a discrete number while allele
a likelihood-ratio test (LRT). The test statistic is frequency is contiguous.

L(ψ̂) L((1− ψ̂)2 ,2ψ̂(1−ψ), ψ̂2 )  random vector G = (G1 ,...,Gn ) to be a genotype
For convenience, define
De = −2log = −2log (8) configuration, and X = i Gi to be the site reference allele count in all the
L(ξˆ0 , ξˆ1 , ξˆ2 ) L(ξˆ0 , ξˆ1 , ξˆ2 ) samples. Assuming HWE, we have
where 
n mi

ψ̂ = argmax L(ψ) (9)  = g|X = k} = δk,sn (g)


Pr{G
gi
M
ψ i=1 k

is the max-likelihood estimate of the site allele frequency and similarly ξ̂0 , ξ̂1 where sn (g) = i gi is the total number of reference alleles in a genotype
and ξ̂2 are the max-likelihood estimate of the genotype frequencies. Because configuration g, and δkl is the Kronecker delta function which equals 1 if
L(ψ̂) has one degree of freedom and L(ξˆ0 , ξˆ1 , ξˆ2 ) has two degrees of freedom, k = l and equals 0 otherwise. The likelihood of allele count is
the De statistic approximately follows the one-degree χ2 distribution. For
   mi 

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


genotype data, De approaches the standard HWE test statistic computed  = k} = 1
L(k) = Pr{d|X ··· δ k,sn (g) Li (gi ) (13)
M gi
from a 3×2 contingency table. k g g 1 n i

2.3.3 Estimating haplotype frequencies between loci In this section, we where d = (d1 ,...,dn ) represents all sequencing data. To compute this
assume all samples are diploid. Given k loci, let h = (h1 ,...,hk ) be a haplotype probability efficiently, we define
where hj equals 1 if the allele at the j-th locus is identical to the reference, 
m1 mj
 j  
 mi
and equals 0 otherwise. Let ηh be the frequency of haplotype h satisfying
 zjl = ··· δl,sj (g) Li (gi ) (14)
gi
h ηh = 1, where g1 =0 gj =0 i=1

j
 
1 
1 
1
for 0 ≤ l ≤ i=1 mi and zjl = 0 otherwise. zjl can be calculated iteratively with
ηh = ··· η(h1 ,...,hk )
h1 =0 h2 =0 hk =0
mj
  
h mj
zjl = zj−1,l−gj · Lj (gj ) (15)
knowing the genotype likelihood at the j-th locus for the i-th individual gj
gj =0
(j)
Li (g), we can compute the haplotype frequencies iteratively with:
starting from z00 = 1. Comparing the definition of znk and Equation (15), we
(t)  (t) k (j) 
η  h ηh j=1 Li (hj +hj )
n know that
(t+1) znk
η = h
  (j) 
(10) L(k) = M (16)
h n (t) (t) 
i=1 h ,h ηh ηh j Li (hj +hj ) k
which computes the likelihood of the allele count.
When sample genotypes are all certain, this EM iteration is reduced
Although the computation of the likelihood function L(k) is more complex
to the standard EM for estimating haplotype frequencies using genotype
than of L(ψ), L(k) is discrete, which is more convenient to maximize or sum
data (Excoffier and Slatkin, 1995).
over. This likelihood function establishes the foundation of the Bayesian
The time complexity of computing Equation (10) is O(n·4k ) and thus it
inference.
is impractical to estimate the haplotype frequency for many loci jointly. A
typical use of Equation (10) is to measure LD between two loci.
2.3.6 Numerical stability of the allele count estimation When computing
zjl with Equation (15), floating point underflow may occur given large j.
2.3.4 Testing associations Suppose we divide samples into two groups of M
size n1 and n−n1 , respectively, and want to test if Group 1 significantly A numerically stable approach is to compute yjl = zjl / l j instead, where
j
differs from Group 2. One possible test statistic could be (Kim et al., Mj = i=1 mi . Thus
2010, 2011) L(k) = ynk (17)
L(ψ̂) M
and by replacing zjk with yjk l j in Equation (15), we can derive:
Da1 = −2log (11)
L[1] (ψ̂[1] )L[2] (ψ̂[2] ) ⎛ ⎞
mj −1
 k −l  mj  
where ψ̂ is the max-likelihood estimate of the site allele frequency of all mj
yjk = ⎝ ⎠ yj−1,k−gj · Lj (gj ) (18)
samples [Equation (9)], and ψ̂[1] and ψ̂[2] are the estimates of allele frequency Mj −l gj
l=0 gj =0
in Groups 1 and 2, respectively. Under the null hypothesis, D approximately ⎛ ⎞
mj −1
follows the one-degree χ2 distribution.  Mj−1 −k +l +1
A potential concern with the Da1 statistic is that the computation of L(ψ) ·⎝ ⎠
k −l
assumes HWE. When HWE is violated, false positives may arise (Nielsen l=gj

et al., 2011). For diploid samples, a safer statistic is However, we note that yjl may decrease exponentially with increasing
L(ξ̂0 , ξˆ1 , ξ̂2 ) j. Floating point underflow may still occur. An even better solution is to
Da2 = −2log (12) rescale yjl for each j, similar to the treatment of the forward algorithm for
L[1] (ξ̂0[1] , ξ̂1[1] , ξ̂2[1] )L[2] (ξ̂0[2] , ξ̂1[2] , ξ̂2[2] ) Hidden Markov Models (Durbin et al., 1998). In practical implementation,
which in principle follows the two-degree χ2 distribution under the null we compute
yjl
hypothesis. However, when both cases and controls are in HWE, the degree ỹjl = j (19)
j =1 tj

of freedom is reduced and this statistic is underpowered.

We have not found a powerful test statistic robust to HWE violation. For where tj is chosen such that l ỹjl = 1.
practical applications, we propose to take the P-value computed with Da1 , As another implementation note, most yjl are close to zero and thus ynk
while filtering candidates having a low Da2 to reduce false positives caused can be computed in a band rather than in a triangle. This may dramatically
by HWE violation (see Section 3). speed up the computation of the likelihood.

2989

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2989 2987–2993


H.Li

2.3.7 Calling variants In variant calling, we have a strong prior take the ratio between the two resulting likelihoods. The larger the ratio, the
knowledge that at most of the sites all samples are homozygous to the more confident the mutation. More exactly, the likelihood ratio is:
reference. To utilize the prior knowledge, we may adopt a Bayesian inference
max(gc ,gf ,gm )∈G {Lc (gc )Lf (gf )Lm (gm )}
for variant calling. Let φk , k = 1,...,M, be the probability of seeing k Dt = −2log (23)
reference alleles among M chromosomes/haplotypes. For convenience, maxLc (gc )·maxLf (gf )·maxLm (gm )
define = {φk }, which is in fact the sample AFS for M chromosomes. where Lc (gc ), Lf (gf ) and Lm (gm ) are the child, father and mother genotype
Recall that X is the number of reference alleles in the samples. The posterior likelihoods, respectively, and G is the set of genotype configurations
of X is satisfying the Mendelian inheritance.
 = k}
φk Pr{d|X φk L(k)
Pr{X = k|d, }=  = (20) Although most of the derivation in this article assumes that variants
φ l Pr{  = l}
d|X l φl L(l) are biallelic, we drop this assumption in the implementation for methods
l
where L(k) is defined by Equation (13) and computed by Equation (17). In described in this subsection. We have observed false somatic/germline
variant calling, we define variant quality as mutations caused by the mismodeling of triallelic variants (M.Depristo,
personal communication). The biallelic assumption may lead to false
 }
Qvar = −10log10 Pr{X = M|d, positives.

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


and call the site as a variant if Qvar is large enough, because in deriving L(k),
we do not require the ploidy of each sample to be the same. The variant calling 2.5 Related works
method described here are in theory applicable to pooled resequencing with
During SNP calling, Thunder (Li et al., 2011) and glfMultiples
unequal pool sizes.
(http://bit.ly/glfmulti) compute the site allele frequency by numerically
maximizing the likelihood [Equation (2)]. Genome Analysis Toolkit (GATK;
2.3.8 Estimating the sample AFS For variant calling [Equation (20)], we Depristo et al., 2011) infers the frequency with EM [Equation (5)]. Kim
typically take the Wright–Fisher AFS as the prior. We can also estimate the et al. (2011) infers the frequency with both the numerical and the EM
sample AFS with the maximum-likelihood inference when the Wright-Fisher algorithms. Li et al. (2010b) derived an alternative method to estimate the
prior deviates from the data. site allele frequency, which is not covered in this article. SeqEM (Martin
Suppose we have L sites of interest and we want to estimate the frequency et al., 2010) estimates the genotype frequency using EM [Equation (7)] with
spectrum across these sites. Let Xa , a = 1,...,L, be a random variable a different parameterization. Le and Durbin (2010) derived Equation (16).
representing the number of reference alleles at site a. We can use an EM The conclusion is correct, but the derivation is not rigorous: the binomial
algorithm to find that maximizes Pr{d| }, the probability of data across coefficient in Equation (13) was left out. Yi et al. (2010) came to a similar
all samples and all sites conditional on AFS. The iteration equation is set of equations to Equations (15) and (20), but the prior is taken from the
(t+1) 1 estimated site allele frequency. To the best of our knowledge, Kim et al.
φk = Pr{Xa = k|d, (t)
} (21) (2010) is the first to use genotype likelihood-based LRT to compute P-value
L a
of associations [Equation (11)] with more thorough evaluation in a recent
We call this method of estimating AFS as EM-AFS. Alternatively, we paper (Kim et al., 2011). Nielsen et al. (2011) further proposed to test
may also acquire the max-likelihood estimate of the allele count at each site associations with a score test (Schaid et al., 2002). Except Kim et al. (2010),
using Equation (16). The normalized histogram of these counts gives the all the previous works focus on diploid samples, while many equations in this
AFS. We call this method as site-AFS. We will compare the two methods in article can be in theory applied to multiploidy samples and pooled samples.
Section 3. In this article, our contribution includes testing HWE, estimating
haplotype frequency, the proposal of two-degree association test, a simple but
effective model for discovering somatic mutations, the rigorous derivation
2.4 Discovering somatic and germline mutations and numerically stable implementation of a discrete allele count estimator
One of the key goals in cancer resequencing is to identify the somatic and an EM algorithm for inferring AFS.
mutations between a normal-tumor sample pair (Robison, 2010), which can
be achieved by computing a likelihood ratio. Given a pair of samples, the
following likelihood ratio is an informative score: 3 RESULTS
L[1] (ĝ)L[2] (ĝ) 3.1 Implementation
Dp = −2log [1] [1] [2] [2] (22)
L (ĝ )L (ĝ )
Most of equations for diploid samples (m = 2) have been
where L[·] (g) is computed by Equation (2), ĝ maximizes L[1] (g)L[2] (g), and implemented in the SAMtools software package (Li et al., 2009a),
similarly ĝ[1] and ĝ[2] maximize L[1] (g) and L[2] (g), respectively. which is distributed under the MIT open source license, free to both
Note that in most practical cases, ĝ equals either ĝ[1] or ĝ[2] . When this academic and commercial uses. The exact Equations (17)–(19) have
stands, we have: also been implemented in GATK as the default SNP calling model.
 
L[1] (ĝ)L[2] (ĝ) = max L[1] (ĝ[1] )L[2] (ĝ[1] ),L[1] (ĝ[2] )L[2] (ĝ[2] ) The SAMtools package consists of two key components
samtools and bcftools. The former computes the genotype
and then we can prove: likelihood L(g) using an improved version of Equation (2) that
  [1] [1] 
L (ĝ ) L[2] (ĝ[2] ) considers error dependencies; the latter component calls variants
Dp = 2log min , and infers various statistics described in this article. To clearly
L[1] (ĝ[2] ) L[2] (ĝ[1] )
separate the two steps, we designed a new Binary variant call
This equation has an intuitive interpretation: we are certain about a
format (BCF), which is the binary representation of the variant
candidate somatic mutation only if both genotypes in both samples are clearly
better than other possible genotypes.
call format (VCF; Danecek et al., 2011) and is more compact and
A natural extension to discovering somatic mutations is to discover de novo much faster to process than VCF. On real data, computing genotype
and somatic mutations in a family trio (Conrad et al., 2011). To identify likelihoods especially for INDELs is typically 10 times slower than
such mutations, we may compute the maximum likelihoods of genotype variant calling. The separation of genotype likelihood computation
configurations without the family constraint and with the constraint, and then and subsequent inferences enhances the flexibility and improves the

2990

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2990 2987–2993


Inference using sequencing data

Table 2. SAMtools specific VCF information 0.2


Complete Genomics
1000G pilot (EM-AFS)
INFOa Equationb Description 0.15
1000G pilot (site-AFS)

Frequency
AF1 3, 5 Non-reference site allele frequency
G3 7 Diploid genotype frequency 0.1
HWE 8 P-value of Hardy–Weinberg equilibrium
NEIR 10 Neighboring r 2 linkage disequilibrium statistic 0.05
LRT 11 One-degree association test P-value
LRT2 12 Two-degree association test P-value
0
AC1 17, 18, 19 Non-reference site allele count 0 2 4 6 8 10 12 14 16 18
FQ 20 Probability of the site being polymorphic among # derived alleles conditional on NA18507 hets
samples
CLR 22, 23 Log likelihood ratio score for de novo mutations Fig. 2. The derived AFS conditional on heterozygotes discovered in the

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


NA18507 genome (Bentley et al., 2008; AC:SRA000271). Heterozygotes
a Tag at the VCF additional information field (INFO). were called with SAMtools on BWA (Li and Durbin, 2009) alignment.
b Related, though not exact, equations for computing the values.
The ancestral sequences were determined from the Ensembl EPO
alignment (Paten et al., 2008), with the requirement of the chimpanzee and
orangutan sequences being identical. The AFS at these heterozygotes were
7 computed in three ways: (i) from the nine independent Yoruba individuals
Beagle imputation
Root-mean-square deviation

ML estimate sequenced by Complete Genomics (Drmanac et al., 2010) and analyzed using
6
CGA Tools version 1.10.0; (ii) from nine random Pilot-1 Yoruba individuals
5 released by the 1000 Genomes Project using the EM-AFS method and
(iii) from the same 9 Pilot-1 individuals using site-AFS.
4

3 is better than our imputation-free method [RMSD(imput) = 12.7;


RMSD (imput-free) = 15.0]. We conjure that this is because with
2
more samples, it is more frequent for two samples to share a long
1 haplotype. The LD plays a more important role in counteracting the
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 lack of coverage. Nonetheless, we should beware that sites selected
maximum imputed r-square to nearby SNPs
on the Omni genotyping chip may not be a good representative
Fig. 1. Correlation of the site allele count accuracy with LD. The site allele of all SNPs. For example, for the sites on the Omni chip, only
count is estimated with Beagle imputation (solid line) and with Equation (16) 8% of SNPs do not have a nearby SNP with r 2 > 0.05 in a 20-
(dashed line) at sites typed by the Omni genotyping chip. For each Omni SNP window (the ‘nearby SNPs’ include all SNPs discovered in
SNP, the maximum r 2 LD statistic between the SNP and 20 nearby SNPs the 670 samples), but this percentage is increased to 30% for all
called by SAMtools (10 upstream and 10 downstream) is computed from SNPs. The large fraction of unlinked SNPs might hurt the accuracy
imputed genotypes. Omni SNPs are then ordered by the maximum r 2 and of imputation-based methods.
approximately evenly divided into 15 bins. For each bin, the RMSD between We have also evaluated our method on an unpublished target
the Omni allele count and the estimated allele count is computed as a
reqsequencing dataset consisting of ∼2000 samples (Haiman,C.
measurement of the allele count accuracy.
and Henderson,B., personal communication). The imputation-based
method does not perform well [RMSD(imput) = 54.8; RMSD(imput-
efficiency for inferring AFS. Bcftools also directly works with
free) = 42.5], probably due to the lack of linked SNPs around
VCF files, but is less efficient than with BCF files.
fragmented target regions.
Table 2 shows how VCF information tags generated by SAMtools
are related to the equations in this article. We refer to the
SAMtools manual page for detailed description. 3.3 Inferring the AFS
To evaluate the accuracy of the estimated AFS, we compared the
3.2 Inferring the allele count AFS obtained from the low-coverage data produced by the 1000
Genomes Project and from the high-coverage data released by
We downloaded the chromosome 20 alignments of 49 Pilot-1 CEU
Complete Genomics (http://bit.ly/m7LzvF). Figure 2 reveals that
samples sequenced by the 1000 Genomes Project using the Illumina
we can infer a fairly accurate AFS using the EM-AFS method
technology only. We called the SNPs with SAMtools and imputed
with 3-fold coverage per sample. On the other hand, the site-AFS
the genotypes with Beagle under the default settings. At 32 522
estimate is less stable, though the overall trend looks right. To
sites genotyped using the Omni genotyping chip and polymorphic in
estimate properties across multiple sites, summing over the posterior
the 49 samples, the root-mean-square deviation (RMSD) between
distribution using EM-AFS is more appropriate.
the allele count acquired from Omni genotypes and the estimate
using Equation (16) equals 3.7, the same as the RMSD between the
Omni and the Beagle-imputed genotypes. Not surprisingly, imputed 3.4 Performing association test
genotypes are more accurate when there is a tightly linked SNP To evaluate the performance of the association test statistics Da1
nearby, while the imputation-free estimate is less affected (Fig. 1). [Equation (11)], we constructed a perfect negative control using
However, on the unreleased European data from the 1000 the 1000 Genomes data and derived the empirical distribution
Genomes Project consisting of 670 samples, Beagle imputation of Da1 . We expect to see no associations. Figure 3 shows that

2991

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2991 2987–2993


H.Li

A 7 B individual, somatic mutations in cell lines, which are of the order of


Imputation 2-degree LRT
1-degree LRT 1000 per diploid genome (Conrad et al., 2011), may be present. If
Observed -log10(P-value)

6
the cell lines used in two studies have greatly diverged, we might
5
see up to a dozen somatic mutations on chromosome 20.
4
This time with a threshold Dp ≥ 30 and a maximum depth
3 filter 150, we identified 667 single-base differences between the
2 two datasets, far more than our expectation. Again we sought to
1 reduce mapping errors by remapping reads with BWA-SW to the
0
1000 Genomes Project phase-2 reference genome. The number of
0 1 2 3 4 5 6 0 1 2 3 4 5 6 differences between the HiSeq and the old Illumina data quickly
Expected -log10(P-value) Expected -log10(P-value) drops to 33. If we further filter out clustered SNPs using a 100 bp
window, 13 potential differences are left, 2% of the initial candidates.
Fig. 3. QQ-plot comparing the association test statistics to the one-degree This exercise again proves that mismapping is the leading source of

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


and the two-degree χ2 distribution. The 49 CEU samples sequenced by errors.
the 1000 Genomes Project using the Illumina technology were randomly To see if the simple likelihood ratio [Equation (22)] is
assigned to two groups of size 24 and 25, respectively. (A) two association comparable to more sophisticated methods, we briefly tried
test statistics were computed on chromosome 20 between the two groups:
SomaticSniper (Larson et al., in press) on our data. With a somatic
one by the one-degree likelihood ratio test [Equation (11)] and the other by
the canonical one-degree χ2 test based on Beagle imputed genotypes; (B) the
score cutoff 65, which is about 30 in the ‘2log’ scale as in Dp ,
two-degree likelihood rate test statistic [Equation (12)]. SomaticSniper identified 1826 differences. SAMtools called fewer,
because it limits the mapping quality of reads with excessive
mismatches and applies base alignment quality (Li, 2011) to fix
Da1 largely follows the one-degree χ2 distribution. However, this
alignment errors around INDELs. With the two features switched
method also produces one false positive SNP (P < 10−6 ). Closer
off, SAMtools called 1696 differences, half of which overlap the
investigation reveals that the SNP significantly violates HWE [P <
differences found by SomaticSniper. Calls unique to one method
10−6 , computed with Equation (8)], and thus violates the assumption
tend to have a mutation score close to the threshold.
behind the derivation of Da1 . In fact, if we test the association
with Da2 which does not assume HWE, the false positive will be
suppressed (P > 0.001). To test association using the one-degree 4 DISCUSSIONS
likelihood-ratio test statistic, it is important to control HWE.
We have proposed a statistical framework for SNP calling as well
as analyzing sequencing data but without explicitly calling SNPs
3.5 Comparing sequencing data from the same or their genotypes. With this framework, we can discover somatic
individual and germline mutations with appropriate input data, efficiently
3.5.1 Comparing datasets of similar characteristics We acquired estimate site allele frequency, allele frequency spectrum and
the NA12878 data used by Depristo et al. (2011). This sample linkage disequilibrium, and test Hardy–Weinberg equilibrium and
was sequenced with HiSeq2000 using two libraries with each put association. On real data, we have demonstrated that our method is
on eight lanes and each sequenced to about 30-fold coverage. We able to achieve comparable accuracy to the best alternative methods.
split the data in two by library and computed Dp [Equation (22)] We have also extensively evaluated the performance of our method
at each base on chromosome 20 to identify sites that are called on several unpublished datasets and got sensible results. Thus, we
differently between the two libraries. With a stringent threshold conclude that useful information can be obtained directly from
Dp ≥ 30 and without any filtering, 32 differences are called between sequencing data without SNP calling or imputation.
the libraries and most from the centromere. Since the libraries were Here we also want to emphasize a few findings in our evaluation of
made from the same DNA at almost the same time, we expect the methods. First, we confirmed that imputation is a viable method
to see no difference between the libraries. Seeing 32 differences for transferring our knowledges on genotyping data to low-coverage
is very unlikely. To explore if this is due to mismapping, we sequencing data. It is likely to have higher accuracy than our
extracted reads around the 32 sites and remapped them with method given homogeneous whole-genome data consisting of many
BWA-SW (Li and Durbin, 2010). Four differences remain around samples. Nonetheless, we showed that the accuracy of imputation
the centromere, implying that most of the differences between depends on the LD nearby, which has long been speculated but
libraries are caused by the variation in read mapping. We further without direct evidence from real data until our work. Second, our
mapped the reads around the four sites to a version of the human proposed EM-AFS method is able to accurately estimate AFS from
reference genome used by the 1000 Genomes Project for phase- low-coverage sequencing data. It is more appropriate than estimating
2 mapping (http://bit.ly/GRCh37d). No differences are left. This the site frequency separately and then doing a histogram. Third,
exercise reveals that when we come to very rare events, mapping we observed that violation of HWE may cause false positives in
errors, instead of sequencing errors, lead to most of the artifacts. association mapping with the one-degree likelihood ratio test (Kim
et al., 2011). A two-degree likelihood ratio test is a conservative
3.5.2 Comparing datasets of different characteristics We also way to avoid such an artifact. At last, we highlighted the importance
did a harder version of the exercise above: comparing this 60- of using data of similar characteristics in the discovery of somatic
fold HiSeq data to the old Illumina data for the same individual mutations. We also want to put a particular emphasis on the
obtained >2 years ago by the 1000 Genomes Project. We note that necessity of controlling mapping errors when looking for very rare
although DNA used in the two datasets was originated from the same events such as somatic mutations, germline mutations and RNA

2992

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2992 2987–2993


Inference using sequencing data

editing. It may be necessary to use two distinct mapping algorithms Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University Press,
to call variants and then take the intersection. Cambridge, UK.
Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular
Frequently, we require to know the exact DNA sequences or
haplotype frequencies in a diploid population. Mol. Biol. Evol., 12, 921–927.
genotypes only to estimate parameters or compute statistics. In Hodgkinson,A. and Eyre-Walker,A. (2010) Human triallelic sites: evidence for a new
these cases, the sequences and genotypes are just intermediate mutational mechanism? Genetics, 184, 233–241.
results. When the sequence itself is uncertain, mostly due to the Howie,B.N. et al. (2009) A flexible and accurate genotype imputation method for the
uncertainty in sequencing and mapping, it may sometimes be next generation of genome-wide association studies. PLoS Genet., 5, e1000529.
Kim,S.Y. et al. (2010) Design of association studies with pooled or un-pooled next-
preferred to directly work with the uncertain sequence, which may generation sequencing data. Genet. Epidemiol., 34, 479–491.
carry more information than an arbitrarily ascertained sequence. Kim,S.Y. et al. (2011) Estimation of allele frequency and association mapping using
We have showed that many population genetical parameters and next-generation sequencing data. BMC Bioinformatics, 12, 231.
statistical tests can be adapted to work on uncertain sequences, Le,S.Q. and Durbin,R. (2010) SNP detection and genotyping from low-coverage
sequencing data on multiple diploid samples. Genome Res., 21, 952–960.
and believe more existing methods can be adapted in a similar
Ley,T. J. et al. (2008) DNA sequencing of a cytogenetically normal acute myeloid
manner. Knowing the exact sequence is convenient, but not always leukaemia genome. Nature, 456, 66–72.

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Wyoming Libraries on March 30, 2015


indispensable. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with burrows-
wheeler transform. Bioinformatics, 25, 1754–1760.
Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with burrows-wheeler
ACKNOWLEDGEMENTS transform. Bioinformatics, 26, 589–595.
Li,H. et al. (2008) Mapping short DNA sequencing reads and calling variants using
We are grateful to Christopher Haiman for providing the unpublished mapping quality scores. Genome Res., 18, 1851–1858.
dataset for assessing the performance, to Petr Danecek for evaluating Li,H. et al. (2009a) The sequence alignment/map format and samtools. Bioinformatics,
the methods in this article on large-scale datasets, to Rasmus Nielsen 25, 2078–2079.
Li,H. (2011) Improving SNP discovery by base alignment quality. Bioinformatics, 27,
for the observation on the occasional slow convergence of the EM
1157–1158.
algorithm and to Si Quang Le and Richard Durbin for the help on Li,Y. et al. (2009b) Genotype imputation. Annu. Rev. Genomics Hum. Genet., 10,
understanding the QCall model. We also thank the 1000 Genomes 387–406.
Project analysis subgroup and the GSA team at Broad Institute for Li,Y. et al. (2010a) MaCH: using sequence and genotype data to estimate haplotypes
various helpful discussions, and thank all the SAMtools users for and unobserved genotypes. Genet. Epidemiol., 34, 816–834.
Li,Y. et al. (2010b) Resequencing of 200 human exomes identifies an excess of low-
evaluating the software package. frequency non-synonymous coding variants. Nat. Genet., 42, 969–972.
Li,Y. et al. (2011) Low-coverage sequencing: Implications for design of complex trait
Funding: National Institutes of Health (1U01HG005208-01).
association studies. Genome Res., 21, 940–951.
Conflict of Interest: none declared. Mardis,E.R. et al. (2009) Recurring mutations found by sequencing an acute myeloid
leukemia genome. N. Engl. J. Med., 361, 1058–1066.
Martin,E.R. et al. (2010) SeqEM: an adaptive genotype-calling approach for next-
generation sequencing studies. Bioinformatics, 26, 2803–2810.
REFERENCES Nakamura,K. et al. (2011) Sequence-specific error profile of illumina sequencers.
1000 Genomes Project Consortium (2010) A map of human genome variation from Nucleic Acids Res., 39, e90.
population-scale sequencing. Nature, 467, 1061–1073. Nielsen,R. et al. (2011) Genotype and SNP calling from next-generation sequencing
Ajay,S.S. et al. (2011) Accurate and comprehensive sequencing of personal genomes. data. Nat. Rev. Genet., 12, 443–451.
Genome Res., 21, 1498–1505. Paten,B. et al. (2008) Enredo and pecan: genome-wide mammalian consistency-based
Bentley,D.R. et al. (2008) Accurate whole human genome sequencing using reversible multiple alignment with paralogs. Genome Res., 18, 1814–1828.
terminator chemistry. Nature, 456, 53–59. Pleasance,E.D. et al. (2010a) A comprehensive catalogue of somatic mutations from a
Brent,R.P. (1973) Algorithms for Minimization without Derivatives. Prentice-Hall, human cancer genome. Nature, 463, 191–196.
Englewood Cliffs, New Jersey. Pleasance,E.D. et al. (2010b) A small-cell lung cancer genome with complex signatures
Browning,B.L. and Yu,Z. (2009) Simultaneous genotype calling and haplotype phasing of tobacco exposure. Nature, 463, 184–190.
improves genotype accuracy and reduces false-positive associations for genome- Roach,J.C. et al. (2010) Analysis of genetic inheritance in a family quartet by whole-
wide association studies. Am. J. Hum. Genet., 85, 847–861. genome sequencing. Science, 328, 636–639.
Conrad,D. et al. (2011) Variation in genome-wide mutation rates within and between Robison,K. (2010) Application of second-generation sequencing to cancer genomics.
human families. Nat. Genet., 43, 712–714. Brief. Bioinformatics, 11, 524–534.
Danecek,P. et al. (2011) The variant call format and vcftools. Bioinformatics, 27, Schaid,D.J. et al. (2002) Score tests for association between traits and haplotypes when
2156–2158. linkage phase is ambiguous. Am. J. Hum. Genet., 70, 425–434.
Depristo,M.A. et al. (2011) A framework for variation discovery and genotyping using Shah,S.P. et al. (2009) Mutational evolution in a lobular breast tumour profiled at single
next-generation DNA sequencing data. Nat. Genet., 43, 491–498. nucleotide resolution. Nature, 461, 809–813.
Drmanac,R. et al. (2010) Human genome sequencing using unchained base reads on Yi,X. et al. (2010) Sequencing of 50 human exomes reveals adaptation to high altitude.
self-assembling DNA nanoarrays. Science, 327, 78–81. Science, 329, 75–78.

2993

[10:51 3/10/2011 Bioinformatics-btr509.tex] Page: 2993 2987–2993

You might also like