MAQ - Heng Li
MAQ - Heng Li
org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Heng Li, Jue Ruan and Richard Durbin Genome Res. 2008 18: 1851-1858 originally published online August 19, 2008 Access the most recent version at doi:10.1101/gr.078212.108
http://genome.cshlp.org/content/suppl/2008/09/26/gr.078212.108.DC1.html This article cites 29 articles, 16 of which can be accessed free at: http://genome.cshlp.org/content/18/11/1851.full.html#ref-list-1 Article cited in: http://genome.cshlp.org/content/18/11/1851.full.html#related-urls
Freely available online through the Genome Research Open Access option. This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/. Receive free email alerts when new articles cite this article - sign up in the box at the top right corner of the article or click here
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Resource
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Heng Li,1 Jue Ruan,2 and Richard Durbin1,3
1
The Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom; 2Beijing Genomics Institute, Chinese Academy of Science, Beijing 100029, China New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http:/ /maq.sourceforge.net. [Supplemental material is available online at www.genome.org. Short-read sequences have been deposited in the European Read Archive (ERA) under accession no. ERA000012 (ftp:/ /ftp.era.ebi.ac.uk/ERA000012/).]
The advent of novel sequencing technologies such as 454 Life Sciences (Roche) (Margulies et al. 2005), Illumina (formerly known as Solexa sequencing), and Applied Biosystems SOLiD opens opportunities to a variety of biological applications, including resequencing (Bentley, 2006; Hillier et al. 2008), ChIPseq (Barski et al. 2007; Johnson et al. 2007; Robertson et al. 2007), gene expression, miRNA discovery, DNA methylation study, cancer genome research, and whole-transcriptome sequencing. Most of these applications rely on fast and accurate read mapping, and some of them, in particular resequencing, require reliable SNP calling. Meeting these requirements is essential to realize the strength of the new sequencing technologies. Several of these technologies produce tens of millions of short reads of currently typically 3040 bp in a single run. Mapping the enormous numbers of short reads to the reference genome poses serious challenges to alignment programs. These challenges come not only from the requirement of highly efficient algorithms but also from the need of accuracy. Whereas existing alignment algorithms (Altschul et al. 1997; Buhler 2001; Ning et al. 2001; Kent 2002; Schwartz et al. 2003; Wu and Watanabe 2005) can be effectively adapted to achieve efficiency, the requirement of accuracy is subtle. Most genomes contain at least some sequence that is repetitive or close to repetitive on the length scale of the reads. As a consequence, some reads will map equally well to multiple positions. Furthermore, one or two mutations or sequencing errors in a short read may lead to its mapping to the wrong location. It is possible to act conservatively by discarding reads that map ambiguously at some level, but this
3 Corresponding author. E-mail [email protected]; fax 44-1223496802. Article published online before print. Article and publication date are at http:// www.genome.org/cgi/doi/10.1101/gr.078212.108. Freely available online through the Genome Research Open Access option.
leaves no information in the repetitive regions and it also discards data, reducing coverage in an uneven fashion, which may complicate the calculation of coverage. An alternative solution to handling these ambiguities is to keep all the reads that can be mapped and to evaluate for each read the likelihood it has been wrongly positioned. Poor alignments can still be discarded later. This strategy essentially resembles phreds (Ewing et al. 1998; Ewing and Green 1998) strategy for base-calling from capillary reads. In a capillary read, there are frequently low-quality regions. Phred does not discard these regions in the first instance. Instead, it calls each base as best as it can, and assigns a quality score that encodes the probability that the base is wrongly called. This per-base quality score is more informative and helpful than simply discarding poor data (Durbin and Dear 1998). Similarly, if the posterior error probability of each read alignment can be calculated, more information will be retained than if all poor data are discarded. Here, we show how to calculate the error probability of a read mapping. We also introduce a new statistical model for consensus genotype calling and subsequent SNP calling. For capillary reads, two different approaches have previously been taken to calling SNPs. The first type of approach works on PCR resequencing data from diploid samples. These algorithms directly examine chromatogram trace files and detect variants by extracting or comparing signals in the peaks of traces. The most widely used software includes PolyPhred (Stephens et al. 2006), SNPdetector (Zhang et al. 2005), and novoSNP (Weckx et al. 2005), each of which can call the genotype of the sample as well as detect variants. The second type of approach works for clone-based data. They are usually built upon phred base calls and detect variants by detecting base-pair differences between a read from a single haplotype and the reference sequence. Two representative software of this type are ssahaSNP (Ning et al. 2001) and PolyBayes (Marth et al. 1999). While ssahaSNP uses a heuristic rule known as the
18:18511858 2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org
Genome Research
www.genome.org
1851
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Li et al.
neighborhood quality standard (NQS) (Altshuler et al. 2000), PolyBayes develops an explicit statistical framework to model variants. All new sequencing technologies are shotgun methods that give sequences derived from a single molecule sampled from a larger population. (Current methods amplify the starting template by some form of PCR, but true single molecule methods are expected in the future.) This means the methods for calling variants from new technology data are most closely related to the second group described above, including ssahaSNP and PolyBayes. However, because of sampling and error rate, we need to combine data from multiple reads. In practice, errors at a particular site are correlated, and we must take this correlation into account. This is analogous to calling a consensus from a sequence assembly, and we propose a Bayesian approach to this issue that is related to that used in assembly software CAP3 (Huang and Madan 1999). In summary, this article presents methods and software for mapping short sequence reads to a reference genome, calculating the probability of a read alignment being correct, and consensus genotype calling with a model that incorporates correlated errors and diploid sampling. The applicability and accuracy of the methods are evaluated based on both real data from the bacterium Salmonella paratyphi and simulated data from the diploid human X chromosome. type sequence from the alignment. The consensus sequence is inferred from a Bayesian statistical model, and each consensus genotype is associated with a phred quality that measures the probability that the consensus genotype is incorrect. Potential SNPs are detected by comparing the consensus sequence to the reference and can be further filtered by a set of predefined rules. These rules are designed to achieve the best performance on deep human resequencing data and aim to compensate for simplifications and assumptions used in the statistical model (e.g., treating neighbor positions independently).
Implementation
We implemented the software MAQ to align short reads and call genotypes based on the algorithm described in the Methods section. MAQ consists of a set of related programs that are compiled into a single binary executable. It is able to map reads, call consensus sequences including SNP and indel variants, simulate diploid genomes and read sequences, and post-process the results in various ways. MAQ also has an option to process Applied Biosystems SOLiD data that uses two base color-space encoding. Further details are available from the documentation at the MAQ website. MAQ is easy to use. For bacterial genomes, alignments and variant calling can be done with a single command line, taking a few minutes on a laptop. In addition, MAQ comes with a compact and fast OpenGL-based read alignment viewer, MAQview, which shows the read alignments, base qualities, and mapping qualities in a graphical interface. Both MAQ and MAQview are designed with genome-wide human resequencing in mind. First, the read alignment, which is the slowest step in the whole pipeline, can be divided into small tasks and parallelized on a modern computer cluster using less than 1 GB of memory for each processor core. The separate subparts of the alignment can then be merged together to give the final alignment. Second, the read alignments are stored in a binary compressed file. Text-based information is only extracted when necessary. This strategy saves disk space by a factor of three to five. Third, a novel technique is implemented to index the compressed alignment file, which enables swift retrieval of reads in any region of the reference sequence. Viewing the alignments of a human-sized genome is as fast as viewing those of a single BAC sequence. As a whole, MAQ and MAQview provide an efficient suite for managing data from Illumina sequencing. MAQ and MAQview are implemented in C/C++ with auxiliary tools in Perl. They have been extensively evaluated on largescale simulated data and real data and have been tested by users from various research groups. MAQ software is freely distributed under the GNU General Public License (GPL). The project home page is at http://maq.sourceforge.net.
Results
Overview of MAQ algorithms
MAQ is a program that rapidly aligns short reads to the reference genome and accurately infers variants, including SNPs and short indels, from the alignment. At the alignment stage, MAQ first searches for the ungapped match with lowest mismatch score, defined as the sum of qualities at mismatching bases. To speed up the alignment, MAQ only considers positions that have two or fewer mismatches in the first 28 bp (default parameters). Sequences that fail to reach a mismatch score threshold but whose mate pair is mapped are searched with a gapped alignment algorithm in the regions defined by the mate pair. To evaluate the reliability of alignments, MAQ assigns each individual alignment a phred-scaled quality score (capped at 99), which measures the probability that the true alignment is not the one found by MAQ. MAQ always reports a single alignment, and if a read can be aligned equally well to multiple positions, MAQ will randomly pick one position and give it a mapping quality zero. Because their mapping score is set to zero, reads that are mapped equally well to multiple positions will not contribute to variant calling. However, they do give information on copy number of repetitive sequences and on the fraction of reads that can be aligned to the genome, and can easily be filtered out for downstream analysis if desired. Mapping quality scores and mapping all reads that match the genome even if repetitive are where MAQ differs from most other alignment programs. MAQ fully utilizes the mate-pair information of paired reads. It is able to use this information to correct wrong alignments, to add confidence to correct alignments, and to accurately map a read to repetitive sequences if its mate is confidently aligned. With paired-end reads, MAQ also finds short insertions/ deletions (indels) from the gapped alignment described above. At the SNP calling stage, MAQ produces a consensus geno-
1852
Genome Research
www.genome.org
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Figure 1. Distribution of mapping qualities, consensus qualities, true alignment error rate, and true consensus error rate. The red line shows the fraction of reads whose mapping qualities fall in each interval. (Pink line) The fraction of consensus genotypes whose consensus qualities fall in each interval; (blue line) the true alignment error rate of reads in each interval; (green line) the true consensus error rate of reads in each interval. (A) Reads are aligned without using mate-pair information. Single-end alignments do not contain enough information for MAQ to assign mapping quality larger than 90; therefore, the data in the top bin are missing. (B) Reads are aligned using mate-pair information.
Genome Research
www.genome.org
1853
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Li et al.
Figure 2. Accuracy of variant calling. In the figure, filtered regions are regions covered by three or fewer reads or by no reads with mapping quality higher than 60. For substitutions, FP equals the number of positions called as different from homozygous reference that in fact should be identical to the reference according to the simulation, divided by the total number of MAQ substitution calls; FN equals the number of positions that are different from the reference according to the simulation but are missed by MAQ, divided by the total number of mutations added in the simulation. For indels, FP equals the number of indel calls within 5-bp flanking regions of a true indel, divided by the total number of MAQ indel calls; FN equals the number of true indel calls missed by MAQ, divided by the total number of indels in simulation. (A) Variants are called based on single-end alignment. (B) Variants are called based on paired-end alignment. (C) Theoretical accuracy of k-allele method, where we call an allele as long as at least k reads are supporting the allele, assuming all reads are correctly aligned (see also Supplemental material).
comparison to the 136,012 true substitutions in the simulation. However, in real data the FP is higher and in some applications, such as in the study of somatic mutations in cancer, the number of true SNPs will be much lower and the rate of false SNPs more critical. This simulation only gives a rough evaluation of MAQs performance. On one hand, in the simulation process, reads are evenly distributed along the genome, no contamination exists, base qualities are accurate and sequencing errors are entirely independent. All these factors make SNP calling simpler. The true accuracy on real data will almost always be lower than the simulation. On the other hand, although errors are independent, we use a dependent model to infer the consensus. Using an independent model would achieve higher accuracy for simulated data. Moreover, we were using the same set of filters across all depths. Adjusting the threshold in filters might help to reach a better balance point between FN and FP at different depths.
After these filters, MAQ predicted two homozygous differences. Checking the capillary reads used in reference assembly confirms that the current reference is wrong at one of the homozygous sites. The other homozygous site is covered by 19 reads, with all of them identical to each other but different from the reference. This site is possibly a true mutation between the reference sequence and the Illumina-sequenced sample. As well as these two homozygous differences, MAQ also predicted four heterozygotes. All four cases look confident from read alignment and show excessively high read depth in comparison to the average depth. Three are clustered together, and it appears likely that there is an additional copy of this region that was not identified in the reference. The fourth position may also be in a duplicated region (see below). Alignment against the same reference strain only evaluates the FP of MAQ SNP calling. To assess the FN, we aligned the reads to a previously published sequence from another reference strain ATCC9150 (McClelland et al. 2004). We downloaded the sequence of strain ATCC9150 (AC: NC_006511) from NCBI and aligned it, using cross_match (P. Green, unpubl.; http://www.phrap.org/phredphrapconsed. html), against the AKU12601 stain with the two homozygous SNPs discovered previously masked as N. Cross_match gave seven alignments, spanning the complete ATCC9150 and 99.97% of AKU12601 genome. 211 substitutions and 39 indels (five of them longer than 20 bp) are contained in the alignment. MAQ did not give any false positives and predicted 173 true substitutions. Of the missing 38 substitutions, 35 were covered by no uniquely aligned reads, and one site was covered by only one uniquely aligned read. Discovering SNPs at these 36 sites is almost impossible with single end short reads. Of the remaining two (38 36) sites, one site was called as a SNP initially but was filtered out due to low read coverage (two reads), and the other was dropped because it was covered by no read with mapping quality higher than 24 and so did not pass the filter, either. In regions passing the SNP calling filters (96.9% of ATCC9150 genome), no SNPs were missed. Interestingly, the four heterozygotes in the AKU12601 read mapping were not called as SNPs any more. One site became a repeat in ATCC9150, and the other three were called confident monomorphic sites with about the average read depth. This observation possibly revealed that around these three sites, there are no copy number changes be-
1854
Genome Research
www.genome.org
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Discussion
MAQ is capable of human whole-genome alignments and supports SNP calling on a diploid sample. It has been used to map short sequencing reads for structural variant calling in cancer samples (Campbell et al. 2008) and for whole-genome methylation analyses (Down et al. 2008). It is able to accurately estimate the error probability of each alignment and of each consensus genotype as well. MAQ can also simulate reads from a diploid genome based on a haploid reference. Simulation suggests that 20- to 30-fold coverage is needed for achieving FNs below 1% in the nonrepeat regions of a diploid sample.
Time complexity
If we map N reads to an L long reference and use k bits in indexing, the time complexity of MAQ alignment algorithm is O(c1NlogN + c2L + c32 kNL). The first term NlogN corresponds to the time spent on sorting the indexes; the second, on scanning the whole reference sequence; and the third term, on processing the alignment when there is a seed hit. In MAQ alignment, k is 24 and N is typically 2 million and therefore 2 kN 0.1, but as constant c3 is usually much larger than c2 and the human genome has many repeats, the time spent on the last two terms is approximately equal. By default, MAQ scans the reference three times against six hash tables. It would be possible to save time by stopping the
Short reads tend to be wrongly aligned because one or two mutations or sequencing errors may make the best position wrong. When evaluating the accuracy of alignments, we have to look at the fraction of discarded reads (FD) and the fraction of wrongly aligned reads (FW) at the same time. Only counting one type of the errors might be misleading. While on simulated data it is possible to estimate both FD and FW of alignments, on real data we cannot calculate FW as we do not know what the correct alignment is. As a consequence, we cannot directly measure the accuracy of the alignment using real data. To see what alignment strategy works best, we must evaluate a measurable outcome from the alignment, such as the accuracy of SNP calls, structural variations, or the agreement between expression profiling and microarray results. The criteria may vary with different applications. In resequencing, accuracy can be measured by the SNP accuracy, which, again, should be measured by the fraction of missing polymorphic sites (FN) and the fraction of wrong calls (FP) at the same time. We can always trade one type of error for the other and therefore once again counting one type of error is misleading. Unlike in an alignment, both FP and FN of SNPs on real data can be estimated from other sources of data. FP can be evaluated by capillary resequencing or genotyping a small subset of SNP calls. FN can be estimated by comparing SNP calls to the wholegenome chip-genotyping results. The fraction of chipgenotyping polymorphic sites that are not found is the FN. It should be noted that such a fraction is only the FN on the sites where probes can be designed for the genotyping microarray. These sites tend to be unique in the reference genome and are usually easier to find by short-read resequencing. The overall FN across the whole genome is higher. In resequencing, it is also a good idea to explicitly define the resequenceable regions (or the regions where SNPs can be confidently called). We want to distinguish low SNP-density regions from hard-to-resequence regions. Using MAQ, the fraction of the human genome that is resequenceable with 35-bp reads is 85%, and with read pairs separated by 170 bp it is 93%. Achieving higher coverage would require a mixture of varying insert sizes and longer reads.
Methods
Single end read mapping
To map reads efficiently, MAQ first indexes read sequences and scans the reference genome sequence to identify hits that are extended and scored. With the Eland-like (A.J. Cox, unpubl.) hashing technique, MAQ, by default, guarantees to find alignments with up to two mismatches in the first 28 bp of the reads. MAQ maps a read to a position that minimizes the sum of quality values of mismatched bases. If there are multiple equally best positions, then one of them is chosen at random.
Genome Research
www.genome.org
1855
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Li et al.
In this article, we will call a potential read alignment position a hit. The algorithm MAQ uses to find the best hit is quite similar to the one used in Eland. It builds multiple hash tables to index the reads and scans the reference sequence against the hash tables to find the hits. By default, six hash tables are used, ensuring that a sequence with two mismatches or fewer will be hit. The six hash tables correspond to six noncontiguous seed templates (Buhler 2001; Ma et al. 2002). Given 8-bp reads, for example, the six templates are 11110000, 00001111, 11000011, 00111100, 11001100, and 00110011, where nucleotides at 1 will be indexed while those at 0 are not. By default, MAQ indexes the first 28 bp of the reads, which are typically the most accurate part of the read. In alignment, MAQ loads all reads into memory and then applies the first template as follows. For each read, MAQ takes the nucleotides at the 1 positions of the template, hashes them into a 24-bit integer, and puts the integer together with the read identifier into a list. When all the reads are processed, MAQ orders the list based on the 24-bit integers, such that reads with the same hashing integer are grouped together in memory. Each integer and its corresponding region are then recorded in a hash table with the integer as the key. We call this process indexing. At the same time that MAQ indexes the reads with the first template, it also indexes the reads with the second template that is complementary to the first one. Taking two templates at a time helps the mate-pair mapping, which will be explained in the section below. After the read indexing with the two templates, the reference will be scanned base by base on both forward and reverse strands. Each 28-bp subsequence of the reference will be hashed through the two templates used in indexing and will be looked up in the two hash tables, respectively. If a hit is found to a read, MAQ will calculate the sum of qualities of mismatched bases q over the whole length of the read, extending out from the 28-bp seed without gaps (the current implementation has a read length limit of 63 bp). MAQ then hashes the coordinate of the hit and the read identifier into another 24-bit integer h and scores the hit as q 224 + h. In this score, h can be considered as a pseudorandom number, which differentiates hits with identical q: If there are multiple hits with the same q, the hit with the smallest h will be identified as the best, effectively selecting randomly from the candidates. For each read, MAQ only holds in memory the position and score of its two best scored hits and the number of 0-, 1-, and 2-mismatch hits in the seed region. When the scan of the reference is complete, the next two templates are applied and the reference will be scanned once again until no more templates are left. Using six templates guarantees to find seed hits with no more than two mismatches, and it also finds 57% of hits with three mismatches. In addition, MAQ can use 20 templates to guarantee finding all seed hits with three mismatches at the cost of speed. In this configuration, 64% of seed hits with four mismatches are also found, though our experience is that these hits are not useful in practice. reference and an ungapped exhaustive alignment is performed. A practical model for alignment with heuristic algorithms will be presented in the Supplemental material. Suppose we have a reference sequence x and a read sequence z. On the assumption that sequencing errors are independent at different sites of the read, the probability p(z|x,u) of z coming from the position u equals the product of the error probabilities of the mismatched bases at the aligned position. For example, if read z mapped to position u has two mismatches: one with phred base quality 20 and the other with 10, then p(z|x,u) = 10 (20 + 10)/10 = 0.001. To calculate the posterior probability ps(u|x,z), we assume a uniform prior distribution p(u|x), and applying the Bayesian formula gives p z|x,u ps u|x,z = Ll+1 p z|x,v
v=1
(1)
where L = |x| is the length of x and l = |z|. Scaling ps in the phred way, we get the mapping quality of the alignment: Q s u|x,z = 10 log10 1 ps u|x,z . The calculation of Equation 1 requires summing over all positions on the reference. It is impractical to calculate the sum given a human-sized genome. In practice, we approximate Qs as: Q s = min{q2 q1 4.343 logn2,4 + 3 k q 14 4.343 logp1 3 k,28 }. Where q1 is the sum of quality values of mismatches of the best hit, q2 is the corresponding sum for the second best hit, n2 is the number of hits having the same number of mismatches as the second best hit, k is the minimum number of mismatches in the 28-bp seed, q is the average base quality in the 28-bp seed, 4.343 is 10/log10, and p1(k,28) is the probability that a perfect hit and a k-mismatch hit coexists given a 28-bp sequence that can be estimated during alignment. Detailed deduction of this equation is given in the Supplemental material. It is also worth noting that in minimizing the sum of quality values of mismatched bases, MAQ is effectively maximizing the posterior probability ps(u|x,z). This is the statistical interpretation of MAQ alignments. On sequencing real samples, reads may also be different from the reference sequence due to the existence of sequence variants in different samples or strains. These variants behave in a similar manner to sequencing errors for mapping purposes, and therefore at the alignment stage, we should set the minimum base error probability as the rate of differences between the reference and the reads. However, this strategy is an approximation. When there are differences between the reference and reads, the best position might consistently give wrong alignments even if there are no sequencing errors, which can invalidate the calculation of mapping qualities. It would be possible in an iterative scheme to update the reference with an estimate of the new sample sequence from the first mapping and then remap to the updated reference.
1856
Genome Research
www.genome.org
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
ank cnk
i=0
i+1.
(2)
Where i is the ith smallest base error probability and c nk is a function of i but varies little with i. The only unknown parameter is , which controls the dependency of errors. The deduction of this equation and the calculation of c nk will be presented in the Supplemental material. Taking a form like Equation 2 is inspired by CAP3 (Huang and Madan, 1999), where is arbitrarily set to 0.5. In principle, can be estimated from real data. In practice, however, the estimate is complicated by the requirement of large data set where SNPs are known, by the inaccuracy of sequencing qualities, by the dependencies of mapping qualities, and also by the approximation made to derive the equation. To estimate , we just tried different values and selected the one that was giving the best final genotype calls. We found = 0.85 is a reasonable value for Illumina Genetic Analyzer data.
Genome Research
www.genome.org
1857
Downloaded from genome.cshlp.org on June 21, 2012 - Published by Cold Spring Harbor Laboratory Press
Li et al.
From a known sequence, paired-end reads can be simulated with insert sizes drawn from a normal distribution and with base qualities drawn from the empirical distribution estimated from real sequence data. Sequencing errors are introduced based on the base quality. With sufficiently large data, we are able to estimate the position-specific distributions of base qualities and the correlation between adjacent qualities as well. An order-one Markov chain is constructed, based on these statistics, to capture the fact that low-quality bases tend to appear at the 3 -end of a read and to appear successively along a read.
Buhler, J. 2001. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17: 419428. Campbell, P.J., Stephens, P.J., Pleasance, E.D., OMeara, S., Li, H., Santarius, T., Stebbings, L.A., Leroy, C., Edkins, S., Hardy, C., et al. 2008. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 40: 722729. Down, T.A., Rakyan, V.K., Turner, D.J., Flicek, P., Li, H., Thorne, N.P., Kulesha, E., Graf, S., Tomazou, E.M., Backdahl, L., et al. 2008. A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylation analysis. Nat. Biotechnol. 26: 779785. Durbin, R. and Dear, S. 1998. Base qualities help sequencing software. Genome Res. 8: 161162. Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. ii. Error probabilities. Genome Res. 8: 186194. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of automated sequencer traces using phred. i. Accuracy assessment. Genome Res. 8: 175185. Hillier, L.W., Marth, G.T., Quinlan, A.R., Dooling, D., Fewell, G., Barnett, D., Fox, P., Glasscock, J.I., Hickenbotham, M., Huang, W., et al. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5: 183188. Holt, K.E., Thomson, N.R., Wain, J., Phan, M.D., Nair, S., Hasan, R., Bhutta, Z.A., Quail, M.A., Norbertczak, H., Walker, D., et al. 2007. Multidrug-resistant Salmonella enterica serovar paratyphi A harbors IncHI1 plasmids similar to those found in serovar typhi. J. Bacteriol. 189: 42574264. Huang, X. and Madan, A. 1999. Cap3: A DNA sequence assembly program. Genome Res. 9: 868877. Hudson, R.R. 2002. Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics 18: 337338. Johnson, D.S., Mortazavi, A., Myers, R.M., and Wold, B. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 14971502. Kent, W.J. 2002. BLATThe BLAST-like alignment tool. Genome Res. 12: 656664. Ma, B., Tromp, J., and Li, M. 2002. Patternhunter: Faster and more sensitive homology search. Bioinformatics 18: 440445. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.-J., Chen, Z., et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376380. Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel, N.O., Hillier, L., Kwok, P.Y., and Gish, W.R. 1999. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23: 452456. McClelland, M., Sanderson, K.E., Clifton, S.W., Latreille, P., Porwollik, S., Sabo, A., Meyer, R., Bieri, T., Ozersky, P., Mclellan, M., et al. 2004. Comparison of genome degradation in Paratyphi and Typhi, human-restricted serovars of Salmonella enterica that cause typhoid. Nat. Genet. 36: 12681274. Ning, Z., Cox, A.J., and Mullikin, J.C. 2001. SSAHA: A fast search method for large DNA databases. Genome Res. 11: 17251729. Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., Bernier, B., Varhol, R., Delaney, A., et al. 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4: 651657. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13: 103107. Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197. Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P., and Nickerson, D.A. 2006. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38: 375381. Weckx, S., Del-Favero, J., Rademakers, R., Claes, L., Cruts, M., De Jonghe, P., Van Broeckhoven, C., and De Rijk, P. 2005. novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15: 436442. Wu, T.D. and Watanabe, C.K. 2005. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21: 18591875. Zerbino, D.R., Birney, E. 2008. Velvet: Algorithms for de novo short read assembly using de Brujin graphs. Genome Res. 18: 821829. Zhang, J., Wheeler, D.A., Yakub, I., Wei, S., Sood, R., Rowe, W., Liu, P.P., Gibbs, R.A., and Buetow, K.H. 2005. SNPdetector: A software tool for sensitive and accurate SNP detection. PLoS Comput. Biol. 1: e53. doi: 10.1371/journal.pcbi.0010053. Received March 7, 2008; accepted in revised form August 13, 2008.
Acknowledgments
We thank Tony Cox, Keira Cheetham, Richard Carter, and David Bentley from Illumina for beneficial discussions on consensus genotype calling. We thank Julian Parkhill, Kathryn Holt, and the Sanger Institute pathogen sequencing unit for providing the S. paratyphi sequence, and the sequencing team and the sequencing informatics group for generating and processing the data. We also thank Ken Chen, David Spencer, LaDeana Hillier, and all the MAQ users for their valuable feedback as MAQ has matured; Klaudia Walter and the members of the Durbin research group for their helpful comments; and the anonymous referees for their helpful suggestions. This work was funded by the Wellcome Trust.
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402. Altshuler, D., Pollara, V.J., Cowles, C.R., Van Etten, W.J., Baldwin, J., Linton, L., and Lander, E.S. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407: 513516. Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., and Zhao, K. 2007. High-resolution profiling of histone methylations in the human genome. Cell 129: 823837. Bentley, D.R. 2006. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16: 545552.
1858
Genome Research
www.genome.org