0% found this document useful (0 votes)

54 views

Lecture2-High Throughput Sequencing-2019

The document summarizes high throughput sequencing platforms. It provides a table comparing the key specifications of different Illumina and non-Illumina platforms including reads per run, read length, run time, yield, sequencing rate, cost per gigabase, and machine cost. It then provides more details on Illumina sequencing technology including its use of reversible terminators, cluster-based detection, and image processing and base calling steps. Recent Illumina innovations discussed include patterned flow cells, two-channel sequencing, and the NovaSeq platform.

Uploaded by

Charlie Hou

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Lecture2-High Throughput Sequencing-2019

Uploaded by

Charlie Hou

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Genetics 211 - 2019

Lecture 2
High Throughput Sequencing
Gavin Sherlock
[email protected]
January 22nd 2019
Sequencing Platforms

Platform Reads x run: (M) Read length: Run time: (d) Yield: (Gb) Rate: (Gb/d) per-Gb: ($) hg-30x: ($) Machine: ($)
iSeq 100 1fcell 4 250* 077-1.28 1.2-2 1.56 521 $62,500 19.9K
MiniSeq 1fcell 25 150* 1 7.5 7.5 233 $28,000 49.5K
MiSeq 1fcell 25 300* 2 15 7.5 66 $8,000 99K
NextSeq 550 1fcell 400 150* 1.2 120 100 50 $5,000 250K
HiSeq 2500 RR 2fcells 600 100* 1.125 120 106.6 51.2 $6,144 740K
HiSeq 2500 V3 2fcells 3000 100* 11 600 55 39.1 $4,692 690K
HiSeq 2500 V4 2fcells 4000 125* 6 1000 166 31.7 $3,804 690K
HiSeq 4000 2fcells 5000 150* 3.5 1500 400 20.5 $2,460 900K
HiSeq X 2fcells 6000 150* 3 1800 600 7.08 $850 1M
NovaSeq S1 2fcells 3300 150* 1.66 1000 600 18.75 $1,800 999K
NovaSeq S2 2fcells 6600 150* 1.66 2000 1200 17.5 $1,564 999K
NovaSeq S4 2fcells 20000 150* 1.83 6000 3600 10.67 $700 999K
Illumina PacBio RSII 0.88 20K** 4.3 12 2.8 200 $24,000 695K
Illumina PacBio Sequel 16cells v6.0 2018 6.4 45K** 6.6 160-320 24-48 80 $9,600 350K
Illumina PacBio Q1 2019 -- 45K** -- 192 -- 6.6 $1,000 350K
SmidgION 1fcell -- 500-2,000,000 TBC TBC TBC TBC -- --
Flongle 1fcell -- 500-2,000,000 1 0.1/1.8-3.3 -- 90-30 $2,700 - $8,100 --
MinION Mk 1B 1fcell -- 500-2,000,000 3 17/30-50 -- 50-12.5 $1,125 - $2,700 --
GridION X5 5fcells -- 500-2,000,000 3 85/150-250 -- 47.5/15.70-7 $675 - $1,575 --
PromethION 48fcells -- 500-2,000,000 2.6 3000/7000-15000 -- 14/7-3.5 $315 - $1,400 --

https://docs.google.com/spreadsheets/d/1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc/edit?hl=en_GB&hl=en_GB#gid=515231169
Illumina:
Flow Cells with “Molecular Colonies”
• flow cell with randomly spaced
molecular clusters
• spacing depends on initial
seeding of the single
molecules onto the flow cell

1µM
Detection, Chemistry

• Massively Parallel Detection on immobilized

“molecular colonies”

• Means you have to measure (image) every cycle,

instead of the Sanger model (letting reaction go to
completion and then separating products by size)

• Requires specially designed chemistry, using

reversible dye-terminators and a polymerase
Illumina Sequencing Technology
Robust Reversible Terminator Chemistry Foundation
3’ 5’

DNA
(0.1-1.0 ug) A G
T
C G
A
C
T T
A
C C
G
G A
T
A A
C
T C
C
C G G
A
T
T C
Sample G
A
preparation Single molecule
Cluster growtharray T
5’

Sequencing
1 2 3 4 5 6 7 8 9
T G C T A C G A T …

Image acquisition Base calling

Illumina Sequence Visualization

250+ Million Clusters

Per Flow Cell

20 Microns

100 Microns
Illumina Sequencing: Reversible
Terminators
fluorophore

O
cleavage site
O O
HN

HN HN
DNA O N
DNA
O N O N
O O
Incorporate
Deblock
O O O and Cleave
PPP
off Dye
3’
OH
3’ 3’ free 3’ end

3’ OH is blocked Ready for Next Cycle

Detection
Image Processing, Base Calling
• Image processing algorithms find signals in
each panel, align signals from different
panels, etc.
– Machines ship with server or small cluster that
does image analysis while run is happening
• Sequence data after base calling much
reduced in size (tens of gigabytes) => more
manageable but still large amounts that add
up over time
• Unsustainable to keep image data; people
discard the images, and just keep the
sequences (fastq format).
Recent Illumina Innovations
• Patterned flow cells (HiSeq 3000/HiSeq 4000 systems)
– Allows denser cluster spacing
– Avoids cluster overlap
– Image analysis easier
• Two-Channel SBS (NextSeq)
– Two, rather than 4 colors
– Leads to faster sequencing times
• Synthetic Long Reads
– We may discuss later, but not widely used
• MiniSeq
– 2nd generation MiSeq – smaller, cheaper, more reads
• NovaSeq – announced 2017
– Up to 10 billion reads in 2 days, per flowcell (2018)
– Towards $100 genome
• iSeq
– released last year. Benchtop ($20K) platform for 6M reads
How do we make an Illumina Genomic
DNA library?
Double-stranded genomic DNA

Fragment (Covaris)

Polish, add dA overhang

Add adaptors, size select

Sequence
Making fragments asymmetric
Fragmented, end polished, phosphorylated, dA overhang DNA sample

5'-pNNNN.........NNNNA-3'
3'-ANNNN.........NNNNp-5'

Genomic Y-adapter

5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCT-3'
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'

Ligate

5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’

[Ligation product is gel purified, selecting only those products in a certain size range]
Making our genomic DNA library
asymmetric
Round 1 of PCR

5'-ACACTCTTTCCCTACACGAC
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
CAGCACATCCCTTTCTCACA-5’

Products of first round:

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’
3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Finishing and Sequencing the Library
Rounds 2-18
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Product of PCR amplification

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

[Anneal to flow cell. Perform cluster generation]

Genomic DNA Sequencing Primer

+
Genomic DNA

Transposomes

tagmentation

Primer 1 Primer 2

Adaptor 1 Adaptor 2

PCR
Nextera Library Preparation

+
Genomic DNA

Transposomes

tagmentation

Primer 1 Primer 2

Adaptor 1 Adaptor 2

(suppression) PCR
How Much Sequence?
• HiSeq 2500 can give ~250 million reads/lane of
paired end 100bp reads
• This is 50Gb of sequence
• This is ~4,000x coverage yeast (12Mb).
• This is an obvious waste of resources (it’s also ~500x
C. elegans, and ~500x D. melanogaster)
• How can we sequence on a HiSeq and not waste all
these resources when sequencing smaller genomes?
• HiSeq 3000/HiSeq 4000
– Patterned flow cells (not random clusters)
– Almost twice as much data, half the time
Multiplexed Sequencing,
using Barcodes
• Two ways to perform barcode sequencing
– In-line barcodes
• Barcode is read as part of the normal sequencing read
– Index barcodes
• Barcode is read as a third, short sequencing run (also known
as index reads)
• Can be used to run multiple samples from any
particular origin on the same lane of a HiSeq, with the
barcodes allowing the samples to be de-convoluted
afterwards.
• Barcodes should be designed so that they are
balanced in GC content, and as dissimilar as
possible. (Hamming distance > 2).
In-line Barcode Sequencing
Index barcoding
Unique Molecular Identifiers (UMIs)

• During the PCR step, each template gets amplified

many times
• If your library is of insufficient complexity, or you
overamplify, you may have PCR duplicates
• You want to make independent observations, not
redundant observations
• When sequencing to high coverage, you may have
identical, but non-redundant observations.
• Want to be able to distinguish these.
Unique Molecular Identifiers (UMIs) using
Random Barcodes
Longer, and/or more Accurate Reads

• Insert sizes are a distribution

• Some inserts not necessarily longer than twice the read length
• What does this mean for paired end reads?

0 100 200 300 400 500

Insert Size
Longer, and/or more Accurate Reads

insert
Read Error Correction
• Many approaches, and lots of available tools
• Most rely on the idea of looking for rare k-mers:

• Build up a table of all k-mers, and their frequencies.

• Consecutive k-mers that cover an error in a read should
be at lower frequency, given sufficient coverage
• Can use this to recognize errors in reads and correct them
• If done without deference to quality scores, assumes
homogenous sample
What are the data?

• Illumina produces data in fastq format.

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

@ followed by a sequence Identifier

The sequence
+ , optionally followed by a sequence Identifier
The quality scores
Example of Illumina SeqID

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R The unique instrument name

6 Flowcell lane
73 Tile number within the flowcell
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no
indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair
reads only)
Assessing Quality
FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
HTQC
https://sourceforge.net/projects/htqc
De novo Assembly of Short Reads

• Several methods available

• Short reads require long overlaps
• e.g., 33 bp reads must overlap by 20 bp
• end-trimming helps, to remove low quality bases.
• Most de novo short read assemblers use a k-mer
hashing based approach and de Bruijn graphs.
• The central challenge of genome assembly is
resolving repeat regions.
De novo Assembly Strategies
• Many, many different algorithms and open source (as well as
closed source) software for short read sequence assembly.
• Choice of tool depends on exactly what you are trying to
assemble:
– Genome size
– Genome complexity
– Level of polymorphism
– Genome vs. transcriptome
– Sequence coverage you have (more is generally better)
– Paired-end vs. single end (you should really have paired-end data)
• E.g.
– Velvet (Zerbino and Birney, 2008)
• Uses DeBruijn graph algorithm plus error correction
– SGA (Simpson and Durbin, 2010)
• Use String Graph – lower memory requirements, but takes longer
– SOAPdenovo2 (Li et al, 2012)
• Also uses DeBruijn graphs with error correction
Example of Velvet de novo Assembly
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Sequence (7bp reads)

AGTCGAG CTTTAGA CGATGAG CTTTAGA
GTCGGG TTAGATC ATGAGGC GAGACAG
GAGGCTC ATCCGAT AGGCTTT GAGACAG
AGTCGAG TAGATCC ATGAGGC TAGAGAA
TAGTCGA CTTTAGA CCGATGA TTAGAGA
CGAGGCT AGATCCG TGAGGCT AGAGACA
TAGTCGA GCTTTAG TCCGATG GCTCTAG
TCGACGC GATCCGA GAGGCTT AGAGACA
TAGTCGA TTAGATC GATGAGG TTTAGAG
GTCGAGG TCTAGAT ATGAGGC TAGAGAC
AGGCTTT ATCCGAT AGGCTTT GAGACAG
AGTCGAG TTAGATT ATGAGGC AGAGACA
GGCTTTA TCCGATG TTTAGAG
CGAGGCT TAGATCC TGAGGCT GAGACAG
AGTCGAG TTTAGATC ATGAGGC TTAGAGA
GAGGCTT GATCCGA GAGGCTT GAGACAG

Hashing (k = 4)
Graph Building

GATT
(1x)

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT

(9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

AGAA

{

CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{
(1x)

TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG
(3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
CTTT TTTA TTAG TAGA
(8x) (8x) (12x) (16x)

CGAC GACG ACGC
(1x) (1x) (1x)

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
GATT
(1x)

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT

(9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)

{
AGAA
(1x)

TAGT

AGTC

GTCG

TCGA CGAG

GAGG AGGC GGCT GCTT

CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{
AGAG

GAGA AGAC

GACA

ACAG
(3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
CTTT TTTA TTAG TAGA
(8x) (8x) (12x) (16x)

CGAC GACG ACGC
(1x) (1x) (1x)

Simplification of Linear Stretches

GATT

GATCCGATGAG AGAT

AGAA
GCTCTAG

{

TAGTCGA

CGAG

GAGGCT
{
TAGA

AGAGA

AGACAG

CGACGC GCTTTAG

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

GATT

GATCCGATGAG AGAT

AGAA
Tips GCTCTAG

{

TAGTCGA

CGAG

{
GAGGCT TAGA AGAGA AGACAG

CGACGC GCTTTAG
Bubble

Error (tip and bubble) removal

AGATCCGATGAG

TAGTCGAG GAGGCTTTAGA AGAGACAG

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
De novo Short Read Assembler
Performance
Assembler Num N50 (kb) Errors

ABySS 302 29.2 19

ALLPATHS-LG 60 96.7 20

Bambus2 109 50.2 190

MSR-CA 94 59.2 34

SGA 252 4.0 10

SOAPdenovo 107 288.2 65

Velvet 162 48.4 42

Assemblies of S. aureus (genome size 2,872,915)

Taken from GAGE paper (Salzberg et al, 2012).
Short Read Assembly Limitations

• Common repeat regions are typically missing/collapsed

– Han Chinese genome missing ~420Mbp of repeats
• Same is true for segmental duplications
– Han Chinese genome only contains ~10Mbp of ~150Mbp of
segmental duplications.
• Even for microbial genomes, you typically get very large
numbers of contigs, which range in size from very small,
to sometimes quite large.
– (need reads of ~7kb to completely assemble bacterial genomes)
Recent Short Read Assembler
Comparisons
• Earl et al (2011). Assemblathon 1: A competitive assessment of de
novo short read assembly methods. Genome Research 21: 2224-2241.
– Used a simulated dataset for all competitors to assemble
• Salzberg et al (2012). GAGE: A critical evaluation of genome
assemblies and assembly algorithms. Genome Research 22(3):557-67.
– Applied several assembly algorithms to their own datasets, for several different
sized genomes
• Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of
genome assembly in three vertebrate species. Gigascience 2(1):10.
– See http://assemblathon.org/
• If you have an assembly problem, you should read these papers to gain
some insights into strengths and weaknesses of different assemblers
• Also see: Vezzi, F., Narzisi, G., and Mishra B. (2012). Reevaluating
assembly evaluations with feature response curves: GAGE and
assemblathons. PLoS One 7(12):e52210.
Improving de novo Assemblies
• Need to generate additional long range continuity
to be able to orient and order contigs
• Mate pair libraries
• Hybrid approach using either PacBio or Oxford
Nanopore data, plus Illumina data
– Though many papers assembling microbial genomes solely
from Oxford Nanopore Data – they still contain sequence errors
• Synthetic long reads (Illumina Tru-Seq Synthetic
Reads (aka Moleculo))
• CPT-SEQ
– Similar to 10X Genomics
• Hi-C contact maps
– Similar to Dovetail Genomics
Mate-pair libraries

• Goal is to have the equivalent of 2-5kb insert

libraries, or even up to 10-12kb.
• However, Illumina flow cell technology is limited to
~700 bp fragments that can be successfully clustered
– Means you have to use some molecular biology to
accomplish the equivalent.
Genomic DNA

Fragment

Size Select (up to 12kb)

Biotinylate
Bio
*
*
Bio
Bio
*
*
Bio

Circularize

*
*

Fragment (400-600bp)

*
* *
*
*
* *
*
*
*
Capture Biotinylated fragments

*
*

Standard Paired End Illumina Sequencing

Incorporate Mate-pair information into assembly

Pacific Biosciences
• Single Molecule Real Time (SMRT) DNA
Sequencing
• Light is detected when fluorescent nucleotides
are incorporated into a growing DNA strand
• Half of all data in reads >45kb; can get >200kb
Pacific Biosciences
Pacific Biosciences
Pacific Biosciences
• Single Molecule Real Time (SMRT) DNA
Sequencing
• Light is detected when fluorescent nucleotides
are incorporated into a growing DNA strand
• Half of all data in reads >45kb; can get >200kb
• Accuracy now ~99% (Q20); with 40x coverage,
consensus approached 99.999% accuracy (Q50)
• Observation of DNA modifications
Pacific Biosciences
Pacific Biosciences
• Single Molecule Real Time (SMRT) DNA
Sequencing
• Light is detected when fluorescent nucleotides
are incorporated into a growing DNA strand
• Half of all data in reads >45kb; can get >200kb
• Accuracy now ~99% (Q20); with 40x coverage,
consensus approached 99.999% accuracy (Q50)
• Observation of DNA modifications
• Throughput per run is low (~6 million reads), but
run time is short (~6 hours)
Oxford Nanopore

• MinION, GridION, PromethION products

• DNA “sequenced” as it is dragged through a
nanopore, based on change in conductance
• Can also detect modified bases
• High error rate (5-15%), but improving
• But very long reads – >2 million bp read reported
• Tons of papers/data/tools released
• Nanopore is a game changer for genome assembly
• Has also been reported recently to directly sequence
RNA – great for seeing isoforms
And something to keep an eye on
Recommended Reading
Nextera
• Adey, A., Morrison, H.G., Asan, Xun, X., Kitzman, J.O., Turner, E.H., Stackhouse, B., MacKenzie, A.P., Caruccio,
N.C., Zhang, X., Shendure, J. (2010). Rapid, low-input, low-bias construction of shotgun fragment libraries by high-
density in vitro transposition. Genome Biol. 11(12):R119.
• Syed, F., Grunenwald, H., Caruccio, N.. (2009). Next-generation sequencing library preparation: simultaneous
fragmentation and tagging using in vitro transposition. Nature Methods 6, Applications Note.
• Caruccio, N. (2011). Preparation of next-generation sequencing libraries using Nextera™ technology: simultaneous
DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol Biol. 733:241-55.
• Marine, R., Polson, S.W., Ravel, J., Hatfull, G., Russell, D., Sullivan, M., Syed, F., Dumas, M., Wommack, K.E.
(2011). Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries
from nanogram quantities of DNA. Appl Environ Microbiol. 77(22):8071-9.
UMIs
• Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S. and Taipale, J. (2011). Counting
absolute numbers of molecules using unique molecular identifiers. Nat Methods. 9(1):72-4.
Adapter Trimming
• Kong, Y. (2011). Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing
technologies. Genomics 98(2):152-3.
• Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal
17:10-12.
• Lindgreen, S. (2012). AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC Res Notes 5:337.
• Jiang, H., Lei, R., Ding, S.W. and Zhu, S. (2014). Skewer: a fast and accurate adapter trimmer for next-generation
sequencing paired-end reads. BMC Bioinformatics 15:182.
Recommended Reading
Read Merging
• Rodrigue, S., Materna, A.C., Timberlake, S.C., Blackburn, M.C., Malmstrom, R.R., Alm, E.J., Chisholm, S.W. (2010).
Unlocking short read sequencing for metagenomics. PLoS One 5(7):e11840.
• Magoč, T. and Salzberg, S.L. (2011). FLASH: fast length adjustment of short reads to improve genome assemblies.
Bioinformatics 27(21):2957-63.
• Masella, A.P., Bartram, A.K., Truszkowski, J.M., Brown, D.G. and Neufeld, J.D. (2012). PANDAseq: paired-end
assembler for illumina sequences. BMC Bioinformatics 13:31.
• Liu, B., Yuan, J., Yiu, S.M., Li, Z., Xie, Y., Chen, Y., Shi, Y., Zhang, H., Li, Y., Lam, T.W. and Luo, R. (2012). COPE:
an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 28(22):2870-4.
• Zhang, J., Kobert, K., Flouri, T. and Stamatakis, A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd
mergeR. Bioinformatics 30(5):614-20.
• Kwon, S., Lee, B. and Yoon, S. (2014). CASPER: context-aware scheme for paired-end reads from high-throughput
amplicon sequencing. BMC Bioinformatics 15 Suppl 9:S10.
Error Correction
• Heo, Y., Wu, X.L., Chen, D., Ma, J. and Hwu, W.M. (2014). BLESS: bloom filter-based error correction solution for
high-throughput sequencing reads. Bioinformatics 30(10):1354-62.
• Lim, E.C., Müller, J., Hagmann, J., Henz, S.R., Kim, S.T. and Weigel, D. (2014). Trowel: a fast and accurate error
correction module for Illumina sequencing reads. Bioinformatics 30(22):3264-5.
• Greenfield, P., Duesing, K., Papanicolaou, A. and Bauer, D.C. (2014). Blue: correcting sequencing errors using
consensus and context. Bioinformatics 30(19):2723-32.
Quality
• Yang, X., Liu, D., Liu, F., Wu, J., Zou, J., Xiao, X., Zhao, F. and Zhu, B. (2013). HTQC: a fast quality control toolkit for
Illumina sequencing data. BMC Bioinformatics 14:33.
Recommended Reading
Assemblers
• Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 18(5):821-9.
• Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of
repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407.
• Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.
Bioinformatics 26(12):i367-73.
• Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data
structures. Genome Research 22(3):549-56. SGA
Assembly of Long Reads
• Loman, N.J., Quick, J., Simpson, J.T. (2015). A complete bacterial genome assembled de novo using only nanopore
sequencing data. Nat Methods 12(8):733-5.
• Stadermann, K.B., Weisshaar, B., Holtgräwe, D. (2015). SMRT sequencing only de novo assembly of the sugar beet
(Beta vulgaris) chloroplast genome. BMC Bioinformatics 16:295.
• Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J.,
Eichler, E.E., Turner, S.W., Korlach, J. (2013). Nonhybrid, finished microbial genome assemblies from long-read
SMRT sequencing data. Nat Methods 10(6):563-9.
PacBio and Oxford Nanopore
• Weirather JL, de Cesare M, Wang Y, Piazza P, Sebastiano V, Wang XJ, Buck D, Au KF. (2017). Comprehensive
comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome
analysis. Version 2. F1000Res. 2017 Feb 3 [revised 2017 Jan 1];6:100.
• Rhoads A, Au KF. (2015). PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13(5):278-
89.
• Ardui S, Ameur A, Vermeesch JR, Hestand MS. (2018). Single molecule real-time (SMRT) sequencing comes of age:
applications and utilities for medical diagnostics. Nucleic Acids Res. 46(5):2159-2168.

Unit 12 - DNA Worksheet - Structure of DNA and Replication
No ratings yet
Unit 12 - DNA Worksheet - Structure of DNA and Replication
4 pages
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
No ratings yet
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
120 pages
Equipe1 - A Tale of Three Next Generation Sequencing Platforms - Comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq Sequencers
No ratings yet
Equipe1 - A Tale of Three Next Generation Sequencing Platforms - Comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq Sequencers
13 pages
RNA-Seq and Transcriptome Analysis: Jessica Holmes
No ratings yet
RNA-Seq and Transcriptome Analysis: Jessica Holmes
98 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Documents - Pub Introduction To Next Generation Sequencing and Variant Calling Karin Kassahn
No ratings yet
Documents - Pub Introduction To Next Generation Sequencing and Variant Calling Karin Kassahn
74 pages
Introduction to Bioinformatics in Microbiology 2018
No ratings yet
Introduction to Bioinformatics in Microbiology 2018
54 pages
nihms-977214
No ratings yet
nihms-977214
21 pages
High Throughput Next Generation Sequencing
No ratings yet
High Throughput Next Generation Sequencing
2 pages
HMCW NGS Data Format
No ratings yet
HMCW NGS Data Format
21 pages
Illumina Sequencing: How To Plan Your First Sequencing Project
No ratings yet
Illumina Sequencing: How To Plan Your First Sequencing Project
71 pages
2023-GenomicaFuncional y Biocomputacion-Day1
No ratings yet
2023-GenomicaFuncional y Biocomputacion-Day1
92 pages
SplitPDFFile 261 To 329
No ratings yet
SplitPDFFile 261 To 329
69 pages
List of Online Bioinformatics Tools and Software - Final
No ratings yet
List of Online Bioinformatics Tools and Software - Final
23 pages
NGS Influenza Covid-19 Module 3 En
No ratings yet
NGS Influenza Covid-19 Module 3 En
23 pages
A Practical Guide To NGS 08 05 17 Digital
No ratings yet
A Practical Guide To NGS 08 05 17 Digital
76 pages
Sequences, Genomes, and Genes in R / Bioconductor: Martin Morgan October 21, 2013
No ratings yet
Sequences, Genomes, and Genes in R / Bioconductor: Martin Morgan October 21, 2013
46 pages
R..Sequences, Genomes, and Genes in R Bioconductor
100% (1)
R..Sequences, Genomes, and Genes in R Bioconductor
46 pages
Illumina
No ratings yet
Illumina
68 pages
Lecture3 High Throughput Sequencing 2019
No ratings yet
Lecture3 High Throughput Sequencing 2019
68 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
CE6068 Lecture 4
No ratings yet
CE6068 Lecture 4
82 pages
Amyris - Low Cost High Thruput Sequencing of DNA Assemblies Using A Highly Multiplexed Nextera Process
No ratings yet
Amyris - Low Cost High Thruput Sequencing of DNA Assemblies Using A Highly Multiplexed Nextera Process
7 pages
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
No ratings yet
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
51 pages
RNA Seq R - Final Decode
No ratings yet
RNA Seq R - Final Decode
76 pages
Illumina-customer-constructed-library-requirements
No ratings yet
Illumina-customer-constructed-library-requirements
3 pages
Ensamblaje de Genomas PDF
No ratings yet
Ensamblaje de Genomas PDF
95 pages
1-s2 0-S1097276515003408-Main
No ratings yet
1-s2 0-S1097276515003408-Main
12 pages
illumina-library-structure_2021
No ratings yet
illumina-library-structure_2021
1 page
Bacterial Genome Assembly Illumina
No ratings yet
Bacterial Genome Assembly Illumina
49 pages
05-21 Captoday NGS
No ratings yet
05-21 Captoday NGS
4 pages
Analysis Results
No ratings yet
Analysis Results
29 pages
Intro To Illumina Sequencing
No ratings yet
Intro To Illumina Sequencing
15 pages
Brochure Sequencing Systems Portfolio
No ratings yet
Brochure Sequencing Systems Portfolio
16 pages
Summary Bioinformation Technology
No ratings yet
Summary Bioinformation Technology
15 pages
Lecture 01 - Genome Sequencing
No ratings yet
Lecture 01 - Genome Sequencing
48 pages
Blank en Berg Pittsburgh 2011 Ngs
No ratings yet
Blank en Berg Pittsburgh 2011 Ngs
59 pages
Genomics For Beginner
No ratings yet
Genomics For Beginner
9 pages
Pant Nagar
No ratings yet
Pant Nagar
45 pages
illumina-library-structure_P5 and P7
No ratings yet
illumina-library-structure_P5 and P7
1 page
EBTY348L_Comp Genomics lectures_Even Sem_2024-25 _set 2
No ratings yet
EBTY348L_Comp Genomics lectures_Even Sem_2024-25 _set 2
29 pages
Base-Calling For Next-Generation Sequencing Platforms
No ratings yet
Base-Calling For Next-Generation Sequencing Platforms
9 pages
Glossary of Terms B4B
No ratings yet
Glossary of Terms B4B
8 pages
Sequencing Review
No ratings yet
Sequencing Review
5 pages
DNA Sequencing: Present Status and Future Challenges: Elaine Mardis Washington University Genome Sequencing Center
No ratings yet
DNA Sequencing: Present Status and Future Challenges: Elaine Mardis Washington University Genome Sequencing Center
26 pages
Novo P
No ratings yet
Novo P
24 pages
Lecture1 Genome - Sequencing 2019
No ratings yet
Lecture1 Genome - Sequencing 2019
41 pages
Homer: Mapping Reads To The Genome
No ratings yet
Homer: Mapping Reads To The Genome
5 pages
Bioinformatics Experimental Design
No ratings yet
Bioinformatics Experimental Design
6 pages
Lecture_28_Unit6_1
No ratings yet
Lecture_28_Unit6_1
16 pages
Next Generation Sequencing
No ratings yet
Next Generation Sequencing
44 pages
Sajeev-Sequencing
No ratings yet
Sajeev-Sequencing
63 pages
Deep Sequencing: Introduction To Bioinformatics Seminar November 9th, 2009
No ratings yet
Deep Sequencing: Introduction To Bioinformatics Seminar November 9th, 2009
56 pages
Next Generation Sequencing Analysis Lecture 02.
No ratings yet
Next Generation Sequencing Analysis Lecture 02.
19 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
No ratings yet
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
81 pages
Methods Guide 770 2014 018 PDF
No ratings yet
Methods Guide 770 2014 018 PDF
154 pages
ExSeq Presentation With Background
No ratings yet
ExSeq Presentation With Background
40 pages
Poster PPT Portrait
No ratings yet
Poster PPT Portrait
1 page
ASIC Mining Guide
From Everand
ASIC Mining Guide
Sterling Blackwood
No ratings yet
Technology in Telecommunications Networks
From Everand
Technology in Telecommunications Networks
Tanushri Kaniyar
No ratings yet
Lecture5 Sequence Comparison-2019
No ratings yet
Lecture5 Sequence Comparison-2019
91 pages
Lecture7 Epigenomics-2019
No ratings yet
Lecture7 Epigenomics-2019
62 pages
20 Effective ChatGPT Prompts
100% (4)
20 Effective ChatGPT Prompts
13 pages
Ap23 FRQ Comp Sci A
No ratings yet
Ap23 FRQ Comp Sci A
19 pages
Investigation__Regulatory_Switches_in_Sti
No ratings yet
Investigation__Regulatory_Switches_in_Sti
4 pages
Genetics and Genomics in Medicine. ISBN 0815344805, 978-0815344803
100% (20)
Genetics and Genomics in Medicine. ISBN 0815344805, 978-0815344803
23 pages
Mutations Practice - Key
No ratings yet
Mutations Practice - Key
3 pages
Mutations Worksheet
No ratings yet
Mutations Worksheet
3 pages
Amoeba Sisters Video Recap: DNA vs. RNA & Protein Synthesis UPDATED
No ratings yet
Amoeba Sisters Video Recap: DNA vs. RNA & Protein Synthesis UPDATED
2 pages
BIO30 3rdLongExam - Reviewer
No ratings yet
BIO30 3rdLongExam - Reviewer
6 pages
Thermodynamics of Oligonucleotide Duplex Melting
No ratings yet
Thermodynamics of Oligonucleotide Duplex Melting
7 pages
Bt504 Current Papars 2022
No ratings yet
Bt504 Current Papars 2022
21 pages
20. Biotechnology
No ratings yet
20. Biotechnology
29 pages
Original DNA Sequence: TAC ACC TTG GCG ACG ACT: (Circle Any Changes)
No ratings yet
Original DNA Sequence: TAC ACC TTG GCG ACG ACT: (Circle Any Changes)
3 pages
Techniques in Molecular Biology (COMPLETE)
100% (1)
Techniques in Molecular Biology (COMPLETE)
51 pages
ARG Ch 16 - The Molecular Basis of Inheritance
No ratings yet
ARG Ch 16 - The Molecular Basis of Inheritance
9 pages
Dna Structure Function and Replication Activity
100% (1)
Dna Structure Function and Replication Activity
6 pages
Isolation of RNA
No ratings yet
Isolation of RNA
18 pages
Fundamental Medical Science I Final Report (Genomic)
No ratings yet
Fundamental Medical Science I Final Report (Genomic)
13 pages
Molecular biology complete Practice
No ratings yet
Molecular biology complete Practice
5 pages
Kami Export - Pogil DNA Vs RNA
No ratings yet
Kami Export - Pogil DNA Vs RNA
3 pages
Molecular Biology Techniques Manual PDF
No ratings yet
Molecular Biology Techniques Manual PDF
96 pages
Contoh Klinik
No ratings yet
Contoh Klinik
1 page
Linkers and Adapters
0% (1)
Linkers and Adapters
11 pages
Experiment 8 Nucleic Acids PDF
No ratings yet
Experiment 8 Nucleic Acids PDF
19 pages
VDJ Recombination With Antibody Structure
No ratings yet
VDJ Recombination With Antibody Structure
38 pages
DNA The Molecule of Life
No ratings yet
DNA The Molecule of Life
37 pages
Chenxiang Lin, Yonggang Ke, Zhe Li, James H. Wang, Yan Liu and Hao Yan - Mirror Image DNA Nanostructures For Chiral Supramolecular Assemblies
No ratings yet
Chenxiang Lin, Yonggang Ke, Zhe Li, James H. Wang, Yan Liu and Hao Yan - Mirror Image DNA Nanostructures For Chiral Supramolecular Assemblies
8 pages
Brief Notes On Polymerase Chain Reaction (PCR) : 2 Year MT Molecular Biology Lab 2010
No ratings yet
Brief Notes On Polymerase Chain Reaction (PCR) : 2 Year MT Molecular Biology Lab 2010
5 pages
Class XII, Molecular basis of inheritance worksheet
No ratings yet
Class XII, Molecular basis of inheritance worksheet
2 pages
Lesson 5 - Central Dogma of Molecular Biology
100% (1)
Lesson 5 - Central Dogma of Molecular Biology
33 pages
Powerpoint Rna Processing
No ratings yet
Powerpoint Rna Processing
21 pages
Download ebooks file Bioinformatics for Beginners Genes Genomes Molecular Evolution Databases and Analytical Tools 1st Edition Supratim Choudhuri all chapters
100% (14)
Download ebooks file Bioinformatics for Beginners Genes Genomes Molecular Evolution Databases and Analytical Tools 1st Edition Supratim Choudhuri all chapters
67 pages

Lecture2-High Throughput Sequencing-2019

Uploaded by

Lecture2-High Throughput Sequencing-2019

Uploaded by

Genetics 211 - 2019

• Massively Parallel Detection on immobilized

• Means you have to measure (image) every cycle,

• Requires specially designed chemistry, using

Image acquisition Base calling

250+ Million Clusters

3’ OH is blocked Ready for Next Cycle

Polish, add dA overhang

Products of first round:

Product of PCR amplification

[Anneal to flow cell. Perform cluster generation]

Genomic DNA Sequencing Primer

• During the PCR step, each template gets amplified

• Insert sizes are a distribution

0 100 200 300 400 500

• Build up a table of all k-mers, and their frequencies.

• Illumina produces data in fastq format.

@ followed by a sequence Identifier

HWUSI-EAS100R The unique instrument name

• Several methods available

Sequence (7bp reads)

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT

Simplification of Linear Stretches

Error (tip and bubble) removal

ABySS 302 29.2 19

Bambus2 109 50.2 190

SGA 252 4.0 10

SOAPdenovo 107 288.2 65

Velvet 162 48.4 42

Assemblies of S. aureus (genome size 2,872,915)

• Common repeat regions are typically missing/collapsed

• Goal is to have the equivalent of 2-5kb insert

Size Select (up to 12kb)

Standard Paired End Illumina Sequencing

Incorporate Mate-pair information into assembly

• MinION, GridION, PromethION products

• MinION, GridION, PromethION products

You might also like