0% found this document useful (0 votes)
54 views

Lecture2-High Throughput Sequencing-2019

The document summarizes high throughput sequencing platforms. It provides a table comparing the key specifications of different Illumina and non-Illumina platforms including reads per run, read length, run time, yield, sequencing rate, cost per gigabase, and machine cost. It then provides more details on Illumina sequencing technology including its use of reversible terminators, cluster-based detection, and image processing and base calling steps. Recent Illumina innovations discussed include patterned flow cells, two-channel sequencing, and the NovaSeq platform.

Uploaded by

Charlie Hou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Lecture2-High Throughput Sequencing-2019

The document summarizes high throughput sequencing platforms. It provides a table comparing the key specifications of different Illumina and non-Illumina platforms including reads per run, read length, run time, yield, sequencing rate, cost per gigabase, and machine cost. It then provides more details on Illumina sequencing technology including its use of reversible terminators, cluster-based detection, and image processing and base calling steps. Recent Illumina innovations discussed include patterned flow cells, two-channel sequencing, and the NovaSeq platform.

Uploaded by

Charlie Hou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Genetics 211 - 2019

Lecture 2
High Throughput Sequencing
Gavin Sherlock
[email protected]
January 22nd 2019
Sequencing Platforms

Platform Reads x run: (M) Read length: Run time: (d) Yield: (Gb) Rate: (Gb/d) per-Gb: ($) hg-30x: ($) Machine: ($)
iSeq 100 1fcell 4 250* 077-1.28 1.2-2 1.56 521 $62,500 19.9K
MiniSeq 1fcell 25 150* 1 7.5 7.5 233 $28,000 49.5K
MiSeq 1fcell 25 300* 2 15 7.5 66 $8,000 99K
NextSeq 550 1fcell 400 150* 1.2 120 100 50 $5,000 250K
HiSeq 2500 RR 2fcells 600 100* 1.125 120 106.6 51.2 $6,144 740K
HiSeq 2500 V3 2fcells 3000 100* 11 600 55 39.1 $4,692 690K
HiSeq 2500 V4 2fcells 4000 125* 6 1000 166 31.7 $3,804 690K
HiSeq 4000 2fcells 5000 150* 3.5 1500 400 20.5 $2,460 900K
HiSeq X 2fcells 6000 150* 3 1800 600 7.08 $850 1M
NovaSeq S1 2fcells 3300 150* 1.66 1000 600 18.75 $1,800 999K
NovaSeq S2 2fcells 6600 150* 1.66 2000 1200 17.5 $1,564 999K
NovaSeq S4 2fcells 20000 150* 1.83 6000 3600 10.67 $700 999K
Illumina PacBio RSII 0.88 20K** 4.3 12 2.8 200 $24,000 695K
Illumina PacBio Sequel 16cells v6.0 2018 6.4 45K** 6.6 160-320 24-48 80 $9,600 350K
Illumina PacBio Q1 2019 -- 45K** -- 192 -- 6.6 $1,000 350K
SmidgION 1fcell -- 500-2,000,000 TBC TBC TBC TBC -- --
Flongle 1fcell -- 500-2,000,000 1 0.1/1.8-3.3 -- 90-30 $2,700 - $8,100 --
MinION Mk 1B 1fcell -- 500-2,000,000 3 17/30-50 -- 50-12.5 $1,125 - $2,700 --
GridION X5 5fcells -- 500-2,000,000 3 85/150-250 -- 47.5/15.70-7 $675 - $1,575 --
PromethION 48fcells -- 500-2,000,000 2.6 3000/7000-15000 -- 14/7-3.5 $315 - $1,400 --

https://docs.google.com/spreadsheets/d/1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc/edit?hl=en_GB&hl=en_GB#gid=515231169
Illumina:
Flow Cells with “Molecular Colonies”
• flow cell with randomly spaced
molecular clusters
• spacing depends on initial
seeding of the single
molecules onto the flow cell

1µM
Detection, Chemistry

• Massively Parallel Detection on immobilized


“molecular colonies”

• Means you have to measure (image) every cycle,


instead of the Sanger model (letting reaction go to
completion and then separating products by size)

• Requires specially designed chemistry, using


reversible dye-terminators and a polymerase
Illumina Sequencing Technology
Robust Reversible Terminator Chemistry Foundation
3’ 5’

DNA
(0.1-1.0 ug) A G
T
C G
A
C
T T
A
C C
G
G A
T
A A
C
T C
C
C G G
A
T
T C
Sample G
A
preparation Single molecule
Cluster growtharray T
5’

Sequencing
1 2 3 4 5 6 7 8 9
T G C T A C G A T …

Image acquisition Base calling


Illumina Sequence Visualization

250+ Million Clusters


Per Flow Cell

20 Microns

100 Microns
Illumina Sequencing: Reversible
Terminators
fluorophore

O
cleavage site
O O
HN

HN HN
DNA O N
DNA
O N O N
O O
Incorporate
Deblock
O O O and Cleave
PPP
off Dye
3’
OH
3’ 3’ free 3’ end

3’ OH is blocked Ready for Next Cycle


Detection
Image Processing, Base Calling
• Image processing algorithms find signals in
each panel, align signals from different
panels, etc.
– Machines ship with server or small cluster that
does image analysis while run is happening
• Sequence data after base calling much
reduced in size (tens of gigabytes) => more
manageable but still large amounts that add
up over time
• Unsustainable to keep image data; people
discard the images, and just keep the
sequences (fastq format).
Recent Illumina Innovations
• Patterned flow cells (HiSeq 3000/HiSeq 4000 systems)
– Allows denser cluster spacing
– Avoids cluster overlap
– Image analysis easier
• Two-Channel SBS (NextSeq)
– Two, rather than 4 colors
– Leads to faster sequencing times
• Synthetic Long Reads
– We may discuss later, but not widely used
• MiniSeq
– 2nd generation MiSeq – smaller, cheaper, more reads
• NovaSeq – announced 2017
– Up to 10 billion reads in 2 days, per flowcell (2018)
– Towards $100 genome
• iSeq
– released last year. Benchtop ($20K) platform for 6M reads
How do we make an Illumina Genomic
DNA library?
Double-stranded genomic DNA

Fragment (Covaris)

Polish, add dA overhang


Add adaptors, size select

Sequence
Making fragments asymmetric
Fragmented, end polished, phosphorylated, dA overhang DNA sample

5'-pNNNN.........NNNNA-3'
3'-ANNNN.........NNNNp-5'

Genomic Y-adapter

5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCT-3'
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'

Ligate

5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’

[Ligation product is gel purified, selecting only those products in a certain size range]
Making our genomic DNA library
asymmetric
Round 1 of PCR

5'-ACACTCTTTCCCTACACGAC
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
CAGCACATCCCTTTCTCACA-5’

Products of first round:

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’
3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Finishing and Sequencing the Library
Rounds 2-18
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Product of PCR amplification

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'

[Anneal to flow cell. Perform cluster generation]

Genomic DNA Sequencing Primer

5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT
Nextera Library Preparation

+
Genomic DNA

Transposomes

tagmentation

Primer 1 Primer 2

Adaptor 1 Adaptor 2

PCR
Nextera Library Preparation

+
Genomic DNA

Transposomes

tagmentation

Primer 1 Primer 2

Adaptor 1 Adaptor 2

(suppression) PCR
How Much Sequence?
• HiSeq 2500 can give ~250 million reads/lane of
paired end 100bp reads
• This is 50Gb of sequence
• This is ~4,000x coverage yeast (12Mb).
• This is an obvious waste of resources (it’s also ~500x
C. elegans, and ~500x D. melanogaster)
• How can we sequence on a HiSeq and not waste all
these resources when sequencing smaller genomes?
• HiSeq 3000/HiSeq 4000
– Patterned flow cells (not random clusters)
– Almost twice as much data, half the time
Multiplexed Sequencing,
using Barcodes
• Two ways to perform barcode sequencing
– In-line barcodes
• Barcode is read as part of the normal sequencing read
– Index barcodes
• Barcode is read as a third, short sequencing run (also known
as index reads)
• Can be used to run multiple samples from any
particular origin on the same lane of a HiSeq, with the
barcodes allowing the samples to be de-convoluted
afterwards.
• Barcodes should be designed so that they are
balanced in GC content, and as dissimilar as
possible. (Hamming distance > 2).
In-line Barcode Sequencing
Index barcoding
Unique Molecular Identifiers (UMIs)

• During the PCR step, each template gets amplified


many times
• If your library is of insufficient complexity, or you
overamplify, you may have PCR duplicates
• You want to make independent observations, not
redundant observations
• When sequencing to high coverage, you may have
identical, but non-redundant observations.
• Want to be able to distinguish these.
Unique Molecular Identifiers (UMIs) using
Random Barcodes
Longer, and/or more Accurate Reads

• Insert sizes are a distribution


• Some inserts not necessarily longer than twice the read length
• What does this mean for paired end reads?

0 100 200 300 400 500


Insert Size
Longer, and/or more Accurate Reads

insert
Read Error Correction
• Many approaches, and lots of available tools
• Most rely on the idea of looking for rare k-mers:

• Build up a table of all k-mers, and their frequencies.


• Consecutive k-mers that cover an error in a read should
be at lower frequency, given sufficient coverage
• Can use this to recognize errors in reads and correct them
• If done without deference to quality scores, assumes
homogenous sample
What are the data?

• Illumina produces data in fastq format.

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

@ followed by a sequence Identifier


The sequence
+ , optionally followed by a sequence Identifier
The quality scores
Example of Illumina SeqID

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R The unique instrument name


6 Flowcell lane
73 Tile number within the flowcell
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no
indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair
reads only)
Assessing Quality
FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
HTQC
https://sourceforge.net/projects/htqc
De novo Assembly of Short Reads

• Several methods available


• Short reads require long overlaps
• e.g., 33 bp reads must overlap by 20 bp
• end-trimming helps, to remove low quality bases.
• Most de novo short read assemblers use a k-mer
hashing based approach and de Bruijn graphs.
• The central challenge of genome assembly is
resolving repeat regions.
De novo Assembly Strategies
• Many, many different algorithms and open source (as well as
closed source) software for short read sequence assembly.
• Choice of tool depends on exactly what you are trying to
assemble:
– Genome size
– Genome complexity
– Level of polymorphism
– Genome vs. transcriptome
– Sequence coverage you have (more is generally better)
– Paired-end vs. single end (you should really have paired-end data)
• E.g.
– Velvet (Zerbino and Birney, 2008)
• Uses DeBruijn graph algorithm plus error correction
– SGA (Simpson and Durbin, 2010)
• Use String Graph – lower memory requirements, but takes longer
– SOAPdenovo2 (Li et al, 2012)
• Also uses DeBruijn graphs with error correction
Example of Velvet de novo Assembly
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

Sequence (7bp reads)


AGTCGAG CTTTAGA CGATGAG CTTTAGA
GTCGGG TTAGATC ATGAGGC GAGACAG
GAGGCTC ATCCGAT AGGCTTT GAGACAG
AGTCGAG TAGATCC ATGAGGC TAGAGAA
TAGTCGA CTTTAGA CCGATGA TTAGAGA
CGAGGCT AGATCCG TGAGGCT AGAGACA
TAGTCGA GCTTTAG TCCGATG GCTCTAG
TCGACGC GATCCGA GAGGCTT AGAGACA
TAGTCGA TTAGATC GATGAGG TTTAGAG
GTCGAGG TCTAGAT ATGAGGC TAGAGAC
AGGCTTT ATCCGAT AGGCTTT GAGACAG
AGTCGAG TTAGATT ATGAGGC AGAGACA
GGCTTTA TCCGATG TTTAGAG
CGAGGCT TAGATCC TGAGGCT GAGACAG
AGTCGAG TTTAGATC ATGAGGC TTAGAGA
GAGGCTT GATCCGA GAGGCTT GAGACAG

Hashing (k = 4)
Graph Building

GATT
(1x)
Ÿ

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT


(9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
Ÿ
AGAA

{
Ÿ Ÿ Ÿ Ÿ

Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{ Ÿ Ÿ
(1x)

Ÿ Ÿ Ÿ
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG
(3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) Ÿ Ÿ Ÿ Ÿ (9x) (12x) (9x) (8x) (5x)
CTTT TTTA TTAG TAGA
(8x) (8x) (12x) (16x)
Ÿ Ÿ Ÿ
CGAC GACG ACGC
(1x) (1x) (1x)

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
GATT
(1x)
Ÿ

TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT


(9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
Ÿ

{
AGAA
Ÿ Ÿ Ÿ Ÿ (1x)

Ÿ
TAGT
Ÿ
AGTC
Ÿ
GTCG
Ÿ Ÿ
TCGA CGAG
Ÿ Ÿ
GAGG AGGC GGCT GCTT
Ÿ Ÿ
CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{ Ÿ
AGAG
Ÿ
GAGA AGAC
Ÿ Ÿ
GACA
Ÿ
ACAG
(3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) Ÿ Ÿ Ÿ Ÿ (9x) (12x) (9x) (8x) (5x)
CTTT TTTA TTAG TAGA
(8x) (8x) (12x) (16x)
Ÿ Ÿ Ÿ
CGAC GACG ACGC
(1x) (1x) (1x)

Simplification of Linear Stretches

Ÿ
GATT

GATCCGATGAG AGAT
Ÿ Ÿ
AGAA
GCTCTAG Ÿ

{
Ÿ
TAGTCGA
Ÿ
CGAG
Ÿ Ÿ
GAGGCT
{ Ÿ
TAGA
Ÿ
AGAGA
Ÿ
AGACAG
Ÿ
Ÿ
CGACGC GCTTTAG

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
Ÿ
GATT

GATCCGATGAG AGAT
Ÿ Ÿ
AGAA
Tips GCTCTAG Ÿ

{
Ÿ
TAGTCGA
Ÿ
CGAG
Ÿ Ÿ
{Ÿ Ÿ Ÿ
GAGGCT TAGA AGAGA AGACAG
Ÿ
Ÿ
CGACGC GCTTTAG
Bubble

Error (tip and bubble) removal

AGATCCGATGAG
Ÿ

Ÿ Ÿ Ÿ
TAGTCGAG GAGGCTTTAGA AGAGACAG

TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
De novo Short Read Assembler
Performance
Assembler Num N50 (kb) Errors

ABySS 302 29.2 19

ALLPATHS-LG 60 96.7 20

Bambus2 109 50.2 190

MSR-CA 94 59.2 34

SGA 252 4.0 10

SOAPdenovo 107 288.2 65

Velvet 162 48.4 42

Assemblies of S. aureus (genome size 2,872,915)


Taken from GAGE paper (Salzberg et al, 2012).
Short Read Assembly Limitations

• Common repeat regions are typically missing/collapsed


– Han Chinese genome missing ~420Mbp of repeats
• Same is true for segmental duplications
– Han Chinese genome only contains ~10Mbp of ~150Mbp of
segmental duplications.
• Even for microbial genomes, you typically get very large
numbers of contigs, which range in size from very small,
to sometimes quite large.
– (need reads of ~7kb to completely assemble bacterial genomes)
Recent Short Read Assembler
Comparisons
• Earl et al (2011). Assemblathon 1: A competitive assessment of de
novo short read assembly methods. Genome Research 21: 2224-2241.
– Used a simulated dataset for all competitors to assemble
• Salzberg et al (2012). GAGE: A critical evaluation of genome
assemblies and assembly algorithms. Genome Research 22(3):557-67.
– Applied several assembly algorithms to their own datasets, for several different
sized genomes
• Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of
genome assembly in three vertebrate species. Gigascience 2(1):10.
– See http://assemblathon.org/
• If you have an assembly problem, you should read these papers to gain
some insights into strengths and weaknesses of different assemblers
• Also see: Vezzi, F., Narzisi, G., and Mishra B. (2012). Reevaluating
assembly evaluations with feature response curves: GAGE and
assemblathons. PLoS One 7(12):e52210.
Improving de novo Assemblies
• Need to generate additional long range continuity
to be able to orient and order contigs
• Mate pair libraries
• Hybrid approach using either PacBio or Oxford
Nanopore data, plus Illumina data
– Though many papers assembling microbial genomes solely
from Oxford Nanopore Data – they still contain sequence errors
• Synthetic long reads (Illumina Tru-Seq Synthetic
Reads (aka Moleculo))
• CPT-SEQ
– Similar to 10X Genomics
• Hi-C contact maps
– Similar to Dovetail Genomics
Mate-pair libraries

• Goal is to have the equivalent of 2-5kb insert


libraries, or even up to 10-12kb.
• However, Illumina flow cell technology is limited to
~700 bp fragments that can be successfully clustered
– Means you have to use some molecular biology to
accomplish the equivalent.
Genomic DNA

Fragment

Size Select (up to 12kb)

Biotinylate
Bio
*
*
Bio
Bio
*
*
Bio

Circularize

*
*

Fragment (400-600bp)

*
* *
*
*
* *
*
*
*
Capture Biotinylated fragments

*
*

Standard Paired End Illumina Sequencing

Incorporate Mate-pair information into assembly


Pacific Biosciences
• Single Molecule Real Time (SMRT) DNA
Sequencing
• Light is detected when fluorescent nucleotides
are incorporated into a growing DNA strand
• Half of all data in reads >45kb; can get >200kb
Pacific Biosciences
Pacific Biosciences
Pacific Biosciences
• Single Molecule Real Time (SMRT) DNA
Sequencing
• Light is detected when fluorescent nucleotides
are incorporated into a growing DNA strand
• Half of all data in reads >45kb; can get >200kb
• Accuracy now ~99% (Q20); with 40x coverage,
consensus approached 99.999% accuracy (Q50)
• Observation of DNA modifications
Pacific Biosciences
Pacific Biosciences
• Single Molecule Real Time (SMRT) DNA
Sequencing
• Light is detected when fluorescent nucleotides
are incorporated into a growing DNA strand
• Half of all data in reads >45kb; can get >200kb
• Accuracy now ~99% (Q20); with 40x coverage,
consensus approached 99.999% accuracy (Q50)
• Observation of DNA modifications
• Throughput per run is low (~6 million reads), but
run time is short (~6 hours)
Oxford Nanopore

• MinION, GridION, PromethION products


• DNA “sequenced” as it is dragged through a
nanopore, based on change in conductance
• Can also detect modified bases
• High error rate (5-15%), but improving
• But very long reads – >2 million bp read reported
• Tons of papers/data/tools released
• Nanopore is a game changer for genome assembly
• Has also been reported recently to directly sequence
RNA – great for seeing isoforms
Oxford Nanopore
Oxford Nanopore

• MinION, GridION, PromethION products


• DNA “sequenced” as it is dragged through a
nanopore, based on change in conductance
• Can also detect modified bases
• High error rate (5-15%), but improving
• But very long reads – >2 million bp read reported
• Tons of papers/data/tools released
• Nanopore is a game changer for genome assembly
• Has also been reported recently to directly sequence
RNA – great for seeing isoforms
And something to keep an eye on
Recommended Reading
Nextera
• Adey, A., Morrison, H.G., Asan, Xun, X., Kitzman, J.O., Turner, E.H., Stackhouse, B., MacKenzie, A.P., Caruccio,
N.C., Zhang, X., Shendure, J. (2010). Rapid, low-input, low-bias construction of shotgun fragment libraries by high-
density in vitro transposition. Genome Biol. 11(12):R119.
• Syed, F., Grunenwald, H., Caruccio, N.. (2009). Next-generation sequencing library preparation: simultaneous
fragmentation and tagging using in vitro transposition. Nature Methods 6, Applications Note.
• Caruccio, N. (2011). Preparation of next-generation sequencing libraries using Nextera™ technology: simultaneous
DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol Biol. 733:241-55.
• Marine, R., Polson, S.W., Ravel, J., Hatfull, G., Russell, D., Sullivan, M., Syed, F., Dumas, M., Wommack, K.E.
(2011). Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries
from nanogram quantities of DNA. Appl Environ Microbiol. 77(22):8071-9.
UMIs
• Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S. and Taipale, J. (2011). Counting
absolute numbers of molecules using unique molecular identifiers. Nat Methods. 9(1):72-4.
Adapter Trimming
• Kong, Y. (2011). Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing
technologies. Genomics 98(2):152-3.
• Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal
17:10-12.
• Lindgreen, S. (2012). AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC Res Notes 5:337.
• Jiang, H., Lei, R., Ding, S.W. and Zhu, S. (2014). Skewer: a fast and accurate adapter trimmer for next-generation
sequencing paired-end reads. BMC Bioinformatics 15:182.
Recommended Reading
Read Merging
• Rodrigue, S., Materna, A.C., Timberlake, S.C., Blackburn, M.C., Malmstrom, R.R., Alm, E.J., Chisholm, S.W. (2010).
Unlocking short read sequencing for metagenomics. PLoS One 5(7):e11840.
• Magoč, T. and Salzberg, S.L. (2011). FLASH: fast length adjustment of short reads to improve genome assemblies.
Bioinformatics 27(21):2957-63.
• Masella, A.P., Bartram, A.K., Truszkowski, J.M., Brown, D.G. and Neufeld, J.D. (2012). PANDAseq: paired-end
assembler for illumina sequences. BMC Bioinformatics 13:31.
• Liu, B., Yuan, J., Yiu, S.M., Li, Z., Xie, Y., Chen, Y., Shi, Y., Zhang, H., Li, Y., Lam, T.W. and Luo, R. (2012). COPE:
an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 28(22):2870-4.
• Zhang, J., Kobert, K., Flouri, T. and Stamatakis, A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd
mergeR. Bioinformatics 30(5):614-20.
• Kwon, S., Lee, B. and Yoon, S. (2014). CASPER: context-aware scheme for paired-end reads from high-throughput
amplicon sequencing. BMC Bioinformatics 15 Suppl 9:S10.
Error Correction
• Heo, Y., Wu, X.L., Chen, D., Ma, J. and Hwu, W.M. (2014). BLESS: bloom filter-based error correction solution for
high-throughput sequencing reads. Bioinformatics 30(10):1354-62.
• Lim, E.C., Müller, J., Hagmann, J., Henz, S.R., Kim, S.T. and Weigel, D. (2014). Trowel: a fast and accurate error
correction module for Illumina sequencing reads. Bioinformatics 30(22):3264-5.
• Greenfield, P., Duesing, K., Papanicolaou, A. and Bauer, D.C. (2014). Blue: correcting sequencing errors using
consensus and context. Bioinformatics 30(19):2723-32.
Quality
• Yang, X., Liu, D., Liu, F., Wu, J., Zou, J., Xiao, X., Zhao, F. and Zhu, B. (2013). HTQC: a fast quality control toolkit for
Illumina sequencing data. BMC Bioinformatics 14:33.
Recommended Reading
Assemblers
• Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 18(5):821-9.
• Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of
repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407.
• Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.
Bioinformatics 26(12):i367-73.
• Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data
structures. Genome Research 22(3):549-56. SGA
Assembly of Long Reads
• Loman, N.J., Quick, J., Simpson, J.T. (2015). A complete bacterial genome assembled de novo using only nanopore
sequencing data. Nat Methods 12(8):733-5.
• Stadermann, K.B., Weisshaar, B., Holtgräwe, D. (2015). SMRT sequencing only de novo assembly of the sugar beet
(Beta vulgaris) chloroplast genome. BMC Bioinformatics 16:295.
• Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J.,
Eichler, E.E., Turner, S.W., Korlach, J. (2013). Nonhybrid, finished microbial genome assemblies from long-read
SMRT sequencing data. Nat Methods 10(6):563-9.
PacBio and Oxford Nanopore
• Weirather JL, de Cesare M, Wang Y, Piazza P, Sebastiano V, Wang XJ, Buck D, Au KF. (2017). Comprehensive
comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome
analysis. Version 2. F1000Res. 2017 Feb 3 [revised 2017 Jan 1];6:100.
• Rhoads A, Au KF. (2015). PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13(5):278-
89.
• Ardui S, Ameur A, Vermeesch JR, Hestand MS. (2018). Single molecule real-time (SMRT) sequencing comes of age:
applications and utilities for medical diagnostics. Nucleic Acids Res. 46(5):2159-2168.

You might also like