Lecture2-High Throughput Sequencing-2019
Lecture2-High Throughput Sequencing-2019
Lecture 2
High Throughput Sequencing
Gavin Sherlock
[email protected]
January 22nd 2019
Sequencing Platforms
Platform Reads x run: (M) Read length: Run time: (d) Yield: (Gb) Rate: (Gb/d) per-Gb: ($) hg-30x: ($) Machine: ($)
iSeq 100 1fcell 4 250* 077-1.28 1.2-2 1.56 521 $62,500 19.9K
MiniSeq 1fcell 25 150* 1 7.5 7.5 233 $28,000 49.5K
MiSeq 1fcell 25 300* 2 15 7.5 66 $8,000 99K
NextSeq 550 1fcell 400 150* 1.2 120 100 50 $5,000 250K
HiSeq 2500 RR 2fcells 600 100* 1.125 120 106.6 51.2 $6,144 740K
HiSeq 2500 V3 2fcells 3000 100* 11 600 55 39.1 $4,692 690K
HiSeq 2500 V4 2fcells 4000 125* 6 1000 166 31.7 $3,804 690K
HiSeq 4000 2fcells 5000 150* 3.5 1500 400 20.5 $2,460 900K
HiSeq X 2fcells 6000 150* 3 1800 600 7.08 $850 1M
NovaSeq S1 2fcells 3300 150* 1.66 1000 600 18.75 $1,800 999K
NovaSeq S2 2fcells 6600 150* 1.66 2000 1200 17.5 $1,564 999K
NovaSeq S4 2fcells 20000 150* 1.83 6000 3600 10.67 $700 999K
Illumina PacBio RSII 0.88 20K** 4.3 12 2.8 200 $24,000 695K
Illumina PacBio Sequel 16cells v6.0 2018 6.4 45K** 6.6 160-320 24-48 80 $9,600 350K
Illumina PacBio Q1 2019 -- 45K** -- 192 -- 6.6 $1,000 350K
SmidgION 1fcell -- 500-2,000,000 TBC TBC TBC TBC -- --
Flongle 1fcell -- 500-2,000,000 1 0.1/1.8-3.3 -- 90-30 $2,700 - $8,100 --
MinION Mk 1B 1fcell -- 500-2,000,000 3 17/30-50 -- 50-12.5 $1,125 - $2,700 --
GridION X5 5fcells -- 500-2,000,000 3 85/150-250 -- 47.5/15.70-7 $675 - $1,575 --
PromethION 48fcells -- 500-2,000,000 2.6 3000/7000-15000 -- 14/7-3.5 $315 - $1,400 --
https://docs.google.com/spreadsheets/d/1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc/edit?hl=en_GB&hl=en_GB#gid=515231169
Illumina:
Flow Cells with “Molecular Colonies”
• flow cell with randomly spaced
molecular clusters
• spacing depends on initial
seeding of the single
molecules onto the flow cell
1µM
Detection, Chemistry
DNA
(0.1-1.0 ug) A G
T
C G
A
C
T T
A
C C
G
G A
T
A A
C
T C
C
C G G
A
T
T C
Sample G
A
preparation Single molecule
Cluster growtharray T
5’
Sequencing
1 2 3 4 5 6 7 8 9
T G C T A C G A T …
20 Microns
100 Microns
Illumina Sequencing: Reversible
Terminators
fluorophore
O
cleavage site
O O
HN
HN HN
DNA O N
DNA
O N O N
O O
Incorporate
Deblock
O O O and Cleave
PPP
off Dye
3’
OH
3’ 3’ free 3’ end
Fragment (Covaris)
Sequence
Making fragments asymmetric
Fragmented, end polished, phosphorylated, dA overhang DNA sample
5'-pNNNN.........NNNNA-3'
3'-ANNNN.........NNNNp-5'
Genomic Y-adapter
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCT-3'
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'
Ligate
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
[Ligation product is gel purified, selecting only those products in a certain size range]
Making our genomic DNA library
asymmetric
Round 1 of PCR
5'-ACACTCTTTCCCTACACGAC
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’
3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
CAGCACATCCCTTTCTCACA-5’
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’
3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Finishing and Sequencing the Library
Rounds 2-18
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3'
3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT
Nextera Library Preparation
+
Genomic DNA
Transposomes
tagmentation
Primer 1 Primer 2
Adaptor 1 Adaptor 2
PCR
Nextera Library Preparation
+
Genomic DNA
Transposomes
tagmentation
Primer 1 Primer 2
Adaptor 1 Adaptor 2
(suppression) PCR
How Much Sequence?
• HiSeq 2500 can give ~250 million reads/lane of
paired end 100bp reads
• This is 50Gb of sequence
• This is ~4,000x coverage yeast (12Mb).
• This is an obvious waste of resources (it’s also ~500x
C. elegans, and ~500x D. melanogaster)
• How can we sequence on a HiSeq and not waste all
these resources when sequencing smaller genomes?
• HiSeq 3000/HiSeq 4000
– Patterned flow cells (not random clusters)
– Almost twice as much data, half the time
Multiplexed Sequencing,
using Barcodes
• Two ways to perform barcode sequencing
– In-line barcodes
• Barcode is read as part of the normal sequencing read
– Index barcodes
• Barcode is read as a third, short sequencing run (also known
as index reads)
• Can be used to run multiple samples from any
particular origin on the same lane of a HiSeq, with the
barcodes allowing the samples to be de-convoluted
afterwards.
• Barcodes should be designed so that they are
balanced in GC content, and as dissimilar as
possible. (Hamming distance > 2).
In-line Barcode Sequencing
Index barcoding
Unique Molecular Identifiers (UMIs)
insert
Read Error Correction
• Many approaches, and lots of available tools
• Most rely on the idea of looking for rare k-mers:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@HWUSI-EAS100R:6:73:941:1973#0/1
Hashing (k = 4)
Graph Building
GATT
(1x)
{
CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{
(1x)
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG
(3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
CTTT TTTA TTAG TAGA
(8x) (8x) (12x) (16x)
CGAC GACG ACGC
(1x) (1x) (1x)
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
GATT
(1x)
{
AGAA
(1x)
TAGT
AGTC
GTCG
TCGA CGAG
GAGG AGGC GGCT GCTT
CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{
AGAG
GAGA AGAC
GACA
ACAG
(3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
CTTT TTTA TTAG TAGA
(8x) (8x) (12x) (16x)
CGAC GACG ACGC
(1x) (1x) (1x)
GATT
GATCCGATGAG AGAT
AGAA
GCTCTAG
{
TAGTCGA
CGAG
GAGGCT
{
TAGA
AGAGA
AGACAG
CGACGC GCTTTAG
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
GATT
GATCCGATGAG AGAT
AGAA
Tips GCTCTAG
{
TAGTCGA
CGAG
{
GAGGCT TAGA AGAGA AGACAG
CGACGC GCTTTAG
Bubble
AGATCCGATGAG
TAGTCGAG GAGGCTTTAGA AGAGACAG
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
De novo Short Read Assembler
Performance
Assembler Num N50 (kb) Errors
ALLPATHS-LG 60 96.7 20
MSR-CA 94 59.2 34
Fragment
Biotinylate
Bio
*
*
Bio
Bio
*
*
Bio
Circularize
*
*
Fragment (400-600bp)
*
* *
*
*
* *
*
*
*
Capture Biotinylated fragments
*
*