0% found this document useful (0 votes)
62 views

7a Genomics 2-24 PDF

The document provides a brief overview of the evolution and history of DNA sequencing technologies. It discusses early methods from the 1970s like chromatography and Sanger dideoxy sequencing. It then covers major developments like the first genome sequenced in 1977 (φX174), large scale automated sequencing in the 1990s, the first human genome draft in 2001, and the introduction of next generation sequencing technologies starting in the 2000s including Illumina, Ion Torrent, and PacBio.

Uploaded by

rf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

7a Genomics 2-24 PDF

The document provides a brief overview of the evolution and history of DNA sequencing technologies. It discusses early methods from the 1970s like chromatography and Sanger dideoxy sequencing. It then covers major developments like the first genome sequenced in 1977 (φX174), large scale automated sequencing in the 1990s, the first human genome draft in 2001, and the introduction of next generation sequencing technologies starting in the 2000s including Illumina, Ion Torrent, and PacBio.

Uploaded by

rf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

All science is either

physics or stamp
collecting...

- Ernest Rutherford
As quoted in Rutherford at
Manchester

Data! Data! Data! ...


I cant make bricks
without the clay

- Sherlock Holmes
Adventures of Copper Beeches
http://www.perkydesigns.com

ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT

Evolution of
sequencing

Archaic sequencing methods


Early 70s: chromatography

First nucleotide sequencing

First DNA sequencing


Proc. Nat. Acad. Sci. USA
Vol. 70, No. 12, Part I, pp. 3581-3584, December 1973

The Nucleotide Sequence of the lac Operator


(regulation/protein-nucleic acid interaction/DNA-RNA sequencing/oligonucleotide priming)

WALTER GILBERT AND ALLAN MAXAM


Department of Biochemistry and Molecular Biology, Harvard University, Cambridge, Massachusetts 02138

Communicated by J. D. Watson, Augut 9, 1973


The lac repressor protects the lac operator
ABSTRACT
against digestion with deoxyribonuclease. The protected
fragment is double-stranded and about 27 base-pairs
long. We determined the sequence of RNA transcription
copies of this fragment and present a sequence for 24
base pairs. It is:
5'--T GG AATT GT GA GC GG AT AAC AATT 3'
3'--AC C TT AACA C TC GC C T ATT GTT AA5'
The sequence has 2-fold symmetry regions; the two longest
are separate4 by one turn of the DNA double helix.

The lactose repressor selects one out of six million nucleotide


sequences in the Escherichia coli genome and binds to it to
prevent the expression of the genes for lactose metabolism.
How does this protein, a 150,000-dalton tetramer of identical
subunits, recognize its target? To answer this question we have
determined the sequence of the repressor-binding site: the
operator.
Genetically the operator is the locus of operator constitutive

bind again to the repressor, and is about 27 base-pairs long.


Here we shall describe its sequence.
METHODS
Sonicated DNA Fragments. Sonicated [82P]DNA fragments
were made by growing a temperature-inducible lysogen of
Xcl857plac5S7 at 340 in a glucose-50 mM Tris HCl or TES
(pH 7.4) medium in 3 mM phosphate, heating at 420 for 15
min at a cell density of 4 X 108/mI, then washing and resuspending the cells at a density of 8 X 108/ml in the same medium with 0.1 mM phosphate. 100 mCi of neutralized H3s2PO4
was added to 10 ml of cells, and the incorporation was continued for 2 hr at 34 . The cells were washed, suspended in 2
ml of TE buffer [10 mM Tris *HCl (pH 7.5)-i mM EDTA],
sonicated with six 15-sec bursts, and extracted with phenol.
The aqueous phase was extracted with ether, and the residual
ether was removed with a stream of N2. The mixture of radio-

First Genome Sequence

Sanger dideoxy sequencging


First DNA genome sequenced in 1977:
X174.

1990s: Large scale automated


Sequencing

Generation 1: Gel based or capillary

First automated sequencing

Capillary Sequencing

1995: Haemophilus influenza

2001 Human Genome

Human Genome
Not a single individual
Was a hack job
Re!ned over the next 5 yrs

Reference
assembly

Next generation sequencing

Massively Parallel Signature


Sequencing (MPSS)
Early 1990s: created by Lynx technologies,
purchased by Solexa/Illumina

Illumina Video
https://www.youtube.com/watch?v=womKfikWlxM

Early next-gen sequencers


Table 1. Comparison of different sequencing technologies, taken from [34].

Sequencer

ABI 3730

Roche 454

Solexaa

SOLiD (mp, frag)b

HeliScopec

Read length

600900

400500

75100

50

2535

Run time

610 h

10 h

210 d

(47 d,814 d)

Yield (Mbp)

0.01

2,3003,500/d

(500, 1,000)

105140/h

Cloning bias

Yes

No

No

No

No

Mate pair information

Yes

No

Yes

Yes

No

Based on the GA IIx. See full specifications at: http://www.illumina.com/systems/genome_analyzer.ilmn.


mp, mate pair; frag, fragment. See https://products.appliedbiosystems.com/ SOLiD 3 Plus System.
c
See: http://www.helicosbio.com/Products/HelicosregGeneticAnalysisSystem/HeliScopetradeSequencer/tabid/87/Default.aspx.
doi:10.1371/journal.pcbi.1000667.t001
b

processing the data. See Table 1 for a comparison between the


yield, fragment length, and run times of the different sequencers.
In pyrosequencing (Figure 2) [28,29], methods such as Roche
454 [30] sequencing is performed by polymerase extension of a
primed template. Single nucleotide species are added at each
cycle. If the particular nucleotide species added to the polymerase
reaction pairs with the one on the template, the incorporation
causes luciferase-based light reaction. The reaction chamber is
then washed, and the cycle repeated. Several hundreds of
thousands of wells containing material for sequencing are typically
used in a single reaction. Second is the inability to read long
mononucleotide repeats correctly.

Metagenomic Sample Coverage


Coverage.

Coverage of a genome is defined as the mean


number of times a nucleotide is being sequenced. Thus, 56
coverage means that each nucleotide in the genome is sequenced a
mean number of five times. If we could sequence a genome in a
single read, then 16 coverage would suffice for sequencing.
Shorter read lengths (25700, depending on sequencing
technologies, see Table 1), necessitate more coverage, to ensure
all reads overlap, and that those overlaps are unique enough to

Next-gen sequencers
Current fashion:
Illumina
IonTorrent
Around the corner
Real Time (PacBio)
Nanopore (Oxford)

Sequencing Overview
genomic segment
cut many times at
random (Shotgun)

Get one or two reads from


each segment

~900 bp

~900 bp

Reconstructing the Sequence


(Fragment Assembly)

reads

Cover region with high redundancy


Overlap & extend reads to reconstruct the original genomic region

Steps to Assemble a Genome


Some Terminology

1. Find overlapping reads

read a 500-900 long word that comes


out of sequencer
mate
a pair
of reads
from
twoof
ends
2. pair
Merge
some
"good#
pairs
reads into
of the same insert fragment

longer contigs

contig a contiguous sequence formed


by several overlapping reads
with
no gaps
3. Link
contigs
to form supercontigs
supercontig an ordered and oriented set
(sca$old)
of contigs, usually by mate
pairs

4. Derive consensus sequence

consensus sequence derived from the


sequene
multiple alignment of reads
in a contig

..ACGATTACAATAGGTT..

De!nition of Coverage

Length of genomic segment:


Number of reads:
Length of each read:

G
N
L

De!nition:

C=NL/G

Coverage

How much coverage is enough?


Lander-Waterman model: Prob[ not covered bp ] = e-C
Assuming uniform distribution of reads, C=10 results in 1
gapped region /1,000,000 nucleotides

Draft sequencing of full


genome
6 to 8X coverage

SNP !nding
>= 20x coverage

Assembly
Join reads to larger sequence: "contigs".

Reference based assembly


De Novo assembly

Publicly available de novo assemblers


Phrap (www.phrap.org)
Celera (wgs-assembler.sf.net)
Paracel (www.paracel.com)
Arachne (ftp://ftp.broadinstitute.org/pub/crd/
ARACHNE/)
CAP3 (http://seq.cs.iastate.edu/)

Gene prediction

Evidence based gene calling: BLAST


Ab initio gene calling; no homolog required:
GeneMark, Glimmer, MetaGene.

ORFans
Open Reading Frame (ORFs) with no similarity to any
sequence in the database.

Annotation
Finding function of a gene

Next-gen sequencing
Whole Genome
Sequencing
RNA-Seq
Exome
Chip-Seq
Methylation (Bisul!te
sequencing)

Lior Pachter's list

https://liorpachter.wordpress.com/seq/

Personal Genomes

Gonzaga-Jauregui 2012

Personal Genomes

~14.6 mil non-redundant SNPs


Each genome V reference assembly ~3.5mil SNPs and
~1000 CNVs

You might also like