0% found this document useful (0 votes)
82 views

Bioinformatics Chapter 3

Genomic assembly involves reconstructing entire genomes from millions of short DNA sequences (reads) generated by sequencing machines. This is challenging because the reads' positions in the genome are unknown. Assembly algorithms model the problem as a graph problem by creating nodes for reads and edges between overlapping reads. Different algorithms use Hamiltonian paths, Eulerian paths, or De Bruijn graphs to represent the genome as a path through the graph to reconstruct the full sequence. Modern techniques now assemble genomes using read pairs and handle real datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Bioinformatics Chapter 3

Genomic assembly involves reconstructing entire genomes from millions of short DNA sequences (reads) generated by sequencing machines. This is challenging because the reads' positions in the genome are unknown. Assembly algorithms model the problem as a graph problem by creating nodes for reads and edges between overlapping reads. Different algorithms use Hamiltonian paths, Eulerian paths, or De Bruijn graphs to represent the genome as a path through the graph to reconstruct the full sequence. Modern techniques now assemble genomes using read pairs and handle real datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 225

How Do Biologists Assemble Genomic Puzzles from

Millions of Pieces?
Graph Algorithms

Phillip Compeau and Pavel Pevzner


Bioinformatics Algorithms: an Active Learning Approach
©2013 by Compeau and Pevzner. All rights reserved
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
Who Are These People?
The human
genome is a three
billion nucleotide
long “book”
written in A, C, G,
T alphabet.

Euler Hamilton De Bruijn


1707-1783 1805-1865 1918-2012

Some genomes are 100 X larger than the human genome:

Amoeba Paris
dubia japonica
Why Do We Sequence 1000s of Species?

– Applications in medicine (genomes of fungi-


producing bacteria), agriculture (oil palm
genome), biotechnology (genomes of energy-
producing cyanobacteria), etc., etc., etc.
Brief History of Genome Sequencing

• 1977: Walter Gilbert and


Frederick Sanger develop
independent DNA sequencing
methods.

• 1980: They share the Nobel Prize. Walter Gilbert

• Still, their sequencing methods


were too expensive ($3 billion to
sequence the human genome).

Frederick Sanger
The Race to Sequence the Human Genome
• 1990: The public Human Genome Project,
headed by Francis Collins, aims to sequence
the human genome by 2005.
Francis Collins

• 1997: Craig Venter founds Celera


Genomics, a private firm, with the same
goal.

• 2000:
The Race to Sequence the Human Genome
• 1990: The public Human Genome Project,
headed by Francis Collins, aims to sequence
the human genome by 2005.
Francis Collins

• 1997: Craig Venter founds Celera


Genomics, a private firm, with the same
goal.
Craig Venter
• 2000:
The Race to Sequence the Human Genome
• 1990: The public Human Genome Project,
headed by Francis Collins, aims to sequence
the human genome by 2005.
Francis Collins

• 1997: Craig Venter founds Celera


Genomics, a private firm, with the same
goal.
Craig Venter
• 2000:
From Human to Mouse to Rat to …

Early 2000s: Many more mammalian genomes are


sequenced using the same Sanger sequencing
method, but it is clear that new technology is needed
for further progress.
Next Generation Sequencing Technologies

• Late 2000s: The market for new


sequencing machines takes off.
– Illumina reduces the cost of sequencing
a human genome from $3 billion to
$10,000.
– Complete Genomics builds a genomic
factory in Silicon Valley that sequences
hundreds of genomes per month.
– Beijing Genome Institute orders hundreds
of sequencing machines, becoming the
world’s largest sequencing center.
Personal Genome Sequencing
Few Mutations Can Make a Big Difference…
• Different people have slightly different genomes: on
average, roughly 1 mutation in 1000 nucleotides.

• The 1 in 1000 nucleotides difference accounts for height,


high cholesterol susceptibility, and 1000s of genetic
diseases.

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACG
ATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCT
ATCGATCGATCGATCGATTATCTACGATCGATCGATCGA
TCACTATACGAGCTACTACGTACGTACGATCGCGGGACT
ATTATCGACTACAGATAAAACATGCTAGTACAACAGTAT
ACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATA
TCCGAT
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACG
ATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCT
ATCGATCGATCGATCGATTATCTACGATCGATCGATCGA
TCACTATACGAGCTACTACGTACGTACGATCGCGTGACT
ATTATCGACTACAGATGAAACATGCTAGTACAACAGTAT
ACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATA
TCCGAT
Why Do We Sequence Personal Genomes?

• 2010: Nicholas Volker became the


first human being to be saved
by genome sequencing.
– Doctors could not diagnose his condition; he went
through dozens of surgeries.
– Sequencing revealed a rare mutation in a XIAP
gene linked to a defect in his immune system.
– This led doctors to use immunotherapy,
which saved the child.
10,000 Genomes and Beyond
• 2010: Scientists launch a
project to sequence 10,000
vertebrate genomes.

• Now: Human genome


sequencing costs just a few
thousand dollars and under
$1,000 human genomes
may arrive any day now.
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
The Newspaper Problem
The Newspaper Problem
The Newspaper Problem
The Newspaper Problem
The Newspaper Problem
The Newspaper Problem
The Newspaper Problem as
an Overlapping Puzzle
The Newspaper Problem as
an Overlapping Puzzle
Multiple Copies of a Genome (Millions of them)

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
Breaking the Genomes at Random Positions

CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
Generating “Reads”

CTGATGA TGGACTACGCTAC TACTGCTAG CTGTATTACG ATCAGCTACCACA TCGTAGCTACG ATGCATTAGCAA GCTATCGGA TCAGCTACCA CATCGTAGC
CTGATGATG GACTACGCT ACTACTGCTA GCTGTATTACG ATCAGCTACC ACATCGTAGCT ACGATGCATTA GCAAGCTATC GGATCAGCTAC CACATCGTAGC
CTGATGATGG ACTACGCTAC TACTGCTAGCT GTATTACGATC AGCTACCAC ATCGTAGCTACG ATGCATTAGCA AGCTATCGG A TCAGCTACCA CATCGTAGC
CTGATGATGGACT ACGCTACTACT GCTAGCTGTAT TACGATCAGC TACCACATCGT AGCTACGATGCA TTAGCAAGCT ATCGGATCA GCTACCACATC GTAGC
“Burning” Some Reads

CTGATGA TGGACTACGCTAC TACTGCTAG CTGTATTACG ATCAGCTACCACA TCGTAGCTACG ATGCATTAGCAA GCTATCGGA TCAGCTACCA CATCGTAGC
CTGATGATG GACTACGCT ACTACTGCTA GCTGTATTACG ATCAGCTACC ACATCGTAGCT ACGATGCATTA GCAAGCTATC GGATCAGCTAC CACATCGTAGC
CTGATGATGG ACTACGCTAC TACTGCTAGCT GTATTACGATC AGCTACCAC ATCGTAGCTACG ATGCATTAGCA AGCTATCGG A TCAGCTACCA CATCGTAGC
CTGATGATGGACT ACGCTACTACT GCTAGCTGTAT TACGATCAGC TACCACATCGT AGCTACGATGCA TTAGCAAGCT ATCGGATCA GCTACCACATC GTAGC
No Idea What Position Every Read Comes From

CTGATGATGGACT

GCTGTATTACG

GCTATCGGA
ATGCATTAGCAA

ACTACTGCTA
TACCACATCGT
CTGATGATGG
No Idea What Position Every Read Comes From

CTGATGATGGACT

GCTGTATTACG

GCTATCGGA

GCAAGCTATC ATGCATTAGCAA

ACTACTGCTA
TACCACATCGT
CTGATGATGG
No Idea What Position Every Read Comes From

CTGATGATGGACT

GCTGTATTACG

GCTATCGGA

GCAAGCTATC ATGCATTAGCAA

ACTACTGCTA
TACCACATCGT
CTGATGATGG
From Experimental to Computational Challenges

Multiple (unsequenced) genome copies

Read generation

Reads

Genome assembly

Assembled genome
…GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGC
C…
What Makes Genome Sequencing Difficult?

• Modern sequencing machines cannot


read an entire genome one nucleotide
at a time from beginning to end (like
we read a book)
• They can only shred the genome and
generate short reads.
• The genome assembly is not the same
as a jigsaw puzzle: we must use
overlapping reads to reconstruct the
genome, a giant overlap puzzle!
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
The Genome Sequencing Problem

Genome Sequencing Problem. Reconstruct a genome from


reads.
• Input. A collection of strings Reads.
• Output. A string Genome reconstructed from Reads.
This is not a
computational
problem!
What Is k-mer Composition?

Composition3(TAATGCCATGGGATGTT)=
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
k-mer Composition

Composition3(TAATGCCATGGGATGTT)=
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
=
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

e.g., lexicographic order (like in a dictionary)


Reconstructing a String from its Composition

String Reconstruction Problem. Reconstruct a string from


its k-mer composition.

• Input. A collection of k-mers.

• Output. A Genome such that Compositionk(Genome) is


equal to the collection of k-mers.
A Naive String Reconstruction Approach

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
A Naive String Reconstruction Approach

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
A Naive String Reconstruction Approach

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
A Naive String Reconstruction Approach

ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
AAT
A Naive String Reconstruction Approach

ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
AAT
A Naive String Reconstruction Approach

ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
AAT
ATG
A Naive String Reconstruction Approach

ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
AAT
ATG
A Naive String Reconstruction Approach

ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT

TAA
AAT
ATG
A Naive String Reconstruction Approach

ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG

TAA
AAT
ATG
TGT
A Naive String Reconstruction Approach

ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG

TAA
AAT
ATG
TGT
What’s Next?

ATG ATG CAT CCA GAT GCC GGA GGG TGC TGG

TAA
AAT
ATG
TGT
GTT
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
Representing a Genome as a Path
Composition3(TAATGCCATGGGATGTT)=

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T
Representing a Genome as a Path
Composition3(TAATGCCATGGGATGTT)=

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T
Representing a Genome as a Path
Composition3(TAATGCCATGGGATGTT)=

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T

Can we construct this genome path without knowing the


genome TAATGCCATGGGATGTT, only from its composition?
Representing a Genome as a Path
Composition3(TAATGCCATGGGATGTT)=

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T

Can we construct this genome path without knowing the


genome TAATGCCATGGGATGTT, only from its composition?

Yes. We simply need to connect k-mer1 with k-mer2 if


suffix(k-mer1)=prefix(k-mer2).
E.g. TAA → AAT
A Path Turns into a Graph

TAATGCCATGGGATGTT

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T

Yes. We simply need to connect k-mer1 with k-mer2 if


suffix(k-mer1)=prefix(k-mer2).
E.g. TAA → AAT
A Path Turns into a Graph

TAATGCCATGGGATGTT

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T

Can we still find the genome path in this graph?


A Path Turns into a Graph

TAATGCCATGGGATGTT

TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT


A T G C T G G T G T

Can we still find the genome path in this graph?


Where Is the Genomic Path?

A Hamiltonian path: a path that visits each node in a graph


exactly once.
TA TGCCATGGGATGTT
A

AA AT AT AT CA CCA GA GCC GGA GGG GTT TA TG TG TG


T G G G T T A C G T

Nodes areWhat
arranged from
are we left to right
trying inthis
find in lexicographic
graph? order.
Hamiltonian Path Problem
Hamiltonian Path Problem. Find a Hamiltonian path in a graph.
• Input. A graph.
• Output. A path visiting every node in the graph exactly once.
Does This Graph Have a Hamiltonian Path?
Hamiltonian Path Problem. Find a Hamiltonian path in a graph.
Input. A graph.
Output. A path visiting every node in the graph exactly once.

William
Hamilton

Icosian game (1857)


Does This Graph Have a Hamiltonian Path?
Hamiltonian Path Problem. Find a Hamiltonian path in a graph.
Input. A graph.
Output. A path visiting every node in the graph exactly once.
14

20 13 12 15
18
19 1
1 2
1
3
4 10
6 5
7 9
William
Hamilton 8
17 16

Icosian game (1857) Undirected graph


Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
TA TGGGATG CC ATGTT
A

AA AT AT AT CA CCA GA GCC GGA GGG GTT TA TG TG TG


T G G G T T A C G T

TA TGCCATGGGATGTT
A

AA AT AT AT CA CCA GA GCC GGA GGG GTT TA TG TG TG


T G G G T T A C G T
A Slightly Different Path
TAATGCCATGGGATGTT
TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT
A T G C T G G T G T

3-mers as nodes

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT

3-mers as edges

How do we label the starting and ending nodes of an edge?


TAA
prefix of TAA TA AA suffix of TAA
Labeling Nodes in the New Path
TAATGCCATGGGATGTT
TA AA AT TG GCC CCA CA AT TG GGG GGA GA AT TG GTT
A T G C T G G T G T

3-mers as nodes

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

3-mers as edges and 2-mers as nodes


Labeling Nodes in the New Path

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

3-mers as edges and 2-mers as nodes


Gluing Identically Labeled Nodes
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

CC
CCA GCC

CA GC
CAT TGC
AT ATG TG
TAA AAT TGG GGG GGA GAT ATG TGT GTT
TA AA AT ATG TG GG GG GA AT TG GT TT
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
CAT TGC
AT ATG TG
TAA
TA AA ATG
AT TG GT TT
AAT TGT GTT
ATG
AT TG
GAT TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
CAT TGC

ATG
TG
TAA AT
TA AA AT ATG
TG GT TT
AAT AT TGT GTT
ATG
TG
GAT TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG TG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA GG
De Bruijn Graph of TAATGCCATGGGATGTT

CC
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
Where is the Genome
ATG TG TGT GT GTT TT
ATG hiding in this graph?
GAT
TGG
GA GG
GGA GGG
It Was Always There!
TA TGCCATGGGATGTT
A

CC
CCA GCC

CA GC
TGC
CAT
TAA AAT
ATG An Eulerian path in a
TA AA AT
ATG TG TGT GT TT Whatisare
graph we trying
a path that to
GTT
ATG find
visits in this
each edgegraph?
exactly
GAT
TGG once.
GA GG
GGA GGG
Eulerian Path Problem
Eulerian Path Problem. Find an Eulerian path in a graph.

• Input. A graph.

• Output. A path visiting every edge in the graph exactly once.


Eulerian Versus Hamiltonian Paths
Eulerian Path Problem. Find an Eulerian path in a graph.

• Input. A graph.

• Output. A path visiting every edge in the graph exactly once.


Hamiltonian Path Problem. Find a Hamiltonian path in a graph.

• Input. A graph.

• Output. A path visiting every node in the graph exactly once.

Find a difference!
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
Eulerian Versus Hamiltonian Paths
Eulerian Path Problem. Find an Eulerian path in a graph.

• Input. A graph.

• Output. A path visiting every edge in the graph exactly once.


Hamiltonian Path Problem. Find a Hamiltonian path in a graph.

• Input. A graph.

• Output. A path visiting every node in the graph exactly once.


What Problem Would You Prefer to Solve?

Hamiltonian Path Problem Eulerian Path Problem

While Euler solved the Eulerian Path Problem


(even for a city with a million bridges), nobody
has developed a fast algorithm for the
Hamiltonian Path Problem yet.
NP-Complete Problems
• The Hamiltonian Path Problem belongs to a
collection containing thousands of
computational problems for which no fast
algorithms are known.

“I can't find an efficient


algorithm, I guess I'm just
too dumb.”

From Garey and Johnson. Computers and Intractability. 1979


Change of Attitude
That would be an excellent argument, but the
question of whether or not NP-Complete
problems can be solved efficiently is one of
seven Millennium Problems in mathematics.

“I can't find an efficient


algorithm, because no such
algorithm is possible.”

From Garey and Johnson. Computers and Intractability. 1979


The Modern State of Affairs
NP-Complete problems are all equivalent: find an
efficient solution to one, and you have an
efficient solution to them all.

“I can't find an efficient


algorithm, but neither can all
these famous people.”
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
Eulerian Path Problem
Eulerian Path Problem. Find an Eulerian path in a graph.

• Input. A graph.

• Output. A path visiting every edge in the graph exactly once.

We constructed the de Bruijn


graph from Genome, but in
reality, Genome is unknown!
What We Have Done: From Genome to de Bruijn Graph
TAATGCCATGGGATGTT

CC
CC GC
A C
C G
A TG C
CA C
T AT
TA AA
G
T A A T A T G T
AT TG GTT
A A T G G T T
AT T
GA G
T TG
G G G
A GG G GG
A G
What We Want: From Reads (k-mers) to Genome
TAATGCCATGGGATGTT

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
What We will Show: From Reads to de Bruijn Graph to Genome
TAATGCCATGGGATGTT

CC
CC GC
A C
C G
A TG C
CA C
T AT
TA AA
G
T A A T A T G T
AT TG GTT
A A T G G T T
AT T
GA G
T TG
G G G
A GG G GG
A G

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
Constructing de Bruijn Graph when Genome Is Known
TAATGCCATGGGATGTT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
Constructing de Bruijn when Genome Is Unknown

TAA ATG GCC CAT TGG GGA ATG GTT

AAT TGC CCA ATG GGG GAT TGT

Composition3(TAATGCCATGGGATGTT)
Representing Composition as a Graph Consisting of Isolated
Edges

TAA ATG GCC CAT TGG GGA ATG GTT

AAT TGC CCA ATG GGG GAT TGT

Composition3(TAATGCCATGGGATGTT)
Constructing de Bruijn Graph from k-mer Composition

TAA ATG GCC CAT TGG GGA ATG GTT


TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

AAT TGC CCA ATG GGG GAT TGT


AA AT TG GC CC CA AT TG GG GG GA AT TG GT

Composition3(TAATGCCATGGGATGTT)
Gluing Identically Labeled Nodes

TAA ATG GCC CAT TGG GGA ATG GTT


TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

AAT TGC CCA ATG GGG GAT TGT


AA AT TG GC CC CA AT TG GG GG GA AT TG GT
Gluing Identically Labeled Nodes

TAA ATG GCC CAT TGG GGA ATG GTT


TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
AA TGC CCA ATG GGG GAT TGT
AT TG GC CC CA AT TG GG GG GA AT TG GT
TAA ATG GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
AAT TGC CCA ATG GGG GAT TGT
AT TG GC CC CA AT TG GG GG GA AT TG GT
TAA ATG GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
AAT TGC CCA ATG GGG GAT TGT
AT TG GC CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TGC CCA ATG GGG GAT TGT


TG GC CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TGC CCA ATG GGG GAT TGT


TG GC CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TGC CCA ATG GGG GAT TGT
GC CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TGC CCA ATG GGG GAT TGT
GC CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

CCA ATG GGG GAT TGT


CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

CCA ATG GGG GAT TGT


CC CA AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
CCA
ATG GGG GAT TGT
CA AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
CCA
ATG GGG GAT TGT
CA AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

ATG GGG GAT TGT


AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

ATG GGG GAT TGT


AT TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
ATG GGG GAT TGT
TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
ATG GGG GAT TGT
TG GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GGG GAT TGT


GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GGG GAT TGT


GG GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GGG GAT TGT


GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GGG GAT TGT


GG GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GAT TGT
GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GAT TGT
GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GAT TGT
AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

GAT TGT
AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TGT
TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TGT
TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TGT
GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

TGT
GT
We Are Not Done with Gluing Yet

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
Gluing Identically Labeled Nodes
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT

CC
CCA GCC

CA GC
CAT TGC
AT ATG TG
TAA AAT TGG GGG GGA GAT ATG TGT GTT
TA AA AT ATG TG GG GG GA AT TG GT TT
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
CAT TGC
AT ATG TG
TAA
TA AA ATG
AT TG GT TT
AAT TGT GTT
ATG
AT TG
GAT TGG
GA GG
GGG
GGA
GG
CC TAATGCCATGGGATGTT
CCA GCC

CA GC
CAT TGC

ATG
TG
TAA AT
TA AA AT ATG
TG GT TT
AAT AT TGT GTT
ATG
TG
GAT TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG TG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes

CC TAATGCCATGGGATGTT
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA GG
The Same de Bruijn Graph:
DeBruin(Genome)=DeBruin(Genome Composition)

CC
CCA GCC

CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGA GGG
Constructing de Bruijn Graph
De Bruijn graph of a collection of k-mers:
– Represent every k-mer as an edge between its prefix
and suffix
– Glue ALL nodes with identical labels.

DeBruijn(k-mers)
form a node for each (k-1)-mer from k-mers
for each k-mer in k-mers
connect its prefix node with its suffix node by an edge
From Hamilton to Euler to de Bruijn

Universal String Problem (De Bruijn, 1946). Find a circular string


containing each binary k-mer exactly once.
From Hamilton to Euler to de Bruijn

Universal String Problem (De Bruijn, 1946). Find a circular string


containing each binary k-mer exactly once.

000 001 010 011 100 101 110 111


0 0
1 0

0 1
1 1
From Hamilton to Euler to de Bruijn

Universal String Problem (Nicolaas de Bruijn, 1946). Find a circular string


containing each binary k-mer exactly once.

000 001 010 011 100 101 110 111


0 0
1 0

0 1
1 1
From Hamilton to Euler to de Bruijn

Universal String Problem (Nicolaas de Bruijn, 1946). Find a circular string


containing each binary k-mer exactly once.

000 001 010 011 100 101 110 111


000 001 010 011 100 101 110 111
00 00 00 01 01 10 01 11 10 00 10 01 11 10 11 11

00 01

10 11
From Hamilton to Euler to de Bruijn

01
0 0
00
1 0

0 1
10 11
1 1
De Bruijn Graph for 4-Universal String

Does it have an Eulerian cycle? If yes, how can we find it?


Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
Eulerian CYCLE Problem
Eulerian CYCLE Problem. Find an Eulerian cycle in a graph.

• Input. A graph.

• Output. A cycle visiting every edge in the graph exactly once.


A Graph is Eulerian if It Contains an
Eulerian Cycle.
Is this graph Eulerian?
A Graph is Eulerian if It Contains an
Eulerian Cycle.
Is this graph Eulerian?
1 in, 2 out

A graph is balanced if indegree = outdegree for each node


Is the Graph for 4-Universal String Balanced?
Euler’s Theorem
• Every Eulerian graph is balanced
Euler’s Theorem
• Every Eulerian graph is balanced
• Every balanced* graph is Eulerian

(*) and strongly connected, of course!


Recruiting an Ant to Prove Euler’s Theorem

Let an ant randomly walk through the graph.


The ant cannot use the same edge twice!
If Ant Was a Genius…

“Yay!
Now can I
go home
please?”
A Less Intelligent Ant Would Randomly
Choose a Node and Start Walking…
Walking…
Walking… and Walking…
Walking… and Walking… and Walking…

Can it get stuck? In what node?


The Ant Can Only Get Stuck at the Starting Node
The Ant Has Completed a Cycle
BUT has not Proven Euler’s theorem yet…
The constructed cycle is not Eulerian. Can we enlarge it?
Let’s Start at a Different Node in the
Green Cycle
Let’s start at a node with still unexplored edges.

“Why should I start at a


different node? Backtracking?
I’m not evolved to walk
backwards! And what
difference does it make???”
New Instructions for the Ant:
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.
An Ant Traversing Previously Constructed Cycle
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.

1
An Ant Traversing Previously Constructed Cycle
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.

1
An Ant Traversing Previously Constructed Cycle
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.

“Why do I have to walk


3
along the same cycle 1
again??? Can I see
something new?”
I Returned Back BUT… I Can Continue Walking!
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.

After completing the cycle, start random exploration of still


untraversed edges in the graph.

3
1

4
Enlarging the Previously Constructed Cycle
Enlarging the Previously Constructed Cycle
Enlarging the Previously Constructed Cycle
Enlarging the Previously Constructed Cycle
Stuck Again!
No Eulerian cycle yet… can we enlarge the green-blue cycle?

The ant should walk along the constructed cycle starting at


yet another node. Which one?

6 7

8
2

3 5
1

4
Starting at a New Node, Again…
Traversing the Previously Constructed
Green-Blue Cycle

1
Traversing the Previously Constructed
Green-Blue Cycle

2
Traversing the Previously Constructed
Green-Blue Cycle

“I hate to traverse the


same cycle! What 1
difference does it make 3
where I start my
walk??? 2
Traversing the Previously Constructed
Green-Blue Cycle

1
3

4
2

“These instructions are stupid…”


Traversing the Previously Constructed
Green-Blue Cycle

1
3

4
2
5
Traversing the Previously Constructed
Green-Blue Cycle

1
3
6
4
2
5
Traversing the Previously Constructed
Green-Blue Cycle

1
3
6
4
2
5
I Returned Back BUT… I Can Continue Walking!

“Hmm, maybe
these
instructions
were not that
7 stupid…”
8

1
3
6
4
2
5
Enlarging the Green-Blue Cycle
Enlarging the Green-Blue Cycle
I Proved Euler’s Theorem! Can I Go Home Please?
EulerianCycle(BalancedGraph)
form a Cycle by randomly walking in BalancedGraph (avoiding already visited edges)
while Cycle is not Eulerian
select a node newStart in Cycle with still unexplored outgoing edges
form a Cycle’ by traversing Cycle from newStart and randomly walking afterwards
Cycle ← Cycle’
return Cycle 9

8
7 10
1
1 1
3
6
4 2

5
Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
From Reads to de Bruijn Graph to Genome
TAATGCCATGGGATGTT

CC
CC GC
A C
C G
A TG C
CA C
T AT
TA AA
G
T A A T A T G T
AT TG GTT
A A T G G T T
AT T
GA G
T TG
G G G
A GG G GG
A G

AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
Multiple Eulerian Paths
TA TGCCATGGGATGTT TA TGGGATG CC ATGTT
A A
CC CC
CCA GCC CCA GCC
CA GC CA GC
TGC TGC
CAT CAT
ATG ATG
TAA AAT TAA AAT
TA AA AT TA AA
ATG TG TGT GT GTT TT AT
ATG TG TGT GT GTT TT
ATG ATG
GAT GAT
TGG TGG
GA GG GA GG
GGA GGG GGA GGG
Breaking Genome into Contigs
TA TGCCATGGGATGTT
A
CC
CCA GCC
TGCCAT
CA GC
TGC
ATG
ATG
TAA AAT
TA AA AT TG GT TT
ATG TGT GTT
TAAT ATG
TGTT
TGG TGG
GA GG
GGGAT
GGA GGG

GGG
DNA Sequencing with Read-pairs
Multiple identical copies of genome

Randomly cut genomes into large equally


sized fragments of size InsertLength

Generate read-pairs:
two reads from the
ends of each fragment
(separated by a fixed
200 bp 200 bp
distance)
InsertLength
From k-mers to Paired k-mers

Read 1 Read 2
Genome ...A T C A G A T T A C G T T C C G A G …
Distance d=11

A paired k-mer is a pair of k-mers at a fixed distance d apart in


Genome. E.g. TCA and TCC are at distance d=11 apart.

Disclaimers:
1. In reality, Read1 and Read2 are typically sampled from different strands:
(→ ……. ← rather than → ……. →)
2. In reality, the distance d between reads is measured with errors.
What is PairedComposition(TAATGCCATGGGATGTT)?
TAA GCC
paired 3-mer

Show first line first


And then show all the lines
What is PairedComposition(TAATGCCATGGGATGTT)?
TAA GCC
AAT CCA
ATG CAT
TGC ATG
GCC TGG
Show first line first CCA GGG
CAT GGA
And then show all the lines ATG GAT
TGG ATG
GGG TGT
GGA GTT
Representing a paired 3-mer TAA GCC as a 2-line expression: TAA
GCC
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
PairedComposition(TAATGCCATGGGATGTT)
TAA GCC
AAT CCA
ATG CAT
TGC ATG
GCC TGG
Show first line first CCA GGG
CAT GGA
And then show all the lines ATG GAT
TGG ATG
GGG TGT
GGA GTT

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
AAT ATG ATG CAT CCA GCC GGA GGG TAA TGC TGG
CCA CAT GAT GGA GGG TGG GTT TGT GCC ATG ATG

Representing PairedComposition in lexicographic order


String Reconstruction from Read-Pairs Problem

String Reconstruction from Read-Pairs Problem. Reconstruct


a string from its paired k-mers.
• Input. A collection of paired k-mers.
• Output. A string Text such that PairedComposition(Text) is
equal to the collection of paired k-mers.
How Would de Bruijn Assemble Paired k-mers?
Paired de Bruijn Graphs
Representing Genome TAATGCCATGGGATGTT as a Path
TAA GCC
AAT CCA
ATG CAT
TGC ATG
GCC TGG
CCA GGG
CAT GGA
ATG GAT
TGG ATG
GGG TGT
GGA GTT

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT

CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Labeling Nodes by Paired Prefixes and Suffixes

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Glue nodes with identical labels

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
Glue nodes with identical labels

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

GCC CCA CAT


TGG GGG GGA
TGC GC CC CA AT
ATG TG GG GG GA
TAA AAT ATG ATG
GCC CCA CAT GAT
TA AA AT TG
GC CC CA AT

TG GG GG GA
AT TG GT TT
TGG GGG GGA
ATG TGT GTT
Glue nodes with identical labels

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

GCC CCA CAT


TGG GGG GGA
TGC GC CC CA AT
ATG TG GG GG GA
TAA AAT ATG
GCC CCA CAT ATG
TA AA AT TG GG GG GA GAT
GC CC CA AT TG GT TT
TGG GGG GGA
ATG TGT GTT

Paired de Bruijn Graph from the Genome


Constructing Paired de Bruijn Graph from paired
k-mers

TAA ATG GCC CAT TGG GGA


GCC CAT TGG GGA ATG GTT

AAT TGC CCA ATG GGG


CCA ATG GGG GAT TGT
Constructing Paired de Bruijn Graph from paired
k-mers

TAA ATG GCC CAT TGG GGA


GCC CAT TGG GGA ATG GTT

AAT TGC CCA ATG GGG


CCA ATG GGG GAT TGT

CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Constructing Paired de Bruijn Graph

TAA ATG GCC CAT TGG GGA


GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

AAT TGC CCA ATG GGG


CCA ATG GGG GAT TGT
AA AT TG GC CC CA AT TG GG GG
CC CA AT TG GG GG GA AT TG GT

CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Constructing Paired de Bruijn Graph

TAA ATG GCC CAT TGG GGA


GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

AAT TGC CCA ATG GGG


CCA ATG GGG GAT TGT
AA AT TG GC CC CA AT TG GG GG
CC CA AT TG GG GG GA AT TG GT

• Paired de Bruijn graph for a collection of paired k-mers:


– Represent every paired k-mer as an edge between its
paired prefix and paired suffix.
– Glue ALL nodes with identical labels.
Constructing Paired de Bruijn Graph

TAA ATG GCC CAT TGG GGA


GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

AAT TGC CCA ATG GGG


CCA ATG GGG GAT TGT
AA AT TG GC CC CA AT TG GG GG
CC CA AT TG GG GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA ATG GCC CAT TGG GGA


GCC CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

AAT TGC CCA ATG GGG


CCA ATG GGG GAT TGT
AT TG GC CC CA AT TG GG GG
CA AT TG GG GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG GCC CAT TGG GGA


GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

TGC CCA ATG GGG


ATG GGG GAT TGT
TG GC CC CA AT TG GG GG
AT TG GG GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG GCC CAT TGG GGA


GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

TGC CCA ATG GGG


ATG GGG GAT TGT
GC CC CA AT TG GG GG
TG GG GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG GCC CAT TGG GGA


GCC CCA CAT TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

TGC CCA ATG GGG


ATG GGG GAT TGT
GC CC CA AT TG GG GG
TG GG GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CAT TGG GGA


GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

CCA ATG GGG


GGG GAT TGT
CC CA AT TG GG GG
GG GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CAT TGG GGA


GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

CCA ATG GGG


GGG GAT TGT
CA AT TG GG GG
GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CAT TGG GGA


GCC CCA CAT ATG TGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

CCA ATG GGG


GGG GAT TGT
CA AT TG GG GG
GG GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT TGG GGA


GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

ATG GGG
GAT TGT
AT TG GG GG
GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT TGG GGA


GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

ATG GGG
GAT TGT
AT TG GG GG
GA AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT TGG GGA


GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

ATG GGG
GAT TGT
TG GG GG
AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT TGG GGA


GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

ATG GGG
GAT TGT
TG GG GG
AT TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

GGG
GG
TGT
GG
TG GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

GGG
TGT
GG
GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT

GGG
TGT
GG
GT
Constructing Paired de Bruijn Graph

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
We Are Not Done with Gluing Yet

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
Constructing Paired de Bruijn Graph

GCC CCA CAT


TGG GGG GGA
TGC GC CC CA AT
ATG TG GG GG GA
TAA AAT ATG ATG
GCC CCA CAT GAT
TA AA AT TG
GC CC CA AT

TG GG GG GA
AT TG GT TT
TGG GGG GGA
ATG TGT GTT
Constructing Paired de Bruijn Graph

GCC CCA CAT


TGG GGG GGA
TGC GC CC CA AT
ATG TG GG GG GA
TAA AAT ATG
GCC CCA CAT ATG
TA AA AT TG GG GG GA GAT
GC CC CA AT TG GT TT
TGG GGG GGA
ATG TGT GTT

Paired de Bruijn Graph from read-pairs


Paired de Bruijn Graphs

• Paired de Bruijn graph for a collection of paired k-mers:


– Represent every paired k-mer as an edge between its
paired prefix and paired suffix.
– Glue ALL nodes with identical labels.
Which Graph Represents a Better Assembly?
Unique genome reconstruction Multiple genome reconstructions

TAATGCCATGGGATGTT TAATGCCATGGGATGTT

TAATGGGATGCCATGTT

GCC CCA CAT


TGG GGG GGA
TGC GC CC CA AT
ATG TG GG GG GA
TAA AAT ATG
GCC CCA CAT ATG
TA AA AT TG GG GG GA
GAT
GC CC CA AT TG GT TT
TGG GGG GGA
ATG TGT GTT

GGA

Paired de Bruijn Graph De Bruijn Graph


Outline

• What Is Genome Sequencing?


• Exploding Newspapers
• The String Reconstruction Problem
• String Reconstruction as a Hamiltonian Path Problem
• String Reconstruction as an Eulerian Path Problem
• Similar Problems with Different Fates
• De Bruijn Graphs
• Euler’s Theorem
• Assembling Read-Pairs
• De Bruijn Graphs Face Harsh Realities of Assembly
Some Ridiculously Unrealistic Assumptions
• Perfect coverage of genome by reads (every k-mer
from the genome is represented by a read)

• Reads are error-free.

• Multiplicities of k-mers are known

• Distances between reads within read-pairs are exact.


Some Ridiculously Unrealistic Assumptions
• Imperfect coverage of genome by reads (every k-mer
from the genome is represented by a read)

• Reads are error-prone.

• Multiplicities of k-mers are unknown.

• Distances between reads within read-pairs are


inexact.

• Etc., etc., etc.


1st Unrealistic Assumption: Perfect Coverage

atgccgtatggacaacgact
atgccgtatg
gccgtatgga
gtatggacaa
gacaacgact

250-nucleotide reads generated by Illumina


technology capture only a small fraction of 250-
mers from the genome, thus violating the key
assumption of the de Bruijn graphs.
Breaking Reads into Shorter k-mers
atgccgtatggacaacgact atgccgtatggacaacgact
atgccgtatg atgcc
gccgtatgga tgccg
gtatggacaa gccgt
gacaacgact ccgta
cgtat
gtatg
tatgg
atgga
tggac
ggaca
gacaa
acaac
caacg
aacga
acgac
cgact
2nd Unrealistic Assumption: Error-free Reads

atgccgtatggacaacgact atgccgtatggacaacgact
atgccgtatg atgcc
gccgtatgga tgccg
gtatggacaa gccgt
gacaacgact ccgta
cgtaCggaca cgtat
gtatg
Erroneous read tatgg
(change of t into C) atgga
tggac
ggaca
gacaa
acaac
caacg
aacga
acgac
cgact
cgtaC
gtaCg
taCgg
aCgga
Cggac

You might also like