Bioinformatics Chapter 3
Bioinformatics Chapter 3
Millions of Pieces?
Graph Algorithms
Amoeba Paris
dubia japonica
Why Do We Sequence 1000s of Species?
Frederick Sanger
The Race to Sequence the Human Genome
• 1990: The public Human Genome Project,
headed by Francis Collins, aims to sequence
the human genome by 2005.
Francis Collins
• 2000:
The Race to Sequence the Human Genome
• 1990: The public Human Genome Project,
headed by Francis Collins, aims to sequence
the human genome by 2005.
Francis Collins
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACG
ATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCT
ATCGATCGATCGATCGATTATCTACGATCGATCGATCGA
TCACTATACGAGCTACTACGTACGTACGATCGCGGGACT
ATTATCGACTACAGATAAAACATGCTAGTACAACAGTAT
ACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATA
TCCGAT
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACG
ATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCT
ATCGATCGATCGATCGATTATCTACGATCGATCGATCGA
TCACTATACGAGCTACTACGTACGTACGATCGCGTGACT
ATTATCGACTACAGATGAAACATGCTAGTACAACAGTAT
ACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATA
TCCGAT
Why Do We Sequence Personal Genomes?
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
Breaking the Genomes at Random Positions
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGGATCAGCTACCACATCGT
AGC
Generating “Reads”
CTGATGA TGGACTACGCTAC TACTGCTAG CTGTATTACG ATCAGCTACCACA TCGTAGCTACG ATGCATTAGCAA GCTATCGGA TCAGCTACCA CATCGTAGC
CTGATGATG GACTACGCT ACTACTGCTA GCTGTATTACG ATCAGCTACC ACATCGTAGCT ACGATGCATTA GCAAGCTATC GGATCAGCTAC CACATCGTAGC
CTGATGATGG ACTACGCTAC TACTGCTAGCT GTATTACGATC AGCTACCAC ATCGTAGCTACG ATGCATTAGCA AGCTATCGG A TCAGCTACCA CATCGTAGC
CTGATGATGGACT ACGCTACTACT GCTAGCTGTAT TACGATCAGC TACCACATCGT AGCTACGATGCA TTAGCAAGCT ATCGGATCA GCTACCACATC GTAGC
“Burning” Some Reads
CTGATGA TGGACTACGCTAC TACTGCTAG CTGTATTACG ATCAGCTACCACA TCGTAGCTACG ATGCATTAGCAA GCTATCGGA TCAGCTACCA CATCGTAGC
CTGATGATG GACTACGCT ACTACTGCTA GCTGTATTACG ATCAGCTACC ACATCGTAGCT ACGATGCATTA GCAAGCTATC GGATCAGCTAC CACATCGTAGC
CTGATGATGG ACTACGCTAC TACTGCTAGCT GTATTACGATC AGCTACCAC ATCGTAGCTACG ATGCATTAGCA AGCTATCGG A TCAGCTACCA CATCGTAGC
CTGATGATGGACT ACGCTACTACT GCTAGCTGTAT TACGATCAGC TACCACATCGT AGCTACGATGCA TTAGCAAGCT ATCGGATCA GCTACCACATC GTAGC
No Idea What Position Every Read Comes From
CTGATGATGGACT
GCTGTATTACG
GCTATCGGA
ATGCATTAGCAA
ACTACTGCTA
TACCACATCGT
CTGATGATGG
No Idea What Position Every Read Comes From
CTGATGATGGACT
GCTGTATTACG
GCTATCGGA
GCAAGCTATC ATGCATTAGCAA
ACTACTGCTA
TACCACATCGT
CTGATGATGG
No Idea What Position Every Read Comes From
CTGATGATGGACT
GCTGTATTACG
GCTATCGGA
GCAAGCTATC ATGCATTAGCAA
ACTACTGCTA
TACCACATCGT
CTGATGATGG
From Experimental to Computational Challenges
Read generation
Reads
Genome assembly
Assembled genome
…GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGC
C…
What Makes Genome Sequencing Difficult?
Composition3(TAATGCCATGGGATGTT)=
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
k-mer Composition
Composition3(TAATGCCATGGGATGTT)=
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
=
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
A Naive String Reconstruction Approach
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
A Naive String Reconstruction Approach
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
A Naive String Reconstruction Approach
ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
AAT
A Naive String Reconstruction Approach
ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
AAT
A Naive String Reconstruction Approach
ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
AAT
ATG
A Naive String Reconstruction Approach
ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
AAT
ATG
A Naive String Reconstruction Approach
ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG TGT
TAA
AAT
ATG
A Naive String Reconstruction Approach
ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG
TAA
AAT
ATG
TGT
A Naive String Reconstruction Approach
ATG ATG CAT CCA GAT GCC GGA GGG GTT TGC TGG
TAA
AAT
ATG
TGT
What’s Next?
ATG ATG CAT CCA GAT GCC GGA GGG TGC TGG
TAA
AAT
ATG
TGT
GTT
Outline
TAATGCCATGGGATGTT
TAATGCCATGGGATGTT
TAATGCCATGGGATGTT
Nodes areWhat
arranged from
are we left to right
trying inthis
find in lexicographic
graph? order.
Hamiltonian Path Problem
Hamiltonian Path Problem. Find a Hamiltonian path in a graph.
• Input. A graph.
• Output. A path visiting every node in the graph exactly once.
Does This Graph Have a Hamiltonian Path?
Hamiltonian Path Problem. Find a Hamiltonian path in a graph.
Input. A graph.
Output. A path visiting every node in the graph exactly once.
William
Hamilton
20 13 12 15
18
19 1
1 2
1
3
4 10
6 5
7 9
William
Hamilton 8
17 16
TA TGCCATGGGATGTT
A
3-mers as nodes
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
3-mers as edges
3-mers as nodes
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
CC
CCA GCC
CA GC
CAT TGC
AT ATG TG
TAA AAT TGG GGG GGA GAT ATG TGT GTT
TA AA AT ATG TG GG GG GA AT TG GT TT
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
CAT TGC
AT ATG TG
TAA
TA AA ATG
AT TG GT TT
AAT TGT GTT
ATG
AT TG
GAT TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
CAT TGC
ATG
TG
TAA AT
TA AA AT ATG
TG GT TT
AAT AT TGT GTT
ATG
TG
GAT TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG TG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA GG
De Bruijn Graph of TAATGCCATGGGATGTT
CC
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
Where is the Genome
ATG TG TGT GT GTT TT
ATG hiding in this graph?
GAT
TGG
GA GG
GGA GGG
It Was Always There!
TA TGCCATGGGATGTT
A
CC
CCA GCC
CA GC
TGC
CAT
TAA AAT
ATG An Eulerian path in a
TA AA AT
ATG TG TGT GT TT Whatisare
graph we trying
a path that to
GTT
ATG find
visits in this
each edgegraph?
exactly
GAT
TGG once.
GA GG
GGA GGG
Eulerian Path Problem
Eulerian Path Problem. Find an Eulerian path in a graph.
• Input. A graph.
• Input. A graph.
• Input. A graph.
Find a difference!
Outline
• Input. A graph.
• Input. A graph.
• Input. A graph.
CC
CC GC
A C
C G
A TG C
CA C
T AT
TA AA
G
T A A T A T G T
AT TG GTT
A A T G G T T
AT T
GA G
T TG
G G G
A GG G GG
A G
What We Want: From Reads (k-mers) to Genome
TAATGCCATGGGATGTT
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
What We will Show: From Reads to de Bruijn Graph to Genome
TAATGCCATGGGATGTT
CC
CC GC
A C
C G
A TG C
CA C
T AT
TA AA
G
T A A T A T G T
AT TG GTT
A A T G G T T
AT T
GA G
T TG
G G G
A GG G GG
A G
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
Constructing de Bruijn Graph when Genome Is Known
TAATGCCATGGGATGTT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
Constructing de Bruijn when Genome Is Unknown
Composition3(TAATGCCATGGGATGTT)
Representing Composition as a Graph Consisting of Isolated
Edges
Composition3(TAATGCCATGGGATGTT)
Constructing de Bruijn Graph from k-mer Composition
Composition3(TAATGCCATGGGATGTT)
Gluing Identically Labeled Nodes
GAT TGT
GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
GAT TGT
GA AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
GAT TGT
AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
GAT TGT
AT TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TGT
TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TGT
TG GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TGT
GT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TGT
GT
We Are Not Done with Gluing Yet
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
Gluing Identically Labeled Nodes
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
CC
CCA GCC
CA GC
CAT TGC
AT ATG TG
TAA AAT TGG GGG GGA GAT ATG TGT GTT
TA AA AT ATG TG GG GG GA AT TG GT TT
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
CAT TGC
AT ATG TG
TAA
TA AA ATG
AT TG GT TT
AAT TGT GTT
ATG
AT TG
GAT TGG
GA GG
GGG
GGA
GG
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
CAT TGC
ATG
TG
TAA AT
TA AA AT ATG
TG GT TT
AAT AT TGT GTT
ATG
TG
GAT TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT TG
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG TG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG TG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA
GG
Gluing Identically Labeled Nodes
CC TAATGCCATGGGATGTT
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGG
GGA GG
The Same de Bruijn Graph:
DeBruin(Genome)=DeBruin(Genome Composition)
CC
CCA GCC
CA GC
TGC
CAT
ATG
TAA AAT
TA AA AT
ATG TG TGT GT
GTT
TT
ATG
GAT
TGG
GA GG
GGA GGG
Constructing de Bruijn Graph
De Bruijn graph of a collection of k-mers:
– Represent every k-mer as an edge between its prefix
and suffix
– Glue ALL nodes with identical labels.
DeBruijn(k-mers)
form a node for each (k-1)-mer from k-mers
for each k-mer in k-mers
connect its prefix node with its suffix node by an edge
From Hamilton to Euler to de Bruijn
0 1
1 1
From Hamilton to Euler to de Bruijn
0 1
1 1
From Hamilton to Euler to de Bruijn
00 01
10 11
From Hamilton to Euler to de Bruijn
01
0 0
00
1 0
0 1
10 11
1 1
De Bruijn Graph for 4-Universal String
• Input. A graph.
“Yay!
Now can I
go home
please?”
A Less Intelligent Ant Would Randomly
Choose a Node and Start Walking…
Walking…
Walking… and Walking…
Walking… and Walking… and Walking…
1
An Ant Traversing Previously Constructed Cycle
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.
1
An Ant Traversing Previously Constructed Cycle
Starting at a node that has an unused edge, traverse the already
constructed (green cycle) and return back to the starting node.
3
1
4
Enlarging the Previously Constructed Cycle
Enlarging the Previously Constructed Cycle
Enlarging the Previously Constructed Cycle
Enlarging the Previously Constructed Cycle
Stuck Again!
No Eulerian cycle yet… can we enlarge the green-blue cycle?
6 7
8
2
3 5
1
4
Starting at a New Node, Again…
Traversing the Previously Constructed
Green-Blue Cycle
1
Traversing the Previously Constructed
Green-Blue Cycle
2
Traversing the Previously Constructed
Green-Blue Cycle
1
3
4
2
1
3
4
2
5
Traversing the Previously Constructed
Green-Blue Cycle
1
3
6
4
2
5
Traversing the Previously Constructed
Green-Blue Cycle
1
3
6
4
2
5
I Returned Back BUT… I Can Continue Walking!
“Hmm, maybe
these
instructions
were not that
7 stupid…”
8
1
3
6
4
2
5
Enlarging the Green-Blue Cycle
Enlarging the Green-Blue Cycle
I Proved Euler’s Theorem! Can I Go Home Please?
EulerianCycle(BalancedGraph)
form a Cycle by randomly walking in BalancedGraph (avoiding already visited edges)
while Cycle is not Eulerian
select a node newStart in Cycle with still unexplored outgoing edges
form a Cycle’ by traversing Cycle from newStart and randomly walking afterwards
Cycle ← Cycle’
return Cycle 9
8
7 10
1
1 1
3
6
4 2
5
Outline
CC
CC GC
A C
C G
A TG C
CA C
T AT
TA AA
G
T A A T A T G T
AT TG GTT
A A T G G T T
AT T
GA G
T TG
G G G
A GG G GG
A G
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
Multiple Eulerian Paths
TA TGCCATGGGATGTT TA TGGGATG CC ATGTT
A A
CC CC
CCA GCC CCA GCC
CA GC CA GC
TGC TGC
CAT CAT
ATG ATG
TAA AAT TAA AAT
TA AA AT TA AA
ATG TG TGT GT GTT TT AT
ATG TG TGT GT GTT TT
ATG ATG
GAT GAT
TGG TGG
GA GG GA GG
GGA GGG GGA GGG
Breaking Genome into Contigs
TA TGCCATGGGATGTT
A
CC
CCA GCC
TGCCAT
CA GC
TGC
ATG
ATG
TAA AAT
TA AA AT TG GT TT
ATG TGT GTT
TAAT ATG
TGTT
TGG TGG
GA GG
GGGAT
GGA GGG
GGG
DNA Sequencing with Read-pairs
Multiple identical copies of genome
Generate read-pairs:
two reads from the
ends of each fragment
(separated by a fixed
200 bp 200 bp
distance)
InsertLength
From k-mers to Paired k-mers
Read 1 Read 2
Genome ...A T C A G A T T A C G T T C C G A G …
Distance d=11
Disclaimers:
1. In reality, Read1 and Read2 are typically sampled from different strands:
(→ ……. ← rather than → ……. →)
2. In reality, the distance d between reads is measured with errors.
What is PairedComposition(TAATGCCATGGGATGTT)?
TAA GCC
paired 3-mer
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
AAT ATG ATG CAT CCA GCC GGA GGG TAA TGC TGG
CCA CAT GAT GGA GGG TGG GTT TGT GCC ATG ATG
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Labeling Nodes by Paired Prefixes and Suffixes
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Glue nodes with identical labels
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
Glue nodes with identical labels
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
TG GG GG GA
AT TG GT TT
TGG GGG GGA
ATG TGT GTT
Glue nodes with identical labels
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Constructing Paired de Bruijn Graph
CCA
GGG
paired prefix of CCA
GGG → CC
GG
CA
GG ← paired suffix of CCA
GGG
Constructing Paired de Bruijn Graph
ATG GGG
GAT TGT
AT TG GG GG
GA AT TG GT
Constructing Paired de Bruijn Graph
ATG GGG
GAT TGT
AT TG GG GG
GA AT TG GT
Constructing Paired de Bruijn Graph
ATG GGG
GAT TGT
TG GG GG
AT TG GT
Constructing Paired de Bruijn Graph
ATG GGG
GAT TGT
TG GG GG
AT TG GT
Constructing Paired de Bruijn Graph
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
GGG
GG
TGT
GG
TG GT
Constructing Paired de Bruijn Graph
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
GGG
TGT
GG
GT
Constructing Paired de Bruijn Graph
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
GGG
TGT
GG
GT
Constructing Paired de Bruijn Graph
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
We Are Not Done with Gluing Yet
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TA AA AT TG GC CC CA AT TG GG GG GA
GC CC CA AT TG GG GG GA AT TG GT TT
Constructing Paired de Bruijn Graph
TG GG GG GA
AT TG GT TT
TGG GGG GGA
ATG TGT GTT
Constructing Paired de Bruijn Graph
TAATGCCATGGGATGTT TAATGCCATGGGATGTT
TAATGGGATGCCATGTT
GGA
atgccgtatggacaacgact
atgccgtatg
gccgtatgga
gtatggacaa
gacaacgact
atgccgtatggacaacgact atgccgtatggacaacgact
atgccgtatg atgcc
gccgtatgga tgccg
gtatggacaa gccgt
gacaacgact ccgta
cgtaCggaca cgtat
gtatg
Erroneous read tatgg
(change of t into C) atgga
tggac
ggaca
gacaa
acaac
caacg
aacga
acgac
cgact
cgtaC
gtaCg
taCgg
aCgga
Cggac