100% found this document useful (1 vote)
15 views

2024 Bioinformatics Algorithms Day 2

Uploaded by

ast56erx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
15 views

2024 Bioinformatics Algorithms Day 2

Uploaded by

ast56erx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Bioinformatics

Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland

Email: [email protected]
Sequence Alignment
GATTACA

insertion
GAATTACA substitution
GATTACC

substitution
TAATTACA

GAATTACA TAATTACA GATTACC

22.09.24 Bioinformatics Algorithms 2


DNA and Protein sequences
• DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)
• RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)
• Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)
• We will represent sequences of DNA, RNA or Proteins as strings
e.g.: ATCTGTTA, TGCCATA, ...

22.09.24 Bioinformatics Algorithms 3


Sequence evolution
GATTACA

GAATTACA
GATTACC

TAATTACA

GAATTACA TAATTACA GATTACC

22.09.24 Bioinformatics Algorithms 4


Sequence evolution
GATTACA
We know real alignment:
insertion
GAATTACA substitution GAATTACA
GATTACC TAATTACA
GA-TTACC

substitution
TAATTACA How can we align sequences
if we don’t know the exact
evolutionary history?
GAATTACA TAATTACA GATTACC

22.09.24 Bioinformatics Algorithms 5


Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA,
RNA, or protein to identify regions of similarity that may be a consequence of functional,
structural, or evolutionary relationships between the sequences.

If two sequences in an alignment share a common ancestor, mismatches can be


interpreted as point mutations and gaps as indels (that is, insertion or deletion
mutations) introduced in one or both lineages in the time since they diverged from one
another.
AAAGCGGAAGTCACAG AAGGCTGAAGTATAG

There are two different ways to align sequences:

AAAGCGGAAGTCACAG
1. Global alignment assumes that the two proteins are basically similar over the ||.||.||||| |.||
entire length of one another. The alignment attempts to match them to each other
from end to end. AAGGCTGAAGT-ATAG
2. Local alignment searches for segments of the two sequences that match well. AAAGCGGAAGTCACAG
There is no attempt to force entire sequences into an alignment, just those parts ......||||| ....
that appear to have good similarity, according to some criterion.
AAGGCTGAAGT-ATAG

22.09.24 Bioinformatics Algorithms 6


Manhattan tourist algorithm
The problem: Given a weighted grid G, travel from source (top left)
to the sink (bottom right) along the highest scoring path only travelling
south and east.

There are only finitely many possible paths and


we can solve problem by trying all of them.

But what does that have to do with sequence


alignments? (we will see later….)
22.09.24 Bioinformatics Algorithms 7
Manhattan Toursit

Trying all possible routes will


give us best solution, but as we
have ssen before, this is
probably not the best way to
tackle this problem!

22.09.24 Bioinformatics Algorithms 8


Manhattan Toursit

There are
𝑛+𝑚 𝑛+𝑚 !
=
𝑛 𝑛! 𝑚!
possible paths!

Therfore, an exhaustive search


!"# !
algorithm is O (! ! !&# !).

22.09.24 Bioinformatics Algorithms 9


Manhattan Toursit

There are
𝑛+𝑚 𝑛+𝑚 !
=
𝑛 𝑛! 𝑚!
possible paths!

Therfore, an exhaustive search


!"# !
algorithm is O (! ! !&# !).

22.09.24 Bioinformatics Algorithms 10


Manhattan Toursit

Like in the Fibonacci example,


the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!

22.09.24 Bioinformatics Algorithms 11


Manhattan Toursit

Like in the Fibonacci example,


the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!

22.09.24 Bioinformatics Algorithms 12


Manhattan Toursit

Like in the Fibonacci example,


the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!

The red piece is calculated


multiplie times!

22.09.24 Bioinformatics Algorithms 13


Manhattan Toursit

Like in the Fibonacci example,


the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!

The red piece is calculated


multiple times!

22.09.24 Bioinformatics Algorithms 14


Manhattan Toursit

In fact, the scores for many paths


are calculated multiple times.

This suggests that we can find the


highest scoring path by splitting
the problem into calculating the
highest scores for sub-paths.

22.09.24 Bioinformatics Algorithms 15


Manhattan Tourist
Let’s have a look at a more general problem:

What is the best way (i.e., highest scoring


path) to get from the source (upper left
corner) to any location on the grid?

Even though this may seem a stupid thing to do at first glance.... Instead
of one question we now have to answer n x m questions!

22.09.24 Bioinformatics Algorithms 16


Dynamic programming
• Dynamic programming describes breaking a large problem into
several smaller ones that are easier to solve
• Solving the more general problem of how we can get from the source
to any location is as easy as solving the more specific problem of
going from source to sink and...
• ... we get the solution of the specific problem en route!
• Finding the best path to a cell is easy because there are only two
possible ways to get there
• These observations outline the key ideas of dynamic programming

22.09.24 Bioinformatics Algorithms 17


Manhattan Toursit
Many paths are calculated multiple
times.

We can split the problem into


small peices by first calculating a
set of sub-paths, and then use this
information to find the optimal
global path.

How shall we do this?

22.09.24 Bioinformatics Algorithms 18


Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source

22.09.24 Bioinformatics Algorithms 19


Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score

22.09.24 Bioinformatics Algorithms 20


Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken

22.09.24 Bioinformatics Algorithms 21


Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken

22.09.24 Bioinformatics Algorithms 22


Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken

22.09.24 Bioinformatics Algorithms 23


Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken

22.09.24 Bioinformatics Algorithms 24


Manhattan Tourist Algorithm
Finally, there is a single (or multiple) optimal path(s) left:

22.09.24 Bioinformatics Algorithms 25


The Algorithm
SE(i,j), SS(i,j) .... Score to come to (i,j) from west or north
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path

S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(S(i-1,j) + SS(i,j), S(i,j-1) + SE(i,j))
D(i,j) = path.to(i,j)

22.09.24 Bioinformatics Algorithms 26


The Algorithm
SE(i,j), SS(i,j) .... Score to come to (i,j) from east or south
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
Complexity:
S(0,0) = 0 T(n) = C1 + C2 (n + m) + C3 n m
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0) = O(n m)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(S(i-1,j) + SS(i,j), S(i,j-1) + SE(i,j))
D(i,j) = path.to(i,j)

22.09.24 Bioinformatics Algorithms 27


Global alignment
What does sequence alignment have to do
with the Manhattan Tourist algorithm?

AAAGCGGAAGTCACAG We can represent sequences as strings and


||.||.||||| |.|| give each alignment between two sites a score:
AAGGCTGAAGT-ATAG
w(i,j) = 1 if site i and j are equal (match)
w(i,j) = -1 if site i and j are not equal (substitution
or insertion/deletion )

22.09.24 Bioinformatics Algorithms 28


Global alignment
What does sequence alignment have to do
with the Manhattan Tourist algorithm?

AAAGCGGAAGTCACAG We can represent sequences as strings and


||.||.||||| |.|| give each alignment between two sites a score:
AAGGCTGAAGT-ATAG
w(i,j) = 1 if site i and j are equal (match)
w(i,j) = -1 if site i and j are not equal (substitution
1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1 or insertion/deletion )

22.09.24 Bioinformatics Algorithms 29


Global alignment
What does sequence alignment have to do
with the Manhattan Tourist algorithm?

AAAGCGGAAGTCACAG We can represent sequences as strings and


||.||.||||| |.|| give each alignment between two sites a score:
AAGGCTGAAGT-ATAG
w(i,j) = 1 if site i and j are equal (match)
w(i,j) = -1 if site i and j are not equal (substitution
1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1 or insertion/deletion )

22.09.24 Bioinformatics Algorithms 30


Global alignment
We now have three possible ways to from
a site to the next:

match, subsitution, or insertion/deletion

If we label a grid the letters of the two


strings, we can represent alignments as
paths through the grid.

Matches and substitutions are represented


by going diagonally to the next cell and
indels by going down.
22.09.24 Bioinformatics Algorithms 31
Global alignment
We now have three possible ways to from
a site to the next:

match, subsitution, or insertion/deletion


ATCGT-AC-
If we label a grid the letters of the two
||.||.|..
strings, we can represent alignments as
paths throughAT-GTTA-T
the grid.

Matches and substitutions are represented


by going diagonally to the next cell and
indels by going down/right.
22.09.24 Bioinformatics Algorithms 32
Global alignment
We now have:
• how to find the highest scoring path on
a grid
• how to asign scores to sequence
alignments
• How to represent sequence-alignments
on a grid
AAAGCGGAAGTCACAG
||.||.||||| |.||
AAGGCTGAAGT-ATAG

1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1
22.09.24 Bioinformatics Algorithms 33
Needleman-wunsch algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path

S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x))
D(i,j) = path.to(i,j)

22.09.24 Bioinformatics Algorithms 34


References
• Needlman and Wunsch (1970) - A general method applicable to the
search for similarities in the amino acid sequence of two proteins
• Online Tools, e.g.: BLAST:

22.09.24 Bioinformatics Algorithms 35


Comparing strings
To compare strings, there are two common measures:
1. Edit distance: the minimum number of operations (insertions, deletions
and substitutions) required to transform 1 string into another
AAAGCGGAAGTCACAG
||.||.|||||-|.|| Edit distance: 4
AAGGCTGAAGT-ATAG

2. Hamming distance: the number of differences when comparing the ith


value of a string against the ith value of another
AAAGCGGAAGTCACAG
||.||.|||||....- Hamming distance: 7
AAGGCTGAAGTATAG

22.09.24 Bioinformatics Algorithms 36


Which distance?
DNA sequences are subject to insertions, deletions, and substitutions.

Subsitutions can be measured by the edit distance, but because insertions and deletions occur as
well, we usually do not know which is the «ith» position.

Thus, the edit distance seems to be more appropriate for sequence alignment.

e.g.: ATATATAT and TATATATA

Edit distance = 2 Hamming distance = 7

-ATATATAT ATATATAT
-|||||||- ........
TATATATA TATATATA

22.09.24 Bioinformatics Algorithms 37


Scoring function
The simplest form of the edit distance gives equal weight to substitutions, insertions or
deletions.

This is not reflecting biological reality because not all substitutions occur with the same
probability, substitutions are more/less likely than insertions or deleteions, etc.

We therfore introduce a scoring function that assigns a score (or penality) to each «edit».

For instance, if we penalize mismatches by μ, insertions/deletions by σ, and score matches


by 1, we can calculate the edit distance for two strings by:

# matches – μ # mismatches – σ # insertions/deletions

22.09.24 Bioinformatics Algorithms 38


Example
AAAGCGGAAGTCACAG AAAG-C-GGAAGTCACAG
||.||.|||||-|.|| ||-|-|-|-||||-|.||
AAGGCTGAAGT-ATAG AA-GGCTG-AAGT-ATAG

#matches = 12 #matches = 12
#mismatches = 3 #mismatches = 1
# indels = 1 # indels = 5

distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ

22.09.24 Bioinformatics Algorithms 39


Example
AAAGCGGAAGTCACAG AAAG-C-GGAAGTCACAG
Which is better||-|-|-|-||||-|.||
alignment?
||.||.|||||-|.||
AAGGCTGAAGT-ATAG d1 > d2 or AA-GGCTG-AAGT-ATAG
d1 < d2 ?
d1 = 12 – 3 μ – σ
μ = 1, σ = 1: d1 > d2 (8 > 6)
#matches = 12 #matches = 12
#mismatches = 3 μ = 2, σ = 1: d1 #mismatches
= d2 (5 = 5) =1
d2 = 12 – μ – 5 σ
# indels = 1 # indels = 5
μ = 3, σ = 1: d1 < d2 (2 < 4)
distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ

22.09.24 Bioinformatics Algorithms 40


Example
AAAGCGGAAGTCACAG AAAG-C-GGAAGTCACAG
||.||.|||||-|.|| ||-|-|-|-||||-|.||
AAGGCTGAAGT-ATAG AA-GGCTG-AAGT-ATAG
How we score matches, mismatches
and indels impacts the outcome
#matches = 12
of the
#matches = 12
#mismatches = 3 sequence alignment!#mismatches = 1
# indels = 1 # indels = 5

distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ

22.09.24 Bioinformatics Algorithms 41


Needleman-wunsch algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path

S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x))
D(i,j) = path.to(i,j)

22.09.24 Bioinformatics Algorithms 42


Example Needlman-Wunsch

TGATACCA
-|.||.||
-GCTAACA

μ=1
σ=1

22.09.24 Bioinformatics Algorithms 43


Example Needlman-Wunsch

TGATACCA
-|.||.||
-GCTAACA

μ=1
σ=1

--TGATACCA
--|-|-||-|
GCT-A-AC-A

μ=2
σ=1

22.09.24 Bioinformatics Algorithms 44


Scoring matrix
We can define a function that incorporates empirical
evidence about the frequency at which a given amino
acid/nucelotide/ etc. is subsituted by another. A C T G
A dAA dAC dAT dAG

One problem: C dCA dCC dCT dAG


To get the the empirical data to construct scoring
functions, one needs to align sequences!
T dTA dTC dTT dAG

Solution: G dGA dGC dGT dGG


If we have «good» data, we can use a naive scoring
function and incrementally improve it.

22.09.24 Bioinformatics Algorithms 45


Modifications to
Needleman-Wunsch algorithm
• So far we have assumed that indels are always of length 1
• A single indel of length n is, however, more likely than n small indels
• An indel of length n should therfore have a smaller penalty than n
indels of length 1
• We can implement this in our algorithm by changing the score
function for indels:

σ1 if new indel
w(i,j,indel) =
σ2 if extension of «open» indel

22.09.24 Bioinformatics Algorithms 46


Modifications to
Needleman-Wunsch algorithm
An alignment with two long indels might be biologically more plausible
than an alignment with fewer indels:

ACCTAACCGGTAGATAC--- ACCT-A-AC-CGGTAGATAC
||||------||||.||--- ||||-|-||-|--||||.||
ACCT------TAGAGACGAC ACCTTAGACAC--TAGAGAC

2 long indels (9 positions in total) 4 short indels (5 positions in total)


1 substitution 1 substitution

22.09.24 Bioinformatics Algorithms 47


Modifications to
Needleman-Wunsch algorithm
• It can also make sense to allow long indels
at beginning and end of alignment:

---AAGCATAGCCAT
---||||||||----
CTGAAGCATAG----

• This can be achieved by simply set the first


row and first column to zero in the
initalization step of the NW algorithm

22.09.24 Bioinformatics Algorithms 48


Modifications to
Needleman-Wunsch algorithm
• It can also make sense to allow long indels
at beginning and end of alignment:

---AAGCATAGCCAT
---||||||||----
CTGAAGCATAG----

• This can be achieved by simply set the first


row and first column to zero in the
initalization step of the NW algorithm

22.09.24 Bioinformatics Algorithms 49


Modifications to
Needleman-Wunsch algorithm
• A simple heuristic approximation for
the NW algorithm can be obtained by
assuming that the best alignmnet
should be close to the diagonal of the
scoring matrix
• We can therfore improve space and
time complexity considerably by
considering only a «band» along the
main diagonal

22.09.24 Bioinformatics Algorithms 50


Complexity of NW algorithm
• It is easy to see from the pseudo-code that NW algorithm has
• space complexity O(n m) (store nxm matrix) and
• time complexity O(n m) (calculate entries of nxm matrix)
• We can reduce the space complexity to O(n) at the cost of doubling
the computation time
• We need to store only 2 n entries to calculate the next column of
entries
• But we need to backtrack the highest scoring path after the calculation:
this requires to store a nxm matrix = O(nm)
• How can we calculate alignment in O(n) space?

22.09.24 Bioinformatics Algorithms 51


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time

Step 1)

It is easy to calculate just the scores of paths


in linear space because we only need to store
at most 2n elements to calculate the next row
of entries.

22.09.24 Bioinformatics Algorithms 52


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
If (U,V) = NW(X,Y) is the alignment of sequences X and Y then we can
see the following:

Let Xl and Xr be a partition of X than there exists a partitioning Yl and


Yr such that NW(X,Y) = NW(Xl,Yl) + NW(Xr,Yr)
ACCT-A-AC-CGGTAGATAC ACCT-A-AC-C GGTAGATAC
||||-|-||-|--||||.|| ||||-|-||-| --||||.||
ACCTTAGACAC--TAGAGAC ACCTTAGACAC --TAGAGAC

22.09.24 Bioinformatics Algorithms 53


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
If (U,V) = NW(X,Y) is the alignment of sequences X and Y then we can see
the following:

Let Xl and Xr be a partition of X than there exists a partitioning Yl and Yr


such that NW(X,Y) = NW(Xl,Yl) + NW(Xr,Yr)

ACCT-A-AC-CGGTAGATAC ACCT-A-AC-C GGTAGATAC


||||-|-||-|--||||.|| ||||-|-||-| --||||.||
ACCTTAGACAC--TAGAGAC ACCTTAGACAC --TAGAGAC

X = Xl + Xr = ACCTAACC + GGTAGATAC, Y = Yl + Yr = ACCTTAGACAC+ TAGAGAC

22.09.24 Bioinformatics Algorithms 54


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time

We start by finding the middle, i.e.


the point on the middle column
of the matrix that divides the highest scoring
path into two sub-paths.

This can be done in O(n) space and O(n m/2)


time.

Note that we only need to find the cell with


the highest score and note the whole path!

That‘s why we can do it in O(n) space.

22.09.24 Bioinformatics Algorithms 55


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time

When we now the location of the «middle»,


we can simply repeat this process for the
remaining upper and lower sub-matrices to get
the location of next «middle» of the alignment
path.

This can be done in time proportional to the


area of the submatrices (gray rectangles).

22.09.24 Bioinformatics Algorithms 56


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time

We recursively repeat this until we have


reconstructed the whole sequence alignment, i.e.,
until we have found the «middle» for all columns.

At each step, this can be done in time


proportional to the area of the submatrices.

22.09.24 Bioinformatics Algorithms 57


From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time

We recursively repeat this until we have


reconstructed the whole sequence alignment, i.e.,
until we have found the «middle» for all columns.

At each step, this can be done in time proportional


to the area of the submatrices.

Total time =
nm + nm/2 + nm/4 + nm/8 + ... ≤ 2 nm = O(n m)
......
Total space = O(n)

22.09.24 Bioinformatics Algorithms 58


From O(n2) -> O(n) (in terms of memory)
by doubling the computation time
• This is an example of a «divide and conquer» algorithm:

A divide and conquer algorithm works by recursively breaking down a


problem into two or more sub-problems of the same (or related)
type, until these become simple enough to be solved directly. The
solutions to the sub-problems are then combined to give a solution
to the original problem.

22.09.24 Bioinformatics Algorithms 59


Implementation in R
• Open the neddleman wunsch R script from ilias
• Run the code
• Try some examples: Score for match

Penalty for
plotNeedlemanWunsch("GCTAACA","TGATACCA",1,1,1) insertion/deletion

Sequence 1 Sequence 2
Penalty for mismatch

22.09.24 Bioinformatics Algorithms 60


Needleman−Wunsch

Implementation in R match = 1

T
mismatch = −1

G A T A
gap = −1

C C A

0 −1 −2 −3 −4 −5 −6 −7 −8

A −1 −1 −2 −1 −2 −3 −4 −5 −6

plotNeedlemanWunsch T −2 0 −1 −2 0 −1 −2 −3 −4

("ATTCCAATA","TGATACCA",1,1,1) T −3 −1 −1 −2 −1 −1 −2 −3 −4

C −4 −2 −2 −2 −2 −2 0 −1 −2
--ATTCCAATA
--||.|||--- C −5 −3 −3 −3 −3 −3 −1 1 0
TGATACCA
A −6 −4 −4 −2 −3 −2 −2 0 2

A −7 −5 −5 −3 −3 −2 −3 −1 1
--ATTCCAATA
--||.||---| T −8 −6 −6 −4 −2 −3 −3 −2 0
TGATACC---A
A −9 −7 −7 −5 −3 −1 −2 −3 −1

22.09.24 Bioinformatics Algorithms 61


Needleman−Wunsch

Implementation in R match = 1

T
mismatch = −1

G A T A
gap = −5

C C A

0 −5 −10 −15 −20 −25 −30 −35 −40

A −5 −1 −6 −9 −14 −19 −24 −29 −34

plotNeedlemanWunsch T −10 −4 −2 −7 −8 −13 −18 −23 −28

("ATTCCAATA","TGATACCA",1,1,5) T −15 −9 −5 −3 −6 −9 −14 −19 −24

C −20 −14 −10 −6 −4 −7 −8 −13 −18

ATTCCAATA
C −25 −19 −15 −11 −7 −5 −6 −7 −12
-|...|..|
-TGATACCA A −30 −24 −20 −14 −12 −6 −6 −7 −6

A −35 −29 −25 −19 −15 −11 −7 −7 −6

T −40 −34 −30 −24 −18 −16 −12 −8 −8

A −45 −39 −35 −29 −23 −17 −17 −13 −7

22.09.24 Bioinformatics Algorithms 62


Scores for amino acids
• The scoring matrices for DNA sequence comparison are usually
defined only by a few parameters
• Random mutations of the nucleotide sequence within a gene may
change the amino acid sequence of the corresponding protein.
• Some of these mutations do not drastically alter the protein’s
structure, but others do and impair the protein’s ability to function.
• The common matrices for protein sequence comparison are point
accepted mutations (PAM) and block substitution (BLOSUM)
• They reflect the frequency with which amino acid x replaces amino
acid y in evolutionarily related sequences

22.09.24 Bioinformatics Algorithms 63


PAM scoring matrices
PAM = Point (or Percentage) Accepted Mutations.
(Dayhoff, Schwartz, and Orcutt, 1978).

Proteins that have, on average, only one mutation per 100 amino acids are
defined as being one PAM unit diverged.

Roughly speaking: 1 PAM is the amount of time in which an “average” protein


mutates 1% of its amino acids.

Accepted mutations are the ones that are able to spread in the population
(typically mutations that do not disrupt the protein function or even increase
fitness).

22.09.24 Bioinformatics Algorithms 64


PAM scoring matrices
1) Given a set of base alignments, define f(i, j) as the total number of times amino acids i and j are aligned
against each other, divided by the total number of aligned positions.
2) We also define g(i, j) as f(i,j)/f(i) , where f(i) is the frequency of amino acid i in all proteins (or seqeunces)
from the data set. g(i, j) defines the probability that an amino acid i mutates into amino acid j within 1
PAM unit.
3) The (i, j) entry of the PAM1 matrix is defined as
log (f(i,j)/ (f(i)·f(j))) = log (g(i,j)/f(j))
Note that f(i) f(j) is the frequency of alignment i against j by chance.
4) The PAM n matrix can be defined as the result of applying the PAM 1 matrix n times.
5) Multiplying g by itself n times gives the probability that amino acid i mutates into amino acid j during n
PAM units.
6) The (i, j) entry of the PAM n matrix is then defined as log(gn(i,j)/f(j)).

22.09.24 Bioinformatics Algorithms 65


PAM matrix
• Usually one starts with sequences that are easy to align because they
are not strongly diverged (e.g., human-chimp)
• Align with naive scoring function
• Pick subset that satisfy the condition that 1% of the sequence is
mutated
• Calculate PAM 1 matrix
• Calculate PAM n matrix

22.09.24 Bioinformatics Algorithms 66


Multiple alignments
• So far we have only seen how to align 2 sequences
• What if we want to align 3 or more sequences?
• Mulitple pariwise alignment:

22.09.24 Bioinformatics Algorithms 67


Multiple pairwise alignment
• In this case, the pairwise alignments are compatible:

22.09.24 Bioinformatics Algorithms 68


Incompatible Alignments
• This is not always the case:

22.09.24 Bioinformatics Algorithms 69


Multiple alignment: three sequences
We can represent all alignments
of three sequences as paths on a
3-dimensional grid.

We now have more directions to move:


- match in 2 sequences, mismatch in 1 (2x)
- match in 2 sequences, indel in 1 (2x)
- match in 3 sequences
- Mismatch in all sequences
- indel in 2 sequences (2x)

22.09.24 Bioinformatics Algorithms 70


Multiple alignment: three sequences
We can represent all alignments
of three sequences as paths on a
3-dimensional grid.

We now have more directions to move:


- match in 2 sequences, mismatch in 1 (2x)
- match in 2 sequences, indel in 1 (2x)
- match in 3 sequences
- Mismatch in all sequences
- indel in 2 sequences (2x)

22.09.24 Bioinformatics Algorithms 71


Multiple alignment algorithm
• Needleman-Wunsch algorithm can be extend to arbitrarily many
sequences
• Two modifications:
1. Calculate highest score over all possible “directions”
2. Store path-pointers in a k-dimensional array instead of a matrix
• Idea of algorithm remains the same
• Time and space complexity increases with number of sequences

22.09.24 Bioinformatics Algorithms 72


Complexity
Unforutnately, time-complexity for k sequences is O(nk)
Space-complexity is also O(nk)
Multiple heuristics have been proposed
• Compute all possible pairwise alignments and find best multiple alignment
Sometimes difficult because of incompatible aligmnents
still demanding: (k 2) possible pairwise alginments
• Align sequences iterativels: start with strong pairwise alignment and join one
sequence after another
• Iterative procedures work well if sequences are not too different from each
other

22.09.24 Bioinformatics Algorithms 73


Sequence alignment
1. Global alignment assumes that the two proteins are basically
similar over the entire length of one another. The alignment
attempts to match them to each other from end to end.
AAAGCGGAAGTCACAG
||.||.||||| |.||
AAGGCTGAAGT-ATAG

1. Local alignment searches for segments of the two sequences


that match well. There is no attempt to force entire sequences
into an alignment, just those parts that appear to have good
similarity, according to some criterion.
AAAGCGGAAGTCACAG
AAAGCGGAAGTCACAG AAGGCTGAAGTATAG
......||||| ....
AAGGCTGAAGT-ATAG
22.09.24 Bioinformatics Algorithms 74
Local alignments
• In many biological applications, the score of an alignment between
two substrings of v and w might actually be larger than the score of
an alignment between the entireties of v and w.
• For example, homeobox genes, which regulate embryonic
development, are present in a large variety of species. Although
homeobox genes are very different in different species, one region in
each gene—called the homeodomain— is highly conserved.
• How to find this conserved area and ignore the areas that show little
similarity?

22.09.24 Bioinformatics Algorithms 75


Local alignments
--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-GAC
--|--||-|--||--|-|-|-|||----||-|--|-|--|-||||---|
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT--C

versus

tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc

How can we represent such a local alignment on a grid?

22.09.24 Bioinformatics Algorithms 76


--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-G
| || | || | | | ||| || | | | | ||||
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT-

Local alignments tccCAGTTATGTCAGgggacacgagcatgcaga


||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc

A A T T G C CG C C G T CG T T T T C A G C A G T T A T G T C A G A T C
T
--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-GAC C
C
C
--|--||-|--||--|-|-|-|||----||-|--|-|--|-||||---| A
G
T
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT--C T
A Global Local
T
G
T
C
A
versus G
G
G
G
A
C
tccCAGTTATGTCAGgggacacgagcatgcagagac A
C
G
|||||||||||| A
G
C
A
aattgccgccgtcgttttcagCAGTTATGTCAGatc T
G
C
A
G
A
G
A
C

22.09.24 Bioinformatics Algorithms 77


Figure 6.16 (a) Global and (b) local alignments of two hypothetical genes th
have a conserved domain. The local alignment has a much worse score accor
Local alignment
• Smith-Waterman algorithm (1981)
6.8 Local Sequence Alignment
• Introduces “free jumps” to arbitrary cells into Needleman-Wunsch
algorithm
• This can be represented by edges of weight 0
(here shown with dashed lines) from the
source (0, 0) to every other cell
on the grid.

Figure 6.17 The Smith-Waterman local alignment algorithm introdu


weight 0 (here shown with dashed lines) from the source vertex (0, 0) t
22.09.24 Bioinformatics Algorithms 78
vertex in the edit graph.
Smith-Waterman algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path

S(0,0) = 0
for (i in 1:n) S(i,0) = 0
for(j in 1:m) S(0,j) = 0
for(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x),0)
D(i,j) = path.to(i,j)

22.09.24 Bioinformatics Algorithms 79


Smith-Waterman
How to find the best local alignment?

1. Find largest entry in matrix

22.09.24 Bioinformatics Algorithms 80


Smith-Waterman
How to find the best local alignment?

1. Find largest entry in matrix


2. Backtrack until cell with entry 0

22.09.24 Bioinformatics Algorithms 81


Smith-Waterman
How to find the best local alignment?

1. Find largest entry in matrix


2. Backtrack until cell with entry 0

It is also possible to find multiple


local alignments

22.09.24 Bioinformatics Algorithms 82


Mutiple
non-overlapping
local alignments:

22.09.24 Bioinformatics Algorithms 83


Mapping and re-sequencing
Alignemnt is also called mapping if you align (map) sequences against an
existing genome

This is an essential step in re-sequencing

If we have a reference genome of an organism, we can align new


sequences against this reference and find sites with genetic differences
between indivudals

Because genomes are quite big, mapping can be quite tedious


22.09.24 Bioinformatics Algorithms 84
NGS data
(next generation sequencing)
NGS generally produces shorter reads than Sanger sequencing (200 vs
1000 bp), which poses some challenges:

• The reads don’t come with position information


• Reference sequence can be quite long (3 billion bp for humans)
• Alignment is not perfect: we need to allow for small variations
• Reads have sequencing-errors
• Short reads are more likely to fit into several positions
• We need to map millions of reads
22.09.24 Bioinformatics Algorithms 85
Some tools
• BLAST (Basic Local Alignment Search Tool)
one of the most commonly used tools in BioInformatics
aligns sequences against existing databases
BLAST can be used for identifying species, locating domains, establishing
phylogeny, DNA mapping, and comparison
• Bowtie
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short
DNA sequences (reads) to the human genome at a rate of over 25 million
35-bp reads per hour. Bowtie indexes the genome with a Burrows-
Wheeler index to keep its memory footprint small: typically about 2.2 GB
for the human genome (2.9 GB for paired-end).
• TopHat
Similar to BowTie but for Transcriptome data

22.09.24 Bioinformatics Algorithms 86


Excercises
• Take the NW-algorithm implementation and try to see how the
outcome changes when you change parameters
• Change the scoring function:
1. smaller penalty for longer indels
2. Include a scoring matrix (different penalities depending on the type of
substitution, e.g., C -> T is more costly than G -> T, etc.)
• Adapt NW algorithm for local alignments

22.09.24 Bioinformatics Algorithms 87


References
• Needleman and Wunsch - A general method applicable to the search
for similarities in the amino acid sequence of two proteins (1970)
• Smith and Waterman - Identification of common molecular
subsequences (1981)
• Chapter 6 in “An Introduction to Bioinformatics Algorithms” – Jones
and Pevzner

22.09.24 Bioinformatics Algorithms 88


Exhaustive search and search trees

22.09.24 Bioinformatics Algorithms 89


Exhaustive search algorithms
• NW and SW are algorithms where we can find solution not by
exhaustive search, but by smart algorithms (dynamic programming)
• Exhaustive search/brute force usually not good but first step towards
better algorithm design
• Sometimes exhaustive search is only option
• Can we find smart ways to ignore subsets of (wrong) solutions?

22.09.24 Bioinformatics Algorithms 90


Restriction sites
Hamilton Smith discovered in 1970 that the restriction enzyme HindII
cleaves DNA molecules at every occurrence, or site, of the sequences
GTGCAC or GTTAAC, breaking a long molecule into a set of restriction
fragments.
Shortly thereafter, maps of restriction sites in DNA molecules, or
restriction maps, became powerful research tools in molecular biology by
helping to narrow the location of certain genetic markers.

• Restiriciton sites are locations on a chromosome that are recognized


by restriction enzymes
• a particular restriction enzyme may cut the sequence between two
nucleotides within its recognition site, or somewhere nearby.
• E.g., the common restriction enzyme EcoRI recognizes the palindromic
sequence GAATTC and cuts between the G and the A on both the top
and bottom strands
• This can then be used to ligate in complementary pieces of DNA

22.09.24 Bioinformatics Algorithms 91


Restriction map

• A restriction map is a map of known


restriction sites within a sequence of
DNA.
• Restriction mapping requires the use of
restriction enzymes.
• In molecular biology, restriction maps are
used as a reference to engineer plasmids
or other relatively short pieces of DNA,
and sometimes for longer genomic DNA.

22.09.24 Bioinformatics Algorithms http://dx.doi.org/10.1590/S0074-02762007005000121


92
Restriction mapping
• If the genomic DNA sequence of an organism is known, then
construction of a restriction map for, say, HindII amounts to finding all
occurrences of GTGCAC and GTTAAC in the genome.
• However, not all restriction sites are known and
• sequencing is not always the best option if one wants to find a
restriction map
• Also: 25 years ago sequencing was not an option!
• How can we build restriction maps for genomes without prior
knowledge of the genomes’ DNA sequence?

22.09.24 Bioinformatics Algorithms 93


Electrophoresis
1. Aliquots of purified plasmid DNA for each digest
2. Digestion with enzyme
3. Samples are run on an electrophoresis gel
4. We can determine the lengths of DNA fragments from
band-pattern

Electrophoresis gives us number and


length of DNA fragments. How can we
infer restriciton map from this
information?
http://bio1151.nicerweb.com/Locked/media/ch20/electrophoresis.html
22.09.24 Bioinformatics Algorithms 94
Some notation
A multiset is a set that allows duplicate elements: e.g., {1,2,2,3,4,4,6}

Let X = {x1 = 0,x2, .... , xn} a set of n elements in increasing order

We denote by ΔX the multiset of all pairwise distances between


elements of X:
ΔX = {xj − xi : 1 ≤ i < j ≤ n} .

22.09.24 Bioinformatics Algorithms 95


Example
For example, if X={0, 2, 4, 7, 10}, then ΔX={2, 2, 3, 3, 4, 5, 6, 7, 8, 10}

22.09.24 Bioinformatics Algorithms 96


The problem

Given all pairwise differences between points on a


line, reconstruct the position of all those points.

!
Input: The multiset of pairwise distances L, containing "
integers.

Output: A set X of n integers such that ΔX = L.

22.09.24 Bioinformatics Algorithms 97


Restriction mapping: brute force
!
Input: The multiset of pairwise distances L, containing "
integers.
Output: A set X of n integers such that ΔX = L.

M = max(L)
for (every set of integers 0 < x2 < .. < xn-1 < M)
X = {0, x2, ...., xn-1,M}
ΔX = get.pairwise.distances(X)
if (ΔX == L)
return(X)
return(«no solution»)

22.09.24 Bioinformatics Algorithms 98


Complexity
• The algorithm is slow since it examines #$% !
different sets of
positions
• This requires about O(M(n−2)) time
• This makes the algorithm unpractical for most applications
• One could improve the algorithm by choosing the elements of the
sets more wisely (e.g., choosing n-2 distinct elements from L rather
than n-2 arbitrary integers)
• This reduces time-complexity to O(n2n-4)
(consider L = {2,998,1000}, n = 3, M = 1000)

22.09.24 Bioinformatics Algorithms 99


A practial algorithm (Skiena, 1990)
Idea:
1. Find largest distance in L (determines the two outermost points of X)
2. Find second-largest distance δ
3. two options to place δ:
left or right 0 δ(?) δ(?) max(X)
4. Pick left and add δ to preliminiary set
5. calculated all pairwise distances 0 δ(?) δ(?) xn-1 max(X)
6. check if they are contained in L
7. If yes: remove δ from L and go to 2.
8. If not: go back and pick other direction 0 x2 δ(?) δ(?) xn-1 max(X)
9. If L is empty, we have found a valid solution

22.09.24 Bioinformatics Algorithms 100


Example
L = {2,2,3,3,4,5,6,7,8,10} x3 = 3 is not possible because x3 – x2 = 1 is not in L

! Therfore x4 must be 7
L has 10 elements and therefore n = 5 ( "
= 10)

10 is the largest distance in L, so x1 = 0 and x5 = 10 X = {0,2,7,10} and L = {2,3,4,6}

X = {0,10} and L ={2,2,3,3,4,5,6,7,8} Largest remaining distance is 6

The largest remaining distance is 8. Two choices: x3 = 4 or x3 = 6

Set x2 = 2 and remove distances 8 and 2. (x2 - x1, x5 – x2) x3 = 6 does not work, so x3 must be 4

We get X = {0,2,10} and L = {2,3,3,4,5,6,7} We get X = {0,2,4,7,10} which is a valid solution

The largest remaining distance is 7 -> either x4 = 7 or x3 = 3

22.09.24 Bioinformatics Algorithms 101


Algorithm
get.rest.map = find.next = function(L,X,m)
If L is empty: X is solution, return
function(L,X)
y = max(L)
m = max(L) If Δ(y,X) is subset of L
L = L without m add y to X and remove Δ(y,X) from L
find.next(L,X,m)
X = {0,m} remove y from X and add Δ(y,X) to L
find.next(L,X,m) if Δ(m-y,X) is subset of L
add m-y to X and remove Δ(y,X) from L
find.next(L,X,m)
remove m – y from X and add Δ(m-y,X) to L

22.09.24 Bioinformatics Algorithms 102


Recursive tree
At each step we can go left or right,
which creates a recursive-tree like this:

If we find that going, say,


left doesn’t yield a viable
solution, we can ignore
whole subtree!
……

……

22.09.24 Bioinformatics Algorithms 103


Some notes
After each recursion we undo the modifications to sets X and L for the next recursive call

This algorithm will list all sets X with Δ(X) = L

This algorithm is usually very fast because we do not go deep into the «recursive tree»

However, there exists pathological examples where we explore 2k possible paths (left and right are
always viable paths)

So strictly speaking, this algorithm has exponential time-complexity, but is much faster in most cases

A polynomial-time algorithm was found in 2002 by Nivat et al.

22.09.24 Bioinformatics Algorithms 104


Search trees
As we have seen, trees can be useful structures to “organize”
information.

We can exploit the tree structure to efficiently chose whether a subset


of all possible combinations is “uninteresting”

22.09.24 Bioinformatics Algorithms 105


Example: Binary search tree
• One node is designated the root of the tree.
• Each node contains a key and has (at most) two subtrees.
• Each subtree is itself a binary search tree.
• The left subtree of a node contains only nodes with keys strictly less
than the node's key.
• The right subtree of a node contains only nodes with keys strictly
greater than the node's key.

22.09.24 Bioinformatics Algorithms 106


Example: Binary tree

See R-script: “tree.r” on ilias

22.09.24 Bioinformatics Algorithms 107

You might also like