2024 Bioinformatics Algorithms Day 2
2024 Bioinformatics Algorithms Day 2
Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland
Email: [email protected]
Sequence Alignment
GATTACA
insertion
GAATTACA substitution
GATTACC
substitution
TAATTACA
GAATTACA
GATTACC
TAATTACA
substitution
TAATTACA How can we align sequences
if we don’t know the exact
evolutionary history?
GAATTACA TAATTACA GATTACC
AAAGCGGAAGTCACAG
1. Global alignment assumes that the two proteins are basically similar over the ||.||.||||| |.||
entire length of one another. The alignment attempts to match them to each other
from end to end. AAGGCTGAAGT-ATAG
2. Local alignment searches for segments of the two sequences that match well. AAAGCGGAAGTCACAG
There is no attempt to force entire sequences into an alignment, just those parts ......||||| ....
that appear to have good similarity, according to some criterion.
AAGGCTGAAGT-ATAG
There are
𝑛+𝑚 𝑛+𝑚 !
=
𝑛 𝑛! 𝑚!
possible paths!
There are
𝑛+𝑚 𝑛+𝑚 !
=
𝑛 𝑛! 𝑚!
possible paths!
Even though this may seem a stupid thing to do at first glance.... Instead
of one question we now have to answer n x m questions!
S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(S(i-1,j) + SS(i,j), S(i,j-1) + SE(i,j))
D(i,j) = path.to(i,j)
1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1
22.09.24 Bioinformatics Algorithms 33
Needleman-wunsch algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x))
D(i,j) = path.to(i,j)
Subsitutions can be measured by the edit distance, but because insertions and deletions occur as
well, we usually do not know which is the «ith» position.
Thus, the edit distance seems to be more appropriate for sequence alignment.
-ATATATAT ATATATAT
-|||||||- ........
TATATATA TATATATA
This is not reflecting biological reality because not all substitutions occur with the same
probability, substitutions are more/less likely than insertions or deleteions, etc.
We therfore introduce a scoring function that assigns a score (or penality) to each «edit».
#matches = 12 #matches = 12
#mismatches = 3 #mismatches = 1
# indels = 1 # indels = 5
distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ
distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ
S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x))
D(i,j) = path.to(i,j)
TGATACCA
-|.||.||
-GCTAACA
μ=1
σ=1
TGATACCA
-|.||.||
-GCTAACA
μ=1
σ=1
--TGATACCA
--|-|-||-|
GCT-A-AC-A
μ=2
σ=1
σ1 if new indel
w(i,j,indel) =
σ2 if extension of «open» indel
ACCTAACCGGTAGATAC--- ACCT-A-AC-CGGTAGATAC
||||------||||.||--- ||||-|-||-|--||||.||
ACCT------TAGAGACGAC ACCTTAGACAC--TAGAGAC
---AAGCATAGCCAT
---||||||||----
CTGAAGCATAG----
---AAGCATAGCCAT
---||||||||----
CTGAAGCATAG----
Step 1)
Total time =
nm + nm/2 + nm/4 + nm/8 + ... ≤ 2 nm = O(n m)
......
Total space = O(n)
Penalty for
plotNeedlemanWunsch("GCTAACA","TGATACCA",1,1,1) insertion/deletion
Sequence 1 Sequence 2
Penalty for mismatch
Implementation in R match = 1
T
mismatch = −1
G A T A
gap = −1
C C A
0 −1 −2 −3 −4 −5 −6 −7 −8
A −1 −1 −2 −1 −2 −3 −4 −5 −6
plotNeedlemanWunsch T −2 0 −1 −2 0 −1 −2 −3 −4
("ATTCCAATA","TGATACCA",1,1,1) T −3 −1 −1 −2 −1 −1 −2 −3 −4
C −4 −2 −2 −2 −2 −2 0 −1 −2
--ATTCCAATA
--||.|||--- C −5 −3 −3 −3 −3 −3 −1 1 0
TGATACCA
A −6 −4 −4 −2 −3 −2 −2 0 2
A −7 −5 −5 −3 −3 −2 −3 −1 1
--ATTCCAATA
--||.||---| T −8 −6 −6 −4 −2 −3 −3 −2 0
TGATACC---A
A −9 −7 −7 −5 −3 −1 −2 −3 −1
Implementation in R match = 1
T
mismatch = −1
G A T A
gap = −5
C C A
ATTCCAATA
C −25 −19 −15 −11 −7 −5 −6 −7 −12
-|...|..|
-TGATACCA A −30 −24 −20 −14 −12 −6 −6 −7 −6
Proteins that have, on average, only one mutation per 100 amino acids are
defined as being one PAM unit diverged.
Accepted mutations are the ones that are able to spread in the population
(typically mutations that do not disrupt the protein function or even increase
fitness).
versus
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
A A T T G C CG C C G T CG T T T T C A G C A G T T A T G T C A G A T C
T
--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-GAC C
C
C
--|--||-|--||--|-|-|-|||----||-|--|-|--|-||||---| A
G
T
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT--C T
A Global Local
T
G
T
C
A
versus G
G
G
G
A
C
tccCAGTTATGTCAGgggacacgagcatgcagagac A
C
G
|||||||||||| A
G
C
A
aattgccgccgtcgttttcagCAGTTATGTCAGatc T
G
C
A
G
A
G
A
C
S(0,0) = 0
for (i in 1:n) S(i,0) = 0
for(j in 1:m) S(0,j) = 0
for(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x),0)
D(i,j) = path.to(i,j)
!
Input: The multiset of pairwise distances L, containing "
integers.
M = max(L)
for (every set of integers 0 < x2 < .. < xn-1 < M)
X = {0, x2, ...., xn-1,M}
ΔX = get.pairwise.distances(X)
if (ΔX == L)
return(X)
return(«no solution»)
! Therfore x4 must be 7
L has 10 elements and therefore n = 5 ( "
= 10)
Set x2 = 2 and remove distances 8 and 2. (x2 - x1, x5 – x2) x3 = 6 does not work, so x3 must be 4
……
This algorithm is usually very fast because we do not go deep into the «recursive tree»
However, there exists pathological examples where we explore 2k possible paths (left and right are
always viable paths)
So strictly speaking, this algorithm has exponential time-complexity, but is much faster in most cases