Sequence Comparison Homology and Similarity
Sequence Comparison Homology and Similarity
Comparison
Similarity
BINF3010/9010
Looking
alike
Not
an
evoluBonary
concept
Types
of
homology
Orthologs:
Genes/proteins
descended
from
a
common
ancestor
Paralogs:
Genes/proteins
related
to
each
other
due
to
a
gene
duplicaBon
event
SPATEGGANDSPAM
insertions deletions
1 SPLATEGGANDSPAM
2 SPAGANDSPAM
Dotmatrix
plot
1 SPLATEGGANDSPAM M A 2 SPAGANDSPAM P S D N A G A 2 P S SPLATEGGANDSPAM 1
Dotmatrix plots
Word size = 2
AAGTTCAGTAGGCATTTAAGCG A * * * * G * * T * * A C C * G * * T * ** T * C C
Word size = 3
A G T A C C G T T C C AAGTTCAGTAGGCATTTAAGCG * * *
* *
Word size = 3
Threshold = 2
AAGTTCAGTAGGCATTTAAGCG
A G T A C C G T T C C * * * * * * * ** * * * * ** * * * * * * * * * * *
Window = 30 Stringency = 9
Window = 20 Stringency = 9
Window = 30 Stringency = 14
Window = 20 Stringency = 13
Repeat detecBon
Sequence
alignment
1 SPLATEGGANDSPAM 2 SPAGANDSPAM
TFIIIA
vs
TFIIIA
Algorithms
Global
alignment
Needleman-Wunsch
Sellers
Local
alignment
Smith-Waterman
Note that the opBmal alignment is not necessarily the correct biological alignment. However, it is usually impossible to know the correct evoluBonary alignment
Structure alignment
Structure
alignment
10 20 30 40 50 60 ....*....|....*....|....*....|....*....|....*....|....*....| 1 ~VLSPADKTNVKAAWGKVgaHAGEYGAEALERMFLSFPTTKTYFPHFDls~~~~~~hGSA 53 1 vHLTPEEKSAVTALWGKV~~NVDEVGGEALGRLLVVYPWTQRFFESFGdlstpdavmGNP 58 70 80 90 100 110 120 ....*....|....*....|....*....|....*....|....*....|....*....| 54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL 113 59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 118
4HHB_A 2HHB_B
4HHB_A 2HHB_B
130 140 ....*....|....*....|....*... 4HHB_A 114 PAEFTPAVHASLDKFLASVSTVLTSKYR 141 2HHB_B 119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146
Scoring
systems
Matches
and
mismatches
SubsBtuBon
mutaBons
Gaps
InserBons
and
deleBons
A T G C
BLOSUM62 Matrix
Data-based
matrices
Calculated
from
amino
acid
frequencies
in
known
homologous
sequences
PAM
family
of
matrices
BLOSUM
family
of
matrices
Perform
befer
than
rst
principle
matrices
(which
are
sBll
useful
for
some
specialised
applicaBons)
+1 -1 -1 -1 -1 -1 E
+1 -1 -1 -1 -1 Q
+1 -1 +1 -1 -1 +1 -1 -1 -1 +1 H V F W
BLOSUM
matrices
BLOSUM 62
BLOSUM
matrices
Heniko
and
Heniko,
1992
Blocks
SubsBtuBon
Matrix
Based
on
the
BLOCKS
database
Currently,
most
widely
used
matrix
family
Most
commonly
used
matrices:
BLOSUM62
and
BLOSUM55
BLOCKS
database
BLOCKS
are
ungapped
mulBple
sequence
alignments
based
on
the
SWISS-PROT
database
and
the
PROSITE
protein
family
database
All
the
sequences
from
SWISS-PROT
belonging
to
a
PROSITE
family
are
aligned
together,
to
create
local
ungapped
alignments
characterisBc
of
the
protein
family
ID Mn_catalase; BLOCK AC IPB007760A; distance from previous block=(3,160) DE Manganese containing catalase BL HIL; width=14; seqs=49; 99.5%=727; strength=1034 CTJC_BACSU|Q45538 ( 67) HLEMIATMVYKLTK 12 GS80_BACSU|P80878 ( 69) HVEMIATMIARLLE 14 YDHU_BACSU|O05513 ( 4) HGNLITDLLDNLLL 25 O69145 ( 70) HMEIVAETINLLNG 64 Q9KDZ2 ( 136) SGNLIFDLLHNYFL 34 Q9KAU6 ( 69) HVEMLATMIARLLD 16 Q9I1T0 ( 68) HLEIIGSIVGMLNK 20 Q97JE8 ( 68) HLEIVGSIVRQLSR 50 MCAT_CLOAB|Q97FE0 ( 124) TGDIVADLLSNIAS 73 Q8Z7E1 ( 68) HLEIIGSLVGMLNK 17 Q8YY54 ( 69) HIEMLATMIAHLLD 27 Q8YSJ5 ( 68) HLEMVGKLIEAHTK 36 Q9KWV1 ( 68) HLEIIGSLVGMLNK 17 Q8XDQ1 ( 68) HLEIIGSLVGMLNK 17 YJQC_BACSU|O34423 ( 69) HVEMLATMISRLLD 19 Q8R929 ( 68) HLEIIATLVFKLLK 22 Q8PG91 ( 68) HLEIIGSIIAMLNK 19 Q8P4M4 ( 68) HLEIIGSIIAMLNK 19 Q8EQM8 ( 18) SGNLLADFRANLTA 35
BLOCK example
D
E
u v
n uv
exy
196/576
112/576
168/576
16/576
48/576
36/576
2log2(oxy/exy)
0.70
-1.09
-1.61
1.70
0.53
1.80
Final matrix
BLOSUM family
A D E
A 1 -1 -2
D -1 2 1
E -2 1 2
Problem: counBng every amino acid in the block can lead to an over-representaBon of amino acid changes found in closely related sequences SoluBon: cluster sequences closer than a set % idenBty, and average their contribuBon so that the whole cluster counts as one sequence This gives rise to a family of matrices, depending on the % idenBty threshold
PAM
matrices
PAM120
nEE
No clustering (BLOSUM100)
Clustering sequences with 80% identity (BLOSUM80)
Clustering sequences with 60% identity (BLOSUM60)
6
3
2
nVE
4
3
2
PAM
matrices
PAM
-
Point
(Percent)
Accepted
MutaBon
Schwartz
and
Dayho,
1978
Also
known
as
MDM78
(mutaBon
data
matrix)
or
Dayho
matrix
Empirical
matrix
based
on
evoluBonary
model
Based
on
small
number
of
families
of
closely
related
proteins
(>85%
idenBty)
so
that
sequences
can
be
aligned
unambiguously
by
hand
Since
the
changes
observed
between
these
sequences
did
not
aect
the
funcBon
of
the
protein,
these
are
accepted
muta9ons
3.
Count
the
number
of
Bmes
each
amino
acid
changes
to
each
other
one
e.g.
F
changing
to
L
hbb_ornan hbb_tacac hbe_ponpy hbb_speci hbb_speto hbb_equhe LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH VDPENFNRLG VDPENFNRLG VDPENFKLLG VDPENFKLLG VDPENFKLLG VDPENFRLLG NVLIVVLARH NVLVVVLARH NVMVIILATH NMIVIVMAHH NMIVIVMAHH NVLVVVLARH FSKDFSPEVQ FSKEFTPEAQ FGKEFTPEVQ LGKDFTPEAQ LGKDFTPEAQ FGKDFTPELQ AAWQKLVSGV AAWQKLVSGV AAWQKLVSAV AAFQKVVAGV AAFQKVVAGV ASYQKVVAGV
4.
Calculate
probability
for
each
amino
acid
mutaBng
to
each
other
amino
acid
For
each
pair
of
amino
acids
i
and
j,the
frequency
of
change
fij
is:
N ij f ij = N ik
k
F F F
pij = cf ij and pii = 1 cf ij i j where c is a posiBve scaling constant chosen so that each pii > 0.
Probability
matrix
The
resulBng
probability
matrix
allows
modelling
the
evoluBon
of
protein
sequences
as
a
Markov
process
-
that
is,
the
probability
of
any
amino
acid
mutaBng
to
another
one
is
dependent
only
on
that
amino
acid
A C D E pAA pAC pCC pAD pCD pDD pAE pCE pDE pEE A C D E
The constant c is chosen so that the expected number of amino acid changes amer one round of applying the probabiliBes is 1 in 100 amino acids
PAM 1
p p
i i i j
ij
= c pi fij = 0.01
i i j
The resulBng probability matrix is the PAM1 probability matrix, giving the probability that an amino acid will mutate to another over an amount of evoluBonary Bme such that 1% of amino acids mutate
5.
PAM
N
Because
the
probability
matrix
is
Markov,
it
is
possible
to
calculate
probability
matrices
for
longer
evoluBonary
Bmes
by
mulBplying
the
matrix
by
itself
n
Bmes
e.g. PAM2 probability matrix : " pAA pAC pAD ...% " pAA pAC $ ' $ $ pCA pCC pCD ...' $ pCA pCC $ pDA pDC pDD ...' $ pDA pDC $ ' $ ... ... ...& # ... ... # ...
PAM
N
e.g.
a
PAM250
matrix
represents
a
250%
level
of
evoluBonary
change
e.g.
PAM120,
PAM80,
PAM60
matrices
could
be
used
for
aligning
sequences
which
are
approximately
40%,
50%
and
60%
similar,
respecBvely
PAM250
has
been
shown
preferable
for
distantly
related
proteins
of
14-27%
similarity
10
Rather than use probabiliBes, it is more convenient to use log odds matrices If pij is an entry in the PAMN probability matrix, the corresponding entry in the PAMN log odds matrix is:
where
C
is
a
posiBve
constant
and
qi
and
qj
are
the
respecBve
observed
frequencies
of
amino
acids
i
and
j
in
the
sequences
Interpreted
as
the
raBo
of
the
probability
that
the
subsBtuBon
represents
an
authenBc
evoluBonary
change
to
the
probability
that
it
occurred
due
to
random
events
of
no
biological
signicance.
PAM100
PAM100
PAM200
PAM100
PAM100
PAM200
BLOSUM
vs
PAM
PAM
Built from an evolutionary model based on closely related proteins
Extrapolation from closely related sequences
Built from a small number of complete sequences
BLOSUM
Built directly from blocks of aligned protein segments covering a wide range of evolutionary time
No extrapolation
Built from a large number of sequence segments
11
Gap
penalBes
I
RaBonale:
Gaps
arise
through
inserBon/deleBon
events,which
do
not
happen
one
residue
at
a
Bme.
Penalty
for
creaBng
a
new
gap
Typically,
relaBvely
high
to
prevent
too
many
gaps
in
the
alignment
Penalty
for
extending
an
exisBng
gap
Typically,
relaBvely
small
so
that
a
small
dierence
in
gap
length
will
not
aect
the
penalty
for
this
gap,
but
not
too
small
to
result
in
very
long
gaps.
Gap Penalties II
Alignment of human and hemoglobin chains
False negatives
12