0% found this document useful (0 votes)
158 views

Sequence Comparison Homology and Similarity

The document discusses sequence comparison and alignment. It defines homology as sequences being evolutionarily related through common ancestry, while similarity refers to sequences that look alike without implying ancestry. High similarity can provide evidence for inferring homology. Optimal alignments maximize matches and minimize gaps/mismatches based on assigned scores. Global alignments align full sequences while local alignments find best-matching regions.

Uploaded by

Anthony Liang
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views

Sequence Comparison Homology and Similarity

The document discusses sequence comparison and alignment. It defines homology as sequences being evolutionarily related through common ancestry, while similarity refers to sequences that look alike without implying ancestry. High similarity can provide evidence for inferring homology. Optimal alignments maximize matches and minimize gaps/mismatches based on assigned scores. Global alignments align full sequences while local alignments find best-matching regions.

Uploaded by

Anthony Liang
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Sequence

Comparison

Homology and similarity


Homology
Sequences are homologous if they are evoluBonarily related - i.e. they share a common ancestor through evoluBon

Similarity BINF3010/9010
Looking alike Not an evoluBonary concept

Homology and similarity


Homology is not a quanBty
Two sequences are either homologous or not homologous e.g., it is incorrect to refer to two sequences as being 50% homologous

Homology and similarity


ComputaBonal methods recognise and measure similarity High similarity is supporBng evidence to infer homology

Similarity can be quanBed


e.g., two sequences can be 50% similar, 80% similar etc

Types of homology
Orthologs: Genes/proteins descended from a common ancestor Paralogs: Genes/proteins related to each other due to a gene duplicaBon event

EvoluBon through mutaBons


SPAMEGGANDSPAM
substitutions

SPATEGGANDSPAM
insertions deletions

1 SPLATEGGANDSPAM

2 SPAGANDSPAM

Visualising the process


Dotmatrix plots (dotplots) Alignments

Dotmatrix plot
1 SPLATEGGANDSPAM M A 2 SPAGANDSPAM P S D N A G A 2 P S SPLATEGGANDSPAM 1

Dotmatrix plots

Dotmatrix plot: Principle


Word size = 1
A **
G T A C C G T T C C AAGTTCAGTAGGCATTTAAGCG * * * ** * * ** * * ** * *** ** * * * ** * * * * * * * * ** * * ** * *** ** * *** * * * * * *

Word size = 2

AAGTTCAGTAGGCATTTAAGCG A * * * * G * * T * * A C C * G * * T * ** T * C C

Word size = 3

A G T A C C G T T C C AAGTTCAGTAGGCATTTAAGCG * * *

* *

Word size = 3
Threshold = 2
AAGTTCAGTAGGCATTTAAGCG
A G T A C C G T T C C * * * * * * * ** * * * * ** * * * * * * * * * * *

Window = 30 Stringency = 9

Window = 20 Stringency = 9

Window = 30 Stringency = 14

Window = 20 Stringency = 13

Dotmatrix plot: repeats


1 SPLATEGGANDSPAM M A 2 SPAGANDSPAM P S D N A G A 2 P S SPLATEGGANDSPAM 1

Repeat detecBon

Sequence alignment
1 SPLATEGGANDSPAM 2 SPAGANDSPAM

TFIIIA
vs
TFIIIA

1 SPLATEGGANDSPAM || | |||||||| 2 SP-A---GANDSPAM

Global vs Local Alignment


Global: align the whole of the two sequences together

1 ....AUAUCUUUAAUUUAAUGGUAAAAUAUUAGAAUACGAAUCUAAUUAU 46 |||| || | || || || || | | | || || 1 UGGUAUAUAGUUUAAACAAAACGAAUGAUUUCGACUCAUUAAAUUAUGAU 50 . . 47 AUAGGUUCAAAUCCUAUAAGAUAUUCCA 74 | | | | | 51 AAUCAUAUUUACCAACCA.......... 68

Which alignment is correct?


1 SPLATEGGANDSPAM || | |||||||| 2 SP-A---GANDSPAM 2 insertion/deletions 1 SPLATEGGANDSPAM || |||||||| 2 SPA----GANDSPAM 1 indel, 1 substitution

Local: align only the region of best similarity



44 UAUAUAGGUUCAA 56 ||||||| || || 4 UAUAUAGUUUAAA 16

1 SPLATEGGANDSPAM || |||||||| 2 SP----AGANDSPAM 1 indel, 1 substitution

1 SPLATEGGANDSPAM | |||||||| 2 -SPA---GANDSPAM 2 indels, 2 substitutions

Which alignment is opBmal?


Select a scoring system for alignments
Assign values to matches, mismatches and gaps

For example: Match: +2 Mismatch: -1 Gap: 5


1 SPLATEGGANDSPAM || | |||||||| 2 SP-A---GANDSPAM 1 SPLATEGGANDSPAM ||x |||||||| 2 SPA----GANDSPAM

Sum up the values over the whole alignment


Alignment score = Scorematch - Scoregap

S = (11*2) + (0*-1) - (2*5) = 12

S = (10*2) + (1*-1) - (1*5) = 14

The opBmal alignment is the one with the highest score

1 SPLATEGGANDSPAM || x|||||||| 2 SP----AGANDSPAM

S = (10*2) + (1*-1) - (1*5) = 14

1 SPLATEGGANDSPAM xx| |||||||| 2 -SPA---GANDSPAM

S = (9*2) + (2*-1) - (2*5) = 6

Algorithms
Global alignment
Needleman-Wunsch Sellers

Local alignment
Smith-Waterman

Note that the opBmal alignment is not necessarily the correct biological alignment. However, it is usually impossible to know the correct evoluBonary alignment

Structure alignment

Structure alignment
10 20 30 40 50 60 ....*....|....*....|....*....|....*....|....*....|....*....| 1 ~VLSPADKTNVKAAWGKVgaHAGEYGAEALERMFLSFPTTKTYFPHFDls~~~~~~hGSA 53 1 vHLTPEEKSAVTALWGKV~~NVDEVGGEALGRLLVVYPWTQRFFESFGdlstpdavmGNP 58 70 80 90 100 110 120 ....*....|....*....|....*....|....*....|....*....|....*....| 54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL 113 59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 118

4HHB_A 2HHB_B

4HHB_A 2HHB_B

130 140 ....*....|....*....|....*... 4HHB_A 114 PAEFTPAVHASLDKFLASVSTVLTSKYR 141 2HHB_B 119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146

Scoring systems
Matches and mismatches
SubsBtuBon mutaBons

DNA sequence alignment


768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG . . . . . 814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | 136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG . . . . . 864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | 173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 813 135 863 172 913 216

Gaps
InserBons and deleBons

DNA scoring matrix used in EMBOSS


A 5 -4 -4 -4 T -4 5 -4 -4 G -4 -4 5 -4 C -4 -4 -4 5

Protein Sequence Alignment


TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT ||:|: |: |:|||::|: ||||||||| YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH

A T G C

Section of EMBOSS data file EDNAFULL

Protein Sequence Alignment


TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT ||:|: |: |:|||::|: ||||||||| YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH Identical

Protein Sequence Alignment


TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT ||:|: |: |:|||::|: ||||||||| YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH Identical
Similar
Different

Protein Comparison: Scoring Matrix


Ala Cys Asp Glu A C D E 0.8 0.0 -0.4 -0.2 1.8 -0.6 -0.8 1.2 0.4 1.0 Phe Gly F G -0.4 0.0 -0.4 -0.6 -0.6 -0.2 -0.6 -0.4 1.2 -0.6 1.2 His H -0.4 -0.6 -0.2 0.0 -0.2 -0.4 1.6 Ile I -0.2 -0.2 -0.6 -0.6 0.0 -0.8 -0.6 0.8 Lys Leu K L -0.2 -0.2 -0.6 -0.2 -0.2 -0.8 0.2 -0.6 -0.6 0.0 -0.4 -0.8 -0.2 -0.6 -0.6 0.4 1.0 -0.4 0.8 Met Asn M N -0.2 -0.4 -0.2 -0.6 -0.6 0.2 -0.4 0.0 0.0 -0.6 -0.6 0.0 -0.4 0.2 0.2 -0.6 -0.2 0.0 0.4 -0.6 1.0 -0.4 1.2 Pro Gln P Q -0.2 -0.2 -0.6 -0.6 -0.2 0.0 -0.2 0.4 -0.8 -0.6 -0.4 -0.4 -0.4 0.0 -0.6 -0.6 -0.2 0.2 -0.6 -0.4 -0.4 0.0 -0.4 0.0 1.4 -0.2 1.0 Arg Ser R S -0.2 0.2 -0.6 -0.2 -0.4 0.0 0.0 0.0 -0.6 -0.4 -0.4 0.0 0.0 -0.2 -0.6 -0.4 0.4 0.0 -0.4 -0.4 -0.2 -0.2 0.0 0.2 -0.4 -0.2 0.2 0.0 1.0 -0.2 0.8 Thr Val T V 0.0 0.0 -0.2 -0.2 -0.2 -0.6 -0.2 -0.4 -0.4 -0.2 -0.4 -0.6 -0.4 -0.6 -0.2 0.6 -0.2 -0.4 -0.2 0.2 -0.2 0.2 0.0 -0.6 -0.2 -0.4 -0.2 -0.4 -0.2 -0.6 0.2 -0.4 1.0 0.0 0.8 Trp Tyr W Y -0.6 -0.4 -0.4 -0.4 -0.8 -0.6 -0.6 -0.4 0.2 0.6 -0.4 -0.6 -0.4 0.4 -0.6 -0.2 -0.6 -0.4 -0.4 -0.2 -0.2 -0.2 -0.8 -0.4 -0.8 -0.6 -0.4 -0.2 -0.6 -0.4 -0.6 -0.4 -0.4 -0.4 -0.6 -0.2 2.2 0.4 1.4 A C D E F G H I K L M N P Q R S T V W Y Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr

First principles amino acid subsBtuBon matrices


IdenBty matrix
Perfect match: posiBve score Any mismatch: negaBve score

GeneBc score matrix


Based on the average number of nucleoBde changes needed to mutate one amino acid into another e.g. K (AAA, AAG) to N (AAC, AAU) has a higher score than K (AAA, AAG) to D (GAU, GAC)

Chemical properBes matrices


e.g. K (basic) to R (basic) has a higher score than K (basic) to F (aromaBc) or K to E (acidic)

BLOSUM62 Matrix

IdenBty matrix example


D E Q H V F W +1 -1 -1 -1 -1 -1 -1 D

Data-based matrices
Calculated from amino acid frequencies in known homologous sequences PAM family of matrices BLOSUM family of matrices Perform befer than rst principle matrices (which are sBll useful for some specialised applicaBons)

+1 -1 -1 -1 -1 -1 E

+1 -1 -1 -1 -1 Q

+1 -1 +1 -1 -1 +1 -1 -1 -1 +1 H V F W

BLOSUM matrices
BLOSUM 62

BLOSUM matrices
Heniko and Heniko, 1992 Blocks SubsBtuBon Matrix Based on the BLOCKS database Currently, most widely used matrix family Most commonly used matrices: BLOSUM62 and BLOSUM55

BLOCKS database
BLOCKS are ungapped mulBple sequence alignments based on the SWISS-PROT database and the PROSITE protein family database All the sequences from SWISS-PROT belonging to a PROSITE family are aligned together, to create local ungapped alignments characterisBc of the protein family

ID Mn_catalase; BLOCK AC IPB007760A; distance from previous block=(3,160) DE Manganese containing catalase BL HIL; width=14; seqs=49; 99.5%=727; strength=1034 CTJC_BACSU|Q45538 ( 67) HLEMIATMVYKLTK 12 GS80_BACSU|P80878 ( 69) HVEMIATMIARLLE 14 YDHU_BACSU|O05513 ( 4) HGNLITDLLDNLLL 25 O69145 ( 70) HMEIVAETINLLNG 64 Q9KDZ2 ( 136) SGNLIFDLLHNYFL 34 Q9KAU6 ( 69) HVEMLATMIARLLD 16 Q9I1T0 ( 68) HLEIIGSIVGMLNK 20 Q97JE8 ( 68) HLEIVGSIVRQLSR 50 MCAT_CLOAB|Q97FE0 ( 124) TGDIVADLLSNIAS 73 Q8Z7E1 ( 68) HLEIIGSLVGMLNK 17 Q8YY54 ( 69) HIEMLATMIAHLLD 27 Q8YSJ5 ( 68) HLEMVGKLIEAHTK 36 Q9KWV1 ( 68) HLEIIGSLVGMLNK 17 Q8XDQ1 ( 68) HLEIIGSLVGMLNK 17 YJQC_BACSU|O34423 ( 69) HVEMLATMISRLLD 19 Q8R929 ( 68) HLEIIATLVFKLLK 22 Q8PG91 ( 68) HLEIIGSIIAMLNK 19 Q8P4M4 ( 68) HLEIIGSIIAMLNK 19 Q8EQM8 ( 18) SGNLLADFRANLTA 35

BLOCK example

From BLOCKS to BLOSUM


1. Count the number of amino acid pairs observed in each column of each block and calculate the observed frequency of each pair 2. Calculate the expected frequency of each pair (based on the frequency of individual amino acids) 3. Calculate the log raBo (typically log2)

1. Count number of observed pairs and calculate frequencies


# 6& There are 4 % ( = 60 aligned pairs of amino acids in the block $ 2'

DADA AAAE AAEE AADA AAEE AADE

Aligned pair Proportion of times observed (xy)


(oxy)
A to A
26/60
A to D
A to E
D to D
D to E
E to E
8/60
10/60
3/60
6/60
7/60

General case for step 1.


For each pair of amino acids x and y, n xy = number of times x and y are in the same column of a block oxy = observed proportion of aligned pair xy oxy = n xy

2. Calculate the expected frequency of each Amino acid (x)


Proportion in block (px)
pair
A
14/24
4/24
6/24

DADA AAAE AAEE AADA AAEE AADE

D
E

u v

n uv

Amino acid pair (xy)


Expected proportion (exy)
A to A
(14/24)2 = 196/576
A to D
2(14/24) (4/24) = 112/576
A to E
2 (14/24) (6/24) = 168/576
D to D
(4/24)2 = 16/576
D to E
2(4/24) (6/24) = 48/576
E to E
(6/24)2 = 36/576

General case for step 2


Expected proportion of amino acid pair xy in random block of same amino acid composition : #2 px py if x y exy = $ % px py if x = y

3. Calculate the log raBo


"o % xy Matrix entry = 2log 2 $ $e ' ' (rounded to nearest integer) # xy &
Aligned pair (xy)
A to A
A to D
A to E
D to D
D to E
E to E
oxy

26/60
8/60
10/60
3/60
6/60
7/60

exy

196/576
112/576
168/576
16/576
48/576
36/576

2log2(oxy/exy)
0.70
-1.09
-1.61
1.70
0.53
1.80

Final matrix

BLOSUM family

A D E

A 1 -1 -2

D -1 2 1

E -2 1 2

Problem: counBng every amino acid in the block can lead to an over-representaBon of amino acid changes found in closely related sequences SoluBon: cluster sequences closer than a set % idenBty, and average their contribuBon so that the whole cluster counts as one sequence This gives rise to a family of matrices, depending on the % idenBty threshold

The 2log2 transformation means that the matrix is in half-bits


VSLHL ELTRS EWTRS EISRS ELCRT

80% identical 60% identical

PAM matrices
PAM120

nEE

No clustering (BLOSUM100)
Clustering sequences with 80% identity (BLOSUM80)
Clustering sequences with 60% identity (BLOSUM60)
6
3
2

nVE

4
3
2

PAM matrices
PAM - Point (Percent) Accepted MutaBon Schwartz and Dayho, 1978 Also known as MDM78 (mutaBon data matrix) or Dayho matrix Empirical matrix based on evoluBonary model Based on small number of families of closely related proteins (>85% idenBty) so that sequences can be aligned unambiguously by hand Since the changes observed between these sequences did not aect the funcBon of the protein, these are accepted muta9ons

1. Align the sequences by hand 2. Order the sequences using parsimony


hbb_ornan hbb_tacac hbe_ponpy hbb_speci hbb_speto hbb_equhe LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH VDPENFNRLG VDPENFNRLG VDPENFKLLG VDPENFKLLG VDPENFKLLG VDPENFRLLG NVLIVVLARH NVLVVVLARH NVMVIILATH NMIVIVMAHH NMIVIVMAHH NVLVVVLARH FSKDFSPEVQ FSKEFTPEAQ FGKEFTPEVQ LGKDFTPEAQ LGKDFTPEAQ FGKDFTPELQ AAWQKLVSGV AAWQKLVSGV AAWQKLVSAV AAFQKVVAGV AAFQKVVAGV ASYQKVVAGV

3. Count the number of Bmes each amino acid changes to each other one e.g. F changing to L
hbb_ornan hbb_tacac hbe_ponpy hbb_speci hbb_speto hbb_equhe LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH LSELHCDKLH VDPENFNRLG VDPENFNRLG VDPENFKLLG VDPENFKLLG VDPENFKLLG VDPENFRLLG NVLIVVLARH NVLVVVLARH NVMVIILATH NMIVIVMAHH NMIVIVMAHH NVLVVVLARH FSKDFSPEVQ FSKEFTPEAQ FGKEFTPEVQ LGKDFTPEAQ LGKDFTPEAQ FGKDFTPELQ AAWQKLVSGV AAWQKLVSGV AAWQKLVSAV AAFQKVVAGV AAFQKVVAGV ASYQKVVAGV

4. Calculate probability for each amino acid mutaBng to each other amino acid
For each pair of amino acids i and j,the frequency of change fij is: N ij f ij = N ik
k

L L 1 F<->L change. (NFL = 1)


L F F F

F F F

For ij, the probability of change pij is:

pij = cf ij and pii = 1 cf ij i j where c is a posiBve scaling constant chosen so that each pii > 0.

Probability matrix
The resulBng probability matrix allows modelling the evoluBon of protein sequences as a Markov process - that is, the probability of any amino acid mutaBng to another one is dependent only on that amino acid
A C D E pAA pAC pCC pAD pCD pDD pAE pCE pDE pEE A C D E

The constant c is chosen so that the expected number of amino acid changes amer one round of applying the probabiliBes is 1 in 100 amino acids

PAM 1

Expected proportion of mutated amino acids :

p p
i i i j

ij

= c pi fij = 0.01
i i j

The resulBng probability matrix is the PAM1 probability matrix, giving the probability that an amino acid will mutate to another over an amount of evoluBonary Bme such that 1% of amino acids mutate

5. PAM N
Because the probability matrix is Markov, it is possible to calculate probability matrices for longer evoluBonary Bmes by mulBplying the matrix by itself n Bmes
e.g. PAM2 probability matrix : " pAA pAC pAD ...% " pAA pAC $ ' $ $ pCA pCC pCD ...' $ pCA pCC $ pDA pDC pDD ...' $ pDA pDC $ ' $ ... ... ...& # ... ... # ...

PAM N
e.g. a PAM250 matrix represents a 250% level of evoluBonary change e.g. PAM120, PAM80, PAM60 matrices could be used for aligning sequences which are approximately 40%, 50% and 60% similar, respecBvely PAM250 has been shown preferable for distantly related proteins of 14-27% similarity

pAD pCD pDD ...

...% ' ...' ...' ' ...&

10

DetecBng evoluBonary relaBonships

Rather than use probabiliBes, it is more convenient to use log odds matrices If pij is an entry in the PAMN probability matrix, the corresponding entry in the PAMN log odds matrix is:

6. PAM log odds matrices

300 million years


200 million years
100 million years
Today

where C is a posiBve constant and qi and qj are the respecBve observed frequencies of amino acids i and j in the sequences Interpreted as the raBo of the probability that the subsBtuBon represents an authenBc evoluBonary change to the probability that it occurred due to random events of no biological signicance.
PAM100
PAM100
PAM200

" p % ij C log$ $q q ' ' # i j&

PAM100
PAM100
PAM200

PAM matrices - summary


Family of subsBtuBon matrices corresponding to dierent levels of evoluBonary Bme Based on sound evoluBonary principles Distances for long periods of evoluBonary history extrapolated from shorter Bmes (assumpBon!) Based on a relaBvely small dataset (mainly globular proteins)

BLOSUM vs PAM
PAM
Built from an evolutionary model based on closely related proteins
Extrapolation from closely related sequences
Built from a small number of complete sequences
BLOSUM
Built directly from blocks of aligned protein segments covering a wide range of evolutionary time
No extrapolation
Built from a large number of sequence segments

BLOSUM vs PAM (cont.)


PAM
PAMn matrices with low n are better suited to closely related sequences
Uses phylogenetic tree to avoid over-representing closely related sequences

Commonly used as log odds matrix
BLOSUM
BLOSUMn matrices with low n are better suited to highly divergent sequences
Uses clustering of related sequences and direct counting of amino acid changes
Commonly used as log odds matrix

BLOSUM vs PAM CounBng Changes BLOSUM


AA AB BB direct counts
A-B count = 4
PAM

AA
BB AB
AB
counts from an
evolutionary model
A-B count = 2

11

Gap penalBes I
RaBonale:
Gaps arise through inserBon/deleBon events,which do not happen one residue at a Bme. Penalty for creaBng a new gap Typically, relaBvely high to prevent too many gaps in the alignment Penalty for extending an exisBng gap Typically, relaBvely small so that a small dierence in gap length will not aect the penalty for this gap, but not too small to result in very long gaps.

Gap Penalties II
Alignment of human and hemoglobin chains

Gap creaBon penalty:

Gap penalty = 1, Gap extension penalty = 0.1



1 V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH.....GSA | |.|.:|..|.| |||| :.:| |:|||:|::: :| |. :|. | ||| |.: 1 VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP . . . . . . 54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .||:||||| :|:.:::||:|::...:..||:||..||:||| ||:||::.|:..|| |: 59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF . . 114 PAEFTPAVHASLDKFLASVSTVLTSKYR 141 . ||||:|:|..:|.:|:|...|. ||: 119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146

Gap extension (length) penalty:

Gap PenalBes III


Alignment of human and hemoglobin chains

The twilight zone


True positives

Gap penalty = 5, Gap extension penalty = 0.1



2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF......DLSHGSAQV |.|.:|..|.| |||| :.:| |:|||:|::: :| |. :|. | | |.:.| 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV . . . . . . 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA |:||||| :|:.:::||:|::...:..||:||..||:||| ||:||::.|:..|| |:. 61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK . . 116 EFTPAVHASLDKFLASVSTVLTSKYR 141 ||||:|:|..:|.:|:|...|. ||: 121 EFTPPVQAAYQKVVAGVANALAHKYH 146

False negatives

Rost, B. Protein Eng. 1999 12:85-94; doi:10.1093/protein/12.2.85

Measuring alignment quality


Alignment score
RelaBve to random alignment?

Something to think about


Why do we add the scores together?

Percentage idenBty Percentage similarity EvoluBonary distance


In its simplest form, 1-%idenBty Several methods available to correct for mulBple subsBtuBons

12

You might also like