Bio Medical Tics - Sequence Analysis - Alignment - 2011
Bio Medical Tics - Sequence Analysis - Alignment - 2011
Sequence Alignment
Study Goal
3
What is a sequence alignment? The difference between a global and local alignment and what the
uses of each are. How to use the dot matrix methods to analyze genes and chromosomes The steps performed by the Needleman-Wunsh and SmithWaterman Algorithms to produce a sequence alignment How to use scoring matrix values and gap penalties to produce a sequence alignment
Sequence Alignment
4
searching for a series of individual characters or character patterns that are in the same order in the sequences
Sequence alignment score:
Sum of the individual log odds scores for each pair of aligned sequence characters in an alignment less a penalty for each gap of one more position
x = x1x2...xM,
y = y1y2yN,
an alignment is an assignment of gaps to positions 0,, N in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence
AGGCTAGTT, AGCGAAGTTT
AGGCTAGTTAGCGAAGTTT AGGCTA-GTTAG-CGAAGTTT AGGC-TA-GTTAG-CG-AAGTTT 6 matches, 3 mismatches, 1 gap
Evolutionary
9
next generation
OK OK OK X X
Still OK?
Scoring Function
10
+m -s -d
Pair-wise Alignment
11
Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the
same score.
>> 2N
Use to align the entire sequence Best for same length sequence
Use to align the similar sequence along certain length Best for sequence sharing a conserved region or domain
Orthologous
Due to purifying selection driven by functional constraints observable in a bckground described by the theory of neutral evolution Fast enough that pseudogenes rapidly deteriorate over evolutionary timescale In any prokaryotic genome, homologs from more than one distantly related species are detectable for 70 80 % of proteins
Methods of Alignment
17
with windows
Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA
Align by Hand
18
GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment
Dotplot
19
Sequence 2
Sequence 1
Dotplot
20
Sequence 2
Sequence 1
T A C A T T A C G T A C A T A C A C T T A
Sequence 2
T A C T G T C A T T A C T G T T C A T
Sequence 1
T A C T G - T C A T | | | | | | | | | T A C T G T T C A T
Dotplot
(Window = 13022Stringency = 9) /
Hemoglobin -chain
Hemoglobin -chain
T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C
Word Size = 3
C T A T G A C A
T A C G G T A T G
Window / Stringency
24
Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Typical window size for DNA sequences is 15 bases, and stringency or match requirement in this window is 10, meaning if there are 10-15 matches within the window, and a dot is printed at the first base in the windows
Dotplot
(Window = 18 /25 Stringency = 10) Hemoglobin -chain
Hemoglobin -chain
Considerations
26
(unspecific) matches.
With large windows the sensitivity for short sequences is
reduced.
Insertions/deletions are not treated explicitly.
Dot Matrix
27
dot matrix method should be considered as a first choice for pair-wise sequence alignment Readily reveal the presence of insertions/deletions and direct and inverted repeats DNA sequence dot matrix comparison: long windows and high stringencies (7/11, 11/15) Protein sequences: short windows, stringencies (1/1) except for short domain of partial similaritity in not similar sequences (15/5)
28
Programs:
Methods of Alignment
29
with windows
Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA
Alignment is additive
Observation: The score of aligning is additive Say that aligns to x1xi y1yj
30
x1xM
The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
31
- Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path)
It is applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage, the final stage contains the overall solution - Mount
Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d
The best score up to (i,j) will be the largest of the three options
Dynamic Programming
36
There are only a polynomial number of subproblems Align x1xi to y1yj Original problem is one of the subproblems Align x1xM to y1yN Each subproblem is easily solved from smaller subproblems ??? Then, we can apply Dynamic Programming!!! Let F(i,j) = optimal score of aligning x1xi y1yj
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment
43
Types of Scores
44
Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein.
Conservation
Amino acid changes that tend to preserve the
Polar
to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine
Scoring Examples
48
Dynamic Programming
49
50
51
52
53
54
55
56
57
58
Scoring systems
59
Sequence 1 Sequence 2
actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact
A A G C T 1 0 0 0
G 0 1 0 0
C 0 0 1 0
T 0 0 0 1
Sequence 1 Sequence 2
PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM
Scoring matrix
C C 9
S -1 T -1
4 1 -1 1 0 1 0 5 -1 0 -2 0 -1 7 -1 -2 -2 -1 4 0 -2 -2 6 0 -1 5 1 6
P -3 A 0
G -3 N -3 D -3 . .
Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. tiny aliphatic I L hydrophobic aromatic charged M Y F W H C S+S V A P G G CSH T K R S D E Q positive polar N small
Scoring matrices reflect: # of mutations to convert one to another chemical similarity observed mutation frequencies the probability of occurrence of each amino acid Widely used scoring matrices:
PAM BLOSUM
PAM 250
64
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 A R N D C Q E G H I L K M F P S T W Y V B Z
-8
17
PAM
65
Dayhoff Matrix
66
Dayhoff Matrix
67
Dayhoff Matrix
68
BLOSUM Matrix
69
BLOSUM Matrix
70
BLOSUM Matrix
71
72
Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.
Significance of alignment
74
Significance of alignment
75
Database Searching
76
Database Searching
77
Global Alignment
Global Alignment
x1 xM
Every non-decreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences
y1 yN
F(0, 0) = 0 , F(0, j) = - j d,
Initialization.
83
F(i, 0)= - i d
2.
a.
Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
global alignment over the length of both sequences (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.
Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.
x = x1xM, y = y1yN
x = aaaacccccggggtta y = ttcccgggaaccaacc
98% of genes are conserved between any two mammals >70% average similarity in protein sequence
hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a : : : : : : : : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ @ @ @ @ @ @ @ 57331/400001 78560/400001 112658/369938 36008/68174 57381/400001 78610/400001 112708/369938 36058/68174
: : : : : : : :
@ @ @ @ @ @ @ @
Termination:
1.
2.
Smith-waterman algorithm
90
BLAST Algorithm
92
BLAST Algorithm
93
BLAST Algorithm
94
BLAST Algorithm
95