0% found this document useful (0 votes)
79 views96 pages

Bio Medical Tics - Sequence Analysis - Alignment - 2011

The document discusses sequence alignment methods. It defines sequence alignment as comparing two or more sequences to find patterns that are in the same order. It describes global and local alignments, and how dot plot and dynamic programming algorithms like Needleman-Wunsh and Smith-Waterman are used to compute optimal alignments through scoring matrices and gap penalties.

Uploaded by

黃柏翰
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views96 pages

Bio Medical Tics - Sequence Analysis - Alignment - 2011

The document discusses sequence alignment methods. It defines sequence alignment as comparing two or more sequences to find patterns that are in the same order. It describes global and local alignments, and how dot plot and dynamic programming algorithms like Needleman-Wunsh and Smith-Waterman are used to compute optimal alignments through scoring matrices and gap penalties.

Uploaded by

黃柏翰
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Sequence Analysis

1 DNA SEQUENCING & SEQUENCE ALIGNMENT

Sequence Alignment

Study Goal
3

What is a sequence alignment? The difference between a global and local alignment and what the

uses of each are. How to use the dot matrix methods to analyze genes and chromosomes The steps performed by the Needleman-Wunsh and SmithWaterman Algorithms to produce a sequence alignment How to use scoring matrix values and gap penalties to produce a sequence alignment

Sequence Alignment
4

Definition: the comparison of two or more sequences by

searching for a series of individual characters or character patterns that are in the same order in the sequences
Sequence alignment score:

Sum of the individual log odds scores for each pair of aligned sequence characters in an alignment less a penalty for each gap of one more position

Pair-wise Sequence Alignment


6

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC


Definition Given two strings

x = x1x2...xM,

y = y1y2yN,

an alignment is an assignment of gaps to positions 0,, N in x, and 0,, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

What is a good alignment?


7

AGGCTAGTT, AGCGAAGTTT
AGGCTAGTTAGCGAAGTTT AGGCTA-GTTAG-CGAAGTTT AGGC-TA-GTTAG-CG-AAGTTT 6 matches, 3 mismatches, 1 gap

7 matches, 1 mismatch, 3 gaps

7 matches, 0 mismatches, 5 gaps

Evolution at the DNA level


8

Deletion Mutation ACGGTGCAGTTACCA AC----CAGTCCACCA SEQUENCE EDITS

REARRANGEMENTS Inversion Translocation Duplication

Evolutionary
9

next generation
OK OK OK X X
Still OK?

Scoring Function
10

Sequence edits: Mutations Insertions Deletions Scoring Function:


AGGCCTC AGGACTC AGGGCCTC AGG . CTC


Alternative definition: minimal edit distance Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other

Match: Mismatch: Gap:

+m -s -d

Score F = (# matches) m - (# mismatches) s (#gaps) d

Pair-wise Alignment
11

The alignment of two sequences (DNA or protein) is

a relatively straightforward computational problem.

There are lots of possible alignments.

Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the

same score.

How do we compute the best alignment?


12
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Too many possible alignments:

>> 2N

Local alignment vs. Global Alignment


13

Use to align the entire sequence Best for same length sequence

Local alignment vs. Global Alignment


14

Use to align the similar sequence along certain length Best for sequence sharing a conserved region or domain

What do we want alignment for?


15

Orthologous

Xenologous transferred genes

Matches to similar sequences


16

Sequence conservation Structure conservation function conservation


What is conserved in a gene [protein] family is functionally important

Due to purifying selection driven by functional constraints observable in a bckground described by the theory of neutral evolution Fast enough that pseudogenes rapidly deteriorate over evolutionary timescale In any prokaryotic genome, homologs from more than one distantly related species are detectable for 70 80 % of proteins

Application: Comparison of sequence/structures can identify

homologous relationships, allowing inference of function based on that relationship

Methods of Alignment
17

By hand - slide sequences on two lines of a word

processor Dot plot

with windows

Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA

Word matching and hash tables0

Align by Hand
18

GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment

Dotplot
19

A dotplot gives an overview of all possible alignments


A T T C A C A T A T A C A T T A C G T A C

Sequence 2

Sequence 1

Dotplot
20

In a dotplot each diagonal corresponds to a possible (ungapped) alignment


A T T C A C A T A T A C A T T A C G T A C

Sequence 2

Sequence 1

One possible alignment:

T A C A T T A C G T A C A T A C A C T T A

Insertions / Deletions in a Dotplot


21

Sequence 2

T A C T G T C A T T A C T G T T C A T

Sequence 1
T A C T G - T C A T | | | | | | | | | T A C T G T T C A T

Dotplot
(Window = 13022Stringency = 9) /

Hemoglobin -chain

Hemoglobin -chain

Word Size Algorithm


23

T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C

Word Size = 3
C T A T G A C A

T A C G G T A T G

Window / Stringency
24

Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Typical window size for DNA sequences is 15 bases, and stringency or match requirement in this window is 10, meaning if there are 10-15 matches within the window, and a dot is printed at the first base in the windows

Scoring Matrix Filtering Matrix: PAM250 Window = 12 Stringency = 9

Dotplot
(Window = 18 /25 Stringency = 10) Hemoglobin -chain

Hemoglobin -chain

Considerations
26

The window/stringency method is more sensitive than the

wordsize method (ambiguities are permitted).


The smaller the window, the larger the weight of statistical

(unspecific) matches.
With large windows the sensitivity for short sequences is

reduced.
Insertions/deletions are not treated explicitly.

Dot Matrix
27

Unless two sequences are known to be very much alike, the

dot matrix method should be considered as a first choice for pair-wise sequence alignment Readily reveal the presence of insertions/deletions and direct and inverted repeats DNA sequence dot matrix comparison: long windows and high stringencies (7/11, 11/15) Protein sequences: short windows, stringencies (1/1) except for short domain of partial similaritity in not similar sequences (15/5)

28

Programs:

DOTTER http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.h tml DOTPLOT - http://helix.nih.gov/docs/gcg/dotplot.html PLALIGN in FASTA EMBOSS http://emboss.sourceforge.net/download/

Methods of Alignment
29

By hand - slide sequences on two lines of a word

processor Dot plot

with windows

Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA

Word matching and hash tables0

Alignment is additive
Observation: The score of aligning is additive Say that aligns to x1xi y1yj
30

x1xM

y1yN xi+1xM yj+1yN

The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

31

Basic principles of dynamic programming


32

- Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path)
It is applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage, the final stage contains the overall solution - Mount

Creation of an alignment path matrix


33

Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0

Creation of an alignment path matrix


35

If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d

The best score up to (i,j) will be the largest of the three options

Dynamic Programming
36

There are only a polynomial number of subproblems Align x1xi to y1yj Original problem is one of the subproblems Align x1xM to y1yN Each subproblem is easily solved from smaller subproblems ??? Then, we can apply Dynamic Programming!!! Let F(i,j) = optimal score of aligning x1xi y1yj

Example of Dynamic programming algorithm


37

Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment

Example of Dynamic programming algorithm


38

Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment

Example of Dynamic programming algorithm


39

Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment

Example of Dynamic programming algorithm


40

Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment

Example of Dynamic programming algorithm


41

Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment

Example of Dynamic programming algorithm


42

Sequences a1, a2, a3, a4 and b1, b2, b3, b4 Align sequences in a global alignment

43

Types of Scores
44

Scoring Matrix / Substitution Matrix


45

Scoring Matrix: Example


A A R N K 5 R -2 7 N -1 -1 7 K -1 3 0 6

Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein.

Conservation
Amino acid changes that tend to preserve the
Polar

physicochemical properties of the original residue

to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine

Scoring Examples
48

Dynamic Programming
49

50

51

52

53

54

55

56

57

58

Scoring systems
59

DNA Scoring Systems -very simple


60

Sequence 1 Sequence 2

actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact

A A G C T 1 0 0 0

G 0 1 0 0

C 0 0 1 0

T 0 0 0 1

Match: 1 Mismatch: 0 Score = 5

Protein Scoring Systems


61

Sequence 1 Sequence 2

PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM

Scoring matrix

C C 9

S -1 T -1

4 1 -1 1 0 1 0 5 -1 0 -2 0 -1 7 -1 -2 -2 -1 4 0 -2 -2 6 0 -1 5 1 6

T:G = -2 T:T = 5 Score = 48

P -3 A 0

G -3 N -3 D -3 . .

Protein Scoring Systems


62

Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. tiny aliphatic I L hydrophobic aromatic charged M Y F W H C S+S V A P G G CSH T K R S D E Q positive polar N small

Protein Scoring Systems


63

Scoring matrices reflect: # of mutations to convert one to another chemical similarity observed mutation frequencies the probability of occurrence of each amino acid Widely used scoring matrices:

PAM BLOSUM

PAM 250
64
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 A R N D C Q E G H I L K M F P S T W Y V B Z

-8

17

PAM
65

Dayhoff Matrix
66

Dayhoff Matrix
67

Dayhoff Matrix
68

BLOSUM Matrix
69

BLOSUM Matrix
70

AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Conserved blocks in alignments

BLOSUM Matrix
71

AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Conserved blocks in alignments

72

TIPS on choosing a scoring matrix


73

Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.

Significance of alignment
74

Significance of alignment
75

Database Searching
76

Database Searching
77

Local vs. Global Alignment


Global Alignment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT--C

Local Alignmentbetter alignment to find conserved segment


tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

Global vs. Local Alignment


79

Global Alignment Needleman and Wunsch (1970)

Local Alignment Smith and Waterman (1981a)

Global Alignment

Two closely related sequences:

needle (Needleman & Wunsch) creates an end-to-end alignment.

Global Alignment

Two sequences sharing several regions of local similarity:

1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67 |||||||||||||| | | | |||| || | | | || 70 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG

The Needleman-Wunsch Matrix


82

x1 xM

Every non-decreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences

y1 yN

An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm


1.

F(0, 0) = 0 , F(0, j) = - j d,

Initialization.

83

F(i, 0)= - i d

2.
a.

Main Iteration. Filling-in partial alignments


For each i = 1M For each j = 1N F(i, j) = max F(i-1,j-1) + s(xi, yj) [case 1] F(i-1, j) d [case 2] F(i, j-1) d [case 3] if [case 1] if [case 2] if [case 3]

DIAG, Ptr(i,j) = LEFT, UP,


3.

Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Global Alignment (Needleman -Wunsch)


The the Needleman-Wunsch algorithm creates a

global alignment over the length of both sequences (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.

Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

Global methods are useful when you want to force

two sequences to align over their entire length

Local Alignment (Smith-Waterman)


Local alignment Identify the most similar sub-region shared between two sequences Smith-Waterman

The local alignment problem


86

Given two strings

x = x1xM, y = y1yN

Find substrings x, y whose similarity (optimal global alignment value) is maximum

x = aaaacccccggggtta y = ttcccgggaaccaacc

Why local alignment examples


87

Genes are shuffled between genomes

Portions of proteins (domains) are often conserved

Cross-species genome similarity


88

98% of genes are conserved between any two mammals >70% average similarity in protein sequence

hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a : : : : : : : : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ @ @ @ @ @ @ @ 57331/400001 78560/400001 112658/369938 36008/68174 57381/400001 78610/400001 112708/369938 36058/68174

hum_a mus_a rat_a fug_a hum_a mus_a rat_a fug_a

: : : : : : : :

AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG CCGAGGACCCTGA-------------------------------------

@ @ @ @ @ @ @ @

57431/400001 78659/400001 112757/369938 36084/68174 57481/400001 78708/400001 112806/369938 36097/68174

atoh enhancer in human, mouse, rat, fugu fish

The Smith-Waterman algorithm


89

Termination:

1.

If we want the best local alignment


FOPT = maxi,j F(i, j) Find FOPT and trace back

2.

If we want all local alignments scoring > t


For all i, j find F(i, j) > t, and trace back Complicated by overlapping local alignments
( WatermanEggert 87: find all non-overlapping local alignments with minimal recalculation of the DP matrix )

Smith-waterman algorithm
90

Basic Local Alignment Search Tool


91

BLAST Algorithm
92

BLAST Algorithm
93

BLAST Algorithm
94

BLAST Algorithm
95

Basic BLAST Algorithms


96

You might also like