0% found this document useful (0 votes)
20 views

Sequence Alignment Presentation

Uploaded by

Munna Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Sequence Alignment Presentation

Uploaded by

Munna Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Sequence alignment

Sequence alignment is a way of arranging the sequences of


DNA, RNA or protein to identify regions of similarity that may
be a consequence of functional, structural or evolutionary
relationships between the sequences.

The sequences are padded with gaps (dashes) so that wherever possible,
columns contain identical characters from the sequences involved

tcctctgcctctgccatcat---caaccccaaagt
|||| ||| ||||| ||||| ||||||||||||
tcctgtgcatctgcaatcatgggcaaccccaaagt
Sequence alignment

Sequence alignment is important for:

* prediction of function
* database searching
* gene finding
* sequence divergence
Causes for sequence (dis)similarity

mutation: a nucleotide at a certain location is replaced by


another nucleotide (e.g.: ATA → AGA)

insertion: at a certain location one new nucleotide is


inserted inbetween two existing nucleotides
(e.g.: AA → AGA)

deletion: at a certain location one existing nucleotide


is deleted (e.g.: ACTG → AC-G)

indel: an insertion or a deletion


An example of aligning text strings
Raw Data ???
T C A T G
C A T T G
4 matches, 1 insertion
2 matches, 0 gaps T C A- T G
T C A T G | | | |
| | . C ATT G
C A T T G
4 matches, 1 insertion
3 matches (2 end gaps) T C A T - G
T C A T G . | | | |
| | | . C A T T G
. C A T T G
Terminologies of sequence comparison

 Sequence identity -- exactly the same Amino Acid or Nucleotide in the


same position.

 Sequence similarity -- Substitutions with similar chemical properties.

 Sequence homology -- general term that indicates evolutionary


relatedness among sequences; we usually measure of percentage
identity of sequence homology.

 Pairwise alignment -- used to find the best-matching piecewise (local)


or global alignments of two query sequences. Pairwise alignments can
only be used between two sequences at a time.

 Multiple sequence alignment -- try to align all the sequences in a given


query set.
The procedure of comparing two (pair-wise alignment) or more
multiple sequences is to search for a series of individual
characters or patterns that are in the same order in the
sequences. Typically, the purpose of this is to find homologues
(relatives) of a gene or gene-product in a database.
This information is useful for answering a variety of biological questions:

1. The identification of sequences of unknown structure or function.

2. The study of molecular evolution.


There are two types of alignment: local and global.
 Global alignment is attempting to match as much of the sequence as
possible.
The tool for Global alignment is based on Needleman-Wunsch algorithm.

 Local alignment is to try to find the regions with highest density of


matches. The tool for local alignment is based on Smith-Waterman
algorithm.


A global alignment between two sequences is an alignment in which all the
characters in both sequences participate in the alignment.
Global alignments are useful mostly for finding closely-related sequences.
The global best fit between two sequences
Example: the sequences s = VIVALASVEGAS and
t = VIVADAVIS align like:
A(s,t) =

V I V A L A S V E G A S
| | | | | | |
V I V A D A - V - - I S

indels
Local alignment methods find related regions within sequences - they can
consist of a subset of the characters within each sequence.

LGPSSKQTGKGS-SRIWDN
Global alignment
LN-ITKSAGKGAIMRLGDA

-------TGKG--------
Local alignment
-------AGKG--------
Methods of pairwise alignment
 Dot matrix analysis
 The dynamic programming (DP) algorithm
 Word methods
Dot matrix analysis
 A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and McIntyre
1970)
 Dot plots are two dimensional graphs showing a comparison
of two sequences. The two axes X and Y of the graph represent
the two sequences being compared. Wherever a base or
residue of one axis coincides with a base or residue on the
other axis, it is marked with a dot. Any region of similarity is
revealed by a diagonal row of dots. Isolated dots not on
diagonal represent random matches.
 Assume that we have to compare the following sequences
Sequence 1: AGCTAGGA
Sequence 2: CACTAGGC
Insert a dot in each matching cell and then scan the resulting
graphs for a series of dots that form a diagonal.
A G C T A G G A
C
A
C
T
A
G
G
C

To maximize the number of matches the resulting alignment could then be


- AGCTAGGA –
CA -CTAGG - C
Dynamic programming algorithm

 The approach compares every pair of characters in the two sequences


and generates an alignment, which is the best or optimal.
This procedure assigns score for matches, mismatches and gaps. It
generates a matrix of number that represents all possible alignments
between the sequences. The highest set of scores in the matrix defines
an optimal alignment.
 The method can be useful in aligning nucleotide to protein sequences.
The method requires large amounts of computing power and is a highly
computationally demanding.
New algorithmic improvements as well as increasing computer capacity
make possible to align a query sequence against a large database in a
few minutes.
The dynamic programming approach to sequence alignment always tries to
follow the best prior-result so far.

Try to align two sequences by inserting some gaps at different locations, so as to


maximize the score of this alignment.

Score measurement is determined by "match award", "mismatch penalty" and


"gap penalty". The higher the score, the better the alignment.

If both penalties are set to 0, it aims to always find an alignment with maximum
matches so far.
It is used to compare the similarity between two sequences of DNA or Protein, to
predict similarity of their functionalities.

A global alignment program is based on the Needleman-Wunsch


(1970) algorithm and a local alignment program is based on the
Smith-Waterman algorithm (1981).
Scoring function

The cost for aligning the two sequences s = VIVALASVEGAS and t =


VIVADAVIS :
V I V A L A S V E G A S
| | | | | | |
A(s,t) = V I V A D A - V - - I S

is: indels

M(A) = 7 matches + 2 mismatches + 3 gaps


=7 –2 –3 =2
 Word methods, also known as k-tuple methods, are heuristic
methods that are not guaranteed to find an optimal alignment
solution, but are significantly more efficient than dynamic
programming.
 The typical tools used for this method is BLAST and FASTA.
• BLAST
 Heuristic method to find the highest scoring
 Locally optimal alignments
 Allow multiple hits to the same sequence
 Based on statistics of ungapped sequence alignments
 The statistics allow the probability of obtaining an ungapped alignment
 Use dynamic programming for narrow region
• FASTA
 Fast sequence search
 Based on dotplot
 Identify identical words (k-tuples)
 Search significant diagonals
 Dynamic programming for narrow region
The substitution score matrix
Substitution: Exchange of one amino acid with that of another amino acid of very
similar physicochemical properties so that the protein is not affected functionally.

Conservative substitution: Substitution that does not affect the protein’s property or
function.

Substitution score matrix is used to show scores for amino acid substitutions.
When calculating alignment scores, identical amino acids are given greater value
than substitutions and among substitutions conservative substitutions are given
greater value than non-conservative substitutions.
Two widely used substitution matrices are PAM and BLOSUM.

PAM - Point Accepted Mutation (Margaret Dayhoff)


Based on closely related proteins
BLOSUM - Blocks Substitution Matrix (Henikoff and Henikoff)
Based on conserved blocks bounded in similarity

PAM BLOSUM
Based on global alignments Based on local alignments.
of closely related proteins.

The PAM1 is calculated from BLOSUM 62 is calculated from


comparisons of sequences comparisons of sequences
with at least 62% identity
with no more than 1% in the blocks.
divergence.
Other PAM matrices are All BLOSUM matrices are
extrapolated from PAM1. based on observed
alignments.
They are not extrapolated
from comparisons of closely
related proteins.
PAM

PAM is the substitution of one amino acid of a protein by another that is ‘accepted’
by evolution. This implies that within some given species, the mutation has not
only arisen but has overtime, spread to essentially the entire species. One PAM
(PAM1) is a unit of evolutionary divergence in which 1% of the amino acids has
been changed (i.e. one point mutation per 100 residues).

PAM is based on the estimated mutation rates from the closely related proteins
and is dominated by the amino acid mutations caused by single base changes.

PAM is used to select groups of amino acids that represent conservative


substitutions in the proteins because it summarizes the observed replacement that
have taken place while conserving the structural and functional properties of
proteins.

Thus PAM matrix provided an empirical, experimental determination of conserved


replacement.
BLOSUM

BLOSUM is based on the observed amino acid substitutions in a large set of


more than 2000 conserved amino acid patterns called blocks. These blocks are
found in a database of protein sequences representing more than 500 families of
related proteins and act as signatures of these protein families.
Multiple sequence alignment:
Determine the best alignment between multiple
(more than two) DNA-sequences.

Multiple alignment is an extension of pairwise alignment to


incorporate more than two sequences into an alignment.

Multiple alignment methods try to align all of the sequences in a


specified set.

The most popular multiple alignment tool is CLUSTAL W.


‘W’ stands for ‘weighted’ (sequences are weighted
differently).
• MSA is central to many bioinformatics
applications
• Phylogenetic tree
• Motifs
• Patterns
• Structure prediction (RNA, protein)
Three-step process
1.) Construct pairwise alignments
2.) Build Guide Tree
3.) Progressive Alignment guided
by the tree
Multiple alignment

You might also like