Module-II
Module-II
If you know your own DNA sequence than you know every thing about your self
sequences
January 22, 2025 Bioinformatics 3
Introduction: Significance of sequence alignment
Sequence alignment is useful for
Discovering functional, structural, and evolutionary
homology
Sequence similarity, is the percentage of aligned residues that are
similar in
Physiochemical properties such as size, charge, and
hydrophobicity
Sequence similarity can be quantified using percentages
Homology is a qualitative statement
Two sequences share 40% similarity
They are either homologous or non-homologous
January 22, 2025 Bioinformatics 5
Sequence similarity vs Identity
Sequence similarity and Sequence identity are synonymous for
nucleotide sequences
For protein sequences, however, the two concepts are very different
In a protein sequence alignment, Sequence identity refers to
The percentage of matches of the same amino acid residues
sequences
The other normalizes by the size of the shorter sequence
Alignment is carried out from beginning to end of both sequences to
find the best possible alignment across the entire length between the
two sequences
January 22, 2025 Bioinformatics 9
Pairwise Sequence similarity Methods
Local alignment
It only finds local regions with the highest level of similarity
between the two sequences
Disadvantages:
It is often up to the user to construct a full alignment with insertions and
This process is iterated until values for all the cells are filled
The best score is put into the bottom right corner of an intermediate
matrix
January 22, 2025 Bioinformatics 19
Alignment Algorithms-Dynamic Programming Method
Thus, the scores are accumulated along the diagonal going from the
upper left corner to the lower right corner
Once the scores have been accumulated in matrix,
the next step is to find the path that represents the optimal alignment
This is done by tracing back through the matrix in reverse order from the
lower right-hand corner of the matrix toward the origin
The best matching path is the one that has the maximum total score
If two or more paths reach the same highest score, one is chosen
arbitrarily to represent the best alignment
The path can also move horizontally or vertically at a certain point,
which corresponds to introduction of a gap or an insertion or deletion
gaps that
represent insertions and deletions
In natural evolutionary processes insertion and deletions are relatively rare
in comparison to substitutions
Introducing gaps should be made more difficult computationally
similarity scores
January 22, 2025 Bioinformatics 21
Alignment Algorithms-Dynamic Programming Method
Gap Penalties:
If the penalty values are set too high, gaps may become too difficult
also unrealistic
Through empirical studies for globular proteins, a set of penalty values
programs
Another factor to consider is the cost difference between opening a gap
Thus, gap opening should have a much higher penalty than gap extension
The normal strategy is to use preset gap penalty values for introducing and
extending gaps
For example, one may use a −12/ − 1 scheme in which
the gap opening penalty is −12 and the gap extension penalty −1
used, which assigns the same score for each gap position regardless
whether it is opening or extending
However, this penalty scheme has been found to be less realistic than the
affine penalty
Gaps at the terminal regions are often treated with no penalty
lengths
Consequently, end gaps can be allowed to be free to avoid getting
unrealistic alignments
Smith–Waterman algorithm
Dynamic Programming for Global Alignment
The classical global pairwise alignment algorithm using dynamic
obtained
The goal of local alignment is
biological meaning
Typical values are –12 for gap opening, and –4 for gap extension
position)
If the residues are not same, the mismatch score is assumed as -3
Variables used:
i, j describes row and columns
M is the matrix value of the required cell (stated as M i,j)
S is the score of the required cell (S i, j)
W is the gap alignment
follows
The first residue (nucleotides or amino acids) in both sequences is ‘C’
Finding the maximum value for M position, one can notice that
i,j
there is no chance to see any negative values in the matrix, since we
are taking 0 as lowest value
After filling the matrix, keep the pointer back to the cell from where
the maximum score has been determined
In the similar fashion fill all the values of the matrix of the cell
January 22, 2025 Bioinformatics 33
Working of Smith-Waterman Algorithm
In the similar fashion fill all the values of the matrix of the cell
Each cell is back pointed by one or more pointers from where the
maximum score has been obtained
Step3: Trace backing the sequences for an optimal alignment
The final step for the appropriate alignment is trace backing, prior to
that one needs to find out the maximum score obtained in the entire
matrix for the local alignment of the sequences
January 22, 2025 Bioinformatics 34
Working of Smith-Waterman Algorithm
It is possible that the maximum scores can be present in more than
one cell, in that case there may be possibility of two or more
alignments, and the best alignment by scoring it
In this example we can see the maximum score in the matrix as 18,
which is found in two positions that lead to multiple alignments, so
the best alignment has to be found
So the trace back begins from the position which has the highest
value, pointing back with the pointers, thus find out the possible
predecessor, then move to next predecessor and continue until we
reach the score 0
Thus a local alignment is obtained and one can see the possible
alignments
The two alignments can be given with a score, for matching as +5 ,
mismatch as -3 and gap penalty as -4
By summing up the scores both of the alignments are giving the same
as 18, so one can predict both alignments are the best
If the gap score is assumed, the gap score can be added to the
If the gap score is assumed, the gap score can be added to the previous
alignment
In the above mentioned example, one can see the bottom right hand
corner score as -1
The important point to be noted here is that there may be two or
Uses of MSA:
In order to characterize protein families, identify shared regions of
Most closely related sequences are aligned first, and then additional