0% found this document useful (0 votes)
2 views

Module-II

The document provides an overview of sequence alignment techniques in bioinformatics, emphasizing their significance in analyzing biological sequences for functional and evolutionary insights. It covers various alignment methods, including global and local alignments, pairwise sequence alignment algorithms, and the dot matrix and dynamic programming methods, highlighting their advantages and limitations. Additionally, it discusses concepts such as sequence homology, similarity, and identity, along with the importance of scoring matrices and gap penalties in alignment processes.

Uploaded by

kpavankumar887
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module-II

The document provides an overview of sequence alignment techniques in bioinformatics, emphasizing their significance in analyzing biological sequences for functional and evolutionary insights. It covers various alignment methods, including global and local alignments, pairwise sequence alignment algorithms, and the dot matrix and dynamic programming methods, highlighting their advantages and limitations. Additionally, it discusses concepts such as sequence homology, similarity, and identity, along with the importance of scoring matrices and gap penalties in alignment processes.

Uploaded by

kpavankumar887
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Bioinformatics

— Unit II-Sequence Alignment Techniques—

Dr. Chandra Mohan D


Assistant Professor
Computer Science and Engineering Group
Indian Institute of Information Technology, Sri City

If you know your own DNA sequence than you know every thing about your self

January 22, 2025 Bioinformatics 1


Outline
 Sequence Alignment
 Introduction (23-01-2024)

Global Alignment

Local Alignment
 Pairwise Sequence Alignment Algorithms

Dot Matrix

Dynamic programming methods
 Needleman-Wunsch and Smith-Waterman algorithms

Scoring Matrices
 Probabilistic Foundations of Sequence Alignment
 Multiple sequence alignment- Uses of MSA
January 22, 2025 Bioinformatics 2
Introduction
 Sequence comparison lies at the heart of bioinformatics analysis
 It is an important step toward structural and functional analysis of
newly determined sequences
 As new biological sequences are being generated at exponential
rates,
 Sequence comparison is becoming increasingly important

 To draw functional and evolutionary inference of a new protein

with proteins already existing in the database


 The most fundamental process in this type of comparison is sequence
alignment
 Sequences are compared by searching for common character
patterns and
 establishing residue–residue correspondence among related

sequences
January 22, 2025 Bioinformatics 3
Introduction: Significance of sequence alignment
 Sequence alignment is useful for
 Discovering functional, structural, and evolutionary

information in biological sequences


 Identifying the evolutionary relationships between sequences helps
 To characterize the function of unknown sequences

 When a sequence alignment reveals significant similarity among a


group of sequences,
 They can be considered as belonging to the same family

 If one member within the family has a known structure and


function
 Sequence alignment can be used for prediction of structure and

function of uncharacterized sequences

January 22, 2025 Bioinformatics 4


Sequence homology vs similarity
 An important concept in sequence analysis is sequence homology
 When two sequences are descended from a common evolutionary
origin,
 They are said to have a homologous relationship or share

homology
 Sequence similarity, is the percentage of aligned residues that are
similar in
 Physiochemical properties such as size, charge, and

hydrophobicity
 Sequence similarity can be quantified using percentages
 Homology is a qualitative statement
 Two sequences share 40% similarity
 They are either homologous or non-homologous
January 22, 2025 Bioinformatics 5
Sequence similarity vs Identity
 Sequence similarity and Sequence identity are synonymous for
nucleotide sequences
 For protein sequences, however, the two concepts are very different
 In a protein sequence alignment, Sequence identity refers to
 The percentage of matches of the same amino acid residues

between two aligned sequences


 Identity in sequence alignment is the number of characters that

match exactly between two different sequences


 Hence, gaps do not count when assessing identity

 There are two ways to calculate the sequence similarity/identity


 One involves the use of the overall sequence lengths of both

sequences
 The other normalizes by the size of the shorter sequence

January 22, 2025 Bioinformatics 6


Sequence similarity vs Identity
 It significantly implies that it has the effect where the sequence
identity is not transitive
 If X=Y and Y=Z, then X is not necessarily equal to Z
 This is deduced in terms of the identity distance measure
 Example:
 X has a sequence of AAGGCTT, Y has a sequence of AAGGC and

Z has a sequence of AAGGCAT


 Identity between X and Y is 100% {5 identical nucleotides /
min[length(X),length(Y)]}
 Identity between Y and Z is also 100%
 But identity between X and Z is only 85% {(6 identical
nucleotides / 7)}

January 22, 2025 Bioinformatics 7


Sequence similarity vs Identity
 The first method uses the below formula for sequence similarity(S%):
S = [(Ls × 2)/(La + Lb)] × 100
 where S is the percentage sequence similarity

 Ls is the number of aligned residues with similar characteristics

 La and Lb are the total lengths of each individual sequence

 The sequence identity (I%) can be calculated in a similar fashion:


I = [(Li × 2)/(La + Lb)] × 100
 where Li is the number of aligned identical residues

 The second method of calculation is to derive the percentage of


identical/similar residues over the full length of the smaller sequence
using the formula
I(S)% = Li(s)/La%
 where La is the length of the shorter of the two sequences

January 22, 2025 Bioinformatics 8


Pairwise Sequence Similarity Methods
 Pairwise sequence alignment is the process of aligning two sequences
and is the basis of
 Database similarity searching and

 Multiple sequence alignment

 The overall goal of pairwise sequence alignment is to find the best


pairing of two sequences, such that
 There is maximum correspondence among residues

 To achieve this, one sequence needs to be shifted relative to the other


 to find the position where maximum matches are found

 There are two different alignment strategies that are


 Global alignment


Alignment is carried out from beginning to end of both sequences to
find the best possible alignment across the entire length between the
two sequences
January 22, 2025 Bioinformatics 9
Pairwise Sequence similarity Methods
 Local alignment

It only finds local regions with the highest level of similarity
between the two sequences

January 22, 2025 Bioinformatics 10


Alignment Algorithms-Dot Matrix Method
 Global and Local alignment algorithms, are fundamentally similar and
only differ in the optimization
 Both types of algorithms can be based on one of the three methods:
 The dot matrix method

 The dynamic programming method and

 The word method

 The dot matrix method:


 The most basic sequence alignment

method is the dot matrix method,


also known as the dot plot method
 It is a graphical way of comparing

two sequences in a 2D matrix

January 22, 2025 Bioinformatics 11


Alignment Algorithms-Dot Matrix Method
 In a dot matrix, two sequences to be compared are written in the
horizontal and vertical axes of the matrix
 The comparison is done by scanning each residue of one sequence for
similarity with all residues in the other sequence
 If a residue match is found, a dot is placed within the graph
 Otherwise, the matrix positions are left blank
 When the two sequences have substantial regions of similarity,
 many dots line up to form contiguous diagonal lines,

 which reveal the sequence alignment

 If there are interruptions in the middle of a diagonal line, they indicate


insertions or deletions
 Parallel diagonal lines within the matrix represent repetitive regions of
the sequences
January 22, 2025 Bioinformatics 12
Alignment Algorithms-Dot Matrix Method
 A problem exists when comparing large sequences using the dot matrix
method, namely, the high noise level
 In most dot plots, dots are plotted all over the graph,
 Obscuring identification of the true alignment

 For DNA sequences, the problem is particularly acute because


 There are only four possible characters in DNA and

 Each residue therefore has a one-in-four chance of matching a

residue in another sequence


 To reduce noise, instead of using a single residue to scan for similarity,
 A filtering technique has to be applied, which uses a “window” of

fixed length covering a stretch of residue pairs


 When applying filtering, windows slide across the two sequences to
compare all possible stretches
January 22, 2025 Bioinformatics 13
Alignment Algorithms-Dot Matrix Method
 Dots are only placed when a stretch of residues equal to
 the window size from one sequence matches completely with a

stretch of another sequence


 This method has been shown to be effective in reducing the noise level
 The window is also called a tuple, the size of which can be
manipulated so that a clear pattern of sequence match can be plotted
 However, if the selected window size is too long, sensitivity of the
alignment is lost
 There are many variations of using the dot plot method
 A sequence can be aligned with itself to identify internal repeat
elements
 In the self comparison, there is a main diagonal for perfect matching of
each residue

January 22, 2025 Bioinformatics 14


Alignment Algorithms-Dot Matrix Method
 If repeats are present, short parallel lines are observed above and below
the main diagonal
 Self complementarity of DNA sequences (inverted repeats) can also be

identified using a dot plot


Advantages:
 It gives a direct visual relationship between two sequences and

 helps easy identification of the regions of greatest similarities

 Identify repeat regions of sequence based on the presence of parallel

diagonals of the same size vertically or horizontally in the matrix


 It is useful in identifying chromosomal repeats and in comparing gene

order conservation between two closely related genomes


 It can also be used in identifying nucleic acid secondary structures

through detecting self-complementarity of a sequence

January 22, 2025 Bioinformatics 15


Alignment Algorithms-Dot Matrix Method
Types of Repeats in DNA:
 Tandem Repeat: 5'-ATTCG ATTCG ATTCG-3'

 Mirror Repeat: 5'-GCATGGTACG-3'


 Pairing Repeat: 5'-TCTAGTTAGATCAA-3'

 Inverted Repeat: 5'-TTACGnnnnnnCGTAA-3'

Disadvantages:
It is often up to the user to construct a full alignment with insertions and

deletions by linking nearby diagonals


Another limitation of this visual analysis method is that it lacks statistical

rigor in assessing the quality of the alignment


The method is also restricted to pairwise alignment

It is difficult for the method to scale up to multiple alignment

January 22, 2025 Bioinformatics 16


Alignment Algorithms-Dynamic Programming Method
 Dynamic programming is a method that determines optimal alignment
 By matching two sequences for all possible pairs of characters

 It is fundamentally similar to the dot matrix method in that it also


creates a 2D alignment grid
 However, it finds alignment in a more quantitative way by
 Converting a dot matrix into a scoring matrix to account for

matches and mismatches between sequences


 By searching for the set of highest scores in this matrix,
 the best alignment can be accurately obtained

 The residue matching is according to a particular scoring matrix


 The scores are calculated one row at a time
 The scanning of the second row takes into account the scores already
obtained in the first round
January 22, 2025 Bioinformatics 17
Alignment Algorithms-Dynamic Programming Method

January 22, 2025 Bioinformatics 18


Alignment Algorithms-Dynamic Programming Method

 This process is iterated until values for all the cells are filled
 The best score is put into the bottom right corner of an intermediate
matrix
January 22, 2025 Bioinformatics 19
Alignment Algorithms-Dynamic Programming Method
 Thus, the scores are accumulated along the diagonal going from the
upper left corner to the lower right corner
 Once the scores have been accumulated in matrix,
 the next step is to find the path that represents the optimal alignment

 This is done by tracing back through the matrix in reverse order from the
lower right-hand corner of the matrix toward the origin
 The best matching path is the one that has the maximum total score
 If two or more paths reach the same highest score, one is chosen
arbitrarily to represent the best alignment
 The path can also move horizontally or vertically at a certain point,
 which corresponds to introduction of a gap or an insertion or deletion

for one of the two sequences

January 22, 2025 Bioinformatics 20


Alignment Algorithms-Dynamic Programming Method
Gap Penalties:
Performing optimal alignment between sequences often involves applying

gaps that
 represent insertions and deletions

In natural evolutionary processes insertion and deletions are relatively rare

in comparison to substitutions
Introducing gaps should be made more difficult computationally

However, assigning penalty values can be more or less arbitrary because

 There is no evolutionary theory to determine a precise cost for

introducing insertions and deletions


If the penalty values are set too low, gaps can become too numerous

 to allow even nonrelated sequences to be matched up with high

similarity scores
January 22, 2025 Bioinformatics 21
Alignment Algorithms-Dynamic Programming Method
Gap Penalties:
If the penalty values are set too high, gaps may become too difficult

 to appear, and reasonable alignment cannot be achieved, which is

also unrealistic
Through empirical studies for globular proteins, a set of penalty values

have been developed that appear to suit most alignment purposes


They are normally implemented as default values in most alignment

programs
Another factor to consider is the cost difference between opening a gap

and extending an existing gap


It is known that it is easier to extend a gap that has already been started

Thus, gap opening should have a much higher penalty than gap extension

January 22, 2025 Bioinformatics 22


Alignment Algorithms-Dynamic Programming Method
Gap Penalties:
These differential gap penalties are also referred to as affine gap penalties

The normal strategy is to use preset gap penalty values for introducing and

extending gaps
For example, one may use a −12/ − 1 scheme in which

 the gap opening penalty is −12 and the gap extension penalty −1

The total gap penalty (W ) is a linear function of gap length, k which is

calculated using the formula:


W = γ + δ × (k− 1)
where γ is the gap opening penalty, δ is the gap extension penalty, and
k is the length of the gap

January 22, 2025 Bioinformatics 23


Alignment Algorithms-Dynamic Programming Method
Gap Penalties:
Besides the affine gap penalty, a constant gap penalty is sometimes also

used, which assigns the same score for each gap position regardless
whether it is opening or extending
However, this penalty scheme has been found to be less realistic than the

affine penalty
Gaps at the terminal regions are often treated with no penalty

 because in reality many true homologous sequences are of different

lengths
Consequently, end gaps can be allowed to be free to avoid getting

unrealistic alignments

January 22, 2025 Bioinformatics 24


Alignment Algorithms-Dynamic Programming Method
Dynamic Programming for Local Alignment
In regular sequence alignment, the divergence level between the two

sequences to be aligned is not easily known


The sequence lengths of the two sequences may also be unequal

In such cases, identification of regional sequence similarity may be of

greater significance than finding a match that includes all residues


The first application of dynamic programming in local alignment is the

Smith–Waterman algorithm
Dynamic Programming for Global Alignment
The classical global pairwise alignment algorithm using dynamic

programming is the Needleman–Wunsch algorithm


In this algorithm, an optimal alignment is obtained over the entire lengths

of the two sequences

January 22, 2025 Bioinformatics 25


Alignment Algorithms-Dynamic Programming Method
 It must extend from the beginning to the end of both sequences to
achieve the highest total score
 In other words, the alignment path has to go from the bottom right
corner of the matrix to the top left corner
 The drawback of focusing on getting a maximum score for the full-
length sequence alignment is the risk of missing the best local similarity
 This strategy is only suitable for aligning two closely related sequences
that are of the same length
 For divergent sequences or sequences with different domain structures,
the approach does not produce optimal alignment
 One of the few web servers dedicated to global pairwise alignment is
GAP(global alignment program)

January 22, 2025 Bioinformatics 26


Alignment Algorithms-Dynamic Programming Method
Dynamic Programming for Local Alignment
As in the global alignment, the final result is influenced by the choice of

scoring systems used


Occasionally, several optimally aligned segments with best scores are

obtained
The goal of local alignment is

 to get the highest alignment score locally

This approach may be suitable

 for aligning divergent sequences or sequences with multiple domains

that may be of different origins


Most commonly used pairwise alignment web servers apply the local

alignment strategy, which include


 SIM, SSEARCH, and LALIGN

January 22, 2025 Bioinformatics 27


Dynamic Programming Algorithm: Smith Waterman
 Smith Waterman algorithm was first proposed by Temple F. Smith
and Michael S. Waterman in 1981
 The algorithm explains the local sequence alignment
 It gives conserved regions between the two sequences
 One can align two partially overlapping sequences
 It align the subsequence of the sequence to itself
 These are the main advantages of Local Sequence Alignment

Gap score or gap penalty:


 Dynamic programming algorithms use gap penalties to maximize the

biological meaning
 Typical values are –12 for gap opening, and –4 for gap extension

January 22, 2025 Bioinformatics 28


Dynamic Programming Algorithm: Smith Waterman
Assumed scoring schemas:
If the residues (nucleotide or amino acids) are same in both the

sequences the match score is assumed (Si,j) as +5


 It is added to the diagonally positioned cell of the current cell (i, j

position)
If the residues are not same, the mismatch score is assumed as -3

 This score should be added to the diagonally positioned cell of

the current cell


The gap penalty score is assumed as -4

 It is added to left and above positioned cells of the current cell

January 22, 2025 Bioinformatics 29


Working of Smith-Waterman Algorithm
 The basic steps of the algorithm are:
1. Initialization of a matrix
2. Matrix Filling with the appropriate scores
3. Trace back the sequences for a suitable alignment
 To study the Local sequence alignment consider the given below
sequences.
CGTGAATTCAT (sequence#1 or A)
GACTTAC (sequence #2 or B)

Step1: Initialization of Matrix


 The two sequences are arranged in a matrix form with A+1 columns

and B+1 rows


 The values in the first row and first column are set to zero

January 22, 2025 Bioinformatics 30


Working of Smith-Waterman Algorithm

Variables used:
i, j describes row and columns
M is the matrix value of the required cell (stated as M i,j)
S is the score of the required cell (S i, j)
W is the gap alignment

January 22, 2025 Bioinformatics 31


Working of Smith-Waterman Algorithm
Step2: Matrix Filling
The second and crucial step of the algorithm is filling the entire matrix,

so it is more important to know the neighbor values (diagonal, upper


and left) of the current cell to fill each and every cell
Mi,j=Maximum{(Mi-1,j-1+Si,j), (Mi,j-1+W), (Mi-1,j+W), 0}
Fill the entire matrix using the assumed scoring schema and initial
values
One can fill the 1st row and 1st column with the scoring matrix as

follows
The first residue (nucleotides or amino acids) in both sequences is ‘C’

and ‘G’, the matching score or the mismatching score is going to be


added the neighboring value which is diagonally located i.e. 0

January 22, 2025 Bioinformatics 32


Working of Smith-Waterman Algorithm
 The upper and left values are added to the gap penalty score from the
matrix
 So the scoring schema equation can be shown as follows
M1,1 =Maximum{M0,0+S1,1, M1,0+W, M0,1+W, 0}
=Maximum{0+(-3), 0+(-4), 0+(-4), 0}
=Maximum{-3,-4,-4,0}
=0
 From the above calculations the maximum value obtained is 0

 Finding the maximum value for M position, one can notice that
i,j
there is no chance to see any negative values in the matrix, since we
are taking 0 as lowest value
 After filling the matrix, keep the pointer back to the cell from where
the maximum score has been determined
 In the similar fashion fill all the values of the matrix of the cell
January 22, 2025 Bioinformatics 33
Working of Smith-Waterman Algorithm
 In the similar fashion fill all the values of the matrix of the cell

 Each cell is back pointed by one or more pointers from where the
maximum score has been obtained
Step3: Trace backing the sequences for an optimal alignment
 The final step for the appropriate alignment is trace backing, prior to
that one needs to find out the maximum score obtained in the entire
matrix for the local alignment of the sequences
January 22, 2025 Bioinformatics 34
Working of Smith-Waterman Algorithm
 It is possible that the maximum scores can be present in more than
one cell, in that case there may be possibility of two or more
alignments, and the best alignment by scoring it
 In this example we can see the maximum score in the matrix as 18,
which is found in two positions that lead to multiple alignments, so
the best alignment has to be found
 So the trace back begins from the position which has the highest
value, pointing back with the pointers, thus find out the possible
predecessor, then move to next predecessor and continue until we
reach the score 0

January 22, 2025 Bioinformatics 35


Working of Smith-Waterman Algorithm
 It is possible to find two pointers pointing out from one cell, where
both ways(alignments) can be considered, best one is found by scoring
and finding maximum score among them

 Thus a local alignment is obtained and one can see the possible
alignments
 The two alignments can be given with a score, for matching as +5 ,
mismatch as -3 and gap penalty as -4

January 22, 2025 Bioinformatics 36


Working of Smith-Waterman Algorithm

 By summing up the scores both of the alignments are giving the same
as 18, so one can predict both alignments are the best

January 22, 2025 Bioinformatics 37


Working of Needleman-Wunsch Algorithm
 To study the algorithm, consider the two given sequences
CGTGAATTCAT (sequence #1) , GACTTAC (sequence #2)
 The length (count of the nucleotides or amino acids) of the sequence
1 and sequence 2 are 11 and 7 respectively
 The initial matrix is created with A+1 column’s and B+1 row’s
(where A and B corresponds to length of the sequences)
 Extra row and column is given, so as to align with gap, at the starting
of the matrix

 The basic steps of the algorithm are:


 Initialization of a matrix

 Matrix Filling with the appropriate scores

 Trace back the sequences for a suitable alignment

January 22, 2025 Bioinformatics 38


Working of Needleman-Wunsch Algorithm
 After creating the initial matrix, scoring schema has to be introduced
which can be user defined with specific scores
 The simple basic scoring schema can be assumed as,
 if two residues (nucleotide or amino acid) at ith and jth position

are same, matching score is 1 (S(i,j)= 1) or


 if the two residues at ith and jth position are not same, mismatch

score is assumed as -1 (S(i,j)= -1)


 The gap score(w) or gap penalty is assumed as -1 .

Step1: Initialization of Matrix


 First row and first column of the matrix can be initially filled with 0

 If the gap score is assumed, the gap score can be added to the

previous cell of the row or column


January 22, 2025 Bioinformatics 39
Working of Needleman-Wunsch Algorithm

Step1: Initialization of Matrix


First row and first column of the matrix can be initially filled with 0

If the gap score is assumed, the gap score can be added to the previous

cell of the row or column

Step2: Matrix Fill Step


The second and crucial step of the algorithm is matrix filling starting

from the upper left hand corner of the matrix


January 22, 2025 Bioinformatics 40
Working of Needleman-Wunsch Algorithm
 To find the maximum score of each cell, it is required to know the
neighboring scores (diagonal, left and right) of the current position
 From the assumed values, add the match or mismatch score to the
diagonal value
 Similarly add the gap score to the other neighboring values
 Thus, we can obtain three different values, from that take the
maximum among them and fill the ith and jth position with the score
obtained
 In terms of matrix positions, it is important to know
[M(i-1,j-1)+S(i,j), M(i,j-1)+w, M(i-1,j)+w]
 Overall the equation can be showed in the following manner
M(i,j) = Maximum{M(i-1,j-1)+S(i,j), M(i,j-1)+w, M(i-1,j)+w}

January 22, 2025 Bioinformatics 41


Working of Needleman-Wunsch Algorithm
 To score the matrix of the current position (the first position M 1,1)
the above stated formulae can be used
 The first residue (nucleotides or amino acids) in the 2 sequences are
‘G’ and ‘C’
 Since they are mismatching residues, the score would (Si,j=-1) be -1
M1,1=Maximum{M0,0+S1,1, M1,0+W, M0,1+W}
=Maximum{0+(-1), -1+(-1), -1+(-1)}
=Maximum{-1,-2,-2}
=-1
 The obtained score -1 is placed in position i,j (1,1) of the scoring
matrix
 Similarly using the above equation and method, fill all the remaining
rows and columns
January 22, 2025 Bioinformatics 42
Working of Needleman-Wunsch Algorithm
 Place the back pointers to the cell from where the maximum score is
obtained, which are predecessors of the current cell

January 22, 2025 Bioinformatics 43


Working of Needleman-Wunsch Algorithm
 Place the back pointers to the cell from where the maximum score is
obtained, which are predecessors of the current cell
3.Trace back Step
 The final step in the algorithm is the trace back for the best

alignment
 In the above mentioned example, one can see the bottom right hand

corner score as -1
 The important point to be noted here is that there may be two or

more alignments possible between the two example sequences


 The current cell with value -1 has immediate predecessor, where the

maximum score obtained is diagonally located and its value is 0


 If there are two or more values which points back, suggests that there

can be two or more possible alignments

January 22, 2025 Bioinformatics 44


Working of Needleman-Wunsch Algorithm
 By continuing the trace back step by the above defined method, one
would reach to the 0th row, 0th column
 Following the above described steps, alignment of two sample
sequences can be found
 The best alignment among the alignments can be identified by using
the maximum alignment score (match =5, mismatch=-1, gap=-2)
which may be user defined

January 22, 2025 Bioinformatics 45


Multiple Sequence Alignment (MSA)
 A MSA is basically an alignment of more than 2 sequences
 MSA tells us about the similarity among multiple sequences

Uses of MSA:
 In order to characterize protein families, identify shared regions of

homology in a multiple sequence alignment


 Determination of the consensus sequence of several aligned sequences

 Consensus sequences can help to develop a “sequence finger print ”

which allows the identification of members of distantly related protein


family (motifs)
 MSA can help us to reveal biological facts about proteins, like

analysis of the secondary/tertiary structure


January 22, 2025 Bioinformatics 46
Multiple Sequence Alignment (MSA)
Types of MSA
Dynamic Programming Approach

 Computes an optimal alignment for a given score function

 Because of its high running time, it is not typically used in practice

Progressive Alignment method

 This approach repeatedly aligns two sequences, two alignments, or

a sequence with an alignment


 This method also called hierarchical or tree method

Iterative refinement method

 Works similarly to progressive methods but repeatedly realigns the

initial sequences as well as adding new sequences to the growing


MSA

January 22, 2025 Bioinformatics 47


MSA-Progressive Alignment method
 The most widely used approach
 Builds up a final MSA by combining pairwise alignments beginning
with the most similar pair and progressing to the most distantly related
 Progressive alignment methods require two stages:
 A first stage in which the relationships between the sequences are

represented as a tree, called a guide tree


 Second step in which the MSA is built by adding the sequences

sequentially to the growing MSA according to the guide tree

January 22, 2025 Bioinformatics 48


Multiple Sequence Alignment (MSA)
MSA Using CLUSTALW
Works by progressive alignment

Most closely related sequences are aligned first, and then additional

sequences and groups of sequences are added, guided by the initial


alignments
Uses alignment scores to produce a phylogenetic tree

January 22, 2025 Bioinformatics 49


MSA-Iterative refinement method
 It works similarly progressive alignment method only
 but refinement is performed

 Refinement step: once new sequence is added to the alignment the

initially aligned sequences are realigned in order to obtain the best


alignment

January 22, 2025 Bioinformatics 50


Probabilistic Foundations of Sequence Alignment

January 22, 2025 Bioinformatics 51

You might also like