0% found this document useful (0 votes)
33 views38 pages

Week 4

This document discusses pairwise sequence alignment and dynamic programming algorithms used for sequence alignment. It provides the following key points: 1. Dynamic programming algorithms allow efficient exploration of all possible sequence alignments to find the optimal alignment without exploring non-optimal alignments. 2. The Needleman-Wunsch algorithm is an example of a dynamic programming approach that can find the globally optimal alignment of two sequences by filling a score matrix. 3. Filling the score matrix proceeds by considering all possible ways to align each additional residue pair, accounting for match, mismatch and gap penalties to optimize the total alignment score.

Uploaded by

Nurullah Mertel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views38 pages

Week 4

This document discusses pairwise sequence alignment and dynamic programming algorithms used for sequence alignment. It provides the following key points: 1. Dynamic programming algorithms allow efficient exploration of all possible sequence alignments to find the optimal alignment without exploring non-optimal alignments. 2. The Needleman-Wunsch algorithm is an example of a dynamic programming approach that can find the globally optimal alignment of two sequences by filling a score matrix. 3. Filling the score matrix proceeds by considering all possible ways to align each additional residue pair, accounting for match, mismatch and gap penalties to optimize the total alignment score.

Uploaded by

Nurullah Mertel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Bioinformatics

Pairwise Sequence Alignment


Assoc. Prof. Dr. Gazi Erkan BOSTANCI

Slides are mainly based on ‘Understanding Bioinformatics’ by Marketa


Zvelebil and Jeremy O. Baum
• Given two sequences, and allowing gaps to be inserted, it is possible to
construct a very large number of alignments.

• Of these, there will be an optimal alignment, which in the ideal case


perfectly identifies the true equivalences between the sequences.
• However, there will be many alternative alignments with varying
degrees of error that could potentially be seriously misleading.

• Furthermore, the fact that an alignment can be constructed for any


two sequences, even ones with no meaningful equivalences, has the
potential to be even more misleading.

• Therefore, all useful methods of sequence alignment must not only


generate alignments but also be able to compare them in a
meaningful way and to provide an assessment of their significance.
• The number of alternative alignments is so great, however, that efficient
methods are required to determine those with optimal scores. Fortunately,
algorithms have been derived that can be guaranteed to identify the optimal
alignment between two sequences for a given scoring scheme.
• As long as only single proteins or genes, or small segments of
genomes, are aligned, these methods can be applied with ease on
today’s computers.

• When searching for alignments of a query sequence with a whole


database of sequences it is usual practice to use more approximate
methods that speed up the search.
• Finding the best-scoring alignment between two sequences does not
guarantee the alignment has any scientific validity. Ways must be
found to discriminate between fortuitously good alignments and
those due to a real evolutionary relationship.

• The large number of complete genome sequences has led to


increased interest in aligning very long sequences such as whole
genomes and chromosomes.

• These applications require a number of approximations and


techniques to increase the speed and reduce the storage
requirements.
Dynamic Programming Algorithms
• For any given pair of sequences, if gaps are allowed there is a large
number of possibilities to consider in determining the best-scoring
alignment.
• For example, two sequences of length 1000 have approximately 10600
different alignments, vastly more than there are particles in the
universe! Given the number and length of known sequences it would
seem impossible to explore all these possibilities.
• Nevertheless, a class of algorithms has been introduced that is able to
efficiently explore the full range of alignments under a variety of
different constraints. They are known as dynamic programming
algorithms, and efficiently avoid needless exploration of the majority
of alignments that can be shown to be nonoptimal.
• The key property of dynamic programming is that the problem can be
divided into many smaller parts. Consider the following alignment:

in which the subscripts u, v, etc. refer to alignment positions rather


than residue types, so that Xu, Yv, and so on each correspond to a
residue or to a gap.
The alignment has been divided into three parts, with positions labeled
1 -> u, u + 1 -> v, and v + 1 -> L.
Because scores for the individual positions are added together, the
score of the whole alignment is the sum of the scores of the three
parts; that is, their contributions to the score are independent. Thus,
the optimal global alignment can be reduced to the problem of
determining the optimal alignments of smaller sections.
• A corollary to this is that the global optimal alignment will not contain parts that
are not themselves optimal. While affine gap penalties require a slightly more
sophisticated argument, essentially the same property holds true for them as
well.

• Starting with sufficiently short sub-sequences, for example the first residue of
each sequence, the optimal alignment can easily be determined, allowing for all
possible gaps.
• Subsequently, further residues can be added to this, at most one from each
sequence at any step. At each stage, the previously determined optimal
subsequence alignment can be assumed to persist, so only the score for adding
the next residue needs to be investigated.

• A worked example later in the section will make this clear. In this way, the optimal
global alignment can be grown from one end of the sequence.

• As an alignment of two sequences will consist of pairs of aligned residues, a


rectangular matrix can conveniently represent these, with rows corresponding to
the residues of one sequence, and columns to those of the other.
• Until the global optimal alignment has been obtained, it is not known which
actual residues are aligned. All possibilities must be considered or the
optimal alignment could be missed. This is not as impossible as it might
seem.
• Saul Needleman and Christian Wunsch published the original dynamic
programming application in this field in 1970, since then many variations
and improvements have been made, some of which will be described here.
There have been three different motivations for developing these
modifications:
• Firstly, global and local alignments require slightly different algorithms.
• Secondly, but less commonly, certain gap-penalty functions and the desire to
optimize scoring parameters have resulted in further new schemes.
• Lastly, especially in the past, the computational requirements of the algorithms
prevented some general applications. For example, the basic technique in a standard
implementation requires computer memory proportional to the product m*n for two
sequences of length m and n. Some algorithms have been proposed that reduce this
demand considerably.
Optimal global alignments are produced using
efficient variations of the Needleman–Wunsch
algorithm
• We will introduce dynamic programming methods by describing their
use to find optimal global alignments. Needleman and Wunsch were
the first to propose this method, but the algorithm given here is not
their original one, because significantly faster methods of achieving
the same goal have since been developed.

• The problem is to align sequence x (x1x2x3…xm) and sequence y


(y1y2y3…yn), finding the best scoring alignment in which all residues of
both sequences are included. The score will be assumed to be a
measure of similarity, so that the highest score is desired.
• The key concept in all these algorithms is the matrix S of optimal
scores of subsequence alignments. The matrix has (m + 1) rows
labeled 0 -> m and (n + 1) columns labeled 0 -> n.

• The rows correspond to the residues of sequence x, and the columns


to those of sequence y. We shall use as a working example the
alignment of the sequences x = THISLINE and y = ISALIGNED, with the
BLOSUM-62 substitution matrix as the scoring matrix.
• The BLOSUM-62 substitution
matrix scores in half bits.
• Scores that would be given to
identical matched residues are in
blue; positive scores for
nonidentical matched residues are
in red.

• The latter represent pairs of


residues for which substitutions
were observed relatively often in
the aligned reference sequences.
• Because the sequences are small they can be aligned manually, and
so we can see that the optimal alignment is:

• This alignment might not produce the optimal score if the gap penalty
were set very high relative to the substitution matrix values, but in
this case it could be argued that the scoring parameters would then
not be appropriate for the problem.

• In the matrix in figure below, the element Si,j is used to store the score
for the optimal alignment of all residues up to xi of sequence x with
all residues up to yj of sequence y.
• The initial stage of filling in
the dynamic programming
matrix to find the optimal
global alignment of the two
sequences THISLINE and
ISALIGNED.
• The initial stage of filling in
the matrix depends only on
the linear gap penalty with E
set to –8.
• The arrows indicate the
cell(s) to which each cell
value contributes.
• The sequences (x1x2x3…xi) and (y1y2y3…yj) with i < m and j < n are called
subsequences. Column Si,0 and row S0,j correspond to the alignment of the
first i or j residues with the same number of gaps. Thus, element S0,3 is the
score for aligning sub-sequence y1y2y3 with a gap of length 3.

• To fill in this matrix, one starts by aligning the beginning of each sequence;
that is, in the extreme upper left-hand corner.

• The elements Si,0 and S0,j are easy to fill in, because there is only one possible
alignment available. Si,0 represents the alignment
• We will start by considering a linear gap penalty g of –8ngap for a gap
of ngap residues, giving the scores of Si,0 and S0,j as –8i and –8j,
respectively. This starting point with numerical values inserted into
the matrix is illustrated in figure above.
• The other matrix elements are filled in according to simple rules that
can be understood by considering a process of adding one position at
a time to the alignment.
• There are only three options for any given position:
• a pairing of residues from both sequences,
• and the two possibilities of a residue from one sequence aligning with a gap
in the other.
• These three options can be written as:

• The scores associated with these are s(xi,yj), g, and g, respectively.

• The value of s(xi,yj) is given by the element sa,b of the substitution


score matrix, where a is the residue type of xi and b is the residue
type of yj.
• The change in notation is solely to improve the clarity of the following
equations.
• Consider the evaluation of element S1,1, so that the only residues that appear
in the alignment are x1 and y1. The left-hand possibility of the three
possibilities could only occur starting from S0,0, as all other alignments will
already contain at least one of these two residues. The middle possibility can
only occur from S1,0 because it requires an alignment that contains x1 but not
y1. Similar reasoning shows that the right-hand possibility can only occur from
S0,1. The three possible alignments have the following scores:

where s(I,T) has been obtained from BLOSUM62. Of these alternatives, the
optimal one is clearly the first. Hence in figure below, S1,1 = –1. Because S1,1 has
been derived from S0,0 an arrow has been drawn linking them in the figure.
• An identical argument can be made to
construct any element of the matrix
from three others, using the formula:

• The maximum (“max”) implies that we


are using a similarity score. Figure
illustrates this formula in the layout of
the matrix.
• Note that it is possible for more than
one of the three alternatives to give the
same optimal score, in which case
arrows are drawn for all optimal
alternatives.
• The dynamic programming matrix used to find the optimal global alignment of
the two sequences THISLINE and ISALIGNED.
• (A) The completed matrix using the BLOSUM-62 scoring matrix and a linear gap
penalty, with E set to –8. The red arrows indicate steps used in the traceback of
the optimal alignment.
• (B) The optimal alignment returned by these calculations, which has a score of –
4.
• We now have a matrix of scores for optimal alignments of many sub-
sequences, together with the global sequence alignment score. This is
given by the value of Sm,n, which in this case is S8,9 = –4.

• Note that this is not necessarily the highest score in the matrix, which
in this case is S8,8 = 4, but only Sm,n includes the information from both
complete sequences.

• For each matrix element we know the element(s) from which it was
directly derived. Arrows in the figures are used to indicate this
information.
• We can use the information on the derivation of each element to obtain
the actual global alignment that produced this optimal score by a process
called traceback.
• Beginning at Sm,n we follow the arrows back through the matrix to the
start (S0,0). Thus, having filled the matrix elements from the beginning of
the sequences, we determine the alignment from the end of the
sequences.
• At each step, we can determine which of the three alternatives given in
the equation for Si,j has been applied, and add it to our alignment. If Si,j
has a diagonal arrow from Si– 1,j – 1, that implies the alignment will contain
xi aligned with yj. Vertical arrows imply a gap in sequence x aligning with
a residue in sequence y, and vice versa for horizontal arrows.
• The traceback arrows involved in the optimal global alignment are shown
in red. When tracing back by hand, care must be taken, as it is easy to
make mistakes, especially by applying the results to residues xi – 1 and yj – 1
instead of xi and yj.
• The traceback information is often stored efficiently in computer
programs, for example using three bits to represent the possible
origins of each matrix element. If a bit is set to zero, that path was not
used, with a value of one indicating the direction. Such schemes allow
all this information to be easily stored and analyzed to obtain the
alignment paths.

• Note that there may be more than one optimal alignment if at some
point along the path during traceback an element is encountered that
was derived from more than one of the three possible alternatives.

• The algorithm does not distinguish between these possible


alignments, although there may be reasons for preferring one to the
others.
• Such preference would normally be justified by knowledge of the molecular
structure or function. Most programs will arbitrarily report just one single
alignment.
• The alignment given by the traceback is not the one we expected, in that it
contains no gaps.

• The carboxy-terminal aspartic acid residue (D) in sequence y is aligned with a


gap only because the two sequences are not the same length. We can readily
understand this outcome if we consider our chosen gap penalty of 8 in the
light of the BLOSUM-62 substitution matrix.
• The worst substitution score in this matrix is –4, significantly less than the gap
penalty. Also, many of the scores for aligning identical residues are only 4 or 5.
This means that if we set such a high gap penalty, a gap is unlikely to be
present in an optimal alignment using this scoring matrix. In these
circumstances, gaps will occur if the sequences are of different length and also
possibly in the presence of particular residues such as tryptophan or cysteine
which have higher scores.
• If instead we use a linear gap penalty g(ngap) = –4ngap, the situation changes, as
shown in Figure below, which gives the optimal alignment we expected.
Because the gap penalty is less severe, gaps are more likely to be introduced,
resulting in a different alignment and a different score.

• In this particular case, four additional gaps occur, two of which occur within the
sequences. The overall alignment score is 7, but this alignment would have
scored –13 with the original gap penalty of 8.

• This example illustrates the need to match the gap penalty to the substitution
matrix used. However, care must be taken in matching these parameters, as the
performance also depends on the properties of the sequences being aligned.
Different parameters may be optimal when looking at long or short sequences,
and depending on the expected sequence similarity.
• Optimal global alignment of two sequences, except for a change in gap
scoring. The linear gap penalty using a value of –4 for the parameter E.
• (A) The completed matrix using the BLOSUM-62 scoring matrix.
• (B) The optimal alignment, which has a score of 7.
Local and suboptimal alignments can be produced
by making small modifications to the dynamic
programming algorithm
• Often we do not expect the whole of one sequence to align well with
the other. For example, the proteins may have just one domain in
common, in which case we want to find this high-scoring zone,
referred to as a local alignment.

• In a global alignment, those regions of the sequences that differ


substantially will often obscure the good agreement over a limited
stretch. The local alignment will identify these stretches while
ignoring the weaker alignment scores elsewhere.
• It turns out that a very similar dynamic programming algorithm to that
described above for global alignments can obtain a local alignment.

• Smith and Waterman first proposed this method. However, it should be noted
that the method presented here requires a similarity-scoring scheme that has
an expected negative value for random alignments and positive value for highly
similar sequences.

• Most of the commonly used substitution matrices fulfill this condition. Note
that the global alignment schemes have no such restriction, and can have all
substitution matrix scores positive.

• Under such a scheme, scores will grow steadily larger as the alignment gets
larger, regardless of the degree of similarity, so that long random alignments
will ultimately be indistinguishable by score alone from short significant ones.
• The key difference in the local alignment algorithm from the global alignment
algorithm set out above is that whenever the score of the optimal sub-
sequence alignment is less than zero it is rejected, and that matrix element is
set to zero.

• The scoring scheme must give a positive score for aligning (at least some)
identical residues. We would expect to be able to find at least one such match
in any alignment worth considering, so that we can be sure that there should
be some positive alignment scores.

• Another algorithmic difference is that we now start traceback from the


highest-scoring matrix element wherever it occurs.
• The extra condition on the matrix elements means that the values of
Si,0 and S0,j are set to zero, as was the case for global alignments
without end gap penalties. The formula for the general matrix
element Si,j with a general gap penalty function g(ngap) is

• Note that the equation only differs from earlier equation by the
inclusion of the zero.
• Figures below show the optimal local alignments for our usual example
in the two cases of linear gap penalties g(ngap) = –8ngap and –4ngap,
respectively. Both result in removal of the differing ends of the
sequences.

• In the first case, the higher gap penalty forces an alignment of serine
(S) and alanine (A) in preference to adding a gap to reach the identical
IS sub-sequence. Lowering the gap penalty in this instance improves
the result to give the local alignment we would expect.
• The dynamic programming calculation for determining the optimal local
alignment of the two sequences THISLINE and ISALIGNED.
• (A) The completed matrix using the BLOSUM-62 scoring matrix with a
linear gap penalty, with E set to –8.
• (B) The optimal alignment, determined by the highest-scoring element,
which has a score of 12.
• Optimal local alignment calculation with a linear gap penalty with E
set to –4.
• (A) The completed matrix for determining the optimal local alignment
of THISLINE and ISALIGNED using the BLOSUM-62 scoring matrix.
• (B) The optimal alignment, identified by the highest scoring element
in the entire matrix, which has a score of 19.
• The problem with dynamic programming methods is that despite
their efficiency they can place heavy demands on computer memory
and take a long time to run.

• The speed of calculation is no longer as serious a barrier as it has


been in the past, but the problem of insufficient computer memory
persists, particularly as there are now many very long sequences,
including those of whole genomes, available for comparison and
analysis.
• Some modifications of the basic dynamic programming algorithm
have been made that reduce the memory and time demands:

• One way of reducing memory requirements is by storing not the


complete matrix but only the two rows required for calculations.

• However, to recover the alignment from such a calculation takes


longer than if all the traceback information has been saved.

• By only calculating a limited region of the matrix, commonly a


diagonal band, both time and space saving can be made, although at
the risk of not identifying the correct optimal alignment.
• Often the first step in a sequence analysis is to search databases to
retrieve all related sequences. Such searches depend on making
pairwise alignments of the query sequence against all the sequences
in the databases, but because of the scale of this task, fast
approximate methods are usually used to make such searches more
practicable.

• The algorithms for two commonly used search programs—BLAST and


FASTA—make use of indexing techniques such as suffix trees and
hashing to locate short stretches of database sequences highly similar
or identical to parts of the query sequence.
• Attempts are then made to extend these to longer, ungapped local
alignments which are scored, the scores being used to identify
database sequences that are likely to be significantly similar. This
process is considerably faster than applying full-matrix dynamic
programming to each database sequence.

• At this point, both techniques revert to the more accurate methods to


examine the highest-scoring sequences, in order to determine the
optimal local alignment and score, but this is only done for a tiny
fraction of the database entries.

You might also like