Week 4
Week 4
• Starting with sufficiently short sub-sequences, for example the first residue of
each sequence, the optimal alignment can easily be determined, allowing for all
possible gaps.
• Subsequently, further residues can be added to this, at most one from each
sequence at any step. At each stage, the previously determined optimal
subsequence alignment can be assumed to persist, so only the score for adding
the next residue needs to be investigated.
• A worked example later in the section will make this clear. In this way, the optimal
global alignment can be grown from one end of the sequence.
• This alignment might not produce the optimal score if the gap penalty
were set very high relative to the substitution matrix values, but in
this case it could be argued that the scoring parameters would then
not be appropriate for the problem.
• In the matrix in figure below, the element Si,j is used to store the score
for the optimal alignment of all residues up to xi of sequence x with
all residues up to yj of sequence y.
• The initial stage of filling in
the dynamic programming
matrix to find the optimal
global alignment of the two
sequences THISLINE and
ISALIGNED.
• The initial stage of filling in
the matrix depends only on
the linear gap penalty with E
set to –8.
• The arrows indicate the
cell(s) to which each cell
value contributes.
• The sequences (x1x2x3…xi) and (y1y2y3…yj) with i < m and j < n are called
subsequences. Column Si,0 and row S0,j correspond to the alignment of the
first i or j residues with the same number of gaps. Thus, element S0,3 is the
score for aligning sub-sequence y1y2y3 with a gap of length 3.
• To fill in this matrix, one starts by aligning the beginning of each sequence;
that is, in the extreme upper left-hand corner.
• The elements Si,0 and S0,j are easy to fill in, because there is only one possible
alignment available. Si,0 represents the alignment
• We will start by considering a linear gap penalty g of –8ngap for a gap
of ngap residues, giving the scores of Si,0 and S0,j as –8i and –8j,
respectively. This starting point with numerical values inserted into
the matrix is illustrated in figure above.
• The other matrix elements are filled in according to simple rules that
can be understood by considering a process of adding one position at
a time to the alignment.
• There are only three options for any given position:
• a pairing of residues from both sequences,
• and the two possibilities of a residue from one sequence aligning with a gap
in the other.
• These three options can be written as:
where s(I,T) has been obtained from BLOSUM62. Of these alternatives, the
optimal one is clearly the first. Hence in figure below, S1,1 = –1. Because S1,1 has
been derived from S0,0 an arrow has been drawn linking them in the figure.
• An identical argument can be made to
construct any element of the matrix
from three others, using the formula:
• Note that this is not necessarily the highest score in the matrix, which
in this case is S8,8 = 4, but only Sm,n includes the information from both
complete sequences.
• For each matrix element we know the element(s) from which it was
directly derived. Arrows in the figures are used to indicate this
information.
• We can use the information on the derivation of each element to obtain
the actual global alignment that produced this optimal score by a process
called traceback.
• Beginning at Sm,n we follow the arrows back through the matrix to the
start (S0,0). Thus, having filled the matrix elements from the beginning of
the sequences, we determine the alignment from the end of the
sequences.
• At each step, we can determine which of the three alternatives given in
the equation for Si,j has been applied, and add it to our alignment. If Si,j
has a diagonal arrow from Si– 1,j – 1, that implies the alignment will contain
xi aligned with yj. Vertical arrows imply a gap in sequence x aligning with
a residue in sequence y, and vice versa for horizontal arrows.
• The traceback arrows involved in the optimal global alignment are shown
in red. When tracing back by hand, care must be taken, as it is easy to
make mistakes, especially by applying the results to residues xi – 1 and yj – 1
instead of xi and yj.
• The traceback information is often stored efficiently in computer
programs, for example using three bits to represent the possible
origins of each matrix element. If a bit is set to zero, that path was not
used, with a value of one indicating the direction. Such schemes allow
all this information to be easily stored and analyzed to obtain the
alignment paths.
• Note that there may be more than one optimal alignment if at some
point along the path during traceback an element is encountered that
was derived from more than one of the three possible alternatives.
• In this particular case, four additional gaps occur, two of which occur within the
sequences. The overall alignment score is 7, but this alignment would have
scored –13 with the original gap penalty of 8.
• This example illustrates the need to match the gap penalty to the substitution
matrix used. However, care must be taken in matching these parameters, as the
performance also depends on the properties of the sequences being aligned.
Different parameters may be optimal when looking at long or short sequences,
and depending on the expected sequence similarity.
• Optimal global alignment of two sequences, except for a change in gap
scoring. The linear gap penalty using a value of –4 for the parameter E.
• (A) The completed matrix using the BLOSUM-62 scoring matrix.
• (B) The optimal alignment, which has a score of 7.
Local and suboptimal alignments can be produced
by making small modifications to the dynamic
programming algorithm
• Often we do not expect the whole of one sequence to align well with
the other. For example, the proteins may have just one domain in
common, in which case we want to find this high-scoring zone,
referred to as a local alignment.
• Smith and Waterman first proposed this method. However, it should be noted
that the method presented here requires a similarity-scoring scheme that has
an expected negative value for random alignments and positive value for highly
similar sequences.
• Most of the commonly used substitution matrices fulfill this condition. Note
that the global alignment schemes have no such restriction, and can have all
substitution matrix scores positive.
• Under such a scheme, scores will grow steadily larger as the alignment gets
larger, regardless of the degree of similarity, so that long random alignments
will ultimately be indistinguishable by score alone from short significant ones.
• The key difference in the local alignment algorithm from the global alignment
algorithm set out above is that whenever the score of the optimal sub-
sequence alignment is less than zero it is rejected, and that matrix element is
set to zero.
• The scoring scheme must give a positive score for aligning (at least some)
identical residues. We would expect to be able to find at least one such match
in any alignment worth considering, so that we can be sure that there should
be some positive alignment scores.
• Note that the equation only differs from earlier equation by the
inclusion of the zero.
• Figures below show the optimal local alignments for our usual example
in the two cases of linear gap penalties g(ngap) = –8ngap and –4ngap,
respectively. Both result in removal of the differing ends of the
sequences.
• In the first case, the higher gap penalty forces an alignment of serine
(S) and alanine (A) in preference to adding a gap to reach the identical
IS sub-sequence. Lowering the gap penalty in this instance improves
the result to give the local alignment we would expect.
• The dynamic programming calculation for determining the optimal local
alignment of the two sequences THISLINE and ISALIGNED.
• (A) The completed matrix using the BLOSUM-62 scoring matrix with a
linear gap penalty, with E set to –8.
• (B) The optimal alignment, determined by the highest-scoring element,
which has a score of 12.
• Optimal local alignment calculation with a linear gap penalty with E
set to –4.
• (A) The completed matrix for determining the optimal local alignment
of THISLINE and ISALIGNED using the BLOSUM-62 scoring matrix.
• (B) The optimal alignment, identified by the highest scoring element
in the entire matrix, which has a score of 19.
• The problem with dynamic programming methods is that despite
their efficiency they can place heavy demands on computer memory
and take a long time to run.