bioinformatics.pdf.bak
bioinformatics.pdf.bak
Luthey-Schulten Group
Theoretical and Computational Biophysics Group
Biophysics 590C: Fall 2004
Sequence Alignment
Algorithms
Rommie Amaro
Zaida Luthey-Schulten
Felix Autenrieth
Anurag Sethi
Brijeet Dhaliwal
Taras Pogorelov
Barry Isralewitz
September 2004
Contents
1 Introduction 3
1 Introduction
The recent developments of projects such as the sequencing of the genome from
several organisms, and high-throughput X-ray structure analysis, have brought
a large amount of data about the sequences and structures of several thousand
proteins to the scientific community. This information can be used effectively for
medical and biological research only if one can extract functional insight from
the sequence and structural data. To achieve this we need to understand how the
proteins perform their functions. Two main computational techniques exist to
reach this a goal: a bioinformatics approach, and atomistic molecular dynamics
simulations. Bioinformatics uses the statistical analysis of protein sequences
and structures to understand their function and predict structures when only
sequence information is available. Molecular modeling and molecular dynamics
simulations use principles from physics and physical chemistry to study the
function and folding of proteins.
Bioinformatics methods are among the most powerful technologies available
in life sciences today. They are used in fundamental research on theories of
evolution and in more practical considerations of protein design. Algorithms and
approaches used in these studies range from sequence and structure alignments,
secondary structure prediction, functional classification of proteins, threading
and modeling of distantly-related homologous proteins to modeling the progress
of protein expression through a cell’s life cycle.
In this tutorial you will use a classic global sequence alignment method, the
Needleman-Wunsch algorithm, to align two small proteins. First you will align
them by hand and perform your own dynamic programming; afterwards you
will check your work against a computer program that we provide you. The
Needleman-Wunsch alignment programs have been kindly provided by Anurag
Sethi.
The entire tutorial takes about an hour to complete in its entirity.
Protein sequences vs. nucleotide sequences. A protein is a se-
quence of amino acids linked with peptide bonds to form a polypep-
tide chain. In this tutorial, the word sequence (unless otherwise
specified) refers to the amino acid residue sequence of a protein; by
convention these sequences are listed from the N-terminal to the C-
terminal of the chain. Sequences can be written with full names, as
in “Lysine, Arginine, Cysteine, ...”, with 3-letter codes, “Lys, Arg,
Cys, ...”, or with 1-letter codes, “K, R, C, ...” . Proteins range in
size from a few dozen to several thousand residues. The nucleotide
sequences of DNA encodes protein sequence. Sections of genes in
chromosomal DNA are copied to mRNA, which provides the guide
for ribosome to assemble a protein. A nucleotide sequence may be
written as “Cytosine, Adenine, Adenine, Guanine, ...”, or “C, A, A,
G, ...”.
This tutorial assumes that the alignment programs we provide you have been
correctly installed on the user’s computer. Please ask a lab attendant for help
if you have any trouble locating software or data files during the tutorial.
1 INTRODUCTION 4
Getting started
The files for this tutorial are located in:
>> mkdir ~/Workshop/bioinformatics-tutorial/
Within this directory is the pdf for the tutorial, as well as the files needed for
running the tutorial. Before you start the tutorial, be sure you are in the direc-
tory with all the files:
>>~/Workshop/bioinformatics-tutorial/bioinformatics
2 SEQUENCE ALIGNMENT ALGORITHMS 5
where Si,j is the similarity score of comparing amino acid ai to amino acid
bj (obtained here from the BLOSUM40 similarity table) and d is the penalty
for a single gap. The matrix is initialized with H0,0 = 0. When obtaining the
local Smith-Waterman alignment, Hi,j is modified:
0
Hi−1,j−1 + Si,j
Hi,j = max
H i−1,j − d
Hi,j−1 − d
The gap penalty can be modified, for instance, d can be replaced by (d × k),
where d is the penalty for a single gap and k is the number of consecutive gaps.
Once the optimal alignment score is found, the “traceback” through H along
the optimal path is found, which corresponds to the the optimal sequence align-
ment for the score. In the next set of exercises you will manually implement
the Needleman-Wunsch alignment for a pair of short sequences, then perform
global sequence alignments with a computer program developed by Anurag
Sethi, which is based on the Needleman-Wunsch algorithm with an affine gap
penalty, d + e(k − 1), where e is the extension gap penalty. The output file will
be in the GCG format, one of the two standard formats in bioinformatics for
storing sequence information (the other standard format is FASTA).
Here you will align the sequence HGSAQVKGHG to the sequence KTEAEMKASEDLKKHGT.
The two sequences are arranged in a matrix in Table 1. The sequences start at
the upper right corner, the initial gap penalties are listed at each offset starting
position. With each move from the start position, the initial penalty increase
by our single gap penalty of 8.
H G S A Q V K G H G
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
K -8
T -16
E -24
A -32
E -40
M -48
K -56
A -64
S -72
E -80
D -88
L -96
K -104
K -112
H -120
G -128
T -136
1 The first step is to fill in the similarity scores Si,j from looking up the
matches in the BLOSUM40 table, shown here labeled with 1-letter amino
acid codes:
A 5
R -2 9
N -1 0 8
D -1 -1 2 9
C -2 -3 -2 -2 16
Q 0 2 1 -1 -4 8
E -1 -1 -1 2 -2 2 7
G 1 -3 0 -2 -3 -2 -3 8
H -2 0 1 0 -4 0 0 -2 13
I -1 -3 -2 -4 -4 -3 -4 -4 -3 6
L -2 -2 -3 -3 -2 -2 -2 -4 -2 2 6
K -1 3 0 0 -3 1 1 -2 -1 -3 -2 6
M -1 -1 -2 -3 -3 -1 -2 -2 1 1 3 -1 7
F -3 -2 -3 -4 -2 -4 -3 -3 -2 1 2 -3 0 9
P -2 -3 -2 -2 -5 -2 0 -1 -2 -2 -4 -1 -2 -4 11
S 1 -1 1 0 -1 1 0 0 -1 -2 -3 0 -2 -2 -1 5
T 0 -2 0 -1 -1 -1 -1 -2 -2 -1 -1 0 -1 -1 0 2 6
W -3 -2 -4 -5 -6 -1 -2 -2 -5 -3 -1 -2 -2 1 -4 -5 -4 19
Y -2 -1 -2 -3 -4 -1 -2 -3 2 0 0 -1 1 4 -3 -2 -1 3 9
V 0 -2 -3 -3 -2 -3 -3 -4 -4 4 2 -2 1 0 -3 -1 1 -3 -1 5
B -1 -1 4 6 -2 0 1 -1 0 -3 -3 0 -3 -3 -2 0 0 -4 -3 -3 5
Z -1 0 0 1 -3 4 5 -2 0 -4 -2 1 -2 -4 -1 0 -1 -2 -2 -3 2 5
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 0 -1 -2 0 0 -2 -1 -1 -1 -1 -1
A R N D C Q E G H I L K M F P S T W Y V B Z X
2 SEQUENCE ALIGNMENT ALGORITHMS 8
using the convention that H values appear in the top part of a square in
large print, and S values appear in the bottom part of a square in small
print. Our gap penalty d is 8.
4 Again, just fill in 4 or 5 boxes in Table 2 until you get a feel for gap
penalties and similarity scores S vs. alignment scores H. In the next step,
we provide the matrix with all values filled in as Table 2.1. Check that
your 4 or 5 calculations match.
5 Now we move to Table 2.1, with all 170 Hi,j values are shown, to do the
“alignment traceback”. To find the alignment requires one to trace the
path through from the end of the sequence (the lower right box) to the
start of the sequence (the upper left box). This job looks complicated,
but should only take about 5 –7 minutes.
6 We are tracing a path in Table 2.1, from the lower right box to the upper
left box. You can only move to a square if it could have been a “prede-
cessor” of your current square – that is, when the matrix was being filled
with Hi,j values, the move from the predecessor square to your current
square would have followed the mathematical rules we used to find Hi,j
above. Circle each square you move to along your path.
2 SEQUENCE ALIGNMENT ALGORITHMS 9
7 Continue moving squares, drawing arrows, and circling each new square
you land on, until you have reached the upper right corner of the matrix
If the path branches, follow both branches.
11 After executing the program you will generate three output files namely
align, scorematrix and stats. View the alignment in GCG format by
2 SEQUENCE ALIGNMENT ALGORITHMS 10
We have have prepared a computer program multiple which will align multiple
pairs of proteins.
2 In the align and stats files you will find all combinatorial possible pairs
of the provided sequences. On a piece of paper, write the names of the the
proteins, grouped by ther domain of life, as listed in Table 4. Compare
sequence identities of aligned proteins from the same domain of a life,
and of aligned proteins from different domains of life, to help answer the
questions below.
2 SEQUENCE ALIGNMENT ALGORITHMS 11
H G S A Q V K G H G
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
−1 −9
K -8
−1 −2 0 −1 1 −2 6 −2 −1 −2
−9 −3
T -16
−2 −2 2 0 −1 1 0 −2 −2 −2
E -24
0 −3 0 −1 2 −3 1 −3 0 −3
A -32
−2 1 1 5 0 0 −1 1 −2 1
E -40
0 −3 0 −1 2 −3 1 −3 0 −3
M -48
1 −2 −2 −1 −1 1 −1 −2 1 −2
K -56
−1 −2 0 −1 1 −2 6 −2 −1 −2
A -64
−2 1 1 5 0 0 −1 1 −2 1
S -72
−1 0 5 1 1 −1 0 0 −1 0
E -80
0 −3 0 −1 2 −3 1 −3 0 −3
D -88
0 −2 0 −1 −1 −3 0 −2 0 −2
L -96
−2 −4 −3 −2 −2 2 −2 −4 −2 −4
K -104
−1 −2 0 −1 1 −2 6 −2 −1 −2
K -112
−1 −2 0 −1 1 −2 6 −2 −1 −2
H -120
13 −2 −1 −2 0 −4 −1 −2 13 −2
G -128
−2 8 0 1 −2 −4 −2 8 −2 8
T -136
−2 −2 2 0 −1 1 0 −2 −2 −2
H G S A Q V K G H G
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
−1 −9 −16 −24 −31 −39 −42 −50 −58 −66
K -8
−1 −2 0 −1 1 −2 6 −2 −1 −2
−95 −86 −73 −66 −56 −46 −32 −28 −20 −14
K -104
−1 −2 0 −1 1 −2 6 −2 −1 −2
−103 −94 −81 −74 −64 −54 −40 −34 −28 −22
K -112
−1 −2 0 −1 1 −2 6 −2 −1 −2
−99 −102 −89 −82 −72 −62 −48 −42 −21 −29
H -120
13 −2 −1 −2 0 −4 −1 −2 13 −2
−107 −91 −97 −88 −80 −70 −56 −40 −29 −13
G -128
−2 8 0 1 −2 −4 −2 8 −2 8
−115 −99 −89 −96 −88 −78 −64 −48 −37 −21
T -136
−2 −2 2 0 −1 1 0 −2 −2 −2