Biochemistry and Bioinformatics
Biochemistry and Bioinformatics
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
Clement O. Bewaji
Department of Biochemistry, University of Ilorin, P. M. B. 1515, Ilorin, Kwara State, Nigeria
5
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
EMBL – http://www.ebi.ac.uk/ebi the database, the most popular being human (Homo
sapiens), baker’s yeast (Saccharomyces cerevisiae)
Genbank – http://www.ncbi.nlm.nih.gov and mouse (Mus musculus).
Information from various genomic projects
DDBJ – http://www.ddbj.nig.ac.jp are also deposited in different databases. Organisms
whose genes have been completely sequenced
The three banks are holding a tremendous include S. cerevisiae, Plasmodium falciparum,
amount of information. EMBL alone has over a Haemophilus influenzae, Mycoplasma genitalium and
billion nucleotide bases, with entries from the Human Escherichia coli (Table 1).
Genome Project (HGP) constituting 54% of the total.
More than 15,500 different species are represented in
6
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
Multiple Sequence Alignment all these families, added more members from the
databases, and then defined conserved patterns of
Bioinformatics is essentially a theoretical amino acids called blocks. Blocks are present in all
discipline which attempts to make predictions about members of the family and are approximately 4-60
biological functions from sequence data. It is also a amino acids long with no gaps or substitutions. The
powerful tool in experimental design. The eventual BLOSUM (Blocks Substitution Matrix) amino acid
goal is to deduce protein structure and function comparison tables were derived from these aligned
strictly from amino acid sequences. However all we blocks. Common patterns of amino acids arising from
can do, at present, is to see whether the new multiple sequence alignment have also been called
sequences being discovered are similar to sequences MOTIFS. These motifs may have more than one
whose structure and/or function are already known. amino acid at each position and may include gaps.
This is achieved by the use of sequence alignment Once motifs have been defined, the sequence
tools designed with various algorithms. databases may be searched for additional sequence
The sequence similarity of two polypeptides entries with the same motif.
or DNA molecules can be estimated quantitatively by Simultaneous alignment of three or more
determining their number of aligned residues that are sequences poses a difficult algorithmic problem but
identical. For example, human and dog cytochrome c programs to perform such alignments are now
differ in 11 out of 104 residues in the molecule. They available (for review, see Chan et al. 1992) and the
are therefore 89% identical, i.e. [(104 – 11)/104] x methods used have been compared (McLure et al.
100. In the same way, it can be deduced that human 1994). The method often used, especially for ten or
and baker’s yeast cytochrome c are [(104 – 45)/104] more sequences, is to first determine sequence
x 100 = 57% identical. When determining the similarity between all pairs of sequences in the set.
percentage identity of two protein or DNA Based on these similarities, various methods are then
molecules, the length of the shorter protein or DNA used to cluster the sequences into the most related
is, by convention, used as the denominator. groups or into a phylogenetic tree.
In the group approach, a consensus is
Multiple sequence alignment is the process produced for each group and then used to make
of aligning two or more sequences with each other so further alignments between groups. Two examples of
as to bring as many similar sequence characters programs using the grouping approach are the
(nucleotides or amino acids) into register as possible. program PIMA (Smith and Smith 1990) which
The resulting alignments can be used for two utilizes several novel alignment techniques and a
purposes: first, to find regions of similar sequence in program described by Taylor (Taylor 1990), and
all of the sequences that define a conserved MAXHOM. (Sander & Schneider 1991).
consensus pattern or domain; and second, if the The tree method uses the distance method of
alignment is particularly strong, to use the aligned phylogenetic analysis to arrange the sequences. The
positions to try and derive the possible evolutionary two closest sequences are then aligned, and the
relationships among the sequences. When dealing resulting consensus alignment is then aligned with
with a sequence of unknown function, the presence of the next best sequence or cluster of sequences, and so
similar domains in several similar sequences implies on, until an alignment is obtained which includes all
a similar biochemical function or structural fold that the sequences. The programs GCG PILEUP
may become the basis of further experimental developed by the Genetics Computer Group (GCG),
investigation. A group of similar sequences may CLUSTAL W (Higgins and Sharp 1988), the ALIGN
define a protein family which may share a common set of programs (Feng and Doolittle1987) and the
biochemical function or evolutionary origin. MS/DOS program by Corpet (Corpet 1988) utilize
Similar proteins have been organized by this method. A disadvantage to all these methods is
several laboratories into protein families. The that the algorithm used is greedy since the alignment
sequence families originally identified by Margaret is driven by the most alike sequences. An example of
Dayhoff and colleagues became the basis of the PAM using CLUSTAL to align a similar portion of four
matrices used for sequence comparisons (Dayhoff et hypothetical protein sequences is shown in Fig. 1.
al., 1978). PAM stands for ‘percentage of acceptable Note that the CLUSTAL alignment produces a
mutations’. Subsequently, Amos Bairoch identified a consensus sequence at the bottom which shows
large number of protein families and prepared a absolutely conserved positions with an "*", and
database of amino acid patterns which is called the almost completely conserved positions with a":", but
PROSITE catalog and which define the active sites of the rest of the alignment is also remarkably similar.
these proteins. These patterns are called MOTIFS.
Subsequently, Henikoff and Henikoff (1991) aligned
7
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
8
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
together into one sequence before scoring the amino Sequence alignments based on Bayesian Statistics
acid substitutions in the aligned blocks. The amino There is need for a method of finding
acid changes within these clustered sequences were sequence alignment that examines all possible
then averaged. Patterns which were 60% identical alignments and does not depend on scoring matrix
were grouped together to make one substitution and gap penalties. To solve this problem, the
matrix called BLOSUM60, and those 80% alike to following algorithms/properties have been
make another matrix called BLOSUM80, and so on. considered:
As the clustering percentage was increased, dynamic programming provides many different
the ability of the resulting matrix to distinguish actual possible alignments and may have many
from chance alignments, defined as the relative alignments with about the same score
entropy or average information content per residue each combination of matrix/penalty combination
pair, also increased. However, at the same time, the gives a different alignment and alignment score
dominance effect of the more alike proteins also for protein alignments, the gaps should only be
increased and biased the matches. BLOSUM62 in certain regions corresponding to loops on the
(Table 2) represents a balance between information outside of the 3D structure: gaps should not
content and match bias and is the favored matrix that occur in regions corresponding to the core of the
is best able to predict alignments among all protein 3D structure - a tightly packed hydrophobic
families. environment
The amino acids in the table are grouped instead of looking for alignments with gaps -
according to the chemistry of the side group: C there is another algorithm for finding ungapped
(sulfhydryl), STPAG (small hydrophilic), NDEQ regions with matches and mismatches - these are
(acid, acid amide and hydrophilic), HRK (basic), called blocks - they might correspond to the a
MILV (small hydrophobic) and FYW (aromatic). helices and b strands in the protein core.
Each entry is the actual frequency of occurrence of
the amino acid pair in the blocks database, clustered The Need For A Bioinformatics Curriculum
at the 62% level, divided by the expected probability
of occurrence. The expected value is calculated from Bioinformatics will continue to influence the
the frequency of occurrence of each of the two way we think and the way we do science. Students of
individual amino acids in the blocks database, and Biochemistry, Genetics or Molecular Biology will
provides a measure of a chance alignment of the two need to know more about the new discipline if they
amino acids. The actual/expected ratio is expressed intend to get employment in the pharmaceutical or
as a log odds score in so-called half bit units, biotechnology industries. We are gradually
obtained by converting the ratio to a logarithm to the approaching the era of personalised medicine (Wang
base 2, and then multiplying by 2. A zero score et al., 2008; Wheeler et al., 2008; Bodmer and
means that the frequency of the amino acid pair in the Bonilla, 2008).
database was as expected by chance, a positive score This raises the question of developing a
that the pair was found more often than by chance, curriculum in Bioinformatics. Undergraduate and
and a negative score that the pair was found less postgraduate students as well as researchers need a
often than by chance. The accumulated score of an basic understanding of the Internet and how to search
alignment of several amino acids in two sequences for appropriate information on it. This means that
may be obtained by adding up the respective scores universities must invest in making a large number of
of each individual pair of amino acids. networked PCs available for this purpose. This is a
The highest scoring matches are between sine qua non for picking up the skill for browsing the
amino acids which are in the same chemical group web which is essential in gaining access to biological
and the very highest scoring matches are for cysteine- sequence databases.
cysteine matches and for matches among the Teaching bioinformatics will be more
aromatic amino acids. A similar variation is seen in difficult to achieve with our current educational
the PAM matrices. Compared to the PAM160 matrix, philosophy. This is because our current practice in
however, the BLOSUM62 matrix gives a more curriculum design treats the physical and biological
positive score to mismatches with the rare amino sciences as alternatives. There is need for all science
acids e.g. cysteine, a more positive score to students to take courses in mathematics, computer
mismatches with hydrophobic amino acids, but a science and biology up to 200 level in the
more negative score to mismatches with hydrophilic universities.
amino acids (Henikoff and Henikoff 1992).
9
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
C S T P A G N D E Q H R K M I L V F Y W
C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2
S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3
T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3
P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4
A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3
G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2
N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4
D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4
E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3
Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2
H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3
K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1
I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3
L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2
V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2
W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
Students will need to be exposed to the make predictions about the function of the new
various sequence databases available, the genomic sequence. Anyone who can acquire these skills will
projects that have been completed and how to be the hottest in demand in a global market place
perform simple analysis on a sequence, i.e. how to which cuts across the Atlantic.
compare a new sequence against a database to
determine whether similar sequences exist, and to
10
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
11
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology
Maxam, A. M. and Gilbert, W. (1977) A new method Sanger, F. and Tuppy, H. (1961) The amino acid
for sequencing DNA. Proc. Natl. Acad. Sci. sequence in the phenylalanyl chain of insulin.
(USA) 74, 560 - 564. Biochem. J. 49, 463 - 490.
McClure M.A. (1995) Paramaterization studies of Sanger, F.; Air, G. M.; Barrel, B. G.; Brown, N. L.;
hidden markov chain models representing highly Coulson, A. R.; Fiddes, J. C.; Hutchison, C. A.
divergent protein sequences, in Proceedings of (III); Slocombe, P. M. and Smith, M. (1977)
the28th Annual Hawaii International Conference Nucleotide sequence of bacteriophage X174.
on System Sciences, ed. by Lawrence Hunter sand Nature 265, 687 – 695.
Bruce Shriver, vol. 5: Biotechnology Sanger, F.; Nicklen, S. and Coulson, A. R. (1977)
ComputingIEEE Computer Society Press DNA Sequencing with chain terminating
McClure M.A., Vasi, T.K., and Fitch W.M. (1994) inhibitors. Proc. Natl. Acad. Sci. (USA) 74, 5463
Comparative analysis of multiple protein- - 5467.
sequence alignment methods. Mol. Biol. Evol. Schuler G.D., Altschul S.F. and Lipman D.J. (1991)
11:571-592. A workbench for multiple alignment construction
Park H, Kim JI, Ju YS, Gokcumen O, Mills RE, Kim and analysis. Proteins 9:180-90.
S, Lee S, Suh D, Hong D, Kang HP, Yoo YJ, Smith R.F., and Smith T.F. (1990) Pattern-induced
Shin JY, Kim HJ, Yavartanoo M, Chang YW, Ha multiple sequence alignments. Proc. Natl. Acad.
JS, Chong W, Hwang GR, Darvishi K, Kim H, Sci. U S A 87:118-122.
Yang SJ, Yang KS, Hurles ME, Scherer SW, Sutcliffe, G. (1979) Nucleotide sequence of the E.
Carter NP, Tyler-Smith C, Lee C, Seo JS (2010) coli plasmid pBR322. Cold Spring Harbor Symp.
Discovery of common Asian copy number Quant. Biol. 43, 77 – 90.
variants using integrated high-resolution array Taylor, W.R. (1990) Hierarchical method to align
CGH and massively parallel DNA sequencing. large numbers of biological sequences. Methods
Nat Genet, 42, 400-405. in Enzymology 183:456-474.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Vingron M. and Argos P. (1991) Motif recognition
Andrews TD, Fiegler H, Shapero MH, Carson and alignment for many sequences by comparison
AR, Chen W, Cho EK, Dallaire S, Freeman JL, of dot matrices. J. Mol. Biol. 218:33-43.
Gonzalez JR, Gratacos M, Huang J, Wang J, Wang W, Li R, Li Y, Tian G, Goodman L,
Kalaitzopoulos D, Komura D, MacDonald JR, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li
Marshall CR, Mei R, Montgomery L, Nishimura H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y,
K, Okamura K, Shen F, Somerville MJ, Tchinda Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M,
J, Valsesia A, Woodwark C, Yang F, et al.(2006) Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, et
Global variation in copy number in the human al.(2008) The diploid genome sequence of an
genome. Nature, 444, 444-454. Asian individual. Nature, 456, 60-65.
Sander C. and Schneider R. (1991) Database of Wheeler DA, Srinivasan M, Egholm M, Shen Y,
homology derived protein structures and the Chen L, McGuire A, He W, Chen YJ, Makhijani
structural meaning of sequence alignment. V, Roth GT, Gomes X, Tartaro K, Niazi F,
Proteins 9:56-68 Turcotte CL, Irzyk GP, Lupski JR, Chinault C,
Sanger, F. (1981) Determination of the nucleotide Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X,
sequences in DNA. Science 214, 1205 - 1210. Muzny DM, Margulies M, Weinstock GM, Gibbs
Sanger, F. and Coulson, A. R. (1975) A rapid method RA, Rothberg JM (2008) The complete genome
for determining sequences in DNA by primed of an individual by massively parallel DNA
synthesis with DNA polymerase. J. Mol. Biol. 94, sequencing. Nature 452, 872-876.
444 – 448.
Sanger, F. and Thompson, E. O. P. (1963) The amino
acid sequence in the glycyl chain of insulin.
Biochem. J. 53, 353 - 374.
12