Introduction To Bioinformatics: High-Throughput Biological Data and Evolution
Introduction To Bioinformatics: High-Throughput Biological Data and Evolution
Introduction to bioinformatics
N
T
R
Lecture 3
E
F B
O I
R O
I I
N N
T
E
G
F
O
R
High-throughput Biological
R
A
T
M
A
T
Data
I
V
I
C -data deluge, bioinformatics algorithms-
E S
V
U and evolution
Last lecture:
• Many different genomics datasets:
– Genome sequencing: more than 300 species completely
sequenced and data in public domain (i.e. information is
freely available), virus genome can be sequenced in a day
– Gene expression (microarray) data: many microarrays
measured per day
– Proteomics: Protein Data Bank (PDB) - as of Tuesday
February 07, 2006 there are 35026 Structures.
http://www.rcsb.org/pdb/
– Protein-protein interaction data: many databases worldwide
– Metabolic pathway, regulation and signaling data, many
databases worldwide
Growth in number of protein
tertiary structures
The data deluge
Although a lot of tertiary structural data is being
produced (preceding slide), there is the
SEQUENCE-STRUCTURE-FUNCTION GAP
VHLTPEEKSAVTALWGKVNVD
EVGGEALGRLLVVYPWTQRFF
ESFGDLSTPDAVMGNPKVKAH
GKKVLGAFSDGLAHLDNLKGTF
ATLSELHCDKLHVDPENFRLLG
NVLVCVLAHHFGKEFTPPVQAA
YQKVVAGVANALAHKYH
1-step
process
2-step
process
There are many different regulation signals such as start, stop and skip
messages hidden in the genome for each gene, but what and where are they?
Expression data
Functional genomics
• Monte Carlo
Protein translation
What is life?
• NASA astrobiology program:
“Life is a self-sustained chemical system
capable of undergoing Darwinian
evolution”
Evolution
Four requirements:
• Template structure providing stability (DNA)
• Copying mechanism (meiosis)
• Mechanism providing variation (mutations;
insertions and deletions; crossing-over; etc.)
• Selection: some traits lead to greater fitness of one
individual relative to another. Darwin wrote
“survival of the fittest”
ACCD (B C) ABD (C ø)
mutation deletion
ACCD (B C) ABD (C ø)
mutation deletion
We need to be able to
do automatic pathway
comparison (pathway
alignment)
http://en.wikipedia.org/wiki/Krebs_cycle
The citric-acid cycle
Fig. 1. (a) A graphical representation of the reactions of the
citric-acid cycle (CAC), including the connections with
pyruvate and phosphoenolpyruvate, and the glyoxylate
shunt. When there are two enzymes that are not homologous
to each other but that catalyse the same reaction (non-
homologous gene displacement), one is marked with a solid
line and the other with a dashed line. The oxidative direction
is clockwise. The enzymes with their EC numbers are as
follows: 1, citrate synthase (4.1.3.7); 2, aconitase (4.2.1.3);
3, isocitrate dehydrogenase (1.1.1.42); 4, 2-ketoglutarate
dehydrogenase (solid line; 1.2.4.2 and 2.3.1.61) and 2-
ketoglutarate ferredoxin oxidoreductase (dashed line;
1.2.7.3); 5, succinyl- CoA synthetase (solid line; 6.2.1.5) or
succinyl-CoA–acetoacetate-CoA transferase (dashed line;
2.8.3.5); 6, succinate dehydrogenase or fumarate reductase
(1.3.99.1); 7, fumarase (4.2.1.2) class I (dashed line) and
class II (solid line); 8, bacterial-type malate dehydrogenase
(solid line) or archaeal-type malate dehydrogenase (dashed
line) (1.1.1.37); 9, isocitrate lyase (4.1.3.1); 10, malate
synthase (4.1.3.2); 11, phosphoenolpyruvate carboxykinase
(4.1.1.49) or phosphoenolpyruvate carboxylase (4.1.1.32);
M. A. Huynen, T. Dandekar and P. Bork 12, malic enzyme (1.1.1.40 or 1.1.1.38); 13, pyruvate
``Variation and evolution of the citric acid cycle: a carboxylase or oxaloacetate decarboxylase (6.4.1.1); 14,
genomic approach'' Trends Microbiol, 7, 281-29 pyruvate dehydrogenase (solid line; 1.2.4.1 and 2.3.1.12)
(1999) and pyruvate ferredoxin oxidoreductase (dashed line;
1.2.7.1).
The citric-acid cycle
b) Individual species might not have a
complete CAC. This diagram shows
the genes for the CAC for each
unicellular species for which a
genome sequence has been published,
together with the phylogeny of the
species. The distance-based
phylogeny was constructed using the
fraction of genes shared between
genomes as a similarity criterion29.
The major kingdoms of life are
indicated in red (Archaea), blue
(Bacteria) and yellow (Eukarya).
Question marks represent reactions for
which there is biochemical evidence
in the species itself or in a related
species but for which no genes could
be found. Genes that lie in a single
operon are shown in the same color.
Genes were assumed to be located in a
single operon when they were
transcribed in the same direction and
the stretches of non-coding DNA
separating them were less than 50
nucleotides in length.
M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29
(1999)
Thinking about evolution
• Is the evolutionary model applicable to other
systems?
– Story telling in old cultures
– Richard Dawkins’ book entitled A Selfish Gene talks
about Memes
• The Genetic Algorithm (GA) is arguably the best
computational optimisation strategy around, and is
based entirely on Darwinian evolution