0% found this document useful (0 votes)
53 views39 pages

Introduction To Bioinformatics: High-Throughput Biological Data and Evolution

This document provides an introduction to high-throughput biological data and bioinformatics algorithms. It discusses how the amount of biological data from sources like genome sequencing, gene expression data, and protein structure data is growing exponentially. This data deluge creates both opportunities and challenges for data analysis and knowledge discovery using bioinformatics algorithms. Common algorithms discussed include clustering, dynamic programming, and machine learning approaches. The document also covers topics like protein structure, protein folding, evolution, and algorithms for tasks like sequence analysis and gene finding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views39 pages

Introduction To Bioinformatics: High-Throughput Biological Data and Evolution

This document provides an introduction to high-throughput biological data and bioinformatics algorithms. It discusses how the amount of biological data from sources like genome sequencing, gene expression data, and protein structure data is growing exponentially. This data deluge creates both opportunities and challenges for data analysis and knowledge discovery using bioinformatics algorithms. Common algorithms discussed include clustering, dynamic programming, and machine learning approaches. The document also covers topics like protein structure, protein folding, evolution, and algorithms for tasks like sequence analysis and gene finding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

C

Introduction to bioinformatics
N
T
R

Lecture 3
E
F B
O I
R O
I I
N N
T
E
G
F
O
R
High-throughput Biological
R
A
T
M
A
T
Data
I
V
I
C -data deluge, bioinformatics algorithms-
E S
V
U and evolution
Last lecture:
• Many different genomics datasets:
– Genome sequencing: more than 300 species completely
sequenced and data in public domain (i.e. information is
freely available), virus genome can be sequenced in a day
– Gene expression (microarray) data: many microarrays
measured per day
– Proteomics: Protein Data Bank (PDB) - as of Tuesday
February 07, 2006 there are 35026 Structures.
http://www.rcsb.org/pdb/
– Protein-protein interaction data: many databases worldwide
– Metabolic pathway, regulation and signaling data, many
databases worldwide
Growth in number of protein
tertiary structures
The data deluge
Although a lot of tertiary structural data is being
produced (preceding slide), there is the

SEQUENCE-STRUCTURE-FUNCTION GAP

The gap between sequence data on the one hand, and


structure or function data on the other, is widening
rapidly: Sequence data grows much faster
High-throughput Biological Data
The data deluge
• Hidden in all these data classes is
information that reflects
– existence, organization, activity,
functionality …… of biological machineries
at different levels in living organisms

Most effectively utilising and analysing this


information computationally is essential for
Bioinformatics
Data issues: from data to
distributed knowledge
• Data collection: getting the data
• Data representation: data standards, data normalisation …..
• Data organisation and storage: database issues …..
• Data analysis and data mining: discovering “knowledge”,
patterns/signals, from data, establishing associations among
data patterns
• Data utilisation and application: from data patterns/signals to
models for bio-machineries
• Data visualization: viewing complex data ……
• Data transmission: data collection, retrieval, …..
• ……
Bio-Data Analysis and Data Mining
• Analysis and mining tools exist and are developed for:
– DNA sequence assembly
– Genetic map construction
– Sequence comparison and database searching
– Gene finding
– Gene expression data analysis
– Phylogenetic tree analysis, e.g. to infer horizontally-
transferred genes
– Mass spectrometry data analysis for protein complex
characterization
– ……
Bio-Data Analysis and Data Mining
• As the amount and types of data and their cross
connections increase rapidly
• the number of analysis tools needed will go up
“exponentially” if we do not reuse techniques
– blast, blastp, blastx, blastn, … from BLAST family
of tools (we will cover BLAST later)
– gene finding tools for human, mouse, fly, rice,
cyanobacteria, …..
– tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, …..
Bio-Data Analysis and Data Mining
Many of these data analysis problems are
fundamentally the same problem(s) and can
be solved using the same set of tools
e.g.
•clustering or
•optimal segmentation by Dynamic
Programming

We will cover both of these techniques in later lectures


Bio-data Analysis, Data
Mining and Integrative
Bioinformatics
To have analysis capabilities covering a wide
range of problems, we need to discover the
common fundamental structures of these
problems;
HOWEVER in biology one size does NOT fit all…

An important goal of bioinformatics is


development of a data analysis
infrastructure in support of Genomics and
beyond
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)

VHLTPEEKSAVTALWGKVNVD
EVGGEALGRLLVVYPWTQRFF
ESFGDLSTPDAVMGNPKVKAH
GKKVLGAFSDGLAHLDNLKGTF
ATLSELHCDKLHVDPENFRLLG
NVLVCVLAHHFGKEFTPPVQAA
YQKVVAGVANALAHKYH

QUATERNARY STRUCTURE (oligomers) TERTIARY STRUCTURE (fold)


Protein complexes for photosynthesis in plants
Protein folding problem
PRIMARY STRUCTURE (amino acid sequence) Each protein sequence “knows”
VHLTPEEKSAVTALWGKVNVD
how to fold into its tertiary
EVGGEALGRLLVVYPWTQRFF structure. We still do not
ESFGDLSTPDAVMGNPKVKAH understand exactly how and why
GKKVLGAFSDGLAHLDNLKGTF
ATLSELHCDKLHVDPENFRLLG
NVLVCVLAHHFGKEFTPPVQAA SECONDARY STRUCTURE (helices, strands)
YQKVVAGVANALAHKYH

1-step
process
2-step
process

The 1-step process is based on a


hydrophobic collapse; the 2-step
process, more common in forming
larger proteins, is called the
TERTIARY STRUCTURE (fold)
framework model of folding
Protein folding: step on the way
is secondary structure prediction
• Long history -- first widely used algorithm
was by Chou and Fasman (1974)
• Different algorithms have been developed over
the years to crack the problem:
– Statistical approaches
– Neural networks (first from speech recognition)
– K-nearest neighbour algorithms
– Support Vector machines
Algorithms in bioinformatics
(recap)
• Sometimes the same basic algorithm can be
re-used for different problems (1-method-
multiple-problem)
• Normally, biological problems are
approached by different researchers using a
variety of methods (1-problem-multiple-
method)
Algorithms in bioinformatics
• string algorithms
• dynamic programming
• machine learning (Neural Netsworks, k-Nearest Neighbour, Support
Vector Machines, Genetic Algorithm, ..)
• Markov chain models, hidden Markov models, Markov Chain Monte
Carlo (MCMC) algorithms
• molecular mechanics, e.g. molecular dynamics, Monte Carlo,
simplified force fields
• stochastic context free grammars
• EM algorithms
• Gibbs sampling
• clustering
• tree algorithms
• text analysis
• hybrid/combinatorial techniques and more…
Sequence analysis and homology searching
Finding genes and regulatory elements

There are many different regulation signals such as start, stop and skip
messages hidden in the genome for each gene, but what and where are they?
Expression data
Functional genomics

• Monte Carlo
Protein translation
What is life?
• NASA astrobiology program:
“Life is a self-sustained chemical system
capable of undergoing Darwinian
evolution”
Evolution
Four requirements:
• Template structure providing stability (DNA)
• Copying mechanism (meiosis)
• Mechanism providing variation (mutations;
insertions and deletions; crossing-over; etc.)
• Selection: some traits lead to greater fitness of one
individual relative to another. Darwin wrote
“survival of the fittest”

Evolution is a conservative process: the vast majority of mutations


will not be selected (i.e. will not make it as they lead to worse
performance or are even lethal) – this is called negative (or
purifying) selection
Orthology/paralogy

Orthologous genes are homologous


(corresponding) genes in different
species
Paralogous genes are homologous genes
within the same species (genome)
Changing molecular sequences
• Mutations: changing nucleotides (‘letters’)
within DNA, also called ‘point mutations’
• A & G: purines, C & T/U: pyrimidines:
– Transition: purine -> purine or pyrimidine ->
pyrimidine
– Transversion: purine -> pyrimidine or
pyrimidine -> purine
Types of point mutation
• Synonymous mutation: mutation that does
not lead to an amino acid change (where in
the codon are these expected?)
• Non-synonymous mutation: does lead to
an amino acid change
– Missense mutation: one a.a replaced by other
a.a
– Nonsense mutation: a.a. replaced by stop
codon (what happens with protein?)
Ka/Ks Ratios
• Ks is defined as the number of synonymous
nucleotide substitutions per synonymous site
• Ka is defined as the number of nonsynonymous
nucleotide substitutions per nonsynonymous site
• The Ka/Ks ratio is used to estimate the type of
selection exerted on a given gene or DNA
fragment
• Need aligned orthologous sequences to do
calculate Ka/Ks ratios (we will talk about
alignment later).
Ka/Ks ratios

The frequency of different values of Ka/Ks for 835 mouse–rat


orthologous genes. Figures on the x axis represent the middle figure of
each bin; that is, the 0.05 bin collects data from 0 to 0.1
Ka/Ks ratios

Three types of selection:


1. Negative (purifying) selection -> Ka/Ks < 1
2. Neutral selection (Kimura) -> Ka/Ks ~= 1
3. Positive selection -> Ka/Ks > 1
Human Evolution
Divergent Evolution
Ancestral sequence: ABCD

ACCD (B C) ABD (C ø)
mutation deletion

ACCD or ACCD Pairwise Alignment


AB─D A─BD
Evolution
Ancestral sequence: ABCD

ACCD (B C) ABD (C ø)
mutation deletion

ACCD or ACCD Pairwise Alignment


AB─D A─BD
true alignment
Consequence of evolution
• Notion of comparative analysis (Darwin)
• What you know about one species might be
transferable to another, for example from
mouse to human
• Provides a framework to do the multi-level
large-scale analysis of the genomics data
plethora
Flavodoxin-cheY Multiple Sequence Alignment
Human Yeast

We need to be able to
do automatic pathway
comparison (pathway
alignment)

This pathway diagram shows a comparison of pathways in (left) Homo sapiens


(human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in
controlling enzymes (square boxes in red) and the pathway itself have occurred
(yeast has one altered (‘overtaking’) path in the graph)
The citric-acid cycle

http://en.wikipedia.org/wiki/Krebs_cycle
The citric-acid cycle
Fig. 1. (a) A graphical representation of the reactions of the
citric-acid cycle (CAC), including the connections with
pyruvate and phosphoenolpyruvate, and the glyoxylate
shunt. When there are two enzymes that are not homologous
to each other but that catalyse the same reaction (non-
homologous gene displacement), one is marked with a solid
line and the other with a dashed line. The oxidative direction
is clockwise. The enzymes with their EC numbers are as
follows: 1, citrate synthase (4.1.3.7); 2, aconitase (4.2.1.3);
3, isocitrate dehydrogenase (1.1.1.42); 4, 2-ketoglutarate
dehydrogenase (solid line; 1.2.4.2 and 2.3.1.61) and 2-
ketoglutarate ferredoxin oxidoreductase (dashed line;
1.2.7.3); 5, succinyl- CoA synthetase (solid line; 6.2.1.5) or
succinyl-CoA–acetoacetate-CoA transferase (dashed line;
2.8.3.5); 6, succinate dehydrogenase or fumarate reductase
(1.3.99.1); 7, fumarase (4.2.1.2) class I (dashed line) and
class II (solid line); 8, bacterial-type malate dehydrogenase
(solid line) or archaeal-type malate dehydrogenase (dashed
line) (1.1.1.37); 9, isocitrate lyase (4.1.3.1); 10, malate
synthase (4.1.3.2); 11, phosphoenolpyruvate carboxykinase
(4.1.1.49) or phosphoenolpyruvate carboxylase (4.1.1.32);
M. A. Huynen, T. Dandekar and P. Bork 12, malic enzyme (1.1.1.40 or 1.1.1.38); 13, pyruvate
``Variation and evolution of the citric acid cycle: a carboxylase or oxaloacetate decarboxylase (6.4.1.1); 14,
genomic approach'' Trends Microbiol, 7, 281-29 pyruvate dehydrogenase (solid line; 1.2.4.1 and 2.3.1.12)
(1999) and pyruvate ferredoxin oxidoreductase (dashed line;
1.2.7.1).
The citric-acid cycle
b) Individual species might not have a
complete CAC. This diagram shows
the genes for the CAC for each
unicellular species for which a
genome sequence has been published,
together with the phylogeny of the
species. The distance-based
phylogeny was constructed using the
fraction of genes shared between
genomes as a similarity criterion29.
The major kingdoms of life are
indicated in red (Archaea), blue
(Bacteria) and yellow (Eukarya).
Question marks represent reactions for
which there is biochemical evidence
in the species itself or in a related
species but for which no genes could
be found. Genes that lie in a single
operon are shown in the same color.
Genes were assumed to be located in a
single operon when they were
transcribed in the same direction and
the stretches of non-coding DNA
separating them were less than 50
nucleotides in length.

M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29
(1999)
Thinking about evolution
• Is the evolutionary model applicable to other
systems?
– Story telling in old cultures
– Richard Dawkins’ book entitled A Selfish Gene talks
about Memes
• The Genetic Algorithm (GA) is arguably the best
computational optimisation strategy around, and is
based entirely on Darwinian evolution

You might also like