Introduction To Bioinformatics and Biocomputing I: DR Tan Tin Wee Director Bioinformatics Centre
Introduction To Bioinformatics and Biocomputing I: DR Tan Tin Wee Director Bioinformatics Centre
and Biocomputing I
Dr Tan Tin Wee
Director
Bioinformatics Centre
http://www.bic.nus.edu.sg/
http://www.apbionet.org/
Twin pillars of Economy
Two Major Late 20th Century
Technologies
1400000000
1200000000
Internet
1000000000
Bases
Nodes
Bases
800000000
600000000
400000000
200000000
Year
Two Serious Problems
• Overwhelming rate of unorganised
proliferation of insufficiently structured
scientific data in some disciplines
- example in life sciences: Genome Project etc.
• Low and Uncertain Performance of
Network Data Communications and
bandwidth limitation
– example of APBionet-APAN collaboration
Bottleneck
• From Sequence
• To Structure
• To Function
• Predicting Function: From Genes to
Genomes and Back
» J.Mol Biol (1998) 283, 707-725
Goal of the Human Genome Program
(and the Genomes Projects!)
• Sequence the 3 billion base pairs of
humanDNA and identify the 100,000
genes contained in the human genome
• Do it for other genomes
Genomes Project!
• Human
• Vertebrates - mouse, dog, sheep, cattle, fish etc etc
• Invertebrates - C elegans, drosophila
• Plants - arabidopsis etc
• Microbes
– E coli, H. Influenzae, H. pylori, Mycoplasma
genitalium, B subtilis, Borrelia, Chlamydia, Aquifex,
Methanocccus Methanobacterium….
DOE Program NIH PROGRAM
GENETIC SUSPECTIBILITIES
• PREVENTIVE MEDICINE
• RISK ASSESSMENT
Large Throughput Sequencing Now
Happening will deluge us with more
data/information
New DOE/NIH Five-Year Plan
(continued)
Sequencing - Related Goals
• Model Organisms
- C. Elegans - 1998
- Drosophila - 2002
- Mouse - 2008
• Full length cDNAs - 2003
• Continued technology development
• Sustained sequencing capacity
Ambitious JGI Sequencing Goals in
FY 2000
Microbial Genome Research
• Capitalises on advances in human
genome program
• Map/sequence microbes with
- environmental/energy relevance
- phylogenetic significance
- commercial value
• Predict gene function, regulation,
and interactions
Microbial Genome Program
Sequencing Will Advance Biotechnology for
a Sustainable Future
• biosensors and biomonitoring
• bioremediation and biorestoration
• manufacturing and bioprocessing
• biofuels - biohydrogen, ethanol, biodiesel
• photosynthesis and biomass production
• disease and drought resistance
• 180 paradigm shift in how biology is done
Microbial Genome Program
Sequencing Completed --
• Mycoplasma genitalium -- free living , smallest genome
• Methanococcus jannaschii -- methane producer
• Archaeoglobus fulgidus -- oil well souring
• Thermotoga maritima -- energy from plant biomass
• Deinococcus radiodurans -- radiation resistant, bioremediation
• Methanobacterium thermoautotrophicm -- methane producer
• Pyrobaculum aerophilum -- thermophile (100NC)
• Aquifex aeolicus VF5 -- deep branching lineage
Microbial Genome Program
Sequencing in progress --
• Pyrococcus furiosus -- model hypothermophile
• Clostridium acetobutylicum -- biotech & waste remediation
• Shewanaella putrefaciens -- bioremediation
• Pseudomonas putida -- bioremediation
• Thiobacillus ferroxidans -- CO2 fixation
• Desulfovibrio vulgaris -- bioremediation
• Caulobacter crescentus -- bioremediation
• Chlorobium tepidum -- carbon management
• Dehalococcoides ethenogenes -- bioremediation
• Carboxydothermus hydrogenoformans -- H2 production
Challenges and Opportunities
Private and public sector
sequencing efforts are
about to drive the
genome project into……
Information Overload!!!
What is Bioinformatics?
Bioinformatics is :
• the use of computers (and persistent data
structures) in pursuit of biological research.
• an emerging new discipline, with its own
goals, research program, and practitioners.
• the sine qua non for 21st-century biology.
• The most significantly underfunded
component of 21st-century biology
• all of the above.
Visualising Genome Information
What is Genome Annotation?
The Process of Adding Biology Information and
Predictions to a Sequenced Genome Framework
Increasing Volume of Data :
A Biological Data Deluge
• Between now and 2003, over 50,000 new human genes and
proteins.
• 100,000 new genes and proteins from genome sequencing
of microbes and model organisms.
• Variants will be found or manufactured at high throughput
(mutants, normal population variants, man-made constructs, homologues
determined in many species from gene-specific environmental screens).
• If 40,000 per year, > 3,000 per month, >100 per day
• In just 1997, about 12,000 genes and proteins were
discovered
Paradigm Shift in Biology
The new paradigm, now emerging, is that all
the ‘genes’ will be known (in the sense of
being resident in databases available
electronically), and that the starting point of a
biological investigation will be theoretical. An
individual scientist will being with a
theoretical conjecture, only then turning to
experiment to follow or test that hypothesis.
• Scientific Challenges
• Algorithmic Challenges
• Computational Challenges
IT-Biology Synergism
• Physics needs calculus, the method for
manipulating information about statistically
large numbers of vanishingly small,
independent, equivalent things.
• Biology needs information technology, the
method for manipulating information about
large numbers of dependent, historically
contigent, individual things
Moore’s Law : The Statement