0% found this document useful (0 votes)
85 views

Bioinformatics For High School

This document provides an introduction to bioinformatics for high school students. It discusses how bioinformatics uses computational methods to analyze biological data from genomes and DNA sequences to understand the information encoded in genomes and link that information to cellular behaviors and biological systems. Key topics covered include gene prediction, protein function prediction, regulatory signal identification, and biomarker identification using gene expression data. The document emphasizes how high-throughput biological data combined with bioinformatics is transforming biology into a more quantitative science.

Uploaded by

Hilman Taufiq
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Bioinformatics For High School

This document provides an introduction to bioinformatics for high school students. It discusses how bioinformatics uses computational methods to analyze biological data from genomes and DNA sequences to understand the information encoded in genomes and link that information to cellular behaviors and biological systems. Key topics covered include gene prediction, protein function prediction, regulatory signal identification, and biomarker identification using gene expression data. The document emphasizes how high-throughput biological data combined with bioinformatics is transforming biology into a more quantitative science.

Uploaded by

Hilman Taufiq
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

An Introduction to Bioinformatics

(high-school version)

Ying Xu
Institute of Bioinformatics, and Biochemistry and Molec
ular Biology Department
University of Georgia
[email protected]
The Basics
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat
cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag
caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg
atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg
caaagragttagatgagctgatgctagaggtcagtgactgatgatc
gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt
cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt
aggtaggatggtaggtaaattgatagatgctagatcgtaggta…
………………………………

cell chromosome genome and sequencing

genes protein metabolic pathway/network


Bioinformatics
(or computational biology)
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat
cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag
caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg
atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg
caaagragttagatgagctgatgctagaggtcagtgactgatgatc
gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt
cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt
aggtaggatggtaggtaaattgatagatgctagatcgtaggta…
………………………………

• This interdisciplinary science … is about providi


ng computational support to studies on linking
the behavior of cells, organisms and populatio
ns to the information encoded in the genomes

– Temple Smith
Information Encoded in Genomes

• What information? And how to find and interpret it?


ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcga
ggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggta
ggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtga
ctgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgct
agatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta……
……………………………

• Working molecules (proteins, RNAs) in our cells

bacterial cell
Information Encoded in Genomes

• How to find where protein-encoding genes are in a genome?


ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgc
gatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgat
cgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta………………………

• A genome is like a book written in “words” consisting of 4


letters (A, C, G, T), and each protein-encoding gene is like an
instruction about how the protein is made

• People have found that the six-letter words (e.g., AAGTGC)


have different frequencies in genes from non-gene regions
Information Encoded in Genomes

Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2%
Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8%
Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3%
….
AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT …..

Is this a gene or non-gene region if you have to make a bet?


Information Encoded in Genomes
• Preference model:
– for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in gene a
nd non-gene regions, FC(X), FN(X)
– calculate X’s preference value P(X) = log (FC(X)/FN(X))

• Properties:
– P(X) is 0 if X has the same frequencies in gene and non-gene regions
– P(X) has positive score if X has higher frequency in gene than in non- gene
region; the larger the difference, the more positive the score is
– P(X) has negative score if X has higher frequency in non-gene than in gene
region; the larger the difference, the more negative the score is

• Gene prediction: given a DNA region, calculate the sum of P(X) values
for all 6-letter words X in the region;
– if the sum is larger than zero, predict “gene”
– otherwise predict non-gene
Information Encoded in Genomes

• You just learned your first bioinformatics method for


gene prediction – congratulations!
Information Encoded in Genomes

• Ok, we now have learned how to find genes encoded in a


genome

• How do we find out what they do (their biological functions,


e.g. sensors, transportors, regulators, enzymes)?
Information Encoded in Genomes

• People have observed that similar protein sequences tend to


have similar functions

• Over the years, many genes have been thoroughly studied in di


fferent organisms, e.g., human, mouse, fly, …., rice, …
– their biological functions have been identified and documented

• For a new protein, scientists can possibly predict its function by


identifying well-studied proteins in other organisms, that have h
igh sequence similarities to it
– This works for ~60% of genes in a newly sequenced genome
Information Encoded in Genomes

• Scientists have developed computational techniques for


– identifying regulatory signals that controls gene transcription
– predicting protein-protein interactions
– elucidating biological networks for a particular function
– …... and elucidating many other information
Information Encoded in Genomes

E. Coli O157 and O111 are human pathogenic while E. Coli K12 is not;
Can we tell why? Which genes or pathways in E. coli O157 and O111
are responsible for the pathogenicity?
Information Encoded in Genomes

Random seq
human chromosome #1
P. furiosus
B. pseudomallei
E. coli O157
E. coli K-12
Information Encoded in Genomes

Red: prokaryotes
Blue: eukaryotes
Green: plastids
Orange: plasmids
Black: mitochondria

x-axis: average of variations of the K-mer


frequencies,
y-axis: average barcode similarity among
fragments of a genome
Information Encoded in Genomes

• Yes, biologists can derive a lot of information from


genomes now

• … but we are far from fully understanding any genome


yet, even for the simplest living organisms, bacteria

• We can clearly use new ideas from bright young minds –


interested in doing bioinformatics?
Linking Genome Information to
Biological Systems Behaviors
• To fully understand cellular behaviors, we need to
– elucidate information encoded in the genome, and
– understand working molecules, encoded by the genome, behaves
according to the physical laws on earth!

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagca
acagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag……………………… gene

protein
Key Drivers of Bioinformatics
• Human genome project has fundamentally changed
biological science

• A key consequence of the genome project is scientists


learned that they can produce biological data massively
– genome sequences
– microarray data for gene expression levels
– yeast two hybrid systems for protein-protein interactions
– …… and other “high-throughput” biological data

These data reflect the cellular states, molecular


structures and functions, in complex ways
Key Drivers of Bioinformatics

• … and let bioinformaticians to (help to) decipher the


meaning of these data, like in genome sequences

• Together, high-throughput probing technologies and


bioinformatics are transforming biological science into a
new science more like physics
Key Drivers of Bioinformatics

• Like physics, where general rules and laws are taught at


the start, biology will surely be presented to future gene
rations of students as a set of basic systems ....... dup
licated and adapted to a very wide range of cellular and
organismic functions, following basic evolutionary princip
les constrained by Earth’s geological history.
– Temple Smith, Current Topics in Computational Molecular Biology
Biomarker Identification

• Our goal is to identify markers in blood that can tell if a


person has a particular form of cancer

…… in a similar fashion to doing pregnancy


test using a test kit, possibly at home
Biomarker Identification
• Microarray gene expression data allow comparative analyses of
gene expression patterns in cancer versus normal tissues

Finding genes showing maximum


difference in their expression levels
between cancer and normal tissues

on cancer tissues on normal tissues


Biomarker Identification

proteins A, …, Z highly
expressed in cancer
Biomarker Identification
• Question: Can we predict which of these tissue marker proteins can
get secreted into blood circulation so we can get markers in blood?

• Through literature search, we found over proteins being secreted into


blood circulation due to various physiological conditions

• We then trained a “classifier” to identify “features” that distinguish


between proteins that can be secreted into blood and proteins that
cannot
Biomarker Identification

• We have developed a classifier to distinguish blood-secretory


proteins and other proteins

• On a test set with 52 positive data and 3,629 negative data, our
classifier achieves
– 89.6% sensitivity, 98.5% specificity and 94% AUC
Biomarker Identification
• The predicted marker proteins can be validated using
mass spectrometry experiment
Biomarker Identification

• If successful, it will be possible to test for cancer using a


test-kit like pregnancy test-kits
Take-Home Message

• Biological science is under rapid transformation because of high-


throughput measurement technologies and bioinformatics

• As an emerging field, bioinformatics is about using computational


techniques to solve biological problems, and represents the future
of biology
THANK YOU!

You might also like