Bioinformatics For High School
Bioinformatics For High School
(high-school version)
Ying Xu
Institute of Bioinformatics, and Biochemistry and Molec
ular Biology Department
University of Georgia
[email protected]
The Basics
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat
cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag
caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg
atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg
caaagragttagatgagctgatgctagaggtcagtgactgatgatc
gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt
cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt
aggtaggatggtaggtaaattgatagatgctagatcgtaggta…
………………………………
– Temple Smith
Information Encoded in Genomes
bacterial cell
Information Encoded in Genomes
Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2%
Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8%
Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3%
….
AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT …..
• Properties:
– P(X) is 0 if X has the same frequencies in gene and non-gene regions
– P(X) has positive score if X has higher frequency in gene than in non- gene
region; the larger the difference, the more positive the score is
– P(X) has negative score if X has higher frequency in non-gene than in gene
region; the larger the difference, the more negative the score is
• Gene prediction: given a DNA region, calculate the sum of P(X) values
for all 6-letter words X in the region;
– if the sum is larger than zero, predict “gene”
– otherwise predict non-gene
Information Encoded in Genomes
E. Coli O157 and O111 are human pathogenic while E. Coli K12 is not;
Can we tell why? Which genes or pathways in E. coli O157 and O111
are responsible for the pathogenicity?
Information Encoded in Genomes
Random seq
human chromosome #1
P. furiosus
B. pseudomallei
E. coli O157
E. coli K-12
Information Encoded in Genomes
Red: prokaryotes
Blue: eukaryotes
Green: plastids
Orange: plasmids
Black: mitochondria
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagca
acagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag……………………… gene
…
protein
Key Drivers of Bioinformatics
• Human genome project has fundamentally changed
biological science
proteins A, …, Z highly
expressed in cancer
Biomarker Identification
• Question: Can we predict which of these tissue marker proteins can
get secreted into blood circulation so we can get markers in blood?
• On a test set with 52 positive data and 3,629 negative data, our
classifier achieves
– 89.6% sensitivity, 98.5% specificity and 94% AUC
Biomarker Identification
• The predicted marker proteins can be validated using
mass spectrometry experiment
Biomarker Identification