Lecture1-1 525 W16 Large
Lecture1-1 525 W16 Large
BIOINFORMATICS
t i on n a ir e:
5 2 5 q ue s
l B IO IN F
t he in it ia ns >
t a k e e s t io
Please /b io inf 5 2 5 - qu
n y u r l. c o m
< http: //t i
Barry Grant
University of Michigan
www.thegrantlab.org
Hongyang Li (GSI)
[email protected]
COURSE LOGISTICS
Lectures: Tuesdays 2:30-4:00 PM
Rm. 2062 Palmer Commons
Labs: Session I: Thursdays 2:30 - 4:00 PM
Session II: Fridays 10:30 - 12:00 PM
Rm. 2036 Palmer Commons
Website: http://tinyurl.com/bioinf525-w16
Lecture, lab and background reading material
plus homework and course announcements
MODULE OVERVIEW
Objective: Provide an introduction to the practice of
bioinformatics as well as a practical guide to using common
bioinformatics databases and algorithms
Protein structure
Chemical entities
Protein families,
motifs and domains
Protein interactions
Pathways
Systems
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence
Pathways
Systems
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence
e t we e n
Protein structure
e ga p b
DNA & RNA structure
r id ge t h
s t o b
a t ic s aim le dg e .
io in fo r m d k n o w Chemical entities
B
Goal: motifs and domainsdata
Protein families, a n
Protein interactions
Pathways
Systems
BIOINFORMATICS RESEARCH AREAS
Include but are not limited to:
• Organization, classification, dissemination and analysis of
biological and biomedical data (particularly ‘-omics' data).
• Biological sequence analysis and phylogenetics.
• Genome organization and evolution.
• Regulation of gene expression and epigenetics.
• Biological pathways and networks in healthy & disease states.
• Protein structure prediction from sequence.
• Modeling and prediction of the biophysical properties of
biomolecules for binding prediction and drug design.
• Design of biomolecular structure and function.
With applications to Biology, Medicine, Agriculture and Industry
Where did bioinformatics come from?
Bioinformatics arose as molecular biology began to be transformed
by the emergence of molecular sequence and structural data
• Bioinformatics provides
methods for the efficient:
‣ storage
‣ annotation
‣ search and retrieval
‣ data integration
‣ data mining and analysis
Tool development
‣ Mostly on a UNIX environment
‣ Knowledge of programing languages frequently required
(Python, Perl, R, C Java, Fortran)
‣ May require specialized or high performance computing
resources…
SIDE-NOTE: SUPERCOMPUTERS AND GPUS
SIDE-NOTE: SUPERCOMPUTERS AND GPUS
Put Levit’s Slide here on Computer Power Increases!
–Johnny Appleseed
To Do!
Skepticism & Bioinformatics
We have to approach computational results the
same way we do wet-lab results:
• Do they make sense?
• Is it what we expected?
• Do we have adequate controls, and how did they
come out?
• Modeling is modeling, but biology is different...
What does this model actually contribute?
• Avoid the miss-use of ‘black boxes’
Common problems with Bioinformatics
Confusing multitude of tools available
‣ Each with many options and settable parameters
30
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research
on na ir e :
2 5 q u e s ti
B IO IN F 5
t he in it ial >
a s e t a k e e s t io n s
Ple o i nf 5 2 5 - qu
c om /b i
< tinyurl.
http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research
http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
National Center for Biotechnology
Information (NCBI)
• Created in 1988 as a part of the National Library of
Medicine (NLM) at the National Institutes of Health
• NCBI’s mission includes:
‣ Establish public databases
‣ Develop software tools
‣ Education on and
dissemination of biomedical Bethesda,MD
information
http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
European Bioinformatics Institute (EBI)
• Created in 1997 as a part of the European Molecular
Biology Laboratory (EMBL)
• EBI’s mission includes:
‣ providing freely available
data and bioinformatics
services
‣ and providing advanced Hinxton,UK
bioinformatics training
• Sequence alignments
‣ Pairwise: Needle, LAlign.
‣ Multiple: ClustalW, T-Coffee, MUSCLE, Jalview
• and others...
‣ Genome browsers
‣ Gene expression analysis
‣ Protein function analysis
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
http://www.oxfordjournals.org/nar/database/c/
Major Molecular Databases
The most popular bioinformatics databases focus on:
GenBank flat file format has defined fields including unique identifiers
such as the ACCESSION number.
This same general format is used for other sequence database records too.
Side node: Database accession numbers
UniProtKB/TrEMBL UniProtKB/Swiss-Prot
/Swiss-Prot
70
UniProt provides cross-references
to a large number other resources
and can serve as a useful “portal”
when you first begin to investigate
a particular protein
71
UniProt/Swiss-Prot vs UniProt/TrEMBL
73
Same domain composition
= same function = annotation transfer
74
DATABASE VIGNETTE
You have just come out a seminar about gastric
cancer and one of your co-workers asks:
“What do you know about that ‘Kras’ gene the
speaker kept taking about?"
You have some recollection about hearing of ‘Ras’
before. How would you find out more?
• Google?
• Library?
• Bioinformatics databases at NCBI and EBI!
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/
e s )
g slid
win
ollo
e f
s e
(or
m o
n de
s o
and
H
76
77
78
79
1 AND 2 1 2 ras AND disease
(1185 results)
1 OR 2 1 2 ras OR disease
(134,872 results)
83
84
Example Questions:
What ‘molecular functions’, ‘biological
processes’, and ‘cellular component’
information is available?
85
GO: Gene Ontology
GO provides a controlled vocabulary of terms for describing gene
product characteristics and gene product annotation data
87
Why do we need Ontologies?
• Annotation is essential for capturing the understanding and
knowledge associated with a sequence or other molecular entity
• Annotation is traditionally recorded as “free text”, which is easy
to read by humans, but has a number of disadvantages,
including:
‣ Difficult for computers to parse
‣ Quality varies from database to database
‣ Terminology used varies from annotator to annotator
88
GO Ontologies
Scroll down to
UniProt link
UniProt will detail much more
information for protein coding genes
such as this one
Scroll down to
UniProt link
UniProt will detail much more
information for protein coding genes
Example Questions:
What positions in the protein are
responsible for GTP binding?
Example Questions:
What variants of this enzyme are
involved in gastric cancer and other
human diseases?
Example Questions:
Are high resolution protein structures
available to examine the details of
these mutations?
Example Questions:
What is known about the protein family,
its species distribution, number in humans
and residue-wise conservation, etc… ?
Nucleotide Protein
BLAST sequences sequences BLAST
Neighbors Neighbors
Related Sequences Hard Link
Related Sequences
BLink
Domains
Global Entrez Query: All NCBI Databases
http://www.ncbi.nlm.nih.gov/gquery/
Discovery Column
(sort, filter, link)
Limits
Search Results
Discovery Column
(sort, filter, link)
Advanced: Search Builder
• HomoloGene
‣ automated collection of homologous genes from selected
eukaryotes
• Taxonomy
‣ access to NCBI data through source organism taxonomic
classification
• PubChem
‣ small organic molecules and their biological activities
• BioSystems
‣ biochemical pathways and processes linked to NCBI genes,
gene products, small molecules, and structures
119
PubMed
120
PubMed results
Limits and Advanced search can be used
to refine searches
Small molecule databases have been added at NCBI
http://pubchem.ncbi.nlm.nih.gov/
HomoloGene - Homologous genes from different
organisms http://www.ncbi.nlm.nih.gov/homologene
Online Mendelian Inheritance in Man –
OMIM
http://www.ncbi.nlm.nih.gov/omim
126
• Series - (GSExxx) is an original submitter-supplied
record that summarizes a study. May contain
multiple individual Samples (GSMxxx).
An individual GSE
series entry describes
the experiment and
gives access to the
experimental data
127
• DataSets - (GDSxxx) are curated collections of
selected Samples that are biologically and
statistically comparable
Expression profile
GO Ontologies
130
GO annotation in UniProt
An example UniProt entry for hemoglobin beta (HBB_human,
P68871) with GO annotation displayed.
GO annotation in UniProt
An example UniProt entry for hemoglobin beta (HBB_human,
P68871) with GO annotation displayed.
DAVID: a online tool for assessing
GO term enrichment in gene lists
http://david.abcc.ncifcrf.gov
133
Example output: enriched functions
from GO
134