0% found this document useful (0 votes)
28 views

Lecture1-1 525 W16 Large

This document provides an introduction to the course Introduction to Bioinformatics. It outlines the course logistics, including lecture and lab times. It also provides an overview of the course modules, which will cover topics like sequence alignment, structural bioinformatics, and genome informatics. The homework assigned is to complete an initial questionnaire and review background reading materials provided on the course website.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Lecture1-1 525 W16 Large

This document provides an introduction to the course Introduction to Bioinformatics. It outlines the course logistics, including lecture and lab times. It also provides an overview of the course modules, which will cover topics like sequence alignment, structural bioinformatics, and genome informatics. The homework assigned is to complete an initial questionnaire and review background reading materials provided on the course website.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

INTRODUCTION TO

BIOINFORMATICS
t i on n a ir e:
5 2 5 q ue s
l B IO IN F
t he in it ia ns >
t a k e e s t io
Please /b io inf 5 2 5 - qu
n y u r l. c o m
< http: //t i
Barry Grant
University of Michigan
www.thegrantlab.org

BIOINF 525 http://bioboot.github.io/bioinf525_w16/ 12-Jan-2016


Barry Grant, Ph.D.
[email protected]

Ryan Mills, Ph.D.


[email protected]

Hongyang Li (GSI)
[email protected]
COURSE LOGISTICS
Lectures: Tuesdays 2:30-4:00 PM
Rm. 2062 Palmer Commons
Labs: Session I: Thursdays 2:30 - 4:00 PM
Session II: Fridays 10:30 - 12:00 PM
Rm. 2036 Palmer Commons

Website: http://tinyurl.com/bioinf525-w16
Lecture, lab and background reading material
plus homework and course announcements
MODULE OVERVIEW
Objective: Provide an introduction to the practice of
bioinformatics as well as a practical guide to using common
bioinformatics databases and algorithms

1.1. ‣ Introduction to Bioinformatics

1.2. ‣ Sequence Alignment and Database Searching

1.3 ‣ Structural Bioinformatics

1.4 ‣ Genome Informatics: High Throughput Sequencing


Applications and Analytical Methods
TODAYS MENU
Overview of bioinformatics
• The what, why and how of bioinformatics?
• Major bioinformatics research areas.
• Skepticism and common problems with bioinformatics.

Bioinformatics databases and associated tools


• Primary, secondary and composite databases.
• Nucleotide sequence databases (GenBank & RefSeq).
• Protein sequence database (UniProt).
• Composite databases (PFAM & OMIM).

Database usage vignette


• Searching with ENTREZ and BLAST.
• Reference slides and handout on major databases.
HOMEWORK

Complete the initial course questionnaire:


http://tinyurl.com/bioinf525-questions

Check out the “Background Reading” material on Ctools:


http://tinyurl.com/bioinf525-w16

Complete the lecture 1.1 homework questions:


http://tinyurl.com/bioinf525-quiz1
Q. What is Bioinformatics?
Q. What is Bioinformatics?
“Bioinformatics is the application of computers to the collection,
archiving, organization, and analysis of biological data.”
[After Orengo, 2003]

… Bioinformatics is a hybrid of biology and computer science


… Bioinformatics is computer aided biology!

Computer based management and analysis of biological and


biomedical data with useful applications in many disciplines,
particularly genomics, proteomics, metabolomics, etc...
MORE DEFINITIONS
‣ “Bioinformatics is conceptualizing biology in terms of
macromolecules and then applying "informatics" techniques
(derived from disciplines such as applied maths, computer
science, and statistics) to understand and organize the
information associated with these molecules, on a large-scale.
Luscombe NM, et al. Methods Inf Med. 2001;40:346.

‣ “Bioinformatics is research, development, or application of


computational approaches for expanding the use of
biological, medical, behavioral or health data, including those
to acquire, store, organize and analyze such data.”
National Institutes of Health (NIH) ( http://tinyurl.com/l3gxr6b )
MORE DEFINITIONS
‣ “Bioinformatics is conceptualizing biology in terms of
macromolecules and then applying "informatics" techniques g y
(derived from disciplines such as applied maths,Bcomputer iolo
science, and statistics) to understand and organize de d the
A i
information associated with these molecules, te r on a large-scale.
pu
Luscombe NM, et al. Methods Inf Med. m
o 2001;40:346.
is C
t ic s
m a
f or
‣ “Bioinformatics is ioi n
research, development, or application of
computational t : B approaches for expanding the use of
oin
y P
biological, medical, behavioral or health data, including those
K e
to acquire, store, organize and analyze such data.”
National Institutes of Health (NIH) ( http://tinyurl.com/l3gxr6b )
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence

DNA & RNA sequence

Protein structure

DNA & RNA structure

Chemical entities
Protein families,
motifs and domains

Protein interactions

Pathways

Systems
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence

DNA & RNA sequence


r e s s i o n
ur e , e x pProtein structure
D s t r uct e c ul e s to
e n c e , 3 b io m ol
DNA & RNA structure
t e s e q u t io n of a l
n t e g r a d f u n c o lo g ic
Goal: I interaction an standing of bi s.
Chemical entities
r ns , n de r y s t e m
patte motifs and r u d s
Protein families,
a d e e p e e s s a n
g a in domains
s , p r o c
n is m
mecha
Protein interactions

Pathways

Systems
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence

DNA & RNA sequence

e t we e n
Protein structure

e ga p b
DNA & RNA structure
r id ge t h
s t o b
a t ic s aim le dg e .
io in fo r m d k n o w Chemical entities
B
Goal: motifs and domainsdata
Protein families, a n

Protein interactions

Pathways

Systems
BIOINFORMATICS RESEARCH AREAS
Include but are not limited to:
• Organization, classification, dissemination and analysis of
biological and biomedical data (particularly ‘-omics' data).
• Biological sequence analysis and phylogenetics.
• Genome organization and evolution.
• Regulation of gene expression and epigenetics.
• Biological pathways and networks in healthy & disease states.
• Protein structure prediction from sequence.
• Modeling and prediction of the biophysical properties of
biomolecules for binding prediction and drug design.
• Design of biomolecular structure and function.
With applications to Biology, Medicine, Agriculture and Industry
Where did bioinformatics come from?
Bioinformatics arose as molecular biology began to be transformed
by the emergence of molecular sequence and structural data

Recap: The key dogmas of molecular biology


• DNA sequence determines protein sequence.
• Protein sequence determines protein structure.
• Protein structure determines protein function.
• Regulatory mechanisms (e.g. gene expression) determine the
amount of a particular function in space and time.

Bioinformatics is now essential for the archiving, organization


and analysis of data related to these processes.
Why do we need Bioinformatics?
Bioinformatics is necessitated by the rapidly expanding
quantities and complexity of biomolecular data

• Bioinformatics provides
methods for the efficient:
‣ storage
‣ annotation
‣ search and retrieval
‣ data integration
‣ data mining and analysis

Bioinformatics is essential for the archiving, organization and


analysis of data from sequencing, structural genomics,
microarrays, proteomics and new high throughput assays.
Why do we need Bioinformatics?
Bioinformatics is necessitated by the rapidly expanding
quantities and complexity of biomolecular data
Growth in solved 3D structures
• Bioinformatics provides
methods for the efficient:
‣ storage
‣ annotation
‣ search and retrieval
‣ data integration
‣ data mining and analysis

Bioinformatics is essential for the archiving, organization and


analysis of data from sequencing, structural genomics,
microarrays, proteomics and new high throughput assays.
How do we do Bioinformatics?
• A “bioinformatics approach” involves the
application of computer algorithms, computer
models and computer databases with the broad
goal of understanding the action of both individual
genes, transcripts, proteins and large collections
of these entities.

DNA RNA Protein


Genome Transcriptome Proteome
How do we actually do Bioinformatics?
Pre-packaged tools and databases
‣ Many online
‣ New tools and time consuming methods frequently
require downloading
‣ Most are free to use

Tool development
‣ Mostly on a UNIX environment
‣ Knowledge of programing languages frequently required
(Python, Perl, R, C Java, Fortran)
‣ May require specialized or high performance computing
resources…
SIDE-NOTE: SUPERCOMPUTERS AND GPUS
SIDE-NOTE: SUPERCOMPUTERS AND GPUS
Put Levit’s Slide here on Computer Power Increases!

–Johnny Appleseed
To Do!
Skepticism & Bioinformatics
We have to approach computational results the
same way we do wet-lab results:
• Do they make sense?
• Is it what we expected?
• Do we have adequate controls, and how did they
come out?
• Modeling is modeling, but biology is different...
What does this model actually contribute?
• Avoid the miss-use of ‘black boxes’
Common problems with Bioinformatics
Confusing multitude of tools available
‣ Each with many options and settable parameters

Most tools and databases are written by and for nerds


‣ Same is true of documentation - if any exists!

Most are developed independently


Notable exceptions are found at the:
• EBI (European Bioinformatics Institute) and
• NCBI (National Center for Biotechnology Information)
Even Blast has many settable parameters

Related tools with different terminology

30
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research

on na ir e :
2 5 q u e s ti
B IO IN F 5
t he in it ial >
a s e t a k e e s t io n s
Ple o i nf 5 2 5 - qu
c om /b i
< tinyurl.

http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research

http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
National Center for Biotechnology
Information (NCBI)
• Created in 1988 as a part of the National Library of
Medicine (NLM) at the National Institutes of Health
• NCBI’s mission includes:
‣ Establish public databases
‣ Develop software tools
‣ Education on and
dissemination of biomedical Bethesda,MD

information

• We will cover a number of core NCBI databases and


software tools in the lecture
http://www.ncbi.nlm.nih.gov
http://www.ncbi.nlm.nih.gov
http://www.ncbi.nlm.nih.gov

Notable NCBI databases include:


GenBank, RefSeq, PubMed, dbSNP
and the search tools ENTREZ and BLAST
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research

http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
European Bioinformatics Institute (EBI)
• Created in 1997 as a part of the European Molecular
Biology Laboratory (EMBL)
• EBI’s mission includes:
‣ providing freely available
data and bioinformatics
services
‣ and providing advanced Hinxton,UK
bioinformatics training

• We will briefly cover several EBI databases and tools


that have advantages over those offered at NCBI
The EBI maintains a number of high quality
curated secondary databases and associated tools
The EBI maintains a number of high quality
curated secondary databases and associated tools
The EBI maintains a number of high quality
curated secondary databases and associated tools
https://www.ebi.ac.uk
The EBI makes available a wider variety of online tools than NCBI
These include multiple tools in the following areas:
• Database sequence searches
‣ FASTA, BLAST, InterProScan

• Sequence alignments
‣ Pairwise: Needle, LAlign.
‣ Multiple: ClustalW, T-Coffee, MUSCLE, Jalview

• Protein structure search and alignment


‣ PDBeFold, DALI

• and others...
‣ Genome browsers
‣ Gene expression analysis
‣ Protein function analysis
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
The EBI also provides a growing selection of online tutorials
on EBI databases and tools

Notable EBI databases include:


ENA, UniProt, Ensembl
and the tools FASTA, BLAST, InterProScan,
ClustalW, T-Coffee, MUSCLE, DALI, HMMER
BIOINFORMATICS DATABASES
AND ASSOCIATED TOOLS
What is a database?
Computerized store of data that is organized to provide efficient retrieval.
• Uses standardized data (record) formats to enable computer handling

Key database features allow for:


• Adding, changing, removing and merging of records
• User-defined queries and extraction of specified records

Desirable features include:


• Contains the data you are interested in
• Allows fast data access
• Provides annotation and curation of entries
• Provides links to additional information (possibly in other databases)
• Allows you to make discoveries
Bioinformatics Databases
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD, Beanref,
Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase,
CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST,
dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER,
FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genlilesne, GenLink,
GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB,
HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase,
HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA,
KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI,
MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase,
MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase,
PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS,
ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP,
SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK,
StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL
Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene,
URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, etc .................. !!!!
Bioinformatics Databases
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD, Beanref,
Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase,
CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, a s e s
CD4OLbase, CGAP,
t a b
aCyanoBase, dbCFC, dbEST,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB,tiCUTG, c s D s
r m a a ti c
dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, in f o
DIP, DOGS, DOMO, n f or
DPD,mDPlnteract, ECDC,
f B io b io i o u t
ECGC, EC02DBASE, EcoCyc,
o t s o
EcoGene, EMBL, EMD
a j o r
db, ENZYME, a n dEPD, EpoDB, ESTHER,
a r
FlyBase, FlyView,rGCRDB, e l GDB, GENATLAS, o f m
Genbank, o ls h
GeneCards, Genlilesne, GenLink,
e i ng C t o
The
GENOTK, GenProtEC, GIFTS,t e d lis
GPCRDB, t GRAP,
e e t he
GRBase, gRNAsdb, d f >
GRR, GSDB, HAEMB,
n o t a s e s s e s .p
a a n
HAMSTERS, HEART-2DPAGE, p le a
HEXAdb, HGMD,
tHIDB,
a b a HIDC, HlVdb, HotMolecBase,
HOVERGEN, For HPDB,
t a b a s e s
HSC-2DPAGE, ICN,
jo r _ D
ICTVDB,a IL2RGbase, IMGT, Kabat, KDNA,
daLGIC, MAD, MaizeDb, t _ a
M MDB, Medline, Mendel, MEROPS, MGDB, MGI,
KEGG, Klotho,
n d o u
a
< H MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase,
MHCPEP5 Micado, MitoDat,
MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase,
PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS,
ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP,
SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK,
StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL
Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene,
URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, etc .................. !!!!
Finding Bioinformatics Databases

http://www.oxfordjournals.org/nar/database/c/
Major Molecular Databases
The most popular bioinformatics databases focus on:

• Biomolecular sequence (e.g. GenBank, UniProt)

• Biomolecular structure (e.g. PDB)

• Vertebrate genomes (e.g. Ensemble)

• Small molecules (e.g. PubChem)


• Biomedical literature (e.g. PubMed)

The are also many popular “boutique” databases for:


• Classifying protein families, domains and motifs (e.g. PFAM, PROSITE)
• Specific organisms (e.g. WormBase, FlyBase)
• Specific proteins of biomedical importance (e.g. KinaseDB, GPCRDB)

• Specific diseases, mutations (e.g. OMIM, HGMD)

• Specific fields or methods of study (e.g. GOA, IEDB)


Major Molecular Databases
The most popular bioinformatics databases focus on:

• Biomolecular sequence (e.g. GenBank, UniProt)

• Biomolecular structure (e.g. PDB)


df’
s .p
• Vertebrate genomes (e.g. Ensemble) a s e
ta b
• Small molecules (e.g. PubChem) D a
or _
• Biomedical literature (e.g. PubMed) M aj
ut_
do
a n
: ‘H
The are also many popular
li n e
“boutique” databases for:
O n
e e
• Classifying protein families, domains and motifs (e.g. PFAM, PROSITE)
S
• Specific organisms (e.g. WormBase, FlyBase)
• Specific proteins of biomedical importance (e.g. KinaseDB, GPCRDB)

• Specific diseases, mutations (e.g. OMIM, HGMD)

• Specific fields or methods of study (e.g. GOA, IEDB)


Primary, secondary & composite databases
Bioinformatics databases can be usefully classified into primary,
secondary and composite according to their data source.
• Primary databases (or archival databases) consist of data derived
experimentally.
‣ GenBank: NCBI’s primary nucleotide sequence database.
‣ PDB: Protein X-ray crystal and NMR structures.

• Secondary databases (or derived databases) contain information derived


from a primary database.
• RefSeq: non redundant set of curated reference sequences primarily
from GenBank
• PFAM: protein sequence families primarily from UniProt and PDB
• Composite databases (or metadatabases) join a variety of different primary
and secondary database sources.
• OMIM: catalog of human genes, genetic disorders and related literature
• GENE: molecular data and literature related to genes with extensive links
to other databases.
GENBANK & REFSEQ:
NCBI’S NUCLEOTIDE SEQUENCE
DATABASES
What is GenBank?

• GenBank is NCBI’s primary nucleotide only sequence


database
‣ Archival in nature - reflects the state of knowledge at time
of submission
‣ Subjective - reflects the submitter point of view
‣ Redundant - can have many copies of the same
nucleotide sequence

• GenBank is actually three collaborating international


databases from the US, Japan and Europe
‣ GenBank (US)
‣ DNA Database of Japan (DDBJ)
‣ European Nucleotide Archive (ENA)
GenBank, ENA and DDBJ
Share and synchronize data

ENA GenBank DDBJ


Housed Housed Housed
at EBI at NCBI in Japan
European National
Bioinformatics Center for
Institute Biotechnology
Information

• The underlying raw DNA sequences are identical


‣ The different sites provide different views and ways to navigate
through the data
• Access to GenBank (and other NCBI databases including
RefSeq) is typically through Entrez, (the Google of NCBI) -
more on this later
GenBank sequence record

GenBank flat file format has defined fields including unique identifiers
such as the ACCESSION number.
This same general format is used for other sequence database records too.
Side node: Database accession numbers

Database accession numbers are strings of letters and


numbers used as identifying labels for sequences and other
data within databases
‣ Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequence


NT_030059 Genomic contig DNA
N91759.1 An expressed sequence tag (1 of 170)
NM_006744 RefSeq DNA sequence (from a transcript)
RNA
NP_007635 RefSeq protein
AAC02945 GenBank protein
Q28369
1KT7
UniProtKB/SwissProt protein
Protein Data Bank structure record
Protein
PMID: 12205585 PubMed IDs identify articles at NCBI/NIH Literature
GenBank sequence record
GenBank sequence record

Can set different display formats here


FASTA sequence record

FASTA sequence files consist of


records where each record begins
with a “>” and header information
on that same line. Each
subsequent line of the record is
sequence information.
This format is commonly used
by sequence analysis programs.
GenBank ‘graphics’ sequence record
GenBank sequence record, cont.
GenBank sequence record, cont.

The FEATURES section


contains annotations including
a conceptual translation of the
nucleotide sequence.
GenBank sequence record, cont.

The actual sequence


entry starts after the
word ORIGIN
RefSeq: NCBI’s Derivative
Sequence Database
• RefSeq entries are hand curated best representation
of a transcript or protein (in their judgement)
• Non-redundant for a given species although alternate
transcript forms will be included if there is good
evidence
- Experimentally verified transcripts and proteins
accession numbers begin with “NM_” or “NP_”
- Model transcripts and proteins based on bioinformatics
predictions with little experimental support
accession numbers begin with “XM_” or “XP_”
- RefSeq also contains contigs and chromosome records
UNIPROT:
THE PREMIER PROTEIN SEQUENCE
DATABASE
UniProt: Protein sequence database

UniProt is a comprehensive, high-quality resource of protein


sequence and functional information
• UniProt comprises four databases:
1. UniProtKB (Knowledgebase)
Containing Swiss-Prot and TrEMBL components
(these correspond to hand curated and automatically annotated
entries respectively)
2. UniRef (Reference Clusters)
Filtered version of UniProtKB at various levels of sequence
identity
e.g. UniRef90 contains sequences with a maximum of 90%
sequence identity to each other
3. UniParc (Archive) with database cross-references to source.
4. UniMES (Metagenomic and Environmental Sequences)
The two sides of UniProtKB

UniProtKB/TrEMBL UniProtKB/Swiss-Prot

Redundant, automatically Non-redundant, high-quality


annotated - unreviewed manual annotation - reviewed

Indicators of which part of UniProt an entry belongs


to include the color of the stars and the ID
The main information added to a
UniProt/Swiss-Prot entry

/Swiss-Prot

70
UniProt provides cross-references
to a large number other resources
and can serve as a useful “portal”
when you first begin to investigate
a particular protein

71
UniProt/Swiss-Prot vs UniProt/TrEMBL

• UniProtKB/Swiss-Prot is a non-redundant database with one entry


per protein

• UniProtKB/TrEMBL is a redundant database with one entry per


translated ENA entry (ENA is the EBI’s equivalent of GenBank)
‣ Therefore TrEMBL can contain multiple entries for the same protein
‣ Multiple UniProtKB/TrEMBL entries for the same protein can arise due
to:
- Erroneous gene model predictions
- Sequence errors (Frame shifts)
- Polymorphisms
- Alternative start sites
- Isoforms
- OR because the same sequence was submitted by different
people
Side note: Automatic Annotation
(sharing the wealth)
Swiss-Prot manually annotated

73
Same domain composition
= same function = annotation transfer

InterPro is an EBI database that


collates protein domain signatures

We will talk more about sequence


similarity and annotation transfer
next week.

74
DATABASE VIGNETTE
You have just come out a seminar about gastric
cancer and one of your co-workers asks:
“What do you know about that ‘Kras’ gene the
speaker kept taking about?"
You have some recollection about hearing of ‘Ras’
before. How would you find out more?
• Google?
• Library?
• Bioinformatics databases at NCBI and EBI!
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/

e s )
g slid
win
ollo
e f
s e
(or
m o
n de
s o
and
H

76
77
78
79
1 AND 2 1 2 ras AND disease
(1185 results)

1 OR 2 1 2 ras OR disease
(134,872 results)

1 NOT 2 1 2 ras NOT disease


(84,448 results)
80
81
82
Example Questions:
What chromosome location and
what genes are in the vicinity?

83
84
Example Questions:
What ‘molecular functions’, ‘biological
processes’, and ‘cellular component’
information is available?

85
GO: Gene Ontology
GO provides a controlled vocabulary of terms for describing gene
product characteristics and gene product annotation data

87
Why do we need Ontologies?
• Annotation is essential for capturing the understanding and
knowledge associated with a sequence or other molecular entity
• Annotation is traditionally recorded as “free text”, which is easy
to read by humans, but has a number of disadvantages,
including:
‣ Difficult for computers to parse
‣ Quality varies from database to database
‣ Terminology used varies from annotator to annotator

• Ontologies are annotations using standard vocabularies that try


to address these issues
• GO is integrated with UniProt and many other databases
including a number at NCBI

88
GO Ontologies

• There are three ontologies in GO:


‣ Biological Process
A commonly recognized series of events
e.g. cell division, mitosis,
‣ Molecular Function
An elemental activity, task or job
e.g. kinase activity, insulin binding
‣ Cellular Component
Where a gene product is located
e.g. mitochondrion, mitochondrial
membrane
89
The ‘Gene Ontology’ or GO is
actually maintained by the EBI so lets
switch or link over to UniProt also
from the EBI.

Scroll down to
UniProt link
UniProt will detail much more
information for protein coding genes
such as this one

Scroll down to
UniProt link
UniProt will detail much more
information for protein coding genes
Example Questions:
What positions in the protein are
responsible for GTP binding?
Example Questions:
What variants of this enzyme are
involved in gastric cancer and other
human diseases?
Example Questions:
Are high resolution protein structures
available to examine the details of
these mutations?
Example Questions:
What is known about the protein family,
its species distribution, number in humans
and residue-wise conservation, etc… ?

PFAM is one of the best


protein family databases
Example Questions:
What is known about the protein family,
its species distribution, number in
humans and residue-wise conservation,
etc… ?
Example Questions:
What is known about the protein family, its
species distribution, number in humans
and residue-wise conservation, etc… ?
Example Questions:
What is known about the protein family, its
species distribution, number in humans and
residue-wise conservation, etc… ?
Example Questions:
What is known about the protein family, its
species distribution, number in humans and
residue-wise conservation, etc… ?
ENTREZ & BLAST:
TOOLS FOR SEARCHING AND ACCESSING
MOLECULAR DATA AT NCBI
Entrez: Integrated search of NCBI databases

Entrez is available from


the main NCBI homepage
or from the homepage of
individual databases
Entrez: navigating across databases
Word weight

Entrez was setup toPubMed


allow you to navigate to
abstracts
related data in different databases without
having to run additional searches.
Relies on pre-computed and pre-compiled
data links:
Taxonomy 33-D
-D
• Neighbor knowledge based on calculations
Structure
Structure
VAST

Phylogeny • Hard links based onGene


things we know about
Neighbors
Related Structures

Nucleotide Protein
BLAST sequences sequences BLAST
Neighbors Neighbors
Related Sequences Hard Link
Related Sequences
BLink
Domains
Global Entrez Query: All NCBI Databases

http://www.ncbi.nlm.nih.gov/gquery/

The Entrez system: 38 (and counting)


integrated databases
Search Results

Discovery Column
(sort, filter, link)
Limits
Search Results

Discovery Column
(sort, filter, link)
Advanced: Search Builder

Helps build complex fielded queries

Items from search history can be included / combined / modified


Complex Query Results

("Danio rerio"[Organism] AND "creatine kinase"[Title])


AND "refseq"[Filter] AND mrna[Filter]
Controlled Vocabularies
– Taxonomy primary controlled vocabulary / classification
system for molecular databases at NCBI

‣ Medical Subject Headings (MeSH) primary controlled


vocabulary / classification system (ontology) for
molecular databases at NCBI
BLAST is a very important tool available from the NCBI
Homepage
http://www.ncbi.nlm.nih.gov/guide/
BLAST – Basic Local Alignment
Search Tool
http://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAST performs sequence


similarity searches of query
sequences vs sequence databases.
We will cover this in detail in the
next lecture.
SUMMARY
• Bioinformatics is computer aided biology.
• Bioinformatics deals with the collection, archiving,
organization, and interpretation of a wide range of biological
data.
• There are a large number of primary, secondary and tertiary
bioinformatics databases.
• The NCBI and EBI are major online bioinformatics service
providers.
• Introduced GenBank, RefSeq, UniProt, PDB databases as well
as a number of ‘boutique’ databases including PFAM and
OMIM.
• Introduced the notion of controlled vocabularies and
ontologies.
• Described the use of ENTREZ and BLAST for searching
databases.
HOMEWORK

Complete the initial course questionnaire:


http://tinyurl.com/bioinf525-questions

Check out the “Background Reading” material on Ctools:


http://tinyurl.com/bioinf525-w16

Complete the lecture 1.1 homework questions:


http://tinyurl.com/bioinf525-quiz1
ADDITIONAL DATABASES OF NOTE
(SLIDES FOR YOUR REFERENCE)
NCBI Metadatabases
• Gene
‣ molecular data and literature related to genes

• HomoloGene
‣ automated collection of homologous genes from selected
eukaryotes
• Taxonomy
‣ access to NCBI data through source organism taxonomic
classification
• PubChem
‣ small organic molecules and their biological activities

• BioSystems
‣ biochemical pathways and processes linked to NCBI genes,
gene products, small molecules, and structures
119
PubMed

• Curated database of biomedical journal articles


• Data records are annotated with MeSH terms
(Medical Subject Headings)
• Contract workers actually read all of the articles
and classify them with the MeSH terms
• PubMed entries contain article abstracts
• PubMed Central contains full journal articles, but
the majority are not freely re-distributable

120
PubMed results
Limits and Advanced search can be used
to refine searches
Small molecule databases have been added at NCBI
http://pubchem.ncbi.nlm.nih.gov/
HomoloGene - Homologous genes from different
organisms http://www.ncbi.nlm.nih.gov/homologene
Online Mendelian Inheritance in Man –
OMIM
http://www.ncbi.nlm.nih.gov/omim

OMIM is essentially a set of reviews of human genes, gene function and


phenotypes. Includes causative mutations where known.
The NCBI Bookshelf includes many well known
molecular biology texts.
http://www.ncbi.nlm.nih.gov/books/
GEO: Gene Expression Omnibus

• Gene expression data (mostly from microarrays but


also RNA-seq data, 2 methods for measuring RNA
levels)

Query browse and


download data sets

126
• Series - (GSExxx) is an original submitter-supplied
record that summarizes a study. May contain
multiple individual Samples (GSMxxx).

An individual GSE
series entry describes
the experiment and
gives access to the
experimental data

127
• DataSets - (GDSxxx) are curated collections of
selected Samples that are biologically and
statistically comparable

Curated expression data


Heart Skeletal Muscle

Expression profile
GO Ontologies

• There are three ontologies in GO:


‣ Biological Process
A commonly recognized series of events
e.g. cell division, mitosis,
‣ Molecular Function
An elemental activity, task or job
e.g. kinase activity, insulin binding
‣ Cellular Component
Where a gene product is located
e.g. mitochondrion, mitochondrial
membrane
129
QuickGO is a fast web-based browser of the Gene Ontology and
Gene Ontology annotation data

130
GO annotation in UniProt
An example UniProt entry for hemoglobin beta (HBB_human,
P68871) with GO annotation displayed.
GO annotation in UniProt
An example UniProt entry for hemoglobin beta (HBB_human,
P68871) with GO annotation displayed.
DAVID: a online tool for assessing
GO term enrichment in gene lists

DAVID allows you to upload lists of genes


and search for enriched GO and search for
functionally related genes not in your list

http://david.abcc.ncifcrf.gov

133
Example output: enriched functions
from GO

134

You might also like