0% found this document useful (0 votes)

28 views

Lecture1-1 525 W16 Large

This document provides an introduction to the course Introduction to Bioinformatics. It outlines the course logistics, including lecture and lab times. It also provides an overview of the course modules, which will cover topics like sequence alignment, structural bioinformatics, and genome informatics. The homework assigned is to complete an initial questionnaire and review background reading materials provided on the course website.

Uploaded by

Gana Ibrahim Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Lecture1-1 525 W16 Large

Uploaded by

Gana Ibrahim Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 129

INTRODUCTION TO

BIOINFORMATICS
t i on n a ir e:
5 2 5 q ue s
l B IO IN F
t he in it ia ns >
t a k e e s t io
Please /b io inf 5 2 5 - qu
n y u r l. c o m
< http: //t i
Barry Grant
University of Michigan
www.thegrantlab.org

BIOINF 525 http://bioboot.github.io/bioinf525_w16/ 12-Jan-2016

Barry Grant, Ph.D.
[email protected]

Ryan Mills, Ph.D.

[email protected]

Hongyang Li (GSI)
[email protected]
COURSE LOGISTICS
Lectures: Tuesdays 2:30-4:00 PM
Rm. 2062 Palmer Commons
Labs: Session I: Thursdays 2:30 - 4:00 PM
Session II: Fridays 10:30 - 12:00 PM
Rm. 2036 Palmer Commons

Website: http://tinyurl.com/bioinf525-w16
Lecture, lab and background reading material
plus homework and course announcements
MODULE OVERVIEW
Objective: Provide an introduction to the practice of
bioinformatics as well as a practical guide to using common
bioinformatics databases and algorithms

1.1. ‣ Introduction to Bioinformatics

1.2. ‣ Sequence Alignment and Database Searching

1.3 ‣ Structural Bioinformatics

1.4 ‣ Genome Informatics: High Throughput Sequencing

Applications and Analytical Methods
TODAYS MENU
Overview of bioinformatics
• The what, why and how of bioinformatics?
• Major bioinformatics research areas.
• Skepticism and common problems with bioinformatics.

Bioinformatics databases and associated tools

• Primary, secondary and composite databases.
• Nucleotide sequence databases (GenBank & RefSeq).
• Protein sequence database (UniProt).
• Composite databases (PFAM & OMIM).

Database usage vignette

• Searching with ENTREZ and BLAST.
• Reference slides and handout on major databases.
HOMEWORK

Complete the initial course questionnaire:

http://tinyurl.com/bioinf525-questions

Check out the “Background Reading” material on Ctools:

http://tinyurl.com/bioinf525-w16

Complete the lecture 1.1 homework questions:

http://tinyurl.com/bioinf525-quiz1
Q. What is Bioinformatics?
Q. What is Bioinformatics?
“Bioinformatics is the application of computers to the collection,
archiving, organization, and analysis of biological data.”
[After Orengo, 2003]

… Bioinformatics is a hybrid of biology and computer science

… Bioinformatics is computer aided biology!

Computer based management and analysis of biological and

biomedical data with useful applications in many disciplines,
particularly genomics, proteomics, metabolomics, etc...
MORE DEFINITIONS
‣ “Bioinformatics is conceptualizing biology in terms of
macromolecules and then applying "informatics" techniques
(derived from disciplines such as applied maths, computer
science, and statistics) to understand and organize the
information associated with these molecules, on a large-scale.
Luscombe NM, et al. Methods Inf Med. 2001;40:346.

‣ “Bioinformatics is research, development, or application of

computational approaches for expanding the use of
biological, medical, behavioral or health data, including those
to acquire, store, organize and analyze such data.”
National Institutes of Health (NIH) ( http://tinyurl.com/l3gxr6b )
MORE DEFINITIONS
‣ “Bioinformatics is conceptualizing biology in terms of
macromolecules and then applying "informatics" techniques g y
(derived from disciplines such as applied maths,Bcomputer iolo
science, and statistics) to understand and organize de d the
A i
information associated with these molecules, te r on a large-scale.
pu
Luscombe NM, et al. Methods Inf Med. m
o 2001;40:346.
is C
t ic s
m a
f or
‣ “Bioinformatics is ioi n
research, development, or application of
computational t : B approaches for expanding the use of
oin
y P
biological, medical, behavioral or health data, including those
K e
to acquire, store, organize and analyze such data.”
National Institutes of Health (NIH) ( http://tinyurl.com/l3gxr6b )
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence

DNA & RNA sequence

Protein structure

DNA & RNA structure

Chemical entities
Protein families,
motifs and domains

Protein interactions

Pathways

Systems
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence

DNA & RNA sequence

r e s s i o n
ur e , e x pProtein structure
D s t r uct e c ul e s to
e n c e , 3 b io m ol
DNA & RNA structure
t e s e q u t io n of a l
n t e g r a d f u n c o lo g ic
Goal: I interaction an standing of bi s.
Chemical entities
r ns , n de r y s t e m
patte motifs and r u d s
Protein families,
a d e e p e e s s a n
g a in domains
s , p r o c
n is m
mecha
Protein interactions

Pathways

Systems
Major types of Bioinformatics Data
Literature and ontologies
Gene expression
Genomes
Protein sequence

DNA & RNA sequence

e t we e n
Protein structure

e ga p b
DNA & RNA structure
r id ge t h
s t o b
a t ic s aim le dg e .
io in fo r m d k n o w Chemical entities
B
Goal: motifs and domainsdata
Protein families, a n

Protein interactions

Pathways

Systems
BIOINFORMATICS RESEARCH AREAS
Include but are not limited to:
• Organization, classification, dissemination and analysis of
biological and biomedical data (particularly ‘-omics' data).
• Biological sequence analysis and phylogenetics.
• Genome organization and evolution.
• Regulation of gene expression and epigenetics.
• Biological pathways and networks in healthy & disease states.
• Protein structure prediction from sequence.
• Modeling and prediction of the biophysical properties of
biomolecules for binding prediction and drug design.
• Design of biomolecular structure and function.
With applications to Biology, Medicine, Agriculture and Industry
Where did bioinformatics come from?
Bioinformatics arose as molecular biology began to be transformed
by the emergence of molecular sequence and structural data

Recap: The key dogmas of molecular biology

• DNA sequence determines protein sequence.
• Protein sequence determines protein structure.
• Protein structure determines protein function.
• Regulatory mechanisms (e.g. gene expression) determine the
amount of a particular function in space and time.

Bioinformatics is now essential for the archiving, organization

and analysis of data related to these processes.
Why do we need Bioinformatics?
Bioinformatics is necessitated by the rapidly expanding
quantities and complexity of biomolecular data

• Bioinformatics provides
methods for the efficient:
‣ storage
‣ annotation
‣ search and retrieval
‣ data integration
‣ data mining and analysis

Bioinformatics is essential for the archiving, organization and

analysis of data from sequencing, structural genomics,
microarrays, proteomics and new high throughput assays.
Why do we need Bioinformatics?
Bioinformatics is necessitated by the rapidly expanding
quantities and complexity of biomolecular data
Growth in solved 3D structures
• Bioinformatics provides
methods for the efficient:
‣ storage
‣ annotation
‣ search and retrieval
‣ data integration
‣ data mining and analysis

Bioinformatics is essential for the archiving, organization and

analysis of data from sequencing, structural genomics,
microarrays, proteomics and new high throughput assays.
How do we do Bioinformatics?
• A “bioinformatics approach” involves the
application of computer algorithms, computer
models and computer databases with the broad
goal of understanding the action of both individual
genes, transcripts, proteins and large collections
of these entities.

DNA RNA Protein

Genome Transcriptome Proteome
How do we actually do Bioinformatics?
Pre-packaged tools and databases
‣ Many online
‣ New tools and time consuming methods frequently
require downloading
‣ Most are free to use

Tool development
‣ Mostly on a UNIX environment
‣ Knowledge of programing languages frequently required
(Python, Perl, R, C Java, Fortran)
‣ May require specialized or high performance computing
resources…
SIDE-NOTE: SUPERCOMPUTERS AND GPUS
SIDE-NOTE: SUPERCOMPUTERS AND GPUS
Put Levit’s Slide here on Computer Power Increases!

–Johnny Appleseed
To Do!
Skepticism & Bioinformatics
We have to approach computational results the
same way we do wet-lab results:
• Do they make sense?
• Is it what we expected?
• Do we have adequate controls, and how did they
come out?
• Modeling is modeling, but biology is different...
What does this model actually contribute?
• Avoid the miss-use of ‘black boxes’
Common problems with Bioinformatics
Confusing multitude of tools available
‣ Each with many options and settable parameters

Most tools and databases are written by and for nerds

‣ Same is true of documentation - if any exists!

Most are developed independently

Notable exceptions are found at the:
• EBI (European Bioinformatics Institute) and
• NCBI (National Center for Biotechnology Information)
Even Blast has many settable parameters

Related tools with different terminology

30
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research

on na ir e :
2 5 q u e s ti
B IO IN F 5
t he in it ial >
a s e t a k e e s t io n s
Ple o i nf 5 2 5 - qu
c om /b i
< tinyurl.

http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research

http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
National Center for Biotechnology
Information (NCBI)
• Created in 1988 as a part of the National Library of
Medicine (NLM) at the National Institutes of Health
• NCBI’s mission includes:
‣ Establish public databases
‣ Develop software tools
‣ Education on and
dissemination of biomedical Bethesda,MD

information

• We will cover a number of core NCBI databases and

software tools in the lecture
http://www.ncbi.nlm.nih.gov
http://www.ncbi.nlm.nih.gov
http://www.ncbi.nlm.nih.gov

Notable NCBI databases include:

GenBank, RefSeq, PubMed, dbSNP
and the search tools ENTREZ and BLAST
Key Online Bioinformatics
Resources: NCBI & EBI
The NCBI and EBI are invaluable, publicly
available resources for biomedical research

http://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk
European Bioinformatics Institute (EBI)
• Created in 1997 as a part of the European Molecular
Biology Laboratory (EMBL)
• EBI’s mission includes:
‣ providing freely available
data and bioinformatics
services
‣ and providing advanced Hinxton,UK
bioinformatics training

• We will briefly cover several EBI databases and tools

that have advantages over those offered at NCBI
The EBI maintains a number of high quality
curated secondary databases and associated tools
The EBI maintains a number of high quality
curated secondary databases and associated tools
The EBI maintains a number of high quality
curated secondary databases and associated tools
https://www.ebi.ac.uk
The EBI makes available a wider variety of online tools than NCBI
These include multiple tools in the following areas:
• Database sequence searches
‣ FASTA, BLAST, InterProScan

• Sequence alignments
‣ Pairwise: Needle, LAlign.
‣ Multiple: ClustalW, T-Coffee, MUSCLE, Jalview

• Protein structure search and alignment

‣ PDBeFold, DALI

• and others...
‣ Genome browsers
‣ Gene expression analysis
‣ Protein function analysis
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
The EBI also provides a growing selection of online tutorials
on EBI databases and tools
The EBI also provides a growing selection of online tutorials
on EBI databases and tools

Notable EBI databases include:

ENA, UniProt, Ensembl
and the tools FASTA, BLAST, InterProScan,
ClustalW, T-Coffee, MUSCLE, DALI, HMMER
BIOINFORMATICS DATABASES
AND ASSOCIATED TOOLS
What is a database?
Computerized store of data that is organized to provide efficient retrieval.
• Uses standardized data (record) formats to enable computer handling

Key database features allow for:

• Adding, changing, removing and merging of records
• User-defined queries and extraction of specified records

Desirable features include:

• Contains the data you are interested in
• Allows fast data access
• Provides annotation and curation of entries
• Provides links to additional information (possibly in other databases)
• Allows you to make discoveries
Bioinformatics Databases
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD, Beanref,
Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase,
CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST,
dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER,
FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genlilesne, GenLink,
GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB,
HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase,
HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA,
KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI,
MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase,
MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase,
PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS,
ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP,
SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK,
StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL
Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene,
URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, etc .................. !!!!
Bioinformatics Databases
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD, Beanref,
Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase,
CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, a s e s
CD4OLbase, CGAP,
t a b
aCyanoBase, dbCFC, dbEST,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB,tiCUTG, c s D s
r m a a ti c
dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, in f o
DIP, DOGS, DOMO, n f or
DPD,mDPlnteract, ECDC,
f B io b io i o u t
ECGC, EC02DBASE, EcoCyc,
o t s o
EcoGene, EMBL, EMD
a j o r
db, ENZYME, a n dEPD, EpoDB, ESTHER,
a r
FlyBase, FlyView,rGCRDB, e l GDB, GENATLAS, o f m
Genbank, o ls h
GeneCards, Genlilesne, GenLink,
e i ng C t o
The
GENOTK, GenProtEC, GIFTS,t e d lis
GPCRDB, t GRAP,
e e t he
GRBase, gRNAsdb, d f >
GRR, GSDB, HAEMB,
n o t a s e s s e s .p
a a n
HAMSTERS, HEART-2DPAGE, p le a
HEXAdb, HGMD,
tHIDB,
a b a HIDC, HlVdb, HotMolecBase,
HOVERGEN, For HPDB,
t a b a s e s
HSC-2DPAGE, ICN,
jo r _ D
ICTVDB,a IL2RGbase, IMGT, Kabat, KDNA,
daLGIC, MAD, MaizeDb, t _ a
M MDB, Medline, Mendel, MEROPS, MGDB, MGI,
KEGG, Klotho,
n d o u
a
< H MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase,
MHCPEP5 Micado, MitoDat,
MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase,
PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS,
ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP,
SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK,
StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL
Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene,
URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, etc .................. !!!!
Finding Bioinformatics Databases

http://www.oxfordjournals.org/nar/database/c/
Major Molecular Databases
The most popular bioinformatics databases focus on:

• Biomolecular sequence (e.g. GenBank, UniProt)

• Biomolecular structure (e.g. PDB)

• Vertebrate genomes (e.g. Ensemble)

• Small molecules (e.g. PubChem)

• Biomedical literature (e.g. PubMed)

The are also many popular “boutique” databases for:

• Classifying protein families, domains and motifs (e.g. PFAM, PROSITE)
• Specific organisms (e.g. WormBase, FlyBase)
• Specific proteins of biomedical importance (e.g. KinaseDB, GPCRDB)

• Specific diseases, mutations (e.g. OMIM, HGMD)

• Specific fields or methods of study (e.g. GOA, IEDB)

Major Molecular Databases
The most popular bioinformatics databases focus on:

• Biomolecular sequence (e.g. GenBank, UniProt)

• Biomolecular structure (e.g. PDB)

df’
s .p
• Vertebrate genomes (e.g. Ensemble) a s e
ta b
• Small molecules (e.g. PubChem) D a
or _
• Biomedical literature (e.g. PubMed) M aj
ut_
do
a n
: ‘H
The are also many popular
li n e
“boutique” databases for:
O n
e e
• Classifying protein families, domains and motifs (e.g. PFAM, PROSITE)
S
• Specific organisms (e.g. WormBase, FlyBase)
• Specific proteins of biomedical importance (e.g. KinaseDB, GPCRDB)

• Specific diseases, mutations (e.g. OMIM, HGMD)

• Specific fields or methods of study (e.g. GOA, IEDB)

Primary, secondary & composite databases
Bioinformatics databases can be usefully classified into primary,
secondary and composite according to their data source.
• Primary databases (or archival databases) consist of data derived
experimentally.
‣ GenBank: NCBI’s primary nucleotide sequence database.
‣ PDB: Protein X-ray crystal and NMR structures.

• Secondary databases (or derived databases) contain information derived

from a primary database.
• RefSeq: non redundant set of curated reference sequences primarily
from GenBank
• PFAM: protein sequence families primarily from UniProt and PDB
• Composite databases (or metadatabases) join a variety of different primary
and secondary database sources.
• OMIM: catalog of human genes, genetic disorders and related literature
• GENE: molecular data and literature related to genes with extensive links
to other databases.
GENBANK & REFSEQ:
NCBI’S NUCLEOTIDE SEQUENCE
DATABASES
What is GenBank?

• GenBank is NCBI’s primary nucleotide only sequence

database
‣ Archival in nature - reflects the state of knowledge at time
of submission
‣ Subjective - reflects the submitter point of view
‣ Redundant - can have many copies of the same
nucleotide sequence

• GenBank is actually three collaborating international

databases from the US, Japan and Europe
‣ GenBank (US)
‣ DNA Database of Japan (DDBJ)
‣ European Nucleotide Archive (ENA)
GenBank, ENA and DDBJ
Share and synchronize data

ENA GenBank DDBJ

Housed Housed Housed
at EBI at NCBI in Japan
European National
Bioinformatics Center for
Institute Biotechnology
Information

• The underlying raw DNA sequences are identical

‣ The different sites provide different views and ways to navigate
through the data
• Access to GenBank (and other NCBI databases including
RefSeq) is typically through Entrez, (the Google of NCBI) -
more on this later
GenBank sequence record

GenBank flat file format has defined fields including unique identifiers
such as the ACCESSION number.
This same general format is used for other sequence database records too.
Side node: Database accession numbers

Database accession numbers are strings of letters and

numbers used as identifying labels for sequences and other
data within databases
‣ Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequence

NT_030059 Genomic contig DNA
N91759.1 An expressed sequence tag (1 of 170)
NM_006744 RefSeq DNA sequence (from a transcript)
RNA
NP_007635 RefSeq protein
AAC02945 GenBank protein
Q28369
1KT7
UniProtKB/SwissProt protein
Protein Data Bank structure record
Protein
PMID: 12205585 PubMed IDs identify articles at NCBI/NIH Literature
GenBank sequence record
GenBank sequence record

Can set different display formats here

FASTA sequence record

FASTA sequence files consist of

records where each record begins
with a “>” and header information
on that same line. Each
subsequent line of the record is
sequence information.
This format is commonly used
by sequence analysis programs.
GenBank ‘graphics’ sequence record
GenBank sequence record, cont.
GenBank sequence record, cont.

The FEATURES section

contains annotations including
a conceptual translation of the
nucleotide sequence.
GenBank sequence record, cont.

The actual sequence

entry starts after the
word ORIGIN
RefSeq: NCBI’s Derivative
Sequence Database
• RefSeq entries are hand curated best representation
of a transcript or protein (in their judgement)
• Non-redundant for a given species although alternate
transcript forms will be included if there is good
evidence
- Experimentally verified transcripts and proteins
accession numbers begin with “NM_” or “NP_”
- Model transcripts and proteins based on bioinformatics
predictions with little experimental support
accession numbers begin with “XM_” or “XP_”
- RefSeq also contains contigs and chromosome records
UNIPROT:
THE PREMIER PROTEIN SEQUENCE
DATABASE
UniProt: Protein sequence database

UniProt is a comprehensive, high-quality resource of protein

sequence and functional information
• UniProt comprises four databases:
1. UniProtKB (Knowledgebase)
Containing Swiss-Prot and TrEMBL components
(these correspond to hand curated and automatically annotated
entries respectively)
2. UniRef (Reference Clusters)
Filtered version of UniProtKB at various levels of sequence
identity
e.g. UniRef90 contains sequences with a maximum of 90%
sequence identity to each other
3. UniParc (Archive) with database cross-references to source.
4. UniMES (Metagenomic and Environmental Sequences)
The two sides of UniProtKB

UniProtKB/TrEMBL UniProtKB/Swiss-Prot

Redundant, automatically Non-redundant, high-quality

annotated - unreviewed manual annotation - reviewed

Indicators of which part of UniProt an entry belongs

to include the color of the stars and the ID
The main information added to a
UniProt/Swiss-Prot entry

/Swiss-Prot

70
UniProt provides cross-references
to a large number other resources
and can serve as a useful “portal”
when you first begin to investigate
a particular protein

71
UniProt/Swiss-Prot vs UniProt/TrEMBL

• UniProtKB/Swiss-Prot is a non-redundant database with one entry

per protein

• UniProtKB/TrEMBL is a redundant database with one entry per

translated ENA entry (ENA is the EBI’s equivalent of GenBank)
‣ Therefore TrEMBL can contain multiple entries for the same protein
‣ Multiple UniProtKB/TrEMBL entries for the same protein can arise due
to:
- Erroneous gene model predictions
- Sequence errors (Frame shifts)
- Polymorphisms
- Alternative start sites
- Isoforms
- OR because the same sequence was submitted by different
people
Side note: Automatic Annotation
(sharing the wealth)
Swiss-Prot manually annotated

73
Same domain composition
= same function = annotation transfer

InterPro is an EBI database that

collates protein domain signatures

We will talk more about sequence

similarity and annotation transfer
next week.

74
DATABASE VIGNETTE
You have just come out a seminar about gastric
cancer and one of your co-workers asks:
“What do you know about that ‘Kras’ gene the
speaker kept taking about?"
You have some recollection about hearing of ‘Ras’
before. How would you find out more?
• Google?
• Library?
• Bioinformatics databases at NCBI and EBI!
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/

e s )
g slid
win
ollo
e f
s e
(or
m o
n de
s o
and
H

76
77
78
79
1 AND 2 1 2 ras AND disease
(1185 results)

1 OR 2 1 2 ras OR disease
(134,872 results)

1 NOT 2 1 2 ras NOT disease

(84,448 results)
80
81
82
Example Questions:
What chromosome location and
what genes are in the vicinity?

83
84
Example Questions:
What ‘molecular functions’, ‘biological
processes’, and ‘cellular component’
information is available?

85
GO: Gene Ontology
GO provides a controlled vocabulary of terms for describing gene
product characteristics and gene product annotation data

87
Why do we need Ontologies?
• Annotation is essential for capturing the understanding and
knowledge associated with a sequence or other molecular entity
• Annotation is traditionally recorded as “free text”, which is easy
to read by humans, but has a number of disadvantages,
including:
‣ Difficult for computers to parse
‣ Quality varies from database to database
‣ Terminology used varies from annotator to annotator

• Ontologies are annotations using standard vocabularies that try

to address these issues
• GO is integrated with UniProt and many other databases
including a number at NCBI

88
GO Ontologies

• There are three ontologies in GO:

‣ Biological Process
A commonly recognized series of events
e.g. cell division, mitosis,
‣ Molecular Function
An elemental activity, task or job
e.g. kinase activity, insulin binding
‣ Cellular Component
Where a gene product is located
e.g. mitochondrion, mitochondrial
membrane
89
The ‘Gene Ontology’ or GO is
actually maintained by the EBI so lets
switch or link over to UniProt also
from the EBI.

Scroll down to
UniProt link
UniProt will detail much more
information for protein coding genes
such as this one

Scroll down to
UniProt link
UniProt will detail much more
information for protein coding genes
Example Questions:
What positions in the protein are
responsible for GTP binding?
Example Questions:
What variants of this enzyme are
involved in gastric cancer and other
human diseases?
Example Questions:
Are high resolution protein structures
available to examine the details of
these mutations?
Example Questions:
What is known about the protein family,
its species distribution, number in humans
and residue-wise conservation, etc… ?

PFAM is one of the best

protein family databases
Example Questions:
What is known about the protein family,
its species distribution, number in
humans and residue-wise conservation,
etc… ?
Example Questions:
What is known about the protein family, its
species distribution, number in humans
and residue-wise conservation, etc… ?
Example Questions:
What is known about the protein family, its
species distribution, number in humans and
residue-wise conservation, etc… ?
Example Questions:
What is known about the protein family, its
species distribution, number in humans and
residue-wise conservation, etc… ?
ENTREZ & BLAST:
TOOLS FOR SEARCHING AND ACCESSING
MOLECULAR DATA AT NCBI
Entrez: Integrated search of NCBI databases

Entrez is available from

the main NCBI homepage
or from the homepage of
individual databases
Entrez: navigating across databases
Word weight

Entrez was setup toPubMed

allow you to navigate to
abstracts
related data in different databases without
having to run additional searches.
Relies on pre-computed and pre-compiled
data links:
Taxonomy 33-D
-D
• Neighbor knowledge based on calculations
Structure
Structure
VAST

Phylogeny • Hard links based onGene

things we know about
Neighbors
Related Structures

Nucleotide Protein
BLAST sequences sequences BLAST
Neighbors Neighbors
Related Sequences Hard Link
Related Sequences
BLink
Domains
Global Entrez Query: All NCBI Databases

http://www.ncbi.nlm.nih.gov/gquery/

The Entrez system: 38 (and counting)

integrated databases
Search Results

Discovery Column
(sort, filter, link)
Limits
Search Results

Discovery Column
(sort, filter, link)
Advanced: Search Builder

Helps build complex fielded queries

Items from search history can be included / combined / modified

Complex Query Results

("Danio rerio"[Organism] AND "creatine kinase"[Title])

AND "refseq"[Filter] AND mrna[Filter]
Controlled Vocabularies
– Taxonomy primary controlled vocabulary / classification
system for molecular databases at NCBI

‣ Medical Subject Headings (MeSH) primary controlled

vocabulary / classification system (ontology) for
molecular databases at NCBI
BLAST is a very important tool available from the NCBI
Homepage
http://www.ncbi.nlm.nih.gov/guide/
BLAST – Basic Local Alignment
Search Tool
http://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAST performs sequence

similarity searches of query
sequences vs sequence databases.
We will cover this in detail in the
next lecture.
SUMMARY
• Bioinformatics is computer aided biology.
• Bioinformatics deals with the collection, archiving,
organization, and interpretation of a wide range of biological
data.
• There are a large number of primary, secondary and tertiary
bioinformatics databases.
• The NCBI and EBI are major online bioinformatics service
providers.
• Introduced GenBank, RefSeq, UniProt, PDB databases as well
as a number of ‘boutique’ databases including PFAM and
OMIM.
• Introduced the notion of controlled vocabularies and
ontologies.
• Described the use of ENTREZ and BLAST for searching
databases.
HOMEWORK

Complete the initial course questionnaire:

http://tinyurl.com/bioinf525-questions

Check out the “Background Reading” material on Ctools:

http://tinyurl.com/bioinf525-w16

Complete the lecture 1.1 homework questions:

http://tinyurl.com/bioinf525-quiz1
ADDITIONAL DATABASES OF NOTE
(SLIDES FOR YOUR REFERENCE)
NCBI Metadatabases
• Gene
‣ molecular data and literature related to genes

• HomoloGene
‣ automated collection of homologous genes from selected
eukaryotes
• Taxonomy
‣ access to NCBI data through source organism taxonomic
classification
• PubChem
‣ small organic molecules and their biological activities

• BioSystems
‣ biochemical pathways and processes linked to NCBI genes,
gene products, small molecules, and structures
119
PubMed

• Curated database of biomedical journal articles

• Data records are annotated with MeSH terms
(Medical Subject Headings)
• Contract workers actually read all of the articles
and classify them with the MeSH terms
• PubMed entries contain article abstracts
• PubMed Central contains full journal articles, but
the majority are not freely re-distributable

120
PubMed results
Limits and Advanced search can be used
to refine searches
Small molecule databases have been added at NCBI
http://pubchem.ncbi.nlm.nih.gov/
HomoloGene - Homologous genes from different
organisms http://www.ncbi.nlm.nih.gov/homologene
Online Mendelian Inheritance in Man –
OMIM
http://www.ncbi.nlm.nih.gov/omim

OMIM is essentially a set of reviews of human genes, gene function and

phenotypes. Includes causative mutations where known.
The NCBI Bookshelf includes many well known
molecular biology texts.
http://www.ncbi.nlm.nih.gov/books/
GEO: Gene Expression Omnibus

• Gene expression data (mostly from microarrays but

also RNA-seq data, 2 methods for measuring RNA
levels)

Query browse and

download data sets

126
• Series - (GSExxx) is an original submitter-supplied
record that summarizes a study. May contain
multiple individual Samples (GSMxxx).

An individual GSE
series entry describes
the experiment and
gives access to the
experimental data

127
• DataSets - (GDSxxx) are curated collections of
selected Samples that are biologically and
statistically comparable

Curated expression data

Heart Skeletal Muscle

Expression profile
GO Ontologies

• There are three ontologies in GO:

‣ Biological Process
A commonly recognized series of events
e.g. cell division, mitosis,
‣ Molecular Function
An elemental activity, task or job
e.g. kinase activity, insulin binding
‣ Cellular Component
Where a gene product is located
e.g. mitochondrion, mitochondrial
membrane
129
QuickGO is a fast web-based browser of the Gene Ontology and
Gene Ontology annotation data

130
GO annotation in UniProt
An example UniProt entry for hemoglobin beta (HBB_human,
P68871) with GO annotation displayed.
GO annotation in UniProt
An example UniProt entry for hemoglobin beta (HBB_human,
P68871) with GO annotation displayed.
DAVID: a online tool for assessing
GO term enrichment in gene lists

DAVID allows you to upload lists of genes

and search for enriched GO and search for
functionally related genes not in your list

http://david.abcc.ncifcrf.gov

133
Example output: enriched functions
from GO

134

Basic Bioinformatics - S. Ignacimuthu
100% (3)
Basic Bioinformatics - S. Ignacimuthu
232 pages
Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
No ratings yet
Bif401 Highlighted Subjective Handouts by BINT - E - HAWA
222 pages
Lecture1 BIMM143 Large
No ratings yet
Lecture1 BIMM143 Large
73 pages
Lecture 1-2 Intro
No ratings yet
Lecture 1-2 Intro
24 pages
BIOINFOMATICS - Information Sources and Applications
No ratings yet
BIOINFOMATICS - Information Sources and Applications
80 pages
BIOINFORMAICS
No ratings yet
BIOINFORMAICS
12 pages
Collection
No ratings yet
Collection
8 pages
BTH 403-BTG407 LECTURE 1
No ratings yet
BTH 403-BTG407 LECTURE 1
6 pages
Special Topics in Ict
No ratings yet
Special Topics in Ict
55 pages
bioinformatic
No ratings yet
bioinformatic
7 pages
BIO 401 Note... Introduction To Bioinformatics
No ratings yet
BIO 401 Note... Introduction To Bioinformatics
4 pages
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
100% (1)
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
54 pages
BMS Lecture 1
No ratings yet
BMS Lecture 1
24 pages
bioinformatrics-1st
No ratings yet
bioinformatrics-1st
3 pages
BIOINFORMATICS-basic
No ratings yet
BIOINFORMATICS-basic
10 pages
RP 12
No ratings yet
RP 12
13 pages
Bio in for Matics
No ratings yet
Bio in for Matics
20 pages
Introduction To Bioinformatics Angshuman Bagchi download
No ratings yet
Introduction To Bioinformatics Angshuman Bagchi download
52 pages
BBCS-185
No ratings yet
BBCS-185
126 pages
Bioinformatics Notes (1)
No ratings yet
Bioinformatics Notes (1)
6 pages
What Is Tics
No ratings yet
What Is Tics
13 pages
CHAP 12 - Bioinformatics - Research Application
No ratings yet
CHAP 12 - Bioinformatics - Research Application
8 pages
Notas
No ratings yet
Notas
4 pages
Aula 1
No ratings yet
Aula 1
27 pages
Class03-What is bioinformatics-2022-SIV2001
No ratings yet
Class03-What is bioinformatics-2022-SIV2001
21 pages
Ch# 1 What is Bioinformatics
No ratings yet
Ch# 1 What is Bioinformatics
3 pages
Introduction A La Bioinformatique
100% (1)
Introduction A La Bioinformatique
165 pages
Bioinformatics 2
No ratings yet
Bioinformatics 2
7 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
2 pages
Introduction to Bioinformatics_BCHS 4214
No ratings yet
Introduction to Bioinformatics_BCHS 4214
10 pages
An Assignment
No ratings yet
An Assignment
6 pages
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
No ratings yet
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
12 pages
04 Computer Applications in Pharmacy Full Unit IV
No ratings yet
04 Computer Applications in Pharmacy Full Unit IV
14 pages
2024.HF_BioInformatics_Lec1p
No ratings yet
2024.HF_BioInformatics_Lec1p
8 pages
Bioinformatics
No ratings yet
Bioinformatics
7 pages
Bio in For Matics
No ratings yet
Bio in For Matics
17 pages
Bioinformatics and Computational Biology
100% (1)
Bioinformatics and Computational Biology
21 pages
Bioinformatics Introduction
No ratings yet
Bioinformatics Introduction
5 pages
Bif 401 PPT 1to 80 by M.habib
No ratings yet
Bif 401 PPT 1to 80 by M.habib
588 pages
Bioinformatics: Basics, Development, and Future: July 2016
No ratings yet
Bioinformatics: Basics, Development, and Future: July 2016
27 pages
BIO310 Lecture-1
No ratings yet
BIO310 Lecture-1
15 pages
Bioinformatics_nstu
No ratings yet
Bioinformatics_nstu
67 pages
Bioinformatics is important because it helps researchers organize and analyze biological
No ratings yet
Bioinformatics is important because it helps researchers organize and analyze biological
7 pages
Module 1 (3)
No ratings yet
Module 1 (3)
34 pages
Bioinformatics_Class_12_Presentation_Paragraph
No ratings yet
Bioinformatics_Class_12_Presentation_Paragraph
14 pages
BioInformatics Intoduction
No ratings yet
BioInformatics Intoduction
22 pages
Bioinforamatics
No ratings yet
Bioinforamatics
157 pages
SBB 1609
No ratings yet
SBB 1609
183 pages
Bioinformatics
No ratings yet
Bioinformatics
28 pages
Unit-1 Bioinformatics
No ratings yet
Unit-1 Bioinformatics
17 pages
L3.1 Definition, Goal and Scope of Bioinformatics
100% (1)
L3.1 Definition, Goal and Scope of Bioinformatics
16 pages
bioinformatics
No ratings yet
bioinformatics
2 pages
5 Bioinformatics
No ratings yet
5 Bioinformatics
23 pages
Bioinformatics: Study Material
No ratings yet
Bioinformatics: Study Material
4 pages
Bioinformatics Updated Features Applications - Abdurakhmonov I.Y. (Edr.) - 2016
No ratings yet
Bioinformatics Updated Features Applications - Abdurakhmonov I.Y. (Edr.) - 2016
239 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
15 pages
Bioinformatics
No ratings yet
Bioinformatics
24 pages
Bio in For Matics
No ratings yet
Bio in For Matics
8 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
ZOOLOGYSyllabus 2021
No ratings yet
ZOOLOGYSyllabus 2021
2 pages
List of Pharmaceutical Manufacturer in Bangladesh
100% (1)
List of Pharmaceutical Manufacturer in Bangladesh
11 pages
Final Timetable Bpharm Onetime Chance Dec 2023 Jan 2024 Exams
No ratings yet
Final Timetable Bpharm Onetime Chance Dec 2023 Jan 2024 Exams
48 pages
Jurnal Mikrobia
No ratings yet
Jurnal Mikrobia
35 pages
High-Resolution Respirometry To Assess Mitochondrial Function in
No ratings yet
High-Resolution Respirometry To Assess Mitochondrial Function in
11 pages
Au Coe QP: Question Paper Code
No ratings yet
Au Coe QP: Question Paper Code
8 pages
Company - : Country Website Point of Contact(s)
No ratings yet
Company - : Country Website Point of Contact(s)
6 pages
End of Module Exam Mcqs/Matching: 1Q Mark
No ratings yet
End of Module Exam Mcqs/Matching: 1Q Mark
2 pages
General Biology 1 First Periodical Test Tos
100% (1)
General Biology 1 First Periodical Test Tos
4 pages
Clinical Pharmacy Intervention
No ratings yet
Clinical Pharmacy Intervention
9 pages
Pertanyaan B. Inggris
No ratings yet
Pertanyaan B. Inggris
3 pages
Bio Investigatory
100% (1)
Bio Investigatory
29 pages
Sce1034 PPT
No ratings yet
Sce1034 PPT
61 pages
Disadvantages of Cloning
No ratings yet
Disadvantages of Cloning
6 pages
Icell Activity: Part One: Animal Cell Organelle Function
No ratings yet
Icell Activity: Part One: Animal Cell Organelle Function
6 pages
New York Sars-Cov-2 Real-Time Reverse Transcriptase (RT) - PCR Diagnostic Panel
No ratings yet
New York Sars-Cov-2 Real-Time Reverse Transcriptase (RT) - PCR Diagnostic Panel
27 pages
1 Genetics BMS Introduction+2022 ET
No ratings yet
1 Genetics BMS Introduction+2022 ET
15 pages
Tissue Engineering: A Primer With Laboratory Demonstrations 1st Edition Jeong-Yeol Yoon Download PDF
100% (3)
Tissue Engineering: A Primer With Laboratory Demonstrations 1st Edition Jeong-Yeol Yoon Download PDF
49 pages
Brochure Science Certification Programme
No ratings yet
Brochure Science Certification Programme
12 pages
BCDB Qualifying Exam Student X May 22 & 23, 2018
No ratings yet
BCDB Qualifying Exam Student X May 22 & 23, 2018
19 pages
Presentation On Anther Culture: Submitted To Submitted by
No ratings yet
Presentation On Anther Culture: Submitted To Submitted by
22 pages
Dna replication pdf
No ratings yet
Dna replication pdf
28 pages
Using Bioedit
No ratings yet
Using Bioedit
3 pages
Chapter 16 Guided Notes
No ratings yet
Chapter 16 Guided Notes
12 pages
Scope of Practice
No ratings yet
Scope of Practice
5 pages
Molecular Biology Principles and Practice 2nd Edition Cox Doudna O’Donnell Test Bank - Read Now With The Full Version Of All Chapters
100% (3)
Molecular Biology Principles and Practice 2nd Edition Cox Doudna O’Donnell Test Bank - Read Now With The Full Version Of All Chapters
16 pages
Illustrative Notebook Cell Theory Lesson For Middle School
No ratings yet
Illustrative Notebook Cell Theory Lesson For Middle School
20 pages
447:380 Genetics (4) (F/S) Elective For Biology Minors Only
No ratings yet
447:380 Genetics (4) (F/S) Elective For Biology Minors Only
2 pages
Tinbergen's 4 Questions
No ratings yet
Tinbergen's 4 Questions
7 pages
Nptel: Noc:Biomems and Microsystems - Video Course
No ratings yet
Nptel: Noc:Biomems and Microsystems - Video Course
3 pages