0% found this document useful (0 votes)
74 views

Databases - Final

- Biological databases store biological information extracted from experiments and other databases through processes like filtering, annotation, and correction. - Databases are classified as primary, secondary, or composite and can be public or private. Primary databases contain original experimental data while secondary databases contain derived information. - Major public primary databases include GenBank, EMBL, and DDBJ for nucleotide sequences, Swiss-Prot and PIR for protein sequences, and the Protein Data Bank for 3D structures. Secondary databases provide classified, organized views of primary database contents.

Uploaded by

Abhi Sachdev
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Databases - Final

- Biological databases store biological information extracted from experiments and other databases through processes like filtering, annotation, and correction. - Databases are classified as primary, secondary, or composite and can be public or private. Primary databases contain original experimental data while secondary databases contain derived information. - Major public primary databases include GenBank, EMBL, and DDBJ for nucleotide sequences, Swiss-Prot and PIR for protein sequences, and the Protein Data Bank for 3D structures. Secondary databases provide classified, organized views of primary database contents.

Uploaded by

Abhi Sachdev
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

Databases

• Genome sequencing and many other large-scale research


projects have generated an explosive growth in biological
data
• Most of these are not source data: their contents have been
extracted from other databases by a process of filtering,
transforming, and manual correction and annotation
• Biological databases are stores of biological information
• It is a collection of structured, searchable and up-to-date
data
• The data deposited in databases are assigned a unique
identifying number for quotation in publications
Classification

• Sequence and structure databases


• Primary, secondary and composite databases
• Public and private databases
• Primary database:
o Consists of data derived experimentally
o Grown tremendously over the years
o Contains information of the sequence or structure alone and
associated annotation information
o Includes:
• Sequence databases
• Structure databases
o Examples,
• Nucleotide / DNA databases: EMBL, EnsEMBL, GenBank, DDBJ
• Protein databases: SwissProt, TrEMBL, PIR, PSD, PDB
• Secondary database:
o Secondary sequence database contains derived information from a
primary database, like information about conserved sequence and
active site residues of the protein families arrived by multiple
sequence alignment of a set of related proteins
o Secondary structure database contains entries of the PDB in an
organized way (e.g. by classification of all PDB entries according to
structures like α-helix or β-sheets) and also information on conserved
secondary structure motifs of a particular protein
o Examples,
• Sequence related information: ProSite, Pfam, REBase, Enzyme
Nomenclature Database
• Genome related information: Online Mendelian Inheritance in Man
(OMIM), TransFac
• Structure related information: Databases of Secondary Structure
Assignments (DSSA), Homology-Derived Secondary Structures of
Proteins (HSSP), Dali
• Pathway information: KEGG
• Composite database:
o Joins a variety of different primary database sources, which
obviates the need to search multiple resources
• Public database:
• Most of the databases are public
• These are freely accessible for everybody everywhere
in the world

• Private database:
• Private companies sequence genomes of
commercially or scientifically interesting organisms
• Data is not available to the public free of charge
• Academics normally are not able to pay the money
required for accessing these databases and they are
mainly used by the pharmaceutical and biotech
industries
• Sometimes, databases that have been public turn private
– For example, on the 1st of June 2002 the former public genome
database of Saccharomyces cerevisiae (yeast) and
Caenorhabditis elegans, two of the most widely-studied
eukaryote model organisms, changed from a free to a
chargeable service
– The database was taken over by Incyte Genomics in year 2000
and they charged US$2000 per lab and year
Nucleotide sequence databases
• The International Nucleotide Sequence Database (INSD) consists of
the following databases:
– GenBank (National Centre for Biotechnology Information; NCBI;
USA)
– EMBL (European Molecular Biology Laboratory; Europe)
– DDJB (DNA Databank of Japan; Japan)

• These are repositories for nt sequence data from all organisms


• All three databases accept nt sequence submissions, and then
exchange new and updated data on a daily basis to achieve
optimal synchronization between them so that all the
databases will contain the same information
• These three databases are primary databases, as they house
original sequence data
• As a result, the rough data are identical, although the format in
which they are stored and the nature of annotation vary slightly
among them
• Genbank
– Developed and maintained by the National Center for Biotechnology
Information (NCBI) at the National Institutes of Health (NIH)
– Primary nucleic acid public database
– Contains all known nt and protein sequences with supporting
bibliographic and biological annotation
• EMBL
– The Laboratory operates from five sites: the main laboratory in
Heidelberg, and outstations in Hinxton (the European Bioinformatics
Institute; EBI; England), Grenoble (France), Hamburg (Germany)
and Monterotondo (near Rome)
– Primary nucleic acid public database
• DDBJ
– Located at the National Institute of Genetics (NIG) in the Shizuoka
prefecture of Japan
– Primary nucleic acid public database
Protein Sequence Databases
• Protein sequence databanks collect additional
information about proteins, like ligands, subunit
association, disulfide bridges, catalytic activity, family,
etc.
• Most of the information is collected from literature
• These databases arise by translation of nucleic acid
sequences
PIR International:

•It was the very first sequence database, setup at the National
Biomedical Research Foundation (Georgetown University,
Washington DC, USA)
•In 1988 the PIR joined with two other groups: the Munich
Information Center for Protein Sequences (MIPS) in Germany
and the Japan International Protein Information Database
(Tsukuba)
•The PIR maintains several databases about proteins:
• PIR-PSD: about protein sequence
• iProClass: classification of protein according to structure
and function
• ASDB: annotation and similarity database
• P/R-NREF: a database of sequence and annotations of
proteins of known structure deposited in the PDB
• RESID: a database of covalent structure modifications
(e.g. S-S bridges)
SwissProt:

•It is an annotated protein sequence database established


in 1986 and maintained collaboratively since 1987 by the
Department of Medical Biochemistry of the University of
Geneva and the EMBL Data Library (now the EBI)
•It consists the description of: function of the protein, post-
translational modifications, domains and sites, secondary
structure, quaternary structure, similarities to other
proteins, diseases associated with deficiencies in the
protein, variants and many descriptions more
• PROSITE - Database of Protein Families and Domains
• Pfam - Protein families database of alignments and
HMMs - (Sanger Institute)
• Database of Interacting Proteins - Univ. of California
• InterPro - Classifies proteins into families and predicts
the presence of domains and sites
• UniProt Universal Protein Resource - EBI, Swiss Institute
of Bioinformatics, PIR
• Swiss-Prot Protein Knowledgebase - Swiss Institute of
Bioinformatics
Structure Databases
• Structure databases archive, annotate and distribute
sets of atomic coordinates to visualize three dimensional
structures
• Contain specific information about stereochemical
analysis, like bond lengths and angles, X-ray crystal
structures and NMR spectroscopic data
• The best established database for biological
macromolecular structures is the Protein Data Bank
(PDB)
Protein Data Bank (PDB):

•Primary database
•It is an American database started in 1971 by the late Walter
Hamilton at Brookhaven National Laboratories at Long Island, New
York
•It is now managed by the Research Collaboratory for Structural
Bioinformatics (RCSB) at Rutgers University
•It is based in the San Diego Supercomputer Center in New Jersey,
California and at the National Institute of Standards and Technology
in Maryland
•It contains 3-D structures about proteins, nucleic acids and some
carbohydrates
•Most of the data of the PDB is generated by X-ray crystallography
and NMR
•Comprises of:
 Protein Databank in Europe (PDBe)
 Protein Databank in Japan (PDBj)
 Research Collaboratory for Structural Bioinformatics (RCSB)
• Secondary databases:
• SCOP - Structural Classification of Proteins
• CATH - Protein Structure Classification downloaded
from PDB
• PDBsum - A pictorial database that provides an at-a-
glance overview of the contents of each 3D structure
deposited in the Protein Data Bank
Pathway Databases
• These are databases that describe biochemical
pathways, reactions, and enzymes
• For the modeling and simulation of a biopathway,
suitable information selection from public biopathway
databases, such as Kyoto Encyclopedia of Genes and
Genomes (KEGG) and BioCyc are useful
KEGG:

•It is a database with associated software suitable for


making simulations of behavior of the cell or the organism
from the information out of the genome
•It is specialized in a pathway database, and a ligand
database
•It can be used to find and visualize information from data
and knowledge on protein interactions and chemical
reactions responsible for various cellular processes
•It attempts to reconstruct protein interaction networks for
all organisms whose genomes are completely sequenced
•It provides enzyme data with links to pathways, genes,
diseases (OMIM database), motif and PDB structures
BioCyc:

•It is a collection of Pathway/Genome Databases


•Each database in the BioCyc collection describes the genome and metabolic
pathways of a single organism, with the exception of the MetaCyc database,
which is a reference source on metabolic pathways from many organisms
•BioCyc databases can be used to visualize the layout of genes within a
chromosome, or of an individual biochemical reaction, or of a complete
biochemical pathway
•The structures of chemical compounds can be displayed in pathways and
reactions
•The navigation capabilities of the software allow a user to move from a display
of an enzyme to a display of a reaction that the enzyme catalyzes, or to the
gene that encodes the enzyme
•The interface supports a variety of queries, such as generating a display of the
map positions of all genes that code for enzymes within a given biochemical
pathway
•It is used as a reference source to look up individual facts
•BioCyc databases support computational studies of the metabolism, such as
design of novel biochemical pathways for biotechnology, studies of the
evolution of metabolic pathways, and simulation of metabolic pathways
•BioCyc is linked to other biological databases containing protein and nucleic-
acid sequence data, bibliographic data, protein structures, and descriptions of
different strains
MetaCyc:
•One of the databases of BioCyc is called MetaCyc
•This metabolic pathway database contains literature-
derived metabolic pathway data for several species
•It describes metabolic pathways, reactions, enzymes, and
substrate compounds
•It is a collaborative project between SRI International, the
Carnegie Institution, and Stanford University
Small Molecule Pathway Database (SMPDB)
BioCyc Database Collection - Includes EcoCyc and MetaCyc
KEGG PATHWAY Database - Univ. of Kyoto
MANET database - University of Illinois
Metabolights - Metabolomics experiments and derived information:
metabolite structures, reference spectra, biological roles, locations and
concentrations. (European Bioinformatics Institute)
Reactome - Navigable map of human biological pathways, ranging
from metabolic processes to hormonal signalling. (Cold Spring Harbor
Laboratory, European Bioinformatics Institute, Gene Ontology
Consortium)
• Microarrays allow snapshots to be made of expression
levels (and hence abundance) for thousands of genes in a
single experiment
• The amount of information generated by a microarray-
based experiment is sufficiently large that no single study
can be expected to mine each piece of scientific information
• The amount of finished microarray experiments grows
rapidly, and because of that the massive amounts of
valuable functional genomics data are already generated
• The Microarray Informatics Team at the EBI was
established in May 2000 to address this problem of
managing and analyzing this data
• They found that systems were needed for the management
and storage of microarray data
• ArrayExpress - A public database for microarray based gene
expression data; Setup by the EBI; Exchanges information
with the NCBI and DDBJ microarray database every week
• Gene Expression Omnibus - National Center for
Biotechnology Information
• GPX - Scottish Centre for Genomic Technology and
Informatics
• Stanford Microarray Database (SMD) - Stanford University
• Genevestigator - Expression Search Engine (Nebion AG)
• It contains all kind of information about all kinds of
naturally and laboratory-made mutants
• Amongst plants, most of the mutant databases are from
Arabidopsis, but databases of other crops are already
made
Arabidopsis thaliana Insertion Database (ATIDB):

•It is a collaboration between the American Cold Spring


Harbor Laboratory and John Innes Center from the UK
•The ATIDB was designed as a public tool for genome
researchers and other biologists to find breeding lines of
Arabidopsis created with insertional mutagenesis and to
facilitate the study of their distribution on the World Wide
Web
• These contain scientific articles or abstracts of them
• Searches usually give the author's name, title,
publication and date (citation information)
• There are several high quality databases, but the most
used is PubMed
PubMed:

•It is the most used literature database


•It is a project developed by the U.S. NCBI at the National
Library of Medicine, located at the NIH
•It provides access to over 12 million MEDLINE citations back to
1966 and also to additional life science journals
•It integrates articles with other information retrieval tools, such
as links to web sites providing full text articles
•It also has an option to retrieve related articles
•It gives a clear option to get several articles about one topic
•It is a very useful and complete database because almost all
scientific journals publish their table of contents or the issues
themself on websites
•Coverage is worldwide, but most of the articles or their
abstracts are in English
• These databases of databases collect data from different sources and
make them available in new and more convenient form, or with an
emphasis on a particular disease or organism
• Some examples,
1. ConsensusPathDB - A molecular functional interaction database,
integrating information from 12 other databases
2. Entrez - (National Center for Biotechnology Information)
3. Enzyme Portal - Integrates enzyme information such as small-molecule
chemistry, biochemical pathways and drug compounds (European
Bioinformatics Institute)
4. MetaBase (KOBIC) - A user contributed database of biological databases
5. mGen - Containing four of the world biggest databases GenBank, Refseq,
EMBL and DDBJ - easy and simple program friendly gene extraction
6. PathogenPortal - A repository linking to the Bioinformatics Resource
Centers (BRCs) sponsored by the National Institute of Allergy and
Infectious Diseases (NIAID)
7. SOURCE - (Stanford University) encapsulates the genetics and molecular
biology of genes from the genomes of Homo sapiens, Mus musculus, and
Rattus norvegicus into easy to navigate GeneReports
• These databases collect organism genome sequences,
annotate and analyze them, and provide public access
• Some add curation of experimental literature to improve
computed annotations
• These databases may hold many species genomes, or a
single model organism genome
• CAMERA - Resource for microbial genomics and
metagenomics
• Corn - the Maize Genetics and Genomics Database
• EcoCyc - A database that describes the genome and the
biochemical machinery of the model organism E. coli K-12
• Ensembl Genomes - Provides genome-scale data for
bacteria, protists, fungi, plants and invertebrate metazoa,
through a unified set of interactive and programmatic
interfaces (using the Ensembl software platform)
• Flybase - Genome of Drosophila melanogaster
• National Microbial Pathogen Data Resource - A manually
curated database of annotated genome data for the
pathogens Campylobacter, Chlamydia, Chlamydophila,
Haemophilus, Listeria, Mycoplasma, Neisseria,
Staphylococcus, Streptococcus, Treponema, Ureaplasma
and Vibrio
• Saccharomyces Genome Database - Genome of yeast
• The SEED platform - For microbial genome analysis includes
all complete microbial genomes, and most partial genomes.
The platform is used to annotate microbial genomes using
subsystems
• Wormbase - Genome of Caenorhabditis elegans
• TAIR - The Arabidopsis Information Resource
• RGD - Rat Genome Database; Genomic and phenotype data
for Rattus norvegicus
Proteomics databases

• Proteomics Identifications Database (PRIDE)


• A public repository for proteomics data, containing protein and
peptide identifications and their associated supporting evidence
as well as details of post-translational modifications (EBI)
• MitoMiner
• A mitochondrial proteomics database integrating large-scale
experimental datasets from mass spectrometry and GFP studies
for various species (Medical Research Council Mitochondrial
Biology Unit)
RNA databases

• Rfam – A database of RNA families


• miRBase - The microRNA database
• snoRNAdb - A database of snoRNAs
• lncRNAdb - A database of lncRNAs
• GtRNAdb - A database of genomic tRNAs
• SILVA - A database of rRNAs
• RDP - The Ribosomal Database Project
Carbohydrate structure databases

• EuroCarbDB - A repository for both carbohydrate


sequences / structures and experimental data
Protein-protein interactions
• BIND - Biomolecular Interaction Network Database
• DIP - Database of Interacting Proteins
• STRING - A database of known and predicted protein-
protein interactions (EMBL)
• The Cell Collective - A web-based platform that enables
laboratory scientists from across the globe to
collaboratively build large-scale models of various
biological processes, and simulate/analyze them in real
time
• MINT - Molecular INTeraction database
Signal transduction pathway databases

• Cancer Cell Map


• Netpath - A curated resource of signal transduction
pathways in humans
• NCI - Nature Pathway Interaction Database
• Reactome - Navigable map of human biological
pathways, ranging from metabolic processes to
hormonal signalling
• SignaLink Database
• WikiPathways
• The Cell Collective
PCR and quantitative PCR primer
databases

• PathoOligoDB - A free qPCR oligo database for


pathogens
• RTPrimerDB - A public primers and probes database for
real-time PCR reactions
Taxonomic databases

• Catalogue of Life source databases


• Encyclopedia of Life
• Integrated Taxonomic Information System
• EzTaxon-e - Database for the identification of
prokaryotes based on 16S ribosomal RNA gene
sequences
• Antibody Central - Antibody information database and
search resource

Barcode of Life Data Systems - A database of DNA
barcodes

Connectivity map - Transcriptional expression data and
correlation tools for drugs

CTD - The Comparative Toxicogenomics Database
describes chemical-gene-disease interactions

Drug2Gene - Provides integrated information for identified
and reported relations between genes/proteins and
drugs/compounds

GreenPhylDB - A phylogenomic database for plant
comparative genomics

HGMD disease-causing mutations - HGMD Human Gene
Mutation Database

HvrBase++ - Human and primate mitochondrial DNA

Oncogenomic databases - A compilation of databases that
serve for cancer research

OMIM Inherited Diseases - Online Mendelian Inheritance in
Man

p53 - The p53 knowledgebase

PHI-base - Pathogen-host interaction database

TRANSFAC - A database about eukaryotic transcription
factors, their genomic binding sites and DNA-binding
profiles.

TreeBASE - An open-access database of phylogenetic trees
and the data behind them

Treefam - TreeFam (Tree families database) is a database
of phylogenetic trees of animal genes

You might also like