0% found this document useful (0 votes)
9 views14 pages

Class04- Biological databases - 2022

The document provides an overview of biological databases, defining them as structured collections of data used for storing, organizing, and retrieving biological information. It discusses the importance of database management systems, various types of databases, and highlights key biological databases such as GenBank and UniProt. Additionally, it addresses the challenges and limitations in database searching and the interconnection between different biological databases.

Uploaded by

m-9274491
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Class04- Biological databases - 2022

The document provides an overview of biological databases, defining them as structured collections of data used for storing, organizing, and retrieving biological information. It discusses the importance of database management systems, various types of databases, and highlights key biological databases such as GenBank and UniProt. Additionally, it addresses the challenges and limitations in database searching and the interconnection between different biological databases.

Uploaded by

m-9274491
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

31/10/2022

SIV 2001

Class 5
Introduction to biological databases

What is a database?

• A database is a computerized archive used to


• store and organize data.
• so that information can be retrieved by a variety of search criteria.
• Biological databases often used for knowledge discovery.
• A good database needs to be:
• structured with minimum redundancy
• searchable (data retrieval)
• updated periodically
• cross-referenced with other databases
• Tools for analysis and visualization
• Data is processed, organized, structured and presented in a useful context
 information

1
31/10/2022

What is a Database?

A structured collection of data held in computer storage; esp. one


that incorporates software to make it accessible in a variety of ways;
transf., any large collection of information.

database management: the organization and manipulation of data in


a database.

database management system (DBMS): a software package that


provides all the functions required for database management.

database system: a database together with a database


management system.

3
Oxford Dictionary

Databases
• Books, articles 1968 -> 1985
• Computer tapes 1982 ->1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 -> ?
• FTP 1989 -> ?
• On-line services 1982 -> ?
• WWW 1993 -> ?
• DVD 2001 -> ?

2
31/10/2022

Biological Database
• Libraries related to biological data and information
• Data from scientific experiments
• Published literature
• High-throughput experiment technologies
• Computational analysis

Some important biological databases


in bioinformatics
• GenBank ncbi.nlm.nih.gov Nucleotide sequences
• Ensembl ensembl.org Human/mouse genome
• PubMed ncbi.nlm.nih.gov Literature references
• NR ncbi.nlm.nih.gov Protein sequences
• UniProt uniprot.org/ Protein sequences
• InterPro ebi.ac.uk Protein domains
• OMIM ncbi.nlm.nih.gov Genetic diseases
• Expasy expasy.org Enzymes
• PDB rcsb.org/pdb/ Protein structures
• KEGG genome.ad.jp Metabolic pathways
• MINT mint.bio.uniroma2.it/ Protein-protein interactions
• GeneNet genenetwork.org / Gene networks
• Transpath genexplain.com/transpath Signal transduction pathways

3
31/10/2022

Genome size (haploid)


Genome size (bp)
Phi-X 174 5,386
Human Mitochondrion 16,569
Mycoplasma genitalium 490,885
Rikettsia prowazekii 1,111,523
Haemophilus influenzae 1,830,138
Streptococcus pneumoniae 2,160,837
Vibrio cholerae 4,033,460
Mycobacterium tuberculosis 4,411,532
Bacillus subtilis 4,214,814
Escherichia coli 4,639,211
Escherichia coli O157:H7 5,440,000
Saccharomyces cerevisiae 12,495,682
Caenorhabditis elegans 100,258,171
Arabidopsis thaliana 115,409,949 The haploid human
Drosophila melanogaster 122,653,977 genome (23 chromosomes)
Anopheles gambiae 178,244,063 is estimated to be about
Homo sapiens 3,200,000,000 3.2 billion bases long
Oryza sativa 3,900,000,000

Working Draft Sequence

gaps

genes

4
31/10/2022

Bioinformatics Information Space


and challenges
1999 2004
Nucleotide sequences: 4,456,822 36,653,899
Protein sequences: 706,862 4,436,362
3D structures: 9,780 19,640
Human Unigene Clusters: 75,832 118,517

Maps and Complete Genomes: 10,870 6,948


Different species node: 52,889 283,121
dbSNP 6,377 13,179,601
RefGenes 515 22,079

human contigs > 250 kb 341 (4.9MB) 2,487,920,000


PubMed records: 10,372,886 12,570,540
OMIM records: 10,695 15,138
Interactions & complexes - 52,385

The amount of sequenced DNA is


vastly increasing

5
31/10/2022

What is a database?
• Organized array of information
• Place where you put things in.
• And (if all goes well!) you should be able to get them out
again.
• Allows you to make discoveries.

Database searching
• The requirements for a biological database searching are:
• Sensitivity, which is the ability to find as many correct hits as possible.
• Selectivity or specificity, which is the ability to exclude incorrect hits or false
positive.
• Speed, which is the time it takes to get the results form databases searches.
• The ideal concept of database search is to have greatest sensitivity, selectivity and
speed, however, it is difficult to achieve in reality. Among the limitations of in a
biological database search are:
• Increased sensitivity associates with decrease in selectivity.
• A very inclusive search with high selectivity would lead to many false positive
(low in sensitivity).
• Increase the speed will cause a lowered sensitivity and selectivity.
• There are also errors in molecular sequence database due to the error during
the sequencing phase of the sequence.
• Redundancy of the molecular sequence information in the molecular
sequence database.
• These limitations could bring to incorrect output of biological database search
and should be consider in analyzing the outcome of a biological database search.

6
31/10/2022

Database format
• Flat file:
• Each database contains only one file, not linked
to any other file.
• The computer has to read the entire file to find
all entries or relationships.

Database format
• Relational database:
• a database structured to recognize relations
between stored items of information.
• connects data tables with rows to transfer
information.

7
31/10/2022

Database format
• Object oriented databases :
• Information is represented in the form of objects as used in object-
oriented programming.
• Stores complex data and relationships between data directly,
• No mapping to relational rows and columns  suitable for applications
dealing with very complex data.
• Great care must be taken when designing a object-oriented database to
ensure efficient querying.

Types of data generated by molecular


biology research
• Molecular sequence:
• Nucleotide sequence (DNA or RNA)
• Protein (amino acids) sequence
• Genome sequence (complete or draft).
• Gene expression
• Gene polymorphism (variation)
• Macromolecular 3D structure
• Proteins sequence patterns or motifs
• Metabolic pathways

8
31/10/2022

Primary Databases
• Experimental results submission by researchers  Original submissions
directly into the database
• Content controlled by the submitter
• Three major databases for nucleotide sequences:
• GenBank
http://www.ncbi.nlm.nih.gov/Genbank/
• European Molecular Biology Laboratory (EMBL)
www.embl.org
• DNA Databank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
• Sequences are exchanged on a daily basis

• US:
• NIH (National Institute of Health)  NLM (National Library of
Medicine)  NCBI (National Center for Biotechnology Information) 
GenBank (database)
• European:
• EMBL (European Molecular Biology Laboratory)  EBI (European
Bioinformatics Institute)  ENA (European Nucleotide Archives)

9
31/10/2022

Primary sequence databases

Labs

Sequencing
Centers

TATAGCCG TATAGCCG
TATAGCCG TATAGCCG

GenBank
Updated ONLY
by submitters

Secondary databases
• Significant processing of original raw data.
• Annotation.
• Functional links.
• Carefully curated database (manually or automate.
• High quality.
• SWISS-PROT, trEMBL and PIR combined in UniProt.
• http://www.uniprot.org/
• Incorporates:
• Function of the protein
• Subcellular localization of protein
• Post-translational modification
• Domains and sites
• Secondary structure
• Quaternary structure
• Similarities to other proteins
• Diseases associated with deficiencies in the protein
• Sequence conflicts, variants, etc.

10
31/10/2022

Primary vs. Secondary sequence databases


RefSeq
Labs
TATAGCCG
AGCTCCGATA
CCGATGACAA

Sequencing
Centers Genome
Curators Assembly

Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI

GenBank
UniGene
Updated ONLY
by submitters
Algorithms

Specialized Databases
• Often focused on a specific aspect of an organism.
• Curated by experts.
• Highly annotated and processed data.

11
31/10/2022

Organism-specific genomic databases


Organism Database/resource URL
Escherichia coli EcoGene http://bmb.med.miami.edu/EcoGene/EcoWeb
EcoCyc (Encyclopedia of E. coli http://ecocyc.pangeasystems.com/ecocyc/ecocyc
genes and metabolism .html
Colibri http://genolist.pasteur.fr/Colibri
Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList
Saccharomyces Saccharomyces Genome Database http://genome-www.stanford.edu/Saccharmyces
cerevisiae (SGD)
Plasmodium falciparum PlasmoDB http://PlasmoDB.org
Arabidopsis thaliana MIPS Arabidopsis thaliana http://mips.gsf.de/proj/thal/db
Database (MAtDB)
The Arabidopsis information http://www.arabidopsis.org
resource (TAIR)
Drosophila FlyBase http://flybase.bio.indiana.edu
melanogaster
Caenorhabditis elegans A C. elegans DataBase (ACeDB) http://www.acedb.org
Mouse Mouse Genome Database (MGD) http://www.informatics.jax.org
Human Human Genome Organization http://www.hugo-international.org/
(HUGO)

12
31/10/2022

SGD is a derivative database serving the yeast research community

Grew out of decades of research

Genome project provided a systematic organization for genes

Bio-databases: A short word on problems


• Even today we face some key limitations
• There is no standard format
• Every database or program has its own format
• There is no standard nomenclature
• Every database has its own names
• Data is not fully optimized
• Some datasets have missing information without indications of it
• Data errors
• Data is sometimes of poor quality, erroneous, misspelled
• Error propagation resulting from computer annotation

13
31/10/2022

Interconnection between biological databases

• Need to access both primary and secondary database.


• Provide links between databases.
• Difficult to connect databases with different structures: ASCII, Relational
and Object-oriented.
• Common Object Request Broken Architecture (CORBA).
• eXtensible Markup Language (XML).

14

You might also like