0% found this document useful (0 votes)

9 views14 pages

Class04- Biological databases - 2022

The document provides an overview of biological databases, defining them as structured collections of data used for storing, organizing, and retrieving biological information. It discusses the importance of database management systems, various types of databases, and highlights key biological databases such as GenBank and UniProt. Additionally, it addresses the challenges and limitations in database searching and the interconnection between different biological databases.

Uploaded by

m-9274491

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views14 pages

Class04- Biological databases - 2022

Uploaded by

m-9274491

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

31/10/2022

SIV 2001

Class 5
Introduction to biological databases

What is a database?

• A database is a computerized archive used to

• store and organize data.
• so that information can be retrieved by a variety of search criteria.
• Biological databases often used for knowledge discovery.
• A good database needs to be:
• structured with minimum redundancy
• searchable (data retrieval)
• updated periodically
• cross-referenced with other databases
• Tools for analysis and visualization
• Data is processed, organized, structured and presented in a useful context
 information

1
31/10/2022

What is a Database?

A structured collection of data held in computer storage; esp. one

that incorporates software to make it accessible in a variety of ways;
transf., any large collection of information.

database management: the organization and manipulation of data in

a database.

database management system (DBMS): a software package that

provides all the functions required for database management.

database system: a database together with a database

management system.

3
Oxford Dictionary

Databases
• Books, articles 1968 -> 1985
• Computer tapes 1982 ->1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 -> ?
• FTP 1989 -> ?
• On-line services 1982 -> ?
• WWW 1993 -> ?
• DVD 2001 -> ?

2
31/10/2022

Biological Database
• Libraries related to biological data and information
• Data from scientific experiments
• Published literature
• High-throughput experiment technologies
• Computational analysis

Some important biological databases

in bioinformatics
• GenBank ncbi.nlm.nih.gov Nucleotide sequences
• Ensembl ensembl.org Human/mouse genome
• PubMed ncbi.nlm.nih.gov Literature references
• NR ncbi.nlm.nih.gov Protein sequences
• UniProt uniprot.org/ Protein sequences
• InterPro ebi.ac.uk Protein domains
• OMIM ncbi.nlm.nih.gov Genetic diseases
• Expasy expasy.org Enzymes
• PDB rcsb.org/pdb/ Protein structures
• KEGG genome.ad.jp Metabolic pathways
• MINT mint.bio.uniroma2.it/ Protein-protein interactions
• GeneNet genenetwork.org / Gene networks
• Transpath genexplain.com/transpath Signal transduction pathways

3
31/10/2022

Genome size (haploid)

Genome size (bp)
Phi-X 174 5,386
Human Mitochondrion 16,569
Mycoplasma genitalium 490,885
Rikettsia prowazekii 1,111,523
Haemophilus influenzae 1,830,138
Streptococcus pneumoniae 2,160,837
Vibrio cholerae 4,033,460
Mycobacterium tuberculosis 4,411,532
Bacillus subtilis 4,214,814
Escherichia coli 4,639,211
Escherichia coli O157:H7 5,440,000
Saccharomyces cerevisiae 12,495,682
Caenorhabditis elegans 100,258,171
Arabidopsis thaliana 115,409,949 The haploid human
Drosophila melanogaster 122,653,977 genome (23 chromosomes)
Anopheles gambiae 178,244,063 is estimated to be about
Homo sapiens 3,200,000,000 3.2 billion bases long
Oryza sativa 3,900,000,000

Working Draft Sequence

gaps

genes

4
31/10/2022

Bioinformatics Information Space

and challenges
1999 2004
Nucleotide sequences: 4,456,822 36,653,899
Protein sequences: 706,862 4,436,362
3D structures: 9,780 19,640
Human Unigene Clusters: 75,832 118,517

Maps and Complete Genomes: 10,870 6,948

Different species node: 52,889 283,121
dbSNP 6,377 13,179,601
RefGenes 515 22,079

human contigs > 250 kb 341 (4.9MB) 2,487,920,000

PubMed records: 10,372,886 12,570,540
OMIM records: 10,695 15,138
Interactions & complexes - 52,385

The amount of sequenced DNA is

vastly increasing

5
31/10/2022

What is a database?
• Organized array of information
• Place where you put things in.
• And (if all goes well!) you should be able to get them out
again.
• Allows you to make discoveries.

Database searching
• The requirements for a biological database searching are:
• Sensitivity, which is the ability to find as many correct hits as possible.
• Selectivity or specificity, which is the ability to exclude incorrect hits or false
positive.
• Speed, which is the time it takes to get the results form databases searches.
• The ideal concept of database search is to have greatest sensitivity, selectivity and
speed, however, it is difficult to achieve in reality. Among the limitations of in a
biological database search are:
• Increased sensitivity associates with decrease in selectivity.
• A very inclusive search with high selectivity would lead to many false positive
(low in sensitivity).
• Increase the speed will cause a lowered sensitivity and selectivity.
• There are also errors in molecular sequence database due to the error during
the sequencing phase of the sequence.
• Redundancy of the molecular sequence information in the molecular
sequence database.
• These limitations could bring to incorrect output of biological database search
and should be consider in analyzing the outcome of a biological database search.

6
31/10/2022

Database format
• Flat file:
• Each database contains only one file, not linked
to any other file.
• The computer has to read the entire file to find
all entries or relationships.

Database format
• Relational database:
• a database structured to recognize relations
between stored items of information.
• connects data tables with rows to transfer
information.

7
31/10/2022

Database format
• Object oriented databases :
• Information is represented in the form of objects as used in object-
oriented programming.
• Stores complex data and relationships between data directly,
• No mapping to relational rows and columns  suitable for applications
dealing with very complex data.
• Great care must be taken when designing a object-oriented database to
ensure efficient querying.

Types of data generated by molecular

biology research
• Molecular sequence:
• Nucleotide sequence (DNA or RNA)
• Protein (amino acids) sequence
• Genome sequence (complete or draft).
• Gene expression
• Gene polymorphism (variation)
• Macromolecular 3D structure
• Proteins sequence patterns or motifs
• Metabolic pathways

8
31/10/2022

Primary Databases
• Experimental results submission by researchers  Original submissions
directly into the database
• Content controlled by the submitter
• Three major databases for nucleotide sequences:
• GenBank
http://www.ncbi.nlm.nih.gov/Genbank/
• European Molecular Biology Laboratory (EMBL)
www.embl.org
• DNA Databank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
• Sequences are exchanged on a daily basis

• US:
• NIH (National Institute of Health)  NLM (National Library of
Medicine)  NCBI (National Center for Biotechnology Information) 
GenBank (database)
• European:
• EMBL (European Molecular Biology Laboratory)  EBI (European
Bioinformatics Institute)  ENA (European Nucleotide Archives)

9
31/10/2022

Primary sequence databases

Labs

Sequencing
Centers

TATAGCCG TATAGCCG
TATAGCCG TATAGCCG

GenBank
Updated ONLY
by submitters

Secondary databases
• Significant processing of original raw data.
• Annotation.
• Functional links.
• Carefully curated database (manually or automate.
• High quality.
• SWISS-PROT, trEMBL and PIR combined in UniProt.
• http://www.uniprot.org/
• Incorporates:
• Function of the protein
• Subcellular localization of protein
• Post-translational modification
• Domains and sites
• Secondary structure
• Quaternary structure
• Similarities to other proteins
• Diseases associated with deficiencies in the protein
• Sequence conflicts, variants, etc.

10
31/10/2022

Primary vs. Secondary sequence databases

RefSeq
Labs
TATAGCCG
AGCTCCGATA
CCGATGACAA

Sequencing
Centers Genome
Curators Assembly

Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI

GenBank
UniGene
Updated ONLY
by submitters
Algorithms

Specialized Databases
• Often focused on a specific aspect of an organism.
• Curated by experts.
• Highly annotated and processed data.

11
31/10/2022

Organism-specific genomic databases

Organism Database/resource URL
Escherichia coli EcoGene http://bmb.med.miami.edu/EcoGene/EcoWeb
EcoCyc (Encyclopedia of E. coli http://ecocyc.pangeasystems.com/ecocyc/ecocyc
genes and metabolism .html
Colibri http://genolist.pasteur.fr/Colibri
Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList
Saccharomyces Saccharomyces Genome Database http://genome-www.stanford.edu/Saccharmyces
cerevisiae (SGD)
Plasmodium falciparum PlasmoDB http://PlasmoDB.org
Arabidopsis thaliana MIPS Arabidopsis thaliana http://mips.gsf.de/proj/thal/db
Database (MAtDB)
The Arabidopsis information http://www.arabidopsis.org
resource (TAIR)
Drosophila FlyBase http://flybase.bio.indiana.edu
melanogaster
Caenorhabditis elegans A C. elegans DataBase (ACeDB) http://www.acedb.org
Mouse Mouse Genome Database (MGD) http://www.informatics.jax.org
Human Human Genome Organization http://www.hugo-international.org/
(HUGO)

12
31/10/2022

SGD is a derivative database serving the yeast research community

Grew out of decades of research

Genome project provided a systematic organization for genes

Bio-databases: A short word on problems

• Even today we face some key limitations
• There is no standard format
• Every database or program has its own format
• There is no standard nomenclature
• Every database has its own names
• Data is not fully optimized
• Some datasets have missing information without indications of it
• Data errors
• Data is sometimes of poor quality, erroneous, misspelled
• Error propagation resulting from computer annotation

13
31/10/2022

Interconnection between biological databases

• Need to access both primary and secondary database.

• Provide links between databases.
• Difficult to connect databases with different structures: ASCII, Relational
and Object-oriented.
• Common Object Request Broken Architecture (CORBA).
• eXtensible Markup Language (XML).

Functional Specifications - Chromeleon 7.2.10
No ratings yet
Functional Specifications - Chromeleon 7.2.10
364 pages
Tafj Dumps
100% (4)
Tafj Dumps
29 pages
Bioinformatics
100% (2)
Bioinformatics
104 pages
Instant Notes in Bioinformatics, Richard M Tywman
100% (2)
Instant Notes in Bioinformatics, Richard M Tywman
257 pages
ABAP Training Plan
No ratings yet
ABAP Training Plan
4 pages
1z0-133.exam.47q: Number: 1z0-133 Passing Score: 800 Time Limit: 120 Min
No ratings yet
1z0-133.exam.47q: Number: 1z0-133 Passing Score: 800 Time Limit: 120 Min
25 pages
Diamond Shield Professional English
No ratings yet
Diamond Shield Professional English
32 pages
Class12 Biological Database
No ratings yet
Class12 Biological Database
23 pages
Biological Database ODL
No ratings yet
Biological Database ODL
21 pages
120-202 Lab 01 - Fall 2018
No ratings yet
120-202 Lab 01 - Fall 2018
13 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
No ratings yet
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
42 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
CAP-UNIT-IV
No ratings yet
CAP-UNIT-IV
8 pages
1. Databases
No ratings yet
1. Databases
34 pages
2024.HF_BioInformatics_Lec3p
No ratings yet
2024.HF_BioInformatics_Lec3p
11 pages
#1 L1 BioDatabases
No ratings yet
#1 L1 BioDatabases
89 pages
100505 Koenig Biological Databases
No ratings yet
100505 Koenig Biological Databases
35 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
34 pages
Lecture 5- DataBase
No ratings yet
Lecture 5- DataBase
18 pages
"MBG1002 Biological Databases Week II
No ratings yet
"MBG1002 Biological Databases Week II
37 pages
Biol BDs Singapore
No ratings yet
Biol BDs Singapore
24 pages
202 07 Bioinformatics
No ratings yet
202 07 Bioinformatics
14 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
61 pages
INtroduction To Informatics
No ratings yet
INtroduction To Informatics
61 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
BCH 505 Bioinformatics 3(2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3(2 2) Databases
17 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
المحاضرة 2
No ratings yet
المحاضرة 2
16 pages
2a.BioinfoServerDatabase (Proteomics)
No ratings yet
2a.BioinfoServerDatabase (Proteomics)
50 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Biological Databases ODL
No ratings yet
Biological Databases ODL
31 pages
PB Bioinfo L1 2023
No ratings yet
PB Bioinfo L1 2023
21 pages
Database
No ratings yet
Database
16 pages
CH12
No ratings yet
CH12
8 pages
Lecture 4 Biological Databases
No ratings yet
Lecture 4 Biological Databases
29 pages
Computational Biology B.Tech - Biotech (Vith Semester)
No ratings yet
Computational Biology B.Tech - Biotech (Vith Semester)
34 pages
Computational biology
No ratings yet
Computational biology
19 pages
Databases in Bioinformatics
No ratings yet
Databases in Bioinformatics
33 pages
bioinformatics
No ratings yet
bioinformatics
5 pages
Bio Informatics
No ratings yet
Bio Informatics
46 pages
Nucleic_Acid_Databases
No ratings yet
Nucleic_Acid_Databases
37 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Zoya Bioinformatics Assignment
No ratings yet
Zoya Bioinformatics Assignment
36 pages
Module-I
No ratings yet
Module-I
65 pages
Bioinformatics1
No ratings yet
Bioinformatics1
37 pages
Day 1
No ratings yet
Day 1
38 pages
Bioinformatics Lecture1
No ratings yet
Bioinformatics Lecture1
28 pages
9. Biological Databases
No ratings yet
9. Biological Databases
17 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
100% (1)
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
54 pages
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
No ratings yet
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
12 pages
IInd Sem Class1
No ratings yet
IInd Sem Class1
56 pages
Bioinformatics notes
No ratings yet
Bioinformatics notes
4 pages
Introduction To Databases
No ratings yet
Introduction To Databases
7 pages
Biological Database
No ratings yet
Biological Database
3 pages
module 4 merged
No ratings yet
module 4 merged
283 pages
Protein Databases
No ratings yet
Protein Databases
12 pages
Biological Databases: - Bio-Informatics
No ratings yet
Biological Databases: - Bio-Informatics
16 pages
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Advanced Perl Techniques for Bioinformatics: Optimizing Data Analysis and Computational Biology
From Everand
Advanced Perl Techniques for Bioinformatics: Optimizing Data Analysis and Computational Biology
Adam Jones
No ratings yet
Flowershop Management System
0% (1)
Flowershop Management System
27 pages
Python Cheatsheet
No ratings yet
Python Cheatsheet
14 pages
A Review Paper On Big Data Analytics Tools: Article
No ratings yet
A Review Paper On Big Data Analytics Tools: Article
7 pages
Accounts Payable & best practices Accounts Payable (AP)
No ratings yet
Accounts Payable & best practices Accounts Payable (AP)
7 pages
Mail Merge Activity 2
No ratings yet
Mail Merge Activity 2
2 pages
SQLServer Developer Resume
No ratings yet
SQLServer Developer Resume
3 pages
3 Hours / 70 Marks: Seat No
No ratings yet
3 Hours / 70 Marks: Seat No
2 pages
Text Classification Based On Random Forest Algorithm
No ratings yet
Text Classification Based On Random Forest Algorithm
4 pages
Principle of Geographic Information Systems
No ratings yet
Principle of Geographic Information Systems
20 pages
Inft1105 - A3 - S23
No ratings yet
Inft1105 - A3 - S23
12 pages
Yii Framework 2.0 API Documentation
No ratings yet
Yii Framework 2.0 API Documentation
25 pages
IT Infrastructure Change Management Guidelines
No ratings yet
IT Infrastructure Change Management Guidelines
12 pages
Lab 4
No ratings yet
Lab 4
4 pages
Part III Ms Access Lecture Notes
No ratings yet
Part III Ms Access Lecture Notes
28 pages
SPRING BOOT
No ratings yet
SPRING BOOT
6 pages
Huyenchip Com 2023 04 11 LLM Engineering HTML
No ratings yet
Huyenchip Com 2023 04 11 LLM Engineering HTML
13 pages
A Model For Auto-Tagging of Research Papers Based On Keyphrase Extraction Methods
No ratings yet
A Model For Auto-Tagging of Research Papers Based On Keyphrase Extraction Methods
6 pages
It Practical File
No ratings yet
It Practical File
18 pages
Ansar - F18605005 Inlab + Post Lab No 01 Operating System Dated 12 April, 2021
No ratings yet
Ansar - F18605005 Inlab + Post Lab No 01 Operating System Dated 12 April, 2021
6 pages
AWP Unit 3 and 4
No ratings yet
AWP Unit 3 and 4
18 pages
STUDENT REGISTRATION SYSTEM
No ratings yet
STUDENT REGISTRATION SYSTEM
19 pages
Web Technology Lab File-4
No ratings yet
Web Technology Lab File-4
36 pages
SQL Introduction
No ratings yet
SQL Introduction
34 pages
Rolling Back A Patch
No ratings yet
Rolling Back A Patch
4 pages
Dar 2024 Handbook 1st Edition
No ratings yet
Dar 2024 Handbook 1st Edition
419 pages

Class04- Biological databases - 2022

Uploaded by

Class04- Biological databases - 2022

Uploaded by

31/10/2022

• A database is a computerized archive used to

A structured collection of data held in computer storage; esp. one

database management: the organization and manipulation of data in

database management system (DBMS): a software package that

database system: a database together with a database

Some important biological databases

Genome size (haploid)

Working Draft Sequence

Bioinformatics Information Space

Maps and Complete Genomes: 10,870 6,948

human contigs > 250 kb 341 (4.9MB) 2,487,920,000

The amount of sequenced DNA is

Types of data generated by molecular

Primary sequence databases

Primary vs. Secondary sequence databases

Organism-specific genomic databases

SGD is a derivative database serving the yeast research community

Grew out of decades of research

Genome project provided a systematic organization for genes

Bio-databases: A short word on problems

Interconnection between biological databases

• Need to access both primary and secondary database.

You might also like