Class04- Biological databases - 2022
Class04- Biological databases - 2022
SIV 2001
Class 5
Introduction to biological databases
What is a database?
1
31/10/2022
What is a Database?
3
Oxford Dictionary
Databases
• Books, articles 1968 -> 1985
• Computer tapes 1982 ->1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 -> ?
• FTP 1989 -> ?
• On-line services 1982 -> ?
• WWW 1993 -> ?
• DVD 2001 -> ?
2
31/10/2022
Biological Database
• Libraries related to biological data and information
• Data from scientific experiments
• Published literature
• High-throughput experiment technologies
• Computational analysis
3
31/10/2022
gaps
genes
4
31/10/2022
5
31/10/2022
What is a database?
• Organized array of information
• Place where you put things in.
• And (if all goes well!) you should be able to get them out
again.
• Allows you to make discoveries.
Database searching
• The requirements for a biological database searching are:
• Sensitivity, which is the ability to find as many correct hits as possible.
• Selectivity or specificity, which is the ability to exclude incorrect hits or false
positive.
• Speed, which is the time it takes to get the results form databases searches.
• The ideal concept of database search is to have greatest sensitivity, selectivity and
speed, however, it is difficult to achieve in reality. Among the limitations of in a
biological database search are:
• Increased sensitivity associates with decrease in selectivity.
• A very inclusive search with high selectivity would lead to many false positive
(low in sensitivity).
• Increase the speed will cause a lowered sensitivity and selectivity.
• There are also errors in molecular sequence database due to the error during
the sequencing phase of the sequence.
• Redundancy of the molecular sequence information in the molecular
sequence database.
• These limitations could bring to incorrect output of biological database search
and should be consider in analyzing the outcome of a biological database search.
6
31/10/2022
Database format
• Flat file:
• Each database contains only one file, not linked
to any other file.
• The computer has to read the entire file to find
all entries or relationships.
Database format
• Relational database:
• a database structured to recognize relations
between stored items of information.
• connects data tables with rows to transfer
information.
7
31/10/2022
Database format
• Object oriented databases :
• Information is represented in the form of objects as used in object-
oriented programming.
• Stores complex data and relationships between data directly,
• No mapping to relational rows and columns suitable for applications
dealing with very complex data.
• Great care must be taken when designing a object-oriented database to
ensure efficient querying.
8
31/10/2022
Primary Databases
• Experimental results submission by researchers Original submissions
directly into the database
• Content controlled by the submitter
• Three major databases for nucleotide sequences:
• GenBank
http://www.ncbi.nlm.nih.gov/Genbank/
• European Molecular Biology Laboratory (EMBL)
www.embl.org
• DNA Databank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
• Sequences are exchanged on a daily basis
• US:
• NIH (National Institute of Health) NLM (National Library of
Medicine) NCBI (National Center for Biotechnology Information)
GenBank (database)
• European:
• EMBL (European Molecular Biology Laboratory) EBI (European
Bioinformatics Institute) ENA (European Nucleotide Archives)
9
31/10/2022
Labs
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
GenBank
Updated ONLY
by submitters
Secondary databases
• Significant processing of original raw data.
• Annotation.
• Functional links.
• Carefully curated database (manually or automate.
• High quality.
• SWISS-PROT, trEMBL and PIR combined in UniProt.
• http://www.uniprot.org/
• Incorporates:
• Function of the protein
• Subcellular localization of protein
• Post-translational modification
• Domains and sites
• Secondary structure
• Quaternary structure
• Similarities to other proteins
• Diseases associated with deficiencies in the protein
• Sequence conflicts, variants, etc.
10
31/10/2022
Sequencing
Centers Genome
Curators Assembly
Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI
GenBank
UniGene
Updated ONLY
by submitters
Algorithms
Specialized Databases
• Often focused on a specific aspect of an organism.
• Curated by experts.
• Highly annotated and processed data.
11
31/10/2022
12
31/10/2022
13
31/10/2022
14