1. Databases
1. Databases
Introduction and
biological databases
19/2/2025
1
Definition of bioinformatics
2
Central dogma in molecular biology
4
History of DNA sequencing
5
History of genome sequencing
• Bacteriophage PhiX174
• First sequenced genome, by Sanger sequencing
• DNA genome consists of 5386 nucleotides and 11 genes
• Published in 1977
• Haemophilus influenzae
• Fist sequenced free-living organism
• DNA genome consists of 1.8 million nucleotides and 1800
genes
• Published in 1995
• Saccharomyces cerevisiae
• First sequenced eukaryote
• Genome consists of 12 million nucleotides and 6000 genes
• Published in 1977, took 7 years to finish
• Homo sapiens
• The Human Genome Project
• Genome consists of approximately 3.25 billion nucleotides
and 21000 genes
• Initiated in 1990, finished 13 years later
• Jointed effort by 200 research groups, cost estimated to be
$3 billion
• First gap-less human genome published in March 2022
7
Definition of bioinformatics
9 Essential bioinformatics
Bioinformatics software: two cultures
11
Bioinformatics vs. computational biology
12
Reproducible research in bioinformatics
13
Biological databases
14
Types of databases
• Flat file format: a long text file that contains many entries separated by a
delimiter (|). Within each entry are a number of fields separated by tabs or
commas (,), aka, a single table for the entire database.
• To search a flat file for information, a computer has to read through the entire
file → improve searching efficiency by establishing a data structure (data
management system)
• Two types of data management system
• Relational databases
• Object-oriented databases
15
Relational databases
• Relational databases use a set of tables to organize data
Relation
Entity Field Attribute
Value
16
Relational databases
• Relational databases use a set of tables to organize data
24
Information retrieval from biological databases
26
Genbank sequence format
• Search output is a
flat file which
contain 3 sections: Header
header, features and
sequence entry
• Each field has a
unique identifier for
easy indexing by
computer software Feature
• Đầu ra tìm kiếm là một tệp phẳng chứa 3
phần: tiêu đề, tính năng và mục nhập trình
tự
• Mỗi trường có một mã định danh duy nhất
để dễ dàng lập chỉ mục bằng phần mềm
máy tính
Sequence
27
Genbank sequencing format - Header
Genbank sequence format - Features
29
Genbank sequence format - Sequence
30
FASTA format
Uniprot database
32
Uniprot database
33
Protein data bank
34