0% found this document useful (0 votes)
4 views

1. Databases

The document provides an overview of bioinformatics, defining it as an interdisciplinary field that utilizes computer technology for managing biological data, particularly related to macromolecules like DNA and proteins. It outlines the history of DNA sequencing and genome projects, including significant milestones such as the Human Genome Project. Additionally, it discusses various types of biological databases, their structures, and the importance of reproducible research in bioinformatics.

Uploaded by

ntmh370
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

1. Databases

The document provides an overview of bioinformatics, defining it as an interdisciplinary field that utilizes computer technology for managing biological data, particularly related to macromolecules like DNA and proteins. It outlines the history of DNA sequencing and genome projects, including significant milestones such as the Human Genome Project. Additionally, it discusses various types of biological databases, their structures, and the importance of reproducible research in bioinformatics.

Uploaded by

ntmh370
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 1.

Introduction and
biological databases
19/2/2025

1
Definition of bioinformatics

• Bioinformatics is an interdisciplinary research area at the interface between


computer science and biological science

• Bioinformatics involves the technology that uses computers for storage,


retrieval, manipulation and distribution of information related to biological
macromolecules including DNA, RNA and protein

2
Central dogma in molecular biology

3 Bioinformatics and functional genomics


History of DNA sequencing

• Structure of DNA was discovered in


1953 by Watson and Crick
• The first DNA sequence was read in
1965

4
History of DNA sequencing

• Rapid DNA sequencing developed by Fred Sanger 1977

5
History of genome sequencing

• Bacteriophage PhiX174
• First sequenced genome, by Sanger sequencing
• DNA genome consists of 5386 nucleotides and 11 genes
• Published in 1977

• Haemophilus influenzae
• Fist sequenced free-living organism
• DNA genome consists of 1.8 million nucleotides and 1800
genes
• Published in 1995

6 https://doi.org/10.1016/j.jgar.2024.08.001 PDB-101: Molecule of the Month: Bacteriophage phiX174


History of genome sequencing

• Saccharomyces cerevisiae
• First sequenced eukaryote
• Genome consists of 12 million nucleotides and 6000 genes
• Published in 1977, took 7 years to finish
• Homo sapiens
• The Human Genome Project
• Genome consists of approximately 3.25 billion nucleotides
and 21000 genes
• Initiated in 1990, finished 13 years later
• Jointed effort by 200 research groups, cost estimated to be
$3 billion
• First gap-less human genome published in March 2022

7
Definition of bioinformatics

• Bioinformatics deals with massive


amount of sequencing data of
nucleotide and amino acid
sequences.

8 Bioinformatics and functional genomics


Subfields of bioinformatics

9 Essential bioinformatics
Bioinformatics software: two cultures

10 Bioinformatics and functional genomics


Command line

11
Bioinformatics vs. computational biology

• Bioinformatics (computational molecular biology) is limited to sequence,


structural, and functional analysis of genes, genomes (DNA) and their
corresponding products (RNA, proteins)

• Computational biology encompasses all biological areas that involve


computation, e.g., mathematical modeling of ecosystems, population
dynamics, but not necessarily involve biological macromolecules

12
Reproducible research in bioinformatics

• A workflow should be well-documented in lab notebook, electronic lab


notebook
• Information stored on a computer should be well-organized
• Data should be made available to other, with some exception regarding
sensitive data
• Metadata is important (can be location from which the bacterium is isolated)
• Databases used in bioinformatics analysis should be documented, version
number and date of access to the databases should also be recorded.
• Software should be documented

13
Biological databases

• Database: computerized archive used to


store and organize data in such a way that
information can be retrieved easily via a
variety of search criteria.

• Database: computer hardware and


software for data management

• Entry: a record in the database, contain a


number of fields that hold the actual data
items (value)

14
Types of databases

• Flat file format: a long text file that contains many entries separated by a
delimiter (|). Within each entry are a number of fields separated by tabs or
commas (,), aka, a single table for the entire database.
• To search a flat file for information, a computer has to read through the entire
file → improve searching efficiency by establishing a data structure (data
management system)
• Two types of data management system
• Relational databases
• Object-oriented databases

15
Relational databases
• Relational databases use a set of tables to organize data

Relation
Entity Field Attribute

Value

16
Relational databases
• Relational databases use a set of tables to organize data

• Relational databases can be created by structured query language (SQL)


17
Object-oriented database
• Object-oriented databases store data as objects that are linked by a set of
pointers defining predetermined relationship between objects

• Object-oriented databases can be created by programming language C++


18
Centralized databases

19 Bioinformatics and functional genomics


Genbank

20 Bioinformatics and functional genomics


Genbank

21 Bioinformatics and functional genomics


Biological databases

• Microorganisms and cell lines:


• Bacdive (DSMZ): https://bacdive.dsmz.de/
• 16S rRNA genes:
• Ribosomal Database Project: https://rdp.cme.msu.edu/
• Silva ribosomal RNA Database Project: https://www.arb-silva.de/
• Greengene Database: https://greengenes.secondgenome.com/
• Earth Microbiome Project: http://www.earthmicrobiome.org/
• Protein:
• https://www.uniprot.org/
• Protein Data Bank: https://www.rcsb.org
Types of data stored in databases

23 Bioinformatics and functional genomics


Information retrieval from biological databases

• NCBI developed and maintains Entrez – biological database retrieval system


that allows text-based searches for data
https://www.ncbi.nlm.nih.gov/search/
• Sequences
• Structures
• Taxonomy
• Abstracts
• Full papers

24
Information retrieval from biological databases

25 Bioinformatics and functional genomics


Refseq database

• Freely available, non-redundant, curated database of nucleotides, genomes,


proteins, provide only one single entry for each biological molecules for major
organisms
Cơ sở dữ liệu Refseq
• Cơ sở dữ liệu có sẵn miễn phí, không dư thừa, được tuyển chọn về nucleotide, bộ gen, protein, chỉ cung cấp một mục duy
nhất cho mỗi phân tử sinh học cho các sinh vật chính

26
Genbank sequence format

• Search output is a
flat file which
contain 3 sections: Header
header, features and
sequence entry
• Each field has a
unique identifier for
easy indexing by
computer software Feature
• Đầu ra tìm kiếm là một tệp phẳng chứa 3
phần: tiêu đề, tính năng và mục nhập trình
tự
• Mỗi trường có một mã định danh duy nhất
để dễ dàng lập chỉ mục bằng phần mềm
máy tính

Sequence
27
Genbank sequencing format - Header
Genbank sequence format - Features

29
Genbank sequence format - Sequence

30
FASTA format
Uniprot database

32
Uniprot database

33
Protein data bank

34

You might also like