0% found this document useful (0 votes)
2 views

Nucleotide Databases

The document provides an overview of primary nucleotide sequence databases, including GenBank, EMBL, and DDBJ, with a focus on GenBank's history, structure, and usage for searching and retrieving DNA sequences. It details the format of GenBank entries, including accession numbers, organism information, and bibliographic references. Additionally, it outlines the various divisions within the GenBank database and the types of sequences it contains.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Nucleotide Databases

The document provides an overview of primary nucleotide sequence databases, including GenBank, EMBL, and DDBJ, with a focus on GenBank's history, structure, and usage for searching and retrieving DNA sequences. It details the format of GenBank entries, including accession numbers, organism information, and bibliographic references. Additionally, it outlines the various divisions within the GenBank database and the types of sequences it contains.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Nucleic acid primary sequence

databases
Nucleotide Sequence
Databases
Primary nucleotide sequence databases
 GenBank (National Centre for
Biotechnology
Information, (NCBI), USA)
 EMBL ( European Molecular Biology

Laboratory, UK)
 DDBJ (DNA Databank of Japan, Japan)
GenBank
 Genbank is the NIH (National Institute of Health) generic sequence
database. Located at National Center for Biotechnology Information,
National Library of Medicine, USA,
 Set up in 1979 at the Los Alamos National Laboratory, New Mexico,
United States
 Maintained since 1992 NCBI (Bethesda).
 Genbank (NCBI) consists of an annotated collection of all publically
available DNA sequences.
 NCBI creates public databases, conducts research in computational
biology, develops software tools for analyzing genome data, and
disseminate biomedical information.
Searching and retrieving
sequence from GenBank
 Access Genbank from
www.ncbi.nlm.nih.gov/Genbank/
 Search using Genbank ID or type a
keyword for search
 Download the sequence in
FASTA/GenBank format
Sources of GenBank

 GenBank (nucleotide database) is a collection of sequences from


several sources, including
1. GenBank
2. RefSeq
3. TPA
4. PDB
RefSeq Nomenclature

NC_#### Complete genome


NG_#### Incomplete genomic
NM_#### mRNA
NR_#### Noncoding transcripts
NP_#### Proteins
NT_#### Intermediate genomic
contigs
What’s in an accession number?

DNA records:
NM_017442 toll-like receptor 9 RefSeq
BC032713 toll-like receptor 9 cDNA clone
NG_001066 toll-like receptor 7 chromosome X
AF172169 toll-like receptor 7 genomic gene
Protein records:
Q15399 toll-like receptor 1 Swiss-Prot
NP_067681 toll-like receptor 2 RefSeq
AAH33651 toll-like receptor 7 Genbank protein
1FYW TIR domain of Tlr2 3D structure (PDB)
2 letters + 6 numerals OR 1 letter + 5
Nucleotide:
numerals
GenBank Database Divisions
Database Division
BCT Bacterial sequences
PRI Primate sequences
ROD Rodent sequences
MAM Other mammalian sequences
VRT Vertebrate sequences
INV Invertebrate sequences
PLN Plant and Fungal sequences
VRL Viral sequences
PHG Phage sequences
RNA Structural RNA sequences
SYN Synthetic and chimeric sequences
UNA Unannotated sequences
Nucleotide Homepage
Using limits and advance search
to restrict searches
GenBank Format
 Each GenBank entry includes a concise description of the
sequence, the scientific name and taxonomy of the source
organism, and a table of features that identifies coding
regions and other sites of biological significance, such as
transcription units, sites of mutations or modifications,
and repeats.
 Protein translations for coding regions are included in
the feature table.
 Bibliographic references are included along with a link to
the Medline unique identifier for all published sequences.
 Each sequence entry is composed of lines. Different
types of lines, each with their own format, are used to
record the various data that make up the entry.
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004

A Traditional
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION AY182241
VERSION AY182241.2 GI:32265057

GenBank Record
KEYWORDS .
SOURCE Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE 1 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.

Header
TITLE Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES Location/Qualifiers
source 1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene 1..1931
/gene="AFS1"
CDS 54..1784
/gene="AFS1"
/note="terpene synthase" Feature
Table
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg

Sequence
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
//
The Header
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
KEYWORDS .
SOURCE Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE 1 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Locus Line
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
LOCUS
LOCUS AY182241
DEFINITION
AY182241Malus x 1931 bp
domestica
1931 mRNA linear
(E,E)-alpha-farnesene
bp mRNA synthase
linear PLN 04-MAY-2004
(AFS1)
PLN mRNA,
04-MAY-2004
complete cds.
ACCESSION AY182241
Length
VERSION AY182241.2 GI:32265057
Length
KEYWORDS
SOURCE
.
Malus x domestica (cultivated apple) Division
Division
Locus Molecule
Molecule type
type
ORGANISM Malus x domestica
Locus name
name Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
Modification
Modification Date
Date
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE 1 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Database Identifiers
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
ACCESSION
ACCESSION
KEYWORDS . AY182241
AY182241
SOURCE Malus x domestica (cultivated apple)
VERSION AY182241.2
AY182241.2 GI:32265057
ORGANISM Malus x domestica
VERSION GI:32265057
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
Version
Version GI
REFERENCE 1 (bases 1 to 1931) GInumber
number
Tracks changes in sequence
Tracks changes in sequence NCBI internal
internaluse
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Cloning and functional expression of NCBI use
an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Organism
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
KEYWORDS .
SOURCE Malus
SOURCE Malus x domestica (cultivated
xx domestica (cultivated apple)
apple)
SOURCE
ORGANISM Malus x domestica (cultivated apple)
Malus domestica
ORGANISM
ORGANISM Malus
Malus xx domestica
domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Eukaryota;
Eukaryota; Viridiplantae;
Spermatophyta; Streptophyta;
Magnoliophyta;
Viridiplantae; eudicotyledons;Embryophyta;
Streptophyta; core eudicots;
Embryophyta;
Tracheophyta;
Tracheophyta; Spermatophyta;
rosids;
Spermatophyta; Magnoliophyta;
eurosids I; Rosales;
Magnoliophyta; eudicotyledons;
Rosaceae; Maloideae; Malus.
eudicotyledons;
REFERENCE 1 (bases
core eudicots; 1 to 1931)
rosids; eurosids
AUTHORScore eudicots;and
Pechous,S.W. rosids; eurosids I;
Whitaker,B.D. I; Rosales;
Rosales; Rosaceae;
Rosaceae;
Maloideae;
TITLE Maloideae;
Malus.
Cloning andMalus.
functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
NCBI-controlled taxonomy
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
The Feature Table
FEATURES Location/Qualifiers
source 1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene 1..1931
/gene="AFS1"
start (atg) stop
stop(tag)
CDS start (atg) 54..1784 (tag)
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
Coding
Codingsequence
sequence /protein_id="AAO22848.2"
/db_xref="GI:32265058" GenPept Identifiers
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS
Implied LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW
Implied ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS
protein
protein EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT
KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
The Sequence

ORIGIN
ORIGIN
11 ttcttgtatc
ttcttgtatc ccaaacatct
ccaaacatct cgagcttctt
cgagcttctt gtacaccaaa
gtacaccaaa ttaggtattc
ttaggtattc actatggaat
actatggaat
61 tcagagttca
61 tcagagttca cttgcaagct
cttgcaagct gataatgagc
gataatgagc agaaaatttt
agaaaatttt tcaaaaccag
tcaaaaccag atgaaacccg
atgaaacccg
121 aacctgaagc
121 aacctgaagc ctcttacttg
ctcttacttg attaatcaaa
attaatcaaa gacggtctgc
gacggtctgc aaattacaag
aaattacaag ccaaatattt
ccaaatattt
181 ggaagaacga
181 ggaagaacga tttcctagat
tttcctagat caatctctta
caatctctta tcagcaaata
tcagcaaata cgatggagat
cgatggagat gagtatcgga
gagtatcgga

1741 ggacccacat
1741 ggacccacat cctgtcttta
cctgtcttta ctattccaac
ctattccaac ctcttgtaaa
ctcttgtaaa ctagtactca
ctagtactca tatagtttga
tatagtttga
1801 aataaatagc
1801 aataaatagc agcaaaagtt
agcaaaagtt tgcggttcag
tgcggttcag ttcgtcatgg
ttcgtcatgg ataaattaat
ataaattaat ctttacagtt
ctttacagtt
1861 tgtaacgttg
1861 tgtaacgttg ttgccaaaga
ttgccaaaga ttatgaataa
ttatgaataa aaagttgtag
aaagttgtag tttgtcgttt
tttgtcgttt aaaaaaaaaa
aaaaaaaaaa
1921 aaaaaaaaaa
1921 aaaaaaaaaa aa
//
//
GenBank and GenPept Formats

[rest of protein sequence deleted for shortness]

[rest of nucleotide sequence deleted for


shortness]

GenBank (DNA) GenPept (protein)


Statistics of GenBank and
WGS
GenBank WGS
Release Date Bases Sequences Bases Sequences
1939210429 10389372102 2587021
208 Jun 2015 185019352
46 21 38
1998236442 11632756010 3029555
209 Aug 2015 187066846
87 01 43
2022370815 12226352674 3091989
210 Oct 2015 188372017
59 98 43
2070181960 13998654956 3330127
212 Feb 2016 190250235
67 08 60
2114239120 14522077049 3389225
213 Apr 2016 193739511
47 49 37
2132009078 15561759446 3502780
214 Jun 2016 194463572
19 48 81
From 1982 to the 2179714376
present, the number of bases in GenBank
16372249703 3597964
215 Aug 2016 196120831
has doubled47 approximately every 18
24 months. 97
2403433782 22422946095 4999657
221 Aug 2017 203180606
58 10 22
EMBL Nucleotide Sequence
Database
 The EMBL Nucleotide Sequence Database (also known as
EMBL-Bank) constitutes Europe's primary nucleotide
sequence resource.
 Main sources for DNA and RNA sequences are direct
submissions from individual researchers, genome
sequencing projects and patent applications.
 Created in 1980 at the European Molecular Biology
Laboratory in Heidelberg, Germany.
Maintained since 1994 by EMBL-EBI (European
Bioinformatics Institute), Hinxton, UK
Web server: http://www.ebi.ac.uk/ embl
Searching and retrieving
sequence from EMBL Database
 European Nucleotide Archive (ENA) at EMBL-
EBI: An open, supported platform for the
management, sharing, integration, archiving and
dissemination of public-domain sequence data.
 Access EMBL database at
http://www.ebi.ac.uk/ena
 Search using EMBL ID /Sequence or type a
keyword for search
 Download the sequence in FASTA format
EMBL Format
 EMBL entries are structured so as to be usable
by human readers as well as by computer
programs.
 Each entry in the database is composed of lines.
 Each line with its own format, which are used
to record the various types of data which make
up the entry.
 Each entry begins with an identification line
(ID) and ends with a terminator line (//).
Identification Number (ID)

Accession Number –AC

Organism Species (OS)

Organism Classification (OC)


Reference Number (RN)

Reference Location (RL)

Reference Authors (RA)


Reference Title (RT)

Feature table Header (FH)

Feature table data (FT)

Coding region

Sequence (SQ)
DNA Data Bank of Japan
 DDBJ (DNA Data Bank of Japan) began DNA data bank activities
in earnest in 1986 at the National Institute of Genetics (NIG),
Mishima, JAPAN.
 DDBJ is the sole DNA data bank in Japan, which is officially
certified to collect DNA sequences from researchers and to issue
the internationally recognized accession number to data
submitters.
 In DDBJ, the data is mainly from Japanese researchers, but of
course accept data and issue the accession number to researchers
in any other countries
 DDBJ has been functioning as the international nucleotide
sequence database in collaboration with EMBL and NCBI/
GenBank.
 DDBJ can be reached at http://www.ddbj.nig.ac.jp
Searching and retrieving
sequence from DDBJ

 Access DDBJ from


http://www.ddbj.nig.ac.jp
 Search using DDBJ ID or type a keyword
for search
 Download the sequence in FASTA format
DDBJ Format
 DDBJ entry contains scientific name,
taxonomy of source organism, coding
regions, transcription units, mutation
sites of the sequence.
 DDBJ format is same as Genbank format.
Rate
DNA
DDBJ of
databas Rel. Date Entries Bases
Date incre
e
ase
The International Sequence Database
Collaboration (INSDC)
• These three databases have collaborated since
1982.
• Each database collects and processes new
sequence data and relevant biological
information from scientists in their region e.g.
EMBL collects from Europe, GenBank from the
USA.
• These databases automatically update each
other with the new sequences collected from
each region, every 24 hours. The result is that
they contain exactly the same information,
except for any sequences that have been added
NIH Entrez

NCBI
GenBank •Submissions
•Submissions
•Updates
•Updates
EMBL
DDBJ
CIB EBI
NIG •Submissions
SRS
•Updates
getentry EMBL
37

You might also like