Plantgdb: A Resource For Comparative Plant Genomics
Plantgdb: A Resource For Comparative Plant Genomics
Department of Genetics, Development and Cell Biology, Iowa State University, 2USDA-ARS Corn Insects
and Crop Genetics Research Unit, Ames, IA 50011, 3Department of Computer Science, University of
South Dakota, Vermillion, SD 57069, 4Department of Genetics, Development and Cell Biology and 5Department of
Statistics, Iowa State University, Ames IA 50011, USA
Received September 12, 2007; Revised October 30, 2007; Accepted October 31, 2007
ABSTRACT
PlantGDB (http://www.plantgdb.org/) is a genomics
database encompassing sequence data for green
plants (Viridiplantae). PlantGDB provides annotated
transcript assemblies for `100 plant species, with
transcripts mapped to their cognate genomic context where available, integrated with a variety of
sequence analysis tools and web services. For 14
plant species with emerging or complete genome
sequence, PlantGDBs genome browsers (xGDB)
serve as a graphical interface for viewing, evaluating
and annotating transcript and protein alignments
to chromosome or bacterial artificial chromosome
(BAC)-based genome assemblies. Annotation is
facilitated by the integrated yrGATE module for
community curation of gene models. Novel web
services at PlantGDB include Tracembler, an iterative alignment tool that generates contigs from
GenBank trace file data and BioExtract Server, a
web-based server for executing custom sequence
analysis workflows. PlantGDB also hosts a plant
genomics research outreach portal (PGROP) that
facilitates access to a large number of resources for
research and training.
INTRODUCTION
PlantGDB serves the plant research community by
providing access to plant sequence data as well as a
variety of sequence and genome analysis tools in a single
online resource [(1,2); Table 1]. This update outlines recent
developments at PlantGDB that have expanded its
usefulness as a tool for comparative genomics. Key
features include: expanded EST assemblies; new genome
browsers for a larger number of species; overnight
annotation of emerging genome sequences; and novel
*To whom correspondence should be addressed. Tel: +1 515-294-9884; Fax: +1 515-294-6755; Email: [email protected]
2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In this table, sequence resources are divided into four categories: Uploaded Sequence, Assembled Sequence, Genome Browsers, and Other Tools. For
each resource in column 1, the species available, source/version, current sequence count, update frequency, tool/services, download options, and
alignment to genome are shown in adjacent cells. Web links for both external and internal data/tool sources are indicated with superscript numbers
and are listed at the end of the table under Web Resources. Sequence counts displayed here are as of 30 October 2007.
Figure 1. Database schema for PlantGDB, showing data sources, update frequency, computation and web services. PlantGDB is accessible at http://
www.plantgdb.org, and genome browsers are accessible at http://www.plantgdb.org/XxGDB, where Xx is the rst letter of the genus and species (e.g.
AtGDB = Arabidopsis thaliana genome database).
Figure 2. Transcript assemblies (PUTs) at PlantGDB, grouped by taxonomic aliation. Sequence totals displayed here are as of 30 October 2007.
Parentheses indicate the number of species/subspecies per genus. Genera highlighted in yellow are associated with a genome browser at PlantGDB;
an underscore indicates chromosome-based genome browsers. An asterisk designates genera for which PlantGDB provides preprocessed GeneSeqer
indices for quick access to spliced alignments.
Community annotation
Although excellent tools are available for dening genic
regions and variant transcript forms from evidence-based
data as well as ab initio prediction, models can often
be improved further by human curated annotation.
PlantGDBs yrGATE (http://www.plantgdb.org/prj/
yrGATE/) is a recently developed tool for community
annotation of gene models that is integrated with
PlantGDBs xGDB genome browsers (17). From a single
browser window the user can rapidly evaluate a selected
region for intron/exon structures based on any combination of EST/cDNA evidence and ab initio prediction,
compare the model with known proteins via GenBank
BLAST, and submit the annotation for review and
publication on the genome browser. To assist in the
identication of gene models in need of annotation, the
Genome Annotation Evaluation (GAEVAL) module
generates quality scores for gene structure predictions
and classies cases of incongruence of the annotation
with experimental evidence (http://www.plantgdb.org/
AtGDB-chtml/gaeval/). The yrGATE tool is available
for both BAC and chromosome-based xGDB browsers
and is being used to communicate evidence-based gene
models to the A. thaliana genome database, TAIR (18).
Figure 1 shows an example of how yrGATE can be used,
together with xGDBs annotation tables and genome
browser, to identify and annotate potential splicing
variants for a gene of interest in maize.
Pipelines for genome annotation
Genome browsers at PlantGDB are refreshed on a
timetable that depends on the pace of accumulation of
new genomic or transcript sequence data or assemblies
for the respective species. New spliced alignments are
calculated for ESTs, cDNAs and PUTs as well as for
other sequence types (where available) and data are
uploaded. To match the rapid pace with which some
genomes are being sequenced, PlantGDB sta have
developed and implemented an automated genome data
pipeline for species with rapidly expanding sequence
data, using Zea mays as an initial example. In 2007, new
maize BAC sequences began to be deposited in GenBank
at the rate of over 60 BACs or 10 Mb of sequence
per day (http://www.maizesequence.org). In addition,
there is a growing catalog of transposable elementtagged maize genomic sequence in GenBank, facilitating
reverse genetics in maize (http://www.plantgdb.org/prj/
AcDsTagging/) (19) as well as a large repository of EST
sequence-derived PUTs. PlantGDBs daily Z. mays pipeline downloads and processes all new maize BACs with
transcript, protein, microarray probe, transposon insertion tag and other genomic alignments, and displays the
cumulative output for all BACs in ZmGDB (the xGDB
browser for Z. mays; http://www.plantgdb.org/ZmGDB/)
within 12 h. The pipeline also updates BLAST and
sequence download resources daily. Signicantly,
the pipeline also generates a browsable, searchable,
tabular output of rice gene models and putatively
transposon-tagged genes for the entire BAC data set
(http://www.plantgdb.org/ZmGDB/DisplayGeneAnn.php),
Figure 3. Screenshots from ZmGDB and yrGATE illustrate the use of online tools for gene discovery and community gene annotation. (A) A webaccessible table of Z. mays BACs (alternately shaded) displaying (left to right) the BAC GI, BAC clone name, followed by the ID, start/end
coordinates and functional annotation of splice-aligned TIGR-predicted proteins from O. sativa and nally the ZmGDB entry date. All elds are
searchable and each row is linked via column 1 to a genome browser view of the BAC region. This table is currently updated daily at ZmGDB
(http://www.plantgdb.org/ZmGDB/DisplayGeneAnn.php). (Similar tables are available for eight other BAC-based xGDB browsers.) Note that a
region of BAC GI 156523432 is aligned to three paralogous rice predicted polypeptides, annotated as autophagy-related protein 8 precursor.
Clicking on the BAC GI 156523432 in table column 1 (circled) brings up a BAC/Clone Context View of the specied region (B), showing spliced
alignments to the rice predicted polypeptides (black), along with other alignment data, in this case maize cDNAs (blue) and maize ESTs (red). Note
the evidence for alternative splicing among the maize ESTs (circles) suggesting at least two alternate transcripts (labeled 1 and 2). The user has the
option to explore and annotate this variation using yrGATE. (C) Launching the yrGATE annotation tool displays scrolling list of evidence scores
and supporting exons for all exon coordinates at a locus (alternative splice coordinates for 1 and 2 are circled). The user can build a complete gene
model on screen by selecting each desired exon and then compare the resulting open reading frame to known proteins using BLAST (data not
shown). (D) The chosen gene model is displayed graphically and will be published on the ZmGDB browser following curation by PlantGDB sta.
Shown here are yrGATE models for the two putative splice variants, with translation start/stop positions indicated by triangles. (E) Predicted protein
sequence for the two yrGATE gene models. This example illustrates how xGDB and yrGATE can be used to identify and publish gene model
predictions quickly and easily, enhancing the community genome knowledge base for maize as well as facilitating hypothesis-driven research.
Outreach
Future directions
available, xGDB browsers will be expanded, with additional annotations contemplated for certain species
[e.g. tracks for transcription factor binding sites and
conserved non-coding sequences (21)]. Also planned are
additional comparative genomics tools such as SynBrowse
(22), expanded DAS import and export, and the development of qualitative (e.g. quality score) and quantitative
(e.g. library) lters for spliced alignments. Expanded help
and tutorial sections are also under development.
CONCLUSIONS
PlantGDB has expanded greatly in scope since 2004,
providing today a wide range of data sets, query methods
and analysis tools for researchers interested in comparative plant genomics or gene discovery research. The site
aims to complement other, more specialized plant genome
sites by providing comprehensive plant sequence data as
well as a suite of tools and genome browsers that
emphasize spliced alignment of cognate and non-cognate
transcripts and similar protein sequences. PlantGDB also
addresses the need for timely access to, and processing of,
high-volume informatics data through use of automated
daily data pipelines (e.g. maize BAC pipeline) and online
workow tools (e.g. BioExtract Server and Tracembler).
With the yrGATE community annotation tool, PlantGDB
facilitates the sharing of user-generated gene annotation
information across the entire plant research community.
ACKNOWLEDGEMENTS
The authors would like to thank Dr. Q. Dong, Dr S.D.
Schlueter and Dr B.-B.Wang for their many contributions
to PlantGDB over the years, M. Brekke for system
support, and the many PlantGDB users who have helped
us to improve this resource by providing feedback and
suggestions. This work is supported in part by a grant from
the National Science Foundation Plant Genome Research
Program to V.B., C.J.L, and C.L. (DBI-0606909). Funding
to pay the Open Access publication charges for this article
was provided by the cited NSF grant.
Conict of interest statement. None declared.
REFERENCES
1. Dong,Q., Schlueter,S.D. and Brendel,V. (2004) PlantGDB, plant
genome database and analysis tools. Nucleic Acids Res., 32,
D354D359.
2. Dong,Q., Lawrence,C.J., Schlueter,S.D., Wilkerson,M.D., Kurtz,S.,
Lushbough,C. and Brendel,V. (2005) Comparative plant genomics
resources at PlantGDB. Plant Physiol., 139, 610618.
3. Bensen,D.A., Karsch-Mizrachi,I., Lipman,J. and Wheeler,D.L.
(2006) GenBank. Nucleic Acids Res., 35, D21D25.