EST
EST
© The Author 2006. Published by Oxford University Press. For Permissions, please email:
[email protected]
Errors Associated with EST generation
• A typical EST sequence is only a very short
copy of the mRNA itself and is highly error
prone, especially at the ends. The overall
sequence quality is usually significantly better in
the middle. Vector and repeat sequences either
in the end or rarely in the middle are excised
during EST pre-processing. As ESTs are
sequenced only once, they are susceptible to
errors. Generally, the quality of base reads in
individual EST sequences is initially poor (upto
20% or ∼50–100 bp), gradually improves and
then diminishes once again towards the end .
The overall sequence quality is usually
significantly better in the middle (‘highly
informative length,’ Figure 1B).
EST and Untranslated Regions (UTRs)
• The 5′ and 3′ UTRs of eukaryotic mRNA have been
experimentally shown to contain sequence elements
essential for gene regulation, expression and translation.
In this context, EST data has proven to be important for
mining UTRs as both 5′ and 3′ ESTs contain significant
sections of the UTRs along with protein coding regions.
The CORG (COmparative Regulatory Genomics)
resource supports promoter analysis using assembled
ESTs, while more than half of the Eukaryotic Promoter
Database entries are based on 5′ EST sequences. Mach
has developed the PRESTA (PRomoter EST
Association) algorithm for promoter verification and
identification of the first exon, by mapping EST 5′ ends.
• The COmparative Regulatory Genomics
(CORG) database and annotation project aims
at providing insights into gene regulation at the
level of transcription. Having now several
genomes of higher eukaryotes at hand, we are
able to study sequence elements on a
comparative basis. Comparative sequence
analysis has become a powerful tool regarding
a variety of problems ranging from gene
finding to the identification of regulatory
elements..
• The CORG project systematically applies
comparative sequence analysis methods to
non-coding, genomic DNA. The working
hypothesis underlying the CORG project is
that local sequence conservation points to
functional importance . The CORG project is a
resource for the genome-wide annotation of
conserved sequence elements in non-coding
genomic DNA. We will subsequently call
these elements ‘conserved non-coding blocks’
(CNBs)
PRESTA
• Large sets of well-characterized promoter
sequences are required to facilitate the
understanding of promoter architecture. The
major sequence databases are a prospective
source of upstream regulatory regions, but
suffer from inaccurate annotation. The
software tool PRESTA (PRomoter EST
Association) presented in this study is
designed for efficient recovery of
characterized and partially verified promoters
from GenBank and EMBL libraries.
• The PRESTA algorithm examines the
putative GenBank/EMBL promoters and
automatically removes most of the poorly
annotated entries. The remaining records
are connected to expressed sequence
tags (ESTs) through a high-stringency
BLAST search.
• The frequency and source of recovered
ESTs provide an estimate of the activity
and expression pattern of the promoter,
and the ESTs' 5' ends assist in
transcription start-site verification. The
PRESTA database provides easy access
to non-redundant upstream regulatory
regions recently extracted by the PRESTA
algorithm.
• The current size of this resource is 552
human and 241 mouse promoters.
Surprisingly, no overlap between the
PRESTA database and the Eukaryotic
Promoter Database (EPD) was detected
by sequence comparison.
EST contigs
• Because of the way ESTs are sequenced,
many distinct expressed sequence tags
are often partial sequences that
correspond to the same mRNA of an
organism. In an effort to reduce the
number of expressed sequence tags for
downstream gene discovery analyses,
several groups assembled expressed
sequence tags into EST contigs.
Data base for EST
• Diatom EST database [http://avesthagen.sznbowler.com.
• ESTree http://www.itb.cnr.it/estree/
• Fungal genomics project
https://fungalgenomics.concordia.ca/home/index.php
• Honey bee brain EST project
http://titan.biotec.uiuc.edu/bee/honeybee_project.htm
• Nematode ESTs at the Sanger
Instituteftp://ftp.sanger.ac.uk/pub/pathogens/nem_ests/N
EMBASE- parasitic nematode
• ESTshttp://www.nematodes.orgParasitic and free-living
nematode EST resourcehttp://www.nematode.net/
EST sequence analysis
• An individual raw EST has negligible biological
information. Analysis using different
combinations of computational tools augments
this weak signal and when a multitude of ESTs
are analysed, the results enable the
reconstruction of transcriptome of that organism.
While diverse research groups have used
different combinations of tools for extraction of
data from specific databases followed by
analyses [32–37], a generic protocol of the
different steps in the analysis of EST data sets is
shown in Figure 2.
Generic steps involved in EST analysis. 1.
© The Author 2006. Published by Oxford University Press. For Permissions, please email:
[email protected]
EST clustering and assembly
• The purpose of EST clustering is to collect overlapping
ESTs from the same transcript of a single gene into a
unique cluster to reduce redundancy. An EST cluster is a
fragmented data, which can be consolidated and
indexed using gene sequence information, such that all
the expressed data arising from a single gene is grouped
into a single index class, and each index class contains
information for only that particular gene. A simple way to
cluster ESTs is by measuring the pair-wise sequence
similarity between them. Then, these distances are
converted into binary values, depending on whether
there is a significant match or not, such that the
sequence pair can be accepted or rejected from the
cluster being assembled
Program for EST sequence assembly
• Name#Website
• CAP3 http://genome.cs.mtu.edu/cap/cap3.html
• CLOBB http://zeldia.cap.ed.ac.uk/CLOBB/
• CLUhttp://compbio.pbrc.edu/pti
• ESTatehttp://www.ebi.ac.uk/~guy/estate/
• ESTs aSSEmbly using
Malighttp://alggen.lsi.upc.es/recerca/essem/frame-essem.html
• megaBLASTftp://ftp.ncbi.nih.gov/blast/
• miraEST http://www.chevreux.org/projects_mira.html
• Paracel Transcript Assemblerhttp://www.paracel.com/
• Phrap http://www.phrap.org/
• stackPACK http://www.sanbi.ac.za/Dbases.html#stackpack
• Xsact and Xtract http://www.ii.uib.no/~ketil/bioinformatics/
Database similarity searches
• Once consensus sequences (putative genes) are
obtained from assembled ESTs, possible functions can
be assigned through downstream annotation, achieved
via database similarity searches, employing familiar
freely available tools and databases.
• Different flavours of BLAST programs from NCBI serve
as a universal tools for database similarity searches.
BLASTN can be used to search ESTs against nucleotide
sequence database and BLASTX to search against
protein databases. BLASTX translates a consensus EST
sequence (query) into protein products in six reading
frames followed by comparisons with protein databases.
Program for EST alignment to
genomic DNA
• BLAT http://genome.ucsc.edu/cgi-bin/hgBlat
• Est2genomehttp://bioweb.pasteur.fr/seqanal/interfaces/est2genome.
html
• GMAP http://www.gene.com/share/gmap/
• MGAlign http://origin.bic.nus.edu.sg/mgalign
• SSAHA http://www.sanger.ac.uk/Software/analysis/SSAHA/
• Sim4http://globin.cse.psu.edu/html/docs/sim4.html
• Splignhttp://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi
Application of EST
• ESTs were first used to construct maps of
the human genome
• assessment of the gene coverage from
EST sequencing
• mapping of gene-based site markers
• EST databases are used for gene
structure prediction
• to investigate alternative splicing
• to discriminate between genes exhibiting
tissue or disease-specific expression
• for the discovery and characterization of
candidate SNPs
• EST-based gene expression protocols
have been used in the identification and
analysis of coexpressed genes on a large
scale