0% found this document useful (0 votes)
9 views29 pages

EST

Expressed Sequence Tags (ESTs) are short DNA sequences used to identify expressed genes in cells, with approximately 65.9 million ESTs available in public databases. They serve as a cost-effective alternative to whole genome sequencing, particularly for organisms with large genomes, and are instrumental in gene discovery, transcript identification, and understanding gene regulation. ESTs are generated from cDNA of mRNA and can be analyzed for various applications, including mapping gene structures and assessing gene expression patterns.

Uploaded by

peachybony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

EST

Expressed Sequence Tags (ESTs) are short DNA sequences used to identify expressed genes in cells, with approximately 65.9 million ESTs available in public databases. They serve as a cost-effective alternative to whole genome sequencing, particularly for organisms with large genomes, and are instrumental in gene discovery, transcript identification, and understanding gene regulation. ESTs are generated from cDNA of mRNA and can be analyzed for various applications, including mapping gene structures and assessing gene expression patterns.

Uploaded by

peachybony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Expressed Sequence Tag

Dr. Sujoy Ghosh


7/07/2011
Definition
• ESTs are short (200–500 nucleotides) DNA
sequences that can be used to identify a gene
that is being expressed in a cell at a particular
time.
• They may be used to identify gene transcripts,
and are instrumental in gene discovery and
gene sequence determination. The identification
of ESTs has proceeded rapidly, with
approximately 65.9 million ESTs now available
in public databases .
• Whole genome sequencing is currently
impractical and expensive for organisms
with large genome sizes. Such an
approach is unlikely to be applied
extensively, irrespective of the significance
of such genome data in human and animal
health, agriculture, ecology and evolution.
In addition, genome expansion, as a result
of retrotransposon repeats, makes whole
genome sequencing less attractive for
plants such as maize [6].
• In this scenario, EST data sets have been
utilized to complement genome sequencing or
as an alternative to the genome sequencing of
many organisms, earning the label, the ‘poor
man's genome’ It must be noted that ESTs are
subject to sampling bias resulting in under-
representation of rare transcripts, often
accounting for only 60% of an organism's genes.
However, ESTs in combination with reduced
representation sequencing strategies, such as
methylation filtration and high Cot selection,
have enabled the successful examination of the
gene pool in plants like maize
• Expressed Sequence Tags are generated from cDNA cloned from
mRNA of any particular species (refer to the Figure 1). As the cDNA
used is complementary to mRNA, the ESTs represent portions of
expressed genes.
• The ESTs can be generated by following steps:
• Transcription of Genomic DNA: Genomic DNA is first transcribed
to generate Nascent mRNA followed by splicing of synthesize
perfect mRNA.
• Reverse transcription of mRNA: mRNA can also be directly
isolated from the species by using different kits (e.g. RNAgent
Promega). mRNA synthesized undergoes reverse transcription to
form cDNA library.
• Generation of ESTs: From the cDNA library 5' or 3'-ESTs are
generated by cDNA end sequencing. 5' EST is formed from a region
of transcript which forms protein whereas the ending portion of
cDNA forms 3'EST.
• Assembly and organization of ESTs: The constructed ESTs can
then be assembled separately in multimember sequence assembly,
Bridged sequence assembly and small clusters on the basis of size
of ESTs.
EST GENERATION
Alternate strategy
• Simpson and co-workers [2005] have developed a novel
cost-effective method for generating high-throughput
ESTs called ORESTES (open reading frame expressed
sequence tags). This method differs from conventional
EST generation by providing sequence data from the
central protein coding region, and thus the most
informative and desired portion, of transcripts.
ORESTES representing highly, moderately and rarely
expressed transcripts have been derived from several
species with more than a million human sequences and
thousands from other species such as cow and honey
bee deposited in the Expressed Sequence Tags
database, dbEST
Characteristics of EST sequences.

Nagaraj S H et al. Brief Bioinform 2007;8:6-21

© The Author 2006. Published by Oxford University Press. For Permissions, please email:
[email protected]
Errors Associated with EST generation
• A typical EST sequence is only a very short
copy of the mRNA itself and is highly error
prone, especially at the ends. The overall
sequence quality is usually significantly better in
the middle. Vector and repeat sequences either
in the end or rarely in the middle are excised
during EST pre-processing. As ESTs are
sequenced only once, they are susceptible to
errors. Generally, the quality of base reads in
individual EST sequences is initially poor (upto
20% or ∼50–100 bp), gradually improves and
then diminishes once again towards the end .
The overall sequence quality is usually
significantly better in the middle (‘highly
informative length,’ Figure 1B).
EST and Untranslated Regions (UTRs)
• The 5′ and 3′ UTRs of eukaryotic mRNA have been
experimentally shown to contain sequence elements
essential for gene regulation, expression and translation.
In this context, EST data has proven to be important for
mining UTRs as both 5′ and 3′ ESTs contain significant
sections of the UTRs along with protein coding regions.
The CORG (COmparative Regulatory Genomics)
resource supports promoter analysis using assembled
ESTs, while more than half of the Eukaryotic Promoter
Database entries are based on 5′ EST sequences. Mach
has developed the PRESTA (PRomoter EST
Association) algorithm for promoter verification and
identification of the first exon, by mapping EST 5′ ends.
• The COmparative Regulatory Genomics
(CORG) database and annotation project aims
at providing insights into gene regulation at the
level of transcription. Having now several
genomes of higher eukaryotes at hand, we are
able to study sequence elements on a
comparative basis. Comparative sequence
analysis has become a powerful tool regarding
a variety of problems ranging from gene
finding to the identification of regulatory
elements..
• The CORG project systematically applies
comparative sequence analysis methods to
non-coding, genomic DNA. The working
hypothesis underlying the CORG project is
that local sequence conservation points to
functional importance . The CORG project is a
resource for the genome-wide annotation of
conserved sequence elements in non-coding
genomic DNA. We will subsequently call
these elements ‘conserved non-coding blocks’
(CNBs)
PRESTA
• Large sets of well-characterized promoter
sequences are required to facilitate the
understanding of promoter architecture. The
major sequence databases are a prospective
source of upstream regulatory regions, but
suffer from inaccurate annotation. The
software tool PRESTA (PRomoter EST
Association) presented in this study is
designed for efficient recovery of
characterized and partially verified promoters
from GenBank and EMBL libraries.
• The PRESTA algorithm examines the
putative GenBank/EMBL promoters and
automatically removes most of the poorly
annotated entries. The remaining records
are connected to expressed sequence
tags (ESTs) through a high-stringency
BLAST search.
• The frequency and source of recovered
ESTs provide an estimate of the activity
and expression pattern of the promoter,
and the ESTs' 5' ends assist in
transcription start-site verification. The
PRESTA database provides easy access
to non-redundant upstream regulatory
regions recently extracted by the PRESTA
algorithm.
• The current size of this resource is 552
human and 241 mouse promoters.
Surprisingly, no overlap between the
PRESTA database and the Eukaryotic
Promoter Database (EPD) was detected
by sequence comparison.
EST contigs
• Because of the way ESTs are sequenced,
many distinct expressed sequence tags
are often partial sequences that
correspond to the same mRNA of an
organism. In an effort to reduce the
number of expressed sequence tags for
downstream gene discovery analyses,
several groups assembled expressed
sequence tags into EST contigs.
Data base for EST
• Diatom EST database [http://avesthagen.sznbowler.com.
• ESTree http://www.itb.cnr.it/estree/
• Fungal genomics project
https://fungalgenomics.concordia.ca/home/index.php
• Honey bee brain EST project
http://titan.biotec.uiuc.edu/bee/honeybee_project.htm
• Nematode ESTs at the Sanger
Instituteftp://ftp.sanger.ac.uk/pub/pathogens/nem_ests/N
EMBASE- parasitic nematode
• ESTshttp://www.nematodes.orgParasitic and free-living
nematode EST resourcehttp://www.nematode.net/
EST sequence analysis
• An individual raw EST has negligible biological
information. Analysis using different
combinations of computational tools augments
this weak signal and when a multitude of ESTs
are analysed, the results enable the
reconstruction of transcriptome of that organism.
While diverse research groups have used
different combinations of tools for extraction of
data from specific databases followed by
analyses [32–37], a generic protocol of the
different steps in the analysis of EST data sets is
shown in Figure 2.
Generic steps involved in EST analysis. 1.

Nagaraj S H et al. Brief Bioinform 2007;8:6-21

© The Author 2006. Published by Oxford University Press. For Permissions, please email:
[email protected]
EST clustering and assembly
• The purpose of EST clustering is to collect overlapping
ESTs from the same transcript of a single gene into a
unique cluster to reduce redundancy. An EST cluster is a
fragmented data, which can be consolidated and
indexed using gene sequence information, such that all
the expressed data arising from a single gene is grouped
into a single index class, and each index class contains
information for only that particular gene. A simple way to
cluster ESTs is by measuring the pair-wise sequence
similarity between them. Then, these distances are
converted into binary values, depending on whether
there is a significant match or not, such that the
sequence pair can be accepted or rejected from the
cluster being assembled
Program for EST sequence assembly
• Name#Website

• CAP3 http://genome.cs.mtu.edu/cap/cap3.html
• CLOBB http://zeldia.cap.ed.ac.uk/CLOBB/
• CLUhttp://compbio.pbrc.edu/pti
• ESTatehttp://www.ebi.ac.uk/~guy/estate/
• ESTs aSSEmbly using
Malighttp://alggen.lsi.upc.es/recerca/essem/frame-essem.html
• megaBLASTftp://ftp.ncbi.nih.gov/blast/
• miraEST http://www.chevreux.org/projects_mira.html
• Paracel Transcript Assemblerhttp://www.paracel.com/
• Phrap http://www.phrap.org/
• stackPACK http://www.sanbi.ac.za/Dbases.html#stackpack
• Xsact and Xtract http://www.ii.uib.no/~ketil/bioinformatics/
Database similarity searches
• Once consensus sequences (putative genes) are
obtained from assembled ESTs, possible functions can
be assigned through downstream annotation, achieved
via database similarity searches, employing familiar
freely available tools and databases.
• Different flavours of BLAST programs from NCBI serve
as a universal tools for database similarity searches.
BLASTN can be used to search ESTs against nucleotide
sequence database and BLASTX to search against
protein databases. BLASTX translates a consensus EST
sequence (query) into protein products in six reading
frames followed by comparisons with protein databases.
Program for EST alignment to
genomic DNA
• BLAT http://genome.ucsc.edu/cgi-bin/hgBlat
• Est2genomehttp://bioweb.pasteur.fr/seqanal/interfaces/est2genome.
html
• GMAP http://www.gene.com/share/gmap/
• MGAlign http://origin.bic.nus.edu.sg/mgalign
• SSAHA http://www.sanger.ac.uk/Software/analysis/SSAHA/
• Sim4http://globin.cse.psu.edu/html/docs/sim4.html
• Splignhttp://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi
Application of EST
• ESTs were first used to construct maps of
the human genome
• assessment of the gene coverage from
EST sequencing
• mapping of gene-based site markers
• EST databases are used for gene
structure prediction
• to investigate alternative splicing
• to discriminate between genes exhibiting
tissue or disease-specific expression
• for the discovery and characterization of
candidate SNPs
• EST-based gene expression protocols
have been used in the identification and
analysis of coexpressed genes on a large
scale

You might also like