0% found this document useful (0 votes)
43 views45 pages

Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011

Gene finding

Uploaded by

Raghav Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views45 pages

Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011

Gene finding

Uploaded by

Raghav Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

C OMPUTATIONAL G ENE

P REDICTION

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


D EFINITIONS
A gene: a nucleotide sequence that codes for a protein

Gene prediction: given a genome, locate the beginning and ending


position of every gene.
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcg
gctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgg
gatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttgga
atatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagc
tgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgct
aagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcgg
ctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggct
atgcaagctgggatccgatgactatgcttaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgct
aagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaag
ctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtct
tgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttacctt
ggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgc
taagctcatgcgg

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


C ENTRAL D OGMA OF
MOLECULAR BIOLOGY
CCTGAGCCAACTATTGATGAA

CCUGAGCCAACUAUUGAUGAA

PEPTIDE

HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/EN/6/68/CENTRAL_DOGMA_OF_MOLECULAR_BIOCHEMISTRY_WITH_ENZYMES.JPG

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


B RIEF HISTORY
“The central dogma of molecular
biology deals with the detailed residue-
by-residue transfer of sequential
i n f o r m a t i o n . It s t a te s t h a t s u c h
information cannot be transfered from
protein to either protein of nucleic
acid”. Francis Crick. Nature 1970

Originally stated in 1958, but


questioned in the 1960s due to evidence
of viral RNA to DNA transfer (shown
by H. Temin and others)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


C ODONS

In 1961 Sydney Brenner and Francis Crick discovered frameshifting


mutations

Systematically deleted nucleotides from DNA

Single and double deletions dramatically altered protein product

Effects of triple deletions were minor

Conclusion: every triplet of nucleotides – a codon – maps to exactly


one amino acid in a protein

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


G ENETIC CODE
Aminoacid Codons Redundancy
64 codons are mapped to 20 (+stop) amino- Alanine GC* 4
acid characters via a genetic code Cysteine TGC,TGT 2
Aspartic Acid GAC,GAT 2
Glutamine Acid GAA,GAG 2
Genetic codes may differ slightly between Phenylalanine TTC,TTT 2
organisms and genomes (e.g. nuclear vs Glycin GG* 4

mitochondrial) Histidine CAC,CAT 2


Isoleucine ATA,ATC,ATT 3
Lysine AAA,AAG 2
Multiple and differing redundancies in the Leucine CT*,TTA,TTG 6
genetic code Methionine ATG 1
Aspargine AAC,AAT 2
Proline CC* 4
Synonymous and non-synonymous Glutamine CAA,CAG 2
substitutions are fundamentally different Arginine AGA,AGG,CG* 6
Serine AGC,AGT,TC* 6
Threonine AC* 4
Valine GT* 4
Tryptophan TGG 1
Tyrosine TAC,TAT 2
Stop TAA,TAG,TGA 3

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


S IX READING FRAMES
HIV-1 protease

DNA: CCAATAAGTC CTATTGAAAC TGTACCAGTA ACAAAGCCAG GAATGGATGG


CCCAAAGGTT AAACAATGGC CATTAACAGA AGAGAAAAAA GC

Protein translation:

In frame: PISPIETVPVTKPGMDGPKVKQWPLTEEKK
+1: QXVLLKLYQXQSQEWMAQRLNNGHXQKRKK
+2 NKSYXNCTSNKARNGWPKGXTMAINRREKS
X marks a stop codon which signals the ribosome to stop protein
synthesis.

Reverse complements are complementary DNA strands (opposite


direction and complementary bases)

They define 3 other reading frames


CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C ONTIGUOUS VS S PLICED GENES
Based on bacterial experimentation, the sequences of DNA, RNA and protein were
collinear; evidence suggested that eukaryotes followed the same pattern.

In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a
viral protein.

Map adenovirus hexon mRNA in viral genome by hybridization to adenovirus DNA


and electron microscopy

mRNA-DNA hybrids formed three curious loop structures instead of contiguous


duplex segment

HTTP://NOBELPRIZE.ORG/NOBEL_PRIZES/MEDICINE/LAUREATES/1993/SHARP-LECTURE.PDF

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E XONS AND I NTRONS

In eukaryotes, a gene is a combination of coding segments (exons)


that are interrupted by non-coding segments (introns)

This makes computational gene prediction in eukaryotes even more


difficult

Prokaryotes (e.g. bacteria) don’t have introns - their genes are


contiguous.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E UKARYOTIC GENES
!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#
?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!
!"#$% &$! "#$! %$&$'()*+,&%! (*-./$01! 2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
REVIEWS

Cytoplasm Nucleus

Poly(A)
ATG Stop site

Promoter
Genomic DNA 1 2 3 4 5

Transcription TSS Stop TTS


AUG

Pre-mRNA
RNA processing
(capping, splicing,
polyadenylation) AUG Stop

mRNA Cap Poly(A)

5! UTR CDS 3! UTR

RNA transport
and translation

Protein

Cap Poly(A)

Coding sequence (CDS) Polypeptide Ribosome Untranslated (UTR) sequence

Figure 1 | The central dogma of gene expression. In the typical process of eukaryotic gene expression, a gene is transcribed
from DNA to pre-mRNA. mRNA is then produced from pre-mRNA by RNA processing, which includes the capping, splicing and
polyadenylation of the transcript. It is then transported from the nucleus to the cytoplasm for translation. TSS, transcription start site;
TTS, transcription termination site.

many good reviews on this topic, and useful bench- all gene-prediction papers refer to four types of ‘exon’, as
marks in the research (for example, REFS 1–8), a truly shown in FIG. 2b; however, these are just the coding
fair comparison of the prediction programs is impos- regions of the exons. To avoid the misuse of these terms,
sible as their performance depends crucially on the I refer to subclasses of exons in this article as 5! CDS,
FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
specific TRAINING DATA that are used to develop them. itexon, 3! CDS and intronless CDS.
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
TRAINING DATA SET Gene structure and exon classification Finding internal coding exons
G ENE F INDING APPROACHES

Direct

Close matches to ESTs, cDNA or protein sequences from the same or


closely related organism

Computational

Something that matches an already known gene (homology)

Something that matches statistical patterns common to all genes (ab initio)

Hybrid

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


S TATISTICAL A PPROACH : M ETAPHOR IN
U NKNOWN L ANGUAGE

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and
numerical symbols could you distinguish between a story and a
stock report in a foreign newspaper?

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


W HAT CAN WE MEASURE
ABOUT GENES ?

ORF (Open Reading Frame): a sequence started by ATG and


terminated by a stop codon (a.g TAA, TAG, TGA)

Codon Usage: the preference for using specific synonymous codons


most frequently measured by CAI (Codon Adaptation Index)

Features and motifs

Promoters, splice sites, enhancers, untranslated regions (UTRs)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


O PEN R EADING F RAMES

Detect potential coding regions by looking at ORFs

A genome of length n is comprised of (n/3) codons

Stop codons break genome into segments

The subsegments of these that start from the Start codon (ATG) are ORFs

Some ORFs can overlap and code for different genes!

ATG TGA
Genomic Sequence

Open reading frame

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


P ROKARYOTES OR INTRON -
LESS GENES

S. cerevisia annotated (in 1997) vs all ORFs

The basic concept is to look for ORFs


that ‘look like’ genes:

Initially, long enough (~100 codons or


longer)

But short ORFs are actually quite frequent


in eukaryotic genes.

Have a believable codon composition, as


measured by, e.g. the codon adaptation
index (CAI)
SMALL OPEN READING FRAMES: BEAUTIFUL NEEDLES IN THE HAYSTACK
MUNIRA A. BASRAI, PHILIP HIETER, AND JEF D. BOEKE
GENOME RES. 1997. 7: 768-771

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


Measures the relative abundance or paucity of a particular codon for
a given organism/gene.

E.g. in a representative dataset of HIV-1 polymerase sequences the


four codons that map to Alanine have a rather skewed distribution:

Codon Count
GCA 41576
GCC 9461
GCG 1017
GCT 11031
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C OMPUTING CAI
Define relative synonymous codon usage (RSCU) for a pair (i,j),
where i is an amino-acid, and j is one of the ni codons mapping to it
as. Xij is the count of the j-th codon for amino-acid i.

An RSCU > 1 indicates a preferred codon and < 1 – an avoided codon


Xij
RSCUij = 1
�ni
ni k=1 Xk
Further define relative adaptiveness wij as:
Codon Count RCSU w
RSCUij Xij
wij = = GCA 41576 2.64 1
RSCUmax maxj Xij
GCC 9461 0.60 0.23
GCG 1017 0.064 0.02
GCT 11031 0.70 0.27
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
O RGANISM WIDE CODON USAGE

SHARP AND LI, 1987

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


T HE CAI OF A GENE
The observed CAI of a sequence with L codons is the geometric
mean of each of the codons:
� L
�1/L

CAIobs = RSCUk
k=1

This is compared with the maximum possible CAI of all codon


sequences with the same length that code for the same protein
sequence to derive CAI.
� L
�1/L

CAImax = RSCUmax
k=1

CAI = CAIobs /CAImax


CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
CAI DISTRIBUTION IN GENES

SHARP AND LI, 1987

Caveats

Some genes have unusual (for the organism) codon usage patterns

Predictive power of CAI depends on the length of the sequence, and


many are quite short
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
A SIMPLE HMM FOR FINDING
PROKARYOTIC / INTRON - LESS GENES
1108 Nucleic Acids Research, 1998, Vol. 26, No. 4

able to correctly identify ORFs where 98% of all genes predicted


by GeneMark.hmm resided. Also there were genes missed by
GeneMark.hmm, mainly due to overlaps, that were recovered by
GeneMark.
In this However, the GeneMark.hmm
generalized HMM, program made several
some hidden states
new predictions and some of them were confirmed by similarity
areIt seems
search. allowed toGeneMark.hmm
that the emit a variable lengthbrought
development substring,
us
instead
closer of aofsingle
to the goal accurateletter.
prediction of bacterial genes and
further arguments in favor of this statement are presented below.

The ideaAND
MATERIALS is that the ‘gene’
METHODS state emits the whole
sequence, instead of N individual letter.
Materials
We have used DNA sequences of the complete genomes of
The length
H.influenzae of the
(GenBank substring
accession is drawn
no. L42023), from a pre-
M.genitalium
defined
(L43967), probability
M.jannaschii function.
(L77117), M.pneumoniae (U00089),
Synechocystis PCC6803 (synecho), E.coli (U00096), H.pylori
(AE000511), M.thermoauthotrophicum (AE000666), B.subtilis
The Viterbi
(AL009126), algorithm
Archeoglobus fulgiduscan be extended
(AE000782). The datatoondeal
Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in the annotated
withE.coli
this RBS were provided by W. Hayes (22). The data
generalization
GeneMark.hmm algorithm. The hidden states of the model are represented as ovals on experimentally verified N-terminal protein sequences were
in the figure, and arrows correspond to allowed transitions between the states. kindly provided by A. Link (23). The Markov models parameters
wereAtypical genes
obtained from the are necessary
GeneMark library to deal with, most
(http://exon.biology.
gatech.edu/ !genmark/matrices/
prominently, ). transfers.
horizontal
The HMM framework of GeneMark.hmm, the logic of
transitions between hidden Markov states, followed the logic of Model of prokaryotic sequence structure
the genetic
Nucleic structure
Acids Research, ofNo.
1998, Vol. 26, the4 bacterial genome (Fig. 1). The Markov
models of coding and non-coding regions were incorporated into The architecture of the hidden Markov model used in the
the HMM framework
CSE/BIMM/BENG 181 M toAYgenerate
24, 2011 stretches of DNA sequence GeneMark.hmm algorithm
SERGEI is
L Kshown in Figure
OSAKOVSKY POND [1. To @deal
SPOND UCSD.EDU]
C
A
G

E MISSION LENGTH DISTRIBUTIONS CAN The model wa


ribosomal bind
of a given nucl

BE DETERMINED EMPIRICALLY in a given posi

The final
1110 Nucleic Acids Research, 1998, Vol. 26, No. 4 revealed the
positional n
matrix defin
Table 1. Nucleotide frequencies for the RBS model
is complem
Nucleotide Position near its 3!-e
1 2 3 4 5 generally a
Note that a
T 0.161 0.050 0.012 0.071 0.115
evaluate a p
C 0.077 0.037 0.012 0.025 0.046 the product
A 0.681 0.105 0.015 0.861 0.164 Table 1. Th
G 0.077 0.808 0.960 0.043 0.659 0.00025. It c
to ribosome
The model was derived using the multiple sequence alignment of 325 annotated assumption
ribosomal binding sites (see text). Given the set of aligned sequences, the frequency
of a given nucleotide was calculated as the number of occurrences of this nucleotide
in a given position divided by the total number of sequences.
Algorithm
The GeneM
The finally obtained alignment of the 325 sequences has bacterial ge
revealed the RBS sequence pattern in the form of a matrix of specific Ma
positional nucleotide frequencies (Table 1). It is seen that the other param
E. COLI CODING Figure 2. Length distribution E. C OLI
probability N ONCODING
densities
matrix defines the strong consensus sequence: AGGAG, which of protein-coding and non-coding same as defi
regions derived from
is complementary to a the annotatedlocated
pentamer E.coli genomic DNA (histograms).
in the E.coli 16S rRNA(a) Coding2
regions; the solid curve is the approximation by ! distribution g(d) = Nc (d/Dc ) for the gra
nearexp(–d/D
its 3!-end. This observation is in a good agreement with the modified th
c ), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen to
generally accepted mechanism of ribosome-mRNA binding.
normalize the distribution function on the interval from 30 nt (the minimal length of B.subtilis, th
Notecoding
that region)
a similar result was obtained previously (27). To
to 7155 nt (the maximal length). (b) Non-coding regions; the solid initiation of
evaluate
curve ais putative RBS we
the approximation calculateddistribution
by exponential its probabilistic score as
f(d) = Nn exp(–d/D n ), where Dn
of ribosom
= 150 nt. The
the product coefficient Nn normalizes
of corresponding elementsthe distribution functiongiven
of the matrix on the interval
in from
mechanism
Table1 to1.1000
Thent. threshold value for RBS score was chosen as
0.00025. It can be shown that the log of this score is proportional
the describ
to ribosome binding energy (with appropriate sign) under the biased freq
Nucleic Acids Research, 1998, Vol. 26, No. 4 assumption of independent formation w
of ribonucleotide pairs. obtain reaso
R � � n 2b(k) 7 of B.subtili
k�1
CSE/BIMM/BENG 181 MAY 24, 2011 Algorithm modifications for S ERGEI Lother
genomes KOSAKOVSKY
than E.coli POND [SPOND@UCSD competition
.EDU]
Here n (k) is the number of symbols b (b = T, C, A, G) in the to the Viterb
S PLICE SITE DETECTION

The beginning and end of exons are signaled by donor and acceptor
sites that usually have GT and AC di-nucleotides

Detecting these sites is difficult, because GT and AC appear very often

Donor Acceptor
Site Site
GT AC
exon 1 exon 2

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


Features of Spliceosome Evolution and Function Inferred fro
an Analysis of the Information at Human plice Sites
R. Michael Stephens1>2”f and Thomas Dana Schneider”%
1128 R. M. Stephens and T. D. Xchneider
“National Cancer Institute
Frederick Cancer Research and Development Center
Laboratory of Mathematical
x.3 Biology
.s
P.Q. Box B, Frederick,-El MD_
21702-1201, U.S.A.
’ Linganore.:: High School
12013 Old Annapolis
i
d
:
-
Rd.,protection
Frederick from:
MD 21701, - 9 _ U.S.A.
s- 0 = hydroxyl radical
z T = Tl
. = RNAase-A
(Received 8 November 1991;
I- accepted 19 August 1992)

An information analysis of the 5’ (donor) and 3’ (acceptor) sequences spanning the ends of
nearly 1800 human introns has provided evidence for structural features of splice sites that
bear upon spliceosome evolution and function: (1) S2% of the sequence information (i.e.
sequence conservation) at donor junctions and 97 o/0 of the sequence information at acceptor
junctions is confined to the introns, allowing codon choices throughout exons to be largely
unrestricted. The distribution of information at intron-exon junctions is also described in
detail and compared with footprints. (2) Acceptor sites are found to possess enough
information to be located in the transcribed portion of the human genome, whereas donor
sites possess about one bit less than the information needed to locate them independently.
This difference suggests that acceptor sites are located first in humans and, having been
located, reduce by a factor of two the number of alternative sites available as donors. Direct
experimenbal evidence exists to support this conclusion. (3) The sequences of donor and
acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive
from a common ancestor and that during evolution the information of both sites shifted
onto the intron. If so, the protein and RNA components that are found in contemporary
spliceosomes, and which are responsible for recognizing donor and accept,or sequences,
should also be related. This conclusion is supported by the common structures found in
different parts of the spliceosome.

Keywords: splice; spliceosome; information theory; evolution; human


9.35 BITS (POSITIONS -25 TO +2)
7.92 BITS (POSITIONS -3 TO +6) 3’
Figure 1. Information curves and sequence logos for human spliceosome binding sites. The left half of the Figure
shows the donor splice sites from position - 8 to position + 17, while the right half shows the -30 to + 10 region around
the acceptor 1.sites. Position zero on both curves is the pointGrabowski
Introduction on the intronet al.,
adjacent
1985;to Reed
the splice
et al.,point, i.e. Steitz
1988; on the 5’
et side,
al.,
CSE/BIMM/BENG 181 Mthe
AYintron is cut immediately before position zero while on the 3’ side it is cut immediately
24, 2011 S after L
ERGEI position
K zero. (ThesePare
OSAKOVSKY OND [[email protected]]
1988). Because reliable splicing is necessary for celli
In eukaryotic cells,provided
the co-ordinates nuclear RNA
by GenBank.) isIn usually
the matrix corresponding to each graph, the bottom row, labeled 1, contains
“At the core of most gene recognition algorithms is one or more coding
measures – functions which produce, given any sample window of sequence,
a number or vector intended to measure the degree to which a sample
sequence resembles a window of ‘typical’ exonic DNA ... attention can
probably limited to six of the twenty or so measures proposed to date”

Evaluated how well different measures performed in recovering known


coding sequences (human and E.Coli) based on organism specific training.

Applied linear discriminant analysis to train each method

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


LDA SPECIFICITY AND SENSITIVITY

HTTP://SCIEN.STANFORD.EDU/CLASS/EE368/PROJECTS2000/PROJECT15/ALGORITHMS.HTML

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


FROM FICKETT AND TUNG 1992
(SP+SN)/2 MEASURE REDUNDANCY

Hexamer based measures come out on top. They are based on the frequencies of 6-mers in one
of the frames (0,+1,+2). Highly predictive, because it captures the codon structure, codon usage
bias, initiation sites and higher order co-dependancies.

Pseudogenes can look confuse even the best protein-trained approaches.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E XAMPLES OF OTHER FEATURES
E.g promoters in Prokaryotes (E.Coli)

Transcription starts at offset 0.

Pribnow Box (-10)

Gilbert Box (-30)

Ribosomal Binding Site (+10)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


T RANSCRIPT A SSEMBLY
Once individual ORFs and splice sites have been identified, they must be assembled into a
Burgefull
12 transcript.
, in a systematic analysis of short introns, have exons in a ‘sea’ of intronic DNA, where many cryptic
suggested that these standard splice sites might not be splice sites exist. This model has since been validated by
sufficient for defining introns in the genomes of plants many experiments, and it proposes that an internal exon
and humans. is initially recognized by the presence of a chain of inter-
Could be done with dynamic programming, or HMMs, for example.
In vertebrates, the internal exons are small (~140
nucleotides on average), whereas introns are typically
acting splicing factors that span it (FIG. 3). The binding of
these trans-acting factors to the pre-mRNA is responsi-
much larger (with some being more than 100 kb in ble for the non-random nucleotide patterns that form
length). In 1990, the ‘exon-definition’ model13 was pro- the molecular basis for all exon-recognition algorithms.
posed to explain how the splicing machinery recognizes These sequence features are often divided into two
Models needs to incorporate relevant biological knowledge. types: ‘signals’, which correspond to short cis-elements
or boundary sites (such as splice sites and branch
a Exon classification sites); and ‘content’, which corresponds to the
REVIEWS
TSS
extended functional regions (such as exons and
GT 5! uexon introns). To evaluate each feature, one needs to define
a scoring function of the feature (also called a feature
TSS TSS
GT GT 5! utexon variable). The best scoring function is the conditional
5! exon probability P(a|s) that the given sequence s contains
TSS
GT 5! utuexon the feature a. According to the Bayes equation P(a|s)
= P(s|a)P(a)/P(s) where P(s|a) (that is, the likelihood
P of s containing a). So, a training sample (sequence
AG GT iuexon set) with the known feature a is built, and then the
occurrence of a particular sequence s is counted.
AG GT iutexon Different features can70K then be integrated into a single 70K 70K
U1 U2 snRNP U2AF65 35 U1 U2 snRNP U2AF65 35 U1 CPSF
score for the whole
SR object (an
snRNP itexon in this case). SR
snRNP
SR
snRNP PAP CstF
AG GT AG GT ituexon
Genes
CBC are predicted
Exon 1 byGU
finding the gene structure
A that
YRYYRY AG Exon 2 GU A YRYYRY AG Exon 3 GU AAUAAA G/U
has the highest score, given the sequence. Approaches CFI CFII
Internal
exon differ in their choice of features, scoring functions and
AG GT itexon
integration methods. Once the problem is phrased as
aFirst
statistical-pattern
exon definitionrecognition problem,
Internalmanyexonstatis-
definition Last exon definition
AG GT iutuexon
tical or machine learning tools are available for recog-
Figure 3 | Exon-definition model. Typically, in vertebrates, exons are much shorter than introns. According to the exon-definition
nizing these patterns. Indeed, almost all of them have
model, before introns are recognized and spliced out, each exon is initially recognized by the protein factors that form a bridge
been applied to the exon (or gene)-recognition prob-
AG Poly(A) 3! uexon across it. In this way, each exon, together with its flanking sequences, forms a molecular, as well as a computational, recognition
lem. Here, I review just a few generic or popular
module (arrows indicate molecular interactions). Modified with permission from REF. 26 © (2002) Macmillan Magazines Ltd.
approaches.
AG Poly(A) AG Poly(A) 3! tuexon CBC, cap-binding complex; CFI/II, cleavage factor I/II; CPSF, cleavage and polyadenylation specificity factor; CstF, the cleavage
Most early programs used the simple positional
3! exon
stimulation factor; PAP, poly(A) polymerase; snRNP, small nuclear RNP; SR, SR protein; U2AF, U2 small nuclear ribonucleoprotein
weight matrix method (WMM, see BOX 1) to identify
AG Poly(A) 3! utuexon
particle (snRNP) auxiliary factor.
splice-site signals. In recent programs, the correlation
among positions in a signal is also explored. The
Fweight array method (WAM)
ROM “COMPUTATIONAL or Markov
PREDICTION models
OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
(BOX 1) are used to explore adjacent correlations; deci-
TSS
Poly(A) sion-tree or maximal-dependence LDA is implemented in SPL — a splice-site recogni-
decomposition identify these boundaries, which results in predicted
Intronless Intronless (MDD) methods are used to explore tion non-adjacent
module of the HEXON program . A new splice-
15
genes being either truncated or fused together.
=
gene gene correlations; and artificial neural site detection
network (ANN) program, GeneSplicer, has also been Determining the 3! end of a gene is easier than deter-
methods are used to explore arbitrary, developed recently and is reported to perform
nonlinear 16
mining its 5! end. This is because most of the mRNA
b CDS misclassification dependencies. These more complex models typically
favourably when compared with many other pro- and EST sequences in GenBank are truncated at their
ATG yield significant, but not marked, improvements
CSE/BIMM/BENG 181 MAY 24, 2011 grams (suchover as NetPlantGene, NetGene2, SHSPL, ERGEI L5!K ends. PONDmodel
The exon-definition
OSAKOVSKY [SPOND @UCSD
can also .EDU]
be applied
GT 5! CDS the simple WMM. However, major improvements
S OME SIMPLE ASSEMBLY RULES
!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#
?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!
-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*!!"#$%0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$! !
&$! "#$! %$&$'()*+,&%! (*-./$01! 2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/#


2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
0"'#
,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!
!
2"B!"2B!
2"B!B"!
B"!2B!
4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>! 2B!B"!
2B!"2B!
4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
!
"2B!2"B!
!
?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!
!"#$% &$! "#$! %$&$'()*+,&%! (*-./$01! 7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&!
2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
2"B!"2B!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
2"B!B"!
2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
B"!2B!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!
2B!B"!
! !
2"B!"2B!
2B!"2B! !"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$!
2"B!B"!
$8-&+>! ,&4*-&+>! -*! ,&4$*%$&,3! *$%,-&+1! D-A*3$E! F)G-*-+! HI>!O&12"-/# 0"'# ="<%H1$1(")$7# P&)&# +'&-(,1(")>! J)0.*,=%$!
K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
"2B!2"B! B"!2B!
2B!B"!
2B!"2B! FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER
"2B!2"B!
,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!
CSE/BIMM/BENG 181 MAY 24, 2011
!
SERGEI L KOSAKOVSKY POND [[email protected]]
U SING K NOWN G ENES TO
P REDICT N EW G ENES

Some genomes may be very well-studied, with experimentally


verified genes.

Closely-related organisms may have similar genes

Unknown genes in one species may be compared to genes in a


sufficiently closely-related species

The idea is that gene structure is on average quite stable.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


S IMILARITY -B ASED A PPROACH
TO G ENE P REDICTION

Genes in different organisms are similar

The similarity-based approach uses known genes in one genome to


predict (unknown) genes in another genome

Problem: Given a known gene and an un-annotated genome


sequence, find a set of substrings of the genomic sequence whose
concatenation best fits the gene

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


C OMPARING G ENES IN T WO G ENOMES

SMALL ISLANDS OF SIMILARITY


CORRESPONDING TO SIMILARITIES
BETWEEN EXONS

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


U SING S IMILARITIES TO F IND
THE E XON S TRUCTURE

The known frog gene is aligned to different locations in the human


genome

Find the “best” path to reveal the exon structure of human gene

Start with a local alignment to find putative exons


Frog Genes (known)

Human Genome

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


C HAINING L OCAL A LIGNMENTS

Find substrings that match a given gene sequence (candidate exons);


use a cutoff to define significance.

Define a candidate exon as (l, r, w): left, right, weight defined as


score of local alignment

Look for a maximum chain of substrings, i.e. a set of non-


overlapping non-adjacent intervals.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E XON C HAINING P ROBLEM
Locate the number and beginning and end of each interval (2n
points)

Find the “best”, i.e. maximum weight path

5
5
15
9
11
4 SCORE=18
3 SCORE=19

0 2 3 5 6 11 13 16 20 25 27 28 30 32

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E XON C HAINING P ROBLEM :
F ORMULATION

Exon Chaining Problem: Given a set of putative exons, find a


maximum set of non-overlapping putative exons

Input: a set of weighted intervals (putative exons)

Output: A maximum chain of intervals from this set

Would a greedy algorithm solve this problem?

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


ExonChaining (G, n) //Graph, number of intervals
for i ← to 2n
si ← 0
for i ← 1 to 2n
if vertex vi in G corresponds to right end of the interval I
j ← index of vertex for left end of the interval I GREEDY: 17
w ← weight of the interval I
si ← max {sj + w, si-1}
else
si ← si-1
return s2n

21
BEST: 21
Use a graph representation of the exon chaining problem

Can be solved in O(n) time using dynamic programming

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E XON C HAINING : D EFICIENCIES

Frog Genes (known)


Human Genome

Poor definition of the putative exon endpoints

Optimal chain of intervals may not correspond to any valid alignment

First interval may correspond to a suffix, whereas second interval may


correspond to a prefix

Combination of such intervals is not a valid alignment


CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
S PLICED A LIGNMENT

Mikhail Gelfand and colleagues proposed a spliced alignment


approach of using a protein within one genome to reconstruct the
exon-intron structure of a (related) gene in another genome.

Begins by selecting either all putative exons between potential


acceptor and donor sites or by finding all substrings similar to the
target protein (as in the Exon Chaining Problem).

This set is further filtered in a such a way that attempt to retain all
true exons, with some false ones.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


S PLICED A LIGNMENT P ROBLEM :
F ORMULATION

Goal: Find a chain of blocks in a genomic sequence that best fits a


target sequence

Input: Genomic sequences G, target sequence T, and a set of


candidate exons B.

Output: A chain of exons Γ such that the global alignment score


between Γ* and T is maximum among all chains of blocks from B.

Γ* - concatenation of all exons from chain Γ

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


E XON C HAINING VS S PLICED
A LIGNMENT

In Spliced Alignment, every path spells out the string obtained by


concatenation of labels of its edges. The weight of the path is
defined as optimal alignment score between concatenated labels
(blocks) and target sequence

Defines weight of entire path in graph, but not the weights for individual
edges.

Exon Chaining assumes the positions and weights of exons are pre-
defined

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]


Box 2 | Useful internet resources

Gene-prediction programs: comparative genomics


Doublescan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/analysis/doublescan
SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bio.math.berkeley.edu/slam
Twinscan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.cs.wustl.edu
Gene-prediction programs (many with homology searching capabilities)
GeneMachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genome.nhgri.nih.gov/genemachine
Genscan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.mit.edu/GENSCAN.html
GenomeScan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.mit.edu/genomescan
Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and
RNASPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genomic.sanger.ac.uk/gf/gf.shtml
Fgenesh, Fgenes-M, SPL and RNASPL . . . . . . . . . . . . . . . . . . . . http://www.softberry.com/berry.phtml
HMMgene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/HMMgene
Genie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/genie.html
GRAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://compbio.ornl.gov/tools/index.shtml
GeneMark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.ebi.ac.uk/genemark [OK?]
GeneID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www1.imim.es/software/geneid/geneid.html#top
GeneParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://beagle.colorado.edu/~eesnyder/GeneParser.html
MZEF and POMBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://argon.cshl.org/genefinder/ [OK?]
AAT, MZEF with homology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genome.cs.mtu.edu/aat.html
MZEF with SpliceProximalCheck . . . . . . . . . . . . . . . . . . . . . . . . http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html
Genesplicer, Glimmer and GlimmerM . . . . . . . . . . . . . . . . . . . . http://www.tigr.org/~salzberg
WebGene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.itba.mi.cnr.it/webgene
GenLang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbil.upenn.edu/genlang/genlang_home.html
Xpound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound
Gene-prediction programs: alignment based
Procrustes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www-hto.usc.edu/software/procrustes/index.html
GeneWise2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/Wise2
SplicePredictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
PredictGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://cbrg.inf.ethz.ch/subsection3_1_8.html
Finding ORFs and splice sites
DioGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbc.umn.edu/diogenes/index.html
OrfFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.ncbi.nlm.nih.gov/gorf/gorf.html
YeastGene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi
CDS: search coding regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html
Neural network splice site prediction . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/splice.html
NetGene2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/NetGene2
Last exon, promoter or TSS prediction
FirstEF, Core_Promoter, CpG_Promoter, Polyadq
and JTEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cshl.edu/mzhanglab
Eponine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Users/td2/eponine
Neural network promoter prediction . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/promoter.html
Transcription element search system . . . . . . . . . . . . . . . . . . . . . http://www.cbil.upenn.edu/tess
Signal Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bimas.dcrt.nih.gov/molbio/signal
AAT, analysis and annotation tool; ORF, open reading frame; TSS; transcription start site.

FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
boundaries, we refer to a region as a state and to a The advantage of HMMs is that more states (such as
CSE/BIMM/BENG 181 MAY 24,boundary
2011 as a transition between states). If the condi- SERGEI
intergenic regions, promoters, UTRs, L KOSAKOVSKY
poly(A) and POND [[email protected]]
G ENERAL T HINGS TO R EMEMBER ABOUT (P ROTEIN -
CODING ) G ENE P REDICTION S OFTWARE

It is, in general, organism-specific

It works best on genes that are reasonably similar to something seen


previously

It finds protein coding regions far better than non-coding regions

In the absence of external (direct) information, alternative forms will


not be identified

It is imperfect! (It’s biology, after all…)

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

You might also like