0% found this document useful (0 votes)

43 views45 pages

Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011

Gene finding

Uploaded by

Raghav Suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views45 pages

Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011

Gene finding

Uploaded by

Raghav Suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

C OMPUTATIONAL G ENE

P REDICTION

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

D EFINITIONS
A gene: a nucleotide sequence that codes for a protein

Gene prediction: given a genome, locate the beginning and ending

position of every gene.
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcg
gctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgg
gatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttgga
atatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagc
tgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgct
aagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcgg
ctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggct
atgcaagctgggatccgatgactatgcttaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgct
aagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaag
ctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtct
tgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttacctt
ggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgc
taagctcatgcgg

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

C ENTRAL D OGMA OF
MOLECULAR BIOLOGY
CCTGAGCCAACTATTGATGAA

CCUGAGCCAACUAUUGAUGAA

PEPTIDE

HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/EN/6/68/CENTRAL_DOGMA_OF_MOLECULAR_BIOCHEMISTRY_WITH_ENZYMES.JPG

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

B RIEF HISTORY
“The central dogma of molecular
biology deals with the detailed residue-
by-residue transfer of sequential
i n f o r m a t i o n . It s t a te s t h a t s u c h
information cannot be transfered from
protein to either protein of nucleic
acid”. Francis Crick. Nature 1970

Originally stated in 1958, but

questioned in the 1960s due to evidence
of viral RNA to DNA transfer (shown
by H. Temin and others)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

C ODONS

In 1961 Sydney Brenner and Francis Crick discovered frameshifting

mutations

Systematically deleted nucleotides from DNA

Single and double deletions dramatically altered protein product

Eﬀects of triple deletions were minor

Conclusion: every triplet of nucleotides – a codon – maps to exactly

one amino acid in a protein

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

G ENETIC CODE
Aminoacid Codons Redundancy
64 codons are mapped to 20 (+stop) amino- Alanine GC* 4
acid characters via a genetic code Cysteine TGC,TGT 2
Aspartic Acid GAC,GAT 2
Glutamine Acid GAA,GAG 2
Genetic codes may diﬀer slightly between Phenylalanine TTC,TTT 2
organisms and genomes (e.g. nuclear vs Glycin GG* 4

mitochondrial) Histidine CAC,CAT 2

Isoleucine ATA,ATC,ATT 3
Lysine AAA,AAG 2
Multiple and diﬀering redundancies in the Leucine CT*,TTA,TTG 6
genetic code Methionine ATG 1
Aspargine AAC,AAT 2
Proline CC* 4
Synonymous and non-synonymous Glutamine CAA,CAG 2
substitutions are fundamentally diﬀerent Arginine AGA,AGG,CG* 6
Serine AGC,AGT,TC* 6
Threonine AC* 4
Valine GT* 4
Tryptophan TGG 1
Tyrosine TAC,TAT 2
Stop TAA,TAG,TGA 3

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

S IX READING FRAMES
HIV-1 protease

DNA: CCAATAAGTC CTATTGAAAC TGTACCAGTA ACAAAGCCAG GAATGGATGG

CCCAAAGGTT AAACAATGGC CATTAACAGA AGAGAAAAAA GC

Protein translation:

In frame: PISPIETVPVTKPGMDGPKVKQWPLTEEKK
+1: QXVLLKLYQXQSQEWMAQRLNNGHXQKRKK
+2 NKSYXNCTSNKARNGWPKGXTMAINRREKS
X marks a stop codon which signals the ribosome to stop protein
synthesis.

Reverse complements are complementary DNA strands (opposite

direction and complementary bases)

They define 3 other reading frames

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C ONTIGUOUS VS S PLICED GENES
Based on bacterial experimentation, the sequences of DNA, RNA and protein were
collinear; evidence suggested that eukaryotes followed the same pattern.

In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a
viral protein.

Map adenovirus hexon mRNA in viral genome by hybridization to adenovirus DNA

and electron microscopy

mRNA-DNA hybrids formed three curious loop structures instead of contiguous

duplex segment

HTTP://NOBELPRIZE.ORG/NOBEL_PRIZES/MEDICINE/LAUREATES/1993/SHARP-LECTURE.PDF

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E XONS AND I NTRONS

In eukaryotes, a gene is a combination of coding segments (exons)

that are interrupted by non-coding segments (introns)

This makes computational gene prediction in eukaryotes even more

diﬃcult

Prokaryotes (e.g. bacteria) don’t have introns - their genes are

contiguous.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E UKARYOTIC GENES
!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#
?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!
!"#$% &$! "#$! %$&$'()*+,&%! (*-./$01! 2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
REVIEWS

Cytoplasm Nucleus

Poly(A)
ATG Stop site

Promoter
Genomic DNA 1 2 3 4 5

Transcription TSS Stop TTS

AUG

Pre-mRNA
RNA processing
(capping, splicing,
polyadenylation) AUG Stop

mRNA Cap Poly(A)

5! UTR CDS 3! UTR

RNA transport
and translation

Protein

Cap Poly(A)

Coding sequence (CDS) Polypeptide Ribosome Untranslated (UTR) sequence

Figure 1 | The central dogma of gene expression. In the typical process of eukaryotic gene expression, a gene is transcribed
from DNA to pre-mRNA. mRNA is then produced from pre-mRNA by RNA processing, which includes the capping, splicing and
polyadenylation of the transcript. It is then transported from the nucleus to the cytoplasm for translation. TSS, transcription start site;
TTS, transcription termination site.

many good reviews on this topic, and useful bench- all gene-prediction papers refer to four types of ‘exon’, as
marks in the research (for example, REFS 1–8), a truly shown in FIG. 2b; however, these are just the coding
fair comparison of the prediction programs is impos- regions of the exons. To avoid the misuse of these terms,
sible as their performance depends crucially on the I refer to subclasses of exons in this article as 5! CDS,
FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
specific TRAINING DATA that are used to develop them. itexon, 3! CDS and intronless CDS.
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
TRAINING DATA SET Gene structure and exon classification Finding internal coding exons
G ENE F INDING APPROACHES

Direct

Close matches to ESTs, cDNA or protein sequences from the same or

closely related organism

Computational

Something that matches an already known gene (homology)

Something that matches statistical patterns common to all genes (ab initio)

Hybrid

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

S TATISTICAL A PPROACH : M ETAPHOR IN
U NKNOWN L ANGUAGE

Noting the diﬀering frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and
numerical symbols could you distinguish between a story and a
stock report in a foreign newspaper?

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

W HAT CAN WE MEASURE
ABOUT GENES ?

ORF (Open Reading Frame): a sequence started by ATG and

terminated by a stop codon (a.g TAA, TAG, TGA)

Codon Usage: the preference for using specific synonymous codons

most frequently measured by CAI (Codon Adaptation Index)

Features and motifs

Promoters, splice sites, enhancers, untranslated regions (UTRs)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

O PEN R EADING F RAMES

Detect potential coding regions by looking at ORFs

A genome of length n is comprised of (n/3) codons

Stop codons break genome into segments

The subsegments of these that start from the Start codon (ATG) are ORFs

Some ORFs can overlap and code for diﬀerent genes!

ATG TGA
Genomic Sequence

Open reading frame

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

P ROKARYOTES OR INTRON -
LESS GENES

S. cerevisia annotated (in 1997) vs all ORFs

The basic concept is to look for ORFs

that ‘look like’ genes:

Initially, long enough (~100 codons or

longer)

But short ORFs are actually quite frequent

in eukaryotic genes.

Have a believable codon composition, as

measured by, e.g. the codon adaptation
index (CAI)
SMALL OPEN READING FRAMES: BEAUTIFUL NEEDLES IN THE HAYSTACK
MUNIRA A. BASRAI, PHILIP HIETER, AND JEF D. BOEKE
GENOME RES. 1997. 7: 768-771

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

Measures the relative abundance or paucity of a particular codon for
a given organism/gene.

E.g. in a representative dataset of HIV-1 polymerase sequences the

four codons that map to Alanine have a rather skewed distribution:

Codon Count
GCA 41576
GCC 9461
GCG 1017
GCT 11031
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C OMPUTING CAI
Define relative synonymous codon usage (RSCU) for a pair (i,j),
where i is an amino-acid, and j is one of the ni codons mapping to it
as. Xij is the count of the j-th codon for amino-acid i.

An RSCU > 1 indicates a preferred codon and < 1 – an avoided codon

Xij
RSCUij = 1
�ni
ni k=1 Xk
Further define relative adaptiveness wij as:
Codon Count RCSU w
RSCUij Xij
wij = = GCA 41576 2.64 1
RSCUmax maxj Xij
GCC 9461 0.60 0.23
GCG 1017 0.064 0.02
GCT 11031 0.70 0.27
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
O RGANISM WIDE CODON USAGE

SHARP AND LI, 1987

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

T HE CAI OF A GENE
The observed CAI of a sequence with L codons is the geometric
mean of each of the codons:
� L
�1/L
�
CAIobs = RSCUk
k=1

This is compared with the maximum possible CAI of all codon

sequences with the same length that code for the same protein
sequence to derive CAI.
� L
�1/L
�
CAImax = RSCUmax
k=1

CAI = CAIobs /CAImax

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
CAI DISTRIBUTION IN GENES

SHARP AND LI, 1987

Caveats

Some genes have unusual (for the organism) codon usage patterns

Predictive power of CAI depends on the length of the sequence, and

many are quite short
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
A SIMPLE HMM FOR FINDING
PROKARYOTIC / INTRON - LESS GENES
1108 Nucleic Acids Research, 1998, Vol. 26, No. 4

able to correctly identify ORFs where 98% of all genes predicted

by GeneMark.hmm resided. Also there were genes missed by
GeneMark.hmm, mainly due to overlaps, that were recovered by
GeneMark.
In this However, the GeneMark.hmm
generalized HMM, program made several
some hidden states
new predictions and some of them were confirmed by similarity
areIt seems
search. allowed toGeneMark.hmm
that the emit a variable lengthbrought
development substring,
us
instead
closer of aofsingle
to the goal accurateletter.
prediction of bacterial genes and
further arguments in favor of this statement are presented below.

The ideaAND
MATERIALS is that the ‘gene’
METHODS state emits the whole
sequence, instead of N individual letter.
Materials
We have used DNA sequences of the complete genomes of
The length
H.influenzae of the
(GenBank substring
accession is drawn
no. L42023), from a pre-
M.genitalium
defined
(L43967), probability
M.jannaschii function.
(L77117), M.pneumoniae (U00089),
Synechocystis PCC6803 (synecho), E.coli (U00096), H.pylori
(AE000511), M.thermoauthotrophicum (AE000666), B.subtilis
The Viterbi
(AL009126), algorithm
Archeoglobus fulgiduscan be extended
(AE000782). The datatoondeal
Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in the annotated
withE.coli
this RBS were provided by W. Hayes (22). The data
generalization
GeneMark.hmm algorithm. The hidden states of the model are represented as ovals on experimentally verified N-terminal protein sequences were
in the figure, and arrows correspond to allowed transitions between the states. kindly provided by A. Link (23). The Markov models parameters
wereAtypical genes
obtained from the are necessary
GeneMark library to deal with, most
(http://exon.biology.
gatech.edu/ !genmark/matrices/
prominently, ). transfers.
horizontal
The HMM framework of GeneMark.hmm, the logic of
transitions between hidden Markov states, followed the logic of Model of prokaryotic sequence structure
the genetic
Nucleic structure
Acids Research, ofNo.
1998, Vol. 26, the4 bacterial genome (Fig. 1). The Markov
models of coding and non-coding regions were incorporated into The architecture of the hidden Markov model used in the
the HMM framework
CSE/BIMM/BENG 181 M toAYgenerate
24, 2011 stretches of DNA sequence GeneMark.hmm algorithm
SERGEI is
L Kshown in Figure
OSAKOVSKY POND [1. To @deal
SPOND UCSD.EDU]
C
A
G

E MISSION LENGTH DISTRIBUTIONS CAN The model wa

ribosomal bind
of a given nucl

BE DETERMINED EMPIRICALLY in a given posi

The final
1110 Nucleic Acids Research, 1998, Vol. 26, No. 4 revealed the
positional n
matrix defin
Table 1. Nucleotide frequencies for the RBS model
is complem
Nucleotide Position near its 3!-e
1 2 3 4 5 generally a
Note that a
T 0.161 0.050 0.012 0.071 0.115
evaluate a p
C 0.077 0.037 0.012 0.025 0.046 the product
A 0.681 0.105 0.015 0.861 0.164 Table 1. Th
G 0.077 0.808 0.960 0.043 0.659 0.00025. It c
to ribosome
The model was derived using the multiple sequence alignment of 325 annotated assumption
ribosomal binding sites (see text). Given the set of aligned sequences, the frequency
of a given nucleotide was calculated as the number of occurrences of this nucleotide
in a given position divided by the total number of sequences.
Algorithm
The GeneM
The finally obtained alignment of the 325 sequences has bacterial ge
revealed the RBS sequence pattern in the form of a matrix of specific Ma
positional nucleotide frequencies (Table 1). It is seen that the other param
E. COLI CODING Figure 2. Length distribution E. C OLI
probability N ONCODING
densities
matrix defines the strong consensus sequence: AGGAG, which of protein-coding and non-coding same as defi
regions derived from
is complementary to a the annotatedlocated
pentamer E.coli genomic DNA (histograms).
in the E.coli 16S rRNA(a) Coding2
regions; the solid curve is the approximation by ! distribution g(d) = Nc (d/Dc ) for the gra
nearexp(–d/D
its 3!-end. This observation is in a good agreement with the modified th
c ), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen to
generally accepted mechanism of ribosome-mRNA binding.
normalize the distribution function on the interval from 30 nt (the minimal length of B.subtilis, th
Notecoding
that region)
a similar result was obtained previously (27). To
to 7155 nt (the maximal length). (b) Non-coding regions; the solid initiation of
evaluate
curve ais putative RBS we
the approximation calculateddistribution
by exponential its probabilistic score as
f(d) = Nn exp(–d/D n ), where Dn
of ribosom
= 150 nt. The
the product coefficient Nn normalizes
of corresponding elementsthe distribution functiongiven
of the matrix on the interval
in from
mechanism
Table1 to1.1000
Thent. threshold value for RBS score was chosen as
0.00025. It can be shown that the log of this score is proportional
the describ
to ribosome binding energy (with appropriate sign) under the biased freq
Nucleic Acids Research, 1998, Vol. 26, No. 4 assumption of independent formation w
of ribonucleotide pairs. obtain reaso
R � � n 2b(k) 7 of B.subtili
k�1
CSE/BIMM/BENG 181 MAY 24, 2011 Algorithm modifications for S ERGEI Lother
genomes KOSAKOVSKY
than E.coli POND [SPOND@UCSD competition
.EDU]
Here n (k) is the number of symbols b (b = T, C, A, G) in the to the Viterb
S PLICE SITE DETECTION

The beginning and end of exons are signaled by donor and acceptor
sites that usually have GT and AC di-nucleotides

Detecting these sites is diﬃcult, because GT and AC appear very often

Donor Acceptor
Site Site
GT AC
exon 1 exon 2

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

Features of Spliceosome Evolution and Function Inferred fro
an Analysis of the Information at Human plice Sites
R. Michael Stephens1>2”f and Thomas Dana Schneider”%
1128 R. M. Stephens and T. D. Xchneider
“National Cancer Institute
Frederick Cancer Research and Development Center
Laboratory of Mathematical
x.3 Biology
.s
P.Q. Box B, Frederick,-El MD_
21702-1201, U.S.A.
’ Linganore.:: High School
12013 Old Annapolis
i
d
:
-
Rd.,protection
Frederick from:
MD 21701, - 9 _ U.S.A.
s- 0 = hydroxyl radical
z T = Tl
. = RNAase-A
(Received 8 November 1991;
I- accepted 19 August 1992)

An information analysis of the 5’ (donor) and 3’ (acceptor) sequences spanning the ends of
nearly 1800 human introns has provided evidence for structural features of splice sites that
bear upon spliceosome evolution and function: (1) S2% of the sequence information (i.e.
sequence conservation) at donor junctions and 97 o/0 of the sequence information at acceptor
junctions is confined to the introns, allowing codon choices throughout exons to be largely
unrestricted. The distribution of information at intron-exon junctions is also described in
detail and compared with footprints. (2) Acceptor sites are found to possess enough
information to be located in the transcribed portion of the human genome, whereas donor
sites possess about one bit less than the information needed to locate them independently.
This difference suggests that acceptor sites are located first in humans and, having been
located, reduce by a factor of two the number of alternative sites available as donors. Direct
experimenbal evidence exists to support this conclusion. (3) The sequences of donor and
acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive
from a common ancestor and that during evolution the information of both sites shifted
onto the intron. If so, the protein and RNA components that are found in contemporary
spliceosomes, and which are responsible for recognizing donor and accept,or sequences,
should also be related. This conclusion is supported by the common structures found in
different parts of the spliceosome.

Keywords: splice; spliceosome; information theory; evolution; human

9.35 BITS (POSITIONS -25 TO +2)
7.92 BITS (POSITIONS -3 TO +6) 3’
Figure 1. Information curves and sequence logos for human spliceosome binding sites. The left half of the Figure
shows the donor splice sites from position - 8 to position + 17, while the right half shows the -30 to + 10 region around
the acceptor 1.sites. Position zero on both curves is the pointGrabowski
Introduction on the intronet al.,
adjacent
1985;to Reed
the splice
et al.,point, i.e. Steitz
1988; on the 5’
et side,
al.,
CSE/BIMM/BENG 181 Mthe
AYintron is cut immediately before position zero while on the 3’ side it is cut immediately
24, 2011 S after L
ERGEI position
K zero. (ThesePare
OSAKOVSKY OND [[email protected]]
1988). Because reliable splicing is necessary for celli
In eukaryotic cells,provided
the co-ordinates nuclear RNA
by GenBank.) isIn usually
the matrix corresponding to each graph, the bottom row, labeled 1, contains
“At the core of most gene recognition algorithms is one or more coding
measures – functions which produce, given any sample window of sequence,
a number or vector intended to measure the degree to which a sample
sequence resembles a window of ‘typical’ exonic DNA ... attention can
probably limited to six of the twenty or so measures proposed to date”

Evaluated how well diﬀerent measures performed in recovering known

coding sequences (human and E.Coli) based on organism specific training.

Applied linear discriminant analysis to train each method

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

LDA SPECIFICITY AND SENSITIVITY

HTTP://SCIEN.STANFORD.EDU/CLASS/EE368/PROJECTS2000/PROJECT15/ALGORITHMS.HTML

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

FROM FICKETT AND TUNG 1992
(SP+SN)/2 MEASURE REDUNDANCY

Hexamer based measures come out on top. They are based on the frequencies of 6-mers in one
of the frames (0,+1,+2). Highly predictive, because it captures the codon structure, codon usage
bias, initiation sites and higher order co-dependancies.

Pseudogenes can look confuse even the best protein-trained approaches.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E XAMPLES OF OTHER FEATURES
E.g promoters in Prokaryotes (E.Coli)

Transcription starts at oﬀset 0.

Pribnow Box (-10)

Gilbert Box (-30)

Ribosomal Binding Site (+10)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

T RANSCRIPT A SSEMBLY
Once individual ORFs and splice sites have been identified, they must be assembled into a
Burgefull
12 transcript.
, in a systematic analysis of short introns, have exons in a ‘sea’ of intronic DNA, where many cryptic
suggested that these standard splice sites might not be splice sites exist. This model has since been validated by
sufficient for defining introns in the genomes of plants many experiments, and it proposes that an internal exon
and humans. is initially recognized by the presence of a chain of inter-
Could be done with dynamic programming, or HMMs, for example.
In vertebrates, the internal exons are small (~140
nucleotides on average), whereas introns are typically
acting splicing factors that span it (FIG. 3). The binding of
these trans-acting factors to the pre-mRNA is responsi-
much larger (with some being more than 100 kb in ble for the non-random nucleotide patterns that form
length). In 1990, the ‘exon-definition’ model13 was pro- the molecular basis for all exon-recognition algorithms.
posed to explain how the splicing machinery recognizes These sequence features are often divided into two
Models needs to incorporate relevant biological knowledge. types: ‘signals’, which correspond to short cis-elements
or boundary sites (such as splice sites and branch
a Exon classification sites); and ‘content’, which corresponds to the
REVIEWS
TSS
extended functional regions (such as exons and
GT 5! uexon introns). To evaluate each feature, one needs to define
a scoring function of the feature (also called a feature
TSS TSS
GT GT 5! utexon variable). The best scoring function is the conditional
5! exon probability P(a|s) that the given sequence s contains
TSS
GT 5! utuexon the feature a. According to the Bayes equation P(a|s)
= P(s|a)P(a)/P(s) where P(s|a) (that is, the likelihood
P of s containing a). So, a training sample (sequence
AG GT iuexon set) with the known feature a is built, and then the
occurrence of a particular sequence s is counted.
AG GT iutexon Different features can70K then be integrated into a single 70K 70K
U1 U2 snRNP U2AF65 35 U1 U2 snRNP U2AF65 35 U1 CPSF
score for the whole
SR object (an
snRNP itexon in this case). SR
snRNP
SR
snRNP PAP CstF
AG GT AG GT ituexon
Genes
CBC are predicted
Exon 1 byGU
finding the gene structure
A that
YRYYRY AG Exon 2 GU A YRYYRY AG Exon 3 GU AAUAAA G/U
has the highest score, given the sequence. Approaches CFI CFII
Internal
exon differ in their choice of features, scoring functions and
AG GT itexon
integration methods. Once the problem is phrased as
aFirst
statistical-pattern
exon definitionrecognition problem,
Internalmanyexonstatis-
definition Last exon definition
AG GT iutuexon
tical or machine learning tools are available for recog-
Figure 3 | Exon-definition model. Typically, in vertebrates, exons are much shorter than introns. According to the exon-definition
nizing these patterns. Indeed, almost all of them have
model, before introns are recognized and spliced out, each exon is initially recognized by the protein factors that form a bridge
been applied to the exon (or gene)-recognition prob-
AG Poly(A) 3! uexon across it. In this way, each exon, together with its flanking sequences, forms a molecular, as well as a computational, recognition
lem. Here, I review just a few generic or popular
module (arrows indicate molecular interactions). Modified with permission from REF. 26 © (2002) Macmillan Magazines Ltd.
approaches.
AG Poly(A) AG Poly(A) 3! tuexon CBC, cap-binding complex; CFI/II, cleavage factor I/II; CPSF, cleavage and polyadenylation specificity factor; CstF, the cleavage
Most early programs used the simple positional
3! exon
stimulation factor; PAP, poly(A) polymerase; snRNP, small nuclear RNP; SR, SR protein; U2AF, U2 small nuclear ribonucleoprotein
weight matrix method (WMM, see BOX 1) to identify
AG Poly(A) 3! utuexon
particle (snRNP) auxiliary factor.
splice-site signals. In recent programs, the correlation
among positions in a signal is also explored. The
Fweight array method (WAM)
ROM “COMPUTATIONAL or Markov
PREDICTION models
OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
(BOX 1) are used to explore adjacent correlations; deci-
TSS
Poly(A) sion-tree or maximal-dependence LDA is implemented in SPL — a splice-site recogni-
decomposition identify these boundaries, which results in predicted
Intronless Intronless (MDD) methods are used to explore tion non-adjacent
module of the HEXON program . A new splice-
15
genes being either truncated or fused together.
=
gene gene correlations; and artificial neural site detection
network (ANN) program, GeneSplicer, has also been Determining the 3! end of a gene is easier than deter-
methods are used to explore arbitrary, developed recently and is reported to perform
nonlinear 16
mining its 5! end. This is because most of the mRNA
b CDS misclassification dependencies. These more complex models typically
favourably when compared with many other pro- and EST sequences in GenBank are truncated at their
ATG yield significant, but not marked, improvements
CSE/BIMM/BENG 181 MAY 24, 2011 grams (suchover as NetPlantGene, NetGene2, SHSPL, ERGEI L5!K ends. PONDmodel
The exon-definition
OSAKOVSKY [SPOND @UCSD
can also .EDU]
be applied
GT 5! CDS the simple WMM. However, major improvements
S OME SIMPLE ASSEMBLY RULES
!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#
?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!
-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*!!"#$%0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$! !
&$! "#$! %$&$'()*+,&%! (*-./$01! 2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

,3$=!-A4!(,-! 4-!4)&+/)4,-&!,&4-! )!(-4$,&1!D-A3$E!F)G--+!HI>!O&12"-/#

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
0"'#
,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!
!
2"B!"2B!
2"B!B"!
B"!2B!
4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>! 2B!B"!
2B!"2B!
4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
!
"2B!2"B!
!
?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!
!"#$% &$! "#$! %$&$'()*+,&%! (*-./$01! 7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&!
2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
2"B!"2B!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
2"B!B"!
2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
B"!2B!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!
2B!B"!
! !
2"B!"2B!
2B!"2B! !"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$!
2"B!B"!
$8-&+>! ,&4*-&+>! -*! ,&4$*%$&,3! *$%,-&+1! D-A*3$E! F)G-*-+! HI>!O&12"-/# 0"'# ="<%H1$1(")$7# P&)&# +'&-(,1(")>! J)0.*,=%$!
K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
"2B!2"B! B"!2B!
2B!B"!
2B!"2B! FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER
"2B!2"B!
,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!
CSE/BIMM/BENG 181 MAY 24, 2011
!
SERGEI L KOSAKOVSKY POND [[email protected]]
U SING K NOWN G ENES TO
P REDICT N EW G ENES

Some genomes may be very well-studied, with experimentally

verified genes.

Closely-related organisms may have similar genes

Unknown genes in one species may be compared to genes in a

suﬃciently closely-related species

The idea is that gene structure is on average quite stable.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

S IMILARITY -B ASED A PPROACH
TO G ENE P REDICTION

Genes in diﬀerent organisms are similar

The similarity-based approach uses known genes in one genome to

predict (unknown) genes in another genome

Problem: Given a known gene and an un-annotated genome

sequence, find a set of substrings of the genomic sequence whose
concatenation best fits the gene

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

C OMPARING G ENES IN T WO G ENOMES

SMALL ISLANDS OF SIMILARITY

CORRESPONDING TO SIMILARITIES
BETWEEN EXONS

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

U SING S IMILARITIES TO F IND
THE E XON S TRUCTURE

The known frog gene is aligned to diﬀerent locations in the human

genome

Find the “best” path to reveal the exon structure of human gene

Start with a local alignment to find putative exons

Frog Genes (known)

Human Genome

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

C HAINING L OCAL A LIGNMENTS

Find substrings that match a given gene sequence (candidate exons);

use a cutoﬀ to define significance.

Define a candidate exon as (l, r, w): left, right, weight defined as

score of local alignment

Look for a maximum chain of substrings, i.e. a set of non-

overlapping non-adjacent intervals.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E XON C HAINING P ROBLEM
Locate the number and beginning and end of each interval (2n
points)

Find the “best”, i.e. maximum weight path

5
5
15
9
11
4 SCORE=18
3 SCORE=19

0 2 3 5 6 11 13 16 20 25 27 28 30 32

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E XON C HAINING P ROBLEM :
F ORMULATION

Exon Chaining Problem: Given a set of putative exons, find a

maximum set of non-overlapping putative exons

Input: a set of weighted intervals (putative exons)

Output: A maximum chain of intervals from this set

Would a greedy algorithm solve this problem?

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

ExonChaining (G, n) //Graph, number of intervals
for i ← to 2n
si ← 0
for i ← 1 to 2n
if vertex vi in G corresponds to right end of the interval I
j ← index of vertex for left end of the interval I GREEDY: 17
w ← weight of the interval I
si ← max {sj + w, si-1}
else
si ← si-1
return s2n

21
BEST: 21
Use a graph representation of the exon chaining problem

Can be solved in O(n) time using dynamic programming

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E XON C HAINING : D EFICIENCIES

Frog Genes (known)

Human Genome

Poor definition of the putative exon endpoints

Optimal chain of intervals may not correspond to any valid alignment

First interval may correspond to a suﬃx, whereas second interval may

correspond to a prefix

Combination of such intervals is not a valid alignment

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
S PLICED A LIGNMENT

Mikhail Gelfand and colleagues proposed a spliced alignment

approach of using a protein within one genome to reconstruct the
exon-intron structure of a (related) gene in another genome.

Begins by selecting either all putative exons between potential

acceptor and donor sites or by finding all substrings similar to the
target protein (as in the Exon Chaining Problem).

This set is further filtered in a such a way that attempt to retain all
true exons, with some false ones.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

S PLICED A LIGNMENT P ROBLEM :
F ORMULATION

Goal: Find a chain of blocks in a genomic sequence that best fits a

target sequence

Input: Genomic sequences G, target sequence T, and a set of

candidate exons B.

Output: A chain of exons Γ such that the global alignment score

between Γ* and T is maximum among all chains of blocks from B.

Γ* - concatenation of all exons from chain Γ

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

E XON C HAINING VS S PLICED
A LIGNMENT

In Spliced Alignment, every path spells out the string obtained by

concatenation of labels of its edges. The weight of the path is
defined as optimal alignment score between concatenated labels
(blocks) and target sequence

Defines weight of entire path in graph, but not the weights for individual
edges.

Exon Chaining assumes the positions and weights of exons are pre-
defined

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

Box 2 | Useful internet resources

Gene-prediction programs: comparative genomics

Doublescan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/analysis/doublescan
SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bio.math.berkeley.edu/slam
Twinscan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.cs.wustl.edu
Gene-prediction programs (many with homology searching capabilities)
GeneMachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genome.nhgri.nih.gov/genemachine
Genscan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.mit.edu/GENSCAN.html
GenomeScan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.mit.edu/genomescan
Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and
RNASPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genomic.sanger.ac.uk/gf/gf.shtml
Fgenesh, Fgenes-M, SPL and RNASPL . . . . . . . . . . . . . . . . . . . . http://www.softberry.com/berry.phtml
HMMgene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/HMMgene
Genie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/genie.html
GRAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://compbio.ornl.gov/tools/index.shtml
GeneMark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.ebi.ac.uk/genemark [OK?]
GeneID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www1.imim.es/software/geneid/geneid.html#top
GeneParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://beagle.colorado.edu/~eesnyder/GeneParser.html
MZEF and POMBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://argon.cshl.org/genefinder/ [OK?]
AAT, MZEF with homology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genome.cs.mtu.edu/aat.html
MZEF with SpliceProximalCheck . . . . . . . . . . . . . . . . . . . . . . . . http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html
Genesplicer, Glimmer and GlimmerM . . . . . . . . . . . . . . . . . . . . http://www.tigr.org/~salzberg
WebGene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.itba.mi.cnr.it/webgene
GenLang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbil.upenn.edu/genlang/genlang_home.html
Xpound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound
Gene-prediction programs: alignment based
Procrustes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www-hto.usc.edu/software/procrustes/index.html
GeneWise2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/Wise2
SplicePredictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
PredictGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://cbrg.inf.ethz.ch/subsection3_1_8.html
Finding ORFs and splice sites
DioGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbc.umn.edu/diogenes/index.html
OrfFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.ncbi.nlm.nih.gov/gorf/gorf.html
YeastGene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi
CDS: search coding regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html
Neural network splice site prediction . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/splice.html
NetGene2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/NetGene2
Last exon, promoter or TSS prediction
FirstEF, Core_Promoter, CpG_Promoter, Polyadq
and JTEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cshl.edu/mzhanglab
Eponine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Users/td2/eponine
Neural network promoter prediction . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/promoter.html
Transcription element search system . . . . . . . . . . . . . . . . . . . . . http://www.cbil.upenn.edu/tess
Signal Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bimas.dcrt.nih.gov/molbio/signal
AAT, analysis and annotation tool; ORF, open reading frame; TSS; transcription start site.

FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
boundaries, we refer to a region as a state and to a The advantage of HMMs is that more states (such as
CSE/BIMM/BENG 181 MAY 24,boundary
2011 as a transition between states). If the condi- SERGEI
intergenic regions, promoters, UTRs, L KOSAKOVSKY
poly(A) and POND [[email protected]]
G ENERAL T HINGS TO R EMEMBER ABOUT (P ROTEIN -
CODING ) G ENE P REDICTION S OFTWARE

It is, in general, organism-specific

It works best on genes that are reasonably similar to something seen

previously

It finds protein coding regions far better than non-coding regions

In the absence of external (direct) information, alternative forms will

not be identified

It is imperfect! (It’s biology, after all…)

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Lva1 App6891 PDF
No ratings yet
Lva1 App6891 PDF
33 pages
Gene Prediction
No ratings yet
Gene Prediction
5 pages
Ghosh and Mallik
No ratings yet
Ghosh and Mallik
68 pages
Gene Prediction
No ratings yet
Gene Prediction
17 pages
Genome Annotation
No ratings yet
Genome Annotation
58 pages
CL662 PW 02 Gene Finding
No ratings yet
CL662 PW 02 Gene Finding
39 pages
Gene Prediction
No ratings yet
Gene Prediction
50 pages
Lecture 8 Chapter 11
No ratings yet
Lecture 8 Chapter 11
61 pages
Unit Vi
No ratings yet
Unit Vi
64 pages
Gene Prediction
No ratings yet
Gene Prediction
15 pages
Bacterial Gene Annotation
100% (1)
Bacterial Gene Annotation
12 pages
CUBT401 - 4 - Sequence and Genome Annotation
No ratings yet
CUBT401 - 4 - Sequence and Genome Annotation
66 pages
PM703 Practical Biotechnology (2019) PM703 Practical Biotechnology (2019)
No ratings yet
PM703 Practical Biotechnology (2019) PM703 Practical Biotechnology (2019)
20 pages
Rosales
No ratings yet
Rosales
27 pages
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
No ratings yet
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
35 pages
Lecture 1: Genes and The Genetic Code Bioinformatics: Definition?
No ratings yet
Lecture 1: Genes and The Genetic Code Bioinformatics: Definition?
4 pages
Genome Organization and Biosynthesis of Proteins
No ratings yet
Genome Organization and Biosynthesis of Proteins
48 pages
Gene Prediction
No ratings yet
Gene Prediction
24 pages
Gene Identification and Prediction
No ratings yet
Gene Identification and Prediction
18 pages
CH 5 Mol Basis MCQ
No ratings yet
CH 5 Mol Basis MCQ
10 pages
BioAlg10 9
No ratings yet
BioAlg10 9
69 pages
Gene Finding
No ratings yet
Gene Finding
31 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Gene Expression
No ratings yet
Gene Expression
19 pages
An Overview of Gene Identification
No ratings yet
An Overview of Gene Identification
9 pages
Lecture Notes Algorithms in Bioinformatics I - Prof. Daniel Huson
No ratings yet
Lecture Notes Algorithms in Bioinformatics I - Prof. Daniel Huson
28 pages
Module 1_Session 3_Part 2
No ratings yet
Module 1_Session 3_Part 2
36 pages
Gene, Proteins, and Genetic Code
No ratings yet
Gene, Proteins, and Genetic Code
37 pages
FINE-STRUCTURE-OF-gene
No ratings yet
FINE-STRUCTURE-OF-gene
37 pages
CS284A Introduction To Computational Biology and Bioinformatics
No ratings yet
CS284A Introduction To Computational Biology and Bioinformatics
24 pages
Anotacion_de_Genomas
No ratings yet
Anotacion_de_Genomas
84 pages
Unit 2 BI
No ratings yet
Unit 2 BI
10 pages
What Is A Gene An Updated Operational Definition
No ratings yet
What Is A Gene An Updated Operational Definition
4 pages
Genome Annotation
No ratings yet
Genome Annotation
25 pages
Molecular basis of inheritance 3
No ratings yet
Molecular basis of inheritance 3
30 pages
BPS 3101 Mid 1 Study Guide
No ratings yet
BPS 3101 Mid 1 Study Guide
32 pages
Lincoln Stein - Genome Annotation: From Sequence To Biology
No ratings yet
Lincoln Stein - Genome Annotation: From Sequence To Biology
13 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual 2024 scribd download full chapters
100% (6)
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual 2024 scribd download full chapters
35 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual 2024 scribd download full chapters
100% (24)
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual 2024 scribd download full chapters
46 pages
Structural genes, Regulatory genes, Overlapping genes
No ratings yet
Structural genes, Regulatory genes, Overlapping genes
14 pages
L11 Biol 261 Ftranslation 2014
No ratings yet
L11 Biol 261 Ftranslation 2014
44 pages
KulBiotePert 3rd 2021
No ratings yet
KulBiotePert 3rd 2021
63 pages
BIO 411 - Decoding Understanding Genomes Lecture
No ratings yet
BIO 411 - Decoding Understanding Genomes Lecture
55 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual - Download Now And Start Reading The Complete Content
No ratings yet
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual - Download Now And Start Reading The Complete Content
48 pages
Untitled
No ratings yet
Untitled
21 pages
Gene L0cation and Structure
No ratings yet
Gene L0cation and Structure
20 pages
Lec (6) - Gene Prediction
No ratings yet
Lec (6) - Gene Prediction
19 pages
Gene Prediction
25% (4)
Gene Prediction
36 pages
CSC124_Lecture3 (2)
No ratings yet
CSC124_Lecture3 (2)
71 pages
Gene Regulation
No ratings yet
Gene Regulation
37 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual instant download
100% (1)
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual instant download
39 pages
BIO101 Module 3
No ratings yet
BIO101 Module 3
15 pages
Textbook Questions
No ratings yet
Textbook Questions
9 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual download
100% (1)
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual download
45 pages
Hidden Markov Models 3
No ratings yet
Hidden Markov Models 3
27 pages
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual - All Chapters Are Available In PDF Format For Download
100% (3)
Human Genetics Concepts and Applications 11th Edition Ricki Lewis Solutions Manual - All Chapters Are Available In PDF Format For Download
31 pages
Fast Facts: Les troubles d'oxydation des acides gras à chaîne longue: Comprendre, identifier et aider
From Everand
Fast Facts: Les troubles d'oxydation des acides gras à chaîne longue: Comprendre, identifier et aider
Barbara K. Burton
No ratings yet
The Science of Stem Cells
From Everand
The Science of Stem Cells
Jonathan M. W. Slack
No ratings yet
Epigenetics Explained
From Everand
Epigenetics Explained
Janine Gee
5/5 (2)
DSP Lab Manual
No ratings yet
DSP Lab Manual
27 pages
Decision Theory Vohra
No ratings yet
Decision Theory Vohra
65 pages
Pac-Biosciences-Single Molecule Real Time DNA Sequencing
No ratings yet
Pac-Biosciences-Single Molecule Real Time DNA Sequencing
7 pages
Noc19 bt20 Assessment Id Week 7
No ratings yet
Noc19 bt20 Assessment Id Week 7
1 page
Biochem 218 - Biomedical Informatics 231: The Human Genome Project
No ratings yet
Biochem 218 - Biomedical Informatics 231: The Human Genome Project
27 pages
Comparison of Liquid/Liquid and Solid-Phase Extraction For Alkaline Drugs
No ratings yet
Comparison of Liquid/Liquid and Solid-Phase Extraction For Alkaline Drugs
5 pages
Growth Rate and Yield Calculations - 17.11.16
100% (1)
Growth Rate and Yield Calculations - 17.11.16
10 pages
MULTIPLE CHOICE. Choose The One Alternative That Best Completes The Statement or Answers The Question. Solve The Problem
No ratings yet
MULTIPLE CHOICE. Choose The One Alternative That Best Completes The Statement or Answers The Question. Solve The Problem
15 pages
2.2.2. Heat Conduction in Cylinders and Spheres: Onduction C Eat H Teady S Wo T
No ratings yet
2.2.2. Heat Conduction in Cylinders and Spheres: Onduction C Eat H Teady S Wo T
9 pages