Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011
Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011
P REDICTION
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/EN/6/68/CENTRAL_DOGMA_OF_MOLECULAR_BIOCHEMISTRY_WITH_ENZYMES.JPG
Protein translation:
In frame: PISPIETVPVTKPGMDGPKVKQWPLTEEKK
+1: QXVLLKLYQXQSQEWMAQRLNNGHXQKRKK
+2 NKSYXNCTSNKARNGWPKGXTMAINRREKS
X marks a stop codon which signals the ribosome to stop protein
synthesis.
In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a
viral protein.
HTTP://NOBELPRIZE.ORG/NOBEL_PRIZES/MEDICINE/LAUREATES/1993/SHARP-LECTURE.PDF
!
!"#$% &$! "#$! %$&$'()*+,&%! (*-./$01! 2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER
2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!
"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
REVIEWS
Cytoplasm Nucleus
Poly(A)
ATG Stop site
Promoter
Genomic DNA 1 2 3 4 5
Pre-mRNA
RNA processing
(capping, splicing,
polyadenylation) AUG Stop
RNA transport
and translation
Protein
Cap Poly(A)
Figure 1 | The central dogma of gene expression. In the typical process of eukaryotic gene expression, a gene is transcribed
from DNA to pre-mRNA. mRNA is then produced from pre-mRNA by RNA processing, which includes the capping, splicing and
polyadenylation of the transcript. It is then transported from the nucleus to the cytoplasm for translation. TSS, transcription start site;
TTS, transcription termination site.
many good reviews on this topic, and useful bench- all gene-prediction papers refer to four types of ‘exon’, as
marks in the research (for example, REFS 1–8), a truly shown in FIG. 2b; however, these are just the coding
fair comparison of the prediction programs is impos- regions of the exons. To avoid the misuse of these terms,
sible as their performance depends crucially on the I refer to subclasses of exons in this article as 5! CDS,
FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
specific TRAINING DATA that are used to develop them. itexon, 3! CDS and intronless CDS.
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
TRAINING DATA SET Gene structure and exon classification Finding internal coding exons
G ENE F INDING APPROACHES
Direct
Computational
Something that matches statistical patterns common to all genes (ab initio)
Hybrid
Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and
numerical symbols could you distinguish between a story and a
stock report in a foreign newspaper?
The subsegments of these that start from the Start codon (ATG) are ORFs
ATG TGA
Genomic Sequence
Codon Count
GCA 41576
GCC 9461
GCG 1017
GCT 11031
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C OMPUTING CAI
Define relative synonymous codon usage (RSCU) for a pair (i,j),
where i is an amino-acid, and j is one of the ni codons mapping to it
as. Xij is the count of the j-th codon for amino-acid i.
Caveats
Some genes have unusual (for the organism) codon usage patterns
The ideaAND
MATERIALS is that the ‘gene’
METHODS state emits the whole
sequence, instead of N individual letter.
Materials
We have used DNA sequences of the complete genomes of
The length
H.influenzae of the
(GenBank substring
accession is drawn
no. L42023), from a pre-
M.genitalium
defined
(L43967), probability
M.jannaschii function.
(L77117), M.pneumoniae (U00089),
Synechocystis PCC6803 (synecho), E.coli (U00096), H.pylori
(AE000511), M.thermoauthotrophicum (AE000666), B.subtilis
The Viterbi
(AL009126), algorithm
Archeoglobus fulgiduscan be extended
(AE000782). The datatoondeal
Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in the annotated
withE.coli
this RBS were provided by W. Hayes (22). The data
generalization
GeneMark.hmm algorithm. The hidden states of the model are represented as ovals on experimentally verified N-terminal protein sequences were
in the figure, and arrows correspond to allowed transitions between the states. kindly provided by A. Link (23). The Markov models parameters
wereAtypical genes
obtained from the are necessary
GeneMark library to deal with, most
(http://exon.biology.
gatech.edu/ !genmark/matrices/
prominently, ). transfers.
horizontal
The HMM framework of GeneMark.hmm, the logic of
transitions between hidden Markov states, followed the logic of Model of prokaryotic sequence structure
the genetic
Nucleic structure
Acids Research, ofNo.
1998, Vol. 26, the4 bacterial genome (Fig. 1). The Markov
models of coding and non-coding regions were incorporated into The architecture of the hidden Markov model used in the
the HMM framework
CSE/BIMM/BENG 181 M toAYgenerate
24, 2011 stretches of DNA sequence GeneMark.hmm algorithm
SERGEI is
L Kshown in Figure
OSAKOVSKY POND [1. To @deal
SPOND UCSD.EDU]
C
A
G
The final
1110 Nucleic Acids Research, 1998, Vol. 26, No. 4 revealed the
positional n
matrix defin
Table 1. Nucleotide frequencies for the RBS model
is complem
Nucleotide Position near its 3!-e
1 2 3 4 5 generally a
Note that a
T 0.161 0.050 0.012 0.071 0.115
evaluate a p
C 0.077 0.037 0.012 0.025 0.046 the product
A 0.681 0.105 0.015 0.861 0.164 Table 1. Th
G 0.077 0.808 0.960 0.043 0.659 0.00025. It c
to ribosome
The model was derived using the multiple sequence alignment of 325 annotated assumption
ribosomal binding sites (see text). Given the set of aligned sequences, the frequency
of a given nucleotide was calculated as the number of occurrences of this nucleotide
in a given position divided by the total number of sequences.
Algorithm
The GeneM
The finally obtained alignment of the 325 sequences has bacterial ge
revealed the RBS sequence pattern in the form of a matrix of specific Ma
positional nucleotide frequencies (Table 1). It is seen that the other param
E. COLI CODING Figure 2. Length distribution E. C OLI
probability N ONCODING
densities
matrix defines the strong consensus sequence: AGGAG, which of protein-coding and non-coding same as defi
regions derived from
is complementary to a the annotatedlocated
pentamer E.coli genomic DNA (histograms).
in the E.coli 16S rRNA(a) Coding2
regions; the solid curve is the approximation by ! distribution g(d) = Nc (d/Dc ) for the gra
nearexp(–d/D
its 3!-end. This observation is in a good agreement with the modified th
c ), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen to
generally accepted mechanism of ribosome-mRNA binding.
normalize the distribution function on the interval from 30 nt (the minimal length of B.subtilis, th
Notecoding
that region)
a similar result was obtained previously (27). To
to 7155 nt (the maximal length). (b) Non-coding regions; the solid initiation of
evaluate
curve ais putative RBS we
the approximation calculateddistribution
by exponential its probabilistic score as
f(d) = Nn exp(–d/D n ), where Dn
of ribosom
= 150 nt. The
the product coefficient Nn normalizes
of corresponding elementsthe distribution functiongiven
of the matrix on the interval
in from
mechanism
Table1 to1.1000
Thent. threshold value for RBS score was chosen as
0.00025. It can be shown that the log of this score is proportional
the describ
to ribosome binding energy (with appropriate sign) under the biased freq
Nucleic Acids Research, 1998, Vol. 26, No. 4 assumption of independent formation w
of ribonucleotide pairs. obtain reaso
R � � n 2b(k) 7 of B.subtili
k�1
CSE/BIMM/BENG 181 MAY 24, 2011 Algorithm modifications for S ERGEI Lother
genomes KOSAKOVSKY
than E.coli POND [SPOND@UCSD competition
.EDU]
Here n (k) is the number of symbols b (b = T, C, A, G) in the to the Viterb
S PLICE SITE DETECTION
The beginning and end of exons are signaled by donor and acceptor
sites that usually have GT and AC di-nucleotides
Donor Acceptor
Site Site
GT AC
exon 1 exon 2
An information analysis of the 5’ (donor) and 3’ (acceptor) sequences spanning the ends of
nearly 1800 human introns has provided evidence for structural features of splice sites that
bear upon spliceosome evolution and function: (1) S2% of the sequence information (i.e.
sequence conservation) at donor junctions and 97 o/0 of the sequence information at acceptor
junctions is confined to the introns, allowing codon choices throughout exons to be largely
unrestricted. The distribution of information at intron-exon junctions is also described in
detail and compared with footprints. (2) Acceptor sites are found to possess enough
information to be located in the transcribed portion of the human genome, whereas donor
sites possess about one bit less than the information needed to locate them independently.
This difference suggests that acceptor sites are located first in humans and, having been
located, reduce by a factor of two the number of alternative sites available as donors. Direct
experimenbal evidence exists to support this conclusion. (3) The sequences of donor and
acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive
from a common ancestor and that during evolution the information of both sites shifted
onto the intron. If so, the protein and RNA components that are found in contemporary
spliceosomes, and which are responsible for recognizing donor and accept,or sequences,
should also be related. This conclusion is supported by the common structures found in
different parts of the spliceosome.
HTTP://SCIEN.STANFORD.EDU/CLASS/EE368/PROJECTS2000/PROJECT15/ALGORITHMS.HTML
HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT
Hexamer based measures come out on top. They are based on the frequencies of 6-mers in one
of the frames (0,+1,+2). Highly predictive, because it captures the codon structure, codon usage
bias, initiation sites and higher order co-dependancies.
!
-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*!!"#$%0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$! !
&$! "#$! %$&$'()*+,&%! (*-./$01! 2! 3-0(/$4$! 0562! 3-&+,+4+! -7! -&$! -*! 0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!
&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!
"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$! +(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-! )!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#
4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!
="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!
Find the “best” path to reveal the exon structure of human gene
Human Genome
5
5
15
9
11
4 SCORE=18
3 SCORE=19
0 2 3 5 6 11 13 16 20 25 27 28 30 32
21
BEST: 21
Use a graph representation of the exon chaining problem
This set is further filtered in a such a way that attempt to retain all
true exons, with some false ones.
Defines weight of entire path in graph, but not the weights for individual
edges.
Exon Chaining assumes the positions and weights of exons are pre-
defined
FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709
boundaries, we refer to a region as a state and to a The advantage of HMMs is that more states (such as
CSE/BIMM/BENG 181 MAY 24,boundary
2011 as a transition between states). If the condi- SERGEI
intergenic regions, promoters, UTRs, L KOSAKOVSKY
poly(A) and POND [[email protected]]
G ENERAL T HINGS TO R EMEMBER ABOUT (P ROTEIN -
CODING ) G ENE P REDICTION S OFTWARE
HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT