BioAlg10 9
BioAlg10 9
info
Gene Prediction
Bioinformatics Algorithms
Introduction
• Gene: A sequence of nucleotides coding for protein
• Gene Prediction Problem: Determine the beginning and end
positions of genes in a genome
• atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgct
aatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccga
tgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcg
gctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgc
ggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctg
ggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctat
gctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcgg
ctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgaca
atgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctat
gctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaa
gctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcat
gcggctatgctaagct
Bioinformatics Algorithms
Introduction
• Gene: A sequence of nucleotides coding for protein
• Gene Prediction Problem: Determine the beginning and end
positions of genes in a genome
• atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgct
aatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccga
tgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcg
gctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgc
ggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctg
ggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
Gene!
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctat
gctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcgg
ctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgaca
atgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctat
gctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaa
gctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcat
gcggctatgctaagct
Bioinformatics Algorithms
Introduction
• In 1960’s it was discovered that the sequence of codons in a gene
determines the sequence of amino acids in a protein
¾ an incorrect assumption: the triplets encoding for amino acid sequences
form contiguous strips of information.
• A paradox: genome size of many eukaryotes does not correspond to
“genetic complexity”, for example, the salamander genome is 10
times the size of that of human.
• 1977 – discovery of “split” genes: experiments with mRNA of hexon,
a viral protein:
• mRNA-DNA hybrids formed three
curious loop structures instead of
contiguous duplex segments (seen in
DNA
an electron microscope)
mRNA
Bioinformatics Algorithms
transcription
splicing
translation
exon = coding
intron = non-coding
Bioinformatics Algorithms
• E.g. for the bases around the transcription start site we may have
the following observed frequencies (given by this position specific
weight matrix (PSWM) ):
Pos. -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7
A .16 .29 .20 .25 .22 .66 .27 .15 1 0 0 .28 .24 .11 .26
C .48 .31 .21 .33 .56 .05 .50 .58 0 0 0 .16 .29 .24 .40
G .18 .16 .46 .21 .17 .27 .12 .22 0 0 1 .48 .20 .45 .21
T .19 .24 .14 .21 .06 .02 .11 .05 0 1 0 .09 .26 .21 .21
ATG TGA
Genomic Sequence
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
ORF prediction
• Amino acids typically have more than one codon, but in nature
certain codons are more in use
Percents of
usage among all
codons encoding
Stp
Bioinformatics Algorithms
• Example:
Leucine, Alanine and Tryptophan are coded by 6, 4 and 1 different
codons respectively. Hence in a uniformly random DNA they should
occur in the ratio 6:4:1. But in a protein they occur in the ratio 6.9:6.5:1.
• For a codon c
fc codon’s frequency of occurrence in the window.
Fc the total number of occurrences of c’s synonymous family in the
window.
rc the calculated number of occurrences of c in a random sequence of
length lw with the same base composition as the sequence being
analyzed.
Rc the calculated number of occurrences of c’s synonymous family in a
random sequence of length lw with the same base composition as
the sequence being analyzed.
Bioinformatics Algorithms
∏p
j =1
cj ≤1
lw
• The probability of the sequence in the window w is then P (w ) = ∏ pc j
i
=1
• Log-based score is used instead and the codon preference statistics for
each window is
∑i =1
lw
log p ci
Pw = e lw
• A correction: 0 in the
codon frequency table
is replaced by 1/Fc
Bioinformatics Algorithms
Non-coding
coding
Bioinformatics Algorithms
acceptor site
acceptor site
donor site
donor site
promoter
start site
stop site
3’ UTR
5’ UTR
5’→3’
TATA ATG GT AG GT AG TAA AAATAAAAAA
TAG
initial initron internal initron terminal TGA Poly-A
exon exon(s) exon
• intron starts usually by AG and ends by GT
• Types of exons
1. Initial exons
2. Internal exons
3. Terminal exons
4. Single-exon genes, i.e. genes without introns.
Bioinformatics Algorithms
Position
% -8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 1 54 … 21
C 26 … 15 5 0 1 2 … 27
G 25 … 12 78 99 0 41 … 27
T 23 … 13 8 1 98 3 … 25
TestCode
• Statistical test described by James Fickett in 1982: tendency for
nucleotides in coding regions to be repeated with periodicity of 3
– Judges randomness instead of codon frequency.
– Finds “putative” coding regions, not introns, exons, or splice sites.
• TestCode finds ORFs based on compositional bias with a periodicity
of three.
Bioinformatics Algorithms
TestCode Statistics
• Define a window size no less than 200 bp, slide the window
along the sequence down 3 bases. In each window:
– Calculate for each base {A, T, G, C}
• max (n3k+1, n3k+2, n3k) / min ( n3k+1, n3k+2, n3k)
• Use these values to obtain a probability from a lookup table (which was a
previously defined and determined experimentally with known coding
and noncoding sequences
• Probabilities can be classified as indicative of " coding” or
“noncoding” regions, or “no opinion” when it is unclear what
level of randomization tolerance a sequence carries
• The resulting sequence of probabilities can be plotted
Bioinformatics Algorithms
Coding
No opinion
Non-coding
Bioinformatics Algorithms
• TWINSCAN
– Uses both HMM and similarity (e.g., between human and
mouse genomes)
Bioinformatics Algorithms
Reverse Translation
• Given a known protein, find a gene in the genome which codes for it.
• One might infer the coding DNA of the given protein by reversing the
translation process
– Inexact: amino acids map to > 1 codon.
– This problem is essentially reduced to an alignment problem.
• This reverse translation problem can be modeled as traveling in
Manhattan grid with free horizontal jumps
– Complexity of Manhattan is n3
• Every horizontal jump models an insertion of an intron
• Problem with this approach: would match nucleotides pointwise and
use horizontal jumps at every opportunity
Bioinformatics Algorithms
Human Genome
Bioinformatics Algorithms
Human Genome
Bioinformatics Algorithms
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Infeasible Chains
• Red local similarities form two non -overlapping intervals but do not
form a valid global alignment
Frog Genes (known)
Human Genome
Bioinformatics Algorithms
Spliced Alignment
• Mikhail Gelfand and colleagues proposed a spliced alignment
approach of using a protein within one genome to reconstruct the
exon-intron structure of a (related) gene in another genome.
– Begins by selecting either all putative exons between potential acceptor
and donor sites or by finding all substrings similar to the target protein
(as in the Exon Chaining Problem).
– This set is further filtered in such way that attempts to retain all true
exons, with some false ones.
Bioinformatics Algorithms
1. For j>0 and when there does not exist a block preceding B
S (i , j , B ) = s (g first (B ) K g i , t1 Kt j )
• After computing the three-dimensional table S(i, j, B), the score of the
optimal spliced alignment is:
maxall blocks BS(end(B), length(T), B)
Bioinformatics Algorithms
• A mosaic effect: short exons are easily combined to fit any target
protein
Bioinformatics Algorithms
• GENSCAN/Genome Scan
• TwinScan
• Glimmer
• GenMark
Bioinformatics Algorithms
GenomeScan
TwinScan
http://www.standford.edu/class/cs262/
Spring2003/Notes/ln10.pdf
Bioinformatics Algorithms
Glimmer
GenMark