0% found this document useful (0 votes)

27 views

BioAlg10 9

Bioinformatics book

Uploaded by

Devanshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

BioAlg10 9

Bioinformatics book

Uploaded by

Devanshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

An Introduction to Bioinformatics Algorithms www.bioalgorithms.

info

Gene Prediction
Bioinformatics Algorithms

Introduction
• Gene: A sequence of nucleotides coding for protein
• Gene Prediction Problem: Determine the beginning and end
positions of genes in a genome

• atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgct
aatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccga
tgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcg
gctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgc
ggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctg
ggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctat
gctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcgg
ctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgaca
atgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctat
gctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaa
gctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcat
gcggctatgctaagct
Bioinformatics Algorithms

Introduction
• Gene: A sequence of nucleotides coding for protein
• Gene Prediction Problem: Determine the beginning and end
positions of genes in a genome

• atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgct
aatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccga
tgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcg
gctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgc
ggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctg
ggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
Gene!
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctat
gctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcgg
ctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgaca
atgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctat
gctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaa
gctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcat
gcggctatgctaagct
Bioinformatics Algorithms

Introduction
• In 1960’s it was discovered that the sequence of codons in a gene
determines the sequence of amino acids in a protein
¾ an incorrect assumption: the triplets encoding for amino acid sequences
form contiguous strips of information.
• A paradox: genome size of many eukaryotes does not correspond to
“genetic complexity”, for example, the salamander genome is 10
times the size of that of human.
• 1977 – discovery of “split” genes: experiments with mRNA of hexon,
a viral protein:
• mRNA-DNA hybrids formed three
curious loop structures instead of
contiguous duplex segments (seen in
DNA
an electron microscope)
mRNA
Bioinformatics Algorithms

Central Dogma and Splicing

intron1 intron2
exon1 exon2 exon3

transcription

splicing

translation
exon = coding
intron = non-coding
Bioinformatics Algorithms

Gene prediction is hard

• The Genome of many eukaryotes contain only relatively few genes
(Human genome 3%).
• Many false splice sites & other signals.
• Very short exons (3bp), especially initial.
• Many very long introns.
• Alternative splicing
Bioinformatics Algorithms

Approaches to gene finding

A. Statistical or ab initio methods: These methods attempt to predict

genes based on statistical properties of the given DNA sequence.
Programs are e.g. Genscan, GeneID, GENIE and FGENEH.
B. Comparative methods: The given DNA string is compared with a
similar DNA string from a different species at the appropriate
evolutionary distance and genes are predicted in both sequences
based on the assumption that exons will be well conserved,
whereas introns will not. Programs are e.g. CEM (conserved exon
method) and Twinscan.
C. Homology methods: The given DNA sequence is compared with
known protein structures. Programs are e.g. TBLASTN or
TBLASTX, Procrustes and GeneWise.
Bioinformatics Algorithms

A. Statistical ab initio methods

• Coding segments (exons) have typical sequences on either end and

use different subwords than non-coding segments (introns).

• E.g. for the bases around the transcription start site we may have
the following observed frequencies (given by this position specific
weight matrix (PSWM) ):
Pos. -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7
A .16 .29 .20 .25 .22 .66 .27 .15 1 0 0 .28 .24 .11 .26
C .48 .31 .21 .33 .56 .05 .50 .58 0 0 0 .16 .29 .24 .40
G .18 .16 .46 .21 .17 .27 .12 .22 0 0 1 .48 .20 .45 .21
T .19 .24 .14 .21 .06 .02 .11 .05 0 1 0 .09 .26 .21 .21

• This can then be used together in a log-likelihood scoring model in

order to distinguish certain recognition sites (such as transcription
start sites, or promoter regions) from non-recognition sites.
Bioinformatics Algorithms

A. Gene prediction in prokaryotes

gene structure
• Most DNA is coding
• No introns
• Promoters are DNA segments upstream of transcripts that initiate
transcription
Promoter 5’ 3’

• Promoter attracts RNA Polymerase to the transcription start site

Bioinformatics Algorithms

A. Gene prediction in prokaryotes

gene structure
5’ untranslated 3’ untranslated
Open reading frame
5’ 3’
-35bp -10bp
Start codon Stop codon
promoter Transcription
Start Site

• Upstream transcription start site (TSS; position 0) there are

promoters
Bioinformatics Algorithms

Promoter Structure in Prokaryotes (E.Coli)

Transcription starts at offset 0.

• Pribnow Box (-10)
• Gilbert Box (-30)
• Ribosomal Binding Site (+10)
Bioinformatics Algorithms

Open Reading Frames (ORFs)

• Detect potential coding regions by looking at ORFs

– A genome of length n is comprised of (n/3) codons
– Stop codons (TAA, TAG or TGA) break genome into segments
between consecutive Stop codons
– The subsegments of these that start from the Start codon (ATG) are
ORFs
• ORFs in different frames may overlap

ATG TGA
Genomic Sequence

Open reading frame

Bioinformatics Algorithms

Six Frames in a DNA Sequence

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

• stop codons – TAA, TAG, TGA

• start codons – ATG
Bioinformatics Algorithms

ORF prediction

1. Evaluation of ORF length

– In an “random” DNA, the average distance between stop codons is
64/3 ≈ 21 which is much less than the average length of a protein
(≈300)
– Simple algorithm, poor performance
2. Evaluation of codon usage
– Codon usage in coding regions differs form the codon usage in non-
coding regions
3. Evaluation of codon preference
– One aminoacid is coded by several different codons, some of them are
used more often than the others (see below)
4. Markov models and HMMs
Bioinformatics Algorithms

2. ORF prediction – codon usage

I. Codon usage (see below)

• Create a 64-element hash table and count the frequencies of
codons in an ORF
• Uneven use of the codons may characterize a real gene
• This compensate for pitfalls of the ORF length test

II. Hexamer counts

• Frequency of occurrences of oligonucleotides of length 6 in a
reading frame
• Usually modeled as fifth-order Hidden Markov Models
P(xn=s | ∩j<n xj) = P(xn=s | xn-1xn-2xn-3xn-4xn-5)
Bioinformatics Algorithms

2. ORF prediction – codon usage

• Vector with 64 components – frequencies of usage for each codon
AA Codon /1000
Gly GGG 1.89
Gly GGA 0.44
Gly GGU 52.99
Gly GGC 34.55
… … …
Glu GAG 15.68
Glu GAA 57.20
… … …
Asp GAU 21.63
Asp GAC 43.26
Bioinformatics Algorithms

Codon Usage in Human Genome

• Amino acids typically have more than one codon, but in nature
certain codons are more in use

Percents of
usage among all
codons encoding
Stp
Bioinformatics Algorithms

Codon Usage in Mouse Genome

AA codon /1000 frac AA codon /1000 frac

Ser TCG 4.31 0.05 Leu CTG 39.95 0.49
Ser TCA 11.44 0.14 Leu CTA 7.89 0.10
Ser TCT 15.70 0.19 Leu CTT 12.97 0.16
Ser TCC 17.92 0.22 Leu CTC 20.04 0.25
Ser AGT 12.25 0.15
Ser AGC 19.54 0.24 Ala GCG 6.72 0.10
Ala GCA 15.80 0.23
Ala GCT 20.12 0.29
Pro CCG 6.33 0.11 Ala GCC 26.51 0.38
Pro CCA 17.10 0.28
Pro CCT 18.31 0.30 Gln CAG 34.18 0.75
Pro CCC 18.42 0.31 Gln CAA 11.51 0.25
Bioinformatics Algorithms

3. ORF prediction – codon preference

• For each reading frame a codon preference statistics at each

position is computed. The statistic is calculated over a window of
length lw (lw is usually between 25 and 50), where the window is
moved along the sequence in increments of three bases,
maintaining the reading frame. The magnitude of the codon
preference statistic is a measure of the likeness of a particular
window of codons to a predetermined preferred usage.
• The statistic is based on the concept of synonymous codons.
Synonymous codons are those codons specifying the same amino
acid.
Bioinformatics Algorithms

3. ORF prediction – codon preference

• Example:
Leucine, Alanine and Tryptophan are coded by 6, 4 and 1 different
codons respectively. Hence in a uniformly random DNA they should
occur in the ratio 6:4:1. But in a protein they occur in the ratio 6.9:6.5:1.
• For a codon c
fc codon’s frequency of occurrence in the window.
Fc the total number of occurrences of c’s synonymous family in the
window.
rc the calculated number of occurrences of c in a random sequence of
length lw with the same base composition as the sequence being
analyzed.
Rc the calculated number of occurrences of c’s synonymous family in a
random sequence of length lw with the same base composition as
the sequence being analyzed.
Bioinformatics Algorithms

3. ORF prediction – codon preference

fc codon’s frequency of occurrence in the window.
Fc the total number of occurrences of c’s synonymous family in the
window.
rc the calculated number of occurrences of c in a random sequence of length lw with
the same base composition as the sequence being analyzed.
Rc the calculated number of occurrences of c’s synonymous family in a random
sequence of length lw with the same base composition as the sequence being
analyzed.
• The codon c preference statistic:
fc / Fc
pc =
rc / Rc

– if pc =1 c is used equally in a random sequence and in the codon

frequency table
Bioinformatics Algorithms

3. ORF prediction – codon preference

• When an aminoacid is coded by codons c1, c2, …,ck, then obviously

∏p
j =1
cj ≤1
lw
• The probability of the sequence in the window w is then P (w ) = ∏ pc j
i
=1

• Log-based score is used instead and the codon preference statistics for
each window is
∑i =1
lw
log p ci

Pw = e lw

• A correction: 0 in the
codon frequency table
is replaced by 1/Fc
Bioinformatics Algorithms

4. ORF prediction – using Markov

models and HMM
• There are many more ORFs than real genes. E.g., the E. coli genome
contains about 6600 ORFs but only about 4400 real genes. Markov
model and an HMM can be used to distinguish between non-coding
ORFs and real genes.
• DNA can be modeled as 64-state Markov chain of codons:
– Probabilities that a certain codon is followed by another one in a coding
ORF is computed.
– Probability of the chain is then computed in the form of log-odds score.
– Non-coding ORF has log-odds distribution centered around 0.

Non-coding
coding
Bioinformatics Algorithms

A. Gene prediction in eukaryotes

gene structure
• Alternating exons and introns

acceptor site

acceptor site
donor site

donor site
promoter

start site

stop site

3’ UTR
5’ UTR

5’→3’
TATA ATG GT AG GT AG TAA AAATAAAAAA
TAG
initial initron internal initron terminal TGA Poly-A
exon exon(s) exon
• intron starts usually by AG and ends by GT
• Types of exons
1. Initial exons
2. Internal exons
3. Terminal exons
4. Single-exon genes, i.e. genes without introns.
Bioinformatics Algorithms

Splice site detection

Donor site
5’ 3’

Position
% -8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 1 54 … 21
C 26 … 15 5 0 1 2 … 27
G 25 … 12 78 99 0 41 … 27
T 23 … 13 8 1 98 3 … 25

• In the exon-intron junctions there is a large similarity to the

consensus sequence → algorithms based on position specific
weight matrices.
• However, this is far too simple, since it does not use all the
information encoded in a gene. Thus more integrated approaches
are sought. This naturally leads us to Hidden Markov Models.
Bioinformatics Algorithms

A simple HMM M for gene detection

• States are ‘in exon’ and ‘in intron’

• p probability that the process stays ‘in exon’; 1–p probability that the
process switches into ‘in intron’
• q probability that the process stays ‘in intron’; 1–q probability that the
process switches into ‘in exon’
• The probability that an exon has length k is
P(exon of length k | M) = pk (1–p)
0.6
0.4 0.6
exon intron
0.4
P(A)=0.2 P(A)=0.25
P(C)=0.3 P(C)=0.25
P(G)=0.3 P(G)=0.25
P(T)=0.2 P(T)=0.25
Bioinformatics Algorithms

A simple HMM M for gene detection

• pk (1–p) implies geometric distribution which does not correspond to
the real distribution of lengths of introns and exons

introns initial exons

Bioinformatics Algorithms

A simple HMM M for gene detection

• pk (1–p) implies geometric distribution which does not correspond to

the real distribution of lengths of introns and exons

internal exons terminal exons

Bioinformatics Algorithms

A simple HMM M for gene detection

• If an exon is too short (under 50bp), the spliceosome (enzyme that
performs the splicing) has not enough room.
• Exons that are longer than 300 bp are difficult to locate.
• Typical numbers for vertebrates:
• mean gene length ≈ 30kb,
• mean coding region length ≈ 1−2kb.
• we need other models that can model biological exon lengths
Bioinformatics Algorithms

Ribosomal Binding Site

Bioinformatics Algorithms

TestCode
• Statistical test described by James Fickett in 1982: tendency for
nucleotides in coding regions to be repeated with periodicity of 3
– Judges randomness instead of codon frequency.
– Finds “putative” coding regions, not introns, exons, or splice sites.
• TestCode finds ORFs based on compositional bias with a periodicity
of three.
Bioinformatics Algorithms

TestCode Statistics

• Define a window size no less than 200 bp, slide the window
along the sequence down 3 bases. In each window:
– Calculate for each base {A, T, G, C}
• max (n3k+1, n3k+2, n3k) / min ( n3k+1, n3k+2, n3k)
• Use these values to obtain a probability from a lookup table (which was a
previously defined and determined experimentally with known coding
and noncoding sequences
• Probabilities can be classified as indicative of " coding” or
“noncoding” regions, or “no opinion” when it is unclear what
level of randomization tolerance a sequence carries
• The resulting sequence of probabilities can be plotted
Bioinformatics Algorithms

TestCode Sample Output

Coding

No opinion

Non-coding
Bioinformatics Algorithms

Popular Gene Prediction Algorithms

• GENSCAN: uses modified Hidden Markov Models (HMMs) –

semi-Markov model – based on statistical methods and on data
from an annotated training set

• TWINSCAN
– Uses both HMM and similarity (e.g., between human and
mouse genomes)
Bioinformatics Algorithms

B. Comparative gene finding

• Idea: the level of sequence conservation between two species
depends on the function of the DNA, e.g. coding sequence is more
conserved than intergenic sequence.
• Program Rosetta:
– first computes a global alignment of two homologous sequences
– and then attempts to predict genes in both sequences simultaneously.
• A conserved exon method: that uses local conservation.
• Orthologous Genes: homologous genes in two species that have a
common ancestor.
Bioinformatics Algorithms

Using Known Genes to Predict New

Genes
• Some genomes may be very well-studied, with many genes having
been experimentally verified.
• Closely-related organisms may have similar genes.
• Unknown genes in one species may be compared to genes in some
closely-related species.
• Most human genes have mouse orthologs:
– 95% of coding exons are in a one-to-one correspondence between the two
genomes.
– 75% of orthologous coding exons have equal length, and
– 95% have equal length modulo 3.
– Intron lengths differ by an average of 50%.
– The coding sequence similarity between the two organisms is around 85%,
– the intron sequence similarity is around 35%,
– 5’ UTRs and 3’ UTRs around 68%.
Bioinformatics Algorithms

Similarity-Based Approach to Gene

Prediction
• Genes in different organisms are similar
• The similarity-based approach uses known genes in one
genome to predict (unknown) genes in another genome
• Problem: Given a known gene and an unannotated genome
sequence, find a set of substrings of the genomic sequence
whose concatenation best fits the gene

• We try to identify small islands of similarity corresponding to

similarities between exons
Bioinformatics Algorithms

Reverse Translation
• Given a known protein, find a gene in the genome which codes for it.
• One might infer the coding DNA of the given protein by reversing the
translation process
– Inexact: amino acids map to > 1 codon.
– This problem is essentially reduced to an alignment problem.
• This reverse translation problem can be modeled as traveling in
Manhattan grid with free horizontal jumps
– Complexity of Manhattan is n3
• Every horizontal jump models an insertion of an intron
• Problem with this approach: would match nucleotides pointwise and
use horizontal jumps at every opportunity
Bioinformatics Algorithms

Comparing Genomic DNA Against

mRNA
(codon sequence)
mRNA

exon1 intron1 exon2 intron2 exon3

Portion of genome
Bioinformatics Algorithms

Using Similarities to Find the Exon

Structure
• The known frog gene is aligned to different locations in the human
genome
• Find the “best” path to reveal the exon structure of human gene
Frog Gene (known)

Human Genome
Bioinformatics Algorithms

Finding Local Alignments

• Use local alignments to find all islands of similarity

Frog Genes (known)

Human Genome
Bioinformatics Algorithms

Chaining Local Alignments

• Find substrings that match a given gene sequence (candidate
exons)
• Define a candidate exons as
(l, r, w)
(left, right, weight defined as score of local alignment)
• Look for a maximum chain of substrings
– Chain: a set of non-overlapping nonadjacent intervals.
Bioinformatics Algorithms

Exon Chaining Problem 5

5
15
9
11
4
3

0 2 3 5 6 11 13 16 20 25 27 28 30 32

• Locate the beginning and end of each interval (2n points)

• Find the “best” path
Bioinformatics Algorithms

Exon Chaining Problem: Formulation

• Exon Chaining Problem: Given a set of putative exons, find a
maximum set of non-overlapping putative exons.

• Input: a set of weighted intervals (putative exons).

• Output: A maximum chain of intervals from this set.

Bioinformatics Algorithms

Exon Chaining Problem: Graph

Representation

• This problem can be solved with dynamic programming in O(n) time.

Bioinformatics Algorithms

Exon Chaining Algorithm

ExonChaining (G, n ) //Graph, number of intervals
for i ← to 2n
si ← 0
for i ← 1 to 2n
if vertex vi in G corresponds to right end of the interval I
j ← index of vertex for the left end of the interval I
w ← weight of the interval I
sj ← max {sj + w, si -1}
else
si ← si-1
return s2n
Bioinformatics Algorithms

Exon Chaining: Deficiencies

– Poor definition of the putative exon endpoints.

– Optimal chain of intervals may not correspond to any valid
alignment:
– First interval may correspond to a suffix, whereas second interval may
correspond to a prefix.
– Combination of such intervals is not a valid alignment.
Bioinformatics Algorithms

Infeasible Chains
• Red local similarities form two non -overlapping intervals but do not
form a valid global alignment
Frog Genes (known)

Human Genome
Bioinformatics Algorithms

Spliced Alignment
• Mikhail Gelfand and colleagues proposed a spliced alignment
approach of using a protein within one genome to reconstruct the
exon-intron structure of a (related) gene in another genome.
– Begins by selecting either all putative exons between potential acceptor
and donor sites or by finding all substrings similar to the target protein
(as in the Exon Chaining Problem).
– This set is further filtered in such way that attempts to retain all true
exons, with some false ones.
Bioinformatics Algorithms

Spliced Alignment Problem:

Formulation
• Goal: Find a chain of blocks in a genomic sequence that best
fits a target sequence
• Input: Genomic sequences G, target sequence T, and a set of
candidate exons B.
• Output: A chain of exons Γ such that the global alignment score
between Γ* and T is maximum among all chains of blocks from
B.
Γ* – the concatenation of all exons from chain Γ
Bioinformatics Algorithms

Lewis Carroll Example

Bioinformatics Algorithms

Spliced Alignment: Idea

• Compute the best alignment between i-prefix of genomic sequence

G and j-prefix of target T:
S(i,j)

• But what is “i-prefix” of G?

• There may be a few i-prefixes of G depending on which block B we
are in.
Bioinformatics Algorithms

Spliced Alignment: Idea

• Compute the best alignment between i-prefix of genomic sequence

G and j-prefix of target T:
S(i,j)

• But what is “i-prefix” of G?

• There may be a few i-prefixes of G depending on which block B we
are in.
• Compute the best alignment between i-prefix of genomic sequence
G and j-prefix of target T under the assumption that the alignment
uses the block B at position i
S(i,j,B)
Bioinformatics Algorithms

The score of a prefix alignment

• Given a position i, let Γ = (B1, . . . ,Bk, . . . ,Bt) be a chain such that

some block Bk contains i.
• We define B1 B2 … Bk-1 Bk(i)

Γ(i ) = B1 B2 * . . . *Bk(i )

as the concatenation of B1. . . Bk-1 and the i-prefix of Bk.
• Then
S(i, j, k) = maxall chains Γ containing block Bk s(Γ*(i ), T(j )),
is the optimal score for aligning a chain of blocks up to position i in
G to the j-prefix of T.
• The values of this matrix can be computed using dynamic
programming.
Bioinformatics Algorithms

Spliced Alignment Initialization

1. For j=0, S(i, 0, B) corresponds to an aligment of blocks in front of B

to gaps and gfirst(B)…gi to gaps
i
S (i ,0, B ) = ∑( indel
)
l =first B

1. For j>0 and when there does not exist a block preceding B

S (i , j , B ) = s (g first (B ) K g i , t1 Kt j )

• first(B) = index of the first base of B

Bioinformatics Algorithms

Spliced Alignment Recurrence

• If i is not the starting vertex of block B:

S(i, j, B) =
max { S(i – 1, j, B) – indel penalty
S(i, j – 1, B) – indel penalty
S(i – 1, j – 1, B) + δ(gi, tj ) }

• If i is the starting vertex of block B:

S(i, j, B) =
max { S(i, j – 1, B) – indel penalty
maxall blocks B’ preceding block B S(end(B’), j, B’) – indel penalty
maxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj ) }

• After computing the three-dimensional table S(i, j, B), the score of the
optimal spliced alignment is:
maxall blocks BS(end(B), length(T), B)
Bioinformatics Algorithms

Spliced Alignment: Complications

• Considering multiple i-prefixes leads to slow down. running time:

O(mn2 |B|)
where m is the target length, n is the genomic sequence length and
|B| is the number of blocks.

• A mosaic effect: short exons are easily combined to fit any target
protein
Bioinformatics Algorithms

Spliced Alignment: Speedup

Bioinformatics Algorithms

Spliced Alignment: Speedup

Bioinformatics Algorithms

Spliced Alignment: Speedup

• P(i,j)=maxall blocks B preceding position i S(end(B), j, B)

Bioinformatics Algorithms

Exon Chaining vs Spliced Alignment

• In Spliced Alignment, every path spells out string obtained by

concatenation of labels of its edges. The weight of the path is
defined as optimal alignment score between concatenated labels
(blocks) and target sequence
– Defines weight of entire path in graph, but not the weights for
individual edges.
• Exon Chaining assumes the positions and weights of exons are
pre-defined.
Bioinformatics Algorithms

Gene Prediction: Aligning Genome

vs. Genome
• Align entire human and mouse genomes.

• Predict genes in both sequences simultaneously as chains of

aligned blocks (exons).

• This approach does not assume any annotation of either human

or mouse genes.
Bioinformatics Algorithms

Gene Prediction Tools

• GENSCAN/Genome Scan
• TwinScan
• Glimmer
• GenMark
Bioinformatics Algorithms

The GENSCAN Algorithm

• Algorithm is based on probabilistic model of gene structure similar to
Hidden Markov Models (HMMs).
• GENSCAN uses a training set in order to estimate the HMM
parameters, then the algorithm returns the exon structure using
maximum likelihood approach standard to many HMM algorithms
(Viterbi algorithm).
– Biological input: Codon bias in coding regions, gene structure (start and
stop codons, typical exon and intron length, presence of promoters,
presence of genes on both strands, etc)
– Covers cases where input sequence contains no gene, partial gene,
complete gene, multiple genes.
• GENSCAN limitations:
– Does not use similarity search to predict genes.
– Does not address alternative splicing.
– Could combine two exons from consecutive genes together
Bioinformatics Algorithms

GenomeScan

• Incorporates similarity information into GENSCAN: predicts gene

structure which corresponds to maximum probability conditional
on similarity information
• Algorithm is a combination of two sources of information
– Probabilistic models of exons-introns
– Sequence similarity information
Bioinformatics Algorithms

TwinScan

• Aligns two sequences and marks each base as gap ( - ),

mismatch (:), match (|), resulting in a new alphabet of 12 letters:
Σ = {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}.
• Run Viterbi algorithm using emissions ek(b) where b ∈ Σ, k.
• The emission probabilities are estimated from human/mouse
gene pairs.
– Ex. eI(x|) < eE(x|) since matches are favored in exons, and
eI(x-) > eE(x-)
since gaps (as well as mismatches) are favored in introns.
– Compensates for dominant occurrence of poly-A region in introns

http://www.standford.edu/class/cs262/
Spring2003/Notes/ln10.pdf
Bioinformatics Algorithms

Glimmer

• Gene Locator and Interpolated Markov ModelER

• Finds genes in bacterial DNA
• Uses interpolated Markov Models
• Made of 2 programs
– BuildIMM
• Takes sequences as input and outputs the Interpolated Markov Models
(IMMs)
– Glimmer
• Takes IMMs and outputs all candidate genes
• Automatically resolves overlapping genes by choosing one, hence limited
• Marks “suspected to truly overlap” genes for closer inspection by user
Bioinformatics Algorithms

GenMark

• Based on non-stationary Markov chain models

• Results displayed graphically with coding vs. noncoding probability

dependent on position in nucleotide sequence

Learners Activity Sheet in General Biology 1: "Cellular Respiration"
No ratings yet
Learners Activity Sheet in General Biology 1: "Cellular Respiration"
2 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
(1997) - (Oxborough and Baker) - Resolving Chlorophyll A Fluorescence Images of Photosynthetic Efficiency Into Photochemical and Non-Photochemical Components 1
No ratings yet
(1997) - (Oxborough and Baker) - Resolving Chlorophyll A Fluorescence Images of Photosynthetic Efficiency Into Photochemical and Non-Photochemical Components 1
8 pages
Science: Cell: The Basic Structural and Functional Unit of Life
No ratings yet
Science: Cell: The Basic Structural and Functional Unit of Life
37 pages
Cell Physiology
100% (4)
Cell Physiology
52 pages
Lecture Notes Algorithms in Bioinformatics I - Prof. Daniel Huson
No ratings yet
Lecture Notes Algorithms in Bioinformatics I - Prof. Daniel Huson
28 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Gene Calling
No ratings yet
Gene Calling
59 pages
CUBT401 - 4 - Sequence and Genome Annotation
No ratings yet
CUBT401 - 4 - Sequence and Genome Annotation
66 pages
LAB 5 - Gene Discovery
No ratings yet
LAB 5 - Gene Discovery
10 pages
Module 1_Session 3_Part 3
No ratings yet
Module 1_Session 3_Part 3
21 pages
Lva1 App6891 PDF
No ratings yet
Lva1 App6891 PDF
33 pages
Gene Pridiction and Orf
No ratings yet
Gene Pridiction and Orf
34 pages
Bioinformatics Softwares: by Rifat Shahriyar Student No: 100705037P
No ratings yet
Bioinformatics Softwares: by Rifat Shahriyar Student No: 100705037P
20 pages
Gene Prediction
No ratings yet
Gene Prediction
25 pages
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
No ratings yet
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
7 pages
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
No ratings yet
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
35 pages
Bioinformatics: Farhan Haq, PHD Department of Biosciences Cui
No ratings yet
Bioinformatics: Farhan Haq, PHD Department of Biosciences Cui
24 pages
Genes
No ratings yet
Genes
74 pages
Gene Prediction
No ratings yet
Gene Prediction
5 pages
Gene Prediction
No ratings yet
Gene Prediction
50 pages
Introduction To Bioinformatics: Tolga Can
No ratings yet
Introduction To Bioinformatics: Tolga Can
21 pages
Gene Prediction
No ratings yet
Gene Prediction
24 pages
Bioinformatics 2015
No ratings yet
Bioinformatics 2015
269 pages
Lecture 01
No ratings yet
Lecture 01
20 pages
Bioinformatics Module Final Version-Word
No ratings yet
Bioinformatics Module Final Version-Word
18 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
66 pages
Unit-5 Bioinformatics
No ratings yet
Unit-5 Bioinformatics
13 pages
Ghosh and Mallik
No ratings yet
Ghosh and Mallik
68 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Pairwise Sequence Allignment
No ratings yet
Pairwise Sequence Allignment
108 pages
Full Download Bioinformatics Methods Express 1st Edition Edition Paul Dear PDF DOCX
100% (8)
Full Download Bioinformatics Methods Express 1st Edition Edition Paul Dear PDF DOCX
75 pages
Brutlag 98
No ratings yet
Brutlag 98
6 pages
Gene Prediction
25% (4)
Gene Prediction
36 pages
CL662 PW 02 Gene Finding
No ratings yet
CL662 PW 02 Gene Finding
39 pages
An Overview of Gene Identification
No ratings yet
An Overview of Gene Identification
9 pages
Manual PDF
100% (1)
Manual PDF
53 pages
2a.BioinfoServerDatabase (Proteomics)
No ratings yet
2a.BioinfoServerDatabase (Proteomics)
50 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
selected topic in cs 1 (3)
No ratings yet
selected topic in cs 1 (3)
53 pages
Exercises 2013 Teachernotes
No ratings yet
Exercises 2013 Teachernotes
16 pages
Bioinformatics For High School
No ratings yet
Bioinformatics For High School
28 pages
Molecular Parte3
No ratings yet
Molecular Parte3
3 pages
Unit 2 BI
No ratings yet
Unit 2 BI
10 pages
Bioinfo Course Notes M1 2020 Dr Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 Dr Mbulli
56 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Lecture 1: INTRODUCTION: A/Prof. Ly Le School of Biotechnology Email: Office: RM 705
No ratings yet
Lecture 1: INTRODUCTION: A/Prof. Ly Le School of Biotechnology Email: Office: RM 705
43 pages
Ch08 GraphsDNAseq
No ratings yet
Ch08 GraphsDNAseq
82 pages
BTC 506 Gene Identification Using Bioinformatic Tools-230302130331
No ratings yet
BTC 506 Gene Identification Using Bioinformatic Tools-230302130331
14 pages
Gene Finding
No ratings yet
Gene Finding
31 pages
Bioinformatics Methods Express 1st Edition Edition Paul Dear download
100% (1)
Bioinformatics Methods Express 1st Edition Edition Paul Dear download
58 pages
Bio Informatics
No ratings yet
Bio Informatics
46 pages
PM703 Practical Biotechnology (2019) PM703 Practical Biotechnology (2019)
No ratings yet
PM703 Practical Biotechnology (2019) PM703 Practical Biotechnology (2019)
20 pages
Gene Finding and Gene Structure Prediction: Outline
No ratings yet
Gene Finding and Gene Structure Prediction: Outline
30 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Buy ebook Bioinformatics Methods Express 1st Edition Edition Paul Dear cheap price
100% (1)
Buy ebook Bioinformatics Methods Express 1st Edition Edition Paul Dear cheap price
67 pages
02.-Sequence Analysis PDF
No ratings yet
02.-Sequence Analysis PDF
14 pages
Bioninformaticas Lecture - 1
No ratings yet
Bioninformaticas Lecture - 1
33 pages
Bacterial Gene Annotation
100% (1)
Bacterial Gene Annotation
12 pages
Bioinformatics
No ratings yet
Bioinformatics
11 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Computational Approaches
No ratings yet
Computational Approaches
12 pages
The Science of Stem Cells
From Everand
The Science of Stem Cells
Jonathan M. W. Slack
No ratings yet
Adipocyte differentiation from the inside out
No ratings yet
Adipocyte differentiation from the inside out
12 pages
Peptides Info
No ratings yet
Peptides Info
37 pages
The Biopharmaceutical Classification System (BCS) : DR Mohammad Issa
No ratings yet
The Biopharmaceutical Classification System (BCS) : DR Mohammad Issa
24 pages
Optimizing-SDS-PAGE-for-Accurate-Protein-Characterization-in-Nutritional-Research-and-Food-Quality-Assessment
No ratings yet
Optimizing-SDS-PAGE-for-Accurate-Protein-Characterization-in-Nutritional-Research-and-Food-Quality-Assessment
39 pages
Appendix I: IA IUPAC Nucleotide Ambiguity Codes
No ratings yet
Appendix I: IA IUPAC Nucleotide Ambiguity Codes
2 pages
Biology Test Chapter 13 Test
No ratings yet
Biology Test Chapter 13 Test
4 pages
Retrovirus
No ratings yet
Retrovirus
23 pages
In Drug Discovery Dario Doller
No ratings yet
In Drug Discovery Dario Doller
53 pages
Human Leukocyte Antigens: Dr. B.Vijayasree 1 Year Post-Graduate Department of Microbiology
No ratings yet
Human Leukocyte Antigens: Dr. B.Vijayasree 1 Year Post-Graduate Department of Microbiology
30 pages
Respiration - Bio Project
100% (1)
Respiration - Bio Project
27 pages
Alcohol Liver Disease
No ratings yet
Alcohol Liver Disease
7 pages
Mpharm 2 Sem Advanced Biopharmaceutics and Pharmacokinetics mph202 2019
100% (1)
Mpharm 2 Sem Advanced Biopharmaceutics and Pharmacokinetics mph202 2019
1 page
DNA Libraries
No ratings yet
DNA Libraries
37 pages
f2012 Problem Set 5 ch7 KEY
No ratings yet
f2012 Problem Set 5 ch7 KEY
7 pages
الكيمياء الحياتية الجزء الأول
No ratings yet
الكيمياء الحياتية الجزء الأول
396 pages
Life Sustaining Processes & Phenomena - Cell Membrane
100% (1)
Life Sustaining Processes & Phenomena - Cell Membrane
4 pages
Railway Tickets
No ratings yet
Railway Tickets
1 page
Classwork Transport Across The Cell Membrane Upl
No ratings yet
Classwork Transport Across The Cell Membrane Upl
4 pages
Riot & Dance Textbook Sample
No ratings yet
Riot & Dance Textbook Sample
56 pages
1.1 - Prokarioti I Eukarioti
100% (1)
1.1 - Prokarioti I Eukarioti
20 pages
Exam 1 Key and Explanations
No ratings yet
Exam 1 Key and Explanations
8 pages
G10 SSLM Q4 W3 Berjes Evaluated Edited
No ratings yet
G10 SSLM Q4 W3 Berjes Evaluated Edited
5 pages
Cancers: Oncomine™ Comprehensive Assay v3 vs. Oncomine™ Comprehensive Assay Plus
No ratings yet
Cancers: Oncomine™ Comprehensive Assay v3 vs. Oncomine™ Comprehensive Assay Plus
19 pages
Microbial Sulfur Metabolism
No ratings yet
Microbial Sulfur Metabolism
329 pages
Cell Rap Lyrics
No ratings yet
Cell Rap Lyrics
5 pages
Biosensors Powerpoint
No ratings yet
Biosensors Powerpoint
26 pages

BioAlg10 9

Uploaded by

BioAlg10 9

Uploaded by

An Introduction to Bioinformatics Algorithms www.bioalgorithms.

Central Dogma and Splicing

Gene prediction is hard

Approaches to gene finding

A. Statistical or ab initio methods: These methods attempt to predict

A. Statistical ab initio methods

• Coding segments (exons) have typical sequences on either end and

• This can then be used together in a log-likelihood scoring model in

A. Gene prediction in prokaryotes

• Promoter attracts RNA Polymerase to the transcription start site

A. Gene prediction in prokaryotes

• Upstream transcription start site (TSS; position 0) there are

Promoter Structure in Prokaryotes (E.Coli)

Transcription starts at offset 0.

Open Reading Frames (ORFs)

• Detect potential coding regions by looking at ORFs

Open reading frame

Six Frames in a DNA Sequence

• stop codons – TAA, TAG, TGA

1. Evaluation of ORF length

2. ORF prediction – codon usage

I. Codon usage (see below)

II. Hexamer counts

2. ORF prediction – codon usage

Codon Usage in Human Genome

Codon Usage in Mouse Genome

AA codon /1000 frac AA codon /1000 frac

3. ORF prediction – codon preference

• For each reading frame a codon preference statistics at each

3. ORF prediction – codon preference

3. ORF prediction – codon preference

– if pc =1 c is used equally in a random sequence and in the codon

3. ORF prediction – codon preference

• When an aminoacid is coded by codons c1, c2, …,ck, then obviously

4. ORF prediction – using Markov

A. Gene prediction in eukaryotes

Consensus splice sites – splicing

Splice site detection

• In the exon-intron junctions there is a large similarity to the

A simple HMM M for gene detection

• States are ‘in exon’ and ‘in intron’

A simple HMM M for gene detection

introns initial exons

A simple HMM M for gene detection

• pk (1–p) implies geometric distribution which does not correspond to

internal exons terminal exons

A simple HMM M for gene detection

Ribosomal Binding Site

TestCode Sample Output

Popular Gene Prediction Algorithms

• GENSCAN: uses modified Hidden Markov Models (HMMs) –

B. Comparative gene finding

Using Known Genes to Predict New

Similarity-Based Approach to Gene

• We try to identify small islands of similarity corresponding to

Comparing Genomic DNA Against

exon1 intron1 exon2 intron2 exon3

Using Similarities to Find the Exon

Finding Local Alignments

Frog Genes (known)

Chaining Local Alignments

Exon Chaining Problem 5

• Locate the beginning and end of each interval (2n points)

Exon Chaining Problem: Formulation

• Input: a set of weighted intervals (putative exons).

• Output: A maximum chain of intervals from this set.

Exon Chaining Problem: Graph

• This problem can be solved with dynamic programming in O(n) time.

Exon Chaining Algorithm

Exon Chaining: Deficiencies

– Poor definition of the putative exon endpoints.

Spliced Alignment Problem:

Lewis Carroll Example

Spliced Alignment: Idea

• Compute the best alignment between i-prefix of genomic sequence

• But what is “i-prefix” of G?

Γ(i ) = B1 B2 * . . . *Bk(i )