0% found this document useful (0 votes)
4 views4 pages

Exons and Introns of Eukaryotic Genes

The document presents a new statistical distance measure for analyzing exons and introns in eukaryotic genes, which improves the segregation of these regions compared to existing methods. The proposed measure utilizes logarithm transformation of the dot product of probability vectors to capture the dependencies between nucleotide bases. Experimental results demonstrate its effectiveness in distinguishing coding and non-coding regions of genes, showing clear distinctions between introns and exons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Exons and Introns of Eukaryotic Genes

The document presents a new statistical distance measure for analyzing exons and introns in eukaryotic genes, which improves the segregation of these regions compared to existing methods. The proposed measure utilizes logarithm transformation of the dot product of probability vectors to capture the dependencies between nucleotide bases. Experimental results demonstrate its effectiveness in distinguishing coding and non-coding regions of genes, showing clear distinctions between introns and exons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

A New Statistical Distanceuseful for Analyzing


Exons and Introns of Eukaryotic Genes
Uddalak Mitra1 and Balaram Bhattacharyya2
1 2
Professor, Research Scholar ,Department of Computer and System Sciences, Visva-Bharati University.
Santiniketan -731205, India.
[email protected], [email protected]

regions is inspected in [7]. A segmentation method for exon


Abstract—Segregation of exons and introns from gene and introns, based on entropy is described in [8]. Authors of
sequence is an important issue in simulating transcription. [9] have used an alternating definition of information that is
Although various computational efforts based on probabilistic found to be useful to analyze coding and noncoding regions of
approaches are taken to discriminate the regions, lack of gene. Another information theoretic approach, average mutual
accuracy is the principal reason to achieve perfection. We information [10], has been used to recognize the protein-
propose a statistical distance measure using logarithm
transformation of the dot product oftwo probability vectors.
coding regions of genes for which training set are not
Experimental result shows effectiveness of the distance measure available.
in segregation of genes with certainty better than other existing Feature of genetic segments are diverse and is too difficult to
methods, even for gene having very less numbers of exons and concordant perfectly with the existing approaches.An
introns. appropriate statistical measure is thus a requirement to capture
features of gene segments for their distinction. Complex
Index Terms—Computational Biology, Gene prediction, patterns of exons and introns in eukaryotic genes make it more
Information Theory, Mathematical transformation, Statistical difficult to predict regions of interest compared to those of
distance measure. prokaryotic genes. They possess plenty ofdiversity in size and
organization andhave no typical structure.However they
contain several conserved features. Such conserved features
I. INTRODUCTION can be used as discriminatory statistics among exons and

G ENE prediction or finding focuses on the process of


identification of specific regions of genomic DNA [1].
Gene prediction includes the recognition of protein-coding
introns of eukaryotic genes. This is the key concept for
segregation of the regions.
We formulate a new statistical distance measure for
genes and non-coding RNA, segregation of exons and introns determining dependency between two statistical objects.
[2] in a gene, but may also include finding of other functional Probability mass function (PMF) is used to represent patterns
elements such as regulatory regions and untranslated regions of the objects. We apply the measure to extract dependencies
(UTR) [3]. In eukaryotic cell the genomic DNA is the of a nucleotide base with other bases at k locations
principal DNA. After a genome has been sequenced, gene downstream and find that the measured dependencies are
finding is the first and most important step in understanding distinctly different for exon regions than that of introns,
the structure of the genome. Identification of the correct genes thereby capturing the discriminatory feature. Although
and determining their functions still demand in vivo downstream is natural in nucleotide sequence, the measure is
experimentation, although the bioinformatics researches are equally applicable for upstream.
making it increasingly possible to isolate gene sequences and
predict functions of genes based only on the sequence alone.
II. DERIVATION OF THE DISTANCE MEASURE
Gene sequences,being a distribution of nucleotides, can be
seen as a repository of biological information necessary for Measurement of dependency between two statistical objects
activity inorganism, statistical and information theoretic is important with wide applicability ranging from
approaches are likely to be relevant forits’ analysis[4]. anthropology, biology, physics, chemistry, computer science,
It has been inspected in [5], on the basis of Shannon Entropy, ecology, physiology etc. The objects may be two random
that gene sequence has more randomness compared to human variables, two sample spaces or two population spaces [11-
language and computer programming languages. Another 13]. Distance or dependency measure is a quantitative degree
important feature, long range correlation between nucleotide of an indication of how far two statistical objects are apart.
sequences, has been investigated in [6]. A recognized Statistical distances satisfying the following properties -
statistical feature, the non-uniform codon usage of coding positivity or non-negativity, symmetry and triangular
inequality called metrics. Otherwise it is statistical divergence.

978-1-4673-9745-2 ©2016 IEEE


International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

joint PMF fk(x,y) = Pk(X=x,Y=y) and the product of marginal


A PMF can be assumed as a vector whose elements are PMFs, f(x)*f(y) = P(X=x)*P(Y=y) of two random variables X
points in Euclidean space and essentially sum up to 1, termed and Y, where the subscript k used with joint PMF indicate that
as probability vector. Probability vector P can be written as P the nucleotides x and y occur at k location apart . Let Nk(x,y)
= [p1, p2, p3 …, pn] in standard notation with the assumption be the number of times two nucleotides at k bases apart takes
∑ i=1.Individual components are within 0 and 1, 0≤pi≤1. the values x and y, where x and y can be A, C, G and T. The
Algebraic form of the dot product of P=[p1,,p2,p3,…,pn] and joint probability Pk(X=x,Y=y) can be estimated by
Q=[q1,q2,q3,…qn] can be defined as
n Pk(X=x,Y=y)=Nk(x,y)/   Nk (x, y) … (4)
 pq = p q + p q + … p q
x∈{A,C,G,T} y∈{A,C,G,T}
P.Q= i i
1 1 2 2 n n … (1)
i =1
Marginal probabilities P(X=x) can be estimated by dividing
The corresponding geometric definition of the dot product total number of times nucleotide x occurs divided by the total
is given by number of bases in the sequence .The difference between the
joint PMF and the product of marginal PMF determines
P.Q = ||P|| ||Q|| CosΘ … (2) mutual dependency of two random variables. The task thus
boils down to the measurement of the difference between the
where Θ is the direction angle between the probability joint PMF and the product of two marginal PMFs. Application
vectors and ||P || and ||Q|| are magnitudes of the vectors P and of Eq.3 yields the measurement
Q respectively. The direction angle has the implication that
when it is 90 degree the probability vectors are orthogonal and
D(Pk(X=x,Y=y)||(P(X=x)*P(Y=y)))
is 0 degree when they are collinear. With P and Q being
probability vectors, P.Q can be interpreted as a statistical 16
distance between the probability distributions. But simple P.Q
is not enough powerful to capture discriminatory feature of
= log(1-log(P(X= x,Y = y)*P(X= x)*P(Y= y)))
i=1
k … (5)

sequences, containing complex inherent statistical patterns.


Statistical data transformationis a procedure to III. PROPERTIES OF THE MEASURE
mathematically modify the values of variable using
transformation operators. Statistical operators are based on the The measurement satisfies the following properties of a
assumption that the variables are normally distributed. Minor statistical distance measure
violations of this assumption increase chances of committing A. Non–negativity: As 0≤Pi≤1 and 0≤Qi≤1 , their product
errors, Type I or Type-II. True normality is exceedingly rare PiQimust satisfy the inequality 0≤PiQi≤1. So the quantity PiQi
and thus data transformations are a need to capture near- is always a fraction and the logarithm of a fraction is always
normality through reduction of errors.The task is thus negative, hence the term (1 – log(PiQi)) will produce a positive
designing appropriate operator for data transformation. value greater than 1. So the term log(1-log(PiQi) is always
Among various transformation operations, square root and positive and the sum of all positive term must generate a
logarithm are widely used. We observe that square root of positive value.Thus our statistical measurement is always non-
values above 1.00 becomes smaller and those between 0.00 negative.
and 0.99becomes larger. Logarithm transformation, on the B. Semi-definiteness of the measurement: When all the Pi’s
contrary, maintains functional pattern of variables on which goes to zero and Qi’s are non negative real numbers all the
the transformation is applied[14].A logarithm transformation terms log(1-(PiQi)) becomes zero and vice-versa.Hence we can
on the dot product of two probability vectors produces the assume the minimum value for the measurement as zero. As
distance measure as the distance measure is a statistical distance and it can actually
n
measure the dependence between two statistical objects we
D(P||Q)=  log(1 − log( p q ))
i =1
i i
can state that the maximum distance will cause in case of
statistical independence. Hence the measurement is semi-
=log(1-log(p1q1))+log(1-log(p2q2))+…+log(1-log(pnqn))
definite.
(3)
C. Symmetry: As multiplication of two real numbers is
commutative, its logarithm is symmetric, so
The base the logarithm can be any standard base value like
e, 2 or 10. As the data transformation techniques do not n n
change the intuitive meaning of the operands and the product
values, the quantity D(P||Q) can be interpreted as the
 log(1 − log( piqi )) =
i =1
 log(1 − log(q p ))
i =1
i i

(6)
logarithm transformed dot product of the probability vectors P
and Q.
A gene can be viewed as a sequence of nucleotide bases and
patterns of the bases may be considered as manifestation of

978-1-4673-9745-2 ©2016 IEEE


International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

TABLE I
COMPARISON OF RESULTS BETWEEN SUPERINFORMATION AND PROPOSED
MEASURE
Range of
Range of
ProposedDistance
Superinformation
measure of
Gene ID measure
exons(above)
exons (above)
and
and introns(below)
introns(below)
HGNC1809 3.1536- 3.8518 50.8851 – 51.0297
1.5000-2.3219 44.2350 – 47.8641

HGNC1810 2.9140-3.6899 50.8746- 50.0425


0.8113-2.3219 44.1843-47.8080

HGNC11311 2.1556- 2.9477 50.7741-50.8235

HGNC11882 3.1751-3.7464 50.7756-50.8662


Fig.1.Patterns of statistical distance (y-axis) against nucleotide distance (x- 2.1556-3.0000 47.3689- 47.4801
axis) of the gene HGNC: 462
HGNC12668 1.9219- 2.5850 50.9062-51.0074
1.0000-1.0000 44.5161- 48.7337

HGNC14024 1.5059- 3.0062 50.7396 -50.9839


IV. RESULT AND DISCUSSION 4.4108- 4.6452 47.4614 -47.6848

HGNC38732 2.4194-3.1699 50.8256-50.9436


The proposed distance measure is applied as a statistical 3.0531-3.6566 47.2825- 47.4517
tool to analyze the coding and non-coding regions of genes.
HGNC18568 2.6924-3.3750 51.0122-51.2085
Human chromosomes data are taken from the site 1.5000-2.3219 48.5335-48.7282
http://www.ensembl.org/Homo_sapiens for study. The gene
HGNC:462in human chromosome Ycodes for protein HGNC462 2.2500- 3.1699 50.9292-51.2904
4.3090-4.6640 47.6975-47.8937
amelogenin involved in amelogenesis, the development of
enamel. It’s first transcript ENST00000215479 has 6 exons HGNC37464 2.0588-2.8074 50.9581-51.1611
(coding regions) and 5 introns (non-coding regions). We have 2.7899-3.6250 47.4404-47.6399
concatenated 6 exon regions to form the contiguous protein
HGNC37473 2.5503-3.4183 50.8754- 50.9689
coding sequence. The same is done for 5 introns. The 2.0875-3.0761 47.3018-47.4325
dependency measure is then calculated for each combined
coding and non-coding sections. The resultsfor the values of k
between 1 and 20 are presented in Figure1 which shows clear Thus for each gene we have 20 dependency values for its
distinction between introns and exons. concatenated coding regions and the same for the non-coding
regions. The average dependency value is computed forthe
concatenated exon region for each gene to construct
probability density function. Thecorresponding computation is
done for the concatenated introns. The result is interesting.
There is cleardistinction between the two distributions without
any overlap(Figure 2).
A comparative study has been carried outto assess
efficiency of the proposed distance measure over Mutual
Information, Bhattacharyya Distance(Figure 3 and Figure 4)
and Superinformation(Table 1). Experiments are conductedon
randomly selected 20 numbers of genes taken from each of the
Human Chromosome 1, 10 and 19.It may be noted in the table
that the minimum distance in case of exons differ much from
the maximum distance of introns, resulting in distinct
segregation of introns and exons. This is an achievement over
Fig2: Probability density function of exons and introns existing measures. It is worth mentioning that for all valuesof
k ≤ L/2, where L is the length of a sequence, there is clear
To study the efficiency of the proposed distance measure in segregation of dependency values, calculated using proposed
segregating introns and exons we further consider all genes in distance measure, for exon and intron regions.
chromosome Y.

978-1-4673-9745-2 ©2016 IEEE


International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

V. CONCLUSION
In this paper we have proposed a new statistical distance
measure from the logarithm transformation of the dot product
of two probability vectors. An important application of the
measure has been found for computation of probability density
function of parts of DNA sequences with a view to distinguish
the selected parts. Experiments are conducted on gene
sequences of several human chromosomes. The distance
measure is found to be efficient, superseding some other
existing distance measures, in segregating intron and exon
regions of human gene sequences.

REFERENCES

[1] R.D. Sleator, An overview of the current status of eukaryote gene


prediction strategies. Gene. 2010 Aug 1;461(1-2):1-4. doi:
10.1016/j.gene.2010.04.008. Epub 2010 Apr 27.
[2] S. Ohno, Brookhaven Symp. Biol. 23, 366 (1972).
[3] J. Wang, J. Kudoh, A. Shintani, S. Minoshima, and N. Shimizu,
Biochem. Biophys. Res. Commun. 250, 704 (1998).
[4] Ganna Leonenko, Sietse O. Los and Peter R. J. North, Statistical
Distances and Their Applications to Biophysical Parameter
Estimation: Information Measures, M-Estimates, and Minimum
Contrast Methods, Remote Sens. 2013, 5, 1355-1388;
doi:10.3390/rs5031355
Fig 3: Patterns of statistical distance (y-axis) against nucleotide distance (x- [5] A. O. Schmitt and H. Herzel, J. Theor. Biol. 1888, 369 (1987).
axis) of combined coding and non-coding regions of the gene from
[6] C. K. Peng, S. Buldyrev, A. Goldberger, S. Havlin, F. Sciortino, M.
Chromosome 1 and 10
Simons, and H. E. Stanley, Nature (London) 356, 168 (1992).
. [7] R. Grantham, C. Gautier, M. Gouy, M. Jacobzone, and R. Mercier,
Nucleic Acids Res. 9, R43 (1981).
[8] P. Bernaola-Galvan, I. Grosse, P. Carpena, J. L. Oliver, R. Roman-
Roldan, and H. E. Stanley, Phys. Rev. Lett. 85, 1342 (2000).
[9] Ranjan Bose and Sonali Chouhan, PHYSICAL REVIEW E 83,
051918 (2011)
[10] I. Grosse, V.B. Sergey, S.H. Eugene , (2000) Average Mutual
Information of Coding and Noncoding DNA, Pacific Symposium on
Biocomputing 5:611-620.
[11] A. Bhattacharyya,(1943). "On a measure of divergence between two
statistical populations defined by their probability distributions".
Bulletin of the Calcutta Mathematical Society 35: 99–109.
MR 0010358
[12] Mahalanobis, P. C. (1936).On the generalised distance in statistics .
Proceedings of the National Institute of Sciences of India2 (1): 49–55.
Retrieved 2012-05-03.
[13] Sung-Hyuk Cha.( 2007), Comprehensive Survey on
Distance/Similarity Measures between Probability Density Functions.
INTERNATIONAL JOURNAL OF MATHEMATICAL MODELS
AND METHODS IN APPLIED SCIENCES, Issue 4, Volume 1
[14] Osborne, Jason (2002). Notes on the use of data transformations.
Practical Assessment, Research & Evaluation, 8(6)

Fig 4:Patterns of statistical distance (y-axis) against nucleotide distance (x-


axis) of combined coding and non-coding regions of the gene from
Chromosome 19 and Y

978-1-4673-9745-2 ©2016 IEEE

You might also like