Entropy-Based Algorithm For Discretization: April 2011
Entropy-Based Algorithm For Discretization: April 2011
net/publication/309014187
CITATIONS READS
0 1,549
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Lamia AbedNoor Muhammed on 12 October 2016.
Abstract
Discretization is a common process used in data mining applications
d
that transforms quantitative data into qualitative data. Different methods
re
have been proposed in order to achieve this process. The stand stone in the
discretization algorithm is to find potential cut-points which split continuous
range values into nominal values. So the discretization methods vary
te
according to how to find these cut-points. Entropy based method is one of
discretization methods however using information entropy measure.
is
In this paper, the aim was to use the entropy-based method in the
discretization with a proposed algorithm. This algorithm attempts to find
suitable cut-points through new concepts. It is based on the entropy
eg
information method with statistical tool through several steps . The practical
work was executed on experimented data that were downloaded from UCI
repository data.
nR
1- Introduction
U
d
Statistics-based algorithms, class-attribute interdependency algorithms,
re
and clustering-based algorithms[4].
While in recent years, extended supervised discretization algorithms have
been emerged and known as "semi-supervised" discretization algorithms.
te
These algorithm based on the concept; by reducing the information needed
to execute supervised algorithms i.e. non-parametric semi-supervised
is
discretization method that based on the MODL framework (“Minimal
Optimized Description Length”)[**].
eg
2-Discretization algorithm
Discretization algorithm is the set of steps that are required in order to
transform continuouse values into another expression discrete values. It aims
nR
to find the cut-points. The term”cut-points refers to a real value within the
range of continuous values that divides the range into two interval is less
than or equal to the cutpoint and the other interval is greater than the cut-
point[4].
U
Information is high for lower probable events and low otherwise. Hence,
entropy H is the highest when each event is equi-probable[5].
Discretization methods use entropy measures to evaluate candidate
cutpoints. This means that an entropy-based method will use the class
information entropy of candidate partitions to select boundaries for
discretization. Class information entropy is a measure of purity and it
d
measures the amount of information which would be needed to specify to
which class an instance belongs[3].
re
Definition 1: Let T partition the set S of examples into the subset S1,…..Sn.
Let there be k classes C1,……..Ck and let p(Ci, Sj) be the proportion of
te
examples in Sj that have class i. The class entropy Ent() of a subset Sj is
defined as[7]:
Ent(Sj)=∑p(Ci,Sj) log(p(Ci,Sj)) ……… (3)
is
It has been shown that optimal cut points for entropy minimization must be
between examples of different classes.
eg
d
- Merge the subsets that can produced minimum total Ent and
standard deviation
re
- Compute Ent(I) for new subsets
te
Figure(1) proposed algorithm
is
Attribute Information:
U
d
re
te
is
eg
All the nominal attributes in data set were candidate to be class attribute and
experimented in primary discretization step of proposed algorithm. Each
nR
1 -26 0.05898433
2 27-33 0.04847932
3 34-40 0.05439528
4 41-47 0.04785283
5 48-54 0.04785283
6 55- 0.05578864
d
Table(2) partitianing attribute AGE into interval according to
re
attribute SEX class attribute value
The resulted data from equal-width partition using (7 width) that concerned
te
with " ASCITES " attribute are (9 partitions). Then resulted data would be
passed to next step of descritization that is "splitting". Only partition (5) was
is
candidate for splitting because its entropy is exceed threshold(s). After
splitting, then the merging step would be executed. The partitions that were
eg
merged are 1 and 2,9 and 10, then 4and 5, 6 and. The results are 6 partitions
as shown in table(3).
nR
3 34-41 0.055553853
4 42-47 0.079202133
5 48-54 0.047852827
6 55- 0.055788643
Table(3) partitianing attribute AGE into interval according
to
attribute ASCITES class attribute value
At last attribute "VARICES" was candidate to be the a class attribute and its
concerned data are obtained from equal-width partition using (15 width).
Then resulted data were passed to splitting. Only partition (5) was candidate
for splitting because its entropy is exceed threshold(s). After splitting, then
the merging step would be executed. The partitions that were merged are 1
and 2,9 and 10, then 4and 5, 6 and. The results are 6 partitions as shown in
table(4).
d
6 48-51 0.052859411
re
7 52- 0.029567632
Table(4) partitianing attribute AGE into intervals according to
attribute VARICES class attribute value
te
4-3 Discussion
1-The results that were obtained from primary step discretization reveal the
is
difference between different attributes concerned with computing entropy
values and so this step is useful in order to choose the attribute that
candidate to be class attribute, therefore three with minimum entropy value
eg
would be chosen.
2- The practical work in this paper experiment random variable in
measuring the width of intervals to be produced.
nR
3- Using primary step can be terminated when the entropy value will be
less, or the standard deviation is not improved through using different
width(this case can be notice with VARICES attribute) as shown in table(1).
U
5- Conclusion
The execution of proposed algorithm reveled some facts:
1- The trained examples must have good distribution for continuous
attribute that to be discretized in order to produce good result.
2- It is convenient to be considered the standard deviation of entropy
value for the total partitions in order to forbidden the
3- Using primary discretization with "equal width" method is suitable for
decrease the iterations of splitting
References
[1] Bondu, A. and Boulle, M. and Lemaire, V., Loiseau, S. and Duval, B., '
A Non-parametric Semi-supervised Discretization Method
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on
Data Mining, IEEE Computer Society Washington, DC, USA .
d
A recent survey”, GESTS International Transactions on Computer Science
and Engineering, vol.32(1), p.p. 47-58.
re
[4] Kurgan, L. A. and Cios, K. J., 2004, “CAIN Discretization Algorithm”,
“IEEE Transactions on Knowledge and Data Engineering”, vol.16, No.2.
te
[5] Liu, H., and others, 2002, “ Discretization: An Enabling Technique”,
“Data Mining and knowledge Discovery”, no.6, p.p.393-423, Kluwer
is
Academic Publishers. Manufactured in Netherlands.
[6] Muhlenbach, F. and Rakotomalala, Ricco, 2005, “Discretization of
eg