Discretization Techniques A Recent Survey
Discretization Techniques A Recent Survey
47-58
1 Introduction
Many Machine Learning (ML) algorithms are known to produce better models by
discretizing continuous attributes [14]. Naive Bayes (NB) classifier requires the esti-
mation of probabilities and the continuous explanatory attributes are not so easy to
handle, as they often take too many different values for a direct estimation of fre-
quencies. To circumvent this, a normal distribution of the continuous values can be
assumed, but this hypothesis is not always realistic. The same phenomenon leads
rules extraction techniques to build poorer sets of rules. DT algorithms carry out a
selection process of nominal attributes and cannot handle continuous ones directly.
As result, a large number of ML and statistical techniques can only be applied to data
sets composed entirely of nominal variables. However, a very large proportion of real
data sets include continuous variables: that is variables measured at the interval or
ratio level. One solution to this problem is to partition numeric variables into a num-
ber of sub-ranges and treat each such sub-range as a category. This process of parti-
tioning continuous variables into categories is usually termed discretization. Unfortu-
nately, the number of ways to discretize a continuous attribute is infinite. Discretiza-
tion is a potential time-consuming bottleneck, since the number of possible discretiza-
tions is exponential in the number of interval threshold candidates within the domain
[14].
The goal of discretization is to find a set of cut points to partition the range into a
small number of intervals that have good class coherence, which is usually measured
by an evaluation function. In addition to the maximization of interdependence be-
tween class labels and attribute values, an ideal discretization method should have a
secondary goal to minimize the number of intervals without significant loss of class-
attribute mutual dependence. Discretization is usually performed prior to the learning
process and it can be broken into two tasks. The first task is to find the number of
discrete intervals. Only a few discretization algorithms perform this; often, the user
must specify the number of intervals or provide a heuristic rule. The second task is to
find the width, or the boundaries, of the intervals given the range of values of a con-
tinuous attribute.
Usually, in discretization process, after sorting data in ascending or descending order
with respect to the variable to be discretized, landmarks must be chosen among the
whole dataset. In general, the algorithm for choosing landmarks can be either top-
down, which starts with an empty list of landmarks and splits intervals, or bottom-up,
which starts with the complete list of all the values as landmarks and merges intervals.
In both cases there is a stopping criterion, which specifies when to stop the discretiza-
tion process. Researchers in the ML community have introduced many discretization
algorithms. Most of these algorithms perform an iterative greedy heuristic search in
the space of candidate discretizations, using different types of scoring functions for
evaluating a discretization. A recent overview of various types of discretization algo-
rithms can be found in [28]. Thus, in this work apart from the brief description of the
discretization process we refer to some more recent works than those in Liu’s article
as well as few articles that were not referred by Liu.
The next section covers wide ranged issues of discretization process. The categories
of discretization methods are analysed in section 3, whereas the combination of dis-
cretization methods with DTs, RLs and BNs are presented in section 4. Finally, the
last section concludes this work.
2 Discretization process
The term “cut-point” refers to a real value within the range of continuous values that
divides the range into two intervals, one interval is less than or equal to the cutpoint
and the other interval is greater than the cut-point. For example, a continuous interval
[a, b] is partitioned into [a, c] and (c, b], where the value c is a cut-point. Cut-point is
also known as split-point. The term “arity” in the discretization context means the
number of intervals or partitions. Before discretization of a continuous feature, arity
can be set to k—the number of partitions in the continuous features. The maximum
number of cut-points is k − 1. Discretization process reduces the arity but there is a
trade-off between arity and its effect on the accuracy.
A typical discretization process broadly consists of four steps: (1) sorting the con-
tinuous values of the feature to be discretized, (2) evaluating a cut-point for splitting
or adjacent intervals for merging, (3) according to some criterion, splitting or merging
intervals of continuous value, and (4) finally stopping at some point.
After sorting, the next step in the discretization process is to find the best “cut-point”
to split a range of continuous values or the best pair of adjacent intervals to merge.
One typical evaluation function is to determine the correlation of a split or a merge
with the class label. There are numerous evaluation functions found in the literature
such as entropy measures and statistical measures (more details in the following sec-
tions). A stopping criterion specifies when to stop the discretization process. It is
usually governed by a trade-off between lower arity with a better understanding but
less accuracy and a higher arity with a poorer understanding but higher accuracy. The
number of inconsistencies (inconsistency is defined later) caused by discretization—it
should not be much higher than the number of inconsistencies of the original data
before discretization. Two instances are considered inconsistent if they are the same
in their attribute values except for their class labels.
Generally, the discretization methods can be categorised as: (1) supervised or unsu-
pervised, (2) direct or incremental, (3) global or local, (4) static or dynamic, (5) top-
down or bottom-up. We distinct these categories in the following section.
A distinction can be made dependent on whether the method takes class information
into account to find proper intervals or not. Several discretization methods, such as
equal width interval binning or equal frequency binning, do not make use of class
membership information during the discretization process. These methods are re-
ferred to as unsupervised methods. In contrast, discretization methods that use class
labels for carrying out discretization are referred to as supervised methods. Previous
research indicated that supervised are better than unsupervised methods [12].
The two representative algorithms of unsupervised (or class-blind) algorithms are
equal-width and equal-frequency discretizations. The equal-width discretization algo-
rithm determines the minimum and maximum values of the discretized attribute and
then divides the range into the user-defined number of equal width discrete intervals.
The equal-frequency algorithm determines the minimum and maximum values of the
discretized attribute, sorts all values in ascending order, and divides the range into a
user-defined number of intervals so that every interval contains the same number of
sorted values. The obvious weakness of this equal-width method is that in cases
where the outcome observations are not distributed evenly, a large amount of impor-
tant information can be lost after the discretization process. For equal-frequency,
many occurrences of a continuous value could cause the occurrences to be assigned
into different bins. One improvement can be after continuous values are assigned into
bins, boundaries of every pair of neighboring bins are adjusted so that duplicate val-
ues should belong to one bin only.
Another dimension of discretization methods is direct vs. incremental process [14].
Direct methods divide the range of k intervals simultaneously (i.e., equal-width),
needing an additional input from the user to determine the number of intervals. In-
cremental methods begin with a simple discretization and pass through an improve-
ment process, needing an additional criterion to know when to stop discretizing [7].
The distinction between global and local discretization methods is dependent on when
discretization is performed [28]. Global discretization handles discretization of each
numeric attribute as a pre-processing step, i.e. before induction of a classifier whereas
local methods, like C4.5 carry out discretization on-the-fly (during induction). Em-
pirical results have indicated that global discretization methods often produced supe-
rior results compared to local methods since the former use the entire value domain of
a numeric attribute for discretization, whereas local methods produce intervals that
are applied to subpartitions of the instance space.
The distinction between static and dynamic methods depends on whether the method
takes feature interactions into account [32]. Static methods, such as binning, entropy-
based partitioning and the 1R algorithm [21], determine the number of partitions for
each attribute independent of the other features. In contrast, dynamic methods con-
duct a search through the space of possible k partitions for all features simultaneously,
thereby capturing interdependencies in feature discretization.
Finally, the distinction between top-down and bottom-up discretization methods can
be made. Top-down methods consider one big interval containing all known values of
a feature and then partition this interval into smaller and smaller subintervals until a
certain stopping criterion, for example Minimum Description Length (MDL), or op-
timal number of intervals is achieved. In contrast, bottom-up methods initially con-
sider a number of intervals, determined by the set of boundary points, to combine
these intervals during execution until a certain stopping criterion, such as a x2 thresh-
old, or optimal number of intervals is achieved [24].
In the discretization problem, a compromise must be found between information
quality (homogeneous intervals in regard to the attribute to predict) and statistical
quality (sufficient sample size in every interval to ensure generalization). The chi-
square-based criteria focus on the statistical point of view whereas the entropy-based
criteria focus on the information point of view. Other criteria (such as Gini or Fusin-
ter criterion) try to find a trade off between information and statistical properties.
There are still other approaches, similar to the wrapper approach involved in the fea-
ture selection field as well as evolutionary based approaches. We analyze all these
approaches in the following sub-sections.
Chi-square (x2) is a statistical measure that conducts a significance test on the rela-
tionship between the values of a feature and the class. Kerber [24] argues that in an
accurate discretization, the relative class frequencies should be fairly consistent
within an interval (otherwise the interval should be split to express this difference)
but two adjacent intervals should not have similar relative class frequency (in that
case the adjacent intervals should be merged into one). The x2 statistic determines the
similarity of adjacent intervals based on some significance level. It tests the hypothe-
sis that two adjacent intervals of a feature are independent of the class. If they are
independent, they should be merged; otherwise they should remain separate.
The top-down method based on chi-square is ChiSplit. It searches for the best split of
an interval, by maximizing the chi-square criterion applied to the two sub-intervals
adjacent to the splitting point: the interval is split if both sub-intervals substantially
differ statistically. The ChiSplit stopping rule is based on a user-defined chi-square
threshold to reject the split if the two sub-intervals are too similar.
The bottom-up method based on chi-square is ChiMerge [24]. It searches for the best
merge of adjacent intervals by minimizing the chi-square criterion applied locally to
two adjacent intervals: they are merged if they are statistically similar. The stopping
rule is based on a user-defined Chi-square threshold to reject the merge if the two
adjacent intervals are insufficiently similar. No definite rule is given to choose this
threshold. Chi2 [30] is an automated version of ChiMerge. In Chi2, the statistical
significance level keeps changing to merge more and more adjacent intervals as long
as the inconsistency criterion [29] is satisfied. A method very similar to Chi2 is
ConMerge [33] which also uses the x2 statistic and the inconsistency measure. Instead
of considering one attribute at a time, ConMerge chooses the lowest x2 value among
the intervals of all continuous features.
Boulle [5] proposes the discretization method Khiops, based on the chi-square statis-
tic. In contrast with related methods ChiMerge and ChiSplit, this method optimizes
the Chi-square criterion in a global manner on the whole discretization domain and
does not require any stopping criterion. The Khiops method starts the discretization
from the elementary single value intervals. It evaluates all merges between adjacent
intervals and selects the best one according to the chi-square criterion applied to the
whole set of intervals. The stopping rule is based on the confidence level computed
with the chi-square statistic. The method automatically stops merging intervals as
soon as the confidence level, related to the Chi-square test of independence between
the discretized attribute and the class attribute, does not decrease anymore.
The quadratic time complexity of the popular C4.5 DT learning algorithm for non-
numeric data increases by a logarithmic factor when numerical attributes are proc-
essed. Many practical DTs—including C4.5—use in partitioning techniques that only
look for the best halving of the domain. Recently, however, considering multisplits of
the domain with more than two intervals has gained popularity. In general, obtained
results (DTs, RLs) using discrete features are usually more compact, shorter and more
accurate than using continuous ones, hence the results can be more closely examined,
compared, used and reused. Supervised discretization techniques might reasonably be
expected to lead to more accurate classification trees since the partitions they produce
are directly related to the class to be predicted. On the other hand one might expect
most of the unsupervised techniques to be considerably faster since they involve little
more than sorting the data, an operation which is common to all discretization meth-
ods.
Dougherty et al. [12] carried out a comparative study of five discretization procedures
using 16 data sets from the UC Irvine ML Database Repository. The methods com-
pared were two unsupervised global methods (equal width interval procedures), two
supervised global methods (1RD [21] and entropy minimisation [15]), and C4.5
which is a supervised local method. In all cases the tree was actually constructed by
C4.5 but in the four global methods the data was pre-processed using the correspond-
ing procedure to discretize all continuous variables. Somewhat surprisingly Dough-
erty et al [12] found only small differences, most of which were not statistically sig-
nificant, between the classification accuracies achieved by resulting DTs. None pro-
duced the highest accuracy for all data sets. In particular the local supervised method,
C4.5, showed no advantage over other supervised methods: Fayyad & Irani’s method
[15] achieved the best overall results.
Although there exist a number of efficient inference algorithms and implementations
for probabilistic reasoning in BNs with discrete variables, few algorithms support
efficient inference in hybrid BNs, BNs where continuous and discrete variables are
intermixed. Exact probabilistic inference in hybrid networks can be reduced to taking
multidimensional integrals in the same way that exact inference in discrete networks
can be reduced to computing sums. However, computing integrals exactly is possible
only for a restricted class of continuous functions. Kozlov and Koller [25] con-
structed an iterative anytime algorithm for BNs that gradually improves the quality of
the discretization and the accuracy of the answer on a query.
For NB, one common discretization approach is fixed k-interval discretization
(FKID). Another popular one is Fayyad and Irani’s entropy minimization heuristic
discretization (FID) [15]. Two recent alternatives are lazy discretization (LD) [22]
and proportional k-interval discretization (PKID) [34]. LD [22] delays probability
estimation until classification time. It waits until a test instance is presented to deter-
mine the cut points for each numeric attribute. For a numeric attribute value from the
test instance, it selects a pair of cut points such that the value is in the middle of its
corresponding interval whose size is the same as created by FKID with k = 10. LD
tends to have high memory and computational requirements because of its lazy meth-
odology. PKID [34] adjusts discretization bias and variance by tuning the interval
size and number, and further adjusts the probability estimation bias and variance of
NB classifiers to achieve lower classification error. The larger the interval (ai, bi], the
more instances in it, the lower the discretization variance, hence the lower the vari-
ance of NB’ probability estimation. However, the larger the interval, the less distin-
guishing information is obtained about the particular value xi of attribute Xi, the
higher the discretization bias, hence the higher the bias of the probability estimation.
So, increasing interval size (decreasing interval number) will decrease variance but
increase bias. Conversely, decreasing interval size (increasing interval number) will
decrease bias but increase variance. PKID aims to resolve this conflict by adjusting
the interval size and number proportional to the number of training instances.
5 Conclusion
References
[1] Aguilar, J., Bacardit, J., Divina, F. Experimental Evaluation of Discretization Schemes for
Rule Induction. Genetic and Evolutionary Computation Conference (GECCO04). Lecture
Notes in Computer Science 3102, 828-839
[2] Bacardit, J. Garrell J.M., Analysis and improvements of the Adaptive Discretization Inter-
vals knowledge representation. Genetic and Evolutionary Computation Conference
(GECCO04). Lecture Notes in Computer Science 3103, 726-738
[3] Bacardit, J. Garrell, J.M.. Evolving multiple discretizations with adaptive intervals for a
Pittsburgh Rule-Based Learning Classifier System. Genetic and Evolutionary Computa-
tion Conference (GECCO03). Lecture Notes in Computer Science 2724, 1818-1831
[4] Bertelsen, R.,&Martinez, T. R. (1994). Extending ID3 through discretization of continu-
ous inputs. In Proceedings of the 7th Florida Artificial Intelligence Research Symposium
(pp. 122–125). Florida AI Research Society.
[5] Boulle, M., Khiops: A Statistical Discretization Method of Continuous Attributes. Ma-
chine Learning 55:1 (2004) 53-69
[6] Burdsall, B., & Giraud-Carrier, C. (1997). Evolving fuzzy prototypes for efficient data
clustering. In Proceedings of the Second International ICSC Symposium on Fuzzy Logic
and Applications (ISFL’97) (pp. 217–223).
[7] Cerquides, J. Lopez, R., Proposal and Empirical Comparison of a Parallelizable Distance-
Based Discretization Method. III International Conference on Knowledge Discovery and
Data Mining (KDDM97). Newport Beach (California USA, 1997) 139-142
[8] Chan, C.-C., Batur, C., and Srinivasan, A. 1991. Determination of quantization intervals
in rule based model for dynamic. In Proceedings of the IEEE Conference on Systems,
Man, and Cybernetics. Charlottesvile, Virginia, pp. 1719–1723.
[9] Ching, J.Y., Wong, A.K.C., Chan, K.C.C., Class-Dependent Discretization for Inductive
Learning from Continuous and Mixed-Mode Data. IEEE Transactions on Pattern Analysis
and Machine Intelligence 17:7 (1995) 641-651
[10] Divina, F., Keijzer, M., Marchiori, E.: A method for handling numerical attributes in GA-
based inductive concept learners. In: GECCO 2003: Proceedings of the Genetic and Evo-
lutionary Computation Conference, Springer (2003) 898–908
[11] Divina, F., Marchiori, E., Handling continuous attributes in an evolutionary inductive
learner. IEEE Transactions on Evolutionary Computation 9:1 (2005) 31-43
[12] Dougherty J, Kohavi R, Sahami M. Supervised and unsupervised discretization of con-
tinuous features. In: Proceedings of the 12th international conference on machine learning.
San Francisco: Morgan Kaufmann; 1995. p. 194–202.
[13] Elomaa, T. Rousu, J., Fast Minimum Training Error Discretization. XIX International
Conference on Machine Learning (ICML02). Sydney (Australia, 2002) 131-138
[14] Elomaa, T. Rousu, J., Efficient multisplitting revisited: Optima-preserving elimination of
partition candidates. Data Mining and Knowledge Discovery 8:2 (2004) 97-126
[15] Fayyad, U.M., Irani, K.B., Multi-Interval Discretization of Continuous-Valued Attributes
for Classification Learning. XIII International Joint Conference on Artificial Intelligence
(IJCAI93). Chambéry (France, 1993) 1022-1029
[16] Ferrandiz, S., Boull, M., Multivariate discretization by recursive supervised bipartition of
graph. IV International Conference in Machine Learning and Data Mining in Pattern Rec-
ognition (MLDM05). Lecture Notes in Computer Science 3587, 253-264
[17] Giráldez, R., Aguilar, J.S., Riquelme J.C., Natural Coding: A More Efficient Representa-
tion for Evolutionary Learning. Genetic and Evolutionary Computation Conference
(GECCO03). Lecture Notes in Computer Science 2723, 979-990
[18] Giráldez, R., Aguilar, J.S., Riquelme J.C., Ferrer, J.F., Rodríguez, D.S., Discretization
Oriented to Decision Rule Generation. VIII Intern. Conference on Knowledge-Based In-
telligent Information and Engineering Systems (KES02). Creme (Italy, 2002) 275-279
[19] Grzymala-Busse, J.W., Three strategies to rule induction from data with numerical attrib-
utes. Lecture Notes in Computer Science 3135 (2004) 54-62
[20] Ho, K.M., Scott, P.D., Zeta: A Global Method for Discretization of Continuous Variables.
III Inter. Conference on Knowledge Discovery and Data Mining. USA (1997) 191-194
[21] Holte, R.C., Very simple classification rules perform well on most commonly used data-
sets. Machine Learning 11 (1993) 63-91
[22] Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2000). Why discretization works for naive
Bayesian classifiers. 17th International Conference (ICML’2000) pp. 309–406.
[23] Janssens, D., Brijs, T., Vanhoof, K., Wets, G., Evaluating the performance of cost-based
discretization versus entropy- and error-based discretization. Computers and Operations
Research 33:11 (2006) 3107-3123
[24] Kerber, R., ChiMerge: Discretization of Numeric Attributes. X National Conf. on Artifi-
cial Intelligence American Association (AAAI92). USA, 1992, 123-128
[25] Kozlov, A.V., Koller, D., Nonuniform Dynamic Discretization in Hybrid Networks. Thir-
teenth Annual Conference on Uncertainty in AI (UAI97). USA, 1997, 314-325
[26] Kurgan, L.A., Cios, K.J., CAIM Discretization Algorithm. IEEE Transactions on Knowl-
edge and Data Engineering 16:2 (2004) 145-153
[27] Kwedlo, W., Kretowski, M.: An evolutionary algorithm using multivariate discretization
for decision rule induction. In Proc. of the European Conference on Principles of Data
Mining and Knowledge Discovery (1999) 392–397
[28] Liu, H., Hussain, F., Lim, C., Dash, M., Discretization: An Enabling Technique. Data
Mining and Knowledge Discovery 6:4 (2002) 393-423
[29] Liu, H. and Setiono, Chi2: Feature selection and discretization of numeric attributes. VII
IEEE Intern. Conf. on Tools with Artificial Intelligence (ICTAI95). USA, 1995, 388-391
[30] Liu, H. and Setiono, R. 1997. Feature selection and discretization. IEEE Transactios on
Knowledge and Data Engineering, 9:1–4.
[31] Maass W. Efficient agnostic PAC-learning with simple hypotheses.Seventh annual ACM
conference on computational learning theory. NewYork: ACM Press; 1994. p. 67–75.
[32] Steck, H., Jaakkola, T., Predictive discretization during model selection. XXVI Sympo-
sium In Pattern Recognition (DAGM04). Lecture Notes in Computer Science 3175,
Springer-Verlag 2004, Tbingen (Germany, 2004) 1-8
[33] Wang, K., Liu, B., Concurrent Discretization of Multiple Attributes. 5th Pacific Rim
International Conference on Topics in Artificial Intelligence (PRICAI98). Lecture Notes
in Computer Science 1531, Springer-Verlag 1998, Singapore (Singapore, 1998) 250-259
[34] Yang, Y., & Webb, Non-Disjoint Discretization for Naive-Bayes Classifiers. XIX Interna-
tional Conference on Machine Learning (ICML02). Sydney (Australia, 2002) 666-673
[35] Yang, Y., & Webb, G. I. (2001). Proportional k interval discretization for naive-Bayes
classifiers. 12th European Conf. on Machine Learning (ECML’01) pp. 564–575.
Biography