0% found this document useful (0 votes)
45 views4 pages

Chi2 Feature Selection and Discretization of Numeric Attributes

This document describes the Chi2 algorithm for feature selection and discretization of numeric attributes. The algorithm has two phases: 1. Phase 1 discretizes each numeric attribute using the Chi2 test with a high significance level, repeatedly merging intervals until inconsistencies are found. This determines an appropriate significance level. 2. Phase 2 further discretizes attributes individually with lower significance levels, stopping when inconsistencies are introduced. Attributes discretized into one interval are removed as irrelevant. The algorithm is tested on data sets, showing it improves accuracy over raw data and benchmarks while effectively discretizing attributes and selecting relevant features via discretization.

Uploaded by

Natsuyuki Hana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views4 pages

Chi2 Feature Selection and Discretization of Numeric Attributes

This document describes the Chi2 algorithm for feature selection and discretization of numeric attributes. The algorithm has two phases: 1. Phase 1 discretizes each numeric attribute using the Chi2 test with a high significance level, repeatedly merging intervals until inconsistencies are found. This determines an appropriate significance level. 2. Phase 2 further discretizes attributes individually with lower significance levels, stopping when inconsistencies are introduced. Attributes discretized into one interval are removed as irrelevant. The algorithm is tested on data sets, showing it improves accuracy over raw data and benchmarks while effectively discretizing attributes and selecting relevant features via discretization.

Uploaded by

Natsuyuki Hana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Chi2: Feature Selection and Discretization of Numeric Attributes

Huan Liu and Rudy Setiono


Department of Information Systems and Computer Science
National University of Singapore
Kent Ridge, Singapore 0511
{liuh,rudys}@iscs.nus.sg

Abstract are data sets with numeric attributes, some are irrel-
Discretization can turn numeric attn'butes into dis- evant and the range of each numeric attribute could
crete ones. Feature selection can eliminate some ir- be very wide; find an algorithm that can automati-
relevant attributes. This paper describes Chi$?,Q sim- cally discretize the numeric attributes as well as re-
ple and general algorithm that uses the x2 statistic to move those irrelevant ones.
discretize numeric attributes repeatedly until some in- This work stems from Kerber's ChiMerge [4] which
consistencies are found in the data, and achieves fea- is designed to discretize numeric attributes based on
ture selection via discretization. The empara'cal results the x2 statistic. ChiMerge consists of an initialization
demonstrate that Chi2 is eflective in feature selection step and a bottom-up merging'process, where intervals
and discretization of numeric and ordinal attributes. are continuously merged until a termination condition,
which is determined by a significance level a (set man-
1 Introduction ually), is met. It is an improvement from the most
obvious simple methods such as equal-width-intervals
Feature selection is a task to select the minimum or equal-frequency-intervals. Instead of defining a
number of attributes needed to represent the data ac- width or frequency threshold (which is not easy un-
curately. By using relevant features, classification al- til scrutinizing each attribute and knowing what it is),
gorithms can in general improve their predictive ac- ChiMerge requires cy to be specified. Nevertheless, too
curacy, shorten the learning period, and result in the big or too small an a will over- or under-discretize an
simpler concepts. There are abundant feature selec- attribute. An extreme example of under-discretization
tion algorithms [5]. Our work adopts an approach is the continuous attribute itself. Over-discretization
that selects a subset of the original attributes since it will introduce many inconsistencies' nonexistent be-
not only has the above virtues, but also serves as an fore, thus change the characteristics of the data. In
indicator on what kind of data (along those selected short, it is not easy to find a proper a for ChiMerge.
features) should be collected. The feature selection It is thereby ideal to let the data determine what value
algorithms can be further divided based on the data a should take. This leads to Phase 1 of Chi2. Natu-
types they operate on. The basic two types of data are rally, if the discretization continues without generat-
nominal e.g., attribute color may have values of red, ing more inconsistencies than in the original data, it is
green, ye1( l and ordinal (e.g., attribute winningpo-
ow) possible that some attributes will be discretized into
sition can have values of 1 , 2 , and 3, or attribute salary one interval only. Hence, they can be removed.
can have 22345.00,46543.89, etc. as its values). Many
feature selection algorithms [l,3, 51 are shown to work 2 Chi2 Algorithm
effectively on discrete data or even more strictly, on bi-
nary data (and/or binary class value). In order to deal The Chi2 algorithm (summarized below) is based
with numeric attributes, a common practice for those on the x2 statistic, and consists of two phases. In
algorithms is to discretize the data before conduct- the first phase, it begins with a high significance level
ing feature selection. This paper provides a way to (sigLevel), e.g., 0.5, for all numeric attributes for dis-
select features directly from numeric attributes while cretization. Each attribute is sorted according to its
discretizing them. Numeric data are very common in values. Then the following is performed: 1. calcu-
real world problems. However, many classification al- late the x2 value as in equation (1) for every pair of
gorithms require that the training data contain only adjacent intervals (at the beginning, each pattern is
discrete attributes, and some would work better on put into its own interval that contains only one value
discretized or binarized data [2, 41. If those numeric of an attribute); 2. merge the pair of adjacent inter-
data can be automatically transformed into discrete vals with the lowest x2 value. Merging continues un-
ones, these classification algorithms would be readily til all pairs of intervals have x2 values exceeding the
at our disposal. Chi2 is our effort towards this goal: parameter determined by sigLevel (initially, 0.5, its
discretize the numeric attributes as well as select fea-
tures a,mong them. 'By inconsistency we mean that two patterns are the same,
The problem this work tackles is as follows: there but classified into different categories.

388
1082-3409195 $04.00 0 1995 IEEE
corresponding x’ value is 0.455 if the degree of free- The formula for computing the x’ value is:
dom is 1, more below). The above process is repeated
with a decreased sigLevel until an inconsistency rate,
6 is exceeded in the discretized data. Phase 1 is, as
a matter of fact, a generalized version of ChiMerge of
Kerber [4]. Instead of specifying a x2 threshold, Chi2
wraps up ChiMerge with a loop that automatically in- where:
crements the x 2 threshold (decrementing siglevel). A k = number of (no.) classes,
consistency checking is also introduced as a stoppin Aij = no. patterns in the ith interval, j t h class,
criterion in order to guarantee that the discretize8 Ri = no. patterns in the ith interval = Cjz1 k
Aij,
data set accurately represents the original one. With
these two new features, Chi2 automatically determines Cj = no. patterns in the j t h class = Et=,A i j ,
a proper x 2 threshold that keeps the fidelity of the N = total no. patterns = ~ i ,
original data. Eij = expected frequency of Aij = Ri * C j / N .
Phase 2 is a finer process of Phase 1. Starting with If either Ri or Cj is 0, Eij is set to 0.1. The degree
sigLevel0 determined in Phase 1, each attribute i is of freedom of the x2 statistic is one less the number of
associated with a sigLevel[i],and takes turns for merg- classes.
ing. Consistency checking is conducted after each at-
tribute’s merging. If the inconsistency rate is not ex- 3 Experiments
ceeded, sigLevel[i]is decremented for attribute i’s next T w o sets of experiments are conducted. In the first
round of merging; otherwise attribute i will not be in- set of experiments, we want to establish that 1. Chi2
volved in further merging. This process is continued helps improve predictive accuracy; and 2. Chi2 prop-
until no attribute’s values can be merged. At the end erly and effectively discretizes data as well as elimi-
of Phase 2, if an attribute is merged to only one value,
it simply means that this attribute is not relevant in
representing the original data set. As a result, when
8‘
nates some irrelevant attributes. C4.5 81 an exten-
sion of ID3 [7])is used to verify the e ectiveness of
Chi2. The reasons for our choice are 1. C4.5 (or ID3)
discretization ends, feature selection is accomplished. works well for many problems and is well known, thus
requiring no further description; and 2. C4.5 selects
relevant features by itself in tree branching so it can be
Chi2 Algorithm: used as a benchmark, as in [5,9, 11, to verify the effects
of Chi2. In the second set of experiments, we have a
closer examination of Chi2’s ati!ity of discretization
Phase 1: and feature selection by introducing a synthetic data
set sigLevel = . S ; set and adding noise attributes to the existing data
do while (InConsistency(data) < 6) {
set. Through these more controlled data sets, we can
for each numeric attribute { better understand how effective Chi2 is.
Sort(attribute, data); 3.1 Real data
chi-sq-initializat ion(attribute, data) ; Three data sets used in experiments are Iris, Wis-
do { consin Breast Cancer and Heart Disease’. They have
chi-sq-calculat ion(attribute , dat a) different types of attributes. The Iris data are of con-
} while (Merge(data1) tinuous attributes, the breast cancer data are of or-
1 dinal discrete ones, and the heart disease data have
sigLevel0 = siglevel; mixed attributes (numeric and discrete).
sigLeve1 = decreSigLevel(sigLeve1); 3.2 Controlled data
1 Two extra data sets are designed to test if noise
Phase 2: attributes can be removed. One is synthetic, the other
set all sigLvlCi1 = sigLevelO for attribute i; is the Iris data added with noise attributes.
do until no-attribute-can-be-merged { The synthetic data consists of 600 items and is de-
for each attribute i that can be merged { scribed by four attributes among which only one at-
Sort(attribute, data); tribute determines each item’s class label. The values,
chi-sq-init ializat ion( attribute, data) ; 2rl of attribute A1 are generated from a uniform dis-
do { tribution between the lower bound (L = 0) and the
chi-sq-calculation(attribute,data) upper bound (U = 75), each item’s class label is de-
} while (Merge(data1) termined as follows: vug < 25 + class l, vo < 50 +
if (InConsistency(data) < 6) class 2, vug < 75 + class 3. Then we add noise at-
sigLvl [i] = decreSigLeve1 (s igLvl [il ) ; tributes Az,A3, and Aq. The values of A2 are gen-
else erated from a normal distribution with U, = U/2 (i.e.
1
attribute i cannot be merged; 37.5) and U = p/3. The values of AB are generated
2They are all obtained from the University of California-
Irvine machine learning repository via anonymous ftp to
ics.uci.edu.

389
a -
X Int
7C.m m Int Class Freq xa
4.6 2 0 0 0.20 6.1 4.4 9 0 0 5.05
4.7 1 0 0 0.20 6.2 4.9 1 0 1 8.11
4.8 3 0 0 1.97 6.3 5.0 12 3 0 13.64
4.9 1 0 1 2.62 6.4 5.5 3 12 3 14.23
5.0 3 1 0 0.10 6.5 6.1 0 10 21
5.1 3 1 0 0.70 6.6
5.2 2 0 0 0.20 6.7
5.3 1 0 0 0.41 6.8 0 1
5.4 3 1 0 1.32 6.9 0 1 0.85
5.5 1 2 0 1.66 7.0 0 1 2.10
5.6 0 4 0 2.50 7.1
5.7
5.8
5.9
1
1
0
1
2
1
0
2
0
1.28
1.20
0.54
-
7.4
7.7
-
O0 I o.20
Table 2: The intervals, class frequencies, and x2 values
for attribute sepal-length after Phase 1 and Phase 2.
The x2 thresholds are (a) 3.22 and (b) 50.6.

Table 1: The initial intervals, class frequencies, and


xz values for sepal-length.

from two normal distributions with p = U/3 (i.e. 25),


p =2 * U/3 (i.e. 50) and (T = p/3 respectively, 300
items each distribution. The values of Ad are gener-
ated from a uniform distribution.
The second data is a modified version of Iris data.
Four noise attributes Ag, Ag, A7 and As are added to
the Iris training data corresponding to the four origi-
nal attributes. The values of each noise attribute are
determined by a normal distribution with p = aue and 1 DataSete
(T = (maz- min)/6, where aue is the average value of, Irl- Heart Breast
muz and min are the maximum and minimum values
of the original attribute. The choice of U is to ap- Figure 1: Number of attributes: original vs. those
proximate p/3 if the corresponding original attribute after Chi2 processing.
is of uniform distribution. Now there are total eight
attributes. The number of patterns used is 75.
3.3 Example reduced from 4 to 2 (petal length and petal width),
In this section, some steps of Chi2 processing for each has four values. For the breast cancer data, 3
the Iris data are shown to demonstrate the behavior attributes are removed from the original 9 attributes.
of Chi2. Table 1 shows the intervals, class frequencies, The remaining 6 attributes have 3 , 4 , 4 , 5 , 3, and 3 dis-
and x2 values of sepal-length after the initialization in crete values respectively. For the heart disease data,
Phase 1. The results for sepal-length after Phase 1and the discrete attributes are left out in discretization
Phase 2 are shown in Table 2. An inconsistency rate and feature selection although they are used for con-
S = 5% is allowed in the experiment, that means up sistency checking. Among the 5 continuous attributes
to 3 (75~0.05)inconsistencies are acceptable. Phase 1 (1, 4, 5, 8 and lo), only 2 attributes (5 and 8) should
stops at sigLevel = 0.2, x2 = 3.22. That means the remain as suggested by Chi2, having 8 and 4 discrete
next sigLevel (0.1) will introduce more inconsistencies. values respectively. For the cancer and disease data
When Phase 2 terminates, the values of both sepal- sets, the default inconsistency rate is used, i.e., 0.
length and sepal-width are merged into one value, so Second, we run C4.5 on both the original data sets
they can be removed; and attributes petal-length and and the dimensionally reduced ones. C4.5 is run using
petal-width are discretized into four discrete values its default setting. Chi2 discretizes the training data
each. With the x2 threshold 3.22, for example, six and generates a mapping table, based on which the
discrete values are needed for attribute sepal-length: testing data are discretized.
< 4.4 -+ 0, < 4.9 -+ 1, ..., < 6.1 + 4, and 2 6.1 -+ 5. Shown in Figure 2 are predictive accuracies and tree
The last one reads if a numeric value is greater than sizes of C4.5 for the three data sets. Predictive accu-
and equal to 6.1, it is quantized to 5. racy improves and tree size drops (by half) for the
breast cancer and heart disease data. As for the Iris
3.4 Empirical results on real data data, accuracy and tree size remain the same by using
First we show that after discretization, the number two attributes only (with 4 values each). In a way, it
of attributes decreases for the three data sets (in Fig- shows that C4.5 works pretty well without Chi2 for
ure 1). For the Iris data, the number of attributes is this data set.

390
mixed attributes (e.g., Heart Disease Data). In addi-
Slze tion, Chi2 can work with multi-calss data. This is an
46.00 advantage over some statistic-based feature selection
algorithms such as Relief [5] which is applicable only
to the two-class data.
Other issues such as selecting S, limitations of Chi2
as well as its computational complexity can be found
in [6].
5 Conclusion
Chi2 is a simple and general algorithm that can
automatically select a proper x2 value, determine the
intervals of a numeric attribute, as well as select fea-
tures according to the characteristics of the data. It
guarantees that the fidelity of the training data can
remain after Chi2 is applied. The empirical results
on both the real data and controlled data have shown
that Chi2 is a useful and reliable tool for discretization
Figure 2: (a) Predictive accuracy and (b) size of de- and feature selection of numeric attributes.
cision trees of C4.5 for the three data sets after and
before the Chi2 processing. References
[l]H. Almuallim and T.G. Dietterich. Learning
boolean concepts in the presence of many irrele-
3.5 Empirical results on controlled data vant features. Artificial Intelligence, 69( 1-2):279-
The purpose of experimenting on the controlled 305, November 1994.
data is to verify how effective Chi2 is in removing [2] J. Catlett. On changing continuous attributes
irrelevant attributes through discretizing numeric at- into ordered discrete attributes. In European
tributes. Therefore, it is only necessary to see if Chi2 Working Session on Learning, 1991.
I1
can 1 discretize the relevant attribute(s) properly
and 2 remove the irrelevant attributes.
Chi2 merged A1 into three discrete values (1,2 and
[3] U.M. Fayyad and K.B. Irani. The attribute se-
lection problem in decision tree generation. In
3 corresponding to three classes (1,2, and 3); merged
h
t e other three attributes Aa,A3, and A4 into one
value. That means that only AI should remain, and
A A A I-92, Proceedings Ninth National Confer-
ence on Artificial Intelligence, pages 104-110.
AAAI Press/The MIT Press, 1992.
the noise (irrelevant) attributes should be removed.
For the modified Iris data, Chi2 merged six at- [4] R. Kerber. Chimerge: Discretization of numeric
tributes out of eight. They are attributes 0, 1, 4, 5, 6 attributes. In A A A I-92, Proceedings Ninth Na-
and 7. The first two are sepal-length and sepal-width. tional Conference on Artificial Intelligence, pages
The last four are added noise (irrelevant) attributes. 123-128. AAAI Press/The MIT Press, 1992.
The remaining two attributes have been merged into
4 discrete values respectively as did in the real data [5] K . Kira and L.A. Rendell. The feature selection
experiment. problem: Traditional methods and a new algo-
Through this set of controlled experiments, it is rithm. In AAAI-92, Proceedings Ninth National
shown that Chi2 effectively discretizes numeric at- Conference on Artificial Intelligence, pages 129-
tributes and removes irrelevant attributes. 134. AAAI Press/The MIT Press, 1992.

4 Discussions [6] H. Liu and R. Setiono. Discretization of ordinal


attributes and feature selection. Technical Report
ChiMerge requires a user to specify a proper sig- TRB4/95, Department of Info Sys and Comp
nificance level (a)which is used for merging values of Sci, National University of Singapore, April 1995,
all the attributes. No definite rule is given to choose
this a. In other words, it is still a matter of trial-and- ”http: //www.iscs.nus.sg/~iuh/chi2.ps”.
error, and clearly it is not easy to find a proper signifi- [7] J.R. Quinlan. Induction of decision trees. Ma-
cance level for each problem. Phase 1 of Chi2 extends chine Learning, 1(1):81-106, 1986.
ChiMerge to an automated one. That is a is automat-
ically varied until further merging is discontinued by [8] J.R. Quinlan. (74.5: Programs for Machine
the stopping criterion (the inconsistency rate). What Learning. Morgan Kaufmann, 1993.
makes Chi2 special is its capability of feature selection
- a big step forward from discretization. In Phase 2 of [9] H. Ragavan and L. Rendell. Lookahead feature
Chi2, each attribute has its own significance level for construction for learning hard concepts. In Ma-
merging in a round robin fashion. Merging stops when chine Learning: Proceedings of the Seventh In-
the inconsistency rate exceeds a specified one S. This ternational Conference, pages 252-259. Morgan
phase of Chi2 accomplishes feature selection. Another Kaufmann Pub. San Mateo, California, 1993.
feature of Chi2 is that it can be applied to data with

You might also like