Classification and kernel density estimation
Classification and kernel density estimation
41 l-417,1997
@ 1997 Elsevier Science Ltd
Pergamon Printed in Great Britain. All rights reserved
0083~6656l97 $15.00 + 0.00
PI I: SOO83-6656(97)00046-9
Abstract- The method of kernel density estimation can be readily used for
the purposes of classification, and an easy-to-use package (ALLOCBO) is now
in wide circulation. It is known that this method performs well (at least in
relative terms) in the case of bimodal, or heavily skewed distributions.
In this article we first review the method, and describe the problem of
choosing h, an appropriate smoothing parameter. We point out that the usual
approach of choosing h to minimize the asymptotic integrated mean squared
error is not entirely appropriate, and we propose an alternative estimate of
the classification error rate, which is the target of interest. Unfortunately, it
seems that analytic results are hard to come by, but simulations indicate that
the proposed estimator has smaller mean squared error than the usual cross-
validation estimate of error rate.
A second topic which we briefly explore is that of classification of drifi-
ing populations. In this case, we outline two general approaches to updating
a classifier based on new observations. One of these approaches is limited
to parametric classifiers; the other relies on weighting of observations, and
is more generally applicable. We use an example from the credit industry
as well as some simulated data to illustrate the methods. @ 1997 Elsevier
Science Ltd. All rights reserved.
1. INTRODUCTION
where K( .) is a kernel function such that j K(x) dx = 1 and h is the smoothing parameter
(which can take the form of a vector, or even a matrix in the case of multi-dimensional data). The
classification rule is then to allocate x* to class C, if rn = argmaxj [i (x*).
Note that if we use a normal kernel function, then the limiting case when h 4 0 gives a nearest
neighbour classifier. This suggests that, with careful choice of the smoothing parameters, we
should always do better than the l-NN classifier. However, there can be numeric difficulties in
using very small values of h naively. If density estimation per se is the goal, then it is widely
recognized that the choice of smoothing parameter is much more important than the choice of
kernel function. However, as will be seen in Section 2, there are many issues which are distinctive
when classification is the end target. For example, there is no reason why the kernel function
should itself be a density. If we relax the condition that K(x) 2 0 then better properties may
ensue.
I. 1. Shifing populations
In classical supervised learning, the available examples (the training data) are usually used to
learn a classifier. In many practical situations in which the environment changes, this procedure
ceases to work. In the StatLog project [2, Ch. 91 this situation was encountered in the case of
a credit-scoring application. Generally speaking, the application of a discrimination algorithm
to classify new, unseen examples will be problematic if either the number of attributes changes
or the number of attributes remains the same but the interpretation of the records of the datasets
changes over the time. In this situation the distribution of at least one class is gradually changing.
To solve this problem one could simply relearn the rule provided that there are enough new
examples with known class, but this is wasteful. An alternative is incremental learning - see
Refs. [1,6,7,5] for example - in which one of the design goals is that the decision tree that is
produced should depend only on the set of instances, without regard to the sequence in which
those instances were presented. However, if there is population drift some of the old data should
be downweighted as no longer representative. To deal with dynamic aspects there are essentially
two problems (in addition to those normally associated with classification). The first problem is
to detect any change in the situation, the second problem is how to react to any detected change.
In Section 3 this paper discusses ideas for adaptive learning which can capture dynamic as-
pects of real-world datasets. Although some of these ideas have a general character and could be
applied to any supervised algorithm, here we focus attention on kernel density methods which
uses a weighted average of kernel functions with the weight being determined by the age. A final
section applies some of the methods and ideas to simulated data and an example from the credit
industry.
Numerous researchers have tackled the problem of choosing an appropriate bandwidth which
is based on the data - see, for example, Ref. [8] for references. However, most of the results are
related to minimizing the (asymptotic) integrated mean squared error (IMSE) which is given by
E l(f - f)2. It is worth noting that such a policy for choosing h may not work very well in
a classification setting when we want to minimize the expected misclassification rate which for
two classes is given by
Classification and kernel density estimation 413
where Xi is the prior probability that the data belongs to class Ci. The usual approach of taking
a Taylor series expansion does not work here, and the fact that the limits of the integral are
random variables makes this look intractable. We illustrate by simulation that the optimal h 1, h2
to minimize IMSE can be very different to those which minimize the expected error rate. In
this example, 100 observations were simulated from each of N(0, l), N(2, OS2). For each of
100 samples we calculated fl and f2 using a range of different smoothing parameters. For these
distributions, (hl , h2) = (0.422,0.211) minimize the asymptotic IMSE, whereas the actual error
rate is minimized for (hl, h2) = (0.720,0.215).
Suppose that the integrals in Eq. (2) are over a connected region. Then we need to estimate
the point t such that fl (t) = fz(t). Let f(hl, IQ) estimate t and be defined by fh, (8 = jh2 (3.
ui = $ K(u)u2du. It can be shown that, approximately (as h J, 0 and
where f(‘)(x) denotes the kernel estimate of f(x) (using Eq. (1)) using all of the data except the
ith observation, and Z(x) = 1 if x > 0; 0 otherwise. An alternative is to use a smoothed version
of (3), which is more obviously an estimate of (2) (again omitting a multiplicative constant),
given by
In this case h = 0 gives the usual leave-one-out, or cross-validation estimate of the error rate,
since, for example, the first integral in (4) will be 1 if fz(xi) > fy’(xi) and 0 otherwise. So
although h = 0 will give an unbiased estimate of the error, a value of h > 0 can give a better
estimate (in terms of mean squared error). Some progress can be made in computing the approx-
imate bias and variance of the estimator derived from Eq. (4), and simulations confirm that, in
general, it does lead to a better estimator than Eq. (3). Figs. 1 and 2 show the estimated mean
squared error over 100 samples of size 10 from each of N(0, 1) and N(1, 1). Note that (4) re-
quires three or four choices of smoothing parameter, whereas (3) requires only two. However,
although (4) can lead to a smaller mean squared error - the minimum of the curve is 0.004 com-
pared with 0.015 for (3) - the extra computation will rarely be worth the effort. Moreover, if
414 C. Taylor
Fig. 1. Mean squared error for estimate of error rate using Eq. (4) as function of smoothing parameter h.
Fig. 2. Mean squared error of estimate given by Eq. (3) as function of h 1, h2.
classification of future observations is to be carried out, then only good choices of h 1, h2 need to
be found. Estimation of the error rate may then be of secondary importance.
Nakhaeizadeh et al. [4] discuss ways of updating the classifier either by modifying the rule
which has been learned, or by modifying the training data. Suppose that we examine the data
in batches of size m and that, at time t + mk we detect a change which requires adaptation in
the learned rule. Any algorithm can be totally relearned from recent observations after a change
in one of the classes has been detected. More interestingly, we can consider how best to re-use
previously learned information.
Classijication and kernel density estimation 415
We can use a “similarity” rule to throw away observations in the current training set which are
different - i.e. they are close in feature space, but have different class label - from those recently
observed. Alternatively, the older observations can be eliminated or a kind of moving window
with a predefined number of observations, possibly representative and new ones, could be used.
We try a possible implementation whereby old data are discarded and new data are included in
the “template” used for establishing a rule according to their perceived usefulness. As presented,
this system will have some limitations. For example, if there are drifting populations then new
data will be incorrectly classified, but should nevertheless be included in the template set. This
point is taken up in a nearest neighbour implementation in Ref. [3]. Note that this approach can
be used for any (not just similarity-based) algorithms.
We now focus attention on a dynamic version of the kernel estimator, which in the simple case
is given by
as an estimate of the density f~ (x) at time T when we have previously observed xt in the class
of interest. Here wt are weights and NT is a normalizing constant. For example, we could choose
wt = e-A(T-r) in which case NT = (1 - ewkT)/( 1 - eeA) or
w = 1 forT?tzT-W
t
I0 otherwise
in which case NT = W. Using either of these parameterizations for wt requires choice of either
A or W, in addition to the smoothing parameter h. Of course both parameters must be chosen for
each class. Again, analytic calculations appear to be intractable even for IMSE and very simple
dynamic models. However, numerical calculations are timplified by noting that simple updating
formulae can be derived expressing fT(x)in terms of fT_ I(x) and a kernel function of XT.
In this paper we consider experiments on real and simulated dynamic data in which we train
on an initial set of data (ordered by time), choosing any parameters by cross-validation means
and then test on observations in the second part of the data.
4. RESULTS
In this section we describe some results of our adaptive updating ideas and compare them
to conventional statistical classification methods in an example. The simplest and non-adaptive
approach is to use the classification rule that was learned from the training data and apply that to
all batches, with no updating. A small modification is to update the priors according to the new
data, and a further modification is to use the priors which are estimated using only the last batch.
An alternative method is to completely re-learn the rule at each time point.
We tried out some of the above ideas on two simulated datasets. At each of 1000 time points
we generate an example with three variables (Xl, X2, X3) from 2 classes. We use the first 500
observations as the training data and the remaining 1500 as test data. The distributions of each
class has two independent normal variables (with unit standard deviation) and a uniformly dis-
tributed (on [0, 1)) “noise” variable. The mean of the noise variable ~3 = 0.5 was independent
of time; the means of the normal variables vary with time as follows:
l datZ:Class1has~~,~~=Ofort~750and~t,~~=t/1000for751~t~1000,whereas
Class 2 differs in that ~2 = 2 for t 5 750 and ~7. = 2 + t/l000 for 751 5 t 5 1000, so
416 C. Taylor
Table 1
Error rates for kernel classifier on simulated data. See text for (Q-o-(v)
that there is no change to the distributions until two thirds of the testing phase, when there is a
sudden jump followed by a slow drift.
l dar2: Class 1 has ~1, ~2 = 0 and Class 2 has ~1 = 2t/lOOO, ~2 = 2 - 2t/lOOO for 1 5 t <
1000. In this case there is a gradual shift in the training and test phase of the second group in
the mean of (Xl, X2) from (0,2) to (2,O).
Since we split the testing data into batches of 50 observations (which always corresponds to 25
observations from each class) the change should happen in batch 21 when applying datl and
should go through the whole training and testing phase when considering dat2. Note that for the
both datasets, the observations were ordered so that the priors (however estimated) were always
equal.
For the kernel classifier we tried four approaches:
(i) h 1, h2 were trained on the test data, but the classifier was not updated (no-learn);
(ii) the classifier was updated using all observations thus far observed (with no change in the
smoothing parameters); the dynamic kernel estimator given by Eq. (5) with
(iii) a weighted exponential decay (A), and
(iv) a window of width W learned from the training data.
In the latter two cases, two parameters for each of the two classes were chosen by cross-
validation. The results are given in Table 1.
The credit data covers a two-year period and consists of 156 273 observations, the first 5000
of which were used as the initial training data. Initially, there were 15 attributes (all categorical)
in one year, and 14 attributes in the second year. Since we are not dealing with the problem of a
change in the number of attributes, the extra variable was discarded. We coded all the variables
into a list of O/ 1 attributes, and stepwise selection in linear discriminant was used to select 15 of
these binary variables.
We display the error rates in each batch in Fig. 3. The kernel classifier which was kept fixed
gave an overall error rate of over 20%, whereas updating the prior to reflect the proportions in
the last batch gave a small improvement to 18.2%. The dynamic version (moving window) gave
a large improvement to 10.4%. Note that we neither scaled the binary data, nor considered a
different smoothing parameter in each dimension. So in effect we made the assumption that the
variables were independent with common variance which was certainly not the case.
Stationarity is a key issue which will affect the performance of any dynamic classification
algorithm. For example, if the changes which occur in the training phase are very different from
the nature of the changes which take place during testing then the way the parameters are updated
is likely to be deficient. For this reason it seems that any method should include a monitoring
process even if this monitoring is not normally used to update the rules.
Classijication and kernel density estimation 417
. .
x .
--._.- ***
.
. ..*: ’ . c
batch
Fig. 3. Error rates for two kernel classifiers. The ‘*’ points are for a non-dynamic classifier in which the priors were
updated according to the proportions observed in the previous batch. The ‘1’ points were for a moving window
classifier, Eq. (5) with (WI, W2) = (400,300O) and (h 1, k2) = (0.25,0.25).
References
[l] S.L. Crawford, Extensions to the CART algorithm, International Journal of&Ian-Machine
Studies 31(1989) 197-217.
[2] D. Michie, D.J. Spiegelhalter, CC. Taylor, Eds., Machine Learning, Neural and Statistical
Classification (Ellis Horwood, Chichester, 1994).
[3] G. Nakhaeizadeh, C.C. Taylor, G. Km&h, Dynamic Aspects of Statistical Classification.
In: Intelligent Adaptive Agents, AAAI Technical report No. WS-96-04 (AAAI Press, Menlo
Park, CA, 1996) p. 55-64.
[4] G. Nakhaeizadeh, C.C. Taylor, G. Kunisch, Dynamic Supervised Learning: Some Basic
Issues and Application Aspects. In: Classification, Data Analysis, and Knowledge
Organisation, B. Klar, 0. Opitz, Eds. (Springer, Berlin, 1997).
[5] J.C. Schlimmer, R. Granger, Incremental learning from noisy data, Machine Learning 1
(1986) 317-354.
[6] P.E. Utgoff, Incremental learning of decision trees, Machine Learning 4 (1989) 161-186.
[7] P.E. Utgoff, An improved algorithm for incremental induction of decision trees, In:
Proceedings of Eleventh Machine Learning Conference, Rutgers University (Morgan
Kaufmann, 1994).
[8] M.P. Wand, M.C. Jones, Kernel Smoothing (Chapman and Hall, London, 1995).