0% found this document useful (0 votes)

2 views

Classification and kernel density estimation

This document discusses the application of kernel density estimation for classification tasks, emphasizing the importance of choosing an appropriate smoothing parameter. It critiques traditional methods for selecting this parameter and proposes alternatives that may yield better classification error rates. Additionally, the paper explores adaptive learning techniques for handling shifting populations in data, illustrating these concepts with examples from the credit industry and simulated datasets.

Uploaded by

Joseph Kitalikyawe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Classification and kernel density estimation

Uploaded by

Joseph Kitalikyawe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

vistas in Astronomy Vol. 41, No. 3, pp.

41 l-417,1997
@ 1997 Elsevier Science Ltd
Pergamon Printed in Great Britain. All rights reserved
0083~6656l97 $15.00 + 0.00

PI I: SOO83-6656(97)00046-9

CLASSIFICATION AND KERNEL DENSITY

ESTIMATION
CHARLES TAYL,OR
Department of Statistics, University of Leeds, Leeds LS2 9JT, UK

Abstract- The method of kernel density estimation can be readily used for
the purposes of classification, and an easy-to-use package (ALLOCBO) is now
in wide circulation. It is known that this method performs well (at least in
relative terms) in the case of bimodal, or heavily skewed distributions.
In this article we first review the method, and describe the problem of
choosing h, an appropriate smoothing parameter. We point out that the usual
approach of choosing h to minimize the asymptotic integrated mean squared
error is not entirely appropriate, and we propose an alternative estimate of
the classification error rate, which is the target of interest. Unfortunately, it
seems that analytic results are hard to come by, but simulations indicate that
the proposed estimator has smaller mean squared error than the usual cross-
validation estimate of error rate.
A second topic which we briefly explore is that of classification of drifi-
ing populations. In this case, we outline two general approaches to updating
a classifier based on new observations. One of these approaches is limited
to parametric classifiers; the other relies on weighting of observations, and
is more generally applicable. We use an example from the credit industry
as well as some simulated data to illustrate the methods. @ 1997 Elsevier
Science Ltd. All rights reserved.

1. INTRODUCTION

Suppose that we have observed n observations Xi, i = 1, . . . , II with corresponding known

classes ci, where xi is a p-dimensional feature vector and ci denotes the class membership. To
simplify notation, we will assume throughout this paper that there are only two classes. The
objective now is to learn a “rule”, say 4, so that we can assign a new observation x* to a class by
a mapping 4(x*) + c.
In the case of the kernel density estimate, suppose that we have nj observations from class Cj,
then we can estimate the probability density function by

h(x) = L c K(X; Xi9 h), (1)

njXidj
412 C. Taylor

where K( .) is a kernel function such that j K(x) dx = 1 and h is the smoothing parameter
(which can take the form of a vector, or even a matrix in the case of multi-dimensional data). The
classification rule is then to allocate x* to class C, if rn = argmaxj [i (x*).
Note that if we use a normal kernel function, then the limiting case when h 4 0 gives a nearest
neighbour classifier. This suggests that, with careful choice of the smoothing parameters, we
should always do better than the l-NN classifier. However, there can be numeric difficulties in
using very small values of h naively. If density estimation per se is the goal, then it is widely
recognized that the choice of smoothing parameter is much more important than the choice of
kernel function. However, as will be seen in Section 2, there are many issues which are distinctive
when classification is the end target. For example, there is no reason why the kernel function
should itself be a density. If we relax the condition that K(x) 2 0 then better properties may
ensue.

I. 1. Shifing populations

In classical supervised learning, the available examples (the training data) are usually used to
learn a classifier. In many practical situations in which the environment changes, this procedure
ceases to work. In the StatLog project [2, Ch. 91 this situation was encountered in the case of
a credit-scoring application. Generally speaking, the application of a discrimination algorithm
to classify new, unseen examples will be problematic if either the number of attributes changes
or the number of attributes remains the same but the interpretation of the records of the datasets
changes over the time. In this situation the distribution of at least one class is gradually changing.
To solve this problem one could simply relearn the rule provided that there are enough new
examples with known class, but this is wasteful. An alternative is incremental learning - see
Refs. [1,6,7,5] for example - in which one of the design goals is that the decision tree that is
produced should depend only on the set of instances, without regard to the sequence in which
those instances were presented. However, if there is population drift some of the old data should
be downweighted as no longer representative. To deal with dynamic aspects there are essentially
two problems (in addition to those normally associated with classification). The first problem is
to detect any change in the situation, the second problem is how to react to any detected change.
In Section 3 this paper discusses ideas for adaptive learning which can capture dynamic as-
pects of real-world datasets. Although some of these ideas have a general character and could be
applied to any supervised algorithm, here we focus attention on kernel density methods which
uses a weighted average of kernel functions with the weight being determined by the age. A final
section applies some of the methods and ideas to simulated data and an example from the credit
industry.

2. ESTIMATION OF ERROR RATES

Numerous researchers have tackled the problem of choosing an appropriate bandwidth which
is based on the data - see, for example, Ref. [8] for references. However, most of the results are
related to minimizing the (asymptotic) integrated mean squared error (IMSE) which is given by
E l(f - f)2. It is worth noting that such a policy for choosing h may not work very well in
a classification setting when we want to minimize the expected misclassification rate which for
two classes is given by
Classification and kernel density estimation 413

x1 fi (xW + ~72 fz(X)dx =a1I1 +7t212, (2)

_ J 1 _ J 1
fh2(Xb’fk1(X) fh, (Xb_fk2(x)

where Xi is the prior probability that the data belongs to class Ci. The usual approach of taking
a Taylor series expansion does not work here, and the fact that the limits of the integral are
random variables makes this look intractable. We illustrate by simulation that the optimal h 1, h2
to minimize IMSE can be very different to those which minimize the expected error rate. In
this example, 100 observations were simulated from each of N(0, l), N(2, OS2). For each of
100 samples we calculated fl and f2 using a range of different smoothing parameters. For these
distributions, (hl , h2) = (0.422,0.211) minimize the asymptotic IMSE, whereas the actual error
rate is minimized for (hl, h2) = (0.720,0.215).
Suppose that the integrals in Eq. (2) are over a connected region. Then we need to estimate
the point t such that fl (t) = fz(t). Let f(hl, IQ) estimate t and be defined by fh, (8 = jh2 (3.
ui = $ K(u)u2du. It can be shown that, approximately (as h J, 0 and

E(3 = t + G [Wc~ - $&VI]

2 f;(r) - $0) ’
K2
var@ =
fz [f;(t) - f;(t)]’ .

Attempting to minimize bias2 + variance leads to quintic simultaneous equations, so plug-in

solutions are not readily available yet.
The usual estimate of the expected error rate is obtained by leave-one-out cross-validation
which, up to a multiplicative constant, is given by (assuming from now on that rrl = ~r2= 0.5)

C z[.h(Xi) - f;“‘(xi)l+ 1 r[_fl(Xi)- ff'(Xi)], (3)

*id!1 Xi EC2

where f(‘)(x) denotes the kernel estimate of f(x) (using Eq. (1)) using all of the data except the
ith observation, and Z(x) = 1 if x > 0; 0 otherwise. An alternative is to use a smoothed version
of (3), which is more obviously an estimate of (2) (again omitting a multiplicative constant),
given by

K(x; xi, h)dx + C K(x; xi, h)dx. (4)

XiEC2I J
fi(Xb.P(X)
2

In this case h = 0 gives the usual leave-one-out, or cross-validation estimate of the error rate,
since, for example, the first integral in (4) will be 1 if fz(xi) > fy’(xi) and 0 otherwise. So
although h = 0 will give an unbiased estimate of the error, a value of h > 0 can give a better
estimate (in terms of mean squared error). Some progress can be made in computing the approx-
imate bias and variance of the estimator derived from Eq. (4), and simulations confirm that, in
general, it does lead to a better estimator than Eq. (3). Figs. 1 and 2 show the estimated mean
squared error over 100 samples of size 10 from each of N(0, 1) and N(1, 1). Note that (4) re-
quires three or four choices of smoothing parameter, whereas (3) requires only two. However,
although (4) can lead to a smaller mean squared error - the minimum of the curve is 0.004 com-
pared with 0.015 for (3) - the extra computation will rarely be worth the effort. Moreover, if
414 C. Taylor

meansquared error for (hl,h2)=(1.89.1.89)

I --

Fig. 1. Mean squared error for estimate of error rate using Eq. (4) as function of smoothing parameter h.

mean squared error (cv)

Fig. 2. Mean squared error of estimate given by Eq. (3) as function of h 1, h2.

classification of future observations is to be carried out, then only good choices of h 1, h2 need to
be found. Estimation of the error rate may then be of secondary importance.

3. DEALING WITH SHIFTING POPULATIONS

Nakhaeizadeh et al. [4] discuss ways of updating the classifier either by modifying the rule
which has been learned, or by modifying the training data. Suppose that we examine the data
in batches of size m and that, at time t + mk we detect a change which requires adaptation in
the learned rule. Any algorithm can be totally relearned from recent observations after a change
in one of the classes has been detected. More interestingly, we can consider how best to re-use
previously learned information.
Classijication and kernel density estimation 415

We can use a “similarity” rule to throw away observations in the current training set which are
different - i.e. they are close in feature space, but have different class label - from those recently
observed. Alternatively, the older observations can be eliminated or a kind of moving window
with a predefined number of observations, possibly representative and new ones, could be used.
We try a possible implementation whereby old data are discarded and new data are included in
the “template” used for establishing a rule according to their perceived usefulness. As presented,
this system will have some limitations. For example, if there are drifting populations then new
data will be incorrectly classified, but should nevertheless be included in the template set. This
point is taken up in a nearest neighbour implementation in Ref. [3]. Note that this approach can
be used for any (not just similarity-based) algorithms.
We now focus attention on a dynamic version of the kernel estimator, which in the simple case
is given by

as an estimate of the density f~ (x) at time T when we have previously observed xt in the class
of interest. Here wt are weights and NT is a normalizing constant. For example, we could choose
wt = e-A(T-r) in which case NT = (1 - ewkT)/( 1 - eeA) or

w = 1 forT?tzT-W
t
I0 otherwise
in which case NT = W. Using either of these parameterizations for wt requires choice of either
A or W, in addition to the smoothing parameter h. Of course both parameters must be chosen for
each class. Again, analytic calculations appear to be intractable even for IMSE and very simple
dynamic models. However, numerical calculations are timplified by noting that simple updating
formulae can be derived expressing fT(x)in terms of fT_ I(x) and a kernel function of XT.
In this paper we consider experiments on real and simulated dynamic data in which we train
on an initial set of data (ordered by time), choosing any parameters by cross-validation means
and then test on observations in the second part of the data.

4. RESULTS

In this section we describe some results of our adaptive updating ideas and compare them
to conventional statistical classification methods in an example. The simplest and non-adaptive
approach is to use the classification rule that was learned from the training data and apply that to
all batches, with no updating. A small modification is to update the priors according to the new
data, and a further modification is to use the priors which are estimated using only the last batch.
An alternative method is to completely re-learn the rule at each time point.
We tried out some of the above ideas on two simulated datasets. At each of 1000 time points
we generate an example with three variables (Xl, X2, X3) from 2 classes. We use the first 500
observations as the training data and the remaining 1500 as test data. The distributions of each
class has two independent normal variables (with unit standard deviation) and a uniformly dis-
tributed (on [0, 1)) “noise” variable. The mean of the noise variable ~3 = 0.5 was independent
of time; the means of the normal variables vary with time as follows:
l datZ:Class1has~~,~~=Ofort~750and~t,~~=t/1000for751~t~1000,whereas

Class 2 differs in that ~2 = 2 for t 5 750 and ~7. = 2 + t/l000 for 751 5 t 5 1000, so
416 C. Taylor
Table 1
Error rates for kernel classifier on simulated data. See text for (Q-o-(v)

Method Error rate Estimated parameters

datl dat2 datl dat2

(i) no-learn (ht. h2) 0.201 0.357 (0.75,0.8) (1.5, 1.5)

(ii) all-data (h 1,h2) 0.167 0.274 (0.75.0.8) (1.5, 1.5)
(iii) (ht , h2, AI, h2) 0.173 0.239 (3, 3,0.08,0.08) (1, 1.0.06.0.06)
(iv) (ht , h2, WI, W2) 0.165 0.231 (1.5, 1.5,210,245) (1.6, 1.6,95, 250)

that there is no change to the distributions until two thirds of the testing phase, when there is a
sudden jump followed by a slow drift.
l dar2: Class 1 has ~1, ~2 = 0 and Class 2 has ~1 = 2t/lOOO, ~2 = 2 - 2t/lOOO for 1 5 t <

1000. In this case there is a gradual shift in the training and test phase of the second group in
the mean of (Xl, X2) from (0,2) to (2,O).
Since we split the testing data into batches of 50 observations (which always corresponds to 25
observations from each class) the change should happen in batch 21 when applying datl and
should go through the whole training and testing phase when considering dat2. Note that for the
both datasets, the observations were ordered so that the priors (however estimated) were always
equal.
For the kernel classifier we tried four approaches:
(i) h 1, h2 were trained on the test data, but the classifier was not updated (no-learn);
(ii) the classifier was updated using all observations thus far observed (with no change in the
smoothing parameters); the dynamic kernel estimator given by Eq. (5) with
(iii) a weighted exponential decay (A), and
(iv) a window of width W learned from the training data.
In the latter two cases, two parameters for each of the two classes were chosen by cross-
validation. The results are given in Table 1.
The credit data covers a two-year period and consists of 156 273 observations, the first 5000
of which were used as the initial training data. Initially, there were 15 attributes (all categorical)
in one year, and 14 attributes in the second year. Since we are not dealing with the problem of a
change in the number of attributes, the extra variable was discarded. We coded all the variables
into a list of O/ 1 attributes, and stepwise selection in linear discriminant was used to select 15 of
these binary variables.
We display the error rates in each batch in Fig. 3. The kernel classifier which was kept fixed
gave an overall error rate of over 20%, whereas updating the prior to reflect the proportions in
the last batch gave a small improvement to 18.2%. The dynamic version (moving window) gave
a large improvement to 10.4%. Note that we neither scaled the binary data, nor considered a
different smoothing parameter in each dimension. So in effect we made the assumption that the
variables were independent with common variance which was certainly not the case.
Stationarity is a key issue which will affect the performance of any dynamic classification
algorithm. For example, if the changes which occur in the training phase are very different from
the nature of the changes which take place during testing then the way the parameters are updated
is likely to be deficient. For this reason it seems that any method should include a monitoring
process even if this monitoring is not normally used to update the rules.
Classijication and kernel density estimation 417

. .
x .
--._.- ***
.
. ..*: ’ . c

batch

Fig. 3. Error rates for two kernel classifiers. The ‘*’ points are for a non-dynamic classifier in which the priors were
updated according to the proportions observed in the previous batch. The ‘1’ points were for a moving window
classifier, Eq. (5) with (WI, W2) = (400,300O) and (h 1, k2) = (0.25,0.25).

References

[l] S.L. Crawford, Extensions to the CART algorithm, International Journal of&Ian-Machine
Studies 31(1989) 197-217.
[2] D. Michie, D.J. Spiegelhalter, CC. Taylor, Eds., Machine Learning, Neural and Statistical
Classification (Ellis Horwood, Chichester, 1994).
[3] G. Nakhaeizadeh, C.C. Taylor, G. Km&h, Dynamic Aspects of Statistical Classification.
In: Intelligent Adaptive Agents, AAAI Technical report No. WS-96-04 (AAAI Press, Menlo
Park, CA, 1996) p. 55-64.
[4] G. Nakhaeizadeh, C.C. Taylor, G. Kunisch, Dynamic Supervised Learning: Some Basic
Issues and Application Aspects. In: Classification, Data Analysis, and Knowledge
Organisation, B. Klar, 0. Opitz, Eds. (Springer, Berlin, 1997).
[5] J.C. Schlimmer, R. Granger, Incremental learning from noisy data, Machine Learning 1
(1986) 317-354.
[6] P.E. Utgoff, Incremental learning of decision trees, Machine Learning 4 (1989) 161-186.
[7] P.E. Utgoff, An improved algorithm for incremental induction of decision trees, In:
Proceedings of Eleventh Machine Learning Conference, Rutgers University (Morgan
Kaufmann, 1994).
[8] M.P. Wand, M.C. Jones, Kernel Smoothing (Chapman and Hall, London, 1995).

Chap 4
No ratings yet
Chap 4
21 pages
Between Classification-Error Approximation and Weighted Least-Squares Learning
No ratings yet
Between Classification-Error Approximation and Weighted Least-Squares Learning
12 pages
Article
No ratings yet
Article
23 pages
Robust Kernel Density Estimation-Kim And Scott
No ratings yet
Robust Kernel Density Estimation-Kim And Scott
37 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
[Paper] Wand, M. P. and Schucany, W. R. (1990). Gaussian-based kernels. Canad. J. Statist. 18 197–204
No ratings yet
[Paper] Wand, M. P. and Schucany, W. R. (1990). Gaussian-based kernels. Canad. J. Statist. 18 197–204
9 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Unit 4
No ratings yet
Unit 4
186 pages
Lachenbruch EstimationErrorRates 1968
No ratings yet
Lachenbruch EstimationErrorRates 1968
12 pages
Module 3
No ratings yet
Module 3
132 pages
Classifier Estimation From Group Probabilities, Cf. (,)
No ratings yet
Classifier Estimation From Group Probabilities, Cf. (,)
8 pages
Uncertainty Based Classification Fusion - A Soft-Biometrics Test Case
No ratings yet
Uncertainty Based Classification Fusion - A Soft-Biometrics Test Case
4 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
POSTER Juan Francisco Agreda Vega TITLE Sobre El Aprendizaje Semi Supervisado Copy Copy Copy Copy
No ratings yet
POSTER Juan Francisco Agreda Vega TITLE Sobre El Aprendizaje Semi Supervisado Copy Copy Copy Copy
1 page
Va41 3 405
No ratings yet
Va41 3 405
6 pages
I2ml3e Chap8
No ratings yet
I2ml3e Chap8
28 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
Lec 1
No ratings yet
Lec 1
42 pages
Module 4
No ratings yet
Module 4
99 pages
densityestimation
No ratings yet
densityestimation
28 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
10.1.1.92.623
No ratings yet
10.1.1.92.623
11 pages
A Two-Stage Optimized Robust Kernel Density Estima
No ratings yet
A Two-Stage Optimized Robust Kernel Density Estima
36 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
31 pages
Estimating The Class Prior in Positive and Unlabeled Data Through Decision Tree Induction
No ratings yet
Estimating The Class Prior in Positive and Unlabeled Data Through Decision Tree Induction
8 pages
Classification[1]
No ratings yet
Classification[1]
45 pages
Non Parametric Density Estimation
No ratings yet
Non Parametric Density Estimation
4 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Naive Bayes Classification of Uncertain Data: 2009 Ninth IEEE International Conference On Data Mining
No ratings yet
Naive Bayes Classification of Uncertain Data: 2009 Ninth IEEE International Conference On Data Mining
6 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Data Classification and Prediction : Lecture-11
No ratings yet
Data Classification and Prediction : Lecture-11
36 pages
CH 4
No ratings yet
CH 4
21 pages
On Why Discretization Works For Naive-Bayes Classifiers: I I I I I I
No ratings yet
On Why Discretization Works For Naive-Bayes Classifiers: I I I I I I
8 pages
dm4
No ratings yet
dm4
68 pages
Articulo Sheather
No ratings yet
Articulo Sheather
11 pages
Probabilistic Neural Networks: Original Contribution
No ratings yet
Probabilistic Neural Networks: Original Contribution
10 pages
Getdist: Kernel Density Estimation: Url: Http://Cosmologist - Info
No ratings yet
Getdist: Kernel Density Estimation: Url: Http://Cosmologist - Info
11 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
CZ4032 Data Analytics & Mining Notes
No ratings yet
CZ4032 Data Analytics & Mining Notes
16 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
DM Unit 3
No ratings yet
DM Unit 3
39 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
0701907v3
No ratings yet
0701907v3
53 pages
Week 4 - Classification Alternative Techniques
No ratings yet
Week 4 - Classification Alternative Techniques
87 pages
unsupervised_learning_clustering_math
No ratings yet
unsupervised_learning_clustering_math
28 pages
Kernel
No ratings yet
Kernel
3 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Bayesian
No ratings yet
Bayesian
23 pages
Simon Sheather 2004 PDF
No ratings yet
Simon Sheather 2004 PDF
10 pages
Empirical Finance1
No ratings yet
Empirical Finance1
31 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Construction and Analysis of an Augmented Lattice Square Design
No ratings yet
Construction and Analysis of an Augmented Lattice Square Design
12 pages
AGHmatrix Tutorial
No ratings yet
AGHmatrix Tutorial
17 pages
BCFtools-RoH_A hidden Markov model approach for detecting autozygosity from next-generation sequencing data
No ratings yet
BCFtools-RoH_A hidden Markov model approach for detecting autozygosity from next-generation sequencing data
3 pages
Average semivariance directly yields accurate estimates of the genomic variance in complex trait analyses
No ratings yet
Average semivariance directly yields accurate estimates of the genomic variance in complex trait analyses
13 pages
Matrix Norms
No ratings yet
Matrix Norms
6 pages
A Brief History of S
No ratings yet
A Brief History of S
20 pages
Ethnobotany of Oyster nut (Telfairia pedata) in NorthernTanzania
No ratings yet
Ethnobotany of Oyster nut (Telfairia pedata) in NorthernTanzania
11 pages
Resolvable designs with large blocks
No ratings yet
Resolvable designs with large blocks
26 pages
Variable Kernel Density Estimation
No ratings yet
Variable Kernel Density Estimation
30 pages
Singular-Value Decomposition and its Applications
No ratings yet
Singular-Value Decomposition and its Applications
28 pages
Using graphs to find the best block designs
No ratings yet
Using graphs to find the best block designs
43 pages
Construction of Resolvable Spatial Row–Column Designs_Williams2006
No ratings yet
Construction of Resolvable Spatial Row–Column Designs_Williams2006
6 pages
MIT6 436JF18 Lec24 PDF
No ratings yet
MIT6 436JF18 Lec24 PDF
9 pages
Mathematics-I M101: Narula Institute of Technology
No ratings yet
Mathematics-I M101: Narula Institute of Technology
4 pages
Assigment 3
No ratings yet
Assigment 3
2 pages
Lecture 4a Knowledge Representation
No ratings yet
Lecture 4a Knowledge Representation
39 pages
Connectionism
No ratings yet
Connectionism
9 pages
Data Analysis Guide for APA 7th Style
No ratings yet
Data Analysis Guide for APA 7th Style
14 pages
Universiti Teknologi Mara Final Examination: Confidential CS/APR2011/CSC580
No ratings yet
Universiti Teknologi Mara Final Examination: Confidential CS/APR2011/CSC580
3 pages
Assignment 2 Data Analysis For Managers
No ratings yet
Assignment 2 Data Analysis For Managers
2 pages
Formula Sheet Mathematics 1 For Economics
No ratings yet
Formula Sheet Mathematics 1 For Economics
3 pages
Project Presentation
No ratings yet
Project Presentation
14 pages
Integer Linear Programming
0% (1)
Integer Linear Programming
10 pages
Topic 6 Part 1
No ratings yet
Topic 6 Part 1
15 pages
Quiz 01aae Taylorseries Answers
No ratings yet
Quiz 01aae Taylorseries Answers
7 pages
Crypto - PPT - Presentation Transcript
No ratings yet
Crypto - PPT - Presentation Transcript
4 pages
Finite Element Method: Foundations: Lecture Notes
No ratings yet
Finite Element Method: Foundations: Lecture Notes
19 pages
Linear Programming: Computer Solution
No ratings yet
Linear Programming: Computer Solution
59 pages
Floa NG Point: 15 - 213: Introduc On To Computer Systems 4 Lecture, Sep 5, 2013
No ratings yet
Floa NG Point: 15 - 213: Introduc On To Computer Systems 4 Lecture, Sep 5, 2013
40 pages
Data Encryption Routines For Pic24 and Dspic Devices: Background
No ratings yet
Data Encryption Routines For Pic24 and Dspic Devices: Background
18 pages
Michele Burrello - Kitaev Model
No ratings yet
Michele Burrello - Kitaev Model
35 pages
Data Structure and Algorithms QUIZZES
No ratings yet
Data Structure and Algorithms QUIZZES
133 pages
Sample Problems Linear Prog
No ratings yet
Sample Problems Linear Prog
18 pages
Lab 3 SIGNAL TRANSFORMATION Usman
No ratings yet
Lab 3 SIGNAL TRANSFORMATION Usman
16 pages
Self-Directed Online Machine Learning For Topology
No ratings yet
Self-Directed Online Machine Learning For Topology
19 pages
Excel and Forecasting Lab
No ratings yet
Excel and Forecasting Lab
22 pages
"Objectives" of Lecture #DSP: - The Need For DSP - Aliasing & Windowing - Introduction To FFT
No ratings yet
"Objectives" of Lecture #DSP: - The Need For DSP - Aliasing & Windowing - Introduction To FFT
21 pages
Cyber ch.4/1/2
No ratings yet
Cyber ch.4/1/2
5 pages
(5-6) Discrete & Continuous Compounding (Rev 1)
No ratings yet
(5-6) Discrete & Continuous Compounding (Rev 1)
35 pages
Basic Concepts of Optimization: The The The Set
No ratings yet
Basic Concepts of Optimization: The The The Set
10 pages
g9 Im Nl en Pse u05
No ratings yet
g9 Im Nl en Pse u05
50 pages
Cashmere and Wool Classification With Large Kernel Attention and Deep Learning
No ratings yet
Cashmere and Wool Classification With Large Kernel Attention and Deep Learning
6 pages

Classification and kernel density estimation

Uploaded by

Classification and kernel density estimation

Uploaded by

vistas in Astronomy Vol. 41, No. 3, pp.

CLASSIFICATION AND KERNEL DENSITY

Suppose that we have observed n observations Xi, i = 1, . . . , II with corresponding known

h(x) = L c K(X; Xi9 h), (1)

2. ESTIMATION OF ERROR RATES

x1 fi (xW + ~72 fz(X)dx =a1I1 +7t212, (2)

E(3 = t + G [Wc~ - $&VI]

Attempting to minimize bias2 + variance leads to quintic simultaneous equations, so plug-in

C z[.h(Xi) - f;“‘(xi)l+ 1 r[_fl(Xi)- ff'(Xi)], (3)

K(x; xi, h)dx + C K(x; xi, h)dx. (4)

meansquared error for (hl,h2)=(1.89.1.89)

mean squared error (cv)

3. DEALING WITH SHIFTING POPULATIONS

Method Error rate Estimated parameters

datl dat2 datl dat2

(i) no-learn (ht. h2) 0.201 0.357 (0.75,0.8) (1.5, 1.5)

You might also like