03-bayes-nearest-neighbors
03-bayes-nearest-neighbors
Error
Why not?
Noise:
Our goal is to formulate a simple rule for minimizing when the joint
distribution of is known
or
Note: is not restricted to any particular set , and hence we will have
Terminology:
• is called a Bayes classifier
• is called the Bayes risk
Proof
For convenience, assume is a continuous random variable with density
We want to maximize this expression, we should design our classifier such that
is maximal
Proof (Part 3)
Therefore, the optimal has
Bayes rule!
Note that in addition to our rigorous derivation, this classifier also coincides with
“common sense”
Variations
Different ways of expressing the Bayes classifier
•
• When
likelihood
ratio test
• When
maximum likelihood
classifier/detector
Example
Suppose that and that
If
Example
How do we calculate the Bayes risk?
or
Fundamental tradeoff
Error
Bayes risk
One natural approach is to use the data to estimate the distribution, and then just
plug this into the formula for the Bayes classifier
Plugin methods
Before we get to these, we will first talk about what is quite possibly the absolute
simplest learning algorithm there is…
Nearest neighbor classifier
The nearest neighbor classifier is easiest to state in words:
The nearest neighbor rule defines a Vornoi partition of the input space
Risk of the nearest neighbor classifier
We will begin by restricting our attention to the binary case where
Similarly, if , then
Thus, as we have
Asymptotically, the risk of the nearest neighbor classifier is at most twice the
Bayes risk
-nearest neighbors
We can drive the factor of 2 in this result down to 1 by generalizing the nearest
neighbor rule to the -nearest neighbor rule as follows:
Assign a label to by taking a majority vote over the training points closest
to
How do we define this more mathematically?
indices of the training points closest to
If , then we can write the -nearest neighbor classifier as
Example
Example
Example
Example
Example
Example
Example
Example
Choosing : Practice
Setting the parameter is a problem of model selection
What is ?
Not much practical guidance from the theory, so we typically must rely on
estimates based on holdout sets or more sophisticated model selection techniques
Choosing : Theory
Using a similar argument as before, one can show that
This is known as universal consistency: given enough data, the algorithm will
eventually converge to a classifier that matches the Bayes risk
Summary
Given enough data, the -nearest neighbor classifier will do just as well as pretty
much any other method
Catch
• The amount of required data can be huge, especially if our feature space is high-
dimensional
• The parameter can matter a lot, so model selection will can be very
important
• Finding the nearest neighbors out of a set of millions of examples is still pretty
hard
– can be sped up using k-d trees, but can still be relatively expensive to apply
– in contrast, many of the other algorithms we will study have an expensive “training”
phase, but application is cheap