0% found this document useful (0 votes)
7 views

03-bayes-nearest-neighbors

The document discusses empirical risk minimization (ERM) and the fundamental tradeoff between the richness of hypothesis sets and the guarantees of error minimization. It introduces the Bayes classifier, which minimizes risk based on known distributions, and explores the nearest neighbor classifier as a simple learning algorithm. Additionally, it emphasizes the importance of model selection and the challenges of applying these methods in high-dimensional spaces.

Uploaded by

Mark Davenport
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

03-bayes-nearest-neighbors

The document discusses empirical risk minimization (ERM) and the fundamental tradeoff between the richness of hypothesis sets and the guarantees of error minimization. It introduces the Bayes classifier, which minimizes risk based on known distributions, and explores the nearest neighbor classifier as a simple learning algorithm. Additionally, it emphasizes the importance of model selection and the challenges of applying these methods in high-dimensional spaces.

Uploaded by

Mark Davenport
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Empirical risk minimization (ERM)

Recall the definitions of risk/empirical risk

Ideally, we would like to choose

Since we cannot compute , instead we choose

This makes sense if is not too large so that for all

Unfortunately, we also want to be large so that can be as small as


possible…
Fundamental tradeoff
More hypotheses ultimately sacrifices our guarantee that

Richer set of hypotheses

Error

“Richness” of hypothesis set


What is a good hypothesis?
Ideally, we would like to have a small number of hypotheses, so that
, while also being lucky enough to have

In general, this may not be possible

There may not be any function with

Why not?

Noise:

Suppose we knew the joint distribution of our data


– what is the optimal classification rule ?
– what are the fundamental limits on how small can be?
Known distribution case
Consider where
• is a random vector in
• is a random variable (depending on )

Let be a classifier with probability of error/risk


given by

Our goal is to formulate a simple rule for minimizing when the joint
distribution of is known

We will let denote this joint distribution of


The joint distribution
For any and any , gives us a way to
compute the probability that a randomly drawn will satisfy and

Conditioning on results in a conditional distribution on the class labels


known as the a posteriori distribution:

Conditioning on results in the class conditional distribution:


Factoring the joint distribution
It is often useful to think about the joint distribution in terms of these conditional
distributions

For any fixed we can write

or

Both ways of thinking will be useful!


The Bayes classifier
Theorem
The classifier satisfies

for any possible classifier

Note: is not restricted to any particular set , and hence we will have

Terminology:
• is called a Bayes classifier
• is called the Bayes risk
Proof
For convenience, assume is a continuous random variable with density

Let denote the a priori class probabilities

Consider an arbitrary classifier . Denote the decision regions


Proof (Part 2)
We can write

We want to maximize this expression, we should design our classifier such that

is maximal
Proof (Part 3)
Therefore, the optimal has

Bayes rule!

Note that in addition to our rigorous derivation, this classifier also coincides with
“common sense”
Variations
Different ways of expressing the Bayes classifier

• When
likelihood
ratio test

• When

maximum likelihood
classifier/detector
Example
Suppose that and that

If
Example
How do we calculate the Bayes risk?

In the case where , our test reduced to declaring 1 iff , thus


Alternative cost/loss functions
So far we have focused on minimizing the risk

There are many situations where this is not appropriate


• cost-sensitive classification
– type I/type II errors or misses/false alarms may have very different costs, in which
case it may be desirable to instead minimize

– alternatively, it may be better to focus on them directly a la Neyman-Pearson


classification
Alternative cost/loss functions
So far we have focused on minimizing the risk

There are many situations where this is not appropriate


• unbalanced datasets
– when one class dominates the other, the probability of error will place less emphasis
on the smaller class
– the class proportions in our dataset may not be representative of the “wild”
– one can use the same ideas as before or alternatively simply minimize something like

or
Fundamental tradeoff

Error
Bayes risk

“Richness” of hypothesis set


What about learning?
We have just seen that when we know the true distribution underlying our dataset,
solving the classification problem is straightforward

Can we get close when all we have is the data?

One natural approach is to use the data to estimate the distribution, and then just
plug this into the formula for the Bayes classifier
Plugin methods

Before we get to these, we will first talk about what is quite possibly the absolute
simplest learning algorithm there is…
Nearest neighbor classifier
The nearest neighbor classifier is easiest to state in words:

Assign the same label as the closest training point to

The nearest neighbor rule defines a Vornoi partition of the input space
Risk of the nearest neighbor classifier
We will begin by restricting our attention to the binary case where

Consider the Bayes risk conditioned on :

Note that if , then we must have

Similarly, if , then

Since selects the label that maximizes , we thus have that


Risk of the nearest neighbor classifier
Now consider the risk of the nearest neighbor classifier conditioned on

Note that for a fixed , we are treating as random

Here we will further treat as being random since it depends on the


training dataset

Thus we have that


Risk of the nearest neighbor classifier

Note that if is the nearest neighbor to , then

Thus, we can write


Intuition from asymptotics
In the limit as , we can assume that

Thus, as we have

It is easy to see that

Asymptotically, the risk of the nearest neighbor classifier is at most twice the
Bayes risk
-nearest neighbors
We can drive the factor of 2 in this result down to 1 by generalizing the nearest
neighbor rule to the -nearest neighbor rule as follows:
Assign a label to by taking a majority vote over the training points closest
to
How do we define this more mathematically?
indices of the training points closest to
If , then we can write the -nearest neighbor classifier as
Example
Example
Example
Example
Example
Example
Example
Example
Choosing : Practice
Setting the parameter is a problem of model selection

Setting by trying to minimize the training error is a particularly bad idea

What is ?

No matter what, we always have

Not much practical guidance from the theory, so we typically must rely on
estimates based on holdout sets or more sophisticated model selection techniques
Choosing : Theory
Using a similar argument as before, one can show that

Thus, by letting and as , we can (asymptotically)


expect to perform arbitrarily close to the Bayes risk

This is known as universal consistency: given enough data, the algorithm will
eventually converge to a classifier that matches the Bayes risk
Summary
Given enough data, the -nearest neighbor classifier will do just as well as pretty
much any other method

Catch
• The amount of required data can be huge, especially if our feature space is high-
dimensional
• The parameter can matter a lot, so model selection will can be very
important
• Finding the nearest neighbors out of a set of millions of examples is still pretty
hard
– can be sped up using k-d trees, but can still be relatively expensive to apply
– in contrast, many of the other algorithms we will study have an expensive “training”
phase, but application is cheap

You might also like