0% found this document useful (0 votes)

7 views

03-bayes-nearest-neighbors

The document discusses empirical risk minimization (ERM) and the fundamental tradeoff between the richness of hypothesis sets and the guarantees of error minimization. It introduces the Bayes classifier, which minimizes risk based on known distributions, and explores the nearest neighbor classifier as a simple learning algorithm. Additionally, it emphasizes the importance of model selection and the challenges of applying these methods in high-dimensional spaces.

Uploaded by

Mark Davenport

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

03-bayes-nearest-neighbors

Uploaded by

Mark Davenport

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Empirical risk minimization (ERM)

Recall the definitions of risk/empirical risk

Ideally, we would like to choose

Since we cannot compute , instead we choose

This makes sense if is not too large so that for all

Unfortunately, we also want to be large so that can be as small as

possible…
Fundamental tradeoff
More hypotheses ultimately sacrifices our guarantee that

Richer set of hypotheses

Error

“Richness” of hypothesis set

What is a good hypothesis?
Ideally, we would like to have a small number of hypotheses, so that
, while also being lucky enough to have

In general, this may not be possible

There may not be any function with

Why not?

Noise:

Suppose we knew the joint distribution of our data

– what is the optimal classification rule ?
– what are the fundamental limits on how small can be?
Known distribution case
Consider where
• is a random vector in
• is a random variable (depending on )

Let be a classifier with probability of error/risk

given by

Our goal is to formulate a simple rule for minimizing when the joint
distribution of is known

We will let denote this joint distribution of

The joint distribution
For any and any , gives us a way to
compute the probability that a randomly drawn will satisfy and

Conditioning on results in a conditional distribution on the class labels

known as the a posteriori distribution:

Conditioning on results in the class conditional distribution:

Factoring the joint distribution
It is often useful to think about the joint distribution in terms of these conditional
distributions

For any fixed we can write

Both ways of thinking will be useful!

The Bayes classifier
Theorem
The classifier satisfies

for any possible classifier

Note: is not restricted to any particular set , and hence we will have

Terminology:
• is called a Bayes classifier
• is called the Bayes risk
Proof
For convenience, assume is a continuous random variable with density

Let denote the a priori class probabilities

Consider an arbitrary classifier . Denote the decision regions

Proof (Part 2)
We can write

We want to maximize this expression, we should design our classifier such that

is maximal
Proof (Part 3)
Therefore, the optimal has

Bayes rule!

Note that in addition to our rigorous derivation, this classifier also coincides with
“common sense”
Variations
Different ways of expressing the Bayes classifier
•

• When
likelihood
ratio test

• When

maximum likelihood
classifier/detector
Example
Suppose that and that

If
Example
How do we calculate the Bayes risk?

In the case where , our test reduced to declaring 1 iff , thus

Alternative cost/loss functions
So far we have focused on minimizing the risk

There are many situations where this is not appropriate

• cost-sensitive classification
– type I/type II errors or misses/false alarms may have very different costs, in which
case it may be desirable to instead minimize

– alternatively, it may be better to focus on them directly a la Neyman-Pearson

classification
Alternative cost/loss functions
So far we have focused on minimizing the risk

There are many situations where this is not appropriate

• unbalanced datasets
– when one class dominates the other, the probability of error will place less emphasis
on the smaller class
– the class proportions in our dataset may not be representative of the “wild”
– one can use the same ideas as before or alternatively simply minimize something like

or
Fundamental tradeoff

Error
Bayes risk

“Richness” of hypothesis set

What about learning?
We have just seen that when we know the true distribution underlying our dataset,
solving the classification problem is straightforward

Can we get close when all we have is the data?

One natural approach is to use the data to estimate the distribution, and then just
plug this into the formula for the Bayes classifier
Plugin methods

Before we get to these, we will first talk about what is quite possibly the absolute
simplest learning algorithm there is…
Nearest neighbor classifier
The nearest neighbor classifier is easiest to state in words:

Assign the same label as the closest training point to

The nearest neighbor rule defines a Vornoi partition of the input space
Risk of the nearest neighbor classifier
We will begin by restricting our attention to the binary case where

Consider the Bayes risk conditioned on :

Note that if , then we must have

Similarly, if , then

Since selects the label that maximizes , we thus have that

Risk of the nearest neighbor classifier
Now consider the risk of the nearest neighbor classifier conditioned on

Note that for a fixed , we are treating as random

Here we will further treat as being random since it depends on the

training dataset

Thus we have that

Risk of the nearest neighbor classifier

Note that if is the nearest neighbor to , then

Thus, we can write

Intuition from asymptotics
In the limit as , we can assume that

Thus, as we have

It is easy to see that

Asymptotically, the risk of the nearest neighbor classifier is at most twice the
Bayes risk
-nearest neighbors
We can drive the factor of 2 in this result down to 1 by generalizing the nearest
neighbor rule to the -nearest neighbor rule as follows:
Assign a label to by taking a majority vote over the training points closest
to
How do we define this more mathematically?
indices of the training points closest to
If , then we can write the -nearest neighbor classifier as
Example
Example
Example
Example
Example
Example
Example
Example
Choosing : Practice
Setting the parameter is a problem of model selection

Setting by trying to minimize the training error is a particularly bad idea

What is ?

No matter what, we always have

Not much practical guidance from the theory, so we typically must rely on
estimates based on holdout sets or more sophisticated model selection techniques
Choosing : Theory
Using a similar argument as before, one can show that

Thus, by letting and as , we can (asymptotically)

expect to perform arbitrarily close to the Bayes risk

This is known as universal consistency: given enough data, the algorithm will
eventually converge to a classifier that matches the Bayes risk
Summary
Given enough data, the -nearest neighbor classifier will do just as well as pretty
much any other method

Catch
• The amount of required data can be huge, especially if our feature space is high-
dimensional
• The parameter can matter a lot, so model selection will can be very
important
• Finding the nearest neighbors out of a set of millions of examples is still pretty
hard
– can be sped up using k-d trees, but can still be relatively expensive to apply
– in contrast, many of the other algorithms we will study have an expensive “training”
phase, but application is cheap

Process Flow Diagram - Nitric Acid
No ratings yet
Process Flow Diagram - Nitric Acid
1 page
#1 Cutty Sark
100% (1)
#1 Cutty Sark
7 pages
English Architecture Vocabulary
0% (1)
English Architecture Vocabulary
2 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
64 pages
Lec 2
No ratings yet
Lec 2
37 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
63 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
65 pages
Theory For Classification and Linear Models (I)
No ratings yet
Theory For Classification and Linear Models (I)
32 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Bayes
No ratings yet
Bayes
10 pages
Lec 1
No ratings yet
Lec 1
42 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Machine learning 04 - Bayes
No ratings yet
Machine learning 04 - Bayes
35 pages
Bayesian_theory_daniel_restrepo
No ratings yet
Bayesian_theory_daniel_restrepo
8 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Naive Bayes Algorithm
No ratings yet
Naive Bayes Algorithm
46 pages
Naive Bayes
No ratings yet
Naive Bayes
2 pages
8
No ratings yet
8
141 pages
UNIT1 ERM and PAC Learning
No ratings yet
UNIT1 ERM and PAC Learning
20 pages
Bayes' Rule and Its Use
No ratings yet
Bayes' Rule and Its Use
13 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
UNIT-3
No ratings yet
UNIT-3
99 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
76 pages
Lecture 7
No ratings yet
Lecture 7
15 pages
Machine Learning: Tools, Techniques, Applications (2013-14-I) # 1
No ratings yet
Machine Learning: Tools, Techniques, Applications (2013-14-I) # 1
5 pages
PR January20 03 PDF
No ratings yet
PR January20 03 PDF
74 pages
Classification-Alternative Techniques: Bayesian Classifiers
No ratings yet
Classification-Alternative Techniques: Bayesian Classifiers
7 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
Bayes Decision Theory
No ratings yet
Bayes Decision Theory
53 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
NBayes Log Reg
No ratings yet
NBayes Log Reg
18 pages
Class Adv Classification IV
No ratings yet
Class Adv Classification IV
49 pages
Bayesian Theory
No ratings yet
Bayesian Theory
66 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
DA_Unit_2
No ratings yet
DA_Unit_2
124 pages
ML BayesionBeliefNetwork Lect12 14
No ratings yet
ML BayesionBeliefNetwork Lect12 14
99 pages
7.simple Classification
No ratings yet
7.simple Classification
45 pages
Lec 6
No ratings yet
Lec 6
14 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
Lecturer4_Bayesian Decision Theory
No ratings yet
Lecturer4_Bayesian Decision Theory
40 pages
Data Classification and Prediction : Lecture-11
No ratings yet
Data Classification and Prediction : Lecture-11
36 pages
26-Bayes Rule-16-03-2024
No ratings yet
26-Bayes Rule-16-03-2024
18 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bayesian Inference: Fundamentals and Applications
From Everand
Bayesian Inference: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lesson Planning Form For Accessible Instruction - Calvin College Education Program
No ratings yet
Lesson Planning Form For Accessible Instruction - Calvin College Education Program
4 pages
Global Competencies
No ratings yet
Global Competencies
33 pages
1-s2.0-S0167577X18314198-main
No ratings yet
1-s2.0-S0167577X18314198-main
4 pages
Ansible Notes
No ratings yet
Ansible Notes
18 pages
Final Exam
No ratings yet
Final Exam
2 pages
Short Form-Modern Fantasy
No ratings yet
Short Form-Modern Fantasy
2 pages
An Investigation of Synthetic Resins For Water Softening
No ratings yet
An Investigation of Synthetic Resins For Water Softening
1 page
Year 8 Assessment 2 2023-24
No ratings yet
Year 8 Assessment 2 2023-24
12 pages
INVERTER-driven: Split-Ductless and Ducted Comfort Systems
No ratings yet
INVERTER-driven: Split-Ductless and Ducted Comfort Systems
32 pages
Bascavr
No ratings yet
Bascavr
210 pages
Job Safety Analysis F&G Alarm
No ratings yet
Job Safety Analysis F&G Alarm
2 pages
Recatangles and Triangles
No ratings yet
Recatangles and Triangles
10 pages
Main Characteristics:: A. Formal Invitations
No ratings yet
Main Characteristics:: A. Formal Invitations
25 pages
Variable Trim Compressor - A New Approach To Variable Compressor Geometry
No ratings yet
Variable Trim Compressor - A New Approach To Variable Compressor Geometry
10 pages
CHP 1 C.S Number System and Conversion
No ratings yet
CHP 1 C.S Number System and Conversion
14 pages
CHAPTER 1 5 Po FINAL
No ratings yet
CHAPTER 1 5 Po FINAL
93 pages
12 Chemistry Notes ch15 Polymers PDF
No ratings yet
12 Chemistry Notes ch15 Polymers PDF
7 pages
Practice Unit 1 A Description of A Place 1 Choose The Correct Answer
No ratings yet
Practice Unit 1 A Description of A Place 1 Choose The Correct Answer
2 pages
RLC Measurement3
No ratings yet
RLC Measurement3
15 pages
Course - FinTech
No ratings yet
Course - FinTech
3 pages
B&K 1212 Instruction Manual
No ratings yet
B&K 1212 Instruction Manual
37 pages
Main
No ratings yet
Main
43 pages
Sec 1 Maths 2017
No ratings yet
Sec 1 Maths 2017
21 pages
Formato Brose 8-D-Problem Solving Schemexlsx
No ratings yet
Formato Brose 8-D-Problem Solving Schemexlsx
17 pages
371-Article Text-743-1-10-20190503
No ratings yet
371-Article Text-743-1-10-20190503
20 pages
Conflict Resolution PDF
No ratings yet
Conflict Resolution PDF
5 pages

03-bayes-nearest-neighbors

Uploaded by

03-bayes-nearest-neighbors

Uploaded by

Empirical risk minimization (ERM)

Recall the definitions of risk/empirical risk

Ideally, we would like to choose

Since we cannot compute , instead we choose

This makes sense if is not too large so that for all

Unfortunately, we also want to be large so that can be as small as

Richer set of hypotheses

“Richness” of hypothesis set

In general, this may not be possible

There may not be any function with

Suppose we knew the joint distribution of our data

Let be a classifier with probability of error/risk

We will let denote this joint distribution of

Conditioning on results in a conditional distribution on the class labels

Conditioning on results in the class conditional distribution:

For any fixed we can write

Both ways of thinking will be useful!

for any possible classifier

Let denote the a priori class probabilities

Consider an arbitrary classifier . Denote the decision regions

In the case where , our test reduced to declaring 1 iff , thus

There are many situations where this is not appropriate

– alternatively, it may be better to focus on them directly a la Neyman-Pearson

There are many situations where this is not appropriate

“Richness” of hypothesis set

Can we get close when all we have is the data?

Assign the same label as the closest training point to

Consider the Bayes risk conditioned on :

Note that if , then we must have

Since selects the label that maximizes , we thus have that

Note that for a fixed , we are treating as random

Here we will further treat as being random since it depends on the

Thus we have that

Note that if is the nearest neighbor to , then

Thus, we can write

It is easy to see that

Setting by trying to minimize the training error is a particularly bad idea

No matter what, we always have

Thus, by letting and as , we can (asymptotically)

You might also like