0% found this document useful (0 votes)

38 views

Emiprical Risk Minimization

This document discusses empirical risk minimization (ERM) for statistical machine learning. ERM finds a strategy h that minimizes the empirical risk computed from training examples, as a proxy for the true but unknown expected risk. While the empirical risk converges to the expected risk for a fixed h, this may not hold for the learned strategy hm, requiring a uniform law of large numbers over the hypothesis space H. The document provides an example where ERM fails and derives a generalization bound showing the relationship between empirical risk, expected risk, and hypothesis space size for a finite H.

Uploaded by

apopop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Emiprical Risk Minimization

Uploaded by

apopop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Statistical Machine Learning (BE4M33SSU)

Lecture 3: Empirical Risk Minimization

Czech Technical University in Prague
V. Franc

BE4M33SSU – Statistical Machine Learning, Winter 2022

Learning
2/12
The goal: Find a strategy h : X → Y minimizing R(h) using the

training set of examples

T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m}

drawn from i.i.d. rv. with unknown p(x, y).

Hypothesis class (space):

H ⊆ Y X = {h : X → Y}

Learning algorithm: a function

A : ∪∞
m=1 (X × Y)m
→H

which returns a strategy hm = A(T m) for a training set T m

Learning: Empirical Risk Minimization approach
3/12
The expected risk R(h), i.e. the true but unknown objective, is replaced

by the empirical risk computed from the training examples T m,

m
1 X i
RT m (h) = `(y , h(xi))
m i=1

The ERM based algorithm returns hm such that

hm ∈ Argmin RT m (h) (1)

h∈H

Depending on the choince of H and ` and algorithm solving (1) we get

individual instances e.g. Support Vector Machines, Linear Regression,

Logistic Regression, Neural Networks learned by back-propagation,
AdaBoost, Gradient Boosted Trees, ...
Example of ERM failure
4/12
Let X = [a, b] ⊂ R, Y = {+1, −1}, `(y, y 0) = [ y 6= y 0] , p(x | y = +1)

and p(x | y = −1) be uniform distributions on X and p(y = +1) = 0.8.

The optimal strategy is h(x) = +1 with the Bayes risk R∗ = 0.2.

Consider learning algorithm which for a given training set

T m = {(x1, y 1), . . . , (xm, y m)} returns memorizing strategy

y j if x = xj for some j ∈ {1, . . . , m}

hm(x) =
−1 otherwise

The empirical risk is RT m (hm) = 0 with probability 1 for any m.

The expected risk is R(hm) = 0.8 for any m.

Wrap up of the previous lecture
5/12
1
Pl i i
We use the empirical risk RS l (h) = `(y , h(y )) as a proxy of the

l i=1
true risk R(h) = Ex,y∼p[`(y, h(x))].
In case of evaluation, h is fixed and due to the law of large numbers,

RS l (h) gets close to R(h) if we have enough examples:

− 2l ε2
P RS l (h) − R(h) ≥ ε ≤ 2e (`max−`min)2

We say that RS l (h) converges in probability to R(h), i.e.

∀ ε > 0 : lim P RS l (h) − R(h) ≥ ε = 0
l→∞

In case of learning, hm = A(Tm) is learned from T m then RT m (h) does

not have to get close to R(h) even if we have enough examples:

∀ ε > 0 : lim P RT m (hm) − R(hm) ≥ ε 6= 0
m→∞
Why law of large numbers does not apply for learning?
6/12
2
− 2m ε 2 1
Pm i
Hoeffding inequality P(|µ̂ − µ| ≥ ε) ≤ 2e (b−a) , µ̂ = i=1 ,
z

m
requires {z 1, . . . , z m} to be sample from i.i.d. rv. with expeted value µ.

T m = {(x1, y 1), . . . , (xm, y m)} is drawn from i.i.d. rv. with p(x, y).

Evaluation:
h fixed independently on T m, z i = `(y i, h(xi)) and {z 1, . . . , z m} is i.i.d.

Therefore ∀ ε > 0 : limm→∞ P(|RT m (h) − R(h)| ≥ ε) = 0

Learning:
hm = A(T m), z i = `(y i, hm(xi)) and thus {z 1, . . . , z m} is not i.i.d.

No guarantee that ∀ ε > 0 : limm→∞ P(|RT m (hm) − R(hm)| ≥ ε) = 0

The task for the rest of the lecture is to show how to fix it.

To fix the problem we need uniform law of large numbers
7/12

P R(hm) − RT m (hm) ≥ ε ≤ P sup R(h) − RT m (h) ≥ ε ≤ B(m, H, ε)
h∈H

H = {h(x) = sign(x − θ)|θ ∈ R}, `(y, y 0) = [ y 6= y 0]

0.8 R(h)
RT m(h)
0.6
0.4 R(h) +
0.2 m=1000 R(h)
R(hm) RT m(hm)
0.0 hm
Uniform Law of Large Numbers
8/12
Law of Large Numbers: for any p(x, y) generating T m, and h ∈ H fixed

without using T m we have

∀ ε > 0 : lim P R(h) − RT m (h) ≥ ε = 0
m→∞ | {z }
empirical risk fails for h

Uniform Law of Large Numbers: if for any p(x, y) generating T m it holds

that

∀ ε > 0 : lim P sup R(h) − RT m (h) ≥ ε = 0
m→∞ h∈H
| {z }
empirical risk fails for some h∈H

we say that ULLN applies for H.

Alternatively we say: the empirical risk converges uniformly to the true

risk, or that the hypothesis class H has the uniform convergence property.
ULLN applies for finite hypothesis class
9/12
Assume a finite hypothesis class H = {h1, . . . , hK }.

Define the set of all “bad” training sets for a strategy h ∈ H as

n o
B(h) = T m ∈ (X × Y)mRT m (h) − R(h) ≥ ε

Hoeffding inequality generalized for finite hypothesis class H:

− 2m ε2
P T m ∈ B(h) = 2 |H| e (b−a)2
P
P max RT m (h) − R(h) ≥ ε ≤
h∈H h∈H

ULLN applies for finite hypothesis class

∀ ε > 0 : lim P max |RT m (h) − R(h)| ≥ ε = 0
m→∞ h∈H
Generalization bound for finite hypothesis class
10/12
Hoeffding inequality generalized for a finite hypothesis class H:

2
− 2m ε 2

P max |RT m (h) − R(h)| ≥ ε ≤ 2|H|e (b−a)
h∈H

Find an upper bound ε on the discrepancy between RT m (h) and R(h)

which holds uniformly for all h ∈ H with probability 1 − δ at least:

and solving the last equality for ε yields

s
log 2|H| + log 1δ
ε = (b − a)
2m
Generalization bound for finite hypothesis class
11/12

Theorem: Let T m = {(x1, y 1), . . . , (xm, y m)} ∈ (X × Y)m be draw from

i.i.d. rv. with p.d.f. p(x, y) and let H be a finite hypothesis class. Then, for
any 0 < δ < 1, with probability at least 1 − δ the inequality
s
log 2|H| + log 1δ
R(h) ≤ R T m (h) + (b − a)
| {z } 2m
empirical risk | {z }
complexity term

holds for all h ∈ H simultaneously and any loss function ` : Y × Y → [a, b].
Recommendations that follow from the generalization bound:

1. Minimize the empirical risk.

2. Use as much training examples as possible.
3. Limit the size of the hypothesis space |H|:
Note that 1) and 3) are conflicting recommendations.
The generalization bound holds for any learning algorithm not just ERM.

Structural Risk Minimization
12/12
Learn h : X → Y by minimizing the generalization bound

s
log 2|H| + log 1δ
R(h) ≤ RT m (h) + (b − a)
2m
| {z }
(m,|H|,δ)

R(h)
(m, |H|, δ)

RT m (h)

h1 hi∗ hK

H1 Hi∗ HK

MATH 499 Homework 2
100% (3)
MATH 499 Homework 2
2 pages
CCSLC Portfolio 2020 1
100% (2)
CCSLC Portfolio 2020 1
31 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Rev-H Oven System Setup Wizard:: Nov-2010 Software Version 4.0.2.1
No ratings yet
Rev-H Oven System Setup Wizard:: Nov-2010 Software Version 4.0.2.1
22 pages
02-first-model-of-learning
No ratings yet
02-first-model-of-learning
37 pages
n14 PDF
No ratings yet
n14 PDF
4 pages
supervised learning
No ratings yet
supervised learning
61 pages
RIP Routing Protocol
No ratings yet
RIP Routing Protocol
27 pages
Statistical Learning Intro
No ratings yet
Statistical Learning Intro
10 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Learning From Uniform Convergence
No ratings yet
Learning From Uniform Convergence
12 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
SML_Lecture2
No ratings yet
SML_Lecture2
35 pages
Nips 2007
No ratings yet
Nips 2007
8 pages
note2
No ratings yet
note2
4 pages
Empirical Risk Minimization
No ratings yet
Empirical Risk Minimization
7 pages
Class 02
No ratings yet
Class 02
42 pages
unit-online-1.2
No ratings yet
unit-online-1.2
20 pages
LearningTheory
No ratings yet
LearningTheory
19 pages
Lecture1
No ratings yet
Lecture1
56 pages
Class14 PDF
No ratings yet
Class14 PDF
29 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
1 Online Linear Regression: COS 511: Theoretical Machine Learning
No ratings yet
1 Online Linear Regression: COS 511: Theoretical Machine Learning
7 pages
Notes
No ratings yet
Notes
10 pages
DSA5102_lecture1
No ratings yet
DSA5102_lecture1
60 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
supervised learning
No ratings yet
supervised learning
52 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Naive Bayes
No ratings yet
Naive Bayes
110 pages
03 Hoeffding
No ratings yet
03 Hoeffding
5 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Michael Importance Weighting
No ratings yet
Michael Importance Weighting
30 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
generalize_DL_2023
No ratings yet
generalize_DL_2023
28 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
Chapter-3-Solutions-Understanding-Machine-Learning
No ratings yet
Chapter-3-Solutions-Understanding-Machine-Learning
6 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Risk Minimization
No ratings yet
Risk Minimization
12 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
An Information-Theoretic Approach To Generalization Theory - Part2
No ratings yet
An Information-Theoretic Approach To Generalization Theory - Part2
22 pages
Learning From Data
No ratings yet
Learning From Data
402 pages
Concentration Inequalities: Hoeffding and Mcdiarmid
No ratings yet
Concentration Inequalities: Hoeffding and Mcdiarmid
5 pages
Lecture Notes For ECE 695-09/08/03
No ratings yet
Lecture Notes For ECE 695-09/08/03
3 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Lec SML Basic Theory 2
No ratings yet
Lec SML Basic Theory 2
49 pages
Foundations Computational Mathematics: Online Learning Algorithms
No ratings yet
Foundations Computational Mathematics: Online Learning Algorithms
26 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
Lecture 7
No ratings yet
Lecture 7
51 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Chapter 3 - Introduction Via Linear Regression
No ratings yet
Chapter 3 - Introduction Via Linear Regression
20 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
AP30J7692
No ratings yet
AP30J7692
1 page
Optmizing Operations With Digital Transformation
No ratings yet
Optmizing Operations With Digital Transformation
15 pages
Conversion Disorder
No ratings yet
Conversion Disorder
18 pages
Mollier Diagram
No ratings yet
Mollier Diagram
17 pages
Skorogovorki Na Angliiskom Yazke
No ratings yet
Skorogovorki Na Angliiskom Yazke
79 pages
Unit Plan For Probability: Key To Resources in This Document
No ratings yet
Unit Plan For Probability: Key To Resources in This Document
6 pages
Iso Certificate
No ratings yet
Iso Certificate
1 page
Norma R13 PDF
No ratings yet
Norma R13 PDF
278 pages
Get Foundations of Info-Metrics: Modeling and Inference With Imperfect Information Golan PDF Ebook With Full Chapters Now
100% (1)
Get Foundations of Info-Metrics: Modeling and Inference With Imperfect Information Golan PDF Ebook With Full Chapters Now
47 pages
UT p4
No ratings yet
UT p4
31 pages
Argos
No ratings yet
Argos
62 pages
Square Roots Unit
No ratings yet
Square Roots Unit
10 pages
Water As The Primary Element
No ratings yet
Water As The Primary Element
7 pages
54 482 1 PB
No ratings yet
54 482 1 PB
5 pages
Lesson 3 Partitioning a Line Segment
No ratings yet
Lesson 3 Partitioning a Line Segment
4 pages
Republic Act No 10354 RPRH Law
No ratings yet
Republic Act No 10354 RPRH Law
14 pages
Functions Review Packet
No ratings yet
Functions Review Packet
7 pages
Elements of Drama and Types of Drama
100% (1)
Elements of Drama and Types of Drama
29 pages
Formative Assessment #2: Digital Notebook: By: Lauz, Princess Daniela
No ratings yet
Formative Assessment #2: Digital Notebook: By: Lauz, Princess Daniela
13 pages
OceanofPDF.com Jessicas Cowboy Daddy - Melinda Barron
No ratings yet
OceanofPDF.com Jessicas Cowboy Daddy - Melinda Barron
141 pages
23 11 13 - Fuel Oil Piping
No ratings yet
23 11 13 - Fuel Oil Piping
2 pages
Hamlet: The Tragedy of Hamlet, Prince of Denmark, or More Simply Hamlet, Is A
No ratings yet
Hamlet: The Tragedy of Hamlet, Prince of Denmark, or More Simply Hamlet, Is A
27 pages
Introduction To Plant Physiology. ISBN 0470247665, 978-0470247662
100% (27)
Introduction To Plant Physiology. ISBN 0470247665, 978-0470247662
23 pages
Broadband in Malaysia
No ratings yet
Broadband in Malaysia
4 pages
Interface Changes
No ratings yet
Interface Changes
2 pages
Chapter 8
No ratings yet
Chapter 8
3 pages
Cambridge IGCSE First Language English Coursebook 4th Edition Cox Marian. pdf download
100% (2)
Cambridge IGCSE First Language English Coursebook 4th Edition Cox Marian. pdf download
60 pages
02 ROTC Registration Form 2nd Sem
No ratings yet
02 ROTC Registration Form 2nd Sem
1 page

Emiprical Risk Minimization

Uploaded by

Emiprical Risk Minimization

Uploaded by

Statistical Machine Learning (BE4M33SSU)

Lecture 3: Empirical Risk Minimization

BE4M33SSU – Statistical Machine Learning, Winter 2022

training set of examples

drawn from i.i.d. rv. with unknown p(x, y).

Hypothesis class (space):

Learning algorithm: a function

which returns a strategy hm = A(T m) for a training set T m

by the empirical risk computed from the training examples T m,

The ERM based algorithm returns hm such that

hm ∈ Argmin RT m (h) (1)

Depending on the choince of H and ` and algorithm solving (1) we get

individual instances e.g. Support Vector Machines, Linear Regression,

and p(x | y = −1) be uniform distributions on X and p(y = +1) = 0.8.

The optimal strategy is h(x) = +1 with the Bayes risk R∗ = 0.2.

Consider learning algorithm which for a given training set

T m = {(x1, y 1), . . . , (xm, y m)} returns memorizing strategy

y j if x = xj for some j ∈ {1, . . . , m}

The empirical risk is RT m (hm) = 0 with probability 1 for any m.

The expected risk is R(hm) = 0.8 for any m.

RS l (h) gets close to R(h) if we have enough examples:

We say that RS l (h) converges in probability to R(h), i.e.

In case of learning, hm = A(Tm) is learned from T m then RT m (h) does

not have to get close to R(h) even if we have enough examples:

Therefore ∀ ε > 0 : limm→∞ P(|RT m (h) − R(h)| ≥ ε) = 0

No guarantee that ∀ ε > 0 : limm→∞ P(|RT m (hm) − R(hm)| ≥ ε) = 0

H = {h(x) = sign(x − θ)|θ ∈ R}, `(y, y 0) = [ y 6= y 0]

without using T m we have

Uniform Law of Large Numbers: if for any p(x, y) generating T m it holds

we say that ULLN applies for H.

Define the set of all “bad” training sets for a strategy h ∈ H as

Hoeffding inequality generalized for finite hypothesis class H:

ULLN applies for finite hypothesis class

Find an upper bound ε on the discrepancy between RT m (h) and R(h)

which holds uniformly for all h ∈ H with probability 1 − δ at least:

and solving the last equality for ε yields

Theorem: Let T m = {(x1, y 1), . . . , (xm, y m)} ∈ (X × Y)m be draw from

1. Minimize the empirical risk.

You might also like