0% found this document useful (0 votes)
38 views

Emiprical Risk Minimization

This document discusses empirical risk minimization (ERM) for statistical machine learning. ERM finds a strategy h that minimizes the empirical risk computed from training examples, as a proxy for the true but unknown expected risk. While the empirical risk converges to the expected risk for a fixed h, this may not hold for the learned strategy hm, requiring a uniform law of large numbers over the hypothesis space H. The document provides an example where ERM fails and derives a generalization bound showing the relationship between empirical risk, expected risk, and hypothesis space size for a finite H.

Uploaded by

apopop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Emiprical Risk Minimization

This document discusses empirical risk minimization (ERM) for statistical machine learning. ERM finds a strategy h that minimizes the empirical risk computed from training examples, as a proxy for the true but unknown expected risk. While the empirical risk converges to the expected risk for a fixed h, this may not hold for the learned strategy hm, requiring a uniform law of large numbers over the hypothesis space H. The document provides an example where ERM fails and derives a generalization bound showing the relationship between empirical risk, expected risk, and hypothesis space size for a finite H.

Uploaded by

apopop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Statistical Machine Learning (BE4M33SSU)

Lecture 3: Empirical Risk Minimization


Czech Technical University in Prague
V. Franc

BE4M33SSU – Statistical Machine Learning, Winter 2022


Learning
2/12
The goal: Find a strategy h : X → Y minimizing R(h) using the


training set of examples

T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m}

drawn from i.i.d. rv. with unknown p(x, y).

Hypothesis class (space):




H ⊆ Y X = {h : X → Y}

Learning algorithm: a function




A : ∪∞
m=1 (X × Y)m
→H

which returns a strategy hm = A(T m) for a training set T m


Learning: Empirical Risk Minimization approach
3/12
The expected risk R(h), i.e. the true but unknown objective, is replaced


by the empirical risk computed from the training examples T m,


m
1 X i
RT m (h) = `(y , h(xi))
m i=1

The ERM based algorithm returns hm such that




hm ∈ Argmin RT m (h) (1)


h∈H

Depending on the choince of H and ` and algorithm solving (1) we get




individual instances e.g. Support Vector Machines, Linear Regression,


Logistic Regression, Neural Networks learned by back-propagation,
AdaBoost, Gradient Boosted Trees, ...
Example of ERM failure
4/12
Let X = [a, b] ⊂ R, Y = {+1, −1}, `(y, y 0) = [ y 6= y 0] , p(x | y = +1)


and p(x | y = −1) be uniform distributions on X and p(y = +1) = 0.8.

The optimal strategy is h(x) = +1 with the Bayes risk R∗ = 0.2.




Consider learning algorithm which for a given training set




T m = {(x1, y 1), . . . , (xm, y m)} returns memorizing strategy

y j if x = xj for some j ∈ {1, . . . , m}



hm(x) =
−1 otherwise

The empirical risk is RT m (hm) = 0 with probability 1 for any m.




The expected risk is R(hm) = 0.8 for any m.



Wrap up of the previous lecture
5/12
1
Pl i i
We use the empirical risk RS l (h) = `(y , h(y )) as a proxy of the


l i=1
true risk R(h) = Ex,y∼p[`(y, h(x))].
In case of evaluation, h is fixed and due to the law of large numbers,


RS l (h) gets close to R(h) if we have enough examples:


  − 2l ε2
P RS l (h) − R(h) ≥ ε ≤ 2e (`max−`min)2

We say that RS l (h) converges in probability to R(h), i.e.


 
∀ ε > 0 : lim P RS l (h) − R(h) ≥ ε = 0
l→∞

In case of learning, hm = A(Tm) is learned from T m then RT m (h) does




not have to get close to R(h) even if we have enough examples:


 
∀ ε > 0 : lim P RT m (hm) − R(hm) ≥ ε 6= 0
m→∞
Why law of large numbers does not apply for learning?
6/12
2
− 2m ε 2 1
Pm i
Hoeffding inequality P(|µ̂ − µ| ≥ ε) ≤ 2e (b−a) , µ̂ = i=1 ,
z


m
requires {z 1, . . . , z m} to be sample from i.i.d. rv. with expeted value µ.

T m = {(x1, y 1), . . . , (xm, y m)} is drawn from i.i.d. rv. with p(x, y).


Evaluation:
h fixed independently on T m, z i = `(y i, h(xi)) and {z 1, . . . , z m} is i.i.d.


Therefore ∀ ε > 0 : limm→∞ P(|RT m (h) − R(h)| ≥ ε) = 0




Learning:
hm = A(T m), z i = `(y i, hm(xi)) and thus {z 1, . . . , z m} is not i.i.d.


No guarantee that ∀ ε > 0 : limm→∞ P(|RT m (hm) − R(hm)| ≥ ε) = 0




The task for the rest of the lecture is to show how to fix it.

To fix the problem we need uniform law of large numbers
7/12
   
P R(hm) − RT m (hm) ≥ ε ≤ P sup R(h) − RT m (h) ≥ ε ≤ B(m, H, ε)
h∈H

H = {h(x) = sign(x − θ)|θ ∈ R}, `(y, y 0) = [ y 6= y 0]

0.8 R(h)
RT m(h)
0.6
0.4 R(h) +
0.2 m=1000 R(h)
R(hm) RT m(hm)
0.0 hm
Uniform Law of Large Numbers
8/12
Law of Large Numbers: for any p(x, y) generating T m, and h ∈ H fixed


without using T m we have


 
∀ ε > 0 : lim P R(h) − RT m (h) ≥ ε = 0
m→∞ | {z }
empirical risk fails for h

Uniform Law of Large Numbers: if for any p(x, y) generating T m it holds




that
 
∀ ε > 0 : lim P sup R(h) − RT m (h) ≥ ε = 0
m→∞ h∈H
| {z }
empirical risk fails for some h∈H

we say that ULLN applies for H.


Alternatively we say: the empirical risk converges uniformly to the true


risk, or that the hypothesis class H has the uniform convergence property.
ULLN applies for finite hypothesis class
9/12
Assume a finite hypothesis class H = {h1, . . . , hK }.


Define the set of all “bad” training sets for a strategy h ∈ H as




n o
B(h) = T m ∈ (X × Y)m RT m (h) − R(h) ≥ ε

Hoeffding inequality generalized for finite hypothesis class H:




   − 2m ε2
P T m ∈ B(h) = 2 |H| e (b−a)2
P
P max RT m (h) − R(h) ≥ ε ≤
h∈H h∈H

ULLN applies for finite hypothesis class




 
∀ ε > 0 : lim P max |RT m (h) − R(h)| ≥ ε = 0
m→∞ h∈H
Generalization bound for finite hypothesis class
10/12
Hoeffding inequality generalized for a finite hypothesis class H:


2
− 2m ε 2
 
P max |RT m (h) − R(h)| ≥ ε ≤ 2|H|e (b−a)
h∈H

Find an upper bound ε on the discrepancy between RT m (h) and R(h)




which holds uniformly for all h ∈ H with probability 1 − δ at least:


   
P max |RT m (h) − R(h)| < ε = 1 − P max |RT m (h) − R(h)| ≥ ε
h∈H h∈H
2
− 2m ε 2
≥ 1 − 2|H|e (b−a) =1−δ

and solving the last equality for ε yields


s
log 2|H| + log 1δ
ε = (b − a)
2m
Generalization bound for finite hypothesis class
11/12

Theorem: Let T m = {(x1, y 1), . . . , (xm, y m)} ∈ (X × Y)m be draw from


i.i.d. rv. with p.d.f. p(x, y) and let H be a finite hypothesis class. Then, for
any 0 < δ < 1, with probability at least 1 − δ the inequality
s
log 2|H| + log 1δ
R(h) ≤ R T m (h) + (b − a)
| {z } 2m
empirical risk | {z }
complexity term

holds for all h ∈ H simultaneously and any loss function ` : Y × Y → [a, b].
Recommendations that follow from the generalization bound:


1. Minimize the empirical risk.


2. Use as much training examples as possible.
3. Limit the size of the hypothesis space |H|:
Note that 1) and 3) are conflicting recommendations.
The generalization bound holds for any learning algorithm not just ERM.

Structural Risk Minimization
12/12
Learn h : X → Y by minimizing the generalization bound


s
log 2|H| + log 1δ
R(h) ≤ RT m (h) + (b − a)
2m
| {z }
(m,|H|,δ)

R(h)
(m, |H|, δ)

RT m (h)

h1 hi∗ hK

H1 Hi∗ HK

You might also like