SML_Lecture3
SML_Lecture3
1
Recall: PAC learning of a finite hypothesis class
1 1
m≥ (log(|H|) + log( ))
δ
• An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
• Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
• Extra term compared to the rectangle learning example is the term
1
(log(|H|))
• The more hypotheses there are in H, the more training examples are
needed
2
Learning with infinite hypothesis classes
3
Vapnik-Chervonenkis dimension
Intuition
4
Shattering
Figure source:
https://datascience.stackexchange.com
5
How to show that VCdim(H) = d
6
Example: intervals on a real line
8
Lines in R2
9
VC-dimension of axis-aligned rectangles
10
VC-dimension of axis-aligned rectangles
• For five distinct points, consider the minimum bounding box of the
points
• There are two possible configurations:
1. There are one or more points in the interior of the box: then one
cannot include the points on the boundary and exclude the points in
the interior
2. At least one of the edges contains two points: in this case we can
pick either of the two points and verify that this point cannot be
excluded while all the other points are included
• Thus by the two examples we have established that VCdim(H) = 4
11
Vapnik-Chervonenkis dimension formally
12
Visualization
13
VC dimension of finite hypothesis classes
14
VC dimension: Further examples
15
Half-time poll: VC dimension of threshold functions in R
1. VCdim = 1
2. VCdim = 2
3. VCdim = ∞
16
Convex polygons have VC dimension = ∞
17
Convex polygons have VC dimension = ∞
18
Convex polygons have VC dimension = ∞
19
Convex polygons have VC dimension = ∞
20
Generalization bound based on the VC-dimension
21
Rademacher complexity
Experiment: how well does your hypothesis class fit noise?
22
Experiment: how well does your hypothesis class fit noise?
23
Rademacher complexity
24
Rademacher complexity
m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m i=1
• Thus
m
1 X 1 X X
σi h(xi ) = ( 1{h(xi )=σi } − 1{h(xi )6=σi } )
m m
i=1 i i
1 X
= (m − 2 1{h(xi )6=σi } ) = 1 − 2ˆ (h)
m
i
25
Rademacher complexity
• Plug in
1
R̂S (H) = Eσ sup (1 − 2ˆ(h))
2 h∈H
1 1
= (1 − 2Eσ inf ˆ(h)) = − Eσ inf ˆ(h))
2 h∈H 2 h∈H
26
Generalization bound with Rademacher complexity
(Mohri et al. 2018): For any δ > 0, with probability at least 1 − δ over a
sample drawn from an unknown distribution D, for any h ∈ H we have:
s
log δ2
R(h) ≤ R̂S (h) + R̂S (H) + 3
2m
27
Example: Rademacher and VC bounds on a real dataset
• Prediction of protein
subcellular localization
• 10-500 training examples,
172 test examples
• Comparing Rademacher and
VC bounds using δ = 0.05
• Training and test error also
shown
28
Example: Rademacher and VC bounds on a real dataset
29
Rademacher vs. VC
30
Rademacher vs. VC
31
Summary: Statistical learning theory
32