0% found this document useful (0 votes)

16 views36 pages

SML_Lecture3

The document discusses the concept of PAC learnability and the importance of VC dimension in supervised machine learning, particularly when dealing with infinite hypothesis classes. It explains how VC dimension measures the capacity of a hypothesis class to adapt to various concepts and outlines methods for determining the VC dimension of different hypothesis classes. Additionally, it introduces Rademacher complexity as a measure of a hypothesis class's ability to fit random noise, linking it to generalization error.

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views36 pages

SML_Lecture3

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CS-E4715 Supervised Machine Learning

Lecture 3: Learning with infinite hypothesis classes

Recall: PAC learnability

• A class C is PAC-learnable, if there exist an algorithm A that given

a training sample S outputs a hypothesis hS that has generalization
error satisfying
Pr (R(hS ) ≤ ) ≥ 1 − δ
• for any distribution D, for arbitrary , δ > 0 and sample size m = |S|
that grows at polynomially in 1/,1/δ

1
Recall: PAC learning of a finite hypothesis class

• Sample complexity bound relying on the size of the hypothesis class

(Mohri et al, 2018): Pr (R(hs ) ≤ ) ≥ 1 − δ if

1 1
m≥ (log(|H|) + log( ))
δ
• An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
• Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
• Extra term compared to the rectangle learning example is the term
1
(log(|H|))
• The more hypotheses there are in H, the more training examples are
needed

2
Learning with infinite hypothesis classes

• The size of the hypothesis class is a useful measure of complexity for

finite hypothesis classes (e.g boolean formulae)
• However, most classifers used in practise rely on infinite hypothesis
classes, e.g.
• H = axis-aligned rectangles in R2 (the example last lecture)
• H = hyperplanes in Rd (e.g. Support vector machines)
• H = neural networks with continuous input variables
• Need better tools to analyze these cases

3
Vapnik-Chervonenkis dimension
Intuition

• VC dimension can be understood as measuring the capacity of a

hypothesis class to adapt to different concepts
• It can be understood through the following thought experiment:
• Pick a fixed hypothesis class H, e.g. axis-aligned rectangles in R 2
• Let as enumerate all possible labelings of a training set of size m:
Y m = {y1 , y2 , . . . , y2m }, where yj = (yj1 , . . . , yjm ), and yij ∈ {0, 1} is
the label of i’th example in the j’th labeling
• We are allowed to freely choose a distribution D generating the
inputs and to generate the input data x1 , . . . , xm
• VCdim(H) = size of the largest training set that we can find a
consistent classifier for all labelings in Y m
• Intuitively:
• low VCdim =⇒ easy to learn, low sample complexity
• high VCdim =⇒ hard to learn, high sample complexity
• infinite VCdim =⇒ cannot learn in PAC framework

4
Shattering

• The underlying concept in VC dimension is shattering

• Given a set of points S = {x1 , . . . , xm } and a fixed class of functions
H
• H is said to shatter S if for any possible partition of S into positive
S+ and negative subset S− we can find a hypothesis for which
h(x) = 1 if and only if x ∈ S+

Figure source:

https://datascience.stackexchange.com

5
How to show that VCdim(H) = d

• How to show that VCdim(H) = d for a hypothesis class

• We need to show two facts:
1. There exists a set of inputs of size d that can be shattered by
hypothesis in H (i.e. we can pick the set of inputs any way we like):
VCdim(H) ≥ d
2. There does not exist any set of inputs of size d + 1 that can be
shattered (i.e. need to show a general property): VCdim(H) < d + 1

6
Example: intervals on a real line

• Let the hypothesis class be intervals in R

• Each hypothesis is defined by two parameters bh , eh ∈ R: the
beginning and end of the interval, h(x) = 1bh ≤x≤eh
• We can shatter any set of two points by changing the end points of
the interval:

• We cannot shatter a three point set, as the middle point cannot be

excluded while the left-hand and right-hand side points are included

We conclude that VC dimension for real intervals = 2

7
Lines in R2

• A hypothesis class of lines h(x) = ax + b shatters a set of three

points R2 .

• We conclude that VC dimension is ≥ 3

8
Lines in R2

Four points cannot be shattered by lines in R2 :

• There are only two possible configurations of four points in R2 :

1. All four points reside on the boundary of the convex hull
2. Three points form the convex hull and one is in interior
• In the first case (left), we cannot draw a line separating the top and
bottom points from the left-and and right-hand side points
• In the second case, we cannot separate the interior point from the
points on the boundary of the convex hull with a line
• The two examples are sufficient to show that VCdim = 3

9
VC-dimension of axis-aligned rectangles

• With axis aligned rectangles we can shatter a set of four points

(picture shows 4 of the 16 configurations)
• This implies VCdim(H) ≥ 4

10
VC-dimension of axis-aligned rectangles

• For five distinct points, consider the minimum bounding box of the
points
• There are two possible configurations:
1. There are one or more points in the interior of the box: then one
cannot include the points on the boundary and exclude the points in
the interior
2. At least one of the edges contains two points: in this case we can
pick either of the two points and verify that this point cannot be
excluded while all the other points are included
• Thus by the two examples we have established that VCdim(H) = 4

11
Vapnik-Chervonenkis dimension formally

• Formally VCdim(H) is defined through the growth function

ΠH (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|

{x1 ,...,xm }⊂X

• The growth function gives the maximum number of unique labelings

the hypothesis class H can provide for an arbitrary set of input points
• The maximum of the growth function is 2m for a set of m examples
• Vapnik-Chervonenkis dimension is then

VCdim(H) = max{m|ΠH (m) = 2m }

12
Visualization

• The ratio of the growth function ΠH (m) to the maximum number of

labelings of a set of size m is shown
• Hypothesis class is 20-dimensional hyperplanes (VC dimension = 21)

13
VC dimension of finite hypothesis classes

• Any finite hypothesis class has VC dimension VCdim(H) ≤ log2 |H|

• To see this:
• Consider a set of m examples S = {x1 , . . . , xm }
• This set can be labeled 2m different ways, by choosing the labels
yi ∈ {0, 1} independently
• Each hypothesis in h ∈ H fixes one labeling, a length-m binary
vector y(h, S) = (h(x1 ), . . . , h(xm ))
• All hypotheses in H together can provide at most |H| different
labelings in total (different vectors y(h, S), h ∈ H)
• If |H| < 2m we cannot shatter S =⇒ we cannot shatter a set of
size m > log2 |H|

14
VC dimension: Further examples

Examples of classes with a finite VC dimension:

• convex d-polygons in R2 : VCdim = 2d + 1 (e.g. for general, not

restricted to axis-aligned, rectangles VCdim = 5)
• hyperplanes in Rd : VCdim = d + 1 - (e.g. single neural unit, linear
SVM)
• neural networks: VCdim = |E | log |E || where E is the set of edges in
the networks (for sign activation function)
• boolean monomials of d variables: VCdim = d
• arbitrary boolean formulae of d variables: VCdim = 2d

15
Half-time poll: VC dimension of threshold functions in R

Consider a hypothesis class H = {hθ } of threshold functions

hθ : R 7→ {0, 1}, θ ∈ R :
(
1 if x > θ
hθ (x) =
0 otherwise

What is the VC dimension of this hypothesis class?

1. VCdim = 1
2. VCdim = 2
3. VCdim = ∞

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

down to ”Lecture 3 poll”:
Answers are anonymous and do not affect grading of the course.
Convex polygons have VC dimension = ∞

• Let our hypothesis class be convex polygons in R2 without

restriction of number of vertices d
• Let us draw an arbitrary circle on R2 - the distribution D will be
concentrated on the circumference of the circle
• This is a difficult distribution for learning polygons - we choose it on
purpose

16
Convex polygons have VC dimension = ∞

• Let us consider a set of m points with arbitrary binary labels

• For any m, let us position m points on the circumference of the
circle
• simulating drawing the inputs from the distribution D

17
Convex polygons have VC dimension = ∞

• Start from an arbitrary positive point (red circles)

• Traverse the circumference clockwise skipping all negative points
and stopping and positive points

18
Convex polygons have VC dimension = ∞

• Connect adjacent positive points with an edge

• This forms a p-polygon inside the circle, where p is the number of
positive data points

19
Convex polygons have VC dimension = ∞

• Define h(x) = +1 for points inside the

polygon and h(x) = 0 outside
• Each of the 2m labelings of m
examples gives us a p-polygon that
includes the p positive points in that
labeling and excludes the negative
points =⇒ we can shatter a set of
size m: VCdim(H) ≥ m
• Since m was arbitrary, we can grow it
without limit VCdim(H) = ∞

20
Generalization bound based on the VC-dimension

• (Mohri, 2018) Let H be a family of functions taking values in

{−1, +1} with VC-dimension d. Then for any δ > 0, with
probability at least 1 − δ the following holds for all h ∈ H:
s r
2 log(em/d) log(1/δ)
R(h) ≤ R̂(h) + +
m/d 2m

• e ≈ 2.71828 is the base of the natural logarithm

• The bound reveals that the critical quantity is m/d, i.e. the number
of examples divided by the VC-dimension
• Manifestation of the Occam’s razor principle: to justify an increase
in the complexity, we need reciprocally more data

21
Rademacher complexity
Experiment: how well does your hypothesis class fit noise?

• Consider a set of training examples S0 = {(xi , yi )}m

i=1

• Generate M new datasets S1 , . . . , SM from S0 by randomly drawing

a new label σ ∈ Y for each training example in S0

Sk = {(xi , σik )}m

i=1

• Train a classifier hk minimizing the empirical risk on training set Sk ,

record its empirical risk
m
1 X
R̂(hk ) = 1hk (xi )6=σik
m
i=1

• Compute the average empirical risk over all datasets:

PM
¯ = M1 k=1 R̂(hk )

22
Experiment: how well does your hypothesis class fit noise?

• Observe the quantity

1
R̂ = − ¯
2
• We have R̂ = 0 when ¯ = 0.5, that is when the predictions
correspond to random coin flips (0.5 probability to predict either
class)
• We have R̂ = 0.5 when ¯ = 0, that is when all hypotheses
hi , i = 1, . . . , M have zero empirical error (perfect fit to noise, not
good!)
• Intuitively we would like our hypothesis
• to be able to separate noise from signal - to have low R̂
• have low empirical error on real data - otherwise impossible to obtain
low generalization error

23
Rademacher complexity

• Rademacher complexity defines complexity as the capacity of

hypothesis class to fit random noise
• For binary classification with labels Y = {−1, +1} empirical
Rademacher complexity can be defined as
m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m
i=1

• σi ∈ {−1, +1} are Rademacher random variables, drawn

independently from uniform distribution (i.e. Pr {σ = 1} = 0.5)
• Expression inside the expectation takes the highest correlation over
all hypothesis in h ∈ H between the random true labels σi and
predicted label h(xi )

24
Rademacher complexity

m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m i=1

• Let us rewrite R̂S (H) in terms of empirical error

• Note that with labels Y = {+1, −1},
(
1 if σi = h(xi )
σi h(xi ) =
−1 if σi =6 h(xi )

• Thus
m
1 X 1 X X
σi h(xi ) = ( 1{h(xi )=σi } − 1{h(xi )6=σi } )
m m
i=1 i i
1 X
= (m − 2 1{h(xi )6=σi } ) = 1 − 2ˆ (h)
m
i

25
Rademacher complexity

• Plug in
1
R̂S (H) = Eσ sup (1 − 2ˆ(h))
2 h∈H
1 1
= (1 − 2Eσ inf ˆ(h)) = − Eσ inf ˆ(h))
2 h∈H 2 h∈H

• Now we have expressed the empirical Rademacher complexity in

terms of expected empirical error of classifying randomly labeled data
• But how does the Rademacher complexity help in model selection?
• We need to relate it to generalization error

26
Generalization bound with Rademacher complexity

(Mohri et al. 2018): For any δ > 0, with probability at least 1 − δ over a
sample drawn from an unknown distribution D, for any h ∈ H we have:
s
log δ2
R(h) ≤ R̂S (h) + R̂S (H) + 3
2m

The bound is composed of the sum of :

• The empirical risk of h on the training data S (with the original

labels): R̂S (h)
• The empirical Rademacher complexity: R̂S (H)
• A term that tends to zero as a function of size of the training data
√
as O(1/ m) assuming constant δ.

27
Example: Rademacher and VC bounds on a real dataset

• Prediction of protein
subcellular localization
• 10-500 training examples,
172 test examples
• Comparing Rademacher and
VC bounds using δ = 0.05
• Training and test error also
shown

28
Example: Rademacher and VC bounds on a real dataset

• Rademacher bound is sharper

than the VC bound
• VC bound is not yet
informative with 500
examples (> 0.5) using
(δ = 0.05)
• The gap between the mean
of the error distribution (≈
test error) and the 0.05
probability tail (VC and
Rademacher bounds) is
evident (and expected)

29
Rademacher vs. VC

Note the differences between Rademacher complexity and VC dimension

• VC dimension is independent of any training sample or distribution

generating the data: it measures the worst-case where the data is
generated in a bad way for the learner
• Rademacher complexity depends on the training sample thus is
dependent on the data generating distribution
• VC dimension focuses the extreme case of realizing all labelings of
the data
• Rademacher complexity measures smoothly the ability to realize
random labelings

30
Rademacher vs. VC

• Generalization bounds based on Rademacher Complexity are

applicable to any binary classifiers (SVM, neural network, decision
tree)
• It motivates state of the art learning algoritms such as support
vector machines
• But computing it might be hard, if we need to train a large number
of classifiers
• Vapnik-Chervonenkis dimension (VCdim) is an alternative that is
usually easier to derive analytically

31
Summary: Statistical learning theory

• Statistical learning theory focuses in analyzing the generalization

ability of learning algorithms
• Probably Approximately Correct framework is the most studied
theoretical framework, asking for bounding the generaliation error
() with high probability (1 − δ), with arbitrary level of error
> 0and confidence δ > 0
• Vapnik-Chervonenkis dimension lets us study learnability infinite
hypothesis classes through the concept of shattering
• Rademacher complexity is a practical alternative to VC dimension,
giving typically sharper bounds (but requires a lot of simulations to
be run)

Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
ML 3
No ratings yet
ML 3
36 pages
Quarter 1 - Module 4: "Changes in Materials That Are Useful and Harmful To One's Environment"
100% (1)
Quarter 1 - Module 4: "Changes in Materials That Are Useful and Harmful To One's Environment"
33 pages
MLSM Lecture3 190923
No ratings yet
MLSM Lecture3 190923
36 pages
03 Hypothesis Spaces Commented4
No ratings yet
03 Hypothesis Spaces Commented4
45 pages
The Bias Complexity Trade-Off: No Free Lunch Theorem, Error Decomposition
No ratings yet
The Bias Complexity Trade-Off: No Free Lunch Theorem, Error Decomposition
38 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
VC-dimension For Characterizing Classifiers
No ratings yet
VC-dimension For Characterizing Classifiers
40 pages
Lecture16 VC
No ratings yet
Lecture16 VC
42 pages
PAC
No ratings yet
PAC
45 pages
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
No ratings yet
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
43 pages
Vapnik-Chervonenkis Dimension
No ratings yet
Vapnik-Chervonenkis Dimension
6 pages
Lecture27_vc
No ratings yet
Lecture27_vc
23 pages
Rejection Proof PDF
No ratings yet
Rejection Proof PDF
146 pages
lec10svm
No ratings yet
lec10svm
35 pages
VC_Dim
No ratings yet
VC_Dim
22 pages
05-vc-bound
No ratings yet
05-vc-bound
27 pages
Computational Learning
No ratings yet
Computational Learning
12 pages
Lecture5 Learning Theory v1.1
No ratings yet
Lecture5 Learning Theory v1.1
59 pages
svm (2)
No ratings yet
svm (2)
35 pages
ML.1.Lecture.9 (Where It Actually Comes From)
No ratings yet
ML.1.Lecture.9 (Where It Actually Comes From)
31 pages
PAC LEARNING
No ratings yet
PAC LEARNING
30 pages
05 VC Theory
No ratings yet
05 VC Theory
11 pages
Lecture26_growth
No ratings yet
Lecture26_growth
25 pages
Lecture 09 Bounds for Bad Hypotheses
No ratings yet
Lecture 09 Bounds for Bad Hypotheses
26 pages
Lec 6
No ratings yet
Lec 6
29 pages
Lec 10 SVM
No ratings yet
Lec 10 SVM
35 pages
Unit6 - Lecture 26 1
No ratings yet
Unit6 - Lecture 26 1
23 pages
Slides Lect 07
No ratings yet
Slides Lect 07
22 pages
VC-dim
No ratings yet
VC-dim
16 pages
Lec14 PDF
No ratings yet
Lec14 PDF
7 pages
Trailblazer Medicare Audit Tool
100% (10)
Trailblazer Medicare Audit Tool
4 pages
17-612
No ratings yet
17-612
17 pages
Lecture 5
No ratings yet
Lecture 5
12 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Week 3
No ratings yet
Week 3
56 pages
AA1_Tema4
No ratings yet
AA1_Tema4
37 pages
Thirteen 19240 PDF
No ratings yet
Thirteen 19240 PDF
17 pages
UNIT-3
No ratings yet
UNIT-3
99 pages
Pac VC PDF
No ratings yet
Pac VC PDF
32 pages
SupervisedLearning 2 33
No ratings yet
SupervisedLearning 2 33
32 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
lect3
No ratings yet
lect3
4 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
hw2 5
No ratings yet
hw2 5
4 pages
ML Unit-3
No ratings yet
ML Unit-3
24 pages
VC Dimension
No ratings yet
VC Dimension
6 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Week_7_Notes[1]
No ratings yet
Week_7_Notes[1]
11 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Learnability Can Be Undecidable-Nicolelis
No ratings yet
Learnability Can Be Undecidable-Nicolelis
5 pages
gcl3
No ratings yet
gcl3
12 pages
VC_Dimension_Explanation
No ratings yet
VC_Dimension_Explanation
2 pages
Untitled 13
No ratings yet
Untitled 13
3 pages
LearningTheory
No ratings yet
LearningTheory
19 pages
How Many Samples To Learn A Finite Class?
No ratings yet
How Many Samples To Learn A Finite Class?
4 pages
Lecture Summary
No ratings yet
Lecture Summary
2 pages
hw2 Sol
No ratings yet
hw2 Sol
3 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Hw5 Solution
No ratings yet
Hw5 Solution
4 pages
Performance of Grid-Connected Solar Photovoltaic P
No ratings yet
Performance of Grid-Connected Solar Photovoltaic P
9 pages
Wealth Creation Management and Its Value
100% (1)
Wealth Creation Management and Its Value
14 pages
Full (Etextbook PDF) For Interpersonal Communication Everyday Encounters 9th Edition Ebook All Chapters
100% (5)
Full (Etextbook PDF) For Interpersonal Communication Everyday Encounters 9th Edition Ebook All Chapters
49 pages
PR 1 Identifying the Inquiry and Stating the Problem (1)
No ratings yet
PR 1 Identifying the Inquiry and Stating the Problem (1)
79 pages
Interruptor Compact NS 630 - 3200a
No ratings yet
Interruptor Compact NS 630 - 3200a
148 pages
Stability Analysis of Small Dams
100% (1)
Stability Analysis of Small Dams
6 pages
Pizzorno, A. Int Allo Studio Della Partecipazione Politica
No ratings yet
Pizzorno, A. Int Allo Studio Della Partecipazione Politica
36 pages
Subsection 1 - Prevention of Seismic Risk (Articles R563-1 To D563-8-1) - Legifrance
No ratings yet
Subsection 1 - Prevention of Seismic Risk (Articles R563-1 To D563-8-1) - Legifrance
13 pages
QuickDent Implants Brochure 2022
No ratings yet
QuickDent Implants Brochure 2022
54 pages
SAT 20
No ratings yet
SAT 20
10 pages
Noyes - Seneca On Death
No ratings yet
Noyes - Seneca On Death
18 pages
Ca Program Design
No ratings yet
Ca Program Design
5 pages
Assigment
No ratings yet
Assigment
11 pages
FPA630 Crop Science Assignment 24oct24
No ratings yet
FPA630 Crop Science Assignment 24oct24
2 pages
Class 5B Promotion List
No ratings yet
Class 5B Promotion List
2 pages
UNGGAS Malaysia - Nurturing Community, From Roots To Fruits.
No ratings yet
UNGGAS Malaysia - Nurturing Community, From Roots To Fruits.
17 pages
Gregor Mendel: Q1. in The 1860s, Gregor Mendel Studied Inheritance in Nearly 30 000 Pea Plants. Pea Plants Can
No ratings yet
Gregor Mendel: Q1. in The 1860s, Gregor Mendel Studied Inheritance in Nearly 30 000 Pea Plants. Pea Plants Can
5 pages
Code of Practice: Investigation and Control of Dampness in Buildings
No ratings yet
Code of Practice: Investigation and Control of Dampness in Buildings
20 pages
Real-time Operating Systems
No ratings yet
Real-time Operating Systems
5 pages
GESI Audit in WASH As An Effective Tool
No ratings yet
GESI Audit in WASH As An Effective Tool
8 pages
LEAP-ZPCET-PT04-PHYSICS PRACTICE PAPER
No ratings yet
LEAP-ZPCET-PT04-PHYSICS PRACTICE PAPER
5 pages
AOAC Use Dilution Method For Testing Disinfectants MB-05
No ratings yet
AOAC Use Dilution Method For Testing Disinfectants MB-05
19 pages
Secret Writing 2 PDF
No ratings yet
Secret Writing 2 PDF
4 pages
111, NDT Brochure
No ratings yet
111, NDT Brochure
4 pages
How To Win Friends and Influence People
No ratings yet
How To Win Friends and Influence People
6 pages
HDC Mon Tieng Anh Chon Doi Tuyen Nam 20202021
No ratings yet
HDC Mon Tieng Anh Chon Doi Tuyen Nam 20202021
3 pages
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

SML_Lecture3

Uploaded by

SML_Lecture3

Uploaded by

CS-E4715 Supervised Machine Learning

Lecture 3: Learning with infinite hypothesis classes

• A class C is PAC-learnable, if there exist an algorithm A that given

• Sample complexity bound relying on the size of the hypothesis class

• The size of the hypothesis class is a useful measure of complexity for

• VC dimension can be understood as measuring the capacity of a

• The underlying concept in VC dimension is shattering

• How to show that VCdim(H) = d for a hypothesis class

• Let the hypothesis class be intervals in R

• We cannot shatter a three point set, as the middle point cannot be

We conclude that VC dimension for real intervals = 2

• A hypothesis class of lines h(x) = ax + b shatters a set of three

• We conclude that VC dimension is ≥ 3

Four points cannot be shattered by lines in R2 :

• There are only two possible configurations of four points in R2 :

• With axis aligned rectangles we can shatter a set of four points

• Formally VCdim(H) is defined through the growth function

ΠH (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|

• The growth function gives the maximum number of unique labelings

VCdim(H) = max{m|ΠH (m) = 2m }

• The ratio of the growth function ΠH (m) to the maximum number of

• Any finite hypothesis class has VC dimension VCdim(H) ≤ log2 |H|

Examples of classes with a finite VC dimension:

• convex d-polygons in R2 : VCdim = 2d + 1 (e.g. for general, not

Consider a hypothesis class H = {hθ } of threshold functions

What is the VC dimension of this hypothesis class?

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

• Let our hypothesis class be convex polygons in R2 without

• Let us consider a set of m points with arbitrary binary labels

• Start from an arbitrary positive point (red circles)

• Connect adjacent positive points with an edge

• Define h(x) = +1 for points inside the

• (Mohri, 2018) Let H be a family of functions taking values in

• e ≈ 2.71828 is the base of the natural logarithm

• Consider a set of training examples S0 = {(xi , yi )}m

• Generate M new datasets S1 , . . . , SM from S0 by randomly drawing

Sk = {(xi , σik )}m

• Train a classifier hk minimizing the empirical risk on training set Sk ,

• Compute the average empirical risk over all datasets:

• Observe the quantity

• Rademacher complexity defines complexity as the capacity of

• σi ∈ {−1, +1} are Rademacher random variables, drawn

• Let us rewrite R̂S (H) in terms of empirical error

• Now we have expressed the empirical Rademacher complexity in

The bound is composed of the sum of :

• The empirical risk of h on the training data S (with the original

• Rademacher bound is sharper

Note the differences between Rademacher complexity and VC dimension

• VC dimension is independent of any training sample or distribution

• Generalization bounds based on Rademacher Complexity are

• Statistical learning theory focuses in analyzing the generalization

You might also like