0% found this document useful (0 votes)
7 views

SML_Lecture3

The document discusses the concept of PAC learnability and the importance of VC dimension in supervised machine learning, particularly when dealing with infinite hypothesis classes. It explains how VC dimension measures the capacity of a hypothesis class to adapt to various concepts and outlines methods for determining the VC dimension of different hypothesis classes. Additionally, it introduces Rademacher complexity as a measure of a hypothesis class's ability to fit random noise, linking it to generalization error.

Uploaded by

mohamnaf.b
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

SML_Lecture3

The document discusses the concept of PAC learnability and the importance of VC dimension in supervised machine learning, particularly when dealing with infinite hypothesis classes. It explains how VC dimension measures the capacity of a hypothesis class to adapt to various concepts and outlines methods for determining the VC dimension of different hypothesis classes. Additionally, it introduces Rademacher complexity as a measure of a hypothesis class's ability to fit random noise, linking it to generalization error.

Uploaded by

mohamnaf.b
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS-E4715 Supervised Machine Learning

Lecture 3: Learning with infinite hypothesis classes


Recall: PAC learnability

• A class C is PAC-learnable, if there exist an algorithm A that given


a training sample S outputs a hypothesis hS that has generalization
error satisfying
Pr (R(hS ) ≤ ) ≥ 1 − δ
• for any distribution D, for arbitrary , δ > 0 and sample size m = |S|
that grows at polynomially in 1/,1/δ

1
Recall: PAC learning of a finite hypothesis class

• Sample complexity bound relying on the size of the hypothesis class


(Mohri et al, 2018): Pr (R(hs ) ≤ ) ≥ 1 − δ if

1 1
m≥ (log(|H|) + log( ))
 δ
• An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
• Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
• Extra term compared to the rectangle learning example is the term
1
 (log(|H|))
• The more hypotheses there are in H, the more training examples are
needed

2
Learning with infinite hypothesis classes

• The size of the hypothesis class is a useful measure of complexity for


finite hypothesis classes (e.g boolean formulae)
• However, most classifers used in practise rely on infinite hypothesis
classes, e.g.
• H = axis-aligned rectangles in R2 (the example last lecture)
• H = hyperplanes in Rd (e.g. Support vector machines)
• H = neural networks with continuous input variables
• Need better tools to analyze these cases

3
Vapnik-Chervonenkis dimension
Intuition

• VC dimension can be understood as measuring the capacity of a


hypothesis class to adapt to different concepts
• It can be understood through the following thought experiment:
• Pick a fixed hypothesis class H, e.g. axis-aligned rectangles in R 2
• Let as enumerate all possible labelings of a training set of size m:
Y m = {y1 , y2 , . . . , y2m }, where yj = (yj1 , . . . , yjm ), and yij ∈ {0, 1} is
the label of i’th example in the j’th labeling
• We are allowed to freely choose a distribution D generating the
inputs and to generate the input data x1 , . . . , xm
• VCdim(H) = size of the largest training set that we can find a
consistent classifier for all labelings in Y m
• Intuitively:
• low VCdim =⇒ easy to learn, low sample complexity
• high VCdim =⇒ hard to learn, high sample complexity
• infinite VCdim =⇒ cannot learn in PAC framework

4
Shattering

• The underlying concept in VC dimension is shattering


• Given a set of points S = {x1 , . . . , xm } and a fixed class of functions
H
• H is said to shatter S if for any possible partition of S into positive
S+ and negative subset S− we can find a hypothesis for which
h(x) = 1 if and only if x ∈ S+

Figure source:

https://datascience.stackexchange.com

5
How to show that VCdim(H) = d

• How to show that VCdim(H) = d for a hypothesis class


• We need to show two facts:
1. There exists a set of inputs of size d that can be shattered by
hypothesis in H (i.e. we can pick the set of inputs any way we like):
VCdim(H) ≥ d
2. There does not exist any set of inputs of size d + 1 that can be
shattered (i.e. need to show a general property): VCdim(H) < d + 1

6
Example: intervals on a real line

• Let the hypothesis class be intervals in R


• Each hypothesis is defined by two parameters bh , eh ∈ R: the
beginning and end of the interval, h(x) = 1bh ≤x≤eh
• We can shatter any set of two points by changing the end points of
the interval:

• We cannot shatter a three point set, as the middle point cannot be


excluded while the left-hand and right-hand side points are included

We conclude that VC dimension for real intervals = 2


7
Lines in R2

• A hypothesis class of lines h(x) = ax + b shatters a set of three


points R2 .

• We conclude that VC dimension is ≥ 3

8
Lines in R2

Four points cannot be shattered by lines in R2 :

• There are only two possible configurations of four points in R2 :


1. All four points reside on the boundary of the convex hull
2. Three points form the convex hull and one is in interior
• In the first case (left), we cannot draw a line separating the top and
bottom points from the left-and and right-hand side points
• In the second case, we cannot separate the interior point from the
points on the boundary of the convex hull with a line
• The two examples are sufficient to show that VCdim = 3

9
VC-dimension of axis-aligned rectangles

• With axis aligned rectangles we can shatter a set of four points


(picture shows 4 of the 16 configurations)
• This implies VCdim(H) ≥ 4

10
VC-dimension of axis-aligned rectangles

• For five distinct points, consider the minimum bounding box of the
points
• There are two possible configurations:
1. There are one or more points in the interior of the box: then one
cannot include the points on the boundary and exclude the points in
the interior
2. At least one of the edges contains two points: in this case we can
pick either of the two points and verify that this point cannot be
excluded while all the other points are included
• Thus by the two examples we have established that VCdim(H) = 4

11
Vapnik-Chervonenkis dimension formally

• Formally VCdim(H) is defined through the growth function

ΠH (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|


{x1 ,...,xm }⊂X

• The growth function gives the maximum number of unique labelings


the hypothesis class H can provide for an arbitrary set of input points
• The maximum of the growth function is 2m for a set of m examples
• Vapnik-Chervonenkis dimension is then

VCdim(H) = max{m|ΠH (m) = 2m }


m

12
Visualization

• The ratio of the growth function ΠH (m) to the maximum number of


labelings of a set of size m is shown
• Hypothesis class is 20-dimensional hyperplanes (VC dimension = 21)

13
VC dimension of finite hypothesis classes

• Any finite hypothesis class has VC dimension VCdim(H) ≤ log2 |H|


• To see this:
• Consider a set of m examples S = {x1 , . . . , xm }
• This set can be labeled 2m different ways, by choosing the labels
yi ∈ {0, 1} independently
• Each hypothesis in h ∈ H fixes one labeling, a length-m binary
vector y(h, S) = (h(x1 ), . . . , h(xm ))
• All hypotheses in H together can provide at most |H| different
labelings in total (different vectors y(h, S), h ∈ H)
• If |H| < 2m we cannot shatter S =⇒ we cannot shatter a set of
size m > log2 |H|

14
VC dimension: Further examples

Examples of classes with a finite VC dimension:

• convex d-polygons in R2 : VCdim = 2d + 1 (e.g. for general, not


restricted to axis-aligned, rectangles VCdim = 5)
• hyperplanes in Rd : VCdim = d + 1 - (e.g. single neural unit, linear
SVM)
• neural networks: VCdim = |E | log |E || where E is the set of edges in
the networks (for sign activation function)
• boolean monomials of d variables: VCdim = d
• arbitrary boolean formulae of d variables: VCdim = 2d

15
Half-time poll: VC dimension of threshold functions in R

Consider a hypothesis class H = {hθ } of threshold functions


hθ : R 7→ {0, 1}, θ ∈ R :
(
1 if x > θ
hθ (x) =
0 otherwise

What is the VC dimension of this hypothesis class?

1. VCdim = 1
2. VCdim = 2
3. VCdim = ∞

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll


down to ”Lecture 3 poll”:
Answers are anonymous and do not affect grading of the course.
Convex polygons have VC dimension = ∞

• Let our hypothesis class be convex polygons in R2 without


restriction of number of vertices d
• Let us draw an arbitrary circle on R2 - the distribution D will be
concentrated on the circumference of the circle
• This is a difficult distribution for learning polygons - we choose it on
purpose

16
Convex polygons have VC dimension = ∞

• Let us consider a set of m points with arbitrary binary labels


• For any m, let us position m points on the circumference of the
circle
• simulating drawing the inputs from the distribution D

17
Convex polygons have VC dimension = ∞

• Start from an arbitrary positive point (red circles)


• Traverse the circumference clockwise skipping all negative points
and stopping and positive points

18
Convex polygons have VC dimension = ∞

• Connect adjacent positive points with an edge


• This forms a p-polygon inside the circle, where p is the number of
positive data points

19
Convex polygons have VC dimension = ∞

• Define h(x) = +1 for points inside the


polygon and h(x) = 0 outside
• Each of the 2m labelings of m
examples gives us a p-polygon that
includes the p positive points in that
labeling and excludes the negative
points =⇒ we can shatter a set of
size m: VCdim(H) ≥ m
• Since m was arbitrary, we can grow it
without limit VCdim(H) = ∞

20
Generalization bound based on the VC-dimension

• (Mohri, 2018) Let H be a family of functions taking values in


{−1, +1} with VC-dimension d. Then for any δ > 0, with
probability at least 1 − δ the following holds for all h ∈ H:
s r
2 log(em/d) log(1/δ)
R(h) ≤ R̂(h) + +
m/d 2m

• e ≈ 2.71828 is the base of the natural logarithm


• The bound reveals that the critical quantity is m/d, i.e. the number
of examples divided by the VC-dimension
• Manifestation of the Occam’s razor principle: to justify an increase
in the complexity, we need reciprocally more data

21
Rademacher complexity
Experiment: how well does your hypothesis class fit noise?

• Consider a set of training examples S0 = {(xi , yi )}m


i=1

• Generate M new datasets S1 , . . . , SM from S0 by randomly drawing


a new label σ ∈ Y for each training example in S0

Sk = {(xi , σik )}m


i=1

• Train a classifier hk minimizing the empirical risk on training set Sk ,


record its empirical risk
m
1 X
R̂(hk ) = 1hk (xi )6=σik
m
i=1

• Compute the average empirical risk over all datasets:


PM
¯ = M1 k=1 R̂(hk )

22
Experiment: how well does your hypothesis class fit noise?

• Observe the quantity


1
R̂ = − ¯
2
• We have R̂ = 0 when ¯ = 0.5, that is when the predictions
correspond to random coin flips (0.5 probability to predict either
class)
• We have R̂ = 0.5 when ¯ = 0, that is when all hypotheses
hi , i = 1, . . . , M have zero empirical error (perfect fit to noise, not
good!)
• Intuitively we would like our hypothesis
• to be able to separate noise from signal - to have low R̂
• have low empirical error on real data - otherwise impossible to obtain
low generalization error

23
Rademacher complexity

• Rademacher complexity defines complexity as the capacity of


hypothesis class to fit random noise
• For binary classification with labels Y = {−1, +1} empirical
Rademacher complexity can be defined as
m
1 1 X i 
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m
i=1

• σi ∈ {−1, +1} are Rademacher random variables, drawn


independently from uniform distribution (i.e. Pr {σ = 1} = 0.5)
• Expression inside the expectation takes the highest correlation over
all hypothesis in h ∈ H between the random true labels σi and
predicted label h(xi )

24
Rademacher complexity

m
1 1 X i 
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m i=1

• Let us rewrite R̂S (H) in terms of empirical error


• Note that with labels Y = {+1, −1},
(
1 if σi = h(xi )
σi h(xi ) =
−1 if σi =6 h(xi )

• Thus
m
1 X 1 X X
σi h(xi ) = ( 1{h(xi )=σi } − 1{h(xi )6=σi } )
m m
i=1 i i
1 X
= (m − 2 1{h(xi )6=σi } ) = 1 − 2ˆ (h)
m
i

25
Rademacher complexity

• Plug in
1 
R̂S (H) = Eσ sup (1 − 2ˆ(h))
2 h∈H
1 1
= (1 − 2Eσ inf ˆ(h)) = − Eσ inf ˆ(h))
2 h∈H 2 h∈H

• Now we have expressed the empirical Rademacher complexity in


terms of expected empirical error of classifying randomly labeled data
• But how does the Rademacher complexity help in model selection?
• We need to relate it to generalization error

26
Generalization bound with Rademacher complexity

(Mohri et al. 2018): For any δ > 0, with probability at least 1 − δ over a
sample drawn from an unknown distribution D, for any h ∈ H we have:
s
log δ2
R(h) ≤ R̂S (h) + R̂S (H) + 3
2m

The bound is composed of the sum of :

• The empirical risk of h on the training data S (with the original


labels): R̂S (h)
• The empirical Rademacher complexity: R̂S (H)
• A term that tends to zero as a function of size of the training data

as O(1/ m) assuming constant δ.

27
Example: Rademacher and VC bounds on a real dataset

• Prediction of protein
subcellular localization
• 10-500 training examples,
172 test examples
• Comparing Rademacher and
VC bounds using δ = 0.05
• Training and test error also
shown

28
Example: Rademacher and VC bounds on a real dataset

• Rademacher bound is sharper


than the VC bound
• VC bound is not yet
informative with 500
examples (> 0.5) using
(δ = 0.05)
• The gap between the mean
of the error distribution (≈
test error) and the 0.05
probability tail (VC and
Rademacher bounds) is
evident (and expected)

29
Rademacher vs. VC

Note the differences between Rademacher complexity and VC dimension

• VC dimension is independent of any training sample or distribution


generating the data: it measures the worst-case where the data is
generated in a bad way for the learner
• Rademacher complexity depends on the training sample thus is
dependent on the data generating distribution
• VC dimension focuses the extreme case of realizing all labelings of
the data
• Rademacher complexity measures smoothly the ability to realize
random labelings

30
Rademacher vs. VC

• Generalization bounds based on Rademacher Complexity are


applicable to any binary classifiers (SVM, neural network, decision
tree)
• It motivates state of the art learning algoritms such as support
vector machines
• But computing it might be hard, if we need to train a large number
of classifiers
• Vapnik-Chervonenkis dimension (VCdim) is an alternative that is
usually easier to derive analytically

31
Summary: Statistical learning theory

• Statistical learning theory focuses in analyzing the generalization


ability of learning algorithms
• Probably Approximately Correct framework is the most studied
theoretical framework, asking for bounding the generaliation error
() with high probability (1 − δ), with arbitrary level of error
 > 0and confidence δ > 0
• Vapnik-Chervonenkis dimension lets us study learnability infinite
hypothesis classes through the concept of shattering
• Rademacher complexity is a practical alternative to VC dimension,
giving typically sharper bounds (but requires a lot of simulations to
be run)

32

You might also like