Foundations of Machine Learning: Module 7: Computational Learning Theory
Foundations of Machine Learning: Module 7: Computational Learning Theory
Module 7: Computational
Learning Theory
Part A: Finite Hypothesis Space
Sudeshna Sarkar
IIT Kharagpur
Goal of Learning Theory
• To understand
– What kinds of tasks are learnable?
– What kind of data is required for learnability?
– What are the (space, time) requirements of the learning
algorithm.?
• To develop and analyze models
– Develop algorithms that provably meet desired criteria
– Prove guarantees for successful algorithms
2
Goal of Learning Theory
• Two core aspects of ML
– Algorithm Design. How to optimize?
– Confidence for rule effectiveness on future data.
• We need particular settings (models)
– Probably Approximately Correct (PAC)
Pr 𝑃 𝑐⨁ℎ ≤ 𝜖 ≥ 1 − 𝛿
C h
C ⨁h=
Error region3
Prototypical Concept Learning Task
• Given
– 𝑑 𝑑
Instances X (e.g., 𝑋 = 𝑅 or 𝑋 = {0,1}
h 𝑐
+ + −
– Distribution 𝒟 over X - + + -
– Target function c - -
– Hypothesis Space ℋ Instance space X
– Training Examples S = 𝑥𝑖 , 𝑐(𝑥𝑖 ) 𝑥𝑖 i.i.d. from 𝒟
• Determine
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐(𝑥) for all 𝑥 in S?
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐 𝑥 for all 𝑥 in X?
• An algorithm does optimization over S, find hypothesis h.
• Goal: Find h which has small error over 𝒟
4
Computational Learning Theory
• Can we be certain about how the learning algorithm
generalizes?
• We would have to see all the examples.
• Inductive inference –
generalizing beyond the training h 𝑐
data is impossible unless we add + + −
- + + -
more assumptions (e.g., - -
priors over H) Instance space X
We need a bias!
Function Approximation
• How many 𝑁
labeled examples in order to determine which
of the 22 hypothesis is the correct one?
• All 2𝑁 instances in X must be labeled!
• Inductive inference: generalizing beyond the training data is
impossible unless we add more assumptions (e.g., bias)
- 𝐻 = ℎ: 𝑋 → 𝑌
|𝑋| 2𝑁
+ ℎ1 ||H|=2 = 2
c + +
- +
- + -
ℎ2
Instance space X
Error of a hypothesis
The true error of hypothesis h, with respect to the target
concept c and observation distribution 𝒟 is the probability that h
will misclassify an instance drawn according to 𝒟
𝑒𝑟𝑟𝑜𝑟𝒟 ℎ 𝑃𝑟𝑥~𝒟 𝑐 𝑥 ≠ ℎ 𝑥
In a perfect world, we’d like the true error to be 0.
Consistent Case
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with
𝑒𝑟𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑆 (ℎ) > 0.
Inconsistent Case
What if there is no perfect h?
Theorem: After m examples, with probability ≥ 1 − 𝛿, all ℎ ∈ 𝐻 have
𝑒𝑟𝑟𝐷 ℎ − 𝑒𝑟𝑟𝑆 (ℎ) <∈, for
2 2
𝑚≥ 2
𝐼𝑛 |𝐻| + 𝐼𝑛
2∈ 𝛿
Sample complexity: example
• 𝒞 : Conjunction of n Boolean literals. Is 𝒞 PAC-learnable?
|ℋ| = 3𝑛
1 1
𝑚 ≥ (𝑛 ln 3 + ln( ))
𝜀 𝛿
• Concrete examples:
– δ=ε=0.05, n=10 gives 280 examples
– δ=0.01, ε=0.05, n=10 gives 312 examples
– δ=ε=0.01, n=10 gives 1,560 examples
– δ=ε=0.01, n=50 gives 5,954 examples
• Result holds for any consistent learner, such as FindS.
Sample Complexity of Learning
Arbitrary Boolean Functions
• Consider any boolean function over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a PAC concept is:
1 2𝑛 1 1 𝑛 1
𝑚 ≥ (ln 2 + ln( )) = (2 ln 2 + ln( ))
𝜀 𝛿 𝜀 𝛿
17
Thank You
Concept Learning Task
“Days in which Aldo enjoys swimming”
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
Back
Thank You
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A
Sudeshna Sarkar
IIT Kharagpur
Sample Complexity: Infinite
Hypothesis Spaces
• Need some measure of the expressiveness of infinite
hypothesis spaces.
• The Vapnik-Chervonenkis (VC) dimension provides
such a measure, denoted VC(H).
• Analagous to ln|H|, there are bounds for sample
complexity using VC(H).
Shattering
• Consider a hypothesis for the 2-class problem.
• A set of 𝑁 points (instances) can be labeled as + or
− in 2𝑁 ways.
• If for every such labeling a function can be found in
ℋ consistent with this labeling, we set that the set
of instances is shattered by ℋ.
Three points in R2
• It is enough to find one set of three points that can be
shattered.
• It is not necessary to be able to shatter every possible set of
three points in 2 dimensions
Shattering Instances
• Consider 2 instances described using a single real-
valued feature being shattered by a single
interval.
x y
Shattering Instances (cont)
But 3 instances cannot be shattered by a single interval.
x y z
x y z + –
_ x,y,z
x y,z
y x,z
x,y z
x,y,z
y,z x
z x,y
Cannot do x,z y
7
VC Dimension
• The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the largest
finite subset of X shattered by H. If arbitrarily large finite
subsets of X can be shattered then VC(H) =
• For a single intervals on the real line, all sets of 2 instances can
be shattered, but no set of 3 instances can, so VC(H) = 2.
VC Dimension
• An unbiased hypothesis space shatters the entire instance
space.
• The larger the subset of X that can be shattered, the more
expressive (and less biased) the hypothesis space is.
• The VC dimension of the set of oriented lines in 2-d is
three.
9
VC Dimension Example
Consider axis-parallel rectangles in the real-plane,
i.e. conjunctions of intervals on two real-valued
features. Some 4 instances can be shattered.
• Therefore VC(H) = 4
• Generalizes to axis-parallel hyper-rectangles (conjunctions of
intervals in n dimensions): VC(H)=2n.
11
Upper Bound on Sample Complexity with VC
12
Sample Complexity Lower Bound with VC
• There is also a general lower bound on the minimum number of
examples necessary for PAC learning (Ehrenfeucht, et al., 1989):
Consider any concept class C such that 𝑉𝐶 𝐻 > 2 , any learner 𝐿
and any 0 < 𝜀 < 1Τ8 , 0 < 𝛿 < 1Τ100 .
Then there exists a distribution D and target concept in C such that if
L observes fewer than:
1 1 VC(C ) 1
max log 2 ,
32
examples, then with probability at least δ, L outputs a hypothesis
having error greater than ε.
• Ignoring constant factors, this lower bound is the same as the upper
bound except for the extra log2(1/ ε) factor in the upper bound.
13
Thank You
Foundations of Machine Learning
Sudeshna Sarkar
IIT Kharagpur
What is Ensemble Classification?
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
Why should it work?
• Works well only if the individual classifiers
disagree
– Error rate < 0.5 and errors are independent
– Error rate is highly correlated with the correlations
of the errors made by the different learners
Bias vs. Variance
• We would like low bias error and low variance error
• Ensembles using multiple trained (high variance/low
bias) models can average out the variance, leaving
just the bias
– Less worry about overfit (stopping criteria, etc.)
with the base models
Combining Weak Learners
• Combining weak learners
– Assume n independent models, each having accuracy of
70%.
– If all n give the same class output then you can be confident
it is correct with probability 1-(1-.7)n.
– Normally not completely independent, but unlikely that all n
would give the same output
• Accuracy better than the base accuracy of the models by using
the majority output.
– If n1 models say class 1 and n2<n1 models say class 2, then
P(class1) = 1 – Binomial(n, n2, .7)
n! n -r
P(r) = p (1 - p)
r
r!(n - r)!
Ensemble Creation Approaches
• Get less correlated errors between models
– Injecting randomness
• initial weights (eg, NN), different learning parameters,
different splits (eg, DT) etc.
– Different Training sets
• Bagging, Boosting, different features, etc.
– Forcing differences
• different objective functions
– Different machine learning model
Ensemble Combining Approaches
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g. Boosting),
Expertise, etc.
• Stacking - Learn the combination function
Combine Learners: Voting
• Unweighted voting
• Linear combination
(weighted vote)
• weight ∝ accuracy
• weight ∝ 1Τvariance
L
y = åw jd j
j=1
L
w j ³ 0 and åw j =1
j=1
• Bayesian
(
P Ci |x = ) å
all models M j
( )( )
P Ci |x , Mj P Mj
Fixed Combination Rules
Bayes Optimal Classifier
• The Bayes Optimal Classifier is an ensemble of all the
hypotheses in the hypothesis space.
• On average, no other ensemble can outperform it.
• The vote for each hypothesis
– proportional to the likelihood that the training dataset
would be sampled from a system if that hypothesis were
true.
– is multiplied by the prior probability of that hypothesis.
• y is the predicted class,
• C is the set of all possible classes,
• H is the hypothesis space,
• T is the training data.
The Bayes Optimal Classifier represents a hypothesis
that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.
Practicality of Bayes Optimal Classifier
• Cannot be practically implemented.
• Most hypothesis spaces are too large
• Many hypotheses output a class or a value, and not
probability
• Estimating the prior probability for each
hypothesizes is not always possible.
BMA
• If dj are independent
1 1 1
Var y Var d j 2 Var d j 2 L Var d j Var d j
1
j L L j L L
Sudeshna Sarkar
IIT Kharagpur
Bagging
• Bagging = “bootstrap aggregation”
– Draw N items from X with replacement
• Desired learners with high variance (unstable)
– Decision trees and ANNs are unstable
– K-NN is stable
• Use bootstrapping to generate L training sets and
train one base-learner with each (Breiman, 1996)
• Use voting
Bagging
• Sampling with replacement
𝑍𝑡 = 𝐷𝑡 𝑖 𝑒𝑥𝑝 (−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 )
𝑖=1
Output the final classifier:
𝑇
𝐻 𝑥 = 𝑠𝑖𝑔𝑛 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Given: 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) where
𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = −1, +1
Initialize 𝐷1 𝑖 = 1Τ𝑚.
For 𝑡 = 1, … , 𝑇:
– Train weak learner using distribution 𝐷𝑡 .
– Get weak classifier ℎ𝑡 : 𝑋 → ℝ.
Choose 𝛼𝑡 to minimize training error
– Choose 𝛼𝑡 ∈ ℝ.
1 1−∈𝑡
– Update: 𝛼𝑡 = 𝐼𝑛
𝐷𝑡 𝑖 exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 ) 2 ∈𝑡
𝐷𝑡 + 1 𝑖 =
𝑍𝑡 where
Where 𝑍𝑡 is a normalization factor 𝑚
𝑚
𝐻 𝑥 = 𝑠𝑖𝑔𝑛 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Strong weak classifiers
• If each classifiers is (at least slightly) better than random
∈𝑡 < 0.5
𝑚 𝑇
1
𝛿(𝐻(𝑥𝑖 ) ≠ 𝑦𝑖 ) ≤ ෑ 𝑍𝑡 ≤ 𝑒𝑥𝑝 −2 (1Τ2 −∈𝑡 )2
𝑚
𝑖=1 𝑡 𝑡=1
Illustrating AdaBoost
Initial weights for each data point Data points
for training
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
Thank You