0% found this document useful (0 votes)
11 views

ML Chapter 7 (CLT) Notes

Ml

Uploaded by

vyshukodumuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

ML Chapter 7 (CLT) Notes

Ml

Uploaded by

vyshukodumuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

UNIT-3

 Two specific frameworks for analyzing learning algorithms are considered.


 Probably Approximately Correct (PAC) framework
 Mistake bound framework
 Probably approximately correct (PAC) framework
 Identify classes of hypotheses that can and cannot be learned from a
polynomial number of training examples
 Define a natural measure of complexity for hypothesis spaces that allows
bounding the number of training examples required for inductive learning.
 Mistake bound framework
 Examine the number of training errors that will be made by a learner
before it determines the correct hypothesis.
2
Topic-1

INTRODUCTION
 Questions in studying machine learning
 Is it possible to identify classes of learning problems i.e., difficult or easy?
 How is this number affected if the learner?
 Can one characterize the number of mistakes that a learner will make
before learning the target function?
 Can one characterize the inherent computational complexity of classes of
learning problems?

4
 Answers to all these questions are not yet known.
 Computational theory of learning have begun to emerge.
 We focus here on the problem of inductively learning an unknown target
function, given only training examples of this target function and a space of
candidate hypotheses.
 Questions
 How many training examples are sufficient to successfully learn the
target function?
 How many mistakes will the learner make before succeeding?
 Set quantitative bounds on these measures, depending on attributes of the
learning problem such as:
5
 the size or complexity of the hypothesis space considered by the
learner
 the accuracy to which the target concept must be approximated
 the probability that the learner will output a successful hypothesis
 the manner in which training examples are presented to the learner
 Our goal is to answer questions such as:

6
Topic-2

PROBABLY LEARNING AN
APPROXIMATELY CORRECT
HYPOTHESIS
● We consider a particular setting for the learning problem, called the
probably approximately correct (PAC) learning model.
Questions
○ How many training examples and how much computation are required
to learn various classes of target functions within this PAC model.
● For simplicity – we consider the case of learning boolean-valued concepts
from noise-free training data.
● The results can be extended to real-valued target functions with noisy
data.

8
(i). The Problem Setting
● Let X - set of all possible instances over which target functions may be
defined.
○ Example: X - represent the set of all people, each described by the
attributes age (e.g., young or old) and height (short or tall).
● Let C - set of target concepts that our learner might be called upon to learn.
○ Each target concept c in C corresponds to some subset of X, or some
boolean-valued function c : X  {0, 1}.
○ Example: consider one target concept c in C as "people who are skiers."
■ If x is a positive example, c(x) = 1
■ If x is a negative example, c(x) = 0

9
● Assume that instances are generated randomly from X according to some
probability distribution D.
○ Example: D might be the distribution of instances generated by observing
people who walk out of the largest sports store in Switzerland.
○ D should be stationary; i.e., the distribution not change over time.
● The learner L considers
○ H - set of possible hypotheses when attempting to learn the target concept
○ Example: H might be the set of all hypotheses describable by
conjunctions of the attributes age and height.
○ After observing a sequence of training examples, L must output some
hypothesis h from H, which is its estimate of c.

10
(ii). Error of a Hypothesis
● We are interested in how closely the L output h approximates c.
● Let us define the true error
● The true error of h is just the error rate we expect when applying h to
future instances drawn according to the probability distribution

11
Figure shows the definition of error in graphical form

● The error of h with respect to c


is the probability that a
randomly drawn instance will fall
into the region where h and c
disagree (their set difference)
on its classification.
● The + and - points indicate
positive and negative training
examples.
Note:
● Error depends strongly on the unknown probability distribution D.
● Error of h with respect to c is not directly observable to the learner.
● L can only observe the performance of h over the training examples, and it
must choose its output h on this basis only.
● We use training error to refer to the fraction of training examples
misclassified by h.
● Question: “How probable is the observed training error for h gives a
misleading estimate of the true errorD(h)?“
● We define sample error of h with respect to a set S of examples to be the
fraction of S misclassified by h.
13
(iii). PAC Learnability
● Aim – To characterize classes of C that can be reliably learned from a
reasonable number of randomly drawn training examples and amount of
computation.
● Find the number of training examples needed to learn a h i.e., errorD(h) = 0.
● The setting we are considering is useless for two reasons:
1. Provide training examples corresponding to every possible instance in X
(an unrealistic assumption), there may be multiple h consistent with the
training examples, and the L cannot be certain to pick the one C.
2. Given the training examples are drawn randomly, there will always be
some nonzero probability that the training examples encountered by L
will be misleading. 14
● To accommodate these two difficulties, consider
1. We will not require that the learner output a zero error h-We will
require the error should be bounded by some constant, (made
arbitrarily small).
2. We will not require that the learner succeed for every sequence of
randomly drawn training examples-We will require the probability of
failure be bounded by some constant, (made arbitrarily small).
● i.e., We require only that the learner probably learn a h i.e., approximately
correct hypothesis-hence the term probably approximately correct learning,
or PAC learning for short.

15
PAC learning may be concerned with the computational resources required for
learning, and the number of training examples required.
If L requires some minimum processing time per training example, then for C to
be PAC-learnable by L, L must learn from a polynomial number of training
examples. 16
Limitation
● PAC has one restrictive assumption in its definition learners L hypothesis
space H consists of hypothesis h with arbitrarily small error ( ) for every
target concept C.
● This is difficult to find if C is not known in advance.

17
Topic-3

SAMPLE COMPLEXITY FOR


FINITE HYPOTHESIS SPACES
● The number of required training examples with problem size, called the
sample complexity of the learning problem.
● But in most practical settings due to limited training data it limits the
success of learner.
● We bound the sample complexity for a very broad class of learners, called
consistent learners.
● A learner is consistent if h(x)=c(x) i.e., it outputs a hypotheses that
perfectly fit the training data.
● Question: Can we derive a bound on the number of training examples
required by any consistent learner, independent of the specific algorithm
to derive a consistent hypothesis? Yes.
19
● For this we defined the VSH,D to be the set of all h E H that correctly
classify the training examples D.

● The significance of the VSH,D is that every consistent learner outputs a h


belonging to the VSH,D, regardless of the X, H, or training data D.

20
● The VSH,D is the subset of
hypotheses h H, which have zero
training error (denoted by r=0).
● True error (h) (denoted by error)
may be nonzero.
● The VSH,D is said to be -exhausted
when all hypotheses h remaining in
VSH,D have error (h) <

21
● Limitation: If target concept C is known then only it can determine
whether the VSH,D is -exhausted.
● The theorem will find out the probability that VSH,D will be -exhausted
after a given number of training examples, even without knowing C.

Proof:
● Let hl, h2, . . . hk be all the h in H that have true error greater than with
respect to c.

22
● Since we have k hypotheses with error greater than , the probability that
at least one of these will be consistent with all m training examples is at
most k(1 - )m
● Since k , this is at most |H|(1 - )m
● We use general inequality to rewrite as if 0 , then (1 - )m m

● Limitation: The probability that m training examples will fail to eliminate all
"bad" hypotheses (i.e., h with true error greater than ), for any consistent
learner using hypothesis space H.

23
● We determine the number of training examples required to reduce this
probability of failure below some desired level .

● The inequality shows the number of training examples sufficient for any
consistent learner to successfully learn any target concept in H, for any
desired values of and .

24
(i). Agnostic Learning and Inconsistent Hypotheses
● Agnostic learner-A learner that makes no assumption that the target
concept is representable by H and that simply finds the hypothesis with
minimum training error, because it makes no prior commitment about whether
or not C H.
Consider the following settings
● Let D denotes set of training examples to the learner
● Let denotes the probability distribution over the entire set of instances
● Let errorD(h) denote the training error of hypothesis h
i.e., errorD(h) is defined as the fraction of the training examples in D that
are misclassified by h.

25
● errorD(h) – over the particular sample of training data D (training error)
● error (h) –over the entire probability distribution (true error)
● Let hbest denote the h from H having lowest training error over the training
examples.
● Question: How many training examples are enough to ensure (with high
probability) that its error (hbest) will be not more than + error (hbest)?
● Use general Hoeffding bounds (additive Chernoff bounds) – it characterize
the deviation between the true probability of some event and its observed
frequency over m independent trials.

26
● i.e. it state that if the errorD(h) is measured over the set D containing m
randomly drawn examples, then

● To assure the best h found by L, we must consider the probability of |H|


could have a large error

● We call this probability , and ask how many examples m is sufficient to hold
to some desired value,

● This generalization says that the L still picks the best h H, but where the
best hypothesis may have nonzero training error.

27
(ii). Conjunctions of Boolean Literals Are PAC-Learnable
● Consider the class C of target concepts described by conjunctions of boolean
literals. Example: Boolean variable (e.g., Old), or its negation (e.g., Old),
conjunctions of boolean literals include target concepts such as "Old Tall.
● Question: Is C PAC-learnable? Yes
● i.e., Any consistent learner will require only a polynomial number of training
examples to learn any c in C.
● Consider the H defined by conjunctions of literals based on n boolean
variables.

28
● Consider there are only three possibilities for each variable in any given h
(i.e., variable as a literal, its negation as a literal, or ignore it)
● Given n such variables, there are 3n distinct hypotheses.
● The size |H| of this hypothesis space is 3n. i.e. |H| = 3n
● The following gives the bound for sample complexity of learning conjunctions
of up to n boolean literals.
● Substitute |H| = 3n in gives

29
Proof
● The sample complexity for this concept class is polynomial in
and independent of
● To incrementally process each training example, apply FIND-S algorithm that
requires linear in n and independent of
● Therefore, this concept class is PAC-learnable by the FIND-S algorithm.
(iii). PAC-Learnability of Other Concept Classes
● PAC-Learnability can also be used to show that many other concept classes
have polynomial sample complexity.
(A) UNBIASED LEARNERS
(B) K-TERM DNF AND K-CNF CONCEPTS

31
(A) UNBIASED LEARNERS
● Consider the unbiased concept class C that contains every teachable concept
relative to X.
● C is defined by the power set of X (i.e. set of all subset of X) i.e.,
● X is defined by n boolean features i.e.,
● Therefore
● To learn such an unbiased concept class, the learner must itself use an
unbiased hypothesis space H = C.
● Substitute in gives the sample complexity
for learning the unbiased concept class relative to X.

● Thus, unbiased C has exponential sample complexity under PAC model


32
(B) K-TERM DNF AND K-CNF CONCEPTS

● Consider the concept class C of k-term disjunctive normal form (k-term


DNF) expressions.
● k-term DNF expressions are of the form T1 v T2 v . . . v Tk, where each term
Ti is a conjunction of n boolean attributes and their negations.
● Assume H = C. We show that |H| is at most 3nk (k terms, may take on 3n
possible values).
● Substitute |H| = 3nk in gives

33
● k-CNF = k-DNF, because any k-term DNF expression can easily be rewritten
as a k-CNF expression.
● k-CNF is more expressive than k-term DNF, it has both polynomial sample
complexity and polynomial time complexity
● Hence, the concept class k-term DNF is PAC learnable by an efficient
algorithm using H = k-CNF.

34
Topic-4

SAMPLE COMPLEXITY FOR


INFINITE HYPOTHESIS SPACES
● Two drawbacks of characterizing sample complexity in terms of |H|
○ It can lead to quite weak bounds (bound on › 1 for large |H|)
○ In case of infinite hypothesis spaces we cannot apply
● To measure the complexity of H, we use
○ Vapnik-Chervonenkis dimension of H (VC dimension, or VC(H), for short)
● The sample complexity of many infinite hypothesis spaces bounds based on
VC(H) will be tighter than those of |H|.

36
(i). Shattering a Set of Instances
● The VC dimension measures the complexity of the H, not by the number of
distinct hypotheses |H|, but instead by the number of distinct instances
from X that can be completely discriminated using H.
We define the notion of shattering a set of instances as:
● Consider some subset of instances S X.
● Each h Є H imposes some dichotomy on S; i.e, h partitions S into the two
subsets {x Є S|h(x) = 1} and {x Є S|h(x) = 0}.
● Given some instance set S, there are 2|S| possible dichotomies
● We say that H shatters S if every possible dichotomy of S can be
represented by some hypothesis from H.

37
● Figure illustrates a set S of 3 instances that is shattered by H;

● Each of the 23 dichotomies of these 3


instances is covered by some hypothesis
● For every possible dichotomy of the instances,
there exists a corresponding hypothesis.

38
(ii). The Vapnik-Chervonenkis Dimension

● For any finite H


● Suppose that VC(H) = d
● Then H will require 2d distinct hypotheses to shatter d instances
Hence 2d 2VC(H) VC(H) log2|H|,
● The definition of VC dimension indicates that if we find any set of instances
of size d that can be shattered, then VC(H) d
● To show that VC(H) d, we must show that no set of size d can be shattered

39
Example-1:
● Consider the set X of instances corresponding to points on the x, y plane.
● Let H be the set of all linear decision surfaces in the plane.
● Let H corresponding to a single perceptron unit with two inputs
● Question: What is the VC dimension of this H?
● i.e., Any two distinct points in the plane can be shattered by H, because we
can find 22 linear surfaces that include neither, either, or both points.
● Question: What about sets of three points?
● i.e., If the points are not colinear, there are 23 linear surfaces that shatter
them.

● But three colinear points cannot be shattered.


● Also no sets of size four can be shattered
● What is VC(H) in this case-two or three? It is at least 3.

● Hence the VC dimension of linear decision surfaces


in an r dimensional space is r + 1.
Example-2:
● Suppose each instance in X and h in H is described by the conjunction of up
to three boolean literals.
● Question: What is VC(H)?

● This set of 3 instances can be shattered by H, because a h can be


constructed for any desired dichotomy. i.e., to exclude instance1, add the
literal l1 to h, to include instance2, but exclude instance1 and instance3, use
the hypothesis l1 ᴧ l3
● This argument easily extends from three features to n.
● Thus, the VC dimension for conjunctions of n boolean literals is at least n.
(iii). Sample Complexity and the VC Dimension
● Q. "How many randomly drawn training examples sufficient to probably
approximately learn any target concept in C?“
● Using VC(H) as a measure for the complexity of H, the new bound is
where VC(H) log2|H|

● The above equation provides an upper bound on the number of training


examples sufficient to probably approximately learn any target concept in C,
for any desired and

43
● The below theorem provides a lower bound on the number of training
examples necessary for successful learning.
● Where lower bound is determined by the complexity of the C, earlier upper
bounds were determined by H.

44
(iv). VC Dimension for Neural Networks (optional)
● We consider how calculate the VC dimension of a network of interconnected
units such as the feedforward networks trained by the BACKPROPAGATION
procedure.

45
Topic-5

Mistake Bound Model of


Learning
● Here we consider the mistake bound model of learning, in which the learner is
evaluated by the total number of mistakes it makes before it converges to
the correct hypothesis.
● Set of training examples x are provided to the learner. The learner must
predict the c(x), before it is shown the correct target value by the trainer.
● Question: “How many mistakes will the learner make in its predictions
before it learns the target concept?”
● Example: Learn to predict which credit card purchases should be approved
and which are fraudulent.
● Learning the target concept exactly means converging to a hypothesis such
that

48
(i). Mistake Bound for the FIND-S Algorithm
● Consider again H consisting of conjunctions of up to n boolean literals l1 . . .ln
and their negations.
● Apply FIND-S algorithm which incrementally computes the maximally
specific hypothesis consistent with the training examples.

● FIND-S converges to the limit that a h makes no errors, provided C H and


the training data is noise-free.
49
● Q: Can we prove a bound on the total number of mistakes that FIND-S will
make before exactly learning the target concept c? Yes
○ If c Є H, then FIND-S can never mistakenly classify a -ve example as +ve
○ Reason: Current h is always at least as specific as the c.
○ To calculate the number of mistakes, count the number of mistakes it
will make misclassifying truly positive examples as negative.
● Q: How many such mistakes can occur before FIND-S learns c exactly?
○ The total number of mistakes can be at most n + 1
● The number of mistakes will be required in the worst case, corresponding to
learning the most general possible target concept

50
(ii). Mistake Bound for the HALVING Algorithm
● Consider an algorithm that learns by maintaining a description of the version
space, incrementally refining the version space as each new training example
is encountered.
● EX: CANDIDATE-ELIMINATION & LIST-THEN-ELIMINATE algorithms
● We derive a worst-case bound on the number of mistakes that will be made
by such a learner, for any finite H.
● Learner will make predictions given a new instance x by taking a majority vote
among the hypotheses in the current version space.
○ If the majority of version space h classify the new instance as +ve, then
this prediction is output by the learner.
○ Otherwise a -ve prediction is output.
51
● Combination of learning the version space, together with using a majority
vote to make subsequent predictions, is often called HALVING algorithm.
● Q: What is the maximum number of mistakes that can be made by the
HALVING algorithm, for an arbitrary finite H, before it exactly learns the
target concept?
○ Learning the target concept "exactly" corresponds to reaching a state
where the version space contains only a single hypothesis.
○ HALVING algorithm can make a mistake if majority of h in its current
version space incorrectly classify the new example.
○ Current version space becomes at most half, once the correct
classification is revealed to the learner. (h with majority vote – retained)
52
● The maximum number of mistakes possible before the version space contains
just one member is (is a worst-case bound)
● Extension to the HALVING algorithm is to allow the hypotheses to vote with
different weights.
○ Eg: Bayes optimal classifier, which takes such a weighted vote among h.
○ Here the weight assigned to each h is the estimated posterior
probability that describes the c, given D.
● We describe a different algorithm based on weighted voting, called the
WEIGHTED-MAJORITY algorithm.

53
(iii). Optimal Mistake Bounds
● The above analyses give worst-case mistake bounds for two specific
algorithms: FIND-S and CANDIDATE-ELIMINATION
● Q. What is the optimal mistake bound for an arbitrary C, assuming H = C.
● By optimal mistake bound we mean the lowest worst-case mistake bound over
all possible learning algorithms.
● i.e., for any learning algorithm A and any target concept c
● Let MA(c) denote the maximum over all possible sequences of training
examples of the number of mistakes made by A to exactly learn c.
● For any nonempty concept class C, Let MA(C) Ξ maxcεC MA(c). Then
● MFind-S(C) = n + 1, C is the concept class described by up to n boolean literals
● MHalving(C) log2(|C|) for any concept class C.
54
● We define the optimal mistake bound for a concept class C as

● Opt(C) - number of mistakes made for the hardest target concept in C, using
the hardest training sequence, by the best algorithm. (given by Littlestone)
● For any C, there is an interesting relationship among the optimal mistake
bound for C, bound of the HALVING algorithm, and the VC dimension of C.

● i.e., there exist concept classes for which the four quantities above are
exactly equal.
55
(iv). WEIGHTED-MAJORITY Algorithm
● Consider a generalization of the HALVING algorithm called the WEIGHTED-
MAJORITY algorithm
● The WEIGHTED-MAJORITY algorithm makes predictions by taking a
weighted vote among a pool of prediction algorithms and learns by altering
the weight associated with each prediction algorithm.
● The WEIGHTED-MAJORITY algorithm begins by assigning a weight of 1 to
each prediction algorithm, then considers the training examples.
● Whenever a prediction algorithm misclassifies a new training example its
weight is decreased by multiplying it by some number β, where 0 β 1.
○ If β = 0 then WEIGHTED-MAJORITY is identical to the HALVING
algorithm
○ If we choose some other value for β, no prediction algorithm will ever be
eliminated completely
● If an algorithm misclassifies a training example, it will simply receive a
smaller vote in the future. 56
● The WEIGHTED-MAJORITY algorithm begins by assigning a weight of 1 to
each prediction algorithm, then considers the training examples.
● Whenever a prediction algorithm misclassifies a new training example its
weight is decreased by multiplying it by some number β, where 0 β 1.
○ If β = 0 then WEIGHTED-MAJORITY is identical to the HALVING
algorithm
○ If we choose some other value for β, no prediction algorithm will ever be
eliminated completely
● If an algorithm misclassifies a training example, it will simply receive a
smaller vote in the future.

58
● Properties
○ Able to accommodate inconsistent training data as it does not eliminate
inconsistent hypothesis, but rather reduces its weight.
○ Bound the number of mistakes made in terms of number of mistakes
committed by best of the pool prediction algorithm.
● The number of mistakes committed by the WEIGHTED MAJORITY
algorithm can be bounded in terms of the number of mistakes made by the
best prediction algorithm in the voting pool.

59

You might also like