ML Notes
ML Notes
1 PAC Learning
Consider instance space X, the set of examples, and concept space C, the set of
target functions that could have generated the examples such that there exists
a f C∈ that is the hidden target function. For example, C could be all n-
conjunctions, all n-dimensional linear functions, etc.
The hypothesis space is the set of all possible hypotheses that our learning
algorithm can choose from, where H is not necessarily equal to C. We consider
our training instances to be S×0,{ 1 }– including both positive and negative
examples of the target concept – such that training instances are generated by
a fixed unknown probability distribution D over X. Each training instance can
be thought of as a (data, label) tuple, as below
1.2 Intuition
Consider Figure 1, showing the space predicted by target function f and hy-
pothesis function h, where points inside the circle are positive, points outside
are negative, and the functions are given by
h = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x100
In this example, we have seen x1 in all positive training instances (even though
it is not active in f ). Therefore, it is very likely that it will be active in future
positive examples. If not, it is active in only a small percentage of examples, so
the error should be small.
We can therefore consider the error as the probability of an example having dif-
ferent labels according to the hypothesis and the target function, given by
2 Conjunctions
Proof
Consider that p(z) is the probability that a randomly chosen example is positive
and z is deleted from h1.
If z is in the target concept, then p(z) = 0; f is a conjunction and thus z can
never be false in a positive example if z ∈ f .
h will make mistakes only on positive examples. A mistake is made only if z
that is in h but not in f . In such cases, when z = 0, h will predict a negative
example while f indicates a positive example.
1Recall that h is a conjunction, and z is a literal, such that if – during training – we see a
Let z be a bad literal. We want to determine the probability that z has not
been eliminated from h after seeing a given number of examples.
This is for one literal z. There are at most n bad literals, thus the probability
that some bad literal survives m examples is bounded by n(1 −n є )m
We want this probability to be bounded by δ, and thus we must choose m to
be sufficiently large. Consider that we want
ϵ m
n(1 − ) <δ
n
Using 1 − x < e−x, it is sufficient to require
−mе
ne n <δ
Therefore, we need
n 1
m> {ln(n) + ln( )}
ϵ δ
Given
• Instance space X
• Output space Y = {−1, +1}
• Distribution D, which is unknown over X × Y
• Training examples S, where each is drawn independently from D (|S| = m
we can define the following
• True Error: ErrorD = Pr(x,y)∈D[h(x)¬ = y]
Σ
• Empirical Error: ErrorS = P r(x,y)∈S [h(x)¬ = y] = [h(xi )¬ = y i ]
1,m
Two Limitations
• Polynomial sample complexity, which is also called information theoretic
constraint, governs if there is enough information in the sample to distin-
guish a hypothesis h that approximate f .
• Polynomial time complexity, also called computational complexity, which
tells if there is an efficient algorithm that can process the sample and
produce a good hypothesis h.
To be PAC Learnable, there must be a hypothesis h ∈ H with arbitrary small
error for every f ∈ C. We generally assume H is a super set of C.
The worst definition is that the algorithm must meet its accuracy for every
distribution and every target function f ∈ C.
We want to prove the general claim that smaller hypothesis spaces are bet-
ter.
Claim: The probability that there exists a hypothesis h ∈ H that is consistent
with m examples and satisfies Error(h) > ϵ is less then |H|(1 − ϵ)m.
Proof
Let h be a bad hypothesis. The probability that h is consistent with on examples
is less than 1−ϵ. Since the m examples are independently drawn, the probability
that h is consistent with m examples is less than (1 − ϵ)m.
The probability that any one of the hypothesis in H is consistent with m exam-
ples is less than |H|(1 − ϵ)m.
Given this fact, we now want this probability to be smaller than δ, that is
4 Consistent Learners
Using the results from the previous section, we can get this general scheme for
PAC learning:
Given a sample of m examples, find some h ∈ H that is consistent with all m
examples. If m if large enough, a consistent hypothesis will be sufficiently close
to f . We can then check that m scales polynomially in the relevant parameters
(i.e. m is not too large). ”Closeness” guarantees
1 1
m> (ln |H| + ln )
ϵ δ
4.1 Examples
Conjunctions
For conjunctions, the size of the hypothesis space is 3n, since there are 3 possible
values for each of the n features (appear negative, positive, or not at all. There-
fore, the number of examples we need according to the PAC learning framework,
m, is given by
1 n 1 1 1
m > ϵ { ln(3 ) + ln( δ )} = ϵ {n ln 3 + ln( δ )}
To determine if we can learn such a class of functions, we must know the size
k
of this hypothesis space. In this case, the hypothesis space is given by 2(2n) ,
corresponding to the number of ways to choose subsets from among the k literals,
including negations. Thus, the sample complexity is given by
y1 = x1 ∨ x2 y2 = x1 ∨ x3
y3 = x1 ∨ x4 y4 = x2 ∨ x3
y5 = x2 ∨ x4 y6 = x3 ∨ x4
f = C1 ∧ C2 ∧ ...Ck; Ci = l1 ∨ l2... ∨ lm
5 Agnostic Learning
To understand the intuition, consider tossing a biased coin. The more tosses,
the more likely the observed result will correspond with the expected result.
Similarly, the probability that an element in H will have training error which is
off by more than ϵ can be bounded as follows:
2
Pr[ErrD(h) > ErrT R(h) + ϵ] < e −2mє
2
If we consider δ = |H|e−2mє , we can get a generalization bound, or how much
will the true error ED deviate from the observed (training) error ET R.
For any distribution D, generating training and test instances with probability
at least 1 − δ over the choice of the training set of size m, (drawn i.i.d.), for all
s
h∈H
log |H| + log( 1 )
δ
ErrorD(h) < ErrorT R(h) +
2m
6 VC Dimension
For both consistent and agnostic learners, we assumed finite hypothesis spaces.
We know consider an infinite hypothesis space.
We can simply choose the maximum x and y values as well as the minimum x
and y values of the positive examples as the boundary for the rectangle. This is
generally a good algorithm because it learns efficiently, but we cannot use the
theorem from before to derive a bound because the hypothesis space is infinitely
large.
Therefore, we need to find out how to derive a bound given an infinitely large
hypothesis space.
6.3 Shattering
Consider the function in which left bounded intervals on the real axis for some
number is positive ([0, a), for some a > 0).
It is trivial for this function to shatter a single point. However, in any set of two
points on the line, the left can be labeled negative and the right positive, which
6.4 Definition
VC dimension serves the same role as the size of the hypothesis space. Using
VC dimension as a measure of expressiveness, we can give an Occam algorithm
for infinite hypothesis spaces.
Given a sample D of m examples we will find h ∈ H that is consistent with all
m examples, if
1 13 2
m > {8V C(H) log + 4 log( )}
ϵ ϵ δ
then with probability at least 1 − δ, h has error less then ϵ.
We consider that the hypothesis space has to be infinite if we want to use this
bound. If we want to shatter m points, then H has to be at least 2m in order
to shatter any configurations of those m examples.
Thus |H| > 2m , log(|H|) ≥ V C(H).
(a) A set that can be shattered (b) a set that cannot be shattered
We’ve discussed upper bounds on the number of examples; that is, if we have
seen m examples, we can be reasonably sure our algorithm will perform well on
new examples. There is also a general lower bound on the minimum number of
examples necessary for PAC learning.
Consider any concept class C such that V C(C) > 2. For any learner L and
small enough ϵ, δ, there exists a distribution D and a target function in C such
that if L observes less than
1 1 V C(C) − 1
m = max[ log( ), ]
ϵ δ 32ϵ
examples, then with probability at least δ, L outputs a hypothesis having
error(h) > ϵ.
This is the inverse of the bound algorithm we have seen before.
Ignoring constant factors, the lower bound is the same as the upper bound,
except for the extra log( 1є ) factor in the upper bound.
7 Conclusion
The PAC framework provides a reasonable model for theoretically analyzing the
effectiveness of learning algorithms.
We discussed that the sample complexity for any consistent learner using the
hypothesis space, H, can be determined from a measure of Hs expressiveness
(|H|, V C(H)).
We discussed consistent and agnostic learners, showing that the log of the size
of a finite hypothesis space is most important, and then extended this notion
to the infinite hypothesis space.