ML Lecture Notes 2022 v0.0
ML Lecture Notes 2022 v0.0
net/publication/365586536
CITATIONS READS
0 1,776
2 authors:
All content following this page was uploaded by Maria Han Veiga on 19 November 2022.
v0.0.1
These notes are based on the class MATH498: Modern Topics in Math-
ematics – Mathematical Foundations of Machine Learning at the University
of Michigan, Fall 2021 taught by Maria Han Veiga.
Contents
1 Introduction 8
2 Probability review 10
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
CONTENTS 3
2.7 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
CONTENTS 4
4.1 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Linear Models 75
5.1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Introduction
I hope these notes leaves you well equipped to speak about ML, inspires
you to start doing research in ML, or to apply to jobs that ask for ML. We
will focus both on theory, as well as applied machine learning.
The second iteration of these notes are work in progress and a collabora-
tion between Maria Han Veiga (PhD Applied Mathematics) and François Ged
(PhD Probability Theory). We write these notes in a way that we believe
is helpful for mathematicians to understand the fundamental principles of
Machine Learning. Because these notes are aimed at senior undergraduates
and early graduate students of Mathematics, some content will be under
8
CHAPTER 1. INTRODUCTION 9
and
Hand Wavy Remark 1. which denote remarks which are more intuition /
informal observations.
You can imagine who is more likely to use which type of remark.
Chapter 2
Probability review
Contents
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Probability space . . . . . . . . . . . . . . . . . . . 12
2.3 Independence and Conditioning . . . . . . . . . . 13
2.4 Random Variables . . . . . . . . . . . . . . . . . . 15
2.5 Discrete Random Variables . . . . . . . . . . . . 17
2.6 Continuous Random Variables . . . . . . . . . . 20
2.7 Moments . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Samples from an unknown distribution . . . . . 29
2.9 Resources . . . . . . . . . . . . . . . . . . . . . . . 34
10
CHAPTER 2. PROBABILITY REVIEW 11
2.1 Motivation
One can think of spam detection, in this case, x is an email (or represen-
tation of an email) and −1 denotes spam, +1 denotes not spam.
However, to set a milder decision rule, one might prefer to estimate the
probability that the email x is a spam, and only warn the user the email is
potentially a spam if this probability is larger than some chosen threshold.
Having a probability estimate of class belonging is even more important when
the number of classes is larger than two.
Example 2.2.1. Suppose we toss a fair coin twice and we observe the outcome
(the two tosses are independent). We have
• Ω = {HH, T T, HT, T H}
P (A ∩ B)
P (A|B) = . (2.2)
P (B)
Example 2.3.2. Given I threw a fair dice and I got an even number, what is
the probability it was 2? What about 3?
• A: {“Getting 2”}
• B: {“Getting even”}
P (A ∩ B) 1/6
P (A|B) = = = 1/3
P (B) 1/2
• If P (B) > 0 and {Ai }i≥1 are pairwise disjoint events, then
∞
! ∞
[ X
P Ai |B = P (Ai |B). (2.3)
i=1 i=1
for all A ∈ F.
• If (Bi )i≥1 ⊂ F form a partition of Ω (i.e. they are pairwise disjoint and
cover Ω) and P (Bi ) > 0 for all i ≥ 1, then for all A ∈ F
∞
X
P (A) = P (A|Bi )P (Bi ). (2.5)
i=1
⋆ Remark 2. Note that defining a real random variable makes sense only
given a probability space (Ω, F, P ) and a σ-field G on R. If not specified, it
is common to assume that G is the Borel σ-field on R. (Recall that the Borel
σ-field is generated by all sub-intervals of the form (a, b] for all a, b ∈ R.)
CHAPTER 2. PROBABILITY REVIEW 16
• X(“HH”) = 2
• X(“HT ”) = 1
• X(“T H”) = 1
• X(“T T ”) = 0
2.5.1 Definition
A discrete random variable is one whose range X(Ω) (i.e., the set of values
it can take) is countable. The probability mass function (PMF) of a discrete
random variable is defined as
and in particular,
X
pX (x) = 1. (2.8)
x∈R
(In the above sum, only countably many terms are non-null.) With a slight
abuse of language, we say that a random variable X has distribution, or law,
pX 4 .
Consider two discrete random variables X and Y associated with the same
experiment. The probability law of each one of them is described by its
respective PMF pX or pY , called the marginal PMFs of the couple (X, Y ).
The marginal PMFs do not provide any information on possible relations
between these two random variables.
Example 2.5.1. Toss a fair coin and let X = 1 if the result is head, X = 0 if
it is tail. Let Y = X and Y ′ = 1 − X. Show that (X, Y ) and (X, Y ′ ) have
the same marginal PMFs but not the same joint PMFs.
5
in this case λ is very large...
CHAPTER 2. PROBABILITY REVIEW 19
which is well defined whenever pY (y) > 0. Using the definition of conditional
probabilities, we obtain
pX,Y (x, y)
pX|Y (x|y) = . (2.12)
pY (y)
Visually, if we fix y, then the conditional PMF pX|Y (x|y) can be viewed as a
“slice” of the joint PMF pX,Y , but normalized so that it sums to one.
We just saw how the joint PMF encodes the distribution of a couple of
discrete random variables. Sometimes we will want to consider more than
two random variables at the same time. A sequence of random variables
(Xi )i≥1 is a sequence such that for all i ≥ 1, Xi is a random variable.
We say that the random variables in (Xi )i≥1 are independent if and only
if for all k ≥ 1 and all i1 , . . . , ik ∈ N pairwise distinct, it holds that
k
Y
p(Xi1 ,...,Xik ) (x1 , . . . , xk ) = pXiℓ (xℓ ), ∀x1 , . . . , xk ∈ R.
ℓ=1
CHAPTER 2. PROBABILITY REVIEW 20
Most of the properties and concepts for continuous random variables will be
the same or analogous to its discrete counterpart (by swapping summation
with integration).
2.6.1 Definitions
When X takes real continuous values it is more natural to specify the prob-
ability of X being inside some interval P(a ≤ X ≤ b), a < b. By convention,
we specify P(X ≤ x) for all x ∈ R, which is known as the cumulative distri-
bution function (CDF) of X, denoted by FX (x).
∂FX (x)
fX (x) = . (2.16)
∂x
Since the CDF is monotonically increasing, then fX (x) ≥ 0; and since
CHAPTER 2. PROBABILITY REVIEW 21
Using the PDF of a continuous random variable, we can compute the prob-
ability of various subsets of the real line:
Z b
P (a < X < b) = P (a ≤ X ≤ b) = fX (t) dt, (2.18)
Za
P (X ∈ B) = fX (x) dx. (2.19)
B
⋆ Remark 3. From measure theory, for the last equation to make sense,
we need B to be Lebesgue measurable. Since it is the case of all Borel sets,
we can use this formula for all B that can be constructed from intervals. We
shall always work with such measurable sets throughout the class, without
necessary recalling it.
⋆ Remark 4. Any random variable can be decomposed into a continuous
part and a singular part (that does not need to be discrete but could be).
For example, X = 0 with probability 1/2 and X = U ∼ U(0, 1) (the uniform
distribution on (0, 1)) with probability 1/2, then X is neither continuous nor
discrete.
and fX,Y is called the joint PDF. If the CDF is differentiable (not always
true), then
∂ 2 FX,Y
(x, y) = fX,Y (x, y). (2.22)
∂x∂y
Similar to the univariate case, we can compute the probability of an event B
with
Z
P ((X, Y ) ∈ B) = fX,Y (x, y) dx dy. (2.23)
B
where fY is the marginal PDF of Y and we assumed fY (y) > 0. The condi-
tional PDF is then
fX,Y (x, y)
fX|Y (x|y) = . (2.26)
fY (y)
We say that X and Y are independent if and only if their joint CDF,
equivalently joint PDF, can be factored:
FX,Y (x, y) = FX (x)FY (y) (2.27)
fX,Y (x, y) = fX (x)fY (y), ∀x, y ∈ R. (2.28)
Equivalently, for all x, y ∈ R such that fY (y) > 0,
FX|Y (x|y) = FX (x) (2.29)
fX|Y (x|y) = fX (x). (2.30)
for all x1 , . . . , xk .
2.7 Moments
In order to do so, one can look at the average value of X, if one were
to sample it many times. This value (that we call the expectation of X and
that we define below) requires that the variable does not take extremely large
values too often, otherwise this average may explode and thus be ill defined.
We formalise the property of a variable being non-extreme as integrability.
if X is discrete, and
Z ∞
E[X] = xfX (x)dx, (2.32)
−∞
if X is continuous.
Remark 1. The case of random variables that are neither discrete nor con-
tinuous is out of the scope of this class.
CHAPTER 2. PROBABILITY REVIEW 25
Remark 2. Note that E[X] does not have to be finite, in which case E[X] =
+∞ so that E[X] is always well defined (as long as X is non-negative).
if X is discrete and
Z ∞
E[g(X)] = g(x)fX (x)dx
−∞
if X is continuous.
Below we list the basic most important properties for expectations, where
X, Y are integrable:
• If X = c then E[X] = c.
CHAPTER 2. PROBABILITY REVIEW 27
For a random variable X with E[X 2 ] < ∞, that is, X is square integrable,
we can define its variance as
• The square root of the variance is the standard deviation, often denoted
by σX or just σ.
• var(aX) = a2 var(X).
• If X and Y are independent and square integrable, then E[XY ] =
E[X]E[Y ] and var(X + Y ) = var(X) + var(Y ).
For the case of two continuous random variables, we have the joint ex-
pectation
Z ∞Z ∞
E[g(X, Y )] = g(x, y) fX,Y (x, y)dxdy. (2.41)
−∞ −∞
The case where X is discrete and Y is continuous is similar with the integral
over the values of X replaced by a sum.
E[X p ]. (2.43)
with probability 1.
mean converges. It turns out that by adding the assumption of finite second
moment, the Central Limit Theorem gives us the order of magnitude of the
distance between the empirical mean and the true mean.
In the central limit theorem, we have the term Smm −E[X1 ], which converges
almost surely to 0 as m →q∞ by the law of large numbers. Multiplying this
m
difference by the factor Var[X1 ]
, we get a random number with normal
√
distribution with mean 0 and variance 1. This factor grows at the speed m
(while Var[X1 ] remains constant in m). This means that in the law of large
numbers, the random variable Smm converges to E[X1 ] exactly at speed √1m .
The constant factor simply normalises the Gaussian limit to have variance 1.
In this section, we see on a concrete example how the theorems from the
previous section can be used to estimate an unknown probability from a
finite sample.
Suppose that we have a fair coin. The fact that we specify the coin is fair
implicitly poses the probability P , such that P (“heads′′ ) = P (“tails′′ ) = 1/2.
If instead we are given a coin and we do not know whether it is fair, then
we can only say that there exists p ∈ [0, 1] such that P (“heads′′ ) = 1 −
P (“tails′′ ) = p. We can, however, say more about the likely values of p by
estimating it through repeated experiments, as follows:
CHAPTER 2. PROBABILITY REVIEW 31
The example of the coin toss above is the frequentist paradigm of Statistics. It
typically requires a large amount of data to be accurate. The other paradigm
is Bayesian Statistics, that yields effective estimate with few data, but often
requires more computations and an a priori distribution. It is based on
Bayes’ Theorem 3, that we now state:
Theorem 3. (Bayes’ theorem) Let A, B ∈ F such that P (A), P (B) > 0 The
CHAPTER 2. PROBABILITY REVIEW 32
which can be rewritten as follows, using the chain rule for repeated applica-
tions of the definition of conditional probability:
Then, with this assumption, the original probability can be re-written as:
For a new example x, we can compute our best guess as the true label,
using:
ŷ = arg max p(y = c|x).
c
(let’s ignore the timing of the tests for this exercise) and the results were
x1 = 1, x2 = −1, x3 = 1, where 1 means that the test was positive and −1
negative.
2.9 Resources
• Measure, Integral and Probability, Marek Capinski and Ekkehard Kopp
• I have two dice and throw them, what is the probability I get the sum
of both of them to be larger than 6?
• Using Bayes theorem, show that P (A) > 0 and P (B) > 0, then
P (A ∩ B) P (B)P (A|B)
P (B|A) = = . (2.45)
P (A) P (A)
Chapter 3
Introduction to Machine
Learning
Contents
3.1 Different paradigms . . . . . . . . . . . . . . . . . 37
3.2 Supervised learning . . . . . . . . . . . . . . . . . 37
3.3 Model selection . . . . . . . . . . . . . . . . . . . . 41
3.4 No free lunch theorem . . . . . . . . . . . . . . . 47
3.5 Optimisation . . . . . . . . . . . . . . . . . . . . . 48
3.6 ML pipeline in practice . . . . . . . . . . . . . . . 52
3.7 List of tasks . . . . . . . . . . . . . . . . . . . . . . 52
36
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 37
There are three main paradigms in machine learning that sometimes share
similar ideas while having very specific techniques. Namely, they are
Example 3.2.2. On the other hand, predicting tomorrow’s weather (say tem-
perature) is a regression task: if a model predicts a temperature of 24◦ F
tomorrow in Ann Arbor and it happens to be 27◦ F, although not exact, this
prediction is more satisfactory than another of 12◦ F.
The term supervised refers to the fact that the examples used in building
the predictor come with labels, that is, learning how to distinguish hand-
written digits is done by presenting images and the right answers to the
algorithm. This is in contrast to unsupervised learning where no labels are
provided and the main goal is to find structure in the data (for example,
possible clusters, lower dimensional representations, etc...)
3.2.1 Set-up
• A probability distribution D on X ,
Remark 5. The dataset above supposes that the observations are perfect.
Often in practice, this is not the case (e.g. temperature measurements).
Such a dataset is said to be noisy and this noise is included in the model
such that yi = f (xi )+ϵi where (ϵ1 , · · · , ϵn ) is a random vector, often (but not
always) assumed to be Gaussian with mean 0 and independent coordinates.
More on that in the next chapter.
Remark 6. Choosing a parametric model corresponds to choosing a specific
hypothesis class H – hence we will interchangeably use model and hypothesis
class. For instance, if X ⊂ R, one could choose H = {h : x 7→ w1 cos(x) +
w2 sin(x); w1 , w2 ∈ R}. The numbers w1 , w2 are called the parameters of
the model. Throughout these notes, we will aggregate them in a parameters
vector w ∈ RP where P will usually denote the number of parameters of
the model (here P = 2). Henceforth, we call an element of H a predictor
and we write fw instead of h for a function in H when we want to make the
dependency of the predictor on the parameters explicit.
Remark 7. Choosing a hypothesis class H is often an engineering choice, we
might choose H according to simplicity, expressiveness, prior knowledge of
our problem, etc. In this class, we will see for example; perceptrons, support
vector machines, kernel methods, ensemble methods, neural networks, etc...)
The idea behind this definition is that the purpose of a training algorithm
A is to send initial parameters w0 ∈ W to trained parameters A(w0 ) ∈ W
using the dataset S such that fA(w0 ) performs well at minimizing (3.3). In
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 41
this context, the scheme used to choose w0 gives an initial predictor fw0 that
needs not perform well. One can choose w0 deterministically or randomly
according to this scheme, that we call the initialization of the parameters.
The training algorithm can itself use randomness.
Definition 3.3.1. Suppose we try to solve a task with the ERM framework
and let h ∈ H be a predictor. The difference between the generalisation loss
of h and its empirical loss is called the generalisation gap of h, that is
m
1 X
RD (h) − L(h) = Ex∼D [ℓ(h(x), f (x))] − ℓ(h(xi ), yi ).
m i=1
• We say that overfitting occurs when a predictor h fits the data well but
is too complex to generalise outside the dataset, that is when L(h) ≈ 0
but RD (h) is large.
On a given task, to assess the efficacy of a model H, after having split the
dataset into a training set and a test set, we can simply look at the test error,
as this is an estimate for the generalisation loss.
Suppose that for a given task, you have the choice between several models
H , . . . , H(n) , and that you have no a priori reason to favour one over the
(1)
others. How do we select the best model? Suppose that we train all of them
on the training set, and compare the predictors thus obtained on the test set.
We then select the predictor that had the lower test error.
Indeed, suppose that the number of models n → ∞, then one can convince
oneself that it becomes more and more likely that one of the predictors is
such that h(xi ) ≈ yi for all (xi , yi ) ∈ Stest . This means that we cannot assess
whether the chosen hypothesis class is well chosen.
Validation set
One way to deal with that issue is to split the dataset into three disjoint
subsets:
Now the procedure becomes the following: train the n models on Strain ,
compare their performances on Sval , select the best model and retrain it on
Strain ∪ Sval , then assess its performance on Stest .
One analogy is to view the training set as textbook lectures and exam-
ples from which we learn a new concept, and we encountered some confusing
topics which we have multiple possible interpretations; validation set entails
practice problems and previous years’ exams, to help us choose which inter-
pretation is the best; and the test set is the final exam of the course.
Cross-validation
In order to solve this issue, one can use cross-validation. Let H be fixed
and let us use cross-validation to assess how good of a choice it is for a given
task. Let split the dataset S = {(xi , yi ) : i ∈ {1, . . . , m}} into a training set
Strain and test set Stest as before. We now randomly partition the dataset
Strain = {(xi , yi ) : i ∈ {1, . . . , m′ }} into k disjoint and covering subsets
S1 , . . . , Sk of roughly the same size. For i = 1 to k, P we call Si the i-th fold.
We denote by mi the size of the i-th fold such that ki=1 mi = m′ . We now
proceed as follows: for all i ∈ {1, . . . , k}, train the model on
k
[
Sj ,
j=1
j̸=i
Coming back to our initial question of choosing the best model among
H , . . . , H(n) , we can simply choose the one that minimises the cross-validation
(1)
error, that is the smallest CV(H(i) ). Then, we can retrain that model on the
whole training dataset Strain , and estimate its generalisation error on Stest .
Choosing k
Recall that k is the number of folds used during cross-validation (CV). What
value of k should we choose?
• the influence of k on the size of the training set and therefore, attained
approximators h1 , ..., hk ?
The no free lunch theorem is often talked about informally, one of the reasons
being that many similar – but not equivalent – versions of the theorem exist
in the literature. The theorem is often stated as follows:
The term “averaged” here does not have a formal meaning, but using the
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 48
3.5 Optimisation
wk+1 = wk − η∇C(wk ).
One step of the above recursion is called a gradient descent step or gradient
descent update.
AGD : w0 7→ wK ,
for all w0 ∈ RP .
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 50
Remark 8. The positive real number η is called the learning rate, as it governs
the size of the updates (see influence of stepsize in Figure 3.3). Both the
learning rate η and the number of steps K are left unchanged when applying
AGD to w0 . Hence, they are hyperparameters according to Definition 3.2.5.
Closely related to gradient descent is the gradient flow, which can be seen as
a continuous version of gradient descent when the learning rate goes to 0, as
we shall see. The interest of gradient flow is purely theoretical: it is often
easier to prove theorems by assuming that training is done under gradient
flow, and then argue that these theorems should hold true under gradient
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 51
descent for small enough learning rates (though rigorous results can also be
proven under gradient descent directly).
Because u(t) follows the negative of the gradient of C, one can show that
t 7→ C(u(t)) is decreasing and reaches its minimal value as t → ∞. Assuming
that u(t) → u∗ ∈ RP as t → ∞, one sees that ∇C(u∗ ) = 0 and the convexity
of C ensures that C(u∗ ) is a global minimum.
By choosing k = ⌊t/η⌋, where ⌊·⌋ denotes the integer part, we can take the
limit as η → 0+ and (assuming that ∇C is continuous to define the Riemann
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 52
For most machine learning problems (supervised), the pipeline will be similar:
• Once the best suited hypothesis class H has been chosen, train a pre-
diction in H and valuate the performance on test set to judge the
generalisation error.
Throughout the notes, we will learn about many different models (or hy-
pothesis classes). In order to get familiar with them and understand what
makes them different, we encourage the reader to think about the following
question for every encounter with a new H:
Q3. Is H well suited for this task? (For all the tasks below.)
For the tasks where it is provided, f denotes the labelling function (gen-
erating the dataset) and cannot be used to trained the algorithm, as it is in
general not known. We provide it to the reader so they may evaluate the
performances of their models by computing the exact generalisation error
and generalisation gap.
Task 1: Let X = R2 and Y = 0, 1. Suppose f : x 7→ 1{x1 <x2 } . You are given the
dataset S = {(−1, 0, 1), (−0.5, −1, 0), (0, 0.5, 1), (0, 1, 1), (1, 0.5, 0), (2, 0, 0)}.
Task 2: Let X = R2 and Y = 0, 1. Let f be 0 inside the closed unit disk, and
1 outside of it.
Task 3: Let X = R and Y = R. Take some points in R and label them by
f (x) = −x + 3.
Task 4: Let X = R and Y = R. Take some points in R and label them by
f (x) = x2 − 1.
Task 5: You find a text handwritten on a tablet in an unknown alphabet. Before
trying to decipher the text, you want to group the identical characters
together.
Task 6: You have access to all American literature and its translation in French.
You want your algorithm to learn how to translate a text from American
English to French.
Task 7: You know the rules of chess, i.e. how to legally move the pieces on a
chessboard and how to asses the result of a chess game. Besides that,
you have no knowledge of what is a good or bad move. You want to
build a chess engine that surpasses top human level.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 54
Task 8: You are given a dataset of pictures of cats and you want to generate
new realistic images of cats.
Task 9: You work for Nitflex, a streaming service company, and want to give
good recommendations of movies to new subscribers, based on the data
of the older subscribers (movies they liked, categories, age, etc)
Chapter 4
Contents
4.1 Learnability . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Finite-sized hypothesis classes . . . . . . . . . . . 59
4.3 Infinite sized hypothesis classes * . . . . . . . . . 64
4.4 Bias-complexity tradeoff and Bias-variance trade-
off . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
⋆ Remark 8. If you are familiar with numerical analysis, you can think
of this chapter as techniques to come up with a priori error estimates (i.e.
before we sample the dataset), whereas what was shown in the previous
chapter, we were computing a posteriori error estimates (we measure the
error for a specific model and a specific dataset).
55
CHAPTER 4. STATISTICAL LEARNING THEORY 56
4.1 Learnability
Since for a trained predictor h, the empirical error L(h) depends on the
training set S, which is generated through random samplings under D, there
will be randomness on the trained predictor h, and therefore, in RD (h). Thus,
we can see RD (h) as a random variable. We can’t expect that S will suffice
to direct the learner toward a good classifier (w.r.t. to all of D), in case S is
not representative of D.
Example 4.1.1. Suppose we have a urn with 30% black and 70% white balls,
and we take a sample where we get “W W W W W”. In this case, our
sample does not represent the underlying distribution of the balls. (Note
that the probability of sampling this dataset is 0.75 ≈ 0.16, which is far from
negligible.)
From the law of large numbers (Theorem 1), we know that more data
samples will ensure that the dataset is representative enough and avoid sit-
uations like in this example. But the finiteness of the dataset can hinder
learning. In the previous chapter, we saw ways to assess the quality of a
model, and how to select a good model among a collection of models to solve
a given task. In this chapter, we are instead concerned with studying the
learnability of a given hypothesis class H from a finite dataset.
Informally, this can also be seen as the labeling function f can be rep-
CHAPTER 4. STATISTICAL LEARNING THEORY 57
(i) some definitions require m(ϵ, δ) to grow polynomially with its param-
eters as they tend to 0, which for example, does not allow m(ϵ, δ) =
ϵ−1 21/δ ;
CHAPTER 4. STATISTICAL LEARNING THEORY 58
Let H be the set of all functions from R to {0, 1}; I claim that it is not PAC
learnable and to demonstrate it, as you choose m(ϵ, δ), I will adversarially
construct D and f . After you fix m(ϵ, δ), I choose arbitrary 2m(ϵ, δ) pairwise
distinct points in R. I let D be the uniform discrete distribution on these
2m(ϵ, δ) points. Because H contains all {0, 1}-binary functions, I can label
them in any possible way with an f ∈ H so that the realisability assumption
holds. Because (at least) half of the 2m(ϵ, δ) points will not be in the dataset
S (recall that it is sampled with D), whatever your algorithm A is, even if it
finds a predictor h with L(h) = 0, it will not have learned anything on half
of the points, and therefore will be likely to not be ϵ-close to f . Conclusion:
H is not PAC learnable.
We saw that PAC learning offers guarantees on a hypothesis class that allows
to retrieve a labeling function within that class, since Definition 4.1.2
assumes the realisability assumption. However, the datasets a practitioner
meets are not, in general, labeled by a function that belongs to the hypothesis
class they use. Even in the case of a truly linear relationship between x and y,
it is enough, for example, to have noise in the samples so that the realisability
assumption does not hold for the set of linear predictors. This means that
CHAPTER 4. STATISTICAL LEARNING THEORY 59
It turns out that the realisability assumption can be removed and the
PAC learning formalism can be extended:
Definition 4.1.3. (Agnostic PAC learnability) A hypothesis class H is ag-
nostic PAC learnable if there exists a function mH : (0, 1)2 → N and a
learning algorithm A with the following property: for every ϵ,δ ∈ (0, 1) and
for every distribution D over X and labeling function f : X → {0, 1}, when
running the learning algorithm on m ≥ mH (ϵ, δ) i.i.d. samples, the algorithm
returns a hypothesis hA such that, with probability at least 1 − δ,
RD (hA ) ≤ min
∗
RD (h∗ ) + ϵ.
h ∈H
Remark 12. When the realisability assumption does not hold, no learner can
guarantee an arbitrarily small error ϵ. Under the definition of agnostic PAC
learning, a learner can still declare success if its error is not much larger
than the best error achievable by a predictor from the class H. This is in
contrast to PAC learning, in which the learner is required to achieve a small
error in absolute terms and not relative to the best error achievable by the
hypothesis class. In particular, for a given task, a hypothesis class H could
be a very poor choice, and still be agnostic PAC learnable. More informally
and concisely: agnostic PAC learnable does not imply good model choice.
The PAC learning formalism tells us that if H is too complex, we may not be
able to find good predictors in H from finitely many data samples. The first
restriction one may want to look at is when H is finite, that is, H contains
only a finite number of functions. Is H simple enough to be PAC learnable?
Throughout this section, we work under the assumption that H is finite.
Proof. Let hA denote the resulting predictor after training. Note that hA
depends on the dataset S; if we sample m i.i.d. data points according to D,
we look at S as a random variable with law Dm .
Fix ϵ, δ ∈ (0, 1) and let the map m(·, ·) be defined as in the Theorem.
Proving that H is PAC learnable amounts to proving that
We will make use of the three following basic facts that we admit without
proof:
Fix m ∈ N, let us bound Dm (S : RD (hA ) > ϵ). Define the set of bad
predictors
M := {S : ∃h ∈ Hb , L(h) = 0}.
(We used “misleading” to stress that even though the hypothesis class con-
tains a predictor with null empirical error, this predictor fails to achieve a
CHAPTER 4. STATISTICAL LEARNING THEORY 61
generalisation smaller than our threshold ϵ.) Finally, for all h ∈ H, define
the set of misleading datasets for h by
since we chose h in the bad hypothesis set Hb . In particular, using the basic
fact (i), we see that
Recall that our goal is to establish (4.1), which is not equivalent to the above
bound. Indeed, we just upper bounded the sum over bad hypotheses of
the probability to sample a misleading dataset. Nonetheless, the probability
CHAPTER 4. STATISTICAL LEARNING THEORY 62
Given a finite hypothesis class and two numbers ϵ, δ ∈ (0, 1), Theorem
4 provides a sufficient number of data points to learn a good predictor for
binary classification, that is to say with generalisation error smaller than ϵ,
with a probability greater than 1−δ, even in the worst case scenario where the
labeling function (in H) and the data distribution are chosen adversarially.
Definition 4.2.1. We say that a hypothesis class H has the uniform conver-
gence property if there is a function mU C : (0, 1)2 → N such that for every
ϵ, δ ∈ (0, 1) and for every probability distribution D over X , if m ≥ mU C (ϵ, δ)
and the m data samples in S are sampled i.i.d. with common law D, it holds
that
Let H be a finite hypothesis class. To show that it has the uniform conver-
gence property, we admit without proof that
2
Dm (|RD (h) − L(h)| > ϵ) ≤ 2e−2mϵ , ∀h ∈ H.
log(2|H|/δ)
Hence, we can choose mU C (ϵ, δ) = 2ϵ2
and we see that H has the
uniform convergence property.
We now show that the uniform convergence property implies the agnostic
PAC learning property. Let hA denote the predictor obtained by the training
algorithm and let h∗ be the optimal predictor within the class, that is
Define the event Eunif := {∀h ∈ H : |RD (h) − L(h)| < ϵ}. On the event
Eunif , we have that
Note that the uniform convergence property ensures that Dm (Eunif ) > 1 −
δ, which means in particular that the above inequalities hold true with a
probability greater than 1 − δ.
This shows that by setting m(ϵ, δ) := mU C (ϵ/2, δ), then H satisfies the
agnostic PAC learnability property. More explicitely,
2 log(2|H|/δ)
m(ϵ, δ) = ,
ϵ2
as claimed, which concludes the proof.
So far, we have assumed that |H| is finite. We showed that finite classes
are learnable and that the sample complexity of a hypothesis class is upper
CHAPTER 4. STATISTICAL LEARNING THEORY 65
bounded by an expression that involves the log of its size. Is there some-
thing similar to this, when we consider |H| = ∞? Namely, we want to say
something about the expressiveness of a set of functions.
H VC-dim
Half intervals 1
Intervals 2
Half-spaces in the plane 3
Neural networks number of parameters
For infinite hypothesis sets, VCdim(H) takes the role of log(|H|) for finite
hypothesis sets. For example: Given a sample S with m examples, find some
h ∈ H that with at least probability 1 − δ, the hypothesis fw has error less
than ϵ if
1 13 2
m≥ 8VCdim(H) log + 4 log .
ϵ ϵ δ
CHAPTER 4. STATISTICAL LEARNING THEORY 66
Consider again the setting of classification, with Y = {−1, 1}. Let us intro-
duce the notion of noisy labels.
This lack of a perfect labeling function f accounts for the noise on the
label. For example, you can think of this as, for example, the features do not
contain all the information needed to attribute the label in a deterministic
way. This setting is a bit closer to reality – because of maybe lack of in-
formation, noise, or other source of uncertainty, the labelling function might
not be deterministic.
Example 4.4.1. Suppose we have a model for tastiness of a papaya given the
colour and softness. Let’s say most soft papayas with bright colour are tasty.
However, we can have the situation that the papaya is soft and bright, and
still not tasty (e.g.: bad climate?), even if it’s unlikely.
Under this new assumption that there’s also noise on the labels, we can
write the theoretical optimal classifier:
Definition 4.4.1. Bayes Optimal predictor: Given a probability dis-
tribution D over X × Y, the predictor is defined as
(
1 if D(y = 1|x) ≥ 1/2
fbayes =
−1 otherwise
a deterministic map f .
RD (fbayes )) ≤ RD (h) ∀h ∈ H
fbayes ∈
/H
and this is the minimal theoretical error possible. This leads us to us defining
noise as:
Definition 4.4.2. Given a distribution D over X × Y, the noise at point
x ∈ X is defined as:
The noise is a characteristic of the learning task and it’s indicative of it’s
level of difficulty. For example, for a point x if the noise is 1/2, it will be
challenge to predict its label correctly. On the contrary, a noise of 0 means
that there exists a labeling function.
This error does not depend on the sample size and it’s determined by the
hypothesis class chosen. Enlarging the hypothesis class (e.g. making it more
complicated) can decrease the approximation error.
Note that under the realisability assumption, the error is zero. In the
agnostic case, the approximation error can be large.
Example 4.4.2. Using a finite polynomial basis to represent a non-polynomial
function, there is an inherent approximation error.
Estimation error: Error between the approximation error and the error
achieved by the ERM predictor. This is in general non-null because the
empirical error is only an estimate of the generalisation error, therefore we do
not necessarily reach the minimal error over the hypothesis set. This quantity
depends on the training set size and on the size and complexity of the
hypothesis class. (namely, ϵest increases logarithmically with size of H and
decreases with m increasing).
h 2 i
Ex,y,S (hS (x) − y)2 = Ex,S hS (x) − h̄(x) + Ex,y (ȳ(x) − y)2
(4.3)
| {z } | {z } | {z }
Expected Test Error Noise
hVariance 2 i
+ Ex h̄(x) − ȳ(x) , (4.4)
| {z }
Bias2
where
h̄(·) = ES∼Dm [hS (·)] ,
denotes the expected predictor when sampling different datasets S from Dm ,
and
ȳ(x) = Ey|x [f (x) + ϵ] ,
the expected value for y given x (as we consider y being noisy).
Proof. We consider the expected error for the predictor hS that we ob-
tained after training on the dataset S with the ERM framework, which can
be written as:
ZZ
2
(hS (x) − y)2 D(x, y)dydx.
E(x,y)∼D (hS (x) − y) =
x y
We can write the expected Test Error (given the ERM framework and
H):
Z Z Z
2
(hS (x) − y)2 D(x, y)Dm (S)dxdydS
E(x,y)∼D (hS (x) − y) =
S∼Dm S x y
CHAPTER 4. STATISTICAL LEARNING THEORY 71
Returning to (4.5), we’re left with the variance and another term
h 2 i h 2 i
2
E(x,y)∼D (hS (x) − y) = Ex,S hS (x) − h̄(x) +Ex,y h̄(x) − y (4.6)
S∼Dm | {z }
Variance
Bias: What is the inherent error that you obtain from your model even
with infinite training data? This is due to your model being ”biased” to
a particular kind of solution (e.g. linear function). In other words, bias is
inherent to your model.
1
By the property of conditional expectation, we have:
Z Z Z Z
Ex,y (f (y)) = f (y)D(x, y)dxdy = f (y)D(y|x)D(x)dxdy (4.7)
x y x y
Z
= Ex [ f (y)D(y|x)dy] = Ex [Ey|x [f (y)]] (4.8)
y
CHAPTER 4. STATISTICAL LEARNING THEORY 73
Noise: How big is the data-intrinsic noise? This error measures ambigu-
ity due to your data distribution and feature representation. You can never
beat this, it is an aspect of the data.
2
Link to paper: https://www.pnas.org/content/116/32/15849
CHAPTER 4. STATISTICAL LEARNING THEORY 74
Figure 4.3: Curves for training risk and test risk. (A) The classical U-shaped
risk curve arising from the bias-variance tradeoff. (B) The double-descent risk
curve, which incorporates the U-shaped risk curve (i.e. the classical regime)
together with the observed behaviour from using high-expressivity function
classes (i.e.: the modern interpolating regime), separated by the interpolation
threshold. The Predictors to the right of the interpolation threshold have zero
training risk. Current topic of discussion.
Figure 4.4: Double-descent risk curve for a fully connected neural network
on MNIST. Training and test errors are shown for different losses. The
dataset considered has 4000 datapoints, with feature dimension d = 784 and
K = 10 classes. The number of parameters of the. network is given by
(d + 1)H + (H + 1)K. The interpolation threshold (black dashed line) is
observed at n · K
Chapter 5
Linear Models
Contents
5.1 Linear Regression . . . . . . . . . . . . . . . . . . 76
5.1.1 Cost function choice . . . . . . . . . . . . . . . . . 76
5.1.2 Explicit solution . . . . . . . . . . . . . . . . . . . 78
5.1.3 Regularization . . . . . . . . . . . . . . . . . . . . 81
5.1.4 Representing nonlinear functions using basis func-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Classification . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Perceptron algorithm . . . . . . . . . . . . . . . . . 86
5.2.2 Support Vector Machine . . . . . . . . . . . . . . . 87
5.2.3 Detour: Duality theory of constrained optimisation 91
5.2.4 Non-separable case . . . . . . . . . . . . . . . . . . 98
75
CHAPTER 5. LINEAR MODELS 76
functions
⃗ T ⃗x + b : w
H = {fw : x → fw (x) = w ⃗ ∈ Rd , b ∈ R}, (5.1)
Linear predictors are nice because they are linear, and thus preserve nice
properties of the loss function ℓ : Y × Y → R+ to the (empirical) cost
function C : Rd → R+ , defined on the parameter space by
m
1 X
C(w) := ℓ(fw (xi ), yi ). (5.2)
m i=1
For example, since the composition of differentiable and convex maps is dif-
ferentiable and convex, it holds that if ℓ is convex in its first argument, then
so is C on the parameter space. In particular, training on H with gradi-
ent flow on the parameters is guaranteed to converge to a global
minimum of the cost function (admitted without proof, see Section 3.5).
The squared error loss is nice for theoretical reasons: convex, differen-
tiable, no hyperparameters, but it is not robust to outliers (i.e.: your total
error might be dominated by a point which is an outlier (the difference grows
squared)). On the contrary, the absolute error loss is less prone to potential
outliers (see 5.21 ), and is convex too. However, it is not differentiable at 0.
The Huber loss is a compromise between squared and absolute error losses:
it is convex and both differentiable and robust to outliers. The price to pay
is that it has a hyper parameter δ that has to be tuned.
1
Notebook: https://colab.research.google.com/drive/1ucl_aC8Q_
q8Y5DC4uPpBqi5DyiFbSIBi?usp=sharing
CHAPTER 5. LINEAR MODELS 78
Least squares is the method that solves the empirical risk minimization prob-
lem for the hypothesis class (5.1) with respect to the squared loss. We want
to find w that minimizes
m
1 X T
arg min C(w) = arg min L(fw ) = arg min (w xi − yi )2 .
w w w 2m i=1
(Note that here we are using the homogeneous notation: w = (w1 , ..., wn , b), xi =
(xi1 , .., xin , 1)T .) We will use the more compact notation and equivalent for-
mulation
1
arg min C(w) = arg min ||Xw − Y ||2 , (5.3)
w 2m w
Theorem 7. The linear regression problem with square loss as in (5.3) sat-
isfies the following:
w = (X T X)−1 X T Y ;
CHAPTER 5. LINEAR MODELS 79
(ii) if X T X is not invertible, then there are infinitely many solutions. More-
over, the minimal L2 norm solution is given by
w = X + Y,
and
∇w 2(Xw)T y = ∇w 2wT X T y = ∂wi (X T y)i = 2X T y.
This yields
1
∇C(w) = (2X T Xw − 2X T y) = 0. (5.4)
2m
X T Xw∗ = X T Y
w∗ = (X T X)−1 X T Y, (5.5)
as claimed.
Note that to prove (ii), we can proceed as in (i) up to Equation (5.4), but
then we cannot invert X T X. We can still seek for a solution to Xw = Y , but
it is not unique. This means there is an infinite number of w∗ that achieve
the same minimal square error on the training data. This is called an over-
determined problem, we have too many degrees of freedom in the problem
and not enough constraints (data).
Which solution to seek for in the over-determined case? We can solve the
following minimization problem instead:
5.1.3 Regularization
The proof can be done as in the non-regularised case by setting the gra-
dient of the cost to 0.
Remark 16. We note that a minimal norm solution (as seen in Theorem 7)
is not a solution of the norm-regularised problem. Nevertheless, it can be
shown that as λ → 0, the solution of the L2 regularized problem converges
to the minimal L2 norm solution of the original (non-regularized) problem.
However, if we regularise
P2 with an L2 penalty on the parameters, i.e. we
1 T 2 2
minimise
min 2
w∈R 4 i=1 (w x i − yi ) + λ||w||2 for some λ > 0, then w =
1/2
is not a solution (one can convince oneself by taking the gradient and
1/2
noting that it is non zero).
fw (x) = w1 x2 + w0 x + b
Our linear coefficients are w = [b, w0 , w1 ], and our “features” become [1, x, x2 ].
We are finding a linear model on “nonlinear features”, namely, given by x2 .
In general, one can fit non-linear functions via linear regression using a
transformation ϕ which applies nonlinear transformations on our features:
d
X
fw (x) = wi ϕi (x)
i=1
• Feature space is is 2-dim: ϕ(x) → (1, x1 , x2 , x21 , x22 , x1 x2 , ..., xk1 , xk2 )
5.2 Classification
⃗ T ⃗x + b) : w
H = {fw : x → fw (x) = sign(w ⃗ ∈ Rn , b ∈ R}.
However, the loss function ℓ(y, y ′ ) = 1{y̸=y′ } , called 0-1 loss is not convex
nor continuous. Thus, to make the optimisation problem easier, we introduce
a surrogate loss, called the Perceptron loss7 :
⃗ T ⃗x + b)).
ℓperc (x, y) = max(0, −y(w
7
Note that if y and wT x − b have the same sign, then the second term is negative and
thus the ℓperc (x, y) is zero. If the signs are opposite, then error quantity will be positive
nonzero.
CHAPTER 5. LINEAR MODELS 86
This ℓperc loss is useful for the sake of solving the optimisation problem8 .
Then we can use the gradient descent algorithm to find a plane that
linearly separates the data (if that’s possible)9 . Then, the gradient descent
update becomes:
1 X
wt+1 = wt − ηt −yi xi .
m
i:yi (wT xi +b)<0
The Support Vector Machine (SVM) is a linear classifier that can be viewed
as an extension of the Perceptron algorithm. In the context of binary classi-
fication in a linearly separable dataset, the Perceptron guarantees that you
find a separating hyperplane, while the SVM finds the maximum-margin
separating hyperplane. Refer to figure 5.4 for a comparison.
Figure 5.4: Two different separating hyperplanes for the same data set.
(Right:) The maximum margin hyperplane. The margin, γ, is the distance
from the hyperplane (solid line) to the closest points in either class (which
touch the parallel dotted lines).
Definition 5.2.1. (Margin) Consider a separating hyperplane defined through
w, b as the set of points P = {x ∈ Rn : wT x + b = 0}. The margin γ(w, b) is
defined as the distance from the hyperplane to the closest point across both
classes.
The SVM objective states that we want to find the hyperplane defined by
w, b such that the distance of that plane to points in the dataset is maximized,
and that the plane correctly classifies points.
CHAPTER 5. LINEAR MODELS 88
What is d? It’s the distance vector resulting from subtracting x from its
projection xP , and its norm is the minimum distance between x and any
point on the hyperplane. Furthermore, note that d is colinear with w, so we
can write d = αw for some α ∈ R. Then,
wT (x − d) + b = 0
wT (x − αw) + b = 0
wT x + b wT x + b
α= =
wT w ||w||2
Now that we know α, we can compute the length of d = αw, i.e. the distance
of x to P as
√ √ |wT x + b|
T 2 T
||d|| = d d = α w w = ,
||w||
as claimed.
points of the class that is further away and increase γ, which contradicts that
γ is maximized.)
|wT x + b|
max min such that ∀i, yi (wT xi + b) ≥ 0
w,b x∈S ||w||
We can pull the denominator outside of the minimization because it does not
depend on x. Because the hyperplane is scale invariant, we can fix the scale
of w, b any way we want. Let’s be clever about it, and choose it such that
min |wT x + b| = 1
x∈S
and note that the w that (satisfies the constraints and) minimizes ||w|| is
the same that minimizes ||w||2 . This is because f (x) = x2 is monotonically
increasing for x ≥ 0 and ||w|| ≥ 0.
min wT w
w
∀i, yi (wT xi + b) ≥ 0
s.t. min |wT xi + b| = 1.
i
These constraints are still hard to deal with, however luckily we can show
that (for the optimal solution) they are equivalent to the much simpler (5.8).
|wT xi + b| = 1
as
yi (wT xi + b) = 1
CHAPTER 5. LINEAR MODELS 91
because we are in the separable case. And if the minimum is 1 then all other
points are ≥ 1.
Now that we have established the simpler formulation of the SVM prob-
lem (5.8), we can numerically find a solution to it, as mentioned in Remark
18. Although we do not have a general closed-form solution, more can be
said about the maximal margin hyperplane.
In this section, we see how to handle constraints when dealing with an op-
timisation problem. We will then apply the method to SVM to express the
solution of the SVM problem (5.8).
Equality constraints
One can use the method of Lagrange multiplier to solve (5.9), by turning a
constrained optimisation into an unconstrained optimisation and introducing
penalties on the violation of the constraints.
CHAPTER 5. LINEAR MODELS 92
Expanding this out, we have a vector where the first n entries with respect
to w lead to: X
∇w C(w) − αi ∇w hi (w) = 0,
i
CHAPTER 5. LINEAR MODELS 93
hi (w) = 0, i = 1, ..., r.
How does this solution relate to that of the original problem (5.9)? When
considering gprim (w), we are fixing a w and then maximize over α. We see
that as soon as one constraint is violated, say hi (w) ̸= 0, then we can let
αi go to plus or minus infinity (depending on the sign of hi (w)) to make
the Lagrangian blow up, therefore gprim (w) = +∞. In particular, gprim (w) is
finite if and only if w satisfy the constraints (provided that C(w) is finite).
Hence, the solution w∗ of (5.9) is the same as that of (5.12).
Inequality constraints
Similarly, we can write the Lagrangian for the following constrained optimi-
sation problem (instead of equality in the constraints, we seek for ≤ con-
straints):
Again, note that if the constraint on hi (w) is violated (i.e. if hi (w) > 0),
then if αi > 0, this will lead to an increase in the Lagrangian.
The primal problem, however, does not seem easier to solve than the
original problem itself. Indeed, in the minimisation-maximisation problem
(5.15), because we maximize over α after having chosen a w, when mini-
mizing over w, we are restricted to candidates that satisfy the constraints
(otherwise, the Lagrangian blow up as explained before). What we would
like to do is to first choose α and only then minimize over w. This is what is
called the dual optimisation problem, which turns the constrained min-max
optimisation problem into a penalised max/min optimisation problem where
we search for an optimal solution over α after minimizing over w. 11
How does solving this relate to solving (5.15)? We can note that the solution
to (5.16) returns a lower bound to (5.15): first note that by definition, for
every w, α, we have
gdual (α) ≤ p∗ ,
11
If you are interested in this, start with the notion of duality in Linear Programming
problems.
CHAPTER 5. LINEAR MODELS 95
d∗ ≤ p∗ .
d∗ = p∗ .
This is called strong duality condition. Spoiler alert: The strong duality
condition holds for the SVM optimisation problem.
Theorem 11. For an optimisation problem for which the strong duality con-
dition holds, any primal optimal solution w∗ and dual optimal solution α∗
respect the KKT conditions. Conversely, if f and hi are affine for all i, then
the KKT conditions are sufficient for duality.
• this means that the optimal w∗ for the SVM problem (5.8) and α∗
satisfy the KKT conditions.
min wT w
w
s.t. ∀i : yi (wT xi + b) − 1 ≥ 0.
m
X
T
L(w, b, α) = w w − αi (yi (wT xi + b) − 1)
i=1
Then, let’s use the KKT conditions to say something about our solutions.
The optimal weight vector w∗ , which defines the normal to the hyperplane
that maximizes the margin, is a linear combination of the training vectors
x1 , ..., xm .
13
Sometimes you will see this multiplied by 1/2 to simplify the gradient computation...
CHAPTER 5. LINEAR MODELS 97
There are some vectors which lie on the margin, and for those, the corre-
sponding αi∗ is nonzero.
Note: While w is unique for the SVM problem, the support vectors are
not. E.g in a hyperplane in N dimensions, we need N + 1 points to define
a hyperplane. If there are more support vectors than N + 1, then we can
choose different support vectors to specify the hyperplane.
• If you were to move one of the support vectors and retrain the SVM,
the resulting hyperplane would change.
yj (wT xj + b) = 1
So far we have assumed that the data was linearly separable. This is often not
the case, we can’t find a hyperplane that separates between the two classes.
In this case, there is no solution to the optimization problems stated above.
The slack variable ξi allows the input xi to be closer to the hyperplane (or
even be on the wrong side), but there is a penalty in the objective function
for such “slack”.
If λ is very large, the SVM becomes very strict and tries to get all points
to be on the right side of the hyperplane. If λ is very small, the SVM becomes
very loose and may “sacrifice” some points to obtain a simpler (i.e. lower
∥w∥22 ) solution.
If we plug this closed form into the objective of our SVM optimization prob-
lem, we obtain the following unconstrained version as loss function and reg-
ularizer: n
X
T T
min w w +λ max 1 − y i (w x i + b), 0 .
w,b | {z }
i=1
l2 −regularizer
| {z }
hinge−loss
Chapter 6
Kernel methods
Linear classifiers are great, but what if there exists no linear decision bound-
ary? As it turns out, there is an elegant way to incorporate non-linearities
into most linear classifiers.
Some disadvantages:
100
CHAPTER 6. KERNEL METHODS 101
In this chapter, we will talk about methods which use this idea of sending
elements of x ∈ X onto some higher dimensional space XH which we don’t
necessarily have to know much about, where our problem becomes easier to
solve. All we need to know about this new space is that it is a so-called
Reproducible kernel Hilbert space.
Before we can dive into what this means, we start with a brief review and
introduction to key concepts.
Note: throughout this chapter, we always assume that the output space
Y = R, to ease the notation.
In Rn , the dot product x · x′ = ni=1 xi x′i can be seen as a way to measure the
P
similarity between two elements x, x′ (e.g. x · x′ = 0 if and only if x and x′
are orthogonal, i.e. nothing of x can be used to represent x′ ). In an arbitrary
vector space, this notion can be generalised:
Note that by symmetry, we only need to check the linearity in the first
variable to get the bilinearity.
CHAPTER 6. KERNEL METHODS 102
We will sometimes write ⟨·, ·⟩H to specify on which space we consider the
inner product, in order to avoid ambiguity.
Example 6.1.1. Inner product examples:
• The Euclidean space Rn , where the inner product is given by the dot
product:
⟨(x1 , · · · , xn ), (y1 , · · · , yn )⟩ = x1 y1 + · · · + xn yn .
T (f ) = ⟨f, g⟩H ∀f ∈ H.
We introduced the two notions of inner products and pd kernels in the previ-
ous section. The link between them may not seem straightforward. However,
as mentioned in the previous section, pd kernels share some similarities with
pd symmetric matrices, which satisfy the following:
⟨f, Kx ⟩H = f (x),
ϕx ∈ H : y →
7 x1 y 1 + x2 y 2 + x1 x2 y 1 y 2 ,
2
R → R.
x21 x′2
1
x22 x′2
√ √ 2′ ′
2x1 x2 2x1 x2
∀x, x′ ∈ X , K(x, x′ ) = (x1 x′1 + x2 x′2 + c)2 = √ √
2cx1 · 2cx′
√ √ 1
2cx2 2cx′
2
c c
∞
X 1 − x2 n 1 − x′2 ′n
⟨Kx , Kx′ ⟩ = √ e 2x √ e 2 x
n=0 n! n!
∞
X (xx′ )n
x2 +x′2
= e− 2
n=0
n!
(x−x′ )2 ||x−x′ ||2
= e− 2 = e− 2 .
Intuitively, the Gaussian kernel sets the inner product in the feature space
between x, x′ to be close to zero if the instances are far away from each other
(in the original domain), and close to 1 if they are close.
Gaussian kernels are among the most frequently used kernels in applica-
tions.
Example 6.2.4. Kernelised SVM: Recall that using the Lagrange multipli-
ers to solve a linearly separable classification task with SVM, the solution
(w∗ , b∗ ) has the following form
m
!
X
fw∗ (x) = sign(w∗T x + b∗ ) = sign αi yi ((xi )T x) + b∗ .
i=1
The “kernelised” SVM (an example can be seen in figure 9.4) yields:
m
!
X
fw∗ (x) = sign αi yi K(xi , x) + b∗
i=1
We simply replace the dot product (xi )T x by K(xi , x), i.e. we compare the
features through K, or equivalently, we compare the features in some implicit
higher dimensional space where the similarity is measured through an inner
product.
See figure 9.4 for an example of Kernel SVM using a Gaussian Kernel
(radial basis functions RBF), and again the following video https://www.
youtube.com/watch?v=OdlNM96sHio&t=0s for an example of the SVM using
a polynomial kernel.
CHAPTER 6. KERNEL METHODS 108
Figure 6.2: Linear SVM and SVM using a RBF kernel. Source:
https://scikit-learn.org/stable/auto_examples/classification/
plot_classifier_comparison.html
K(x, y) = x1 y1 + x2 y2 .
1. Check symmetry
K i,j = K(xi , xj ) i, j = 1, · · · , n,
Now that we got some intuition for what kernels are and how they work,
let us prove the main theorem (Moore-Aronszajn Theorem) of this section.
We then extend b on G2 × G2 as
rf rg
X X
b(f, g) = αi βj K(xi , yj ).
i=1 j=1
⋆ Remark 11. The proof of this last point turns out to be quite technical
and is beyond the scope of this notes. For our purpose, the intuition given by
the above sketch should be enough. For the curious and motivated reader,
we give a rough plan of how to complete the proof of the last point:
(i) Show that (fn )n≥1 || · ||G2 -Cauchy sequence ⇒ fn (x) Cauchy sequence
in R for all x ∈ R (use reproducing kernel then Cauchy-Schwarz in-
equality). In particular, (fn )n≥1 converges pointwise.
(ii) Show that if a Cauchy sequence fn → 0 pointwise as n → ∞ then
||fn ||G2 → 0, (fix N ∈ N large enough, write ⟨fn , fn ⟩G2 = ⟨fn − fN , fn ⟩G2 +
⟨fN , fn ⟩G2 and bound the two terms).
(iii) Show that for two Cauchy sequences (fn )n≥1 , (gn )n≥1 in G2 , ⟨fn , gn ⟩G2 n≥1
is a Cauchy sequence in R. (Use the Cauchy-Schwarz inequality)
(iv) For f, g ∈ H, define ⟨f, g⟩H := limn→∞ ⟨fn , gn ⟩G2 and show using
(ii) that it does not depend on the choice of the Cauchy sequences
(fn )n≥1 , (gn )n≥1 that converge pointwisely to f and g.
CHAPTER 6. KERNEL METHODS 111
(vii) Show that H is complete: take a Cauchy sequence (fn )n≥1 in H and use
(vi) to define a sequence (gn )n≥1 in G2 such that limn→∞ ||fn − gn ||H =
0; check that (gn )n≥1 is a Cauchy sequence in G2 that pointwisely
converges to a function g ∈ H and show that fn converges to g in
H.
In this section, we will state results on pd kernels without proofs. We will see
that a kernel can be represented as a sum of its eigenfunctions, similar to the
eigendecomposition of a symmetric matrix. Thanks to this representation,
the inner product in the associated RKHS can be seen as an inner product
in L2 (µ), the set of square integrable functions against some measure µ with
compact support in X . We will then use this representation to see how the
inner product in the RKHS corresponds, for some specific examples, to a dot
product in Rn .
TK f : X → R,
Z
x 7→ TK f (x) := K(x, x′ )f (x′ )µ(dx′ ).
X
It turns out that we can decompose the kernel K using the eigenfunctions
of the operator TK , and get the analogue of the above fact
R for pd kernels. For
2 2
a measure µ on a set B, let L (B, µ) := f : B → R : B f (x) µ(dx) < ∞ .
Theorem 14 (Mercer’s Theorem). Let K be a continuous pd kernel and µ
be a finite measure supported on a compact subset B ⊂ X . There exists an
orthonormal basis (ei )i≥1 of L2 (B, µ) consisting of eigenfunctions of TK with
non-negative eigenvalues (λi )i≥1 . Furthermore, for all i ≥ 1, if λi > 0, then
ei is continuous and for all x, x′ ∈ B, it holds that
X
K(x, x′ ) = λi ei (x)ei (x′ ),
i≥1
that is, the inner product in the RKHS H actually corresponds to a dot
product in Rn !
Exercise.
Let K(x, x′ ) := x1 x′1 + x2 x′2 + x1 x2 x′1 x′2 . Let B := [−c, c]2 ⊂ R2 for a
positive real number c and let µ(dx) := 1B (x)dx
CHAPTER 6. KERNEL METHODS 113
TK f (x) = λf (x), ∈ B
⇔ a1 x1 + a2 x2 + a3 x1 x2 = λf (x), (6.2)
where
Z
a1 := x′1 f (x′ )dx′ ,
[−c,c]2
Z
a2 := x′2 f (x′ )dx′ ,
2
Z[−c,c]
a3 := x′1 x′2 f (x′ )dx′ .
[−c,c]2
x1 x2 x1 x2
One can check that e1 (x) = 1/2 , e2 (x) = 1/2 and e3 (x) = 1/2 are eigen-
λ1 λ2 λ3
functions with respective eigenvalues λ1 = λ2 = 23 c3 and λ3 = 94 c6 Let us
CHAPTER 6. KERNEL METHODS 114
as claimed.
The reader can also check that e1 , e2 and e3 are orthonormal and it is
clear from (6.2) that TK has no other eigenfunction with non-zero eigenvalue
(that is not a combination of these three).
1/2 1/2 1/2
Let eλ (x) := (λ1 e1 (x), λ2 e2 (x), λ3 e3 (x))T , then for any x, x′ ∈ B, we
thus see that
Note that we made an arbitrary choice for the compact set B = [−c, c]2
and the finite measure µ(dx) = 1B (x)dx. Changing c does not change
the eigenfunctions (inside the smallest B), but does change the eigenval-
ues. Choosing a different shape than a square for B would more profoundly
change the operator and then the eigenfunctions, which would lead to other
non-canonical feature maps and another Hilbert space, but with, again, an
equivalent inner space.
CHAPTER 6. KERNEL METHODS 115
The representer theorem plays an large role in a large class of learning prob-
lems. It provides the means to reduce a infinite dimensional optimization
problem to tractable finite dimensional one.
Therefore,
∀i = 1, · · · , m fw (xi ) = fw,s (xi )
i.e the orthogonal part does not influence fw at points xi .
Lastly, we must show that for a minimum fw , fw⊥ does not enter the last
term of the objective function, given by λ||fw ||2H . The last term is given by
the norm of fw in H.
It turns out that by considering the squared error loss, the answer is yes.
We define respectively the objective function and the regularised objective
function for all h ∈ H as
m
1 X
L(h) := (h(xi ) − yi )2 ,
m i=1
Lr (h) := L(h) + R(h),
CHAPTER 6. KERNEL METHODS 117
where R(h) := λ||h||2H for some λ > 0. The kernel regression and kernel
ridge regression problems then read as
Gaussian processes
Contents
7.1 Formal definition . . . . . . . . . . . . . . . . . . . 120
7.2 Gaussian processes and kernel methods . . . . . 123
118
CHAPTER 7. GAUSSIAN PROCESSES 119
Suppose now we see two datapoints (x1 , y1 ) and (x2 , y2 ). Then, we wish
to consider only functions which pass by those two points. In 7.1(b), we see
functions which are consistent with the observed data (dashed lines), and the
solid line depicts the mean of all functions consistent with those observations.
Notice how uncertainty is reduced close to the observations (this is because
we have the prior that the functions are smooth).
Figure 7.1: Panel (a) shows four samples drawn from the prior distribution.
Panel (b) shows the situation after two datapoints have been observed. The
mean prediction is shown as the solid line and four samples from the posterior
are shown as dashed lines. In both plots, the shaded region denotes twice
the standard deviation at each input value x.
CHAPTER 7. GAUSSIAN PROCESSES 120
The squared exponential (SE) covariance function has the form, for r =
x − x′ , x, x′ ∈ X :
r2
KSE (r) = exp(− 2 )
2ℓ
with parameter ℓ ∈ R defining the characteristic length-scale, which modifies
the behaviour of the Gaussian Process (see figure 7.2.
Both covariance functions obtained from the the SE and the Matérn ker-
nels are so called stationary covariance functions, as they are a function of
x − x′ , and thus, invariant to translations in the input space. The covariance
functions given above decay monotonically with r and are always positive.
However, this is not a necessary condition for a covariance function 1
The kernel
K(x, x′ ) = σ02 + x · x′ ,
constitutes a non-stationary covariance function.
2
By Theorem 16, the solution for the optimisation above is
with (KXX )i,j = k(xi , xj ), y = (y1 , ..., ym )T and kXx = (k(x1 , x), · · · , k(xm , x))T
the vector of inner products between the data and the new point x.
Note: since this GP serves as a prior, the mean function µ and the kernel
k should be chosen so that they reflect one’s prior knowledge or belief about
the regression function f .
2
You can verify by considering the Lagrangian.
CHAPTER 7. GAUSSIAN PROCESSES 125
yi = f (xi ) + ξi , i = 1, ..., m
• yi = f (xi ) + ξi , i = 1, ..., m
• ξi ∼ N (0, σ 2 ), i = 1, ..., m.
• f ∼ GP (µ, k)
f |Y ∼ GP (µ̄, k̄),
Using the theorem above, the following equivalence holds for GP-regression
and kernel ridge regression: We have µ̄ = fKRR if σ 2 = mλ, where
Remark 25. One of the disadvantages of Gaussian processes is the fact that
the covariance matrix size will scale with relation m2 for dataset of size m,
and furthermore, to invert the covariance matrix takes approximately O(m3 )
operations.3
3
Current best asymptotic complexity is O(m2 .376) [?].
Chapter 8
Deep learning
Contents
8.1 Fully connected dense neural networks . . . . . 128
8.2 Back Propagation . . . . . . . . . . . . . . . . . . 132
8.3 Approximation Theorems . . . . . . . . . . . . . 139
8.4 * Infinitely wide neural networks . . . . . . . . . 141
8.5 Beyond feed forward neural networks . . . . . . 148
8.6 Tricks of the trade . . . . . . . . . . . . . . . . . . 151
One prototypical example often used to motivate the need of neural net-
works is the function XOR (Figure 8.1), which is a non-linear function that
a simple neural network can approximate.
In this chapter, we will start with the simplest type of deep neural net-
works: the fully connected dense neural networks (section 8.1). We will see
that some notions of learning theory that we’ve encountered will be chal-
lenged, and new mathematical theory is necessary to understand why neural
networks seem to work so well in practice.
127
CHAPTER 8. DEEP LEARNING 128
Furthermore, neural networks are (by far) the models presented in these
lecture notes which require more care when being set-up and trained. You will
see there are more hyper-parameters, engineering choices and ways to success-
fully/unsuccessfully train a neural network. We consolidate some practical
advice, tricks of the trade, in section 8.6.
8.1.1 Definitions
a = σ(wT x + b),
where x ∈ Rn are inputs, w ∈ Rn weights, b ∈ R the bias and σ : R → R an
activation function.
What is f (x) doing? It’s taking a vector x and applying a linear trans-
formation to it (by W1 , b), this returns a vector of dimension d1 . σ is applied
to this vector (typically component-wise).
And for a general L-layer neural network f : Rd → RdL we can write the
definition:
Definition 8.1.2. A fully connected feedforward neural network is given by
its architecture, namely, by hyper-parameters:
x(0) = x,
x(k+1) = σk+1 (Wk+1 x(k) + bk+1 ),
and f (x) = x(L) .
CHAPTER 8. DEEP LEARNING 130
Remark 26. There are some engineering choices: L, σi , di . This fixes the
number of degrees of freedom we have to find for Wi , bi .
Example 8.1.1. What if all σ are the identity functions?
WL (· · · (W1 x + b1 ) · · · ) + bL
=(WL · · · W1 )x + (bL + WL bL−1 + · · · WL · · · W2 + b1 .
From the definition of fully-connected neural networks, we see that (as soon
as there is a non-linear activation) the map w 7→ fw is non-linear. As a
CHAPTER 8. DEEP LEARNING 131
consequence, even when the chosen loss L on the function’s space is convex,
the cost C on the parameter’s space is not. In particular, for neural networks,
there is no guarantee that local minima are global minima, which makes the
success of training through gradient descent not obvious.
Why? If y = 1 and fw (x) > 0 (or y = −1 and fw (x) < 0), the value
exp(−yfw (x)) is small, and then we have ℓ(fw (x), y) ≈ 0. Otherwise, in case
of misclassification, we have: log(1 + e−fw (x)y ) > log 2.
m
1 X
L(fw ) = (yi log fw (x) + (1 − yi ) log(1 − fw (x))).
m i=1
vector:
(L−1)
exp(x1 )
1
fw (x) = Pn (L−1)
· · · .
i=1 exp(x i ) exp(xn
(L−1)
)
where (with a slight abuse of notation) we write, for the i-th datapoint,
yi = (yi,1 , · · · , yi,n ). Observe that datapoint xi can only belong to one class,
so yi is zero everywhere, except on the class that xi belongs to. Again, a
perfect classifier achieves 0 loss.
The non-convexity issues we talked about for regression are similar for
classification. The softmax map and the cross-entropy loss can be generalised
to tackle multi-class classification.
8.2.1 Definition
You can note that through gradient based optimisation methods, we must
find a way to update each parameter of the model. We want an efficient
way to write the updates for each degree of freedom we have in the neural
network.
L
For the weights wij in weights matrix WL , for example, we want to com-
pute ∂wijL C(w). This is given, through the chain rule, by:
L−1
For the weight wij , we have
Note that some terms of the derivative have already been computed when
L
we wrote the update for wij .
Zn = Wn Xn−1
Xn = σn (Zn ),
where Z denotes the vector of zi and W the matrix of weights wij . Further-
more, X0 denotes the input vector.
∂C
Suppose we have computed ∂Xn,i
, for i = 1, ..., dn (the width of the layer
n).
CHAPTER 8. DEEP LEARNING 134
∂C ∂σn (z) ∂C
= ◦
∂Zn ∂z z=Zn ∂Xn
∂C ∂C
= (Xn−1 )T
∂Wn ∂Zn
∂C ∂C
= WnT ,
∂Xn−1 ∂Zn
∂σn (z)
note that now ∂z
produces a vector of the activation function eval-
z=Zn
1
uated at Zn , and ◦ gives the Hadarmard product between two vectors.
To make things more clear for the next exposition, we can write
∂C ∂C
= (Xn−1 )T
∂Wn ∂Z
n
∂σn (z) ∂C
= ◦ (Xn−1 )T
∂z z=Zn ∂Xn
∂σn (z) ∂C
= T
◦ Wn+1 (Xn−1 )T
∂z z=Zn ∂Zn+1
!
∂σn (z) T ∂σn+1 (z) ∂σL (z) ∂C
= ◦ Wn+1 T
◦ Wn+2 ... ◦ (Xn−1 )T .
∂z z=Zn ∂z z=Zn+1 ∂z z=ZL ∂XL
1
a = (a1 , ..., an ), b = (b1 , ..., bn ) then a ◦ b = (a1 b1 , ..., an bn )
CHAPTER 8. DEEP LEARNING 135
Neural networks are usually trained with gradient based algorithms, the
gradient being computed with backpropagation. Since they are composi-
tion of functions, this may cause gradients to explode or vanish at early
layers. Let’s first look at the chain rule (for simplicity in the scalar case
w ∈ R) for the derivative of a map fL (w) = gL (gL−1 (. . . g1 (w)) · · · ): let
fℓ (w) = gℓ (gℓ−1 (. . . g1 (w)) · · · ) for all ℓ = 1, · · · , L and f0 (w) ≡ w. We have
gradients, training gets stuck and takes too long to converge, with large
gradients, training jumps over minima and gets away from good solutions.
Can we even train then? Fortunately there are heuristics that lead to
practical techniques allowing stable training. We mention two of them in the
forthcoming section 8.6.2.
The way a neural network’s weights are initialized prior to training has a
crucial effect on the success of training. Indeed, suppose that weights are all
initialized to be zero. Then, the updates for the weights are given as:
∂C
= δ n Xn−1
T
=0
∂Wn
as δ n yields a zero vector, meaning that the weights will not change during
training.
More generally, if the weights are all initialized to a constant value c, then
the update yields:
CHAPTER 8. DEEP LEARNING 137
∂C
= δ n Xn−1
T
⃗ β⃗ T
= δ n σn−1 (c1T Xn−2 1)T = α
∂Wn
= cn 1dn ×dn−1 ,
where α⃗ , β⃗ are two constant vectors. In particular, all weights at the same
layer n receive the same update cn ∈ R.
n
Pdn−1 n
Fix i ∈ {1, . . . , dn }. By independence of xn−1,j and wij , the mean of j=1 wij xn−1,j
is 0, and if that sum is close enough to its mean, then we (roughly speaking)
CHAPTER 8. DEEP LEARNING 138
This is, of course, very informal but we need not make a more formal claim
for understanding the intuition. Now we get that
"dn−1 #!
X
Var [Xn ] ≈ σn′ (0)2 Var n
wij xn−1,j
j=1 i=1,...,dn
dn−1
!
X
σn′ (0)2
n
= Var wij xn−1,j , (8.4)
j=1 i=1,...,dn
n
where we used the independence of the wij xn−1,j for distinct j to take the
n
sum out of the variance. Note that the wij ’s and the xn−1,j ’s are independent,
since xn−1,j only depends on the inputs and the weights of the previous layers,
which at initialization, are assumed to be independent from those of layer n.
Note that for two independent random variables U, V , it holds that
Var(U V ) = E[U 2 ]E[V 2 ] − E[U ]2 E[V ]2
= E[U 2 ] − E[U ]2 E[V 2 ] − E[V ]2 + E[U ]2 E[V 2 ] + E[U 2 ]E[V ]2 − 2E[U ]2 E[V ]2
= Var(U )Var(V ) + Var(U )E[V ]2 + Var(V )E[U ]2 .
If moreover U and V have mean 0, then Var(U V ) = Var(U )Var(V ). Assume
that the inputs are centered, then xn−1,j is centered too. Hence, coming back
to (8.4), we have “shown” that
Var [Xn ] ≈ σn′ (0)Var [Wn ]T Var [Xn−1 ] .
The i-th component of the vector Var [Wn ]T Var [Xn−1 ] is a sum of dn−1 el-
n 1
ements wij xn−1,j . One can choose Var[wij ] = dn−1 , such that provided that
′
σn (0) = 1 (e.g. for the tanh activation function), the variance of Xn is
constant across layers.
Remark 27. This heuristic is valid only for the first step of gradient descent.
After that, weights are, for example, no longer independent.
Remark 28. Similar derivation can be done for other activation functions, but
some terms don’t cancel out so nicely, and this will change the initialisation.
CHAPTER 8. DEEP LEARNING 139
• Xavier: wij
n
p
∼ Unif(0, 1/ dn−1 ). This is heuristically justified as above
for the tanh activation function.
• Le Cun: wij
n
∼ N (0, 1/dn−1 ).
• He: wij
n
∼ N (0, 2/dn−1 ). It can be motivated similarly as above for the
ReLU activation.
The theorem below, by Cybenko (1989) [3] is often called the universal
approximation theorem for neural networks. We state a weaker version of
the theorem proved in the linked paper 2 .
2
In the original paper, the theorem needs only σ to be a discriminatory function.
CHAPTER 8. DEEP LEARNING 140
1
Theorem 18. Let σ be the sigmoid function, that is σ(x) = 1+e−x
. Then
finite sums of the form
N
X
G(x) = αj σ(wjT x + bj ),
j=1
are dense in C(In ) with respect to the supremum norm, the space of contin-
uous functions on In , where In denotes the n-dimensional unit cube [0, 1]n .
Meaning, given any f ∈ C(In ) and ϵ > 0, there is a sum, G(x), of the above
form, for which
|G(x) − f (x)| < ϵ ∀x ∈ In .
⋆ Remark 12. The following proof requires some familiarity with functional
analysis.
Proof. Assume without proof that the sigmoid has the following property
(called discriminatory property): Let µ be a finite regular signed Borel mea-
sure, if Z
σ(wjT x + θ)dµ(x) = 0 ∀(⃗y , θ) ∈ Rd × R,
In
then µ = 0.
PN
Let S = {f (x) = j αj σ(wjT x+bj ) : N ∈ N, αj , wj , bj ∈ R, j = 1, ..., N }.
What the above theorem tells us is that neural networks can approximate
any arbitrary continuous function, similar to the Stone-Weierstrass theorem
for polynomials. It is worth noting that this fact is true for deeper networks
as well: conditioning on the output of layer L − 2 and considering it as the
input layer, the layers L − 2, L − 1 and L can be seen as a two-layer neural
network and one can directly apply the above theorem.
Similar result by Kurt Hornik (1991) [7]. The result above also extends
to classification tasks. The same result also holds for more general sigmoidal
functions and the ReLU function.3
To simplify the notation, consider a neural network fw with scalar input and
ouput. Suppose moreover that it has a single hidden layer of width d ∈ N and
with the identity activation function between the hidden and output layer,
3
Similar type of statement as Weierstrass approximation theorem.
CHAPTER 8. DEEP LEARNING 142
Sketch of proof. The main idea is to use the central limit theorem 2. Indeed,
one notes that in (8.5), the output fw (x) corresponds to a sum of d i.i.d.
variables and the convergence to a Gaussian distribution follows from the
central limit theorem. (One needs to make sure that σ(X) has a second
moment for X a Gaussian variable, which is the case since σ is Lipschitz.)
CHAPTER 8. DEEP LEARNING 143
Knowing that infinitely wide neural networks with the above initialization
are Gaussian processes is informative about the prior knowledge we inject in
our model, but does not provide any insight on what happens during training.
In the next two sections, we present two recent techniques that will allow us
to say more about training.
get:
m
1 X
fwt+1 (x) − fwt (x) ≈= −η ∂ŷ ℓ(yi , ŷ) ŷ=fwt (xi )
∇w fwt (xi ) · ∇w fwt (x).
m i=1
(8.7)
It turns out that the dot product in the right-hand side defines a time-
dependent kernel
(d)
Θt (x, x′ ) := ∇w fwt (x) · ∇w fwt (x′ ),
(d)
where we made the dependency on the width d explicit. The kernel Θt is
called the Neural Tangent Kernel (NTK) of the neural network at time t.
(d)
Because the initialization is random, the NTK at initialization Θ0 is itself
a random kernel. Moreover, the fact that its dynamics in time depends on
the training makes the exact dynamics of the network fwt intractable.
However, there is a very important result from Jacot et al [8] then gen-
eralized in many ways (e.g. in Yang [13]) that gives us what we want:
Theorem 20. There exists a deterministic kernel Θ(∞) : R × R → R such
that, in the setup above, for σ Lipschitz, twice differentiable with bounded
second derivative, for all x, x′ ∈ R and t ∈ R+ , it holds with probability 1
that
(d)
lim |Θt (x, x′ ) − Θ(∞) (x, x′ )| = 0.
d→∞
where K (1) and K (2) are the kernels defined in Theorem 19, and K̇ (2) is
defined as K (2) with the derivative σ ′ instead of σ in its definition.
Remark 29. The assumption that the neural network has a single hidden
layer is superfluous and we make it solely for the sake of presentation.
⋆ Remark 13. One can show that Θ(∞) is positive semi-definite when the
input data {xi : i = 1, . . . , m} lie on the sphere. (In our case – scalar inputs
– it does not make much sense, however it does in the general d-dimensional
case.)
CHAPTER 8. DEEP LEARNING 145
Many consequences can be drawn from Theorem 20; we now loosely dis-
cuss the most straightforward.
Convergence. Suppose that the loss is the mean square loss ℓ(y, y ′ ) =
1
2
(y − y ′ )2 . For a vanishing learning rate η and as the number of steps of
gradient descent tends to ∞ as 1/η 4 , the training trajectory becomes
m
1 X (∞)
∂t fwt (x) = − Θ (x, xj )(fwt (xj ) − yj ).
m j=1
where Θ(∞),−1 is the inverse of Θ(∞) . We see that choosing x in the dataset,
i.e. equal to some xi , then fw∞ (xi ) = yi so the network perfectly fits the data
and the empirical loss is zero.
4
This corresponds to gradient flow as discussed in the optional section 3.5.2, which can
be seen as the continuous version of gradient descent.
CHAPTER 8. DEEP LEARNING 146
It is worth noting that this Gaussian process does not correspond to the
Bayesian posterior given the prior fw0 unless one subtracts the initial output
function fw0 ; see e.g. [6] for more details on this.
Some limitations. The NTK regime (also called kernel regime, or lazy
regime) does not fully describe neural networks behavior for several reasons.
The first one is that neural networks used in practice do not contain an infi-
nite number of neurons... But we won’t be too concerned by that objection.
Another critic made to the NTK regime is that the kernel is fixed as a result
of the fact that individual weights asymptotically don’t move during train-
ing: the map fwt evolves during training as a result of infinitely infinitesimal
changes in the weights of the neurons. Loosely speaking, this causes the
network to not learn any features in the data, akin to a kernel method with
given kernel which fits the data as a linear combination of feature map of
the kernel, without trying to learn these features at any point. However, fea-
ture learning can be a crucial characteristic of successful models in practice;
see e.g. [4] for a language representation model using pre-training to learn
important features of the data.
In this section we stay very informal to present another training regime where
theoretical guarantees can be obtained.
We first note that in the infinite width limit d → ∞, the infinite network at
initialization is null because the weights are i.i.d. Gaussian and the law of
large numbers tells us that for Z1 , Z2 , Z3 i.i.d. N (0, 1),
Z
fw (x) = w2 σ(w1 x + b)µ(dw1 , dw2 , db)
R3
= E [Z1 σ(Z2 x + Z3 )] = E [Z1 ] E [σ(Z2 x + Z3 )] = 0,
where µ := limd→∞ µd .
• Pooling layer: maxpool (define the size of the max pool), returns the
maximum value for the size of the maxpool.
This operation performs downsampling and selects for the “dominant”
pixel (i.e. the pixel with largest value). We note that similar images
(say one is a slight shift of another) yield similar down-sampled images.
Both the mapping G, called generator, and discriminator fdisc are neural
networks (with their own set of parameters) which we will train.
For a given generator G, maxfdisc ℓ(G, fdisc ) will optimise the discriminator
fdisc to reject samples G(z), by assigning high values to samples from the
distribution D and low values to the generated samples G(z). Whereas for a
given discriminator fdisc , minG ℓ(G, fdisc ) optimises G so that the generated
samples G(z) will attempt to fool the discriminator fdisc into assigning high
values.
m1
1 X
Ex∼D [log fdisc (x)] ≈ log fdisc (xi )
m1 i=1
m2
1 X
Ez∼γ [log(1 − fdisc (G(z)))] ≈ log (1 − fdisc (G(zi ))) .
m2 i=1
where x1 , ..., xm1 are coming from the training set X and z1 , ..., zm2 from a
dummy distribution γ.
The optimisation is done in two steps: first update the discriminator fdisc
taking the gradient with respect to parameters of fdisc and computing the
CHAPTER 8. DEEP LEARNING 151
gradient ascent update, then, update the generator G taking the gradient
with respect to parameters of G and computing the gradient descent update.
GANs have become very powerful tools for generative models, where we
want to generate similar samples to a given dataset, for example, new faces,
new rooms, etc5 . Not only this, GANs can be used to solve problems in which
we can’t easily formalise a loss function, or in situations where we don’t have
access to a loss function, or we are unable to compute gradients on the loss
function.
This section is loosely based on the lecture series [10], advice that has come
from many years of theory and experimentation that have lead to substantial
differences in terms of speed, ease of implementation and accuracy when it
comes to putting algorithms to work in practice. A lot of the information on
these lecture series is already implemented (by default) on machine learning
libraries (e.g. tensorflow, keras, torch).
What are some things that can go wrong when using neural network
models6 ?
X − X̂
X′ = 2
.
σX
• Average of each input over the training set should be close to zero;
• Scale input variables so that their covariances are about the same;
t+1 ∂L ∂L ∂L
wij t
= wij −η where n
= xjn−1 n
∂wij ∂wij ∂zi
7
You can verify this by chain rule
CHAPTER 8. DEEP LEARNING 153
Let us focus on the first layer, n = 1. The update for the weights corre-
sponding to a particular output node i (corresponding to row i of the weights
∂L
matrix W 1 ) are proportional to x0j ∂z 1 . If all components of an input vector
i
are positive, all the updates of the weights that feed into node i will have the
∂L
same sign (sign( ∂z 1 )). This means, those weights can only all decrease or in-
i
crease together for a given input. If the weight vector must change direction,
it can only do so by zigzagging, which is ineficient and slow.
Normalising the inputs is not only a concern for the speed of conver-
gence but really for the trainability of the network. Due to finite numerical
precision, numerical overflow can occur, turning gradient updates into N aN
updates.
∇w C(w)
• if ||∇w C(w)|| > ϵ, update the parameters with ϵ ||∇ w C(w)||
,
A neural network can then be composed of many layers, some of them being
BN.
Despite the several heuristics and the empirical success of batch normal-
ization, there is, for now, no very robust theory nor consensus on the different
effects of batch normalization.
Regularisation
Early stopping
The rational about early stopping is that when we decide on the number of
epochs to train a network, it’s not usually a very well informed decision.
monitoring the training and a validation loss, stopping when the valida-
tion loss no longer decresese / increases.
1. Split the training data into a training set and a validation set, e.g. in
a 2-to-1 proportion.
2. Train only on the training set and evaluate the per-example error on
the validation set once in a while, e.g. after every fifth epoch.
3. Stop training as soon as the error on the validation set is higher than
it was the last time it was checked.
4. Use the weights the network had in that previous step as the result of
the training run.
8.6.4 Dropout
can turn off at random neurons while training. This method is called dropout.
It works as follows:
The reason why we rescale the weights when predicting is because when the
weights are updated, the sub-networks used contain an average proportion
p of neurons, hence their weights tend to be bigger than they should when
using all of them together. Think about the sum of two perfect predictors:
the obtained predictor performs poorly unless we divide its output by 2.
Dropout may also refer to other similar methods where individual weights
are dropped out instead of individual neurons.
Then we also have training hyperparameters, that specify how the training
algorithm A behave: number of epochs, batchsize, learning rate (momentum,
etc..), training set / validation set split, loss function (type, l1-l2 penalties)
Many times, good hyperparameters come from experience, trial and error,
theoretical justifications or empirical results. There are also computational
frameworks that can help explore this large hyperparameter space, such as
hyperopt [?].
Chapter 9
Ensemble methods
Contents
9.1 Weak learner . . . . . . . . . . . . . . . . . . . . . 160
9.2 Adaboost . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.1 * A sufficient condition for weak-learnability . . . 168
9.2.2 Connections to other models . . . . . . . . . . . . 169
9.2.3 * Gradient Boosting . . . . . . . . . . . . . . . . . 172
9.3 Boosting regression . . . . . . . . . . . . . . . . . 172
Ensemble methods are usually reserved for methods that generate a model
using an aggregate of base learners. There are different ways to approach
the construction of ensemble methods. In this chapter, we will focus on
boosting.
• The first is the bias-complexity tradeoff. We have seen that the gen-
eralisation error of an ERM learner can be decomposed into a sum of
approximation error and estimation error, as described in equa-
tion (4.2). The more expressive the hypothesis class the learner is
159
CHAPTER 9. ENSEMBLE METHODS 160
searching over, the smaller the approximation error is, but the larger
the estimation error becomes. A learner is thus faced with the prob-
lem of picking a good trade-off between these two considerations. The
boosting paradigm allows the learner to have smooth control over this
trade-off. The learning starts with a basic class (that might
have a large approximation error), and as it progresses the
class that the predictor may belong to grows richer.
Figure 9.1 illustrates how properly aggregating base learners can solve a
task on which a base learners taken individually does not perform extremely
well.
Note that this definition is almost identical to the definition of PAC learn-
ing (here we will call strong learning) shown in chapter 3.7, with one crucial
difference: strong learnability implies the ability to find an arbitrarily good
classifier, with error rate at most ϵ for an arbitrarily small ϵ, when considering
the non-agnostic case. In weak learnability, however, we only need to output
a hypothesis whose error rate is at most 1/2 − γ for a fixed γ > 0, namely,
whose error rate is slightly better than what a random labeling would give
us. The hope is that it may be easier to come up with efficient weak learners
than with efficient (full) PAC learners.
For simplicity, let us assume b = 1 (this flips the label ±1). Let S =
{(x1 , y1 ), ..., (xm , ym )} be a training set. We will show how to implement an
ERM rule, namely, how to find a decision stump that minimizes L(h).
m
X
LD (fw ) = Di 1fw (xi )̸=yi .
i=1
Note here we just wrote LD (fw ) in such a way that we show the misla-
belling when the true label is positive, and the prediction is negative, and
when the true label is negative, and the prediction is positive. What we want
to do now is to show that this can be further simplified, eventually yielding
to an easy minimization problem.
Fix j ∈ [d] and sort the training examples such that x1,j ≤ x2,j ≤ · · · ≤
x +x
xm,j . Let us define the set Θj = { i,j 2 i+1,j : i ∈ [m − 1]} ∪ {x1,j − 1, xm,j + 1}.
This is essentially setting up a grid for which, for any θ ∈ R, there exists
CHAPTER 9. ENSEMBLE METHODS 164
θ′ ∈ Θj that yields the same predictions for the sample S. Then, we can
minimize θ over Θj .
9.2 Adaboost
which is at most 1/2 − γ for a fixed γ ∈ (0, 1/2)2 that does not depend on t.
This can be seen as a linear combination of the weak learners ht . We will see
in the proof of the forthcoming theorem 22 that this choice of wt is ”optimal”
in some sense.
At the end of the round, AdaBoost updates the probability vector D(t) so
that examples on which ht is wrong will get a higher probability mass while
examples on which ht is correct will get a lower probability mass. This gives
more importance to the points that ht misclassifies, for the next weak learner
to focus on.3
Remark 33. There are two quantities which play a role of weight. One is
wt , which gives the weights the contribution of a weak learner ht in the final
model fw . Another weight, given by the probability vector D(t) , gives a
weight to each data point.
Theorem 22. Let S be a training set and assume that at each iteration of
AdaBoost, the weak learner returns a hypothesis for which ϵt ≤ 1/2 − γ.
Then, the training error of the output hypothesis of AdaBoost is at most
m
1 X
L(hs ) = 1h (x )̸=y ≤ exp(−2γ 2 T ).
m i=1 s i i
P
Proof. For each round t, let ft = k≤t wk hk , so that the output of Adaboost
is HT := sign(fT ). In addition, let
m
X (t)
Zt = Di e−yi wt ht (xi ) ,
i=1
(t+1)
which is the normalisation factor so that Di , as defined in the AdaBoost
algorithm described in figure 9.2, is indeed a probability distribution.
Unrolling the recurrence to update D(T +1) , for all i = 1..m, we can write
T
(T +1) (1)
Y exp − wt yi ht (xi )
Di = Di
t=1
Zt
PT
1 exp −yi t=1 wt ht (xi )
= × QT
m t=1 Zt
1 exp (−yi fT (xi ))
= × QT .
m t=1 Zt
Note that 1HT (x)̸=y ≤ e−yFT (x) , since x is misclassified by HT if and only
CHAPTER 9. ENSEMBLE METHODS 167
We now rewrite Zt as
m
X (t)
Zt = Di exp(−wt yi ht (xi ))
i=1
X (t)
X (t)
= Di exp(−wt ) + Di exp(wt )
i:yi =ht (xi ) i:yi ̸=ht (xi )
= (1 − ϵt ) exp(−wt ) + ϵt exp(wt ).
p
Zt ≤ 1 − 4γ 2 ,
recalling that g(x) = x(1 − x) is monotonically increasing [0, 1/2]. We thus
have proven that
RS (HT ) ≤ (1 − 4γ 2 )T /2 .
CHAPTER 9. ENSEMBLE METHODS 168
To conclude, recall the fact that for all x ∈ R, the exponential function
satisfies 1 + x ≤ exp(x) and use it so that 1 − 4γ 2 ≤ exp(−4γ 2 ), which entails
that
RS (HT ) ≤ exp(−2γ 2 T ),
as claimed.
Thanks to the above theorem, we see that even though each weak learner
performs only slightly better than a purely (uniform) random guess, Ad-
aBoost is able to choose a linear combination of them that yields a predictor
whose error on the dataset decreases exponentially fast in the number of
weak learners.
⋆ Remark 16. However, what we really care about is the true risk of the
output hypothesis, i.e. the generalisation error. It turns out that
r !
T V C(H)
RD (h) ≤ RS (h) + O .
m
Let all the weak hypothesis belong to some class of hypothesis H. Suppose
our training sample S is such that for some weak hypothesis Pg1 , · · · , gk from
H, and for some nonnegative coefficients a1 , · · · , ak , with kj=1 aj = 1, and
for some θ > 0, it holds that
CHAPTER 9. ENSEMBLE METHODS 169
k
X
yi aj gj (xi ) ≥ θ (9.2)
j=1
for each example (xi , yi ) in S. This condition implies that yi can be computed
by a weighted majority vote of the weak hypothesis:
k
!
X
yi = sign aj gj (xi ) ,
j=1
where wt are computed by the boosting algorithm and weight the contribu-
tion of each respective ht . Their effect is to give higher influence to the more
accurate classifiers in the sequence.
In this section, we will first show that AdaBoost fits an additive model
in a base learner, optimising a novel exponential loss function. Then, we
will develop a class of gradient boosted models (GBMs), for boosting weak
learners for any loss function.
CHAPTER 9. ENSEMBLE METHODS 170
M
X
f (x) = βk b(x; γk ),
k=1
where β are expansion coefficients and b(x; γ) ∈ R are usually simple func-
tions of the multivariate argument x characterised by a set of parameters
γ. E.g. decision stumps, γ parametrises the split variables and split points.
For example, additive expansions like this can describe single-hidden-layer
neural networks.
At each iteration m, one solves for the optimal basis function b(x; γm )
and corresponding m to add to the current expansion fm−1 (x). This produces
fm (x), and the process is repeated. Previously added terms are not modified.4
4
It can be shown, for the squared-loss
CHAPTER 9. ENSEMBLE METHODS 171
L(yi , fm−1 (xi ) + βb(xi ; γ)) = (yi − fm−1 (xi ) − b(x; γ))2
= (rim − βb(xi ; γ))2
where rim = yi − fm−1 (xi ) is simply the residual of the current model on the ith ob-
servation. Thus, for the squared-error loss, the term m b(x; γm ) that best fits the current
residuals is added to the expansion at each step.
CHAPTER 9. ENSEMBLE METHODS 172
Although we do not go into detail in class about how this works, regression
with trees is also possible, generating a hypothesis of the form:
T
X
h(x) = wt ht (x).
t=1
[1] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.
Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proceedings of the National Academy of Sci-
ences, 116(32):15849–15854, 2019.
[2] Lénaı̈c Chizat and Francis Bach. On the global convergence of gradi-
ent descent for over-parameterized models using optimal transport. In
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 31. Curran Associates, Inc., 2018.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805, 2018.
[6] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep
ensembles via the neural tangent kernel. Advances in neural information
processing systems, 33:1010–1022, 2020.
173
BIBLIOGRAPHY 174
[8] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent
kernel: Convergence and generalization in neural networks. In Samy
Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò
Cesa-Bianchi, and Roman Garnett, editors, NeurIPS, pages 8580–8589,
2018.
[9] Radford M Neal. Priors for infinite networks. In Bayesian Learning for
Neural Networks, pages 29–53. Springer, 1996.
[13] Greg Yang. Scaling limits of wide neural networks with weight sharing:
Gaussian process behavior, gradient independence, and neural tangent
kernel derivation. 2019.
[14] Greg Yang and Edward J Hu. Feature learning in infinite-width neural
networks. arXiv preprint arXiv:2011.14522, 2020.