0% found this document useful (0 votes)
22 views

ML Lecture Notes 2022 v0.0

Uploaded by

kkarthikonline
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

ML Lecture Notes 2022 v0.0

Uploaded by

kkarthikonline
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 176

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/365586536

Lecture notes on mathematics of Machine Learning (undergraduate)

Preprint · November 2022


DOI: 10.13140/RG.2.2.36687.36007

CITATIONS READS
0 1,776

2 authors:

Maria Han Veiga François Ged


The Ohio State University École Polytechnique Fédérale de Lausanne
28 PUBLICATIONS 171 CITATIONS 10 PUBLICATIONS 11 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Maria Han Veiga on 19 November 2022.

The user has requested enhancement of the downloaded file.


Mathematical Foundations of Machine
Learning

Lecture notes by:


Maria Han Veiga
François Gaston Ged

v0.0.1

November 19, 2022


1

These notes are based on the class MATH498: Modern Topics in Math-
ematics – Mathematical Foundations of Machine Learning at the University
of Michigan, Fall 2021 taught by Maria Han Veiga.
Contents

1 Introduction 8

2 Probability review 10

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Independence and Conditioning . . . . . . . . . . . . . . . . . 13

2.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . 17

2.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.2 Examples of discrete random variables . . . . . . . . . 17

2.5.3 Multiple random variables (marginal, joint, conditional,


independence) . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.4 Sequence of discrete random variables . . . . . . . . . . 19

2.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . 20

2.6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 20

2
CONTENTS 3

2.6.2 Some examples of continuous random variables . . . . 21

2.6.3 Multiple random variables (marginal, joint, conditional,


independence) . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.4 Sequence of continuous random variables . . . . . . . . 23

2.7 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7.1 Non-negative random variables . . . . . . . . . . . . . 24

2.7.2 The case of signed random variables . . . . . . . . . . . 25

2.8 Samples from an unknown distribution . . . . . . . . . . . . . 29

2.8.1 Law of large numbers and central limit theorem . . . . 29

2.8.2 Estimating an unknown probability . . . . . . . . . . . 30

2.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Introduction to Machine Learning 36

3.1 Different paradigms . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Parameters and training . . . . . . . . . . . . . . . . . 40

3.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Underfitting and overfitting . . . . . . . . . . . . . . . 42

3.3.2 Validation set and cross-validation . . . . . . . . . . . 44

3.4 No free lunch theorem . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
CONTENTS 4

3.5.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . 49

3.5.2 Gradient flow* . . . . . . . . . . . . . . . . . . . . . . 50

3.6 ML pipeline in practice . . . . . . . . . . . . . . . . . . . . . . 52

3.7 List of tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Statistical learning theory 55

4.1 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 Realisability assumption . . . . . . . . . . . . . . . . . 56

4.1.2 PAC learnability . . . . . . . . . . . . . . . . . . . . . 57

4.1.3 Agnostic PAC learnability . . . . . . . . . . . . . . . . 58

4.2 Finite-sized hypothesis classes . . . . . . . . . . . . . . . . . . 59

4.2.1 PAC learnability . . . . . . . . . . . . . . . . . . . . . 59

4.2.2 Agnostic PAC learnability . . . . . . . . . . . . . . . . 62

4.3 Infinite sized hypothesis classes * . . . . . . . . . . . . . . . . 64

4.4 Bias-complexity tradeoff and Bias-variance tradeoff . . . . . . 66

4.4.1 Existence of noise . . . . . . . . . . . . . . . . . . . . . 66

4.4.2 Bias-complexity trade-off . . . . . . . . . . . . . . . . . 67

4.4.3 Bias-Variance tradeoff . . . . . . . . . . . . . . . . . . 68

5 Linear Models 75

5.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1.1 Cost function choice . . . . . . . . . . . . . . . . . . . 76


CONTENTS 5

5.1.2 Explicit solution . . . . . . . . . . . . . . . . . . . . . 78

5.1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . 81

5.1.4 Representing nonlinear functions using basis functions . 83

5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2.1 Perceptron algorithm . . . . . . . . . . . . . . . . . . . 86

5.2.2 Support Vector Machine . . . . . . . . . . . . . . . . . 87

5.2.3 Detour: Duality theory of constrained optimisation . . 91

5.2.4 Non-separable case . . . . . . . . . . . . . . . . . . . . 98

6 Kernel methods 100

6.1 Inner products and kernels . . . . . . . . . . . . . . . . . . . . 101

6.2 Reproducible kernel Hilbert spaces . . . . . . . . . . . . . . . 103

6.3 Mercer’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4 Representer theorem . . . . . . . . . . . . . . . . . . . . . . . 115

6.5 Kernel (ridge) regression . . . . . . . . . . . . . . . . . . . . . 116

7 Gaussian processes 118

7.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.1.1 Kernel (covariance functions) . . . . . . . . . . . . . . 120

7.1.2 Squared exponential covariance function . . . . . . . . 121

7.1.3 Matérn class of covariance functions . . . . . . . . . . . 122

7.1.4 Dot product covariance functions . . . . . . . . . . . . 123


CONTENTS 6

7.2 Gaussian processes and kernel methods . . . . . . . . . . . . . 123

8 Deep learning 127

8.1 Fully connected dense neural networks . . . . . . . . . . . . . 128

8.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.1.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . 130

8.2 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.2.2 Exploding and vanishing gradients . . . . . . . . . . . 135

8.2.3 Common initialization schemes . . . . . . . . . . . . . 136

8.3 Approximation Theorems . . . . . . . . . . . . . . . . . . . . 139

8.4 * Infinitely wide neural networks . . . . . . . . . . . . . . . . 141

8.4.1 Initialization scale . . . . . . . . . . . . . . . . . . . . . 141

8.4.2 The Neural Tangent Kernel . . . . . . . . . . . . . . . 143

8.4.3 Mean Field regime . . . . . . . . . . . . . . . . . . . . 146

8.5 Beyond feed forward neural networks . . . . . . . . . . . . . . 148

8.5.1 Convolutional neural networks . . . . . . . . . . . . . . 148

8.5.2 Generative Adversarial Neural Networks . . . . . . . . 149

8.6 Tricks of the trade . . . . . . . . . . . . . . . . . . . . . . . . 151

8.6.1 Input regularisation . . . . . . . . . . . . . . . . . . . . 152

8.6.2 Stabilising the gradients . . . . . . . . . . . . . . . . . 153


CONTENTS 7

8.6.3 Preventing overfitting . . . . . . . . . . . . . . . . . . . 154

8.6.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.6.5 Dealing with hyper-parameters . . . . . . . . . . . . . 158

9 Ensemble methods 159

9.1 Weak learner . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

9.2 Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9.2.1 * A sufficient condition for weak-learnability . . . . . . 168

9.2.2 Connections to other models . . . . . . . . . . . . . . . 169

9.2.3 * Gradient Boosting . . . . . . . . . . . . . . . . . . . 172

9.3 Boosting regression . . . . . . . . . . . . . . . . . . . . . . . . 172


Chapter 1

Introduction

These notes are aimed at being an introduction to machine learning, with a


stronger focus on the mathematics behind a lot of the algorithms
and techniques. While these are based on a math course, I still want to
give you the opportunity to do stuff hands-on, so it’s important that we take
a computer and also learn how to solve specific machine learning problems. I
am not sure what is it that you want to do in your career or in the future, so
I want that these notes give you the tools to be able to understand machine
learning, read ML papers, both theoretical and applied, as well as use ML as
another tool that you can use to solve problems, similarly to approximation
theory, calculus, analysis.

I hope these notes leaves you well equipped to speak about ML, inspires
you to start doing research in ML, or to apply to jobs that ask for ML. We
will focus both on theory, as well as applied machine learning.

The second iteration of these notes are work in progress and a collabora-
tion between Maria Han Veiga (PhD Applied Mathematics) and François Ged
(PhD Probability Theory). We write these notes in a way that we believe
is helpful for mathematicians to understand the fundamental principles of
Machine Learning. Because these notes are aimed at senior undergraduates
and early graduate students of Mathematics, some content will be under

⋆ Remark 1. Star remarks, which denote remarks on possibly relevant

8
CHAPTER 1. INTRODUCTION 9

mathematical information/more formal theory, which is not part of the scope


of the course.

and

Hand Wavy Remark 1. which denote remarks which are more intuition /
informal observations.

You can imagine who is more likely to use which type of remark.
Chapter 2

Probability review

Contents
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Probability space . . . . . . . . . . . . . . . . . . . 12
2.3 Independence and Conditioning . . . . . . . . . . 13
2.4 Random Variables . . . . . . . . . . . . . . . . . . 15
2.5 Discrete Random Variables . . . . . . . . . . . . 17
2.6 Continuous Random Variables . . . . . . . . . . 20
2.7 Moments . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Samples from an unknown distribution . . . . . 29
2.9 Resources . . . . . . . . . . . . . . . . . . . . . . . 34

In our journey to understand machine learning, we will encounter several


sources of randomness, such as those coming from the collected data, which
are usually random observations (i.e. samples) of some unknown probability
distribution, the initial parameters of the model we will use to make predic-
tions, or the intrinsic randomness of some training algorithms. To build a
solid theory, we need some knowledge of probability theory and this is what
this chapter is about.

10
CHAPTER 2. PROBABILITY REVIEW 11

2.1 Motivation

Let us consider the task of binary classification, where we wish to learn a


mapping from inputs x ∈ X (we also call these features) to outputs Y =
{−1, +1} 1 . We can formalise the problem as a function approximation
problem. Given a labeled training set {(xi , yi ); i = 1 · · · m}, we assume that
there is some unknown function f : X → Y such that f (xi ) = yi for all
i = 1 · · · m, and the goal of learning is to approximate the function f by
a function fˆ : X → {−1, 1}. Then, for any input x ∈ X , one can make a
prediction of its label using ŷ = fˆ(x). This is what it is called a discriminative
model, as its goal is to discriminate between different classes.

One can think of spam detection, in this case, x is an email (or represen-
tation of an email) and −1 denotes spam, +1 denotes not spam.

However, to set a milder decision rule, one might prefer to estimate the
probability that the email x is a spam, and only warn the user the email is
potentially a spam if this probability is larger than some chosen threshold.
Having a probability estimate of class belonging is even more important when
the number of classes is larger than two.

Instead of discriminating whether x belongs to some class y, one might


want to create an object x that belongs to a given class y. This is the goal
of a generative model that tries to learn the conditional probability p(x|y)
instead.

Besides the possibility of making stochastic predictors and stochastic gen-


erators, other sources of randomness are more broadly encountered when
practicing Machine Learning. Namely, it is commonly assumed that the
dataset is randomly generated by some unknown probability distribution.
On the other hand, by using parametric models, one needs to set a value for
the initial parameters before training, and it is common to initialise them
at random values. Finally, the training procedure (i.e. the update of the
parameters to fit the data) itself can include some randomness. Hence, a
good understanding of machine learning requires some basic knowledge of
probability theory.
1
We will define more rigorously what X is later on
CHAPTER 2. PROBABILITY REVIEW 12

2.2 Probability space

Probability theory is a mathematical framework that allows us to reason


about phenomena or experiments under uncertainty. A probabilistic model is
a mathematical model of a probabilistic experiment that satisfies the axioms
of probability theory, and allows us to calculate probabilities as well as to
reason about the outcomes of an experiment.

Definition 2.2.1. A probability space is a triplet (Ω, F, P ), where Ω is an


arbitrary set, F is a σ-field of subsets of Ω, and P is a measure on F such
that
P (Ω) = 1,
called probability measure (or in short, probability).

This means that:

• Ω is called the sample space, it is the set of all possible outcomes,

• F is a collection of subsets of Ω, σ−field2 ,

• the probability P maps elements of F (subsets of Ω) onto the real


interval [0, 1],

• an element A ∈ F is called an event, and P (A) ∈ [0, 1] is the probability


that A occurs.

Example 2.2.1. Suppose we toss a fair coin twice and we observe the outcome
(the two tosses are independent). We have

• Ω = {HH, T T, HT, T H}

• F = {{}{HH}, {T T }, {HT }, {T H}, {HH, T T }, {HH, HT }, ..., Ω},


|F| = 2|Ω| = 24 = 16
2
It must contain Ω, and it must be closed by complementation and by countable unions
to be a σ-field; we will not manipulate σ-algebras in this class and therefore we do not
dwell on it to keep the focus on the necessary concepts
CHAPTER 2. PROBABILITY REVIEW 13

Figure 2.1: Schematic of sample space Ω, event A and probability of A, P (A).

Events that depend on the outcome of the experiment can be written as


elements of the σ-field F. For example, the event “having exactly one head
during those two tosses” is the element {HT, T H}. Probabilities of events in
F are assigned by the probability measure P . We have for example P ({}) =
0 (always true for any probability measure by definition), and because all
outcomes are equally likely (the coin is fair), we have
# outcomes with exactly one head 2 1
P (“getting exactly one head”) = = =
# number of possible outcomes 4 2
.

2.3 Independence and Conditioning

We say that two events A, B ∈ F are independent if the occurrence or non-


occurrence of one does not affect the probability assigned to the other. More
formally, we can state

• A and B are independent if P (A ∩ B) = P (A)P (B).


• The events in a set {As |s ∈ S} ⊂ F are independent if for every finite
subset S0 ⊂ S we have
!
\ Y
P As = P (As ). (2.1)
s∈S0 s∈S0
CHAPTER 2. PROBABILITY REVIEW 14

To be more precise, the last point is the property of mutual independence


for a family of events. It is strictly stronger than the property of pairwise
independence that As and As′ must be independent for all s ̸= s′ in S.
Example 2.3.1. I tossed a fair coin 8 times (tosses are assumed independent)
and I got head 8 times, what’s the probability I get tails in my next throw?

Let B ∈ F with P (B) > 0. For every event A ∈ F, the conditional


probability that A occurs conditionally given that B occurs is denoted by
P (A|B) and defined by

P (A ∩ B)
P (A|B) = . (2.2)
P (B)

Example 2.3.2. Given I threw a fair dice and I got an even number, what is
the probability it was 2? What about 3?

• A: {“Getting 2”}

• B: {“Getting even”}

• A ∩ B: {“Getting 2”} and {“Getting even”} = {“Getting 2”}

P (A ∩ B) 1/6
P (A|B) = = = 1/3
P (B) 1/2

The following properties can also be proven:


CHAPTER 2. PROBABILITY REVIEW 15

• If P (B) > 0 and {Ai }i≥1 are pairwise disjoint events, then

! ∞
[ X
P Ai |B = P (Ai |B). (2.3)
i=1 i=1

• If B ∈ F is such that P (B) > 0 and P (B c ) > 0, where B c denotes the


complement B c = Ω \ B, then

P (A) = P (A|B)P (B) + P (A|B c )P (B c ), (2.4)

for all A ∈ F.

• If (Bi )i≥1 ⊂ F form a partition of Ω (i.e. they are pairwise disjoint and
cover Ω) and P (Bi ) > 0 for all i ≥ 1, then for all A ∈ F

X
P (A) = P (A|Bi )P (Bi ). (2.5)
i=1

2.4 Random Variables

A real random variable provides us with a numerical value that is dependent


on the outcome of an experiment. It is a convenient way to express the
elements of Ω as numbers rather than abstract elements of sets.

Throughout this course, we will only consider real random variables or


multivariate real random variables, that is to say, random variables with
values in R or Rd for d ≥ 2.

Definition 2.4.1. Real Random Variable Let G be a σ-field on R. A real


random variable X is a function X : Ω → R, such that for all B ∈ G, it holds
that {ω ∈ Ω : X(ω) ∈ B} ∈ F.

⋆ Remark 2. Note that defining a real random variable makes sense only
given a probability space (Ω, F, P ) and a σ-field G on R. If not specified, it
is common to assume that G is the Borel σ-field on R. (Recall that the Borel
σ-field is generated by all sub-intervals of the form (a, b] for all a, b ∈ R.)
CHAPTER 2. PROBABILITY REVIEW 16

The event {ω ∈ Ω : X(ω) ∈ B} in the definition is simply the preimage


of B by X, also denoted by X −1 (B).

The above definition naturally extends to X : Ω 7→ Rd for all d ≥ 2.


Example 2.4.1. Toss independently n fair coins and observe the resulting
sequence. The state space consists of the set of all 2n possible coin sequences.
Let our random variable X be the number of heads. 3 For k ≥ 0, what is
P (X = k)? Meaning, in n coin tosses, what is the probability we get exactly
k heads?
Example 2.4.2. Toss 2 independent and fair coins. Our random variable X
is the number of heads.

• X(“HH”) = 2

• X(“HT ”) = 1

• X(“T H”) = 1

• X(“T T ”) = 0

Perhaps in more practical terms, a real random variable transforms an


element ω from the original sample space (which could be abstract or difficult
to work with directly) into a numerical quantity X(ω) (real number) that
is more convenient or tangible to work with (e.g. quantities that we may
measure in a laboratory).

Once we are mapped by X from Ω to R, we may choose to view R as the


new “sample space” and create a new probability triple for itself. Then, on
this new measurable space (R, B), we can assign probabilities to the events
in B, and we call that the probability law for random variable X, denoted
by PX . The new probability space associated with the random variable X is
then (R, B, PX ). We can also map from R back to the original sample space
via X −1 (note this is the “preimage” inverse, not the usual inverse function).
Hence, we can write, for some event B in the new σ-field B,

PX (B) = P (X −1 (B)) = P ({ω : X(ω) ∈ B}). (2.6)


3
Convince yourself that X is indeed a random variable.
CHAPTER 2. PROBABILITY REVIEW 17

2.5 Discrete Random Variables

2.5.1 Definition

A discrete random variable is one whose range X(Ω) (i.e., the set of values
it can take) is countable. The probability mass function (PMF) of a discrete
random variable is defined as

pX (x) = P (X = x), ∀x ∈ R, (2.7)

and in particular,
X
pX (x) = 1. (2.8)
x∈R

(In the above sum, only countably many terms are non-null.) With a slight
abuse of language, we say that a random variable X has distribution, or law,
pX 4 .

2.5.2 Examples of discrete random variables


• (X ∼ U({a, . . . , b})) Discrete uniform with integer parameters a < b,
(e.g.: throwing a fair dice):
1
pX (x) = if x ∈ {a, . . . , b} and pX (x) = 0 otherwise.
b−a

• (X ∼ Ber(p)) Bernoulli with parameter 0 ≤ p ≤ 1; (e.g.: yes or no):

pX (k) = pk (1 − p)1−k , k ∈ {0, 1} and pX (k) = 0 otherwise

• (X ∼ Bin(n, p)) Binomial with parameters n ∈ N and p ∈ [0, 1]; (e.g.:


number of heads after n independent coin tosses with a biased coin):
 
n
pX (k) = pk (1−p)n−k for k = 0, 1, . . . , n and pX (k) = 0 otherwise
k
4
Even though technically its law is PX defined in the previous section (that is because
PX characterizes pX ).
CHAPTER 2. PROBABILITY REVIEW 18

• (X ∼ Pois(λ)) Poisson with parameter λ > 0:

pX (k) = e−λ λk /k! for k = 0, 1, . . . and pX (k) = 0 otherwise.

The Poisson distribution is typically used to model the number of times


an event occurs in a unit amount of time when these occurrence are
thought to be independent and when the rate of occurrence is 1/λ,
so that the average number of occurrence per unit of time is λ (e.g.
earthquakes, victory of France in the soccer world cup5 , . . . ).

2.5.3 Multiple random variables (marginal, joint, con-


ditional, independence)

Consider two discrete random variables X and Y associated with the same
experiment. The probability law of each one of them is described by its
respective PMF pX or pY , called the marginal PMFs of the couple (X, Y ).
The marginal PMFs do not provide any information on possible relations
between these two random variables.

The statistical properties of two random variables X and Y are captured


by their joint PMF :

pX,Y (x, y) = P (X = x, Y = y). (2.9)

Example 2.5.1. Toss a fair coin and let X = 1 if the result is head, X = 0 if
it is tail. Let Y = X and Y ′ = 1 − X. Show that (X, Y ) and (X, Y ′ ) have
the same marginal PMFs but not the same joint PMFs.

We may also concatenate multiple random variables together into a ran-


dom vector X = [X1 , . . . , Xn ] and still use the notation pX (x) but with the
understanding that it is the joint PMF (in this case, x = (x1 , . . . , xn )). From
the joint PMF, we can recover the marginals via
X X
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y). (2.10)
y x

5
in this case λ is very large...
CHAPTER 2. PROBABILITY REVIEW 19

The conditional PMF of X given Y is

pX|Y (x|y) = P (X = x|Y = y), (2.11)

which is well defined whenever pY (y) > 0. Using the definition of conditional
probabilities, we obtain
pX,Y (x, y)
pX|Y (x|y) = . (2.12)
pY (y)
Visually, if we fix y, then the conditional PMF pX|Y (x|y) can be viewed as a
“slice” of the joint PMF pX,Y , but normalized so that it sums to one.

Random variables X and Y are said to be independent if for all x, y ∈ R,


it holds that

pX,Y (x, y) = pX (x)pY (y), (2.13)

or equivalently for all x, y ∈ R such that pY (y) > 0,

pX|Y (x|y) = pX (x). (2.14)

Furthermore, if X and Y are independent, then functions of these random


variables h(X) and g(Y ) are also independent, provided that h(X) and g(Y )
are random variables. (One indeed needs to check that they satisfy Definition
2.4.1; for example, it is always the case when h and g are continuous).

2.5.4 Sequence of discrete random variables

We just saw how the joint PMF encodes the distribution of a couple of
discrete random variables. Sometimes we will want to consider more than
two random variables at the same time. A sequence of random variables
(Xi )i≥1 is a sequence such that for all i ≥ 1, Xi is a random variable.

We say that the random variables in (Xi )i≥1 are independent if and only
if for all k ≥ 1 and all i1 , . . . , ik ∈ N pairwise distinct, it holds that
k
Y
p(Xi1 ,...,Xik ) (x1 , . . . , xk ) = pXiℓ (xℓ ), ∀x1 , . . . , xk ∈ R.
ℓ=1
CHAPTER 2. PROBABILITY REVIEW 20

This is sometimes called mutual independence to stress that it is strictly


stronger that pairwise independence, the latter being the weaker property
that for all i ̸= j, the two random variables Xi , Xj are independent
Example 2.5.2. Let X be a random variable on {0, 1} with the following law:
P (X = 1) = 1/2 = P (X = 0). Let Y have the same law and be independent
from X. Let Z be a random variable such that Z = X if Y = 1, and
Z = 1 − X if Y = 0. One can show that the triplet (X, Y, Z) is composed of
pairwise independent variables, but not mutually independent.

2.6 Continuous Random Variables

Most of the properties and concepts for continuous random variables will be
the same or analogous to its discrete counterpart (by swapping summation
with integration).

2.6.1 Definitions

When X takes real continuous values it is more natural to specify the prob-
ability of X being inside some interval P(a ≤ X ≤ b), a < b. By convention,
we specify P(X ≤ x) for all x ∈ R, which is known as the cumulative distri-
bution function (CDF) of X, denoted by FX (x).

A continuous (real) random variable is one that has a probability density


function (PDF) fX (x) such that
Z x
FX (x) = P (X ≤ x) = fX (t) dt. (2.15)
−∞

Conversely, if the CDF is differentiable (not always true), then

∂FX (x)
fX (x) = . (2.16)
∂x
Since the CDF is monotonically increasing, then fX (x) ≥ 0; and since
CHAPTER 2. PROBABILITY REVIEW 21

lim FX (x) = 1, then


x→+∞
Z +∞
fX (t) dt = 1. (2.17)
−∞

Using the PDF of a continuous random variable, we can compute the prob-
ability of various subsets of the real line:
Z b
P (a < X < b) = P (a ≤ X ≤ b) = fX (t) dt, (2.18)
Za
P (X ∈ B) = fX (x) dx. (2.19)
B

⋆ Remark 3. From measure theory, for the last equation to make sense,
we need B to be Lebesgue measurable. Since it is the case of all Borel sets,
we can use this formula for all B that can be constructed from intervals. We
shall always work with such measurable sets throughout the class, without
necessary recalling it.
⋆ Remark 4. Any random variable can be decomposed into a continuous
part and a singular part (that does not need to be discrete but could be).
For example, X = 0 with probability 1/2 and X = U ∼ U(0, 1) (the uniform
distribution on (0, 1)) with probability 1/2, then X is neither continuous nor
discrete.

2.6.2 Some examples of continuous random variables


• (X ∼ U(a, b)) Uniform with parameters a and b (and a < b);
fX (x) = 1/(b − a) for x ∈ [a, b]
The probability law of a uniform random variable on [a, b] is the Lebesgue
measure on [a, b] divided by b − a.
• (X ∼ Exp(λ)) Exponential with λ > 0,
fX (x) = λe−λx .
The exponential distribution is memoryless, that is
P(X > x + t|X > x) = P(X > t),
for all x, t > 0.
CHAPTER 2. PROBABILITY REVIEW 22

• (X ∼ N (µ, σ 2 )) Normal (Gaussian) with mean µ ∈ R and standard


deviation σ > 0;
1 (x−µ)2
fX (x) = √ e− 2σ2 . (2.20)
σ 2π
The distribution N (0, 1) is called the standard normal.

2.6.3 Multiple random variables (marginal, joint, con-


ditional, independence)

The joint CDF between X and Y is


Z x Z y
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = fX,Y (u, v) du dv (2.21)
−∞ −∞

and fX,Y is called the joint PDF. If the CDF is differentiable (not always
true), then

∂ 2 FX,Y
(x, y) = fX,Y (x, y). (2.22)
∂x∂y
Similar to the univariate case, we can compute the probability of an event B
with
Z
P ((X, Y ) ∈ B) = fX,Y (x, y) dx dy. (2.23)
B

The marginal PDF of X is then


Z ∞
fX (x) = fX,Y (x, y) dy. (2.24)
−∞

The conditional CDF of X given Y is


Z x
fX,Y (u, y)
FX|Y (x|y) = du (2.25)
−∞ fY (y)
CHAPTER 2. PROBABILITY REVIEW 23

where fY is the marginal PDF of Y and we assumed fY (y) > 0. The condi-
tional PDF is then
fX,Y (x, y)
fX|Y (x|y) = . (2.26)
fY (y)

We say that X and Y are independent if and only if their joint CDF,
equivalently joint PDF, can be factored:
FX,Y (x, y) = FX (x)FY (y) (2.27)
fX,Y (x, y) = fX (x)fY (y), ∀x, y ∈ R. (2.28)
Equivalently, for all x, y ∈ R such that fY (y) > 0,
FX|Y (x|y) = FX (x) (2.29)
fX|Y (x|y) = fX (x). (2.30)

2.6.4 Sequence of continuous random variables

Sequences of continuous random variables are defined identically to sequences


of discrete random variables. The mutual independence property translates
similarly with the PDFs and CDFs as follows: we say that a sequence of
continuous random variables are mutually independent if and only if for all
k ≥ 1 and all i1 , . . . , ik ∈ N pairwise distinct, it holds that
k
Y
FXi1 ,...,Xik (x1 , . . . , xk ) = FXiℓ (xℓ ),
ℓ=1
Yk
or equivalently fXi1 ,...,Xik (x1 , . . . , xk ) = Ff Xiℓ (xℓ ),
ℓ=1

for all x1 , . . . , xk .

2.7 Moments

The probability density function of a continuous or probability mass func-


tion of a discrete random variable X provides us the probabilities of all the
CHAPTER 2. PROBABILITY REVIEW 24

possible values of X. It is often desirable to summarize this information in


a single representative number.

In order to do so, one can look at the average value of X, if one were
to sample it many times. This value (that we call the expectation of X and
that we define below) requires that the variable does not take extremely large
values too often, otherwise this average may explode and thus be ill defined.
We formalise the property of a variable being non-extreme as integrability.

2.7.1 Non-negative random variables

To compute a mean under a probability distribution P , we need to integrate


a map against P . We first properly define integrals on functions taking non-
negative values. The case of non-positive functions is identical. Then, for
general functions, we will naturally obtain an integrability condition to define
its integral as the difference between its positive and its negative parts.

We have seen that a real random variable X is a function from a proba-


bility space (Ω, F, P ) to (R, G) (where, often, G is chosen as the Borel σ-field
on R). We have seen that this defines a probability measure pX on R that is
called the law (or distribution) of X. Hence, using pX , we can directly com-
pute integrals on R without explicitly defining (Ω, F, P ), e.g. to compute
the average value that X takes.
Definition 2.7.1. Suppose X is a non-negative random variable, that is
P (X ≥ 0) = 1. The expectation of X is defined by
X
E[X] = x pX (x), (2.31)
x

if X is discrete, and
Z ∞
E[X] = xfX (x)dx, (2.32)
−∞

if X is continuous.
Remark 1. The case of random variables that are neither discrete nor con-
tinuous is out of the scope of this class.
CHAPTER 2. PROBABILITY REVIEW 25

Remark 2. Note that E[X] does not have to be finite, in which case E[X] =
+∞ so that E[X] is always well defined (as long as X is non-negative).

Instead of integrating X, one can integrate g(X) where g : R → R+ . The


only restriction is that g(X) is a random variable; this depends on g and this
ios the case for all the maps we will consider in this class. (For example, it
is true as soon as g is piecewise continuous.) Hence, for any real random
variable X and g such that g(X) is a random variable, the following is well
defined:
X
E[g(X)] = g(x)pX (x),
x

if X is discrete and
Z ∞
E[g(X)] = g(x)fX (x)dx
−∞

if X is continuous.

⋆ Remark 5. The expectation E[g(X)] can equivalently be written on the


canonical probability space (Ω, F, P ) as follows:
Z
E[g(X)] = g(X(ω))P (dω).

2.7.2 The case of signed random variables

In the previous section, we introduced the expectation of a non-negative


random variable and assigned it a well-defined value. Identical arguments
can be made to define the expectation of non-positive variables. In this
section, we aim at defining E[X] for a random variable X that can take
signed values, that is P [X > 0] > 0 and P [X < 0] > 0.

If we try to define the expectation as in the previous section, a problem


will arise with infinite values. Let us see an example. Let X be such that
3 1
PpX (−k) = π2 · k2 and pX (x) = 0 for all x ∈
for all k ∈ N, pX (k) = / N ∪ −N.
One can check that k∈N∪−N pX (k) = 1 so that pX is indeed a probability
CHAPTER 2. PROBABILITY REVIEW 26
P
measure. However, if we try to define E[X] as k∈N∪N N kpX (k), the positive
and negative parts will be
3 X k
= +∞ (2.33)
π 2 k∈N k 2
3 X k
and = −∞. (2.34)
π 2 k∈−N k 2

Since +∞ − ∞ is ill defined, we cannot make sense of E[X] in that case.

In order to avoid this issue, we need to restrict ourselves to integrable


random variables: Let X+ = X1X≥0 and X− = |X|1X<0 so that X =
X+ − X− (note that both X+ and X− are non-negative random variables).
Definition 2.7.2. We say that a real random variable X is integrable if
E[X+ ], E[X− ] < +∞. In this case, we define

E[X] = E[X+ ] − E[X− ]. (2.35)

The integrability condition is often written E[|X|] < ∞ since E[|X|] =


E[X+ ] + E[X− ], by linearity of the expectation (which follows from the lin-
earity of the integral).
Remark 3. When exactly one of E[X+ ] and E[X− ] is infinite, then it is pos-
sible to define E[X] = +∞ or −∞ according to whether X+ or X− is not
integrable. In this case, even though E[X] is well defined (infinite), X is not
integrable.

Below we list the basic most important properties for expectations, where
X, Y are integrable:

• If X ≥ 0 then E[X] ≥ 0 (Monotonicity)

• |E[X]| ≤ E[|X|] (triangle inequality)

• If X and Y are integrable, then E[aX + bY ] = aE[X] + bE[Y ] for all


a, b ∈ R. (Linearity)

• If X = c then E[X] = c.
CHAPTER 2. PROBABILITY REVIEW 27

• For any event A ∈ F, we have P (A) = E[1A ].


• If X and Y are square integrable, then E[|XY |] ≤ E[X 2 ]E[Y 2 ], with
p

equality if and only if X = aY for some a ∈ R. (Cauchy-Schwarz


Inequality)

For a random variable X with E[X 2 ] < ∞, that is, X is square integrable,
we can define its variance as

var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2 .

• The square root of the variance is the standard deviation, often denoted
by σX or just σ.
• var(aX) = a2 var(X).
• If X and Y are independent and square integrable, then E[XY ] =
E[X]E[Y ] and var(X + Y ) = var(X) + var(Y ).

The covariance of two square integrable variables X and Y is defined as

Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]. (2.36)

When Cov(X, Y ) = 0, we say that X and Y are uncorrelated.

Independence implies uncorrelated, but uncorrelated does not imply inde-


pendence. By rescaling the covariance, we obtain the correlation (assuming
that none of X, Y is deterministic):
Cov(X, Y )
ρ(X, Y ) = p ∈ [−1, 1]. (2.37)
V ar(X)V ar(Y )
The correlation ρ(X, Y ) can be thought of as a measure of the linear depen-
dence between X and Y .
Remark 4. The fact that the correlation always belongs to [−1, 1] is proven
using Cauchy-Schwarz Inequality. In particular, the equality case of this
inequality entails that ρ(X, Y ) = 1 (resp. −1) if and only if Y = aX for
some real a > 0 (resp. a < 0).
CHAPTER 2. PROBABILITY REVIEW 28

In general, with two discrete integrable random variables, we can form


the joint expectation
XX
E[g(X, Y )] = g(x, y) pX,Y (x, y). (2.38)
x y

The conditional expectation of X given Y is defined to be


X
E[X|Y = y] = x pX|Y (x|y). (2.39)
x

The total expectation theorem can be derived as well:


X
E[X|Y = y] pY (y) = E[X]. (2.40)
y

For the case of two continuous random variables, we have the joint ex-
pectation
Z ∞Z ∞
E[g(X, Y )] = g(x, y) fX,Y (x, y)dxdy. (2.41)
−∞ −∞

The conditional expectation of X given Y is defined to be


Z ∞
E[X|Y = y] = x fX|Y (x|y)dx (2.42)
−∞

The case where X is discrete and Y is continuous is similar with the integral
over the values of X replaced by a sum.

For a real p > 1, if X p is integrable, its p-th moment is defined as

E[X p ]. (2.43)

For any p > q ≥ 1, it holds that if X p is integrable, then X q is integrable.


Hence, the p ≥ 1 such that X p is integrable is an interval (possibly empty).

⋆ Remark 6. The last statement about the integrability of X q implied by


that of X p is a consequence of the following more general Rfact: Let Lp =
Lp (Ω, F, m) be the space of functions f : Ω → R such that Ω |f (ω)|p m(dω)
for some measure m. If m is finite (i.e. m(Ω) < ∞), then Lq ⊂ Lp .
CHAPTER 2. PROBABILITY REVIEW 29

Informally, for an integral to be ill-defined, one needs either the integrand


f (ω) to explode or the measure m to assign an infinite mass on a region where
the function is not close to zero. When m is bounded, only the large values
of f (ω) can be problematic. Therefore, if the values of |f (ω)|p are not too
large, it is also the case of |f (ω)|q since xq < xp when x > 1.

Since we work with a probability space (m = P ), we can just apply this


fact and obtain the above-mentioned property of the moments of a random
variable.

2.8 Samples from an unknown distribution

2.8.1 Law of large numbers and central limit theorem

Whereas Probability Theory addresses general questions related to (Ω, F, P ),


the field of Statistics is concerned with questions where the distribution of a
random variable pX is unknown. Intuitively, the more observations from pX ,
the more we should know about it. This intuition is formalised in the Law
of Large Numbers:
Theorem 1. Let (Xi )i≥1 be a sequence
Pm of i.i.d. and integrable random vari-
ables. For all m ∈ N, define Sm := i=1 Xi . It holds that
Sm
lim = E[X1 ],
m→∞ m

with probability 1.

When a statement holds with probability 1 under P , we say that the


statement holds P -almost surely. What the above theorem tells us is that
the empirical mean of i.i.d. and integrable random variables converges to
the true mean (i.e. the expectation). In other words, if we don’t know pX
but have access to an infinite number of independent observations, we can
retrieve the expectation of pX .

In practice, of course, we don’t have access to infinitely many observa-


tions. The law of large numbers does not tell us at what speed the empirical
CHAPTER 2. PROBABILITY REVIEW 30

mean converges. It turns out that by adding the assumption of finite second
moment, the Central Limit Theorem gives us the order of magnitude of the
distance between the empirical mean and the true mean.

Theorem 2. Let (Xi )i≥1 be a sequence P of i.i.d. square integrable random


variables. For all m ∈ N, define Sm = m i=1 Xi . For any real numbers a < b,
it holds that
r   
m Sm
lim P − E[X1 ] ∈ (a, b) = P [a < Y < b],
m→∞ Var[X1 ] m

where Y ∼ N (0, 1).

In the central limit theorem, we have the term Smm −E[X1 ], which converges
almost surely to 0 as m →q∞ by the law of large numbers. Multiplying this
m
difference by the factor Var[X1 ]
, we get a random number with normal

distribution with mean 0 and variance 1. This factor grows at the speed m
(while Var[X1 ] remains constant in m). This means that in the law of large
numbers, the random variable Smm converges to E[X1 ] exactly at speed √1m .
The constant factor simply normalises the Gaussian limit to have variance 1.

2.8.2 Estimating an unknown probability

In this section, we see on a concrete example how the theorems from the
previous section can be used to estimate an unknown probability from a
finite sample.

The frequentist way

Suppose that we have a fair coin. The fact that we specify the coin is fair
implicitly poses the probability P , such that P (“heads′′ ) = P (“tails′′ ) = 1/2.
If instead we are given a coin and we do not know whether it is fair, then
we can only say that there exists p ∈ [0, 1] such that P (“heads′′ ) = 1 −
P (“tails′′ ) = p. We can, however, say more about the likely values of p by
estimating it through repeated experiments, as follows:
CHAPTER 2. PROBABILITY REVIEW 31

We can see the outcome of a coin toss as that of a Bernoulli random


variable with parameter p, where the variable is equal to 1 if the coin yields
“heads”, and 0 if “tails”. Suppose that we toss the coin m ∈ N times and
that (Xi )m
let Xi be the outcome of the i-th toss. NoteP i=1 are i.i.d. Using the
m
law of large numbers, we know that Sm := i=1 Xi is such that Smm → p as
m → ∞. Hence, for a large enough m, we can estimate p by
# of heads Sm
p̂ := = .
# of tosses m
Naturally, if m = 3, it is unlikely that our estimate is satisfactory. Suppose
that we want to have an error of order 1/100. Then we can use the central
√ that tells us that the error of the estimate, that is p̂ − p, is
limit theorem
of order 1/ m up to a random number that looks like a standard centered
Gaussian for large enough m. Hence, to obtain an error of order 1/100, we
need
1 1
√ ≤ ⇔ m ≥ 1002 = 10000.
m 100
1
That is, by tossing ten thousand times the coin, we have that p̂ − p ≈ 100
Y
where Y ∼ N (0, 1).

Take-home message: In this class, “learning” means approximating a


function or a distribution given a dataset. From the above example, the law
of large numbers and the central limit theorems may be seen as first results
on learning. Assuming that the data are indeed i.i.d., the more data, the
better.

The Bayesian way

The example of the coin toss above is the frequentist paradigm of Statistics. It
typically requires a large amount of data to be accurate. The other paradigm
is Bayesian Statistics, that yields effective estimate with few data, but often
requires more computations and an a priori distribution. It is based on
Bayes’ Theorem 3, that we now state:

Theorem 3. (Bayes’ theorem) Let A, B ∈ F such that P (A), P (B) > 0 The
CHAPTER 2. PROBABILITY REVIEW 32

Bayes’ theorem states that


P (B)P (A|B)
P (B|A) = (2.44)
P (A)

The informal interpretation of Bayes’ Theorem in the context of learning


is the following: suppose B represents your a priori beliefs of the world, and
A is some observations that are linked to these beliefs. Ideally, you want
to update your beliefs according to A. That is exactly what this theorem
tells us: the probability of our beliefs B a posteriori (that is, after having
observed A), is given by the right-hand side of the equation of the statement.
Note that it depends on three terms: the prior probability we attribute to
our beliefs P (B), the probability of the observations given our prior beliefs
P (A|B), and a last term more difficult to interpret P (A). This last term can
further be decomposed as
P (A) = P (A|B)P (B) + P (A|B c )P (B c ),
so that we can compute P (A) according to whether our beliefs are true or
not, and the prior probability we assign to our beliefs.

Let us see how this works with an example.


Example 2.8.1. Naive Bayes classifier:

This is a simple “probabilistic classifier” based on applying Bayes’ theo-


rem with strong (naive) independence assumptions between the features.

Let us consider again a binary classification. The naive Bayes classifier is


a conditional probability model: given an input x to be classified, represented
by a vector x = (x1 , ..., xm ) representing m features, it assigns conditional
probabilities
p(y = +1|x1 , · · · , xm ), p(y = −1|x1 , · · · , xm ).
Using Bayes’ Theorem, we can write
p(y = +1)p(x|y = +1)
p(y = +1|x1 , · · · , xm ) = .
p(x)
The numerator is equivalent to the joint probability
p(y = +1, x1 , · · · , xm )
CHAPTER 2. PROBABILITY REVIEW 33

which can be rewritten as follows, using the chain rule for repeated applica-
tions of the definition of conditional probability:

p(y = +1, x1 , · · · , xm ) = p(x1 |x2 · · · , xm , y = +1)p(x2 , · · · , xm , y = +1)


= p(x1 |x2 · · · , xm , y = +1)p(x2 |x3 · · · , xm , y = +1)
p(x3 , · · · , xm , y = +1)
= ···
= p(x1 |x2 · · · , xm , y = +1)p(x2 |x3 · · · , xm , y = +1) · · ·
p(xm−1 |xm , y = +1)p(xm |y = +1)p(y = +1).

Suppose now that the so-called naive conditional independence assump-


tion holds, which tells us that all features in x are mutually independent,
conditional on the label (e.g. y = +1 or y = −1). Under this assumption:

p(xi |x1 , . . . , xi−1 , xi+1 , . . . , xm , y = +1) = p(xi |y = +1).

Then, with this assumption, the original probability can be re-written as:

p(y = +1|x1 , · · · , xm ) ∝ p(y = +1, x1 , · · · , xm )


∝ p(y = +1)p(x1 |y = +1) · · · p(xm |y = +1).

For a new example x, we can compute our best guess as the true label,
using:
ŷ = arg max p(y = c|x).
c

This is called the MAP estimate (maximum a posteriori).


Example 2.8.2. Application of naive Bayes classifier: We now use the
naive Bayes classifier to answer the following question: Suppose that there is
an infectious disease for which we have a test with false positive probability
0.01 and false negative probability 0.2. This means that a test on a non-
infected patient returns a positive result (“infected”) with probability 0.01
and a test on an infected patient returns a negative result with probability
0.2. Suppose moreover that we know that 10% of the population is infected.

More formally, let y ∈ {−1, 1} be such that y = 1 if the patient is infected,


−1 if they are non-infected. Suppose that the patient was tested three times
CHAPTER 2. PROBABILITY REVIEW 34

(let’s ignore the timing of the tests for this exercise) and the results were
x1 = 1, x2 = −1, x3 = 1, where 1 means that the test was positive and −1
negative.

The naive conditional independence assumption holds in this case, since


given y, the results of the tests x1 , x2 , x3 are independent. We can therefore
compute

p(y = +1|x1 = 1, x2 = −1, x3 = 1)


∝ p(y = +1)p(x1 = 1|y = +1)p(x2 = −1|y = +1)p(x3 = 1|y = +1)
∝ 0.1 × (1 − 0.2) × 0.2 × (1 − 0.2) = 0.0128.

Similar computations yield p(y = −1|x1 = 1, x2 = −1, x3 = 1) ∝ 0.0000891.


In particular, we deduce that p(y = +1|x1 = 1, x2 = −1, x3 = 1) > p(y =
−1|x1 = 1, x2 = −1, x3 = 1), and the naive Bayes classifier classifies the
tested individual as “infected”.

2.9 Resources
• Measure, Integral and Probability, Marek Capinski and Ekkehard Kopp

Exercises/things to dwell on:

• We have a fair dice, we want to estimate the probability of getting 6


after throwing it once.

• Suppose we have two dice, we want to estimate the probability of get-


ting at least one 6 after throwing it twice.

• I have two dice and throw them, what is the probability I get the sum
of both of them to be larger than 6?

• What is the probability I get 6 on my second throw, after I rolled a 3?


And a 6?

• Suppose I have a target and I throw a dart randomly, what is the


probability I hit the center of the board?
CHAPTER 2. PROBABILITY REVIEW 35

• Suppose I have this list of words, sampling a word uniformly at random,


can you tell me the probability of the word occurrence?

• What happens if I want to compute the probability of a word I have


not seen in the list?

• Using Bayes theorem, show that P (A) > 0 and P (B) > 0, then

P (A ∩ B) P (B)P (A|B)
P (B|A) = = . (2.45)
P (A) P (A)
Chapter 3

Introduction to Machine
Learning

Contents
3.1 Different paradigms . . . . . . . . . . . . . . . . . 37
3.2 Supervised learning . . . . . . . . . . . . . . . . . 37
3.3 Model selection . . . . . . . . . . . . . . . . . . . . 41
3.4 No free lunch theorem . . . . . . . . . . . . . . . 47
3.5 Optimisation . . . . . . . . . . . . . . . . . . . . . 48
3.6 ML pipeline in practice . . . . . . . . . . . . . . . 52
3.7 List of tasks . . . . . . . . . . . . . . . . . . . . . . 52

Machine learning aims at building algorithms that autonomously learn


how to perform a task from examples. This definition is rather vague on
purpose, but to make it slightly clearer: by “autonomously”, we mean that
no expert is teaching (or coding by hand) the solution; by “learn” we mean
that we have a measure of performance of the algorithm’s output on the task.
In this chapter, we wish to give a general and accessible picture of Machine
Learning. We will introduce the notation, set-up and essential concepts of
the field, without dwelling on details. The current chapter should provide
enough formalism and intuition to make the forthcoming chapters appear
natural to the reader.

36
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 37

3.1 Different paradigms

There are three main paradigms in machine learning that sometimes share
similar ideas while having very specific techniques. Namely, they are

• supervised learning – we have access to labelled examples: spam detec-


tion using a dataset of emails, some of which we know are spam, the
others we know are not spam;
• unsupervised learning – examples are not labelled: the dataset is com-
posed of paintings, the algorithm must group them by guessing which
come from the same artist;
• reinforcement learning – examples are generated from interacting with
the environment: the algorithm controls a drone and learns how to
navigate by trial and error.

These different types of learning are not exclusive.

In these notes, we mostly focus on supervised learning; much of what we


treat can then be transferred to unsupervised and reinforcement learning. We
also concentrate on specific and common techniques of unsupervised learning
in Chapter ??. We hope that the reader is then able to learn autonomously
from the literature.

3.2 Supervised learning

In the previous section, we said that supervised learning is learning through


labelled examples. The problems that can be solved are divided into two
types: classification problems when the goal is to predict a categorical vari-
able, and regression problems when the prediction takes on numerical values
that are compared with a distance map.
Example 3.2.1. Recognizing handwritten digits is a classification task: each
digit belongs to one of the ten classes from “zero” to “nine”. If a model reads
a 2 instead of a 5, it is as wrong as if it has read a 1.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 38

Example 3.2.2. On the other hand, predicting tomorrow’s weather (say tem-
perature) is a regression task: if a model predicts a temperature of 24◦ F
tomorrow in Ann Arbor and it happens to be 27◦ F, although not exact, this
prediction is more satisfactory than another of 12◦ F.

The term supervised refers to the fact that the examples used in building
the predictor come with labels, that is, learning how to distinguish hand-
written digits is done by presenting images and the right answers to the
algorithm. This is in contrast to unsupervised learning where no labels are
provided and the main goal is to find structure in the data (for example,
possible clusters, lower dimensional representations, etc...)

3.2.1 Set-up

The supervised learning set-up is as follows:

• An input space X ⊂ Rnin with nin ≥ 1 and an output space Y ⊂ Rnout


with nout ≥ 1,

• An unknown mapping f : X → Y we want to approximate,

• A probability distribution D on X ,

• A dataset S = {(xi , yi ); i = 1..m} such that f (xi ) = yi for all i = 1..m


and the xi ’s are independent and identically distributed (i.i.d.) with
common law D,

• A hypothesis class H, that is a set of functions mapping X to Y, sup-


posedly containing the unknown f (or good approximations of it),

• A loss function ℓ : Y × Y → R+ such that for h ∈ H, ℓ(h(xi ), yi )


measures the error of the prediction of h at xi from its true label yi =
f (xi ). It is thus standard to assume that ℓ(y, y ′ ) = ℓ(y ′ , y) and ℓ(y, y) =
0 for all y, y ′ ∈ Y.

• A training algorithm A, as defined in the forthcoming definition 3.2.4.


CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 39

Remark 5. The dataset above supposes that the observations are perfect.
Often in practice, this is not the case (e.g. temperature measurements).
Such a dataset is said to be noisy and this noise is included in the model
such that yi = f (xi )+ϵi where (ϵ1 , · · · , ϵn ) is a random vector, often (but not
always) assumed to be Gaussian with mean 0 and independent coordinates.
More on that in the next chapter.
Remark 6. Choosing a parametric model corresponds to choosing a specific
hypothesis class H – hence we will interchangeably use model and hypothesis
class. For instance, if X ⊂ R, one could choose H = {h : x 7→ w1 cos(x) +
w2 sin(x); w1 , w2 ∈ R}. The numbers w1 , w2 are called the parameters of
the model. Throughout these notes, we will aggregate them in a parameters
vector w ∈ RP where P will usually denote the number of parameters of
the model (here P = 2). Henceforth, we call an element of H a predictor
and we write fw instead of h for a function in H when we want to make the
dependency of the predictor on the parameters explicit.
Remark 7. Choosing a hypothesis class H is often an engineering choice, we
might choose H according to simplicity, expressiveness, prior knowledge of
our problem, etc. In this class, we will see for example; perceptrons, support
vector machines, kernel methods, ensemble methods, neural networks, etc...)

Then, we can define more concretely classification and regression prob-


lems.

Definition 3.2.1. A problem is said to be a classification problem if the


labels are categorical, or more formally, if Y is discrete and ℓ : (y, y ′ ) = 1y̸=y′ .
If there are k ∈ N classes, we usually encode them such that Y = 1, · · · , k.

Definition 3.2.2. A regression task is a learning task where Y takes numer-


ical values, i.e. Y ∈ Rnout and such that predictions are evaluated by a loss
function ℓ : Y × Y → R+ that does more than simply discriminated whether
the prediction is right or wrong, but also provides a magnitude of error. This
is the key difference between classification and regression.

Definition 3.2.3. For a predictor h ∈ H, the generalisation error of h is


defined by

RD (h) = Ex∼D [ℓ(h(x), f (x))]. (3.1)


CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 40

The goal of supervised learning is to solve the optimization problem

min RD (h). (3.2)


h∈H

In general, the generalisation error of a predictor is not directly accessible.


We instead can minimise the following quantity:
m
1 X
min L(h) = min ℓ(h(xi ), yi ). (3.3)
h∈H h∈H m
i=1

where the mapping L : H → R+ is called the empirical error (or training


error, empirical/training loss), as it depends on the dataset. What we de-
scribed above is the empirical risk minimization (ERM) framework, and
it assumes that a predictor which minimises (3.3), that we call h∗ , is close
to minimising (3.1). Even though we may not manage to find h∗ , the hope
is that a predictor h that performs decently well on the dataset is good at
predicting labels for inputs outside of the dataset. Informally and more con-
cisely, we expect L(h) small ⇒ RD (h) small. We will see, however, in the
forthcoming section 3.3 that for some h, L(h) can be small (even zero) but
RD (h) very large; this is called overfitting. Avoiding overfitting is both a
practical and theoretical topic of interest.

3.2.2 Parameters and training

In remark 6, we explained that a parametric model defines a hypothesis


class. That is, denoting by W ⊂ RP the parameters space of the model, we
have H = {fw ; w ∈ W}. The parameters w can be trained to search over
predictors in H, hence they are called trainable parameters. The training
procedure simply refers to the following:
Definition 3.2.4 (informal). Given a dataset S and a hypothesis class H =
{fw ; w ∈ W} with W ⊂ RP the parameter space, we say that a map A =
AH,S : W → W is a training algorithm.

The idea behind this definition is that the purpose of a training algorithm
A is to send initial parameters w0 ∈ W to trained parameters A(w0 ) ∈ W
using the dataset S such that fA(w0 ) performs well at minimizing (3.3). In
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 41

this context, the scheme used to choose w0 gives an initial predictor fw0 that
needs not perform well. One can choose w0 deterministically or randomly
according to this scheme, that we call the initialization of the parameters.
The training algorithm can itself use randomness.

On the other hand, a parametric model can have non-trainable parame-


ters

Definition 3.2.5. All parameters of H and A that are not modified by A


are called hyperparameters.

Changing the hyperparameters corresponds to changing the hypothesis


class or the algorithm A.
Example 3.2.3. Let X = R and Y = R. Let Hk be the set of polynomials of
degree at most k ∈ N. Given a dataset S = {(xi , yi ); i = 1..m}, one can try
to learn the task in H1 if one believes that the relationship between inputs
and outputs is linear, i.e. there exists a, b ∈ R such that yi = axi +b and more
generally, f (x) = ax + b for all x ∈ R. A training algorithm A : R2 → R2
that finds the best (a, b) does not modify the value of k (equal to 1 here),
hence it is a hyperparameter.

3.3 Model selection

In this section, we present a method that addresses the following question:

Given a learning task, how do we choose a good model ?

Indeed, choosing a simple model, i.e. a parametric hypothesis class with


few parameters, may result in a predictor that is unable to fit the data,
whereas a complex model with too many parameters will fit the data but
may not be able to predict reasonable values outside of the dataset. These
issues are called underfitting and overfitting and are (slightly) more formally
defined below. We will see how splitting the dataset into a training set and
a test set helps avoiding it.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 42

Furthermore, having finitely many data samples prevents us from trying


out all possible models we have at hand, as it would cause a problem of
overfitting the test set. The cross-validation technique fixes this problem
by smartly making use of the data samples, as we will describe.

3.3.1 Underfitting and overfitting

Recall that in the ERM framework, in hope to minimise (3.1), we seek to


minimise (3.3).

Definition 3.3.1. Suppose we try to solve a task with the ERM framework
and let h ∈ H be a predictor. The difference between the generalisation loss
of h and its empirical loss is called the generalisation gap of h, that is
m
1 X
RD (h) − L(h) = Ex∼D [ℓ(h(x), f (x))] − ℓ(h(xi ), yi ).
m i=1

The generalisation gap is in general inaccessible to the practitioner, as D


and f are unknown. However, there are ways one can estimate it, by splitting
the dataset into two disjoint training set and test set. Indeed, let Strain and
Stest be two disjoint subsets of S. To ease the notation, let us assume that
Strain = {(xi , yi ) : i ∈ {1, . . . , m′ }} and Stest = {(xi , yi ) : i ∈ {m′ + 1, . . . , m}}
for some m′ < m. Then, it suffices to train a model on Strain to fit it, and to
estimate the generalisation error Ex∼D [ℓ(h(x), f (x))] by
m
1 X
ℓ(h(xi ), yi ).
m − m′ i=m′ +1

Note that if m − m′ tends to infinity, this becomes the exact generalisation


error by the law of large numbers (Theorem 1) since the data samples are
assumed i.i.d.

Definition 3.3.2. (Underfitting and overfitting)

• We say that underfitting occurs when a hypothesis class is too simple


to fit the data properly, that is when inf h∈H L(h) is large.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 43

• We say that overfitting occurs when a predictor h fits the data well but
is too complex to generalise outside the dataset, that is when L(h) ≈ 0
but RD (h) is large.

In general, the training error decreases as we increase the complexity or


flexibility of our model (e.g., in polynomial fitting, as we use higher degrees
of polynomial functions). The generalisation error tends to also decrease
initially as complexity increases, but then increases as the model overfits the
training set. 1

Figure 3.1: Example of underfitting, optimal fitting and overfitting.

⋆ Remark 7. In Figure 3.1, overfitting occurs because the predictor is


too complex and fits the noise in the data. Even without noise, an overly
complex model can feature overfitting, e.g.: for a dataset of n points on a
line, a polynomial of degree n + k can perfectly fit all the datapoints and
look very different from a line.
1
This is not the complete story, as you will see further on.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 44

3.3.2 Validation set and cross-validation

On a given task, to assess the efficacy of a model H, after having split the
dataset into a training set and a test set, we can simply look at the test error,
as this is an estimate for the generalisation loss.

Thanks to this split, we have a way to detect overfitting. For example, if


the training error is very low, but the test error is very high. If our model
overfits, we can simply change to another model. However, that is a naive
way of choosing a model that could lead to overfitting the test set.

Suppose that for a given task, you have the choice between several models
H , . . . , H(n) , and that you have no a priori reason to favour one over the
(1)

others. How do we select the best model? Suppose that we train all of them
on the training set, and compare the predictors thus obtained on the test set.
We then select the predictor that had the lower test error.

By proceeding that way, we expose ourselves to selecting a predictor


which, by chance, overfits Stest . Note that the predictor does not have
to be trained on Stest to overfit it.

Indeed, suppose that the number of models n → ∞, then one can convince
oneself that it becomes more and more likely that one of the predictors is
such that h(xi ) ≈ yi for all (xi , yi ) ∈ Stest . This means that we cannot assess
whether the chosen hypothesis class is well chosen.

Validation set

One way to deal with that issue is to split the dataset into three disjoint
subsets:

• the training set Strain ,

• the validation set Sval ,

• the test set Stest .


CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 45

Now the procedure becomes the following: train the n models on Strain ,
compare their performances on Sval , select the best model and retrain it on
Strain ∪ Sval , then assess its performance on Stest .

One analogy is to view the training set as textbook lectures and exam-
ples from which we learn a new concept, and we encountered some confusing
topics which we have multiple possible interpretations; validation set entails
practice problems and previous years’ exams, to help us choose which inter-
pretation is the best; and the test set is the final exam of the course.

Cross-validation

Although this procedure seems satisfactory, it can greatly be improved. In-


deed, note that for each class H(i) , we select only one predictor h(i) to assess
how good of a choice H(i) is. This in turns make the randomness of the fi-
nite sampling and split of the dataset too important in the model selection;
ideally we want to choose the best model for a distribution D and a labelling
function f , not the best model for a given dataset split Strain , Sval , Sset .

In order to solve this issue, one can use cross-validation. Let H be fixed
and let us use cross-validation to assess how good of a choice it is for a given
task. Let split the dataset S = {(xi , yi ) : i ∈ {1, . . . , m}} into a training set
Strain and test set Stest as before. We now randomly partition the dataset
Strain = {(xi , yi ) : i ∈ {1, . . . , m′ }} into k disjoint and covering subsets
S1 , . . . , Sk of roughly the same size. For i = 1 to k, P we call Si the i-th fold.
We denote by mi the size of the i-th fold such that ki=1 mi = m′ . We now
proceed as follows: for all i ∈ {1, . . . , k}, train the model on
k
[
Sj ,
j=1
j̸=i

and denote by hi ∈ H the predictor thus obtained. We evaluate the per-


formance of hi on the i-th fold, that are the only samples from the training
dataset that the predictor has not trained on, that is we define
1 X
Li (hi ) := ℓ(hi (x), y).
mi
(x,y)∈Si
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 46

We now have a collection of k predictors belonging to H, each trained on a


different subset of the training dataset, and for each we have an estimate of
their generalisation error computed on the corresponding i-th fold they have
not seen. We now evaluate the quality of the model H for the task by
k
1X
CV(H) := Li (hi ).
k i=1

Coming back to our initial question of choosing the best model among
H , . . . , H(n) , we can simply choose the one that minimises the cross-validation
(1)

error, that is the smallest CV(H(i) ). Then, we can retrain that model on the
whole training dataset Strain , and estimate its generalisation error on Stest .

Choosing k

Recall that k is the number of folds used during cross-validation (CV). What
value of k should we choose?

We can ask ourselves, what is:

• the influence of k when estimating the expected generalisation error?

• the influence of k on the size of the training set and therefore, attained
approximators h1 , ..., hk ?

• the computational complexity of the training algorithm for different k?

Consider a dataset of fixed size m. On the one extreme, if we choose


k = m′ , then the k-fold CV becomes the leave-one-out cross validation (LOO-
CV). With respect to the estimator of the generalisation error, most experi-
ments show that there’s a monotonically decreasing or constant variance of
the generalisation error estimator with increasing k 2 . We also note that this
creates the largest number of predictors h1 , . . . , hk , which might be very sim-
ilar to one another, with training sets of size (m′′ = m′ − 1) (after leaving
2
with some exceptions on the stability of the algorithm
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 47

one out). Furthermore, it can be quite computationally expensive since it


requires solving m′ slightly smaller (m′ − 1)-sized subproblems of the same
type.

On the other extreme, a small k (such as k = 2) provides an estimator for


the generalisation error with a higher variance and higher bias. This is due
to the fact that the estimators are trained on distinct and smaller datasets
(m′′ = m′ /2) . However, the computational complexity of the training algo-
rithm is small as well (as we only train h1 and h2 ).

In practice, the number of folds k will depend on the dataset size, as


one must balance the training dataset size m′′ = (k − 1)m′ /k and the com-
putational complexity of training k models. For example, Figure 3.2 shows
a hypothetical learning curve for a classifier, plotting 1 − RD (h), the true
prediction error for a given estimator h using a 5-fold CV. In this case, if we
have a dataset of 200 points, then a 5-fold CV would generate training sets
of size m′′ = 160, which would reach an error that is fairly close to if the full
m′ = 200 were used and the trained predictors would not suffer from much
more bias. On the other hand, if we had a dataset set of m′ = 50 points,
then using a 5-fold CV would lead to training sets of size m′′ = 40, leading
to a substantially increased error of the predictor. The precise trade-off thus
is problem dependent and dataset size dependent.

As a compromise among bias, variance and computational cost, k of 5 or


10 are commonly used choices for many applied problems.

3.4 No free lunch theorem

The no free lunch theorem is often talked about informally, one of the reasons
being that many similar – but not equivalent – versions of the theorem exist
in the literature. The theorem is often stated as follows:

“All optimization algorithms perform equally well when their performance


is averaged across all possible problems.”

The term “averaged” here does not have a formal meaning, but using the
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 48

Figure 3.2: A hypothetical learning curve for a classifier on a given task: a


plot of 1-Err versus the size of the training set, considering a 5-fold CV. [?]

Machine Learning formalism we introduced in this chapter, the no free lunch


theorem can be understood as follows: without any prior knowledge on the
data distribution D and the labelling function f , there is no way to guess
whether a pair of hypothesis class and training algorithm (H1 , A1 ) is likely
to perform better or worse than another pair (H2 , A2 ).

It means that there is no single best machine learning algorithm across


all possible prediction problems. This, in turn, motivates the development
of many different types of models, to cover the wide variety of data that
appears in the real world.

3.5 Optimisation

Optimisation is the field of Mathematics that is concerned with finding inputs


that optimise – i.e. minimise or maximise – a given function. This is precisely
what we aim for in the ERM framework! In this section, we present the basic
algorithm for training a parametric model.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 49

3.5.1 Gradient descent

In the set-up of supervised learning, we informally defined the notion of


training algorithms in Definition 3.2.4. Many training algorithms that are
used in practice are inspired (sometimes mere variants) of the simple gradient
descent algorithm used to solve minimization problems.

Let C : RP → R be a convex, differentiable map. Recall that this means


that

C(v) ≥ C(u) + ∇C(u) · (v − u),

where · denotes the dot product in RP . Because it is convex, every crit-


ical point of C is a global minimum, i.e. ∇C(u) = 0 if and only if u =
arg minw∈RP C(w).

The gradient descent algorithm

The gradient descent algorithm is a first-order iterative optimisation algo-


rithm for finding a local minimum of a differentiable function.

Recall that S = {(xi , yi ); i = 1..m} is our dataset and H is our hypothesis


class.

Definition 3.5.1. Let w0 ∈ RP and fix η > 0. We then define wk , k ≥ 0


recursively as

wk+1 = wk − η∇C(wk ).

One step of the above recursion is called a gradient descent step or gradient
descent update.

Given a number of steps K ∈ N, gradient descent is the following training


algorithm:

AGD : w0 7→ wK ,

for all w0 ∈ RP .
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 50

Remark 8. The positive real number η is called the learning rate, as it governs
the size of the updates (see influence of stepsize in Figure 3.3). Both the
learning rate η and the number of steps K are left unchanged when applying
AGD to w0 . Hence, they are hyperparameters according to Definition 3.2.5.

Figure 3.3: Learning rate η


Remark 9. In the ERM framework, the gradient descent update equation
reads as
m
X
wk+1 = wk − η∇w L(fwk ) = wk − η ∇w ℓ (fwk (xi ), yi ) .
i=1

3.5.2 Gradient flow*

Closely related to gradient descent is the gradient flow, which can be seen as
a continuous version of gradient descent when the learning rate goes to 0, as
we shall see. The interest of gradient flow is purely theoretical: it is often
easier to prove theorems by assuming that training is done under gradient
flow, and then argue that these theorems should hold true under gradient
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 51

Figure 3.4: Example of Gradient Descent in 1-dimensional problem (left) and


2-dimensional problem (right)

descent for small enough learning rates (though rigorous results can also be
proven under gradient descent directly).

Definition 3.5.2. We say that a family (u(t))t∈R+ ⊂ RP follows the (neg-


ative of the) gradient flow of C if it is solution of the following differential
equation:

∂t u(t) = −∇u C(u(t)).

Because u(t) follows the negative of the gradient of C, one can show that
t 7→ C(u(t)) is decreasing and reaches its minimal value as t → ∞. Assuming
that u(t) → u∗ ∈ RP as t → ∞, one sees that ∇C(u∗ ) = 0 and the convexity
of C ensures that C(u∗ ) is a global minimum.

To see that gradient descent is a discrete approximation of gradient flow,


let k ∈ N and use Definition 3.5.1 to write
k
X
wk = −η ∇C(wℓ ).
ℓ=0

By choosing k = ⌊t/η⌋, where ⌊·⌋ denotes the integer part, we can take the
limit as η → 0+ and (assuming that ∇C is continuous to define the Riemann
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 52

integral) there exists w : R+ → RP such that


⌊t/η⌋−1 Z t
X
lim wt/η = lim −η ∇C(wℓ ) = −∇C(w(s))ds,
η→0+ η→0+ 0
ℓ=0

and therefore rescaled time gradient descent converges to gradient flow.

3.6 ML pipeline in practice

For most machine learning problems (supervised), the pipeline will be similar:

• Given a dataset, split it into: training, validation and test sets.

• Choose a hypothesis class (or several) H.

• Define a metric of error ℓ.

• Use cross-validation to find the best suited hypothesis class H in a set


of possible hypothesis classes

• Once the best suited hypothesis class H has been chosen, train a pre-
diction in H and valuate the performance on test set to judge the
generalisation error.

3.7 List of tasks

As presented in this chapter, the machine learning formalism abstractly seems


to be suited for a wide variety of tasks as long as they can be phrased as
a function approximation problem. Below you will find a list of tasks to
illustrate how to use this formalism. For each of them, we encourage the
reader to answer the following questions:

Q1. Into what paradigm does the task fall? (Supervised/unsupervised)


CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 53

Q2. If supervised, is it a regression or a classification task?

Throughout the notes, we will learn about many different models (or hy-
pothesis classes). In order to get familiar with them and understand what
makes them different, we encourage the reader to think about the following
question for every encounter with a new H:

Q3. Is H well suited for this task? (For all the tasks below.)

For the tasks where it is provided, f denotes the labelling function (gen-
erating the dataset) and cannot be used to trained the algorithm, as it is in
general not known. We provide it to the reader so they may evaluate the
performances of their models by computing the exact generalisation error
and generalisation gap.

Task 1: Let X = R2 and Y = 0, 1. Suppose f : x 7→ 1{x1 <x2 } . You are given the
dataset S = {(−1, 0, 1), (−0.5, −1, 0), (0, 0.5, 1), (0, 1, 1), (1, 0.5, 0), (2, 0, 0)}.
Task 2: Let X = R2 and Y = 0, 1. Let f be 0 inside the closed unit disk, and
1 outside of it.
Task 3: Let X = R and Y = R. Take some points in R and label them by
f (x) = −x + 3.
Task 4: Let X = R and Y = R. Take some points in R and label them by
f (x) = x2 − 1.
Task 5: You find a text handwritten on a tablet in an unknown alphabet. Before
trying to decipher the text, you want to group the identical characters
together.
Task 6: You have access to all American literature and its translation in French.
You want your algorithm to learn how to translate a text from American
English to French.
Task 7: You know the rules of chess, i.e. how to legally move the pieces on a
chessboard and how to asses the result of a chess game. Besides that,
you have no knowledge of what is a good or bad move. You want to
build a chess engine that surpasses top human level.
CHAPTER 3. INTRODUCTION TO MACHINE LEARNING 54

Task 8: You are given a dataset of pictures of cats and you want to generate
new realistic images of cats.

Task 9: You work for Nitflex, a streaming service company, and want to give
good recommendations of movies to new subscribers, based on the data
of the older subscribers (movies they liked, categories, age, etc)
Chapter 4

Statistical learning theory

Contents
4.1 Learnability . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Finite-sized hypothesis classes . . . . . . . . . . . 59
4.3 Infinite sized hypothesis classes * . . . . . . . . . 64
4.4 Bias-complexity tradeoff and Bias-variance trade-
off . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

In this chapter, we take the point of view of Statistics to study learning


theory. In short, Statistics is concerned with estimating an unknown distri-
bution µ from observations of it. Typical questions can be: how many data
points should one collect to ensure that one’s estimator µ̂ is close enough (in
some sense) to µ? How complex should a model be to have a small empirical
error and a small generalisation error? More generally, we will be interested
in deriving approximation bounds of hypothesis classes.

⋆ Remark 8. If you are familiar with numerical analysis, you can think
of this chapter as techniques to come up with a priori error estimates (i.e.
before we sample the dataset), whereas what was shown in the previous
chapter, we were computing a posteriori error estimates (we measure the
error for a specific model and a specific dataset).

55
CHAPTER 4. STATISTICAL LEARNING THEORY 56

4.1 Learnability

In this chapter, except if explicitly stated, we restrict ourselves to the case


of binary classification, that is Y = {0, 1}, as it simplifies the presentation.
As usual, we assume that the dataset S = {(xi , yi ) : i ∈ {1, . . . , m}} is such
that the xi ’s are i.i.d. with common law D in X and that yi = f (xi ) for some
labeling function f . Furthermore, we assume the 0-1 loss function 1h(x)̸=y .

Since for a trained predictor h, the empirical error L(h) depends on the
training set S, which is generated through random samplings under D, there
will be randomness on the trained predictor h, and therefore, in RD (h). Thus,
we can see RD (h) as a random variable. We can’t expect that S will suffice
to direct the learner toward a good classifier (w.r.t. to all of D), in case S is
not representative of D.
Example 4.1.1. Suppose we have a urn with 30% black and 70% white balls,
and we take a sample where we get “W W W W W”. In this case, our
sample does not represent the underlying distribution of the balls. (Note
that the probability of sampling this dataset is 0.75 ≈ 0.16, which is far from
negligible.)

From the law of large numbers (Theorem 1), we know that more data
samples will ensure that the dataset is representative enough and avoid sit-
uations like in this example. But the finiteness of the dataset can hinder
learning. In the previous chapter, we saw ways to assess the quality of a
model, and how to select a good model among a collection of models to solve
a given task. In this chapter, we are instead concerned with studying the
learnability of a given hypothesis class H from a finite dataset.

4.1.1 Realisability assumption

Definition 4.1.1. (Realisability assumption) For given hypothesis class H,


data distribution D and labeling function f , we say that the realisability
assumption holds for H, D, f if RD,f (h) = 0.

Informally, this can also be seen as the labeling function f can be rep-
CHAPTER 4. STATISTICAL LEARNING THEORY 57

resented by elements of H, at least on the support of D. Alternatively, if


f ∈ H, then clearly the realisability assumption holds true.
Example 4.1.2. It is easy to construct a case where the realisability assump-
tion does not hold: let X = R, D = Unif(−1, 1), f : x 7→ 1{|x|>1/2} and
H = {hw : x 7→ 1{x>w} }, where w ∈ R denotes the only parameter of the
model. One can check that the realisability assumption does not hold. (Note
that m being fixed, there is a positive probability on the dataset S that there
exists h ∈ H such that L(h) = 0, but this probability is not 1.)

4.1.2 PAC learnability

The realisability assumption ensures that there is a predictor in the hypothe-


sis class that fits the data perfectly. This guarantee, however, does not mean
that we are able to find this predictor. We need stronger properties for the
hypothesis class. This is why the Probably Approximately Correct learning
(PAC learning) framework was introduced.

Definition 4.1.2. (PAC learnability) A hypothesis class H is PAC learnable


if there exists a function mH : (0, 1)2 → N and a learning algorithm A with
the following properly: for every ϵ, δ ∈ (0, 1), for every distribution D over X ,
and for every labeling function f : X → {0, 1}, if the realisability assumption
holds with respect to H, D, f, then when running A on m ≥ mH (ϵ, δ) i.i.d.
examples generated by D labeled by f , A returns a hypothesis hA such that,
with probability of at least 1 − δ, RD (hA ) ≤ ϵ.

We have two parameters in the PAC learnability. The accuracy parameter


ϵ determines how far we allow our predictor h to be from the optimal predictor
(“approximately correct”), and a confidence parameter δ that indicates how
likely h is to meet the accuracy requirement (“probably”).
Remark 10. The definition above does not imply anything on the computa-
tional aspects of learnability. Nonetheless:

(i) some definitions require m(ϵ, δ) to grow polynomially with its param-
eters as they tend to 0, which for example, does not allow m(ϵ, δ) =
ϵ−1 21/δ ;
CHAPTER 4. STATISTICAL LEARNING THEORY 58

(ii) some definitions include the number of computations of A as a param-


eter of m (and ask for polynomial growth).
Remark 11. The PAC learning definition is not easy to digest at first. After
some time, one may even wonder if the definition is not vacuous. Indeed, since
m can grow at any pace in terms of its parameters and there is no restriction
−1000000
either on A, one can set m(ϵ, δ) extremely large (think of 10(ϵδ) ). By
the law of large numbers and the central limit theorem (Theorems 1 and
√ L(h) → RD (h) as m → ∞ with an approximation error
2), we know that
of order O(1/ m), and since the realisability assumption holds, it feels like
we should always be likely to find a good enough predictor. However this
intuition does not take into account the following fact: m(ϵ, δ) and H are
fixed before choosing f and D. In Corollary 6.4 and Theorem 5.1 in [12], the
interested reader will find a construction of a non-PAC learnable hypothesis
class. We informally explain the main idea on a particular case:

Let H be the set of all functions from R to {0, 1}; I claim that it is not PAC
learnable and to demonstrate it, as you choose m(ϵ, δ), I will adversarially
construct D and f . After you fix m(ϵ, δ), I choose arbitrary 2m(ϵ, δ) pairwise
distinct points in R. I let D be the uniform discrete distribution on these
2m(ϵ, δ) points. Because H contains all {0, 1}-binary functions, I can label
them in any possible way with an f ∈ H so that the realisability assumption
holds. Because (at least) half of the 2m(ϵ, δ) points will not be in the dataset
S (recall that it is sampled with D), whatever your algorithm A is, even if it
finds a predictor h with L(h) = 0, it will not have learned anything on half
of the points, and therefore will be likely to not be ϵ-close to f . Conclusion:
H is not PAC learnable.

4.1.3 Agnostic PAC learnability

We saw that PAC learning offers guarantees on a hypothesis class that allows
to retrieve a labeling function within that class, since Definition 4.1.2
assumes the realisability assumption. However, the datasets a practitioner
meets are not, in general, labeled by a function that belongs to the hypothesis
class they use. Even in the case of a truly linear relationship between x and y,
it is enough, for example, to have noise in the samples so that the realisability
assumption does not hold for the set of linear predictors. This means that
CHAPTER 4. STATISTICAL LEARNING THEORY 59

we can’t guarantee that the generalisation error RD (h) = 0; we still want to


find h ∈ H for which the generalisation error is nevertheless low.

It turns out that the realisability assumption can be removed and the
PAC learning formalism can be extended:
Definition 4.1.3. (Agnostic PAC learnability) A hypothesis class H is ag-
nostic PAC learnable if there exists a function mH : (0, 1)2 → N and a
learning algorithm A with the following property: for every ϵ,δ ∈ (0, 1) and
for every distribution D over X and labeling function f : X → {0, 1}, when
running the learning algorithm on m ≥ mH (ϵ, δ) i.i.d. samples, the algorithm
returns a hypothesis hA such that, with probability at least 1 − δ,
RD (hA ) ≤ min

RD (h∗ ) + ϵ.
h ∈H

Remark 12. When the realisability assumption does not hold, no learner can
guarantee an arbitrarily small error ϵ. Under the definition of agnostic PAC
learning, a learner can still declare success if its error is not much larger
than the best error achievable by a predictor from the class H. This is in
contrast to PAC learning, in which the learner is required to achieve a small
error in absolute terms and not relative to the best error achievable by the
hypothesis class. In particular, for a given task, a hypothesis class H could
be a very poor choice, and still be agnostic PAC learnable. More informally
and concisely: agnostic PAC learnable does not imply good model choice.

4.2 Finite-sized hypothesis classes

The PAC learning formalism tells us that if H is too complex, we may not be
able to find good predictors in H from finitely many data samples. The first
restriction one may want to look at is when H is finite, that is, H contains
only a finite number of functions. Is H simple enough to be PAC learnable?
Throughout this section, we work under the assumption that H is finite.

4.2.1 PAC learnability

It turns out that any finite hypothesis class is PAC learnable:


CHAPTER 4. STATISTICAL LEARNING THEORY 60

Theorem 4. If H is finite, then it is PAC learnable. Moreover, the map


m : (0, 1)2 → N in Definition 4.1.2 can be chosen as
 
log(|H|/δ)
m(ϵ, δ) = ,
ϵ

where ⌈x⌉ denotes the smallest integer greater or equal to x ∈ R.

Proof. Let hA denote the resulting predictor after training. Note that hA
depends on the dataset S; if we sample m i.i.d. data points according to D,
we look at S as a random variable with law Dm .

Fix ϵ, δ ∈ (0, 1) and let the map m(·, ·) be defined as in the Theorem.
Proving that H is PAC learnable amounts to proving that

Dm(ϵ,δ) (S : RD (hA ) > ϵ) < δ. (4.1)

We will make use of the three following basic facts that we admit without
proof:

(i) For all x ∈ R, it holds that 1 + x ≤ ex .

(ii) For all sets A, B ∈ F and probability P , it holds that P (A ∪ B) ≤


P (A) + P (B).

(iii) If A ⊂ B, then P (A) ≤ P (B)

Fix m ∈ N, let us bound Dm (S : RD (hA ) > ϵ). Define the set of bad
predictors

Hb := {h ∈ H : RD (h) > ϵ},

and define the set of misleading datasets of size m

M := {S : ∃h ∈ Hb , L(h) = 0}.

(We used “misleading” to stress that even though the hypothesis class con-
tains a predictor with null empirical error, this predictor fails to achieve a
CHAPTER 4. STATISTICAL LEARNING THEORY 61

generalisation smaller than our threshold ϵ.) Finally, for all h ∈ H, define
the set of misleading datasets for h by

M (h) := {S : h ∈ Hb , L(h) = 0}.

Note that by definition, we can write


[
M= M (h).
h∈Hb

In particular, thanks to the fact (ii) above, we have that


X
Dm (M ) ≤ Dm (M (h)) .
h∈Hb

Because the m samples of the dataset are independent, for all h ∈ Hb , we


have that
m
Y
Dm (M (h)) = Dm (S : h(xi ) = yi , ∀i ∈ {1, . . . , m}) = D (xi : h(xi ) = yi ) ,
i=1

and by definition of the generalisation error,

Ex∼D (1h(x)̸=y ) = D (xi : h(xi ) ̸= yi )

we moreover have that for each data point,

D (xi : h(xi ) = yi ) = 1 − RD (h) ≤ 1 − ϵ,

since we chose h in the bad hypothesis set Hb . In particular, using the basic
fact (i), we see that

Dm (M (h)) ≤ (1 − ϵ)m ≤ e−ϵm .

Putting everything together, we have shown that


X
Dm (M ) ≤ e−ϵm ≤ |H|e−ϵm .
h∈Hb

Recall that our goal is to establish (4.1), which is not equivalent to the above
bound. Indeed, we just upper bounded the sum over bad hypotheses of
the probability to sample a misleading dataset. Nonetheless, the probability
CHAPTER 4. STATISTICAL LEARNING THEORY 62

of sampling a misleading dataset for the trained predictor (as in (4.1)) is


upper bounded by the leftShand-side above, thanks to the basic fact (iii):
since {S : RD (hA ) > ϵ} ⊂ h∈Hb {S : RD (h)} = M , we have that

Dm (S : RD (hA ) > ϵ) ≤ Dm (M ) ≤ |H|e−ϵm .

It suffices now to check that for m > m(ϵ, δ) = log(|H|/δ)


ϵ
, the right hand-side
above is smaller than δ, which is the case, hence concluding the proof.

Given a finite hypothesis class and two numbers ϵ, δ ∈ (0, 1), Theorem
4 provides a sufficient number of data points to learn a good predictor for
binary classification, that is to say with generalisation error smaller than ϵ,
with a probability greater than 1−δ, even in the worst case scenario where the
labeling function (in H) and the data distribution are chosen adversarially.

We can analyse the bound of Theorem 4 and note that:

• as m → ∞, the probability of picking a misleading set for which h ∈ Hb


yields L(h) = 0 decreases

• as |H| increases, so does the (adversarial) probability of finding bad


hypothesis.

• the smalle we choose ϵ (i.e.: we want better accuracy), the greater we


need to choose m.
Remark 13. It is important to note that Theorem 4 does not say that if
m < m(ϵ, δ), then training is likely to fail. It says that if m < m(ϵ, δ), then
there could exist a labeling function f ∈ H and a data distribution D that
could make training fail.

4.2.2 Agnostic PAC learnability

Recall Definition 4.1.3, where agnostic PAC learnability was defined. We


proved in Theorem 4 that any finite hypothesis class is PAC learnable, and
derived a bound for the number of data samples that guarantee success of
training. We are now interested in removing the realisability assumption:
CHAPTER 4. STATISTICAL LEARNING THEORY 63

Theorem 5. If H is finite, then it is agnostic PAC learnable. Moreover, the


map m : (0, 1)2 → N in Definition 4.1.3 can be chosen as
 
2 log(2|H|/δ)
m(ϵ, δ) := .
ϵ2

Before we prove Theorem 5, we need to introduce the uniform convergence


property:

Definition 4.2.1. We say that a hypothesis class H has the uniform conver-
gence property if there is a function mU C : (0, 1)2 → N such that for every
ϵ, δ ∈ (0, 1) and for every probability distribution D over X , if m ≥ mU C (ϵ, δ)
and the m data samples in S are sampled i.i.d. with common law D, it holds
that

Dm (∃h ∈ H : |RD (h) − L(h)| > ϵ) < δ.

In a sentence, the uniform convergence property states that provided that


m is large enough, the generalisation gap is likely to be small across all data
distribution D and labeling function f .

Proof of Theorem 5. The strategy is the following:

• we show that every finite H has the uniform convergence property of


Definition 4.2.1;

• we show that if H has the uniform convergence property, then it is


agnostic PAC learnable with m(ϵ, δ) = mU C (ϵ/2, δ).

Let H be a finite hypothesis class. To show that it has the uniform conver-
gence property, we admit without proof that
2
Dm (|RD (h) − L(h)| > ϵ) ≤ 2e−2mϵ , ∀h ∈ H.

It is a simple application of Hoeffding’s inequality (that can easily be found


online with its proof), that provides a bound on the probability that an
empirical mean is far from its expectation. (Note that the expectation of
CHAPTER 4. STATISTICAL LEARNING THEORY 64

L(h) integrated over the datasets under Dm is indeed given by R(h).) We


now use the basic fact (ii) in the proof of Theorem 4 to write
!
2 2
[ X
Dm {|RD (h) − L(h)| > ϵ} ≤ 2e−2mϵ ≤ |H|2e−2mϵ .
h∈H h∈H

log(2|H|/δ)
Hence, we can choose mU C (ϵ, δ) = 2ϵ2
and we see that H has the
uniform convergence property.

We now show that the uniform convergence property implies the agnostic
PAC learning property. Let hA denote the predictor obtained by the training
algorithm and let h∗ be the optimal predictor within the class, that is

h∗ := arg min RD (h).


h∈H

Define the event Eunif := {∀h ∈ H : |RD (h) − L(h)| < ϵ}. On the event
Eunif , we have that

RD (hA ) ≤ L(hA ) + ϵ ≤ L(h∗ ) + ϵ ≤ RD (h∗ ) + 2ϵ.

Note that the uniform convergence property ensures that Dm (Eunif ) > 1 −
δ, which means in particular that the above inequalities hold true with a
probability greater than 1 − δ.

This shows that by setting m(ϵ, δ) := mU C (ϵ/2, δ), then H satisfies the
agnostic PAC learnability property. More explicitely,

2 log(2|H|/δ)
m(ϵ, δ) = ,
ϵ2
as claimed, which concludes the proof.

4.3 Infinite sized hypothesis classes *

So far, we have assumed that |H| is finite. We showed that finite classes
are learnable and that the sample complexity of a hypothesis class is upper
CHAPTER 4. STATISTICAL LEARNING THEORY 65

bounded by an expression that involves the log of its size. Is there some-
thing similar to this, when we consider |H| = ∞? Namely, we want to say
something about the expressiveness of a set of functions.

Example: linear classification in 2-d, 2 points, 3 points, 4 points.

We say linear functions are expressive enough to shatter 3 points.

Definition 4.3.1. A set S of examples is shattered by a set of functions H if


for every partition of the examples in S into positive and negative examples,
there is a function fw ∈ H that gives exactly these labels to the examples.

Definition 4.3.2. (Vapnik–Chervonenkis dimension) The VC-dimension of


a hypothesis class H, denoted VCdim(H), is the maximal size of a set C ⊂ X
that can be shattered by H. If H can shatter sets of arbitrarily large size,
we say H has infinite VC-dimension.

H VC-dim
Half intervals 1
Intervals 2
Half-spaces in the plane 3
Neural networks number of parameters

For infinite hypothesis sets, VCdim(H) takes the role of log(|H|) for finite
hypothesis sets. For example: Given a sample S with m examples, find some
h ∈ H that with at least probability 1 − δ, the hypothesis fw has error less
than ϵ if  
1 13 2
m≥ 8VCdim(H) log + 4 log .
ϵ ϵ δ
CHAPTER 4. STATISTICAL LEARNING THEORY 66

4.4 Bias-complexity tradeoff and Bias-variance


tradeoff

4.4.1 Existence of noise

Consider again the setting of classification, with Y = {−1, 1}. Let us intro-
duce the notion of noisy labels.

Suppose we have a dataset S = {(x1 , y1 ), . . . , (xn , yn )}. We now suppose


that each data point and its corresponding label are independently drawn
from an unknown data distribution, i.e., (xi , yi ) ∼ D(X, Y ). Note the dif-
ference with the setup where the labels are given by a deterministic labeling
function f : here the same x can have a positive probability to be labeled by
y or y ′ , with y ̸= y ′ .

This lack of a perfect labeling function f accounts for the noise on the
label. For example, you can think of this as, for example, the features do not
contain all the information needed to attribute the label in a deterministic
way. This setting is a bit closer to reality – because of maybe lack of in-
formation, noise, or other source of uncertainty, the labelling function might
not be deterministic.
Example 4.4.1. Suppose we have a model for tastiness of a papaya given the
colour and softness. Let’s say most soft papayas with bright colour are tasty.
However, we can have the situation that the papaya is soft and bright, and
still not tasty (e.g.: bad climate?), even if it’s unlikely.

Under this new assumption that there’s also noise on the labels, we can
write the theoretical optimal classifier:
Definition 4.4.1. Bayes Optimal predictor: Given a probability dis-
tribution D over X × Y, the predictor is defined as

(
1 if D(y = 1|x) ≥ 1/2
fbayes =
−1 otherwise

• In the deterministic case, D(y = 1|x) is either 1 or 0, because we have


CHAPTER 4. STATISTICAL LEARNING THEORY 67

a deterministic map f .

• Under uncertainty, on average, this predictor is optimal

RD (fbayes )) ≤ RD (h) ∀h ∈ H

• We rarely have access to this classifier, because it implies we can eval-


uate the probability D(y = 1|x). Typically,

fbayes ∈
/H

The error made by fbayes is, by definition,

RD (fbayes ) = Ex [1 − max{(D(−1|x), D(1|x))}] ,

and this is the minimal theoretical error possible. This leads us to us defining
noise as:
Definition 4.4.2. Given a distribution D over X × Y, the noise at point
x ∈ X is defined as:

noise(x) = 1 − max{(D(−1|x), D(1|x))}.

The noise is a characteristic of the learning task and it’s indicative of it’s
level of difficulty. For example, for a point x if the noise is 1/2, it will be
challenge to predict its label correctly. On the contrary, a noise of 0 means
that there exists a labeling function.

4.4.2 Bias-complexity trade-off

In learning theory, we talk about bias-complexity trade-off. The generalisa-


tion error can be decomposed in two components (three components if we
include noise):

RD (h) = ϵapprox + ϵest (4.2)


where ϵapprox = minh′ ∈H RD (h′ ) and ϵest = RD (h) − ϵapprox
CHAPTER 4. STATISTICAL LEARNING THEORY 68

Approximation error: this is the minimum error achievable by a pre-


dictor in the hypothesis class H. This term measures how much error we
have because we restrict ourselves to a specific class i.e. how much inductive
bias we have.

This error does not depend on the sample size and it’s determined by the
hypothesis class chosen. Enlarging the hypothesis class (e.g. making it more
complicated) can decrease the approximation error.

Note that under the realisability assumption, the error is zero. In the
agnostic case, the approximation error can be large.
Example 4.4.2. Using a finite polynomial basis to represent a non-polynomial
function, there is an inherent approximation error.

Estimation error: Error between the approximation error and the error
achieved by the ERM predictor. This is in general non-null because the
empirical error is only an estimate of the generalisation error, therefore we do
not necessarily reach the minimal error over the hypothesis set. This quantity
depends on the training set size and on the size and complexity of the
hypothesis class. (namely, ϵest increases logarithmically with size of H and
decreases with m increasing).

Since we want to minimize the total error, we have a trade-off, called


bias-complexity trade off. A rich H reduces the approximation error but
might lead to high estimation error (overfitting), whereas a simple H reduces
the estimation error, but increases the approximation error (underfitting).

4.4.3 Bias-Variance tradeoff

The Bias-Variance trade-off is typically referenced in computational statis-


tics and it does not use the learning framework we have been describing
(PAC). Nevertheless, it’s used quite often to describe the notion of managing
the complexity of a chosen model class with its different sources of errors.

In this section, we leave the binary classification setup to consider a re-


gression task with the mean squared error. More precisely:
CHAPTER 4. STATISTICAL LEARNING THEORY 69

Figure 4.1: Graphical representation of bias and variance (precision and


accuracy).
CHAPTER 4. STATISTICAL LEARNING THEORY 70

Lemma 6. Given a function f = f (x) to be approximated, a dataset for


training S = {(xi , yi ), i = 1, · · · , m} of fixed size, with yi = f (xi ) + ϵ, and
a predictor h that aims to approximate f , the expected mean-squared error
for the predictor hS , obtained after training on the dataset S with the ERM
framework, can be decomposed as:

h 2 i
Ex,y,S (hS (x) − y)2 = Ex,S hS (x) − h̄(x) + Ex,y (ȳ(x) − y)2
   
(4.3)
| {z } | {z } | {z }
Expected Test Error Noise
hVariance 2 i
+ Ex h̄(x) − ȳ(x) , (4.4)
| {z }
Bias2

where
h̄(·) = ES∼Dm [hS (·)] ,
denotes the expected predictor when sampling different datasets S from Dm ,
and
ȳ(x) = Ey|x [f (x) + ϵ] ,
the expected value for y given x (as we consider y being noisy).

Proof. We consider the expected error for the predictor hS that we ob-
tained after training on the dataset S with the ERM framework, which can
be written as:

ZZ
2
(hS (x) − y)2 D(x, y)dydx.

E(x,y)∼D (hS (x) − y) =
x y

We can write the expected Test Error (given the ERM framework and
H):
Z Z Z
2
(hS (x) − y)2 D(x, y)Dm (S)dxdydS

E(x,y)∼D (hS (x) − y) =
S∼Dm S x y
CHAPTER 4. STATISTICAL LEARNING THEORY 71

The expected test error can then be decomposed as


h 2 i
 2 
E(x,y)∼D (hS (x) − y) = E hS (x) − h̄(x) + h̄(x) − y
S∼Dm
= E (hS (x) − h̄(x))2
 
   h 2 i
+ 2E hS (x) − h̄(x) h̄(x) − y + E h̄(x) − y
(4.5)

The middle term of the above equation is 0:


      
E(x,y)∼D hS (x) − h̄(x) h̄(x) − y = Ex,y ES hS (x) − h̄(x) h̄(x) − y
S∼Dm
  
= Ex,y ES [hS (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

Returning to (4.5), we’re left with the variance and another term
h 2 i h 2 i
 2
E(x,y)∼D (hS (x) − y) = Ex,S hS (x) − h̄(x) +Ex,y h̄(x) − y (4.6)
S∼Dm | {z }
Variance

Expanding the second term in the above equation:


h 2 i h 2 i
= Ex,y (ȳ(x) − y)2 + Ex h̄(x) − ȳ(x)
 
Ex,y h̄(x) − y
| {z } | {z }
Noise
Bias2
  
+ 2 Ex,y h̄(x) − ȳ(x) (ȳ(x) − y) .
CHAPTER 4. STATISTICAL LEARNING THEORY 72

The third term in the equation above is 0 by 1 :


    
Ex,y h̄(x) − ȳ(x) (ȳ(x) − y) = Ex Ey|x [ȳ(x) − y] h̄(x) − ȳ(x)
 
= Ex Ey|x [ȳ(x) − y] h̄(x) − ȳ(x)
  
= Ex ȳ(x) − Ey|x [y] h̄(x) − ȳ(x)
 
= Ex (ȳ(x) − ȳ(x)) h̄(x) − ȳ(x)
= Ex [0]
= 0.

This gives us the decomposition of expected test error as follows


h 2 i
Ex,y,S (hS (x) − y)2 = Ex,S hS (x) − h̄(x) + Ex,y (ȳ(x) − y)2
   
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
+ Ex h̄(x) − ȳ(x)
| {z }
Bias2

Variance: Captures how much your model changes if you train on a


different training set. How “over-specialized” is your model to a particular
training set (overfitting)? If we have the best possible model for our training
data, how far off are we from the average model?

Bias: What is the inherent error that you obtain from your model even
with infinite training data? This is due to your model being ”biased” to
a particular kind of solution (e.g. linear function). In other words, bias is
inherent to your model.
1
By the property of conditional expectation, we have:
Z Z Z Z
Ex,y (f (y)) = f (y)D(x, y)dxdy = f (y)D(y|x)D(x)dxdy (4.7)
x y x y
Z
= Ex [ f (y)D(y|x)dy] = Ex [Ey|x [f (y)]] (4.8)
y
CHAPTER 4. STATISTICAL LEARNING THEORY 73

Figure 4.2: Representation of bias and variance contribution to error.

Noise: How big is the data-intrinsic noise? This error measures ambigu-
ity due to your data distribution and feature representation. You can never
beat this, it is an aspect of the data.

⋆ Remark 9. What about neural networks? Don’t we have millions of


parameters and still models generalise?

The bias–variance trade-off implies that a model should balance under-


fitting and overfitting: Rich enough to express underlying structure in data
and simple enough to avoid fitting spurious patterns. However, in modern
practice, very rich models such as neural networks are trained to exactly
fit (i.e., interpolate) the data. Classically, such models would be considered
overfitted, and yet they often obtain high accuracy on test data.

In [1]2 , authors show a “double-descent” curve that includes the textbook


U-shaped bias–variance trade-off curve by showing how increasing model
capacity beyond the point of interpolation results in improved performance.

2
Link to paper: https://www.pnas.org/content/116/32/15849
CHAPTER 4. STATISTICAL LEARNING THEORY 74

Figure 4.3: Curves for training risk and test risk. (A) The classical U-shaped
risk curve arising from the bias-variance tradeoff. (B) The double-descent risk
curve, which incorporates the U-shaped risk curve (i.e. the classical regime)
together with the observed behaviour from using high-expressivity function
classes (i.e.: the modern interpolating regime), separated by the interpolation
threshold. The Predictors to the right of the interpolation threshold have zero
training risk. Current topic of discussion.

Figure 4.4: Double-descent risk curve for a fully connected neural network
on MNIST. Training and test errors are shown for different losses. The
dataset considered has 4000 datapoints, with feature dimension d = 784 and
K = 10 classes. The number of parameters of the. network is given by
(d + 1)H + (H + 1)K. The interpolation threshold (black dashed line) is
observed at n · K
Chapter 5

Linear Models

Contents
5.1 Linear Regression . . . . . . . . . . . . . . . . . . 76
5.1.1 Cost function choice . . . . . . . . . . . . . . . . . 76
5.1.2 Explicit solution . . . . . . . . . . . . . . . . . . . 78
5.1.3 Regularization . . . . . . . . . . . . . . . . . . . . 81
5.1.4 Representing nonlinear functions using basis func-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Classification . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Perceptron algorithm . . . . . . . . . . . . . . . . . 86
5.2.2 Support Vector Machine . . . . . . . . . . . . . . . 87
5.2.3 Detour: Duality theory of constrained optimisation 91
5.2.4 Non-separable case . . . . . . . . . . . . . . . . . . 98

In this chapter, we study the family of linear predictors - a very useful


family of hypothesis class. Linear predictors are intuitive, easy to interpret,
theoretically sound, and fit the data reasonable well in many natural learning
problems.

Let X = Rd and Y = R and consider the parametric class of affine

75
CHAPTER 5. LINEAR MODELS 76

functions

⃗ T ⃗x + b : w
H = {fw : x → fw (x) = w ⃗ ∈ Rd , b ∈ R}, (5.1)

each function is parametrised by w ⃗ and b, and takes as input a vector ⃗x ∈ Rd



⃗ T ⃗x + b. One could consider the output space to be Rd
and returns a scalar w
for d′ ≥ 2 without changing the theory, but making the presentation slightly
more difficult. We will stick to Y = R throughout the chapter.

We call the homogeneous representation as including b in w ⃗ such that


w
⃗ = (w1 , ..., wd , b). The class above is then equivalent to taking the input
space to be Rd × {1} and fw (⃗x) = w ⃗ T ⃗x.

For binary classification, we can compose an element of H with the sign


function which returns the sign of a real number (positive or negative).

5.1 Linear Regression

5.1.1 Cost function choice

Linear predictors are nice because they are linear, and thus preserve nice
properties of the loss function ℓ : Y × Y → R+ to the (empirical) cost
function C : Rd → R+ , defined on the parameter space by
m
1 X
C(w) := ℓ(fw (xi ), yi ). (5.2)
m i=1

For example, since the composition of differentiable and convex maps is dif-
ferentiable and convex, it holds that if ℓ is convex in its first argument, then
so is C on the parameter space. In particular, training on H with gradi-
ent flow on the parameters is guaranteed to converge to a global
minimum of the cost function (admitted without proof, see Section 3.5).

Common choices of loss functions are

• the squared error loss (or L2 loss) ℓ : (y, y ′ ) := 21 (y − y ′ )2 ,


CHAPTER 5. LINEAR MODELS 77

• the absolute error loss (or L1 loss) ℓ : (y, y ′ ) 7→ |y − y ′ |,

• the Huber loss


1
− y ′ )2 for |y − y ′ | < δ

′ 2
(y
ℓ(y, y , δ) =
δ(|y − y ′ | − 12 δ) otherwise.

Figure 5.1: Different loss functions shapes.

The squared error loss is nice for theoretical reasons: convex, differen-
tiable, no hyperparameters, but it is not robust to outliers (i.e.: your total
error might be dominated by a point which is an outlier (the difference grows
squared)). On the contrary, the absolute error loss is less prone to potential
outliers (see 5.21 ), and is convex too. However, it is not differentiable at 0.
The Huber loss is a compromise between squared and absolute error losses:
it is convex and both differentiable and robust to outliers. The price to pay
is that it has a hyper parameter δ that has to be tuned.

1
Notebook: https://colab.research.google.com/drive/1ucl_aC8Q_
q8Y5DC4uPpBqi5DyiFbSIBi?usp=sharing
CHAPTER 5. LINEAR MODELS 78

Figure 5.2: Example of optimizing the squared-error loss or absolute error


loss, with presence of outlier.

5.1.2 Explicit solution

Least squares is the method that solves the empirical risk minimization prob-
lem for the hypothesis class (5.1) with respect to the squared loss. We want
to find w that minimizes
m
1 X T
arg min C(w) = arg min L(fw ) = arg min (w xi − yi )2 .
w w w 2m i=1

(Note that here we are using the homogeneous notation: w = (w1 , ..., wn , b), xi =
(xi1 , .., xin , 1)T .) We will use the more compact notation and equivalent for-
mulation
1
arg min C(w) = arg min ||Xw − Y ||2 , (5.3)
w 2m w

where X = (xij )ij ∈ Rm×n , Y = (y1 , . . . , ym )T , and || · || is the Euclidean


norm in Rm .

Theorem 7. The linear regression problem with square loss as in (5.3) sat-
isfies the following:

(i) if X T X is invertible, then the solution is unique, given by

w = (X T X)−1 X T Y ;
CHAPTER 5. LINEAR MODELS 79

(ii) if X T X is not invertible, then there are infinitely many solutions. More-
over, the minimal L2 norm solution is given by

w = X + Y,

where X + denotes the pseudo inverse2 of X.


Remark 14. Note that if we assume that all datapoints are distinct (i.e. X is
full-rank), it holds that X T X is invertible if we are in the situation m ≥ n.
(at least as many datapoints m than degrees of freedom n) 3 Similarly, if we
are in the situation of m < n, then X T X is not invertible.4

Proof. We prove (i), as the proof of (ii) is an exercise of homework 2.

To solve this problem, we calculate the gradient of the cost function C


defined in (5.2) and set it to zero:
1 1
∇w C(w) = ∇w ||Xw − Y ||2 = ∇w (Xw − Y )T (Xw − Y )
2m 2m
1
∇w (Xw)T (Xw) − (Xw)T y − y T (Xw) + Y T Y

=
2m
1
∇w (Xw)T (Xw) − 2(Xw)T y

=
2m

Using the lemma you proved in the homework, we write:

∇w (Xw)T (Xw) = ∇w (wT (X T Xw))


= ((X T X) + (X T X)T )w = (X T X + X T X)w
= 2(X T X)w,
2
The pseudo inverse of a matrix generalises the inverse to non-invertible matrices.
The geometric interpretation is the following: for any matrix X ∈ Rm×n , denote by
Im(X) ⊂ Rm its image. Then for all element v ∈ Im(X), there exists a unique element
u ∈ Rn such that Xu = v. The pseudo inverse X + ∈ Rn×m returns the following: for
any v ∈ Rm , consider its unique orthogonal projection π(v) onto Im(X), then X + v is the
unique u ∈ Rn such that Xu = π(v).
3
Why? Consider the dimension and rank of X T X.
4
Why? When m < n, X has at most rank m. As such, X T X ∈ Rn×n has also at most
rank m.
CHAPTER 5. LINEAR MODELS 80

and
∇w 2(Xw)T y = ∇w 2wT X T y = ∂wi (X T y)i = 2X T y.
This yields
1
∇C(w) = (2X T Xw − 2X T y) = 0. (5.4)
2m

Given that the Hessian of C is a semipositive definite matrix, we have


a convex function. Hence, w∗ is a global minimum of C(w) if and only if
∇C(w∗ ) = 0; we want to find w∗ such that ∇C(w∗ ) = 0. This means w∗ is
a minimum if and only if it satisfies the normal equation

X T Xw∗ = X T Y

Since we assume that X T X is invertible, we have

w∗ = (X T X)−1 X T Y, (5.5)

as claimed.

Note that to prove (ii), we can proceed as in (i) up to Equation (5.4), but
then we cannot invert X T X. We can still seek for a solution to Xw = Y , but
it is not unique. This means there is an infinite number of w∗ that achieve
the same minimal square error on the training data. This is called an over-
determined problem, we have too many degrees of freedom in the problem
and not enough constraints (data).

Which solution to seek for in the over-determined case? We can solve the
following minimization problem instead:

w∗ = arg min ||w||2 subject to min ||Xw − y||2


w w

i.e. we find a solution with minimal L2 norm in the parameters space.

To do so, one uses some linear algebra tools. Recall that if A = U ΣV T is


the Singular Value Decomposition5 of A, the pseudo-inverse of A is defined
as A+ = V Σ+ U T . Note that A is n × m and Σ+ is an n × m diagonal matrix
Σ+ = diag[1/σ1 , 1/σ2 , · · · , 1/σr , 0, · · · , 0], where r is the rank of A.
5
CHAPTER 5. LINEAR MODELS 81

If A is invertible, the pseudo-inverse is the same as the inverse. If A is


not invertible, A+ A yields a n × n diagonal matrix with the first r diagonal
entries to be 1, and 0 for the others. AA+ will be a similar diagonal matrix
m × m.

Note: if we plug in X = U ΣV T into the solution for the m > n case, we


see that
X + = (X T X)−1 X T
Remark 15. (No closed form solution) In generality, one might not have
access to the closed form solution of a learning task. Then, we can use opti-
misation techniques, such as gradient descent, as introduced in Section3.5.

But even when a closed-form solution is available for linear models, we


might use gradient descent. This is because building (X T X)−1 can be expen-
sive, whereas computing the gradient of C and iterating is cheap. Further-
more, because C is convex, we can find the optimal solution (for appropriate
learning rate).

5.1.3 Regularization

In the overdetermined case n > m, in Theorem 7, we modified the objective


to include some norm of w to achieve a unique solution. Indeed, (infinitely)
many solutions may have zero empirical loss, but what we hope for is to have
a small generalisation error as well. The practice of adding something extra
to the minimization is one way to prevent overfitting, as well as to attain a
unique solution.

Regularized Loss Minimization (RLM) is a learning rule in which we


jointly minimize the empirical risk (ERM) and a regularization function.
Theorem 8. (Singular Value Decomposition (SVD)). Given a real m × n matrix A, it
can be decomposed as
A = U ΣV T ,
where U ∈ Rm×m , V ∈ Rn×n are orthogonal (U T U = U U T = I, V T V = V V T = I)
and Σ ∈ Rm×n is a diagonal matrix with non-negative real numbers in the diagonal. We
denote Σ = diag[σ1 , · · · , σr , 0, · · · , 0], each σi > 0 and r is the rank of A.
CHAPTER 5. LINEAR MODELS 82

A regularization function is a mapping R : RP → R+ , and the regularized


loss minimization rule outputs a hypothesis in
arg min(L(w) + R(w)).
w

As we mentioned, it helps preventing overfitting. Not only are we minimiz-


ing the empirical risk, but we are also minimizing some penalisation on the
model. Intuitively, the regularization function measures the complexity of
hypotheses. Regularization can also be seen as a stabilizer of the learning
algorithm. An algorithm is considered robust if a slight change of its input
does not change its output much (e.g. continuously).

The L2 regularisation is standard and is given by:


R(w) = λ||w||2 , λ > 0,
which in the context of linear regression leads to the Ridge regression,
defined by the following minimisation problem:
m n
1 X X
min (yi − wT xi )2 + λ||w||2 , λ > 0, ||w|| = 2
wj2 (5.6)
w 2m
i=1 j=1

Theorem 9. Consider the ridge regression problem (5.6). If −mλ is not an


eigenvalue of X T X, then the solution is unique and is given by
w∗ = (X T X + λmI)−1 X T Y.

The proof can be done as in the non-regularised case by setting the gra-
dient of the cost to 0.

Another type of regularisation is the L1 norm, namely


n
X
R(w) = λ|w|1 , |w|1 = |wi |,
i=1

which defines the Lasso regression, leading to the optimisation problem:


m
1 X
min (yi − wT xi )2 + λ|w|1 , λ > 0.
w 2m
i=1

However, there’s no closed form solution for the general case.


CHAPTER 5. LINEAR MODELS 83

Remark 16. We note that a minimal norm solution (as seen in Theorem 7)
is not a solution of the norm-regularised problem. Nevertheless, it can be
shown that as λ → 0, the solution of the L2 regularized problem converges
to the minimal L2 norm solution of the original (non-regularized) problem.

The example below compares minimal L1 and L2 norm solutions, as well


as L2 regularized solutions.
   
1 2
Example 5.1.1. Let x1 = , x2 = , y1 = 1 and y2 = 2. One can
1 2
P2 T 2
 the solutions of minw∈R2 i=1 (w xi − yi ) are given by the line
see that
a
w= , a ∈ R.
1−a

The minimal L2 norm solution is obtained for a = 1/2, whereas the


segment a ∈ [0, 1] contains all minimal L1 norm solution of the problem.

However, if we regularise
P2 with an L2 penalty on the parameters, i.e. we
1 T 2 2
minimise
  min 2
w∈R 4 i=1 (w x i − yi ) + λ||w||2 for some λ > 0, then w =
1/2
is not a solution (one can convince oneself by taking the gradient and
1/2
noting that it is non zero).

The conclusion is similar for the L1 -regularised problem6 .


Remark 17. Note that now the scale of the vectors in X matters. Why?
We want w ⃗ to be small, if features span different magnitudes, the contribu-
tion of a large feature will dominate the regression. So it’s necessary, when
considering Ridge or Lasso regression, to center and normalise the data X.

5.1.4 Representing nonlinear functions using basis func-


tions

As we have seen, linear models have nice theoretical guarantees. What if


the relationship between inputs and outputs isn’t linear? It turns out we
6
An example of a regression weights being optimised under different regularisa-
tion strategies. https://developers.google.com/machine-learning/crash-course/
regularization-for-sparsity/l1-regularization
CHAPTER 5. LINEAR MODELS 84

can use linear models to express non-linear relationships. Indeed, suppose


we have data given in the form (see figure 5.3). So, we can understand that

Figure 5.3: Quadratic relation

our approximator could benefit from having non-linear features. Namely,

fw (x) = w1 x2 + w0 x + b

Our linear coefficients are w = [b, w0 , w1 ], and our “features” become [1, x, x2 ].
We are finding a linear model on “nonlinear features”, namely, given by x2 .

In general, one can fit non-linear functions via linear regression using a
transformation ϕ which applies nonlinear transformations on our features:
d
X
fw (x) = wi ϕi (x)
i=1

For example, if we consider polynomial transformations of our feature space,


ϕ can be given as:

• Feature space is is 1-dim: ϕ(x) → (1, x, x2 , ..., xk )

• Feature space is is 2-dim: ϕ(x) → (1, x1 , x2 , x21 , x22 , x1 x2 , ..., xk1 , xk2 )

• Feature space is is d-dim: vector of all monomials in x1 to xp of degree


up to k.
CHAPTER 5. LINEAR MODELS 85

5.2 Classification

Let us turn again to the classification problem, where we are considering a


dataset with binary labels, i.e.: S = {(xi , yi ) : i = 1..m} ⊂ Rn × {−1, 1}.

Assumption: We suppose that the dataset S ⊂ Rn × {−1, 1} is linearly


separable, i.e. there exists a hyperplane P = {x ∈ Rn : wT x+b = 0} for some
fixed ∈ Rn∗ , b ∈ R, that separates the data. Formally, for all i = 1..m, yi = 1
if and only if wT xi + b > 0 and similarly, yi = −1 if and only if wT xi + b < 0.

The class of half-spaces is defined as follows:

⃗ T ⃗x + b) : w
H = {fw : x → fw (x) = sign(w ⃗ ∈ Rn , b ∈ R}.

Let us illustrate this hypothesis class geometrically, considering the case


n = 2. Each hypothesis forms a hyperplane that is perpendicular to the
vector w⃗ = (w1 , w2 ) and intersects the vertical axis at the point (0, −b/w2 ).
The instances that are “above” the hyperplane, that is, share an acute angle
with w,
⃗ are labeled positively. Instances that are “below” the hyperplane,
that is, share an obtuse angle with w, ⃗ are labeled negatively.

How do we find a good w?


⃗ We can consider the following minimization:
m
X
arg min C(w) = arg min 1{yi ̸=sign(w⃗ T x⃗i +b)} .
w w,b
⃗ i=1

However, the loss function ℓ(y, y ′ ) = 1{y̸=y′ } , called 0-1 loss is not convex
nor continuous. Thus, to make the optimisation problem easier, we introduce
a surrogate loss, called the Perceptron loss7 :

⃗ T ⃗x + b)).
ℓperc (x, y) = max(0, −y(w
7
Note that if y and wT x − b have the same sign, then the second term is negative and
thus the ℓperc (x, y) is zero. If the signs are opposite, then error quantity will be positive
nonzero.
CHAPTER 5. LINEAR MODELS 86

5.2.1 Perceptron algorithm

This ℓperc loss is useful for the sake of solving the optimisation problem8 .

As previously seen, we compute the gradient of the cost:


m
1 X
C(w) = max(0, −yi (wT xi + b))
m i=1
m
1 X
∇C(w) = ∇w max(0, −yi (wT xi + b))
m i=1
m 
1 X 0 if yi wT xi + b ≥ 0 (correctly classified)
∇C(w) =
m −yi xi otherwise (misclassified).
i=1

Then we can use the gradient descent algorithm to find a plane that
linearly separates the data (if that’s possible)9 . Then, the gradient descent
update becomes:

1 X
wt+1 = wt − ηt −yi xi .
m
i:yi (wT xi +b)<0

The Perceptron algorithm does not give us guarantees except that it


separates the data if the data is linearly separable (i.e.: if gradient descent
converges, it finds a local minimum which by convexity is a global minimum).
There are, however, infinitely many planes that separate the data.
8
What about at x = 0? The derivative is not unique, but it can be approached with
sub-gradients.
9
Consider the code in: https://colab.research.google.com/drive/
1xDC8vlx6Eepl34xmcN6Cm-GhwFStM5KG?usp=sharing, which shows the linearly sep-
arable case and a nonlinear separable case, and what happens with the perceptron
algorithm.
CHAPTER 5. LINEAR MODELS 87

5.2.2 Support Vector Machine

The Support Vector Machine (SVM) is a linear classifier that can be viewed
as an extension of the Perceptron algorithm. In the context of binary classi-
fication in a linearly separable dataset, the Perceptron guarantees that you
find a separating hyperplane, while the SVM finds the maximum-margin
separating hyperplane. Refer to figure 5.4 for a comparison.

Figure 5.4: Two different separating hyperplanes for the same data set.
(Right:) The maximum margin hyperplane. The margin, γ, is the distance
from the hyperplane (solid line) to the closest points in either class (which
touch the parallel dotted lines).
Definition 5.2.1. (Margin) Consider a separating hyperplane defined through
w, b as the set of points P = {x ∈ Rn : wT x + b = 0}. The margin γ(w, b) is
defined as the distance from the hyperplane to the closest point across both
classes.

Then, we can state the SVM objective:


max γ(w, b) s.t. ∀i = 1..m, yi (wT xi + b) ≥ 0. (5.7)
w,b

The SVM objective states that we want to find the hyperplane defined by
w, b such that the distance of that plane to points in the dataset is maximized,
and that the plane correctly classifies points.
CHAPTER 5. LINEAR MODELS 88

The expression above is not yet computable. Namely, we need an explicit


expression for the margin γ in order to solve this optimisation problem,
however, you will see that it leads us to a max min problem (5.7), which
is not trivial to solve. Luckily, turns out that the SVM objective can be
reformulated in a much nicer formulation.
Proposition 1. The solution (w∗ , b∗ ) of (5.7) is also the solution of
min wT w
w
s.t. ∀i ∈ {1, . . . , m}, yi (wT xi + b) ≥ 1. (5.8)
Remark 18. This new formulation (5.8) is a quadratic optimization problem.
The objective is quadratic and the constraints are all linear. One can solve
it efficiently with a Quadratically Constrained Quadratic Program solver. It
has a unique solution whenever a separating hyper plane exists.

Before we prove Proposition 1, we need to establish some preliminary


results such as finding a convenient expression for the margin of a separating
hyperplane. Namely, how do we find the margin or, how do we compute the
distance of an arbitrary point x to a hyperplane P ?
Lemma 10. Let P be the hyperplane induced by a nonzero w ∈ Rn and
b ∈ R and let x ∈ Rn . We have that
|wT x + b|
dist(P, x) = .
||w||
In particular, if S = {(x1 , y1 ), . . . , (xm , ym )}, then the margin of P with
respect to S is given by
|wT x + b|
γ(w, b) = min .
x∈S ||w||
Furthermore, we can rescale the parameters w and b, and thus the margin,
without changing the hyperplane (i.e. invariant to rescaling)
γ(βw, βb) = γ(w, b) ∀β ∈ R, β ̸= 0.

Proof. Let d := x − xP ∈ Rn , where xP is the orthogonal projection of x onto


hyperplane P . We know that wT xP + b = 0 by definition, which entails that
wT (x − d) + b = 0.
CHAPTER 5. LINEAR MODELS 89

Figure 5.5: Distance between an arbitrary point x and hyperplane P .

What is d? It’s the distance vector resulting from subtracting x from its
projection xP , and its norm is the minimum distance between x and any
point on the hyperplane. Furthermore, note that d is colinear with w, so we
can write d = αw for some α ∈ R. Then,

wT (x − d) + b = 0
wT (x − αw) + b = 0
wT x + b wT x + b
α= =
wT w ||w||2

Now that we know α, we can compute the length of d = αw, i.e. the distance
of x to P as
√ √ |wT x + b|
T 2 T
||d|| = d d = α w w = ,
||w||
as claimed.

The margin of the hyperplane P to a dataset S follows by definition,


i.e., we are finding the point in S for which the distance to the hyperplane
is the smallest. From the expression of the margin, one then deduces its
scale-invariance property, which concludes the proof.
Remark 19. Symmetry of the margin: If the hyperplane is such that the
margin γ to a labeled dataset is maximized, it must lie right in the middle of
the two classes. In other words, γ must be the distance to the closest point
within both classes. (If not, you could move the hyperplane towards data
CHAPTER 5. LINEAR MODELS 90

points of the class that is further away and increase γ, which contradicts that
γ is maximized.)

Proof of Proposition 1. By plugging the expression of γ from Lemma 10 in


the objective (5.7), we get

|wT x + b|
 
max min such that ∀i, yi (wT xi + b) ≥ 0
w,b x∈S ||w||
We can pull the denominator outside of the minimization because it does not
depend on x. Because the hyperplane is scale invariant, we can fix the scale
of w, b any way we want. Let’s be clever about it, and choose it such that

min |wT x + b| = 1
x∈S

Then, we can simplify the optimisation:


1
max · 1 = min ||w||,
w ||w|| w

and note that the w that (satisfies the constraints and) minimizes ||w|| is
the same that minimizes ||w||2 . This is because f (x) = x2 is monotonically
increasing for x ≥ 0 and ||w|| ≥ 0.

Then, the new constrained optimization problem becomes:

min wT w
w
∀i, yi (wT xi + b) ≥ 0
s.t. min |wT xi + b| = 1.
i

These constraints are still hard to deal with, however luckily we can show
that (for the optimal solution) they are equivalent to the much simpler (5.8).

( =⇒ ) We can write the constraints with the absolute value

|wT xi + b| = 1

as
yi (wT xi + b) = 1
CHAPTER 5. LINEAR MODELS 91

because we are in the separable case. And if the minimum is 1 then all other
points are ≥ 1.

( ⇐= ) Assume that for all i ∈ {1, . . . , m}, yi (wT xi + b) ≥ 1. (e.g. larger


than 1) is true, how can we guarantee that |wT xi + b| = 1 for at least one
point xi ? Suppose it is not the case, then w is not minimized, and we can
minimize it further. (Divide w, b by ||w||.)

This concludes the proof.

Now that we have established the simpler formulation of the SVM prob-
lem (5.8), we can numerically find a solution to it, as mentioned in Remark
18. Although we do not have a general closed-form solution, more can be
said about the maximal margin hyperplane.

5.2.3 Detour: Duality theory of constrained optimisa-


tion

In this section, we see how to handle constraints when dealing with an op-
timisation problem. We will then apply the method to SVM to express the
solution of the SVM problem (5.8).

Equality constraints

Consider the following general constrained optimisation problem:

min C(w) (5.9)


w
s.t. hi (w) = 0, i = 1, ..., r,

where the hi ’s are C 1 constraints.

One can use the method of Lagrange multiplier to solve (5.9), by turning a
constrained optimisation into an unconstrained optimisation and introducing
penalties on the violation of the constraints.
CHAPTER 5. LINEAR MODELS 92

Definition 5.2.2. The Lagrangian function of (5.9) is defined for all w ∈ Rn


and α ∈ Rr by
r
X
L(w, α) = C(w) − αi hi (w). (5.10)
i=1

The coefficients αi ’s are called the Lagrange multipliers.

Intuition (hand-wavy). Suppose that w∗ is the solution of (5.9). In


particular, the constraints are satisfied and for each i ∈ {1, . . . , r}, if we
move in a direction w orthogonal to ∇w hi (w∗ ), we locally have hi (w∗ + ϵw) ≈
hi (w∗ ) + O(ϵ2 ). Now this reasoning applies if w is orthogonal to ∇w hi (w∗ )
simultaneously for all i ∈ {1, · · · , r}. If wT ∇w C(w∗ ) ̸= 0, then we found a
direction that decreases the cost in a neighborhood of w∗ , while respecting
the constraints, which contradicts that w∗ is optimal. This means that,
necessarily, wT ∇w C(w∗ ) = 0. This means that the gradient of the cost at w∗
is orthogonal to any w orthogonal to the gradient of the constraints at w∗ :
it holds that ∇w C(w∗ ) ∈ Span({∇w hi (w∗ ); i = 1, . . . , r}), that is, there exist
α1∗ , . . . , αr∗ such that
r
X

∇w C(w ) = αi∗ ∇w hi (w∗ ). (5.11)
i=1

The above intuition justifies the definition of the Lagrangian function,


as it helps us write (5.11) in a concise manner as ∇w L(w∗ , α∗ ) = 0. By
computing the gradient of L(w, α) with respect to w, α, and set it to 0, we
can solve for the unknowns. Whether we are minimizing of maximizing L,
the αi ’s play a role of penalising the solution if the constraints are violated.
Consider ∇w,α L = 0 and write
X
∇w,α C(w) − ∇w,α αi hi (w) = 0
i

Expanding this out, we have a vector where the first n entries with respect
to w lead to: X
∇w C(w) − αi ∇w hi (w) = 0,
i
CHAPTER 5. LINEAR MODELS 93

and the last r entries lead to

hi (w) = 0, i = 1, ..., r.

Define gprim : w 7→ maxα L(w, α). The solution to (5.11)10 is given, in


terms of the Lagrangian, as:

p∗ = min gprim (w). (5.12)


w

How does this solution relate to that of the original problem (5.9)? When
considering gprim (w), we are fixing a w and then maximize over α. We see
that as soon as one constraint is violated, say hi (w) ̸= 0, then we can let
αi go to plus or minus infinity (depending on the sign of hi (w)) to make
the Lagrangian blow up, therefore gprim (w) = +∞. In particular, gprim (w) is
finite if and only if w satisfy the constraints (provided that C(w) is finite).
Hence, the solution w∗ of (5.9) is the same as that of (5.12).

Inequality constraints

Similarly, we can write the Lagrangian for the following constrained optimi-
sation problem (instead of equality in the constraints, we seek for ≤ con-
straints):

min C(w), (5.13)


w
s.t. hi (w) ≤ 0, i = 1, ..., r.

Definition 5.2.3. The Lagrangian function of (5.13) is defined for all w ∈ Rn


and α ∈ [0, ∞)r by
r
X
L(w, α) = L(w) + αi hi (w), (5.14)
i=1

where the αi ’s are called the Lagrange multipliers.


10
Also called as the primal problem/primal solution.
CHAPTER 5. LINEAR MODELS 94

Again, note that if the constraint on hi (w) is violated (i.e. if hi (w) > 0),
then if αi > 0, this will lead to an increase in the Lagrangian.

As in the equality constraints case, the solution to (5.13) is given, in terms


of the Lagrangian, as:

p∗ = min gprim (w) = min max L(w, α). (5.15)


w w α≥0

The primal problem, however, does not seem easier to solve than the
original problem itself. Indeed, in the minimisation-maximisation problem
(5.15), because we maximize over α after having chosen a w, when mini-
mizing over w, we are restricted to candidates that satisfy the constraints
(otherwise, the Lagrangian blow up as explained before). What we would
like to do is to first choose α and only then minimize over w. This is what is
called the dual optimisation problem, which turns the constrained min-max
optimisation problem into a penalised max/min optimisation problem where
we search for an optimal solution over α after minimizing over w. 11

First, we write the dual function:

gdual (α) = min L(w, α),


w

and then the dual optimisation problem:

d∗ = max gdual (α) = max min L(w, α). (5.16)


α≥0 α≥0 w

How does solving this relate to solving (5.15)? We can note that the solution
to (5.16) returns a lower bound to (5.15): first note that by definition, for
every w, α, we have

L(w, α) ≤ gprim (w).

Taking the minimum over w yields

gdual (α) ≤ p∗ ,
11
If you are interested in this, start with the notion of duality in Linear Programming
problems.
CHAPTER 5. LINEAR MODELS 95

and then the maximum over α to get

d∗ ≤ p∗ .

In some cases, the solution to the original optimisation problem (primal)


is the same as the solution to the dual problem, i.e.:

d∗ = p∗ .

This is called strong duality condition. Spoiler alert: The strong duality
condition holds for the SVM optimisation problem.

To ensure strong duality, we need some conditions on the candidate so-


lutions, namely the Karush-Kuhn-Tucker (KKT) conditions12 : we say
that w∗ and α∗ fulfil the KKT conditions if and only if

L(w∗ , α∗ ) = 0, i = 1, ..., n
∂wi

L(w∗ , α∗ ) = 0, i = 1, ..., r (stationarity ∇L = 0)
∂αi
hi (w∗ ) ≤ 0, i = 1, ..., r (primal feasibility)
αi∗ ≥ 0, i = 1, ..., r (dual feasibility)
αi∗ hi (w∗ ) = 0, i = 1, ..., r (complementary slackness)

We admit the following result:

Theorem 11. For an optimisation problem for which the strong duality con-
dition holds, any primal optimal solution w∗ and dual optimal solution α∗
respect the KKT conditions. Conversely, if f and hi are affine for all i, then
the KKT conditions are sufficient for duality.

Proof. Refer to book: “Convex Optimization” by Stephen P. Boyd.

Cool, now what? Two facts:


12
Subject to differentiability and convexity requirements.
CHAPTER 5. LINEAR MODELS 96

• the SVM optimisation satisfies the strong duality condition.

• this means that the optimal w∗ for the SVM problem (5.8) and α∗
satisfy the KKT conditions.

SVM optimisation problem:

Recall the SVM optimisation problem

min wT w
w
s.t. ∀i : yi (wT xi + b) − 1 ≥ 0.

The Lagrangian for the SVM problem13 :

m
X
T
L(w, b, α) = w w − αi (yi (wT xi + b) − 1)
i=1

Observation 1: The SVM optimisation satisfies the strong duality con-


dition, we can solve either the dual problem (finding αi ) or the primal (finding
w).

Observation 2: The KKT conditions are obtained by setting the gradi-


ent of the Lagrangian with respect to the primal variables w and α to zero
and plugging in the optimal solutions w∗ and α∗ .

Then, let’s use the KKT conditions to say something about our solutions.

Using the stationarity condition:


m
X m
X
∇w L = w ∗ − αi∗ yi xi = 0 =⇒ w∗ = αi∗ yi xi . (5.17)
i=1 i=1

The optimal weight vector w∗ , which defines the normal to the hyperplane
that maximizes the margin, is a linear combination of the training vectors
x1 , ..., xm .
13
Sometimes you will see this multiplied by 1/2 to simplify the gradient computation...
CHAPTER 5. LINEAR MODELS 97

Using the complementary slackness condition:

∀i, αi∗ [yi (wT xi + b) − 1] = 0 =⇒ αi∗ = 0 or yi (wT xi + b) = 1.

There are some vectors which lie on the margin, and for those, the corre-
sponding αi∗ is nonzero.

Combining with (5.17), a vector xi appears in the expansion of w∗ if and


only if αi∗ ̸= 0. The datapoints which appear in the expansion of w∗ are
called ‘‘support vector ”. The support vectors define the maximum margin
hyperplane. 14

Note: While w is unique for the SVM problem, the support vectors are
not. E.g in a hyperplane in N dimensions, we need N + 1 points to define
a hyperplane. If there are more support vectors than N + 1, then we can
choose different support vectors to specify the hyperplane.

• If you were to move one of the support vectors and retrain the SVM,
the resulting hyperplane would change.

• If we move non-support vectors, the SVM hyperplane would not change


(provided you don’t move them too much, or they could turn into
support vectors themselves).

We thus have the following:

Proposition 2. The SVM classifier f∗ with parameters (w∗ , b∗ ) that max-


imize the maximal margin hyperplane for a separable dataset {(xi , yi ); i =
1, . . . , m} is of the form
 !T 
m
X
f∗ (x) = sign((w∗ )T x + b∗ ) = sign  αi∗ yi xi x + b∗  .
i=1
14
To find b, we can note that

yj (wT xj + b) = 1

for some j (one of the support vectors). Then b = yj − wT xj


CHAPTER 5. LINEAR MODELS 98

Note that our approximating function is given by an inner product be-


tween support vectors and new datapoint x, which we will exploit in the
future to learn non-linear hyper-planes.

5.2.4 Non-separable case

So far we have assumed that the data was linearly separable. This is often not
the case, we can’t find a hyperplane that separates between the two classes.
In this case, there is no solution to the optimization problems stated above.

We can fix this by allowing the constraints to be violated ever so slightly


with the introduction of slack
minw,b,ξ wT w + λ ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi C ≥ 0
∀i ξi ≥ 0

The slack variable ξi allows the input xi to be closer to the hyperplane (or
even be on the wrong side), but there is a penalty in the objective function
for such “slack”.

If λ is very large, the SVM becomes very strict and tries to get all points
to be on the right side of the hyperplane. If λ is very small, the SVM becomes
very loose and may “sacrifice” some points to obtain a simpler (i.e. lower
∥w∥22 ) solution.

Unconstrained Formulation: Let us consider the value of ξi for the


case of λ ̸= 0. Because the objective will always try to minimize ξi as much
as possible, the equation must hold as an equality and we have:

1 − yi (wT xi + b) if yi (wT xi + b) < 1
ξi =
0 if yi (wT xi + b) ≥ 1

This is equivalent to the following closed form:

ξi = max(1 − yi (wT xi + b), 0).


CHAPTER 5. LINEAR MODELS 99

If we plug this closed form into the objective of our SVM optimization prob-
lem, we obtain the following unconstrained version as loss function and reg-
ularizer: n
X
T T
 
min w w +λ max 1 − y i (w x i + b), 0 .
w,b | {z }
i=1
l2 −regularizer
| {z }
hinge−loss
Chapter 6

Kernel methods

Linear classifiers are great, but what if there exists no linear decision bound-
ary? As it turns out, there is an elegant way to incorporate non-linearities
into most linear classifiers.

We can make linear classifiers non-linear by applying a nonlinear mapping


ϕ on the input feature vectors X to a higher-dimensional space XH , where
linear separation is possible.

Figure 6.1: Nonlinear transformation on the input. Source: Scikit-learn.


A cool visualisation: https://www.youtube.com/watch?v=OdlNM96sHio&
t=0s

Some disadvantages:

100
CHAPTER 6. KERNEL METHODS 101

• ϕ(x) might be very high dimensional.

• Do we have to build ϕ(x) from scratch?

In this chapter, we will talk about methods which use this idea of sending
elements of x ∈ X onto some higher dimensional space XH which we don’t
necessarily have to know much about, where our problem becomes easier to
solve. All we need to know about this new space is that it is a so-called
Reproducible kernel Hilbert space.

Before we can dive into what this means, we start with a brief review and
introduction to key concepts.

Note: throughout this chapter, we always assume that the output space
Y = R, to ease the notation.

6.1 Inner products and kernels

In Rn , the dot product x · x′ = ni=1 xi x′i can be seen as a way to measure the
P
similarity between two elements x, x′ (e.g. x · x′ = 0 if and only if x and x′
are orthogonal, i.e. nothing of x can be used to represent x′ ). In an arbitrary
vector space, this notion can be generalised:

Definition 6.1.1. An inner product in a real vector space H is a map ⟨·, ·⟩ :


H × H → R that satisfies the following properties: for all u, v, w ∈ H and
α ∈ R,

1. ⟨u, v⟩ = ⟨v, u⟩, (symmetry)

2. ⟨u + αv, w⟩ = ⟨u, w⟩ + α ⟨v, w⟩, (bilinearity)

3. ⟨v, v⟩ ≥ 0 with equality if and only if v = 0. (positive-definiteness)

Note that by symmetry, we only need to check the linearity in the first
variable to get the bilinearity.
CHAPTER 6. KERNEL METHODS 102

We will sometimes write ⟨·, ·⟩H to specify on which space we consider the
inner product, in order to avoid ambiguity.
Example 6.1.1. Inner product examples:

• The Euclidean space Rn , where the inner product is given by the dot
product:

⟨(x1 , · · · , xn ), (y1 , · · · , yn )⟩ = x1 y1 + · · · + xn yn .

• The vector space of real functions whose domain is an closed interval


[a, b] with inner product:
Z b
⟨f, g⟩ = f (x)g(x)dx.
a

The inner product induces a norm on H:


q
||u||H := ⟨u, u⟩H .

In particular, once we have a norm, we have a (induced) metric (d(u, v) :=


||u − v||H ), and we can make sense of the notion of completeness of a space.
Remark 20. Reminder. A space H is said to be complete (w.r.t. a norm
||· ||H ) if and only if for any sequence (un )n≥1 of elements of H that converges
to some u∗ (w.r.t. to norm || · ||H ), it holds that u∗ ∈ H.
An inner product space that is complete w.r.t. the norm induced by its inner
product is called a Hilbert space.

Theorem 12. (Riesz Representation theorem) If T is a linear bounded op-


erator on a Hilbert space H, then there exists some g ∈ H such that

T (f ) = ⟨f, g⟩H ∀f ∈ H.

This means every (bounded) linear operator T : H → R, on a Hilbert


space can be written as an inner product of some g ∈ H.
CHAPTER 6. KERNEL METHODS 103

Definition 6.1.2. Let X be an non empty set. We say that a symmetric1


function K : X × X → R is a positive definite kernel (pd kernel) if and only
if for any fixed n ∈ N and c1 , · · · , cn ∈ R, it holds that
n
X
ci cj K(xi , xj ) ≥ 0, ∀x1 , · · · , xn ∈ X .
i,j=1

Remark 21. Caveat: Sometimes2 , the above is called a positive semidefinite


kernel, because it allows the sum in the definition to be 0 even for non-
null ci ’s. In the Machine Learning literature, “positive definite kernel” is
understood as defined in Definition 6.1.2 and this is the convention we will
use in this manuscript.

Note there’s no restrictions on X (except it’s non-empty). The property


that defines pd kernels can be rephrased as: for all n ∈ N and x1 , · · · , xn ∈ X ,
the matrix K that is given by

K i,j = K(xi , xj ) i, j = 1..n

is (symmetric) positive semidefinite. This matrix is called Gram matrix.

Exercise: Show that it is equivalent for a symmetric matrix to be positive


semidefinite and to have all its eigenvalues non-negative.

6.2 Reproducible kernel Hilbert spaces

We introduced the two notions of inner products and pd kernels in the previ-
ous section. The link between them may not seem straightforward. However,
as mentioned in the previous section, pd kernels share some similarities with
pd symmetric matrices, which satisfy the following:

• if A ∈ Rn×n is a pd symmetric matrix, then one can check that the


function b : (x, x′ ) 7→ xT Ax′ , x, x′ ∈ Rn , defines an inner product.
1
In this manuscript, it is implicitly assumed that a kernel is symmetric.
2
In Probability Theory
CHAPTER 6. KERNEL METHODS 104

Viewing pd kernels as the infinite-dimensional analogue of pd symmetric


matrices, one can hope that a pd kernel is always linked to some inner prod-
uct. It is indeed the case but before we see it, we need to introduce some
concepts.

Let H be a Hilbert space of functions from X to R.


⋆ Remark 10. We are given a Hilbert space H of real functions on X .
Suppose that for all x ∈ X , the functional Lx : H → R defined by Lx (f ) =
f (x), is a bounded operator on H, i.e.

∀x ∈ X , ∃Mx > 0 s.t. ∀f ∈ H : |f (x)| ≤ Mx ||f ||H . (6.1)

Then, using the Riesz representation theorem, there exists a unique Kx ∈ H


such that Lx (f ) = ⟨f, Kx ⟩H , so that one can define the kernel K : (x, x′ ) :=
⟨Kx , Kx′ ⟩H on X × X such that H is a RKHS.
Definition 6.2.1. We say that a kernel K on H satisfies the reproducing
property if and only if for all x ∈ X and all f ∈ H, it holds that

⟨f, Kx ⟩H = f (x),

where Kx := K(x, ·) ∈ H. In this case, we say that H is a reproducible kernel


Hilbert space (RKHS) with reproducing kernel K.

We note in particular that for any x, y ∈ X ,

K(x, y) = ⟨K(·, x), K(·, y)⟩H .


Example 6.2.1. On the reproducing property:

Let feature map ϕ : R2 → H, where H is the space of functions from


R2 → R, be defined for all x ∈ R2 by

ϕx ∈ H : y →
7 x1 y 1 + x2 y 2 + x1 x2 y 1 y 2 ,
2
R → R.

Define the kernel K on R2 × R2 by

K(x, x′ ) = ⟨ϕx , ϕ′x ⟩H := x1 x′1 + x2 x′2 + x1 x2 x′1 x′2


CHAPTER 6. KERNEL METHODS 105

Fix u ∈ R3 and define fu : R2 → R by


fu (x) = u1 x1 + u2 x2 + u3 x1 x2 .
Note that fu ∈ H, since we can write fu = ϕ(u3 ,1) + ϕ(u1 −u3 ,0) + ϕ(0,u2 −1) ,
which is a linear combination of elements of H. We thus have a RKHS
H with feature map ϕ, and we can check that K enjoys the reproducing
property: since Kx = ϕx , we have that
⟨fu , Kx ⟩H = ϕ(u3 ,1) , ϕx + ϕ(u1 −u3 ,0) , ϕx + ϕ(0,u2 −1) , ϕx
= fu (x),
as claimed.

Now we can ask ourselves, given a pd kernel K on X × X , is there a


RKHS of real functions on X associated with K? The answer is positive and
is known as the Moore-Aronszajn theorem:
Theorem 13. Moore-Aronszajn theorem: Let K : X × X → R be a
positive definite kernel. There exists a unique RKHS H ⊂ {f : X → R} with
reproducing kernel K. In particular, there exists a mapping ϕ : X → H such
that for all x, x′ ∈ X ,
K(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩H .

Note that Theorem 13 shows that a pd kernel induces a unique RKHS


and vice versa. However, for a given RKHS, the feature map ϕ is not unique.

In view of the above theorem, why do we introduce a kernel instead


of computing ϕ(x), and then the inner product in H? Because evaluating
the kernel is computationally more tractable. Furthermore, we can define a
kernel K without explicitly knowing what the space H is. Thanks to the
above theorem, a pd kernel allows us to measure the similarity of two points
x, x′ ∈ X through the implicit inner product of ϕ(x) and ϕ(x′ ) in a (unknown)
RKHS. This is sometimes referred to as the kernel trick in machine learning:
we send the input space to a higher dimensional space where the elements
can better be compared.

Equipped with those theoretical tools, we now look at some examples


where introducing a kernel can be useful.
CHAPTER 6. KERNEL METHODS 106

Example 6.2.2. Polynomial kernel: For any constant C > 0, a polynomial


kernel of degree d ∈ R is the kernel K defined over X ⊂ RN by:

∀x, x′ ∈ X , K(x, x′ ) = (x · x′ + c)d

Let an input space be of dimension N = 2, a second degree polynomial


d = 2 corresponds to the following inner product in dimension 6:

x21 x′2
   
1
 x22   x′2
√  √ 2′ ′ 

 2x1 x2   2x1 x2 
∀x, x′ ∈ X , K(x, x′ ) = (x1 x′1 + x2 x′2 + c)2 =  √  √
 2cx1  ·  2cx′ 

√  √ 1
 2cx2   2cx′ 
2
c c

The features corresponding to a second-degree polynomial are the origi-


nal features (x1 , x2 ) as well as products of these features, and the constant
feature.3
Example 6.2.3. Gaussian kernel: For any constant σ > 0, a Gaussian
kernel or radial basis function (RBF) is the kernel K defined over X ⊂ RN
by:
||x′ − x′ ||2
 
′ ′
∀x, x ∈ X , K(x, x ) = exp − .
2σ 2
What mapping ϕ would lead to this kernel? Let us consider a simplification,
where σ = 1 andD let thex2 original Einstance space beDR and xconsider the
2
E
1 − n 1 − n
mapping ϕ(x) := ( √n! e 2 x )n≥0 , · . Then, Kx (·) := ( √n! e 2 x )n≥0 , · .
3
A cool visualisation using Kernel SVM, with a polynomial kernel: https://www.
youtube.com/watch?v=OdlNM96sHio&t=0s
CHAPTER 6. KERNEL METHODS 107

∞   
X 1 − x2 n 1 − x′2 ′n
⟨Kx , Kx′ ⟩ = √ e 2x √ e 2 x
n=0 n! n!

X (xx′ )n
 
x2 +x′2
= e− 2
n=0
n!
(x−x′ )2 ||x−x′ ||2
= e− 2 = e− 2 .

Intuitively, the Gaussian kernel sets the inner product in the feature space
between x, x′ to be close to zero if the instances are far away from each other
(in the original domain), and close to 1 if they are close.

Gaussian kernels are among the most frequently used kernels in applica-
tions.
Example 6.2.4. Kernelised SVM: Recall that using the Lagrange multipli-
ers to solve a linearly separable classification task with SVM, the solution
(w∗ , b∗ ) has the following form
m
!
X
fw∗ (x) = sign(w∗T x + b∗ ) = sign αi yi ((xi )T x) + b∗ .
i=1

The “kernelised” SVM (an example can be seen in figure 9.4) yields:
m
!
X
fw∗ (x) = sign αi yi K(xi , x) + b∗
i=1

We simply replace the dot product (xi )T x by K(xi , x), i.e. we compare the
features through K, or equivalently, we compare the features in some implicit
higher dimensional space where the similarity is measured through an inner
product.

See figure 9.4 for an example of Kernel SVM using a Gaussian Kernel
(radial basis functions RBF), and again the following video https://www.
youtube.com/watch?v=OdlNM96sHio&t=0s for an example of the SVM using
a polynomial kernel.
CHAPTER 6. KERNEL METHODS 108

Figure 6.2: Linear SVM and SVM using a RBF kernel. Source:
https://scikit-learn.org/stable/auto_examples/classification/
plot_classifier_comparison.html

Example 6.2.5. Consider the function K : R2 × R2 → R given as:

K(x, y) = x1 y1 + x2 y2 .

Is this a valid kernel? I.e. Is it a symmetric, positive definite kernel?

1. Check symmetry

2. Consider for any n ∈ N, the set of points x1 , ..., xn ∈ R2 . Verify that


CHAPTER 6. KERNEL METHODS 109

the matrix K, given by

K i,j = K(xi , xj ) i, j = 1, · · · , n,

is symmetric positive semidefinite.

Now that we got some intuition for what kernels are and how they work,
let us prove the main theorem (Moore-Aronszajn Theorem) of this section.

Proof. (Proof of theorem 13) Sketch of proof:

We are given a pd kernel K and we want to construct the RKHS H. We


build H up in the following way:

• Let G1 := {Kx : x ∈ X }, where we recall that Kx = K(x, ·).

• Let G2 be the set of finite linear combinations of elements of G1 , i.e.


( r )
X
G2 := αi Kxi : r ∈ N, αi ∈ R, xi ∈ X , ∀i = 1..r
i=1

One can check that G2 is a vector space.

• We define an inner product in G2 as follows: for all x, y ∈ X , define


b : G1 × G1 by b(Kx , Ky ) := K(x, y).
Then, for all f, g ∈ G2 , there exist rf , rg ∈ N, αi , βj ∈ R and xi , yj ∈ X
for all i = 1..rf , j = 1..rg such that we can write f, g as:
rf rg
X X
f= αi Kxi , g= β i K yi .
i=1 i=1

We then extend b on G2 × G2 as
rf rg
X X
b(f, g) = αi βj K(xi , yj ).
i=1 j=1

We readily see by construction that b is symmetric and bilinear.


CHAPTER 6. KERNEL METHODS 110

Moreover, since K is a pd kernel, it holds that b(f, f ) ≥ 0 for all f ∈ G2 .


To show that b is positive definite, it remains to show that b(f, f ) > 0
when f ̸= 0. p
Exercise: show that |b(f, Kx )| ≤ b(f, f )K(x, x), x ∈ X , f ∈ G2
(Cauchy-Schwarz Inequality). Hint: look at b(f + Kx , f + Kx ) and
b(f − Kx , f − Kx ) then use that for all c, d ∈ R, c2 + d2 ≥ 2|cd|
We choose x ∈ X such that f (x) ̸= 0 and we use the result of the above
exercise: 0 < f (x)2 = b(f, Kx )2 ≤ b(f, f )K(x, x). We thus have that b
is positive definite, which entails that we b is an
p inner product on G2 .
We then define ⟨·, ·⟩G2 := b(·, ·) and ||f ||G2 := ⟨f, f ⟩.
• There is a last property that G2 lacks to be a Hilbert space: it is not
complete. One can define the space
H := { lim fn ; (fn )n≥1 Cauchy sequence in (G2 , || · ||G2 )}
n→∞

where the limit is the pointwise limit. For f, g ∈ H, it is possible to


define ⟨f, g⟩H := limn→∞ ⟨fn , gn ⟩G2 that makes H a Hilbert space.

⋆ Remark 11. The proof of this last point turns out to be quite technical
and is beyond the scope of this notes. For our purpose, the intuition given by
the above sketch should be enough. For the curious and motivated reader,
we give a rough plan of how to complete the proof of the last point:

(i) Show that (fn )n≥1 || · ||G2 -Cauchy sequence ⇒ fn (x) Cauchy sequence
in R for all x ∈ R (use reproducing kernel then Cauchy-Schwarz in-
equality). In particular, (fn )n≥1 converges pointwise.
(ii) Show that if a Cauchy sequence fn → 0 pointwise as n → ∞ then
||fn ||G2 → 0, (fix N ∈ N large enough, write ⟨fn , fn ⟩G2 = ⟨fn − fN , fn ⟩G2 +
⟨fN , fn ⟩G2 and bound the two terms).

(iii) Show that for two Cauchy sequences (fn )n≥1 , (gn )n≥1 in G2 , ⟨fn , gn ⟩G2 n≥1
is a Cauchy sequence in R. (Use the Cauchy-Schwarz inequality)
(iv) For f, g ∈ H, define ⟨f, g⟩H := limn→∞ ⟨fn , gn ⟩G2 and show using
(ii) that it does not depend on the choice of the Cauchy sequences
(fn )n≥1 , (gn )n≥1 that converge pointwisely to f and g.
CHAPTER 6. KERNEL METHODS 111

(v) Show that ⟨·, ·⟩H is indeed an inner product.

(vi) Show that G2 is dense in H.

(vii) Show that H is complete: take a Cauchy sequence (fn )n≥1 in H and use
(vi) to define a sequence (gn )n≥1 in G2 such that limn→∞ ||fn − gn ||H =
0; check that (gn )n≥1 is a Cauchy sequence in G2 that pointwisely
converges to a function g ∈ H and show that fn converges to g in
H.

(viii) Check that K is the reproducing kernel of H.

6.3 Mercer’s Theorem

In this section, we will state results on pd kernels without proofs. We will see
that a kernel can be represented as a sum of its eigenfunctions, similar to the
eigendecomposition of a symmetric matrix. Thanks to this representation,
the inner product in the associated RKHS can be seen as an inner product
in L2 (µ), the set of square integrable functions against some measure µ with
compact support in X . We will then use this representation to see how the
inner product in the RKHS corresponds, for some specific examples, to a dot
product in Rn .

Definition 6.3.1. Let µ be a finite measure on a compact subset B ⊂ X .


The integral operator TK : L2 (µ) → L2 (µ) induced by a pd kernel K and the
measure µ is defined by

TK f : X → R,
Z
x 7→ TK f (x) := K(x, x′ )f (x′ )µ(dx′ ).
X

We say that e : X → R is an eigenfunction of TK with eigenvalue λ if and


only if TK e = λe.

Pd kernels can be seen as infinite-dimensional generalization of positive


definite matrices. We admit the well known following fact:
CHAPTER 6. KERNEL METHODS 112

A real matrix M ∈ Rn×n is semi-positive definite with rank k ≤ n if


and only if M = B T B for some matrix B k×n of rank k, and if M is positive
definite, then k = n. In particular, the columns b1 , . . . , bn ∈ Rk of B are such
that Mi,j = ⟨bi , bj ⟩, where the inner product denotes the dot product.

It turns out that we can decompose the kernel K using the eigenfunctions
of the operator TK , and get the analogue of the above fact
R for pd kernels. For
2 2
a measure µ on a set B, let L (B, µ) := f : B → R : B f (x) µ(dx) < ∞ .
Theorem 14 (Mercer’s Theorem). Let K be a continuous pd kernel and µ
be a finite measure supported on a compact subset B ⊂ X . There exists an
orthonormal basis (ei )i≥1 of L2 (B, µ) consisting of eigenfunctions of TK with
non-negative eigenvalues (λi )i≥1 . Furthermore, for all i ≥ 1, if λi > 0, then
ei is continuous and for all x, x′ ∈ B, it holds that
X
K(x, x′ ) = λi ei (x)ei (x′ ),
i≥1

where the series converges uniformly on B.

Suppose that TK has finitely many non-zero eigenvalues, say n ∈ N. Then


with Mercer’s Theorem, we can write K as a dot product:
n
X

K(x, x ) = λi ei (x)ei (x′ ) = ⟨eλ (x), eλ (x′ )⟩ ,
i=1
√ √
where we defined eλ (x) := ( λ1 e1 (x), . . . , λn en (x)) ∈ Rn . Now by Theorem
13, this means that

⟨ϕ(x), ϕ(x′ )⟩H = ⟨eλ (x), eλ (x′ )⟩ ,

that is, the inner product in the RKHS H actually corresponds to a dot
product in Rn !

Let’s come back to Example 6.2.1 to illustrate what this means.

Exercise.

Let K(x, x′ ) := x1 x′1 + x2 x′2 + x1 x2 x′1 x′2 . Let B := [−c, c]2 ⊂ R2 for a
positive real number c and let µ(dx) := 1B (x)dx
CHAPTER 6. KERNEL METHODS 113

(i) Write the induced integral operator TK .

(ii) Show that e1 : R2 → R, x 7→ x1 is an eigenfunction of TK and find its


eigenvalue.

(iii) Find all the other eigenfunctions with nonzero eigenvalues.

(iv) Write an orthonormal basis of L2 (B, µ).

(v) Deduce the expression of K(x, x′ ) as a dot product in Rn for some


n ∈ N.

Example 6.3.1. In Example 6.2.1, we started from a feature map ϕx to define


the kernel K(x, x′ ) = x1 x′1 + x2 x′2 + x1 x2 x′1 x′2 . Visually, we recognized the
dot product in R3 of the vectors (x1 , x2 , x1 x2 )T and (x′1 , x′2 , x′1 x′2 )T . We now
show that this is exactly what can be derived from Mercer’s Theorem.

Let B := [−c, c]2 ⊂ R2 with c > 0 arbitrary. Let µ be the Lebesgue


measure on B. The associated integral operator reads as
Z
TK f (x) = (x1 x′1 + x2 x′2 + x1 x2 x′1 x′2 )f (x′ )dx′ .
[−c,c]2

We look for the eigenfunctions of TK :

TK f (x) = λf (x), ∈ B
⇔ a1 x1 + a2 x2 + a3 x1 x2 = λf (x), (6.2)

where
Z
a1 := x′1 f (x′ )dx′ ,
[−c,c]2
Z
a2 := x′2 f (x′ )dx′ ,
2
Z[−c,c]
a3 := x′1 x′2 f (x′ )dx′ .
[−c,c]2

x1 x2 x1 x2
One can check that e1 (x) = 1/2 , e2 (x) = 1/2 and e3 (x) = 1/2 are eigen-
λ1 λ2 λ3
functions with respective eigenvalues λ1 = λ2 = 23 c3 and λ3 = 94 c6 Let us
CHAPTER 6. KERNEL METHODS 114

do it only for the first eigenfunction/eigenvector: we plug e1 in the left-hand


side of (6.2) and we get
 Z Z Z 
1 ′ 2 ′ ′ ′ ′ ′ 2 ′ ′
1/2
x1 (x1 ) dx + x2 x1 x2 dx + x1 x2 (x1 ) x2 dx
λ1 [−c,c]2 [−c,c]2 [−c,c]2
 ′ 3 c
(x1 )
e1 (x) +0+0
3 −c
λ1 e1 (x),

as claimed.

The reader can also check that e1 , e2 and e3 are orthonormal and it is
clear from (6.2) that TK has no other eigenfunction with non-zero eigenvalue
(that is not a combination of these three).
1/2 1/2 1/2
Let eλ (x) := (λ1 e1 (x), λ2 e2 (x), λ3 e3 (x))T , then for any x, x′ ∈ B, we
thus see that

K(x, x′ ) = ⟨ϕx , ϕx′ ⟩H = ⟨eλ (x), eλ (x′ )⟩ .

Conclusion: For a simple kernel, we used Mercer’s Theorem to express


it as a dot product in R3 instead of an abstract inner product in the RKHS of
its canonical feature maps ϕx , ϕx′ . One thus speak of non-canonical feature
maps φ(x) = (x1 , x2 , x1 x2 )T which define a Hilbert space that is not the
RKHS, but for which the inner product is equivalent.

Note that we made an arbitrary choice for the compact set B = [−c, c]2
and the finite measure µ(dx) = 1B (x)dx. Changing c does not change
the eigenfunctions (inside the smallest B), but does change the eigenval-
ues. Choosing a different shape than a square for B would more profoundly
change the operator and then the eigenfunctions, which would lead to other
non-canonical feature maps and another Hilbert space, but with, again, an
equivalent inner space.
CHAPTER 6. KERNEL METHODS 115

6.4 Representer theorem

The representer theorem plays an large role in a large class of learning prob-
lems. It provides the means to reduce a infinite dimensional optimization
problem to tractable finite dimensional one.

Theorem 15. Let X be a set. K is a positive definite kernel on X and H


is its corresponding RKHS. Furthermore, let S = {(x1 , y1 ), ..., (xm , ym )} be
a finite set of points in X × Y.

Let us consider h ∈ X as a candidate model, λ > 0 and ℓ an arbitrary


loss function. Then, the solution to the optimisation problem:
m
X
min ℓ(h(xi ), yi ) + λ||h||2H
h∈H
i=1

admits a representation of the form:


m
X m
X
∀x ∈ X , fw (x) = wi K(xi , x) = wi Kxi (x).
i=1 i=1

Proof. Let Hs be the sub-space spanned by the training data:


m
X
Hs = {fw ∈ H : fw (x) = wi Kxi (x), (w1 , · · · , wm ) ∈ Rm }
i=1

Hs is a finite dimensional subspace of H.

Then, ∃Hs⊥ = {u ∈ H : ⟨u, v⟩ = 0, ∀v ∈ Hs }, such that

∀fw ∈ H, fw = fw,s + fw⊥ (orthogonal decomposition)

The part which is orthogonal has a property which is:

∀i = 1, · · · , m, fw⊥ (xi ) = fw⊥ , Kxi = 0

by the reproducing property of RKHS (because Kxi ∈ Hs ).


CHAPTER 6. KERNEL METHODS 116

Therefore,
∀i = 1, · · · , m fw (xi ) = fw,s (xi )
i.e the orthogonal part does not influence fw at points xi .

Lastly, we must show that for a minimum fw , fw⊥ does not enter the last
term of the objective function, given by λ||fw ||2H . The last term is given by
the norm of fw in H.

||fw ||2H = ||fw,s ||2H + ||fw⊥ ||2H .

As the objective function is strictly increasing in the last variable (the


norm), then
||fw⊥ ||2H = 0
minimizes the objective function. Thus, minimum fw ∈ Hs .

6.5 Kernel (ridge) regression

In chapter 5, we studied linear models and obtain the analytical solution of


linear regression in Theorem 7 and that of linear ridge regression in Theorem
9, when considering the squared error loss. In the current chapter, we saw
how kernel methods can express non-linear relationships between inputs and
outputs as linear combinations of elements of a RKHS, see in particular
Theorem 15.

Question: Can we derive an analytical solution with kernel methods?

It turns out that by considering the squared error loss, the answer is yes.
We define respectively the objective function and the regularised objective
function for all h ∈ H as
m
1 X
L(h) := (h(xi ) − yi )2 ,
m i=1
Lr (h) := L(h) + R(h),
CHAPTER 6. KERNEL METHODS 117

where R(h) := λ||h||2H for some λ > 0. The kernel regression and kernel
ridge regression problems then read as

min L(h), (6.3)


h∈H
and min Lr (h). (6.4)
h∈H

Theorem 16. Let S = {(xi , yi ) ∈ X × Y; i = 1..m} and recall that K =


(K(xi , xj ))1≤i,j≤m denotes the Gram matrix of the dataset. Let K(x, X) =
(K(x, x1 ), . . . , K(x, xm )).

(i) If K is invertible, then a solution to (6.3) is h∗ given by

h∗ (x) = K(x, X)K −1 Y.

(ii) If −mλ is not an eigenvalue of K, then (6.4) is minimised at h∗ given


by

h∗ (x) = K(x, X) (K + mλIm )−1 Y,

where Im is the m × m identity matrix.


Chapter 7

Gaussian processes

Contents
7.1 Formal definition . . . . . . . . . . . . . . . . . . . 120
7.2 Gaussian processes and kernel methods . . . . . 123

So far, we have considered models which have a clear functional structure,


meaning, we consider a class of functions (for example, linear functions).
Another approach to tackle the learning problem (both in regression and
classification), is to give a prior probability to every possible function, where
higher probabilities are given to functions that we consider to be more likely,
for example, because they are smoother than other functions.

The first mentioned approach has an obvious problem, in that we have to


decide upon the richness of the class of functions considered; if we are using
a model based on a certain class of functions (e.g. linear functions) and the
target function is not well modelled by this class, then the predictions will be
poor. We can increase the flexibility of the class of functions, but this runs
into the danger of overfitting, where we can obtain a good fit to the training
data, but perform badly when making test predictions.

The second approach appears to have a serious problem, in that surely


there are an uncountably infinite set of possible functions, and how are we
going to compute with this set in finite time? A Gaussian process is a

118
CHAPTER 7. GAUSSIAN PROCESSES 119

generalization of a Gaussian random variable. So far we have seen random


variables which are scalars or vectors, similarly a stochastic process is a
random function.
Example 7.0.1. Consider a 1-d regression problem. In 7.1(a) we show a num-
ber of functions drawn at random from the prior distribution over functions
specified by a particular Gaussian process, which favours smooth functions.
This prior is taken to represent our prior beliefs over the kinds of functions
we expect to observe, before seeing any data.

Suppose now we see two datapoints (x1 , y1 ) and (x2 , y2 ). Then, we wish
to consider only functions which pass by those two points. In 7.1(b), we see
functions which are consistent with the observed data (dashed lines), and the
solid line depicts the mean of all functions consistent with those observations.
Notice how uncertainty is reduced close to the observations (this is because
we have the prior that the functions are smooth).

Figure 7.1: Panel (a) shows four samples drawn from the prior distribution.
Panel (b) shows the situation after two datapoints have been observed. The
mean prediction is shown as the solid line and four samples from the posterior
are shown as dashed lines. In both plots, the shaded region denotes twice
the standard deviation at each input value x.
CHAPTER 7. GAUSSIAN PROCESSES 120

7.1 Formal definition

Definition 7.1.1. Let X be a nonempty set, K : X × X → R be a positive


definite kernel and µ : X → R be any real-valued function. Then a random
function f : X → R is said to be a Gaussian Process (GP) with mean function
µ and covariance kernel K, denoted by GP (µ, K), if the following holds: For
any finite set X = (x1 , · · · , xm ) ∈ X of any size m ∈ N, the random vector
fX = (f (x1 ), · · · , f (xm ))T ∈ Rm
follows the multivariate normal distribution N (µX , K XX ) with covariance
matrix K XX = (K(xi , xj ))m
i,j=1 ∈ R
m×m
and mean vector

µX = (µ(x1 ), · · · , µ(xm ))T ∈ Rm .


Remark 22. The positive definite kernel K is equivalent to the covariance
kernel/covariance function.
Remark 23. This definition implies that if f is a Gaussian process, then
there exists a mean function µ : X → R and a covariance kernel K : X ×
X → R. On the other hand, it is also true that for any positive definite
kernel K and mean function µ, there exists a corresponding Gaussian process
f ∼ GP (µ, K). There exists a one-to-one correspondence between Gaussian
processes f ∼ GP (µ, K) and pairs (µ, K) of mean function µ and positive
definite kernel K.
Remark 24. Since K is the covariance function of a Gaussian process, by
definition it can be written as
K(x, x′ ) = Ef ∼GP (µ,K) [(f (x) − µ(x))(f (x′ ) − µ(x′ ))] , x, x′ ∈ X ,
where the expectation is with respect to the random function f ∼ GP (µ, K).
Example 7.1.1. (A concrete GP) The most common choices for µ(x) and
K(x, x′ ) are µ(x) = 0 and K(x, x′ ) = exp(−(x − x′ )2 /2).

7.1.1 Kernel (covariance functions)

The (covariance function) kernel is a crucial ingredient in a Gaussian process


predictor, as it encodes our assumptions about the function which we wish
CHAPTER 7. GAUSSIAN PROCESSES 121

to learn. From a slightly different viewpoint it is clear that in supervised


learning the notion of similarity between data points is crucial; it is a basic
similarity assumption that points with inputs x which are close are likely to
have similar target values y, and thus training points that are near to a test
point should be more informative about the prediction at that point. Under
the Gaussian process view it is the covariance function that defines nearness
or similarity.

What is the effect of choosing a kernel to define the GP (µ, K)?

7.1.2 Squared exponential covariance function

The squared exponential (SE) covariance function has the form, for r =
x − x′ , x, x′ ∈ X :
r2
KSE (r) = exp(− 2 )
2ℓ
with parameter ℓ ∈ R defining the characteristic length-scale, which modifies
the behaviour of the Gaussian Process (see figure 7.2.

Definition 7.1.2. Mean square derivative: A random process Xt is mean-


square differentiable at time t0 if there is a random process Xt′0 such that
" 2 #
X t − X 0
limt→t0 E Xt′0 − = 0.
t − t0

This covariance function is infinitely differentiable which means that the


GP with this covariance function has Mean Square derivatives of all orders,
and is thus very smooth. This can be seen as a disadvantage, however,
the squared exponential is probably the most widely-used kernel within the
kernel machines field.
CHAPTER 7. GAUSSIAN PROCESSES 122

Figure 7.2: Varying the hyperparameter ℓ regulates the influence of neigh-


bouring points.

7.1.3 Matérn class of covariance functions

The Matérn class of covariance functions is given by:


√ !ν √ !
21−ν 2νr 2νr
KM atern (r) = Kν ,
Γ(ν) ℓ ℓ

with positive parameters ν and ℓ, where Kν is a modified Bessel function [?].


To see the influence of ν in the resulting GP, refer to figure 7.3. For the
Matérn class the process f (x) is k-times mean square differentiable if and
only if ν > k.
CHAPTER 7. GAUSSIAN PROCESSES 123

Figure 7.3: Varying the hyperparameter ν regulates regularity of the func-


tions in the GP. Note if ν → ∞, we obtain the SE covariance function.

Both covariance functions obtained from the the SE and the Matérn ker-
nels are so called stationary covariance functions, as they are a function of
x − x′ , and thus, invariant to translations in the input space. The covariance
functions given above decay monotonically with r and are always positive.
However, this is not a necessary condition for a covariance function 1

7.1.4 Dot product covariance functions

The kernel
K(x, x′ ) = σ02 + x · x′ ,
constitutes a non-stationary covariance function.

7.2 Gaussian processes and kernel methods

In this short introduction to Gaussian processes, we have seen some familiar


objects that were already introduced in the chapter about kernel methods.
1
E.g. a valid covariance function can have the form of a damped oscillation.
CHAPTER 7. GAUSSIAN PROCESSES 124

Is this a coincidence, or is there some more obvious relation between GPs


and kernel methods?

Recall that the ridge regression adds a regularisation penalty (scaled by


λ) to the cost term:
m
1 X
(yi − f (xi ))2 + λ||f ||2H .
m i=1

2
By Theorem 16, the solution for the optimisation above is

f (x) = y T (KXX + mλI)−1 kXx ,

with (KXX )i,j = k(xi , xj ), y = (y1 , ..., ym )T and kXx = (k(x1 , x), · · · , k(xm , x))T
the vector of inner products between the data and the new point x.

We will see that a prediction using kernel ridge regression is equivalent


to the mean prediction of a GP regression.

Recall that we discussed the Bayesian approach based on Theorem 3 for


estimating an unknown probability distribution in Section 2.8.2: we fix a
prior distribution encoding our beliefs, we observe data samples, we update
our beliefs accordingly to obtain the posterior distribution.

The Gaussian process regression (also known as Kriging) is a Bayesian


non parametric method for regression. Being a Bayesian approach, the GP-
regression produces a posterior distribution of the unknown regression func-
tion f , provided by the training data (X, Y ), a prior distribution Π0 on f
and a likelihood function denoted by ℓX,Y (f ).

More specifically, the prior Π0 is defined as a GP (µ, k) with mean function


µ and covariance kernel k.

Note: since this GP serves as a prior, the mean function µ and the kernel
k should be chosen so that they reflect one’s prior knowledge or belief about
the regression function f .
2
You can verify by considering the Lagrangian.
CHAPTER 7. GAUSSIAN PROCESSES 125

The likelihood function is defined by a probabilistic model p(yi |f (xi )) for


the noise variables ξ1 , · · · , ξm , since this determines the distributions of the
observations Y with the additive noise model:

yi = f (xi ) + ξi , i = 1, ..., m

It is common to assume ξi are i.i.d centered Gaussian random variables with


variance σ2 > 0.
ξi ∼ N (0, σ 2 ), i = 1, ..., m.
Theorem 17. Assume the following

• yi = f (xi ) + ξi , i = 1, ..., m

• ξi ∼ N (0, σ 2 ), i = 1, ..., m.

• f ∼ GP (µ, k)

and let X = (x1 , ..., xm ) ∈ X n and Y = (y1 , ..., ym )T ∈ Rm . Then, we have

f |Y ∼ GP (µ̄, k̄),

where µ̄ : X → R and k̄ : X × X → R are given by:

µ̄(x) = µ(x) + kxX (KXX + σ 2 Im )−1 (Y − µX ) x ∈ X (7.1)


k̄(x, x′ ) = k(x, x′ ) − kxX (KXX + σ 2 In )−1 kXx′ , x, x′ ∈ X (7.2)
(7.3)
T
with kXx = kxX = (k(x1 , x), · · · , k(xm , x))T , and (KXX )i,j = k(xi , xj )

Using the theorem above, the following equivalence holds for GP-regression
and kernel ridge regression: We have µ̄ = fKRR if σ 2 = mλ, where

1. µ̄ is the posterior mean function of GP-regression based on (X, Y ), the


GP prior f ∼ GP (0, k) and the modelling assumption where the noise
is i.i.d. N (0, σ 2 )

2. fKRR is the solution to kernel ridge regression based on (X, Y ), the


RKHS H and regularisation constant λ > 0.
CHAPTER 7. GAUSSIAN PROCESSES 126

Remark 25. One of the disadvantages of Gaussian processes is the fact that
the covariance matrix size will scale with relation m2 for dataset of size m,
and furthermore, to invert the covariance matrix takes approximately O(m3 )
operations.3

3
Current best asymptotic complexity is O(m2 .376) [?].
Chapter 8

Deep learning

Contents
8.1 Fully connected dense neural networks . . . . . 128
8.2 Back Propagation . . . . . . . . . . . . . . . . . . 132
8.3 Approximation Theorems . . . . . . . . . . . . . 139
8.4 * Infinitely wide neural networks . . . . . . . . . 141
8.5 Beyond feed forward neural networks . . . . . . 148
8.6 Tricks of the trade . . . . . . . . . . . . . . . . . . 151

In the recent years, Machine Learning as become often identified as Neural


Networks or Deep Learning, so we could not leave this class of models out of
our exposition about Machine Learning.

One prototypical example often used to motivate the need of neural net-
works is the function XOR (Figure 8.1), which is a non-linear function that
a simple neural network can approximate.

In this chapter, we will start with the simplest type of deep neural net-
works: the fully connected dense neural networks (section 8.1). We will see
that some notions of learning theory that we’ve encountered will be chal-
lenged, and new mathematical theory is necessary to understand why neural
networks seem to work so well in practice.

127
CHAPTER 8. DEEP LEARNING 128

Figure 8.1: XOR function

Furthermore, neural networks are (by far) the models presented in these
lecture notes which require more care when being set-up and trained. You will
see there are more hyper-parameters, engineering choices and ways to success-
fully/unsuccessfully train a neural network. We consolidate some practical
advice, tricks of the trade, in section 8.6.

8.1 Fully connected dense neural networks

8.1.1 Definitions

Definition 8.1.1. Let us introduce the function (a unit):

a = σ(wT x + b),
where x ∈ Rn are inputs, w ∈ Rn weights, b ∈ R the bias and σ : R → R an
activation function.

We can write different models we have studied by considering different


activation functions:

• linear regression: σ(z) = z

• binary classification: σ(z) = sign(z)

• logistic regression: σ(z) = 1


1+e−z
CHAPTER 8. DEEP LEARNING 129

A neural network is a combination of a lot of these units. For example, a


1-layer neural network f : Rd → Rd1 is given by:
f (x) = σ1 (W1 x + b1 )
for x ∈ Rd , W1 a matrix ∈ Rd1 ×d and b1 ∈ Rd1 .

What is f (x) doing? It’s taking a vector x and applying a linear trans-
formation to it (by W1 , b), this returns a vector of dimension d1 . σ is applied
to this vector (typically component-wise).

A 2-layer neural network f : Rd → Rd2 is given as:


f (x) = σ2 (W2 σ1 (W1 x + b1 ) + b2 ),
where W2 is a matrix ∈ Rd2 ×d1 and b2 ∈ Rd2 .

And for a general L-layer neural network f : Rd → RdL we can write the
definition:
Definition 8.1.2. A fully connected feedforward neural network is given by
its architecture, namely, by hyper-parameters:

• L ∈ N the number of layers


• σi : R → R, activation functions (i = 1, ..., N )
• d0 , ..., dL , specifying the number of neurons in the input, output and
l-th hidden layer.

and it has the form:


f (x) = σL (WL σL−1 (· · · (W2 σ1 (W1 x + b1 ) + b2 ) · · · ) + bL ) ,
where
 the σ
i ’s are applied
 component-wise, that is for z ∈ Rdi , we write
z1 σi (z1 )
 ..   .. 
σi  .  =  . . Alternatively, one can write recursively
zdi σi (zdi )

x(0) = x,
x(k+1) = σk+1 (Wk+1 x(k) + bk+1 ),
and f (x) = x(L) .
CHAPTER 8. DEEP LEARNING 130

A few properties of this definition are:

• One can both do classification and regression, depending on the acti-


vation function.

• Feature selection is “built-in”: if σL is the identity, then the neural


network output is WL x(L−1) + bL , which is a linear combination of the
penultimate layer’s output.

• Number of degrees of freedom for a neural network is given by:


L
X
dof = dl dl−1 + dl
l=1

Remark 26. There are some engineering choices: L, σi , di . This fixes the
number of degrees of freedom we have to find for Wi , bi .
Example 8.1.1. What if all σ are the identity functions?

WL (· · · (W1 x + b1 ) · · · ) + bL
=(WL · · · W1 )x + (bL + WL bL−1 + · · · WL · · · W2 + b1 .

This is just a linear predictor.

8.1.2 Loss functions

Similarly to other supervised learning models we have seen, neural networks


are trained in the ERM / Regularised ERM framework.

When considering regression type problems, a typical loss function to be


considered is the Mean Squared Error (or Mean absolute error). For a neural
network fw , we get
m
1 X
C(w) = L(fw ) = ||fw (xi ) − yi ||22 .
2m i=1

From the definition of fully-connected neural networks, we see that (as soon
as there is a non-linear activation) the map w 7→ fw is non-linear. As a
CHAPTER 8. DEEP LEARNING 131

consequence, even when the chosen loss L on the function’s space is convex,
the cost C on the parameter’s space is not. In particular, for neural networks,
there is no guarantee that local minima are global minima, which makes the
success of training through gradient descent not obvious.

For a classification task, like previously, we do not want to optimise


over non-differentiable functions (or at least, hard differentiable). So instead
of the 0 − 1 loss ℓ(y, y ′ ) = 1{y̸=y′ } , we consider a surrogate loss, similar to
what we did with the perceptron.

Consider a fully connected neural network fw (x) = WL x(L−1) + bL . To


perform binary classification, one can choose the predictor to be sign(fw (x)).
A commonly used surrogate loss is the logistic loss, given by:

ℓ(fw (x), y) = log 1 + exp−yfw (x) .




Why? If y = 1 and fw (x) > 0 (or y = −1 and fw (x) < 0), the value
exp(−yfw (x)) is small, and then we have ℓ(fw (x), y) ≈ 0. Otherwise, in case
of misclassification, we have: log(1 + e−fw (x)y ) > log 2.

Another common way to do binary classification is to choose dL−1 = 1


and σL (z) = sigmoid(z) = 1+exp1 (−z) , which returns a number in [0, 1], that
can be interpreted as probability.

Consider the two classes 0 and 1. To predict the class of an input x,


one can then choose 0 if fw (x) ≤ 0.5 and 1 otherwise. The loss function to
minimize can be the cross-entropy loss defined by

m
1 X
L(fw ) = (yi log fw (x) + (1 − yi ) log(1 − fw (x))).
m i=1

Note if y = 0, then we want fw (x)yi to be close to 0, and if y = 1, we


want fw (x)) to be close to 1. Note that a perfect classifier achieves 0 loss.

Multi-class classification (i.e. a datapoint x can have a label 0, ..., n − 1)


can be easily performed by considering dL−1 = n and σL (z) = softmax, which
returns a probability on a finite set. For example, dL−1 = n we get the output
CHAPTER 8. DEEP LEARNING 132

vector:
 (L−1) 
exp(x1 )
1
fw (x) = Pn (L−1)
 · · · .
i=1 exp(x i ) exp(xn
(L−1)
)

Then, the loss function is the categorical cross-entropy:


m n
1 XX
L(fw ) = (yi,j log fw (xi )j + (1 − yi,j ) log(1 − fw (xi )j )),
m i=1 j=1

where (with a slight abuse of notation) we write, for the i-th datapoint,
yi = (yi,1 , · · · , yi,n ). Observe that datapoint xi can only belong to one class,
so yi is zero everywhere, except on the class that xi belongs to. Again, a
perfect classifier achieves 0 loss.

The non-convexity issues we talked about for regression are similar for
classification. The softmax map and the cross-entropy loss can be generalised
to tackle multi-class classification.

8.2 Back Propagation

8.2.1 Definition

You can note that through gradient based optimisation methods, we must
find a way to update each parameter of the model. We want an efficient
way to write the updates for each degree of freedom we have in the neural
network.

The back-propagation algorithm can be thought of as a table-filling algo-


rithm that takes advantage of storing intermediate results.
Example 8.2.1. Note that our cost function, for a generic L-layer neural
CHAPTER 8. DEEP LEARNING 133

network, is given as:


m
X
C(w) = L(fw ) = ℓ(yi , fw (xi ))
i=1
m
X
= ℓ(yi , σL (WL σL−1 (· · · (W2 σ1 (W1 xi + b1 ) + b2 ) · · · ) + bL )
i=1

L
For the weights wij in weights matrix WL , for example, we want to com-
pute ∂wijL C(w). This is given, through the chain rule, by:

∂zL ∂σL ∂C(w)


∂wijL C(w) = L
∂wij ∂zL ∂σL
∂zL
with zL = WL σL−1 , so L
∂wij
= σL−1,j .

L−1
For the weight wij , we have

∂zL−1 ∂σL−1 ∂zL ∂σL ∂C(w)


∂wL−1 C(w) = L−1
ij ∂wij ∂zL−1 ∂σL−1 ∂zL ∂σL
with zL−1 = WL−1 σL−2 .

Note that some terms of the derivative have already been computed when
L
we wrote the update for wij .

We can write a neural network recursively, as:

Zn = Wn Xn−1
Xn = σn (Zn ),

where Z denotes the vector of zi and W the matrix of weights wij . Further-
more, X0 denotes the input vector.
∂C
Suppose we have computed ∂Xn,i
, for i = 1, ..., dn (the width of the layer
n).
CHAPTER 8. DEEP LEARNING 134

Then we can compute recursively the following gradients:


∂C ∂σn (z) ∂C
=
∂zn,i ∂z z=zn,i ∂xn,i
∂C ∂C
n
= xn−1,j
∂wij ∂zn,i
∂C X
n ∂C
= wij .
∂xn−1,j i
∂z n,i

This can also be written in matrix-vector notation, namely,

∂C ∂σn (z) ∂C
= ◦
∂Zn ∂z z=Zn ∂Xn
 
∂C ∂C
= (Xn−1 )T
∂Wn ∂Zn
∂C ∂C
= WnT ,
∂Xn−1 ∂Zn
∂σn (z)
note that now ∂z
produces a vector of the activation function eval-
z=Zn
1
uated at Zn , and ◦ gives the Hadarmard product between two vectors.

To make things more clear for the next exposition, we can write

 
∂C ∂C
= (Xn−1 )T
∂Wn ∂Z
 n 
∂σn (z) ∂C
= ◦ (Xn−1 )T
∂z z=Zn ∂Xn
 
∂σn (z) ∂C
= T
◦ Wn+1 (Xn−1 )T
∂z z=Zn ∂Zn+1
!
∂σn (z) T ∂σn+1 (z) ∂σL (z) ∂C
= ◦ Wn+1 T
◦ Wn+2 ... ◦ (Xn−1 )T .
∂z z=Zn ∂z z=Zn+1 ∂z z=ZL ∂XL
1
a = (a1 , ..., an ), b = (b1 , ..., bn ) then a ◦ b = (a1 b1 , ..., an bn )
CHAPTER 8. DEEP LEARNING 135

We can call the term inside the parenthesis as


∂σn (z) ∂σn+1 (z) ∂σL (z) ∂C
δ n := T
◦ Wn+1 T
◦ Wn+2 ... ◦ ,
∂z z=Zn ∂z z=Zn+1 ∂z z=ZL ∂XL

then the gradient reads:


∂C
= δ n Xn−1
T
. (8.1)
∂Wn
We can compute δ n−1 recursively, by
∂σL (z) ∂C
δL = ◦ , (8.2)
∂z z=ZL ∂XL
∂σn−1 (z)
δ n−1 = T
◦ Wn−1 δn, n = 1, ..., L. (8.3)
∂z z=Zn−1

8.2.2 Exploding and vanishing gradients

Neural networks are usually trained with gradient based algorithms, the
gradient being computed with backpropagation. Since they are composi-
tion of functions, this may cause gradients to explode or vanish at early
layers. Let’s first look at the chain rule (for simplicity in the scalar case
w ∈ R) for the derivative of a map fL (w) = gL (gL−1 (. . . g1 (w)) · · · ): let
fℓ (w) = gℓ (gℓ−1 (. . . g1 (w)) · · · ) for all ℓ = 1, · · · , L and f0 (w) ≡ w. We have

fL′ (w) = gL′ (fL−1 (w))fL−1



(w)
= ···
L−1
Y

= gL−i (fL−i−1 (w)).
i=0

We see that the derivative of the composition of L maps is a product of L


derivatives. In particular, if each of them is of order, say, a ∈ (0, ∞), then
fL′ (w) is of order aL . For large L, if a < 1, then we get a very small gradient,
whereas if a > 1, it can grow very large. Gradients that are too small or
too large hinder for training, for similar reasons as that of the learning rate
we discussed in Section 3.5, see Figure 3.3. Loosely speaking, with small
CHAPTER 8. DEEP LEARNING 136

gradients, training gets stuck and takes too long to converge, with large
gradients, training jumps over minima and gets away from good solutions.

Recall that the backpropagation update at layer n ∈ {1, . . . , L} in a fully


connected neural network is given in Equation (8.1) in terms of δ n , which
is recursively defined from the last layer to the previous ones in (8.2). In
particular, the further we go backwards, the more terms in the product,
which, as we said, can be the source of training instability.

Note that since the derivatives of the activation functions σℓ ’s appear in


the product (8.2), the phenomena of exploding and vanishing gradients are
linked to the choices of these functions. For example, the sigmoid function
σ : x 7→ (1 + e−x )−1 has a vanishing derivative away from some interval
centered at 0. On the other hand, the derivative of a polynomial activation
function of degree at least 2 explodes far enough from 0.

Can we even train then? Fortunately there are heuristics that lead to
practical techniques allowing stable training. We mention two of them in the
forthcoming section 8.6.2.

Specific weights initialisation schemes are based on heuristics aiming at


stabilising the gradients.

8.2.3 Common initialization schemes

The way a neural network’s weights are initialized prior to training has a
crucial effect on the success of training. Indeed, suppose that weights are all
initialized to be zero. Then, the updates for the weights are given as:
∂C
= δ n Xn−1
T
=0
∂Wn
as δ n yields a zero vector, meaning that the weights will not change during
training.

More generally, if the weights are all initialized to a constant value c, then
the update yields:
CHAPTER 8. DEEP LEARNING 137

∂C
= δ n Xn−1
T
⃗ β⃗ T
= δ n σn−1 (c1T Xn−2 1)T = α
∂Wn
= cn 1dn ×dn−1 ,

where α⃗ , β⃗ are two constant vectors. In particular, all weights at the same
layer n receive the same update cn ∈ R.

In general, it is best to initialize the weights at random (independent)


values, thus avoiding this problem. Commonly used initialization schemes
were born from heuristics to avoid the problems of exploding and vanishing
gradients we discussed before.

Intuition for choosing the variance of the weights at initialization.


Initialize the biases at 0. From (8.1) and (8.2), in order for the magnitude of
the gradient to be of the same order across all layers, we want the activations
Xn (or the preactivations Zn ) to have the same mean and the same variance
at every layer. Let’s see how to ensure it at initialization.

For simplicity, we assume:

• components of Wn are i.i.d. with mean 0 and variance a (with some


unspecified distribution for now)

• activation function σ is tanh.

Let’s write the variance of Xn in terms of that of Xn−1 . We have

Var [Xn ] = Var [σn (Wn Xn−1 )]


dn−1
" !#!
X
n
= Var σn wij xn−1,j .
j=1 i=1,...,dn

n
Pdn−1 n
Fix i ∈ {1, . . . , dn }. By independence of xn−1,j and wij , the mean of j=1 wij xn−1,j
is 0, and if that sum is close enough to its mean, then we (roughly speaking)
CHAPTER 8. DEEP LEARNING 138

have the first order Taylor approximation


dn−1 dn−1
!
X X
n ′ n
σn wij xn−1,j ≈ σn (0) wij xn−1,j .
j=1 j=1

This is, of course, very informal but we need not make a more formal claim
for understanding the intuition. Now we get that
"dn−1 #!
X
Var [Xn ] ≈ σn′ (0)2 Var n
wij xn−1,j
j=1 i=1,...,dn
dn−1
!
X
σn′ (0)2
 n 
= Var wij xn−1,j , (8.4)
j=1 i=1,...,dn
n
where we used the independence of the wij xn−1,j for distinct j to take the
n
sum out of the variance. Note that the wij ’s and the xn−1,j ’s are independent,
since xn−1,j only depends on the inputs and the weights of the previous layers,
which at initialization, are assumed to be independent from those of layer n.
Note that for two independent random variables U, V , it holds that
Var(U V ) = E[U 2 ]E[V 2 ] − E[U ]2 E[V ]2
  
= E[U 2 ] − E[U ]2 E[V 2 ] − E[V ]2 + E[U ]2 E[V 2 ] + E[U 2 ]E[V ]2 − 2E[U ]2 E[V ]2
= Var(U )Var(V ) + Var(U )E[V ]2 + Var(V )E[U ]2 .
If moreover U and V have mean 0, then Var(U V ) = Var(U )Var(V ). Assume
that the inputs are centered, then xn−1,j is centered too. Hence, coming back
to (8.4), we have “shown” that
Var [Xn ] ≈ σn′ (0)Var [Wn ]T Var [Xn−1 ] .

The i-th component of the vector Var [Wn ]T Var [Xn−1 ] is a sum of dn−1 el-
n 1
ements wij xn−1,j . One can choose Var[wij ] = dn−1 , such that provided that

σn (0) = 1 (e.g. for the tanh activation function), the variance of Xn is
constant across layers.
Remark 27. This heuristic is valid only for the first step of gradient descent.
After that, weights are, for example, no longer independent.
Remark 28. Similar derivation can be done for other activation functions, but
some terms don’t cancel out so nicely, and this will change the initialisation.
CHAPTER 8. DEEP LEARNING 139

Some common initializations. The following is a list of commonly used


initialization schemes. Biases are initialized at 0, all weights are mutually
independent:

• Xavier: wij
n
p
∼ Unif(0, 1/ dn−1 ). This is heuristically justified as above
for the tanh activation function.

• Le Cun: wij
n
∼ N (0, 1/dn−1 ).

• He: wij
n
∼ N (0, 2/dn−1 ). It can be motivated similarly as above for the
ReLU activation.

Other initialization schemes, such as Glorot initialization, scale the variance


by dn−1 + dn .

Active research provides insights on the effect of the variance at initial-


ization on the training dynamics and generalization of neural networks; more
on this in the forthcoming section 8.4.

8.3 Approximation Theorems

In this section we give a brief introduction to some of the approximation


results using neural networks. This is a very active research area, but we will
focus on one fundamental result which is often cited in talks/literature as
one of the firsts of the kind. [5] has a good introduction, focusing also on the
necessary tools from analysis to understand the proofs. This section is not
examinable, it’s a very brief introduction to modern advances in the theory
of neural networks.

The theorem below, by Cybenko (1989) [3] is often called the universal
approximation theorem for neural networks. We state a weaker version of
the theorem proved in the linked paper 2 .
2
In the original paper, the theorem needs only σ to be a discriminatory function.
CHAPTER 8. DEEP LEARNING 140

1
Theorem 18. Let σ be the sigmoid function, that is σ(x) = 1+e−x
. Then
finite sums of the form
N
X
G(x) = αj σ(wjT x + bj ),
j=1

are dense in C(In ) with respect to the supremum norm, the space of contin-
uous functions on In , where In denotes the n-dimensional unit cube [0, 1]n .
Meaning, given any f ∈ C(In ) and ϵ > 0, there is a sum, G(x), of the above
form, for which
|G(x) − f (x)| < ϵ ∀x ∈ In .

⋆ Remark 12. The following proof requires some familiarity with functional
analysis.

Proof. Assume without proof that the sigmoid has the following property
(called discriminatory property): Let µ be a finite regular signed Borel mea-
sure, if Z
σ(wjT x + θ)dµ(x) = 0 ∀(⃗y , θ) ∈ Rd × R,
In

then µ = 0.
PN
Let S = {f (x) = j αj σ(wjT x+bj ) : N ∈ N, αj , wj , bj ∈ R, j = 1, ..., N }.

This is a linear subspace of C(Ih ). If closure of S, S = C(In ), we are


done.

Assume that S ̸= C(In ). By the Hahn-Banach theorem, there exists a


bounded linear form L ̸= 0 on C(In ) such that L(S) = 0.

By the Riesz representation theorem, there exists a signed regular Borel


measure µ that is nonzero (as L ̸= 0) such that:
Z
L(h) = h(x)dµ(x) ∀h ∈ C(In ).
In

Taking h ∈ S, we have L(h) = 0 which implies µ = 0 because of the discrim-


inatory property, which yields a contradiction.
CHAPTER 8. DEEP LEARNING 141

What the above theorem tells us is that neural networks can approximate
any arbitrary continuous function, similar to the Stone-Weierstrass theorem
for polynomials. It is worth noting that this fact is true for deeper networks
as well: conditioning on the output of layer L − 2 and considering it as the
input layer, the layers L − 2, L − 1 and L can be seen as a two-layer neural
network and one can directly apply the above theorem.

Similar result by Kurt Hornik (1991) [7]. The result above also extends
to classification tasks. The same result also holds for more general sigmoidal
functions and the ReLU function.3

8.4 * Infinitely wide neural networks

Neural networks are often used largely overparametrized. As a consequence,


for a given architecture, there may be many neural networks that perfectly
fit a given dataset. Yet, neural networks trained with (variations of) gradient
descent sometimes seem to generalize well or at least, better than expected,
for example when compared to overparametrised linear regression.

Constructing a general theory to explain their success seems out of reach.


Indeed, we will see in this section that different training regimes occur by
only changing the scale of the weights’ initialization.

The content of this section concerns recent development on the theory of


neural networks. It is far from exhaustive and its goal is to briefly present
two successful approaches attempting to explain the surprising success of
(overparametrized) neural networks.

8.4.1 Initialization scale

To simplify the notation, consider a neural network fw with scalar input and
ouput. Suppose moreover that it has a single hidden layer of width d ∈ N and
with the identity activation function between the hidden and output layer,
3
Similar type of statement as Weierstrass approximation theorem.
CHAPTER 8. DEEP LEARNING 142

that is for an input x ∈ R, the network predicts


1
fw (x) := w2 σ(w1 x + b),

where w1 , b ∈ Rd×1 , w2 ∈ R1×d are the learnable parameters of the network,
and γ > 0 is a hyperparameter. We assume that the components of w1 , w2
and b are i.i.d. standard Gaussian variables N (0, 1) at initialization so that
γ governs the initialization scale (recall that if X ∼ N (0, 1), then aX ∼
N (0, a2 ) for a ∈ R).

Instead of the above equation, we will prefer to (equivalently) write the


network as the sum of its hidden neurons, that is
d
1 X
fw (x) := γ w2,k σ(w1,k x + bk ), (8.5)
d k=1

In the Bayesian learning paradigm, the prior distribution of a model (ini-


tialization) is crucial as it is supposed to encode our prior beliefs and radically
influences the learning (training). It is not clear that neural networks trained
by gradient descent perform Bayesian learning (in fact, they do not, see the
discussion at the end of Section 8.4.2). Nonetheless, Neal in [9] observed
the following result for 2-layer neural networks (later generalized to deeper
networks):
Theorem 19. Suppose that fw is a neural network at initialization defined
as in (8.5) with γ = 1/2. Suppose moreover that σ is a Lipschitz map. Then
it holds that the law of fw converges as d → ∞ towards GP (0, K (2) ), where
the kernel K (2) is given by

K (2) (x, x′ ) := Eg∼GP (0,K (1) ) [σ(g(x))σ(g(x′ ))]


where K (1) (x, x′ ) := xT x′ + 1.

Sketch of proof. The main idea is to use the central limit theorem 2. Indeed,
one notes that in (8.5), the output fw (x) corresponds to a sum of d i.i.d.
variables and the convergence to a Gaussian distribution follows from the
central limit theorem. (One needs to make sure that σ(X) has a second
moment for X a Gaussian variable, which is the case since σ is Lipschitz.)
CHAPTER 8. DEEP LEARNING 143

For the covariance, we first observe that x 7→ w1 x + b is a Gaussian


processes with covariance kernel K (1) as in the statement. Then, one condi-
(1)
tions on the values of the hidden layers {xk ; k = 1, . . . , d}, and checks that
fw (x) given x(1) is a Gaussian process (as a linear combination of indepen-
(1) (1)
dent Gaussian processes) with covariance kernel d1 dk=1 xk (x′ )k . One gets
P
a limit as d → ∞ that does not depend on x1 nor (x′ )(1) , and getting rid of
the conditioning yields the claim.

Knowing that infinitely wide neural networks with the above initialization
are Gaussian processes is informative about the prior knowledge we inject in
our model, but does not provide any insight on what happens during training.
In the next two sections, we present two recent techniques that will allow us
to say more about training.

8.4.2 The Neural Tangent Kernel

In this section, we consider a neural network defined as in (8.5) with γ = 1/2.


Since we study training, we denote by wt the network’s parameters at time
t of training.

From Theorem 19, we know that the network at initialization fw0 is a


Gaussian process with explicit kernel. Recall that gradient descent steps are
given by
m
1 X
wt+1 = wt − η ∂ŷ ℓ(yi , ŷ) ŷ=fwt (xi )
∇w fwt (xi ), (8.6)
m i=1

where η > 0 is the learning rate. We are interested in the dynamics of


training, meaning that we would like to know more about how fwt evolves
in time (and in particular what it looks like at convergence, when t → ∞
(provided that it converges)). We do a first-order Taylor expansion of fwt+1
to read:

fwt+1 (x) ≈ fwt (x) + ∂t wt · ∇w fwt (x)

Approximating ∂t wt with a finite difference, we can replace it with (8.6) and


CHAPTER 8. DEEP LEARNING 144

get:
m
1 X
fwt+1 (x) − fwt (x) ≈= −η ∂ŷ ℓ(yi , ŷ) ŷ=fwt (xi )
∇w fwt (xi ) · ∇w fwt (x).
m i=1
(8.7)
It turns out that the dot product in the right-hand side defines a time-
dependent kernel
(d)
Θt (x, x′ ) := ∇w fwt (x) · ∇w fwt (x′ ),
(d)
where we made the dependency on the width d explicit. The kernel Θt is
called the Neural Tangent Kernel (NTK) of the neural network at time t.
(d)
Because the initialization is random, the NTK at initialization Θ0 is itself
a random kernel. Moreover, the fact that its dynamics in time depends on
the training makes the exact dynamics of the network fwt intractable.

However, there is a very important result from Jacot et al [8] then gen-
eralized in many ways (e.g. in Yang [13]) that gives us what we want:
Theorem 20. There exists a deterministic kernel Θ(∞) : R × R → R such
that, in the setup above, for σ Lipschitz, twice differentiable with bounded
second derivative, for all x, x′ ∈ R and t ∈ R+ , it holds with probability 1
that
(d)
lim |Θt (x, x′ ) − Θ(∞) (x, x′ )| = 0.
d→∞

Furthermore, the limiting NTK has the following expression

Θ(∞) (x, x′ ) = K (1) (x, x′ )K̇ (2) (x, x′ ) + K (2) (x, x′ ),

where K (1) and K (2) are the kernels defined in Theorem 19, and K̇ (2) is
defined as K (2) with the derivative σ ′ instead of σ in its definition.
Remark 29. The assumption that the neural network has a single hidden
layer is superfluous and we make it solely for the sake of presentation.
⋆ Remark 13. One can show that Θ(∞) is positive semi-definite when the
input data {xi : i = 1, . . . , m} lie on the sphere. (In our case – scalar inputs
– it does not make much sense, however it does in the general d-dimensional
case.)
CHAPTER 8. DEEP LEARNING 145

⋆ Remark 14. The assumptions on σ can be greatly relaxed, as it is


enough to assume that the second order weak derivative of σ is polynomially
bounded. (It is the case, for example, of the ReLU activation.)

Many consequences can be drawn from Theorem 20; we now loosely dis-
cuss the most straightforward.

Training follows kernel gradient descent. Ignoring the second order


term in Equation (8.7), considering the limit as d → ∞, the NTK is fixed
and training has the following dynamics:
m
1 X
fwt+1 (x) − fwt (x) = −η ∂ŷ ℓ(yi , ŷ) ŷ=fwt (xi )
Θ(∞) (xi , x).
m i=1

Convergence. Suppose that the loss is the mean square loss ℓ(y, y ′ ) =
1
2
(y − y ′ )2 . For a vanishing learning rate η and as the number of steps of
gradient descent tends to ∞ as 1/η 4 , the training trajectory becomes
m
1 X (∞)
∂t fwt (x) = − Θ (x, xj )(fwt (xj ) − yj ).
m j=1

Suppose that Θ(∞) is positive semi-definite. This differential equation has


an explicit solution and in particular, it can be shown that fwt converges as
t → ∞ to
m
X
fw∞ (x) = fw0 (x) − Θ(∞) (x, xi )Θ(∞),−1 (xi , xj )(fw0 (xj ) − yj ), (8.8)
i,j=1

where Θ(∞),−1 is the inverse of Θ(∞) . We see that choosing x in the dataset,
i.e. equal to some xi , then fw∞ (xi ) = yi so the network perfectly fits the data
and the empirical loss is zero.
4
This corresponds to gradient flow as discussed in the optional section 3.5.2, which can
be seen as the continuous version of gradient descent.
CHAPTER 8. DEEP LEARNING 146

Gaussian process. Since fw0 is a Gaussian process by Theorem 19 and


since we apply a linear map to the Gaussian vector (fw0 (xj ))j=1,...,m in the
expression of fw∞ in (8.8), the neural network at convergence fw∞ is itself a
Gaussian process.

It is worth noting that this Gaussian process does not correspond to the
Bayesian posterior given the prior fw0 unless one subtracts the initial output
function fw0 ; see e.g. [6] for more details on this.

Some limitations. The NTK regime (also called kernel regime, or lazy
regime) does not fully describe neural networks behavior for several reasons.
The first one is that neural networks used in practice do not contain an infi-
nite number of neurons... But we won’t be too concerned by that objection.
Another critic made to the NTK regime is that the kernel is fixed as a result
of the fact that individual weights asymptotically don’t move during train-
ing: the map fwt evolves during training as a result of infinitely infinitesimal
changes in the weights of the neurons. Loosely speaking, this causes the
network to not learn any features in the data, akin to a kernel method with
given kernel which fits the data as a linear combination of feature map of
the kernel, without trying to learn these features at any point. However, fea-
ture learning can be a crucial characteristic of successful models in practice;
see e.g. [4] for a language representation model using pre-training to learn
important features of the data.

8.4.3 Mean Field regime

In this section we stay very informal to present another training regime where
theoretical guarantees can be obtained.

We now assume that our 2-layer neural network is initialized as in (8.5)


with γ = 1. Compared to the previous section where we supposed γ = 1/2,
this initialization is small.

Let µd = µd (w1 , w2 , b) be the measure on the weights space consisting of


CHAPTER 8. DEEP LEARNING 147

the average of Dirac masses on the neurons weights, that is


d
1X
µd = δ(w ,w ,b ) .
d k=1 1,k 2,k k

Equation (8.5) with γ = 1 can now be rewritten in the integral form


Z
fw (x) = w2 σ(w1 x + b)µd (dw1 , dw2 , db).
R3

We first note that in the infinite width limit d → ∞, the infinite network at
initialization is null because the weights are i.i.d. Gaussian and the law of
large numbers tells us that for Z1 , Z2 , Z3 i.i.d. N (0, 1),
Z
fw (x) = w2 σ(w1 x + b)µ(dw1 , dw2 , db)
R3
= E [Z1 σ(Z2 x + Z3 )] = E [Z1 ] E [σ(Z2 x + Z3 )] = 0,

where µ := limd→∞ µd .

With an abuse of notation, we denote by µt the measure of the infinite-


width neural network at time t of training. The advantage of writing the
neural network in integral form and taking the limit is that the dynamics of
the measure µt during training can be studied instead of that of the function.
We refrain ourselves to do so as it involves many tools that are out of the scope
of this class (from optimal transport theory such as Wassertein gradients).
We refer the interesting reader to the paper [2]. One of the main results of
this paper can be informally stated as
Theorem 21 (Informal). Under some assumption on the support of the ini-
tial measure µ0 and if σ = ReLU, then if µt converges to some µ∞ as t → ∞,
then µ∞ is optimal in the sense that the induced network fw∞ achieves zero
loss.
Remark 30. Such a result is called a global convergence result: it does not
claim that the network converges, but if it does, then its limit perfectly fits the
data. It is important to note that the results in the mean-field regime require
the assumption of having a single hidden layer. We chose σ = ReLU, the
result holds for more general activations under some homogeneity assumption
of the network.
CHAPTER 8. DEEP LEARNING 148

⋆ Remark 15. As opposed to the NTK regime, in the mean-field regime,


feature learning occurs and we can move from one to the other regime simply
by changing the initialization scale. For details on the impact of initialization
scale and feature learning, see the paper [14].

8.5 Beyond feed forward neural networks

8.5.1 Convolutional neural networks

Convolutional Neural Networks (CNNs) are designed in such a way, that


they can take into account spatial structure of the input. They were inspired
by mice visual system and were originally designed to work with images.
Compared to standard neural networks, CNNs have much fewer parameters
which makes it possible to efficiently train very deep architectures.

The Convolutional neural network typically has the following structure


(we consider 2-D CNN (images), there are 1-D (e.g.: speech) and 3-D (e.g.:
medical imaging), possibly higher dimensions):

Figure 8.2: Diagram of CNN [?]

Formulation of each component:

Input: matrix Rn×n

• Convolution layer: # filters, size of filters. (often 3x3, 4x4, 5x5) It


CHAPTER 8. DEEP LEARNING 149

performs a discrete convolution between a filter and the (input or in-


termediary matrix). Below is the 2-d discrete convolution between
matrix f and filter g
M
X N
X
(f ∗ g)(x, y) = f (x − n, y − m)g(n, m).
m=−M n=−N

Several filters are formed.

• Pooling layer: maxpool (define the size of the max pool), returns the
maximum value for the size of the maxpool.
This operation performs downsampling and selects for the “dominant”
pixel (i.e. the pixel with largest value). We note that similar images
(say one is a slight shift of another) yield similar down-sampled images.

• Fully-connected layer: flatten input, map that to desired output size


(e.g. multiclass classification)

• Output activation function: softmax (return probabilities of belonging


to class i).

8.5.2 Generative Adversarial Neural Networks

The original Generative Adversarial Neural Network (GAN) was introduced


as a new generative framework from training data sets. It addressed the
question: “if you are given a data set of objects with a certain degree of
consistency, can we artificially generate similar objects?”

Mathematically, objects in the dataset are samples generated from a com-


mon probability distribution D on the input space X and thus have this
consistency between them.

In practice, we have observations of data points sampled from distribu-


tion D, and we want to approximate the underlying distribution with some
approximation G. Suppose X ⊂ Rn . We start with an initial distribution γ
defined in Rd and we define the mapping G : Rd → Rn such that if a random
variable z ∈ Rd has distribution γ, then G(z) has distribution D.
CHAPTER 8. DEEP LEARNING 150

But how do we find a good approximation to D through G? We introduce


another function, fdisc , called the discriminator function, which is a classifier
tasked to predict whether a particular sample x is sampled from the original
distribution D or from the approximation G, namely, if D(x) is high, then
x ∼ D, if D(x) is low, then x ∼ G(z).

Both the mapping G, called generator, and discriminator fdisc are neural
networks (with their own set of parameters) which we will train.

The target loss function is given by:


ℓ(G, fdisc ) := Ex∼D [log fdisc (x)] + Ez∼γ [log(1 − fdisc (G(z)))] .

The GAN solves the minimax problem


min max ℓ(G, fdisc ).
G fdisc

For a given generator G, maxfdisc ℓ(G, fdisc ) will optimise the discriminator
fdisc to reject samples G(z), by assigning high values to samples from the
distribution D and low values to the generated samples G(z). Whereas for a
given discriminator fdisc , minG ℓ(G, fdisc ) optimises G so that the generated
samples G(z) will attempt to fool the discriminator fdisc into assigning high
values.

Note that we can’t actually compute these expectations, so we must ap-


proximate them:

m1
1 X
Ex∼D [log fdisc (x)] ≈ log fdisc (xi )
m1 i=1
m2
1 X
Ez∼γ [log(1 − fdisc (G(z)))] ≈ log (1 − fdisc (G(zi ))) .
m2 i=1
where x1 , ..., xm1 are coming from the training set X and z1 , ..., zm2 from a
dummy distribution γ.

The optimisation is done in two steps: first update the discriminator fdisc
taking the gradient with respect to parameters of fdisc and computing the
CHAPTER 8. DEEP LEARNING 151

gradient ascent update, then, update the generator G taking the gradient
with respect to parameters of G and computing the gradient descent update.

GANs have become very powerful tools for generative models, where we
want to generate similar samples to a given dataset, for example, new faces,
new rooms, etc5 . Not only this, GANs can be used to solve problems in which
we can’t easily formalise a loss function, or in situations where we don’t have
access to a loss function, or we are unable to compute gradients on the loss
function.

8.6 Tricks of the trade

This section is loosely based on the lecture series [10], advice that has come
from many years of theory and experimentation that have lead to substantial
differences in terms of speed, ease of implementation and accuracy when it
comes to putting algorithms to work in practice. A lot of the information on
these lecture series is already implemented (by default) on machine learning
libraries (e.g. tensorflow, keras, torch).

In this section, I will give a summary and justification/reasoning to why


these “tricks of the trade” are important in practice. Many of the advice are
quite general and apply to most neural networks.

What are some things that can go wrong when using neural network
models6 ?

• Poor performance ( overfitting, stuck in local minima, etc...)

• Inability to train network (exploding gradients (NaN values) and flat


gradients (refer to section 8.2.2), diverging solutions, slow convergence)

• Very large hyper-parameter space


5
Check out https://thisxdoesnotexist.com/ for many examples of GANs used to
generate a variety of things!
6
This is of course not an exclusive problem of neural networks.
CHAPTER 8. DEEP LEARNING 152

8.6.1 Input regularisation

We’ve seen in Chapter 5 that sometimes it is necessary to normalise the input


(linear regression with regularisation). An input X is typically centered by
2
subtracting the empirical mean X̂ and dividing by the empirical variance σX :

X − X̂
X′ = 2
.
σX

For neural networks, normalising the input is also a common practice,


namely, such that:

• Average of each input over the training set should be close to zero;

• Scale input variables so that their covariances are about the same;

• input variables should be uncorrelated if possible.

Shifting and scaling is quite simple. However, decorrelating the inputs


might be tricky. Sometimes, Principal component analysis is applied to the
feature matrix to remove linear correlations in input.

It is also observed that convergence is usually faster if the average of each


input over the training set is close to zero, meaning there are both positive
and negative values. Let us consider an example of the extreme case:
Example 8.6.1. Suppose all the components of an input vector are positive
(or all negative).
n
The gradient update for any weight i, j that sits on layer n, wij , is given
7
as :

t+1 ∂L ∂L ∂L
wij t
= wij −η where n
= xjn−1 n
∂wij ∂wij ∂zi
7
You can verify this by chain rule
CHAPTER 8. DEEP LEARNING 153

Let us focus on the first layer, n = 1. The update for the weights corre-
sponding to a particular output node i (corresponding to row i of the weights
∂L
matrix W 1 ) are proportional to x0j ∂z 1 . If all components of an input vector
i
are positive, all the updates of the weights that feed into node i will have the
∂L
same sign (sign( ∂z 1 )). This means, those weights can only all decrease or in-
i
crease together for a given input. If the weight vector must change direction,
it can only do so by zigzagging, which is ineficient and slow.

Normalising the inputs is not only a concern for the speed of conver-
gence but really for the trainability of the network. Due to finite numerical
precision, numerical overflow can occur, turning gradient updates into N aN
updates.

8.6.2 Stabilising the gradients

We saw in Section 8.2.3 the problems of vanishing and exploding gradients


when training a neural network. We discussed how avoiding these problems
motivated different weights initialization schemes. We have, however, no
guarantee that these initializations prevent these two phenomena to occur
beyond the first step of gradient descent. We present here two techniques
that have been introduced to keep the training stable.

Gradient clipping. What can we do if the gradient becomes too large?


Simply normalize it! The following technique is called gradient clipping and
is quite intuitive: let ϵ > 0, then

∇w C(w)
• if ||∇w C(w)|| > ϵ, update the parameters with ϵ ||∇ w C(w)||
,

• otherwise, use the usual update ∇w C(w).

If some components of the gradient dominate, the others, after clipping,


might end up too small. Another practice is to clip each component of the
gradient independently if it crosses a threshold ϵ.
CHAPTER 8. DEEP LEARNING 154

Batch normalization. Batch normalization is a way to guarantee that all


layers’ activations are centered and normalized with respect to the batch of
data samples the network is training on. We can formally define it as a new
type of layer of the neural network. A batch normalization layer, denoted by
BN, does the following:

• {x(i) ; i = 1, . . . , m} ⊂ Rd is some batch of input.


• µ := d1 di=1 x(i) is the empirical mean (vector) of the input batch.
P

• v := d1 di=1 (x(i) − µ)2 is the empirical variance (vector) of the input


P
batch, where the square function is applied component-wise.
(i)
• x(i) := x√v+ϵ
−µ
for all i = 1, . . . , m, where ϵ > 0 is a hyperparameter that
prevents the denominator to be null.
(i)
• BN(x(i) ) := (γk xk + βk )k=1,...,d , where γk , βk are parameters that can
be learnable.

A neural network can then be composed of many layers, some of them being
BN.

Despite the several heuristics and the empirical success of batch normal-
ization, there is, for now, no very robust theory nor consensus on the different
effects of batch normalization.

8.6.3 Preventing overfitting

Avoiding overfitting is a common desire in Machine Learning, to have better


hope that the model generalises to unseen data. For neural networks, there
are different strategies to prevent overfitting.

Regularisation

Similarly to previously seen models, we can also add a regularisation term on


the loss function, penalising the model parameters through a L1 or L2 norm.
CHAPTER 8. DEEP LEARNING 155

Early stopping

The rational about early stopping is that when we decide on the number of
epochs to train a network, it’s not usually a very well informed decision.

monitoring the training and a validation loss, stopping when the valida-
tion loss no longer decresese / increases.

1. Split the training data into a training set and a validation set, e.g. in
a 2-to-1 proportion.

2. Train only on the training set and evaluate the per-example error on
the validation set once in a while, e.g. after every fifth epoch.

3. Stop training as soon as the error on the validation set is higher than
it was the last time it was checked.

4. Use the weights the network had in that previous step as the result of
the training run.

in reality, the curves don’t behave so nicely.

As we see, choosing a stopping criterion predominantly involves a tradeoff


between training time and generalization error. However, some stopping
criteria may typically find better tradeoffs that others.

Empirically, leads to faster to train networks, prevents overfitting, can be


’tricky’ to get right because the behaviour of the errors is not straightforward,
tunable parameters (when is the validation error too large with respect to
the training error? what is the number of successive s (number of epochs)?).

8.6.4 Dropout

Even though overparametrized neural networks seem to generalize better


than expected, they are not completely immune to overfitting. A efficient
way of regularizing them and enhance their generalization performances, one
CHAPTER 8. DEEP LEARNING 156

can turn off at random neurons while training. This method is called dropout.
It works as follows:

• For each neuron, independently turn it off with a probability 1 − p ∈


(0, 1); p is called the keep probability, equivalently 1 − p is called the
dropout rate. It is one more hyperparameter to the model
• The sub-network consisting of neurons that have been kept is trained
for one step of gradient descent (or any optimization algorithm at use).
Only the weights of the neurons that have been kept are updated during
this step
• Repeat the two previous steps until the end of training.
• To make predictions with the trained neural network – or to evaluate
it on a test dataset – use the full neural network with weights rescaled
by the keep probability p.
CHAPTER 8. DEEP LEARNING 157

The reason why we rescale the weights when predicting is because when the
weights are updated, the sub-networks used contain an average proportion
p of neurons, hence their weights tend to be bigger than they should when
using all of them together. Think about the sum of two perfect predictors:
the obtained predictor performs poorly unless we divide its output by 2.

Why is it a good idea to use dropout? As we said, this method trains


smaller networks inside the network, thus encouraging learning more sparse
functions (i.e. simpler, needing less parameters to be expressed). Each sub-
network hence has less opportunities to overfit. Another informal reason is
that the full network resembles an average of smaller networks. Averaging
simple predictors to construct a good predictor can be a very good idea to
enhance training and performance; in the next chapter, we will see a model
that is based on that idea, where the simple predictors do not even need to
perform that well.

Dropout may also refer to other similar methods where individual weights
are dropped out instead of individual neurons.

During training, we compare a neural network without dropout:


(l+1) (l+1) (l) (l+1)
x
ei = wi x + bi ,
(l+1) (l+1)
xi = σ(e
xi ),
CHAPTER 8. DEEP LEARNING 158

and with dropout:


(l)
rj ∼ Bernoulli(p),
(l) (l) (l)
zi = ri xi ,
(l+1) (l+1) (l) (l+1)
x
ei = wi zi + bi ,
(l+1) (l+1)
xi = σ(zi ).

8.6.5 Dealing with hyper-parameters

Architecture hyperparameters define the function space H where our approx-


imator lives in. Some hyperparameters are: number of layers, width of each
layer, activation functions, other types of layers (Dropout, normalising layers,
etc...).

Then we also have training hyperparameters, that specify how the training
algorithm A behave: number of epochs, batchsize, learning rate (momentum,
etc..), training set / validation set split, loss function (type, l1-l2 penalties)

Many times, good hyperparameters come from experience, trial and error,
theoretical justifications or empirical results. There are also computational
frameworks that can help explore this large hyperparameter space, such as
hyperopt [?].
Chapter 9

Ensemble methods

Contents
9.1 Weak learner . . . . . . . . . . . . . . . . . . . . . 160
9.2 Adaboost . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.1 * A sufficient condition for weak-learnability . . . 168
9.2.2 Connections to other models . . . . . . . . . . . . 169
9.2.3 * Gradient Boosting . . . . . . . . . . . . . . . . . 172
9.3 Boosting regression . . . . . . . . . . . . . . . . . 172

Ensemble methods are usually reserved for methods that generate a model
using an aggregate of base learners. There are different ways to approach
the construction of ensemble methods. In this chapter, we will focus on
boosting.

Boosting is an algorithmic paradigm that grew out of a theoretical ques-


tion and became a very practical machine learning tool. The boosting ap-
proach uses a generalisation of linear predictors to address two major issues:

• The first is the bias-complexity tradeoff. We have seen that the gen-
eralisation error of an ERM learner can be decomposed into a sum of
approximation error and estimation error, as described in equa-
tion (4.2). The more expressive the hypothesis class the learner is

159
CHAPTER 9. ENSEMBLE METHODS 160

searching over, the smaller the approximation error is, but the larger
the estimation error becomes. A learner is thus faced with the prob-
lem of picking a good trade-off between these two considerations. The
boosting paradigm allows the learner to have smooth control over this
trade-off. The learning starts with a basic class (that might
have a large approximation error), and as it progresses the
class that the predictor may belong to grows richer.

• The second issue that boosting addresses is the computational com-


plexity of learning. A boosting algorithm amplifies the accuracy of
weak learners. Intuitively, one can think of a weak learner as an algo-
rithm that uses a simple ”rule of thumb” to output a hypothesis that
comes from an easy-to-learn hypothesis class and performs just slightly
better than a random guess. When a weak learner can be implemented
efficiently, boosting provides a tool for aggregating such weak hypothe-
ses to approximate gradually good predictors for larger, and harder to
learn, classes.

In this chapter, we start again by considering binary classification for the


sake of exposition, though ensemble methods apply to general classification
tasks and regression.

Figure 9.1 illustrates how properly aggregating base learners can solve a
task on which a base learners taken individually does not perform extremely
well.

9.1 Weak learner

Let us define the concept of γ-weak-learnability first. If we were to randomly


guess labels by tossing a fair coin each time, then in average, we would be
right half of the time and wrong just as much. A γ-weak-learner is a learner
that merely does slightly better than that.
Definition 9.1.1. A learning algorithm A is a γ-weak-learner for a class H
if there exists a function mH : (0, 1) → N such that for every δ ∈ (0, 1), for
every distribution D over X , and for every labeling function f : X → {±1},
CHAPTER 9. ENSEMBLE METHODS 161

Figure 9.1: An illustration of how AdaBoost behaves on a tiny toy problem


with m = 10 examples. Credit: [11] Figures 1.1 and 1.2, the reader will find
greater details on this example therein.
Left: Each row depicts one round, for t = 1, 2, 3. The left box in each
row represents the distribution D(t) , with the size of each example scaled
in proportion to its weight under that distribution. Each box on the right
shows the weak hypothesis ht , where darker shading indicates the region of
the domain predicted to be positive. Examples that are misclassified by ht
have been circled.
Right: The combined classifier for the toy example is computed as the sign
of the weighted sum of the three weak hypotheses, w1 h1 + w2 h2 + w3 h3 as
shown at the top. This is equivalent to the classifier shown at the bottom.
CHAPTER 9. ENSEMBLE METHODS 162

if the realizable assumption holds with respect to H, D, and f , then when


running the learning algorithm on m ≥ mH (δ) i.i.d. examples generated by
D and labeled by f , the algorithm returns a hypothesis fw such that, with
probability of at least 1 − δ, R(D,f ) (fw ) ≤ 1/2 − γ.
Definition 9.1.2. A hypothesis class H is γ-weak-learnable if there exists a
γ-weak-learner for that class.

Note that this definition is almost identical to the definition of PAC learn-
ing (here we will call strong learning) shown in chapter 3.7, with one crucial
difference: strong learnability implies the ability to find an arbitrarily good
classifier, with error rate at most ϵ for an arbitrarily small ϵ, when considering
the non-agnostic case. In weak learnability, however, we only need to output
a hypothesis whose error rate is at most 1/2 − γ for a fixed γ > 0, namely,
whose error rate is slightly better than what a random labeling would give
us. The hope is that it may be easier to come up with efficient weak learners
than with efficient (full) PAC learners.

One possible approach is to take a “simple” hypothesis class, denoted B,


and to apply ERM with respect to B (stands for base class) as the weak
learning algorithm. For this to work, we need that B will satisfy two require-
ments:

• ERMB is efficiently implementable.


• For every sample that is labeled by some hypothesis from H, any ERMB
hypothesis will have an error of at most 1/2 − γ.

Remark 31. What can be considered as weak learners?

• Decision stumps (A one-level decision tree.)


HDS = {x → sign(θ − xj )b : θ ∈ R, j ∈ [d], b ∈ {±1}}

• Decision trees. These are typically defined recursively.


• While traditionally decision stumps/trees were the defacto weak learn-
ers, but you can also use linear functions, splines, SVMs, shallow net-
works...
CHAPTER 9. ENSEMBLE METHODS 163

Let us see an example of how to find an optimal h ∈ HDS , in the class of


Decision stumps.
Example 9.1.1. Let X = Rd and consider the base hypothesis class over Rd
to be:
HDS = {x 7→ sign(θ − xj )b : θ ∈ R, j ∈ [d], b ∈ {±1}}.

For simplicity, let us assume b = 1 (this flips the label ±1). Let S =
{(x1 , y1 ), ..., (xm , ym )} be a training set. We will show how to implement an
ERM rule, namely, how to find a decision stump that minimizes L(h).

a probability vector in Rm (that is, all ele-


Let us introduce the vector D, P
ments of D are non-negative and i Di = 1). The weak learner we describe
receives D and training set S and outputs a decision stump h : X → Y that
minimizes the risk w.r.t. D:

m
X
LD (fw ) = Di 1fw (xi )̸=yi .
i=1

Note that if D = (1/m, ..., 1/m) then LD (fw ) = L(fw ).

Each decision stump is parametrised by an index j ∈ [d] (selects which


dimension of the feature vector to split over), and a threshold θ. Therefore,
minimizing LD (h) amounts to solving the minimization problem:
!
X X
min min Di 1xi,j >θ + Di 1xi,j ≤θ . (9.1)
j∈[d] θ∈R
i:yi =1 i:yi =−1

Note here we just wrote LD (fw ) in such a way that we show the misla-
belling when the true label is positive, and the prediction is negative, and
when the true label is negative, and the prediction is positive. What we want
to do now is to show that this can be further simplified, eventually yielding
to an easy minimization problem.

Fix j ∈ [d] and sort the training examples such that x1,j ≤ x2,j ≤ · · · ≤
x +x
xm,j . Let us define the set Θj = { i,j 2 i+1,j : i ∈ [m − 1]} ∪ {x1,j − 1, xm,j + 1}.
This is essentially setting up a grid for which, for any θ ∈ R, there exists
CHAPTER 9. ENSEMBLE METHODS 164

θ′ ∈ Θj that yields the same predictions for the sample S. Then, we can
minimize θ over Θj .

This gives us an efficient procedure: choose j ∈ [d] and θ ∈ Θj that


minimize the objective value in (9.1). This yields a runtime complexity of
O(dm2 ), but it’s possible to minimize the objective in O(dm). Refer to [?]
if interested.

9.2 Adaboost

AdaBoost (short for Adaptive Boosting) is an algorithm that has access to a


weak learner and finds a hypothesis with a low empirical risk. The AdaBoost
algorithm receives as input a training set of examples S = {(x1 , y1 ), ..., (xm , ym )}
and the boosting process proceeds in a sequence of consecutive rounds. At
round t, the booster algorithm first defines a probability over the samples S,
denoted by D(t)1 . Then, the booster algorithm passes the probability vector
D(t) and the sample S to the weak learner. The weak learner is assumed to
return a “weak” hypothesis, ht , whose error is given by,
Xm
(t) (t)
ϵt = RD (ht ) = Di 1ht (xi )̸=yi ,
i=1

which is at most 1/2 − γ for a fixed γ ∈ (0, 1/2)2 that does not depend on t.

Then, AdaBoost assigns a weight wt to ht , given by


 
1 1
wt = log −1 ,
2 ϵt
that is, the smaller the error, the larger the weight.
Remark 32. What is the weight wt ? In the end, AdaBoost returns a classifier
that aggregates all the weak learners that were obtained after each round t,
of the form:
XT
fw (x) = wt ht (x).
t=1
1
Represented as a probability vector, as defined previously.
2
There is a probability of at most δ (from definition 9.1.1) that the weak learner fails
to have an error smaller than 1/2 − γ
CHAPTER 9. ENSEMBLE METHODS 165

This can be seen as a linear combination of the weak learners ht . We will see
in the proof of the forthcoming theorem 22 that this choice of wt is ”optimal”
in some sense.

At the end of the round, AdaBoost updates the probability vector D(t) so
that examples on which ht is wrong will get a higher probability mass while
examples on which ht is correct will get a lower probability mass. This gives
more importance to the points that ht misclassifies, for the next weak learner
to focus on.3

The algorithm reads:

Figure 9.2: Algorithm for AdaBoost (from [11])

Remark 33. There are two quantities which play a role of weight. One is
wt , which gives the weights the contribution of a weak learner ht in the final
model fw . Another weight, given by the probability vector D(t) , gives a
weight to each data point.

In the next theorem, we suppose that the dataset S = {(xi , yi ); i = 1..m}


is such that the labels yi ’s are generated by a map f in a hypothesis class H
that is γ-weak-learnable.
3
A video that shows the Adaboost algorithm. https://www.youtube.com/watch?v=
k4G2VCuOMMg
CHAPTER 9. ENSEMBLE METHODS 166

Theorem 22. Let S be a training set and assume that at each iteration of
AdaBoost, the weak learner returns a hypothesis for which ϵt ≤ 1/2 − γ.
Then, the training error of the output hypothesis of AdaBoost is at most
m
1 X
L(hs ) = 1h (x )̸=y ≤ exp(−2γ 2 T ).
m i=1 s i i

P
Proof. For each round t, let ft = k≤t wk hk , so that the output of Adaboost
is HT := sign(fT ). In addition, let
m
X (t)
Zt = Di e−yi wt ht (xi ) ,
i=1

(t+1)
which is the normalisation factor so that Di , as defined in the AdaBoost
algorithm described in figure 9.2, is indeed a probability distribution.

Unrolling the recurrence to update D(T +1) , for all i = 1..m, we can write
T 
(T +1) (1)
Y exp − wt yi ht (xi )
Di = Di
t=1
Zt
 PT 
1 exp −yi t=1 wt ht (xi )
= × QT
m t=1 Zt
1 exp (−yi fT (xi ))
= × QT .
m t=1 Zt

Note that 1HT (x)̸=y ≤ e−yFT (x) , since x is misclassified by HT if and only
CHAPTER 9. ENSEMBLE METHODS 167

if fT (x) and y have opposite signs. Therefore,


m
1 X
RS (HT ) = 1H (x )̸=y
m i=1 T i i
m
1 X
≤ exp(−yi fT (xi ))
m i=1
m
X T
Y
(T +1)
= Di Zt
i=1 t=1
T
Y
= Zt .
t=1

We now rewrite Zt as
m
X (t)
Zt = Di exp(−wt yi ht (xi ))
i=1
X (t)
X (t)
= Di exp(−wt ) + Di exp(wt )
i:yi =ht (xi ) i:yi ̸=ht (xi )

= (1 − ϵt ) exp(−wt ) + ϵt exp(wt ).

By definition of ϵt . Furthermore, by definition of wt = 12 log(1/ϵt − 1), we get


r r
ϵt 1 − ϵt
Zt = (1 − ϵt ) + ϵt
1 − ϵt ϵt
p
= 4ϵt (1 − ϵt )

By our assumption, we have that ϵt ≤ 1/2 − γ, then, we can bound the


quantity above to obtain:

p
Zt ≤ 1 − 4γ 2 ,
recalling that g(x) = x(1 − x) is monotonically increasing [0, 1/2]. We thus
have proven that

RS (HT ) ≤ (1 − 4γ 2 )T /2 .
CHAPTER 9. ENSEMBLE METHODS 168

To conclude, recall the fact that for all x ∈ R, the exponential function
satisfies 1 + x ≤ exp(x) and use it so that 1 − 4γ 2 ≤ exp(−4γ 2 ), which entails
that

RS (HT ) ≤ exp(−2γ 2 T ),

as claimed.

Thanks to the above theorem, we see that even though each weak learner
performs only slightly better than a purely (uniform) random guess, Ad-
aBoost is able to choose a linear combination of them that yields a predictor
whose error on the dataset decreases exponentially fast in the number of
weak learners.

⋆ Remark 16. However, what we really care about is the true risk of the
output hypothesis, i.e. the generalisation error. It turns out that
r !
T V C(H)
RD (h) ≤ RS (h) + O .
m

d = V C(H). We will not discuss this in this class.

9.2.1 * A sufficient condition for weak-learnability

Theorem 22 proves that the assumption of γ-weak learnability is sufficient to


ensure that AdaBoost will drive down the training error very quickly. But
when does this assumption of γ-weak learnability actually hold?

In this section, we provide a condition that implies the assumption of


empirical weak learnability. This condition is only in terms of the functional
relationship between the data instances and their labels, and does not involve
distributions over examples.

Let all the weak hypothesis belong to some class of hypothesis H. Suppose
our training sample S is such that for some weak hypothesis Pg1 , · · · , gk from
H, and for some nonnegative coefficients a1 , · · · , ak , with kj=1 aj = 1, and
for some θ > 0, it holds that
CHAPTER 9. ENSEMBLE METHODS 169

k
X
yi aj gj (xi ) ≥ θ (9.2)
j=1

for each example (xi , yi ) in S. This condition implies that yi can be computed
by a weighted majority vote of the weak hypothesis:
k
!
X
yi = sign aj gj (xi ) ,
j=1

namely, when it is strictly greater than 0. (9.2) demands that significantly


more than a bare weighted majority to be correct. When this condition holds
for all i, we say the sample S is linearly separable with margin θ.

9.2.2 Connections to other models

We saw, through AdaBoost, that boosting is essentially applying the weak


classification algorithm to repeatedly modified versions of the data, producing
a sequence of weak classifiers ht (x), t = 1, · · · , T . The predictions from all
of them are then combined through a weighted majority vote to produce the
final prediction: !
XT
HT (x) = sign wt ht (x) ,
t=1

where wt are computed by the boosting algorithm and weight the contribu-
tion of each respective ht . Their effect is to give higher influence to the more
accurate classifiers in the sequence.

In this section, we will first show that AdaBoost fits an additive model
in a base learner, optimising a novel exponential loss function. Then, we
will develop a class of gradient boosted models (GBMs), for boosting weak
learners for any loss function.
CHAPTER 9. ENSEMBLE METHODS 170

Boosting and additive models

Boosting is a way of fitting an additive expansion in a set of elementary


”basis” functions. Here, the basis functions are individual classifiers ht (x) ∈
{−1, +1}. More generally, basis function expansions take the form:

M
X
f (x) = βk b(x; γk ),
k=1

where β are expansion coefficients and b(x; γ) ∈ R are usually simple func-
tions of the multivariate argument x characterised by a set of parameters
γ. E.g. decision stumps, γ parametrises the split variables and split points.
For example, additive expansions like this can describe single-hidden-layer
neural networks.

Models based on additive expansions are fit by minimizing a loss function


averaged over the training data, such as the squared-error or likelihood-based
loss function !
Xm XM
min L yi , βk b(x; γk ) . (9.3)
{βk ,γk }
i=1 k=1
For many loss functions L or basis functions b, this requires computation-
ally intensive numerical optimisation techniques. A simple alternative is to
rapidly solve the sub-problem of fitting just a single basis function:
m
X
min L (yi , βb(x; γ)) .
{β,γ}
i=1

The forward stagewise additive modelling approximates the solution (9.3)


by sequentially adding new basis functions to the expansion, without adjust-
ing the parameters and coefficients of those that have been already added.
This is outline in the algorithm shown in 9.3.

At each iteration m, one solves for the optimal basis function b(x; γm )
and corresponding m to add to the current expansion fm−1 (x). This produces
fm (x), and the process is repeated. Previously added terms are not modified.4
4
It can be shown, for the squared-loss
CHAPTER 9. ENSEMBLE METHODS 171

Figure 9.3: Forward stagewise algorithm

AdaBoost is equivalent to forward stagewise additive modelling using the


loss function:

L(y, f (x)) = exp(−yf (x)).


This equivalence between AdaBoost and forward stagewise additive mod-
elling was only discovered five years after AdaBoost’s inception.

L(y, f (x)) = (y − f (x))2


one has

L(yi , fm−1 (xi ) + βb(xi ; γ)) = (yi − fm−1 (xi ) − b(x; γ))2
= (rim − βb(xi ; γ))2

where rim = yi − fm−1 (xi ) is simply the residual of the current model on the ith ob-
servation. Thus, for the squared-error loss, the term m b(x; γm ) that best fits the current
residuals is added to the expansion at each step.
CHAPTER 9. ENSEMBLE METHODS 172

9.2.3 * Gradient Boosting

Very informally speaking, gradient boosting is currently a very popular method


that requires little fiddling to get a good classifier. Unlike neural networks,
which require choosing more hyper-parameters, the gradient boosting algo-
rithm produces predictive models with less engineering choices.

The main difference between AdaBoost and Gradient Boosting, is that


the iterative corrections are not introduced by re-weighting the data, but
through gradient updates. This has the advantage that the problem can use
a general loss function, as well as being more computationally efficient.

9.3 Boosting regression

Although we do not go into detail in class about how this works, regression
with trees is also possible, generating a hypothesis of the form:

T
X
h(x) = wt ht (x).
t=1

Figure 9.4: Example of Boosting for regression.


Bibliography

[1] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.
Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proceedings of the National Academy of Sci-
ences, 116(32):15849–15854, 2019.

[2] Lénaı̈c Chizat and Francis Bach. On the global convergence of gradi-
ent descent for over-parameterized models using optimal transport. In
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 31. Curran Associates, Inc., 2018.

[3] G. Cybenko. Approximation by superpositions of a sigmoidal function.


Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.

[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805, 2018.

[5] Leonardo Ferreira Guilhoto. An overview of artificial neural networks


for mathematicians. 2018.

[6] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep
ensembles via the neural tangent kernel. Advances in neural information
processing systems, 33:1010–1022, 2020.

[7] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer


feedforward networks are universal approximators. Neural Networks,
2(5):359–366, 1989.

173
BIBLIOGRAPHY 174

[8] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent
kernel: Convergence and generalization in neural networks. In Samy
Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò
Cesa-Bianchi, and Roman Garnett, editors, NeurIPS, pages 8580–8589,
2018.

[9] Radford M Neal. Priors for infinite networks. In Bayesian Learning for
Neural Networks, pages 29–53. Springer, 1996.

[10] Yoav Freund Robert E. Schapire. Boosting: Foundations and Algo-


rithms. The MIT Press, 2014.

[11] Yoav Freund Robert E. Schapire. Boosting: Foundations and Algo-


rithms. The MIT Press, 2014.

[12] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine


Learning - From Theory to Algorithms. Cambridge University Press,
2014.

[13] Greg Yang. Scaling limits of wide neural networks with weight sharing:
Gaussian process behavior, gradient independence, and neural tangent
kernel derivation. 2019.

[14] Greg Yang and Edward J Hu. Feature learning in infinite-width neural
networks. arXiv preprint arXiv:2011.14522, 2020.

View publication stats

You might also like