0% found this document useful (0 votes)
3 views

GML-slides-2024-04-29 (1)

The document outlines a lecture on the foundations of machine learning, focusing on various learning methods such as supervised, unsupervised, and reinforcement learning. It discusses model induction, generalization, and specific algorithms like linear regression and nearest neighbor estimation. The lecture structure includes topics on regularization, validation, and different learning settings, providing a comprehensive overview of machine learning principles and applications.

Uploaded by

Supreme Urs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

GML-slides-2024-04-29 (1)

The document outlines a lecture on the foundations of machine learning, focusing on various learning methods such as supervised, unsupervised, and reinforcement learning. It discusses model induction, generalization, and specific algorithms like linear regression and nearest neighbor estimation. The lecture structure includes topics on regularization, validation, and different learning settings, providing a comprehensive overview of machine learning principles and applications.

Uploaded by

Supreme Urs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 206

Foundations of Machine Learning

Prof. Dr. Eyke Hüllermeier

LMU Munich
Institute of Informatics

Munich, April 29, 2025

— slides are in preparation and will be continuously updated —

E. Hüllermeier 1/207
Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
A. Probability

E. Hüllermeier 2/207
ML settings and problems

Machine learning is mainly concerned with the development and analysis of learning
algorithms.

Different types of learning problems have already been formalised in terms of


corresponding problem setups, such as online learning, active learning, reinforcement
learning, semi-supervised learning, etc.

In this lecture, we shall focus on one of the simplest settings, so-called supervised
learning (also known as learning from examples).

Introduction E. Hüllermeier 3/207


ML settings and problems

Introduction E. Hüllermeier 4/207


ML settings and problems

Supervised learning: The learning algorithm („learner“) is provided with explicit


examples of what outputs the target model is supposed to produce for specific inputs
(e.g., labeled handwritten digits).

Unsupervised learning: Learner merely observes the data (e.g., handwritten digits), but
without any type of supervision. The main goal is to discover structure in the data, for
example represented as a grouping of data into clusters.

Reinforcement learning: Feedback is provided to the learner, but is typically of an


indirect nature and may come with a temporal delay (e.g., game playing). The goal is to
learn a policy for acting in a dynamic environment.

Introduction E. Hüllermeier 5/207


Batch vs. online learning

Introduction E. Hüllermeier 6/207


Batch vs. online learning

Introduction E. Hüllermeier 7/207


Machine learning and model induction

Model induction: Process of generalisation, replacing specific data by a general (though


hypothetical, tentative) model of the data-generating process (or parts thereof).
A learning algorithm A accepts as input a set of training data
N
∈ ZN

D = zi i=1

consisting of „data points“ z i ∈ Z.


As output, the learner returns a model h ∈ H, where H is a given model class
(hypothesis space); formally, the learning algorithm can hence be seen as a mapping
from training data to hypotheses:

A: D→H .

Introduction E. Hüllermeier 9/207


Machine learning and model induction

Introduction E. Hüllermeier 10/207


Example of unsupervised learning

Data points = handwritten symbols

Introduction E. Hüllermeier 11/207


Example of unsupervised learning

Data points = handwritten symbols

Introduction E. Hüllermeier 12/207


Example of supervised learning: email addresses

First Name Last name Email address


Peter Smith [email protected]
Ryan Scott [email protected]
Lara Wasserman [email protected]
Jennifer Mc Donalds [email protected]
Henry Kessinger [email protected]
Julian Hoffmeister ???

Introduction E. Hüllermeier 13/207


Example of supervised learning: dangerous robots

Introduction E. Hüllermeier 14/207


Example of supervised learning: dangerous robots

Introduction E. Hüllermeier 15/207


Example of supervised learning: medical diagnosis

Introduction E. Hüllermeier 16/207


Example of supervised learning: facial expression recognition

Facial expression recognition by a smart eyewear for facial direction changes, repeatability, and
positional drift [?].

Introduction E. Hüllermeier 17/207


Basic setup of the (supervised) learning problem

Introduction E. Hüllermeier 18/207


Supervised learning

In supervised learning, the data space Z is split into


▶ an input (instance) space X and
▶ an outcome space Y,
i.e., data points (observations, examples) are tuples of the form z = (x , y ) ∈ X × Y.

The interest is to learn a mapping from X to Y, or, more generally, from X to some
prediction space Y that models the dependence of outcomes (responses, dependent
variable) on inputs (predictors, independent variables).

More precisely, suppose there is an unknown target function

f : X −→ Y

that captures this dependence optimally (in a sense to be specified later on).

Introduction E. Hüllermeier 19/207


Supervised learning

In order to learn the target function, the learning algorithm A is provided with a set
N
∈ (X × Y)N

D = (x i , yi ) i=1

of training examples (x i , yi ) ∈ X × Y, often referred to as data points.

As an output, the learner produces a function

g : X −→ Y ,

which is an element of the underlying hypothesis space H.

Introduction E. Hüllermeier 20/207


Supervised learning

Thus, as already said, the learner itself can be seen as a mapping

A : D −→ H ,

where D is the set of all training data sets (possibly of different size).
The learner will typically try to find a hypothesis that “fits” the training data D
sufficiently well.
The result of the learning process, the model g, can be used for different purposes,
notably to make predictions ŷ = g(x) for future (query) instances x ∈ X ; the better g
approximates f , the better such predictions will be.
The hypothesis space H together with the learner A is also referred to as the learning
model.

Introduction E. Hüllermeier 21/207


Classification and regression

Important special cases of supervised learning include the problems of classification and
regression.
In classification, the outcome space consists of a finite (typically small to moderate)
number of class labels:
Y = {y1 , . . . , yK } .
In the case K = 2, we speak of a binary classification problem.
Examples: Handwritten digit recognition (K = 10), predicting the result of a football
match (K = 3), credit card applications/approval in a bank (K = 2) based on customer
information such as age, gender, annual salary, years in residence, years in job,
outstanding loans, etc.
In the case of regression, the outputs are real numbers, i.e., Y = R.

Introduction E. Hüllermeier 22/207


Feature vectors

Instances are often represented in terms of so-called feature vectors:

x = (x1 , . . . , xd )⊤ .

Thus, the instance space X = Rd corresponds to a d-dimensional Euclidean space (or a


part thereof).

The entries of the d-dimensional vectors are called “features” and typically describe an
instance in terms of specific properties.

Other representations are possible, e.g., in terms of structures such as sequences, graphs,
etc.

Introduction E. Hüllermeier 23/207


Feature vectors

Imagine that data objects are handwritten digits (bitmaps):

How could such objects we represented in terms of feature vectors

x = (x1 , . . . , xd )⊤ ?

What could be useful (informative) features, helping to distinguish different digits?

Introduction E. Hüllermeier 24/207


Feature vectors

By fixing a feature representation, data objects are embedded as points in some


d-dimensional (Euclidean) space.

Introduction E. Hüllermeier 25/207


Feature vectors
Embedding of MNIST data using principle component analysis (PCA), a method for
dimensionality reduction via linear transformation Rn −→ Rd (here n = 282 = 784,
d = 2) and weights for the first principle component:

Introduction E. Hüllermeier 26/207


Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

Introduction E. Hüllermeier 27/207


Exemplary ML methods

For the purpose of illustration, and to convey a basic idea of ML algorithms, we shall look
at two exemplary methods (model classes):
▶ linear models,
▶ nearest neighbor estimation.

Both approaches are in a sense quite basic, but played an important role in the history of
ML and are still commonly used.

They are also quite different in terms of their characteristics and properties (e.g.,
parametric vs. nonparametric, strong bias vs. weak bias, eager vs. lazy, etc.).

We consider both approaches for the tasks of classification and regression.

Two simple methods E. Hüllermeier 28/207


Linear regression

Suppose to be given a set


N
∈ (X × Y)N

D = (x i , yi ) i=1

of training examples (x i , yi ) ∈ X × Y, where X = Rd and Y = R.

We are interested in learning a linear regression function


d
X
hw (x) = w0 + wj · xj
j=1

that “fits” the data well in terms of the squared error loss
N
1 X 2
Ein (hw ) = Ein (w) = yi − hw (xi ) .
N i=1

Two simple methods E. Hüllermeier 29/207


Linear regression

Two simple methods E. Hüllermeier 30/207


Accounting for the bias term

The wj are the parameters of the model, also called weights; they determine the
influence of the individual features (direction and strength).

The constant w0 is called the intercept or bias term.

To simplify notation, we extend each instance x = (x1 , . . . , xd )⊤ by a leading 1:

x = (x0 , x1 , . . . , xd )⊤ = (1, x1 , . . . , xd )⊤

The linear model can then be written compactly as follows:


d d
wj · xj = w ⊤ x = wj · xj = w ⊤ x = w, x
X X
hw (x) = w0 +
j=1 j=0

Two simple methods E. Hüllermeier 31/207


Accounting for the bias term

Predictions on the training instances x 1 , . . . , x N can be expressed compactly in terms of


matrix operations. To this end, we collect all instances in an N × (d + 1) matrix X , in
which the i th row is given by

x⊤
i = (1, xi,1 , . . . , xi,d ) .

For a weight vector w = (w0 , w1 , . . . , wd )⊤ , we then have


     
1 x1,1 x1,2 ... x1,d w0 ŷ0

 1 x2,1 x2,2 ... x2,d  
  w1  
  ŷ1 

X ·w = .. .. .. .. .. · .. = .. ,

 . . . . .
 
  .
 
  .


1 xN,1 xN,2 . . . xN,d wd ŷN

where ŷi = hw (x i ) is the approximation of yi produced by the linear predictor hw .

Two simple methods E. Hüllermeier 32/207


Ordinary least squares (OLS)

With y = (y1 , . . . , yN )⊤ and ∥ · ∥ the Euclidean norm, the training error (in-sample error
Ein ) can be written as follows:
N 
1 X 2
Ein (w) = w ⊤ x i −yi
N i=1 | {z }
ŷi
1
= ∥X w − y∥2
N
1
= (X w − y)⊤ (X w − y)
N
1  ⊤ ⊤ 
= w X X w − 2w ⊤ X ⊤ y + y ⊤ y .
N
Hence, minimising Ein comes down to finding
 
w ∗ = argmin w ⊤ X ⊤ X w − 2w ⊤ X ⊤ y + y ⊤ y .
w∈Rd+1

Two simple methods E. Hüllermeier 33/207


Ordinary least squares (OLS)

The above problem can be solved by deriving the gradient


2  ⊤ 
∇Ein (w) = X X w − X ⊤y
N
and finding the w for which it vanishes.
This leads to the following closed-form solution:
−1 ⊤
w ∗ = X ⊤X X y = X †y ,

with the pseudo-inverse


X † = (X ⊤ X )−1 X ⊤
of X (which is supposed to be invertible).

Two simple methods E. Hüllermeier 34/207


Model misspecification

OLS will always deliver a predictor, even if the underlying relationship is not linear, i.e.,
the (linear) model is actually misspecified:

Two simple methods E. Hüllermeier 35/207


Binary classification

Suppose to be given a set


N
∈ (X × Y)N

D = (x i , yi ) i=1

of examples (x i , yi ) ∈ X × Y, where X = Rd and Y = {−1, +1}.


We are interested in learning a linear classifier
hw (x) = sign(w ⊤ x) ,
which makes as few mistakes on the training data as possible, i.e., which minimises the
(empirical) 0/1 loss
N
X
Ein (hw ) = Ein (w) = e(yi , hw (x i )) ,
i=1
where (
0 if y = ŷ
e(y , ŷ ) = Jy ̸= ŷ K = .
1 if y ̸= ŷ

Two simple methods E. Hüllermeier 36/207


Learning linear classifiers

This leads to the challenging optimisation problem


N r z
w ∗ ∈ argmin sign(w ⊤ x i ) ̸= yi
X
,
w∈Rd+1 i=1

which is neither continuous nor convex in w (and indeed NP-hard to solve).

Different (approximate) algorithms for tackling this problem have been proposed in the
literature — one of them is the perceptron learning algorithm (PLA), which plays an
important role in the history of AI.

Later, we will also discuss other ways of tackling the problem, e.g., through minimising a
more manageable surrogate loss instead of the 0/1 loss.

Two simple methods E. Hüllermeier 37/207


The perceptron learning algorithm

The original version of PLA starts with an initial weight vector w(0) and updates this
vector iteratively.
In each iteration t = 0, 1, 2, . . ., the algorithm picks one of the examples (x i , yi ) ∈ D at
random; denote this example as (x(t), y (t)).
The current model w(t) is then updated as follows:

1 
w(t + 1) ← w(t) + y (t) − ŷ x(t) ,
2
where D E
ŷ = sign w(t), x(t) .

Two simple methods E. Hüllermeier 38/207


The perceptron learning algorithm

Thus, if the example is classified correctly by the current weight vector w(t), then
y (t) − ŷ = 0 and the model is simply retained.
Otherwise, 12 y (t) − ŷ = 21 y (t) − (−y (t)) = y (t), and the weight vector is updated in
 

a direction that reduces its mistake on the current example:


D 1 E
w(t + 1), x(t) = w(t) + y (t) − ŷ ) x(t), x(t)
D
2 E
= w(t) + y (t) x(t), x(t)
= w(t), x(t) +y (t) · x(t), x(t)
| {z } | {z }
old score ≥0

Since the model only changes in the case of a mistake, PLA is called an error-driven
method.

Two simple methods E. Hüllermeier 39/207


The perceptron learning algorithm

As a remarkable property of PLA, one can prove the following: If the data D is linearly
separable, which means there exists a perceptron (a weight vector w ∗ ) that classifies all
examples correctly, then PLA will only make a finite number of updates (i.e., it will
eventually reach a state in which no more mistakes are made on D).

Theorem: Assume there exists a parameter vector w ∗ such that ∥w ∗ ∥ = 1, and some
γ > 0 such that
yi ⟨w ∗ , x i ⟩ ≥ γ
for all i ∈ [N]. Moreover, suppose ∥x i ∥ ≤ R for all i ∈ [N]. Then the Perceptron
algorithm (with w(0) = 0) makes at most R 2 /γ 2 mistakes.

Two simple methods E. Hüllermeier 40/207


The pocket algorithm

In the case of data that is not linearly separable, PLA is not guaranteed to converge — in
this case, it is difficult to say anything about the performance of the algorithm.
A simple extension of PLA is the pocket algorithm.
It applies the PLA update rule but avoids any deterioration by memorizing the vector ŵ
with the lowest empirical error so far; or, stated differently, it updates the current weight
vector only if it indeed reduces the number of mistakes on the training data.
Thus, in contrast to the original PLA, it needs to compute the empirical error on the
entire training data in each iteration, which makes it rather slow.

Two simple methods E. Hüllermeier 41/207


The pocket algorithm

1: set ŵ to w(0) of PLA


2: for t = 0, 1, . . . T − 1 do
3: run PLA for one update to obtain w(t + 1)
4: compute Ein (w(t + 1)) on D
5: if Ein (w(t + 1)) < Ein (ŵ) then
6: ŵ ← w(t + 1)
7: end if
8: end for
9: return ŵ

Two simple methods E. Hüllermeier 42/207


The nearest neighbour method

A very basic machine learning (pattern recognition) approach is the nearest neighbour
or, more generally, k-nearest neighbour (kNN) method.
Instead of inducing a model from the training data D, the kNN method simply stores the
data itself (which qualifies it as a “lazy” learning method).
If a prediction for a new query instance x ∈ X is requested, kNN first retrieves the k
nearest neighbours of x from D, namely the examples
     
x i(1) , yi(1) , x i(2) , yi(2) , . . . , x i(k) , yi(k)

associated with those instances x i(1) , . . . , x i(k) having the smallest distances from x.
Obviously, this step assumes the instance space X to be equipped with a suitable
distance measure.

Two simple methods E. Hüllermeier 43/207


The nearest neighbour method

Two simple methods E. Hüllermeier 44/207


The nearest neighbour method

Outcomes yi(1) , . . . , yi(k) are then combined into a prediction for x.


The aggregation operator depends on the structure of Y.
In classification, aggregation is typically done via majority voting:
k q
X y
ŷ = argmax yi(j) = y
y ∈Y j=1

Instead of predicting a single class label, probabilities on Y can be estimated in terms of


relative frequencies (Y = P(Y) = set of probability distributions on Y).
In regression, the outcomes are reasonably combined via (arithmetic) averaging:
k
1X
ŷ = y
k j=1 i(j)

Two simple methods E. Hüllermeier 45/207


Nearest neighbor regression

kNN regression for k = 1, 3, 5, 7

Two simple methods E. Hüllermeier 46/207


The nearest neighbour method

kNN is a nonparametric method (unlike the linear predictor, it is not determined by a


fixed number of parameter values (weights), and the complexity of the predictor increases
with increasing sample size).

The basic kNN method has been extended in various ways; for example, in weighted
kNN, closer instances have a higher influence in the aggregation than more remote ones.

The idea of case base editing is to store only a subset of the observations, removing
those that may deteriorate rather than improve predictions.

Also note that the neighborhood size k as well as the distance measure can be seen as
parameters of the method; instead of predefining them, they can also be learned and
adapted to the data (e.g., metric learning).

Two simple methods E. Hüllermeier 47/207


Weighted nearest neighbor regression

Weighted kNN regression for k = 1, 3, 5, 7 (weights ∝ exp(−2 distance))

Two simple methods E. Hüllermeier 48/207


Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

Two simple methods E. Hüllermeier 49/207


Noisy targets

The dependency between instances and outcomes is commonly assumed to be


non-deterministic, or, more specifically, probabilistic:

Y |x ∼ P ,

i.e., the outcome for a given instance x is considered as a random variable with
(conditional) distribution PY,x (·) = P(· | x) on Y.

The general setting E. Hüllermeier 50/207


Noisy targets

Likewise, the occurrence of instances is modelled by a probability distribution P = PX on


X , so that the joint probability is given as follows:

P(x, y ) = P(x) · P(y | x)

Depending on the underlying domains, the measures P can be defined in terms of


probability mass or density functions p.

The general setting E. Hüllermeier 51/207


Loss function

The quality of predictions is generally quantified in terms of an error measure or loss


function
e : Y × Y −→ R+ resp. e : Y × Y −→ R+
that specifies a penalty e(y , ŷ ) for a prediction ŷ in light of a true outcome y .
That function is part of the specification of the learning problem.
Examples include the simple 0/1 loss

e(y , ŷ ) = Jy ̸= ŷ K

often used in classification, and the squared loss

e(y , ŷ ) = (y − ŷ )2

commonly used in regression.

The general setting E. Hüllermeier 52/207


Extended setup of the learning problem

The general setting E. Hüllermeier 53/207


What is an optimal predictor?

In the case of a noisy target, a perfect predictor with no mistakes cannot exist.
Then, two questions naturally arise:
▶ First, what does the theoretically optimal target f now look like?
▶ Second, how to evaluate candidate hypotheses h ∈ H?
The latter question is important to decide on the optimality of a hypothesis in case f ̸∈ H.
To answer these questions, we combine the concept of a loss function with the
probabilistic modelling of input/output pairs as introduced above.

The general setting E. Hüllermeier 54/207


Expected loss

First, suppose an instance x to be fixed. What is the best prediction ŷ ∈ Y (or, more
generally, ŷ ∈ Y) the learner can make?
To answer this question, we adopt expected loss (error) minimisation as a rational
decision principle.
For each prediction ŷ ∈ Y, the expected loss is given by
h i Z
Ey ∼P(· | x) e(y , ŷ ) | x = e(y , ŷ ) d P(y | x) .
Y

In the case of a discrete output space (like in classification), this expression simplifies to a
weighted sum: h i X
Ey ∼P(· | x) e(y , ŷ ) | x = e(y , ŷ ) p(y | x) .
y ∈Y

The general setting E. Hüllermeier 55/207


Expected loss

Example: For the case of classification with the 0/1 loss e(y , ŷ ) = Jy ̸= ŷ K,
h i X
E e(y , ŷ ) | x = e(y , ŷ ) p(y | x)
y ∈Y
X
= 0 · p(ŷ | x) + 1 · p(y | x)
ŷ ̸=y ∈Y

= P(y ̸= ŷ | x)
= 1 − p(ŷ | x) .

Thus, expected loss is minimised by a mode of the distribution p(· | x), i.e., an outcome
for which the probability is highest.

The general setting E. Hüllermeier 56/207


Expected loss

Ordinal scales (such as {win, tie, loss} are often encoded in terms of natural numbers
{1, 2, . . . , K }, and the absolute difference |y − ŷ | is adopted as an error function (L1 -loss).
This can be criticised, because this embedding assumes equidistance.
Suppose p(1 | x) = 0.4, p(2 | x) = 0.3, p(3 | x) = 0.3. Then
 
E e(y , ŷ = 1) | x = 0.4 · 0 + 0.3 · 1 + 0.3 · 2 = 0.9
 
E e(y , ŷ = 2) | x = 0.4 · 1 + 0.3 · 0 + 0.3 · 1 = 0.7
 
E e(y , ŷ = 3) | x = 0.4 · 2 + 0.3 · 1 + 0.3 · 0 = 1.1

More generally, one can prove that the L1 -loss is minimised (in expectation) by a median
of the distribution.

The general setting E. Hüllermeier 57/207


Expected loss

In the most general case, the loss is explicitly specified for each combination of prediction
ŷ and observation y :
e(y , ŷ ) ŷ = y1 ŷ = y2 ŷ = y2
y = y1 0 2 5
y = y2 1 0 1
y = y3 3 2 0
Suppose p(y1 | x) = 0.2, p(y2 | x) = 0.5, p(y3 | x) = 0.3. Then
 
E e(y , ŷ = y1 ) | x = 0.2 · 0 + 0.5 · 1 + 0.3 · 3 = 1.4
 
E e(y , ŷ = y2 ) | x = 0.2 · 2 + 0.5 · 0 + 0.3 · 2 = 1.0
 
E e(y , ŷ = y2 ) | x = 0.2 · 5 + 0.5 · 1 + 0.3 · 0 = 1.5

The general setting E. Hüllermeier 58/207


Expected loss

In the case where the prediction space does not coincide with Y, a loss needs to be
specified for each combination of prediction ŷ ∈ Y and observation y ∈ Y.

For example, suppose the learner is allowed to predict a subset of candidate outcomes,
i.e., Y = 2Y ; the loss function may then look as follows:

e(y , ŷ ) {y1 } {y2 } {y3 } {y1 , y2 } {y1 , y3 } {y2 , y3 } {y1 , y2 , y3 }


y = y1 0 4 4 1 4 4 2
y = y2 5 0 5 1 1 5 2
y = y3 3 3 0 3 3 1 2

As before, given a probability distribution on Y, the expected loss can be computed for
each ŷ ∈ Y, and the prediction minimising this expectation can be found.

The general setting E. Hüllermeier 59/207


Bayes predictor

The (pointwise) Bayes predictor makes optimal decisions in every situation, i.e., for
every instance x ∈ X ; thus, it maps instances to expected loss minimisers:
h i
f : x 7→ argmin Ey ∼P(· | x) e(y , ŷ )
ŷ ∈Y

Note that, in general, the Bayes predictor is only a theoretical construct — it cannot be
computed, because P is unknown.
Moreover, being defined in a pointwise manner, it is not necessarily an element of the
hypothesis space H.

The general setting E. Hüllermeier 60/207


Bayes predictor

Example: Let X = {1, . . . , 10}, Y = {0, 1}, P(x ) ≡ 1/10, and P(· | x ) as follows:

x 1 2 3 4 5 6 7 8 9 10
P(y = 1 | x ) 0 0.1 0.3 0.6 0.8 0.4 0.8 0.6 0.3 0

Then the (pointwise) Bayes predictor for 0/1 loss is given as follows:

x 1 2 3 4 5 6 7 8 9 10
f (x ) 0 0 0 1 1 0 1 1 0 0

This predictor is not an element of the hypothesis space H = {ha,b | a, b ∈ R} comprised


of “interval predictors” (
1 if a ≤ x ≤ b
ha,b (x ) =
0 otherwise

The general setting E. Hüllermeier 61/207


Generalisation performance, out-of-sample error

In general, the best we can hope for is finding some g ≈ f . But how do we assess the
quality of a hypothesis h (in comparison to f )?
Following the same idea as before, namely, weighing the possible prediction errors of h
with their probability of occurrence, we define the generalisation performance of a
hypothesis in terms of the risk or out-of-sample error as follows:
h i Z
Eout (h) = E(x,y )∼P e(y , h(x)) = e(y , h(x)) d P(x, y )
X ×Y

An optimal hypothesis is then obtained as a risk minimiser:

g ∈ argmin Eout (h)


h∈H

In addition to generalisation performance, other criteria can be considered: Complexity,


fairness, privacy, reliability, robustness, causality, usability, trust, etc.

The general setting E. Hüllermeier 62/207


Empirical risk minimisation

Again, note that the out-of-sample error Eout (h) is a theoretical measure that cannot be
computed practically.
What can be computed instead is the in-sample error of h, also called empirical risk:
N
1 X
Ein (h) = e(yi , h(x i ))
N i=1
The principle of empirical risk minimization (ERM) suggests finding a hypothesis with
minimal empirical (instead of true) risk:
N
1 X
g ∈ argmin e(yi , h(x i ))
h∈H N i=1

Does a low Ein (h) imply a low Eout (h)? Under which conditions will ERM be successful?
These are fundamental questions calling for a theory of generalisation.
The general setting E. Hüllermeier 63/207
Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

The general setting E. Hüllermeier 64/207


Evaluating a hypothesis

Fix a hypothesis h ∈ H.
To evaluate this hypothesis, we can draw a sample D and compute the in-sample error,
which will provide us an unbiased estimate of the generalisation error Eout (h).
Exploiting concentration properties of the mean of a random variable U, assuring that
the sample mean ν = n1 N i=1 Ui concentrates around the expected value µ = E(U)
P

with high probability, one can even derive probabilistic bounds on the estimation error.
For example, the Hoeffding inequality states that

P |ν − µ| > ϵ ≤ 2 exp(−2ϵ2 N)


for any random variable with support in [0, 1].

Model induction and generalisation E. Hüllermeier 65/207


Evaluating a hypothesis

In our case, this translates into the bound


 
P |Ein − Eout | > ϵ ≤ 2 exp(−2ϵ2 N) .

Thus, given a sufficient sample size N = |D|, the out-of-sample error of h can be
estimated very precisely.

Important prerequisites of the unbiasedness of the estimate and the validity of the
probabilistic bound:
▶ First, D is a so-called i.i.d. sample, namely, the examples (x i , yi ) are independent and
identically distributed.
▶ Second, the hypothesis h is fixed beforehand, that is, it does not depend on any properties
of the data D.

Model induction and generalisation E. Hüllermeier 66/207


Urn model

If the loss function e is the simple 0/1 loss

e(y , ŷ ) = Jy ̸= ŷ K ,

the estimation of µ = Eout via ν = Ein can be illustrated by an urn experiment.

Urn with marbles = instances x ∈ X . Colour is green if it is classified correctly by h,


otherwise red (we assume a deterministic dependency between instances and labels).

Thus, the fraction of red marbles in the urn is µ = Eout .

The in-sample error ν = Ein can then be considered as the fraction of red marbles when
sampling N times from the urn with replacement.

Model induction and generalisation E. Hüllermeier 67/207


Urn model

Ein (h) 0 1/4 1/2 3/4 1


Prob. 256/625 256/625 96/625 16/625 1/625

Model induction and generalisation E. Hüllermeier 68/207


Selecting an optimal hypothesis

Unlike the case of hypothesis evaluation discussed above, a single h is not fixed
beforehand when learning from data.
Instead, learning is concerned with hypothesis optimisation: a presumably optimal
hypothesis g is selected from a set of candidates, specified by the hypothesis space H.
Since optimality refers to performance on the training data D, the hypothesis g is no
longer independent of the data.
Therefore, we cannot expect Ein (g) to be an unbiased estimate of Eout (g): we have
ED∼P N Ein (h) = Eout (h) for fixed h, but
  !
ED∼P N min Ein (h) ̸= Eout argmin Ein (h)
h∈H h∈H

More specifically, since there is a tendency to select a g that performs especially well on
the training data, Ein (g) is likely to underestimate Eout (g).
Model induction and generalisation E. Hüllermeier 69/207
Examples

Consider a case of binary classification, and suppose Eout (h) = 1/2 for all h ∈ H.

As another example, consider the case where H is extremely rich and consists of all
mappings X −→ Y.

Model induction and generalisation E. Hüllermeier 70/207


Theorem (VC inequality, VC generalisation bound)

Let any hypothesis space H be given, and let N denote the size of the training data. The
following holds for all ϵ > 0:
1
   
P |Ein (g) − Eout (g)| > ϵ ≤ 4 mH (2N) exp − ϵ2 N
8
Or, for any tolerance δ > 0,
s
8 4mH (2N)
Eout (g) ≤ Ein (g) + log
N δ
| {z }
Ω(mH ,N,δ)

with probability at least 1 − δ.


Here, mH denotes the so-called growth function, which measures the capacity
(complexity, flexibility) of the hypothesis space H.

Model induction and generalisation E. Hüllermeier 71/207


Complexity of the hypothesis space

As suggested by these inequalities, successful learning involves two subproblems:


▶ make sure that Ein (g) ≈ 0
▶ make sure that Eout (g) ≈ Ein (g), i.e., Ω(mH , N, δ) ≈ 0

The first requirement calls for a rich hypothesis space, which guarantees the existence
of a g with small training error.

The second requirement calls for a (small) hypothesis space of low complexity (small
Ω(mH , N, δ)).

The complexity of hypothesis spaces H and, connected to this, the choice of the right
complexity, are central themes of machine learning.

Model induction and generalisation E. Hüllermeier 72/207


Dichotomies

Consider a set of instances X = {x 1 , . . . , x N } ⊂ X .


A dichotomy of X is a separation of X into positive and negative instances, i.e., an
assignment of a label yi ∈ {−1, +1} to each x i .
A hypothesis h ∈ H ⊆ 2X induces a dichotomy on X :

(y1 , . . . , yN ) = h(x 1 ), . . . , h(x N )

Let n o
 
H x 1, . . . , x N = h(x 1 ), . . . , h(x N ) | h ∈ H

Regardless of the size |H| (which might be infinite),

≤ 2N .

H x 1, . . . , x N

Model induction and generalisation E. Hüllermeier 73/207


The growth function

The growth function characterises the flexibility of a hypothesis space. It counts the
maximum number of dichotomies on N points:

mH (N) = max H x 1, . . . , x N
x 1 ,...,x N ∈X

Examples: Linear models, positive rays, positive intervals, convex sets.


The break point for a hypothesis space H is the smallest value k ∈ N such that

mH (k) < 2k .

If there is no such value, then H does not have a break point (or k = ∞).

Model induction and generalisation E. Hüllermeier 74/207


VC Dimension

The Vapnik-Chervonenkis dimension dVC (H) of a hypothesis set H is the largest value
N for which mH (N) = 2N , i.e., the largest number of points that can be shattered by H.

If mH (N) = 2N for all N ∈ N, then dVC (H) = ∞.

Example: The VC dimension of the perceptron (class of linear classifiers H) in Rd is d + 1.

Theorem: If mH (k) < 2k for some value k, then


k−1
!
X N  
mH (N) ≤ = O N k−1 .
i=0
i

Important implication: If H has finite VC dimension, then the asymptotic growth of mH


is only polynomial in N, and Ω(mH , N, δ) → 0 as N → ∞.

Model induction and generalisation E. Hüllermeier 75/207


Penalty for model complexity

Note that, since mH (N) ≤ N dvc , the following bound is also valid:
v !
u
u8 4((2N)dvc + 1)
Eout (g) ≤ Ein (g) + t log
N δ
= Ein (g) + Ω′ (dVC , N, δ)

Interpretation: The in-sample error is corrected by a term that depends on the model
complexity, i.e., the complexity of the hypothesis space H; the more complex H, the
smaller the in-sample error, but the higher the correction.

Model induction and generalisation E. Hüllermeier 76/207


Penalty for model complexity

out-of-sample error

model complexity

error
in-sample error

VC dimension

Dependency of errors on model complexity.

Model induction and generalisation E. Hüllermeier 77/207


Approximation-generalisation tradeoff

Analysis so far: H needs to achieve the right balance between approximating the training
data and generalizing on new data.

Generalisation bound confirms that, to guarantee a strong generalisation performance,


▶ H needs to be sufficiently flexible, as otherwise Ein (g) cannot be small;
▶ flexibility of H should be limited, since otherwise there is a danger of poor generalisation
due to overfitting (the correction factor Ω(dVC , N) will be large).

Also note that, since Ω(dVC , N) depends on both the VC dimension of H and the size of
the training data simultaneously, the choice of H needs to be adapted to N. In other
words, different N may call for different hypothesis spaces.

Model induction and generalisation E. Hüllermeier 78/207


Bias-variance decomposition

Model induction and generalisation E. Hüllermeier 79/207


Bias-variance decomposition

100 f

50

x1
10 20 30 40 50

Model induction and generalisation E. Hüllermeier 80/207


Bias-variance decomposition

y
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
hypothesis
50 space
H⇢F

x1 high bias, low variance


<latexit sha1_base64="/z1Kk7C/GDd186L65NTxEeq9GJ8=">AAACOHicbVDJSgNBEO1xjXFL9ODBS2MQPEiY8aAeBS8eIxgTSEKo6alJGnsZunuUMORrvOoP+CfevIlXv8DOcnB7UPB4r4qqenEmuHVh+BosLC4tr6yW1srrG5tb25Xqzq3VuWHYZFpo047BouAKm447ge3MIMhYYCu+u5z4rXs0lmt140YZ9iQMFE85A+elfmVvyAdDGnOwx1ToB3oPhoNi2K/Uwno4Bf1LojmpkTka/Wqw1k00yyUqxwRY24nCzPUKMI4zgeNyN7eYAbuDAXY8VSDR9orpB2N66JWEptr4Uo5O1e8TBUhrRzL2nRLc0P72JuJ/Xid36Xmv4CrLHSo2W5TmgjpNJ3HQhBtkTow8AWa4v5WyIRhgzof2Y4uQOkGj/CMGFT4wLSWopJuC5GKUYAq5cAXzZ47LPrzod1R/ye1JPQrr0fVJ7eJ0HmOJ7JMDckQickYuyBVpkCZhZEweyRN5Dl6Ct+A9+Ji1LgTzmV3yA8HnF9XlrHk=</latexit>

10 20 30 40 50

Model induction and generalisation E. Hüllermeier 81/207


Bias-variance decomposition

y
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
hypothesis
50 space
H⇢F

x1 high bias, low variance


<latexit sha1_base64="/z1Kk7C/GDd186L65NTxEeq9GJ8=">AAACOHicbVDJSgNBEO1xjXFL9ODBS2MQPEiY8aAeBS8eIxgTSEKo6alJGnsZunuUMORrvOoP+CfevIlXv8DOcnB7UPB4r4qqenEmuHVh+BosLC4tr6yW1srrG5tb25Xqzq3VuWHYZFpo047BouAKm447ge3MIMhYYCu+u5z4rXs0lmt140YZ9iQMFE85A+elfmVvyAdDGnOwx1ToB3oPhoNi2K/Uwno4Bf1LojmpkTka/Wqw1k00yyUqxwRY24nCzPUKMI4zgeNyN7eYAbuDAXY8VSDR9orpB2N66JWEptr4Uo5O1e8TBUhrRzL2nRLc0P72JuJ/Xid36Xmv4CrLHSo2W5TmgjpNJ3HQhBtkTow8AWa4v5WyIRhgzof2Y4uQOkGj/CMGFT4wLSWopJuC5GKUYAq5cAXzZ47LPrzod1R/ye1JPQrr0fVJ7eJ0HmOJ7JMDckQickYuyBVpkCZhZEweyRN5Dl6Ct+A9+Ji1LgTzmV3yA8HnF9XlrHk=</latexit>

10 20 30 40 50

Model induction and generalisation E. Hüllermeier 82/207


Bias-variance decomposition

100
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
10
hypothesis
space
H⇢F
1

x1 low bias, high variance


<latexit sha1_base64="skaJEDamNoU06maKOJQRUuYEw9c=">AAACOHicbVDJSgNBEO1xjXFL9ODBS2MQPEiY8aAeBS8eIxgTSEKo6alJGnsZunuUMORrvOoP+CfevIlXv8DOcnB7UPB4r4qqenEmuHVh+BosLC4tr6yW1srrG5tb25Xqzq3VuWHYZFpo047BouAKm447ge3MIMhYYCu+u5z4rXs0lmt140YZ9iQMFE85A+elfmVP6Acac7DHdMgHQ3oPhoNi2K/Uwno4Bf1LojmpkTka/Wqw1k00yyUqxwRY24nCzPUKMI4zgeNyN7eYAbuDAXY8VSDR9orpB2N66JWEptr4Uo5O1e8TBUhrRzL2nRLc0P72JuJ/Xid36Xmv4CrLHSo2W5TmgjpNJ3HQhBtkTow8AWa4v5WyIRhgzof2Y4uQOkGj/CMGFT4wLSWopJuC5GKUYAq5cAXzZ47LPrzod1R/ye1JPQrr0fVJ7eJ0HmOJ7JMDckQickYuyBVpkCZhZEweyRN5Dl6Ct+A9+Ji1LgTzmV3yA8HnF9Y2rHk=</latexit>

10 20 30 40 50

Model induction and generalisation E. Hüllermeier 83/207


Bias-variance decomposition

100
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
10
hypothesis
space
H⇢F
1

x1 low bias, high variance


<latexit sha1_base64="skaJEDamNoU06maKOJQRUuYEw9c=">AAACOHicbVDJSgNBEO1xjXFL9ODBS2MQPEiY8aAeBS8eIxgTSEKo6alJGnsZunuUMORrvOoP+CfevIlXv8DOcnB7UPB4r4qqenEmuHVh+BosLC4tr6yW1srrG5tb25Xqzq3VuWHYZFpo047BouAKm447ge3MIMhYYCu+u5z4rXs0lmt140YZ9iQMFE85A+elfmVP6Acac7DHdMgHQ3oPhoNi2K/Uwno4Bf1LojmpkTka/Wqw1k00yyUqxwRY24nCzPUKMI4zgeNyN7eYAbuDAXY8VSDR9orpB2N66JWEptr4Uo5O1e8TBUhrRzL2nRLc0P72JuJ/Xid36Xmv4CrLHSo2W5TmgjpNJ3HQhBtkTow8AWa4v5WyIRhgzof2Y4uQOkGj/CMGFT4wLSWopJuC5GKUYAq5cAXzZ47LPrzod1R/ye1JPQrr0fVJ7eJ0HmOJ7JMDckQickYuyBVpkCZhZEweyRN5Dl6Ct+A9+Ji1LgTzmV3yA8HnF9Y2rHk=</latexit>

10 20 30 40 50

Model induction and generalisation E. Hüllermeier 84/207


Bias-variance decomposition

100 <latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>


g f

10
hypothesis
space
H⇢F
1

x1 low bias, high variance


<latexit sha1_base64="skaJEDamNoU06maKOJQRUuYEw9c=">AAACOHicbVDJSgNBEO1xjXFL9ODBS2MQPEiY8aAeBS8eIxgTSEKo6alJGnsZunuUMORrvOoP+CfevIlXv8DOcnB7UPB4r4qqenEmuHVh+BosLC4tr6yW1srrG5tb25Xqzq3VuWHYZFpo047BouAKm447ge3MIMhYYCu+u5z4rXs0lmt140YZ9iQMFE85A+elfmVP6Acac7DHdMgHQ3oPhoNi2K/Uwno4Bf1LojmpkTka/Wqw1k00yyUqxwRY24nCzPUKMI4zgeNyN7eYAbuDAXY8VSDR9orpB2N66JWEptr4Uo5O1e8TBUhrRzL2nRLc0P72JuJ/Xid36Xmv4CrLHSo2W5TmgjpNJ3HQhBtkTow8AWa4v5WyIRhgzof2Y4uQOkGj/CMGFT4wLSWopJuC5GKUYAq5cAXzZ47LPrzod1R/ye1JPQrr0fVJ7eJ0HmOJ7JMDckQickYuyBVpkCZhZEweyRN5Dl6Ct+A9+Ji1LgTzmV3yA8HnF9Y2rHk=</latexit>

10 20 30 40 50

Model induction and generalisation E. Hüllermeier 85/207


Bias-variance decomposition

To get a second view on the approximation-generalisation tradeoff, we decompose the


out-of-sample error into bias and variance.
This decomposition works best for squared error loss (in regression), though it can be
generalised to other loss functions.
Out-of-sample error for squared error loss:
   2 
(D) (D)
Eout g = Ex g (x) − f (x) .

Model induction and generalisation E. Hüllermeier 86/207


Bias-variance decomposition

Taking the expectation over the data sample D, we obtain


h  i   2 
ED Eout g (D) = ED Ex g (D) (x) − f (x)
  2 
= Ex ED g (D) (x) − f (x)
h h i h i i
= Ex ED g (D) (x)2 − 2ED g (D) (x) f (x) + f (x)2
h h i i
= Ex ED g (D) (x)2 − 2ḡ(x)f (x) + f (x)2
h h i i
= Ex ED g (D) (x)2 −2ḡ(x)2 + ḡ(x)2 + ḡ(x)2 − 2ḡ(x)f (x) + f (x)2
h h i h i 2
= Ex ED g (D) (x)2 − 2ED g (D) (x) ḡ(x) + ḡ(x)2 + ḡ(x) − f (x)
h  2  2 i
= Ex ED g (D) (x) − ḡ(x) + ḡ(x) − f (x)
| {z }
| {z } bias(x)
var(x)

Model induction and generalisation E. Hüllermeier 87/207


Bias-variance decomposition

The function h i
ḡ(x) = ED g (D) (x)
can be seen as a kind of average function learned from the data.

The bias-variance decomposition of the out-of-sample error is finally obtained by


averaging over instances:
h  i
ED Eout g (D)
 
= Ex bias(x) + var(x)
= bias + var

Model induction and generalisation E. Hüllermeier 88/207


Bias-variance decomposition

Target f (x ) = x
(
a if x ≤ t
Hypotheses of the form ht,a,b (x ) =
b if x > t

Model induction and generalisation E. Hüllermeier 89/207


Bias-variance decomposition

Target f (x ) = x
(
a if x ≤ t
Hypotheses of the form ht,a,b (x ) =
b if x > t

Left: Example of a sample D and hypothesis g (D) , right: ḡ ± std

Model induction and generalisation E. Hüllermeier 90/207


Noisy data

The above derivation assumes error-free data, i.e., the variability is entirely due to the
randomness of the training data.
If the observed data is noisy, namely,

y = f (x) + ϵ ,

where ϵ is an error term with mean value 0 and standard deviation σ, the decomposition
generalises to h  i
ED Eout g (D) = Ex bias(x) + var(x) + σ 2 .
 

Model induction and generalisation E. Hüllermeier 91/207


Remarks

Notice that the VC analysis does not depend on the learner A, only on H and N.
The bias-variance decomposition, on the other hand, does depend on A, because the
algorithm has an influence on which hypothesis g (D) is chosen for a given set of training
data D.
Thus, changing A will change ḡ, and therefore bias and var.
Since bias and variance cannot be computed in practice, the bias-variance decomposition
is only a conceptual tool.
To improve generalisation performance, it generally suggests two options:
(i) reducing variance without significantly increasing the bias, and
(ii) reducing bias without significantly increasing the variance.

Model induction and generalisation E. Hüllermeier 92/207


Learning curve

Learning curves characterise the (expected) performance of an algorithm as a function


of the sample size:
h  i h  i
N 7→ ED Ein g (D) and N 7→ ED Eout g (D) .
expected error

expected error
number of data points number of data points

Typical learning curves for a simple (left) and a complex model (right).

Model induction and generalisation E. Hüllermeier 93/207


Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

Model induction and generalisation E. Hüllermeier 94/207


Overfitting

Roughly speaking, overfitting means fitting the training data more than warranted,
thereby generalising poorly beyond that data.

A clear indicator of overfitting is a low in-sample error and a comparatively high


out-of-sample error.

Typically, overfitting is caused by learning with a model class that is too complex.

But what does “too complex” mean?

What truly matters is how the model complexity matches, not with the target function,
but with the quality and quantity of the training data.

Regularisation and validation E. Hüllermeier 95/207


Overfitting

overfitting and
underfitting lead to
poor generalization
OUTPUT

INPUT

Regularisation and validation E. Hüllermeier 96/207


Overfitting

expected error

number of data points

In-sample and out-of-sample error of a simple (solid lines) versus a complex model (dashed lines).

Regularisation and validation E. Hüllermeier 97/207


Noise

A major source of overfitting is noise in the data.

Obviously, the learner should not attempt to fit the noisy part in the data. But how to
distinguish signal from noise?

Types of noise: random (stochastic) and deterministic.

Deterministic noise is that part of the target function that cannot be modeled, i.e., that
cannot be captured by the underlying model class.

In the bias-variance decomposition

ED (Eout ) = σ 2 + bias + variance ,

stochastic noise is connected to σ 2 , while deterministic noise is connected to the bias.

Regularisation and validation E. Hüllermeier 98/207


Overfitting

Example of deterministic noise.

Regularisation and validation E. Hüllermeier 99/207


Examples

Fitting a polynomial of degree 2 versus a polynomial of degree 10 to


(i) data with random noise, e.g., noisy observations of a polynomial of degree 10, and
(ii) data with deterministic noise (highly complex target), e.g., a polynomial of degree 50.
In both cases, the simpler model may lead to better generalisation.

Majority classifier.

Regularisation and validation E. Hüllermeier 100/207


Examples

Fitting data produced by f (x ) = 1.4x 4 − 0.3x 3 + x 2 + 2x − 1+ Gaussian noise with a polynomial of


degree 1 (above) and 4 (below), where training data has size 10 (left) and 100 (right), respectively.

Regularisation and validation E. Hüllermeier 101/207


Regularisation

The notion of regularisation can be motivated by recalling the VC bound, telling us that

Eout (h) ≤ Ein (h) + Ω(H)

for all h ∈ H.

Instead of fixing a sufficiently simple hypothesis space H from the beginning, one may
also start with a more complex one but favor simple hypotheses in that space to more
complex ones.

Formally, this could be accomplished by minimizing a combination of Ein (h) and Ω(h),
where Ω(h) is a measure of the complexity of the individual hypothesis h.

Regularisation and validation E. Hüllermeier 102/207


Constraints

The problem of selecting the “right” hypothesis space for the problem at hand is often
tackled by considering a family of spaces

H0 ⊂ H1 ⊂ H2 ⊂ · · · ⊂ HK = H

of increasing complexity. Formally, this can be done by representing each of these spaces
as a constrained version of H.

For example, suppose Hk is the space of polynomials of degree k. Then


K
( )
X
i
Hk = x 7→ wi · x wk+1 = . . . = wK = 0 .
i=0

Regularisation and validation E. Hüllermeier 103/207


Soft constraints

As an alternative to hard constraints of that kind, one may also use soft constraints that
prevent weights from getting too large while not forcing them to vanish.

More specifically, consider the case of linear models (perhaps in some feature space), and
let n o
H(C ) = h h(x) = w ⊤ Φ(x), w ⊤ w ≤ C

for a constant C ≥ 0.

Obviously, when C1 < C2 , then

H(C1 ) ⊂ H(C2 ) and therefore dVC (H(C1 )) ≤ dVC (H(C2 )) .

Regularisation and validation E. Hüllermeier 104/207


Constrained optimisation

Using soft constraints, the learning problem essentially becomes a constrained


optimisation problem:

w reg = argmin Ein (w) s.t. w ⊤w ≤ C (1)


w

Let w un denote the unconstrained solution, i.e., the empirical risk minimiser

w un = argmin Ein (w) .


w

If w ⊤
un w un ≤ C , then w un ∈ H(C ) and w reg = w un . Otherwise, Ein needs to be
minimised under the equality constraint

w ⊤w = C .

Regularisation and validation E. Hüllermeier 105/207


Constrained optimisation

From optimisation theory (constrained optimisation, Karush-Kuhn-Tucker conditions), it


is known that solving the constrained problem (??) is equivalent to finding the
unconstrained minimiser of Ein (w) + λw T w for some λ ≥ 0 that depends on C (or vice
versa).
In other words, denoting by λC the λ that corresponds to C , the constrained solution

w reg = argmin Ein (w) s.t. w ⊤w ≤ C


w

is the same as the compromise solution

w reg = argmin Ein (w) + λC w T w .


w

Example:
N
1 X
w reg = argmin (w0 + w1 xi − yi )2 s.t. (w1 )2 ≤ C
(w0 ,w1 ) N i=1

Regularisation and validation E. Hüllermeier 106/207


Constrained optimisation

Minimising (w − 5)2 s.t. w 2 ≤ 4 is equivalent to minimising (w − 5)2 + 1.5 · w 2 :

Starting with λ = 0 and wreg = wun = 5, increasing λ “moves” wreg toward the feasible
region (shaded in red) and hits the boundary at λ = 1.5.

Regularisation and validation E. Hüllermeier 107/207


Augmented error

General approach: Minimise the augmented error


λ
Eaug (h, λ, Ω) = Ein (h) + Ω(h) ,
N
where the last term is called the penalty term and λ is the regularisation parameter.
What is crucial here is the right choice of λ and Ω.
The special case Ω(w) = w ⊤ w is also called weight decay.
Of course, other types of regularisation can be used, for example
Q
X
γq wq2 ≤ C
q=0

with coefficients γq ∈ R.

Regularisation and validation E. Hüllermeier 108/207


Testing and Validation

The augmented error as discussed above can be seen as a proxy of the out-of-sample
error, which any machine learning method seeks to minimise:

Eout (h) ≈ Ein (h) + overfit penalty ,


λ
where regularisation is just estimating the penalty term in terms of N Ω(h).

In other words, regularised learning can be seen as minimising an estimate of the


out-of-sample error.

Testing and validation, to be discussed next, tackle the problem of estimating the
out-of-sample error in a more direct way.

Regularisation and validation E. Hüllermeier 109/207


Training and test data

We are mainly interested in how a model trained on data D will generalise, i.e., perform
on new data encountered in the future:

As we cannot anticipate the future, we try to mimic it:

Regularisation and validation E. Hüllermeier 110/207


Test data and test error

To estimate the generalisation performance of a predictor g ∈ H, it must be evaluated


out of sample; therefore, suppose the original data D to be split into
▶ a training set Dtrain of size N − K
▶ and a test set Dtest of size K .

Let g − ∈ H be the hypothesis obtained by the learner on Dtrain .

The test error of g − is then computed as follows:


1
Etest (g − ) = e(g − (x n ), yn ) .
X
K x n ∈Dtest

One can easily show that this is an unbiased estimate of Eout (g − ), i.e.,

EDtest Etest (g − ) = Eout (g − ) .


 

Regularisation and validation E. Hüllermeier 111/207


Example

Left: Data is generated by f (x ) = 2.9x 4 − 1.4x 2 + 1.1x + 0.7 + σ, where σ is normal with standard
deviation 0.3. Right: Predictor (polynomials of degree 3) learned on training data (green) and
evaluated on test data (red).

Regularisation and validation E. Hüllermeier 112/207


Choice of K

The size of the test set needs to be chosen with care.

The larger K , the tighter the estimate of Eout (g − ) by Etest (g − ).

At the same time, however, Etest (g − ) itself will increase (in expectation), because the size
of the training data, N − K , is reduced; the variance of g − will also increase.

If K is small, and hence N − K large, then Eout (g − ) will be smaller in expectation (and
with smaller variance), however, its estimate by Etest (g − ) will be unreliable.

Thus, the optimal K will be somewhere in-between.

Regularisation and validation E. Hüllermeier 113/207


Choice of K
 
The learning curve depicts the expected loss EDN Eout (g) , i.e., the expected loss of a
predictor g learned on a (randomly generated) data set DN of size N.

Regularisation and validation E. Hüllermeier 114/207


Cross-validation

Testing relies on the following chain of reasoning:


small K large K
Eout (g) ≈ Eout (g − ) ≈ Etest (g − )

The dilemma is that the first (approximate) equality calls for a small K , whereas the
second one calls for a large K .

Cross-validation (CV) is a technique that makes better use of the data, thereby
tightening the ≈ relations; in particular, it allows for deriving more reliable estimates of
Eout (g − ) for a given K .

The basic idea is to repeat the train/test split L = N/K times, each time using a
different portion of K data points for testing, and then averaging the results.

Regularisation and validation E. Hüllermeier 115/207


Cross-validation

In general, L-fold CV splits the data into L folds

D1 , . . . , DL ,

of size K ≈ N/L and trains L hypotheses

g1− , . . . , gL− .

Each hypothesis gl− is obtained on D \ Dl (all folds except the l-th one), while Dl is used
to get Etest (gl− ); the final cross-validated estimate is then given by

L
1X
Ecv = Etest (gl− ) .
L l=1

Regularisation and validation E. Hüllermeier 116/207


Cross-validation

Schematic representation of L-fold cross-validation for L = 5:

The special case of K = 1 (L = N) is called leave-one-out cross-validation; in practice,


10-fold cross-validation is often used.

Regularisation and validation E. Hüllermeier 117/207


Example

Data is generated by f (x ) = 2.9x 4 − 1.4x 2 + 1.1x + 0.7 + σ, where σ is normal with standard deviation
0.3 (upper left); the other pictures show the functions (polynomials of degree 4) fitted in a 5-fold CV.
Regularisation and validation E. Hüllermeier 118/207
Cross-validation

Note that ECV obtained by averaging does not estimate the generalisation performance of
a concrete predictor; instead, it estimates a property of the learner, namely, the
(expected) out-of-sample error of a hypothesis trained on a sample of size N − K .

ECV is again an unbiased estimate for this expected error.

Thanks to the averaging, it is more reliable than an estimate obtained from a single
validation set, which is reflected by a smaller variance.

The true variance of the estimate is difficult to compute, because the different training
sets are highly overlapping, and hence the Etest (gl− ) are not independent of each other.

Regularisation and validation E. Hüllermeier 119/207


Restoring D

The hypotheses gl− are merely used for estimating generalisation performance, but of
course, the learner is not obliged to adopt any of these in the end.

Instead, it makes sense to retrain and learn a hypothesis g on the entire data D.

Based on the discussion of learning curves, we expect Eout (g) ≤ Eout (g − ).

Thus, a deterioration of Eout due to a large K can be avoided.

Still, a large K increases the discrepancy between Eout (g) and Eout (g − ).

Regularisation and validation E. Hüllermeier 120/207


Model selection

So far, we (implicitly) assumed the learning model (learning algorithm A, hypothesis


space H) to be fixed.
Normally, however, the learner has various degrees of freedom, which it tries to leverage
for optimising the (result of the) learning process.
This includes parameters of the algorithm A (such as neighbourhood size k or
regularisation parameter λ), which are called hyperparameters in this context (to
distinguish them from parameters of hypotheses h).
Also included are choices regarding the hypothesis space, e.g., the choice of one among
a set of candidate spaces H1 , . . . , HM (e.g., degree of polynomial in regression).
Let us denote the learner’s decision space by Θ — every θ ∈ Θ instantiates a concrete
learning model.

Regularisation and validation E. Hüllermeier 121/207


Model selection

The choice of an instantiation θ is also referred to as model selection.


How can the learner select an optimal model, leading to a g with lowest (expected) error?
Naïvely, the learner may try each θ (at least if Θ is finite), compute the test error as
discussed above, and pick the instantiation θ ∗ with the lowest error.
However, such as post-hoc selection will obviously produce a bias, and Etest (gθ∗ ) will no
longer be an unbiased estimate of Eout (gθ∗ ) — the data on which Etest is computed was
used in the learning process and influenced the final choice of the predictor.
Actually, the learner must commit to a concrete learning model before inducing (and
deploying) the final predictor (which also corresponds to the real practical scenario).
This is where validation data comes into play.

Regularisation and validation E. Hüllermeier 122/207


Validation versus test data
If the learner wants to make a decision θ, it has to do so on the training data only.

At the same time, to get an unbiased performance estimate, a predictor should not be
evaluated on data it has been trained on.

Consequently, the learner needs to put aside another part of the training data for
validation:
Dval ⊂ Dtrain = D \ Dtest

The validation data is used to assess different options θ and to choose the presumably
best one.

Like in the case of testing, the ideas of averaging over multiple splits of the data
(cross-validation) and of restoring the original data in the end can be applied, leading to a
procedure of nested cross-validation (with an outer loop for testing and an inner loop
for validation).
Regularisation and validation E. Hüllermeier 123/207
Nested cross-validation

Regularisation and validation E. Hüllermeier 124/207


Example

An inner loop of a nested CV, selection between polynomials of degree 1, 2, 3, 4.

Regularisation and validation E. Hüllermeier 125/207


How to split the data

The question of how to split the data for training and validation is also relevant for the
inner loop of the cross-validation scheme.
In particular, in the case of crossing learning curves, an inappropriate split may lead to
suboptimal model choice.

Regularisation and validation E. Hüllermeier 126/207


Remarks

Validation data is somehow in-between training and test data as far as the “level of
contamination” is concerned.

It is not as clean as test data (influences decisions made during training) but not as
contaminated as training data (hypotheses are never evaluated on data on which they
have been trained).

Again, the result of a nested CV should be interpreted as a property of the learner A,


namely, the expected performance of a predictor produced by A on data of a certain size
when model selection is done via cross-validation.

Regularisation and validation E. Hüllermeier 127/207


Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

Regularisation and validation E. Hüllermeier 128/207


Nonlinear transformation

The core of linear models is the inner product


d
⟨w, x⟩ = w ⊤ x =
X
wi xi (⋆)
i=0

between parameter (weight) vector and feature values.

The term (??) is linear in both, the feature values xi and the parameters wi .

For the algorithms, however, it is only the latter that really matters, because the
parameters are playing the role of the variables in the corresponding learning
(optimisation) problems, whereas the data (feature values xi ) are treated as constants.

This observation suggests the possibility of nonlinear transformations of the features


without any need for adapting the learning algorithms.

Nonlinear models E. Hüllermeier 129/207


The Z-space

Formally, consider a feature transform in the form of a mapping from the original
instance space to a feature space Z:

Φ : X −→ Z

This function is called feature map or linearisation function, and Z the feature space or
linearisation space. Φ maps every instance to a feature representation:
   
  z0 ϕ0 (x1 , . . . , xd )
x0  z1   ϕ1 (x1 , . . . , xd ) 
   

 x1 
  ..   .. 
 ..  7→ 
 . =
 
.


.
 
  
..  
.. 
. .
   
xd    
zd̃ ϕd̃ (x1 , . . . , xd )

Nonlinear models E. Hüllermeier 130/207


The Z-space

A linear model in the feature space, i.e., a model


w ⊤ z = w ⊤ Φ(x) =
X
wi zi ,
i=0

corresponds to a (possibly) nonlinear hypothesis



h(x) = h̃ Φ(x)

in the original space X ; we denote the set of such hypotheses by HΦ .


That is, the decision boundary

Φ(−1) (0) = {x ∈ X | ⟨w, Φ(x)⟩ = 0}

induced in the original space X can be nonlinear.

Nonlinear models E. Hüllermeier 131/207


Generalisation

Estimation of generalisation performance remains valid if the VC dimension d of H is


replaced by the VC dimension d̃ of HΦ , provided the transform Φ is defined prior to
touching the data in any way.

More specifically,
dVC (HΦ ) ≤ d̃ .

Note that the image of X under Φ,


n o
Φ(X ) = z ∈ Z | ∃ x ∈ X : z = Φ(x) ,

is in general only a subset of Z (original data is embedded in a manifold).

Nonlinear models E. Hüllermeier 132/207


Approximation-generalisation tradeoff

The choice of a proper transform Φ is closely connected to the discussion about the
approximation-generalisation tradeoff.

In general, choosing a more complex Φ (higher-dimensional Z) reduces the bias and


increases the variance.

Or, stated differently, the in-sample error will be reduced, but the VC-dimension will
grow; therefore, the bound on the out-of-sample error may or may not become tighter.

Nonlinear models E. Hüllermeier 133/207


Linear models and learning algorithms

By a linear model or linear learning machine (LLM) we mean a model that is linear in
the parameters to be determined.
Example: For instances x = (x1 , x2 ) ∈ R2 , the (regression) model
y = h(x) = w0 + w1 x1 + w2 x2 + w3 (x1 )2 + w4 (x2 )2 + w5 (x1 x2 )
is linear in w = (w0 , . . . , w5 )⊤ . It can be written compactly as
y = h(x) = ⟨w, Φ(x)⟩ = w ⊤ Φ(x) = w ⊤ z ,
where
ϕ1 (x)
   
1

 ϕ2 (x)  
  x1 

 ϕ3 (x)   x2 
.
z = Φ(x) =  =

 ϕ4 (x)  
  (x1 )2 

 ϕ5 (x)   (x2 )2 
ϕ6 (x) x1 x2

Nonlinear models E. Hüllermeier 134/207


Linear models and learning algorithms

In the previous example, if the weight vector is given by w = (−6, 0, 0, 2, 4, −3)⊤ , then
Φ(−1) (0) is an ellipsoid in X = R2 , i.e., the linear decision boundary {z | ⟨w, z⟩ = 0} in
Z = R6 induces a nonlinear decision boundary in X .

Nonlinear models E. Hüllermeier 135/207


Excursus: Kernel-based machine learning

The idea of feature transformation and linearisation is closely connected to kernel-based


machine learning methods, such as support vector machines (SVM).

These methods learn linear models in a feature space (typically of very high dimension),
albeit in an indirect manner, and possibly even without explicitly knowing this space.

This is accomplished by means of a kernel function κ : X 2 −→ R, which is a function


that has a (latent) representation in terms of a linearisation:

κ(x, x ′ ) = ϕ(x), ϕ(x ′ ) (⋆)

Kernel trick: If a learning algorithm can be implemented in such a way that it only
operates on inner products ⟨z, z ′ ⟩, but never on individual data points z, then this
algorithm can be run in X instead of Z: Thanks to (??), every inner product in Z can be
computed by the kernel function in X .

Nonlinear models E. Hüllermeier 136/207


Artificial neural networks

Artificial neural networks (ANN), which are inspired by the processing of information in
the human brain, constitute another important class of nonlinear models.

These models have become quite popular in the 80th and 90th of the last century, but
have then been displaced by kernel-based learning machines.

More recently, they reappeared in the guise of deep neural networks, which are now
considered as state of the art in applications such as image, speech, and natural language
processing.

Here, we only consider the basic version of ANNs, notably the multi-layer perceptron
(there are other lectures specifically dedicated to deep learning).

Nonlinear models E. Hüllermeier 137/207


Multilayer perceptron

A multilayer perceptron (MLP) consists of a sequence of (fully connected) layers of


neurons; we number the input layer by 0, the hidden layers from 1 to L − 1, and the
output layer by L.

Nonlinear models E. Hüllermeier 138/207


Multilayer perceptron

The MLP is parameterised by the weights


n o
(l)
w = wi,j | 1 ≤ l ≤ L, 0 ≤ i ≤ d (l−1) , 1 ≤ j ≤ d (l) ,

(l)
where wi,j is the weight of the connection from the i th neuron in layer l − 1 and the j th
neuron in layer l, and d (l) is the width of the l th layer (the 0th neuron accounts for a
constant bias of 1).
Inference is accomplished by successively computing
 (l−1)

  dX
(l) (l) (l) (l−1) 
xj = θ sj = θ wi,j xi ,
i=0

where θ is the (nonlinear) activation function.


In this way, an MLP realises a function hw : Rd −→ R.
Nonlinear models E. Hüllermeier 139/207
Information processing in a single neuron

Nonlinear models E. Hüllermeier 140/207


Training by gradient descent

Like for the linear perceptron, training an MLP by minimising (regularised) in-sample
error can be accomplished by gradient descent, i.e., by finding the derivative of the
in-sample error
N
X
Ein (w) = e(h(x n ), yn )
i=1

with respect to w, and making a (small) step in negative direction of the gradient:

w ← w − η ∇w Ein (w) .

Instead of processing all training examples before making a gradient step, it is often more
efficient to process smaller batches at a time.
Batches can be sampled at random or processed systematically; an epoch of training
means that all examples have been processed once.

Nonlinear models E. Hüllermeier 141/207


Backpropagation

The backpropagation algorithm is a specific instance of stochastic gradient descent,


i.e., it (randomly) picks individual training examples (x n , yn ) and makes a step into the
negative direction of the gradient of the error function at that point (batch size 1).

What this algorithm needs is the entries of the vector ∇w Ein (w), i.e., the partial
derivatives
∂e(w)
(l)
∂wi,j
(l)
of the error (as a function of w) with respect to the weights wi,j for all i, j, l.

Nonlinear models E. Hüllermeier 142/207


Backpropagation

xsxsx
Since
(l)
∂e(w) ∂e(w) ∂sj
(l)
= (l)
× (l)
,
∂wi,j ∂sj ∂wi,j

and
(l)
∂sj (l−1)
(l)
= xi ,
∂wi,j
we only need

(l) ∂e(w)
δj = (l)
∂sj

for all layers l and neurons j.


Nonlinear models E. Hüllermeier 143/207
Backpropagation

Suppose the activation function to be given by

exp(s) − exp(−s)
θ(s) = tanh(s) = .
exp(s) + exp(−s)

Note that, as a convenient property of this function,

d θ(s)
θ′ (s) = = 1 − θ2 (s) .
ds
For the final layer (l = L, j = 1), the derivative is then given as follows:

∂e θ(s1L ), yn

(L) ∂e(w)
δ1 = (L)
= (L)
∂s1 ∂s1

Nonlinear models E. Hüllermeier 144/207


Backpropagation

For the layers l = L − 1, L − 2, . . . , 1, we obtain recursively:


(l−1) ∂e(w)
δi = (l−1)
∂si
d (l) (l) (l−1)
X ∂e(w) ∂sj ∂xi
= (l)
× (l−1)
× (l−1)
j=1 ∂sj ∂xi ∂si
d (l)
 
(l) (l) (l−1)
δj × wi,j × θ′ si
X
=
j=1

 d (l)
X
(l−1) 2 (l) (l)
= 1− si δj wi,j
j=1

In this way, the error e θ(s1L ), yn at the output layer is propagated back through the


entire network, layer by layer.


Nonlinear models E. Hüllermeier 145/207
Backpropagation
(l)
1: initialize all weights wi,j at random
2: while stopping condition false do
3: pick n ∈ {1, . . . , N}
(l)
4: compute all xj {forward}
(l)
5: compute all δj {backward}
(l) (l) (l−1) (l)
6: update the weights: wi,j ← wi,j − η xi δj
7: test stopping condition
8: end while
(l)
9: return final weights wi,j

Nonlinear models E. Hüllermeier 146/207


Stopping condition

A common way to control the learning process is to monitor generalisation performance


on a validation set, and to stop when the validation error starts to increase.

validation error validation error.


in practice

training error

Nonlinear models E. Hüllermeier 147/207


Neural network as a feature map
(L−1)
Recalling the idea of and linearisation, note that the mapping Rd −→ Z = Rd of the
original data from the input layer to the hidden layer (or, more generally, to the
penultimate layer) of a neural network can be seen as a feature embedding.
The final learning task is then solved on this representation in Z-space.
Importantly, this feature map is not predefined but also learned on the training data.

Nonlinear models E. Hüllermeier 148/207


Regression
(L)
The activation at the output layer, which maps the score s = s1 to the final
prediction ŷ , may differ from θ; it depends on the type of output that ought to be
produced, i.e., on the type of learning task at hand.

Obviously, the error function depends on the learning task, too.

For example, in the case of regression, the identity s 7→ s can be used for activation, and
the error function is typically taken as e(y , ŷ ) = (y − ŷ )2 .

Essentially, this means that standard linear regression (OLS) is done on the
linearisation, i.e., in the embedding space Z rather than the original data space.

Nonlinear models E. Hüllermeier 149/207


Classification

In the case of binary classification, one often produces predictions ŷ ∈ (0, 1), which are
then interpreted as probabilities (of the positive class); this can be accomplished by the
activation
1
s 7→ .
1 + exp(−s)

A suitable error function in this case is the logistic loss (log-loss, cross-entropy loss)

e(y , ŷ ) = −y · log(ŷ ) − (1 − y ) log(1 − ŷ ) .

This approach comes down to doing logistic regression in the embedding space.

Nonlinear models E. Hüllermeier 150/207


Logistic regression

Logistic regression fits a generalised linear model:



  X
f (p̂) = log = wi · xi .
1 − p̂ i
| {z } | {z }
log-odds ratio s(x )

Finding weights wi minimising log-loss is equivalent to maximum likelihood estimation.


Nonlinear models E. Hüllermeier 151/207
Multi-class classification

The idea of predicting a probability distribution (Bernoulli in the binary case) can be
generalised to the case of K > 2 classes, e.g., by means of an output layer with K
neurons (instead of a single one) and a softmax transformation:
 
(L)
exp si
p̂i = ŷi = 
(L)
 
(L)

exp s1 + . . . + exp sK

Again, a suitable error function is the cross-entropy loss:


K
X
e(y, ŷ) = − yi · log(ŷi ) + (1 − yi ) · log(1 − ŷi ) ,
i=1

where y = (y1 , . . . , yK )⊤ is a one-hot encoding of the outcome (i.e., yi = 1 if the i th class


was observed, and 0 for all other entries).

Nonlinear models E. Hüllermeier 152/207


Classification with decision trees

Decision trees are based on the principle of recursive partitioning: The instance space is
(recursively) partitioned in such a way that each subspace is associated with single class.

Nonlinear models E. Hüllermeier 153/207


Classification with decision trees

Example of a decision tree for the adult data (predict whether a person has yearly income
≥ 50K of < 50K):

Each path from the root to a leaf node can be considered as an IF-THEN rule, thence
decision trees correspond to a rule systems with a specific structure.
Nonlinear models E. Hüllermeier 154/207
Splitting with discrete and continuous features

If the feature Xi with discrete domain {xi,1 , . . . , xi,m } is used as a split-attribute at an


inner node of the tree, then m descendants of that node are created.

All training examples, for which Xi takes the value xi,j , i.e., which fulfil the logical
predicate (Xi = xi,j ), are assigned to the j th descendant.

For continuous features, one usually uses predicates of the form (Xi < t) and (Xi ≥ t),
which means that the splits are binary.

As candidates for the threshold t, one tries values located in-between two values of Xi in
the training data.

Nonlinear models E. Hüllermeier 155/207


Entropy as a measure of purity

The probability distribution on the set of classes Y = {y1 , . . . , yK } shall be given by


p = (p1 , . . . , pK ) .

The (Shannon) entropy of this distribution is


K
X
H(p) = − pk log(pk ) ,
k=1

whereby 0 log(0) = 0 per definitionem.


In decision tree learning, the pi are estimated by relative frequencies in the training data
(associated with a node in the tree), and hence entropy by
K
X
H(p̂) = − p̂k log(p̂k ) .
k=1

Nonlinear models E. Hüllermeier 156/207


Information gain

Let p̂ = (p̂1 , . . . , p̂K ) be the distribution at a node with n data points.

Let Xi be a split-attribute with m descendants and p̂ j = (p̂j,1 , . . . , p̂j,K ) the distribution


at the j th descendant with nj data points (n = n1 + . . . + nm ).

The information gain through the split is then defined as follows:


m
X nj
Gain = H(p̂) − H(p̂ j ) .
j=1
n

The split with the highest information gain is adopted.

Stopping condition: number of data points too low (due to fragmentation) or no


(significant) information gain anymore.

Nonlinear models E. Hüllermeier 157/207


Algorithm RecPart

Require: Training data D, attributes A


Ensure: Decision tree B
1: if all examples belong to class c then
2: return node (leaf) marked with class c
3: else
4: select an attribute Xi ∈ A
5: for all xi,1 , . . . , xi,m in the domain of Xi do
6: Dj ← {x ∈ D | xi = xi,j }
7: Bj ← RecPart(Dj , A \ {Xi })
8: end for
9: end if
10: return root Xi with subtrees B1 , . . . , Bm

Nonlinear models E. Hüllermeier 158/207


Implementations and extensions

Concrete algorithms for decision tree induction (e.g. C5.0, CART) follow the basic
recursive partitioning scheme (see algorithm RecPart), but differ with regard to various
technical aspects and extensions.
These algorithms can be seen as a greedy search in the space of all decision trees; thus,
they do not guarantee any optimality, neither in terms of accuracy nor size.
In general, the problem of learning optimal decision trees in NP-complete.
More recent methods seek to minimise an augmented error (using the number of leafs for
regularisation), based on
▶ mathematical programming (e.g., mixed integer programming) and SAT solvers,
▶ stochastic search methods,
▶ customised dynamic programming algorithms combined with branch-and-bound.

Nonlinear models E. Hüllermeier 159/207


Pruning

One important extension is a post-processing step called pruning: induced trees are
simplified to prevent over-fitting.
Reduced Error Pruning:
▶ Retain part of the training data for validation.
▶ Learn decision tree with the remaining training data.
▶ For each inner node, use the validation data to determine the performance improvement (if
any) if that node would become a leaf node (label through majority vote).
▶ The node with the maximal improvement is turned into a leaf.
▶ This process ends as soon as any further pruning will deteriorate performance.

Nonlinear models E. Hüllermeier 160/207


Entropy-based discretisation

Some machine learning algorithms can only handle discrete features, necessitating
discretisation of continuous features as a pre-processing step.

Most discretisation techniques partition the domain of a numerical variable into a finite
number of intervals, treating each interval as a categorical value.
▶ Equi-width binning: each interval (bin) has the same width
▶ Equi-frequency binning: each bin contains (approximately) the same number of data points

Entropy-based discretisation is a supervised discretisation techniques, which make use


of the target variable Y , essentially inducing a decision tree on a single dimension:
▶ find a split of the domain that minimises (weighted) entropy of the induced class
distributions,
▶ apply the same strategy recursively till a termination criterion is met.

Nonlinear models E. Hüllermeier 161/207


Regression trees

The principle of decision trees (classification trees) can also be applied to regression.
In the simplest case, leaf nodes are marked with the mean of the relevant values of the
target variable (leading to a piecewise constant function).
Variance reduction is often used as a split criterion:
m
X |Dj |
Gain = V − · Vj
j=1
|D|

where
1 X 1 X
V = (y − m)2 , Vj = (y − mj )2 ,
|D| (x,y )∈D |Dj | (x,y )∈D
j

1 X 1 X
m= y, mj = y .
|D| (x,y )∈D |Di | (x,y )∈D
j

Nonlinear models E. Hüllermeier 162/207


Regression trees

Example of a regression tree:

Nonlinear models E. Hüllermeier 163/207


Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

Nonlinear models E. Hüllermeier 164/207


Reduction techniques: from multi-class to binary classification

Many ML methods are inherently restricted to the binary case and cannot directly solve
problems with multiple classes (multi-class, multinomial, polychotomous classification):

Ensemble methods E. Hüllermeier 165/207


Reduction techniques: from multi-class to binary classification

Such methods can be made amenable to multi-class problems via decomposition


(reduction) techniques, which reduce a single multi-class problem to a set of binary
classification problems.

At prediction time, a query x is submitted to each of the binary models, and their
predictions are combined into a prediction for the original multi-class problem.

We refer to the representation of a complex model in terms of a set of binary


(dichotomous) models as model dichotomisation.

The latter may come with an increased predictive accuracy, even for genuine multi-class
classifiers, because binary problems are simpler (“divide and conquer”).

Ensemble methods E. Hüllermeier 166/207


One-vs-rest decomposition

Given a problem with instance space X and classes Y = {y1 , y2 , . . . , yK }, one-vs-rest


decomposition (OvR) learns models h1 , . . . , hK , one for each class.
The task of hk is to distinguish instances belonging to class yk from instances that do not
belong to this class; correspondingly, if the original training data is given by

D = (x 1 , c1 ), (x 2 , c2 ), . . . , (x N , cN ) ⊂ X × Y ,

the model hk is trained on the transformed data


n o
(k) (k) (k)
Dk = (x 1 , c1 ), (x 2 , c2 ), . . . , (x N , cN ) ⊂ X × {−1, +1} ,

where (
−1 if cn ̸= yk
cn(k) = .
+1 if cn = yk
To this end, any binary classifier can be used, called base learner in this context.
Ensemble methods E. Hüllermeier 167/207
One-vs-rest decomposition

Ensemble methods E. Hüllermeier 168/207


One-vs-rest decomposition

Suppose all models have been trained, and a new query instance x is submitted for
classification.
Ideally, there is one k such that hk (x) = +1 while hi (x) = −1 for all i ̸= k; in that case,
the predicted class would be yk .
Conflict: Either none of the predictions is positive, or more than one of the models claim
the instance for the class it represents.
Therefore, one typically trains scoring classifiers hk : X −→ R; then, hk (x) can be seen
as the strength or the confidence that the class label of x is yk .
Given predictions of that kind, a natural classification rule is to choose yk such that

k = argmax hi (x) .
1≤i≤K

Ensemble methods E. Hüllermeier 169/207


One-vs-rest decomposition

In terms of complexity, K models are trained in the one-vs-rest decompositions scheme


in total, each one on a training set of size N.

Thus, if the underlying base learner has a time complexity of O(N α ), the overall
complexity of training an OvR classifier is

O(KN α ) .

At prediction time, K predictors must be queried.

Ensemble methods E. Hüllermeier 170/207


All-pairs decomposition

Another simple decomposition technique is the all-pairs scheme.


A quadratic number of K (K − 1)/2 models hi,j , 1 ≤ i < j ≤ K , is trained, one for each
pair of classes yi and yj .
Model hi,j is intended to separate instances with class label yi from those belonging to
class yj (ignoring all other classes).
Given a set of training data D = {(x n , cn )}N
n=1 , let

 +1
 if cn = yi
cn(i,j) = 0 if cn ∈
̸ {yi , yj } .


−1 if cn = yj

 hi,j is then trained on the set of data Di,j consisting of all examples
The model

(i,j) (i,j)
x n , cn such that cn ̸= 0.

Ensemble methods E. Hüllermeier 171/207


All-pairs decomposition

Ensemble methods E. Hüllermeier 172/207


All-pairs decomposition

At classification time, a query x is submitted to all hi,j , and predictions hi,j (x) are
combined into an overall prediction.
Prediction hi,j (x) is interpreted as a “vote” in favor of yi or yj .
Suppose predictions are in [0, 1], and for all 1 ≤ i < j ≤ K , let

hj,i (x) = 1 − hi,j (x) .

Then, the total number of (weighted) votes in favor of class yk is given by


X
vk = hk,i (x) ,
i̸=k

and the class maximising this score is chosen as a prediction.


The case where hi,j (x) ∈ {0, 1} is called binary voting, whereas the general case with
predictions in [0, 1] is called weighted voting.
Ensemble methods E. Hüllermeier 173/207
All-pairs decomposition

Since each example (x n , cn ) ∈ D is involved in the training of exactly K − 1 models, the


total number of examples produced for training all binary models is (K − 1)N = O(KN).

Thus, although the number of models is now quadratic instead of linear in K , the total
number of examples produced is of the same order as for one-vs-rest.

More importantly, since the individual training sets are smaller, the all-pairs approach
tends to be even more efficient in the case of base learner complexity O(N α ) with α > 1,
i.e., if the running time of the base learner is super-linear.

Note, however, that the quadratic number of models increases space complexity (each
of them needs to be stored) as well as time complexity in the prediction step.

Ensemble methods E. Hüllermeier 174/207


Error correcting output codes (ECOC)

In principle, a base learner could use any partition of the original classes into positive,
negative, and neutral (ignored) examples.
One-vs-rest and all-pairs are special cases of a more general technique based on so-called
error correcting output codes.
Suppose a code matrix M of size K × L to be given, where the entry M(k, ℓ) is +1 if
class yk is considered as positive by the base learner hℓ , by −1 if yk is considered as
negative, and by 0 if that class is simply ignored by the learner.
The class yk is then encoded by the k-th row

m k = M(k, 1), . . . , M(k, L)

of the matrix M.

Ensemble methods E. Hüllermeier 175/207


Error correcting output codes (ECOC)

Training of each of the the binary models hℓ , 1 ≤ ℓ ≤ L, is done as before, i.e., by


transforming (and possibly reducing) the original data D into a dichotomous data set.
At prediction time, a query instance x is submitted to each of the models, and the
predictions are combined into a vector

v = (v1 , . . . , vL ) = (h1 (x), . . . , hL (x)) ∈ {−1, +1}L .

The final (multi-class) prediction is then given by the class yk that minimises the
(generalised) Hamming distance between the code vector and the prediction:
L
1X
k = argmin |M(i, ℓ) − hℓ (x)| (⋆)
1≤i≤K 2 ℓ=1

Ensemble methods E. Hüllermeier 176/207


Error correcting output codes (ECOC)

The notion of “error correction” refers to the fact that the ECOC scheme may tolerate a
certain number of incorrect binary predictions hℓ (x) without affecting the correctness of
the overall (multi-class) prediction.

Suppose M is chosen in such a way that the Hamming distance between each pair of
classes yi and yj is at least d.

Each incorrect binary prediction increases the distance between v and m k by at most 1.

Thus, as long as the number of mistakes does not exceed (d − 1)/2, the correct class will
still be selected by Hamming decoding (??).

Ensemble methods E. Hüllermeier 177/207


Error correcting output codes (ECOC)

One-vs-rest corresponds to the ECOC with the K × K code matrix


 
+1 −1 −1 ··· −1
 −1 +1 −1 ··· −1 
M=  .
 
.. .. .. ..
 . . . . 
−1 −1 −1 ··· +1

All-pairs is obtained for the K × K (K − 1)/2 matrix


+1 +1 · · · 0
 

 −1 0 ··· 0 


 0 −1 · · · 0 

M=
 0 0 ··· 0  .

 .. .. .. .. 

 . . . . 

 0 0 ··· +1 
0 0 ··· −1

Ensemble methods E. Hüllermeier 178/207


Nested dichotomies

CC
CCCCC C
CC
C CC
CCCCCC CCC
CC
C CC
C CC
C
C CC
C
C
CCCC CCCC C
A A A A AC C C C
A A AAA A C C CCCCCCCCCCC C
A A A
AAA
A A AA A AA
A
C CCCCCCCCC C CCC
C1 A A A A C CC C C
A AA AAAAAAA AAAAAAAAAAAAA CCCCCC CCCCC
A AA AAAAA A A
A B BCBCCC CCCCC CCC C
ADA A AA AA AAAAA B B B BB B C B CCCCCC CCCC CC
C C2 DD DDA D A A A B BB BB B B CC CCCCCCC
A A A AB BBB B B B B C CC CCC
CC
C CCC
D
D DD D AD A B
B C
C CC C
D D DD DD
D DDDD D BB B BBB B
BBB
BBBBB
B
B BB C
D D DD D DD D D E B B
EB BBB BB B BBB BBB B C
E B B B BB
DD DD DD
DDDDDDD DDD E E E EEE E E B BB B B B B
C3 C4 D D DDD D D DE EE EEEEEEEB
DD D D DD DDE E E E E EE E EB
E
EEEEE E E E E E B B BB B
DEDE EEEE E
D EEEEE EEEEE E E
D D D
E EEE EEE EEEEEE
E EE
A D B E E EE EE E

Example of a nested dichotomy for a 5-class problem. The first classifier (C1 ) is supposed to
separate class C from the meta-class {A, B, D, E }, i.e., the union of classes A, B, D, and E ;
likewise, the second classifier separates classes {A, D} from {B, E }, the third classifier class A
from D, and the fourth B from E .

Ensemble methods E. Hüllermeier 179/207


Nested dichotomies

Nested dichotomies (NDs) split the set of classes in a recursive way.


The combination of individual predictions into an overall (multi-class) prediction can be
accomplished in a very natural and consistent manner (in contrast to ECOC, no
inconsistencies are produced).
Starting at the root and following the respective classifier’s prediction at each node, one
eventually ends up in a leaf node labeled with a class.
If base learners are probabilistic, inference comes down to the chain rule of probability
(multiplication of probabilities along a path from root to leaf), e.g.,
       
P A | x = P {A, B, D, E } | x × P {A, D} | x, {A, B, D, E } × P A | x, {A, D}

Ensemble methods E. Hüllermeier 180/207


Ensembles of nested dichotomies

The performance of the multi-class classifier eventually produced may strongly depend
on the structure of the dichotomy, because the latter specifies the set of binary
problems that need to be solved.

Therefore, instead of producing a single dichotomy, one often trains an ensemble of


nested dichotomies, creating different structures at random.

Given a query x, the predictions obtained by the different dichotomies are aggregated
into an overall prediction through majority voting resp. averaging (probabilities).

Ensemble methods E. Hüllermeier 181/207


Ensembles in machine learning

More generally, in machine learning, the notion of ensemble refers to a set of models
h1 , . . . , hM that have been trained on the same data.

Unlike the case of decomposition methods, each of these models is in principle a


complete solution to the original learning problem.

Prediction for an instance x:



h(x) = AGG h1 (x), h2 (x), . . . , hM (x)

Aggregation depends on the learning problem (e.g., majority voting, mean, median).

Ensemble methods E. Hüllermeier 182/207


How do ensembles work?

Main purpose is to increase the accuracy of predictions.

Idea: Asking for the opinion of several experts is better than only asking a single one.

However, experts need to be sufficiently independent of each other (in spite of being
trained on the same data).

Formally, the improvement achieved by ensemble methods can be explained by looking at


the bias/variance decomposition of the prediction error: Ensembles tend to reduce the
bias or the variance or even both.

Main questions:
▶ Diversification: How to induce sufficiently different models from the same data?
▶ Aggregation: How to combine the predictions of the individual models?

Ensemble methods E. Hüllermeier 183/207


How to achieve diversity?

Applying the same learner to the same data will always yield the same model (unless the
learning algorithm is randomised).

There are basically two possibilities to produce different models: Either by modifying the
learning process or the data.

Ensemble methods E. Hüllermeier 184/207


Bagging

Bagging is based on a resampling technique called bootstrapping in statistics.


Given a data set D of size N = |D|, new data sets B of the same size as are produced by
sampling from D with replacement.
Suppose M such bootstrap samples are produced, and denote them by B1 , B2 , . . . , BM .
Then, a model hi is trained on each of the data sets Bi , using the same base learner.
The overall model is given by the aggregation AGG of the models h1 , . . . , hM thus
produced.
Bagging tends to reduce variance and can be seen as an alternative to the use of
regularization techniques.
To produce sufficiently diverse models, base learners with high variance should be
used — especially popular are ensembles of decision tees.

Ensemble methods E. Hüllermeier 185/207


Boosting

Boosting combines a set of weak learners into a single strong learner.


The weak learners are successively trained on the data D. In each round, the examples in
the training data are re-weighted: more weight is put on those examples on which the
last hypothesis made a mistake, and less on those on which it was correct.

Ensemble methods E. Hüllermeier 186/207


AdaBoost

One of the most basic versions of boosting is the AdaBoost (Adaptive Boosting)
algorithm.
While bagging essentially reduces the variance part of the error (although it can also
reduce bias), boosting potentially reduces both bias and variance.
However, it is much more susceptible to noise in the data. While bagging will hardly ever
increase the error (compared to a single model), this is less guaranteed for boosting.

Ensemble methods E. Hüllermeier 187/207


Algorithm AdaBoost

Require: base learner A, training data D, ensemble size K


1: w (0) ← (1, . . . , 1)/|D| {initialize uniform weights}
2: for k = 1 to K do
3: hk ← A(D, w (k−1) )
4: for n = 1 to |D| do
5: ĉn ← hk (x n ) {make predictions on training data}
6: end for
P (k−1)
7: ϵ̂(k) ← n w n Jĉn ̸= cn K {compute weighted training error}
1 1−ϵ̂(k)
8: α(k) ← 2 log ϵ̂(k)
{compute “adaptive” parameter}
9: for n = 1 to |D| do
(k) (k−1)
10: w n ← Z1 w n exp(−α(k) ĉn cn ) {re-weight and normalize}
11: end for
12: end for 
P (k)
13: return h(x) = sgn kα hk (x) {weighted voting classifier}

Ensemble methods E. Hüllermeier 188/207


Example: Classification with decision stumps

A decision stump is a decision tree with only a single split (two leaf nodes)
 
AdaBoost learns h(x) = sgn 0.8 h1 (x) + 0.35 h2 (x) + 0.55 h3 (x)

Weights: w (0) = (1/3, 1/3, 1/3), w (1) = (1/2, 1/4, 1/4), w (3) = (1/3, 1/6, 1/2)

Ensemble methods E. Hüllermeier 189/207


Weak learners

In principle, any type of learner can be used as a weak learner in boosting, as long as it is
better than random guessing.

Another requirement, however, is that the learner accepts weighted training examples
(most algorithms can be generalised correspondingly).

Typical examples of learners used in boosting include shallow decision trees (e.g.,
decision stumps) and linear classifiers.

Ensemble methods E. Hüllermeier 190/207


Random ensembles

Instead of modifying the training data, model diversity can also be achieved by
manipulating the learner.
Especially effective for learners that are very sensitive toward (minor) changes of the
learning process — the probably most prominent example is again decision trees.
Random forests combines bagging with randomisation of splits in decision tree learning.
The best attribute is chosen among a randomly√selected subset of the attributes (a
common choice for the size of candidate set is d, with d the total number of attributes).
This leads to a reduction of the correlation between the trees forming the ensemble.
Extreme case: split attributes are chosen completely at random.

Ensemble methods E. Hüllermeier 191/207


Stacking

Stacking realises AGG again by a learning algorithm (instead of predefining it)


−→ learn how to best combine the predictions of the base learners.
The algorithm for the aggregation is called meta learner.
It takes predictions of base learners as input (and possibly x itself) and produces a final
prediction as output.
Training examples for this learner are of the form
 
(x1′ , . . . , xM
′ 
, y) = h1 (x), h2 (x), . . . , hM (x) , y .

Original examples (x, y ) used to produce the examples for the meta learner should not
have already been used for training the base learners.
Prediction for a query instance x:
 
ŷ = h h1 (x), h2 (x), . . . , hM (x)

Ensemble methods E. Hüllermeier 192/207


Stacking

Optionally, the original instance x can be given as an additional input to the meta-learner.

Ensemble methods E. Hüllermeier 193/207


Structure of the lecture

1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability

Ensemble methods E. Hüllermeier 194/207


Labeled versus unlabeled data

Labeling of data often comes with a certain cost, while unlabeled data is more readily
available (for example, wet lab experiments in biology, human annotation in text or image
processing).
A natural idea is to harness the unlabeled data to improve (supervised) learning.
Methods making use of both labeled and unlabeled data are called semi-supervised
learning methods.
Often, these methods improve prediction accuracy, although theoretical guarantees are
difficult (if not impossible) to obtain.
In the following, we will assume that only the first L ≪ N training instances x 1 , . . . , x L
are labeled, while the rest in unlabeled:
 L  N
DL = (x i , yi ) i=1
and DN = x j j=L+1
.

Semi-supervised learning E. Hüllermeier 195/207


Semi-supervised learning

Semi-supervised learning E. Hüllermeier 196/207


Self-training

Self-training algorithms proceed from the assumption that their own predictions on
instances in DN are correct, at least those on which they are sufficiently sure.
Thus, starting with D = DL , such algorithms iterate the following steps:
▶ Learn a model h on the current training data D.
▶ Use h to label those instances x j ∈ DN \ D on which the prediction is sufficiently certain, i.e.,
set yj = h(x j ).

Instead of simply adding the newly labeled training instances, these examples could be
weighted; like e.g. in boosting, this of course requires learning algorithms that are able to
deal with weighted examples.

Semi-supervised learning E. Hüllermeier 197/207


Self-training

Self-training is among the simplest approaches to semi-supervised learning and is often


used in real tasks such as natural language processing.
As a wrapper method, it can be applied to any type of (complex) learner.
However, the approach also exhibits disadvantages, including the following:
▶ Mistakes made in the beginning might be reinforced in later stages of the learning process.
▶ Convergence cannot be guaranteed in general.

Semi-supervised learning E. Hüllermeier 198/207


Co-training and multi-view learning

The idea of multi-view learning is to look at an object (e.g., a website) from two (or
more) different viewpoints (e.g., the pictures and the text on the website).
Formally, suppose the instance space X to be split into two parts, so that an instance is
represented in the form  
(1) (2)
xi = xi , xi .

Co-training proceeds from the assumption that each view alone is insufficient to train a
(1) (2)
good classifier and, moreover, that x i and x i are conditionally independent given the
class.

Semi-supervised learning E. Hüllermeier 199/207


Co-training and multi-view learning

Co-training algorithms then repeat the following steps:


(1) (2)
▶ Train two classifiers h(1) and h(2) from DL and DL , respectively.
▶ Classify DN separately with h(1) and h(2) .
▶ Add the k most confident examples of h(1) to the labeled training data of h(2) .
▶ Add the k most confident examples of h(2) to the labeled training data of h(1) .

Co-training is a wrapper method that applies to all existing classifiers. It tends to be


less sensitive to mistakes than self-training.
However, a natural split of the features does not always exist (although the feature
subsets do not necessarily need to be disjunct).
Moreover, models using both views simultaneously may often perform better.

Semi-supervised learning E. Hüllermeier 200/207


Co-training and multi-view learning

There are many variants of co-training, such as example weighting, multiview learning
with majority vote labeling, etc.
Multiview learning (with m learners) can also be realised via regularisation:
m X
X L m
X N
X
min e(yi , hv (x i )) +λ1 ∥hv ∥2 + λ2 (hu (x j ) − hv (x j ))2
h∈H
v =1 i=1 u,v =1 j=L+1
| {z } | {z }
incorrectness of hv disagreement between hu ,hv

Minimising a joint loss function of this kind encourages the learners, not only to predict
correctly on the labeled data, but also to agree on the unlabeled data.

Semi-supervised learning E. Hüllermeier 201/207


Generative models

In contrast to discriminative approaches, generative methods first estimate a joint


distribution P on X × Y; predictions can then be derived by conditioning on a given
query x:
P(x, y ) P(x, y )
P(y | x) = =P ∝ P(x, y )
P(x) ȳ ∈Y P(x, ȳ )

Generative methods can be applied in the semi-supervised context in a quite natural way,
because they can model the probability of observing an instance x j as a marginal
probability: X
P(x) = P(x, ȳ )
ȳ ∈Y

Semi-supervised learning E. Hüllermeier 202/207


Generative models

Suppose the (joint) probability P to be parametrised by θ ∈ Θ. Then, under the


assumption of independent observations,
L
Y N
Y
L(θ) = P(DL , DN | θ) = P(x i , yi | θ) · P(x j | θ)
i=1 j=L+1
L
Y N
Y X
= P(x i , yi | θ) · P(x j , ȳ | θ)
i=1 j=L+1 ȳ ∈Y

An estimation of θ can then be obtained via maximum likelihood estimation:

θ ∗ = argmax P(DL , DN | θ)
θ∈Θ

Semi-supervised learning E. Hüllermeier 203/207


Cluster-and-label

Methods of that kind are formally well-grounded and often very effective, provided the
model assumptions are (approximately) correct; however, they may become
computationally complex, and the (log-)likelihood function may have local optima.
Generative methods are closely connected to methods based on clustering (eventually,
unlabeled data provides an idea of the distribution of the data), which are computatonally
less complex but purely heuristic.
In the simplest version, cluster-and-label methods simply work as follows:
▶ A clustering algorithm is applied to both labeled and unlabeled instances,
▶ and each cluster is then completely labeled by applying the majority rule.

Semi-supervised learning E. Hüllermeier 204/207


Graph-based algorithms

Instead of approximating the distribution of the data by means of standard clustering, the
(topological) structure of the data space can also be represented in terms of a graph.
Each data point corresponds to a node, and two nodes are connected by a (weighted)
edge if they are “similar” to each other. The assumption is that similar data points tend
to have the same class label.
Based on this assumption, the given labels (coming from examples (x i , yi ) ∈ DL ) can be
propagated over the whole graph.

Semi-supervised learning E. Hüllermeier 205/207


Graph-based algorithms

Typically, the problem is formalised as a graph-theoretic optimisation problem, such as the


mincut problem:
Fixing the labels (y1 , . . . , yL ) ∈ {0, 1}L of the labeled instances, find
(yL+1 , . . . , yN ) ∈ {0, 1}N−L so as to minimise
X
wi,j |yi − yj | .
1≤i<j≤N

This is a combinatorial optimisation problem with a polynomial solution.

Semi-supervised learning E. Hüllermeier 206/207


Further reading

J.E. van Engelen and H.H. Hoos. A survey on semi-supervised learning. Machine
Learning, 109:373–440, 2020. DOI 10.1007/s10994-019-05855-6.

Semi-supervised learning E. Hüllermeier 207/207

You might also like