GML-slides-2024-04-29 (1)
GML-slides-2024-04-29 (1)
LMU Munich
Institute of Informatics
E. Hüllermeier 1/207
Structure of the lecture
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
A. Probability
E. Hüllermeier 2/207
ML settings and problems
Machine learning is mainly concerned with the development and analysis of learning
algorithms.
In this lecture, we shall focus on one of the simplest settings, so-called supervised
learning (also known as learning from examples).
Unsupervised learning: Learner merely observes the data (e.g., handwritten digits), but
without any type of supervision. The main goal is to discover structure in the data, for
example represented as a grouping of data into clusters.
A: D→H .
Facial expression recognition by a smart eyewear for facial direction changes, repeatability, and
positional drift [?].
The interest is to learn a mapping from X to Y, or, more generally, from X to some
prediction space Y that models the dependence of outcomes (responses, dependent
variable) on inputs (predictors, independent variables).
f : X −→ Y
that captures this dependence optimally (in a sense to be specified later on).
In order to learn the target function, the learning algorithm A is provided with a set
N
∈ (X × Y)N
D = (x i , yi ) i=1
g : X −→ Y ,
A : D −→ H ,
where D is the set of all training data sets (possibly of different size).
The learner will typically try to find a hypothesis that “fits” the training data D
sufficiently well.
The result of the learning process, the model g, can be used for different purposes,
notably to make predictions ŷ = g(x) for future (query) instances x ∈ X ; the better g
approximates f , the better such predictions will be.
The hypothesis space H together with the learner A is also referred to as the learning
model.
Important special cases of supervised learning include the problems of classification and
regression.
In classification, the outcome space consists of a finite (typically small to moderate)
number of class labels:
Y = {y1 , . . . , yK } .
In the case K = 2, we speak of a binary classification problem.
Examples: Handwritten digit recognition (K = 10), predicting the result of a football
match (K = 3), credit card applications/approval in a bank (K = 2) based on customer
information such as age, gender, annual salary, years in residence, years in job,
outstanding loans, etc.
In the case of regression, the outputs are real numbers, i.e., Y = R.
x = (x1 , . . . , xd )⊤ .
The entries of the d-dimensional vectors are called “features” and typically describe an
instance in terms of specific properties.
Other representations are possible, e.g., in terms of structures such as sequences, graphs,
etc.
x = (x1 , . . . , xd )⊤ ?
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
For the purpose of illustration, and to convey a basic idea of ML algorithms, we shall look
at two exemplary methods (model classes):
▶ linear models,
▶ nearest neighbor estimation.
Both approaches are in a sense quite basic, but played an important role in the history of
ML and are still commonly used.
They are also quite different in terms of their characteristics and properties (e.g.,
parametric vs. nonparametric, strong bias vs. weak bias, eager vs. lazy, etc.).
that “fits” the data well in terms of the squared error loss
N
1 X 2
Ein (hw ) = Ein (w) = yi − hw (xi ) .
N i=1
The wj are the parameters of the model, also called weights; they determine the
influence of the individual features (direction and strength).
x = (x0 , x1 , . . . , xd )⊤ = (1, x1 , . . . , xd )⊤
x⊤
i = (1, xi,1 , . . . , xi,d ) .
With y = (y1 , . . . , yN )⊤ and ∥ · ∥ the Euclidean norm, the training error (in-sample error
Ein ) can be written as follows:
N
1 X 2
Ein (w) = w ⊤ x i −yi
N i=1 | {z }
ŷi
1
= ∥X w − y∥2
N
1
= (X w − y)⊤ (X w − y)
N
1 ⊤ ⊤
= w X X w − 2w ⊤ X ⊤ y + y ⊤ y .
N
Hence, minimising Ein comes down to finding
w ∗ = argmin w ⊤ X ⊤ X w − 2w ⊤ X ⊤ y + y ⊤ y .
w∈Rd+1
OLS will always deliver a predictor, even if the underlying relationship is not linear, i.e.,
the (linear) model is actually misspecified:
Different (approximate) algorithms for tackling this problem have been proposed in the
literature — one of them is the perceptron learning algorithm (PLA), which plays an
important role in the history of AI.
Later, we will also discuss other ways of tackling the problem, e.g., through minimising a
more manageable surrogate loss instead of the 0/1 loss.
The original version of PLA starts with an initial weight vector w(0) and updates this
vector iteratively.
In each iteration t = 0, 1, 2, . . ., the algorithm picks one of the examples (x i , yi ) ∈ D at
random; denote this example as (x(t), y (t)).
The current model w(t) is then updated as follows:
1
w(t + 1) ← w(t) + y (t) − ŷ x(t) ,
2
where D E
ŷ = sign w(t), x(t) .
Thus, if the example is classified correctly by the current weight vector w(t), then
y (t) − ŷ = 0 and the model is simply retained.
Otherwise, 12 y (t) − ŷ = 21 y (t) − (−y (t)) = y (t), and the weight vector is updated in
Since the model only changes in the case of a mistake, PLA is called an error-driven
method.
As a remarkable property of PLA, one can prove the following: If the data D is linearly
separable, which means there exists a perceptron (a weight vector w ∗ ) that classifies all
examples correctly, then PLA will only make a finite number of updates (i.e., it will
eventually reach a state in which no more mistakes are made on D).
Theorem: Assume there exists a parameter vector w ∗ such that ∥w ∗ ∥ = 1, and some
γ > 0 such that
yi ⟨w ∗ , x i ⟩ ≥ γ
for all i ∈ [N]. Moreover, suppose ∥x i ∥ ≤ R for all i ∈ [N]. Then the Perceptron
algorithm (with w(0) = 0) makes at most R 2 /γ 2 mistakes.
In the case of data that is not linearly separable, PLA is not guaranteed to converge — in
this case, it is difficult to say anything about the performance of the algorithm.
A simple extension of PLA is the pocket algorithm.
It applies the PLA update rule but avoids any deterioration by memorizing the vector ŵ
with the lowest empirical error so far; or, stated differently, it updates the current weight
vector only if it indeed reduces the number of mistakes on the training data.
Thus, in contrast to the original PLA, it needs to compute the empirical error on the
entire training data in each iteration, which makes it rather slow.
A very basic machine learning (pattern recognition) approach is the nearest neighbour
or, more generally, k-nearest neighbour (kNN) method.
Instead of inducing a model from the training data D, the kNN method simply stores the
data itself (which qualifies it as a “lazy” learning method).
If a prediction for a new query instance x ∈ X is requested, kNN first retrieves the k
nearest neighbours of x from D, namely the examples
x i(1) , yi(1) , x i(2) , yi(2) , . . . , x i(k) , yi(k)
associated with those instances x i(1) , . . . , x i(k) having the smallest distances from x.
Obviously, this step assumes the instance space X to be equipped with a suitable
distance measure.
The basic kNN method has been extended in various ways; for example, in weighted
kNN, closer instances have a higher influence in the aggregation than more remote ones.
The idea of case base editing is to store only a subset of the observations, removing
those that may deteriorate rather than improve predictions.
Also note that the neighborhood size k as well as the distance measure can be seen as
parameters of the method; instead of predefining them, they can also be learned and
adapted to the data (e.g., metric learning).
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
Y |x ∼ P ,
i.e., the outcome for a given instance x is considered as a random variable with
(conditional) distribution PY,x (·) = P(· | x) on Y.
e(y , ŷ ) = Jy ̸= ŷ K
e(y , ŷ ) = (y − ŷ )2
In the case of a noisy target, a perfect predictor with no mistakes cannot exist.
Then, two questions naturally arise:
▶ First, what does the theoretically optimal target f now look like?
▶ Second, how to evaluate candidate hypotheses h ∈ H?
The latter question is important to decide on the optimality of a hypothesis in case f ̸∈ H.
To answer these questions, we combine the concept of a loss function with the
probabilistic modelling of input/output pairs as introduced above.
First, suppose an instance x to be fixed. What is the best prediction ŷ ∈ Y (or, more
generally, ŷ ∈ Y) the learner can make?
To answer this question, we adopt expected loss (error) minimisation as a rational
decision principle.
For each prediction ŷ ∈ Y, the expected loss is given by
h i Z
Ey ∼P(· | x) e(y , ŷ ) | x = e(y , ŷ ) d P(y | x) .
Y
In the case of a discrete output space (like in classification), this expression simplifies to a
weighted sum: h i X
Ey ∼P(· | x) e(y , ŷ ) | x = e(y , ŷ ) p(y | x) .
y ∈Y
Example: For the case of classification with the 0/1 loss e(y , ŷ ) = Jy ̸= ŷ K,
h i X
E e(y , ŷ ) | x = e(y , ŷ ) p(y | x)
y ∈Y
X
= 0 · p(ŷ | x) + 1 · p(y | x)
ŷ ̸=y ∈Y
= P(y ̸= ŷ | x)
= 1 − p(ŷ | x) .
Thus, expected loss is minimised by a mode of the distribution p(· | x), i.e., an outcome
for which the probability is highest.
Ordinal scales (such as {win, tie, loss} are often encoded in terms of natural numbers
{1, 2, . . . , K }, and the absolute difference |y − ŷ | is adopted as an error function (L1 -loss).
This can be criticised, because this embedding assumes equidistance.
Suppose p(1 | x) = 0.4, p(2 | x) = 0.3, p(3 | x) = 0.3. Then
E e(y , ŷ = 1) | x = 0.4 · 0 + 0.3 · 1 + 0.3 · 2 = 0.9
E e(y , ŷ = 2) | x = 0.4 · 1 + 0.3 · 0 + 0.3 · 1 = 0.7
E e(y , ŷ = 3) | x = 0.4 · 2 + 0.3 · 1 + 0.3 · 0 = 1.1
More generally, one can prove that the L1 -loss is minimised (in expectation) by a median
of the distribution.
In the most general case, the loss is explicitly specified for each combination of prediction
ŷ and observation y :
e(y , ŷ ) ŷ = y1 ŷ = y2 ŷ = y2
y = y1 0 2 5
y = y2 1 0 1
y = y3 3 2 0
Suppose p(y1 | x) = 0.2, p(y2 | x) = 0.5, p(y3 | x) = 0.3. Then
E e(y , ŷ = y1 ) | x = 0.2 · 0 + 0.5 · 1 + 0.3 · 3 = 1.4
E e(y , ŷ = y2 ) | x = 0.2 · 2 + 0.5 · 0 + 0.3 · 2 = 1.0
E e(y , ŷ = y2 ) | x = 0.2 · 5 + 0.5 · 1 + 0.3 · 0 = 1.5
In the case where the prediction space does not coincide with Y, a loss needs to be
specified for each combination of prediction ŷ ∈ Y and observation y ∈ Y.
For example, suppose the learner is allowed to predict a subset of candidate outcomes,
i.e., Y = 2Y ; the loss function may then look as follows:
As before, given a probability distribution on Y, the expected loss can be computed for
each ŷ ∈ Y, and the prediction minimising this expectation can be found.
The (pointwise) Bayes predictor makes optimal decisions in every situation, i.e., for
every instance x ∈ X ; thus, it maps instances to expected loss minimisers:
h i
f : x 7→ argmin Ey ∼P(· | x) e(y , ŷ )
ŷ ∈Y
Note that, in general, the Bayes predictor is only a theoretical construct — it cannot be
computed, because P is unknown.
Moreover, being defined in a pointwise manner, it is not necessarily an element of the
hypothesis space H.
Example: Let X = {1, . . . , 10}, Y = {0, 1}, P(x ) ≡ 1/10, and P(· | x ) as follows:
x 1 2 3 4 5 6 7 8 9 10
P(y = 1 | x ) 0 0.1 0.3 0.6 0.8 0.4 0.8 0.6 0.3 0
Then the (pointwise) Bayes predictor for 0/1 loss is given as follows:
x 1 2 3 4 5 6 7 8 9 10
f (x ) 0 0 0 1 1 0 1 1 0 0
In general, the best we can hope for is finding some g ≈ f . But how do we assess the
quality of a hypothesis h (in comparison to f )?
Following the same idea as before, namely, weighing the possible prediction errors of h
with their probability of occurrence, we define the generalisation performance of a
hypothesis in terms of the risk or out-of-sample error as follows:
h i Z
Eout (h) = E(x,y )∼P e(y , h(x)) = e(y , h(x)) d P(x, y )
X ×Y
Again, note that the out-of-sample error Eout (h) is a theoretical measure that cannot be
computed practically.
What can be computed instead is the in-sample error of h, also called empirical risk:
N
1 X
Ein (h) = e(yi , h(x i ))
N i=1
The principle of empirical risk minimization (ERM) suggests finding a hypothesis with
minimal empirical (instead of true) risk:
N
1 X
g ∈ argmin e(yi , h(x i ))
h∈H N i=1
Does a low Ein (h) imply a low Eout (h)? Under which conditions will ERM be successful?
These are fundamental questions calling for a theory of generalisation.
The general setting E. Hüllermeier 63/207
Structure of the lecture
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
Fix a hypothesis h ∈ H.
To evaluate this hypothesis, we can draw a sample D and compute the in-sample error,
which will provide us an unbiased estimate of the generalisation error Eout (h).
Exploiting concentration properties of the mean of a random variable U, assuring that
the sample mean ν = n1 N i=1 Ui concentrates around the expected value µ = E(U)
P
with high probability, one can even derive probabilistic bounds on the estimation error.
For example, the Hoeffding inequality states that
P |ν − µ| > ϵ ≤ 2 exp(−2ϵ2 N)
Thus, given a sufficient sample size N = |D|, the out-of-sample error of h can be
estimated very precisely.
Important prerequisites of the unbiasedness of the estimate and the validity of the
probabilistic bound:
▶ First, D is a so-called i.i.d. sample, namely, the examples (x i , yi ) are independent and
identically distributed.
▶ Second, the hypothesis h is fixed beforehand, that is, it does not depend on any properties
of the data D.
e(y , ŷ ) = Jy ̸= ŷ K ,
The in-sample error ν = Ein can then be considered as the fraction of red marbles when
sampling N times from the urn with replacement.
Unlike the case of hypothesis evaluation discussed above, a single h is not fixed
beforehand when learning from data.
Instead, learning is concerned with hypothesis optimisation: a presumably optimal
hypothesis g is selected from a set of candidates, specified by the hypothesis space H.
Since optimality refers to performance on the training data D, the hypothesis g is no
longer independent of the data.
Therefore, we cannot expect Ein (g) to be an unbiased estimate of Eout (g): we have
ED∼P N Ein (h) = Eout (h) for fixed h, but
!
ED∼P N min Ein (h) ̸= Eout argmin Ein (h)
h∈H h∈H
More specifically, since there is a tendency to select a g that performs especially well on
the training data, Ein (g) is likely to underestimate Eout (g).
Model induction and generalisation E. Hüllermeier 69/207
Examples
Consider a case of binary classification, and suppose Eout (h) = 1/2 for all h ∈ H.
As another example, consider the case where H is extremely rich and consists of all
mappings X −→ Y.
Let any hypothesis space H be given, and let N denote the size of the training data. The
following holds for all ϵ > 0:
1
P |Ein (g) − Eout (g)| > ϵ ≤ 4 mH (2N) exp − ϵ2 N
8
Or, for any tolerance δ > 0,
s
8 4mH (2N)
Eout (g) ≤ Ein (g) + log
N δ
| {z }
Ω(mH ,N,δ)
The first requirement calls for a rich hypothesis space, which guarantees the existence
of a g with small training error.
The second requirement calls for a (small) hypothesis space of low complexity (small
Ω(mH , N, δ)).
The complexity of hypothesis spaces H and, connected to this, the choice of the right
complexity, are central themes of machine learning.
Let n o
H x 1, . . . , x N = h(x 1 ), . . . , h(x N ) | h ∈ H
≤ 2N .
H x 1, . . . , x N
The growth function characterises the flexibility of a hypothesis space. It counts the
maximum number of dichotomies on N points:
mH (N) = max H x 1, . . . , x N
x 1 ,...,x N ∈X
mH (k) < 2k .
If there is no such value, then H does not have a break point (or k = ∞).
The Vapnik-Chervonenkis dimension dVC (H) of a hypothesis set H is the largest value
N for which mH (N) = 2N , i.e., the largest number of points that can be shattered by H.
Note that, since mH (N) ≤ N dvc , the following bound is also valid:
v !
u
u8 4((2N)dvc + 1)
Eout (g) ≤ Ein (g) + t log
N δ
= Ein (g) + Ω′ (dVC , N, δ)
Interpretation: The in-sample error is corrected by a term that depends on the model
complexity, i.e., the complexity of the hypothesis space H; the more complex H, the
smaller the in-sample error, but the higher the correction.
out-of-sample error
model complexity
error
in-sample error
VC dimension
Analysis so far: H needs to achieve the right balance between approximating the training
data and generalizing on new data.
Also note that, since Ω(dVC , N) depends on both the VC dimension of H and the size of
the training data simultaneously, the choice of H needs to be adapted to N. In other
words, different N may call for different hypothesis spaces.
100 f
50
x1
10 20 30 40 50
y
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
hypothesis
50 space
H⇢F
10 20 30 40 50
y
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
hypothesis
50 space
H⇢F
10 20 30 40 50
100
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
10
hypothesis
space
H⇢F
1
10 20 30 40 50
100
f
<latexit sha1_base64="1QC60ylCw8p+fVZd0mvmaADVrKg=">AAACInicbVDLSgNBEJyNrxhfiR69LEbBU9jNQT0GvHiMaFRIgvTO9iZD5rHMzCphySd41R/wa7yJJ8GPcRL3YKIFDUVVN91dUcqZsUHw6ZWWlldW18rrlY3Nre2dam33xqhMU+xQxZW+i8AgZxI7llmOd6lGEBHH22h0PvVvH1AbpuS1HafYFzCQLGEUrJOuDgeH99V60Ahm8P+SsCB1UqB9X/PWe7GimUBpKQdjumGQ2n4O2jLKcVLpZQZToCMYYNdRCQJNP5/dOvGPnBL7idKupPVn6u+JHIQxYxG5TgF2aBa9qfif181sctbPmUwzi5L+LEoy7lvlTx/3Y6aRWj52BKhm7lafDkEDtS6euS1cqBi1dI9olPhIlRAg414CgvFxjAlk3ObUnTmpuPDCxaj+kptmIwwa4WWz3jopYiyTfXJAjklITkmLXJA26RBKBuSJPJMX79V78969j5/WklfM7JE5eF/fsn+jyQ==</latexit>
g
10
hypothesis
space
H⇢F
1
10 20 30 40 50
10
hypothesis
space
H⇢F
1
10 20 30 40 50
The function h i
ḡ(x) = ED g (D) (x)
can be seen as a kind of average function learned from the data.
Target f (x ) = x
(
a if x ≤ t
Hypotheses of the form ht,a,b (x ) =
b if x > t
Target f (x ) = x
(
a if x ≤ t
Hypotheses of the form ht,a,b (x ) =
b if x > t
The above derivation assumes error-free data, i.e., the variability is entirely due to the
randomness of the training data.
If the observed data is noisy, namely,
y = f (x) + ϵ ,
where ϵ is an error term with mean value 0 and standard deviation σ, the decomposition
generalises to h i
ED Eout g (D) = Ex bias(x) + var(x) + σ 2 .
Notice that the VC analysis does not depend on the learner A, only on H and N.
The bias-variance decomposition, on the other hand, does depend on A, because the
algorithm has an influence on which hypothesis g (D) is chosen for a given set of training
data D.
Thus, changing A will change ḡ, and therefore bias and var.
Since bias and variance cannot be computed in practice, the bias-variance decomposition
is only a conceptual tool.
To improve generalisation performance, it generally suggests two options:
(i) reducing variance without significantly increasing the bias, and
(ii) reducing bias without significantly increasing the variance.
expected error
number of data points number of data points
Typical learning curves for a simple (left) and a complex model (right).
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
Roughly speaking, overfitting means fitting the training data more than warranted,
thereby generalising poorly beyond that data.
Typically, overfitting is caused by learning with a model class that is too complex.
What truly matters is how the model complexity matches, not with the target function,
but with the quality and quantity of the training data.
overfitting and
underfitting lead to
poor generalization
OUTPUT
INPUT
expected error
In-sample and out-of-sample error of a simple (solid lines) versus a complex model (dashed lines).
Obviously, the learner should not attempt to fit the noisy part in the data. But how to
distinguish signal from noise?
Deterministic noise is that part of the target function that cannot be modeled, i.e., that
cannot be captured by the underlying model class.
Majority classifier.
The notion of regularisation can be motivated by recalling the VC bound, telling us that
for all h ∈ H.
Instead of fixing a sufficiently simple hypothesis space H from the beginning, one may
also start with a more complex one but favor simple hypotheses in that space to more
complex ones.
Formally, this could be accomplished by minimizing a combination of Ein (h) and Ω(h),
where Ω(h) is a measure of the complexity of the individual hypothesis h.
The problem of selecting the “right” hypothesis space for the problem at hand is often
tackled by considering a family of spaces
H0 ⊂ H1 ⊂ H2 ⊂ · · · ⊂ HK = H
of increasing complexity. Formally, this can be done by representing each of these spaces
as a constrained version of H.
As an alternative to hard constraints of that kind, one may also use soft constraints that
prevent weights from getting too large while not forcing them to vanish.
More specifically, consider the case of linear models (perhaps in some feature space), and
let n o
H(C ) = h h(x) = w ⊤ Φ(x), w ⊤ w ≤ C
for a constant C ≥ 0.
Let w un denote the unconstrained solution, i.e., the empirical risk minimiser
If w ⊤
un w un ≤ C , then w un ∈ H(C ) and w reg = w un . Otherwise, Ein needs to be
minimised under the equality constraint
w ⊤w = C .
Example:
N
1 X
w reg = argmin (w0 + w1 xi − yi )2 s.t. (w1 )2 ≤ C
(w0 ,w1 ) N i=1
Starting with λ = 0 and wreg = wun = 5, increasing λ “moves” wreg toward the feasible
region (shaded in red) and hits the boundary at λ = 1.5.
with coefficients γq ∈ R.
The augmented error as discussed above can be seen as a proxy of the out-of-sample
error, which any machine learning method seeks to minimise:
Testing and validation, to be discussed next, tackle the problem of estimating the
out-of-sample error in a more direct way.
We are mainly interested in how a model trained on data D will generalise, i.e., perform
on new data encountered in the future:
One can easily show that this is an unbiased estimate of Eout (g − ), i.e.,
Left: Data is generated by f (x ) = 2.9x 4 − 1.4x 2 + 1.1x + 0.7 + σ, where σ is normal with standard
deviation 0.3. Right: Predictor (polynomials of degree 3) learned on training data (green) and
evaluated on test data (red).
At the same time, however, Etest (g − ) itself will increase (in expectation), because the size
of the training data, N − K , is reduced; the variance of g − will also increase.
If K is small, and hence N − K large, then Eout (g − ) will be smaller in expectation (and
with smaller variance), however, its estimate by Etest (g − ) will be unreliable.
The dilemma is that the first (approximate) equality calls for a small K , whereas the
second one calls for a large K .
Cross-validation (CV) is a technique that makes better use of the data, thereby
tightening the ≈ relations; in particular, it allows for deriving more reliable estimates of
Eout (g − ) for a given K .
The basic idea is to repeat the train/test split L = N/K times, each time using a
different portion of K data points for testing, and then averaging the results.
D1 , . . . , DL ,
g1− , . . . , gL− .
Each hypothesis gl− is obtained on D \ Dl (all folds except the l-th one), while Dl is used
to get Etest (gl− ); the final cross-validated estimate is then given by
L
1X
Ecv = Etest (gl− ) .
L l=1
Data is generated by f (x ) = 2.9x 4 − 1.4x 2 + 1.1x + 0.7 + σ, where σ is normal with standard deviation
0.3 (upper left); the other pictures show the functions (polynomials of degree 4) fitted in a 5-fold CV.
Regularisation and validation E. Hüllermeier 118/207
Cross-validation
Note that ECV obtained by averaging does not estimate the generalisation performance of
a concrete predictor; instead, it estimates a property of the learner, namely, the
(expected) out-of-sample error of a hypothesis trained on a sample of size N − K .
Thanks to the averaging, it is more reliable than an estimate obtained from a single
validation set, which is reflected by a smaller variance.
The true variance of the estimate is difficult to compute, because the different training
sets are highly overlapping, and hence the Etest (gl− ) are not independent of each other.
The hypotheses gl− are merely used for estimating generalisation performance, but of
course, the learner is not obliged to adopt any of these in the end.
Instead, it makes sense to retrain and learn a hypothesis g on the entire data D.
Still, a large K increases the discrepancy between Eout (g) and Eout (g − ).
At the same time, to get an unbiased performance estimate, a predictor should not be
evaluated on data it has been trained on.
Consequently, the learner needs to put aside another part of the training data for
validation:
Dval ⊂ Dtrain = D \ Dtest
The validation data is used to assess different options θ and to choose the presumably
best one.
Like in the case of testing, the ideas of averaging over multiple splits of the data
(cross-validation) and of restoring the original data in the end can be applied, leading to a
procedure of nested cross-validation (with an outer loop for testing and an inner loop
for validation).
Regularisation and validation E. Hüllermeier 123/207
Nested cross-validation
The question of how to split the data for training and validation is also relevant for the
inner loop of the cross-validation scheme.
In particular, in the case of crossing learning curves, an inappropriate split may lead to
suboptimal model choice.
Validation data is somehow in-between training and test data as far as the “level of
contamination” is concerned.
It is not as clean as test data (influences decisions made during training) but not as
contaminated as training data (hypotheses are never evaluated on data on which they
have been trained).
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
The term (??) is linear in both, the feature values xi and the parameters wi .
For the algorithms, however, it is only the latter that really matters, because the
parameters are playing the role of the variables in the corresponding learning
(optimisation) problems, whereas the data (feature values xi ) are treated as constants.
Formally, consider a feature transform in the form of a mapping from the original
instance space to a feature space Z:
Φ : X −→ Z
This function is called feature map or linearisation function, and Z the feature space or
linearisation space. Φ maps every instance to a feature representation:
z0 ϕ0 (x1 , . . . , xd )
x0 z1 ϕ1 (x1 , . . . , xd )
x1
.. ..
.. 7→
. =
.
.
..
..
. .
xd
zd̃ ϕd̃ (x1 , . . . , xd )
d̃
w ⊤ z = w ⊤ Φ(x) =
X
wi zi ,
i=0
More specifically,
dVC (HΦ ) ≤ d̃ .
The choice of a proper transform Φ is closely connected to the discussion about the
approximation-generalisation tradeoff.
Or, stated differently, the in-sample error will be reduced, but the VC-dimension will
grow; therefore, the bound on the out-of-sample error may or may not become tighter.
By a linear model or linear learning machine (LLM) we mean a model that is linear in
the parameters to be determined.
Example: For instances x = (x1 , x2 ) ∈ R2 , the (regression) model
y = h(x) = w0 + w1 x1 + w2 x2 + w3 (x1 )2 + w4 (x2 )2 + w5 (x1 x2 )
is linear in w = (w0 , . . . , w5 )⊤ . It can be written compactly as
y = h(x) = ⟨w, Φ(x)⟩ = w ⊤ Φ(x) = w ⊤ z ,
where
ϕ1 (x)
1
ϕ2 (x)
x1
ϕ3 (x) x2
.
z = Φ(x) = =
ϕ4 (x)
(x1 )2
ϕ5 (x) (x2 )2
ϕ6 (x) x1 x2
In the previous example, if the weight vector is given by w = (−6, 0, 0, 2, 4, −3)⊤ , then
Φ(−1) (0) is an ellipsoid in X = R2 , i.e., the linear decision boundary {z | ⟨w, z⟩ = 0} in
Z = R6 induces a nonlinear decision boundary in X .
These methods learn linear models in a feature space (typically of very high dimension),
albeit in an indirect manner, and possibly even without explicitly knowing this space.
Kernel trick: If a learning algorithm can be implemented in such a way that it only
operates on inner products ⟨z, z ′ ⟩, but never on individual data points z, then this
algorithm can be run in X instead of Z: Thanks to (??), every inner product in Z can be
computed by the kernel function in X .
Artificial neural networks (ANN), which are inspired by the processing of information in
the human brain, constitute another important class of nonlinear models.
These models have become quite popular in the 80th and 90th of the last century, but
have then been displaced by kernel-based learning machines.
More recently, they reappeared in the guise of deep neural networks, which are now
considered as state of the art in applications such as image, speech, and natural language
processing.
Here, we only consider the basic version of ANNs, notably the multi-layer perceptron
(there are other lectures specifically dedicated to deep learning).
(l)
where wi,j is the weight of the connection from the i th neuron in layer l − 1 and the j th
neuron in layer l, and d (l) is the width of the l th layer (the 0th neuron accounts for a
constant bias of 1).
Inference is accomplished by successively computing
(l−1)
dX
(l) (l) (l) (l−1)
xj = θ sj = θ wi,j xi ,
i=0
Like for the linear perceptron, training an MLP by minimising (regularised) in-sample
error can be accomplished by gradient descent, i.e., by finding the derivative of the
in-sample error
N
X
Ein (w) = e(h(x n ), yn )
i=1
with respect to w, and making a (small) step in negative direction of the gradient:
w ← w − η ∇w Ein (w) .
Instead of processing all training examples before making a gradient step, it is often more
efficient to process smaller batches at a time.
Batches can be sampled at random or processed systematically; an epoch of training
means that all examples have been processed once.
What this algorithm needs is the entries of the vector ∇w Ein (w), i.e., the partial
derivatives
∂e(w)
(l)
∂wi,j
(l)
of the error (as a function of w) with respect to the weights wi,j for all i, j, l.
xsxsx
Since
(l)
∂e(w) ∂e(w) ∂sj
(l)
= (l)
× (l)
,
∂wi,j ∂sj ∂wi,j
and
(l)
∂sj (l−1)
(l)
= xi ,
∂wi,j
we only need
(l) ∂e(w)
δj = (l)
∂sj
exp(s) − exp(−s)
θ(s) = tanh(s) = .
exp(s) + exp(−s)
d θ(s)
θ′ (s) = = 1 − θ2 (s) .
ds
For the final layer (l = L, j = 1), the derivative is then given as follows:
∂e θ(s1L ), yn
(L) ∂e(w)
δ1 = (L)
= (L)
∂s1 ∂s1
d (l)
X
(l−1) 2 (l) (l)
= 1− si δj wi,j
j=1
In this way, the error e θ(s1L ), yn at the output layer is propagated back through the
training error
For example, in the case of regression, the identity s 7→ s can be used for activation, and
the error function is typically taken as e(y , ŷ ) = (y − ŷ )2 .
Essentially, this means that standard linear regression (OLS) is done on the
linearisation, i.e., in the embedding space Z rather than the original data space.
In the case of binary classification, one often produces predictions ŷ ∈ (0, 1), which are
then interpreted as probabilities (of the positive class); this can be accomplished by the
activation
1
s 7→ .
1 + exp(−s)
A suitable error function in this case is the logistic loss (log-loss, cross-entropy loss)
This approach comes down to doing logistic regression in the embedding space.
The idea of predicting a probability distribution (Bernoulli in the binary case) can be
generalised to the case of K > 2 classes, e.g., by means of an output layer with K
neurons (instead of a single one) and a softmax transformation:
(L)
exp si
p̂i = ŷi =
(L)
(L)
exp s1 + . . . + exp sK
Decision trees are based on the principle of recursive partitioning: The instance space is
(recursively) partitioned in such a way that each subspace is associated with single class.
Example of a decision tree for the adult data (predict whether a person has yearly income
≥ 50K of < 50K):
Each path from the root to a leaf node can be considered as an IF-THEN rule, thence
decision trees correspond to a rule systems with a specific structure.
Nonlinear models E. Hüllermeier 154/207
Splitting with discrete and continuous features
All training examples, for which Xi takes the value xi,j , i.e., which fulfil the logical
predicate (Xi = xi,j ), are assigned to the j th descendant.
For continuous features, one usually uses predicates of the form (Xi < t) and (Xi ≥ t),
which means that the splits are binary.
As candidates for the threshold t, one tries values located in-between two values of Xi in
the training data.
Concrete algorithms for decision tree induction (e.g. C5.0, CART) follow the basic
recursive partitioning scheme (see algorithm RecPart), but differ with regard to various
technical aspects and extensions.
These algorithms can be seen as a greedy search in the space of all decision trees; thus,
they do not guarantee any optimality, neither in terms of accuracy nor size.
In general, the problem of learning optimal decision trees in NP-complete.
More recent methods seek to minimise an augmented error (using the number of leafs for
regularisation), based on
▶ mathematical programming (e.g., mixed integer programming) and SAT solvers,
▶ stochastic search methods,
▶ customised dynamic programming algorithms combined with branch-and-bound.
One important extension is a post-processing step called pruning: induced trees are
simplified to prevent over-fitting.
Reduced Error Pruning:
▶ Retain part of the training data for validation.
▶ Learn decision tree with the remaining training data.
▶ For each inner node, use the validation data to determine the performance improvement (if
any) if that node would become a leaf node (label through majority vote).
▶ The node with the maximal improvement is turned into a leaf.
▶ This process ends as soon as any further pruning will deteriorate performance.
Some machine learning algorithms can only handle discrete features, necessitating
discretisation of continuous features as a pre-processing step.
Most discretisation techniques partition the domain of a numerical variable into a finite
number of intervals, treating each interval as a categorical value.
▶ Equi-width binning: each interval (bin) has the same width
▶ Equi-frequency binning: each bin contains (approximately) the same number of data points
The principle of decision trees (classification trees) can also be applied to regression.
In the simplest case, leaf nodes are marked with the mean of the relevant values of the
target variable (leading to a piecewise constant function).
Variance reduction is often used as a split criterion:
m
X |Dj |
Gain = V − · Vj
j=1
|D|
where
1 X 1 X
V = (y − m)2 , Vj = (y − mj )2 ,
|D| (x,y )∈D |Dj | (x,y )∈D
j
1 X 1 X
m= y, mj = y .
|D| (x,y )∈D |Di | (x,y )∈D
j
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
Many ML methods are inherently restricted to the binary case and cannot directly solve
problems with multiple classes (multi-class, multinomial, polychotomous classification):
At prediction time, a query x is submitted to each of the binary models, and their
predictions are combined into a prediction for the original multi-class problem.
The latter may come with an increased predictive accuracy, even for genuine multi-class
classifiers, because binary problems are simpler (“divide and conquer”).
where (
−1 if cn ̸= yk
cn(k) = .
+1 if cn = yk
To this end, any binary classifier can be used, called base learner in this context.
Ensemble methods E. Hüllermeier 167/207
One-vs-rest decomposition
Suppose all models have been trained, and a new query instance x is submitted for
classification.
Ideally, there is one k such that hk (x) = +1 while hi (x) = −1 for all i ̸= k; in that case,
the predicted class would be yk .
Conflict: Either none of the predictions is positive, or more than one of the models claim
the instance for the class it represents.
Therefore, one typically trains scoring classifiers hk : X −→ R; then, hk (x) can be seen
as the strength or the confidence that the class label of x is yk .
Given predictions of that kind, a natural classification rule is to choose yk such that
k = argmax hi (x) .
1≤i≤K
Thus, if the underlying base learner has a time complexity of O(N α ), the overall
complexity of training an OvR classifier is
O(KN α ) .
hi,j is then trained on the set of data Di,j consisting of all examples
The model
(i,j) (i,j)
x n , cn such that cn ̸= 0.
At classification time, a query x is submitted to all hi,j , and predictions hi,j (x) are
combined into an overall prediction.
Prediction hi,j (x) is interpreted as a “vote” in favor of yi or yj .
Suppose predictions are in [0, 1], and for all 1 ≤ i < j ≤ K , let
Thus, although the number of models is now quadratic instead of linear in K , the total
number of examples produced is of the same order as for one-vs-rest.
More importantly, since the individual training sets are smaller, the all-pairs approach
tends to be even more efficient in the case of base learner complexity O(N α ) with α > 1,
i.e., if the running time of the base learner is super-linear.
Note, however, that the quadratic number of models increases space complexity (each
of them needs to be stored) as well as time complexity in the prediction step.
In principle, a base learner could use any partition of the original classes into positive,
negative, and neutral (ignored) examples.
One-vs-rest and all-pairs are special cases of a more general technique based on so-called
error correcting output codes.
Suppose a code matrix M of size K × L to be given, where the entry M(k, ℓ) is +1 if
class yk is considered as positive by the base learner hℓ , by −1 if yk is considered as
negative, and by 0 if that class is simply ignored by the learner.
The class yk is then encoded by the k-th row
m k = M(k, 1), . . . , M(k, L)
of the matrix M.
The final (multi-class) prediction is then given by the class yk that minimises the
(generalised) Hamming distance between the code vector and the prediction:
L
1X
k = argmin |M(i, ℓ) − hℓ (x)| (⋆)
1≤i≤K 2 ℓ=1
The notion of “error correction” refers to the fact that the ECOC scheme may tolerate a
certain number of incorrect binary predictions hℓ (x) without affecting the correctness of
the overall (multi-class) prediction.
Suppose M is chosen in such a way that the Hamming distance between each pair of
classes yi and yj is at least d.
Each incorrect binary prediction increases the distance between v and m k by at most 1.
Thus, as long as the number of mistakes does not exceed (d − 1)/2, the correct class will
still be selected by Hamming decoding (??).
CC
CCCCC C
CC
C CC
CCCCCC CCC
CC
C CC
C CC
C
C CC
C
C
CCCC CCCC C
A A A A AC C C C
A A AAA A C C CCCCCCCCCCC C
A A A
AAA
A A AA A AA
A
C CCCCCCCCC C CCC
C1 A A A A C CC C C
A AA AAAAAAA AAAAAAAAAAAAA CCCCCC CCCCC
A AA AAAAA A A
A B BCBCCC CCCCC CCC C
ADA A AA AA AAAAA B B B BB B C B CCCCCC CCCC CC
C C2 DD DDA D A A A B BB BB B B CC CCCCCCC
A A A AB BBB B B B B C CC CCC
CC
C CCC
D
D DD D AD A B
B C
C CC C
D D DD DD
D DDDD D BB B BBB B
BBB
BBBBB
B
B BB C
D D DD D DD D D E B B
EB BBB BB B BBB BBB B C
E B B B BB
DD DD DD
DDDDDDD DDD E E E EEE E E B BB B B B B
C3 C4 D D DDD D D DE EE EEEEEEEB
DD D D DD DDE E E E E EE E EB
E
EEEEE E E E E E B B BB B
DEDE EEEE E
D EEEEE EEEEE E E
D D D
E EEE EEE EEEEEE
E EE
A D B E E EE EE E
Example of a nested dichotomy for a 5-class problem. The first classifier (C1 ) is supposed to
separate class C from the meta-class {A, B, D, E }, i.e., the union of classes A, B, D, and E ;
likewise, the second classifier separates classes {A, D} from {B, E }, the third classifier class A
from D, and the fourth B from E .
The performance of the multi-class classifier eventually produced may strongly depend
on the structure of the dichotomy, because the latter specifies the set of binary
problems that need to be solved.
Given a query x, the predictions obtained by the different dichotomies are aggregated
into an overall prediction through majority voting resp. averaging (probabilities).
More generally, in machine learning, the notion of ensemble refers to a set of models
h1 , . . . , hM that have been trained on the same data.
Aggregation depends on the learning problem (e.g., majority voting, mean, median).
Idea: Asking for the opinion of several experts is better than only asking a single one.
However, experts need to be sufficiently independent of each other (in spite of being
trained on the same data).
Main questions:
▶ Diversification: How to induce sufficiently different models from the same data?
▶ Aggregation: How to combine the predictions of the individual models?
Applying the same learner to the same data will always yield the same model (unless the
learning algorithm is randomised).
There are basically two possibilities to produce different models: Either by modifying the
learning process or the data.
One of the most basic versions of boosting is the AdaBoost (Adaptive Boosting)
algorithm.
While bagging essentially reduces the variance part of the error (although it can also
reduce bias), boosting potentially reduces both bias and variance.
However, it is much more susceptible to noise in the data. While bagging will hardly ever
increase the error (compared to a single model), this is less guaranteed for boosting.
A decision stump is a decision tree with only a single split (two leaf nodes)
AdaBoost learns h(x) = sgn 0.8 h1 (x) + 0.35 h2 (x) + 0.55 h3 (x)
Weights: w (0) = (1/3, 1/3, 1/3), w (1) = (1/2, 1/4, 1/4), w (3) = (1/3, 1/6, 1/2)
In principle, any type of learner can be used as a weak learner in boosting, as long as it is
better than random guessing.
Another requirement, however, is that the learner accepts weighted training examples
(most algorithms can be generalised correspondingly).
Typical examples of learners used in boosting include shallow decision trees (e.g.,
decision stumps) and linear classifiers.
Instead of modifying the training data, model diversity can also be achieved by
manipulating the learner.
Especially effective for learners that are very sensitive toward (minor) changes of the
learning process — the probably most prominent example is again decision trees.
Random forests combines bagging with randomisation of splits in decision tree learning.
The best attribute is chosen among a randomly√selected subset of the attributes (a
common choice for the size of candidate set is d, with d the total number of attributes).
This leads to a reduction of the correlation between the trees forming the ensemble.
Extreme case: split attributes are chosen completely at random.
Original examples (x, y ) used to produce the examples for the meta learner should not
have already been used for training the base learners.
Prediction for a query instance x:
ŷ = h h1 (x), h2 (x), . . . , hM (x)
Optionally, the original instance x can be given as an additional input to the meta-learner.
1. Introduction
2. Two simple methods
3. The general setting
4. Model induction and generalisation
5. Regularisation and validation
6. Nonlinear models
7. Model dichotomisation and ensembles
8. Semi-supervised learning
9. Reinforcement learning
A. Probability
Labeling of data often comes with a certain cost, while unlabeled data is more readily
available (for example, wet lab experiments in biology, human annotation in text or image
processing).
A natural idea is to harness the unlabeled data to improve (supervised) learning.
Methods making use of both labeled and unlabeled data are called semi-supervised
learning methods.
Often, these methods improve prediction accuracy, although theoretical guarantees are
difficult (if not impossible) to obtain.
In the following, we will assume that only the first L ≪ N training instances x 1 , . . . , x L
are labeled, while the rest in unlabeled:
L N
DL = (x i , yi ) i=1
and DN = x j j=L+1
.
Self-training algorithms proceed from the assumption that their own predictions on
instances in DN are correct, at least those on which they are sufficiently sure.
Thus, starting with D = DL , such algorithms iterate the following steps:
▶ Learn a model h on the current training data D.
▶ Use h to label those instances x j ∈ DN \ D on which the prediction is sufficiently certain, i.e.,
set yj = h(x j ).
Instead of simply adding the newly labeled training instances, these examples could be
weighted; like e.g. in boosting, this of course requires learning algorithms that are able to
deal with weighted examples.
The idea of multi-view learning is to look at an object (e.g., a website) from two (or
more) different viewpoints (e.g., the pictures and the text on the website).
Formally, suppose the instance space X to be split into two parts, so that an instance is
represented in the form
(1) (2)
xi = xi , xi .
Co-training proceeds from the assumption that each view alone is insufficient to train a
(1) (2)
good classifier and, moreover, that x i and x i are conditionally independent given the
class.
There are many variants of co-training, such as example weighting, multiview learning
with majority vote labeling, etc.
Multiview learning (with m learners) can also be realised via regularisation:
m X
X L m
X N
X
min e(yi , hv (x i )) +λ1 ∥hv ∥2 + λ2 (hu (x j ) − hv (x j ))2
h∈H
v =1 i=1 u,v =1 j=L+1
| {z } | {z }
incorrectness of hv disagreement between hu ,hv
Minimising a joint loss function of this kind encourages the learners, not only to predict
correctly on the labeled data, but also to agree on the unlabeled data.
Generative methods can be applied in the semi-supervised context in a quite natural way,
because they can model the probability of observing an instance x j as a marginal
probability: X
P(x) = P(x, ȳ )
ȳ ∈Y
θ ∗ = argmax P(DL , DN | θ)
θ∈Θ
Methods of that kind are formally well-grounded and often very effective, provided the
model assumptions are (approximately) correct; however, they may become
computationally complex, and the (log-)likelihood function may have local optima.
Generative methods are closely connected to methods based on clustering (eventually,
unlabeled data provides an idea of the distribution of the data), which are computatonally
less complex but purely heuristic.
In the simplest version, cluster-and-label methods simply work as follows:
▶ A clustering algorithm is applied to both labeled and unlabeled instances,
▶ and each cluster is then completely labeled by applying the majority rule.
Instead of approximating the distribution of the data by means of standard clustering, the
(topological) structure of the data space can also be represented in terms of a graph.
Each data point corresponds to a node, and two nodes are connected by a (weighted)
edge if they are “similar” to each other. The assumption is that similar data points tend
to have the same class label.
Based on this assumption, the given labels (coming from examples (x i , yi ) ∈ DL ) can be
propagated over the whole graph.
J.E. van Engelen and H.H. Hoos. A survey on semi-supervised learning. Machine
Learning, 109:373–440, 2020. DOI 10.1007/s10994-019-05855-6.