Lecture 16: Boosting — Applied ML
Lecture 16: Boosting — Applied ML
Contents
Lecture 16: Boosting
16.1. Defining Boosting
16.2. Structure of a Boosting Algorithm
16.3. Adaboost
16.4. Ensembling
16.5. Additive Models
16.6. Gradient Boosting
In this lecture, we will cover a new class of machine learning algorithms based on
an idea called Boosting. Boosting is an effective way to combine the predictions
from simple models into more complex and powerful ones that often attain state-
of-the-art performance on many machine learning competitions and benchmarks.
We will begin by defining boosting and seeing how this concept relates to our
previous lectures about bagging, which we saw in the context of random forests
and decision trees.
16.1.1. Review
16.1.1.1. Review: Overfitting
Recall that we saw in our lecture on decision trees and random forests that a
common machine learning failure mode is that of overfitting:
A very expressive model (e.g., a high degree polynomial) fits the training
dataset perfectly.
The model also makes wildly incorrect prediction outside this dataset and
doesn’t generalize.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 1 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)
The model is too simple to fit the data well (e.g., approximating a high degree
polynomial with linear regression).
As a result, the model is not accurate on training data and is not accurate on
new data.
16.1.2.1. Boosting
The idea of boosting is to reduce underfitting by combining models that correct
each others’ errors.
Each g t fits the points where the previous models made errors.
Let’s now move towards more fully describing the structure of a boosting
algorithm.
Step 1: Fit a weak learner g 0 on dataset D = {(x (i) , y (i) )}. Let f = g 0 .
Step 2: Compute weights w (i) for each i based on model predictions f(x (i) ) and
targets y (i) . Give more weight to points with errors.
Step 3: Fit another weak learner g 1 on D = {(x (i) , y (i) )} with weights w (i) , which
place more emphasis on the points on which the existing model is less accurate.
At each step, f t becomes more expressive since it is the sum of a larger number of
weak learners that are each accurate on a different subset of the data.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 2 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
Given its historical importance, we begin with an introduction of Adaboost and use
it to further illustrate the structure of boosting algorithms in general.
Today, there exist many algorithms that are considered types of boosting, even
though they’re not derived from the perspective of theoretical ML.
Recall that I{⋅} is an indicator function that takes on value 1 when the
condition in the brackets is true and 0 otherwise.
Notice that if all the weights w (i) are the same, then this is just the
misclassification rate. When each weight can be different, we get a
“weighted” misclassification error.
α t = log[(1 − e t )/e t ]
f ← f + αt gt
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 3 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
Notice that e t intuitively measures how much influence g t should have on our
overall predictor f . As e t approaches zero, meaning few of the highly
weighted points were misclassified by g t , α t will be large, allowing g t to have
a bigger contribution to our predictor. As e t approaches 1/2 (recall that g t ’s
are ‘weak learners’), this means that g t is not doing much better than random
guessing. This will cause α t to be close to zero, meaning that g t is
contributing little to our overall prediction function.
Step 4: Compute new data weights w (i) ← w (i) exp[α t I{y (i) ≠ f(x (i) )}].
# https://scikit-
learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html
import numpy as np
from sklearn.datasets import make_gaussian_quantiles
# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2., n_samples=200,
n_features=2, n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=300, n_features=2, n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 4 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
Let’s now train Adaboost on this dataset. We use the AdaBoostClassifier class
from sklearn.
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=200)
Visualizing the output of the algorithm, we see that it can learn a highly non-linear
decision boundary to separate the two classes.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 5 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
The idea of ensembling is to combine many models into one. Bagging and boosting
are ensembling techniques to reduce over- and under-fitting, respectively.
There are other approaches to ensembling that are useful to know about.
P (y ∣ x) = ∫ P (y ∣ x, θ)P (θ ∣ D)dθ
θ
As we have seen with Adaboost, boosting algorithms are a form of ensembling that
yield high accuracy via a highly expressive non-linear model family. If trees are
used as weak learners, then we also have the added benefit of requiring little to no
preprocessing. However, as we saw with the random forest algorithm, with
boosting, the interpretability of the weak learners is lost.
Next, we are going to see another perspective on boosting and derive new
boosting algorithms.
T
f(x) = ∑ α t g(x; ϕ t ).
t=1
The main model f(x) consists of T smaller models g with weights α t and
parameters ϕ t .
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 6 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
Additive models are more general than a linear model, because g can be non-linear
in ϕ t (therefore so is f ).
Start with
n
f 0 = arg min ∑ L(y (i) , g(x (i) ; ϕ))
ϕ
i=1
Note that each step f t−1 is fixed, and we are only optimizing over the weight
α t and the new model parameters ϕ t . This helps keep the optimization
process tractable.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 7 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
plt.tight_layout()
Notice that the exponential loss very heavily penalizes (i.e., exponentially)
misclassified points. This could potentially be an issue in the presence of outliers
or if we have some ‘noise’ in the labeling process, e.g., points were originally
classified by a human annotator with imperfect labeling.
n n
L t = ∑ e −y = ∑ w (i) exp (−y (i) αg(x (i) ; ϕ))
(i) (f (i) )+αg( (i) ;ϕ))
t−1 (x x
i=1 i=1
Suppose that g(y; ϕ) ∈ {−1, 1}. With a bit of algebraic manipulations, we get
that:
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 8 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
Lt = eα ∑ w (i) + e −α ∑ w (i)
y (i) ≠g(x (i) ) y (i) =g(x (i) )
n n
= (e α − e −α ) ∑ w (i) I{y (i) ≠ g(x (i) )} + e −α ∑ w (i) .
i=1 i=1
n
ϕ t = arg min ∑ w (i) I{y (i) ≠ g(x (i) ; ϕ)}
ϕ
i=1
α t = log[(1 − e t )/e t ]
∑ ni=1 w (i) I{y (i) ≠f(x (i) )}
where e t = .
∑ ni=1 w (i) }
These are update rules for Adaboost, and it’s not hard to show that the update rule
for w (i) is the same as well.
L(y, f) = (y − f) 2 .
n
∑(r (ti) − g(x (i) ; ϕ)) 2 ,
i=1
where r (ti) = y (i) − f(x (i) ) t−1 is the residual from the model at time t − 1.
This looks like the log of the exponential loss; it is less sensitive to outliers since it
doesn’t penalize large errors as much.
n
J(α, ϕ) = ∑ log (1 + exp (−2y (i) (f t−1 (x (i) ) + αg(x (i) ; ϕ))).
i=1
This gives a different weight update compared to Adabost. This algorithm is called
LogitBoost.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 9 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
ax.set_ylim([0,1])
plt.tight_layout()
We are now going to see another way of deriving boosting algorithms that is
inspired by gradient descent.
There may exist other losses for which it is complex to derive boosting-type
weight update rules.
At each step, we may need to solve a costly optimization problem over ϕ t .
Optimizing each ϕ t greedily may cause us to overfit.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 10 of 20
Lecture 16: Boosting — Applied ML
At step t, we minimize
∣⎜⎟
Let’s start to motivate gradient boosting by taking a new lens to the boosting
algorithms we saw above.
L(y, f) =
i=1
1
2
(y − f) 2 .
where r (ti) = y (i) − f t−1 (x (i) ) is the residual from the model at time t − 1.
r (ti) =
⎛
∂L(y (i) , f)
∂f f=f t−1 (x)
which we are now viewing as the gradient with respect to f t−1 (x (i) ).
⎞
2
That is, we are trying to select the parameters ϕ t that are closest to the residuals,
In the coming sections, we will try to explain why in L2Boost we are fitting the
derivatives of the L2 loss?
parameters θ:
fθ : X → Y
θ = (θ 1 , θ 2 , . . . , θ d ).
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM
Page 11 of 20
Lecture 16: Boosting — Applied ML
Intuitively,
Recall that formally, an expectation E x∼P f(x) is ∑ x∈X f(x)P (x) if x is discrete
and ∫ x∈X f(x)P (x)dx if x is continuous.
(Gradient Descent)
∇J(θ) =
on infinite data.
x∈X y∈Y
⎡
∂J(θ)
∂θ 1
∂J(θ)
∂θ 2
⋮
⎣ ∂J(θ) ⎦
∂θ d
⎤
m i=1
.
The j-th entry of the vector ∇J(θ) is the partial derivative ∂J(θ) of J with respect
∂θ j
We can optimize J(θ) using gradient descent via the usual update rule:
m
^ J(θ) = 1 ∑ ∇L (y (i) , f θ (x (i) )).
∇
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM
Page 12 of 20
Lecture 16: Boosting — Applied ML
Intuitively, the gradient boosting algorithm asks, “what if instead of optimizing over
the finite-dimensional parameters θ ∈ R d , we try optimizing directly over infinite-
dimensional functions?”
⎡ ⋮ ⎤
f = f(x) .
⎣ ⋮ ⎦
Keep in mind that f should perform well in expectation on new data x, y sampled
from the data distribution P:
Functional Gradients
J(f) = E (x,y)∼P [L (y, f(x))] is "good".
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM
Page 13 of 20
Lecture 16: Boosting — Applied ML
∇J(f) =
⎢⎥
⎡ ⋮ ⎤
∂J(f)
∂f(x)
⎣ ⋮ ⎦
The functional gradient is a function that tells us how much we “move” f(x) at
each point x. Given a good step size, the resulting new function will be closer to
minimizing J .
Recall that we are taking the perspective that f is a vector indexed by x. Thus the
x-th entry of the vector ∇J(f) is the partial derivative
f(x), the x-th component of f .
∂J(f)
∂f(x)
=
∂
∂f(x)
(E (x,y)∼P [L (y, f(x))]) =
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
∂J(f)
∂f(x)
of J with respect to
f=f(x)
20/10/24, 6:23 PM
Page 14 of 20
Lecture 16: Boosting — Applied ML
$∇J(f)(x) = ∂f(x) =
α t ∇J(f t )$
∇J(f) =
∣⎢⎥
⎡
⋮
f=f(x)
Previously, we optimized J(θ) using gradient descent via the update rule:
We can now optimize our objective using gradient descent in functional space via
the same update:
But recall as well that in the standard supervised learning approach that we
reviewed above, we were not able to compute ∇J(θ) = E (x,y)∼P [∇L (y, f θ (x))]
on infinite data and instead we used:
∂f
∇J(f)(x) = ∂L(y,f)
f=f(x)
m
^ J(θ) = 1 ∑ ∇L (y (i) , f θ (x (i) )).
∇
m i=1
f=f(x)
$
^
In the case of functional gradients, we also need to find an approximation ∇J(f)
∂J(f) ∂L(y,f)
:
∇J(f) is a function that we need to “learn” from D. (We will use supervised
learning for this!)
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM
Page 15 of 20
D g = ⎨ x (i) ,
⎩⎝
Lecture 16: Boosting — Applied ML
∂L(y (i) , f)
∂f
at any x ∈ X
f=f(x (i) )
functional derivative ∇ f J(f) i at f(x (i) )⎠
∣
We cannot represent ∇J(f) because it’s a general function.
We cannot measure ∇J(f) at each x (only at n training points).
Even if we could, the problem would be too unconstrained and generally
intractable to optimize.
g θt ∈ M
, i = 1, 2, … , n⎬
⎭
Functional descent will then have the form: $ f t (x) ← f t−1 (x) − αg θt−1 (x) .
∂f
g θt ≈ ∇J(f t )
The model extrapolates beyond the training set. Given enough datapoints,
g θt learns ∇J(f t ).
Think of g θt as the projection of ∇J(f t ) onto the function class M.
new function
f=f(x)
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
.
⎞
old function - gradient step
If ggeneralizes, thisapproximatesf_t \gets f_{t-1} - \alpha \nabla J(f_{t-1}).$
∂L(y, f)
⎧⎛ ⎫
20/10/24, 6:23 PM
Page 16 of 20
∣⎪⎜⎟
Lecture 16: Boosting — Applied ML
We now have the motivation and background needed to define gradient boosting.
Gradient boosting is a procedure that performs functional gradient descent with
approximate gradients.
g t (x) ≈
∂L(y, f)
∂f f=f t−1 (x)
.
Step 2: Take a step of gradient descent using approximate gradients with step α t :
f(x) = ∑ Tt=1 α t g t (x).$ This looks like the output of a boosting algorithm!
At step t, we minimize
where r (i)
t
L(y, f) =
n
1
2
(y − f) 2
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM
Page 17 of 20
Lecture 16: Boosting — Applied ML
r (ti) =
∂L(y (i) , f)
∂f
∣
r (ti) = y (i) − f(x (i) ) t−1
This answers our question from above as to “why in L2Boost we are fitting the
derivatives of the L2 loss?” The reason is that we are finding an approximation
g(⋅; ϕ) to ∇J(f) and to do so we are minimize the square loss between
∇J(f)(x (i) ) = r (ti) and g(x (i) ; ϕ) at our n training points.
Many boosting methods are special cases of gradient boosting in this way.
Regression losses:
L2, L1, and Huber (L1/L2 interpolation) losses.
Quantile loss: estimates quantiles of distribution of p(y|x).
Classification losses:
Log-loss, softmax loss, exponential loss, negative binomial likelihood,
etc.
We most often use small decision trees as the learner g t . Thus, input
preprocessing is minimal.
We can regularize by controlling tree size, step size α, and using early
stopping.
We can scale-up gradient boosting to big data by sub-sampling data at each
iteration (a form of stochastic gradient descent).
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM
Page 18 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
First we create the dataset. Our values come from a non-linear function f(x) plus
some noise.
# https://scikit-
learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quant
ile.html
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)
# Create dataset
f = lambda x: x * np.sin(x)
y = f(X).ravel()
dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise
# Visualize it
xx = np.atleast_2d(np.linspace(0, 10, 1000)).T
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations');
alpha = 0.95
clf = GradientBoostingRegressor(loss='squared_error', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1,
min_samples_leaf=9,
min_samples_split=9)
clf.fit(X, y)
GradientBoostingRegressor(alpha=0.95, min_samples_leaf=9,
min_samples_split=9,
n_estimators=250)
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 19 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM
y_pred = clf.predict(xx)
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction');
By Cornell University
© Copyright 2023.
https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 20 of 20