0% found this document useful (0 votes)
8 views

Lecture 16: Boosting — Applied ML

Lecture 16 discusses boosting, a machine learning technique that combines predictions from weak models to improve accuracy and reduce underfitting. It contrasts boosting with bagging, emphasizing that boosting sequentially corrects errors made by previous models, while bagging reduces overfitting through parallel model training. The lecture also introduces Adaboost, a specific boosting algorithm, and outlines its structure and implementation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 16: Boosting — Applied ML

Lecture 16 discusses boosting, a machine learning technique that combines predictions from weak models to improve accuracy and reduce underfitting. It contrasts boosting with bagging, emphasizing that boosting sequentially corrects errors made by previous models, while bagging reduces overfitting through parallel model training. The lecture also introduces Adaboost, a specific boosting algorithm, and outlines its structure and implementation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Lecture 16: Boosting

Contents
Lecture 16: Boosting
16.1. Defining Boosting
16.2. Structure of a Boosting Algorithm
16.3. Adaboost
16.4. Ensembling
16.5. Additive Models
16.6. Gradient Boosting

In this lecture, we will cover a new class of machine learning algorithms based on
an idea called Boosting. Boosting is an effective way to combine the predictions
from simple models into more complex and powerful ones that often attain state-
of-the-art performance on many machine learning competitions and benchmarks.

We will begin by defining boosting and seeing how this concept relates to our
previous lectures about bagging, which we saw in the context of random forests
and decision trees.

16.1.1. Review
16.1.1.1. Review: Overfitting
Recall that we saw in our lecture on decision trees and random forests that a
common machine learning failure mode is that of overfitting:

A very expressive model (e.g., a high degree polynomial) fits the training
dataset perfectly.
The model also makes wildly incorrect prediction outside this dataset and
doesn’t generalize.

16.1.1.2. Review: Bagging


The idea of bagging was to reduce overfitting by averaging many models trained
on random subsets of the data.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 1 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)

# output average prediction at test time:


y_test = ensemble.average_prediction(y_test)

16.1.2. Underfitting and Boosting


Underfitting is another common problem in machine learning that can be thought
of as the converse to overfitting.

The model is too simple to fit the data well (e.g., approximating a high degree
polynomial with linear regression).
As a result, the model is not accurate on training data and is not accurate on
new data.

16.1.2.1. Boosting
The idea of boosting is to reduce underfitting by combining models that correct
each others’ errors.

As in bagging, we combine many models g t into one ensemble f .

Unlike bagging, the g t are small and tend to underfit.

Each g t fits the points where the previous models made errors.

16.1.2.2. Weak Learners


A key ingredient of a boosting algorithm is a weak learner.

Intuitively, this is a model that is slightly better than random.


Examples of weak learners include: small linear models, small decision trees
(e.g., depth 1 or 2).

Let’s now move towards more fully describing the structure of a boosting
algorithm.

Step 1: Fit a weak learner g 0 on dataset D = {(x (i) , y (i) )}. Let f = g 0 .

Step 2: Compute weights w (i) for each i based on model predictions f(x (i) ) and
targets y (i) . Give more weight to points with errors.

Step 3: Fit another weak learner g 1 on D = {(x (i) , y (i) )} with weights w (i) , which
place more emphasis on the points on which the existing model is less accurate.

Step 4: Set f 1 = g 0 + α 1 g for some weight α 1 . Go to Step 2 and repeat.

At each step, f t becomes more expressive since it is the sum of a larger number of
weak learners that are each accurate on a different subset of the data.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 2 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

In Python-like pseudocode, this looks as follows:

weights, ensemble = np.ones(n_data,), Ensemble([])


for i in range(n_models):
model = SimpleBaseModel().fit(X, y, weights)
predictions = ensemble.predict(X)
model_weight, weights = update_weights(weights, predictions)
ensemble.add(model, model_weight)

# output consensus prediction at test time:


y_test = ensemble.predict(y_test)

Given its historical importance, we begin with an introduction of Adaboost and use
it to further illustrate the structure of boosting algorithms in general.

Type: Supervised learning (classification).


Model family: Ensembles of weak learners (often decision trees).
Objective function: Exponential loss.
Optimizer: Forward stagewise additive model building (to be defined in more
detail below).

An interesting historical note, boosting algorithms were initially developed in the


90s within theoretical machine learning.

Originally, boosting addressed a theoretical question of whether weak


learners with >50% accuracy can be combined to form a strong learner.
Eventually, this research led to a practical algorithm called Adaboost.

Today, there exist many algorithms that are considered types of boosting, even
though they’re not derived from the perspective of theoretical ML.

16.3.1. Defining Adaboost


We start with uniform weights w (i) = 1/n and f = 0. Then for t = 1, 2, . . . , T :

Step 1: Fit weak learner g t on D with weights w (i) .

Step 2: Compute misclassification error

∑ ni=1 w (i) I{y (i) ≠ g t (x (i) )}


et =
∑ ni=1 w (i)

Recall that I{⋅} is an indicator function that takes on value 1 when the
condition in the brackets is true and 0 otherwise.
Notice that if all the weights w (i) are the same, then this is just the
misclassification rate. When each weight can be different, we get a
“weighted” misclassification error.

Step 3: Compute model weight and update function.

α t = log[(1 − e t )/e t ]
f ← f + αt gt

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 3 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Notice that e t intuitively measures how much influence g t should have on our
overall predictor f . As e t approaches zero, meaning few of the highly
weighted points were misclassified by g t , α t will be large, allowing g t to have
a bigger contribution to our predictor. As e t approaches 1/2 (recall that g t ’s
are ‘weak learners’), this means that g t is not doing much better than random
guessing. This will cause α t to be close to zero, meaning that g t is
contributing little to our overall prediction function.

Step 4: Compute new data weights w (i) ← w (i) exp[α t I{y (i) ≠ f(x (i) )}].

Exponentiation ensures that all the weights are positive.


If our predictor correctly classifies a point, its weight w (i) does not change.
Any point that is misclassified by f has its weight increased.
We again use α t here, as above, to mediate how strongly we adjust weights.

16.3.2. Adaboost: An Example


Let’s implement Adaboost on a simple dataset to see what it can do.

Let’s start by creating a classification dataset. We will intentionally create a 2D


dataset where the classes are not easily separable.

# https://scikit-
learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html
import numpy as np
from sklearn.datasets import make_gaussian_quantiles

# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2., n_samples=200,
n_features=2, n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=300, n_features=2, n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))

We can visualize this dataset using matplotlib.

import matplotlib.pyplot as plt


plt.rcParams['figure.figsize'] = [12, 4]

# Plot the training points


plot_colors, plot_step, class_names = "br", 0.02, "AB"
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

for i, n, c in zip(range(2), class_names, plot_colors):


idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], cmap=plt.cm.Paired, s=60,
edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right');

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 4 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Let’s now train Adaboost on this dataset. We use the AdaBoostClassifier class
from sklearn.

from sklearn.ensemble import AdaBoostClassifier


from sklearn.tree import DecisionTreeClassifier

# Create and fit an AdaBoosted decision tree


bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=200)
bdt.fit(X, y)

AdaBoostClassifier(algorithm='SAMME',

base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=200)

Visualizing the output of the algorithm, we see that it can learn a highly non-linear
decision boundary to separate the two classes.

xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),


np.arange(y_min, y_max, plot_step))

# plot decision boundary


Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

# plot training points


for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], cmap=plt.cm.Paired, s=60,
edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right');

Boosting and bagging are special cases of ensembling.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 5 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

The idea of ensembling is to combine many models into one. Bagging and boosting
are ensembling techniques to reduce over- and under-fitting, respectively.

There are other approaches to ensembling that are useful to know about.

In stacking, we train m independent models g j (x) (possibly from different


model classes) and then train another model f(x) to predict y from the
outputs of g j .
The Bayesian approach can also be seen as form of ensembling

P (y ∣ x) = ∫ P (y ∣ x, θ)P (θ ∣ D)dθ
θ

where we average models P (y ∣ x, θ) using weights P (θ ∣ D).

Ensembling is a useful technique in machine learning, as it often helps squeeze out


additional performance out of ML algorithms, however this comes at the cost of
additional (potentially quite expensive) computation to train and use ensembles.

As we have seen with Adaboost, boosting algorithms are a form of ensembling that
yield high accuracy via a highly expressive non-linear model family. If trees are
used as weak learners, then we also have the added benefit of requiring little to no
preprocessing. However, as we saw with the random forest algorithm, with
boosting, the interpretability of the weak learners is lost.

16.4.1. Bagging vs. Boosting


We conclude this initial introduction to boosting by contrasting it to the bagging
approach we saw previously. While both concepts refer to methods for combining
the outputs of various models trained on the same dataset, there are important
distinctions between these concepts.

Bagging targets overfitting vs. boosting targets underfitting.


Bagging is a parallelizable method for combining models (e.g., each tree in
the random forest can be learned in parallel) vs. boosting is an inherently
sequential way to combine models.

Next, we are going to see another perspective on boosting and derive new
boosting algorithms.

Boosting can be seen as a way of fitting more general additive models:

T
f(x) = ∑ α t g(x; ϕ t ).
t=1

The main model f(x) consists of T smaller models g with weights α t and
parameters ϕ t .

The parameters are the α t plus the parameters ϕ t of each g.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 6 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Additive models are more general than a linear model, because g can be non-linear
in ϕ t (therefore so is f ).

Boosting is a specific approach to training additive models. We will see a more


general approach below.

16.5.1. Forward Stagewise Additive


Modeling
A general way to fit additive models is the forward stagewise approach.

Suppose we have a loss L : Y × Y → [0, ∞).

Start with

n
f 0 = arg min ∑ L(y (i) , g(x (i) ; ϕ))
ϕ
i=1

At each iteration t we fit the best addition to the current model.


n
α t , ϕ t = arg min ∑ L(y (i) , f t−1 (x (i) ) + αg(x (i) ; ϕ))
α,ϕ
i=1

Note that each step f t−1 is fixed, and we are only optimizing over the weight
α t and the new model parameters ϕ t . This helps keep the optimization
process tractable.

We note some practical considerations in forward stagewise additive modeling:

Popular choices of g include cubic splines, decision trees, and kernelized


models.
We may use a fixed number of iterations T or early stopping when the error
on a hold-out set no longer improves.
An important design choice is the loss L.

16.5.2. Losses in Additive Models


We will now cover the various types of losses used in additive models and the
implications of these different choices.

16.5.2.1. Exponential Loss


We start with the exponential loss. Give a binary classification problem with labels
Y = {−1, +1}, the exponential loss is defined as

L(y, f) = exp(−y ⋅ f).

When y = 1, L is small when f → ∞.


When y = −1, L is small when f → −∞.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 7 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Let’s visualize the exponential loss and compare it to other losses.

from matplotlib import pyplot as plt


import numpy as np
plt.rcParams['figure.figsize'] = [12, 4]

# define the losses for a target of y=1


losses = {
'Hinge' : lambda f: np.maximum(1 - f, 0),
'L2': lambda f: (1-f)**2,
'L1': lambda f: np.abs(f-1),
'Exponential': lambda f: np.exp(-f)
}

# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
plt.tight_layout()

Notice that the exponential loss very heavily penalizes (i.e., exponentially)
misclassified points. This could potentially be an issue in the presence of outliers
or if we have some ‘noise’ in the labeling process, e.g., points were originally
classified by a human annotator with imperfect labeling.

Special Case: Adaboost


Adaboost is an instance of forward stagewise additive modeling with the
exponential loss.

At each step t, we minimize

n n
L t = ∑ e −y = ∑ w (i) exp (−y (i) αg(x (i) ; ϕ))
(i) (f (i) )+αg( (i) ;ϕ))
t−1 (x x
i=1 i=1

with w (i) = exp(−y (i) f t−1 (x (i) )).

We can derive the Adaboost update rules from this equation.

Suppose that g(y; ϕ) ∈ {−1, 1}. With a bit of algebraic manipulations, we get
that:

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 8 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Lt = eα ∑ w (i) + e −α ∑ w (i)
y (i) ≠g(x (i) ) y (i) =g(x (i) )
n n
= (e α − e −α ) ∑ w (i) I{y (i) ≠ g(x (i) )} + e −α ∑ w (i) .
i=1 i=1

where I{⋅} is the indicator function.

From there, we get that:

n
ϕ t = arg min ∑ w (i) I{y (i) ≠ g(x (i) ; ϕ)}
ϕ
i=1
α t = log[(1 − e t )/e t ]
∑ ni=1 w (i) I{y (i) ≠f(x (i) )}
where e t = .
∑ ni=1 w (i) }

These are update rules for Adaboost, and it’s not hard to show that the update rule
for w (i) is the same as well.

16.5.2.2. Squared Loss


Another popular choice of loss is the squared loss, which allows us to derive a
principled boosting algorithm for regression (as opposed to the exponential loss
which can be used for classification). We define the squared loss as:

L(y, f) = (y − f) 2 .

The resulting algorithm is often called L2Boost. At step t, we minimize

n
∑(r (ti) − g(x (i) ; ϕ)) 2 ,
i=1

where r (ti) = y (i) − f(x (i) ) t−1 is the residual from the model at time t − 1.

16.5.2.3. Logistic Loss


Another common loss is the logistic loss. When Y = {−1, 1} it is defined as:

L(y, f) = log(1 + exp(−2 ⋅ y ⋅ f)).

This looks like the log of the exponential loss; it is less sensitive to outliers since it
doesn’t penalize large errors as much.

In the context of boosting, we minimize

n
J(α, ϕ) = ∑ log (1 + exp (−2y (i) (f t−1 (x (i) ) + αg(x (i) ; ϕ))).
i=1

This gives a different weight update compared to Adabost. This algorithm is called
LogitBoost.

Let’s plot some of these new losses as we did before.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 9 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

from matplotlib import pyplot as plt


import numpy as np
plt.rcParams['figure.figsize'] = [12, 4]

# define the losses for a target of y=1


losses = {
'Hinge' : lambda f: np.maximum(1 - f, 0),
'L2': lambda f: (1-f)**2,
'Logistic': lambda f: np.log(1+np.exp(-2*f)),
'Exponential': lambda f: np.exp(-f)
}

# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
ax.set_ylim([0,1])
plt.tight_layout()

To summarize what we have seen for additive models:

Additive models have the form $f(x) = ∑ Tt=1 α t g(x; ϕ t ).$


These models can be fit using the forward stagewise additive approach.
This reproduces Adaboost (when using exponential loss) and can be used to
derive new boosting-type algorithms that optimize a wide range of objectives
that are more robust to outliers and extend beyond classification.

We are now going to see another way of deriving boosting algorithms that is
inspired by gradient descent.

16.6.1. Limitations of Forward Stagewise


Additive Modeling
Forward stagewise additive modeling is not without limitations.

There may exist other losses for which it is complex to derive boosting-type
weight update rules.
At each step, we may need to solve a costly optimization problem over ϕ t .
Optimizing each ϕ t greedily may cause us to overfit.

16.6.2. Motivating Gradient Boosting

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 10 of 20
Lecture 16: Boosting — Applied ML

At step t, we minimize

∑ (y (i) − f t−1 (x (i) )) − g(x (i) ; ϕ t ) .




i=1   
⎝derivative of L at ft−1 (x(i) ) ⎠
n

∣⎜⎟
Let’s start to motivate gradient boosting by taking a new lens to the boosting
algorithms we saw above.

Consider, for example, L2Boost, which optimizes the L2 loss

L(y, f) =

i=1
1
2
(y − f) 2 .

∑(r (ti) − g(x (i) ; ϕ)) 2 ,

where r (ti) = y (i) − f t−1 (x (i) ) is the residual from the model at time t − 1.

Observe that the residual is also the derivative of the L2 loss

with respect to f at f t−1 (x (i) ):


1 (i)
2
(y − f t−1 (x (i) )) 2

r (ti) =

Thus, at step t, we are minimizing


∂L(y (i) , f)
∂f f=f t−1 (x)

which we are now viewing as the gradient with respect to f t−1 (x (i) ).

2

That is, we are trying to select the parameters ϕ t that are closest to the residuals,

In the coming sections, we will try to explain why in L2Boost we are fitting the
derivatives of the L2 loss?

16.6.3. Revisiting Supervised Learning


Let’s first recap classical supervised learning, and then contrast it against the
gradient boosting approach to which we are building up.

16.6.3.1. Supervised Learning: The Model


Recall that a machine learning model is a function

parameters θ:
fθ : X → Y

that maps inputs x ∈ X to targets y ∈ Y . The model has a d-dimensional set of

θ = (θ 1 , θ 2 , . . . , θ d ).

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 11 of 20
Lecture 16: Boosting — Applied ML

the data distribution P:

Intuitively,

to the j-th component of θ.


∣⎜⎟
16.6.3.2. Supervised Learning: The Learning
Objective
Intuitively, f θ should perform well in expectation on new data x, y sampled from

J(θ) = E (x,y)∼P [L (y, f θ (x))] is "good".

Here, L : X × Y → R is a performance metric and we take its expectation or


average over all the possible samples x, y from P.

Recall that formally, an expectation E x∼P f(x) is ∑ x∈X f(x)P (x) if x is discrete
and ∫ x∈X f(x)P (x)dx if x is continuous.

J(θ) = E (x,y)∼P [L (y, f θ (x))] = ∑ ∑ L (y, f θ (x))P(x, y)

(Gradient Descent)

∇J(θ) =

However, in practice, we cannot measure

on infinite data.
x∈X y∈Y

is the performance on an infinite-sized holdout set, where we have sampled every


possible point.

16.6.3.3. Supervised Learning: The Optimizer

The gradient ∇J(θ) is the d-dimensional vector of partial derivatives:


∂J(θ)
∂θ 1
∂J(θ)
∂θ 2


⎣ ∂J(θ) ⎦
∂θ d

∇J(θ) = E (x,y)∼P [∇L (y, f θ (x))]

m i=1
.

The j-th entry of the vector ∇J(θ) is the partial derivative ∂J(θ) of J with respect
∂θ j

We can optimize J(θ) using gradient descent via the usual update rule:

θ t ← θ t−1 − α t ∇J(θ t−1 ).

^ J(θ) measured on a dataset D


We substitute ∇J(θ) with an approximation ∇
sampled from P:

m
^ J(θ) = 1 ∑ ∇L (y (i) , f θ (x (i) )).

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 12 of 20
Lecture 16: Boosting — Applied ML

Monte Carlo approximation).

16.6.4. Supervised Learning Over Functions ⎢⎥


If the number of IID samples m is large, this approximation holds (we call this a

Intuitively, the gradient boosting algorithm asks, “what if instead of optimizing over
the finite-dimensional parameters θ ∈ R d , we try optimizing directly over infinite-
dimensional functions?”

But what do we mean by “infinite-dimensional functions?” Letting our model space


be the (unrestricted) set of functions f : X → Y, each function is an infinite-
dimensional vector indexed by x ∈ X :

⎡ ⋮ ⎤
f = f(x) .
⎣ ⋮ ⎦

The x-th component of the vector f is f(x). So rather than uniquely


characterizing a function by some finite dimensional vector of parameters, a point
in function space can be uniquely characterized by the values that it takes on
every possible input, of which there can be infinitely many. It’s as if we choose
infinite parameters θ = (. . . , f(x), . . . ) that specify function values, and we
optimize over that.

The Learning Objective


Our learning objective J(f) is now defined over f . Although the form of the
objective will be equivalent to the standard supervised learning setup we recalled
above, we can think of optimizing J over a “very high-dimensional” (potentially
infinite) vector of “parameters” f .

Keep in mind that f should perform well in expectation on new data x, y sampled
from the data distribution P:

Functional Gradients
J(f) = E (x,y)∼P [L (y, f(x))] is "good".

We would like to again optimize J(f) using gradient descent:

min J(f) = min E (x,y)∼P [L (y, f(x))].


f

vector ∇J(f) : X → R “indexed” by x:


f

We may define the functional gradient of this loss at f as an infinite-dimensional

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 13 of 20
Lecture 16: Boosting — Applied ML

∇J(f) =

⎢⎥
⎡ ⋮ ⎤
∂J(f)
∂f(x)

⎣ ⋮ ⎦

Let’s compare the parametric and the functional gradients.


.

The parametric gradient ∇J(θ) ∈ R d is a vector of the same shape as


θ ∈ R d . Both ∇J(θ) and θ are indexed by j = 1, 2, . . . , d.

The functional gradient ∇J(f) : X → R is a vector of the same shape as


f : X → R. Both ∇J(f) and f are “indexed” by x ∈ X .

The parametric gradient ∇J(θ) at θ = θ 0 tells us how to modify θ 0 in order


to further decrease the objective J starting from J(θ 0 ).

The functional gradient ∇J(f) at f = f 0 tells us how to modify f 0 in order


to further decrease the objective J starting from J(f 0 ). We can think of this
as the functional gradient telling us how to change the output of f 0 for each
possible input in order to better optimize our objective.

This is best understood via a picture.

The functional gradient is a function that tells us how much we “move” f(x) at
each point x. Given a good step size, the resulting new function will be closer to
minimizing J .

Recall that we are taking the perspective that f is a vector indexed by x. Thus the
x-th entry of the vector ∇J(f) is the partial derivative
f(x), the x-th component of f .

∂J(f)
∂f(x)
=

∂f(x)
(E (x,y)∼P [L (y, f(x))]) =

So the functional gradient is


∂L(y, f)
∂f

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
∂J(f)
∂f(x)
of J with respect to

f=f(x)
20/10/24, 6:23 PM

Page 14 of 20
Lecture 16: Boosting — Applied ML

Functional Gradient Descent

$∇J(f)(x) = ∂f(x) =
α t ∇J(f t )$
∇J(f) =

∣⎢⎥

This is an infinite-dimensional vector indexed by x.


∂L(y,f)
∂f


f=f(x)

Previously, we optimized J(θ) using gradient descent via the update rule:

θ t ← θ t−1 − α t ∇J(θ t−1 )


We can now optimize our objective using gradient descent in functional space via
the same update:

f t ← f t−1 − α t ∇J(f t−1 ).

After T steps of f t ← f t−1 − α t ∇J(f t−1 ), we get a model of the form

The Challenge of Supervised Learning Over


Functions
T −1
f T = f 0 − ∑ α t ∇J(f t )
t=0

After T steps of f t ← f t−1 − α t ∇J(f t−1 ), we get a model of the form $


−1
f T = f 0 − ∑ Tt=0

Recall that each ∇J(f t ) is a function of x. Therefore f T is a function of x as well,


and as a function that is found through gradient descent, f T will minimize J.

But recall as well that in the standard supervised learning approach that we
reviewed above, we were not able to compute ∇J(θ) = E (x,y)∼P [∇L (y, f θ (x))]
on infinite data and instead we used:

∂f

with an average in the data.


∂f

This is more challenging than before:

∇J(f)(x) = ∂L(y,f)
f=f(x)
m
^ J(θ) = 1 ∑ ∇L (y (i) , f θ (x (i) )).

m i=1

f=f(x)
$
^
In the case of functional gradients, we also need to find an approximation ∇J(f)
∂J(f) ∂L(y,f)
:

is not an expectation so we can’t approximate it

∇J(f) is a function that we need to “learn” from D. (We will use supervised
learning for this!)

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 15 of 20



D g = ⎨ x (i) ,

 ⎩⎝ 
Lecture 16: Boosting — Applied ML

∂L(y (i) , f)
∂f

at any x ∈ X

f=f(x (i) )

functional derivative ∇ f J(f) i at f(x (i) )⎠

We cannot represent ∇J(f) because it’s a general function.
We cannot measure ∇J(f) at each x (only at n training points).
Even if we could, the problem would be too unconstrained and generally
intractable to optimize.

16.6.5. Modeling Functional Gradients


We will address the above problem by learning a model of gradients.

In supervised learning, we only have access to n data points that describe


the true X → Y mapping (call it f ∗ ).
We learn a model f θ : X → Y from a function class M to approximate f ∗ .
The model extrapolates beyond the training set. Given enough datapoints, f θ
learns a true mapping.

We can apply the same idea to gradients, learning ∇J(f).

We search for a model g θt : X → R within a more restricted function class


M that can approximate the functional gradient ∇J(f t ).

    
g θt ∈ M

, i = 1, 2, … , n⎬


Functional descent will then have the form: $ f t (x) ← f t−1 (x) − αg θt−1 (x) .

∂f
g θt ≈ ∇J(f t )

The model extrapolates beyond the training set. Given enough datapoints,
g θt learns ∇J(f t ).
Think of g θt as the projection of ∇J(f t ) onto the function class M.

new function

f=f(x)

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
.


old function - gradient step
If ggeneralizes, thisapproximatesf_t \gets f_{t-1} - \alpha \nabla J(f_{t-1}).$

16.6.6. Fitting Functional Gradients


In practice, what does it mean to approximate a functional gradient g ≈ ∇J(f)m?
We can use standard supervised learning. Suppose we have a fixed function f and
we want to estimate the functional gradient of L

∂L(y, f)

Step 1: We define a loss L g (e.g., L2 loss) to measure how well g ≈ ∇J(f).

Step 2: We compute ∇J(f) on the training dataset:

⎧⎛ ⎫
20/10/24, 6:23 PM

Page 16 of 20
∣⎪⎜⎟
Lecture 16: Boosting — Applied ML

Step 3: We train a model g : X → R on D g to predict functional gradients at any


x:

16.6.7. Gradient Boosting


g(x) ≈

16.6.8. Interpreting Gradient Boosting


∂L(y, f)
∂f f=f 0 (x)

Notice how after T steps we get an additive model of the form $


.

We now have the motivation and background needed to define gradient boosting.
Gradient boosting is a procedure that performs functional gradient descent with
approximate gradients.

Start with f (x) = 0. Then, at each step t > 1:

Step 1: Create a training dataset D g and fit g t (x (i) ) using loss L g :

g t (x) ≈
∂L(y, f)
∂f f=f t−1 (x)
.

Step 2: Take a step of gradient descent using approximate gradients with step α t :

f t (x) = f t−1 (x) − α t ⋅ g t (x).

f(x) = ∑ Tt=1 α t g t (x).$ This looks like the output of a boosting algorithm!

However, unlike before for forward stagewise additive models:

This works for any differentiable loss L.


It does not require any mathematical derivations for new L.

16.6.9. Returning to L2 Boosting


To better highlight the connections between boosting and gradient boosting, let’s
return to the example of L2Boost, which optimizes the L2 loss

At step t, we minimize

where r (i)
t
L(y, f) =

n
1
2
(y − f) 2

∑(r (ti) − g(x (i) ; ϕ)) 2 ,


i=1

= y (i) − f t−1 (x (i) ) is the residual from the model at time t − 1.

Observe that the residual

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 17 of 20
Lecture 16: Boosting — Applied ML

r (ti) =
∂L(y (i) , f)
∂f


r (ti) = y (i) − f(x (i) ) t−1

is also the gradient of the L2 loss with respect to f at f(x (i) )

f=f t−1 (x)

This answers our question from above as to “why in L2Boost we are fitting the
derivatives of the L2 loss?” The reason is that we are finding an approximation
g(⋅; ϕ) to ∇J(f) and to do so we are minimize the square loss between
∇J(f)(x (i) ) = r (ti) and g(x (i) ; ϕ) at our n training points.

Many boosting methods are special cases of gradient boosting in this way.

16.6.10. Losses for Additive Models vs.


Gradient Boosting
We have seen several losses that can be used with the forward stagewise additive
approach.

The exponential loss L(y, f) = exp(−yf) gives us Adaboost.


The log-loss L(y, f) = log(1 + exp(−2yf)) is more robust to outliers.
The squared loss L(y, f) = (y − f ) 2 can be used for regression.

Gradient boosting can optimize a wider range of losses.

Regression losses:
L2, L1, and Huber (L1/L2 interpolation) losses.
Quantile loss: estimates quantiles of distribution of p(y|x).
Classification losses:
Log-loss, softmax loss, exponential loss, negative binomial likelihood,
etc.

When using gradient boosting these additional facts are useful:

We most often use small decision trees as the learner g t . Thus, input
preprocessing is minimal.
We can regularize by controlling tree size, step size α, and using early
stopping.
We can scale-up gradient boosting to big data by sub-sampling data at each
iteration (a form of stochastic gradient descent).

16.6.11. Algorithm: Gradient Boosting


As with the other algorithms we’ve seen, we can now present the algorithmic
components for gradient boosting.

Type: Supervised learning (classification and regression).

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 18 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Model family: Ensembles of weak learners (often decision trees).


Objective function: Any differentiable loss function.
Optimizer: Gradient descent in functional space. Weak learner uses its own
optimizer.
Probabilistic interpretation: None in general; certain losses may have one.

16.6.12. Gradient Boosting: An Example


Let’s now try running Gradient Boosted Decision Trees (GBDT) on a small
regression dataset.

First we create the dataset. Our values come from a non-linear function f(x) plus
some noise.

# https://scikit-
learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quant
ile.html
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)

# Create dataset
f = lambda x: x * np.sin(x)
y = f(X).ravel()
dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

# Visualize it
xx = np.atleast_2d(np.linspace(0, 10, 1000)).T
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations');

Next, we train a GBDT regressor, using the GradientBoostingRegressor class from


sklearn.

from sklearn.ensemble import GradientBoostingRegressor

alpha = 0.95
clf = GradientBoostingRegressor(loss='squared_error', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1,
min_samples_leaf=9,
min_samples_split=9)
clf.fit(X, y)

GradientBoostingRegressor(alpha=0.95, min_samples_leaf=9,
min_samples_split=9,
n_estimators=250)

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 19 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

We may now visualize its predictions

y_pred = clf.predict(xx)
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction');

16.6.14. Pros and Cons of Gradient


Boosting
Gradient boosted decision trees (GBTs) are one of the best off-the-shelf ML
algorithms that exist, often on par with deep learning.

Attain state-of-the-art performance. GBTs have won the most Kaggle


competitions.
Require little data preprocessing and tuning.
Work with any objective, including probabilistic ones.

Their main limitations are:

GBTs don’t work with unstructured data like images, audio.


Implementations are not as flexible as modern neural net libraries.

By Cornell University
© Copyright 2023.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 20 of 20

You might also like