0% found this document useful (0 votes)

8 views

Lecture 16: Boosting — Applied ML

Lecture 16 discusses boosting, a machine learning technique that combines predictions from weak models to improve accuracy and reduce underfitting. It contrasts boosting with bagging, emphasizing that boosting sequentially corrects errors made by previous models, while bagging reduces overfitting through parallel model training. The lecture also introduces Adaboost, a specific boosting algorithm, and outlines its structure and implementation.

Uploaded by

thepalaceartisanfoundation

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture 16: Boosting — Applied ML

Uploaded by

thepalaceartisanfoundation

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Lecture 16: Boosting

Contents
Lecture 16: Boosting
16.1. Defining Boosting
16.2. Structure of a Boosting Algorithm
16.3. Adaboost
16.4. Ensembling
16.5. Additive Models
16.6. Gradient Boosting

In this lecture, we will cover a new class of machine learning algorithms based on
an idea called Boosting. Boosting is an effective way to combine the predictions
from simple models into more complex and powerful ones that often attain state-
of-the-art performance on many machine learning competitions and benchmarks.

We will begin by defining boosting and seeing how this concept relates to our
previous lectures about bagging, which we saw in the context of random forests
and decision trees.

16.1.1. Review
16.1.1.1. Review: Overfitting
Recall that we saw in our lecture on decision trees and random forests that a
common machine learning failure mode is that of overfitting:

A very expressive model (e.g., a high degree polynomial) fits the training
dataset perfectly.
The model also makes wildly incorrect prediction outside this dataset and
doesn’t generalize.

16.1.1.2. Review: Bagging

The idea of bagging was to reduce overfitting by averaging many models trained
on random subsets of the data.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 1 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)

# output average prediction at test time:

y_test = ensemble.average_prediction(y_test)

16.1.2. Underfitting and Boosting

Underfitting is another common problem in machine learning that can be thought
of as the converse to overfitting.

The model is too simple to fit the data well (e.g., approximating a high degree
polynomial with linear regression).
As a result, the model is not accurate on training data and is not accurate on
new data.

16.1.2.1. Boosting
The idea of boosting is to reduce underfitting by combining models that correct
each others’ errors.

As in bagging, we combine many models g t into one ensemble f .

Unlike bagging, the g t are small and tend to underfit.

Each g t fits the points where the previous models made errors.

16.1.2.2. Weak Learners

A key ingredient of a boosting algorithm is a weak learner.

Intuitively, this is a model that is slightly better than random.

Examples of weak learners include: small linear models, small decision trees
(e.g., depth 1 or 2).

Let’s now move towards more fully describing the structure of a boosting
algorithm.

Step 1: Fit a weak learner g 0 on dataset D = {(x (i) , y (i) )}. Let f = g 0 .

Step 2: Compute weights w (i) for each i based on model predictions f(x (i) ) and
targets y (i) . Give more weight to points with errors.

Step 3: Fit another weak learner g 1 on D = {(x (i) , y (i) )} with weights w (i) , which
place more emphasis on the points on which the existing model is less accurate.

Step 4: Set f 1 = g 0 + α 1 g for some weight α 1 . Go to Step 2 and repeat.

At each step, f t becomes more expressive since it is the sum of a larger number of
weak learners that are each accurate on a different subset of the data.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 2 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

In Python-like pseudocode, this looks as follows:

weights, ensemble = np.ones(n_data,), Ensemble([])

for i in range(n_models):
model = SimpleBaseModel().fit(X, y, weights)
predictions = ensemble.predict(X)
model_weight, weights = update_weights(weights, predictions)
ensemble.add(model, model_weight)

# output consensus prediction at test time:

y_test = ensemble.predict(y_test)

Given its historical importance, we begin with an introduction of Adaboost and use
it to further illustrate the structure of boosting algorithms in general.

Type: Supervised learning (classification).

Model family: Ensembles of weak learners (often decision trees).
Objective function: Exponential loss.
Optimizer: Forward stagewise additive model building (to be defined in more
detail below).

An interesting historical note, boosting algorithms were initially developed in the

90s within theoretical machine learning.

Originally, boosting addressed a theoretical question of whether weak

learners with >50% accuracy can be combined to form a strong learner.
Eventually, this research led to a practical algorithm called Adaboost.

Today, there exist many algorithms that are considered types of boosting, even
though they’re not derived from the perspective of theoretical ML.

16.3.1. Defining Adaboost

We start with uniform weights w (i) = 1/n and f = 0. Then for t = 1, 2, . . . , T :

Step 1: Fit weak learner g t on D with weights w (i) .

Step 2: Compute misclassification error

∑ ni=1 w (i) I{y (i) ≠ g t (x (i) )}

et =
∑ ni=1 w (i)

Recall that I{⋅} is an indicator function that takes on value 1 when the
condition in the brackets is true and 0 otherwise.
Notice that if all the weights w (i) are the same, then this is just the
misclassification rate. When each weight can be different, we get a
“weighted” misclassification error.

Step 3: Compute model weight and update function.

α t = log[(1 − e t )/e t ]
f ← f + αt gt

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 3 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Notice that e t intuitively measures how much influence g t should have on our
overall predictor f . As e t approaches zero, meaning few of the highly
weighted points were misclassified by g t , α t will be large, allowing g t to have
a bigger contribution to our predictor. As e t approaches 1/2 (recall that g t ’s
are ‘weak learners’), this means that g t is not doing much better than random
guessing. This will cause α t to be close to zero, meaning that g t is
contributing little to our overall prediction function.

Step 4: Compute new data weights w (i) ← w (i) exp[α t I{y (i) ≠ f(x (i) )}].

Exponentiation ensures that all the weights are positive.

If our predictor correctly classifies a point, its weight w (i) does not change.
Any point that is misclassified by f has its weight increased.
We again use α t here, as above, to mediate how strongly we adjust weights.

16.3.2. Adaboost: An Example

Let’s implement Adaboost on a simple dataset to see what it can do.

Let’s start by creating a classification dataset. We will intentionally create a 2D

dataset where the classes are not easily separable.

# https://scikit-
learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html
import numpy as np
from sklearn.datasets import make_gaussian_quantiles

# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2., n_samples=200,
n_features=2, n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=300, n_features=2, n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))

We can visualize this dataset using matplotlib.

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [12, 4]

# Plot the training points

plot_colors, plot_step, class_names = "br", 0.02, "AB"
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

for i, n, c in zip(range(2), class_names, plot_colors):

idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], cmap=plt.cm.Paired, s=60,
edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right');

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 4 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Let’s now train Adaboost on this dataset. We use the AdaBoostClassifier class
from sklearn.

from sklearn.ensemble import AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier

# Create and fit an AdaBoosted decision tree

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=200)
bdt.fit(X, y)

AdaBoostClassifier(algorithm='SAMME',

base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=200)

Visualizing the output of the algorithm, we see that it can learn a highly non-linear
decision boundary to separate the two classes.

xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

np.arange(y_min, y_max, plot_step))

# plot decision boundary

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

# plot training points

for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], cmap=plt.cm.Paired, s=60,
edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right');

Boosting and bagging are special cases of ensembling.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 5 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

The idea of ensembling is to combine many models into one. Bagging and boosting
are ensembling techniques to reduce over- and under-fitting, respectively.

There are other approaches to ensembling that are useful to know about.

In stacking, we train m independent models g j (x) (possibly from different

model classes) and then train another model f(x) to predict y from the
outputs of g j .
The Bayesian approach can also be seen as form of ensembling

P (y ∣ x) = ∫ P (y ∣ x, θ)P (θ ∣ D)dθ
θ

where we average models P (y ∣ x, θ) using weights P (θ ∣ D).

Ensembling is a useful technique in machine learning, as it often helps squeeze out

additional performance out of ML algorithms, however this comes at the cost of
additional (potentially quite expensive) computation to train and use ensembles.

As we have seen with Adaboost, boosting algorithms are a form of ensembling that
yield high accuracy via a highly expressive non-linear model family. If trees are
used as weak learners, then we also have the added benefit of requiring little to no
preprocessing. However, as we saw with the random forest algorithm, with
boosting, the interpretability of the weak learners is lost.

16.4.1. Bagging vs. Boosting

We conclude this initial introduction to boosting by contrasting it to the bagging
approach we saw previously. While both concepts refer to methods for combining
the outputs of various models trained on the same dataset, there are important
distinctions between these concepts.

Bagging targets overfitting vs. boosting targets underfitting.

Bagging is a parallelizable method for combining models (e.g., each tree in
the random forest can be learned in parallel) vs. boosting is an inherently
sequential way to combine models.

Next, we are going to see another perspective on boosting and derive new
boosting algorithms.

Boosting can be seen as a way of fitting more general additive models:

T
f(x) = ∑ α t g(x; ϕ t ).
t=1

The main model f(x) consists of T smaller models g with weights α t and
parameters ϕ t .

The parameters are the α t plus the parameters ϕ t of each g.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 6 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Additive models are more general than a linear model, because g can be non-linear
in ϕ t (therefore so is f ).

Boosting is a specific approach to training additive models. We will see a more

general approach below.

16.5.1. Forward Stagewise Additive

Modeling
A general way to fit additive models is the forward stagewise approach.

Suppose we have a loss L : Y × Y → [0, ∞).

Start with

n
f 0 = arg min ∑ L(y (i) , g(x (i) ; ϕ))
ϕ
i=1

At each iteration t we fit the best addition to the current model.

n
α t , ϕ t = arg min ∑ L(y (i) , f t−1 (x (i) ) + αg(x (i) ; ϕ))
α,ϕ
i=1

Note that each step f t−1 is fixed, and we are only optimizing over the weight
α t and the new model parameters ϕ t . This helps keep the optimization
process tractable.

We note some practical considerations in forward stagewise additive modeling:

Popular choices of g include cubic splines, decision trees, and kernelized

models.
We may use a fixed number of iterations T or early stopping when the error
on a hold-out set no longer improves.
An important design choice is the loss L.

16.5.2. Losses in Additive Models

We will now cover the various types of losses used in additive models and the
implications of these different choices.

16.5.2.1. Exponential Loss

We start with the exponential loss. Give a binary classification problem with labels
Y = {−1, +1}, the exponential loss is defined as

L(y, f) = exp(−y ⋅ f).

When y = 1, L is small when f → ∞.

When y = −1, L is small when f → −∞.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 7 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Let’s visualize the exponential loss and compare it to other losses.

from matplotlib import pyplot as plt

import numpy as np
plt.rcParams['figure.figsize'] = [12, 4]

# define the losses for a target of y=1

losses = {
'Hinge' : lambda f: np.maximum(1 - f, 0),
'L2': lambda f: (1-f)**2,
'L1': lambda f: np.abs(f-1),
'Exponential': lambda f: np.exp(-f)
}

# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
plt.tight_layout()

Notice that the exponential loss very heavily penalizes (i.e., exponentially)
misclassified points. This could potentially be an issue in the presence of outliers
or if we have some ‘noise’ in the labeling process, e.g., points were originally
classified by a human annotator with imperfect labeling.

Special Case: Adaboost

Adaboost is an instance of forward stagewise additive modeling with the
exponential loss.

At each step t, we minimize

n n
L t = ∑ e −y = ∑ w (i) exp (−y (i) αg(x (i) ; ϕ))
(i) (f (i) )+αg( (i) ;ϕ))
t−1 (x x
i=1 i=1

with w (i) = exp(−y (i) f t−1 (x (i) )).

We can derive the Adaboost update rules from this equation.

Suppose that g(y; ϕ) ∈ {−1, 1}. With a bit of algebraic manipulations, we get
that:

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 8 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Lt = eα ∑ w (i) + e −α ∑ w (i)
y (i) ≠g(x (i) ) y (i) =g(x (i) )
n n
= (e α − e −α ) ∑ w (i) I{y (i) ≠ g(x (i) )} + e −α ∑ w (i) .
i=1 i=1

where I{⋅} is the indicator function.

From there, we get that:

n
ϕ t = arg min ∑ w (i) I{y (i) ≠ g(x (i) ; ϕ)}
ϕ
i=1
α t = log[(1 − e t )/e t ]
∑ ni=1 w (i) I{y (i) ≠f(x (i) )}
where e t = .
∑ ni=1 w (i) }

These are update rules for Adaboost, and it’s not hard to show that the update rule
for w (i) is the same as well.

16.5.2.2. Squared Loss

Another popular choice of loss is the squared loss, which allows us to derive a
principled boosting algorithm for regression (as opposed to the exponential loss
which can be used for classification). We define the squared loss as:

L(y, f) = (y − f) 2 .

The resulting algorithm is often called L2Boost. At step t, we minimize

n
∑(r (ti) − g(x (i) ; ϕ)) 2 ,
i=1

where r (ti) = y (i) − f(x (i) ) t−1 is the residual from the model at time t − 1.

16.5.2.3. Logistic Loss

Another common loss is the logistic loss. When Y = {−1, 1} it is defined as:

L(y, f) = log(1 + exp(−2 ⋅ y ⋅ f)).

This looks like the log of the exponential loss; it is less sensitive to outliers since it
doesn’t penalize large errors as much.

In the context of boosting, we minimize

n
J(α, ϕ) = ∑ log (1 + exp (−2y (i) (f t−1 (x (i) ) + αg(x (i) ; ϕ))).
i=1

This gives a different weight update compared to Adabost. This algorithm is called
LogitBoost.

Let’s plot some of these new losses as we did before.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 9 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

from matplotlib import pyplot as plt

import numpy as np
plt.rcParams['figure.figsize'] = [12, 4]

# define the losses for a target of y=1

losses = {
'Hinge' : lambda f: np.maximum(1 - f, 0),
'L2': lambda f: (1-f)**2,
'Logistic': lambda f: np.log(1+np.exp(-2*f)),
'Exponential': lambda f: np.exp(-f)
}

To summarize what we have seen for additive models:

Additive models have the form $f(x) = ∑ Tt=1 α t g(x; ϕ t ).$

These models can be fit using the forward stagewise additive approach.
This reproduces Adaboost (when using exponential loss) and can be used to
derive new boosting-type algorithms that optimize a wide range of objectives
that are more robust to outliers and extend beyond classification.

We are now going to see another way of deriving boosting algorithms that is
inspired by gradient descent.

16.6.1. Limitations of Forward Stagewise

Additive Modeling
Forward stagewise additive modeling is not without limitations.

There may exist other losses for which it is complex to derive boosting-type
weight update rules.
At each step, we may need to solve a costly optimization problem over ϕ t .
Optimizing each ϕ t greedily may cause us to overfit.

16.6.2. Motivating Gradient Boosting

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 10 of 20
Lecture 16: Boosting — Applied ML

At step t, we minimize

∑ (y (i) − f t−1 (x (i) )) − g(x (i) ; ϕ t ) .



i=1   
⎝derivative of L at ft−1 (x(i) ) ⎠
n

∣⎜⎟
Let’s start to motivate gradient boosting by taking a new lens to the boosting
algorithms we saw above.

Consider, for example, L2Boost, which optimizes the L2 loss

L(y, f) =

i=1
1
2
(y − f) 2 .

∑(r (ti) − g(x (i) ; ϕ)) 2 ,

where r (ti) = y (i) − f t−1 (x (i) ) is the residual from the model at time t − 1.

Observe that the residual is also the derivative of the L2 loss

with respect to f at f t−1 (x (i) ):

1 (i)
2
(y − f t−1 (x (i) )) 2

r (ti) =

Thus, at step t, we are minimizing

⎛
∂L(y (i) , f)
∂f f=f t−1 (x)

which we are now viewing as the gradient with respect to f t−1 (x (i) ).
⎞
2

That is, we are trying to select the parameters ϕ t that are closest to the residuals,

In the coming sections, we will try to explain why in L2Boost we are fitting the
derivatives of the L2 loss?

16.6.3. Revisiting Supervised Learning

Let’s first recap classical supervised learning, and then contrast it against the
gradient boosting approach to which we are building up.

16.6.3.1. Supervised Learning: The Model

Recall that a machine learning model is a function

parameters θ:
fθ : X → Y

that maps inputs x ∈ X to targets y ∈ Y . The model has a d-dimensional set of

θ = (θ 1 , θ 2 , . . . , θ d ).

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 11 of 20
Lecture 16: Boosting — Applied ML

the data distribution P:

Intuitively,

to the j-th component of θ.

∣⎜⎟
16.6.3.2. Supervised Learning: The Learning
Objective
Intuitively, f θ should perform well in expectation on new data x, y sampled from

J(θ) = E (x,y)∼P [L (y, f θ (x))] is "good".

Here, L : X × Y → R is a performance metric and we take its expectation or

average over all the possible samples x, y from P.

Recall that formally, an expectation E x∼P f(x) is ∑ x∈X f(x)P (x) if x is discrete
and ∫ x∈X f(x)P (x)dx if x is continuous.

J(θ) = E (x,y)∼P [L (y, f θ (x))] = ∑ ∑ L (y, f θ (x))P(x, y)

(Gradient Descent)

∇J(θ) =

However, in practice, we cannot measure

on infinite data.
x∈X y∈Y

is the performance on an infinite-sized holdout set, where we have sampled every

possible point.

16.6.3.3. Supervised Learning: The Optimizer

The gradient ∇J(θ) is the d-dimensional vector of partial derivatives:

⎡
∂J(θ)
∂θ 1
∂J(θ)
∂θ 2

⋮
⎣ ∂J(θ) ⎦
∂θ d
⎤

∇J(θ) = E (x,y)∼P [∇L (y, f θ (x))]

m i=1
.

The j-th entry of the vector ∇J(θ) is the partial derivative ∂J(θ) of J with respect
∂θ j

We can optimize J(θ) using gradient descent via the usual update rule:

θ t ← θ t−1 − α t ∇J(θ t−1 ).

^ J(θ) measured on a dataset D

We substitute ∇J(θ) with an approximation ∇
sampled from P:

m
^ J(θ) = 1 ∑ ∇L (y (i) , f θ (x (i) )).
∇

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 12 of 20
Lecture 16: Boosting — Applied ML

Monte Carlo approximation).

16.6.4. Supervised Learning Over Functions ⎢⎥

If the number of IID samples m is large, this approximation holds (we call this a

Intuitively, the gradient boosting algorithm asks, “what if instead of optimizing over
the finite-dimensional parameters θ ∈ R d , we try optimizing directly over infinite-
dimensional functions?”

But what do we mean by “infinite-dimensional functions?” Letting our model space

be the (unrestricted) set of functions f : X → Y, each function is an infinite-
dimensional vector indexed by x ∈ X :

⎡ ⋮ ⎤
f = f(x) .
⎣ ⋮ ⎦

The x-th component of the vector f is f(x). So rather than uniquely

characterizing a function by some finite dimensional vector of parameters, a point
in function space can be uniquely characterized by the values that it takes on
every possible input, of which there can be infinitely many. It’s as if we choose
infinite parameters θ = (. . . , f(x), . . . ) that specify function values, and we
optimize over that.

The Learning Objective

Our learning objective J(f) is now defined over f . Although the form of the
objective will be equivalent to the standard supervised learning setup we recalled
above, we can think of optimizing J over a “very high-dimensional” (potentially
infinite) vector of “parameters” f .

Keep in mind that f should perform well in expectation on new data x, y sampled
from the data distribution P:

Functional Gradients
J(f) = E (x,y)∼P [L (y, f(x))] is "good".

We would like to again optimize J(f) using gradient descent:

min J(f) = min E (x,y)∼P [L (y, f(x))].

vector ∇J(f) : X → R “indexed” by x:

We may define the functional gradient of this loss at f as an infinite-dimensional

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 13 of 20
Lecture 16: Boosting — Applied ML

∇J(f) =

⎢⎥
⎡ ⋮ ⎤
∂J(f)
∂f(x)

⎣ ⋮ ⎦

Let’s compare the parametric and the functional gradients.

The parametric gradient ∇J(θ) ∈ R d is a vector of the same shape as

θ ∈ R d . Both ∇J(θ) and θ are indexed by j = 1, 2, . . . , d.

The functional gradient ∇J(f) : X → R is a vector of the same shape as

f : X → R. Both ∇J(f) and f are “indexed” by x ∈ X .

The parametric gradient ∇J(θ) at θ = θ 0 tells us how to modify θ 0 in order

to further decrease the objective J starting from J(θ 0 ).

The functional gradient ∇J(f) at f = f 0 tells us how to modify f 0 in order

to further decrease the objective J starting from J(f 0 ). We can think of this
as the functional gradient telling us how to change the output of f 0 for each
possible input in order to better optimize our objective.

This is best understood via a picture.

The functional gradient is a function that tells us how much we “move” f(x) at
each point x. Given a good step size, the resulting new function will be closer to
minimizing J .

Recall that we are taking the perspective that f is a vector indexed by x. Thus the
x-th entry of the vector ∇J(f) is the partial derivative
f(x), the x-th component of f .

∂J(f)
∂f(x)
=
∂
∂f(x)
(E (x,y)∼P [L (y, f(x))]) =

So the functional gradient is

∂L(y, f)
∂f

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
∂J(f)
∂f(x)
of J with respect to

f=f(x)
20/10/24, 6:23 PM

Page 14 of 20
Lecture 16: Boosting — Applied ML

Functional Gradient Descent

$∇J(f)(x) = ∂f(x) =
α t ∇J(f t )$
∇J(f) =

∣⎢⎥
⎡

This is an infinite-dimensional vector indexed by x.

∂L(y,f)
∂f
⋮

⋮
f=f(x)

Previously, we optimized J(θ) using gradient descent via the update rule:

θ t ← θ t−1 − α t ∇J(θ t−1 )

⎤

We can now optimize our objective using gradient descent in functional space via
the same update:

f t ← f t−1 − α t ∇J(f t−1 ).

After T steps of f t ← f t−1 − α t ∇J(f t−1 ), we get a model of the form

The Challenge of Supervised Learning Over

Functions
T −1
f T = f 0 − ∑ α t ∇J(f t )
t=0

After T steps of f t ← f t−1 − α t ∇J(f t−1 ), we get a model of the form $

−1
f T = f 0 − ∑ Tt=0

Recall that each ∇J(f t ) is a function of x. Therefore f T is a function of x as well,

and as a function that is found through gradient descent, f T will minimize J.

But recall as well that in the standard supervised learning approach that we
reviewed above, we were not able to compute ∇J(θ) = E (x,y)∼P [∇L (y, f θ (x))]
on infinite data and instead we used:

∂f

with an average in the data.

∂f

This is more challenging than before:

∇J(f)(x) = ∂L(y,f)
f=f(x)
m
^ J(θ) = 1 ∑ ∇L (y (i) , f θ (x (i) )).
∇
m i=1

f=f(x)
$
^
In the case of functional gradients, we also need to find an approximation ∇J(f)
∂J(f) ∂L(y,f)
:

is not an expectation so we can’t approximate it

∇J(f) is a function that we need to “learn” from D. (We will use supervised
learning for this!)

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 15 of 20





D g = ⎨ x (i) ,

 ⎩⎝ 
Lecture 16: Boosting — Applied ML

∂L(y (i) , f)
∂f

at any x ∈ X

f=f(x (i) )

functional derivative ∇ f J(f) i at f(x (i) )⎠
∣
We cannot represent ∇J(f) because it’s a general function.
We cannot measure ∇J(f) at each x (only at n training points).
Even if we could, the problem would be too unconstrained and generally
intractable to optimize.

16.6.5. Modeling Functional Gradients

We will address the above problem by learning a model of gradients.

In supervised learning, we only have access to n data points that describe

the true X → Y mapping (call it f ∗ ).
We learn a model f θ : X → Y from a function class M to approximate f ∗ .
The model extrapolates beyond the training set. Given enough datapoints, f θ
learns a true mapping.

We can apply the same idea to gradients, learning ∇J(f).

We search for a model g θt : X → R within a more restricted function class

M that can approximate the functional gradient ∇J(f t ).

    
g θt ∈ M

, i = 1, 2, … , n⎬
⎭

Functional descent will then have the form: $ f t (x) ← f t−1 (x) − αg θt−1 (x) .

∂f
g θt ≈ ∇J(f t )

The model extrapolates beyond the training set. Given enough datapoints,
g θt learns ∇J(f t ).
Think of g θt as the projection of ∇J(f t ) onto the function class M.

new function

f=f(x)

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
.

⎞
old function - gradient step
If ggeneralizes, thisapproximatesf_t \gets f_{t-1} - \alpha \nabla J(f_{t-1}).$

16.6.6. Fitting Functional Gradients

In practice, what does it mean to approximate a functional gradient g ≈ ∇J(f)m?
We can use standard supervised learning. Suppose we have a fixed function f and
we want to estimate the functional gradient of L

∂L(y, f)

Step 1: We define a loss L g (e.g., L2 loss) to measure how well g ≈ ∇J(f).

Step 2: We compute ∇J(f) on the training dataset:

⎧⎛ ⎫
20/10/24, 6:23 PM

Page 16 of 20
∣⎪⎜⎟
Lecture 16: Boosting — Applied ML

Step 3: We train a model g : X → R on D g to predict functional gradients at any

16.6.7. Gradient Boosting

g(x) ≈

16.6.8. Interpreting Gradient Boosting

∂L(y, f)
∂f f=f 0 (x)

Notice how after T steps we get an additive model of the form $

We now have the motivation and background needed to define gradient boosting.
Gradient boosting is a procedure that performs functional gradient descent with
approximate gradients.

Start with f (x) = 0. Then, at each step t > 1:

Step 1: Create a training dataset D g and fit g t (x (i) ) using loss L g :

g t (x) ≈
∂L(y, f)
∂f f=f t−1 (x)
.

Step 2: Take a step of gradient descent using approximate gradients with step α t :

f t (x) = f t−1 (x) − α t ⋅ g t (x).

f(x) = ∑ Tt=1 α t g t (x).$ This looks like the output of a boosting algorithm!

However, unlike before for forward stagewise additive models:

This works for any differentiable loss L.

It does not require any mathematical derivations for new L.

16.6.9. Returning to L2 Boosting

To better highlight the connections between boosting and gradient boosting, let’s
return to the example of L2Boost, which optimizes the L2 loss

At step t, we minimize

where r (i)
t
L(y, f) =

n
1
2
(y − f) 2

∑(r (ti) − g(x (i) ; ϕ)) 2 ,

i=1

= y (i) − f t−1 (x (i) ) is the residual from the model at time t − 1.

Observe that the residual

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 17 of 20
Lecture 16: Boosting — Applied ML

r (ti) =
∂L(y (i) , f)
∂f

∣
r (ti) = y (i) − f(x (i) ) t−1

is also the gradient of the L2 loss with respect to f at f(x (i) )

f=f t−1 (x)

This answers our question from above as to “why in L2Boost we are fitting the
derivatives of the L2 loss?” The reason is that we are finding an approximation
g(⋅; ϕ) to ∇J(f) and to do so we are minimize the square loss between
∇J(f)(x (i) ) = r (ti) and g(x (i) ; ϕ) at our n training points.

Many boosting methods are special cases of gradient boosting in this way.

16.6.10. Losses for Additive Models vs.

Gradient Boosting
We have seen several losses that can be used with the forward stagewise additive
approach.

The exponential loss L(y, f) = exp(−yf) gives us Adaboost.

The log-loss L(y, f) = log(1 + exp(−2yf)) is more robust to outliers.
The squared loss L(y, f) = (y − f ) 2 can be used for regression.

Gradient boosting can optimize a wider range of losses.

Regression losses:
L2, L1, and Huber (L1/L2 interpolation) losses.
Quantile loss: estimates quantiles of distribution of p(y|x).
Classification losses:
Log-loss, softmax loss, exponential loss, negative binomial likelihood,
etc.

When using gradient boosting these additional facts are useful:

We most often use small decision trees as the learner g t . Thus, input
preprocessing is minimal.
We can regularize by controlling tree size, step size α, and using early
stopping.
We can scale-up gradient boosting to big data by sub-sampling data at each
iteration (a form of stochastic gradient descent).

16.6.11. Algorithm: Gradient Boosting

As with the other algorithms we’ve seen, we can now present the algorithmic
components for gradient boosting.

Type: Supervised learning (classification and regression).

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html
20/10/24, 6:23 PM

Page 18 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

Model family: Ensembles of weak learners (often decision trees).

Objective function: Any differentiable loss function.
Optimizer: Gradient descent in functional space. Weak learner uses its own
optimizer.
Probabilistic interpretation: None in general; certain losses may have one.

16.6.12. Gradient Boosting: An Example

Let’s now try running Gradient Boosted Decision Trees (GBDT) on a small
regression dataset.

First we create the dataset. Our values come from a non-linear function f(x) plus
some noise.

# https://scikit-
learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quant
ile.html
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)

# Create dataset
f = lambda x: x * np.sin(x)
y = f(X).ravel()
dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

# Visualize it
xx = np.atleast_2d(np.linspace(0, 10, 1000)).T
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations');

Next, we train a GBDT regressor, using the GradientBoostingRegressor class from

sklearn.

from sklearn.ensemble import GradientBoostingRegressor

alpha = 0.95
clf = GradientBoostingRegressor(loss='squared_error', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1,
min_samples_leaf=9,
min_samples_split=9)
clf.fit(X, y)

GradientBoostingRegressor(alpha=0.95, min_samples_leaf=9,
min_samples_split=9,
n_estimators=250)

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 19 of 20
Lecture 16: Boosting — Applied ML 20/10/24, 6:23 PM

We may now visualize its predictions

y_pred = clf.predict(xx)
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction');

16.6.14. Pros and Cons of Gradient

Boosting
Gradient boosted decision trees (GBTs) are one of the best off-the-shelf ML
algorithms that exist, often on par with deep learning.

Attain state-of-the-art performance. GBTs have won the most Kaggle

competitions.
Require little data preprocessing and tuning.
Work with any objective, including probabilistic ones.

Their main limitations are:

GBTs don’t work with unstructured data like images, audio.

Implementations are not as flexible as modern neural net libraries.

By Cornell University
© Copyright 2023.

https://kuleshov-group.github.io/aml-book/contents/lecture16-boosting.html Page 20 of 20

Projected Sales and Finished Goods Inventory Requirements
No ratings yet
Projected Sales and Finished Goods Inventory Requirements
5 pages
02 eRAN15.0 4T4R Test Guide
100% (2)
02 eRAN15.0 4T4R Test Guide
47 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
Statistics Project
No ratings yet
Statistics Project
5 pages
Boosting
No ratings yet
Boosting
13 pages
DM(Boosting)
No ratings yet
DM(Boosting)
15 pages
chapter 3- boosting theory
No ratings yet
chapter 3- boosting theory
7 pages
ensemble
No ratings yet
ensemble
33 pages
Ensemble - Part 1
No ratings yet
Ensemble - Part 1
33 pages
AdaBoost Classifier in Python (Article) - DataCamp
100% (1)
AdaBoost Classifier in Python (Article) - DataCamp
9 pages
Exp 3
No ratings yet
Exp 3
11 pages
1.1 - Xgboost, GBboost, Adaboost - Boosting - Medium
No ratings yet
1.1 - Xgboost, GBboost, Adaboost - Boosting - Medium
6 pages
Bagging vs Boosting in Machine Learning
No ratings yet
Bagging vs Boosting in Machine Learning
5 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
ENSEMBLE_LEARNING
No ratings yet
ENSEMBLE_LEARNING
9 pages
Gradient Boosting in ML
No ratings yet
Gradient Boosting in ML
5 pages
AdaBoost Notes
No ratings yet
AdaBoost Notes
5 pages
MLDM Lect17 Classification Ensembles
No ratings yet
MLDM Lect17 Classification Ensembles
2 pages
ADABOOST
No ratings yet
ADABOOST
9 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
ML-Unit I - Ensemble Methods
No ratings yet
ML-Unit I - Ensemble Methods
54 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Boosting
No ratings yet
Boosting
6 pages
addaboost
No ratings yet
addaboost
12 pages
Resilience To Overfitting AdaBoosts Approach
No ratings yet
Resilience To Overfitting AdaBoosts Approach
8 pages
Data Mining - Ensemble Methods
No ratings yet
Data Mining - Ensemble Methods
12 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Adaboost Solutions
No ratings yet
Adaboost Solutions
6 pages
Bagging and Boosting: Amit Srinet Dave Snyder
No ratings yet
Bagging and Boosting: Amit Srinet Dave Snyder
33 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
Types of Boosting
No ratings yet
Types of Boosting
4 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Adaboost Algorithm
No ratings yet
Adaboost Algorithm
17 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Machine Learning: Ensemble Methods
No ratings yet
Machine Learning: Ensemble Methods
54 pages
Lesson 8 - Ensemble Learning
No ratings yet
Lesson 8 - Ensemble Learning
61 pages
TM Adaboost
No ratings yet
TM Adaboost
12 pages
2011 CLOOSTING_ CLustering Data with bOOSTING
No ratings yet
2011 CLOOSTING_ CLustering Data with bOOSTING
10 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Chapter Five
No ratings yet
Chapter Five
42 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Business Analyst - Data Controls Implementation
No ratings yet
Business Analyst - Data Controls Implementation
2 pages
Analog Camera User Manual B3xx T3xx D3xx
No ratings yet
Analog Camera User Manual B3xx T3xx D3xx
14 pages
High Static Pressure Duct: Specifications 60Hz
No ratings yet
High Static Pressure Duct: Specifications 60Hz
3 pages
Joy Conference
No ratings yet
Joy Conference
12 pages
Rfi Log for 05-April-2025
No ratings yet
Rfi Log for 05-April-2025
48 pages
Qualification of Safety-Related Cables and Field Splices For Nuclear Power Plants
No ratings yet
Qualification of Safety-Related Cables and Field Splices For Nuclear Power Plants
45 pages
Retail Management
No ratings yet
Retail Management
15 pages
Homework Font
100% (2)
Homework Font
4 pages
Mohit Sample 1
No ratings yet
Mohit Sample 1
52 pages
Eoulu_F1+Probe+Station_V3.5.1 (1)
No ratings yet
Eoulu_F1+Probe+Station_V3.5.1 (1)
23 pages
ELECTRIC
No ratings yet
ELECTRIC
82 pages
GE 122-Lec4-Matrix Rank and Ill-Conditioned Systems-Handout
No ratings yet
GE 122-Lec4-Matrix Rank and Ill-Conditioned Systems-Handout
6 pages
LPG SUBSIDYCustomer Support
No ratings yet
LPG SUBSIDYCustomer Support
1 page
Train Ticket - MR Subhash Kashyap - Delhi To Secunderabad On 15 March 2023.
No ratings yet
Train Ticket - MR Subhash Kashyap - Delhi To Secunderabad On 15 March 2023.
3 pages
Multiple Matching Exercise 18
No ratings yet
Multiple Matching Exercise 18
3 pages
Dailogue Writing
No ratings yet
Dailogue Writing
10 pages
HPC Module For Student PDF
No ratings yet
HPC Module For Student PDF
24 pages
OOP Updates in ABAP 7.4 and ABAP 7.5
No ratings yet
OOP Updates in ABAP 7.4 and ABAP 7.5
7 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Node Js Notes by Tishant Agrawal
No ratings yet
Node Js Notes by Tishant Agrawal
55 pages
Nis S 19 Model Answer Paper
No ratings yet
Nis S 19 Model Answer Paper
33 pages
Digital VLSI Design Timing Analysis: Semester B, 2021-22 Lecturer: Zvika Webb 21 March 2022
100% (1)
Digital VLSI Design Timing Analysis: Semester B, 2021-22 Lecturer: Zvika Webb 21 March 2022
86 pages
Llaban, Reynel (Putting It Into Practice)
No ratings yet
Llaban, Reynel (Putting It Into Practice)
2 pages
Kavya
No ratings yet
Kavya
2 pages
HITACHI Construction Truck Mfg. Ltd. Standards: Hitachi Ccu Software Installation Procedure
No ratings yet
HITACHI Construction Truck Mfg. Ltd. Standards: Hitachi Ccu Software Installation Procedure
13 pages
INFRA Water QMS MIS Report - SEP 2020
No ratings yet
INFRA Water QMS MIS Report - SEP 2020
57 pages
Chapter 2
No ratings yet
Chapter 2
47 pages
2018 Competitor Handbook New Full Page
No ratings yet
2018 Competitor Handbook New Full Page
21 pages