16 Boosting
16 Boosting
Ensemble methods are used to form combinations of predictors that achieve a bias-variance trade-
off better than the one achieved by the algorithm generating the predictors in the combination.
Many stochastic gradient descent algorithms, like Pegasos, can be viewed as ensemble methods
because they output a combination of all the predictors generated during the sequential run over
the training set. We now look at ensemble methods that are not based on online algorithms. As
usual, our focus is on binary classification.
Fix a training set (x1 , y1 ), . . . , (xm , ym ) for a binary classification problem with zero-one loss, and
assume an ensemble of classifiers h1 , . . . , hT is available (later, we describe how to obtain the
ensemble). Consider the majority classifier f defined by
T
!
X
f (x) = sgn hi (x) (1)
i=1
Clearly, f is wrong on x if and only if at least half of the classifiers h1 , . . . , hT are wrong on x
(assume T is odd to avoid ties). We now study the conditions under which the majority classifier
achieves a small training error. Assume
T
Y
P h1 (xZ ) 6= yZ ∧ · · · ∧ hT (xZ ) 6= yZ = P hi (xZ ) 6= yZ (2)
i=1
where Z is a random variable uniformly distributed on the set {1, . . . , m} of indices of training
examples. In other words, each classifier is wrong independently of the others with respect to
the uniform distribution over the training set. The indicator functions of these events define the
training error `S (hi ) of each classifier hi . Indeed,
m
1 X
`S (hi ) = I{hi (xt ) 6= yt } = P hi (xZ ) 6= yZ
m
t=1
The majority classifier (1) is a simple example of an ensemble method, combining the predictions
of an ensemble of classifiers in order to boost the accuracy. We now bound the training error of f .
Introduce
T
1X
`ave = `S (hi )
T
i=1
1
Without loss of generality, we can assume that `S (hi ) ≤ 2 for each i = 1, . . . , T . This implies
1
1
`ave ≤ 2 and we can write
`S (f ) = P f (xZ ) 6= yZ
T
!
X T
=P I{hi (xZ ) 6= yZ } >
2
i=1
T !
1X 1
=P I{hi (xZ ) 6= yZ } > `ave + − `ave
T 2
i=1
Now introduce the Bernoulli random variables B1 , . . . , Bm defined by Bi = I{hi (xZ ) 6= yZ }. Note
that these random variables are independent, due to our assumption (2). Also, E[Bi ] = `S (hi ) and
T
1X
E[Bi ] = `ave
T
i=1
This result tells us that if we manage to obtain classifiers with independent training errors in the
sense of (2), then the training error of the majority vote classifier decreases exponentially with
respect to T γ 2 , where γ measures how better than random guessing is each classifier hi on the
training set. Note that being able to reduce the training error on arbitrary datasets implies a
decrease of the bias error.
Bagging. How can we obtain classifiers with independent training errors? A popular heuristic,
known as Bagging, applies to any learning algorithm A for binary classification and to any training
set S. Let m be the size of S. Bagging builds h1 , . . . , hT by drawing m examples uniformly at
random with replacement from S. This process is repeated T times so to obtain the resampled
training sets S1 , . . . , ST . Then A is run on each Si setting hi = A(Si ). The idea is that the
resampling procedure helps enforce condition (2).
One may wonder how different from S any Si can be. To find that out, we take a little detour and
compute the fraction of unique data points in Si . As you see in a moment, more than one third of
the points of S are missing from Si in expectation! Let N be the number of unique points drawn,
and let Xt be the indicator function of the event that (xt , yt ) is drawn. Then the probability that
(xt , yt ) is not drawn is
1 m
P(Xt = 0) = 1 −
m
2
So we have
m m m
1 m 1 m
X X X
E[N ] = E[Xt ] = P(Xt = 1) = 1− 1− =m−m 1−
m m
t=1 t=1 t=1
1 m
1
1− 1− ≈ 1 − = 0.632 . . .
m e
We saw that independence of errors helps reduce the bias by driving the training error to zero.
On the other hand, subsampling of the training set helps reduce the variance. If we think of the
m training points arranged in a m × d matrix (called the data matrix), then what bagging does
is subsampling the rows of this matrix. We now briefly describe another ensemble method that
increases the protection against overfitting by also subsampling the columns of the data matrix.
Random Forest. This ensemble method works by taking a majority vote over an ensemble
h1 , . . . , hT of tree predictors. Similarly to bagging, each tree predictor hi is obtained by running
a learning algorithm over a dataset Si obtained by subsampling the rows of the full data matrix.
However, the algorithm for learning tree predictors does not have direct access to Si . Indeed, when
considering a leaf ` for splitting, instead of being given Si,` (the set of training examples in Si that
are routed to `), the algorithm √ has access to a version of Si,` containing a random subset of the
original features (typically, d features are sampled from the original d features). This additional
sampling provides a better control on the variance at the expense of the bias. Because of its good
performance on many learning tasks, Random Forest is often used as a baseline when testing a new
learning algorithm.
Boosting. We now introduce boosting, a principled ensemble method that achieves the exponential
bound (3) on the training error without requiring the demanding condition (2). Boosting is an
incremental method to build classifiers of the form sgn(f ), where
T
X
f= wi hi
i=1
In practice, also in order to save computational costs, base classifiers are very simple. A typical
choice for H is that of decision stumps. These are all classifiers of the form hi,τ : Rd → {−1, 1}
defined by hi,τ (x) = ± sgn(xi − τ ), where i = 1, . . . , d and τ ∈ R.
The specific boosting algorithm we introduce is known as AdaBoost (adaptive boosting). Fix a
training set S with m examples (x1 , y1 ), . . . , (xm , ym ), and a sequence h1 , . . . , hT of base classifiers.
We now show how to choose the coefficients w so that the training error is bounded as in (3).
AdaBoost uses the convex upper bound I{z ≤ 0} ≤ e−z on the zero-one loss function I{f (xt ) yt ≤ 0}.
3
This gives
m m m
1 X 1 X −f (xt )yt 1 X − PTi=1 wi hi (xt )yt
`S (f ) = I{f (xt )yt ≤ 0} ≤ e = e
m m m
t=1 t=1 t=1
Note that other algorithms use different convex upper bounds on the zero-one loss. For example,
SVM uses the hinge loss.
Introduce now the functions L1 , . . . , LT defined by Li (t) = hi (xt )yt . Note that Li (t) ∈ {−1, 1} and
Li (t) = 1 if and only if hi (xt ) = yt . Recalling that Z is a random variable uniformly distributed in
{1, . . . , m}, we can view each Li (Z) as a random variable and write
m
"T #
1 X − PTi=i wi Li (t) h PT
− i=i wi Li (Z)
i Y
−wi Li (Z)
`S (f ) ≤ e =E e =E e
m
t=1 i=1
If condition (2) were true, we could write the expectation of the product as a product of expecta-
tions. In order to sidestep the condition, we change the probability space and write
"T # T
Y Y h i
−wi Li (Z) −wi Li (Zi )
E e = Ei e (4)
i=1 i=1
where we set
m
def
X
εi = Pi (Li (Zi ) = −1) = I{Li (t) = −1}Pi (t)
t=1
Note that εi is the error of hi with respect to the probability Pi . Namely, εi is the weighted training
error of hi where the weights are determined by Pi .
Before computing the Pi , we show how to pick w1 , . . . , wT in order to minimize (5). By computing
the zeros of the derivative of e−w (1 − εi ) + ew εi with respect to w, we find a single zero at
1 1 − εi
wi = ln .
2 εi
Note that the above expression is only defined for 0 < εi < 1. As we will see, P(t) > 0 for all
t ∈ {1, . . . , m}. Hence εi ∈ {0, 1} implies that either hi or −hi has zero training error on S. If this
4
happens, than we can throw away all hj for j 6= i and avoid using boosting altogether. Therefore,
without loss of generality we may assume 0 < εi < 1 for all i = 1, . . . , m.
Note that wi = 0 if and only if εi = 12 , meaning that the weight (according to Pi ) of the training
points where hi errs is exactly 12 . Because such a hi does not affect the value of f (since εi = 12
implies wi = 0) without loss of generality we may also assume that εi 6= 21 .
def
Set γi = 12 − εi and note that εi 6= 1
2 implies γi 6= 0. Using the inequality 1 + x ≤ ex , which holds
for all x ∈ R, we get
T p T q T PT
2 2 2
Y Y Y
`S (f ) ≤ 4εi (1 − εi ) = 1 − 4γi2 ≤ e−2γi = e−2 i=1 γi = e−2T γ
i=1 i=1 i=1
where in the last step we assumed γi > γ > 0. This is the same bound as the one we proved in (3)
under the condition (2). Note, however, that the definition of γi = 21 − εi changes because εi now
is the weighted training error of hi .
Just like (3), this bound provides a pretty strong control on the bias. Using the observation that
`S (f ) = 0 if and only if `S (f ) < 1/m, we conclude that a number
ln m
T >
2γ 2
of boosting rounds is sufficient to bring the training error of f down to zero.
Pi (t)e−wi Li (t)
Pi+1 (t) = for t = 1, . . . , m and i = 1, . . . , T − 1 (6)
Ei e−wi Li (Zi )
where
h i Xm
Ei e−wi Li (Zi ) = e−wi Li (s) Pi (s)
s=1
It is easy to check that P1 , . . . , PT are indeed probability distributions on {1, . . . , m}. In particular,
Pi (t) > 0 and Pi (1) + · · · + Pi (m) = 1.
For this choice of Pi we can prove (4) as follows. First, we solve (6) for e−wi Li (t) obtaining
h i P (t)
i+1
e−wi Li (t) = Ei e−wi Li (Zi )
Pi (t)
5
Then, we write
T m T
" # !
Y 1 X Y h −wi Li (Zi ) i Pi+1 (t)
E e−wi Li (Zi ) = Ei e
m Pi (t)
i=1 t=1 i=1
m T
Y !
1 X PT +1 (t) h i
= Ei e−wi Li (Zi )
m P1 (t)
t=1 i=1
m
! T !
X PT +1 (t) Y h i
−wi Li (Zi )
= P1 (t) Ei e
P1 (t)
t=1 i=1
T
Y h i
= Ei e−wi Li (Zi )
i=1
These probability distributions have a simple interpretation when one studies how Pi+1 depends
on Pi . Fix Pi and suppose εi < 12 . Then wi > 0 and each Pi+1 (t) is obtained multiplying Pi (t) for
the quantity e−wi Li (t) , which is bigger than 1 if and only if hi (xt ) 6= yt . In other words, the weight
of each training example (xt , yt ) is increased when Pi is updated to Pi+1 if and only if hi errs on
(xt , yt ). Intuitively, the boosting process concentrates the weight on the training examples that are
misclassified by the previous classifiers. A similar argument applies to the case when εi > 21 .
2. Compute εi for hi
3. If εi ∈ 0, 12 , 1 then BREAK
4. Let wi ← 1
2 ln 1−ε
εi .
i
If for loop exited on BREAK, then deal with the special case
Else output f = sgn(w1 h1 + · · · + wT hT ).
We are now ready to introduce the pseudo-code for AdaBoost. It is convenient to view the boosting
process as sequence of rounds between the boosting algorithm and the learning algorithm A.
hi
B A
Pi
6
In each round i, the booster B gives Pi to A and gets hi in response. If A return hi such that
εi = 12 , then the boosting process stops and the booster outputs f = sgn(w1 h1 + · · · + wi−1 hi−1 ).
If A return hi such that εi ∈ {0, 1}, then the boosting process also stops, but in this case booster
outputs f = hi or f = −hi .