ML 14 Boosting
ML 14 Boosting
Let X be the set of all possible examples or instances, i.e., the input space.
Let Y be the set of all possible labels or target values.
For binary classifications, Y = {0,1} .
For discrete classifications, Y = {0,1,..., j} .
For regression classifications, Y ∈ .
Let K = {k1 , k2 ,..., km } be a sequence of weak classifiers, each of which
outputs a classification k j ( xi ) for each xi .
The k j are determined iteratively starting with k1 which is a standard base
classifier (e.g., logistic regression, decision tree).
and k0 is a constant.
We build successive approximations to F ( x ).
n
Fˆ0 ( x ) = arg min ∑ L( yi , α )
α i =1
n
Fˆm ( x ) =
Fˆm −1 ( x ) + arg min ∑ L( yi , Fˆm −1 ( x ) + h( x ).
h i =1
n
Fˆm ( x ) − Fm ( x ) =
−α m ∑ ∇ Fm−1 L( yi , Fˆm −1 ( xi ).]
i =1
αm (
arg min L yi , Fˆm −1 ( xi ) − α∇ Fm−1 L( yi , Fˆm −1 ( xi ) ,
α
)
where the derivatives are taken with respect to the functions
Fˆ ( x ) for i ∈ {1,..., m} .
i
C) Output FˆM ( x ).
AdaBoost
km
wm+1
km
wm+1
km
wm+1
km
wm+1
km
wm+1
km
wm+1
km
wm+1
Thus,
= Cm ( xi ) C( m −1) ( xi ) + α m km ( xi ).
C1 ( xi ) k1 ( xi ) and=
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 22
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Loss function, L
m
Since Cm ( xi ) = ∑ α j k j ( xi ), we need to determine the "best" choices for the
j =1
1
Since wi(1) = , Z0 = N .
Z0
Note that wi( m ) > 0 for all i and m.
i =1
N
= ∑e
− yi C( m−1) ( xi ) − yiα m km ( xi )
e
i =1
N
= ∑ wi( m ) e − yiα m km ( xi ) .
i =1
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 25
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Splitting L into two components
If xi is correctly classified by km , then
= either yi 1=
and km ( xi ) 1 or
−1 and km ( xi ) =
yi = −1. In either case, yi km ( xi ) =
1.
If xi is incorrectly classified by km , then either yi = 1 and km ( xi ) = −1 or
yi = −1 and km ( xi ) = 1. In either case, yi km ( xi ) = −1.
We can split the sum for E between those point that are correctly
classified by km and those that are misclassified.
N
L ∑=
w e α
( m ) − yi m km ( xi )
i
i 1= yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m
∑ wi( m)e−αm +
L= ∑ (
wi( m ) eα m − e −α m . )
i 1 yi ≠ km ( xi )
N
=L e −α m
∑w (m)
i (
+ eα m − e −α m ) ∑ wi( m ) .
i 1 yi ≠ km ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 26
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding the minimum of L
N
=L e −α m
∑w (m)
i (
+ eα m − e −α m ) ∑ wi( m ) .
i 1 yi ≠ km ( xi )
Differentiating
=
=
L
yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m with respect to α m ,
∂L
we obtain − ∑ wi( m ) e −α m + ∑ wi( m ) eα m .
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )
∂L
−e −α m ∑ wi( m ) + eα m ∑ wi( m ) ,
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )
e −α m ∑
yi km ( xi )
wi( m ) = eα m ∑
yi ≠ km ( xi )
wi( m )
−α m αm
ln e ∑ (m)
wi ) = ln e ∑ (m)
wi
yi = km ( xi ) yi ≠ km ( xi )
ln e + ln ∑ wi ) =+
−α m (m)
ln e ln ∑ wi
αm (m)
yi km ( xi ) yi ≠ km ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 28
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding αm that minimizes L …
ln e −α m + ln ∑ wi( m ) = ln eα m + ln ∑ wi( m )
yi km ( xi ) yi ≠ km ( xi )
−α m + ln ∑ wi = α m + ln ∑ wi
(m) (m)
yi km ( xi ) yi ≠ km ( xi )
2α m ln ∑ wi − ln ∑ wi (m) (m)
= yi km ( xi ) yi ≠ km ( xi )
(m)
∑ wi
1 yi = km ( xi )
α m = ln
2 (m)
∑ wi
yi ≠ km ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 29
Spring 2020 – Erik K. Grimmelmann, Ph.D.
∊m = weighted error of km
∑
wi( m )
∑
yi ≠ km ( xi )
=εm = N
wi( m ) ,
∑ wi( m) i =1
yi ≠ km ( xi )
N
since ∑ wi( m ) = 1 as the wi( m ) are normalized for each m.
i =1
1 − ε m ∑ wi + ∑ wi − ∑=
= (m) (m)
wi( m ) ∑ wi( m )
yi =km ( xi ) yi ≠ km ( xi ) yi ≠ km ( xi ) yi = km ( xi )
1− εm
∑
yi = km ( xi )
wi( m )
= ,
εm
∑
yi ≠ km ( xi )
wi( m )
(m)
∑ wi
1 yi = km ( xi )
and thus α m =
= ln 1 ln 1 − ε m ,
2 (m) 2 εm
∑ wi
i m i
y ≠ k ( x )
1
which is the negative logit function mutiplied by .
2
From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 31
Spring 2020 – Erik K. Grimmelmann, Ph.D.
∊m as a function of αm
1 1 − ε m
α m = ln ,
2 εm
1 − ε m
e 2α m = ,
εm
ε m e 2α m = 1 − ε m ,
ε m e 2α + 1 =
m
1
1 e −α m e −2α m
=εm = 2α m
=
αm −α m −2α m
.
e + 1 e + e 1 + e
From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 32
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Normalization Constant, Zm
N
Z ( m +1) = ∑ wi( m ) e − yiα m km ( xi )
i =1
=
=yi km ( xi )
∑ wi( m ) e − yiα m km ( xi ) + ∑
yi ≠ km ( xi )
wi( m ) e − yiα m km ( xi )
=
=yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m
wi( m )
= e −α m
=yi km ( xi )
∑ wi( m ) + eα m ∑
yi ≠ km ( xi )
= e −α m (1 − ε m ) + eα m ε m
(1 − ε m ) e−αm + ε meαm
=
From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 33
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Normalization Constant, Zm
(1 − ε m ) e−αm + ε meαm
Z ( m +1) =
1 1 − ε m −α m ε 1− εm
since α m
= ln=
, e = m
and eα m
2 εm 1− εm εm
εm 1− εm
(1 − ε m )
Z ( m +1) =
1− εm
+ εm
εm
εm 1− εm 1− εm εm
( m ) 1− ε 1− ε m ε ε
=1 − ε + ε
m m m m
= ε m (1 − ε m ) + ε m (1 − ε m )
= 2 ε m (1 − ε m )
From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 34
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Putting it all together
1 (1)
A) Initialize the sample weights, w = . i
N
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied)
km
wm+1
1
In addition, if for all m ∈ [1, M ] , γ ≤ − ε m , then
2
R(h) ≤ exp ( −2γ 2 M ) .
Parameter Default
base_estimator DecisionTreeClassifier
n_estimators 50
learning_rate 1
algorithm ‘SAMME.R’
random_state None
Method Description
staged_decision_function(X) Return staged predictions for X.
staged_predict_proba(X) Predict class probabilities for X.
staged_score(X, y[, sample_weight]) Return staged scores for X, y.
({ y , F ( x)} ) = −∑ y ln p (x)
K
K
L k k 1 k k
k =1
e Fk ( x )
p k ( x) = K
∑ e
l =1
Fl ( x )
yik =
− (
∂L { y , F ( x )}K
il k i 1
) yik − pk ,m −1 ( xi )
=
∂Fk ( xi )
{Fl =
( x ) Fl , m −1( x )}
K
1
K −1 ∑ xi ∈R jkm yik
α jkm = .
K ∑ x ∈R yik (1 − yik )
i jkm
2) For k = 1, K
a) y=
ik yik − pk ( xi ) for i = 1, N
b) { R jkm } = J − terminal node tree
J
j =1 ( { yik , xi }1
N
)
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 50
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
B) For each iteration, m = 1 to M ...
(or until a convergence criterion has been satisfied)
2) For k = 1, K ...
K − 1 ∑ xi ∈R jkm yik
= c) α jkm = , j 1, J
K ∑ x ∈R yik
i jkm
J
d ) Fkm ( x ) = Fk ( m −1) ( x ) + ∑ α1( x ∈ R jkm )
j =1
estimates { pk , M ( x )}1
K
K
kˆ( x ) = arg min ∑ c ( k , k ' ) pk ' M ( x )
1≤ k ≤ K k ' =1
Parameter Default
loss ‘deviance’
learning_rate 0.1
n_estimators 100
max_depth 3
criterion ‘friedman_mse’
min_samples_split 2
min_samples_leaf 1
min_weight_fraction_leaf 0
subsample 1.0
Parameter Default
max_features None
max_leaf_nodes None
min_impurity_decrease 0.
init None
verbose 0
warm_start False
random_state None
presort ‘auto’