0% found this document useful (0 votes)
33 views

ML 14 Boosting

machine learning course

Uploaded by

yiqing he
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

ML 14 Boosting

machine learning course

Uploaded by

yiqing he
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Boosting

The City College of New York


CSc 59929 – Introduction to Machine Learning 1
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Definitions

• A weak learner is a classifier whose results are only


slightly correlated with the true classification.
• Ensemble methods are general techniques in ML
for combining several predictors to create a
predictor that is more accurate than its components.
• Boosting is a ML ensemble method for reducing
bias and variance in supervised learning.

The City College of New York


CSc 59929 – Introduction to Machine Learning 2
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Review of weak learning

A concept class C is said to be weakly PAC - learnable if there exists


1
an algorithm A, > γ > 0, and a polynomial function poly (⋅, ⋅, ⋅, ⋅) such that
2
for any 1>δ >0, for all distributions D on X , and for any target concept
1
c ∈ C , the following holds for any sample size m ≥ poly ( , n, size(c)) :
δ
 1 
Prm  R(hS ) ≤ − γ  ≥ 1 − δ ,
S ~D  2 
where O(n) is an upper bound on the cost of the computational
representation of x ∈ X and size(c) is the upper bound of the cost
of the computation representation of c ∈ C.
γ is refered to as the edge of A.
from Mohri et. al.
The City College of New York
CSc 59929 – Introduction to Machine Learning 3
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosting methods

• Most boosting methods consist of iteratively adding


the results of a weak learner to form a strong
learner.
• When adding the results of the weak learners,
they are typically weighted by a factor related to
the weak learner’s accuracy.
• Each time the results are added, the sample data
are reweighted before the next weak learner is
applied to them.

The City College of New York


CSc 59929 – Introduction to Machine Learning 4
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosted classifiers

Let X be the set of all possible examples or instances, i.e., the input space.
Let Y be the set of all possible labels or target values.
For binary classifications, Y = {0,1} .
For discrete classifications, Y = {0,1,..., j} .
For regression classifications, Y ∈ .
Let K = {k1 , k2 ,..., km } be a sequence of weak classifiers, each of which
outputs a classification k j ( xi ) for each xi .
The k j are determined iteratively starting with k1 which is a standard base
classifier (e.g., logistic regression, decision tree).

The City College of New York


CSc 59929 – Introduction to Machine Learning 5
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosted classifiers

The goal is to find an approximation Fˆ ( x ) to F ( x ) that minimizes the expected


value of some specified loss function, L( y, F ( x )) :
Fˆ ( x ) = arg min E [ L( y, F ( x )) ]
x, y
F
M
We assume that Fˆ ( x ) = ∑ α j h j ( x ) + k0 , where h j ( x ) is a weak learner
j =1

and k0 is a constant.
We build successive approximations to F ( x ).
n
Fˆ0 ( x ) = arg min ∑ L( yi , α )
α i =1
n
Fˆm ( x ) =
Fˆm −1 ( x ) + arg min ∑ L( yi , Fˆm −1 ( x ) + h( x ).
h i =1

From Wikipedia, Gradient Boosting


The City College of New York
CSc 59929 – Introduction to Machine Learning 6
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient descent

n
Fˆm ( x ) − Fm ( x ) =
−α m ∑ ∇ Fm−1 L( yi , Fˆm −1 ( xi ).]
i =1

αm (
arg min L yi ,  Fˆm −1 ( xi ) − α∇ Fm−1 L( yi , Fˆm −1 ( xi )  ,
α
)
where the derivatives are taken with respect to the functions
Fˆ ( x ) for i ∈ {1,..., m} .
i

If y is discrete rather than continuous, we choose the candiate function


h closest to the gradient of L for which the coefficent α may then be
calculated with the aid of the technique of line search.

From Wikipedia, Gradient Boosting


The City College of New York
CSc 59929 – Introduction to Machine Learning 7
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient descent algorithm
A) Intialize the model with the constant value
n
Fˆ0 ( x ) = arg min ∑ L( yi , α )
α i =1

B) For each iteration, m = 1 to M


(or until a convergence criterion has been satisfied)

1) Compute the so-called pseudo - residuals :


 ∂L( yi , Fˆ ( xi )) 
rim =
−  for i =
1,..., n
ˆ
 ∂F ( xi )  Fˆ ( x ) = Fˆm−1 ( x )
2) Fit a base learner h m (x ) to the pseudo-residuals, i.e., train it
{( x , r )}
n
using i im i =1
.
From Wikipedia, Gradient Boosting
The City College of New York
CSc 59929 – Introduction to Machine Learning 8
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient descent algorithm …
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied) ...

3) Compute the coefficient α m by solving the following


one-dimensional optimization problem:
n
αm arg min ∑ L ( yi , [ Fm −1 ( xi ) + α hm ( xi ) ])
α i =1

2) Update the model:


Fˆm ( x ) Fˆm −1 ( x ) + α m hm ( x )
=

C) Output FˆM ( x ).

From Wikipedia, Gradient Boosting


The City College of New York
CSc 59929 – Introduction to Machine Learning 9
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Adaptive Boosting

AdaBoost

The City College of New York


CSc 59929 – Introduction to Machine Learning 10
Spring 2020 – Erik K. Grimmelmann, Ph.D.
AdaBoost (short for Adaptive Boosting)

• Formulated by Yoav Freund and Robert Schapire


• Awarded the Gödel Prize in 2003 for this work.

The City College of New York


CSc 59929 – Introduction to Machine Learning 11
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Qualitative illustration

• Images are from Mohri.


• Assume that the base classifier can only
determine decision boundaries that are
axis-aligned.

The City College of New York


CSc 59929 – Introduction to Machine Learning 12
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Initial data
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 13
Spring 2020 – Erik K. Grimmelmann, Ph.D.
k1
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 14
Spring 2020 – Erik K. Grimmelmann, Ph.D.
w2
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 15
Spring 2020 – Erik K. Grimmelmann, Ph.D.
k2
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 16
Spring 2020 – Erik K. Grimmelmann, Ph.D.
w3
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 17
Spring 2020 – Erik K. Grimmelmann, Ph.D.
k3
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 18
Spring 2020 – Erik K. Grimmelmann, Ph.D.
w4
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 19
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final classifier components
m=1 m=2 m=3

The City College of New York


CSc 59929 – Introduction to Machine Learning 20
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final classifier

The City College of New York


CSc 59929 – Introduction to Machine Learning 21
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosted classifiers, Cm(xi)
Let X be the set of all possible examples or instances, i.e., the input space.
Let Y be the set of all possible labels or target values.
For now we consider only binary classifications, so Y = {0,1} .
Let K = {k1 , k2 ,..., km } be a sequence of weak classifiers, each of which
outputs a classification k j ( xi ) ∈ {0,1} for each xi .
The k j are determined iteratively starting with k1 which is a standard base
classifier (e.g., logistic regression, decision tree).
We designate the sequence of boosted classifiers as C = {C1 , C2 ,..., Cm } where
m
Cm ( xi ) ∑=
α k ( x ) and α
j =1
j j i 1 1.

Thus,
= Cm ( xi ) C( m −1) ( xi ) + α m km ( xi ).
C1 ( xi ) k1 ( xi ) and=
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 22
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Loss function, L
m
Since Cm ( xi ) = ∑ α j k j ( xi ), we need to determine the "best" choices for the
j =1

α j and k j which we can do one j at at time, starting with j = 1.


We choose "best" to be the choices that minimize the loss function, L,
the total error of Cm which we define to be the sum of the exponential loss on
each data point,
N
L = ∑ e − yiCm ( xi ) .
i =1

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 23
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Weights, wi(m)
We introduce weights, wi( m ) , to be applied to the input xi for iteration m.
−y C ( xi )
1 (1) (m) e i ( m−1)
Let w
= = and wi
i for m > 1, where
Z0 Zm
N
Z m is a normalization factor such that ∑w
i =1
(m)
i = 1.

1
Since wi(1) = , Z0 = N .
Z0
Note that wi( m ) > 0 for all i and m.

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 24
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Loss Function, L
−y C ( xi )
1 (1) (m) e i ( m−1)
Since w
= = iand wi for m > 1, and
Z0 Z ( m −1)
Cm ( xi ) C( m −1) ( xi ) + α m km ( xi )
=
N
L = ∑ e − yiCm ( xi )
i =1
N
= ∑e
− yi C( m−1) ( xi ) +α m km ( xi ) 

i =1
N
= ∑e
− yi C( m−1) ( xi ) − yiα m km ( xi )
e
i =1
N
= ∑ wi( m ) e − yiα m km ( xi ) .
i =1
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 25
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Splitting L into two components
If xi is correctly classified by km , then
= either yi 1=
and km ( xi ) 1 or
−1 and km ( xi ) =
yi = −1. In either case, yi km ( xi ) =
1.
If xi is incorrectly classified by km , then either yi = 1 and km ( xi ) = −1 or
yi = −1 and km ( xi ) = 1. In either case, yi km ( xi ) = −1.
We can split the sum for E between those point that are correctly
classified by km and those that are misclassified.
N
L ∑=
w e α
( m ) − yi m km ( xi )
i
i 1= yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m

∑ wi( m)e−αm +
L= ∑ (
wi( m ) eα m − e −α m . )
i 1 yi ≠ km ( xi )
N
=L e −α m
∑w (m)
i (
+ eα m − e −α m ) ∑ wi( m ) .
i 1 yi ≠ km ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 26
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding the minimum of L
N
=L e −α m
∑w (m)
i (
+ eα m − e −α m ) ∑ wi( m ) .
i 1 yi ≠ km ( xi )

The only term in the expression for L that depends on km is ∑


yi ≠ km ( xi )
wi( m ) .

Given that α m > 0, it follow that eα m > e −α m and thus eα m − e −α m > 0. ( )


Therefore, the km that minimizes L is the one that minimizes ∑
yi ≠ km ( xi )
wi( m ) .

Differentiating
=
=
L
yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m with respect to α m ,

∂L
we obtain − ∑ wi( m ) e −α m + ∑ wi( m ) eα m .
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 27
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding αm that minimizes L
∂L
− ∑ wi( m ) e −α m + ∑ wi( m ) eα m
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )

∂L
−e −α m ∑ wi( m ) + eα m ∑ wi( m ) ,
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )

Setting this to zero and and solving for α m we obtain

e −α m ∑
yi km ( xi )
wi( m ) = eα m ∑
yi ≠ km ( xi )
wi( m )

 −α m   αm 
ln e ∑ (m)
wi )  = ln e ∑ (m)
wi 
 yi = km ( xi )   yi ≠ km ( xi ) 
   
ln e  + ln  ∑ wi )  =+
−α m (m)
ln e  ln  ∑ wi 
αm (m)

 yi km ( xi )   yi ≠ km ( xi ) 
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 28
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding αm that minimizes L …

   
ln e −α m  + ln  ∑ wi( m )  = ln eα m  + ln  ∑ wi( m ) 
 yi km ( xi )   yi ≠ km ( xi ) 
   
−α m + ln  ∑ wi  = α m + ln  ∑ wi 
(m) (m)

 yi km ( xi )   yi ≠ km ( xi ) 
   
2α m ln  ∑ wi  − ln  ∑ wi  (m) (m)

=  yi km ( xi )   yi ≠ km ( xi ) 
 (m) 
 ∑ wi 
1  yi = km ( xi ) 
α m = ln
2  (m) 
 ∑ wi 
 yi ≠ km ( xi ) 
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 29
Spring 2020 – Erik K. Grimmelmann, Ph.D.
∊m = weighted error of km

The weighted error of km is ε m where


wi( m )

yi ≠ km ( xi )
=εm = N
wi( m ) ,
∑ wi( m) i =1
yi ≠ km ( xi )

N
since ∑ wi( m ) = 1 as the wi( m ) are normalized for each m.
i =1

 
1 − ε m  ∑ wi + ∑ wi  − ∑=
= (m) (m)
wi( m ) ∑ wi( m )
 yi =km ( xi ) yi ≠ km ( xi )  yi ≠ km ( xi ) yi = km ( xi )

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 30
Spring 2020 – Erik K. Grimmelmann, Ph.D.
αm as a function of ∊m

1− εm

yi = km ( xi )
wi( m )
= ,
εm

yi ≠ km ( xi )
wi( m )

 (m) 
 ∑ wi   
1  yi = km ( xi )
and thus α m =
= ln  1 ln 1 − ε m  ,
2  (m)  2  εm 
 ∑ wi 
 i m i
y ≠ k ( x ) 
1
which is the negative logit function mutiplied by .
2

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 31
Spring 2020 – Erik K. Grimmelmann, Ph.D.
∊m as a function of αm
1 1 − ε m 
α m = ln  ,
2  εm 
1 − ε m 
e 2α m =  ,
 εm 
ε m e 2α m = 1 − ε m ,
ε m e 2α + 1 =
m
1
1 e −α m e −2α m
=εm = 2α m
=
αm −α m −2α m
.
e + 1 e + e  1 + e 

Not surprisingly, this is the logistic function of -2αm .

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 32
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Normalization Constant, Zm
N
Z ( m +1) = ∑ wi( m ) e − yiα m km ( xi )
i =1

=
=yi km ( xi )
∑ wi( m ) e − yiα m km ( xi ) + ∑
yi ≠ km ( xi )
wi( m ) e − yiα m km ( xi )

=
=yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m

wi( m )
= e −α m
=yi km ( xi )
∑ wi( m ) + eα m ∑
yi ≠ km ( xi )

= e −α m (1 − ε m ) + eα m ε m

(1 − ε m ) e−αm + ε meαm
=

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 33
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Normalization Constant, Zm
(1 − ε m ) e−αm + ε meαm
Z ( m +1) =

1  1 − ε m  −α m ε 1− εm
since α m
= ln=
 , e = m
and eα m
2  εm  1− εm εm
εm 1− εm
(1 − ε m )
Z ( m +1) =
1− εm
+ εm
εm
εm 1− εm 1− εm εm
( m ) 1− ε 1− ε m ε ε
=1 − ε + ε
m m m m

= ε m (1 − ε m ) + ε m (1 − ε m )

= 2 ε m (1 − ε m )

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 34
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Putting it all together
1 (1)
A) Initialize the sample weights, w = . i
N
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied)

1) Choose the classifier km that minimizes the total weighted error


∑ wi( m ) .
i ≠ km ( xi )

2) Apply km to the sample data using the


current sample weights, wi( m ) .
3) Use the results to calculate the error rate,
N
εm = ∑
yi ≠ km ( xi )
w (m)
i ∑ i .
w
1
i=
(m)

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 35
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Putting It All Together …
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied) ...
1 1 − ε m 
4) Use ε m to calculate the weight for interation m , α m = ln  ,
2  εm 
5) Update the classifier=Cm ( xi ) C( m −1) ( xi ) + α m km ( xi ),

6) Calculate the normalization factor


= Zm 2 ε m (1 − ε m )
7) Update the sample weights for use in the next iteration,
e − yiCm ( xi )
wi( m +1) =
Zm
C) Output CM ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 36
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Illustration again
m=0 m=1 m=2 m=3

km

wm+1

The City College of New York


CSc 59929 – Introduction to Machine Learning 37
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final Classifier Compoments
m=1 m=2 m=3

The City College of New York


CSc 59929 – Introduction to Machine Learning 38
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final Classifier

The City College of New York


CSc 59929 – Introduction to Machine Learning 39
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Bound on the empirical error
  M 1  
2

R(h) ≤ exp  −2∑  − ε m   .


 m =1  2  

1 
In addition, if for all m ∈ [1, M ] , γ ≤  − ε m  , then
2 

R(h) ≤ exp ( −2γ 2 M ) .

Thus, the empirical error of AdaBoost decrease exponentially


as a function of the number of rounds of boosting.

See Mohri for proof of the bound.

The City College of New York


CSc 59929 – Introduction to Machine Learning 40
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Class
class sklearn.ensemble.AdaBoostClassifier(…)

Parameter Default
base_estimator DecisionTreeClassifier
n_estimators 50
learning_rate 1
algorithm ‘SAMME.R’
random_state None

The City College of New York


CSc 59929 – Introduction to Machine Learning 41
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Attributes
Attribute Description
estimators_ The collection of fitted sub-estimators.
classes_ The classes labels.
n_classes_ The number of classes.
estimator_weights_ Weights for each estimator in the boosted ensemble.
Classification error for each estimator in the boosted
estimator_errors_
ensemble.
The feature importances if supported by the
feature_importance_
base_estimator.

The City College of New York


CSc 59929 – Introduction to Machine Learning 42
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Methods
Method Description
decision_function(X) Compute the decision function of X
fit(X, y[, sample_weight]) Build a boosted classifier from the training set (X,y)
get_params([deep]) Get parameters for this estimator.
predict(X) Predict classes for X.
predict_log_proba(X) Predict class log-probabilities for X
predict_proba(X) Predict class probabilities for X
Returns the mean accuracy on the given test data
score(X, y[, sample_weight])
and labels.
set_params(**params) Set the parameters of this estimator.

The City College of New York


CSc 59929 – Introduction to Machine Learning 43
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Methods

Method Description
staged_decision_function(X) Return staged predictions for X.
staged_predict_proba(X) Predict class probabilities for X.
staged_score(X, y[, sample_weight]) Return staged scores for X, y.

The City College of New York


CSc 59929 – Introduction to Machine Learning 44
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient Tree Boosting

The City College of New York


CSc 59929 – Introduction to Machine Learning 45
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting

• Formulated by Friedman in 1999

The City College of New York


CSc 59929 – Introduction to Machine Learning 46
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting
For a K -class classicification problem, yi ∈ {0,1,..., K }
Choose an exponential loss function,

({ y , F ( x)} ) = −∑ y ln p (x)
K
K
L k k 1 k k
k =1

where yk ∈ {0,1} , yk =1 if class = k and pk ( x ) =Pr ( yk =1| x ) .


Consistent with the earlier notes on gradient boosting, we obtain
1 K
Fk ( x ) ln p k ( x ) −
=
K
∑ ln p (x) or equivalently,
l =1
l

e Fk ( x )
p k ( x) = K

∑ e
l =1
Fl ( x )

The City College of New York


CSc 59929 – Introduction to Machine Learning 47
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting
Substituting p k ( x ) = K
e Fk ( x )
into the expression for L ({ yk , Fk ( x)}1
K
)
∑e
l =1
Fl ( x )

and differentiating with respect to Fk ( x ) we obtain:

yik =
−  (
 ∂L { y , F ( x )}K 
il k i 1
 ) yik − pk ,m −1 ( xi )
=
 ∂Fk ( xi ) 
 {Fl =
( x ) Fl , m −1( x )}
K
1

where pk ,m −1 ( xi ) is derived from Fk ,m −1 ( xi ) through the expression


for p k ( x ) above. Thus, K -trees are induced at each iteration m to
predict the corresponding current residuals for each class on the
probability scale. Each of these trees has J-terminal nodes with
corresponding regions {R jkm }Jj =1.
The City College of New York
CSc 59929 – Introduction to Machine Learning 48
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting …
The model updates α jkm corresponding to these regions are the solutions
to
N K J 
{α jkm } argαmin ∑∑ φ  yik , Fk ,m−1 ( xi ) + ∑ α jk 1( xi ∈ R jm ) 
{ jk } =i 1 =k 1  =j 1 
where φ ( yk , Fk ) = − yk ln pk .
Some simple approximations results in the decompostion of the problem
into a separate calculation for each terminal node of each tree.
The result is

K −1 ∑ xi ∈R jkm yik
α jkm = .
K ∑ x ∈R yik (1 − yik )
i jkm

The City College of New York


CSc 59929 – Introduction to Machine Learning 49
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
A) Intialize
= Fk 0 0=
for k 1, K .
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied)
e Fk ( x )
1) pk = K
for k = 1, K
∑e
l =1
Fk ( x )

2) For k = 1, K
a) y=
ik yik − pk ( xi ) for i = 1, N
b) { R jkm } = J − terminal node tree
J

j =1 ( { yik , xi }1
N
)
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 50
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
B) For each iteration, m = 1 to M ...
(or until a convergence criterion has been satisfied)
2) For k = 1, K ...
K − 1 ∑ xi ∈R jkm yik
= c) α jkm = , j 1, J
K ∑ x ∈R yik
i jkm

J
d ) Fkm ( x ) = Fk ( m −1) ( x ) + ∑ α1( x ∈ R jkm )
j =1

C) Use { Fk , M ( x )}1 to obtain the corresponding probability


K

estimates { pk , M ( x )}1
K

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 51
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
D) Use estimates { pk , M ( x )}1 for the classification:
K

K
kˆ( x ) = arg min ∑ c ( k , k ' ) pk ' M ( x )
1≤ k ≤ K k ' =1

where c ( k , k ' ) is the cost associated with predicting


the kth class when the truth is k ' .

From Wikipedia, Adaboost


The City College of New York
CSc 59929 – Introduction to Machine Learning 52
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting
Classifier Class
class sklearn.ensemble.GradientBoostingClassifier(…)

Parameter Default
loss ‘deviance’
learning_rate 0.1
n_estimators 100
max_depth 3
criterion ‘friedman_mse’
min_samples_split 2
min_samples_leaf 1
min_weight_fraction_leaf 0
subsample 1.0

The City College of New York


CSc 59929 – Introduction to Machine Learning 53
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Class
class sklearn.ensemble.GradientBoostingClassifier(…)

Parameter Default
max_features None
max_leaf_nodes None
min_impurity_decrease 0.
init None
verbose 0
warm_start False
random_state None
presort ‘auto’

The City College of New York


CSc 59929 – Introduction to Machine Learning 54
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Attributes
Attribute Description
feature_importance_ The feature importances
The improvement in loss (= deviance) on the out-of-bag
oob_improvement_
samples relative to the previous iteration.
The i-th score train_score_[i] is the deviance (= loss) of the
train_score_
model at iteration i on the in-bag sample.
loss_ The concrete loss function.
init The estimator that provides the initial predictions.
estimators_ The collection of fitted sub-estimators.

The City College of New York


CSc 59929 – Introduction to Machine Learning 55
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Methods
Method Description
apply(X) Apply trees in the ensemble to X, return leaf indices
decision_function(X) Compute the decision function of X
fit(X, y[, sample_weight]) Fit the gradient boosting model.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict classes for X.
predict_log_proba(X) Predict class log-probabilities for X
predict_proba(X) Predict class probabilities for X
Returns the mean accuracy on the given test data
score(X, y[, sample_weight])
and labels.
set_params(**params) Set the parameters of this estimator.

The City College of New York


CSc 59929 – Introduction to Machine Learning 56
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Methods
Method Description
staged_decision_function(X) Return staged predictions for X.
staged_predict_proba(X) Predict class probabilities for X.
staged_score(X, y[, sample_weight]) Return staged scores for X, y.

The City College of New York


CSc 59929 – Introduction to Machine Learning 57
Spring 2020 – Erik K. Grimmelmann, Ph.D.

You might also like