0% found this document useful (0 votes)

33 views

ML 14 Boosting

machine learning course

Uploaded by

yiqing he

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

ML 14 Boosting

machine learning course

Uploaded by

yiqing he

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Boosting

The City College of New York

CSc 59929 – Introduction to Machine Learning 1
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Definitions

• A weak learner is a classifier whose results are only

slightly correlated with the true classification.
• Ensemble methods are general techniques in ML
for combining several predictors to create a
predictor that is more accurate than its components.
• Boosting is a ML ensemble method for reducing
bias and variance in supervised learning.

The City College of New York

CSc 59929 – Introduction to Machine Learning 2
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Review of weak learning

A concept class C is said to be weakly PAC - learnable if there exists

1
an algorithm A, > γ > 0, and a polynomial function poly (⋅, ⋅, ⋅, ⋅) such that
2
for any 1>δ >0, for all distributions D on X , and for any target concept
1
c ∈ C , the following holds for any sample size m ≥ poly ( , n, size(c)) :
δ
 1 
Prm  R(hS ) ≤ − γ  ≥ 1 − δ ,
S ~D  2 
where O(n) is an upper bound on the cost of the computational
representation of x ∈ X and size(c) is the upper bound of the cost
of the computation representation of c ∈ C.
γ is refered to as the edge of A.
from Mohri et. al.
The City College of New York
CSc 59929 – Introduction to Machine Learning 3
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosting methods

• Most boosting methods consist of iteratively adding

the results of a weak learner to form a strong
learner.
• When adding the results of the weak learners,
they are typically weighted by a factor related to
the weak learner’s accuracy.
• Each time the results are added, the sample data
are reweighted before the next weak learner is
applied to them.

The City College of New York

CSc 59929 – Introduction to Machine Learning 4
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosted classifiers

Let X be the set of all possible examples or instances, i.e., the input space.
Let Y be the set of all possible labels or target values.
For binary classifications, Y = {0,1} .
For discrete classifications, Y = {0,1,..., j} .
For regression classifications, Y ∈ .
Let K = {k1 , k2 ,..., km } be a sequence of weak classifiers, each of which
outputs a classification k j ( xi ) for each xi .
The k j are determined iteratively starting with k1 which is a standard base
classifier (e.g., logistic regression, decision tree).

The City College of New York

CSc 59929 – Introduction to Machine Learning 5
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosted classifiers

The goal is to find an approximation Fˆ ( x ) to F ( x ) that minimizes the expected

value of some specified loss function, L( y, F ( x )) :
Fˆ ( x ) = arg min E [ L( y, F ( x )) ]
x, y
F
M
We assume that Fˆ ( x ) = ∑ α j h j ( x ) + k0 , where h j ( x ) is a weak learner
j =1

and k0 is a constant.
We build successive approximations to F ( x ).
n
Fˆ0 ( x ) = arg min ∑ L( yi , α )
α i =1
n
Fˆm ( x ) =
Fˆm −1 ( x ) + arg min ∑ L( yi , Fˆm −1 ( x ) + h( x ).
h i =1

From Wikipedia, Gradient Boosting

The City College of New York
CSc 59929 – Introduction to Machine Learning 6
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient descent

n
Fˆm ( x ) − Fm ( x ) =
−α m ∑ ∇ Fm−1 L( yi , Fˆm −1 ( xi ).]
i =1

αm (
arg min L yi ,  Fˆm −1 ( xi ) − α∇ Fm−1 L( yi , Fˆm −1 ( xi )  ,
α
)
where the derivatives are taken with respect to the functions
Fˆ ( x ) for i ∈ {1,..., m} .
i

If y is discrete rather than continuous, we choose the candiate function

h closest to the gradient of L for which the coefficent α may then be
calculated with the aid of the technique of line search.

From Wikipedia, Gradient Boosting

The City College of New York
CSc 59929 – Introduction to Machine Learning 7
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient descent algorithm
A) Intialize the model with the constant value
n
Fˆ0 ( x ) = arg min ∑ L( yi , α )
α i =1

B) For each iteration, m = 1 to M

(or until a convergence criterion has been satisfied)

1) Compute the so-called pseudo - residuals :

 ∂L( yi , Fˆ ( xi )) 
rim =
−  for i =
1,..., n
ˆ
 ∂F ( xi )  Fˆ ( x ) = Fˆm−1 ( x )
2) Fit a base learner h m (x ) to the pseudo-residuals, i.e., train it
{( x , r )}
n
using i im i =1
.
From Wikipedia, Gradient Boosting
The City College of New York
CSc 59929 – Introduction to Machine Learning 8
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient descent algorithm …
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied) ...

3) Compute the coefficient α m by solving the following

one-dimensional optimization problem:
n
αm arg min ∑ L ( yi , [ Fm −1 ( xi ) + α hm ( xi ) ])
α i =1

2) Update the model:

Fˆm ( x ) Fˆm −1 ( x ) + α m hm ( x )
=

C) Output FˆM ( x ).

From Wikipedia, Gradient Boosting

The City College of New York
CSc 59929 – Introduction to Machine Learning 9
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Adaptive Boosting

AdaBoost

The City College of New York

CSc 59929 – Introduction to Machine Learning 10
Spring 2020 – Erik K. Grimmelmann, Ph.D.
AdaBoost (short for Adaptive Boosting)

• Formulated by Yoav Freund and Robert Schapire

• Awarded the Gödel Prize in 2003 for this work.

The City College of New York

CSc 59929 – Introduction to Machine Learning 11
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Qualitative illustration

• Images are from Mohri.

• Assume that the base classifier can only
determine decision boundaries that are
axis-aligned.

The City College of New York

CSc 59929 – Introduction to Machine Learning 12
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Initial data
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 13
Spring 2020 – Erik K. Grimmelmann, Ph.D.
k1
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 14
Spring 2020 – Erik K. Grimmelmann, Ph.D.
w2
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 15
Spring 2020 – Erik K. Grimmelmann, Ph.D.
k2
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 16
Spring 2020 – Erik K. Grimmelmann, Ph.D.
w3
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 17
Spring 2020 – Erik K. Grimmelmann, Ph.D.
k3
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 18
Spring 2020 – Erik K. Grimmelmann, Ph.D.
w4
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 19
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final classifier components
m=1 m=2 m=3

The City College of New York

CSc 59929 – Introduction to Machine Learning 20
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final classifier

The City College of New York

CSc 59929 – Introduction to Machine Learning 21
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Boosted classifiers, Cm(xi)
Let X be the set of all possible examples or instances, i.e., the input space.
Let Y be the set of all possible labels or target values.
For now we consider only binary classifications, so Y = {0,1} .
Let K = {k1 , k2 ,..., km } be a sequence of weak classifiers, each of which
outputs a classification k j ( xi ) ∈ {0,1} for each xi .
The k j are determined iteratively starting with k1 which is a standard base
classifier (e.g., logistic regression, decision tree).
We designate the sequence of boosted classifiers as C = {C1 , C2 ,..., Cm } where
m
Cm ( xi ) ∑=
α k ( x ) and α
j =1
j j i 1 1.

Thus,
= Cm ( xi ) C( m −1) ( xi ) + α m km ( xi ).
C1 ( xi ) k1 ( xi ) and=
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 22
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Loss function, L
m
Since Cm ( xi ) = ∑ α j k j ( xi ), we need to determine the "best" choices for the
j =1

α j and k j which we can do one j at at time, starting with j = 1.

We choose "best" to be the choices that minimize the loss function, L,
the total error of Cm which we define to be the sum of the exponential loss on
each data point,
N
L = ∑ e − yiCm ( xi ) .
i =1

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 23
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Weights, wi(m)
We introduce weights, wi( m ) , to be applied to the input xi for iteration m.
−y C ( xi )
1 (1) (m) e i ( m−1)
Let w
= = and wi
i for m > 1, where
Z0 Zm
N
Z m is a normalization factor such that ∑w
i =1
(m)
i = 1.

1
Since wi(1) = , Z0 = N .
Z0
Note that wi( m ) > 0 for all i and m.

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 24
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Loss Function, L
−y C ( xi )
1 (1) (m) e i ( m−1)
Since w
= = iand wi for m > 1, and
Z0 Z ( m −1)
Cm ( xi ) C( m −1) ( xi ) + α m km ( xi )
=
N
L = ∑ e − yiCm ( xi )
i =1
N
= ∑e
− yi C( m−1) ( xi ) +α m km ( xi ) 

i =1
N
= ∑e
− yi C( m−1) ( xi ) − yiα m km ( xi )
e
i =1
N
= ∑ wi( m ) e − yiα m km ( xi ) .
i =1
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 25
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Splitting L into two components
If xi is correctly classified by km , then
= either yi 1=
and km ( xi ) 1 or
−1 and km ( xi ) =
yi = −1. In either case, yi km ( xi ) =
1.
If xi is incorrectly classified by km , then either yi = 1 and km ( xi ) = −1 or
yi = −1 and km ( xi ) = 1. In either case, yi km ( xi ) = −1.
We can split the sum for E between those point that are correctly
classified by km and those that are misclassified.
N
L ∑=
w e α
( m ) − yi m km ( xi )
i
i 1= yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m

∑ wi( m)e−αm +
L= ∑ (
wi( m ) eα m − e −α m . )
i 1 yi ≠ km ( xi )
N
=L e −α m
∑w (m)
i (
+ eα m − e −α m ) ∑ wi( m ) .
i 1 yi ≠ km ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 26
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding the minimum of L
N
=L e −α m
∑w (m)
i (
+ eα m − e −α m ) ∑ wi( m ) .
i 1 yi ≠ km ( xi )

The only term in the expression for L that depends on km is ∑

yi ≠ km ( xi )
wi( m ) .

Given that α m > 0, it follow that eα m > e −α m and thus eα m − e −α m > 0. ( )

Therefore, the km that minimizes L is the one that minimizes ∑
yi ≠ km ( xi )
wi( m ) .

Differentiating
=
=
L
yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m with respect to α m ,

∂L
we obtain − ∑ wi( m ) e −α m + ∑ wi( m ) eα m .
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 27
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding αm that minimizes L
∂L
− ∑ wi( m ) e −α m + ∑ wi( m ) eα m
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )

∂L
−e −α m ∑ wi( m ) + eα m ∑ wi( m ) ,
=
∂α m
= yi km ( xi ) yi ≠ km ( xi )

Setting this to zero and and solving for α m we obtain

e −α m ∑
yi km ( xi )
wi( m ) = eα m ∑
yi ≠ km ( xi )
wi( m )

 −α m   αm 
ln e ∑ (m)
wi )  = ln e ∑ (m)
wi 
 yi = km ( xi )   yi ≠ km ( xi ) 
   
ln e  + ln  ∑ wi )  =+
−α m (m)
ln e  ln  ∑ wi 
αm (m)

 yi km ( xi )   yi ≠ km ( xi ) 
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 28
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Finding αm that minimizes L …

   
ln e −α m  + ln  ∑ wi( m )  = ln eα m  + ln  ∑ wi( m ) 
 yi km ( xi )   yi ≠ km ( xi ) 
   
−α m + ln  ∑ wi  = α m + ln  ∑ wi 
(m) (m)

 yi km ( xi )   yi ≠ km ( xi ) 
   
2α m ln  ∑ wi  − ln  ∑ wi  (m) (m)

=  yi km ( xi )   yi ≠ km ( xi ) 
 (m) 
 ∑ wi 
1  yi = km ( xi ) 
α m = ln
2  (m) 
 ∑ wi 
 yi ≠ km ( xi ) 
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 29
Spring 2020 – Erik K. Grimmelmann, Ph.D.
∊m = weighted error of km

The weighted error of km is ε m where

∑
wi( m )
∑
yi ≠ km ( xi )
=εm = N
wi( m ) ,
∑ wi( m) i =1
yi ≠ km ( xi )

N
since ∑ wi( m ) = 1 as the wi( m ) are normalized for each m.
i =1

 
1 − ε m  ∑ wi + ∑ wi  − ∑=
= (m) (m)
wi( m ) ∑ wi( m )
 yi =km ( xi ) yi ≠ km ( xi )  yi ≠ km ( xi ) yi = km ( xi )

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 30
Spring 2020 – Erik K. Grimmelmann, Ph.D.
αm as a function of ∊m

1− εm
∑
yi = km ( xi )
wi( m )
= ,
εm
∑
yi ≠ km ( xi )
wi( m )

 (m) 
 ∑ wi   
1  yi = km ( xi )
and thus α m =
= ln  1 ln 1 − ε m  ,
2  (m)  2  εm 
 ∑ wi 
 i m i
y ≠ k ( x ) 
1
which is the negative logit function mutiplied by .
2

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 31
Spring 2020 – Erik K. Grimmelmann, Ph.D.
∊m as a function of αm
1 1 − ε m 
α m = ln  ,
2  εm 
1 − ε m 
e 2α m =  ,
 εm 
ε m e 2α m = 1 − ε m ,
ε m e 2α + 1 =
m
1
1 e −α m e −2α m
=εm = 2α m
=
αm −α m −2α m
.
e + 1 e + e  1 + e 

Not surprisingly, this is the logistic function of -2αm .

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 32
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Normalization Constant, Zm
N
Z ( m +1) = ∑ wi( m ) e − yiα m km ( xi )
i =1

=
=yi km ( xi )
∑ wi( m ) e − yiα m km ( xi ) + ∑
yi ≠ km ( xi )
wi( m ) e − yiα m km ( xi )

=
=yi km ( xi )
∑ wi( m ) e −α m + ∑
yi ≠ km ( xi )
wi( m ) eα m

wi( m )
= e −α m
=yi km ( xi )
∑ wi( m ) + eα m ∑
yi ≠ km ( xi )

= e −α m (1 − ε m ) + eα m ε m

(1 − ε m ) e−αm + ε meαm
=

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 33
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Normalization Constant, Zm
(1 − ε m ) e−αm + ε meαm
Z ( m +1) =

1  1 − ε m  −α m ε 1− εm
since α m
= ln=
 , e = m
and eα m
2  εm  1− εm εm
εm 1− εm
(1 − ε m )
Z ( m +1) =
1− εm
+ εm
εm
εm 1− εm 1− εm εm
( m ) 1− ε 1− ε m ε ε
=1 − ε + ε
m m m m

= ε m (1 − ε m ) + ε m (1 − ε m )

= 2 ε m (1 − ε m )

From Mohri
The City College of New York
CSc 59929 – Introduction to Machine Learning 34
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Putting it all together
1 (1)
A) Initialize the sample weights, w = . i
N
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied)

1) Choose the classifier km that minimizes the total weighted error

∑ wi( m ) .
i ≠ km ( xi )

2) Apply km to the sample data using the

current sample weights, wi( m ) .
3) Use the results to calculate the error rate,
N
εm = ∑
yi ≠ km ( xi )
w (m)
i ∑ i .
w
1
i=
(m)

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 35
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Putting It All Together …
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied) ...
1 1 − ε m 
4) Use ε m to calculate the weight for interation m , α m = ln  ,
2  εm 
5) Update the classifier=Cm ( xi ) C( m −1) ( xi ) + α m km ( xi ),

6) Calculate the normalization factor

= Zm 2 ε m (1 − ε m )
7) Update the sample weights for use in the next iteration,
e − yiCm ( xi )
wi( m +1) =
Zm
C) Output CM ( xi )
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 36
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Illustration again
m=0 m=1 m=2 m=3

wm+1

The City College of New York

CSc 59929 – Introduction to Machine Learning 37
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final Classifier Compoments
m=1 m=2 m=3

The City College of New York

CSc 59929 – Introduction to Machine Learning 38
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Final Classifier

The City College of New York

CSc 59929 – Introduction to Machine Learning 39
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Bound on the empirical error
  M 1  
2

R(h) ≤ exp  −2∑  − ε m   .

 m =1  2  

1 
In addition, if for all m ∈ [1, M ] , γ ≤  − ε m  , then
2 

R(h) ≤ exp ( −2γ 2 M ) .

Thus, the empirical error of AdaBoost decrease exponentially

as a function of the number of rounds of boosting.

See Mohri for proof of the bound.

The City College of New York

CSc 59929 – Introduction to Machine Learning 40
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Class
class sklearn.ensemble.AdaBoostClassifier(…)

Parameter Default
base_estimator DecisionTreeClassifier
n_estimators 50
learning_rate 1
algorithm ‘SAMME.R’
random_state None

The City College of New York

CSc 59929 – Introduction to Machine Learning 41
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Attributes
Attribute Description
estimators_ The collection of fitted sub-estimators.
classes_ The classes labels.
n_classes_ The number of classes.
estimator_weights_ Weights for each estimator in the boosted ensemble.
Classification error for each estimator in the boosted
estimator_errors_
ensemble.
The feature importances if supported by the
feature_importance_
base_estimator.

The City College of New York

CSc 59929 – Introduction to Machine Learning 42
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Methods
Method Description
decision_function(X) Compute the decision function of X
fit(X, y[, sample_weight]) Build a boosted classifier from the training set (X,y)
get_params([deep]) Get parameters for this estimator.
predict(X) Predict classes for X.
predict_log_proba(X) Predict class log-probabilities for X
predict_proba(X) Predict class probabilities for X
Returns the mean accuracy on the given test data
score(X, y[, sample_weight])
and labels.
set_params(**params) Set the parameters of this estimator.

The City College of New York

CSc 59929 – Introduction to Machine Learning 43
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Ada Boost Classifier Methods

Method Description
staged_decision_function(X) Return staged predictions for X.
staged_predict_proba(X) Predict class probabilities for X.
staged_score(X, y[, sample_weight]) Return staged scores for X, y.

The City College of New York

CSc 59929 – Introduction to Machine Learning 44
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient Tree Boosting

The City College of New York

CSc 59929 – Introduction to Machine Learning 45
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting

• Formulated by Friedman in 1999

The City College of New York

CSc 59929 – Introduction to Machine Learning 46
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting
For a K -class classicification problem, yi ∈ {0,1,..., K }
Choose an exponential loss function,

({ y , F ( x)} ) = −∑ y ln p (x)
K
K
L k k 1 k k
k =1

where yk ∈ {0,1} , yk =1 if class = k and pk ( x ) =Pr ( yk =1| x ) .

Consistent with the earlier notes on gradient boosting, we obtain
1 K
Fk ( x ) ln p k ( x ) −
=
K
∑ ln p (x) or equivalently,
l =1
l

e Fk ( x )
p k ( x) = K

∑ e
l =1
Fl ( x )

The City College of New York

CSc 59929 – Introduction to Machine Learning 47
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting
Substituting p k ( x ) = K
e Fk ( x )
into the expression for L ({ yk , Fk ( x)}1
K
)
∑e
l =1
Fl ( x )

and differentiating with respect to Fk ( x ) we obtain:

yik =
−  (
 ∂L { y , F ( x )}K 
il k i 1
 ) yik − pk ,m −1 ( xi )
=
 ∂Fk ( xi ) 
 {Fl =
( x ) Fl , m −1( x )}
K
1

where pk ,m −1 ( xi ) is derived from Fk ,m −1 ( xi ) through the expression

for p k ( x ) above. Thus, K -trees are induced at each iteration m to
predict the corresponding current residuals for each class on the
probability scale. Each of these trees has J-terminal nodes with
corresponding regions {R jkm }Jj =1.
The City College of New York
CSc 59929 – Introduction to Machine Learning 48
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient tree boosting …
The model updates α jkm corresponding to these regions are the solutions
to
N K J 
{α jkm } argαmin ∑∑ φ  yik , Fk ,m−1 ( xi ) + ∑ α jk 1( xi ∈ R jm ) 
{ jk } =i 1 =k 1  =j 1 
where φ ( yk , Fk ) = − yk ln pk .
Some simple approximations results in the decompostion of the problem
into a separate calculation for each terminal node of each tree.
The result is

K −1 ∑ xi ∈R jkm yik
α jkm = .
K ∑ x ∈R yik (1 − yik )
i jkm

The City College of New York

CSc 59929 – Introduction to Machine Learning 49
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
A) Intialize
= Fk 0 0=
for k 1, K .
B) For each iteration, m = 1 to M
(or until a convergence criterion has been satisfied)
e Fk ( x )
1) pk = K
for k = 1, K
∑e
l =1
Fk ( x )

2) For k = 1, K
a) y=
ik yik − pk ( xi ) for i = 1, N
b) { R jkm } = J − terminal node tree
J

j =1 ( { yik , xi }1
N
)
From Wikipedia, Adaboost
The City College of New York
CSc 59929 – Introduction to Machine Learning 50
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
B) For each iteration, m = 1 to M ...
(or until a convergence criterion has been satisfied)
2) For k = 1, K ...
K − 1 ∑ xi ∈R jkm yik
= c) α jkm = , j 1, J
K ∑ x ∈R yik
i jkm

J
d ) Fkm ( x ) = Fk ( m −1) ( x ) + ∑ α1( x ∈ R jkm )
j =1

C) Use { Fk , M ( x )}1 to obtain the corresponding probability

estimates { pk , M ( x )}1
K

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 51
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Gradient boosting algorithm
D) Use estimates { pk , M ( x )}1 for the classification:
K

K
kˆ( x ) = arg min ∑ c ( k , k ' ) pk ' M ( x )
1≤ k ≤ K k ' =1

where c ( k , k ' ) is the cost associated with predicting

the kth class when the truth is k ' .

From Wikipedia, Adaboost

The City College of New York
CSc 59929 – Introduction to Machine Learning 52
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting
Classifier Class
class sklearn.ensemble.GradientBoostingClassifier(…)

Parameter Default
loss ‘deviance’
learning_rate 0.1
n_estimators 100
max_depth 3
criterion ‘friedman_mse’
min_samples_split 2
min_samples_leaf 1
min_weight_fraction_leaf 0
subsample 1.0

The City College of New York

CSc 59929 – Introduction to Machine Learning 53
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Class
class sklearn.ensemble.GradientBoostingClassifier(…)

Parameter Default
max_features None
max_leaf_nodes None
min_impurity_decrease 0.
init None
verbose 0
warm_start False
random_state None
presort ‘auto’

The City College of New York

CSc 59929 – Introduction to Machine Learning 54
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Attributes
Attribute Description
feature_importance_ The feature importances
The improvement in loss (= deviance) on the out-of-bag
oob_improvement_
samples relative to the previous iteration.
The i-th score train_score_[i] is the deviance (= loss) of the
train_score_
model at iteration i on the in-bag sample.
loss_ The concrete loss function.
init The estimator that provides the initial predictions.
estimators_ The collection of fitted sub-estimators.

The City College of New York

CSc 59929 – Introduction to Machine Learning 55
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Methods
Method Description
apply(X) Apply trees in the ensemble to X, return leaf indices
decision_function(X) Compute the decision function of X
fit(X, y[, sample_weight]) Fit the gradient boosting model.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict classes for X.
predict_log_proba(X) Predict class log-probabilities for X
predict_proba(X) Predict class probabilities for X
Returns the mean accuracy on the given test data
score(X, y[, sample_weight])
and labels.
set_params(**params) Set the parameters of this estimator.

The City College of New York

CSc 59929 – Introduction to Machine Learning 56
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Gradient Boosting Classifier
Methods
Method Description
staged_decision_function(X) Return staged predictions for X.
staged_predict_proba(X) Predict class probabilities for X.
staged_score(X, y[, sample_weight]) Return staged scores for X, y.

The City College of New York

CSc 59929 – Introduction to Machine Learning 57
Spring 2020 – Erik K. Grimmelmann, Ph.D.

EP 222: Classical Mechanics Tutorial Sheet 7: Solution
No ratings yet
EP 222: Classical Mechanics Tutorial Sheet 7: Solution
7 pages
Problema B.5.1 Ogata
67% (3)
Problema B.5.1 Ogata
2 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Boosting
No ratings yet
Boosting
11 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
Boosting
No ratings yet
Boosting
13 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Module3
No ratings yet
Module3
26 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
chapter 3- boosting theory
No ratings yet
chapter 3- boosting theory
7 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Slide07 Haykin Chapter 7: Committee Machines
No ratings yet
Slide07 Haykin Chapter 7: Committee Machines
8 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
Foundations of Machine Learning: Boosting
No ratings yet
Foundations of Machine Learning: Boosting
41 pages
Lec5 Boosting v2.7 1
No ratings yet
Lec5 Boosting v2.7 1
46 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Artificial Intelligence Fundamentals: Learning: Boosting
No ratings yet
Artificial Intelligence Fundamentals: Learning: Boosting
24 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
ML 09 SVM Draft
No ratings yet
ML 09 SVM Draft
73 pages
09_EnsembleLearning
No ratings yet
09_EnsembleLearning
36 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
ML 13 Strong and Weak Learners
No ratings yet
ML 13 Strong and Weak Learners
13 pages
Machine Learning: Ensemble Methods
No ratings yet
Machine Learning: Ensemble Methods
54 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
sol3_2016
No ratings yet
sol3_2016
8 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
ensemble
No ratings yet
ensemble
33 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
addaboost
No ratings yet
addaboost
12 pages
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
No ratings yet
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
32 pages
16 Boosting
No ratings yet
16 Boosting
7 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Lec13 PDF
No ratings yet
Lec13 PDF
10 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Lec06 - Ensembling Methods Bagging Boosting
No ratings yet
Lec06 - Ensembling Methods Bagging Boosting
48 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
ML 12 NLP Example
No ratings yet
ML 12 NLP Example
30 pages
ML 06 Multiclass
No ratings yet
ML 06 Multiclass
11 pages
ML 10 Decision Trees
No ratings yet
ML 10 Decision Trees
101 pages
ML 07 Regularization
No ratings yet
ML 07 Regularization
5 pages
ML 01 Course Intro
No ratings yet
ML 01 Course Intro
37 pages
Generative AI Machine Learning
No ratings yet
Generative AI Machine Learning
13 pages
Ad3501 - Deep Learning
No ratings yet
Ad3501 - Deep Learning
2 pages
Randomized Algorithm
No ratings yet
Randomized Algorithm
38 pages
Cloud Resource Management and Scheduling
No ratings yet
Cloud Resource Management and Scheduling
22 pages
Capitulo 1 Big data uc3m
No ratings yet
Capitulo 1 Big data uc3m
10 pages
Chapter 3: Partial Derivatives: Lagrange Multipliers
No ratings yet
Chapter 3: Partial Derivatives: Lagrange Multipliers
14 pages
Unit 3-2
No ratings yet
Unit 3-2
12 pages
Linear Classifiers in Python: Chapter3
100% (1)
Linear Classifiers in Python: Chapter3
19 pages
DL - Assignment 4 Solution
No ratings yet
DL - Assignment 4 Solution
6 pages
Chapter 9 - Simple Regression Analysis - L1 - Jan 2024
No ratings yet
Chapter 9 - Simple Regression Analysis - L1 - Jan 2024
26 pages
lava lamp encryption seminar report
No ratings yet
lava lamp encryption seminar report
29 pages
exchange sort
No ratings yet
exchange sort
5 pages
Synopsis
No ratings yet
Synopsis
5 pages
Linear Programming: Basic Concepts Solution To Solved Problems
100% (1)
Linear Programming: Basic Concepts Solution To Solved Problems
4 pages
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
Brei 2020
No ratings yet
Brei 2020
67 pages
Dyslexia Prediction Using Machine Learning
No ratings yet
Dyslexia Prediction Using Machine Learning
9 pages
Solution of Inverse Diffusion Problems in Excel
No ratings yet
Solution of Inverse Diffusion Problems in Excel
4 pages
Fixed-Point Algorithm Development
No ratings yet
Fixed-Point Algorithm Development
6 pages
ZCh3 Classical Encryption Techniques
No ratings yet
ZCh3 Classical Encryption Techniques
52 pages
Object Detection Using Convolutional Neural Network Transfer Learning
No ratings yet
Object Detection Using Convolutional Neural Network Transfer Learning
11 pages
Ejercicio01 - What'sBest!
No ratings yet
Ejercicio01 - What'sBest!
5 pages
Midterm PDF
No ratings yet
Midterm PDF
2 pages
Queuing Theory Math Sem 4
No ratings yet
Queuing Theory Math Sem 4
6 pages
Assignment 1 - Opti 2 - Report
No ratings yet
Assignment 1 - Opti 2 - Report
13 pages
NP Complete
No ratings yet
NP Complete
9 pages
Data-driven Machine Learning
No ratings yet
Data-driven Machine Learning
22 pages
(Ebooks PDF) Download Stable Mutations For Evolutionary Algorithms Andrzej Obuchowicz Full Chapters
100% (4)
(Ebooks PDF) Download Stable Mutations For Evolutionary Algorithms Andrzej Obuchowicz Full Chapters
52 pages