0% found this document useful (0 votes)
3 views

09_EnsembleLearning

Uploaded by

Nicole Oo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

09_EnsembleLearning

Uploaded by

Nicole Oo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Ensemble Learning

Hady W. Lauw
Photo by Felix Mittermeier from Pexels

IS712 Machine Learning


CLASSIFICATION AND REGRESSION TREES (CART)
Recursively partitioning the input space and defining a local model for each partition
Regression Tree
• Partition input space into regions
• Prediction is the mean response in each region
– Alternatively, fit a regression function locally

3
Classification Tree
• Partition input space into regions
• Prediction is the mode of class label distribution

4
Recursive Procedure to Grow a Tree

• Split function chooses the “best” feature j (among M features) and feature value t
(among the viable feature values of j) to split
|𝐷, | 𝐷.
𝑗 ∗, 𝑡 ∗ = arg min min 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
"∈{%,…,(} *∈+! |𝐷| 𝐷
5
Is it worth splitting?

• Is there gain to be made from further splits?


– The distribution in each region may already be sufficiently homogeneous
– The gain may be too small
|𝐷, | 𝐷.
𝐺𝑎𝑖𝑛 = 𝑐𝑜𝑠𝑡 𝐷 − 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
|𝐷| 𝐷

• Are there significant risks of overfitting?


– The tree may already be too deep
– The number of examples in a particular region may be too small

6
Regression Cost
• For a subset of data points D, quantify:
𝑐𝑜𝑠𝑡 𝐷 = @ 𝑦- − 𝑓(𝒙- , 𝑦- ) 0

-∈/

• In the simplest case, the prediction could just be the mean response
1
𝑓 𝒙- , 𝑦- = @ 𝑦-
|𝐷|
-∈/

• Alternatively, we can fit a regression function at each leaf node

7
Classification Cost: Misclassification
• First, we estimate the class proportions within a subset of data points D:
1
𝜋̂! = @ 𝟏(𝑦- = 𝑐)
|𝐷|
-∈/

• Predicted class is the mode of the class distribution


𝑦! = arg max 𝜋G1
1∈2

• Misclassification rate
1
𝑐𝑜𝑠𝑡 𝐷 = % 𝟏 𝑦! ≠ 𝑦̂ = 1 − 𝜋̂ %̂
𝐷
!∈#

8
Classification Cost: Entropy
• Define entropy of class distribution
G = − @ 𝜋J 1 log 𝜋J 1
𝐻 𝝅
1∈2

• Minimizing entropy is maximizing information gain


𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛 𝑋, < 𝑡, 𝑌 = 𝐻 𝑌 − 𝐻(𝑌|𝑋, < 𝑡)

9
Classification Cost: Gini Index
• Define entropy of class distribution
G
𝐺𝑖𝑛𝑖 𝝅
= @ 𝜋J 1 (1 − 𝜋J 1 )
1∈2

= @ 𝜋J 1 − @ 𝜋J 10
1∈2 1∈2

= 1 − @ 𝜋J 10
1∈2

10
Classification Cost: Comparison

• Assume 2-class classification, where each class has 400 instances


Splits Misclassification Rate Entropy Gini Index
Split 1: (300, 100) 0.25 0.81 0.375
Split 2: (100, 300)
Split 1: (200, 400) 0.25 0.70 0.336
Split 2: (199, 1)

11
Example: Iris Dataset

12
Example: Iris Dataset - Unpruned

13
Example: Iris Dataset - Pruned

14
BAGGING
Reducing variance via Bootstrap AGGregating
Random Tree Classifier
• Sample 𝑘 out of 𝑀 features randomly
– Heuristic: 𝑘 = 𝑀
• Build a full decision tree based only on the 𝑘 features
• This is a high-variance model

Classification tree on 2 out of 4 features in Iris dataset


16
Random Forest Classifier
• To lower the variance, we can “bag” many random trees

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }


• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a full classification tree ℎ3 (𝒙) with the 𝑘 features
• The final classifier is the average of the trees
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

17
Sampling: without vs. with replacement

https://www.spss-tutorials.com/spss-sampling-basics/

18
Bagging: Bootstrap Aggregating
• We can apply bagging to other models as well

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }


• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a model ℎ3 (𝒙) with the 𝑘 features
• The final model is the average of the predictions
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

19
Bias-Variance Illustration

http://scott.fortmann-roe.com/docs/BiasVariance.html
20
Expected Loss for Regression

• Recall: assume a random noise 𝜖 ∈ 𝒩(0, 𝜎 .) is the source of residual error:


𝑦 = 𝑦J 𝒙 𝒘′ + 𝜖
𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 = 𝒩 𝑦 𝑦J 𝒙|𝒘′ , 𝜎 0

E 𝑦 𝒙 = W 𝑦 𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 d𝑦 = 𝑦(𝒙|𝒘′)
J

𝑣𝑎𝑟[𝑦|𝒙] = E 𝑦 − E 𝑦 𝒙 0 = E 𝜖 0 = 𝑣𝑎𝑟 𝜖 + E 𝜖 0 = 𝜎0

• Suppose that instead of learning from the “complete” data, we take sub-
samples, then we expect some variations in the squared loss we’d observe
– Let 𝑦 be the observed response variable
– Let 𝑦J be the optimal function (the 𝒙 and 𝒘′ omitted for simplicity of notation)
– Let 𝑦] be the function we learn from a particular sample
– We would like to characterize the expected squared loss E[ 𝑦 − 𝑦] 0 ] under the
distribution of subsamples
21
Bias-Variance Decomposition for Regression
E 𝑦 − 𝑦] 0 Note that:
= E 𝑦J + 𝜖 − 𝑦] 0 E𝜖 =0
0 E E 𝑦% − 𝑦% = E 𝑦% − E 𝑦% = 0
=E 𝑦J − E[𝑦])
] + 𝜖 + (E[𝑦]
] − 𝑦]
= E 𝑦J − E 𝑦] 0 + E 𝜖 0 + E E 𝑦] − 𝑦] 0
+2×E 𝜖 ×E[𝑦J − E 𝑦] ] + 2×E 𝜖]×E[E 𝑦] − 𝑦] + 2×E 𝑦J − E 𝑦] ×E E 𝑦] − 𝑦]
= E[ 𝑦J − E 𝑦] 0 ] + 𝜎 0 + E E 𝑦] − 𝑦] 0

• Squared bias E 𝑦! − E 𝑦E . = E 𝑦! − E ℎ0 (𝒙) .

– Contribution to squared loss due to deviation of the learnt function from the optimal
• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

– Contribution to squared loss due to sensitivity to different training subsamples


• Irreducible error 𝜎 . = E 𝑦! − 𝑦 .

– Contribution to squared loss due to random noise in the data

22
Bagging Reduces Variance
• Weak law of large numbers when samples are i.i.d:
1
:
ℎ 𝒙 = > ℎ0 (𝒙) → E ℎ0 (𝒙) as 𝐿 → ∞
𝐿
0∈{-,…,/}

• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

• If we replace ℎ3 with ℎ̀, variance reduces to 0, if indeed samples are i.i.d.

• Bagging samples are unlikely i.i.d., so variance may not disappear


completely, but would likely still be reduced effectively

23
Unbiased Estimate of Test Error
• Each bagging sample 𝐷0 only involves a subset of the training data 𝐷

• A specific training instance (𝒙6 , 𝑦6 ) is part of some sample, but not others
• Let 𝐷76 = {𝐷0 | 𝒙6 , 𝑦6 ∉ 𝐷0 } be the samples that do not contain this instance
-
• Let ℎ: 76 (𝒙) = ∑ ℎ (𝒙) be average of models trained on 𝐷76
|9!" | 9# ∈9!" 0

• The out-of-bag error is the average such error across the 𝑁 instances in 𝐷
1
𝜖::; 𝐷 = > 𝑙𝑜𝑠𝑠(ℎ: 76 𝒙6 , 𝑦6 )
𝑁
6∈{-,…,<}

24
BOOSTING
Reducing bias via iteratively improving weak learners
Boosting
• Consider a binary classification problem 𝑦 ∈ {−1, 1}

• A weak learner is a model for binary classification that has slightly better
performance than random guesses
– Example: a shallow classification tree

• Boosting seeks to create a strong learner from a weighted combination of


multiple weak learners

𝐻 𝒙 = sign > 𝛼= ℎ= (𝒙)


=

26
AdaBoost

• Training data 𝐷 has 𝑁 instances 𝒙6 , 𝑦6 6∈{-,…,<}


• Associate each instance 𝒙6 , 𝑦6 with a weight 𝑤6
• Assume we can train a model ℎ= that minimizes a weighted loss function
𝐿= = > 𝑤6 = 𝟏(ℎ= 𝒙6 ≠ 𝑦6 )
6∈{-,…,<}
– 𝟏(ℎ* 𝒙- = 𝑦- ) is an indicator function that yields 1 if the equality within holds, 0 otherwise
– An example model is a decision stump with a single feature split
-
• Initially, the weights of all instances are uniform 𝑤6 = <
• Subsequently, weights of misclassified instances are adjusted

27
Algorithm
• For iteration 𝑡 from 1 to 𝑇
– Fit a classifier ℎ* to the training data by minimizing the weighted loss function
&
𝐿& = % 𝑤! 𝟏(ℎ& 𝒙! ≠ 𝑦! )
!∈{(,…,+}
– Evaluate error of this iteration
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
– Evaluate coefficient of this classifier
1 − 𝜖&
𝛼& = ln
𝜖&
– Update data weight coefficients
(&.() (&)
𝑤! = 𝑤! exp 𝛼& 𝟏(ℎ& 𝒙! ≠ 𝑦! )
• Final prediction is given by:
𝐻 𝒙! = 𝑠𝑖𝑔𝑛 % 𝛼& ℎ& 𝒙!
&∈{(,…,0}
28
Minimizes Sequential Exponential Error

• Sequential ensemble classifier:


1
𝐻* " 𝒙- = @ 𝛼* ℎ* 𝒙-
2
*∈{%,…,* " }

• Sequential exponential error:


𝐸= @ exp −𝑦- 𝐻*5 (𝒙- )
-∈{%,…,4}
1
= @ exp −𝑦- 𝐻* " 6% 𝒙- − 𝑦- 𝛼* " ℎ* " (𝒙- )
2
-∈{%,…,4}
(* " ) 1
= @ 𝑤- exp − 𝑦- 𝛼* " ℎ* " 𝒙-
2
-∈{%,…,4}
(= $ )
• Given 𝐻= $ , 𝑤6
is a constant, and we
seek to find the minimizing 𝛼= $ and ℎ= $
29
Sequential Exponential Error (cont’d)

• Correctly classified instances: 𝐶*" = 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- ≥ 0


• Wrongly classified instances: 𝐶*̅ " = { 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- < 0}
• Sequential exponential error:
(& ! ) 1
𝐸= % 𝑤! exp − 𝑦! 𝛼& ! ℎ& ! 𝒙!
2
!∈{(,…,+}
𝛼& ! (& ! ) 𝛼& ! (& ! )
= exp − % 𝑤! + exp % 𝑤!
2 2
!∈1"! !∈1"̅ !
𝛼! 𝛼! &! 𝛼& ! (& ! )
= exp & − exp − & % 𝑤! 1(ℎ& ! 𝑥! ≠ 𝑦! ) +exp − % 𝑤!
2 2 2
!∈{(,…,+} !∈{(,…,+}

• Minimizing the above with respect to 𝛼= $ gives us


1 − 𝜖&
𝛼& = ln
𝜖&
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
30
Illustration

31
Illustration (cont’d)

32
Interpretations of Boosting

• A form of L1 regularization
– Each weak learner is a decision stump that relies on a single feature
– Boosting “selects” among these weak learners (features) that work well

• Margin maximization
– By iteratively adjusting weights of misclassified instances, boosting seeks the classifier
that maximizes the margin

• Functional gradient descent


– The functions are the “parameters”
– GradientBoost is a generic algorithm for boosting that accommodates various loss
functions

33
Boosting Loss Functions

34
Conclusion
• Classification and Regression Trees (CART)
– A class of models that partition the input space into regions and models each region
locally

• Ensemble learning
– Aggregating the predictions of multiple models

• Bagging
– Trains multiple models from sub-samples of dataset
– Reduces variance of a high-variance learning algorithm without affecting bias

• Boosting
– Combining multiple weak learners into a strong learner
– Reduces bias by iteratively over-weighting misclassified instances
35
References

• [PRML] Bishop, C. M. (2006). Pattern recognition and machine learning.


Springer.
– Chapter 14 (Combining Models)

• [MLaPP] Murphy K. P. (2012). Machine learning: a probabilistic perspective.


MIT Press.
– Chapter 16 (Adaptive Basis Function Models)

36

You might also like