09_EnsembleLearning
09_EnsembleLearning
Hady W. Lauw
Photo by Felix Mittermeier from Pexels
3
Classification Tree
• Partition input space into regions
• Prediction is the mode of class label distribution
4
Recursive Procedure to Grow a Tree
• Split function chooses the “best” feature j (among M features) and feature value t
(among the viable feature values of j) to split
|𝐷, | 𝐷.
𝑗 ∗, 𝑡 ∗ = arg min min 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
"∈{%,…,(} *∈+! |𝐷| 𝐷
5
Is it worth splitting?
6
Regression Cost
• For a subset of data points D, quantify:
𝑐𝑜𝑠𝑡 𝐷 = @ 𝑦- − 𝑓(𝒙- , 𝑦- ) 0
-∈/
• In the simplest case, the prediction could just be the mean response
1
𝑓 𝒙- , 𝑦- = @ 𝑦-
|𝐷|
-∈/
7
Classification Cost: Misclassification
• First, we estimate the class proportions within a subset of data points D:
1
𝜋̂! = @ 𝟏(𝑦- = 𝑐)
|𝐷|
-∈/
• Misclassification rate
1
𝑐𝑜𝑠𝑡 𝐷 = % 𝟏 𝑦! ≠ 𝑦̂ = 1 − 𝜋̂ %̂
𝐷
!∈#
8
Classification Cost: Entropy
• Define entropy of class distribution
G = − @ 𝜋J 1 log 𝜋J 1
𝐻 𝝅
1∈2
9
Classification Cost: Gini Index
• Define entropy of class distribution
G
𝐺𝑖𝑛𝑖 𝝅
= @ 𝜋J 1 (1 − 𝜋J 1 )
1∈2
= @ 𝜋J 1 − @ 𝜋J 10
1∈2 1∈2
= 1 − @ 𝜋J 10
1∈2
10
Classification Cost: Comparison
11
Example: Iris Dataset
12
Example: Iris Dataset - Unpruned
13
Example: Iris Dataset - Pruned
14
BAGGING
Reducing variance via Bootstrap AGGregating
Random Tree Classifier
• Sample 𝑘 out of 𝑀 features randomly
– Heuristic: 𝑘 = 𝑀
• Build a full decision tree based only on the 𝑘 features
• This is a high-variance model
17
Sampling: without vs. with replacement
https://www.spss-tutorials.com/spss-sampling-basics/
18
Bagging: Bootstrap Aggregating
• We can apply bagging to other models as well
19
Bias-Variance Illustration
http://scott.fortmann-roe.com/docs/BiasVariance.html
20
Expected Loss for Regression
E 𝑦 𝒙 = W 𝑦 𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 d𝑦 = 𝑦(𝒙|𝒘′)
J
𝑣𝑎𝑟[𝑦|𝒙] = E 𝑦 − E 𝑦 𝒙 0 = E 𝜖 0 = 𝑣𝑎𝑟 𝜖 + E 𝜖 0 = 𝜎0
• Suppose that instead of learning from the “complete” data, we take sub-
samples, then we expect some variations in the squared loss we’d observe
– Let 𝑦 be the observed response variable
– Let 𝑦J be the optimal function (the 𝒙 and 𝒘′ omitted for simplicity of notation)
– Let 𝑦] be the function we learn from a particular sample
– We would like to characterize the expected squared loss E[ 𝑦 − 𝑦] 0 ] under the
distribution of subsamples
21
Bias-Variance Decomposition for Regression
E 𝑦 − 𝑦] 0 Note that:
= E 𝑦J + 𝜖 − 𝑦] 0 E𝜖 =0
0 E E 𝑦% − 𝑦% = E 𝑦% − E 𝑦% = 0
=E 𝑦J − E[𝑦])
] + 𝜖 + (E[𝑦]
] − 𝑦]
= E 𝑦J − E 𝑦] 0 + E 𝜖 0 + E E 𝑦] − 𝑦] 0
+2×E 𝜖 ×E[𝑦J − E 𝑦] ] + 2×E 𝜖]×E[E 𝑦] − 𝑦] + 2×E 𝑦J − E 𝑦] ×E E 𝑦] − 𝑦]
= E[ 𝑦J − E 𝑦] 0 ] + 𝜎 0 + E E 𝑦] − 𝑦] 0
– Contribution to squared loss due to deviation of the learnt function from the optimal
• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .
22
Bagging Reduces Variance
• Weak law of large numbers when samples are i.i.d:
1
:
ℎ 𝒙 = > ℎ0 (𝒙) → E ℎ0 (𝒙) as 𝐿 → ∞
𝐿
0∈{-,…,/}
23
Unbiased Estimate of Test Error
• Each bagging sample 𝐷0 only involves a subset of the training data 𝐷
• A specific training instance (𝒙6 , 𝑦6 ) is part of some sample, but not others
• Let 𝐷76 = {𝐷0 | 𝒙6 , 𝑦6 ∉ 𝐷0 } be the samples that do not contain this instance
-
• Let ℎ: 76 (𝒙) = ∑ ℎ (𝒙) be average of models trained on 𝐷76
|9!" | 9# ∈9!" 0
• The out-of-bag error is the average such error across the 𝑁 instances in 𝐷
1
𝜖::; 𝐷 = > 𝑙𝑜𝑠𝑠(ℎ: 76 𝒙6 , 𝑦6 )
𝑁
6∈{-,…,<}
24
BOOSTING
Reducing bias via iteratively improving weak learners
Boosting
• Consider a binary classification problem 𝑦 ∈ {−1, 1}
• A weak learner is a model for binary classification that has slightly better
performance than random guesses
– Example: a shallow classification tree
26
AdaBoost
27
Algorithm
• For iteration 𝑡 from 1 to 𝑇
– Fit a classifier ℎ* to the training data by minimizing the weighted loss function
&
𝐿& = % 𝑤! 𝟏(ℎ& 𝒙! ≠ 𝑦! )
!∈{(,…,+}
– Evaluate error of this iteration
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
– Evaluate coefficient of this classifier
1 − 𝜖&
𝛼& = ln
𝜖&
– Update data weight coefficients
(&.() (&)
𝑤! = 𝑤! exp 𝛼& 𝟏(ℎ& 𝒙! ≠ 𝑦! )
• Final prediction is given by:
𝐻 𝒙! = 𝑠𝑖𝑔𝑛 % 𝛼& ℎ& 𝒙!
&∈{(,…,0}
28
Minimizes Sequential Exponential Error
31
Illustration (cont’d)
32
Interpretations of Boosting
• A form of L1 regularization
– Each weak learner is a decision stump that relies on a single feature
– Boosting “selects” among these weak learners (features) that work well
• Margin maximization
– By iteratively adjusting weights of misclassified instances, boosting seeks the classifier
that maximizes the margin
33
Boosting Loss Functions
34
Conclusion
• Classification and Regression Trees (CART)
– A class of models that partition the input space into regions and models each region
locally
• Ensemble learning
– Aggregating the predictions of multiple models
• Bagging
– Trains multiple models from sub-samples of dataset
– Reduces variance of a high-variance learning algorithm without affecting bias
• Boosting
– Combining multiple weak learners into a strong learner
– Reduces bias by iteratively over-weighting misclassified instances
35
References
36