0% found this document useful (0 votes)

3 views

09_EnsembleLearning

Uploaded by

Nicole Oo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

09_EnsembleLearning

Uploaded by

Nicole Oo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Ensemble Learning

Hady W. Lauw
Photo by Felix Mittermeier from Pexels

IS712 Machine Learning

CLASSIFICATION AND REGRESSION TREES (CART)
Recursively partitioning the input space and defining a local model for each partition
Regression Tree
• Partition input space into regions
• Prediction is the mean response in each region
– Alternatively, fit a regression function locally

3
Classification Tree
• Partition input space into regions
• Prediction is the mode of class label distribution

4
Recursive Procedure to Grow a Tree

• Split function chooses the “best” feature j (among M features) and feature value t
(among the viable feature values of j) to split
|𝐷, | 𝐷.
𝑗 ∗, 𝑡 ∗ = arg min min 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
"∈{%,…,(} *∈+! |𝐷| 𝐷
5
Is it worth splitting?

• Is there gain to be made from further splits?

– The distribution in each region may already be sufficiently homogeneous
– The gain may be too small
|𝐷, | 𝐷.
𝐺𝑎𝑖𝑛 = 𝑐𝑜𝑠𝑡 𝐷 − 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
|𝐷| 𝐷

• Are there significant risks of overfitting?

– The tree may already be too deep
– The number of examples in a particular region may be too small

6
Regression Cost
• For a subset of data points D, quantify:
𝑐𝑜𝑠𝑡 𝐷 = @ 𝑦- − 𝑓(𝒙- , 𝑦- ) 0

-∈/

• In the simplest case, the prediction could just be the mean response
1
𝑓 𝒙- , 𝑦- = @ 𝑦-
|𝐷|
-∈/

• Alternatively, we can fit a regression function at each leaf node

7
Classification Cost: Misclassification
• First, we estimate the class proportions within a subset of data points D:
1
𝜋̂! = @ 𝟏(𝑦- = 𝑐)
|𝐷|
-∈/

• Predicted class is the mode of the class distribution

𝑦! = arg max 𝜋G1
1∈2

• Misclassification rate
1
𝑐𝑜𝑠𝑡 𝐷 = % 𝟏 𝑦! ≠ 𝑦̂ = 1 − 𝜋̂ %̂
𝐷
!∈#

8
Classification Cost: Entropy
• Define entropy of class distribution
G = − @ 𝜋J 1 log 𝜋J 1
𝐻 𝝅
1∈2

• Minimizing entropy is maximizing information gain

𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛 𝑋, < 𝑡, 𝑌 = 𝐻 𝑌 − 𝐻(𝑌|𝑋, < 𝑡)

9
Classification Cost: Gini Index
• Define entropy of class distribution
G
𝐺𝑖𝑛𝑖 𝝅
= @ 𝜋J 1 (1 − 𝜋J 1 )
1∈2

= @ 𝜋J 1 − @ 𝜋J 10
1∈2 1∈2

= 1 − @ 𝜋J 10
1∈2

10
Classification Cost: Comparison

• Assume 2-class classification, where each class has 400 instances

Splits Misclassification Rate Entropy Gini Index
Split 1: (300, 100) 0.25 0.81 0.375
Split 2: (100, 300)
Split 1: (200, 400) 0.25 0.70 0.336
Split 2: (199, 1)

11
Example: Iris Dataset

12
Example: Iris Dataset - Unpruned

13
Example: Iris Dataset - Pruned

14
BAGGING
Reducing variance via Bootstrap AGGregating
Random Tree Classifier
• Sample 𝑘 out of 𝑀 features randomly
– Heuristic: 𝑘 = 𝑀
• Build a full decision tree based only on the 𝑘 features
• This is a high-variance model

Classification tree on 2 out of 4 features in Iris dataset

16
Random Forest Classifier
• To lower the variance, we can “bag” many random trees

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a full classification tree ℎ3 (𝒙) with the 𝑘 features
• The final classifier is the average of the trees
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

17
Sampling: without vs. with replacement

https://www.spss-tutorials.com/spss-sampling-basics/

18
Bagging: Bootstrap Aggregating
• We can apply bagging to other models as well

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a model ℎ3 (𝒙) with the 𝑘 features
• The final model is the average of the predictions
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

19
Bias-Variance Illustration

http://scott.fortmann-roe.com/docs/BiasVariance.html
20
Expected Loss for Regression

• Recall: assume a random noise 𝜖 ∈ 𝒩(0, 𝜎 .) is the source of residual error:

𝑦 = 𝑦J 𝒙 𝒘′ + 𝜖
𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 = 𝒩 𝑦 𝑦J 𝒙|𝒘′ , 𝜎 0

E 𝑦 𝒙 = W 𝑦 𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 d𝑦 = 𝑦(𝒙|𝒘′)
J

𝑣𝑎𝑟[𝑦|𝒙] = E 𝑦 − E 𝑦 𝒙 0 = E 𝜖 0 = 𝑣𝑎𝑟 𝜖 + E 𝜖 0 = 𝜎0

• Suppose that instead of learning from the “complete” data, we take sub-
samples, then we expect some variations in the squared loss we’d observe
– Let 𝑦 be the observed response variable
– Let 𝑦J be the optimal function (the 𝒙 and 𝒘′ omitted for simplicity of notation)
– Let 𝑦] be the function we learn from a particular sample
– We would like to characterize the expected squared loss E[ 𝑦 − 𝑦] 0 ] under the
distribution of subsamples
21
Bias-Variance Decomposition for Regression
E 𝑦 − 𝑦] 0 Note that:
= E 𝑦J + 𝜖 − 𝑦] 0 E𝜖 =0
0 E E 𝑦% − 𝑦% = E 𝑦% − E 𝑦% = 0
=E 𝑦J − E[𝑦])
] + 𝜖 + (E[𝑦]
] − 𝑦]
= E 𝑦J − E 𝑦] 0 + E 𝜖 0 + E E 𝑦] − 𝑦] 0
+2×E 𝜖 ×E[𝑦J − E 𝑦] ] + 2×E 𝜖]×E[E 𝑦] − 𝑦] + 2×E 𝑦J − E 𝑦] ×E E 𝑦] − 𝑦]
= E[ 𝑦J − E 𝑦] 0 ] + 𝜎 0 + E E 𝑦] − 𝑦] 0

• Squared bias E 𝑦! − E 𝑦E . = E 𝑦! − E ℎ0 (𝒙) .

– Contribution to squared loss due to deviation of the learnt function from the optimal
• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

– Contribution to squared loss due to sensitivity to different training subsamples

• Irreducible error 𝜎 . = E 𝑦! − 𝑦 .

– Contribution to squared loss due to random noise in the data

22
Bagging Reduces Variance
• Weak law of large numbers when samples are i.i.d:
1
:
ℎ 𝒙 = > ℎ0 (𝒙) → E ℎ0 (𝒙) as 𝐿 → ∞
𝐿
0∈{-,…,/}

• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

• If we replace ℎ3 with ℎ̀, variance reduces to 0, if indeed samples are i.i.d.

• Bagging samples are unlikely i.i.d., so variance may not disappear

completely, but would likely still be reduced effectively

23
Unbiased Estimate of Test Error
• Each bagging sample 𝐷0 only involves a subset of the training data 𝐷

• A specific training instance (𝒙6 , 𝑦6 ) is part of some sample, but not others
• Let 𝐷76 = {𝐷0 | 𝒙6 , 𝑦6 ∉ 𝐷0 } be the samples that do not contain this instance
-
• Let ℎ: 76 (𝒙) = ∑ ℎ (𝒙) be average of models trained on 𝐷76
|9!" | 9# ∈9!" 0

• The out-of-bag error is the average such error across the 𝑁 instances in 𝐷
1
𝜖::; 𝐷 = > 𝑙𝑜𝑠𝑠(ℎ: 76 𝒙6 , 𝑦6 )
𝑁
6∈{-,…,<}

24
BOOSTING
Reducing bias via iteratively improving weak learners
Boosting
• Consider a binary classification problem 𝑦 ∈ {−1, 1}

• A weak learner is a model for binary classification that has slightly better
performance than random guesses
– Example: a shallow classification tree

• Boosting seeks to create a strong learner from a weighted combination of

multiple weak learners

𝐻 𝒙 = sign > 𝛼= ℎ= (𝒙)

26
AdaBoost

• Training data 𝐷 has 𝑁 instances 𝒙6 , 𝑦6 6∈{-,…,<}

• Associate each instance 𝒙6 , 𝑦6 with a weight 𝑤6
• Assume we can train a model ℎ= that minimizes a weighted loss function
𝐿= = > 𝑤6 = 𝟏(ℎ= 𝒙6 ≠ 𝑦6 )
6∈{-,…,<}
– 𝟏(ℎ* 𝒙- = 𝑦- ) is an indicator function that yields 1 if the equality within holds, 0 otherwise
– An example model is a decision stump with a single feature split
-
• Initially, the weights of all instances are uniform 𝑤6 = <
• Subsequently, weights of misclassified instances are adjusted

27
Algorithm
• For iteration 𝑡 from 1 to 𝑇
– Fit a classifier ℎ* to the training data by minimizing the weighted loss function
&
𝐿& = % 𝑤! 𝟏(ℎ& 𝒙! ≠ 𝑦! )
!∈{(,…,+}
– Evaluate error of this iteration
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
– Evaluate coefficient of this classifier
1 − 𝜖&
𝛼& = ln
𝜖&
– Update data weight coefficients
(&.() (&)
𝑤! = 𝑤! exp 𝛼& 𝟏(ℎ& 𝒙! ≠ 𝑦! )
• Final prediction is given by:
𝐻 𝒙! = 𝑠𝑖𝑔𝑛 % 𝛼& ℎ& 𝒙!
&∈{(,…,0}
28
Minimizes Sequential Exponential Error

• Sequential ensemble classifier:

1
𝐻* " 𝒙- = @ 𝛼* ℎ* 𝒙-
2
*∈{%,…,* " }

• Sequential exponential error:

𝐸= @ exp −𝑦- 𝐻*5 (𝒙- )
-∈{%,…,4}
1
= @ exp −𝑦- 𝐻* " 6% 𝒙- − 𝑦- 𝛼* " ℎ* " (𝒙- )
2
-∈{%,…,4}
(* " ) 1
= @ 𝑤- exp − 𝑦- 𝛼* " ℎ* " 𝒙-
2
-∈{%,…,4}
(= $ )
• Given 𝐻= $ , 𝑤6
is a constant, and we
seek to find the minimizing 𝛼= $ and ℎ= $
29
Sequential Exponential Error (cont’d)

• Correctly classified instances: 𝐶" = 𝒙- , 𝑦- |𝑦- . ℎ" 𝒙- ≥ 0

• Wrongly classified instances: 𝐶*̅ " = { 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- < 0}
• Sequential exponential error:
(& ! ) 1
𝐸= % 𝑤! exp − 𝑦! 𝛼& ! ℎ& ! 𝒙!
2
!∈{(,…,+}
𝛼& ! (& ! ) 𝛼& ! (& ! )
= exp − % 𝑤! + exp % 𝑤!
2 2
!∈1"! !∈1"̅ !
𝛼! 𝛼! &! 𝛼& ! (& ! )
= exp & − exp − & % 𝑤! 1(ℎ& ! 𝑥! ≠ 𝑦! ) +exp − % 𝑤!
2 2 2
!∈{(,…,+} !∈{(,…,+}

• Minimizing the above with respect to 𝛼= $ gives us

1 − 𝜖&
𝛼& = ln
𝜖&
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
30
Illustration

31
Illustration (cont’d)

32
Interpretations of Boosting

• A form of L1 regularization
– Each weak learner is a decision stump that relies on a single feature
– Boosting “selects” among these weak learners (features) that work well

• Margin maximization
– By iteratively adjusting weights of misclassified instances, boosting seeks the classifier
that maximizes the margin

• Functional gradient descent

– The functions are the “parameters”
– GradientBoost is a generic algorithm for boosting that accommodates various loss
functions

33
Boosting Loss Functions

34
Conclusion
• Classification and Regression Trees (CART)
– A class of models that partition the input space into regions and models each region
locally

• Ensemble learning
– Aggregating the predictions of multiple models

• Bagging
– Trains multiple models from sub-samples of dataset
– Reduces variance of a high-variance learning algorithm without affecting bias

• Boosting
– Combining multiple weak learners into a strong learner
– Reduces bias by iteratively over-weighting misclassified instances
35
References

• [PRML] Bishop, C. M. (2006). Pattern recognition and machine learning.

Springer.
– Chapter 14 (Combining Models)

• [MLaPP] Murphy K. P. (2012). Machine learning: a probabilistic perspective.

MIT Press.
– Chapter 16 (Adaptive Basis Function Models)

LOW LEVEL DESIGN (LLD) Ineuron Internship Credit Card Default Prediction
No ratings yet
LOW LEVEL DESIGN (LLD) Ineuron Internship Credit Card Default Prediction
11 pages
Password Cracking PDF
100% (1)
Password Cracking PDF
10 pages
Introduction to Calculus
From Everand
Introduction to Calculus
Joan Van Glabek
4.5/5 (8)
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Inbound 8392301798635648784
No ratings yet
Inbound 8392301798635648784
43 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
19_ML_intro
No ratings yet
19_ML_intro
33 pages
k-NN-1
No ratings yet
k-NN-1
31 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
Deep learning
No ratings yet
Deep learning
15 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
Back Propagation
No ratings yet
Back Propagation
27 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
mod4
No ratings yet
mod4
65 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Lecture2
No ratings yet
Lecture2
67 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
ML_basics_lecture2_linear_classification
No ratings yet
ML_basics_lecture2_linear_classification
34 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
LCS - Week 06
No ratings yet
LCS - Week 06
42 pages
02A-DL2023-NN-basics
No ratings yet
02A-DL2023-NN-basics
52 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Three Approaches To Ordinal Classification (Slides 2009)
No ratings yet
Three Approaches To Ordinal Classification (Slides 2009)
25 pages
Lec 24
No ratings yet
Lec 24
39 pages
02 Basics
No ratings yet
02 Basics
35 pages
Learning-Demo
No ratings yet
Learning-Demo
7 pages
practicalMachineLearning_lecture3
No ratings yet
practicalMachineLearning_lecture3
25 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
T6- KNN - Features, Distances &amp; Non-Parametric Models
No ratings yet
T6- KNN - Features, Distances &amp; Non-Parametric Models
23 pages
Learning From Uniform Convergence
No ratings yet
Learning From Uniform Convergence
12 pages
Machine Learning Basics Lecture 7: Multiclass Classification
No ratings yet
Machine Learning Basics Lecture 7: Multiclass Classification
28 pages
FALLSEM2024-25 BCSE209L TH VL2024250101735 2024-07-29 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101735 2024-07-29 Reference-Material-I
48 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
Note 6
No ratings yet
Note 6
33 pages
OLMP Lab6
No ratings yet
OLMP Lab6
27 pages
ML CheatSheet
No ratings yet
ML CheatSheet
3 pages
Aiml Lab Algorithms
No ratings yet
Aiml Lab Algorithms
10 pages
Nearest Neighbour
No ratings yet
Nearest Neighbour
25 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
ML cheat sheet(1)
No ratings yet
ML cheat sheet(1)
2 pages
第八章
No ratings yet
第八章
28 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Problem Set 2 One To One and Inverse Functions
No ratings yet
Problem Set 2 One To One and Inverse Functions
2 pages
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
No ratings yet
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
7 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
Generative Models For Classification Neural Networks
No ratings yet
Generative Models For Classification Neural Networks
43 pages
Ai Mod3@Azdocuments - in
No ratings yet
Ai Mod3@Azdocuments - in
42 pages
Measures of Variability
No ratings yet
Measures of Variability
5 pages
Assignment 9 solution
No ratings yet
Assignment 9 solution
4 pages
Datagiri: Presented 17 November By: Himanshu Shrivastava
No ratings yet
Datagiri: Presented 17 November By: Himanshu Shrivastava
17 pages
2019-20-I Q4_key
No ratings yet
2019-20-I Q4_key
2 pages
Module3
No ratings yet
Module3
26 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Clustering
No ratings yet
Clustering
10 pages
2 LinearRegression2
No ratings yet
2 LinearRegression2
45 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Lecture 3
No ratings yet
Lecture 3
61 pages
CHPT 2
No ratings yet
CHPT 2
16 pages
Technical Evalution Sheet
No ratings yet
Technical Evalution Sheet
2 pages
Description: 120 Bentley Square, Mountain View, Ca 94040 USA
No ratings yet
Description: 120 Bentley Square, Mountain View, Ca 94040 USA
9 pages
QMDM ANOVA1way
No ratings yet
QMDM ANOVA1way
25 pages
B.SC - .CSIT 6th Sem Final Syllabus
No ratings yet
B.SC - .CSIT 6th Sem Final Syllabus
23 pages
Julian Cybersecurity Presentation
No ratings yet
Julian Cybersecurity Presentation
18 pages
Turf Image Inc.-Geotechnical Engineering Materials & Testing
No ratings yet
Turf Image Inc.-Geotechnical Engineering Materials & Testing
13 pages
Keys
No ratings yet
Keys
11 pages
Robo Junior
No ratings yet
Robo Junior
6 pages
Windows Privilege Escalation
No ratings yet
Windows Privilege Escalation
26 pages
Poka-Yoke: - This Is Known As MISTAKE-PROOFING' - From Japanese: - Yokeru (Avoid) and Poka (Inadvertent Errors)
No ratings yet
Poka-Yoke: - This Is Known As MISTAKE-PROOFING' - From Japanese: - Yokeru (Avoid) and Poka (Inadvertent Errors)
36 pages
3com Corporation Was A Digital Electronics Manufacturer Best Known For Its
No ratings yet
3com Corporation Was A Digital Electronics Manufacturer Best Known For Its
5 pages
Fabiola Eshun Resume - 22-03-24
No ratings yet
Fabiola Eshun Resume - 22-03-24
1 page
723 SC - 02877b
No ratings yet
723 SC - 02877b
40 pages
GEZE Produktprospekt EN 381676
No ratings yet
GEZE Produktprospekt EN 381676
100 pages
LY201503-1 BGA Machine Pricelist
No ratings yet
LY201503-1 BGA Machine Pricelist
4 pages
The Best Way3 (2ED)
No ratings yet
The Best Way3 (2ED)
11 pages
Empotech - q2 - Mod2 - Planning and Conceptualizing ICT Project
100% (4)
Empotech - q2 - Mod2 - Planning and Conceptualizing ICT Project
19 pages
V-One Drill Manual English
No ratings yet
V-One Drill Manual English
4 pages
Spf5189z Data Sheet
No ratings yet
Spf5189z Data Sheet
11 pages
C4 Software
No ratings yet
C4 Software
23 pages
DYN Automotive Brochure INT PDF
No ratings yet
DYN Automotive Brochure INT PDF
23 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
38 pages
Pig Trap End Closure1 PDF
No ratings yet
Pig Trap End Closure1 PDF
24 pages
MR Rack Owners Manual
No ratings yet
MR Rack Owners Manual
314 pages
Programming With Arrays - Grade 8
No ratings yet
Programming With Arrays - Grade 8
8 pages
Learning Cmake
100% (6)
Learning Cmake
45 pages
2bac Reading-Quiz1 Humanities
100% (3)
2bac Reading-Quiz1 Humanities
2 pages

09_EnsembleLearning

Uploaded by

09_EnsembleLearning

Uploaded by

Ensemble Learning

IS712 Machine Learning

• Is there gain to be made from further splits?

• Are there significant risks of overfitting?

• Alternatively, we can fit a regression function at each leaf node

• Predicted class is the mode of the class distribution

• Minimizing entropy is maximizing information gain

• Assume 2-class classification, where each class has 400 instances

Classification tree on 2 out of 4 features in Iris dataset

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• Recall: assume a random noise 𝜖 ∈ 𝒩(0, 𝜎 .) is the source of residual error:

• Squared bias E 𝑦! − E 𝑦E . = E 𝑦! − E ℎ0 (𝒙) .

– Contribution to squared loss due to sensitivity to different training subsamples

– Contribution to squared loss due to random noise in the data

• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

• If we replace ℎ3 with ℎ̀, variance reduces to 0, if indeed samples are i.i.d.

• Bagging samples are unlikely i.i.d., so variance may not disappear

• Boosting seeks to create a strong learner from a weighted combination of

𝐻 𝒙 = sign > 𝛼= ℎ= (𝒙)

• Training data 𝐷 has 𝑁 instances 𝒙6 , 𝑦6 6∈{-,…,<}

• Sequential ensemble classifier:

• Sequential exponential error:

• Correctly classified instances: 𝐶*" = 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- ≥ 0

• Minimizing the above with respect to 𝛼= $ gives us

• Functional gradient descent

• [PRML] Bishop, C. M. (2006). Pattern recognition and machine learning.

• [MLaPP] Murphy K. P. (2012). Machine learning: a probabilistic perspective.

You might also like

• Correctly classified instances: 𝐶" = 𝒙- , 𝑦- |𝑦- . ℎ" 𝒙- ≥ 0