Bagging and Random Forests
Bagging and Random Forests
1
Slides based on: STT450-550: Statistical Data Mining
Ensemble methods
• A single decision tree does not perform well
We need to make sure they do not all just learn the same
Problem!
• Decision trees discussed earlier suffer from high
variance!
• If we randomly split the training data into 2 parts, and fit
decision trees on both parts, the results could be quite
different
3
Problem!
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf 4
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf 5
Bagging
To solve this problem, we can use bagging (bootstrap aggregating).
6
What is bagging?
• Bagging is an extremely powerful idea based on
two things:
• Averaging: reduces variance!
• Bootstrapping: plenty of training datasets!
7
Bootstrapping is simple!
• Resampling of the observed dataset (and of equal size
to the observed dataset), each of which is obtained by
random sampling with replacement from the original
dataset.
• For prediction:
• Regression: average all B predictions from all B trees
• Classification: majority vote among all B trees
10
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf 11
Bagging
• Reduces overfitting (variance)
• Easy to parallelize
Bagging for Regression Trees
• Construct B regression trees using B bootstrapped
training datasets
• Average the resulting predictions
14
Bagging for Classification Trees
• Construct B regression trees using B bootstrapped
training datasets
• For prediction, there are two approaches:
1. Record the class that each bootstrapped data set predicts
and provide an overall prediction to the most commonly
occurring one (majority vote).
2. If our classifier produces probability estimates we can just
average the probabilities and then predict to the class with
the highest probability.
• Both methods work well.
15
A Comparison of Error Rates
• Here the green line
represents a simple
majority vote approach
• 1-1/exp(1) ~ 0.632
• 1/exp(1) ~ 0.368
19
Variable Importance Measure
• Bagging typically improves the accuracy over
prediction using a single tree, but it is now hard to
interpret the model!
• We have hundreds of trees, and it is no longer clear
which variables are most important to the procedure
• Thus bagging improves prediction accuracy at the
expense of interpretability
• But, we can still get an overall summary of the
importance of each predictor using Relative
Influence Plots
20
Relative Influence Plots
• How do we decide which variables are most useful
in predicting the response?
• We can compute something called relative influence
plots.
• These plots give a score for each variable.
• These scores represent the decrease in MSE when
splitting on a particular variable
• A number close to zero indicates the variable is not
important and could be dropped.
• The larger the score the more influence the variable has.
21
Example: Housing Data
• Median
Income is by
far the most
important
variable.
• Longitude,
Latitude and
Average
occupancy are
the next most
important. 22
Random Forests
23
Random Forests for Classification
• It is a very efficient statistical learning method
• It builds on the idea of bagging, but it provides an improvement
because it de-correlates the trees
• Decision tree:
• Easy to achieve 0% error rate on training data
• If each training example has its own leaf ……
• Random forest: Bagging of decision tree
• Resampling training data is not sufficient
• Randomly restrict the features/questions used in each split
• How does it work?
• Build a number of decision trees on bootstrapped training sample, but
when building these trees, each time a split in a tree is considered, a
random sample of m predictors is chosen as split candidates from the
full set of p predictors (Usually )
24
Random Forests for Regression
25
Why are we considering a random sample of m
predictors instead of all p predictors for splitting?
• All bagged trees will look similar. Hence all the predictions from
the bagged trees will be highly correlated
26
Random Forest with different values of “m”
• Notice when
random forests
are built using
• m = p, then this
amounts simply
to bagging.
27