Chapter 7_Printed
Chapter 7_Printed
Learning and
Random Forests
Introduction
• Suppose you pose a complex question to thousands of random
people, then aggregate their answers.
• In many cases you will find that this aggregated answer is better than
an expert’s answer. This is called the wisdom of the crowd.
• Similarly, if you aggregate the predictions of a group of predictors
(such as classifiers or regressors), you will often get better predictions
than with the best individual predictor.
• A group of predictors is called an ensemble; thus, this technique is
called Ensemble Learning, and an Ensemble Learning algorithm is
called an Ensemble method.
SHL-TI-UKP 1
Voting Classifiers
Suppose you have trained a few classifiers, each one achieving about
80% accuracy.
SHL-TI-UKP 2
• Voting classifier often achieves a higher accuracy than the best
classifier in the ensemble.
• Ensemble methods work best when the predictors are as
independent from one another as possible.
• One way to get diverse classifiers is to train them using very different
algorithms.
• If all classifiers are able to estimate class probabilities (i.e., they all
have a predict_proba() method), then you can tell Scikit-Learn to
predict the class with the highest-class probability, averaged over all
the individual classifiers. This is called soft voting.
SHL-TI-UKP 3
Bagging and Pasting
• One way to get a diverse set of classifiers is to use very different training algorithms.
• Another approach is to use the same training algorithm for every predictor and train
them on different random subsets of the training set.
• When sampling is performed with replacement, this method is called bagging (short
for bootstrap aggregating ).
• When sampling is performed without replacement, it is called pasting
• Once all predictors are trained, the ensemble can make a prediction
for a new instance by simply aggregating the predictions of all
predictors.
• The aggregation function is typically the statistical mode for
classification, or the average for regression.
SHL-TI-UKP 4
Out-of-Bag Evaluation
• With bagging, some instances may be sampled several times for any
given predictor, while others may not be sampled at all.
• Only about 63% of the training instances are sampled on average for
each predictor. The remaining 37% of the training instances that are
not sampled are called out-of-bag(oob) instances.
• OOB is the mean prediction error on each training sample 𝑥 , using
only the trees that did not have 𝑥 in their bootstrap sample.
10
SHL-TI-UKP 5
Random Forests
• Random Forest is an ensemble of Decision Trees, generally trained via
the bagging method (or sometimes pasting), typically with
max_samples set to the size of the training set.
• Instead of building a BaggingClassifier and passing it a
DecisionTreeClassifier, you can instead use theRandomForestClassifier
class, which is more convenient and optimized for Decision Trees.
11
Decision
Tree
12
SHL-TI-UKP 6
Random
Forest
13
Extra-Trees
• When you are growing a tree in a Random Forest, at each node only a
random subset of the features is considered for splitting.
• It is possible to make trees even more random by also using random
thresholds for each feature rather than searching for the best possible
thresholds (like regular Decision Trees do).
• A forest of such extremely random trees is called an Extremely
Randomized Trees ensemble (or Extra-Trees for short).
• This technique trades more bias for a lower variance.
• It also makes Extra-Trees much faster to train than regular Random
Forests, because finding the best possible threshold for each feature
at every node is one of the most time-consuming tasks of growing a
tree.
14
SHL-TI-UKP 7
Feature Importance
• Scikit-Learn measures a feature’s importance by looking at how much
the tree nodes that use that feature reduce impurity on average
(across all trees in the forest).
• More precisely, it is a weighted average, where each node’s weight is
equal to the number of training samples that are associated with it.
15
Boosting
• Boosting (originally called hypothesis boosting) refers to any
Ensemble method that can combine several weak learners into a
strong learner.
• The general idea of most boosting methods is to train predictors
sequentially, each trying to correct its predecessor.
16
SHL-TI-UKP 8
Adaptive Boosting (AdaBoost)
• One way for a new predictor to correct its predecessor is to pay a bit
more attention to the training instances that the predecessor
underfitted.
• This results in new predictors focusing more and more on the hard
cases.
17
18
SHL-TI-UKP 9
AdaBoost in detail
(0) Each instance weight 𝑤 ( ) =
(1) Weighted error rate of the 𝑗-th predictor
∑ 𝑤( )
()
𝑟 =
∑ 𝑤( )
(2) Calculate predictor weight
1−𝑟
𝛼 = 𝜂 log
𝑟
𝜼 is the learning rate hyperparameter (defaults to 1). The more accurate the
predictor is, the higher its weight will be.
If it is just guessing randomly, then its weight will be close to zero.
If it is negative then, the predictor is less accurate than random guessing
(WRONG PREDICTOR)
19
x
𝑁 is the number of predictor
Scikit-Learn uses a multiclass version of AdaBoost called SAMME (which stands for Stagewise
Additive Modeling using a Multiclass Exponential loss function). When there are just two classes,
SAMME is equivalent to AdaBoost.
20
SHL-TI-UKP 10
Gradient Boosting
• Just like AdaBoost, Gradient Boosting works by sequentially adding
predictors to an ensemble, each one correcting its predecessor.
• However, Gradient Boosting tries to fit the new predictor to the
residual errors made by the previous predictor.
21
22
SHL-TI-UKP 11
23
Stacking
• The last Ensemble method we will
discuss in this chapter is called
stacking (short for stacked
generalization).
• The idea: Train a model to perform
aggregation
• To train the blender (meta learner),
a common approach is to use a
hold-out set
24
SHL-TI-UKP 12
• First, the training set is split into two subsets. The first subset is used
to train the predictors in the first layer
25
26
SHL-TI-UKP 13
Split the training set into three subsets:
• The first one is used to train the first layer,
• The second one is used to create the training
set used to train the second layer (using
predictions made by the predictors of the
first layer), and
• The third one is used to create the training
set to train the third layer (using predictions
made by the predictors
• of the second layer).
Once this is done, we can make a prediction for
a new instance by going through each layer
sequentially
27
28
SHL-TI-UKP 14