0% found this document useful (0 votes)
1 views

Chapter 7_Printed

Ed

Uploaded by

Siana Halim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Chapter 7_Printed

Ed

Uploaded by

Siana Halim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Ensemble

Learning and
Random Forests

Introduction
• Suppose you pose a complex question to thousands of random
people, then aggregate their answers.
• In many cases you will find that this aggregated answer is better than
an expert’s answer. This is called the wisdom of the crowd.
• Similarly, if you aggregate the predictions of a group of predictors
(such as classifiers or regressors), you will often get better predictions
than with the best individual predictor.
• A group of predictors is called an ensemble; thus, this technique is
called Ensemble Learning, and an Ensemble Learning algorithm is
called an Ensemble method.

Ge’ron, Hands-on Machine Learning,

SHL-TI-UKP 1
Voting Classifiers
Suppose you have trained a few classifiers, each one achieving about
80% accuracy.

Ge’ron, Hands-on Machine Learning,

A very simple way to create an even better classifier is to aggregate the


predictions of each classifier and predict the class that gets the most
votes. This majority-vote classifier is called a hard-voting classifier

Ge’ron, Hands-on Machine Learning,

SHL-TI-UKP 2
• Voting classifier often achieves a higher accuracy than the best
classifier in the ensemble.
• Ensemble methods work best when the predictors are as
independent from one another as possible.
• One way to get diverse classifiers is to train them using very different
algorithms.
• If all classifiers are able to estimate class probabilities (i.e., they all
have a predict_proba() method), then you can tell Scikit-Learn to
predict the class with the highest-class probability, averaged over all
the individual classifiers. This is called soft voting.

Ge’ron, Hands-on Machine Learning,

Biased Coin Tossing

Ge’ron, Hands-on Machine Learning,

SHL-TI-UKP 3
Bagging and Pasting
• One way to get a diverse set of classifiers is to use very different training algorithms.
• Another approach is to use the same training algorithm for every predictor and train
them on different random subsets of the training set.
• When sampling is performed with replacement, this method is called bagging (short
for bootstrap aggregating ).
• When sampling is performed without replacement, it is called pasting

Ge’ron, Hands-on Machine Learning,

• Once all predictors are trained, the ensemble can make a prediction
for a new instance by simply aggregating the predictions of all
predictors.
• The aggregation function is typically the statistical mode for
classification, or the average for regression.

SHL-TI-UKP 4
Out-of-Bag Evaluation
• With bagging, some instances may be sampled several times for any
given predictor, while others may not be sampled at all.
• Only about 63% of the training instances are sampled on average for
each predictor. The remaining 37% of the training instances that are
not sampled are called out-of-bag(oob) instances.
• OOB is the mean prediction error on each training sample 𝑥 , using
only the trees that did not have 𝑥 in their bootstrap sample.

Ge’ron, Hands-on Machine Learning,

Random Patches and Random Subspaces


• BaggingClassifier class supports instance sampling and feature
sampling.
• Instance sampling is controlled by max_samples and bootstrap
• Feature sampling is controlled by two hyperparameters: max_features
and bootstrap_features.
• Sampling both training instances and features is called the Random
Patches method.
• Keeping all training instances (by setting bootstrap=False and
max_samples=1.0) but sampling features (by setting
bootstrap_features=True and/or max_features <1.0) is called the
Random Subspaces method.

Ge’ron, Hands-on Machine Learning,

10

SHL-TI-UKP 5
Random Forests
• Random Forest is an ensemble of Decision Trees, generally trained via
the bagging method (or sometimes pasting), typically with
max_samples set to the size of the training set.
• Instead of building a BaggingClassifier and passing it a
DecisionTreeClassifier, you can instead use theRandomForestClassifier
class, which is more convenient and optimized for Decision Trees.

Ge’ron, Hands-on Machine Learning,

11

Decision
Tree

12

SHL-TI-UKP 6
Random
Forest

13

Extra-Trees
• When you are growing a tree in a Random Forest, at each node only a
random subset of the features is considered for splitting.
• It is possible to make trees even more random by also using random
thresholds for each feature rather than searching for the best possible
thresholds (like regular Decision Trees do).
• A forest of such extremely random trees is called an Extremely
Randomized Trees ensemble (or Extra-Trees for short).
• This technique trades more bias for a lower variance.
• It also makes Extra-Trees much faster to train than regular Random
Forests, because finding the best possible threshold for each feature
at every node is one of the most time-consuming tasks of growing a
tree.

Ge’ron, Hands-on Machine Learning,

14

SHL-TI-UKP 7
Feature Importance
• Scikit-Learn measures a feature’s importance by looking at how much
the tree nodes that use that feature reduce impurity on average
(across all trees in the forest).
• More precisely, it is a weighted average, where each node’s weight is
equal to the number of training samples that are associated with it.

Ge’ron, Hands-on Machine Learning,

15

Boosting
• Boosting (originally called hypothesis boosting) refers to any
Ensemble method that can combine several weak learners into a
strong learner.
• The general idea of most boosting methods is to train predictors
sequentially, each trying to correct its predecessor.

Ge’ron, Hands-on Machine Learning,

16

SHL-TI-UKP 8
Adaptive Boosting (AdaBoost)
• One way for a new predictor to correct its predecessor is to pay a bit
more attention to the training instances that the predecessor
underfitted.
• This results in new predictors focusing more and more on the hard
cases.

Ge’ron, Hands-on Machine Learning,

17

• For example, when training an AdaBoost classifier, the algorithm first


trains a base classifier (such as a Decision Tree) and uses it to make
predictions on the training set.
• The algorithm then increases the relative weight of misclassified
training instances.
• Then it trains a second classifier, using the updated weights, and
again makes predictions on the training set, updates the instance
weights, and so on.
• There is one important drawback to this sequential learning
technique:
• it cannot be parallelized (or only partially), since each predictor can only be
trained after the previous predictor has been trained and evaluated. As a
result, it does not scale as well as bagging or pasting.

Ge’ron, Hands-on Machine Learning,

18

SHL-TI-UKP 9
AdaBoost in detail
(0) Each instance weight 𝑤 ( ) =
(1) Weighted error rate of the 𝑗-th predictor
∑ 𝑤( )
()
𝑟 =
∑ 𝑤( )
(2) Calculate predictor weight
1−𝑟
𝛼 = 𝜂 log
𝑟
𝜼 is the learning rate hyperparameter (defaults to 1). The more accurate the
predictor is, the higher its weight will be.
If it is just guessing randomly, then its weight will be close to zero.
If it is negative then, the predictor is less accurate than random guessing
(WRONG PREDICTOR)

19

(3) Weighted update rule


()
𝑤 𝑖𝑓 𝑦 = 𝑦( )
𝑤( ) ← ()
𝑤 ( ) exp(𝛼 ) 𝑖𝑓 𝑦 ≠ 𝑦( )
For 𝑖 = 1,2, … , 𝑚
Then all the instance weights are normalized (i.e., divided by ∑ 𝑤 ( ) )
(4) Finally, a new predictor is trained using the updated weights, and the whole process is repeated.
(5) The algorithm stops when the desired number of predictors is reached, or when a perfect
predictor is found.
(6) AdaBoost Predictions
𝑦 x = argmax 𝛼

x
𝑁 is the number of predictor
Scikit-Learn uses a multiclass version of AdaBoost called SAMME (which stands for Stagewise
Additive Modeling using a Multiclass Exponential loss function). When there are just two classes,
SAMME is equivalent to AdaBoost.

Ge’ron, Hands-on Machine Learning,

20

SHL-TI-UKP 10
Gradient Boosting
• Just like AdaBoost, Gradient Boosting works by sequentially adding
predictors to an ensemble, each one correcting its predecessor.
• However, Gradient Boosting tries to fit the new predictor to the
residual errors made by the previous predictor.

Ge’ron, Hands-on Machine Learning,

21

Example: Gradient Boosting

22

SHL-TI-UKP 11
23

Stacking
• The last Ensemble method we will
discuss in this chapter is called
stacking (short for stacked
generalization).
• The idea: Train a model to perform
aggregation
• To train the blender (meta learner),
a common approach is to use a
hold-out set

Ge’ron, Hands-on Machine Learning,

24

SHL-TI-UKP 12
• First, the training set is split into two subsets. The first subset is used
to train the predictors in the first layer

Ge’ron, Hands-on Machine Learning,

25

• Next, the first layer’s predictors are used


to make predictions on the second (held-
out) set.
• This ensures that the predictions are
“clean,” since the predictors never saw
these instances during training.
• For each instance in the hold-out set,
there are three predicted values.
• We can create a new training set using
these predicted values as input features,
and keeping the target values.
• The blender is trained on this new training
set, so it learns to predict the target value,
given the first layer’s predictions.

Ge’ron, Hands-on Machine Learning,

26

SHL-TI-UKP 13
Split the training set into three subsets:
• The first one is used to train the first layer,
• The second one is used to create the training
set used to train the second layer (using
predictions made by the predictors of the
first layer), and
• The third one is used to create the training
set to train the third layer (using predictions
made by the predictors
• of the second layer).
Once this is done, we can make a prediction for
a new instance by going through each layer
sequentially

Ge’ron, Hands-on Machine Learning,

27

Ge’ron, Hands-on Machine Learning,

28

SHL-TI-UKP 14

You might also like