0% found this document useful (0 votes)
36 views

Chapter 7 - Ensemble

Ensemble methods like bagging, boosting, and stacking combine multiple machine learning models to improve performance. Bagging trains each model on randomly sampled subsets of the training data and averages the results. Boosting builds models sequentially by focusing on instances previous models misclassified. Stacking trains a meta-model to aggregate the predictions of the base models and further reduce error. These techniques can significantly improve accuracy and reduce variance compared to single models.

Uploaded by

Bic Bui Sport
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Chapter 7 - Ensemble

Ensemble methods like bagging, boosting, and stacking combine multiple machine learning models to improve performance. Bagging trains each model on randomly sampled subsets of the training data and averages the results. Boosting builds models sequentially by focusing on instances previous models misclassified. Stacking trains a meta-model to aggregate the predictions of the base models and further reduce error. These techniques can significantly improve accuracy and reduce variance compared to single models.

Uploaded by

Bic Bui Sport
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 7: Ensemble

**soft **instead of hard when classifiers are able to estimate class probabilities (predict_proba) =>
more confident votes average over all the individual classifiers

When sampling is performed with replacement, this method is called bagging (short for bootstrap
aggregating )

=> Allow for training instances to be sampled multiple times.

When sampling is performed without replacement, it is called pasting.

=> Effective because can be run in parallel => Scale very well

Bagging more biased than pasting (diversity means less correlated), but less variance

oob (Out of bag) are the not used instances, around a third of the instances.

Random patches and random subspaces

max_samples & bootstrap_features

Good for higher dimension inputs like images

Sampling both training instances and features is called the Random Patches method. Keeping all
training instances (by setting bootstrap=False and max_samples=1.0) but sampling features (by setting
bootstrap_features to True and/or max_features to a value smaller than 1.0) is called the Random
Subspaces method.

Ensembles on Random Patches

Ensembles on Random Subspaces

Sampling features => more diversity => more bias but lower variance

BaggingClassifier equivalent of RandomForestClassifier:

bag_clf = BaggingClassifier(

DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),

n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

Extra-trees

Random thresholds for each feature rather best possible threshold =>
Extremely Randomized Trees ensemble

=> trades off more bias less variance => much faster to train (best possible threshold for each feature
at every node is time consuming)

Tip: Hard to tell if Random Forest is better or not, the only way to know is cross validation

Feature importance:

Random forest measures the relative importance of each feature

=> how much the tree nodes using that feature reduce its impurity on average (all trees)

Boosting (hypothesis boosting):

Combine several weak learners into a strong learner.

Train predictors sequentially, each trying to correct its predecessor.

Most popular is AdaBoost (Adaptive Boosting) and Gradient Boosting

AdaBoost:

Pay more attention to the training instances that its pred ignored

=> predictors that focus more on harder cases.

First train a base classifier

Then increase the relative weight of misclassified instances

Then trains a second classifier, and so on.


=> similar to GD but instead adding more predictors

Warning: One drawback => cannot be parallelized (each can only be trained after the last one)
=> does not scale as well as bagging or pasting.

AdaBoost Algo:
each w(i) is initially set to 1/m.

Then calculate the predictor's weight aj, η is the learning rate hyperparameter (defaults to 1)

The more accurate => higher its weight. If just guessing randomly, weight close to 0. If often wrong =>
weight is negative.

Next, Ada updates its instances' weights

Then all weights are normalized (i.e., divided by m∑i=1 w(i))

Then repeat. Algorithm stops when the desired num of predictors reached, or when a perfect predictor
is found.

**To Predict, **Ada computes all the predictions of all the predictors and weights them. The predicted
class is the one that recieves the majority of weighted votes.
Scikit learn uses a multiclass version of AdaBoost called SAMME (Stagewise Additive Modeling using a
Multiclass Exponential loss function), SAMME.R if have predict_proba() (performs better)

WHen just two classes => SAMME = AdaBoost

A decision stump = decision tree with max_depth = 1

Gradient Boosting:

Like AdaBoost, but instead of tweaking the instance weights at every iter, tries to fit the residual errors
left to right => shrinkage

Early stopping:

import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)

gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)

for y_pred in gbrt.staged_predict(X_val)]

bst_n_estimators = np.argmin(errors) + 1

gbrt_best =
GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)

gbrt_best.fit(X_train, y_train)
Also you can set warm_start = true => keep existing trees

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")

error_going_up = 0

for n_estimators in range(1, 120):

gbrt.n_estimators = n_estimators

gbrt.fit(X_train, y_train)

y_pred = gbrt.predict(X_val)

val_error = mean_squared_error(y_val, y_pred)

if val_error < min_val_error:

min_val_error = val_error

error_going_up = 0

else:
error_going_up += 1

if error_going_up == 5:

break # early stopping

subsample => fraction of the training instances => training each tree => higher bias lower variance =>
speeds up training => Stochastic Gradient Boosting

Check out XBoost Lib!

Stacking:

stacked generalization

Simple idea => train a model to aggregate the votes => blender/ meta learner
To train a blender => hold-out set
=> Possible to train several different blenders (one using linear regression, other using random forest)
=> a whole layer of blenders

=> The trick: split into 3 subsets, first one for the 1st layer, second one to train the 2nd layer, third one
to train the 3rd layer

You might also like