Chapter 7 - Ensemble
Chapter 7 - Ensemble
**soft **instead of hard when classifiers are able to estimate class probabilities (predict_proba) =>
more confident votes average over all the individual classifiers
When sampling is performed with replacement, this method is called bagging (short for bootstrap
aggregating )
Bagging more biased than pasting (diversity means less correlated), but less variance
oob (Out of bag) are the not used instances, around a third of the instances.
Sampling both training instances and features is called the Random Patches method. Keeping all
training instances (by setting bootstrap=False and max_samples=1.0) but sampling features (by setting
bootstrap_features to True and/or max_features to a value smaller than 1.0) is called the Random
Subspaces method.
Sampling features => more diversity => more bias but lower variance
bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
Extra-trees
Random thresholds for each feature rather best possible threshold =>
Extremely Randomized Trees ensemble
=> trades off more bias less variance => much faster to train (best possible threshold for each feature
at every node is time consuming)
Tip: Hard to tell if Random Forest is better or not, the only way to know is cross validation
Feature importance:
=> how much the tree nodes using that feature reduce its impurity on average (all trees)
AdaBoost:
Pay more attention to the training instances that its pred ignored
Warning: One drawback => cannot be parallelized (each can only be trained after the last one)
=> does not scale as well as bagging or pasting.
AdaBoost Algo:
each w(i) is initially set to 1/m.
Then calculate the predictor's weight aj, η is the learning rate hyperparameter (defaults to 1)
The more accurate => higher its weight. If just guessing randomly, weight close to 0. If often wrong =>
weight is negative.
Then repeat. Algorithm stops when the desired num of predictors reached, or when a perfect predictor
is found.
**To Predict, **Ada computes all the predictions of all the predictors and weights them. The predicted
class is the one that recieves the majority of weighted votes.
Scikit learn uses a multiclass version of AdaBoost called SAMME (Stagewise Additive Modeling using a
Multiclass Exponential loss function), SAMME.R if have predict_proba() (performs better)
Gradient Boosting:
Like AdaBoost, but instead of tweaking the instance weights at every iter, tries to fit the residual errors
left to right => shrinkage
Early stopping:
import numpy as np
from sklearn.model_selection import train_test_split
gbrt.fit(X_train, y_train)
bst_n_estimators = np.argmin(errors) + 1
gbrt_best =
GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Also you can set warm_start = true => keep existing trees
min_val_error = float("inf")
error_going_up = 0
gbrt.n_estimators = n_estimators
gbrt.fit(X_train, y_train)
y_pred = gbrt.predict(X_val)
min_val_error = val_error
error_going_up = 0
else:
error_going_up += 1
if error_going_up == 5:
subsample => fraction of the training instances => training each tree => higher bias lower variance =>
speeds up training => Stochastic Gradient Boosting
Stacking:
stacked generalization
Simple idea => train a model to aggregate the votes => blender/ meta learner
To train a blender => hold-out set
=> Possible to train several different blenders (one using linear regression, other using random forest)
=> a whole layer of blenders
=> The trick: split into 3 subsets, first one for the 1st layer, second one to train the 2nd layer, third one
to train the 3rd layer