Ensemble Methods
Ensemble Methods
Introduction
• Ensemble learning helps improve machine learning results by
combining several models.
• Ensemble learning combines the predictions from multiple models to
reduce the variance of predictions and reduce generalization error.
• This approach allows the production of better predictive performance
compared to a single model.
• Ensemble methods are meta-algorithms that combine several
machine learning techniques into one predictive model in order to
decrease variance (bagging), bias (boosting), or improve predictions
(stacking).
Ensemble Methods
• Models can be different from each other for a variety of reasons,
starting from the population they are built upon to the modeling used
for building the model.
The differences can be due to :
1. Difference in Population.
2. Difference in Hypothesis.
3. Difference in Modeling Technique.
4. Difference in Initial Seed.
Error in Ensemble Learning (Variance vs.
Bias)
The error emerging from any model can be broken down into three components mathematically. Following are these
component :
Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A
high bias error means we have a under-performing model which keeps on missing important trends.
Variance on the other side quantifies how are the prediction made on same observation different from each other.
A high variance model will over-fit on your training population and perform badly on any observation beyond
training. Following diagram will give you more clarity (Assume that red spot is the real value and blue dots are
predictions) :
Contd..
• Normally, as you increase the complexity of your model, you will see
a reduction in error due to lower bias in the model. However, this
only happens till a particular point.
• As you continue to make your model more complex, you end up
over-fitting your model and hence your model will start suffering
from high variance.
•
Ensemble learning Types
Bootstrapping
• Bootstrap refers to random sampling with replacement. Bootstrap
allows us to better understand the bias and the variance with the
dataset.
• Bootstrap involves random sampling of small subset of data from the
dataset. This subset can be replaced. The selection of all the example
in the dataset has equal probability. This method can help to better
understand the mean and standard deviation from the dataset.
• Let’s assume we have a sample of ‘n’ values (x) and we’d like to get
an estimate of the mean of the sample.
mean(x) = 1/n * sum(x)
Visual Interpretation of Bootstrapping
Parallel Ensemble Learning(Bagging)
• Bagging, is a machine learning ensemble meta-algorithm intended to
improve the strength and accuracy of machine learning algorithms
used in classification and regression purpose. It additionally
diminishes fluctuation of data(variance)and help to overcome
over-fitting.
• Parallel ensemble methods where the base learners are generated in
parallel
• Algorithms : Random Forest, Bagged Decision Trees, Extra Trees
Sequential Ensemble learning (Boosting)
• Boosting, is a machine learning ensemble meta-algorithm for
principally reducing bias, and furthermore variance in supervised
learning, and a group of machine learning algorithms that convert
weak learner to strong ones.
• Sequential ensemble methods where the base learners are generated
sequentially.
• Example : Adaboost, Stochastic Gradient Boosting
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
Ensemble: combining a set of heterogeneous classifiers
12
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown tuple X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• Proved improved accuracy in prediction
13
Basic Algorithm
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
15
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set
Di of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
20
Classification of Class-Imbalanced Data Sets
21
Random Forest Classifier
1. Take a random sample of size N with replacement from the data.
2. Take a random sample without replacement of the predictors.
3. Construct the first CART partition of the data.
4. Repeat Step 2 for each subsequent split until the tree is as large as
desired. Do not prune.
5. Repeat Steps 1–4 a large number of times.
Example
• Each decision tree in the ensemble is built upon a random bootstrap sample of
the original data, which contains positive (green labels) and negative (red labels)
examples.
• Class prediction for new instances using a random forest model is based on a
majority voting procedure among all individual trees.
• Bagging features and samples simultaneously: At each tree split, a
random sample of m features is drawn, and only those m features are
considered for splitting.
• Typically m = √ d or log2 d, where d is the number of features
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored. This is
called the “out-of-bag” error rate. random forests tries to improve on
bagging by “de-correlating” the trees.
• Each tree has the same expectation.
Advantages
• Random forests is considered as a highly accurate and robust method
because of the number of decision trees participating in the process.
• It does not suffer from the overfitting problem. The main reason is that it
takes the average of all the predictions, which cancels out the biases.
• The algorithm can be used in both classification and regression problems.
• Random forests can also handle missing values. There are two ways to
handle these: using median values to replace continuous variables, and
computing the proximity-weighted average of missing values.
• You can get the relative feature importance, which helps in selecting the
most contributing features for the classifier.
Disadvantages
• Random forests is slow in generating predictions because it has
multiple decision trees. Whenever it makes a prediction, all the trees
in the forest have to make a prediction for the same given input and
then perform voting on it. This whole process is time-consuming.
• The model is difficult to interpret compared to a decision tree, where
you can easily make a decision by following the path in the tree.
Finding important features