0% found this document useful (0 votes)
2 views

Ensemble Methods

Ensemble methods enhance machine learning performance by combining multiple models to reduce variance and bias, resulting in better predictive accuracy. Key techniques include bagging, which reduces variance through parallel model training, and boosting, which reduces bias through sequential model training. Random forests, a popular ensemble method, utilize multiple decision trees to improve robustness and accuracy while also offering feature importance insights.

Uploaded by

musavvirk04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Ensemble Methods

Ensemble methods enhance machine learning performance by combining multiple models to reduce variance and bias, resulting in better predictive accuracy. Key techniques include bagging, which reduces variance through parallel model training, and boosting, which reduces bias through sequential model training. Random forests, a popular ensemble method, utilize multiple decision trees to improve robustness and accuracy while also offering feature importance insights.

Uploaded by

musavvirk04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Ensemble Methods

Introduction
• Ensemble learning helps improve machine learning results by
combining several models.
• Ensemble learning combines the predictions from multiple models to
reduce the variance of predictions and reduce generalization error.
• This approach allows the production of better predictive performance
compared to a single model.
• Ensemble methods are meta-algorithms that combine several
machine learning techniques into one predictive model in order to
decrease variance (bagging), bias (boosting), or improve predictions
(stacking).
Ensemble Methods
• Models can be different from each other for a variety of reasons,
starting from the population they are built upon to the modeling used
for building the model.
The differences can be due to :
1. Difference in Population.
2. Difference in Hypothesis.
3. Difference in Modeling Technique.
4. Difference in Initial Seed.
Error in Ensemble Learning (Variance vs.
Bias)
The error emerging from any model can be broken down into three components mathematically. Following are these
component :

Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A
high bias error means we have a under-performing model which keeps on missing important trends.
Variance on the other side quantifies how are the prediction made on same observation different from each other.
A high variance model will over-fit on your training population and perform badly on any observation beyond
training. Following diagram will give you more clarity (Assume that red spot is the real value and blue dots are
predictions) :
Contd..
• Normally, as you increase the complexity of your model, you will see
a reduction in error due to lower bias in the model. However, this
only happens till a particular point.
• As you continue to make your model more complex, you end up
over-fitting your model and hence your model will start suffering
from high variance.

Ensemble learning Types
Bootstrapping
• Bootstrap refers to random sampling with replacement. Bootstrap
allows us to better understand the bias and the variance with the
dataset.
• Bootstrap involves random sampling of small subset of data from the
dataset. This subset can be replaced. The selection of all the example
in the dataset has equal probability. This method can help to better
understand the mean and standard deviation from the dataset.
• Let’s assume we have a sample of ‘n’ values (x) and we’d like to get
an estimate of the mean of the sample.
mean(x) = 1/n * sum(x)
Visual Interpretation of Bootstrapping
Parallel Ensemble Learning(Bagging)
• Bagging, is a machine learning ensemble meta-algorithm intended to
improve the strength and accuracy of machine learning algorithms
used in classification and regression purpose. It additionally
diminishes fluctuation of data(variance)and help to overcome
over-fitting.
• Parallel ensemble methods where the base learners are generated in
parallel
• Algorithms : Random Forest, Bagged Decision Trees, Extra Trees
Sequential Ensemble learning (Boosting)
• Boosting, is a machine learning ensemble meta-algorithm for
principally reducing bias, and furthermore variance in supervised
learning, and a group of machine learning algorithms that convert
weak learner to strong ones.
• Sequential ensemble methods where the base learners are generated
sequentially.
• Example : Adaboost, Stochastic Gradient Boosting
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
Ensemble: combining a set of heterogeneous classifiers

12
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown tuple X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• Proved improved accuracy in prediction

13
Basic Algorithm
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
15
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set
Di of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:

• The weight of classifier Mi’s vote is


16
• Similarities Between Bagging and Boosting
• Bagging and Boosting, both being the commonly used methods,
have a universal similarity of being classified as ensemble
methods. Here we will explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or
taking the majority of them i.e Majority Voting).
4. Both are good at reducing variance and provide higher stability.
S.NO Bagging Boosting

The simplest way of combining A way of combining predictions


1. predictions that that
belong to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
Models are weighted according to
3. Each model receives equal weight.
their performance.
New models are influenced
4. Each model is built independently. by the performance of previously
built models.
Different training data subsets are
selected using row sampling with Every new subset contains the
5. replacement and random sampling elements that were misclassified by
methods from the entire training previous models.
dataset.
Bagging tries to solve the
6. Boosting tries to reduce bias.
over-fitting problem.
If the classifier is unstable (high If the classifier is stable and simple
7.
variance), then apply bagging. (high bias) the apply boosting.
In this base classifiers are trained In this base classifiers are trained
8.
parallelly. sequentially.
Example: The Random forest model Example: The AdaBoost uses
9
uses Bagging. Boosting techniques
Random Forest
• Random Forest:
( Breiman 2001)
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
• During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
• Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting

20
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous


negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from negative
class
• Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance
of costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks

21
Random Forest Classifier
1. Take a random sample of size N with replacement from the data.
2. Take a random sample without replacement of the predictors.
3. Construct the first CART partition of the data.
4. Repeat Step 2 for each subsequent split until the tree is as large as
desired. Do not prune.
5. Repeat Steps 1–4 a large number of times.
Example

• Each decision tree in the ensemble is built upon a random bootstrap sample of
the original data, which contains positive (green labels) and negative (red labels)
examples.
• Class prediction for new instances using a random forest model is based on a
majority voting procedure among all individual trees.
• Bagging features and samples simultaneously: At each tree split, a
random sample of m features is drawn, and only those m features are
considered for splitting.
• Typically m = √ d or log2 d, where d is the number of features
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored. This is
called the “out-of-bag” error rate. random forests tries to improve on
bagging by “de-correlating” the trees.
• Each tree has the same expectation.
Advantages
• Random forests is considered as a highly accurate and robust method
because of the number of decision trees participating in the process.
• It does not suffer from the overfitting problem. The main reason is that it
takes the average of all the predictions, which cancels out the biases.
• The algorithm can be used in both classification and regression problems.
• Random forests can also handle missing values. There are two ways to
handle these: using median values to replace continuous variables, and
computing the proximity-weighted average of missing values.
• You can get the relative feature importance, which helps in selecting the
most contributing features for the classifier.
Disadvantages
• Random forests is slow in generating predictions because it has
multiple decision trees. Whenever it makes a prediction, all the trees
in the forest have to make a prediction for the same given input and
then perform voting on it. This whole process is time-consuming.
• The model is difficult to interpret compared to a decision tree, where
you can easily make a decision by following the path in the tree.
Finding important features

• Random forests also offers a good feature selection indicator.


• Scikit-learn provides an extra variable with the model, which shows the relative
importance or contribution of each feature in the prediction. It automatically
computes the relevance score of each feature in the training phase. Then it scales
the relevance down so that the sum of all scores is 1.
• This score will help you choose the most important features and drop the least
important ones for model building.
• Random forest uses gini importance or mean decrease in impurity (MDI) to
calculate the importance of each feature.
• Gini importance is also known as the total decrease in node impurity. This is how
much the model fit or accuracy decreases when you drop a variable. The larger
the decrease, the more significant the variable is. Here, the mean decrease is a
significant parameter for variable selection. The Gini index can describe the
overall explanatory power of the variables.
Random Forests vs Decision Trees

• Random forests is a set of multiple decision trees.


• Deep decision trees may suffer from overfitting, but random forests
prevents overfitting by creating trees on random subsets.
• Decision trees are computationally faster.
• Random forests is difficult to interpret, while a decision tree is easily
interpretable and can be converted to rules.
Which is the best, Bagging or Boosting?
• Bagging and Boosting decrease the variance of your single estimate as
they combine several estimates from different models. So the result
may be a model with higher stability.
• If the problem is that the single model gets a very low performance,
Bagging will rarely get a better bias. However, Boosting could
generate a combined model with lower errors as it optimises the
advantages and reduces pitfalls of the single model.
• By contrast, if the difficulty of the single model is over-fitting, then
Bagging is the best option. Boosting for its part doesn’t help to avoid
over-fitting; in fact, this technique is faced with this problem itself.
For this reason, Bagging is effective more often than Boosting.
Stacking & Blending
Stacking is a way of combining multiple models, that introduces the concept
of a meta learner. It is less widely used than bagging and boosting. Unlike
bagging and boosting, stacking may be (and normally is) used to combine
models of different types. The procedure is as follows:
• Split the training set into two disjoint sets.
• Train several base learners on the first part.
• Test the base learners on the second part.
• Using the predictions from Test data sets as the inputs, and the correct
responses as the outputs, train a higher level learner.
• Example : Voting Classifier
Blending is technique where we can do weighted averaging of final result.

You might also like