Ensemble Method
Ensemble Method
several built.
Majorly there are two types of reducible errors associated with learning methods:
1. Bias error: Also known as algorithm bias or AI bias; is a phenomenon that occurs when an
algorithm produces results that are systemically prejudiced due to erroneous assumptions
in the machine learning (ML) process.
Bias is one type of error that occurs due to wrong assumptions about data such as assuming
data is linear when in reality, data follows a complex function.
Bias is simply defined as the inability of the model because of that there is some difference
or error occurring between the model’s predicted value and the actual value.
Let Y be the true value of a parameter, and let Y’ be an estimator of Y based on a sample of
data. Then, the bias of the estimator Y’ is given by:
Bias(Y’)=E(Y’)-Y
where E(Y’) is the expected value of the estimator Y’. It is the measurement of the model
that how well it fits the data.
Low Bias: Low bias value means fewer assumptions are taken to build the target
function. In this case, the model will closely match the training dataset.
High Bias: High bias value means more assumptions are taken to build the target
function. In this case, the model will not match the training dataset closely.
The high-bias model will not be able to capture the dataset trend. It is considered as
the underfitting model which has a high error rate. It is due to a very simplified
algorithm.
For example, a linear regression model may have a high bias if the data has a non-
linear relationship.
Use a more complex model: One of the main reasons for high bias is the very simplified
model. it will not be able to capture the complexity of the data. In such cases, we can
make our mode more complex by increasing the number of hidden layers in the case of
a deep neural network. Or we can use a more complex model like Polynomial
regression for non-linear datasets, CNN for image processing, and RNN for sequence
learning.
Increase the number of features: By adding more features to train the dataset will
increase the complexity of the model. And improve its ability to capture the underlying
patterns in the data.
Reduce Regularization of the model: Regularization techniques such as L1 or L2
regularization can help to prevent overfitting and improve the generalization ability of
the model. if the model has a high bias, reducing the strength of regularization or
removing it altogether can help to improve its performance.
Increase the size of the training data: Increasing the size of the training data can help
to reduce bias by providing the model with more examples to learn from the dataset.
2. Variance:
Variance is the measure of spread in data from its mean position
In machine learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data
More specifically, variance is the variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it can adjust on the new subset of the
training dataset.
Low variance: Low variance means that the model is less sensitive to
changes in the training data and can produce consistent estimates of
the target function with different subsets of data from the
same distribution. This is the case of underfitting when the model fails to
generalize on both training and test data.
High variance: High variance means that the model is very sensitive to
changes in the training data and can result in significant changes in the
estimate of the target function when trained on different subsets of data
from the same distribution. This is the case of overfitting when the model
performs well on the training data but poorly on new, unseen test data. It
fits the training data too closely that it fails on the new training dataset.
Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be
used to tune hyperparameters to reduce variance.
Feature selection: By choosing the only relevant feature will decrease the model’s
complexity and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce variance in machine
learning models
Simplifying the model: Reducing the complexity of the model, such as decreasing
the number of parameters or layers in a neural network, can also help reduce variance
and improve generalization performance.
High Bias, Low Variance: A model with high bias and low variance is said to be
underfitting.
High Variance, Low Bias: A model with high variance and low bias is said to be
overfitting.
High-Bias, High-Variance: A model has both high bias and high variance, which
means that the model is not able to capture the underlying patterns in the data (high
bias) and is also too sensitive to changes in the training data (high variance). As a
result, the model will produce inconsistent and inaccurate predictions on
average.
Low Bias, Low Variance: A model that has low bias and low variance means that the
model is able to capture the underlying patterns in the data (low bias) and is not too
sensitive to changes in the training data (low variance). This is the ideal scenario for
a machine learning model, as it is able to generalize well to new, unseen data and
produce consistent and accurate predictions. But in practice, it’s not possible.
The model is likely to be just complex enough to capture the complexity of the
data, but not too complex to overfit the training data. This can happen when the
model has been carefully tuned to achieve a good balance between bias and
variance, by adjusting the hyperparameters and selecting an appropriate model
architecture.
If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone.
If algorithms fit too complex (hypothesis with high degree equation) then it may be on
high variance and low bias.
In the latter condition, the new entries will not perform well. There is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance.
An algorithm can’t be more complex and less complex at the same time. For the
graph, the perfect tradeoff will be like this.
Ensemble model
Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
Advantage: Improvement in predictive accuracy.
Disadvantage : It is difficult to understand an ensemble of classifiers.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision
tree classifier and is generated using a random selection of attributes at each node to
determine the split. During classification, each tree votes and the most popular class is
returned.
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set, selecting observations with
replacement.
2. A subset of features is selected randomly and whichever feature gives the best split is
used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of predictions
from n number of trees.
Ensemble methods
1. Averaging method: It is mainly used for regression problems. The method consists of
building multiple models independently and returning the average of the prediction of all the
models. In general, the combined output is better than an individual output because variance
is reduced.
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
Boosting is a sequential method–it aims to prevent a wrong base model from affecting the
final output. Instead of combining the base models, the method focuses on building a new
model that is dependent on the previous one. A new model tries to remove the errors made by
its previous one. Each of these models is called weak learners. The final model (aka strong
learner) is formed by getting the weighted mean of all the weak learners.
The base models are trained on the complete dataset, then the meta-model is
trained on features returned (as output) from base models. The base models in
stacking are typically different. The meta-model helps to find the features from base
models to achieve the best accuracy.
Stacking is a bit different from the basic ensembling methods because it has first-level and
second-level models. Stacking features are first extracted by training the dataset with all the
first-level models. A first-level model is then using the train stacking features to train the
model than this model predicts the final output with test stacking features.
Algorithm:
1. Split the train dataset into n parts
2. A base model (say linear regression) is fitted on n-1 parts and predictions
are made for the nth part. This is done for each one of the n part of the
train set.
3. The base model is then fitted on the whole train dataset.
4. This model is used to predict the test dataset.
5. The Steps 2 to 4 are repeated for another base model which results in
another set of predictions for the train and test dataset.
6. The predictions on train data set are used as a feature to build the new
model.
7. This final model is used to make the predictions on test dataset
6. Blending: It is similar to the stacking method explained above, but rather than using the
whole dataset for training the base-models, a validation dataset is kept separate to make
predictions.
Algorithm:
1. Split the training dataset into train, test and validation dataset.
2. Fit all the base models using train dataset.
3. Make predictions on validation and test dataset.
4. These predictions are used as features to build a second level model
5. This model is used to make predictions on test and meta-features