unit 4 ml
unit 4 ml
Ensemble Learning and Random Forest: Introduction to Ensemble Learning, Basic Ensemble Techniques (Max Voting,
Averaging, Weighted Average), Voting Classifiers, Bagging and Pasting, Out-of-Bag Evaluation, Random Patches and
Random Subspaces, Random Forests (Extra-Trees, Feature Importance), Boosting (AdaBoost, Gradient Boosting),
Stacking.
Use Case: Averaging is commonly used in bagging methods like Random Forest Regression, where multiple
regression trees provide their predictions, and the final prediction is the average of all.
3. Weighted Average
Definition: The weighted average method is a more refined version of the averaging technique. In this
method, each model's prediction is given a different weight, and the final prediction is the weighted average
of all model predictions. The weight can be based on the model's performance or confidence in the
prediction.
How it works:
o Each model's prediction is multiplied by a weight that reflects the model's accuracy or reliability.
o The final prediction is calculated as the sum of all the weighted predictions divided by the total sum
of the weights.
Example:
o Suppose you have the following models and their predictions:
Model 1: Prediction = 10, Weight = 0.4
Model 2: Prediction = 12, Weight = 0.3
Model 3: Prediction = 11, Weight = 0.2
Model 4: Prediction = 9, Weight = 0.1
o The weighted average prediction is:
Use Case: Weighted averaging can be used when models have different performances or when certain
models are known to be more reliable for certain types of data. For instance, models with lower error rates
might be assigned higher weights.
Technique Type of Problem How it Works Example
Each model votes for a class, and the class with Random Forest
Max Voting Classification
the most votes is the final prediction. (classification)
The final prediction is the average of all model
Averaging Regression Random Forest Regression
predictions.
Predictions are weighted by the model’s
Weighted Used in more sophisticated
Regression/Classification performance, and the final prediction is a
Average ensembles like boosting
weighted average.
Summary:
These ensemble techniques—max voting, averaging, and weighted averaging—are foundational methods for
combining predictions from multiple models, and they serve as the core of more advanced ensemble methods like
Random Forests (bagging), Boosting algorithms (like AdaBoost or Gradient Boosting), and Stacking.
/_________________________________________________________________________/
#Voting Classifiers
A Voting Classifier is a type of ensemble learning method that combines multiple machine learning models (also
known as "base models" or "weak learners") to make a final prediction based on the majority voting principle. It is
primarily used for classification tasks, where the goal is to combine the predictions of multiple classifiers to increase
accuracy and robustness.
Key Concept of Voting Classifiers:
Each base model in the ensemble makes a prediction for a given instance.
Voting occurs to decide the final predicted class:
o The class that receives the most votes from the individual models is chosen as the final prediction.
Voting classifiers are particularly useful when combining different models with complementary strengths, which
leads to improved overall performance.
Types of Voting Classifiers:
1. Hard Voting (Majority Voting):
o Definition: In hard voting, each classifier in the ensemble casts a vote for a class, and the class with
the most votes is chosen as the final prediction. If there is a tie (e.g., two classes have the same
number of votes), a predefined tie-breaking rule may be used.
o How it works:
Suppose you have an ensemble of NNN classifiers.
Each classifier assigns a predicted class label to the input data point.
The final prediction is the class label that appears most frequently across all classifiers.
o Example:
Model 1: Class 0
Model 2: Class 1
Model 3: Class 0
Model 4: Class 0
Model 5: Class 1
Hard Voting Result: Class 0 is chosen because it has the most votes (3 votes for Class 0 vs. 2
votes for Class 1).
o Use Case: This approach is simple, and often used when combining multiple classifiers like decision
trees, logistic regression, or support vector machines (SVMs) in an ensemble.
2. Soft Voting:
o Definition: Soft voting is a more advanced version of voting. Instead of using hard class labels, soft
voting relies on the predicted probabilities (class probabilities) for each class and takes a weighted
average of these probabilities to make the final prediction.
o How it works:
For each classifier, the predicted probability of each class is computed.
The predicted probabilities of each class are averaged (or summed, depending on the
method) across all classifiers.
The final predicted class is the one with the highest average (or summed) probability.
o Example:
Model 1: Probability for Class 0 = 0.6, Probability for Class 1 = 0.4
Model 2: Probability for Class 0 = 0.7, Probability for Class 1 = 0.3
Model 3: Probability for Class 0 = 0.5, Probability for Class 1 = 0.5
Soft Voting Result:
# Load a dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
2. Pasting
Pasting is a variation of bagging with a subtle but important difference in how the training subsets are created.
Key Concept of Pasting:
Data Sampling: Unlike bagging, pasting generates training subsets by sampling without replacement from
the original training data. In other words, no data point can appear more than once in any given training
subset.
Combining Predictions: After training the individual models on different subsets of the data, the predictions
are combined in the same way as in bagging: using majority voting for classification or averaging for
regression.
How Pasting Works:
1. Data Subsets: From the original dataset, NNN samples are drawn without replacement to create each
training subset.
2. Train Models: Each subset is used to train an individual model, just like in bagging.
3. Combine Results:
o For Classification: Each model casts a vote for a predicted class, and the majority vote is taken as the
final prediction.
o For Regression: The predictions from all models are averaged.
Key Difference Between Bagging and Pasting:
Bagging samples data with replacement, meaning a data point can appear multiple times in the same subset.
Pasting samples data without replacement, meaning each subset contains unique data points and no data
point appears more than once.
Example:
Imagine a dataset with 10 data points:
In Bagging: A training subset could contain, for example, data points {1, 2, 2, 4, 5, 7, 8, 8, 9, 10} (with
repetitions).
In Pasting: A training subset would contain {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} (with no repetitions).
Advantages of Pasting:
No Redundancy: Since there are no repeated data points in each training subset, pasting might be more
efficient in utilizing the available training data.
Good for High-Variance Models: Like bagging, pasting helps reduce overfitting by combining multiple
models.
Disadvantages of Pasting:
Limited Diversity: Since no data points are repeated in each subset, the subsets might have more in common
with each other than in bagging. This could lead to a reduced level of model diversity, which can impact
performance.
Computational Cost: Like bagging, pasting requires training multiple models, which can be computationally
expensive.
# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Pasting using decision tree classifier (max_samples = 1.0 ensures no repetition of data points)
pasting_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, max_samples=1.0, bootstrap=False,
random_state=42)
pasting_clf.fit(X_train, y_train)
print(f"Pasting Accuracy: {pasting_clf.score(X_test, y_test)}")
Conclusion:
Bagging and Pasting are ensemble methods that improve the performance of machine learning models by
reducing overfitting and increasing robustness.
Bagging uses bootstrap sampling (sampling with replacement), while Pasting uses sampling without
replacement.
Both methods work by training multiple models on different subsets of data and combining their predictions,
but bagging tends to have more model diversity due to repeated samples in the training sets, while pasting
ensures no redundancy in the data subsets.
Out-of-Bag (OOB) Evaluation is a technique used to estimate the performance of an ensemble model, particularly in
methods like Bagging (Bootstrap Aggregating) and Random Forests, without needing a separate validation set or
cross-validation. It leverages the inherent structure of bootstrapping to evaluate model performance on data that
wasn't used in training individual models.
Key Concept of Out-of-Bag Evaluation:
In bagging, each model is trained on a bootstrap sample—a random subset of the training data with replacement.
Since each subset can have repeated data points, some data points are left out of the training set for each model.
These left-out data points are called Out-of-Bag (OOB) samples.
Out-of-Bag Evaluation uses these OOB samples to estimate the performance of the model. The main idea is:
For each data point in the training set, you can track how often it is left out of the bootstrap samples during
model training.
When predicting the class (in classification) or the value (in regression) for a data point, only the models that
did not see that point during training are used.
This allows for a validation-like process without needing to reserve a separate validation set.
How Out-of-Bag (OOB) Evaluation Works:
1. Bootstrapping: During the training phase, each model in the ensemble is trained on a bootstrap sample
(random subset with replacement). For each model, a portion of the data is left out (OOB samples).
2. OOB Prediction: Each data point has multiple models that did not see it during training (since the data point
was left out of the bootstrap sample). These models are used to predict the outcome for that data point.
3. Performance Estimation: The predictions made by the models that did not use a particular data point are
compared to the actual label (for classification) or value (for regression) of that data point. The performance
(e.g., accuracy, mean squared error) is averaged across all data points.
Example:
Consider a dataset with 1000 samples:
Each decision tree in the random forest is trained on a bootstrap sample, and each tree leaves out some
samples from the original dataset (because of sampling with replacement).
For each data point in the dataset, we can see how often it was left out and which trees are available to make
predictions for that data point.
If a data point was not in a tree’s training set, that tree can be used to predict the class or value of that point.
The final prediction is often the average or majority vote of the predictions from all trees that did not use
that point.
The OOB error rate is then calculated as the average error of all predictions made using OOB samples.
Advantages of Out-of-Bag Evaluation:
No Need for a Separate Validation Set: OOB evaluation effectively uses the data points that were left out
during the training process to estimate the model's performance, so it eliminates the need for an extra
holdout validation set. This is especially useful when the dataset is small.
Efficient: Since the models are trained on different subsets of the data, OOB evaluation can be done without
additional data splits, which saves time and computation.
Accurate Estimate: The OOB estimate can be as accurate as other validation techniques like cross-validation.
In fact, for Random Forests, it is often as good or better because the OOB process inherently tests each
model on data that it has not seen.
Reduces Overfitting: By using the OOB samples for evaluation, the model's tendency to overfit to the training
data is reduced. The OOB samples act as a kind of pseudo-validation set that helps provide an unbiased
estimate of model performance.
Example of Out-of-Bag Evaluation in Random Forests (with Python):
In scikit-learn, RandomForestClassifier and RandomForestRegressor provide built-in support for OOB evaluation.
Here's an example using RandomForestClassifier.
python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load dataset
X, y = load_iris(return_X_y=True)
# Split into train and test sets (no validation set used here)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize Random Forest Classifier with OOB evaluation enabled
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
# Train the model
rf.fit(X_train, y_train)
# Access the Out-of-Bag score (similar to accuracy)
print(f"Out-of-Bag Score: {rf.oob_score_}")
# Evaluate on test data
test_accuracy = rf.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy}")