0% found this document useful (0 votes)
43 views30 pages

module_4

Decision Trees (DTs) are a supervised learning method for classification and regression that uses a tree-like model of decisions. Key concepts include root nodes, decision nodes, leaf nodes, and the process of splitting and pruning to create a model that predicts outcomes based on input features. The document also discusses the CART algorithm, computational complexity, and regularization techniques to prevent overfitting in Decision Trees.

Uploaded by

Bhagya Jyoti Mk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views30 pages

module_4

Decision Trees (DTs) are a supervised learning method for classification and regression that uses a tree-like model of decisions. Key concepts include root nodes, decision nodes, leaf nodes, and the process of splitting and pruning to create a model that predicts outcomes based on input features. The document also discusses the CART algorithm, computational complexity, and regularization techniques to prevent overfitting in Decision Trees.

Uploaded by

Bhagya Jyoti Mk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Machine Learning 21AI63

MODULE 4
Decision Trees
Introduction to Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features

Key Terms in Decision Trees:


• Root Node: The starting point of the tree.
• Splitting: Dividing a node into multiple sub-nodes.
• Decision Node: A node that splits into more sub-nodes.
• Leaf Node: A node that does not split further and represents an outcome.
• Pruning: Removing sub-nodes to simplify the tree.
• Branch: A part of the tree with multiple nodes

Figure - Decision tree Structure

Figure: A decision tree for the concept PlayTennis

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

How Decision Trees Operate:

• A decision tree looks like an upside-down tree, starting with a root node.
• From the root node, the tree splits into decision nodes, and these further split until they
reach leaf nodes, which show the outcomes.
• Each decision node represents a condition, and the branches represent the possible
answers

Types of decision trees in machine learning


Decision trees in machine learning can either be classification trees or regression trees.

1. Classification trees determine whether an event happened or not. This involves a


“yes” or “no” outcome. It deals with categories.
Example: Is an animal a reptile or mammal?

2. Regression trees predict continuous values based on previous data or information


sources. It deals with numerical outcomes.
Example: Predicting house prices based on features like size and location.

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Training and Visualizing a Decision Tree


To understand Decision Trees, let’s just build one and take a look at how it makes
predictions.

Example Using the Iris Dataset:


The iris dataset is a famous dataset used to practice classification algorithms. It contains data
about different iris flowers, including their petal lengths and widths, which we will use to
classify them into different species

Step-by-Step Process:
1. Load the Dataset and Train the Model

from sklearn.datasets import load_iris


from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset


iris = load_iris()
X = iris.data[:, 2:] # Use petal length and width as features
y = iris.target # Target variable (species)

# Initialize and train the Decision Tree Classifier


tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

2. Visualize the Decision Tree


You can visualize the trained Decision Tree by first using the export_graphviz()
method to output a graph definition file called iris_tree.dot

from sklearn.tree import export_graphviz

export_graphviz(
tree_clf,
out_file="iris_tree.dot",
feature_names=iris.feature_names[2:], # Petal length and width
class_names=iris.target_names, # Species names
rounded=True,
filled=True
)

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

3. Convert the .dot File to an Image


Use the Graphviz tool to convert the .dot file to a more accessible format like PNG

$ dot -Tpng iris_tree.dot -o iris_tree.png

Decision tree

Making Predictions

Suppose you find an iris flower and you want to classify it.
→ Start at the Root Node:
• Find is the petal length smaller than 2.45 cm?
• If yes, move to the left child node.
• If no, move to the right child node
→ Left Child Node (Depth 1, Left):
• This node is a leaf node, meaning it does not ask any more questions.
• Prediction: The flower is classified as Iris-Setosa
→ Right Child Node (Depth 1, Right):
• This node is not a leaf node, so it asks another question that Is the petal width
smaller than 1.75 cm?
• If yes, move to the left child node (Depth 2, Left).
• If no, move to the right child node (Depth 2, Right).
→ Depth 2, Left Node:
• Prediction: The flower is classified as Iris-Versicolor.
→ Depth 2, Right Node:
• Prediction: The flower is classified as Iris-Virginica.

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Attributes of a Node:
Samples: Number of training instances it applies to.
• Example: If 100 training instances have a petal length > 2.45 cm (Depth 1, Right), and
among them, 54 have a petal width < 1.75 cm (Depth 2, Left).

Value: Number of training instances of each class in that node.


• Example: The bottom-right node (Depth 2, Right) applies to 0 Iris-Setosa, 1 Iris-
Versicolor, and 45 Iris-Virginica.

Gini Impurity: Measures the node's impurity. A pure node (all instances belong to one class)
has a Gini score of 0.

Estimating Class Probabilities


• A Decision Tree can also estimate the probability that an instance belongs to a
particular class k.
• First it traverses the tree to find the leaf node for this instance, and then it returns the
ratio of training instances of class k in this node.
• For example, consider a flower whose petals are 5 cm long and 1.5 cm wide. The
corresponding leaf node is the depth-2 left node, so the Decision Tree should output
the following probabilities

• 0% for Iris-Setosa (0/54)

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• 90.7% for Iris-Versicolor (49/54)


• and 9.3% for Iris-Virginica (5/54)
The predicted class is Iris-Versicolor (class 1) since it has the highest probability

The CART Training Algorithm

CART (Classification and Regression Trees) is a decision tree technique employed in


machine learning to address both classification and regression tasks. It identifies patterns and
relationships within a dataset and constructs a tree structure based on the variable values
present in the data.

Algorithm
• First splits the training set in two subsets using a single feature k and a threshold t k
(e.g., “petal length ≤ 2.45 cm”)
• To choose k and tk, search for the pair (k, tk) that produces the purest subsets
(weighted by their size).
• The cost function that the algorithm tries to minimize is given by

• Once successfully split the training set in two, splits the subsets using the same logic,
then the sub-subsets and so on, recursively.
• Stops recursing once it reaches the maximum depth, or if it cannot find a split that will
reduce impurity.

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Algorithm

Step 1. Initialization:
• Begin with the entire dataset as the root node.
Step 2. Splitting Criteria:
• For each node, considers all possible splits for each feature.
• The goal is to find the split that minimizes the impurity (for classification) or the
variance (for regression) in the resulting sub-nodes.

Classification: The impurity measure used is the Gini impurity.

Regression: The variance measure typically used is the mean squared error (MSE).

Step 3. Choose the Best Split


• Evaluate all possible splits for all features.
• Choose the split that results in the lowest Gini impurity (for classification) or the
lowest variance (for regression).
Step 4. Split the Node
• Split the dataset into two subsets based on the best split.
• Create two new child nodes and assign the data points to these nodes based on the
split criteria.
Step 5. Repeat
For each child node, repeat steps 2 to 4 until one of the stopping conditions is met:
• Maximum depth of the tree is reached.
• Minimum number of samples in a node is reached.
• No further reduction in impurity or variance can be achieved.
• Node contains only samples of a single class (for classification).

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Computational Complexity

• Making predictions requires traversing the Decision Tree from the root to a leaf and
they are relatively balanced. so traversing the Decision Tree requires going through
roughly O(log2(m)) nodes, where m is the number of training instances. Each node
checks only one feature, making predictions very fast, even for large datasets.
• The complexity of making a prediction with a Decision Tree is O(log2 (m)), meaning it
remains quick regardless of the number of features in your data.
• During training, the algorithm evaluates all features for all samples at each node. This
process results in a training complexity of O(n×mlog(m)), where n is the number of
features and m is the number of training instances.

Gini Impurity or Entropy?

Information gain
• Information gain measures how well a given attribute separates the training examples
according to their target classification.
• It is used to select among the candidate attributes at each step while growing the tree.
• Information gain, is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• The information gain, Gain(S, A) of an attribute A, relative to a collection of examples
S, is defined as

Entropy
• Entropy measures the impurity of a collection of examples.
• Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this Boolean classification is

Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
• The entropy is 0 if all members of S belong to the same class
• The entropy is 1 when the collection contains an equal number of positive and
negative examples
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Gini Index
The Gini Index or Gini Impurity, measures the probability for a random instance being
misclassified when chosen randomly. It is used to determine the best feature to split the data
at each node in the tree.

• Gini = 0: All elements in the node belong to a single class, hence the node is pure.
• Gini > 0: There are multiple classes present in the node, indicating impurity.
• Maximum Impurity (0.5): For a binary classification, this occurs when the classes are
perfectly split (50% each).

At each node of decision tree, the algorithm calculates the Gini Index for all possible splits
and chooses the split that results in the lowest Gini Index for the child nodes, indicating the
purest possible nodes.

So should you use Gini impurity or entropy?


• Gini Impurity: Tends to be faster to compute and is a good default choice. It isolates
the most frequent class in its own branch of the tree.
• Entropy: Tends to produce slightly more balanced trees by considering the
information gain. This can be more informative but slightly slower to compute.

Regularization Hyperparameters in Decision Trees

• Decision Trees: They make minimal assumptions about the training data, allowing the
tree structure to adapt closely to the data. This can lead to overfitting.
• Nonparametric Models: These models have an undefined number of parameters
before training, meaning they can fit closely to the training data. Decision Trees are an
example of this.
• Parametric Models: These models have a predetermined number of parameters,
reducing the risk of overfitting but increasing the risk of underfitting.

To avoid overfitting the training data, restrict the Decision Tree’s freedom during training,
this is called regularization. This is done by controlling hyperparameters that limit the tree's
growth and complexity.

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Regularization Hyperparameters in Scikit-Learn: The DecisionTreeClassifier class has a few


other parameters that restrict the shape of the Decision Tree:

1. max_depth: Limits the maximum depth of the tree. Reducing max_depth restricts
how deep the tree can grow, thus controlling overfitting.
2. min_samples_split: The minimum number of samples required to split an internal
node. Increasing this value means nodes must have more samples to split, which
reduces tree complexity.
3. min_samples_leaf: The minimum number of samples a leaf node must have. Prevents
creating leaf nodes with very few samples, reducing overfitting.
4. min_weight_fraction_leaf: Similar to min_samples_leaf but expressed as a fraction
of the total number of weighted instances. Ensures leaf nodes contain a minimum
fraction of the dataset, adding regularization
5. max_leaf_nodes: Limits the maximum number of leaf nodes. Restricts the overall size
and complexity of the tree
6. max_features: The maximum number of features to consider for splitting at each
node. Limits the number of features evaluated, simplifying the model and reducing the
risk of overfitting.

Adjusting max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf,


max_leaf_nodes, and max_features helps balance the model's complexity and its ability to
generalize to new data.

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Regression

Decision Trees are not only useful for classification tasks but also for regression tasks.
Regression with Decision Trees involves predicting continuous values instead of classes.

Building a Regression Tree


Use the DecisionTreeRegressor class from Scikit-Learn.

from sklearn.tree import DecisionTreeRegressor


tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

Each leaf node in the regression tree predicts a continuous value. The predicted value is the
average of the target values of all training instances in that leaf node.

Example:
• Suppose you want to predict for a new instance with x1 = 0.6. Then start at the root of
the tree and traverse it according to the feature values until you reach a leaf node.
• The leaf node predicts value = 0.1106, which is the average target value of the training
instances in that node. This prediction results in a Mean Squared Error (MSE) of
0.0151 over these 110 instances.

This model’s predictions are represented in below figure. If max_depth is set to 2, the
predictions are less detailed. Increasing max_depth to 3 results in more detailed predictions.
The predicted value for each region is always the average target value of the instances in that
region.

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The CART (Classification and Regression Tree) algorithm splits the dataset to minimize the
Mean Squared Error (MSE) rather than impurity. Equation shows the cost function that the
algorithm tries to minimize.

Decision Trees are prone to overfitting when dealing with regression tasks. Without any
regularization (i.e., using the default hyperparameters), you get the predictions as shown on
the left of below Figure. It is obviously overfitting the training set very badly. Just setting
min_samples_leaf=10 results in a much more reasonable model, represented on the right of
Figure.

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Question Bank

1. Define Decision Tree. How Decision Trees Operate.


2. Explain the types of decision trees in machine learning
3. Describe the process of training and visualizing decision tree using an example
4. Explain CART algorithm to train decision trees
5. Discuss the computational complexity of making predictions with Decision Trees
6. Compare Gini Impurity and Entropy in the context of decision trees with example
7. Define Information Gain and explain its significance in decision trees with an example
8. Describe the concept of Entropy and how it measures impurity in a dataset and
consider an appropriate example
9. Explain the Gini Index and its role in determining the best feature for splitting data.
10. Discuss the importance of regularization hyperparameters in decision trees.
11. Explain the process and purpose of "Pruning" in decision trees.
12. Differentiate between classification trees and regression trees in machine learning with
an example

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Ensemble Learning and Random Forests


Introduction to Ensemble Learning

Ensemble Learning is a powerful technique in machine learning that combines the predictions
of multiple models to produce a better overall prediction

The Concept of Ensemble Learning

Wisdom of the Crowd:


• Imagine asking a complex question to thousands of random people and then
combining their answers. Often, this combined answer is better than the answer from a
single expert. This is known as the "wisdom of the crowd."
• Similarly, combining the predictions from multiple models (classifiers or regressors)
often results in more accurate predictions than any individual model.

Definition:
• A group of predictors (models) is called an ensemble.
• The technique of combining multiple models is called Ensemble Learning.
• An algorithm that implements Ensemble Learning is called an Ensemble Method.

Example - Random Forest:


• Train multiple Decision Tree classifiers on different random subsets of the training
data. To make a prediction, get predictions from all the individual trees and choose the
class that gets the most votes. This ensemble of Decision Trees is called a Random
Forest.

Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating): Train multiple models on different random subsets


of the training data and combine their predictions.
2. Boosting: Train multiple models sequentially, each trying to correct the errors of the
previous model. Combine their predictions to improve accuracy.
3. Stacking: Train multiple models and then train a meta-model to combine their
predictions.
4. Random Forests: A specific type of bagging that uses Decision Trees. Combines the
predictions of multiple Decision Trees to improve accuracy and robustness.

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Voting Classifiers

• Voting Classifiers are a type of ensemble learning method that combine the
predictions of multiple classifiers to improve accuracy
• Imagine you have trained several classifiers, each achieving around 80% accuracy. For
example, you might have a Logistic Regression classifier, an SVM classifier, a
Random Forest classifier, and a K-Nearest Neighbors classifier. A simple way to
create an even better classifier is to combine their predictions.

Hard Voting:
• The most straightforward method is to aggregate the predictions from each classifier
and predict the class that gets the most votes. This is known as a hard voting
classifier.
• Surprisingly, this majority-vote classifier often achieves higher accuracy than the best
individual classifier in the ensemble. If each classifier is a weak learner (meaning it
does only slightly better than random guessing), the ensemble can still be a strong
learner (achieving high accuracy), provided there are a sufficient number of weak
learners and they are sufficiently diverse.

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Example:
• Suppose you have a slightly biased coin that has a 51% chance of coming up heads,
and 49% chance of coming up tails. If you toss it 1,000 times, you will get more or
less 510 heads and 490 tails, and hence a majority of heads.
• You will find that the probability of obtaining a majority of heads after 1,000
tosses is close to 75%
• The probability of getting a majority of heads increases with the number of tosses due
to the law of large numbers. With 10,000 tosses, the probability of a majority heads is
over 97%.

Figure shows 10 series of biased coin tosses. You can see that as the number of tosses
increases, the ratio of heads approaches 51%. Eventually all 10 series end up so close to 51%
that they are consistently above 50%.

• Similarly, suppose you build an ensemble containing 1,000 classifiers that are
individually correct only 51% of the time. If you predict the majority voted class, you
can hope for up to 75% accuracy!
• However, this high accuracy assumes all classifiers are perfectly independent and
make uncorrelated errors. In reality, since they are trained on the same data, they are
likely to make similar errors, reducing the ensemble’s overall accuracy.

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Bagging and Pasting

Bagging and Pasting are ensemble learning techniques used to improve the accuracy and
robustness of machine learning models.

Creating Diverse Predictors:


• One way to create a diverse set of classifiers is to use different training algorithms.
• Another approach is to use the same training algorithm for every predictor but train
them on different random subsets of the training set.

Bagging (Bootstrap Aggregating):


• Each predictor is trained on a random subset of the training set where sampling is done
with replacement. This means some instances may be repeated in the same subset, this
method is called bagging
• Once all predictors are trained, their predictions are aggregated. For classification
tasks, the most frequent prediction is chosen (like a hard voting classifier). For
regression tasks, the average prediction is used.
• Each predictor trained on a subset has a higher bias, but when combined, the ensemble
reduces both bias and variance, resulting in better overall performance.

Pasting:
• Each predictor is trained on a random subset of the training set where sampling is done
without replacement. This means instances are not repeated in the same subset, this
method is called pasting
• Similar to bagging, predictions from all predictors are aggregated to make the final
prediction.
As you can see in above Figure, Predictors can be trained in parallel, using different CPU
cores or servers, making both bagging and pasting scalable and efficient. Predictions can also
be made in parallel, which speeds up the overall process.

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Advantages of Bagging and Pasting


• Reduction in Variance: By training on different subsets, the overall model's variance
is reduced compared to a single predictor trained on the entire training set.
• Scalability: The ability to train and predict in parallel makes these methods suitable
for large datasets and complex models.

Example of Bagging and Pasting


• Bagging: Imagine training 10 decision tree classifiers. Each tree is trained on a subset
of the training data created by randomly sampling with replacement. The final
prediction for a new instance is based on the majority vote of these 10 trees.
• Pasting: Similarly, you train 10 decision tree classifiers, but this time each tree is
trained on a subset created by random sampling without replacement. The final
prediction is also based on the majority vote.

Bagging and Pasting in Scikit-Learn

• Scikit-Learn offers a simple API for both bagging and pasting with the
BaggingClassifier class (or BaggingRegressor for regression).
• The following code trains an ensemble of 500 Decision Tree classifiers, each trained
on 100 training instances randomly sampled from the training set with replacement
(bagging) but if you want to use pasting instead, just set bootstrap=False. The n_jobs
parameter tells Scikit-Learn the number of CPU cores to use for training and
predictions.

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier

# Create a bagging classifier


bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1)

# Train the bagging classifier


bag_clf.fit(X_train, y_train)

# Make predictions
y_pred = bag_clf.predict(X_test)

BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1): This creates an ensemble of 500
Decision Trees. Each tree is trained on 100 random samples from the training set with
replacement.

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

n_jobs=-1: Utilizes all available CPU cores for parallel processing, making training faster.
Below Figure compares the decision boundary of a single Decision Tree with the decision
boundary of a bagging ensemble of 500 trees, both trained on the moons dataset.

• A single Decision Tree might have a complex decision boundary that overfits the
training data.
• A bagging ensemble of trees will generally have a smoother decision boundary, which
reduces variance and improves generalization.

Out-of-Bag Evaluation
• In bagging, each predictor is trained on a random subset of the training set with
replacement. On average, about 63% of the training instances are used for training
each predictor. The remaining 37% of the instances are not used and are called out-of-
bag (OOB) instances. Each predictor has a different set of OOB instances.
• OOB instances are not seen by the predictor during training, so they can be used to
evaluate the predictor's performance. This provides a way to evaluate the ensemble
without needing a separate validation set. The overall performance of the ensemble
can be assessed by averaging the OOB evaluations of all predictors.

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Random Patches and Random Subspaces

Bagging isn't limited to sampling training instances but it can also sample features. The
BaggingClassifier class supports sampling the features. This is controlled by two
hyperparameters: max_features and bootstrap_features.
• max_features: Specifies the number of features to sample.
• bootstrap_features: If set to True, features are sampled with replacement

Random Patches: Sampling both training instances and features is called the Random
Patches method. This method is used when dealing with high-dimensional data (e.g., images).

In Scikit-Learn:
• bootstrap=True: Sample training instances with replacement.
• max_samples=<fraction>: Fraction of training instances to sample.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.

Random Subspaces: All training instances are used, but features are sampled.

In Scikit-Learn:
• bootstrap=False: Use all training instances.
• max_samples=1.0: Use all training instances.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.

Benefits
• Increased Diversity: Sampling features adds more diversity to the predictors.
• Bias-Variance Tradeoff: This approach generally increases bias but decreases
variance, improving generalization.

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Random Forests
Random forest is a supervised learning algorithm that combines multiple Decision Trees to
improve the accuracy and stability of predictions. The “forest” it builds is an ensemble of
decision trees, trained with the bagging method.

How Does a Random Forest Work?


Training Phase:
• Create multiple subsets of the original training data using bootstrapping (sampling
with replacement).
• Train a Decision Tree on each subset. Each tree is grown to its maximum depth.
• At each split in a tree, a random subset of features is considered for splitting instead of
considering all features. This adds randomness and reduces correlation between trees.
Prediction Phase:
• For a new data point, each tree in the forest makes a prediction.
• For classification tasks, the final prediction is made by taking the majority vote of all
the trees.
• For regression tasks, the final prediction is made by averaging the predictions of all
the trees.

Example Code - Here’s how to train a Random Forest classifier with 500 trees, each limited
to a maximum of 16 nodes, using all available CPU cores:

from sklearn.ensemble import RandomForestClassifier


# Create a Random Forest classifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)

# Train the classifier


rnd_clf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rnd_clf.predict(X_test)

A RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to


control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control
the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of
searching for the very best feature when splitting a node, it searches for the best feature
among a random subset of features. This results in a greater tree diversity, which trades a
higher bias for a lower variance, generally yielding an overall better model.

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The following BaggingClassifier is roughly equivalent to the above


RandomForestClassifier:

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

splitter="random": Introduces randomness by selecting a random subset of features for


splitting nodes.

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong
learner. The general idea of most boosting methods is to train predictors sequentially, each
trying to correct its predecessor.

The most popular are boosting methods are


• AdaBoost (Adaptive Boosting)
• Gradient Boosting

AdaBoost

• AdaBoost is a machine learning algorithm that combines multiple weak learners to


create a strong learner. A weak learner is a model that performs slightly better than
random guessing. AdaBoost focuses on improving the performance of these weak
learners by adjusting their weights based on their performance on the training data.
• Example, to build an AdaBoost classifier, a first base classifier (such as a Decision
Tree) is trained and used to make predictions on the training set. The relative weight of
misclassified training instances is then increased. A second classifier is trained using
the updated weights and again it makes predictions on the training set, weights are
updated, and so on.

The below figure shows the decision boundaries of five consecutive predictors on the moons
dataset. The first classifier gets many instances wrong, so their weights get boosted. The
second classifier therefore does a better job on these instances, and so on. The plot on the
right represents the same sequence of predictors except that the learning rate is halved.

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Once all predictors are trained, the ensemble makes predictions very much like bagging or
pasting, except that predictors have different weights depending on their overall accuracy on
the weighted training set.

AdaBoost algorithm
• Each instance weight w(i) is initially set to 1/m. A first predictor is trained and its
weighted error rate r1 is computed on the training set.

• The predictor’s weight αj is then computed using Equation - 2, where η is the learning
rate hyperparameter (defaults to 1). The more accurate the predictor is, the higher its
weight will be. If it is just guessing randomly, then its weight will be close to zero.
However, if it is most often wrong (i.e., less accurate than random guessing), then its
weight will be negative.

• Next the instance weights are updated using Equation 3: the misclassified instances are
boosted.

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• Finally, a new predictor is trained using the updated weights, and the whole process is
repeated.
• The algorithm stops when the desired number of predictors is reached, or when a
perfect predictor is found.

To make predictions, AdaBoost computes the predictions of all the predictors and weighs
them using the predictor weights αj. The predicted class is the one that receives the majority
of weighted votes (see Equation - 4).

Gradient Boosting

• Gradient Boosting is a powerful machine learning technique used for regression and
classification tasks. It builds an ensemble of models sequentially, each one correcting
the errors of its predecessor.
• In this the models are added one at a time to the ensemble. Each new model corrects
the errors of the combined previous models.
• Instead of adjusting instance weights like AdaBoost, Gradient Boosting fits new
models to the residual errors (the difference between the actual and predicted values)
of the existing ensemble. Typically, Decision Trees are used as the base learners in
Gradient Boosting.
• The Learning Rate hyperparameter scales the contribution of each new model,
controlling the step size of each iteration. Setting a low learning rate usually improves
generalization by preventing overfitting, though it requires more iterations.
• To avoid overfitting, training can be stopped early when the validation error stops
improving.
• Stochastic Gradient Boosting: This technique introduces randomness by training each
model on a random subset of the data, which can reduce overfitting and improve
performance.

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Steps to Implement Gradient Boosting

#Initial Model:

from sklearn.tree import DecisionTreeRegressor


tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

#Second Model on Residuals


y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

#Third Model on New Residuals


y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

#Predictions with Ensembl


y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2,
tree_reg3))

Below figure represents the predictions of these three trees in the left column, and the
ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so
its predictions are exactly the same as the first tree’s predictions. In the second row, a new
tree is trained on the residual errors of the first tree. On the right you can see that the
ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly,
in the third row another tree is trained on the residual errors of the second tree.

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Figure - Gradient Boosting

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Stacking

Stacking (or stacked generalization) is an ensemble method that combines multiple models to
improve prediction accuracy. Unlike simple aggregation methods like voting, stacking uses
another model to learn how to best combine the outputs of the base models.

How Stacking Works

• First Layer - Base Models: Multiple models are trained on the training dataset. These
can be any type of model (e.g., decision trees, linear regression, etc.). Each model
makes predictions on the training data.
• Hold-out Set: The training dataset is split into two subsets: one for training the base
models and one for creating the hold-out predictions. The base models make
predictions on the hold-out set, producing an array of predictions for each instance.
• Second Layer - Blender (Meta Learner): The predictions from the base models are
used as input features to train a new model called the blender or meta learner. The
target values for training the blender are the actual values from the hold-out set.
• Making Predictions: To make a prediction on new data, the input is first passed
through the base models. The predictions from the base models are then fed into the
blender, which makes the final prediction.

28 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Key Points
• Blender Training: The blender learns to optimize the combination of base model
predictions.
• Hold-out Set: Ensures that the predictions used to train the blender are independent
and not biased by the training data of the base models.
• Multiple Layers: It's possible to add multiple layers of models, but this increases
complexity and computational cost.

29 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Question Bank

1. Define ensemble learning and explain the concept of the "wisdom of the crowd" in the
context of machine learning.
2. What is an ensemble method, and how does it improve prediction accuracy?
3. Describe the differences between bagging, boosting, and stacking.
4. Explain how a random forest combines multiple decision trees to make predictions.
5. What is a hard voting classifier? Describe how it works with an example.
6. Explain why combining multiple classifiers that are weak learners can still result in a
strong learner.
7. Differentiate between bagging and pasting in terms of their sampling methods.
8. Provide a real-world example where bagging could be used effectively.
9. Write a Python code snippet to implement a bagging classifier using Scikit-Learn with
decision trees as base estimators.
10. What is out-of-bag (OOB) evaluation, and how is it useful in bagging methods?
11. Explain how OOB evaluation provides an estimate of the ensemble's performance
without using a separate validation set.
12. How do the random patches and random subspaces methods increase diversity among
predictors?
13. Describe the benefits and potential drawbacks of using random patches in high-
dimensional data.
14. Explain the process of training a random forest, highlighting the role of bootstrapping
and feature selection.
15. Write a Python code snippet to train a random forest classifier using Scikit-Learn.
16. Compare and contrast a single decision tree with a random forest in terms of bias,
variance, and overall prediction accuracy.
17. Describe the AdaBoost algorithm and how it adjusts weights during training to
improve prediction accuracy.
18. What are the advantages and disadvantages of using AdaBoost compared to other
ensemble methods?
19. Explain how gradient boosting differs from AdaBoost and provide a step-by-step
explanation of its training process.
20. Write a Python code snippet to implement a basic gradient boosting model using
Scikit-Learn.
21. What is stacking in ensemble learning, and how does it differ from simple aggregation
methods like voting?
22. Describe the role of the meta-learner in stacking and its importance in improving
prediction accuracy.

30 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

You might also like