0% found this document useful (0 votes)

43 views30 pages

module_4

Decision Trees (DTs) are a supervised learning method for classification and regression that uses a tree-like model of decisions. Key concepts include root nodes, decision nodes, leaf nodes, and the process of splitting and pruning to create a model that predicts outcomes based on input features. The document also discusses the CART algorithm, computational complexity, and regularization techniques to prevent overfitting in Decision Trees.

Uploaded by

Bhagya Jyoti Mk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views30 pages

module_4

Uploaded by

Bhagya Jyoti Mk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Machine Learning 21AI63

MODULE 4
Decision Trees
Introduction to Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features

Key Terms in Decision Trees:

• Root Node: The starting point of the tree.
• Splitting: Dividing a node into multiple sub-nodes.
• Decision Node: A node that splits into more sub-nodes.
• Leaf Node: A node that does not split further and represents an outcome.
• Pruning: Removing sub-nodes to simplify the tree.
• Branch: A part of the tree with multiple nodes

Figure - Decision tree Structure

Figure: A decision tree for the concept PlayTennis

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

How Decision Trees Operate:

• A decision tree looks like an upside-down tree, starting with a root node.
• From the root node, the tree splits into decision nodes, and these further split until they
reach leaf nodes, which show the outcomes.
• Each decision node represents a condition, and the branches represent the possible
answers

Types of decision trees in machine learning

Decision trees in machine learning can either be classification trees or regression trees.

1. Classification trees determine whether an event happened or not. This involves a

“yes” or “no” outcome. It deals with categories.
Example: Is an animal a reptile or mammal?

2. Regression trees predict continuous values based on previous data or information

sources. It deals with numerical outcomes.
Example: Predicting house prices based on features like size and location.

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Training and Visualizing a Decision Tree

To understand Decision Trees, let’s just build one and take a look at how it makes
predictions.

Example Using the Iris Dataset:

The iris dataset is a famous dataset used to practice classification algorithms. It contains data
about different iris flowers, including their petal lengths and widths, which we will use to
classify them into different species

Step-by-Step Process:
1. Load the Dataset and Train the Model

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset

iris = load_iris()
X = iris.data[:, 2:] # Use petal length and width as features
y = iris.target # Target variable (species)

# Initialize and train the Decision Tree Classifier

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

2. Visualize the Decision Tree

You can visualize the trained Decision Tree by first using the export_graphviz()
method to output a graph definition file called iris_tree.dot

from sklearn.tree import export_graphviz

export_graphviz(
tree_clf,
out_file="iris_tree.dot",
feature_names=iris.feature_names[2:], # Petal length and width
class_names=iris.target_names, # Species names
rounded=True,
filled=True
)

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

3. Convert the .dot File to an Image

Use the Graphviz tool to convert the .dot file to a more accessible format like PNG

$ dot -Tpng iris_tree.dot -o iris_tree.png

Decision tree

Making Predictions

Suppose you find an iris flower and you want to classify it.
→ Start at the Root Node:
• Find is the petal length smaller than 2.45 cm?
• If yes, move to the left child node.
• If no, move to the right child node
→ Left Child Node (Depth 1, Left):
• This node is a leaf node, meaning it does not ask any more questions.
• Prediction: The flower is classified as Iris-Setosa
→ Right Child Node (Depth 1, Right):
• This node is not a leaf node, so it asks another question that Is the petal width
smaller than 1.75 cm?
• If yes, move to the left child node (Depth 2, Left).
• If no, move to the right child node (Depth 2, Right).
→ Depth 2, Left Node:
• Prediction: The flower is classified as Iris-Versicolor.
→ Depth 2, Right Node:
• Prediction: The flower is classified as Iris-Virginica.

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Attributes of a Node:
Samples: Number of training instances it applies to.
• Example: If 100 training instances have a petal length > 2.45 cm (Depth 1, Right), and
among them, 54 have a petal width < 1.75 cm (Depth 2, Left).

Value: Number of training instances of each class in that node.

• Example: The bottom-right node (Depth 2, Right) applies to 0 Iris-Setosa, 1 Iris-
Versicolor, and 45 Iris-Virginica.

Gini Impurity: Measures the node's impurity. A pure node (all instances belong to one class)
has a Gini score of 0.

Estimating Class Probabilities

• A Decision Tree can also estimate the probability that an instance belongs to a
particular class k.
• First it traverses the tree to find the leaf node for this instance, and then it returns the
ratio of training instances of class k in this node.
• For example, consider a flower whose petals are 5 cm long and 1.5 cm wide. The
corresponding leaf node is the depth-2 left node, so the Decision Tree should output
the following probabilities

• 0% for Iris-Setosa (0/54)

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

• 90.7% for Iris-Versicolor (49/54)

• and 9.3% for Iris-Virginica (5/54)
The predicted class is Iris-Versicolor (class 1) since it has the highest probability

The CART Training Algorithm

CART (Classification and Regression Trees) is a decision tree technique employed in

machine learning to address both classification and regression tasks. It identifies patterns and
relationships within a dataset and constructs a tree structure based on the variable values
present in the data.

Algorithm
• First splits the training set in two subsets using a single feature k and a threshold t k
(e.g., “petal length ≤ 2.45 cm”)
• To choose k and tk, search for the pair (k, tk) that produces the purest subsets
(weighted by their size).
• The cost function that the algorithm tries to minimize is given by

• Once successfully split the training set in two, splits the subsets using the same logic,
then the sub-subsets and so on, recursively.
• Stops recursing once it reaches the maximum depth, or if it cannot find a split that will
reduce impurity.

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Algorithm

Step 1. Initialization:
• Begin with the entire dataset as the root node.
Step 2. Splitting Criteria:
• For each node, considers all possible splits for each feature.
• The goal is to find the split that minimizes the impurity (for classification) or the
variance (for regression) in the resulting sub-nodes.

Classification: The impurity measure used is the Gini impurity.

Regression: The variance measure typically used is the mean squared error (MSE).

Step 3. Choose the Best Split

• Evaluate all possible splits for all features.
• Choose the split that results in the lowest Gini impurity (for classification) or the
lowest variance (for regression).
Step 4. Split the Node
• Split the dataset into two subsets based on the best split.
• Create two new child nodes and assign the data points to these nodes based on the
split criteria.
Step 5. Repeat
For each child node, repeat steps 2 to 4 until one of the stopping conditions is met:
• Maximum depth of the tree is reached.
• Minimum number of samples in a node is reached.
• No further reduction in impurity or variance can be achieved.
• Node contains only samples of a single class (for classification).

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Computational Complexity

• Making predictions requires traversing the Decision Tree from the root to a leaf and
they are relatively balanced. so traversing the Decision Tree requires going through
roughly O(log2(m)) nodes, where m is the number of training instances. Each node
checks only one feature, making predictions very fast, even for large datasets.
• The complexity of making a prediction with a Decision Tree is O(log2 (m)), meaning it
remains quick regardless of the number of features in your data.
• During training, the algorithm evaluates all features for all samples at each node. This
process results in a training complexity of O(n×mlog(m)), where n is the number of
features and m is the number of training instances.

Gini Impurity or Entropy?

Information gain
• Information gain measures how well a given attribute separates the training examples
according to their target classification.
• It is used to select among the candidate attributes at each step while growing the tree.
• Information gain, is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• The information gain, Gain(S, A) of an attribute A, relative to a collection of examples
S, is defined as

Entropy
• Entropy measures the impurity of a collection of examples.
• Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this Boolean classification is

Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
• The entropy is 0 if all members of S belong to the same class
• The entropy is 1 when the collection contains an equal number of positive and
negative examples
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Gini Index
The Gini Index or Gini Impurity, measures the probability for a random instance being
misclassified when chosen randomly. It is used to determine the best feature to split the data
at each node in the tree.

• Gini = 0: All elements in the node belong to a single class, hence the node is pure.
• Gini > 0: There are multiple classes present in the node, indicating impurity.
• Maximum Impurity (0.5): For a binary classification, this occurs when the classes are
perfectly split (50% each).

At each node of decision tree, the algorithm calculates the Gini Index for all possible splits
and chooses the split that results in the lowest Gini Index for the child nodes, indicating the
purest possible nodes.

So should you use Gini impurity or entropy?

• Gini Impurity: Tends to be faster to compute and is a good default choice. It isolates
the most frequent class in its own branch of the tree.
• Entropy: Tends to produce slightly more balanced trees by considering the
information gain. This can be more informative but slightly slower to compute.

Regularization Hyperparameters in Decision Trees

• Decision Trees: They make minimal assumptions about the training data, allowing the
tree structure to adapt closely to the data. This can lead to overfitting.
• Nonparametric Models: These models have an undefined number of parameters
before training, meaning they can fit closely to the training data. Decision Trees are an
example of this.
• Parametric Models: These models have a predetermined number of parameters,
reducing the risk of overfitting but increasing the risk of underfitting.

To avoid overfitting the training data, restrict the Decision Tree’s freedom during training,
this is called regularization. This is done by controlling hyperparameters that limit the tree's
growth and complexity.

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Regularization Hyperparameters in Scikit-Learn: The DecisionTreeClassifier class has a few

other parameters that restrict the shape of the Decision Tree:

1. max_depth: Limits the maximum depth of the tree. Reducing max_depth restricts
how deep the tree can grow, thus controlling overfitting.
2. min_samples_split: The minimum number of samples required to split an internal
node. Increasing this value means nodes must have more samples to split, which
reduces tree complexity.
3. min_samples_leaf: The minimum number of samples a leaf node must have. Prevents
creating leaf nodes with very few samples, reducing overfitting.
4. min_weight_fraction_leaf: Similar to min_samples_leaf but expressed as a fraction
of the total number of weighted instances. Ensures leaf nodes contain a minimum
fraction of the dataset, adding regularization
5. max_leaf_nodes: Limits the maximum number of leaf nodes. Restricts the overall size
and complexity of the tree
6. max_features: The maximum number of features to consider for splitting at each
node. Limits the number of features evaluated, simplifying the model and reducing the
risk of overfitting.

Adjusting max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf,

max_leaf_nodes, and max_features helps balance the model's complexity and its ability to
generalize to new data.

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Regression

Decision Trees are not only useful for classification tasks but also for regression tasks.
Regression with Decision Trees involves predicting continuous values instead of classes.

Building a Regression Tree

Use the DecisionTreeRegressor class from Scikit-Learn.

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

Each leaf node in the regression tree predicts a continuous value. The predicted value is the
average of the target values of all training instances in that leaf node.

Example:
• Suppose you want to predict for a new instance with x1 = 0.6. Then start at the root of
the tree and traverse it according to the feature values until you reach a leaf node.
• The leaf node predicts value = 0.1106, which is the average target value of the training
instances in that node. This prediction results in a Mean Squared Error (MSE) of
0.0151 over these 110 instances.

This model’s predictions are represented in below figure. If max_depth is set to 2, the
predictions are less detailed. Increasing max_depth to 3 results in more detailed predictions.
The predicted value for each region is always the average target value of the instances in that
region.

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

The CART (Classification and Regression Tree) algorithm splits the dataset to minimize the
Mean Squared Error (MSE) rather than impurity. Equation shows the cost function that the
algorithm tries to minimize.

Decision Trees are prone to overfitting when dealing with regression tasks. Without any
regularization (i.e., using the default hyperparameters), you get the predictions as shown on
the left of below Figure. It is obviously overfitting the training set very badly. Just setting
min_samples_leaf=10 results in a much more reasonable model, represented on the right of
Figure.

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Question Bank

1. Define Decision Tree. How Decision Trees Operate.

2. Explain the types of decision trees in machine learning
3. Describe the process of training and visualizing decision tree using an example
4. Explain CART algorithm to train decision trees
5. Discuss the computational complexity of making predictions with Decision Trees
6. Compare Gini Impurity and Entropy in the context of decision trees with example
7. Define Information Gain and explain its significance in decision trees with an example
8. Describe the concept of Entropy and how it measures impurity in a dataset and
consider an appropriate example
9. Explain the Gini Index and its role in determining the best feature for splitting data.
10. Discuss the importance of regularization hyperparameters in decision trees.
11. Explain the process and purpose of "Pruning" in decision trees.
12. Differentiate between classification trees and regression trees in machine learning with
an example

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Ensemble Learning and Random Forests

Introduction to Ensemble Learning

Ensemble Learning is a powerful technique in machine learning that combines the predictions
of multiple models to produce a better overall prediction

The Concept of Ensemble Learning

Wisdom of the Crowd:

• Imagine asking a complex question to thousands of random people and then
combining their answers. Often, this combined answer is better than the answer from a
single expert. This is known as the "wisdom of the crowd."
• Similarly, combining the predictions from multiple models (classifiers or regressors)
often results in more accurate predictions than any individual model.

Definition:
• A group of predictors (models) is called an ensemble.
• The technique of combining multiple models is called Ensemble Learning.
• An algorithm that implements Ensemble Learning is called an Ensemble Method.

Example - Random Forest:

• Train multiple Decision Tree classifiers on different random subsets of the training
data. To make a prediction, get predictions from all the individual trees and choose the
class that gets the most votes. This ensemble of Decision Trees is called a Random
Forest.

Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating): Train multiple models on different random subsets

of the training data and combine their predictions.
2. Boosting: Train multiple models sequentially, each trying to correct the errors of the
previous model. Combine their predictions to improve accuracy.
3. Stacking: Train multiple models and then train a meta-model to combine their
predictions.
4. Random Forests: A specific type of bagging that uses Decision Trees. Combines the
predictions of multiple Decision Trees to improve accuracy and robustness.

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Voting Classifiers

• Voting Classifiers are a type of ensemble learning method that combine the
predictions of multiple classifiers to improve accuracy
• Imagine you have trained several classifiers, each achieving around 80% accuracy. For
example, you might have a Logistic Regression classifier, an SVM classifier, a
Random Forest classifier, and a K-Nearest Neighbors classifier. A simple way to
create an even better classifier is to combine their predictions.

Hard Voting:
• The most straightforward method is to aggregate the predictions from each classifier
and predict the class that gets the most votes. This is known as a hard voting
classifier.
• Surprisingly, this majority-vote classifier often achieves higher accuracy than the best
individual classifier in the ensemble. If each classifier is a weak learner (meaning it
does only slightly better than random guessing), the ensemble can still be a strong
learner (achieving high accuracy), provided there are a sufficient number of weak
learners and they are sufficiently diverse.

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Example:
• Suppose you have a slightly biased coin that has a 51% chance of coming up heads,
and 49% chance of coming up tails. If you toss it 1,000 times, you will get more or
less 510 heads and 490 tails, and hence a majority of heads.
• You will find that the probability of obtaining a majority of heads after 1,000
tosses is close to 75%
• The probability of getting a majority of heads increases with the number of tosses due
to the law of large numbers. With 10,000 tosses, the probability of a majority heads is
over 97%.

Figure shows 10 series of biased coin tosses. You can see that as the number of tosses
increases, the ratio of heads approaches 51%. Eventually all 10 series end up so close to 51%
that they are consistently above 50%.

• Similarly, suppose you build an ensemble containing 1,000 classifiers that are
individually correct only 51% of the time. If you predict the majority voted class, you
can hope for up to 75% accuracy!
• However, this high accuracy assumes all classifiers are perfectly independent and
make uncorrelated errors. In reality, since they are trained on the same data, they are
likely to make similar errors, reducing the ensemble’s overall accuracy.

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Bagging and Pasting

Bagging and Pasting are ensemble learning techniques used to improve the accuracy and
robustness of machine learning models.

Creating Diverse Predictors:

• One way to create a diverse set of classifiers is to use different training algorithms.
• Another approach is to use the same training algorithm for every predictor but train
them on different random subsets of the training set.

Bagging (Bootstrap Aggregating):

• Each predictor is trained on a random subset of the training set where sampling is done
with replacement. This means some instances may be repeated in the same subset, this
method is called bagging
• Once all predictors are trained, their predictions are aggregated. For classification
tasks, the most frequent prediction is chosen (like a hard voting classifier). For
regression tasks, the average prediction is used.
• Each predictor trained on a subset has a higher bias, but when combined, the ensemble
reduces both bias and variance, resulting in better overall performance.

Pasting:
• Each predictor is trained on a random subset of the training set where sampling is done
without replacement. This means instances are not repeated in the same subset, this
method is called pasting
• Similar to bagging, predictions from all predictors are aggregated to make the final
prediction.
As you can see in above Figure, Predictors can be trained in parallel, using different CPU
cores or servers, making both bagging and pasting scalable and efficient. Predictions can also
be made in parallel, which speeds up the overall process.

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Advantages of Bagging and Pasting

• Reduction in Variance: By training on different subsets, the overall model's variance
is reduced compared to a single predictor trained on the entire training set.
• Scalability: The ability to train and predict in parallel makes these methods suitable
for large datasets and complex models.

Example of Bagging and Pasting

• Bagging: Imagine training 10 decision tree classifiers. Each tree is trained on a subset
of the training data created by randomly sampling with replacement. The final
prediction for a new instance is based on the majority vote of these 10 trees.
• Pasting: Similarly, you train 10 decision tree classifiers, but this time each tree is
trained on a subset created by random sampling without replacement. The final
prediction is also based on the majority vote.

Bagging and Pasting in Scikit-Learn

• Scikit-Learn offers a simple API for both bagging and pasting with the
BaggingClassifier class (or BaggingRegressor for regression).
• The following code trains an ensemble of 500 Decision Tree classifiers, each trained
on 100 training instances randomly sampled from the training set with replacement
(bagging) but if you want to use pasting instead, just set bootstrap=False. The n_jobs
parameter tells Scikit-Learn the number of CPU cores to use for training and
predictions.

from sklearn.ensemble import BaggingClassifier

from sklearn.tree import DecisionTreeClassifier

# Create a bagging classifier

bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1)

# Train the bagging classifier

bag_clf.fit(X_train, y_train)

# Make predictions
y_pred = bag_clf.predict(X_test)

BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1): This creates an ensemble of 500
Decision Trees. Each tree is trained on 100 random samples from the training set with
replacement.

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

n_jobs=-1: Utilizes all available CPU cores for parallel processing, making training faster.
Below Figure compares the decision boundary of a single Decision Tree with the decision
boundary of a bagging ensemble of 500 trees, both trained on the moons dataset.

• A single Decision Tree might have a complex decision boundary that overfits the
training data.
• A bagging ensemble of trees will generally have a smoother decision boundary, which
reduces variance and improves generalization.

Out-of-Bag Evaluation
• In bagging, each predictor is trained on a random subset of the training set with
replacement. On average, about 63% of the training instances are used for training
each predictor. The remaining 37% of the instances are not used and are called out-of-
bag (OOB) instances. Each predictor has a different set of OOB instances.
• OOB instances are not seen by the predictor during training, so they can be used to
evaluate the predictor's performance. This provides a way to evaluate the ensemble
without needing a separate validation set. The overall performance of the ensemble
can be assessed by averaging the OOB evaluations of all predictors.

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Random Patches and Random Subspaces

Bagging isn't limited to sampling training instances but it can also sample features. The
BaggingClassifier class supports sampling the features. This is controlled by two
hyperparameters: max_features and bootstrap_features.
• max_features: Specifies the number of features to sample.
• bootstrap_features: If set to True, features are sampled with replacement

Random Patches: Sampling both training instances and features is called the Random
Patches method. This method is used when dealing with high-dimensional data (e.g., images).

In Scikit-Learn:
• bootstrap=True: Sample training instances with replacement.
• max_samples=<fraction>: Fraction of training instances to sample.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.

Random Subspaces: All training instances are used, but features are sampled.

In Scikit-Learn:
• bootstrap=False: Use all training instances.
• max_samples=1.0: Use all training instances.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.

Benefits
• Increased Diversity: Sampling features adds more diversity to the predictors.
• Bias-Variance Tradeoff: This approach generally increases bias but decreases
variance, improving generalization.

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Random Forests
Random forest is a supervised learning algorithm that combines multiple Decision Trees to
improve the accuracy and stability of predictions. The “forest” it builds is an ensemble of
decision trees, trained with the bagging method.

How Does a Random Forest Work?

Training Phase:
• Create multiple subsets of the original training data using bootstrapping (sampling
with replacement).
• Train a Decision Tree on each subset. Each tree is grown to its maximum depth.
• At each split in a tree, a random subset of features is considered for splitting instead of
considering all features. This adds randomness and reduces correlation between trees.
Prediction Phase:
• For a new data point, each tree in the forest makes a prediction.
• For classification tasks, the final prediction is made by taking the majority vote of all
the trees.
• For regression tasks, the final prediction is made by averaging the predictions of all
the trees.

Example Code - Here’s how to train a Random Forest classifier with 500 trees, each limited
to a maximum of 16 nodes, using all available CPU cores:

from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)

# Train the classifier

rnd_clf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rnd_clf.predict(X_test)

A RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to

control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control
the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of
searching for the very best feature when splitting a node, it searches for the best feature
among a random subset of features. This results in a greater tree diversity, which trades a
higher bias for a lower variance, generally yielding an overall better model.

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

The following BaggingClassifier is roughly equivalent to the above

RandomForestClassifier:

from sklearn.ensemble import BaggingClassifier

from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

splitter="random": Introduces randomness by selecting a random subset of features for

splitting nodes.

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong
learner. The general idea of most boosting methods is to train predictors sequentially, each
trying to correct its predecessor.

The most popular are boosting methods are

• AdaBoost (Adaptive Boosting)
• Gradient Boosting

AdaBoost

• AdaBoost is a machine learning algorithm that combines multiple weak learners to

create a strong learner. A weak learner is a model that performs slightly better than
random guessing. AdaBoost focuses on improving the performance of these weak
learners by adjusting their weights based on their performance on the training data.
• Example, to build an AdaBoost classifier, a first base classifier (such as a Decision
Tree) is trained and used to make predictions on the training set. The relative weight of
misclassified training instances is then increased. A second classifier is trained using
the updated weights and again it makes predictions on the training set, weights are
updated, and so on.

The below figure shows the decision boundaries of five consecutive predictors on the moons
dataset. The first classifier gets many instances wrong, so their weights get boosted. The
second classifier therefore does a better job on these instances, and so on. The plot on the
right represents the same sequence of predictors except that the learning rate is halved.

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Once all predictors are trained, the ensemble makes predictions very much like bagging or
pasting, except that predictors have different weights depending on their overall accuracy on
the weighted training set.

AdaBoost algorithm
• Each instance weight w(i) is initially set to 1/m. A first predictor is trained and its
weighted error rate r1 is computed on the training set.

• The predictor’s weight αj is then computed using Equation - 2, where η is the learning
rate hyperparameter (defaults to 1). The more accurate the predictor is, the higher its
weight will be. If it is just guessing randomly, then its weight will be close to zero.
However, if it is most often wrong (i.e., less accurate than random guessing), then its
weight will be negative.

• Next the instance weights are updated using Equation 3: the misclassified instances are
boosted.

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

• Finally, a new predictor is trained using the updated weights, and the whole process is
repeated.
• The algorithm stops when the desired number of predictors is reached, or when a
perfect predictor is found.

To make predictions, AdaBoost computes the predictions of all the predictors and weighs
them using the predictor weights αj. The predicted class is the one that receives the majority
of weighted votes (see Equation - 4).

Gradient Boosting

• Gradient Boosting is a powerful machine learning technique used for regression and
classification tasks. It builds an ensemble of models sequentially, each one correcting
the errors of its predecessor.
• In this the models are added one at a time to the ensemble. Each new model corrects
the errors of the combined previous models.
• Instead of adjusting instance weights like AdaBoost, Gradient Boosting fits new
models to the residual errors (the difference between the actual and predicted values)
of the existing ensemble. Typically, Decision Trees are used as the base learners in
Gradient Boosting.
• The Learning Rate hyperparameter scales the contribution of each new model,
controlling the step size of each iteration. Setting a low learning rate usually improves
generalization by preventing overfitting, though it requires more iterations.
• To avoid overfitting, training can be stopped early when the validation error stops
improving.
• Stochastic Gradient Boosting: This technique introduces randomness by training each
model on a random subset of the data, which can reduce overfitting and improve
performance.

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Steps to Implement Gradient Boosting

#Initial Model:

from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

#Second Model on Residuals

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

#Third Model on New Residuals

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

#Predictions with Ensembl

y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2,
tree_reg3))

Below figure represents the predictions of these three trees in the left column, and the
ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so
its predictions are exactly the same as the first tree’s predictions. In the second row, a new
tree is trained on the residual errors of the first tree. On the right you can see that the
ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly,
in the third row another tree is trained on the residual errors of the second tree.

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Figure - Gradient Boosting

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Stacking

Stacking (or stacked generalization) is an ensemble method that combines multiple models to
improve prediction accuracy. Unlike simple aggregation methods like voting, stacking uses
another model to learn how to best combine the outputs of the base models.

How Stacking Works

• First Layer - Base Models: Multiple models are trained on the training dataset. These
can be any type of model (e.g., decision trees, linear regression, etc.). Each model
makes predictions on the training data.
• Hold-out Set: The training dataset is split into two subsets: one for training the base
models and one for creating the hold-out predictions. The base models make
predictions on the hold-out set, producing an array of predictions for each instance.
• Second Layer - Blender (Meta Learner): The predictions from the base models are
used as input features to train a new model called the blender or meta learner. The
target values for training the blender are the actual values from the hold-out set.
• Making Predictions: To make a prediction on new data, the input is first passed
through the base models. The predictions from the base models are then fed into the
blender, which makes the final prediction.

28 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Key Points
• Blender Training: The blender learns to optimize the combination of base model
predictions.
• Hold-out Set: Ensures that the predictions used to train the blender are independent
and not biased by the training data of the base models.
• Multiple Layers: It's possible to add multiple layers of models, but this increases
complexity and computational cost.

29 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

Machine Learning 21AI63

Question Bank

1. Define ensemble learning and explain the concept of the "wisdom of the crowd" in the
context of machine learning.
2. What is an ensemble method, and how does it improve prediction accuracy?
3. Describe the differences between bagging, boosting, and stacking.
4. Explain how a random forest combines multiple decision trees to make predictions.
5. What is a hard voting classifier? Describe how it works with an example.
6. Explain why combining multiple classifiers that are weak learners can still result in a
strong learner.
7. Differentiate between bagging and pasting in terms of their sampling methods.
8. Provide a real-world example where bagging could be used effectively.
9. Write a Python code snippet to implement a bagging classifier using Scikit-Learn with
decision trees as base estimators.
10. What is out-of-bag (OOB) evaluation, and how is it useful in bagging methods?
11. Explain how OOB evaluation provides an estimate of the ensemble's performance
without using a separate validation set.
12. How do the random patches and random subspaces methods increase diversity among
predictors?
13. Describe the benefits and potential drawbacks of using random patches in high-
dimensional data.
14. Explain the process of training a random forest, highlighting the role of bootstrapping
and feature selection.
15. Write a Python code snippet to train a random forest classifier using Scikit-Learn.
16. Compare and contrast a single decision tree with a random forest in terms of bias,
variance, and overall prediction accuracy.
17. Describe the AdaBoost algorithm and how it adjusts weights during training to
improve prediction accuracy.
18. What are the advantages and disadvantages of using AdaBoost compared to other
ensemble methods?
19. Explain how gradient boosting differs from AdaBoost and provide a step-by-step
explanation of its training process.
20. Write a Python code snippet to implement a basic gradient boosting model using
Scikit-Learn.
21. What is stacking in ensemble learning, and how does it differ from simple aggregation
methods like voting?
22. Describe the role of the meta-learner in stacking and its importance in improving
prediction accuracy.

30 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

The Story Teller Test PDF
No ratings yet
The Story Teller Test PDF
3 pages
Biological Science Term 2 Learning Sequence 2013
No ratings yet
Biological Science Term 2 Learning Sequence 2013
5 pages
CRI201 High Pressure Common Rail Injector Tester User Manual
No ratings yet
CRI201 High Pressure Common Rail Injector Tester User Manual
5 pages
The Experience of Astral Travel and The Death of The Ego Secured
100% (2)
The Experience of Astral Travel and The Death of The Ego Secured
208 pages
Decision Trees
No ratings yet
Decision Trees
38 pages
ML Mod-4
No ratings yet
ML Mod-4
30 pages
Trees and Forests: Machine Learning With Python Cookbook
No ratings yet
Trees and Forests: Machine Learning With Python Cookbook
5 pages
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
No ratings yet
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
18 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
18 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
Chapter 4classification and Prediction
No ratings yet
Chapter 4classification and Prediction
19 pages
Decision Tree
No ratings yet
Decision Tree
7 pages
DM Lab Cycle 5
No ratings yet
DM Lab Cycle 5
3 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
Tree Based Learning Methods
No ratings yet
Tree Based Learning Methods
28 pages
Decision Tree Algorithm in Machine Learning
No ratings yet
Decision Tree Algorithm in Machine Learning
17 pages
decision tree
No ratings yet
decision tree
13 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
DAR LECT 12
No ratings yet
DAR LECT 12
29 pages
Machine_Learning_Lecture_08_Decision Tree Learning (1)
No ratings yet
Machine_Learning_Lecture_08_Decision Tree Learning (1)
67 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
2179-Unit-3
No ratings yet
2179-Unit-3
29 pages
1822-b.e-cse-batchno-149
No ratings yet
1822-b.e-cse-batchno-149
66 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
DMI UNIT 4
No ratings yet
DMI UNIT 4
34 pages
Lecture Note #5_PEC-CS701E
No ratings yet
Lecture Note #5_PEC-CS701E
16 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Tree
No ratings yet
Decision Tree
11 pages
TEAA_ Tree Ensembles-1
No ratings yet
TEAA_ Tree Ensembles-1
43 pages
Trees
No ratings yet
Trees
78 pages
PA.UNIT-III
No ratings yet
PA.UNIT-III
75 pages
Practical 5
No ratings yet
Practical 5
3 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
14 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Cours #4—Decision Tree
No ratings yet
Cours #4—Decision Tree
18 pages
Unit-3 Introduction To Machine Learning Algorithms
No ratings yet
Unit-3 Introduction To Machine Learning Algorithms
18 pages
Title: Implementation of Decision Tree Classification: Department of Computer Science and Engineering
No ratings yet
Title: Implementation of Decision Tree Classification: Department of Computer Science and Engineering
8 pages
Decision Trees and Regression Techniques
No ratings yet
Decision Trees and Regression Techniques
27 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Decision_tree
No ratings yet
Decision_tree
15 pages
Decision Tree Ppt
0% (1)
Decision Tree Ppt
24 pages
CSL0777 L25
No ratings yet
CSL0777 L25
39 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
Decision Trees
No ratings yet
Decision Trees
8 pages
Module 4 Lecture -2
No ratings yet
Module 4 Lecture -2
65 pages
Decision Tree Classification Algorithm (2)
No ratings yet
Decision Tree Classification Algorithm (2)
11 pages
ML_Module-3-chapter-6 RNSIT
No ratings yet
ML_Module-3-chapter-6 RNSIT
10 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
No ratings yet
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
22 pages
Les 3 DWM
No ratings yet
Les 3 DWM
21 pages
ML Unit 3
No ratings yet
ML Unit 3
28 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
Tree
No ratings yet
Tree
7 pages
Lecture 3 - Decision Trees and Random Forest
No ratings yet
Lecture 3 - Decision Trees and Random Forest
20 pages
U4 ML Updated
No ratings yet
U4 ML Updated
32 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
Decision Tree-31-01-2025
No ratings yet
Decision Tree-31-01-2025
28 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Group Assignment: Instructions To Candidates
No ratings yet
Group Assignment: Instructions To Candidates
6 pages
8051 Ub Programmer
No ratings yet
8051 Ub Programmer
22 pages
Elastic Run 2018
No ratings yet
Elastic Run 2018
7 pages
Lab07.Filesystem Management
No ratings yet
Lab07.Filesystem Management
9 pages
Amplitude Modulation - Definition, Types, Expression - GeeksforGeeks
No ratings yet
Amplitude Modulation - Definition, Types, Expression - GeeksforGeeks
19 pages
Hollandus - Opera Vegetable
100% (1)
Hollandus - Opera Vegetable
76 pages
Folk Arts and Crafts of Caraga Region
No ratings yet
Folk Arts and Crafts of Caraga Region
2 pages
Worksheet Choose Velocity PDF
No ratings yet
Worksheet Choose Velocity PDF
1 page
Pumps Vs AgitatorAsdds FDSFSD
No ratings yet
Pumps Vs AgitatorAsdds FDSFSD
5 pages
Consumer Learning
No ratings yet
Consumer Learning
60 pages
Ethics Intro To Ethics Assignment
No ratings yet
Ethics Intro To Ethics Assignment
3 pages
Reading June 20
No ratings yet
Reading June 20
6 pages
ENG GD 6 TERM 1
No ratings yet
ENG GD 6 TERM 1
12 pages
Lesson Harlem Ren
No ratings yet
Lesson Harlem Ren
3 pages
IS 456: 2000 8.2.4.2 Maximum Cement Content: Chss Type
No ratings yet
IS 456: 2000 8.2.4.2 Maximum Cement Content: Chss Type
1 page
MSMG-5100 Assignment 2 2024
No ratings yet
MSMG-5100 Assignment 2 2024
2 pages
Especificaciones Terex T340
No ratings yet
Especificaciones Terex T340
9 pages
Katherine L. Llup GED 321-6223 (3:30 - 4:30pm) February 10, 2014
No ratings yet
Katherine L. Llup GED 321-6223 (3:30 - 4:30pm) February 10, 2014
5 pages
2024 (Zainurrahman Et Al) Text Readability - Its Impact On Reading Comprehension and Reading Time
No ratings yet
2024 (Zainurrahman Et Al) Text Readability - Its Impact On Reading Comprehension and Reading Time
11 pages
User Manual Wtvb01 485
No ratings yet
User Manual Wtvb01 485
42 pages
Essential_Skills_ISYS10301_Academic_Development_Coursework
No ratings yet
Essential_Skills_ISYS10301_Academic_Development_Coursework
9 pages
Heavy Petting: by Miles Mathis
No ratings yet
Heavy Petting: by Miles Mathis
7 pages
FS CIR 1 of 2024 - Notification of FOQA Events Engineering Parameters
No ratings yet
FS CIR 1 of 2024 - Notification of FOQA Events Engineering Parameters
3 pages
Steady-State Performance Investigation of Closed-Circuit Hydrostatic Drive Using Variable Displacement Pump and Variable Displacement Motor
No ratings yet
Steady-State Performance Investigation of Closed-Circuit Hydrostatic Drive Using Variable Displacement Pump and Variable Displacement Motor
1 page
Different Types of Hypothesis
No ratings yet
Different Types of Hypothesis
3 pages