module_4
module_4
MODULE 4
Decision Trees
Introduction to Decision Trees
Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features
• A decision tree looks like an upside-down tree, starting with a root node.
• From the root node, the tree splits into decision nodes, and these further split until they
reach leaf nodes, which show the outcomes.
• Each decision node represents a condition, and the branches represent the possible
answers
Step-by-Step Process:
1. Load the Dataset and Train the Model
export_graphviz(
tree_clf,
out_file="iris_tree.dot",
feature_names=iris.feature_names[2:], # Petal length and width
class_names=iris.target_names, # Species names
rounded=True,
filled=True
)
Decision tree
Making Predictions
Suppose you find an iris flower and you want to classify it.
→ Start at the Root Node:
• Find is the petal length smaller than 2.45 cm?
• If yes, move to the left child node.
• If no, move to the right child node
→ Left Child Node (Depth 1, Left):
• This node is a leaf node, meaning it does not ask any more questions.
• Prediction: The flower is classified as Iris-Setosa
→ Right Child Node (Depth 1, Right):
• This node is not a leaf node, so it asks another question that Is the petal width
smaller than 1.75 cm?
• If yes, move to the left child node (Depth 2, Left).
• If no, move to the right child node (Depth 2, Right).
→ Depth 2, Left Node:
• Prediction: The flower is classified as Iris-Versicolor.
→ Depth 2, Right Node:
• Prediction: The flower is classified as Iris-Virginica.
Attributes of a Node:
Samples: Number of training instances it applies to.
• Example: If 100 training instances have a petal length > 2.45 cm (Depth 1, Right), and
among them, 54 have a petal width < 1.75 cm (Depth 2, Left).
Gini Impurity: Measures the node's impurity. A pure node (all instances belong to one class)
has a Gini score of 0.
Algorithm
• First splits the training set in two subsets using a single feature k and a threshold t k
(e.g., “petal length ≤ 2.45 cm”)
• To choose k and tk, search for the pair (k, tk) that produces the purest subsets
(weighted by their size).
• The cost function that the algorithm tries to minimize is given by
• Once successfully split the training set in two, splits the subsets using the same logic,
then the sub-subsets and so on, recursively.
• Stops recursing once it reaches the maximum depth, or if it cannot find a split that will
reduce impurity.
Algorithm
Step 1. Initialization:
• Begin with the entire dataset as the root node.
Step 2. Splitting Criteria:
• For each node, considers all possible splits for each feature.
• The goal is to find the split that minimizes the impurity (for classification) or the
variance (for regression) in the resulting sub-nodes.
Regression: The variance measure typically used is the mean squared error (MSE).
Computational Complexity
• Making predictions requires traversing the Decision Tree from the root to a leaf and
they are relatively balanced. so traversing the Decision Tree requires going through
roughly O(log2(m)) nodes, where m is the number of training instances. Each node
checks only one feature, making predictions very fast, even for large datasets.
• The complexity of making a prediction with a Decision Tree is O(log2 (m)), meaning it
remains quick regardless of the number of features in your data.
• During training, the algorithm evaluates all features for all samples at each node. This
process results in a training complexity of O(n×mlog(m)), where n is the number of
features and m is the number of training instances.
Information gain
• Information gain measures how well a given attribute separates the training examples
according to their target classification.
• It is used to select among the candidate attributes at each step while growing the tree.
• Information gain, is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• The information gain, Gain(S, A) of an attribute A, relative to a collection of examples
S, is defined as
Entropy
• Entropy measures the impurity of a collection of examples.
• Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this Boolean classification is
Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
• The entropy is 0 if all members of S belong to the same class
• The entropy is 1 when the collection contains an equal number of positive and
negative examples
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1
Gini Index
The Gini Index or Gini Impurity, measures the probability for a random instance being
misclassified when chosen randomly. It is used to determine the best feature to split the data
at each node in the tree.
• Gini = 0: All elements in the node belong to a single class, hence the node is pure.
• Gini > 0: There are multiple classes present in the node, indicating impurity.
• Maximum Impurity (0.5): For a binary classification, this occurs when the classes are
perfectly split (50% each).
At each node of decision tree, the algorithm calculates the Gini Index for all possible splits
and chooses the split that results in the lowest Gini Index for the child nodes, indicating the
purest possible nodes.
• Decision Trees: They make minimal assumptions about the training data, allowing the
tree structure to adapt closely to the data. This can lead to overfitting.
• Nonparametric Models: These models have an undefined number of parameters
before training, meaning they can fit closely to the training data. Decision Trees are an
example of this.
• Parametric Models: These models have a predetermined number of parameters,
reducing the risk of overfitting but increasing the risk of underfitting.
To avoid overfitting the training data, restrict the Decision Tree’s freedom during training,
this is called regularization. This is done by controlling hyperparameters that limit the tree's
growth and complexity.
1. max_depth: Limits the maximum depth of the tree. Reducing max_depth restricts
how deep the tree can grow, thus controlling overfitting.
2. min_samples_split: The minimum number of samples required to split an internal
node. Increasing this value means nodes must have more samples to split, which
reduces tree complexity.
3. min_samples_leaf: The minimum number of samples a leaf node must have. Prevents
creating leaf nodes with very few samples, reducing overfitting.
4. min_weight_fraction_leaf: Similar to min_samples_leaf but expressed as a fraction
of the total number of weighted instances. Ensures leaf nodes contain a minimum
fraction of the dataset, adding regularization
5. max_leaf_nodes: Limits the maximum number of leaf nodes. Restricts the overall size
and complexity of the tree
6. max_features: The maximum number of features to consider for splitting at each
node. Limits the number of features evaluated, simplifying the model and reducing the
risk of overfitting.
Regression
Decision Trees are not only useful for classification tasks but also for regression tasks.
Regression with Decision Trees involves predicting continuous values instead of classes.
Each leaf node in the regression tree predicts a continuous value. The predicted value is the
average of the target values of all training instances in that leaf node.
Example:
• Suppose you want to predict for a new instance with x1 = 0.6. Then start at the root of
the tree and traverse it according to the feature values until you reach a leaf node.
• The leaf node predicts value = 0.1106, which is the average target value of the training
instances in that node. This prediction results in a Mean Squared Error (MSE) of
0.0151 over these 110 instances.
This model’s predictions are represented in below figure. If max_depth is set to 2, the
predictions are less detailed. Increasing max_depth to 3 results in more detailed predictions.
The predicted value for each region is always the average target value of the instances in that
region.
The CART (Classification and Regression Tree) algorithm splits the dataset to minimize the
Mean Squared Error (MSE) rather than impurity. Equation shows the cost function that the
algorithm tries to minimize.
Decision Trees are prone to overfitting when dealing with regression tasks. Without any
regularization (i.e., using the default hyperparameters), you get the predictions as shown on
the left of below Figure. It is obviously overfitting the training set very badly. Just setting
min_samples_leaf=10 results in a much more reasonable model, represented on the right of
Figure.
Question Bank
Ensemble Learning is a powerful technique in machine learning that combines the predictions
of multiple models to produce a better overall prediction
Definition:
• A group of predictors (models) is called an ensemble.
• The technique of combining multiple models is called Ensemble Learning.
• An algorithm that implements Ensemble Learning is called an Ensemble Method.
Voting Classifiers
• Voting Classifiers are a type of ensemble learning method that combine the
predictions of multiple classifiers to improve accuracy
• Imagine you have trained several classifiers, each achieving around 80% accuracy. For
example, you might have a Logistic Regression classifier, an SVM classifier, a
Random Forest classifier, and a K-Nearest Neighbors classifier. A simple way to
create an even better classifier is to combine their predictions.
Hard Voting:
• The most straightforward method is to aggregate the predictions from each classifier
and predict the class that gets the most votes. This is known as a hard voting
classifier.
• Surprisingly, this majority-vote classifier often achieves higher accuracy than the best
individual classifier in the ensemble. If each classifier is a weak learner (meaning it
does only slightly better than random guessing), the ensemble can still be a strong
learner (achieving high accuracy), provided there are a sufficient number of weak
learners and they are sufficiently diverse.
Example:
• Suppose you have a slightly biased coin that has a 51% chance of coming up heads,
and 49% chance of coming up tails. If you toss it 1,000 times, you will get more or
less 510 heads and 490 tails, and hence a majority of heads.
• You will find that the probability of obtaining a majority of heads after 1,000
tosses is close to 75%
• The probability of getting a majority of heads increases with the number of tosses due
to the law of large numbers. With 10,000 tosses, the probability of a majority heads is
over 97%.
Figure shows 10 series of biased coin tosses. You can see that as the number of tosses
increases, the ratio of heads approaches 51%. Eventually all 10 series end up so close to 51%
that they are consistently above 50%.
• Similarly, suppose you build an ensemble containing 1,000 classifiers that are
individually correct only 51% of the time. If you predict the majority voted class, you
can hope for up to 75% accuracy!
• However, this high accuracy assumes all classifiers are perfectly independent and
make uncorrelated errors. In reality, since they are trained on the same data, they are
likely to make similar errors, reducing the ensemble’s overall accuracy.
Bagging and Pasting are ensemble learning techniques used to improve the accuracy and
robustness of machine learning models.
Pasting:
• Each predictor is trained on a random subset of the training set where sampling is done
without replacement. This means instances are not repeated in the same subset, this
method is called pasting
• Similar to bagging, predictions from all predictors are aggregated to make the final
prediction.
As you can see in above Figure, Predictors can be trained in parallel, using different CPU
cores or servers, making both bagging and pasting scalable and efficient. Predictions can also
be made in parallel, which speeds up the overall process.
• Scikit-Learn offers a simple API for both bagging and pasting with the
BaggingClassifier class (or BaggingRegressor for regression).
• The following code trains an ensemble of 500 Decision Tree classifiers, each trained
on 100 training instances randomly sampled from the training set with replacement
(bagging) but if you want to use pasting instead, just set bootstrap=False. The n_jobs
parameter tells Scikit-Learn the number of CPU cores to use for training and
predictions.
# Make predictions
y_pred = bag_clf.predict(X_test)
BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1): This creates an ensemble of 500
Decision Trees. Each tree is trained on 100 random samples from the training set with
replacement.
n_jobs=-1: Utilizes all available CPU cores for parallel processing, making training faster.
Below Figure compares the decision boundary of a single Decision Tree with the decision
boundary of a bagging ensemble of 500 trees, both trained on the moons dataset.
• A single Decision Tree might have a complex decision boundary that overfits the
training data.
• A bagging ensemble of trees will generally have a smoother decision boundary, which
reduces variance and improves generalization.
Out-of-Bag Evaluation
• In bagging, each predictor is trained on a random subset of the training set with
replacement. On average, about 63% of the training instances are used for training
each predictor. The remaining 37% of the instances are not used and are called out-of-
bag (OOB) instances. Each predictor has a different set of OOB instances.
• OOB instances are not seen by the predictor during training, so they can be used to
evaluate the predictor's performance. This provides a way to evaluate the ensemble
without needing a separate validation set. The overall performance of the ensemble
can be assessed by averaging the OOB evaluations of all predictors.
Bagging isn't limited to sampling training instances but it can also sample features. The
BaggingClassifier class supports sampling the features. This is controlled by two
hyperparameters: max_features and bootstrap_features.
• max_features: Specifies the number of features to sample.
• bootstrap_features: If set to True, features are sampled with replacement
Random Patches: Sampling both training instances and features is called the Random
Patches method. This method is used when dealing with high-dimensional data (e.g., images).
In Scikit-Learn:
• bootstrap=True: Sample training instances with replacement.
• max_samples=<fraction>: Fraction of training instances to sample.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.
Random Subspaces: All training instances are used, but features are sampled.
In Scikit-Learn:
• bootstrap=False: Use all training instances.
• max_samples=1.0: Use all training instances.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.
Benefits
• Increased Diversity: Sampling features adds more diversity to the predictors.
• Bias-Variance Tradeoff: This approach generally increases bias but decreases
variance, improving generalization.
Random Forests
Random forest is a supervised learning algorithm that combines multiple Decision Trees to
improve the accuracy and stability of predictions. The “forest” it builds is an ensemble of
decision trees, trained with the bagging method.
Example Code - Here’s how to train a Random Forest classifier with 500 trees, each limited
to a maximum of 16 nodes, using all available CPU cores:
# Make predictions
y_pred_rf = rnd_clf.predict(X_test)
The Random Forest algorithm introduces extra randomness when growing trees; instead of
searching for the very best feature when splitting a node, it searches for the best feature
among a random subset of features. This results in a greater tree diversity, which trades a
higher bias for a lower variance, generally yielding an overall better model.
bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
Boosting
Boosting refers to any Ensemble method that can combine several weak learners into a strong
learner. The general idea of most boosting methods is to train predictors sequentially, each
trying to correct its predecessor.
AdaBoost
The below figure shows the decision boundaries of five consecutive predictors on the moons
dataset. The first classifier gets many instances wrong, so their weights get boosted. The
second classifier therefore does a better job on these instances, and so on. The plot on the
right represents the same sequence of predictors except that the learning rate is halved.
Once all predictors are trained, the ensemble makes predictions very much like bagging or
pasting, except that predictors have different weights depending on their overall accuracy on
the weighted training set.
AdaBoost algorithm
• Each instance weight w(i) is initially set to 1/m. A first predictor is trained and its
weighted error rate r1 is computed on the training set.
• The predictor’s weight αj is then computed using Equation - 2, where η is the learning
rate hyperparameter (defaults to 1). The more accurate the predictor is, the higher its
weight will be. If it is just guessing randomly, then its weight will be close to zero.
However, if it is most often wrong (i.e., less accurate than random guessing), then its
weight will be negative.
• Next the instance weights are updated using Equation 3: the misclassified instances are
boosted.
• Finally, a new predictor is trained using the updated weights, and the whole process is
repeated.
• The algorithm stops when the desired number of predictors is reached, or when a
perfect predictor is found.
To make predictions, AdaBoost computes the predictions of all the predictors and weighs
them using the predictor weights αj. The predicted class is the one that receives the majority
of weighted votes (see Equation - 4).
Gradient Boosting
• Gradient Boosting is a powerful machine learning technique used for regression and
classification tasks. It builds an ensemble of models sequentially, each one correcting
the errors of its predecessor.
• In this the models are added one at a time to the ensemble. Each new model corrects
the errors of the combined previous models.
• Instead of adjusting instance weights like AdaBoost, Gradient Boosting fits new
models to the residual errors (the difference between the actual and predicted values)
of the existing ensemble. Typically, Decision Trees are used as the base learners in
Gradient Boosting.
• The Learning Rate hyperparameter scales the contribution of each new model,
controlling the step size of each iteration. Setting a low learning rate usually improves
generalization by preventing overfitting, though it requires more iterations.
• To avoid overfitting, training can be stopped early when the validation error stops
improving.
• Stochastic Gradient Boosting: This technique introduces randomness by training each
model on a random subset of the data, which can reduce overfitting and improve
performance.
#Initial Model:
Below figure represents the predictions of these three trees in the left column, and the
ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so
its predictions are exactly the same as the first tree’s predictions. In the second row, a new
tree is trained on the residual errors of the first tree. On the right you can see that the
ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly,
in the third row another tree is trained on the residual errors of the second tree.
Stacking
Stacking (or stacked generalization) is an ensemble method that combines multiple models to
improve prediction accuracy. Unlike simple aggregation methods like voting, stacking uses
another model to learn how to best combine the outputs of the base models.
• First Layer - Base Models: Multiple models are trained on the training dataset. These
can be any type of model (e.g., decision trees, linear regression, etc.). Each model
makes predictions on the training data.
• Hold-out Set: The training dataset is split into two subsets: one for training the base
models and one for creating the hold-out predictions. The base models make
predictions on the hold-out set, producing an array of predictions for each instance.
• Second Layer - Blender (Meta Learner): The predictions from the base models are
used as input features to train a new model called the blender or meta learner. The
target values for training the blender are the actual values from the hold-out set.
• Making Predictions: To make a prediction on new data, the input is first passed
through the base models. The predictions from the base models are then fed into the
blender, which makes the final prediction.
Key Points
• Blender Training: The blender learns to optimize the combination of base model
predictions.
• Hold-out Set: Ensures that the predictions used to train the blender are independent
and not biased by the training data of the base models.
• Multiple Layers: It's possible to add multiple layers of models, but this increases
complexity and computational cost.
Question Bank
1. Define ensemble learning and explain the concept of the "wisdom of the crowd" in the
context of machine learning.
2. What is an ensemble method, and how does it improve prediction accuracy?
3. Describe the differences between bagging, boosting, and stacking.
4. Explain how a random forest combines multiple decision trees to make predictions.
5. What is a hard voting classifier? Describe how it works with an example.
6. Explain why combining multiple classifiers that are weak learners can still result in a
strong learner.
7. Differentiate between bagging and pasting in terms of their sampling methods.
8. Provide a real-world example where bagging could be used effectively.
9. Write a Python code snippet to implement a bagging classifier using Scikit-Learn with
decision trees as base estimators.
10. What is out-of-bag (OOB) evaluation, and how is it useful in bagging methods?
11. Explain how OOB evaluation provides an estimate of the ensemble's performance
without using a separate validation set.
12. How do the random patches and random subspaces methods increase diversity among
predictors?
13. Describe the benefits and potential drawbacks of using random patches in high-
dimensional data.
14. Explain the process of training a random forest, highlighting the role of bootstrapping
and feature selection.
15. Write a Python code snippet to train a random forest classifier using Scikit-Learn.
16. Compare and contrast a single decision tree with a random forest in terms of bias,
variance, and overall prediction accuracy.
17. Describe the AdaBoost algorithm and how it adjusts weights during training to
improve prediction accuracy.
18. What are the advantages and disadvantages of using AdaBoost compared to other
ensemble methods?
19. Explain how gradient boosting differs from AdaBoost and provide a step-by-step
explanation of its training process.
20. Write a Python code snippet to implement a basic gradient boosting model using
Scikit-Learn.
21. What is stacking in ensemble learning, and how does it differ from simple aggregation
methods like voting?
22. Describe the role of the meta-learner in stacking and its importance in improving
prediction accuracy.