Machine Learning
Machine Learning
GridSearchCV
Prashant Sundge
LinkedIn GitHub
� If you found this kernel helpful, please consider giving it an upvote! Your support motivates me to create
more valuable content.
� I'd also love to hear your feedback and comments. Let me know if you have any questions, suggestions,
or insights. Your input is highly appreciated!
Happy coding! �
Table of Contents
1. Introduction
2. Definition of a Decision Tree
3. Function of a Decision Tree
4. Import Libraries
5. Read Dataset
6. IF ELSE Representation of Decision Tree
7. Tree Representation of Dataset
8. Decision Tree Basics
9. Decision Tree Terminologies
10. Decision Tree Terminology Naming Conventions
11. Example of Decision Tree
12. ROOT NODE IF ELSE NODE
13. Building the Tree
14. Entropy
15. Entropy and Gini Impurity Ranges
16. Information Gain
17. Iris Dataset With Plain Decision Tree
18. Tree Function Definitions
19. Model Evaluations
20. Confusion Matrix Function Definition
21. Cancer Dataset with Entropy in Decision Tree
22. Accuracy and Model Evaluations
23. Gini Impurity
24. Gini Impurity Formula
25. Hyperparameters and GridSearchCV
26. What are Hyperparameters
27. The Power of GridSearchCV
28. Putting It All Together
29. Play_tennis Dataset with Gini Impurity and Grid Search
30. Label Encoder
31. What Are Regression Trees?
32. Mean Square Error
33. Building a Regression Tree
34. Step 1: Initial Split and Calculation of Predicted Outputs and Mean Square Error
35. Step 2: Repeated Split and Mean Square Error Calculation
36. Step 3: Choosing the Split Point
37. Regression Dataset Model Predictions
38. Regression Tree plotted
39. Regression confusion matrix plotted
40. What happens when there are multiple independent variables?
41. Reference
Introduction
Definition of a Decision Tree:
A Decision Tree is a hierarchical and tree-like structure used in machine learning for both classification and
regression tasks. It systematically divides data into subsets based on the values of input features, ultimately
leading to decisions or predictions. It consists of nodes, branches, and leaves, where nodes represent
feature attributes, branches represent decision rules, and leaves represent the final outcomes or predictions.
In classification, a Decision Tree helps classify data into different categories or classes, while in regression, it
predicts numerical values based on the input features. The simplicity and interpretability of Decision Trees
make them valuable tools in machine learning, allowing users to understand and visualize decision-making
processes.
Import Librabries
In [1]: import pandas as pd
Read Dataset
In [2]: example=pd.read_excel("Example1.xlsx")
In [3]: example
0 F Student ENGINEER
1 F Programmer JAVA
2 M Programmer PYTHON
3 F Programmer JAVA
4 M Student ENGINEER
5 M Student ENGINEER
ENGINEER
JAVA
PYHTON
JAVA
ENGINEER
ENGINEER
• Root Node: The initial node at the beginning of a decision tree, where the entire population or dataset
starts dividing based on various features or conditions.
• Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes. These
nodes represent intermediate decisions or conditions within the tree.
• Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification or
outcome. Leaf nodes are also referred to as terminal nodes.
• Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a decision tree
is referred to as a sub-tree. It represents a specific portion of the decision tree.
• Pruning: The process of removing or cutting down specific nodes in a decision tree to prevent
overfitting and simplify the model.
• Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-tree. It
represents a specific path of decisions and outcomes within the tree.
• Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as a parent
node, and the sub-nodes emerging from it are referred to as child nodes. The parent node represents a
decision or condition, while the child nodes represent the potential outcomes or further decisions
based on that condition.
Decision Tree Terminalogy Naming conventions
In [7]: play_tennis
• If Outlook is Sunny:
▪ Subnode: Humidity
◦ If Humidity is High: Don't Play Tennis
◦ If Humidity is Normal: Play Tennis
• If Outlook is Overcast: Play Tennis
• If Outlook is Rainy:
▪ Subnode: Wind
◦ If Wind is Weak: Play Tennis
◦ If Wind is Strong: Don't Play Tennis
A decision tree is a machine learning algorithm that makes decisions based on the values of attributes
(features) in a dataset. It works as follows:
Splitting: To create child nodes, the algorithm selects an attribute and splits the data based on its values.
The attribute selection is based on criteria like Gini Impurity and Entropy.
Entropy:
• Entropy measures the average information content in a dataset. For a node, it's calculated as:
Entropy(S) is the entropy of node S . c is the number of classes. pi is the probability of a randomly chosen
data point belonging to class i.
Entropy is minimized when all data points in the node belong to a single class, and it is a measure of the
disorder in the data.
In [8]: play_tennis
Example
• To illustrate the equation, we will do an example that calculates the entropy of our dataset
When a dataset is completely homogeneous (all data points belong to a single class), it has zero impurity,
and the entropy is zero (Equation 1.4).
In contrast, if the dataset can be equally divided into two classes, it is entirely non-homogeneous, resulting
in maximum impurity of 100%, and the entropy is one (Equation 1.3).
Impurity and entropy are measures used to quantify the level of disorder or uncertainty in a dataset, with
higher values indicating greater impurity and uncertainty, and lower values indicating greater homogeneity
and certainty.
Information Gain:
• Information Gain is a measure used to assess the effectiveness of an attribute in classifying a training
dataset. It quantifies the expected reduction in entropy achieved by partitioning the dataset based on
this attribute.
• Information Gain, denoted as Gain(S, A), is a function of an attribute A relative to a collection of data
S.
Where:
• The Information Gain measures how much uncertainty or impurity is reduced when you split the
dataset based on attribute A. A higher Information Gain indicates that attribute A is more effective in
making distinctions within the dataset.
• Decision tree algorithms use Information Gain (or similar criteria) to determine the best attribute for
splitting the data at each node, aiming to create a tree structure that maximizes the reduction in
impurity as it grows.
To become more clear, let’s use this equation and measure the information gain of attribute Wind from
the dataset of Figure 1. The dataset has 14 instances, so the sample space is 14 where the sample has 9
positive and 5 negative instances. The Attribute Wind can have the values Weak or Strong. Therefore,
These two examples should make us clear that how we can calculate information gain. The information
gain of the 4 attributes of Figure 1 dataset are:
It's crucial to keep in mind that the primary objective of assessing information gain is to pinpoint the
attribute that is most valuable for classifying the training set. Our ID3 algorithm will employ this selected
attribute as the root from which to construct the decision tree. Subsequently, it will once more compute
information gain to determine the attribute for the next node.
Based on our calculations, it is evident that the attribute providing the most substantial information gain is
"Outlook." This attribute will serve as the foundational root of our decision tree.
In Figure 3, we present a visual representation of the decision tree constructed during the initial stage of the
ID3 algorithm. Here's a breakdown of the process:
• The training examples are effectively sorted into their respective descendant nodes within the tree
structure.
• One of the descendant nodes, labeled as "Overcast," contains only positive instances and, as a result, is
transformed into a leaf node with the classification "Yes."
• For the remaining two nodes, a critical question emerges: Which attribute should be chosen for further
testing? To address this, we extend these nodes by selecting attributes that offer the highest
information gain concerning the new subset of examples.
• The subsequent step involves identifying the attribute that is most suitable for testing within the
"Sunny" descendant node.
The Dataset in Figure 1 has the value Sunny on Day1, Day2, Day8, Day9, Day11. So the Sample Space S=5
here.
We can now measure the information gain of Temperature and Wind by following the same way we
measured Gain(S, Humidity). Finally, we will get:
At this stage, Humidity emerges as the attribute that yields the highest information gain. Therefore, in the
"Sunny" descendant node following "Outlook," the attribute chosen is "Humidity."
• The "High" descendant node exclusively contains negative examples, while the "Normal" descendant
node exclusively contains positive examples. Consequently, both of these nodes transition into leaf
nodes and cannot be expanded further.
• If we apply the same procedure to extend the "Rain" descendant, we find that the attribute "Wind"
provides the most information. I'll leave this part for readers to perform the calculations themselves.
In [11]: x=iris.data
y=iris.target
In [12]: y
Out[12]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [14]: clf=DecisionTreeClassifier()
clf.fit(X_train, y_train)
Out[14]: ▾ DecisionTreeClassifier
DecisionTreeClassifier()
In [15]: y_pred=clf.predict(X_test)
Accuracy: 1.00
Tree Function Definations
In [17]: def display_tree(model):
# Create a figure with a larger size and set the background color
plt.figure(figsize=(15, 10))
plt.rcParams['axes.facecolor'] = 'lightgray'
rounded=True,
proportion=True,
precision=2,
fontsize=12,
)
In [18]: display_tree(clf)
Model Evaluations
In [19]: from sklearn.metrics import confusion_matrix,classification_report, accuracy_score, ConfusionMatrixDisplay
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
load_breast_cancer
load_diabetes
load_digits
load_files
load_iris
load_linnerud
load_sample_image
load_sample_images
load_svmlight_file
load_svmlight_files
load_wine
In [24]: cancer=load_breast_cancer()
In [25]: x=cancer.data
y=cancer.target
In [26]: y
Out[26]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,
1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])
Out[29]: ▾ DecisionTreeClassifier
DecisionTreeClassifier(criterion='entropy')
In [30]: y_cancer_predict=cancer_model.predict(X_test)
Accuracy: 0.97
In [32]: display_tree(cancer_model)
In [33]: confusion_matrix_fun(y_test, y_cancer_predict)
Gini Impurity
Que- How can we determine the optimal feature for partitioning the dataset and what criteria should we
use to evaluate the quality of these partitions when constructing a decision tree?
The next step is to calculate the Gini Impurity for the 4 features (outlook, temp, humidity, windy), and
decide which feature will be the root node.
Let’s calculate the Gini Impurity for Outlook; as you may notice, the outlook feature is a categorical variable,
we have three possible values (sunny, overcast, and rainy).
When outlook = sunny, sunny (2yes/3no), for outlook = overcast(4yes/0no) and, finally for outlook =
rainy(3yes/2no).
We’ll calculate the Gini Impurity of outlook by weighting the impurity of each branch and how many
elements it has,
Congratulation! you have just calculated the Gini Impurity for the first feature, to calculate the Gini Gain,
which is calculated by subtracting the weighted impurities of the branches from the original impurity.
The best split is chosen by maximizing the Gini Gain or by minimizing the Gini Impurity.
In our example, the outlook has the minimum Gini Impurity value and the maximum Gini Gain value, so, It
In our example, the outlook has the minimum Gini Impurity value and the maximum Gini Gain value, so, It
will be chosen as the root decision to split our data.
• Criterion: Think of this as the decision-making principle for your model. It can be either "gini" or
"entropy," determining how your model chooses which questions to ask during training.
• Splitter: This hyperparameter is all about how the model makes choices. It can "split" by selecting the
"best" feature or do it "randomly." Like flipping a coin to decide.
• Max Depth: Imagine this as a tree in your backyard. The "max depth" is like deciding how tall this tree
can grow. You can set it to a number (like 10), or you can let it grow as tall as it wants (None).
• Min Samples Split: This hyperparameter tells the model how many samples need to be at a branch
before it splits. It's like saying, "Hey, only split if there are at least 5 apples on this branch."
• Min Samples Leaf: Now, think of this as a rule for when to stop growing a branch. It tells the model
not to make a new branch if there are fewer than a certain number of samples left.
• Grid Search: It's like having a map of the entire haystack, marking specific spots where you think the
needle might be. In our case, these spots are different combinations of hyperparameter values.
• Random Search: Imagine instead of a map, you randomly drop pins into the haystack. Grid Search
checks every marked spot, while Random Search explores some of them, hoping to find the needle
faster.
• Manual Search: Sometimes, you don't need a map; you know the haystack well. You manually choose
where to look for the needle. This is like setting hyperparameters based on your intuition.
• Bayesian Optimization: This is like having a detective that learns from previous attempts. It doesn't
waste time revisiting spots where the needle isn't. It adapts and focuses on promising areas.
• Genetic Algorithms: Think of this as evolution. It starts with a population of possibilities and creates
new ones by mixing and mutating them. Over time, it gets closer to finding the best hyperparameters.
You define a grid of hyperparameters, setting the values you want to explore.
• You define a grid of hyperparameters, setting the values you want to explore.
• GridSearchCV, or one of the other methods, goes through each combination of hyperparameters and
trains your model with them.
• It evaluates the model's performance using cross-validation and selects the best set of
hyperparameters based on an evaluation metric like accuracy or error.
• Finally, you train your model using the best hyperparameters, making it perform at its peak on your
specific data.
Hyperparameter tuning is like finding the best settings for your machine learning model. It's a bit like
tuning a musical instrument - finding just the right notes to play. With techniques like GridSearchCV, you
can make your models sing beautifully on your data, and that's what makes you a powerful data scientist.
Remember, finding the right hyperparameters is not a one-time thing. It's an iterative process that requires
experimentation and fine-tuning. So, keep exploring, keep learning, and keep improving your models to
achieve the best results. Happy modeling!
Lebel encoder
In [36]: from sklearn.preprocessing import LabelEncoder
In [37]: le=LabelEncoder()
In [39]: tennis
0 2 1 0 1 0
1 2 1 0 0 0
2 0 1 0 1 1
3 1 2 0 1 1
4 1 0 1 1 1
5 1 0 1 0 0
6 0 0 1 0 1
7 2 2 0 1 0
8 2 0 1 1 1
9 1 2 1 1 1
10 2 2 1 0 1
11 0 0 1 0 1
12 2 2 1 0 1
13 1 2 0 0 0
In [40]: y=tennis['play']
Out[42]: 0 0
1 0
2 1
3 1
4 1
5 0
6 1
7 0
8 1
9 1
10 1
11 1
12 1
13 0
Name: play, dtype: int32
In [43]: x
0 2 1 0 1
1 2 1 0 0
2 0 1 0 1
3 1 2 0 1
4 1 0 1 1
5 1 0 1 0
6 0 0 1 0
7 2 2 0 1
8 2 0 1 1
9 1 2 1 1
10 2 2 1 0
11 0 0 1 0
12 2 2 1 0
13 1 2 0 0
In [47]: tennis_model=DecisionTreeClassifier()
In [48]: # Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(tennis_model, param_grid, cv=5)
grid_search.fit(x, y)
In [49]: tennis_pred_y=best_estimator.predict(X_test)
Accuracy: 0.97
In [51]: display_tree(best_estimator)
accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
What Are Regression Trees ?
A regression tree is a machine learning model that is used for regression tasks, where the goal is to predict
continuous, numerical values (outputs) rather than discrete categories. It functions similarly to a decision
tree, but instead of making categorical decisions, it makes splits and decisions to estimate and predict
numeric outcomes.
• To do this, it uses measures like Entropy and Information Gain. However, when we're predicting
continuous values, we can't use the same approach.
• We need a different way to measure how much our predictions differ from the actual target, and that's
where the mean square error comes in.
It helps us understand how far off our predictions are from the real values we want to predict.
Source:- https://medium.com/analytics-vidhya/regression-trees-decision-tree-for-regression-machine-
learning-e4d7525d8047
Step 1: Initial Split and Calculation of Predicted Outputs and Mean Square
Error
• Sort the data based on X (already sorted in this case).
• Calculate the average of the first 2 rows in variable X, which is (1+2)/2 = 1.5 according to the given
dataset.
• Divide the dataset into two parts (Part A and Part B) based on the condition: X < 1.5 and X ≥ 1.5.
• Part A consists of only one point, which is the first row (1,1), and all the other points are in Part B.
• Calculate the average of all Y values in Part A and Part B separately. These two values are the predicted
output of the decision tree for X < 1.5 and X ≥ 1.5, respectively.
• Using the predicted and original values, calculate the Mean Square Error (MSE) and note it down.
This process creates a decision tree for regression by iteratively finding the best split points based on the
lowest Mean Square Error, which helps in making accurate predictions for continuous variables.
In [57]: x=df_reg
Out[60]: ▾ DecisionTreeClassifier
DecisionTreeClassifier()
In [61]: y_pred_reg=reg_model.predict(X_test)
Accuracy: 0.67
accuracy 0.67 3
macro avg 0.50 0.50 0.50 3
weighted avg 0.67 0.67 0.67 3
The logic behind the algorithm itself is not rocket science. All we are doing is splitting the data-set by
selecting certain points that best splits the data-set and minimises the mean square error. And the way we
are selecting these points is by going through an iterative process of calculating mean square error for all
the splits and choosing the split that has the least value for the mse. So, It only natural this works.
What happens when there are multiple
independent variables ?
• Let us consider that there are 3 variables similar to the independent variable X from fig 2.2.
• At each node, All the 3 variables would go through the same process as what X went through in the
above example. The data would be sorted based on the 3 variables separately.
• The points that minimises the mse are calculated for all the 3 variables. out of the 3 variables and the
points calculated for them, the one that has the least mse would be chosen.
Referance
www.analyticsvidhya.com
https://medium.com/@jairiidriss
https://scikit-learn.org
https://medium.com/analytics-vidhya/regression-trees-decision-tree-for-regression-machine-learning-
e4d7525d8047