Unit 4

Decision trees are a supervised learning technique that can be used for both classification and regression problems. It builds a tree-like model of decisions and their consequences. Random forest is an ensemble learning method that fits multiple decision trees on various sub-samples of the dataset and takes the average to improve predictive accuracy. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Pruning helps reduce overfitting by removing parts of the decision tree that provide little power in classifying new data instances.

Uploaded by

Prathmesh Mane Deshmukh

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Unit 4

Uploaded by

Prathmesh Mane Deshmukh

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Decision Trees

Decision Tree is a Supervised learning technique that can be used

for both classification and Regression problems, but mostly it is
preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
•In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
Why use Decision Trees?

• Decision Trees usually mimic human thinking

ability while making a decision, so it is easy to
understand.
• The logic behind the decision tree can be
easily understood because it shows a tree-like
structure.
Decision Tree Terminologies

• Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two
or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree
cannot be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted
branches from the tree.
• Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
Working of Decision Tree Algorithm
• In a decision tree, for predicting the class of the given
dataset, the algorithm starts from the root node of the
tree.
• This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the
next node.
• For the next node, the algorithm again compares the
attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf
node of the tree.
The complete process can be better
understood using the below algorithm:
• Step-1: Begin the tree with the root node, says S, which
contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values
for the best attributes.
• Step-4: Generate the decision tree node, which contains the
best attribute.
• Step-5: Recursively make new decision trees using the subsets
of the dataset created in step -3. Continue this process until a
stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Attribute Selection Measures

• While implementing a Decision tree, the main issue

arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such
problems there is a technique which is called
as Attribute selection measure or ASM.
• By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two
popular techniques for ASM, which are:
1. Information Gain
2. Gini Index
1. Information Gain:

• Information gain is the measurement of changes in

entropy after the segmentation of a dataset based
on an attribute.
• It calculates how much information a feature
provides us about a class.
• According to the value of information gain, we split
the node and build the decision tree.
• A decision tree algorithm always tries to maximize
the value of information gain, and a node/attribute
having the highest information gain is split first.
• It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the

impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
• S= Total number of samples
• P(yes)= probability of yes
• P(no)= probability of no
2. Gini Index:

• Gini index is a measure of impurity or purity used

while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be
preferred as compared to the high Gini index.
• It only creates binary splits, and the CART algorithm
uses the Gini index to create binary splits.
• Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Advantages of the Decision Tree
It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
The decision tree contains lots of layers, which makes it complex.
It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
For more class labels, the computational complexity of the decision tree may
increase.
Random Forest
Random Forest is a popular machine learning algorithm
that belongs to the supervised learning technique.
It can be used for both Classification and Regression
problems in ML.
It is based on the concept of ensemble learning, which
is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of
the model.
• "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of
that dataset."

• Instead of relying on one decision tree, the random forest

takes the prediction from each tree and based on the
majority votes of predictions, and it predicts the final
output.
• The greater number of trees in the forest leads to higher
accuracy and prevents the problem of overfitting.
The below diagram explains the working of the
Random Forest algorithm:
Why use Random Forest?
• It takes less training time as compared to
other algorithms.
• It predicts output with high accuracy, even for
the large dataset it runs efficiently.
• It can also maintain accuracy when a large
proportion of data is missing.
How does Random Forest algorithm work?

• Random Forest works in two-phase first is to create the random forest

by combining N decision tree, and second is to make predictions for
each tree created in the first phase.
• The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
Applications of Random Forest

• Banking: Banking sector mostly uses this

algorithm for the identification of loan risk.
• Medicine: With the help of this algorithm,
disease trends and risks of the disease can be
identified.
• Land Use: We can identify the areas of similar
land use by this algorithm.
• Marketing: Marketing trends can be identified
using this algorithm.
Advantages of Random Forest

• Random Forest is capable of performing both

Classification and Regression tasks.
• It is capable of handling large datasets with
high dimensionality.
• It enhances the accuracy of the model and
prevents the overfitting issue.
Disadvantages of Random Forest
• Although random forest can be used for both
classification and regression tasks, it is not
more suitable for Regression tasks.
Overfitting and Pruning in Decision Trees —
Improving Model’s Accuracy
What is Overfitting?
• Overfitting is a common problem that needs to be
handled while training a decision tree model.
• Overfitting occurs when a model fits too closely to
the training data and may become less accurate
when encountering new data or predicting future
outcomes.
• In an overfit condition, a model memorizes the
noise of the training data and fails to capture
essential patterns
• In decision trees, In order to fit the data (even
noisy data), the model keeps generating new
nodes and ultimately the tree becomes too
complex to interpret.
• The decision tree predicts well for the training
data but can be inaccurate for new data. If a
decision tree model is allowed to train to its
full potential, it can overfit the training data.
What is Pruning?

• Pruning is a technique that removes parts of

the decision tree and prevents it from growing
to its full depth.
• Pruning removes those parts of the decision
tree that do not have the power to classify
instances.
• Pruning can be of two types — Pre-Pruning
and Post-Pruning.
The unpruned tree is denser, more complex, and has a higher
variance — resulting in overfitting.
1.Pre-Pruning

• Pre-Pruning, also known as ‘Early Stopping’ or ‘Forward Pruning’,

stops the growth of the decision tree — preventing it from reaching
its full depth.
• It stops the non-significant branches from generating in a decision
tree.
• Pre-Pruning involves the tuning of the hyperparameters prior to
training the model.
• Pre-Pruning stops the tree-building process for leaves with small
samples.
• During each stage of the splitting of the tree, the cross-validation
error will be monitored .
• If the value of the error does not continue to decrease, the tree’s
growth is stopped.
Hyperparameter Tuning for Pre-Pruning Decision Trees

• The hyperparameters that can be tuned for pre-pruning or

early stopping are max_depth, min_samples_leaf, and
min_samples_split.
1.max_depth: Specifies the maximum depth of the tree. If
None, then nodes are expanded until all leaves are pure or
until all leaves contain less than min_samples_split samples.
The more the value of max_depth, the more complex the
tree will be.
2.min_samples_leaf: Specifies the minimum number of
samples required at a leaf node.
3. min_samples_split: Specifies the minimum number of
samples required to split an internal node.
2. Post-Pruning

• Post-Pruning or ‘backward pruning’ is a technique

that eliminates branches from a “completely grown”
decision tree model to reduce its complexity and variance.
• This technique allows the decision tree to grow to its full
depth, then removes branches to prevent the model from
overfitting.
• By doing so, the model might slightly increase the training
error but drastically decrease the testing error.
• In Post-Pruning, non-significant branches of the model are
removed using the Cost Complexity Pruning
(CCP) technique.
• Cost Complexity Pruning or ‘Weakest Link Pruning’ works by calculating
a tree score that is based on the Sum of Squared Residuals (SSR) of the
tree or subtrees, and a Tree Complexity penalty (T), which is a function
of the number of leaves or terminal nodes in the tree or subtree.
• The SSR increases as the trees get shorter, and the Tree Complexity
Penalty compensates for the difference in the number of leaves.
Tree Score = SSR + alpha*T
where alpha is a tuning parameter
• This pruning technique then calculates different values for alpha, giving
us a sequence of trees from a full-sized tree to just a leaf.
• This is repeated until a 10-Fold Cross-Validation is done.
• The final value for alpha is the one that on average gave us the lowest
Sum of Squared Residuals with the testing data
Ensemble methods
• Ensemble methods combines several decision
trees to produce better predictive performance
than utilizing a single decision tree.
• The main principle behind the ensemble model
is that a group of weak learners come together
to form a strong learner.
• Techniques to perform ensemble decision trees:
– 1. Bagging
– 2. Boosting
1.Bagging (Bootstrap Aggregation)
• It is used when our goal is to reduce the variance of a decision tree.
• Here idea is to create several subsets of data from training sample
chosen randomly with replacement.
• Now, each collection of subset data is used to train their decision
trees.
• As a result, we end up with an ensemble of different models.
• Average of all the predictions from different trees are used which is
more robust than a single decision tree.
• Random Forest is an extension over bagging. It takes one extra step
where in addition to taking the random subset of data, it also takes
the random selection of features rather than using all features to
grow trees. When you have many random trees. It’s called Random
Forest
2. Boosting
• It is ensemble technique to create a collection of predictors.
• In this technique, learners are learned sequentially with early
learners fitting simple models to the data and then analyzing
data for errors.
• In other words, we fit consecutive trees (random sample) and
at every step, the goal is to solve for net error from the prior
tree.
• When an input is misclassified by a hypothesis, its weight is
increased so that next hypothesis is more likely to classify it
correctly. By combining the whole set at the end converts weak
learners into better performing model.
• Gradient Boosting is an extension over boosting method.
Supervised Learning Unsupervised Learning
Supervised learning algorithms are Unsupervised learning algorithms are
trained using labeled data. trained using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
In supervised learning, input data is In unsupervised learning, only input data
provided to the model along with the is provided to the model.
output.
The goal of supervised learning is to train The goal of unsupervised learning is to
the model so that it can predict the find the hidden patterns and useful
output when it is given new data. insights from the unknown dataset.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
It includes various algorithms such as It includes various algorithms such as
Linear Regression, Logistic Regression, Clustering, KNN, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.

Prediction and Analysis of Franchise Cricket
No ratings yet
Prediction and Analysis of Franchise Cricket
8 pages
NOTES
No ratings yet
NOTES
18 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
41 pages
chapter 04
No ratings yet
chapter 04
48 pages
Unit 3 (A) NGP
No ratings yet
Unit 3 (A) NGP
78 pages
U4 ML Updated
No ratings yet
U4 ML Updated
32 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
34 pages
Apznzayn4iudcvxyoppqs61j04 7hfvwveb4orry3irmq7ekrlv08lh81olz64cb1ycwzmxuattzrg0ox0g-e Tcprei1i3bwhbnbqofqhvtixwokm0ftaoxwee3znpcytoh6jgknlof6 Rukjysosqdyan8wfbovpzrikmrpeywyu07ft Vvpsanuerxuhcghc7g6sd4pcyi9z-Wao8bn
No ratings yet
Apznzayn4iudcvxyoppqs61j04 7hfvwveb4orry3irmq7ekrlv08lh81olz64cb1ycwzmxuattzrg0ox0g-e Tcprei1i3bwhbnbqofqhvtixwokm0ftaoxwee3znpcytoh6jgknlof6 Rukjysosqdyan8wfbovpzrikmrpeywyu07ft Vvpsanuerxuhcghc7g6sd4pcyi9z-Wao8bn
20 pages
CSL0777 L25
No ratings yet
CSL0777 L25
39 pages
Lecture Note #5_PEC-CS701E
No ratings yet
Lecture Note #5_PEC-CS701E
16 pages
Module 4 Lecture -2
No ratings yet
Module 4 Lecture -2
65 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Random Forest
No ratings yet
Random Forest
25 pages
AIML Removed
No ratings yet
AIML Removed
25 pages
AIML Removed Merged
No ratings yet
AIML Removed Merged
31 pages
RandomForest ML
No ratings yet
RandomForest ML
5 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
DS Unit - 4
No ratings yet
DS Unit - 4
76 pages
DECSION TREE
No ratings yet
DECSION TREE
6 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Module 5 - Supervised Learning Algorithms
No ratings yet
Module 5 - Supervised Learning Algorithms
38 pages
2179-Unit-3
No ratings yet
2179-Unit-3
29 pages
FMLanswerkey-IT 2.docx (1) (1) (1)
No ratings yet
FMLanswerkey-IT 2.docx (1) (1) (1)
11 pages
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
No ratings yet
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
22 pages
AI - Mod 5. Part 2
No ratings yet
AI - Mod 5. Part 2
40 pages
Unit Ii
No ratings yet
Unit Ii
22 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
15 pages
Lecture Notes 3
No ratings yet
Lecture Notes 3
11 pages
Session 17-Decision Tree
No ratings yet
Session 17-Decision Tree
16 pages
Decision Tree Comprehesive
No ratings yet
Decision Tree Comprehesive
7 pages
Decision Tree (Autosaved)
No ratings yet
Decision Tree (Autosaved)
14 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
14 pages
Chapter 09 CART-3
No ratings yet
Chapter 09 CART-3
42 pages
Lecture 7.1 - Decision Tree Classification
No ratings yet
Lecture 7.1 - Decision Tree Classification
15 pages
PAIML-UNIT 5 (1) (1)
No ratings yet
PAIML-UNIT 5 (1) (1)
38 pages
Decision Tree
No ratings yet
Decision Tree
3 pages
Unit 3 Classification - Dr. Vidyut D
No ratings yet
Unit 3 Classification - Dr. Vidyut D
72 pages
Tree
No ratings yet
Tree
31 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
module2-2
No ratings yet
module2-2
30 pages
Cours #4—Decision Tree
No ratings yet
Cours #4—Decision Tree
18 pages
Lecture 7 Overview of ML models
No ratings yet
Lecture 7 Overview of ML models
77 pages
Decision tree
No ratings yet
Decision tree
16 pages
Decision Tree
No ratings yet
Decision Tree
24 pages
1
No ratings yet
1
2 pages
Lab 2
No ratings yet
Lab 2
3 pages
Unit-3 Introduction To Machine Learning Algorithms
No ratings yet
Unit-3 Introduction To Machine Learning Algorithms
18 pages
Decision Tree
No ratings yet
Decision Tree
68 pages
Decision Tree
No ratings yet
Decision Tree
11 pages
ML CLASS 6 Decision Tree Algorithm
No ratings yet
ML CLASS 6 Decision Tree Algorithm
21 pages
Tree
No ratings yet
Tree
7 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
5 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
Unit Iir20
No ratings yet
Unit Iir20
22 pages
DM Mod 3
No ratings yet
DM Mod 3
14 pages
ml unit3
No ratings yet
ml unit3
8 pages
Support, Decision and Random
No ratings yet
Support, Decision and Random
8 pages
Day48 Decision Trees
No ratings yet
Day48 Decision Trees
5 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Trees: Recap/Do Now
No ratings yet
Decision Trees: Recap/Do Now
25 pages
Document (6)
No ratings yet
Document (6)
54 pages
RUS Boost Tree Ensemble Classifiers For OD
No ratings yet
RUS Boost Tree Ensemble Classifiers For OD
7 pages
MECE Framework McKinsey
No ratings yet
MECE Framework McKinsey
6 pages
Population Growth Prediction
100% (2)
Population Growth Prediction
10 pages
AI & ML Unit 3 Notes
No ratings yet
AI & ML Unit 3 Notes
20 pages
4. DECISION MAKING UNDER UNCERTAINITY
No ratings yet
4. DECISION MAKING UNDER UNCERTAINITY
44 pages
An Analytic-Based Course Recommendation System For Higher Education
No ratings yet
An Analytic-Based Course Recommendation System For Higher Education
6 pages
Top 50 Data Mining Interview Questions & Answers PDF
No ratings yet
Top 50 Data Mining Interview Questions & Answers PDF
30 pages
Report of Breast Cancer
No ratings yet
Report of Breast Cancer
80 pages
IS Revision Questions
No ratings yet
IS Revision Questions
9 pages
Data Analytics - Unit-IV
No ratings yet
Data Analytics - Unit-IV
21 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
53 pages
Customer Churn Prediction in Telecom Sector Using Machine Learning Techniques
No ratings yet
Customer Churn Prediction in Telecom Sector Using Machine Learning Techniques
16 pages
DecisionTrees-2 2
No ratings yet
DecisionTrees-2 2
1 page
Natural Language Processing and ML Based Student Mental Health Analysis Using Non Clinical Texts PDF
No ratings yet
Natural Language Processing and ML Based Student Mental Health Analysis Using Non Clinical Texts PDF
53 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
6 pages
Sentiment Analysis On The Place of Interest in MalaysiaJournal of Advanced Research in Applied Sciences and Engineering Technology
No ratings yet
Sentiment Analysis On The Place of Interest in MalaysiaJournal of Advanced Research in Applied Sciences and Engineering Technology
12 pages
Ucs551 GRP Project
No ratings yet
Ucs551 GRP Project
34 pages
Twitter Sentiment Analysis Project Report Compressed
No ratings yet
Twitter Sentiment Analysis Project Report Compressed
33 pages
DAL Assignment 4 Endsem
No ratings yet
DAL Assignment 4 Endsem
8 pages
Performance Analysis of Deep Neural Network and Machine Learning Algorithms For Diabetes Prediction
No ratings yet
Performance Analysis of Deep Neural Network and Machine Learning Algorithms For Diabetes Prediction
6 pages
Decision Tree - A Step-by-Step Guide
No ratings yet
Decision Tree - A Step-by-Step Guide
36 pages
1 s2.0 S1359836823006029 Main
No ratings yet
1 s2.0 S1359836823006029 Main
16 pages
Flight Fare Prediction System Using Machine Learning
No ratings yet
Flight Fare Prediction System Using Machine Learning
10 pages
Slides
No ratings yet
Slides
174 pages
CH 03 Decision Under Risk - Part I
No ratings yet
CH 03 Decision Under Risk - Part I
39 pages
Assignment 2 Mechanical Vibration (1)
No ratings yet
Assignment 2 Mechanical Vibration (1)
44 pages
Enhancing GPS Positioning Accuracy Using Machine Learning Regression
No ratings yet
Enhancing GPS Positioning Accuracy Using Machine Learning Regression
6 pages