0% found this document useful (0 votes)

6 views

LVC+1+Post-Session+Summary

Uploaded by

golgothgolgoth039

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

LVC+1+Post-Session+Summary

Uploaded by

golgothgolgoth039

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Practical Data Science

LVC 1: Decision Trees

A Decision Tree (DT) is a supervised learning algorithm used for classification (spam / not spam)
as well as regression (pricing a car or a house) problems.

A decision tree is like a flow chart where each internal node represents a test on an attribute and
each branch represents the outcome of that test. In a classification problem, each leaf node
represents a class label, i.e., the decision was taken after computing all attributes, and the path from
the first node to a leaf represents classification rules, also called decision rules.

Let’s consider a simple example of “who gets a loan”? Here, the decision tree represents a
sequence of questions that the bank might ask an applicant, to figure out whether that applicant is
eligible for the loan or not. Each internal node represents a question and each leaf node represents
the class label - Get loan / Don’t get loan. As we move along the edges, we get decision rules. For
example, if an applicant’s age is under 30 and has a salary of less than $2500, then the applicant
won’t get a loan.

[email protected]
R8L0PN473F

Decision trees are one of the most famous supervised learning algorithms. They have many
advantages but a few limitations as well.

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
Sharing or publishing the contents in part or full is liable for legal action.
Advantages of Decision Trees:

● Human-Algorithm Interaction
○ Simple to understand and interpret
○ Mirrors human decision-making more closely
○ Uses an open-box model, i.e., can visualize and understand the machine learning logic
(as opposed to a black-box model which is not interpretable)

● Versatile
○ Able to handle both numerical and categorical data
○ Powerful - can model arbitrary functions as long as we have sufficient data
○ Requires little data preparation
○ Performs well with large datasets

● Built-in feature selection

○ Naturally de-emphasizes irrelevant features
○ Develops a hierarchy in terms of the relevance of features

● Testable: Possible to validate a model using statistical tests

[email protected]
Limitations of decision
R8L0PN473F trees:

● Trees can be non-robust: A small change in the training data can result in a large change in
the tree and consequently the final predictions.

● The problem of learning an optimal decision tree is known to be NP-Complete

○ Practical decision-tree learning algorithms are based on heuristics (greedy algorithms)
○ Such algorithms cannot guarantee obtaining the globally optimal decision tree

● Overfitting: Decision-tree solvers can create over-complex trees that do not generalize well
from the training data.

Before we move further, let’s answer a simple question:

Why do we need decision trees? Why can’t we use linear classifiers?

Consider a scenario for binary classification. Let two continuous independent variables be X1 and X2
and the dependent variable be the color of the data point, i.e., either Red or Blue. The decision tree is
built top-down from a root node and involves partitioning the data into subsets that contain instances
with similar values (homogeneous). The data points, sample decision tree, and homogeneous
subsets are given in the below diagram which shows a tree with a depth equal to 2.

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Sharing or publishing the contents in part or full is liable for legal action.
From the above figure, we observe that a simple 2-depth tree for classification would result in fewer
misclassifications than any linear classifier that we can fit to separate these two classes - blue
and red, because a straight line would find it very difficult to separate blue and red points well.

How powerful are Decision Trees?

A decision tree can realize any Boolean Function. The below diagram shows a decision tree for
the XOR boolean function
[email protected] (where a statement is false if A and B are both true or both false,
R8L0PN473F
otherwise, true).

Note: Realization is not unique as there can be many trees for the same function.

What are the steps to building a decision tree?

The algorithm follows the below steps to build a decision tree:

1. Pick a feature

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
Sharing or publishing the contents in part or full is liable for legal action.
2. Split the data based on that feature such that the outcome is binary,
i.e., no data point belongs to both sides of the split
3. Define the new decision rule
4. Repeat the process until each leaf node is homogeneous, i.e., all the data points in a leaf node
belong to the same class

Remark: At each split, we need to try all the different combinations for all the features which makes
the algorithm computationally expensive.

Now, the question arises: how do we find the best split? How to identify the feature and the value
of that feature to split the data at each level? This motivates the use of entropy as an impurity
measure.

Entropy: It is a measure of uncertainty or diversity embedded in a random variable. Suppose 𝑍 is a

random variable with the probability mass function 𝑃(𝑍), then the entropy of 𝑍 is given as:

𝐻(𝑍) = − ∑ 𝑃(𝑍)𝑙𝑜𝑔(𝑃(𝑍)) = − 𝐸(𝑙𝑜𝑔(𝑃(𝑍)))

𝑧ϵ𝑍

Where 𝐸 represents the expected value. For all calculations, we will use the log with base 2.
[email protected]
Note: Since the log
R8L0PN473F of values between 0 and 1 is negative, the minus sign helps to avoid negative
values of entropy.

Let’s consider an example of a coin flip where the probability of heads is equal to 𝑝 and the
probability of tails is equal to 1 − 𝑝. Then, the entropy is given as:
− 𝑝 𝑙𝑜𝑔𝑝 − (1 − 𝑝) 𝑙𝑜𝑔(1 − 𝑝)

If we plot the entropy as a function of 𝑝, we get a graph like this:

From the above graph, we can observe that the entropy is minimum, i.e., 0 when 𝑝 = 0 or 𝑝 = 1, i.e.,
when we can only get heads or tails which implies that the entropy is minimum when the outcome

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
Sharing or publishing the contents in part or full is liable for legal action.
is homogeneous. Also, the entropy is maximum, i.e., 1 when 𝑝 = 0. 5 which
implies that the entropy is maximum when both outcomes are equally
likely. So, we can say that the lower the impurity, the lesser the entropy.

In a similar way, we can calculate the entropy of two variables, i.e., the entropy for the joint
distribution of 𝑋 and 𝑌. This is called joint entropy of two variables and we can extend the formula of
the entropy for a single variable to two variables as shown below:

𝐻(𝑋, 𝑌) = − ∑ 𝑃(𝑥, 𝑦) 𝑙𝑜𝑔(𝑃(𝑥, 𝑦))

(𝑥, 𝑦) ϵ 𝑋𝑥𝑌

Where 𝑃(𝑥, 𝑦) represents the joint distribution of 𝑋 and 𝑌.

So, we have seen the entropy of a single random variable and the joint distribution of two variables.
But in decision trees, we also want to find out the entropy of the target variable for a given split. This
is called conditional entropy. It is denoted as 𝐻(𝑌 | 𝑋).

In a decision tree, our aim is to find the feature and the corresponding value such that if we split the
data, the reduction of entropy in the target variable given the split is highest, i.e., the difference
between the entropy of 𝑌 and the conditional entropy 𝐻(𝑌 | 𝑋) is maximized. This is called
information gain. Mathematically, it is written as:
[email protected]
R8L0PN473F 𝐼𝐺(𝑌 | 𝑋) = 𝐻(𝑌) − 𝐻(𝑌 | 𝑋)

Our aim is to maximize information gain at each split, or in other words, minimizing 𝐻(𝑌 | 𝑋) as 𝐻(𝑌)
is constant.

Empirical Computation of Entropy

Now that we know the theory behind entropy and information gain, let’s go through the steps to find
the best split in a decision tree.

1. Start with the complete training dataset

2. Pick a feature, say 𝑋(𝑚)
3. Describe the data based on that feature, i.e., {(𝑥𝑖(𝑚), 𝑦𝑖), 𝑖 = 1, 2, ...., 𝑁}

4. Split the data into two nodes, say 𝑆1 and 𝑆2, based on one of the values of the same feature

5. Compute 𝐻(𝑆1) and 𝐻(𝑆2)

6. Find the entropy of the split using the formula 𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1)

7. Compute information gain for the split

8. Repeat the process for other features and pick the one that maximizes information gain

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Sharing or publishing the contents in part or full is liable for legal action.
9. Remark: We can see that the algorithm is making the locally optimal
choice, i.e., choosing the split that maximizes the information gain at
that stage, and not trying to find the global optimal solution. Hence, the decision tree is
considered a greedy algorithm.

Finally, let’s see a simple example to understand the computations involved in the steps
mentioned above. Consider the following dataset with 2 independent variables, 𝑋1 and 𝑋2, and one

target variable, 𝑌, where all the variables can only take Boolean values.

𝑋1 𝑋2 𝑌

T T T

T F T

T T T

T F T
[email protected]
R8L0PN473F
F T T

F F F

F T F

F F F

Here, we have two choices to split the data. We can split the data on 𝑋1= T and 𝑋1= F or 𝑋2 = T and

𝑋2 = F. We need to find out which feature maximizes the information gain to find the best split.

First, let’s compute 𝐻(𝑌) (recall that we use log with base 2 for computations). In the target variable,
the class True is occurring 5 out of 8 times and False is occurring 3 out of 8 times. So,

𝐻(𝑌) = − 𝑝 𝑙𝑜𝑔𝑝 − (1 − 𝑝) 𝑙𝑜𝑔(1 − 𝑝) = − (5/8) * 𝑙𝑜𝑔(5/8) − (3/8) * 𝑙𝑜𝑔(3/8) = 0. 954

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Sharing or publishing the contents in part or full is liable for legal action.
Now, the split on 𝑋1 will divide the target variable data into two disjoint sets,

one for 𝑋1= T (say 𝑆1) and the other for 𝑋1= F (say 𝑆2). The rows in the below

table show the count of target classes in each node after the split.

𝑌= T 𝑌= F

𝑋1= T
4 0

𝑋1= F
1 3

Computing 𝐻(𝑆1) and 𝐻(𝑆2),

𝐻(𝑆1) = − (4/4) * 𝑙𝑜𝑔(4/4) − (0/4) * 𝑙𝑜𝑔(0/4) = 0, because (𝑙𝑜𝑔(1) = 0 𝑎𝑛𝑑 0 * 𝑙𝑜𝑔(0) = 0)

𝐻(𝑆2) = − (1/4) * 𝑙𝑜𝑔(1/4) − (3/4) * 𝑙𝑜𝑔(3/4) = 0. 811

So, the entropy of the split is,

[email protected]
R8L0PN473F
𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1) = (4/8) * 0 + (4/8) * 0. 811 = 0. 4055

The information gain by splitting on feature 𝑋1is,

𝐼𝐺 = 𝐻(𝑌) − 0. 4055 = 0. 954 − 0. 4055 = 0. 5485

Similarly, the split on 𝑋2will divide the target variable data into two disjoint sets, say 𝑆1and 𝑆2.

𝑌= T 𝑌= F

𝑋2= T
3 1

𝑋2= F
2 2

Computing 𝐻(𝑆1) and 𝐻(𝑆2) for 𝑋2,

This file is meant for personal use by [email protected] only.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Sharing or publishing the contents in part or full is liable for legal action.
𝐻(𝑆1) = − (3/4) * 𝑙𝑜𝑔(3/4) − (1/4) * 𝑙𝑜𝑔(1/4) = 0. 811

𝐻(𝑆2) = − (2/4) * 𝑙𝑜𝑔(2/4) − (2/4) * 𝑙𝑜𝑔(2/4) = 1

So, the entropy of the split is,

𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1) = (4/8) * 0. 811 + (4/8) * 1 = 0. 9055

The information gain by splitting on feature 𝑋2is,

𝐼𝐺 = 𝐻(𝑌) − 0. 9055 = 0. 954 − 0. 9055 = 0. 0485

Since the information gain is maximized while splitting on the feature 𝑋1, it is the best split for the

data.

Remark: Observe that the entropy of a split will always be less than or equal to the entropy of the
target variable. Both the entropies will be equal (and consequently, the information gain will be zero) if
the feature we are splitting on is independent of the target variable and provides no information about
the target variable.

[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.

Notations and Definitions:

● Feature Space: A vector of independent variables - 𝑋

● Outcome Class (Categorical): 𝑌

● Decision Rule: It is a function 𝑓: 𝑋 → 𝑌. It is the rule to identify which data point will belong to
which class.

● Misclassification error (Empirical error): It is equal to the number of misclassifications

divided by the total number of observations. Mathematically, it can be written as:

𝑁
1
𝑅(𝑓) = 𝑁
∑ 𝐼( 𝑓(𝑥𝑖) ≠ 𝑦𝑖)
𝑖

Where 𝑥𝑖 is a data point, 𝑦𝑖 is the actual class, 𝑓(𝑥𝑖) is the prediction made by the decision tree, and 𝐼
[email protected]
is a function such that:
R8L0PN473F
𝐼(𝑥) = 1 𝑖𝑓 𝑥 ≠ 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, 𝑖𝑡 𝑖𝑠 0

● A Probabilistic Model: We can assume that X ⊂ 𝑋 and Y ⊂ 𝑌 are random variables and each
class is characterized by some joint distribution of a subset of independent variables.

● Sub-Class: A sub-class is the set of data points that have the same decision rule for a subset
of features and belong to the same class. Mathematically, it can be written as:

𝐶 = {(𝑥, 𝑦) | 𝑥(𝑘) = 𝑣𝑘, 𝑘 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑖𝑛𝑑𝑖𝑐𝑒𝑠}

This file is meant for personal use by [email protected] only.

Big O
No ratings yet
Big O
2 pages
Act9
No ratings yet
Act9
22 pages
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
No ratings yet
Decision Tree Algorithm - A Complete Guide: Data Science Blogathon
13 pages
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
No ratings yet
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
22 pages
ML_UNIT_3_NOTES-1
No ratings yet
ML_UNIT_3_NOTES-1
118 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Week03 Classification
No ratings yet
Week03 Classification
22 pages
Decision Trees
No ratings yet
Decision Trees
11 pages
4.Decision Tree
No ratings yet
4.Decision Tree
39 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
Decision Trees
No ratings yet
Decision Trees
27 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
Dtree&rf
No ratings yet
Dtree&rf
26 pages
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
No ratings yet
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
7 pages
Machine Learning-Lecture 05
No ratings yet
Machine Learning-Lecture 05
21 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
DM chapter 4
No ratings yet
DM chapter 4
6 pages
AIML Lec-11
No ratings yet
AIML Lec-11
18 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
19 -- Decision Tree -- ID3
No ratings yet
19 -- Decision Tree -- ID3
87 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Decision Tree For Classification (ID3 Information Gain Entropy)
No ratings yet
Decision Tree For Classification (ID3 Information Gain Entropy)
3 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
ML - 04 - Decision Trees
No ratings yet
ML - 04 - Decision Trees
51 pages
Decision Tree
No ratings yet
Decision Tree
45 pages
Decision Tree
No ratings yet
Decision Tree
19 pages
Decision Tree Algorithm Tutorial With Example in R
No ratings yet
Decision Tree Algorithm Tutorial With Example in R
23 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
AIML Module 4 Imp
No ratings yet
AIML Module 4 Imp
5 pages
22.InfoTheory-DecisionTrees-short
No ratings yet
22.InfoTheory-DecisionTrees-short
25 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
decisiontrees (1)
No ratings yet
decisiontrees (1)
28 pages
4-Decision Tree Learning 1
No ratings yet
4-Decision Tree Learning 1
21 pages
2024-Lecture11-MLAlgorithms
No ratings yet
2024-Lecture11-MLAlgorithms
84 pages
Random Forests
No ratings yet
Random Forests
22 pages
Module 3 - Decision Tress and Artificial Neural Networks
No ratings yet
Module 3 - Decision Tress and Artificial Neural Networks
177 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
Machine_Learning_Lecture_08_Decision Tree Learning (1)
No ratings yet
Machine_Learning_Lecture_08_Decision Tree Learning (1)
67 pages
7_DecisionTree
No ratings yet
7_DecisionTree
58 pages
ML Unit-2 Material WORD
No ratings yet
ML Unit-2 Material WORD
25 pages
Data Mining Notes Unit 4
No ratings yet
Data Mining Notes Unit 4
30 pages
Decision Trees (I) : ISOM3360 Data Mining For Business Analytics, Session 4
No ratings yet
Decision Trees (I) : ISOM3360 Data Mining For Business Analytics, Session 4
32 pages
CSE445 T5a Decision Trees
No ratings yet
CSE445 T5a Decision Trees
54 pages
Decision Trees Palagraism
No ratings yet
Decision Trees Palagraism
16 pages
Ds 6
No ratings yet
Ds 6
24 pages
1694600905-Unit2.4 Decision Tree CU 2.0
No ratings yet
1694600905-Unit2.4 Decision Tree CU 2.0
29 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Stock Market Dashboard in Python
No ratings yet
Stock Market Dashboard in Python
4 pages
RAGE Against the Machine - Retrieval-Augmented LLM Explanations
No ratings yet
RAGE Against the Machine - Retrieval-Augmented LLM Explanations
4 pages
Programming with Python and GUI Development...2024
No ratings yet
Programming with Python and GUI Development...2024
145 pages
Notebook - Deep Neural Networks
No ratings yet
Notebook - Deep Neural Networks
28 pages
5_2-6_Spatial_Environmental_Data_Gaussian_Processes
No ratings yet
5_2-6_Spatial_Environmental_Data_Gaussian_Processes
4 pages
Time_series_analysis__1718649022
No ratings yet
Time_series_analysis__1718649022
5 pages
notebook - text classification
No ratings yet
notebook - text classification
7 pages
Building a Tanh Activation Function
No ratings yet
Building a Tanh Activation Function
9 pages
Notebook - Geospatial
No ratings yet
Notebook - Geospatial
11 pages
Notebook - Music Recommendation System Reference
No ratings yet
Notebook - Music Recommendation System Reference
22 pages
Notebook - Main Code
No ratings yet
Notebook - Main Code
4 pages
Data pipeline in ML
No ratings yet
Data pipeline in ML
3 pages
Notebook - Agave Plant Maturation Model Inference and Testing
No ratings yet
Notebook - Agave Plant Maturation Model Inference and Testing
7 pages
Boston Dataset
No ratings yet
Boston Dataset
6 pages
Glossary+of+Notations+-+Recommender+Systems+Part++3
No ratings yet
Glossary+of+Notations+-+Recommender+Systems+Part++3
4 pages
New system to harness 40% of the sun's heat to produce clean hydrogen fuel
No ratings yet
New system to harness 40% of the sun's heat to produce clean hydrogen fuel
6 pages
1_3_Multiple_Hypothesis_Testing
No ratings yet
1_3_Multiple_Hypothesis_Testing
14 pages
5 2-4 Spatial Environmental Data Gaussian Processes
No ratings yet
5 2-4 Spatial Environmental Data Gaussian Processes
3 pages
5_3-2_Spatial_Environmental_Data_Model_Selection_Long-range_Dependencies
No ratings yet
5_3-2_Spatial_Environmental_Data_Model_Selection_Long-range_Dependencies
3 pages
MLS+1+-+Presentation
No ratings yet
MLS+1+-+Presentation
11 pages
MLS+1+-+Regression
No ratings yet
MLS+1+-+Regression
20 pages
ML+LVC+3+Post-Session+Summary
No ratings yet
ML+LVC+3+Post-Session+Summary
16 pages
The+CNN+Architecture
No ratings yet
The+CNN+Architecture
15 pages
ML+LVC+2+Post-Session+Summary
No ratings yet
ML+LVC+2+Post-Session+Summary
12 pages
ML+LVC+3+Glossary
No ratings yet
ML+LVC+3+Glossary
1 page
Image Compression Standards
No ratings yet
Image Compression Standards
57 pages
A Tutorial of The Wavelet Transform
No ratings yet
A Tutorial of The Wavelet Transform
72 pages
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
Computational Mathematics I
No ratings yet
Computational Mathematics I
35 pages
Lec8 Optimum Receiver
No ratings yet
Lec8 Optimum Receiver
36 pages
maths-class-xii-chapter-12-linear-programming-practice-paper-13-answers
No ratings yet
maths-class-xii-chapter-12-linear-programming-practice-paper-13-answers
12 pages
Optimization Handout AAOC222 Sem1 2012 13
No ratings yet
Optimization Handout AAOC222 Sem1 2012 13
3 pages
Partial Response and Viterbi Detection: Disadvantage - Feed-Forward Equalizer
No ratings yet
Partial Response and Viterbi Detection: Disadvantage - Feed-Forward Equalizer
30 pages
Denoising of Images Using Autoencoders
No ratings yet
Denoising of Images Using Autoencoders
18 pages
White Box Testing (SW)
No ratings yet
White Box Testing (SW)
24 pages
Viva Questions
No ratings yet
Viva Questions
3 pages
HS12 53129
No ratings yet
HS12 53129
134 pages
FM MOD 2-
No ratings yet
FM MOD 2-
23 pages
On Simplex Method With Surplus Variables With An Example.: Dr. B. C. Roy Engineering College, Durgapur
No ratings yet
On Simplex Method With Surplus Variables With An Example.: Dr. B. C. Roy Engineering College, Durgapur
3 pages
Solutions - Homework 6
No ratings yet
Solutions - Homework 6
9 pages
Experiment 1: Digital Signal Processing Lab
No ratings yet
Experiment 1: Digital Signal Processing Lab
17 pages
Example 1: DFT of Sine Waveform: (One Cycle, Two Cycles and Seven Cycles)
No ratings yet
Example 1: DFT of Sine Waveform: (One Cycle, Two Cycles and Seven Cycles)
15 pages
Comparative Analysis of DFS, BFS and Dijkstra Algorithms To Determine The Shortest Route On A Geographic Map PDF
No ratings yet
Comparative Analysis of DFS, BFS and Dijkstra Algorithms To Determine The Shortest Route On A Geographic Map PDF
6 pages
The Fundamentals: Algorithms The Integers
No ratings yet
The Fundamentals: Algorithms The Integers
55 pages
Math f212 Test 1 Make-Up
No ratings yet
Math f212 Test 1 Make-Up
2 pages
Application of Runge - Kutta Method To Population Equations
No ratings yet
Application of Runge - Kutta Method To Population Equations
8 pages
A cluster-based optimization framework for vehicle routing problem with workload balance
No ratings yet
A cluster-based optimization framework for vehicle routing problem with workload balance
14 pages
Binomial Expansion A (4037)
No ratings yet
Binomial Expansion A (4037)
9 pages
Bidirectional_associative_memory
No ratings yet
Bidirectional_associative_memory
3 pages
حسین
No ratings yet
حسین
3 pages
Fundamental Algorithms - Binary Search Trees
No ratings yet
Fundamental Algorithms - Binary Search Trees
7 pages
Chapter2 Nonlinear Eqs Version2021
No ratings yet
Chapter2 Nonlinear Eqs Version2021
19 pages
Da Assignment
No ratings yet
Da Assignment
3 pages
Outline Solutions To Exercises On Dynamic Programming and Network Flow
No ratings yet
Outline Solutions To Exercises On Dynamic Programming and Network Flow
3 pages

LVC+1+Post-Session+Summary

Uploaded by

LVC+1+Post-Session+Summary

Uploaded by

Practical Data Science

LVC 1: Decision Trees

This file is meant for personal use by [email protected] only.

● Built-in feature selection

● Testable: Possible to validate a model using statistical tests

● The problem of learning an optimal decision tree is known to be NP-Complete

Before we move further, let’s answer a simple question:

Why do we need decision trees? Why can’t we use linear classifiers?

This file is meant for personal use by [email protected] only.

How powerful are Decision Trees?

What are the steps to building a decision tree?

The algorithm follows the below steps to build a decision tree:

This file is meant for personal use by [email protected] only.

Entropy: It is a measure of uncertainty or diversity embedded in a random variable. Suppose 𝑍 is a

𝐻(𝑍) = − ∑ 𝑃(𝑍)𝑙𝑜𝑔(𝑃(𝑍)) = − 𝐸(𝑙𝑜𝑔(𝑃(𝑍)))

If we plot the entropy as a function of 𝑝, we get a graph like this:

This file is meant for personal use by [email protected] only.

𝐻(𝑋, 𝑌) = − ∑ 𝑃(𝑥, 𝑦) 𝑙𝑜𝑔(𝑃(𝑥, 𝑦))

Where 𝑃(𝑥, 𝑦) represents the joint distribution of 𝑋 and 𝑌.

Empirical Computation of Entropy

1. Start with the complete training dataset

5. Compute 𝐻(𝑆1) and 𝐻(𝑆2)

7. Compute information gain for the split

This file is meant for personal use by [email protected] only.

𝐻(𝑌) = − 𝑝 𝑙𝑜𝑔𝑝 − (1 − 𝑝) 𝑙𝑜𝑔(1 − 𝑝) = − (5/8) * 𝑙𝑜𝑔(5/8) − (3/8) * 𝑙𝑜𝑔(3/8) = 0. 954

This file is meant for personal use by [email protected] only.

Computing 𝐻(𝑆1) and 𝐻(𝑆2),

𝐻(𝑆1) = − (4/4) * 𝑙𝑜𝑔(4/4) − (0/4) * 𝑙𝑜𝑔(0/4) = 0, because (𝑙𝑜𝑔(1) = 0 𝑎𝑛𝑑 0 * 𝑙𝑜𝑔(0) = 0)

𝐻(𝑆2) = − (1/4) * 𝑙𝑜𝑔(1/4) − (3/4) * 𝑙𝑜𝑔(3/4) = 0. 811

So, the entropy of the split is,

The information gain by splitting on feature 𝑋1is,

𝐼𝐺 = 𝐻(𝑌) − 0. 4055 = 0. 954 − 0. 4055 = 0. 5485

Computing 𝐻(𝑆1) and 𝐻(𝑆2) for 𝑋2,

This file is meant for personal use by [email protected] only.

𝐻(𝑆2) = − (2/4) * 𝑙𝑜𝑔(2/4) − (2/4) * 𝑙𝑜𝑔(2/4) = 1

So, the entropy of the split is,

𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1) = (4/8) * 0. 811 + (4/8) * 1 = 0. 9055

The information gain by splitting on feature 𝑋2is,

𝐼𝐺 = 𝐻(𝑌) − 0. 9055 = 0. 954 − 0. 9055 = 0. 0485

This file is meant for personal use by [email protected] only.

Notations and Definitions:

● Feature Space: A vector of independent variables - 𝑋

● Outcome Class (Categorical): 𝑌

● Misclassification error (Empirical error): It is equal to the number of misclassifications

𝐶 = {(𝑥, 𝑦) | 𝑥(𝑘) = 𝑣𝑘, 𝑘 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑖𝑛𝑑𝑖𝑐𝑒𝑠}

This file is meant for personal use by [email protected] only.

You might also like