0% found this document useful (0 votes)
6 views

LVC+1+Post-Session+Summary

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

LVC+1+Post-Session+Summary

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Practical Data Science

LVC 1: Decision Trees


A Decision Tree (DT) is a supervised learning algorithm used for classification (spam / not spam)
as well as regression (pricing a car or a house) problems.

A decision tree is like a flow chart where each internal node represents a test on an attribute and
each branch represents the outcome of that test. In a classification problem, each leaf node
represents a class label, i.e., the decision was taken after computing all attributes, and the path from
the first node to a leaf represents classification rules, also called decision rules.

Let’s consider a simple example of “who gets a loan”? Here, the decision tree represents a
sequence of questions that the bank might ask an applicant, to figure out whether that applicant is
eligible for the loan or not. Each internal node represents a question and each leaf node represents
the class label - Get loan / Don’t get loan. As we move along the edges, we get decision rules. For
example, if an applicant’s age is under 30 and has a salary of less than $2500, then the applicant
won’t get a loan.

[email protected]
R8L0PN473F

Decision trees are one of the most famous supervised learning algorithms. They have many
advantages but a few limitations as well.

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
Sharing or publishing the contents in part or full is liable for legal action.
Advantages of Decision Trees:

● Human-Algorithm Interaction
○ Simple to understand and interpret
○ Mirrors human decision-making more closely
○ Uses an open-box model, i.e., can visualize and understand the machine learning logic
(as opposed to a black-box model which is not interpretable)

● Versatile
○ Able to handle both numerical and categorical data
○ Powerful - can model arbitrary functions as long as we have sufficient data
○ Requires little data preparation
○ Performs well with large datasets

● Built-in feature selection


○ Naturally de-emphasizes irrelevant features
○ Develops a hierarchy in terms of the relevance of features

● Testable: Possible to validate a model using statistical tests

[email protected]
Limitations of decision
R8L0PN473F trees:

● Trees can be non-robust: A small change in the training data can result in a large change in
the tree and consequently the final predictions.

● The problem of learning an optimal decision tree is known to be NP-Complete


○ Practical decision-tree learning algorithms are based on heuristics (greedy algorithms)
○ Such algorithms cannot guarantee obtaining the globally optimal decision tree

● Overfitting: Decision-tree solvers can create over-complex trees that do not generalize well
from the training data.

Before we move further, let’s answer a simple question:

Why do we need decision trees? Why can’t we use linear classifiers?

Consider a scenario for binary classification. Let two continuous independent variables be X1 and X2
and the dependent variable be the color of the data point, i.e., either Red or Blue. The decision tree is
built top-down from a root node and involves partitioning the data into subsets that contain instances
with similar values (homogeneous). The data points, sample decision tree, and homogeneous
subsets are given in the below diagram which shows a tree with a depth equal to 2.

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Sharing or publishing the contents in part or full is liable for legal action.
From the above figure, we observe that a simple 2-depth tree for classification would result in fewer
misclassifications than any linear classifier that we can fit to separate these two classes - blue
and red, because a straight line would find it very difficult to separate blue and red points well.

How powerful are Decision Trees?

A decision tree can realize any Boolean Function. The below diagram shows a decision tree for
the XOR boolean function
[email protected] (where a statement is false if A and B are both true or both false,
R8L0PN473F
otherwise, true).

Note: Realization is not unique as there can be many trees for the same function.

What are the steps to building a decision tree?

The algorithm follows the below steps to build a decision tree:

1. Pick a feature

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
Sharing or publishing the contents in part or full is liable for legal action.
2. Split the data based on that feature such that the outcome is binary,
i.e., no data point belongs to both sides of the split
3. Define the new decision rule
4. Repeat the process until each leaf node is homogeneous, i.e., all the data points in a leaf node
belong to the same class

Remark: At each split, we need to try all the different combinations for all the features which makes
the algorithm computationally expensive.

Now, the question arises: how do we find the best split? How to identify the feature and the value
of that feature to split the data at each level? This motivates the use of entropy as an impurity
measure.

Entropy: It is a measure of uncertainty or diversity embedded in a random variable. Suppose 𝑍 is a


random variable with the probability mass function 𝑃(𝑍), then the entropy of 𝑍 is given as:

𝐻(𝑍) = − ∑ 𝑃(𝑍)𝑙𝑜𝑔(𝑃(𝑍)) = − 𝐸(𝑙𝑜𝑔(𝑃(𝑍)))


𝑧ϵ𝑍

Where 𝐸 represents the expected value. For all calculations, we will use the log with base 2.
[email protected]
Note: Since the log
R8L0PN473F of values between 0 and 1 is negative, the minus sign helps to avoid negative
values of entropy.

Let’s consider an example of a coin flip where the probability of heads is equal to 𝑝 and the
probability of tails is equal to 1 − 𝑝. Then, the entropy is given as:
− 𝑝 𝑙𝑜𝑔𝑝 − (1 − 𝑝) 𝑙𝑜𝑔(1 − 𝑝)

If we plot the entropy as a function of 𝑝, we get a graph like this:

From the above graph, we can observe that the entropy is minimum, i.e., 0 when 𝑝 = 0 or 𝑝 = 1, i.e.,
when we can only get heads or tails which implies that the entropy is minimum when the outcome

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
Sharing or publishing the contents in part or full is liable for legal action.
is homogeneous. Also, the entropy is maximum, i.e., 1 when 𝑝 = 0. 5 which
implies that the entropy is maximum when both outcomes are equally
likely. So, we can say that the lower the impurity, the lesser the entropy.

In a similar way, we can calculate the entropy of two variables, i.e., the entropy for the joint
distribution of 𝑋 and 𝑌. This is called joint entropy of two variables and we can extend the formula of
the entropy for a single variable to two variables as shown below:

𝐻(𝑋, 𝑌) = − ∑ 𝑃(𝑥, 𝑦) 𝑙𝑜𝑔(𝑃(𝑥, 𝑦))


(𝑥, 𝑦) ϵ 𝑋𝑥𝑌

Where 𝑃(𝑥, 𝑦) represents the joint distribution of 𝑋 and 𝑌.

So, we have seen the entropy of a single random variable and the joint distribution of two variables.
But in decision trees, we also want to find out the entropy of the target variable for a given split. This
is called conditional entropy. It is denoted as 𝐻(𝑌 | 𝑋).

In a decision tree, our aim is to find the feature and the corresponding value such that if we split the
data, the reduction of entropy in the target variable given the split is highest, i.e., the difference
between the entropy of 𝑌 and the conditional entropy 𝐻(𝑌 | 𝑋) is maximized. This is called
information gain. Mathematically, it is written as:
[email protected]
R8L0PN473F 𝐼𝐺(𝑌 | 𝑋) = 𝐻(𝑌) − 𝐻(𝑌 | 𝑋)

Our aim is to maximize information gain at each split, or in other words, minimizing 𝐻(𝑌 | 𝑋) as 𝐻(𝑌)
is constant.

Empirical Computation of Entropy

Now that we know the theory behind entropy and information gain, let’s go through the steps to find
the best split in a decision tree.

1. Start with the complete training dataset


2. Pick a feature, say 𝑋(𝑚)
3. Describe the data based on that feature, i.e., {(𝑥𝑖(𝑚), 𝑦𝑖), 𝑖 = 1, 2, ...., 𝑁}

4. Split the data into two nodes, say 𝑆1 and 𝑆2, based on one of the values of the same feature

5. Compute 𝐻(𝑆1) and 𝐻(𝑆2)

6. Find the entropy of the split using the formula 𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1)

7. Compute information gain for the split


8. Repeat the process for other features and pick the one that maximizes information gain

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Sharing or publishing the contents in part or full is liable for legal action.
9. Remark: We can see that the algorithm is making the locally optimal
choice, i.e., choosing the split that maximizes the information gain at
that stage, and not trying to find the global optimal solution. Hence, the decision tree is
considered a greedy algorithm.

Finally, let’s see a simple example to understand the computations involved in the steps
mentioned above. Consider the following dataset with 2 independent variables, 𝑋1 and 𝑋2, and one

target variable, 𝑌, where all the variables can only take Boolean values.

𝑋1 𝑋2 𝑌

T T T

T F T

T T T

T F T
[email protected]
R8L0PN473F
F T T

F F F

F T F

F F F

Here, we have two choices to split the data. We can split the data on 𝑋1= T and 𝑋1= F or 𝑋2 = T and

𝑋2 = F. We need to find out which feature maximizes the information gain to find the best split.

First, let’s compute 𝐻(𝑌) (recall that we use log with base 2 for computations). In the target variable,
the class True is occurring 5 out of 8 times and False is occurring 3 out of 8 times. So,

𝐻(𝑌) = − 𝑝 𝑙𝑜𝑔𝑝 − (1 − 𝑝) 𝑙𝑜𝑔(1 − 𝑝) = − (5/8) * 𝑙𝑜𝑔(5/8) − (3/8) * 𝑙𝑜𝑔(3/8) = 0. 954

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Sharing or publishing the contents in part or full is liable for legal action.
Now, the split on 𝑋1 will divide the target variable data into two disjoint sets,

one for 𝑋1= T (say 𝑆1) and the other for 𝑋1= F (say 𝑆2). The rows in the below

table show the count of target classes in each node after the split.

𝑌= T 𝑌= F

𝑋1= T
4 0

𝑋1= F
1 3

Computing 𝐻(𝑆1) and 𝐻(𝑆2),

𝐻(𝑆1) = − (4/4) * 𝑙𝑜𝑔(4/4) − (0/4) * 𝑙𝑜𝑔(0/4) = 0, because (𝑙𝑜𝑔(1) = 0 𝑎𝑛𝑑 0 * 𝑙𝑜𝑔(0) = 0)

𝐻(𝑆2) = − (1/4) * 𝑙𝑜𝑔(1/4) − (3/4) * 𝑙𝑜𝑔(3/4) = 0. 811

So, the entropy of the split is,


[email protected]
R8L0PN473F
𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1) = (4/8) * 0 + (4/8) * 0. 811 = 0. 4055

The information gain by splitting on feature 𝑋1is,

𝐼𝐺 = 𝐻(𝑌) − 0. 4055 = 0. 954 − 0. 4055 = 0. 5485

Similarly, the split on 𝑋2will divide the target variable data into two disjoint sets, say 𝑆1and 𝑆2.

𝑌= T 𝑌= F

𝑋2= T
3 1

𝑋2= F
2 2

Computing 𝐻(𝑆1) and 𝐻(𝑆2) for 𝑋2,

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Sharing or publishing the contents in part or full is liable for legal action.
𝐻(𝑆1) = − (3/4) * 𝑙𝑜𝑔(3/4) − (1/4) * 𝑙𝑜𝑔(1/4) = 0. 811

𝐻(𝑆2) = − (2/4) * 𝑙𝑜𝑔(2/4) − (2/4) * 𝑙𝑜𝑔(2/4) = 1

So, the entropy of the split is,

𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1) = (4/8) * 0. 811 + (4/8) * 1 = 0. 9055

The information gain by splitting on feature 𝑋2is,

𝐼𝐺 = 𝐻(𝑌) − 0. 9055 = 0. 954 − 0. 9055 = 0. 0485

Since the information gain is maximized while splitting on the feature 𝑋1, it is the best split for the

data.

Remark: Observe that the entropy of a split will always be less than or equal to the entropy of the
target variable. Both the entropies will be equal (and consequently, the information gain will be zero) if
the feature we are splitting on is independent of the target variable and provides no information about
the target variable.

[email protected]
R8L0PN473F

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
Sharing or publishing the contents in part or full is liable for legal action.
Appendix

Notations and Definitions:

● Feature Space: A vector of independent variables - 𝑋

● Outcome Class (Categorical): 𝑌

● Decision Rule: It is a function 𝑓: 𝑋 → 𝑌. It is the rule to identify which data point will belong to
which class.

● Misclassification error (Empirical error): It is equal to the number of misclassifications


divided by the total number of observations. Mathematically, it can be written as:

𝑁
1
𝑅(𝑓) = 𝑁
∑ 𝐼( 𝑓(𝑥𝑖) ≠ 𝑦𝑖)
𝑖

Where 𝑥𝑖 is a data point, 𝑦𝑖 is the actual class, 𝑓(𝑥𝑖) is the prediction made by the decision tree, and 𝐼
[email protected]
is a function such that:
R8L0PN473F
𝐼(𝑥) = 1 𝑖𝑓 𝑥 ≠ 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, 𝑖𝑡 𝑖𝑠 0

● A Probabilistic Model: We can assume that X ⊂ 𝑋 and Y ⊂ 𝑌 are random variables and each
class is characterized by some joint distribution of a subset of independent variables.

● Sub-Class: A sub-class is the set of data points that have the same decision rule for a subset
of features and belong to the same class. Mathematically, it can be written as:

𝐶 = {(𝑥, 𝑦) | 𝑥(𝑘) = 𝑣𝑘, 𝑘 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑖𝑛𝑑𝑖𝑐𝑒𝑠}

This file is meant for personal use by [email protected] only.


Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
Sharing or publishing the contents in part or full is liable for legal action.

You might also like