LVC+1+Post-Session+Summary
LVC+1+Post-Session+Summary
A decision tree is like a flow chart where each internal node represents a test on an attribute and
each branch represents the outcome of that test. In a classification problem, each leaf node
represents a class label, i.e., the decision was taken after computing all attributes, and the path from
the first node to a leaf represents classification rules, also called decision rules.
Let’s consider a simple example of “who gets a loan”? Here, the decision tree represents a
sequence of questions that the bank might ask an applicant, to figure out whether that applicant is
eligible for the loan or not. Each internal node represents a question and each leaf node represents
the class label - Get loan / Don’t get loan. As we move along the edges, we get decision rules. For
example, if an applicant’s age is under 30 and has a salary of less than $2500, then the applicant
won’t get a loan.
[email protected]
R8L0PN473F
Decision trees are one of the most famous supervised learning algorithms. They have many
advantages but a few limitations as well.
● Human-Algorithm Interaction
○ Simple to understand and interpret
○ Mirrors human decision-making more closely
○ Uses an open-box model, i.e., can visualize and understand the machine learning logic
(as opposed to a black-box model which is not interpretable)
● Versatile
○ Able to handle both numerical and categorical data
○ Powerful - can model arbitrary functions as long as we have sufficient data
○ Requires little data preparation
○ Performs well with large datasets
[email protected]
Limitations of decision
R8L0PN473F trees:
● Trees can be non-robust: A small change in the training data can result in a large change in
the tree and consequently the final predictions.
● Overfitting: Decision-tree solvers can create over-complex trees that do not generalize well
from the training data.
Consider a scenario for binary classification. Let two continuous independent variables be X1 and X2
and the dependent variable be the color of the data point, i.e., either Red or Blue. The decision tree is
built top-down from a root node and involves partitioning the data into subsets that contain instances
with similar values (homogeneous). The data points, sample decision tree, and homogeneous
subsets are given in the below diagram which shows a tree with a depth equal to 2.
A decision tree can realize any Boolean Function. The below diagram shows a decision tree for
the XOR boolean function
[email protected] (where a statement is false if A and B are both true or both false,
R8L0PN473F
otherwise, true).
Note: Realization is not unique as there can be many trees for the same function.
1. Pick a feature
Remark: At each split, we need to try all the different combinations for all the features which makes
the algorithm computationally expensive.
Now, the question arises: how do we find the best split? How to identify the feature and the value
of that feature to split the data at each level? This motivates the use of entropy as an impurity
measure.
Where 𝐸 represents the expected value. For all calculations, we will use the log with base 2.
[email protected]
Note: Since the log
R8L0PN473F of values between 0 and 1 is negative, the minus sign helps to avoid negative
values of entropy.
Let’s consider an example of a coin flip where the probability of heads is equal to 𝑝 and the
probability of tails is equal to 1 − 𝑝. Then, the entropy is given as:
− 𝑝 𝑙𝑜𝑔𝑝 − (1 − 𝑝) 𝑙𝑜𝑔(1 − 𝑝)
From the above graph, we can observe that the entropy is minimum, i.e., 0 when 𝑝 = 0 or 𝑝 = 1, i.e.,
when we can only get heads or tails which implies that the entropy is minimum when the outcome
In a similar way, we can calculate the entropy of two variables, i.e., the entropy for the joint
distribution of 𝑋 and 𝑌. This is called joint entropy of two variables and we can extend the formula of
the entropy for a single variable to two variables as shown below:
So, we have seen the entropy of a single random variable and the joint distribution of two variables.
But in decision trees, we also want to find out the entropy of the target variable for a given split. This
is called conditional entropy. It is denoted as 𝐻(𝑌 | 𝑋).
In a decision tree, our aim is to find the feature and the corresponding value such that if we split the
data, the reduction of entropy in the target variable given the split is highest, i.e., the difference
between the entropy of 𝑌 and the conditional entropy 𝐻(𝑌 | 𝑋) is maximized. This is called
information gain. Mathematically, it is written as:
[email protected]
R8L0PN473F 𝐼𝐺(𝑌 | 𝑋) = 𝐻(𝑌) − 𝐻(𝑌 | 𝑋)
Our aim is to maximize information gain at each split, or in other words, minimizing 𝐻(𝑌 | 𝑋) as 𝐻(𝑌)
is constant.
Now that we know the theory behind entropy and information gain, let’s go through the steps to find
the best split in a decision tree.
4. Split the data into two nodes, say 𝑆1 and 𝑆2, based on one of the values of the same feature
6. Find the entropy of the split using the formula 𝑃(𝑆1) * 𝐻(𝑆1) + 𝑃(𝑆2) * 𝐻(𝑆1)
Finally, let’s see a simple example to understand the computations involved in the steps
mentioned above. Consider the following dataset with 2 independent variables, 𝑋1 and 𝑋2, and one
target variable, 𝑌, where all the variables can only take Boolean values.
𝑋1 𝑋2 𝑌
T T T
T F T
T T T
T F T
[email protected]
R8L0PN473F
F T T
F F F
F T F
F F F
Here, we have two choices to split the data. We can split the data on 𝑋1= T and 𝑋1= F or 𝑋2 = T and
𝑋2 = F. We need to find out which feature maximizes the information gain to find the best split.
First, let’s compute 𝐻(𝑌) (recall that we use log with base 2 for computations). In the target variable,
the class True is occurring 5 out of 8 times and False is occurring 3 out of 8 times. So,
one for 𝑋1= T (say 𝑆1) and the other for 𝑋1= F (say 𝑆2). The rows in the below
table show the count of target classes in each node after the split.
𝑌= T 𝑌= F
𝑋1= T
4 0
𝑋1= F
1 3
Similarly, the split on 𝑋2will divide the target variable data into two disjoint sets, say 𝑆1and 𝑆2.
𝑌= T 𝑌= F
𝑋2= T
3 1
𝑋2= F
2 2
Since the information gain is maximized while splitting on the feature 𝑋1, it is the best split for the
data.
Remark: Observe that the entropy of a split will always be less than or equal to the entropy of the
target variable. Both the entropies will be equal (and consequently, the information gain will be zero) if
the feature we are splitting on is independent of the target variable and provides no information about
the target variable.
[email protected]
R8L0PN473F
● Decision Rule: It is a function 𝑓: 𝑋 → 𝑌. It is the rule to identify which data point will belong to
which class.
𝑁
1
𝑅(𝑓) = 𝑁
∑ 𝐼( 𝑓(𝑥𝑖) ≠ 𝑦𝑖)
𝑖
Where 𝑥𝑖 is a data point, 𝑦𝑖 is the actual class, 𝑓(𝑥𝑖) is the prediction made by the decision tree, and 𝐼
[email protected]
is a function such that:
R8L0PN473F
𝐼(𝑥) = 1 𝑖𝑓 𝑥 ≠ 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, 𝑖𝑡 𝑖𝑠 0
● A Probabilistic Model: We can assume that X ⊂ 𝑋 and Y ⊂ 𝑌 are random variables and each
class is characterized by some joint distribution of a subset of independent variables.
● Sub-Class: A sub-class is the set of data points that have the same decision rule for a subset
of features and belong to the same class. Mathematically, it can be written as: