Unit 4 - Decision Tree ID3
Unit 4 - Decision Tree ID3
Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and
what the corresponding output is in the training data) where the data is continuously split according to a certain
parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions
or the final outcomes. And the decision nodes are where the data is split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to predict whether a
person is fit given their information like age, eating habit, and physical activity, etc. The decision nodes here are
questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are
outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary classification problem (a yes no type problem).
There are two main types of Decision Trees:
What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’.
Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123.
There are many algorithms out there which construct Decision Trees, but one of the best is called as ID3 Algorithm.
ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm, we’ll go through few definitions.
The steps in ID3 algorithm are as follows:
Entropy:
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the amount of
uncertainty or randomness in data.
Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss whose probability of
heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest possible, since there’s no way of
determining what the outcome might be. Alternatively, consider a coin which has heads on both the sides, the entropy
of such an event can be predicted perfectly since we know beforehand that it’ll always be heads. In other words, this
event has no randomness hence it’s entropy is zero. In particular, lower values imply less uncertainty while higher
values imply high uncertainty.
Information Gain:
information gain is denoted by IG(S,A) for a set S is the effective change in entropy after deciding on a particular
attribute A. It measures the relative change in entropy with respect to the independent variables.
Alternatively,
where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire set, while the second
term calculates the Entropy after applying the feature A, where P(x) is the probability of event x.
Let’s understand this with the help of an example. Consider a piece of data collected over the course of 14 days where
the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the
day. Now, our job is to build a predictive model which takes in above 4 parameters and predicts whether Golf will be
played on the day. We’ll build a decision tree to do that using ID3 algorithm.
Now, let's go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy of the current state. In
the above example, we can see in total there are 5 No’s and 9 Yes’s.
Yes No Total
9 5 14
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them belong to one class
and other half belong to other class that is perfect randomness. Here it’s 0.94 which means the distribution is fairly
random. Now, the next step is to choose the attribute that gives us highest possible Information Gain which we’ll
choose as the root node. Let’s start with ‘Wind’
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible values in the sample data,
hence x = {Weak, Strong} We’ll have to calculate:
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind is Strong.
Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’ for ‘Play Golf’. So,
we have,
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for Play Golf and 3 where
we had ‘No’ for Play Golf.
here half items belong to one class while other half belong to other. Hence we have perfect randomness. Now we have
all the pieces required to calculate the Information Gain,
Which tells us the Information Gain by considering ‘Wind’ as the feature and give us information gain of 0.048. Now
we must similarly calculate the Information Gain for all the features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we chose Outlook
attribute as the root node. At this point, the decision tree looks like.
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no coincidence by any chance,
the simple tree resulted because of the highest information gain is given by the attribute Outlook. Now that we’ve
used Outlook, we’ve got three of them remaining Humidity, Temperature, and Wind. And, we had three possible
values of Outlook: Sunny, Overcast, Rain. Where the Overcast node already ended up having leaf node ‘Yes’, so
we’re left with two subtrees to compute: Sunny and Rain.
will give us Wind as the one with highest information gain. The final Decision Tree looks something like this.