Classification: Table 4.1. Data Set For Exercise 2
Classification: Table 4.1. Data Set For Exercise 2
2. Consider the training examples shown in Table 4.1 for a binary classification
problem.
(a) Compute the Gini index for the overall collection of training examples.
Answer:
Gini = 1 − 2 × 0.52 = 0.5.
(b) Compute the Gini index for the Customer ID attribute.
Answer:
The gini for each Customer ID value is 0. Therefore, the overall gini
for Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:
The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5.
Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5.
27
(d) Compute the Gini index for the Car Type attribute using multiway
split.
Answer:
The gini for Family car is 0.375, Sports car is 0, and Luxury car is
0.2188. The overall gini is 0.1625.
(e) Compute the Gini index for the Shirt Size attribute using multiway
split.
Answer:
The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large
shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for
Shirt Size attribute is 0.4914.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
Answer:
Car Type because it has the lowest gini among the three attributes.
(g) Explain why Customer ID should not be used as the attribute test
condition even though it has the lowest Gini.
Answer:
The attribute has no predictive power since new customers are assigned
to new Customer IDs.
3. Consider the training examples shown in Table 4.2 for a binary classification
problem.
(a) What is the entropy of this collection of training examples with respect
to the positive class?
Answer:
There are four positive examples and five negative examples. Thus,
P (+) = 4/9 and P (−) = 5/9. The entropy of the training examples is
−4/9 log2 (4/9) − 5/9 log2 (5/9) = 0.9911.
28 Chapter 4 Classification
(b) What are the information gains of a1 and a2 relative to these training
examples?
Answer:
For attribute a1 , the corresponding counts and probabilities are:
a1 + -
T 3 1
F 1 4
(d) What is the best split (among a1 , a2 , and a3 ) according to the infor-
mation gain?
Answer:
According to information gain, a1 produces the best split.
(e) What is the best split (between a1 and a2 ) according to the classification
error rate?
Answer:
For attribute a1 : error rate = 2/9.
For attribute a2 : error rate = 4/9.
Therefore, according to error rate, a1 produces the best split.
(f) What is the best split (between a1 and a2 ) according to the Gini index?
Answer:
For attribute a1 , the gini index is
4 5
1 − (3/4) − (1/4) + 1 − (1/5) − (4/5) = 0.3444.
2 2 2 2
9 9
For attribute a2 , the gini index is
5 4
1 − (2/5) − (3/5) + 1 − (2/4) − (2/4) = 0.4889.
2 2 2 2
9 9
Since the gini index for a1 is smaller, it produces the better split.
4. Show that the entropy of a node never increases after splitting it into smaller
successor nodes.
Answer:
Let Y = {y1 , y2 , · · · , yc } denote the c classes and X = {x1 , x2 , · · · , xk } denote
the k attribute values of an attribute X. Before a node is split on X, the
entropy is:
c
c
k
E(Y ) = − P (yj ) log2 P (yj ) = P (xi , yj ) log2 P (yj ), (4.1)
j=1 j=1 i=1
k
where we have used the fact that P (yj ) = i=1 P (xi , yj ) from the law of
total probability.
After splitting on X, the entropy for each child node X = xi is:
c
E(Y |xi ) = − P (yj |xi ) log2 P (yj |xi ) (4.2)
j=1
30 Chapter 4 Classification
where P (yj |xi ) is the fraction of examples with X = xi that belong to class
yj . The entropy after splitting on X is given by the weighted entropy of the
children nodes:
k
E(Y |X) = P (xi )E(Y |xi )
i=1
k
c
= − P (xi )P (yj |xi ) log2 P (yj |xi )
i=1 j=1
k
c
= − P (xi , yj ) log2 P (yj |xi ), (4.3)
i=1 j=1
where we have used a known fact from probability theory that P (xi , yj ) =
P (yj |xi )×P (xi ). Note that E(Y |X) is also known as the conditional entropy
of Y given X.
To answer this question, we need to show that E(Y |X) ≤ E(Y ). Let us com-
pute the difference between the entropies after splitting and before splitting,
i.e., E(Y |X) − E(Y ), using Equations 4.1 and 4.3:
k
c
P (yj )
= P (xi , yj ) log2
i=1 j=1
P (yj |xi )
k
c
P (xi )P (yj )
= P (xi , yj ) log2 (4.4)
i=1 j=1
P (xi , yj )
d
d
ak log(zk ) ≤ log ak zk , (4.5)
k=1 k=1
d
subject to the condition that k=1 ak = 1. This property is a special case
of a more general theorem involving convex functions (which include the
logarithmic function) known as Jensen’s inequality.
31
A B Class Label
T F +
T T +
T T +
T F −
T T +
F F −
F F −
F F −
T T −
T F −