Data Mining Algorithms Classification L4
Data Mining Algorithms Classification L4
1
-4-
2
-4-
3
-4-
In Decision Tree the major challenge is to identification of the attribute for the root
node in each level. This process is known as attribute selection. We have two
popular attribute selection measures: Information Gain and Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into
smaller subsets the entropy changes. Information gain is a measure of this change in
entropy.
Or
4
-4-
2-Gini index
(a) Compute the Gini index for the overall collection of training examples.
Answer: Gini = 1 − 2 × 0.52 = 0.5.
(b) Compute the Gini index for the Customer ID attribute.
Answer:
5
-4-
The gini for each Customer ID value is 0. Therefore, the overall gini for
Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:
The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5.
Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5.
h.w:
(d) Compute the Gini index for the Car Type attribute using multiway
split.
(e) Compute the Gini index for the Shirt Size attribute using multiway
split.
(a) What is the entropy of this collection of training examples with respect to the
positive class?
Answer:
There are four positive examples and five negative examples. Thus,
6
-4-
P(+) = 4/9 and P(−) = 5/9. The entropy of the training examples is
−4/9 log2(4/9) − 5/9 log2(5/9) = 0.9911.
(b) What are the information gains of a1 and a2 relative to these training examples?
Answer:
For attribute a1, the corresponding counts and probabilities are: