ML - 4
ML - 4
A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7
• In the last slide we have considered a decision tree where values of any
attribute if binary only. Decision tree is also possible where attributes are
of continuous data type
Decision Tree with numeric data
– Greedy strategy
• A top-down recursive divide-and-conquer
Discussions on Algorithm:
Called with three parameters: D, attribute_list,
attribute_selection_method
D: data partition
attribute_list: list of attributes describing the
tuple
attribute_selection_method: heuristic method
for selecting best attribute.
July 12, 2024 13
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Definition
m=2
July 12, 2024 20
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info( D ) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
July 12, 2024 21
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Definition
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high
July 12, 2024 yes fair yes 29
>40 medium no excellent no
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
July 12, 2024 30
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
33
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Definition
• The three measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and
purity in both partitions
July 12, 2024 34
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Definition
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
• G-statistic: has a close approximation to χ2 distribution
• MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
– The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
– CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
July 12, 2024 35
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
40
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
Thank You
UNIVERSITY OF ENGINEERING & MANAGEMENT, KOLKATA
Thank You