0% found this document useful (0 votes)
75 views

CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)

The document discusses the CS446 lecture on machine learning models and decision trees. It introduces decision trees as a method for determining a course of action through a tree-shaped diagram. Decision trees can be used for both classification and regression problems by learning from a set of examples to create a tree that represents the data. The key steps of building a decision tree through algorithms like ID3 are also summarized.

Uploaded by

Muhammad hanzla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)

The document discusses the CS446 lecture on machine learning models and decision trees. It introduces decision trees as a method for determining a course of action through a tree-shaped diagram. Decision trees can be used for both classification and regression problems by learning from a set of examples to create a tree that represents the data. The key steps of building a decision tree through algorithms like ID3 are also summarized.

Uploaded by

Muhammad hanzla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CS446: Machine Learning

Lecture 21 (ML Models – Decision Trees –ID3)


Instructor:
Dr. Muhammad Kabir
Assistant Professor
[email protected]

School of Systems and Technology


Department of Computer Science
University of Management and Technology, Lahore
Decision Trees
– Decision tree is a tree shaped diagram used to determine a course of action.
Each branch of the tree represents a possible decision, occurrence or reaction.
– It is a tree-structured classifiers, where internal codes represents the features
of the dataset, branches represents the decision rules and each leaf node
represents the label.

2
Decision Trees
– A hierarchical data structure that represents data by implementing a
divide and conquer strategy
– Can be used as a non-parametric classification and regression method
– Given a collection of examples, learn a decision tree that represents it.
– Use this representation to classify new examples

C B A

3
The Representation
• Decision Trees are classifiers for instances represented as feature vectors
– color={red, blue, green} ; shape={circle, triangle, rectangle} ; label= {A, B, C}
• Nodes are tests for feature values Learning a
Evaluation of a
• There is one branch for each value of the feature Decision Tree Decision Tree
• Leaves specify the category (labels)
Color
• Can categorize instances into multiple disjoint categories

C B A
Shape B Shape

B A C B A
4
Decision Trees
Advantages
 Can be use for both classification & regression.
 Easy to interpret
 Can be used for non-linear data
 No need for normalization of scaling
 Not sensitive to outliers

Disadvantages
 Overfitting issue
 Small changes in the data alter the tree structure causing instability
 Training time is relatively higher
5
Decision Trees (Problem Solving?)
– Decision trees are capable of solving both non-parametric classification and
regression problems.

6
Decision Trees (Terminologies)

7
Decision Trees (Important Terms - Entropy)

8
Decision Trees (Important Terms - Entropy)

9
Decision Trees (Important Terms – Information Gain)

10
Decision Trees (Important Terms – Leaf Node)

11
Decision Trees (Important Terms – Root Node)

12
How does a Decision Tree works?

13
How does a Decision Tree works?

14
How does a Decision Tree works?

15
How does a Decision Tree works?

16
How does a Decision Tree works?

17
How does a Decision Tree works?

18
How does a Decision Tree works?

19
How does a Decision Tree works?

20
How does a Decision Tree works?

21
How does a Decision Tree works?

22
How does a Decision Tree works?

23
How does a Decision Tree works?

24
How does a Decision Tree works?

25
How does a Decision Tree works?

26
How does a Decision Tree works?

27
How does a Decision Tree works?

28
Decision Trees
► Output is a discrete category (classification).
► Real valued outputs are possible (regression trees)

► There are efficient algorithms for processing large Color


amounts of data (but not too many features).

► There are methods for handling noisy data


(classification noise and attribute noise) and for Shape B Shape
handling missing attribute values

- + + + -

29
Decision Boundaries
• Usually, instances are represented as attribute-value pairs (color=blue,
shape = square, +)
• Numerical values can be used either by discretizing or by using thresholds
for splitting nodes
• In this case, the tree divides the features space into axis-parallel rectangles,
each labeled with one of the labels
Y
X<3
+ + +
no yes
7
Y>7 Y<5
+ + -
5
no yes no yes

- + - - + + X<1
no yes
1 3 X
+ - 30
Learning decision trees
(ID3 algorithm)

31
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy

• Clearly, given data, there are Outlook


many ways to represent it as
a decision tree.
Sunny Overcast Rain
• Learning a good representation
Humidity Wind
from data is the challenge. Yes

High Normal Strong Weak


No Yes No Yes
32
Will I play tennis today?
• Features
– Outlook: {Sun, Overcast, Rain}
– Temperature: {Hot, Mild, Cool}
– Humidity: {High, Normal, Low}
– Wind: {Strong, Weak}

• Labels
– Binary classification task: Y = {+, -}

33
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S +
L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S -
W(eak)

34
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W - • Data is processed in Batch (i.e. all the
2 S H H S -
3 O H H W +
data available) Algorithm?
4 R M H W + • Recursively build a decision tree top
5 R C N W +
6 R C N S - down.
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Rain
Sunny Overcast
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
– Label is the target attribute (the prediction)
– Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S why?
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label) For evaluation time
End
Return Root

36
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– But, finding the minimal decision tree consistent with the data.
• The recursive algorithm is a greedy heuristic search for a
simple tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.

37
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– The main decision in the algorithm is the selection of the next attribute
to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
– The most popular heuristics is based on information gain, originated
with the ID3 system of Quinlan.

38
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a
binary classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
– If all the examples belong to the same category: Entropy = 0
– If all the examples are equally mixed (0.5, 0.5): Entropy = 1
– Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 𝑝1, 𝑝2 , … , 𝑝𝑘 = − ෍ 𝑝𝑖 log 𝑝𝑖
1
39
High Entropy – High level of Uncertainty
Information Gain Low Entropy – No Uncertainty.

• The information gain of an attribute a is the expected reduction


in entropy caused by partitioning on this attribute
|𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) Outlook
|𝑆|
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
• Where:
Sunny Overcast Rain
– Sv is the subset of S for which attribute a has value v, and
– the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

• Partitions of low entropy (imbalanced splits) lead to high gain


• Go back to check which of the A, B splits is better

40
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

41
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
9 5
5 R C N W + • 𝑝+ = 𝑝− =
6 R C N S - 14 14
7 O C N S + • 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑃𝑙𝑎𝑦 = −𝑝+ log 2 𝑝+ − 𝑝− log 2 𝑝−
8 S M H W - 9 9 5 5
9 S C N W + = − log2 − log2
10 R M N W + 14 14 14 14
11 S M N S +  0.94
12 O M H S +
13 O H N W +
14 R M H S -

42
Information Gain: Outlook
O T H W Play? |𝑆𝑣 |
1 S H H W -
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + 𝑝+ = 2/5 𝑝− = 3/5 Entropy(O = S) = 0.971
5 R C N W + Outlook = overcast:
6 R C N S - 𝑝+ = 4/4 𝑝− = 0 Entropy(O = O) = 0
7 O C N S + Outlook = rainy:
8 S M H W - 𝑝+ = 3/5 𝑝− = 2/5 Entropy(O = R) = 0.971
9 S C N W +
10 R M N W +
Expected entropy
11 S M N S + |𝑆 |
12 O M H S + = σ𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
43
Information Gain: Humidity
O T H W Play? |𝑆𝑣 |
1 S H H W -
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + 𝑝+ = 3/7 𝑝− = 4/7 Entropy(H = H) = 0.985
5 R C N W + Humidity = Normal:
6 R C N S - 𝑝+ = 6/7 𝑝− = 1/7 Entropy(H = N) = 0.592
7 O C N S +
8 S M H W - Expected entropy
9 S C N W + |𝑆 |
10 R M N W +
= σ𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
11 S M N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.694 = 0.246
14 R M H S -

44
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.151
7 O C N S + Wind: 0.048
8 S M H W - Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -

45
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Outlook Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246

46
An Illustrative Example (III)
O T H W Play?
Outlook 1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
8 S M H W -
? Yes ?
9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -

47
An Illustrative Example (III)
O T H W Play?
Outlook
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Continue until:
• Every attribute is included in path, or, 11 S M N S +
• All examples in the leaf have same label 12 O M H S +
13 O H N W +
14 R M H S -

48
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
Sunny Overcast Rain 4 R M H W +
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Gain(S sunny , Humidity)  .97-(3/5) 0-(2/5) 0 = .97 11 S M N S +
Gain(S sunny , Temp)  .97- 0-(2/5) 1 = .57 12 O M H S +
13 O H N W +
Gain(S sunny , Wind)  .97-(2/5) 1 - (3/5) .92= .02
14 R M H S -

Split on Humidity
49
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?

50
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes ?

High Normal
No Yes

51
An Illustrative Example (VI)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

52
53
Video Lecture for ID3 - Understanding
https://www.youtube.com/watch?v=coOTEc-
0OGw&ab_channel=MaheshHuddar

54

You might also like