Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
Piyush Rai
Aug 5, 2016
Indoor or Outdoor ?
Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree
Defined by a hierarchy of rules (in form of a tree)
Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
Note: The tree need not be a binary tree
1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree
Defined by a hierarchy of rules (in form of a tree)
Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
Note: The tree need not be a binary tree
(Labeled) Training data is used to construct the Decision Tree1 (DT)
1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree
Defined by a hierarchy of rules (in form of a tree)
Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
Note: The tree need not be a binary tree
(Labeled) Training data is used to construct the Decision Tree1 (DT)
The DT can then be used to predict label y of a test example x
1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree: An Example
Identifying the region - blue or green - a point lies in (binary classification)
Each point has 2 features: its co-ordinates {x1 , x2 } on the 2D plane
Left: Training data, Right: A DT constructed using this data
The DT can be used to predict the region (blue/green) of a new test point
By testing the features of the test point
In the order defined by the DT (first x2 and then x1 )
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 5
Decision Tree: Another Example
Deciding whether to play or not to play Tennis on a Saturday
A binary classification problem (play vs no-play)
Each input (a Saturday) has 4 features: Outlook, Temp., Humidity, Wind
Question: Why does it make more sense to test the feature “outlook” first?
Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s most informative
Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s most informative
We will see shortly how to quantity the informativeness
Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s most informative
We will see shortly how to quantity the informativeness
Analogy: Playing the game 20 Questions (the most useful questions first)
We can assess informativeness of each feature by looking at how much it reduces the entropy of
the class distribution
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8
Information Gain
IG (S, F ) denotes the no. of bits saved while encoding S once we know the value of the feature F
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
IG (S, temperature) = 0.971 − 2/5 ∗ 0 − 2/5 ∗ 1 − 1/5 ∗ 0 = 0.570
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
IG (S, temperature) = 0.971 − 2/5 ∗ 0 − 2/5 ∗ 1 − 1/5 ∗ 0 = 0.570
Likewise we can compute: IG (S, humidity) = 0.970 , IG (S, wind) = 0.019
For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}
S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
IG (S, temperature) = 0.971 − 2/5 ∗ 0 − 2/5 ∗ 1 − 1/5 ∗ 0 = 0.570
Likewise we can compute: IG (S, humidity) = 0.970 , IG (S, wind) = 0.019
Therefore, we choose “humidity” (with highest IG = 0.970) for the level-2 left node
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14
Growing The Tree
It consist of examples all having the same label (the node becomes “pure”)
It consist of examples all having the same label (the node becomes “pure”)
Or we run out of features to test!
Overfitting Illustration
Overfitting Illustration
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Desired: a DT that is not too big in size, yet fits the training data reasonably
Minimum Description Length (MDL): more details when we cover Model Selection
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions
Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions
Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions
Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)
More sophisticated decision rules at the internal nodes (anything that splits the data into
homogeneous groups; e.g., a machine learning classifier)
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions
Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)
More sophisticated decision rules at the internal nodes (anything that splits the data into
homogeneous groups; e.g., a machine learning classifier)
2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Some Aspects about Decision Trees
Some key strengths:
Simple and each to interpret
Do not make any assumption about distribution of data
Easily handle different types of features (real, categorical/nominal, etc.)
Very fast at test time (just need to check the features, starting the root node and following the DT
until you reach a leaf node)
Multiple DTs can be combined via ensemble methods (e.g., Decision Forest)
Each DT can be constructed using a (random) small subset of features