0% found this document useful (0 votes)
61 views

Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)

The document discusses decision trees, which are machine learning models that classify examples by asking questions about their features. Decision trees work by recursively splitting the training data into purer subsets based on feature tests, forming a tree structure. The informativeness of features is quantified using entropy, which measures the uncertainty in a data set. Features with more discriminative power, like "outlook" in the tennis example, result in greater reduction of entropy and so are tested higher in the tree.

Uploaded by

ramadevi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)

The document discusses decision trees, which are machine learning models that classify examples by asking questions about their features. Decision trees work by recursively splitting the training data into purer subsets based on feature tests, forming a tree structure. The informativeness of features is quantified using entropy, which measures the uncertainty in a data set. Features with more discriminative power, like "outlook" in the tennis example, result in greater reduction of entropy and so are tested higher in the tree.

Uploaded by

ramadevi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Learning by Asking Questions: Decision Trees

Piyush Rai

Machine Learning (CS771A)

Aug 5, 2016

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 1


A Classification Problem

Indoor or Outdoor ?

Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 2
Predicting by Asking Questions

Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 3
Predicting by Asking Questions

How can we learn this tree using labeled training data?


Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 3
Decision Tree
Defined by a hierarchy of rules (in form of a tree)

Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches

1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree
Defined by a hierarchy of rules (in form of a tree)

Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
Note: The tree need not be a binary tree

1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree
Defined by a hierarchy of rules (in form of a tree)

Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
Note: The tree need not be a binary tree
(Labeled) Training data is used to construct the Decision Tree1 (DT)

1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree
Defined by a hierarchy of rules (in form of a tree)

Rules form the internal nodes of the tree (topmost internal node = root)
Each internal node tests the value of some feature and “splits” data across the outgoing branches
Note: The tree need not be a binary tree
(Labeled) Training data is used to construct the Decision Tree1 (DT)
The DT can then be used to predict label y of a test example x
1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 4
Decision Tree: An Example
Identifying the region - blue or green - a point lies in (binary classification)
Each point has 2 features: its co-ordinates {x1 , x2 } on the 2D plane
Left: Training data, Right: A DT constructed using this data

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 5


Decision Tree: An Example
Identifying the region - blue or green - a point lies in (binary classification)
Each point has 2 features: its co-ordinates {x1 , x2 } on the 2D plane
Left: Training data, Right: A DT constructed using this data

The DT can be used to predict the region (blue/green) of a new test point
By testing the features of the test point
In the order defined by the DT (first x2 and then x1 )
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 5
Decision Tree: Another Example
Deciding whether to play or not to play Tennis on a Saturday
A binary classification problem (play vs no-play)
Each input (a Saturday) has 4 features: Outlook, Temp., Humidity, Wind

Pic credit: Tom Mitchell


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 6
Decision Tree: Another Example
Deciding whether to play or not to play Tennis on a Saturday
A binary classification problem (play vs no-play)
Each input (a Saturday) has 4 features: Outlook, Temp., Humidity, Wind
Left: Training data, Right: A decision tree constructed using this data

Pic credit: Tom Mitchell


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 6
Decision Tree: Another Example
Deciding whether to play or not to play Tennis on a Saturday
A binary classification problem (play vs no-play)
Each input (a Saturday) has 4 features: Outlook, Temp., Humidity, Wind
Left: Training data, Right: A decision tree constructed using this data

The DT can be used to predict play vs no-play for a new Saturday


By testing the features of that Saturday
In the order defined by the DT
Pic credit: Tom Mitchell
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 6
Decision Tree Construction

Now let’s look at the playing Tennis example

Question: Why does it make more sense to test the feature “outlook” first?

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 7


Decision Tree Construction

Now let’s look at the playing Tennis example

Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s most informative

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 7


Decision Tree Construction

Now let’s look at the playing Tennis example

Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s most informative
We will see shortly how to quantity the informativeness

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 7


Decision Tree Construction

Now let’s look at the playing Tennis example

Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s most informative
We will see shortly how to quantity the informativeness
Analogy: Playing the game 20 Questions (the most useful questions first)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 7


Entropy
Entropy is a measure of randomness/uncertainty of a set

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes
pc is the probability that a random element of S belongs to class c
.. basically, the fraction of elements of S belonging to class c

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes
pc is the probability that a random element of S belongs to class c
.. basically, the fraction of elements of S belonging to class c

Probability vector p = [p1 , p2 , . . . , pC ] is the class distribution of the set S

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes
pc is the probability that a random element of S belongs to class c
.. basically, the fraction of elements of S belonging to class c

Probability vector p = [p1 , p2 , . . . , pC ] is the class distribution of the set S


Entropy of the set S X
H(S) = − pc log2 pc
c∈C

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes
pc is the probability that a random element of S belongs to class c
.. basically, the fraction of elements of S belonging to class c

Probability vector p = [p1 , p2 , . . . , pC ] is the class distribution of the set S


Entropy of the set S X
H(S) = − pc log2 pc
c∈C

If a set S of examples (or any subset of it) has..


Some dominant classes =⇒ small entropy of the class distribution

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes
pc is the probability that a random element of S belongs to class c
.. basically, the fraction of elements of S belonging to class c

Probability vector p = [p1 , p2 , . . . , pC ] is the class distribution of the set S


Entropy of the set S X
H(S) = − pc log2 pc
c∈C

If a set S of examples (or any subset of it) has..


Some dominant classes =⇒ small entropy of the class distribution
Equiprobable classes =⇒ high entropy of the class distribution

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8


Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples with C many classes
pc is the probability that a random element of S belongs to class c
.. basically, the fraction of elements of S belonging to class c

Probability vector p = [p1 , p2 , . . . , pC ] is the class distribution of the set S


Entropy of the set S X
H(S) = − pc log2 pc
c∈C

If a set S of examples (or any subset of it) has..


Some dominant classes =⇒ small entropy of the class distribution
Equiprobable classes =⇒ high entropy of the class distribution

We can assess informativeness of each feature by looking at how much it reduces the entropy of
the class distribution
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 8
Information Gain

Let’s assume each element of S has a set of features


Information Gain (IG) on knowing the value of some feature ’F ’
X |Sf |
IG (S, F ) = H(S) − H(Sf )
|S|
f ∈F

Sf denotes the subset of elements of S for which feature F has value f

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 9


Information Gain

Let’s assume each element of S has a set of features


Information Gain (IG) on knowing the value of some feature ’F ’
X |Sf |
IG (S, F ) = H(S) − H(Sf )
|S|
f ∈F

Sf denotes the subset of elements of S for which feature F has value f


IG (S, F ) = entropy of S minus the weighted sum of entropy of its children

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 9


Information Gain

Let’s assume each element of S has a set of features


Information Gain (IG) on knowing the value of some feature ’F ’
X |Sf |
IG (S, F ) = H(S) − H(Sf )
|S|
f ∈F

Sf denotes the subset of elements of S for which feature F has value f


IG (S, F ) = entropy of S minus the weighted sum of entropy of its children
IG (S, F ): Increase in our certainty about S once we know the value of F

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 9


Information Gain

Let’s assume each element of S has a set of features


Information Gain (IG) on knowing the value of some feature ’F ’
X |Sf |
IG (S, F ) = H(S) − H(Sf )
|S|
f ∈F

Sf denotes the subset of elements of S for which feature F has value f


IG (S, F ) = entropy of S minus the weighted sum of entropy of its children
IG (S, F ): Increase in our certainty about S once we know the value of F

IG (S, F ) denotes the no. of bits saved while encoding S once we know the value of the feature F

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 9


Entropy and Information Gain: Pictorially
Assume we have a 4-class problem. Each point has 2 features

Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 10
Entropy and Information Gain: Pictorially
Assume we have a 4-class problem. Each point has 2 features
Which feature should we test (i.e., split on) first?

Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 10
Entropy and Information Gain: Pictorially
Assume we have a 4-class problem. Each point has 2 features
Which feature should we test (i.e., split on) first?

Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 10
Entropy and Information Gain: Pictorially
Assume we have a 4-class problem. Each point has 2 features
Which feature should we test (i.e., split on) first?

Pic credit: “Decision Forests: A Unified Framework” by Criminisi et al


Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 10
Computing Information Gain

Coming back to playing tennis..


Let’s begin with the root node of the DT
and compute IG of each feature
Consider feature
“wind” ∈ {weak,strong} and its IG w.r.t.
the root node

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 11


Computing Information Gain

Coming back to playing tennis..


Let’s begin with the root node of the DT
and compute IG of each feature
Consider feature
“wind” ∈ {weak,strong} and its IG w.r.t.
the root node

Root node: S = [9+, 5−] (all training data: 9 play, 5 no-play)


Entropy: H(S) = −(9/14) log2 (9/14) − (5/14) log2 (5/14) = 0.94

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 11


Computing Information Gain

Coming back to playing tennis..


Let’s begin with the root node of the DT
and compute IG of each feature
Consider feature
“wind” ∈ {weak,strong} and its IG w.r.t.
the root node

Root node: S = [9+, 5−] (all training data: 9 play, 5 no-play)


Entropy: H(S) = −(9/14) log2 (9/14) − (5/14) log2 (5/14) = 0.94
Sweak = [6+, 2−] =⇒ H(Sweak ) = 0.811

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 11


Computing Information Gain

Coming back to playing tennis..


Let’s begin with the root node of the DT
and compute IG of each feature
Consider feature
“wind” ∈ {weak,strong} and its IG w.r.t.
the root node

Root node: S = [9+, 5−] (all training data: 9 play, 5 no-play)


Entropy: H(S) = −(9/14) log2 (9/14) − (5/14) log2 (5/14) = 0.94
Sweak = [6+, 2−] =⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] =⇒ H(Sstrong ) = 1

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 11


Computing Information Gain

Coming back to playing tennis..


Let’s begin with the root node of the DT
and compute IG of each feature
Consider feature
“wind” ∈ {weak,strong} and its IG w.r.t.
the root node

Root node: S = [9+, 5−] (all training data: 9 play, 5 no-play)


Entropy: H(S) = −(9/14) log2 (9/14) − (5/14) log2 (5/14) = 0.94
Sweak = [6+, 2−] =⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] =⇒ H(Sstrong ) = 1
|Sweak | |Sstrong |
IG (S, wind) = H(S) − H(Sweak ) − H(Sstrong )
|S| |S|
= 0.94 − 8/14 ∗ 0.811 − 6/14 ∗ 1
= 0.048
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 11
Choosing the most informative feature
At the root node, the information gains are:
IG (S, wind) = 0.048 (we already saw)
IG (S, outlook) = 0.246
IG (S, humidity) = 0.151
IG (S, temperature) = 0.029

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 12


Choosing the most informative feature
At the root node, the information gains are:
IG (S, wind) = 0.048 (we already saw)
IG (S, outlook) = 0.246
IG (S, humidity) = 0.151
IG (S, temperature) = 0.029
“outlook” has the maximum IG =⇒ chosen as the root node

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 12


Choosing the most informative feature
At the root node, the information gains are:
IG (S, wind) = 0.048 (we already saw)
IG (S, outlook) = 0.246
IG (S, humidity) = 0.151
IG (S, temperature) = 0.029
“outlook” has the maximum IG =⇒ chosen as the root node

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 12


Growing The Tree

How to decide which feature to test next ?

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 13


Growing The Tree

How to decide which feature to test next ?


Rule: Iterate - for each child node, select the feature with the highest IG

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 13


Growing The Tree

How to decide which feature to test next ?


Rule: Iterate - for each child node, select the feature with the highest IG

For level-2, left node: S = [2+, 3−] (days 1,2,8,9,11)


Compute the Information Gain for each feature (except outlook)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 13


Growing The Tree

How to decide which feature to test next ?


Rule: Iterate - for each child node, select the feature with the highest IG

For level-2, left node: S = [2+, 3−] (days 1,2,8,9,11)


Compute the Information Gain for each feature (except outlook)
The feature with the highest Information Gain should be chosen for this node

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 13


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
IG (S, temperature) = 0.971 − 2/5 ∗ 0 − 2/5 ∗ 1 − 1/5 ∗ 0 = 0.570

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
IG (S, temperature) = 0.971 − 2/5 ∗ 0 − 2/5 ∗ 1 − 1/5 ∗ 0 = 0.570
Likewise we can compute: IG (S, humidity) = 0.970 , IG (S, wind) = 0.019

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14


Growing The Tree

For this node (S = [2+, 3−]), the IG for the feature temperature:
X |Sv |
IG (S, temperature) = H(S) − H(Sv )
|S|
v ∈{hot,mild,cool}

S = [2+, 3−] =⇒ H(S) = −(2/5) ∗ log2 (2/5) − (3/5) ∗ log2 (3/5) = 0.971
Shot = [0+, 2−] =⇒ H(Shot ) = −0 ∗ log2 (0) − (2/2) ∗ log2 (2/2) = 0
Smild = [1+, 1−] =⇒ H(Smild ) = −(1/2) ∗ log2 (1/2) − (1/2) ∗ log2 (1/2) = 1
Scool = [1+, 0−] =⇒ H(Scool ) = −(1/1) ∗ log2 (1/1) − (0/1) ∗ log2 (0/1) = 0
IG (S, temperature) = 0.971 − 2/5 ∗ 0 − 2/5 ∗ 1 − 1/5 ∗ 0 = 0.570
Likewise we can compute: IG (S, humidity) = 0.970 , IG (S, wind) = 0.019
Therefore, we choose “humidity” (with highest IG = 0.970) for the level-2 left node
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 14
Growing The Tree

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 15


Growing The Tree

Level-2, middle node: no need to grow (already a leaf)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 15


Growing The Tree

Level-2, middle node: no need to grow (already a leaf)


Level-2, right node: repeat the same exercise!

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 15


Growing The Tree

Level-2, middle node: no need to grow (already a leaf)


Level-2, right node: repeat the same exercise!
Compute IG for each feature (except outlook)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 15


Growing The Tree

Level-2, middle node: no need to grow (already a leaf)


Level-2, right node: repeat the same exercise!
Compute IG for each feature (except outlook)
Exercise: Verify that wind has the highest IG for this node

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 15


Growing The Tree

Level-2, middle node: no need to grow (already a leaf)


Level-2, right node: repeat the same exercise!
Compute IG for each feature (except outlook)
Exercise: Verify that wind has the highest IG for this node
Level-2 expansion gives us the following tree:

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 15


Growing The Tree: Stopping Criteria

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 16


Growing The Tree: Stopping Criteria

Stop expanding a node further when:

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 16


Growing The Tree: Stopping Criteria

Stop expanding a node further when:

It consist of examples all having the same label (the node becomes “pure”)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 16


Growing The Tree: Stopping Criteria

Stop expanding a node further when:

It consist of examples all having the same label (the node becomes “pure”)
Or we run out of features to test!

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 16


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label
Otherwise, let F be the feature having the highest information gain
Root ← F

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label
Otherwise, let F be the feature having the highest information gain
Root ← F
For each possible value f of F

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label
Otherwise, let F be the feature having the highest information gain
Root ← F
For each possible value f of F
Add a tree branch below Root corresponding to the test F = f

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label
Otherwise, let F be the feature having the highest information gain
Root ← F
For each possible value f of F
Add a tree branch below Root corresponding to the test F = f
Let Examplesf be the set of examples with feature F having value f
Let Labelsf be the corresponding labels

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label
Otherwise, let F be the feature having the highest information gain
Root ← F
For each possible value f of F
Add a tree branch below Root corresponding to the test F = f
Let Examplesf be the set of examples with feature F having value f
Let Labelsf be the corresponding labels
If Examplesf is empty, add a leaf node below this branch with label = most common label in Examples

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Decision Tree Learning Algorithm
A recursive algorithm:
DT(Examples, Labels, Features):
If all examples are positive, return a single node tree Root with label = +
If all examples are negative, return a single node tree Root with label = -
If all features exhausted, return a single node tree Root with majority label
Otherwise, let F be the feature having the highest information gain
Root ← F
For each possible value f of F
Add a tree branch below Root corresponding to the test F = f
Let Examplesf be the set of examples with feature F having value f
Let Labelsf be the corresponding labels
If Examplesf is empty, add a leaf node below this branch with label = most common label in Examples
Otherwise, add the following subtree below this branch:
DT(Examplesf , Labelsf , Features - {F })
Note: Features - {F } removes feature F from the feature set Features

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 17


Overfitting in Decision Trees

Overfitting Illustration

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 18


Overfitting in Decision Trees

Overfitting Illustration

High training accuracy doesn’t necessarily imply high test accuracy

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 18


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

Criteria for judging which nodes could potentially be pruned


Use a validation set (separate from the training set)

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

Criteria for judging which nodes could potentially be pruned


Use a validation set (separate from the training set)
Prune each possible node that doesn’t hurt the accuracy on the validation set

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

Criteria for judging which nodes could potentially be pruned


Use a validation set (separate from the training set)
Prune each possible node that doesn’t hurt the accuracy on the validation set
Greedily remove the node that improves the validation accuracy the most

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

Criteria for judging which nodes could potentially be pruned


Use a validation set (separate from the training set)
Prune each possible node that doesn’t hurt the accuracy on the validation set
Greedily remove the node that improves the validation accuracy the most
Stop when the validation set accuracy starts worsening

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Avoiding Overfitting: Decision Tree Pruning

Desired: a DT that is not too big in size, yet fits the training data reasonably

Mainly two approaches


Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)

Criteria for judging which nodes could potentially be pruned


Use a validation set (separate from the training set)
Prune each possible node that doesn’t hurt the accuracy on the validation set
Greedily remove the node that improves the validation accuracy the most
Stop when the validation set accuracy starts worsening

Minimum Description Length (MDL): more details when we cover Model Selection

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 19


Decision Tree Extensions

Real-valued features can be dealt with using thresholding

2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions

Real-valued features can be dealt with using thresholding

Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)

2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions

Real-valued features can be dealt with using thresholding

Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)

Other criteria for judging feature informativeness


Gini-index, misclassification rate

2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions

Real-valued features can be dealt with using thresholding

Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)

Other criteria for judging feature informativeness


Gini-index, misclassification rate

More sophisticated decision rules at the internal nodes (anything that splits the data into
homogeneous groups; e.g., a machine learning classifier)

2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Decision Tree Extensions

Real-valued features can be dealt with using thresholding

Real-valued labels (Regression Trees2 ) by re-defining entropy or using other criteria (how similar to
each other are the y ’s at any node)

Other criteria for judging feature informativeness


Gini-index, misclassification rate

More sophisticated decision rules at the internal nodes (anything that splits the data into
homogeneous groups; e.g., a machine learning classifier)

Handling features with differing costs

2 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 20
Some Aspects about Decision Trees
Some key strengths:
Simple and each to interpret
Do not make any assumption about distribution of data
Easily handle different types of features (real, categorical/nominal, etc.)
Very fast at test time (just need to check the features, starting the root node and following the DT
until you reach a leaf node)
Multiple DTs can be combined via ensemble methods (e.g., Decision Forest)
Each DT can be constructed using a (random) small subset of features

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 21


Some Aspects about Decision Trees
Some key strengths:
Simple and each to interpret
Do not make any assumption about distribution of data
Easily handle different types of features (real, categorical/nominal, etc.)
Very fast at test time (just need to check the features, starting the root node and following the DT
until you reach a leaf node)
Multiple DTs can be combined via ensemble methods (e.g., Decision Forest)
Each DT can be constructed using a (random) small subset of features

Some key weaknesses:


Learning the optimal DT is NP-Complete. The existing algorithms are heuristics (e.g., greedy
selection of features)
Can be unstable if some labeled examples are noisy
Can sometimes become very complex unless some pruning is applied
Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 21
Next Class:
Learning as Optimization

Machine Learning (CS771A) Learning by Asking Questions: Decision Trees 22

You might also like