Decision Trees
Decision Trees
You have some past experiences (data) about restaurants—what kind of food they serve, how
busy they are, prices, whether you had to wait, and whether you ended up waiting or not.
Now you want to predict: Will I have to wait if I go to a new restaurant with similar features
Where:
V is the set of possible classes for the target variable (in this case, WillWait = {Yes,
No}).
P ( v k ) is the probability of class v k in the dataset.
Step-by-step:
1. Count the occurrences of each class:
From the dataset of 12 samples:
Yes appears in: y1, y3, y4, y6, y8, y12 → 6 times
No appears in: y2, y5, y7, y9, y10, y11 → 6 times
So,
6
P ( Yes )= =0.5
12
6
P ( No )= =0.5
12
3. Information Gain:
(
3
4
3 1
4 4
1
)
B=− log 2 + log 2 ≈−[0.75 ⋅−0.415+0.25 ⋅−2]=0.811
4
4
Weight: =0.333
12
Contribution: 0.333 ⋅0.811≈ 0.270
Group 3: Pat = Full
p3=3 , n3 =3
B=1.0 (equal split)
6
Weight: =0.5
12
Contribution: 0.5 ⋅1=0.5
Step 5: Gain(Pat)
Gain ( Pat )=1.0−0.770=0.230
We have:
Entropy=− ( 126 log 126 + 126 log 126 )=−2 ⋅ 12 ⋅ log 12=1.0
2 2 2
Let’s show the results of calculating Information Gain (IG) for each
attribute:
Feature Information
Gain
Price 0.196
➡So, Pat gives the highest information gain = 0.541, making it the root
node.
Interpretation:
Step-by-Step Formula:
Where:
From data:
Pat WillWa
it
Som Yes
e
Full No
Som Yes
e
Full Yes
Full No
Som Yes
e
Non No
Pat WillWa
it
Som Yes
e
Full No
Full No
Non No
e
Full Yes
Counts:
Example 2: Price
From data:
Info Gain:
IG ( S , Price )=1.0− ( 126 ⋅1.0+ 122 ⋅0+ 124 ⋅0.8113)=1.0−( 0.5+ 0+0.2704 )=0.2296
(Note: We got ~0.1957 earlier due to rounding/float precision)
Summary:
Featur Information
e Gain
Price 0.2296
This equation is a statistical test used to measure how much the actual class distributions in
subsets deviate from what we’d expect if the feature were irrelevant. It's a kind of χ² (Chi-
Square)-like statistic used in some decision tree algorithms (like C4.5) to determine if a feature
is useful for splitting.
Notation Breakdown
p: Total number of positive examples (e.g., WillWait = Yes)
n: Total number of negative examples (WillWait = No)
d: Number of distinct values the attribute takes (number of subsets after a split)
For subset k:
o pk: Actual number of positives
o nk: Actual number of negatives
ˆpk, ˆnk: Expected positives and negatives if the attribute was irrelevant
( )
2 2
d
( p k − ^pk ) ( nk − n^ k )
Δ=∑ +
k=1 ^p k n^ k
This tells you: How far each subset's actual class distribution is from the expected
Higher Δ → more likely the attribute is important
Example (Simplified):
Let’s say you have 12 examples:
Total p = 6, n = 6
Split by Pat (3 values: None, Some, Full):
None 2 0 2
Som 4 4 0
e
Full 6 2 4
Som 4 2 2
e
Full 6 3 3
Now compute Δ:
( )
2 2
( pk −^pk ) ( nk−^n k )
Δ=∑ +
^p k n^ k
For None:
( 0−1 )2 ( 2−1 )2
+ =1+ 1=2
1 1
For Some:
( 4−2 )2 ( 0−2 )2 4 4
+ = + =4
2 2 2 2
For Full:
( 2−3 )2 ( 4−3 )2 1 1 2
+ = + =
3 3 3 3 3
Total Δ = 2 + 4 + 2/3 ≈ 6.67
Interpretation:
High Δ (like 6.67) → Actual counts deviate significantly from expected → Attribute is
useful
Low Δ (~0) → Class distribution looks like random → Attribute is not useful
This statistic can be used to perform a hypothesis test or as an alternative to Information Gain
when selecting attributes.