0% found this document useful (0 votes)
5 views11 pages

Decision Trees

The document discusses decision-making in restaurant selection using past experiences and data analysis techniques like entropy, information gain, and remainder. It explains how to calculate entropy to measure uncertainty in waiting times, and how to use information gain to determine the best attributes for predicting outcomes. The calculations demonstrate how to evaluate the effectiveness of various features, such as patron count and price, in reducing uncertainty when deciding whether to wait at a restaurant.

Uploaded by

onemistry513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

Decision Trees

The document discusses decision-making in restaurant selection using past experiences and data analysis techniques like entropy, information gain, and remainder. It explains how to calculate entropy to measure uncertainty in waiting times, and how to use information gain to determine the best attributes for predicting outcomes. The calculations demonstrate how to evaluate the effectiveness of various features, such as patron count and price, in reducing uncertainty when deciding whether to wait at a restaurant.

Uploaded by

onemistry513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Imagine you're trying to decide where to eat with your friends

You have some past experiences (data) about restaurants—what kind of food they serve, how
busy they are, prices, whether you had to wait, and whether you ended up waiting or not.
Now you want to predict: Will I have to wait if I go to a new restaurant with similar features

1. Entropy = Uncertainty or Confusion


Think of entropy like the amount of confusion you have when making a decision.
 If half the time you wait and half the time you don’t, you’re very uncertain → entropy is
high (maximum = 1).
 If almost always you either wait or don’t, then entropy is low → you're more certain.
In short: Entropy tells you how mixed-up your outcomes are. The more mixed, the harder it is to
make decisions.

2. Information Gain = How much your confusion goes down


Let’s say you notice that if the restaurant is "Empty", you never wait, and if it's "Full", you
usually do.
This means the attribute Patron (number of people) is a great predictor of whether you'll wait.
 So, by looking at Patron, your confusion goes down a lot.
 This decrease in entropy is called Information Gain.
In short: Information Gain tells you how helpful a question (attribute) is in making your
decision easier.

3. Remainder = Leftover confusion after splitting


Even after splitting data by an attribute like Price, the groups might still be messy:
 In one price range, people waited sometimes and didn’t other times.
 That means the split didn't help much — the confusion (entropy) remains high in those
groups.
So, the Remainder is the weighted average of how confused you still are after the split.
In short: Remainder tells you how much uncertainty is still there after splitting the data by a
question.
To calculate the sample entropy of the output class WillWait from the given dataset, we use
the entropy formula:
H ( V )=−∑ P ( v k ) log 2 P ( v k )
k

Where:
 V is the set of possible classes for the target variable (in this case, WillWait = {Yes,
No}).
 P ( v k ) is the probability of class v k in the dataset.

Step-by-step:
1. Count the occurrences of each class:
From the dataset of 12 samples:
 Yes appears in: y1, y3, y4, y6, y8, y12 → 6 times
 No appears in: y2, y5, y7, y9, y10, y11 → 6 times

So,
6
 P ( Yes )= =0.5
12
6
 P ( No )= =0.5
12

2. Plug into the entropy formula:


H ( V )=−[P ( Yes ) ⋅log 2 P ( Yes )+ P ( No ) ⋅log 2 P ( No ) ]

H ( V )=−[0.5⋅ log 2 0.5+0.5 ⋅log 2 0.5]

H ( V )=−[0.5⋅ (−1 )+ 0.5 ⋅ (−1 ) ]=−[−0.5−0.5]=1.0

Sample Entropy = 1.0 bits


This indicates maximum uncertainty, which makes sense because the class labels are evenly split
between Yes and No.
1. Entropy (before split):

B ( p p+n )=−( p+np log p+np + pn+n log p+nn )


2 2

2. Remainder after splitting on attribute A (e.g., splitting the data by values of


attribute A):
pk + nk
( )
d
pk
Remainder ( A )=∑ B
k=1 p+ n p k +nk

3. Information Gain:

Gain ( A )=B ( p+np )−Remainder ( A )


Example: Calculate Information Gain for attribute Pat (Patron)
Step 1: Get class counts from full dataset (12 examples)
From before:
 p = 6 (Yes), n = 6 (No)
 So total entropy:

B ( 126 )=−( 0.5 log 0.5+0.5 log 0.5)=1.0


2 2

Step 2: Group by values of Pat and count Yes/No in each


Ye
Pat Count s No
None 2 0 2
Some 4 3 1
Full 6 3 3

Step 3: Calculate Remainder(Pat)


Group 1: Pat = None
 p1=0 , n1 =2
 B=−[0⋅ log 2 0+1⋅ log 2 1]=0
2
 Weight: =0.1667
12
 Contribution to remainder: 0.1667 ⋅0=0
Group 2: Pat = Some
 p2=3 , n2=1

(
3
4
3 1
4 4
1
)
 B=− log 2 + log 2 ≈−[0.75 ⋅−0.415+0.25 ⋅−2]=0.811
4
4
 Weight: =0.333
12
 Contribution: 0.333 ⋅0.811≈ 0.270
Group 3: Pat = Full
 p3=3 , n3 =3
 B=1.0 (equal split)
6
 Weight: =0.5
12
 Contribution: 0.5 ⋅1=0.5

Step 4: Total Remainder


Remainder ( Pat )=0+0.270+0.5=0.770

Step 5: Gain(Pat)
Gain ( Pat )=1.0−0.770=0.230

Information Gain for attribute Pat = 0.230 bits


complete calculations for building the decision tree using Information
Gain:

Step 1: Entropy of the Entire Dataset

We have:

 6 examples where WillWait = Yes

 6 examples where WillWait = No

Entropy=− ( 126 log 126 + 126 log 126 )=−2 ⋅ 12 ⋅ log 12=1.0
2 2 2

Step 2: Information Gain for Each Feature

Let’s show the results of calculating Information Gain (IG) for each
attribute:

Feature Information
Gain

Pat (Patrons) 0.541

Hun (Hungry) 0.196

Price 0.196

Est (Estimated 0.208


Wait)

Fri, Rain, Res 0.021

Alt, Bar, Type ~0.0

➡So, Pat gives the highest information gain = 0.541, making it the root
node.

Interpretation:

 Pat (Patrons) = Best first split (most reduction in entropy).

 After splitting on Pat, we would recursively apply the same method to


the subsets to choose the next best feature (e.g., Hun, Price, etc.).
Let’s walk through how to calculate Information Gain step-by-step using
the features Pat (Patrons) and Price as examples.

Step-by-Step Formula:

2. Entropy (S) of the target set WillWait:

Entropy ( S )=−∑ pi log 2 pi

where pi is the proportion of class i in set S

3. Information Gain (IG):


∣ Sv ∣
IG ( S , A )=Entropy ( S )− ∑ ∣S∣
⋅Entropy ( S v )
v ∈Values ( A )

Where:

 A is the attribute (like Pat or Price)

 S v is the subset of S for which attribute A=v

Example 1: Pat (Patrons)

Values of Pat: None, Some, Full

From data:

Pat WillWa
it

Som Yes
e

Full No

Som Yes
e

Full Yes

Full No

Som Yes
e

Non No
Pat WillWa
it

Som Yes
e

Full No

Full No

Non No
e

Full Yes

Counts:

 None (2 samples): 0 Yes, 2 No → Entropy = 0

 Some (4 samples): 4 Yes, 0 No → Entropy = 0

 Full (6 samples): 2 Yes, 4 No

( 26 log 26 + 46 log 46 )=0.9183


Entropy Full =− 2 2

Now, total Info Gain for Pat:

( 122 ⋅0+ 124 ⋅0+ 126 ⋅ 0.9183)=1.0−0.4592=0.5408


IG ( S , Pat )=1.0−

Example 2: Price

Values: $, $$, $$$

From data:

 $ (6 samples): 3 Yes, 3 No → Entropy = 1.0

 $$ (2 samples): 2 Yes, 0 No → Entropy = 0

 $$$ (4 samples): 1 Yes, 3 No

Entropy $ $ $=− ( 14 log 14 + 34 log 34 )≈ 0.8113


2 2

Info Gain:
IG ( S , Price )=1.0− ( 126 ⋅1.0+ 122 ⋅0+ 124 ⋅0.8113)=1.0−( 0.5+ 0+0.2704 )=0.2296
(Note: We got ~0.1957 earlier due to rounding/float precision)

Summary:

Featur Information
e Gain

Pat 0.5408 (Best)

Price 0.2296
This equation is a statistical test used to measure how much the actual class distributions in
subsets deviate from what we’d expect if the feature were irrelevant. It's a kind of χ² (Chi-
Square)-like statistic used in some decision tree algorithms (like C4.5) to determine if a feature
is useful for splitting.

Notation Breakdown
 p: Total number of positive examples (e.g., WillWait = Yes)
 n: Total number of negative examples (WillWait = No)
 d: Number of distinct values the attribute takes (number of subsets after a split)
 For subset k:
o pk: Actual number of positives
o nk: Actual number of negatives
 ˆpk, ˆnk: Expected positives and negatives if the attribute was irrelevant

Expected Counts (Under Irrelevance Assumption)


The assumption:
If the attribute is irrelevant, the class proportions in each subset should match the whole dataset.
So, for each subset with sk =pk + nk examples:
 Expected positives:
sk
^pk = p ×
p+n
 Expected negatives:
sk
n^ k =n ×
p +n

Deviation Score (Δ)


The total deviation is:

( )
2 2
d
( p k − ^pk ) ( nk − n^ k )
Δ=∑ +
k=1 ^p k n^ k
This tells you: How far each subset's actual class distribution is from the expected
Higher Δ → more likely the attribute is important

Example (Simplified):
Let’s say you have 12 examples:
 Total p = 6, n = 6
Split by Pat (3 values: None, Some, Full):

Pat Subset Size pk (Yes) nk (No)

None 2 0 2

Som 4 4 0
e

Full 6 2 4

Now compute expected values if Pat were irrelevant:


 ˆpk = p × (pk + nk) / (p + n)
 ˆnk = n × (pk + nk) / (p + n)

Pat Subset Size ˆpk ˆnk

None 2 6 × 2/12 = 1 6 × 2/12 = 1

Som 4 2 2
e

Full 6 3 3

Now compute Δ:

( )
2 2
( pk −^pk ) ( nk−^n k )
Δ=∑ +
^p k n^ k

For None:
( 0−1 )2 ( 2−1 )2
+ =1+ 1=2
1 1
For Some:

( 4−2 )2 ( 0−2 )2 4 4
+ = + =4
2 2 2 2
For Full:

( 2−3 )2 ( 4−3 )2 1 1 2
+ = + =
3 3 3 3 3
Total Δ = 2 + 4 + 2/3 ≈ 6.67

Interpretation:
 High Δ (like 6.67) → Actual counts deviate significantly from expected → Attribute is
useful
 Low Δ (~0) → Class distribution looks like random → Attribute is not useful
This statistic can be used to perform a hypothesis test or as an alternative to Information Gain
when selecting attributes.

You might also like