0% found this document useful (0 votes)
2 views

02 DecisionTrees Done

A decision tree is a supervised learning algorithm used for classification and regression, which organizes data into a tree structure based on feature attributes. The ID3 algorithm is commonly used to select the best attributes for splitting data, utilizing information gain to measure the effectiveness of each attribute. The document also discusses the concept of entropy and how it relates to information gain in decision tree construction.

Uploaded by

david1milad1982
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

02 DecisionTrees Done

A decision tree is a supervised learning algorithm used for classification and regression, which organizes data into a tree structure based on feature attributes. The ID3 algorithm is commonly used to select the best attributes for splitting data, utilizing information gain to measure the effectiveness of each attribute. The document also discusses the concept of entropy and how it relates to information gain in decision tree construction.

Uploaded by

david1milad1982
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Decision Trees

A decision tree is a
supervised learning
algorithm that is used for
classification and regression
modeling
Sample Dataset
• Columns denote features Xi
• Rows denote labeled instances
• Class label denotes whether a tennis game was played
Decision Tree
• A possible decision tree for the data:

• Each internal node: test one attribute Xi


• Each branch from a node: selects one value for Xi
• Each leaf node: predict Y
Decision Tree
• A possible decision tree for the data:

• What prediction would we make for


<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?

NO
Decision Tree
• If features are continuous, internal nodes can
test the value of a feature against a threshold
Decision Tree Induced
Partition
Decision Tree – Decision Boundary
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribution over labels

Decision
boundary
Another Example:
Restaurant Domain (Russell & Norvig)
‫مدير‬
Model a patron’s decision of whether to wait for a table at a restaurant

~7,000 possible cases


A Decision Tree
from Introspection
‫التامل‬

Is this the best decision tree?


Preference bias: Ockham’s Razor

Idea: The simplest consistent explanation is the best

• Therefore, the smallest decision tree that correctly


classifies all of the training examples is best
• Finding the provably smallest decision tree is NP-hard
• ...So instead of constructing the absolute smallest tree
consistent with the training examples, construct one that
is pretty small
Choosing the Best Attribute

Key problem: choosing which attribute to split a


given set of examples
•Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest
number of possible values
– Most-Values: Choose the attribute with the largest
number of possible values
– Max-Gain: Choose the attribute that has the largest
expected information gain
• i.e., attribute that results in smallest expected size of
subtrees rooted at its children

•The ID3 algorithm uses the Max-Gain method of


selecting the best attribute
Basic Algorithm for Top-Down
Induction of Decision Trees
‫مقسم ثنائي‬
[ID3, C4.5 by Quinlan] (Iterative Dichotomiser 3)

node = root of decision tree Main loop:


1.A  the “best” decision attribute for the next node.
2.Assign A as decision attribute for node.
3.For each value of A, create a new descendant of node.
4.Sort training examples to leaf nodes.
5.If training examples are perfectly classified, stop. Else, recurse over new leaf
nodes.

How do we choose which attribute is best?


The ID3 algorithm uses the Max-Gain method
of selecting the best attribute
Choosing an
Attribute

Idea: a good attribute splits the examples into


subsets that are (ideally) “all positive” or “all
negative”

Which split is more informative: Patrons? or Type?


ID3-induced
Decision Tree
Compare the Two Decision Trees
Information Gain

Which test is more informative?


Split over whether Balance exceeds Split over whether
50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed


Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples
Impurity
‫الشوائب‬

Very impure group Less impure Minimum


impurity
Entropy: a common way to measure
impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a


randomly drawn value of X (under most efficient code)
Entropy: a common way to measure
impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a


randomly drawn value of X (under most efficient code)

Why? Information theory:


• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:
Information Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.

• Information gain tells us how important a given


attribute of the feature vectors is.

• We will use it to decide the ordering of attributes in


the nodes of a decision tree.
From Entropy to
Information Gain
From Entropy to Information Gain
Entropy H(X) of a random variable X
From Entropy to Information Gain
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :


From Entropy to Information Gain
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :


From Entropy to Information Gain
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (Information Gain) of X and Y :


Information Gain
Information Gain is the mutual information between
input attribute A and target variable Y

Information Gain is the expected reduction in entropy


of target variable Y for data sample S, due to sorting
on variable A
Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]

child 13 log 13   4 log 4 


 0.787
entropy 17 2
17  17 2
17 

Entire population (30 instances)


17 instances

child 1 log 1  12 log 12  0.391


entropy 
13 2
13   13
2
13 

parent 14 log 14   16 log 16   0.996


entropy  30 2
30   30 2
30  13 instances

17  0.787   13  0.391   0.615


(Weighted) Average Entropy of Children =
 30   30 
Information Gain= 0.996 - 0.615 = 0.38
Entropy-Based Automatic Decision
Tree Construction

Training Set X Node 1


x1=(f11,f12,…f1m) What feature
x2=(f21,f22, f2m) should be
. used?
What values?
.
xn=(fn1,f22, f2m)

Quinlan suggested information gain in his ID3 system


based on entropy.
Play
Day Outlook Temp Humidity Wind
Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes


4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes


8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

Rafael Nadal 13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No


Example

Play
Day Outlook Temp Humidity Wind
Tennis
Outlook 1 Sunny Hot High Weak No
2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes


Overcast

4 Rain Mild High Weak Yes


5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes


Humidity Yes Wind
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes


No Yes No Yes 13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No


Example
Question 1 Question 2

Yes No Yes No
Example
Question 3 Question 4

Yes No Yes No
Question 1 Question 2

E=1 E=1

Yes No Yes No

E=0.97 E=0.92 E=0.72 E=0

Information Gain
k
Question 1 Question 2

E=1 E=1

Yes No Yes No

E=0.97 E=0.92 E=0.72 E=0


Play
Day Outlook Temp Humidity Wind
Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes


4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes


8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No


E=0.954

Wind

E=0.811 E=1
G (S, W ind) = 0.048
E=0.954

Humidity

E=0.985 E=0.592
G (S, W ind) = 0.048
G (S , H umidity) = 0.151 E=0.954

Temp

d
Mil
E=1 E=0.92 E=0.81
G (S, W ind) = 0.048
G (S , H umidity) = 0.151 E=0.954

G (S , T emp) = 0.042

Outlook

Overcast
E=0.971 E=0 E=0.971
Outlook

Overcast
Humidity Yes Wind

No Yes No Yes
Example
This dataset is originally from the National Institute of Diabetes and Digestive
and Kidney Diseases. The objective of the dataset is to diagnostically predict
whether or not a patient has diabetes, based on certain diagnostic
measurements included in the dataset. Several constraints were placed on the
selection of these instances from a larger database. In particular, all patients
here are females at least 21 years old of Pima Indian heritage. ‫بيما التراث‬
.‫الهندي‬

Content The datasets consists of several medical predictor variables and one
target variable, Outcome. Predictor variables includes the number of
pregnancies the patient has had, their BMI, insulin level, age, and so on.

Link Data set : Pima Indians Diabetes Database (kaggle.com)


To download .csv file.
Example
Example

You might also like