Lecture 13-Supervised Learning-Decision Trees-M
Lecture 13-Supervised Learning-Decision Trees-M
CS-06504
Data Mining
Supervised Learning
(Decision Trees – Ch # 8 by Han and
Kamber)
Supervised Learning
Supervised
SupervisedLearning
learning is• Trainingdataincludesboththeinput
the machine learning task of learning a
functiondesiredresults.
andthe that maps an input to an output based on example
• Forsomeexamplesthecorrect
input-output pairs. Itknownandaregivenininputtothemodel
results(targets)are infers a function from labeled training
data consisting
during of a set of training
thelearningprocess. examples. In supervised
• Theconstruc?onofaproper
learning, each example istestset(Bok)iscrucial.
training,valida?onand a pair consisting of an•input
Theseobject
(typically a vector) and a desired output value (also called
methodsareusuallyfastandaccurate. • Havetobeableto
the supervisory signal). (Wikipedia)
generalize:givethecorrect resultswhennewdataaregivenin
inputwithout knowingapriorithetarget.
2
Supervised Learning – Detailed Definition
In supervised learning we have input variables (X) and an output
variable (Y) and we use an algorithm to learn the mapping
function from the input to the output. Y = f(X)
The goal is to approximate the mapping function so well that
when you have new input data (X) that you can predict the
output variables (Y) for that data.
It is called supervised learning because the process of an
algorithm learning from the training dataset can be thought of as
a teacher supervising the learning process. We know the correct
answers, the algorithm iteratively makes predictions on the
training data and is corrected by the teacher. Learning stops
when the algorithm achieves an acceptable level of
performance.
3
Supervised Learning
Supervised learning problems can be further grouped
into regression and classification problems.
Classification: A classification problem is when the output
variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
Regression: A regression problem is when the output
variable is a real value, such as “dollars” or “weight”.
4
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat
6
What is classification (cont…)
The target function f is known as a classification model
7
Examples of Classification Tasks
Predicting tumor cells as benign or malignant
9
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
Validate
15 No Large 67K ? model using
Test Data
10
Actual
Class
Class = 1 f11 f10
Class = 0 f01 f00
12
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
13
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
14
Example of a Decision Tree
l l
ir c a ir c a
o us
go go
t i nu ss
t e t e n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
Class labe
Training Data Model: Decision Tree
15
Another Example of Decision Tree
l l
ir ca ir ca o us
u
ego ego t in ss
at at on l a
c c c c MarSt Single,
Tid Refund Marital Taxable Married Divorced
Status Income Cheat
16
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
18
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
19
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
20
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
21
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
<= 80K > 80K
NO YES
22
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
<= 80K > 80K
NO YES
23
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11
12
No
Yes
Small
Medium
55K
80K
?
?
Decisio
13
14
Yes
No
Large
Small
110K
95K
?
?
Deduction
n Tree
15 No Large 67K ?
10
New Data
Test Set Set 24
Tree Induction
Finding the best decision tree is NP-hard
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
25
ID3 Algorithm
ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision
tree from a dataset.
C4.5 is its successor.
These algorithms employs a top-down, greedy search
through the space of possible decision trees.
26
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is selecting
which attribute should be tested at the root of the tree and
then at each node in the tree
31
Splitting Criterion
Example:
Two classes, +/-
100 records overall (50 +s and 50 -
s)
A and B are two binary attributes
Records with A=0: 48+, 2-
Records with A=1: 2+, 48-
Records with B=0: 26+, 24-
Records with B=1: 24+, 26-
Splitting on A is better than
splitting on B
A does a good job of separating +s and -
s 32
Entropy (for two classes i.e. m=2)
33
Which Attribute is the Best Classifier?
Information Gain
The expected information needed to classify a
tuple in D is
34
Information Gain
Gain of an attribute split: compare the impurity of the
parent node with the average impurity of the child
nodes
Information gain is defined as the pattern
observed in the dataset and reduction in the
entropy.
37
DECISION TREES
Example
38
DECISION TREES
Example
39
DECISION TREES
Example
We calculate the Info Gain for each attribute and
select the attribute having the highest Info Gain
40
DECISION TREES
Example
42
DECISION TREES
Example
44
DECISION TREES
Example
45
DECISION TREES
From Decision Trees to Rules
Next Step: Make rules from the decision tree
47