0% found this document useful (0 votes)
6 views

Lecture 13-Supervised Learning-Decision Trees-M

Supervised learning involves training a model using labeled input-output pairs to predict outcomes for new data. It is categorized into classification and regression tasks, with classification focusing on categorizing data into predefined classes. Decision trees are a common method for classification, utilizing a tree-like structure to represent decisions based on attribute tests.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 13-Supervised Learning-Decision Trees-M

Supervised learning involves training a model using labeled input-output pairs to predict outcomes for new data. It is categorized into classification and regression tasks, with classification focusing on categorizing data into predefined classes. Decision trees are a common method for classification, utilizing a tree-like structure to represent decisions based on attribute tests.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Lecture 13

CS-06504
Data Mining
Supervised Learning
(Decision Trees – Ch # 8 by Han and
Kamber)
Supervised Learning
Supervised
SupervisedLearning
learning is• Trainingdataincludesboththeinput
the machine learning task of learning a
functiondesiredresults.
andthe that maps an input to an output based on example
• Forsomeexamplesthecorrect
input-output pairs. Itknownandaregivenininputtothemodel
results(targets)are infers a function from labeled training
data consisting
during of a set of training
thelearningprocess. examples. In supervised
• Theconstruc?onofaproper
learning, each example istestset(Bok)iscrucial.
training,valida?onand a pair consisting of an•input
Theseobject
(typically a vector) and a desired output value (also called
methodsareusuallyfastandaccurate. • Havetobeableto
the supervisory signal). (Wikipedia)
generalize:givethecorrect resultswhennewdataaregivenin
inputwithout knowingapriorithetarget.

2
Supervised Learning – Detailed Definition
In supervised learning we have input variables (X) and an output
variable (Y) and we use an algorithm to learn the mapping
function from the input to the output. Y = f(X)
The goal is to approximate the mapping function so well that
when you have new input data (X) that you can predict the
output variables (Y) for that data.
It is called supervised learning because the process of an
algorithm learning from the training dataset can be thought of as
a teacher supervising the learning process. We know the correct
answers, the algorithm iteratively makes predictions on the
training data and is corrected by the teacher. Learning stops
when the algorithm achieves an acceptable level of
performance.
3
Supervised Learning
Supervised learning problems can be further grouped
into regression and classification problems.
Classification: A classification problem is when the output
variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
Regression: A regression problem is when the output
variable is a real value, such as “dollars” or “weight”.

4
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Tax-return data for year 2011
3 No Single 70K No
4 Yes Married 120K No A new tax return for 2012
5 No Divorced 95K Yes
Is this a cheating tax return?
6 No Married 60K No
7 Yes Divorced 220K No Refund Marital Taxable
Status Income Cheat
8 No Single 85K Yes
No Married 80K ?
9 No Married 75K No 10

10 No Single 90K Yes


10

An instance of the classification problem: learn a


method for discriminating between records of
different classes (cheaters vs non-cheaters)
5
What is classification?
Classification is the task of learning a target function f that
maps attribute set x to one of the predefined class labels y
l l
ir ca ir ca o us
go go
t i nu ss
t e t e n l a
ca ca co c
Tid Refund Marital Taxable
Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

6
What is classification (cont…)
The target function f is known as a classification model

Descriptive modeling: Explanatory tool to distinguish


between objects of different classes (e.g., understand
why people cheat on their taxes)

Predictive modeling: Predict a class of a previously


unseen record

7
Examples of Classification Tasks
Predicting tumor cells as benign or malignant

Classifying credit card transactions as legitimate or


fraudulent

Categorizing news stories as finance, weather, sports,


entertainment, etc

Identifying spam email, spam web pages, adult content

Understanding if a web query has commercial intent or


not
8
General approach to classification
Training set consists of records with known class labels

Training set is used to build a classification model

A labeled test set of previously unseen data records is


used to evaluate the quality of the model.

The classification model is applied to new records with


unknown class labels

9
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
Validate
15 No Large 67K ? model using
Test Data
10

New Data Set 10


Evaluation of classification models
Counts of test records that are correctly (or incorrectly)
predicted by the classification model
Confusion matrix Predicted Class
Class = Class =
1 0

Actual
Class
Class = 1 f11 f10
Class = 0 f01 f00

# correct prediction s f11  f 00


Accuracy  
total # of prediction s f11  f10  f 01  f 00

# wrong prediction s f10  f 01


Error rate  
total # of prediction s f11  f10  f 01  f 00
11
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines

12
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines

13
Decision Trees
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution

14
Example of a Decision Tree
l l
ir c a ir c a
o us
go go
t i nu ss
t e t e n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No Test outcom
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes <= 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Class labe
Training Data Model: Decision Tree
15
Another Example of Decision Tree
l l
ir ca ir ca o us
u
ego ego t in ss
at at on l a
c c c c MarSt Single,
Tid Refund Marital Taxable Married Divorced
Status Income Cheat

1 Yes Single 125K No NO Refund


Yes No
2 No Married 100K No
3 No Single 70K No
NO TaxInc
4 Yes Married 120K No
<= 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree
10
10 No Single 90K Yes that fits the same data!

16
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?
Validate
15 No Large 67K ? model using
Test Data
10

New Data Set


17
Apply Model to Test Data
New Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
<= 80K > 80K

NO YES

18
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
<= 80K > 80K

NO YES

19
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
<= 80K > 80K

NO YES

20
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
<= 80K > 80K

NO YES

21
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
<= 80K > 80K

NO YES

22
Apply Model to Test Data
New Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
<= 80K > 80K

NO YES

23
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11
12
No
Yes
Small
Medium
55K
80K
?
?
Decisio
13
14
Yes
No
Large
Small
110K
95K
?
?
Deduction
n Tree
15 No Large 67K ?
10

New Data
Test Set Set 24
Tree Induction
Finding the best decision tree is NP-hard

Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.

Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT

25
ID3 Algorithm
ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision
tree from a dataset.
C4.5 is its successor.
These algorithms employs a top-down, greedy search
through the space of possible decision trees.

26
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is selecting
which attribute should be tested at the root of the tree and
then at each node in the tree

We would like to select the attribute which is most useful


for classifying examples

For this we need a good quantitative measure

For this purpose a statistical property, called information


gain is used
27
Which Attribute is the Best Classifier?
Definition of Entropy
In order to define information gain precisely, we begin by
defining entropy
Entropy is a measure commonly used in information theory.
It is defined as the randomness or measuring the disorder of the
information being processed in Machine Learning.
In other words, we can say that entropy is the machine learning
metric that measures the unpredictability or impurity in the system.

Entropy characterizes the impurity of an arbitrary collection of


examples 28
Definition of Entropy
• When information is processed in the system, then
every piece of information has a specific value to
make and can be used to draw conclusions from it.
• If it is easier to draw a valuable conclusion from a
piece of information, then entropy will be lower in
Machine Learning, or if entropy is higher, then it will
be difficult to draw any conclusion from that piece of
information
• Entropy is a measure of disorder or uncertainty and
the goal of machine learning models and Data
Scientists in general is to reduce uncertainty.
29
Entropy
Entropy (D)
Entropy of data set D is denoted by H(D)
Ci s are the possible classes
pi = fraction of records from D that have
class C
H ( D )  p
ci
i log 2 pi

The range of entropy is from 0 to log2(m),


where m is the number of classes.
Maximum value is attained when all the
classes have equal proportion. 30
Entropy Examples
Example:
10 records have class A
20 records have class B
30 records have class C
40 records have class D
Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log .3)
+ (.4 log .4)]
Entropy = 1.846

31
Splitting Criterion
Example:
Two classes, +/-
100 records overall (50 +s and 50 -
s)
A and B are two binary attributes
Records with A=0: 48+, 2-
Records with A=1: 2+, 48-
Records with B=0: 26+, 24-
Records with B=1: 24+, 26-
Splitting on A is better than
splitting on B
A does a good job of separating +s and -
s 32
Entropy (for two classes i.e. m=2)

33
Which Attribute is the Best Classifier?
Information Gain
The expected information needed to classify a
tuple in D is

How much more information would we still


need (after partitioning at attribute A) to
arrive at an exact classification? This amount
is measured by

In general, we write , where D is the collection


of examples & A is an attribute

34
Information Gain
Gain of an attribute split: compare the impurity of the
parent node with the average impurity of the child
nodes
Information gain is defined as the pattern
observed in the dataset and reduction in the
entropy.

(This is the information gain from A on D)

Maximizing the gain  Minimizing the weighted


average impurity measure of children nodes
35
Information Gain

Low Information Gain High Entropy

High Information Gain Low Entropy


36
DECISION TREES
Which Attribute is the Best Classifier?: Information
Gain

37
DECISION TREES
Example

The collection of examples has 9 positive values


and 5 negative ones
Entropy(D) = Entropy(S) = 0.940

Eight (6 positive and 2 negative ones) of these


examples have the attribute value Wind = Weak

Six (3 positive and 3 negative ones) of these


examples have the attribute value Wind = Strong

38
DECISION TREES
Example

The information gain obtained by separating the


examples according to the attribute Wind is calculated
as:

¿ 0.94 − ( 8 / 14 ) ∗ ( 0.811 ) − ( 6 / 14 ) ∗(1.0)

39
DECISION TREES
Example
We calculate the Info Gain for each attribute and
select the attribute having the highest Info Gain

40
DECISION TREES
Example

Which attribute should be selected as the first test?

“Outlook” provides the most information, so we put “Outlook”


at root node of the decision tree.
41
DECISION TREES

42
DECISION TREES
Example

The process of selecting a new attribute is


now repeated for each (non-terminal)
descendant node, this time using only
training examples associated with that node

Attributes that have been incorporated


higher in the tree are excluded, so that any
given attribute can appear at most once
along any path through the tree
43
DECISION TREES
Example

This process continues for each new leaf


node until either:

1. Every attribute has already been included


along this path through the tree

2. The training examples associated with a


leaf node have zero entropy

44
DECISION TREES
Example

45
DECISION TREES
From Decision Trees to Rules
Next Step: Make rules from the decision tree

After making the identification tree, we trace each


path from the root node to leaf node, recording the
test outcomes as antecedents and the leaf node
classification as the consequent
Simple way: one rule for each leaf

For our example we have:

If the Outlook is Sunny and the Humidity is High


then No
If the Outlook is Sunny and the Humidity is Normal
46
DECISION TREES FOR BOOLEAN FUNCTIOS
Example

Develop decision tree for the following Boolean


function.
(i)A A ∧ B B A∧B
F F F
T F F
F T F
T T T

(ii) A  B (iii) A XOR B

47

You might also like