09 Classification DecisionTree Concept Tool Tagged
09 Classification DecisionTree Concept Tool Tagged
Classification 1
(Introduction and Decision
Tree)
Prepared by Raymond Wong
The examples used in Decision Tree are borrowed from LW Chan’s notes
XLMiner Screenshot captured by Qixu Chen
Presented by Raymond Wong
raywong@cse
COMP1942 1
Classification
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
root
child=yes child=no
100% Yes
0% No
Income=high Income=low
100% Yes 0% Yes
0% No 100% No
Decision tree
COMP1942 2
Classification
New set
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
Race Income Child Insurance
root
black high no yes child=yes child=no
white high yes yes
black low no no
100% Yes 0% Yes
white low no no
0% No 100% No
COMP1942 4
Applications
Insurance
According to the attributes of customers,
Determine which customers will buy an insurance
policy
Marketing
According to the attributes of customers,
Determine which customers will buy a product
such as computers
Bank Loan
According to the attributes of customers,
Determine which customers are “risky” customers
or “safe” customers
COMP1942 5
Applications
Network
According to the traffic patterns,
Determine whether the patterns are
related to some “security attacks”
Software
According to the experience of
programmers,
Determine which programmers can fix
some certain bugs
COMP1942 6
Same/Difference
Classification
Clustering
COMP1942 7
Classification Methods
Decision Tree
Bayesian Classifier
Nearest Neighbor Classifier
COMP1942 8
Decision Trees
Decision Trees
Iterative Dichotomiser
ID3
Classification
C4.5
Classification And Regression Trees
CART
Measurement
How to use the data mining tool
COMP1942 9
Entropy
Example 1
Consider a random variable which has
a uniform distribution over 32
outcomes
To identify an outcome, we need a
label that takes 32 different values.
Thus, 5 bit strings suffice as labels
COMP1942 10
Entropy
Entropy is used to measure how
informative is a node.
If we are given a probability distribution
P = (p1, p2, …, pn) then the Information
conveyed by this distribution, also called
the Entropy of P, is:
I(P) = - (p1 x log p1 + p2 x log p2 + …+ pn
x log pn)
All logarithms here are in base 2.
COMP1942 11
Entropy
For example,
If P is (0.5, 0.5), then I(P) is 1.
If P is (0.67, 0.33), then I(P) is 0.92,
If P is (1, 0), then I(P) is 0.
The entropy is a way to measure
the amount of information.
The smaller the entropy, the more
informative we have.
COMP1942 12
Race Incom Child Insuranc
e e
blac high no yes
Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Race, e
Info(Tblack) = - ¾ log ¾ - ¼ log ¼ = 0.8113 blac low no no
k
Info(Twhite) = - ¾ log ¾ - ¼ log ¼ = 0.8113 blac low no no
k
Info(Race, T) = ½ x Info(Tblack) + ½ x Info(Twhite=
) 0.8113
blac low no no
Gain(Race, T) = Info(T) – Info(Race, T)= 1 – 0.8113= k 0.1887
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e
COMP1942 13
Race Incom Child Insuranc
e e
blac high no yes
Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Income, e
Info(Thigh) = - 1 log 1 – 0 log 0= 0 blac low no no
k
Info(Tlow) = - 1/3 log 1/3 – 2/3 log 2/3 = 0.9183 blac low no no
k
Info(Income, T) = ¼ x Info(Thigh) + ¾ x Info(Tlow) = 0.6887
blac low no no
k = 0.3113
Gain(Income, T) = Info(T) – Info(Income, T)= 1 – 0.6887
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e
For attribute Income, Gain(Income, T) = 0.3113
COMP1942 14
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
{2, 3, 4} {1, 5, 6, 7, 8} 3
whit high yes yes
Insurance: 3 Yes; 0 No Insurance: 1 Yes; 4 No 4 e
5 whit low yes yes
Info(T) = - ½ log ½ - ½ log ½
=1 6 e
7 whit low yes yes
For attribute Child, e
8
Info(Tyes) = - 1 log 1 – 0 log 0= 0 blac low no no
k
Info(Tno) = - 1/5 log 1/5 – 4/5 log 4/5 = 0.7219 blac low no no
k
Info(Child, T) = 3/8 x Info(Tyes) + 5/8 x Info(Tno)= 0.4512
blac low no no
Gain(Child, T) = Info(T) – Info(Child, T)
= 1 – 0.4512= k 0.5488
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e
For attribute Income, Gain(Income, T) = 0.3113
COMP1942
For attribute Child, Gain(Child, T) = 0.5488 15
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
{2, 3, 4} {1, 5, 6, 7, 8} 3
whit high yes yes
Insurance: 3 Yes; 0 No Insurance: 1 Yes; 4 No 4 e
5 whit low yes yes
Info(T) = - 1/5 log 1/5 – 4/5 log 4/5
= 0.7219 6 e
7 whit low yes yes
For attribute Race, e
8
Info(Tblack) = - ¼ log ¼ – ¾ log ¾ = 0.8113 blac low no no
k
Info(Twhite) = - 0 log 0 – 1 log 1= 0 blac low no no
k
Info(Race, T) = 4/5 x Info(Tblack) + 1/5 x Info(Twhite=
) 0.6490
blac low no no
Gain(Race, T) = Info(T) – Info(Race, T) k
= 0.7219 – 0.6490= 0.0729
whit low no no
For attribute Race, Gain(Race, T) = 0.0729 e
COMP1942 16
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
{2, 3, 4} {1, 5, 6, 7, 8} 3
whit high yes yes
Insurance: 3 Yes; 0 No Insurance: 1 Yes; 4 No 4 e
5 whit low yes yes
Info(T) = - 1/5 log 1/5 – 4/5 log 4/5
= 0.7219 6 e
7 whit low yes yes
For attribute Income, e
8
Info(Thigh) = - 1 log 1 – 0 log 0 = 0 blac low no no
k
Info(Tlow) = - 0 log 0 – 1 log 1 = 0 blac low no no
k
Info(Income, T) = 1/5 x Info(Thigh) + 4/5 x Info(Tlow=
) 0
blac low no no
Gain(Income, T) = Info(T) – Info(Income, T) = 0.7219 k – 0= 0.7219
whit low no no
For attribute Race, Gain(Race, T) = 0.0729 e
For attribute Income, Gain(Income, T) = 0.7219
COMP1942 17
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
Income=high Income=low 3
100% Yes 0% Yes
whit high yes yes
0% No 100% No
4 e
5 whit
{1}1/5 – 4/5 log 4/5 {5, 6, 7, 8} low yes yes
Info(T) = - 1/5 log
=Insurance:
0.7219 6 e
1 Yes; 0 No Insurance: 0 Yes; 4 No
7 whit low yes yes
For attribute Income, e
8
Info(Thigh) = - 1 log 1 – 0 log 0 = 0 blac low no no
k
Info(Tlow) = - 0 log 0 – 1 log 1 = 0 blac low no no
k
Info(Income, T) = 1/5 x Info(Thigh) + 4/5 x Info(Tlow=
) 0
blac low no no
Gain(Income, T) = Info(T) – Info(Income, T) = 0.7219 k – 0= 0.7219
whit low no no
For attribute Race, Gain(Race, T) = 0.0729 e
For attribute Income, Gain(Income, T) = 0.7219
COMP1942 18
root Race Incom Child Insuranc
child=yes child=no e e
1
100% Yes blac high no yes
0% No 2
k
Income=high Income=low 3
100% Yes 0% Yes
whit high yes yes
0% No 100% No
4 e
5 whit low yes yes
6 e
Decision tree 7 whit low yes yes
8 e
blac low no no
k
Suppose there is a new person. blac low no no
Race Incom Child Insuranc k
e e blac low no no
whit high no ? k
e whit low no no
e
COMP1942 19
root Race Incom Child Insuranc
child=yes child=no e e
1
100% Yes blac high no yes
0% No 2
k
Income=high Income=low 3
100% Yes 0% Yes
whit high yes yes
0% No 100% No
4 e
5 whit low yes yes
6 e
Decision tree 7 whit low yes yes
8 e
blac low no no
k
Termination Criteria? blac low no no
k
blac low no no
k
e.g., height of the tree whit low no no
e.g., accuracy of each node
e
COMP1942 20
Decision Trees
Decision Trees
ID3
C4.5
CART
Measurement
How to use the data mining tool
COMP1942 21
C4.5
ID3
Impurity Measurement
Gain(A, T)
= Info(T) – Info(A, T)
C4.5
Impurity Measurement
Gain(A, T)
= (Info(T) – Info(A, T))/SplitInfo(A)
where SplitInfo(A) = -vA p(v) log p(v)
COMP1942 22
Race Incom Child Insuranc
e e
blac high no yes
Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Race, e
Info(Tblack) = - ¾ log ¾ - ¼ log ¼ = 0.8113
blac low no no
Info(Twhite) = - ¾ log ¾ - ¼ log ¼ = 0.8113 k
COMP1942 23
Race Incom Child Insuranc
e e
blac high no yes
Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Income, e
Info(Thigh) = - 1 log 1 – 0 log 0= 0
blac low no no
Info(Tlow) = - 1/3 log 1/3 – 2/3 log 2/3 = 0.9183 k
COMP1942 25
CART
Impurity Measurement
Gini
I(P) = 1 – j pj2
COMP1942 26
Race Incom Child Insuranc
e e
blac high no yes
Gini k
whit high yes yes
e
Info(T) = 1 – (½)2 – (½)2 whit low yes yes
=½ e
whit low yes yes
For attribute Race, e
Info(Tblack) = 1 – (¾)2 – (¼)2 = 0.375 blac low no no
k
Info(Twhite) = 1 – (¾)2 – (¼)2 = 0.375 blac low no no
k
Info(Race, T) = ½ x Info(Tblack) + ½ x Info(Twhite)= 0.375
blac low no no
Gain(Race, T) = Info(T) – Info(Race, T)= ½ – 0.375 k= 0.125
whit low no no
For attribute Race, Gain(Race, T) = 0.125 e
COMP1942 27
Race Incom Child Insuranc
e e
blac high no yes
Gini k
whit high yes yes
e
Info(T) = 1 – (½)2 – (½)2 whit low yes yes
=½ e
whit low yes yes
For attribute Income, e
COMP1942 29
Measurement
Confusion Matrix
Error Report
Lift Chart
Decile-wise lift chart
Others
COMP1942 30
Measurement – Confusion
Matrix
New set
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
Race Income Child Insurance
root
black high no yes child=yes child=no
white high yes yes
black low no no
100% Yes 0% Yes
white low no no
0% No 100% No
Training set
COMP1942 Decision tree 31
Measurement – Confusion
Matrix
COMP1942 34
Measurement
Error Report - Error
Class # # % Error
Report Cases
4 Errors
0 0.00
Yes 4 0 0.00
No 8 0 0.00
Overall
Race Incom Child Actual Predicted
Insuranc
e e
yes
blac high no yes yes
k Race Income Child Insurance
root
black high no yes
yes child=yes child=no
whit high
white high yesyes yes yes
e white low yes yes yes 100% Yes
white low yes yes 0% No
whitblack low yes yes no Income=low
low no no Income=high
e black low no no
no 100% Yes 0% Yes
black low no no
0% No 100% No
whitwhite low
low noyes no yes no
e no
blac low no no
k Training set
COMP1942 Decision tree 35
Measurement
Confusion Matrix
Error Report
Lift Chart
Decile-wise lift chart
Others
COMP1942 36
Measurement - Lift Chart
Lift charts
visual aids for measuring model
performance
consist of a lift curve and a baseline
COMP1942 37
Measurement - Lift Chart
Lift charts
We need to define which value in the
target attribute is a “success”
In our running example, we can treat
“Yes” as a success
COMP1942 38
Lift Chart
4
3
Cumulative
2
1
Measurement 0–1 Lift
Cumulative Insurance of
actual values (or Lift Curve)
Curve
2 3 4 5 6 7 8 # cases
whitblack low
low noyes no yes no whit low yes yes no
e black low no no
no e no
black low no no
whitwhite low
low noyes no yes no whit low yes yes no
e no e no
blac low no no blac low no no
k COMP1942 k 39
Lift Chart
4
3
Cumulative
2
1
Measurement 0–1 Lift
Cumulative Insurance of
actual values (or Lift Curve)
Curve
2 3 4 5 6 7 8 # cases
whitblack low
low noyes no yes no whit low yes yes no
e black low no no
no e no
black low no no
whitwhite low
low noyes no yes no whit low yes yes no
e no e no
blac low no no blac low no no
k COMP1942 k 40
Measurement
Confusion Matrix
Error Report
Lift Chart
Decile-wise lift chart
Others
COMP1942 41
Measurement - Decile-wise
lift chart
A decile is any of the nine values that
divide the sorted data into ten equal
parts, so that each part represents
1/10 of the sample or population.
E.g., 1st decile: the first 10% tuples
E.g., 2nd decile: the second 10% tuples
E.g., 3rd decile: the third 10% tuples
COMP1942 42
Decile-wise lift chart
whitwhite low
low noyes no yes no whit87.5%
low yes yes no
Decile mean = 0.0 9th Decile
e noth e no
Decile mean = 0.0 10 Decile 100%
blac COMP1942
low no no blac low no no 43
Measurement
Confusion Matrix
Error Report
Lift Chart
Decile-wise lift chart
Others
COMP1942 44
Measurement - Others
We will discuss them later.
E.g., Precision, Recall, Specificity,
f1-score
COMP1942 45
Decision Trees
Decision Trees
ID3
C4.5
CART
Measurement
How to use the data mining tool
COMP1942 46
How to use the data
mining tool
We can use XLMiner for
classification (Decision Tree, CART)
COMP1942 47
How to use the data
mining tool
We have the following 2 versions.
XLMiner Desktop (installed in either
the CSE lab machine or your
computer)
XLMiner Cloud (installed as a plugin in
your Office 365 Excel)
COMP1942 48
Race Incom Child Insuranc
e e
Suppose there is a person.
Race Incom Child Insuranc blac high no yes
e e k
whit high no ? whit high yes yes
e e
XLMiner requires that whit
e
low yes yes
format. blac
k
low no no
format. blac
k
low no no
whit low no no
We can etransform from “categorical”
to “numeric” by XLMiner
Transformation Tool first
COMP1942 51
COMP1942 52
COMP1942 53
COMP1942 54
COMP1942 55
COMP1942 56
Data source
Worksheet
COMP1942 57
# Rows
9
Variables
4
# Columns
Variables
Variables to
be factored
COMP1942 58
COMP1942 59
COMP1942 60
Assign numbers 1, 2, 3,…
Option
Assign numbers 0, 1, 2,…
COMP1942 61
COMP1942 62
COMP1942 63
COMP1942 64
COMP1942 65
COMP1942 66
COMP1942 67
We have finished the
transformation.
However, we want to make the
classification process easier
Thus, we want to “tidy up” the
input format for the later process
of classification now.
COMP1942 68
Copy and paste!
COMP1942 69
COMP1942 70
COMP1942 71
COMP1942 72
COMP1942 73
COMP1942 74
COMP1942 75
COMP1942 76
COMP1942 77
Now, we understand how to
perform the transformation.
We also “tidied up” the data for the
process of classification
Next, we need to perform the data
mining task for classification
(Decision Tree, CART)
COMP1942 78
COMP1942 79
COMP1942 80
COMP1942 81
Data Source Workbook
Worksheet
Data range
COMP1942 82
# Columns
4
# Rows in
Training set 8
COMP1942 83
Variables
First row contains header
Selected variables
Output variables
COMP1942 84
COMP1942 85
COMP1942 86
COMP1942 87
COMP1942 88
Binary Classification
Number of classes:
2
0.5
Success probability cutoff
COMP1942 89
Preprocessing
COMP1942 90
Limit number of: Tree Growth
Nodes
Splits
Records in terminal nodes
COMP1942 91
1
COMP1942 92
Maximum number of 5
levels (to display)
COMP1942 93
Full grown
COMP1942 94
COMP1942 95
Score training data
Detailed report
Summary report
Lift charts
COMP1942 96
Score new data
In worksheet
COMP1942 97
Data source
Worksheet
Workbook
Data range
COMP1942 98
# Rows in data 1
4
Variables
# Columns in data
COMP1942 99
Scale variables in input data
Variables in new data
Match sequentially
Unmatch selected
Match by name
COMP1942 100
COMP1942 101
COMP1942 102
COMP1942 103
COMP1942 104
COMP1942 105
COMP1942 106
COMP1942 107
COMP1942 108
COMP1942 109
COMP1942 110
COMP1942 111
COMP1942 112
COMP1942 113
COMP1942 114
COMP1942 115
COMP1942 116
COMP1942 117
COMP1942 118
COMP1942 119
COMP1942 120
Child_ord <=1.5 Child_ord >1.5
# of cases in
this branch = 5 # of cases in
this branch = 3
Income_ord
<=1.5 Income_ord >1.5
# of cases in
this branch = 1 # of cases in
this branch = 4
COMP1942 121
COMP1942 122
COMP1942 123
COMP1942 124
COMP1942 125
COMP1942 126
COMP1942 127
COMP1942 128
COMP1942 129
COMP1942 130
COMP1942 131
COMP1942 132
COMP1942 133
COMP1942 134
COMP1942 135
COMP1942 136
COMP1942 137
COMP1942 138
COMP1942 139
COMP1942 140
COMP1942 141
COMP1942 142
COMP1942 143
COMP1942 144
COMP1942 145
COMP1942 146
COMP1942 147
COMP1942 148
COMP1942 149
COMP1942 150
COMP1942 151
COMP1942 152
COMP1942 153
How to use the data
mining tool
We have the following 2 versions.
XLMiner Desktop (installed in either
the CSE lab machine or your
computer)
XLMiner Cloud (installed as a plugin in
your Office 365 Excel)
COMP1942 154
How to use the data
mining tool (XLMiner
Cloud)
The way of opening “Create Category
Scores” in XLMiner Cloud plugin in your
Office 365 Excel
“Data Science” Tag Transform Categorical
Data - Create Category Scores
COMP1942 155
How to use the data mining tool
(XLMiner Cloud)
The steps and output format of “create
category scores” in XLMiner Cloud are
similar to the steps in XLMiner Desktop.
The transformation result of XLMiner
Cloud Platform is the same as that from
XLMiner Desktop.
COMP1942 156
How to use the data
mining tool (XLMiner
Cloud)
The way of opening “classification tree” in
XLMiner Cloud plugin in your Office 365
Excel
“Data Science” Tag Classify Classification
Tree
COMP1942 157
How to use the data
mining tool (XLMiner
Cloud)
The steps of performing “classification
tree” in XLMiner Cloud is similar to the
steps in XLMiner Desktop.
The decision tree and classification result
of XLMiner Cloud is the same as that from
XLMiner Desktop.
COMP1942 158
How to use the data
mining tool (XLMiner
Cloud)
The output format of XLMiner Cloud is
similar to the output in XLMiner Desktop.
However, to display the classification tree
and the lift charts, you need to call out the
“Charts” window.
COMP1942 159
COMP1942 160
COMP1942 161
COMP1942 162
COMP1942 163
COMP1942 164
COMP1942 165
COMP1942 166
COMP1942 167
COMP1942 168
COMP1942 169
COMP1942 170
COMP1942 171
COMP1942 172