0% found this document useful (0 votes)
10 views

09 Classification DecisionTree Concept Tool Tagged

The document provides an introduction to decision trees as a classification method, illustrating their application in various fields such as insurance, marketing, and banking. It explains the concepts of entropy and information gain, which are used to build decision trees by evaluating the informativeness of attributes. Additionally, it outlines different decision tree algorithms like ID3, C4.5, and CART.

Uploaded by

David Kok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

09 Classification DecisionTree Concept Tool Tagged

The document provides an introduction to decision trees as a classification method, illustrating their application in various fields such as insurance, marketing, and banking. It explains the concepts of entropy and information gain, which are used to build decision trees by evaluating the informativeness of attributes. Additionally, it outlines different decision tree algorithms like ID3, C4.5, and CART.

Uploaded by

David Kok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

COMP1942

Classification 1
(Introduction and Decision
Tree)
Prepared by Raymond Wong
The examples used in Decision Tree are borrowed from LW Chan’s notes
XLMiner Screenshot captured by Qixu Chen
Presented by Raymond Wong
raywong@cse
COMP1942 1
Classification
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
root
child=yes child=no
100% Yes
0% No
Income=high Income=low
100% Yes 0% Yes
0% No 100% No

Decision tree
COMP1942 2
Classification
New set
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
Race Income Child Insurance
root
black high no yes child=yes child=no
white high yes yes

white low yes yes 100% Yes


white low yes yes 0% No
black low no no Income=high Income=low
black low no no

black low no no
100% Yes 0% Yes
white low no no
0% No 100% No

Training set Decision tree


COMP1942 3
Classification
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e

Input attributes Target attributes

COMP1942 4
Applications
 Insurance

According to the attributes of customers,

Determine which customers will buy an insurance
policy
 Marketing

According to the attributes of customers,

Determine which customers will buy a product
such as computers
 Bank Loan
 According to the attributes of customers,

Determine which customers are “risky” customers
or “safe” customers

COMP1942 5
Applications
 Network
 According to the traffic patterns,

Determine whether the patterns are
related to some “security attacks”
 Software
 According to the experience of
programmers,

Determine which programmers can fix
some certain bugs

COMP1942 6
Same/Difference
 Classification
 Clustering

COMP1942 7
Classification Methods
 Decision Tree
 Bayesian Classifier
 Nearest Neighbor Classifier

COMP1942 8
Decision Trees
 Decision Trees
Iterative Dichotomiser
 ID3
Classification
 C4.5
Classification And Regression Trees
 CART
 Measurement
 How to use the data mining tool

COMP1942 9
Entropy
 Example 1
 Consider a random variable which has
a uniform distribution over 32
outcomes
 To identify an outcome, we need a
label that takes 32 different values.
 Thus, 5 bit strings suffice as labels

COMP1942 10
Entropy
 Entropy is used to measure how
informative is a node.
 If we are given a probability distribution
P = (p1, p2, …, pn) then the Information
conveyed by this distribution, also called
the Entropy of P, is:
I(P) = - (p1 x log p1 + p2 x log p2 + …+ pn
x log pn)
 All logarithms here are in base 2.
COMP1942 11
Entropy
 For example,
 If P is (0.5, 0.5), then I(P) is 1.
 If P is (0.67, 0.33), then I(P) is 0.92,
 If P is (1, 0), then I(P) is 0.
 The entropy is a way to measure
the amount of information.
 The smaller the entropy, the more
informative we have.

COMP1942 12
Race Incom Child Insuranc
e e
blac high no yes

Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Race, e
Info(Tblack) = - ¾ log ¾ - ¼ log ¼ = 0.8113 blac low no no
k
Info(Twhite) = - ¾ log ¾ - ¼ log ¼ = 0.8113 blac low no no
k
Info(Race, T) = ½ x Info(Tblack) + ½ x Info(Twhite=
) 0.8113
blac low no no
Gain(Race, T) = Info(T) – Info(Race, T)= 1 – 0.8113= k 0.1887
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e

COMP1942 13
Race Incom Child Insuranc
e e
blac high no yes

Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Income, e
Info(Thigh) = - 1 log 1 – 0 log 0= 0 blac low no no
k
Info(Tlow) = - 1/3 log 1/3 – 2/3 log 2/3 = 0.9183 blac low no no
k
Info(Income, T) = ¼ x Info(Thigh) + ¾ x Info(Tlow) = 0.6887
blac low no no
k = 0.3113
Gain(Income, T) = Info(T) – Info(Income, T)= 1 – 0.6887
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e
For attribute Income, Gain(Income, T) = 0.3113
COMP1942 14
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
{2, 3, 4} {1, 5, 6, 7, 8} 3
whit high yes yes
Insurance: 3 Yes; 0 No Insurance: 1 Yes; 4 No 4 e
5 whit low yes yes
Info(T) = - ½ log ½ - ½ log ½
=1 6 e
7 whit low yes yes
For attribute Child, e
8
Info(Tyes) = - 1 log 1 – 0 log 0= 0 blac low no no
k
Info(Tno) = - 1/5 log 1/5 – 4/5 log 4/5 = 0.7219 blac low no no
k
Info(Child, T) = 3/8 x Info(Tyes) + 5/8 x Info(Tno)= 0.4512
blac low no no
Gain(Child, T) = Info(T) – Info(Child, T)
= 1 – 0.4512= k 0.5488
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e
For attribute Income, Gain(Income, T) = 0.3113
COMP1942
For attribute Child, Gain(Child, T) = 0.5488 15
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
{2, 3, 4} {1, 5, 6, 7, 8} 3
whit high yes yes
Insurance: 3 Yes; 0 No Insurance: 1 Yes; 4 No 4 e
5 whit low yes yes
Info(T) = - 1/5 log 1/5 – 4/5 log 4/5
= 0.7219 6 e
7 whit low yes yes
For attribute Race, e
8
Info(Tblack) = - ¼ log ¼ – ¾ log ¾ = 0.8113 blac low no no
k
Info(Twhite) = - 0 log 0 – 1 log 1= 0 blac low no no
k
Info(Race, T) = 4/5 x Info(Tblack) + 1/5 x Info(Twhite=
) 0.6490
blac low no no
Gain(Race, T) = Info(T) – Info(Race, T) k
= 0.7219 – 0.6490= 0.0729
whit low no no
For attribute Race, Gain(Race, T) = 0.0729 e

COMP1942 16
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
{2, 3, 4} {1, 5, 6, 7, 8} 3
whit high yes yes
Insurance: 3 Yes; 0 No Insurance: 1 Yes; 4 No 4 e
5 whit low yes yes
Info(T) = - 1/5 log 1/5 – 4/5 log 4/5
= 0.7219 6 e
7 whit low yes yes
For attribute Income, e
8
Info(Thigh) = - 1 log 1 – 0 log 0 = 0 blac low no no
k
Info(Tlow) = - 0 log 0 – 1 log 1 = 0 blac low no no
k
Info(Income, T) = 1/5 x Info(Thigh) + 4/5 x Info(Tlow=
) 0
blac low no no
Gain(Income, T) = Info(T) – Info(Income, T) = 0.7219 k – 0= 0.7219
whit low no no
For attribute Race, Gain(Race, T) = 0.0729 e
For attribute Income, Gain(Income, T) = 0.7219
COMP1942 17
root Race Incom Child Insuranc
child=yes child=no e e
20% Yes
1
100% Yes blac high no yes
0% No 80% No 2
k
Income=high Income=low 3
100% Yes 0% Yes
whit high yes yes
0% No 100% No
4 e
5 whit
{1}1/5 – 4/5 log 4/5 {5, 6, 7, 8} low yes yes
Info(T) = - 1/5 log
=Insurance:
0.7219 6 e
1 Yes; 0 No Insurance: 0 Yes; 4 No
7 whit low yes yes
For attribute Income, e
8
Info(Thigh) = - 1 log 1 – 0 log 0 = 0 blac low no no
k
Info(Tlow) = - 0 log 0 – 1 log 1 = 0 blac low no no
k
Info(Income, T) = 1/5 x Info(Thigh) + 4/5 x Info(Tlow=
) 0
blac low no no
Gain(Income, T) = Info(T) – Info(Income, T) = 0.7219 k – 0= 0.7219
whit low no no
For attribute Race, Gain(Race, T) = 0.0729 e
For attribute Income, Gain(Income, T) = 0.7219
COMP1942 18
root Race Incom Child Insuranc
child=yes child=no e e
1
100% Yes blac high no yes
0% No 2
k
Income=high Income=low 3
100% Yes 0% Yes
whit high yes yes
0% No 100% No
4 e
5 whit low yes yes
6 e
Decision tree 7 whit low yes yes
8 e
blac low no no
k
Suppose there is a new person. blac low no no
Race Incom Child Insuranc k
e e blac low no no
whit high no ? k
e whit low no no
e

COMP1942 19
root Race Incom Child Insuranc
child=yes child=no e e
1
100% Yes blac high no yes
0% No 2
k
Income=high Income=low 3
100% Yes 0% Yes
whit high yes yes
0% No 100% No
4 e
5 whit low yes yes
6 e
Decision tree 7 whit low yes yes
8 e
blac low no no
k
Termination Criteria? blac low no no
k
blac low no no
k
e.g., height of the tree whit low no no
e.g., accuracy of each node
e

COMP1942 20
Decision Trees
 Decision Trees
 ID3
 C4.5
 CART
 Measurement
 How to use the data mining tool

COMP1942 21
C4.5
 ID3
 Impurity Measurement

Gain(A, T)
= Info(T) – Info(A, T)
 C4.5
 Impurity Measurement

Gain(A, T)
= (Info(T) – Info(A, T))/SplitInfo(A)

where SplitInfo(A) = -vA p(v) log p(v)

COMP1942 22
Race Incom Child Insuranc
e e
blac high no yes

Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Race, e
Info(Tblack) = - ¾ log ¾ - ¼ log ¼ = 0.8113
blac low no no
Info(Twhite) = - ¾ log ¾ - ¼ log ¼ = 0.8113 k

Info(Race, T) = ½ x Info(Tblack) + ½ x Info(Twhite= blac


) 0.8113 low no no
k
SplitInfo(Race) = - ½ log ½ - ½ log ½ = 1 blac low no no
k
Gain(Race, T) = (Info(T) – Info(Race, T))/SplitInfo(Race)= (1 – 0.8113)/1= 0.1887
whit low no no
For attribute Race, Gain(Race, T) = 0.1887 e

COMP1942 23
Race Incom Child Insuranc
e e
blac high no yes

Entropy k
whit high yes yes
e
Info(T) = - ½ log ½ - ½ log ½ whit low yes yes
=1 e
whit low yes yes
For attribute Income, e
Info(Thigh) = - 1 log 1 – 0 log 0= 0
blac low no no
Info(Tlow) = - 1/3 log 1/3 – 2/3 log 2/3 = 0.9183 k

Info(Income, T) = ¼ x Info(Thigh) + ¾ x Info(Tlow) =blac low


0.6887 no no
k
SplitInfo(Income) = - 2/8 log 2/8 – 6/8 log 6/8= 0.8113
blac low no no
k
Gain(Income, T)= (Info(T)–Info(Income, T))/SplitInfo(Income)
= (1–0.6887)/0.8113
whit low no
= 0.3837 no
For attribute Race, Gain(Race, T) = 0.1887 e
For attribute Income, Gain(Income, T) = 0.3837
COMP1942
For attribute Child, Gain(Child, T) = ? 24
Decision Trees
 Decision Trees
 ID3
 C4.5
 CART
 Measurement
 How to use the data mining tool

COMP1942 25
CART
 Impurity Measurement
 Gini
I(P) = 1 – j pj2

COMP1942 26
Race Incom Child Insuranc
e e
blac high no yes

Gini k
whit high yes yes
e
Info(T) = 1 – (½)2 – (½)2 whit low yes yes
=½ e
whit low yes yes
For attribute Race, e
Info(Tblack) = 1 – (¾)2 – (¼)2 = 0.375 blac low no no
k
Info(Twhite) = 1 – (¾)2 – (¼)2 = 0.375 blac low no no
k
Info(Race, T) = ½ x Info(Tblack) + ½ x Info(Twhite)= 0.375
blac low no no
Gain(Race, T) = Info(T) – Info(Race, T)= ½ – 0.375 k= 0.125
whit low no no
For attribute Race, Gain(Race, T) = 0.125 e

COMP1942 27
Race Incom Child Insuranc
e e
blac high no yes

Gini k
whit high yes yes
e
Info(T) = 1 – (½)2 – (½)2 whit low yes yes
=½ e
whit low yes yes
For attribute Income, e

Info(Thigh) = 1 – 12 – 02 = 0 blac low no no


k
Info(Tlow) = 1 – (1/3)2 – (2/3)2 = 0.444 blac low no no
k
Info(Income, T) = 1/4 x Info(Thigh) + 3/4 x Info(Tlow)= 0.333
blac low no no
Gain(Income, T) = Info(T) – Info(Race, T)= ½ – 0.333 k = 0.167
whit low no no
For attribute Race, Gain(Race, T) = 0.125 e
For attribute Income, Gain(Race, T) = 0.167
COMP1942
For attribute Child, Gain(Child, T) = ? 28
Decision Trees
 Decision Trees
 ID3
 C4.5
 CART
 Measurement
 How to use the data mining tool

COMP1942 29
Measurement
 Confusion Matrix
 Error Report
 Lift Chart
 Decile-wise lift chart
 Others

COMP1942 30
Measurement – Confusion
Matrix
New set
Suppose there is a person.
Race Incom Child Insuranc
e e
whit high no ?
e
Race Income Child Insurance
root
black high no yes child=yes child=no
white high yes yes

white low yes yes 100% Yes


white low yes yes 0% No
black low no no Income=high Income=low
black low no no

black low no no
100% Yes 0% Yes
white low no no
0% No 100% No

Training set
COMP1942 Decision tree 31
Measurement – Confusion
Matrix

Race Incom Child InsurancPredicted


e e
yes
blac high no yes yes
k Race Income Child Insurance
root
black high no yes
yes child=yes child=no
whit high
white high yesyes yes yes
e white low yes yes yes 100% Yes
white low yes yes 0% No
whitblack low yes yes no Income=low
low no no Income=high
e black low no no
no 100% Yes 0% Yes
black low no no
0% No 100% No
whitwhite low
low noyes no yes no
e no
blac low no no
k Training set
COMP1942 Decision tree 32
Measurement – Confusion
Confusion Matrix
Matrix Actual
Predicted Class
Yes No
Class 4 0
Yes 0 4

Race Incom Child Actual Predicted


Insuranc No Is this decision tree “good”?
e e
yes
blac high no yes yes
k Race Income Child Insurance
root
black high no yes
yes child=yes child=no
whit high
white high yesyes yes yes
e white low yes yes yes 100% Yes
white low yes yes 0% No
whitblack low yes yes no Income=low
low no no Income=high
e black low no no
no 100% Yes 0% Yes
black low no no
0% No 100% No
whitwhite low
low noyes no yes no
e no
blac low no no
k Training set
COMP1942 Decision tree 33
Measurement
 Confusion Matrix
 Error Report
 Lift Chart
 Decile-wise lift chart
 Others

COMP1942 34
Measurement
Error Report - Error
Class # # % Error
Report Cases
4 Errors
0 0.00
Yes 4 0 0.00
No 8 0 0.00

Overall
Race Incom Child Actual Predicted
Insuranc
e e
yes
blac high no yes yes
k Race Income Child Insurance
root
black high no yes
yes child=yes child=no
whit high
white high yesyes yes yes
e white low yes yes yes 100% Yes
white low yes yes 0% No
whitblack low yes yes no Income=low
low no no Income=high
e black low no no
no 100% Yes 0% Yes
black low no no
0% No 100% No
whitwhite low
low noyes no yes no
e no
blac low no no
k Training set
COMP1942 Decision tree 35
Measurement
 Confusion Matrix
 Error Report
 Lift Chart
 Decile-wise lift chart
 Others

COMP1942 36
Measurement - Lift Chart
 Lift charts
 visual aids for measuring model
performance
 consist of a lift curve and a baseline

 The greater the area between the lift


curve and the baseline, the better the
model.

COMP1942 37
Measurement - Lift Chart
 Lift charts
 We need to define which value in the
target attribute is a “success”
 In our running example, we can treat
“Yes” as a success

COMP1942 38
Lift Chart
4
3
Cumulative
2
1
Measurement 0–1 Lift
Cumulative Insurance of
actual values (or Lift Curve)
Curve
2 3 4 5 6 7 8 # cases

Sort the tuples according to the predicted value where “success”


(or “yes”) are sorted at a higher priority and then the actual value

Race Incom Child Actual Predicted


Insuranc Race Incom Child Actual Predicted
Insuranc
e e e e
yes yes
blac high no yes yes blac high no yes yes
k Race Income Child Insurance
k
black high no yes
yes yes
whit high
white high yesyes yes yes whit high yes yes
e white low yes yes yes e yes
white low yes yes

whitblack low
low noyes no yes no whit low yes yes no
e black low no no
no e no
black low no no

whitwhite low
low noyes no yes no whit low yes yes no
e no e no
blac low no no blac low no no
k COMP1942 k 39
Lift Chart
4
3
Cumulative
2
1
Measurement 0–1 Lift
Cumulative Insurance of
actual values (or Lift Curve)
Curve
2 3 4 5 6 7 8 # cases

Sort the tuples according to the predicted Cumulative


value whereInsurance
“success” using
(or “yes”) areArea between
sorted the lift
at a higher Average
curveand
priority (or
then the Baseline)
actual value
and the baseline
Actual Predicted Race Incom Child Insuranc
Race Incom Child Insuranc Actual Predicted
e e e e
yes yes
blac high no yes yes blac high no yes yes
k Race Income Child Insurance
k
black high no yes
yes yes
whit The
white largeryesyes
high
high the area
yes is, the better the classifier
yes whit is.high yes yes
e white low yes yes yes e yes
white low yes yes

whitblack low
low noyes no yes no whit low yes yes no
e black low no no
no e no
black low no no

whitwhite low
low noyes no yes no whit low yes yes no
e no e no
blac low no no blac low no no
k COMP1942 k 40
Measurement
 Confusion Matrix
 Error Report
 Lift Chart
 Decile-wise lift chart
 Others

COMP1942 41
Measurement - Decile-wise
lift chart
 A decile is any of the nine values that
divide the sorted data into ten equal
parts, so that each part represents
1/10 of the sample or population.
 E.g., 1st decile: the first 10% tuples
 E.g., 2nd decile: the second 10% tuples
 E.g., 3rd decile: the third 10% tuples

COMP1942 42
Decile-wise lift chart

Measurement - Decile-wise Decile mean/


4
3
2
Global lift
mean =chart
Global mean 1
4/8 = 0.5
0 1 2 3 4 5 6 7 8 9 10
Deciles

Sort the tuples according to the predicted value where “success”


(or “yes”) are sorted at a higher priority and then the actual value

Race Actual Predicted


Incom Child Insuranc Race Incom Child Actual Predicted
Insuranc
e e 1.0 1st Decile
Decile mean = 12.5%e e
yes yes
blac high no meanyes
Decile = 1.0 2
yesDecile
nd
blac25%high no yes yes
k Race Income Child Insurance
Decile mean = 1.0 3rd Decile k
black high no
Decile
yes
mean = 1.0 yes
4th Decile
37.5% yes
whit high
white high yesyes yes yes whit high yes yes
e white low Decile
yes mean
yes = 1.0 yes
5 Decile th
e 50% yes
white low yes yes
Decile mean = 0.0 no
6th Decile no
whitblack low
low no yes no yes whit62.5%
low yes yes
e black low Decile
no mean
no = 0.0 7
noDecile
th
e 75% no
black low no no
Decile mean = 0.0 8 Decile th

whitwhite low
low noyes no yes no whit87.5%
low yes yes no
Decile mean = 0.0 9th Decile
e noth e no
Decile mean = 0.0 10 Decile 100%
blac COMP1942
low no no blac low no no 43
Measurement
 Confusion Matrix
 Error Report
 Lift Chart
 Decile-wise lift chart
 Others

COMP1942 44
Measurement - Others
 We will discuss them later.
 E.g., Precision, Recall, Specificity,
f1-score

COMP1942 45
Decision Trees
 Decision Trees
 ID3
 C4.5
 CART
 Measurement
 How to use the data mining tool

COMP1942 46
How to use the data
mining tool
 We can use XLMiner for
classification (Decision Tree, CART)

COMP1942 47
How to use the data
mining tool
 We have the following 2 versions.
 XLMiner Desktop (installed in either
the CSE lab machine or your
computer)
 XLMiner Cloud (installed as a plugin in
your Office 365 Excel)

COMP1942 48
Race Incom Child Insuranc
e e
Suppose there is a person.
Race Incom Child Insuranc blac high no yes
e e k
whit high no ? whit high yes yes
e e
 XLMiner requires that whit
e
low yes yes

the input data should whit low yes yes


have the following e

format. blac
k
low no no

 Input attributes blac low no no


Input attributes:

Numeric k
• Categorical
blac attribute:
Target low no no
 Target attribute k
• Categorical

(or output attribute) whit low no no


We can etransform from “categorical”

Categorical to “numeric” by XLMiner
Transformation Tool first
COMP1942 49
Race Incom Child Insuranc
1
e
1 1
e
Suppose there is a person.
Race Incom Child Insuranc blac
2
high
1
no
2
yes
2 e e k
1 1 2 2 2
whit high no ? whit high yes yes
e e2 2 2

 XLMiner requires that 1


whit
e1
2
low 1
yes yes
2 1
the input data should whit
1 low
2 yes
1 yes
have the following e2
2 1

format. blac
k
low no no

 Input attributes blac low no no


Input attributes:

Numeric k
• Categorical
blac attribute:
Target low no no
 Target attribute k
• Categorical

(or output attribute) whit low no no


We can etransform from “categorical”

Categorical to “numeric” by XLMiner
Transformation Tool first
COMP1942 50
Race Incom Child Insuranc
1
e
1 1
e
Suppose there is a person.
Race Incom Child Insuranc blac
2
high
1
no
2
yes
2 e e k
1 1 2 2 2
whit high no ? whit high yes yes
e e2 2 2

 How can XLMiner 1


whit
e1
2
low 1
yes yes
2 1
perform the whit
1 low
2 yes
1 yes
transformation? e2
2 1
blac low no no
 Open “classification- k
decisionTree.xlsx” in MS Excel in blac low no no
a CSE lab machine Input attributes:
k
• Categorical
blac attribute:
Target low no no
k
• Categorical

whit low no no
We can etransform from “categorical”
to “numeric” by XLMiner
Transformation Tool first
COMP1942 51
COMP1942 52
COMP1942 53
COMP1942 54
COMP1942 55
COMP1942 56
Data source

Worksheet

Data range Workbook

Select the correct data range


“$B$1:$E$10”

COMP1942 57
# Rows
9
Variables

4
# Columns

Variables
Variables to
be factored

First row contains header

COMP1942 58
COMP1942 59
COMP1942 60
Assign numbers 1, 2, 3,…

Option
Assign numbers 0, 1, 2,…

COMP1942 61
COMP1942 62
COMP1942 63
COMP1942 64
COMP1942 65
COMP1942 66
COMP1942 67
 We have finished the
transformation.
 However, we want to make the
classification process easier
 Thus, we want to “tidy up” the
input format for the later process
of classification now.

COMP1942 68
Copy and paste!

COMP1942 69
COMP1942 70
COMP1942 71
COMP1942 72
COMP1942 73
COMP1942 74
COMP1942 75
COMP1942 76
COMP1942 77
 Now, we understand how to
perform the transformation.
 We also “tidied up” the data for the
process of classification
 Next, we need to perform the data
mining task for classification
(Decision Tree, CART)

COMP1942 78
COMP1942 79
COMP1942 80
COMP1942 81
Data Source Workbook

Worksheet

Data range

Select the correct data range


“$D$27:$G$35”

COMP1942 82
# Columns
4
# Rows in

Training set 8

COMP1942 83
Variables
First row contains header

Selected variables

Variables in input data

Output variables

COMP1942 84
COMP1942 85
COMP1942 86
COMP1942 87
COMP1942 88
Binary Classification

Target Success class


Classes
Yes

Number of classes:

2
0.5
Success probability cutoff
COMP1942 89
Preprocessing

Partition data Rescale data

COMP1942 90
Limit number of: Tree Growth

Levels Limit value

Nodes

Splits
Records in terminal nodes

COMP1942 91
1

COMP1942 92
Maximum number of 5
levels (to display)

COMP1942 93
Full grown

COMP1942 94
COMP1942 95
Score training data
Detailed report

Summary report

Lift charts

COMP1942 96
Score new data

In worksheet

COMP1942 97
Data source

Worksheet
Workbook

Data range

Select the correct data range


“$D$37:$G$38”

COMP1942 98
# Rows in data 1
4

Variables

# Columns in data

First row contains headers

COMP1942 99
Scale variables in input data
Variables in new data

Match sequentially

Match selected Unmatch all

Unmatch selected
Match by name

COMP1942 100
COMP1942 101
COMP1942 102
COMP1942 103
COMP1942 104
COMP1942 105
COMP1942 106
COMP1942 107
COMP1942 108
COMP1942 109
COMP1942 110
COMP1942 111
COMP1942 112
COMP1942 113
COMP1942 114
COMP1942 115
COMP1942 116
COMP1942 117
COMP1942 118
COMP1942 119
COMP1942 120
Child_ord <=1.5 Child_ord >1.5

# of cases in
this branch = 5 # of cases in
this branch = 3
Income_ord
<=1.5 Income_ord >1.5

# of cases in
this branch = 1 # of cases in
this branch = 4

COMP1942 121
COMP1942 122
COMP1942 123
COMP1942 124
COMP1942 125
COMP1942 126
COMP1942 127
COMP1942 128
COMP1942 129
COMP1942 130
COMP1942 131
COMP1942 132
COMP1942 133
COMP1942 134
COMP1942 135
COMP1942 136
COMP1942 137
COMP1942 138
COMP1942 139
COMP1942 140
COMP1942 141
COMP1942 142
COMP1942 143
COMP1942 144
COMP1942 145
COMP1942 146
COMP1942 147
COMP1942 148
COMP1942 149
COMP1942 150
COMP1942 151
COMP1942 152
COMP1942 153
How to use the data
mining tool
 We have the following 2 versions.
 XLMiner Desktop (installed in either
the CSE lab machine or your
computer)
 XLMiner Cloud (installed as a plugin in
your Office 365 Excel)

COMP1942 154
How to use the data
mining tool (XLMiner
Cloud)
 The way of opening “Create Category
Scores” in XLMiner Cloud plugin in your
Office 365 Excel

“Data Science” Tag  Transform  Categorical
Data - Create Category Scores

COMP1942 155
How to use the data mining tool
(XLMiner Cloud)
 The steps and output format of “create
category scores” in XLMiner Cloud are
similar to the steps in XLMiner Desktop.
 The transformation result of XLMiner
Cloud Platform is the same as that from
XLMiner Desktop.

COMP1942 156
How to use the data
mining tool (XLMiner
Cloud)
 The way of opening “classification tree” in
XLMiner Cloud plugin in your Office 365
Excel

“Data Science” Tag  Classify  Classification
Tree

COMP1942 157
How to use the data
mining tool (XLMiner
Cloud)
 The steps of performing “classification
tree” in XLMiner Cloud is similar to the
steps in XLMiner Desktop.
 The decision tree and classification result
of XLMiner Cloud is the same as that from
XLMiner Desktop.

COMP1942 158
How to use the data
mining tool (XLMiner
Cloud)
 The output format of XLMiner Cloud is
similar to the output in XLMiner Desktop.
 However, to display the classification tree
and the lift charts, you need to call out the
“Charts” window.

COMP1942 159
COMP1942 160
COMP1942 161
COMP1942 162
COMP1942 163
COMP1942 164
COMP1942 165
COMP1942 166
COMP1942 167
COMP1942 168
COMP1942 169
COMP1942 170
COMP1942 171
COMP1942 172

You might also like