0% found this document useful (0 votes)
106 views

STA555 Data Mining: Decision Trees

1. Decision trees are a type of supervised learning algorithm used for classification problems. They create a model that predicts a target variable based on input variables. 2. A decision tree consists of rules that divide a population into homogeneous groups with respect to a target variable. The algorithm recursively partitions the data into nodes based on input variables. 3. Leaves contain groups of records that are predominantly of a single target class. The path from the root node to a leaf describes the rule for classifying records in that leaf.

Uploaded by

Minnie Mouse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

STA555 Data Mining: Decision Trees

1. Decision trees are a type of supervised learning algorithm used for classification problems. They create a model that predicts a target variable based on input variables. 2. A decision tree consists of rules that divide a population into homogeneous groups with respect to a target variable. The algorithm recursively partitions the data into nodes based on input variables. 3. Leaves contain groups of records that are predominantly of a single target class. The path from the root node to a leaf describes the rule for classifying records in that leaf.

Uploaded by

Minnie Mouse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

STA555 Data

Mining
Decision Trees
What is a Decision Tree
 Decision tree is a type of supervised learning algorithm (having
a pre-defined target variable) that is commonly used in
classification problems.

 The goal is to create a model that predicts the value of a target


variable based on several input variables.
Decision Tree
 Decision tree are useful for classification and prediction.

 A decision tree model consists of a set of rules for dividing a large


heterogeneous population into smaller, more homogeneous groups
with respect to a particular target.

 The target variable is usually categorical and the decision tree is


used either to:
 (1) calculate the probability that a given record belong to each of the
category or
 (2) To classify the record by assigning it to the most likely class (or
category).

 The algorithm used to construct decision tree is referred to as


recursive partitioning

 Note: Decision tree can also be used to estimate the value of a continuous
target variable (regression tree). However, multiple regression and neural
network models are generally more appropriate when the target variable is
continuous.
Examples of a Decision Tree
How a Decision Tree is Constructed
 Decision tree uses the target variable to determine how each input should
be partitioned.
 In the end, the decision tree breaks the data into nodes, defined by the
splitting rules at each step.
 Taken together, the rules for all the nodes, will form the decision tree
model.
 A model that can be expressed as a collection of rules is very attractive.
 Rules readily expressed in English so that we can understand them.
EXAMPLE OF AN ENGLISH RULE

*------------------------------------------------------------*
Node = 2
*------------------------------------------------------------*
if Median Home Value Region < 67650
then
Tree Node Identifier = 2
Number of Observations = 3983
Predicted: TargetB=0 = 0.54
Predicted: TargetB=1 = 0.46

*------------------------------------------------------------*
Node = 6
*------------------------------------------------------------*
if Median Home Value Region >= 67650 or MISSING
AND Age < 36.5
then
Tree Node Identifier = 6
Number of Observations = 410
Predicted: TargetB=0 = 0.58
Predicted: TargetB=1 = 0.42

*------------------------------------------------------------*
Node = 7
*------------------------------------------------------------*
if Median Home Value Region >= 67650 or MISSING
AND Age >= 36.5 or MISSING
then
Tree Node Identifier = 7
Number of Observations = 5293
Predicted: TargetB=0 = 0.47
Predicted: TargetB=1 = 0.53
A Typical Decision Tree

 The box at the top of the diagram is the root


node, which contains all the training data used to
grow the tree.
 The root node has n children, and a rule that
specifies which records go to which child. The rule
is based on the most important input selected by
the tree algorithm.
 The objective of the tree is to split these
records/observations into nodes dominated by a
single class.
 The nodes that ultimately get used are at the ends
of their branches, with no children. These are the
leaves of the tree.
1. The box at the top of
2. The root node the diagram is the root
has n children, and node, which contains
a rule that specifies all the training data
which records go to used to grow the tree.
which child. The
rule is based on the
most important
input selected by
the tree algorithm.
The point of the
tree is to split
these records into
dominated nodes
child by a single class. child

The path from


the root node
The nodes that ultimately get used to a leaf
are at the ends of their branches, describes a
with no children. These are the rule for the
leaves of the tree. records in that
leaf.
A Typical Decision Tree

 The path from the root node to a leaf describes a rule


for the records/observations in that leaf.
 Decision trees assign scores to new
records/observations, simply by letting each
record/observation flow through the tree to arrive at its
appropriate leaf.
 Each leaf has a rule, which is based on the path through
the tree.
 The rules are used to assign new records/observations
to the appropriate leaf. The proportion of
records/observations in each class provides the scores.
1. The path from
the root node to a
leaf describes a
rule for the
records in that
2. Each leaf.
leaf has
a rule,
which
is
based
on the
path
through
the
tree.
4. Decision trees
3. The rules are used to assign New assign scores to new
new records to the appropriate Record: records, simply by
leaf. The proportion of records FS97NK = 4, letting each record
in each class provides the MSLG = 10 flow through the tree
scores. =>Yhat = 0 to arrive at its
A Simple Decision Tree
Target: Status:Buyer or Non-Buyer (categorical variable )

Node 0
Buyer 600 40%
Income < $100,000 Non-buyer 900 60% $100,000 and above

Node 1 Node 2
Age Buyer 350 36.84% Gender Buyer 250 45.45%
Non-buyer 600 63.16% 25 and Non-buyer 300 54.55%
female
<25 above male

Node 4 Node 5 Node 6


Node 3
Buyer 300 75% Buyer 200 50% Buyer 50 33.33%
Buyer 50 9.09%
Non-buyer 100 25% Non-buyer 200 50% Non-buyer 100 66.67%
Non-buyer 500 90.91%

Chinese
Malay & Indian
Race A customer with
Node 7
Node 8
income less than Buyer 30 15%
Buyer 170 85%
$100000 and age less Non-buyer 170 85%
Non-buyer 30 15%
than 25 is predicted
Note: Input variables that are higher up in the decision tree
as a non-buyer. can be deemed as the more important variables in predicting
the target variable.
Growing Decision Trees for Binary Target
Variable

 There are two algorithms in the building of a decision tree.

 Splitting algorithm. The process of partitioning/splitting the data set into


subsets. Splits are formed on a particular variable/input and in a particular
location. For each split, two determinations are made: the predictor/input
variable used for the split, called the splitting variable, and the set of values for
the predictor/input variable (which are split between the left child node and the
right child node), called the split point. Splitting algorithm repeatedly splits the
data into smaller and smaller groups in such a way that each new set of nodes
has greater purity than its ancestors with respect to the target variable.

 Pruning algorithm. The process of reducing the size of the tree by turning
some branch nodes into leaf nodes, and removing the leaf nodes under the
original branch. Pruning is useful because classification trees may fit the training
data well, but may do a poor job of classifying new values. Lower branches may
be strongly affected by outliers. Pruning enables you to find the next largest tree
and minimize the problem. A simpler tree often avoids over-fitting.
Splitting algorithm
Finding the Initial Split
 The tree starts with records/
observations in the training set — at
the root node.
 The first task is to split the records
into children by creating a rule on the
input variables.
 What are the best children? The
answer is the ones that are purest in
one of the target values, because the
goal is to separate the values of the
target as much as possible.
 For a binary target, this value is the
probability of membership in each
class.
Splitting algorithm
Finding the Initial Split

 To perform the split, the algorithm considers all possible


splits on all input variables.
 The algorithm then chooses the best split value for each
variable. The best variable is the one that produces the
best split.
 The measure used to evaluate a potential split is purity
of the target variable in the children.
 Low purity means that the distribution of the target
variable in the children is similar to that of the parent
node, whereas high purity means that members of a
single class predominate.
The best split
• is the one that increases purity in the children by the greatest
amount
• creates nodes of similar size, do not create nodes containing
very few records
Example: Good & Poor Splits

Good Split Nodes with


very small
sample size
Splitting on a Numeric input Variable (X)
 When searching for a binary split on a
numeric input variable, distinct each value
that the variable takes is treated as a
candidate value for the split.

 Splits on a numeric variable take in the


form X<N. All records where the value of X
(the splitting variable) is less than some
constant N are sent to one child and all
records where the value of X is greater than
or equal to sent to the other.

 After each trial split, the increase in purity


due to the split is measured. (Repeat the
process for all possible cut of value and
compare the split that maximize the purity
value)
Splitting on a Categorical Input Variable (X)

 The simplest algorithm for splitting on a


categorical input variable is to create a new
branch for each class that the categorical
variable can take on.
 But high branching factors quickly reduce
the population of training records available
at each child node, making further splitting
less likely and less reliable.
 A better and more common approach is to
group together classes that, taken
individually, predict similar outcomes.
Splitting in the Presence of
Missing Values
 One of the nicest things about decision trees is their
ability to handle missing values in input fields by using
null as an allowable value.
 This approach is preferable than to discard/delete
records/observations with missing values or trying to
impute missing values.
 Throwing out records is likely to create a biased training
set because the records with missing values are
probably not a random sample of the population.
 Replacing missing values with imputed values runs the
risk that important information provided by the fact
that a value ignored in the model.
Growing the Full Tree
 The initial split produces two or more children, each of which is then split
in the same manner as the root node.
 This is called a recursive algorithm, because the same splitting method is
used on the subsets of data in each child.
 Once again, all input fields are considered as candidate for split, even
fields already used for splits.
 Eventually, the tree building stops, for one of three reasons:
 No split can be found that significantly increases the purity of any
node's children.
 The number of records per node reaches some preset lower bound.
 The depth of the tree reaches some preset limit.
 At this point, the full decision tree has been grown.
Note:
 Employing tightly stopping criteria tends to create small and under–fitted decision
trees. On the other hand, using loosely stopping criteria tends to generate large
decision trees that are over–fitted to the training set.
Recall : Split Criteria

 The best split is defined as one that does the best job of
separating the data into groups where a single class
predominates in each group
 Measure used to evaluate a potential split is purity
 The best split is one that increases purity of the sub-sets
by the greatest amount
 A good split also creates nodes of similar size or at least
does not create very small nodes
Tests for Choosing Best Split

Choice of splitting algorithm depends on the type of target


variable, whether the target variable is categorical or
numeric/interval and not on the input variable. The type of the
input variable does not matter.

Splitting algorithm

Categorical target variable:

 Gini (population diversity)


Interval target variable:
 Entropy (information gain)
Variance reduction
 Chi-square Test using Logworth
F-test
Gini (Population Diversity) as
a Splitting Criterion
 For the Gini measure, a score of 0.5 means that two
classes are represented equally.
 When a node has only one class, its score is 1.
 Because purer nodes have higher scores, the goal of
decision tree algorithms that use this measure is to
maximize the Gini score of the split.
 The Gini measure of a node is the sum of the squares of
the proportions of the classes in the node.
 A perfectly pure node has a Gini score of 1. A node that
is evenly balanced has a Gini score of 0.5.
Evaluating the split using Gini
Which of these two proposed splits increases purity the most?

Gini score at root node =


0.52+0.52=0.5

Income<2000 Income>=2000
Female
Male
Buyer
Non-
buyer

Ginimale = (0.1)2 + (0.9)2 = 0.82 Ginileft = 1


Ginifemale = 0.92 + 0.12 = 0.82 Giniright = (4/14)2 + (10/14)2 = 0.592
Gini scoregender = Gini scoreincome =
(10/20)*0.82 + (10/20)*0.82 = 0.820 (6/20)*1 + (14/20)*0.592 = 0.714
Perfectly pure node would have a Gini score of 1.
Therefore, the gender is better split. Source: Berry and Linoff (2004)
SPLIT A Gini score root node:
P(triangle)2 + P(stars)2 = 0.52 + 0.52 = 0

Gini score for the Gini score for the


left child is right child is
0.1252 + 0.8752 = 0.82 + 0.22 = 0.68
0.78125
Purity = (8/18)*0.78125 + (10/18)*.68 = 0.725

SPLIT B

Left: (3/9)2 + (6/9)2 Right: (6/9)2 + (3/9)2


= 0.556 = 0.556
Purity = 9/18*0.556 + 9/18*0.556 = 0.556

Split A is better than Split B since the purity score for Split A is
higher
Entropy Reduction / Information Gain as a
Splitting Criteria
 Entropy for a decision tree is the total of the entropy of all terminal
nodes in the tree
 Entropy measures impurity or lack of information in decision tree.
 The best input variable -gives the greatest reduction in entropy.
 When a node has only one class, its score is 0. Entropy values go from
0 (purer population) to 1 (equal number of each class). So, purer
nodes have lower scores, and the goal is to minimize the entropy
score of the split.
 As a decision tree becomes purer, more orderly and more informative,
its entropy approaches zero.
 The reduction in entropy is sometimes referred to as information gain.
Entropy:
H = −∑ Pi log 2 ( Pi )

Pi is the probability of the i th category of the target variable ocurring in a particular node
Evaluating the split using entropy
Entropy= -1* [P(dark)log2P(dark)+P(light)log2P(light)]

Which of these two proposed splits increases information gain the most?
Entropy at root node =1

log2(a)=log10(a)/log10(2)

Entropyleft = -1*[1log2(1)+0]=0
Female Income<2000 Income>=2000
Male

Entropymale = -1*(0.9log2(0.9)+0.1log2(0.1))=0.469 Entropyright =


-1*[4/14log2(4/14)+10/14log2(10/14)] =
Entropyfemale = -1*(0.1log2(0.1)+0.9log2(0.9)=0.469 -1(-0.52-0.35) = 0.8631

Entropy scoregender = Entropy scoreincome =


(10/20)*0.469 + (10/20)*0.469 = 0.469 (6/20)*0 + (14/20)*0.8631 = 0.6042
Information gain=1- 0.469 = 0.531
Information gain=1-0.6042=0.3958
Source: Berry and Linoff (2004
Left Child: SPLIT A Right Child:
-1*(0.875*log2(0.875) + -1*(0.200*log2(0.200) +
0.125*log2(0.125)) = 0.544 0.800*log2(0.800)) = 0.722

The total entropy reduction due to the split =


(0.544)*(8/18) + (0.722)*(10/18) = 0.643

Left Child: SPLIT B Right Child:


-1*((4/5)*log2(4/5)+ -1*((5/13)*log2(5/13)+
(1/5)*log2(1/5)) = .721 (8/13)*log2(8/13)) = .961

The total entropy reduction due to the split =


(0.721)*(5/18) + (0.961)*(13/18) = 0.894

Based on the lower total entropy value, Split A is better than


Split B
Chi-Square Test as a Splitting Criteria

 The chi-square test is a test of statistical significance.


 Its value measures how likely or unlikely a split is.
 The higher the chi-square value, the less likely the split
is due to chance – and not being due to chance means
that split is important.
 “unlikely due to chance” is simply when you have
tested variables and found a significant p value,
 (p < α value where α = 0.05) the results you have
found are unlikely to be due to chance (the results
is significant)
 In SAS Enterprise Miner, the calculated value is called
the logworth value.
Cont….

 The best split based on the logworth is


determined as follows:
 Compute the Chi-Square statistic of
association between the binary targets and
all potential splits of each competing input
 For each input, determine the split with
the highest logworth = - log of the chi-
square statistics (chi-square of p value)
 Compare the best split across all input
variables and choose the highest logworth
as the best split.
Calculating Chi-Square and
Logworth Values
 The chi-square statistic computes a measure of how
different the number of observations is in each of the
four cells as compared to the expected number
 The p-value associated with the null hypothesis is
computed
 Enterprise Miner then computes the logworth of the p-
value, logworth = - log10(p-value)
 The split that generates the highest logworth for a given
input variable is selected
Which is the best split?
Observed Expected

Split 1 Have Heart Disease? Have Heart Disease?

0 1 0 1

age <51 466 59 525 age <51 483.39 41.61 chi-square = 11.69538825

>=51 1021 69 1090 >=51 1003.6 86.39 df (r-1)(c-1) = 1

1487 128 1615 pvalue= 0.000001

logworth = 6

Split 2 Have Heart Disease? Have Heart Disease?

0 1 0 1

age <41 534 16 550 age <41 506.41 43.591 chi-square = 28.76270027

>=41 953 112 1065 >=41 980.59 84.409 df (r-1)(c-1) = 1

1487 128 1615 pvalue= 0.000000001

logworth = 9
example

 First review Chi-square tests


 Contingency tables

Heart Disease Heart Disease


No Yes No Yes

Low 95 5 100 75 25
BP

55 45 100 75 25
High
BP

OBSERVED EXPECTED
χ2 Test Statistic
 Expect(100X150/200)=75
 (etc. (100X50/200)=25)
2
Heart Disease ( observed − exp ected )
No Yes χ 2 = ∑allcells
exp ected

Low 95 5 2(400/75)+
BP 100
(75) (25) 2(400/25) =
High 55 45 42.67
100
BP
(75) (25)
Compare to
150 50 200 Tables –
WHERE IS HIGH BP CUTOFF??? Significant!
Measuring “Worth” of a Split

 P-value is probability of Chi-square as great as that


observed if independence is true. (Pr {χ2>42.67} is
6.4E-11)
 P-values are too small.
 Logworth = -log10(p-value) = 10.1938
 Best Chi-square  max logworth.
Logworth for Age Splits

Age 47 maximizes logworth


Pruning
 Pruning algorithm. The process of reducing the size of the tree by turning some
branch nodes into leaf nodes, and removing the leaf nodes under the original
branch. Pruning is useful because classification trees may fit the training data well,
but may do a poor job of classifying new values. Lower branches may be strongly
affected by outliers. Pruning enables you to find the next largest tree and minimize
the problem. A simpler tree often avoids over-fitting.

 Pruning methods originally suggested in (Breiman et al., 1984) were developed for
solving this dilemma. Employing tightly stopping criteria tends to create small and
under–fitted decision trees. On the other hand, using loosely stopping criteria tends
to generate large decision trees that are over–fitted to the training set. Pruning is
one of the technique used to tackle overfitting in decision tree.

 According to this methodology, a loosely stopping criterion is used, letting the


decision tree to overfit the training set. Then the over-fitted tree is cut back into a
smaller tree by removing sub–branches that are not contributing to the
generalization accuracy. It has been shown in various studies that employing
pruning methods can improve the generalization performance of a decision tree,
especially in noisy domains.

 Two most commonly and one ought to be approaches to pruning are CART, CHAID
and C5.0.
Pruning Algorithm : CART
 CART (Classification and Regression Tree) is a popular decision tree
algorithm since 1984. This algorithm was introduced to the world by Breiman
et al. in 1984.

 The CART algorithm grows binary trees and continues splitting as long
as new splits can be found that increase purity.

 Inside a complex tree, there are many simpler subtrees, each of which
represents different trade-off between model complexity and accuracy.

 Through repeated pruning, the CART algorithm identifies a set of such


subtrees as candidate models.

 These candidate subtrees are applied to the validation set, and the
tree with the lowest validation set misclassification rate (or average
squared error for a numeric target) is selected as the final model.
Pruning Algorithm : C5.0

 C5.0 is a more recent version of the decision-tree algorithm.


 The tree grows by C5.0 is similar to those grown by CART (although unlike
CART, C5.0 makes multiway splits on categorical variables).
 Like CART, C5.0 algorithm first grows an overfit tree and then prunes it
back to create a more stable model.
 The pruning strategy is quite different in which C5.0 use the training set
to decide how the tree should be pruned.
Pruning Algorithm : CHAID

 CHAID (Chi-Squared Automatic Interaction


Detection) This algorithm was originally
proposed by Kass in 1980.
 In CHAID algorithm, a test of statistical
significance, chi-squared test is used to test
whether the distribution of the validation set
results looks different from the distribution of
the training set results.
 The split would be pruned when the
confidence level is less than some user-
defined threshold, so only splits that are let
say 95% confident in the validation set would
remain.

You might also like