STA555 Data Mining: Decision Trees
STA555 Data Mining: Decision Trees
Mining
Decision Trees
What is a Decision Tree
Decision tree is a type of supervised learning algorithm (having
a pre-defined target variable) that is commonly used in
classification problems.
Note: Decision tree can also be used to estimate the value of a continuous
target variable (regression tree). However, multiple regression and neural
network models are generally more appropriate when the target variable is
continuous.
Examples of a Decision Tree
How a Decision Tree is Constructed
Decision tree uses the target variable to determine how each input should
be partitioned.
In the end, the decision tree breaks the data into nodes, defined by the
splitting rules at each step.
Taken together, the rules for all the nodes, will form the decision tree
model.
A model that can be expressed as a collection of rules is very attractive.
Rules readily expressed in English so that we can understand them.
EXAMPLE OF AN ENGLISH RULE
*------------------------------------------------------------*
Node = 2
*------------------------------------------------------------*
if Median Home Value Region < 67650
then
Tree Node Identifier = 2
Number of Observations = 3983
Predicted: TargetB=0 = 0.54
Predicted: TargetB=1 = 0.46
*------------------------------------------------------------*
Node = 6
*------------------------------------------------------------*
if Median Home Value Region >= 67650 or MISSING
AND Age < 36.5
then
Tree Node Identifier = 6
Number of Observations = 410
Predicted: TargetB=0 = 0.58
Predicted: TargetB=1 = 0.42
*------------------------------------------------------------*
Node = 7
*------------------------------------------------------------*
if Median Home Value Region >= 67650 or MISSING
AND Age >= 36.5 or MISSING
then
Tree Node Identifier = 7
Number of Observations = 5293
Predicted: TargetB=0 = 0.47
Predicted: TargetB=1 = 0.53
A Typical Decision Tree
Node 0
Buyer 600 40%
Income < $100,000 Non-buyer 900 60% $100,000 and above
Node 1 Node 2
Age Buyer 350 36.84% Gender Buyer 250 45.45%
Non-buyer 600 63.16% 25 and Non-buyer 300 54.55%
female
<25 above male
Chinese
Malay & Indian
Race A customer with
Node 7
Node 8
income less than Buyer 30 15%
Buyer 170 85%
$100000 and age less Non-buyer 170 85%
Non-buyer 30 15%
than 25 is predicted
Note: Input variables that are higher up in the decision tree
as a non-buyer. can be deemed as the more important variables in predicting
the target variable.
Growing Decision Trees for Binary Target
Variable
Pruning algorithm. The process of reducing the size of the tree by turning
some branch nodes into leaf nodes, and removing the leaf nodes under the
original branch. Pruning is useful because classification trees may fit the training
data well, but may do a poor job of classifying new values. Lower branches may
be strongly affected by outliers. Pruning enables you to find the next largest tree
and minimize the problem. A simpler tree often avoids over-fitting.
Splitting algorithm
Finding the Initial Split
The tree starts with records/
observations in the training set — at
the root node.
The first task is to split the records
into children by creating a rule on the
input variables.
What are the best children? The
answer is the ones that are purest in
one of the target values, because the
goal is to separate the values of the
target as much as possible.
For a binary target, this value is the
probability of membership in each
class.
Splitting algorithm
Finding the Initial Split
The best split is defined as one that does the best job of
separating the data into groups where a single class
predominates in each group
Measure used to evaluate a potential split is purity
The best split is one that increases purity of the sub-sets
by the greatest amount
A good split also creates nodes of similar size or at least
does not create very small nodes
Tests for Choosing Best Split
Splitting algorithm
Income<2000 Income>=2000
Female
Male
Buyer
Non-
buyer
SPLIT B
Split A is better than Split B since the purity score for Split A is
higher
Entropy Reduction / Information Gain as a
Splitting Criteria
Entropy for a decision tree is the total of the entropy of all terminal
nodes in the tree
Entropy measures impurity or lack of information in decision tree.
The best input variable -gives the greatest reduction in entropy.
When a node has only one class, its score is 0. Entropy values go from
0 (purer population) to 1 (equal number of each class). So, purer
nodes have lower scores, and the goal is to minimize the entropy
score of the split.
As a decision tree becomes purer, more orderly and more informative,
its entropy approaches zero.
The reduction in entropy is sometimes referred to as information gain.
Entropy:
H = −∑ Pi log 2 ( Pi )
Pi is the probability of the i th category of the target variable ocurring in a particular node
Evaluating the split using entropy
Entropy= -1* [P(dark)log2P(dark)+P(light)log2P(light)]
Which of these two proposed splits increases information gain the most?
Entropy at root node =1
log2(a)=log10(a)/log10(2)
Entropyleft = -1*[1log2(1)+0]=0
Female Income<2000 Income>=2000
Male
0 1 0 1
age <51 466 59 525 age <51 483.39 41.61 chi-square = 11.69538825
logworth = 6
0 1 0 1
age <41 534 16 550 age <41 506.41 43.591 chi-square = 28.76270027
logworth = 9
example
Low 95 5 100 75 25
BP
55 45 100 75 25
High
BP
OBSERVED EXPECTED
χ2 Test Statistic
Expect(100X150/200)=75
(etc. (100X50/200)=25)
2
Heart Disease ( observed − exp ected )
No Yes χ 2 = ∑allcells
exp ected
Low 95 5 2(400/75)+
BP 100
(75) (25) 2(400/25) =
High 55 45 42.67
100
BP
(75) (25)
Compare to
150 50 200 Tables –
WHERE IS HIGH BP CUTOFF??? Significant!
Measuring “Worth” of a Split
Pruning methods originally suggested in (Breiman et al., 1984) were developed for
solving this dilemma. Employing tightly stopping criteria tends to create small and
under–fitted decision trees. On the other hand, using loosely stopping criteria tends
to generate large decision trees that are over–fitted to the training set. Pruning is
one of the technique used to tackle overfitting in decision tree.
Two most commonly and one ought to be approaches to pruning are CART, CHAID
and C5.0.
Pruning Algorithm : CART
CART (Classification and Regression Tree) is a popular decision tree
algorithm since 1984. This algorithm was introduced to the world by Breiman
et al. in 1984.
The CART algorithm grows binary trees and continues splitting as long
as new splits can be found that increase purity.
Inside a complex tree, there are many simpler subtrees, each of which
represents different trade-off between model complexity and accuracy.
These candidate subtrees are applied to the validation set, and the
tree with the lowest validation set misclassification rate (or average
squared error for a numeric target) is selected as the final model.
Pruning Algorithm : C5.0