Classification and Regression Trees
Classification and Regression Trees
Trees
CLASSIFICATION TREES
Goal
• Classify an outcome based on a set of
predictors
• The output is a set of rules
Example
• Goal: classify a record as “will accept credit
card offer” or “will not accept”
• Rule might be “IF (Income > 92.5) AND
(Education < 1.5) AND (Family <= 2.5) THEN
Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams
Two key ideas
• Recursive partitioning:
Repeatedly split the records into two parts so
as to achieve maximum homogeneity within the
new parts
• Pruning:
Simplify the tree by pruning peripheral
branches to avoid overfitting
Recursive Partitioning
• Dependent (response) variable y
• The dependent variable is a categorical variable in
classification trees
• Predictor variables x1, x2, …, xp
• The predictor variables are continuous, or binary
or ordinal
• Recursive partitioning divides the p-dimensional
space of the predictor variables into non-
overlapping multidimensional rectangles
Recursive Partitioning Steps
• Select one of the predictor variables, say xi
• Select a value of xi, say si, that divides the
training data into two (not necessarily equal)
portions
• Then, one of these two parts is divided in a
similar manner by choosing a variable again
and a split value for the variable
Recursive Partitioning Steps
• This results in three multi-dimensional
rectangular regions
• The process is continued so that smaller and
smaller rectangular regions are obtained
• The idea is to divide the entire predictor space
into rectangles such that each rectangle is as
homogeneous or “pure” as possible
Recursive Partitioning Steps
• At each step, we measure how “pure” or
homogeneous each of the resulting portions
are
“Pure” = containing records of mostly one class
• Divide records into those with lot size > 14.4 and
those < 14.4
• After evaluating that split, try the next one, which
is 15.4 (halfway between 14.8 and 16.0)
The first split: by lot size
The second split: by income
After all splits
Note: Categorical predictors
• Examine all possible ways in which the categories can
be split.
• E.g., categories A, B, C can be split in 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, number of splits becomes huge
• XLMiner supports only binary categorical variables
• R can handle any categorical variable
MEASURING IMPURITY
Measuring Impurity
• Gini impurity index
• Entropy
Gini Impurity Index
• Gini impurity index for rectangle A is given by
I(A) = 1 -
• Bagging, Random Forests, and Boosting are tools that can improve
predictions/classifications, at the cost of interpretability and
representability