100% found this document useful (1 vote)

111 views60 pages

Classification and Regression Trees

This document discusses classification and regression trees (CART), a type of decision tree used for classification and regression. It begins by defining the goals of classification and regression trees, and provides an example of using a tree to classify credit card acceptances. The key ideas of recursive partitioning and pruning are introduced. The document then details the recursive partitioning process, describes how to measure impurity in nodes, and explains how trees are constructed, classified, and pruned. Finally, it briefly introduces bagging, random forests, and boosting as methods to improve tree predictions.

Uploaded by

ShyamBhatt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

111 views60 pages

Classification and Regression Trees

Uploaded by

ShyamBhatt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Classification and Regression

Trees
CLASSIFICATION TREES
Goal
• Classify an outcome based on a set of
predictors
• The output is a set of rules
Example
• Goal: classify a record as “will accept credit
card offer” or “will not accept”
• Rule might be “IF (Income > 92.5) AND
(Education < 1.5) AND (Family <= 2.5) THEN
Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams
Two key ideas
• Recursive partitioning:
Repeatedly split the records into two parts so
as to achieve maximum homogeneity within the
new parts
• Pruning:
Simplify the tree by pruning peripheral
branches to avoid overfitting
Recursive Partitioning
• Dependent (response) variable y
• The dependent variable is a categorical variable in
classification trees
• Predictor variables x1, x2, …, xp
• The predictor variables are continuous, or binary
or ordinal
• Recursive partitioning divides the p-dimensional
space of the predictor variables into non-
overlapping multidimensional rectangles
Recursive Partitioning Steps
• Select one of the predictor variables, say xi
• Select a value of xi, say si, that divides the
training data into two (not necessarily equal)
portions
• Then, one of these two parts is divided in a
similar manner by choosing a variable again
and a split value for the variable
Recursive Partitioning Steps
• This results in three multi-dimensional
rectangular regions
• The process is continued so that smaller and
smaller rectangular regions are obtained
• The idea is to divide the entire predictor space
into rectangles such that each rectangle is as
homogeneous or “pure” as possible
Recursive Partitioning Steps
• At each step, we measure how “pure” or
homogeneous each of the resulting portions
are
“Pure” = containing records of mostly one class

• In each split, the algorithm tries different

values of xi, and si to maximize purity
Example: Riding Mowers
• Goal: Classify 24 households as owning or not
owning riding mowers
• Dependent variable: Categorical (owner, non-
owner)
• Predictors: Income, Lot Size
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner
Algorithm: how to split
• Order records according to one variable, say lot
size
• R uses the predictor values as the split points
• XLMiner finds midpoints between successive
values, and uses them as split points
E.g. first midpoint is 14.4 (halfway between 14.0 and 14.8)

• Divide records into those with lot size > 14.4 and
those < 14.4
• After evaluating that split, try the next one, which
is 15.4 (halfway between 14.8 and 16.0)
The first split: by lot size
The second split: by income
After all splits
Note: Categorical predictors
• Examine all possible ways in which the categories can
be split.
• E.g., categories A, B, C can be split in 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, number of splits becomes huge
• XLMiner supports only binary categorical variables
• R can handle any categorical variable
MEASURING IMPURITY
Measuring Impurity
• Gini impurity index
• Entropy
Gini Impurity Index
• Gini impurity index for rectangle A is given by
I(A) = 1 -

• m is the number of classes of the response

variable
• pk = proportion of observations in rectangle A
that belong to class k, k = 1, …, m
Gini Impurity Index
• It takes the value 0 when all the observations
belong to the same class; it is the minimum value
• It takes the value (m-1)/m when all m classes are
equally represented; it is the maximum value
• Clearly, for a two-class case, the Gini impurity
index is maximum when pk = 0.5 (that is, when
the rectangle contains 50% of each of the two
classes)
Entropy
• The entropy is given by

• pk = proportion of observations in rectangle A

that belong to class k, k = 1, …, m
• Entropy ranges between 0 (most pure) and
log2(m) (equal representation of classes)
Impurity and Recursive Partitioning
• Obtain overall impurity measure (weighted
avg. of individual rectangles)
• At each successive stage, compare this
measure across all possible splits in all
variables
• Choose the split that reduces impurity the
most
• Chosen split points become nodes on the tree
First Split – The Tree
Tree after three splits
Tree structure
• Split points become nodes on tree (circles with
split value in center)
• Terminal nodes represent “leaves” (terminal
points, no further splits, classification value
noted)
• Numbers over lines between nodes indicate
number of cases
• Read down tree to derive rule
E.g., If lot size < 19, and if income > 84.75, then class = “owner”
Determining Leaf Node Label
• Each leaf node label is determined by “voting”
of the records within it, and by the cutoff
value
• Records within each leaf node are from the
training data
• Default cutoff=0.5 means that the leaf node’s
label is the majority class
• Cutoff = 0.25: requires 25% or more “1”
records in the leaf to label it a “1” node
Classifying a new observation
• “Drop” the observation down the tree in such
a way that at each decision node an
appropriate branch is taken
• Continue until a node is reached that has no
successor (that is, a leaf node)
• Classify that observation according to the
label of that leaf node
• Build the tree using training data; assess its
accuracy using validation data
Tree after all splits
THE OVERFITTING PROBLEM
Stopping Tree Growth
• Natural end of process is 100% purity in each
leaf
• This overfits the data, which end up fitting
noise in the data
• Overfitting leads to low predictive accuracy of
new data
• Past a certain point, the error rate for the
validation data starts to increase
Full-grown Tree Error Rate
Some ways to stop tree growth
• One can control the following to stop tree growth
- Tree depth (i.e., number of splits)
- Minimum number of records in a terminal
node
- Minimum reduction in impurity
• The problem is that it is not clear which of the
above provides a good stopping rule
CHAID
• CHAID: Chi-squared automatic interaction
detection
• Uses chi-square statistical test to limit tree
growth
• Splitting stops when purity improvement is
not statistically significant
• Widely used in database marketing
CHAID process
• At each node, we split on the predictor that
has strongest association with the response
• Strength of association is measured by the p-
value of a chi-squared test of independence
• If for the best predictor, the test does not
show a significant improvement, the split is
not carried out, and the tree is terminated
• Suitable for categorical predictors; can be
adapted to continuous predictors by binning
Pruning
• Let the tree grow to full extent, then prune it
back
• Pruning consists of successively selecting a
decision node and redesignating it as a leaf
node (by lopping off branches extending
beyond that decision node and thereby
reducing the size of the tree)
Pruning
• Idea is to find that point at which the
validation error begins to rise
• Generate successively smaller trees by
pruning leaves
• At each pruning stage, multiple trees are
possible
• Use cost complexity to choose the best tree at
that stage
Cost Complexity
• The cost complexity is given by
CC(T) = Err(T) +  L(T)

CC(T) = cost complexity of a tree

Err(T) = proportion of misclassified records for the training data
 = penalty factor attached to tree size (set by user)
• When  is very small, we get a full-grown unpruned tree
• When  is very large (infinity), we get a tree with fewest
number of nodes
Algorithm with Cost Complexity
• Among trees of given size, choose the one
with lowest CC
• Do this for each size of tree
• Finally, choose that one as the best tree that
gives smallest misclassification error in the
validation data (minimum error tree)
Using Validation Error to Prune
Pruning process yields a set of trees of different
sizes and associated error rates

Two trees of interest:

• Minimum error tree
Has lowest error rate on validation data
• Best pruned tree
Smallest tree within one std. error of min. error
This adds a bonus for simplicity/parsimony
REGRESSION TREES
Regression Trees for Prediction
• Used with numerical (continuous) outcome
variable
• Procedure similar to classification tree
• Many splits are attempted, then the one that
minimizes impurity is chosen
Regression Trees for Prediction
• Prediction is computed as the average of
numerical target variable in the rectangle (in
CT it is majority vote)

• Impurity is measured by sum of squared

deviations from leaf mean

• Performance is measured by RMSE (root mean

squared error)
IMPROVING PREDICTIONS: BAGGING,
RANDOM FORESTS, AND BOOSTING
Bagging
• Decision trees suffer from high variance
• This means that if we split the training data
into two parts at random, and fit a tree to
both halves, the results could be quite
different
• But we know that a procedure with low
variance will give similar results if applied
repeatedly to distinct datasets
Bagging
• Bootstrap aggregation, or bagging, is a general
purpose tool for reducing variance of a statistical
learning method
• It follows from the principle that averaging a set
of observations reduces variance
• The idea is to get many training sets, building a
separate prediction model using each training
set, and the averaging the results
• Since we do not have “many” training sets, we
use a statistical procedure based on resampling,
called bootstrap
Decision Trees and Bagging: Algorithm
• The basic algorithm is the following:
- Draw multiple random samples (say, B), with
replacement, from the given data
(bootstrapping)
- To each sample, fit a tree
- Combine the predictions/classifications from
the individual trees to obtain improved
prediction
Choosing Number of Trees to Fit
• B is not a critical parameter with bagging
• Using very large value of B will not lead to
over-fitting
• B should be sufficiently large (like 150 or so) to
have satisfactory performance
Combining the Results
• For regression trees:
- Construct B regression trees using B
bootstrapped training sets
- Average resulting predictions
- These are deep, un-pruned trees, with high
variance but low bias
- Averaging these B trees reduces variance
Combining the Results
• For classification trees:
- For a given test observation, record the class
predicted by each of the B trees
- Then, take a majority vote
- That is, the final prediction is the most
commonly occurring class among the B
predictions
Random Forests
• Random Forests (RF) is an approach similar to
bagging, based on multiple trees
• Improves the predictive performance of trees
• Based on an idea of bootstrapping
• Not easily interpretable unlike trees (and like
bagging)
• Does not have a nice graphical representation
unlike trees (and like bagging)
Random Forests: Algorithm
• The algorithm, in a very basic form, is the
following:
- Draw multiple random samples, with
replacement, from the given data (bootstrapping)
- To each sample, fit a tree using a random subset
of predictors
- Combine the predictions/classifications from
the individual trees to obtain improved prediction
Random Forests and Bagging: Further
Details
• RF and Bagging cannot be displayed in tree
diagrams – lose of visual representation
• RF and Bagging can produce “variable
importance” (VI) scores
• VI measures relative contribution of each
predictor
• VI score for a predictor is obtained by adding the
decrease in the impurity index for that predictor
over all the trees
• Higher the score, more important a variable is
Boosted Trees for Classification
Problems
• Another type of multi-tree improvement
• A sequence of trees is fitted
• Each tree concentrates on misclassified
records from the previous tree
Boosted Trees for Classification
Problems
• Algorithm:
- Step 1: Fit a single tree
- Step 2: Draw a sample that gives higher
selection probabilities to misclassified records
- Step 3: Fit a tree to the new sample
- Step 4: Repeat steps 2 and 3 multiple times
- Step 5: Use weighted voting to classify
records, with heavier weights for later trees
Boosted Trees for Classification
Problems: Why Does This Work?
• Consider a binary response – 0 (unimportant class),
and 1 (important class)
• Typically, 0s are dominant in numbers
• Basic classifiers are tempted to classify cases as
belonging to the dominant class
• Naturally, 1s in this case constitute most of the
misclassifications with the single best-pruned tree
• Boosting concentrates on misclassifications (which are
mostly 1s)
• So, this naturally reduces misclassification of 1s!
Boosted Trees for Regression Problems
• The process is similar
• Trees are grown sequentially, learning from
the trees previously fitted
• In this case, the trees are fitted sequentially to
the residuals of the previous trees rather than
the outcome Y as the response
• The residuals are updated sequentially, and
final predictions are then adjusted with these
updated residuals
Advantages of Trees
• Easy to use, understand
• Produce rules that are easy to interpret &
implement
• Variable selection & reduction is automatic
• Do not require the assumptions of statistical
models
• Can work without extensive handling of
missing data
Disadvantages of Trees
• May not perform well where there is structure
in the data that is not well captured by
horizontal or vertical splits

• Since the process deals with one variable at a

time, no way to capture interactions between
variables
Summary
• Classification and Regression Trees are an easily understandable
and transparent method for predicting or classifying new records

• A tree is a graphical representation of a set of rules

• Trees must be pruned to avoid over-fitting of the training data

• As trees do not make any assumptions about the data structure,

they usually require large samples

• Bagging, Random Forests, and Boosting are tools that can improve
predictions/classifications, at the cost of interpretability and
representability

(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
100% (5)
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
81 pages
CP8D Information 2023
No ratings yet
CP8D Information 2023
4 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Fleco PSHS Aug2021
No ratings yet
Fleco PSHS Aug2021
1 page
Macrame Manual
100% (1)
Macrame Manual
79 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
PR01
100% (1)
PR01
41 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
9 Regression
100% (1)
9 Regression
14 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Thinkcspy 3
100% (1)
Thinkcspy 3
415 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Classification Problems
100% (1)
Classification Problems
25 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
6 XG Boost - Jupyter Notebook
100% (1)
6 XG Boost - Jupyter Notebook
3 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Machine Learning - Part 1
100% (1)
Machine Learning - Part 1
80 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Sales Forecasting
100% (1)
Sales Forecasting
10 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Book
100% (1)
Book
480 pages
Presentation GPT 4
100% (1)
Presentation GPT 4
25 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
10 pages
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
100% (2)
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
55 pages
Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory
100% (1)
Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory
16 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
HW1
100% (1)
HW1
8 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Lecture 07 KNN 14112022 034756pm
100% (1)
Lecture 07 KNN 14112022 034756pm
24 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
36 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Chapter 9 - Classification and Regression Trees: Data Mining For Business Intelligence
No ratings yet
Chapter 9 - Classification and Regression Trees: Data Mining For Business Intelligence
36 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
Cog Etiologies Research Worksheet 2
No ratings yet
Cog Etiologies Research Worksheet 2
2 pages
Manual Hidrolavadora Dirt Devil
No ratings yet
Manual Hidrolavadora Dirt Devil
11 pages
4E Cleric
No ratings yet
4E Cleric
2 pages
6wresearch - India Video Surveillance Market (2020-2026) - Sample
100% (1)
6wresearch - India Video Surveillance Market (2020-2026) - Sample
49 pages
De-Embedding Transmission Line Measurements For Accurate Modeling of IC Designs
No ratings yet
De-Embedding Transmission Line Measurements For Accurate Modeling of IC Designs
7 pages
Book Review and Analysis of "Harrison Bergeron" by Kurt Vonnegut, Jr.
No ratings yet
Book Review and Analysis of "Harrison Bergeron" by Kurt Vonnegut, Jr.
5 pages
Everything Is Poison Joy Mccullough instant download
No ratings yet
Everything Is Poison Joy Mccullough instant download
35 pages
SAD-FINAL-TERM (Repaired)
No ratings yet
SAD-FINAL-TERM (Repaired)
9 pages
Method Statement & Quality Control Procedure For Fire Alarm System
No ratings yet
Method Statement & Quality Control Procedure For Fire Alarm System
4 pages
Kami Export - Godfrey Byaombe - Angle of Rotation 2
No ratings yet
Kami Export - Godfrey Byaombe - Angle of Rotation 2
1 page
Brock Courses
No ratings yet
Brock Courses
4 pages
Office Auto Notes Unit 1
No ratings yet
Office Auto Notes Unit 1
21 pages
Sony Prs 600 Manual
No ratings yet
Sony Prs 600 Manual
190 pages
IP Practical File - Reference
No ratings yet
IP Practical File - Reference
98 pages
Mery Luciawaty - Reservoir Management Manager - Pertamina Hulu Energi ONWJ LTD - LinkedIn PDF
No ratings yet
Mery Luciawaty - Reservoir Management Manager - Pertamina Hulu Energi ONWJ LTD - LinkedIn PDF
8 pages
Future Global Navigation Satellite Systems (GNSS) : Nawzad Kameran Al-Salihi, PHD
No ratings yet
Future Global Navigation Satellite Systems (GNSS) : Nawzad Kameran Al-Salihi, PHD
14 pages
COLREG
100% (1)
COLREG
41 pages
Sampling Error
No ratings yet
Sampling Error
2 pages
Assessing Design Solutions
No ratings yet
Assessing Design Solutions
5 pages
Materials and Financial
100% (1)
Materials and Financial
267 pages
How Knowledge Transfer Impacts Performance: A Multilevel Model of Benefits and Liabilities
No ratings yet
How Knowledge Transfer Impacts Performance: A Multilevel Model of Benefits and Liabilities
19 pages
Blueprint 5 AK SB Final - PDF
No ratings yet
Blueprint 5 AK SB Final - PDF
19 pages
Lecture - 8: Fluid Motion: Streamlines Etc
No ratings yet
Lecture - 8: Fluid Motion: Streamlines Etc
13 pages
Lectures On Transistors 1
No ratings yet
Lectures On Transistors 1
15 pages
Sika Viscocrete - 1062ns
No ratings yet
Sika Viscocrete - 1062ns
2 pages
Introduction To Pests and Diseases in Forestry
No ratings yet
Introduction To Pests and Diseases in Forestry
39 pages
Itr Practical Questions
No ratings yet
Itr Practical Questions
13 pages

Classification and Regression Trees

Uploaded by

Classification and Regression Trees

Uploaded by

Classification and Regression

• In each split, the algorithm tries different

• m is the number of classes of the response

• pk = proportion of observations in rectangle A

CC(T) = cost complexity of a tree

Two trees of interest:

• Impurity is measured by sum of squared

• Performance is measured by RMSE (root mean

• Since the process deals with one variable at a

• A tree is a graphical representation of a set of rules

• Trees must be pruned to avoid over-fitting of the training data

• As trees do not make any assumptions about the data structure,

You might also like