0% found this document useful (0 votes)
79 views

DecisionTrees RandomForest v2

This document introduces decision trees and random forests. It explains that decision trees use a set of binary rules to calculate a target value for classification or regression problems. Random forests are an ensemble method that combines the predictions from multiple decision trees to improve accuracy. The document outlines how random forests work by building trees on random subsets of data and variables, and averaging their predictions. Random forests provide measures of accuracy, variable importance and error rates.

Uploaded by

Sandeep Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

DecisionTrees RandomForest v2

This document introduces decision trees and random forests. It explains that decision trees use a set of binary rules to calculate a target value for classification or regression problems. Random forests are an ensemble method that combines the predictions from multiple decision trees to improve accuracy. The document outlines how random forests work by building trees on random subsets of data and variables, and averaging their predictions. Random forests provide measures of accuracy, variable importance and error rates.

Uploaded by

Sandeep Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Introduction to decision trees and

random forests
Ned Horning
American Museum of Natural History's
Center for Biodiversity and Conservation
[email protected]

What are decision trees?

A predictive model that uses a set of binary


rules applied to calculate a target value
Can be used for classification (categorical
variables) or regression (continuous variables)
applications
Rules are developed using software available
in many statistics packages
Different algorithms are used to determine the
best split at a node

Example classification tree

How do classification trees work?

Uses training data to


build model
Tree generator
determines:

Which variable to split


at a node and the
value of the split
Decision to stop
(make a terminal note)
or split again
Assign terminal nodes
to a class

Dividing feature space recursive


partitioning

Blue = water
Green = forest
Yellow = shrub
Brown = non-forest
Gray = cloud/shadow

Dividing feature space recursive


partitioning

Dividing feature space recursive


partitioning
A constant (class or predicted function value) is assigned to each rectangle

Dividing feature space recursive


partitioning

Dividing feature space recursive


partitioning

Dividing feature space recursive


partitioning

Dividing feature space recursive


partitioning

Editing (pruning) the tree

Overfitting is common since individual pixels


can be a terminal node
Classification trees can have hundreds or
thousands of nodes and these need to be
reduced by pruning to simplify the tree
Pruning involves removing nodes to simplify the
tree
Parameters such as minimum node size, and
maximum standard deviation of samples at a
node can restrict tree size

Regression trees

Regression calculates
relationship between
predictor and
response variables
Structure is similar to
classification tree
Terminal nodes are
predicted function
(model) values
Predicted values are
limited to the values
in the terminal nodes

Decision tree advantages

Easy to interpret the decision rules


Nonparametric so it is easy to incorporate a
range of numeric or categorical data layers and
there is no need to select unimodal training data

Robust with regard to outliers in training data

Classification is fast once rules are developed

Drawbacks of decision trees

Decision trees tend to overfit training data which


can give poor results when applied to the full
data set
Splitting perpendicular to feature space axes is
not always efficient
Not possible to predict beyond the minimum
and maximum limits of the response variable in
the training data

Packages in R

tree The original decision tree package


rpart A slightly newer and more aggressively
maintained package

What are ensemble models?

Combines the results


from different models
Models can be a
similar type or
different
The result from an
ensemble model is
usually better than
the result from one of
the individual models

What is random forests

An ensemble
classifier using many
decision tree models
Can be used for
classification or
regression
Accuracy and variable
importance
information is
provided with the
results

How random forests work

A different subset of the


training data are selected
(~2/3), with replacement,
to train each tree
Remaining training data
(OOB) are used to
estimate error and variable
importance
Class assignment is made
by the number of votes
from all of the trees and
for regression the average
of the results is used

Use a subset of variables

A randomly selected
subset of variables is used
to split each node
The number of variables
used is decided by the
user (mtry parameter in R)
Smaller subset produces
less correlation (lower
error rate) but lower
predictive power (high
error rate)
Optimum range of values
is often quite wide

Common variables for random


forests

Input data (predictor


and response)
Number of trees
Number of variables
to use at each split
Options to calculate
error and variable
significance
information
Sampling with or
without replacement

randomForest(x, y=NULL, xtest=NULL,


ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt
(ncol(x))),
replace=TRUE, classwt=NULL, cutoff,
strata,
sampsize = if (replace) nrow(x) else
ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y))
5 else 1,
importance=FALSE, localImp=FALSE,
nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest),
corr.bias=FALSE,
keep.inbag=FALSE, ...)

Proximity measure

Proximity measures
how frequent unique
pairs of training
samples (in and out of
bag) end up in the
same terminal node
Used to fill in missing
data and calculating
outliers
Outliers for classification

Information from Random Forests

Classification
accuracy
Variable importance
Outliers
(classification)
Missing data
estimation
Error rates for random
forest objects

Error rate vs. number of trees

Advantages of random forests

No need for pruning


trees
Accuracy and variable
importance generated
automatically
Overfitting is not a
problem
Not very sensitive to
outliers in training data
Easy to set parameters

Limitations of random forests

Regression can't
predict beyond range
in the training data
In regression extreme
values are often not
predicted accurately
underestimate highs
and overestimate
lows

Common remote sensing


applications of random forests

Classification

Land cover
classification
Cloud/shadow
screening

Regression

Continuous fields
(percent cover)
mapping
Biomass mapping

Resources to learn more about


random forests

http://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm#prox

http://en.wikipedia.org/wiki/Random_forest

The randomForest Package (for R) description

You might also like