DecisionTrees RandomForest v2
DecisionTrees RandomForest v2
random forests
Ned Horning
American Museum of Natural History's
Center for Biodiversity and Conservation
[email protected]
Blue = water
Green = forest
Yellow = shrub
Brown = non-forest
Gray = cloud/shadow
Regression trees
Regression calculates
relationship between
predictor and
response variables
Structure is similar to
classification tree
Terminal nodes are
predicted function
(model) values
Predicted values are
limited to the values
in the terminal nodes
Packages in R
An ensemble
classifier using many
decision tree models
Can be used for
classification or
regression
Accuracy and variable
importance
information is
provided with the
results
A randomly selected
subset of variables is used
to split each node
The number of variables
used is decided by the
user (mtry parameter in R)
Smaller subset produces
less correlation (lower
error rate) but lower
predictive power (high
error rate)
Optimum range of values
is often quite wide
Proximity measure
Proximity measures
how frequent unique
pairs of training
samples (in and out of
bag) end up in the
same terminal node
Used to fill in missing
data and calculating
outliers
Outliers for classification
Classification
accuracy
Variable importance
Outliers
(classification)
Missing data
estimation
Error rates for random
forest objects
Regression can't
predict beyond range
in the training data
In regression extreme
values are often not
predicted accurately
underestimate highs
and overestimate
lows
Classification
Land cover
classification
Cloud/shadow
screening
Regression
Continuous fields
(percent cover)
mapping
Biomass mapping
http://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm#prox
http://en.wikipedia.org/wiki/Random_forest