0% found this document useful (0 votes)

86 views27 pages

DecisionTrees RandomForest v2

This document introduces decision trees and random forests. It explains that decision trees use a set of binary rules to calculate a target value for classification or regression problems. Random forests are an ensemble method that combines the predictions from multiple decision trees to improve accuracy. The document outlines how random forests work by building trees on random subsets of data and variables, and averaging their predictions. Random forests provide measures of accuracy, variable importance and error rates.

Uploaded by

Sandeep Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views27 pages

DecisionTrees RandomForest v2

Uploaded by

Sandeep Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Introduction to decision trees and

random forests
Ned Horning
American Museum of Natural History's
Center for Biodiversity and Conservation
[email protected]

What are decision trees?

A predictive model that uses a set of binary

rules applied to calculate a target value
Can be used for classification (categorical
variables) or regression (continuous variables)
applications
Rules are developed using software available
in many statistics packages
Different algorithms are used to determine the
best split at a node

Example classification tree

How do classification trees work?

Uses training data to

build model
Tree generator
determines:

Which variable to split

at a node and the
value of the split
Decision to stop
(make a terminal note)
or split again
Assign terminal nodes
to a class

Dividing feature space recursive

partitioning

Blue = water
Green = forest
Yellow = shrub
Brown = non-forest
Gray = cloud/shadow

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning
A constant (class or predicted function value) is assigned to each rectangle

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning

Dividing feature space recursive

partitioning

Editing (pruning) the tree

Overfitting is common since individual pixels

can be a terminal node
Classification trees can have hundreds or
thousands of nodes and these need to be
reduced by pruning to simplify the tree
Pruning involves removing nodes to simplify the
tree
Parameters such as minimum node size, and
maximum standard deviation of samples at a
node can restrict tree size

Regression trees

Regression calculates
relationship between
predictor and
response variables
Structure is similar to
classification tree
Terminal nodes are
predicted function
(model) values
Predicted values are
limited to the values
in the terminal nodes

Decision tree advantages

Easy to interpret the decision rules

Nonparametric so it is easy to incorporate a
range of numeric or categorical data layers and
there is no need to select unimodal training data

Robust with regard to outliers in training data

Classification is fast once rules are developed

Drawbacks of decision trees

Decision trees tend to overfit training data which

can give poor results when applied to the full
data set
Splitting perpendicular to feature space axes is
not always efficient
Not possible to predict beyond the minimum
and maximum limits of the response variable in
the training data

Packages in R

tree The original decision tree package

rpart A slightly newer and more aggressively
maintained package

What are ensemble models?

Combines the results

from different models
Models can be a
similar type or
different
The result from an
ensemble model is
usually better than
the result from one of
the individual models

What is random forests

An ensemble
classifier using many
decision tree models
Can be used for
classification or
regression
Accuracy and variable
importance
information is
provided with the
results

How random forests work

A different subset of the

training data are selected
(~2/3), with replacement,
to train each tree
Remaining training data
(OOB) are used to
estimate error and variable
importance
Class assignment is made
by the number of votes
from all of the trees and
for regression the average
of the results is used

Use a subset of variables

A randomly selected
subset of variables is used
to split each node
The number of variables
used is decided by the
user (mtry parameter in R)
Smaller subset produces
less correlation (lower
error rate) but lower
predictive power (high
error rate)
Optimum range of values
is often quite wide

Common variables for random

forests

Input data (predictor

and response)
Number of trees
Number of variables
to use at each split
Options to calculate
error and variable
significance
information
Sampling with or
without replacement

randomForest(x, y=NULL, xtest=NULL,

ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt
(ncol(x))),
replace=TRUE, classwt=NULL, cutoff,
strata,
sampsize = if (replace) nrow(x) else
ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y))
5 else 1,
importance=FALSE, localImp=FALSE,
nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest),
corr.bias=FALSE,
keep.inbag=FALSE, ...)

Proximity measure

Proximity measures
how frequent unique
pairs of training
samples (in and out of
bag) end up in the
same terminal node
Used to fill in missing
data and calculating
outliers
Outliers for classification

Information from Random Forests

Classification
accuracy
Variable importance
Outliers
(classification)
Missing data
estimation
Error rates for random
forest objects

Error rate vs. number of trees

Advantages of random forests

No need for pruning

trees
Accuracy and variable
importance generated
automatically
Overfitting is not a
problem
Not very sensitive to
outliers in training data
Easy to set parameters

Limitations of random forests

Regression can't
predict beyond range
in the training data
In regression extreme
values are often not
predicted accurately
underestimate highs
and overestimate
lows

Common remote sensing

applications of random forests

Classification

Land cover
classification
Cloud/shadow
screening

Regression

Continuous fields
(percent cover)
mapping
Biomass mapping

Resources to learn more about

random forests

http://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm#prox

http://en.wikipedia.org/wiki/Random_forest

The randomForest Package (for R) description

Thrive: Solar LED Home Lighting System
No ratings yet
Thrive: Solar LED Home Lighting System
2 pages
Bermocoll EHM 300 PDS
No ratings yet
Bermocoll EHM 300 PDS
3 pages
Paper - II Linguistics
No ratings yet
Paper - II Linguistics
16 pages
Matlab - Image Processing: (Food Technology)
No ratings yet
Matlab - Image Processing: (Food Technology)
2 pages
MJD32C 73287
No ratings yet
MJD32C 73287
6 pages
An Efficient Privacy-Enhancing Cross-Silo Federated Learning and Applications For False Data Injection Attack Detection in Smart Grids
No ratings yet
An Efficient Privacy-Enhancing Cross-Silo Federated Learning and Applications For False Data Injection Attack Detection in Smart Grids
15 pages
Imagination
No ratings yet
Imagination
178 pages
MS2-Sequence1-Me-my Friends and My Family
No ratings yet
MS2-Sequence1-Me-my Friends and My Family
28 pages
Page 133 (Get Smart Plus 4)
No ratings yet
Page 133 (Get Smart Plus 4)
2 pages
Electricity in Mauritius
No ratings yet
Electricity in Mauritius
2 pages
Guidelines On The Quality Assurance of Learning Resources
100% (4)
Guidelines On The Quality Assurance of Learning Resources
18 pages
TataChemicals
No ratings yet
TataChemicals
5 pages
1 10 2021 Handwritten Notes
No ratings yet
1 10 2021 Handwritten Notes
18 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
9 pages
Methods of Printing
No ratings yet
Methods of Printing
11 pages
PublishedPaperNo.8 2022
100% (1)
PublishedPaperNo.8 2022
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
Time Table IMO Model Course 1.08
100% (1)
Time Table IMO Model Course 1.08
2 pages
Fernandez Del Rio Et Al 2020
No ratings yet
Fernandez Del Rio Et Al 2020
6 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
All Quantities Are Provisional and Subject For Remeasurement
No ratings yet
All Quantities Are Provisional and Subject For Remeasurement
13 pages
History of Aerospace
No ratings yet
History of Aerospace
81 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
01 CE134P-2 Introduction To Structural Steel Design
No ratings yet
01 CE134P-2 Introduction To Structural Steel Design
18 pages
WATER-QUALITY-PREDICTION-USING-MACHINE-LEARNING-TECHNIQUE
No ratings yet
WATER-QUALITY-PREDICTION-USING-MACHINE-LEARNING-TECHNIQUE
9 pages
E Procurement EXAMPLES
No ratings yet
E Procurement EXAMPLES
33 pages
Hot Sauce Experiment
No ratings yet
Hot Sauce Experiment
3 pages
NLM - Controlling
No ratings yet
NLM - Controlling
8 pages
Group 15 EMD332 Machine Design Report - 2D Camera Slider
100% (1)
Group 15 EMD332 Machine Design Report - 2D Camera Slider
33 pages
Classification Vs Regression
No ratings yet
Classification Vs Regression
3 pages
1Z0-184 (Final_Last_One) 2 2
No ratings yet
1Z0-184 (Final_Last_One) 2 2
10 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
15 pages
CEC Value of DA: Distribution Automation Detailed Scenarios: Xanthus
No ratings yet
CEC Value of DA: Distribution Automation Detailed Scenarios: Xanthus
19 pages
Titanic Classification Disaster Kaggle
No ratings yet
Titanic Classification Disaster Kaggle
18 pages
Ucm6100 AMI Guide 0
No ratings yet
Ucm6100 AMI Guide 0
15 pages
Reliability Modeling of Distributed Generation in Conventional Distribution Systems Planning and Analysis
No ratings yet
Reliability Modeling of Distributed Generation in Conventional Distribution Systems Planning and Analysis
6 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
23 pages
Mmel 90
No ratings yet
Mmel 90
54 pages
Shaping Australia Energy Future National Cost Benefit Assessment Report Part 1
No ratings yet
Shaping Australia Energy Future National Cost Benefit Assessment Report Part 1
148 pages
Bolker Et Al 2009 General Mixed Model
No ratings yet
Bolker Et Al 2009 General Mixed Model
9 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
100% (1)
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
33 pages
Presentation On Mobile Services
No ratings yet
Presentation On Mobile Services
12 pages
Hypothesis Testing - 2 Populations
100% (1)
Hypothesis Testing - 2 Populations
26 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Random Forest Intro Presented
No ratings yet
Random Forest Intro Presented
38 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Failure prediction in the refinery piping system using machine learning algorithms
No ratings yet
Failure prediction in the refinery piping system using machine learning algorithms
10 pages
Sales Promotion Toshiba
No ratings yet
Sales Promotion Toshiba
36 pages
HYDRO Power Plant
100% (2)
HYDRO Power Plant
41 pages
Software Development Risk Management: by Karl Gallagher
No ratings yet
Software Development Risk Management: by Karl Gallagher
19 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
Predictive Maintenance of Railway Point Machine Using Machine Learning Algorithm
No ratings yet
Predictive Maintenance of Railway Point Machine Using Machine Learning Algorithm
3 pages
Smart Grid Future PDF
No ratings yet
Smart Grid Future PDF
40 pages
Random Forest
No ratings yet
Random Forest
8 pages
PR01
100% (1)
PR01
41 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
12 e Elm 434
No ratings yet
12 e Elm 434
5 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
DB Implementation in SCADA
No ratings yet
DB Implementation in SCADA
4 pages
Infrastructures: Pavement Distress Detection Methods: A Review
No ratings yet
Infrastructures: Pavement Distress Detection Methods: A Review
19 pages
Feasibility Analysis of Using SWER For Homboza Village Electrification
No ratings yet
Feasibility Analysis of Using SWER For Homboza Village Electrification
10 pages
Techno-Commercial Proposal: Rooftop Solar Power Plant - 100 KW NTPC Limited
No ratings yet
Techno-Commercial Proposal: Rooftop Solar Power Plant - 100 KW NTPC Limited
11 pages
Infrastructure Quality and Reliability
No ratings yet
Infrastructure Quality and Reliability
35 pages
Energy Labelling and Standards Programs Throughout The World - Julio 2004 PDF
No ratings yet
Energy Labelling and Standards Programs Throughout The World - Julio 2004 PDF
56 pages
HydropowerNorway SeminarPaper
No ratings yet
HydropowerNorway SeminarPaper
71 pages
Social Network Analysis in R PDF
No ratings yet
Social Network Analysis in R PDF
35 pages
Newton-Raphson Vs Broyden
No ratings yet
Newton-Raphson Vs Broyden
6 pages
Random Forest
No ratings yet
Random Forest
18 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Gib Above Duct
No ratings yet
Gib Above Duct
52 pages
The Unofficial Essential Skills / Revision Guide For Mpm1D Grade 9 Academic Mathematics in Ontario by Mark Burke
No ratings yet
The Unofficial Essential Skills / Revision Guide For Mpm1D Grade 9 Academic Mathematics in Ontario by Mark Burke
44 pages
Modeling With Penalized Splines
No ratings yet
Modeling With Penalized Splines
50 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
The Box-Jenkins Methodology For RIMA Models
No ratings yet
The Box-Jenkins Methodology For RIMA Models
180 pages
2011 Wecc Soti Report
No ratings yet
2011 Wecc Soti Report
29 pages
GAMS Getting Started
No ratings yet
GAMS Getting Started
31 pages
Heart Prediction
No ratings yet
Heart Prediction
15 pages
Deployments and Deployment Attempts: Futuregrid, and The More Modern Intergrid and Intragrid
No ratings yet
Deployments and Deployment Attempts: Futuregrid, and The More Modern Intergrid and Intragrid
22 pages
Random Forest
No ratings yet
Random Forest
5 pages
Forecast
No ratings yet
Forecast
82 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Statistical Modeling
No ratings yet
Statistical Modeling
22 pages
Notes On Time Series Analysis
No ratings yet
Notes On Time Series Analysis
111 pages
TTM Single Aisle Line & Base Ata 31
No ratings yet
TTM Single Aisle Line & Base Ata 31
512 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages

DecisionTrees RandomForest v2

Uploaded by

DecisionTrees RandomForest v2

Uploaded by

Introduction to decision trees and

What are decision trees?

A predictive model that uses a set of binary

Example classification tree

How do classification trees work?

Uses training data to

Which variable to split

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Dividing feature space recursive

Editing (pruning) the tree

Overfitting is common since individual pixels

Decision tree advantages

Easy to interpret the decision rules

Robust with regard to outliers in training data

Classification is fast once rules are developed

Drawbacks of decision trees

Decision trees tend to overfit training data which

tree The original decision tree package

What are ensemble models?

Combines the results

What is random forests

How random forests work

A different subset of the

Use a subset of variables

Common variables for random

Input data (predictor

randomForest(x, y=NULL, xtest=NULL,

Information from Random Forests

Error rate vs. number of trees

Advantages of random forests

No need for pruning

Limitations of random forests

Common remote sensing

Resources to learn more about

The randomForest Package (for R) description

You might also like