0% found this document useful (0 votes)

272 views

Random Forest

This document provides an introduction to random forests, which are an ensemble machine learning technique. It begins with an outline and then covers decision trees, bagging, and how random forests combine these methods by growing many randomized decision trees and aggregating their predictions. Key points include how random forests encourage diversity among trees to improve accuracy over a single tree, and how they produce useful byproducts like out-of-bag error estimates and variable importance measures. The document also briefly introduces kernel-induced random forests, which apply the random forest approach to data in a kernel-induced feature space.

Uploaded by

azn

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

272 views

Random Forest

Uploaded by

azn

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

An introduction

to random forests
Eric Debreuve / Team Morpheme
Institutions: University Nice Sophia Antipolis / CNRS / Inria
Labs: I3S / Inria CRI SA-M / iBV

Outline
Machine learning
Decision tree
Random forest
Bagging
Random decision trees

Kernel-Induced Random Forest (KIRF)

Byproducts
Out-of-bag error
Variable importance
2

Machine learning
Learning/training: build a classification or regression rule
from a set of samples
Samples
(learning set)

Machine learning
algorithm

class/category = rule(sample)
or
value = rule(sample)

Prediction: assign a class or value to new samples

New samples

Learned rule

predicted class
or
predicted value

(Un)Supervised learning
Supervised
Learning set = { (sample

[acquisition],

class

[expert])

Unsupervised
Learning set = unlabeled samples

Semi-supervised
Learning set = some labeled samples + many unlabeled samples

Ensemble learning
Combining weak classifiers (of the same type)...
... in order to produce a strong classifier
Condition: diversity among the weak classifiers

Example: Boosting
Train each new weak classifier focusing on samples misclassified by
previous ones
Popular implementation: AdaBoost
Weak classifiers: only need to be better than random guess

Outline
Machine learning
Decision tree
Random forest
Bagging
Random decision trees

Kernel-Induced Random Forest (KIRF)

Byproducts
Out-of-bag error
Variable importance
6

Decision tree
Root node

Entry point to a collection of data

Inner nodes (among which the root node)

A question is asked about data
One child node per possible answer

Q2
D1

Q3
D3

Leaf nodes
Correspond to the decision to take (or conclusion to make) if reached

Example: CART - Classification and Regression Tree

Labeled sample
Vector of variable/feature values + class label
Binary decision tree
Top-down, greedy building...
... by recursively partitioning the feature space into hyper-rectangles
Similarity with weighted kNN

Normally, pruning
To avoid over-fitting of learning data
To achieve a trade-off between prediction accuracy and complexity
7

Decision tree > CART > Building

All labeled samples initially assigned to root node
N root node
With node N do
Find the feature F + threshold value T...
... that split the samples assigned to N into 2 subsets Sleft and Sright...
... so as to maximize the label purity within these subsets
Assign (F,T) to N
If Sleft and Sright too small to be splitted
Attach child leaf nodes Lleft and Lright to N
Tag the leaves with the most present label in Sleft and Sright, resp.
else
Attach child nodes Nleft and Nright to N
Assign Sleft and Sright to them, resp.
Repeat procedure for N = Nleft and N = Nright
8

Decision tree > CART > Building > Purity

(Im)Purity
Quality measure applied to each subset Sleft and Sright
Combination of the measures (e.g., weighted average)

Examples
Gini index =
Entropy =
Misclassification error =

Decision tree > CART > Properties

CART kNN

SVM

Intrinsically multiclass
Handles Apple and Orange features
Robustness to outliers
Works w/ "small" learning set
Scalability (large learning set)
Prediction accuracy
Parameter tuning
10

Outline
Machine learning
Decision tree
Random forest
Bagging
Random decision trees

Kernel-Induced Random Forest (KIRF)

Byproducts
Out-of-bag error
Variable importance
11

Random forest
Definition
Collection of unpruned CARTs
Rule to combine individual tree decisions

Purpose
Improve prediction accuracy

Principle
Encouraging diversity among the tree

Solution: randomness
Bagging
Random decision trees (rCART)

Random forest > Bagging

Bagging: Bootstrap aggregation
Technique of ensemble learning...
... to avoid over-fitting
Important since trees are unpruned
... to improve stability and accuracy

Two steps
Bootstrap sample set
Aggregation

Random forest > Bagging > Bootstrap

L: original learning set composed of p samples
Generate K learning sets Lk...
... composed of q samples, q p,...
... obtained by uniform sampling with replacement from L
In consequences, Lk may contain repeated samples

Random forest: q = p
Asymptotic proportion of unique samples in Lk = 100 (1 - 1/e) ~ 63%
The remaining samples can be used for testing

Random forest > Bagging > Aggregation

Learning
For each Lk, one classifier Ck (rCART) is learned

Prediction
S: a new sample
Aggregation = majority vote among the K predictions/votes Ck(S)

Random forest > Random decision tree

All labeled samples initially assigned to root node
Nroot node
With node N do

Find the feature F among a random subset of features + threshold value T...
... that split the samples assigned to N into 2 subsets Sleft and Sright...
... so as to maximize the label purity within these subsets

Assign (F,T) to N

If Sleft

else

and Sright too small to be splitted

Attach child leaf nodes Lleft and Lright to N
Tag the leaves with the most present label in Sleft and Sright, resp.
Attach child nodes Nleft and Nright to N
Assign Sleft and Sright to them, resp.
Repeat procedure for N = Nleft and N = Nright

Random subset of features

Random drawing repeated at each node

For D-dimensional samples, typical subset size = round(sqrt(D)) (also round(log2(x)))
Increases diversity among the rCARTs + reduces computational load

Typical purity: Gini index

Random forest > Properties

CART kNN

SVM

Intrinsically multiclass
Handles Apple and Orange features
Robustness to outliers
Works w/ "small" learning set
Scalability (large learning set)
Prediction accuracy
Parameter tuning
17

Random forest > Illustration

1 rCART

10 rCARTs

100 rCARTs

500 rCARTs
18

Random forest > Limitations

Oblique/curved frontiers
Staircase effect
Many pieces of hyperplanes

Fundamentally discrete
Functional data? (Example: curves)

Outline
Machine learning
Decision tree
Random forest
Bagging
Random decision trees

Kernel-Induced Random Forest (KIRF)

Byproducts
Out-of-bag error
Variable importance
20

Kernel-Induced Random Forest (KIRF)

Random forest
Sample S is a vector
Features of S = components of S

Kernel-induced features
Learning set L = { Si, i [1..N] }
Kernel K(x,y)
Features of sample S = { Ki(S) = K(Si, S), i [1..N] }
Samples S and Si can be vectors or functional data

Kernel > Kernel trick

Kernel trick
Maps samples into an inner product space...
... usually of higher dimension (possibly infinite)...
... in which classification (or regression) is easier
Typically linear

Kernel K(x,y)
Symmetric
Positive semi-definite (Mercer's condition):

Note: mapping needs not to be known (might not even have an
explicit representation; e.g., Gaussian kernel)
22

Kernel > Examples

Polynomial (homogeneous):
Polynomial (inhomogeneous):
Hyperbolic tangent:
Gaussian:
Function of the distance between samples
Straightforward application to functional data of a metric space
E.g., curves

KIRF > Illustration

Gaussian kernel
Some similarity with vantage-point tree

Reminder: RF w/ 100 rCARTs

KIRF w/ 100 rCARTs

KIRF > Limitations

Which kernel?
Which kernel parameters?

No orange and apple handling anymore

(xy or (x - y)2)

Computational load (kernel evaluations)

Especially during learning

Needs to store samples

(Instead of feature indices in Random forest)

Outline
Machine learning
Decision tree
Random forest
Bagging
Random decision trees

Kernel-Induced Random Forest (KIRF)

Byproducts
Out-of-bag error
Variable importance
26

Byproduct > Reminder

To grow one rCART
Bootstrap sample set from learning set L
Remaining samples
Called out-of-bag samples
Can be used for testing

Two points of view

For one rCART, out-of-bag samples = L \ Bootstrap samples
Used for variable importance
For one sample S of L, set of rCARTs for which S was out-of-bag
Used for out-of-bag error

Byproduct > Out-of-bag error

For each sample S of the learning set

Look for all the rCARTs for which S was out-of-bag

Build the corresponding sub-forest
Predict the class of S with it
Error = is prediction correct?

Out-of-bag error = average over all samples of S

Note: predictions not made using the whole forest...
... but with some aggregation

Provides an estimation of the generalization error

Can be used to decide when to stop adding trees to the forest

Byproduct > Variable importance

For each rCART
Compute out-of-bag error OOBoriginal
Fraction of misclassified out-of-bag samples
Consider the ith feature/variable of the samples
Randomly permute its values among the out-of-bag samples
Re-compute out-of-bag error OOBpermutation
rCART-level importance(i) = OOBpermutation - OOBoriginal

Variable importance(i) = average over all rCARTs

Note: rCART-based errors (no aggregation)
Avoid attenuation of individual errors

An introduction
to random forests
Thank you for your attention

Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Logistic Regression Quiz: Pandas Version: 1.0.5 Seaborn Version: 0.10.1 Matplotlib Version: 3.2.1 Sklearn Version: 0.23.1
50% (2)
Logistic Regression Quiz: Pandas Version: 1.0.5 Seaborn Version: 0.10.1 Matplotlib Version: 3.2.1 Sklearn Version: 0.23.1
1 page
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Project Questions
No ratings yet
Project Questions
4 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Artificial Intelligence (AI) / Machine Learning (ML) : Limited Seats Only
67% (3)
Artificial Intelligence (AI) / Machine Learning (ML) : Limited Seats Only
2 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
Machine Learning Coursera Quiz 2
100% (1)
Machine Learning Coursera Quiz 2
6 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
ML Quiz 1: Course Content
No ratings yet
ML Quiz 1: Course Content
13 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
ML Week 3 Logistic Regression
60% (10)
ML Week 3 Logistic Regression
6 pages
Week 1 Graded Quiz On Solution PDF
No ratings yet
Week 1 Graded Quiz On Solution PDF
2 pages
SMDM Project
100% (1)
SMDM Project
22 pages
The Cricket Winner Prediction With Applications of ML and Data Analytics
No ratings yet
The Cricket Winner Prediction With Applications of ML and Data Analytics
18 pages
Data Mining Quiz 3 - Random Forest: Course Content
No ratings yet
Data Mining Quiz 3 - Random Forest: Course Content
8 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
PM Guided Project Sample Business Report
No ratings yet
PM Guided Project Sample Business Report
52 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Pradeep Chauhan Business Report 09july'23
100% (1)
Pradeep Chauhan Business Report 09july'23
32 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
No ratings yet
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
77 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
No ratings yet
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
52 pages
Buisiness Reoprt Extended As Project Report
No ratings yet
Buisiness Reoprt Extended As Project Report
18 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Data Mining Project
No ratings yet
Data Mining Project
11 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Project Questions
No ratings yet
Project Questions
3 pages
Problem 2 - Survey: Importing Nessceary Libraries
No ratings yet
Problem 2 - Survey: Importing Nessceary Libraries
10 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Problem 1
No ratings yet
Problem 1
12 pages
Time Series Forecasting - Rose - Buisness Report
100% (1)
Time Series Forecasting - Rose - Buisness Report
69 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Random Forest
No ratings yet
Random Forest
83 pages
Kakuro #1: 11X17 Kakuro Puzzles by Krazydad, Book 1
No ratings yet
Kakuro #1: 11X17 Kakuro Puzzles by Krazydad, Book 1
10 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Moor, James. The Dartmouth College Artificial Intelligence Conference - The Next Fifty Years, AI Magazine 27, No. 4 (Winter 2006), 87-91
No ratings yet
Moor, James. The Dartmouth College Artificial Intelligence Conference - The Next Fifty Years, AI Magazine 27, No. 4 (Winter 2006), 87-91
6 pages
Intelligent Control System Course Plan
No ratings yet
Intelligent Control System Course Plan
4 pages
The Categories of Neural Network Learning Rules
No ratings yet
The Categories of Neural Network Learning Rules
7 pages
Nanobots
No ratings yet
Nanobots
8 pages
AI Unit I Lecture Notes
No ratings yet
AI Unit I Lecture Notes
159 pages
Call For Paper
No ratings yet
Call For Paper
1 page
Deep Learning For Computer Vision With MATLAB - MATLAB & Simulink
No ratings yet
Deep Learning For Computer Vision With MATLAB - MATLAB & Simulink
5 pages
Ai, Iot, Big Data & Blockchain
No ratings yet
Ai, Iot, Big Data & Blockchain
19 pages
Nanoscience and Nanotechnology
No ratings yet
Nanoscience and Nanotechnology
2 pages
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
No ratings yet
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
12 pages
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
No ratings yet
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
8 pages
Convolutional Neural Network Report
No ratings yet
Convolutional Neural Network Report
5 pages
Characteristics of Artificial Neural Networks
No ratings yet
Characteristics of Artificial Neural Networks
38 pages
Register of Hazards and Risks: (OHSAS 18001:2007)
No ratings yet
Register of Hazards and Risks: (OHSAS 18001:2007)
2 pages
Looking at Rectangular Haar Features Used in Cascade Face Detection
No ratings yet
Looking at Rectangular Haar Features Used in Cascade Face Detection
5 pages
Industrial Robotics
No ratings yet
Industrial Robotics
26 pages
Deep Learning Tutorial: Reference: Hung-Yi Lee
100% (1)
Deep Learning Tutorial: Reference: Hung-Yi Lee
179 pages
Ece Neural Network and Fuzzy Logic
No ratings yet
Ece Neural Network and Fuzzy Logic
2 pages
A Tour of Machine Learning Algorithms
No ratings yet
A Tour of Machine Learning Algorithms
9 pages
A Very Short History of Artificial Intelligence AI
No ratings yet
A Very Short History of Artificial Intelligence AI
12 pages
6 Image Segmentation Combined
No ratings yet
6 Image Segmentation Combined
40 pages
Pyimagesearch Gurus Syllabus PDF
0% (1)
Pyimagesearch Gurus Syllabus PDF
30 pages
Two Day Workshop On Introduction To Neural Network Toolbox & MATLAB-17
No ratings yet
Two Day Workshop On Introduction To Neural Network Toolbox & MATLAB-17
5 pages
Adversarial Training Technique
No ratings yet
Adversarial Training Technique
3 pages
DCGAN (Deep Convolution Generative Adversarial Networks)
No ratings yet
DCGAN (Deep Convolution Generative Adversarial Networks)
27 pages
2019 Ganesh, Deep Orange Mask R-CNN Based Orange PDF
No ratings yet
2019 Ganesh, Deep Orange Mask R-CNN Based Orange PDF
6 pages
12 - Goal Stack Planning
100% (1)
12 - Goal Stack Planning
65 pages