SlideShare a Scribd company logo
DECISION TREE & TREE
ENSEMBLE
BY ANTHONY ANH QUOC DOAN
BACKGROUND &
QUALIFICATIONS
• Full stack programmer for over 10 years professional
experiences
• Statistician/Data Scientist consultant for 2 years of
professional experiences
• Interned at FDA and JPL NASA
• BS in Computer Science & a MS in Applied Statistic
MOTIVATION FOR TREE
ENSEMBLE (FOREST)
• Example of Tree Ensemble is
Random forest, RF is a great
statistical learning model.
• It works well with small to
medium data unlike Neural
Network which requires large
data to train. (1)
• It is not a blackbox that every
body is telling you. It is
explainable (more so than
neural network)
• A great tool set under your belt
NOT ANOTHER RANDOM
FOREST TALK...
• Mine is different. I will show you how data partition actually
works for binary tree structure via data partition
• I will tie it to computer science and statistic
• I will show you it's not black box
ROADMAP
SECTIONS
1. CART
2. Bagging
3. Random Forest
4. References
1. CART (CLASSIFICATION AND REGRESSION TREE)
CART
SECTION OVERVIEW
1. History
2. Quick Example
3. CART Overview
4. Recursive Data Partition (CART)
5. Classifying with Terminal Nodes
6. Basis Function
7. Node Splitting
8. Tree Pruning
9. Regression Tree
Random forest sgv_ai_talk_oct_2_2018
CLASSIFICATION TREE -
(BINARY TREE DATA STRUCTURE)
CLASSIFICATION TREE - QUICK
EXAMPLE
• We have several predictors (x's) and one response
(y, the thing we're trying to predict)
• Let's do cancer with a simple model.
1. response Y is either 0,1 where the person
have cancer or not
2. predictor - age (value range: 0 year old to 99
year old)
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
HOW CART (DATA PARTITION
ALGORITHM) WORKS
1. For each predictor, all possible splits of the
predictor values are considered
2. For each predictor, the best split is selected.
(we'll get to best split criteria)
3. With the best split of each predictor is
determined, we pick the best predictor in that
group.

CART Uses Binary Tree Data Structure
CART - DATA PARTITION
EXAMPLES
• Examples of step 1:
• marital status (categorical data: never married, married, and divorced)
• never married vs [married and divorced]
• married vs [never married and divorced]
• divorced vs [never married and married]
• age (value range: 21 to 24)
• So the possible split (maintaining order)
• 21 vs 22-24
• 21-22 vs 23-24
• 21-23 vs 24

VISUAL EXAMPLE
• The best way to do this is adding another predictor to our
model.
• We're going to add exercise per week (hours)
• recap our response:
• y (0 - no cancer or 1- cancer)
• our predictors:
• age (value range: 0 to 99 year)
• exercise per week (value: 0 to 168 hours).
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
Note here we can either partition the data
horizontally or vertically. We're choosing the
best split/partition.
Random forest sgv_ai_talk_oct_2_2018
REPEAT DATA PARTITION ALGORITHM,
CART, AGAIN FOR THE PARTITIONED DATA
RECURSIVELY...
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
R - CODE
flowers_data <- iris
flowers_data <- flowers_data[!
(flowers_data$Species ==
'virginica'),]
library(rpart)
tree.model <- rpart(Species ~ . ,
data = flowers_data)
library(party)
library(partykit)
tree.model.party <-
as.party(tree.model)
plot(tree.model.party)
R - CODE
flowers_data <- iris
library(rpart)
tree.model <- rpart(Species
~ . , data = flowers_data)
library(party)
library(partykit)
tree.model.party <-
as.party(tree.model)
plot(tree.model.party)
BASIS FUNCTIONS
• X is a predictor
• M transformations of X
• β_m is the weight given to the mth transformation (the coefficient)
• h_m is the m-th transformation of X
• f(x) is the linear combination of transformed values of X
BASIS FUNCTION EXAMPLE
THE BASIS EXAMPLE WE CARE ABOUT
THIS IS CART BASIS FUNCTION
The new summation is for multiple predictors. P = total
number of predictors.
Random forest sgv_ai_talk_oct_2_2018
NODE SPLITTING
QUESTION HOW IS SPLIT
DETERMINED?
• The goal is to split/partition the data until each
node is homogenous in data, or as little
"impurity" (few outcomes that we want in the
particular node and mostly outcome that we want).
HOW IS NODE IMPURITY
CALCULATED?
NODE IMPURITY - SETUP
• Our data to train is a random sample from a well
defined population
• given a node, node A
• p(y = 1 | A)
• impurity of node A is the probability that y = 1
given it is node A.
IMPURITY FUNCTION
• I(A) represent Impurity function that takes in a node
as our parameter.
• The restriction tells us that the impurity function is
nonnegative, symmetric when A contains all 0s or 1s,
and a maximum of half of each (coin toss).
IMPURITY - DEFINING Φ (PHI)
• There are several Φ functions that people uses.
• The most commonly use is the Gini Index: Φ(p) = p (1-p)
• Others
• Bayes Error: Φ(p) = min(p, 1-p)
• Cross Entropy Function: Φ(p) = -p log(p) - (1-p) log(1-p) 





GINI INDEX
• p(i|t) denote the fraction of records belonging to
class i at a given node t
• c is the number of classes (example: cancer, no
cancer)
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
REGRESSION TREE - SPLITTING
NODE
• The only difference is the impurity function of the
node
• Which is just a within-node sum of squares for the
response
• Where the summation of all cases in node τ minus
the mean those cases squared.
• This is SSTO (sum square total) in Linear Regression
REGRESSION TREE EXAMPLE
Warning:
We're going to do it on classification data as an
example to show how to use the equation.
Because I don't believe I have time to go over
linear regression in here to do a proper example.
Random forest sgv_ai_talk_oct_2_2018
• = (1/n)*sum(y_i)
• = (1/6)*(0+0+0+0+1+1)
• = 2/6
• = 1/3
• represents one data
point in the partition
(0,0,0,1,1)
2. BAGGING (BOOTSTRAP AGGREGATION)
BAGGING
SECTION OVERVIEW
1. Benefit
2. Bootstrap Algorithm
3. Bootstrap example
4. Bagging Algorithm
5. Flaws
6. Why does this work?
WHY? BENEFITS.
• Reduce overfitting
• Reduce bias
• Break the bias-variance trade-off
BOOTSTRAP
• Before we dive into
bagging algorithm we
should go over bootstrap
• Bootstrap is a statistical
resample method.
• When you can't afford
to get more sample
(think medical data,
poor grad student, cost,
etc..)
BOOTSTRAP ALGORITHM
1. Used in statistic when you want to estimate a statistic of a random sample
(a statistic: mean, variance, mode, etc...)
2. Using bootstrap we diverge from traditional parametric statistic, we do not
assume a distribution of the random sample
3. What we do is sample our only data set (aka random sample) with
replacement. We take up to the number of observations in our original
data.
4. We repeat step 3 for a large number of time, B times. Once done we have
B number of bootstrap random samples.
5. We then take the statistic of each bootstrap random sample and average it
BOOTSTRAP EXAMPLE
• Original data (random sample):
• {1,2,3,1,2,1,1,1} (n = 8)
• Bootstrap random sample data:
• {1,2,1,2,1,1,3} (n = 8); mean = 1.375
• {1,1,2,2,3,1,1} (n = 8); mean = 1.375
• {1,2,1,2,1,1,2} (n = 8); mean = 1.25
• The estimated mean for our original data is the mean of the statistic for
each bootstrap sample (1.375+1.375+1.25)/3 = ~1.3333



BAGGING (BOOTSTRAP
AGGREGATION) ALGORITHM
1. Take a random sample of size N with replacement
from the data
2. Construct a classification tree as usual but do not
prune
3. Assign a class to each terminal node, and store the
class attached to each case coupled with the
predictor values for each observation
BAGGING ALGORITHM
4. Repeat Steps 1-3 a large number of times.
5. For each observation in the dataset, count the number of
times over tress that it is classified in one category and the
number of times over trees it is classified in the other category
6. Assign each observation to a final category by a majority
vote over the set of tress. Thus, if 51% of the time over a large
number of trees a given observation is classified as a "1", that
becomes its classification.
7. Construct the confusion table from these class assignments.
FLAWS
• The problem with Bagging algorithm is it's using
CART.
• CART uses Gini-Index, a greedy algorithm to find the
best split.
• So we end up with trees that are structurally similar to
each other. The trees are highly correlated among the
predictions.
• Random Forest address this.
WHY DOES THIS WORK?
• Outside the scope of this talk.
3. RANDOM FOREST
RANDOM FOREST
SECTION OVERVIEW
1. Problem RF solve
2. Building Random Forest
Algorithm
3. Breaking Down Random
Forest Algorithm Parts by
Parts
4. How to use Random Forest to
Predict
5. R Code
1. PROBLEM RANDOM FOREST
IS TRYING TO SOLVE
BAGGING PROBLEM
With bagging we have an ensemble of
structurally similar trees. This causes
highly correlated trees. 



RANDOM FOREST SOLUTION
Create trees that have no correlation or weak
correlation.


2. BUILDING RANDOM FOREST
ALGORITHM
RANDOM FOREST ALGORITHM
1. Take a random sample of size N with
replacement from the data. 

2. Take a random sample without replacement of
the predictors. 

3. Construct the first CART partition of the data. 



BUILDING RANDOM FOREST
ALGORITHM
4. Repeat Step 2 for each subsequent
split until the tree is as large as desired.
Do not prune.
5. Repeat Steps 1–4 a large number of
times (e.g., 500). 

3. BREAKING DOWN RANDOM
FOREST ALGORITHM PARTS BY PARTS
1. TAKE A RANDOM SAMPLE OF SIZE
N WITH REPLACEMENT FROM DATA
• This is just bootstrapping on our data (recall
bagging section)
2. TAKE A RANDOM SAMPLE WITHOUT
REPLACEMENT OF THE PREDICTORS
• Predictor sampling / bootstrapping
• Notice this is bootstrapping our predictors and it's
without replacement.
• This is Random Forest solution to highly correlated
trees that arises from bagging algorithm.
Question (for step 2):
Max number when sampling predictors/
features?
Answer:
2 to 3 sample for predictors/features
3. CONSTRUCT THE FIRST CART
PARTITION OF DATA
• We partitioned our first bootstrap and use Gini-
index on our bootstrapped predictors sample in
step 2 to decided the split.
4. REPEAT STEP 2 FOR EACH SUBSEQUENT
SPLIT UNTIL THE TREE IS AS LARGE AS
DESIRED. DO NOT PRUNE. 
• Self explanatory.
5. REPEAT STEPS 1–4 A LARGE NUMBER OF
TIMES (E.G., 500). 

 
• Steps 1 to 4 is to build one tree.
• You repeat step 1 to 4 a large number of times to build a
forest.
• There's no magic number for large number. You can build
101, 201, 501, 1001, etc.. There's research paper that suggest
certain numbers but it base on your data. So just check model
performance via model evaluation using cross validation.
• I have no idea why suggested numbers are usually even, but I
choose odd number of trees in case of ties.
4. HOW TO USE RANDOM
FOREST TO PREDICT
• You have an observation that you want to predict
• Say the observation is x = male, z = 23 year old
• You plug that in your your random forest.
• Random Forest takes those predictors and give it to each decision
trees.
• Each decision trees give one prediction, cancer or no cancer (0,1).
• You take all of those predictions (aka votes) and take the majority.
• This is why I suggest odd number of trees to break ties for binary
responses.
5. R CODE
set.seed(415)
library(randomForest)
iris_train <- iris[-1,]
iris_test <- iris[1,]
rf.model <- randomForest(Species ~ ., data = iris_train)
predict(rf.model,iris_test)
Random forest sgv_ai_talk_oct_2_2018
REFERENCES
• Statistical Learning from a Regression Perspective
by Richard A. Berk
• Introduction to Data Mining by Pang-Ning Tan
(http://www-users.cs.umn.edu/~kumar/dmbook/
index.php)
• 11.12 From bagging to forest (https://
onlinecourses.science.psu.edu/stat857/node/181)
• "An Empirical Comparison of Supervised Learning
Algorithms" - https://www.cs.cornell.edu/~caruana/
ctp/ct.papers/caruana.icml06.pdf
• Random Forest created and trademarked by Leo
Breiman (https://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm#workings)
• "Bagging and Random Forest Ensemble Algorithms
for Machine Learning" By Jason Brownlee (http://
machinelearningmastery.com/bagging-and-
random-forest-ensemble-algorithms-for-machine-
learning/)

More Related Content

PPTX
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
PPTX
Random forest algorithm
Rashid Ansari
 
PPTX
Random Forest and KNN is fun
Zhen Li
 
PPT
Decision tree and random forest
Lippo Group Digital
 
PPTX
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Simplilearn
 
PDF
From decision trees to random forests
Viet-Trung TRAN
 
PDF
Machine Learning Feature Selection - Random Forest
Rupak Roy
 
PPTX
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Simplilearn
 
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
Random forest algorithm
Rashid Ansari
 
Random Forest and KNN is fun
Zhen Li
 
Decision tree and random forest
Lippo Group Digital
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Simplilearn
 
From decision trees to random forests
Viet-Trung TRAN
 
Machine Learning Feature Selection - Random Forest
Rupak Roy
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Simplilearn
 

What's hot (20)

PDF
Decision tree
SEMINARGROOT
 
PDF
Understanding random forests
Marc Garcia
 
PPTX
Random forest
Ujjawal
 
PDF
Clustering[306] [Read-Only].pdf
igeabroad
 
PPTX
CART – Classification & Regression Trees
Hemant Chetwani
 
PPTX
Decision trees and random forests
Debdoot Sheet
 
PPTX
Supervised Machine Learning in R
Babu Priyavrat
 
PPTX
Decision Tree Learning
Md. Ariful Hoque
 
PPSX
Classification Using Decision tree
Mohd. Noor Abdul Hamid
 
PPT
Slide3.ppt
butest
 
PDF
Decision trees in Machine Learning
Mohammad Junaid Khan
 
PPT
Decision tree
Soujanya V
 
PPT
2.2 decision tree
Krish_ver2
 
PDF
Introduction to Random Forest
Rupak Roy
 
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Salah Amean
 
PPTX
Decision tree
Venkata Reddy Konasani
 
PPTX
Random forest and decision tree
AAKANKSHA JAIN
 
PPT
1.8 discretization
Krish_ver2
 
PPTX
Text Classification
RAX Automation Suite
 
PDF
Information Retrieval based on Cluster Analysis Approach
AIRCC Publishing Corporation
 
Decision tree
SEMINARGROOT
 
Understanding random forests
Marc Garcia
 
Random forest
Ujjawal
 
Clustering[306] [Read-Only].pdf
igeabroad
 
CART – Classification & Regression Trees
Hemant Chetwani
 
Decision trees and random forests
Debdoot Sheet
 
Supervised Machine Learning in R
Babu Priyavrat
 
Decision Tree Learning
Md. Ariful Hoque
 
Classification Using Decision tree
Mohd. Noor Abdul Hamid
 
Slide3.ppt
butest
 
Decision trees in Machine Learning
Mohammad Junaid Khan
 
Decision tree
Soujanya V
 
2.2 decision tree
Krish_ver2
 
Introduction to Random Forest
Rupak Roy
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Salah Amean
 
Decision tree
Venkata Reddy Konasani
 
Random forest and decision tree
AAKANKSHA JAIN
 
1.8 discretization
Krish_ver2
 
Text Classification
RAX Automation Suite
 
Information Retrieval based on Cluster Analysis Approach
AIRCC Publishing Corporation
 
Ad

Similar to Random forest sgv_ai_talk_oct_2_2018 (20)

PPTX
Decision Tree.pptx
JayabharathiMuraliku
 
PPTX
Machine Learning with Python unit-2.pptx
GORANG6
 
PDF
Random Forest / Bootstrap Aggregation
Rupak Roy
 
PPTX
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
PPTX
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
 
PPT
RANDOM FORESTS Ensemble technique Introduction
Lalith86
 
PPT
Using Tree algorithms on machine learning
Rajasekhar364622
 
PDF
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
PPTX
AI Algorithms
Dr. C.V. Suresh Babu
 
PDF
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
PPTX
04 Classification in Data Mining
Valerii Klymchuk
 
PPTX
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
Vijayalakshmi171563
 
PPTX
3. Tree Models in machine learning
Kv Sagar
 
PDF
Bank loan purchase modeling
Saleesh Satheeshchandran
 
PPTX
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
PPTX
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
PDF
Random forests-talk-nl-meetup
Willem Hendriks
 
PDF
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
PPTX
Decision Tree in Machine learning with random forest and classification
muskaanbhayana9
 
PDF
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
Decision Tree.pptx
JayabharathiMuraliku
 
Machine Learning with Python unit-2.pptx
GORANG6
 
Random Forest / Bootstrap Aggregation
Rupak Roy
 
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
 
RANDOM FORESTS Ensemble technique Introduction
Lalith86
 
Using Tree algorithms on machine learning
Rajasekhar364622
 
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
AI Algorithms
Dr. C.V. Suresh Babu
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
04 Classification in Data Mining
Valerii Klymchuk
 
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
Vijayalakshmi171563
 
3. Tree Models in machine learning
Kv Sagar
 
Bank loan purchase modeling
Saleesh Satheeshchandran
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
Random forests-talk-nl-meetup
Willem Hendriks
 
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
Decision Tree in Machine learning with random forest and classification
muskaanbhayana9
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
Ad

Recently uploaded (20)

PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 

Random forest sgv_ai_talk_oct_2_2018

  • 1. DECISION TREE & TREE ENSEMBLE BY ANTHONY ANH QUOC DOAN
  • 2. BACKGROUND & QUALIFICATIONS • Full stack programmer for over 10 years professional experiences • Statistician/Data Scientist consultant for 2 years of professional experiences • Interned at FDA and JPL NASA • BS in Computer Science & a MS in Applied Statistic
  • 3. MOTIVATION FOR TREE ENSEMBLE (FOREST) • Example of Tree Ensemble is Random forest, RF is a great statistical learning model. • It works well with small to medium data unlike Neural Network which requires large data to train. (1) • It is not a blackbox that every body is telling you. It is explainable (more so than neural network) • A great tool set under your belt
  • 4. NOT ANOTHER RANDOM FOREST TALK... • Mine is different. I will show you how data partition actually works for binary tree structure via data partition • I will tie it to computer science and statistic • I will show you it's not black box
  • 5. ROADMAP SECTIONS 1. CART 2. Bagging 3. Random Forest 4. References
  • 6. 1. CART (CLASSIFICATION AND REGRESSION TREE)
  • 7. CART SECTION OVERVIEW 1. History 2. Quick Example 3. CART Overview 4. Recursive Data Partition (CART) 5. Classifying with Terminal Nodes 6. Basis Function 7. Node Splitting 8. Tree Pruning 9. Regression Tree
  • 9. CLASSIFICATION TREE - (BINARY TREE DATA STRUCTURE)
  • 10. CLASSIFICATION TREE - QUICK EXAMPLE • We have several predictors (x's) and one response (y, the thing we're trying to predict) • Let's do cancer with a simple model. 1. response Y is either 0,1 where the person have cancer or not 2. predictor - age (value range: 0 year old to 99 year old)
  • 13. HOW CART (DATA PARTITION ALGORITHM) WORKS 1. For each predictor, all possible splits of the predictor values are considered 2. For each predictor, the best split is selected. (we'll get to best split criteria) 3. With the best split of each predictor is determined, we pick the best predictor in that group.

  • 14. CART Uses Binary Tree Data Structure
  • 15. CART - DATA PARTITION EXAMPLES • Examples of step 1: • marital status (categorical data: never married, married, and divorced) • never married vs [married and divorced] • married vs [never married and divorced] • divorced vs [never married and married] • age (value range: 21 to 24) • So the possible split (maintaining order) • 21 vs 22-24 • 21-22 vs 23-24 • 21-23 vs 24

  • 16. VISUAL EXAMPLE • The best way to do this is adding another predictor to our model. • We're going to add exercise per week (hours) • recap our response: • y (0 - no cancer or 1- cancer) • our predictors: • age (value range: 0 to 99 year) • exercise per week (value: 0 to 168 hours).
  • 19. Note here we can either partition the data horizontally or vertically. We're choosing the best split/partition.
  • 21. REPEAT DATA PARTITION ALGORITHM, CART, AGAIN FOR THE PARTITIONED DATA RECURSIVELY...
  • 24. R - CODE flowers_data <- iris flowers_data <- flowers_data[! (flowers_data$Species == 'virginica'),] library(rpart) tree.model <- rpart(Species ~ . , data = flowers_data) library(party) library(partykit) tree.model.party <- as.party(tree.model) plot(tree.model.party)
  • 25. R - CODE flowers_data <- iris library(rpart) tree.model <- rpart(Species ~ . , data = flowers_data) library(party) library(partykit) tree.model.party <- as.party(tree.model) plot(tree.model.party)
  • 26. BASIS FUNCTIONS • X is a predictor • M transformations of X • β_m is the weight given to the mth transformation (the coefficient) • h_m is the m-th transformation of X • f(x) is the linear combination of transformed values of X
  • 28. THE BASIS EXAMPLE WE CARE ABOUT
  • 29. THIS IS CART BASIS FUNCTION The new summation is for multiple predictors. P = total number of predictors.
  • 32. QUESTION HOW IS SPLIT DETERMINED? • The goal is to split/partition the data until each node is homogenous in data, or as little "impurity" (few outcomes that we want in the particular node and mostly outcome that we want).
  • 33. HOW IS NODE IMPURITY CALCULATED?
  • 34. NODE IMPURITY - SETUP • Our data to train is a random sample from a well defined population • given a node, node A • p(y = 1 | A) • impurity of node A is the probability that y = 1 given it is node A.
  • 35. IMPURITY FUNCTION • I(A) represent Impurity function that takes in a node as our parameter. • The restriction tells us that the impurity function is nonnegative, symmetric when A contains all 0s or 1s, and a maximum of half of each (coin toss).
  • 36. IMPURITY - DEFINING Φ (PHI) • There are several Φ functions that people uses. • The most commonly use is the Gini Index: Φ(p) = p (1-p) • Others • Bayes Error: Φ(p) = min(p, 1-p) • Cross Entropy Function: Φ(p) = -p log(p) - (1-p) log(1-p) 
 
 

  • 37. GINI INDEX • p(i|t) denote the fraction of records belonging to class i at a given node t • c is the number of classes (example: cancer, no cancer)
  • 44. REGRESSION TREE - SPLITTING NODE • The only difference is the impurity function of the node • Which is just a within-node sum of squares for the response • Where the summation of all cases in node τ minus the mean those cases squared. • This is SSTO (sum square total) in Linear Regression
  • 46. Warning: We're going to do it on classification data as an example to show how to use the equation. Because I don't believe I have time to go over linear regression in here to do a proper example.
  • 48. • = (1/n)*sum(y_i) • = (1/6)*(0+0+0+0+1+1) • = 2/6 • = 1/3 • represents one data point in the partition (0,0,0,1,1)
  • 49. 2. BAGGING (BOOTSTRAP AGGREGATION)
  • 50. BAGGING SECTION OVERVIEW 1. Benefit 2. Bootstrap Algorithm 3. Bootstrap example 4. Bagging Algorithm 5. Flaws 6. Why does this work?
  • 51. WHY? BENEFITS. • Reduce overfitting • Reduce bias • Break the bias-variance trade-off
  • 52. BOOTSTRAP • Before we dive into bagging algorithm we should go over bootstrap • Bootstrap is a statistical resample method. • When you can't afford to get more sample (think medical data, poor grad student, cost, etc..)
  • 53. BOOTSTRAP ALGORITHM 1. Used in statistic when you want to estimate a statistic of a random sample (a statistic: mean, variance, mode, etc...) 2. Using bootstrap we diverge from traditional parametric statistic, we do not assume a distribution of the random sample 3. What we do is sample our only data set (aka random sample) with replacement. We take up to the number of observations in our original data. 4. We repeat step 3 for a large number of time, B times. Once done we have B number of bootstrap random samples. 5. We then take the statistic of each bootstrap random sample and average it
  • 54. BOOTSTRAP EXAMPLE • Original data (random sample): • {1,2,3,1,2,1,1,1} (n = 8) • Bootstrap random sample data: • {1,2,1,2,1,1,3} (n = 8); mean = 1.375 • {1,1,2,2,3,1,1} (n = 8); mean = 1.375 • {1,2,1,2,1,1,2} (n = 8); mean = 1.25 • The estimated mean for our original data is the mean of the statistic for each bootstrap sample (1.375+1.375+1.25)/3 = ~1.3333
 

  • 55. BAGGING (BOOTSTRAP AGGREGATION) ALGORITHM 1. Take a random sample of size N with replacement from the data 2. Construct a classification tree as usual but do not prune 3. Assign a class to each terminal node, and store the class attached to each case coupled with the predictor values for each observation
  • 56. BAGGING ALGORITHM 4. Repeat Steps 1-3 a large number of times. 5. For each observation in the dataset, count the number of times over tress that it is classified in one category and the number of times over trees it is classified in the other category 6. Assign each observation to a final category by a majority vote over the set of tress. Thus, if 51% of the time over a large number of trees a given observation is classified as a "1", that becomes its classification. 7. Construct the confusion table from these class assignments.
  • 57. FLAWS • The problem with Bagging algorithm is it's using CART. • CART uses Gini-Index, a greedy algorithm to find the best split. • So we end up with trees that are structurally similar to each other. The trees are highly correlated among the predictions. • Random Forest address this.
  • 58. WHY DOES THIS WORK? • Outside the scope of this talk.
  • 60. RANDOM FOREST SECTION OVERVIEW 1. Problem RF solve 2. Building Random Forest Algorithm 3. Breaking Down Random Forest Algorithm Parts by Parts 4. How to use Random Forest to Predict 5. R Code
  • 61. 1. PROBLEM RANDOM FOREST IS TRYING TO SOLVE
  • 62. BAGGING PROBLEM With bagging we have an ensemble of structurally similar trees. This causes highly correlated trees. 
 

  • 63. RANDOM FOREST SOLUTION Create trees that have no correlation or weak correlation. 

  • 64. 2. BUILDING RANDOM FOREST ALGORITHM
  • 65. RANDOM FOREST ALGORITHM 1. Take a random sample of size N with replacement from the data. 
 2. Take a random sample without replacement of the predictors. 
 3. Construct the first CART partition of the data. 
 

  • 66. BUILDING RANDOM FOREST ALGORITHM 4. Repeat Step 2 for each subsequent split until the tree is as large as desired. Do not prune. 5. Repeat Steps 1–4 a large number of times (e.g., 500). 

  • 67. 3. BREAKING DOWN RANDOM FOREST ALGORITHM PARTS BY PARTS
  • 68. 1. TAKE A RANDOM SAMPLE OF SIZE N WITH REPLACEMENT FROM DATA • This is just bootstrapping on our data (recall bagging section)
  • 69. 2. TAKE A RANDOM SAMPLE WITHOUT REPLACEMENT OF THE PREDICTORS • Predictor sampling / bootstrapping • Notice this is bootstrapping our predictors and it's without replacement. • This is Random Forest solution to highly correlated trees that arises from bagging algorithm.
  • 70. Question (for step 2): Max number when sampling predictors/ features?
  • 71. Answer: 2 to 3 sample for predictors/features
  • 72. 3. CONSTRUCT THE FIRST CART PARTITION OF DATA • We partitioned our first bootstrap and use Gini- index on our bootstrapped predictors sample in step 2 to decided the split.
  • 73. 4. REPEAT STEP 2 FOR EACH SUBSEQUENT SPLIT UNTIL THE TREE IS AS LARGE AS DESIRED. DO NOT PRUNE.  • Self explanatory.
  • 74. 5. REPEAT STEPS 1–4 A LARGE NUMBER OF TIMES (E.G., 500). 
   • Steps 1 to 4 is to build one tree. • You repeat step 1 to 4 a large number of times to build a forest. • There's no magic number for large number. You can build 101, 201, 501, 1001, etc.. There's research paper that suggest certain numbers but it base on your data. So just check model performance via model evaluation using cross validation. • I have no idea why suggested numbers are usually even, but I choose odd number of trees in case of ties.
  • 75. 4. HOW TO USE RANDOM FOREST TO PREDICT
  • 76. • You have an observation that you want to predict • Say the observation is x = male, z = 23 year old • You plug that in your your random forest. • Random Forest takes those predictors and give it to each decision trees. • Each decision trees give one prediction, cancer or no cancer (0,1). • You take all of those predictions (aka votes) and take the majority. • This is why I suggest odd number of trees to break ties for binary responses.
  • 78. set.seed(415) library(randomForest) iris_train <- iris[-1,] iris_test <- iris[1,] rf.model <- randomForest(Species ~ ., data = iris_train) predict(rf.model,iris_test)
  • 80. REFERENCES • Statistical Learning from a Regression Perspective by Richard A. Berk • Introduction to Data Mining by Pang-Ning Tan (http://www-users.cs.umn.edu/~kumar/dmbook/ index.php) • 11.12 From bagging to forest (https:// onlinecourses.science.psu.edu/stat857/node/181) • "An Empirical Comparison of Supervised Learning Algorithms" - https://www.cs.cornell.edu/~caruana/ ctp/ct.papers/caruana.icml06.pdf • Random Forest created and trademarked by Leo Breiman (https://www.stat.berkeley.edu/~breiman/ RandomForests/cc_home.htm#workings) • "Bagging and Random Forest Ensemble Algorithms for Machine Learning" By Jason Brownlee (http:// machinelearningmastery.com/bagging-and- random-forest-ensemble-algorithms-for-machine- learning/)