0% found this document useful (0 votes)

11 views

Random Forest

Uploaded by

ramawijayas.techdr7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Random Forest

Uploaded by

ramawijayas.techdr7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/358554175

About Random Forest and imbalanced datasets

Article · January 2022

CITATIONS READS

0 209

1 author:

Matteo Sani
University of Florence
3 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Matteo Sani on 12 February 2022.

The user has requested enhancement of the downloaded file.

About Random Forest and imbalanced datasets
Matteo Sani
January 10, 2022

Contents
1 Classification Trees 1

2 Ensemble Methods 2
2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Learning algorithms with imbalanced data 3

3.1 Random Under Sampling (RUS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 SMOTE + RUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1 Classification Trees
Supervised learning algorithm introduced by [2] in which the general idea is to recursively split the co-
variate space into homogeneous and non-overlapping partitions within each the prediction of the response
is constant. Suppose to have a sample X1 , .., Xn and each sample point is a p-variate random variable
Xi1 , .., Xip with i = 1, .., n. In the initial setting exists only a single region R which is equivalent to the
covariate space χ. The goal is to find a splitting point, defined as a couple (j, s1 ) where j = 1, .., p is one
of the p covariates and s1 is one of its possible values in the sample, that partition the covariate space
into two non-overlapping regions. Mathematically speaking:

(j, s1 ) : R1 = (X1 , ...Xp ) ∈ χ : Xj ≤ s1 }, R2 = {(X1 , ...Xp ) ∈ χ : Xj > s1 (1)

The best splitting point is defined as that point that provides the best improvement of an impurity
measure ϕ defined as a measure of the degree of heterogeneity in the class distribution. Possible choices
of ϕ are:
Gini Index
XK K
X
G= p̂mk (1 − p̂mk ) = 1 − p̂2mk (2)
k=1 k=1

Entropy
K
X
D=− p̂mk log p̂mk (3)
k=1

where p̂mk is the proportion of observations that are from the k-th class in the m-th region. Looking at
the Gini index, we notice that it can be considered as a sum of Bernoulli variances over the k classes;
it’s easy to see that the minimum value is observed when the p̂mk ’s have extreme values (i.e. when the
partition is mainly represented of observations belonging to one class). Once the splitting has occurred,
a proportion p1 of the observations is sent to the first region, p2 to the second and the impurity measure
for the two regions ϕR1 ,ϕR2 is recorded. Defined ϕR the overall impurity measure, the change in impurity
is given by:

1
∆i = ϕR − p1 ϕR1 − p2 ϕR2 (4)
so the best splitting point is obtained via minimization:

min p1 ϕR1 + p2 ϕR2 (5)
j,s1

The procedure is applied recursively on both the partitions created until some termination criterion is
satisfied; generally we consider a minimum number of observations within each region (or leaf node) or
a maximum depth of the tree. This is done since a very depth tree or a tree in which each leaf node
contains only few observations, is more likely to overfit our data providing lack in generalization which
will lead in poor predictive performance. Once all the partitions are being defined, the prediction of the
response is done by taking the most occurring class in that partition.

2 Ensemble Methods
Decision trees are very simply to apply and to communicate to people: they represent graphically how
people make decisions! The cost of simplicity is a lower predictive accuracy when compared to other clas-
sification algorithms due to the fact that they more likely will overfit our data. However the performance
can be drastically improved with ensemble methods, defined as approaches that combines many weak
learners defined as such since they provide poor predictions on their own. Ensemble methods are not
properly related to decision trees, since the concept is a general idea. As a drawback, these methods are
no longer easily interpretable and the decision path is much less transparent respect to a DT. Further-
more, it’s no more possible to depict the role of the variables in explaining or causing; it’s only possible
to measure the impact of each predictor into the predictive process (variables importance).The following
are two ensemble models which make use of decision trees as weak learner.

2.1 Bagging
Bagging, short for Bootstrap aggregation, is a technique which combines many classification trees in order
to improve the prediction accuracy. In the following we present the algorithm of bagging.

• Given a training set, a number B of new sets are obtained via Bootstrap i.e by resampling with
replacement the given set.
• For each Bootstrap sample a very deep and not pruned classification tree is fitted: it will have low
bias (performance on the training set) and high variance.

• Define fˆ∗b (x) the prediction for a new observation x from a single tree. The bagging prediction
fˆbag (x) is given by the majority vote rule.
B
X
fˆbag (x) = arg max I{fˆ∗b (x)=k)} (6)
k
b=1

Out Of Bag Test Error Estimation Bagging provides also a good estimate of the test error; in
particular, it can been shown that on average each tree is fitted on the 23 of the entire sample: the
remaining 31 are called Out Of Bag (OOB) observations; so given a single observation i, there will be
about B3 trees for which it is an OOB. Using them for making the prediction and taking the majority vote,
the estimate error is computed. Repeating this for all the observations in the sample one can compute
the total estimate error and so provide an estimate of the test error.

2.2 Random Forest

The amount of variance reduction in bagging is reduced if the trees are highly correlated; this phenomena
occurs when the trees share similar structures. Random Forest is a bagging procedure, with the only
difference that each tree is forced to consider only a subset m of predictors randomly chosen; in classi-
√
fication tasks generally m = p. This provides a simple tool to decorrelate the trees and improve the

2
predictive accuracy. The idea beyond is that if there’s a variable with an high impact on the predictions,
the trees more likely will make the first split over that variable, and so the tree structures will be similar;
by randomly choosing a subset of variables we give chances also to the less important variables!

3 Learning algorithms with imbalanced data

Considering a binary classification problem (but is also true for multi class classification), a dataset is
said ”imbalanced ” when it contains lots of samples belonging to one class which is called majority class
and few samples in the other, the minority class. The problem here is that many classification algorithms
assumes that there is an approximately equal number of samples in each class; this is also the reason
for which accuracy is the most used metric to evaluate the predictive performances. In imbalanced data
framework, each Bootstrap sample more likely will contain mainly observations in the majority class and
so the algorithm will learn quite perfectly how to predict them, but with poor predictive performance on
the minority class.
One way to solve this issue is to use Resampling techniques i.e. techniques which aim is to define a
new training set by sampling from the imbalanced one; these are divided into:

• Oversampling
• Undersampling
• Hybrid

A lot of literature and techniques exist on the topic, but we will focus only on three of them, one for
each category.

3.1 Random Under Sampling (RUS)

This technique is extremely easy: it simply randomly remove observations from the majority class until
the set is balanced. It can be applied in presence of a great number of observations. However, when the
dataset is highly imbalanced, there will remain too few samples to train the model; furthermore, we loose
a relevant amount of data, which is never a good idea.

3.2 SMOTE
Introduced by [4], SMOTE is a data augmentation technique which generates new synthetic samples by
resampling the minority class until the new training set is balanced. Oversampling techniques such as
ROS (Random Oversampling) are criticized since they could lead to overfitting. However this is not the
case of SMOTE because instead of randomly copy existing sample, it generates new ones by taking into
account the nearest neighbours.
Taking one sample X we compute the k-nearest neighbours T1 , .., Tk where k is the number of neigh-
bours that must be set a priori (generally 5). We then take only N < K of them where N depends on the
amount of oversampling desired and for each one Ti , compute the difference between its feature vector
and the feature vector of the minority sample X. This difference is then multiplied by a random number
δ ∈ (0, 1) and then added to the feature vector of X. The new synthetic samples Xi are obtained as
follows.

Xi = X + δ(Ti − X) i = 1, .., N (7)

The amount of new synthetic samples depends on the imbalance degree and so to the ratio between
number of minority samples and majority ones. For example, if the ratio between minority and majority
samples is 0.5, we will consider all the minority samples and for each of them we will consider only 2 of
the k neighbours and create 2 new synthetic samples as described above.

3
3.3 SMOTE + RUS
In the original paper, it has been demonstrated that SMOTE provide better performances when combined
with Random Undersampling. The idea is to firstly oversample the minority class until a desired ratio is
achieved, then undersample the majority class in order to balance the data. In this way we fix the issues
induced by RUS.

References
[1] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning, vol. 112.
Springer, 2013.
[2] G. Breiman Leo, L. Kamil A, G. V. Di Prisco, and W. J. Freeman, “Classification of eeg spatial
patterns with a tree-structured methodology: Cart,” IEEE transactions on biomedical engineering,
no. 12, pp. 1076–1086, 1986.

[3] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-
sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[5] A. Fernández, S. Garcia, F. Herrera, and N. V. Chawla, “Smote for learning from imbalanced data:
progress and challenges, marking the 15-year anniversary,” Journal of artificial intelligence research,
vol. 61, pp. 863–905, 2018.
[6] X. Tan, S. Su, Z. Huang, X. Guo, Z. Zuo, X. Sun, and L. Li, “Wireless sensor networks intrusion
detection based on smote and the random forest algorithm,” Sensors, vol. 19, no. 1, p. 203, 2019.

View publication stats

Weekly Quiz 1 - PGPBABI.O.OCT19 Statistical Methods For Decision Making - Great Learning PDF
100% (1)
Weekly Quiz 1 - PGPBABI.O.OCT19 Statistical Methods For Decision Making - Great Learning PDF
7 pages
BBBC
0% (2)
BBBC
12 pages
KTEE309 - MCQs Chapter 1-2 (Ms. Qu NH) PDF
No ratings yet
KTEE309 - MCQs Chapter 1-2 (Ms. Qu NH) PDF
4 pages
EC4215 Formative Assessment 1
No ratings yet
EC4215 Formative Assessment 1
2 pages
PN 24218 Solutions
No ratings yet
PN 24218 Solutions
11 pages
Random Forests: N 1 N J X A I X A I
No ratings yet
Random Forests: N 1 N J X A I X A I
12 pages
Guo Paper 2019
No ratings yet
Guo Paper 2019
4 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Random Forest
No ratings yet
Random Forest
83 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Ensemble Methods.pptx
No ratings yet
Ensemble Methods.pptx
32 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Guided Tour To Random Forest
No ratings yet
Guided Tour To Random Forest
42 pages
ML-Lec6
No ratings yet
ML-Lec6
4 pages
2025 Ensemble Learning.docx
No ratings yet
2025 Ensemble Learning.docx
25 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Random Forests
No ratings yet
Random Forests
43 pages
E IS388 Theory MellaMargaretaVeronica 00000059669
No ratings yet
E IS388 Theory MellaMargaretaVeronica 00000059669
7 pages
Chapter 09 CART-3
No ratings yet
Chapter 09 CART-3
42 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
class 2a-Decision Trees
No ratings yet
class 2a-Decision Trees
28 pages
Team 5
No ratings yet
Team 5
12 pages
A Random Forest Guided Tour: Gerard - Biau@
No ratings yet
A Random Forest Guided Tour: Gerard - Biau@
41 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Random Forest Class Lecture Notes
No ratings yet
Random Forest Class Lecture Notes
2 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
DS535 Note 6 (Page1-14)
No ratings yet
DS535 Note 6 (Page1-14)
13 pages
DMI UNIT 4
No ratings yet
DMI UNIT 4
34 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Unit-V_1
No ratings yet
Unit-V_1
26 pages
Biau 2016
No ratings yet
Biau 2016
31 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
dm unit 4
No ratings yet
dm unit 4
24 pages
ESGB_2025_classification and regression tress [Enregistré automatiquement]
No ratings yet
ESGB_2025_classification and regression tress [Enregistré automatiquement]
43 pages
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
No ratings yet
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
27 pages
Lesson 5.0 Supervised Learning with Decision Trees (1)
No ratings yet
Lesson 5.0 Supervised Learning with Decision Trees (1)
16 pages
Random Forest
No ratings yet
Random Forest
29 pages
Lecture 05 Random Forest 07112022 124639pm
No ratings yet
Lecture 05 Random Forest 07112022 124639pm
25 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Bagging and Random Forests
No ratings yet
Bagging and Random Forests
24 pages
Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Decision Trees and Random Forest
No ratings yet
Decision Trees and Random Forest
79 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Random Forest
No ratings yet
Random Forest
25 pages
Random Forest
No ratings yet
Random Forest
25 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Trees Cheat Sheet PDF
No ratings yet
Decision Trees Cheat Sheet PDF
2 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
BSC ML Ch3.pptx
No ratings yet
BSC ML Ch3.pptx
106 pages
08 Tree Advanced
No ratings yet
08 Tree Advanced
68 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
41 pages
Unit-3(1)
No ratings yet
Unit-3(1)
63 pages
22AIP3101A Session 11
No ratings yet
22AIP3101A Session 11
30 pages
Lecture 11 Slides - After
No ratings yet
Lecture 11 Slides - After
55 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
L1 Probability Theory
No ratings yet
L1 Probability Theory
13 pages
Statistics A-level Formula Sheet
No ratings yet
Statistics A-level Formula Sheet
9 pages
Case Study 22 QA Business Quantitative Analysis
No ratings yet
Case Study 22 QA Business Quantitative Analysis
7 pages
Sampling Distribution 3
No ratings yet
Sampling Distribution 3
3 pages
Racunari U Gradjevinarstvu 2
No ratings yet
Racunari U Gradjevinarstvu 2
4 pages
LONG QUIZ 2 - Statistics and Probability
No ratings yet
LONG QUIZ 2 - Statistics and Probability
1 page
Final Assessment MBAS901-Sabina.K
No ratings yet
Final Assessment MBAS901-Sabina.K
9 pages
Constructing Secure Encryption Schemes
No ratings yet
Constructing Secure Encryption Schemes
4 pages
Testing of Hypothesis For in Case of Simple Linear Regression Line
No ratings yet
Testing of Hypothesis For in Case of Simple Linear Regression Line
7 pages
Sampling Distribution of OLS Estimator of A Monte Carlo Simulation
No ratings yet
Sampling Distribution of OLS Estimator of A Monte Carlo Simulation
3 pages
Imputation
No ratings yet
Imputation
2 pages
Estimation Bertinoro09 Cristiano Porciani 1
No ratings yet
Estimation Bertinoro09 Cristiano Porciani 1
42 pages
Makesens Manual 2002
No ratings yet
Makesens Manual 2002
35 pages
Journal of Animal Ecology - 2008 - Edwards - Using likelihood to test for L vy flight search patterns and for general
No ratings yet
Journal of Animal Ecology - 2008 - Edwards - Using likelihood to test for L vy flight search patterns and for general
11 pages
Statistics and Probability
No ratings yet
Statistics and Probability
4 pages
Math644 Chapter 1 Part1
No ratings yet
Math644 Chapter 1 Part1
5 pages
Pum2019 Validation in Clinical Laboratory
No ratings yet
Pum2019 Validation in Clinical Laboratory
67 pages
X Variable 1 Line Fit Plot: Regression Statistics
No ratings yet
X Variable 1 Line Fit Plot: Regression Statistics
3 pages
Module 3 Descriptive Statistics Numerical Measures
No ratings yet
Module 3 Descriptive Statistics Numerical Measures
28 pages
Statistics
No ratings yet
Statistics
10 pages
Santos - Empirical Process Syllabus
No ratings yet
Santos - Empirical Process Syllabus
3 pages
Working Paper ITLS - WP - 13 - 01
No ratings yet
Working Paper ITLS - WP - 13 - 01
20 pages
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
No ratings yet
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
3 pages
Full Factorial Design DOE..
No ratings yet
Full Factorial Design DOE..
6 pages
Properties of Joint Distributions: Chris Piech CS109, Stanford University
No ratings yet
Properties of Joint Distributions: Chris Piech CS109, Stanford University
54 pages

Random Forest

Uploaded by

Random Forest

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

About Random Forest and imbalanced datasets

Article · January 2022

The user has requested enhancement of the downloaded file.

3 Learning algorithms with imbalanced data 3

2.2 Random Forest

3 Learning algorithms with imbalanced data

3.1 Random Under Sampling (RUS)

Xi = X + δ(Ti − X) i = 1, .., N (7)

View publication stats

You might also like