0% found this document useful (0 votes)

24 views

Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results

The document discusses machine learning techniques for handling imbalanced datasets, where there is an unequal distribution of classes. It compares oversampling and undersampling techniques, applying them to a public imbalanced customer transaction dataset. Experimental results show that oversampling generally performs better than undersampling for different machine learning classifiers and evaluation metrics.

Uploaded by

Sergio Alumenda

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results

Uploaded by

Sergio Alumenda

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2020 11th International Conference on Information and Communication Systems (ICICS)

Machine Learning with Oversampling and

Undersampling Techniques: Overview Study and
Experimental Results
2020 11th International Conference on Information and Communication Systems (ICICS) 978-1-7281-6227-0/20/$31.00 ©2020 IEEE 10.1109/ICICS49469.2020.239556

Roweida Mohammed, Jumanah Rawashdeh and Malak Abdullah

Jordan University of Science and Technology
Irbid, Jordan
[email protected], [email protected], [email protected]

Abstract—Data imbalance in Machine Learning refers to an by undersampling or oversampling the dataset. Undersampling
unequal distribution of classes within a dataset. This issue is is the process of decreasing the amount of majority target
encountered mostly in classification tasks in which the distribu- instances or samples. Some common undersampling methods
tion of classes or labels in a given dataset is not uniform. The
straightforward method to solve this problem is the resampling contain tomeks’ links [7], cluster centroids [8] and other
method by adding records to the minority class or deleting methods. Oversampling can be performed by increasing the
ones from the majority class. In this paper, we have exper- amount of minority class instances or samples with producing
imented with the two resampling widely adopted techniques: new instances or repeating some instances. An example of
oversampling and undersampling. In order to explore both oversampling methods is Borderline-SMOTE [9]. Figure 1
techniques, we have chosen a public imbalanced dataset from
kaggle website Santander Customer Transaction Prediction and shows the difference between the two techniques: oversam-
have applied a group of well-known machine learning algorithms pling and undersampling.
with different hyperparamters that give best results for both In this work, the imbalanced dataset of ‘Santander Customer
resampling techniques. One of the key findings of this paper is Transaction Prediction’ from a Kaggle competitions (released
noticing that oversampling performs better than undersampling
in Feb, 2019) 1 has been used with different machine learning
for different classifiers and obtains higher scores in different
evaluation metrics. models to experiment oversampling and undersampling tech-
Index Terms—Undersampling, Oversampling, Class Imbal- niques and apply a full comparison with different evaluation
ance, Machine Learning, SVM, Random Forest, Naive Bayes, metrics. Our code for this experiment can be found on github
Recall, Precision, Accuracy [10]. The results show that oversampling has better scores
than the undersampling methods for different machine learning
I. I NTRODUCTION classifier models.
In machine learning and statistics, classification is defined as This paper is organized as follows: related work is shown in
training a system with labeled dataset to identify a new unseen section II. Section III describes the dataset that has been used
dataset to which class it belongs. Recently, there is enormous in this article. Our methodology and evaluation metrics are
growth in data and, unfortunately, there is lack of quality presented in section IV. Experiments and results are introduced
labeled data. Various traditional machine learning methods in section V. Finally, the conclusion of the paper is provided
assumed that the target classes have the same distribution. in section VI.
However, this assumption is not correct in several applications,
for example weather forecast [1], diagnosis of illnesses [2], II. R ELATED WORK
finding fraud [3], as nearly most of the instances are labeled
with one class, while few instances are labeled as the other The imbalance data challenge has attracted growing at-
class. For this reason, the models lean more to the majority tention of researchers, recently. Authors in [11] proposed a
class and eliminate the minority class. This reflects on the famous method for undersampling. It works by eliminating
models performance as these models will perform poorly when the data points where target class does not equal the majority
the datasets are imbalanced. This is called class imbalance of its KNN. In [12], authors discussed several problems
problem. Thus, in such situation, although a good accuracy related to learning with class scatterings skewed. For example,
can be gained, however, we don’t gain good enough scores the connection between class scatterings and price sensitive
in other evaluation metrics, such as precision, recall, F1-score knowledge, and the boundaries of error frequency and ac-
[4] and ROC score. curacy to measure the act of models. In [13], the authors
Recently, there is a great interest in class imbalance is- proposed a review for the most commonly used methods
sue. Several researchers consider it a challenging issue that learning from imbalanced classes. They claimed that the bad
needs more attention to resolve [5] [6]. One of the common performance of the models created by the typical machine
approaches was to use resampling techniques to make the
dataset balanced. Resampling techniques can be applied either 1 https://www.kaggle.com/c/santander-customer-transaction-prediction

Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
978-1-7281-6227-0/20/$31.00 ©2020 IEEE 243
2020 11th International Conference on Information and Communication Systems (ICICS)

Fig. 1: Differences between undersampling and oversampling

learning methods on imbalanced classes is mostly due three datasets has no missing values. Figure 2 shows the distribution
main issues: error costs, class scattering, and accuracy. of target column (0: refers to the number of customers that will
Authors in [14] suggested the resampling methods due to not make the transaction, and 1: refers to the customers that
the difficulty of identifying the minority target. They applied will make the transaction successfully). From this figure, we
a new resampling method by which equally oversampling of can easily notice that the dataset is imbalanced dataset.
infrequent positives and undersampling of the non-infected
majority depending on synthetic circumstances created by
class-specific sub-clustering. They stated that their new re-
sampling technique achieved better results than traditional
random resampling. In [15], authors applied three dissimilar
methods to an advertising dataset. Logistic regression, Chi-
squared automatic interaction detection, and neural network.
The performance of three methods was created by the means of
accurac, AUC, and precision. They compared several different
imbalance datasets produced from the real dataset. They stated
that precision is a good measure for imbalanced dataset.
An addition method had been introduced in [8] in which it
uses k-means grouping to balance the imbalanced instances by Fig. 2: Distribution of target classes
decreasing the amount of majority instances. Also, Authors in
[16] applied an undersampling method to remove information IV. M ETHODOLOGY
points from majority instances constructed on their spaces
between each other. In order to give a comprehensive view of the unbalanced
For our work it will be different than the previous research data problem, we have started with exploring the dataset.
because we will be using the dataset of ‘Santander Customer After downloading the training dataset from Kaggle, we have
Transaction Prediction’ from a Kaggle competitions (released split the data into train and target dataset. Then, we scaled
in Feb,2019) in order to compare between oversampling and the dataset. Moreover, we have ranked the features and se-
undersampling methods. lected the important ones for our experiment. Therefore, the
underlying frequency distribution of the features has been
III. DATASET studied, and the correlation matrix between the features has
The competition from Kaggle website, ”Santander Customer been calculated. As a result, we have found that there is a
Transaction Prediction”, is a binary classification challenge in small correlation between features, that means the features are
which a dataset with numeric data fields had been provided mostly independent from each other. For this reason, a feature
to solve a problem. The challenge is to predict and identify selection technique has been applied in order to select the most
which customers will make a specific transaction in the future important features and drop the rest. Figure 3 illustrates the
regardless of the amount of money transacted. Knowing that distribution of some features from the dataset.
the dataset is imbalanced, we used this data to tackle and After exploring the given dataset and preparing it to become
review the imbalance data problem. The competition posted compatible with machine learning algorithms, we used two
two datasets, training and testing datasets. In general, the resampling techniques, which depends on changing the class
whole dataset contains 202 features and 200,000 entries. The distribution. Also, we studied different classification models

Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
244
2020 11th International Conference on Information and Communication Systems (ICICS)

Fig. 3: A small sample to show features distribution

for this experoment. Table I shows the different classifiers predictive and recall are high. Possibly, the best common
that are used for both resampling techniques. To compare metric to measure general classification performance is ROC
between classifiers, we have used different evaluation metrics [18].
(Accuracy, Precision, Recall, F1-Score [17], and ROC [18]) In our work, we used scikit-learn [24], numpy [25], and
(note that higher is better). pandas [8] packages to implement these models and for data
adaptation.
Abbreviation Machine Learning classifier models
SVM(Linear) Support Vector Machine with Linear kernel V. E XPERIMENTS AND RESULTS
SVM(Poly) Support Vector Machine with Poly kernel
SVM(RBF) Support Vector Machine with RBF kernel A. Oversampling Minority Class
NB Gaussian Naive Bayes For our first experiment, we used the oversampling tech-
LR Logistic regression
DT1 Decision tree
nique. A non heuristic algorithm is known as random over-
DT2 Decision tree sampling. Its main objective is to balance class spreading
DT3 Decision tree through the random repetition of minority target instances.
RF Random Forest Figure 4 shows how the class target is distributed after using
GB Gradient Boosting Classifier this method on our dataset and it equals to 120,000.
BC(NB) Bagging Classifier with NB
BC(DT) Bagging Classifier with DT
AB AdaBoosting [19]
VE Voting-Ensembling NB, LR, DT depth=18, and RF
TABLE I: Different classifier models

We should mention that the most common evaluation metric

for classic models is the accuracy metric [20] [21] [22]. How-
ever, accuracy is not a favorable measurement when dealing
with imbalanced datasets [21] [22] [23]. As many experts
have detected that for much skewed target distributions, the
recall of the minority target is sometimes 0, which means
no classification instructions had been generated for minority Fig. 4: Class target distribution after oversampling method
target. From information retrieval field, we can use the terms
in which the minority target has worse precision and recall However, this technique has two limitations. First, it will
compared to the majority target. Knowing that accuracy seats rise the probability of over-fitting, as it creates the same
extra weight on the majority target compared to minority reproductions of the minority class instances [26]. Second, it
target, this makes it hard for a classifier to accomplish well makes learning procedure more time overwhelming especially
on the minority target. For this purpose, extra metrics are if the original dataset is now equally huge, but imbalanced;
upcoming into general use. the same as our dataset. It is good to use this method when
In latest years, many new metrics for imbalanced datasets you don’t have a lot of data to work with.
were proposed from other fields. Some of these metrics are We used different classification models to make the pre-
recall and positive predictive, ROC and AUC [18], F-measure diction with random oversampling technique. We have ex-
and other metrics. For imbalance problematic, F-measure is perimented different hyperparamters for each model. Table II
a popular evaluation metrics [17]. It is a mixture of positive shows the best hyperparamters for the classifier models with
predictive and recall. It has a high value when both positive using oversampling technique.

Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
245
2020 11th International Conference on Information and Communication Systems (ICICS)

Best Parameters for Oversampling Technique

Classifiers Hyper-parameters
Figure 5 presents the charts of the results of different evalua-
C = 1.0 tion metrics.
SVM (Linear Kernel) Max iteration = 10
Learning rate = 0.0001 Models Accuracy Precision Recall F1 ROC-Curve
C = 0.1 SVM(Linear) 0.73 0.74 0.73 0.73 0.73
SVM (Poly Kernel) Degree = 1 SVM(Poly) 0.73 0.74 0.72 0.73 0.73
Gamma = 0.1 SVM(RBF) 0.80 1.00 0.60 0.75 0.80
C = 0.1 NB 0.76 0.77 0.75 0.76 0.76
SVM (RBF)
Gamma = 0.2
LR 0.73 0.74 0.73 0.73 0.73
Naı̈ve Bayes Default hyperparameters
Logistic Regression Default hyperparameters DT1 0.94 0.90 0.99 0.94 0.94
Decision Tree 1 Default hyperparameters DT2 0.87 0.80 0.91 0.87 0.87
Criterion = entropy DT3 0.84 0.83 0.86 0.84 0.84
Decision Tree 2 RF 0.998 0.999 0.997 0.998 0.998
Max depth = 20
Criterion = gini GB 0.93 0.92 0.94 0.93 0.93
Decision Tree 3
Max depth = 18 BC(NB) 0.76 0.77 0.75 0.76 0.76
Criterion = entropy BC(DT) 0.98 0.97 0.998 0.988 0.987
Max depth = 25 AB 0.68 0.70 0.64 0.67 0.68
Random Forest
Max features = log2
N estimators = 150 VE 0.93 0.93 0.93 0.93 0.93
Learning rate = 0.1
Max depth = 10
TABLE III: Evaluation metrics for the classifiers with
Gradient Boosting Oversampling
N estimators = 60
Subsample = 1.0
N estimators = 10
Bagging with Naı̈ve Bayes
oob score = False
B. Undersampling Majority Class
Bagging with decision trees Criterion = gini
Algorithm = SAMME.R In our second experiment, we used the undersampling
Ada Boosting Learning rate = 0.1 technique. The best simple undersampling algorithm is random
N estimators = 50
undersampling [17]. It is a non-heuristic algorithm which try
TABLE II: Classifier models hyperparamters with to balance target distributions over eliminating randomly from
oversampling majority class instances. By this operation, it may remove
possibly valuable data that can be essential for classifier
models, but it is useful when you have a lot of data. Figure 6
Table III shows the evaluation metrics for all the classifiers shows the target class after using this method on our dataset
with the mentioned hyperparametrs. We can see that Random and it equals to 14,000.
Forest has the highest score between all of evaluation metrics Various heuristic undersampling algorithms have been pre-
and performed better than the other classifiers. Furthermore, sented or announced from cleaning the data in latest years [16]

Fig. 5: Charts of different evaluation metrics for oversampling

Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
246
2020 11th International Conference on Information and Communication Systems (ICICS)

Best parameters for Undersampling Technique

Classifiers Hyper-parameters
SVM (Linear Kernel) Default hyperparameters
C = 0.1
SVM (Poly Kernel) Degree = 3
Gamma = 0.1
C = 0.001
SVM (RBF)
Gamma = 0.5
Naı̈ve Bayes Default hyperparameters
Logistic Regression Default hyperparameters
Decision Tree 1 Default hyperparameters
Criterion = entropy
Decision Tree 2
Max depth = 20
Decision Tree 3 Max depth = 18
Fig. 6: Class target distribution after undersampling method Criterion = gini
Max depth = 25
Random Forest
Max features = log2
N estimators = 150
[27] [28] [29] [30]. They are created on two dissimilar noise Learning rate = 0.1
Max depth = 3
model theories. Some researchers think that the instances, Gradient Boosting
N estimators = 60
which are close to classification margin of two classes, are Subsample = 0.5
known as noise. On the other hand, some researcher de- N estimators = 10
Bagging with Naı̈ve Bayes
oob score = False
liberates instances through more neighbors of various labels Criterion = entropy
are known as noise. We used the same classification models Bagging with decision trees
Max depth = 20
of oversampling experiment to make the prediction with Algorithm = SAMME.R
Ada Boosting Learning rate = 0.1
undersampling technique. We have experimented different N estimators = 50
hyperparamters for each model. Table IV shows the best hyper-
paramters for the classifier models with using undersampling TABLE IV: classifier models hyperparamters with
technique. undersampling
When we used the random undersampling technique, the
Models Accuracy Precision Recall F1 ROC-Curve
score for the classifier models was poor when compared to the
SVM(Linear) 0.74 0.74 0.73 0.74 0.74
random oversampling technique. Table V offers the evaluation SVM(Poly) 0.74 0.78 0.67 0.72 0.74
metrics for all the classifiers for the random undersampling SVM(RBF) 0.50 0.0 0.0 0.0 0.50
technique. We can notice that some classifiers have the same NB 0.77 0.77 0.75 0.76 0.77
scores as in the oversampling method or got worse in some LR 0.74 0.74 0.73 0.74 0.74
other classifiers. Naive Bayes (NB) in undersampling method DT1 0.59 0.59 0.59 0.59 0.59
got the higher score compared to the other classifiers. Figure 7 DT2 0.62 0.64 0.57 0.60 0.62
DT3 0.60 0.62 0.55 0.58 0.60
presents the charts for the evaluation metrics for the classifier RF 0.75 0.75 0.76 0.75 0.75
models. GB 0.73 0.75 0.70 0.73 0.73
BC(NB) 0.77 0.78 0.75 0.76 0.76
VI. C ONCLUSION BC(DT) 0.67 0.68 0.63 0.66 0.67
AB 0.68 0.70 0.63 0.66 0.68
VE 0.75 0.76 0.74 0.75 0.75
In this paper, we presented two techniques to handle the
problem of class imbalance and applied them to different TABLE V: Evaluation metrics for the classifiers with
machine learning classification models. We have used the undersampling
dataset provided by the competition from Kaggle website,
”Santander Customer Transaction Prediction”. This data was
released in 2019 and it is a binary classification challenge R EFERENCES
to predict and identify which customers will make a specific
[1] S. Choi, Y. J. Kim, S. Briceno, and D. Mavris, “Prediction of weather-
transaction in the future regardless of the amount of money induced airline delays based on machine learning algorithms,” in 2016
transacted. Knowing that the dataset is imbalanced, we used IEEE/AIAA 35th Digital Avionics Systems Conference (DASC). IEEE,
this data to tackle and review the imbalance data problem. We 2016, pp. 1–6.
[2] B. Krawczyk, M. Galar, Ł. Jeleń, and F. Herrera, “Evolutionary un-
tried the oversampling technique for the dataset and measured dersampling boosting for imbalanced classification of breast cancer
the classifiers with different evaluation metrics; as well for the malignancy,” Applied Soft Computing, vol. 38, pp. 714–726, 2016.
other technique, undersampling. We noticed how oversampling [3] W. Wei, J. Li, L. Cao, Y. Ou, and J. Chen, “Effective detection of
sophisticated online banking fraud on extremely imbalanced data,” World
perform better than undersampling for different classifiers and Wide Web, vol. 16, no. 4, pp. 449–475, 2013.
get higher scores in different evaluation metrics. For future [4] C. J. Van Rijsbergen, The geometry of information retrieval. Cambridge
work, we are planning to apply different deep learning tech- University Press, 2004.
[5] Y. Sun, A. K. Wong, and M. S. Kamel, “Classification of imbalanced
niques with both resampling techniques to compare between data: A review,” International Journal of Pattern Recognition and
both. Artificial Intelligence, vol. 23, no. 04, pp. 687–719, 2009.

Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
247
2020 11th International Conference on Information and Communication Systems (ICICS)

Fig. 7: Charts of different evaluation metrics for undersampling

[6] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE intelligence computation and applications. Springer, 2009, pp. 461–
Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 471.
1263–1284, 2009. [21] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning
[7] I. Tomek, “A generalization of the k-nn rule,” IEEE Transactions on from imbalanced data sets,” ACM SIGKDD explorations newsletter,
Systems, Man, and Cybernetics, no. 2, pp. 121–126, 1976. vol. 6, no. 1, pp. 1–6, 2004.
[8] G. Lemaı̂tre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A [22] M. Hossin, M. Sulaiman, A. Mustapha, N. Mustapha, and R. Rahmat,
python toolbox to tackle the curse of imbalanced datasets in machine “A hybrid evaluation metric for optimizing classifier,” in 2011 3rd
learning,” The Journal of Machine Learning Research, vol. 18, no. 1, Conference on Data Mining and Optimization (DMO). IEEE, 2011,
pp. 559–563, 2017. pp. 165–170.
[9] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: a new over- [23] R. Ranawana and V. Palade, “Optimized precision-a new measure for
sampling method in imbalanced data sets learning,” in International classifier performance evaluation,” in 2006 IEEE International Confer-
conference on intelligent computing. Springer, 2005, pp. 878–887. ence on Evolutionary Computation. IEEE, 2006, pp. 2254–2261.
[10] https://github.com/Roweida-Mohammed/Code For Santander [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
Customer Transaction Prediction. O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
[11] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using “Scikit-learn: Machine learning in python,” Journal of machine learning
edited data,” IEEE Transactions on Systems, Man, and Cybernetics, research, vol. 12, no. Oct, pp. 2825–2830, 2011.
no. 3, pp. 408–421, 1972. [25] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in
[12] M. C. Monard and G. E. Batista, “Learmng with skewed class dis- science & engineering, vol. 9, no. 3, p. 90, 2007.
trihutions,” Advances in Logic, Artificial Intelligence, and Robotics: [26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
LAPTEC, vol. 85, no. 2002, p. 173, 2002. synthetic minority over-sampling technique,” Journal of artificial intel-
[13] S. Visa and A. Ralescu, “Issues in mining imbalanced data sets-a review ligence research, vol. 16, pp. 321–357, 2002.
paper,” in Proceedings of the sixteen midwest artificial intelligence and [27] P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans-
cognitive science conference, vol. 2005. sn, 2005, pp. 67–73. actions on information theory, vol. 14, no. 3, pp. 515–516, 1968.
[14] G. Cohen, M. Hilario, H. Sax, S. Hugonnet, and A. Geissbuhler, “Learn- [28] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced training
ing from imbalanced data in surveillance of nosocomial infection,” sets: one-sided selection,” in Icml, vol. 97. Nashville, USA, 1997, pp.
Artificial intelligence in medicine, vol. 37, no. 1, pp. 7–18, 2006. 179–186.
[15] E. Duman, Y. Ekinci, and A. Tanrıverdi, “Comparing alternative classi- [29] I. Tomek, “Two modifications of cnn,” IEEE Transactions on Systems,
fiers for database marketing: The case of imbalanced datasets,” Expert Man and Cybernetics, vol. 6, no. 6, p. 769–772, 1976.
Systems with Applications, vol. 39, no. 1, pp. 48–53, 2012. [30] J. Laurikkala, “Improving identification of difficult small classes by
[16] I. Mani and I. Zhang, “knn approach to unbalanced data distributions: balancing class distribution,” in Conference on Artificial Intelligence in
a case study involving information extraction,” in Proceedings of work- Medicine in Europe. Springer, 2001, pp. 63–66.
shop on learning from imbalanced datasets, vol. 126, 2003.
[17] A. Estabrooks and N. Japkowicz, “A mixture-of-experts framework for
learning from imbalanced data sets,” in International Symposium on
Intelligent Data Analysis. Springer, 2001, pp. 34–43.
[18] A. P. Bradley, “The use of the area under the roc curve in the evaluation
of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp.
1145–1159, 1997.
[19] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of computer
and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
[20] Q. Gu, L. Zhu, and Z. Cai, “Evaluation measures of the classification
performance of imbalanced data sets,” in International symposium on

Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
248

A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
1608 06048 PDF
No ratings yet
1608 06048 PDF
7 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE
No ratings yet
Imbalanced_Data_Classification_Method_Based_on_LSSASMOTE
9 pages
s42044-025-00240-0
No ratings yet
s42044-025-00240-0
19 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
No ratings yet
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
20 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Class Notes
No ratings yet
Class Notes
24 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
Enhanced synthetic oversampling for multiclass imbalanced data
No ratings yet
Enhanced synthetic oversampling for multiclass imbalanced data
20 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages
axioms-11-00607-v2
No ratings yet
axioms-11-00607-v2
19 pages
Oversampling techniques for imbalanced data in regression
No ratings yet
Oversampling techniques for imbalanced data in regression
19 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
Imbalanced Learn Python
No ratings yet
Imbalanced Learn Python
5 pages
70 157 1 PB
No ratings yet
70 157 1 PB
11 pages
Li 2011
No ratings yet
Li 2011
4 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
Journal Pone 0259227
No ratings yet
Journal Pone 0259227
15 pages
I D L A R: Mbalanced ATA Earning Pproaches Eview
No ratings yet
I D L A R: Mbalanced ATA Earning Pproaches Eview
19 pages
5.+Final+Paper +Develop+a+Technique+to+Balance+the+Imbalance+Data
No ratings yet
5.+Final+Paper +Develop+a+Technique+to+Balance+the+Imbalance+Data
13 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
s40537-024-00943-4
No ratings yet
s40537-024-00943-4
32 pages
Dealing With Imbalanced Data
No ratings yet
Dealing With Imbalanced Data
9 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
10 Techniques To Solve Imbalanced Classes in ML
No ratings yet
10 Techniques To Solve Imbalanced Classes in ML
16 pages
Imbalanced_Data_Problem_in_Machine_Learning_A_Review
No ratings yet
Imbalanced_Data_Problem_in_Machine_Learning_A_Review
14 pages
702 1974 1 PB
No ratings yet
702 1974 1 PB
9 pages
Class Imbalance Notes
No ratings yet
Class Imbalance Notes
6 pages
11-A-SMOTE_A_new_preprocessing_approach_for_highly_im
No ratings yet
11-A-SMOTE_A_new_preprocessing_approach_for_highly_im
11 pages
1 s2.0 S0950705119302898 Main
No ratings yet
1 s2.0 S0950705119302898 Main
17 pages
Under-Sampling Technique For Imbalanced Data Using Minimum Sum of Euclidean Distance in Principal Component Subset
No ratings yet
Under-Sampling Technique For Imbalanced Data Using Minimum Sum of Euclidean Distance in Principal Component Subset
14 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
No ratings yet
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
7 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Two_Novel_SMOTE_Methods_for_Solving_Imbalanced_Classification_Problems
No ratings yet
Two_Novel_SMOTE_Methods_for_Solving_Imbalanced_Classification_Problems
8 pages
11192-Article (PDF) - 20731-1-10-20180420
No ratings yet
11192-Article (PDF) - 20731-1-10-20180420
43 pages
Lesson 3
No ratings yet
Lesson 3
8 pages
1-s2.0-S2214579622000089-main
No ratings yet
1-s2.0-S2214579622000089-main
19 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Pattern Recognition Letters: Julien Lesouple, Cédric Baudoin, Marc Spigai, Jean-Yves Tourneret
No ratings yet
Pattern Recognition Letters: Julien Lesouple, Cédric Baudoin, Marc Spigai, Jean-Yves Tourneret
11 pages
Extracting Text and Images From PDF Files
No ratings yet
Extracting Text and Images From PDF Files
10 pages
A Survey of Machine Learning Approaches For Student Dropout Prediction in Online Courses
No ratings yet
A Survey of Machine Learning Approaches For Student Dropout Prediction in Online Courses
34 pages
An In-Depth Study and Improvement of Isolation Forest
No ratings yet
An In-Depth Study and Improvement of Isolation Forest
19 pages
Incomplete Data Review
No ratings yet
Incomplete Data Review
3 pages
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
No ratings yet
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
25 pages
970-Article Text-3918-2-10-20221108
No ratings yet
970-Article Text-3918-2-10-20221108
9 pages
Transformation, Normalization and Batch Effect in The Analysis of Mass Spectrometry Data For Omics Studies
No ratings yet
Transformation, Normalization and Batch Effect in The Analysis of Mass Spectrometry Data For Omics Studies
34 pages
Student Performance Assessment Using Clustering Techniques
No ratings yet
Student Performance Assessment Using Clustering Techniques
10 pages
1 s2.0 S235197891930736X Main
No ratings yet
1 s2.0 S235197891930736X Main
6 pages
Final Documentation of Online Frankie
No ratings yet
Final Documentation of Online Frankie
36 pages
Dell Inspiron 15 5570 LA-F114PR10 CAL60 UMA 20170726
No ratings yet
Dell Inspiron 15 5570 LA-F114PR10 CAL60 UMA 20170726
53 pages
User Registration Guide 10052017 (F)
No ratings yet
User Registration Guide 10052017 (F)
19 pages
Espinosa 02 - Activity - 1
No ratings yet
Espinosa 02 - Activity - 1
2 pages
Multifunctional and Multidimensional Secure Data Aggregation Scheme in WSNs
No ratings yet
Multifunctional and Multidimensional Secure Data Aggregation Scheme in WSNs
12 pages
Note On Stokes Theorem and Application
No ratings yet
Note On Stokes Theorem and Application
7 pages
Power BI Vs Tableau
No ratings yet
Power BI Vs Tableau
6 pages
Baby Strides Product Bulk Order Catalogue 2022
No ratings yet
Baby Strides Product Bulk Order Catalogue 2022
18 pages
Sitel Joining Forms - NEW
No ratings yet
Sitel Joining Forms - NEW
25 pages
Chapter 0
No ratings yet
Chapter 0
47 pages
GitHub - SkyrilHD - HP-8x70W-Hackintosh - Mostly Partially Working Hackintosh On 8x70W
No ratings yet
GitHub - SkyrilHD - HP-8x70W-Hackintosh - Mostly Partially Working Hackintosh On 8x70W
5 pages
Prob 1
No ratings yet
Prob 1
3 pages
Growth Hacking - Definitive Guide For Facebook Group Admins
0% (1)
Growth Hacking - Definitive Guide For Facebook Group Admins
26 pages
Midterm Examination Schedule Fall 2021
No ratings yet
Midterm Examination Schedule Fall 2021
11 pages
Python Basics
No ratings yet
Python Basics
22 pages
Fundamentals of Database System Chapter 1 (1)
No ratings yet
Fundamentals of Database System Chapter 1 (1)
20 pages
Tsirfcreate
No ratings yet
Tsirfcreate
26 pages
Agent Accelerator For Genesys Contact Center
No ratings yet
Agent Accelerator For Genesys Contact Center
13 pages
Telecom Billing PDF
No ratings yet
Telecom Billing PDF
33 pages
3
No ratings yet
3
3 pages
Binary Files
No ratings yet
Binary Files
2 pages
CV Klaas 2024
No ratings yet
CV Klaas 2024
5 pages
Mail Collmaster
No ratings yet
Mail Collmaster
6 pages
1 Data Sheet Nokia 1830 PSS-24x
No ratings yet
1 Data Sheet Nokia 1830 PSS-24x
8 pages
Learn How To Display Surpac Data in Google Earth
No ratings yet
Learn How To Display Surpac Data in Google Earth
9 pages
How To Build A Simple REST API in PHP - Envato Tuts+
No ratings yet
How To Build A Simple REST API in PHP - Envato Tuts+
17 pages
Matutum View Academy: (The School of Faith)
No ratings yet
Matutum View Academy: (The School of Faith)
14 pages
Module 3 Notes
No ratings yet
Module 3 Notes
29 pages
ECommerce Application Thesis Documentation
No ratings yet
ECommerce Application Thesis Documentation
8 pages
Dataware House Doc1
No ratings yet
Dataware House Doc1
3 pages

Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results

Uploaded by

Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results

Uploaded by

2020 11th International Conference on Information and Communication Systems (ICICS)

Machine Learning with Oversampling and

Roweida Mohammed, Jumanah Rawashdeh and Malak Abdullah

Fig. 1: Differences between undersampling and oversampling

Fig. 3: A small sample to show features distribution

We should mention that the most common evaluation metric

Best Parameters for Oversampling Technique

Fig. 5: Charts of different evaluation metrics for oversampling

Best parameters for Undersampling Technique

Fig. 7: Charts of different evaluation metrics for undersampling

You might also like