Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
Abstract—Data imbalance in Machine Learning refers to an by undersampling or oversampling the dataset. Undersampling
unequal distribution of classes within a dataset. This issue is is the process of decreasing the amount of majority target
encountered mostly in classification tasks in which the distribu- instances or samples. Some common undersampling methods
tion of classes or labels in a given dataset is not uniform. The
straightforward method to solve this problem is the resampling contain tomeks’ links [7], cluster centroids [8] and other
method by adding records to the minority class or deleting methods. Oversampling can be performed by increasing the
ones from the majority class. In this paper, we have exper- amount of minority class instances or samples with producing
imented with the two resampling widely adopted techniques: new instances or repeating some instances. An example of
oversampling and undersampling. In order to explore both oversampling methods is Borderline-SMOTE [9]. Figure 1
techniques, we have chosen a public imbalanced dataset from
kaggle website Santander Customer Transaction Prediction and shows the difference between the two techniques: oversam-
have applied a group of well-known machine learning algorithms pling and undersampling.
with different hyperparamters that give best results for both In this work, the imbalanced dataset of ‘Santander Customer
resampling techniques. One of the key findings of this paper is Transaction Prediction’ from a Kaggle competitions (released
noticing that oversampling performs better than undersampling
in Feb, 2019) 1 has been used with different machine learning
for different classifiers and obtains higher scores in different
evaluation metrics. models to experiment oversampling and undersampling tech-
Index Terms—Undersampling, Oversampling, Class Imbal- niques and apply a full comparison with different evaluation
ance, Machine Learning, SVM, Random Forest, Naive Bayes, metrics. Our code for this experiment can be found on github
Recall, Precision, Accuracy [10]. The results show that oversampling has better scores
than the undersampling methods for different machine learning
I. I NTRODUCTION classifier models.
In machine learning and statistics, classification is defined as This paper is organized as follows: related work is shown in
training a system with labeled dataset to identify a new unseen section II. Section III describes the dataset that has been used
dataset to which class it belongs. Recently, there is enormous in this article. Our methodology and evaluation metrics are
growth in data and, unfortunately, there is lack of quality presented in section IV. Experiments and results are introduced
labeled data. Various traditional machine learning methods in section V. Finally, the conclusion of the paper is provided
assumed that the target classes have the same distribution. in section VI.
However, this assumption is not correct in several applications,
for example weather forecast [1], diagnosis of illnesses [2], II. R ELATED WORK
finding fraud [3], as nearly most of the instances are labeled
with one class, while few instances are labeled as the other The imbalance data challenge has attracted growing at-
class. For this reason, the models lean more to the majority tention of researchers, recently. Authors in [11] proposed a
class and eliminate the minority class. This reflects on the famous method for undersampling. It works by eliminating
models performance as these models will perform poorly when the data points where target class does not equal the majority
the datasets are imbalanced. This is called class imbalance of its KNN. In [12], authors discussed several problems
problem. Thus, in such situation, although a good accuracy related to learning with class scatterings skewed. For example,
can be gained, however, we don’t gain good enough scores the connection between class scatterings and price sensitive
in other evaluation metrics, such as precision, recall, F1-score knowledge, and the boundaries of error frequency and ac-
[4] and ROC score. curacy to measure the act of models. In [13], the authors
Recently, there is a great interest in class imbalance is- proposed a review for the most commonly used methods
sue. Several researchers consider it a challenging issue that learning from imbalanced classes. They claimed that the bad
needs more attention to resolve [5] [6]. One of the common performance of the models created by the typical machine
approaches was to use resampling techniques to make the
dataset balanced. Resampling techniques can be applied either 1 https://www.kaggle.com/c/santander-customer-transaction-prediction
Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
978-1-7281-6227-0/20/$31.00 ©2020 IEEE 243
2020 11th International Conference on Information and Communication Systems (ICICS)
learning methods on imbalanced classes is mostly due three datasets has no missing values. Figure 2 shows the distribution
main issues: error costs, class scattering, and accuracy. of target column (0: refers to the number of customers that will
Authors in [14] suggested the resampling methods due to not make the transaction, and 1: refers to the customers that
the difficulty of identifying the minority target. They applied will make the transaction successfully). From this figure, we
a new resampling method by which equally oversampling of can easily notice that the dataset is imbalanced dataset.
infrequent positives and undersampling of the non-infected
majority depending on synthetic circumstances created by
class-specific sub-clustering. They stated that their new re-
sampling technique achieved better results than traditional
random resampling. In [15], authors applied three dissimilar
methods to an advertising dataset. Logistic regression, Chi-
squared automatic interaction detection, and neural network.
The performance of three methods was created by the means of
accurac, AUC, and precision. They compared several different
imbalance datasets produced from the real dataset. They stated
that precision is a good measure for imbalanced dataset.
An addition method had been introduced in [8] in which it
uses k-means grouping to balance the imbalanced instances by Fig. 2: Distribution of target classes
decreasing the amount of majority instances. Also, Authors in
[16] applied an undersampling method to remove information IV. M ETHODOLOGY
points from majority instances constructed on their spaces
between each other. In order to give a comprehensive view of the unbalanced
For our work it will be different than the previous research data problem, we have started with exploring the dataset.
because we will be using the dataset of ‘Santander Customer After downloading the training dataset from Kaggle, we have
Transaction Prediction’ from a Kaggle competitions (released split the data into train and target dataset. Then, we scaled
in Feb,2019) in order to compare between oversampling and the dataset. Moreover, we have ranked the features and se-
undersampling methods. lected the important ones for our experiment. Therefore, the
underlying frequency distribution of the features has been
III. DATASET studied, and the correlation matrix between the features has
The competition from Kaggle website, ”Santander Customer been calculated. As a result, we have found that there is a
Transaction Prediction”, is a binary classification challenge in small correlation between features, that means the features are
which a dataset with numeric data fields had been provided mostly independent from each other. For this reason, a feature
to solve a problem. The challenge is to predict and identify selection technique has been applied in order to select the most
which customers will make a specific transaction in the future important features and drop the rest. Figure 3 illustrates the
regardless of the amount of money transacted. Knowing that distribution of some features from the dataset.
the dataset is imbalanced, we used this data to tackle and After exploring the given dataset and preparing it to become
review the imbalance data problem. The competition posted compatible with machine learning algorithms, we used two
two datasets, training and testing datasets. In general, the resampling techniques, which depends on changing the class
whole dataset contains 202 features and 200,000 entries. The distribution. Also, we studied different classification models
Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
244
2020 11th International Conference on Information and Communication Systems (ICICS)
for this experoment. Table I shows the different classifiers predictive and recall are high. Possibly, the best common
that are used for both resampling techniques. To compare metric to measure general classification performance is ROC
between classifiers, we have used different evaluation metrics [18].
(Accuracy, Precision, Recall, F1-Score [17], and ROC [18]) In our work, we used scikit-learn [24], numpy [25], and
(note that higher is better). pandas [8] packages to implement these models and for data
adaptation.
Abbreviation Machine Learning classifier models
SVM(Linear) Support Vector Machine with Linear kernel V. E XPERIMENTS AND RESULTS
SVM(Poly) Support Vector Machine with Poly kernel
SVM(RBF) Support Vector Machine with RBF kernel A. Oversampling Minority Class
NB Gaussian Naive Bayes For our first experiment, we used the oversampling tech-
LR Logistic regression
DT1 Decision tree
nique. A non heuristic algorithm is known as random over-
DT2 Decision tree sampling. Its main objective is to balance class spreading
DT3 Decision tree through the random repetition of minority target instances.
RF Random Forest Figure 4 shows how the class target is distributed after using
GB Gradient Boosting Classifier this method on our dataset and it equals to 120,000.
BC(NB) Bagging Classifier with NB
BC(DT) Bagging Classifier with DT
AB AdaBoosting [19]
VE Voting-Ensembling NB, LR, DT depth=18, and RF
TABLE I: Different classifier models
Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
245
2020 11th International Conference on Information and Communication Systems (ICICS)
Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
246
2020 11th International Conference on Information and Communication Systems (ICICS)
Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
247
2020 11th International Conference on Information and Communication Systems (ICICS)
[6] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE intelligence computation and applications. Springer, 2009, pp. 461–
Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 471.
1263–1284, 2009. [21] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning
[7] I. Tomek, “A generalization of the k-nn rule,” IEEE Transactions on from imbalanced data sets,” ACM SIGKDD explorations newsletter,
Systems, Man, and Cybernetics, no. 2, pp. 121–126, 1976. vol. 6, no. 1, pp. 1–6, 2004.
[8] G. Lemaı̂tre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A [22] M. Hossin, M. Sulaiman, A. Mustapha, N. Mustapha, and R. Rahmat,
python toolbox to tackle the curse of imbalanced datasets in machine “A hybrid evaluation metric for optimizing classifier,” in 2011 3rd
learning,” The Journal of Machine Learning Research, vol. 18, no. 1, Conference on Data Mining and Optimization (DMO). IEEE, 2011,
pp. 559–563, 2017. pp. 165–170.
[9] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: a new over- [23] R. Ranawana and V. Palade, “Optimized precision-a new measure for
sampling method in imbalanced data sets learning,” in International classifier performance evaluation,” in 2006 IEEE International Confer-
conference on intelligent computing. Springer, 2005, pp. 878–887. ence on Evolutionary Computation. IEEE, 2006, pp. 2254–2261.
[10] https://github.com/Roweida-Mohammed/Code For Santander [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
Customer Transaction Prediction. O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
[11] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using “Scikit-learn: Machine learning in python,” Journal of machine learning
edited data,” IEEE Transactions on Systems, Man, and Cybernetics, research, vol. 12, no. Oct, pp. 2825–2830, 2011.
no. 3, pp. 408–421, 1972. [25] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in
[12] M. C. Monard and G. E. Batista, “Learmng with skewed class dis- science & engineering, vol. 9, no. 3, p. 90, 2007.
trihutions,” Advances in Logic, Artificial Intelligence, and Robotics: [26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
LAPTEC, vol. 85, no. 2002, p. 173, 2002. synthetic minority over-sampling technique,” Journal of artificial intel-
[13] S. Visa and A. Ralescu, “Issues in mining imbalanced data sets-a review ligence research, vol. 16, pp. 321–357, 2002.
paper,” in Proceedings of the sixteen midwest artificial intelligence and [27] P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans-
cognitive science conference, vol. 2005. sn, 2005, pp. 67–73. actions on information theory, vol. 14, no. 3, pp. 515–516, 1968.
[14] G. Cohen, M. Hilario, H. Sax, S. Hugonnet, and A. Geissbuhler, “Learn- [28] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced training
ing from imbalanced data in surveillance of nosocomial infection,” sets: one-sided selection,” in Icml, vol. 97. Nashville, USA, 1997, pp.
Artificial intelligence in medicine, vol. 37, no. 1, pp. 7–18, 2006. 179–186.
[15] E. Duman, Y. Ekinci, and A. Tanrıverdi, “Comparing alternative classi- [29] I. Tomek, “Two modifications of cnn,” IEEE Transactions on Systems,
fiers for database marketing: The case of imbalanced datasets,” Expert Man and Cybernetics, vol. 6, no. 6, p. 769–772, 1976.
Systems with Applications, vol. 39, no. 1, pp. 48–53, 2012. [30] J. Laurikkala, “Improving identification of difficult small classes by
[16] I. Mani and I. Zhang, “knn approach to unbalanced data distributions: balancing class distribution,” in Conference on Artificial Intelligence in
a case study involving information extraction,” in Proceedings of work- Medicine in Europe. Springer, 2001, pp. 63–66.
shop on learning from imbalanced datasets, vol. 126, 2003.
[17] A. Estabrooks and N. Japkowicz, “A mixture-of-experts framework for
learning from imbalanced data sets,” in International Symposium on
Intelligent Data Analysis. Springer, 2001, pp. 34–43.
[18] A. P. Bradley, “The use of the area under the roc curve in the evaluation
of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp.
1145–1159, 1997.
[19] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of computer
and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
[20] Q. Gu, L. Zhu, and Z. Cai, “Evaluation measures of the classification
performance of imbalanced data sets,” in International symposium on
Authorized licensed use limited to: Universidad Del Norte Biblioteca. Downloaded on May 02,2024 at 16:50:42 UTC from IEEE Xplore. Restrictions apply.
248