Ensemble Models For Effective Classification of Big Data With Data Imbalance
Ensemble Models For Effective Classification of Big Data With Data Imbalance
Submitted by
K.MADASAMY
(Registration No. P 4463)
Dr. M.RAMASWAMI
Professor
Department of Computer Applications
School of Information Technology
Madurai Kamaraj University
D E C E M B E R – 2019
Ensemble Models for Effective Classification of
Big Data with Data Imbalance
SYNOPSIS
1. Introduction
1|Page
issue. Intrusion detection, financial fraud detection, detection of cancer and the like
are some of the domains where normal or legitimate data are huge in number, while
the interesting instances represent minority classes, with low number of entries. The
ratio between the number of instances in majority class and number of instances in
minority classes is called the imbalance ratio. A class is considered to be balanced if
its imbalance ratio is 1, i.e., it contains equal number of instances of all the
representative classes. A change in this ratio leads to data imbalance [3].
2. Review of Literature
This section presents some of the recent and most prominent works in the
area of classification task on imbalanced data. A boosting based model was recently
proposed by Gong et al. [5]. This model is a sampling based method that uses
multiple sampling models to achieve the desired balance. This model also proposes
a boosting technique for effective prediction of classes. A comparison of multiple
ensemble models for imbalanced data prediction was proposed by Galar et al. [6]. A
bagged ensemble specifically designed for credit card fraud detection was proposed
by Akila et al. [7]. This model proposes a bagging methodology for effective
detection of fraudulent cases in credit card transactions. A cost sensitive model to
handle imbalance was proposed by Liu et al. [8]. This is a probability estimation
2|Page
based classifier model, aimed to effectively handle data imbalance. A credit
classification method to handle imbalanced data was proposed by Yu et al. [9].
This is a rebalancing mechanism that utilizes bagging and resampling models to
perform predictions. Several methods are aimed towards handling imbalances by
introducing sampling techniques. Such models include SMOTE by Chawla et al. in
[10] and an under-sampling model by Liu et al. in [11]. The best sampling model to
be used on imbalanced datasets is itself a research problem with many contributions
towards this analysis [12]. An overlap sensitive classifier using support vector
machines and K-nearest neighbour algorithms was proposed by Lee et al. in [13].
Another under sampling model to handle imbalance was proposed by Triguero et al.
in [14]. This model proposes a classification algorithm and claims to effectively
operate on data with high imbalance levels. A Spark based Random Forest model
was proposed by Chen et al. in [15]. This model creates a parallelized Random
Forest algorithm for effective classification in Spark environment. Other Spark
based models include SCARFF [16] and an ensemble model presented in [17].
3. Objectives
To analyse the impact of varied imbalance levels and its effects on the
performance of classifier models.
3|Page
To propose parallel models that can train faster on Big Data with
imbalance.
The research is organized under seven chapters. The main focus of the thesis
is to propose an ensemble model for classification that can effectively handle data
imbalances in Big Data. Contents of the research are listed in Table 1.
Chapter 1 Introduction
4|Page
supervised learning process, classification in specific and the issues in Big Data that
affects the process of classification. Also, an outline of ensemble based modelling
has been discussed. The chapter further explains the motivation, scope and
objectives of the thesis.
Chapter 3 analyses and identifies data imbalance to be one of the main issues
existing in the data involving real-time applications. An analysis of the different
categories of machine learning classifier models has been deployed for measuring
their prediction efficiencies. Standard classification metrics were used for analysis
and datasets of varied sizes, imbalance levels and classes were used for pilot
analysis. The datasets used in this contribution are meant for preliminary study for
validating the pursuance of further research.
It was identified and suggested that ensembles are the best architectures to
handle the issue of data imbalance on real-time data. An analysis of the ensemble
models like Bagging, Boosting, Bucket of Models and Stacking in terms of their
working principles have been discussed. It was observed that Boosting and Stacking
have higher potentials in handling data with varying levels of imbalance. Due to the
distributed operational nature of the models, they can be easily parallelized.
Thus, they can also be effectively used on Big Data and can be effective for real-
time applications.
5|Page
However, due to the heterogeneity of the base learners, the rules formulated by each
base learner will be distinct and different from each other. The test data is passed to
these base learners and predictions are obtained. These predictions form the input
data for the second phase. The first level predictions and the actual class labels are
grouped to form the training data for phase-2. The second phase is a meta-learner,
and uses a strong classifier for predictions. The predictions provided by the meta-
learner is used as the final prediction. The complex learning architecture involving
multi-level training processes has enabled effective handling of data imbalance and
hence providing better results.
Chapter 6 examines the result analysis of the second and third contributions
made in the research and their related discussions.
6|Page
Chapter 7 highlights the important conclusions derived out the whole
research process and analysis with suggestion for the future research directions.
5. Performance Evaluation
This ensures that the proposed model is scalable in terms of both data size
and imbalance levels. The description of various datasets are shown in Table 2.
These datasets were used to evaluate the prediction capability of the proposed
two-phase stacking ensemble model.
7|Page
Table 3: Performance metrics of two-phase stacking ensemble model
The selected datasets shown in Table 2 were used to evaluate the prediction
performance of the proposed parallel heterogeneous voting ensemble model and the
results are shown in Table 4.
8|Page
It could be observed from Table 4, that the aggregated metrics exhibit high
performances greater than 0.9 for all the metrics on all the datasets. The results
effectively highlighting the high predictive nature of the proposed ensemble model.
Precision Comparison
1.02
1
0.98
Precision
0.96
0.94
0.92
0.9
0.88
0.86
CoverType Glass5 Wine Yeast6 Abalone
Datasets
Two Phase Stacking Heterogeneous Voting
9|Page
Recall Comparison
1.02
1
0.98
Recall 0.96
0.94
0.92
0.9
0.88
CoverType Glass5 Wine Yeast6 Abalone
Datasets
Accuracy Comparison
1
0.98
Accuracy
0.96
0.94
0.92
0.9
0.88
CoverType Glass5 Wine Yeast6 Abalone
Dataset
10 | P a g e
Further, the AUC values of proposed models have been compared with the
recently developed RHSBoost [5], a boosting based ensemble model.
Overall Comparison
Abalone
Yeast6
Wine
Glass5
CoverType
Figure 4 shows the comparison of AUC values obtained from both the
proposed models with the RHSBoost, which indicates the effectiveness of the
model in terms of true prediction levels and false alarm rates. It is observed that the
average prediction accuracies of the proposed models reveals higher performances
when compared with RHSBoost model.
6. Conclusion
11 | P a g e
of imbalanced datasets. From the analysis it is suggested that an ensemble is the
best approach to handle data imbalance. The second contribution proposes a
two-phase stacking ensemble model as a solution for handling data with huge
imbalance. This model was tested on datasets with varied sizes, classes and
imbalance levels and was found to exhibit effective results. Also a comparison with
the recently developed RHSBoost ensemble model reveals that the proposed two-
phase stacking ensemble model outperforms the RHSBoost ensemble model.
The third contribution proposes a parallel heterogeneous voting ensemble model,
which improves the results further and also incorporates parallelization to ensure
faster training and prediction which is more suitable for real-time Big Data
applications. A major limitation of this model is that, the proposed model is
effective only on data from lower to moderate imbalance levels. Future extensions
of this work will concentrate on modifying the model to enable it to handle high
imbalance levels. Further, the number of heterogeneous models and the number of
stages in the ensemble could be effectively reduced to lower the computational
complexity of the proposed models.
12 | P a g e
REFERENCES
[2] Akila.S, and Srinivasulu Reddy.U, “Data Imbalance: Effects and Solutions
for Classification of Large and Highly Imbalanced Data”, Proceedings of
ICRECT, Vol. 16, pp. 28-34, 2016.
[3] López.V, Fernández.A, García.S, Palade.V, and Herrera.F, “An insight into
classification with imbalanced data: Empirical results and current trends on
using data intrinsic characteristics”, Information Sciences, Vol.250, pp.113-
41, 2013.
[7] Akila.S, and Srinivasulu Reddy.U. "Risk based bagged ensemble (RBE) for
credit card fraud detection", Inventive Computing and Informatics (ICICI),
International Conference on IEEE, 2017.
13 | P a g e
[8] Liu, Zhenbing,"Cost-Sensitive Collaborative Representation Based
Classification via Probability Estimation Addressing the Class Imbalance
Problem" , Artificial Intelligence and Robotics, Springer, pp. 287-294, 2018.
[9] Yu, Lean, et al. "A DBN-based resampling SVM ensemble learning paradigm
for credit classification with imbalanced data", Applied Soft Computing,
Vol. 69, pp. 192-202, 2018.
[13] Han Kyu Lee , Seoung Bum Kim , “An Overlap-Sensitive Margin Classifier
for Imbalanced and overlapping Data”, Expert Systems With Applications,
doi:10.1016/j.eswa.2018.01.008, 2018.
[15] Chen. J, Li. K, Tang. Z, Bilal. K, Yu. S, Weng. C and Li. K, “A parallel
random forest algorithm for big data in a spark cloud computing
environment”, IEEE Transactions on Parallel & Distributed Systems, Vol.1,
pp.1-1, 2017.
14 | P a g e
[16] Carcillo. F, Dal Pozzolo, Le Borgne.A, Caelen.Y.A, Mazzer.O, and
Bontempi. G, “SCARFF: A scalable framework for streaming credit card
fraud detection with spark”, Information fusion, 41, pp.182-194, 2018.
[17] Wu, Z., Lin, W., Zhang, Z., Wen, A. and Lin, L.,“An ensemble random
forest algorithm for insurance big data analysis”, In Computational Science
and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC),
IEEE International Conference, Vol. 1, pp. 531-536, 2017.
List of Publications
[1]. Madasamy.K, Ramaswami.M, “Performance Evaluation of Word Frequency
Count in Hadoop Environment”, International Journal of Innovative
Research in Science, Engineering and Technology (IJIRSET), Volume 6,
Issue 6, June 2017. [ UGC Approved Journal ]
15 | P a g e
[5]. Madasamy.K, Ramaswami.M, “Two-Phase Stacking Ensemble to Effectively
Handle Data Imbalances in Classification Problems”, International Journal
of Advanced Research in Computer science (IJARCS), Volume 9, Issue 1,
January 2018. [UGC Approved Journal ]
16 | P a g e