0% found this document useful (0 votes)
79 views17 pages

Ensemble Models For Effective Classification of Big Data With Data Imbalance

This document is a synopsis submitted by K. Madasamy to Madurai Kamaraj University for a Doctor of Philosophy in Computer Science. The synopsis proposes research on developing ensemble models for effective classification of big data with data imbalances. The objectives are to analyze the impact of data imbalances, propose models that can handle various imbalance levels, and develop parallel models for faster training on big data. The research is organized into 7 chapters covering an introduction, literature review, analysis of imbalance impact, a two-phase stacking ensemble model, and a parallel heterogeneous voting ensemble model.

Uploaded by

vikasbhowate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views17 pages

Ensemble Models For Effective Classification of Big Data With Data Imbalance

This document is a synopsis submitted by K. Madasamy to Madurai Kamaraj University for a Doctor of Philosophy in Computer Science. The synopsis proposes research on developing ensemble models for effective classification of big data with data imbalances. The objectives are to analyze the impact of data imbalances, propose models that can handle various imbalance levels, and develop parallel models for faster training on big data. The research is organized into 7 chapters covering an introduction, literature review, analysis of imbalance impact, a two-phase stacking ensemble model, and a parallel heterogeneous voting ensemble model.

Uploaded by

vikasbhowate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Ensemble Models for Effective Classification of

Big Data with Data Imbalance

Synopsis submitted to Madurai Kamaraj University


in partial fulfilment of the requirements for the award of the Degree of
DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE

Submitted by

K.MADASAMY
(Registration No. P 4463)

Under the Guidance of

Dr. M.RAMASWAMI
Professor
Department of Computer Applications
School of Information Technology
Madurai Kamaraj University

DEPARTMENT OF COMPUTER APPLICATIONS


MADURAI KAMARAJ UNIVERSITY
Madurai - 625 021
Tamilnadu, India

D E C E M B E R – 2019
Ensemble Models for Effective Classification of
Big Data with Data Imbalance

SYNOPSIS

1. Introduction

Massive increase in online activities has resulted in generation of huge


amounts of data and the processing of which has the potential to drive business and
research. The data however need to be processed appropriately to obtain the
information, which requires specialized data mining techniques. Classification is a
data mining process that categorizes data instances into predefined categories.
Classifiers learn patterns contained in the given training data to predict class labels
for unseen data. Domains such as Customer based Product Prediction, Churn
Prediction, Fraud Detection, Network Intrusion Detection, Particle Analysis and
Prediction such as Higgs Boson and the like are some of the areas where
classification based on machine learning models are effectively put into use.

Performance of classifier models is hindered by the presence of several


intrinsic issues contained in the data. One such major issue is data imbalance [1].
Data is considered to be imbalanced if one of its classes exhibits dominance over
the other existing classes. i.e., instances of one class are huge in number, while
instances in other classes are very less in number. The class which shows high
dominance is referred to as the majority class, while the classes with low dominance
are referred to as the minority class. Though this problem has been observed to be
very prominent in several binary classification problems, its impact on multiclass
datasets is found to be more profound [2] and yet unexplored. Scientific datasets are
preprocessed and are found to be balanced, hence this issue cannot be explicitly
observed in such scenarios. However, in real-time datasets, this is a very prominent

1|Page
issue. Intrusion detection, financial fraud detection, detection of cancer and the like
are some of the domains where normal or legitimate data are huge in number, while
the interesting instances represent minority classes, with low number of entries. The
ratio between the number of instances in majority class and number of instances in
minority classes is called the imbalance ratio. A class is considered to be balanced if
its imbalance ratio is 1, i.e., it contains equal number of instances of all the
representative classes. A change in this ratio leads to data imbalance [3].

Data Imbalance leads to several issues, and especially affects the


performance of the machine learning models. Major issue due to imbalance is that it
creates bias during the prediction process. This makes the classifier more reliant
towards predicting the majority classes and creates issues when predicting minority
classes. Due to the huge number of instances contained in the majority classes, the
classifier is overly trained on the majority classes and due to the low instance levels
in minority classes, the classifier receives low training in terms of the minority
classes. This biased training leads to poor predictions. Although the impact of data
imbalance varies between classifiers, their presence cannot be overlooked [4].

2. Review of Literature

This section presents some of the recent and most prominent works in the
area of classification task on imbalanced data. A boosting based model was recently
proposed by Gong et al. [5]. This model is a sampling based method that uses
multiple sampling models to achieve the desired balance. This model also proposes
a boosting technique for effective prediction of classes. A comparison of multiple
ensemble models for imbalanced data prediction was proposed by Galar et al. [6]. A
bagged ensemble specifically designed for credit card fraud detection was proposed
by Akila et al. [7]. This model proposes a bagging methodology for effective
detection of fraudulent cases in credit card transactions. A cost sensitive model to
handle imbalance was proposed by Liu et al. [8]. This is a probability estimation

2|Page
based classifier model, aimed to effectively handle data imbalance. A credit
classification method to handle imbalanced data was proposed by Yu et al. [9].
This is a rebalancing mechanism that utilizes bagging and resampling models to
perform predictions. Several methods are aimed towards handling imbalances by
introducing sampling techniques. Such models include SMOTE by Chawla et al. in
[10] and an under-sampling model by Liu et al. in [11]. The best sampling model to
be used on imbalanced datasets is itself a research problem with many contributions
towards this analysis [12]. An overlap sensitive classifier using support vector
machines and K-nearest neighbour algorithms was proposed by Lee et al. in [13].
Another under sampling model to handle imbalance was proposed by Triguero et al.
in [14]. This model proposes a classification algorithm and claims to effectively
operate on data with high imbalance levels. A Spark based Random Forest model
was proposed by Chen et al. in [15]. This model creates a parallelized Random
Forest algorithm for effective classification in Spark environment. Other Spark
based models include SCARFF [16] and an ensemble model presented in [17].

3. Objectives

The main objective of the proposed work is to effectively handle data


imbalance to perform classification in real-time Big Data. The other objectives are

 To analyse the impact of varied imbalance levels and its effects on the
performance of classifier models.

 To analyse the performance of various classifier models on huge


imbalanced data to identify a robust base learner.

 To identify the impact of ensemble models on imbalanced data.

 To propose models that can effectively handle data of varied imbalance


levels to provide high performances.

3|Page
 To propose parallel models that can train faster on Big Data with
imbalance.

 To speed up the classification process using parallelization and to


provide predictions in real-time applications.

4. Structure of the Research

The research is organized under seven chapters. The main focus of the thesis
is to propose an ensemble model for classification that can effectively handle data
imbalances in Big Data. Contents of the research are listed in Table 1.

Table 1. Structure of the Research

Chapters Contents and Descriptions

Chapter 1 Introduction

Chapter 2 Review of Literature

Data Imbalance and Classifiers: Impact and Solutions from a


Chapter 3
Big Data Perspective. ( Contribution-1)

Two-phase Stacking Ensemble to Effectively Handle Data


Chapter 4
Imbalances in Classification Problems. ( Contribution-2)

Parallel Heterogeneous Voting Ensemble for Effective


Chapter 5
Classification of Imbalanced Data. ( Contribution-3)

Chapter 6 Results and Discussion

Chapter 7 Conclusion and Scope for related research in future.

Chapter 1 briefly presents the essential preliminary concepts related to the


proposed research. It describes the evolution of data from a manageable size to Big
Data and the scenario that has led to this explosion in data. It discusses the basics of

4|Page
supervised learning process, classification in specific and the issues in Big Data that
affects the process of classification. Also, an outline of ensemble based modelling
has been discussed. The chapter further explains the motivation, scope and
objectives of the thesis.

Chapter 2 outlines the review of literature discussing prior studies in the


domain of classification and specifically on models handling data imbalance.
In this chapter, the models are categorized as individual models, ensemble models,
cost sensitive models, sampling based models and parallelization-based models.

Chapter 3 analyses and identifies data imbalance to be one of the main issues
existing in the data involving real-time applications. An analysis of the different
categories of machine learning classifier models has been deployed for measuring
their prediction efficiencies. Standard classification metrics were used for analysis
and datasets of varied sizes, imbalance levels and classes were used for pilot
analysis. The datasets used in this contribution are meant for preliminary study for
validating the pursuance of further research.

It was identified and suggested that ensembles are the best architectures to
handle the issue of data imbalance on real-time data. An analysis of the ensemble
models like Bagging, Boosting, Bucket of Models and Stacking in terms of their
working principles have been discussed. It was observed that Boosting and Stacking
have higher potentials in handling data with varying levels of imbalance. Due to the
distributed operational nature of the models, they can be easily parallelized.
Thus, they can also be effectively used on Big Data and can be effective for real-
time applications.

Chapter 4 proposes a two-phase stacking ensemble model that can be used to


handle data imbalances during the process of classification. The proposed model is
composed of two major training phases. The first phase of model uses multiple base
learners for training. The entire training data is used to train all the base learners.

5|Page
However, due to the heterogeneity of the base learners, the rules formulated by each
base learner will be distinct and different from each other. The test data is passed to
these base learners and predictions are obtained. These predictions form the input
data for the second phase. The first level predictions and the actual class labels are
grouped to form the training data for phase-2. The second phase is a meta-learner,
and uses a strong classifier for predictions. The predictions provided by the meta-
learner is used as the final prediction. The complex learning architecture involving
multi-level training processes has enabled effective handling of data imbalance and
hence providing better results.

Chapter 5 proposes a parallel heterogeneous voting ensemble model for


effectively handling data imbalance during the process of classification.
The proposed ensemble uses heterogeneous tree-based base-learners like Decision
Tree, Random Forest and Gradient Boosted Trees for prediction. A bagged
ensemble is created using the heterogeneous base learners. All three models are
intentionally selected to be tree-based models in order to maintain a certain level of
homogeneity in the rule creation process. The base-learners are chosen such that
each model complements the other model and hence resulting in enhanced
predictions. The entire training data are passed on to the base learners and the
learning models are built. The test data when passed on to the ensemble model will
result in a set of predictions for each instances, rather than a single prediction. The
final voting phase is to combine the predictions that will result in providing a single
prediction. Voting is the process of combining the predictions such that the highly
voted prediction is considered as the final prediction. This model is generic and the
heterogeneity aids in effective reduction of bias and variance in the prediction
process.

Chapter 6 examines the result analysis of the second and third contributions
made in the research and their related discussions.

6|Page
Chapter 7 highlights the important conclusions derived out the whole
research process and analysis with suggestion for the future research directions.

5. Performance Evaluation

This section analysis the performance of both two-phase stacking ensemble


model and the parallel heterogeneous voting ensemble model. Reasonable
comparative analysis has been made to fulfil the objectives framed in this research
work.

5.1. Performance evaluation of two-phase stacking ensemble model

Experiments were performed by implementing the ensemble model in


Python. Five benchmark datasets with varying number of classes, sizes and
imbalance levels were considered for analysis.

Table 2: Description of Datasets

No. of No. of Imbalance No. of


Datasets Source
Instances Attributes Ratio Classes

CoverType 38501 10 13.02 Multi (7) UCI

Glass5 214 9 22.81 Binary (2) KEEL

Wine 4898 11 25.77 Multi (3) UCI

Yeast6 1484 8 39.15 Binary (2) KEEL

Abalone 4177 8 115.03 Binary (2) UCI

This ensures that the proposed model is scalable in terms of both data size
and imbalance levels. The description of various datasets are shown in Table 2.
These datasets were used to evaluate the prediction capability of the proposed
two-phase stacking ensemble model.

7|Page
Table 3: Performance metrics of two-phase stacking ensemble model

Datasets AUC Accuracy F1-Score Recall Precision


CoverType 0.96 0.92021 0.90979 0.92021 0.90768
Glass5 1.0 1.0 1.0 1.0 1.0
Wine 0.98649 0.95556 0.95585 0.95556 0.95803
Yeast6 0.94723 0.99191 0.99584 0.99446 0.99722
Abalone 0.99709 0.99454 0.99709 0.99419 1.0

Table 3 shows the aggregated performance metrics such as AUC, Accuracy,


F1-Score, Recall and Precision. Effective performance in terms of these metrics
indicate that the classifier model is effective. It could be observed that, irrespective
of the data size, imbalance levels and the number of classes the model performs
effectively exhibiting high performances greater than 0.9 for all the metrics. This
shows the high performing nature of the classifier model on datasets with varying
imbalance levels. This also shows that the proposed model is scalable.

5.2. Performance evaluation of parallel heterogeneous voting ensemble model

The selected datasets shown in Table 2 were used to evaluate the prediction
performance of the proposed parallel heterogeneous voting ensemble model and the
results are shown in Table 4.

Table 4: Performance metrics of parallel heterogeneous voting ensemble

Datasets AUC Accuracy F1-Score Recall Precision


CoverType 0.97128 0.92921 0.91390 0.92021 0.90774
Glass5 1.0 1.0 1.0 1.0 1.0
Wine 0.99583 0.96932 0.96082 0.96364 0.95803
Yeast6 0.96898 0.99770 0.99833 0.99802 0.99864
Abalone 1.0 1.0 1.0 1.0 1.0

8|Page
It could be observed from Table 4, that the aggregated metrics exhibit high
performances greater than 0.9 for all the metrics on all the datasets. The results
effectively highlighting the high predictive nature of the proposed ensemble model.

5.3. Performance Comparison

A comparison of the performances in terms of accuracy, precision and recall


is shown in Figures 1, 2 and 3. It is clearly noticed that the parallel heterogeneous
voting ensemble model exhibits higher performances in terms of both precision and
recall levels in most cases compared to the two-phase stacking ensemble model.
Both the models exhibit equal performance in a few data instances, however, the
performance levels are found to be high (greater than 0.9) exhibiting the
effectiveness of both those models. Precision refers to the fraction of relevant
instances among the retrieved instances and recall refers to the fraction of relevant
instances that have been retrieved from the overall data. High precision and recall
implies effective identification of classes, in turn implying high performance.

Precision Comparison
1.02
1
0.98
Precision

0.96
0.94
0.92
0.9
0.88
0.86
CoverType Glass5 Wine Yeast6 Abalone
Datasets
Two Phase Stacking Heterogeneous Voting

Figure 1: Performance Comparison of Precision Levels

9|Page
Recall Comparison
1.02
1
0.98
Recall 0.96
0.94
0.92
0.9
0.88
CoverType Glass5 Wine Yeast6 Abalone
Datasets

Two Phase Stacking Heterogeneous Voting

Figure 2: Performance Comparison of Recall Levels

A comparison of the accuracy levels is shown in Figure 3. It is understood


that the heterogeneous voting ensemble model shows higher accuracy levels on all
the datasets compared to the two phase stacking ensemble model, highlighting the
improved performances of the heterogeneous voting ensemble model.

Accuracy Comparison
1
0.98
Accuracy

0.96
0.94
0.92
0.9
0.88
CoverType Glass5 Wine Yeast6 Abalone
Dataset

Two Phase Stacking Heterogeneous Voting

Figure 3: Performance Comparison of Accuracy

10 | P a g e
Further, the AUC values of proposed models have been compared with the
recently developed RHSBoost [5], a boosting based ensemble model.

Overall Comparison

Abalone

Yeast6

Wine

Glass5

CoverType

0 0.2 0.4 0.6 0.8 1 1.2


AUC
Heterogeneous Voting Two Phase Stacking RHSBoost

Figure 4: Comparison of AUC with RHSBoost

Figure 4 shows the comparison of AUC values obtained from both the
proposed models with the RHSBoost, which indicates the effectiveness of the
model in terms of true prediction levels and false alarm rates. It is observed that the
average prediction accuracies of the proposed models reveals higher performances
when compared with RHSBoost model.

6. Conclusion

Performing classification on Big Data is one of the major requirements in the


present information revolution era. However, an intrinsic issue like data imbalance
contained in the real-time data poses huge challenges for machine learning models.
The proposed parallel heterogeneous voting ensemble model aims to effectively
handle the data imbalance in huge data to provide unbiased results. The first
contribution examines the robustness of classifier models in handling varied level

11 | P a g e
of imbalanced datasets. From the analysis it is suggested that an ensemble is the
best approach to handle data imbalance. The second contribution proposes a
two-phase stacking ensemble model as a solution for handling data with huge
imbalance. This model was tested on datasets with varied sizes, classes and
imbalance levels and was found to exhibit effective results. Also a comparison with
the recently developed RHSBoost ensemble model reveals that the proposed two-
phase stacking ensemble model outperforms the RHSBoost ensemble model.
The third contribution proposes a parallel heterogeneous voting ensemble model,
which improves the results further and also incorporates parallelization to ensure
faster training and prediction which is more suitable for real-time Big Data
applications. A major limitation of this model is that, the proposed model is
effective only on data from lower to moderate imbalance levels. Future extensions
of this work will concentrate on modifying the model to enable it to handle high
imbalance levels. Further, the number of heterogeneous models and the number of
stages in the ensemble could be effectively reduced to lower the computational
complexity of the proposed models.

12 | P a g e
REFERENCES

[1] Mao.W, Wang.J, He.L, and Tian.Y, “Online sequential prediction of


imbalance data with two-stage hybrid strategy by extreme learning
machine”, Neuro computing, Feb 8, 2017.

[2] Akila.S, and Srinivasulu Reddy.U, “Data Imbalance: Effects and Solutions
for Classification of Large and Highly Imbalanced Data”, Proceedings of
ICRECT, Vol. 16, pp. 28-34, 2016.

[3] López.V, Fernández.A, García.S, Palade.V, and Herrera.F, “An insight into
classification with imbalanced data: Empirical results and current trends on
using data intrinsic characteristics”, Information Sciences, Vol.250, pp.113-
41, 2013.

[4] Akila.S, and Srinivasulu Reddy.U, “ Modelling a Stable Classifier for


Handling Large Scale Data with Noise and Imbalance”, IEEE International
Conference on Computational Intelligence in Data Science, 2017.

[5] Gong Joonho, and Hyunjoong Kim. "RHSBoost: Improving classification


performance in imbalance data" , Computational Statistics & Data Analysis,
Vol. 111, pp. 1-13, 2017.

[6] Galar. M, Fernandez. A, Barrenechea. E, Bustince. H, & Herrera. F.


“A review on ensembles for the class imbalance problem: bagging, boosting,
and hybrid-based approaches”, IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), Vol. 42(4), pp. 463-484,
2012.

[7] Akila.S, and Srinivasulu Reddy.U. "Risk based bagged ensemble (RBE) for
credit card fraud detection", Inventive Computing and Informatics (ICICI),
International Conference on IEEE, 2017.

13 | P a g e
[8] Liu, Zhenbing,"Cost-Sensitive Collaborative Representation Based
Classification via Probability Estimation Addressing the Class Imbalance
Problem" , Artificial Intelligence and Robotics, Springer, pp. 287-294, 2018.

[9] Yu, Lean, et al. "A DBN-based resampling SVM ensemble learning paradigm
for credit classification with imbalanced data", Applied Soft Computing,
Vol. 69, pp. 192-202, 2018.

[10] Chawla.N.V “SMOTE: synthetic minority over-sampling technique”,


Res.J.Artif.Intell, Vol.16, pp. 321–357, 2002.

[11] Liu.X.Y, Wu.J, and Zhou.Z.H, “Exploratory undersampling for class-


imbalance learning”, IEEE Trans. Syst, Man Cybern. Part B (Cybern)
Vol.39 (2), pp. 539–550, 2009.

[12] R. Barandela, et al. “The imbalanced training sample problem: Under or


oversampling?”, Joint IAPR International Workshops on Statistical
Techniques, in Pattern Recognition, (SPR) and Structural and Syntactic
Pattern Recognition (SSPR), Springer, pp.806-14, 2004 .

[13] Han Kyu Lee , Seoung Bum Kim , “An Overlap-Sensitive Margin Classifier
for Imbalanced and overlapping Data”, Expert Systems With Applications,
doi:10.1016/j.eswa.2018.01.008, 2018.

[14] Triguero. I, Galar. M, Merino. D, Maillo. J, Bustince. J.H and Herrera. F,


“Evolutionary undersampling for extremely imbalanced big data
classification under apache spark”, In Evolutionary Computation (CEC),
2016 IEEE Congress, pp. 640-647, 2016.

[15] Chen. J, Li. K, Tang. Z, Bilal. K, Yu. S, Weng. C and Li. K, “A parallel
random forest algorithm for big data in a spark cloud computing
environment”, IEEE Transactions on Parallel & Distributed Systems, Vol.1,
pp.1-1, 2017.

14 | P a g e
[16] Carcillo. F, Dal Pozzolo, Le Borgne.A, Caelen.Y.A, Mazzer.O, and
Bontempi. G, “SCARFF: A scalable framework for streaming credit card
fraud detection with spark”, Information fusion, 41, pp.182-194, 2018.

[17] Wu, Z., Lin, W., Zhang, Z., Wen, A. and Lin, L.,“An ensemble random
forest algorithm for insurance big data analysis”, In Computational Science
and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC),
IEEE International Conference, Vol. 1, pp. 531-536, 2017.

List of Publications
[1]. Madasamy.K, Ramaswami.M, “Performance Evaluation of Word Frequency
Count in Hadoop Environment”, International Journal of Innovative
Research in Science, Engineering and Technology (IJIRSET), Volume 6,
Issue 6, June 2017. [ UGC Approved Journal ]

[2]. Madasamy.K, Ramaswami.M, “Hadoop-Based Word Count Simulation on


Amazon Cloud”, International Journal of Innovative Research in Electrical,
Electronics, Instrumentation and Control Engineering (IJIREEICE), Volume
5, Issue 7, July 2017. [ UGC Approved Journal ]

[3]. Madasamy.K, Ramaswami.M, “A Panorama of Big Data Analytics with


Hadoop”, International Journal of Computational and Applied Mathematics
(IJCAM) Volume 12, Number 1, 2017. [UGC Approved Journal ]

[4]. Madasamy.K, Ramaswami.M, “Data Imbalance and classifiers :Impact and


Solutions from a Big Data Perspective”, International Journal of
Computational Intelligence Research (IJCIR), Volume 13, Number 9(2017),
pp.2267-2281. [UGC Approved Journal]

15 | P a g e
[5]. Madasamy.K, Ramaswami.M, “Two-Phase Stacking Ensemble to Effectively
Handle Data Imbalances in Classification Problems”, International Journal
of Advanced Research in Computer science (IJARCS), Volume 9, Issue 1,
January 2018. [UGC Approved Journal ]

[6]. Madasamy.K, Ramaswami.M, “Parallel Heterogeneous Voting Ensemble for


Effective Classification of Imbalanced Data”, International Journal of
Innovative Research in Science, Engineering and Technology (IJIRSET),
Volume 7, Issue 8, August 2018.

[7]. Madasamy.K, Ramaswami.M, “Markov Decision on Data Backup


Scheduling for Big Data” International Journal of Computational
Intelligence and Informatics (IJCII), ISSN: 2349-6363, Vol.7: No.4, pages
207-216, March 2018.

Papers presented in the International Conferences


[1]. Madasamy.K, Ramaswami.M, “Optimal Data backup Processing Using
Markov Decision Process in the Context of Big Data Analytics” ,
International Conference on Mathematical Modeling and Computational
Methods in Science and Engineering (ICMMCMSE-2017), February 20-22,
2017, Alagappa University, Karaikudi, Tamilnadu.

[2]. Madasamy.K, Ramaswami.M, “Performance Evaluation of Ensemble based


Classifier Models of High Dimensional Data”, 1st International conference on
Applied Soft Computing Techniques (ICASCT’17), April 22-23, 2017,
Kalasalingam University, Krishnankoil, Tamilnadu.

16 | P a g e

You might also like