0% found this document useful (0 votes)

79 views17 pages

Ensemble Models For Effective Classification of Big Data With Data Imbalance

This document is a synopsis submitted by K. Madasamy to Madurai Kamaraj University for a Doctor of Philosophy in Computer Science. The synopsis proposes research on developing ensemble models for effective classification of big data with data imbalances. The objectives are to analyze the impact of data imbalances, propose models that can handle various imbalance levels, and develop parallel models for faster training on big data. The research is organized into 7 chapters covering an introduction, literature review, analysis of imbalance impact, a two-phase stacking ensemble model, and a parallel heterogeneous voting ensemble model.

Uploaded by

vikasbhowate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views17 pages

Ensemble Models For Effective Classification of Big Data With Data Imbalance

Uploaded by

vikasbhowate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Ensemble Models for Effective Classification of

Big Data with Data Imbalance

Synopsis submitted to Madurai Kamaraj University

in partial fulfilment of the requirements for the award of the Degree of
DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE

Submitted by

K.MADASAMY
(Registration No. P 4463)

Under the Guidance of

Dr. M.RAMASWAMI
Professor
Department of Computer Applications
School of Information Technology
Madurai Kamaraj University

DEPARTMENT OF COMPUTER APPLICATIONS

MADURAI KAMARAJ UNIVERSITY
Madurai - 625 021
Tamilnadu, India

D E C E M B E R – 2019
Ensemble Models for Effective Classification of
Big Data with Data Imbalance

SYNOPSIS

1. Introduction

Massive increase in online activities has resulted in generation of huge

amounts of data and the processing of which has the potential to drive business and
research. The data however need to be processed appropriately to obtain the
information, which requires specialized data mining techniques. Classification is a
data mining process that categorizes data instances into predefined categories.
Classifiers learn patterns contained in the given training data to predict class labels
for unseen data. Domains such as Customer based Product Prediction, Churn
Prediction, Fraud Detection, Network Intrusion Detection, Particle Analysis and
Prediction such as Higgs Boson and the like are some of the areas where
classification based on machine learning models are effectively put into use.

Performance of classifier models is hindered by the presence of several

intrinsic issues contained in the data. One such major issue is data imbalance [1].
Data is considered to be imbalanced if one of its classes exhibits dominance over
the other existing classes. i.e., instances of one class are huge in number, while
instances in other classes are very less in number. The class which shows high
dominance is referred to as the majority class, while the classes with low dominance
are referred to as the minority class. Though this problem has been observed to be
very prominent in several binary classification problems, its impact on multiclass
datasets is found to be more profound [2] and yet unexplored. Scientific datasets are
preprocessed and are found to be balanced, hence this issue cannot be explicitly
observed in such scenarios. However, in real-time datasets, this is a very prominent

1|Page
issue. Intrusion detection, financial fraud detection, detection of cancer and the like
are some of the domains where normal or legitimate data are huge in number, while
the interesting instances represent minority classes, with low number of entries. The
ratio between the number of instances in majority class and number of instances in
minority classes is called the imbalance ratio. A class is considered to be balanced if
its imbalance ratio is 1, i.e., it contains equal number of instances of all the
representative classes. A change in this ratio leads to data imbalance [3].

Data Imbalance leads to several issues, and especially affects the

performance of the machine learning models. Major issue due to imbalance is that it
creates bias during the prediction process. This makes the classifier more reliant
towards predicting the majority classes and creates issues when predicting minority
classes. Due to the huge number of instances contained in the majority classes, the
classifier is overly trained on the majority classes and due to the low instance levels
in minority classes, the classifier receives low training in terms of the minority
classes. This biased training leads to poor predictions. Although the impact of data
imbalance varies between classifiers, their presence cannot be overlooked [4].

2. Review of Literature

This section presents some of the recent and most prominent works in the
area of classification task on imbalanced data. A boosting based model was recently
proposed by Gong et al. [5]. This model is a sampling based method that uses
multiple sampling models to achieve the desired balance. This model also proposes
a boosting technique for effective prediction of classes. A comparison of multiple
ensemble models for imbalanced data prediction was proposed by Galar et al. [6]. A
bagged ensemble specifically designed for credit card fraud detection was proposed
by Akila et al. [7]. This model proposes a bagging methodology for effective
detection of fraudulent cases in credit card transactions. A cost sensitive model to
handle imbalance was proposed by Liu et al. [8]. This is a probability estimation

2|Page
based classifier model, aimed to effectively handle data imbalance. A credit
classification method to handle imbalanced data was proposed by Yu et al. [9].
This is a rebalancing mechanism that utilizes bagging and resampling models to
perform predictions. Several methods are aimed towards handling imbalances by
introducing sampling techniques. Such models include SMOTE by Chawla et al. in
[10] and an under-sampling model by Liu et al. in [11]. The best sampling model to
be used on imbalanced datasets is itself a research problem with many contributions
towards this analysis [12]. An overlap sensitive classifier using support vector
machines and K-nearest neighbour algorithms was proposed by Lee et al. in [13].
Another under sampling model to handle imbalance was proposed by Triguero et al.
in [14]. This model proposes a classification algorithm and claims to effectively
operate on data with high imbalance levels. A Spark based Random Forest model
was proposed by Chen et al. in [15]. This model creates a parallelized Random
Forest algorithm for effective classification in Spark environment. Other Spark
based models include SCARFF [16] and an ensemble model presented in [17].

3. Objectives

The main objective of the proposed work is to effectively handle data

imbalance to perform classification in real-time Big Data. The other objectives are

 To analyse the impact of varied imbalance levels and its effects on the
performance of classifier models.

 To analyse the performance of various classifier models on huge

imbalanced data to identify a robust base learner.

 To identify the impact of ensemble models on imbalanced data.

 To propose models that can effectively handle data of varied imbalance

levels to provide high performances.

3|Page
 To propose parallel models that can train faster on Big Data with
imbalance.

 To speed up the classification process using parallelization and to

provide predictions in real-time applications.

4. Structure of the Research

The research is organized under seven chapters. The main focus of the thesis
is to propose an ensemble model for classification that can effectively handle data
imbalances in Big Data. Contents of the research are listed in Table 1.

Table 1. Structure of the Research

Chapters Contents and Descriptions

Chapter 1 Introduction

Chapter 2 Review of Literature

Data Imbalance and Classifiers: Impact and Solutions from a

Chapter 3
Big Data Perspective. ( Contribution-1)

Two-phase Stacking Ensemble to Effectively Handle Data

Chapter 4
Imbalances in Classification Problems. ( Contribution-2)

Parallel Heterogeneous Voting Ensemble for Effective

Chapter 5
Classification of Imbalanced Data. ( Contribution-3)

Chapter 6 Results and Discussion

Chapter 7 Conclusion and Scope for related research in future.

Chapter 1 briefly presents the essential preliminary concepts related to the

proposed research. It describes the evolution of data from a manageable size to Big
Data and the scenario that has led to this explosion in data. It discusses the basics of

4|Page
supervised learning process, classification in specific and the issues in Big Data that
affects the process of classification. Also, an outline of ensemble based modelling
has been discussed. The chapter further explains the motivation, scope and
objectives of the thesis.

Chapter 2 outlines the review of literature discussing prior studies in the

domain of classification and specifically on models handling data imbalance.
In this chapter, the models are categorized as individual models, ensemble models,
cost sensitive models, sampling based models and parallelization-based models.

Chapter 3 analyses and identifies data imbalance to be one of the main issues
existing in the data involving real-time applications. An analysis of the different
categories of machine learning classifier models has been deployed for measuring
their prediction efficiencies. Standard classification metrics were used for analysis
and datasets of varied sizes, imbalance levels and classes were used for pilot
analysis. The datasets used in this contribution are meant for preliminary study for
validating the pursuance of further research.

It was identified and suggested that ensembles are the best architectures to
handle the issue of data imbalance on real-time data. An analysis of the ensemble
models like Bagging, Boosting, Bucket of Models and Stacking in terms of their
working principles have been discussed. It was observed that Boosting and Stacking
have higher potentials in handling data with varying levels of imbalance. Due to the
distributed operational nature of the models, they can be easily parallelized.
Thus, they can also be effectively used on Big Data and can be effective for real-
time applications.

Chapter 4 proposes a two-phase stacking ensemble model that can be used to

handle data imbalances during the process of classification. The proposed model is
composed of two major training phases. The first phase of model uses multiple base
learners for training. The entire training data is used to train all the base learners.

5|Page
However, due to the heterogeneity of the base learners, the rules formulated by each
base learner will be distinct and different from each other. The test data is passed to
these base learners and predictions are obtained. These predictions form the input
data for the second phase. The first level predictions and the actual class labels are
grouped to form the training data for phase-2. The second phase is a meta-learner,
and uses a strong classifier for predictions. The predictions provided by the meta-
learner is used as the final prediction. The complex learning architecture involving
multi-level training processes has enabled effective handling of data imbalance and
hence providing better results.

Chapter 5 proposes a parallel heterogeneous voting ensemble model for

effectively handling data imbalance during the process of classification.
The proposed ensemble uses heterogeneous tree-based base-learners like Decision
Tree, Random Forest and Gradient Boosted Trees for prediction. A bagged
ensemble is created using the heterogeneous base learners. All three models are
intentionally selected to be tree-based models in order to maintain a certain level of
homogeneity in the rule creation process. The base-learners are chosen such that
each model complements the other model and hence resulting in enhanced
predictions. The entire training data are passed on to the base learners and the
learning models are built. The test data when passed on to the ensemble model will
result in a set of predictions for each instances, rather than a single prediction. The
final voting phase is to combine the predictions that will result in providing a single
prediction. Voting is the process of combining the predictions such that the highly
voted prediction is considered as the final prediction. This model is generic and the
heterogeneity aids in effective reduction of bias and variance in the prediction
process.

Chapter 6 examines the result analysis of the second and third contributions
made in the research and their related discussions.

6|Page
Chapter 7 highlights the important conclusions derived out the whole
research process and analysis with suggestion for the future research directions.

5. Performance Evaluation

This section analysis the performance of both two-phase stacking ensemble

model and the parallel heterogeneous voting ensemble model. Reasonable
comparative analysis has been made to fulfil the objectives framed in this research
work.

5.1. Performance evaluation of two-phase stacking ensemble model

Experiments were performed by implementing the ensemble model in

Python. Five benchmark datasets with varying number of classes, sizes and
imbalance levels were considered for analysis.

Table 2: Description of Datasets

No. of No. of Imbalance No. of

Datasets Source
Instances Attributes Ratio Classes

CoverType 38501 10 13.02 Multi (7) UCI

Glass5 214 9 22.81 Binary (2) KEEL

Wine 4898 11 25.77 Multi (3) UCI

Yeast6 1484 8 39.15 Binary (2) KEEL

Abalone 4177 8 115.03 Binary (2) UCI

This ensures that the proposed model is scalable in terms of both data size
and imbalance levels. The description of various datasets are shown in Table 2.
These datasets were used to evaluate the prediction capability of the proposed
two-phase stacking ensemble model.

7|Page
Table 3: Performance metrics of two-phase stacking ensemble model

Datasets AUC Accuracy F1-Score Recall Precision

CoverType 0.96 0.92021 0.90979 0.92021 0.90768
Glass5 1.0 1.0 1.0 1.0 1.0
Wine 0.98649 0.95556 0.95585 0.95556 0.95803
Yeast6 0.94723 0.99191 0.99584 0.99446 0.99722
Abalone 0.99709 0.99454 0.99709 0.99419 1.0

Table 3 shows the aggregated performance metrics such as AUC, Accuracy,

F1-Score, Recall and Precision. Effective performance in terms of these metrics
indicate that the classifier model is effective. It could be observed that, irrespective
of the data size, imbalance levels and the number of classes the model performs
effectively exhibiting high performances greater than 0.9 for all the metrics. This
shows the high performing nature of the classifier model on datasets with varying
imbalance levels. This also shows that the proposed model is scalable.

5.2. Performance evaluation of parallel heterogeneous voting ensemble model

The selected datasets shown in Table 2 were used to evaluate the prediction
performance of the proposed parallel heterogeneous voting ensemble model and the
results are shown in Table 4.

Table 4: Performance metrics of parallel heterogeneous voting ensemble

Datasets AUC Accuracy F1-Score Recall Precision

CoverType 0.97128 0.92921 0.91390 0.92021 0.90774
Glass5 1.0 1.0 1.0 1.0 1.0
Wine 0.99583 0.96932 0.96082 0.96364 0.95803
Yeast6 0.96898 0.99770 0.99833 0.99802 0.99864
Abalone 1.0 1.0 1.0 1.0 1.0

8|Page
It could be observed from Table 4, that the aggregated metrics exhibit high
performances greater than 0.9 for all the metrics on all the datasets. The results
effectively highlighting the high predictive nature of the proposed ensemble model.

5.3. Performance Comparison

A comparison of the performances in terms of accuracy, precision and recall

is shown in Figures 1, 2 and 3. It is clearly noticed that the parallel heterogeneous
voting ensemble model exhibits higher performances in terms of both precision and
recall levels in most cases compared to the two-phase stacking ensemble model.
Both the models exhibit equal performance in a few data instances, however, the
performance levels are found to be high (greater than 0.9) exhibiting the
effectiveness of both those models. Precision refers to the fraction of relevant
instances among the retrieved instances and recall refers to the fraction of relevant
instances that have been retrieved from the overall data. High precision and recall
implies effective identification of classes, in turn implying high performance.

Precision Comparison
1.02
1
0.98
Precision

0.96
0.94
0.92
0.9
0.88
0.86
CoverType Glass5 Wine Yeast6 Abalone
Datasets
Two Phase Stacking Heterogeneous Voting

Figure 1: Performance Comparison of Precision Levels

9|Page
Recall Comparison
1.02
1
0.98
Recall 0.96
0.94
0.92
0.9
0.88
CoverType Glass5 Wine Yeast6 Abalone
Datasets

Two Phase Stacking Heterogeneous Voting

Figure 2: Performance Comparison of Recall Levels

A comparison of the accuracy levels is shown in Figure 3. It is understood

that the heterogeneous voting ensemble model shows higher accuracy levels on all
the datasets compared to the two phase stacking ensemble model, highlighting the
improved performances of the heterogeneous voting ensemble model.

Accuracy Comparison
1
0.98
Accuracy

0.96
0.94
0.92
0.9
0.88
CoverType Glass5 Wine Yeast6 Abalone
Dataset

Two Phase Stacking Heterogeneous Voting

Figure 3: Performance Comparison of Accuracy

10 | P a g e
Further, the AUC values of proposed models have been compared with the
recently developed RHSBoost [5], a boosting based ensemble model.

Overall Comparison

Abalone

Yeast6

Wine

Glass5

CoverType

0 0.2 0.4 0.6 0.8 1 1.2

AUC
Heterogeneous Voting Two Phase Stacking RHSBoost

Figure 4: Comparison of AUC with RHSBoost

Figure 4 shows the comparison of AUC values obtained from both the
proposed models with the RHSBoost, which indicates the effectiveness of the
model in terms of true prediction levels and false alarm rates. It is observed that the
average prediction accuracies of the proposed models reveals higher performances
when compared with RHSBoost model.

6. Conclusion

Performing classification on Big Data is one of the major requirements in the

present information revolution era. However, an intrinsic issue like data imbalance
contained in the real-time data poses huge challenges for machine learning models.
The proposed parallel heterogeneous voting ensemble model aims to effectively
handle the data imbalance in huge data to provide unbiased results. The first
contribution examines the robustness of classifier models in handling varied level

11 | P a g e
of imbalanced datasets. From the analysis it is suggested that an ensemble is the
best approach to handle data imbalance. The second contribution proposes a
two-phase stacking ensemble model as a solution for handling data with huge
imbalance. This model was tested on datasets with varied sizes, classes and
imbalance levels and was found to exhibit effective results. Also a comparison with
the recently developed RHSBoost ensemble model reveals that the proposed two-
phase stacking ensemble model outperforms the RHSBoost ensemble model.
The third contribution proposes a parallel heterogeneous voting ensemble model,
which improves the results further and also incorporates parallelization to ensure
faster training and prediction which is more suitable for real-time Big Data
applications. A major limitation of this model is that, the proposed model is
effective only on data from lower to moderate imbalance levels. Future extensions
of this work will concentrate on modifying the model to enable it to handle high
imbalance levels. Further, the number of heterogeneous models and the number of
stages in the ensemble could be effectively reduced to lower the computational
complexity of the proposed models.

12 | P a g e
REFERENCES

[1] Mao.W, Wang.J, He.L, and Tian.Y, “Online sequential prediction of

imbalance data with two-stage hybrid strategy by extreme learning
machine”, Neuro computing, Feb 8, 2017.

[2] Akila.S, and Srinivasulu Reddy.U, “Data Imbalance: Effects and Solutions
for Classification of Large and Highly Imbalanced Data”, Proceedings of
ICRECT, Vol. 16, pp. 28-34, 2016.

[3] López.V, Fernández.A, García.S, Palade.V, and Herrera.F, “An insight into
classification with imbalanced data: Empirical results and current trends on
using data intrinsic characteristics”, Information Sciences, Vol.250, pp.113-
41, 2013.

[4] Akila.S, and Srinivasulu Reddy.U, “ Modelling a Stable Classifier for

Handling Large Scale Data with Noise and Imbalance”, IEEE International
Conference on Computational Intelligence in Data Science, 2017.

[5] Gong Joonho, and Hyunjoong Kim. "RHSBoost: Improving classification

performance in imbalance data" , Computational Statistics & Data Analysis,
Vol. 111, pp. 1-13, 2017.

[6] Galar. M, Fernandez. A, Barrenechea. E, Bustince. H, & Herrera. F.

“A review on ensembles for the class imbalance problem: bagging, boosting,
and hybrid-based approaches”, IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), Vol. 42(4), pp. 463-484,
2012.

[7] Akila.S, and Srinivasulu Reddy.U. "Risk based bagged ensemble (RBE) for
credit card fraud detection", Inventive Computing and Informatics (ICICI),
International Conference on IEEE, 2017.

13 | P a g e
[8] Liu, Zhenbing,"Cost-Sensitive Collaborative Representation Based
Classification via Probability Estimation Addressing the Class Imbalance
Problem" , Artificial Intelligence and Robotics, Springer, pp. 287-294, 2018.

[9] Yu, Lean, et al. "A DBN-based resampling SVM ensemble learning paradigm
for credit classification with imbalanced data", Applied Soft Computing,
Vol. 69, pp. 192-202, 2018.

[10] Chawla.N.V “SMOTE: synthetic minority over-sampling technique”,

Res.J.Artif.Intell, Vol.16, pp. 321–357, 2002.

[11] Liu.X.Y, Wu.J, and Zhou.Z.H, “Exploratory undersampling for class-

imbalance learning”, IEEE Trans. Syst, Man Cybern. Part B (Cybern)
Vol.39 (2), pp. 539–550, 2009.

[12] R. Barandela, et al. “The imbalanced training sample problem: Under or

oversampling?”, Joint IAPR International Workshops on Statistical
Techniques, in Pattern Recognition, (SPR) and Structural and Syntactic
Pattern Recognition (SSPR), Springer, pp.806-14, 2004 .

[13] Han Kyu Lee , Seoung Bum Kim , “An Overlap-Sensitive Margin Classifier
for Imbalanced and overlapping Data”, Expert Systems With Applications,
doi:10.1016/j.eswa.2018.01.008, 2018.

[14] Triguero. I, Galar. M, Merino. D, Maillo. J, Bustince. J.H and Herrera. F,

“Evolutionary undersampling for extremely imbalanced big data
classification under apache spark”, In Evolutionary Computation (CEC),
2016 IEEE Congress, pp. 640-647, 2016.

[15] Chen. J, Li. K, Tang. Z, Bilal. K, Yu. S, Weng. C and Li. K, “A parallel
random forest algorithm for big data in a spark cloud computing
environment”, IEEE Transactions on Parallel & Distributed Systems, Vol.1,
pp.1-1, 2017.

14 | P a g e
[16] Carcillo. F, Dal Pozzolo, Le Borgne.A, Caelen.Y.A, Mazzer.O, and
Bontempi. G, “SCARFF: A scalable framework for streaming credit card
fraud detection with spark”, Information fusion, 41, pp.182-194, 2018.

[17] Wu, Z., Lin, W., Zhang, Z., Wen, A. and Lin, L.,“An ensemble random
forest algorithm for insurance big data analysis”, In Computational Science
and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC),
IEEE International Conference, Vol. 1, pp. 531-536, 2017.

List of Publications
[1]. Madasamy.K, Ramaswami.M, “Performance Evaluation of Word Frequency
Count in Hadoop Environment”, International Journal of Innovative
Research in Science, Engineering and Technology (IJIRSET), Volume 6,
Issue 6, June 2017. [ UGC Approved Journal ]

[2]. Madasamy.K, Ramaswami.M, “Hadoop-Based Word Count Simulation on

Amazon Cloud”, International Journal of Innovative Research in Electrical,
Electronics, Instrumentation and Control Engineering (IJIREEICE), Volume
5, Issue 7, July 2017. [ UGC Approved Journal ]

[3]. Madasamy.K, Ramaswami.M, “A Panorama of Big Data Analytics with

Hadoop”, International Journal of Computational and Applied Mathematics
(IJCAM) Volume 12, Number 1, 2017. [UGC Approved Journal ]

[4]. Madasamy.K, Ramaswami.M, “Data Imbalance and classifiers :Impact and

Solutions from a Big Data Perspective”, International Journal of
Computational Intelligence Research (IJCIR), Volume 13, Number 9(2017),
pp.2267-2281. [UGC Approved Journal]

15 | P a g e
[5]. Madasamy.K, Ramaswami.M, “Two-Phase Stacking Ensemble to Effectively
Handle Data Imbalances in Classification Problems”, International Journal
of Advanced Research in Computer science (IJARCS), Volume 9, Issue 1,
January 2018. [UGC Approved Journal ]

[6]. Madasamy.K, Ramaswami.M, “Parallel Heterogeneous Voting Ensemble for

Effective Classification of Imbalanced Data”, International Journal of
Innovative Research in Science, Engineering and Technology (IJIRSET),
Volume 7, Issue 8, August 2018.

[7]. Madasamy.K, Ramaswami.M, “Markov Decision on Data Backup

Scheduling for Big Data” International Journal of Computational
Intelligence and Informatics (IJCII), ISSN: 2349-6363, Vol.7: No.4, pages
207-216, March 2018.

Papers presented in the International Conferences

[1]. Madasamy.K, Ramaswami.M, “Optimal Data backup Processing Using
Markov Decision Process in the Context of Big Data Analytics” ,
International Conference on Mathematical Modeling and Computational
Methods in Science and Engineering (ICMMCMSE-2017), February 20-22,
2017, Alagappa University, Karaikudi, Tamilnadu.

[2]. Madasamy.K, Ramaswami.M, “Performance Evaluation of Ensemble based

Classifier Models of High Dimensional Data”, 1st International conference on
Applied Soft Computing Techniques (ICASCT’17), April 22-23, 2017,
Kalasalingam University, Krishnankoil, Tamilnadu.

16 | P a g e

Unit 3 OLAP and OLTP
No ratings yet
Unit 3 OLAP and OLTP
64 pages
Unit 1 - DWM
No ratings yet
Unit 1 - DWM
112 pages
AVES An Audio-Visual Emotion Stream Dataset For Temporal Emotion Detection
No ratings yet
AVES An Audio-Visual Emotion Stream Dataset For Temporal Emotion Detection
13 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
IOT Data Acquisition
No ratings yet
IOT Data Acquisition
13 pages
Image Similarity Aptitude Test 24TH Nov'22
No ratings yet
Image Similarity Aptitude Test 24TH Nov'22
13 pages
Theories of Thinking
No ratings yet
Theories of Thinking
3 pages
Wallack C
No ratings yet
Wallack C
147 pages
Emp 211 by Mairinai Zakayo Philipo
No ratings yet
Emp 211 by Mairinai Zakayo Philipo
113 pages
Department of Education: Proposed Guidance Program For S.Y. 2022-2023
No ratings yet
Department of Education: Proposed Guidance Program For S.Y. 2022-2023
3 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Aicte Books 2023
No ratings yet
Aicte Books 2023
4 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
A Systematic Study of The Class Imbalance Problem in Convolutional Neural Networks
No ratings yet
A Systematic Study of The Class Imbalance Problem in Convolutional Neural Networks
21 pages
JACKSON Lauren EDN221 Assessment2
100% (1)
JACKSON Lauren EDN221 Assessment2
12 pages
UEU Sistem Pendukung Keputusan Pertemuan 11
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 11
48 pages
Drama Curriculum For Primary
100% (1)
Drama Curriculum For Primary
8 pages
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
No ratings yet
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
54 pages
Modeling Imbalance Class
No ratings yet
Modeling Imbalance Class
24 pages
Org Structure & Design
No ratings yet
Org Structure & Design
42 pages
1 s2.0 S0957417423032803 Main
No ratings yet
1 s2.0 S0957417423032803 Main
29 pages
Newbook
No ratings yet
Newbook
80 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
Guidelines grantInAid Scheme
No ratings yet
Guidelines grantInAid Scheme
35 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
Emt 11 12 Q2 1301 PF FD
No ratings yet
Emt 11 12 Q2 1301 PF FD
38 pages
Investigating Class Rarity in Big Data: Open Access Research
No ratings yet
Investigating Class Rarity in Big Data: Open Access Research
17 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
Leevy2018 Article ASurveyOnAddressingHigh-classI
No ratings yet
Leevy2018 Article ASurveyOnAddressingHigh-classI
30 pages
Math 472 Homework Assignment 2
100% (1)
Math 472 Homework Assignment 2
7 pages
Module 4. Market Research PRE-TEST: Before Starting With This Module, Let Us See What You Already Know About Market
75% (4)
Module 4. Market Research PRE-TEST: Before Starting With This Module, Let Us See What You Already Know About Market
8 pages
10 Techniques To Solve Imbalanced Classes in ML
No ratings yet
10 Techniques To Solve Imbalanced Classes in ML
16 pages
A Unifying View of Class Overlap and Imbalance
No ratings yet
A Unifying View of Class Overlap and Imbalance
26 pages
Aipptoriginal 191215023212
No ratings yet
Aipptoriginal 191215023212
16 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Axioms 11 00607 v2
No ratings yet
Axioms 11 00607 v2
19 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Class Notes
No ratings yet
Class Notes
24 pages
Factors Influencing Career Interests and Choices of High School Adolescents in Tamale, Northern Ghana
No ratings yet
Factors Influencing Career Interests and Choices of High School Adolescents in Tamale, Northern Ghana
19 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
ARTS 1013 Module 1 1 1
No ratings yet
ARTS 1013 Module 1 1 1
17 pages
ACHI Outcomes&Indicators Mod5
No ratings yet
ACHI Outcomes&Indicators Mod5
47 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
Sensors: A Novel Secure Iot-Based Smart Home Automation System Using A Wireless Sensor Network
No ratings yet
Sensors: A Novel Secure Iot-Based Smart Home Automation System Using A Wireless Sensor Network
19 pages
Bendová - 2015 Inclusive Education of Pupils With Special Education Needs in Czech Republic Primary School
No ratings yet
Bendová - 2015 Inclusive Education of Pupils With Special Education Needs in Czech Republic Primary School
8 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
Synth
No ratings yet
Synth
6 pages
Handling Imbalanced Dataset
No ratings yet
Handling Imbalanced Dataset
23 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Conceptual Cae Design and Simulation Virtual Internship: Summer Internship Report On
No ratings yet
Conceptual Cae Design and Simulation Virtual Internship: Summer Internship Report On
9 pages
Deep Learning and Thresholding With Class-Imbalanced Big Data
No ratings yet
Deep Learning and Thresholding With Class-Imbalanced Big Data
8 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
LLB 1000 Assigment One Collins Chilima
No ratings yet
LLB 1000 Assigment One Collins Chilima
5 pages
Sushmeet Singh Bhurji
No ratings yet
Sushmeet Singh Bhurji
5 pages
Mortality Prediction Analysis
No ratings yet
Mortality Prediction Analysis
7 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
Lesson 3
No ratings yet
Lesson 3
8 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García
No ratings yet
Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García
15 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
Text Classification Paper 1
No ratings yet
Text Classification Paper 1
5 pages
Slides Imbalanced Learning Intro
No ratings yet
Slides Imbalanced Learning Intro
7 pages
CDEV8132.Career Management - Assignment 1.career Profile & Labour Market Research.2024-2025.Ver2
No ratings yet
CDEV8132.Career Management - Assignment 1.career Profile & Labour Market Research.2024-2025.Ver2
8 pages
Performance Evaluation of Class Balancing
No ratings yet
Performance Evaluation of Class Balancing
6 pages
Learning Designer Crib Sheet
No ratings yet
Learning Designer Crib Sheet
2 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
AI Research Areas SEM VI
No ratings yet
AI Research Areas SEM VI
3 pages
Statand Prob Q4 M5
No ratings yet
Statand Prob Q4 M5
16 pages
Chapter 12
No ratings yet
Chapter 12
7 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
No ratings yet
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
9 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
A Human Resource Project On Training & Development Policies
No ratings yet
A Human Resource Project On Training & Development Policies
18 pages
Discovering Tut
No ratings yet
Discovering Tut
4 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
Capstone Project - Capstone Project
No ratings yet
Capstone Project - Capstone Project
1 page
F3a3 1
No ratings yet
F3a3 1
1 page
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
No ratings yet
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
8 pages
MALPRACTICES
No ratings yet
MALPRACTICES
4 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
28 April 2023 - BoS On New Branches - 0001
No ratings yet
28 April 2023 - BoS On New Branches - 0001
1 page
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Contact Center Services (DLP-W2D1)
No ratings yet
Contact Center Services (DLP-W2D1)
3 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Parent Presentation Ece 497
No ratings yet
Parent Presentation Ece 497
16 pages
Note Taking Techniques
No ratings yet
Note Taking Techniques
3 pages
Introduction To Swedish:Numbers and Plurals - Wikiversity
No ratings yet
Introduction To Swedish:Numbers and Plurals - Wikiversity
5 pages
Participant Information Sheet
No ratings yet
Participant Information Sheet
1 page
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Ensemble Models For Effective Classification of Big Data With Data Imbalance

Uploaded by

Ensemble Models For Effective Classification of Big Data With Data Imbalance

Uploaded by

Ensemble Models for Effective Classification of

Big Data with Data Imbalance

Synopsis submitted to Madurai Kamaraj University

Under the Guidance of

DEPARTMENT OF COMPUTER APPLICATIONS

Massive increase in online activities has resulted in generation of huge

Performance of classifier models is hindered by the presence of several

Data Imbalance leads to several issues, and especially affects the

The main objective of the proposed work is to effectively handle data

 To analyse the performance of various classifier models on huge

 To identify the impact of ensemble models on imbalanced data.

 To propose models that can effectively handle data of varied imbalance

 To speed up the classification process using parallelization and to

4. Structure of the Research

Table 1. Structure of the Research

Chapters Contents and Descriptions

Chapter 2 Review of Literature

Data Imbalance and Classifiers: Impact and Solutions from a

Two-phase Stacking Ensemble to Effectively Handle Data

Parallel Heterogeneous Voting Ensemble for Effective

Chapter 6 Results and Discussion

Chapter 7 Conclusion and Scope for related research in future.

Chapter 1 briefly presents the essential preliminary concepts related to the

Chapter 2 outlines the review of literature discussing prior studies in the

Chapter 4 proposes a two-phase stacking ensemble model that can be used to

Chapter 5 proposes a parallel heterogeneous voting ensemble model for

This section analysis the performance of both two-phase stacking ensemble

5.1. Performance evaluation of two-phase stacking ensemble model

Experiments were performed by implementing the ensemble model in

Table 2: Description of Datasets

No. of No. of Imbalance No. of

CoverType 38501 10 13.02 Multi (7) UCI

Glass5 214 9 22.81 Binary (2) KEEL

Wine 4898 11 25.77 Multi (3) UCI

Yeast6 1484 8 39.15 Binary (2) KEEL

Abalone 4177 8 115.03 Binary (2) UCI

Datasets AUC Accuracy F1-Score Recall Precision

Table 3 shows the aggregated performance metrics such as AUC, Accuracy,

5.2. Performance evaluation of parallel heterogeneous voting ensemble model

Table 4: Performance metrics of parallel heterogeneous voting ensemble

Datasets AUC Accuracy F1-Score Recall Precision

5.3. Performance Comparison

A comparison of the performances in terms of accuracy, precision and recall

Figure 1: Performance Comparison of Precision Levels

Two Phase Stacking Heterogeneous Voting

Figure 2: Performance Comparison of Recall Levels

A comparison of the accuracy levels is shown in Figure 3. It is understood

Two Phase Stacking Heterogeneous Voting

Figure 3: Performance Comparison of Accuracy

0 0.2 0.4 0.6 0.8 1 1.2

Figure 4: Comparison of AUC with RHSBoost

Performing classification on Big Data is one of the major requirements in the

[1] Mao.W, Wang.J, He.L, and Tian.Y, “Online sequential prediction of

[4] Akila.S, and Srinivasulu Reddy.U, “ Modelling a Stable Classifier for

[5] Gong Joonho, and Hyunjoong Kim. "RHSBoost: Improving classification

[6] Galar. M, Fernandez. A, Barrenechea. E, Bustince. H, & Herrera. F.

[10] Chawla.N.V “SMOTE: synthetic minority over-sampling technique”,

[11] Liu.X.Y, Wu.J, and Zhou.Z.H, “Exploratory undersampling for class-

[12] R. Barandela, et al. “The imbalanced training sample problem: Under or

[14] Triguero. I, Galar. M, Merino. D, Maillo. J, Bustince. J.H and Herrera. F,

[2]. Madasamy.K, Ramaswami.M, “Hadoop-Based Word Count Simulation on

[3]. Madasamy.K, Ramaswami.M, “A Panorama of Big Data Analytics with

[4]. Madasamy.K, Ramaswami.M, “Data Imbalance and classifiers :Impact and

[6]. Madasamy.K, Ramaswami.M, “Parallel Heterogeneous Voting Ensemble for

[7]. Madasamy.K, Ramaswami.M, “Markov Decision on Data Backup

Papers presented in the International Conferences

[2]. Madasamy.K, Ramaswami.M, “Performance Evaluation of Ensemble based

You might also like