Class Imbalance Paper
Class Imbalance Paper
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 1
Abstract—Customer retention is a major issue for vari- [1]). Churn-prone industries such as the telecommunication
ous service-based organizations particularly telecom industry, industry typically maintain customer relationship management
wherein predictive models for observing the behavior of cus- (CRM) databases that are rich in unseen knowledge and
tomers is one of the great instruments in customer retention
process and inferring the future behavior of the customers. How- certain patterns that may be exploited for acquiring customer
ever, the performances of predictive models are greatly affected information on time for intelligent decision-making practice
when the real-world dataset is highly imbalanced. A dataset is of an industry [2].
called imbalanced if the samples size from one class is very much However, knowledge discovery in such rich CRM databases,
smaller or larger than the other classes. The most commonly which typically contains thousands or millions of customers’
used technique is over/under sampling for handling the class-
imbalance problem (CIP) in various domains. In this study, we information, is a challenging and difficult task. Therefore,
survey six well-known sampling techniques and compare the many industries have to inescapably depend on prediction
performances of these key techniques i.e., Mega-trend Diffusion models for customer churn if they want to remain in the
Function (MTDF), Synthetic Minority Oversampling Technique competitive market [1]. As a consequence, several competi-
(SMOTE), Adaptive Synthetic Sampling approach (ADASYN), tive industries have implemented a wide range of statistical
Couples Top-N Reverse k-Nearest Neighbor (TRkNN), Majority
Weighted Minority Oversampling Technique (MWMOTE) and and intelligent machine learning (ML) techniques to develop
Immune centroids oversampling technique (ICOTE). Moreover, predictive models that deal with customer churn [2].
this study also reveals the evaluation of four rules-generation al- Unfortunately, the performance of ML techniques is consid-
gorithms (the Learning from Example Module, version 2 (LEM2), erably affected by the CIP. The problem of imbalanced dataset
Covering, Exhaustive and Genetic algorithms) using publicly appears when the proportion of majority or negative class has
available datasets. The empirical results demonstrate that the
overall predictive performance of MTDF and rules-generation a higher ratio than positive or minority class [3], [4]. The
based on Genetic algorithms performed best as compared to the skewed distribution (imbalanced) of data in the dataset poses
rest of the evaluated oversampling methods and rule-generation challenges for machine learning and data mining algorithms
algorithms. [3], [5]. This is an area of research focusing on skewed class
Index Terms—SMOTE, ADASYN, Mega Trend Diffusion distribution where minority class is targeted for classification
Function, Class Imbalance, Rough Set, Customer Churn, mRMR. [6]. Consider a dataset where the imbalance ratio is 1:99 (i.e.,
ICOTE, MWMOTE, TRkNN. where 99% of the instances belong to the majority class, and
1% belongs to the minority class). A classifier may achieve the
I. I NTRODUCTION accuracy up to 99% just by ignoring that 1% of minority class
instances—however, adopting such an approach will result in
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 2
problem have been reported in literature [2], [4]. These studies undersampling methods for handling the CIP; however, the
can be grouped into the following approaches based on their proposed study differs from the previous studies in that in
dealing with class imbalance issue: (i) the internal-level: addition to evaluating six oversampling techniques (SMOTE,
construct or update the existing methods to emphasize the sig- ADASYN, MTDF, ICOTE, MWMOTE and TRkNN), we also
nificance of the positive class and (ii) the external level: adding compare the performance of four rules-generation algorithms
data in preprocessing stage where the distribution of class (Exh, Gen, Cov and LEM2). The proposed study is also
is resampled in order to reduce the influence of imbalanced focused on considering the following research questions (RQ):
distribution of class in the classification process. The internal • RQ1: What is the list of attributes that is highly symp-
level approach is further divided into two groups [5]. Firstly, tomatic in the targeted data set for prediction of customer
the cost-sensitive approach, which falls between internal and churn?
external level approaches, is based on reducing incorrect • RQ2: Which of the oversampling technique (e.g. SMOTE,
classification costs for positive or minority class leading to MTDF, ADASYN, ICOTE, MWMOTE and TRkNN) is
reduction of the overall cost for both internal and external level more suitable for creating synthetically samples that not
approaches. Secondly, the ensemble/boosting approach, which only handle the CIP in a dataset of the telecommunication
adopts the use of multiple classifiers to follow the similar idea sector but also improves the classification performance.
adopted by the internal approach. In this study, we have used • RQ3: Which of the rule-generation algorithm (Exh, Gen,
six well-known advanced oversampling techniques—namely, Cov & LEM2) is more suitable using RST based classi-
Mega-trend Diffusion Function (MTDF), Synthetic Minor- fication for customer churn prediction in the imbalanced
ity Oversampling Technique (SMOTE), Adaptive Synthetic dataset.
Sampling approach (ADASYN), Majority Weighted Minor- Remaining paper is organized as follows: the next section
ity Oversampling Technique (MWMOTE), Immune centroids presents the existing work of class imbalance and approaches
oversampling technique (ICOTE) and Couples Top-N Reverse to handle the CIP. The background study and evaluation
k-Nearest Neighbor (TRkNN). measures are explored in section III. The experiments are
SMOTE [6] is commonly used as a benchmark for over- detailed in section IV. The section V explains the results
sampling algorithm [7], [8]. ADASYN is also an important followed by section VI that concludes the paper.
oversampling technique which improves the learning about the
samples distribution in an efficient way [9]. MTDF was first
II. H ANDLING THE C LASS -I MBALANCE P ROBLEM
proposed by Li et al. [10] and reported improved performance
of classifying imbalanced medical datasets [11]. CUBE is also A. Class imbalance problem (CIP)
another advanced oversampling technique [12] but we have In this section, firstly it is explained that the CIP in
not considered this approach since as noted by Japkowicz the context of classification followed by techniques used in
[13], CUBE oversampling technique does not increase the handling the CIP in a dataset and its relationship to potential
predictive performance of the classifier. The above-mentioned domains. This section also contains a brief literature review
oversampling approaches are used to handle the imbalanced on class imbalance/skewed distribution of samples in datasets.
dataset and improve the performance of predictive models Prior to the overview of handling CIP, first, there is a need to
for customer churn, particularly in the telecommunication address the notion of classification. The aim of classification
sector. Rough Set Theory (RST) is applied to the four dif- is to train the classifier on some dataset, making it capable
ferent rule-generation algorithms—(i) Learning from Example to correctly classify the unknown classes of unseen objects
Module, version 2 (LEM2), (ii) Genetic (Gen), (iii) Cov- [4], [17]. If the samples in the dataset are not balanced, there
ering (Cov) and (iv) Exhaustive (Exh) algorithms—in this is a great chance that the classification task will result in
study to observe the behavior of customer churn and all misleading results.
experiments are applied on the publicly available dataset. CIP exists in many real-world classifications including So-
It is specified here that this paper is an extended version cial Network Services [18], [19], [20], [21], [22]], Banks &
of our previous work [14], and makes the following con- Financial Services [16], [23], [24], [25], [26], Credit Card Ac-
tributions: (i) more datasets are employed to obtained more count Services [27], [28], Online Gaming Services [29], [30],
generalized results for the selected oversampling techniques Human Resource Management [31], [32], [33], Discussion &
and the rules-generation algorithms, (ii) another well-known Answer forums [34], Fault Prediction & Diagnosis [35], [11],
oversampling technique—namely, ADASYN—is also used User’s profile personalization [36], Wireless Networks [37],
(iii) detailed analysis and discussion on the performance of [38], 5G future network [39] and Insurance & Subscription
targeted oversampling techniques (namely, ADASYN, MTDF, Services [40], [41], [42]. Considering the scenario of class
SMOTE, MWMOTE, ICOTE and TRkNN) followed by the imbalance in any application domain, almost all the objects
rules-generation algorithms—namely, Gen, Cov, LEM2 and belong to specific class (majority class) and far less number
Exh, and (iv) detailed performance evaluation—in terms of of objects are assigned to other class (minority class) [26].
the balance accuracy, the imbalance ratio, the area under The classification problems observation shows that training
the curve (AUC) and the McNemar’s statistical test—is per- the classifier using conventional classification techniques re-
formed to validate the results and avoid any biases. Many sults on higher performance, but it tends to classify all
comparative studies [2], [7], [11], [15], [16] have already the samples data into the majority class; usually, which is
been carried out on the comparison of oversampling and often not the desired goal of the classification study [26]. In
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 3
contrast, research studies [43], [44] show that latest machine them with the samples population of the majority class
learning techniques result in low performance due to dealing [26]. Classification techniques usually produce a better
with large imbalanced datasets. Also, the class imbalance performance when the samples of both classes are nearly
may cause classification approaches to pass from difficulties equally distributed in the dataset.
in learning, which eventually results in poor classification Figure 2(a) depicts the examples of random ignorance of the
performance [15]. Therefore, learning from imbalanced class majority class sample while figure 2(b) replicate the minority
data has received a tremendous amount of attention from class samples.
the machine learning and data mining research community
[45]. The following figures illustrate difficulties in imbalanced
datasets such as figure 1 (a) describes the overlapping problem
& small disjoints in figure 1(b).
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 4
used to rebalance the distribution of the class data [4], oversampling with undersampling of minority and majority
[5], [47], and therefore avoids the modification of the classes respectively. They performed different experiments
learning algorithm. such as undersampling the data in negative class followed by
The data level approaches are usually considered to preprocess oversampling the data in positive class and finally combining
the data before training the classifier and then include into the oversampled with under-sampled data. The conclusive
learning algorithms. [4], [49]. On the other hand, cost-sensitive results did not show significant improvement. Tang et al. [54],
and internal level methods are depended on the problem and used support vector machine (SVM) and granular computing
require special knowledge of targeted classifier along with the for handling CIP through underdamping technique. They have
domain. Similarly, the cost-sensitive technique has a major removed the noisy data (e.g., redundant data or irrelevant
problem of defining misclassification costs, which are not usu- data) from the dataset while keeping only those samples that
ally known/available in the data space [50]. The comparison have maximum relevant information. They investigated that
between oversampling and undersampling has already been undersampling can significantly increase the classification per-
performed [2] with the conclusion that oversampling performs formance, but it was also observed that random undersampling
best as compared to undersampling for handling CIP. It is also might not provide highly accurate classification. On the other
reported that undersampling technique has a major problem of hand, G. Wu & Chang [55] performed an experiment showing
losing classifier’s performance when some potential samples results weak performance of SVM on the dataset that suffered
are discarded from the majority class [46]. Due to these from CIP. Foster Probost [56] empirically observed that during
motives, our objective is to deeply review the state-of-the- classification process with imbalanced data the number of
art data level methods to address binary CIP. The data level instances of positive class are usually very less. Therefore,
method have following advantages: (i) it is independent of the trained classifiers can accurately recognize the objects of
the obligation to train the classifier; (ii) it is usually used negative class instead of positive class. The reason is that the
in preprocessing data stage of other methods (i.e. ensemble- positive class cannot contribute more as compared to negative
based approach, cost-sensitive level); and (iii) it can be easily class. Due to this reason, the misclassification of instances that
incorporated into other methods (i.e. internal methods) [51]. belong to minority class cannot be reduced in CIP. Batista
et al. [48] introduced a comparative investigation of various
sampling schema (i.e., Edited Nearest Neighbor rule or ENN
B. Techniques for handling CIP and SMOTE) to balance the training-set. They removed the
Kubat & Matwin [52] have addressed the CIP by applying redundant or irrelevant data from the training process, which
undersampling technique for the negative class and kept the improved the mean number of induced rules and increased the
original instances of the positive class. They applied geometric performance of SMOTE+ENN. In connection to this work,
mean (related to ROC Curve) for performance evaluation of another sampling technique was proposed by H. Guo [57] for
classifiers. Burez & Poel [2] reported that random undersam- handling CIP. He modified the existing procedure of DataBoost
pling can improve the prediction accuracy as compared to the which performed much better than SMOTEBoost [49].
boosting techniques but did not help them in their experiments. Haibo He. et al. [9] introduced ADASYN oversampling
However, by randomly oversampling or resampling the minor- algorithm which was an extension of the SMOTE algorithm.
ity class may result in over-fitting. Chawla et al. [6] introduced They reported that ADASYN algorithm can self-decide the
a novel SMOTE considered as widespread technique for over- number of artificial data samples that are required to be pro-
sampling. SMOTE produces new positive observations based duced for minority class. They also investigated that ADASYN
on weighted mean/average of the k-nearest neighbor giving not only provided a balanced data distribution but also forced
positive observations. It reduces the samples inconsistency and the learning algorithm to focus on complex samples in the
creates a correlation between objects of the positive class. The dataset. On the other hand, SMOTE algorithm [6], gener-
SMOTE oversampling technique is experimentally evaluated ated alike numbers of artificial data for minority class while
on a variety of datasets with various levels of imbalance and DataBoost-IM [57] algorithm has generated various weightage
different sizes of data. SMOTE with C4.5 and Ripper algo- for changed minority class samples to compensate for the
rithms, outperformed as compared to Ripper’s Loss Ratio and distribution of skewed data. However, in their study, they have
Naı̈ve Bayes [6], [45]. Wouter Verbeke et al. [15] illustrated shown that ADASYN has produced more efficient results than
an oversampling technique by simply copying the minority SMOTE and DataBoost-IM.
class data and adding it to the training set. They reported that
just oversampling the minority class by same data (i.e. copied III. BACKGROUND : OVERSAMPLING TECHNIQUES AND
EVALUATION METRICS
samples) did not show significant improvement in performance
of the classifier. Therefore, they have suggested using more This section presents a study of six well-known oversam-
appropriate oversampling methods (i.e. SMOTE). On the other pling techniques (i.e., MTDF, SMOTE, ADASYN, MWMOTE
hand, Jo & Japkowicz [3] showed that the decision tree C4.5 and ICOTE), feature selection algorithm (i.e., mRMR) and
and Backpropagation Neural Network (BNN) algorithms both Rough Set Theory (RST).
degrade the performance of classifiers on small and complex
dataset due to class imbalance. They have proposed to use the A. Mega-Trend-Diffusion Function (MTDF)
cluster-based oversampling for a small and complex dataset Der-Chiang Li et al. [10] introduced the MTDF, procedure
with class imbalance dataset. Ling & Li [53] have combined to facilitate the estimation of domain samples systematically. It
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 5
B. SMOTE E. TRkN
To overcome the issue of over-fitting and extend the de- TRkNN originally proposed by Tsal & Yu [60] in 2016 to
cision area of the positive class samples, a novel technique overcome the CIP and improve the accuracy rate in predicting
SMOTE “Synthetic Minority Oversampling TEchnique” was the samples of majority and minority classes. TRkNN algo-
introduced by Chawla et al. [45], This technique produces rithm also solves the issue of noisy and borderline sample in
artificial samples by using the feature space rather than data CIP. The advantage of TRkNN is to avoid the production of
space. It is used for oversampling of positive class by creating unnecessary minority examples. For in-depth study, read the
the artificial data instead of using replacement or randomized original work of Tsal et al. [60].
sampling techniques. It was the first technique which intro-
duced new samples in the learning dataset to enhance the data
space and counter the scarcity in the distribution of samples F. ICOTE
[45]. The oversampling technique is a standard procedure in Xushegn et al. [61] introduced another oversampling tech-
the classification of imbalance data (e.g., minority class) [7]. It nique “ICOTE” in 2015 to improve the performance of clas-
has received incredible effort from the researcher of machine sification in CIP. ICOTE is based on an immune network and
learning domain in a recent decade. The pseudo code of the it produces a set of immune centroids to broaden the decision
SMOTE algorithm and detail can be found in [45]. space of the positive class. In this algorithm, the immune
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 6
network is used to produce artificial samples on clusters It is applied to minimize the computational cost in
with the high data densities and these immune centroids are complex and large IS [64], [67].
considered synthetic examples in order to resolve the CIP. For • Covering Algorithm (Cov): It is the modified implemen-
in-depth study, read the original work of Xushegn et al. [61]. tation of the Learning from Example Module, version
1 (LEM1) algorithm and deployed in the Rough Set
G. Rough Set Theory (RST) Exploration System (RSES) as a rule-generation method.
RST [62] was initially proposed by Pawlak in 1982, which It was presented by Jerzy Grzymala [68].
is used as a mathematical tool in order to address ambiguity. • RSES LEM2 Algorithm (LEM2): It is a divide-and-
RST philosophy is centered on the assumption that there is conquer based method which coupled with the RST
information (knowledge, data) associated with all instances in approximation and it depends on the local covering de-
the universe of discourse. RST has a precise idea of rough termination of every instance from the decision attribute
set approximation (i.e., LB=lower bound and U B=upper [68], [69].
bound), and the boundary region (BR). The BR separates
the LB from U B (i.e. boundary-line). For example, those I. Evaluation Measures
samples that may not be classified with certainty are members It may not be possible to construct classifier that could
of either the LB or U B. It is difficult to characterize the perfectly classify all the objects of the validation set [42], [2].
borderline samples due to unavailability of clear knowledge To evaluate the classification performance, we calculate the
about these elements. Therefore, any rough concept is replaced count of T P (e.g., True Positive), F N (e.g., False Negative),
by either LB or U B approximation of vague concept [58–60]. F P (e.g., False Positive) and T N (e.g., True Negative). The
Mathematically, the concepts of LB, U B and BR have been F P value is part of N or negative but incorrectly classified as
defined as; let X ⊆ U and B is an equivalence relation (i.e. the P or positive. The F N result actually belongs to P . it can be
partition of the universe set U to create new subset of interest formulate as P = F N + T P but when incorrectly classified
from U which has the same value of outcome attribute) in the instances will then belongs to N . Mathematically it can
information system or IS = (U, B) of non-empty finite set be expressed as: N = F P + T N . The following evaluations
U and B, where U is the universe of objects
S and B is a set measures are used for performance validation of the proposed
which contains features. Then LB = Y ∈ U/B : Y ⊆ X approach.
is a LBS approximation Tand an exact member of X while • Regular Accuracy (RA): It is a measure that calculates
U B = Y ∈ U/B : Y X 6= φ is U B approximation that the classifier’s overall accuracy. It is formulated as:
can be an element of X. BR = LB − U B in the boundary TN + TP
region. The detail study can be found at [62], [63], [64], [65]. RA = (13)
N +P
• Sensitivity (Recall): It is the proportion of those cases
H. Rules Generation
which are correctly classified as true positive, and calcu-
Decision rules are often denoted as “IF C then D” where D lated as:
represents the decision feature and C is the set of conditional TP
attributes in the decision table [63]. Given two unary predicate Recall = (14)
P
formulae are α(χ) and β(χ), where χ executes over a finite • Precision: It is fraction of the predicted positive instances
set U . Łukasiewicz defined this in 1913 as: i.e. cardkα(χ)k
card(U ) , that characterized as correctly churned. Formally, it can
assign to α(χ) where kα(χ)k = {χU : χ}satisfies α while be expressed as;
the fractional value is assigned to implication α(x) => β(x) TP
is then card(|α(χ)β(χ)|) P recision = (15)
card(|α(χ)|) with assumption that ||α(x)|| 6= φ. TP + FP
The decision rules can easily be built by overlaying the • F-Measure: It is based on the harmonic mean between
reduct sets over the IS. Mathematically, it can be represented both the precision and recall. The high F-measure value
∧
as: (ai1 = v1 ) ..............∧ (aik = vk ) => d = vd ,where represents that both precision and recall are reasonably
1 ≤ ij ≺ .......... ≺ ik ≤ m, vi ∈ Vai ; for simplicity it can high. It can also be considered as the weighted-average
be represented in IF-ELSE statement as “IF C then D” where of recall and precision.
C is set of conditions and D is decision part. To retrieve Recall × P recision
the decision rules, the following well-known rule-generation F.M easure = 2 × (16)
Recall + P recision
algorithms are used [65]:
• Coverage: The ratio of classified objects that are recog-
• Exhaustive Algorithm (Exh): It takes subsets of at-
nized by a classifier from the class to the total number
tributes incrementally and then returns reduced set and
of instances in the class. Where C is a classifier, A is a
minimal decision rules. The generated decision rules
decision table, Match A (C) is a subset of objects in A
are those rules, which have minimal descriptors in the
that is classified by classifier C.
conditional attributes. It requires more focus due to the
|M atch A(C)|
extensive computations needed in the case of large and Coverage A(C) = (17)
complex Boolean reasoning method [66]. ||A||
• Genetic Algorithm (Gen): This method depends on • Mutual Information (MI): It measures the information
order-based genetic algorithm combined with a heuristic. regarding how much the attribute’s value pays to creating
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 7
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 8
TABLE III: The a and b values of MTDF Function for each attribute
Features Dataset 1 Dataset 2 Dataset 3 Dataset 4
Sequence MTDF Ranges MTDF Ranges MTDF Ranges MTDF Ranges
a b a b a b a b
F1 28.4636 44.6562 0.0000 1.0000 -30 56.88 0.873547 6.655153
F2 0.0000 1.0000 380.069 393.496 3.45 15.58 2.412609 8.02709
F3 173.517 210.653 91.9169 127.634 10.43 20.17 0.361109 4.892115
F4 0.0000 1.0000 15.2519 29.8818 308.83 426.43 0.881878 6.282582
F5 111.32 138.139 2409.06 2443.80 22.93 36.07 35.89648 63.77696
F6 207.681 239.084 0.0000 1.0000 3552064 3263292 0.348405 5.293856
F7 2.72140 9.83579 2403.23 2437.72 5444049 5451211 0.350999 5.211826
F8 2.0746 5.80143 1292.73 1318.407 37.58 576.24 0.388244 4.034051
F9 9.2768 16.4491 98.0268 103.876 6155159 6168114 0.348485 5.291311
F10 15.665 24.8207 7646.24 7687.29 105.22 136.59 0.410333 3.335514
TABLE IV: Attributes for the Decision table using Dataset 1 TABLE VI: Attributes for the Decision table using Dataset 3
Sets Desciption Sets Desciption
Number of Objects {5700 distinct objects} Number of Objects {38162 distinct objects}
Conditional Attributes {Intl Plan, Day charges, Day mins, Conditional Attributes {Var126, Var72, Var7, Var189, Var65,
Custserv calls, Intl charges, Eve mins, Var113, Var133, Var16, Var153, Var73}
Account length, Vmail plan} Decision Attribute {Churn}
Decision Attribute {Churn}
.
TABLE VII: Attributes for the Decision table using Dataset 4
Sets Desciption
TABLE V: Attributes for the Decision table using Dataset 2
Number of Objects {7043 distinct objects}
Sets Desciption Conditional Attributes {Contacts, PaymentMethod, Dependents,
Number of Objects {38162 distinct objects} DeviceProtection, Tenure, Gender,
Conditional Attributes {Var ar flag, Avg call intran, Tot usage days, PaperlessBilling, SeniorCitizen, Partner,
Avg usage days, Avg call, Highend program PhoneService}
flag, Avg call local, Avg call ob, Std vas,arc, Decision Attribute {Churn}
Avg mins}
Decision Attribute {Churn}
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 9
Once the range of individual attributes (i.e., a and b) is of the predictive models on known and unknown objects [77].
defined by using MTDF, the next step is to generate the To perform the validation process, the following procedure is
artificial samples using the above mentioned oversampling applied: (i) some of the data is excluded from the training
technique. Then prepare the decision tables to apply the base- set that is used for the learning process of the classifier, (ii)
classifier (i.e., rough set theory). Tables IV, V, VI and VII when the training process is finished, and the classifier is to be
represent the prepared decision tables for dataset 1, 2, 3 and considered as trained, the excluded samples can be included
4 respectively. to validate the results of the trained classifier on unseen
These tables contain objects, attributes with condition that data samples. This procedure is known as the cross-validation
needs to be fulfilled and the decision attributes. For each exper- process. The K-fold cross-validation is applied in the proposed
iment (i.e. original population extended by MTDF, SMOTE, study to avoid bias during the validation of methods.
ADASYN, MWMOTE, ICOTE and TRkNN oversampling
techniques) the same structure of decision tables with four
V. R ESULTS AND D ISCUSSION
different datasets were used.
The cut and discretization process is an important approach In this section, the four rules-generation algorithms using
to reduce the dataset horizontally in order to handle the RST on dataset 1, 2, 3, and 4 that were expanded using
large data efficiently. It is a common approach used in the MTDF, SMOTE, ADASYN, ICOTE, MWMOTE and TRkNN
RST where the attributes that contain continuous values are to handle the CIP were considered. Tables IX, X, XI and
split into a finite number of intervals [47]. The cut and XII reflect the performance of classifiers through evaluation
discretization processes were carefully accomplished on the measures. The following tables IX, X, XI and XII clearly
prepared decision table. These cuts on the decision table were shows the performance of various oversampling techniques
made at every iteration to minimize the number of cuts to the (i.e., MTDF, SMOTE, ADASYN, MWMOTE, ICOTE and
data. This was done in light of the recommendation in [65]. TRkNN) on publically available telecommunication’s dataset
For example, the cuts of attribute “Day Min” in dataset 1 were by applying classification process based on rough set theory
grouped after discretization process as listed in Table VIII. The and four rules generation algorithms (i.e., Gen, Exh, Cov
first field denotes the groups that are represented by numeric and LEM2). It can be observed (i.e., as given tables IX-
numbers for simplicity purpose and listed in ascending order. XII) that the Cov and LEM2 rules generation algorithms have
The second column represents the intervals that are obtained achieved maximum performances as compared to Gen and Exh
after the discretization process. The third column is the count algorithms but these two (i.e. Cov & LEM2) have not been
of the attribute’s values that fall into certain groups while covered all the instances available in the dataset. Therefore, all
the last column is the percentage of variable’s value in each the underlined values corresponding to targeted oversampling
interval. It is clear from table VIII that the range of Day Min techniques are ignored from further analysis in this study while
has been changed from the continuous nature into 14 different the bold values reflect the best performed techniques in our
intervals or groups after cut and discretization process. empirical environment.
To report RQ2 and RQ3, the results reflect in Tables IX, X,
TABLE VIII: Cuts distribution of attribute Day Min XI, and XII were thoroughly analyzed to cover each relevant
Group# Intervals Count Percentage detail. It is observed that both algorithms (e.g., Cov & LEM2)
shown higher accuracy, but do not deliver full coverage (e.g.,
1 {263.55, 281.05} 115 3.45%
the underlined values does not seem to follow from full
2 {151.05, 163.45} 295 8.85% coverage). The coverage of a learning algorithm is the number
3 {237.85, 251.85} 158 4.74% of samples or instances that can be learned by that algorithm
4 {281, **} 103 3.09% from samples of a given size for given accuracy [78].
5 {163.45, 178.15} 339 10.17% On the other hand, both algorithms (i.e. Exh and Gen) have
6 {217.65, 178.15} 337 10.11% covered all instances in given datasets (i.e. Dataset 1, 2, 3 &
7 {178.15, 189.25} 235 7.05% 4). It can be observed in Figure 4 that both the Cov and LEM2
8 {251.85, 263.55} 95 2.85% algorithms achieved higher accuracy as compared to Exh and
9 {108.8, 151.05} 687 20.61% Gen algorithms but these results seem not to follow the full
10 {195.45, 208.85} 302 9.06% objects coverage. So these results (i.e. not fully covering the
11 {189.25, 195.45} 168 5.04% dataset objects) were ignored and not shown in Figures 5 (a),
12 {*, 78.65} 109 3.27% (b), (c) and (d) for datasets 1, 2, 3 and 4 respectively. Figure
13 {78.65, 108.8} 202 6.06% 5 describes the performance of oversampling algorithms (e.g.,
14 {208.85, 217.65} 188 5.64% MTDF, SMOTE, ADASYN, ICOTE, MWMOTE and TRkNN)
**=Maximum number, *=Minimum number on the datasets that were prepared and balanced by using the
same algorithms. It is observed from the results that Gen and
Exh algorithms have performed better as compared to the rest
C. Training and Validation Sets of the other two. Although, Gen and Exh algorithms have
In data mining, training and validation is an extremely achieved alike performance, however, the Gen algorithm has
important step. Once the training process is finished, then the acquired more accuracies 0.929, 0.862, 0.982 and 0.813 with
validation step is to be performed to confirm the performance 100% objects coverage on dataset 1, 2, 3, and 4 respectively.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 10
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 11
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 12
(b)
(a)
(c)
(b)
(d)
(c)
Fig. 5: The positive predictive value (precision) and sensitivity
(recall) followed by F-measure (the weighted harmonic mean
of precision and recall) evaluate the classifiers’ performance on
dataset (1-4) where (a), (b), (c) and (d) reflects the preciseness
and robustness the targeted algorithms. F-measure reaches its
worst value at 0 & best value at 1.
(d)
Fig. 4: Coverage of objects and Accuracy of techniques on The F-Measure and MI can give the best result. The MI
all datasets (1-4), where (a), (b), (c) and (d) represents the value of Gen M is larger than other targeted algorithms. It
accuracy and coverage of selected algorithms on dataset 1,2, describes the correct ranking through MI measure between
3 and 4 respectively. List of algorithms is given on x-axis the projected and positive churn in all datasets (e.g., datasets
while y-axis reflects the number of samples (instances) 1, 2, 3 & 4).
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 13
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 14
[9], was applied. ADASYN uses a weighted distribution for TABLE XIV: p-Value (P), M values obtained from McNemar’s
positive class samples and reducing the bias in generated test
synthetic samples [79]. ADASYN algorithm also has not been Algorithms Dataset 1 Dataset 2 Dataset 3 Dataset 4 Avg. Values
sufficiently achieved higher performance (e.g., average per-
P 0.012 0.078 0.003 0.0008 0.0234
formance based on obtained AUC values 0.92, 0.85, 0.93 and Gen M
M 7.04878 3.84845 9.30769 8.27944 7.01359
0.80 on dataset 1, 2, 3 and 4 respectively), due to interpolation
P 0.542 0.199 0.41 0.656 0.5092
between positive class samples and their random same class Gen S
M 0.58139 1.86452 0.92452 0.05581 0.8565
neighbors for producing the artificial samples. According to
P 0.661 0.326 0.322 0.0007 0.32742
Rashma et al. [79] sometime both algorithms (i.e., ADASYN Gen A
M 0.0088 0.0127 -0.0117 8.27751 2.07183
and SMOTE) become inappropriate and cannot provide the
P 0.0001 0.9000 0.9980 0.851 0.79975
required amount of useful synthetic positive class samples. Gen W
M 7.70916 0.03937 0.0397 0.29388 0.27044
Finally, it is summarized from all the simulation results given
P 0.018 0.235 0.508 0.9800 0.43525
in tables IX—XIII based on various evaluation measures (de- Gen I
M 7.2867 1.52039 0.73529 0.01207 2.38861
tailed given in section III) that MTDF have better performed
P 0.526 0.815 0.851 0.473 0.66625
as a whole as compared to the rest of applied oversampling Gen T
M 0.59346 0.17234 0.08707 0.59838 0.36281
techniques (e.g., SMOTE, ADASYN, MWMOTE, ICOTE and
TRkNN) on the targeted domain datasets (i.e., Tables I and
II). It is also observed about MTDF that it can solve the
issue of CIP of small dataset very efficiently. However, the statistic distribution χ2 test for goodness-of-fit that compares
other oversampling techniques needed a considerable number the distribution of the counts expected under the H0 to the
of samples during the learning process to approximate the observed values. it is sufficed that to reject the H0 in favor
real function. Consequently, the existing number of samples of of the hypothesis that the multiple algorithms have different
used datasets have created difficulty when using small datasets performance when trained on the targeted data set. McNemar’s
for average performed (i.e., SMOTE, ADASYN and ICOTE) Test value can be calculated using the following formula given
and worst performed (MWMOTE and TRkNN) algorithms in in equation (22) [83], [84]:
this study. In order to completely cover the information gap in 2
(|n01 − n10 |)
datasets (e.g., Tables I and II) and avoid the over estimation M= > χ21,α (22)
n01 − n10
of samples, MTDF more appropriately substituted the required
samples with the help of both data trend estimation and mega The probability for the quantity of M is larger than χ21,0.95 =
diffusion [80]. On the other hand, MTDF is based on normal 3.841459 is less than 0.05 for 95% confidence test with 1
distribution which is a compulsory condition in the statistical degree of freedom [84]. If the null hypothesis that the debate
data-analysis [11]. Therefore, MTDF is best techniques for had no effect were true then Gen M = Gen S = Gen A as chi
more systematically estimating the existing samples or data in statistic χ2 where the degree of freedom is 1. In case the H0
the dataset. is correct, then the probability that this quantity is greater than
the χ21,0.95 = 3.841459 is less than 0.05 for 95% confidence
C. Statistical Test test [84]. McNemar’s Test value can be calculated using a
In order to compare different classifiers and algorithms, this formula (i.e. discussed in section 3.2 and equation (22)). We
study supports our comparative assumptions through statistical can reject the H0 , as these classifiers have the same error
evidence. For this work, a non-parametric test was used to rate which may also consider in favor of the null hypothesis
provide statistical comparisons of some classifiers according that these algorithms have different performance when trained
to recommendation suggested by [4], [70], [81]. The reasons on the same datasets (e.g. dataset 1, 2, 3 and 4). Table XIV
why this study uses a nonparametric statistical test as follows reflects the performance of classifiers.
[81], [82]: (i) these tests can handle both normally and non- “M” values is greater than chi statistic χ2 = 3.841459
normally distributed data while parametric test, usually apply reflecting to reject the null hypothesis (shown in underlined)
on normally distributed data only, (ii) non-parametric test with 95% confidence and 1 degree of freedom while “P”
are guaranteed the reliability of the parametric test, and (iii) values is lower than 0.05 (shown in underlined) indicating that
parametric tests are more closed to reject the null-hypothesis the performance difference between classifiers is statistically
than the non-parametric tests unless their assumptions are significant. The overall best average performance between the
violated. Therefore, Janez Demsar [81] recommended using classifiers is shown in bold.
the non-parametric tests than using parametric test. These tests
may not be satisfied causing the statistical analysis to lose its
credibility. Furthermore, McNemar’s test [83] was applied to D. Threats to Validity
evaluate the classifiers’ performance by comparing the results • Open source tools and public dataset: To investigate the
of best-performing algorithms (e.g. Gen M, Gen S, Gen A, performance of the proposed solution, this study has used
Gen I, Gen W and Gen T). four publicly available datasets from different sources
Under the H0 (e.g., null hypothesis), different algorithms related to the telecom sector. Also open source tools for
should have the same error rate, which means that classi- evaluation and classification process were used, therefore,
fier A = Classifier B. McNemar’s test is based on the chi the results may not be generalizable to closed-source
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 15
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 16
[21] P. D. Kusuma, D. Radosavljevik, F. W. Takes, and P. van der Putten, [43] N. Chawla, N. Japkowicz, and A. Kolcz, “Special issue on learning from
“Combining customer attribute and social network mining for prepaid imbalanced datasets,” in ACM SIGKDD Explorations, ACM SIGKDD
mobile churn prediction.” Proc. the 23rd Annual Belgian Dutch Explorations. ACM SIGKDD, 2004.
Conference on Machine Learning (BENELEARN), 2013, pp. 50–58. [44] S. Visa and A. Ralescu, “Issues in mining imbalanced data sets-a review
[22] W. Jingjing, J. Chunxiao, and Q. e. a. Tony, “The value strength aided in- paper,” vol. 2005, pp. 67–73, 2005.
formation diffusion in socially-aware mobile networks,” Special section [45] N. V. Chawla, “Data mining for imbalanced datasets: An overview,” in
on recent advances in socially aware mobile networking IEEEAccess, Data mining and knowledge discovery handbook. Springer, 2005, pp.
vol. 4, pp. 3907–3919, Aug 2016. 853–867.
[23] U. D. Prasad and S. Madhavi, “Prediction of churn behavior of bank [46] G. M. Weiss, “Mining with rarity: a unifying framework,” ACM SIGKDD
customers using data mining tools,” Business Intelligence Journal, vol. 5, Explorations Newsletter, vol. 6, no. 1, pp. 7–19, 2004.
no. 1, pp. 96–101, 2012. [47] F. He, X. Wang, and B. Liu, “Attack detection by rough set theory in
[24] K. Chitra and B. Subashini, “Customer retention in banking sector using recommendation system,” in Granular Computing (GrC), 2010 IEEE
predictive data mining technique.” ICIT 2011 The 5th International International Conference on. IEEE, 2010, pp. 692–695.
Conference on Information Technology, 2011. [48] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior
[25] J. Bloemer, K. De Ruyter, and P. Peeters, “Investigating drivers of bank of several methods for balancing machine learning training data,” ACM
loyalty: the complex relationship between image, service quality and Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004.
satisfaction,” International Journal of bank marketing, vol. 16, no. 7, [49] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost:
pp. 276–286, 1998. Improving prediction of the minority class in boosting,” in European
[26] M. A. H. Farquad, V. Ravi, and S. B. Raju, “Churn prediction using Conference on Principles of Data Mining and Knowledge Discovery.
comprehensible support vector machine: An analytical crm application,” Springer, 2003, pp. 107–119.
Applied Soft Computing, vol. 19, pp. 31–40, 2014. [50] B. Liu, Y. Ma, and C. K. Wong, “Improving an association rule based
[27] C.-S. Lin, G.-H. Tzeng, and Y.-C. Chin, “Combined rough set theory and classifier,” in European Conference on Principles of Data Mining and
flow network graph to predict customer churn in credit card accounts,” Knowledge Discovery. Springer, 2000, pp. 504–509.
Expert Systems with Applications, vol. 38, no. 1, pp. 8–15, 2011. [51] K. Satyasree and J. Murthy, “An exhaustive literature review on class
[28] K. Lee, N. Chung, and K. Shin, “An artificial intelligence-based data imbalance problem,” Int. J. Emerg. Trends Technol. Comput. Sci, vol. 2,
mining approach to extracting strategies for reducing the churning rate in pp. 109–118, 2013.
credit card industry,” Journal of Intelligent Information Systems, vol. 8, [52] M. Kubat, S. Matwin et al., “Addressing the curse of imbalanced training
no. 2, pp. 15–35, 2002. sets: one-sided selection,” in ICML, vol. 97. Nashville, USA, 1997,
[29] M. Suznjevic, I. Stupar, and M. Matijasevic, “Mmorpg player behavior pp. 179–186.
model based on player action categories,” Proceedings of the 10th [53] C. X. Ling and C. Li, “Data mining for direct marketing: Problems and
Annual Workshop on Network and Systems Support for Games. IEEE solutions.” in KDD, vol. 98, 1998, pp. 73–79.
Press, 2011, p. 6. [54] Y. Tang, S. Krasser, D. Alperovitch, and P. Judge, “Spam sender
[30] J. Kawale, A. Pal, and J. Srivastava, “Churn prediction in mmorpgs: detection with classification modeling on highly imbalanced mail server
A social influence based approach,” in International Conference on behavior data.” in Artificial Intelligence and Pattern Recognition, 2008,
Computational Science and Engineering, 2009. CSE’09, vol. 4. IEEE, pp. 174–180.
2009, pp. 423–428. [55] E. Y. G. Wu and Chang, “Kernel boundary alignment considering
unbalanced data distribution,” IEEE Trans. Knowl. Data Eng, vol. 17,
[31] M. L. Kane-Sellers, “Predictive models of employee voluntary turnover
no. 6, pp. 786–796, 2006.
in a north american professional sales force using data-mining analysis,”
2007. [56] P. Foster, “Machine learning from imbalanced data sets 101.” Proceed-
ings of the AAAI 2000 workshop on imbalanced data sets, 2000, pp.
[32] V. V. Saradhi and G. K. Palshikar, “Employee churn prediction,” Expert
1–3.
Systems with Applications, vol. 38, no. 3, pp. 1999–2006, 2011.
[57] H. Guo and H. L. Viktor, “Learning from imbalanced data sets
[33] M. Saron and Z. A. Othman, “Academic talent model based on human with boosting and data generation: the databoost-im approach,” ACM
resource data mart,” International Journal of Research in Computer SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 30–39, 2004.
Science, vol. 2, no. 5, p. 29, 2012.
[58] H. Peng, F. Long, and C. Ding, “Feature selection based on mu-
[34] G. Dror, D. Pelleg, O. Rokhlenko, and I. Szpektor, “Churn prediction tual information criteria of max-dependency, max-relevance, and min-
in new users of yahoo! answers.” Proceedings of the 21st International redundancy,” IEEE Transactions on pattern analysis and machine intel-
Conference on World Wide Web, ACM, 2012, pp. 829–834. ligence, vol. 27, no. 8, pp. 1226–1238, 2005.
[35] M. Jaudet, N. Iqbal, and A. Hussain, “Neural networks for fault- [59] S. Barua, M. M. Islam, X. Yao, and K. Murase, “Mwmote–majority
prediction in a telecommunications network,” Multitopic Conference, weighted minority oversampling technique for imbalanced data set
2004. Proceedings of INMIC 2004. 8th International. IEEE, 2004, pp. learning,” IEEE Transactions on Knowledge and Data Engineering,
315–320. vol. 26, no. 2, pp. 405–425, 2014.
[36] A. Hawalah and M. Fasli, “Dynamic user profiles for web personalisa- [60] M.-F. Tsai and S.-S. Yu, “Distance metric based oversampling method
tion,” Expert Systems with Applications, vol. 42, no. 5, pp. 2547–2569, for bioinformatics and performance evaluation,” Journal of medical
2015. systems, vol. 40, no. 7, pp. 1–9, 2016.
[37] N. Ahad, J. Qadir, and N. Ahsan, “Neural networks in wireless net- [61] X. Ai, J. Wu, V. S. Sheng, P. Zhao, Y. Yao, and Z. Cui, “Immune cen-
works: Techniques, applications and guidelines,” Journal of Network troids over-sampling method for multi-class classification,” in Pacific-
and Computer Applications, vol. 68, pp. 1–27, 2016. Asia Conference on Knowledge Discovery and Data Mining. Springer,
[38] Z. L. Cheng Fang, Jun Liu, “Fine-grained http web traffic analysis based 2015, pp. 251–263.
on large-scale mobile datasets,” IEEEAccess, vol. 4, pp. 4364–43–73, [62] Z. Pawlak, “Rough sets,” International Journal of Computer & Infor-
Aug 2016. mation Sciences, vol. 11, no. 5, pp. 341–356, 1982.
[39] T. S. Rappaport, S. Sun, R. Mayzus, H. Zhao, Y. Azar, K. Wang, G. N. [63] Z. Pawlak and A. Skowron, “Rough sets and conflict analysis,” in E-
Wong, J. K. Schulz, M. Samimi, and F. Gutierrez, “Millimeter wave Service Intelligence. Springer, 2007, pp. 35–74.
mobile communications for 5g cellular: It will work!” IEEE access, [64] J. G. Bazan, H. S. Nguyen, S. H. Nguyen, P. Synak, and J. Wróblewski,
vol. 1, pp. 335–349, 2013. “Rough set algorithms in classification problem,” in Rough set methods
[40] R. A. Soeini and K. V. Rodpysh, “Applying data mining to insurance and applications. Springer, 2000, pp. 49–88.
customer churn management,” International Proceedings of Computer [65] J. G. Bazan and M. Szczuka, “The rough set exploration system,” in
Science and Information Technology, vol. 30, pp. 82–92, 2012. Transactions on Rough Sets III. Springer, 2005, pp. 37–56.
[41] K. Coussement and D. Van den Poel, “Churn prediction in subscription [66] H. Nguyen and S. Nguyen, “Analysis of stulong data by rough set ex-
services: An application of support vector machines while comparing ploration system (rses),” in Proceedings of the ECML/PKDD Workshop,
two parameter-selection techniques,” Expert systems with applications, 2003, pp. 71–82.
vol. 34, no. 1, pp. 313–327, 2008. [67] J. Wróblewski, “Genetic algorithms in decomposition and classification
[42] J. Burez and D. Van den Poel, “Crm at a pay-tv company: Using problems,” in Rough Sets in Knowledge Discovery 2. Springer, 1998,
analytical models to reduce customer attrition by targeted marketing pp. 471–487.
for subscription services,” Expert Systems with Applications, vol. 32, [68] J. W. Grzymala-Busse, “A new version of the rule induction system
no. 2, pp. 277–288, 2007. lers,” Fundamenta Informaticae, vol. 31, no. 1, pp. 27–39, 1997.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 17
[69] J. W. Grzymala Busse, “Lers-a system for learning from examples based Sajid Anwar obtained his BSc (Comp. Sc) and MSc
on rough sets,” in Intelligent decision support. Intelligent decision (Comp. Sc) degrees from University of Peshawar
support: Springer, 1992, pp. 3–18. in 1997 and 1999 respectively. He obtained his
[70] V. López, A. Fernández, S. Garcı́a, V. Palade, and F. Herrera, “An insight MS (Comp.Sc) and PhD (in Software Architecture)
into classification with imbalanced data: Empirical results and current from the University of NUCES-FAST, Pakistan, in
trends on using data intrinsic characteristics,” Information Sciences, vol. 2007 and 2011 respectively. He is currently Assistant
250, pp. 113–141, 2013. Professor of Computing Science, and coordinator of
[71] “Data Source,” http://www.sgi.com/tech/mlc/db/, Aug. 2015, [Online; the BS-Software Engineering at the Institute of Man-
accessed 1-Aug-2015]. agement Sciences Peshawar, Pakistan. His research
[72] “Data Source,” http://lamda.nju.edu.cn/yuy/dm07/assign2.htm, Oct. interests are concerned with Software Architecture,
2015, [Online; accessed 15-Oct-2015]. Software Requirement Engineering, Searched Based
[73] “KDD’06 Challenge Dataset,” http://www3.ntu.edu.sg/sce/pakdd2006/, Software Engineering and Mining Software Repository.
Oct. 2015, [Online; accessed 15-Oct-2015].
[74] “IBM Telecom Dataset,” https://www.ibm.com/analytics/watson-
analytics/community/predictive-insights-in-the-telco-customer-churn-
data-set/, Jan. 2016, [Online; accessed 1-Jan-2016].
[75] G. Holmes, A. Donkin, and I. H. Witten, “Weka: A machine learning
workbench,” in Intelligent Information Systems, 1994. Proceedings of
the 1994 Second Australian and New Zealand Conference on. IEEE,
1994, pp. 357–361.
[76] Microsoft, “Standardize Function,” http://office.microsoft.com/en-
001/excel-help/standardize-function-HP010342919.aspx , 2014, [Online;
accessed 9-Nov-2014].
[77] R. Bellazzi and B. Zupan, “Predictive data mining in clinical medicine: Awais Adnan is Assistant Professor and Cooerdina-
current issues and guidelines,” International journal of medical infor- tor of Master Program, Department of Comptuer Sci-
matics, vol. 77, no. 2, pp. 81–97, 2008. ence in Institute of Management Sciences Peshawar.
[78] H. S. Almuallim, “Concept coverage and its application to two learning He has done his Ph.D. from IMSciences|Peshawar
tasks,” 1992. and MS from NUST Islamabad. He is manager
[79] M. R. K. Dhurjad and M. S. Banait, “A survey on oversampling of ORIC in IMSciences|Peshawar to promote and
techniques for imbalanced learning,” International journal of Appl. or facilitate the research students in commercialization
Innovation. Engineering. Management, vol. 3, no. 1, pp. 279–284, 2014. of their research. His major areas of interest are
[80] N. H. Ruparel, N. M. Shahane, and D. P. Bhamare, “Learning from Multimedia and Machine Learning.
small data set to build classification model: A survey,” IJCA Proceedings
on International Conference on Recent Trends in Engineering and
Technology 2013, vol. ICRTET, no. 4, pp. 23–26, May 2013, full text
available.
[81] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,”
Journal of Machine learning research, vol. 7, no. Jan, pp. 1–30, 2006.
[82] Minitab, “Choosing Between a Nonparametric Test and a Paramet-
ric Test,” http://blog.minitab.com/blog/adventures-in-statistics/choosing-
between-a-nonparametric-test-and-a-parametric-test, 2015, [Online; ac-
cessed 19-Dec-2015].
[83] B. S. Everitt, The analysis of contingency tables. CRC Press, 1992.
[84] T. G. Dietterich, “Approximate statistical tests for comparing supervised Muhammad Nawaz received his MSc (Computer
classification learning algorithms,” Neural computation, vol. 10, no. 7, Science) and MS in Information Technology from
pp. 1895–1923, 1998. University of Peshawar-Pakistan. He worked as a
lecturer at the University of Peshawar; followed
by working as a Computer Programmer at Khyber
Teaching Hospital, Peshawar; and then appointed as
Assistant Professor in Multimedia at the Institute of
Management Sciences, Peshawar - a position he still
holds. Currently he is the Head of PhD and MS-
Computer Sciences at the IMSciences |Peshawar.
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2619719, IEEE Access
JOURNAL OF IEEEACCESS 18
2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.