Comparative Study Classification Algorit PDF
Comparative Study Classification Algorit PDF
Abstract— Data classification is one of the most important tasks in data mining, which identify to which categories a
new observation belongs, on the basis of a training set. Preparing data before doing any data mining is essential step
to ensure the quality of mined data. There are different algorithms used to solve classification problems. In this
research four algorithms namely support vector machine (SVM), C5.0, K-nearest neighbor (KNN) and Recursive
Partitioning and Regression Trees (rpart) are compared before and after applying two feature selection techniques.
These techniques are Wrapper and Filter. This comparative study is implemented throughout using R programming
language. Direct marketing campaigns dataset of banking institution is used to predict if the client will subscribe a
term deposit or not. The dataset is composed of 4521 instances. 3521 instance as training set 78%, 1000 instance as
testing set 22%. The results show that C5.0 is superior to other algorithms before implementing FS technique and
SVM is superior to others after implementing FS.
Keywords— Classification, Feature Selection, Wrapper Technique, Filter Technique, Support Vector Machine
(SVM), C5.0, K-Nearest Neighbor (KNN), Recursive Partitioning and Regression Trees (Rpart).
I. INTRODUCTION
The problem of data classification has numerous applications in a wide variety of mining applications. This is
because the problem attempts to learn the relationship between a set of feature variables and a target variable of interest.
Excellent overviews on data classification may be found in Classification algorithms typically contain two phases. The
first one is training phase in which a model is constructed from the training instances. The second is testing phase in
which the model is used to assign a label to an unlabeled test instance[1].
Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome,
the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or
prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to
predict the outcome. Next the algorithm is given a data set, called prediction set, which contains the same set of
attributes, except for the prediction attribute – not yet known. The algorithm analyses the input and produces predicted
instances. The prediction accuracy defines how “good” the algorithm is [2]. The four classifiers used in this paper are
shown in (figure 1). But many irrelevant, noisy or ambiguous attributes may be present in data to be mined. So they need
to be removed because it affects the performance of algorithms. Attribute selection methods are used to avoid over fitting
and improve model performance and to provide faster and more cost-effective models [3]. The main purpose of Feature
Selection (FS) approach is to select a minimal and relevant feature subset for a given dataset and maintain its original
representation. FS not only reduces the dimensionality of data but also enhance the performance of a classifier. So, the
task of FS is to search for best possible feature subset depending on the problem to be solved [4].
This paper is organized as follows. Section 2 refers to the four algorithms to deal with the classification
problem. Section 3 describes the used FS techniques. Section 4 demonstrates our experimental methodology then section
5 presents the results. Finally section 6 provides conclusion and future work.
classification
tools
KNN SVM
Decision Tree
algorithm algorithm
C5.0 rPart
algorithm algorithm
B. C5.0 Algorithm
C5.0 is new Decision Tree (DT) algorithm developed based on C4.5 by Quinlan. It includes all functionalities of
C4.5 and apply a bunch of new technologies [7]. The classifier is tested first to classify unseen data and for this purpose
resulting DT is used. C4.5 algorithm follows the rules of ID3 algorithm. Similarly C5 algorithm follows the rules of
algorithm of C4.5. C5 algorithm has many features such as; i) the large DT can be viewing as a set of rules which is easy
to understand; ii) C5 algorithm gives acknowledge on noise and missing data; iii) problem of over fitting and error
pruning is solved by the C5 algorithm; and iv) in classification technique the C5 classifier can anticipate which attributes
are relevant and which are not relevant in classification[8].
Randomly set with 10% sample is selected. The percentage of training set and test set is shown in figure 2. In
the experiment the KNN algorithm classifies any new object based on a similarity function “distance function” which can
be Euclidean, Manhattan, Minkowski or other. It measures how far a new object from its neighbors (the distance between
the object and its neighbors) and the number of its neighbors is defined by K. EX: if k=3, so the KNN will search for the
closest three neighbors to this object using the distance function and the predicted class of new object is determined by
majority class of its neighbors.
22%
training
testing
78%
The SVM works by mapping data to a high-dimensional feature space so that data points can be categorized,
even when the data are not otherwise linearly separable. A separator between the categories is found, and then the data
are transformed in such a way that the separator could be drawn as a hyper plane. Following this, characteristics of new
data can be used to predict the group to which a new record should belong.
The C5.0 and rPart are kind of decision tree which divides a dataset into smaller subsets. Leaf node represents a
decision. Based on feature values of instances, the decision trees classify the instances. Each node represents a feature in
an instance in a decision tree which is to be classified, and each branch represents a value. Classification of Instances
starts from the root node and sorted based on their feature values[8].
V. EXPERIMENTAL RESULTS
1. Before Feature Selection
Based on the dataset in hand; the results revealed that C5.0 algorithm is the best to solve classification problem
and KNN is the poorest algorithm to deal with classification problem. The performance of different methods was
compared by calculating the average error rate and accuracy rate of each algorithm using a confusion matrix.
The accuracy (AC) is the percentage of the total number of predictions that were correct. It is determined using the
following equation:
C5.0 algorithm was able to correctly predict that 845 clients won’t subscribe the bank term deposit and 54
clients will subscribe the term deposit. rPart algorithm correctly predict that 854 clients won’t subscribe the term deposit
Roc curve is implemented for the four classifiers as in figure 4. The closer the curve to the left-hand border and
then the top border of the ROC space, the more accurate the test. The Area Under ROC Curve (AUC) quantifies the
overall ability of the test to differentiate between those who will subscribe the term deposit and those who won’t. In the
worst case scenario the area under ROC curve will be at 0.5 and in the best case scenario (one that has zero false
positives and zero false negatives) the area will at 1.00. More details about the accuracy and error rate for each classifier
are shown in table 2.
4-a: ROC Curve for C5.0. 4-b: ROC Curve for rPart.
4-c: ROC Curve for SVM. 4-d: ROC Curve for KNN
Figure 4: ROC Curve for the Algorithms.
100.00%
80.00%
60.00%
0.00%
C5.0 Algorithm rPart Support vector K-nearest
Algorithm machine(SVM)neighbor(KNN)
Wrapper Technique
The experimental results revealed that SVM algorithm is superior to others to solve classification problem when
cost = 10. Even when cost = 1 SVM gives impressive results the accuracy rate was (96.7 %) which is higher than the rest.
Accuracy and error rate for each classifier is shown in table 4. A comparison is made between four classifiers before and
after applying FS in table 5.
With applying FS technique the SVM algorithm correctly predicts all the records with zero false positives and
zero false negatives giving 100 % as accuracy rate which means that there is an improvement in the performance of the
classifier. Also there is an improvement in the performance of C5.0, rPart and Knn compared with their performance
6-a: confusion matrix for SVM. 6-b: confusion matrix for C5.
6-c: confusion matrix for rPart. 6-d: confusion matrix for KNN.
Figure 6: confusion matrix for the classifiers with FS.
As known ROC curve is able to visualize the performance of the classifiers, the AUC is used to get the one with
best performance. So as shown in figure 7 the AUC for SVM is at 1.0 which means higher performance compared to
others. The AUC for KNN is under .5 which gives poor performance. For C5.0 and rPart the AUC is higher than .5
which give good performance.
7-a: ROC Curve for SVM with FS. 7-b: ROC Curve for C5.0 with FS.
7-c: ROC Curve for rPart with FS. 7-d: ROC Curve for KNN with FS.
Figure 7: ROC Curve for the Algorithms with FS.
120%
100%
80%
60%
Accuracy Rate
40% Error Rate
20%
0%
Support vector C5.0 Algorithm rPart Algorithm K-nearest
machine(SVM) neighbor(KNN)
Removing the irrelevant features from a dataset before doing any data mining has a great influence on the
performance of the classifiers. Notably the accuracy rate of the four classifiers is increased and the error rate is decreased.
In SVM the accuracy rate moved from 88.1% to 100% resulting in zero error rate. With C5.0 algorithm the accuracy rate
moved from 89.9 % to 90.4% and the error rate reduced by .5%. With rpart the accuracy rate increased from 88.8% to
89.3 resulting in reducing the error rate by .5%. And in Knn the accuracy rate moved from 87.5% to 88.9% with error
rate reduced by 1.4%. Summarization of the accuracy rate for the classifiers before and after using FS is shown in table 5.
VI. CONCLUSION
DM includes many tasks, classification task is one of them which can be solved by many algorithms. The data
on which DM tasks depend may contain several inconsistencies, missing records or irrelevant features, which make the
knowledge extraction very difficult. So, it is essential to apply pre-processing techniques such as FS in order to enhance
its quality. In this paper we compared between the performance of SVM, C5.0, KNN and rpart before and after Feature
Selection. From the obtained results we conclude that after implementing the FS it improves the data quality and the
performance of the four classifiers, and SVM algorithm gives impressive and high results over other algorithms.
REFERENCES
[1] CC Aggarwal, Classification Algorithms and Applications Data, 2014.
[2] F. Voznika and L. Viana, “Data mining classification,” pp. 1–6, 1998.
[3] S. Beniwal and J. Arora, “Classification and Feature Selection Techniques in Data Mining,” vol. 1, no. 6, pp. 1–
6, 2012.
[4] D. Tomar and S. Agarwal, “A Survey on Pre-processing and Post-processing Techniques in Data Mining,” vol.
7, no. 4, pp. 99–128, 2014.
[5] I. Technologies, “Missing Value Imputation in Multi Attribute Data Set,” vol. 5, no. 4, pp. 5315–5321, 2014.
[6] E. Acu, “The treatment of missing values and its effect in the classifier accuracy,” no. 1995, pp. 1–9.