Classification of Multivariate Data Sets Without Missing Values Using Memory Based Classifiers - An Effectiveness Evaluation
Classification of Multivariate Data Sets Without Missing Values Using Memory Based Classifiers - An Effectiveness Evaluation
1, January 2013
CLASSIFICATION OF MULTIVARIATE DATA SETS WITHOUT MISSING VALUES USING MEMORY BASED CLASSIFIERS AN EFFECTIVENESS EVALUATION
C. Lakshmi Devasena1
1
Department of Computer Science and Engineering, Sphoorthy Engineering College, Hyderabad, India
[email protected]
ABSTRACT
Classification is a gradual practice for allocating a given piece of input into any of the known category. Classification is a crucial Machine Learning technique. There are many classification problem occurs in different application areas and need to be solved. Different types are classification algorithms like memorybased, tree-based, rule-based, etc are widely used. This work evaluates the performance of different memory based classifiers for classification of Multivariate data set without having Missing values from UCI machine learning repository using the open source machine learning tool. A comparison of different memory based classifiers used and a practical guideline for selecting the renowned and most suited algorithm for a classification is presented. Apart from that some pragmatic criteria for describing and evaluating the best classifiers are discussed.
KEYWORDS
Classification, IB1 Classifier, IBk Classifier, K Star Classifier, LWL Classifier
1. INTRODUCTION
In machine learning, classification refers to an algorithmic process for designating a given input data into one among the different categories given. An example would be a given program can be assigned into "private" or "public" classes. An algorithm that implements classification is known as a classifier. The input data can be termed as an instance and the categories are known as classes. The characteristics of the instance can be described by a vector of features. These features can be nominal, ordinal, integer-valued or real-valued. Many data mining algorithms work only in terms of nominal data and require that real or integer-valued data be converted into groups. Classification is a supervised procedure that learns to classify new instances based on the knowledge learnt from a previously classified training set of instances. The equivalent unsupervised procedure is known as clustering. It entails grouping data into classes based on inherent similarity measure. Classification and clustering are examples of the universal problems like pattern recognition. In machine learning, classification systems induced from empirical data (examples) are first of all rated by their predictive accuracy. In practice, however, the interpretability or transparency of a classifier is often important as well. This work evaluates the effectiveness of memory-based classifiers to classify the Multivariate Datasets without containing missing values.
DOI : 10.5121/ijaia.2013.4110
129
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
2. LITERATURE REVIEW
In [1], the comparison of the performance analysis of Fuzzy C mean (FCM) clustering algorithm with Hard C Mean (HCM) algorithm on Iris flower data set is done and concluded Fuzzy clustering are proper for handling the issues related to understanding pattern types, incomplete / noisy data, mixed information and human interaction, and can afford fairly accurate solutions faster. In [6], the issues of determining an appropriate number of clusters and of visualizing the strength of the clusters are addressed using the Iris Data Set.
3. DATA SET
IRIS flower data set classification problem is one of the novel multivariate dataset created by Sir Ronald Aylmer Fisher [3] in 1936. IRIS dataset consists of 150 instances from three different types of Iris plants namely Iris setosa, Iris virginica and Iris versicolor, each of which consist of 50 instances. Length and width of sepal and petals is measured from each sample of three selected species of Iris flower. These four features were measured and used to classify the type of plant are the Sepal Length, Petal Length, Sepal Width and Petal Width [4]. Based on the combination of the four features, the classification of the plant is made. Other multivariate datasets selected for Performance evaluation of Memory-Based Classifiers are Car Evaluation Dataset, Glass Identification Dataset and Balance Scale Dataset from UCI Machine Learning Repository [8]. Car Evaluation dataset has six attributes (Buying Price, Maintenance Price, Number of Doors, Capacity, Size of Luggage Boat and Estimated Safety of the car) and consists of 1728 instances of four different classes. Glass Identification Data set has nine attributes (Refractive Index, Sodium, Potassium, Magnesium, Aluminium, Calcium, Silicon, Barium and Iron content) and consists of 214 instances of seven different classes namely Building Windows Float Processed Glass, Vehicle Windows Float Processed Glass, Building Windows Non-Float Processed Glass, Vehicle Windows Non-Float Processed Glass, Containers Non-Window Glass, Tableware Non-Window Glass and Headlamps Non-Window Glass. Balance Scale Dataset contains four attributes (Left weight, Left distance, Right Weight and Right Distance) and 625 instances.
4. CLASSIFIERS USED
Different memory based Classifiers are evaluated to find the effectiveness of those classifiers in the classification of Iris Data set. The Classifiers evaluated here are.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
(2) The probability function P* is defined as the probability of all paths from instance a to instance b: (3)
131
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
(5) K* is not strictly a distance function. For example, K*(a|a) is in general non-zero and the function (as emphasized by the | notation) is not symmetric. Although possibly counter-intuitive the lack of these properties does not interfere with the development of the K* algorithm below. The following properties are provable:
(6).
(7) This process has a physical interpretation. The strength of the springs are equal in the unweighted case, and the position of the hiper-plane minimizes the sum of the stored energy in the springs (Equation 8). We will ignore a factor of 1/2 in all our energy calculations to simplify notation. The stored energy in the springs in this case is C of Equation 7, which is minimized by the physical process.
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
xiT = yi
(9)
In what follows we will assume that the constant 1 has been appended to all the input vectors xi to include a constant term in the regression. The data training points can be collected in a matrix equation: X = y (10)
where X is a matrix whose ith row is xiT and y is a vector whose ith element is yi . Thus, the dimensionality of X is n x d where n is the number of data training points and d is the dimensionality of x. Estimating the parameters using an unweighted regression minimizes the criterion given in equation 1 [7]. By solving the normal equations (XTX) = XT y For : = (XTX) - iXTy (12) (11)
Inverting the matrix XTX is not the numerically best way to solve the normal equations from the point of view of efficiency or accuracy, and usually other matrix techniques are used to solve Equation 11.
(15)
133
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
Here a is the actual output and c is the expected output. The mean-squared error is the commonly used measure for numeric prediction.
1 2 3 4
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
Comparison based on Corre ctly Classified Instance s 150 140 130 120 110 100
IB1
Ibk
K S tar
LWL
IB1
IBk
K Star
LWL
Figure 2. Comparison based on MAE and RMSE values Iris Dataset Table 4. Confusion Matrix for K*Classifier IRIS Dataset A 50 0 0 B 0 50 0 C 0 0 50
135
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
Data Set 2: Car Evaluation Data set The performance of the memory based algorithms for Car Evaluation Data set in terms of Classification Accuracy, Time taken to test the Model, RMSE and MAE values as shown in Table 6. Comparison among the classifiers based on the correctly classified instances is shown in Fig. 3. Comparison among these classifiers based on MAE and RMSE values are shown in Fig. 4. The confusion matrix arrived for these classifiers are shown from Table 7 to Table 10. The overall ranking is done based on the classification accuracy, MAE and RMSE values and it is given in Table 6. Based on the results arrived, IB1 Classifier has 100% accuracy and zero MAE and RMSE got the first position in ranking followed by IBk, K Star and LWL as shown in Table 6. Table 6. Overall Results of Memory Based Classifiers CAR Dataset Classifier Used Instances Correctly Classified (Out of 1728) 1728 1728 1728 1210 Classification Accuracy (%) Time taken to Test Model 0.62 0.62 3.49 2.72 MAE RMSE Rank
1 2 3 4
Table 8. Confusion Matrix for IBk Classifier CAR Dataset A 1210 0 0 0 B 0 384 0 0 C 0 0 69 0 D 0 0 0 65
Table 9. Confusion Matrix for K Star Classifier CAR Dataset A 1210 0 0 0 B 0 384 0 0 C 0 0 69 0 D 0 0 0 65
136
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
Comparison based on Correctly Classified Insances
IB1
Ibk
K Star
LWL
IB1
IBk
K Star
LWL
Figure 4. Comparison based on MAE and RMSE values CAR Dataset Table 10. Confusion Matrix for LWL Classifier CAR Dataset A 1210 384 69 65 B 0 0 0 0 C 0 0 0 0 D 0 0 0 0
A = Unaccident B = Accident C = Good D = Verygood Data Set 3: Glass Identification Data set
The performance of the memory based algorithms for Glass Identification Dataset in terms of Classification Accuracy, Time taken to test the Model, RMSE and MAE values as shown in Table 11. Comparison among the classifiers based on the correctly classified instances is shown in Fig. 5. Comparison among these classifiers based on MAE and RMSE values are shown in Fig. 6. The confusion matrix arrived for these classifiers are shown from Table 12 to Table 15. The overall ranking is done based on the classification accuracy, Time taken to test the Model, MAE and
137
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
RMSE values. Based on the results arrived, IB1 Classifier has 100% accuracy with Nil MAE and RMSE got the first position in ranking followed by IBk, K Star and LWL as shown in Table 11. Table 11. Overall Results of Memory Based Classifiers Glass Dataset Classifier Used Instances Correctly Classified (Out of 214) 214 214 214 97 Classification Accuracy (%) Time taken to Test Model (sec) 0.08 0.08 0.70 0.47 MAE RMSE Rank
1 2 3 4
IB1
Ibk
K Star
LWL
IB1
IBk
K Star
LWL
138
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
A = Build window float B = Build window non-float C = Vehicle Window Float D = Vehicle Window non-Float E = Containers F = Tableware G = Headlamps
A = Build window float B = Build window non-float C = Vehicle Window Float D = Vehicle Window non-Float E = Containers F = Tableware G = Headlamps
A = Build window float B = Build window non-float C = Vehicle Window Float D = Vehicle Window non-Float E = Containers F = Tableware G = Headlamps
A = Build window float B = Build window non-float C = Vehicle Window Float D = Vehicle Window non-Float E = Containers F = Tableware G = Headlamps Data Set 4: Balance Scale Dataset
The performance of the memory based algorithms for Balance Scale Dataset in terms of Classification Accuracy, Time taken to test the Model, RMSE and MAE values as shown in Table 16. Comparison among the classifiers based on the correctly classified instances is shown in Fig.
139
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
7. Comparison among these classifiers based on MAE and RMSE values are shown in Fig. 8. The confusion matrix arrived for these classifiers are shown from Table 17 to Table 20. Table 16. Overall Results of Memory Based Classifiers Balance Scale Dataset Classifier Used Instances Correctly Classified (Out of 625) 625 625 589 352 Classification Accuracy (%) Time taken to Test Model (sec) 0.3 0.3 0.62 0.78 MAE RMSE Rank
1 2 3 4
The overall ranking is done based on the classification accuracy, Time taken to test the Model, MAE and RMSE values. Based on the results arrived, IB1 Classifier has 100% accuracy with Nil MAE and RMSE got the first position in ranking followed by IBk, K Star and LWL as shown in Table 16.
Comparison based on Correctly Classified Instances
IB1
Ibk
K Star
LWL
Figure 7. Comparison based on Number of Instances Correctly Classified Balance Scale Dataset
Comparison based on MAE and RMSE
IB1
IBk
K Star
LWL
Figure 8. Comparison based on MAE and RMSE values Balance Scale Dataset
140
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013
Table 17. Confusion Matrix for IB1Classifier Balance Scale Dataset A 288 0 0 B 0 49 0 C 0 0 288
Table 18. Confusion Matrix for IBkClassifier Balance Scale Dataset A 288 0 0 B 0 49 0 C 0 0 288
Table 19. Confusion Matrix for K Star Classifier Balance Scale Dataset A 288 12 0 B 0 13 0 C 0 24 288
Table 20. Confusion Matrix for LWL Classifier Balance Scale Dataset A 176 23 112 B 0 0 0 C 112 26 176
7. CONCLUSIONS
In this performance evaluation work, Memory based classifiers are experimented to estimate classification accuracy of those classifiers in the classification of Multivariate Data sets without Missing Values using Iris, Glass Identification, Balance Scale, Car Evaluation and Congressional Voting Records Data Sets. The experiments were done using an open source Machine Learning Tool. The performance of the classifiers was measured and results are compared. Among the four classifiers (IB1 Classifier, IBk Classifier, K Star Classifier and LWL Classifier) IB1 Classifier performs well in this classification problem. IBk Classifier, K Star Classifier and LWL classifier are getting the successive ranks based on classification accuracy and other evaluation measures.
ACKNOWLEDGEMENTS
The author thanks the Management of Sphoorthy Engineering College and Faculties of CSE Department for the cooperation extended.
REFERENCES
[1] [2] Pawan Kumar and Deepika Sirohi, Comparative Analysis of FCM and HCM Algorithm on Iris Data Set, International Journal of Computer Applications, Vol. 5, No.2, pp 33 37, August 2010. David Benson-Putnins, Margaret monfardin, Meagan E. Magnoni, and Daniel Martin, Spectral Clustering and Visualization: A Novel Clustering of Fisher's Iris Data Set. 141
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.1, January 2013 [3] [4] [5] [6] [7] [8] Fisher, R.A, The use of multiple measurements in taxonomic problems Annual Eugenics, 7, pp.179 188, 1936. Patrick S. Hoey, Statistical Analysis of the Iris Flower Dataset. M. Kuramochi, G. Karypis. Gene classification using expression profiles: a feasibility study, International Journal on Artificial Intelligence Tools, 14(4):641-660, 2005. John G. Cleary, K*: An Instance-based Learner Using an Entropic Distance Measure. Christopher G. Atkeson, Andrew W. Moore and Stefan Schaal, Locally Weighted Learning October 1996. UCI Machine Learning Data Repository http://archive.ics.uci.edu/ml/datasets.
Authors
C. Lakshmi Devasena has completed MCA, M.Phil. and pursuing Ph.D. She has Nine years of teaching experience and Two years of industrial experience. Her area of research interest is Image processing, Medical Image Analysis, Cryptography and Data mining. She has published 16 papers in International Journals and Twelve papers in Proceedings of International and National Conferences. She has presented 30 papers in National and international conferences. At Present, she is working as Associate Professor in Sphoorthy Engineering College, Hyderabad, AP.
142