A Case Study On Data Classification Approach Using K-Nearest Neighbor
A Case Study On Data Classification Approach Using K-Nearest Neighbor
net/publication/357235411
CITATION READS
1 620
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jogeswar Tripathy on 02 February 2022.
Abstract— Data mining is the process of obtaining database, data warehouse. Here somehow server is
knowledge and information from massive amounts of responsible for fetch the user data according to the user's data
data. Data mining is mostly used for data analysis. In Data mining request knowledge base.
Mining various techniques are used that are association In machine learning, the prediction accuracy of
mining, regression, prediction, classification, clustering, categorization algorithms derived from empirical data
etc. Classification is described as the process of (examples) is assessed first. In practice, however, a
identifying a collection of models (or functions) that classifier's interpretability or transparency is frequently
explain and differentiate data classes and ideas, to use the crucial. This research investigates the accuracy of k-nearest
model for detect the classes of unknown objects or neighbor classifiers in classifying a dataset. Many remarkable
patterns, whose class designations aren't clear. and diverse types of classification learning algorithms, such
Classification is a supervised learning problem. That as Support Vector Machine (SVM), k closest neighbors, and
means in machine learning, Classification is the problem Naive Bayes Classifier (NBC), are classification related
of identifying data patterns from the group of patterns algorithms due to their major significance in exploratory
according to their characteristics and define which pattern analysis.
patterns are from which class. Pattern classification can
be done by using various classifiers. A classifier is a The following are the different types of data mining
program that inputs the feature vector of a pattern or strategies:
data point and assigns it to one of a set of designated Classification:
classes. The classifiers such as Artificial Neural Network In this step, the provided data instance must be classified into
(ANN), k-Nearest Neighbor (k-NN) classifier, Support one of the target classes that have already been identified.
Vector Machine (SVM), etc. are used for pattern One example is determining whether a consumer in a credit
classification purposes. Focusing on the classification card transaction database should be categorized as a
technique of data mining, in this research work the trustworthy customer or a defaulter based on his numerous
accuracy of k-NN using three datasets from the UCI demographic and prior purchase criteria.
machine learning library is presented. The main goal of
this paper is to provide a review to find out the accuracy Estimation:
of the k-NN classification technique using different An estimated model, like a classification model, is used to
datasets in data mining. The k-NN classifier is a simple determine a value for an unknown output attribute. In contrast
but efficient approach used for classification in research. to classification, an estimation problem's output attribute is
numeric rather than categorical. Consider the following
Keywords— Classification, SVM, k-NN, Supervised Learning, scenario: Calculate an individual's pay.
Normalization.
I. INTRODUCTION Prediction:
It's difficult to tell the difference between prediction and
Data mining is nothing but the information repository where categorization or estimate. The only distinction is that the
a huge amount of data can be stored/retrieved in/from predictive model predicts a future consequence rather than
databases and data warehouses. In another way it can be influencing present behavior. The output attribute can be
recognized as Knowledge Discovery in Databases (KDD), As category or numeric.
it deal s with various techniques of integration from several
disciplines like learning, computing of high performance, Association rule mining:
database technology, machine learning, pattern recognition, It is the process of extracting interesting hidden rules known
statistics, neural networks, information retrieval, data as association rules from a large transactional data set. For
visualization, etc. Some major components of the data mining example, the rule milk, butter, biscuit specifies that anytime
system which help improvise the architecture of the same are: milk and butter are purchased together, the biscuit is also
purchased, allowing these goods to be sold together to the project topic in the domain of data mining. The
improve overall sales of each item. classification challenge in various datasets has also drawn the
attention of the data mining community in the recent decade.
Clustering: The popularity of the closest neighbor classifier and its
It is a method of classification where the target classes are not variations, such as the k-Nearest Neighbor classifier (k-NN),
known. For example, given 100 consumers, they must be can be attributed.
categorized based on specific similarity criteria, and the B. Objective
classes into which the customers should ultimately be placed
The majority of classification experiments in the field of
are not predetermined. The classification data mining data mining have been seen concerning various datasets. This
approach is mostly used in the thesis work. This papers objective is to provide a classification technique
The goal of this research is to increase the k-NN that takes less time to classify using three distinct datasets for
approach's performance. The goal of k-NN is to compute the different k values without degrading the performance of
closest neighbor based on various values of k, which defines classifier. It’s main goal is to an analytic study of data mining
how many nearest neighbors should be considered when technology.
defining a value. A k-NN's general concept is to compute a
distance metric for calculating distances between data points. C. Organization
After then, the algorithm attempts to locate the K closest or The following is a breakdown of the paper's structure: The
closest answers in the dataset. The following are some purpose and goal of this study are included in Section I, which
examples of distance metrics: Euclidean distance is a term provides a brief introduction to the issue. The related work is
used to describe the distance between two points in space. described in Section II. Section III explains the suggested
The initial step in k-NN is to calculate all of the ‘distances' framework using a block diagram. The approach is presented
between each data point and each of the data set's reference in Section IV. Section V contains the data sets utilized in this
points. In the second step these distances are sorted and then study, as well as recommended framework techniques,
implementation details, and experimental result analysis.
choose the k nearest objects from which to execute the third
Finally, section VI wraps off the study by discussing the
and final stage of categorization. Finally, from the d- proposal's future scope.
dimensional feature space, k-NN identifies the k nearest (or
most similar) points to a data point among N points. k is the II. RELATED WORK
number of neighbors that are considered from a dataset. This part provides a summary of the literature review, which
One of the most essential data mining techniques is includes reviews of technical papers on the k-Nearest
classification. It entails applying a model learned from the Neighbor classification approach as applied to diverse
dataset's data points to generate predictions about the class applications. It provides an overview of current data mining
label of fresh data/observations. The process of developing a research being conducted on diverse applications too.
collection of models that explain and identify data classes, to Henceforth it was discovered the following findings and
use the model to predict the classes of objects or patterns limitations from several papers:
whose class labels are unknown, is known as classification.
Detecting spam in email communications based on the In this work [1], author focuses on the k-Nearest Neighbor
message header and content, detecting cancer based on MRI approach for classification. In terms of the parameter k, the
scan findings, categorising galaxies based on their location researcher tried this method with a variety of distances and
and morphologies are just a few examples. Learning a target classification criteria (majority, consensus, and random). The
function as shown in Fig.1 that translates each attribute set x results indicated that the k-NN technique may be used with
to one of the predetermined class labels y is the problem of both types of Euclidean distance and Manhattan. These
classification. The classification model is another name for distances are useful for categorization and performance, but
the goal function. they take time. As a result, they develop two types of distance
that yield the greatest outcomes of (98.70 and 98.70,
respectively).
Input Output
The WBC dataset experiment [2] shows that combining
Machine Learning Process(MLP) and J48 classifiers with
Attribute Set Classification Class level features selection Principal Component Analysis (PCA) beats
(x) Model (y) the other classifiers. On the other hand, the Wisconsin
Diagnostic Breast Cancer (WDBC) dataset revealed that
Fig. 1. Classification Model for a Task using a single classifier, such as the Specific Efficient
Optimization Algorithm (SMO) or a fusion of SMO and MLP
A. Motivation or SMO and IBK, is preferable to using several classifiers.
The motivation in this research is to find the accuracy Finally, the MLP, J48, SMO, and IBK fusion fared better than
score of k-NN classifier. Here it was discovered that a lot of the other classifiers in the Wisconsin Prognostic Breast
work has already been done in the different areas of data Cancer (WPBC) dataset.
mining about different datasets when working on identifying
To solve the issues of poor efficiency and reliance on k, they between points. The system searches the database for the K
picked a few representatives from the training dataset with closest or closest replies to the provided query (i.e. query
some additional information to represent the whole training point). Euclidean distance, Manhattan distance, and other
dataset in this paper[3]. They have utilized the optimal but distance measures are examples. The supervised learning
varied k chosen by the dataset itself to eliminate the reliance approach known as the k-nearest neighbour method (k-NN)
on k without user interaction in the selection of each has been used in a number of applications. The Euclidean
representative. The restriction that which was discovered distance is commonly used.
from this study is that the researcher need to focus on how to
enhance the classification accuracy of marginal data that falls III. PROPOSED FRAMEWORK
beyond the typical areas. In the WPBC dataset, the fusion of This research work focuses on a data classification concept
MLP, J48, SMO, and IBK outperformed the other classifiers. using k-NN. The model is presented in Fig.2. After
normalizing all the data between 0 and 1 of the datasets, the k-
Here the researcher noticed and focused on the choice of k NN classification technique will be used by which the distance
values in this paper[4], and ultimately, the experiment results between each row is calculated as per the k value. Then in the
show that the proposed approach consistently beats other next step, the predicted class labels of each row of the dataset
classifiers across a wide range of k, and its efficacy has been will be going to predict. In the next step by using the actual
shown with good performance. class label and predicted class label, a confusion matrix will
be done[9]. By using the confusion matrix True Positive (TP),
True Negative (TN), False Positive (FP), false Negative (FN),
In this work[5] of pattern recognition, the KNN classifier is
value will be calculated according to the formula. Finally, a
one of the most often used neighborhood classifiers.
comparison between the accuracies of datasets will be done.
However, it has significant drawbacks, including high As the experimental data set, several sorts of two-class
computation complexity, complete reliance on the training datasets such as BCW, Pima, and zoo are used to determine
set, and no weight variation across classes. To address this, how the performance differs based on the data. Here the data
this study proposes a unique approach for improving KNN are collected of various sizes and types, then utilizes for both
classification performance using Genetic Algorithms (GA). nominal and numerical data to assess the outcomes. The
complete implementation and analysis of the accuracy of
In this paper [6] rather of computing the similarities between datasets using the k-Nearest Neighbor algorithm described on
all of the training and test samples and then selecting k- the following model using MATLAB tool [10,11].
neighbors for classification, GA picks just k-neighbors at
each iteration, calculates the similarities, classifies the test Apply
samples using these neighbours, and determines the accuracy. Input Identify the Normalizing k-NN
Dataset class level the data algorithm
k-NN's computation complexity was decreased in this case.
The performance of the Gk-NN classifier is compared against
conventional k-NN, CART, and SVM using five distinct
medical datasets from the UCI data collection. The trials and
findings have shown that the suggested technique not only
confusion
decreases the complexity of the k-NN but also enhances the
Use the
matrix
Comparison Calculate the
classifiability accuracy of the k-NN. of accuracy accuracy
A. Normalization Where:
B. k-Nearest Neighbor Classification j is the smallest integer such that
A. Normalization max(|v′|) < 1
Values of some attributes in a data-set can be of higher B. k-Nearest neighbor Classification
numeric range whereas some may be of smaller range [15]. Data points may be classified based on their distance from
For applying some classification algorithms like neural points in a training dataset, which is a basic yet effective
networks and their variants, distance measures, they require method of doing so.To calculate the distance, we may use a
the value of all the attributes to be small within a range [16]. variety of measures, which will be discussed next.
For example, input values for neural network or k-NN can be
-1, 0, or +1. Distance Metrics
i. Min-Max Normalization The following are the varying distances between the
In instances where we know the range of values in our components xs and yt: Given a mx-by-n data matrix X, which
input data, we utilise min-max normalisation. This approach can be represented by mx (1-by-n) row vectors x1, x2,...,
is utilised when using a neural network as a learning machine xmx, and a my-by-n data matrix Y, which can be represented
or when using a naïve bayesian classifier that requires all by my (1-by-n) row vectors y1, y2,..., ymy.
features to have an in-class variance of 0 to 1 [17]. The
minimum and maximum ranges are set to 0 and 1, TABLE I. APPROACHES TO DEFINING THE DISTANCE BETWEEN INSTANCES
(X AND Y)
respectively, in this work. It is given by the formula :
𝑚 2
𝑣 ′ = (𝑣 − min 𝐴)/(max 𝐴 − 𝑚𝑖𝑛𝐴) (1) Makowsky: 𝐷(𝑥, 𝑦) = (∑𝑖=1 |𝑥𝑖 − 𝑦𝑖 |𝑟 )
Where: Manhattan: 𝐷(𝑥, 𝑦) = | − 𝑥𝑖 − 𝑦𝑖 |
Chebychev: 𝐷(𝑥, 𝑦) = 𝑚𝑎𝑥 ∑𝑚 𝑖=1 |𝑥𝑖 − 𝑦𝑖 |
v is the original value for an instance of attribute ’A’. 𝑚
Euclidean: 𝐷(𝑥, 𝑦) = 𝑖=1(|𝑥𝑖 − 𝑦𝑖 |2 ) (1/2)
∑
𝑚
v is the new value. Canberra:𝐷(𝑥, 𝑦) = ∑𝑖=1(|𝑥𝑖 − 𝑦𝑖 |)/(|𝑥𝑖 | + |𝑦𝑖 |)
minA is the minimum value of the attribute in the
original dataset(A). V. DATA SET USED
max A is the maximum value of the attribute in the A data set is often the contents of a single database table
or statistical data matrix, with each table column representing
original dataset(A). a distinct variable and each row representing a specific
ii. Z-score Normalization member of the data set in question. A dataset is made up of a
data matrix with m rows (representing the items) and k
This normalization technique is based on the mean and columns (corresponding to the measurements). The columns
standard deviation of a particular attribute ’A’ in the dataset are commonly referred to as features, but they can also have a
[18]. This is therefore otherwise refereed as standard deviation different backdrop, as shown in the Table II. A dataset is an
normalization or zero-mean normalization. First, the mean enhanced version of a data matrix in this case. It has a size of
and standard deviation need to be calculated mathematically m*k and may be used with various Matlab matrices
as always and then the formula follows: operations.
𝑧 = (𝑥 − 𝐴)/𝐴 (2) A. BCW (Breast Cancer Wisconsin) Data set
Where: The Wisconsin Breast Cancer datasets from the UCI
Machine Learning Repository were used to distinguish
x is the original value for an instance of attribute ’A’.
malignant (cancerous) from benign (non-cancerous) samples.
Z is the transformed value. W. Nick Street, University of Wisconsin, Computer Sciences
Dept., 1210 West Dayton St., Madison, WI 53706 street at
is the mean of attribute ’A’. cs.wisc.edu 608-262-6619, and Olvi L. Mangasarian,
is the standard deviation of the attribute. Computer Sciences Dept.The Table II below contains a
summary of all the datasets [19]. Each dataset contains a TABLE III. PERFORMANCE OF KNN WITH DIFFERENT K VALUES
IN PERCENTAGE
collection of numerical characteristics or attributes as well as
certain categorization of patterns.
DATA SETS K=3 K=5 K=7
TABLE II. DESCRIPTION OF BCW, PIMA AND ZOO DATASETS
VI. IMPLEMENTATION DETAILS & EXPERIMENTAL As shown in Table III, It was observed that if the k value
RESULTS varies, the accuracy rate for k-NN also varies accordingly. In
The experimental data collection includes a variety of two- BCW dataset when k- value increases then the accuracy of this
class level datasets such as BCW, Pima, and zoo. The dataset is decreased with that k value. And also by using the
performance fluctuates or does not vary depending on the Pima dataset in this k-NN classifier without using training and
data. Here data collected of various sizes and sorts, utilizes testing instances with the k values increased, the accuracy
both nominal and numerical data to evaluate results. Using the decreased. But by using the zoo dataset with the use of
k-Nearest Neighbor techniques access the correctness of different k values, the accuracy of the classifier first increased,
datasets. The MATLAB tool is used to finish the and then decreased, and so on.
implementation. In this method, k-Nearest Neighbor model The accuracy rate for KNN increases first, then falls as the
used with three datasets to investigate how this technique aids k value grows. This is because bigger values of k minimize the
in the prediction of unknown class values and finding the influence of noise on the classification, but make class borders
accuracy of the prediction using the confusion matrix. less clear. When it was compared the three datasets with
Several two-class datasets, such as bcw, Pima, and zoo, are different k values, it was discovered that BCW has a better
included in the experimental data set. Depending on the data, KNN classification accuracy rate than the other two. The
the performance changes or doesn't change.Here data maximum accuracy is obtained quickly when k is still a small
collected of various sizes and types; for instance,outcomes number (k=5) and then gradually lowered, but the highest
accessed using both nominal and numerical data. Here it was accuracy is reached slowly for different k values for Pima and
investigated how this strategy assists in the prediction of zoo. This is most likely because BCW feature vectors are
unknown class values and the accuracy of the prediction using denser in multidimensional space than the other two.
the confusion matrix with k-Nearest Neighbor classification VII. CONCLUSION & FUTURE WORK
model [20] .
In data mining and pattern recognition, the k-NN classifier
is one of the most often used neighborhood classifier.
However it has certain drawbacks, such as high computation
complexity, complete reliance on the training set, and no
weight variation across classes. The focus of this method is on
the accuracy of various k selections to increase classification
performance. But by different k values with odd numbers, the
accuracy of each dataset initially increases, then falls, and then Pattern Recognition: A Review, IEEE Transactions on Pattern
increases again. The outcome of the implementation is that k- Analysis and Machine Intelligence , 22(1) pp.4-37, (2000) .
[17] E. Acuna, C. Rodriguez, The treatment of missing values and
NN is a very excellent classifier. As the size of the data set its effect in the classifier accuracy, in: D. Banks, L. House,
grows larger, it produces good results. The future work can be F.R. McMorris, P. Arabie, W. Gaul (Eds.), Classification,
done by using different datasets with some feature selection Clustering and Data Mining Applications, Springer, Berlin,
techniques for increasing the performance of a classifier. pp. 639648 (2004).
[18] Vijaya, P., Murty, M.N., Subramanian, D.K.: Leaders-
REFERENCES subleaders: An ecient hierarchical clustering algorithm for
large data sets. Pattern Recognition Letters 25, 505513 (2004).
[1] N. Suguna, and Dr. K. Thanushkodi, An Improved k-Nearest [19] A. Frank, A. Asuncion, UCI Machine Learning Repository,
Neighbor Classification Using Genetic Algorithm IJCSI http://www.archive.ics.uci.edu/ml,(2011).
International Journal of Computer Science Issues, Vol. 7, [20] https://archive.ics.uci.edu/ml/datasets.html.
Issue 4, No 2, (2018).
[2] Gouda I. Salama, M. B. Abdelhalim, and Magdy Abd-elghany
Zeid,Experimental Comparison of Classifiers for Breast
Cancer Diagnosis, IEEE Transactions, pp.978-1-4673-2961
(2016 ).
[3] Gongde Guo, Hui Wang , David Bell , Yaxin Bi , and Kieran
Greer, KNN Model-Based Approach in Classification,
Spinger(2012).
[4] Gou, J., Du, L. Zhang, Y. and Xiong, T. ”A New Distance-
weighted k-nearest Neighbor Classifier”, Journal of
Information and Computational Science, 9(6) pp.1429-1436
(2012).
[5] S. C. Bagui, S. Bagui, K. Pal, Breast Cancer Detection using
Nearest Neighbor Classification Rules, Pattern Recognition
36, pp 25-34, (2003).
[6] Hamid Parvin,Hoseinali Alizadeh,Behrouz Minati, A
Modification on K-Nearest Neighbor Classifier, Global
Journal of Computer Science and Technology, Vol.10, Issue
14 (Ver.1.0),November (2010).
[7] Agrawal,R., Imielinski, T., Swami, A., Database Mining:A
Performance Perspective, IEEE Transactions on Knowledge
and Data Engineering, pp. 914-925, December 1993.
[8] Nitin Bhatia, Vandana, ”Survey of Nearest Neighbor
Techniques” International Journal of Computer Science and
Information Security, Vol. 8, No. 2,(2010).
[9] Angeline Christobel. Y, Dr. Sivaprakasam (2011). An
Empirical Comparison of Data Mining Classification
Methods. International Journal of Computer Information
Systems,Vol. 3, No. 2, (2011).
[10] V. Suresh Babu and P. Viswanath. Weighted k-nearest leader
classier for large data sets. In PReMI, pp. 1724 (2007).
[11] Aman Kataria, M. D. Singh, A Review of Data Classification
Using K-Nearest Neighbour Algorithm, International Journal
of Emerging Technology and Advanced Engineering ,
Volume (3), Issue (6), June (2013).
[12] J. Tripathy,R. Dash, B. K. Pattanayak and B. Mohnty, “
Automated Phrase Mining Using Post: The Best
Approach,”2021 1st Odisha International Conference on
Electrical Power Engineering, Communication and
Computing Technology(ODICON), 2021, pp. 1-6,
doi:10.1109/ODICON50556.2021.9429014.
[13] Panda, Smruti Rekha, and Jogeswar Tripathy. "Odia offline
typewritten character recognition using template matching
with unicode mapping." 2015 international symposium on
advanced computing and communication (ISACC), pp. 109-
115, IEEE, 2015.
[14] D. Mohapatra, J. Tripathy, K. K. Mohanty and D. S. K. Nayak,
"Interpretation of Optimized Hyper Parameters in Associative
Rule Learning using Eclat and Apriori," 2021 5th
International Conference on Computing Methodologies and
Communication (ICCMC), 2021, pp. 879-882, doi:
10.1109/ICCMC51019.2021.9418049.
[15] Mohapatra D., Tripathy J., Patra T.K. (2021) Rice Disease
Detection and Monitoring Using CNN and Naive Bayes
Classification. In: Borah S., Pradhan R., Dey N., Gupta P.
(eds) Soft Computing Techniques and Applications. Advances
in Intelligent Systems and Computing, vol 1248. Springer,
Singapore. https://doi.org/10.1007/978-981-15-7394-1_2.
[16] Anil K. Jain , Robert P. W. Duin , Jianchang Mao, Statistical