Open In App

KNN Classifier in R Programming

Last Updated : 02 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

K-Nearest Neighbor or KNN is a supervised non-linear classification algorithm. It is also Non-parametric in nature meaning , it doesn't make any assumption about underlying data or its distribution.

Algorithm Structure

In KNN algorithm, K specifies the number of neighbors and its algorithm is as follows:

  • Choose the number K of the neighbor.
  • Take the K Nearest Neighbor of unknown data point according to distance.
  • Among the K-neighbors, count the number of data points in each category.
  • Assign the new data point to a category, where you counted the most neighbors.

For the Nearest Neighbor classifier, the distance between two points is expressed in the form of Euclidean Distance.

Example:

Consider a dataset containing two features Red and Blue and we classify them. Here K =5 meaning, we are considering 5 neighbors according to Euclidean distance.

So, when a new data point enters, out of 5 neighbors, if 3 are Blue and 2 are Red, we assign the new data point to the category with most neighbors (in this case that will be Blue).

Implemention of KNN

We will perform the K-Nearest Neighbor Algorithm in R programming language using the Iris dataset.

1. Installing the Required Packages

We will install the class package which can be used to fit a KNN model also caTools for splitting our dataset into training and testing.

R
install.packages("caTools") 
install.packages("class") 
install.packages("ggplot2")
  

library(caTools) 
library(class)
library(ggplot2)

2. Importing the Dataset

We will use the Iris dataset which is a built in dataset in R programming language which contains 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor). We will use the str() function to give us the feature names and structure of the dataset.

R
data(iris)
str(iris)

Output:

str_irir
Structure of the data

3. Splitting data into train and test data

We first split the iris dataset into training and testing sets using a 70:30 ratio. Then, we scale the numeric feature columns (first 4) in both sets to normalize their values.

R
split <- sample.split(iris, SplitRatio = 0.7) 
train_cl <- subset(iris, split == "TRUE") 
test_cl <- subset(iris, split == "FALSE") 

train_scale <- scale(train_cl[, 1:4]) 
test_scale <- scale(test_cl[, 1:4]) 

4. Fitting KNN Model

We fit a KNN model using the scaled training data, where k = 1. The model then predicts species labels for the test set based on the nearest neighbor from the training set. Also, the Classifier Species feature is fitted in the model.

R
classifier_knn <- knn(train = train_scale, 
                      test = test_scale, 
                      cl = train_cl$Species, 
                      k = 1) 

5. Displaying a Confusion Matrix

We create a confusion matrix to compare the predicted labels with the actual species in the test set. This helps us evaluate how well the KNN model classified each species.

R
cm <- table(test_cl$Species, classifier_knn) 
cm

Output:

cm_knn
Confusion Matrix of the KNN model

6. Evaluating the Model for different K values

We test multiple values of k to find the most suitable one for our KNN model. For each k, we calculate the miss-classification error and print the corresponding accuracy. This helps in selecting a k that balances bias and variance for better model performance.

R
library(ggplot2)

k_values <- c(1, 3, 5, 7, 15, 19)

accuracy_values <- sapply(k_values, function(k) {
  classifier_knn <- knn(train = train_scale, 
                        test = test_scale, 
                        cl = train_cl$Species, 
                        k = k)
  1 - mean(classifier_knn != test_cl$Species)
})

accuracy_data <- data.frame(K = k_values, Accuracy = accuracy_values)

ggplot(accuracy_data, aes(x = K, y = Accuracy)) +
  geom_line(color = "lightblue", size = 1) +
  geom_point(color = "lightgreen", size = 3) +
  labs(title = "Model Accuracy for Different K Values",
       x = "Number of Neighbors (K)",
       y = "Accuracy") +
  theme_minimal()

Output:

knn-k-values
KNN model performance

From the graph, we observe the following accuracy trends for different values of k:

  • k = 1: The model achieved 91.66% accuracy.
  • k = 3: The accuracy remained the same at 91.66%, showing no improvement over k = 1.
  • k = 5: Accuracy increased to 95%, which is higher than at k = 1 and 3.
  • k = 7: The accuracy remained 95%, same as at k = 5.
  • k = 15: The accuracy dropped slightly to 92.5%.
  • k = 19: The accuracy further decreased to 90%, the lowest among all tested values.

Therefore, the optimal value of k for our model is 5.

In this article, we implemented the K-Nearest Neighbors (KNN) algorithm on the iris dataset and evaluated model accuracy across different values of k. We found that accuracy peaked at k = 5 and 7, demonstrating the importance of tuning k for optimal performance.


Next Article

Similar Reads