0% found this document useful (0 votes)
3 views

5. K-Nearest Neighbors Classifiers 2025

The document provides an overview of the K-Nearest Neighbors (K-NN) algorithm, which is primarily used for classification problems by comparing new data entries to existing data based on proximity. It outlines the procedure for implementing K-NN, including choosing the value of K, calculating distances using metrics like Euclidean and Manhattan distances, and classifying new entries based on majority voting among neighbors. Additionally, it includes a practical example of classifying iris species using the K-NN algorithm with the Iris dataset from scikit-learn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

5. K-Nearest Neighbors Classifiers 2025

The document provides an overview of the K-Nearest Neighbors (K-NN) algorithm, which is primarily used for classification problems by comparing new data entries to existing data based on proximity. It outlines the procedure for implementing K-NN, including choosing the value of K, calculating distances using metrics like Euclidean and Manhattan distances, and classifying new entries based on majority voting among neighbors. Additionally, it includes a practical example of classifying iris species using the K-NN algorithm with the Iris dataset from scikit-learn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Classification Algorithms

Pushparaj, Amrita Univ, Cbe


K-Nearest Neighbors
Classifier
(KNN)

Pushparaj, Amrita Univ, Cbe


K-Nearest Neighbors (K-NN) algorithm

• Used mostly for solving classification problems


• Compares a new data entry to the values in a given data set
• Based on its closeness or similarities in a given range (K) of neighbors,
the algorithm assigns the new data to a class or category in the data
set (training data)

Pushparaj, Amrita Univ, Cbe


K-NN Procedure
• Step #1 - Assign a value to K.
• Step #2 - Calculate the distance between the new data entry
and all other existing data entries. Arrange them in
ascending order.
• Step #3 - Find the K nearest neighbors to the new entry
based on the calculated distances.
• Step #4 - Assign the new data entry to the majority class in
the nearest neighbors.

Pushparaj, Amrita Univ, Cbe


The graph above represents a data set
consisting of two classes — red and blue.
A new data entry has been introduced, the
green point
Pushparaj, Amrita Univ, Cbe
• Assign a value to K which
denotes the number of
neighbors to consider before
classifying the new data entry.
• Let's assume the value of K is 3.
• Since the value of K is 3, the
algorithm will only consider the
3 nearest neighbors to the
green point (new entry)
• Out of the 3 nearest neighbors
in the diagram above, the
majority class is red so the new
entry will be assigned to that
class. Pushparaj, Amrita Univ, Cbe
K-Nearest Neighbors Classifiers and Model
Example With Data Set
• We have two columns —
Brightness and Saturation.
• Each row in the table has a class of
either Red or Blue.
• Let's assume the value of K is 5
The new data entry:

Pushparaj, Amrita Univ, Cbe


Distance Metrics Used in KNN Algorithm
• Euclidean Distance
• Measures a straight line
between the query point and
the other point being
measured

Pushparaj, Amrita Univ, Cbe


• Manhattan
distance
• Measures the
absolute value
between two
points

Pushparaj, Amrita Univ, Cbe


Pushparaj, Amrita Univ, Cbe
How to Calculate Euclidean Distance

• Here's the new data entry:

• To know its class, we have to calculate the distance from the


new entry to other entries in the data set using the Euclidean
distance formula

Pushparaj, Amrita Univ, Cbe


Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
• The majority class within the 5 nearest neighbors to the new
entry is Red. Therefore, we'll classify the new entry as Red.
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
How to Choose the Value of K in the K-NN
Algorithm

• There is no particular way of choosing the value K, but here


are some common conventions to keep in mind:
• Choosing a very low value will most likely lead to inaccurate
predictions.
• The commonly used value of K is 5.
• Always use an odd number as the value of K.

Pushparaj, Amrita Univ, Cbe


Pushparaj, Amrita Univ, Cbe
Classifying Iris Species
❖ An ML model for distinguishing the
species of some iris flowers
❖ Data of iris measurements in cm:
❖ the length and width of the petals
❖ the length and width of the sepals
❖ Each instance of iris has a class label of
species (setosa, versicolor, or virginica)
❖ Goal is to build a machine learning
model that can learn from the
measurements of these irises whose
species is known, so that we can
predict the species for a new iris.
❖ A three-class classification problem
❖ Data: Iris dataset Parts of the iris flower
Pushparaj, Amrita Univ, Cbe
Data : Iris dataset
Included in scikit-learn in the datasets module
Load it by calling the load_iris function:

from sklearn.datasets import load_iris


iris_dataset = load_iris()

The iris object that is returned by load_iris is a Bunch object, which is very similar
to a dictionary. It contains keys and values:

print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset:
dict_keys(['target_names', 'feature_names', 'DESCR', 'data',
'target'])
Pushparaj, Amrita Univ, Cbe
The value of the key DESCR is a short description of the dataset

print(iris_dataset['DESCR'][:193] + "\n...")

The value of the key target_names is an array of strings, containing the species of flower
print("Target names: {}".format(iris_dataset['target_names']))
Target names: ['setosa' 'versicolor' 'virginica']
print("Feature names: \n{}".format(iris_dataset['feature_names']))
Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
Pushparaj, Amrita Univ, Cbe
print("Type of data: {}".format(type(iris_dataset['data'])))
Type of data: <class 'numpy.ndarray'>
print("Shape of data: {}".format(iris_dataset['data'].shape))
Shape of data: (150, 4)
First five rows of data:
[[ 5.1 3.5 1.4 0.2]
print("First five rows of data:\n{}".format(iris_dataset['data'][:5])) [ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]

print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>

Pushparaj, Amrita Univ, Cbe


print("Shape of target: {}".format(iris_dataset['target'].shape))
Shape of target: (150,)

print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data’],
iris_dataset['target'], random_state=0)
Pushparaj, Amrita Univ, Cbe
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

X_train shape: (112, 4)


y_train shape: (112,)

print("X_test shape: {}".format(X_test.shape))


print("y_test shape: {}".format(y_test.shape))

X_test shape: (38, 4)


y_test shape: (38,)
Pushparaj, Amrita Univ, Cbe
Inspecting the data
To find abnormalities
May be in inches, not in cms
Best way: Visualization – scatter plot or pair plot
Pair plot: possible pairs of features
To create the plot, we first convert the NumPy array into a pandas DataFrame
Pandas has a function to create pair plots called scatter_matrix.
The diagonal of this matrix is filled with histograms of each feature
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3
Pushparaj, Amrita Univ, Cbe
Pushparaj, Amrita Univ, Cbe
Building the Model: k-Nearest Neighbors
To set the parameter of k-nn model:

from sklearn.neighbors import


KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

To build the model on the training set, we call the fit method of the knn object,
which takes as arguments the NumPy array X_train containing the training data and
the NumPy array y_train of the corresponding training labels

knn.fit(X_train, y_train)
Pushparaj, Amrita Univ, Cbe
Making Predictions
Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9
cm, a petal length of 1 cm, and a petal width of 0.2 cm. What species of iris would
this be?
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))
X_new.shape: (1, 4)

To make a prediction, we call the predict method of the knn object:


prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction])) Pushparaj, Amrita Univ, Cbe
Evaluating the Model
y_pred = knn.predict(X_test)
print("Test set predictions:\n
{}".format(y_pred))
Test set predictions:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0 2]
print("Test set score:
{:.2f}".format(np.mean(y_pred == y_test)))
Test set score: 0.97

We can also use the score method of the knn object

print("Test set score:


{:.2f}".format(knn.score(X_test,
` y_test)))
Pushparaj, Amrita Univ, Cbe

You might also like