0% found this document useful (0 votes)
4 views

KNN-Unit1-Notes (1)

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

KNN-Unit1-Notes (1)

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Classification Algorithm in Machine Learning

As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted the
output for continuous values, but to predict the categorical values, we need Classification
algorithms.

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as BinaryClassifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then
it is called as Multi-classClassifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naive
Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models

o Logistic Regression
o Support Vector Machines

o Non-linear Models

o K-Nearest Neighbours
o Kernel SVM
o Naive Bayes
o Decision Tree Classification
o Random Forest Classification

Evaluating a Classification model:

Once our model is completed, it is necessary to evaluate its performance; either it is a


Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))
Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

Regression vs. Classification in Machine Learning

Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets.
But the difference between both is how they are used for different machine learning
problems.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.

Consider the below diagram:


Classification:

Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.

o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naive Bayes
o Decision Tree Classification
o Random Forest Classification

Logistic Regression Code using Python:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

df=pd.read_csv('insurance_data.csv')

print(df)

model=LogisticRegression()

model.fit(X_train,Y_train)

print('Model got Trained')

print(model.predict([[50]]))

print(model.predict([[25]]))

Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression


o Multiple Linear Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression

Simple Linear Regression Code using Python:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

df=pd.read_csv('student_scores.csv')

X=df.iloc[:,:-1].values

Y=df.iloc[:,1].values

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)

regressor=LinearRegression()

regressor.fit(X_train,Y_train)
print("Model got trained with data")

model = LinearRegression().fit(X_train, Y_train)

pred = model.predict(X_test)

model.score(X_test, Y_test)

X_range = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1,1)

Y_range = model.predict(X_range)

plt.scatter(X_train, Y_train)

plt.scatter(X_test, Y_test)

plt.plot(X_range, Y_range)

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output variable must be of In Classification, the output variable must be a
continuous nature or real value. discrete value.

The task of the regression algorithm is to map the The task of the classification algorithm is to map
input value (x) with the continuous output the input value(x) with the discrete output
variable(y). variable(y).

Regression Algorithms are used with continuous Classification Algorithms are used with discrete
data. data.

In Classification, we try to find the decision


In Regression, we try to find the best fit line,
boundary, which can divide the dataset into
which can predict the output more accurately.
different classes.

Regression algorithms can be used to solve the Classification Algorithms can be used to solve
regression problems such as Weather Prediction, classification problems such as Identification of
House price prediction, etc. spam emails, Speech Recognition, Identification
of cancer cells, etc.

The regression Algorithm can be further divided The Classification algorithms can be divided into
into Linear and Non-linear Regression. Binary Classifier and Multi-class Classifier.

Data Matrix Notation

One of the fundamental applications of matrices and vectors in machine learning is the
representation of data. In most machine learning tasks, data is typically organized in a tabular
format, where each row represents an observation and each column represents a feature or
attribute. This tabular structure can be represented as a matrix, where each element
corresponds to a specific value in the dataset.

The fuel of ML models, that is data, needs to be converted into arrays before you can feed it
into your models. The computations performed on these arrays include operations like matrix
multiplication (dot product). This further returns the output that is also represented as a
transformed matrix/tensor of numbers.

For example, consider a dataset of housing prices with features such as the number of
bedrooms, square footage, and location. This dataset can be represented as an m x n matrix,
where m is the number of observations (rows) and n is the number of features (columns). Each
element in the matrix represents the value of a specific feature for a particular observation.

Linear algebra basically deals with vectors and matrices (different shapes of arrays) and
operations on these arrays. In NumPy, vectors are basically a 1-dimensional array of numbers
but geometrically, they have both magnitude and direction.Our data can be represented using a
vector. In the figure below, one row in this data is represented by a feature vector which has 3
elements or components representing 3 different dimensions. N-entries in a vector makes it n-
dimensional vector space and in this case, we can see 3-dimensions.

In the context of data science, particularly in machine learning, the significance of linear
algebra becomes evident when dealing with datasets that have numerous features, making
visualization and manual judgment challenging. While we can easily visualize and draw lines
in 2 or 3-dimensional Cartesian space, real-world datasets often involve a high-dimensional
space (N-dimensions) that is impractical to visualize. This is where the power of linear algebra
comes into play. It allows us to apply mathematical principles to machine learning models,
enabling the creation of decision boundaries or planes in N-dimensional space for accurate
data classification and analysis.

Data representation includes transforming data into vectors and matrices, which are structured
mathematical objects that can be manipulated to perform operations like addition,
multiplication, and transformation.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider
the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

Always needs to determine the value of K which may be complex some time.

o The computation cost is high because of calculating the distance between the data
points for all the training samples.

KNN –Implementaton using Python (for Classifisction )

# Import necessary libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# Create KNN classifier

knn = KNeighborsClassifier(n_neighbors=3)

# Train the model

knn.fit(X_train, y_train)

# Make predictions

y_pred = knn.predict(X_test)

# Evaluate the model

print("Confusion Matrix:")

print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")

print(classification_report(y_test, y_pred))

print("\nAccuracy Score:")

print(accuracy_score(y_test, y_pred))

Here's a simple implementation of the K-Nearest Neighbors (KNN) algorithm for


classification using Python. We'll use the popular scikit-learn library, which provides a
straightforward API for implementing machine learning models.

Explanation:

1. Loading the Dataset: The Iris dataset is a popular dataset used for classification
problems. It consists of 150 samples from three species of Iris flowers, with four
features (sepal length, sepal width, petal length, and petal width).
2. Data Splitting: The data is split into training and testing sets using train_test_split.
The test_size=0.3 indicates that 30% of the data will be used for testing.
3. Feature Scaling: KNN is a distance-based algorithm, so it's important to scale the
features to ensure that all features contribute equally to the distance calculation.
4. Training the Model: The KNeighborsClassifier is initialized with n_neighbors=3,
which means the model will consider the 3 nearest neighbors for classification.
5. Making Predictions: The model predicts the class labels for the test set.
6. Evaluation: The model's performance is evaluated using a confusion matrix,
classification report, and accuracy score.
KNN with Numerical Examples

KNN is one of the simplest machine learning algorithm for classification and regression
problem but mainly used for classification.

 In KNN classification, the output is a class membership. The given data point is classified
based on the majority of type of its neighbor's. The data point is assigned to the most
frequent class among its k nearest neighbor's. Usually k is a small positive integer. If k=1,
then the data point is simply assigned to the class of that single nearest neighbor.

 In KNN regression, the output is simply some property value for the object. This value is
the average of the values of k nearest neighbors.

The following two properties would define KNN well −

 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.

 Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm


because it doesn’t assume anything about the underlying data

KNN Intuition

The KNN algorithm intuition is very simple to understand. It simply calculates the distance
between a sample data point and all the other training data points. The distance can be
Euclidean distance or Manhattan distance. Then, it selects the k nearest data points where k
can be any integer. Finally, it assigns the sample data point to the class to which the majority
of the k data points belong.

In KNN “K” is the number of nearest neighbors and mostly we opt for K value as Odd
because it helps to decide the majority of the class.
In the above graph we can see there is two category of data and we need to classify the new
data point based on nearest neighbor

Distance Measurement

We generally say that we will use distance to find the nearest neighbours of any query point
Xq, but we still don’t know how mathematically distance is measured between Xq and other
nearest points? for further finding distance, we can’t conclude that this is nearest or not.

In a theoretical manner, we can say that a distance measure is an objective score that
summarizes the difference between two objects in a specific domain. There are several types
of distance measures techniques but we only use some of them and they are listed below:

1. Euclidean distance

We mostly use this distance measurement technique to find the distance between consecutive
points. It is generally used to find the distance between two real-valued vectors
2. Manhattan distance

This distance is also known as taxicab distance or city block distance, that is because the way
this distance is calculated. The distance between two points is the sum of the absolute
differences of their Cartesian coordinates.
3. Minkowski distance

It is a metric intended for real-valued vector spaces. We can calculate Minkowski distance
only in a normed vector space, which means in a space where distances can be represented as
a vector that has a length and the lengths cannot be negative.

There are a few conditions that the distance metric must satisfy:

1. Non-negativity: d(x, y) >= 0

2. Identity: d(x, y) = 0 if and only if x == y

3. Symmetry: d(x, y) = d(y, x)

4. Triangle Inequality: d(x, y) + d(y, z) >= d(x, z)

This above formula for Minkowski distance is in generalized form and we can manipulate it to
get different distance metrices.

The p value in the formula can be manipulated to give us different distances like:

 p = 1, when p is set to 1 we get Manhattan distance

 p = 2, when p is set to 2 we get Euclidean distance


Numerical Example

Q. From the given data-set find (x,y) = 57,170 whether belongs to Under or Normal Weights

Solution:

In this approach we are going to use Euclidean distance formulae, n(no of records)=9 and
assuming K value as 3

##############################################

d1 = sqrt ((x2-x1)² + (y2-y1)²)

x1=167,y1=51 and x2=170,y2=57

d1 = sqrt ((170–167)² + (57–51)²)


d1 = 6.7

###############################################

##############################################

d2 = sqrt ((x2-x1)² + (y2-y1)²)

x1=183,y1=56 and x2=170,y2=57

d2 = sqrt ((170–183)² + (57–56)²)

d2 = 13

###############################################

##############################################

d3 = sqrt ((x2-x1)² + (y2-y1)²)

x1=176,y1=69 and x2=170,y2=57

d3 = sqrt ((170–176)² + (57–69)²)

d3 = 13.4

###############################################

##############################################
d4 = sqrt ((x2-x1)² + (y2-y1)²)

x1=173,y1=64 and x2=170,y2=57

d4 = sqrt ((170–173)² + (57–64)²)

d4 = 7.6

###############################################

##############################################

d5 = sqrt ((x2-x1)² + (y2-y1)²)

x1=172,y1=65 and x2=170,y2=57

d5 = sqrt ((170–172)² + (57–65)²)

d5 = 8.2

###############################################

##############################################

d6 = sqrt ((x2-x1)² + (y2-y1)²)

x1=173,y1=64 and x2=170,y2=57

d6 = sqrt ((170–173)² + (57–64)²)


d6 = 4.1

###############################################

##############################################

d7 = sqrt ((x2-x1)² + (y2-y1)²)

x1=169,y1=58 and x2=170,y2=57

d7 = sqrt ((170–169)² + (57–58)²)

d7 = 1.414

###############################################

##############################################

d8 = sqrt ((x2-x1)² + (y2-y1)²)

x1=173,y1=57 and x2=170,y2=57

d8 = sqrt ((170–173)² + (57–57)²)

d8 = 3

###############################################

##############################################
d9 = sqrt ((x2-x1)² + (y2-y1)²)

x1=170,y1=55 and x2=170,y2=57

d9 = sqrt ((170–170)² + (57–55)²)

d9 = 2

###############################################

from above results (57,174) → Under-Weight ,(57,170) → Given point

(57,173) → Normal-Weight,(58,169) → Normal-Weight. Hence out of three points two points


are Normal-Weight and one point is Under-Weight due to which majority is Normal-Weight.

Final Conclusion is given point is Normal-Weight

Finding K Value:

The most important step in KNN is to find a optimal value of K i,e how many clusters we
want to divide into and the optimal value of k reduces effect of the noise on the classification,
but makes boundaries between classes less distinc.
Example:2
Failure Cases of K-NN:

1.When Query Point is far away from the data points.

2.If we have Jumble data sets.

For the above image shows jumble sets of data set, no useful information in the above data set.
In this situation, the algorithm may be failing.
Distance Measures in K-NN:

There are mainly four distance measures in Machine Learning Listed below.

1. Euclidean Distance

2. Manhattan Distance

3. Minkowski Distance

Euclidean Distance

The Euclidean distance between two points in either the plane or 3-dimensional space
measures the length of a segment connecting the two points. It is the most obvious way of
representing distance between two points. Euclidean distance marks the shortest route of the
two points.

The Pythagorean Theorem can be used to calculate the distance between two points, as shown
in the figure below. If the points (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are in 2-dimensional space,
then the Euclidean distance between them is
Euclidean distance is called an L2 Norm of a vector.

Norm means the distance between two vectors.

Euclidean distance from an origin is given by


Manhattan Distance

The Manhattan distance between two vectors (city blocks) is equal to the one-norm of the
distance between the vectors. The distance function (also called a “metric”) involved is also
called the “taxi cab” metric.

Manhattan distance between two vectors is called as L1 Norm of a vector.

In L2 Norm we take the sum of the Squaring of the difference between elements vectors, in L1
Norm we take the sum of the absolute difference between elements vectors.

Manhattan Distance between two points (x1, y1) and (x2, y2) is:
|x1 — x2| + |y1 — y2|.
Manhattan Distance from an origin is given by

Minkowski Distance

Minkowski distance is a metric in a normed vector space. Minkowski distance is used for
distance similarity of vector. Given two or more vectors, find distance similarity of these
vectors.

Minkowski distance is called the LP Norm of a vector.

p !=0 , P is always greater than 0(p>0)

Euclidean distance from Minkowski distance

When p = 2, Minkowski distance is the same as the Euclidean distance.

Manhattan distance from Minkowski distance

When p = 1, Minkowski distance is the same as the Manhattan distance.


Cosine distance and cosine similarity:

Cosine similarity measures the similarity between two vectors of an inner product space. It is
measured by the cosine of the angle between two vectors and determines whether two vectors
are pointing in roughly the same direction. It is often used to measure document similarity in
text analysis.

The cosine similarity is advantageous because even if the two similar documents are far apart
by the Euclidean distance (due to the size of the document), chances are they may still be
oriented closer together. The smaller the angle, the higher the cosine similarity.

The relation between cosine similarity and cosine distance can be defined as below.

1. Similarity decreases when the distance between two vectors increases.

2. Similarity increases when the distance between two vectors decreases.


Cosine Similarity and Cosine Distance:

Cosine similarity says that to find the similarity between two points or vectors we need to find
the Angle between them.

The formula to find the Cosine Similarity and Distance is as below:

Cosine Similarity= cosθ

cosine Distance=1- cosθ

when cosine similarity (x1,x2) is very similar to cosine similarity (x1,x2) equal to 1. If it is a
very dissimilar cosine similarity (x1,x2) equal to -1.
θ is the angle between x1 and x2.

Cosine distance is using the angle between two points but Euclidean distance uses the
Geometrical distance of two points.

If A and B are unit vectors than ||A|| =||B||=1

than similarity, cos(θ)=A.B

Relationship between Euclidean distance and Cosine distance.

If A and B are unit vectors than ||A|| =||B||=1

The square of [Euclidean-distance(x1,x2)] = 2(1-cos(θ))


The square of [Euclidean-distance(x1,x2)]=2 cosine distance (x1,x2)

Example Demonstration

Consider there are 2 sen`1tences.

sentence 1: hello world

sentence 2: hello hello

mapping them to vector, it will be something like this

this would look like this on graph


Now applying Euclidian distance formula to this,

root of 2 is around 1.4142


Now applying cosine similarity to this

applying formula
this is the same even if sentence 2 is hello hello hello

while Euclidean distance will give you the value 2.2361 but cosine similarity will give you the
same 0.7071

sentence 1: I am Iron Man

sentence 2: Iron can rust.

we will get vector matrix like this,


Cosine similarity formula,

Applying this formula on top matrix,


Giving us this on furthur calculation,

The final similarity is


The similarity is 0.289, which seems accurate given the sentences.

Use Cases and disadvantages

Use Cases:

1. Document Similarity: Cosine similarity is widely used in natural language processing to


measure the similarity between documents. It’s applied in plagiarism detection, document
clustering, and content recommendation systems.

2. Recommendation Systems: In collaborative filtering-based recommendation systems,


cosine similarity helps identify users or items with similar preferences. It’s used in movie,
music, and product recommendations.

3. Text Classification: In text classification tasks like spam detection, sentiment analysis, or
topic modeling, cosine similarity can be used to compare the similarity between a document
and predefined categories.
4. Information Retrieval: Search engines use cosine similarity to match user queries with
relevant documents by comparing the query vector to document vectors.

5. Clustering: Cosine similarity aids in clustering similar data points together. For example,
it’s used in grouping similar news articles or social media posts.

Disadvantages:

1. Sensitivity to Document Length: Cosine similarity doesn’t consider the length of


documents. Longer documents may have lower cosine similarity scores even if they share
substantial content.

2. Lack of Semantic Understanding: Cosine similarity treats words or features as


independent entities. It doesn’t capture the semantic meaning of words, making it less
effective in understanding context.

3. Sparse Data Issues: In high-dimensional spaces, where data is often sparse, cosine
similarity can be less reliable. It might not accurately reflect the true similarity between data
points.

4. Normalization Dependency: Cosine similarity depends on vector normalization. Different


normalization methods can yield different results, making comparisons sensitive to
preprocessing choices.

5. No Negative Values: Cosine similarity produces values between -1 and 1. It doesn’t handle
negative associations well, which might be relevant in certain applications.

The performance of the K-NN algorithm is influenced by three main factors :

1. The distance function or distance metric used to determine the nearest neighbors.

2. The decision rule used to derive a classification from the K-nearest neighbors.

3. The number of neighbors used to classify the new example.


Decision surface for K-NN as K changes:

when the K=1 decision curve is sharp edges and non-smooth curves and the classifier doesn’t
make any mistakes.

when the k=5 decision surface has a smooth curve and the classifier makes small mistakes.

In K-NN the smoothness of the decision surface increases as K increases.


when k=n, the classifier gives every query point belongs to the Majority class. when the K=n
classifier makes more errors.

we have played with the K-NN decision surface with some toy data sets shown below
images.
By seeing the above images we observe that as k increases the decision surface is going to
smooth.

Overfitting and Underfitting of a Model

Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, the underfitting and overfitting are the two
terms that need to be checked for the performance of the model and whether the model is
generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting

Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our model.
It means the more we train our model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:

As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.

How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting

Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Goodness of Fit

The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the result
or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and the same
happens with test data. But if we train the model for a long duration, then the performance of
the model may decrease due to the overfitting, as the model also learn the noise present in the
dataset. The errors in the test dataset start increasing, so the point, just before the raising of
errors, is the good point, and we can stop here for achieving a good model.
Using the Dtrain process of finding the right function of the given data is called fitting.

when k=1 the model is overfitting to the data because our model doesn’t make any errors.

when k=n the model is underfitting to the data because our model makes more errors, it
considers every query point belongs to the Majority class.

The balance between overfitting and underfitting is well-fit, it makes some errors because
Machine Learning is not perfect, it’s ok to make small mistakes.

But you may think that how can we sure that our model is underfitting or overfitting?

The answer is by plotting method.

We want our model is Maximum accuracy or minimum error on the cross-validation dataset.
Training Error: Training errors occur when a trained model returns errors after running it on
the data again. It starts returning the wrong results. There is one logical assumption here by the
way, and that is your training set will not include same training samples belonging to different
classes, i.e. conflicting information. Some real world datasets might have this property though.

Cross-validation Error: Error occurs when choosing best K ,by the use of Cross-validation.

When Train error is low and validation error is high we face a problem of overfitting
shown in the above image.

When Train error is high and validation error also high we face a problem of underfitting
shown in the above image.
We choose our model as some train error and some validation errors both are close to each
other, shown in the above image best fit.

How to find the best K?

By cross-validation, we find the optimal K.

Cross-Validation is just Make some data to seen to the function and some data to unseen to the
function.

Cross-validation (CV) is one of the techniques used to test the effectiveness of machine
learning models, it is also a resampling procedure used to evaluate a model if we have limited
data. To perform CV we need to keep aside a sample/portion of the data on which is do not
use to train the model, later us this sample for testing/validating.

Below are the few common techniques used for CV.

1. Train_Test Split approach.

In this approach, we randomly split the complete data into training and test sets. Then Perform
the model training on the training set and use the test set for validation purpose, ideally split
the data into 70:30 or 80:20. If our data is huge and our test sample and train sample has the
same distribution then this approach is acceptable.
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

df=pd.read_csv('student_scores.csv')

X=df.iloc[:,:-1].values

Y=df.iloc[:,1].values

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)

regressor=LinearRegression()
regressor.fit(X_train,Y_train)

print("Model got trained with data")

Python Code for Bias and Variance (to check whether it is overfit or underfit or bestfit):

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from mlxtend.evaluate import bias_variance_decomp

df=pd.read_csv('student_scores.csv')

X=df.iloc[:,:-1].values

Y=df.iloc[:,1].values

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

model=LinearRegression()

mse,bias,var=bias_variance_decomp(model,X_train,Y_train,X_test,Y_test,loss='mse',num_ro
unds=200,random_seed=1)

print('MSE:',mse)

print('Bias:',bias)

print('Variance:',var)

o/p
MSE: 25.994907724912167
Bias: 22.412118374516112
Variance: 3.582789350396052

(This model having High Bias and Low Variance- It is Underfitting Model)
Pros of K Nearest Neighbors

 Simple algorithm and hence easy to interpret the prediction.

 Non-parametric so makes no assumption about the underlying data pattern.

 used for both classification and Regression.

 The training step is much faster for the nearest neighbors compared to other machine
learning algorithms.

Cons of K Nearest Neighbors

 KNN is computationally expensive as it searches the nearest neighbors for the new point
at the prediction stage

 High memory requirement as KNN has to store all the data points

 Prediction stage is very costly

 Sensitive to outliers, accuracy is impacted by noise or irrelevant data.

The effectiveness of the k-Nearest Neighbors (KNN)

1. Simplicity and Interpretability

 Strengths: KNN is easy to understand and implement. Its simplicity comes from the
fact that it makes decisions based on the similarity between data points (distance
between neighbors).
 Weaknesses: KNN doesn't learn a model; it simply memorizes the training data and
makes predictions based on proximity, which might not generalize well in complex
problems.

2. Performance

 Strengths: KNN performs well for simple and well-distributed data where class
separation is clear.
 Weaknesses: It can struggle with high-dimensional data (the curse of
dimensionality). The distance metric may become less meaningful in many
dimensions, affecting the algorithm's accuracy. Additionally, it is computationally
expensive during inference since it has to calculate distances to every training point
for each prediction.

3. No Assumptions about Data

 Strengths: KNN is a non-parametric algorithm, meaning it doesn't make any


assumptions about the underlying data distribution. This makes it flexible and
applicable to many different problems.
 Weaknesses: The algorithm’s effectiveness is highly dependent on the quality and
amount of labeled data available. Without sufficient and balanced training data, KNN
can make inaccurate predictions.

4. Choice of Hyperparameters

 Strengths: KNN has only a few hyperparameters, primarily the number of neighbors
(k) and the distance metric used. It can be tuned easily using cross-validation.
 Weaknesses: Choosing an optimal k is crucial. If k is too small, KNN might be
sensitive to noise. If k is too large, it can misclassify based on distant or irrelevant
data points.

5. Effect of Noisy Data

 Weaknesses: KNN is sensitive to noise, particularly when k is small. Noisy data


points may lead to incorrect classification if they are close to the test sample.

6. Scalability

 Weaknesses: KNN is computationally expensive, especially with a large dataset,


because the algorithm computes distances from the query point to all points in the
training set. Optimizations like KD-trees or Ball trees can help, but for very large
datasets, KNN may not be scalable.
7. Distance Metric Selection

 The effectiveness of KNN is directly influenced by the choice of the distance metric
(e.g., Euclidean, Manhattan, Minkowski). For certain types of data, such as text or
categorical variables, specialized distance metrics (e.g., Hamming distance) may be
necessary to ensure good performance.

Summary of Effectiveness

 Good for: Simple datasets, small to medium-sized data, cases where model
interpretability is crucial.
 Challenges: High-dimensional data, large datasets, noise, and computational costs.

The effectiveness of K-Nearest Neighbors (KNN) can be quantitatively evaluated using


metrics such as accuracy, precision, recall, F1 score, and by visualizing performance
through graphs like decision boundaries, confusion matrices, or ROC curves.

You might also like