KNN-Unit1-Notes (1)
KNN-Unit1-Notes (1)
As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted the
output for continuous values, but to predict the categorical values, we need Classification
algorithms.
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as BinaryClassifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then
it is called as Multi-classClassifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naive
Bayes, ANN.
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naive Bayes
o Decision Tree Classification
o Random Forest Classification
1. ?(ylog(p)+(1?y)log(1?p))
Where y= Actual output, p= predicted output.
2. Confusion Matrix:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.
Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets.
But the difference between both is how they are used for different machine learning
problems.
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.
o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naive Bayes
o Decision Tree Classification
o Random Forest Classification
import pandas as pd
df=pd.read_csv('insurance_data.csv')
print(df)
model=LogisticRegression()
model.fit(X_train,Y_train)
print(model.predict([[50]]))
print(model.predict([[25]]))
Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.
import pandas as pd
df=pd.read_csv('student_scores.csv')
X=df.iloc[:,:-1].values
Y=df.iloc[:,1].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)
regressor=LinearRegression()
regressor.fit(X_train,Y_train)
print("Model got trained with data")
pred = model.predict(X_test)
model.score(X_test, Y_test)
Y_range = model.predict(X_range)
plt.scatter(X_train, Y_train)
plt.scatter(X_test, Y_test)
plt.plot(X_range, Y_range)
In Regression, the output variable must be of In Classification, the output variable must be a
continuous nature or real value. discrete value.
The task of the regression algorithm is to map the The task of the classification algorithm is to map
input value (x) with the continuous output the input value(x) with the discrete output
variable(y). variable(y).
Regression Algorithms are used with continuous Classification Algorithms are used with discrete
data. data.
Regression algorithms can be used to solve the Classification Algorithms can be used to solve
regression problems such as Weather Prediction, classification problems such as Identification of
House price prediction, etc. spam emails, Speech Recognition, Identification
of cancer cells, etc.
The regression Algorithm can be further divided The Classification algorithms can be divided into
into Linear and Non-linear Regression. Binary Classifier and Multi-class Classifier.
One of the fundamental applications of matrices and vectors in machine learning is the
representation of data. In most machine learning tasks, data is typically organized in a tabular
format, where each row represents an observation and each column represents a feature or
attribute. This tabular structure can be represented as a matrix, where each element
corresponds to a specific value in the dataset.
The fuel of ML models, that is data, needs to be converted into arrays before you can feed it
into your models. The computations performed on these arrays include operations like matrix
multiplication (dot product). This further returns the output that is also represented as a
transformed matrix/tensor of numbers.
For example, consider a dataset of housing prices with features such as the number of
bedrooms, square footage, and location. This dataset can be represented as an m x n matrix,
where m is the number of observations (rows) and n is the number of features (columns). Each
element in the matrix represents the value of a specific feature for a particular observation.
Linear algebra basically deals with vectors and matrices (different shapes of arrays) and
operations on these arrays. In NumPy, vectors are basically a 1-dimensional array of numbers
but geometrically, they have both magnitude and direction.Our data can be represented using a
vector. In the figure below, one row in this data is represented by a feature vector which has 3
elements or components representing 3 different dimensions. N-entries in a vector makes it n-
dimensional vector space and in this case, we can see 3-dimensions.
In the context of data science, particularly in machine learning, the significance of linear
algebra becomes evident when dealing with datasets that have numerous features, making
visualization and manual judgment challenging. While we can easily visualize and draw lines
in 2 or 3-dimensional Cartesian space, real-world datasets often involve a high-dimensional
space (N-dimensions) that is impractical to visualize. This is where the power of linear algebra
comes into play. It allows us to apply mathematical principles to machine learning models,
enabling the creation of decision boundaries or planes in N-dimensional space for accurate
data classification and analysis.
Data representation includes transforming data into vectors and matrices, which are structured
mathematical objects that can be manipulated to perform operations like addition,
multiplication, and transformation.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
iris = load_iris()
X = iris.data
y = iris.target
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))
Explanation:
1. Loading the Dataset: The Iris dataset is a popular dataset used for classification
problems. It consists of 150 samples from three species of Iris flowers, with four
features (sepal length, sepal width, petal length, and petal width).
2. Data Splitting: The data is split into training and testing sets using train_test_split.
The test_size=0.3 indicates that 30% of the data will be used for testing.
3. Feature Scaling: KNN is a distance-based algorithm, so it's important to scale the
features to ensure that all features contribute equally to the distance calculation.
4. Training the Model: The KNeighborsClassifier is initialized with n_neighbors=3,
which means the model will consider the 3 nearest neighbors for classification.
5. Making Predictions: The model predicts the class labels for the test set.
6. Evaluation: The model's performance is evaluated using a confusion matrix,
classification report, and accuracy score.
KNN with Numerical Examples
KNN is one of the simplest machine learning algorithm for classification and regression
problem but mainly used for classification.
In KNN classification, the output is a class membership. The given data point is classified
based on the majority of type of its neighbor's. The data point is assigned to the most
frequent class among its k nearest neighbor's. Usually k is a small positive integer. If k=1,
then the data point is simply assigned to the class of that single nearest neighbor.
In KNN regression, the output is simply some property value for the object. This value is
the average of the values of k nearest neighbors.
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
KNN Intuition
The KNN algorithm intuition is very simple to understand. It simply calculates the distance
between a sample data point and all the other training data points. The distance can be
Euclidean distance or Manhattan distance. Then, it selects the k nearest data points where k
can be any integer. Finally, it assigns the sample data point to the class to which the majority
of the k data points belong.
In KNN “K” is the number of nearest neighbors and mostly we opt for K value as Odd
because it helps to decide the majority of the class.
In the above graph we can see there is two category of data and we need to classify the new
data point based on nearest neighbor
Distance Measurement
We generally say that we will use distance to find the nearest neighbours of any query point
Xq, but we still don’t know how mathematically distance is measured between Xq and other
nearest points? for further finding distance, we can’t conclude that this is nearest or not.
In a theoretical manner, we can say that a distance measure is an objective score that
summarizes the difference between two objects in a specific domain. There are several types
of distance measures techniques but we only use some of them and they are listed below:
1. Euclidean distance
We mostly use this distance measurement technique to find the distance between consecutive
points. It is generally used to find the distance between two real-valued vectors
2. Manhattan distance
This distance is also known as taxicab distance or city block distance, that is because the way
this distance is calculated. The distance between two points is the sum of the absolute
differences of their Cartesian coordinates.
3. Minkowski distance
It is a metric intended for real-valued vector spaces. We can calculate Minkowski distance
only in a normed vector space, which means in a space where distances can be represented as
a vector that has a length and the lengths cannot be negative.
There are a few conditions that the distance metric must satisfy:
This above formula for Minkowski distance is in generalized form and we can manipulate it to
get different distance metrices.
The p value in the formula can be manipulated to give us different distances like:
Q. From the given data-set find (x,y) = 57,170 whether belongs to Under or Normal Weights
Solution:
In this approach we are going to use Euclidean distance formulae, n(no of records)=9 and
assuming K value as 3
##############################################
###############################################
##############################################
d2 = 13
###############################################
##############################################
d3 = 13.4
###############################################
##############################################
d4 = sqrt ((x2-x1)² + (y2-y1)²)
d4 = 7.6
###############################################
##############################################
d5 = 8.2
###############################################
##############################################
###############################################
##############################################
d7 = 1.414
###############################################
##############################################
d8 = 3
###############################################
##############################################
d9 = sqrt ((x2-x1)² + (y2-y1)²)
d9 = 2
###############################################
Finding K Value:
The most important step in KNN is to find a optimal value of K i,e how many clusters we
want to divide into and the optimal value of k reduces effect of the noise on the classification,
but makes boundaries between classes less distinc.
Example:2
Failure Cases of K-NN:
For the above image shows jumble sets of data set, no useful information in the above data set.
In this situation, the algorithm may be failing.
Distance Measures in K-NN:
There are mainly four distance measures in Machine Learning Listed below.
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
Euclidean Distance
The Euclidean distance between two points in either the plane or 3-dimensional space
measures the length of a segment connecting the two points. It is the most obvious way of
representing distance between two points. Euclidean distance marks the shortest route of the
two points.
The Pythagorean Theorem can be used to calculate the distance between two points, as shown
in the figure below. If the points (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are in 2-dimensional space,
then the Euclidean distance between them is
Euclidean distance is called an L2 Norm of a vector.
The Manhattan distance between two vectors (city blocks) is equal to the one-norm of the
distance between the vectors. The distance function (also called a “metric”) involved is also
called the “taxi cab” metric.
In L2 Norm we take the sum of the Squaring of the difference between elements vectors, in L1
Norm we take the sum of the absolute difference between elements vectors.
Manhattan Distance between two points (x1, y1) and (x2, y2) is:
|x1 — x2| + |y1 — y2|.
Manhattan Distance from an origin is given by
Minkowski Distance
Minkowski distance is a metric in a normed vector space. Minkowski distance is used for
distance similarity of vector. Given two or more vectors, find distance similarity of these
vectors.
Cosine similarity measures the similarity between two vectors of an inner product space. It is
measured by the cosine of the angle between two vectors and determines whether two vectors
are pointing in roughly the same direction. It is often used to measure document similarity in
text analysis.
The cosine similarity is advantageous because even if the two similar documents are far apart
by the Euclidean distance (due to the size of the document), chances are they may still be
oriented closer together. The smaller the angle, the higher the cosine similarity.
The relation between cosine similarity and cosine distance can be defined as below.
Cosine similarity says that to find the similarity between two points or vectors we need to find
the Angle between them.
when cosine similarity (x1,x2) is very similar to cosine similarity (x1,x2) equal to 1. If it is a
very dissimilar cosine similarity (x1,x2) equal to -1.
θ is the angle between x1 and x2.
Cosine distance is using the angle between two points but Euclidean distance uses the
Geometrical distance of two points.
Example Demonstration
applying formula
this is the same even if sentence 2 is hello hello hello
while Euclidean distance will give you the value 2.2361 but cosine similarity will give you the
same 0.7071
Use Cases:
3. Text Classification: In text classification tasks like spam detection, sentiment analysis, or
topic modeling, cosine similarity can be used to compare the similarity between a document
and predefined categories.
4. Information Retrieval: Search engines use cosine similarity to match user queries with
relevant documents by comparing the query vector to document vectors.
5. Clustering: Cosine similarity aids in clustering similar data points together. For example,
it’s used in grouping similar news articles or social media posts.
Disadvantages:
3. Sparse Data Issues: In high-dimensional spaces, where data is often sparse, cosine
similarity can be less reliable. It might not accurately reflect the true similarity between data
points.
5. No Negative Values: Cosine similarity produces values between -1 and 1. It doesn’t handle
negative associations well, which might be relevant in certain applications.
1. The distance function or distance metric used to determine the nearest neighbors.
2. The decision rule used to derive a classification from the K-nearest neighbors.
when the K=1 decision curve is sharp edges and non-smooth curves and the classifier doesn’t
make any mistakes.
when the k=5 decision surface has a smooth curve and the classifier makes small mistakes.
we have played with the K-NN decision surface with some toy data sets shown below
images.
By seeing the above images we observe that as k increases the decision surface is going to
smooth.
Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.
Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model.
It means the more we train our model, the more chances of occurring the overfitted model.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.
Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the result
or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and the same
happens with test data. But if we train the model for a long duration, then the performance of
the model may decrease due to the overfitting, as the model also learn the noise present in the
dataset. The errors in the test dataset start increasing, so the point, just before the raising of
errors, is the good point, and we can stop here for achieving a good model.
Using the Dtrain process of finding the right function of the given data is called fitting.
when k=1 the model is overfitting to the data because our model doesn’t make any errors.
when k=n the model is underfitting to the data because our model makes more errors, it
considers every query point belongs to the Majority class.
The balance between overfitting and underfitting is well-fit, it makes some errors because
Machine Learning is not perfect, it’s ok to make small mistakes.
But you may think that how can we sure that our model is underfitting or overfitting?
We want our model is Maximum accuracy or minimum error on the cross-validation dataset.
Training Error: Training errors occur when a trained model returns errors after running it on
the data again. It starts returning the wrong results. There is one logical assumption here by the
way, and that is your training set will not include same training samples belonging to different
classes, i.e. conflicting information. Some real world datasets might have this property though.
Cross-validation Error: Error occurs when choosing best K ,by the use of Cross-validation.
When Train error is low and validation error is high we face a problem of overfitting
shown in the above image.
When Train error is high and validation error also high we face a problem of underfitting
shown in the above image.
We choose our model as some train error and some validation errors both are close to each
other, shown in the above image best fit.
Cross-Validation is just Make some data to seen to the function and some data to unseen to the
function.
Cross-validation (CV) is one of the techniques used to test the effectiveness of machine
learning models, it is also a resampling procedure used to evaluate a model if we have limited
data. To perform CV we need to keep aside a sample/portion of the data on which is do not
use to train the model, later us this sample for testing/validating.
In this approach, we randomly split the complete data into training and test sets. Then Perform
the model training on the training set and use the test set for validation purpose, ideally split
the data into 70:30 or 80:20. If our data is huge and our test sample and train sample has the
same distribution then this approach is acceptable.
import pandas as pd
df=pd.read_csv('student_scores.csv')
X=df.iloc[:,:-1].values
Y=df.iloc[:,1].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)
regressor=LinearRegression()
regressor.fit(X_train,Y_train)
Python Code for Bias and Variance (to check whether it is overfit or underfit or bestfit):
import pandas as pd
df=pd.read_csv('student_scores.csv')
X=df.iloc[:,:-1].values
Y=df.iloc[:,1].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)
model=LinearRegression()
mse,bias,var=bias_variance_decomp(model,X_train,Y_train,X_test,Y_test,loss='mse',num_ro
unds=200,random_seed=1)
print('MSE:',mse)
print('Bias:',bias)
print('Variance:',var)
o/p
MSE: 25.994907724912167
Bias: 22.412118374516112
Variance: 3.582789350396052
(This model having High Bias and Low Variance- It is Underfitting Model)
Pros of K Nearest Neighbors
The training step is much faster for the nearest neighbors compared to other machine
learning algorithms.
KNN is computationally expensive as it searches the nearest neighbors for the new point
at the prediction stage
High memory requirement as KNN has to store all the data points
Strengths: KNN is easy to understand and implement. Its simplicity comes from the
fact that it makes decisions based on the similarity between data points (distance
between neighbors).
Weaknesses: KNN doesn't learn a model; it simply memorizes the training data and
makes predictions based on proximity, which might not generalize well in complex
problems.
2. Performance
Strengths: KNN performs well for simple and well-distributed data where class
separation is clear.
Weaknesses: It can struggle with high-dimensional data (the curse of
dimensionality). The distance metric may become less meaningful in many
dimensions, affecting the algorithm's accuracy. Additionally, it is computationally
expensive during inference since it has to calculate distances to every training point
for each prediction.
4. Choice of Hyperparameters
Strengths: KNN has only a few hyperparameters, primarily the number of neighbors
(k) and the distance metric used. It can be tuned easily using cross-validation.
Weaknesses: Choosing an optimal k is crucial. If k is too small, KNN might be
sensitive to noise. If k is too large, it can misclassify based on distant or irrelevant
data points.
6. Scalability
The effectiveness of KNN is directly influenced by the choice of the distance metric
(e.g., Euclidean, Manhattan, Minkowski). For certain types of data, such as text or
categorical variables, specialized distance metrics (e.g., Hamming distance) may be
necessary to ensure good performance.
Summary of Effectiveness
Good for: Simple datasets, small to medium-sized data, cases where model
interpretability is crucial.
Challenges: High-dimensional data, large datasets, noise, and computational costs.