0% found this document useful (0 votes)
19 views

Sayan Das - Machine Learning

Machine learning k nearest neighbour

Uploaded by

sayandas3646
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Sayan Das - Machine Learning

Machine learning k nearest neighbour

Uploaded by

sayandas3646
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

"K-Nearest Neighbors (KNN) and Decision Trees in

Machine Learning"
Name – Sayan Das

Roll no - 14400121013

Subject code – PEC-CS 701E

Subject Name – Machine Learning

Academic Session – 2024 - 2025

Department - Computer Science and


Engineering

College name - Neotia Institute of


Technology, Management & Science
(144)

Abstract- This report delves into two fundamental machine


learning algorithms: K-Nearest Neighbors (KNN) and Decision II. K-NEAREST NEIGHBORS (KNN) - OVERVIEW
Trees. KNN is a simple, instance-based learning algorithm that
classifies data based on the proximity of the data points in the
A. What is KNN?
feature space, utilizing distance metrics like Euclidean KNN is a simple algorithm used to classify data or
distance. Decision Trees, on the other hand, use a tree-like predict values based on the data around it.
structure to make decisions based on a series of feature-based
conditions, splitting data at various nodes to classify or predict It looks at the data points near a new point (called its
outcomes. The report explains the workings of both algorithms “neighbors”) to decide what class or value the new point
with example datasets, mathematical formulas, such as
should have.
Euclidean distance for KNN and Information Gain for
Decision Trees, and provides a comparison of their practical
applications. By analyzing the strengths and weaknesses of Example: If you have a group of animals, and most
these algorithms, we explore their use in various domains, nearby animals are dogs, KNN will likely classify the new
including pattern recognition, classification, and regression animal as a dog.
tasks.

Keywords—K-Nearest Neighbors, Decision Trees, Machine


Learning, Classification, Euclidean Distance, Information Gain,
Feature Space, Supervised Learning, Algorithm Comparison,
Regression.

I. INTRODUCTION
Machine Learning (ML) is a branch of artificial
intelligence (AI) that allows computers to learn and make
decisions without being explicitly programmed. It's all about
using data to train models that can predict or classify new
data. Today, we’ll focus on two popular machine learning
algorithms:

 K-Nearest Neighbors (KNN)


 Decision Trees

These algorithms are easy to understand and are used


III. KNN - HOW IT WORKS
for both classification and regression.
A. Step-by-Step Process
 Pick a number for K (the number of nearby points
to look at).

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


 Measure the distance between the new data point VI. EXAMPLE FOR KNN (EUCLIDEAN DISTANCE)
and all other points in the dataset. Imagine we have a dataset like this:

 Identify the K nearest points (neighbors) to the new Height Weight Age Class
point. 5.8 70 25 Healthy

1) For Classification: Count how many of the nearest 6.0 80 30 Unhealthy


neighbors belong to each class. The new point will be
5.5 65 28 Healthy
classified into the class with the most neighbors.

2) For Regression: Take the average value of the K And now, there's a new person with:
nearest neighbors to predict the value for the new point. Height: 5.7
Weight: 72
Age: 26
IV. WHEN DO WE USE KNN ALGORITHM?
We can use when: We want to classify if this new person is "Healthy" or
"Unhealthy" based on their height, weight, and age using
 Data is labeled. KNN (we'll use K=3 for this example).

 Data is noise free. To do that, let’s calculate the Euclidean distance


between the new person and each person in the dataset!
 Dataset is small.
Euclidean Distance Formula: Type equation here.
d(p,q)=√(𝑞1 − 𝑝1)2 + (𝑞2 − 𝑝2)2 + (𝑞3 − 𝑝3)2
Where p is the new person, and q is each person in the
dataset.

Distance to Person 1 (5.8, 70, 25):


d=√(5.8-5.7)2+(70-72)2+(25-26)2

d=√(0.1)2+(-2)2+(-1)2=√0.01+4+1=√5.01≈2.24

Distance to Person 2 (6.0, 80, 30):


𝑑=√(6.0-5.7)2+(80-72)2+(30-26)2

𝑑=√(0.3)2+(8)2+(4)2=√0.09+64+16 =√80.09 ≈8.95

V. KNN - FORMULA Distance to Person 3 (5.5, 65, 28):


A. Formula to Calculate Distance: 𝑑=√=(5.5-5.7)2+(65-72)2+(28-26)2
One common way to calculate the distance between
points is by using Euclidean Distance is a way to measure 𝑑=√(-0.2)2+(-7)2+(2)2=√0.04+49+4=√53.04≈7.28
how far apart two points are in space.
Results of KNN:
The formula for this is: The closest distances are:
d(p,q)=√∑𝑛𝑖=1(𝑞𝑖 − 𝑝𝑖)2 1. 2.24 (Person 1 - Healthy)
2. 7.28 (Person 3 - Healthy)
 p and q are two data points. 3. 8.95 (Person 2 - Unhealthy)
 n is the number of features (like height, weight, Since 2 out of 3 neighbors are "Healthy", our new person is
age). classified as Healthy
 The formula finds the “straight-line” distance
between two points.
VII. APPLICATIONS OF KNN
A. Image recognition
Classifying images based on their pixel patterns.

B. Recommender systems
Suggesting products or content based on user
preferences and past behavior
C. Pattern recognition  Pick the best feature to split the data based on how
Identifying patterns in data sets, such as finding similar much information it gives you (more on this in the
customers in a database next slide).

 Create a node and divide the data based on the


D. Credit scoring feature’s values.
Assessing the creditworthiness of individuals based on
their financial history.  Repeat this process for each child node until you
cannot split anymore (a leaf is reached).
VIII. KNN - ADVANTAGES & DISADVANTAGES
A. Advantages XI. DECISION TREE - FORMULA (INFORMATION GAIN)
 Very simple to understand and implement. B. Entropy and Information Gain
 No need for a training phase, so it can be used on- A decision tree uses entropy to measure uncertainty in a
the-fly. dataset.
 Works well with small datasets. The formula for entropy is: (𝑆)=∑𝑐𝑖=1 𝑝 𝑖 𝑙𝑜𝑔 2 𝑝 𝑖
B. Disadvantages
Where pi is the probability of class i, and ccc is the
 Becomes slow with large datasets because it needs number of classes.
to calculate distances for every point.
 Sensitive to noisy data (irrelevant points can affect Information Gain measures how much a feature helps
the outcome). |𝑠𝑣|
to separate the data: (𝑆,)=𝐻(𝑆)−∑𝑣∈𝑣𝑎𝑙𝑎𝑛𝑐𝑒(𝐴) H(Sv)
|𝑠|
IX. DECISION TREES – OVERVIEW
A decision tree is a supervised learning algorithm used This helps the tree decide which feature to split on.
for both classification and regression tasks. Like a flowchart
XII. EXAMPLE FOR DECISION TREE (INFORMATION
structure where each internal node represents a feature
GAIN)
(attribute) of the data, each branch represents a possible
value of that feature, and each leaf node represents a Now, let's use a decision tree example!
decision or prediction.
We have a dataset to decide whether to play tennis:

Let's assume we want to play badminton on a particular


Play
day — say Saturday — how will you decide whether to play Outlook Temp Humidity Wind
Tennis
or not. Let's say you go out and check if it's hot or cold,
check the speed of the wind and humidity, how the weather Sunny Hot High Weak No
is, i.e. is it sunny, cloudy, or rainy. You take all these factors
into account to decide if you want to play or not. Overcast Mild Normal Strong Yes

Rainy Cool High Strong No

Step 1: Calculate Entropy for Play Tennis


We’ll start by calculating the entropy for the target variable
("Play Tennis").

H(S)= - ∑pilog2pi

Where pi is the probability of each class.

Fig . A decision tree for the concept Play Badminton  Probability of Yes = 1/3
 Probability of No = 2/3
X. DECISION TREE - HOW IT WORKS 3 3 3 3
H(S) = −( log 2 + log 2 )
A. Step-by-Step Process 1 1 2 2
 Start with all the data and consider all the features
(like weather, temperature).
H(S)=−(0.333×−1.585+0.666×−0.585)=0.918
Step 2: Calculate Information Gain for "Outlook"
We’ll calculate the information gain for the
Outlook feature. 3) Cons: Prone to overfitting.

 When Outlook is Sunny: 1 "No", 0 "Yes" →


Entropy = 0 CONCLUSION
 When Outlook is Overcast: 0 "No", 1 "Yes" → In conclusion, both K-Nearest Neighbors (KNN)
Entropy = 0 and Decision Trees are essential algorithms in machine
 When Outlook is Rainy: 1 "No", 0 "Yes" → learning, offering intuitive and effective methods for
Entropy = 0 classification and regression tasks. KNN leverages the
proximity of data points in a feature space, making it
suitable for problems where similar data points share similar
Since all values are pure, information gain is outcomes. Decision Trees, on the other hand, split data
maximized. recursively based on feature values, allowing for clear, rule-
based predictions that are easily interpretable.
XIII. DECISION TREE APPLICATIONS
A. Fraud detection While KNN is computationally simple but
Identifying fraudulent transactions by analyzing patterns in resource-intensive for large datasets, Decision Trees can
financial data. sometimes overfit without proper tuning. Each algorithm
has its strengths: KNN works well when the data is evenly
distributed, and Decision Trees excel when interpretability
B. Medical diagnosis is crucial. Understanding the trade-offs between these
Predicting diseases based on symptoms and medical algorithms helps in choosing the right approach for a
history. specific problem. Both remain popular and widely used in
fields like finance, healthcare, and marketing due to their
versatility and performance.
C. Loan approval
Assessing loan applications based on credit history and
income
ACKNOWLEDGMENT
D. Customer segmentation I would like to express my sincere gratitude to all
Grouping customers based on their demographics, those who have supported and contributed to the completion
buying habits, and other characteristics for targeted of this report on "K-Nearest Neighbors (KNN) and Decision
marketing campaigns. Trees in Machine Learning ".

XIV. DECISION TREE - ADVANTAGES & I extend my heartfelt thanks to my mentors and
DISADVANTAGES educators Suman Halder sir who provided guidance and
A. Advantages insights throughout the research and writing process. Their
valuable input has greatly enriched the content and structure
 Easy to visualize and interpret. of this report.
 Can handle both numerical and categorical data.
 Doesn’t need much data preparation. I also want to acknowledge the resources,
textbooks, and academic materials that have served as
B. Disadvantages essential references, allowing me to delve into the subject
matter and present accurate information.
 Can easily become too complex and overfit the
data (making it perform worse on new data). REFERENCES
 Sensitive to small changes in the data.
[1] J. R. Quinlan, "Induction of Decision Trees," Mach.
Learn., vol. 1, no. 1, pp. 81–106, 1986.
XV. KNN VS. DECISION TREES [2] T. Cover and P. Hart, "Nearest Neighbor Pattern
A. KNN Classification," IEEE Trans. Inf. Theory, vol. 13, no. 1,
pp. 21–27, 1967.
1) Type: Instance-based (it remembers data).
[3] K-Nearest Neighbors (KNN) Classification with R
2) Learning Type: Lazy learning (no training). Tutorial. [Online]. Available:
https://www.datacamp.com/tutorial/k-nearest-neighbors-
knn-classification-with-r-tutorial. [Accessed: Sep. 13,
3) Cons: Slow with large data. 2024].
B. Decision Trees
[4] Decision tree Tutorial & Notes: [Online]. Available:
1) Type: Model-based (it creates a model to predict). https://www.hackerearth.com/practice/machine-
learning/machine-learning-algorithms/ml-decision-
tree/tutorial/[Accessed: Sep. 13, 2024].
2) Learning Type: Needs training.

You might also like