0% found this document useful (0 votes)

15 views

KNN v2

The document discusses using a K-Nearest Neighbors (KNN) classification algorithm to predict customer credit scores for a lending company. KNN works by calculating the distance between a new customer and existing customers, and assigning the new customer the most common credit score of their K nearest neighbors. The document provides an example of how KNN can be used to predict the credit score of a new customer, Dravid, based on finding his two nearest neighbors, Jay and Rina, and assigning him their most common response. It also discusses how to handle non-numeric attributes and choose the optimal value for the parameter K.

Uploaded by

Sukeshan R

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

KNN v2

Uploaded by

Sukeshan R

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

K-Nearest Neighbor

Classification
Agenda
 KNN Classification Algorithm
 Solving Business Problems using KNN Algorithm
 Hands-on
Sample Business Problem
 Let’s assume a money lending company “XYZ” like UpStart,
IndiaLends, etc.
 Money lending XYZ company is interested in making the money
lending system comfortable & safe for lenders as well as for
borrowers. The company holds a database of customer details.
 Using customer’s detailed information from the database, it will
calculate a credit score(discrete value) for each customer.
 The calculated credit score helps the company and lenders to
understand the credibility of a customer clearly.
 So they can simply take a decision whether they should lend
money to a particular customer or not.
Sample Business Problem
 The customer’s details could be:
 Educational background details
 Highest graduated degree
 Cumulative grade points average (CGPA) or marks percentage
 The reputation of the college
 Consistency in his lower degrees
 Cleared education loan dues
 Employment details
 Salary
 Years of experience
 Got any onsite opportunities
 Average job change duration
Sample Business Problem
 The company(XYZ) uses these kind of details to calculate credit
score of a customer
 The process of calculating the credit score from the customer’s
details is expensive
 To reduce the cost of predicting credit score, they realized that
the customers with similar background details are getting a
similar credit score
 So, they decided to use already available data of customers and
predict the credit score by comparing it with similar data
 These kinds of problems are handled by the K-nearest neighbor
classifier for finding the similar kind of customers
Introduction
 K-nearest neighbor classifier is one of the introductory supervised
classifier, which every data science learner should be aware of
 Fix & Hodges proposed K-nearest neighbor classifier algorithm in 1951
for performing pattern classification task
 For simplicity, this classifier is called as KNN Classifier
 KNN addresses the pattern recognition problems and also the best
choices for addressing some of the classification related tasks
 The simple version of the K-nearest neighbor classifier algorithms is to
predict the target label by finding the nearest neighbor class
 The closest class will be identified using the distance measures like
Euclidean distance
K_Nearest Neighbour Algorithm

To determine the class of a new example E:

• Calculate the distance between E and all examples in the training set
• Select K-nearest examples to E in the training set
• Assign E to the most common class among its K-nearest neighbors

E
Distance Between Neighbors

Each example is represented with a set of numerical attributes

Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit No. of credit
cards=3 cards=2

• “Closeness” is defined in terms of the Euclidean distance between two

examples

• The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is
defined as:
n
D( X , Y )   (x  y )
i 1
i i
2

Distance (Jay,Rina) = (35−41)2 +(95,000−215,000)2 +(3 − 2)2

K_Nearest Neighbours: Example

No. credit
Customer Age Income Response
cards

Jay 35 35K 3 No

Rina 22 50K 2 Yes

Hema 63 200K 1 No

Tommy 59 170K 1 No

Neil 25 40K 4 Yes

Dravid 37 50K 2 ?
K_Nearest Neighbours: Example

Custome Incom No. credit

Age Response Distance from Dravid
r e cards

Jay 35 35K 3 No (35 − 37)2 +(35 − 50)2 +(3 − 2)2

= 15.16

Rina 22 50K 2 Yes 15

Hema 63 200K 1 No 152.23

Tommy 59 170K 1 No 122

Neil 25 40K 4 Yes 15.74

Dravid 37 50K 2 ? 0
K_Nearest Neighbours

Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2

Distance (Jay, Rina)=sqrt [(35-45)2+(95,000-215,000)2 +(3-2)2]

• Distance between neighbors could be dominated by some attributes with
relatively large numbers (e.g., income in our example)
• Important to normalize some features
(e.g., map numbers to numbers between 0-1)

Example: Income
Highest income = 200K
Davis’s income is normalized to 50/200, Rina income is normalized to
50/200, etc.)
K_Nearest Neighbours

Normalization of Variables

No. credit
Customer Age Income Response
cards
55/63= 35/200= 3/4=
Jay No
0.175 0.175 0.75
22/63= 50/200= 2/4=
Rina Yes
0.34 0.25 0.5
63/63= 200/200= 1/4=
Hema No
1 1 0.25
59/63= 170/200= 1/4=
Tommy No
0.93 0.175 0.25
25/63= 40/200= 4/4=
Neil Yes
0.39 0.2 1
37/63= 50/200= 2/4=
Dravid Yes
0. 58 0.25 0.5
K-Nearest Neighbor

• Distance works naturally with numerical attributes

d(Rina,Johm)= (35−37)2+(35−50)2 +(3−2)2 =15.16
• What if we have nominal attributes?

Example: Married

No. credit
Customer Married Income Response
cards
Jay Yes 35K 3 No
Rina No 50K 2 Yes
Hema No 200K 1 No
Tommy Yes 170K 1 No
Neil No 40K 4 Yes
Dravid Yes 50K 2 Yes
Non-Numeric Data

 Feature values are not always numbers

 Example
 Boolean values: Yes or no, presence or absence of an
attribute
 Categories: Colors, educational attainment, gender
 How do these values factor into the computation of
distance?
Dealing with Non-Neumeric Data

 Boolean values => convert to 0 or 1

 Applies to yes-no/presence-absence attributes
 Non-binary characterizations
 Use natural progression when applicable; e.g., educational
attainment: GS, HS, College, MS, PHD => 1,2,3,4,5
 Assign arbitrary numbers but be careful about distances; e.g.,
color: red, yellow, blue => 1,2,3
 How about unavailable data?
(0 value not always the answer)
Distance measures
• How to determine similarity between data points
• Let x = (x1,…,xn) and y = (y1,…yn) be n-dimensional vectors of
data points of objects g1 and g2
– g1, g2 can be two different genes in microarray data
How to calculate distance using Math?
1. Euclidean Distance

(y1
,y2)
Y
x
(x1
,x2)
1. Euclidean Distance

(9 ,8)
Y
x
(5 ,5)
1. Euclidean Distance

(9
,8,3)
Y
x
(5
,5,7)

L2 Norm
k-NN Variations

• Value of k
– Larger k increases confidence in prediction
– Note that if k is too large, decision may be skewed
• Weighted evaluation of nearest neighbors
– Plain majority may unfairly skew decision
– Revise algorithm so that closer neighbors have
greater “vote weight”
How to Choose ”K”?

• For k = 1, …,5 point x gets classified correctly

• red class
• For larger k classification of x is wrong
• blue class
How to Choose ”K”?

 Selecting the value of 𝐾 in 𝐾-nearest neighbor is the most critical

problem.
 A small value of 𝐾 means that noise will have a higher influence on the
result i.e., the probability of overfitting is very high.
 A large value of 𝐾 makes it computationally expensive and defeats the
basic idea behind KNN (that points that are near might have
similar classes ).
 A simple approach to select 𝐾 is 𝐾 = √𝑛
 It depends on individual cases, at times best process is to run through
each possible value of 𝐾 and test our result
KNN algorithm Pseudo Code
 Let (𝑋𝑖 , 𝐶𝑖 ) where 𝑖 = 1,2, ⋯ , 𝑛 be data points. 𝑋𝑖 denotes
feature values & 𝐶𝑖 denotes labels for 𝑋𝑖 for each 𝑖
 Assuming the number of classes as 𝑐, 𝐶𝑖 ∈ {1,2,3, ⋯ , 𝑐} for all
values of 𝑖
 Let 𝑥 be a point for which label is not known
 We would like to find the label class using k-nearest neighbor
algorithms.
KNN algorithm Pseudo Code
 Calculate 𝑑(𝑥, 𝑥𝑖 ), 𝑖 = 1,2, ⋯ , 𝑛; where d denotes the Euclidean
distance between the points.
 Let’s consider a setup with 𝑛 training samples, where 𝑥𝑖 is the training
data point.
 The training data points are categorized into 𝑐 classes.
 Using KNN, we want to predict class for the new data point.
 So, the first step is to calculate the distance(Euclidean) between the new data point
and all the training data points.
 Next step is to arrange all the distances in non-decreasing order.
 Assuming a positive value of 𝑘 and filtering 𝑘 least values from the sorted list.
 Now, we have 𝑘 top distances.
 Let 𝑘𝑖 denotes no. of points belonging to the 𝑖𝑡ℎ class among 𝑘 points.
 If 𝑘𝑖 > 𝑘𝑗 for all 𝑖 ≠ 𝑗 then put 𝑥 in class 𝑖
KNN algorithm: Example
 Let’s consider the image shown here
where we have two different target
classes white and orange circles.
 We have total 26 training samples.
 Now we would like to predict the
target class for the blue circle
 Considering 𝑘 value as three, we
need to calculate the similarity
distance using similarity measures
like Euclidean distance.
 If the similarity score is less which
means the classes are close.
 In the image, we have calculated
distance and placed the less distance
circles to blue circle inside the Big
circle.
Advantages and Disadvantages
 Advantages
 Makes no assumptions about distributions of classes in feature space
 Don’t need any prior knowledge about the structure of data in the training set
 No retraining is required if the new training pattern is added to the existing training
set
 Can work for multi-classes simultaneously
 Easy to implement and understand
 Disadvantages
 Fixing the optimal value of K is a challenge
 Does not output any models. Calculates distances for every new point ( lazy learner)
 For every test data, the distance should be computed between test data and all the
training data. Thus a lot of time may be needed for the testing
Demo Using Python
Sample Code in Python

X = [[0], [1], [2], [3]]

y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) KNeighborsClassifier(...)
neigh.predict([[1.1]])

29
Summary
 KNN classification algorithm
 Different distance measures
 KNN algorithm
 Advantages and disadvantages
 Case study 1 (using KNN )
Thanks!

6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
KNN Presentation
No ratings yet
KNN Presentation
16 pages
Earth Science (Big) Data Analytics: March 2018
No ratings yet
Earth Science (Big) Data Analytics: March 2018
37 pages
B Tech (Biotechnology With Specialized Subjects in Artificial Intelligence Machine Learning) W e F 2020-21 Admitted Batch
No ratings yet
B Tech (Biotechnology With Specialized Subjects in Artificial Intelligence Machine Learning) W e F 2020-21 Admitted Batch
278 pages
Causes and Effects of Climate Change
No ratings yet
Causes and Effects of Climate Change
19 pages
Climate Science IYCN
No ratings yet
Climate Science IYCN
41 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
KNN Algorithm
No ratings yet
KNN Algorithm
3 pages
ML - Unit 2
No ratings yet
ML - Unit 2
15 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
AMNA SHAHID - Docx MCQS
No ratings yet
AMNA SHAHID - Docx MCQS
8 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Quantum Information Science 1st Edition R. Manenti && M. Motta download pdf
100% (2)
Quantum Information Science 1st Edition R. Manenti && M. Motta download pdf
40 pages
Cluster
100% (1)
Cluster
72 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Applications of Data Mining in The Banking Sector
No ratings yet
Applications of Data Mining in The Banking Sector
8 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
DATA ANAYTICS Notes UNIT4
No ratings yet
DATA ANAYTICS Notes UNIT4
45 pages
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
89 pages
Enlight 360 Proposal For ACCION Bank
No ratings yet
Enlight 360 Proposal For ACCION Bank
13 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
23 pages
Prediction Heart Disease
No ratings yet
Prediction Heart Disease
11 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
GAM: The Predictive Modeling Silver Bullet: Author: Kim Larsen
No ratings yet
GAM: The Predictive Modeling Silver Bullet: Author: Kim Larsen
27 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Download Complete Data Mining for Business Intelligence Concepts Techniques and Applications in Microsoft Office Excel r with XLMiner r 2nd ed Edition Patel PDF for All Chapters
100% (19)
Download Complete Data Mining for Business Intelligence Concepts Techniques and Applications in Microsoft Office Excel r with XLMiner r 2nd ed Edition Patel PDF for All Chapters
60 pages
CEHT Question Bank
No ratings yet
CEHT Question Bank
2 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
Demographics Segmentation Using Machine Learning
No ratings yet
Demographics Segmentation Using Machine Learning
8 pages
A Comparative Study and Systematic Analysis of XAI Models and Their Applications in Healthcare
No ratings yet
A Comparative Study and Systematic Analysis of XAI Models and Their Applications in Healthcare
26 pages
Machine Learning With Python Unit 1-17-84 Final13092024
No ratings yet
Machine Learning With Python Unit 1-17-84 Final13092024
68 pages
PDF Hands-on Time Series Analysis With Python: From Basics To Bleeding Edge Techniques B. V. Vishwas download
100% (1)
PDF Hands-on Time Series Analysis With Python: From Basics To Bleeding Edge Techniques B. V. Vishwas download
62 pages
Text
No ratings yet
Text
131 pages
Machine Learning C
No ratings yet
Machine Learning C
24 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
DSF Unit IV MCQ Notes
No ratings yet
DSF Unit IV MCQ Notes
6 pages
Association Rules
No ratings yet
Association Rules
64 pages
ISM Session 1-8+webinar1,2 Merged
No ratings yet
ISM Session 1-8+webinar1,2 Merged
718 pages
DMW MCQ
No ratings yet
DMW MCQ
388 pages
Random Forest
No ratings yet
Random Forest
18 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Mini Project 2A PPT 2.0
No ratings yet
Mini Project 2A PPT 2.0
19 pages
Caret Package Infographic PDF
No ratings yet
Caret Package Infographic PDF
1 page
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
Classification Vs Regression
No ratings yet
Classification Vs Regression
3 pages
Little Book of R For Multivariate Analysis
No ratings yet
Little Book of R For Multivariate Analysis
51 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Supervised Example KNN
No ratings yet
Supervised Example KNN
22 pages
12 ML KNN
No ratings yet
12 ML KNN
28 pages
Algorithms - K Nearest Neighbors
No ratings yet
Algorithms - K Nearest Neighbors
23 pages
Service Compressors R426a R401a r401b R409a r409b 220v 50hz 60hz 115v 60hz 03-2015 Desb010a302
No ratings yet
Service Compressors R426a R401a r401b R409a r409b 220v 50hz 60hz 115v 60hz 03-2015 Desb010a302
8 pages
Permutation Combination
No ratings yet
Permutation Combination
26 pages
Tugas Belajar Dirumah Reading Skill
No ratings yet
Tugas Belajar Dirumah Reading Skill
1 page
ترجمة الفوكاب بدون جمل بعد التعديل الأخير M442
No ratings yet
ترجمة الفوكاب بدون جمل بعد التعديل الأخير M442
14 pages
Unit PPT
No ratings yet
Unit PPT
81 pages
Thesis Capsule Proposal
100% (3)
Thesis Capsule Proposal
6 pages
MCQ (Stack & Queue)
No ratings yet
MCQ (Stack & Queue)
11 pages
Reading and Writing Skills Activity Sheet.
No ratings yet
Reading and Writing Skills Activity Sheet.
6 pages
A Case Study of How Netflix Adapts Its Development
No ratings yet
A Case Study of How Netflix Adapts Its Development
5 pages
Main
No ratings yet
Main
132 pages
EIA PPT CH-1&2
No ratings yet
EIA PPT CH-1&2
40 pages
Switched Beam Antenna System Design: Mohamad Kamal Abdul Rahim, Mohd Nazif Mat Salleh, Osman Ayop, Thelaha Masri
No ratings yet
Switched Beam Antenna System Design: Mohamad Kamal Abdul Rahim, Mohd Nazif Mat Salleh, Osman Ayop, Thelaha Masri
4 pages
Advantages of Ethics in Human Resource
No ratings yet
Advantages of Ethics in Human Resource
11 pages
Applying Digital Analysis Using Benford's Law To Detect Fraud-The Dangers of Type I Errors
0% (1)
Applying Digital Analysis Using Benford's Law To Detect Fraud-The Dangers of Type I Errors
7 pages
Scimakelatex 14983 Some+thing
No ratings yet
Scimakelatex 14983 Some+thing
7 pages
Objective Function in Machine Learning: Enhancing Performance Optimization Through Mathematical Modeling
No ratings yet
Objective Function in Machine Learning: Enhancing Performance Optimization Through Mathematical Modeling
2 pages
4089 - Proven Practices To Simplify Your Chart of Accounts Design-1
100% (1)
4089 - Proven Practices To Simplify Your Chart of Accounts Design-1
58 pages
REFERENCES
No ratings yet
REFERENCES
8 pages
GQB in PDF
100% (7)
GQB in PDF
592 pages
Hamm HD75
No ratings yet
Hamm HD75
2 pages
Bart Sibrel - Wiki Article (Moon Landing Was Faked)
0% (1)
Bart Sibrel - Wiki Article (Moon Landing Was Faked)
3 pages
Problems in Finite Element Methods Aubin Nitsche’s Duality Process
No ratings yet
Problems in Finite Element Methods Aubin Nitsche’s Duality Process
763 pages
Scientific Reasoning and Argumentation The Roles of Domain Specific and Domain General Knowledge 1st Edition Frank Fischer (Editor) - Quickly download the ebook to read anytime, anywhere
100% (1)
Scientific Reasoning and Argumentation The Roles of Domain Specific and Domain General Knowledge 1st Edition Frank Fischer (Editor) - Quickly download the ebook to read anytime, anywhere
75 pages
Detection of Hydrocarbon - Saturated Reservoirs in A Challenging Geological Setting Using AVO Attributes
No ratings yet
Detection of Hydrocarbon - Saturated Reservoirs in A Challenging Geological Setting Using AVO Attributes
11 pages
How To Interface The 24LC256 EEPROM To Arduino
No ratings yet
How To Interface The 24LC256 EEPROM To Arduino
5 pages
Top Mobile App Development Companies in USA
No ratings yet
Top Mobile App Development Companies in USA
7 pages
048-P02 (Compatibility Mode)
No ratings yet
048-P02 (Compatibility Mode)
11 pages
Performance Management System Questions Papers
No ratings yet
Performance Management System Questions Papers
24 pages
South African Renewable Energy Grid Code Version 2.9 Requirements Part III Discussions and Conclusions
No ratings yet
South African Renewable Energy Grid Code Version 2.9 Requirements Part III Discussions and Conclusions
5 pages
IRDAOnline Agency Licensing Portal
No ratings yet
IRDAOnline Agency Licensing Portal
1 page