KNN v2
KNN v2
Classification
Agenda
KNN Classification Algorithm
Solving Business Problems using KNN Algorithm
Hands-on
Sample Business Problem
Let’s assume a money lending company “XYZ” like UpStart,
IndiaLends, etc.
Money lending XYZ company is interested in making the money
lending system comfortable & safe for lenders as well as for
borrowers. The company holds a database of customer details.
Using customer’s detailed information from the database, it will
calculate a credit score(discrete value) for each customer.
The calculated credit score helps the company and lenders to
understand the credibility of a customer clearly.
So they can simply take a decision whether they should lend
money to a particular customer or not.
Sample Business Problem
The customer’s details could be:
Educational background details
Highest graduated degree
Cumulative grade points average (CGPA) or marks percentage
The reputation of the college
Consistency in his lower degrees
Cleared education loan dues
Employment details
Salary
Years of experience
Got any onsite opportunities
Average job change duration
Sample Business Problem
The company(XYZ) uses these kind of details to calculate credit
score of a customer
The process of calculating the credit score from the customer’s
details is expensive
To reduce the cost of predicting credit score, they realized that
the customers with similar background details are getting a
similar credit score
So, they decided to use already available data of customers and
predict the credit score by comparing it with similar data
These kinds of problems are handled by the K-nearest neighbor
classifier for finding the similar kind of customers
Introduction
K-nearest neighbor classifier is one of the introductory supervised
classifier, which every data science learner should be aware of
Fix & Hodges proposed K-nearest neighbor classifier algorithm in 1951
for performing pattern classification task
For simplicity, this classifier is called as KNN Classifier
KNN addresses the pattern recognition problems and also the best
choices for addressing some of the classification related tasks
The simple version of the K-nearest neighbor classifier algorithms is to
predict the target label by finding the nearest neighbor class
The closest class will be identified using the distance measures like
Euclidean distance
K_Nearest Neighbour Algorithm
• Calculate the distance between E and all examples in the training set
• Select K-nearest examples to E in the training set
• Assign E to the most common class among its K-nearest neighbors
E
Distance Between Neighbors
Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit No. of credit
cards=3 cards=2
• The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is
defined as:
n
D( X , Y ) (x y )
i 1
i i
2
No. credit
Customer Age Income Response
cards
Jay 35 35K 3 No
Hema 63 200K 1 No
Tommy 59 170K 1 No
Dravid 37 50K 2 ?
K_Nearest Neighbours: Example
Dravid 37 50K 2 ? 0
K_Nearest Neighbours
Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2
Example: Income
Highest income = 200K
Davis’s income is normalized to 50/200, Rina income is normalized to
50/200, etc.)
K_Nearest Neighbours
Normalization of Variables
No. credit
Customer Age Income Response
cards
55/63= 35/200= 3/4=
Jay No
0.175 0.175 0.75
22/63= 50/200= 2/4=
Rina Yes
0.34 0.25 0.5
63/63= 200/200= 1/4=
Hema No
1 1 0.25
59/63= 170/200= 1/4=
Tommy No
0.93 0.175 0.25
25/63= 40/200= 4/4=
Neil Yes
0.39 0.2 1
37/63= 50/200= 2/4=
Dravid Yes
0. 58 0.25 0.5
K-Nearest Neighbor
Example: Married
No. credit
Customer Married Income Response
cards
Jay Yes 35K 3 No
Rina No 50K 2 Yes
Hema No 200K 1 No
Tommy Yes 170K 1 No
Neil No 40K 4 Yes
Dravid Yes 50K 2 Yes
Non-Numeric Data
(y1
,y2)
Y
x
(x1
,x2)
1. Euclidean Distance
(9 ,8)
Y
x
(5 ,5)
1. Euclidean Distance
(9
,8,3)
Y
x
(5
,5,7)
L2 Norm
k-NN Variations
• Value of k
– Larger k increases confidence in prediction
– Note that if k is too large, decision may be skewed
• Weighted evaluation of nearest neighbors
– Plain majority may unfairly skew decision
– Revise algorithm so that closer neighbors have
greater “vote weight”
How to Choose ”K”?
29
Summary
KNN classification algorithm
Different distance measures
KNN algorithm
Advantages and disadvantages
Case study 1 (using KNN )
Thanks!