Lecture8 KNN1
Lecture8 KNN1
Classifier
Instance-Based Classifiers
Set of Stored Cases • Store the training records
– Nearest neighbor
• Uses k “closest” points (nearest neighbors) for performing classification
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute
Distance Test
Record
X X X
– Manhatten distance
𝑑 𝑝, 𝑞 = 𝑝𝑖 − 𝑞𝑖
𝑖
– q norm distance
𝑑 𝑝, 𝑞 = ( 𝑝𝑖 − 𝑞𝑖 𝑞 ) 1/𝑞
𝑖
• Determine the class from nearest neighbor list
– take the majority vote of class labels among the k-nearest neighbors
y’ = argmax 𝒙𝑖 ,𝑦𝑖 ϵ 𝐷𝑧 𝐼( 𝑣 = 𝑦𝑖 )
𝑣
where Dz is the set of k closest training examples to z.
– Weigh the vote according to distance
y’ = argmax 𝒙𝑖 ,𝑦𝑖 ϵ 𝐷𝑧 𝑤𝑖 × 𝐼( 𝑣 = 𝑦𝑖 )
𝑣
• weight factor, w = 1/d2
The KNN classification algorithm
Let k be the number of nearest neighbors and D be the set of
training examples.
1. for each test example z = (x’,y’) do
2. Compute d(x’,x), the distance between z and every
example, (x,y) ϵ D
3. Select Dz ⊆ D, the set of k closest training examples to z.
4. y’ = argmax 𝒙𝑖 ,𝑦𝑖 ϵ 𝐷𝑧 𝐼( 𝑣 = 𝑦𝑖 )
𝑣
5. end for
KNN Classification
$2,50,000
$2,00,000
$1,50,000
Loan$ Non-Default
$1,00,000 Default
$50,000
$0
0 10 20 30 40 50 60 70
Age
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from other classes
X
Nearest Neighbor Classification…
• Scaling issues
– Attributes may have to be scaled to prevent distance measures from
being dominated by one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 60 KG to 100KG
• income of a person may vary from Rs10K to Rs 2 Lakh
Nearest Neighbor Classification…
• Problem with Euclidean measure:
– High dimensional data
• curse of dimensionality: all vectors are almost equidistant to the query vector
– Can produce undesirable results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142