0% found this document useful (0 votes)

6 views

cs4302-lecture2

The document discusses K-nearest neighbors (KNN) as a method for inductive learning in supervised learning contexts, focusing on classification and regression problems. It explains the concept of hypothesis space, generalization, and various distance metrics used in KNN, such as Euclidean and Minkowski distances. Additionally, it addresses challenges like overfitting, underfitting, and the importance of choosing the right value of k, along with techniques like cross-validation to optimize model performance.

Uploaded by

jp9tavdrt

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

cs4302-lecture2

Uploaded by

jp9tavdrt

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Lecture 2: K-nearest neighbours

g
2024

Xin Li
School of Computer Science,
Beijing Institute of Technology
Inductive Learning (recap)

Induction

form (x, f x )
Given a training set of examples of the

x is the input, f(x) is the output

Return a function ℎ that approximates f

ℎ is called the hypothesis

PAGE 2
Supervised Learning
Two types of problems
1. Classification

2. Regression

NB: The nature (categorical or continuous) of the

domain (input space) off does not matter

PAGE 3
Classification Example
Problem: Will you enjoy an outdoor sport based on the
weather?
Training Sky Humidity Wind Water Forecast EnjoySport
set: Sunny Normal Strong Warm Same yes
Sunny High Strong Warm Same yes
Sunny High Strong Warm Change no
Sunny High Strong Cool Change yes

!
x f(x
Possible )

h 1 :s = sunny → enjoysport = yes

Hypotheses:

h " : wa = cool or F = same →

enjoysport = yes
PAGE 4
Regression Example
Find function h that fits f at
instances x

PAGE 5
More
Examples
Problem Domain Range Classification /
Regression
Spam Detection

Stock price prediction

Speech recognition

Digit recognition

Housing valuation

Weather prediction

PAGE 6
Hypothesis Space
Hypothesis space H
Set of all hypotheses ℎ that the learner may
consider
Learning is a search through hypothesis
space
Objective: find h that minimizes
Misclassification (or more generally some
error function) with respect to the training
examples
PAGE 7
Generalization
A good hypothesis will generalize well
i.e., predict unseen examples correctly

Usually …
Any hypothesis ℎ found to approximate the target function f
well over a sufficiently large set of training examples
will also approximate the target function well over any
unobserved examples

PAGE 8
Inductive Learning
Goal: find an ℎ that agrees with f on training set
ℎ is consistent if it agrees with f on all examples

Finding a consistent hypothesis is not always possible

Insufficient hypothesis space:
E.g., it is not possible to learn exactly f x = ax + b + xsin(x)
when H = space of polynomials of finite degree

Noisy data
E.g., in weather prediction, identical conditions may lead to
rainy and sunny days
PAGE 9
Inductive Learning
A learning problem is realizable if the hypothesis space
contains the true function otherwise it is unrealizable.
Difficult to determine whether a learning problem is realizable
since the true function is not known
It is possible to use a very large hypothesis space
For example: H = class of all Turing machines
But there is a tradeoffbetween expressiveness of a
hypothesis class and the complexity of finding a good
hypothesis
PAGE 10
Nearest Neighbor Classifiers
Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the

Records “nearest” records

PAGE 10
Nearest Neighbour Classification
Classification function: ℎ x = y x*

where y x* is the label associated with the nearest

neighbour
Distance measures: d x,
xꞌ

PAGE 11
Euclidean Distance

• Euclidean Distance

n 2
dist   ( pk  qk )
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.

• Standardization is necessary, if scales differ.

Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
1
n r r
dist (  | pk  qk | )
k 1

Where r is a parameter, n is the number of dimensions (attributes) and pk and

qk are, respectively, the kth attributes (components) or data objects p and q.
Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

A common example of this is the Hamming distance, which is just the number of bits that are different
between two binary vectors

r = 2. Euclidean distance

r  . “supremum” (Lmax norm, L norm) distance.

This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of
dimensions.
L1 p1 p2 p3 p4
Minkowski Distance p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Mahalanobis Distance
1 T
mahalanobi s( p, q) ( p  q)  ( p  q)

 is the covariance matrix of the

input data X

1 n
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Mahalanobis Distance
Covariance Matrix:

 0. 3 0 . 2 
  
 0 . 2 0 . 3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4
Voronoi Diagram

neighbor fn ℎ
Partition implied by nearest

Assuming Euclidean distance

PAGE 12
K-Nearest Neighbour

Nearest neighbour often instable (noise)

Idea: assign most frequent label among k-nearest

neighbours
Let knn x be the k-nearest neighbours of x according
to distance d
Label:

PAGE 13
Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:

– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points

that have the k smallest distance to x
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance

d ( p, q ) 
Determine the class from nearest neighbor list
(p  q)
take the majority vote of class labels among the k-nearest neighbors
i i i
2

Weigh the vote according to distance

weight factor, w = 1/d2
Nearest Neighbor Classification…
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes

X
Effect of K
K controls the degree of
smoothing.
Which partition do you prefer?
k = k = k =
1 3 31
Why?

PAGE 14
Performance of a learning algorithm
A learning algorithm is good if it produces a
hypothesis that does a good job of predicting
classifications of unseen examples
Verify performance with a test set
1. Collect a large set of examples
2. Divide into 2 disjoint sets: training set and test set
3. Learn hypothesis ℎ with training set
4. Measure percentage of correctly classified examples by ℎ in the
test set
PAGE 15
The effect of K
Best r depends on
Problem
Amount of
training data

PAGE 16
Underfitting

hypothesis ℎ
Defi nition: underfitting occurs when an algorithm finds a

future accuracy of some other hypothesis ℎ’

with training accuracy that is lower than the

Amount of underfitting of ℎ:

Common cause:
Classifier is not expressive enough
PAGE 17
Overfitting

hypothesis ℎ with higher training accuracy than its future

Defi nition: overfitting occurs when an algorithm finds a

accuracy.

Amount of overfitting of ℎ:

Common causes:
Classifier is too expressive
Noisy data
Lack of data PAGE 18
Choosing K
How should we choose K?
Ideally: select K with highest future accuracy
Alternative: select K with highest test accuracy

Problem: since we are choosing K based on the test set, the test
set effectively becomes part of the training set when optimizing K.
Hence, we cannot trust anymore the test set accuracy to be
representative of future accuracy.

Solution: split data into training, validation and test sets

Training set: compute nearest neighbour
Validation set: optimize hyperparameters such as K
Test set: measure performance
PAGE 19
Choosing K based on Validation
Set

PAGE 20
Robust validation

How can we ensure that validation

accuracy is representative of future
accuracy?
Validation accuracy becomes more reliable
as we increase the size of the
validation set
However, this reduces the amount of data left
for training

PAGE 21
Cross-Validation
Repeatedly split training data in two parts, one for training and one
for validation. Report the average validation accuracy.
k-fold cross validation: split training data in k equal size subsets.
Run k
experiments, each time validating on one subset and training on the

the k experiments.
remaining subsets. Compute the average validation accuracy of

Picture:

PAGE 22
Selecting the Number of Neighbours by Cross-Validation

PAGE 23
Selecting the Hyperparameters by Cross-Validation

PAGE 24
Weighted K-Nearest Neighbour

We can often improve K-nearest neighbours by

weighting each neighbour based on some distance
measure

PAGE 25
K-Nearest Neighbour
Regression
We can also use KNN for regression
Let yx be a real value instead of a
categorical label
K-nearest neighbour regression:

Weighted K-nearest neighbour regression:

PAGE 26
Nearest neighbor Classification…
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results

111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
Nearest neighbor Classification…
k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction and rule-based systems
Classifying unknown records are relatively expensive

Plant Simulation Step-By-Step ENU PDF
No ratings yet
Plant Simulation Step-By-Step ENU PDF
988 pages
Lab 1 - Accessing and Preparing Data Steps
No ratings yet
Lab 1 - Accessing and Preparing Data Steps
28 pages
Debt Sizing For Minimum DSCR With VBA Goal Seek - Solve For Zero Delta!
No ratings yet
Debt Sizing For Minimum DSCR With VBA Goal Seek - Solve For Zero Delta!
5 pages
Nearest Neighbour
No ratings yet
Nearest Neighbour
25 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Example 1: Riding Mowers
No ratings yet
Example 1: Riding Mowers
6 pages
Chapter#10 (Part#01) SL (K-NN)
No ratings yet
Chapter#10 (Part#01) SL (K-NN)
27 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
5c. Nearest Neighbour Classifier
No ratings yet
5c. Nearest Neighbour Classifier
2 pages
ml5
No ratings yet
ml5
35 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
73 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
KNN CIML
No ratings yet
KNN CIML
12 pages
KNN Presentation
No ratings yet
KNN Presentation
16 pages
جدول الاختبارات 2023-2024 الفصل الاول - مع القاعات - ٠٦١١٣٥
No ratings yet
جدول الاختبارات 2023-2024 الفصل الاول - مع القاعات - ٠٦١١٣٥
36 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
ML Lecture#2
No ratings yet
ML Lecture#2
70 pages
Week10 KNN Practical
No ratings yet
Week10 KNN Practical
4 pages
ML Unit 2 r20 Jntuk
No ratings yet
ML Unit 2 r20 Jntuk
34 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
T6- KNN - Features, Distances &amp; Non-Parametric Models
No ratings yet
T6- KNN - Features, Distances &amp; Non-Parametric Models
23 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
09314060
No ratings yet
09314060
13 pages
Road Traffic Algorithm
No ratings yet
Road Traffic Algorithm
5 pages
08 - kNN
No ratings yet
08 - kNN
39 pages
Notes 02
No ratings yet
Notes 02
79 pages
ML-KN
No ratings yet
ML-KN
12 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
When Do We Use KNN Algorithm?
No ratings yet
When Do We Use KNN Algorithm?
7 pages
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
No ratings yet
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
13 pages
06-knn
No ratings yet
06-knn
41 pages
KNN v2
No ratings yet
KNN v2
31 pages
2223 ML Lecture04
No ratings yet
2223 ML Lecture04
46 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
CSE445 NSU Week_5
No ratings yet
CSE445 NSU Week_5
26 pages
5. K-Nearest Neighbors
No ratings yet
5. K-Nearest Neighbors
35 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
datamining-lect12
No ratings yet
datamining-lect12
75 pages
textbook ML_removed (2)
No ratings yet
textbook ML_removed (2)
10 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
K Nearest Neighbor Classification
0% (1)
K Nearest Neighbor Classification
32 pages
20 KNN Presentation
No ratings yet
20 KNN Presentation
16 pages
K NN Annotated Slides
No ratings yet
K NN Annotated Slides
9 pages
Session 9 KNN - 2024
No ratings yet
Session 9 KNN - 2024
23 pages
UNIT-3
No ratings yet
UNIT-3
100 pages
w5 Classification
No ratings yet
w5 Classification
34 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
K-Nearest Neighbor Learning
No ratings yet
K-Nearest Neighbor Learning
31 pages
Chapter 7 - K-Nearest-Neighbor: Data Mining For Business Analytics in Python
No ratings yet
Chapter 7 - K-Nearest-Neighbor: Data Mining For Business Analytics in Python
21 pages
Lecture8 KNN1
No ratings yet
Lecture8 KNN1
16 pages
12_23ECE216_Nearest Neighbors
No ratings yet
12_23ECE216_Nearest Neighbors
29 pages
3 2KNN
No ratings yet
3 2KNN
27 pages
Mlfa Autumn 22 Lec 03
No ratings yet
Mlfa Autumn 22 Lec 03
61 pages
LFD 2005 Nearest Neighbour
No ratings yet
LFD 2005 Nearest Neighbour
6 pages
Aiml M3 C2
No ratings yet
Aiml M3 C2
56 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
WEEK 07
No ratings yet
WEEK 07
24 pages
Lecture 02 - KNN and ML Basics
No ratings yet
Lecture 02 - KNN and ML Basics
33 pages
Unit 4_KVR
No ratings yet
Unit 4_KVR
111 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Data Structure Pass Soal Ujian 26 Feb 2022-Unlock
No ratings yet
Data Structure Pass Soal Ujian 26 Feb 2022-Unlock
9 pages
Oops Assignment 1
No ratings yet
Oops Assignment 1
3 pages
Media Kit 2021
No ratings yet
Media Kit 2021
23 pages
Geek Decathlon AMR Whitepaper
No ratings yet
Geek Decathlon AMR Whitepaper
20 pages
User Interface Design
No ratings yet
User Interface Design
43 pages
Led Document
No ratings yet
Led Document
48 pages
Smart-Car-Parking-Reservation-System-for-Establishments
No ratings yet
Smart-Car-Parking-Reservation-System-for-Establishments
90 pages
PDF Concurrency by Tutorials Second Edition Multithreading in Swift With GCD and Operations Tutorial Team Download
100% (4)
PDF Concurrency by Tutorials Second Edition Multithreading in Swift With GCD and Operations Tutorial Team Download
62 pages
Extended Response-1 - 240229 - 120756
No ratings yet
Extended Response-1 - 240229 - 120756
20 pages
Hall Effect
No ratings yet
Hall Effect
7 pages
ESACCI LC Ph2 PUGv2 - 2.0
No ratings yet
ESACCI LC Ph2 PUGv2 - 2.0
105 pages
ARM7
No ratings yet
ARM7
76 pages
Presentation ZDL700
No ratings yet
Presentation ZDL700
27 pages
Invoice Print Query
No ratings yet
Invoice Print Query
15 pages
14 11 16 15 10 12 13 Daylight System
No ratings yet
14 11 16 15 10 12 13 Daylight System
3 pages
Gis Interview Question
No ratings yet
Gis Interview Question
4 pages
"Marc HRMS": A Project Report On
No ratings yet
"Marc HRMS": A Project Report On
61 pages
Endoflife DLP
No ratings yet
Endoflife DLP
3 pages
Diode Ba157 - BA159 (Data Sheet)
No ratings yet
Diode Ba157 - BA159 (Data Sheet)
3 pages
261 SIMS Strategy and Insight Development March 30 2010
No ratings yet
261 SIMS Strategy and Insight Development March 30 2010
14 pages
Quick-Guide - Update INOR USB-Interface en
No ratings yet
Quick-Guide - Update INOR USB-Interface en
6 pages
1 PDF
No ratings yet
1 PDF
1 page
Sinbeam Design Software: General Use
No ratings yet
Sinbeam Design Software: General Use
5 pages
Ch2 Questions With Solutions
No ratings yet
Ch2 Questions With Solutions
18 pages
Gravity Free Fall
No ratings yet
Gravity Free Fall
4 pages
Tax Invoice
No ratings yet
Tax Invoice
3 pages
FF Transmitter
No ratings yet
FF Transmitter
4 pages

cs4302-lecture2

Uploaded by

cs4302-lecture2

Uploaded by

Lecture 2: K-nearest neighbours

x is the input, f(x) is the output

Return a function ℎ that approximates f

NB: The nature (categorical or continuous) of the

h 1 :s = sunny → enjoysport = yes

h " : wa = cool or F = same →

Stock price prediction

Finding a consistent hypothesis is not always possible

Training Choose k of the

where y x* is the label associated with the nearest

• Standardization is necessary, if scales differ.

Where r is a parameter, n is the number of dimensions (attributes) and pk and

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

r  . “supremum” (Lmax norm, L norm) distance.

 is the covariance matrix of the

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Assuming Euclidean distance

Nearest neighbour often instable (noise)

Idea: assign most frequent label among k-nearest

 To classify an unknown record:

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points

Weigh the vote according to distance

future accuracy of some other hypothesis ℎ’

hypothesis ℎ with higher training accuracy than its future

Solution: split data into training, validation and test sets

How can we ensure that validation

We can often improve K-nearest neighbours by

Weighted K-nearest neighbour regression:

You might also like