cs4302-lecture2
cs4302-lecture2
g
2024
Xin Li
School of Computer Science,
Beijing Institute of Technology
Inductive Learning (recap)
Induction
form (x, f x )
Given a training set of examples of the
PAGE 2
Supervised Learning
Two types of problems
1. Classification
2. Regression
PAGE 3
Classification Example
Problem: Will you enjoy an outdoor sport based on the
weather?
Training Sky Humidity Wind Water Forecast EnjoySport
set: Sunny Normal Strong Warm Same yes
Sunny High Strong Warm Same yes
Sunny High Strong Warm Change no
Sunny High Strong Cool Change yes
!
x f(x
Possible )
PAGE 5
More
Examples
Problem Domain Range Classification /
Regression
Spam Detection
Speech recognition
Digit recognition
Housing valuation
Weather prediction
PAGE 6
Hypothesis Space
Hypothesis space H
Set of all hypotheses ℎ that the learner may
consider
Learning is a search through hypothesis
space
Objective: find h that minimizes
Misclassification (or more generally some
error function) with respect to the training
examples
PAGE 7
Generalization
A good hypothesis will generalize well
i.e., predict unseen examples correctly
Usually …
Any hypothesis ℎ found to approximate the target function f
well over a sufficiently large set of training examples
will also approximate the target function well over any
unobserved examples
PAGE 8
Inductive Learning
Goal: find an ℎ that agrees with f on training set
ℎ is consistent if it agrees with f on all examples
Noisy data
E.g., in weather prediction, identical conditions may lead to
rainy and sunny days
PAGE 9
Inductive Learning
A learning problem is realizable if the hypothesis space
contains the true function otherwise it is unrealizable.
Difficult to determine whether a learning problem is realizable
since the true function is not known
It is possible to use a very large hypothesis space
For example: H = class of all Turing machines
But there is a tradeoffbetween expressiveness of a
hypothesis class and the complexity of finding a good
hypothesis
PAGE 10
Nearest Neighbor Classifiers
Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute
Distance Test
Record
PAGE 10
Nearest Neighbour Classification
Classification function: ℎ x = y x*
PAGE 11
Euclidean Distance
• Euclidean Distance
n 2
dist ( pk qk )
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
1
n r r
dist ( | pk qk | )
k 1
r = 2. Euclidean distance
Do not confuse r with n, i.e., all these distances are defined for all numbers of
dimensions.
L1 p1 p2 p3 p4
Minkowski Distance p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Mahalanobis Distance
1 T
mahalanobi s( p, q) ( p q) ( p q)
1 n
j ,k ( X ij X j )( X ik X k )
n 1 i 1
0. 3 0 . 2
0 . 2 0 . 3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Voronoi Diagram
neighbor fn ℎ
Partition implied by nearest
PAGE 12
K-Nearest Neighbour
PAGE 13
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X
d ( p, q )
Determine the class from nearest neighbor list
(p q)
take the majority vote of class labels among the k-nearest neighbors
i i i
2
X
Effect of K
K controls the degree of
smoothing.
Which partition do you prefer?
k = k = k =
1 3 31
Why?
PAGE 14
Performance of a learning algorithm
A learning algorithm is good if it produces a
hypothesis that does a good job of predicting
classifications of unseen examples
Verify performance with a test set
1. Collect a large set of examples
2. Divide into 2 disjoint sets: training set and test set
3. Learn hypothesis ℎ with training set
4. Measure percentage of correctly classified examples by ℎ in the
test set
PAGE 15
The effect of K
Best r depends on
Problem
Amount of
training data
PAGE 16
Underfitting
hypothesis ℎ
Defi nition: underfitting occurs when an algorithm finds a
Amount of underfitting of ℎ:
Common cause:
Classifier is not expressive enough
PAGE 17
Overfitting
accuracy.
Amount of overfitting of ℎ:
Common causes:
Classifier is too expressive
Noisy data
Lack of data PAGE 18
Choosing K
How should we choose K?
Ideally: select K with highest future accuracy
Alternative: select K with highest test accuracy
Problem: since we are choosing K based on the test set, the test
set effectively becomes part of the training set when optimizing K.
Hence, we cannot trust anymore the test set accuracy to be
representative of future accuracy.
PAGE 20
Robust validation
PAGE 21
Cross-Validation
Repeatedly split training data in two parts, one for training and one
for validation. Report the average validation accuracy.
k-fold cross validation: split training data in k equal size subsets.
Run k
experiments, each time validating on one subset and training on the
the k experiments.
remaining subsets. Compute the average validation accuracy of
Picture:
PAGE 22
Selecting the Number of Neighbours by Cross-Validation
PAGE 23
Selecting the Hyperparameters by Cross-Validation
PAGE 24
Weighted K-Nearest Neighbour
PAGE 25
K-Nearest Neighbour
Regression
We can also use KNN for regression
Let yx be a real value instead of a
categorical label
K-nearest neighbour regression:
PAGE 26
Nearest neighbor Classification…
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
Nearest neighbor Classification…
k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction and rule-based systems
Classifying unknown records are relatively expensive