Lecture 3 Basics of Clssification
Lecture 3 Basics of Clssification
(Adapted from various sources. The slides copied are only for
teaching purposes.)
BASIC IDEA
§ In the exams, the student’s grade was assigned based
on their marks as follows:
Mark ≥ "# : A
6# > Mark : F
Classification !
§ The classes are mutually exhaustive and
exclusive.
§ Unsupervised Classification
Classification § It is called clustering
types § We don’t know the classes or the
number of possible classes.
How to
proceed?
The image is taken from https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
§ Training set (the teacher)
§ Collection of records with a set of
attributes and one class label.
How to
proceed?
Classifier 5
Another § X1,…,Xn Î {0,1} (Blue vs. Red pixels)
Application § Y Î {5,6} (predict whether a digit is a 5 or a 6)
Normalization Constant
collection of instances
DB
with known categories
EXAMPLE 1
Determining decision on scholarship application based on the following features:
ØInstead, the presented training data is simply stored and, when a new
query instance is encountered, a set of similar, related instances is
retrieved from memory and used to classify the new query instance
ØNo training is needed. Classification time is linear in training set size for each
test case.
9
K-NEAREST-
NEIGHBORS
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3:
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3:
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3
The 3 nearest points to c are: a,
a and o.
Therefore, the most possible
label for c is a.
SIMPLE ILLUSTRATION: THE COMPLEXITY
-
-
+
+ -
-
+
+
-
SIMPLE ILLUSTRATION
-
-
+
•q + -
-
+
+
-
What is the class of q?
SIMPLE ILLUSTRATION
-
-
+
•q + -
-
+
+
-
q is + under 1-NN
SIMPLE ILLUSTRATION
-
-
+
•q + -
-
+
+
-
q is + under 1-NN,
but – under 5-NN
For a given instance T, get the top k
Get dataset instances that are “nearest” to T
• Select a reasonable distance measure
K - NEAREST Inspect
Inspect the category of these k instances,
choose the category C that represent the
NEIGHBORS most instances
Evaluation has to be
treated as hypothesis
testing in statistics.
The value of the
population parameter
has to be statistically
inferred based on the
sample statistics (i.e., a
training set in pattern
recognition).
Danger of overfitting
Validation is important!
Questions???