0% found this document useful (0 votes)
16 views

KNN Evaluation

Uploaded by

akrab.tech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

KNN Evaluation

Uploaded by

akrab.tech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CSCI417

Machine Intelligence

Lecture # 2

Spring 2024

1
Tentative Course Topics

1.Machine Learning Basics


2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.

2
Other Names

 Non-parametric classification
algorithm.

 Instance based algorithm.

 Lazy learning algorithm.

 Competitive learning algorithm

4
Classification revisited

 Classification is dividing up objects so that each is assigned to one of


a number of discrete and definite categories known as classes.
 Examples:
• customers who are likely to buy or not buy a particular product in a
supermarket
• people who are at high, medium or low risk of acquiring a certain illness

5
Classification revisited

R e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat

No S in g le 75K ?
T id R e fu n d M a r ita l T a x a b le
S ta tu s In c o m e C heat Yes M a r r ie d 50K ?

1 Yes S in g le 125K No No M a r r ie d 150K ?

2 No M a r r ie d 100K No Yes D iv o r c e d 90K ?

3 No S in g le 70K No No S in g le 40K ?

4 Yes M a r r ie d 120K No No M a r r ie d 80K ?


10

5 No D iv o r c e d 95K Yes
6 No M a r r ie d 60K No Test
7 Yes D iv o r c e d 220K No Set
8 No S in g le 85K Yes
9 No M a r r ie d 75K No
Training
Learn
10 No S in g le 90K Yes Model
10

Set Classifier

6
Before K-NN After K-NN
X2 x2

Category 2 Category 2

New data point K-N N New data point assigned


to Category 1

Category 1 Category 1

x1 x1
1. A positive integer k is specified, along with a new sample (k= 1, 3, 5)

2. We select the k entries in our training data set which are closest to
the new sample

3. We find the most common classification of these entries

4. This is the classification we give to the new sample


Example

 Two classes
 Two attributes
 How to classify (9.1,11)

9
Example

 Two classes
 Two attributes
 How to classify (9.1,11)

10
S T E P 1: Choose the number K of neighbors:K =5

x2

Category 2

New data point

Category 1

x1
y

P (x ,y )
2 2 2

y2

y1
P 1(x 1,y1)

x1 x2
x
x2

Category 1: 3 neighbors
Category 2
Category 2: 2 neighbors
New data point

Category 1

x1
Classifying with distance measurements k-Nearest Neighbors (kNN)

• Store all training examples

 Each point is a “vector” of attributes

• Classify new examples based on most “similar” training examples

 Similar means “closer” in vector space

16
Example

19
Components of K-NN classifier
Distance metric
− How do we measure distance between instances?
− Determines the layout of the example space categorical variables

continuous variables

The k hyper-parameter
− How large a neighborhood
should we consider?

− Determines the complexity of


the hypothesis space

26
Decision Boundaries

• Regions in feature space closest to every training


example
• If test point is in the region corresponding to a
given input point, return its label

28
Decision Boundary of a kNN

Tunes the complexity of the


hypothesis space
− If k = 1, every training example
has its own neighborhood
− If k = N, the en re feature
space is one neighborhood!

• Higher k yields smoother


decision boundaries

30
Variations on kNN
Weighted voting
• Default: all neighbors have equal weight
• Extension: weight votes of neighbors by (inverse) distance
• The intuition behind weighted kNN, is to give more weight to the points
which are nearby and less weight to the points which are farther away.

Epsilon Ball Nearest Neighbors


Same general principle as kNN, but change the method for selecting which
training examples vote
• Instead of using K nearest neighbors, use all examples such that

( )≤

31
https://www.cs.umd.edu/
Issues with KNN – Effect of K

32
Evaluation
Supervised Learning: Training and Testing
In supervised learning, the task is to learn a mapping from inputs to outputs given a
training dataset 1 1 𝑛 of n input-output pairs.

Let be our ML algorithm that maps input examples to output labels ,


parameterized by θ, where θ captures all the learnable parameters of our ML algorithm.

35
Classification Loss Function

average misclassification rate

Our goal is to minimize the loss function, i.e., find a set of parameters θ, that make
the misclassification rate as close to zero as possible.

Remember that, For continuous labels or response variables, a common loss function is the
Mean Square Error (MSE)

36
Performance Measurement

• The quality of predictions from a learned model is often expressed in terms of a


loss function. A loss function tells you how much you will be penalized for
making a guess when the answer is actually .
• There are many possible loss functions. Here are some frequently used examples:
• Loss applies to predictions drawn from finite domains.

37
Performance Measurement

• Squared loss

• Linear loss

• Asymmetric loss Consider a situation in


which you are trying to predict whether
someone is having a heart attack. It
might be much worse to predict “no”
when the answer is really “yes”, than
the other way around.
38
Accuracy – the simple metric

No. of items in a class labeled correctly

No. of items in that class

39
What fraction of the examples are classified correctly? C1

Acc = ?
= 9/10
C2

5
• Acc(M1) = ? M2 M1
C1 C1
• Acc(M2) = ?

C2 C2

What’s the problem?

6
Shortcomings of Accuracy

Let’s delve into the possible classification cases. Either the classifier got a positive
example labeled as positive, or it made a mistake and marked it as negative. Conversely, a
negative example may have been mis-labeled as positive, or correctly guessed negative.
So we define the following metrics:

 True Positives (TP): number of positive examples, labeled as such.


 False Positives (FP): number of negative examples, labeled as positive.
 True Negatives (TN): number of negative examples, labeled as such.
 False Negatives (FN): number of positive examples, labeled as negative.

42
Example – Spam Classifier

In this case, accuracy = (10 + 100)/(10 + 100 + 25 + 15) = 73.3%. We may be tempted to think our
classifier is pretty decent since it detected nearly 73% of all the spam messages.

However, look what happens when we switch it for a dumb classifier that always says “no spam”:

43
A new dump Spam Classifier

We get accuracy = (0 + 125)/(0 + 125 + 0 + 25) = 83.3%.


This looks crazy. We changed our model to a completely useless one, with exactly zero
predictive power, and yet, we got an increase in accuracy.

44
• Imbalanced data (distribution of classes)!
• Some errors matter more than others …
− Given medical record, predict patient has COVID or not
− Given an email, detect spam

• When classes are highly unbalanced, we focus on one target class


(usually the rare class), denoted as the “positive” class.

7
Accuracy paradox

 This is called the accuracy paradox. When TP < FP, then accuracy will always increase
when we change a classification rule to always output “negative” category.
Conversely, when TN < FN, the same will happen when we change our rule to always
output “positive”.

 So what can we do, so we are not tricked into thinking one classifier model is better
than other one, when it really isn’t?

46
positive negative
FP TN
Actual

TP FN
positive negative
Predicted

8
positive
positive negative
M2 M1
? ?
C1 C1
Actual

M1
? ?
positive negative
Predicted
positive negative

C2 C2
? ?
Actual

M2
? ?
positive negative
Predicted
9
positive

M2 M1
positive negative

2 6 C1 C1
Actual

M1
1 1
positive negative
Predicted

C2 C2
positive negative

1 7
Actual

M2
0 2
positive negative
Predicted
10
Actual correct prediction divided by the
positive negative

total prediction made by the model.


2 6
Actual

M1 Out of all the positive predicted, what percentage is truly positive.

1 1
positive negative
Predicted =
+
positive negative

= 1/3 FP TN

negative
1
1 7

Actual
Actual

M2

positive
TP FN

2 =0/1 positive negative

0 2 Precision: % of positive
Predicted

positive negative predictions that are correct


Predicted
What fraction of the actual positive
examples are predicted as positive?
positive negative

2 6 Out of the total positive, what percentage are predicted positive.


Actual

M1
1 1 FP TN

negative
Actual
positive negative
= TP FN

positive
Predicted
+ FN positive negative
Predicted
positive negative

1 = 1/2
1 7
Actual

M2
2 = 0/2
0 2
positive negative Recall: % of gold positive
Predicted examples that are found

12
which is better (high precision and low recall or vice-versa)?
Detect few positive examples but misses many others The ideal

1
Precision

Predict everything
as positive
0 1
Recall
15
False Positive or False Negative Sick, healthy
Spam, not Spam

In the medical example, what is worse, a False In the spam detector example, what is worse, a
Positive, or a False Negative? False Positive, or a False Negative?

Predicted Predicted Sent to Spam Sent to inbox


Sick Healthy

Sick 1000 200 Spam 1000 200

False Negative False Negative


Healthy 800 8000 Not Spam 800 8000
False Positive False Positive

55
there should be a metric that combines both
measure
2
1
+
− Harmonic mean of and

• Weighted measure
different weightage to recall and precision

Beta represents how many times recall is more important than precision.
If the recall is twice as important as precision, the value of Beta is 2.

16
M2 M1
M1 M2 C1 C1

Precision ? ?

Recall ? ?

F1 ? ?
C2 C2

17
M2 M1
M1 M2 C1 C1

Precision 1/3 = 0.33 0/1 = 0

Recall 1/2 = 0.5 0/2 = 0

F1 0.4 0
C2 C2

18
C1
• Accuracy = ?
= (3+3+1)/10 = 0.7

What’s the problem?


C2
• Good measure when classes are nearly balanced!

C3

19
C1
Predicted
Actual

C2

C3

20
Predicted C1

3 0 1
Actual

0 3 1 C2

1 0 1

C3
P
R
F1
21
Predicted C1

3 0 1
Actual

0 3 1 C2
1 0 1

C3
P 0.75 1 0.333
R 0.75 0.75 0.5
F1 0.75 0.86 0.4

Macro-F1 = (0.75+0.86+0.4)/3 = 0.67 average of the class-wise F1 scores 22


Overfitting and Underfitting

Overfitting:
It occurs when you pay too much attention to the
specifics of the training data, and are not able to
generalize well.
•Often, this means that your model is fitting noise,
rather than whatever it is supposed to fit.
•Or didn’t have sufficient data to learn from

Underfitting:
Learning algorithm had the opportunity to learn more
from training data, but didn’t (too simple model).
•Or didn’t have sufficient data to learn from

66
67
CSCI417
Machine Intelligence

Thank you

Spring 2024

68

You might also like