0% found this document useful (0 votes)
45 views

03 Classification Handout

This document summarizes a lecture on linear classification. It introduces classification problems and discusses modeling classification as regression by assigning categorical labels numerical values. It describes how linear classifiers use a decision boundary to separate classes and discusses learning classifiers by minimizing loss functions. Key concepts covered include decision boundaries, loss functions, and metrics for evaluating classification models like recall and precision.

Uploaded by

Zakriya Shahid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

03 Classification Handout

This document summarizes a lecture on linear classification. It introduces classification problems and discusses modeling classification as regression by assigning categorical labels numerical values. It describes how linear classifiers use a decision boundary to separate classes and discusses learning classifiers by minimizing loss functions. Key concepts covered include decision boundaries, loss functions, and metrics for evaluating classification models like recall and precision.

Uploaded by

Zakriya Shahid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CSC 411: Lecture 03: Linear Classification

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 1 / 24


Examples of Problems

What digit is this?


How can I predict this? What are my input features?
Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 2 / 24
Regression

What do all these problems have in common?

Categorical outputs, called labels


(eg, yes/no, dog/cat/person/other)

Assigning each input vector to one of a finite number of labels is called


classification

Binary classification: two possible labels (eg, yes/no, 0/1, cat/dog)

Multi-class classification: multiple possible labels

We will first look at binary problems, and discuss multi-class problems later
in class

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 3 / 24


Today

Linear Classification (binary)


Key Concepts:
I Classification as regression
I Decision boundary
I Loss functions
I Metrics to evaluate classification

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 4 / 24


Classification vs Regression

We are interested in mapping the input x ∈ X to a label t ∈ Y


In regression typically Y = <
Now Y is categorical

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 5 / 24


Classification as Regression

Can we do this task using what we have learned in previous lectures?


Simple hack: Ignore that the output is categorical!
Suppose we have a binary problem, t ∈ {−1, 1}
Assuming the standard model used for (linear) regression
y (x) = f (x, w) = wT x

How can we obtain w?


Use least squares, w = (XT X)−1 XT t. How is X computed? and t?
Which loss are we minimizing? Does it make sense?
N
1 X (n)
`square (w, t) = (t − wT x(n) )2
N n=1

How do I compute a label for a new example? Let’s see an example


Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 6 / 24
Classification as Regression

A dimensional
One 1D example: example (input x is 1-dim)

The colors indicate labels (a blue plus denotes that t (i) is from the first
class, red circle that t (i) is from the second class)
Greg Shakhnarovich (TTIC) Lecture 5: Regularization, intro to classification October 15, 2013 11 / 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 7 / 24


Decision Rules

Our classifier has the form


f (x, w) = wo + wT x

A reasonable decision rule is


(
1 if f (x, w) ≥ 0
y=
−1 otherwise

How can I mathematically write this rule?


y (x) = sign(w0 + wT x)

What does this function look like?


Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 8 / 24
Decision Rules
A 1D example:

y
ŷ = +1 ŷ = −1
+1
x
w0 + w T x
-1

How can I mathematically write this rule?

Greg Shakhnarovich (TTIC) y (x) = sign(w0 + wT x)


Lecture 5: Regularization, intro to classification October 15, 2013 11 / 15

This specifies a linear classifier: it has a linear boundary (hyperplane)

w0 + wT x = 0

which separates the space into two ”half-spaces”

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 9 / 24


Example in 1D

The linear classifier has a linear boundary (hyperplane)

w0 + wT x = 0

which separates the space into two ”half-spaces”


In 1D this is simply a threshold

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 10 / 24


Example in 2D

The linear classifier has a linear boundary (hyperplane)


w0 + wT x = 0
which separates the space into two ”half-spaces”
In 2D this is a line
Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 11 / 24
Example in 3D

The linear classifier has a linear boundary (hyperplane)


w0 + wT x = 0
which separates the space into two ”half-spaces”
In 3D this is a plane
What about higher-dimensional spaces?
Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 12 / 24
Geometry

wT x = 0 a line passing though the origin and orthogonal to w


wT x + w0 = 0 shifts it by w0

Figure from G. Shakhnarovich

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 13 / 24


Learning Linear Classifiers

Learning consists in estimating a “good” decision boundary


We need to find w (direction) and w0 (location) of the boundary
What does “good” mean?
Is this boundary good?

We need a criteria that tell us how to select the parameters


Do you know any?

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 14 / 24


Loss functions

Classifying using a linear decision boundary reduces the data dimension to 1

y (x) = sign(w0 + wT x)

What is the cost of being wrong?


Loss function: L(y , t) is the loss incurred for predicting y when correct
answer is t
For medical diagnosis: For a diabetes screening test is it better to have false
positives or false negatives?
For movie ratings: The ”truth” is that Alice thinks E.T. is worthy of a 4.
How bad is it to predict a 5? How about a 2?

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 15 / 24


Loss functions

A possible loss to minimize is the zero/one loss


(
0 if y (x) = t
L(y (x), t) =
1 if y (x) 6= t

Is this minimization easy to do? Why?

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 16 / 24


Other Loss functions

Zero/one loss for a classifier


(
0 if y (x) = t
L0−1 (y (x), t) =
1 if y (x) 6= t

Asymmetric Binary Loss



α
 if y (x) = 1 ∧ t = 0
LABL (y (x), t) = β if y (x) = 0 ∧ t = 1

0 if y (x) = t

Squared (quadratic) loss


Lsquared (y (x), t) = (t − y (x))2

Absolute Error
Labsolute (y (x), t) = |t − y (x)|
Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 17 / 24
More Complex Loss Functions

What if the movie predictions are used for rankings? Now the predicted
ratings don’t matter, just the order that they imply.
In what order does Alice prefer E.T., Amelie and Titanic?
Possibilities:
I 0-1 loss on the winner
I Permutation distance
I Accuracy of top K movies.

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 18 / 24


Can we always separate the classes?

If we can separate the classes, the problem is linearly separable

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 19 / 24


Can we always separate the classes?

Causes of non perfect separation:


Model is too simple
Noise in the inputs (i.e., data attributes)
Simple features that do not account for all variations
Errors in data targets (mis-labelings)

Should we make the model complex enough to have perfect separation in the
training data?

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 20 / 24


Metrics
How to evaluate how good my classifier is? How is it doing on dog vs no-dog?

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 21 / 24


Metrics

How to evaluate how good my classifier is?


Recall: is the fraction of relevant instances that are retrieved
TP TP
R= =
TP + FN all groundtruth instances

Precision: is the fraction of retrieved instances that are relevant


TP TP
P= =
TP + FP all predicted

F1 score: harmonic mean of precision and recall


P ·R
F1 = 2
P +R

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 22 / 24


More on Metrics
How to evaluate how good my classifier is?
Precision: is the fraction of retrieved instances that are relevant
Recall: is the fraction of relevant instances that are retrieved
Precision Recall Curve

Average Precision (AP): mean under the curve

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 23 / 24


Metrics vs Loss

Metrics on a dataset is what we care about (performance)


We typically cannot directly optimize for the metrics
Our loss function should reflect the problem we are solving. We then hope it
will yield models that will do well on our dataset

Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 24 / 24

You might also like