0% found this document useful (0 votes)

16 views

Discriminant, Generative, Discriminative Models

Uploaded by

Lakshmi Narayanan Ranganatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Discriminant, Generative, Discriminative Models

Uploaded by

Lakshmi Narayanan Ranganatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 98

Discriminant Functions Generative Models Discriminative Models

Linear Models for Classification

1
Discriminant Functions Generative Models Discriminative Models

Classification: Hand-written Digit Recognition

xi = ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes

• Denote classes by Ck
• Represent input image as a vector xi ∈ R784 .
• We have target vector ti ∈ { 0, 1} 10
• Given a training set {(x 1 , t1 ), . . . , (x N , t N )}, learning
problem is to construct a “good” function y(x) from these.
• y : R784 → R10

2
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,

we will use a “linear” model for classification:

y(x) = f (wT x + w0 )

• This is called a generalized linear model

• f (·) is a fixed non-linear function
• e.g.
1 if u ≥ 0
f (u) =
0 otherwise
• Decision boundary between classes will be linear function of x
• Can also apply non-linearity to x, as in φ i (x) for regression

3
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,

we will use a “linear” model for classification:

y(x) = f (wT x + w0 )

• This is called a generalized linear model

4
3
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,

we will use a “linear” model for classification:

y(x) = f (wT x + w0 )

• This is called a generalized linear model

5
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Model

1
1

0 0.5

−1
0

−1 0 1 0 0.5 1

y(x) = f (wT φ(x ) +

w0 )
x1 φ1 (x 1 , x 2 )
y = f 1 w2 + w0
x2 w φ2 (x 1 , x 2 )
6
Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

7
Discriminant Functions Generative Models Discriminative Models

Discriminant Functions with Two Classes

y> 0 x2
y=0 • Start with 2 class problem,
y< 0 R1
R2 ti ∈ { 0,
1}
• Simple linear discriminant
x
w
y (x)
ǁ wǁ
y(x) = wT x + w0
x⊥
apply threshold function to
x1
get classification
−w 0
ǁ wǁ
• Projection of x in w dir. wT x
||
is w||

8
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

• A linear discriminant between two classes separates with a

hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
9
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

• A linear discriminant between two classes separates with a

hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
10
9
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1
R2

C1
R3
C2

not C1

not C2

• A linear discriminant between two classes separates with a

hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
11
10
Discriminant Functions Generative Models Discriminative Models

Multiple Classes
C3
C
1
?

R1
R3
R1 C1 ?
R2

C1 C3
R2
R3
C2 C2

not C1 C2
not C2

• A linear discriminant between two classes separates with a

hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
12
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Rj
Ri

Rk
xB
xA xˆ
• A solution is to build K linear functions:

y k (x) = wkT x + wk 0
assign x to class arg max k y k (x)
• Gives connected, convex decision regions

xˆ = λ xA + (1 − λ )xB
yk(xˆ) = λy k (xA ) + (1 −
⇒ λ )y k (xB )
13
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Rj
Ri

Rk
xB
xA xˆ
• A solution is to build K linear functions:

y k (x) = wkT x + wk 0
assign x to class arg max k y k (x)
• Gives connected, convex decision regions

xˆ = λ xA + (1 − λ )xB
yk(xˆ) = λy k (xA ) + (1 −
⇒ λ )y k (xB )
14
Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk , wk 0 )?

• One approach is to use least squares, similar to regression
• Find W to minimize squared error over all examples and
all components of the label vector:
N ΣK
1Σ
E (W ) = (y k (x n ) − nk ) 2
2
n=1 k=1
t

• Some algebra, we get a solution using the pseudo-inverse as in

regression

15
Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk , wk 0 )?

• Some algebra, we get a solution using the pseudo-inverse as in

regression

16
15
Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk , wk 0 )?

• Some algebra, we get a solution using the pseudo-inverse as in

regression

17
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−2

−4

−6

−8

−4 −2 0 2 4 6 8

• Looks okay... least squares decision

boundary
• Similar to logistic regression
decision boundary (more later)

18
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

• Gets worse by adding easy points?!

• Looks okay... least squares decision
boundary
• Similar to logistic regression
decision boundary (more later)

19
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

• Gets worse by adding easy points?!

• Looks okay... least squares decision • Why?
boundary
• Similar to logistic regression
decision boundary (more later)

20
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

• Gets worse by adding easy points?!

• Looks okay... least squares decision • Why?
boundary • If target value is 1, points far from
• Similar to logistic regression boundary will have high value,
decision boundary (more later) say 10; this is a large error so the
boundary is moved
21
Discriminant Functions Generative Models Discriminative Models

• Easily separated by hyperplanes, but not found using

least squares!
• We’ll address these problems later with better models
• First, a look at a different criterion for linear discriminant
22
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold

23
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold

24
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold
• In which direction w should we project?

25
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold
• In which direction w should we project?
• One which separates classes “well”

26
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• A natural idea would be to project in the direction of the line

connecting class means

27
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2

−2 2 6

• A natural idea would be to project in the direction of the line

connecting class means

28
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2

−2 2 6

• A natural idea would be to project in the direction of the line

connecting class means
• However, problematic if classes have variance in this
direction

29
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

4 4

2 2

0 0

−2 −2

−2 2 6 −2 2 6

• A natural idea would be to project in the direction of the line

connecting class means
• However, problematic if classes have variance in this
direction
• Fisher criterion: maximize ratio of inter-class separation
(between) to intra-class variance (inside)
30
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):

Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 31
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):

Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 32
31
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):

Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 33
32
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):

Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 34
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

(m 2 − m 1 ) 2 wT S B
J(w) =
= wT W
s 21 + s 22 w S
Between-class
w covariance:
SB = (m2 − m1 )(m2 −
m1 ) T Within-class covariance:
Σ Σ
SW = (x n − m1 )(xn − m1 )T + (x n − m2 )(xn − m2 )T
n∈C1 n∈C2

Lots of math:

w ∝ SW−1 (m2 −
m1 )
If covariance S W is isotropic, reduces to class mean difference
vector
35
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

Lots of math:

w ∝ SW−1 (m2 −
m1 )
If covariance S W is isotropic, reduces to class mean difference
vector
36
Discriminant Functions Generative Models Discriminative Models

FLD Summary

• FLD is a dimensionality reduction technique (more later in

the course)
• Criterion for choosing projection based on class labels
• Still suffers from outliers (e.g. earlier least squares example)
• Refer pg 191 for FDA for multiple classes

37
Discriminant Functions Generative Models Discriminative Models

Perceptrons

• Perceptrons is used to refer to many neural network

structures (more in Chapter 5.)
• The classic type is a fixed non-linear transformation of input,
one layer of adaptive weights, and a threshold:

y(x) = f (wT φ(x))

• Developed by Rosenblatt in the 50s

• The main difference compared to the methods we’ve seen so far
is the learning algorithm

38
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem

• For ease of notation, we will use t = 1 for class C1 and t = −1
for class C2
• We saw that squared error was problematic
• Instead, we’d like to minimize the number of
misclassified examples
• An example is mis-classified if wT φ(x n )t n < 0
• Perceptron criterion:
Σ
E P (w) = − wT φ(xn )t
n n∈M

sum over mis-classified examples

only

39
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem

sum over mis-classified examples

only

40
39
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem

sum over mis-classified examples

only

Linear Models for Classification 41

40
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem

E P (w) = − Σ wT φ(xn )t
n n∈M

sum over mis-classified examples

only

42
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient

descent (gradient descent per example):

w( τ + 1 ) = w( τ ) − η∇E P (w)

43
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient

descent (gradient descent per example):

w( τ + 1 ) = w( τ ) − η∇E P (w) = w( τ ) + ηφ(xn )tn

` ˛ ¸ x
i f in co rrec t

• Iterate over all training examples, only change w if the

example is mis-classified

44
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient

descent (gradient descent per example):

w( τ + 1 ) = w( τ ) − η∇E P (w) = w( τ ) + ηφ(xn )tn

` ˛ ¸ x
i f in co rrec t

• Iterate over all training examples, only change w if the

example is mis-classified
• Guaranteed to converge if data are linearly separable
• Will not converge if not
• May take many iterations
• Sensitive to initialization

45
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

0.5

−0.5

−1
−1 −0.5 0 0.5 1

46
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

47
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

0.5

−0.5

−1
−1 −0.5 0 0.5 1

48
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

49
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Convergence of the perceptron algorithm, showing data points from two classes (red and blue) in a 2-dim feature space (φ1, φ2).
top left plot shows initial parameter vector w shown as a black arrow together with the corresponding decision boundary
(black line), in which the arrow points towards the decision region which classified as belonging to the red class. The data point
circled in green is misclassified and so its feature vector is added to the current weight vector, giving the new decision boundary
shown in top right plot. bottom left plot shows next misclassified point to be considered, indicated by green circle, and its feature
vector is again added to the weight vector giving the decision boundary shown in bottom right plot for which all data points are
correctly classified. 50
Discriminant Functions Generative Models Discriminative Models

Limitations of Perceptrons

• Perceptrons can only solve linearly separable problems in

feature space
• Same as the other models in this chapter
• Canonical example of non-separable problem is X-OR
• Real datasets can look like this too

0
0 1 I2

51
Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

52
Discriminant Functions Generative Models Discriminative Models