0% found this document useful (0 votes)
16 views

Discriminant, Generative, Discriminative Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Discriminant, Generative, Discriminative Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 98

Discriminant Functions Generative Models Discriminative Models

Linear Models for Classification

1
Discriminant Functions Generative Models Discriminative Models

Classification: Hand-written Digit Recognition

xi = ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes


• Denote classes by Ck
• Represent input image as a vector xi ∈ R784 .
• We have target vector ti ∈ { 0, 1} 10
• Given a training set {(x 1 , t1 ), . . . , (x N , t N )}, learning
problem is to construct a “good” function y(x) from these.
• y : R784 → R10

2
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,


we will use a “linear” model for classification:

y(x) = f (wT x + w0 )

• This is called a generalized linear model


• f (·) is a fixed non-linear function
• e.g.
1 if u ≥ 0
f (u) =
0 otherwise
• Decision boundary between classes will be linear function of x
• Can also apply non-linearity to x, as in φ i (x) for regression

3
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,


we will use a “linear” model for classification:

y(x) = f (wT x + w0 )

• This is called a generalized linear model


• f (·) is a fixed non-linear function
• e.g.
1 if u ≥ 0
f (u) =
0 otherwise
• Decision boundary between classes will be linear function of x
• Can also apply non-linearity to x, as in φ i (x) for regression

4
3
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,


we will use a “linear” model for classification:

y(x) = f (wT x + w0 )

• This is called a generalized linear model


• f (·) is a fixed non-linear function
• e.g.
1 if u ≥ 0
f (u) =
0 otherwise
• Decision boundary between classes will be linear function of x
• Can also apply non-linearity to x, as in φ i (x) for regression

5
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Model


1
1

0 0.5

−1
0

−1 0 1 0 0.5 1

y(x) = f (wT φ(x ) +


w0 )
x1 φ1 (x 1 , x 2 )
y = f 1 w2 + w0
x2 w φ2 (x 1 , x 2 )
6
Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

7
Discriminant Functions Generative Models Discriminative Models

Discriminant Functions with Two Classes

y> 0 x2
y=0 • Start with 2 class problem,
y< 0 R1
R2 ti ∈ { 0,
1}
• Simple linear discriminant
x
w
y (x)
ǁ wǁ
y(x) = wT x + w0
x⊥
apply threshold function to
x1
get classification
−w 0
ǁ wǁ
• Projection of x in w dir. wT x
||
is w||

8
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

• A linear discriminant between two classes separates with a


hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
9
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

• A linear discriminant between two classes separates with a


hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
10
9
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1
R2

C1
R3
C2

not C1

not C2

• A linear discriminant between two classes separates with a


hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
11
10
Discriminant Functions Generative Models Discriminative Models

Multiple Classes
C3
C
1
?

R1
R3
R1 C1 ?
R2

C1 C3
R2
R3
C2 C2

not C1 C2
not C2

• A linear discriminant between two classes separates with a


hyperplane
• How to use this for multiple classes?
• One-versus-the-rest method: build K − 1 classifiers, between Ck
and all others
• One-versus-one method: build K ( K − 1)/2 classifiers, between
all pairs
12
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Rj
Ri

Rk
xB
xA xˆ
• A solution is to build K linear functions:

y k (x) = wkT x + wk 0
assign x to class arg max k y k (x)
• Gives connected, convex decision regions

xˆ = λ xA + (1 − λ )xB
yk(xˆ) = λy k (xA ) + (1 −
⇒ λ )y k (xB )
13
Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Rj
Ri

Rk
xB
xA xˆ
• A solution is to build K linear functions:

y k (x) = wkT x + wk 0
assign x to class arg max k y k (x)
• Gives connected, convex decision regions

xˆ = λ xA + (1 − λ )xB
yk(xˆ) = λy k (xA ) + (1 −
⇒ λ )y k (xB )
14
Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk , wk 0 )?


• One approach is to use least squares, similar to regression
• Find W to minimize squared error over all examples and
all components of the label vector:
N ΣK

E (W ) = (y k (x n ) − nk ) 2
2
n=1 k=1
t

• Some algebra, we get a solution using the pseudo-inverse as in


regression

15
Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk , wk 0 )?


• One approach is to use least squares, similar to regression
• Find W to minimize squared error over all examples and
all components of the label vector:
N ΣK

E (W ) = (y k (x n ) − nk ) 2
2
n=1 k=1
t

• Some algebra, we get a solution using the pseudo-inverse as in


regression

16
15
Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk , wk 0 )?


• One approach is to use least squares, similar to regression
• Find W to minimize squared error over all examples and
all components of the label vector:
N ΣK

E (W ) = (y k (x n ) − nk ) 2
2
n=1 k=1
t

• Some algebra, we get a solution using the pseudo-inverse as in


regression

17
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares


4

−2

−4

−6

−8

−4 −2 0 2 4 6 8

• Looks okay... least squares decision


boundary
• Similar to logistic regression
decision boundary (more later)

18
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares


4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

• Gets worse by adding easy points?!


• Looks okay... least squares decision
boundary
• Similar to logistic regression
decision boundary (more later)

19
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares


4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

• Gets worse by adding easy points?!


• Looks okay... least squares decision • Why?
boundary
• Similar to logistic regression
decision boundary (more later)

20
Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares


4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

• Gets worse by adding easy points?!


• Looks okay... least squares decision • Why?
boundary • If target value is 1, points far from
• Similar to logistic regression boundary will have high value,
decision boundary (more later) say 10; this is a large error so the
boundary is moved
21
Discriminant Functions Generative Models Discriminative Models

More Least Squares Problems


6 6

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

• Easily separated by hyperplanes, but not found using


least squares!
• We’ll address these problems later with better models
• First, a look at a different criterion for linear discriminant
22
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold

23
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold

24
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold
• In which direction w should we project?

25
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wT x≥ −w0

followed by a threshold
• In which direction w should we project?
• One which separates classes “well”

26
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• A natural idea would be to project in the direction of the line


connecting class means

27
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2

−2 2 6

• A natural idea would be to project in the direction of the line


connecting class means

28
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2

−2 2 6

• A natural idea would be to project in the direction of the line


connecting class means
• However, problematic if classes have variance in this
direction

29
Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

4 4

2 2

0 0

−2 −2

−2 2 6 −2 2 6

• A natural idea would be to project in the direction of the line


connecting class means
• However, problematic if classes have variance in this
direction
• Fisher criterion: maximize ratio of inter-class separation
(between) to intra-class variance (inside)
30
Discriminant Functions Generative Models Discriminative Models

Math time - FLD


• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):


Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 31
Discriminant Functions Generative Models Discriminative Models

Math time - FLD


• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):


Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 32
31
Discriminant Functions Generative Models Discriminative Models

Math time - FLD


• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):


Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 33
32
Discriminant Functions Generative Models Discriminative Models

Math time - FLD


• Projection y n = wT x n
• Inter-class separation is distance between class means (good):

1 Σ
mk = wT n
Nk
n∈Ck x

• Intra-class variance (bad):


Σ
s2 = (y − 2
k n k
m ) n∈Ck

• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 34
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

(m 2 − m 1 ) 2 wT S B
J(w) =
= wT W
s 21 + s 22 w S
Between-class
w covariance:
SB = (m2 − m1 )(m2 −
m1 ) T Within-class covariance:
Σ Σ
SW = (x n − m1 )(xn − m1 )T + (x n − m2 )(xn − m2 )T
n∈C1 n∈C2

Lots of math:

w ∝ SW−1 (m2 −
m1 )
If covariance S W is isotropic, reduces to class mean difference
vector
35
Discriminant Functions Generative Models Discriminative Models

Math time - FLD

(m 2 − m 1 ) 2 wT S B
J(w) =
= wT W
s 21 + s 22 w S
Between-class
w covariance:
SB = (m2 − m1 )(m2 −
m1 ) T Within-class covariance:
Σ Σ
SW = (x n − m1 )(xn − m1 )T + (x n − m2 )(xn − m2 )T
n∈C1 n∈C2

Lots of math:

w ∝ SW−1 (m2 −
m1 )
If covariance S W is isotropic, reduces to class mean difference
vector
36
Discriminant Functions Generative Models Discriminative Models

FLD Summary

• FLD is a dimensionality reduction technique (more later in


the course)
• Criterion for choosing projection based on class labels
• Still suffers from outliers (e.g. earlier least squares example)
• Refer pg 191 for FDA for multiple classes

37
Discriminant Functions Generative Models Discriminative Models

Perceptrons

• Perceptrons is used to refer to many neural network


structures (more in Chapter 5.)
• The classic type is a fixed non-linear transformation of input,
one layer of adaptive weights, and a threshold:

y(x) = f (wT φ(x))

• Developed by Rosenblatt in the 50s


• The main difference compared to the methods we’ve seen so far
is the learning algorithm

38
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem


• For ease of notation, we will use t = 1 for class C1 and t = −1
for class C2
• We saw that squared error was problematic
• Instead, we’d like to minimize the number of
misclassified examples
• An example is mis-classified if wT φ(x n )t n < 0
• Perceptron criterion:
Σ
E P (w) = − wT φ(xn )t
n n∈M

sum over mis-classified examples


only

39
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem


• For ease of notation, we will use t = 1 for class C1 and t = −1
for class C2
• We saw that squared error was problematic
• Instead, we’d like to minimize the number of
misclassified examples
• An example is mis-classified if wT φ(x n )t n < 0
• Perceptron criterion:
Σ
E P (w) = − wT φ(xn )t
n n∈M

sum over mis-classified examples


only

40
39
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem


• For ease of notation, we will use t = 1 for class C1 and t = −1
for class C2
• We saw that squared error was problematic
• Instead, we’d like to minimize the number of
misclassified examples
• An example is mis-classified if wT φ(x n )t n < 0
• Perceptron criterion:
Σ
E P (w) = − wT φ(xn )t
n n∈M

sum over mis-classified examples


only

Linear Models for Classification 41


40
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem


• For ease of notation, we will use t = 1 for class C1 and t = −1
for class C2
• We saw that squared error was problematic
• Instead, we’d like to minimize the number of
misclassified examples
• An example is mis-classified if wT φ(x n )t n < 0
• Perceptron criterion:

E P (w) = − Σ wT φ(xn )t
n n∈M

sum over mis-classified examples


only

42
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient


descent (gradient descent per example):

w( τ + 1 ) = w( τ ) − η∇E P (w)

43
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient


descent (gradient descent per example):

w( τ + 1 ) = w( τ ) − η∇E P (w) = w( τ ) + ηφ(xn )tn


` ˛ ¸ x
i f in co rrec t

• Iterate over all training examples, only change w if the


example is mis-classified

44
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient


descent (gradient descent per example):

w( τ + 1 ) = w( τ ) − η∇E P (w) = w( τ ) + ηφ(xn )tn


` ˛ ¸ x
i f in co rrec t

• Iterate over all training examples, only change w if the


example is mis-classified
• Guaranteed to converge if data are linearly separable
• Will not converge if not
• May take many iterations
• Sensitive to initialization

45
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration


1

0.5

−0.5

−1
−1 −0.5 0 0.5 1

46
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration


1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

47
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration


1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

0.5

−0.5

−1
−1 −0.5 0 0.5 1

48
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration


1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

49
Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration


1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Convergence of the perceptron algorithm, showing data points from two classes (red and blue) in a 2-dim feature space (φ1, φ2).
top left plot shows initial parameter vector w shown as a black arrow together with the corresponding decision boundary
(black line), in which the arrow points towards the decision region which classified as belonging to the red class. The data point
circled in green is misclassified and so its feature vector is added to the current weight vector, giving the new decision boundary
shown in top right plot. bottom left plot shows next misclassified point to be considered, indicated by green circle, and its feature
vector is again added to the weight vector giving the decision boundary shown in bottom right plot for which all data points are
correctly classified. 50
Discriminant Functions Generative Models Discriminative Models

Limitations of Perceptrons

• Perceptrons can only solve linearly separable problems in


feature space
• Same as the other models in this chapter
• Canonical example of non-separable problem is X-OR
• Real datasets can look like this too

I1

0
0 1 I2

51
Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

52
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models


• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function
• We’ll now develop a probabilistic approach
• With 2 classes, C1 and C2:

p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)

p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)

• In generative models we specify the distribution p(x|Ck) which


generates the data for each class
53
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models


• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function
• We’ll now develop a probabilistic approach
• With 2 classes, C1 and C2:

p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)

p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)

• In generative models we specify the distribution p(x|Ck) which


generates the data for each class
Linear Models for Classification 54
53
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models


• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function
• We’ll now develop a probabilistic approach
• With 2 classes, C1 and C2:

p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)

p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)

• In generative models we specify the distribution p(x|Ck) which


generates the data for each class
Linear Models for Classification 55
54
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models


• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function
• We’ll now develop a probabilistic approach
• With 2 classes, C1 and C2:

p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)

p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)

• In generative models we specify the distribution p(x|Ck) which


generates the data for each class
Linear Models for Classification 56
55
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models


• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function
• We’ll now develop a probabilistic approach
• With 2 classes, C1 and C2:

p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)

p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)

• In generative models we specify the distribution p(x|Ck) which


generates the data for each class
Linear Models for Classification 57
56
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models


• Up to now we’ve looked at learning classification by
choosing parameters to minimize an error function
• We’ll now develop a probabilistic approach
• With 2 classes, C1 and C2:

p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)

p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)

• In generative models we specify the distribution p(x|Ck) which


generates the data for each class
58
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature


• Determine if we are in Vancouver (C1) or Honolulu (C2)
• Generative model:

p(x|C1)p(C1)
p(C1 |x) =
p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vancouver

• e.g. p(x |C1 ) = N (x ; 10, 5)


• p(x|C2) is distribution over typical temperatures in Honolulu

• e.g. p(x |C2 ) = N (x ; 25, 5)

• p(C• 1 |x
Class priors
= 15) = p(C1) =0.0484·0.1
0.1, p(C2 ) = 0.9

0.0484·0.1+0.0108·0.9 0.33
59
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature


• Determine if we are in Vancouver (C1) or Honolulu (C2)
• Generative model:

p(x|C1)p(C1)
p(C1 |x) =
p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vancouver

• e.g. p(x |C1 ) = N (x ; 10, 5)


• p(x|C2) is distribution over typical temperatures in Honolulu

• e.g. p(x |C2 ) = N (x ; 25, 5)

• p(C• 1 |x
Class priors
= 15) = p(C1) =0.0484·0.1
0.1, p(C2 ) = 0.9

0.0484·0.1+0.0108·0.9 0.33
60
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(x|C1)p(C1)
p(C1|x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
1
=
≡ σ ( a)
1 + exp(−a)
p(x|C1)p(C1)
where a = ln
p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been
studying

• Of course, we will see how such a simple form a = wT x + w0


arises naturally
61
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(x|C1)p(C1)
p(C1|x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
1
=
≡ σ ( a)
1 + exp(−a)
p(x|C1)p(C1)
where a = ln
p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been
studying

• Of course, we will see how such a simple form a = wT x + w0


arises naturally
Linear Models for Classification 62
61
Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(x|C1)p(C1)
p(C1|x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
1
=
≡ σ ( a)
1 + exp(−a)
p(x|C1)p(C1)
where a = ln
p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been
studying

• Of course, we will see how such a simple form a = wT x + w0


arises naturally
63
Discriminant Functions Generative Models Discriminative Models

Logistic Sigmoid

0.5

0
−5 0 5

• The function σ(a) = 1


1+ ex p( −a) is known as the logistic sigmoid
• It squashes the real axis down to [0, 1]
• It is continuous and differentiable
• It avoids the problems encountered with the too correct
least-squares error fitting (later)

64
Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K >


2
classes:
p(x|Ck)p(Ck)
p(Ck|x) =
Σj p(x|Cj )p(Cj )
exp(a k )
=
Σj exp(a j )
where ak = ln p(x|
Ck )p(Ck )

• a. k. a. softmax function
• If some ak aj , p(Ck |x) goes to
1

65
Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K >


2
classes:
p(x|Ck)p(Ck)
p(Ck|x) = Σ
j p(x|Cj )p(Cj )
exp(a k )
= Σ
j exp(a j )
where ak = ln p(x|
Ck )p(Ck )

• a. k. a. softmax function
• If some ak >> aj , p(Ck |x) goes to
1, p(Cj |x) close to 0

66
Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes


• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,
and have the same covariance matrix Σ :
1 1
p(x|Ck ) = exp − (x − µk )T−1Σ (x − µ
k
2
)
(2π) D / 2 |Σ| 1 / 2
• a takes a simple form:

p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)

• Note that quadratic terms xT Σ − 1 x cancel


67
Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes


• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,
and have the same covariance matrix Σ :
1 1
p(x|Ck ) = exp − (x − µk )T−1Σ (x − µ
k
2
)
(2π) D / 2 |Σ| 1 / 2
• a takes a simple form:

p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)
• Note that quadratic terms xT Σ − 1 x
68
67
cancel
Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes


• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,
and have the same covariance matrix Σ :
1 1
p(x|Ck ) = exp − (x − µk )T−1Σ (x − µ
k
2
)
(2π) D / 2 |Σ| 1 / 2
• a takes a simple form:

p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)
• Note that quadratic terms xT Σ − 1 x
69
68
cancel
Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes


• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,
and have the same covariance matrix Σ :
1 1
p(x|Ck ) = exp − (x − µk )T−1Σ (x − µ
k
2
)
(2π) D / 2 |Σ| 1 / 2
• a takes a simple form:

p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)

• Note that quadratic terms xT Σ − 1 x cancel


70
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using


maximum likelihood
• Parameters are µ1 , µ2 , Σ − 1 , p(C1 ) ≡ π, p(C2 ) ≡ 1 −
π
• Refer to as θ
• For a datapoint x n from class C1 (t n = 1):

p(xn , C1 ) = p(C1 )p(xn |C1 ) = πN (xn |µ1 ,


Σ)
• For a datapoint x n from class C2 (t n = 0):

p(x n , C2) = p(C2)p(xn|C2) = (1 − π)N (x n |µ2 , Σ )

71
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using


maximum likelihood
• Parameters are µ1 , µ2 , Σ − 1 , p(C1 ) ≡ π, p(C2 ) ≡ 1 −
π
• Refer to as θ
• For a datapoint x n from class C1 (t n = 1):

p(xn , C1 ) = p(C1 )p(xn |C1 ) = πN (xn |µ1 ,


Σ)
• For a datapoint x n from class C2 (t n = 0):

p(x n , C2) = p(C2)p(xn|C2) = (1 − π)N (x n |µ2 , Σ )

72
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using


maximum likelihood
• Parameters are µ1 , µ2 , Σ − 1 , p(C1 ) ≡ π, p(C2 ) ≡ 1 −
π
• Refer to as θ
• For a datapoint x n from class C1 (t n = 1):

p(xn , C1 ) = p(C1 )p(xn |C1 ) = πN (xn |µ1 , Σ )

• For a datapoint x n from class C2 (t n = 0):

p(x n , C2) = p(C2)p(xn|C2) = (1 − π)N (x n |µ2,


Σ)
73
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

YN
p(t|π, µ , µ , Σ ) = [πN (x |µ , Σ )]t n −π)N (x |µ , 1−t n
1 2 n n 2
[(1 1 n= 1 Σ )]

• As usual, ln is our friend:

ΣN
l (t; θ) =tn ln π + (1 — tn ) — π) + nt ln N1 + (1 tn ) ln N2
˛ x x
` ln(1
π − ˛ ¸
n=1 ` µ , µ ,Σ
¸ 1 2

• Maximize for each separately

74
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

YN
p(t|π, µ , µ , Σ ) = [πN (x |µ , Σ )]t n −π)N (x |µ , 1−t n
1 2 n n 2
[(1 1 n=1 Σ )]

• As usual, ln is our
friend:
ΣN
l (t; θ) =tn ln π + (1 — tn ) — π) + nt ln N1 + (1 tn ) ln N2
˛ x x
` ln(1
π − ˛ ¸
n=1 ` µ , µ ,Σ
¸ 1 2

• Maximize for each separately

75
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter π


is straightforward:

∂ ΣN t n 1− n
l (t; θ) = −
∂π tπ
1− π
n=1
N1
⇒π=
N1+ N2
• N 1 and N 2 are the number of training points in each class
• Prior is simply the fraction of points in each class

76
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter π


is straightforward:

∂ ΣN t n 1− n
l (t; θ) = −
∂π tπ
1− π
n=1
N1
⇒π=
N1+ N2
• N 1 and N 2 are the number of training points in each class
• Prior is simply the fraction of points in each class

Linear Models for Classification 77


76
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter π


is straightforward:

∂ ΣN t n 1− n
l (t; θ) = −
∂π tπ
1− π
n=1
N1
⇒π=
N1+ N2
• N 1 and N 2 are the number of training points in each class
• Prior is simply the fraction of points in each class

78
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters


• The other parameters can also be found in the same fashion
• Class means:
N
1 Σ
µ1 = t n xn
N1 n= 1
N
1 Σ
µ2 = (1 − tn )xn
N2 n= 1

• Means of training examples from each class


• Shared covariance matrix:

N1 1 Σ N2 1 Σ
Σ = (x −µ )(x −µ ) T + (x n −µ2 )(x n −µ )2
T
N N1 n∈C1
n 1 n N N2 n∈C2
1

• Weighted average of class covariances

79
Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters


• The other parameters can also be found in the same fashion
• Class means:
N
1 Σ
µ1 = t n xn
N1 n= 1
N
1 Σ
µ2 = (1 − tn )xn
N2 n= 1

• Means of training examples from each class


• Shared covariance matrix:

N1 1 Σ N2 1 Σ
Σ = (x −µ )(x −µ ) T + (x n −µ2 )(x n −µ )2
T
N N1 n∈C1
n 1 n N N2 n∈C2
1

• Weighted average of class covariances

80
79
Discriminant Functions Generative Models Discriminative Models

Gaussian with Different Covariances


2.5
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
−2 −1 0 1 2

p(x|C b )p(C b )
a = ln
p (x | C r )p (C r )

a= ln(p(x |C b)) − ln(p(x |C r )) + l n(p(C b)) − ln(p(C r ))


`
` ˛ ¸ x x ˛ ¸
−(x− µb )T Σb(x−µb )+ ( x − µ r ) T Σ r (x −µr ) co n st
.
81
Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models Summary

• Posterior given by generalized linear models with logistic


sigmoid for K=2 , Softmax for K>=2 classes

• Fitting Gaussian using ML criterion is sensitive to outliers

82
Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

83
Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form


of class-conditional distributions (e.g. Gaussian)
• Resulted in logistic sigmoid of linear function of
x
• Discriminative model - explicitly use functional
form 1
p(C1 |x) =
1 + exp(−wT x + 0
w )
and find w directly
• For the generative model we had 2M + M (M + 1)/ 2 +
1
parameters
• M is dimensionality of x
• Discriminative model will have M + 1 parameters

84
Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form


of class-conditional distributions (e.g. Gaussian)
• Resulted in logistic sigmoid of linear function of
x
• Discriminative model - explicitly use functional
form 1
p(C1 |x) =
1 + exp(−wT x + 0
w )
and find w directly
• For the generative model we had 2M + M (M + 1)/ 2 +
1
parameters
• M is dimensionality of x
• Discriminative model will have M + 1 parameters

Linear Models for Classification 85


84
Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form


of class-conditional distributions (e.g. Gaussian)
• Resulted in logistic sigmoid of linear function of
x
• Discriminative model - explicitly use functional
form 1
p(C1 |x) =
1 + exp(−wT x + 0
w )
and find w directly
• For the generative model we had 2M + M (M + 1)/ 2 +
1
parameters
• M is dimensionality of x
• Discriminative model will have M + 1 parameters

86
Discriminant Functions Generative Models Discriminative Models

Generative vs. Discriminative

• Generative models • Discriminative models


• Can generate synthetic example • Only usable for classification
data • Don’t solve a harder problem than
• Perhaps accurate classification is you need to
equivalent to accurate synthesis • Tend to have fewer parameters
• e.g. vision and graphics • Require good model of decision
• Tend to have more parameters boundary
• Require good model of class
distributions

87
88
Logistic Regression

89
90
• This time no closed-form solution since y n = σ(w T x)
• Could use (stochastic) gradient descent
• But there’s a better iterative technique
91
Iterative Reweighted Least Squares

92
Replace X by (X) or and y by t in 5.11

93
94
95
96
97
98

You might also like