Discriminant, Generative, Discriminative Models
Discriminant, Generative, Discriminative Models
1
Discriminant Functions Generative Models Discriminative Models
xi = ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
2
Discriminant Functions Generative Models Discriminative Models
y(x) = f (wT x + w0 )
3
Discriminant Functions Generative Models Discriminative Models
y(x) = f (wT x + w0 )
4
3
Discriminant Functions Generative Models Discriminative Models
y(x) = f (wT x + w0 )
5
Discriminant Functions Generative Models Discriminative Models
0 0.5
−1
0
−1 0 1 0 0.5 1
Outline
Discriminant Functions
Generative Models
Discriminative Models
7
Discriminant Functions Generative Models Discriminative Models
y> 0 x2
y=0 • Start with 2 class problem,
y< 0 R1
R2 ti ∈ { 0,
1}
• Simple linear discriminant
x
w
y (x)
ǁ wǁ
y(x) = wT x + w0
x⊥
apply threshold function to
x1
get classification
−w 0
ǁ wǁ
• Projection of x in w dir. wT x
||
is w||
8
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
Multiple Classes
Multiple Classes
R1
R2
C1
R3
C2
not C1
not C2
Multiple Classes
C3
C
1
?
R1
R3
R1 C1 ?
R2
C1 C3
R2
R3
C2 C2
not C1 C2
not C2
Multiple Classes
Rj
Ri
Rk
xB
xA xˆ
• A solution is to build K linear functions:
y k (x) = wkT x + wk 0
assign x to class arg max k y k (x)
• Gives connected, convex decision regions
xˆ = λ xA + (1 − λ )xB
yk(xˆ) = λy k (xA ) + (1 −
⇒ λ )y k (xB )
13
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
Rj
Ri
Rk
xB
xA xˆ
• A solution is to build K linear functions:
y k (x) = wkT x + wk 0
assign x to class arg max k y k (x)
• Gives connected, convex decision regions
xˆ = λ xA + (1 − λ )xB
yk(xˆ) = λy k (xA ) + (1 −
⇒ λ )y k (xB )
14
Discriminant Functions Generative Models Discriminative Models
15
Discriminant Functions Generative Models Discriminative Models
16
15
Discriminant Functions Generative Models Discriminative Models
17
Discriminant Functions Generative Models Discriminative Models
−2
−4
−6
−8
−4 −2 0 2 4 6 8
18
Discriminant Functions Generative Models Discriminative Models
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
19
Discriminant Functions Generative Models Discriminative Models
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
20
Discriminant Functions Generative Models Discriminative Models
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
y = wT x≥ −w0
followed by a threshold
23
Discriminant Functions Generative Models Discriminative Models
y = wT x≥ −w0
followed by a threshold
24
Discriminant Functions Generative Models Discriminative Models
y = wT x≥ −w0
followed by a threshold
• In which direction w should we project?
25
Discriminant Functions Generative Models Discriminative Models
y = wT x≥ −w0
followed by a threshold
• In which direction w should we project?
• One which separates classes “well”
26
Discriminant Functions Generative Models Discriminative Models
27
Discriminant Functions Generative Models Discriminative Models
−2
−2 2 6
28
Discriminant Functions Generative Models Discriminative Models
−2
−2 2 6
29
Discriminant Functions Generative Models Discriminative Models
4 4
2 2
0 0
−2 −2
−2 2 6 −2 2 6
1 Σ
mk = wT n
Nk
n∈Ck x
• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 31
Discriminant Functions Generative Models Discriminative Models
1 Σ
mk = wT n
Nk
n∈Ck x
• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 32
31
Discriminant Functions Generative Models Discriminative Models
1 Σ
mk = wT n
Nk
n∈Ck x
• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 33
32
Discriminant Functions Generative Models Discriminative Models
1 Σ
mk = wT n
Nk
n∈Ck x
• Fisher criterion:
(m − 2
2 1
J(w) = m s)21 + s 22
maximize wrt
w 34
Discriminant Functions Generative Models Discriminative Models
(m 2 − m 1 ) 2 wT S B
J(w) =
= wT W
s 21 + s 22 w S
Between-class
w covariance:
SB = (m2 − m1 )(m2 −
m1 ) T Within-class covariance:
Σ Σ
SW = (x n − m1 )(xn − m1 )T + (x n − m2 )(xn − m2 )T
n∈C1 n∈C2
Lots of math:
w ∝ SW−1 (m2 −
m1 )
If covariance S W is isotropic, reduces to class mean difference
vector
35
Discriminant Functions Generative Models Discriminative Models
(m 2 − m 1 ) 2 wT S B
J(w) =
= wT W
s 21 + s 22 w S
Between-class
w covariance:
SB = (m2 − m1 )(m2 −
m1 ) T Within-class covariance:
Σ Σ
SW = (x n − m1 )(xn − m1 )T + (x n − m2 )(xn − m2 )T
n∈C1 n∈C2
Lots of math:
w ∝ SW−1 (m2 −
m1 )
If covariance S W is isotropic, reduces to class mean difference
vector
36
Discriminant Functions Generative Models Discriminative Models
FLD Summary
37
Discriminant Functions Generative Models Discriminative Models
Perceptrons
38
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
39
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
40
39
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
Perceptron Learning
E P (w) = − Σ wT φ(xn )t
n n∈M
42
Discriminant Functions Generative Models Discriminative Models
w( τ + 1 ) = w( τ ) − η∇E P (w)
43
Discriminant Functions Generative Models Discriminative Models
44
Discriminant Functions Generative Models Discriminative Models
45
Discriminant Functions Generative Models Discriminative Models
0.5
−0.5
−1
−1 −0.5 0 0.5 1
46
Discriminant Functions Generative Models Discriminative Models
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
47
Discriminant Functions Generative Models Discriminative Models
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
0.5
−0.5
−1
−1 −0.5 0 0.5 1
48
Discriminant Functions Generative Models Discriminative Models
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
49
Discriminant Functions Generative Models Discriminative Models
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Convergence of the perceptron algorithm, showing data points from two classes (red and blue) in a 2-dim feature space (φ1, φ2).
top left plot shows initial parameter vector w shown as a black arrow together with the corresponding decision boundary
(black line), in which the arrow points towards the decision region which classified as belonging to the red class. The data point
circled in green is misclassified and so its feature vector is added to the current weight vector, giving the new decision boundary
shown in top right plot. bottom left plot shows next misclassified point to be considered, indicated by green circle, and its feature
vector is again added to the weight vector giving the decision boundary shown in bottom right plot for which all data points are
correctly classified. 50
Discriminant Functions Generative Models Discriminative Models
Limitations of Perceptrons
I1
0
0 1 I2
51
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
52
Discriminant Functions Generative Models Discriminative Models
p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)
p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)
p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)
p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)
p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)
p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)
p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)
p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)
p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)
p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)
p(x|C1 )p(C1 )
p(C1 |x) = Bayes’ Rule
p(x)
p(x|C1)p(C1)
p(C1 |x) = Sum rule
p(x, C1 ) + p(x,
C2p(x|C
) )p(C )
1 1
p(C1 |x) = Product rule
p(x|C1)p(C1) + p(x|C2)p(C2)
p(x|C1)p(C1)
p(C1 |x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
• p(C• 1 |x
Class priors
= 15) = p(C1) =0.0484·0.1
0.1, p(C2 ) = 0.9
≈
0.0484·0.1+0.0108·0.9 0.33
59
Discriminant Functions Generative Models Discriminative Models
p(x|C1)p(C1)
p(C1 |x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
• p(C• 1 |x
Class priors
= 15) = p(C1) =0.0484·0.1
0.1, p(C2 ) = 0.9
≈
0.0484·0.1+0.0108·0.9 0.33
60
Discriminant Functions Generative Models Discriminative Models
p(x|C1)p(C1)
p(C1|x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
1
=
≡ σ ( a)
1 + exp(−a)
p(x|C1)p(C1)
where a = ln
p(x|C2)p(C2)
• This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been
studying
p(x|C1)p(C1)
p(C1|x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
1
=
≡ σ ( a)
1 + exp(−a)
p(x|C1)p(C1)
where a = ln
p(x|C2)p(C2)
• This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been
studying
p(x|C1)p(C1)
p(C1|x) =
p(x|C1)p(C1) + p(x|C2)p(C2)
1
=
≡ σ ( a)
1 + exp(−a)
p(x|C1)p(C1)
where a = ln
p(x|C2)p(C2)
• This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been
studying
Logistic Sigmoid
0.5
0
−5 0 5
64
Discriminant Functions Generative Models Discriminative Models
Multi-class Extension
• a. k. a. softmax function
• If some ak aj , p(Ck |x) goes to
1
65
Discriminant Functions Generative Models Discriminative Models
Multi-class Extension
• a. k. a. softmax function
• If some ak >> aj , p(Ck |x) goes to
1, p(Cj |x) close to 0
66
Discriminant Functions Generative Models Discriminative Models
p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)
p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)
• Note that quadratic terms xT Σ − 1 x
68
67
cancel
Discriminant Functions Generative Models Discriminative Models
p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)
• Note that quadratic terms xT Σ − 1 x
69
68
cancel
Discriminant Functions Generative Models Discriminative Models
p(x|C1)p(C1)
a = ln
w
= wT x + 0
p(x|C2)p(C2)
71
Discriminant Functions Generative Models Discriminative Models
72
Discriminant Functions Generative Models Discriminative Models
YN
p(t|π, µ , µ , Σ ) = [πN (x |µ , Σ )]t n −π)N (x |µ , 1−t n
1 2 n n 2
[(1 1 n= 1 Σ )]
ΣN
l (t; θ) =tn ln π + (1 — tn ) — π) + nt ln N1 + (1 tn ) ln N2
˛ x x
` ln(1
π − ˛ ¸
n=1 ` µ , µ ,Σ
¸ 1 2
74
Discriminant Functions Generative Models Discriminative Models
YN
p(t|π, µ , µ , Σ ) = [πN (x |µ , Σ )]t n −π)N (x |µ , 1−t n
1 2 n n 2
[(1 1 n=1 Σ )]
• As usual, ln is our
friend:
ΣN
l (t; θ) =tn ln π + (1 — tn ) — π) + nt ln N1 + (1 tn ) ln N2
˛ x x
` ln(1
π − ˛ ¸
n=1 ` µ , µ ,Σ
¸ 1 2
75
Discriminant Functions Generative Models Discriminative Models
∂ ΣN t n 1− n
l (t; θ) = −
∂π tπ
1− π
n=1
N1
⇒π=
N1+ N2
• N 1 and N 2 are the number of training points in each class
• Prior is simply the fraction of points in each class
76
Discriminant Functions Generative Models Discriminative Models
∂ ΣN t n 1− n
l (t; θ) = −
∂π tπ
1− π
n=1
N1
⇒π=
N1+ N2
• N 1 and N 2 are the number of training points in each class
• Prior is simply the fraction of points in each class
∂ ΣN t n 1− n
l (t; θ) = −
∂π tπ
1− π
n=1
N1
⇒π=
N1+ N2
• N 1 and N 2 are the number of training points in each class
• Prior is simply the fraction of points in each class
78
Discriminant Functions Generative Models Discriminative Models
N1 1 Σ N2 1 Σ
Σ = (x −µ )(x −µ ) T + (x n −µ2 )(x n −µ )2
T
N N1 n∈C1
n 1 n N N2 n∈C2
1
79
Discriminant Functions Generative Models Discriminative Models
N1 1 Σ N2 1 Σ
Σ = (x −µ )(x −µ ) T + (x n −µ2 )(x n −µ )2
T
N N1 n∈C1
n 1 n N N2 n∈C2
1
80
79
Discriminant Functions Generative Models Discriminative Models
p(x|C b )p(C b )
a = ln
p (x | C r )p (C r )
82
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
83
Discriminant Functions Generative Models Discriminative Models
84
Discriminant Functions Generative Models Discriminative Models
86
Discriminant Functions Generative Models Discriminative Models
87
88
Logistic Regression
89
90
• This time no closed-form solution since y n = σ(w T x)
• Could use (stochastic) gradient descent
• But there’s a better iterative technique
91
Iterative Reweighted Least Squares
92
Replace X by (X) or and y by t in 5.11
93
94
95
96
97
98