0% found this document useful (0 votes)
2 views

lec08

Lecture 8 of CSC 311 at the University of Toronto covers multivariate Gaussians and Gaussian discriminant analysis, building on previous concepts in probabilistic models. It includes a review of linear algebra, eigenvectors, eigenvalues, and the spectral decomposition of symmetric matrices, which are crucial for understanding Gaussian distributions. The lecture also introduces the multivariate Gaussian distribution, its mean and covariance, and how it relates to univariate distributions, emphasizing the importance of these concepts in machine learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lec08

Lecture 8 of CSC 311 at the University of Toronto covers multivariate Gaussians and Gaussian discriminant analysis, building on previous concepts in probabilistic models. It includes a review of linear algebra, eigenvectors, eigenvalues, and the spectral decomposition of symmetric matrices, which are crucial for understanding Gaussian distributions. The lecture also introduces the multivariate Gaussian distribution, its mean and covariance, and how it relates to univariate distributions, emphasizing the importance of these concepts in machine learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

C S C 311: Introduction to Machine Learning

Lecture 8 - Multivariate Gaussians, G D A

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro M L (UofT) CSC311-Lec8 1 / 51


O verview

Last week, we started our tour of probabilistic models, and


introduced the fundamental concepts in the discrete setting.
Continuous random variables:
) Manipulating Gaussians to tackle interesting problems requires lots
of linear algebra, so we’ll begin with a linear algebra review.
) Additi onal reference: See also Chapter 4 of Mathematics
for Machine Learning, by Desienroth et al.
https://mml-book.github.io/
Regression: Linear regression as maximum likelihood estimation
under a Gaussian distribution.
Generati ve classifi er for conti nuous data: Gaussian
discriminant analysis, a Bayes classifier for continuous variables.
Next week’s lecture ( P C A ) draws heavily on today’s linear algebra
content, so be sure to review it offline.

Intro M L (UofT) CSC311-Lec8 2 / 51


Linear Algebra Review

Intro M L (UofT) CSC311-Lec8 3 / 51


Eigenvectors and Eigenvalues

Let B be a square matrix. An eigenvector of B is a vector v such


that
B v = λv
for a scalar λ, which is called an eigenvalue.
A matrix of size D × D has at most D distinct eigenvalues, but
may have fewer.
I will have very little to say about the general case, since in this
course we will only be concerned with the case of symmetric
matrices, which is much simpler.
) Today’s tutorial covers the general case, as well as how to compute
eigenvectors/eigenvalues.

Intro M L (UofT) CSC311-Lec8 4 / 51


Spectral Decomposition

If a matrix A is symmetric, then the situation is much simpler,


due to a result called the Spectral Theorem.
) All of the eigenvalues are real-valued.
) There is a full set of linearly independent eigenvectors (i.e. D
for a
D × D matrix).
) I.e., these eigenvectors form a basis for R D .
) These eigenvectors can be chosen to be real-valued.
) The eigenvectors can be chosen to be orthonormal.
In this class, we will only need to use eigenvectors and eigenvalues
in the symmetric case. But it’s important to remember why this
case is so special.

Intro M L (UofT) CSC311-Lec8 5 / 51


Spectral Decomposition

Equivalently to the Spectral Theorem, a symmetric matrix A


can be factorized with the Spectral Decomposition:

A = QΛQT

where
) Q is an orthogonal matrix
) T h e columns q i of Q are eigenvectors.
) Λ is a diagonal matrix.
) T h e diagonal entries λ i are the corresponding eigenvalues.

Check that this is reasonable:

Aqi =

Intro M L (UofT) CSC311-Lec8 6 / 51


Spectral Decomposition

Because A has a full set of orthonormal eigenvectors {q i }, we can


use these as an orthonormal basis for R D .
I.e., a vector x can be written in an alternate coordinate system:

x = x˜ 1 q 1 + · · · + x˜ D q D

Converting between the two coordinate systems:


= QT x

x = Qx˜

In the alternate coordinate system,


A acts by rescaling the individual coordinates (i.e.
“stretching” the space):
Intro M L (UofT) CSC311-Lec8 7 / 51
P S D Matrices
Symmetric matrices are important because they represent
quadratic forms, f ( v ) = v T A v .

positive non-strictly
definite PSD

negative indefinit
definite e
If v A v > 0 for all v =/ 0, i.e. the quadratic form curves upwards,
T

we say that A is positive definite and denote this A > 0.


If v T A v ≥ 0 for all v, we say A is positive semidefinite (PSD),
denoted
A ≥ 0.

I≺
If v T A v < 0 for all v / = 0, we say A is negative definite, denoted A
n t r o0.M L ( U o f T ) CSC311-Lec8 8 / 51
P S D Matrices
Exercise: Show from the definition that nonnegative linear
combinations of P S D matrices are P SD.

Related: If A is a random matrix which is always P SD, then


E[A] is P SD. (The discrete case is a special case of the above.)
Exercise: Show that for any matrix B , the matrix B B T is P SD.

Coroll ar y : For a random vector x, the covariance matrix


Cov(x) = E [ ( x − µ)(x − µ) T ] is a P S D matrix. (Special case of
above, since x − µ is a column vector, i.e. a D × 1 matrix.)
Intro M L (UofT) CSC311-Lec8 9 / 51
P S D Matrices

C l a i m : A is positive definite iff all of its eigenvalues are positive.


It is P S D iff all of its eigenvalues are nonnegative.
) Expressing v in terms of the eigenbasis, v˜ = Q T v ,

vT Av = vT QΛQT v
= v˜TΛv˜
Σ
2
= λ
i i
v˜ i
) This is positive (nonnegative) for all v iff all the λ i are
positive (nonnegative).

Intro M L (UofT) CSC311-Lec8 10 / 51


P S D Matrices

If A is positive definite, then the contours of the quadratic form


are elliptical.
If A is both diagonal and positive definite (i.e. its diagonal entries
are positive), then the ellipses are axis-aligned.

0.5 0
A =
0
1

f (v ) = v T A v
Σ 2
= ai vi
i

Intro M L (UofT) CSC311-Lec8 11 / 51


P S D Matrices
For general positive definite A = Q Λ Q T , the contours of the
quadratic form are elliptical, and the principal axes of the ellipses
are aligned with the eigenvectors.

A = 1 −1
−1 2

f (v ) = v T Q Λ Q T v
= v˜TΛv˜
Σ
= λ i i2
v˜ i

In this example, λ 1 > λ 2 .


All symmetric matrices are diagonal if you choose the right
coordinate system.
Intro M L (UofT) CSC311-Lec8 12 / 51
Matrix Powers
The Spectral Decomposition makes it easy to compute powers of a
matrix. Observe that

A 2 = (Q Λ Q T ) 2 = Q Λ Q T Q Λ Q T = Q Λ 2 Q T
` ˛¸
x =I
Iterating this, for any integer k > 0,

Ak = QΛk QT .
Similarly, if A is invertible, then

A −1 = (Q T ) −1 Λ −1 Q −1 = Q Λ − 1 Q T .

If A is P SD, then we can easily define the matrix square root:

A 1/2 = Q Λ 1/2Q T.

Observe that A 1 / 2 is P S D and (A 1 / 2 ) 2 = A . This is the unique


Intro M L (UofT) CSC311-Lec8 13 / 51
Determinant
The determinant |B| of a square matrix B determines how
volumes change under linear transformation by B .

The definition of the determinant is complicated, and we won’t


need it in this course.
F i g u r e : M a t h e m a ti c s for M a c h i n e L e a r n i n g
Intro M L (UofT) CSC311-Lec8 14 / 51
Determinant

Some basic properties:


) | B C | = |B| · |C|
) |B| = 0 iff B is singular
) |B − 1 | = |B|−1 if B is invertible (nonsingular)
) |B T | = |B|
) If Q is orthogonal, then |Q| = ± 1 (i.e. orthogonal transformations
preserve volume) Q
) If Λ is diagonal with entries { λi }, then |Λ| = i i
λ . determinant of a matrix equals the product of its eigenvalues.
The
This is easy to show in the symmetric case:
Y
|A| = | Q Λ Q T | = |Q||Λ||Q T | = |Λ| = λi .
i

Coroll ar y : the determinant of a P S D matrix is nonnegative, and


the determinant of a positive definite matrix is positive.

Intro M L (UofT) CSC311-Lec8 15 / 51


Multivariate Gaussian Distribution

Intro M L (UofT) CSC311-Lec8 16 / 51


Univariate Gaussian distribution

Recall the Gaussian, or normal,


distribution:
2 1 ( x − µ)2
N (x ; µ, σ ) = √ exp − 2σ2
2πσ
Parameterized by mean µ and
variance σ 2 .
The Central Limit Theorem says
that sums of lots of independent
random variables are
approximately Gaussian.
In machine learning, we use
Gaussians a lot because they make
the calculations easy.

Intro M L (UofT) CSC311-Lec8 17 / 51


Multivariate Mean and Covariance

Mean
 
µ1
 . 
µ = E[x ] =  . 
µd

Covariance

σ 12 ··· 
σ12 σ 1D 
 σ 12 ··· σ2D
T σ22
Σ = C o v ( x ) = E[(x − µ ) ( x − µ) ] =  ... .. .. 
. .. . 
··· σ2D
σ D1 σ D2

The statistics (µ and Σ ) uniquely define a multivariate Gaussian (or


multivariate Normal) distribution, denoted N(µ, Σ ) or N ( x ; µ, Σ )
) This is not true for distributions in general!

Intro M L (UofT) CSC311-Lec8 18 / 51


Multivariate Gaussian Distribution
P D F of the multivariate Gaussian distribution:
1
N (x ; µ, Σ ) = exp − ( x − µ)T Σ −1
( x − µ)
1 (2π) d / 2 |Σ| 1 / 2 2

Compare to the univariate case (d = 1, Σ = σ 2 ):

2 1 ( x − µ)2
N (x ; µ, σ ) = √ exp − 2σ2
2πσ

Intro M L (UofT) CSC311-Lec8 19 / 51


Bivariate Gaussian
1 0 1 0 1 0
Σ = Σ = 0.5 Σ = 2
0 1 0 1 0 1

Figure: Probability density function

Figure: Contour plot of the pdf

Intro M L (UofT) CSC311-Lec8 20 / 51


Bivariate Gaussian
1 0 2 0 1 0
Σ = Σ = Σ =
0 1 0 1 0 2

Figure: Probability density function

Figure: Contour plot of the pdf


Intro M L (UofT) CSC311-Lec8 21 / 51
Bivariate Gaussian
∫ , ∫ , ∫ ,
1 0 1 1
Σ = Σ = Σ =
0 1 0.5 0.8
∫ 0.5 1, ∫ 0.8 1,
1.5 0. 1.8 0.
= Q1 Q T1 = Q2 Q T2
0. 0.
0.5 0.2
T est your intuiti on: D o es Q 1 = Q 2 ?

Figure: Probability density function

Figure: Contour plot of the pdf


Intro M L (UofT) CSC311-Lec8 22 / 51
Bivariate Gaussian
∫ , ∫ , ,
1 0 1 1 − 0.5
Σ = Σ = ∫
0 1 0.5 Σ = − 0.5 1
∫ 0.5 1, ∫ ,
1.5 0. λ1 0.
= Q1 Q T1 = Q2 Q T2
0. 0.5 0. λ2
T est your intuiti on: D o es Q 1 = Q 2 ? Wh at are λ 1 and
λ2?

Figure: Probability density function

Figure: Contour plot of the pdf


Intro M L (UofT) CSC311-Lec8 23 / 51
Gaussian Intuition: (Multivariate) Shift + Scale

Recall that in the univariate case, all normal distributions are shaped
like the standard normal distribution
The densities are related to the standard normal by a shift (µ), a scale
(or stretch, or dilation) σ, and a normalization factor

Intro M L (UofT) CSC311-Lec8 24 / 51


Shift + Scale: Multivariate Case

Start with a standard (spherical) Gaussian x ∼ N ( 0 , I). So


E[x] = 0 and Cov(x) = I.
Consider what happens if we map xˆ = S x +
b. B y linearity of expecation,

E[xˆ] = SE[x] + b = b.

B y the linear transformation rule for


covariance,

Cov(xˆ ) = S Cov(x )S T = S S T .

It’s possible to show that xˆ is also Gaussian distributed (but we


won’t show this here).

Intro M L (UofT) CSC311-Lec8 25 / 51


Shift + Scale: Multivariate Case

E[Sx + b] = b
Cov( Sx + b) = S S T .

In the univariate case, we obtain N ( µ , σ 2 ) by starting with N(0, 1)



and shifting by µ and stretching by σ = σ 2 .
In the multivariate case, to obtain N(µ, Σ ) , we start with N ( 0 ,
I ) and shift by µ and scale by the matrix square root Σ 1 / 2 .
Recall:
Intuiti on:Σ 1 / for
2
= each
QΛ 1eigenvector
/2
Q.
√ qi with corresponding eigenvalue
))

λ i , we stretch by a factor of λ i in the direction q i .

Intro M L (UofT) CSC311-Lec8 26 / 51


Gaussian Maximum Likelihood
Suppose we want to model the distribution of highest and lowest
temperatures in Toronto in March, and we’ve recorded the following
observations 8
(-2.5,-7.5) (-9.9,-14.9) (-12.1,-17.5) (-8.9,-13.9) (-6.0,-
11.1)
Assume they’re drawn from a Gaussian distribution with mean µ, and
covariance Σ . We want to estimate these using data.
Log-likelihood function:
N Y 1 1 (−i )1 ( i ) T
l (µ, Σ ) = log
i= 1
(2π) |Σ| 1 / 2 exp
d/2 − 2 (x − µ) Σ (x − µ)

ΣN 1 1 (−i )1 ( i ) T
= log exp − 2 (x − µ) Σ (x − µ)
i= 1
(2π) d / 2 |Σ |1 / 2
ΣN 1
= − log(2π)d / 2 − log |Σ |1/ 2 − (x( i ) − µ) T( i )Σ −1
(x − µ)
2
i= 1 ` c o n ˛¸
stant x
1/2
Optional intuition building: why does |Σ| show up in the Gaussian density p(x)?
eigenvalues
Hint: determinant is product of

Intro M L (UofT) CSC311-Lec8 27 / 51


Gaussian Maximum Likelihood

Maximize the log-likelihood by setting the derivative to zero:

dl ΣN d
0= = − (x ( i ) − µ)T Σ−1 (x ( i ) − µ)
dµ 1
i= 1

ΣN
= − Σ2 −1 ( x ( i ) − µ) = 0
i= 1

Here we use the identity ∇ x x T A x = 2 A x


Σ N
Solving we get µˆ N1 i = 1 x ( i ) . In general, “hat” means estimator
=This is just the sample mean of the observed values, or the
empirical mean.

Intro M L (UofT) CSC311-Lec8 28 / 51


Gaussian Maximum Likelihood

We can do a similar calculation for the covariance matrix Σ (we


skip the details).
Setti ng the partial derivatives to zero, just like before, we get:
N
∂l 1 Σ
0= = ⇒ Σˆ = T
(x ( i ) − µˆ)(x(i) − µˆ)
∂Σ N
i= 1
1
= ( X − 1µT ) T ( X − 1µT )
N
where 1 is an N -dimensional vector of 1s.
This is called the empirical covariance and comes up quite often
(e.g., P C A soon!)
Derivation in multivariate case is tedious. No need to worry about
it. But it is good practice to derive this in one dimension. See
supplement (next slide).
Intro M L (UofT) CSC311-Lec8 29 / 51
Supplement: M L E for univariate Gaussian

N
∂l 1 Σ
0= = − x (i) − µ
∂µ 2
σ i= 1
" #
∂l ∂ ΣN 1 1 (i) 2
0= = — log 2π − log σ − (x − µ)
2σ 2 N
∂σ i= 1
2 1 Σ
µˆM L = x (i)
∂σ ΣN N i= 1
1 ∂ ∂ ∂ 1
= log 2π − log σ − ( x ( i ) − µ)2 ‚
− i= 1 2 ∂σ ∂σ ∂σ 2σ . 1 ΣN
,
σˆML = ( x ( i ) − µ) 2
ΣN 1 1 N i= 1
= 0− + ( x( i ) µ) 2

− i= 1
σ σ3
N
N 1 Σ
= − + ( x ( i ) − µ)2
σ σ3 i= 1

Intro M L (UofT) CSC311-Lec8 30 / 51


Revisiting Linear Regression

Intro M L (UofT) CSC311-Lec8 31 / 51


Recap: Linear Regression
Given a training set of inputs and targets { ( x ( i ), t ( i ))} iN= 1
Linear model:
y = w T ψ(x)
Squared error loss:
1 2
L (y, t) = (t − y)
2
L 2 regularization:
λ 2
R (w) = ǁwǁ
2

Solution 1: solve analytically by setting the gradient to 0

w = ( Ψ T Ψ + λI) − 1 Ψ T t
Solution 2: solve approximately using gradient descent

w ← (1 − αλ)w − α Ψ T ( y − t)
Intro M L (UofT) CSC311-Lec8 32 / 51
Linear Regression as Maximum Likelihood

We can give linear regression a probabilistic interpretation by assuming a


Gaussian noise model:

t | x ∼ N (w T ψ(x ), σ 2 )

Linear regression is just maximum likelihood under this model:

1 Σ 1 Σ
N N
log p(t( i ) | x ( i ) ; w , b) = log N (t ( i ) ; w T ψ(x ), σ 2 )
N i= 1 N i= 1
= 1 ΣN log 1√ exp ∫ − (t( i ) − w T ψ(x)) 2 ,
2πσ
N i= 1 2σ 2

1 Σ
N
= const − (t ( i ) − w T ψ (x)) 2
2N σ i = 1
2

Intro M L (UofT) CSC311-Lec8 33 / 51


Regularization as M A P Inference

We can view an L 2 regularizer as M A P inference with a Gaussian prior.


Recall M A P inference:
arg max log p(w | D ) = arg max [log p(w) + log p(D | w)]
w w

We just derived the likelihood term log p(D | w ):


1 Σ
N
log p(D | w ) = − (t ( i ) − w T x − b) 2 + const
2N σ 2 i = 1

Assume a Gaussian prior, w ∼ N (m ,


S ): log p(w ) = log N (w ; m , S )
1
= log exp − 21 ( w − m ) TS −1 (w − m)

= − 12 ( w(2π)
− m ) |ST S | − 1( w − m ) + const
D /2 1/ 2

Commonly, m = 0 and S = ηI, so


1
log p(w) = − ǁ w ǁ2 +

const.
T h i s is just L 2 regularizati on!
Intro M L (UofT) CSC311-Lec8 34 / 51
Gaussian Discriminant Analysis

Intro M L (UofT) CSC311-Lec8 35 / 51


Generative vs Discriminative (Recap)

Two approaches to classification:

Discriminative approach: estimate parameters of decision


boundary/class separator directly from labeled
examples.
) Model p(t|x) directly (logistic regression models)
) Learn mappings from inputs to classes (linear/logistic regression,
decision trees etc)
) Tries to solve: How do I separate the classes?

Generative approach: model the distribution of inputs


characteristic of the class (Bayes classifier).
) Model p(x|t)
) Apply Bayes Rule to derive p(t|x).
) Tries to solve: What does each class ”look” like?
Intro M L (UofT) CSC311-Lec8 36 / 51
Classification: Diabetes Example
Gaussian discriminant analysis ( G D A ) is a Bayes classifier for
continuous-valued inputs.

Observation per patient: White blood cell count & glucose


value.

p(x | t = k) for each class is shaped like an ellipse


= ⇒ we model each class as a multivariate Gaussian
Intro M L (UofT) CSC311-Lec8 37 / 51
Gaussian Discriminant Analysis

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is


distributed according to a multivariate Gaussian distribution
Multivariate Gaussian distribution:
1
p(x | t = k) = exp − ( x − µ ) T Σ (x − µ )
1 k −1 k k

(2π) |Σ k |
D / 2 1 / 2
where | Σ k | denotes2the determinant of the matrix.
Each class k has associated mean vector µ k and covariance matrix Σ k
How many parameters?
) Each µ k has D parameters, for D K total.
) Each Σ k has O ( D 2 ) parameters, for O ( D 2 K ) — could be hard
to estimate (more on that later).

Intro M L (UofT) CSC311-Lec8 38 / 51


G DA : Learning
Learn the parameters for each class using maximum likelihood
For simplicity, assume binary classification
t
1−t
p(t | φ) = φ (1 − φ)
You can compute the M L estimates in closed form (φ and µ k are easy,
Σ k is tricky)
N
1 Σ (i )
φ = r
N i =1 1
Σ N (i )
i=1 rk · x (i )
µk = Σ
N
i=1 rk(i)

1 ΣN (i )
Σk = Σ r k ( x (i ) − µk )(x (i ) − µk )T
N
i=1 rk(i) i =1

r (i)
k = 1 [t
(i)
= k]
Intro M L (UofT) CSC311-Lec8 39 / 51
G D A Decision Boundary
Recall: for Bayes classifiers, we compute the decision boundary with
Bayes’ Rule:
p(t) p(x | t)
p(t | x ) = Σ p(t′ ) p(x | t′ )
' t

Plug in the Gaussian p(x | t):

log p(t k |x) = log p(x|t k ) + log p(t k ) − log p(x)


D 1 1
= − log(2π) − log | Σ | − ( x − µ )T Σ −1 ( x − µ ) +
2 k 2 k k k

+ log
2 p(tk ) − log p(x )

Decision boundary:
( x − µ ) T Σ −1 ( x − µ ) = ( x − µ )T Σ −1 ( x − µ ) + Const
k k k A A
A
What’s the shape of the boundary?
) We have a quadratic function in x, so the decision boundary is
a conic section!
Intro M L (UofT) CSC311-Lec8 40 / 51
G D A Decision Boundary

likelihoods
discriminant:
P (t1|x ) =
0.5

posterior for t1

Intro M L (UofT) CSC311-Lec8 41 / 51


G D A Decision Boundary

Our equation for the decision boundary:

( x − µ )T Σ −1 ( x − µ ) = ( x − µ )T Σ −1 ( x − µ ) + Const
k k k l l

Expand thel product and factor out constants (w.r.t. x):

x T Σ −1 T −1 T −1 T −1
k x − 2µk Σ k x = x Σ l x − 2µl Σ l x + Const

What if all classes share the same covariance Σ ?


) We get a linear decision boundary!

−2µTk Σ − 1 x = −2µTA Σ − 1 x + Const


(µ k − µ A ) T Σ − 1 x = Const

Intro M L (UofT) CSC311-Lec8 42 / 51


G D A Decision Boundary: Shared Covariances

variances m a y
be different

Intro M L (UofT) CSC311-Lec8 43 / 51


G D A vs Logisti c Regression

Binary classification: If you examine p(t = 1 | x ) under G D A and assume


Σ 0 = Σ 1 = Σ , you will find that it looks like this:
1
p(t | x , φ, µ0, µ1, Σ ) =
1 + exp(− w T x − b)

where (w, b) are chosen based on (φ, µ0, µ1, Σ ) .


Same model as logistic regression!

Intro M L (UofT) CSC311-Lec8 44 / 51


G D A vs Logisti c Regression

When should we prefer G D A to logistic regression, and vice versa?


G D A makes a stronger modeling assumption: assumes class-conditional
data is multivariate Gaussian
) If this is true, G D A is asymptotically efficient (best model in limit
of large N)
) If it’s not true, the quality of the predictions might suffer.
Many class-conditional distributions lead to logistic classifier.
) When these distributions are non-Gaussian (i.e., almost always),
L R usually beats G D A
G D A can handle easily missing features (how do you do that with L R ? )

Intro M L (UofT) CSC311-Lec8 45 / 51


Gaussian Naive Bayes

What if x is high-dimensional?
) The Σ k have O ( D 2 K ) parameters, which can be a problem if D
is large.
) We already saw we can save some a factor of K by using a shared
covariance for the classes.
) Any other idea you can think of?
Naive Bayes: Assumes features independent given the class

YD
p(x | t = k) = p(x j | t =
k) j=1

Assuming likelihoods are Gaussian, how many parameters required for


Naive Bayes classifier?
) This is equivalent to assuming the x j are uncorrelated, i.e. Σ
is diagonal.
) Hence, only D parameters for Σ !
Intro M L (UofT) CSC311-Lec8 46 / 51
Gaussian Na¨ıve Bayes
Gaussian Na¨ıve Bayes classifier assumes that the likelihoods are
Gaussian: "
2
1 #− ( x j − µ j k )
p(x j | t = k) = √ exp 2σ2
2πσ j k jk
(this is just a 1-dim Gaussian, one for each input dimension)
Model the same as G D A with diagonal covariance matrix
Maximum likelihood estimate of parameters
Σ
µ jk N
i=1 k r (i )j
= Σ
N
x (i )
i=1 rk(i)
Σ N (i ) (i )
i=1 r k (xj − µj k ) 2
σ2j k = Σ
N
i=1 rk(i)

r (i)
k = 1 [t
(i)
= k]

Intro M L (UofT) CSC311-Lec8 47 / 51


Decision Boundary: Isotropic

We can go even further and assume the covariances are spherical, or


isotropic.
In this case: Σ = σ 2 I (just need one
parameter!) Going back to the class posterior for
G DA :

log p(t k |x) = D


log 1 p(t ) −1 1
− p(x | t k ) +− log
log(2π) log |k Σ − log
| − ( x − µ )T Σ −1 ( x − µ ) +
p(x) 2 2 k k k k

+ log
2 p(tk ) − log p(x )

Suppose for simplicity that p(t) is uniform. Plugging in Σ = σ 2 I and


simplifying a bit,
1
log p(tk | x ) − log p(tA | x ) = − ( x − µk ) T ( x − µ k ) − ( x − µA) T ( x − µ )
2σ2 A
1 2
= − 2 ǁ x − µk ǁ2 − ǁ x − µ
ǁ 2σ A

Intro M L (UofT) CSC311-Lec8 48 / 51


Decision Boundary: Isotropic

*?

The decision boundary bisects the class means!

Intro M L (UofT) CSC311-Lec8 49 / 51


Example

Intro M L (UofT) CSC311-Lec8 50 / 51


Generative models - Recap

G D A has quadratic (conic) decision boundary.

With shared covariance, G D A is similar to logistic regression.

Generative models:
) Flexible models, easy to add/remove class.
) Handle missing data naturally.
) More “natural” way to think about things, but usually doesn’t
work as well.

Tries to solve a hard problem (model p(x)) in order to solve a easy


problem (model p(t | x)).

N e x t up: Unsupervised learning with P C A !

Intro M L (UofT) CSC311-Lec8 51 / 51

You might also like