0% found this document useful (0 votes)

2 views

lec08

Lecture 8 of CSC 311 at the University of Toronto covers multivariate Gaussians and Gaussian discriminant analysis, building on previous concepts in probabilistic models. It includes a review of linear algebra, eigenvectors, eigenvalues, and the spectral decomposition of symmetric matrices, which are crucial for understanding Gaussian distributions. The lecture also introduces the multivariate Gaussian distribution, its mean and covariance, and how it relates to univariate distributions, emphasizing the importance of these concepts in machine learning.

Uploaded by

Trần Đình Vương

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

lec08

Uploaded by

Trần Đình Vương

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

C S C 311: Introduction to Machine Learning

Lecture 8 - Multivariate Gaussians, G D A

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro M L (UofT) CSC311-Lec8 1 / 51

O verview

Last week, we started our tour of probabilistic models, and

introduced the fundamental concepts in the discrete setting.
Continuous random variables:
) Manipulating Gaussians to tackle interesting problems requires lots
of linear algebra, so we’ll begin with a linear algebra review.
) Additi onal reference: See also Chapter 4 of Mathematics
for Machine Learning, by Desienroth et al.
https://mml-book.github.io/
Regression: Linear regression as maximum likelihood estimation
under a Gaussian distribution.
Generati ve classifi er for conti nuous data: Gaussian
discriminant analysis, a Bayes classifier for continuous variables.
Next week’s lecture ( P C A ) draws heavily on today’s linear algebra
content, so be sure to review it offline.

Intro M L (UofT) CSC311-Lec8 2 / 51

Linear Algebra Review

Intro M L (UofT) CSC311-Lec8 3 / 51

Eigenvectors and Eigenvalues

Let B be a square matrix. An eigenvector of B is a vector v such

that
B v = λv
for a scalar λ, which is called an eigenvalue.
A matrix of size D × D has at most D distinct eigenvalues, but
may have fewer.
I will have very little to say about the general case, since in this
course we will only be concerned with the case of symmetric
matrices, which is much simpler.
) Today’s tutorial covers the general case, as well as how to compute
eigenvectors/eigenvalues.

Intro M L (UofT) CSC311-Lec8 4 / 51

Spectral Decomposition

If a matrix A is symmetric, then the situation is much simpler,

due to a result called the Spectral Theorem.
) All of the eigenvalues are real-valued.
) There is a full set of linearly independent eigenvectors (i.e. D
for a
D × D matrix).
) I.e., these eigenvectors form a basis for R D .
) These eigenvectors can be chosen to be real-valued.
) The eigenvectors can be chosen to be orthonormal.
In this class, we will only need to use eigenvectors and eigenvalues
in the symmetric case. But it’s important to remember why this
case is so special.

Intro M L (UofT) CSC311-Lec8 5 / 51

Spectral Decomposition

Equivalently to the Spectral Theorem, a symmetric matrix A

can be factorized with the Spectral Decomposition:

A = QΛQT

where
) Q is an orthogonal matrix
) T h e columns q i of Q are eigenvectors.
) Λ is a diagonal matrix.
) T h e diagonal entries λ i are the corresponding eigenvalues.

Check that this is reasonable:

Aqi =

Intro M L (UofT) CSC311-Lec8 6 / 51

Spectral Decomposition

Because A has a full set of orthonormal eigenvectors {q i }, we can

use these as an orthonormal basis for R D .
I.e., a vector x can be written in an alternate coordinate system:

x = x˜ 1 q 1 + · · · + x˜ D q D

Converting between the two coordinate systems:

x˜
= QT x

x = Qx˜

In the alternate coordinate system,

A acts by rescaling the individual coordinates (i.e.
“stretching” the space):
Intro M L (UofT) CSC311-Lec8 7 / 51
P S D Matrices
Symmetric matrices are important because they represent
quadratic forms, f ( v ) = v T A v .

positive non-strictly
definite PSD

negative indefinit
definite e
If v A v > 0 for all v =/ 0, i.e. the quadratic form curves upwards,
T

we say that A is positive definite and denote this A > 0.

If v T A v ≥ 0 for all v, we say A is positive semidefinite (PSD),
denoted
A ≥ 0.

I≺
If v T A v < 0 for all v / = 0, we say A is negative definite, denoted A
n t r o0.M L ( U o f T ) CSC311-Lec8 8 / 51
P S D Matrices
Exercise: Show from the definition that nonnegative linear
combinations of P S D matrices are P SD.

Related: If A is a random matrix which is always P SD, then

E[A] is P SD. (The discrete case is a special case of the above.)
Exercise: Show that for any matrix B , the matrix B B T is P SD.

Coroll ar y : For a random vector x, the covariance matrix

Cov(x) = E [ ( x − µ)(x − µ) T ] is a P S D matrix. (Special case of
above, since x − µ is a column vector, i.e. a D × 1 matrix.)
Intro M L (UofT) CSC311-Lec8 9 / 51
P S D Matrices

C l a i m : A is positive definite iff all of its eigenvalues are positive.

It is P S D iff all of its eigenvalues are nonnegative.
) Expressing v in terms of the eigenbasis, v˜ = Q T v ,

vT Av = vT QΛQT v
= v˜TΛv˜
Σ
2
= λ
i i
v˜ i
) This is positive (nonnegative) for all v iff all the λ i are
positive (nonnegative).

Intro M L (UofT) CSC311-Lec8 10 / 51

P S D Matrices

If A is positive definite, then the contours of the quadratic form

are elliptical.
If A is both diagonal and positive definite (i.e. its diagonal entries
are positive), then the ellipses are axis-aligned.

0.5 0
A =
0
1

f (v ) = v T A v
Σ 2
= ai vi
i

Intro M L (UofT) CSC311-Lec8 11 / 51

P S D Matrices
For general positive definite A = Q Λ Q T , the contours of the
quadratic form are elliptical, and the principal axes of the ellipses
are aligned with the eigenvectors.

A = 1 −1
−1 2

f (v ) = v T Q Λ Q T v
= v˜TΛv˜
Σ
= λ i i2
v˜ i

In this example, λ 1 > λ 2 .

All symmetric matrices are diagonal if you choose the right
coordinate system.
Intro M L (UofT) CSC311-Lec8 12 / 51
Matrix Powers
The Spectral Decomposition makes it easy to compute powers of a
matrix. Observe that

A 2 = (Q Λ Q T ) 2 = Q Λ Q T Q Λ Q T = Q Λ 2 Q T
` ˛¸
x =I
Iterating this, for any integer k > 0,

Ak = QΛk QT .
Similarly, if A is invertible, then

A −1 = (Q T ) −1 Λ −1 Q −1 = Q Λ − 1 Q T .

If A is P SD, then we can easily define the matrix square root:

A 1/2 = Q Λ 1/2Q T.

Observe that A 1 / 2 is P S D and (A 1 / 2 ) 2 = A . This is the unique

Intro M L (UofT) CSC311-Lec8 13 / 51
Determinant
The determinant |B| of a square matrix B determines how
volumes change under linear transformation by B .

The definition of the determinant is complicated, and we won’t

need it in this course.
F i g u r e : M a t h e m a ti c s for M a c h i n e L e a r n i n g
Intro M L (UofT) CSC311-Lec8 14 / 51
Determinant

Some basic properties:

) | B C | = |B| · |C|
) |B| = 0 iff B is singular
) |B − 1 | = |B|−1 if B is invertible (nonsingular)
) |B T | = |B|
) If Q is orthogonal, then |Q| = ± 1 (i.e. orthogonal transformations
preserve volume) Q
) If Λ is diagonal with entries { λi }, then |Λ| = i i
λ . determinant of a matrix equals the product of its eigenvalues.
The
This is easy to show in the symmetric case:
Y
|A| = | Q Λ Q T | = |Q||Λ||Q T | = |Λ| = λi .
i

Coroll ar y : the determinant of a P S D matrix is nonnegative, and

the determinant of a positive definite matrix is positive.

Intro M L (UofT) CSC311-Lec8 15 / 51

Multivariate Gaussian Distribution

Intro M L (UofT) CSC311-Lec8 16 / 51

Univariate Gaussian distribution

Recall the Gaussian, or normal,

distribution:
2 1 ( x − µ)2
N (x ; µ, σ ) = √ exp − 2σ2
2πσ
Parameterized by mean µ and
variance σ 2 .
The Central Limit Theorem says
that sums of lots of independent
random variables are
approximately Gaussian.
In machine learning, we use
Gaussians a lot because they make
the calculations easy.

Intro M L (UofT) CSC311-Lec8 17 / 51

Multivariate Mean and Covariance

Mean
 
µ1
 . 
µ = E[x ] =  . 
µd

Covariance

σ 12 ··· 
σ12 σ 1D 
 σ 12 ··· σ2D
T σ22
Σ = C o v ( x ) = E[(x − µ ) ( x − µ) ] =  ... .. .. 
. .. . 
··· σ2D
σ D1 σ D2

The statistics (µ and Σ ) uniquely define a multivariate Gaussian (or

multivariate Normal) distribution, denoted N(µ, Σ ) or N ( x ; µ, Σ )
) This is not true for distributions in general!

Intro M L (UofT) CSC311-Lec8 18 / 51

Multivariate Gaussian Distribution
P D F of the multivariate Gaussian distribution:
1
N (x ; µ, Σ ) = exp − ( x − µ)T Σ −1
( x − µ)
1 (2π) d / 2 |Σ| 1 / 2 2

Compare to the univariate case (d = 1, Σ = σ 2 ):

2 1 ( x − µ)2
N (x ; µ, σ ) = √ exp − 2σ2
2πσ

Intro M L (UofT) CSC311-Lec8 19 / 51

Bivariate Gaussian
1 0 1 0 1 0
Σ = Σ = 0.5 Σ = 2
0 1 0 1 0 1

Figure: Probability density function

Figure: Contour plot of the pdf

Intro M L (UofT) CSC311-Lec8 20 / 51

Bivariate Gaussian
1 0 2 0 1 0
Σ = Σ = Σ =
0 1 0 1 0 2

Figure: Probability density function

Figure: Contour plot of the pdf

Intro M L (UofT) CSC311-Lec8 21 / 51
Bivariate Gaussian
∫ , ∫ , ∫ ,
1 0 1 1
Σ = Σ = Σ =
0 1 0.5 0.8
∫ 0.5 1, ∫ 0.8 1,
1.5 0. 1.8 0.
= Q1 Q T1 = Q2 Q T2
0. 0.
0.5 0.2
T est your intuiti on: D o es Q 1 = Q 2 ?

Figure: Probability density function

Figure: Contour plot of the pdf

Intro M L (UofT) CSC311-Lec8 22 / 51
Bivariate Gaussian
∫ , ∫ , ,
1 0 1 1 − 0.5
Σ = Σ = ∫
0 1 0.5 Σ = − 0.5 1
∫ 0.5 1, ∫ ,
1.5 0. λ1 0.
= Q1 Q T1 = Q2 Q T2
0. 0.5 0. λ2
T est your intuiti on: D o es Q 1 = Q 2 ? Wh at are λ 1 and
λ2?

Figure: Probability density function

Figure: Contour plot of the pdf

Intro M L (UofT) CSC311-Lec8 23 / 51
Gaussian Intuition: (Multivariate) Shift + Scale

Recall that in the univariate case, all normal distributions are shaped
like the standard normal distribution
The densities are related to the standard normal by a shift (µ), a scale
(or stretch, or dilation) σ, and a normalization factor

Intro M L (UofT) CSC311-Lec8 24 / 51

Shift + Scale: Multivariate Case

Start with a standard (spherical) Gaussian x ∼ N ( 0 , I). So

E[x] = 0 and Cov(x) = I.
Consider what happens if we map xˆ = S x +
b. B y linearity of expecation,

E[xˆ] = SE[x] + b = b.

B y the linear transformation rule for

covariance,

Cov(xˆ ) = S Cov(x )S T = S S T .

It’s possible to show that xˆ is also Gaussian distributed (but we

won’t show this here).

Intro M L (UofT) CSC311-Lec8 25 / 51

Shift + Scale: Multivariate Case

E[Sx + b] = b
Cov( Sx + b) = S S T .

In the univariate case, we obtain N ( µ , σ 2 ) by starting with N(0, 1)

√
and shifting by µ and stretching by σ = σ 2 .
In the multivariate case, to obtain N(µ, Σ ) , we start with N ( 0 ,
I ) and shift by µ and scale by the matrix square root Σ 1 / 2 .
Recall:
Intuiti on:Σ 1 / for
2
= each
QΛ 1eigenvector
/2
Q.
√ qi with corresponding eigenvalue
))

λ i , we stretch by a factor of λ i in the direction q i .

Intro M L (UofT) CSC311-Lec8 26 / 51

Gaussian Maximum Likelihood
Suppose we want to model the distribution of highest and lowest
temperatures in Toronto in March, and we’ve recorded the following
observations 8
(-2.5,-7.5) (-9.9,-14.9) (-12.1,-17.5) (-8.9,-13.9) (-6.0,-
11.1)
Assume they’re drawn from a Gaussian distribution with mean µ, and
covariance Σ . We want to estimate these using data.
Log-likelihood function:
N Y 1 1 (−i )1 ( i ) T
l (µ, Σ ) = log
i= 1
(2π) |Σ| 1 / 2 exp
d/2 − 2 (x − µ) Σ (x − µ)

ΣN 1 1 (−i )1 ( i ) T
= log exp − 2 (x − µ) Σ (x − µ)
i= 1
(2π) d / 2 |Σ |1 / 2
ΣN 1
= − log(2π)d / 2 − log |Σ |1/ 2 − (x( i ) − µ) T( i )Σ −1
(x − µ)
2
i= 1 ` c o n ˛¸
stant x
1/2
Optional intuition building: why does |Σ| show up in the Gaussian density p(x)?
eigenvalues
Hint: determinant is product of

Intro M L (UofT) CSC311-Lec8 27 / 51

Gaussian Maximum Likelihood

Maximize the log-likelihood by setting the derivative to zero:

dl ΣN d
0= = − (x ( i ) − µ)T Σ−1 (x ( i ) − µ)
dµ 1
i= 1
dµ
ΣN
= − Σ2 −1 ( x ( i ) − µ) = 0
i= 1

Here we use the identity ∇ x x T A x = 2 A x

Σ N
Solving we get µˆ N1 i = 1 x ( i ) . In general, “hat” means estimator
=This is just the sample mean of the observed values, or the
empirical mean.

Intro M L (UofT) CSC311-Lec8 28 / 51

Gaussian Maximum Likelihood

We can do a similar calculation for the covariance matrix Σ (we

skip the details).
Setti ng the partial derivatives to zero, just like before, we get:
N
∂l 1 Σ
0= = ⇒ Σˆ = T
(x ( i ) − µˆ)(x(i) − µˆ)
∂Σ N
i= 1
1
= ( X − 1µT ) T ( X − 1µT )
N
where 1 is an N -dimensional vector of 1s.
This is called the empirical covariance and comes up quite often
(e.g., P C A soon!)
Derivation in multivariate case is tedious. No need to worry about
it. But it is good practice to derive this in one dimension. See
supplement (next slide).
Intro M L (UofT) CSC311-Lec8 29 / 51
Supplement: M L E for univariate Gaussian

N
∂l 1 Σ
0= = − x (i) − µ
∂µ 2
σ i= 1
" #
∂l ∂ ΣN 1 1 (i) 2
0= = — log 2π − log σ − (x − µ)
2σ 2 N
∂σ i= 1
2 1 Σ
µˆM L = x (i)
∂σ ΣN N i= 1
1 ∂ ∂ ∂ 1
= log 2π − log σ − ( x ( i ) − µ)2 ‚
− i= 1 2 ∂σ ∂σ ∂σ 2σ . 1 ΣN
,
σˆML = ( x ( i ) − µ) 2
ΣN 1 1 N i= 1
= 0− + ( x( i ) µ) 2

− i= 1
σ σ3
N
N 1 Σ
= − + ( x ( i ) − µ)2
σ σ3 i= 1

Intro M L (UofT) CSC311-Lec8 30 / 51

Revisiting Linear Regression

Intro M L (UofT) CSC311-Lec8 31 / 51

Recap: Linear Regression
Given a training set of inputs and targets { ( x ( i ), t ( i ))} iN= 1
Linear model:
y = w T ψ(x)
Squared error loss:
1 2
L (y, t) = (t − y)
2
L 2 regularization:
λ 2
R (w) = ǁwǁ
2

Solution 1: solve analytically by setting the gradient to 0

w = ( Ψ T Ψ + λI) − 1 Ψ T t
Solution 2: solve approximately using gradient descent

w ← (1 − αλ)w − α Ψ T ( y − t)
Intro M L (UofT) CSC311-Lec8 32 / 51
Linear Regression as Maximum Likelihood

We can give linear regression a probabilistic interpretation by assuming a

Gaussian noise model:

t | x ∼ N (w T ψ(x ), σ 2 )

Linear regression is just maximum likelihood under this model:

1 Σ 1 Σ
N N
log p(t( i ) | x ( i ) ; w , b) = log N (t ( i ) ; w T ψ(x ), σ 2 )
N i= 1 N i= 1
= 1 ΣN log 1√ exp ∫ − (t( i ) − w T ψ(x)) 2 ,
2πσ
N i= 1 2σ 2

1 Σ
N
= const − (t ( i ) − w T ψ (x)) 2
2N σ i = 1
2

Intro M L (UofT) CSC311-Lec8 33 / 51

Regularization as M A P Inference

We can view an L 2 regularizer as M A P inference with a Gaussian prior.

Recall M A P inference:
arg max log p(w | D ) = arg max [log p(w) + log p(D | w)]
w w

We just derived the likelihood term log p(D | w ):

1 Σ
N
log p(D | w ) = − (t ( i ) − w T x − b) 2 + const
2N σ 2 i = 1

Assume a Gaussian prior, w ∼ N (m ,

S ): log p(w ) = log N (w ; m , S )
1
= log exp − 21 ( w − m ) TS −1 (w − m)

= − 12 ( w(2π)
− m ) |ST S | − 1( w − m ) + const
D /2 1/ 2

Commonly, m = 0 and S = ηI, so

1
log p(w) = − ǁ w ǁ2 +
2η
const.
T h i s is just L 2 regularizati on!
Intro M L (UofT) CSC311-Lec8 34 / 51
Gaussian Discriminant Analysis

Intro M L (UofT) CSC311-Lec8 35 / 51

Generative vs Discriminative (Recap)

Two approaches to classification:

Discriminative approach: estimate parameters of decision

boundary/class separator directly from labeled
examples.
) Model p(t|x) directly (logistic regression models)
) Learn mappings from inputs to classes (linear/logistic regression,
decision trees etc)
) Tries to solve: How do I separate the classes?

Generative approach: model the distribution of inputs

characteristic of the class (Bayes classifier).
) Model p(x|t)
) Apply Bayes Rule to derive p(t|x).
) Tries to solve: What does each class ”look” like?
Intro M L (UofT) CSC311-Lec8 36 / 51
Classification: Diabetes Example
Gaussian discriminant analysis ( G D A ) is a Bayes classifier for
continuous-valued inputs.

Observation per patient: White blood cell count & glucose

value.

p(x | t = k) for each class is shaped like an ellipse

= ⇒ we model each class as a multivariate Gaussian
Intro M L (UofT) CSC311-Lec8 37 / 51
Gaussian Discriminant Analysis

Gaussian Discriminant Analysis in its general form assumes that p(x|t) is

distributed according to a multivariate Gaussian distribution
Multivariate Gaussian distribution:
1
p(x | t = k) = exp − ( x − µ ) T Σ (x − µ )
1 k −1 k k

(2π) |Σ k |
D / 2 1 / 2
where | Σ k | denotes2the determinant of the matrix.
Each class k has associated mean vector µ k and covariance matrix Σ k
How many parameters?
) Each µ k has D parameters, for D K total.
) Each Σ k has O ( D 2 ) parameters, for O ( D 2 K ) — could be hard
to estimate (more on that later).

Intro M L (UofT) CSC311-Lec8 38 / 51

G DA : Learning
Learn the parameters for each class using maximum likelihood
For simplicity, assume binary classification
t
1−t
p(t | φ) = φ (1 − φ)
You can compute the M L estimates in closed form (φ and µ k are easy,
Σ k is tricky)
N
1 Σ (i )
φ = r
N i =1 1
Σ N (i )
i=1 rk · x (i )
µk = Σ
N
i=1 rk(i)

1 ΣN (i )
Σk = Σ r k ( x (i ) − µk )(x (i ) − µk )T
N
i=1 rk(i) i =1

r (i)
k = 1 [t
(i)
= k]
Intro M L (UofT) CSC311-Lec8 39 / 51
G D A Decision Boundary
Recall: for Bayes classifiers, we compute the decision boundary with
Bayes’ Rule:
p(t) p(x | t)
p(t | x ) = Σ p(t′ ) p(x | t′ )
' t

Plug in the Gaussian p(x | t):

log p(t k |x) = log p(x|t k ) + log p(t k ) − log p(x)

D 1 1
= − log(2π) − log | Σ | − ( x − µ )T Σ −1 ( x − µ ) +
2 k 2 k k k

+ log
2 p(tk ) − log p(x )

Decision boundary:
( x − µ ) T Σ −1 ( x − µ ) = ( x − µ )T Σ −1 ( x − µ ) + Const
k k k A A
A
What’s the shape of the boundary?
) We have a quadratic function in x, so the decision boundary is
a conic section!
Intro M L (UofT) CSC311-Lec8 40 / 51
G D A Decision Boundary

likelihoods
discriminant:
P (t1|x ) =
0.5

posterior for t1

Intro M L (UofT) CSC311-Lec8 41 / 51

G D A Decision Boundary

Our equation for the decision boundary:

( x − µ )T Σ −1 ( x − µ ) = ( x − µ )T Σ −1 ( x − µ ) + Const
k k k l l

Expand thel product and factor out constants (w.r.t. x):

x T Σ −1 T −1 T −1 T −1
k x − 2µk Σ k x = x Σ l x − 2µl Σ l x + Const

What if all classes share the same covariance Σ ?

) We get a linear decision boundary!

−2µTk Σ − 1 x = −2µTA Σ − 1 x + Const

(µ k − µ A ) T Σ − 1 x = Const

Intro M L (UofT) CSC311-Lec8 42 / 51

G D A Decision Boundary: Shared Covariances

variances m a y
be different

Intro M L (UofT) CSC311-Lec8 43 / 51

G D A vs Logisti c Regression

Binary classification: If you examine p(t = 1 | x ) under G D A and assume

Σ 0 = Σ 1 = Σ , you will find that it looks like this:
1
p(t | x , φ, µ0, µ1, Σ ) =
1 + exp(− w T x − b)

where (w, b) are chosen based on (φ, µ0, µ1, Σ ) .

Same model as logistic regression!

Intro M L (UofT) CSC311-Lec8 44 / 51

G D A vs Logisti c Regression

When should we prefer G D A to logistic regression, and vice versa?

G D A makes a stronger modeling assumption: assumes class-conditional
data is multivariate Gaussian
) If this is true, G D A is asymptotically efficient (best model in limit
of large N)
) If it’s not true, the quality of the predictions might suffer.
Many class-conditional distributions lead to logistic classifier.
) When these distributions are non-Gaussian (i.e., almost always),
L R usually beats G D A
G D A can handle easily missing features (how do you do that with L R ? )

Intro M L (UofT) CSC311-Lec8 45 / 51

Gaussian Naive Bayes

What if x is high-dimensional?
) The Σ k have O ( D 2 K ) parameters, which can be a problem if D
is large.
) We already saw we can save some a factor of K by using a shared
covariance for the classes.
) Any other idea you can think of?
Naive Bayes: Assumes features independent given the class

YD
p(x | t = k) = p(x j | t =
k) j=1

Assuming likelihoods are Gaussian, how many parameters required for

Naive Bayes classifier?
) This is equivalent to assuming the x j are uncorrelated, i.e. Σ
is diagonal.
) Hence, only D parameters for Σ !
Intro M L (UofT) CSC311-Lec8 46 / 51
Gaussian Na¨ıve Bayes
Gaussian Na¨ıve Bayes classifier assumes that the likelihoods are
Gaussian: "
2
1 #− ( x j − µ j k )
p(x j | t = k) = √ exp 2σ2
2πσ j k jk
(this is just a 1-dim Gaussian, one for each input dimension)
Model the same as G D A with diagonal covariance matrix
Maximum likelihood estimate of parameters
Σ
µ jk N
i=1 k r (i )j
= Σ
N
x (i )
i=1 rk(i)
Σ N (i ) (i )
i=1 r k (xj − µj k ) 2
σ2j k = Σ
N
i=1 rk(i)

r (i)
k = 1 [t
(i)
= k]

Intro M L (UofT) CSC311-Lec8 47 / 51

Decision Boundary: Isotropic

We can go even further and assume the covariances are spherical, or

isotropic.
In this case: Σ = σ 2 I (just need one
parameter!) Going back to the class posterior for
G DA :

log p(t k |x) = D

log 1 p(t ) −1 1
− p(x | t k ) +− log
log(2π) log |k Σ − log
| − ( x − µ )T Σ −1 ( x − µ ) +
p(x) 2 2 k k k k

+ log
2 p(tk ) − log p(x )

Suppose for simplicity that p(t) is uniform. Plugging in Σ = σ 2 I and

simplifying a bit,
1
log p(tk | x ) − log p(tA | x ) = − ( x − µk ) T ( x − µ k ) − ( x − µA) T ( x − µ )
2σ2 A
1 2
= − 2 ǁ x − µk ǁ2 − ǁ x − µ
ǁ 2σ A

Intro M L (UofT) CSC311-Lec8 48 / 51

Decision Boundary: Isotropic

The decision boundary bisects the class means!

Intro M L (UofT) CSC311-Lec8 49 / 51

Example

Intro M L (UofT) CSC311-Lec8 50 / 51

Generative models - Recap

G D A has quadratic (conic) decision boundary.

With shared covariance, G D A is similar to logistic regression.

Generative models:
) Flexible models, easy to add/remove class.
) Handle missing data naturally.
) More “natural” way to think about things, but usually doesn’t
work as well.

Tries to solve a hard problem (model p(x)) in order to solve a easy

problem (model p(t | x)).

N e x t up: Unsupervised learning with P C A !

Intro M L (UofT) CSC311-Lec8 51 / 51

Quantum Optics Standing Wave Quantization: M. Brune, A. Aspect Homework of Lesson 1
100% (2)
Quantum Optics Standing Wave Quantization: M. Brune, A. Aspect Homework of Lesson 1
4 pages
Lec08 - State Transformation and Realizations
No ratings yet
Lec08 - State Transformation and Realizations
55 pages
Linear Algebra Review: Introduction To Machine Learning (CSC 311) Spring 2020
No ratings yet
Linear Algebra Review: Introduction To Machine Learning (CSC 311) Spring 2020
28 pages
Circulant Decomposition of A Matrix and The Eigenvalues of Toeplitz Type Matrices
No ratings yet
Circulant Decomposition of A Matrix and The Eigenvalues of Toeplitz Type Matrices
15 pages
Tut 7
No ratings yet
Tut 7
32 pages
Physics 364: Problem Set 2: (a) µ (a) ν 0 (a) (3) (a)
No ratings yet
Physics 364: Problem Set 2: (a) µ (a) ν 0 (a) (3) (a)
2 pages
Discrete Sine Transform
No ratings yet
Discrete Sine Transform
13 pages
Generalized Riccati Equation and Spectral Factorization For Discrete-Time Descriptor System
No ratings yet
Generalized Riccati Equation and Spectral Factorization For Discrete-Time Descriptor System
4 pages
Stability Analysis For VAR Systems
No ratings yet
Stability Analysis For VAR Systems
11 pages
Introduction To The Painlevé Property, Test and Analysis1 (R Conte & M Mussette)
No ratings yet
Introduction To The Painlevé Property, Test and Analysis1 (R Conte & M Mussette)
7 pages
208-The Buckling Mode Extracted From The LDLT
No ratings yet
208-The Buckling Mode Extracted From The LDLT
9 pages
Lecture 22
No ratings yet
Lecture 22
4 pages
pgg3
No ratings yet
pgg3
14 pages
MCE693/793: Analysis and Control of Nonlinear Systems: Introduction To Describing Functions
No ratings yet
MCE693/793: Analysis and Control of Nonlinear Systems: Introduction To Describing Functions
31 pages
Simultaneous Diagonalization of Two Quadratic Forms and A Generalized Eigenvalue Problem
No ratings yet
Simultaneous Diagonalization of Two Quadratic Forms and A Generalized Eigenvalue Problem
7 pages
Solvability, Controllability, and Observability of Singular Systems
No ratings yet
Solvability, Controllability, and Observability of Singular Systems
20 pages
diagonalisasi block circulant eng 1
No ratings yet
diagonalisasi block circulant eng 1
21 pages
Lec#04 Introduction To Z-Transform
No ratings yet
Lec#04 Introduction To Z-Transform
17 pages
Introduction To Linear Algebra V: 1 Eigenvalue and Eigenvector
No ratings yet
Introduction To Linear Algebra V: 1 Eigenvalue and Eigenvector
4 pages
Some Properties of Generalized Lyapunov Equations
No ratings yet
Some Properties of Generalized Lyapunov Equations
5 pages
Modes in Linear Circuits
No ratings yet
Modes in Linear Circuits
13 pages
(Babich, M) On Birational Darboux Coordinates On
No ratings yet
(Babich, M) On Birational Darboux Coordinates On
15 pages
Appendix: University of California, Berkeley
No ratings yet
Appendix: University of California, Berkeley
14 pages
Amp Sparse Paper Detail
No ratings yet
Amp Sparse Paper Detail
43 pages
Lagrange_equations_Lec9
No ratings yet
Lagrange_equations_Lec9
17 pages
Linear Algebra and Robot Modeling: 1 Basic Kinematic Equations
No ratings yet
Linear Algebra and Robot Modeling: 1 Basic Kinematic Equations
9 pages
Ch.8 Linear Algebra
No ratings yet
Ch.8 Linear Algebra
16 pages
An Algorithm To Compute The Transfer Function of A Mechanical System
No ratings yet
An Algorithm To Compute The Transfer Function of A Mechanical System
6 pages
CS168: The Modern Algorithmic Toolbox Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations
10 pages
Cores of Euclidean Targets For Single-Delay Autonomous Linear Neutral Control Systems
No ratings yet
Cores of Euclidean Targets For Single-Delay Autonomous Linear Neutral Control Systems
10 pages
Goldenratio Using DIFF EQ
No ratings yet
Goldenratio Using DIFF EQ
14 pages
Gill Saunders Shinnerl 1996
No ratings yet
Gill Saunders Shinnerl 1996
12 pages
MIT8 333F13 ExamReview3
No ratings yet
MIT8 333F13 ExamReview3
10 pages
E6-43-06-01
No ratings yet
E6-43-06-01
6 pages
Real-Valued Vector
No ratings yet
Real-Valued Vector
29 pages
TTK4115 Summary
No ratings yet
TTK4115 Summary
5 pages
SFML_DATE_19_lecture3_svdpca_notes
No ratings yet
SFML_DATE_19_lecture3_svdpca_notes
6 pages
Nester - Invariant Derivation of The Euler-Lagrange Equation
No ratings yet
Nester - Invariant Derivation of The Euler-Lagrange Equation
6 pages
4c0d665d324c565fb9b6d258538cb78e_MIT14_384F13_lec20
No ratings yet
4c0d665d324c565fb9b6d258538cb78e_MIT14_384F13_lec20
6 pages
1 s2.0 S0021869312005601 Main
No ratings yet
1 s2.0 S0021869312005601 Main
16 pages
Stability I: Equilibrium Points
No ratings yet
Stability I: Equilibrium Points
34 pages
MIT8 851S13 IntroToSCET
No ratings yet
MIT8 851S13 IntroToSCET
10 pages
Lec19 PDF
No ratings yet
Lec19 PDF
9 pages
Critical Properties of Lattice Gauge Theories at Finite Temperature
No ratings yet
Critical Properties of Lattice Gauge Theories at Finite Temperature
7 pages
Multiple Correspondence Analysis in Marketing Research
No ratings yet
Multiple Correspondence Analysis in Marketing Research
24 pages
Lecture6 Orthogonality Dot Product
No ratings yet
Lecture6 Orthogonality Dot Product
5 pages
3fa2s 2012 Abrir
No ratings yet
3fa2s 2012 Abrir
8 pages
Algebraic Methods in Data Science: Lesson 3: Dan Garber
No ratings yet
Algebraic Methods in Data Science: Lesson 3: Dan Garber
14 pages
AP2
No ratings yet
AP2
3 pages
Advanced Mathematics For Engineers Part Iv - Pde: Dr. Semu Mitiku
No ratings yet
Advanced Mathematics For Engineers Part Iv - Pde: Dr. Semu Mitiku
37 pages
Midterm
No ratings yet
Midterm
3 pages
Mce371 13
No ratings yet
Mce371 13
19 pages
Txline
No ratings yet
Txline
33 pages
Eigone
No ratings yet
Eigone
2 pages
Chin.Ann.Math.B_2013
No ratings yet
Chin.Ann.Math.B_2013
29 pages
Control Systems I: Lecture 4: Diagonalization, Modal Analysis, Intro To Feedback Readings
No ratings yet
Control Systems I: Lecture 4: Diagonalization, Modal Analysis, Intro To Feedback Readings
26 pages
Physical Chemistry Exam 1 Key
No ratings yet
Physical Chemistry Exam 1 Key
8 pages
Chapter 5: Image Transforms: (From Anil. K. Jain)
No ratings yet
Chapter 5: Image Transforms: (From Anil. K. Jain)
35 pages
MATH 2111 Tutorial Notes 4 (Linear Independence and Linear Transformation)
No ratings yet
MATH 2111 Tutorial Notes 4 (Linear Independence and Linear Transformation)
7 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Transformations of Objects
No ratings yet
Transformations of Objects
24 pages
Electromagnetics Week 1
No ratings yet
Electromagnetics Week 1
10 pages
Statika 2
No ratings yet
Statika 2
22 pages
Computer Graphics: Local Space, World Space and Cameras
No ratings yet
Computer Graphics: Local Space, World Space and Cameras
26 pages
Mai 3.9-3.11 Vectors - Dot Product 1
No ratings yet
Mai 3.9-3.11 Vectors - Dot Product 1
26 pages
Download Complete From Algebraic Structures to Tensors Digital Signal and Image Processing Matrices and Tensors in Signal Processing Set 1st Edition Gérard Favier (Editor) PDF for All Chapters
100% (2)
Download Complete From Algebraic Structures to Tensors Digital Signal and Image Processing Matrices and Tensors in Signal Processing Set 1st Edition Gérard Favier (Editor) PDF for All Chapters
65 pages
Determinants
No ratings yet
Determinants
9 pages
Intermediate Mathematical Methods For Economics
No ratings yet
Intermediate Mathematical Methods For Economics
3 pages
Mathematics for Machine Learning
No ratings yet
Mathematics for Machine Learning
249 pages
General Mathematics (30062 and Associated Codes)
No ratings yet
General Mathematics (30062 and Associated Codes)
28 pages
Xii - Vector Assignment - 2022 - 23
No ratings yet
Xii - Vector Assignment - 2022 - 23
6 pages
hw#6 Question 1 BL PDF
No ratings yet
hw#6 Question 1 BL PDF
78 pages
Algorithmic Methods For Markov Chains
No ratings yet
Algorithmic Methods For Markov Chains
27 pages
Chapter 4 4 - Matrices
No ratings yet
Chapter 4 4 - Matrices
25 pages
The Residue Theorem and Its Applications
No ratings yet
The Residue Theorem and Its Applications
10 pages
Vector Analysis-1
100% (1)
Vector Analysis-1
17 pages
Strength of Materials Problems PDF
No ratings yet
Strength of Materials Problems PDF
122 pages
Mathematics - Iit Jam FMTP - VPM Classes
No ratings yet
Mathematics - Iit Jam FMTP - VPM Classes
50 pages
NCERT Solutions Class 12 Maths Chapter 11 Three Dimensional Geometry
No ratings yet
NCERT Solutions Class 12 Maths Chapter 11 Three Dimensional Geometry
33 pages
Advance Math
No ratings yet
Advance Math
55 pages
Improvement in Outdoor Sound Source Detection Using A Quadrotor-Embedded Microphone Array
No ratings yet
Improvement in Outdoor Sound Source Detection Using A Quadrotor-Embedded Microphone Array
6 pages
Euclidean Space - Wikipedia
No ratings yet
Euclidean Space - Wikipedia
21 pages
Instant ebooks textbook Fundamentals of Linear Algebra and Optimization Jean Gallier And Jocelyn Quaintance download all chapters
100% (1)
Instant ebooks textbook Fundamentals of Linear Algebra and Optimization Jean Gallier And Jocelyn Quaintance download all chapters
55 pages
Vector - 3D Theory (18-10-2022)
100% (2)
Vector - 3D Theory (18-10-2022)
3 pages
4.6 Null Space, Column Space, Row Space
No ratings yet
4.6 Null Space, Column Space, Row Space
10 pages
Unit 4 (Determinants)
No ratings yet
Unit 4 (Determinants)
21 pages
Data Structre Practical File by Satyajeet Mohanty
No ratings yet
Data Structre Practical File by Satyajeet Mohanty
16 pages
(Ebook) Statistical signal processing of complex-valued data : the theory of improper and noncircular signals by Peter J Schreier; Louis L Scharf ISBN 9780521897723, 9780511677724, 9780511815911, 9781282539051, 9781107015340, 0511677723, 0511815913, 1282539051, 1107015340 - Download the ebook today and experience the full content
100% (2)
(Ebook) Statistical signal processing of complex-valued data : the theory of improper and noncircular signals by Peter J Schreier; Louis L Scharf ISBN 9780521897723, 9780511677724, 9780511815911, 9781282539051, 9781107015340, 0511677723, 0511815913, 1282539051, 1107015340 - Download the ebook today and experience the full content
46 pages
Line and Surface
No ratings yet
Line and Surface
40 pages
Problem Set 1 Foundations: PHYC30016 Electrodynamics
No ratings yet
Problem Set 1 Foundations: PHYC30016 Electrodynamics
2 pages