lec08
lec08
A = QΛQT
where
) Q is an orthogonal matrix
) T h e columns q i of Q are eigenvectors.
) Λ is a diagonal matrix.
) T h e diagonal entries λ i are the corresponding eigenvalues.
Aqi =
x = x˜ 1 q 1 + · · · + x˜ D q D
x˜
= QT x
x = Qx˜
positive non-strictly
definite PSD
negative indefinit
definite e
If v A v > 0 for all v =/ 0, i.e. the quadratic form curves upwards,
T
I≺
If v T A v < 0 for all v / = 0, we say A is negative definite, denoted A
n t r o0.M L ( U o f T ) CSC311-Lec8 8 / 51
P S D Matrices
Exercise: Show from the definition that nonnegative linear
combinations of P S D matrices are P SD.
vT Av = vT QΛQT v
= v˜TΛv˜
Σ
2
= λ
i i
v˜ i
) This is positive (nonnegative) for all v iff all the λ i are
positive (nonnegative).
0.5 0
A =
0
1
f (v ) = v T A v
Σ 2
= ai vi
i
A = 1 −1
−1 2
f (v ) = v T Q Λ Q T v
= v˜TΛv˜
Σ
= λ i i2
v˜ i
A 2 = (Q Λ Q T ) 2 = Q Λ Q T Q Λ Q T = Q Λ 2 Q T
` ˛¸
x =I
Iterating this, for any integer k > 0,
Ak = QΛk QT .
Similarly, if A is invertible, then
A −1 = (Q T ) −1 Λ −1 Q −1 = Q Λ − 1 Q T .
A 1/2 = Q Λ 1/2Q T.
Mean
µ1
.
µ = E[x ] = .
µd
Covariance
σ 12 ···
σ12 σ 1D
σ 12 ··· σ2D
T σ22
Σ = C o v ( x ) = E[(x − µ ) ( x − µ) ] = ... .. ..
. .. .
··· σ2D
σ D1 σ D2
2 1 ( x − µ)2
N (x ; µ, σ ) = √ exp − 2σ2
2πσ
Recall that in the univariate case, all normal distributions are shaped
like the standard normal distribution
The densities are related to the standard normal by a shift (µ), a scale
(or stretch, or dilation) σ, and a normalization factor
E[xˆ] = SE[x] + b = b.
Cov(xˆ ) = S Cov(x )S T = S S T .
E[Sx + b] = b
Cov( Sx + b) = S S T .
ΣN 1 1 (−i )1 ( i ) T
= log exp − 2 (x − µ) Σ (x − µ)
i= 1
(2π) d / 2 |Σ |1 / 2
ΣN 1
= − log(2π)d / 2 − log |Σ |1/ 2 − (x( i ) − µ) T( i )Σ −1
(x − µ)
2
i= 1 ` c o n ˛¸
stant x
1/2
Optional intuition building: why does |Σ| show up in the Gaussian density p(x)?
eigenvalues
Hint: determinant is product of
dl ΣN d
0= = − (x ( i ) − µ)T Σ−1 (x ( i ) − µ)
dµ 1
i= 1
dµ
ΣN
= − Σ2 −1 ( x ( i ) − µ) = 0
i= 1
N
∂l 1 Σ
0= = − x (i) − µ
∂µ 2
σ i= 1
" #
∂l ∂ ΣN 1 1 (i) 2
0= = — log 2π − log σ − (x − µ)
2σ 2 N
∂σ i= 1
2 1 Σ
µˆM L = x (i)
∂σ ΣN N i= 1
1 ∂ ∂ ∂ 1
= log 2π − log σ − ( x ( i ) − µ)2 ‚
− i= 1 2 ∂σ ∂σ ∂σ 2σ . 1 ΣN
,
σˆML = ( x ( i ) − µ) 2
ΣN 1 1 N i= 1
= 0− + ( x( i ) µ) 2
− i= 1
σ σ3
N
N 1 Σ
= − + ( x ( i ) − µ)2
σ σ3 i= 1
w = ( Ψ T Ψ + λI) − 1 Ψ T t
Solution 2: solve approximately using gradient descent
w ← (1 − αλ)w − α Ψ T ( y − t)
Intro M L (UofT) CSC311-Lec8 32 / 51
Linear Regression as Maximum Likelihood
t | x ∼ N (w T ψ(x ), σ 2 )
1 Σ 1 Σ
N N
log p(t( i ) | x ( i ) ; w , b) = log N (t ( i ) ; w T ψ(x ), σ 2 )
N i= 1 N i= 1
= 1 ΣN log 1√ exp ∫ − (t( i ) − w T ψ(x)) 2 ,
2πσ
N i= 1 2σ 2
1 Σ
N
= const − (t ( i ) − w T ψ (x)) 2
2N σ i = 1
2
= − 12 ( w(2π)
− m ) |ST S | − 1( w − m ) + const
D /2 1/ 2
(2π) |Σ k |
D / 2 1 / 2
where | Σ k | denotes2the determinant of the matrix.
Each class k has associated mean vector µ k and covariance matrix Σ k
How many parameters?
) Each µ k has D parameters, for D K total.
) Each Σ k has O ( D 2 ) parameters, for O ( D 2 K ) — could be hard
to estimate (more on that later).
1 ΣN (i )
Σk = Σ r k ( x (i ) − µk )(x (i ) − µk )T
N
i=1 rk(i) i =1
r (i)
k = 1 [t
(i)
= k]
Intro M L (UofT) CSC311-Lec8 39 / 51
G D A Decision Boundary
Recall: for Bayes classifiers, we compute the decision boundary with
Bayes’ Rule:
p(t) p(x | t)
p(t | x ) = Σ p(t′ ) p(x | t′ )
' t
+ log
2 p(tk ) − log p(x )
Decision boundary:
( x − µ ) T Σ −1 ( x − µ ) = ( x − µ )T Σ −1 ( x − µ ) + Const
k k k A A
A
What’s the shape of the boundary?
) We have a quadratic function in x, so the decision boundary is
a conic section!
Intro M L (UofT) CSC311-Lec8 40 / 51
G D A Decision Boundary
likelihoods
discriminant:
P (t1|x ) =
0.5
posterior for t1
( x − µ )T Σ −1 ( x − µ ) = ( x − µ )T Σ −1 ( x − µ ) + Const
k k k l l
x T Σ −1 T −1 T −1 T −1
k x − 2µk Σ k x = x Σ l x − 2µl Σ l x + Const
variances m a y
be different
What if x is high-dimensional?
) The Σ k have O ( D 2 K ) parameters, which can be a problem if D
is large.
) We already saw we can save some a factor of K by using a shared
covariance for the classes.
) Any other idea you can think of?
Naive Bayes: Assumes features independent given the class
YD
p(x | t = k) = p(x j | t =
k) j=1
r (i)
k = 1 [t
(i)
= k]
+ log
2 p(tk ) − log p(x )
*?
Generative models:
) Flexible models, easy to add/remove class.
) Handle missing data naturally.
) More “natural” way to think about things, but usually doesn’t
work as well.