0% found this document useful (0 votes)
7 views

CS772-Lec3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

CS772-Lec3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Estimating Parameters and Predictive

Distributions: Some Simple Cases

CS772A: Probabilistic Machine Learning


Piyush Rai
2
Plan today
▪ Parameter estimation (point est. and posterior) and predictive distribution for
▪ Bernoulli observation model (binary-valued observations)
▪ Multinoulli observation model (discrete-valued observations)
▪ Focus today on cases with conjugate prior on parameters (easy to compute posterior)
▪ Gaussian distribution and some of its important properties
▪ Parameter estimation and predictive distribution for Gaussian observation models

CS772A: PML
3

Bernoulli Observation Model

CS772A: PML
4
Estimating a Coin’s Bias
▪ Consider a sequence of 𝑁 coin toss outcomes (observations) Probability
of a head
▪ Each observation 𝑦𝑛 is a binary random variable. Head: 𝑦𝑛 = 1, Tail: 𝑦𝑛 = 0
▪ Each 𝑦𝑛 is assumed generated by a Bernoulli distribution with param 𝜃 ∈ (0,1)
Likelihood or
observation model 𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
▪ Here 𝜃 the unknown param (probability of head). Let’s do MLE
assuming i.i.d. data
▪ Log-likelihood: σ𝑁
𝑛=1 log 𝑝 𝑦𝑛 𝜃 = σ 𝑁
𝑛=1 [𝑦𝑛 log θ + (1 − 𝑦𝑛 )log (1 − 𝜃)]

▪ Maximizing log-lik, or minimizing neg. log-lik (NLL) w.r.t. 𝜃 gives


Thus MLE Indeed, with a small number of
I tossed a coin 5 times – gave 1 head and
4 tails. Does it means 𝜃 = 0.2?? The
σ𝑁
𝑛=1 𝑦𝑛 solution is simply training observations, MLE may
overfit and may not be reliable. An
MLE approach says so. What is I see 0 𝜃𝑀𝐿𝐸 = the fraction of
heads! ☺ Makes alternative is MAP estimation
head and 5 tails. Does it mean 𝜃 = 0? 𝑁 intuitive sense! which can incorporate a prior
distribution over 𝜃
CS772A: PML
5
Estimating a Coin’s Bias
▪ Let’s do MAP estimation for the bias of the coin
▪ Each likelihood term is Bernoulli
𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
▪ Also need a prior since we want to do MAP estimation
▪ Since 𝜃 ∈ (0,1), a reasonable choice of prior for 𝜃 would be Beta distribution
Γ(𝛼 + 𝛽) 𝛼−1 𝛽−1
𝑝 𝜃 𝛼, 𝛽 = 𝜃 1−𝜃
Γ 𝛼 Γ 𝛽
The gamma function 𝛼 and 𝛽 (both non-negative reals)
are the two hyperparameters of this
Using 𝛼 = 1 and 𝛽 = 1 will make
Beta prior
the Beta prior a uniform prior
Can set these based on intuition,
cross-validation, or even learn them
CS772A: PML
6
Estimating a Coin’s Bias
▪ The log posterior for the coin-toss model is log-lik + log-prior
𝑁
𝐿𝑃 𝜃 = ෍ log 𝑝 𝑦𝑛 𝜃 + log 𝑝 𝜃 𝛼, 𝛽
𝑛=1

▪ Plugging in the expressions for Bernoulli and Beta and ignoring any terms that
don’t depend on 𝜃, the log posterior simplifies to
𝑁
𝐿𝑃 𝜃 = ෍ 𝑦𝑛 log θ + (1 − 𝑦𝑛 log 1 − 𝜃 ] + 𝛼 − 1 log 𝜃 + 𝛽 − 1 log(1 − 𝜃)
𝑛=1
▪ Maximizing the above log post. (or min. of its negative) w.r.t. 𝜃 gives
Prior’s hyperparameters have an
Using 𝛼 = 1 and 𝛽 = 1 gives us
the same solution as MLE
σ𝑁
𝑛=1 𝑦𝑛 + 𝛼 − 1
interesting interpretation. Can think of
𝛼 − 1 and 𝛽 − 1 as the number of
𝜃𝑀𝐴𝑃 = heads and tails, respectively, before
Recall that 𝛼 = 1 and 𝛽 = 1 for Beta 𝑁+𝛼+𝛽−2 starting the coin-toss experiment
distribution is in fact equivalent to a Such interpretations of prior’s hyperparameters as
(akin to “pseudo-observations”)
uniform prior (hence making MAP being “pseudo-observations” exist for various other
prior distributions as well (in particular, distributions
equivalent to MLE) belonging to “exponential family” of distributions CS772A: PML
7
The Posterior Distribution
▪ Let’s do fully Bayesian inference and compute the posterior distribution
▪ Bernoulli likelihood: 𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
Γ(𝛼+𝛽)
▪ Beta prior: 𝑝 𝜃 = Beta 𝜃 𝛼, 𝛽 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1 Number of tails (𝑁0 )
Γ 𝛼 Γ 𝛽 Number of heads (𝑁1 )

▪ The posterior
Hyperparams 𝛼, 𝛽
can be computed as 𝑁 𝑁
𝜃 σ𝑛=1 𝑦𝑛 (1 − 𝜃)𝑁−σ𝑛=1 𝑦𝑛
not shown for brevity
𝑝 𝜃 𝑝(𝒚|𝜃) 𝑝 𝜃 ς𝑁
Γ(𝛼+𝛽) 𝛼−1
𝑝(𝑦𝑛 |𝜃) 𝜃 1−𝜃 𝛽−1 ς𝑁
𝑛=1 𝜃
𝑦𝑛 (1−𝜃)1−𝑦𝑛

𝑝 𝜃𝒚 = = 𝑛=1 Γ 𝛼 Γ 𝛽
= Γ(𝛼+𝛽) 𝛼−1
1−𝜃 𝛽−1 ς𝑁
𝑝(𝒚) 𝑝(𝒚) ∫Γ 𝛼 Γ 𝛽 𝜃 𝑦
𝑛=1 𝜃 𝑛 (1−𝜃)
1−𝑦𝑛 𝑑𝜃

▪ Here, even without computing the denominator (marg lik), we can identify the posterior
▪ It is Beta distribution since 𝑝 𝜃 𝒚 ∝ 𝜃 𝛼+𝑁1 −1 1 − 𝜃 𝛽+𝑁0 −1 Exercise: Show that the
normalization constant equals
Hint: Use the fact that the
▪ Thus 𝑝 𝜃 𝒚 = Beta 𝜃 𝛼 + 𝑁1 , 𝛽 + 𝑁0 posterior must integrate to 1
Γ(𝛼 + 𝛽 + 𝑁)
∫ 𝑝 𝜃 𝒚 𝑑𝜃 = 1 Γ 𝛼 + σ𝑁 𝑁
𝑛=1 𝑦𝑛 Γ 𝛽 + 𝑁 − σ𝑛=1 𝑦𝑛

▪ Here, finding the posterior boiled down to simply “multiply, add stuff, and identify”
▪ Here, posterior has the same form as prior (both Beta): property of conjugate priors.
CS772A: PML
8
Conjugacy and Conjugate Priors
▪ Many pairs of distributions are conjugate to each other
▪ Bernoulli (likelihood) + Beta (prior) ⇒ Beta posterior
▪ Binomial (likelihood) + Beta (prior) ⇒ Beta posterior
▪ Multinomial (likelihood) + Dirichlet (prior) ⇒ Dirichlet posterior Not true in general, but in some
cases (e.g., the variance of the
▪ Poisson (likelihood) + Gamma (prior) ⇒ Gamma posterior Gaussian likelihood is fixed)

▪ Gaussian (likelihood) + Gaussian (prior) ⇒ Gaussian posterior


▪ and many other such pairs ..
▪ Tip: If two distr are conjugate to each other, their functional forms are similar
▪ Example: Bernoulli and Beta have the forms This is why, when we multiply them while
computing the posterior, the exponents get added
and we get the same form for the posterior as the
Bernoulli 𝑦 𝜃 = 𝜃 𝑦 (1 − 𝜃)1−𝑦 prior but with just updated hyperparameter. Also,
Γ(𝛼 + 𝛽) 𝛼−1 we can identify the posterior and its
𝛽−1 hyperparameters simply by inspection
Beta 𝜃 𝛼, 𝛽 = 𝜃 1−𝜃
Γ 𝛼 Γ 𝛽

▪ More on conjugate priors when we look at exponential family distributions


CS772A: PML
9
Predictive Distribution
▪ Suppose we want to compute the prob that the next outcome 𝑦𝑁+1 will be head (=1)
▪ The posterior predictive distribution (averaging over all 𝜃’s weighted by their respective
posterior probabilities)
1 1
𝑝 𝑦𝑁+1 = 1 𝒚 = න 𝑝(𝑦𝑁+1 = 1, 𝜃|𝒚) 𝑑𝜃 = න 𝑝 𝑦𝑁+1 = 1 𝜃 𝑝(𝜃|𝒚) 𝑑𝜃
0 0
1
= න 𝜃 × 𝑝(𝜃|𝒚) 𝑑𝜃
0
Expectation of 𝜃 w.r.t. the Beta posterior
= 𝔼𝑝(𝜃|𝒚) [𝜃] distribution 𝑝 𝜃 𝒚 = Beta 𝜃 𝛼 + 𝑁1 , 𝛽 + 𝑁0
𝛼 + 𝑁1 For models where likelihood and
= prior are conjugate to each other,
𝛼+𝛽+𝑁
▪ Therefore the PPD will be the PPD can be computed easily
in closed form (more on this
𝑝(𝑦𝑁+1 |𝒚) = Bernoulli(𝑦𝑁+1 |𝔼𝑝(𝜃|𝒚) [𝜃]) when we talk about exponential
family distributions)

▪ The plug-in predictive distribution using a point estimate 𝜃መ (e.g., using MLE/MAP)
𝑝 𝑦𝑁+1 = 1 𝒚 ≈ 𝑝 𝑦𝑁+1 = 1 𝜃෠ = 𝜃෠ ෠
𝑝(𝑦𝑁+1 |𝒚) = Bernoulli(𝑦𝑁+1 |𝜃)
CS772A: PML
10

Multinoulli Observation Model

CS772A: PML
MLE/MAP left as 11
The Posterior Distribution an exercise

▪ Assume 𝑁 discrete obs 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 } with each 𝑦𝑛 ∈ {1,2, … , 𝐾}, e.g.,


▪ 𝑦𝑛 represents the outcome of a dice roll with 𝐾 faces
▪ 𝑦𝑛 represents the class label of the 𝑛𝑡ℎ example in a classification problem (total 𝐾 classes)
▪ 𝑦𝑛 represents the identity of the 𝑛𝑡ℎ word in a sequence of words These sum to 1
▪ Assume likelihood to be multinoulli with unknown params 𝝅 = [𝜋1 , 𝜋2 , … , 𝜋𝐾 ]
𝐾
𝕀[𝑦𝑛 =𝑘] Generalization of Bernoulli to
𝑝 𝑦𝑛 𝜋 = multinoulli 𝑦𝑛 𝜋 = ෑ 𝜋𝑘
𝑘=1 𝐾 > 2 discrete outcomes
▪ 𝝅 is a vector of probabilities (“probability vector”), e.g., Large values of 𝛼 will
give a Dirichlet peaked
Called the
▪ Biases of the 𝐾 sides of the dice concentration
around its mean (next
slides illustrates this)
parameter of the
▪ Prior class probabilities in multi-class classification (𝑝 𝑦𝑛 = 𝑘 = 𝜋𝑘 ) Dirichlet (assumed
▪ Probabilities of observing each word of the 𝐾 words in a vocabulary known for now) Each 𝛼𝑘 ≥ 0

▪ Assume a conjugate prior (Dirichlet) on 𝝅 with hyperparams 𝜶 = [𝛼1 , 𝛼2 , … , 𝛼𝐾 ]


Generalization of Beta to
𝐾-dimensional probability
vectors
CS772A: PML
12
Brief Detour: Dirichlet Distribution Basically, probability vectors

▪ An important distribution. Models non-neg. vectors 𝜋 that also sum to one


▪ A random draw from 𝐾-dim Dirich. will be a point under (𝐾-1)-dim probability simplex
(1,0,0)
The probability simplex of a
2-dim simplex (representing
a 3-dim Dirichlet) and the
coordinates of various
points on the simplex

(1/2,1/4,1/4)
(1/2,1/2,0) (1/2,0,1/2)
(3/8,3/8,1/4) (3/8,1/4,3/8)

(1/4,3/8,3/8)
(1/4,1/2,1/4) (1/4,1/4,1/2)

(0,1,0) (0,0,1)
(0,1/2,1/2) CS772A: PML
13
Brief Detour: Dirichlet Distribution
▪ A visualization of Dirichlet distribution for different values of concentration param
Like a uniform
distribution if
Visualizations of PDFs of some 3-dim all 𝛼𝑘 ’s are 1
All 𝛼𝑘 ’s large results in
Dirichlet distributions (each generated peak around the
center of the simplex
using a different conc. Param vector 𝜶)
𝜶 controls the shape
of the Dirichlet (just 𝜋2 𝜋2
like Beta distribution’s
hyperparameters) 𝜋3 𝜋3
𝜋1 𝜋1

𝜋2 𝜋2
𝜋3 𝜋3
𝜋1 𝜋1

▪ Interesting fact: Can generate a 𝐾-dim Dirichlet random variable by independently


generating 𝐾 gamma random variables and normalizing them to sum to 1 CS772A: PML
14
The Posterior Distribution Prior
Likelihood

▪ Posterior 𝑝(𝝅|𝒚) is easy to compute due to conjugacy b/w multinoulli and Dir.
Don’t need to compute for this
𝑝(𝝅, 𝒚|𝜶) 𝑝(𝝅|𝜶)𝑝 𝒚 𝝅, 𝜶 𝑝(𝝅|𝜶)𝑝 𝒚 𝝅 case because of conjugacy
𝑝 𝝅 𝒚, 𝜶 = = =
𝑝(𝒚|𝜶) 𝑝(𝒚|𝜶) 𝑝(𝒚|𝜶) Marg-lik = ∫ 𝑝(𝝅|𝜶)𝑝 𝒚 𝝅 d𝝅

▪ Assuming 𝑦𝑛 ’s are i.i.d. given 𝝅, 𝑝 𝒚 𝝅 = ς𝑁


𝑛=1 𝑝(𝑦𝑛 |𝝅), and therefore
𝛼𝑘 −1 𝕀[𝑦𝑛 =𝑘] 𝛼𝑘 +σ𝑁
𝑛=1 𝕀[𝑦𝑛 =𝑘] −1
𝑝 𝝅 𝒚, 𝜶 ∝ ς𝐾
𝑘=1 𝜋𝑘 × ς𝑁 ς𝐾
𝑛=1 𝑘=1 𝜋𝑘 = 𝐾
ς𝑘=1 𝜋𝑘
▪ Even without computing marg-lik, 𝑝(𝒚|𝜶), we can see that the posterior is Dirichlet
▪ Denoting 𝑁𝑘 = σ𝑁
𝑛=1 𝕀[𝑦𝑛 = 𝑘], number of observations with with value 𝑘
Similar to number
𝑝 𝝅 𝒚, 𝜶 = Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 ) of heads and tails
for the coin bias
▪ Note: 𝑁1 , , 𝑁2 . . . , 𝑁𝐾 are the sufficient statistics for this estimation problem estimation problem

▪ We only need the suff-stats to estimate the parameters and values of individual observations aren’t
needed (another property from exponential family of distributions – more on this later)
CS772A: PML
15
The Predictive Distribution
▪ Finally, let’s also look at the posterior predictive distribution for this model
▪ PPD is the prob distr of a new 𝑦∗ ∈ 1,2, … , 𝐾 , given training data 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 }
Will be a multinoulli. Just need
to estimate the probabilities of
each of the 𝐾 outcomes
𝑝 𝑦∗ 𝒚, 𝜶 = ∫ 𝑝 𝑦∗ 𝝅 𝒑 𝝅 𝒚, 𝜶 𝒅𝝅
▪ 𝑝 𝑦∗ 𝝅 = multinoulli 𝑦∗ 𝝅 , 𝑝 𝝅 𝒚, 𝜶 = Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 )
▪ Can compute the posterior predictive probability for each of the 𝐾 possible outcomes
𝑝 𝑦∗ = 𝑘 𝒚, 𝜶 = ∫ 𝑝 𝑦∗ = 𝑘 𝝅 𝒑 𝝅 𝒚, 𝜶 𝒅𝝅
= ∫ 𝜋𝑘 × Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 )𝑑𝜋
𝛼𝑘 + 𝑁𝑘 (Expectation of 𝜋𝑘 w.r.t the Dirichlet posterior)
= 𝐾
σ𝑘=1 𝛼𝑘 + 𝑁 A similar effect was
𝐾 Note how these probabilities achieved in the Beta-
𝛼𝑘 +𝑁𝑘 have been “smoothened” due Bernoulli model, too
▪ Thus PPD is multinoulli with probability vector σ𝐾 to the use of the prior + the
𝑘=1 𝛼𝑘 +𝑁 𝑘=1 averaging over the posterior

▪ Plug-in predictive will also be multinoulli but with prob vector given by the point estimate of 𝝅
CS772A: PML
16

Gaussian Observation Model

CS772A: PML
17
Gaussian Distribution (Univariate)
▪ Distribution over real-valued scalar random variables 𝑌 ∈ ℝ, e.g., height of
students in a class
▪ Defined by a scalar mean 𝜇 and a scalar variance 𝜎 2

2
1 𝑦−𝜇
𝒩 𝑌 = 𝑦 𝜇, 𝜎 2 = exp −
2𝜋𝜎 2 2𝜎 2

▪ Mean: 𝔼 𝑌 = 𝜇
Gaussian PDF in
▪ Variance: var[𝑌] = 𝜎2 terms of precision

1 𝛽 𝛽
▪ Inverse of variance is called precision: 𝛽 = 2
. 𝒩 𝑌 = 𝑦 𝜇, 𝛽 =
2𝜋
exp − 𝑦 − 𝜇
2
2
𝜎
CS772A: PML
18
Gaussian Distribution (Multivariate)
▪ Distribution over real-valued vector random variables 𝒀 ∈ ℝ𝐷
▪ Defined by a mean vector 𝜇 ∈ ℝ𝐷 and a covariance matrix 𝚺
A two-dimensional Gaussian

1
𝒩 𝒀 = 𝒚 𝝁, 𝚺 = exp − 𝒚 − 𝝁 ⊤ 𝚺 −1 (𝒚 − 𝝁)
2𝜋 𝐷 𝚺

▪ Note: The cov. matrix 𝚺 must be symmetric and PSD


▪ All eigenvalues are positive
▪ 𝒛⊤ 𝚺𝒛 ≥ 0 for any real vector 𝒛

▪ The covariance matrix also controls the shape of the Gaussian


▪ Sometimes we work with precision matrix (inverse of covariance matrix) 𝚲 = 𝚺 −1
CS772A: PML
19
Covariance Matrix for Multivariate Gaussian
Spherical Covariance Diagonal Covariance Full Covariance
5 5 5

Spherical: Equal
spreads (variances)
along all dimensions

Diagonal: Unequal
spreads (variances)
along all directions
-5 -5 5 -5 5 but still axis-parallel
5
5 5 5
Full: Unequal spreads
(variances) along all
directions and also
spreads along oblique
directions

-5 5 -5 5 -5 5
CS772A: PML
20
Multivariate Gaussian: Marginals and Conditionals

CS772A: PML
21
Transformation of Random Variables
▪ Suppose 𝑌 = 𝑓(𝑋) = 𝐴𝑋 + 𝑏 be a linear function of a vector-valued r.v. 𝑋 (𝐴 is a
matrix and 𝑏 is a vector, both constants)
▪ Suppose 𝔼 𝑋 = 𝜇 and cov 𝑋 = Σ, then for the vector-valued r.v. 𝑌

𝔼 𝑌 = 𝔼 𝐴𝑋 + 𝑏 = 𝐴𝜇 + 𝑏
cov 𝑌 = cov 𝐴𝑋 + 𝑏 = 𝐴Σ𝐴⊤
▪ Likewise, if 𝑌 = 𝑓 𝑋 = 𝑎⊤ 𝑋 + 𝑏 be a linear function of a vector-valued r.v. 𝑋 (𝑎 is a
vector and 𝑏 is a scalar, both constants)
▪ Suppose 𝔼 𝑋 = 𝜇 and cov 𝑋 = Σ, then for the scalar-valued r.v. 𝑌
𝔼 𝑌 = 𝔼 𝑎⊤ 𝑋 + 𝑏 = 𝑎⊤ 𝜇 + 𝑏
var 𝑌 = var 𝑎⊤ 𝑋 + 𝑏 = 𝑎⊤ Σ𝑎
CS772A: PML
22
Linear Gaussian Model (LGM)
▪ LGM defines a noisy lin. transform of a Gaussian r.v. 𝜽 with 𝑝 𝜽 = 𝒩(𝜽|𝝁, 𝚲−1 )
Both 𝜽 and 𝒚 are vectors (can
be of different sizes) Noise vector - independently
Also assume 𝑨, 𝒃, 𝚲, 𝑳 to be
known; only 𝜽 is unknown
𝒚 = 𝑨𝜽 + 𝒃 + 𝝐 and drawn from 𝒩(𝝐|𝟎, 𝑳−1 )

▪ Easy to see that, conditioned on 𝜽, 𝒚 too has a Gaussian distribution


Conditional
distribution 𝑝 𝒚|𝜽 = 𝒩 𝒚 𝑨𝜽 + 𝒃, 𝑳−1
−1
▪ Assume 𝑝 𝜽 as prior and 𝑝 𝒚|𝜽 as the likelihood, and defining 𝚺 = 𝚲 + 𝑨⊤ 𝑳𝑨
Posterior of 𝜽 𝑝 𝒚 𝜽 𝑝(𝜽)
𝑝 𝜽|𝒚 = = 𝒩 𝜽 𝚺(𝑨⊤ 𝑳 𝒚 − 𝒃 + 𝚲𝝁), 𝚺
𝑝(𝒚)
Marginal
distribution 𝑝 𝒚 = ∫ 𝑝 𝒚 𝜽 𝑝 𝜽 𝑑𝜽 = 𝒩(𝒚|𝑨𝝁 + 𝒃, 𝑨𝚲−1 𝑨⊤ + 𝑳−1 )
▪ Many probabilistic ML models are LGMs
▪ These results are very widely used (PRML Chap. 2 contains a proof) CS772A: PML
Its MLE/MAP 23
Posterior Distribution for Gaussian’s Mean estimation left as
an exercise

▪ Given: 𝑁 i.i.d. scalar observations 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 } assumed drawn from 𝒩(𝑦|𝜇, 𝜎 2 )

Likelihood 2
2 𝑦𝑛 − 𝜇
𝒩(𝑦|𝜇, 𝜎 ) 𝑝 𝑦𝑛 𝜇, 𝜎 2 = 𝒩 𝑦𝑛 𝜇, 𝜎 2 ∝ exp −
Assume 𝜎 2 to
2𝜎 2
Overall
be known
Likelihood
𝑁
𝑝 𝒚 𝜇, 𝜎 2 = ෑ 𝑝(𝑦𝑛 |𝜇, 𝜎 2 )
𝑛=1

𝜇
▪ Note: Easy to see that each 𝑦𝑛 drawn from 𝒩(𝑦|𝜇, 𝜎 2 ) is equivalent to the following

𝑦𝑛 = 𝜇 + 𝜖𝑛
Thus 𝑦𝑛 is like a noisy
version of 𝜇 with zero
mean Gaussian noise
where 𝜖𝑛 ∼ 𝒩(0, 𝜎 2 )
added to it

▪ Let’s estimate mean 𝜇 given 𝒚 using fully Bayesian inference (not point estimation)
CS772A: PML
24
A prior distribution for the mean
▪ To computer posterior, need a prior over 𝜇
𝑝 𝜇|𝜇0 , 𝜎02
▪ Let’s choose a Gaussian prior
𝑝 𝜇|𝜇0 , 𝜎02 = 𝒩 𝜇 𝜇0 , 𝜎02
𝜇 − 𝜇0 2 𝜇0
∝ exp −
2𝜎02
▪ The prior basically says that a priori we believe 𝜇 is close to 𝜇0

▪ The prior’s variance 𝜎02 denotes how certain we are about our belief

▪ We will assume that the prior’s hyperparameters (𝜇0 , 𝜎02 )are known

▪ Since 𝜎 2 in the likelihood 𝒩 𝑦 𝜇, 𝜎 2 is known, Gaussian prior 𝒩 𝜇 𝜇0 , 𝜎02 on


𝜇 is also conjugate to the likelihood (thus posterior of 𝜇 will also be Gaussian)
CS772A: PML
25
The posterior distribution for the mean
▪ The posterior distribution for the unknown mean parameter 𝜇
On conditioning side,
𝑝 𝒚 𝜇 𝑝(𝜇) 𝑁 2 2
skipping all fixed params 𝑦𝑛 − 𝜇 𝜇 − 𝜇0
and hyperparams from 𝑝 𝜇𝒚 = ∝ ෑ exp − exp −
the notation 𝑝(𝒚) 𝑛=1 2𝜎 2 2𝜎02
▪ Easy to see that the above will be prop. to exp of a quadratic function of 𝜇. Simplifying:
𝜇 − 𝜇𝑁 2 Gaussian posterior (not a
𝑝 𝜇 𝒚 ∝ exp − surprise since the chosen prior
2𝜎𝑁2 was conjugate to the likelihood)
Gaussian posterior’s precision is the sum of 1 1 𝑁 Also the MLE
the prior’s precision and sum of the noise = + Contribution
precisions of all the observations 𝜎𝑁2 𝜎02 𝜎 2 Contribution
from the prior from the data
solution for 𝜇

Gaussian posterior’s mean is a 𝜎 2 𝑁𝜎02 σ𝑁𝑛=1 𝑦𝑛


convex combination of prior’s 𝜇𝑁 = 𝜇0 + 𝑦ത (where 𝑦
ത = )
2 2 2 2 𝑁
mean and the MLE solution 𝑁𝜎0 + 𝜎 𝑁𝜎0 + 𝜎
▪ What happens to the posterior as 𝑁 (number of observations) grows very large?
Meaning, we become very-very
▪ Data (likelihood part) overwhelms the prior certain about the estimate of 𝜇
▪ Posterior’s variance 𝜎𝑁2 will approximately be 𝜎 2 /𝑁 (and goes to 0 as 𝑁 → ∞)
▪ The posterior’s mean 𝜇𝑁 approaches 𝑦(which
ത is also the MLE solution) CS772A: PML
26
The Predictive Distribution
▪ If given a point estimate 𝜇,Ƹ the plug-in predictive distribution for a test 𝑦∗ would be
The best point estimate

This is an approximation
of the true PPD 𝑝 𝑦∗ 𝒚
𝑝 𝑦∗ 𝜇,Ƹ 𝜎 2 ) = 𝒩(𝑦∗ |𝜇,Ƹ 𝜎 2 )
▪ On the other hand, the posterior predictive distribution of 𝑥∗ would be
A useful fact: When we
2
𝑝 𝑦∗ 𝒚 = ∫ 𝑝 𝑦∗ 𝜇, 𝜎 )𝑝(µ 𝒚 𝑑𝜇 have conjugacy, the
posterior predictive
distribution also has a
= ∫ 𝒩(𝑦∗ |𝜇, 𝜎 2 )𝒩 𝜇 𝜇𝑁 , 𝜎𝑁2 𝑑𝜇 closed form (will see this
result more formally when
talking about exponential
If conditional is Gaussian
This “extra” variance 𝜎𝑁2 in PPD is due to the
averaging over the posterior’s uncertainty = 𝒩(𝑦∗ |𝜇𝑁 , 𝜎2 + 𝜎𝑁2 ) then marginal is also
family distributions)

Gaussian PRML [Bis 06],


▪ For an alternative way to get the above result, note that, for test data 2.115, and also
mentioned in prob-
stats refresher slides
𝑦∗ = 𝜇 + 𝜖 𝜇 ∼ 𝒩 𝜇𝑁 , 𝜎𝑁2 𝜖 ∼ 𝒩 0, 𝜎 2
Using the posterior of 𝜇 since we
are at test stage now
Since both 𝜇 and 𝜖 are Gaussian r.v., and are independent,
⇒ 𝑝(𝑦∗ |𝒚) = 𝒩(𝑦∗ |𝜇𝑁 , 𝜎2 + 𝜎𝑁2 ) 𝑦∗ also has a Gaussian posterior predictive, and the
respective means and variances of 𝜇 and 𝜖 get added upCS772A: PML
27
Gaussian Observation Model: Some Other Facts
▪ MLE/MAP for 𝜇, 𝜎 2 (or both) is straightforward in Gaussian observation models.
▪ Posterior also straightforward in most situations for such models
▪ (As we saw) computing posterior of 𝜇 is easy (using Gaussian prior) if variance 𝜎 2 is known
▪ Likewise, computing posterior of 𝜎 2 is easy (using gamma prior on 𝜎 2 ) if mean 𝜇 is known
▪ If 𝜇, 𝜎 2 both are unknown, posterior computation requires computing 𝑝 𝜇, 𝜎 2 𝒚
▪ Computing joint posterior 𝑝 𝜇, 𝜎 2 𝒚 exactly requires a jointly conjuage prior 𝑝(𝜇, 𝜎 2 )
▪ “Gaussian-gamma” (“Normal-gamma”) is such a conjugate prior – a product of normal and gamma
▪ Note: Computing joint posteriors exactly is possible only in rare cases such this one
▪ If each observation 𝒚𝑛 ∈ ℝ𝐷 , can assume a likelihood/observation model 𝒩 𝒚 𝝁, 𝚺
▪ Need to estimate a vector-valued mean 𝝁 ∈ ℝ𝐷 . Can use a multivariate Gaussian prior
▪ Need to estimate a 𝐷 × 𝐷 positive definite covariance matrix 𝚺. Can use a Wishart prior
▪ If 𝝁, 𝚺 both are unknown, can use Normal-Wishart as a conjugate prior
CS772A: PML

You might also like