0% found this document useful (0 votes)

7 views

CS772-Lec3

Uploaded by

bhaveshshukla0903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

CS772-Lec3

Uploaded by

bhaveshshukla0903

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Estimating Parameters and Predictive

Distributions: Some Simple Cases

CS772A: Probabilistic Machine Learning

Piyush Rai
2
Plan today
▪ Parameter estimation (point est. and posterior) and predictive distribution for
▪ Bernoulli observation model (binary-valued observations)
▪ Multinoulli observation model (discrete-valued observations)
▪ Focus today on cases with conjugate prior on parameters (easy to compute posterior)
▪ Gaussian distribution and some of its important properties
▪ Parameter estimation and predictive distribution for Gaussian observation models

CS772A: PML
3

Bernoulli Observation Model

CS772A: PML
4
Estimating a Coin’s Bias
▪ Consider a sequence of 𝑁 coin toss outcomes (observations) Probability
of a head
▪ Each observation 𝑦𝑛 is a binary random variable. Head: 𝑦𝑛 = 1, Tail: 𝑦𝑛 = 0
▪ Each 𝑦𝑛 is assumed generated by a Bernoulli distribution with param 𝜃 ∈ (0,1)
Likelihood or
observation model 𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
▪ Here 𝜃 the unknown param (probability of head). Let’s do MLE
assuming i.i.d. data
▪ Log-likelihood: σ𝑁
𝑛=1 log 𝑝 𝑦𝑛 𝜃 = σ 𝑁
𝑛=1 [𝑦𝑛 log θ + (1 − 𝑦𝑛 )log (1 − 𝜃)]

▪ Maximizing log-lik, or minimizing neg. log-lik (NLL) w.r.t. 𝜃 gives

Thus MLE Indeed, with a small number of
I tossed a coin 5 times – gave 1 head and
4 tails. Does it means 𝜃 = 0.2?? The
σ𝑁
𝑛=1 𝑦𝑛 solution is simply training observations, MLE may
overfit and may not be reliable. An
MLE approach says so. What is I see 0 𝜃𝑀𝐿𝐸 = the fraction of
heads! ☺ Makes alternative is MAP estimation
head and 5 tails. Does it mean 𝜃 = 0? 𝑁 intuitive sense! which can incorporate a prior
distribution over 𝜃
CS772A: PML
5
Estimating a Coin’s Bias
▪ Let’s do MAP estimation for the bias of the coin
▪ Each likelihood term is Bernoulli
𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
▪ Also need a prior since we want to do MAP estimation
▪ Since 𝜃 ∈ (0,1), a reasonable choice of prior for 𝜃 would be Beta distribution
Γ(𝛼 + 𝛽) 𝛼−1 𝛽−1
𝑝 𝜃 𝛼, 𝛽 = 𝜃 1−𝜃
Γ 𝛼 Γ 𝛽
The gamma function 𝛼 and 𝛽 (both non-negative reals)
are the two hyperparameters of this
Using 𝛼 = 1 and 𝛽 = 1 will make
Beta prior
the Beta prior a uniform prior
Can set these based on intuition,
cross-validation, or even learn them
CS772A: PML
6
Estimating a Coin’s Bias
▪ The log posterior for the coin-toss model is log-lik + log-prior
𝑁
𝐿𝑃 𝜃 = ෍ log 𝑝 𝑦𝑛 𝜃 + log 𝑝 𝜃 𝛼, 𝛽
𝑛=1

▪ Plugging in the expressions for Bernoulli and Beta and ignoring any terms that
don’t depend on 𝜃, the log posterior simplifies to
𝑁
𝐿𝑃 𝜃 = ෍ 𝑦𝑛 log θ + (1 − 𝑦𝑛 log 1 − 𝜃 ] + 𝛼 − 1 log 𝜃 + 𝛽 − 1 log(1 − 𝜃)
𝑛=1
▪ Maximizing the above log post. (or min. of its negative) w.r.t. 𝜃 gives
Prior’s hyperparameters have an
Using 𝛼 = 1 and 𝛽 = 1 gives us
the same solution as MLE
σ𝑁
𝑛=1 𝑦𝑛 + 𝛼 − 1
interesting interpretation. Can think of
𝛼 − 1 and 𝛽 − 1 as the number of
𝜃𝑀𝐴𝑃 = heads and tails, respectively, before
Recall that 𝛼 = 1 and 𝛽 = 1 for Beta 𝑁+𝛼+𝛽−2 starting the coin-toss experiment
distribution is in fact equivalent to a Such interpretations of prior’s hyperparameters as
(akin to “pseudo-observations”)
uniform prior (hence making MAP being “pseudo-observations” exist for various other
prior distributions as well (in particular, distributions
equivalent to MLE) belonging to “exponential family” of distributions CS772A: PML
7
The Posterior Distribution
▪ Let’s do fully Bayesian inference and compute the posterior distribution
▪ Bernoulli likelihood: 𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
Γ(𝛼+𝛽)
▪ Beta prior: 𝑝 𝜃 = Beta 𝜃 𝛼, 𝛽 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1 Number of tails (𝑁0 )
Γ 𝛼 Γ 𝛽 Number of heads (𝑁1 )

▪ The posterior
Hyperparams 𝛼, 𝛽
can be computed as 𝑁 𝑁
𝜃 σ𝑛=1 𝑦𝑛 (1 − 𝜃)𝑁−σ𝑛=1 𝑦𝑛
not shown for brevity
𝑝 𝜃 𝑝(𝒚|𝜃) 𝑝 𝜃 ς𝑁
Γ(𝛼+𝛽) 𝛼−1
𝑝(𝑦𝑛 |𝜃) 𝜃 1−𝜃 𝛽−1 ς𝑁
𝑛=1 𝜃
𝑦𝑛 (1−𝜃)1−𝑦𝑛

𝑝 𝜃𝒚 = = 𝑛=1 Γ 𝛼 Γ 𝛽
= Γ(𝛼+𝛽) 𝛼−1
1−𝜃 𝛽−1 ς𝑁
𝑝(𝒚) 𝑝(𝒚) ∫Γ 𝛼 Γ 𝛽 𝜃 𝑦
𝑛=1 𝜃 𝑛 (1−𝜃)
1−𝑦𝑛 𝑑𝜃

▪ Here, even without computing the denominator (marg lik), we can identify the posterior
▪ It is Beta distribution since 𝑝 𝜃 𝒚 ∝ 𝜃 𝛼+𝑁1 −1 1 − 𝜃 𝛽+𝑁0 −1 Exercise: Show that the
normalization constant equals
Hint: Use the fact that the
▪ Thus 𝑝 𝜃 𝒚 = Beta 𝜃 𝛼 + 𝑁1 , 𝛽 + 𝑁0 posterior must integrate to 1
Γ(𝛼 + 𝛽 + 𝑁)
∫ 𝑝 𝜃 𝒚 𝑑𝜃 = 1 Γ 𝛼 + σ𝑁 𝑁
𝑛=1 𝑦𝑛 Γ 𝛽 + 𝑁 − σ𝑛=1 𝑦𝑛

▪ Here, finding the posterior boiled down to simply “multiply, add stuff, and identify”
▪ Here, posterior has the same form as prior (both Beta): property of conjugate priors.
CS772A: PML
8
Conjugacy and Conjugate Priors
▪ Many pairs of distributions are conjugate to each other
▪ Bernoulli (likelihood) + Beta (prior) ⇒ Beta posterior
▪ Binomial (likelihood) + Beta (prior) ⇒ Beta posterior
▪ Multinomial (likelihood) + Dirichlet (prior) ⇒ Dirichlet posterior Not true in general, but in some
cases (e.g., the variance of the
▪ Poisson (likelihood) + Gamma (prior) ⇒ Gamma posterior Gaussian likelihood is fixed)

▪ Gaussian (likelihood) + Gaussian (prior) ⇒ Gaussian posterior

▪ and many other such pairs ..
▪ Tip: If two distr are conjugate to each other, their functional forms are similar
▪ Example: Bernoulli and Beta have the forms This is why, when we multiply them while
computing the posterior, the exponents get added
and we get the same form for the posterior as the
Bernoulli 𝑦 𝜃 = 𝜃 𝑦 (1 − 𝜃)1−𝑦 prior but with just updated hyperparameter. Also,
Γ(𝛼 + 𝛽) 𝛼−1 we can identify the posterior and its
𝛽−1 hyperparameters simply by inspection
Beta 𝜃 𝛼, 𝛽 = 𝜃 1−𝜃
Γ 𝛼 Γ 𝛽

▪ More on conjugate priors when we look at exponential family distributions

CS772A: PML
9
Predictive Distribution
▪ Suppose we want to compute the prob that the next outcome 𝑦𝑁+1 will be head (=1)
▪ The posterior predictive distribution (averaging over all 𝜃’s weighted by their respective
posterior probabilities)
1 1
𝑝 𝑦𝑁+1 = 1 𝒚 = න 𝑝(𝑦𝑁+1 = 1, 𝜃|𝒚) 𝑑𝜃 = න 𝑝 𝑦𝑁+1 = 1 𝜃 𝑝(𝜃|𝒚) 𝑑𝜃
0 0
1
= න 𝜃 × 𝑝(𝜃|𝒚) 𝑑𝜃
0
Expectation of 𝜃 w.r.t. the Beta posterior
= 𝔼𝑝(𝜃|𝒚) [𝜃] distribution 𝑝 𝜃 𝒚 = Beta 𝜃 𝛼 + 𝑁1 , 𝛽 + 𝑁0
𝛼 + 𝑁1 For models where likelihood and
= prior are conjugate to each other,
𝛼+𝛽+𝑁
▪ Therefore the PPD will be the PPD can be computed easily
in closed form (more on this
𝑝(𝑦𝑁+1 |𝒚) = Bernoulli(𝑦𝑁+1 |𝔼𝑝(𝜃|𝒚) [𝜃]) when we talk about exponential
family distributions)

▪ The plug-in predictive distribution using a point estimate 𝜃መ (e.g., using MLE/MAP)
𝑝 𝑦𝑁+1 = 1 𝒚 ≈ 𝑝 𝑦𝑁+1 = 1 𝜃෠ = 𝜃෠ ෠
𝑝(𝑦𝑁+1 |𝒚) = Bernoulli(𝑦𝑁+1 |𝜃)
CS772A: PML
10

Multinoulli Observation Model

CS772A: PML
MLE/MAP left as 11
The Posterior Distribution an exercise

▪ Assume 𝑁 discrete obs 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 } with each 𝑦𝑛 ∈ {1,2, … , 𝐾}, e.g.,

▪ 𝑦𝑛 represents the outcome of a dice roll with 𝐾 faces
▪ 𝑦𝑛 represents the class label of the 𝑛𝑡ℎ example in a classification problem (total 𝐾 classes)
▪ 𝑦𝑛 represents the identity of the 𝑛𝑡ℎ word in a sequence of words These sum to 1
▪ Assume likelihood to be multinoulli with unknown params 𝝅 = [𝜋1 , 𝜋2 , … , 𝜋𝐾 ]
𝐾
𝕀[𝑦𝑛 =𝑘] Generalization of Bernoulli to
𝑝 𝑦𝑛 𝜋 = multinoulli 𝑦𝑛 𝜋 = ෑ 𝜋𝑘
𝑘=1 𝐾 > 2 discrete outcomes
▪ 𝝅 is a vector of probabilities (“probability vector”), e.g., Large values of 𝛼 will
give a Dirichlet peaked
Called the
▪ Biases of the 𝐾 sides of the dice concentration
around its mean (next
slides illustrates this)
parameter of the
▪ Prior class probabilities in multi-class classification (𝑝 𝑦𝑛 = 𝑘 = 𝜋𝑘 ) Dirichlet (assumed
▪ Probabilities of observing each word of the 𝐾 words in a vocabulary known for now) Each 𝛼𝑘 ≥ 0

▪ Assume a conjugate prior (Dirichlet) on 𝝅 with hyperparams 𝜶 = [𝛼1 , 𝛼2 , … , 𝛼𝐾 ]

Generalization of Beta to
𝐾-dimensional probability
vectors
CS772A: PML
12
Brief Detour: Dirichlet Distribution Basically, probability vectors

▪ An important distribution. Models non-neg. vectors 𝜋 that also sum to one

▪ A random draw from 𝐾-dim Dirich. will be a point under (𝐾-1)-dim probability simplex
(1,0,0)
The probability simplex of a
2-dim simplex (representing
a 3-dim Dirichlet) and the
coordinates of various
points on the simplex

(1/2,1/4,1/4)
(1/2,1/2,0) (1/2,0,1/2)
(3/8,3/8,1/4) (3/8,1/4,3/8)

(1/4,3/8,3/8)
(1/4,1/2,1/4) (1/4,1/4,1/2)

(0,1,0) (0,0,1)
(0,1/2,1/2) CS772A: PML
13
Brief Detour: Dirichlet Distribution
▪ A visualization of Dirichlet distribution for different values of concentration param
Like a uniform
distribution if
Visualizations of PDFs of some 3-dim all 𝛼𝑘 ’s are 1
All 𝛼𝑘 ’s large results in
Dirichlet distributions (each generated peak around the
center of the simplex
using a different conc. Param vector 𝜶)
𝜶 controls the shape
of the Dirichlet (just 𝜋2 𝜋2
like Beta distribution’s
hyperparameters) 𝜋3 𝜋3
𝜋1 𝜋1

𝜋2 𝜋2
𝜋3 𝜋3
𝜋1 𝜋1

▪ Interesting fact: Can generate a 𝐾-dim Dirichlet random variable by independently

generating 𝐾 gamma random variables and normalizing them to sum to 1 CS772A: PML
14
The Posterior Distribution Prior
Likelihood

▪ Assuming 𝑦𝑛 ’s are i.i.d. given 𝝅, 𝑝 𝒚 𝝅 = ς𝑁

𝑛=1 𝑝(𝑦𝑛 |𝝅), and therefore
𝛼𝑘 −1 𝕀[𝑦𝑛 =𝑘] 𝛼𝑘 +σ𝑁
𝑛=1 𝕀[𝑦𝑛 =𝑘] −1
𝑝 𝝅 𝒚, 𝜶 ∝ ς𝐾
𝑘=1 𝜋𝑘 × ς𝑁 ς𝐾
𝑛=1 𝑘=1 𝜋𝑘 = 𝐾
ς𝑘=1 𝜋𝑘
▪ Even without computing marg-lik, 𝑝(𝒚|𝜶), we can see that the posterior is Dirichlet
▪ Denoting 𝑁𝑘 = σ𝑁
𝑛=1 𝕀[𝑦𝑛 = 𝑘], number of observations with with value 𝑘
Similar to number
𝑝 𝝅 𝒚, 𝜶 = Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 ) of heads and tails
for the coin bias
▪ Note: 𝑁1 , , 𝑁2 . . . , 𝑁𝐾 are the sufficient statistics for this estimation problem estimation problem

▪ We only need the suff-stats to estimate the parameters and values of individual observations aren’t
needed (another property from exponential family of distributions – more on this later)
CS772A: PML
15
The Predictive Distribution
▪ Finally, let’s also look at the posterior predictive distribution for this model
▪ PPD is the prob distr of a new 𝑦∗ ∈ 1,2, … , 𝐾 , given training data 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 }
Will be a multinoulli. Just need
to estimate the probabilities of
each of the 𝐾 outcomes
𝑝 𝑦∗ 𝒚, 𝜶 = ∫ 𝑝 𝑦∗ 𝝅 𝒑 𝝅 𝒚, 𝜶 𝒅𝝅
▪ 𝑝 𝑦∗ 𝝅 = multinoulli 𝑦∗ 𝝅 , 𝑝 𝝅 𝒚, 𝜶 = Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 )
▪ Can compute the posterior predictive probability for each of the 𝐾 possible outcomes
𝑝 𝑦∗ = 𝑘 𝒚, 𝜶 = ∫ 𝑝 𝑦∗ = 𝑘 𝝅 𝒑 𝝅 𝒚, 𝜶 𝒅𝝅
= ∫ 𝜋𝑘 × Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 )𝑑𝜋
𝛼𝑘 + 𝑁𝑘 (Expectation of 𝜋𝑘 w.r.t the Dirichlet posterior)
= 𝐾
σ𝑘=1 𝛼𝑘 + 𝑁 A similar effect was
𝐾 Note how these probabilities achieved in the Beta-
𝛼𝑘 +𝑁𝑘 have been “smoothened” due Bernoulli model, too
▪ Thus PPD is multinoulli with probability vector σ𝐾 to the use of the prior + the
𝑘=1 𝛼𝑘 +𝑁 𝑘=1 averaging over the posterior

▪ Plug-in predictive will also be multinoulli but with prob vector given by the point estimate of 𝝅
CS772A: PML
16

Gaussian Observation Model

CS772A: PML
17
Gaussian Distribution (Univariate)
▪ Distribution over real-valued scalar random variables 𝑌 ∈ ℝ, e.g., height of
students in a class
▪ Defined by a scalar mean 𝜇 and a scalar variance 𝜎 2

2
1 𝑦−𝜇
𝒩 𝑌 = 𝑦 𝜇, 𝜎 2 = exp −
2𝜋𝜎 2 2𝜎 2

▪ Mean: 𝔼 𝑌 = 𝜇
Gaussian PDF in
▪ Variance: var[𝑌] = 𝜎2 terms of precision

1 𝛽 𝛽
▪ Inverse of variance is called precision: 𝛽 = 2
. 𝒩 𝑌 = 𝑦 𝜇, 𝛽 =
2𝜋
exp − 𝑦 − 𝜇
2
2
𝜎
CS772A: PML
18
Gaussian Distribution (Multivariate)
▪ Distribution over real-valued vector random variables 𝒀 ∈ ℝ𝐷
▪ Defined by a mean vector 𝜇 ∈ ℝ𝐷 and a covariance matrix 𝚺
A two-dimensional Gaussian

1
𝒩 𝒀 = 𝒚 𝝁, 𝚺 = exp − 𝒚 − 𝝁 ⊤ 𝚺 −1 (𝒚 − 𝝁)
2𝜋 𝐷 𝚺

▪ Note: The cov. matrix 𝚺 must be symmetric and PSD

▪ All eigenvalues are positive
▪ 𝒛⊤ 𝚺𝒛 ≥ 0 for any real vector 𝒛

▪ The covariance matrix also controls the shape of the Gaussian

▪ Sometimes we work with precision matrix (inverse of covariance matrix) 𝚲 = 𝚺 −1
CS772A: PML
19
Covariance Matrix for Multivariate Gaussian
Spherical Covariance Diagonal Covariance Full Covariance
5 5 5

Spherical: Equal
spreads (variances)
along all dimensions

Diagonal: Unequal
spreads (variances)
along all directions
-5 -5 5 -5 5 but still axis-parallel
5
5 5 5
Full: Unequal spreads
(variances) along all
directions and also
spreads along oblique
directions

-5 5 -5 5 -5 5
CS772A: PML
20
Multivariate Gaussian: Marginals and Conditionals

CS772A: PML
21
Transformation of Random Variables
▪ Suppose 𝑌 = 𝑓(𝑋) = 𝐴𝑋 + 𝑏 be a linear function of a vector-valued r.v. 𝑋 (𝐴 is a
matrix and 𝑏 is a vector, both constants)
▪ Suppose 𝔼 𝑋 = 𝜇 and cov 𝑋 = Σ, then for the vector-valued r.v. 𝑌

𝔼 𝑌 = 𝔼 𝐴𝑋 + 𝑏 = 𝐴𝜇 + 𝑏
cov 𝑌 = cov 𝐴𝑋 + 𝑏 = 𝐴Σ𝐴⊤
▪ Likewise, if 𝑌 = 𝑓 𝑋 = 𝑎⊤ 𝑋 + 𝑏 be a linear function of a vector-valued r.v. 𝑋 (𝑎 is a
vector and 𝑏 is a scalar, both constants)
▪ Suppose 𝔼 𝑋 = 𝜇 and cov 𝑋 = Σ, then for the scalar-valued r.v. 𝑌
𝔼 𝑌 = 𝔼 𝑎⊤ 𝑋 + 𝑏 = 𝑎⊤ 𝜇 + 𝑏
var 𝑌 = var 𝑎⊤ 𝑋 + 𝑏 = 𝑎⊤ Σ𝑎
CS772A: PML
22
Linear Gaussian Model (LGM)
▪ LGM defines a noisy lin. transform of a Gaussian r.v. 𝜽 with 𝑝 𝜽 = 𝒩(𝜽|𝝁, 𝚲−1 )
Both 𝜽 and 𝒚 are vectors (can
be of different sizes) Noise vector - independently
Also assume 𝑨, 𝒃, 𝚲, 𝑳 to be
known; only 𝜽 is unknown
𝒚 = 𝑨𝜽 + 𝒃 + 𝝐 and drawn from 𝒩(𝝐|𝟎, 𝑳−1 )

▪ Easy to see that, conditioned on 𝜽, 𝒚 too has a Gaussian distribution

Conditional
distribution 𝑝 𝒚|𝜽 = 𝒩 𝒚 𝑨𝜽 + 𝒃, 𝑳−1
−1
▪ Assume 𝑝 𝜽 as prior and 𝑝 𝒚|𝜽 as the likelihood, and defining 𝚺 = 𝚲 + 𝑨⊤ 𝑳𝑨
Posterior of 𝜽 𝑝 𝒚 𝜽 𝑝(𝜽)
𝑝 𝜽|𝒚 = = 𝒩 𝜽 𝚺(𝑨⊤ 𝑳 𝒚 − 𝒃 + 𝚲𝝁), 𝚺
𝑝(𝒚)
Marginal
distribution 𝑝 𝒚 = ∫ 𝑝 𝒚 𝜽 𝑝 𝜽 𝑑𝜽 = 𝒩(𝒚|𝑨𝝁 + 𝒃, 𝑨𝚲−1 𝑨⊤ + 𝑳−1 )
▪ Many probabilistic ML models are LGMs
▪ These results are very widely used (PRML Chap. 2 contains a proof) CS772A: PML
Its MLE/MAP 23
Posterior Distribution for Gaussian’s Mean estimation left as
an exercise

▪ Given: 𝑁 i.i.d. scalar observations 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 } assumed drawn from 𝒩(𝑦|𝜇, 𝜎 2 )

Likelihood 2
2 𝑦𝑛 − 𝜇
𝒩(𝑦|𝜇, 𝜎 ) 𝑝 𝑦𝑛 𝜇, 𝜎 2 = 𝒩 𝑦𝑛 𝜇, 𝜎 2 ∝ exp −
Assume 𝜎 2 to
2𝜎 2
Overall
be known
Likelihood
𝑁
𝑝 𝒚 𝜇, 𝜎 2 = ෑ 𝑝(𝑦𝑛 |𝜇, 𝜎 2 )
𝑛=1

𝜇
▪ Note: Easy to see that each 𝑦𝑛 drawn from 𝒩(𝑦|𝜇, 𝜎 2 ) is equivalent to the following

𝑦𝑛 = 𝜇 + 𝜖𝑛
Thus 𝑦𝑛 is like a noisy
version of 𝜇 with zero
mean Gaussian noise
where 𝜖𝑛 ∼ 𝒩(0, 𝜎 2 )
added to it

▪ Let’s estimate mean 𝜇 given 𝒚 using fully Bayesian inference (not point estimation)
CS772A: PML
24
A prior distribution for the mean
▪ To computer posterior, need a prior over 𝜇
𝑝 𝜇|𝜇0 , 𝜎02
▪ Let’s choose a Gaussian prior
𝑝 𝜇|𝜇0 , 𝜎02 = 𝒩 𝜇 𝜇0 , 𝜎02
𝜇 − 𝜇0 2 𝜇0
∝ exp −
2𝜎02
▪ The prior basically says that a priori we believe 𝜇 is close to 𝜇0

▪ The prior’s variance 𝜎02 denotes how certain we are about our belief

▪ We will assume that the prior’s hyperparameters (𝜇0 , 𝜎02 )are known

▪ Since 𝜎 2 in the likelihood 𝒩 𝑦 𝜇, 𝜎 2 is known, Gaussian prior 𝒩 𝜇 𝜇0 , 𝜎02 on

𝜇 is also conjugate to the likelihood (thus posterior of 𝜇 will also be Gaussian)
CS772A: PML
25
The posterior distribution for the mean
▪ The posterior distribution for the unknown mean parameter 𝜇
On conditioning side,
𝑝 𝒚 𝜇 𝑝(𝜇) 𝑁 2 2
skipping all fixed params 𝑦𝑛 − 𝜇 𝜇 − 𝜇0
and hyperparams from 𝑝 𝜇𝒚 = ∝ ෑ exp − exp −
the notation 𝑝(𝒚) 𝑛=1 2𝜎 2 2𝜎02
▪ Easy to see that the above will be prop. to exp of a quadratic function of 𝜇. Simplifying:
𝜇 − 𝜇𝑁 2 Gaussian posterior (not a
𝑝 𝜇 𝒚 ∝ exp − surprise since the chosen prior
2𝜎𝑁2 was conjugate to the likelihood)
Gaussian posterior’s precision is the sum of 1 1 𝑁 Also the MLE
the prior’s precision and sum of the noise = + Contribution
precisions of all the observations 𝜎𝑁2 𝜎02 𝜎 2 Contribution
from the prior from the data
solution for 𝜇

Gaussian posterior’s mean is a 𝜎 2 𝑁𝜎02 σ𝑁𝑛=1 𝑦𝑛

convex combination of prior’s 𝜇𝑁 = 𝜇0 + 𝑦ത (where 𝑦
ത = )
2 2 2 2 𝑁
mean and the MLE solution 𝑁𝜎0 + 𝜎 𝑁𝜎0 + 𝜎
▪ What happens to the posterior as 𝑁 (number of observations) grows very large?
Meaning, we become very-very
▪ Data (likelihood part) overwhelms the prior certain about the estimate of 𝜇
▪ Posterior’s variance 𝜎𝑁2 will approximately be 𝜎 2 /𝑁 (and goes to 0 as 𝑁 → ∞)
▪ The posterior’s mean 𝜇𝑁 approaches 𝑦(which
ത is also the MLE solution) CS772A: PML
26
The Predictive Distribution
▪ If given a point estimate 𝜇,Ƹ the plug-in predictive distribution for a test 𝑦∗ would be
The best point estimate

This is an approximation
of the true PPD 𝑝 𝑦∗ 𝒚
𝑝 𝑦∗ 𝜇,Ƹ 𝜎 2 ) = 𝒩(𝑦∗ |𝜇,Ƹ 𝜎 2 )
▪ On the other hand, the posterior predictive distribution of 𝑥∗ would be
A useful fact: When we
2
𝑝 𝑦∗ 𝒚 = ∫ 𝑝 𝑦∗ 𝜇, 𝜎 )𝑝(µ 𝒚 𝑑𝜇 have conjugacy, the
posterior predictive
distribution also has a
= ∫ 𝒩(𝑦∗ |𝜇, 𝜎 2 )𝒩 𝜇 𝜇𝑁 , 𝜎𝑁2 𝑑𝜇 closed form (will see this
result more formally when
talking about exponential
If conditional is Gaussian
This “extra” variance 𝜎𝑁2 in PPD is due to the
averaging over the posterior’s uncertainty = 𝒩(𝑦∗ |𝜇𝑁 , 𝜎2 + 𝜎𝑁2 ) then marginal is also
family distributions)

Gaussian PRML [Bis 06],

▪ For an alternative way to get the above result, note that, for test data 2.115, and also
mentioned in prob-
stats refresher slides
𝑦∗ = 𝜇 + 𝜖 𝜇 ∼ 𝒩 𝜇𝑁 , 𝜎𝑁2 𝜖 ∼ 𝒩 0, 𝜎 2
Using the posterior of 𝜇 since we
are at test stage now
Since both 𝜇 and 𝜖 are Gaussian r.v., and are independent,
⇒ 𝑝(𝑦∗ |𝒚) = 𝒩(𝑦∗ |𝜇𝑁 , 𝜎2 + 𝜎𝑁2 ) 𝑦∗ also has a Gaussian posterior predictive, and the
respective means and variances of 𝜇 and 𝜖 get added upCS772A: PML
27
Gaussian Observation Model: Some Other Facts
▪ MLE/MAP for 𝜇, 𝜎 2 (or both) is straightforward in Gaussian observation models.
▪ Posterior also straightforward in most situations for such models
▪ (As we saw) computing posterior of 𝜇 is easy (using Gaussian prior) if variance 𝜎 2 is known
▪ Likewise, computing posterior of 𝜎 2 is easy (using gamma prior on 𝜎 2 ) if mean 𝜇 is known
▪ If 𝜇, 𝜎 2 both are unknown, posterior computation requires computing 𝑝 𝜇, 𝜎 2 𝒚
▪ Computing joint posterior 𝑝 𝜇, 𝜎 2 𝒚 exactly requires a jointly conjuage prior 𝑝(𝜇, 𝜎 2 )
▪ “Gaussian-gamma” (“Normal-gamma”) is such a conjugate prior – a product of normal and gamma
▪ Note: Computing joint posteriors exactly is possible only in rare cases such this one
▪ If each observation 𝒚𝑛 ∈ ℝ𝐷 , can assume a likelihood/observation model 𝒩 𝒚 𝝁, 𝚺
▪ Need to estimate a vector-valued mean 𝝁 ∈ ℝ𝐷 . Can use a multivariate Gaussian prior
▪ Need to estimate a 𝐷 × 𝐷 positive definite covariance matrix 𝚺. Can use a Wishart prior
▪ If 𝝁, 𝚺 both are unknown, can use Normal-Wishart as a conjugate prior
CS772A: PML

Two Sample Hotelling's T Square
No ratings yet
Two Sample Hotelling's T Square
30 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
CQF Module 2 Examination: Instructions
No ratings yet
CQF Module 2 Examination: Instructions
3 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
CS772-Lec7
No ratings yet
CS772-Lec7
18 pages
CS772-Lec5
No ratings yet
CS772-Lec5
22 pages
CS772-Lec6
No ratings yet
CS772-Lec6
23 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
CH 5
No ratings yet
CH 5
45 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
Johnson11MLSS Talk Extras
No ratings yet
Johnson11MLSS Talk Extras
73 pages
Lecture 4
No ratings yet
Lecture 4
7 pages
Bayesian Parameter Estimation
No ratings yet
Bayesian Parameter Estimation
40 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
Bayesian Inference: by Hoai Nam Nguyen September 9, 2017
No ratings yet
Bayesian Inference: by Hoai Nam Nguyen September 9, 2017
7 pages
ML-Map-and-Bayseian
No ratings yet
ML-Map-and-Bayseian
35 pages
IntroBayesTimeSeries1
No ratings yet
IntroBayesTimeSeries1
72 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
bayesian-inference
No ratings yet
bayesian-inference
18 pages
Bayes' Estimators: The Method
No ratings yet
Bayes' Estimators: The Method
7 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
SL-Chapter2
No ratings yet
SL-Chapter2
14 pages
Conjugate Prior
No ratings yet
Conjugate Prior
5 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Lec_4
No ratings yet
Lec_4
35 pages
An Overview of Bayesian Econometrics
No ratings yet
An Overview of Bayesian Econometrics
30 pages
Revision - Bayesian Inference
No ratings yet
Revision - Bayesian Inference
4 pages
Lecture 20 - Bayesian Analysis
No ratings yet
Lecture 20 - Bayesian Analysis
4 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
Notes BMDA PDF
No ratings yet
Notes BMDA PDF
520 pages
Notes
No ratings yet
Notes
520 pages
L08 MAP (1)
No ratings yet
L08 MAP (1)
8 pages
Notes (4ps)
No ratings yet
Notes (4ps)
134 pages
03_lecturenote_MLE_MAP
No ratings yet
03_lecturenote_MLE_MAP
7 pages
CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data
No ratings yet
CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data
71 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
BT_Wk3_LectureNotes(2)
No ratings yet
BT_Wk3_LectureNotes(2)
19 pages
BT_Wk3_LectureNotes(3)
No ratings yet
BT_Wk3_LectureNotes(3)
16 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
ln13
No ratings yet
ln13
5 pages
8. Bayesian_Lec_3
No ratings yet
8. Bayesian_Lec_3
24 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Wk04 machine learning
No ratings yet
Wk04 machine learning
6 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Ch3 - 2009 Conjugate Families of Distributions
No ratings yet
Ch3 - 2009 Conjugate Families of Distributions
67 pages
CS464_Ch3_Estimation
No ratings yet
CS464_Ch3_Estimation
56 pages
Chapter 5. Bayesian Statistics (II)
No ratings yet
Chapter 5. Bayesian Statistics (II)
30 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
Introduction To Bayesian Methods With An Example
No ratings yet
Introduction To Bayesian Methods With An Example
25 pages
A Very Gentle Note On The Construction of DP Zhang
No ratings yet
A Very Gentle Note On The Construction of DP Zhang
15 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
Lecture 6. Bayesian Estimation
No ratings yet
Lecture 6. Bayesian Estimation
14 pages
2 DP Handout
No ratings yet
2 DP Handout
41 pages
Notes On Beta and Dirchilet Distribution
No ratings yet
Notes On Beta and Dirchilet Distribution
19 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Simple Numbers
From Everand
Simple Numbers
Prasant
No ratings yet
Kalman Bucy 1961
No ratings yet
Kalman Bucy 1961
14 pages
Detection of Anomalous Crowd Behavior Using Spatio Tempora Multiresolution Model and Kronecker Sum Decompositions
No ratings yet
Detection of Anomalous Crowd Behavior Using Spatio Tempora Multiresolution Model and Kronecker Sum Decompositions
10 pages
Mlfa Autumn 2023 Pca
No ratings yet
Mlfa Autumn 2023 Pca
32 pages
Abhishek Report
No ratings yet
Abhishek Report
48 pages
Multivariate Analysis (Minitab)
No ratings yet
Multivariate Analysis (Minitab)
43 pages
Grassmann Manifold Based Spectrum Sensing For TV White Spaces
No ratings yet
Grassmann Manifold Based Spectrum Sensing For TV White Spaces
11 pages
SASprimer
No ratings yet
SASprimer
125 pages
04 Multivariate Analysis
No ratings yet
04 Multivariate Analysis
38 pages
Lecture 7 - Markov Switching Models20130520235704
No ratings yet
Lecture 7 - Markov Switching Models20130520235704
85 pages
2 Random Vectors
No ratings yet
2 Random Vectors
53 pages
Karhunen-Loève Transform - KLT: Jankees Van Der Poel D.Sc. Student, Mechanical Engineering
No ratings yet
Karhunen-Loève Transform - KLT: Jankees Van Der Poel D.Sc. Student, Mechanical Engineering
70 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Assigment
No ratings yet
Assigment
52 pages
Random Signals and Noise
No ratings yet
Random Signals and Noise
517 pages
Linear Algebra - Module 1
No ratings yet
Linear Algebra - Module 1
52 pages
4 Template BL Prob
No ratings yet
4 Template BL Prob
16 pages
Level Crossing and Other Level Functionals by M. Kratz
No ratings yet
Level Crossing and Other Level Functionals by M. Kratz
59 pages
The Routledge Companion to the Future of Marketing 1st Edition Luiz Moutinho 2025 Scribd Download
100% (2)
The Routledge Companion to the Future of Marketing 1st Edition Luiz Moutinho 2025 Scribd Download
56 pages
A New Wellbore Position Calculation Method: C.R. Chia, SPE, W.J. Phillips, SPE, and D.L. Aklestad, SPE, Schlumberger
No ratings yet
A New Wellbore Position Calculation Method: C.R. Chia, SPE, W.J. Phillips, SPE, and D.L. Aklestad, SPE, Schlumberger
5 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
15 pages
Estimation and Detection Theory by Don H. Johnson
No ratings yet
Estimation and Detection Theory by Don H. Johnson
214 pages
Multivariate Quality Control Thesis
100% (1)
Multivariate Quality Control Thesis
135 pages
Correlation Surprise
No ratings yet
Correlation Surprise
15 pages
Introduction To Portfolio Analysis: Drivers in The Case of Two Assets
No ratings yet
Introduction To Portfolio Analysis: Drivers in The Case of Two Assets
20 pages
Neha Tabassum
No ratings yet
Neha Tabassum
58 pages
Comparative Study of Stock Trend Prediction Using Time Delay, Recurrent and Probabilistic Neural Networks
100% (1)
Comparative Study of Stock Trend Prediction Using Time Delay, Recurrent and Probabilistic Neural Networks
15 pages
EHS - Bob Litterman
No ratings yet
EHS - Bob Litterman
27 pages
A MARKOV MATRIX FOR FATIGUE LOAD SIMULATION Krenk1989
No ratings yet
A MARKOV MATRIX FOR FATIGUE LOAD SIMULATION Krenk1989
12 pages

CS772-Lec3

Uploaded by

CS772-Lec3

Uploaded by

Estimating Parameters and Predictive

Distributions: Some Simple Cases

CS772A: Probabilistic Machine Learning

Bernoulli Observation Model

▪ Maximizing log-lik, or minimizing neg. log-lik (NLL) w.r.t. 𝜃 gives

▪ Gaussian (likelihood) + Gaussian (prior) ⇒ Gaussian posterior

▪ More on conjugate priors when we look at exponential family distributions

Multinoulli Observation Model

▪ Assume 𝑁 discrete obs 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 } with each 𝑦𝑛 ∈ {1,2, … , 𝐾}, e.g.,

▪ Assume a conjugate prior (Dirichlet) on 𝝅 with hyperparams 𝜶 = [𝛼1 , 𝛼2 , … , 𝛼𝐾 ]

▪ An important distribution. Models non-neg. vectors 𝜋 that also sum to one

▪ Interesting fact: Can generate a 𝐾-dim Dirichlet random variable by independently

▪ Assuming 𝑦𝑛 ’s are i.i.d. given 𝝅, 𝑝 𝒚 𝝅 = ς𝑁

Gaussian Observation Model

▪ Note: The cov. matrix 𝚺 must be symmetric and PSD

▪ The covariance matrix also controls the shape of the Gaussian

▪ Easy to see that, conditioned on 𝜽, 𝒚 too has a Gaussian distribution

▪ Given: 𝑁 i.i.d. scalar observations 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 } assumed drawn from 𝒩(𝑦|𝜇, 𝜎 2 )

▪ Since 𝜎 2 in the likelihood 𝒩 𝑦 𝜇, 𝜎 2 is known, Gaussian prior 𝒩 𝜇 𝜇0 , 𝜎02 on

Gaussian posterior’s mean is a 𝜎 2 𝑁𝜎02 σ𝑁𝑛=1 𝑦𝑛

Gaussian PRML [Bis 06],

You might also like