CS772-Lec3
CS772-Lec3
CS772A: PML
3
CS772A: PML
4
Estimating a Coin’s Bias
▪ Consider a sequence of 𝑁 coin toss outcomes (observations) Probability
of a head
▪ Each observation 𝑦𝑛 is a binary random variable. Head: 𝑦𝑛 = 1, Tail: 𝑦𝑛 = 0
▪ Each 𝑦𝑛 is assumed generated by a Bernoulli distribution with param 𝜃 ∈ (0,1)
Likelihood or
observation model 𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
▪ Here 𝜃 the unknown param (probability of head). Let’s do MLE
assuming i.i.d. data
▪ Log-likelihood: σ𝑁
𝑛=1 log 𝑝 𝑦𝑛 𝜃 = σ 𝑁
𝑛=1 [𝑦𝑛 log θ + (1 − 𝑦𝑛 )log (1 − 𝜃)]
▪ Plugging in the expressions for Bernoulli and Beta and ignoring any terms that
don’t depend on 𝜃, the log posterior simplifies to
𝑁
𝐿𝑃 𝜃 = 𝑦𝑛 log θ + (1 − 𝑦𝑛 log 1 − 𝜃 ] + 𝛼 − 1 log 𝜃 + 𝛽 − 1 log(1 − 𝜃)
𝑛=1
▪ Maximizing the above log post. (or min. of its negative) w.r.t. 𝜃 gives
Prior’s hyperparameters have an
Using 𝛼 = 1 and 𝛽 = 1 gives us
the same solution as MLE
σ𝑁
𝑛=1 𝑦𝑛 + 𝛼 − 1
interesting interpretation. Can think of
𝛼 − 1 and 𝛽 − 1 as the number of
𝜃𝑀𝐴𝑃 = heads and tails, respectively, before
Recall that 𝛼 = 1 and 𝛽 = 1 for Beta 𝑁+𝛼+𝛽−2 starting the coin-toss experiment
distribution is in fact equivalent to a Such interpretations of prior’s hyperparameters as
(akin to “pseudo-observations”)
uniform prior (hence making MAP being “pseudo-observations” exist for various other
prior distributions as well (in particular, distributions
equivalent to MLE) belonging to “exponential family” of distributions CS772A: PML
7
The Posterior Distribution
▪ Let’s do fully Bayesian inference and compute the posterior distribution
▪ Bernoulli likelihood: 𝑝 𝑦𝑛 𝜃 = Bernoulli 𝑦n 𝜃 = 𝜃 𝑦𝑛 (1 − 𝜃)1−𝑦𝑛
Γ(𝛼+𝛽)
▪ Beta prior: 𝑝 𝜃 = Beta 𝜃 𝛼, 𝛽 = 𝜃 𝛼−1 1 − 𝜃 𝛽−1 Number of tails (𝑁0 )
Γ 𝛼 Γ 𝛽 Number of heads (𝑁1 )
▪ The posterior
Hyperparams 𝛼, 𝛽
can be computed as 𝑁 𝑁
𝜃 σ𝑛=1 𝑦𝑛 (1 − 𝜃)𝑁−σ𝑛=1 𝑦𝑛
not shown for brevity
𝑝 𝜃 𝑝(𝒚|𝜃) 𝑝 𝜃 ς𝑁
Γ(𝛼+𝛽) 𝛼−1
𝑝(𝑦𝑛 |𝜃) 𝜃 1−𝜃 𝛽−1 ς𝑁
𝑛=1 𝜃
𝑦𝑛 (1−𝜃)1−𝑦𝑛
𝑝 𝜃𝒚 = = 𝑛=1 Γ 𝛼 Γ 𝛽
= Γ(𝛼+𝛽) 𝛼−1
1−𝜃 𝛽−1 ς𝑁
𝑝(𝒚) 𝑝(𝒚) ∫Γ 𝛼 Γ 𝛽 𝜃 𝑦
𝑛=1 𝜃 𝑛 (1−𝜃)
1−𝑦𝑛 𝑑𝜃
▪ Here, even without computing the denominator (marg lik), we can identify the posterior
▪ It is Beta distribution since 𝑝 𝜃 𝒚 ∝ 𝜃 𝛼+𝑁1 −1 1 − 𝜃 𝛽+𝑁0 −1 Exercise: Show that the
normalization constant equals
Hint: Use the fact that the
▪ Thus 𝑝 𝜃 𝒚 = Beta 𝜃 𝛼 + 𝑁1 , 𝛽 + 𝑁0 posterior must integrate to 1
Γ(𝛼 + 𝛽 + 𝑁)
∫ 𝑝 𝜃 𝒚 𝑑𝜃 = 1 Γ 𝛼 + σ𝑁 𝑁
𝑛=1 𝑦𝑛 Γ 𝛽 + 𝑁 − σ𝑛=1 𝑦𝑛
▪ Here, finding the posterior boiled down to simply “multiply, add stuff, and identify”
▪ Here, posterior has the same form as prior (both Beta): property of conjugate priors.
CS772A: PML
8
Conjugacy and Conjugate Priors
▪ Many pairs of distributions are conjugate to each other
▪ Bernoulli (likelihood) + Beta (prior) ⇒ Beta posterior
▪ Binomial (likelihood) + Beta (prior) ⇒ Beta posterior
▪ Multinomial (likelihood) + Dirichlet (prior) ⇒ Dirichlet posterior Not true in general, but in some
cases (e.g., the variance of the
▪ Poisson (likelihood) + Gamma (prior) ⇒ Gamma posterior Gaussian likelihood is fixed)
▪ The plug-in predictive distribution using a point estimate 𝜃መ (e.g., using MLE/MAP)
𝑝 𝑦𝑁+1 = 1 𝒚 ≈ 𝑝 𝑦𝑁+1 = 1 𝜃 = 𝜃
𝑝(𝑦𝑁+1 |𝒚) = Bernoulli(𝑦𝑁+1 |𝜃)
CS772A: PML
10
CS772A: PML
MLE/MAP left as 11
The Posterior Distribution an exercise
(1/2,1/4,1/4)
(1/2,1/2,0) (1/2,0,1/2)
(3/8,3/8,1/4) (3/8,1/4,3/8)
(1/4,3/8,3/8)
(1/4,1/2,1/4) (1/4,1/4,1/2)
(0,1,0) (0,0,1)
(0,1/2,1/2) CS772A: PML
13
Brief Detour: Dirichlet Distribution
▪ A visualization of Dirichlet distribution for different values of concentration param
Like a uniform
distribution if
Visualizations of PDFs of some 3-dim all 𝛼𝑘 ’s are 1
All 𝛼𝑘 ’s large results in
Dirichlet distributions (each generated peak around the
center of the simplex
using a different conc. Param vector 𝜶)
𝜶 controls the shape
of the Dirichlet (just 𝜋2 𝜋2
like Beta distribution’s
hyperparameters) 𝜋3 𝜋3
𝜋1 𝜋1
𝜋2 𝜋2
𝜋3 𝜋3
𝜋1 𝜋1
▪ Posterior 𝑝(𝝅|𝒚) is easy to compute due to conjugacy b/w multinoulli and Dir.
Don’t need to compute for this
𝑝(𝝅, 𝒚|𝜶) 𝑝(𝝅|𝜶)𝑝 𝒚 𝝅, 𝜶 𝑝(𝝅|𝜶)𝑝 𝒚 𝝅 case because of conjugacy
𝑝 𝝅 𝒚, 𝜶 = = =
𝑝(𝒚|𝜶) 𝑝(𝒚|𝜶) 𝑝(𝒚|𝜶) Marg-lik = ∫ 𝑝(𝝅|𝜶)𝑝 𝒚 𝝅 d𝝅
▪ We only need the suff-stats to estimate the parameters and values of individual observations aren’t
needed (another property from exponential family of distributions – more on this later)
CS772A: PML
15
The Predictive Distribution
▪ Finally, let’s also look at the posterior predictive distribution for this model
▪ PPD is the prob distr of a new 𝑦∗ ∈ 1,2, … , 𝐾 , given training data 𝒚 = {𝑦1 , 𝑦2 , … , 𝑦𝑁 }
Will be a multinoulli. Just need
to estimate the probabilities of
each of the 𝐾 outcomes
𝑝 𝑦∗ 𝒚, 𝜶 = ∫ 𝑝 𝑦∗ 𝝅 𝒑 𝝅 𝒚, 𝜶 𝒅𝝅
▪ 𝑝 𝑦∗ 𝝅 = multinoulli 𝑦∗ 𝝅 , 𝑝 𝝅 𝒚, 𝜶 = Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 )
▪ Can compute the posterior predictive probability for each of the 𝐾 possible outcomes
𝑝 𝑦∗ = 𝑘 𝒚, 𝜶 = ∫ 𝑝 𝑦∗ = 𝑘 𝝅 𝒑 𝝅 𝒚, 𝜶 𝒅𝝅
= ∫ 𝜋𝑘 × Dirichlet 𝝅 𝛼1 + 𝑁1 , 𝛼2 + 𝑁2 , … , 𝛼𝐾 + 𝑁𝐾 )𝑑𝜋
𝛼𝑘 + 𝑁𝑘 (Expectation of 𝜋𝑘 w.r.t the Dirichlet posterior)
= 𝐾
σ𝑘=1 𝛼𝑘 + 𝑁 A similar effect was
𝐾 Note how these probabilities achieved in the Beta-
𝛼𝑘 +𝑁𝑘 have been “smoothened” due Bernoulli model, too
▪ Thus PPD is multinoulli with probability vector σ𝐾 to the use of the prior + the
𝑘=1 𝛼𝑘 +𝑁 𝑘=1 averaging over the posterior
▪ Plug-in predictive will also be multinoulli but with prob vector given by the point estimate of 𝝅
CS772A: PML
16
CS772A: PML
17
Gaussian Distribution (Univariate)
▪ Distribution over real-valued scalar random variables 𝑌 ∈ ℝ, e.g., height of
students in a class
▪ Defined by a scalar mean 𝜇 and a scalar variance 𝜎 2
2
1 𝑦−𝜇
𝒩 𝑌 = 𝑦 𝜇, 𝜎 2 = exp −
2𝜋𝜎 2 2𝜎 2
▪ Mean: 𝔼 𝑌 = 𝜇
Gaussian PDF in
▪ Variance: var[𝑌] = 𝜎2 terms of precision
1 𝛽 𝛽
▪ Inverse of variance is called precision: 𝛽 = 2
. 𝒩 𝑌 = 𝑦 𝜇, 𝛽 =
2𝜋
exp − 𝑦 − 𝜇
2
2
𝜎
CS772A: PML
18
Gaussian Distribution (Multivariate)
▪ Distribution over real-valued vector random variables 𝒀 ∈ ℝ𝐷
▪ Defined by a mean vector 𝜇 ∈ ℝ𝐷 and a covariance matrix 𝚺
A two-dimensional Gaussian
1
𝒩 𝒀 = 𝒚 𝝁, 𝚺 = exp − 𝒚 − 𝝁 ⊤ 𝚺 −1 (𝒚 − 𝝁)
2𝜋 𝐷 𝚺
Spherical: Equal
spreads (variances)
along all dimensions
Diagonal: Unequal
spreads (variances)
along all directions
-5 -5 5 -5 5 but still axis-parallel
5
5 5 5
Full: Unequal spreads
(variances) along all
directions and also
spreads along oblique
directions
-5 5 -5 5 -5 5
CS772A: PML
20
Multivariate Gaussian: Marginals and Conditionals
CS772A: PML
21
Transformation of Random Variables
▪ Suppose 𝑌 = 𝑓(𝑋) = 𝐴𝑋 + 𝑏 be a linear function of a vector-valued r.v. 𝑋 (𝐴 is a
matrix and 𝑏 is a vector, both constants)
▪ Suppose 𝔼 𝑋 = 𝜇 and cov 𝑋 = Σ, then for the vector-valued r.v. 𝑌
𝔼 𝑌 = 𝔼 𝐴𝑋 + 𝑏 = 𝐴𝜇 + 𝑏
cov 𝑌 = cov 𝐴𝑋 + 𝑏 = 𝐴Σ𝐴⊤
▪ Likewise, if 𝑌 = 𝑓 𝑋 = 𝑎⊤ 𝑋 + 𝑏 be a linear function of a vector-valued r.v. 𝑋 (𝑎 is a
vector and 𝑏 is a scalar, both constants)
▪ Suppose 𝔼 𝑋 = 𝜇 and cov 𝑋 = Σ, then for the scalar-valued r.v. 𝑌
𝔼 𝑌 = 𝔼 𝑎⊤ 𝑋 + 𝑏 = 𝑎⊤ 𝜇 + 𝑏
var 𝑌 = var 𝑎⊤ 𝑋 + 𝑏 = 𝑎⊤ Σ𝑎
CS772A: PML
22
Linear Gaussian Model (LGM)
▪ LGM defines a noisy lin. transform of a Gaussian r.v. 𝜽 with 𝑝 𝜽 = 𝒩(𝜽|𝝁, 𝚲−1 )
Both 𝜽 and 𝒚 are vectors (can
be of different sizes) Noise vector - independently
Also assume 𝑨, 𝒃, 𝚲, 𝑳 to be
known; only 𝜽 is unknown
𝒚 = 𝑨𝜽 + 𝒃 + 𝝐 and drawn from 𝒩(𝝐|𝟎, 𝑳−1 )
Likelihood 2
2 𝑦𝑛 − 𝜇
𝒩(𝑦|𝜇, 𝜎 ) 𝑝 𝑦𝑛 𝜇, 𝜎 2 = 𝒩 𝑦𝑛 𝜇, 𝜎 2 ∝ exp −
Assume 𝜎 2 to
2𝜎 2
Overall
be known
Likelihood
𝑁
𝑝 𝒚 𝜇, 𝜎 2 = ෑ 𝑝(𝑦𝑛 |𝜇, 𝜎 2 )
𝑛=1
𝜇
▪ Note: Easy to see that each 𝑦𝑛 drawn from 𝒩(𝑦|𝜇, 𝜎 2 ) is equivalent to the following
𝑦𝑛 = 𝜇 + 𝜖𝑛
Thus 𝑦𝑛 is like a noisy
version of 𝜇 with zero
mean Gaussian noise
where 𝜖𝑛 ∼ 𝒩(0, 𝜎 2 )
added to it
▪ Let’s estimate mean 𝜇 given 𝒚 using fully Bayesian inference (not point estimation)
CS772A: PML
24
A prior distribution for the mean
▪ To computer posterior, need a prior over 𝜇
𝑝 𝜇|𝜇0 , 𝜎02
▪ Let’s choose a Gaussian prior
𝑝 𝜇|𝜇0 , 𝜎02 = 𝒩 𝜇 𝜇0 , 𝜎02
𝜇 − 𝜇0 2 𝜇0
∝ exp −
2𝜎02
▪ The prior basically says that a priori we believe 𝜇 is close to 𝜇0
▪ The prior’s variance 𝜎02 denotes how certain we are about our belief
▪ We will assume that the prior’s hyperparameters (𝜇0 , 𝜎02 )are known
This is an approximation
of the true PPD 𝑝 𝑦∗ 𝒚
𝑝 𝑦∗ 𝜇,Ƹ 𝜎 2 ) = 𝒩(𝑦∗ |𝜇,Ƹ 𝜎 2 )
▪ On the other hand, the posterior predictive distribution of 𝑥∗ would be
A useful fact: When we
2
𝑝 𝑦∗ 𝒚 = ∫ 𝑝 𝑦∗ 𝜇, 𝜎 )𝑝(µ 𝒚 𝑑𝜇 have conjugacy, the
posterior predictive
distribution also has a
= ∫ 𝒩(𝑦∗ |𝜇, 𝜎 2 )𝒩 𝜇 𝜇𝑁 , 𝜎𝑁2 𝑑𝜇 closed form (will see this
result more formally when
talking about exponential
If conditional is Gaussian
This “extra” variance 𝜎𝑁2 in PPD is due to the
averaging over the posterior’s uncertainty = 𝒩(𝑦∗ |𝜇𝑁 , 𝜎2 + 𝜎𝑁2 ) then marginal is also
family distributions)