Notes On Beta and Dirchilet Distribution
Notes On Beta and Dirchilet Distribution
on
Beta Distribu!on
Defini!on
Γ(a + b) a−1
p(θ; a, b) = Beta(θ; a, b) = θ (1 − θ)b−1
Γ(a)Γ(b)
Normaliza!on
∫ p(θ; a, b)dθ = 1
Γ(a + b) a−1
∫ θ (1 − θ)b−1 dθ = 1
Γ(a)Γ(b)
Γ(a + b)
∫ θa−1 (1 − θ)b−1 dθ = 1
Γ(a)Γ(b)
Γ(a)Γ(b)
∫ θ (1 − θ) dθ =
a−1 b−1
Γ(a + b)
Mean
E[θ] = ∫ θ × Beta(θ; a, b)dθ
Γ(a + b) a−1
=∫ θ× θ (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b)
= ∫ θa (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b) Γ(a + 1)Γ(b)
= ×
Γ(a)Γ(b) Γ(a + b + 1)
Γ(a + b) a × Γ(a)Γ(b)
= ×
Γ(a)Γ(b) (a + b) × Γ(a + b)
a
=
a+b
Mode
Γ(a + b) a−1
ℓ(θ) = log p(θ; a, b) = log θ (1 − θ)b−1
Γ(a)Γ(b)
Γ(a + b)
= log + (a − 1) log θ + (b − 1) log(1 − θ)
Γ(a)Γ(b)
dℓ(θ) 1 1
= (a − 1) − (b − 1) =0
dθ θ 1−θ
a−1
θmax =
a+b−2
Variance
a(a + 1) a 2
var(θ) = −( )
(a + b)(a + b + 1) a+b
ab
=
(a + b)2 (a + b + 1)
Plots
The above figure shows plots of the Beta distribu!on Beta(θ∣a, b) given as a func!on
of θ for various values of the hyper-parameters a and b.
Prior
θ ∼ Beta(θ; a, b)
Likelihood
D = (X1 , ⋯ , Xi , ⋯ , XN )
Xi ∼ Bernoulli(θ)
p(Xi = 1) = θ
p(Xi = 0) = 1 − θ
Posterior
p(D, θ) p(D∣θ)p(θ)
p(θ∣D) = = ∝ p(D∣θ)p(θ)
p(D) ∫ p(D∣θ)p(θ)dθ
Γ(a + N1 )Γ(b + N0 )
∫ θN1 +a−1 (1 − θ)N0 +b−1 dθ =
Γ(a + b + N0 + N1 )
There is a constant C
So C is
Γ(a + b + N0 + N1 )
C=
Γ(a + N1 )Γ(b + N0 )
Γ(a + b + N0 + N1 ) N1 +a−1
p(θ∣D) = θ (1 − θ)N0 +b−1
Γ(a + N1 )Γ(b + N0 )
× M + NN1 × N
a
E[θ∣D] = M
M +N
M a N N1
= × + ×
M +N M M +N N
a N1
=λ× + (1 − λ) ×
M N
If λ = MM +N , then 1 − λ = N
M +N . We know a is the prior mean of θ , and N1 is the
M N
MLE of θ .
We will now show that the posterior mean is convex combina!on of the prior mean and
the MLE, which captures the no!on that the posterior is a compromise between what
we previously believed and what the data is telling us. So the weaker the prior, the
smaller is λ, and hence the closer the posterior mean is to the MLE. One can show
similarly that the posterior mode is a convex combina!on of the prior mode and the
MLE, and that it too converges to the MLE.
So far, we have been focusing on inference of the unknown parameter(s). Let us now
turn our a#en!on to predic!on of future observable data. Consider predic!ng the
probability of heads in a single future trial under a Beta(a + N1 , b + N0 ) posterior.
We have
1
p(Xnew = 1∣D) = ∫ p(Xnew = 1, θ∣D)dθ
0
1
=∫ p(Xnew = 1∣θ, D) × p(θ∣D)dθ
0
1
=∫ p(Xnew = 1∣θ) × p(θ∣D)dθ
0
1
=∫ θ × Beta(θ; a + N1 , b + N0 )dθ
0
a + N1
= E(θ∣D) =
a+b+N
Thus we see that the mean of the posterior predic!ve distribu!on is equivalent (in this
case) to plugging in the posterior mean parameters: p(Xnew = 1∣D) =
Bernoulli(Xnew ∣E(θ∣D)).
Dirichlet Distribu!on
Defini!on
α =(α1 , ⋯ , αk , ⋯ , αK )
θ =(θ1 , ⋯ , θk , ⋯ , θK )
K K
Γ(∑k=1 αk )
p(θ; α) = Dir(θ; α) = K ∏ θkαk −1
∏k=1 Γ(αk ) k=1
Normaliza!on
∫ Dir(θ; α)dθ = 1
K K
Γ(∑k=1 αk )
∫ K ∏ θkαk −1 dθ =1
∏k=1 Γ(αk ) k=1
K K
Γ(∑k=1 αk )
K ∫ ∏ θk
αk −1
dθ =1
∏k=1 Γ(αk ) k=1
K K
∏ Γ(αk )
∫ ∏ θkαk −1 dθ = k=1K
k=1 Γ(∑k=1 αk )
Mean
K K
Γ(∑i=1 αi ) αk × ∏i=1 Γ(αi )
E[θk ] = K × K K
∏i=1 Γ(αi ) Γ(∑i=1 αi ) × ∑i=1 αi
αk
= K
∑i=1 αi
Mode
K K
Γ(∑k=1 αk )
ℓ(θ) = log p(θ; a, b) = log K ∏ θkαk −1
∏k=1 Γ(αk ) k=1
K K
Γ(∑k=1 αk )
= log K + ∑(αk − 1) log θk
∏k=1 Γ(αk ) k=1
K
Because ∑k=1 θk = 1, we can get
K K K
Γ(∑k=1 αk )
ℓ(θ) = log K + ∑(αk − 1) log θk + λ(1 − ∑ θk )
∏k=1 Γ(αk ) k=1 k=1
∂ℓ(θ) αk − 1
= −λ=0
∂θk θk
We can get
αk − 1
θk =
λ
K
Again, because ∑k=1 θk = 1, we can get
K
αk − 1
∑ =1
λ
k=1
So
K
λ = ∑(αk − 1)
k=1
αk − 1 αk − 1 αk − 1
θk(max) = = K = K
λ ∑k=1 (αk − 1) (∑k=1 αk ) − K
Variance
K K
Γ(∑i=1 αi ) αk (αk + 1) × ∏i=1 Γ(αi )
E[θk2 ] = K × K K K
∏i=1 Γ(αi ) Γ(∑i=1 αi ) × ∑i=1 αi × (1 + ∑i=1 αi )
αk (αk + 1)
= K K
∑i=1 αi × (1 + ∑i=1 αi )
Finally, we get
αk (αk + 1) αk
var(θk ) = K K −( K
)2
∑i=1 αi × (1 + ∑i=1 αi ) ∑i=1 αi
K
αk ((∑i=1 αi ) − αk )
= K K
(∑i=1 αi )2 × (1 + ∑i=1 αi )
In the above figure, (a) The Dirichlet distribu!on when K = 3 defines a distribu!on
over the simplex, which can be represented by the triangular surface. Points on this
3
surface sa!sfy 0 ≤ θk ≤ 1 and ∑k=1 θk = 1. (b) Plot of the Dirichlet density when
α = (2, 2, 2). (c) α = (20, 2, 2). (d) α = (0.1, 0.1, 0.1).
!
In the above figure, all K = 10 values of αk = 100.
In the above figure, all K = 10 values of αk = 0.1.
Prior
θ ∼ Dirichlet(θ; α)
Likelihood
D = (X1 , ⋯ , Xi , ⋯ , XN )
Xi ∼ M ultinoulli(θ)
K
p(D∣θ) = ∏ θkNk
k=1
K
∑ Nk = N
k=1
Posterior
p(D, θ) p(D∣θ)p(θ)
p(θ∣D) = = ∝ p(D∣θ)p(θ)
p(D) ∫ p(D∣θ)p(θ)dθ
K K K
p(θ∣D) ∝ ∏ θkNk × ∏ θkαk −1 = ∏ θkNk +αk −1
k=1 k=1 k=1
K K
∏k=1 Γ(Nk + αk )
∫ ∏ θkNk +αk −1 = K
k=1 Γ(∑k=1 (Nk + αk ))
There is a constant C
K
∫ p(θ∣D)dθ = 1 = ∫ C × ∏ θkNk +αk −1 dθ
k=1
So C is
K
Γ(∑k=1 (Nk + αk ))
C= K
∏k=1 Γ(Nk + αk )
K K
Γ(∑k=1 (Nk + αk ))
p(θ∣D) = K ∏ θkNk +αk −1
∏k=1 Γ(Nk + αk ) k=1
αk + Nk
E[θk ∣D] = K
∑i=1 (αi + Ni )
K K
If we set M = ∑k=1 αk and we know N = ∑k=1 Nk , then we can get
αk
× M + NNk × N
E[θk ∣D] = M
M +N
M αk N Nk
= × + ×
M +N M M +N N
αk Nk
=λ× + (1 − λ) ×
M N
If λ = MM N αk Nk
+N , then 1 − λ = M +N . We know M is the prior mean of θk , and N is the
MLE of θk .
The posterior predic!ve distribu!on for a single mul!noulli trial is given by the following
expression:
1
p(Xnew = j∣D) = ∫ p(Xnew = j, θ∣D)dθ
0
= ∫ θj × Dir(θ; N1 + αk , ⋯ , NK + αK )dθ
αj + Nj
= E(θj ∣D) = K
∑k=1 (αk + Nk )
The above expression avoids the zero-count problem. In fact, this form of Bayesian
smoothing is even more important in the mul!nomial case than the binary case, since
the likelihood of data sparsity increases once we start par!!oning the data into many
categories.
Reference