0% found this document useful (0 votes)
26 views

Notes On Beta and Dirchilet Distribution

The document discusses the Beta and Dirichlet distributions. The Beta distribution defines the probability of θ given parameters a and b. It has a mean of a/(a+b) and a variance of ab/(a+b)2(a+b+1). The Dirichlet distribution generalizes the Beta to multiple probabilities θ1,...θK that sum to 1, with parameters α1,...αK. It has a mean for θk of αk/Σαi. Both distributions are commonly used as conjugate priors in Bayesian models.

Uploaded by

Jun Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Notes On Beta and Dirchilet Distribution

The document discusses the Beta and Dirichlet distributions. The Beta distribution defines the probability of θ given parameters a and b. It has a mean of a/(a+b) and a variance of ab/(a+b)2(a+b+1). The Dirichlet distribution generalizes the Beta to multiple probabilities θ1,...θK that sum to 1, with parameters α1,...αK. It has a mean for θk of αk/Σαi. Both distributions are commonly used as conjugate priors in Bayesian models.

Uploaded by

Jun Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Notes on Beta and Dirchilet Distribu!

on

Beta Distribu!on

Defini!on

Γ(a + b) a−1
p(θ; a, b) = Beta(θ; a, b) = θ (1 − θ)b−1
Γ(a)Γ(b)

Normaliza!on

∫ p(θ; a, b)dθ = 1
Γ(a + b) a−1
∫ θ (1 − θ)b−1 dθ = 1
Γ(a)Γ(b)
Γ(a + b)
∫ θa−1 (1 − θ)b−1 dθ = 1
Γ(a)Γ(b)
Γ(a)Γ(b)
∫ θ (1 − θ) dθ =
a−1 b−1
Γ(a + b)

Mean
E[θ] = ∫ θ × Beta(θ; a, b)dθ
Γ(a + b) a−1
=∫ θ× θ (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b)
= ∫ θa (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b) Γ(a + 1)Γ(b)
= ×
Γ(a)Γ(b) Γ(a + b + 1)
Γ(a + b) a × Γ(a)Γ(b)
= ×
Γ(a)Γ(b) (a + b) × Γ(a + b)
a
=
a+b

Mode

Γ(a + b) a−1
ℓ(θ) = log p(θ; a, b) = log θ (1 − θ)b−1
Γ(a)Γ(b)
Γ(a + b)
= log + (a − 1) log θ + (b − 1) log(1 − θ)
Γ(a)Γ(b)

We calculate the deriva!ve with respect to θ

dℓ(θ) 1 1
= (a − 1) − (b − 1) =0
dθ θ 1−θ

a−1
θmax =
a+b−2

Variance

var(θ) = E[(θ − E(θ))2 ]


= E[θ2 − 2θE(θ) + E(θ)2 ]
= E[θ2 ] − E[θ]2
E[θ2 ] = ∫ θ2 × Beta(θ; a, b)dθ
Γ(a + b) a−1
=∫ θ × 2
θ (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b)
= ∫ θa+1 (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b) Γ(a + 2)Γ(b)
= ×
Γ(a)Γ(b) Γ(a + b + 2)
Γ(a + b) a(a + 1) × Γ(a)Γ(b)
= ×
Γ(a)Γ(b) (a + b)(a + b + 1) × Γ(a + b)
a(a + 1)
=
(a + b)(a + b + 1)

a(a + 1) a 2
var(θ) = −( )
(a + b)(a + b + 1) a+b
ab
=
(a + b)2 (a + b + 1)

Plots
The above figure shows plots of the Beta distribu!on Beta(θ∣a, b) given as a func!on
of θ for various values of the hyper-parameters a and b.

The beta-Bernoulli model

Prior

θ ∼ Beta(θ; a, b)

Likelihood

D = (X1 , ⋯ , Xi , ⋯ , XN )

Xi ∼ Bernoulli(θ)
p(Xi = 1) = θ
p(Xi = 0) = 1 − θ

p(D∣θ) = θN1 (1 − θ)N0


N1 + N2 = N

Posterior

p(D, θ) p(D∣θ)p(θ)
p(θ∣D) = = ∝ p(D∣θ)p(θ)
p(D) ∫ p(D∣θ)p(θ)dθ

p(θ∣D) ∝ θN1 (1 − θ)N0 × θa−1 (1 − θ)b−1 = θN1 +a−1 (1 − θ)N0 +b−1

Γ(a + N1 )Γ(b + N0 )
∫ θN1 +a−1 (1 − θ)N0 +b−1 dθ =
Γ(a + b + N0 + N1 )

There is a constant C

∫ p(θ∣D)dθ = 1 = ∫ C × θN1 +a−1 (1 − θ)N0 +b−1 dθ

So C is

Γ(a + b + N0 + N1 )
C=
Γ(a + N1 )Γ(b + N0 )

And we can get

Γ(a + b + N0 + N1 ) N1 +a−1
p(θ∣D) = θ (1 − θ)N0 +b−1
Γ(a + N1 )Γ(b + N0 )

Because the posterior distribu!on p(θ∣D) is Beta(θ; N1 + a, N0 + b), so we can


easily calculate the corresponding mean, mode and variance based on the previous
sec!on on Beta distribu!on.
E[θ∣D] = ∫ θ × p(θ∣D)dθ
a + N1
=
a + N1 + b + N0

If we set M = a + b and we know N = N0 + N1 , then we can get

× M + NN1 × N
a
E[θ∣D] = M
M +N
M a N N1
= × + ×
M +N M M +N N
a N1
=λ× + (1 − λ) ×
M N

If λ = MM +N , then 1 − λ = N
M +N . We know a is the prior mean of θ , and N1 is the
M N
MLE of θ .

We will now show that the posterior mean is convex combina!on of the prior mean and
the MLE, which captures the no!on that the posterior is a compromise between what
we previously believed and what the data is telling us. So the weaker the prior, the
smaller is λ, and hence the closer the posterior mean is to the MLE. One can show
similarly that the posterior mode is a convex combina!on of the prior mode and the
MLE, and that it too converges to the MLE.

Posterior predic!ve distribu!on

So far, we have been focusing on inference of the unknown parameter(s). Let us now
turn our a#en!on to predic!on of future observable data. Consider predic!ng the
probability of heads in a single future trial under a Beta(a + N1 , b + N0 ) posterior.

We have
1
p(Xnew = 1∣D) = ∫ p(Xnew = 1, θ∣D)dθ
0
1
=∫ p(Xnew = 1∣θ, D) × p(θ∣D)dθ
0
1
=∫ p(Xnew = 1∣θ) × p(θ∣D)dθ
0
1
=∫ θ × Beta(θ; a + N1 , b + N0 )dθ
0
a + N1
= E(θ∣D) =
a+b+N

Thus we see that the mean of the posterior predic!ve distribu!on is equivalent (in this
case) to plugging in the posterior mean parameters: p(Xnew = 1∣D) =
Bernoulli(Xnew ∣E(θ∣D)).

Dirichlet Distribu!on

Defini!on

α =(α1 , ⋯ , αk , ⋯ , αK )
θ =(θ1 , ⋯ , θk , ⋯ , θK )

K K
Γ(∑k=1 αk )
p(θ; α) = Dir(θ; α) = K ∏ θkαk −1
∏k=1 Γ(αk ) k=1

Normaliza!on
∫ Dir(θ; α)dθ = 1
K K
Γ(∑k=1 αk )
∫ K ∏ θkαk −1 dθ =1
∏k=1 Γ(αk ) k=1
K K
Γ(∑k=1 αk )
K ∫ ∏ θk
αk −1
dθ =1
∏k=1 Γ(αk ) k=1
K K
∏ Γ(αk )
∫ ∏ θkαk −1 dθ = k=1K
k=1 Γ(∑k=1 αk )

Mean

E[θk ] = ∫ θk × Dir(θ; α)dθ


K K
Γ(∑i=1 αi )
= ∫ θk × K ∏ θiαi −1 dθ
∏i=1 Γ(αi ) i=1
K
Γ(∑i=1 αi )
∫ θk ∏ θiαi −1 dθ
(αk −1)+1
= K
∏i=1 Γ(αi ) i≠k
K
Γ(∑i=1 αi )
∫ θk ∏ θiαi −1 dθ
(αk +1)−1
= K
∏i=1 Γ(αi ) i≠k

Because Γ(αk + 1) = Γ(αk ) × αk , we can get

∏i≠k Γ(αi )Γ(αk + 1)


∫ ∏ θiαi dθ
(α +1)−1
θk k = K
i≠k Γ(1 + ∑i=1 αi )
αk Γ(αk ) × ∏i≠k Γ(αi )
= K K
Γ(∑i=1 αi ) × ∑i=1 αi
K
αk × ∏i=1 Γ(αi )
= K K
Γ(∑i=1 αi ) × ∑i=1 αi
We can get

K K
Γ(∑i=1 αi ) αk × ∏i=1 Γ(αi )
E[θk ] = K × K K
∏i=1 Γ(αi ) Γ(∑i=1 αi ) × ∑i=1 αi
αk
= K
∑i=1 αi

Mode

K K
Γ(∑k=1 αk )
ℓ(θ) = log p(θ; a, b) = log K ∏ θkαk −1
∏k=1 Γ(αk ) k=1
K K
Γ(∑k=1 αk )
= log K + ∑(αk − 1) log θk
∏k=1 Γ(αk ) k=1

K
Because ∑k=1 θk = 1, we can get

K K K
Γ(∑k=1 αk )
ℓ(θ) = log K + ∑(αk − 1) log θk + λ(1 − ∑ θk )
∏k=1 Γ(αk ) k=1 k=1

We calculate the par!al deriva!ve with respect to θk :

∂ℓ(θ) αk − 1
= −λ=0
∂θk θk

We can get

αk − 1
θk =
λ
K
Again, because ∑k=1 θk = 1, we can get

K
αk − 1
∑ =1
λ
k=1

So
K
λ = ∑(αk − 1)
k=1

Finally, we can get

αk − 1 αk − 1 αk − 1
θk(max) = = K = K
λ ∑k=1 (αk − 1) (∑k=1 αk ) − K

Variance

var(θk ) = E[θk2 ] − E[θk ]2

First, we can get

E[θk2 ] = ∫ θk2 × Dir(θ; α)dθ


K K
Γ(∑i=1 αi )
= ∫ θk2 × K ∏ θiαi −1 dθ
∏i=1 Γ(αi ) i=1
K
Γ(∑i=1 αi )
∫ θk ∏ θiαi −1 dθ
(αk −1)+2
= K
∏i=1 Γ(αi ) i≠k
K
Γ(∑i=1 αi )
∫ θk ∏ θiαi −1 dθ
(αk +2)−1
= K
∏i=1 Γ(αi ) i≠k

Because Γ(αk + 2) = Γ(αk ) × αk × (αk + 1), we can get

∏i≠k Γ(αi )Γ(αk + 2)


∫ ∏ θiαi dθ
(α +2)−1
θk k = K
i≠k Γ(2 + ∑i=1 αi )
αk (αk + 1)Γ(αk ) × ∏i≠k Γ(αi )
= K K K
Γ(∑i=1 αi ) × ∑i=1 αi × (1 + ∑i=1 αi )
K
αk (αk + 1) × ∏i=1 Γ(αi )
= K K K
Γ(∑i=1 αi ) × ∑i=1 αi × (1 + ∑i=1 αi )
We have

K K
Γ(∑i=1 αi ) αk (αk + 1) × ∏i=1 Γ(αi )
E[θk2 ] = K × K K K
∏i=1 Γ(αi ) Γ(∑i=1 αi ) × ∑i=1 αi × (1 + ∑i=1 αi )
αk (αk + 1)
= K K
∑i=1 αi × (1 + ∑i=1 αi )

Finally, we get

αk (αk + 1) αk
var(θk ) = K K −( K
)2
∑i=1 αi × (1 + ∑i=1 αi ) ∑i=1 αi
K
αk ((∑i=1 αi ) − αk )
= K K
(∑i=1 αi )2 × (1 + ∑i=1 αi )
In the above figure, (a) The Dirichlet distribu!on when K = 3 defines a distribu!on
over the simplex, which can be represented by the triangular surface. Points on this
3
surface sa!sfy 0 ≤ θk ≤ 1 and ∑k=1 θk = 1. (b) Plot of the Dirichlet density when
α = (2, 2, 2). (c) α = (20, 2, 2). (d) α = (0.1, 0.1, 0.1).

In the above figure, all K = 10 values of αk = 1.


In the above figure, all K = 10 values of αk = 10.

!
In the above figure, all K = 10 values of αk = 100.
In the above figure, all K = 10 values of αk = 0.1.

In the above figure, all K = 10 values of αk = 0.01.


In the above figure, all K = 10 values of αk = 0.001.

The Dirichlet-mul!noulli model

Prior

θ ∼ Dirichlet(θ; α)

Likelihood

D = (X1 , ⋯ , Xi , ⋯ , XN )

Xi ∼ M ultinoulli(θ)
K
p(D∣θ) = ∏ θkNk
k=1
K
∑ Nk = N
k=1

Posterior

p(D, θ) p(D∣θ)p(θ)
p(θ∣D) = = ∝ p(D∣θ)p(θ)
p(D) ∫ p(D∣θ)p(θ)dθ

K K K
p(θ∣D) ∝ ∏ θkNk × ∏ θkαk −1 = ∏ θkNk +αk −1
k=1 k=1 k=1

K K
∏k=1 Γ(Nk + αk )
∫ ∏ θkNk +αk −1 = K
k=1 Γ(∑k=1 (Nk + αk ))

There is a constant C

K
∫ p(θ∣D)dθ = 1 = ∫ C × ∏ θkNk +αk −1 dθ
k=1

So C is

K
Γ(∑k=1 (Nk + αk ))
C= K
∏k=1 Γ(Nk + αk )

And we can get

K K
Γ(∑k=1 (Nk + αk ))
p(θ∣D) = K ∏ θkNk +αk −1
∏k=1 Γ(Nk + αk ) k=1

Because the posterior distribu!on p(θ∣D) is Dirchilet(θ; N1 + αk , ⋯ , NK +


αK ), so we can easily calculate the corresponding mean, mode and variance based on
the previous sec!on on Beta distribu!on.

αk + Nk
E[θk ∣D] = K
∑i=1 (αi + Ni )

K K
If we set M = ∑k=1 αk and we know N = ∑k=1 Nk , then we can get
αk
× M + NNk × N
E[θk ∣D] = M
M +N
M αk N Nk
= × + ×
M +N M M +N N
αk Nk
=λ× + (1 − λ) ×
M N

If λ = MM N αk Nk
+N , then 1 − λ = M +N . We know M is the prior mean of θk , and N is the
MLE of θk .

Posterior predic!ve distribu!on

The posterior predic!ve distribu!on for a single mul!noulli trial is given by the following
expression:
1
p(Xnew = j∣D) = ∫ p(Xnew = j, θ∣D)dθ
0

= ∫ p(Xnew = j∣θ, D) × p(θ∣D)dθ

= ∫ p(Xnew = j∣θ) × p(θ∣D)dθ

= ∫ θj × Dir(θ; N1 + αk , ⋯ , NK + αK )dθ
αj + Nj
= E(θj ∣D) = K
∑k=1 (αk + Nk )

The above expression avoids the zero-count problem. In fact, this form of Bayesian
smoothing is even more important in the mul!nomial case than the binary case, since
the likelihood of data sparsity increases once we start par!!oning the data into many
categories.

Reference

David M. Blei, Probabilis!c Topic Models, Tutorial on KDD 2011


(h#ps://www.cs.princeton.edu/~blei/kdd-tutorial.pdf)

Christopher M. Bishop, Pa#ern Recogni!on and Machine Learning, 2006

Kevin Murphy, Machine Learning: a Probabilis!c Perspec!ve, 2012

You might also like