0% found this document useful (0 votes)
36 views

2 DP Handout

1) The Dirichlet process (DP) is a discrete nonparametric prior for generating random probability measures. It can be used to model exchangeable data that may exhibit ties with positive probability. 2) The DP can be defined through its predictive distributions, which predict new observations as a mixture of the prior guess and the empirical distribution. 3) The Dirichlet distribution is closely related to the DP. It can be used to define the DP through consistent finite-dimensional distributions. The Dirichlet distribution has several key properties including additivity and conjugacy with the multinomial distribution.

Uploaded by

Alessandro Sinai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

2 DP Handout

1) The Dirichlet process (DP) is a discrete nonparametric prior for generating random probability measures. It can be used to model exchangeable data that may exhibit ties with positive probability. 2) The DP can be defined through its predictive distributions, which predict new observations as a mixture of the prior guess and the empirical distribution. 3) The Dirichlet distribution is closely related to the DP. It can be used to define the DP through consistent finite-dimensional distributions. The Dirichlet distribution has several key properties including additivity and conjugacy with the multinomial distribution.

Uploaded by

Alessandro Sinai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning II

Part I: An overview of Bayesian Nonparametrics


The Dirichlet process and models based on the DP

Igor Prünster
Bocconi University

1 / 40
Discrete nonparametric priors

If the de Finetti measure Q selects (a.s.) discrete distributions i.e. P̃ is a discrete


random probability measure
X
P̃( · ) = p̃i δZi ( · ), (♢)
i≥1

then any (exchangeable) vector X (n) := (X1 , . . . , Xn ) generated by Q will exhibit ties
with positive probability i.e. feature Kn distinct observations

X1∗ , . . . , XK∗n
PKn
with frequencies N1 , . . . , NKn such that i=1
Ni = n.

Species sampling: (♢) model for species distribution within a population


• Xi∗ is the label of the i–th distinct species in the sample;
• Ni is the frequency of the i–th distinct species;
• Kn is total number of distinct species in the sample.
=⇒ Species metaphor

2 / 40
Dirichlet process from a predictive point of view

Problem: Assume (Xn )n≥1 is an exchangeable sequence.


▶ Predict the distribution of Xn+1 conditional on a sample X (n) with Kn distinct
values X1∗ , . . . , XK∗n and frequencies N1 , . . . , NKn ;
▶ Prior guess at law of any of the Xi ’s is P ∗ ;
▶ The strength of the prior belief is measured by a parameter θ > 0.

Idea: Predict the distribution of Xn+1 as a linear combination of P ∗ and the


PKn
empirical measure n−1 N δ ∗ , namely
i=1 i Xi

Kn
(n) θ ∗ n 1 X
P[Xn+1 ∈ · | X ]= P (·) + Ni δXi∗ ( · )
θ+n θ+n n
| {z } |prior{zguess} | {z } i=1
P[Xn+1 =“new” | X (n) ] P[Xn+1 =“old” | X (n) ]
| {z }
emprirical measure

=⇒ Predictive distributions of the Dirichlet process (DP) [Ferguson, 1973].


Remark: The de Finetti measure Q of (Xn )n≥1 is a DP prior iff the prediction rule is
a linear combination of P ∗ and the empirical measure [Regazzini, 1978; Lo, 1991]

3 / 40
0.0 0.2 0.4 0.6 0.8 1.0

Plot of a Dirichlet process predictive cumulative distribution function given X (5)


with continuous prior guess.

4 / 40
The Dirichlet distribution
The original definition of the Dirichlet process [Ferguson, 1973] was in terms of a
consistent family of finite–dimensional Dirichlet distributions.

Dirichlet distribution

Let α = (α1 , . . . , αK ) be a vector of nonnegative numbers and set |α| =


PK
α . Define the (k − 1)-dimensional simplex
i=1 i
n k−1 o
X
∆k−1 = x1 , . . . , xk−1 : x1 ≥ 0, . . . , xk−1 ≥ 0, xi ≤ 1
i=1
The random vector (p1 , . . . , pk ) is said to have Dirichlet distribution with
parameters α1 , . . . , αK , denoted

(p1 , . . . , pK ) ∼ Dir(α1 , . . . , αK )

if it has density w.r.t. the Lebesgue measure on ∆K given by


k−1
Γ(|α|) α −1
X
dK (p1 , . . . , pk−1 ; α) = QK p1α1 −1 · · · pk−1
k−1
(1 − pi )αk −1 .
i=1
Γ(αi ) i=1

If any αi = 0, set the corresponding pi = 0 and consider the density on the lower
dimensional set. Note that if k = 2, one obtains the beta distribution.
5 / 40
Dirichlet Distributions
Examples of Dirichlet distributions over p = (p1, p2, p3) which can be plotted in 2D
since p3 = 1 p1 p2:

The plot displays:

◦ Left panel: Heat maps of Dirichlet distributions on the ∆2 simplex.


◦ Right panel: Dirichlet densities on the ∆2 simplex with (α1 , α2 , α3 ) equal
to (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4), respectively.

6 / 40
Key properties of the Dirichlet distribution

▶ Construction as normalization of independent gamma r.v.:


ind
Let Yj ∼ Ga(αj , 1), αj > 0, for j = 1, . . . , k, and define

Yj
pj = Pk
i=1
Yi

then
(p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk )
▶ Additivity:
If (p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk ) and 0 < r1 < · · · < rℓ = k, then
r1
X rℓ  r1
X rℓ 
X X
pi , . . . , pi ∼ Dirℓ αi , . . . , αi
i=1 i=rℓ−1 +1 i=1 i=rℓ−1 +1

This property suggests that it is usefulP to see the vector of parameters


(α1 , . . . , αK ) as a measure s.t. α(A) = α.
i∈A i
Note also
 k 
X
pi ∼ Beta αi , αj − αi
j=1

6 / 40
▶ Conjugacy
Let (Xn )n≥1 be categorical r.v.’s s.t.

P(X1 = j | p1 , . . . , pk ) = pj , j = 1, . . . , k
(p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk ).

Then by Bayes Theorem

(p1 , . . . , pk ) | X1 = j ∼ Dir(α1 , . . . , αj + 1, . . . , αk ).
Pk
For X (n) s.t. ni is the frequency of value i (i = 1, . . . , k) with j=1
nj = n,
we have
(p1 , . . . , pk ) | X (n) ∼ Dir(α1 + n1 , . . . , αk + nk ).

=⇒ The Dirichlet prior is conjugate w.r.t. the multinomial model, which


extends the conjugacy seen in beta–binomial case.

7 / 40
▶ Predictive structure
Z
P(X1 = j) = P(X1 = j | p1 , . . . , pK )Dir(dp1 , . . . , dpk | α)
Z∆K
Γ(|α|) αK −1 −1 αK −1
= pj QK p1α1 −1 · · · pK −1 pK dp1 , . . . , dpk
∆K i=1
Γ(α i )
Z
Γ(|α|) αj +1−1
= QK p1α1 −1 · · · pj · · · pKαK −1 dp1 , . . . , dpk
i=1
Γ(αi ) ∆K

Γ(|α|) Γ(αj + 1) αj
= =
Γ(αj ) Γ(|α| + 1) |α|
Similarly for X2 | X1 , we obtain
αj + δX1 (j)
P(X2 = j | X1 ) = .
|α| + 1

Hence a Dirichlet sample (i.e., a categorical-Dirichlet) is governed by the Pólya


urn scheme
αj αj + δX1 (j)
P(X1 = j) = , P(X2 = j | X1 ) = , ...
|α| |α| + 1
Pn
αj + i=1
δXi (j)
P(Xn+1 = j | X1 , . . . , Xn ) =
|α| + n

8 / 40
Interpretation of the Pólya urn predictives: assume the αi are integers, then we have
• an urn with k colors of unknown proportions (p1 , . . . , pk )
• we put a Dirichlet prior on the proportions
• without observations, we expect that the probability of observing a ball of
colour j is P(X1 = j) = E[pj ] = αj / |α|
• after observing X1 , we update our prior knowledge on the urn composition to
 
Dir α1 + δX1 (1), . . . , αk + δX1 (k)

so we expect that the probability of sampling color j as second draw is


αj + δX1 (j)
E(pj | X1 ) =
|α| + 1
and so on.
▶ Exchangeability: By de Finetti’s representation Theorem it follows that a
sample X (n) drawn from the Dirichlet multinomial model is exchangeable. In
Pk
fact, denoting the frequencies by n1 , . . . , nk s.t. n = n, we have
i=1 i

(α1 )(n1 ) · · · (αk )(nk )


P (X1 = x1 , . . . , Xn = xn ) =
|α|(n)

with a(n) = a(a + 1) · · · (a + n − 1) (ascending factorial or Pochhammer


symbol). As expected, it does not depend on the order of appearance of the
observations, only on their multiplicities.
9 / 40
The Dirichlet process

The Dirichlet process was introduced by Ferguson (1973) as a RPM with Dirichlet
distributed finite–dimensional distributions. See Ghosal (2010) and Ghosal & van
der Vaart (2017) for recent reviews.

Dirichlet process (DP)

Let α be a finite non null measure on (X, X ). A RPM P̃ on (X, X ) is


said to be a Dirichlet process (DP), denoted P̃ ∼ D (α), if for every finite
measurable partition A1 , . . . , Ak of X, we have
(P̃(A1 ), . . . , P̃(AK )) ∼ Dir(α(A1 ), . . . , α(AK )).

The existence of a DP is shown by a variation of Kolgomorov’s consistency


conditions combined with an instrumental use of de Finetti’s representation
Theorem. See Regazzini (1996) for details.
It is often useful to reparametrize the base measure as α = θP0 and correspondingly
write P̃ ∼ D (θP0 ) where
▶ θ := α(X) > 0, called the total mass of α
α(·)
▶ P0 (·) =
α(X)
∈ P , termed the centering distribution (or “ prior guess ”).
The advantage is that we may want to choose P0 and θ separately, rather than the
measure α directly, as parameters of the process.
10 / 40
Key properties

1. A priori moments

Prior specification via moments

Let P̃ ∼ Dα . Then for any A, B ∈ X

α(A)
E[P̃(A)] = = P0 (A)
α(X)
P0 (A)P0 (Ac )
Var(P̃(A)) = .
θ+1
P0 (A ∩ B) − P0 (A)P0 (B)
Cov(P̃(A), P̃(B)) =
θ+1

The proofs are straightforward by noting that marginally, for any A ∈ X , one has
α(A) α(A)
P̃(A) ∼ Beta(α(A), α(Ac )). E.g. E[P̃(A)] = α(A)+α(Ac ) = α(X) = P0 (A).

Hence, a natural way to select the parameters of the DP is:


▶ P0 is the prior guess
▶ θ controls the strength of belief in P0 (often also called precision parameter).
If θ ↑ ∞, Var(P̃(A)) ↓ 0 and P̃ degenerates on P0 .

11 / 40
A single realization of a DP

0.15 1.0

0.8

0.10
0.6

0.4
0.05

0.2

0.00 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figures: Probability mass function (left) and cumulative distribution func-


tion (right) of a single realization of a DP (solid) with P0 = Beta(5, 10)
centering distribution (dotted) and θ = 20.

12 / 40
Realizations of a DP

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figures: Cumulative distribution functions corresponding to 10 realizations


(grey step functions) of the DP P ∼ D (θP0 ) with θ = 10 and centering
distributions (black lines) P0 ≡ Beta(1, 1) = Unif(0, 1) (left) and P0 ≡
Beta(5, 10) (right). Recall that E[P̃] = P0 .

13 / 40
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figures: Cumulative distribution functions corresponding to 10 realizations


(grey step functions) of the DP P ∼ D (θP0 ) with centering distribution
P0 ≡ Beta(5, 5) (black lines) and θ = 1 (left) and θ = 50 (right).

14 / 40
2. Posterior distribution and conjugacy

Posterior distribution

Let (Xn )n≥1 be an X-valued exchangeable sequence whose de Finetti measure


Q is the law of a Dirichlet process, i.e.
iid
Xi | P̃ ∼ P̃
P̃ ∼ D (θP0 ).

Then
 n 
X
P̃ | X1 , . . . , Xn ∼ D θP0 + δXi
i=1

In the posterior the parameter measure α is updated by adding point masses to the
observed locations. This is consistent with special case of the beta-binomial model:
set A = {Head} and Ac = {Tail}, then
n n
X X
P̃(A)|X (n) ∼ Beta(θP0 (A) + δXi (A), θP0 (Ac ) + δXi (Ac ))
| {z } | {z }
i=1 i=1
=α | {z } =β | {z }
# successes k # failures n−k

15 / 40
3. Predictive distributions
Generative construction via generalized Pólya urn [Blackwell & McQueen, 1973]
▶ Step 1: for X1 we have

P(X1 ∈ A) = E[P(X1 ∈ A | P̃)] = E[P̃(A)]= P0 (A) (A ∈ X )

=⇒ the marginal distribution of X1 (often called prior predictive) is the


centering distribution P0 .
▶ Step 2: for X2 |X1 we have
Z
P(X2 ∈ A | X1 ) = E [P̃(A) | X1 ] = P(A) Q(dP|X1 )
P
| {z }
d
=Beta(α(A)+δX1 (A),α(Ac )+δX1 (Ac ))

θP0 (A) + δX1 (A) θ 1


= = P0 (A) + δX (A)
θ+1 θ+1 θ+1 1

▶ Step n+1: iterating, for Xn+1 |X (n) we obtain


n
θ n 1 X
P(Xn+1 ∈ A | X (n) ) = P0 (A) + δXi
θ+n θ+nn
i=1

i.e. as seen before the linear combination of prior guess and empirical measure
with weights depending on the total mass parameter θ and the sample size n.

16 / 40
Example: Predictive distribution

0.0 0.2 0.4 0.6 0.8 1.0

The plots displays:


◦ Cdf of the prior guess P ∗ ≡ Beta(2, 2) (dotted line).
◦ Cdf of the predictive distribution (solid line) for a DP with θ = 10 and 5 observed
data points (rug).
Note that, since the predictive is a linear combination of a continuous cdf (F ∗ ) and the
empirical cdf, the resulting cdf is piecewise differentiable with jumps occurring at the
observed locations, with jump sizes being equal to 1/n times the multiplicity of the observed
location. In the picture we only have distinct locations, so all jumps are of size 1/5.
17 / 40
Equivalently, one can state the predictive distribution as

 Z ∼ P0 with probability θ/(θ + n)
X1 with probability 1/(θ + n)


Xn+1 = ..

 .

Xn with probability 1/(θ + n)

Remarks:
▶ Interpretation:
Let P0 be diffuse i.e. P0 (x ) = 0 for all x ∈ X so it has no atoms. Then
• n/(θ + n) is the probabilty that Xn+1 is an “old” value, i.e. already
observed in X1 , . . . , Xn
• θ/(θ + n) is the probability that Xn+1 is a “new” value, not previously
observed.
▶ Asymptotics:
• As n increases, data provide more and more information and the weight
associated to the prior guess vanishes =⇒ data swamp the prior
• Frequentist evaluation: suppose the data are iid from a “ true unknown ”
Pn w
P0 , then, as n → ∞, n1 δ → P0 and E [P̃( · ) | X (n) ] → P0 . One
i=1 Xi
(n)
can show that Var[P̃( · ) | X ] → 0 as n → ∞ and, hence, by
Chebyshev’s inequality, we conclude that, for any weak neighbourhood of
P0 , denoted by W (P0 ), we have Q(W (P0 )|X (n) ) → 1.
=⇒ the DP is weakly consistent at P0 .
18 / 40
4. Support
We have said that it is desirable for a nonparametric prior Q to have large or
possibly full support.
Take home message: the DP prior has full support i.e. weak neighbourhoods of any
distribution P ∗ have positive probability.
A little bit more precisely:
Define for any P ∈ P ,
\
supp(P) = {A : A closed and P(Ac ) = 0}.

Then x ∈ supp(P) iff P(A) > 0 for every open set A which contains x .
Ferguson (1973, 1974) showed that for any θ > 0
  n o
supp DθP0 = P ∈ P : supp(P) ⊂ supp(P0 )

Hence the DP prior has full support. This implies that given P ∗ whose support is
contained in that of P0 ,  
DθP0 A(P ∗ ) >0

i.e. every weakly open subset of distributions that contains P ∗ (whose support is
contained in that of P0 ), has positive mass.

19 / 40
5. DP is a discrete nonparametric prior
Blackwell (1973) showed that, even though the DP prior has full support, the
realizations of P̃ are discrete distributions (almost surely). This is true even when
the base measure P0 of the DP is continuous (e.g. P0 = N(0, 1)).
Typical realizations of a DP are as follows:

0.25

0.20

0.15

0.10

0.05

0.00

0 10 20 30 40 50 60

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure: Top: realization of a DP ∼ D (θP0 ) (solid black) with θ = 5 and P0 = Ga(20, 1)


(dotted line). Bottom: realization of DPs seen as cdf’s (grey step functions) for θ = 1, 5, 20
(left to right) against cdf of P0 (solid line).
20 / 40
6. Distinct values and induced partition
The discreteness of the DP, implies that we observe ties with positive probability i.e.
P(Xn+1 = Xi ) > 0 for all i = 1, . . . , n, which is apparent from the prediction rule.
This induces a random partition of the integers {1, . . . , n} determined by the
grouping the observations with the same value. E.g. if

X1 = X3 = X4 , X2 = X5

the induced partition of {1, . . . , 5} is

{1, 3, 4}, {2, 5}.

We will denote by X1∗ , . . . , XK∗n the distinct values in X1 , . . . , Xn in order of


appearance, where Xi∗ is the common value for observations in group i which has
multiplicity ni , i.e.
Xr1 = · · · = Xrni = Xi∗
and (r1 , . . . , rni ) ⊂ {1, . . . , n} identify those observations.
To reflect this, the predictive distribution can be rewritten as
Kn
θ n 1 X
P(Xn+1 ∈ A | X (n) ) = P0 (A) + ni δXi∗ (A)
θ+n θ+nn
i=1

from which P(Xn+1 = Xi∗ ) = ni /(θ + n).

21 / 40
As seen, the sampling scheme for the observations is termed generalized Pólya urn
scheme. The induced sampling scheme on the partition process is called Chinese
restaurant process (see e.g. Pitman, 2006):
▶ The 1st customer arrives at the restaurant and sits at Table 1.
▶ The 2nd customer arrives at the restaurant and sits at Table 1 with probability
1/(θ + 1) or at a new table with probability θ/(θ + 1).
▶ The 3rd customer arrives at the restaurant: (a) if the first 2 costumers sit at
the same table, then she sits at that table with probability 2/(θ + 2) and at a
new table with probability θ/(θ + 2); (b) if the first 2 costumers sit a different
tables, then she sits at each of these with probability 1/(θ + 2) and at a new
one with probability θ/(θ + 2), and so on.
▶ The allocation of customers at the tables determines a random partition of the
observations into clusters.

For this reasons (and for the connection with mixture models) Kn is also referred to
as the number of clusters. Note that the allocation of a new value to a cluster is
proportional to the cluster frequency.

22 / 40
0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure: Different clustering structures induced by the total mass parameter θ equal to 1, 5,
and 20 from left to right.

23 / 40
Let us look at the behaviour of Kn . Consider a DP with non–atomic base measure
P ∗ and θ > 0 and define the r.v.

if Xi ∈
/ {X1 , . . . , Xi−1 }
n
1 (Xi new)
Di = ,
0 else (Xi old)
 
θ
which are independent and Di ∼ Bern θ+i−1
. Hence, we have
n n
θ
hX i X
E[Kn ] = E Di =
θ+i −1
i=1 i=1

Moreover, it can be shown that (see e.g. Pitman, 2006):


▶ Recall (θ)n := θ . . . (θ + n − 1) and denote by |s(n, k)| the signless Stirling
number of the first kind 1
θk
P [Kn = k] = |s(n, k)| [Antoniak, 1974]
(θ)n
▶ Kn / log n → θ a.s. [Korwar & Hollander, 1973]
▶ E[Kn ] ≈ Var[Kn ] ≈ θ log n
√ d
▶ (Kn − θ log n)/ θ log n → N(0, 1)
Remark: No parameter really controls the growth rate of Kn . Much recent research
has concentrated on constructing alternative P̃ with more flexible growth rate of Kn .
1
|s(n, k)| = |s(n − 1, k − 1)| + (n − 1)|s(n − 1, k)| with |s(0, 0)| = 1, |s(n, 0)| = 0, and
|s(n, k)| = 0 if n < k.
24 / 40
Growth rate of the DP

500 trajectories of Kn for n = 1, . . . , 1000 and θ = 20 together with E[Kn ] ± two


standard deviations as given above.
25 / 40
7. Exchangeable partition probability function
Any sample X (n) from a discrete RPM P̃ induces an exchangeable random partition
Πn , which is a random partition whose law is invariant under permutation of its
elements, i.e. it is a symmetric function which only depends on the group sizes. For
a partition A1 , . . . , Ak of {1, . . . , n}, this reads
P(Πn = {A1 , . . . , AK }) = p(|A1 |, . . . , |AK |), |Ai | = card(Ai ).
This clearly holds for every n and hence any P̃ induces an infinite exchangeable
partition (Πn )n≥1 .
The distribution of a specific DP sample featuring Kn distinct sample observations
with frequencies n1 , . . . , nKn is given by
k
θk Y
pkn (n1 , . . . , nk ) = (ni − 1)!
θ(n)
i=1

This distribution is called Exchangeable Partition Probability Function (EPPF)


[Pitman, 1995] and characterizes the partition distribution. See e.g. Pitman (2006).
One can also encode the partition in terms of countsPn or multiplicities:
Pni.e. m1 groups
of size 1, m2 groups of size 2, and so on, with i=1
mi = Kn and i=1
imi = n.
For the DP this leads to the Ewens’ sampling formula [Ewens, 1972], a cornerstone
of Population Genetics, which is given by
n
n! Y θmi
ESFθ (m1 , . . . , mn ) = ,
θ(n) mi !i mi
i=1

See Crane (2016) for a recent review on the Ewens sampling formula.
26 / 40
Constructions of the DP
1. DP via finite-dimensional distributions
This corresponds to the original construction of Ferguson (1973) and we gave it as
definition of the DP.
2. DP via predictive distributions (or generalized Pólya urn)
We saw that a DP leads to predictive distributions of the form
n
θ 1 X
P[Xn+1 ∈ · | X (n) ] = P∗( · ) + δXi ( · )
θ+n θ+n
i=1

and already stated the viceversa i.e. that predictive distributions of this form imply
a DP prior as de Finetti measure. This can be proved by noting that:
▶ The distribution of X (n) is recovered from the predictives as
n
P
Y θP ∗ (dxi ) + j<i
δXj (dxj )
P(X1 ∈ dx1 , . . . , Xn ∈ dxn ) = P ∗ (dx1 )
θ+i −1
i=2

and shown to be exchangeable similarly to the finite Dirichlet case with Kn in


place of K .
iid
▶ From de Finetti’s Theorem, there exists a RPM P̃ such that Xi | P̃ ∼ P̃
▶ Since we have seen that the DP generates this predictive sequence, the
uniqueness of the de Finetti measure implies that P̃ is a DP.
27 / 40
3. DP via stick–breaking [Sethuraman, 1994]
▶ Consider a sequence of i.i.d. r.v. Vi ∼ Beta(1, θ) with θ > 0. To construct
the random probability weights of a discrete RPM break a unit–length stick
sequentially according to the Vi ’s as follows:

0 1
V1 1 − V1
p̃1
V2 (1 − V1 ) (1 − V2 )(1 − V1 )
p̃2 ..
.

p̃1 p̃2 p̃3 ...


Qi
=⇒ p̃i = Vi j=1
(1 − Vj ), i = 1, . . ..
▶ sample i.i.d. locations (Zi )i≥1 from P ∗ on X
P
▶ define a RPM as P̃ = p̃ δ , which is shown to coincide with the DP.
i≥1 i Zi

Remarks:
▶ By this construction it is apparent that the DP is a discrete RPM.
▶ Models based on the stick-breaking strategy are sometimes referred to as
residual allocation models.
28 / 40
4. DP as transformation of a process with independent increments

Process with independent increments

A stochastic process {Xt : t ≥ 0} on X = R+ is said to be a process with


independent increments if X0 = 0 a.s. and for any n ≥ 1 and 0 ≤ t0 < t1 <
. . . < tn , the r.v.
Xt0 , Xt1 − Xt0 , . . . , Xtn − Xtn−1
are mutually independent.

In fact, two more constructions of the Dirichlet process [Ferguson, 1973 & 1974;
Doksum, 1974] are based on processes with independent increments:
▶ Normalization: Let {ξt : t ≥ 0} be a gamma process i.e. a process with
independent increments s.t. ξt ∼ gaα((0,t]),1 with α a finite measure on R+ .
Then
ξt
F̃ (t) =
limt→∞ ξt
is the random cdf associated to a DP on R+ .
▶ Neutrality to the right: Let {ζt : t ≥ 0} be a “ suitable ” process with
independent increments. Then

F̃ (t) = 1 − e−ζt

is the random cdf associated to a Dirichlet process on R+ .


29 / 40
A simulated trajectory of a gamma process

30 / 40
Mixtures of Dirichlet processes (MDP)

In order to reflect uncertainty about our prior knowledge encoded in the precision
parameter θ and centering distribution P ∗ or simply to increase flexibility, one often
introduces a further hierarchy on α = θP ∗ by putting a prior also on θ and/or on
P ∗ . This leads to mixtures of Dirichlet processes (MDP) (Antoniak, 1974):

iid
Xi | P̃, Z̃ ∼ P̃
P̃ | Z̃ ∼ D (αZ̃ ) (prior)
Z̃ ∼ π (hyperprior)

where the parameter measure αZ is now indexed by Z , which is a r.v. with


distribution π.
According to the support of π one obtains a finite, countable or uncountable
mixtures of DPs. Typical examples of MDPs are

P̃ | θ̃ ∼ D (θ̃P ∗ ), θ̃ ∼ Ga(a, b)
P̃ | (µ̃, σ̃) ∼ D (θN(µ̃, σ̃ )),
2
µ̃ ∼ N(m, s 2 ), σ̃ ∼ Inv-Ga(α, β)

where in the latter σ̃ −1 ∼ Ga(α, β −1 ).


Note that realizations are still discrete, which should be intuitive from the hierarchy.
First we sample Z , and given Z we sample a DP, which is discrete a.s.

31 / 40
0.15 0.15

0.10 0.10

0.05 0.05

0.00 0.00

0 10 20 30 40 0 10 20 30 40

The plot displays:


◦ Left panel: Realization of a DP P̃ ∼ D (θ · Ga(20, 1)) together with the
pdf of its base measure.
◦ Right panel: Realization of an MDP P̃ | Z̃ ∼ D (θ · Ga(Z , 1)) with
Z̃ ∈ {3, 10, 25} with probabilities 0.1, 0.6, 0.3 respectively, together with
the pdf of its mixture base measure.

32 / 40
Properties of MDPs

▶ Prior mean:
αZ̃ (A)
R
Since E[P̃(A) | Z̃ ] = αZ̃ (X)
= PZ∗ (A), we have E[P̃(A)] = Z
P̄z∗ (A)π(dz).
▶ Posterior distribution:
By conditioning on Z̃ , we have
 n 
X
P̃ | Z̃ , X (n) ∼ D αZ̃ + δXi
i=1
and the posterior becomes Z  n 
X
P̃ | X (n)
∼ D αz + δXi π(dz | X (n) ).
Z i=1
▶ Posterior mean:
The posterior mean is given by
Z  Pn
αz (A) + δ (A) 
i=1 Xi
E[ P̃(A) | X (n) ] = π(dz | X (n) ).
Z
αz (X) + n

If θ := αZ̃ (X) does not depend on Z̃ it becomes


Z n
θ n 1 X
Pz∗ (A)π(dz | X (n) ) + δXi (A)
θ+n Z
θ+nn
i=1

where, unlike for the DP, the centering measure is also updated.
33 / 40
Dirichlet process mixtures
Although discrete distributions can approximate (weakly) any continuous
distribution, both the DP and MDP are inappropriate when the data come from a
continuous distribution.
Dirichlet process mixtures (DPM) were introduced by Lo (1984) as a model for
density estimation (an analog to the popular frequentist kernel density estimation).
DPMs differ substantially from MDPs in that they are mixtures w.r.t. the Dirichlet
process.
Consider a kernel f (y | x ) on Y indexed by a parameter x ∈ X, such that for any x
f (y | x ) is a density2 . Then define the mixture of kernels
Z
f˜(y ) = f (y | x )P̃(dx ), P̃ ∼ D (α),
X
=⇒ Y ’s are the data and X ’s are latent variables which parametrize each kernel.
Since the DP is a discrete RPM, the model reduces to a countable mixture
Z X
f˜(y ) = f (y | x ) p̃i δXi (dx )
X i≥1
X Z X
= p̃i f (y | x )δXi (dx )= p̃i f (y | Xi ).
i≥1 X i≥1
2
A typical choice corresponds to f a Gaussian density on Y = R with parameter
x = (µ, σ) ∈ Y = R+ × R+ +.
34 / 40
A DPM can be equivalently formulated in hierarchical form
ind
Yi | Xi , P̃ ∼ f ( · | Xi ) i = 1, . . . , n [level of the observables]
iid
Xi | P̃ ∼ P̃ i = 1, . . . , n [latent level]

P̃ ∼ D (α)

The discrete or continuous nature of the kernel f (y | x ) determines the selected


distributions, together with their support. For instance:
▶ If f (y | x ) ≡ Po(x ), f˜ is a DP mixture of Poisson distributions, selecting
discrete distributions on Z+ .
▶ If x = (µ, σ) ∈ R × R+ and f (y | x ) ≡ N(µ, σ 2 ), f˜ is a location-scale mixture
of Gaussians on R.

An important property of the DPM model is the large support, relative to the choice
of kernel. For example:
▶ A DP location-scale mixture of normals induces a prior with full support on
the space of absolutely continuous distributions on R.
▶ a DP mixture of Unif(−X , X ) distributions has full support on the space of
unimodal symmetric densities.

35 / 40
Example: Realizations of a DPM

0.04
0.15

0.03

0.10

0.02

0.05
0.01

0.00 0.00

0 5 10 15 20 25 0 20 40 60 80

Realizations of a DPM with Gaussian (left) and Poisson (right) kernels.


◦ Left panel: Draw from a location Gaussian DPM (solid line); mixture components
N(xi∗ , 1) (colored dashed) with means xi∗ ’s resulting from the DP sample (vertical
black solid with height representing their weight). Observations simulated from the
mixture are represented as rugs, colored according to the generating kernel (colored
dashed).
◦ Right panel: Draw from a Poisson DPM with analogous interpretation of the various
lines.
36 / 40
DPMs are the most popular BNP model. Two are the main reasons:
▶ Flexible density estimation:
Parametric models imply severe constraints on the considered distributions
(e.g. unimodality, tail behaviour, overdispersion). The DPM avoids such
constraints and to flexibly adapt to any data generating distribution.
▶ Model-based clustering:
Given P̃ is discrete, Kn ≤ n distinct latent variables are generated. Here, Kn
represents the number of mixture components and the n observations are
clustered in Kn groups in the sense that 2 data belong to the same cluster if
they come from the same mixture component f (y |Xi∗ ) for i = 1, . . . , Kn .

The DPM posterior is not available in closed form. Hence, one relies on Markov
Chain Monte Carlo (MCMC) techniques, in particular Gibbs sampling schemes,
which are standard apart of having to deal with P̃ (infinite-dimensional!). There are
two classes of algorithms according to the way P̃ is handled:
▶ Marginal algorithms: “ P̃ is integrated out ” leading to work with the predictive
distributions (or Blackwell–MacQueen Pólya urn scheme) or the EPPF
[Escobar & West, 1995; MacEachern & Müller, 1998].
▶ Conditional algorithms: Simulating P̃ is part of the algorithm and there are
several devices to do it. The simplest is truncation: take N large enough and
PN
simulate p̃ δ ≈ P̃ 3 . For DPMs there algorithms which allow exact
i=1 i Xi
simulation even without sampling an infinite sequence, known as retrospective
[Papaspiliopoulos & Roberts, 2008] and slice samplers [Walker, 2007].
3
There are strategies to select N so to control the approximation error.
37 / 40
Simulation experiment: DPM density estimate

0.15

0.10

0.05

0.00

−5 0 5 10 15 20 25 30

The plot displays:


◦ True data generating density (dashed) given by a mixture of 3 Gaussians.
◦ Posterior density estimate obtained with a Gaussian DPM based on 500 data points
(rugs colored according to allocated kernel)
The inference was performed using the popular R package DPpackage of Jara et al. (2011)
with the third default prior specification in the package.
38 / 40
Some References

• Antoniak(1974). Mixtures of Dirichlet processes with applications to Bayesian


nonparametric problems. Ann. Statist. 2, 1152–1174.
• Blackwell (1973). Discreteness of Ferguson Selections. Ann. Statist. 1, 356–8.
• Blackwell & McQueen (1973). Ferguson distributions via Pólya urn schemes.
Ann. Statist. 1, 353–5.
• Crane (2016). The ubiquitous Ewens sampling formula. Statist. Science 31, 1–19.
• Doksum (1974). Tailfree and neutral random probabilities and their posterior
distributions. Ann. Probab. 2, 183–201.
• Escobar & West (1995). Bayesian density estimation and inference using
mixtures. J. Amer. Stat. Assoc. 90, 577–88.
• Ewens (1972). The sampling theory of selectively neutral alleles. Theor. Pop.
Biol. 3, 87–112.
• Ferguson (1973). A Bayesian analysis of some nonparametric problems. Ann.
Statist. 1, 209-30.
• Ferguson (1974). Prior distributions on spaces of probability measures. Ann.
Statist. 2, 615-29.
• Ghosal (2010). The Dirichlet process, related priors and posterior asymptotics. In
Bayesian Nonparametrics (Hjort, Holmes, Müller & Walker, eds.). Cambridge Univ.
Press, Cambridge.
• Ghosal & van der Vaart (2017). Fundamentals of Bayesian nonparametric
inference. Cambridge University Press.
• Ghosh & Ramamoorthi (2003). Bayesian nonparametrics. Springer, New York.

39 / 40
• Hjort, Holmes, Müller & Walker (Eds.) (2010). Bayesian Nonparametrics.
Cambridge: Cambridge Univ. Press.
• Jara, Hanson, Quintana, Müller and Rosner (2011). DPpackage: Bayesian semi-
and nonparametric modeling in R. J. Statistical Software 40, 1-30.
• Korwar & Hollander (1973). Contribution to the theory of Dirichlet processes.
Ann. Probab. 1, 705-11.
• Lo (1984). On a class of Bayesian nonparametric estimates. Ann. Statist. 12,
351-57.
• Lo (1991). A characterization of the Dirichlet process. Statist. Probab. Lett. 12,
185–187.
• MacEachern & Müller (1998). Estimating mixture of Dirichlet process models. J.
Comput. Graph. Statist. 7, 223– 238.
• Papaspiliopoulos & Roberts (2008). Retrospective MCMC for Dirichlet process
hierarchical models. Biometrika 95, 169–186.
• Pitman (1995). Exchangeable and partially exchangeable random partitions.
Prob. Th. and Rel. Fields 102, 145–158.
• Pitman (2006). Combinatorial Stochastic Processes. Lecture Notes in Math.,
vol.1875, Springer, Berlin.
• Regazzini (1978). Intorno ad alcune questioni relative alla definizione del premio
secondo la teoria della credibilità. Giorn. Istit. Ital. Attuari, 41, 77–89.
• Regazzini (1996). Impostazione nonparametrica di problemi d’inferenza bayesiana.
Imati Tech. Report 96-21, http://web.mi.imati.cnr.it/iami/abstracts/96-21.html
• Sethuraman (1994). A constructive definition of the DP prior. Stat. Sin. 2,
639–50.
Walker (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist.
Sim. Comput. 36, 45–54.
40 / 40

You might also like