2 DP Handout
2 DP Handout
Igor Prünster
Bocconi University
1 / 40
Discrete nonparametric priors
then any (exchangeable) vector X (n) := (X1 , . . . , Xn ) generated by Q will exhibit ties
with positive probability i.e. feature Kn distinct observations
X1∗ , . . . , XK∗n
PKn
with frequencies N1 , . . . , NKn such that i=1
Ni = n.
2 / 40
Dirichlet process from a predictive point of view
Kn
(n) θ ∗ n 1 X
P[Xn+1 ∈ · | X ]= P (·) + Ni δXi∗ ( · )
θ+n θ+n n
| {z } |prior{zguess} | {z } i=1
P[Xn+1 =“new” | X (n) ] P[Xn+1 =“old” | X (n) ]
| {z }
emprirical measure
3 / 40
0.0 0.2 0.4 0.6 0.8 1.0
4 / 40
The Dirichlet distribution
The original definition of the Dirichlet process [Ferguson, 1973] was in terms of a
consistent family of finite–dimensional Dirichlet distributions.
Dirichlet distribution
(p1 , . . . , pK ) ∼ Dir(α1 , . . . , αK )
If any αi = 0, set the corresponding pi = 0 and consider the density on the lower
dimensional set. Note that if k = 2, one obtains the beta distribution.
5 / 40
Dirichlet Distributions
Examples of Dirichlet distributions over p = (p1, p2, p3) which can be plotted in 2D
since p3 = 1 p1 p2:
6 / 40
Key properties of the Dirichlet distribution
Yj
pj = Pk
i=1
Yi
then
(p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk )
▶ Additivity:
If (p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk ) and 0 < r1 < · · · < rℓ = k, then
r1
X rℓ r1
X rℓ
X X
pi , . . . , pi ∼ Dirℓ αi , . . . , αi
i=1 i=rℓ−1 +1 i=1 i=rℓ−1 +1
6 / 40
▶ Conjugacy
Let (Xn )n≥1 be categorical r.v.’s s.t.
P(X1 = j | p1 , . . . , pk ) = pj , j = 1, . . . , k
(p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk ).
(p1 , . . . , pk ) | X1 = j ∼ Dir(α1 , . . . , αj + 1, . . . , αk ).
Pk
For X (n) s.t. ni is the frequency of value i (i = 1, . . . , k) with j=1
nj = n,
we have
(p1 , . . . , pk ) | X (n) ∼ Dir(α1 + n1 , . . . , αk + nk ).
7 / 40
▶ Predictive structure
Z
P(X1 = j) = P(X1 = j | p1 , . . . , pK )Dir(dp1 , . . . , dpk | α)
Z∆K
Γ(|α|) αK −1 −1 αK −1
= pj QK p1α1 −1 · · · pK −1 pK dp1 , . . . , dpk
∆K i=1
Γ(α i )
Z
Γ(|α|) αj +1−1
= QK p1α1 −1 · · · pj · · · pKαK −1 dp1 , . . . , dpk
i=1
Γ(αi ) ∆K
Γ(|α|) Γ(αj + 1) αj
= =
Γ(αj ) Γ(|α| + 1) |α|
Similarly for X2 | X1 , we obtain
αj + δX1 (j)
P(X2 = j | X1 ) = .
|α| + 1
8 / 40
Interpretation of the Pólya urn predictives: assume the αi are integers, then we have
• an urn with k colors of unknown proportions (p1 , . . . , pk )
• we put a Dirichlet prior on the proportions
• without observations, we expect that the probability of observing a ball of
colour j is P(X1 = j) = E[pj ] = αj / |α|
• after observing X1 , we update our prior knowledge on the urn composition to
Dir α1 + δX1 (1), . . . , αk + δX1 (k)
The Dirichlet process was introduced by Ferguson (1973) as a RPM with Dirichlet
distributed finite–dimensional distributions. See Ghosal (2010) and Ghosal & van
der Vaart (2017) for recent reviews.
1. A priori moments
α(A)
E[P̃(A)] = = P0 (A)
α(X)
P0 (A)P0 (Ac )
Var(P̃(A)) = .
θ+1
P0 (A ∩ B) − P0 (A)P0 (B)
Cov(P̃(A), P̃(B)) =
θ+1
The proofs are straightforward by noting that marginally, for any A ∈ X , one has
α(A) α(A)
P̃(A) ∼ Beta(α(A), α(Ac )). E.g. E[P̃(A)] = α(A)+α(Ac ) = α(X) = P0 (A).
11 / 40
A single realization of a DP
0.15 1.0
0.8
0.10
0.6
0.4
0.05
0.2
0.00 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
12 / 40
Realizations of a DP
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
13 / 40
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
14 / 40
2. Posterior distribution and conjugacy
Posterior distribution
Then
n
X
P̃ | X1 , . . . , Xn ∼ D θP0 + δXi
i=1
In the posterior the parameter measure α is updated by adding point masses to the
observed locations. This is consistent with special case of the beta-binomial model:
set A = {Head} and Ac = {Tail}, then
n n
X X
P̃(A)|X (n) ∼ Beta(θP0 (A) + δXi (A), θP0 (Ac ) + δXi (Ac ))
| {z } | {z }
i=1 i=1
=α | {z } =β | {z }
# successes k # failures n−k
15 / 40
3. Predictive distributions
Generative construction via generalized Pólya urn [Blackwell & McQueen, 1973]
▶ Step 1: for X1 we have
i.e. as seen before the linear combination of prior guess and empirical measure
with weights depending on the total mass parameter θ and the sample size n.
16 / 40
Example: Predictive distribution
Remarks:
▶ Interpretation:
Let P0 be diffuse i.e. P0 (x ) = 0 for all x ∈ X so it has no atoms. Then
• n/(θ + n) is the probabilty that Xn+1 is an “old” value, i.e. already
observed in X1 , . . . , Xn
• θ/(θ + n) is the probability that Xn+1 is a “new” value, not previously
observed.
▶ Asymptotics:
• As n increases, data provide more and more information and the weight
associated to the prior guess vanishes =⇒ data swamp the prior
• Frequentist evaluation: suppose the data are iid from a “ true unknown ”
Pn w
P0 , then, as n → ∞, n1 δ → P0 and E [P̃( · ) | X (n) ] → P0 . One
i=1 Xi
(n)
can show that Var[P̃( · ) | X ] → 0 as n → ∞ and, hence, by
Chebyshev’s inequality, we conclude that, for any weak neighbourhood of
P0 , denoted by W (P0 ), we have Q(W (P0 )|X (n) ) → 1.
=⇒ the DP is weakly consistent at P0 .
18 / 40
4. Support
We have said that it is desirable for a nonparametric prior Q to have large or
possibly full support.
Take home message: the DP prior has full support i.e. weak neighbourhoods of any
distribution P ∗ have positive probability.
A little bit more precisely:
Define for any P ∈ P ,
\
supp(P) = {A : A closed and P(Ac ) = 0}.
Then x ∈ supp(P) iff P(A) > 0 for every open set A which contains x .
Ferguson (1973, 1974) showed that for any θ > 0
n o
supp DθP0 = P ∈ P : supp(P) ⊂ supp(P0 )
Hence the DP prior has full support. This implies that given P ∗ whose support is
contained in that of P0 ,
DθP0 A(P ∗ ) >0
i.e. every weakly open subset of distributions that contains P ∗ (whose support is
contained in that of P0 ), has positive mass.
19 / 40
5. DP is a discrete nonparametric prior
Blackwell (1973) showed that, even though the DP prior has full support, the
realizations of P̃ are discrete distributions (almost surely). This is true even when
the base measure P0 of the DP is continuous (e.g. P0 = N(0, 1)).
Typical realizations of a DP are as follows:
0.25
0.20
0.15
0.10
0.05
0.00
0 10 20 30 40 50 60
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X1 = X3 = X4 , X2 = X5
21 / 40
As seen, the sampling scheme for the observations is termed generalized Pólya urn
scheme. The induced sampling scheme on the partition process is called Chinese
restaurant process (see e.g. Pitman, 2006):
▶ The 1st customer arrives at the restaurant and sits at Table 1.
▶ The 2nd customer arrives at the restaurant and sits at Table 1 with probability
1/(θ + 1) or at a new table with probability θ/(θ + 1).
▶ The 3rd customer arrives at the restaurant: (a) if the first 2 costumers sit at
the same table, then she sits at that table with probability 2/(θ + 2) and at a
new table with probability θ/(θ + 2); (b) if the first 2 costumers sit a different
tables, then she sits at each of these with probability 1/(θ + 2) and at a new
one with probability θ/(θ + 2), and so on.
▶ The allocation of customers at the tables determines a random partition of the
observations into clusters.
For this reasons (and for the connection with mixture models) Kn is also referred to
as the number of clusters. Note that the allocation of a new value to a cluster is
proportional to the cluster frequency.
22 / 40
0.7 0.7 0.7
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure: Different clustering structures induced by the total mass parameter θ equal to 1, 5,
and 20 from left to right.
23 / 40
Let us look at the behaviour of Kn . Consider a DP with non–atomic base measure
P ∗ and θ > 0 and define the r.v.
if Xi ∈
/ {X1 , . . . , Xi−1 }
n
1 (Xi new)
Di = ,
0 else (Xi old)
θ
which are independent and Di ∼ Bern θ+i−1
. Hence, we have
n n
θ
hX i X
E[Kn ] = E Di =
θ+i −1
i=1 i=1
See Crane (2016) for a recent review on the Ewens sampling formula.
26 / 40
Constructions of the DP
1. DP via finite-dimensional distributions
This corresponds to the original construction of Ferguson (1973) and we gave it as
definition of the DP.
2. DP via predictive distributions (or generalized Pólya urn)
We saw that a DP leads to predictive distributions of the form
n
θ 1 X
P[Xn+1 ∈ · | X (n) ] = P∗( · ) + δXi ( · )
θ+n θ+n
i=1
and already stated the viceversa i.e. that predictive distributions of this form imply
a DP prior as de Finetti measure. This can be proved by noting that:
▶ The distribution of X (n) is recovered from the predictives as
n
P
Y θP ∗ (dxi ) + j<i
δXj (dxj )
P(X1 ∈ dx1 , . . . , Xn ∈ dxn ) = P ∗ (dx1 )
θ+i −1
i=2
0 1
V1 1 − V1
p̃1
V2 (1 − V1 ) (1 − V2 )(1 − V1 )
p̃2 ..
.
Remarks:
▶ By this construction it is apparent that the DP is a discrete RPM.
▶ Models based on the stick-breaking strategy are sometimes referred to as
residual allocation models.
28 / 40
4. DP as transformation of a process with independent increments
In fact, two more constructions of the Dirichlet process [Ferguson, 1973 & 1974;
Doksum, 1974] are based on processes with independent increments:
▶ Normalization: Let {ξt : t ≥ 0} be a gamma process i.e. a process with
independent increments s.t. ξt ∼ gaα((0,t]),1 with α a finite measure on R+ .
Then
ξt
F̃ (t) =
limt→∞ ξt
is the random cdf associated to a DP on R+ .
▶ Neutrality to the right: Let {ζt : t ≥ 0} be a “ suitable ” process with
independent increments. Then
F̃ (t) = 1 − e−ζt
30 / 40
Mixtures of Dirichlet processes (MDP)
In order to reflect uncertainty about our prior knowledge encoded in the precision
parameter θ and centering distribution P ∗ or simply to increase flexibility, one often
introduces a further hierarchy on α = θP ∗ by putting a prior also on θ and/or on
P ∗ . This leads to mixtures of Dirichlet processes (MDP) (Antoniak, 1974):
iid
Xi | P̃, Z̃ ∼ P̃
P̃ | Z̃ ∼ D (αZ̃ ) (prior)
Z̃ ∼ π (hyperprior)
P̃ | θ̃ ∼ D (θ̃P ∗ ), θ̃ ∼ Ga(a, b)
P̃ | (µ̃, σ̃) ∼ D (θN(µ̃, σ̃ )),
2
µ̃ ∼ N(m, s 2 ), σ̃ ∼ Inv-Ga(α, β)
31 / 40
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 10 20 30 40 0 10 20 30 40
32 / 40
Properties of MDPs
▶ Prior mean:
αZ̃ (A)
R
Since E[P̃(A) | Z̃ ] = αZ̃ (X)
= PZ∗ (A), we have E[P̃(A)] = Z
P̄z∗ (A)π(dz).
▶ Posterior distribution:
By conditioning on Z̃ , we have
n
X
P̃ | Z̃ , X (n) ∼ D αZ̃ + δXi
i=1
and the posterior becomes Z n
X
P̃ | X (n)
∼ D αz + δXi π(dz | X (n) ).
Z i=1
▶ Posterior mean:
The posterior mean is given by
Z Pn
αz (A) + δ (A)
i=1 Xi
E[ P̃(A) | X (n) ] = π(dz | X (n) ).
Z
αz (X) + n
where, unlike for the DP, the centering measure is also updated.
33 / 40
Dirichlet process mixtures
Although discrete distributions can approximate (weakly) any continuous
distribution, both the DP and MDP are inappropriate when the data come from a
continuous distribution.
Dirichlet process mixtures (DPM) were introduced by Lo (1984) as a model for
density estimation (an analog to the popular frequentist kernel density estimation).
DPMs differ substantially from MDPs in that they are mixtures w.r.t. the Dirichlet
process.
Consider a kernel f (y | x ) on Y indexed by a parameter x ∈ X, such that for any x
f (y | x ) is a density2 . Then define the mixture of kernels
Z
f˜(y ) = f (y | x )P̃(dx ), P̃ ∼ D (α),
X
=⇒ Y ’s are the data and X ’s are latent variables which parametrize each kernel.
Since the DP is a discrete RPM, the model reduces to a countable mixture
Z X
f˜(y ) = f (y | x ) p̃i δXi (dx )
X i≥1
X Z X
= p̃i f (y | x )δXi (dx )= p̃i f (y | Xi ).
i≥1 X i≥1
2
A typical choice corresponds to f a Gaussian density on Y = R with parameter
x = (µ, σ) ∈ Y = R+ × R+ +.
34 / 40
A DPM can be equivalently formulated in hierarchical form
ind
Yi | Xi , P̃ ∼ f ( · | Xi ) i = 1, . . . , n [level of the observables]
iid
Xi | P̃ ∼ P̃ i = 1, . . . , n [latent level]
P̃ ∼ D (α)
An important property of the DPM model is the large support, relative to the choice
of kernel. For example:
▶ A DP location-scale mixture of normals induces a prior with full support on
the space of absolutely continuous distributions on R.
▶ a DP mixture of Unif(−X , X ) distributions has full support on the space of
unimodal symmetric densities.
35 / 40
Example: Realizations of a DPM
0.04
0.15
0.03
0.10
0.02
0.05
0.01
0.00 0.00
0 5 10 15 20 25 0 20 40 60 80
The DPM posterior is not available in closed form. Hence, one relies on Markov
Chain Monte Carlo (MCMC) techniques, in particular Gibbs sampling schemes,
which are standard apart of having to deal with P̃ (infinite-dimensional!). There are
two classes of algorithms according to the way P̃ is handled:
▶ Marginal algorithms: “ P̃ is integrated out ” leading to work with the predictive
distributions (or Blackwell–MacQueen Pólya urn scheme) or the EPPF
[Escobar & West, 1995; MacEachern & Müller, 1998].
▶ Conditional algorithms: Simulating P̃ is part of the algorithm and there are
several devices to do it. The simplest is truncation: take N large enough and
PN
simulate p̃ δ ≈ P̃ 3 . For DPMs there algorithms which allow exact
i=1 i Xi
simulation even without sampling an infinite sequence, known as retrospective
[Papaspiliopoulos & Roberts, 2008] and slice samplers [Walker, 2007].
3
There are strategies to select N so to control the approximation error.
37 / 40
Simulation experiment: DPM density estimate
0.15
0.10
0.05
0.00
−5 0 5 10 15 20 25 30
39 / 40
• Hjort, Holmes, Müller & Walker (Eds.) (2010). Bayesian Nonparametrics.
Cambridge: Cambridge Univ. Press.
• Jara, Hanson, Quintana, Müller and Rosner (2011). DPpackage: Bayesian semi-
and nonparametric modeling in R. J. Statistical Software 40, 1-30.
• Korwar & Hollander (1973). Contribution to the theory of Dirichlet processes.
Ann. Probab. 1, 705-11.
• Lo (1984). On a class of Bayesian nonparametric estimates. Ann. Statist. 12,
351-57.
• Lo (1991). A characterization of the Dirichlet process. Statist. Probab. Lett. 12,
185–187.
• MacEachern & Müller (1998). Estimating mixture of Dirichlet process models. J.
Comput. Graph. Statist. 7, 223– 238.
• Papaspiliopoulos & Roberts (2008). Retrospective MCMC for Dirichlet process
hierarchical models. Biometrika 95, 169–186.
• Pitman (1995). Exchangeable and partially exchangeable random partitions.
Prob. Th. and Rel. Fields 102, 145–158.
• Pitman (2006). Combinatorial Stochastic Processes. Lecture Notes in Math.,
vol.1875, Springer, Berlin.
• Regazzini (1978). Intorno ad alcune questioni relative alla definizione del premio
secondo la teoria della credibilità. Giorn. Istit. Ital. Attuari, 41, 77–89.
• Regazzini (1996). Impostazione nonparametrica di problemi d’inferenza bayesiana.
Imati Tech. Report 96-21, http://web.mi.imati.cnr.it/iami/abstracts/96-21.html
• Sethuraman (1994). A constructive definition of the DP prior. Stat. Sin. 2,
639–50.
Walker (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist.
Sim. Comput. 36, 45–54.
40 / 40