0% found this document useful (0 votes)

36 views

2 DP Handout

1) The Dirichlet process (DP) is a discrete nonparametric prior for generating random probability measures. It can be used to model exchangeable data that may exhibit ties with positive probability. 2) The DP can be defined through its predictive distributions, which predict new observations as a mixture of the prior guess and the empirical distribution. 3) The Dirichlet distribution is closely related to the DP. It can be used to define the DP through consistent finite-dimensional distributions. The Dirichlet distribution has several key properties including additivity and conjugacy with the multinomial distribution.

Uploaded by

Alessandro Sinai

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

2 DP Handout

Uploaded by

Alessandro Sinai

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Machine Learning II

Part I: An overview of Bayesian Nonparametrics

The Dirichlet process and models based on the DP

Igor Prünster
Bocconi University

1 / 40
Discrete nonparametric priors

If the de Finetti measure Q selects (a.s.) discrete distributions i.e. P̃ is a discrete

random probability measure
X
P̃( · ) = p̃i δZi ( · ), (♢)
i≥1

then any (exchangeable) vector X (n) := (X1 , . . . , Xn ) generated by Q will exhibit ties
with positive probability i.e. feature Kn distinct observations

X1∗ , . . . , XK∗n
PKn
with frequencies N1 , . . . , NKn such that i=1
Ni = n.

Species sampling: (♢) model for species distribution within a population

• Xi∗ is the label of the i–th distinct species in the sample;
• Ni is the frequency of the i–th distinct species;
• Kn is total number of distinct species in the sample.
=⇒ Species metaphor

2 / 40
Dirichlet process from a predictive point of view

Problem: Assume (Xn )n≥1 is an exchangeable sequence.

▶ Predict the distribution of Xn+1 conditional on a sample X (n) with Kn distinct
values X1∗ , . . . , XK∗n and frequencies N1 , . . . , NKn ;
▶ Prior guess at law of any of the Xi ’s is P ∗ ;
▶ The strength of the prior belief is measured by a parameter θ > 0.

Idea: Predict the distribution of Xn+1 as a linear combination of P ∗ and the

PKn
empirical measure n−1 N δ ∗ , namely
i=1 i Xi

Kn
(n) θ ∗ n 1 X
P[Xn+1 ∈ · | X ]= P (·) + Ni δXi∗ ( · )
θ+n θ+n n
| {z } |prior{zguess} | {z } i=1
P[Xn+1 =“new” | X (n) ] P[Xn+1 =“old” | X (n) ]
| {z }
emprirical measure

=⇒ Predictive distributions of the Dirichlet process (DP) [Ferguson, 1973].

Remark: The de Finetti measure Q of (Xn )n≥1 is a DP prior iff the prediction rule is
a linear combination of P ∗ and the empirical measure [Regazzini, 1978; Lo, 1991]

3 / 40
0.0 0.2 0.4 0.6 0.8 1.0

Plot of a Dirichlet process predictive cumulative distribution function given X (5)

with continuous prior guess.

4 / 40
The Dirichlet distribution
The original definition of the Dirichlet process [Ferguson, 1973] was in terms of a
consistent family of finite–dimensional Dirichlet distributions.

Dirichlet distribution

Let α = (α1 , . . . , αK ) be a vector of nonnegative numbers and set |α| =

PK
α . Define the (k − 1)-dimensional simplex
i=1 i
n k−1 o
X
∆k−1 = x1 , . . . , xk−1 : x1 ≥ 0, . . . , xk−1 ≥ 0, xi ≤ 1
i=1
The random vector (p1 , . . . , pk ) is said to have Dirichlet distribution with
parameters α1 , . . . , αK , denoted

(p1 , . . . , pK ) ∼ Dir(α1 , . . . , αK )

if it has density w.r.t. the Lebesgue measure on ∆K given by

k−1
Γ(|α|) α −1
X
dK (p1 , . . . , pk−1 ; α) = QK p1α1 −1 · · · pk−1
k−1
(1 − pi )αk −1 .
i=1
Γ(αi ) i=1

If any αi = 0, set the corresponding pi = 0 and consider the density on the lower
dimensional set. Note that if k = 2, one obtains the beta distribution.
5 / 40
Dirichlet Distributions
Examples of Dirichlet distributions over p = (p1, p2, p3) which can be plotted in 2D
since p3 = 1 p1 p2:

The plot displays:

◦ Left panel: Heat maps of Dirichlet distributions on the ∆2 simplex.

◦ Right panel: Dirichlet densities on the ∆2 simplex with (α1 , α2 , α3 ) equal
to (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4), respectively.

6 / 40
Key properties of the Dirichlet distribution

▶ Construction as normalization of independent gamma r.v.:

ind
Let Yj ∼ Ga(αj , 1), αj > 0, for j = 1, . . . , k, and define

Yj
pj = Pk
i=1
Yi

then
(p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk )
▶ Additivity:
If (p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk ) and 0 < r1 < · · · < rℓ = k, then
r1
X rℓ r1
X rℓ
X X
pi , . . . , pi ∼ Dirℓ αi , . . . , αi
i=1 i=rℓ−1 +1 i=1 i=rℓ−1 +1

This property suggests that it is usefulP to see the vector of parameters

(α1 , . . . , αK ) as a measure s.t. α(A) = α.
i∈A i
Note also
k
X
pi ∼ Beta αi , αj − αi
j=1

6 / 40
▶ Conjugacy
Let (Xn )n≥1 be categorical r.v.’s s.t.

P(X1 = j | p1 , . . . , pk ) = pj , j = 1, . . . , k
(p1 , . . . , pk ) ∼ Dir(α1 , . . . , αk ).

Then by Bayes Theorem

(p1 , . . . , pk ) | X1 = j ∼ Dir(α1 , . . . , αj + 1, . . . , αk ).
Pk
For X (n) s.t. ni is the frequency of value i (i = 1, . . . , k) with j=1
nj = n,
we have
(p1 , . . . , pk ) | X (n) ∼ Dir(α1 + n1 , . . . , αk + nk ).

=⇒ The Dirichlet prior is conjugate w.r.t. the multinomial model, which

extends the conjugacy seen in beta–binomial case.

Γ(|α|) Γ(αj + 1) αj
= =
Γ(αj ) Γ(|α| + 1) |α|
Similarly for X2 | X1 , we obtain
αj + δX1 (j)
P(X2 = j | X1 ) = .
|α| + 1

Hence a Dirichlet sample (i.e., a categorical-Dirichlet) is governed by the Pólya

urn scheme
αj αj + δX1 (j)
P(X1 = j) = , P(X2 = j | X1 ) = , ...
|α| |α| + 1
Pn
αj + i=1
δXi (j)
P(Xn+1 = j | X1 , . . . , Xn ) =
|α| + n

8 / 40
Interpretation of the Pólya urn predictives: assume the αi are integers, then we have
• an urn with k colors of unknown proportions (p1 , . . . , pk )
• we put a Dirichlet prior on the proportions
• without observations, we expect that the probability of observing a ball of
colour j is P(X1 = j) = E[pj ] = αj / |α|
• after observing X1 , we update our prior knowledge on the urn composition to

Dir α1 + δX1 (1), . . . , αk + δX1 (k)

so we expect that the probability of sampling color j as second draw is

αj + δX1 (j)
E(pj | X1 ) =
|α| + 1
and so on.
▶ Exchangeability: By de Finetti’s representation Theorem it follows that a
sample X (n) drawn from the Dirichlet multinomial model is exchangeable. In
Pk
fact, denoting the frequencies by n1 , . . . , nk s.t. n = n, we have
i=1 i

(α1 )(n1 ) · · · (αk )(nk )

P (X1 = x1 , . . . , Xn = xn ) =
|α|(n)

with a(n) = a(a + 1) · · · (a + n − 1) (ascending factorial or Pochhammer

symbol). As expected, it does not depend on the order of appearance of the
observations, only on their multiplicities.
9 / 40
The Dirichlet process

The Dirichlet process was introduced by Ferguson (1973) as a RPM with Dirichlet
distributed finite–dimensional distributions. See Ghosal (2010) and Ghosal & van
der Vaart (2017) for recent reviews.

Dirichlet process (DP)

Let α be a finite non null measure on (X, X ). A RPM P̃ on (X, X ) is

said to be a Dirichlet process (DP), denoted P̃ ∼ D (α), if for every finite
measurable partition A1 , . . . , Ak of X, we have
(P̃(A1 ), . . . , P̃(AK )) ∼ Dir(α(A1 ), . . . , α(AK )).

The existence of a DP is shown by a variation of Kolgomorov’s consistency

conditions combined with an instrumental use of de Finetti’s representation
Theorem. See Regazzini (1996) for details.
It is often useful to reparametrize the base measure as α = θP0 and correspondingly
write P̃ ∼ D (θP0 ) where
▶ θ := α(X) > 0, called the total mass of α
α(·)
▶ P0 (·) =
α(X)
∈ P , termed the centering distribution (or “ prior guess ”).
The advantage is that we may want to choose P0 and θ separately, rather than the
measure α directly, as parameters of the process.
10 / 40
Key properties

1. A priori moments

Prior specification via moments

Let P̃ ∼ Dα . Then for any A, B ∈ X

α(A)
E[P̃(A)] = = P0 (A)
α(X)
P0 (A)P0 (Ac )
Var(P̃(A)) = .
θ+1
P0 (A ∩ B) − P0 (A)P0 (B)
Cov(P̃(A), P̃(B)) =
θ+1

The proofs are straightforward by noting that marginally, for any A ∈ X , one has
α(A) α(A)
P̃(A) ∼ Beta(α(A), α(Ac )). E.g. E[P̃(A)] = α(A)+α(Ac ) = α(X) = P0 (A).

Hence, a natural way to select the parameters of the DP is:

▶ P0 is the prior guess
▶ θ controls the strength of belief in P0 (often also called precision parameter).
If θ ↑ ∞, Var(P̃(A)) ↓ 0 and P̃ degenerates on P0 .

11 / 40
A single realization of a DP

0.15 1.0

0.8

0.10
0.6

0.4
0.05

0.2

0.00 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figures: Probability mass function (left) and cumulative distribution func-

tion (right) of a single realization of a DP (solid) with P0 = Beta(5, 10)
centering distribution (dotted) and θ = 20.

12 / 40
Realizations of a DP

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figures: Cumulative distribution functions corresponding to 10 realizations

(grey step functions) of the DP P ∼ D (θP0 ) with θ = 10 and centering
distributions (black lines) P0 ≡ Beta(1, 1) = Unif(0, 1) (left) and P0 ≡
Beta(5, 10) (right). Recall that E[P̃] = P0 .

13 / 40
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figures: Cumulative distribution functions corresponding to 10 realizations

(grey step functions) of the DP P ∼ D (θP0 ) with centering distribution
P0 ≡ Beta(5, 5) (black lines) and θ = 1 (left) and θ = 50 (right).

14 / 40
2. Posterior distribution and conjugacy

Posterior distribution

Let (Xn )n≥1 be an X-valued exchangeable sequence whose de Finetti measure

Q is the law of a Dirichlet process, i.e.
iid
Xi | P̃ ∼ P̃
P̃ ∼ D (θP0 ).

Then
n
X
P̃ | X1 , . . . , Xn ∼ D θP0 + δXi
i=1

In the posterior the parameter measure α is updated by adding point masses to the
observed locations. This is consistent with special case of the beta-binomial model:
set A = {Head} and Ac = {Tail}, then
n n
X X
P̃(A)|X (n) ∼ Beta(θP0 (A) + δXi (A), θP0 (Ac ) + δXi (Ac ))
| {z } | {z }
i=1 i=1
=α | {z } =β | {z }
# successes k # failures n−k

15 / 40
3. Predictive distributions
Generative construction via generalized Pólya urn [Blackwell & McQueen, 1973]
▶ Step 1: for X1 we have

P(X1 ∈ A) = E[P(X1 ∈ A | P̃)] = E[P̃(A)]= P0 (A) (A ∈ X )

=⇒ the marginal distribution of X1 (often called prior predictive) is the

centering distribution P0 .
▶ Step 2: for X2 |X1 we have
Z
P(X2 ∈ A | X1 ) = E [P̃(A) | X1 ] = P(A) Q(dP|X1 )
P
| {z }
d
=Beta(α(A)+δX1 (A),α(Ac )+δX1 (Ac ))

θP0 (A) + δX1 (A) θ 1

= = P0 (A) + δX (A)
θ+1 θ+1 θ+1 1

▶ Step n+1: iterating, for Xn+1 |X (n) we obtain

n
θ n 1 X
P(Xn+1 ∈ A | X (n) ) = P0 (A) + δXi
θ+n θ+nn
i=1

i.e. as seen before the linear combination of prior guess and empirical measure
with weights depending on the total mass parameter θ and the sample size n.

16 / 40
Example: Predictive distribution

0.0 0.2 0.4 0.6 0.8 1.0

The plots displays:

◦ Cdf of the prior guess P ∗ ≡ Beta(2, 2) (dotted line).
◦ Cdf of the predictive distribution (solid line) for a DP with θ = 10 and 5 observed
data points (rug).
Note that, since the predictive is a linear combination of a continuous cdf (F ∗ ) and the
empirical cdf, the resulting cdf is piecewise differentiable with jumps occurring at the
observed locations, with jump sizes being equal to 1/n times the multiplicity of the observed
location. In the picture we only have distinct locations, so all jumps are of size 1/5.
17 / 40
Equivalently, one can state the predictive distribution as

 Z ∼ P0 with probability θ/(θ + n)
X1 with probability 1/(θ + n)


Xn+1 = ..

 .

Xn with probability 1/(θ + n)

Remarks:
▶ Interpretation:
Let P0 be diffuse i.e. P0 (x ) = 0 for all x ∈ X so it has no atoms. Then
• n/(θ + n) is the probabilty that Xn+1 is an “old” value, i.e. already
observed in X1 , . . . , Xn
• θ/(θ + n) is the probability that Xn+1 is a “new” value, not previously
observed.
▶ Asymptotics:
• As n increases, data provide more and more information and the weight
associated to the prior guess vanishes =⇒ data swamp the prior
• Frequentist evaluation: suppose the data are iid from a “ true unknown ”
Pn w
P0 , then, as n → ∞, n1 δ → P0 and E [P̃( · ) | X (n) ] → P0 . One
i=1 Xi
(n)
can show that Var[P̃( · ) | X ] → 0 as n → ∞ and, hence, by
Chebyshev’s inequality, we conclude that, for any weak neighbourhood of
P0 , denoted by W (P0 ), we have Q(W (P0 )|X (n) ) → 1.
=⇒ the DP is weakly consistent at P0 .
18 / 40
4. Support
We have said that it is desirable for a nonparametric prior Q to have large or
possibly full support.
Take home message: the DP prior has full support i.e. weak neighbourhoods of any
distribution P ∗ have positive probability.
A little bit more precisely:
Define for any P ∈ P ,
\
supp(P) = {A : A closed and P(Ac ) = 0}.

Then x ∈ supp(P) iff P(A) > 0 for every open set A which contains x .
Ferguson (1973, 1974) showed that for any θ > 0
n o
supp DθP0 = P ∈ P : supp(P) ⊂ supp(P0 )

Hence the DP prior has full support. This implies that given P ∗ whose support is
contained in that of P0 ,
DθP0 A(P ∗ ) >0

i.e. every weakly open subset of distributions that contains P ∗ (whose support is
contained in that of P0 ), has positive mass.

19 / 40
5. DP is a discrete nonparametric prior
Blackwell (1973) showed that, even though the DP prior has full support, the
realizations of P̃ are discrete distributions (almost surely). This is true even when
the base measure P0 of the DP is continuous (e.g. P0 = N(0, 1)).
Typical realizations of a DP are as follows:

0.25

0.20

0.15

0.10

0.05

0.00

0 10 20 30 40 50 60

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure: Top: realization of a DP ∼ D (θP0 ) (solid black) with θ = 5 and P0 = Ga(20, 1)

(dotted line). Bottom: realization of DPs seen as cdf’s (grey step functions) for θ = 1, 5, 20
(left to right) against cdf of P0 (solid line).
20 / 40
6. Distinct values and induced partition
The discreteness of the DP, implies that we observe ties with positive probability i.e.
P(Xn+1 = Xi ) > 0 for all i = 1, . . . , n, which is apparent from the prediction rule.
This induces a random partition of the integers {1, . . . , n} determined by the
grouping the observations with the same value. E.g. if

X1 = X3 = X4 , X2 = X5

the induced partition of {1, . . . , 5} is

{1, 3, 4}, {2, 5}.

We will denote by X1∗ , . . . , XK∗n the distinct values in X1 , . . . , Xn in order of

appearance, where Xi∗ is the common value for observations in group i which has
multiplicity ni , i.e.
Xr1 = · · · = Xrni = Xi∗
and (r1 , . . . , rni ) ⊂ {1, . . . , n} identify those observations.
To reflect this, the predictive distribution can be rewritten as
Kn
θ n 1 X
P(Xn+1 ∈ A | X (n) ) = P0 (A) + ni δXi∗ (A)
θ+n θ+nn
i=1

from which P(Xn+1 = Xi∗ ) = ni /(θ + n).

21 / 40
As seen, the sampling scheme for the observations is termed generalized Pólya urn
scheme. The induced sampling scheme on the partition process is called Chinese
restaurant process (see e.g. Pitman, 2006):
▶ The 1st customer arrives at the restaurant and sits at Table 1.
▶ The 2nd customer arrives at the restaurant and sits at Table 1 with probability
1/(θ + 1) or at a new table with probability θ/(θ + 1).
▶ The 3rd customer arrives at the restaurant: (a) if the first 2 costumers sit at
the same table, then she sits at that table with probability 2/(θ + 2) and at a
new table with probability θ/(θ + 2); (b) if the first 2 costumers sit a different
tables, then she sits at each of these with probability 1/(θ + 2) and at a new
one with probability θ/(θ + 2), and so on.
▶ The allocation of customers at the tables determines a random partition of the
observations into clusters.

For this reasons (and for the connection with mixture models) Kn is also referred to
as the number of clusters. Note that the allocation of a new value to a cluster is
proportional to the cluster frequency.

22 / 40
0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure: Different clustering structures induced by the total mass parameter θ equal to 1, 5,
and 20 from left to right.

23 / 40
Let us look at the behaviour of Kn . Consider a DP with non–atomic base measure
P ∗ and θ > 0 and define the r.v.

if Xi ∈
/ {X1 , . . . , Xi−1 }
n
1 (Xi new)
Di = ,
0 else (Xi old)

θ
which are independent and Di ∼ Bern θ+i−1
. Hence, we have
n n
θ
hX i X
E[Kn ] = E Di =
θ+i −1
i=1 i=1

Moreover, it can be shown that (see e.g. Pitman, 2006):

▶ Recall (θ)n := θ . . . (θ + n − 1) and denote by |s(n, k)| the signless Stirling
number of the first kind 1
θk
P [Kn = k] = |s(n, k)| [Antoniak, 1974]
(θ)n
▶ Kn / log n → θ a.s. [Korwar & Hollander, 1973]
▶ E[Kn ] ≈ Var[Kn ] ≈ θ log n
√ d
▶ (Kn − θ log n)/ θ log n → N(0, 1)
Remark: No parameter really controls the growth rate of Kn . Much recent research
has concentrated on constructing alternative P̃ with more flexible growth rate of Kn .
1
|s(n, k)| = |s(n − 1, k − 1)| + (n − 1)|s(n − 1, k)| with |s(0, 0)| = 1, |s(n, 0)| = 0, and
|s(n, k)| = 0 if n < k.
24 / 40
Growth rate of the DP

500 trajectories of Kn for n = 1, . . . , 1000 and θ = 20 together with E[Kn ] ± two

standard deviations as given above.
25 / 40
7. Exchangeable partition probability function
Any sample X (n) from a discrete RPM P̃ induces an exchangeable random partition
Πn , which is a random partition whose law is invariant under permutation of its
elements, i.e. it is a symmetric function which only depends on the group sizes. For
a partition A1 , . . . , Ak of {1, . . . , n}, this reads
P(Πn = {A1 , . . . , AK }) = p(|A1 |, . . . , |AK |), |Ai | = card(Ai ).
This clearly holds for every n and hence any P̃ induces an infinite exchangeable
partition (Πn )n≥1 .
The distribution of a specific DP sample featuring Kn distinct sample observations
with frequencies n1 , . . . , nKn is given by
k
θk Y
pkn (n1 , . . . , nk ) = (ni − 1)!
θ(n)
i=1

This distribution is called Exchangeable Partition Probability Function (EPPF)

[Pitman, 1995] and characterizes the partition distribution. See e.g. Pitman (2006).
One can also encode the partition in terms of countsPn or multiplicities:
Pni.e. m1 groups
of size 1, m2 groups of size 2, and so on, with i=1
mi = Kn and i=1
imi = n.
For the DP this leads to the Ewens’ sampling formula [Ewens, 1972], a cornerstone
of Population Genetics, which is given by
n
n! Y θmi
ESFθ (m1 , . . . , mn ) = ,
θ(n) mi !i mi
i=1

See Crane (2016) for a recent review on the Ewens sampling formula.
26 / 40
Constructions of the DP
1. DP via finite-dimensional distributions
This corresponds to the original construction of Ferguson (1973) and we gave it as
definition of the DP.
2. DP via predictive distributions (or generalized Pólya urn)
We saw that a DP leads to predictive distributions of the form
n
θ 1 X
P[Xn+1 ∈ · | X (n) ] = P∗( · ) + δXi ( · )
θ+n θ+n
i=1

and already stated the viceversa i.e. that predictive distributions of this form imply
a DP prior as de Finetti measure. This can be proved by noting that:
▶ The distribution of X (n) is recovered from the predictives as
n
P
Y θP ∗ (dxi ) + j<i
δXj (dxj )
P(X1 ∈ dx1 , . . . , Xn ∈ dxn ) = P ∗ (dx1 )
θ+i −1
i=2

and shown to be exchangeable similarly to the finite Dirichlet case with Kn in

place of K .
iid
▶ From de Finetti’s Theorem, there exists a RPM P̃ such that Xi | P̃ ∼ P̃
▶ Since we have seen that the DP generates this predictive sequence, the
uniqueness of the de Finetti measure implies that P̃ is a DP.
27 / 40
3. DP via stick–breaking [Sethuraman, 1994]
▶ Consider a sequence of i.i.d. r.v. Vi ∼ Beta(1, θ) with θ > 0. To construct
the random probability weights of a discrete RPM break a unit–length stick
sequentially according to the Vi ’s as follows:

0 1
V1 1 − V1
p̃1
V2 (1 − V1 ) (1 − V2 )(1 − V1 )
p̃2 ..
.

p̃1 p̃2 p̃3 ...

Qi
=⇒ p̃i = Vi j=1
(1 − Vj ), i = 1, . . ..
▶ sample i.i.d. locations (Zi )i≥1 from P ∗ on X
P
▶ define a RPM as P̃ = p̃ δ , which is shown to coincide with the DP.
i≥1 i Zi

Remarks:
▶ By this construction it is apparent that the DP is a discrete RPM.
▶ Models based on the stick-breaking strategy are sometimes referred to as
residual allocation models.
28 / 40
4. DP as transformation of a process with independent increments

Process with independent increments

A stochastic process {Xt : t ≥ 0} on X = R+ is said to be a process with

independent increments if X0 = 0 a.s. and for any n ≥ 1 and 0 ≤ t0 < t1 <
. . . < tn , the r.v.
Xt0 , Xt1 − Xt0 , . . . , Xtn − Xtn−1
are mutually independent.

In fact, two more constructions of the Dirichlet process [Ferguson, 1973 & 1974;
Doksum, 1974] are based on processes with independent increments:
▶ Normalization: Let {ξt : t ≥ 0} be a gamma process i.e. a process with
independent increments s.t. ξt ∼ gaα((0,t]),1 with α a finite measure on R+ .
Then
ξt
F̃ (t) =
limt→∞ ξt
is the random cdf associated to a DP on R+ .
▶ Neutrality to the right: Let {ζt : t ≥ 0} be a “ suitable ” process with
independent increments. Then

F̃ (t) = 1 − e−ζt

is the random cdf associated to a Dirichlet process on R+ .

29 / 40
A simulated trajectory of a gamma process

30 / 40
Mixtures of Dirichlet processes (MDP)

In order to reflect uncertainty about our prior knowledge encoded in the precision
parameter θ and centering distribution P ∗ or simply to increase flexibility, one often
introduces a further hierarchy on α = θP ∗ by putting a prior also on θ and/or on
P ∗ . This leads to mixtures of Dirichlet processes (MDP) (Antoniak, 1974):

iid
Xi | P̃, Z̃ ∼ P̃
P̃ | Z̃ ∼ D (αZ̃ ) (prior)
Z̃ ∼ π (hyperprior)

where the parameter measure αZ is now indexed by Z , which is a r.v. with

distribution π.
According to the support of π one obtains a finite, countable or uncountable
mixtures of DPs. Typical examples of MDPs are

P̃ | θ̃ ∼ D (θ̃P ∗ ), θ̃ ∼ Ga(a, b)
P̃ | (µ̃, σ̃) ∼ D (θN(µ̃, σ̃ )),
2
µ̃ ∼ N(m, s 2 ), σ̃ ∼ Inv-Ga(α, β)

where in the latter σ̃ −1 ∼ Ga(α, β −1 ).

Note that realizations are still discrete, which should be intuitive from the hierarchy.
First we sample Z , and given Z we sample a DP, which is discrete a.s.

31 / 40
0.15 0.15

0.10 0.10

0.05 0.05

0.00 0.00

0 10 20 30 40 0 10 20 30 40

The plot displays:

◦ Left panel: Realization of a DP P̃ ∼ D (θ · Ga(20, 1)) together with the
pdf of its base measure.
◦ Right panel: Realization of an MDP P̃ | Z̃ ∼ D (θ · Ga(Z , 1)) with
Z̃ ∈ {3, 10, 25} with probabilities 0.1, 0.6, 0.3 respectively, together with
the pdf of its mixture base measure.

32 / 40
Properties of MDPs

▶ Prior mean:
αZ̃ (A)
R
Since E[P̃(A) | Z̃ ] = αZ̃ (X)
= PZ∗ (A), we have E[P̃(A)] = Z
P̄z∗ (A)π(dz).
▶ Posterior distribution:
By conditioning on Z̃ , we have
n
X
P̃ | Z̃ , X (n) ∼ D αZ̃ + δXi
i=1
and the posterior becomes Z n
X
P̃ | X (n)
∼ D αz + δXi π(dz | X (n) ).
Z i=1
▶ Posterior mean:
The posterior mean is given by
Z Pn
αz (A) + δ (A)
i=1 Xi
E[ P̃(A) | X (n) ] = π(dz | X (n) ).
Z
αz (X) + n

If θ := αZ̃ (X) does not depend on Z̃ it becomes

Z n
θ n 1 X
Pz∗ (A)π(dz | X (n) ) + δXi (A)
θ+n Z
θ+nn
i=1

where, unlike for the DP, the centering measure is also updated.
33 / 40
Dirichlet process mixtures
Although discrete distributions can approximate (weakly) any continuous
distribution, both the DP and MDP are inappropriate when the data come from a
continuous distribution.
Dirichlet process mixtures (DPM) were introduced by Lo (1984) as a model for
density estimation (an analog to the popular frequentist kernel density estimation).
DPMs differ substantially from MDPs in that they are mixtures w.r.t. the Dirichlet
process.
Consider a kernel f (y | x ) on Y indexed by a parameter x ∈ X, such that for any x
f (y | x ) is a density2 . Then define the mixture of kernels
Z
f˜(y ) = f (y | x )P̃(dx ), P̃ ∼ D (α),
X
=⇒ Y ’s are the data and X ’s are latent variables which parametrize each kernel.
Since the DP is a discrete RPM, the model reduces to a countable mixture
Z X
f˜(y ) = f (y | x ) p̃i δXi (dx )
X i≥1
X Z X
= p̃i f (y | x )δXi (dx )= p̃i f (y | Xi ).
i≥1 X i≥1
2
A typical choice corresponds to f a Gaussian density on Y = R with parameter
x = (µ, σ) ∈ Y = R+ × R+ +.
34 / 40
A DPM can be equivalently formulated in hierarchical form
ind
Yi | Xi , P̃ ∼ f ( · | Xi ) i = 1, . . . , n [level of the observables]
iid
Xi | P̃ ∼ P̃ i = 1, . . . , n [latent level]

P̃ ∼ D (α)

The discrete or continuous nature of the kernel f (y | x ) determines the selected

distributions, together with their support. For instance:
▶ If f (y | x ) ≡ Po(x ), f˜ is a DP mixture of Poisson distributions, selecting
discrete distributions on Z+ .
▶ If x = (µ, σ) ∈ R × R+ and f (y | x ) ≡ N(µ, σ 2 ), f˜ is a location-scale mixture
of Gaussians on R.

An important property of the DPM model is the large support, relative to the choice
of kernel. For example:
▶ A DP location-scale mixture of normals induces a prior with full support on
the space of absolutely continuous distributions on R.
▶ a DP mixture of Unif(−X , X ) distributions has full support on the space of
unimodal symmetric densities.

35 / 40
Example: Realizations of a DPM

0.04
0.15

0.03

0.10

0.02

0.05
0.01

0.00 0.00

0 5 10 15 20 25 0 20 40 60 80

Realizations of a DPM with Gaussian (left) and Poisson (right) kernels.

◦ Left panel: Draw from a location Gaussian DPM (solid line); mixture components
N(xi∗ , 1) (colored dashed) with means xi∗ ’s resulting from the DP sample (vertical
black solid with height representing their weight). Observations simulated from the
mixture are represented as rugs, colored according to the generating kernel (colored
dashed).
◦ Right panel: Draw from a Poisson DPM with analogous interpretation of the various
lines.
36 / 40
DPMs are the most popular BNP model. Two are the main reasons:
▶ Flexible density estimation:
Parametric models imply severe constraints on the considered distributions
(e.g. unimodality, tail behaviour, overdispersion). The DPM avoids such
constraints and to flexibly adapt to any data generating distribution.
▶ Model-based clustering:
Given P̃ is discrete, Kn ≤ n distinct latent variables are generated. Here, Kn
represents the number of mixture components and the n observations are
clustered in Kn groups in the sense that 2 data belong to the same cluster if
they come from the same mixture component f (y |Xi∗ ) for i = 1, . . . , Kn .

The DPM posterior is not available in closed form. Hence, one relies on Markov
Chain Monte Carlo (MCMC) techniques, in particular Gibbs sampling schemes,
which are standard apart of having to deal with P̃ (infinite-dimensional!). There are
two classes of algorithms according to the way P̃ is handled:
▶ Marginal algorithms: “ P̃ is integrated out ” leading to work with the predictive
distributions (or Blackwell–MacQueen Pólya urn scheme) or the EPPF
[Escobar & West, 1995; MacEachern & Müller, 1998].
▶ Conditional algorithms: Simulating P̃ is part of the algorithm and there are
several devices to do it. The simplest is truncation: take N large enough and
PN
simulate p̃ δ ≈ P̃ 3 . For DPMs there algorithms which allow exact
i=1 i Xi
simulation even without sampling an infinite sequence, known as retrospective
[Papaspiliopoulos & Roberts, 2008] and slice samplers [Walker, 2007].
3
There are strategies to select N so to control the approximation error.
37 / 40
Simulation experiment: DPM density estimate

0.15

0.10

0.05

0.00

−5 0 5 10 15 20 25 30

The plot displays:

◦ True data generating density (dashed) given by a mixture of 3 Gaussians.
◦ Posterior density estimate obtained with a Gaussian DPM based on 500 data points
(rugs colored according to allocated kernel)
The inference was performed using the popular R package DPpackage of Jara et al. (2011)
with the third default prior specification in the package.
38 / 40
Some References

• Antoniak(1974). Mixtures of Dirichlet processes with applications to Bayesian

nonparametric problems. Ann. Statist. 2, 1152–1174.
• Blackwell (1973). Discreteness of Ferguson Selections. Ann. Statist. 1, 356–8.
• Blackwell & McQueen (1973). Ferguson distributions via Pólya urn schemes.
Ann. Statist. 1, 353–5.
• Crane (2016). The ubiquitous Ewens sampling formula. Statist. Science 31, 1–19.
• Doksum (1974). Tailfree and neutral random probabilities and their posterior
distributions. Ann. Probab. 2, 183–201.
• Escobar & West (1995). Bayesian density estimation and inference using
mixtures. J. Amer. Stat. Assoc. 90, 577–88.
• Ewens (1972). The sampling theory of selectively neutral alleles. Theor. Pop.
Biol. 3, 87–112.
• Ferguson (1973). A Bayesian analysis of some nonparametric problems. Ann.
Statist. 1, 209-30.
• Ferguson (1974). Prior distributions on spaces of probability measures. Ann.
Statist. 2, 615-29.
• Ghosal (2010). The Dirichlet process, related priors and posterior asymptotics. In
Bayesian Nonparametrics (Hjort, Holmes, Müller & Walker, eds.). Cambridge Univ.
Press, Cambridge.
• Ghosal & van der Vaart (2017). Fundamentals of Bayesian nonparametric
inference. Cambridge University Press.
• Ghosh & Ramamoorthi (2003). Bayesian nonparametrics. Springer, New York.

39 / 40
• Hjort, Holmes, Müller & Walker (Eds.) (2010). Bayesian Nonparametrics.
Cambridge: Cambridge Univ. Press.
• Jara, Hanson, Quintana, Müller and Rosner (2011). DPpackage: Bayesian semi-
and nonparametric modeling in R. J. Statistical Software 40, 1-30.
• Korwar & Hollander (1973). Contribution to the theory of Dirichlet processes.
Ann. Probab. 1, 705-11.
• Lo (1984). On a class of Bayesian nonparametric estimates. Ann. Statist. 12,
351-57.
• Lo (1991). A characterization of the Dirichlet process. Statist. Probab. Lett. 12,
185–187.
• MacEachern & Müller (1998). Estimating mixture of Dirichlet process models. J.
Comput. Graph. Statist. 7, 223– 238.
• Papaspiliopoulos & Roberts (2008). Retrospective MCMC for Dirichlet process
hierarchical models. Biometrika 95, 169–186.
• Pitman (1995). Exchangeable and partially exchangeable random partitions.
Prob. Th. and Rel. Fields 102, 145–158.
• Pitman (2006). Combinatorial Stochastic Processes. Lecture Notes in Math.,
vol.1875, Springer, Berlin.
• Regazzini (1978). Intorno ad alcune questioni relative alla definizione del premio
secondo la teoria della credibilità. Giorn. Istit. Ital. Attuari, 41, 77–89.
• Regazzini (1996). Impostazione nonparametrica di problemi d’inferenza bayesiana.
Imati Tech. Report 96-21, http://web.mi.imati.cnr.it/iami/abstracts/96-21.html
• Sethuraman (1994). A constructive definition of the DP prior. Stat. Sin. 2,
639–50.
Walker (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist.
Sim. Comput. 36, 45–54.
40 / 40

Law of Sines and Cosines Review Worksheet: Find Each Measurement Indicated. Round Your Answers To The Nearest Tenth
0% (1)
Law of Sines and Cosines Review Worksheet: Find Each Measurement Indicated. Round Your Answers To The Nearest Tenth
4 pages
Fast Convolution Cook Toom Algorithm
No ratings yet
Fast Convolution Cook Toom Algorithm
24 pages
Solar Mobile Charging
45% (11)
Solar Mobile Charging
21 pages
A Very Gentle Note On The Construction of DP Zhang
No ratings yet
A Very Gentle Note On The Construction of DP Zhang
15 pages
Johnson11MLSS Talk Extras
No ratings yet
Johnson11MLSS Talk Extras
73 pages
Dirichlet Process: Yee Whye Teh, University College London
No ratings yet
Dirichlet Process: Yee Whye Teh, University College London
13 pages
Hankin - A Generalization of The Dirichlet Distribution
No ratings yet
Hankin - A Generalization of The Dirichlet Distribution
17 pages
3 Prediction Handout
No ratings yet
3 Prediction Handout
25 pages
Notes On Beta and Dirchilet Distribution
No ratings yet
Notes On Beta and Dirchilet Distribution
19 pages
Predicting Winning Lottery Numbers
No ratings yet
Predicting Winning Lottery Numbers
10 pages
Exam P Review Sheet
No ratings yet
Exam P Review Sheet
12 pages
Chapter2 Probability
No ratings yet
Chapter2 Probability
45 pages
Bayesian Regression For A Dirichlet Distributed Response Using Stan
No ratings yet
Bayesian Regression For A Dirichlet Distributed Response Using Stan
13 pages
Lecture 4
No ratings yet
Lecture 4
7 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
Lec5-MultivariateGaussian Students Dirichlet
No ratings yet
Lec5-MultivariateGaussian Students Dirichlet
21 pages
ln13
No ratings yet
ln13
5 pages
BT_Wk3_LectureNotes(3)
No ratings yet
BT_Wk3_LectureNotes(3)
16 pages
Walker_2016_SliceSampling
No ratings yet
Walker_2016_SliceSampling
18 pages
01 Lectureslides ProbTheory
No ratings yet
01 Lectureslides ProbTheory
42 pages
Dis Tri Buci Ones
No ratings yet
Dis Tri Buci Ones
16 pages
Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart
No ratings yet
Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart
32 pages
Models Beyond The DP
No ratings yet
Models Beyond The DP
47 pages
BT_Wk3_LectureNotes(2)
No ratings yet
BT_Wk3_LectureNotes(2)
19 pages
Probability Theory 2013
No ratings yet
Probability Theory 2013
61 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Bayes Intro PT 2
No ratings yet
Bayes Intro PT 2
13 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
mcmc
No ratings yet
mcmc
76 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Uniform Distribution (Continuous)
No ratings yet
Uniform Distribution (Continuous)
5 pages
CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data
No ratings yet
CSE291D Lecture 3: Conjugate Priors Generative Models For Discrete Data
71 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
DP Tutorial 2
No ratings yet
DP Tutorial 2
44 pages
Ch3 - 2009 Conjugate Families of Distributions
No ratings yet
Ch3 - 2009 Conjugate Families of Distributions
67 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Chapter 2 - Section 2 (part 1)
No ratings yet
Chapter 2 - Section 2 (part 1)
42 pages
MIR - Ivchenko G. I., Medvedev Yu. and Chistyakov A. - Problems in Mathematical Statistics - 1991
100% (4)
MIR - Ivchenko G. I., Medvedev Yu. and Chistyakov A. - Problems in Mathematical Statistics - 1991
282 pages
170 Are View
No ratings yet
170 Are View
12 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
No ratings yet
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
282 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Stochastic Processes Notes
100% (1)
Stochastic Processes Notes
22 pages
The University of Nottingham: Do NOT Turn Examination Paper Over Until Instructed To Do So
No ratings yet
The University of Nottingham: Do NOT Turn Examination Paper Over Until Instructed To Do So
6 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
No ratings yet
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
22 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
2 Statistical Definitions: 2.1 Probability Density Function
No ratings yet
2 Statistical Definitions: 2.1 Probability Density Function
9 pages
Lecture 11
No ratings yet
Lecture 11
15 pages
2024-Fourier Basis Density Model
No ratings yet
2024-Fourier Basis Density Model
5 pages
Notation
No ratings yet
Notation
3 pages
Random variables distributions
No ratings yet
Random variables distributions
36 pages
SI_Chapter-1
No ratings yet
SI_Chapter-1
30 pages
MGF, Discrete Statistical Distributions - 551
No ratings yet
MGF, Discrete Statistical Distributions - 551
49 pages
Probability Distribution: Shreya Kanwar (16eemme023)
No ratings yet
Probability Distribution: Shreya Kanwar (16eemme023)
51 pages
R300 Advanced Econometrics Methods Lecture Slides
No ratings yet
R300 Advanced Econometrics Methods Lecture Slides
362 pages
Distribution Theory Questionnaire
No ratings yet
Distribution Theory Questionnaire
3 pages
Lecture4 More Bayes
No ratings yet
Lecture4 More Bayes
24 pages
Common Probability Distributionsi Math 217/218 Probability and Statistics
No ratings yet
Common Probability Distributionsi Math 217/218 Probability and Statistics
10 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
L-1, Chem WS-1 Sol.
No ratings yet
L-1, Chem WS-1 Sol.
3 pages
Elrs Smi 138
No ratings yet
Elrs Smi 138
11 pages
Module 1 - Les #9 Influence Line-Girder
No ratings yet
Module 1 - Les #9 Influence Line-Girder
6 pages
AA Hollow Cathode Lamps - Recommended Operating Conditions: Single Element, Multi-Element and Continuum Lamps
No ratings yet
AA Hollow Cathode Lamps - Recommended Operating Conditions: Single Element, Multi-Element and Continuum Lamps
2 pages
12 CAPWAP Complexities and Case Studies For Pile Foundation Testing in India
No ratings yet
12 CAPWAP Complexities and Case Studies For Pile Foundation Testing in India
14 pages
DISHA Concept Map
100% (1)
DISHA Concept Map
5 pages
Corporate Governance Reform Within The Uk Banking Industry and Its Effect On Firm Performance
No ratings yet
Corporate Governance Reform Within The Uk Banking Industry and Its Effect On Firm Performance
15 pages
VLSI Digital Design Using VHDL and FPGA
No ratings yet
VLSI Digital Design Using VHDL and FPGA
22 pages
AC-0049 Well Design Rules
No ratings yet
AC-0049 Well Design Rules
5 pages
Mechanical and Fracture Toughness Analysis of Woven Carbon Fibre Reinforced Epoxy Composites
No ratings yet
Mechanical and Fracture Toughness Analysis of Woven Carbon Fibre Reinforced Epoxy Composites
6 pages
PLC
No ratings yet
PLC
84 pages
Grade 12 Cs- Pre Board 3 Ans (1)
No ratings yet
Grade 12 Cs- Pre Board 3 Ans (1)
23 pages
13-02A Permutations
No ratings yet
13-02A Permutations
19 pages
2-Wire Transmitter With HART Protocol: Application
No ratings yet
2-Wire Transmitter With HART Protocol: Application
2 pages
EDP Reporting
No ratings yet
EDP Reporting
42 pages
SM814 (32523) - Training and Development Question Paper
No ratings yet
SM814 (32523) - Training and Development Question Paper
3 pages
DESCRIPTIVE STAT, CONCEPT AND HEALTHCARE APP_082302
No ratings yet
DESCRIPTIVE STAT, CONCEPT AND HEALTHCARE APP_082302
4 pages
Impedance Matching For 2.4-GHz Axial-Mode PVC-Pipe Helix by Thin Triangular Copper Strip Wongpaibool 2008
No ratings yet
Impedance Matching For 2.4-GHz Axial-Mode PVC-Pipe Helix by Thin Triangular Copper Strip Wongpaibool 2008
8 pages
WS06 VCCT
No ratings yet
WS06 VCCT
18 pages
Unit 4 (C++) - Part Two
No ratings yet
Unit 4 (C++) - Part Two
54 pages
Dme 2
No ratings yet
Dme 2
6 pages
Mapeh 4: Region VIII District Learning Center Vi Scandinavian Elementary School
No ratings yet
Mapeh 4: Region VIII District Learning Center Vi Scandinavian Elementary School
11 pages
Example 9: Residual Stress Analysis (RSA) : GRLWEAP Standard Examples
No ratings yet
Example 9: Residual Stress Analysis (RSA) : GRLWEAP Standard Examples
3 pages
Using Blockchain Technology To Ensure Honey Purity - M.A.S. Rünzel
No ratings yet
Using Blockchain Technology To Ensure Honey Purity - M.A.S. Rünzel
11 pages
ExportWAContacts_??_? Kheda Teacher ??_?_20241215152249 (1)-merged-numbered-pages-deleted-compressed
No ratings yet
ExportWAContacts_??_? Kheda Teacher ??_?_20241215152249 (1)-merged-numbered-pages-deleted-compressed
15 pages
Network Security Web Security and SSL/TLS: Angelos Keromytis Columbia University
No ratings yet
Network Security Web Security and SSL/TLS: Angelos Keromytis Columbia University
24 pages
NBR Definition and Creation Process - 005 - e
No ratings yet
NBR Definition and Creation Process - 005 - e
17 pages

2 DP Handout

Uploaded by

2 DP Handout

Uploaded by

Machine Learning II

Part I: An overview of Bayesian Nonparametrics

If the de Finetti measure Q selects (a.s.) discrete distributions i.e. P̃ is a discrete

Species sampling: (♢) model for species distribution within a population

Problem: Assume (Xn )n≥1 is an exchangeable sequence.

Idea: Predict the distribution of Xn+1 as a linear combination of P ∗ and the

=⇒ Predictive distributions of the Dirichlet process (DP) [Ferguson, 1973].

Plot of a Dirichlet process predictive cumulative distribution function given X (5)

Let α = (α1 , . . . , αK ) be a vector of nonnegative numbers and set |α| =

if it has density w.r.t. the Lebesgue measure on ∆K given by

The plot displays:

◦ Left panel: Heat maps of Dirichlet distributions on the ∆2 simplex.

▶ Construction as normalization of independent gamma r.v.:

This property suggests that it is usefulP to see the vector of parameters

Then by Bayes Theorem

=⇒ The Dirichlet prior is conjugate w.r.t. the multinomial model, which

Hence a Dirichlet sample (i.e., a categorical-Dirichlet) is governed by the Pólya

so we expect that the probability of sampling color j as second draw is

(α1 )(n1 ) · · · (αk )(nk )

with a(n) = a(a + 1) · · · (a + n − 1) (ascending factorial or Pochhammer

Dirichlet process (DP)

Let α be a finite non null measure on (X, X ). A RPM P̃ on (X, X ) is

The existence of a DP is shown by a variation of Kolgomorov’s consistency

Prior specification via moments

Let P̃ ∼ Dα . Then for any A, B ∈ X

Hence, a natural way to select the parameters of the DP is:

Figures: Probability mass function (left) and cumulative distribution func-

Figures: Cumulative distribution functions corresponding to 10 realizations

Figures: Cumulative distribution functions corresponding to 10 realizations

Let (Xn )n≥1 be an X-valued exchangeable sequence whose de Finetti measure

P(X1 ∈ A) = E[P(X1 ∈ A | P̃)] = E[P̃(A)]= P0 (A) (A ∈ X )

=⇒ the marginal distribution of X1 (often called prior predictive) is the

θP0 (A) + δX1 (A) θ 1

▶ Step n+1: iterating, for Xn+1 |X (n) we obtain

0.0 0.2 0.4 0.6 0.8 1.0

The plots displays:

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0

Figure: Top: realization of a DP ∼ D (θP0 ) (solid black) with θ = 5 and P0 = Ga(20, 1)

the induced partition of {1, . . . , 5} is

{1, 3, 4}, {2, 5}.

We will denote by X1∗ , . . . , XK∗n the distinct values in X1 , . . . , Xn in order of

from which P(Xn+1 = Xi∗ ) = ni /(θ + n).

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0.0 0.0 0.0

Moreover, it can be shown that (see e.g. Pitman, 2006):

500 trajectories of Kn for n = 1, . . . , 1000 and θ = 20 together with E[Kn ] ± two

This distribution is called Exchangeable Partition Probability Function (EPPF)

and shown to be exchangeable similarly to the finite Dirichlet case with Kn in

p̃1 p̃2 p̃3 ...

Process with independent increments

A stochastic process {Xt : t ≥ 0} on X = R+ is said to be a process with

is the random cdf associated to a Dirichlet process on R+ .

where the parameter measure αZ is now indexed by Z , which is a r.v. with

where in the latter σ̃ −1 ∼ Ga(α, β −1 ).

The plot displays:

If θ := αZ̃ (X) does not depend on Z̃ it becomes

The discrete or continuous nature of the kernel f (y | x ) determines the selected

Realizations of a DPM with Gaussian (left) and Poisson (right) kernels.

The plot displays:

• Antoniak(1974). Mixtures of Dirichlet processes with applications to Bayesian

You might also like