Density exploration methods

Density exploration methods
Pierre Jacob
Department of Statistics, University of Oxford
pierre.jacob at stats.ox.ac.uk
March 2014
Pierre Jacob Density exploration 1/ 49

Outline
1 MCMC and multimodal target distributions
2 Parallel MCMC, tempering and equi-energy moves
3 Wang–Landau algorithm

Outline

MCMC and multimodal target distributions
Algorithm 1 Metropolis–Hastings targeting π
1: Init X0 ∈ X.
2: for t = 1 to T do
3: Sample X⋆ from some proposal distribution q(Xt−1, ·).
4: Compute the acceptance ratio:
α(Xt−1, X⋆
) = min
(
1,
π(X⋆)
π(Xt−1)
q(X⋆, Xt−1)
q(Xt−1, X⋆)
)
.
5: With probability α(Xt−1, X⋆), set Xt = X⋆;
otherwise Xt = Xt−1.
6: end for

MCMC methods allow to approximate
∫
φ(x)π(x)dx,
as long as π can be evaluated / estimated point-wise, and
generate
X0, X1, . . . , XT .
The guarantees are largely asymptotic in T going to inﬁnity.
For multimodal target distributions the non-asymptotic regime
might be very diﬀerent.

−4
−2
0
2
4
6
8
−4 −2 0 2 4 6 8
µ1
µ2
Figure : Posterior distribution of (µ1, µ2) in a Gaussian mixture model.
See Stephens (1997), Bayesian methods for mixtures of normal distribu-
tions, PhD thesis. Figure obtained using PAWL.

r
θ
−2
−1
0
1
2
−2 −1 0 1 2
Figure : Posterior distribution of (r, θ) in a theta-Ricker Hidden Markov
model. See Polansky et al. (2009), Likelihood ridges and multimodality
in population growth rate models. Figure obtained using SMC2
.

0.00
0.05
0.10
0.15
0.20
0.25
−5 0 5 10 15
X
density
Figure : Toy example: a mixture of well-separated normal distributions.

0.0
0.1
0.2
0.3
0.4
0.5
−5 0 5 10 15
AMH
density
Figure : Markov chain still stuck in one mode after 50, 000 iterations.

−750
−500
−250
0
0 200 400 600 800
X
Y
0.25
0.50
0.75
density
Figure : Feist your eyes on the moustarget distribution!

Note that multimodal distributions are not diﬃcult to sample
from if the modes are not well separated.
In fact we can [re]deﬁne a mode as a region from where
Metropolis-Hastings cannot escape.
Non-asymptotic Error Bounds for Sequential MCMC
Methods in Multimodal Settings. N. Schweizer 2012

Outline

Parallel MCMC
A first idea is to run N chains independently, from various
starting points chosen to be “spread out”.
The chains can thus find multiple modes, and other benefits
such as parallelization and convergence diagnostics.
What if there are > N modes? What if all the chains are
initialized in the attraction zone of the same mode?

Parallel MCMC
Figure : Parallel MCMC on the moustarget distribution

Parallel MCMC
−400
−300
−200
−100
0
100
0 2500 5000 7500 10000
iterations
Y
indexchain
1
2
3
4
5
6
7
8
9
10
Figure : Parallel MCMC on the moustarget distribution

Parallel Tempering
The idea of parallel tempering is to run N chains targeting
diﬀerent versions of π, of “increasing diﬃculty”.
Introduce “inverse temperatures”:
0 < γ1 < . . . < γN = 1.
Introduce “tempered” distributions πγn for n = 1, . . . , N.
For γ ≈ 0, πγ is considered easier to sample because the
variations of π are smaller.

Parallel Tempering
One MCMC chain per inverse temperature, for instance using
a Metropolis-Hastings kernel targeting πγn .
Note that the local modes of πγ are the same for every γ.
The chains interact through “swap moves”.

Parallel Tempering
When a “swap move” is to be performed, do the following.
Sample indices k1, k2 uniformly in {1, . . . , N}.
With acceptance probability
min
(
1,
πγk1 (xk2 )πγk2 (xk1 )
πγk1 (xk1 )πγk2 (xk2 )
)
,
exchange the value of xk1 and xk2 .
This doesn’t change the joint target distribution
πγ1
⊗ πγ2
⊗ . . . ⊗ πγN
.
In particular the N-th chain still targets πγN = π.

Parallel Tempering
Figure : Parallel Tempering on the moustarget distribution, with γ
equally spaced in [0.5, 1].

Parallel Tempering
−400
−300
−200
−100
0
100
0 2500 5000 7500 10000
iterations
Y
indexchain
1
2
3
4
5
6
7
8
9
10

Parallel Tempering

Parallel Tempering
−1000
−500
0
0 2500 5000 7500 10000
iterations
Y
indexchain
1
2
3
4
5
6
7
8
9
10

Parallel Tempering
The choice of (γn)N
n=1 is essential.
Taking γ1 very low increases the exploration for the chain
targeting πγ1 .
If the increments γn − γn−1 are too large, the swap moves
tend to be rejected, which decreases the exploration for the
“upper” chains.

SMC sampler
Sequence of distributions, for instance πγn for n = 1, . . . , N
such that 0 < γ1 < . . . < γN = 1. Say N = 100.
M particles (say 10,000), sequentially importance sampling
from µ to πγ1 and then from πγn−1 to πγn .
When the eﬀective sample size is low, resample and then
MCMC move for each particle (say 5 steps for each particle).
The ability to recover modes is sensitive to the choice of the
initial distribution µ.

SMC sampler
q qqq
qq
q
q qq
−750
−500
−250
0
0 200 400 600 800
X
Y
Figure : SMC sampler on the moustarget distribution
µ = N
((
400
−100
)
,
(
322 0
0 322
))

SMC sampler
q
qqq
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
qqq
qq
q
q
q
qqq
q
qqq q
q
q q
q
q
q
q
q
qqq
q
q
q
q
q
qq
q
qqq
q
q q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
−750
−500
−250
0
0 200 400 600 800
X
Y
Figure : SMC sampler on the moustarget distribution
µ = N
((
400
−400
)
,
(
1002 0
0 1002
))

Equi-energy sampler
Same initial setting as parallel tempering: N chains (Xn
t )N
n=1,
each targets πγn , using MH kernel.
First chain (X1
t ) simply targets πγ1 using MH.
For an upper chain (Xn
t ), with probability ε perform an
equi-energy move, otherwise MH step targeting πγn .
Adaptive Equi-Energy Sampler : Convergence and
Illustration, Schreck, Fort, Moulines, 2013.

Equi-energy sampler
An equi-energy move consists in proposing a point among the
history of the chain just below, (Xn−1
t ), and then accepting it
with a MH type acceptance probability.
[Whereas in Parallel Tempering we proposed the current state
of any other chain.]
The proposal is reduced to points with roughly similar values
of π(x) (hence “equi-energy”).

Equi-energy sampler
Introduce a sequence ξ0 = 0 < ξ1 < . . . < ξS = +∞ cutting
the density axis R+ in S intervals.
Introduce H(x, y) such that H(x, y) = 1 if π(x) and π(y) are
in the same interval [ξk, ξk+1).
Introduce the proposal distribution given Xn
t and
θ = {Xn−1
k }1≤k≤t:
gθ(Xn
t , dy) ∝
t∑
k=1
H(Xn−1
k , Xn
t )δXn−1
k
(dy).
Then the proposed point is accepted with probability
min
(
1,
πγn−γn−1 (y)
πγn−γn−1 (Xn
t )
)
(similar to swap acceptance probability in Parallel Tempering)

Outline

Wang–Landau algorithm
The main idea is to force the chain to avoid regions that have
already been visited.
The concept of region is formalized by a partition of the state
space.
The self-avoiding eﬀect is achieved by an adaptation of the
transition kernel.
Determining the density of states for classical statistical
models: A random walk algorithm to produce a ﬂat
histogram, F. Wang and D. Landau, Physical Review E 2001.

Partition the state space:
X =
d∪
i=1
Xi
Desired frequencies of visit:
ϕ = (ϕ1, . . . , ϕd) such that
d∑
i=1
ϕi = 1

0.00
0.05
0.10
0.15
0.20
0.25
−5 0 5 10 15
X
density
Figure : Partition the state space in 11 bins.

Penalized distribution for any θ = (θ1, . . . , θd):
πθ(x) ∝
π(x)
θ(J(x))
where J(x) such that x ∈ XJ(x).
There is θ⋆ such that:
∀i ∈ {1, . . . , d}
∫
Xi
πθ⋆ (x)dx = ϕi
i.e. πθ⋆ gives a desired mass ϕi to each bin Xi.
These ideal penalties θ⋆ are not available.

0.00
0.05
0.10
0.15
0.20
0.25
−5 0 5 10 15
X
density
Figure : Biased target distribution: each bin now carries the same mass.

0.00
0.05
0.10
0.15
0.20
0.25
−5 0 5 10 15
X
density
Figure : Markov chain exploring the biased target over 50, 000 iterations.

Algorithm 2 Wang-Landau with deterministic schedule (ηt)
1: Init θ0 > 0, X0 ∈ X.
3: Sample Xt from Kθt−1 (Xt−1, ·), MH kernel targeting πθt−1 .
4: Update the penalties:
log θt(i) ← log θt−1(i) + ηt (1IXi (Xt) − ϕi)
5: end for

If ηt → 0 “fast enough”, (θt)t≥0 converges.
If for each bin i, ϕi = 1/d:
θt(i) −−−→
t→∞
∫
Xi
π(x)dx =: ψi
at least up to a multiplicative constant.
(Xt)t≥0 is asymptotically distributed according to πθ⋆ .
Convergence of the Wang-Landau algorithm,
G. Fort, B. Jourdain, E. Kuhn, T. Lelievre, G. Stoltz
2012, on arXiv.

Choice of (ηt) can have a huge impact on the results.
Deﬁne the counters:
νt(i) :=
t∑
n=1
1IXi (Xn)
Flat Histogram (FH) is reached when:
max
i∈{1,...,d}
νt(i)
t
− ϕi < c
for some c > 0.
Instead of decreasing (ηt) at each iteration, decrease only
when the Flat Histogram criterion is reached.

Algorithm 3 Wang-Landau with stochastic schedule (ηκt )
1: Init θ0 = 1, X0 ∈ X.
2: Init κ0 ← 0.
4: Sample Xt from Kθt−1 (Xt−1, ·), MH kernel targeting πθt−1 .
5: If (FH) then κt ← κt−1 + 1, otherwise κt ← κt−1.
6: Update the penalties:
log θt(i) ← log θt−1(i) + ηκt (1IXi (Xt) − ϕi)
7: end for

To be sure that eventually, for any c > 0:
max
i∈{1,...,d}
νt(i)
t
− ϕi < c
we have proved:
∀i ∈ {1, . . . , d}
νt(i)
t
P
−−−→
t→∞
ϕi
for any ﬁxed η > 0,
which implies:
E
[
inf
{
t ≥ 0 : ∀i ∈ {1, . . . , d} |
νt(i)
t
− ϕi| < c
}]
< ∞.
The Wang-Landau algorithm reaches the Flat Histogram
criterion in ﬁnite time, PJ & R. Ryder, AAP 2013.

N chains (X
(1)
t , . . . , X
(N)
t ) using the same kernel Kθt
targeting πθt at time t.
The interaction is made through the common penalties (θt).
The update was
log θt(i) ← log θt−1(i) + η (1IXi (Xt) − ϕi) .

N chains (X
(1)
t , . . . , X
(N)
t ) using the same kernel Kθt
targeting πθt at time t.
The interaction is made through the common penalties (θt).
The update is now
log θt(i) ← log θt−1(i) + η
(
1
N
∑N
k=1 1IXi (X
(k)
t ) − ϕi
)
.

Default choice of partition: along density values
(always 1-dimensional!).
Introduce a sequence ξ0 = 0 < ξ1 < . . . < ξd = +∞ cutting
the density axis R+ in d intervals.
Some sense of a good range can be grasped from pilot runs.

Figure : Wang-Landau on the moustarget distribution. Colours represent
the partition.

Figure : First bin of the partition.

Figure : Last bin of the partition.

Figure : Evolution of log θt along the iterations.

In some situations we might have some intuition on the
direction along which the modes are spread.
For instance, if we knew that the modes of the moustarget
distribution were along the y-axis:
∀i ∈ {1, . . . , d} Xi = R × (yi, yi+1)
with −∞ = y1 < y2 < . . . < yd = +∞.

Figure : Wang-Landau using the y-axis to partition the space.

Figure : Exploration of the bins.

Figure : Evolution of log θ along the iterations.

Bibliography
The Wang-Landau algorithm in general state spaces:
applications and convergence analysis, Y. Atchadé and J.
Liu, Statistica Sinica 2010.
Determining the density of states for classical statistical
models: A random walk algorithm to produce a flat
histogram, F. Wang and D. Landau, Physical Review E 2001.
An Adaptive Interacting Wang-Landau Algorithm for
Automatic Density Exploration, L. Bornn, PJ, P. Del Moral,
A. Doucet, JCGS 2013.
The Wang-Landau algorithm reaches the Flat Histogram
criterion in finite time, PJ & R. Ryder, AAP 2013.
Adaptive Equi-Energy Sampler : Convergence and
Illustration, Schreck, Fort, Moulines, 2013.
Efficiency of the Wang-Landau algorithm: a simple test
case, G. Fort, B. Jourdain, E. Kuhn, T Lelievre, G. Stoltz,
2014.

Density exploration methods

More Related Content

What's hot (19)

Similar to Density exploration methods (20)

More from Pierre Jacob (14)

Recently uploaded (20)

Density exploration methods