Stochalgslides
Stochalgslides
Sándor Baran
1 / 417
Literature
I Christian P. Robert, George Casella: Monte Carlo Statistical
Methods. Second Edition. Springer, New York, 2004.
3 / 417
Uniform simulation
Simulation of “uniform random numbers”: producing a determinis-
tic sequence in [0, 1].
4000
Frequency
2000
0 0.0 0.2 0.4 0.6 0.8 1.0
yn
4000
Frequency
2000
0
0.8
0.8
Frequency
300
x n +1
ACF
0.4
0.4
0 100
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40
x xn Lag
8 / 417
Congruential generators
Definition. A congruential generator on {0, 1, . . . , M} is defined
by the function
D(x) = (ax + b) mod (M + 1).
Plot of (0.003k, D(0.003k)) with D(x) = 69069x mod 1 (fraction part) for
k = 1, 2, . . . , 333. 9 / 417
RANDU: a problematic generator
RANDU: random number generator used by IBM from the 1960s
8
xn+2 − 6xn+1 + 9xn
100
6
Frequency
4
2
60
0
-4 -2
20
0
0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400
Triplet
10 / 417
Shift-register generators
Definition. For a given k × k matrix T with binary entries , the
associated shift-register generator is given by the transformation
k−1
X
xn+1 = Txn , where xn = enj 2j , enj ∈ {0, 1}.
j=0
Congruential generator:
Shift-register generators:
Combination:
Xn = In + Jn + Kn mod 232 .
12 / 417
Performance of Kiss
Plots of pairs (xn , xn+1 ), (xn , xn+2 ), (xn , xn+5 ) and (xn , xn+10 ) for a sample of
5000 generations from Kiss.
13 / 417
Inverse transform
Definition. For a non-decreasing function F on R the generalized
inverse of F denoted by F − is defined as
Lemma. If U ∼ U(0, 1), then the random variable F − (U) has the
distribution of F .
F F − (u) ≥ u F − F (x) ≤ x.
and
In this way
(u, x) : F − (u) ≤ x = (u, x) : F (x) ≥ u ,
so
P F − (U) ≤ x = P U ≤ F (x) = F (x).
2
14 / 417
Exponential variable generation
If X ∼ Exp(λ), then we have F (x) = 1 − e−λx , x > 0.
1
F − (u) = F −1 (u) = − log(1 − u), u ∈ [0, 1].
λ
If U ∼ U(0, 1) then 1 − U ∼ U(0, 1) and − λ1 log(U) ∼ Exp(λ).
Exp(1) from Uniform Exp(1) from R
0.8
0.8
0.6
0.6
Density
Density
0.4
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 0 2 4 6 8
Histograms of 10000 exponential random variables generated using the inverse
transform and R function rexp.
15 / 417
Other distributions with invertible CDF-s
Logistic distribution Log (µ, β):
e−(x−µ)/β 1
PDF : f (x) = 2 ; CDF : F (x) = ;
β 1 + e−(x−µ)/β 1+ e−(x−µ)/β
u
F − (u) = F −1 (u) = µ + β log .
1−u
1 1 1 1 x −µ
PDF : f (x) = ; CDF : F (x) = + arctan ;
πσ 1+ x−µ 2 2 π σ
σ
F − (u) = F −1 (u) = µ + σ tan π u − 1/2 .
16 / 417
Distributions connected to exponential
Uj : i.i.d U(0, 1); Xj := − log(Uj ) ∼ Exp(1).
17 / 417
Distributions connected to exponential
Uj : i.i.d U(0, 1); Xj := − log(Uj ) ∼ Exp(1).
1 Pn 1 Pn
2n j=1 Xj 2n j=1 log(Uj )
Y = 1 Pn+m = 1 P n+m ∼ F2n,2m , n, m ∈ N.
2m j=1 Xj 2m j=n+1 log(Uj )
18 / 417
Normal variable generation. Box-Müller algorithm
X1 , X2 : independent N (0, 1). (X1 , X2 ) is rotation invariant.
(r , θ): polar coordinates of (X1 , X2 ).
19 / 417
Discrete random variables. General method
X ∼ Pθ : Pθ is a discrete distribution on non-negative integers.
Calculate and store:
p0 := P(X ≤ 0), p1 := P(X ≤ 1), p2 := P(X ≤ 2), . . .
until the values reach 1 with a given number of decimals.
Generate U ∼ U(0, 1) and take
X =k if pk−1 < U < pk , where p−1 := 0.
Examples.
Binomial Bin(8, .4)
p0 = 0.016796, p1 = 0.106376, p2 = 0.315395, . . . , p8 = 1.
Poisson P(2)
p0 = 0.135335, p1 = 0.406006, p2 = 0.676676, . . . , p10 = 0.999992.
Storage problems. Often it suffices to calculate probabilities in the
interval mean ± 3 × std.deviation.
20 / 417
Beta generation
U1 , U2 , . . . , Un : i.i.d. sample from U(0, 1).
U(1) , U(2) , . . . , U(n) : ordered sample. U(i) ∼ Be(i, n − i + 1).
Only applies for integer parameters in the Beta distribution.
U1 , U2 : independent U(0, 1). The conditional distribution of
1/α
U1 1/α 1/β
1/α 1/β
given U1 + U2 ≤1 is Be(α, β).
U1 + U2
[Jöhnk, 1964]
Algorithm:
1. Generate u1 , u2 independently from U(0, 1);
1/α 1/α 1/β
2. Accept x = u1 / u1 + u2 as a draw from Be(α, β)
1/α 1/β
if u1 + u2 ≤ 1;
3. Return to 1. otherwise.
Rapid decrease of probability of acceptance with the increase of α
and β.
21 / 417
Accept-reject method
Problems with the previously showed methods:
I The inverse transform method applies only for a few
distributions.
I Many distributions do not have good representations.
Key idea:
Use a simpler density g called instrumental density to simulate
from the desired density f (target density).
Constraints on g :
1. f and g have compatible supports (i.e. g (x) > 0 when
f (x) > 0).
2. There exists a constant M such that f (x)/g (x) ≤ M for all x.
Generate X from g and accept it as a variable Y from f with pro-
f (X )
bability Mg (X ) .
It suffices to know f up to a normalizing constant.
22 / 417
Accept-reject algorithm
Generate from Y ∼ f with the help of X ∼ g , where f (x) ≤ Mg (x).
Algorithm
1. Generate x from the distribution g and u from
U(0, 1);
2. Accept y = x as a draw from Y if u ≤ f (x)/Mg (x);
3. Return to 1. otherwise.
Lemma. The above algorithm produces a value of a random
variable Y distributed according to f .
Proof. The CDF of Y equals
f (X )
f (X )
P X < y , U ≤ Mg (X )
P(Y < y ) = P X < y U ≤ =
Mg (X ) P U ≤ Mg f (X )
(X )
R y R f (x)/Mg (x) 1
Ry Z y
du g (x)dx f (x)dx
= R−∞ 0 M −∞
∞ R f (x)/Mg (x)
= 1
R ∞ = f (x)dx,
du g (x)dx M −∞
f (x)dx −∞
−∞ 0
that is Y ∼ f . 2
Remark. The probability of acceptance is 1/M, the expected
number of trials until acceptance is M. 23 / 417
Example. Normal from Cauchy
Density functions:
1 2 1
N (0, 1): f (x) = √ e−x /2 ; C(0, 1): g (x) = .
2π π(1 + x 2)
p
f (x)/g (x) ≤ 2π/e = 1.5203, the maximum is reached at x = ±1.
p
Mean number of trials: M := 2π/e = 1.5203.
Probability of acceptance: 1/M = 0.6577.
Tuning
1
Instrumental density: C(0, σ) with PDF gσ (x) = πσ(1+x 2 /σ 2 )
r σ 2 −1 σ 2 −1
2π e 2 e 2
f (x)/gσ (x) ≤ Mσ := =M and Mσ ≥ M1 = M.
e σ σ
C(0, 1) is the best choice among Cauchy distributions for simula-
ting N (0, 1).
24 / 417
Example. Normal from double exponential (Laplace)
Density functions:
1 2 λ −λ|x|
N (0, 1): f (x) = √ e−x /2 ; L(λ): gλ (x) = e .
2π 2
1
+ 12 sign(x) 1 − e−λ|x| .
CDF of L: Gλ (x) = 2
2
2/πλ−1 eλ /2 .
p
f (x)/gλ (x) ≤ Mλ :=
p
The minimum of Mλ is attained at λ = 1, and M1 = 2e/π.
L(1) is the best choice among double exponential distributions for
simulating N (0, 1).
Mean number of trials: M1 = 1.3155.
Probability of acceptance: 1/M = 0.7602.
25 / 417
Example. Gamma Accept-Reject
If α ∈ N and U1 , U2 , . . . , Uα are i.i.d. U(0, 1)
α
X
Y = −β log(Uj ) ∼ Ga(α, β).
j=1
Γ(a) (α/e)α
Corresponding mean number of trials: Ma/α = .
Γ(α) (a/e)a
26 / 417
Example. Truncated normal distribution
1 1 (x−µ)2
f (x) = √ e− 2σ2 if x ≥ µ.
Φ (µ − µ)/σ 2πσ
27 / 417
Example. Truncated normal from translated exponential
Assume µ = 0, σ = 1.
1 1 2
f (x) = √ e−x /2 ; gλ (x) = λe−λ(x−µ) , x ≥ µ.
1 − Φ(µ) 2π
For x ≥ µ:
f (x) 1 1 2 1 1 e
= √ e−x /2+λ(x−µ) ≤ Mλ := √ M λ,
gλ (x) 1 − Φ(µ) λ 2π 1 − Φ(µ) 2π
where (
exp λ2 /2 − λµ /λ, if λ > µ;
Mλ :=
e
exp − µ2 /2 /λ
otherwise.
µ
q
First part has minimum at λ∗ = 2 + 21 µ2 +4, second part at λ
e = µ.
λ∗ is the optimal choice. Mean number of trials:
1 1 e 1
Mλ∗ = √ M λ∗ < e.g. if µ > 0.
1 − Φ(µ) 2π 1 − Φ(µ)
28 / 417
Envelope accept-reject method
In some cases PDF f is too complex to be evaluated at each step.
Suppose there exists a function g` , a PDF gm and a constant M
such that
g` (x) ≤ f (x) ≤ Mgm (x).
Algorithm
1. Generate x from gm and u from U(0, 1);
2. Accept y = x as a draw from Y if u ≤ g` (x)/Mgm (x);
3. Otherwise accept y = x if u ≤ f (x)/Mgm (x).
Lemma. The above algorithm produces a value of a random vari-
able Y distributed according to f .
1
Ry
f (x)dx
Z y
=
M
1
R−∞
∞ = f (x)dx,
M −∞
f (x)dx −∞
that is Y ∼ f . 2
30 / 417
Example. Lower bound for normal generation
exp − x 2 /2 ≥ 1 − x 2 /2 .
1 2 1 x2
ϕ(x) = √ e−x /2 ≥ √ 1− =: g` (x).
2π 2π 2
√
Useless, if for X ∼ gm we have |X | ≥ 2.
√
For X ∼ C(0, 1): P |X | ≥ 2 = 0.3918.
√
For X ∼ L(1): P |X | ≥ 2 = 0.2431.
31 / 417
Poisson generation
Simple method
Poisson process: N ∼ P(λ), Xn ∼ Exp(λ), n ∈ N.
e−(x−µ)/β 1
PDF : f (x) = 2 ; CDF : F (x) = ;
β 1 + e−(x−µ)/β 1+ e−(x−µ)/β
u
F −1 (u) = µ + β log .
1−u
Easy to simulate using uniform distribution on [0, 1].
32 / 417
Poisson from logistic
Instrumental variable: L = bX + 0.5c, where X ∼ Log (µ, β).
Restrict the values of L to non-negative integers, that is the range
of X to [−0.5, ∞).
If X > −0.5
P(N = n) = P(n − 0.5 ≤ X < n + 0.5)
1 1 −(0.5+µ)/β
= − 1 + e
1 + e−(n+0.5−µ)/β 1 + e−(n−0.5−µ)/β
−(n−0.5−µ)/β − e−(n+0.5−µ)/β
e
1 + e−(0.5+µ)/β
= −(n+0.5−µ)/β
−(n−0.5−µ)/β
1+e 1+e
Reparametrization [Atkinson, 1979]: α = µ/β, γ = 1/β.
eα−γ(n−0.5) − eα−γ(n+0.5)
P(N = n) = 1 + e−(α+0.5γ)
1 + eα−γ(n+0.5) 1 + eα−γ(n−0.5)
γeα−γn −(α+0.5γ)
≈ 2 1 + e .
1 + eα−γn
33 / 417
Atkinson’s algorithm
Ratio of probabilities: λn / P(N = n) eλ n! .
Algorithm √
0. Define γ = π/ 3λ, α = λβ and k = log c − λ − log b;
1. Generate u1 ∼ U(0, 1) and calculate
x = α − log (1 − u1 )/u1 /γ
until x > −0.5;
2. Define n = bx + 0.5c and generate u2 ∼ U(0, 1);
3. Accept n ∼ P(λ) if
2
α − γx + log u2 / 1 + exp(α − γx) ≤ k + n log λ − log n!.
34 / 417
Log-concave densities
Definition. A density f called log-concave if log f is concave.
Example. Exponential family (special form):
Log-concave if
2
g 00 (x)g (x) − g 0 (x)
∂2 ∂2
log f (x) = log g (x) = < 0.
∂x 2 ∂x 2 g 2 (x)
∂2 2e−(x−µ)/β
log g (x) = − 2 < 0.
∂x 2 β 2 1 + e−(x−µ)/β
35 / 417
Envelope accept-reject method
f : log-concave density. Sn := {xi ∈ supp f | i = 0, 1, . . . , n + 1}.
h(xi ) = log f (xi ) is known (at least up to constant).
Line Li,i+1 through xi , h(xi ) and xi+1 , h(xi+1 ) is below h(x) if
x ∈ [xi , xi+1 ] and above it otherwise.
For x ∈ [xi , xi+1 ], i = 0, 1, . . . , n:
hn (x) := min Li−1,i (x), Li+1,i+2 (x) and hn (x) := Li,i+1 (x).
Outside [x0 , xn+1 ] define
hn (x) := min L0,1 (x), Ln,n+1 (x) and hn (x) := −∞.
Envelopes:
hn (x) ≤ h(x) = log f (x) ≤ hn (x), x ∈ supp f .
For f n (x) = exp hn (x) and f n (x) = exp hn (x) we have
f n (x) ≤ f (x) ≤ f n (x) = ω n gn (x), x ∈ supp f .
ωn : norming constant; gn (x): PDF.
36 / 417
Adaptive Rejection Sampling (ARS)
Generate an observation from distribution f .
ARS algorithm
0. Initialize n and Sn ;
1. Generate x ∼ gn , u ∼ U(0, 1);
2. Accept x if u ≤ f n (x)/ω n gn (x);
3. Otherwise, accept x if u ≤ f (x)/ω n gn (x) and update
Sn to Sn+1 = Sn ∪ {x}.
Prior for (a, b): N (0, σ 2 ) × N (0, τ 2 ). Posterior for (a, b):
X n n n
2 2 2 2
X X
π(a, b) ∝ exp a yj + b xj yj − ea ebxj e−a /(2σ )−b /(2τ ) .
j=1 j=1 j=1
n
∂2 X
log π(a, b) = −ea ebxj − σ −2 < 0,
∂a2
j=1
n
∂2 X
log π(a, b) = −ea xj2 ebxj − τ −2 < 0.
∂b 2
j=1
SLLN: hn → Ef h(X ) as n → ∞ with probability 1.
42 / 417
Variance
Integral and approximation:
Z n
1X
Ef h(X ) = h(x)f (x)dx, hn := h(Xj ).
X n
j=1
If Ef h2 (X ) < ∞,
Z
1 2
Var hn = h(x) − Ef h(X ) f (x)dx.
n X
Sample variance
n
1 X 2
vn := 2
h(Xj ) − hn .
n
j=1
Black: mean; blue: mean ± two standard errors; red: exact value.
45 / 417
Example. Normal quantiles
Evaluate Z t
1 2
Φ(t) := √ e−y /2 dy .
−∞ 2π
Generate X1 , X2 , . . . , Xn ∼ N (0, 1), calculate
n
1X
Φ(t)
b := I{Xj ≤t} .
n
j=1
Var Φ(t)
b = Φ(t) 1 − Φ(t) /n. t ≈ 0: Var Φ(t)
b = 1/(4n)
quantiles 0.5 0.75 0.8 0.9 0.95 0.975 0.99 0.999 0.9999
n/t 0.0000 0.6745 0.8416 1.2816 1.6449 1.9600 2.3263 3.0902 3.7190
102 0.5200 0.7800 0.8200 0.9000 0.9300 0.9700 0.9900 1.0000 1.0000
103 0.4890 0.7440 0.7950 0.9040 0.9460 0.9710 0.9880 1.0000 1.0000
104 0.4947 0.7524 0.8031 0.9018 0.9506 0.9764 0.9906 0.9990 1.0000
105 0.4980 0.7472 0.7972 0.8987 0.9492 0.9751 0.9900 0.9990 0.9999
106 0.4991 0.7493 0.7993 0.8994 0.9499 0.9752 0.9900 0.9990 0.9999
107 0.5003 0.7501 0.8001 0.9000 0.9499 0.9751 0.9900 0.9990 0.9999
108 0.5001 0.7500 0.8001 0.9000 0.9500 0.9750 0.9900 0.9990 0.9999
46 / 417
Example. Tail probability of normal
Evaluate P(X > 5), where X ∼ N (0, 1).
Exact value: 2.866516 × 10−7 .
Calculate
Z ∞ Z 5
1 −y 2 /2 1 1 1 2
1−Φ(5) = √ e dy and −Φ1 (5) = − √ e−y /2 dy .
5 2π 2 2 0 2π
n = 105 : 1 − Φ(5)
b = 0, Φ(5)
b = 1,
b 1 (5) = 0.50089, 1/2 − Φ
Φ b 1 (5) = −0.00089.
47 / 417
Example. Tail probability of normal
48 / 417
Importance sampling. Cauchy tail probability
Evaluate p = P(X > 2), where X ∼ C(0, 1), that is
Z ∞
1
p= dx.
2 π(1 + x 2)
Exact value: p = 0.1475836.
Generate X1 , X2 , . . . , Xn ∼ C(0, 1), calculate
n
1X
pb1 := I{Xj >2} .
n
j=1
E pb1 = p, Var pb1 = p(1 − p)/n = 0.1258/n.
By symmetry one can consider
n
1 X
pb2 := I{|Xj |>2} .
2n
j=1
E pb2 = p, Var pb2 = p(1 − 2p)/(2n) = 0.0520/n.
49 / 417
Importance sampling. Cauchy tail probability
where
2
h(x) := and U ∼ U(0, 2).
π(1 + x 2 )
2
Var pb3 = E h2 (U) − E[h(U)] /n = 0.0285/n.
50 / 417
Importance sampling. Cauchy tail probability
Evaluate p = P(X > 2), where X ∼ C(0, 1).
Z∞ Z1/2 Z1/2
1 y −2 1 1
p= 2
dx = −2
dy = 2
dy = E h(V ) ,
π(1 + x ) π(1 + y ) π(1 + y ) 4
2 0 0
where
h(x) := 2/ π(1 + x 2 )
and V ∼ U(0, 1/2).
Generate V1 , V2 , . . . , Vn ∼ U(0, 1/2), calculate
n
1 X
pb4 := h(Vj ).
4n
j=1
2
Var pb4 = E h2 (V ) − E[h(V )] /(16n) = 9.5525 × 10−5 /n.
Exponential Lognormal
1.01
250
200
1.00
150
0.99
100
0.98
50
0
0.97
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
λ λ
as n → ∞ with probability 1.
Variance of the approximation is finite only if
f 2 (X ) f 2 (x)
Z
2 2 f (X )
Eg h (X ) 2 = Ef h (X ) = h2 (x) dx < ∞.
g (X ) g (X ) X g (x)
Instrumental distributions with unbounded ratios f /g are not app-
ropriate. The variance of the estimators is large.
Distributions g with thicker tails than f are preferred.
Sufficient conditions e.g.
I f (x)/g (x) < M, ∀x ∈ X and Ef h2 (X ) < ∞;
1 1 2
f (x) = ; g (x) = √ e−x /2 .
π(1 + x 2 ) 2π
For approximation use a sample Y1 , Y2 , . . . , Yn ∼ N (0, 1).
n n 2
2 eYj /2
r
1 X f (Yj ) 1X
P(3 ≤ X ≤ 9) ≈ I{3≤Yj ≤9} = I .
n g (Yj ) n π 1 + Yj2 {3≤Yj ≤9}
j=1 j=1
57 / 417
Example. Cauchy probability
Evaluate P(3 ≤ X ≤ 9), X ∼ C(0, 1) using N (0, 1).
59 / 417
Alternative method
X1 , X2 , . . . , Xn : a sample from distribution g .
Instead of
n Pn
1X f (Xj ) j=1 h(Xj )f (Xj )/g (Xj )
h(Xj ) use Pn
n g (Xj ) j=1 f (Xj )/g (Xj )
j=1
Normal distribution:
2
f 2 (x) ex (ν−2)/(2ν)
∝ ν+1 (does not have finite integral on R).
g (x) 1 + x 2 /ν
Cauchy distribution:
f 2 (x) 1 + x2
∝ ν+1 (has finite integral on R).
g (x) 1 + x 2 /ν
61 / 417
Example. Student’s t distribution I. Mean
1.20
1.18
1.16
1.14
1.12
1.10
1/2
Empirical means of estimators of Ef X /(1 − X ) , X ∼ T12 , for 500 rep-
lications: sampling from T12 (solid black), importance sampling from C(0, 1)
(dashed red) and from N (0, 12/10) (dashed blue).
1/2
Means are stable. h1 (x) = x/(1 − x) has a singularity at
2
x = 1, so h1 (x) is not integrable under f .
None of the estimators have finite variance! 62 / 417
Example. Student’s t distribution I. Variance
1/2
Empirical range of estimators of Ef X /(1 − X ) , X ∼ T12 , for 500 repli-
cations: sampling from T12 (left), importance sampling from C(0, 1) (center)
and from N (0, 12/10) (right).
63 / 417
Example. Student’s t distribution I. Variance
1/2
Empirical range of estimators of Ef X /(1 − X ) , X ∼ T12 , for 5000 rep-
lications: sampling from T12 (left), importance sampling from C(0, 1) (center)
and from N (0, 12/10) (right).
64 / 417
Example. Student’s t distribution via reflected gamma
Find an instrumental distribution g with a better behaviour of
(1 − x)g (x) at x = 1.
Reflected gamma distribution RGa(α, β, γ) with PDF
β α |x − γ|α−1 e−β|x−γ|
, x ∈ R, α, β > 0.
2Γ(α)
If X ∼ RGa(α, β, γ) then |X − γ| ∼ Ga(α, β).
g : PDF of RGa(α, 1, 1).
f 2 (x) |x|e−|x−1|
h12 (x) ∝ .
g (x) |x − 1|α |1 + x 2 /ν|ν+1
Integrable around x = 1 for α < 1. Problems at ∞, infinite
variance but no influence on the approximation.
Y ∼ Ga(α, β); Z : uniform on {−1, 1}, independent of Y .
X = YZ + γ ∼ RGa(α, β, γ).
65 / 417
Example. Student’s t distribution via reflected gamma
1/2
Empirical range of estimators of Ef X /(1 − X ) , X ∼ T12 , sampling from
RGa(0.5, 1, 1) for 500 (left) and for 5000 (right) replications.
66 / 417
Example. Student’s t distribution II.
67 / 417
Example. Student’s t distribution II.
(black), importance sampling from C(0, 1) (red), from N (0, 12/10) (blue) and
from U(0, 1/2.1) (green). Final values: 6.69, 6.53, 5.51 and 6.52. True value:
6.52.
68 / 417
Example. Student’s t distribution III.
Approximate Ef [h3 (X )], X ∼ T12 , where
x5
h3 (x) := I .
1 + (x − 3)2 {x≥0}
69 / 417
Example. Student’s t distribution III.
sampling from T12 (black), importance sampling from C(0, 1) (red), from
N (0, 12/10) (blue) and from Exp(1) (green). Final values: 4.30, 4.40, 4.15
and 4.53. True value: 4.64.
70 / 417
Importance sampling and Accept-Reject
Approximate Z
J := h(x)f (x)dx.
X
Accept-Reject method
g : PDF of the instrumental distribution satisfying f (x) ≤ Mg (x).
Simulate Y ∼ g (x) and accept it as variable X ∼ f (x) with proba-
bility f (Y )/ Mg (Y ) .
Ratio f /g is bounded. g can be used as instrumental distribution
in importance sampling.
X1 , X2 , . . . , Xn : sample from f obtained by Accept-Reject method.
Y1 , Y2 , . . . , YN : a sample of random length N from g used to pro-
duce the sample from f .
{Y1 , Y2 , . . . , YN } = {X1 , X2 , . . . , Xn } ∪ {Z1 , Z2 , . . . , ZN−n }.
Z1 , Z2 , . . . , ZN−n : values rejected by Accept-Reject algorithm.
71 / 417
Estimators
Approximate Z
J := h(x)f (x)dx.
X
Estimators:
n N
1X 1 X f (Yk )
δ1 := h(Xj ) and δ2 := h(Yk ) .
n N g (Yk )
j=1 k=1
If f /g is known up to a constant:
N N
,
X f (Yk ) X f (Yk )
δ3 := h(Yk ) .
g (Yk ) g (Yk )
k=1 k=1
Decomposition of δ2 :
n N−n
n 1 X N −n 1 X f (Zk )
δ2 = h(Xj ) + h(Zk ) .
N n n N −n g (Zk )
j=1 k=1
Mg (x) − f (x)
.
M −1
Estimator associated with the rejected values:
N−n
1 X (M − 1)f (Zk )
δ4 := h(Zj ).
N −n Mg (Zk ) − f (Zk )
k=1
73 / 417
Example. Gamma Accept-Reject
Generate Ga(α, 1), α 6∈ N, α ≥ 1 using instrumental distribution
Ga(a, b) with a = bαc and b = a/α.
Density functions:
Approximate:
E X /(1 + X ) , X ∼ Ga(α, 1).
74 / 417
Example. Gamma Accept-Reject
Convergence of estimators δ1 (black), δ5 (green) and δ6 (blue) of E X /(1+X )],
X ∼ Ga(α, 1) to the true value of 0.7496829 (dashed red) for α = 3.7. Bound:
M = 1.116356. Final values: 0.749958, 0.751682 and 0.749880.
75 / 417
Example. Gamma Accept-Reject
Convergence of estimators δ1 (black), δ5 (green) and δ6 (blue) of E X /(1+X )],
X ∼ Ga(α, 1) to the true value of 0.7496829 (dashed red) for α = 3.7. Bound:
M1 = 1.163983. Final values: 0.750516, 0.752792 and 0.750705.
76 / 417
Example. Gamma Accept-Reject
Convergence of estimators δ1 (black), δ5 (green) and δ6 (blue) of E X /(1+X )],
X ∼ Ga(α, 1) to the true value of 0.708122 (dashed red) for α = 3.08. Bound:
M = 1.013969. Final values: 0.708323, 0.705941 and 0.708242.
77 / 417
Laplace approximation
Alternative to Monte Carlo integration. Evaluate
Z
f (x|θ)dx, f ≥ 0,
A
for a fixed θ.
Write f (x|θ) = exp h(x|θ) .
Expand h(x|θ) around a point x0 :
(x − x0 )2 00
h(x|θ) = h(x0 |θ) + (x − x0 )h0 (x0 |θ) + h (x0 |θ)
2!
(x − x0 )3 000
+ h (x0 |θ) + R3 (x),
3!
lim R3 (x)/(x − x0 )3 = 0.
x→x0
xθ )2 00
Z Z
(x−b
eh(x|θ) dx ≈ eh(bxθ |θ) e 2 h (bxθ |θ)
A A
(x − xbθ )3 000 (x − xbθ )6 000
2
× 1+ h (bxθ |θ) + h (b xθ |θ) dx.
3! 2(3!)2
Number of terms used: order of approximation.
79 / 417
First-order approximation
Approximation:
xθ )2 00
Z Z Z
(x−b
xθ |θ)
f (x|θ)dx = eh(x|θ)
dx ≈ e h(b
e 2 h (bxθ |θ) dx
A A
ZA
xθ )2 00
(x−b
xθ |θ) e 2 h (bxθ |θ) dx.
= f (b
A
Consider normal CDF with mean xbθ and variance −1/ h00 (b
xθ |θ)).
xbθ is the maximum point of h(x|θ), so h00 (b
xθ |θ) < 0.
Take A = [a, b]:
s
Z b Z b
h(x|θ) xθ |θ)
h(b 2π
f (x|θ)dx = e dx ≈ e
a a −h00 (b
xθ |θ)
hp p i
× Φ −h00 (b
xθ |θ)(b − xbθ ) − Φ −h00 (b
xθ |θ)(b − xbθ ) .
α−1
h(x| α, β) ≈ (α − 1) log xbα,β − βb
xα,β − 2
(x − xbα,β )2
2bxα,β
1
(βx − α + 1)2 .
= (α − 1) log(α − 1) − log β − 1 −
2(α − 1)
81 / 417
Example. Gamma approximation
First-order Laplace approximation:
s
Z b α α−1 α 2πb 2
xα,β
β x β
e−βx dx ≈ α−1 −βb
xbα,β e xα,β
a Γ(α) Γ(α) α−1
" s ! s !#
α−1 α−1
× Φ 2
b − xbα,β −Φ 2
a − xbα,β
xbα,β xbα,β
β α α − 1 α−1 1−α p
= e 2π(α − 1)/β
Γ(α) β
h √ √ i
× Φ (bβ − α + 1)/ α − 1 − Φ (aβ − α + 1)/ α − 1 .
Example. α = 6, β = 0.5, xbα,β = 10.
Interval Approximation Exact value
(9, 11) 0.174016 0.174012
(8, 12) 0.339580 0.339451
(3, 17) 0.867908 0.845947
(12, ∞) 0.321957 0.445680
82 / 417
Monitoring variation
Approximate
Z n
1X
h(x)f (x)dx with hn := h(Xj ), Xj ∼ f (x).
X n
j=1
2
Example. h(x) := cos(10x) − sin(60x) , f (x) ≡ 1, X = [0, 1].
Black: mean; blue: 95 % confidence bounds based on CLT; red: exact value. 84 / 417
Problem with the confidence interval
At a given n the confidence interval based on CLT is valid, but for
the whole path it is not.
Possible solutions:
I Univariate monitoring: run several independent sequences and
estimate variance using these samples.
Greedy in computing time
I Multivariate monitoring:
derive the joint distribution of the
sequence hn .
Calculations might be problematic.
85 / 417
Univariate monitoring. Example
Evaluate Z 1 2
cos(10x) − sin(60x) dx.
0
500 independent samples of size 10000 are generated from U(0, 1).
Liu [1996]:
Var δnh ≈ n−1 Var f h(X ) 1 + Var g (W ) .
88 / 417
Example. Cauchy prior
n n
X
X θj 1
δbnπ (x) := , θj ∼ N (x, 1).
1 + θj2 1 + θj2
j=1 j=1
Parameter: x = 2.5. Band: entire range of 500 samples; black: mean of run-
ning means; green: 90 % empirical confidence bounds; blue: 95 % confidence
bounds based on CLT.
90 / 417
Example. Cauchy prior
n n
X
−(x−θj )2 /2 2 /2
X
δenπ (x) := θj e e−(x−θj ) , θj ∼ C(0, 1).
j=1 j=1
Parameter: x = 2.5. Band: entire range of 500 samples; black: mean of run-
ning means; green: 90 % empirical confidence bounds; blue: 95 % confidence
bounds based on CLT.
91 / 417
Multivariate monitoring
Approximate
Z n
1X
h(x)f (x)dx with hn := h(Xj ), Xj ∼ f (x).
X n
j=1
To obtain bounds for hn determine the joint distribution of hn .
Assume: Xi ∼ N (µ, σ 2 ).
σ2
E X m = µ, Cov X k , X ` = , k, ` = 1, 2, . . . , n.
max{k, `}
92 / 417
Multivariate monitoring. Example
For Xi ∼ N (µ, σ 2 ) one has
Xn := (X 1 , X 2 , . . . , X n )> ∼ Nn µ1n , Σn ,
where
1 1 1/2 1/3 1/4 1/5 . . . 1/n
1 1/2 1/2 1/3 1/4 1/5 . . . 1/n
1n := 1 , Σn := σ 2 1/3 1/3 1/3 1/4 1/5 . . . 1/n.
.. .. .. .. .. .. ..
. . . . . . ... .
1 1/n 1/n 1/n 1/n 1/n . . . 1/n
Distribution:
(
> Xn2 if σ 2 is known;
Σ−1
Xn − µ1n n Xn − µ1n ∼
nFn,ν if σ 2 is unknown,
b2 ∼ Xν independent of Xn .
and in the latter case we have σ
93 / 417
Multivariate monitoring. Example
Inverse of Σn :
2 −2 0 0 0 ... 0
−2 8 −6 0 0 ... 0
1
0 −6 18 −12 0
Σ−1 = 2 ... 0 .
n
σ .. .. .. .. ..
. . . . . ... −n(n − 1)
0 0 0 0 . . . −n(n − 1) n2
qn : appropriate Xn2 or Fn,ν quantile. Confidence limits:
n > o
µ : Xn − µ1n Σ−1
n X n − µ1n ≤ qn
s
2 > −1
2 σ (Xn Σn Xn − qn )
= µ : µ ∈ Xn ± Xn − .
n
Variance inequality
Var E[δ(X )|Y ] ≤ Var δ(X ) .
Estimator
δ ∗ (Y ) := E[δ(X )|Y ]
has smaller variance than δ(X ).
95 / 417
Example. Student’s t expectation
Let h(x) := exp(−x 2 ), X ∼ T (ν, µ, σ 2 ) and consider E[h(X )].
Approximation:
n
1X
exp − Xj2 , Xj ∼ T (ν, µ, σ 2 ).
δn :=
n
j=1
96 / 417
Example. Student’s t expectation
Approximate
E exp(−X 2 ) , X ∼ T (ν, µ, σ 2 ).
where
%i := E I{Ui ≤ωi } N, Y1 , . . . , YN .
Let κ ∈ N.
%i = P I{Ui ≤ωi } N = κ, Y1 , . . . , YN
P Qn−2 Qκ−2
(i1 ,...,in−2 ) j=1 ωij j=n−1 (1−ωij )
= ωi P Qn−1 Qκ−2 , i = 1, 2, . . . , κ−1,
(i1 ,...,in−1 ) j=1 ωij j=n (1−ωij )
where a = a0,n < a1,n < . . . < an,n = b and maxj {aj+1,n −aj,n } → 0 as
n → ∞.
of Z 1
h(x)dx
0
−2
has a variance of order O n .
Define U(−1) := 0, U(n+1) := 1 and consider
n
X h U(j) + h U(j+1)
δn :=
e U(j+1) − U(j) .
2
j=−1
Riemann approximation:
n−1
1 X 4 −X
δ2,n := h X(j) X(j) e (j) X(j+1) − X(j) .
Γ(4)
j=1
Convergence of the estimators δ1,n (blue) and δ2,n (black) to the true value of
6.0245 (dashed red).
105 / 417
Importance sampling
Approximate
Z b Z b
f (x)
J := h(x)f (x)dx = h(x) g (x)dx.
a a g (x)
g : PDF of the instrumental distribution.
Monte Carlo approximations of J :
n
1X
δ1,n := h(Xj ), Xj ∼ f (x);
n
j=1
n
∗ 1 X f (Yj )
δ1,n := h(Yj ) , Yj ∼ g (x).
n g (Yj )
j=1
Riemann approximations of J :
n−1
X
δ2,n := h X(j) f X(j) X(j+1) − X(j) ;
j=1
n−1
X
∗
δ2,n := h Y(j) f Y(j) Y(j+1) − Y(j) .
j=1 106 / 417
Example. Student’s t expectation
Let h(x) := (1 + exp(x))I{x≤0} , X ∼ Tν , and consider
Z 0
Γ (ν + 1)/2 − ν+1
1 + ex 1 + x 2 /ν
E[h(X )] = √ 2
dx.
πνΓ(ν/2) −∞
Importance sampling MonteCarlo approximation with instrumental
distribution N 0, ν/(ν − 2) :
√ n
N 2Γ (ν +1)/2 X − ν+1 − Yj (ν−2)
1+eYj 1+Yj2 /ν
δ1,n := √ 2
e 2ν I{Yj ≤0} ,
ν −2Γ(ν/2)n j=1
where Y1 , Y2 , . . . , Yn ∼ N 0, ν/(ν − 2) . Unstable.
Importance sampling Monte Carlo approximation with instrumental
distribution C 0, 1):
√ n
C πΓ (ν + 1)/2 X − ν+1
1+eZj 1+Zj2 /ν 1+Zj2 I{Zj ≤0} ,
δ1,n := √ 2
νΓ(ν/2)n
j=1
where Z1 , Z2 , . . . , Zn ∼ C 0, 1 .
107 / 417
Example. Normal instrumental distribution
Approximate
Z 0
Γ (ν + 1)/2 − ν+1
1 + ex 1 + x 2 /ν
√ 2
dx
πνΓ(ν/2) −∞
with instrumental distribution N 0, ν/(ν − 2) .
Riemann approximation:
n−1 − ν+1
N Γ (ν +1)/2 X Y(j)
2 2
δ2,n := √ 1+e 1+Y(j) /ν I{Y(j) ≤0} Y(j+1)−Y(j)
πνΓ(ν/2)
j=1
N N N
Convergence of the estimators δ1,n (blue), δ2,n (black) and δ3,n (green) for
ν = 2.5 to the true value of 0.7330 (dashed red).
109 / 417
Example. Cauchy instrumental distribution
Approximate
Z 0
Γ (ν + 1)/2 − ν+1
1 + ex 1 + x 2 /ν
√ 2
dx
πνΓ(ν/2) −∞
with instrumental distribution C(0, 1).
Riemann approximation:
n−1 − ν+1
C Γ (ν +1)/2 X
2
1+eZ(j) 1+Z(j)
2
δ2,n := √ /ν I{Z(j) ≤0} Z(j+1)−Z(j)
πνΓ(ν/2)
j=1
C C C
Convergence of the estimators δ1,n (blue), δ2,n (black) and δ3,n (green) for
ν = 2.5 to the true value of 0.7330 (dashed red).
111 / 417
Acceleration methods. Correlated simulations
In Monte Carlo integration usually independent samples are simu-
lated.
Use of positively or negatively correlated samples might reduce the
variance.
112 / 417
Comparison of estimators
f (x|θ): PDF with parameter θ to be estimated.
L δ(x), θ : loss function for the estimator δ(x) of θ.
h i
Risk function: R(δ, θ) := E L δ(X ), θ , X ∼ f (x|θ).
Xj , Yj ∼ f (x|θ), j = 1, 2, . . . , n.
Positive correlation between L δ1 (Xj ), θ and L δ2 (Yj ), θ reduces
b 1 , θ) − R(δ
the variance of R(δ b 2 , θ).
113 / 417
Useful remarks
2. Possibly use the same sample for the comparison of risks for
every value of θ. E.g. the same uniform sample is used for genera-
tion of samples from f (x|θ) for every θ. In many cases there exists
a transformation Mθ on X such that if X 0 ∼ f (x|θ0 ) then we have
Mθ X 0 ∼ f (x|θ).
114 / 417
Example. Positive-part James-Stein estimator
Problem: estimate the unknown mean vector θ of a p-dimensional
normal distribution using a single observation X ∼ Np θ, Ip .
2
Loss function: L δ(X ), θ :=
δ(X ) − θ
.
h
2 i
Risk function: R(δ, θ) = E
δ(X ) − θ
(squared error risk).
Least squares estimator: δLS (x) := x.
p−2
James-Stein estimator: δJS (x) := 1 − kxk2
x.
+
Approximate R(δJS,a , θ) by
n
+
2
1 X
1− a
2
Xj − θ
, X1 , X2 , . . . , Xn ∼ Np θ, Ip .
n
kXj k
j=1
Transformation: Mθ x := x + θ − θ0 , x ∈ Rp .
X 0 ∼ Np θ0 , Ip ,
Mθ X ∼ Np θ, Ip .
+
Approximate squared error risks of δJS,a for N5 θ, I5 as a function of kθk for
different values of a. Transformations of a single sample of size 1000. 117 / 417
Comparison of risks
+
Approximation of the squared error risk of δJS,a :
n +
2
1 X
1− a
2
Xj − θ
, X1 , X2 , . . . , Xn ∼ Np θ, Ip .
n
kXj k
j=1
+
Approximate squared error risks of δJS,a for N5 θ, I5 as a function of kθk for
different values of a. Individual samples of size 1000 for each θ. 118 / 417
Antithetic variables
Approximate Z ∞
J := h(x)f (x)dx
−∞
2
120 / 417
Example. Gamma expectation
Let h(x) := x log(x), X ∼ Ga(4, 1), and consider
Z ∞
1
E[h(X )] = h(x)x 4 e−x dx.
Γ(4) 0
H(x) := h F − (x) , where F − is the quantile function of Ga(4, 1).
20
15
H(x)
10 5
0
Antithetic approximation:
n
1 X
δ2,2n := h(Xj ) + h(Yj ) .
2n
j=1
122 / 417
Example. Gamma expectation
Approximate
E X log(X ) , X ∼ Ga(4, 1).
Empirical correlation of h(X ) and h(Y ): −0.7604.
Convergence of the estimators δ1,2n (blue) and δ2,2n (black) to the true value of
6.0245 (dashed red). 123 / 417
Antithetic variables. Inversion at a higher level
Approximate
Z ∞ n
1 X
J := h(x)f (x)dx using δn := h(Xj ) + h(Yj ) ,
−∞ 2n
j=1
Works in the cases when h(x) is approximately linear and for large
sample sizes.
124 / 417
Example. Beta expectation
Let h(x) := cos(x) + 2 sin(x), X ∼ Be(α, α), and consider
Z 1
1
E[h(X )] = h(x)x α−1 (1 − x)α−1 dx.
B(α, α) 0
2.2
2.0
1.8
h(x)
1.6
1.4
1.2
1.0
E[X ] = 1/2 and Be(α, α) is symmetric around the mean. h(x) can
be approximated by a linear function. 125 / 417
Example. Beta expectation
Let h(x) := cos(x) + 2 sin(x), X ∼ Be(α, α), and consider
Z 1
1
E[h(X )] = h(x)x α−1 (1 − x)α−1 dx.
B(α, α) 0
Generate X1 , X2 , . . . , X2n ∼ Be(α, α).
Define
Yk := 1 − Xk , k = 1, 2, . . . , n.
Antithetic approximation:
n
1 X
δ2,2n := h(Xj ) + h(Yj ) .
2n
j=1
126 / 417
Example. Beta expectation
Approximate
E cos(X ) + 2 sin(X ) , X ∼ Be(2, 2).
Empirical correlation of h(X ) and h(Y ): −0.9421.
Convergence of the estimators δ1,2n (blue) and δ2,2n (black) to the true value of
1.7909 (dashed red). 127 / 417
Control variates
Approximate Z ∞
J := h(x)f (x)dx
−∞
with an estimator δ1 .
h0 : a function such that J0 := E h0 (X ) , X ∼ f (x), is known.
δ3 : Unbiased estimator of J0 .
Weighted estimator: δ2 := δ1 + β δ3 − J0 , β ∈ R.
where X , X1 , X2 , . . . , Xn ∼ f (x).
Control variate estimator:
n
1X ∗
δ2 := h(Xj ) + β h0 (Xj ) − E h0 (X ) ,
n
j=1
θj
on h01 (θj ), h02 (θj ), . . . , h0r (θj ) , j = 1, 2, . . . n.
1 + θj2
132 / 417
Example. Cauchy prior
h i
Approximate J := E θ 1 + θ2 , θ ∼ N (x, 1).
Control variates: h0k (θ) := (θ − x)2k−1 , k = 1, 2, 3, 4, 5.
−6
x 10
1.2
0.8
L(x,y)
0.6
0.4
0.2
0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x 137 / 417
Optimization methods
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X
5.2
5.2
5.0
4.8
4.8
4.4
4.6
4.0
0 100 200 300 400 500 0 100 200 300 400 500
Sample size Sample size
Sequence of MLEs corresponding to 500 simulations from C(5, 1) (left) by
optimizing the log-likelihood (black) and likelihood (red);
(right) by optimizing
the perturbed log-likelihood g (θ) = − sin2 100(θ − 5) − log `(θ) (black) and
likelihood `(θ) (red).
139 / 417
Numerical methods in R
Newton-Ralphson method: nlm
Designed for minimization.
Iteration searching the minimum of h : X ⊂ Rd 7→ R.
2
∂ h
xk+1 = xk − (xk ) ∇h(xk ).
∂x∂x>
Log-likelihood function:
n
X
log `(µ1 , µ2 |x1 , x2 , . . . , xn ) = log 0.25ϕ(xj −µ1 )+0.75ϕ(xj −µ2 ) .
j=1
140 / 417
Newton-Ralphson iterations
5
4
3
2
μ2
1
0
-1
-2
-2 -1 0 1 2 3 4 5
μ1
142 / 417
Example
2
The graph of h(x) = cos(10x) − sin(60x) (top) and scatterplot of
h(ui ), ui ∼ U(0, 1), i = 1, 2, . . . 5000 (bottom).
143 / 417
Precision
3.8
3.6
h(x)
3.4
3.2
2
Range of 1000 sequences of successive maxima of h(x) = cos(10x)−sin(60x)
found by uniform sampling over 1000 iterations. True maximum: red line.
144 / 417
Generalization of gradient method
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X
Gradient method produces a sequence of form
xk+1 = xk + αk ∇h(xk ), αk > 0.
If h is not differentiable or an analytic expression of h is not avai-
lable:
xk+1 = xk + αk ∇h(x
b k ),
Plot of function h(x, y ) on [−1, 1]2 . Minimum at (0, 0). 147 / 417
Test runs
150 / 417
Simulation of the annealing process
X : a finite set describing the possible states of the solid.
h : X 7→ R: energy function. h(x), x ∈ X , is the energy corres-
ponding to state x.
Aim: find the state xopt minimizing the energy function.
x: current state with energy h(x).
y : a candidate state generated from x by a perturbation
mechanism.
Metropolis criterion [Metropolis et al., 1953]
I If h(y ) − h(x) ≤ 0 then accept y as a new current state.
I If h(y ) − h(x) > 0 then accept y as a new current state with
probability
h(x) − h(y )
exp .
kB T
T : temperature of the heath bath.
kB : Boltzmann constant relating energy at the individual particle
level with temperature. kB = 1.3806488(13) × 1023 .
151 / 417
Boltzmann distribution
Physical process: if the cooling is sufficiently slow, the solid reach-
es the thermal equilibrium at each temperature T .
Simulation: thermal equilibrium is reached after generating a large
number of transitions at a given temperature T .
Boltzman distribution: distribution of states at a given temperatu-
re T corresponding to thermal equilibrium.
1 h(x)
P(X = x) = exp − .
Z (T ) kB T
152 / 417
Combinatorial problems
Find
N : X 7→ 2X ,
Algorithm
1. Generate a candidate solution y from Xx ;
2. Accept x = y with probability
+
exp − h(y ) − h(x) /Tk ;
1
qx (T ) = exp − h(x)/T , x ∈ X , where
N0 (T )
X
N0 (T ) := exp − h(y )/T .
y ∈X
Then
1
lim qx (T ) = qx∗ := IXopt (x).
T &0 card Xopt
Proof. limx&0 exp(α/x) = 0 if α < 0, and 1 if α = 0.
exp − h(x)/T exp (hopt − h(x))/T
lim qx (T ) = lim P = lim P
y ∈X exp − h(y )/T y ∈X exp (hopt − h(y ))/T
T &0 T &0 T &0
1
= lim P IXopt (x)
T &0
y ∈X exp (h opt − h(y ))/T
exp (hopt − h(x))/T 1
+ lim P IX \Xopt (x) = IXopt (x) + 0. 2
T &0
y ∈X exp (h opt − h(y ))/T card Xopt 162 / 417
Remarks
XnT : the MC corresponding to the SA algorithm at temperature T .
I
1
lim qx (T ) = qx∗ := IXopt (x)
T &0 card Xopt
implies
lim lim P XnT = x = lim qx (T ) = qx∗
T &0 n→∞ T &0
or
lim lim P XnT ∈ Xopt = 1.
T &0 n→∞
1
I Gxy := C IXx (y ) can be replaced by Gxy = Gyx , x, y ∈ X .
1
I Gxy := card(Xx ) IXx (y )
leads to stationary distribution
card(Xx ) exp − h(x)/T
qex (T ) = P , x ∈ X ,
y ∈X card(Xy ) exp − h(y )/T
and limT &0 qex (T ) = qx∗ . [Lundy & Mees, 1986]
These results require infinitely many steps at a given temperature!163 / 417
Inhomogeneous Markov chains
Consider a sequence of homogeneous Markov chains of length L
corresponding to L steps at a given temperature.
164 / 417
A convergence result
Theorem. Let (X , h) be an instance of a combinatorial minimiza-
tion problem and consider the SA algorithm, where the generation
probabilities are uniform on the (equal) neighbourhoods and accep-
tance probabilities are according to Metropolis criterion. Assume:
1. For all x, y ∈ X there exists an integer r ≥ 1 and sequence
u0 , u1 , . . . , up ∈ X with u0 = x, ur = y such that Guk uk+1 > 0,
k = 0, 1, . . . , r − 1.
2. Sequence T`0 satisfies
(L + 1)∆
T`0 ≥
, where ∆ := max h(y ) − h(x)y ∈ Xx ,
log(` + 2) x,y ∈X
Definition. Let w ∈ Xloc , then the depth d(w ) of the local mini-
mum w is the smallest τ such that there is a solution y ∈ X with
h(y ) < h(w ) that is reacheable at height h(w ) + τ from w . If w is
a global minimum then d(w ) := ∞.
166 / 417
Necessary and sufficient condition
Theorem. [Hajek, 1988] Let (X , h) be an instance of a combina-
torial minimization problem and let Xn be the MC corresponding
to the SA algorithm. Assume:
1. For all x, y ∈ X there exists an integer r ≥ 1 and sequence
u0 , u1 , . . . , up ∈ X with u0 = x, ur = y such that Guk uk+1 > 0,
k = 0, 1, . . . , r − 1.
2. For all τ > 0 and x, y ∈ X , x is reacheable from y at height τ
if and only if y is reacheable from x at height τ .
If Tn & 0 as n → ∞ then
lim P Xn ∈ Xopt = 1
n→∞
if and only if
∞
X
exp − D/Tk = ∞, (1)
k=1
where D := max d(w )|w ∈ Xloc \ Xopt .
Remark. For Tn := Γ/ log(n) condition (1) is equivalent to Γ ≥ D.
167 / 417
Continuous simulated annealing algorithm
Find
arg min h(x), h : X ⊆ Rd 7→ R.
x∈X
Neigbourhood of x: a random perturbation ζ is generated from a
distribution with PDF g (|y − x|) (centered at x).
Tk : temperature at the kth step.
Algorithm
1. Generate z from distribution g (|y − xk |);
2. Accept xk+1 = z with probability
+
exp − h(z) − h(xk ) /Tk ;
Take xk+1 = xk , otherwise;
3. Update Tk to Tk+1 .
Can be modeled by a discrete time continuous state space MC.
The results highly depend on g and on the cooling sequence Tk .
A numerical optimization (e.g. Newton-Ralphson) can be launched
from the optimal point of SA. 168 / 417
Example
Find the maximum of
2
h(x) = cos(10x) − sin(60x) , x ∈ X = [0, 1],
Algorithm
1. Generate z from distribution U(ak , bk ), where
ak := max{xk − %k , 0}, bk := min{xk + %k , 1};
2. Accept xk+1 = z with probability
+
exp − h(xk ) − h(z) /Tk ;
4
3
3
h(x)
h(x)
2
2
1
1
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
4
4
3
3
h(x)
h(x)
2
2
1
1
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Four different
√ runs of SA with random starting√point for Tk = 1/ log(k + 1),
%k = Tk (red) and Tk = 1/(k + 1)2 , %k = 5 Tk (green).
170 / 417
Example
Find the minimum of
2
h(x, y ) = x sin(20y ) + y sin(20x) cosh sin(10x)x
2
+ x cos(10y ) − y sin(10x) cosh cos(20y )y .
Cooling sequences:
Tk := (0.95)k ,
Tk := 1/ 10(k + 1) ,
p
Tk := 1/ log(k + 1), Tk := 1/ 10 log(k + 1) .
171 / 417
Example
175 / 417
Test runs. Single global minimum
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)−0.4 cos(4πy )+ 0.7.
−3
x 10
10
2
y
−2
−4
−6
−0.015 −0.01 −0.005 0 0.005 0.01
x
Final points of 100 test runs with random starting points from U [−1, 1]2 .
176 / 417
Visited points. Single global minimum
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)−0.4 cos(4πy )+ 0.7.
1
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Points visited during a single test run (every tenth). Starting: ◦; final point: ?.
Starting point Optimum point Final point Acceptance
x y x y x y ratio
−0.4179 −0.2710 −0.0006 −0.0007 0.0146 −0.0218 50 %
177 / 417
Example. Multiple global minima
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)+0.4 cos(4πy )+ 0.7.
Global minima at (0, −0.23502848) and (0, 0.23502848).
178 / 417
Test runs. Multiple global minima
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)+0.4 cos(4πy )+ 0.7.
0.245
0.24
0.235
y
0.23
0.225
−8 −6 −4 −2 0 2 4 6
x −3
x 10
−0.225
−0.23
y
−0.235
−0.24
−8 −6 −4 −2 0 2 4 6
x −3
x 10
Final points of 100 test runs with random starting points from U [−1, 1]2 .
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Points visited during a single test run (every tenth). Starting: ◦; final point: ?.
Starting point Optimum point Final point Acceptance
x y x y x y ratio
0.6505 0.6896 −0.0004 −0.2354 −0.0027 −0.2632 49 %
180 / 417
Remarks on impementation
Problem: find the ML estimates of the transition probabilities of a
partially observed homogeneous MC with s states.
Input data: observations of a MC where zeros denote the missing
points.
−6
x 10
1.2
0.8
L(x,y)
0.6
0.4
0.2
0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x
0.592
0.59
0.588
y
0.586
0.584
0.582
0.58
0.895 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935
x
Final points of 100 test runs with random starting points from U [0, 1]2 .
Optimal point: ◦.
183 / 417
Example
−4
x 10
3
2.5
1.5
K(x,y)
1
0.5
0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x
0.8
0.78
0.76
y
0.74
0.72
0.7
0.975 0.98 0.985 0.99 0.995 1 1.005
x
Final points of 100 test runs with random starting points from U [0, 1]2 .
Optimal point: ◦.
185 / 417
Stochastic approximation
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X
Missing observations:
z = (zm+1 , zm+2 , . . . , zn ), zj > a, j = m + 1, m + 2, . . . , n.
Complete data likelihood:
Ym n
Y
c
L (θ|y, z) = f (yj − θ) f (zj − θ).
j=1 j=m+1
Z
L(θ|y) = E Lc (θ|y, Z)y = Lc (θ|y, z)f (z|y, θ)dz.
Z
f (z|y, θ): conditional PDF of missing data on observed.
187 / 417
Example. Censored normal distribution
Data: Y1 , Y2 , . . . , Yn ∼ N (θ, 1) i.i.d. Censoring at a.
4e-22
2.0e-21
1.0e-21
2e-22
0.0e+00
0e+00
3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0
θ θ
Problems:
I Optimization algorithms require evaluations of hb1,n (x) in many
points x, so many samples of size n should be generated.
I The generated sample changes with every value of x which
makes the sequence of evaluations of hb1,n (x) unstable. Con-
vergence properties of the optimization algorithm are usually
not preserved.
189 / 417
Importance sampling approach
Find arg max x∈X h(x), X ⊂ Rd , where h(x) = E H(x, Z ) .
Derivatives of h
b2,n (x) aproximate derivatives of h(x).
190 / 417
Drawbacks
Maximize
n
f (Yj |x)
Z
b2,n (x) := 1
X
h H(x, Yj ) instead of h(x) := H(x, z)f (z|x)dx.
n g (Yj ) X
j=1
1. h
b2,n (x) is often far not as smooth as h(x). Problems might
arise with regularity, unimodality may vanish, etc. More diffi-
cult to optimize.
2. The choice of g is critical, since a function should be approxi-
mated. The convergence is not uniform and might be slow e.g.
if g (y ) = f (y |x0 ) and x0 is far from the true maximum x∗ .
3. The number of generated sample values Yj should vary with x
to achieve the same precision in the approximation of h(x).
Usually impossible.
191 / 417
Example. Probit regression
Distribution of Y ∈ {0, 1} given a covariate X is
P(Y = 1|X = x) = 1 − P(Y = 0|X = x) = Φ(θ0 + θ1 x).
Find the posterior mode of θ0 .
Sample: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Marginal posterior mode given a flat prior on θ1 : arg maxθ0 h(θ0 )
Z Y n
y 1−yj
h(θ0 ) := Φ(θ0 + θ1 xj ) j Φ(−θ0 − θ1 xj ) dθ1 .
j=1
Real data: Pima Indians Diabetes dataset Prima.tr. This tribe has
the highest prevalence of type 2 diabetes in the world. (R library
MASS). 200 cases.
X : body mass index; Y : presence of diabetes.
Earlier study:
estimate of θ1 is µ = 0.1052 with standard error σ = 0.0296.
192 / 417
Example. Probit regression
Find the maximum of
n
Z Y
y 1−yj
h(θ0 ) := Φ(θ0 + θ1 xj ) j Φ(−θ0 − θ1 xj ) dθ1 .
j=1
M n y 1−yj
1 XY Φ(θ0 + θ1,m xj ) j Φ(−θ0 − θ1,m xj )
h(θ0 ) :=
b .
M f5 (θ1,m ; µ, σ 2 )
m=1 j=1
Two approaches:
I Different t sample for each value of θ0 .
I One single t sample for all values of θ0 .
193 / 417
Example. Probit regression
6e-54
0e+00
-4 -3 -2 -1 0
θ0
6e-54
0e+00
-4 -3 -2 -1 0
θ0
Approximations of h(θ0 ) based on 1000 simulations from T (5, 0.1, 0.1) distri-
bution. Range of 100 replications when a different t sample is used for each
value of θ0 and a single approximation (top); when one t sample is used for all
values of θ0 and a single approximation (bottom).
194 / 417
Example. Probit regression
Approximate
n
Z Y
yj 1−yj
h(θ0 ) := Φ(θ0 + θ1 xj ) Φ(−θ0 − θ1 xj ) dθ1 .
j=1
6e-54
4e-54
2e-54
0e+00
-4 -3 -2 -1 0
θ0
Approximations of h(θ0 ) based on 1000 simulations from T (5, 0.1, 0.1) distri-
bution. Mean of 100 replications when a different t sample is used for each
value of θ0 (red) and when one t sample is used for all values of θ0 (blue).
195 / 417
Monte Carlo maximization
Find
Algorithm
0. Choose x1 ∈ X and error ε.
At the ith step
1. Generate y1 , y2 , . . . yn ∼ f (z|xi ) and calculate
n
bi (x) := 1
X f (yj |x)
h H(x, yj ) ;
n f (yj |xi )
j=1
Kiefer-Wolfowitz algorithm
Find [Kiefer & Wolfowitz, 1952]:
arg max h(x), where h(x) := E H(x, Z ) .
x∈X
for every 0 < δ < 1 then the sequence (Xj ) converges to x ∗ almost
surely.
198 / 417
The EM algorithm
EM: expectation-maximization. Deterministic optimization tech-
nique. [Dempster, Laird & Rubin, 1977].
X = (X1 , X2 , . . . , Xn ): a sample from joint PDF g (x|θ), where
Z
g (x|θ) = f (x, z|θ)dz.
Z
Z ∈ Z: unobserved data-vector.
Find the maximum-likelihood estimator of θ, that is
arg max `(θ|x), where `(θ|x) := log L(θ|x) and L(θ|x) = g (x|θ).
θ
199 / 417
The EM algorithm
Expected log-likelihood:
Q(θ|θ0 , x) := Eθ0 `C (θ|X, Z)X = x .
Theorem. If the expected complete data log-likelihood Q(θ|θ0 , x
is continuous
in both θ and θ0 , then every point of an EM
sequ-
ence θ(j) is a stationary point of L(θ|x), and L θ(j) x converges
b b
monotonically to L θbx for some stationary point θ.
b
k=1
with ω = (ω1 , ω2 , . . . , ωM ), µ = (µ1 , µ2 , . . . µM ).
g (x|µ, σ 2 ): PDF of the truncated
normal distribution with lower
cut-off at 0. Notation: N0 µ, σ 2 .
1
σϕ (x − µ)/σ
g x µ, σ 2 :=
, x ≥ 0, and 0 otherwise.
Φ µ/σ
Mean κ and variance σ 2
!2
σϕ µ/σ µ ϕ µ/σ ϕ µ/σ
κ = µ+ and %2 = σ 2 1− − .
Φ µ/σ σ Φ µ/σ Φ µ/σ
zj = (z1,j , z2,j , . . . , zM,j )> has one entry 1, all other entries are 0.
n X
X M
zj,k = n.
j=1 k=1
j=1 j=m+1
N a (θ, 1):
normal N (θ0 , 1). with upper truncation at a.
Maximum of Q(θ|θ0 , y) at: θb = my + (n − m)Eθ0 [Z1 ] n.
EM sequence:
ϕ a − θb(j)
m n−m b
θ(j+1) = y +
b θ(j) + .
n n 1 − Φ a − θb(j) 207 / 417
EM for censored normal data
Observed data log-likelihood function:
m
1X 1
`(θ|y) = (n − m) log 1 − Φ(a − θ) − (yi − θ)2 − log(2π).
2 2
j=1
-55
Observed likelihood
-60
-65
-70
-75
8 EM sequences (dotted lines) starting from random points for censored normal
N (4, 1) likelihood with n = 40, m = 23, a = 4.5, y = 3.4430 and the obser-
ved data log-likelihood (orange). 208 / 417
Monte Carlo EM
Expected log-likelihood:
1 θ 1 1 θ
(Z1 , Z2 , X2 , X3 , X4 ) ∼ M n; , , (1−θ); (1−θ), , Z1 +Z2 = X1 .
2 4 4 4 4
Given θ = θ(0) :
θ(0) X1
Z2 |X1 ∼ Bin X1 , θ(0) /(2 + θ(0) )
so E[Z2 |X1 ] = .
2 + θ(0)
E step: replace z2 in `C by ze2 := E[Z2 |X1 = x1 ] = θ(0) x1 /(2 + θ(0) ).
M step: maximize `C (θ|z1 , ze2 , x2 , x3 , x4 ).
Iteration:
θ(k) x1 θ(k) x1
(k+1)
θ = + x4 + x2 + x3 + x4 .
2 + θ(k) 2 + θ(k)
211 / 417
Genetic linkage. Monte Carlo version
Replace
n
θ(0) x1 1X
E[Z2 |X1 = x1 ] = with Y = Yj ,
2 + θ(0) n
j=1
where
θ(0) θ(0)
Y1 , Y2 , . . . , Yn ∼ Bin x1 , or nY ∼ Bin nx1 , .
2+θ(0) 2+θ(0)
Iteration:
(k)
θ(k)
Y + x4 (k)
θe(k+1) = (k)
, nY ∼ Bin nx1 , .
Y + x2 + x3 + x4 2+θ(k)
212 / 417
EM for genetic linkage
Log-likelihood:
`(θ|x1 , x2 , x3 , x4 ) = x1 log(2+θ)+(x2 +x3 ) log(1−θ)+x4 log(θ)+const.
Observations [Dempster, Laird & Rubin, 1977]: (125, 18, 20, 34).
60
40
20
Likelihood
0
-20
-40
-60
8 EM sequences (dotted lines) starting from random points of [0, 1] and the
log-likelihood function (orange). 213 / 417
MCEM for genetic linkage
Log-likelihood:
`(θ|x1 , x2 , x3 , x4 ) = x1 log(2+θ)+(x2 +x3 ) log(1−θ)+x4 log(θ)+const.
Observations [Dempster, Laird & Rubin, 1977]: (125, 18, 20, 34).
0.6
0.6
0.5
0.5
MCEM sequences
MCEM sequences
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
1 2 3 4 5 6 1 2 3 4 5 6
Iterations Iterations
EM sequence (orange) starting from 0.1 and range of 500 MCEM sequences for
n = 10 (left) and n = 100 (right). 214 / 417
EM standard errors
Under some regularity conditions the asymptotic variance
−1 of the
maximum likelihood estimator θb of θ∗ ∈ Rd is F(θ∗ ) .
F(θ): Fisher information matrix on θ corresponding to X ∼ g (x|θ).
2
∂`(θ|X) ∂`(θ|X) ∂ `(θ|X)
F(θ) := Eθ = −Eθ .
∂θ ∂θ> ∂θ∂θ>
Approximation of the information matrix:
2
∗ ∂ `(θ|x)
F(θ ) ≈ − .
∂θ∂θ> θ=θb
Oakes [1999]:
∂ 2 Q(θ0 |θ, x)
2x1 1
0
= .
∂θ ∂θ
θ0 =θ (2 + θ)2 θ
216 / 417
MCEM standard errors
Expected log-likelihood: Q(θ0 |θ, x) := Eθ `C (θ0 |X, Z)X = x .
Reformulation:
∂ 2 `(θ|x)
2 C
∂ ` (θ|X, Z)
= Eθ X = x
∂θ2 ∂θ2
!2
∂` (θ|X, Z) 2
C C
∂` (θ|X, Z)
+ Eθ
∂θ X = x − Eθ ∂θ X = x
2 C C
∂ ` (θ|X, Z) ∂` (θ|X, Z)
= Eθ X = x + Varθ X = x .
∂θ2 ∂θ
217 / 417
Approximation
Z1 , Z2 , . . . , Zn ∼ k(z|θ, x): sample from the missing data PDF.
Estimated variance:
−1
∂ 2 `(θ|x)
b ≈
Var(θ) − ,
∂θ2
θ=θb
where
n
∂ 2 `(θ|x) 1 X ∂ 2 `C (θ|x, Zj )
≈
∂θ2 n ∂θ2
j=1
n n
!2
1X ∂`C (θ|x, Zj ) 1 X ∂`C (θ|x, Zi )
+ − .
n ∂θ n ∂θ
j=1 i=1
218 / 417
Basic definitions
Definition. Let X be a nonempty set. A transition kernel is a
function K defined on X × B(X ) such that
I For all x ∈ X , K (x, ·) is a probability measure.
I For all A ∈ B(X ), K (·, A) is measurable.
223 / 417
Resolvent
Definition. The Markov chain Xnm , m ∈ N, with transition
ker-
nel K m (x, A) is called the m-skeleton of the chain Xn .
Xnε
The MC with transition kernel Kε is called the Kε -chain.
Remark. Xnε is a sub-chain of Xn with indices selected ran-
δn ∼ N 0, σ 2
Xn+1 = αXn + δn+1 , i.i.d.
Transition kernel density: κ(x, y ) = ϕ (y − αx)/σ /σ.
Theorem. Let Xn be a ψ-irreducible Markov chain. For every
set A ∈ B(X ) such that ψ(A) > 0 there exists m ∈ N and a small
set C ⊆ A such that ψ(C ) > 0 and νm (C ) > 0. Moreover, X can
be decomposed in a countable partition of small sets.
228 / 417
Example. AR(1) process
AR(1) process:
δn ∼ N 0, σ 2
Xn+1 = αXn + δn+1 , i.i.d.
229 / 417
Creating atoms
Construction of a companion Markov chain possessing an atom:
Athreya & Ney [1978]
Transition kernel for the split chain X̌n := (Xn , ω̌n ) with ω̌n ∈ {0, 1}:
Ǩ (x0 , Ai ) := K ∗ (x0 , Ai ), x0 ∈ X0 \ C0 ;
K ∗ (x0 , Ai ) − εν ∗ (Ai )
Ǩ (x0 , Ai ) := , x0 ∈ C0 ; i = 0, 1,
1−ε
∗
Ǩ (x1 , Ai ) := ν (Ai ), x1 ∈ X1 .
EC : the set of time points for which C is a small set with measure
proportional to νM .
239 / 417
Example. AR(1) process
AR(1) process:
δn ∼ N 0, σ 2
Xn+1 = αXn + δn+1 , i.i.d.
Transition kernel density:
1 (y −αx)2
κ(x, y ) = √ e− 2σ2 .
2πσ
2
Stationary distribution: N µ, τ .
Equation to be solved for µ and τ 2 :
Z ∞
(z−µ)2 1 (z−αx)2 (x−µ)2
−
e 2τ 2 = √ e− 2σ2 e− 2τ 2 dx.
−∞ 2πσ
Derived equations
σ2
µ = αµ, τ 2 = α2 τ 2 + σ 2 resulting µ = 0, τ2 = .
1 − α2
Stationary solution exists iff |α| < 1, and it is the N 0, σ 2 /(1−α2 ) .
243 / 417
h-Ergodicity
Definition. We say that a Markov chain (Xn ) is h-ergodic if
h ≥ 1 and
(a) (Xn ) is positive Harris recurrent with invariant probability
distribution π;
(b) the integral π(h) is finite;
(c) for every initial condition x ∈ X of the chain
lim
K n (x, ·) − π
h → 0.
n→∞
ψ Shc = 0
and K (x, Sh ) = 1, x ∈ Sh ,
respectively. If x ∈ Sh then
lim
K n (x, ·) − π
h = 0.
n→∞
(c) If Xn is h-regular, then X
is h-ergodic. Conversely, if
n
Xn is h-ergodic then Xn restricted to a full absorbing set
is h-regular.
Remark. If Xn is h-ergodic then it may not be h-regular.
245 / 417
Geometric ergodicity
Interested in the speed of convergence to the invariant distribution.
n=1
n=1
246 / 417
Geometrically ergodic atoms
Definition. An accessible atom α is called geometrically ergodic
if there exists rα > 1 such that
∞
X
rαn K n (α, α) − π(α) < ∞.
n=1
n=1
n=1
and define
τα
X 2
γf2 := π(α)Eα
f (Xj ) − π(f ) .
j=1
Then γf2 < ∞ and if γf2 > 0 then CLT holds for f , that is
n
1 X D
f (Xj ) − π(f ) −→ N 0, γf2 .
√
n
j=1
252 / 417
The Markov chain Monte Carlo (MCMC) principle
Monte Carlo integration: approximate
Z
J := h(x)f (x)dx.
2. Take
(
with probability % x (t) , yt ,
(t+1) yt
x =
x (t) with probability 1 − % x (t) , yt ,
where
f (y ) q(x|y )
%(x, y ) := min ,1 .
f (x) q(y |x) 255 / 417
Metropolis-Hastings algorithm. Remarks
1. The algorithm always accepts a value yt if f (yt )/q yt x (t) is
greater than f x (t) /q x (t) yt . Symmetric q: f (yt ) > f x (t) .
Z Z Z Z
κ(x, y )f (x)dy dx = %(x, y )q(y |x)f (x)dy dx
X A X A
Z Z Z Z
+ 1 − r (x) δx (y )f (x)dy dx = %(y , x)q(x|y )f (y )dy dx
ZX A Z Z X A
+ 1 − r (x) IA (x)f (x)dx = %(y , x)q(x|y )dx f (y )dy
ZX Z A X
Z Z
+ 1−r (x) f (x)dx = r (y )f (y )dy + 1−r (x) f (x)dx = f (y )dy . 2
A A A A
258 / 417
Recurrency
Theorem. If the Metropolis-Hastings chain X (t) is f -irreducible
2.5
2.5
2.0
2.0
Density
Density
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Algorithm
Given x (t) ,
1. Generate yt ∼ g (y ).
2. Take
y f (yt ) g (x (t) )
t with probability min f (x (t) ) g (yt ) , 1 ,
x (t+1) =
(t)
x otherwise.
set. The chain is π-irreducible: for all n there exists xn ∈ C and m ∈ N such that
Pxn X (m) ∈ Dn , τC > m > 0.
(3)
c
k
For x ∈ Dn ∩ C we have Px (τC ≥ k) ≥ 1 − 2/n . Radius of convergence of
∞
X
Ex z τC = Px (τC ≥ k)z k
k=0
c
τ − 2). xn ∈ C and by (3) we can reach Dn ∩ C without hitting
is less than n/(n
C . Hence Exn z C also diverges outside n/(n − 2). As n is arbitrary, C cannot
be Kendall for any γ > 1. 2
268 / 417
Acceptance rate
Theorem. If there exists a constant M such that
3
4
2
2
1
0
X
X
0
-1
-2
-2
-3
0 2000 4000 6000 8000 10000 0 100 200 300 400 500
Iterations Iterations
1.0
0.4
0.8
0.3
0.6
Density
ACF
0.2
0.4
0.1
0.2
0.0
0.0
-4 -2 0 2 4 0 10 20 30 40
Lag
15000
500
300
X
X
5000
100
0
-100
0 2000 4000 6000 8000 10000 0 100 200 300 400 500
Iterations Iterations
1.0
0.4
0.8
0.3
0.6
Density
ACF
0.2
0.4
0.1
0.2
0.0
0.0
-4 -2 0 2 4 0 10 20 30 40
Lag
Algorithm
Given x (t) ,
1. Generate yt ∼ g y − x (t) .
2. Take
( n o
f (yt )
yt with probability min , 1 ,
x (t+1) = f (x (t) )
x (t) otherwise.
276 / 417
Example. Random walk normal generator
Hastings [1970]: generate a sample from N (0, 1) using random
walk with U(−δ, δ) errors.
Acceptance probability:
(t) (t) 2 2
% x , yt = exp (x ) − yt /2 ∧ 1.
0.4
0.4
0.6
0.5
0.3
0.3
0.4
Density
Density
Density
0.2
0.2
0.3
0.2
0.1
0.1
0.1
0.0
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4
d
θ(x) := log f (x)
dx
is defined for all sufficiently large x and
lim θ(x) = θ∞
x→∞
1.0
0.8
0.8
0.6
0.6
ACF
ACF
0.4
0.4
0.2
0.2
0.0
0 5 10 15 20 25 30 35 0.0 0 5 10 15 20 25 30 35
Lag Lag
286 / 417
Acceptance rate of random walk MH algorithm
High acceptance rate might indicate that the algorithm moves
“slowly”.
If x (t) and yt are close to each other one has
(t)
f (yt )
% x , yt = min , 1 ≈ 1.
f (x (t) )
Limited moves on the support of f . Problematic e.g. for multimo-
dal densities where the modes are separated with zones of small
probability. If the variance of g is small, difficult to jump from one
mode to the other.
0.4
0.4
0.6
0.5
0.3
0.3
0.4
Density
Density
Density
0.2
0.2
0.3
0.2
0.1
0.1
0.1
0.0
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4
Rao-Blackwellized estimator:
T T
X
RB 1 X
δT = h(Yt )E I{X (t) =Yt } Y1 , . . . YT
T
t=1 j=t
T T
X
1 X
(t)
= h(Yt ) P X = Yt Y1 , . . . YT .
T
t=1 j=t
289 / 417
Independent Metropolis-Hastings algorithm
Define
f (Yi ) ωj
τi := P X (i) = Yi Y1 , . . . Yi ,
ωi := , %ij := ∧ 1,
g (Yi ) ωi
j
Y
ζii := 1, ζij := (1 − %ik ) (i < j).
k=i+1
290 / 417
Langevin algorithm
f : PDF corresponding to probability distribution π.
Wt : standard Brownian motion (Wiener process).
Lt : Langevin diffusion satisfying SDE
σ2
dLt = ∇ log f Lt dt + σdWt
2
has stationary distribution π. Discretization:
σ2
Lt+1 − Lt = ∇ log f Lt + σεt , εt ∼ N (0, 1).
2
[Grenander & Miller, 1994; Roberts & Stramer, 2002]
Metropolis-Hastings proposal corresponding to candidate PDF g :
σ2
Yt = X (t) + ∇ log f X (t) + σεt ,
εt ∼ g .
2
Acceptance probability:
( )
f (y ) g (x − y )/σ − σ∇ log f (y )/2
%(x, y ) := min , 1 .
f (x) g (y − x)/σ − σ∇ log f (x)/2
291 / 417
Properties
Metropolis-Hastings proposal corresponding to candidate PDF g :
σ2
Yt = X (t) + ∇ log f X (t) + σεt ,
εt ∼ g .
2
Besag [1994]: g is the PDF of N (0, 1).
Metropolis-adjusted Langevin algorithm [Roberts & Tweedie, 1996].
Theorem. [Roberts & Tweedie, 1996] If ∇ log f (x) → 0 as
kxk → ∞ then the Metropolis-adjusted Langevin algorithm is not
geometrically ergodic.
Optimal acceptance rate: 0.547 [Roberts & Rosenthal, 1998].
Example. f : PDF of N (0, 1).
Metropolis-adjusted Langevin algorithm:
Yt |X (t) = x (t) ∼ N x (t) 1 − σ 2 /2 , σ 2 .
Random walk Metropolis Hastings with the same error distribution:
Yt |X (t) = x (t) ∼ N x (t) , σ 2 .
292 / 417
Example. Non-identifiable normal model
Consider model
θ1 + θ2
Z ∼N ,1 with priors θ1 , θ2 ∼ N (0, 1).
2
Posterior distribution:
1 5 2 5 2 1
f θ1 , θ2 z ∝ exp − θ + θ + θ 1 θ2 − θ1 + θ 2 z .
2 4 1 4 2 2
(t) (t)
Candidate distribution for given θ1 , θ2 :
(t) (t) (t) (t)
!
σ2
2z − θ2 − 5θ1 2z − θ1 − 5θ2 (t) (t)
N2 , + θ1 , θ2 , σ 2 I2 .
2 4 4
Theorem. Simulating
X ∼ f (x)
is equivalent to simulating
(X , U) ∼ U S(f ) .
296 / 417
MCMC approach
Aim: for a given PDF f (x) simulate values from uniform distribu-
tion on
S(f ) := (x, u) : 0 < u < f (x) .
MCMC approach: generate a Markov chain with stationary distri-
bution U S(f ) .
Natural solution: use a random walk on S(f ). Go only one direc-
tion at a time. Single-variable slice sampler [Neal, 2003].
Starting point: (x, u) ∈ S(f ).
I The move along the u-axis corresponds to conditional distri-
bution
U|X = x ∼ U u : u ≤ f (x) .
Results a change from (x, u) to (x, u 0 ) ∈ S(f ).
I The subsequent move along the x-axis corresponds to conditi-
onal distribution
X |U = u 0 ∼ U x : u 0 ≤ f (x) .
0.0015
0.25
0.20
0.0010
0.15
Density
Density
0.10
0.0005
0.05
0.0000
0.00
f (x) = C f1 (x).
I[0,f1 (x)] (u)
X (t) , U (t+1) ∼ f (x)
∝ I{0≤u≤f1 (x)} .
f1 (x)
Step 2: X (t+1) |U (t+1) = u ∼ U A(t+1) , where we have
A(t+1) := x : u ≤ f1 (x) .
301 / 417
Product slice sampler
Algorithm
Given x (t) ,
(t+1)
1. Generate u1 ∼ U 0, f1 x (t) .
.
.
.
(t+1)
k. Generate uk ∼ U 0, fk x (t) .
k+1. Generate x (t+1) ∼ U A(t+1) , where
n o
(t+1)
A(t+1) := x : ui ≤ fi (x), i = 1, 2, . . . , k .
Decomposition:
f1 (x) := 1+sin2 (3x) , f2 (x) := 1+cos4 (5x) , f3 (x) := exp −x 2 /2 .
-4 -2 0 2 4
X
with
A(u) := x : u ≤ f1 (x) .
Transition kernel for f1 X (t) :
ZZ dudx
(t+1) (t)
P f1 X ≤ y f1 X =v =
I{u≤f1 (x)≤y } I{0≤u≤v }
λ A(u) v
1 y ∧v λ A(u) −λ A(y )
1 v λ A(y )
Z Z
= du = max 1− , 0 du.
v 0 λ A(u) v 0 λ A(u)
0.8
ACF
4
0.0
0
Dimension: 5
6 12
0.0 0.8
ACF
0
Dimension: 10
0.8
15
ACF
0.0
0
0.8 Dimension: 50
ACF
40
0.0
0
Sequence plot and ACF for the series produced by the slice sampler correspon-
ding to f1 (z) := z d−1 exp(−z) for d = 1, 5, 10, 50. 308 / 417
Connection to slice sampling
Fundamental theorem: simulation from the joint PDF f (x, y ) is
equivalent to simulation from uniform distribution on
S(f ) := (x, y , u) : 0 ≤ u ≤ f (x, y ) .
Starting from (x, y , u) ∈ S(f ) sample one component at a time:
1. Generate a realization x 0 of X from U x : u ≤ f (x, y ) .
Algorithm
0. Take X (0) = x (0) .
Given x (t) , y (t) ,
1. Generate y (t+1) ∼ fY |X y x (t) .
2. Generate x (t+1) ∼ fX |Y x y (t+1) .
0.4
0.4
0.2 0.3
0.2 0.3
Density
Density
0.1
0.1
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4
X Y
X Y
0.8
0.8
ACF
ACF
0.4
0.4
0.0
0 5 10 15 20 25 30 35 0.0 0 5 10 15 20 25 30 35
Lag Lag
2.5
2.0
0.08
Probability
Density
1.0 1.5
0.04
0.5
0.00
0.0
density Z
κ(x, x 0 ) = fU|X (u|x)fX |U (x 0 |u)du
and Y (t) are said to be conjugate to each other with the inter-
(c) under stationarity X (t) , Y (t−1) and X (t) , Y (t) are identi-
cally distributed.
two-stage Gibbs sampler is reversible and the two chains are inter-
leaving.
sible.
320 / 417
Proof
Transition kernel density of X (t) : κX (x, x 0 ) = fY |X (y |x)fX |Y (x 0 |y )dy .
R
Z Z
fX (x)κX (x, x 0 ) = fX (x) fY |X (y |x)fX |Y (x 0 |y )dy = f (x, y )fX |Y (x 0 |y )dy
fY |X (y |x 0 )fX (x 0 )
Z
= f (x, y ) dy = fX (x 0 )κX (x 0 , x).
fY (y )
Detailed balance equation holds, so X (t) is reversible. Similar argument app-
Conditions (a) and (b) follow directly from the generating mechanism of the
Gibbs sampler. Under stationarity the joint CDF of X (t−1) , Y (t−1) :
Z x Z y
P X (t−1) < x, Y (t−1) < y = f (u, v )dv du.
−∞ −∞
Since
fX |Y (x|y )fY (y ) = f (x, y ),
(t−1) (t−1)
and X (t) , Y (t−1) are identically distributed
random vectors X ,Y
implying condition (c). Hence, X (t) , Y (t) are interleaving.
2
321 / 417
Reversible two-stage Gibbs sampler
Algorithm Given y (t) ,
1. Generate ω ∼ fX |Y x y (t) .
2. Generate y (t+1) ∼ fY |X y ω .
3. Generate x (t+1) ∼ fX |Y x y (t+1) .
f (x 0 , y 0 )
Z
= fX |Y (ω|y )fY |X (y 0 |ω)dω fX |Y (x|y )fY (y )
fY (y 0 )
fY |X (y 0 |ω)
Z
= f (x 0 , y 0 ) f (ω, y ) dω fX |Y (x|y )
fY (y 0 )
Z
= f (x 0 , y 0 ) fX |Y (ω|y 0 )fY |X (y |ω)dω fX |Y (x|y ) = g (x 0 , y 0 , x, y ),
implying reversibility. 2
322 / 417
Connection to EM algorithm
X = (X1 , X2 , . . . , Xn ): observed data from joint PDF g (x|θ), where
Z
g (x|θ) = f (x, z|θ)dz.
Z
Z ∈ Z: unobserved (missing) data-vector.
Complete data likelihood: LC (θ|x, z) = f (x, z|θ).
Incomplete data likelihood: L(θ|x) = g (x|θ).
Missing data density: k(z|θ, x) = LC (θ|x, z)/L(θ|x).
LC (θ|x, z)
Z
∗
L (θ|x, z) := R C if LC (θ|x, z)dθ < ∞.
L (θ|x, z)dθ
Two-stage Gibbs sampler:
Z|Θ = θ ∼ k(z|θ, x), Θ|Z = z ∼ L∗ (θ|x, z).
E-step, which often reduces to calculation of Eθ ZX = x is repla-
ced by simulation from k(z|θ, x).
M-step, which is a maximization in θ is replaced by simulation
from L∗ (θ|x, z). 323 / 417
Example. Gibbs sampler for censored normal data
Data: X1 , X2 , . . . , Xn ∼ N (θ, 1) censored at a.
Observed: x = (x1 , x2 , . . . , xm ); unobserved: z = (zm+1 , . . . , zn ).
Distribution of the missing data:
ϕ(z − θ)
Zi |Θ = θ ∼ , i = m + 1, m + 2, . . . , n.
1 − Φ(a − θ)
j=1 j=m+1
which corresponds to
mx + (n − m)z 1
N , .
n n
2.0
2.0
1.5
1.5
Density
Density
1.0
1.0
0.5
0.5
0.0
0.0
3.8 4.0 4.2 4.4 4.6 4.8 5.0 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2
θ Z
Θ(t)
Invariant distribution of : incomplete data likelihood L(θ|x),
that is Z
L(θ |x) = κ(θ, θ0 |x)L(θ|x)dθ.
0
Algorithm
(t) (t)
Given x(t) = x1 , . . . , xp ,
(t+1) (t) (t)
1. Generate x1 ∼ f1 x1 x2 , . . . , xp .
(t+1) (t+1) (t) (t)
2. Generate x2 ∼ f2 x2 x1 , x3 , . . . , xp .
.
.
.
(t+1) (t+1) (t+1)
p. Generate xp ∼ fp xp x1 , . . . , xp−1 .
All simulations can be univariate.
328 / 417
Example. Autoexponential model
Joint PDF [Besag, 1974]:
f (x1 , x2 , x3 ) ∝ exp − (x1 + x2 + x3 + θ12 x1 x2 + θ23 x2 x3 + θ31 x3 x1 ) ,
329 / 417
Example. Ising model
X : D × D binary matrices with entries +1 or −1.
Joint PDF (Boltzmann distribution):
X X
f (x) = exp − J xi xj − H xi , xi ∈ {−1, 1}.
(i,j)∈N i
Algorithm
(t) (t)
Given y1 , . . . , yp ,
1. Generate a permutation σ = (σ1 , σ2 , . . . , σp ) of
{1, 2, . . . , p}
(t+1) (t)
2. Generate yσ1 ∼ gσ1 yσ1 yj , j 6= σ1 .
.
.
.
(t+1) (t+1)
p+1. Generate yσp ∼ gσp yσp yj , j 6= σp .
tionary distribution f .
334 / 417
The general Hammersley-Clifford theorem
Theorem. Under positivity condition the joint PDF g satisfies
p
Y gσj (yσj |yσ1 , . . . , yσj−1 , yσ0 j+1 , . . . , yσ0 p )
g (y1 , y2 , . . . , yp ) ∝
gσj (yσ0 j |yσ1 , . . . , yσj−1 , yσ0 j+1 , . . . , yσ0 p )
j=1
f (x) of g (y).
Theorem Density g is the PDF of the stationary distribution for
the Markov chain Y (t) produced by the completion Gibbs samp-
ler, and if Y(t) is ergodic, then f is the limiting distribution of
Y(t) :
Proof. Transition kernel density of
κ(y, y0 ) = g1 (y10 |y2 , . . . , yp )g2 (y20 |y10 , y2 , . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
).
Denote
Z
g (j) (y1 , . . . yj−1 , yj+1 , . . . , yp−1 ) := g (y1 , y2 , . . . , yp )dyj , j = 1, 2, . . . , p.
336 / 417
Proof
Let Y(t) ∼ g and A be measurable.
Z
P Y(t+1) ∈ A = IA (y0 )κ(y, y0 )g (y)dydy0
Z
IA (y0 ) g1 (y10 |y2 , . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
= )
337 / 417
Proof
we obtain
Z
P Y(t+1) ∈ A = IA (y0 ) g3 (y30 |y10 , y20 , y4 . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
)
Finally, Z
P Y(t+1) ∈ A = g( y10 , y20 , . . . , yp0 )dy10 dy20 . . . dyp0 ,
A
Y(t) .
so g is the stationary distribution of
has PDF f . 2
338 / 417
Recurrence, ergodicity
Theorem. If the joint PDF g corresponding to the completion
Gibbs sampler satisfies the positivity condition, then the generated
Markov chain is irreducible with respect to the invariant measure.
Theorem. Under positivity condition, if the transition kernel of
the completion Gibbs sampler is absolutely continuous with respect
to the invariant measure corresponding to g , then the generated
Markov chain is Harris recurrent and ergodic.
Theorem. Assume the transition kernel of the Gibbs chain Y(t)
Proof. Let i ∈ {1, 2, . . . , p}. The ith step of the Gibbs sampler from vector
y = (y1 , y2 , . . . , yp ) generates y0 = (y10 , y20 , . . . , yp0 ) using candidate density
Acceptance probability:
g (y0 )qi (y|y0 ) g (y0 )qi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp )
%(y, y0 ) = =
g (y)qi (y0 |y) g (y)qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp )
qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp )qi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp )
= =1
qi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp )qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp )
2
Acceptance probability:
g (y10 , y20 ) κ (y10 , y20 ), (y1 , y2 )
(y1 , y2 ), (y10 , y20 )
% =
g (y1 , y2 ) κ (y1 , y2 ), (y10 , y20 )
Density
0.2
0.1
0.0
-1 0 1 2 3 4 5 6 -2 0 2 4 6 8
Y1 Y2
2. Accept-Reject method.
Candidate: Exp(λ) with PDF g (x) := λe−λx I{x>0} . Upper bound:
√ −1
exp λµ + λ2 σ 2 /2 .
f (x)/g (x) ≤ 2πσΦ µ/σ λ
p
Optimal choice of parameter λ: λ∗ := − µ + µ2 + 4σ 2 / 2σ 2 .
345 / 417
Case study. Bivariate truncated normal
Example. µ1 = −5, µ2 = 2, σ12 = 1, σ22 = 2, σ12 = 0.8.
Acceptance probability for rejection sampling: 2.8665×10−7 .
0.30
4
3
0.20
Density
Density
2
0.10
1
0.00
0
Full conditionals:
λi |β, ti , xi ∼ Ga(xi + α, ti + β), i = 1, 2, . . . , 10;
10
X
β|λ1 , . . . , λ10 ∼ Ga γ + 10α, δ + λj .
j=1
10 γ+10α 10
(β 0 )γ+10α−1
Z X X
κ(β, β 0 ) = δ+ λj exp −β 0 δ+ λj
Γ(γ +10α)
j=1 j=1
10
Y (ti +β)xi +α
λxi i +α−1 exp − λi (ti +β) dλ1 . . . dλ10
×
Γ(xi +α)
i=1
10 xi +α
δ γ+10α (β 0 )γ+10α−1
Y ti
≥ exp(−β 0 δ) .
Γ(γ +10α) ti + β 0
i=1
The entire space X = R+ isa small set for the transition kernel
with density κ(β, β 0 ): β (t) uniformly ergodic.
Uniform ergodicity of λ(t) can be derived from the uniform ergo-
dicity of β (t) .
350 / 417
Classical example. Nuclear pump failures
λ1 λ2 β
15
0.6
4
10
0.4
Density
Density
Density
2 3
0.2
5
0.0
0
0
0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6
λ1 λ2 β
1.0
1.0
1.0
0.8
0.8
0.8
0.4 0.6
0.4 0.6
0.4 0.6
ACF
ACF
ACF
0.2
0.2
0.2
0.0
0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag Lag
Metropolis-Hastings:
I Candidate densities, at best, are approximations of the target
density f .
I Some generated values might be rejected. If the candidate
highly differs (e.g. in spread) from the target, the performan-
ce is poor.
I More flexibility in generating new values.
352 / 417
Example. Bivariate normal mixture
Target distribution: mixture of bivariate normals
(Y1 , Y2 )> ∼ p1 N2 (µ1 , Σ1 ) + p2 N2 (µ2 , Σ2 ) + p3 N2 (µ3 , Σ3 ),
2
µi,1 σi,1 σi,12
µi = , Σi = 2 , i = 1, 2, 3.
µi,2 σi,12 σi,2
Marginals:
2 2 2
Yk ∼ p1 N µ1,k , σ1,k +p2 N µ2,k , σ2,k +p3 N µ3,k , σ3,k , k = 1, 2.
Full conditionals:
3
X σj,12 det(Σj )
Y1 |Y2 = y2 ∼ qj,2 (y2 )N µj,1 + 2 y2 − µj,2 , 2
,
σj,2 σj,2
j=1
3
X σj,12 det(Σj )
Y2 |Y1 = y1 ∼ qj,1 (y1 )N µj,2 + 2 y1 − µj,1 , 2
,
σj,1 σj,1
j=1
k
pi y −µi,k X pj y −µj,k
qi,k (y ) := 2 ϕ 2
ϕ , i = 1, 2, 3, k = 1, 2.
σi,k σi,k σj,k σj,k
j=1
353 / 417
Example. Bivariate normal mixture
Weights: p1 = 0.3, p2 = 0.45, p3 = 0.25.
Means: µ1 = [−4, 4]> , µ2 = [−1, −2]> , µ3 = [0, 0]> .
Covariance matrices:
1 −0.5 1 1.2 1 −0.7
Σ1 = , Σ2 = , Σ3 = .
−0.5 0.3 1.2 1.5 −0.7 0.6
Y1 Y2
0.25
0.20
0.20
0.15
0.15
Density
Density
0.10
0.10
0.05
0.05
0.00
0.00
-8 -6 -4 -2 0 2 -6 -4 -2 0 2 4 6
The first 300 successive moves of the Gibbs chain on the surface of the target
distribution. 355 / 417
Example. Bivariate normal mixture
10000 points generated by the Gibbs sampler on the surface of the target dist-
ribution. 356 / 417
Example. Bad choice of coordinates for the Gibbs sampler
E, E 0 : disks in R2 with radius 1 and centers (1, 1) and (−1, 1).
f : uniform PDF on E ∪ E 0 , that is
1
f (y1 , y2 ) = IE (y1 , y2 ) + IE 0 (y1 , y2 ) .
2π
Full conditionals:
( p p
U −1− 1−(1−y2 )2 , −1+ 1−(1−y2 )2 if y2 < 0;
Y1 |Y2 = y2 ∼ p p
U 1− 1−(1−y2 )2 , 1 + 1−(1−y2 )2 if y2 ≥ 0;
( p p
U −1− 1−(1−y1 )2 , −1+ 1−(1−y1 )2 if y1 < 0;
Y2 |Y1 = y1 ∼ p p
U 1− 1−(1−y1 )2 , 1 + 1−(1−y1 )2 if y1 ≥ 0.
363 / 417
Metropolizing the Gibbs sampler
Y = (Y1 , Y2 , . . . Yp ): discrete random vector with distribution π.
Y(1) , Y(2) , . . . , Y(T ) : a sample from Y generated by an MCMC
method with transition matrix Q for approximation of
T
1 X
h Y(t) .
J := E h(Y) with JT :=
T
t=1
Define:
ν(h, π, Q) := lim T Var JT .
T →∞
Definition. Let Q1 and Q2 be transition matrices. We say that
Q1 Q2 if the off-diagonal elements of Q2 are larger than those
of Q1 .
Theorem. [Peskun, 1973] Suppose each of the irreducible transi-
tion matrices Q1 and Q2 are reversible for the same distribution π.
If Q1 Q2 then we have
ν(h, π, Q1 ) ≥ ν(h, π, Q2 ).
364 / 417
Randomized Gibbs sampler
Algorithm
(t) (t)
Given y(t) = y1 , . . . , yp ,
(t+1) (t)
If we force to generate yk 6= yk , an additional Metropolis-Has-
tings rejection-acceptance step is needed to correct the possible
deviation from the target distribution.
365 / 417
Metropolizing the Randomized Gibbs sampler
Metropolized version of the Randomized Gibbs sampler [Liu, 1995]:
Algorithm
(t) (t)
Given y(t) = y1 , . . . , yp ,
1. Generate a coordinate k from a pre-assigned dist-
ribution (q1 , q2 , . . . , qp ).
(t)
2. Generate ωk 6= yk with probability
(t)
πk ωk yj , j 6= k
(t) (t) .
1 − πk yk yj , j 6= k
(t+1)
3. Accept yk = ωk with probability
1 − πk y (t) y (t) , j 6= k
k j
min (t) ,1 .
1 − πk ωk y , j 6= k
j
366 / 417
Transition matrices
(q1 , q2 , . . . , qp ): distribution of coordinates.
(t) (t) (t) (t) (t)
ω := y1 , . . . , yk−1 , ωk , yk+1 , . . . , yp , ωk 6= yk .
Transition probability of the metropolized sampler:
(t)
1−πk y (t) y (t), j 6= k
π k ω k
y , j 6= k
j k j
Q2 y(t), ω = qk
(t) (t) min (t) , 1
1−πk yk yj , j 6= k 1−πk ωk yj , j 6= k
πk ωk y (t) , j 6= k
(t)
πk ω k
y , j 6= k
j j
= qk min , .
1−πk y (t) y (t) , j 6= k 1−πk ωk y (t) , j 6= k
k j j
ν(h, π, Q1 ) ≥ ν(h, π, Q2 ).
367 / 417
Example. Beta-Binomial combination
Consider
Joint PDF:
n Γ(a + b) x+a−1
f (x, θ) = θ (1 − θ)n−x+b−1 ,
x Γ(a)Γ(b)
1.0
0.12
2.5
0.8
2.0
0.08
Probability
0.4 0.6
Density
1.0 1.5
ACF
0.04
0.2
0.5
0.0
0.00
0.0
0 2 4 6 8 10 13 16 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40
X θ X
1.0
0.12
2.5
0.8
1.0 1.5 2.0
0.08
Probability
0.4 0.6
Density
ACF
0.04
0.2
0.5
0.00
0.0
0.0
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 10 20 30 40
Lag
Y |X = x ∼ N (%x, 1 − %2 ), X |Y = y ∼ N (%y , 1 − %2 ).
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
ACF
ACF
ACF
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag Lag
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
ACF
ACF
ACF
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag Lag
Hierarchical centering:
ηk ∼ N µ, σα2 .
Yk,` = ηk + εk,` ,
Posterior correlations of parameters:
− 12 −1
MNσα2 MNσα2
%µ,ηk = − 1 + , %ηk ,η` = 1 + , k 6= `.
σy2 σy2 374 / 417
Example. Random effect model
Simple random effect model:
Yk,` = µ + αk + εk,` , k = 1, 2, . . . , M, ` = 1, 2, . . . , N.
2
Error terms: εk,` ∼ N 0, σy , variance σy2 is known.
Sweeping [Vines, Gilks & Wild, 1995; Gilks & Roberts, 1996].
Take φk := αk − α and ν := µ + α, where α := M
P
k=1 αk .
Yk,` = ν + φk + εk,` , k = 1, 2, . . . , M, ` = 1, 2, . . . , N.
Parameters:
M−1
X
φM := (φ1 , φ2 , . . . , φM−1 )>∼ NM−1 0, σα2 ΣM−1 , φM = −
φk .
k=1
ΣM−1 : an (M − 1) × (M − 1) matrix with 1 − /M on the main
diagonal and −1/M everywhere else. No prior independence.
Posterior correlations of parameters:
%ν,φk = 0, %φk ,φ` = −1/M. 375 / 417
Improper priors
Blind use of Gibbs sampler: focusing only on conditional PDFs.
Improper prior: instead of a probability measure the density of the
prior corresponds to a σ-finite measure [Robert, 2007]
In case of improper priors the conditional PDFs may not corres-
pond to any joint PDF. The function given by the Hammersley-
Clifford theorem is not integrable. Propriety of the joint PDF
should be checked.
Full conditionals:
2 1 2
π(µ|σ, x) ∝ e−(x−µ)/2σ , π(σ|µ, x) ∝ 3 e−(x−µ)/2σ ,
σ
corresponding to N (x, σ 2 ) for µ and Exp (x −µ)2 /2 for 1/σ 2 .
377 / 417
Example. Normal improper posterior
log(|θ|)
0
-300 -200 -100
log(σ)
0
-300 -200 -100
10 15 20 25 30 35
Velocity (km/s) 380 / 417
Bayesian model choice
Definition. A Bayesian variable dimensional model is defined as a
collection of models
Mk = pk (·|θk ); θk ∈ Θk , k = 1, 2, . . . , K ,
associated with a collection of priors πk (θk ) on the parameters of
the models and a prior distribution on the indices of these models
{%(k), k = 1, 2, . . . , K }.
π(k, θk ) := %(k)πk (θk ): a density on the parameter space
[
Θ := {k} × Θk .
k
One can select the model with the highest posterior or use the mo-
del average as the predictive PDF.
381 / 417
Possible problems with the model choice
Models should be coherent. E.g. if M1 = M2 ∪ M3 , a natural
requirement can be p(M1 |y ) = p(M2 |y ) + p(M2 |y ).
The coefficients for each model should be considered as an entire
set and treated as separate entities.
Example AR(p) process:
p
X
Mp : Yt = αjp Yt−j + σp εt .
j=1
The best fitting AR(p+1) model is not necessarily the best fitting AR(p) model
with an additional parameter α(p+1)(p+1) . The variances σp2 might differ and the
parameters are not independent a posteriory.
Computational problems:
I Parameter spaces are often infinite dimensional.
I Computation of posterior quantities often involve integration
over different parameter spaces.
I The representation of the parameter space required more ad-
vanced MCMC methods.
382 / 417
Reversible jumps MCMC
Green’s Algorithm. [Green, 1995; Robert, 2007]
π: target distribution on Θ with PDF f (x), x = (k, θk ) ∈ Θ.
Aim: construction of an irreducible aperiodic transition kernel K
satisfying detailed balance equation
Z Z Z Z
K (x, dy )π(dx) = K (y , dx)π(dy ), A, B ∈ B(Θ).
A B B A
For this type of move Gk should have a density with respect to a measure on
R3 concentrated on
n o
θ, θ(1) , θ(2) : θ = θ(1) + θ(2) /2 .
The reversed move from C1 to C2 should have a proposal distribution also con-
centrated on this set. E.g. given x = (1, θ), generate u from some distribution
h independently of θ and set θ(1) = θ + u, θ(2) = θ − u. 387 / 417
Derivation of the symmetric measure
C1 = {1} × Θ1 , C2 = {2} × Θ2 : two parameter subspaces.
Dimensions: Θ1 ⊆ Rn1 and Θ2 ⊆ Rn2 .
q: a move which always switches between these subspaces.
q(x, C1 ) = 0, x ∈ C1 ; q(x, C2 ) = 0, x ∈ C2 .
pk` : probability of choosing a move from Ck to C` .
u12 , u21 : continuous random vectors with dimensions m1 and m2
generated from PDFs h12 and h21 independently from θ1 and θ2 .
Dimension matching:
n1 + m1 = n2 + m2 and there is a bijection
T between θ1 , u12 and θ2 , u21 ).
Symmetric measure ξ: for A ⊆ C1 , B ⊆ C2
ξ(A×B) = ξ(B×A) := λ (θ1 , u1 ) : θ1 ∈ A, θ2 ∈ B, (θ2 , u2 ) = T (θ1 , u1 ) .
λ: Lebesgue measure on Rn1 +m1 .
For general A, B ∈ B(Θ):
ξ(A × B) := ξ (A ∩ C1 ) × (B ∩ C2 ) + ξ (A ∩ C2 ) × (B ∩ C1 ) .
388 / 417
Density of the symmetric measure
For A ⊆ C1 , B ⊆ C2 :
ξ(A × B) := λ (θ1 , u1 ) : θ1 ∈ A, θ2 ∈ B, (θ2 , u2 ) = T (θ1 , u1 ) .
Completions: u12 ∼ h12 , u21 ∼ h21 .
For x = (1, θ1 ) ∈ C1 , y = (2, θ2 ) ∈ C2 :
g (x, y ) := f (x)p12 h12 (u12 ), g (y , x) := f (y )p21 h21 (u21 ) det JT (y ) ,
otherwise g (x, y ) := 0.
JT (y ): the Jacobian of the transformation T .
f : PDF of the target measure π.
g (x, y ), x, y ∈ Θ, is the density with respect to ξ of the joint mea-
sure Z Z
G (A × B) := π(dx)q(x, dy ).
A B
Acceptance probability of a move from x to y :
( )
f (y )p21 h21 (u21 ) det JT (y )
%(x, y ) = min ,1 .
f (x)p12 h12 (u12 )
389 / 417
Green’s algorithm
pk` : probability of choosing a move from model Mk to M` .
hk` : PDF of completion corresponding to move from Mk to M` .
Tk` : function specifying the move from Mk to M` .
Algorithm
(t)
Given x(t) = k, θk ,
1. Select a model M` with probability pk` .
2. Generate uk` ∼ hk` (u).
(t)
3. Set (θ` , u`k ) = Tk` θk , uk` .
(t+1)
4. Take θ` = θ` with probability
( )
f (`, θ` ) p`k h`k (u`k )
(t)
min (t) p h (u )
det JTk` θk , uk` , 1 ,
f k, θ k
k` k` k`
(t+1) (t)
and take θk = θk , otherwise.
390 / 417
Example
Subspaces:
Θ = C1 ∪ C 2 , C1 := {1}×R, C2 := {2}×R2 .
Moves
T θ(1) , θ(2) : 2, θ(1) , θ(2) ∈ C2
T −1 (θ, u) : (1, θ, u) ∈ C1 7→ 2, θ − u, θ + u ∈ C2 ,
u ∼ h.
Jacobians:
1
det JT θ(1) , θ(2) = ,
det JT −1 (θ, u) = 2.
2
Acceptance probability for a move from (1, θ) to 2, θ(1) , θ(2) :
( )
f 2, θ(1) , θ(2)
p21
u = θ(2) − θ(1) /2.
min 2 ,1 ,
f (1, θ) p12 h(u)
p12 , p21 : probabilities of choosing moves C1 → C2 and C2 → C1 ,
respectively. 391 / 417
Fixed dimension reassessment
For a move from Mk to M` translating θk ∈ Θk to θ` ∈ Θ` with
nk := dim(Θk ) < dim(Θ` ) =: n` an equivalent move in fixed di-
mension can be described.
Green’s algorithm (assuming θ` should not be completed)
Add uk` ∈ Uk` to θk such that Θk × Uk` and Θ` are in bijection.
Metropolis-Hastings like step
A move between (θk , uk` ) and θ` with PDFs f (k, θk )hk` (uk` ) and
f (`, θ` ), respectively, with deterministic candidate distribution
θ` = Tk` (θk , uk` ).
Randomization
Candidate:
θ` ∼ Nn` Tk` (θk , uk` ), εIn` , ε > 0.
Reciprocal candidate:
−1
(θk , uk` ) = T`k (θ) = Tk` (θ), θ ∼ Nn` θ` , εIn` .
In : n × n unit matrix. 392 / 417
Acceptance probability
Candidate:
θ` ∼ Nn` Tk` (θk , uk` ), εIn` , ε > 0.
PDF of candidate:
2
(2πε)−n` /2 exp − θ` − Tk` (θk , uk` ) /2ε .
Reciprocal candidate:
−1
T`k (θ) = Tk` (θ), θ ∼ Nn` θ` , εIn` .
PDF of reciprocal candidate:
2
(2πε)−n` /2 exp − θ` − Tk` (θk , uk` ) /2ε det JTk` (θk , uk` ) .
Possible moves:
(i) linear → linear: (α0 , α1 ) → (α00 , α10 ), where (α00 , α10 ) ∼ f0 f1 .
(ii) linear → quadratic: (α0 , α1 ) → (α0 , α1 , α20 ), where α20 ∼ f2 .
(iii) quadratic → quadratic: (α0 , α1 , α2 ) → (α00 , α10 , α20 ), where
(α00 , α10 , α20 ) ∼ f0 f1 f2 .
(iv) quadratic → linear: (α0 , α1 , α2 ) → (α00 , α10 ), where
(α00 , α10 ) = (α0 , α1 ).
399 / 417
Example. Gaussian mixture
Model [Richardson & Green, 1997]:
k
X 1 x − µj,k
Mk : Yi ∼ ωj,k ϕ , i = 1, 2, . . . , n.
σj,k σj,k
j=1
(d) Hyperparameter:
k
X
−2
β| · · · ∼ Ga γ + kα, ρ + σj,k .
j=1
Birth or death move (f): reversible jump with the same proposal
probabilities bk , dk .
403 / 417
Example. Gaussian mixture. Split or combine move
Combine move
(j1 , j2 ): randomly chosen pair of adjacent indices to be merged.
Adjacent: µj1 ,k < µj2 ,k and no other µj,k in µj1 ,k , µj2 ,k .
j ∗ : index of the new component. k is reduced by 1.
Reallocation: set zi,k−1 = j ∗ for yi with zi,k = j1 or zi,k = j2 .
Change of weights and normal parameters: zeroth, first and se-
cond moments should match.
ωj ∗ ,k−1 = ωj1 ,k +ωj2 ,k , ωj ∗ ,k−1 µj ∗ ,k−1 = ωj1 ,k µj1 ,k +ωj2 ,k µj2 ,k ,
ωj ∗ ,k−1 σj2∗ ,k−1 +µ2j ∗ ,k−1 = ωj1 ,k σj21 ,k +µ2j1 ,k +ωj2 ,k σj22 ,k +µ2j2 ,k .
Transformation:
Tk,k−1 : ωj1 ,k , ωj2 ,k , µj1 ,k , µj2 ,k , σj21 ,k , σj22 ,k
dk+1 −1
× g2,2 (u1 )g2,2 (u2 )g1,1 (u3 )
bk palloc
ωj ∗ ,k |µj1 ,k+1 −µj2 ,k+1 |σj21 ,k+1 σj22 ,k+1
× .
u2 (1 − u2 )2 u3 (1 − u2 )σj2∗ ,k
Lk : likelihood of the sample y corresponding to model Mk .
`1 , `2 : number of observations proposed to be assigned to j1 and j2 .
B(x, y ): beta function. gs,t : PDF of Be(s, t).
palloc : probability of the particular allocation we made. 406 / 417
Example. Gaussian mixture. Birth or death move
Birth move. Probability for Mk : bk . Weights and parameters for
the new component:
ωj ∗ ,k+1 ∼ Be(1, k), µj ∗ ,k+1 ∼ N ξ, κ−1 , σj2∗ ,k+1 ∼ Ga(α, β).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
10 15 20 25 30 35
Velocity (km/s)
Histogram and predictive PDFs of the Galaxy data. Solid line: overall predicti-
ve PDF; dashed lines: conditional predictive PDFs for k = 4 (blue), 5 (green),
6 (red), 7 (magenta) and 8 (orange). 409 / 417
References
Athreya, K. and Ney, P. (1978)
A new approach to the limit theory ofrecurrent Markov chains. Trans. Amer. Math. Soc. 245, 493–501.
Atkinson, A. C. (1979)
The computer generation of Poisson random variables. J. Roy. Statist. Soc. Ser. C Appl. Statist. 28, 29–35.
Bédard, M. (2008)
Optimal acceptance rates for Metropolis algorithms: moving beyond 0.234. Stochastic Process. Appl. 118,
2198–2222.
Besag, J. (1974)
Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Royal Statist. Soc.
Series B 36, 192-326.
Besag, J. (1994)
Discussion of “Markov chains for exploring posterior distributions”. Ann. Statist. 22, 1734–1741.
Geyer, C. J. (1996)
Estimation and optimization of functions. In Markov chain Monte Carlo in Practice. Gilks, W. R.,
Richardson, S. T. and Spiegelhalter D. J. (eds.). 241–258. Chapman & Hall, London.
412 / 417
References
Gilks, W. and Roberts, G. (1996)
Strategies for improving MCMC. In Markov chain Monte Carlo in Practice. Gilks, W. R., Richardson, S. T.
and Spiegelhalter D. J. (eds.). 89–114. Chapman & Hall, London.
Green, P. J. (1995)
Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82,
711-732.
Grenander, U. and Miller, M. (1994)
Representations of knowledge in complex systems (with discussion). J. Royal Statist. Soc. Series B 56,
549–603.
Hàjek, B. (1988)
Cooling schedules for optimal annealing. Math. Operation Research 13, 311–329.
Hastings, W. (1970)
Monte Carlo sampling methods using Markov chainsand their application. Biometrika 57, 97–109.
Ingrassia, S. (1992)
A comparison between the simulated annealing and the EM algorithms in normal mixture decompositions.
Stat. Comput. 2, 203–211.
Jeffreys, H. (1946)
An Invariant Form for the Prior Probability in Estimation Problems. Proc. R. Soc. Lond. A 186, 453–461.
Jeffreys, H. (1961)
Theory of Probability (3rd edition). Oxford University Press, Oxford.
413 / 417
References
Johnson, A. A. (2010)
Geometric Ergodicity of Gibbs Samplers. PhD Thesis, University of Minnesota.
Jöhnk, M. (1964)
Erzeugubg von betaverteilten und gammaverteilten Zufallszahen. Metrika 8, 5–15.
Komárek, A. (2009)
A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the
number of components and interval-censored data. Comput. Statist. Data Anal. 53, 3932–3947.
Lin, S. (1965)
Computer solutions of the travelling salesman problem. Bell System Tech. J. 44, 2245–2269.
Liu, J. (1995).
Metropolized Gibbs sampler: An improvement. Technical report. Deptartment of Statistics, Stanford
University, CA.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E (1953)
Equation of state calculations by fast computing Machines. J. Chem. Phys. 21, 1087–1092.
Neal, R. (2003)
Slice sampling (with discussion). Ann. Statist. 31, 705–767.
Neath, R. C. (2012)
On convergence properties of the Monte Carlo EM algorithm. arXiv : 1206.4768.
Oakes, D. (1999)
Direct calculation of the information matrix via the EM algorithm. J. R. Statist. Soc. B 61, 479–482.
Peskun, P. (1973)
Optimum Monte Carlo sampling using Markov chains. Biometrika 60, 607–612.
Rao, C. R. (1965)
Linear Statistical Inference and its Applications. Wiley, New York.
Robert, C. P. (2007)
The Bayesian Choice. From Decision-Theoretic Foundations to Computational Implementation. Second
Edition. Springer, New York.
Robert, C. P. (1995)
Simulation of truncated normal variables. Stat. Comput. 5, 121–125.
Tierney, L. (1994)
Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1701–1786.
Wu, C. (1983)
On the convergence properties of the EM algorithm. Ann. Statist. 11, 95–103.
417 / 417