0% found this document useful (0 votes)
111 views

Stochalgslides

The document provides an overview of stochastic algorithms and their applications. It discusses random variable simulation using techniques like Monte Carlo integration and Markov chains. Specific algorithms covered include the Metropolis-Hastings algorithm, slice sampler, Gibbs sampler, and methods for variable dimension models. Literature references are provided for further reading on Monte Carlo methods and stochastic simulation.

Uploaded by

Srinivasa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Stochalgslides

The document provides an overview of stochastic algorithms and their applications. It discusses random variable simulation using techniques like Monte Carlo integration and Markov chains. Specific algorithms covered include the Metropolis-Hastings algorithm, slice sampler, Gibbs sampler, and methods for variable dimension models. Literature references are provided for further reading on Monte Carlo methods and stochastic simulation.

Uploaded by

Srinivasa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 406

Stochastic Algorithms

Sándor Baran

Summer Semester, 2012/13

1 / 417
Literature
I Christian P. Robert, George Casella: Monte Carlo Statistical
Methods. Second Edition. Springer, New York, 2004.

I Christian P. Robert, George Casella: Introducing Monte Carlo


Methods with R. Springer, New York, 2010.

I Emile Aarts, Jan Korst: Simulated Annealing and Boltzmann


Machines. Wiley, New York, 1989.

I Brian D. Ripley: Stochastic Simulation. Wiley, New York,


1987.

I Reuven Y. Rubinstein: Simulation and the Monte Carlo


method. Wiley, New York, 1981.

I Sean Meyn, Richard Tweedie: Markov Chains and Stochastic


Stability. Springer, New York, 1993.
2 / 417
Contents
Random Variable Simulation
Monte Carlo Integration
Controlling Monte Carlo Variance
Monte Carlo Optimization
Markov Chains
The Metropolis-Hastings Algorithm
The Slice Sampler
The Two-Stage Gibbs Sampler
The Multi-Stage Gibbs Sampler
Variable Dimension Models
References

3 / 417
Uniform simulation
Simulation of “uniform random numbers”: producing a determinis-
tic sequence in [0, 1].

Definition. A uniform pseudo-random number generator is an al-


gorithm, which starting from an initial value u0 and a transforma-
tion D produces a sequence (ui ) = D i (u0 ) of values in [0, 1] such
that for all n ∈ N, the values (u1 , u2 , . . . , un ) reproduce the beha-
viour of an i.i.d. sample (V1 , V2 , . . . , Vn ) of uniform random vari-
ables when compared through a usual set of tests.
u1 , u2 , . . . , un is considered as a realization of a random sequence
U1 , U2 , . . . , Un and using this realization test
H0 : U1 , U2 , . . . , Un are i.i.d. U[0, 1].
E.g. Kolmogorov-Smirnov, or correlation test using an ARMA(p,q)
model.
Set of tests by Marsaglia: Die Hard
www.stat.fsu.edu/pub/diehard/
5 / 417
Example: logistic function
Function: D(x) = 4x(1 − x) ∈ [0, 1].
Iteration yields a sequence which has the same behaviour as a
random sequence from arcsine distribution with
1 arcsin(2x − 1) 1
PDF: f (x) = p ; CDF: F (x) = + .
π x(1 − x) π 2
Iterations:
x0 = 0.33, xn+1 = D(xn ), yn = F (xn ), n = 1, 2, . . . , 100000.

4000
Frequency
2000
0 0.0 0.2 0.4 0.6 0.8 1.0
yn
4000
Frequency
2000
0

0.0 0.2 0.4 0.6 0.8 1.0


yn+100
6 / 417
Uniform simulation with R
N uniform random numbers on [a, b] in R:
runif(N, min=a, max=b)

Example: x<-runif(10^4) (10000 numbers from U(0, 1))


500

0.8
0.8
Frequency
300

x n +1

ACF
0.4
0.4
0 100

0.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40
x xn Lag

Kolmogorov-Smirnov test: p = 0.7473

To specify the seed and type of generator: set.seed


7 different random number generators are “built-in”.
7 / 417
Period
M: the largest integer accepted by the computer.
Uniform pseudo-random number generators usually take values on
{0, 1, . . . , M} then divide the obtained value by M + 1.

Definition. The period T0 of a generator (ui ) = D i (u0 ) is the




smallest integer T such that ui+T = ui for every i, that is D T is


equal to the identity function.

A generator of form xn+1 = f (xn ) has a period at most M + 1.


Solution: use several sequences xni simultaneously possibly gene-
rated by different methods.

Example: Kiss generator (Keep it simple stupid).


Kiss uses two different techniques: congruential generation and
shift-register generation.

8 / 417
Congruential generators
Definition. A congruential generator on {0, 1, . . . , M} is defined
by the function
D(x) = (ax + b) mod (M + 1).

Performance highly depends in the choice of a. The graph of the


function D transformed into [0, 1] should range through [0, 1]2 .
a 6∈ Q is a good choice. Impossible, because of the finite represen-
tation of a in the computer.
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Plot of (0.003k, D(0.003k)) with D(x) = 69069x mod 1 (fraction part) for
k = 1, 2, . . . , 333. 9 / 417
RANDU: a problematic generator
RANDU: random number generator used by IBM from the 1960s

vn+1 = 65539 · vn mod 231 , xn = vn /231 .


Triplets of consecutive values lie on hyper-planes of the form:
xn+2 − 6xn+1 + 9xn = c, c ∈ Z.
R vector randu: 400 rows of three consecutive values produced by
RANDU.

8
xn+2 − 6xn+1 + 9xn
100

6
Frequency

4
2
60

0
-4 -2
20
0

0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400
Triplet

10 / 417
Shift-register generators
Definition. For a given k × k matrix T with binary entries , the
associated shift-register generator is given by the transformation
k−1
X
xn+1 = Txn , where xn = enj 2j , enj ∈ {0, 1}.
j=0

Uses only binary operations, very fast. Addition is a logical XOR.


Shift matrices:
   
0 1 0 ··· 0 0 0 0 0 ··· 0 0
0 0
 1 ··· 0 0 
1
 0 0 ··· 0 0 
0 0 0 ··· 0 0 0 1 0 ··· 0 0
L = . . R = .
   
.. . . .. ..  .. .. . . .. .. 
 .. .. . . . .  .. . . . . .
   
0 0 0 ··· 0 1 0 0 0 ··· 0 0
0 0 0 ··· 0 0 0 0 0 ··· 1 0

L (e1 , e2 , . . . , ek )> = (e2 , e3 , . . . , ek , 0)> ,


R (e1 , e2 , . . . , ek )> = (0, e1 , . . . , ek−1 )> . 11 / 417
The Kiss generator

Congruential generator:

In+1 = 69069 × In + 23606797 mod 232 .

Shift-register generators:

Jn+1 = (I + L15 )(I + R 17 )Jn mod 232 ,


Kn+1 = (I + L13 )(I + R 18 )Kn mod 231 .

Combination:
Xn = In + Jn + Kn mod 232 .

The period of Kiss is of order 295 . Passes “Die Hard” tests.

12 / 417
Performance of Kiss

Plots of pairs (xn , xn+1 ), (xn , xn+2 ), (xn , xn+5 ) and (xn , xn+10 ) for a sample of
5000 generations from Kiss.
13 / 417
Inverse transform
Definition. For a non-decreasing function F on R the generalized
inverse of F denoted by F − is defined as

F − (u) := inf{x : F (x) ≥ u}.

Lemma. If U ∼ U(0, 1), then the random variable F − (U) has the
distribution of F .

Proof. For all u ∈ [0, 1] and x ∈ F − [0, 1]




F F − (u) ≥ u F − F (x) ≤ x.
 
and

In this way
(u, x) : F − (u) ≤ x = (u, x) : F (x) ≥ u ,
 

so
P F − (U) ≤ x = P U ≤ F (x) = F (x).
 

2
14 / 417
Exponential variable generation
If X ∼ Exp(λ), then we have F (x) = 1 − e−λx , x > 0.
1
F − (u) = F −1 (u) = − log(1 − u), u ∈ [0, 1].
λ
If U ∼ U(0, 1) then 1 − U ∼ U(0, 1) and − λ1 log(U) ∼ Exp(λ).
Exp(1) from Uniform Exp(1) from R
0.8

0.8
0.6

0.6
Density

Density
0.4

0.4
0.2

0.2
0.0

0.0

0 2 4 6 8 0 2 4 6 8
Histograms of 10000 exponential random variables generated using the inverse
transform and R function rexp.

15 / 417
Other distributions with invertible CDF-s
Logistic distribution Log (µ, β):

e−(x−µ)/β 1
PDF : f (x) = 2 ; CDF : F (x) = ;
β 1 + e−(x−µ)/β 1+ e−(x−µ)/β
 u 
F − (u) = F −1 (u) = µ + β log .
1−u

Cauchy distribution C(µ, σ):

1 1 1 1  x −µ 
PDF : f (x) =  ; CDF : F (x) = + arctan ;
πσ 1+ x−µ 2 2 π σ
σ
 
F − (u) = F −1 (u) = µ + σ tan π u − 1/2 .

16 / 417
Distributions connected to exponential
Uj : i.i.d U(0, 1); Xj := − log(Uj ) ∼ Exp(1).

Chi-square distribution Xn2 :


Connection to exponential: X22 = Exp(1/2).
Y1 ∼ Xn2 and Y2 ∼ Xm2 independent: Y1 + Y2 ∼ Xn+m
2 .
n
X n
X
2
Y =2 Xj = −2 log(Uj ) ∼ X2n , n ∈ N.
j=1 j=1

Gamma distribution Ga(α, β):


Connection to chi-square: Y ∼ Xn2 then Y /β ∼ Ga(n/2, β/2).
n n
1X 1X
Y = Xj = − log(Uj ) ∼ Ga(n, β), n ∈ N.
β β
j=1 j=1

17 / 417
Distributions connected to exponential
Uj : i.i.d U(0, 1); Xj := − log(Uj ) ∼ Exp(1).

Beta distribution Be(α, β):


Connection to chi-square: if Y1 ∼ Xn2 and Y2 ∼ Xm2
independent, then Y1Y+Y1
2
∼ Be n/2, m/2 .
Pn Pn
j=1 Xj j=1 log(Uj )
Y = Pn+m = Pn+m ∼ Be(n, m), n, m ∈ N.
j=1 Xj j=1 log(Uj )

F-distribution Fν1 ,ν2 :


Connection to chi-square: if Y1 ∼ Xν21 and Y2 ∼ Xν22
independent, then Y 1 /ν1
Y2 /ν2 ∼ Fν1 ,ν2 .

1 Pn 1 Pn
2n j=1 Xj 2n j=1 log(Uj )
Y = 1 Pn+m = 1 P n+m ∼ F2n,2m , n, m ∈ N.
2m j=1 Xj 2m j=n+1 log(Uj )

18 / 417
Normal variable generation. Box-Müller algorithm
X1 , X2 : independent N (0, 1). (X1 , X2 ) is rotation invariant.
(r , θ): polar coordinates of (X1 , X2 ).

r 2 = X12 + X22 ∼ X22 = Exp(1/2), θ ∼ U(0, 2π).

U1 , U2 : independent U(0, 1).


p p
X1 := −2 log(U1 ) cos(2πU2 ), X2 := −2 log(U1 ) sin(2πU2 ).

X1 , X2 are independent N (0, 1).


Box-Müller algorithm
1. Generate u1 , u2 independently from U(0, 1);
2. Define
p p
x1 := −2 log(u1 ) cos(2πu2 ), x2 := −2 log(u1 ) sin(2πu2 );
3. Take x1 and x2 as independent draws from N (0, 1).

19 / 417
Discrete random variables. General method
X ∼ Pθ : Pθ is a discrete distribution on non-negative integers.
Calculate and store:
p0 := P(X ≤ 0), p1 := P(X ≤ 1), p2 := P(X ≤ 2), . . .
until the values reach 1 with a given number of decimals.
Generate U ∼ U(0, 1) and take
X =k if pk−1 < U < pk , where p−1 := 0.
Examples.
Binomial Bin(8, .4)
p0 = 0.016796, p1 = 0.106376, p2 = 0.315395, . . . , p8 = 1.
Poisson P(2)
p0 = 0.135335, p1 = 0.406006, p2 = 0.676676, . . . , p10 = 0.999992.
Storage problems. Often it suffices to calculate probabilities in the
interval mean ± 3 × std.deviation.
20 / 417
Beta generation
U1 , U2 , . . . , Un : i.i.d. sample from U(0, 1).
U(1) , U(2) , . . . , U(n) : ordered sample. U(i) ∼ Be(i, n − i + 1).
Only applies for integer parameters in the Beta distribution.
U1 , U2 : independent U(0, 1). The conditional distribution of
1/α
U1 1/α 1/β
1/α 1/β
given U1 + U2 ≤1 is Be(α, β).
U1 + U2
[Jöhnk, 1964]
Algorithm:
1. Generate u1 , u2 independently from U(0, 1);
1/α 1/α 1/β 
2. Accept x = u1 / u1 + u2 as a draw from Be(α, β)
1/α 1/β
if u1 + u2 ≤ 1;
3. Return to 1. otherwise.
Rapid decrease of probability of acceptance with the increase of α
and β.
21 / 417
Accept-reject method
Problems with the previously showed methods:
I The inverse transform method applies only for a few
distributions.
I Many distributions do not have good representations.

Key idea:
Use a simpler density g called instrumental density to simulate
from the desired density f (target density).
Constraints on g :
1. f and g have compatible supports (i.e. g (x) > 0 when
f (x) > 0).
2. There exists a constant M such that f (x)/g (x) ≤ M for all x.
Generate X from g and accept it as a variable Y from f with pro-
f (X )
bability Mg (X ) .
It suffices to know f up to a normalizing constant.
22 / 417
Accept-reject algorithm
Generate from Y ∼ f with the help of X ∼ g , where f (x) ≤ Mg (x).
Algorithm
1. Generate x from the distribution g and u from
U(0, 1);
2. Accept y = x as a draw from Y if u ≤ f (x)/Mg (x);
3. Return to 1. otherwise.
Lemma. The above algorithm produces a value of a random
variable Y distributed according to f .
Proof. The CDF of Y equals
 
f (X )

f (X )
 P X < y , U ≤ Mg (X )
P(Y < y ) = P X < y U ≤ =
 
Mg (X ) P U ≤ Mg f (X )
(X )
R y R f (x)/Mg (x) 1
Ry Z y
du g (x)dx f (x)dx
= R−∞ 0 M −∞
∞ R f (x)/Mg (x)
= 1
R ∞ = f (x)dx,
du g (x)dx M −∞
f (x)dx −∞
−∞ 0

that is Y ∼ f . 2
Remark. The probability of acceptance is 1/M, the expected
number of trials until acceptance is M. 23 / 417
Example. Normal from Cauchy
Density functions:
1 2 1
N (0, 1): f (x) = √ e−x /2 ; C(0, 1): g (x) = .
2π π(1 + x 2)
p
f (x)/g (x) ≤ 2π/e = 1.5203, the maximum is reached at x = ±1.
p
Mean number of trials: M := 2π/e = 1.5203.
Probability of acceptance: 1/M = 0.6577.

Tuning
1
Instrumental density: C(0, σ) with PDF gσ (x) = πσ(1+x 2 /σ 2 )

r σ 2 −1 σ 2 −1
2π e 2 e 2
f (x)/gσ (x) ≤ Mσ := =M and Mσ ≥ M1 = M.
e σ σ
C(0, 1) is the best choice among Cauchy distributions for simula-
ting N (0, 1).
24 / 417
Example. Normal from double exponential (Laplace)
Density functions:
1 2 λ −λ|x|
N (0, 1): f (x) = √ e−x /2 ; L(λ): gλ (x) = e .
2π 2
1
+ 12 sign(x) 1 − e−λ|x| .

CDF of L: Gλ (x) = 2

Inverse CDF: Gλ−1 (u) = 1


 
λ sign u − 1/2 log 1 − 2|u − 1/2| .

2
2/πλ−1 eλ /2 .
p
f (x)/gλ (x) ≤ Mλ :=
p
The minimum of Mλ is attained at λ = 1, and M1 = 2e/π.
L(1) is the best choice among double exponential distributions for
simulating N (0, 1).
Mean number of trials: M1 = 1.3155.
Probability of acceptance: 1/M = 0.7602.
25 / 417
Example. Gamma Accept-Reject
If α ∈ N and U1 , U2 , . . . , Uα are i.i.d. U(0, 1)
α
X
Y = −β log(Uj ) ∼ Ga(α, β).
j=1

α 6∈ N, α ≥ 1: use instrumental distribution Ga(a, b) with a = bαc.


Assume β = 1. Density functions:
x α−1 e−x b a x a−1 e−bx
Ga(α, 1): f (x) = ; Ga(a, b): gb (x) = .
Γ(α) Γ(a)

Γ(a) x α−a e(1−b)x α − a α−a


 
f (x) Γ(a)
≤ ≤ Mb := , b < 1.
gb (x) Γ(α) ba Γ(α)b a (1 − b)e
b a (1 − b)α−a attains its maximum at b = a/α, which is the opti-
mal choice for simulating Ga(α, 1).

Γ(a) (α/e)α
Corresponding mean number of trials: Ma/α = .
Γ(α) (a/e)a
26 / 417
Example. Truncated normal distribution

N (µ, σ 2 ) distribution with constraint x ≥ µ. PDF:

1 1 (x−µ)2
f (x) = √ e− 2σ2 if x ≥ µ.
Φ (µ − µ)/σ 2πσ

Naive method: simulate x from N (µ, σ 2 ), accept it if x ≥ µ holds.



Mean number of trials: 1/Φ (µ − µ)/σ .

Instrumental distribution: translated exponential Exp(λ, µ).

PDF : gλ (x) = λe−λ(x−µ) , x ≥ µ; CDF : Gλ (x) = 1−e−λ(x−µ) , x ≥ µ;


1
Gλ−1 (u) = µ − log(1 − u).
λ

27 / 417
Example. Truncated normal from translated exponential
Assume µ = 0, σ = 1.
1 1 2
f (x) = √ e−x /2 ; gλ (x) = λe−λ(x−µ) , x ≥ µ.
1 − Φ(µ) 2π
For x ≥ µ:
f (x) 1 1 2 1 1 e
= √ e−x /2+λ(x−µ) ≤ Mλ := √ M λ,
gλ (x) 1 − Φ(µ) λ 2π 1 − Φ(µ) 2π
where (
exp λ2 /2 − λµ /λ, if λ > µ;

Mλ :=
e
exp − µ2 /2 /λ

otherwise.
µ
q
First part has minimum at λ∗ = 2 + 21 µ2 +4, second part at λ
e = µ.
λ∗ is the optimal choice. Mean number of trials:
1 1 e 1
Mλ∗ = √ M λ∗ < e.g. if µ > 0.
1 − Φ(µ) 2π 1 − Φ(µ)
28 / 417
Envelope accept-reject method
In some cases PDF f is too complex to be evaluated at each step.
Suppose there exists a function g` , a PDF gm and a constant M
such that
g` (x) ≤ f (x) ≤ Mgm (x).

Algorithm
1. Generate x from gm and u from U(0, 1);
2. Accept y = x as a draw from Y if u ≤ g` (x)/Mgm (x);
3. Otherwise accept y = x if u ≤ f (x)/Mgm (x).
Lemma. The above algorithm produces a value of a random vari-
able Y distributed according to f .

The probability that f is not evaluated:


 Z ∞ Z g` (x)/Mgm (x)
1 ∞
 Z
g` (X )
P U≤ = du gm (x)dx = g` (x)dx.
Mgm (X ) −∞ 0 M −∞
29 / 417
Proof

The CDF of Y equals


 
g` (X ) g` (X ) f (X )

P(Y < y ) = P X < y U ≤ or <U≤

Mgm (X ) Mgm (X ) Mgm (X )
   
g` (X ) g` (X ) f (X )
P X < y, U ≤ Mgm (X )
+ P X < y , Mg m (X )
<U≤ Mgm (X )
=  
f (X )
P U ≤ Mgm (X )

Ry R g` (x)/Mgm (x) R y R f (x)/Mg (x)


−∞ 0
du gm (x)dx + −∞ g (x)/Mgmm (x) du gm (x)dx
`
= R ∞ R f (x)/Mgm (x)
−∞ 0
du gm (x)dx

1
Ry
f (x)dx
Z y
=
M
1
R−∞
∞ = f (x)dx,
M −∞
f (x)dx −∞

that is Y ∼ f . 2

30 / 417
Example. Lower bound for normal generation

Taylor series expansion:

exp − x 2 /2 ≥ 1 − x 2 /2 .
 

Lower bound for N (0, 1):

1 2 1  x2 
ϕ(x) = √ e−x /2 ≥ √ 1− =: g` (x).
2π 2π 2

Useless, if for X ∼ gm we have |X | ≥ 2.
√ 
For X ∼ C(0, 1): P |X | ≥ 2 = 0.3918.
√ 
For X ∼ L(1): P |X | ≥ 2 = 0.2431.

31 / 417
Poisson generation
Simple method
Poisson process: N ∼ P(λ), Xn ∼ Exp(λ), n ∈ N.

P(N = k) = P(X1 + . . . + Xk ≤ 1 < X1 + . . . + Xk+1 ).

Generate exponentials until their sum exceeds 1.


On average λ exponentials are needed to get a Poisson. Not effec-
tive for large values of λ.

Poisson from logistic distribution Log (µ, β)

e−(x−µ)/β 1
PDF : f (x) = 2 ; CDF : F (x) = ;
β 1 + e−(x−µ)/β 1+ e−(x−µ)/β
 u 
F −1 (u) = µ + β log .
1−u
Easy to simulate using uniform distribution on [0, 1].
32 / 417
Poisson from logistic
Instrumental variable: L = bX + 0.5c, where X ∼ Log (µ, β).
Restrict the values of L to non-negative integers, that is the range
of X to [−0.5, ∞).
If X > −0.5
P(N = n) = P(n − 0.5 ≤ X < n + 0.5)
 
1 1 −(0.5+µ)/β

= − 1 + e
1 + e−(n+0.5−µ)/β 1 + e−(n−0.5−µ)/β
−(n−0.5−µ)/β − e−(n+0.5−µ)/β
 
e
 1 + e−(0.5+µ)/β

= −(n+0.5−µ)/β

−(n−0.5−µ)/β
1+e 1+e
Reparametrization [Atkinson, 1979]: α = µ/β, γ = 1/β.
eα−γ(n−0.5) − eα−γ(n+0.5)
  
P(N = n) =   1 + e−(α+0.5γ)
1 + eα−γ(n+0.5) 1 + eα−γ(n−0.5)
γeα−γn  −(α+0.5γ)

≈ 2 1 + e .
1 + eα−γn
33 / 417
Atkinson’s algorithm
Ratio of probabilities: λn / P(N = n) eλ n! .


Difficult to compute a bound and hence, to optimize in (α, γ).



Atkinson (1979): γ = π/ 3λ, α = λβ.
The first two moments of the logistic and of the Poisson coincide.
Bound for the ratio (interpolation in 1/µ) : c = 0.767 − 3.36/λ.

Algorithm √
0. Define γ = π/ 3λ, α = λβ and k = log c − λ − log b;
1. Generate u1 ∼ U(0, 1) and calculate
 
x = α − log (1 − u1 )/u1 /γ
until x > −0.5;
2. Define n = bx + 0.5c and generate u2 ∼ U(0, 1);
3. Accept n ∼ P(λ) if
 2 
α − γx + log u2 / 1 + exp(α − γx) ≤ k + n log λ − log n!.
34 / 417
Log-concave densities
Definition. A density f called log-concave if log f is concave.
Example. Exponential family (special form):

f (x) = h(x) eθx−ψ(θ) , θ, x ∈ R.

Log-concave if
2
g 00 (x)g (x) − g 0 (x)

∂2 ∂2
log f (x) = log g (x) = < 0.
∂x 2 ∂x 2 g 2 (x)

For N (0, 1) we have ∂ 2 log g (x)/∂x 2 = −1 < 0.


For Log (µ, β) we have

∂2 2e−(x−µ)/β
log g (x) = − 2 < 0.
∂x 2 β 2 1 + e−(x−µ)/β

35 / 417
Envelope accept-reject method
f : log-concave density. Sn := {xi ∈ supp f | i = 0, 1, . . . , n + 1}.
h(xi ) = log f (xi ) is known (at least up to constant).
 
Line Li,i+1 through xi , h(xi ) and xi+1 , h(xi+1 ) is below h(x) if
x ∈ [xi , xi+1 ] and above it otherwise.
For x ∈ [xi , xi+1 ], i = 0, 1, . . . , n:

hn (x) := min Li−1,i (x), Li+1,i+2 (x) and hn (x) := Li,i+1 (x).
Outside [x0 , xn+1 ] define

hn (x) := min L0,1 (x), Ln,n+1 (x) and hn (x) := −∞.
Envelopes:
hn (x) ≤ h(x) = log f (x) ≤ hn (x), x ∈ supp f .
For f n (x) = exp hn (x) and f n (x) = exp hn (x) we have
f n (x) ≤ f (x) ≤ f n (x) = ω n gn (x), x ∈ supp f .
ωn : norming constant; gn (x): PDF.
36 / 417
Adaptive Rejection Sampling (ARS)
Generate an observation from distribution f .

ARS algorithm
0. Initialize n and Sn ;
1. Generate x ∼ gn , u ∼ U(0, 1);
2. Accept x if u ≤ f n (x)/ω n gn (x);
3. Otherwise, accept x if u ≤ f (x)/ω n gn (x) and update
Sn to Sn+1 = Sn ∪ {x}.

Sn is only updated when f (x) is calculated.


The accuracy of the envelopes f n and f n increases, which reduces
the number of evaluations of f .
Important condition: ω n < ∞.
If the support f is not bounded from below (above) then L0,1
(Ln,n+1 ) should have a positive (negative) slope.
37 / 417
Simulation from the instrumental distribution
hn (x) induces a set of points {x0 = y0 , y1 , . . . , yrn +1 = xn+1 } which
is refinement of Sn . y−1 := −∞, yrn +2 := ∞ if supp f = R.
On [yj , yj+1 ] we have hn (x) = αj x + βj that is f n (x) = eαj x+βj .
 rX
n +1 
−1 αj x+βj
gn (x) = ω n e I[yj ,yj+1 ] (x) ,
j=−1
rX
n +1 yj+1 rX
n +1
eαj yj+1 − eαj yj
Z
αj x+βj
ωn = e dx = e βj .
αj
j=−1 yj j=−1

Generate an observation x from distribution gn .


Supplemental ARS algorithm
1. Select the interval [yj , yj+1 ] with probability
eβj eαj yj+1 − eαj yj / ω n αj ;
 

2. Generate u ∼ U(0, 1) and take


 
x = αj−1 log eαj yj + u eαj yj+1 − eαj yj .
38 / 417
Example. Poisson regression
Sample: (Y1 , x1 ), (Y2 , x2 ), . . . , (Yn , xn ).

Yj | xj ∼ P exp(a + bxj ) , j = 1, 2, . . . , n.
Likelihood function:
Y n −1  Xn n
X n
X 
a bxj
`(a, b) = Yj ! exp a yj + b xj yj − e e .
j=1 j=1 j=1 j=1

Prior for (a, b): N (0, σ 2 ) × N (0, τ 2 ). Posterior for (a, b):
 X n n n 
2 2 2 2
X X
π(a, b) ∝ exp a yj + b xj yj − ea ebxj e−a /(2σ )−b /(2τ ) .
j=1 j=1 j=1
n
∂2 X
log π(a, b) = −ea ebxj − σ −2 < 0,
∂a2
j=1
n
∂2 X
log π(a, b) = −ea xj2 ebxj − τ −2 < 0.
∂b 2
j=1

Log-concave, the ARS algorithm directly applies.


39 / 417
Some related problems
1. Show that the following Box-Müller algorithm produces a value
of a standard normal random variable [Robert & Casella, 2004;
Exercise 2.9].
Algorithm
1. Generate y1 , y2 independently from Exp(1) until
y2 > (1 − y1 )2 /2;
2. Generate u from U(0, 1) and take
(
y1 if u ≤ 1/2;
x=
−y1 if u > 1/2.

2. Show that if U, V ∼ U(0, 1) are independent and α, β > 0 then


the conditional distribution of
U 1/α
given U 1/α + V 1/β ≤ 1
U 1/α + V 1/β
is Be(α, β).
40 / 417
Classical Monte Carlo Integration

Problem: Given a random variable X ∈ X with PDF f (x) eva-


luate Z
 
Ef h(X ) = h(x)f (x)dx.
X

Monte Carlo solution: Generate an i.i.d sample (X1 , X2 , . . . , Xn )


from the distribution f (x) and evaluate the empirical mean:
n
1X
hn := h(Xj ).
n
j=1

 
SLLN: hn → Ef h(X ) as n → ∞ with probability 1.

42 / 417
Variance
Integral and approximation:
Z n
  1X
Ef h(X ) = h(x)f (x)dx, hn := h(Xj ).
X n
j=1

If Ef h2 (X ) < ∞,
 
Z 
  1  2
Var hn = h(x) − Ef h(X ) f (x)dx.
n X
Sample variance
n
1 X 2
vn := 2
h(Xj ) − hn .
n
j=1

CLT: For large n


 
hn − Ef h(X )
√ is approximately N (0, 1).
vn
Confidence bounds for approximation.
43 / 417
Example. Simple integration
Evaluate:
Z 1
2
h(x)dx, where h(x) = cos(10x) − sin(60x) .
04
3
h(x)
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0


x

Exact value: 1.0145


1 Pn
Generate U1 , U2 , . . . , Un ∼ U(0, 1), calculate n j=1 h(Uj ).
44 / 417
Example. Simple integration
Approximate:
Z 1 n
2 1X
cos(10x) − sin(60x) dx with h(Uj ), Uj ∼ U(0, 1).
0 n
j=1

Black: mean; blue: mean ± two standard errors; red: exact value.
45 / 417
Example. Normal quantiles
Evaluate Z t
1 2
Φ(t) := √ e−y /2 dy .
−∞ 2π
Generate X1 , X2 , . . . , Xn ∼ N (0, 1), calculate
n
1X
Φ(t)
b := I{Xj ≤t} .
n
j=1
  
Var Φ(t)
b = Φ(t) 1 − Φ(t) /n. t ≈ 0: Var Φ(t)
b = 1/(4n)

Precision of four decimals: 2 1/(4n) ≤ 10−4 ⇐⇒ n ≥ 108 .


p

quantiles 0.5 0.75 0.8 0.9 0.95 0.975 0.99 0.999 0.9999
n/t 0.0000 0.6745 0.8416 1.2816 1.6449 1.9600 2.3263 3.0902 3.7190
102 0.5200 0.7800 0.8200 0.9000 0.9300 0.9700 0.9900 1.0000 1.0000
103 0.4890 0.7440 0.7950 0.9040 0.9460 0.9710 0.9880 1.0000 1.0000
104 0.4947 0.7524 0.8031 0.9018 0.9506 0.9764 0.9906 0.9990 1.0000
105 0.4980 0.7472 0.7972 0.8987 0.9492 0.9751 0.9900 0.9990 0.9999
106 0.4991 0.7493 0.7993 0.8994 0.9499 0.9752 0.9900 0.9990 0.9999
107 0.5003 0.7501 0.8001 0.9000 0.9499 0.9751 0.9900 0.9990 0.9999
108 0.5001 0.7500 0.8001 0.9000 0.9500 0.9750 0.9900 0.9990 0.9999

46 / 417
Example. Tail probability of normal
Evaluate P(X > 5), where X ∼ N (0, 1).
Exact value: 2.866516 × 10−7 .

Calculate
Z ∞ Z 5
1 −y 2 /2 1 1 1 2
1−Φ(5) = √ e dy and −Φ1 (5) = − √ e−y /2 dy .
5 2π 2 2 0 2π

Generate X1 , X2 , . . . , Xn ∼ N (0, 1), calculate


n n
1X 1 b 1 1X
1 − Φ(5)
b = I{Xj >5} and − Φ1 (5) = − I{0≤Xj ≤5} .
n 2 2 n
j=1 j=1

n = 105 : 1 − Φ(5)
b = 0, Φ(5)
b = 1,
b 1 (5) = 0.50089, 1/2 − Φ
Φ b 1 (5) = −0.00089.

47 / 417
Example. Tail probability of normal

Evaluate P(X > 5), where X ∼ N (0, 1).


Exact value: 2.866516 × 10−7 .
Calculate:
Z ∞ Z 1/5
1 2 1 5 −1/(2v 2 )
1 − Φ(5) = √ e−y /2 dy = √ e dv .
5 2π 5 2π 0 v2

Generate U1 , U2 , . . . , Un ∼ U(0, 1/5), calculate


n
1X 1 2
√ 2
e−1/(2Uj ) .
n 5 2πUj
j=1

n = 105 : 2.850975 × 10−7 .

48 / 417
Importance sampling. Cauchy tail probability
Evaluate p = P(X > 2), where X ∼ C(0, 1), that is
Z ∞
1
p= dx.
2 π(1 + x 2)
Exact value: p = 0.1475836.
Generate X1 , X2 , . . . , Xn ∼ C(0, 1), calculate
n
1X
pb1 := I{Xj >2} .
n
j=1
  
E pb1 = p, Var pb1 = p(1 − p)/n = 0.1258/n.
By symmetry one can consider
n
1 X
pb2 := I{|Xj |>2} .
2n
j=1
  
E pb2 = p, Var pb2 = p(1 − 2p)/(2n) = 0.0520/n.
49 / 417
Importance sampling. Cauchy tail probability

Evaluate p = P(X > 2), where X ∼ C(0, 1).


Z ∞ Z 2
1 1 1 1  
p= 2
dx = − 2
dx = − E h(U) ,
2 π(1 + x ) 2 0 π(1 + x ) 2

where
2
h(x) := and U ∼ U(0, 2).
π(1 + x 2 )

Generate U1 , U2 , . . . , Un ∼ U(0, 2), calculate


n
1 1X
pb3 := − h(Uj ).
2 n
j=1

   2 
Var pb3 = E h2 (U) − E[h(U)] /n = 0.0285/n.


50 / 417
Importance sampling. Cauchy tail probability
Evaluate p = P(X > 2), where X ∼ C(0, 1).
Z∞ Z1/2 Z1/2
1 y −2 1 1  
p= 2
dx = −2
dy = 2
dy = E h(V ) ,
π(1 + x ) π(1 + y ) π(1 + y ) 4
2 0 0
where
h(x) := 2/ π(1 + x 2 )

and V ∼ U(0, 1/2).
Generate V1 , V2 , . . . , Vn ∼ U(0, 1/2), calculate
n
1 X
pb4 := h(Vj ).
4n
j=1
   2 
Var pb4 = E h2 (V ) − E[h(V )] /(16n) = 9.5525 × 10−5 /n.
 

Approximation pb1 pb2 pb3 pb4


Generated sample C(0, 1) C(0, 1) U(0, 2) U(0, 1/2)
Variance 0.1258/n 0.0520/n 0.0285/n 9.5525 × 10−5 /n

A reduction of order 10−3 : 1000 ≈ 33 times fewer simulations. 51 / 417
Importance sampling
Definition. The method of importance sampling is an evaluation
of the integral Z
 
Ef h(X ) = h(x)f (x)dx
X
based on generating a sample X1 , X2 , . . . , Xn from a given distribu-
tion g (instrumental distribution) and approximating
n
  1X f (Xj )
Ef h(X ) ≈ h(Xj ) .
n g (Xj )
j=1

Importance sampling fundamental inequality:


Z  
  f (x) f (X )
Ef h(X ) = h(x) g (x)dx = Eg h(X ) ,
X g (x) g (X )

provided supp(h × f ) ⊂ supp(g ).


52 / 417
Example. Exponential vs. lognormal
2
Let X ∼ Exp(1/λ) or X ∼ LN (0, σ 2 ) with eσ /2 = λ. PDFs:
1 1 2 2
f (x) = e−x/λ or g (x) = √ e−(log x) /(2σ ) , x > 0.
λ xσ 2π
In both cases E[X ] = λ, so X can serve as an estimator of λ.
Compare the performances with respect to scaled squared error loss
L(δ, λ) = (δ − λ)2 /λ2 .
 
Calculate E L(X , λ) for both distributions.
For approximation use a sample X1 , X2 , . . . , Xn ∼ LN (0, σ 2 ).
Generate Y1 , Y2 , . . . , Yn ∼ N (0, 1) and set Xj = exp(σYj ).
 
Approximations of E L(X , λ) when X is
√ n
1 σ 2π X 2 2
exponential : R1 :=
b
2
(Xj − λ)2 Xj e−Xj /λ e(log Xj ) /(2σ ) ;
nλ λ
j=1
n
1 X
lognomal: Rb2 := (Xj − λ)2 .
nλ2
j=1 53 / 417
Example. Exponential

vs.

lognormal
True values of E L(X , λ) :

exponential: 1; lognormal: (λ − 1)(λ + 1).

Exponential Lognormal
1.01

250
200
1.00

150
0.99

100
0.98

50
0
0.97

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
λ λ

Sample size: n = 105


54 / 417
Example. Tail probability of normal
Evaluate P(X > 5), X ∼ N (0, 1). Exact value: 2.866516 × 10−7 .
g : PDF of Exp(1) translated to 5.
2 /2
f (x) = (2π)−1/2 e−x ; g (x) = e−(x−5) , x > 5.
For approximation use a sample Y1 , Y2 , . . . , Yn ∼ g .
n n
1 X f (Yj ) 1 X 1 −Yj2 /2+Yj −5
P(X > 5) ≈ = √ e .
n g (Yj ) n 2π
j=1 j=1
3e-07
2e-07
1e-07
0e+00

0 200 400 600 800 1000


Iterations

Convergence of the approximation. Final value (n = 103 ): 2.921121 × 10−7 . 55 / 417


Finite variance estimators
By SLLN
n Z
1X f (Xj )  
h(Xj ) → Ef h(X ) = h(x)f (x)dx
n g (Xj ) X
j=1

as n → ∞ with probability 1.
Variance of the approximation is finite only if
f 2 (X ) f 2 (x)
    Z
2 2 f (X )
Eg h (X ) 2 = Ef h (X ) = h2 (x) dx < ∞.
g (X ) g (X ) X g (x)
Instrumental distributions with unbounded ratios f /g are not app-
ropriate. The variance of the estimators is large.
Distributions g with thicker tails than f are preferred.
Sufficient conditions e.g.
I f (x)/g (x) < M, ∀x ∈ X and Ef h2 (X ) < ∞;
 

I X is compact, f (x) < F and g (x) > ε ∀x ∈ X .


56 / 417
Example. Cauchy probability

Evaluate P(3 ≤ X ≤ 9), X ∼ C(0, 1). Exact value: 0.06719309.


g : PDF of N (0, 1).

1 1 2
f (x) = ; g (x) = √ e−x /2 .
π(1 + x 2 ) 2π
For approximation use a sample Y1 , Y2 , . . . , Yn ∼ N (0, 1).
n n 2
2 eYj /2
r
1 X f (Yj ) 1X
P(3 ≤ X ≤ 9) ≈ I{3≤Yj ≤9} = I .
n g (Yj ) n π 1 + Yj2 {3≤Yj ≤9}
j=1 j=1

The ratio f (x)/g (x) is unbounded.

57 / 417
Example. Cauchy probability
Evaluate P(3 ≤ X ≤ 9), X ∼ C(0, 1) using N (0, 1).

Evolution of the importance sampling estimator. Exact value (red line):


0.06719309.
58 / 417
Assessing minimal variance
Theorem. The choice of PDF g that minimizes the variance of
the estimator
n
1X f (Xj ) |h(x)|f (x)
h(Xj ) is g ∗ (x) = R .
n
j=1
g (Xj ) X |h(z)|f (z)dz

Proof. Let X ∼ g , then


 2 2
h (X )f 2 (X )
    
h(X )f (X ) h(X )f (X )
Var = Eg − E g .
g (X ) g 2 (X ) g (X )
Second term is independent of g . Jensen’s inequality
 2 2 Z 2
h (X )f 2 (X )
  
|h(X )|f (X )
Eg ≥ E g = |h(x)|f (x)dx .
g 2 (X ) g (X ) X

The minimum, which is independent of g , is reached when g = g ∗ . 2

Useless: g ∗ contains the integral to be estimated.

59 / 417
Alternative method
X1 , X2 , . . . , Xn : a sample from distribution g .
Instead of
n Pn
1X f (Xj ) j=1 h(Xj )f (Xj )/g (Xj )
h(Xj ) use Pn
n g (Xj ) j=1 f (Xj )/g (Xj )
j=1

to approximate Ef [h(X )]. SLLN ensures convergence to Ef [h(X )].


Biased (small bias), improvement in variance.

Optimality theorem suggests g ∝ |h|f resulting


Pn Pn −1
j=1 h(Xj )f (Xj )/g (Xj ) j=1 h(Xj ) h(Xj )

Pn = Pn −1 .
j=1 f (Xj )/g (Xj ) j=1 h(Xj )

Not optimal, might be unstable.


Look for distributions g where |h|f /g is almost constant with finite
variance.
60 / 417
Example. Student’s t distribution
X ∼ Tν , ν ≥ 1, with PDF
x 2 −(ν+1)/2
 
Γ (ν + 1)/2
f (x) = √ 1+ .
νπΓ(ν/2) ν
Approximate Ef [hi (X )], i = 1, 2, 3, with
x5
r
x

h1 (x) := , h2 (x) := x 5 I{x≥2.1} , h3 (x) := I .
1−x 1+(x −3)2 {x≥0}

Possible instrumental distributions: C(0, 1) and N 0, ν/(ν −2) .

Normal distribution:
2
f 2 (x) ex (ν−2)/(2ν)
∝ ν+1 (does not have finite integral on R).
g (x) 1 + x 2 /ν
Cauchy distribution:
f 2 (x) 1 + x2
∝ ν+1 (has finite integral on R).
g (x) 1 + x 2 /ν
61 / 417
Example. Student’s t distribution I. Mean

1.20
1.18
1.16
1.14
1.12
1.10

0 500 1000 1500 2000


Iterations

 1/2 
Empirical means of estimators of Ef X /(1 − X ) , X ∼ T12 , for 500 rep-
lications: sampling from T12 (solid black), importance sampling from C(0, 1)
(dashed red) and from N (0, 12/10) (dashed blue).
1/2
Means are stable. h1 (x) = x/(1 − x) has a singularity at
2
x = 1, so h1 (x) is not integrable under f .
None of the estimators have finite variance! 62 / 417
Example. Student’s t distribution I. Variance

 1/2 
Empirical range of estimators of Ef X /(1 − X ) , X ∼ T12 , for 500 repli-
cations: sampling from T12 (left), importance sampling from C(0, 1) (center)
and from N (0, 12/10) (right).
63 / 417
Example. Student’s t distribution I. Variance

 1/2 
Empirical range of estimators of Ef X /(1 − X ) , X ∼ T12 , for 5000 rep-
lications: sampling from T12 (left), importance sampling from C(0, 1) (center)
and from N (0, 12/10) (right).
64 / 417
Example. Student’s t distribution via reflected gamma
Find an instrumental distribution g with a better behaviour of
(1 − x)g (x) at x = 1.
Reflected gamma distribution RGa(α, β, γ) with PDF
β α |x − γ|α−1 e−β|x−γ|
, x ∈ R, α, β > 0.
2Γ(α)
If X ∼ RGa(α, β, γ) then |X − γ| ∼ Ga(α, β).
g : PDF of RGa(α, 1, 1).

f 2 (x) |x|e−|x−1|
h12 (x) ∝ .
g (x) |x − 1|α |1 + x 2 /ν|ν+1
Integrable around x = 1 for α < 1. Problems at ∞, infinite
variance but no influence on the approximation.
Y ∼ Ga(α, β); Z : uniform on {−1, 1}, independent of Y .

X = YZ + γ ∼ RGa(α, β, γ).
65 / 417
Example. Student’s t distribution via reflected gamma

 1/2 
Empirical range of estimators of Ef X /(1 − X ) , X ∼ T12 , sampling from
RGa(0.5, 1, 1) for 500 (left) and for 5000 (right) replications.

66 / 417
Example. Student’s t distribution II.

Approximate Ef [h2 (X )], X ∼ T12 , where h2 (x) := x 5 I{x≥2.1} .


h2 (x) has a restricted support, moreover
Z ∞ Z 1/2.1
1
Ef [h2 (X )] = 5
x f (x)dx = 2.1u −7 f (1/u)du.
2.1 2.1 0

f (x): PDF of T12 .

Use U(0, 1/2.1) as an instrumental distribution. Approximation:


n
1 X −7
Uj f (1/Uj ), Uj ∼ U(0, 1/2.1).
2.1n
j=1

67 / 417
Example. Student’s t distribution II.

Convergence of estimators of Ef X 5 I{X ≥2.1} , X ∼ T12 : sampling from T12


 

(black), importance sampling from C(0, 1) (red), from N (0, 12/10) (blue) and
from U(0, 1/2.1) (green). Final values: 6.69, 6.53, 5.51 and 6.52. True value:
6.52.
68 / 417
Example. Student’s t distribution III.
Approximate Ef [h3 (X )], X ∼ T12 , where

x5
h3 (x) := I .
1 + (x − 3)2 {x≥0}

h3 (x) again has a restricted support and


∞ ∞
x5 x 5 ex
Z Z
Ef [h3 (X )] = f (x)dx = f (x)e−x dx.
0 1 + (x − 3)2 0 1 + (x − 3)2

f (x): PDF of T12 .

Use Exp(1) as an instrumental distribution. Approximation:


n
1X Yj5 eYj
f (Yj ), Yj ∼ Exp(1).
n 1 + (Yj − 3)2
j=1

69 / 417
Example. Student’s t distribution III.

Convergence of estimators of Ef X 5 / 1 + (X − 3)2 I{X ≥0} , X ∼ T12 :


  

sampling from T12 (black), importance sampling from C(0, 1) (red), from
N (0, 12/10) (blue) and from Exp(1) (green). Final values: 4.30, 4.40, 4.15
and 4.53. True value: 4.64.
70 / 417
Importance sampling and Accept-Reject
Approximate Z
J := h(x)f (x)dx.
X

Accept-Reject method
g : PDF of the instrumental distribution satisfying f (x) ≤ Mg (x).
Simulate Y ∼ g (x) and accept it as variable X ∼ f (x) with proba-
bility f (Y )/ Mg (Y ) .
Ratio f /g is bounded. g can be used as instrumental distribution
in importance sampling.
X1 , X2 , . . . , Xn : sample from f obtained by Accept-Reject method.
Y1 , Y2 , . . . , YN : a sample of random length N from g used to pro-
duce the sample from f .
{Y1 , Y2 , . . . , YN } = {X1 , X2 , . . . , Xn } ∪ {Z1 , Z2 , . . . , ZN−n }.
Z1 , Z2 , . . . , ZN−n : values rejected by Accept-Reject algorithm.
71 / 417
Estimators
Approximate Z
J := h(x)f (x)dx.
X
Estimators:
n N
1X 1 X f (Yk )
δ1 := h(Xj ) and δ2 := h(Yk ) .
n N g (Yk )
j=1 k=1

If f /g is known up to a constant:
N N
,
X f (Yk ) X f (Yk )
δ3 := h(Yk ) .
g (Yk ) g (Yk )
k=1 k=1
Decomposition of δ2 :
 
n N−n
n 1 X N −n 1 X f (Zk ) 
δ2 =  h(Xj ) + h(Zk ) .
N n n N −n g (Zk )
j=1 k=1

δ2 , δ3 contain random sums, as N ∼ N eg (n, 1/M). Difficult to


compare δ1 and δ2 . Moreover, the distribution of Zk is not g (x). 72 / 417
Corrected estimator
PDF of Zk , k = 1, 2, . . . , N − n:

Mg (x) − f (x)
.
M −1
Estimator associated with the rejected values:
N−n
1 X (M − 1)f (Zk )
δ4 := h(Zj ).
N −n Mg (Zk ) − f (Zk )
k=1

Corrected estimators for J :


n
n N −n 1X
δ5 := δ1 + δ4 where δ1 := h(Xj ),
N N n
j=1
N−n
, N−n
n N −n X h(Zj )f (Zk ) X f (Zk )
δ6 := δ1 + .
N N Mg (Zk ) − f (Zk ) Mg (Zk ) − f (Zk )
k=1 k=1

73 / 417
Example. Gamma Accept-Reject
Generate Ga(α, 1), α 6∈ N, α ≥ 1 using instrumental distribution
Ga(a, b) with a = bαc and b = a/α.
Density functions:

x α−1 e−x b a x a−1 e−bx


Ga(α, 1): f (x) = ; Ga(a, b): gb (x) = .
Γ(α) Γ(a)

f (x) Γ(a) (α/e)α Γ(a)  


≤ M := = exp α log(α)−1)−a log(a)−1 .
g (x) Γ(α) (a/e)a Γ(α)

Approximate:
 
E X /(1 + X ) , X ∼ Ga(α, 1).

Bounds for Accept-Reject:


 
M, M1 := exp α log(α) − 1) − a log(a) − 1 /2.

74 / 417
Example. Gamma Accept-Reject


Convergence of estimators δ1 (black), δ5 (green) and δ6 (blue) of E X /(1+X )],
X ∼ Ga(α, 1) to the true value of 0.7496829 (dashed red) for α = 3.7. Bound:
M = 1.116356. Final values: 0.749958, 0.751682 and 0.749880.

75 / 417
Example. Gamma Accept-Reject


Convergence of estimators δ1 (black), δ5 (green) and δ6 (blue) of E X /(1+X )],
X ∼ Ga(α, 1) to the true value of 0.7496829 (dashed red) for α = 3.7. Bound:
M1 = 1.163983. Final values: 0.750516, 0.752792 and 0.750705.

76 / 417
Example. Gamma Accept-Reject


Convergence of estimators δ1 (black), δ5 (green) and δ6 (blue) of E X /(1+X )],
X ∼ Ga(α, 1) to the true value of 0.708122 (dashed red) for α = 3.08. Bound:
M = 1.013969. Final values: 0.708323, 0.705941 and 0.708242.

77 / 417
Laplace approximation
Alternative to Monte Carlo integration. Evaluate
Z
f (x|θ)dx, f ≥ 0,
A

for a fixed θ.

Write f (x|θ) = exp h(x|θ) .
Expand h(x|θ) around a point x0 :

(x − x0 )2 00
h(x|θ) = h(x0 |θ) + (x − x0 )h0 (x0 |θ) + h (x0 |θ)
2!
(x − x0 )3 000
+ h (x0 |θ) + R3 (x),
3!
lim R3 (x)/(x − x0 )3 = 0.
x→x0

Choose x0 = xbθ : the value that maximizes h(x|θ), so h0 (b


xθ |θ) = 0.
78 / 417
Expansions
Approximation:
Z Z
f (x|θ)dx = eh(x|θ) dx
A A
xθ )2 00 xθ )3 000
Z
(x−b (x−b
xθ |θ)
≈eh(b
e 2 h (bxθ |θ) e 3! h (bxθ |θ) dx.
A

Extract the cubic term using ey ≈ 1 + y + y 2 /2:


(x−bxθ )3 000
h (b xθ |θ) (x − xbθ )3 000 (x − xbθ )6  000 2
e 3! ≈1+ h (b xθ |θ) + 2
h (b xθ |θ) .
3! 2(3!)

xθ )2 00
Z Z
(x−b
eh(x|θ) dx ≈ eh(bxθ |θ) e 2 h (bxθ |θ)
A A
(x − xbθ )3 000 (x − xbθ )6  000
 
2
× 1+ h (bxθ |θ) + h (b xθ |θ) dx.
3! 2(3!)2
Number of terms used: order of approximation.
79 / 417
First-order approximation
Approximation:
xθ )2 00
Z Z Z
(x−b
xθ |θ)
f (x|θ)dx = eh(x|θ)
dx ≈ e h(b
e 2 h (bxθ |θ) dx
A A
ZA
xθ )2 00
(x−b
xθ |θ) e 2 h (bxθ |θ) dx.
= f (b
A

Consider normal CDF with mean xbθ and variance −1/ h00 (b
xθ |θ)).
xbθ is the maximum point of h(x|θ), so h00 (b
xθ |θ) < 0.
Take A = [a, b]:
s
Z b Z b
h(x|θ) xθ |θ)
h(b 2π
f (x|θ)dx = e dx ≈ e
a a −h00 (b
xθ |θ)
hp  p i
× Φ −h00 (b
xθ |θ)(b − xbθ ) − Φ −h00 (b
xθ |θ)(b − xbθ ) .

Φ: CDF of N (0, 1).


80 / 417
Example. Gamma approximation
First-order approximation of the integral of Ga(α, β) PDF on [a, b]:
b b b
β α x α−1 −βx βα βα
Z Z Z
e dx = f (x|α, β)dx = eh(x|α,β) dx,
a Γ(α) Γ(α) a Γ(α) a

where h(x|α, β) := (α − 1) log x − βx.


Second-order Taylor expansion of h around x0 :
 α−1  α−1
h(x|α, β) ≈ (α−1) log x0 −βx0 + −β (x−x0 )− 2 (x−x0 )2 .
x0 2x0

Maximum of h(x|α, β) at x0 = xbα,β = (α − 1)/β.

α−1
h(x| α, β) ≈ (α − 1) log xbα,β − βb
xα,β − 2
(x − xbα,β )2
2bxα,β
1
(βx − α + 1)2 .

= (α − 1) log(α − 1) − log β − 1 −
2(α − 1)

81 / 417
Example. Gamma approximation
First-order Laplace approximation:
s
Z b α α−1 α 2πb 2
xα,β
β x β
e−βx dx ≈ α−1 −βb
xbα,β e xα,β
a Γ(α) Γ(α) α−1
" s ! s !#
α−1  α−1 
× Φ 2
b − xbα,β −Φ 2
a − xbα,β
xbα,β xbα,β

β α  α − 1 α−1 1−α p
= e 2π(α − 1)/β
Γ(α) β
h  √   √ i
× Φ (bβ − α + 1)/ α − 1 − Φ (aβ − α + 1)/ α − 1 .
Example. α = 6, β = 0.5, xbα,β = 10.
Interval Approximation Exact value
(9, 11) 0.174016 0.174012
(8, 12) 0.339580 0.339451
(3, 17) 0.867908 0.845947
(12, ∞) 0.321957 0.445680
82 / 417
Monitoring variation
Approximate
Z n
1X
h(x)f (x)dx with hn := h(Xj ), Xj ∼ f (x).
X n
j=1
2
Example. h(x) := cos(10x) − sin(60x) , f (x) ≡ 1, X = [0, 1].

Black: mean; blue: 95 % confidence bounds based on CLT; red: exact value. 84 / 417
Problem with the confidence interval
At a given n the confidence interval based on CLT is valid, but for
the whole path it is not.

Another Monte Carlo sequence will not be in the envelope with


probability 0.95.

h1 , h2 , h3 , . . . are not independent!

Possible solutions:
I Univariate monitoring: run several independent sequences and
estimate variance using these samples.
Greedy in computing time
I Multivariate monitoring:
 derive the joint distribution of the
sequence hn .
Calculations might be problematic.

85 / 417
Univariate monitoring. Example
Evaluate Z 1 2
cos(10x) − sin(60x) dx.
0
500 independent samples of size 10000 are generated from U(0, 1).

Dashed black: running mean of a single sequence; dashed green: 95 % bounds


corresponding to the running mean based on CLT; blue: 95 % empirical confi-
dence bounds; orange: mean of running means; red: exact value.
86 / 417
Example. Cauchy prior
Single observation x on X ∼ N (θ, 1) with prior θ ∼ C(0, 1).
1 2 1
f (x|θ) := √ e−(x−θ) /2 , g (θ) := ,
2π π(1 + θ2 )
1 2
g (θ|x) ∝ e−(x−θ) /2 .
1 + θ2
Posterior mean:
Z ∞ Z ∞
θ 2 1 2
δ π (x) = 2
e−(x−θ) /2 dθ 2
e−(x−θ) /2 dθ.
−∞ 1+θ −∞ 1+θ
For approximation consider a sample θ1 , θ2 , . . . , θn ∼ N (x, 1) and
n X n
π
X θj 1
δn (x) :=
b
2
.
1 + θj 1 + θj2
j=1 j=1

Law of Large Numbers: δbnπ (x) → δ π (x) as n → ∞.


Problems with estimation of the variance of δbπ (x).
n
87 / 417
Variance estimation
Importance sampling with self normalization: approximate
Z
 
Ef h(X ) = h(x)f (x)dx
X
by Pn Pn
j=1 h(Xj )f (Xj )/g (Xj ) j=1 h(Xj )Wj
δnh := Pn = Pn ,
j=1 f (Xj )/g (Xj ) j=1 Wj
where Xj ∼ g (x) and Wj := κf (Xj )/g (Xj ) with arbitrary κ 6= 0.
n
X n
X
Snh := h(Xj )Wj , Sn1 := Wj .
j=1 j=1

Lemma. The variance of δnh


can be approximated by
1 h
h
   h 1
 2
  1
i
Var Sn − 2Ef h(X ) Cov Sn , Sn + E f h(X ) Var Sn .
n2 κ2

Liu [1996]:
Var δnh ≈ n−1 Var f h(X ) 1 + Var g (W ) .
  
88 / 417
Example. Cauchy prior
n n
X
X θj 1
δbnπ (x) := , θj ∼ N (x, 1).
1 + θj2 1 + θj2
j=1 j=1

Parameter: x = 2.5. Band: 95 % empirical confidence interval from 500


samples; black: running mean of a single sequence; green: 95 % bounds
corresponding to the running mean based on CLT; blue: 95 % corrected
confidence bounds. 89 / 417
Example. Cauchy prior
n n
X
X θj 1
δbnπ (x) := , θj ∼ N (x, 1).
1 + θj2 1 + θj2
j=1 j=1

Parameter: x = 2.5. Band: entire range of 500 samples; black: mean of run-
ning means; green: 90 % empirical confidence bounds; blue: 95 % confidence
bounds based on CLT.
90 / 417
Example. Cauchy prior
n n
X
−(x−θj )2 /2 2 /2
X
δenπ (x) := θj e e−(x−θj ) , θj ∼ C(0, 1).
j=1 j=1

Parameter: x = 2.5. Band: entire range of 500 samples; black: mean of run-
ning means; green: 90 % empirical confidence bounds; blue: 95 % confidence
bounds based on CLT.
91 / 417
Multivariate monitoring
Approximate
Z n
1X
h(x)f (x)dx with hn := h(Xj ), Xj ∼ f (x).
X n
j=1

To obtain bounds for hn determine the joint distribution of hn .

Example. X1 , X2 , . . . , Xn ∼ f (x) i.i.d. with µ := E[X1 ].



Produce an error bound for the sequence X m 1≤m≤n , where
m
1 X
X m := Xj .
m
j=1

Assume: Xi ∼ N (µ, σ 2 ).
   σ2
E X m = µ, Cov X k , X ` = , k, ` = 1, 2, . . . , n.
max{k, `}
92 / 417
Multivariate monitoring. Example
For Xi ∼ N (µ, σ 2 ) one has

Xn := (X 1 , X 2 , . . . , X n )> ∼ Nn µ1n , Σn ,


where
   
1 1 1/2 1/3 1/4 1/5 . . . 1/n
1 1/2 1/2 1/3 1/4 1/5 . . . 1/n
   
1n := 1 , Σn := σ 2 1/3 1/3 1/3 1/4 1/5 . . . 1/n.
  
 ..   .. .. .. .. .. .. 
.  . . . . . ... . 
1 1/n 1/n 1/n 1/n 1/n . . . 1/n

Distribution:
(
> Xn2 if σ 2 is known;
Σ−1

Xn − µ1n n Xn − µ1n ∼
nFn,ν if σ 2 is unknown,

b2 ∼ Xν independent of Xn .
and in the latter case we have σ
93 / 417
Multivariate monitoring. Example
Inverse of Σn :
 
2 −2 0 0 0 ... 0
−2 8 −6 0 0 ... 0 
1 
 0 −6 18 −12 0

Σ−1 = 2 ... 0 .

n
σ  .. .. .. .. .. 
 . . . . . ... −n(n − 1)
0 0 0 0 . . . −n(n − 1) n2
qn : appropriate Xn2 or Fn,ν quantile. Confidence limits:
n > o
µ : Xn − µ1n Σ−1

n X n − µ1n ≤ qn
 s 
2 > −1

2 σ (Xn Σn Xn − qn ) 
= µ : µ ∈ Xn ± Xn − .
 n 

Confidence bounds for the running mean X k


s
>
2 σ 2 (Xk Σ−1
k Xk − qk )
Xk ± Xk − , k = 1, 2, . . . , n.
k
94 / 417
Rao-Blackwellization

Variance inequality
 
Var E[δ(X )|Y ] ≤ Var δ(X ) .

δ(X ): estimator of E[h(X )], X ∼ f (x).

X can be simulated from the joint distribution f (x, y ) of (X , Y )


satisfying Z ∞
f (x, y )dy = f (x).
−∞

Estimator

δ ∗ (Y ) := E[δ(X )|Y ]
has smaller variance than δ(X ).

95 / 417
Example. Student’s t expectation
Let h(x) := exp(−x 2 ), X ∼ T (ν, µ, σ 2 ) and consider E[h(X )].
Approximation:
n
1X
exp − Xj2 , Xj ∼ T (ν, µ, σ 2 ).

δn :=
n
j=1

Construction of William Gosset (Student): Z ∼ N (0, 1), V ∼ Xν2 ,


p
X = µ + σZ / V /ν ∼ T (ν, µ, σ 2 ).
Conditional distribution of X given Y = V /ν is normal with mean
µ and variance σ 2 /Y , where Y ∼ Ga(ν/2, ν/2). Conditional PDF:

y − (x−µ)2 2 y
f (x|y ) = √ e 2σ .
2πσ
Rao-Blackwellized approximation:
n n 2
1X    1 X 1 − 2µ
δn∗ := E exp − Xj2 Yj = p e 2σ /Yj +1 .
n n 2σ 2 /Yj + 1
j=1 j=1

96 / 417
Example. Student’s t expectation
Approximate
E exp(−X 2 ) , X ∼ T (ν, µ, σ 2 ).
 

Convergence of the estimators δn (blue) and δn∗ (black) for ν = 5, µ = 3,


σ = 0.5 to the true value of 0.0077 (dashed red).
97 / 417
Accept-reject framework
Approximate E[h(X )], where X ∼ f (x), with
n
1X
δn := h(Xj ).
n
j=1

f , g : the target and the instrumental density, respectively.


X1 , X2 , . . . , Xn : a sample from distribution f generated with the
help of accept-reject method.
N: (random) number of generated random variables Yi ∼ g (x) in
order to have a sample of size n from f (x). N satisfies
N N−1
X X f (Yi )
I{Ui ≤ωi } = n, I{Ui ≤ωi } = n − 1, ωi := ,
Mg (Yi )
i=1 i=1
and Ui ∼ U(0, 1), i = 1, 2, . . . , N.
N
1X
δn = h(Yi )I{Ui ≤ωi } .
n
i=1
98 / 417
Rao-Blackwellized estimator
" N
# N
1 X 1X
δn∗ := E h(Yi )I{Ui ≤ωi } N, Y1 , . . . , YN = %i h(Yi ),

n n
i=1 i=1

where  
%i := E I{Ui ≤ωi } N, Y1 , . . . , YN .
Let κ ∈ N.

%i = P I{Ui ≤ωi } N = κ, Y1 , . . . , YN
P Qn−2 Qκ−2
(i1 ,...,in−2 ) j=1 ωij j=n−1 (1−ωij )
= ωi P Qn−1 Qκ−2 , i = 1, 2, . . . , κ−1,
(i1 ,...,in−1 ) j=1 ωij j=n (1−ωij )

and %κ = 1. Numerator (denominator) sum is taken over all sub-


sets of size n − 2 (n − 1) of {1, . . . , i − 1, i + 1, . . . , κ}.
δn∗ is an average over all possible permutations of the realized
sample, it depends only on (N, Y(1) , . . . , Y(N−1) , YN ).
99 / 417
Recursion for the weights
P Qn−2 Qκ−2
(i1 ,...,in−2 )j=1 ωij j=n−1 (1−ωij ) f (Yi )
%i = ωi P Qn−1 Qκ−2 , ωi := .
(i1 ,...,in−1 ) j=1 ωij j=n (1−ωij )
Mg (Yi )
For k ≤ m < κ let
k
X Y m
Y
Sk (m) := ωij (1−ωij ) with {i1 , . . . , im } = {1, . . . , m},
(i1 ,...,ik ) j=1 j=k+1

Sk (m) := 0 for k > m, and Ski (m) := Sk (m − 1).


Recursion:
Sk (m) = ωm Sk−1 (m − 1) + (1 − ωm )Sk (m − 1);
Ski (m) = ωm Sk−1
i
(m − 1) + (1 − ωm )Ski (m − 1).
Weights:
i
%i = ωi Sn−2 (κ − 1)/Sn−1 (κ − 1), i = 1, 2, . . . , κ − 1.
100 / 417
Riemann approximation
Riemann sum for
Z b n−1
X
h(x)f (x)dx is h(aj,n )f (aj,n )(aj+1,n − aj,n ),
a j=0

where a = a0,n < a1,n < . . . < an,n = b and maxj {aj+1,n −aj,n } → 0 as
n → ∞.

Definition. The method of simulation by Riemann sums approxi-


mates the integral
Z b n−1
X   
h(x)f (x)dx by h X(j) f X(j) X(j+1) − X(j) ,
a j=0

where X0 , X1 , . . . , Xn are i.i.d. random variables from distribution f


and X(0) ≤ X(1) ≤ . . . ≤ X(n) are the order statistics associated
with this sample. [Yakowitz, Krimmel & Szidarovszky, 1978]
101 / 417
Order of variance
Theorem. Let U(0) < U(1) < . . . < U(n) be an ordered sample
from U(0, 1). If the first derivative h0 is bounded on [0, 1], the
estimator
n−1
X    
δn := h U(j) U(j+1) −U(j) +h(0)U(0) +h U(n) 1−U(n)
j=1

of Z 1
h(x)dx
0
−2

has a variance of order O n .
Define U(−1) := 0, U(n+1) := 1 and consider
n  
X  h U(j) + h U(j+1)
δn :=
e U(j+1) − U(j) .
2
j=−1

Theorem. [Yakowitz, Krimmel & Szidarovszky, 1978] If h00 is bo-


unded then the variance of δen is of order O n−4 .
Remark. For classical Monte Carlo integration the order is O n−1 1.02 / 417

General setup
Remark. For d dimensions the order of δen is O n−4/d .


For d > 0 numerical methods are more effective.


X(0) < X(1) < . . . < X(n) : ordered sample from distribution f .
F − : the generalized inverse of the CDF F of f .
U(0) < U(1) < . . . < U(n) : ordered sample from U(0, 1).
Representation: X(j) = F − U(j) , j = 0, 1, . . . , n.

Rb
Approximation of a h(x)f (x)dx:
n−1
X   
h X(j) f X(j) X(j+1) − X(j)
j=0 n−1 
X    − 
h F − U(j) f F − U(j) F U(j+1) − F − U(j)

=
j=0
n−1
X
H(x) := h F − (x) .
  
≈ H U(j) U(j+1) − U(j) , where
j=0

Theorem about the order of variance applies. 103 / 417


Example. Gamma expectation
Let h(x) := x log(x), X ∼ Ga(4, 1), and consider
Z ∞
1
E[h(X )] = h(x)x 4 e−x dx.
Γ(4) 0

Monte Carlo approximation:


n
1X
δ1,n := h(Xj ), Xj ∼ Ga(4, 1).
n
j=1

Riemann approximation:
n−1
1 X  4 −X 
δ2,n := h X(j) X(j) e (j) X(j+1) − X(j) .
Γ(4)
j=1

X(1) ≤ X(2) ≤ . . . ≤ X(n) : the order statistics associated with the


sample X1 , X2 , . . . , Xn from Ga(4, 1).
104 / 417
Example. Gamma expectation
Approximate  
E X log(X ) , X ∼ Ga(4, 1).

Convergence of the estimators δ1,n (blue) and δ2,n (black) to the true value of
6.0245 (dashed red).
105 / 417
Importance sampling
Approximate
Z b Z b
f (x)
J := h(x)f (x)dx = h(x) g (x)dx.
a a g (x)
g : PDF of the instrumental distribution.
Monte Carlo approximations of J :
n
1X
δ1,n := h(Xj ), Xj ∼ f (x);
n
j=1
n
∗ 1 X f (Yj )
δ1,n := h(Yj ) , Yj ∼ g (x).
n g (Yj )
j=1

Riemann approximations of J :
n−1
X   
δ2,n := h X(j) f X(j) X(j+1) − X(j) ;
j=1
n−1
X

  
δ2,n := h Y(j) f Y(j) Y(j+1) − Y(j) .
j=1 106 / 417
Example. Student’s t expectation
Let h(x) := (1 + exp(x))I{x≤0} , X ∼ Tν , and consider
Z 0
Γ (ν + 1)/2 − ν+1
1 + ex 1 + x 2 /ν

E[h(X )] = √ 2
dx.
πνΓ(ν/2) −∞
Importance sampling MonteCarlo approximation with instrumental
distribution N 0, ν/(ν − 2) :
√  n
N 2Γ (ν +1)/2 X − ν+1 − Yj (ν−2)
1+eYj 1+Yj2 /ν

δ1,n := √ 2
e 2ν I{Yj ≤0} ,
ν −2Γ(ν/2)n j=1

where Y1 , Y2 , . . . , Yn ∼ N 0, ν/(ν − 2) . Unstable.
Importance sampling Monte Carlo approximation with instrumental
distribution C 0, 1):
√  n
C πΓ (ν + 1)/2 X − ν+1
1+eZj 1+Zj2 /ν 1+Zj2 I{Zj ≤0} ,
 
δ1,n := √ 2
νΓ(ν/2)n
j=1

where Z1 , Z2 , . . . , Zn ∼ C 0, 1 .
107 / 417
Example. Normal instrumental distribution
Approximate
Z 0
Γ (ν + 1)/2 − ν+1
1 + ex 1 + x 2 /ν

√ 2
dx
πνΓ(ν/2) −∞

with instrumental distribution N 0, ν/(ν − 2) .
Riemann approximation:
 n−1 − ν+1
N Γ (ν +1)/2 X Y(j)

2 2 
δ2,n := √ 1+e 1+Y(j) /ν I{Y(j) ≤0} Y(j+1)−Y(j)
πνΓ(ν/2)
j=1

Normalized Riemann approximation:


Pn−1   − ν+1
Y(j) 2 /ν 2 
j=1 1 + e 1 + Y (j) I{Y(j) ≤0} Y(j+1) −Y(j)
N
δ3,n := − ν+1
Pn−1  2 /ν 2 
j=1 1 + Y (j) Y(j+1) −Y(j)
Y(1) ≤ Y(2) ≤ . . . ≤ Y(n) : the order statistics associated with the
sample Y1 , Y2 , . . . , Yn from N 0, ν/(ν − 2) .
108 / 417
Example. Normal instrumental distribution
Approximate
h  i
E 1 + exp(X ) I{X ≤0} , X ∼ Tν .

N N N
Convergence of the estimators δ1,n (blue), δ2,n (black) and δ3,n (green) for
ν = 2.5 to the true value of 0.7330 (dashed red).
109 / 417
Example. Cauchy instrumental distribution
Approximate
Z 0
Γ (ν + 1)/2 − ν+1
1 + ex 1 + x 2 /ν

√ 2
dx
πνΓ(ν/2) −∞
with instrumental distribution C(0, 1).
Riemann approximation:
 n−1 − ν+1
C Γ (ν +1)/2 X 
2
1+eZ(j) 1+Z(j)
2

δ2,n := √ /ν I{Z(j) ≤0} Z(j+1)−Z(j)
πνΓ(ν/2)
j=1

Normalized Riemann approximation:


Pn−1   − ν+1
Z(j) 2 /ν 2 
j=1 1 + e 1 + Z (j) I{Z(j) ≤0} Z(j+1) −Z(j)
C
δ3,n := − ν+1
Pn−1  2 /ν 2 
j=1 1 + Z (j) Z(j+1) −Z(j)
Z(1) ≤ Z(2) ≤ . . . ≤ Z(n) : the order statistics associated with the
sample Z1 , Z2 , . . . , Zn from C(0, 1) .
110 / 417
Example. Cauchy instrumental distribution
Approximate
h  i
E 1 + exp(X ) I{X ≤0} , X ∼ Tν .

C C C
Convergence of the estimators δ1,n (blue), δ2,n (black) and δ3,n (green) for
ν = 2.5 to the true value of 0.7330 (dashed red).
111 / 417
Acceleration methods. Correlated simulations
In Monte Carlo integration usually independent samples are simu-
lated.
Use of positively or negatively correlated samples might reduce the
variance.

Example. Estimate the difference J1 − J2 , where


Z Z
J1 := g1 (x)f1 (x)dx, J2 := g2 (x)f2 (x)dx.

δi : Monte Carlo approximation of Ji , i = 1, 2.

δ1 and δ2 independent: Var(δ1 − δ2 ) = Var(δ1 ) + Var(δ2 ).

δ1 and δ2 positively correlated:

Var(δ1 −δ2 ) = Var(δ1 )+Var(δ2 )−2 Cov(δ1 , δ2 ) < Var(δ1 )+Var(δ2 ).

112 / 417
Comparison of estimators
f (x|θ): PDF with parameter θ to be estimated.

L δ(x), θ : loss function for the estimator δ(x) of θ.
h i
Risk function: R(δ, θ) := E L δ(X ), θ , X ∼ f (x|θ).

Estimators δ1 and δ2 can be compared through their risks R(δ1 , θ)


and R(δ2 , θ) which usually don’t have analytic expressions.
Approximations:
n n
b 1 , θ) := 1 b 2 , θ) := 1
X  X 
R(δ L δ1 (Xj ), θ , R(δ L δ2 (Yj ), θ .
n n
j=1 j=1

Xj , Yj ∼ f (x|θ), j = 1, 2, . . . , n.
 
Positive correlation between L δ1 (Xj ), θ and L δ2 (Yj ), θ reduces
b 1 , θ) − R(δ
the variance of R(δ b 2 , θ).

113 / 417
Useful remarks

1. The same sample X1 , X2 , . . . , Xn should  be used in theevalua-


tion of R(δ1 , θ) and R(δ2 , θ). If L δ1 (x), θ and L δ2 (x), θ are
monotonic in x in the same sense, then
 
n h
1 X  i  
Var L δ1 (Xj ), θ −L δ2 (Xj ), θ ≤ Var R(δ b 1 , θ)−R(δ
b 2 , θ) .
n
j=1

2. Possibly use the same sample for the comparison of risks for
every value of θ. E.g. the same uniform sample is used for genera-
tion of samples from f (x|θ) for every θ. In many cases there exists
a transformation Mθ on X such that if X 0 ∼ f (x|θ0 ) then we have
Mθ X 0 ∼ f (x|θ).

Smooths the estimates of the risk functions.

114 / 417
Example. Positive-part James-Stein estimator
Problem: estimate the unknown mean vector θ of a p-dimensional

normal distribution using a single observation X ∼ Np θ, Ip .
 2
Loss function: L δ(X ), θ := δ(X ) − θ .
h 2 i
Risk function: R(δ, θ) = E δ(X ) − θ (squared error risk).
Least squares estimator: δLS (x) := x.
 
p−2
James-Stein estimator: δJS (x) := 1 − kxk2
x.

James and Stein [1961]: for p ≥ 3


(p − 2)2
 
Y ∼ P kθk2 /2 .

R(δJS , θ) = p−E < p = R(δLS , θ),
p − 2 + 2Y
Positive-part James-Stein estimator:
 +
+ a
δJS,a (x) := 1 − x, 0 < a < 2(p − 2).
kxk2
+
For p ≥ 3 we have R(δJS,a , θ) ≤ R(δJS , θ). 115 / 417
Example. Risk evaluation
Squared error risk of positive-part James-Stein estimator:
"  + 2 #
+
a 
R(δJS,a , θ) = E 1 − kX k2 , X ∼ Np θ, Ip .
X − θ

+
Approximate R(δJS,a , θ) by
n  + 2
1 X 1− a

2
Xj − θ
, X1 , X2 , . . . , Xn ∼ Np θ, Ip .
n kXj k
j=1

Transformation: Mθ x := x + θ − θ0 , x ∈ Rp .

X 0 ∼ Np θ0 , Ip ,
 
Mθ X ∼ Np θ, Ip .

Generate a sample X10 , X20 , . . . , Xn0 ∼ Np 0, Ip .




For all values of θ use Xj := Xj0 + θ, j = 1, 2, . . . , n.


116 / 417
Comparison of risks
+
Approximation of the squared error risk of δJS,a :
n + 2
 
1 X 1− a

2
Xj − θ , X1 , X2 , . . . , Xn ∼ Np θ, Ip .
n kXj k
j=1

+

Approximate squared error risks of δJS,a for N5 θ, I5 as a function of kθk for
different values of a. Transformations of a single sample of size 1000. 117 / 417
Comparison of risks
+
Approximation of the squared error risk of δJS,a :
n + 2
 
1 X 1− a

2
Xj − θ , X1 , X2 , . . . , Xn ∼ Np θ, Ip .
n kXj k
j=1

+

Approximate squared error risks of δJS,a for N5 θ, I5 as a function of kθk for
different values of a. Individual samples of size 1000 for each θ. 118 / 417
Antithetic variables
Approximate Z ∞
J := h(x)f (x)dx
−∞

using X1 , X2 , . . . , Xn ∼ f (x) and Y1 , Y2 , . . . , Yn ∼ f (x).


If h(Xj ) and h(Yj ) are negatively correlated
n
1 X 
h(Xj ) + h(Yj )
2n
j=1

is more effective that an approximation based on 2n i.i.d. variables.


Y1 , Y2 , . . . , Yn : antithetic variables.
Rubinstein [1981]: Generate U1 , U2 , . . . , Un ∼ U(0, 1).
F − : generalized inverse of the CDF corresponding to f (x).
Define: Xj = F − (Uj ), Yj = F − (1 − Uj ), j = 1, 2, . . . , n.
119 / 417
Antithetic variables
Theorem. Let H := h ◦ F − , U ∼ U(0, 1), X := F − (U) and
Y := F − (1 − U). If H is monotone and H 0 is continuous, then
h(X ) and h(Y ) are negatively correlated.
   
Proof. Since X , Y ∼ f (x), we have E h(X ) = E h(Y ) = J .

Cov h(X ), h(Y ) = E h(X )h(Y ) − J 2 = E H(U)H(1 − U) − J 2


    

Assume H(u) is non-decreasing, H 0 is continuous and H(1) > H(0). Define


Z u
g (u) := H(1 − t)dt − uJ , g (0) = g (1) = 0.
0
0
g (u) = H(1 − u) − J is monotone non-increasing and
Z 1
H(u)du = E h(X ) = J < H(1) implies g 0 (0) > 0, g 0 (1) < 0.
 
H(0) <
0
R1
Thus g (u) ≥ 0, u ∈ [0, 1], so 0 g (u)H 0 (u)du ≥ 0. Integration by parts shows
Z 1 Z 1 Z 1
0 ≥ g 0 (u)H(u)du = H(1−u)H(u)du−J H(u)du = E H(U)H(1−U) −J 2 .
 
0 0 0

2
120 / 417
Example. Gamma expectation
Let h(x) := x log(x), X ∼ Ga(4, 1), and consider
Z ∞
1
E[h(X )] = h(x)x 4 e−x dx.
Γ(4) 0
H(x) := h F − (x) , where F − is the quantile function of Ga(4, 1).


20
15
H(x)
10 5
0

0.0 0.2 0.4 0.6 0.8 1.0


x

Conditions of the theorem are fulfilled. 121 / 417


Example. Gamma expectation
Let h(x) := x log(x), X ∼ Ga(4, 1), and consider
Z ∞
1
E[h(X )] = h(x)x 4 e−x dx.
Γ(4) 0
Generate U1 , U2 , . . . , U2n ∼ U(0, 1).
Define
Xj := F − (Uj ), j = 1, 2, . . . , 2n, Yk := F − (1−Uk ), k = 1, 2, . . . , n.
Monte Carlo approximation:
2n
1 X
δ1,2n := h(Xj ).
2n
j=1

Antithetic approximation:
n
1 X 
δ2,2n := h(Xj ) + h(Yj ) .
2n
j=1

122 / 417
Example. Gamma expectation
Approximate  
E X log(X ) , X ∼ Ga(4, 1).
Empirical correlation of h(X ) and h(Y ): −0.7604.

Convergence of the estimators δ1,2n (blue) and δ2,2n (black) to the true value of
6.0245 (dashed red). 123 / 417
Antithetic variables. Inversion at a higher level
Approximate
Z ∞ n
1 X 
J := h(x)f (x)dx using δn := h(Xj ) + h(Yj ) ,
−∞ 2n
j=1

where Xj , Yj ∼ f (x) and h(Xj ), h(Yj ) are negatively correlated.

Geweke [1988]: If f (x) is symmetric around µ, use Yj := 2µ − Xj .


n
1 X 
δn = h(Xj ) + h(2µ − Xj ) .
2n
j=1

Limiting case: if h(x) is linear, Var(δn ) = 0. Not interesting.

Works in the cases when h(x) is approximately linear and for large
sample sizes.
124 / 417
Example. Beta expectation
Let h(x) := cos(x) + 2 sin(x), X ∼ Be(α, α), and consider
Z 1
1
E[h(X )] = h(x)x α−1 (1 − x)α−1 dx.
B(α, α) 0

2.2
2.0
1.8
h(x)
1.6
1.4
1.2
1.0

0.0 0.2 0.4 0.6 0.8 1.0


x

E[X ] = 1/2 and Be(α, α) is symmetric around the mean. h(x) can
be approximated by a linear function. 125 / 417
Example. Beta expectation
Let h(x) := cos(x) + 2 sin(x), X ∼ Be(α, α), and consider
Z 1
1
E[h(X )] = h(x)x α−1 (1 − x)α−1 dx.
B(α, α) 0
Generate X1 , X2 , . . . , X2n ∼ Be(α, α).
Define
Yk := 1 − Xk , k = 1, 2, . . . , n.

Monte Carlo approximation:


2n
1 X
δ1,2n := h(Xj ).
2n
j=1

Antithetic approximation:
n
1 X 
δ2,2n := h(Xj ) + h(Yj ) .
2n
j=1

126 / 417
Example. Beta expectation
Approximate
 
E cos(X ) + 2 sin(X ) , X ∼ Be(2, 2).
Empirical correlation of h(X ) and h(Y ): −0.9421.

Convergence of the estimators δ1,2n (blue) and δ2,2n (black) to the true value of
1.7909 (dashed red). 127 / 417
Control variates
Approximate Z ∞
J := h(x)f (x)dx
−∞
with an estimator δ1 .
 
h0 : a function such that J0 := E h0 (X ) , X ∼ f (x), is known.

Example. µ is the median of f and h0 (x) := I{x≥µ} . Then J0 = 1/2.

δ3 : Unbiased estimator of J0 .

Weighted estimator: δ2 := δ1 + β δ3 − J0 , β ∈ R.

E[δ2 ] = E[δ1 ] and Var(δ2 ) = Var(δ1 )+β 2 Var(δ3 )+2β Cov(δ1 , δ3 ).


Optimal choice:
Cov(δ1 , δ3 )
β∗ = − Var(δ2 ) = 1 − %213 Var(δ1 ).

results in
Var(δ3 )
%13 : correlation between δ1 and δ3 .
128 / 417
Special case
Approximations
n n
  1X   1X
E h(X ) ≈ δ1 := h(Xj ), E h0 (X ) ≈ δ3 := h0 (Xj ),
n n
j=1 j=1

where X , X1 , X2 , . . . , Xn ∼ f (x).
Control variate estimator:
n  
1X ∗
  
δ2 := h(Xj ) + β h0 (Xj ) − E h0 (X ) ,
n
j=1

β ∗ = − Cov h(X ), h0 (X ) Var h0 (X ) .


 

r control variates: h01 (x), h02 (x), . . . , h0r (x).


n  r 
1X X  
βk∗ h0k (Xj ) − E h0k (X )

δ2 := h(Xj ) + ,
n
j=1 k=1

 
βk = − Cov h(X ), h0k (X ) Var h0k (X ) .
Estimation of βk∗ s: regression of h(X ) over h01 (X ), . . . , h0r (X ). 129 / 417
Example. Control variate integration
Approximate
Z ∞ n
1X
P(X > a) = f (x)dx with δ1 := I{Xj >a} , Xj ∼ f (x).
a n
j=1

For some µ the value of P(X > µ) is known (a > µ is assumed).


n
1X
P(X > µ) ≈ δ3 = I{Xj >µ} and
n
j=1
n  X n 
1 X 1
δ2 = I{Xj >a} + β I{Xj >µ} − P(X > µ) .
n n
j=1 j=1
 
Cov(δ1 , δ3 ) = P(X > a) 1 − P(X > µ) /n,
 
Var(δ3 ) = P(X > µ) 1 − P(X > µ) /n.
Improvement in variance if
P(X > a) Cov(δ1 , δ3 )
−2 = −2 < β < 0.
P(X > µ) Var(δ3 )
130 / 417
Example. Tail probability of Student’s t
Approximate P(X > 3), where X ∼ T5 .
Table of T5 : P(X > 3.365) = 0.01, P(X > 2.571) = 0.025.
Let µ = 2.571 and β = −0.01/0.025 = −0.4.
Optimal value: β ∗ = −0.6023.

Convergence of the estimators δ1 (blue), δ2 (black) and optimal δ2∗ (green) to


the true value of 0.0150 (dashed red). 131 / 417
Example. Cauchy prior
Approximate
Z ∞ n
θ 1 2 1 X θj
J := √ e−(x−θ) /2 dθ with δ1 := ,
−∞ 1+θ 2
2π n 1 + θj2
j=1

where θ1 , θ2 , . . . , θn ∼ N (x, 1).

Odd central moments of N (x, 1) are zero. Control variates:

h0k (θ) := (θ − x)2k−1 , k = 1, 2, . . . , r .

Estimators of β1∗ , β2∗ , . . . , βr∗ by linear regression:

θj 
on h01 (θj ), h02 (θj ), . . . , h0r (θj ) , j = 1, 2, . . . n.
1 + θj2

132 / 417
Example. Cauchy prior
h  i
Approximate J := E θ 1 + θ2 , θ ∼ N (x, 1).
Control variates: h0k (θ) := (θ − x)2k−1 , k = 1, 2, 3, 4, 5.

Impact of the control variates over the approximation of J . Convergence of


the Monte Carlo estimator δ1 (blue), and estimators δ2,k using k = 1, 2, 3, 4, 5
control variates (shades of green, darker means larger k) to the true value of
0.3450 (dashed red). 133 / 417
Optimization problems
Find
arg max h(x) or arg min h(x), h : X ⊂ Rd 7→ R.
x∈X x∈X
h has several local maxima or minima: numerical algorithms (e.g.
gradient method) can be trapped in a local optimum.
Example. h(x, y ) := ax 2 + by 2 − c cos(αx) − d cos(γy ) + c + d.

Function h(x, y ) with a = 1, b = 2, c = 0.3, d = 0.4 and α = 3π, γ = 4π.


135 / 417
Example. Partially observed Markov chain
{Xt }: homogeneous MC with state space S = {1, 2}.
Sample (0 denotes the unobserved state):
...0 0 1 0 0 2 2 1 0 0 0 0 1 0 0 2 2 1
0 0 0 0 1 0 0 2 2 1 0 0 0 0 1 0 0 2 2 1 0...
k-step transition probabilities:
(k)
pij = P(Xk = j|X0 = i), i, j ∈ {1, 2}, k = 1, 2, . . . .
(k) 2
 
k-step transition matrix: P (k) = pij = P k , P = P (1) .
i,j=1
π = (π1 , π2 ): stationary distribution, π>P = π>.
Likelihood function:
(3) (5) (3) (5) (3) (5) (3)
L(P) = π1 p12 p22 p21 p11 p12 p22 p21 p11 p12 p22 p21 p11 p12 p22 p21 .
ML estimator of P
Pb := arg max L(P).
P
136 / 417
Example. Likelihood function
 4
y 1 3
L(P) = L(x, y ) = (x − (1 − x − y ) x)(1 − y )y
x +y x +y
 3
1 5
× (y + (1 − x − y ) x) , x = p12 , y = p21 .
x +y

−6
x 10
1.2

0.8
L(x,y)
0.6

0.4

0.2

0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x 137 / 417
Optimization methods
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X

X is finite: combinatorial optimization.


I Numerical methods
Uses the analytic properties of h and of the domain X , e.g.
gradient method. Produces a deterministic sequence of points
approximating the optimum point.
I Stochastic methods
I Stochastic search or stochastic exploration.
Uses some properties of h and explores X . The objective
function is approximately maximized by a random sequence of
points. E.g. random search, stochastic gradient method, simu-
lated annealing.
I Stochastic approximation
Finds an acceptable random approximation h b of h to be maxi-
mized. Not concerned with exploring X . E.g. EM algorithm.
138 / 417
Numerical methods in R
One-dimensional continuous functions: optimize
Example. Find the MLE of θ of a Cauchy C(θ, 1).
Likelihood function:
n
Y 1
`(θ|x1 , x2 , . . . , xn ) = .
1 + (xj − θ)2
j=1
True parameter: θ = 5.

5.2
5.2
5.0

4.8
4.8

4.4
4.6

4.0

0 100 200 300 400 500 0 100 200 300 400 500
Sample size Sample size
Sequence of MLEs corresponding to 500 simulations from C(5, 1) (left) by
optimizing the log-likelihood (black) and likelihood (red);
 (right) by optimizing
the perturbed log-likelihood g (θ) = − sin2 100(θ − 5) − log `(θ) (black) and
likelihood `(θ) (red).
139 / 417
Numerical methods in R
Newton-Ralphson method: nlm
Designed for minimization.
Iteration searching the minimum of h : X ⊂ Rd 7→ R.
 2 
∂ h
xk+1 = xk − (xk ) ∇h(xk ).
∂x∂x>

Example. Find the MLE of mean parameters µ1 and µ2 in the


mixture model:

N (µ1 , 1) with prob. 1/4; N (µ2 , 1) with prob. 3/4.

Log-likelihood function:

n
X 
log `(µ1 , µ2 |x1 , x2 , . . . , xn ) = log 0.25ϕ(xj −µ1 )+0.75ϕ(xj −µ2 ) .
j=1

140 / 417
Newton-Ralphson iterations

5
4
3
2
μ2

1
0
-1
-2

-2 -1 0 1 2 3 4 5
μ1

Six Newton-Ralphson sequences for log ` based on a sample of 400 observa-


tions from the normal mixture with µ1 = 0, µ2 = 2.5.
141 / 417
Stochastic exploration. Basic solution
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X

If X is bounded, simulate u1 , u2 , . . . , un ∼ U(X ).


Use approximation hn∗ := max{h(u1 ), h(u2 ), . . . , h(un )}.
Convergence to the true maximum, but very slow.

Example. Find the maximum of


2
h(x) = cos(10x) − sin(60x) , x ∈ [0, 1].

True maximum: h∗ = 3.868428 at 0.3396384 and at 0.6028403.


Approximations:
n 100 500 1000 2000 5000
hn∗ 3.840309 3.868426 3.868426 3.868426 3.868428

142 / 417
Example

2
The graph of h(x) = cos(10x) − sin(60x) (top) and scatterplot of
h(ui ), ui ∼ U(0, 1), i = 1, 2, . . . 5000 (bottom).
143 / 417
Precision

3.8
3.6
h(x)
3.4
3.2

0 200 400 600 800 1000


Iterations

2
Range of 1000 sequences of successive maxima of h(x) = cos(10x)−sin(60x)
found by uniform sampling over 1000 iterations. True maximum: red line.
144 / 417
Generalization of gradient method
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X
Gradient method produces a sequence of form
xk+1 = xk + αk ∇h(xk ), αk > 0.
If h is not differentiable or an analytic expression of h is not avai-
lable:
xk+1 = xk + αk ∇h(x
b k ),

where for x = (x1 , x2 , . . . , xd )>



h(x1 + β1 , x2 , . . . , xd ) − h(x1 − β1 , x2 , . . . , xd )
∇h(x) =
b ,...,
2β1
h(x1 , x2 , . . . , xd + βd ) − h(x1 , x2 , . . . , xd − βd ) >

.
2βd
If e.g. −h and X are convex, it converges to the true minimum
point x ∗ of h : X ⊂ Rd 7→ R.
145 / 417
Random search algorithms
Random search double trial algorithm:
h(xk + βk ζ k ) − h(xk − βk ζ k ) ∆h(xk , βk ζ k )
∇ζ k h(xk ) ≈ = ,
2βk 2βk
(βk ): decreasing sequence
(ζ k ): random sequence with members uniformly distributed on
the d-dimensional unit sphere.
Iteration:
αk
xk+1 = xk + ∆h(xk , βk ζ k )ζ k .
2βk
Coincides with the gradient method if ζ k is taken in the direction
of the gradient ∇h(xk ).
Nonlinear tactic random search algorithm:
Iteration:
αk
xk+1 = xk + Yk I{Yk >0} ζ k , where Yk := h(xk +βk ζ k )−h(xk ),
βk
and αk > 0, βk > 0.
146 / 417
Example
2 
h(x, y ) = x sin(20y ) + y sin(20x) cosh sin(10x)x
2 
+ x cos(10y ) − y sin(10x) cosh cos(20y )y .

Plot of function h(x, y ) on [−1, 1]2 . Minimum at (0, 0). 147 / 417
Test runs

Realizations of the random


 search doulble trial algorithm
0.1 for sequences
αk := 1/ 100 log(k + 1) , βk := 1/ log(k + 1) (left) and αk := 1/(10k),
βk := 1/(10k) (right) started from (0.65, 0.8).

Figure Iterations K (xK , yK ) h(xK , yK ) min h(xk , yk )


Left 359 (0.2050, 0.2192) 0.2405514 0.1154418
Right 266 (0.4912, −0.0060) 0.2382754 0.0715281
148 / 417
Test runs

Realizations of the random search doulble trial algorithm for sequences


αk := 1/(k + 1), βk := 1/(k + 1)0.5 (left) and αk := 1/(k + 1),
βk := 1/(k + 1)0.1 (right) started from (0.65, 0.8).

Figure Iterations K (xK , yK ) h(xK , yK ) min h(xk , yk )


Left 113 (−0.0078, −0.0058) 7.1298 × 10−5 5.9981 × 10−5
Right 105 (−0.0006, 0.0398) 9.8670 × 10−7 5.9110 × 10−8
149 / 417
Physical Annealing

Annealing: A thermal process for obtaining low energy states of a


solid in a heat bath.
Steps:
I Increase the temperature of the heat bath to a maximum
value at which the solid melts
I Decrease carefully the temperature of the heath bath until the
particles arrange themselves in the ground state of the solid.
Liquid phase: all particles arrange themselves randomly.
Ground state: particles are arranged in a highly structured lattice,
the energy of the system is minimal.
The ground state is obtained if the maximum temperature is high
enough and the cooling is sufficienly slow. Otherwise a meta-stable
state will be arranged.

150 / 417
Simulation of the annealing process
X : a finite set describing the possible states of the solid.
h : X 7→ R: energy function. h(x), x ∈ X , is the energy corres-
ponding to state x.
Aim: find the state xopt minimizing the energy function.
x: current state with energy h(x).
y : a candidate state generated from x by a perturbation
mechanism.
Metropolis criterion [Metropolis et al., 1953]
I If h(y ) − h(x) ≤ 0 then accept y as a new current state.
I If h(y ) − h(x) > 0 then accept y as a new current state with
probability  
h(x) − h(y )
exp .
kB T
T : temperature of the heath bath.
kB : Boltzmann constant relating energy at the individual particle
level with temperature. kB = 1.3806488(13) × 1023 .
151 / 417
Boltzmann distribution
Physical process: if the cooling is sufficiently slow, the solid reach-
es the thermal equilibrium at each temperature T .
Simulation: thermal equilibrium is reached after generating a large
number of transitions at a given temperature T .
Boltzman distribution: distribution of states at a given temperatu-
re T corresponding to thermal equilibrium.
 
1 h(x)
P(X = x) = exp − .
Z (T ) kB T

X : current state of the solid.


Z (T ): partition function defined as
 
X h(x)
Z (T ) = exp − .
kB T
x∈X

152 / 417
Combinatorial problems
Find

arg max h(x) or arg min h(x), h : X 7→ R, X is finite.


x∈X x∈X

h: cost function, equivalent to the energy function.


x ∈ X : solution of the optimization problem, equivalent to a state.

Example. (Travelling salesman problem)


Given n cities and the n × n distance matrix (dpq ) of the cities.
Tour: a closed path visiting each city exactly once.
Problem: find the tour of minimal length.

A solution: a cyclic permutation π = π(1), π(2), . . . , π(n) .
π(k): successor of city k. π ` (k) 6= k, ` = 1, 2, . . . , n − 1, and
π n (k) = k for all k.
X : set of all cyclic permutations π on n cities. card(X ) = (n − 1)!.
Cost function: h(π) = nk=1 dk,π(k) .
P
153 / 417
Neighbourhood structures
X : solution space of the combinatorial optimization problem.
h : X 7→ R: cost function.
(X , h): an instance of the combinatorial optimization problem.
Definition. Let (X , h) be an instance of a combinatorial optimi-
zation problem. A neighbourhood structure is a mapping

N : X 7→ 2X ,

which defines for each solution x as set Xx ⊂ X of solutions that


are “close” to x in some sense. The set Xx is called the neighbour-
hood of solution x and each y ∈ Xx is called a neighbour of y . We
assume that y ∈ Xx ⇐⇒ x ∈ Xy .
Definition. Let (X , h) be an instance of a combinatorial optimi-
zation problem and N be a neighbourhood structure. A generation
mechanism is a mean of selecting a solution y from the neighbour-
hood Xx of a solution x.
154 / 417
Example. Travelling salesman problem
(X , h): an instance of a travelling salesman problem with n cities.
Nk : k-change neighbourhood structure. [Lin, 1965]
Xx : solutions that can be obtained from x by removing k edges
from the tour corresponding to x and replacing them with k other
edges such that again a tour is obtained.
N2 (p, q): a 2-change connected to cities p and q, which changes
a solution x into solution y .
Connection between cyclic permutations πx and πy :
1 < k < n: an integer, for which πxk (p) = q.
πy (p) = πx−1 (q) = πxk−1 (p) πy πx (p) = q = πxk (p)


πy πxr (p) = πxr −1 (p),



r = 2, 3, . . . , k − 1,
πy (s) = πx (s), for all other cities.

Xx := y ∈ X πy is obtained from πx by a 2-change .

card Xx = (n − 1)(n − 2) and each pair of solutions can be con-
nected by a sequence of 2-changes.
155 / 417
Example. Ising model
X : D × D binary matrices with entries +1 or −1.
Energy function:
X X
h(x) = −J xi xj − H xi .
(i,j)∈N i

xi : the ith entry of matrix x ∈ X .


N : neighbourhood structure. E.g. (i, j) ∈ N if they are neighbo-
urs horizontally or vertically.
Sum is taken over all neighbouring pairs (i, j).
Probability of a configuration: Boltzmann distribution correspon-
ding to h(x).
Aim: find the minimum energy (most probable) configuration.
Conditional distributions:
P 
exp − 2(H + J j:(i,j)∈N xj )
P(xi = 1|xj , j 6= i) = P .
1 + exp − 2(H + J j:(i,j)∈N xj )
156 / 417
Discrete simulated annealing algorithm
(X , h): an instance of a combinatorial minimization problem
searching for
arg min h(x).
x∈X

(Tk ): decreasing sequence of “temperature” values.


L: number of iterations with fixed temperature Tk ,

Algorithm
1. Generate a candidate solution y from Xx ;
2. Accept x = y with probability
 + 
exp − h(y ) − h(x) /Tk ;

repeat step 1, otherwise;


3. After L iterations update Tk to Tk+1 .
For a ∈ R, a+ = a if a > 0 and a+ = 0 otherwise.
157 / 417
Markov chain representation
Xn : the state of the algorithm at the nth step. A Markov chain
with state space X .
Definition. Let (X , h) be an instance of a combinatorial minimi-
zation problem. The transition probabilities for the SA algorithm
are defined as:

Pxy (n) := Pxy (Tn ) := P Xn = y |Xn−1 = x

Gxy (Tn )Axy (Tn ) if x 6= y ;
= 1 − P P (T ) if x = y , x, y ∈ X .
 xz n
z∈X ,z6=x

Gxy (Tn ): generation probability. The probability of generating


solution y from solution x at temperature Tn .

Axy (Tn ): acceptance probability. The probability of accepting


solution y generated from solution x at temperature Tn .
158 / 417
Transitions
Special generation and acceptance probabilities are considered.
Definition. Generation probability: Assume card(Xx ) = C for all
x ∈ X.
1
Gxy (Tn ) = Gxy := IXx (y ), x, y ∈ X .
C

Uniform choice from the neighbourhood of x.


Definition. Acceptance probability (Metropolis criterion):
 + 
Axy (Tn ) := exp − h(y ) − h(x) /Tn , x, y ∈ X .

Remark. If Tn = T for all n then Xn is a homogeneous Markov


chain with transition probabilities

 Gxy AxyP(T ) if x 6= y ;
Pxy = P Xn = y |Xn−1 = x = 1 − Pxz (T ) if x = y .

z∈X ,z6=x
159 / 417
Stationary distribution
Theorem. Let (X , h) be an instance of a combinatorial minimiza-
tion problem and denote by P(T ) the transition matrix associated
with the SA algorithm. Assume that for all x, y ∈ X there exists
an integer r ≥ 1 and u0 , u1 , . . . , up ∈ X with u0 = x, ur = y such
that Guk uk+1 > 0, k = 0, 1, . . . , r − 1. Then the Markov chain Xn
has a stationary distribution q(T ) with components

1 
qx (T ) = exp − h(x)/T , x ∈ X , where
N0 (T )
X 
N0 (T ) := exp − h(y )/T .
y ∈X

Proof. Three statements to prove:


1. Irreducibility. ∀x, y ∈ X ∃n ≥ 1 : P n

xy
(T ) > 0;
2. Aperiodicity. By irreducibility suffices to show: ∃x ∈ X : Pxx (T ) > 0;
3. Detailed balance equation. qx (T )Pxy (T ) = qy (T )Pyx (T ), ∀x, y ∈ X .
160 / 417
Proof
Irreducibility: By the condition on the generating probabilities
X X X
P r xy (T ) =

... Pxv1 (T )Pv1 v2 (T ) . . . Pvr −1 y (T )
v1 ∈X v2 ∈X vr −1 ∈X

≥ Gxu1 Axu1 (T )Gu1 u2 Au1 u2 (T ) . . . Gur −1 y Aur −1 y (T ) > 0


as all acceptance probabilities are positive.
Aperiodicity: Let x, y ∈ X , h(x) < h(y ), Gxy > 0 (always exists by condition
on generating probabilities if X =6 Xopt .) Axy < 1 so
X X
Pxx (T ) = 1 − Gxz Axz (T ) = 1 − Gxy Axy − Gxz Axz (T )
z∈X ,z6=x z∈X ,z6=x,y
X X
> 1 − Gxy − Gxz (T ) = 1 − Gxz (T ) ≥ 0.
z∈X ,z6=x,y z∈X ,z6=x

Detailed balance equation: Gxy = Gyx , so suffices to show


qx (T )Axy (T ) = qy (T )Ayx (T ), ∀x, y ∈ X .
−1    + 
qx (T )Axy (T ) = N0 (T ) exp − h(x)/T exp − h(y ) − h(x) /T
−1     + 
= N0 (T ) exp − h(y )/T exp − h(x) − h(y ) /T − h(y ) − h(x) /T
−1    + 
= N0 (T ) exp − h(y )/T exp − h(x) − h(y ) /T = qy (T )Ayx (T ). 2
161 / 417
Limit of stationary distributions
Xopt : set of optimal solutions. hopt : optimal cost.
If x ∈ Xopt then for all y ∈ X we have hopt = h(x) ≤ h(y ).
Theorem. Let (X , h) be an instance of a combinatorial minimiza-
tion problem and consider the stationary distribution
!−1
X  
qx (T ) = exp − h(y )/T exp − h(x)/T , x ∈ X .
y ∈X

Then
1
lim qx (T ) = qx∗ :=  IXopt (x).
T &0 card Xopt
Proof. limx&0 exp(α/x) = 0 if α < 0, and 1 if α = 0.
 
exp − h(x)/T exp (hopt − h(x))/T
lim qx (T ) = lim P  = lim P 
y ∈X exp − h(y )/T y ∈X exp (hopt − h(y ))/T
T &0 T &0 T &0

1
= lim P  IXopt (x)
T &0
y ∈X exp (h opt − h(y ))/T

exp (hopt − h(x))/T 1
+ lim P  IX \Xopt (x) =  IXopt (x) + 0. 2
T &0
y ∈X exp (h opt − h(y ))/T card Xopt 162 / 417
Remarks
XnT : the MC corresponding to the SA algorithm at temperature T .
I
1
lim qx (T ) = qx∗ :=  IXopt (x)
T &0 card Xopt
implies
lim lim P XnT = x = lim qx (T ) = qx∗

T &0 n→∞ T &0
or
lim lim P XnT ∈ Xopt = 1.

T &0 n→∞
1
I Gxy := C IXx (y ) can be replaced by Gxy = Gyx , x, y ∈ X .
1
I Gxy := card(Xx ) IXx (y )
leads to stationary distribution

card(Xx ) exp − h(x)/T
qex (T ) = P , x ∈ X ,
y ∈X card(Xy ) exp − h(y )/T
and limT &0 qex (T ) = qx∗ . [Lundy & Mees, 1986]
These results require infinitely many steps at a given temperature!163 / 417
Inhomogeneous Markov chains
Consider a sequence of homogeneous Markov chains of length L
corresponding to L steps at a given temperature.

T`0 : temperature of the ` th chain. T`0 ≥ T`+1


0 , ` = 0, 1, . . . .

Tk : temperature at the kth step of the algorithm.

Tk = T`0 for `L < k ≤ (` + 1)L.

The SA algorithm can be represented as an inhomogeneous


Markov chain Xn with transition probabilities

 Gxy (TP n )Axy (Tn ) if x 6= y ;
Pxy (n) := P Xn = y |Xn−1 = x = 1− Pxz (Tn ) if x = y .

z∈X ,z6=x

164 / 417
A convergence result
Theorem. Let (X , h) be an instance of a combinatorial minimiza-
tion problem and consider the SA algorithm, where the generation
probabilities are uniform on the (equal) neighbourhoods and accep-
tance probabilities are according to Metropolis criterion. Assume:
1. For all x, y ∈ X there exists an integer r ≥ 1 and sequence
u0 , u1 , . . . , up ∈ X with u0 = x, ur = y such that Guk uk+1 > 0,
k = 0, 1, . . . , r − 1.
2. Sequence T`0 satisfies
(L + 1)∆
T`0 ≥

, where ∆ := max h(y ) − h(x) y ∈ Xx ,
log(` + 2) x,y ∈X

` = 0, 1, . . . , and L is chosen as the maximum of the mini-


mum number of transitions required to reach xopt from y for
all y ∈ X .
Then the limiting distribution q ∗ of the corresponding MC Xn has
components
1
qx∗ :=

 IXopt (x), that is lim P Xn ∈ Xopt = 1.
card Xopt n→∞ 165 / 417
Depth of local minima

Definition. Let x, y ∈ X , then y is reachable at height τ from x


if there exists an integer r ≥ 1 and u0 , u1 , . . . , up ∈ X with u0 = x
and ur = y such that for all k = 0, 1, . . . , r − 1 we have Guk uk+1 > 0
and h(uk ) ≤ τ .

Xloc : the set of local minima. Xopt ⊆ Xloc .

Definition. Let w ∈ Xloc , then the depth d(w ) of the local mini-
mum w is the smallest τ such that there is a solution y ∈ X with
h(y ) < h(w ) that is reacheable at height h(w ) + τ from w . If w is
a global minimum then d(w ) := ∞.

166 / 417
Necessary and sufficient condition
Theorem. [Hajek, 1988] Let (X , h) be an instance of a combina-
torial minimization problem and let Xn be the MC corresponding
to the SA algorithm. Assume:
1. For all x, y ∈ X there exists an integer r ≥ 1 and sequence
u0 , u1 , . . . , up ∈ X with u0 = x, ur = y such that Guk uk+1 > 0,
k = 0, 1, . . . , r − 1.
2. For all τ > 0 and x, y ∈ X , x is reacheable from y at height τ
if and only if y is reacheable from x at height τ .
If Tn & 0 as n → ∞ then

lim P Xn ∈ Xopt = 1
n→∞
if and only if

X 
exp − D/Tk = ∞, (1)
k=1

where D := max d(w )|w ∈ Xloc \ Xopt .
Remark. For Tn := Γ/ log(n) condition (1) is equivalent to Γ ≥ D.
167 / 417
Continuous simulated annealing algorithm
Find
arg min h(x), h : X ⊆ Rd 7→ R.
x∈X
Neigbourhood of x: a random perturbation ζ is generated from a
distribution with PDF g (|y − x|) (centered at x).
Tk : temperature at the kth step.
Algorithm
1. Generate z from distribution g (|y − xk |);
2. Accept xk+1 = z with probability
 + 
exp − h(z) − h(xk ) /Tk ;
Take xk+1 = xk , otherwise;
3. Update Tk to Tk+1 .
Can be modeled by a discrete time continuous state space MC.
The results highly depend on g and on the cooling sequence Tk .
A numerical optimization (e.g. Newton-Ralphson) can be launched
from the optimal point of SA. 168 / 417
Example
Find the maximum of
2
h(x) = cos(10x) − sin(60x) , x ∈ X = [0, 1],

that is the minimum of e.g. −h(x).


g : PDF of U(−%k , %k ). Scale %k can be updated at each step.

Algorithm
1. Generate z from distribution U(ak , bk ), where
ak := max{xk − %k , 0}, bk := min{xk + %k , 1};
2. Accept xk+1 = z with probability
 + 
exp − h(xk ) − h(z) /Tk ;

Take xk+1 = xk , otherwise;


3. Update Tk to Tk+1 , %k to %k+1 .
Stopping rule: h has not changed e.g. in the last half of the steps.
169 / 417
Example

4
3

3
h(x)

h(x)
2

2
1

1
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
4

4
3

3
h(x)

h(x)
2

2
1

1
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Four different
√ runs of SA with random starting√point for Tk = 1/ log(k + 1),
%k = Tk (red) and Tk = 1/(k + 1)2 , %k = 5 Tk (green).
170 / 417
Example
Find the minimum of
2 
h(x, y ) = x sin(20y ) + y sin(20x) cosh sin(10x)x
2 
+ x cos(10y ) − y sin(10x) cosh cos(20y )y .

Perturbations: normal random variables with variance σk2 , where


p
σk := %k Tk .

%k : scaling factor depending on acceptance ratio. Increases when


many candidate points are accepted and decreases otherwise.

Cooling sequences:

Tk := (0.95)k ,

Tk := 1/ 10(k + 1) ,
p 
Tk := 1/ log(k + 1), Tk := 1/ 10 log(k + 1) .

171 / 417
Example

SA realizations for cooling sequences Tk := (0.95)k , Tk := 1/ 10(k + 1) ,



p 
Tk := 1/ log(k + 1), Tk := 1/ 10 log(k + 1) started from (0.65, 0.8).
172 / 417
Coordinatewise perturbations
Find
arg min h(x), h : X ⊆ Rd 7→ R.
x∈X
Tk : temperature at the kth step.

vk = vk,1 , vk,2 , . . . , vk,d : step size vector at the kth step.

Generation of a candidate point z =  z1 , z2 , . . . , zd from the
current state xk = xk,i , xk,i , . . . , xk,i is performed coordinatewise.
Algorithm
1. Choose a direction κ uniformly from {1, 2, . . . , d};
2. Generate u from U(−1, 1);
3. Set (
xk,i + vk,i u, if i = κ;
zi =
xk,i , if i 6= κ.
Coordinates of vk are regularly adjusted in order to keep acceptan-
ce ratio around 50 %.
Temperature is reduced after a given number of adjustments of the
step size. 173 / 417
Theoretical results
XnT : Markov chain corresponding to the SA algorithm with coor-
dinatewise perturbations and fixed step size vector vk = v.

Theorem. Suppose that exp − h(x)/T , T > 0, is integrable
over X = Rd . Then the invariant distribution of XnT is given by
the probability measure
R 
Q exp − h(x)/T dx
µT (Q) = R  , Q ∈ B(X ),
X exp − h(x)/T dx

where B(X ) denotes the Borel σ-algebra.

Xopt : finite set of globally optimal points.

Under some regularity conditions on h, as T & 0, probability mea-


sures µT converge weakly to a probability measure µ∗ concentra-
ted on Xopt .
174 / 417
Example. Single global minimum
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)−0.4 cos(4πy )+ 0.7.
Global minimum at the origin.

175 / 417
Test runs. Single global minimum
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)−0.4 cos(4πy )+ 0.7.
−3
x 10
10

2
y

−2

−4

−6
−0.015 −0.01 −0.005 0 0.005 0.01
x

Final points of 100 test runs with random starting points from U [−1, 1]2 .


176 / 417
Visited points. Single global minimum
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)−0.4 cos(4πy )+ 0.7.
1

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Points visited during a single test run (every tenth). Starting: ◦; final point: ?.
Starting point Optimum point Final point Acceptance
x y x y x y ratio
−0.4179 −0.2710 −0.0006 −0.0007 0.0146 −0.0218 50 %
177 / 417
Example. Multiple global minima
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)+0.4 cos(4πy )+ 0.7.
Global minima at (0, −0.23502848) and (0, 0.23502848).

178 / 417
Test runs. Multiple global minima
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)+0.4 cos(4πy )+ 0.7.
0.245

0.24

0.235
y

0.23

0.225
−8 −6 −4 −2 0 2 4 6
x −3
x 10

−0.225

−0.23
y

−0.235

−0.24
−8 −6 −4 −2 0 2 4 6
x −3
x 10

Final points of 100 test runs with random starting points from U [−1, 1]2 .


Relative frequencies of the optimal points from 500 test runs:


(0, −0.23502848) : 259/500, (0, 0.23502848) : 241/500. 179 / 417
Visited points. Multiple global minima
Test function: h(x, y ) := x 2 +2y 2 −0.3 cos(3πx)+0.4 cos(4πy )+ 0.7.
1

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Points visited during a single test run (every tenth). Starting: ◦; final point: ?.
Starting point Optimum point Final point Acceptance
x y x y x y ratio
0.6505 0.6896 −0.0004 −0.2354 −0.0027 −0.2632 49 %
180 / 417
Remarks on impementation
Problem: find the ML estimates of the transition probabilities of a
partially observed homogeneous MC with s states.
Input data: observations of a MC where zeros denote the missing
points.

Likelihood function has s(s − 1) variables. Random choice of di-


rections is very time consuming: cyclically go through all directi-
ons. [Corana, Marchesi, Martini & Ridella, 1987]

The magnitude of the likelihood L is very small, e.g. for s = 2 and


36 time points with 16 observations isaround 10−6 . Search for the
minimum of − log(L) or − log log(L) .

Taylor the initial temperature to the behaviour of the function to


be minimized. [Ingrassia, 1992]

Choose carefully the parameters driving the algorithm.


181 / 417
Example

−6
x 10
1.2

0.8
L(x,y)
0.6

0.4

0.2

0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x

Likelihood function L(x, y ) (x = p12 , y = p21 ) of observations


...0 0 1 0 0 2 2 1 0 0 0 0 1 0 0 2 2 1 0 0 0 0 1 0 0 2 2 1 0 0 0 0 1 0 0 2 2 1 0...
Global maximum at: (0.91455394, 0.58750368).
182 / 417
Test runs
0.594

0.592

0.59

0.588
y

0.586

0.584

0.582

0.58
0.895 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935
x

Final points of 100 test runs with random starting points from U [0, 1]2 .


Optimal point: ◦.
183 / 417
Example

−4
x 10
3

2.5

1.5
K(x,y)
1

0.5

0
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2 0.2
y 0 0
x

Likelihood function K (x, y ) (x = p12 , y = p21 ) of observations


...0 0 2 0 0 0 2 0 0 1 2 2 2 2 0 0 1 0 0 2 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0...
Global maximum at: (1, 0.76096759).
184 / 417
Test runs

0.8

0.78

0.76
y

0.74

0.72

0.7
0.975 0.98 0.985 0.99 0.995 1 1.005
x

Final points of 100 test runs with random starting points from U [0, 1]2 .


Optimal point: ◦.
185 / 417
Stochastic approximation
Find
arg max h(x), h : X ⊂ Rd 7→ R.
x∈X

Works more directly with the objective function h instead of the


fast exploration of the space X .
Uses an acceptable (random) approximation of h.

Missing data models


Models, where the likelihood function can be expressed as
Z
g (x|θ) = f (x, z|θ)dz.
Z

More generally, h(x) is given as


 
h(x) := E H(x, Z ) .
Z : unobserved quantity.
Typical situation: censoring models.
186 / 417
Example. Censored data likelihood
Data: Y1 , Y2 , . . . , Yn ∼ f (y − θ) i.i.d. F (y ): CDF of f (y ).
Uncensored observations: y = (y1 , y2 , . . . , ym );
Censored: (ym+1 , ym+2 , . . . , yn ), yj = a, j = m+1, m+2, . . . , n.
Observed data likelihood function:
m
Y  n−m
L(θ|y) = 1 − F (a − θ) f (yj − θ).
j=1

Missing observations:
z = (zm+1 , zm+2 , . . . , zn ), zj > a, j = m + 1, m + 2, . . . , n.
Complete data likelihood:
Ym n
Y
c
L (θ|y, z) = f (yj − θ) f (zj − θ).
j=1 j=m+1
Z
L(θ|y) = E Lc (θ|y, Z) y = Lc (θ|y, z)f (z|y, θ)dz.
 
Z
f (z|y, θ): conditional PDF of missing data on observed.
187 / 417
Example. Censored normal distribution
Data: Y1 , Y2 , . . . , Yn ∼ N (θ, 1) i.i.d. Censoring at a.

4e-22
2.0e-21
1.0e-21

2e-22
0.0e+00

0e+00
3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0
θ θ

Likelihood function of censored observations from a N (4, 1) sample of size 40


with censoring at a = 4.5 (red); observed-data likelihood (black); likelihood of
the original sample (green). Original likelihood funtions (left); “equalized” like-
lihoods (right). 188 / 417
Monte Carlo approximation
Find
h(x) = E H(x, Z ) , Z ∼ f (z|x), X ⊂ Rd .
 
arg max h(x), where
x∈X

Generate Z1 , Z2 , . . . , Zn ∼ f (z|x) and find the maximum of


n
b1,n (x) := 1
X
h H(x, Zj ).
n
j=1

b1,n (x) → h(x) as n → ∞ for all fixed x ∈ X .


h

Problems:
I Optimization algorithms require evaluations of hb1,n (x) in many
points x, so many samples of size n should be generated.
I The generated sample changes with every value of x which
makes the sequence of evaluations of hb1,n (x) unstable. Con-
vergence properties of the optimization algorithm are usually
not preserved.
189 / 417
Importance sampling approach
Find arg max x∈X h(x), X ⊂ Rd , where h(x) = E H(x, Z ) .
 

Geyer [1996]: Generate Y1 , Y2 , . . . , Yn ∼ g (y ) and find the maxi-


mum of
n
b2,n (x) := 1
X f (Yj |x)
h H(x, Yj ) , x = (x1 , x2 , . . . , xd )> .
n g (Yj )
j=1

g : instrumental distribution, not depending on x.


For example: g (y ) = f (y |x0 ).
h i
h(x) = E H(x, Y )f (Y |x)/g (Y ) , Y ∼ g (y ).
Under some smoothness conditions:
 
∂h(x) ∂H(x, Y ) f (Y |x) ∂f (Y |x) 1
=E + H(x, Y ) ,
∂xk ∂xk g (Y ) ∂xk g (Y )
n  
∂h
b2,m (x) 1 X ∂H(x, Yj ) f (Yj |x) ∂f (Yj |x) 1
= + H(x, Yj ) .
∂xk n ∂xk g (Yj ) ∂xk g (Yj )
j=1

Derivatives of h
b2,n (x) aproximate derivatives of h(x).
190 / 417
Drawbacks
Maximize
n
f (Yj |x)
Z
b2,n (x) := 1
X
h H(x, Yj ) instead of h(x) := H(x, z)f (z|x)dx.
n g (Yj ) X
j=1

1. h
b2,n (x) is often far not as smooth as h(x). Problems might
arise with regularity, unimodality may vanish, etc. More diffi-
cult to optimize.
2. The choice of g is critical, since a function should be approxi-
mated. The convergence is not uniform and might be slow e.g.
if g (y ) = f (y |x0 ) and x0 is far from the true maximum x∗ .
3. The number of generated sample values Yj should vary with x
to achieve the same precision in the approximation of h(x).
Usually impossible.

191 / 417
Example. Probit regression
Distribution of Y ∈ {0, 1} given a covariate X is
P(Y = 1|X = x) = 1 − P(Y = 0|X = x) = Φ(θ0 + θ1 x).
Find the posterior mode of θ0 .
Sample: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Marginal posterior mode given a flat prior on θ1 : arg maxθ0 h(θ0 )
Z Y n
 y  1−yj
h(θ0 ) := Φ(θ0 + θ1 xj ) j Φ(−θ0 − θ1 xj ) dθ1 .
j=1

Real data: Pima Indians Diabetes dataset Prima.tr. This tribe has
the highest prevalence of type 2 diabetes in the world. (R library
MASS). 200 cases.
X : body mass index; Y : presence of diabetes.

Earlier study:
estimate of θ1 is µ = 0.1052 with standard error σ = 0.0296.
192 / 417
Example. Probit regression
Find the maximum of
n
Z Y
 y  1−yj
h(θ0 ) := Φ(θ0 + θ1 xj ) j Φ(−θ0 − θ1 xj ) dθ1 .
j=1

Monte Carlo approximation of h(θ0 ) via importance sampling with


instrumental distribution T (5, µ, σ 2 ).

θ1,1 , θ1,2 , . . . , θ1,M : sample from T (5, µ, σ 2 ) with pdf f5 (θ; µ, σ 2 ).

M n  y  1−yj
1 XY Φ(θ0 + θ1,m xj ) j Φ(−θ0 − θ1,m xj )
h(θ0 ) :=
b .
M f5 (θ1,m ; µ, σ 2 )
m=1 j=1

Two approaches:
I Different t sample for each value of θ0 .
I One single t sample for all values of θ0 .
193 / 417
Example. Probit regression

6e-54
0e+00

-4 -3 -2 -1 0
θ0
6e-54
0e+00

-4 -3 -2 -1 0
θ0

Approximations of h(θ0 ) based on 1000 simulations from T (5, 0.1, 0.1) distri-
bution. Range of 100 replications when a different t sample is used for each
value of θ0 and a single approximation (top); when one t sample is used for all
values of θ0 and a single approximation (bottom).
194 / 417
Example. Probit regression
Approximate
n
Z Y
 yj  1−yj
h(θ0 ) := Φ(θ0 + θ1 xj ) Φ(−θ0 − θ1 xj ) dθ1 .
j=1
6e-54
4e-54
2e-54
0e+00

-4 -3 -2 -1 0
θ0

Approximations of h(θ0 ) based on 1000 simulations from T (5, 0.1, 0.1) distri-
bution. Mean of 100 replications when a different t sample is used for each
value of θ0 (red) and when one t sample is used for all values of θ0 (blue).
195 / 417
Monte Carlo maximization
Find

h(x) = E H(x, Z ) , Z ∼ f (z|x), X ⊂ Rd .


 
arg max h(x), where
x∈X

Algorithm
0. Choose x1 ∈ X and error ε.
At the ith step
1. Generate y1 , y2 , . . . yn ∼ f (z|xi ) and calculate
n
bi (x) := 1
X f (yj |x)
h H(x, yj ) ;
n f (yj |xi )
j=1

2. Find x∗ = arg max h


bi (x);
3. Update xi to xi+1 = x∗ ;
4. Repeat until e.g. kxi+1 − xi k < ε.
196 / 417
Robbins-Monro algorithm
Solve [Robbins & Monro, 1951]:
 
h(x) = β, x ∈ X , where h(x) := E H(x, Z ) .

Generate a Markov chain:



Xj+1 = Xj + γj β − Zj , Zj ∼ f (z|Xj ).

f (z|x): conditional PDF of Z given X .

Kiefer-Wolfowitz algorithm
Find [Kiefer & Wolfowitz, 1952]:
 
arg max h(x), where h(x) := E H(x, Z ) .
x∈X

Generate a Markov chain:


Z2j − Z2j−1
Xj+1 = Xj +γj , Z2j−1 ∼ f (z|Xj −αj ), Z2j ∼ f (z|Xj +αj ).
αj
197 / 417
Convergence of the Robbins-Monro algorithm

Solve h(x) = β, x ∈ X , where h(x) := E H(x, Z ) .
Xj+1 = Xj + γj (β − Zj ), Zj ∼ f (z|Xj ).

Theorem. Assume (γj ) is a sequence of positive numbers such


that

X ∞
X
γj = ∞ γj2 < ∞,
j=1 j=1

and for the simulated values Zj we have


 
E H(Xj , Zj ) Xj = h(Xj ) and |Zj | < B a.s.
for a fixed bound B. If there exists x ∗ ∈ X such that
(x − x ∗ ) h(x) − β > 0

inf
δ≤|x−x ∗ |≤1/δ

for every 0 < δ < 1 then the sequence (Xj ) converges to x ∗ almost
surely.
198 / 417
The EM algorithm
EM: expectation-maximization. Deterministic optimization tech-
nique. [Dempster, Laird & Rubin, 1977].
X = (X1 , X2 , . . . , Xn ): a sample from joint PDF g (x|θ), where
Z
g (x|θ) = f (x, z|θ)dz.
Z
Z ∈ Z: unobserved data-vector.
Find the maximum-likelihood estimator of θ, that is
arg max `(θ|x), where `(θ|x) := log L(θ|x) and L(θ|x) = g (x|θ).
θ

Complete data log-likelihood:


`C (θ|x, z) := log LC (θ|x, z), where LC (θ|x, z) = f (x, z|θ).
Given X ∼ g (x|θ0 ) for a certain θ0 the best mean squares approxi-
mation of
`C (θ|X, Z) is Eθ0 `C (θ|X, Z) X .
 

199 / 417
The EM algorithm
Expected log-likelihood:
Q(θ|θ0 , x) := Eθ0 `C (θ|X, Z) X = x .
 

Maximize Q(θ|θ0 , x) in θ, replace θ0 with this maximum, and up-


date the expected log-likelihood.
Algorithm
0. Find a starting value θb(0) and an error ε.
At the jth step
1. E-step: Compute
θ(j) , x := Eθb `C (θ|X, Z) X = x ;
  
Q θ b
(j)

2. M-step: Maximize Q θ b θ(j) , x and take

θb(j+1) := arg max Q θ b θ(j) , x ;
θ

3. Repeat until e.g. kθb(j+1) − θb(j) k < ε.


200 / 417
Monotonicity of the EM algorithm
Theorem.
 [Dempster, Laird & Rubin, 1977] The EM sequence
θb(j) satisfies
 
L θb(j+1) x ≥ L θb(j) x , j = 0, 1, 2, . . . ,
 
θ(j) , x = Q θb(j) b
and equality holds if and only if Q θb(j+1) b θ(j) , x .
Proof. Conditional density of Z given X: k(z|θ, x) = f (x, z|θ)/g (x|θ).
Log-likelihood:
`(θ|x) = `C (θ|x, z) − log k(z|θ, x), so `(θ|x) = Q θ b
 
θ(j) , x − H θ b θ(j) , x
  
with H θ b θ(j) , x := Eθb(j) log k(Z|θ, X) X = x . Consider the difference
    
` θb(j+1) x − ` θb(j) x = Q θb(j+1) b θ(j) , x − Q θb(j) bθ(j) , x
  
− H θb(j+1) b θ(j) , x − H θb(j) b θ(j) , x .
For all θ we have
  h  i
H θ b θ(j) , x − H θb(j) b
θ(j) , x = Eθb(j) log k(Z|θ, X)/k(Z|θb(j) , X) X = x
Z
 
≤ log Eθb(j) k(Z|θ, X)/k(Z|θb(j) , X) X = x = log k(z|θ, x)dx = 0. 2
Z
201 / 417
Convergence
Monotonicity result
 
L θb(j+1) x ≥ L θb(j) x

does not imply convergence

lim θb(j) = θ∗ , where θ∗ := arg max `(θ|x).


j→∞ θ


Theorem. If the expected complete data log-likelihood Q(θ|θ0 , x
is continuous
 in both θ and θ0 , then every point of an EM
 sequ-
ence θ(j) is a stationary point of L(θ|x), and L θ(j) x converges
b b

monotonically to L θb x for some stationary point θ.
b

Convergence is only guaranteed to a stationary point.

Wu [1983]: theorem about the convergence to a local maximum.


Difficult to check the assumptions.
202 / 417
EM for truncated normal mixture
Mixture PDF:
M
X
p x ω, µ, σ 2 := ωk g x µk , σ 2
 

k=1
with ω = (ω1 , ω2 , . . . , ωM ), µ = (µ1 , µ2 , . . . µM ).
g (x|µ, σ 2 ): PDF of the truncated
 normal distribution with lower
cut-off at 0. Notation: N0 µ, σ 2 .

1

σϕ (x − µ)/σ
g x µ, σ 2 :=

 , x ≥ 0, and 0 otherwise.
Φ µ/σ
Mean κ and variance σ 2
    !2 
σϕ µ/σ µ ϕ µ/σ ϕ µ/σ
κ = µ+  and %2 = σ 2 1− −  .
Φ µ/σ σ Φ µ/σ Φ µ/σ

Aim: find the MLE of ω, µ and σ 2 based on a sample from mix-


ture distribution p. [Lee & Scott, 2012] 203 / 417
Log-likelihood function
x = (x1 , x2 , . . . , xn ): realization of a sample X = (X1 , X2 , . . . , Xn )
from mixture distribution p.
Log-likelihood:
n
"M #
 X
2
X 2

` ω, µ, σ x = log ωk g xj µk , σ .
j=1 k=1

Unobserved quantities: z = (zk,j ), k = 1, 2, . . . , M, j = 1, 2, . . . , n.



Definition: zk,j = 1 if xj is from g x µk , σ 2 .


zj = (z1,j , z2,j , . . . , zM,j )> has one entry 1, all other entries are 0.
n X
X M
zj,k = n.
j=1 k=1

If z was observed, MLE of ω would have components


n
1X
ω
bk = zk,j , k = 1, 2, . . . , M.
n
j=1 204 / 417
Complete data log-likelihood
Complete data log-likelihood:
n X
M   
C 2
 X 2

` ω, µ, σ x, z = zk,j log(ωk ) + log g (xj µk , σ
.
j=1 k=1

zk,j is a realization of a Bernoulli random variable Zk,j



µk , σ 2

    ω k g X j
E Zk,j X = P Zk,j = 1 X = P Zk,j = 1 Xj = PM .
2
i=1 ωi g Xj µi , σ

(r ) (r )
E-step: Given ωk , µk , k = 1, 2, . . . , M, and σ (r )
(r ) (r )
ωk g xj µk , σ 2(r )

(r +1)
zk,j := P (r ) (r ) .
M
ω g xj µ , σ 2(r )
i=1 i i
M-step:
n
(r +1) 1 X (r +1)
ωk = zk,j .
n
j=1

Find the maximum of `C ω (r +1) , µ, σ 2 x, z(r +1) in µ and σ.

205 / 417
Maximization
Complete data log-likelihood:
n X M 
C
 X
2 1 1
zk,j log(ωk ) − log(2π) − log σ 2

` ω, µ, σ x, z =
2 2
j=1 k=1
(xj − µk )2  

− − log Φ µk /σ .
2σ 2
Equations
∂ C ∂ C
` ω, µ, σ 2 x, z = 0 2
 
and ` ω, µ, σ x, z = 0
∂µk ∂σ 2
result in iteration steps for µ and σ 2 .
  !
X n −1 X n (r ) (r )
(r +1) (r +1) (r +1) ϕ µ k /σ
µk := zk,j  zk,j xj − σ (r ) (r )  ,
Φ µ /σ (r )
j=1 j=1 k
n M (r +1)
" #
ϕ µ /σ (r )
1 X X (r +1)  (r +1)
2
(r +1) k
σ 2(r +1) := zk,j xj −µk +σ (r ) µk (r +1)  .
N Φ µ /σ (r )
j=1 k=1 k
206 / 417
EM for censored normal data
Data: Y1 , Y2 , . . . , Yn ∼ N (θ, 1) censored at a.
Observed: y = (y1 , y2 , . . . , ym ); unobserved: z = (zm+1 , . . . , zn ).
Complete data likelihood:
m
Y n
 Y
Lc (θ|y, z) ∝ exp − (yj − θ)2 /2 exp − (zj − θ)2 /2 .


j=1 j=m+1

Expected complete data log-likelihood:


m n
1X 2 1
X
Eθ0 (Zj −θ)2 , Zj ∼ N a (θ, 1).
 
Q(θ|θ0 , y) = − (yi −θ) −
2 2
j=1 j=m+1

N a (θ, 1):
normal N (θ0 , 1). with upper truncation at a.

Maximum of Q(θ|θ0 , y) at: θb = my + (n − m)Eθ0 [Z1 ] n.
EM sequence:
 
ϕ a − θb(j)

m n−m b
θ(j+1) = y +
b θ(j) +  .
n n 1 − Φ a − θb(j) 207 / 417
EM for censored normal data
Observed data log-likelihood function:
m
 1X 1
`(θ|y) = (n − m) log 1 − Φ(a − θ) − (yi − θ)2 − log(2π).
2 2
j=1

-55
Observed likelihood
-60
-65
-70
-75

3.0 3.5 4.0 4.5 5.0


θ

8 EM sequences (dotted lines) starting from random points for censored normal
N (4, 1) likelihood with n = 40, m = 23, a = 4.5, y = 3.4430 and the obser-
ved data log-likelihood (orange). 208 / 417
Monte Carlo EM
Expected log-likelihood:

Q(θ|θ0 , x) := Eθ0 `C (θ|X, Z) X = x .


 

Approximation via Z1 , Z2 , . . . , Zn ∼ k(z|θ0 , x) [Wei & Tanner, 1990]:


n
1X C
Q(θ|θ
b 0 , x) := ` (θ|x, Zj ).
n
j=1

At jth step simulate from k z b
θ(j) , x .
Importance sampling [Geyer & Thompson, 1992]:
n
1 X LC (θ|x, Zj )
 
L(θ|x) g (x|θ) f (X, Z|θ)
= = Eθ(0) X = x ≈ ,
L(θ(0) |x) g (x|θ(0) ) f (X, Z|θ0 ) n LC (θ(0) |x, Zj )
j=1

where Z1 , Z2 , . . . , Zn ∼ k(z|θ(0) , x).

Convergence results e.g Neath [2012].


209 / 417
Toy example. Genetic linkage
Rao [1965], Dempster, Laird & Rubin [1977]:
Observations (x1 , x2 , x3 , x4 ) from multinomial distribution
 1 θ 1 1 θ
M n; + , (1 − θ); (1 − θ), , 0 ≤ θ ≤ 1.
2 4 4 4 4
Likelihood:
n!
L(θ|x1 , x2 , x3 , x4 ) = (2 + θ)x1 (1 − θ)x2 +x3 θx4 .
x1 !x2 !x3 !x4 !4n
Unobserved variables: z1 , z2 , where x1 = z1 + z2 and
(z1 , z2 , x2 , x3 , x4 ) is an observation on
 1 θ 1 1 θ
(Z1 , Z2 , X2 , X3 , X4 ) ∼ M n; , , (1 − θ); (1 − θ), .
2 4 4 4 4
Complete data likelihood:
n!
LC (θ|z1 , z2 , x2 , x3 , x4 ) = 2z1 θz2 +x4 (1 − θ)x2 +x3 .
z1 !z2 !x2 !x3 !x4 !4n
Complete data log-likelihood:
`C (θ|z1 , z2 , x2 , x3 , x4 ) = (z2 +x4 ) log(θ)+(x2 +x3 ) log(1−θ)+const.
210 / 417
Toy example. Genetic linkage
Complete data log-likelihood:
`C (θ|z1 , z2 , x2 , x3 , x4 ) = (z2 +x4 ) log(θ)+(x2 +x3 ) log(1−θ)+const.
Maximum point θ∗ = (z2 + x4 ) (z2 + x2 + x3 + x4 ).


 
1 θ 1 1 θ
(Z1 , Z2 , X2 , X3 , X4 ) ∼ M n; , , (1−θ); (1−θ), , Z1 +Z2 = X1 .
2 4 4 4 4
Given θ = θ(0) :
θ(0) X1
Z2 |X1 ∼ Bin X1 , θ(0) /(2 + θ(0) )

so E[Z2 |X1 ] = .
2 + θ(0)
E step: replace z2 in `C by ze2 := E[Z2 |X1 = x1 ] = θ(0) x1 /(2 + θ(0) ).
M step: maximize `C (θ|z1 , ze2 , x2 , x3 , x4 ).
Iteration:
θ(k) x1 θ(k) x1
  
(k+1)
θ = + x4 + x2 + x3 + x4 .
2 + θ(k) 2 + θ(k)
211 / 417
Genetic linkage. Monte Carlo version
Replace
n
θ(0) x1 1X
E[Z2 |X1 = x1 ] = with Y = Yj ,
2 + θ(0) n
j=1

where

θ(0) θ(0)
   
Y1 , Y2 , . . . , Yn ∼ Bin x1 , or nY ∼ Bin nx1 , .
2+θ(0) 2+θ(0)

Iteration:
(k)
θ(k)
 
Y + x4 (k)
θe(k+1) = (k)
, nY ∼ Bin nx1 , .
Y + x2 + x3 + x4 2+θ(k)

SLLN: θe(k+1) → θ(k+1) a.s. as n → ∞.

212 / 417
EM for genetic linkage
Log-likelihood:
`(θ|x1 , x2 , x3 , x4 ) = x1 log(2+θ)+(x2 +x3 ) log(1−θ)+x4 log(θ)+const.
Observations [Dempster, Laird & Rubin, 1977]: (125, 18, 20, 34).
60
40
20
Likelihood
0
-20
-40
-60

0.0 0.2 0.4 0.6 0.8 1.0


θ

8 EM sequences (dotted lines) starting from random points of [0, 1] and the
log-likelihood function (orange). 213 / 417
MCEM for genetic linkage
Log-likelihood:
`(θ|x1 , x2 , x3 , x4 ) = x1 log(2+θ)+(x2 +x3 ) log(1−θ)+x4 log(θ)+const.
Observations [Dempster, Laird & Rubin, 1977]: (125, 18, 20, 34).
0.6

0.6
0.5

0.5
MCEM sequences

MCEM sequences
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1

1 2 3 4 5 6 1 2 3 4 5 6
Iterations Iterations

EM sequence (orange) starting from 0.1 and range of 500 MCEM sequences for
n = 10 (left) and n = 100 (right). 214 / 417
EM standard errors
Under some regularity conditions the asymptotic variance
−1 of the
maximum likelihood estimator θb of θ∗ ∈ Rd is F(θ∗ ) .
F(θ): Fisher information matrix on θ corresponding to X ∼ g (x|θ).
   2 
∂`(θ|X) ∂`(θ|X) ∂ `(θ|X)
F(θ) := Eθ = −Eθ .
∂θ ∂θ> ∂θ∂θ>
Approximation of the information matrix:
 2 
∗ ∂ `(θ|x)
F(θ ) ≈ − .
∂θ∂θ> θ=θb

Oakes [1999]:

∂ 2 `(θ|x) ∂ Q(θ0 |θ, x) ∂ 2 Q(θ0 |θ, x)


 2 
= + 0 .
∂θ∂θ> ∂θ0 ∂θ0> ∂θ0 ∂θ> θ =θ

The expression depends only on the observed data. Provides an


approximation for the variance of the ML estimator θ.
b
215 / 417
Example. Genetic linkage
Log-likelihood for x = (x1 , x2 , x3 , x4 ):
`(θ|x) = x1 log(2 + θ) + (x2 + x3 ) log(1 − θ) + x4 log(θ) + const.
Second partial derivative:
∂ 2 `(θ|x) x1 x2 + x4 x4
2
=− 2
− 2
− 2.
∂θ (2 + θ) (1 − θ) θ
Expected log-likelihood:
θx1
Q(θ0 |θ, x) = log(θ0 ) + (x2 + x3 ) log(1 − θ0 ) + x4 log(θ0 ) + const.
2+θ
Partial derivatives:
∂ 2 Q(θ0 |θ, x)
 
θx1 1 x2 + x4 x4
02
= − 2 − 2
− 2,
∂θ
θ0 =θ 2+θ θ (1 − θ) θ

∂ 2 Q(θ0 |θ, x)

2x1 1
0
= .
∂θ ∂θ
θ0 =θ (2 + θ)2 θ
216 / 417
MCEM standard errors
Expected log-likelihood: Q(θ0 |θ, x) := Eθ `C (θ0 |X, Z) X = x .
 

Oakes’ identity (d = 1):

∂ 2 `(θ|x) ∂ Q(θ0 |θ, x) ∂ 2 Q(θ0 |θ, x)


 2 
= + 0 .
∂θ2 ∂θ02 ∂θ0 ∂θ θ =θ

Reformulation:
∂ 2 `(θ|x)
 2 C 
∂ ` (θ|X, Z)
= Eθ X = x
∂θ2 ∂θ2
!2
∂` (θ|X, Z) 2
 C    C
∂` (θ|X, Z)
+ Eθ
∂θ X = x − Eθ ∂θ X = x
 2 C   C 
∂ ` (θ|X, Z) ∂` (θ|X, Z)
= Eθ X = x + Varθ X = x .
∂θ2 ∂θ

217 / 417
Approximation
Z1 , Z2 , . . . , Zn ∼ k(z|θ, x): sample from the missing data PDF.

Have already been used in approximation:


n
1X C
Q(θ0 |θ, x) ≈ Q(θ
b 0 |θ, x) := ` (θ|x, Zj ).
n
j=1

Estimated variance:
−1
∂ 2 `(θ|x)


b ≈
Var(θ) − ,
∂θ2
θ=θb
where
n
∂ 2 `(θ|x) 1 X ∂ 2 `C (θ|x, Zj )

∂θ2 n ∂θ2
j=1
n n
!2
1X ∂`C (θ|x, Zj ) 1 X ∂`C (θ|x, Zi )
+ − .
n ∂θ n ∂θ
j=1 i=1

218 / 417
Basic definitions
Definition. Let X be a nonempty set. A transition kernel is a
function K defined on X × B(X ) such that
I For all x ∈ X , K (x, ·) is a probability measure.
I For all A ∈ B(X ), K (·, A) is measurable.

Definition. Given a sequence of transition kernels Kn , n ∈ N, a


sequence X0 , X1 , . . . , Xn , . . . is a Markov chain if for all A ∈ B(X ),
n ∈ N and x0 , x1 , . . . , xn−1 ∈ X we have
P (Xn ∈ A|X0 = x0 , X1 = x1 , . . . , Xn−1 = xn−1 )
Z
= P(Xn ∈ A|Xn−1 = xn−1 ) = Kn (xn−1 , dx) = Kn (xn−1 , A).
A

Remark. Let A1 , A2 , . . . , An ∈ B(X ) and x0 ∈ X . Then



P (X1 , . . . , Xn ) ∈ A1 × . . . × An X0 = x0
Z Z
= ... K1 (x0 , dx1 ) . . . Kn−1 (xn−2 , dxn−1 )Kn (xn−1 , An ).
An−1 A1

Distribution µ of X0 : initial distribution. 220 / 417


Homogeneous Markov chains 
Definition. A Markov chain Xn is called homogeneous if for all
n ∈ N, x ∈ X and A ∈ X we have
P(Xn+1 ∈ A|Xn = x) = P(X1 ∈ A|X0 = x), that is Kn ≡ K
and K (x, A) = P(X1 ∈ A|X0 = x).
κ(x, y ): transition kernel density:
Z Z
P(X1 ∈ A|X0 = x) = κ(x, y )dy = K (x, dy ).
A A
Definition. The n-step transition kernel of a homogeneous Mar-
kov chain is defined as K 0 (x, A) := I{x∈A} , while for n ≥ 1
Z
n
K (x, A) := K (x, dy ) K n−1 (y , A), x ∈ X , A ∈ B(X ).
X

Theorem. (Chapman-Kolmogorov equations) For any m with


0 ≤ m ≤ n we have
Z
n
K (x, A) := K m (x, dy ) K n−m (y , A), x ∈ X , A ∈ B(X ).
X 221 / 417
Occupation, return and stopping times
Definition. For any set A ∈ B(X ) the occupation time ηA is the
number of visits by (Xn ) to A after time zero and is given by

X
ηA := I{Xn ∈A} .
n=1
For any set A ∈ B(X ) the variable
τA := inf{n ≥ 1 : Xn ∈ A}
is called the first return time on A. τA := ∞, if Xn 6∈ A for every n.
Px , Ex : conditional probability and expectation given X0 = x.
Ex [ηA ]: the average number of passages in A.
Px (τA < ∞): probability of return to A in a finite number of steps.
Definition. A random variable ζ ∈ N ∪ {∞} is a stopping time on
(Xn ) if for all n ∈ N the event {ζ = n} is measurable with respect
to σ{X0 , X1 , . . . , Xn }.
Remark. τA is a stopping time. 222 / 417
Weak and strong Markov property
Theorem. (Weak Markov property) Let (Xn ) be a Markov chain
and h be a bounded and measurable function. Then for all n ∈ N
and x0 , x1 , . . . , xn ∈ X we have
 
E h(Xn+1 , Xn+2 , . . .) Xn = xn , Xn−1 = xn−1 , . . . , X0 = x0
 
= E h(Xn+1 , Xn+2 , . . .) Xn = xn .

Theorem. (Strong Markov property) Let (Xn ) be a Markov chain,


h be a bounded and measurable function and ζ is a stopping time.
Then conditional on ζ < ∞
 
E h(Xζ+1 , Xζ+2 , . . .) Xζ = xζ , Xζ−1 = xζ−1 , . . . , X0 = x0
 
= E h(Xζ+1 , Xζ+2 , . . .) Xζ = xζ .

223 / 417
Resolvent
Definition. The Markov chain Xnm , m ∈ N, with transition

ker-
nel K m (x, A) is called the m-skeleton of the chain Xn .


The resolvent Kε , 0 < ε < 1 associated with Markov kernel K is



X
Kε (x, A) := (1 − ε) εi K i (x, A).
i=1

Xnε

The MC with transition kernel Kε is called the Kε -chain.
Remark. Xnε is a sub-chain of Xn with indices selected ran-
 

domly from geometric distribution with parameter ε.


Example. AR(1) process:
Xn+1 = αXn + δn+1 , δn ∼ f (x), n = 1, 2, . . . .
Transition kernel: for x, y ∈ R
P(Xn+1 < y |Xn = x) = P(αXn + δn+1 < y |Xn = x)
= P(αx + δn+1 < y ) = P(δn+1 < y − αx).
Transition kernel density: κ(x, y ) = f (y − αx). 224 / 417
Irreducibility

Discrete state space: Markov chain
 Xn is irreducible, if all states
communicate, i.e. Px τ{y } < ∞ > 0 for all x, y ∈ X .

In the continuous case Px τ{y } < ∞ ≡ 0 might hold.

Definition. Given a measure ϕ on B(X ), the Markov chain Xn
with transition kernel K is ϕ-irreducible if for every A ∈ B(X ) with
ϕ(A) > 0 and for all x ∈ X there exists n > 0 such that K n (x, A) > 0
(that is Px τA < ∞ > 0). The chain is strongly ϕ-irreducible if
n = 1 for all measurable A.

Theorem. The chain Xn is ϕ-irreducible if and only if for every
x ∈ X and every A ∈ B(X ) with ϕ(A) > 0, one of the following
properties holds:
(a) there exists n ∈ N such that K n (x, A) > 0;

(b) E ηA ] > 0;
(c) Kε (x, A) > 0 for every 0 < ε < 1.
225 / 417
Maximal irreducibility

measure
Theorem. If Xn is ϕ-irreducible, then there exists a probability
measure ψ on B(X ) called maximal irreducibility measure such
that 
(a) the Markov chain Xn is ψ-irreducible; 
(b) if there exists a measure % such that Xn is %-irreducible then
% is dominated by ψ,that is %  ψ; 
(c) if ψ(A) = 0 then ψ x : Px (τA < ∞) > 0 = 0;
(d) the measure ψ is equivalent to
Z
ψ0 (A) := K1/2 (x, A)ϕ(dx), A ∈ B(X ),
X
that is ψ  ψ0 and ψ0  ψ.

Given an irreducibility measure ϕ, the theorem provides a construc-


tion for the maximal irreducibility measure.
Definition. A Markov chain will be called ψ-irreducible if it is
ϕ-irreducible for some ϕ and ψ is the maximal irreducibility mea-
sure.
226 / 417
Example. AR(1) process
AR(1) process:

δn ∼ N 0, σ 2

Xn+1 = αXn + δn+1 , i.i.d.

Transition kernel density: κ(x, y ) = ϕ (y − αx)/σ /σ.

λ: Lebesgue measure on B(X ).


For A ∈ B(X ) with λ(A) > 0
(y −αx)2
Z Z
1
K (x, A) = κ(x, y )dy = √ e− 2σ2 dy > 0.
A A 2πσ

Xn is strongly λ-irreducible.

Let δn ∼ U(−1, 1) and α > 1.

Xn+1 − Xn ≥ (α − 1)Xn − 1 ≥ 0 for Xn ≥ 1/(α − 1).

The Markov chain is increasing, cannot return to previous values.


Not irreducible.
227 / 417
Atoms, small sets
Definition. A set α ∈ B(X ) is called an atom for the Markov
chain (Xn ) if there exists a measure ν on B(X ) such that
K (x, A) = ν(A), x ∈ α, A ∈ B(X ).

If Xn is ψ-irreducible and ψ(α) > 0 then α is called an access-
ible atom.
Remark. A single point is always an atom. If X is countable and
the chain is irreducible then every point is an accessible atom.
Definition. A set C is small if there exists m ∈ N and a nonzero
measure νm on B(X ) such that for all x ∈ C we have
K m (x, A) ≥ νm (A), A ∈ B(X ).


Theorem. Let Xn be a ψ-irreducible Markov chain. For every
set A ∈ B(X ) such that ψ(A) > 0 there exists m ∈ N and a small
set C ⊆ A such that ψ(C ) > 0 and νm (C ) > 0. Moreover, X can
be decomposed in a countable partition of small sets.
228 / 417
Example. AR(1) process
AR(1) process:

δn ∼ N 0, σ 2

Xn+1 = αXn + δn+1 , i.i.d.

Transition kernel density:


1 (y −αx)2
κ(x, y ) = √ e− 2σ2 .
2πσ
If x ∈ [ω, ω], a lower bound for κ(x, y ):
 2 
−α2 (ω 2 ∨ω 2 )
 √ 1 exp −y +αωy2σ

2πσ 2 if y ≥ 0,
κ1 (y ) :=  
 √ 1 exp −y 2 +αωy −α2 (ω2 ∨ω2 ) if y < 0.
2πσ 2σ 2

C = [ω, ω] is a small set with m = 1 and


Z
ν1 (A) := κ1 (y )dy , A ∈ B(X ).
A

229 / 417
Creating atoms
Construction of a companion Markov chain possessing an atom:
Athreya & Ney [1978]

Definition. We say that minorizing condition holds if there exists


a set C ∈ B(X ), a constant ε > 0 and a probability measure ν on
B(X ) such that ν(C ) = 1 and ν(C c ) = 0
K (x, A) ≥ εν(A), x ∈ C, A ∈ B(X ).

Assumptions: Minorizing condition holds and Px (τC < ∞) ≡ 1,


x ∈ X.
Change of transition kernel: if Xn in C
(
ν(·), with probability ε,
Xn+1 ∼
Q(Xn , ·), with probability 1 − ε,
where  
Q(x, A) := K (x, A) − εν(A) /(1 − ε).
230 / 417
Renewal times
No change in transition kernel given Xn = x:
K (x, A)−εν(A)
P(Xn+1 ∈ A|Xn = x) = εν(A) + (1 − ε) = K (x, A).
1−ε
As Px (τC < ∞) ≡ 1 the chain reaches C with probability 1 and
sooner or later Xn+1 is generated from ν.
τ : the time when Xn+1 is generated from ν. Stopping time with
Px (Xn ∈ A, τ = n) = ν(A)Px (τ = n).
Definition. A renewal time (or regeneration time) is a stopping
time τ with the property that (Xτ , Xτ +1 , . . .) is independent of
(Xτ −1 , Xτ −2 , . . .).
τ is a renewal time for the Markov chain.
τ0 := 0, τj := inf{n > τj−1 : Xn ∈ C and Xn+1 ∼ ν}, j ∈ N.
(τj ) satisfies
Px (Xn ∈ A, τj = n) = ν(A)Px (τj = n)
and splits the chain to independent parts. 231 / 417
Construction
Splitting X : X̌ := X × {0, 1}.
Components: X0 := X × {0} (upper level) and X1 := X × {1} (lower level).
Elements: x0 := (x, 0) ∈ X0 , x1 := (x, 1) ∈ X1 .
σ-algebra on X̌ :

B(X̌ ) := σ A0 , A1 A0 := A × {0}, A1 := A × {1}, A ∈ B(X ) .
Splitting a measure µ on B(X ).
µ∗ (A0 ) := µ(A ∩ C )(1 − ε) + µ(A ∩ C c ), µ∗ (A1 ) := µ(A ∩ C )ε.
µ∗ (A0 ∪ A1 ) = µ(A) for all A ∈ B(X ). If A ⊆ C c then µ∗ (A0 ) = µ(A).

Transition kernel for the split chain X̌n := (Xn , ω̌n ) with ω̌n ∈ {0, 1}:
Ǩ (x0 , Ai ) := K ∗ (x0 , Ai ), x0 ∈ X0 \ C0 ;
K ∗ (x0 , Ai ) − εν ∗ (Ai )
Ǩ (x0 , Ai ) := , x0 ∈ C0 ; i = 0, 1,
1−ε

Ǩ (x1 , Ai ) := ν (Ai ), x1 ∈ X1 .

X1 is an atom for (X̌n ) but Ǩ n xi , X1 \ C1 = 0, (i = 0, 1) for all n ≥ 1. Only




C1 can be reached from the lower level.

α̌ := C1 := C × {1} is an atom for the split chain (X̌n ).


232 / 417
Period
(Xn ): ψ-irreducible Markov chain.

C : a small set with associated integer M and a corresponding


measure νM , where νM (C ) > 0.

When the chain starts in C there is a positive probability of retur-


ning to C at time M.

EC := {n ≥ 1 : C is νn -small, with νn := δn νM for some δn > 0}.

EC : the set of time points for which C is a small set with measure
proportional to νM .

EC is closed under addition. If n, m ∈ EC then n + m ∈ EC , since


Z
n+m
K m (x, dy )K n (y , B) ≥ δm δn νM (C ) νM (B), x ∈ C .
 
K (x, B) ≥
C

Greatest common divisor of Ec : “natural” period for C .


233 / 417
Cycles, aperiodicity
Theorem. Let Xn be a ψ-irreducible Markov chain, C ∈ B(X )
be a νM small set with ψ(C ) > 0 and let d be the g.c.d of EC .
Then there exist disjoint sets D0 , D1 , . . . , Dd−1 ∈ B(X ) (called
d-cycle) such that
(a) forx ∈ Di , K (x, Di+1 ) = 1, i = 1, 2, . . . , d − 1, mod d.
d−1 c
(b) ψ ∪i=0 Di = 0.
The d-cycle is maximal in the sense that for any other collection
{Dk0 , k = 0, 1, . . . , d 0 − 1} satisfying (a)–(b) we have d 0 dividing
d. If d 0 = d then, by reordering the indices if necessary, Di0 = Di
ψ almost everywhere.

Definition. Let Xn be a ψ-irreducible Markov chain. Thelar-
gest d for which a d-cycle occurs is called the period of Xn . If
d = 1 then the chain Xn is called aperiodic. If there exists a
ν1 -small set C with ν1 (C ) > 0, then the chain is strongly aperi-
odic.

Remark. The Kε chain is strongly aperiodic for all 0 < ε < 1.


234 / 417
Classification of states
ηA : occupation time of a set A ∈ B(X ), defined by

X
ηA := I{Xn ∈A} .
n=1

Definition. A set A ∈ B(X ) is called recurrent, if Ex [ηA ] = ∞


for all x ∈ A.
The set A is uniformly transient if there exists a constant M such
that Ex [ηA ] < M for all x ∈ A.
The set A is transient if there exists a covering of X by uniformly
transient sets, that is a countable collection of uniformly transient
sets Bi such that A = ∪i Bi .
Theorem. Let (Xn ) be a ψ-irreducible Markov chain with an
accessible atom α.
(a) If α is recurrent, then every set A ∈ B(X ) such that
ψ(A) > 0 is recurrent.
(b) If α is transient then X is transient.
235 / 417
Transient and recurrent Markov chains
Definition. A Markov chain is recurrent if there exists a measure
ψ such that (Xn ) is ψ-irreducible and for every A ∈ B(X ) such
that ψ(A) > 0 we have Ex [ηA ] = ∞ for all x ∈ A. The chain (Xn )
is transient if it is ψ-irreducible and X is transient.
Theorem. If (Xn ) is a ψ-irreducible Markov chain, then (Xn ) is
either recurrent or transient.
Theorem. A ψ-irreducible Markov chain (Xn ) is recurrent if there
exists a small set C ∈ B(X ) with ψ(C ) > 0 such that
Px (τC < ∞) = 1 for all x ∈ C.

Theorem. If (Xn ) is a ψ-irreducible Markov chain, then every set


A ∈ B(X ) with ψ(A) = 0 is transient.

Theorem. Let Xn be a ψ-irreducible Markov chain. Then there
exist disjoint sets D0 , D1 , . . . , Dd−1 ∈ B(X ) such that
(a) forx ∈ Di , K (x, Di+1 ) = 1, i = 1, 2, . . . , d − 1, mod d.
c
(b) ψ ∪d−1 i=0 D i = 0.
236 / 417
Harris recurrence
Recurrence: for every A ∈ B(X ) such that ψ(A) > 0, Ex [ηA ] = ∞
holds for all x ∈ A.
Definition. A set A ∈ B(X ) is Harris recurrent if Px (ηA = ∞) = 1
for all x ∈ A. The chain (Xn ) is Harris recurrent if there exists a
measure ψ such that (Xn ) is ψ-irreducible and every A ∈ B(X )
such that ψ(A) > 0 is Harris recurrent.
Theorem. If for every A ∈ B(X ) we have Px (τA < ∞) = 1 for all
x ∈ A then Px (ηA = ∞) = 1 for every x ∈ X , and (Xn ) is Harris
recurrent.
 
Remark. If X  is countable then Ex η{x} = ∞ if and only if
Px τ{x} < ∞ = 1 for every x ∈ X . Recurrence and Harris re-
currence coincide.
Theorem. A ψ-irreducible Markov chain (Xn ) is Harris recurrent
if there exists a small set C ∈ B(X ) with ψ(C ) > 0 such that
Px (τC < ∞) = 1 for all x ∈ X.
237 / 417
Invariant measures
Definition. A σ-finite measure π on B(X ) is invariant for the
transition kernel K and for the associated Markov chain (Xn ) if
Z
π(A) = K (x, A)π(dx), A ∈ B(X ).
X

Definition. If (Xn ) is ψ-irreducible and admits an invariant pro-


bability measure π then (Xn ) is called positive. If the chain does
not admit such a measure then we call it null recurrent.
If the invariant measure π is a probability measure and X0 ∼ π
then Xn ∼ π for all n ∈ N, so by the Markov property the process
is stationary, that is the joint distribution of Xn , Xn+1 , . . . , Xn+k
does not depend on n.
Theorem. If the chain (Xn ) is positive then it is recurrent.
Theorem. If (Xn ) is a recurrent chain then there exists an invari-
ant σ-finite measure which is unique up to a multiplicative factor.
Definition. If (Xn ) is Harris recurrent and positive then it is
called Harris positive. 238 / 417
Example. Random walk
Random walk on R:

Xn+1 = Xn + δn+1 , δn ∼ f (x), n = 1, 2, . . . .

Transition kernel density: κ(x, y ) = f (y − x).


λ: Lebesque measure on B(R). A ∈ B(R).
Z Z Z Z Z
K (x, A)λ(dx) = κ(x, y )dy λ(dx) = f (y − x)dy λ(dx)
R Z Z R A Z Z R A
= IA−x (y )f (y )dy λ(dx) = f (y ) IA−y (x)λ(dx)dy = λ(A).
R R R R

As λ(R) = ∞, and the invariant measure is unique up to a


constant term, the random walk cannot be positive recurrent.

Remark. One can prove that the random walk is null-recurrent.

239 / 417
Example. AR(1) process
AR(1) process:
δn ∼ N 0, σ 2

Xn+1 = αXn + δn+1 , i.i.d.
Transition kernel density:
1 (y −αx)2
κ(x, y ) = √ e− 2σ2 .
2πσ
2

Stationary distribution: N µ, τ .
Equation to be solved for µ and τ 2 :
Z ∞
(z−µ)2 1 (z−αx)2 (x−µ)2

e 2τ 2 = √ e− 2σ2 e− 2τ 2 dx.
−∞ 2πσ
Derived equations
σ2
µ = αµ, τ 2 = α2 τ 2 + σ 2 resulting µ = 0, τ2 = .
1 − α2
Stationary solution exists iff |α| < 1, and it is the N 0, σ 2 /(1−α2 ) .


If |α| < 1 then the AR(1) process is positive, hence it is recurrent.


240 / 417
Reversibility
Definition. A stationary Markov chain (Xn ) is reversible, if the
conditional distribution of Xn+1 given Xn+2 = x equals the condi-
tional distribution of Xn+1 given Xn = x.

Definition. A Markov chain with transition kernel density κ satis-


fies the detailed ballance condition if there exists a function f sa-
tisfying
κ(y , x)f (y ) = κ(x, y )f (x), x, y ∈ X .

Theorem. Suppose that a Markov chain with transition kernel


density κ satisfies the detailed ballance condition with a probability
density function corresponding to a measure π. Then
(a) π is the invariant measure of the chain.
(b) The chain is reversible.

Provides a sufficient condition for a density function to be the sta-


tionary density. E.g. if |α| < 1 then the AR(1) process is reversible.
241 / 417
Convergence to invariant distribution
Discrete case:
 
lim P Xn = y X0 = x = π {y } , x ∈ X.
n→∞
Definition. If µ is a signed measure on B(X ) then the total
variation norm kµk is defined as
Z

kµk := sup µ(f )| = sup µ(A)− inf µ(A), µ(f ) := f (x)µ(dx).

f :|f |≤1 A∈B(X ) A∈B(X ) X

Remark. If µ1 and µ2 are measures on B(X ) then


1
kµ1 − µ2 k = kµ1 − µ2 kTV := sup µ1 (A) − µ2 (A) .
2 A∈B(X )

Theorem. If measure π is invariant for the transition kernel K


then for every initial distribution µ
Z
K n (x, ·)µ(dx) − π


X
is non-increasing in n.
242 / 417
Ergodicity
Theorem. If (Xn ) is positive Harris and aperiodic then for every
initial distribution µ
Z
K n (x, ·)µ(dx) − π → 0 as n → ∞.


X

Corollary. For every bounded function f and initial distribution µ


   
lim E f (Xn ) − E f (X ) = 0, X ∼ π.

n→∞

Definition. Let h : X 7→ [1, ∞) and µ be a signed measure on


B(X ). The h-norm kµkh is defined as
Z

kµkh := sup µ(f ) ,
µ(f ) := f (x)µ(dx).
f :|f |≤h X

243 / 417
h-Ergodicity
Definition. We say that a Markov chain (Xn ) is h-ergodic if
h ≥ 1 and
(a) (Xn ) is positive Harris recurrent with invariant probability
distribution π;
(b) the integral π(h) is finite;
(c) for every initial condition x ∈ X of the chain

lim K n (x, ·) − π h → 0.

n→∞

Definition. A set C ∈ B(X ) is called h-regular if h ≥ 1 and for


each B ∈ B(X ) with ψ(B) > 0
"τ −1 #
B
X
sup Ex h(Xk ) < ∞.
x∈C k=0

The chain Xn is called h-regular if there exists a countable cover
of X with h-regular sets.
244 / 417
Convergence
Theorem. Let (Xn ) be positive recurrent and aperiodic.
(a) If π(h) = ∞ then K n (x, h) → ∞ as n → ∞ for all x ∈ X .
(b) If π(h) < ∞ then the set Sh of h-regular sets is full and
absorbing, that is

ψ Shc = 0

and K (x, Sh ) = 1, x ∈ Sh ,

respectively. If x ∈ Sh then

lim K n (x, ·) − π h = 0.

n→∞
 
(c) If Xn is h-regular, then X
 is h-ergodic. Conversely, if
n
Xn is h-ergodic then Xn restricted to a full absorbing set
is h-regular.

Remark. If Xn is h-ergodic then it may not be h-regular.
245 / 417
Geometric ergodicity
Interested in the speed of convergence to the invariant distribution.

Definition. A Markov chain (Xn ) is geometrically h-ergodic with


h ≥ 1 , if (Xn ) is positive Harris with π(h) < ∞ and there exists a
constant rh > 1 such that

X
rhn K n (x, ·) − π h < ∞

n=1

for all x ∈ X . If h ≡ 1 then we call (Xn ) geometrically ergodic.

If α is an atom then K (x, α) = ν(α). Notation K (α, α) is valid.

Remark. If (Xn ) has an atom α then geometric ergodicity implies


that for a real number r > 1

X
r n K n (α, α) − π(α) < ∞.

n=1

246 / 417
Geometrically ergodic atoms
Definition. An accessible atom α is called geometrically ergodic
if there exists rα > 1 such that

X
rαn K n (α, α) − π(α) < ∞.

n=1

An accessible atom α is called Kendall atom of rate γ if there


exists γ > 1 such that
Eα γ τα < ∞.
 

Theorem. Let (Xn ) be ψ-irreducible, with invariant probability


measure π and assume that there exists a geometrically ergodic
atom α. Then there exist γ > 1, r > 1 and R < ∞ such that

X
r n K n (x, ·) − π ≤ REx γ τα < ∞
 

n=1

on a set Sγ which is full and absorbing.


247 / 417
General geometric convergence
Definition An accessible set A ∈ B(X ) is called Kendall set of
rate γ if there exists γ > 1 such that
sup Ex γ τA < ∞.
 
x∈A

Theorem. Let (Xn ) be ψ-irreducible and aperiodic with invariant


probability measure π, and assume that there exists a Kendall
small set C ∈ B(X ). Then there exist γ > 1, r > 1 and R < ∞
such that

X
r n K n (x, ·) − π ≤ REx γ τC < ∞.
 

n=1

on a set Sγ which is full and absorbing.

Mengersen & Tweede [1996]: For ψ-irreducible and aperiodic Mar-


kov chains geometric ergodicity is equivalent with the existence of
a Kendall small set C with ψ(C ) > 0.
248 / 417
Uniform ergodicity
Definition. A Markov chain (Xn ) is called uniformly ergodic if
sup K n (x, ·) − π → 0,

as n → ∞.
x∈X

Theorem. For a Markov chain (Xn ) the following are equivalent:


(a) (Xn ) is uniformly ergodic.
(b) There exists r > 1 and R < ∞ such that for all x
kK n (x, ·) − π ≤ Rr −n .

(c) The state space X is νm -small for some m.


(d) (Doeblin’s condition) The chain is aperiodic and there exists a
probability measure ϕ on B(X ) and constants ε < 1, δ > 0,
m ∈ N such that whenever ϕ(A) > ε
inf K m (x, A) > δ.
x∈X
(e) The chain is aperiodic and there exists a small set C and a
γ > 1 such that
sup Ex γ τC < ∞.
 
x∈X 249 / 417
Harmonic functions
Definition. A measurable function h : X 7→ R is harmonic for the
chain (Xn ) if for all x ∈ X
Z
 
K (x, dy )h(y ) = h(x), that is E h(Xn+1 ) Xn = x = h(x).
X

Theorem. If (Xn ) is recurrent then every bounded harmonic


function is constant ψ almost everywhere.
Theorem. For a positive Markov chain, if the only bounded har-
monic functions are the constant functions then the chain is Harris
recurrent.
Theorem. If (Xn ) is Harris recurrent then the constants are the
only bounded harmonic functions.
X1 , X2 , . . . , Xn : observations of a Markov chain.
Interested in the limiting behaviour of
n
1X
Sn (f ) := f (Xj ) as n → ∞.
n
j=1 250 / 417
Ergodic theorems
Theorem. Suppose that (Xn ) is positive Harris and that any of
the LLN or the CLT hold for some f and some initial distribution.
Then this same limit holds for every initial distribution.
Theorem. The following statements are equivalent when an
invariant probability measure π exists for the Markov chain (Xn ).
(a) For each f ∈ L1 (π)
Z
lim Sn (f ) = f (x)π(dx).
n→∞ X
(b) The Markov chain (Xn ) is positive Harris.

Theorem. If (Xn ) has a σ-finite invariant measure π, the follo-


wing two statements are equivalent:
(a) If f , g ∈ L1 (π) with X g (x)π(dx) 6= 0 then
R
R
Sn (f ) f (x)π(dx)
lim = RX .
n→∞ Sn (g )
X g (x)π(dx)
(b) The Markov chain (Xn ) is Harris recurrent.
251 / 417
Central Limit Theorem
Theorem. Let f : X 7→ R and (Xn ) be Harris positive with an
accessible atom α. Suppose
 
τα
2
X 
 2
Eα τα < ∞, Eα  f (Xj )  < ∞,
j=1

and define
 
τα
X 2
γf2 := π(α)Eα 
 
f (Xj ) − π(f ) .
j=1

Then γf2 < ∞ and if γf2 > 0 then CLT holds for f , that is
n
1 X  D
f (Xj ) − π(f ) −→ N 0, γf2 .


n
j=1

252 / 417
The Markov chain Monte Carlo (MCMC) principle
Monte Carlo integration: approximate
Z
J := h(x)f (x)dx.

Classical approach: simulate an i.i.d. sample from f (x).


Importance sampling: simulate an i.i.d. sample from a different
distribution.
MCMC approach: simulate a sample which is approximately distri-
buted according to f (x) using an ergodic Markov chain.
Definition. An MCMC method for the simulation of a distribu- 
tion f (x) is any method producing an ergodic Markov chain X (t)
with stationary distribution f (x).
Ergodicity ensures the almost sure convergence
T
1 X
h X (t) → J

as T → ∞.
T
t=1
254 / 417
Metropolis-Hastings algorithm
f (y ): target density.
q(y |x): conditional density (candidate density) satisfying:
I for a given x it is easy to simulate from q(y |x);
I either explicitly available (up to a multiplicative constant
independent of x) or symmetric that is q(y |x) = q(x|y ).
Ratio f (y )/q(y |x) is known up to a multiplicative constant inde-
pendent of x.
Metropolis-Hastings algorithm
Given x (t) ,
1. Generate yt ∼ q y x (t) .


2. Take
(
with probability % x (t) , yt ,

(t+1) yt
x =
x (t) with probability 1 − % x (t) , yt ,


where  
f (y ) q(x|y )
%(x, y ) := min ,1 .
f (x) q(y |x) 255 / 417
Metropolis-Hastings algorithm. Remarks
1. The algorithm always accepts a value yt if f (yt )/q yt x (t) is


greater than f x (t) /q x (t) yt . Symmetric q: f (yt ) > f x (t) .
 

Values yt resulting decrease of the ratio f /q may also be accepted.


2. If f x (0) > 0 then f x (t) > 0 for all t ∈ N. For theoretical re-
 

asons define %(x, y ) := 0 if both f (x) = 0 and f (y ) = 0.


3. The sample produced by the algorithm is not i.i.d. Repeated va-
lues may occur.
4. E := supp f cannot be truncated by q. Avoid the situation of
existence of a set A ⊂ E such that
Z Z
f (y )dy > 0 and q(y |x)dy = 0, ∀x ∈ E.
A A

If x (0) 6∈ A then A will never be visited. Regularity condition:


[
supp f ⊂ supp q(·|x).
x∈E
R
5. If x 6∈ E then E q(y |x)dy = 1.
256 / 417
Invariant distribution
Theorem. For every conditional distribution q, whose support in-
cludes E, f is a stationary distribution of the Markov chain X (t)


produced by the Metropolis-Hastings algorithm.


Proof. The probability, that the chain remains at the same position if X (t) = x:
Z
1 − r (x), where r (x) := %(x, y )q(y |x)dy .
X
For a set A ∈ B(X )
Z Z Z
P X (1) ∈ A X (0) = x = %(x, y )q(y |x)dy + 1−r (x)
 
δx (y )dy = κ(x, y )dy ,
A A A
where δx (y ) is the Dirac delta function with mass at x. For every compactly
supported continuous function g
Z
g (y )δx (y )dy = g (x).
X
Transition kernel density:

κ(x, y ) = %(x, y )q(y |x) + 1 − r (x) δx (y ).
Transition kernel:

K (x, dy ) = %(x, y )q(y |x)dy + 1 − r (x) δx (dy ).
257 / 417
Proof of invariance
To prove that f is the PDF of the invariant distribution of K one has to show
Z Z Z Z
f (y )dy = K (x, A)f (x)dx = κ(x, y )f (x)dy dx.
A X X A

Transition kernel density:



κ(x, y ) = %(x, y )q(y |x) + 1 − r (x) δx (y ).

By the definition of %(x, y ):

%(x, y )q(y |x)f (x) = %(y , x)q(x|y )f (y ).

Z Z Z Z
κ(x, y )f (x)dy dx = %(x, y )q(y |x)f (x)dy dx
X A X A
Z Z Z Z

+ 1 − r (x) δx (y )f (x)dy dx = %(y , x)q(x|y )f (y )dy dx
ZX A Z Z X A

+ 1 − r (x) IA (x)f (x)dx = %(y , x)q(x|y )dx f (y )dy
ZX Z A X
Z Z
 
+ 1−r (x) f (x)dx = r (y )f (y )dy + 1−r (x) f (x)dx = f (y )dy . 2
A A A A

258 / 417
Recurrency
Theorem. If the Metropolis-Hastings chain X (t) is f -irreducible


(i.e. irreducible with respect to the invariant distribution π with


PDF f ) then it is Harris recurrent.
Proof. h: bounded harmonic function, Ex h(X (1) ) = h(x). Prove: h ≡ const .
 

X (t) is recurrent with transition kernel density




κ(x, y ) = %(x, y )q(y |x) + 1 − r (x) δx (y ).
h is constant π almost everywhere, namely
Z Z
h(x) = π(h) := h(x)π(dx) = h(x)f (x)dx, π a.e.
X X
R
If π(A) = 0, A ∈ B(X ) then for all x ∈ E we have A %(x, y )q(y |x)dx = 0.
Hence,
Z Z
h(x) = Ex h(X (1) ) = κ(x, y )h(y )dy = %(x, y )q(y |x)h(y )dy + 1 − r (x) h(x)
  
X X
Z
 
= %(x, y )q(y |x)π(h)dy + 1 − r (x) h(x) = π(h)r (x) + 1 − r (x) h(x).
X
π-irreducibility implies r (x) > 0 for all x ∈ E, so h(x) = π(h) on E.
R
If x 6∈ E then %(x, y ) = 1 and E q(y |x)dy = 1 imply h(x) = π(h). 2
259 / 417
Convergence
Theorem. Assume that the Metropolis-Hastings chain X (t) is


irreducible with respect to the invariant distribution π with PDF f .


(a) If h ∈ L1 (π), then
T Z
1 X (t)

lim h X = π(h) = h(x)f (x)dx, π a.e.
T →∞ T X
t=1

(b) If, in addition, X (t) is aperiodic, then



Z
n

lim K (x, ·)µ(dx) − π

=0
n→∞ X
for every initial distribution µ.
Proof. π-irredicibility implies Harris recurrence. Ergodic theorem applies.
Since X (t) is positive Harris, (b) follows from the aperiodicity.

2

Remark. Irreducibility holds if q(y |x) > 0 for all x, y ∈ E.


Remark. Aperiodicity holds if
h i
P f X (t) q Yt X (t) ≤ f (Yt )q X (t) Yt < 1.
 
260 / 417
Sufficient conditions of irreducibility and aperiodicity
Theorem. [Roberts & Tweedie, 1996] Assume f is bounded and
positive on every compact set of its support E. If there exist cons-
tants ε > 0, δ > 0 such that
q(y |x) > ε if|x − y | < δ,
then the Metropolis-Hastings chain X (t) is f -irreducible and


aperiodic. Moreover, every nonempty compact set is a small set.


Idea: q allows moves in a neighbourhood of the current point. If
% > 0 in this neighbourhood, any subset of E can be reached in a
finite number of steps.
Corollary. If the conditions of the theorem hold then
T
1 X
h X (t) = π(h)

lim π a.e.
T →∞ T
t=1

for h ∈ L1 (π), while for every initial distribution µ we have


Z
n

lim K (x, ·)µ(dx) − π

= 0.
n→∞ X 261 / 417
Proof
x (0) : arbitrary starting point from E
(i) (m)
A ⊂ E there
For all exists m ∈ N and x ∈ E, i = 1, 2, . . . , m such that x ∈ A
and x (i+1) − x (i) < δ.
x (0) and A can be linked through balls with radius δ. Acceptance of x (i)
starting from the (i −1)st ball is positive, so K n x (0) , A > 0.


For every x ∈ B := B x (0) , δ/2



Z Z Z
f (y )
Px (A) ≥ %(x, y )q(y |x)dy ≥ q(y |x)dy + q(y |x)dy ,
A A∩Dx f (x) A∩Dxc

where Dx := y : f (y )q(x|y ) ≤ f (x)q(y |x) .
Z Z
f (y )
Px (A) ≥ q(x|y )dy + q(y |x)dy ,
A∩Dx ∩B f (x) A∩Dxc ∩B
Z Z
inf B f (x) inf B f (x) 
≥ q(x|y )dy + q(y |x)dy ≥ ε λ A∩B .
supB f (x) A∩Dx ∩B A∩Dxc ∩B supB f (x)
B is small with m = 1, and ν1 is a multiple of the uniform distribution on B.
m = 1 implies that the chain is aperiodic. Balls are small sets.
Meyn & Tweedie [1993]: For irreducible and aperiodic chains the union if small
sets is small. Hence, compact sets are small. 2
262 / 417
Example. Beta generation
Aim: generate a sample from Be(2.7, 6.3) using MH algorithm.
Candidate density: PDF of U(0, 1). Independent of the previous
value.
0.8
0.4
X
0.0

0 1000 2000 3000 4000 5000


Iterations
0.7
0.5
X
0.3
0.1

4600 4700 4800 4900 5000


Iterations

Simulated MH sequence with U(0, 1) candidate and Be(2.7, 6.3) target.


263 / 417
Goodness of fit
Be(2.7,6.3) from MH Be(2.7,6.3) from R
3.0

2.5
2.5

2.0
2.0
Density

Density
1.5
1.5

1.0
1.0

0.5
0.5
0.0

0.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

Histograms of 5000 beta random variables generated using the MH algorithm


and R function rbeta.
Kolmogorov-Smirnov test of Be(2.7, 6.3) fit: p = 0.2772.
Kolmogorov-Smirnov test of equality of the samples: p = 0.4806.
264 / 417
Independent Metropolis-Hastings algorithm
Accept-Reject method: generate a value X from the instrumental
f (X )
density g and accept it as a value from f with probability Mg (X ) .
Condition: f (x) ≤ Mg (x) for all x.
Independent Metropolis-Hastings algorithm: q(y |x) = g (y ).

Algorithm
Given x (t) ,
1. Generate yt ∼ g (y ).
2. Take
  
y f (yt ) g (x (t) )
t with probability min f (x (t) ) g (yt ) , 1 ,
x (t+1) =
 (t)
x otherwise.

The generated sequence in not independent, as the acceptance


probability depends on the previous value.
X (t) is irreducible and aperiodic iff g > 0 almost everywhere on E.

265 / 417
Geometric and uniform ergodicity
Theorem. [Mengersen & Tweedie, 1996] The independent Metro-
polis-Hastings algorithm produces a uniformly ergodic chain if the-
re exists a constant M such that
f (x) ≤ Mg (x), x ∈ E := supp f . (2)
In this case
K (x, ·) − π ≤ 2 1 − 1 ,
n  n
TV M
where π is the measure corresponding to PDF f . If for every M
there exists a set of positive π measure where (2) does not hold
then the chain is not even geometrically ergodic.
Proof. Transition kernel density:
 
 f (y ) g (x)
κ(x, y ) = %(x, y )g (y ) + 1 − r (x) δx (y ), where %(x, y ) = min ,1 .
f (x) g (y )
We have
   
f (y ) g (x) g (x) 1
κ(x, y ) ≥ g (y ) min , 1 = min f (y ) , g (y ) ≥ f (y ).
f (x) g (y ) f (x) M
Hence, the whole state space is small and the chain is uniformly ergodic.
266 / 417
Proof of the bound for the total variation distance
Z Z
   
K (x, ·)−π = 2 sup κ(x, y )−f (y ) dy = 2 f (y )−κ(x, y ) dy
TV
A A {y :f (y )≥κ(x,y )}
Z
 1  1
≤2 1− f (y ) dy ≤ 2 1 − .
M {y :f (y )≥κ(x,y )} M
We have
Z Z Z 
 2     
κ (x, y ) − f (y ) dy = κ(u, y ) − f (y ) dy κ(x, u) − f (u) du,
A E A
where κn denotes the transition kernel density corresponding to K n , implying
K (x, ·) − π ≤ 2 1 − 1 .
2  2
TV M
Using induction we obtain
Z Z Z 
 n+1   n   
κ (x, y ) − f (y ) dy = κ (u, y ) − f (y ) dy κ(x, u) − f (u) du,
A E A
resulting upper bound
K (x, ·) − π ≤ 2 1 − 1 .
n  n
TV M
Assume that for all M there exists x ∈ E such that f (x) > Mg (x). Define

Dn := x : f (x)/g (x) ≥ n satisfying π(Dn ) > 0, n ∈ N. 267 / 417
Proof of necessity of condition f (x) ≤ Mg (x)
 
Define: Ax := y : %(x, y ) = 1 and Rx := y : %(x, y ) < 1 .
Ax : transition from x is surely accepted; Rx : transition from x may be rejected.
Let x ∈ Dn . For y ∈ Ax we have 1 ≤ f (y )/g (y ) g (x)/f (x) ≤ n−1 f (y ) g (y ) .
     
Z Z
n−1 f (y )dy ≤ n−1 ,

K x, Ax \ {x} = g (y )dy ≤
Ax Ax
Z Z
n−1 f (y )dy ≤ n−1 .
   
K x, Rx = g (y ) f (y )/g (y ) g (x)/f (x) dy ≤
Ax Ax
  
For x ∈ Dn we have K x, {x} = 1 − K x, Ax \ {x} − K x, Rx ≥ 1 − 2/n.
Assume X (t) is geometrically ergodic. C : the corresponding Kendall small


set. The chain is π-irreducible: for all n there exists xn ∈ C and m ∈ N such that
Pxn X (m) ∈ Dn , τC > m > 0.

(3)
c
k
For x ∈ Dn ∩ C we have Px (τC ≥ k) ≥ 1 − 2/n . Radius of convergence of

  X
Ex z τC = Px (τC ≥ k)z k
k=0
c
 τ − 2). xn ∈ C and by (3) we can reach Dn ∩ C without hitting
is less than n/(n
C . Hence Exn z C also diverges outside n/(n − 2). As n is arbitrary, C cannot
be Kendall for any γ > 1. 2
268 / 417
Acceptance rate
Theorem. If there exists a constant M such that

f (x) ≤ Mg (x), x ∈ E := supp f

then for the stationary Metropolis-Hastings algorithm the mean


acceptance probability is at least 1/M.

Proof. The expected acceptance probability


f (Yt ) g (X (t) )
h    Z Z
i
E % X (t) , Yt = E min , 1 = I f (y )g (x) f (x)g (y )dxdy
f (X (t) ) g (Yt ) g (y )f (x)
>1
ZZ ZZ
f (y )g (x) 
+ I f (y )g (x) f (x)g (y )dxdy = 2 I f (y )g (x) f (x)g (y )dxdy
g (y )f (x) g (y )f (x) ≤1 g (y )f (x)
≥1
ZZ  
f (y ) 2 f (X1 ) f (X2 ) 1
≥2 I f (y )g (x) f (x) dxdy = P ≥ = ,
g (y )f (x)
≥1 M M g (X 1 ) g (X 2 ) M
where X1 and X2 are random independent variables from distribution f . 2

Remark. The independent Metropolis-Hastings algorithm is more


effective than the Accept-Reject method.
269 / 417
Example. Cauchy generation
Aim: generate a sample from C(0, 1) using independent Metropo-
lis-Hastings algorithm starting from the stationary distribution.
Candidate density: N (0, 1).
Problem: if the starting value (generated from C(0, 1)) is large, the
next value is hardly accepted.
Example: x (0) = 25.7378 (obtained for the 14th trial).
g x (0) /f x (0) = 1.19 × 10−141
 

Large values are heavily weighted, resulting long constant sequen-


ces. The histogram is closer to the normal, than to the Cauchy.
Kolmogorov-Smirnov test of C(0, 1) fit: p < 2.2 × 10−16 .
Candidate density: T1/2
Large values regularly appear, but the algorithm can accept a new
candidate with a fair probability.
Kolmogorov-Smirnov test of C(0, 1) fit: p = 0.5052.
270 / 417
Example. Cauchy from normal

3
4

2
2

1
0
X

X
0

-1
-2
-2

-3
0 2000 4000 6000 8000 10000 0 100 200 300 400 500
Iterations Iterations

1.0
0.4

0.8
0.3

0.6
Density

ACF
0.2

0.4
0.1

0.2
0.0
0.0

-4 -2 0 2 4 0 10 20 30 40
Lag

Simulated independent MH sequence with N (0, 1) candidate and C(0, 1) target


(upper panel), histogram with Cauchy (red) and normal (blue) PDFs (lower,
left) and ACF of the sample (lower, right).
271 / 417
Example. Cauchy from Student’s t

15000

500
300
X

X
5000

100
0

-100
0 2000 4000 6000 8000 10000 0 100 200 300 400 500
Iterations Iterations

1.0
0.4

0.8
0.3

0.6
Density

ACF
0.2

0.4
0.1

0.2
0.0
0.0

-4 -2 0 2 4 0 10 20 30 40
Lag

Simulated independent MH sequence with T1/2 candidate and C(0, 1) target


(upper panel), histogram with Cauchy (red) and normal (blue) PDFs (lower,
left) and ACF of the sample (lower, right).
272 / 417
Example. Cauchy CDF evaluation
Evaluate P(X < 2), where X ∼ C(0, 1), using independent Metro-
polis-Hastings sequences X (t) with N (0, 1) and T1/2 candidates.
Approximation:
T
1 X
I{X (t) <2} .
T
t=1

Convergence of Cauchy sequences generated by independent MH algorithm


with N (0, 1) (blue) and T1/2 (black) candidates to the true value of 0.8524
(dashed red). 273 / 417
Adaptive Rejection Metropolis Sampling (ARMS)
Generalization of the ASR algorithm for non log-concave densities
[Gilks, Best & Tan, 1995].
f : density. Sn := {xi ∈ supp f | i = 0, 1, . . . , n + 1}.
h(xi ) = log f (xi ) is known (at least up to constant).
 
y = Li,i+1 (x): line through xi , h(xi ) and xi+1 , h(xi+1 ) .
n  o
h̃n (x) := max Li,i+1 (x), min Li−1,i (x), Li+1,i+2 (x)
if xi ≤ x < xi+1 , i = 1, 2, . . . , n − 1. Boundaries:



L0,1 (x) if x < x0 ;
max L (x), L (x)

if x0 ≤ x < x1 ;
0,1 1,2
h̃n (x) := 


max Ln,n+1 (x), Ln−1,n (x) if xn ≤ x < xn+1 ;

L
n,n+1 (x) if x ≥ xn+1 .
In general h̃n (x) is not an upper bound for h(x) := log f (x).

Candidate density gn ∝ exp h̃n (x) .
274 / 417
The ARMS algorithm
Two steps: Accept-Reject and Metropolis-Hastings.
n o 
ψn (x) ∝ min f (x), exp h̃(x) , gn ∝ exp h̃n (x) .
Algorithm
0. Initialize n and Sn ;
Given x (t) .
1. Generate y ∼ gn , u ∼ U(0, 1). 
2. Accept y if u ≤ f (y )/ exp h̃(y ) . Otherwise, update
Sn to Sn+1 = Sn ∪{y } and update h̃n (and gn ) to h̃n+1 .
3. Generate v ∼ U(0, 1) and take
  
f (y ) ψn (x (t) )
y if v ≤ min f (x (t) ) ψn (y ) , 1 ,
x (t+1) =
 (t)
x otherwise.

Remark. Accept-Reject part generates values from ψn .


Remark. The resulting Markov chain is not time homogeneous.
General convergence and ergodicity results cannot be applied.
275 / 417
Random walks
Given the current state X (t) simulate Yt according to
Yt = X (t) + εt , εt ∼ g , independent of X (t) .
Candidate density: q(y |x) = g (y − x).
If g is positive in a neighbourhood of 0 then X (t) is f -irreducible


and aperiodic and in this way it is ergodic.


Symmetric g , that is g (−x) = g (x): original algorithm of Metro-
polis, et al. [1953].

Algorithm
Given x (t) ,
1. Generate yt ∼ g y − x (t) .


2. Take
( n o
f (yt )
yt with probability min , 1 ,
x (t+1) = f (x (t) )
x (t) otherwise.
276 / 417
Example. Random walk normal generator
Hastings [1970]: generate a sample from N (0, 1) using random
walk with U(−δ, δ) errors.
Acceptance probability:
  
(t) (t) 2 2

% x , yt = exp (x ) − yt /2 ∧ 1.

Convergence of the means of standard normal sequences generated by random


walk MH algorithm with U(−δ, δ) errors for δ = 0.1 (blue), δ = 0.5 (green) and
δ = 1 (black).
277 / 417
Example. Random walk normal generator
δ=0.1 δ=0.5 δ=1

0.4

0.4
0.6
0.5

0.3

0.3
0.4
Density

Density

Density
0.2

0.2
0.3
0.2

0.1

0.1
0.1
0.0

0.0

0.0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Fit of the standard normal sequences generated by random walk MH algorithm


with U(−δ, δ) errors for δ = 0.1 (left), δ = 0.5 (center) and δ = 1 (right).
Kolmogorov-Smirnov test of standard normal fit:
δ 0.1 0.5 1
p < 2.2 × 10−16 9.383 × 10−8 0.3422 278 / 417
Ergodicity
Theorem. [Mengersen & Tweedie, 1996] If supp f = R and
q(y |x) = g (y − x) = g (x − y ) then the Metropolis-Hastings chain
is not uniformly ergodic.
Definition. A continuous PDF f satisfying f (x) > 0 for all x ∈ R is
called log-concave in the tails if there exists α > 0 and x1 such that
log f (x) − log f (y ) ≥ α|x − y | for all y ≤ x ≤ −x1 or x1 ≤ x ≤ y .
Theorem. [Mengersen & Tweedie, 1996] Suppose f is log-conca-
ve in the tails and symmetric and denote by π the corresponding
probability measure. Then for any continuous symmetric positive
density g the random walk Metropolis-Hastings chain is geometri-
cally ergodic, with
K (x, ·) − π ≤ Rr −n V (x)
n
V
for some R < ∞, r > 1 and V (x) = es|x| for any s < α, where α
is the log-concavity constant of f . If f is not symmetric then the
same conclusion holds provided the candidate density g satisfies
g (x) ≤ be−α|x| for some finite b. 279 / 417
Example. Comparison of tail effects
Random walk MH algorithm with N (0, 1) candidate density.
Target density: N (0, 1) with PDF ϕ. Mean: 0.
Log-concave on the tails (yielding geometric ergodicity), since
d
lim log ϕ(x) = −∞.
x→∞ dx

Target density: a distribution with PDF


−3
ψ(x) ∝ 1 + |x| , x ∈ R.
Mean is also zero, but not log-concave on the tails, since
d
lim log ψ(x) = 0.
x→∞ dx

The asymptotic behaviour of the sums of form


T
1 X (t)
X
T
t=1

is investigated. 500 independent chains initialized at x (0) = 0. 280 / 417


Example. Comparison of tail effects

90 % empirical confidence envelopes of the means of 500 independent sequen-


ces produced by random walk MH algorithm initialized at 0 with N (0, 1) errors
for targets ϕ (left) and ψ (right). For both targets the same normal errors and
the same uniform values for controlling acceptance are used.
281 / 417
Necessary and sufficient condition
Theorem. [Mengersen & Tweedie, 1996] Let f be a continuous
PDF satisfying f (x) > 0 for all x ∈ R and assume that

d
θ(x) := log f (x)
dx
is defined for all sufficiently large x and

lim θ(x) = θ∞
x→∞

is also defined (although possibly −∞). Further, assume that g (y )


is continuous, symmetric and
Z ∞
|y |g (y )dy < ∞.
−∞

Then the random walk Metropolis-Hastings chain is geometrically


ergodic if and only if θ∞ < 0.
282 / 417
Optimizing the acceptance rates
Commonly used types of Metropolis-Hastings algorithms:
(a) Fully automated Metropolis-Hastings, e.g. ARMS.
(b) Independent Metropolis-Hastings with candidate density g sa-
tisfying f ≤ Mg to ensure uniform ergodicity.
(c) Random walk Metropolis-Hastings.
Case (a): One can only play with the initial point x (0) . No big in-
fluence on the efficiency of the algorithm.
Case (b): Choose a candidate g that maximizes the mean accep-
tance rate.
    
  f (Y ) g (X ) f (Y ) f (X )
% := E %(X , Y ) = E min , 1 =P ≥ ,
f (X ) g (Y ) g (Y ) g (X )
where X ∼ f and Y ∼ g independent. Lower bound: % ≥ 1/M.

Large difference between f and g (e.g. in spread) causes large vari-


ations in acceptance ratio %(X , Y ). The more faithfully g reprodu-
ces f , the better is the performance of the algorithm.
283 / 417
Parametrized candidate distribution
Use a parametrized candidate distribution g (x|θ).
I Choose an initial parameter θ0 . Run the algorithm and calcu-
late the estimate %b(θ0 ) of the acceptance rate. Adjust θ in or-
der to increase %.
I Similarly to the Accept-Reject algorithm choose a parameter
which optimizes the lower bound of %.

Example. Target: N (0, 1) distribution with PDF ϕ(x).


Candidate: Laplace distribution L(λ) with PDF
λ
g (x|λ) = e−λ|x| , x ∈ R.
2
2
Upper bound: ϕ(x)/g (x|λ) ≤ Mλ := 2/πλ−1 eλ /2 .
p

The minimum of Mλ is attained at λ = 1. L(1) is the optimal choi-


ce among Laplace distributions for simulating N (0, 1).
p
M1 = 2e/π = 1.3155, e.g. M3 = 23.9411.
Lower bounds for %: 1/M1 = 0.7602, 1/M3 = 0.0418. 284 / 417
Example. Standard normal generation using Laplace

Convergence of the means of standard normal sequences generated by indepen-


dent MH algorithm with L(1) (black) and L(3) (blue) candidates.

Estimated acceptance rates (and lower bounds):


%b(1) = 0.8298 (0.7602); %b(3) = 0.4692 (0.0418).
p-values of Kolmogorov-Smirnov tests of standard normal fit of the
sequences (and of the last 3000 values).
L(1) : 0.6451 (0.6489); L(3) : 2.84 × 10−8 (3.48 × 10−5 ). 285 / 417
Example. Standard normal generation using Laplace
1.0 λ=1 λ=3

1.0
0.8

0.8
0.6

0.6
ACF

ACF
0.4

0.4
0.2

0.2
0.0

0 5 10 15 20 25 30 35 0.0 0 5 10 15 20 25 30 35
Lag Lag

Autocovariances of standard normal sequences generated by independent MH


algorithm with L(1) (left) and L(3) (right) candidates.

286 / 417
Acceptance rate of random walk MH algorithm
High acceptance rate might indicate that the algorithm moves
“slowly”.
If x (t) and yt are close to each other one has
 
(t)
 f (yt )
% x , yt = min , 1 ≈ 1.
f (x (t) )
Limited moves on the support of f . Problematic e.g. for multimo-
dal densities where the modes are separated with zones of small
probability. If the variance of g is small, difficult to jump from one
mode to the other.

Asymptotically optimal acceptance rate (dimension d → ∞): 0.234.


Necessary and sufficient conditions are given.
[Gelman, Roberts & Gilks, 1996; Roberts, Gelman & Gilks, 1997]
Suggestions: for d = 1, 2 use % ≈ 1/2; for d > 3 use % ≈ 1/4.
Other asymptotically optimal acceptance rates, e.g. Bedard [2008].
287 / 417
Example. Random walk normal generator
Generate a standard normal sample using U(−δ, δ) errors.
Estimated acceptance rates:
δ = 0.1 : 0.9833; δ = 0.5 : 0.9037; δ = 1 : 0.8060.
δ=0.1 δ=0.5 δ=1

0.4

0.4
0.6
0.5

0.3

0.3
0.4
Density

Density

Density
0.2

0.2
0.3
0.2

0.1

0.1
0.1
0.0

0.0

0.0
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Fit of the standard normal sequences generated by random walk MH algorithm


with U(−δ, δ) errors for δ = 0.1 (left), δ = 0.5 (center) and δ = 1 (right).
288 / 417
Rao-Blackwellization PT
Approximate E h(X ) , X ∼ f (x), with δTMH := T1 X (t) .
  
t=1 h
X (1) , X (2) , . . . , X (T ) : sample produced by a MH algorithm.

Y1 , Y2 , . . . , YT : candidate variables, Yt | X (t) = x (t) ∼ g y x (t) .


U1 , U2 , . . . , UT : auxiliary U(0, 1) variables controlling acceptance.


T
1 Xh i
δTMH = h(Yt )I{X (t) =Yt } + h X (t−1) I{X (t) =X (t−1) }

T
t=1
T T
1 X X
= h(Yt ) I{X (t) =Yt } .
T
t=1 j=t

Rao-Blackwellized estimator:
T T
X 
RB 1 X
δT = h(Yt )E I{X (t) =Yt } Y1 , . . . YT

T
t=1 j=t
T T
X 
1 X
(t)

= h(Yt ) P X = Yt Y1 , . . . YT .

T
t=1 j=t
289 / 417
Independent Metropolis-Hastings algorithm
Define
f (Yi ) ωj
τi := P X (i) = Yi Y1 , . . . Yi ,

ωi := , %ij := ∧ 1,
g (Yi ) ωi
j
Y
ζii := 1, ζij := (1 − %ik ) (i < j).
k=i+1

Theorem. [Casella & Robert, 1996] The Rao-Blackwellized esti-


mator δTRB can be written as
T
1 X
δTRB = ϕt h(Yt ),
T
t=1
where ϕt is the expected number of times Yt occurs in the sample,
given by
T
X t−1
X
ϕt = τt ζtk with τt = τk ζk(t−1) %kt (t > 0).
k=t k=0

290 / 417
Langevin algorithm
f : PDF corresponding to probability distribution π.
Wt : standard Brownian motion (Wiener process).
Lt : Langevin diffusion satisfying SDE
σ2 
dLt = ∇ log f Lt dt + σdWt
2
has stationary distribution π. Discretization:
σ2 
Lt+1 − Lt = ∇ log f Lt + σεt , εt ∼ N (0, 1).
2
[Grenander & Miller, 1994; Roberts & Stramer, 2002]
Metropolis-Hastings proposal corresponding to candidate PDF g :
σ2
Yt = X (t) + ∇ log f X (t) + σεt ,

εt ∼ g .
2
Acceptance probability:
(  )
f (y ) g (x − y )/σ − σ∇ log f (y )/2
%(x, y ) := min , 1 .
f (x) g (y − x)/σ − σ∇ log f (x)/2
291 / 417
Properties
Metropolis-Hastings proposal corresponding to candidate PDF g :
σ2
Yt = X (t) + ∇ log f X (t) + σεt ,

εt ∼ g .
2
Besag [1994]: g is the PDF of N (0, 1).
Metropolis-adjusted Langevin algorithm [Roberts & Tweedie, 1996].
Theorem. [Roberts & Tweedie, 1996] If ∇ log f (x) → 0 as
kxk → ∞ then the Metropolis-adjusted Langevin algorithm is not
geometrically ergodic.
Optimal acceptance rate: 0.547 [Roberts & Rosenthal, 1998].
Example. f : PDF of N (0, 1).
Metropolis-adjusted Langevin algorithm:
  
Yt |X (t) = x (t) ∼ N x (t) 1 − σ 2 /2 , σ 2 .
Random walk Metropolis Hastings with the same error distribution:
Yt |X (t) = x (t) ∼ N x (t) , σ 2 .

292 / 417
Example. Non-identifiable normal model
Consider model
 
θ1 + θ2
Z ∼N ,1 with priors θ1 , θ2 ∼ N (0, 1).
2

Posterior distribution:
  
 1 5 2 5 2 1 
f θ1 , θ2 z ∝ exp − θ + θ + θ 1 θ2 − θ1 + θ 2 z .
2 4 1 4 2 2
(t) (t) 
Candidate distribution for given θ1 , θ2 :
(t) (t) (t) (t) 
!
σ2

2z − θ2 − 5θ1 2z − θ1 − 5θ2 (t) (t) 
N2 , + θ1 , θ2 , σ 2 I2 .
2 4 4

I2 : two-by-two unit matrix.


σ = 1.46 yields an acceptance rate around 1/2.
293 / 417
Example. Non-identifiable normal model

Convergence of the empirical averages for the Metropolis-adjusted Langevin


algorithm (solid black) and i.i.d simulations (dotted black) of (θ1 , θ2 ) for the
estimation of E[θ1 ] (left) and E[θ2 ] (right) and the corresponding 90% confi-
dence intervals (blue: Langevin; green: i.i.d) for z = 4.3. 294 / 417
Fundamental theorem of simulation
f (x): a density we want to simulate from.
Z f (x) Z ∞
f (x) = du = I{0≤u≤f (x)} du.
0 −∞

f (x) is the marginal density in X of the joint distribution


 
(X , U) ∼ U S(f ) , where S(f ) := (x, u) : 0 ≤ u ≤ f (x) .
U: auxiliary variable. Not directly related to the original problem.
Generating
 from f “reduces” to
generating uniform random variab-
les on (x, u) : 0 < u < f (x) .

Theorem. Simulating
X ∼ f (x)
is equivalent to simulating

(X , U) ∼ U S(f ) .

296 / 417
MCMC approach
Aim: for a given PDF f (x) simulate values from uniform distribu-
tion on 
S(f ) := (x, u) : 0 < u < f (x) .
MCMC approach: generate a Markov chain with stationary distri-
bution U S(f ) .
Natural solution: use a random walk on S(f ). Go only one direc-
tion at a time. Single-variable slice sampler [Neal, 2003].
Starting point: (x, u) ∈ S(f ).
I The move along the u-axis corresponds to conditional distri-
bution  
U|X = x ∼ U u : u ≤ f (x) .
Results a change from (x, u) to (x, u 0 ) ∈ S(f ).
I The subsequent move along the x-axis corresponds to conditi-
onal distribution
X |U = u 0 ∼ U x : u 0 ≤ f (x) .
 

Results a change from (x, u 0 ) to (x 0 , u 0 ) ∈ S(f ). 297 / 417


Single-variable slice sampler
Algorithm
Given x (t) ,
1. Generate u (t+1) ∼ U 0, f x (t) .


2. Generate x (t+1) ∼ U A(t+1) , where




A(t+1) := x : u (t+1) ≤ f (x) .




The algorithm remains valid if instead of f (x) we use some f1 (x)


such that f (x) = C f1 (x).

Example. Generate values from PDF


fd (x) := exp − x 1/d /Γ(d + 1),

x > 0, d > 0.
Conditional distributions:
  d 
U|X = x ∼ U 0, fd (x) , X |U = u ∼ U 0, −log Γ(d +1)u .

Large values of d might cause computational problems.


298 / 417
Example. Single variable slice sampler
Target density: fd (x) := exp − x 1/d /Γ(d + 1), x > 0, d > 0.


Slice sampler Slice sampler

0.0015
0.25
0.20

0.0010
0.15
Density

Density
0.10

0.0005
0.05

0.0000
0.00

0 10 20 30 40 50 0 2000 4000 6000 8000


X X

Slice sampler histogram of samples from fd for d = 2 (left) and d = 5 (right).


299 / 417
Invariance
Both steps of the slice sampler
 algorithms preserve
the uniform
distribution on S(f ) := (x, u) : 0 < u < f (x) .

Step 1: X (t) ∼ f (x) and U (t+1) |X (t) = x ∼ U 0, f1 (x) , where




f (x) = C f1 (x).
I[0,f1 (x)] (u)
X (t) , U (t+1) ∼ f (x)

∝ I{0≤u≤f1 (x)} .
f1 (x)
Step 2: X (t+1) |U (t+1) = u ∼ U A(t+1) , where we have


A(t+1) := x : u ≤ f1 (x) .


I[0,f1 (x)] (u) IA(t+1) (v )


X (t) , U (t+1) , X (t+1) ∼ f (x)

.
f1 (x) λ(A(t+1) )
Integrate with respect to x.
Z ∞
(t+1) (t+1) I (t+1) (v )
I[0,f1 (x)] (u) A (t+1) dx

U ,X ∼C
−∞ λ(A )
Z ∞
I{0≤u≤f1 (x)}
= C I{0≤u≤f1 (v )} (t+1) )
dx ∝ I{0≤u≤f1 (v )} .
−∞ λ(A 300 / 417
Product slice sampler
Main difficultyof one-variable slice sampler: generating uniformly
from A(t) := x : u (t) ≤ f (x) .
Assume decomposition:
k
Y
f (x) ∝ fi (x), fi > 0,
i=1

fi is not necessarily a density. E.g. Bayesian likelihood with flat pri-


or, where fi might be the ith individual likelihood.
Z
fi (x) = I{0≤ui ≤fi (x)} dui .

f (x) is the marginal of the joint PDF


k
Y
p(x, u1 , . . . , uk ) ∝ I{0≤ui ≤fi (x)} .
i=1

301 / 417
Product slice sampler
Algorithm
Given x (t) , 
(t+1) 
1. Generate u1 ∼ U 0, f1 x (t) .
.
.
.  
(t+1)
k. Generate uk ∼ U 0, fk x (t) .
k+1. Generate x (t+1) ∼ U A(t+1) , where

n o
(t+1)
A(t+1) := x : ui ≤ fi (x), i = 1, 2, . . . , k .

Example. Target density:


f (x) ∝ 1 + sin2 (3x) 1 + cos4 (5x) exp − x 2 /2 .
  

Decomposition:
f1 (x) := 1+sin2 (3x) , f2 (x) := 1+cos4 (5x) , f3 (x) := exp −x 2 /2 .
  

Generated x value should be uniform over


p
x : |x| ≤ −2 log u3 ∩ x : sin2 (3x) ≥ u1 −1 ∩ x : cos4 (5x) ≥ u2 −1 .
  
302 / 417
Example. Product slice sampler
Target density:
f (x) ∝ 1 + sin2 (3x) 1 + cos4 (5x) exp − x 2 /2 .
  
0.6
0.5
0.4
Density
0.3 0.2
0.1
0.0

-4 -2 0 2 4
X

Histogram of a sample from f produced by 10000 iterations of a product slice


sampler. 303 / 417
Transition probabilities
Single-variable slice sampler with target f (x) = Cf1 (x):
 
U|X (t) = x (t) ∼ U 0, f1 x (t) , X (t+1) |U = u ∼ U A(u) ,


with 
A(u) := x : u ≤ f1 (x) .
 
Transition kernel for f1 X (t) :

  ZZ dudx
(t+1) (t)
 
P f1 X ≤ y f1 X =v =

I{u≤f1 (x)≤y } I{0≤u≤v } 
λ A(u) v
1 y ∧v λ A(u) −λ A(y )
   
1 v λ A(y )
Z Z 
=  du = max 1−  , 0 du.
v 0 λ A(u) v 0 λ A(u)

X (t+1) has the same transition kernel. The convergence proper-




ties of the two chains


 are identical. The behaviour is fully charac-
terized by λ A(u) .
304 / 417
Uniform ergodicity
Theorem. [Mira & Tierney, 2002] If f1 is bounded and supp f1 is
bounded, the single-variable slice sampler is uniformly ergodic.
Proof. One might assume that f1 ≤ 1 and supp f1 = [0, 1]. The slice sampler is
uniformly ergodic iff the whole space supp f1 is a small set. Consider

1 y ∧v λ A(y )
  Z 
(t+1)  (t) 
`(v ) := P f1 X ≤ y f1 X =v = 1−  du.

v 0 λ A(u)

For v ≥ y function `(v ) is decreasing. λ A(u) is decreasing in u, so the
integrand is also decreasing in u. Thus, for v < y
Z v  
λ A(y ) λ A(y )

1 1
`0 (v ) = − 2 1−  du + 1−  ≤ 0.
v 0 λ A(u) v λ A(v )
Hence, `(v ) is decreasing for all v ∈ [0, 1]. By L’Hospital’s rule

λ A(y )


max `(v ) = lim `(v ) = lim 1 −  = 1 − λ A(y ) .
v ∈[0,1] v →0 v →0 λ A(v )
Z y 
λ A(y )
min `(v ) = lim `(v ) = 1−  du.
v ∈[0,1] v →1 0 λ A(u)
As the transition probabilities are bounded from above and below by nondege-
nerate bounds which are independent from v , the whole space is small. 2
305 / 417
Geometric ergodicity
Theorem. [Roberts & Rosenthal, 1999] Assume f1 is bounded and
 
Q(u) := λ A(u) , where A(u) := x : u ≤ f1 (x)
is differentiable and there exists a constant α > 1 such that
Q 0 (u)u 1+1/α is non-increasing, at least on an open set containing
0. Then the single-variable slice sampler is geometrically ergodic.
Examples. [Roberts & Rosenthal, 1999]
1. Let supp f = R+ and assume f (x) ∝ f1 (x) := exp(−γx), γ > 0,
at least on the tails. For small u

A(u) := x : x ≤ log(1/u)/γ , Q(u) = log(1/u)/γ.
Thus, Q 0 (u)u 1+1/α = −u 1/α γ −1 is decreasing.
2. Let supp f = R+ and assume f (x) ∝ f1 (x) := x −δ , δ > 0, at
least on the tails. For small u
A(u) := x : x ≤ u −1/δ }, Q(u) = u −1/δ .


Thus, Q 0 (u)u 1+1/α = −u 1/α−1/δ δ −1 is non-increasing if α ≤ δ.


306 / 417
Example. Exact convergence rate.
Target distribution: Exp(1) with PDF f (x) = exp(−x), x > 0 and
corresponding probability measure π.
The single-variable slice sampler is geometrically ergodic.
Roberts & Rosenthal [1999]: If n ≥ 23
kK n (x, ·) − πkTV ≤ 0.054865(0.985015)n (n − 15.7043).
E.g. for n = 530 we have
kK n (x, ·) − πkTV ≤ 0.0095.

Example. A poor slice sampler


Target distribution: f (x) ∝ exp − kxk , x ∈ Rd .


Simulating from f is equivalent to simulating from radius z = kxk


with PDF
g (z) ∝ z d−1 exp(−z), z > 0.
The performance of the slice sampler is getting worse with the in-
crease of dimension d. 307 / 417
Example. A poor slice sampler
Dimension: 1

0.8
ACF
4

0.0
0

0 200 400 600 800 1000 0 5 10 15 20 25 30


Iterations Lag

Dimension: 5
6 12

0.0 0.8
ACF
0

0 200 400 600 800 1000 0 5 10 15 20 25 30


Iterations Lag

Dimension: 10

0.8
15

ACF
0.0
0

0 200 400 600 800 1000 0 5 10 15 20 25 30


Iterations Lag

0.8 Dimension: 50
ACF
40

0.0
0

0 200 400 600 800 1000 0 5 10 15 20 25 30


Iterations Lag

Sequence plot and ACF for the series produced by the slice sampler correspon-
ding to f1 (z) := z d−1 exp(−z) for d = 1, 5, 10, 50. 308 / 417
Connection to slice sampling
Fundamental theorem: simulation from the joint PDF f (x, y ) is
equivalent to simulation from uniform distribution on

S(f ) := (x, y , u) : 0 ≤ u ≤ f (x, y ) .
Starting from (x, y , u) ∈ S(f ) sample one component at a time:
1. Generate a realization x 0 of X from U x : u ≤ f (x, y ) .
 

2. Generate a realization y 0 of Y from U y : u ≤ f (x 0 , y ) .


3. Generate a realization u 0 of U from U 0, f (x 0 , y 0 ) .
fX (x), fY (y ): marginal PDFs corresponding to X and Y , resp.
fX |Y (x|y ), fY |X (y |x): conditional PDFs.
Remarks.  
(a) Generating fromU x : u ≤ f (x, y ) is  equivalent to gene-
rating from U x : u/fY (y ) ≤ fX |Y (x|y ) .
(b) The sequence of generations along the three axes does not
need to be the same order all the time to keep the (statio-
nary) uniform distribution on S(f ).
310 / 417
Two-stage Gibbs sampler
Fix e.g. Y = y and repeat the generation of U and X . This gives
a simulation of
X ∼ fX |Y (x|y ).
In the same way one can obtain simulation of
Y ∼ fY |X (y |x).
Repetition of these steps yields the two-stage Gibbs sampler or
Data Augmentation algorithm [Tanner & Wong, 1987; 2010].

Algorithm
0. Take X (0) = x (0) .
Given x (t) , y (t) ,


1. Generate y (t+1) ∼ fY |X y x (t) .


2. Generate x (t+1) ∼ fX |Y x y (t+1) .


Each step of the Gibbs sampler can be considered as if infinitely


many steps were made with a slice sampler.
311 / 417
Markov property
Definition. Let (X1 , X2 , . . . , Xp ) ∼ f (x1 , x2 , . . . , xp ) and denote
by f (k) the marginal PDF corresponding to Xk . If f (k) (xk ) > 0 for
every k = 1, 2, . . . , p, implies that f (x1 , x2 , . . . , xp ) > 0, then f sa-
tisfies the positivity condition.
Theorem. Sequences X (t) and Y (t) produced by the two-sta-
 

ge Gibbs sampler are Markov chains with stationary distributions


fX and fY , respectively. If the positivity condition holds on the jo-
int PDF f then both chains are strongly irreducible with respect to
their stationary distributions.
Proof. Transition kernel density of X (t) : κ(x, x 0 ) = fY |X (y |x)fX |Y (x 0 |y )dy .
 R

fX is the stationary distribution of X (t) since



Z Z Z
fX (x 0 ) = fX |Y (x 0 |y )fY (y )dy = fX |Y (x 0 |y ) fY |X (y |x)fX (x)dxdy
Z Z  Z
= fX |Y (x 0 |y )fY |X (y |x)dy fX (x)dx = κ(x, x 0 )fX (x)dx.

If the positivity condition on f holds then fX |Y (x|y ) > 0 on the support of f , so


every A ∈ B(X ) can be visited in a single iteration of the Gibbs sampler. Thus,
X (t) is strongly irreducible.

2312 / 417
Example. Normal bivariate Gibbs sampler
Target density:
   
0 1 %
(X , Y ) ∼ N2 , , |%| < 1.
0 % 1
Marginal distributions: X , Y ∼ N (0, 1).
Conditional distributions:
Y |X = x ∼ N (%x, 1 − %2 ), X |Y = y ∼ N (%y , 1 − %2 ).
Gibbs sampler
 algorithm
Given x (t) ,
1. Generate y (t+1) ∼ N %x (t) , 1 − %2 .


2. Generate x (t+1) ∼ N %y (t+1) , 1 − %2 .




Marginal process X (t) :




X (t) = %2 X (t−1) + εt , εt ∼ N (0, 1 − %4 ).


AR(1) process with stationary distribution N (0, 1). Recursion:
 D
X (t) X (0) = x (0) ∼ N %2t x (0) , 1 − %4t −→ N (0, 1) as t → ∞. 313 / 417

Example. Normal bivariate Gibbs sampler

0.4

0.4
0.2 0.3

0.2 0.3
Density

Density
0.1

0.1
0.0

0.0
-4 -2 0 2 4 -4 -2 0 2 4
X Y

X Y
0.8

0.8
ACF

ACF
0.4

0.4
0.0

0 5 10 15 20 25 30 35 0.0 0 5 10 15 20 25 30 35
Lag Lag

Histograms (upper panel) and autocorrelations (lower panel) of samples from


marginal variables X (left) and Y (right) produced by the Gibbs sampler based
on 5000 iterations for % = 0.4.
314 / 417
Example. Normal bivariate Gibbs sampler
0.45
0.40
Covariance
0.35 0.30

0 1000 2000 3000 4000 5000


Iterations

Convergence of the empirical covariances to the true covariance of % = 0.4 of


samples from marginal variables X and Y produced by the Gibbs sampler.

Shapiro-Wilk test for joint normality: p = 0.0538.


Kolmogorov-Smirnov normality tests:
X : p = 0.5772; Y : p = 0.1310.
315 / 417
Example. Beta-Binomial combination
Consider
X |Θ = θ ∼ Bin(n, θ), Θ ∼ Be(a, b).
Joint PDF:
 
n Γ(a + b) x+a−1
f (x, θ) = θ (1 − θ)n−x+b−1 ,
x Γ(a)Γ(b)
where x = 0, 1, . . . , n, θ ∈ [0, 1]. Conditional distributions:
X |Θ = θ ∼ Bin(n, θ), Θ|X = x ∼ Be(x + a, n − x + b).
Marginal distribution of X : beta-binomial with distribution
 
n B(x + a, n − x + b)
P(X = x) = , x = 0, 1, . . . , n.
x B(a, b)
B(α, β): beta function defined as
Z 1
Γ(α)Γ(β)
B(α, β) := x α−1 (1 − x)β−1 = , α, β > 0.
0 Γ(α + β)
316 / 417
Example. Beta-Binomial combination
Gibbs sampler with conditional distributions
X |Θ = θ ∼ Bin(n, θ), Θ|X = x ∼ Be(x + a, n − x + b).
Beta-binomial Beta
0.12

2.5
2.0
0.08
Probability

Density
1.0 1.5
0.04

0.5
0.00

0.0

0 2 4 6 8 10 13 16 0.0 0.2 0.4 0.6 0.8 1.0


X θ

Histograms of marginal distributions from the Gibbs sampler based on 10000


iterations for n = 17, a = 4, b = 7. 317 / 417
Back to slice sampler
Joint distribution to generate from: uniform distribution on

S(fX ) := (x, u) : 0 ≤ u ≤ fX (x) .
Slice sampler can be considered as a special two-stage Gibbs samp-
ler with conditional densities
I{0≤u≤fX (x)} I{0≤u≤fX (x)}
fX |U (x|u) = R , fU|X (u|x) = R .
I{0≤u≤fX (x)} dx I{0≤u≤fX (x)} du
Generated sequence X (t) is a Markov chain with transition kernel


density Z
κ(x, x 0 ) = fU|X (u|x)fX |U (x 0 |u)du

and stationary distribution having PDF fX (x).


For any marginal fX (x) a Gibbs sampler can be induced with the
help of an arbitrary conditional density g (y |x). Conditional densi-
ties:
g (y |x)fX (x) g (y |x)fX (x)
fX |Y (x|y ) = R , fY |X (y |x) = R .
g (y |x)fX (x)dx g (y |x)fX (x)dy 318 / 417
The Hammersley-Clifford theorem, recurrence, ergodicity
Conditional distributions contain sufficient information to generate
a sample from the joint distribution.
Theorem. The joint distribution associated with the conditional
PDFs fY |X (y |x) and fX |Y (x|y ) has the joint density
fY |X (y |x)
f (x, y ) = R    .
fY |X (y |x) fX |Y (x|y ) dy
Proof. As f (x, y ) = fY |X (y |x)fX (x) = fX |Y (x|y )fY (y ) we have
fY |X (y |x)
Z Z
fY (y ) 1
dy = dy = ,
fX |Y (x|y ) fX (x) fX (x)
implying the result 2
Theorem. Under positivity condition, if the transition kernel with
density
κ (x, y ), (x 0 , y 0 ) = fY |X (y 0 |x)fX |Y (x 0 |y 0 )


is absolutely continuous with respect to the probability measureπ


corresponding to f (x, y ) for all (x, y ) then the chain X (t) , Y (t) is
Harris recurrent and ergodic with stationary distribution π. 319 / 417
Reversibility and interleaving property
Definition. [Liu, Wong & Kong, 1994] Two Markov chains X (t)


and Y (t) are said to be conjugate to each other with the inter-


leaving property (or interleaved) if


(a) X (t) and X (t+1) are independent conditionally on Y (t) ;
  

(b) Y (t−1) and Y (t) are independent conditionally on X (t) ;


  

(c) under stationarity X (t) , Y (t−1) and X (t) , Y (t) are identi-
 

cally distributed.

Theorem. For two interleaved Markov chains X (t) and Y (t) ,


 

if X (t) is ergodic (geometrically ergodic), then Y (t) is also er-


 

godic (geometrically ergodic).

Theorem. Each of the chains X (t) and Y (t) generated by a


 

two-stage Gibbs sampler is reversible and the two chains are inter-
leaving.

Remark. The entire chain X (t) , Y (t) is not necesssarily rever-




sible.
320 / 417
Proof
Transition kernel density of X (t) : κX (x, x 0 ) = fY |X (y |x)fX |Y (x 0 |y )dy .
 R
Z Z
fX (x)κX (x, x 0 ) = fX (x) fY |X (y |x)fX |Y (x 0 |y )dy = f (x, y )fX |Y (x 0 |y )dy

fY |X (y |x 0 )fX (x 0 )
Z
= f (x, y ) dy = fX (x 0 )κX (x 0 , x).
fY (y )
Detailed balance equation holds, so X (t) is reversible. Similar argument app-


lies for Y (t) .




Conditions (a) and (b) follow directly from the generating mechanism  of the
Gibbs sampler. Under stationarity the joint CDF of X (t−1) , Y (t−1) :
  Z x Z y
P X (t−1) < x, Y (t−1) < y = f (u, v )dv du.
−∞ −∞

The joint CDF of X (t) , Y (t−1) :



  Z x Z y
P X (t) < x, Y (t−1) < y = fX |Y (v |u)fY (u)dv du.
−∞ −∞

Since
fX |Y (x|y )fY (y ) = f (x, y ),
(t−1) (t−1)
and X (t) , Y (t−1) are identically distributed
 
random vectors X ,Y
implying condition (c). Hence, X (t) , Y (t) are interleaving.

2
321 / 417
Reversible two-stage Gibbs sampler
Algorithm Given y (t) ,

1. Generate ω ∼ fX |Y x y (t) .


2. Generate y (t+1) ∼ fY |X y ω .

3. Generate x (t+1) ∼ fX |Y x y (t+1) .


Theorem. Under stationarity the chain X (t) , Y (t) generated by




the above algorithm is reversible.


Proof. PDF g (x, y , x 0 , y 0 ) of X (t) , Y (t) , X (t+1) , Y (t+1) equals

Z
g (x, y , x 0 , y 0 ) = f (x, y ) fX |Y (ω|y )fY |X (y 0 |ω)dω fX |Y (x 0 |y 0 )

f (x 0 , y 0 )
Z
= fX |Y (ω|y )fY |X (y 0 |ω)dω fX |Y (x|y )fY (y )
fY (y 0 )
fY |X (y 0 |ω)
Z
= f (x 0 , y 0 ) f (ω, y ) dω fX |Y (x|y )
fY (y 0 )
Z
= f (x 0 , y 0 ) fX |Y (ω|y 0 )fY |X (y |ω)dω fX |Y (x|y ) = g (x 0 , y 0 , x, y ),

implying reversibility. 2
322 / 417
Connection to EM algorithm
X = (X1 , X2 , . . . , Xn ): observed data from joint PDF g (x|θ), where
Z
g (x|θ) = f (x, z|θ)dz.
Z
Z ∈ Z: unobserved (missing) data-vector.
Complete data likelihood: LC (θ|x, z) = f (x, z|θ).
Incomplete data likelihood: L(θ|x) = g (x|θ).
Missing data density: k(z|θ, x) = LC (θ|x, z)/L(θ|x).

LC (θ|x, z)
Z

L (θ|x, z) := R C if LC (θ|x, z)dθ < ∞.
L (θ|x, z)dθ
Two-stage Gibbs sampler:
Z|Θ = θ ∼ k(z|θ, x), Θ|Z = z ∼ L∗ (θ|x, z).
 
E-step, which often reduces to calculation of Eθ Z X = x is repla-
ced by simulation from k(z|θ, x).
M-step, which is a maximization in θ is replaced by simulation
from L∗ (θ|x, z). 323 / 417
Example. Gibbs sampler for censored normal data
Data: X1 , X2 , . . . , Xn ∼ N (θ, 1) censored at a.
Observed: x = (x1 , x2 , . . . , xm ); unobserved: z = (zm+1 , . . . , zn ).
Distribution of the missing data:

ϕ(z − θ)
Zi |Θ = θ ∼ , i = m + 1, m + 2, . . . , n.
1 − Φ(a − θ)

Complete data likelihood:


m
Y n
 Y
Lc (θ|x, z) ∝ exp − (xj − θ)2 /2 exp − (zj − θ)2 /2 ,


j=1 j=m+1

which corresponds to
 
mx + (n − m)z 1
N , .
n n

L∗ (θ|x, z) exists, Gibbs sampler can be run.


324 / 417
Example. Gibbs sampler for censored normal data
Conditional distributions:
 
ϕ(z − θ) mx + (n − m)z 1
Zi |Θ = θ ∼ , Θ|Z = z ∼ N ,
1 − Φ(a − θ) n n

2.0
2.0

1.5
1.5
Density

Density
1.0
1.0

0.5
0.5
0.0

0.0

3.8 4.0 4.2 4.4 4.6 4.8 5.0 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2
θ Z

Histograms of posterior distributions of Θ and Z corresponding to a N (4, 1)


sample censored at a = 4.5. Sample size: n = 40; size of observed data: m = 29;
number of iterations of the Gibbs sampler: 5000. 325 / 417
Transition kernel
Two-stage Gibbs sampler corresponding to EM algorithm:
Z|Θ = θ ∼ k(z|θ, x), Θ|Z = z ∼ L∗ (θ|x, z).
Transition kernel density of the Markov chain Θ(t) :

Z
0
κ(θ, θ |x) = k(z|θ, x)L∗ (θ0 |x, z)dz.
Z

Θ(t)

Invariant distribution of : incomplete data likelihood L(θ|x),
that is Z
L(θ |x) = κ(θ, θ0 |x)L(θ|x)dθ.
0

LC (θ|x, z) is integrable in θ: L(θ|x) generates a finite measure,


chain Θ(t) is positive.

Remark. The Markov chain Θ(t) is ergodic.




EM algorithm: gives the ML estimator of θ from L(θ|x).


Gibbs sampler: produces the entire function L(θ|x).
326 / 417
Multi-stage Gibbs sampler
X = (X1 , X2 , . . . Xp ) ∈ X : random vector (p > 1) with PDF f (x).
f1 , f2 , . . . , fp : univariate conditional densities (full conditionals).

Xi |x1 , . . . , xi−1 , xi+1 , . . . , xp ∼ fi (xi |x1 , . . . , xi−1 , xi+1 , . . . , xp ),

where Xi |x1 , . . . , xi−1 , xi+1 , . . . , xp denotes

Xi |X1 = x1 , . . . , Xi−1 = xi−1 , Xi+1 = xi+1 , . . . , Xp = xp .

Algorithm
(t) (t) 
Given x(t) = x1 , . . . , xp ,
(t+1) (t) (t) 
1. Generate x1 ∼ f1 x1 x2 , . . . , xp .
(t+1) (t+1) (t) (t) 
2. Generate x2 ∼ f2 x2 x1 , x3 , . . . , xp .
.
.
.
(t+1) (t+1) (t+1) 
p. Generate xp ∼ fp xp x1 , . . . , xp−1 .
All simulations can be univariate.
328 / 417
Example. Autoexponential model
Joint PDF [Besag, 1974]:

f (x1 , x2 , x3 ) ∝ exp − (x1 + x2 + x3 + θ12 x1 x2 + θ23 x2 x3 + θ31 x3 x1 ) ,

for x1 , x2 , x3 > 0, where θij > 0 are known.

Full conditional densities are exponential, easy to simulate. E.g.



X3 |x1 , x2 ∼ Exp 1 + θ23 x2 + θ31 x1 .

Marginals and conditionals are more complicated.



exp − (x1 + x2 + θ12 x1 x2 )
f (x1 , x2 ) ∝ ,
1 + θ23 x2 + θ31 x1
Z ∞ 
exp − (x1 + x2 + θ12 x1 x2 )
f (x1 ) ∝ dx2 .
0 1 + θ23 x2 + θ31 x1

329 / 417
Example. Ising model
X : D × D binary matrices with entries +1 or −1.
Joint PDF (Boltzmann distribution):
 X X 
f (x) = exp − J xi xj − H xi , xi ∈ {−1, 1}.
(i,j)∈N i

xi : the ith entry of the matrix x ∈ X .


N : neighbourhood structure. E.g. (i, j) ∈ N if they are neighbo-
urs horizontally or vertically.
Sum is taken over all neighbouring pairs (i, j).
Full conditionals:
P
exp − Hxi − Jxi j:(i,j)∈N xj )
f (xi |xj , j 6= i) = P  P 
exp −H −J j:(i,j)∈N xj +exp H +J j:(i,j)∈N xj
P 
exp − (H + J j:(i,j)∈N xj )(xi + 1)
= P  .
1 + exp − 2(H + J j:(i,j)∈N xj )
330 / 417
Completion
Definition. Given a PDF f , a density g that satisfies
Z
g (x, z)dz = f (x)
Z
is called a completion of f .
g is chosen so that the full conditionals are easy to simulate from.
Instead of f (x) the Gibbs sampler is implemented on the PDF g (y)
of Y = (Y1 , Y2 , . . . , Yp ), where y = (x, z) ∈ Y ⊆ Rp . Full conditi-
onals:
Yi |y1 , . . . , yi−1 , yi+1 , . . . , yp ∼ gi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp ).
Algorithm
(t) (t) 
Given y(t) = y1 , . . . , yp ,
(t+1) (t) (t) 
1. Generate y1 ∼ g1 y1 y2 , . . . , yp .
(t+1) (t+1) (t) (t) 
2. Generate y2 ∼ g2 y2 y , y , . . . , yp .
1 3
.
.
.
(t+1) (t+1) (t+1) 
p. Generate yp ∼ gp yp y1 , . . . , yp−1 . 331 / 417
Example. Cauchy-Normal posterior
Target PDF to simulate from:
−ν
f (θ) ∝ exp(−θ2 /2) 1 + (θ − θ0 )2 , ν > 0, θ ∈ R.
f (θ) is the posterior distribution at x = 0 corresponding to model
X |Θ = θ ∼ N (θ, 1), Θ ∼ C(θ0 , 1).
f (θ) can be considered as a marginal corresponding to completion:
2 /2 2 ]z/2
g (θ, z) ∝ e−θ e−[1+(θ−θ0 ) z ν−1 .
Full conditionals:
r   
1+z zθ0 2 1 + z
g1 (θ|z) = exp − θ −
2π 1+z 2
2
ν ν−1
[1 + (θ − θ0 )2 ]z
  
1 + (θ − θ0 ) z
g2 (z|θ) = exp − .
2 Γ(ν) 2
Conditional distributions:
1 + (θ − θ0 )2
   
zθ0 1
Z |Θ = θ ∼ Ga ν, , Θ|Z = z ∼ N , .
2 1+z 1+z
332 / 417
Reversible Gibbs sampler
Gibbs sampler considered so far: Gibbs sampler with systematic
scan. The obtained Markov chain is not reversible.
Modification: Gibbs sampler with symmetric scan. Results in a
reversible Markov chain Y(t) .
Algorithm
(t) (t) 
Given y1 , . . . , yp ,
(t) (t) 
1. Generate y1∗ ∼ g1 y1 y2 , . . . , yp .
(t) (t) 
2. Generate y2∗ ∼ g2 y2 y1∗ , y3 , . . . , yp .

.
.
.
∗ ∼ gp−1 yp−1 y1∗ , . . . , yp−2 ∗ , y (t) .

p-1. Generate yp−1 p
(t+1)
∼ gp yp y1∗ , . . . , yp−1


p. Generate yp .
(t+1) ∗ ∗ , y (t+1) .

p+1. Generate yp−1 ∼ gp−1 yp−1 y1 , . . . , yp−2 p
.
.
.
(t+1) (t+1) (t+1) 
2p-1. Generate y1 ∼ g1 y1 y2 , . . . , yp .
333 / 417
Random scan Gibbs sampler
Reversible Gibbs sampler requires 2p − 1 random generations for a
single step of Y(t) . The first p − 1 values are not used.


Choose a random order of updating the coordinates [Geman & Ge-


man, 1984; Liu, Wong & Kong, 1995; Johnson, 2009].

Algorithm
(t) (t) 
Given y1 , . . . , yp ,
1. Generate a permutation σ = (σ1 , σ2 , . . . , σp ) of
{1, 2, . . . , p}
(t+1) (t) 
2. Generate yσ1 ∼ gσ1 yσ1 yj , j 6= σ1 .
.
.
.
(t+1) (t+1) 
p+1. Generate yσp ∼ gσp yσp yj , j 6= σp .

The algorithm results in a reversible Markov chain Y(t) with sta-




tionary distribution f .
334 / 417
The general Hammersley-Clifford theorem
Theorem. Under positivity condition the joint PDF g satisfies
p
Y gσj (yσj |yσ1 , . . . , yσj−1 , yσ0 j+1 , . . . , yσ0 p )
g (y1 , y2 , . . . , yp ) ∝
gσj (yσ0 j |yσ1 , . . . , yσj−1 , yσ0 j+1 , . . . , yσ0 p )
j=1

for every permutation σ = (σ1 , σ2 , . . . , σp ) of {1, 2, . . . , p} and


y0 = (y10 , y20 , . . . , yp0 ) ∈ Y.
Proof. Denote
Z
g (j) (y1 , . . . yj−1 , yj+1 , . . . , yp−1 ) := g (y1 , y2 , . . . , yp )dyj , j = 1, 2, . . . , p.
For a given y0 ∈ Y we have
g (y1 , y2 , . . . , yp ) = gp (yp |y1 , . . . , yp−1 )g (p) (y1 , . . . , yp−1 )
gp (yp |y1 , . . . , yp−1 )
= g (y1 , . . . , yp−1 , yp0 )
gp (yp0 |y1 , . . . , yp−1 )
0
gp (yp |y1 , . . . , yp−1 ) gp−1 (yp−1 |y1 , . . . , yp−2 , yp−1 ) 0
= 0 0 0
g (y1 , . . . , yp−1 , yp0 )
gp (yp |y1 , . . . , yp−1 ) gp−1 (yp−1 |y1 , . . . , yp−2 , yp−1 )
p 0
Y gj (yj |y1 , . . . , yj−1 , yj+1 , . . . , yp0 )
= g (y10 , y20 , . . . , yp0 ).
j=1
gj (yj0 |y1 , . . . , yj−1 , yj+1
0
, . . . , yp0 )
The same reasoning is valid for an arbitrary permutation σ. 2335 / 417
Markov properties
Consider the completion Gibbs sampler with PDFs f (x) and
g (y) = g (x, z) satisfying
Z
g (x, z)dz = f (x).
Z
Gibbs sampler produces a Markov chain Y(t) .


X (t) : marginal subchain corresponding to the marginal density




f (x) of g (y).
Theorem Density g is the  PDF of the stationary distribution for
the Markov chain Y (t) produced by the completion Gibbs samp-
ler, and if Y(t) is ergodic, then f is the limiting distribution of


the subchain X (t) .




Y(t) :

Proof. Transition kernel density of
κ(y, y0 ) = g1 (y10 |y2 , . . . , yp )g2 (y20 |y10 , y2 , . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
).
Denote
Z
g (j) (y1 , . . . yj−1 , yj+1 , . . . , yp−1 ) := g (y1 , y2 , . . . , yp )dyj , j = 1, 2, . . . , p.
336 / 417
Proof
Let Y(t) ∼ g and A be measurable.
Z
P Y(t+1) ∈ A = IA (y0 )κ(y, y0 )g (y)dydy0


Z
IA (y0 ) g1 (y10 |y2 , . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
 
= )

× g1 (y1 |y2 , . . . , yp )g (1) (y2 , . . . , yp ) dy1 . . . dyp dy10 . . . dyp0 .


 

Integrating out with respect to y1 and using


g1 (y10 |y2 , . . . , yp )g (1) (y2 , . . . , yp ) = g (y10 , y2 , . . . , yp )
we obtain
Z
P Y(t+1) ∈ A = IA (y0 ) g2 (y20 |y10 , y3 , . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
  
)

× g( y10 , y2 , . . . , yp )dy2 . . . dyp dy10 . . . dyp0


Z
0
IA (y0 ) g2 (y20 |y10 , y3 , . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
 
= )

× g2 (y2 |y10 , y3 , . . . , yp )g (2) (y10 , y3 , . . . , yp ) dy2 . . . dyp dy10 . . . dyp0 .


 

337 / 417
Proof

Integrating out now with respect to y2 and using

g2 (y20 |y10 , y3 , . . . , yp )g (2) (y10 , y3 , . . . , yp ) = g (y10 , y20 , y3 , . . . , yp )

we obtain
Z
P Y(t+1) ∈ A = IA (y0 ) g3 (y30 |y10 , y20 , y4 . . . , yp ) . . . gp (yp0 |y10 , . . . , yp−1
0
  
)

× g( y10 , y20 , y3 . . . , yp )dy3 . . . dyp dy10 . . . dyp0 .

Finally, Z
P Y(t+1) ∈ A = g( y10 , y20 , . . . , yp0 )dy10 dy20 . . . dyp0 ,

A

Y(t) .

so g is the stationary distribution of

If Y(t) is ergodic then g is PDF of the limiting distribution of Y(t) as t → ∞.




As X (t) is the marginal subchain of Y(t) , the limiting distribution of X (t)


 

has PDF f . 2

338 / 417
Recurrence, ergodicity
Theorem. If the joint PDF g corresponding to the completion
Gibbs sampler satisfies the positivity condition, then the generated
Markov chain is irreducible with respect to the invariant measure.
Theorem. Under positivity condition, if the transition kernel of
the completion Gibbs sampler is absolutely continuous with respect
to the invariant measure corresponding to g , then the generated
Markov chain is Harris recurrent and ergodic.
Theorem. Assume the transition kernel of the Gibbs chain Y(t)


is absolutely continuous with respect to the invariant measure π


with PDF g .
(a) If h1 , h2 ∈ L1 (π) with h2 (y)π(dy 6= 0 then
R 
PT (t)
 R 
t=1 h1 Y h1 (y)π(dy
lim PT =R  π a.e.
T →∞
t=1 h2 Y (t) h2 (y)π(dy
(b) If Y(t) is aperiodic, then for every initial distribution µ

Z
n

lim K (y, ·)µ(dy) − π = 0.
n→∞
Y 339 / 417
Connection to Metropolis-Hastings
Theorem. The p-dimensional completion Gibbs sampler is equi-
valent to the composition of p Metropolis-Hasting algorithms with
acceptance probabilities uniformly equal to 1.

Proof. Let i ∈ {1, 2, . . . , p}. The ith step of the Gibbs sampler from vector
y = (y1 , y2 , . . . , yp ) generates y0 = (y10 , y20 , . . . , yp0 ) using candidate density

qi (y0 |y) = δ(y1 ,...,yi−1 ,yi+1 ,...,yp ) (y10 , . . . , yi−1


0 0
, yi+1 , . . . , yp0 )
× qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp ).

Acceptance probability:
g (y0 )qi (y|y0 ) g (y0 )qi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp )
%(y, y0 ) = =
g (y)qi (y0 |y) g (y)qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp )
qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp )qi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp )
= =1
qi (yi |y1 , . . . , yi−1 , yi+1 , . . . , yp )qi (yi0 |y1 , . . . , yi−1 , yi+1 , . . . , yp )
2

Remark. The (global) acceptance probability of a vector y is not


necessarily 1.
340 / 417
Example. Autoexponential model
Joint PDF of the bivariate autoexponential model:

g (y1 , y2 ) ∝ exp − y1 − y2 − θ12 y1 y2 , y1 , y2 > 0.
Full conditionals:

g1 (y1 |y2 ) = (1 + θ12 y2 ) exp − (1 + θ12 y2 )y1 ,

g2 (y2 |y1 ) = (1 + θ12 y1 ) exp − (1 + θ12 y1 )y2 .
Transitional kernel density:
κ (y1 , y2 ), (y10 , y20 ) = g1 (y10 |y2 )g2 (y20 |y10 ).


Acceptance probability:
g (y10 , y20 ) κ (y10 , y20 ), (y1 , y2 )

(y1 , y2 ), (y10 , y20 )

% =
g (y1 , y2 ) κ (y1 , y2 ), (y10 , y20 )


(1+θ12 y20 )(1+θ12 y1 )


exp − θ12 (y20 y1 −y10 y2 ) .

= 0
(1+θ12 y2 )(1+θ12 y1 )
% is different from 1 for almost every vector (y1 , y2 , y10 , y20 ).
341 / 417
Case study. Bivariate truncated normal
f (y1 , y2 ): PDF of bivariate normal distribution N2 (m, Σ) with
constraint y1 ≥ 0 and
   2 
µ1 σ1 σ12
µ= , Σ= .
µ2 σ12 σ22
For y1 , y2 ∈ R, y1 > 0
 >  !
1 1 y1 − µ1 y − µ
f (y1 , y2 ) = µ1 
exp − Σ−1 1 1
.
2 y2 − µ2 y2 − µ2
p
2π det(Σ)Φ σ1

Simplest method: generate a pair (Y1 , Y2 ) ∼ N2 (m, Σ) and accept


if Y1 ≥ 0. Acceptance probability: Φ(µ1 /σ1 ).
R package: mnormt; function: rmnorm.
The higher is the ratio µ1 /σ1 , the better is the acceptance rate.
Example.
Sample size: 10000; µ1 = 1, µ2 = 2, σ12 = 1, σ22 = 2, σ12 = 0.8.
Acceptance probability: 0.8413; acceptance rate: 0.8424.
342 / 417
Case study. Bivariate truncated normal
Marginals:
1  y1 − µ1 .  µ1 
f (y1 ) = ϕ Φ , y1 > 0;
σ1 σ1 σ1
y − (µ1 − µ2 σ12 /σ12 ) .  µ1 
 
1  y2 −µ2 
f (y2 ) = ϕ Φ Φ , y2 ∈ R.
σ2 σ2 σ2 (σ12 σ22 −σ12
2 )1/2 /σ
12 σ1
0.5

0.00 0.05 0.10 0.15 0.20 0.25 0.30


0.4
0.3
Density

Density
0.2
0.1
0.0

-1 0 1 2 3 4 5 6 -2 0 2 4 6 8
Y1 Y2

Histograms of marginals Y1 and Y2 of a sample of size 10000 from a bivariate


normal under constraint Y1 ≥ 0. Rejection sampling using a non-truncated bi-
variate normal. 343 / 417
Case study. Bivariate truncated normal
Negative values of µ1 yield very low acceptance rates.
Example. µ1 = −1, µ2 = 2, σ12 = 1, σ22 = 2, σ12 = 0.8.
Sample size: 10000; number of generated values: 63633.
Acceptance probability: 0.1587; acceptance rate: 0.1576.
Ratio µ1 /σ1 −0.5 −1 −1.5 −2 −2.5 −3
Acc. probability 0.3085 0.1587 0.0668 0.0228 0.0062 0.0013

Gibbs sampler for truncated normal vectors [Robert, 1995].


Full conditionals:
σ12 σ22 − σ12
2
 
σ12
Y1 |Y2 = y2 ∼ N0 µ1 + 2 (y2 − µ2 ), ;
σ2 σ22
σ2σ2 − σ2
 
σ12
Y2 |Y1 = y1 ∼ N µ2 + 2 (y1 − µ1 ), 1 2 2 12 .
σ1 σ1
Reduces to the problem of simulation from a univariate truncated
normal distribution.
344 / 417
Case study. Bivariate truncated normal
Simulation from truncated normal N0 (µ, σ 2 ). PDF:
1 x − µ
f (x) := ϕ I{x>0} .
σΦ(µ/σ) σ
1. Inversion method.
 
U ∼ U(0, 1), X = µ + σΦ−1 1 − UΦ µ/σ ∼ N0 (µ, σ 2 ).

R functions: qnorm and pnorm.


Negative values of µ/σ cause problems. E.g. if µ/σ = −5.5:
pnorm(-5.5): 1.8990×10−8 ; 1-pnorm(-5.5): 1; qnorm(1): ∞.

2. Accept-Reject method.
Candidate: Exp(λ) with PDF g (x) := λe−λx I{x>0} . Upper bound:
√  −1
exp λµ + λ2 σ 2 /2 .

f (x)/g (x) ≤ 2πσΦ µ/σ λ
p
Optimal choice of parameter λ: λ∗ := − µ + µ2 + 4σ 2 / 2σ 2 .
 
345 / 417
Case study. Bivariate truncated normal
Example. µ1 = −5, µ2 = 2, σ12 = 1, σ22 = 2, σ12 = 0.8.
Acceptance probability for rejection sampling: 2.8665×10−7 .

0.30
4
3

0.20
Density

Density
2

0.10
1

0.00
0

0.0 0.5 1.0 1.5 2 4 6 8 10


Y1 Y2

Histograms of marginals Y1 and Y2 of a sample of size 10000 from a bivariate


normal under constraint Y1 ≥ 0. Gibbs sampler with mixed generation (inver-
sion method and Accept-Reject) of univariate truncated normals.
346 / 417
Hierarchical models
Gibbs sampler is a natural way of sampling from hierarchical mo-
dels, where the target density f can be decomposed as
Z
f (x) = f1 (x|z1 )f2 (z1 |z2 ) . . . fr (zr |zr +1 )fr +1 (zr +1 )dz1 . . . dzr +1 .

Example. Two-level hierarchy.


X1 , X2 , . . . , Xn : sample. Θ = (Θ1 , Θ2 , . . . , Θn ) and Γ = (Γ1 , Γ2 , . . . , Γn ).
Xi |Θ = θ ∼ fi (x|θ), i = 1, 2, . . . , n, θ = (θ1 , θ2 , . . . , θp );
Θj |Γ = γ ∼ πj (θ|γ), j = 1, 2, . . . , p, γ = (γ1 , γ2 , . . . , γs );
Γk ∼ h(γ), k = 1, 2, . . . , s.
Joint PDF:
n p s
Y Y Y
g (x, θ, γ) = fi (xi |θ) πj (θj |γ) h(γk ).
i=1 j=1 k=1
Given a realization x = (x1 , x2 , . . . , xn ) of the sample, the full conditionals:
n
Y
Θj ∝ πj (θj |γ) fi (xi |θ), j = 1, 2, . . . , p,
i=1
p
Y
Γj ∝ h(γk |γ) πj (θj |γ), k = 1, 2, . . . , s.
j=1 347 / 417
Classical example. Nuclear pump failures
Data: number of failures and times of observation of 10 pumps in
nuclear plant Farley 1 (Unit 1, Joseph M. Farley Nuclear Genera-
ting Station, Dothan, Alabama).
[Gaver & O’Muircheartaigh, 1987; Gaver & Lehoczky, 1987].
Pump 1 2 3 4 5 6 7 8 9 10
Failures 5 1 5 14 3 19 1 1 4 22
Time 94.32 15.72 62.88 125.76 5.24 31.44 1.05 1.05 2.10 10.48

The number of failures of the ith pump: a Poisson process with


parameter λi , i = 1, 2, . . . , 10.
The number Xi of failures for an observation time ti : P(λi ti ).
Prior distributions: independent Gamma.
Hierarchical model:
Xi ∼ P(λi ti ), i = 1, 2, . . . , 10;
λj ∼ Ga(α, β), j = 1, 2, . . . , 10;
β ∼ Ga(γ, δ).
α, γ, δ: fixed hyperparameters.
Values: α = 1.8, γ = 0.01, δ = 1 [Gelfand & Smith, 1990]. 348 / 417
Classical example. Nuclear pump failures
Joint posterior PDF:
g (λ1 , . . . , λ10 , β| t1 , . . . , t10 , x1 , . . . , x10 )
10
Y
(λi ti )xi e−λi ti λiα−1 e−βλi β 10α β γ−1 e−δβ
 

i=1
Y10
λxi i +α−1 e−λi (ti +β) β 10α+γ−1 e−δβ .
 

i=1

Full conditionals:
λi |β, ti , xi ∼ Ga(xi + α, ti + β), i = 1, 2, . . . , 10;
 10
X 
β|λ1 , . . . , λ10 ∼ Ga γ + 10α, δ + λj .
j=1

and λ(t) generated by the Gibbs


(t)
 
Remark. Markov chains β
(t) (t) 
sampler, where λ(t) = λ1 , . . . , λ10 , are uniformly ergodic.
349 / 417
Classical example. Nuclear pump failures
Transition kernel density for the chain β (t) :


 
10 γ+10α 10
(β 0 )γ+10α−1
Z  X  X 
κ(β, β 0 ) = δ+ λj exp −β 0 δ+ λj 
Γ(γ +10α)
j=1 j=1

10
Y (ti +β)xi +α
λxi i +α−1 exp − λi (ti +β) dλ1 . . . dλ10

×
Γ(xi +α)
i=1

10 xi +α
δ γ+10α (β 0 )γ+10α−1

Y ti
≥ exp(−β 0 δ) .
Γ(γ +10α) ti + β 0
i=1

The entire space X = R+ isa small set for the transition kernel
with density κ(β, β 0 ): β (t) uniformly ergodic.
Uniform ergodicity of λ(t) can be derived from the uniform ergo-


dicity of β (t) .


350 / 417
Classical example. Nuclear pump failures
λ1 λ2 β

15

0.6
4
10

0.4
Density

Density

Density
2 3

0.2
5

0.0
0

0
0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 1 2 3 4 5 6

λ1 λ2 β
1.0

1.0

1.0
0.8
0.8

0.8
0.4 0.6
0.4 0.6

0.4 0.6
ACF

ACF

ACF
0.2
0.2

0.2
0.0
0.0

0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag Lag

Histograms (upper panel) and autocorrelations (lower panel) of samples from


marginal distributions of λ1 , λ2 and β produced by the Gibbs sampler based on
5000 iterations. Hyperparameter values: α = 1.8, γ = 0.01, δ = 1.
351 / 417
Gibbs sampler vs. Metropolis-Hastings
Gibbs sampler:
I Conditional densitied are derived from the target density f .
I All generated values are accepted.
I Changes only one coordinate at a time. Decomposition of f
given a particular coordinate system does not necessarily ag-
ree with the form of f .
I Generally converges rather slowly. Might be attracted to the
closest local mode of f . Has difficulties in exploring supp f .

Metropolis-Hastings:
I Candidate densities, at best, are approximations of the target
density f .
I Some generated values might be rejected. If the candidate
highly differs (e.g. in spread) from the target, the performan-
ce is poor.
I More flexibility in generating new values.
352 / 417
Example. Bivariate normal mixture
Target distribution: mixture of bivariate normals
(Y1 , Y2 )> ∼ p1 N2 (µ1 , Σ1 ) + p2 N2 (µ2 , Σ2 ) + p3 N2 (µ3 , Σ3 ),
   2 
µi,1 σi,1 σi,12
µi = , Σi = 2 , i = 1, 2, 3.
µi,2 σi,12 σi,2
Marginals:
2 2 2
  
Yk ∼ p1 N µ1,k , σ1,k +p2 N µ2,k , σ2,k +p3 N µ3,k , σ3,k , k = 1, 2.
Full conditionals:
3  
X σj,12  det(Σj )
Y1 |Y2 = y2 ∼ qj,2 (y2 )N µj,1 + 2 y2 − µj,2 , 2
,
σj,2 σj,2
j=1
3  
X σj,12  det(Σj )
Y2 |Y1 = y1 ∼ qj,1 (y1 )N µj,2 + 2 y1 − µj,1 , 2
,
σj,1 σj,1
j=1
 k
pi  y −µi,k  X pj  y −µj,k 
qi,k (y ) := 2 ϕ 2
ϕ , i = 1, 2, 3, k = 1, 2.
σi,k σi,k σj,k σj,k
j=1
353 / 417
Example. Bivariate normal mixture
Weights: p1 = 0.3, p2 = 0.45, p3 = 0.25.
Means: µ1 = [−4, 4]> , µ2 = [−1, −2]> , µ3 = [0, 0]> .
Covariance matrices:
     
1 −0.5 1 1.2 1 −0.7
Σ1 = , Σ2 = , Σ3 = .
−0.5 0.3 1.2 1.5 −0.7 0.6

Y1 Y2
0.25

0.20
0.20

0.15
0.15
Density

Density
0.10
0.10

0.05
0.05
0.00

0.00

-8 -6 -4 -2 0 2 -6 -4 -2 0 2 4 6

Histograms of marginals Y1 and Y2 of a sample of size 10000 from a mixture of


bivariate normals produced by the Gibbs sampler. 354 / 417
Example. Bivariate normal mixture

The first 300 successive moves of the Gibbs chain on the surface of the target
distribution. 355 / 417
Example. Bivariate normal mixture

10000 points generated by the Gibbs sampler on the surface of the target dist-
ribution. 356 / 417
Example. Bad choice of coordinates for the Gibbs sampler
E, E 0 : disks in R2 with radius 1 and centers (1, 1) and (−1, 1).
f : uniform PDF on E ∪ E 0 , that is
1 
f (y1 , y2 ) = IE (y1 , y2 ) + IE 0 (y1 , y2 ) .

Full conditionals:
( p p 
U −1− 1−(1−y2 )2 , −1+ 1−(1−y2 )2 if y2 < 0;
Y1 |Y2 = y2 ∼ p p 
U 1− 1−(1−y2 )2 , 1 + 1−(1−y2 )2 if y2 ≥ 0;
( p p 
U −1− 1−(1−y1 )2 , −1+ 1−(1−y1 )2 if y1 < 0;
Y2 |Y1 = y1 ∼ p p 
U 1− 1−(1−y1 )2 , 1 + 1−(1−y1 )2 if y1 ≥ 0.

The Markov chain produced by the Gibbs sampler is not irreducib-


le. It remains in the quadrand on which it is initialized.
Change of coordinates z1 := y1 + y2 and z2 := y1 − y2 solves the
problem.
357 / 417
Mixtures, cycles
Joint application of the Gibbs sampler and Metropolis-Hastings
algorithm might use the advantage of both methods.
Hybrid MCMC approach: Tierney [1994]
Definition. A hybrid MCMC algorithm is a Markov chain Monte
Carlo which simultaneously utilizes both Gibbs sampling steps and
Metropolis-Hastings steps. If K1 , K2 , . . . , Kn are kernels which cor-
respond to these different steps and if (α1 , α2 , . . . , αn ) is a proba-
bility distribution, a mixture of K1 , K2 , . . . , Kn is an algorithm asso-
ciated with the kernel
K̃ = α1 K1 + α2 K2 + . . . + αn Kn
and a cycle of K1 , K2 , . . . , Kn is an algorithm with kernel
K ∗ = K1 ◦ K2 ◦ . . . ◦ Kn ,
where “ ◦” denotes the composition of kernels.
Remark. Gibbs sampler in p-dimensions: a cycle of p Metropo-
lis-Hastings kernels. 358 / 417
Remarks
I One can increase the speed of convergence of the Gibbs samp-
ler if e.g. in every m iteration the Gibbs generation is replaced
with a Metropolis-Hastings step corresponding to a candidate
with larger dispersion.
Other approach: in each step use a Metropolis-Hastings gene-
ration mechanism with probability 1/m.
Mixing helps against being trapped around a local mode of
the PDF corresponding to the stationary distribution.
I If one of the kernels in the mixture is irreducible or aperiodic
then the same holds for the mixture.
I If all kernels have the same stationary distribution π then both
their mixture and the corresponding cycle has π as stationary
distribution.
I The irreducibility of a cycle K ∗ forming a Gibbs sampler does
not require the irreducibility of the components (e.g. Gibbs as
a cycle of Metropolis-Hastings steps).
359 / 417
Ergodicity of mixtures and cycles
Theorem. [Tierney, 1994] If K1 and K2 are two kernels with the
same stationary distribution π and if K1 produces a uniformly ergo-
dic Markov chain, the mixture kernel
K̃ = αK1 + (1 − α)K2 (0 < α < 1)
is also uniformly ergodic. Moreover, if X is a small set for K1 with
m = 1, the kernel cycles K1 ◦ K2 and K2 ◦ K1 are uniformly ergodic.
Proof. K1 produces a uniformly ergodic Markov chain if and only if X is small.
There exists m ∈ N, εm > 0 and a probability measure νm such that
K1m (x, A) ≥ εm νm (A) ∀A ∈ B(X ), ∀x ∈ X .
For K̃ one has
m
αK1 + (1 − α)K2 (x, A) ≥ αm K1m (x, A) ≥ αm εm νm (A),
so X is small for K̃ , too. Further, if X is small for K1 with m = 1, then
Z Z Z
(K1 ◦ K2 )(x, A) = K2 (x, dy )K1 (y , dz) ≥ ε1 ν1 (A) K2 (x, dy ) = ε1 ν1 (A)
ZA ZX Z Z X
(K2 ◦ K1 )(x, A) = K1 (x, dy )K2 (y , dz) ≥ ε1 ν1 (dy )K2 (y , dz) = ε1 (K2 ◦ ν1 )(A).
A X A X
X is small both for K1 ◦ K2 and K2 ◦ K1 . 2360 / 417
Hybrid MCMC
Metropolis-Hastings steps can be added at elementary levels of
Gibbs sampling. E.g. if it is difficult to simulate from conditional
PDF gi (yi |yj , j 6= i), replace this step with a simulation from an
instrumental PDF qi .
Algorithm
(t+1) (t+1) (t) (t) 
For i = 1, 2, . . . , p given y1 , . . . , yi−1 , yi , . . . , yp ,
(t+1) (t+1) (t) (t) 
1. Simulate ỹi ∼ qi yi y1 , . . . , yi−1 , yi , . . . , yp .
2. Take
(
(t+1) ỹi with probability %,
yi = (t)
yi with probability 1 − %,
where
  
 (t+1) (t+1) (t) (t)
 gi ỹi |y1 ,...,yi−1 ,yi+1 ,...,yp 
  
 qi ỹi |y (t+1) ,...,y (t+1) ,y (t) ,y (t) ,...,yp(t) 
 
1 i−1 i i+1
% := 1 ∧ 
(t) (t+1) (t+1) (t) (t)
 .

 gi yi |y1 ,...,yi−1 ,yi+1 ,...,yp 


  (t) (t+1)  
(t+1) (t) (t) 
qi yi |y1 ,...,yi−1 ,ỹi ,yi+1 ,...,yp
361 / 417
Example. ARCH model
Gaussian Autoregressiv Conditionally Heteroscedastic (ARCH)
model:
( 1/2
2
Zt = 1 + βZt−1 δt , δt ∼ N (0, 1), Z1 ∼ N (0, 1),
εt ∼ Np 0, σ 2 Ip .

Xt = aZt + εt ,

θ := (a, β, σ): model parameter, a ∈ Rp , 0 < β, σ ∈ R.


x = (x1 , x2 , . . . , xT )> : observed values of X1 , X2 , . . . , XT .
δt , εt : i.i.d. errors, independent from each other.
Z = (Z1 , Z2 , . . . , ZT )> : missing data.
Conditional density of the missing data
T T
! 2 2
> > 1 1 X 2
Y e−zt /2(1+βzt−1 )
g (z |x , θ) ∝ exp − 2 kxt −azt k e−z1 /2
2
2 )1/2
.
σ 2T 2σ (1+βzt−1
t=1 t=2

Impossible to simulate directly from g (z> |x> , θ).


362 / 417
Example. ARCH model
Full conditionals of the latent variables Zt :
2
kxt −azt k2 zt2
 
zt+1
gt (zt |xt , zt−1 , zt+1 , θ) ∝ exp − − −
2 ) 2(1+βz 2 )
.
2σ 2 2(1+βzt−1 t

2 /2(1 + βz 2 ) makes simulation still difficult.


Term zt+1 t

Metropolis-Hastings step with instrumental distribution:

kxt − azt k2 zt2


 
qt (zt |xt , zt−1 , θ) ∝ exp − − 2 )
.
2σ 2 2(1 + βzt−1

Simulating from qt (zt |xt , zt−1 , θ) is equivalent to simulating from


!
2 )a> x
(1 + βzt−1 2 )σ 2
(1 + βzt−1
t
N ,
2 )kak2 + σ 2 (1 + βz 2 )kak2 + σ 2
.
(1 + βzt−1 t−1

363 / 417
Metropolizing the Gibbs sampler
Y = (Y1 , Y2 , . . . Yp ): discrete random vector with distribution π.
Y(1) , Y(2) , . . . , Y(T ) : a sample from Y generated by an MCMC
method with transition matrix Q for approximation of
T
1 X
h Y(t) .
  
J := E h(Y) with JT :=
T
t=1
Define: 
ν(h, π, Q) := lim T Var JT .
T →∞
Definition. Let Q1 and Q2 be transition matrices. We say that
Q1  Q2 if the off-diagonal elements of Q2 are larger than those
of Q1 .
Theorem. [Peskun, 1973] Suppose each of the irreducible transi-
tion matrices Q1 and Q2 are reversible for the same distribution π.
If Q1  Q2 then we have
ν(h, π, Q1 ) ≥ ν(h, π, Q2 ).
364 / 417
Randomized Gibbs sampler
Algorithm
(t) (t) 
Given y(t) = y1 , . . . , yp ,

1. Generate a coordinate k from a pre-assigned dist-


ribution (q1 , q2 , . . . , qp ).
2. Generate
(t+1) (t) 
yk ∼ πk yk yj , j 6= k .
(t)
yk : the actual value of the kth coordinate.

The algorithm results in a reversible Markov chain.


(t+1) (t)
Since πk is discrete, there is a positive probability of yk = yk .

(t+1) (t)
If we force to generate yk 6= yk , an additional Metropolis-Has-
tings rejection-acceptance step is needed to correct the possible
deviation from the target distribution.
365 / 417
Metropolizing the Randomized Gibbs sampler
Metropolized version of the Randomized Gibbs sampler [Liu, 1995]:
Algorithm
(t) (t) 
Given y(t) = y1 , . . . , yp ,
1. Generate a coordinate k from a pre-assigned dist-
ribution (q1 , q2 , . . . , qp ).
(t)
2. Generate ωk 6= yk with probability
(t) 
πk ωk yj , j 6= k
(t) (t) .
1 − πk yk yj , j 6= k

(t+1)
3. Accept yk = ωk with probability
  
 1 − πk y (t) y (t) , j 6= k


k j
min (t)  ,1 .
 1 − πk ωk y , j 6= k 
j

366 / 417
Transition matrices
(q1 , q2 , . . . , qp ): distribution of coordinates.
(t) (t) (t) (t)  (t)
ω := y1 , . . . , yk−1 , ωk , yk+1 , . . . , yp , ωk 6= yk .
Transition probability of the metropolized sampler:
(t)   
1−πk y (t) y (t), j 6= k 

π k ω k
y , j 6= k
j k j
Q2 y(t), ω = qk

(t) (t)  min (t)  , 1
1−πk yk yj , j 6= k 1−πk ωk yj , j 6= k

  
 πk ωk y (t) , j 6= k
 (t)
πk ω k
y , j 6= k 
j j
= qk min , .
 1−πk y (t) y (t) , j 6= k  1−πk ωk y (t) , j 6= k  

k j j

Transition probability of the Gibbs sampler (off-diagonal):


(t)
Q2 y(t) , ω = qk πk ωk yj , j 6= k .
 

Remark. Q1 y(t) , ω ≤ Q2 y(t) , ω , so Q1  Q2 implying


 

ν(h, π, Q1 ) ≥ ν(h, π, Q2 ).
367 / 417
Example. Beta-Binomial combination
Consider

X |Θ = θ ∼ Bin(n, θ), Θ ∼ Be(a, b).

Joint PDF:
 
n Γ(a + b) x+a−1
f (x, θ) = θ (1 − θ)n−x+b−1 ,
x Γ(a)Γ(b)

where x = 0, 1, . . . , n, θ ∈ [0, 1]. Conditional distributions:

X |Θ = θ ∼ Bin(n, θ), Θ|X = x ∼ Be(x + a, n − x + b).

Marginal distribution of X : beta-binomial


 
n B(x + a, n − x + b)
P(X = x) = , x = 0, 1, . . . , n.
x B(a, b)
Metropolis-Hastings step is applied at the generation from the bi-
nomial distribution [Liu, 1995].
368 / 417
Example. Beta-Binomial combination
Beta-binomial Beta ACF of X

1.0
0.12

2.5

0.8
2.0
0.08
Probability

0.4 0.6
Density
1.0 1.5

ACF
0.04

0.2
0.5

0.0
0.00

0.0
0 2 4 6 8 10 13 16 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40
X θ X

Beta-binomial Beta ACF of X

1.0
0.12

2.5

0.8
1.0 1.5 2.0
0.08
Probability

0.4 0.6
Density

ACF
0.04

0.2
0.5
0.00

0.0
0.0

0 2 4 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40


X θ X

Histograms of marginal distributions and the autocovariances of the first coor-


dinates of samples of size 10000 produced by the Gibbs sampler (upper panel)
and the metropolized Gibbs sampler (lower panel). Parameter values: n = 17,
a = 4, b = 7. 369 / 417
Example. Beta-Binomial combination
ACF of X

1.0
0.8
0.6
ACF
0.4
0.2
0.0

0 10 20 30 40
Lag

Autocovariances of the first coordinates of samples of size 10000 produced by


the Gibbs sampler (black) and the metropolized Gibbs sampler (blue). Para-
meter values: n = 17, a = 4, b = 7.
370 / 417
Reparametrization
Convergence properties of the Gibbs sampler are highly affected by
the choice of coordinates. E.g. in two-dimensions highly correlated
coordinates yield slow convergence.

Example. Target density: bivariate normal distribution.


   
0 1 %
(X , Y ) ∼ N2 , , |%| < 1.
0 % 1

Marginal distributions: X , Y ∼ N (0, 1). Conditional distributions:

Y |X = x ∼ N (%x, 1 − %2 ), X |Y = y ∼ N (%y , 1 − %2 ).

Transformation: X + Y and X − Y are independent.


 
Simulate independently: V ∼ N 0, 2(1+%) , W ∼ N 0, 2(1−%) .
Take: X := (V + W )/2, Y := (V − W )/2.
(X , Y ) are distributed as the target.
371 / 417
Example. Bivariate normal distribution
ρ=0.3 ρ=0.6 ρ=0.9

1.0

1.0

1.0
0.8
0.8
0.8

0.6
0.6
0.6
ACF

ACF

ACF
0.4
0.4
0.4

0.2
0.2
0.2

0.0
0.0
0.0

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag Lag

ρ=0.3 ρ=0.6 ρ=0.9


1.0

1.0

1.0
0.8

0.8
0.8

0.6
0.6

0.6
ACF

ACF

ACF
0.4
0.4

0.4

0.2
0.2

0.2

0.0
0.0

0.0

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag Lag Lag

Autocovariances of the first coordinates of samples of size 5000 produced by


the Gibbs sampler without (upper panel) and with (lower panel) coordinate
transformation.
372 / 417
Example. Bivariate normal distribution

Convergence of the empirical covariances to the true covariances of % = 0.3


(top), % = 0.6 (middle) and % = 0.9 (bottom) of samples from marginal vari-
ables X and Y produced by the Gibbs sampler without (black lines) and with
(blue lines) coordinate transformation. 373 / 417
Example. Random effect model
Simple random effect model:
Yk,` = µ + αk + εk,` , k = 1, 2, . . . , M, ` = 1, 2, . . . , N.
2
Error terms: εk,` ∼ N 0, σy , variance σy2 is known.


Priors: αk ∼ N 0, σα2 , variance σα2 is known. Flat prior in µ.




[Gelfand, Sahu & Carlin, 1995; Gilks & Roberts, 1996]


Posterior correlations of parameters:
!− 1 !−1
Mσy2 2
Mσy2
%µ,αk = − 1 + , %αk ,α` = 1+ , k 6= `.
Nσα2 Nσα2

Hierarchical centering:
ηk ∼ N µ, σα2 .

Yk,` = ηk + εk,` ,
Posterior correlations of parameters:
− 12 −1
MNσα2 MNσα2
 
%µ,ηk = − 1 + , %ηk ,η` = 1 + , k 6= `.
σy2 σy2 374 / 417
Example. Random effect model
Simple random effect model:
Yk,` = µ + αk + εk,` , k = 1, 2, . . . , M, ` = 1, 2, . . . , N.
2
Error terms: εk,` ∼ N 0, σy , variance σy2 is known.


Priors: αk ∼ N 0, σα2 , variance σα2 is known. Flat prior in µ.




Sweeping [Vines, Gilks & Wild, 1995; Gilks & Roberts, 1996].
Take φk := αk − α and ν := µ + α, where α := M
P
k=1 αk .
Yk,` = ν + φk + εk,` , k = 1, 2, . . . , M, ` = 1, 2, . . . , N.
Parameters:
M−1
X
φM := (φ1 , φ2 , . . . , φM−1 )>∼ NM−1 0, σα2 ΣM−1 , φM = −

φk .
k=1
ΣM−1 : an (M − 1) × (M − 1) matrix with 1 − /M on the main
diagonal and −1/M everywhere else. No prior independence.
Posterior correlations of parameters:
%ν,φk = 0, %φk ,φ` = −1/M. 375 / 417
Improper priors
Blind use of Gibbs sampler: focusing only on conditional PDFs.
Improper prior: instead of a probability measure the density of the
prior corresponds to a σ-finite measure [Robert, 2007]
In case of improper priors the conditional PDFs may not corres-
pond to any joint PDF. The function given by the Hammersley-
Clifford theorem is not integrable. Propriety of the joint PDF
should be checked.

Example. Exponential full conditionals.

Y1 |Y2 = y2 ∼ Exp(y2 ), Y2 |Y1 = y1 ∼ Exp(y1 )

The only function which could have served as a joint PDF is

g (y1 , y2 ) ∝ exp(−y1 y2 ), y1 , y2 > 0.


R∞R∞
g (y1 , y2 ) is not a proper PDF as 0 0 g (y1 , y2 )dy2 dy1 = ∞.
376 / 417
Example. Normal improper posterior
Observations from N (µ, σ 2 ), parameter vector: θ := (µ, σ).
Jeffreys prior [Jeffreys, 1946, 1961; Robert, 2007]:
1
q 
π(µ, σ) ∝ det I (µ, σ) ∝ 2 .
σ
I (µ, σ): Fisher information matrix on (µ, σ)
Posterior (formally from Bayes’s theorem):
f (x|µ, σ)π(µ, σ)
π(µ, σ|y ) = RR .
f (x|µ, σ)π(µ, σ)dµ dσ
Norming constant:
ZZ ZZ
1 2
f (x|µ, σ)π(µ, σ)dµ dσ = √ e−(x−µ)/2σ dµ dσ = ∞.
2πσ 3

Full conditionals:
2 1 2
π(µ|σ, x) ∝ e−(x−µ)/2σ , π(σ|µ, x) ∝ 3 e−(x−µ)/2σ ,
σ
corresponding to N (x, σ 2 ) for µ and Exp (x −µ)2 /2 for 1/σ 2 .

377 / 417
Example. Normal improper posterior
log(|θ|)
0
-300 -200 -100

0 100 200 300 400


Iterations

log(σ)
0
-300 -200 -100

0 100 200 300 400


Iterations

First 400 iterations of the chains corresponding to parameters µ and σ produ-


ced by the Gibbs sampler. Logarithmic scales.
378 / 417
Motivation
Statistical model is not defined precisely: the dimension of the pa-
rameter space is not fixed. MCMC methods considered so far can-
not be applied.
Example. Mixture modelling.
Data: velocities Y of 82 galaxies in Corona Borealis region.
Model: Gaussian mixture, number of components is not given.
k
X
2

Mk : Yi ∼ ωjk N µj,k , σjk , i = 1, 2, . . . , 82.
j=1
0.20
0.15
Density
0.10
0.05
0.00

10 15 20 25 30 35
Velocity (km/s) 380 / 417
Bayesian model choice
Definition. A Bayesian variable dimensional model is defined as a
collection of models

Mk = pk (·|θk ); θk ∈ Θk , k = 1, 2, . . . , K ,
associated with a collection of priors πk (θk ) on the parameters of
the models and a prior distribution on the indices of these models
{%(k), k = 1, 2, . . . , K }.
π(k, θk ) := %(k)πk (θk ): a density on the parameter space
[
Θ := {k} × Θk .
k

Posterior of model Mk given observation y [Robert, 2007]:


R
%(k) Θk pk (y|θk )πk (θk )dθk
p(Mk |y) = PK R .
j=1 %(j) Θj pj (y|θj )πj (θj )dθj

One can select the model with the highest posterior or use the mo-
del average as the predictive PDF.
381 / 417
Possible problems with the model choice
Models should be coherent. E.g. if M1 = M2 ∪ M3 , a natural
requirement can be p(M1 |y ) = p(M2 |y ) + p(M2 |y ).
The coefficients for each model should be considered as an entire
set and treated as separate entities.
Example AR(p) process:
p
X
Mp : Yt = αjp Yt−j + σp εt .
j=1

The best fitting AR(p+1) model is not necessarily the best fitting AR(p) model
with an additional parameter α(p+1)(p+1) . The variances σp2 might differ and the
parameters are not independent a posteriory.

Computational problems:
I Parameter spaces are often infinite dimensional.
I Computation of posterior quantities often involve integration
over different parameter spaces.
I The representation of the parameter space required more ad-
vanced MCMC methods.
382 / 417
Reversible jumps MCMC
Green’s Algorithm. [Green, 1995; Robert, 2007]
π: target distribution on Θ with PDF f (x), x = (k, θk ) ∈ Θ.
Aim: construction of an irreducible aperiodic transition kernel K
satisfying detailed balance equation
Z Z Z Z
K (x, dy )π(dx) = K (y , dx)π(dy ), A, B ∈ B(Θ).
A B B A

Kernel K is decomposed according to the model in which it propo-


ses a move.
qk : transition measure corresponding to Mk .
qk (x, A):
P the probability of proposing a move (jump) from x to A,
where k qk (x, Θ) ≤ 1.
P
The probability that no move for x is proposed: 1 − k qk (x, Θ).
%k (x, y ): probability of accepting a move from x to y . Yet unde-
fined.
383 / 417
Green’s Algorithm
Transition kernel:
XZ
K (x, B) = %k (x, y )qk (x, dy ) + ω(x)IB (x), B ∈ B(Θ).
k B

ω(x): probability of not moving from x. Either no move is propo-


sed or the proposed move is rejected.
XZ X
ω(x) := 1 − %k (x, y ))qk (x, dy ) + 1 − qk (x, Θ).
k Θ k
Detailed balance equation:
XZ Z Z
π(dx) %k (x, y )qk (x, dy ) + π(dx)ω(x)
k A B A∩B
XZ Z Z
= π(dy ) %k (y , x)qk (y , dx) + π(dy )ω(y ), A, B ∈ B(Θ).
k B A A∩B

Detailed balance equation holds if for each k, A, B we have


Z Z Z Z
π(dx) %k (x, y )qk (x, dy ) = π(dy ) %k (y , x)qk (y , dx).
A B B A 384 / 417
Choice of acceptance probability
Assumption. The joint measure
Z Z
Gk (A × B) := π(dx)qk (x, dy )
A B
has a finite PDF gk (x, y ) with respect to a symmetric measure ξk
on Θ × Θ.
Condition for the detailed balance equation to be held:
Z Z Z Z
π(dx) %k (x, y )qk (x, dy ) = gk (x, y )%k (x, y )ξk (dx, dy )
A Z ZB A B Z Z
= gk (y , x)%k (y , x)ξk (dy , dx) = π(dy ) %k (y , x)qk (y , dx).
B A B A
As ξk is symmetric, the above equation is satisfied if
gk (x, y )%k (x, y ) = gk (y , x)%k (y , x).
Metropolis-Hastings choice:
 
gk (y , x)
%k (x, y ) := min ,1 .
gk (x, y )
385 / 417
Remarks
I Zero denominator in the acceptance probability
 
gk (y , x)
%k (x, y ) := min ,1
gk (x, y )
does not cause problem: the probability of proposing a move
from x to y is zero.
I Symmetric measure ξk and PDF gk are still undefined. Given
gk the definition of the sampling method is constructive.
I In defining the measure ξk one has a great flexibility to exploit
the structure of the problem at hand. Intuition is needed to
find the “good” jump strategies.
I Target π need not to be normalized, but relative normalizati-
ons between different subspaces are needed. It is not necessa-
ry that the priors πk (θk ) are properly normalized, there must
be only one unknown multiplicative constant among all priors
unless only posteriors conditional on k are needed.
386 / 417
Dimension matching
Jump between models Mk1 and Mk2 : a move from Θk1 to Θk2 .
dim(Θk1 ) > dim(Θk2 ): the jump can be represented as a
deterministic transformation

θk2 = T θk1 .
Dimension matching condition [Green, 1995]: the opposite jump
from Θk2 to Θk1 should be concentrated on:
 
θk 1 : θk 2 = T θk 1 .
Example. Assume Θ = C1 ∪C2 with C1 := {1}×R and C2 := {2}×R2 and the
target π has proper densities on both subspaces.
A possible move: 2, θ(1) , θ(2) ∈ C2 7→ 1, (θ(1) + θ(2) )/2 ∈ C1 .
 

For this type of move Gk should have a density with respect to a measure on
R3 concentrated on
n  o
θ, θ(1) , θ(2) : θ = θ(1) + θ(2) /2 .


The reversed move from C1 to C2 should have a proposal distribution also con-
centrated on this set. E.g. given x = (1, θ), generate u from some distribution
h independently of θ and set θ(1) = θ + u, θ(2) = θ − u. 387 / 417
Derivation of the symmetric measure
C1 = {1} × Θ1 , C2 = {2} × Θ2 : two parameter subspaces.
Dimensions: Θ1 ⊆ Rn1 and Θ2 ⊆ Rn2 .
q: a move which always switches between these subspaces.
q(x, C1 ) = 0, x ∈ C1 ; q(x, C2 ) = 0, x ∈ C2 .
pk` : probability of choosing a move from Ck to C` .
u12 , u21 : continuous random vectors with dimensions m1 and m2
generated from PDFs h12 and h21 independently from θ1 and θ2 .
Dimension matching:
 n1 + m1 = n2 + m2 and there is a bijection
T between θ1 , u12 and θ2 , u21 ).
Symmetric measure ξ: for A ⊆ C1 , B ⊆ C2

ξ(A×B) = ξ(B×A) := λ (θ1 , u1 ) : θ1 ∈ A, θ2 ∈ B, (θ2 , u2 ) = T (θ1 , u1 ) .
λ: Lebesgue measure on Rn1 +m1 .
For general A, B ∈ B(Θ):
 
ξ(A × B) := ξ (A ∩ C1 ) × (B ∩ C2 ) + ξ (A ∩ C2 ) × (B ∩ C1 ) .
388 / 417
Density of the symmetric measure
For A ⊆ C1 , B ⊆ C2 :

ξ(A × B) := λ (θ1 , u1 ) : θ1 ∈ A, θ2 ∈ B, (θ2 , u2 ) = T (θ1 , u1 ) .
Completions: u12 ∼ h12 , u21 ∼ h21 .
For x = (1, θ1 ) ∈ C1 , y = (2, θ2 ) ∈ C2 :

g (x, y ) := f (x)p12 h12 (u12 ), g (y , x) := f (y )p21 h21 (u21 ) det JT (y ) ,
otherwise g (x, y ) := 0.
JT (y ): the Jacobian of the transformation T .
f : PDF of the target measure π.
g (x, y ), x, y ∈ Θ, is the density with respect to ξ of the joint mea-
sure Z Z
G (A × B) := π(dx)q(x, dy ).
A B
Acceptance probability of a move from x to y :
(  )
f (y )p21 h21 (u21 ) det JT (y )
%(x, y ) = min ,1 .
f (x)p12 h12 (u12 )
389 / 417
Green’s algorithm
pk` : probability of choosing a move from model Mk to M` .
hk` : PDF of completion corresponding to move from Mk to M` .
Tk` : function specifying the move from Mk to M` .

Algorithm
(t) 
Given x(t) = k, θk ,
1. Select a model M` with probability pk` .
2. Generate uk` ∼ hk` (u).
(t) 
3. Set (θ` , u`k ) = Tk` θk , uk` .
(t+1)
4. Take θ` = θ` with probability
( )
f (`, θ` ) p`k h`k (u`k ) 
(t) 
min (t)  p h (u )
det JTk` θk , uk` , 1 ,
f k, θ k
k` k` k`

(t+1) (t)
and take θk = θk , otherwise.
390 / 417
Example
Subspaces:
Θ = C1 ∪ C 2 , C1 := {1}×R, C2 := {2}×R2 .
Moves
T θ(1) , θ(2) : 2, θ(1) , θ(2) ∈ C2
 

7→ 1, (θ(1) + θ(2) )/2 , (θ(2) − θ(1) )/2 ∈ C1 ;


 

T −1 (θ, u) : (1, θ, u) ∈ C1 7→ 2, θ − u, θ + u ∈ C2 ,

u ∼ h.
Jacobians:
  1
det JT θ(1) , θ(2) = ,

det JT −1 (θ, u) = 2.
2
Acceptance probability for a move from (1, θ) to 2, θ(1) , θ(2) :

( )
f 2, θ(1) , θ(2)

p21
u = θ(2) − θ(1) /2.

min 2 ,1 ,
f (1, θ) p12 h(u)
p12 , p21 : probabilities of choosing moves C1 → C2 and C2 → C1 ,
respectively. 391 / 417
Fixed dimension reassessment
For a move from Mk to M` translating θk ∈ Θk to θ` ∈ Θ` with
nk := dim(Θk ) < dim(Θ` ) =: n` an equivalent move in fixed di-
mension can be described.
Green’s algorithm (assuming θ` should not be completed)
Add uk` ∈ Uk` to θk such that Θk × Uk` and Θ` are in bijection.
Metropolis-Hastings like step
A move between (θk , uk` ) and θ` with PDFs f (k, θk )hk` (uk` ) and
f (`, θ` ), respectively, with deterministic candidate distribution
θ` = Tk` (θk , uk` ).
Randomization
Candidate:

θ` ∼ Nn` Tk` (θk , uk` ), εIn` , ε > 0.
Reciprocal candidate:
−1

(θk , uk` ) = T`k (θ) = Tk` (θ), θ ∼ Nn` θ` , εIn` .
In : n × n unit matrix. 392 / 417
Acceptance probability
Candidate:

θ` ∼ Nn` Tk` (θk , uk` ), εIn` , ε > 0.
PDF of candidate:
 2 
(2πε)−n` /2 exp − θ` − Tk` (θk , uk` ) /2ε .
Reciprocal candidate:
−1

T`k (θ) = Tk` (θ), θ ∼ Nn` θ` , εIn` .
PDF of reciprocal candidate:
 2 
(2πε)−n` /2 exp − θ` − Tk` (θk , uk` ) /2ε det JTk` (θk , uk` ) .


Metropolis-Hastings acceptance probability


 
 f (`, θ` ) exp − θ` −Tk` (θk , uk` ) 2 /2ε det JTk` (θk , uk` ) 
   
min  2  ,1 .
 f (k, θk )hk` (uk` ) exp − θ` −Tk` (θk , uk` ) /2ε 

No dependence on ε. Taking into account the probabilities of the


two moves gives the acceptance condition of Green’s algorithms. 393 / 417
Example. Poisson vs. negative binomial
y = (y1 , y2 , . . . yn ): realization of an i.i.d sample Y1 , Y2 , . . . , Yn
from a random variable Y .
Competing models:
M1 : Y ∼ P(λ); M2 : Y ∼ N B(λ, κ).
Likelihood functions
n n
Y λyj −λ Y λyj Γ(1/κ+yj )(1 + κλ)−1/κ
L1 (λ|y) = e , L2 (λ, κ|y) = .
yj ! yj ! Γ(1/κ)(1/κ+λ)yj
j=1 j=1

Means and variances:


M1 : E[Y ] = λ, Var(Y ) = λ; M2 : E[Y ] = λ, Var(Y ) = λ(1+κλ).
κ characterizes the overdispersion relative to a Poisson distribution.
Applications:
I Tumor counts in genetically engineered mice. [Newton &
Hastie, 2006]
I Number of goals of 1040 matches in the Premier League for
the seasons 2005/06 and 2006/07. [Hastie & Green, 2012] 394 / 417
Example. Poisson vs. negative binomial
Competing models:
M1 : Y ∼ P(λ); M2 : Y ∼ N B(λ, κ).
(1) (2) 
Parameters: θ1 = λ ∈ Θ1 = R+ , θ2 = θ2 , θ2 = (λ, κ) ∈ Θ1 = R2+ .
Prior probabilities of the two models: %P = %N B = 1/2.
Priors of parameters:
(1) (2)
θ1 , θ2 ∼ Ga(αλ , βλ ), θ2 ∼ Ga(ακ , βκ ).
PDF f of the target distribution π:
(
1
πk (θ1 )Lk (θ|y) for k = 1;
f (k, θk |y) ∝ 21 (1) (2)  (1) (2) 
2 πk θ2 , θ2 Lk ( θ2 , θ2 y for k = 2,
where
(1) (2)  (1)  (2) 
π1 (θ1 ) = γ(θ1 ; αλ , βλ ), πk θ2 , θ2 = γ θ2 ; αλ , βλ γ θ2 ; ακ , βκ .
γ(θ; α, β): PDF of the Ga(α, β) distribution.
395 / 417
Example. Poisson vs. negative binomial
Competing models:
M1 : Y ∼ P(λ); M2 : Y ∼ N B(λ, κ).
All possible moves have equal probability. Within model moves are
classical fixed-dimensional Metropolis-Hasting moves.
Move Poisson → negative binomial.
x = (1, θ): current state of the algorithm.
Generate u ∼ N (0, σ 2 ) (PDF is denoted by h) with fixed σ and set
x 0 = (2, θ0 ), θ0 = θ0(1) , θ0(2) = T12 (θ, u) = θ, µ exp(u) .
 

κ is lognormal, multiplicatively centered around µ.



Jacobian of the transformation: det JT12 (θ, u) = µ exp(u).
Move negative binomial → Poisson.
(θ, u) = T21 θ0(1) , θ0(2) = θ0(1) , log(θ0(2) /µ) .
 
 
Jacobian of the transformation: det JT21 θ0(1) , θ0(2) = 1/θ0(2) .
396 / 417
Example. Poisson vs. negative binomial
Competing models:
M1 : Y ∼ P(λ); M2 : Y ∼ N B(λ, κ).
Move Poisson → negative binomial. Acceptance probability
f (2, θ0 |y) √
 
2 2

min 2
2πσ exp u /2σ µ exp(u), 1 .
f (1, θ|y)
Move negative binomial → Poisson. Acceptance probability
 
f (1, θ|y) 1 
0(2)
2 2
 1
min √ exp − log(θ /µ) /2σ ,1 .
f (2, θ0 |y) 2πσ 2 θ0(2)

Implementation issues [Hastie & Green, 2012]:


I One can also consider completion (θ, u), where u ∼ Exp(β).
I Examining the empirical value of Var(Y )/E[Y ] for the given
data set µ = 0.015 is the optimal choice. For e.g. µ = 1 no
trans-dimensional moves were accepted.
I Posterior probabilities: p(M1 |y) = 0.708, p(M2 |y) = 0.292.
I Acceptance rates: σ = 1.5 : 58%; σ = 0.05 : 8%.
397 / 417
Example. Linear vs. quadratic regression
Competing models:
yi = β0 +β1 xi +εi and yi = β0 +β1 xi +β2 xi2 +εi , i = 1, 2, . . . , n.
Multidimensional setting:
ε ∼ Nn 0, σ 2 In .

y = Xβ + ε,
Least squares estimator (LSE) of β:
β̂ = (X> X)−1 X> y ∼ Nk β, σ 2 (X> X)−1 ,

k = 2, 3.
Using the SVD of X decomposition of X> X:
X> X = P∆P> .
∆: diagonal matrix containing the eigenvalues λj , of X> X.
P: orthonormal matrix of the normed eigenvectors of X> X.
Transformed model:
y = X∗ α + ε, X∗ := XP, α := P> β.
LSE of α: α̂ = ∆−1 X∗> y ∼ Nk α, σ 2 ∆ −1 , so α̂j ∼ N αj , σ 2 /λj .
 
398 / 417
Example. Linear vs. quadratic regression
Regression model:

y = X∗ α+ε, α̂ ∼ Nk α, σ 2 ∆−1 , k = 2, 3, α̂j ∼ N αj , σ 2 /λj .


 

Competing models: linear (j = 0, 1) and quadratic (j = 0, 1, 2).


Prior for αj : N (0, τ 2 ).
λj τ 2
Posterior for αj : N (bj , σ 2 bj ), bj := λj τ 2 +σ 2
. Posterior PDF: fj .

Possible moves:
(i) linear → linear: (α0 , α1 ) → (α00 , α10 ), where (α00 , α10 ) ∼ f0 f1 .
(ii) linear → quadratic: (α0 , α1 ) → (α0 , α1 , α20 ), where α20 ∼ f2 .
(iii) quadratic → quadratic: (α0 , α1 , α2 ) → (α00 , α10 , α20 ), where
(α00 , α10 , α20 ) ∼ f0 f1 f2 .
(iv) quadratic → linear: (α0 , α1 , α2 ) → (α00 , α10 ), where
(α00 , α10 ) = (α0 , α1 ).
399 / 417
Example. Gaussian mixture
Model [Richardson & Green, 1997]:
k
X 1  x − µj,k 
Mk : Yi ∼ ωj,k ϕ , i = 1, 2, . . . , n.
σj,k σj,k
j=1

y = (y1 , y2 , . . . , yn ): realization of the sample Y1 , Y2 , . . . , Yn .


Priors:
−2
µj,k ∼ N ξ, κ−1 ,

σj,k ∼ Ga(α, β), β ∼ Ga(γ, ρ),
ω k = (ω1,k , . . . , ωk,k ) ∼ Dk (δ, . . . , δ), %(k) = 1/kmax or k ∼ P(λ).
R: range of data, that is R := ymax − ymin .
Parameters depending on the sample:
ξ = ymax + ymin /2, κ = c/R 2 , ρ = d/R 2 ,

c, d > 0.

Free parameters: α, γ, δ. Suggestion: α > 1 > γ.


zi,k : allocation of Yi . Independent variables with distribution

P zi,k = j = ωj,k , j = 1, 2, . . . , k, i = 1, 2, . . . , n. 400 / 417
Example. Gaussian mixture
Model:
k
X 1  x − µj,k 
Mk : Yi ∼ ωj,k ϕ , i = 1, 2, . . . , n.
σj,k σj,k
j=1

Permutation invariant. Ordering is fixed by assuming


µ1,k < µ2,k < . . . < µk,k .
Moves of the reversible MCMC (scanned systematically):
(a) updating the weights ω k ;
(b) updating the normal parameters µj,k and σj,k ;
(c) updating the allocations zi,k ;
(d) updating the hyperparameter β;
(e) splitting one mixture component into two, or combining two
into one;
(f) the birth or death of an empty component. Not represented in
the sample, i.e. no sample value is allocated to it.
Moves (e) and (f): changing k by 1 and make the corresponding
changes in ω k , µj,k , σj,k , zi,k . 401 / 417
Example. Gaussian mixture
Moves (a) – (d) do not change the dimension of the parameter set.
Standard Gibbs steps. [Diebolt & Robert, 1994]
x| · · · : conditional distribution of x given all other variables
Full conditionals:
(a) Weights:
n
X
ω k | · · · ∼ Dk (δ + n1,k , . . . , δ + nk,k ), nj,k := I{zi,k =j} .
i=1

Independent Ga δ+nj,k , θ) values scaled to sum to 1.


(b) Normal parameters:
−2 P
!
σj,k i:zj,k =j yi + κξ −1
−2

µj,k | · · · ∼ N −2
, σj,k nj,k + κ .
σj,k nj,k + κ
Generate a candidate. Accept if the ordering is unchanged.
 
−2 1 1 X 2
σj,k | · · · ∼ Ga α + nj,k , β + yi − µj,k ) .
2 2
i:zj,k =j
402 / 417
Example. Gaussian mixture
Full conditionals:
(c) Allocations:
2 !
 ωj,k yi − µj,k
P zi,k = j · · · ∝ exp − 2
.
σj,k 2σj,k

(d) Hyperparameter:
 
k
X
−2 
β| · · · ∼ Ga γ + kα, ρ + σj,k .
j=1

Split or combine move (e): reversible jump mechanism.


bk , dk := 1−bk : probabilities of attempting to split or combine.
Values: d1 = 0, bkmax = 0 and bk = dk = 1/2, k = 2, 3, . . . , kmax −1.

Birth or death move (f): reversible jump with the same proposal
probabilities bk , dk .
403 / 417
Example. Gaussian mixture. Split or combine move
Combine move
(j1 , j2 ): randomly chosen pair of adjacent indices to be merged.
 
Adjacent: µj1 ,k < µj2 ,k and no other µj,k in µj1 ,k , µj2 ,k .
j ∗ : index of the new component. k is reduced by 1.
Reallocation: set zi,k−1 = j ∗ for yi with zi,k = j1 or zi,k = j2 .
Change of weights and normal parameters: zeroth, first and se-
cond moments should match.
ωj ∗ ,k−1 = ωj1 ,k +ωj2 ,k , ωj ∗ ,k−1 µj ∗ ,k−1 = ωj1 ,k µj1 ,k +ωj2 ,k µj2 ,k ,
ωj ∗ ,k−1 σj2∗ ,k−1 +µ2j ∗ ,k−1 = ωj1 ,k σj21 ,k +µ2j1 ,k +ωj2 ,k σj22 ,k +µ2j2 ,k .
  

Transformation:
Tk,k−1 : ωj1 ,k , ωj2 ,k , µj1 ,k , µj2 ,k , σj21 ,k , σj22 ,k


7→ ωj ∗ ,k−1 , u1 , µj ∗ ,k−1 , u2 , σj2∗ ,k−1 , u3 .




Acceptance probability is a straightforward transformation of the


acceptance probability corresponding to split move. 404 / 417
Example. Gaussian mixture. Split or combine move
Split move
j ∗ : a randomly chosen index to be split. k is increased by 1.
Generate u1 ∼ Be(2, 2), u2 ∼ Be(2, 2), u3 ∼ Be(1, 1).

ωj1 ,k+1 = ωj ∗ ,k u1 , ωj2 ,k+1 = ωj ∗ ,k (1 − u1 ),


ωj2 ,k+1 ωj1 ,k+1
r r
µj1 ,k+1 = µj ∗ ,k −u2 σj ∗ ,k , µj2 ,k+1 = µj ∗ ,k +u2 σj ∗ ,k ,
ωj1 ,k+1 ωj2 ,k+1
 σj2∗ ,k ωj ∗ ,k  σj2∗ ,k ωj ∗ ,k
σj21 ,k+1 = u3 1−u22 , σj22 ,k+1 = (1−u3 ) 1−u22 .
ωj1 ,k+1 ωj2 ,k+1
Obtained values satisfy the system of equations considered at the
merging move.
If adjacency condition is not satisfied, the move is immediately re-
jected.
Reallocation of sample values yi with zi,k = j ∗ between j1 and j2 is
done according to the full conditionals.
405 / 417
Example. Gaussian mixture. Acceptance probability
Split move: min{1, Ak,k+1 }, where
Lk+1 %(k + 1) ωjδ−1+`
1 ,k+1
1 δ−1+`2
ωj ,k+1
Ak,k+1 = (k + 1) δ−1+` +` 2
Lk %(k) ωj ∗ ,k 1 2
B(δ, kδ)
r
κ  κ 
× exp − (µj1 ,k+1 −ξ)2 +(µj2 ,k+1 −ξ)2 −(µj ∗ ,k −ξ)2
2π 2
α
 2 2 −α−1 
β σ σ
j1 ,k+1 j2 ,k+1 2 2 2

× exp − β σj ,k+1 +σj ,k+1 −σj ∗ ,k
Γ(α) σj2∗ ,k 1 2

dk+1 −1
× g2,2 (u1 )g2,2 (u2 )g1,1 (u3 )
bk palloc
ωj ∗ ,k |µj1 ,k+1 −µj2 ,k+1 |σj21 ,k+1 σj22 ,k+1
× .
u2 (1 − u2 )2 u3 (1 − u2 )σj2∗ ,k
Lk : likelihood of the sample y corresponding to model Mk .
`1 , `2 : number of observations proposed to be assigned to j1 and j2 .
B(x, y ): beta function. gs,t : PDF of Be(s, t).
palloc : probability of the particular allocation we made. 406 / 417
Example. Gaussian mixture. Birth or death move
Birth move. Probability for Mk : bk . Weights and parameters for
the new component:
ωj ∗ ,k+1 ∼ Be(1, k), µj ∗ ,k+1 ∼ N ξ, κ−1 , σj2∗ ,k+1 ∼ Ga(α, β).


Weights should be rescaled to sum to 1.


Acceptance probability: min{1, Ak,k+1 }, where
%(k + 1) ωjδ−1
∗ ,k+1 (1 − ωj ∗ ,k+1 )
n+kδ−k
Ak,k+1 = (k + 1)
%(k) B(δ, kδ)
dk+1 1
× (1 − ωj ∗ ,k+1 )k .
(k0 + 1)bk g1,k (ωj ∗ ,k+1 )
k0 : number of empty components before the birth.
Death move. Probability for Mk+1 : dk+1 = 1 − bk+1 . A ran-
dom choice is made between any existing empty components and
the selected component is deleted.
Weights should be rescaled to sum to 1.
Acceptance probability: min 1, A−1

k,k+1 . 407 / 417
Case study. Galaxy data
Data: velocities Y (km/s) of 82 galaxies in Corona Borealis region.
Implementation: R package mixAK [Komárek, 2009]
Range: R = 25.107. Prior for k: uniform on {1, 2, . . . , 30}.
Data dependent parameters:
ξ = 21.73, κ = 1/R 2 = 1/630.36, ρ = 10/R 2 = 0.016.
Fixed parameters: α = 2, γ = 0.2, δ = 1.
100000 iterations in burning period, 500000 iterations for sampling.
Posterior probabilities of k
0.25
0.20
0.15
Probability
0.10
0.05
0.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k

Posterior distribution of the number of components. 408 / 417


Case study. Galaxy data
Overall predictive PDF: weighted average of the estimated k com-
ponent predictive PDFs where the weights are the posterior proba-
bilities P(Mk |y ).
0.20
0.15
Density
0.10
0.05
0.00

10 15 20 25 30 35
Velocity (km/s)

Histogram and predictive PDFs of the Galaxy data. Solid line: overall predicti-
ve PDF; dashed lines: conditional predictive PDFs for k = 4 (blue), 5 (green),
6 (red), 7 (magenta) and 8 (orange). 409 / 417
References
Athreya, K. and Ney, P. (1978)
A new approach to the limit theory ofrecurrent Markov chains. Trans. Amer. Math. Soc. 245, 493–501.

Atkinson, A. C. (1979)
The computer generation of Poisson random variables. J. Roy. Statist. Soc. Ser. C Appl. Statist. 28, 29–35.

Bédard, M. (2008)
Optimal acceptance rates for Metropolis algorithms: moving beyond 0.234. Stochastic Process. Appl. 118,
2198–2222.
Besag, J. (1974)
Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Royal Statist. Soc.
Series B 36, 192-326.

Besag, J. (1994)
Discussion of “Markov chains for exploring posterior distributions”. Ann. Statist. 22, 1734–1741.

Casella, G. and Robert, C. (1996)


Rao-Blackwellisation of sampling schemes. Biometrika 83, 81–94.

Corana, A., Marchesi, M., Martini., C. and Ridella, S. (1987)


Minimizing multimodal functions of continuous variables with the “simulated annealing” algorithm. ACM
Trans. Math. Software 13, 262–280.

Dempster, A., Laird, N. and Rubin, D. (1977)


Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Royal Statist. Soc.
Series B 39, 1–38.

Diebolt, J. and Robert, C. P. (1994)


Estimation of finite mixture distributions through Bayesian sampling. J. R. Statist. Soc. B 56, 363–375.

Gaver, D. P. and Lehoczky, J. P. (1987)


Statistical analysis of hierarchical stochastic models: examples and approaches. Ann. Oper. Res. 8, 217–227.
411 / 417
References
Gaver, D. P. and O’Muircheartaigh, L. (1987)
Robust empirical Bayes analysis of event rates. Technometrics 29, 1–15.

Gelfand, A., Sahu, S. and Carlin, B. (1995)


Efficient parametrization for normal linear mixed models. Biometrika 82, 479–488.

Gelfand, A. and Smith, A. (1990)


Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85, 398–409.

Gelman, A., Roberts, G. and Gilks, W. (1996)


Efficient Metropolis jumping rules. In Bayesian Statistics 5. Berger, J., Bernardo, J., Dawid, A., Lindley, D.,
and Smith, A. (eds). 599–608. Oxford University Press, Oxford.

Geman, S. and Geman, D. (1984)


Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal.
Mach. Intell. 6, 721–741.
Geweke, J. (1988)
Antithetic Acceleration of Monte Carlo Integration in Bayesian Inference. Institute of Statistics and Decision
Sciences, Duke University.

Geyer, C. J. (1996)
Estimation and optimization of functions. In Markov chain Monte Carlo in Practice. Gilks, W. R.,
Richardson, S. T. and Spiegelhalter D. J. (eds.). 241–258. Chapman & Hall, London.

Geyer, C. and Thompson, E. (1992)


Constrained Monte Carlo maximum likelihood for dependent data (with discussion). J. Royal Statist. Soc.
Series B 54, 657–699.

Gilks, W., Best, N. and Tan, K. (1995)


Adaptive rejection Metropolis sampling within Gibbs sampling. J. R. Stat. Soc. Ser. C. Appl. Stat. 44,
455–472.

412 / 417
References
Gilks, W. and Roberts, G. (1996)
Strategies for improving MCMC. In Markov chain Monte Carlo in Practice. Gilks, W. R., Richardson, S. T.
and Spiegelhalter D. J. (eds.). 89–114. Chapman & Hall, London.

Green, P. J. (1995)
Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82,
711-732.
Grenander, U. and Miller, M. (1994)
Representations of knowledge in complex systems (with discussion). J. Royal Statist. Soc. Series B 56,
549–603.
Hàjek, B. (1988)
Cooling schedules for optimal annealing. Math. Operation Research 13, 311–329.

Hastie, D. I. and Green, P. J. (2012)


Model choice using reversible jump Markov chain Monte Carlo. Stat. Neerl. 66, 309–338.

Hastings, W. (1970)
Monte Carlo sampling methods using Markov chainsand their application. Biometrika 57, 97–109.

Ingrassia, S. (1992)
A comparison between the simulated annealing and the EM algorithms in normal mixture decompositions.
Stat. Comput. 2, 203–211.

James, W. and Stein, C. (1961)


Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, Vol. 1 , University of California Press, 361–379.

Jeffreys, H. (1946)
An Invariant Form for the Prior Probability in Estimation Problems. Proc. R. Soc. Lond. A 186, 453–461.

Jeffreys, H. (1961)
Theory of Probability (3rd edition). Oxford University Press, Oxford.
413 / 417
References
Johnson, A. A. (2010)
Geometric Ergodicity of Gibbs Samplers. PhD Thesis, University of Minnesota.

Jöhnk, M. (1964)
Erzeugubg von betaverteilten und gammaverteilten Zufallszahen. Metrika 8, 5–15.

Kiefer, J. and Wolfowitz, J. (1952)


Stochastic estimation of the maximum of a regression function. Ann. Mathemat. Statist. 23, 462–466.

Komárek, A. (2009)
A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the
number of components and interval-censored data. Comput. Statist. Data Anal. 53, 3932–3947.

Lee, G and Scott, C. (2012)


EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput.
Statist. Data Anal. 56, 2816–2829.

Lin, S. (1965)
Computer solutions of the travelling salesman problem. Bell System Tech. J. 44, 2245–2269.

Liu, J. (1995).
Metropolized Gibbs sampler: An improvement. Technical report. Deptartment of Statistics, Stanford
University, CA.

Liu, J., Wong, W. and Kong, A. (1994)


Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and sampling
schemes. Biometrika 81, 27–40.

Liu, J., Wong, W. and Kong, A. (1995)


Correlation structure and convergence rate of the Gibbs sampler with various scans. J. Royal Statist. Soc.
Series B 57, 157–169.

Lundy, M. and Mees, A. (1986)


Convergence of an annealing algorithm. Math. Programming 34, 111–124. 414 / 417
References
Mengersen, K. and Tweedie, R. (1996)
Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24, 101–121.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E (1953)
Equation of state calculations by fast computing Machines. J. Chem. Phys. 21, 1087–1092.

Meyn, S. and Tweedie, R. (1993)


Markov Chains and Stochastic Stability. Springer, New York.

Mira, A. and Tierney, L. (2002)


On the use of auxiliary variables in Markov chain Monte Carlo sampling. Scand. J. Stat. 29, 1–12.

Neal, R. (2003)
Slice sampling (with discussion). Ann. Statist. 31, 705–767.

Neath, R. C. (2012)
On convergence properties of the Monte Carlo EM algorithm. arXiv : 1206.4768.

Newton, M. A. and Hastie, D. I. (2006)


Assessing Poisson variation of intestinal tumour multiplicity in mice carrying a Robertsonian translocation.
J. R. Stat. Soc. Ser. C. Appl. Stat. 55, 123–138.

Oakes, D. (1999)
Direct calculation of the information matrix via the EM algorithm. J. R. Statist. Soc. B 61, 479–482.

Peskun, P. (1973)
Optimum Monte Carlo sampling using Markov chains. Biometrika 60, 607–612.

Rao, C. R. (1965)
Linear Statistical Inference and its Applications. Wiley, New York.

Richardson, S. and Green, P. J. (1997)


On Bayesian analysis of mixtures with an unknown number of components. J. R. Statist. Soc. B 59,
731–792. 415 / 417
References
Robbins, H. and Monro, S. (1951)
A stochastic approximation method. Ann. Mathemat. Statist. 22, 400–407.

Robert, C. P. (2007)
The Bayesian Choice. From Decision-Theoretic Foundations to Computational Implementation. Second
Edition. Springer, New York.

Robert, C. P. (1995)
Simulation of truncated normal variables. Stat. Comput. 5, 121–125.

Roberts, G., Gelman, A. and Gilks, W. (1997)


Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Applied Prob. 7,
110–120.
Roberts, G. and Rosenthal, J. (1998)
Markov chain Monte Carlo: Some practical implications of theoretical results (with discussion). Canadian J.
Statist. 26, 5–32.

Roberts, G. and Rosenthal, J. (1999)


Convergence of slice sampler Markov chains. J. R. Statist. Soc. B 61, 643–660.

Roberts, G. and Stramer, O. (2002)


Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 4, 337–357.

Roberts, G. and Tweedie, R. (1996)


Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms.
Biometrika 83, 95–110.
Rubinstein, R. Y. (1981)
Simulation and the Monte Carlo method. Wiley, New York.

Tanner, M. and Wong, W. (1987)


The calculation of posterior distributions by data augmentation. J. American Statist. Assoc. 82, 528–550.
416 / 417
References

Tanner, M. and Wong, W. (2010)


From EM to data augmentation: The emergence of MCMC Bayesian computation in the 1980s. Statist. Sci.
25, 506–516.

Tierney, L. (1994)
Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1701–1786.

Vines, S., Gilks, W. and Wild, P. (1995)


Fitting multiple random effect models. Technical report. MRC Biostatistics Unit, Cambridge University.

Wei, G. and Tanner, M. (1990)


Posterior computations for censored regression data. J. American Statist. Assoc. 85, 829–839.

Wu, C. (1983)
On the convergence properties of the EM algorithm. Ann. Statist. 11, 95–103.

Yakowitz, S., Krimmel, J. E. and Szidarovszky, F. (1978)


Weighted Mote Carlo integration. SIAM J. Numer. Anal. 15, 1289–1300.

417 / 417

You might also like