Gaussian Markov Random Fields Theory and
Gaussian Markov Random Fields Theory and
General Editors
Gaussian Markov
Random Fields
Theory and Applications
Håvard Rue
Leonhard Held
Rue, Havard.
Gaussian Markov random fields : theory and applications / Havard Rue & Leonhard Held.
p. cm. -- (Monographs on statistics and applied probability ; 104)
Includes bibliographical references and index.
ISBN 1-58488-432-0 (alk. paper)
1. Gaussian Markov random fields. I. Held, Leonhard. II. Title. III. Series.
QA274.R84 2005
519.2'33--dc22 2004061870
Preface
1 Introduction
1.1 Background
1.1.1 An introductory example
1.1.2 Conditional autoregressions
1.2 The scope of this monograph
1.2.1 Numerical methods for sparse matrices
1.2.2 Statistical inference in hierarchical models
1.3 Applications of GMRFs
5 Approximation techniques
5.1 GMRFs as approximations to Gaussian fields
5.1.1 Gaussian fields
5.1.2 Fitting GMRFs to Gaussian fields
5.1.3 Results
5.1.4 Regular lattices and boundary conditions
5.1.5 Example: Swiss rainfall data
5.2 Approximating hidden GMRFs
5.2.1 Constructing non-Gaussian approximations
5.2.2 Example: A stochastic volatility model
5.2.3 Example: Reanalyzing Tokyo rainfall data
5.3 Bibliographic notes
Appendices
A Common distributions
References
Introduction
1.1 Background
This monograph considers Gaussian Markov random fields (GMRFs)
covering both theory and applications. A GMRF is really a simple
construct: It is just a (finite-dimensional) random vector following a
multivariate normal (or Gaussian) distribution. However, we will be
concerned with more restrictive versions where the GMRF satisfies ad-
ditional conditional independence assumptions, hence the term Markov.
Conditional independence is a powerful concept. Let x = (x1 , x2 , x3 )T
be a random vector, then x1 and x2 are conditionally independent given
x3 if, for known value of x3 , discovering x2 tells you nothing new about
the distribution of x1 . Under this condition the joint density π(x) must
have the representation
π(x) = π(x1 | x3 ) π(x2 | x3 ) π(x3 ),
which is a simplification of a general representation
π(x) = π(x1 | x2 , x3 ) π(x2 | x3 ) π(x3 ).
The conditional independence property implies that π(x1 |x2 , x3 ) is
simplified to π(x1 |x3 ), which is easier to understand, to represent, and
to interpret.
−φ 1
with zero entries outside the diagonal and first off-diagonals. The
conditional independence assumptions impose certain restrictions on
the precision matrix. The tridiagonal form is due to the fact that xi
and xj are conditionally independent for |i − j| > 1, given the rest.
This also holds in general for any GMRF: If Qij = 0 for i = j,
then xi and xj are conditionally independent given the other variables
{xk : k = i and k = j} and vice versa. The sparse structure of Q
prepares the ground for fast computations of GMRFs to which we return
in Section 1.2.1.
The simple relationship between conditional independence and the
zero structure of the precision matrix is not evident in the covariance
matrix Σ = Q−1 , which is a (completely) dense matrix with entries
1
σij = φ|i−j| .
1 − φ2
For example, for n = 7,
⎛ ⎞
1 φ φ2 φ3 φ4 φ5 φ6
⎜φ 1 φ φ2 φ3 φ4 φ5 ⎟
⎜ 2 ⎟
⎜φ φ 1 φ φ2 φ3 φ4 ⎟
1 ⎜ ⎜φ3
⎟
Σ= φ2 φ 1 φ φ2 φ3 ⎟
⎟.
1 − φ2 ⎜
⎜φ4
⎜ 5 φ3 φ2 φ 1 φ φ2 ⎟
⎟
⎝φ φ4 φ3 φ2 φ 1 φ⎠
φ6 φ5 φ4 φ3 φ2 φ 1
It is therefore difficult to derive conditional independence properties from
the structure of Σ. Clearly, the entries in Σ only give (direct) information
about the marginal dependence structure, not the conditional one. For
which was pioneered by Besag (1974, 1975). These models are also
known by the name conditional autoregressions, abbreviated as CAR
models. There is also an alternative and more restrictive approach to
CAR models, the so-called simultaneous autoregressions (SAR), which
we will not discuss specifically. This approach dates back to Whittle
(1954), see for example, Cressie (1993) for further details.
The n full conditionals (1.5) must satisfy some consistency conditions
to ensure that a joint normal density exists with these full conditionals.
25 12 3
24
26
28 23
27
30
29 20 13 11 4 5
22
31 21 19
32 18
33
7 6
17
34
16
35
15 10 8
36
49 50
14
48
37
13 9
12
38 47
39 11
46
14
40 45 10
41 44 9
7
42 43 8 15
6
5
3
4 16
1
2
(a) (b)
267
286 233
245
201
Figure 1.2 The graph of the 366 administrative regions in Sardinia where two
regions sharing a common border are neighbors. Neighbors in the graph are
linked by edges or indicated by overlapping nodes.
This is a typical example where MCMC is the only way for statistical
inference, but where the choice of the particular MCMC algorithm is
crucial. In Section 4, we will describe an MCMC algorithm that jointly
updates the GMRF w and the hyperparameters θ (here the unknown
precisions of x and v) in one block, thus ensuring good mixing and
convergence properties of the algorithm.
Suppose now there is also covariate information z i available for each
observation yi , here z i is of dimension p, say. A common approach is to
assume now that yi is Poisson with mean exp(xi + vi + z Ti β), where β is
a vector of unknown regression parameters, with a multivariate normal
prior with some mean and some precision matrix, which can be zero. We
do not give the exact details here, but β can also be merged with the
GMRF w to a larger GMRF of dimension 2n + p, which still inherits
the sparse structure. Furthermore, block updates of the enlarged field,
jointly with unknown hyperparameters, is still possible.
Merging two or more GMRFs into a larger one typically preserves
the local features of the GMRF and simplifies the structure of the
model. This is important mainly for computational reasons, as we can
then construct efficient MCMC algorithms. Note that if θ is fixed
and yi is normal, this will correspond to independent simulation from
the posterior, no matter how large the dimension of the GMRF. For
nonnormal observations, as in the above Poisson case, we will use
35 30 78
87 31
36
29
34 77
37
32 28
33
76
84 27
38
90 82
26
83 75
88
39
40 25
92 89 74
24
41
42
93 91 23
73
43 72
22
94
44
21
71
45 20
95
70 69
99 19
46
96 49 100
47 18
68
48 50
17 67
97
98
64 16
66
63 15
14
57
13 65
53 54 56
58 61
51
7 12
1 3 60
4 6 11
8
2
5 62
10
9
52
55
59
(a)
1
2 17
3 18 12
19 11 4 5 13 28
27 20 6 21 7 29
22 8 10 23
9 24 26
25 14
30 15
16 31
32
(b)
Figure 1.3 The graph of w (1.7) where the graph of x is in Figure 1.1(a) and
(b), respectively. The nodes corresponding to u are displayed in gray.
2.1 Preliminaries
2.1.1 Matrices and vectors
Vectors and matrices are typeset in bold, like x and A. The transpose
of A is denoted by AT . The notation A = (Aij ) means that the element
in the ith row and jth column of A is Aij . For a vector we use the same
notation, x = (xi ). We denote by xi:j the vector (xi , xi+1 , . . . , xj )T . For
an n × m matrix A with columns A1 , A2 , . . . , Am , vec(A) denotes the
vector obtained by stacking the columns one above the other, vec(A) =
(AT1 , AT2 , . . . , ATm )T . A submatrix of A is obtained by deleting some
rows and/or columns of A. A submatrix of an n × n matrix A is called a
principal submatrix, if it can be obtained by deleting
11 rows and columns of
the same index, so for example, B = A A13
A31 A33 is a principal submatrix
of A. An r × r submatrix is called a leading principal submatrix of A, if
it can be obtained by deleting the last n − r rows and columns.
We use the notation diag(A) and diag(a), where A is an n × n matrix
Similarly, ‘⊘’ denotes elementwise division. We will use the symbol ‘’
for raising each element of a matrix A to a scalar power a, i.e., element
ij of A a is Aaij .
(a) (b)
showing (most of) the states of the USA. Each state represents a region
and they share common borders.
1 3
Comparing (2.8) and (2.9) we see that π(xi |x−i ) is normal. Comparing
the coefficients for the quadratic term, we obtain (2.6). Comparing the
coefficients for the linear term, we obtain
1
E(xi | x−i ) = − Qij xj .
Qii j:j∼i
If x has mean µ, then x−µ has mean zero, hence replacing xi and xj by
xi − µi and xj − µj , respectively, gives (2.5). To show (2.7), we proceed
similarly and consider
1 Qii Qij xi
π(xi , xj | x−ij ) ∝ exp − (xi , xj ) + linear terms .
2 Qji Q jj xj
(2.10)
We compare this density with the density of the bivariate normal random
variable (xi , xj )T with covariance matrix Σ = (Σij ), which has density
proportional to
−1
1 Σii Σij xi
exp − (xi , xj ) + linear terms . (2.11)
2 Σji Σjj xj
(a) Pairwise
(b) Local
1111
0000
0000
1111
0000
1111
0000
1111
111
000
000
111 111
000
000
111
000
111 000
111
111
000
000
111
000
111
000
111
(c) Global
Figure 2.3 Illustration of the various Markov properties. (a) The pairwise
Markov property; the two black nodes are conditionally independent given
the gray nodes. (b) The local Markov property; the black and white nodes
are conditionally independent given the gray nodes. (c) The global Markov
property; the black and striped nodes are conditionally independent given the
gray nodes.
Conditional distribution
We split the indices into the nonempty sets A and denote by B the set
−A, so that
xA
x= . (2.13)
xB
Partition the mean and precision accordingly,
µA QAA QAB
µ= , and Q = . (2.14)
µB QBA QBB
Our next result, is a generalization of Theorem 2.3.
Theorem 2.5 Let x be a GMRF wrt G = (V, E) with mean µ and
precision matrix Q > 0. Let A ⊂ V and B = V \ A where A, B = ∅. The
conditional distribution of xA |xB is then a GMRF wrt the subgraph G A
with mean µA|B and precision matrix QA|B > 0, where
µA|B = µA − Q−1
AA QAB (xB − µB ) (2.15)
and
QA|B = QAA .
This is a powerful result for two reasons. First, we have explicit knowl-
edge of QA|B through the principal matrix QAA , so no computation
is needed to obtain the conditional precision matrix. Constructing
the subgraph G A does not change the structure; it just removes all
nodes not in A and the corresponding edges. This is important for the
computational issues that will be discussed in Section 2.3. Secondly, since
Qij is zero unless j ∈ ne(i), the conditional mean only depends on values
of µ and Q in A ∪ ne(A). This is a great advantage if A is a small subset
of V and in striking contrast to the corresponding general result for the
normal distribution, see Section 2.1.7.
Example 2.2 To illustrate Theorem 2.5, we compute the mean and
precision of xi given x−i , which are found using A = {i} as (2.5)
and (2.6). This result is frequently used for single-site Gibbs sampling in
GMRF models, to which we return in Section 4.1.
Since (2.21) and (2.22) must be identical it follows that κi βij = κj βji
for i = j. The density of x can then be expressed as
n
1 1
log π(x) = const − κi x2i − κi βij xi xj ;
2 i=1 2
i=j
and
Q ij = 0 ⇐⇒ {i, j} ∈ E for all i = j.
An MGMRFp is also a GMRF with dimension np with identical mean
vector and precision matrix. All results valid for a GMRF are then
also valid for an MGMRFp , with obvious changes as the graph for an
MGMRFp is of size n and defined wrt {xi }, while for a GMRF it is of
size np and defined wrt {xi }.
Interpretation of Q ij can be derived from the full conditional
ii and Q
π(xi |x−i ). The extensions of (2.5) and (2.6) are
E(xi | x−i ) = µi − Q −1 Q ij (xj − µj )
ii
j:j∼i
Prec(xi | x−i ) = .
Qii
Sample x ∼ N (µ, Σ)
We start with Algorithm 2.3, the most commonly used algorithm for
sampling from a multivariate normal random variable x ∼ N (µ, Σ).
Then x has the required distribution, as
T
=L
Cov(x) = Cov(Lz) L
=Σ (2.24)
and E(x) = µ. To obtain repeated samples, we do step 1 only once.
L
because |Σ| = |L T | = |L||
L T | = |L|
2 . Hence we obtain
n
n ii − 1 uT u,
log π(x) = − log(2π) − log L (2.25)
2 i=1
2
Theorem 2.7 Let x be a GMRF wrt to the labelled graph G, with mean
µ and precision matrix Q > 0. Let L be the Cholesky triangle of Q.
Then for i ∈ V,
n
1
E(xi | x(i+1):n ) = µi − Lji (xj − µj ) and
Lii j=i+1
Prec(xi | x(i+1):n ) = L2ii .
In matrix terms,
i−1this is a direct consequence of Q = LLT , which gives
2 2
Qii = Lii + j=1 Lij .
Sample x ∼ NC (b, Q)
For repeated samples we do step 1 and steps 5–7 only once. Note that
if z = 0 then x∗ is the conditional mean (2.28). The following trivial
but very useful example illustrates the use of (2.30).
Example 2.3 Let x1 , . . . , xn be independent normal variables with
mean µi and variance σi2 . To sample x conditional on i xi = 0, we
first sample xi ∼ N (µi , σi2 ) for i = 1, . . . , n and compute the constrained
sample x∗ via
x∗i = xi − c σi2 ,
where c = j xj / j σj2 .
The log density π(x|Ax) can be rapidly evaluated at x∗ using the
identity
π(x)π(Ax | x)
π(x | Ax) = . (2.31)
π(Ax)
Note that we can compute each term on the right-hand side easier than
the term on the left-hand side: The unconstrained density π(x) is a
GMRF and the log density is computed using (2.26) and L computed in
Algorithm 2.6, step 1. The degenerate density π(Ax|x) is either zero or
a constant, which must be one for A = I. A change of variables gives us
1
log π(Ax | x) = − log |AAT |,
2
i.e., we need to compute the determinant of a k × k matrix, which is
found from its Cholesky factorization. Finally, the denominator π(Ax)
in (2.31) is normal with mean Aµ and covariance matrix AQ−1 AT . The
corresponding Cholesky triangle L is available from Algorithm 2.6, step
7. The log density can then be computed from (2.25).
the log density. Note that all Cholesky triangles required to evaluate the
log density are already computed in Algorithm 2.7.
The stochastic version of Example 2.3 now follows.
Example 2.4 Let x1 , . . . , xn be independent normal variables with
variance σi2 and mean µi . We now observe e ∼ N ( i xi , σǫ2 ). To sample
from π(x|e), we sample xi ∼ N (µi , σi2 ), unconditionally, for i = 1, . . . , n
while we condition on ǫ ∼ N (e, σǫ2 ). A conditional sample x∗ is then
∗ 2 j xj − ǫ
xi = xi − c σi , where c = 2 2
.
j σj + σǫ
We can merge soft and hard constraints into one framework if we allow
Σǫ to be SPSD, but we have chosen not to, as the details are somewhat
tedious.
We start with a dense matrix Q > 0 with dimension n, and show how
to compute the Cholesky triangle L, so Q = LLT , which can be written
as
j
Qij = Lik Ljk , i ≥ j.
k=1
We now define
j−1
vi = Qij − Lik Ljk , i ≥ j,
k=1
and we immediately see that L2jj = vj and Lij Ljj = vi for i > j. If we
√ √
know {vi } for fixed j, then Ljj = vj and Lij = vi / vj , for i = j + 1 to
n. This gives the jth column in L. The algorithm is completed by noting
that {vi } for fixed j only depends on elements of L in the first j − 1
columns of L. Algorithm 2.8 gives the pseudocode using vector notation
for simplicity: vj:n = Qj:n,j is short for vk = Qkj for k = j to n and so
on. The overall process involves n3 /3 flops. If Q is symmetric but not
SPD, then vj ≤ 0 for some j and the algorithm fails.
3 4
and the corresponding precision matrix Q
⎛ ⎞
× × ×
⎜× × ×⎟
Q=⎜ ⎝×
⎟
× ×⎠
× × ×
where the ×’s denote nonzero terms. The only possible zero terms in L
(in general) are L32 and L41 due to Corollary 2.2. Considering L32 we
see that F (2, 3) = 4. This is not a separating subset for 2 and 3 due to
node 1, hence L32 is not known to be zero using Corollary 2.2. For L41
we see that F (1, 4) = {2, 3}, which does separate 1 and 4, hence L41 = 0.
In total, L has the following structure:
⎛ ⎞
×
⎜× × ⎟
L=⎜ ⎝×
√ ⎟,
⎠
×
× × ×
√
where the possibly nonzero entry L32 is marked as ‘ ’.
(a) (b)
Figure 2.4 (a) The map of Germany with n = 544 regions, and (b) the
corresponding graph for the GMRF where neighboring regions share a common
border.
(a) (b)
Figure 2.5 (a) The black regions make north and south conditionally indepen-
dent, and (b) displays the automatically computed reordering starting from the
white region ending at the black region. This reordering produces the precision
matrix in Figure 2.6(b).
500
500
400
400
300
300
200
200
100
100
0
0
0 100 200
(a)
300 400 500 0 100 200
(b)
300 400 500
Figure 2.6 (a) The precision matrix Q in the original ordering, and (b) the
precision matrix after appropriate reordering to obtain a band matrix with
small bandwidth. Only the nonzero terms are shown and those are indicated by
a dot.
5 6 5 1
1 6
4 2 4 2
3 3
(a) (b)
Figure 2.7 Two graphs with a slight change in the ordering. Graph (a) requires
O(n3 ) flops to factorize, while graph (b) only requires O(n). Here, n represent
the number of nodes being neighbors to the center node. The fill-in is maximal
in (a) and minimal in (b).
There has been much work in the numerical and computer science liter-
ature on other reordering schemes focusing on reducing the number of
fill-ins rather than focusing on reducing the bandwidth. Why reordering
schemes that reduce the number of fill-ins may be better can be seen
from the following example. For a GMRF with the graph shown in Figure
(a)
500
500
400
400
300
300
200
200
100
100
0
0 100 200
(b)
300 400 500 0 100 200
(c) 300 400 500
Figure 2.8 Figure (a) displays the ordering found using a nested dissection
algorithm where the ordering is from white to black. (b) displays the reordered
precision matrix and (c) its Cholesky triangle. In (b) and (c) only the nonzero
elements are shown and those are indicated by a dot.
Table 2.1 The average CPU time for (in seconds) factorizing Q into LLT
and solving LLT µ = b, for a band matrix of order n and bandwidth p, using
band-Cholesky factorization and band forward- or back-substitution.
We now add 10 additional nodes, which are neighbors with all others.
This makes the bandwidth maximal so the BCF is not a good choice.
Using MSCF we have obtained the results shown in Table 2.2. The fill-in
Table 2.2 The average CPU time (in seconds) for factorizing Q into LLT and
solving LLT µ = b, for a band matrix of order n, bandwidth p with additional
10 nodes that are neighbors to all others. The factorization routine is MSCF.
(a) (b)
Figure 2.10 Two examples of spatial GMRF models; (a) shows a GMRF on
a lattice used as an approximation to a Gaussian field with an exponential
correlation function, (b) the graph found from Delaunay triangulation of a
planar point set.
(a) (b)
Figure 2.11 The neighbors to the black pixel; (a) the 3 × 3 neighborhood system
and (b) the 5 × 5 neighborhood system.
unit square. If adjacent tiles are neighbors, then we obtain the graph
found by the Delaunay triangulation of the points. This graph is similar
to the one defined by regions of Germany in Figure 2.4, although here,
the outline and the configuration of the regions are random as well.
We will report some timing results for lattices only, as they are similar
for irregular lattices. The neighbors to a pixel i, will be those 8 (24) in the
3 × 3 (5 × 5) window centered at i, and the dimension of the lattice will
be 100 × 100, 150 × 150, and 200 × 200. Table 2.3 summarizes our results.
The speed of the algorithms is again impressive. The performance in the
solve part is quite similar, but for the largest lattice the MSCF really
outperform the BCF. The reason is the O(n3/2 ) cost for MSCF compared
to O(n2 ) for the BCF, which is of clear importance for large lattices. For
Table 2.3 The average CPU time (in seconds) for factorizing Q into LLT and
solving LLT µ = b, for a 1002 , 1502 and 2002 square lattice with 3 × 3 and
5 × 5 neighborhood, using the BCF and MSCF method.
time
Table 2.4 The average CPU time (in seconds) using the MSCF algorithm for
factorizing Q into LLT and solving LLT µ = b, for the spatiotemporal GMRF
using T time steps, and with and without 10 global nodes. The dimension of
the graph is 544 × T .
Λ = diag(λ0 , λ1 , . . . , λn−1 ).
we obtain Cij = cj−i mod n , i.e., all circulant matrices can be expressed
as F ΛF H for some diagonal matrix Λ.
The following theorem now states that the class of circulant matrices
is closed under some matrix operations.
Theorem 2.10 Let C and D be n × n circulant matrices. Then
1. C and D commute, i.e., CD = DC, and CD is circulant
2. C ± D is circulant
3. C p is circulant, p = 1, 2, . . .
4. if C is non singular then C p is circulant, p = −1, −2, . . .
Proof. Recall that a circulant matrix is uniquely described by its
n eigenvalues as they all share the same eigenvectors. Let ΛC and
ΛD denote the diagonal matrices with the eigenvalues of C and D,
respectively, on the diagonal. Then
CD = F ΛC F H F ΛD F H
= F (ΛC ΛD )F H ;
hence CD is circulant with eigenvalues {λCi λDi }. The matrices com-
mute since ΛC ΛD = ΛD ΛC for diagonal matrices. Using the same
argument,
C ± D = F (ΛD ± ΛC )F H ;
hence C ± D is circulant. Similarly,
C p = F ΛpC F H , p = ±1, ±2, . . .
as ΛpC is a diagonal matrix.
The matrix F in (2.42) is well know as the discrete Fourier transform
(DFT) matrix, so computing F v for a vector v is the same as computing
the DFT of v. Taking the inverse DFT (IDFT) of v is the same as
calculating F H v. Note that if n can be factorized as a product of small
primes, the computation of F v requires only O(n log n) flops. ‘Small
primes’ is the ‘traditional’ requirement, but the (superb) library FFTW,
which is a comprehensive collection of fast C routines for computing the
discrete Fourier transform (http://www.fftw.org), allows arbitrary size
and employs O(n log n) algorithms for all sizes. Small primes are still
computational most efficient.
and
⎛ n−1 ⎞
j=0 vj ω −j0
⎜ n−1 −j1 ⎟
1 ⎜ j=0 vj ω ⎟
IDFT(v) = F v = √ ⎜
H
⎜ . ⎟.
⎟
n⎝ .. ⎠
n−1 −j(n−1)
j=0 v j ω
Recall that ‘⊙’ denotes elementwise multiplication, ‘⊘’ denotes elemen-
twise division, and ‘’ is elementwise power, see Section 2.1.1. Let C be
a circulant matrix with base c, then the matrix-vector product Cv can
be computed as
Cv = F ΛF H v
√
= F n diag(F c) F H v
√
= n DFT(DFT(c) ⊙ IDFT(v)).
The product of two circulant matrices C and D, with base c and d,
respectively, can be written as
CD = F (ΛC ΛD ) F H (2.44)
√ √
= F n diag(F c) n diag(F d) F H . (2.45)
Since CD is a circulant matrix with (unknown) base p, say, then
√
CD = F n diag(F p) F H . (2.46)
Comparing (2.46) and (2.44), we see that
√ √ √
n diag(F p) = n diag(F c) n diag(F d);
hence
√
p= n IDFT (DFT(c) ⊙ DFT(d)) .
Solving Cx = b can be done similarly, since
x = C −1 b
= F Λ−1 F H b
1
= √ DFT(IDFT(b)) ⊘ DFT(c)).
n
= FN N H
n Λ (F n )
algorithm. We can make use of it since Im(v) has the same distribution
as Re(v), and Im(v) and Re(v) are independent.
The log density can be evaluated as
Nn 1 1 T
− log 2π + log |Q| − vec(x) Q vec(x),
2 2 2
where
log |Q| = log Λij
ij
1.0
0.8
0.6
0.4
0.2
0.0
0 20 40 60 80 100 120
(a) (b)
Figure 2.13 Illustrations to Example 2.8, the sample (a) and the first column
of the base of Q−1 in (b).
base equal to
⎛ ⎞
1.00 0.42 0.39 0.28 0.24 0.20 0.18 0.16 0.14 ···
⎜0.42 0.33 0.31 0.26 0.23 0.20 0.18 0.16 0.14 · · ·⎟
⎜ ⎟
⎜0.39 0.31 0.29 0.25 0.22 0.19 0.17 0.15 0.14 · · ·⎟
⎜ ⎟
⎜0.28 0.26 0.25 0.22 0.20 0.18 0.16 0.15 0.13 · · ·⎟
⎜ ⎟
⎜0.24 0.23 0.22 0.20 0.18 0.17 0.15 0.14 0.13 · · ·⎟
⎜ ⎟
⎜0.20 0.20 0.19 0.18 0.17 0.15 0.14 0.13 0.12 · · ·⎟ .
⎜ ⎟
⎜0.18 0.18 0.17 0.16 0.15 0.14 0.13 0.12 0.11 · · ·⎟
⎜ ⎟
⎜0.16 0.16 0.15 0.15 0.14 0.13 0.12 0.11 0.10 · · ·⎟
⎜ ⎟
⎜0.14 0.14 0.14 0.13 0.13 0.12 0.11 0.10 0.09 · · ·⎟
⎝ ⎠
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
respectively.
Both norms can be expressed in terms of the eigenvalues {λk } of AT A:
1 1
A2s = max λk , and A2w = trace(AT A) = λk .
k n n
k
Consider first term 1. Using Theorem 2.11 with f (·) = log(·) and that T n
and C n are asymptotically equivalent matrices with bounded eigenvalues
from above and from below (by assumption), then
1 1
lim log |T n | = lim log |C n |.
n→∞ n n→∞ n
where i and j are points in the d-dimensional space and γ(·) is the so-
called covariance function (see Definition 5.1 in Section 5.1), they proved
that
1 1
− log |ΣN | + log |S N | = O(nd−1 )
2 2 (2.57)
1 1 T −1
− xT Σ−1
N x + x S N x = Op (n
d−1
).
2 2
The result (2.57) also holds for partial derivatives wrt parameters that
govern γ(·). The error in the deterministic and stochastic part of (2.57)
is of the same order. The consequence is that the log likelihood and its
circulant approximation differ by Op (nd−1 ). We can also use the results
of Kent and Mardia (1996) (their Lemma 4.3 in particular) to give the
same bound on the difference of the conditional log likelihood (or its
circulant approximation) and the log likelihood.
Let θ̂ be the true MLE estimator and θ be the MLE estimator com-
puted using the circulant approximation. Maximum likelihood theory
states that, under some mild regularity conditions, θ̂ is asymptotically
normal with
N 1/2 (θ̂ − θ) ∼ N (0, H)
where H > 0. The standard deviation for a component of θ is then
O(N −1/2 ). The bias in the MLE is for this problem O(N −1 ) (Mardia,
1990). Kent and Mardia (1996) show that θ has bias of O(1/n). From
this we can conclude, that for d = 1 the bias caused by the circulant
approximation is of smaller order than the standard deviation. The
circulant approximation is harmless and θ has the same asymptotic
properties as θ̂. For d = 2 the bias caused by the circulant approximation
is of the same order as the standard deviation, so the error we make is
of the same order as the random error. The circulant approximation
is then tolerable, bearing in mind this bias. For d ≥ 3 the bias is of
larger order than the standard deviation so the error due to the circulant
approximation dominates completely. An alternative is then to use the
modified Whittle approximation to the log likelihood that is discussed
in Section 2.6.5.
where it exists.
The covariance γij can be extracted from the CGF using
∂ i+j
i j
Γ(z1 , z2 ) = γij
∂z1 ∂z2 (z1 ,z2 )=(0,0)
for i ≥ 0 and j ≥ 0, and the SDF can be expressed using the CGF as
1
f (ω1 , ω2 ) =
Γ(exp(−ιω1 ), exp(−ιω2 )). (2.61)
4π 2
We need the following result.
Lemma 2.4 The covariances of the conditional autoregression satisfy
the recurrence equations
1, ij = 00
θkl γi+k,j+l = (2.62)
kl∈I
0, otherwise.
∞
Define
1
dij = γij − (δij − θkl γi+k,j+l ).
θ00
kl∈I∞ \00
which follows by expanding the terms and then using Lemma 2.4. To
show (2.60) we start with
Var(xij ) = Var(E(xij | x−ij )) + E(Var(xij | x−ij ))
to compute
E(Var(xij | x−ij )) = γ00 − Var(E(xij | x−ij ))
⎛ ⎞
1
= γ00 − E ⎝ 2 θkl θk′ l′ xi+k,j+l xi+k′ ,j+l′ ⎠
θ00 ′ ′ kl∈I∞ \00 k l ∈I∞ \00
1
= γ00 − 2 θkl θk′ l′ γk′ −k,l′ −l
θ00
kl∈I∞ \00 k′ l′ ∈I ∞ \00
1
= γ00 − 2 θkl (−θ00 )γkl
θ00
kl∈I∞ \00
⎛ ⎞
1 ⎝
= θ00 γ00 + θkl γkl ⎠
θ00
kl∈I∞ \00
1
= ,
θ00
where we have used Lemma 2.4 twice. From this (2.60) follows since
Var(xij |x−ij ) is a constant.
We end by presenting Whittle’s approximation, which uses the SDF
of the process on I∞ to approximate the log likelihood if x is observed
If we can determine Θ+
∞, then any configuration in this set is valid for
all n > 2m.
Theorem 2.16 The set Θ+ ∞ as defined in (2.70) is nonempty, bounded,
convex and
⎧ ⎫
⎨ ⎬
Θ+∞ = θ : θij cos(iω1 + jω2 ) > 0, (ω1 , ω2 ) ∈ [0, 2π)2 . (2.71)
⎩ ⎭
ij
Note that (2.71) corresponds to the SDF (2.58) being strictly positive.
The diagonal dominance criterion, see Section 2.1.6, states that
Proof.
if ij |θij | < 1 then Q(θ) is SPD for any n > 2m, hence Θ+ ∞ is non-
+
empty. Further, Θ∞ is bounded as |θij | < 1 (see Section 2.1.6), and
hence θ ∈ Θ+ n.
Lakshmanan and Derin (1993) showed that (2.71) is equivalent to a
bivariate reciprocal polynomial not having any zeros inside the unit
bicircle. They use some classical results concerning the geometry of
zero sets of reciprocal polynomials and obtain a complete procedure for
verifying the validity of any θ and for identifying Θ+ ∞ . The approach
taken is still somewhat complex but explicit results for m = (1, 1)T are
known. We will now report these results.
Let m = (1, 1)T so θ can be represented as
⎡ ⎤
sym θ01 θ11
⎣sym 1 θ10 ⎦ ,
sym sym θ1−1
where the entries marked with ‘sym’ follow from θij = θ−i,−j . Then
θ ∈ Θ+
∞ iff the following four conditions are satisfied:
2 2 2 1 2
ρ = 4(θ11 + θ01 + θ1−1 − θ10 ) − 1 < 0, (2.73)
2
2
2 4θ11 θ1−1 − θ10 + 2 (4θ01 (θ11 + θ1−1 ) − 2θ10 ) + ρ < 0,
2
2 4θ11 θ1−1 − θ10 − 2 (4θ01 (θ11 + θ1−1 ) − 2θ10 ) + ρ < 0,
and either
2 2
2
R = 16 4θ11 θ1−1 − θ10 − (4θ01 (θ11 + θ1−1 ) − 2θ10 ) < 0
or
& 2
R ≥ 0, and R2 − − 8ρ 4θ11 θ1−1 − θ10 +
2 '2
3 (4θ01 (θ11 + θ1−1 ) − 2θ10 ) < 0. (2.74)
These inequalities are somewhat involved but special cases are of great
interest. First consider the case where θ01 = θ10 and θ11 = θ1−1 = 0,
which gives the requirement |θ01 | < 1/4.
θ11
1/4
1/2 1/2
1/4
Figure 2.14 The valid parameter space Θ+∞ where θ01 = θ10 and θ11 = θ1−1 ,
where restriction to diagonal dominance Θ++ defined in (2.76), is shown as
gray.
Suppose now we include the diagonal terms θ01 = θ10 and θ11 = θ1−1 ,
the inequalities (2.73) to (2.74) reduce to |θ01 |−θ11 < 1/4, and θ11 < 1/4.
Figure 2.14 shows Θ+ in this case. The smaller gray area in Figure 2.14
is found from using a sufficient condition only, the diagonal dominant
criterion, which we discuss in Section 2.7.2.
The analytical results obtained for a stationary GMRF on Tn are
also informative for a nonstationary GMRF on the lattice In′ with full
conditionals
E(xij | x−ij ) = − θi′ j ′ xi+i′ ,j+j ′ and
i′ j ′ =00
(i+i′ ,j+j ′ )∈In′
Prec(xij | x−ij ) = 1.
The full conditionals equal those in (2.64) in the interior of the lattice,
but differ at the boundary. Let Θ+,In′ be the valid parameter space for
the GMRF on In′ . We now observe that the (block Toeplitz) precision
matrix for the GMRF on the lattice is a principal submatrix of the
(block-circulant) precision matrix for the GMRF on the torus, if n′ ≤
n − m. The consequence is that
+,I
Θ+
n ⊆ Θ n′ for n′ ≤ n − m. (2.75)
Further details are provided in Section 5.1.4. The block-Toeplitz and the
so
λ ≥ Aii − |Aij |.
j:j =i
The lower bound is strictly positive as A is diagonal dominant and Aii >
0. As λ was any eigenvalue, all n eigenvalues of A are strictly positive
and A has full rank. Let Λ be a diagonal matrix with the eigenvalues
on the diagonal and the corresponding eigenvectors in the corresponding
column of V , so A = V ΛV T . For x = 0,
xT Ax = xT (V ΛV T )x = (V T x)T Λ(V T x) > 0,
hence A is SPD.
3.1 Preliminaries
3.1.1 Some additional definitions
The null space of a matrix A is the set of all vectors x such that Ax = 0.
The nullity of A is the dimension of the null space. For an n × m matrix
the rank is min{n, m} − k where k is the nullity. For a singular n × n
matrix A with nullity k, we denote by |A|∗ the product of the n − k non-
6. Matrix multiplication
(A ⊗ B)(C ⊗ D) = AC ⊗ BD
3.1.3 Polynomials
Many IGMRFs are invariant to the addition of a polynomial of a certain
degree. Here we introduce the necessary notation for polynomials, first
on a line and then in higher dimensions.
Let s1 < s2 < · · · < sn denote the ordered locations on the line and
define s = (s1 , . . . , sn )T . Let pk (si ) denote a polynomial of degree k,
AT = (e1 , e2 , . . . , ek ) (3.11)
be decomposed as
x = (c1 e1 + · · · + ck ek ) + (dk+1 ek+1 + · · · + dn en )
= x + x⊥ , (3.14)
where x is the part of x in the subspace spanned by the columns of
and x⊥ is the part of x orthogonal to x ,
AT , the null space of Q,
T ⊥
where of course (x ) x = 0. For a given x, the coefficients in (3.14)
are ci = eTi x and dj = eTj x.
Using this decomposition we immediately see that
π ∗ (x) = π ∗ (x⊥ ), (3.15)
so π ∗ (x) does not depend on c1 , . . . , ck . Hence, π ∗ is invariant to the
addition of any x and this is the important feature of IGMRFs. Also
note that π ∗ (x⊥ ) is equal to π(x|Ax = a).
We can interpret π ∗ (x) as a limiting form of a proper density π̃(x).
Let π̆(x ) be the density of x and define the proper density
π̃(x) = π ∗ (x⊥ ) π̆(x ).
Let π̆(x ) be a zero mean Gaussian with precision matrix γI. Then in
the limit as γ → 0,
π̃(x) ∝ π ∗ (x). (3.16)
Roughly speaking, π ∗ (x) can be decomposed into the proper density for
x⊥ ∈ Rn−k times a diffuse improper density for x ∈ Rk .
Example 3.2 Consider again Example 3.1, where we now look at the
improper density π ∗ defined in (3.13) with ‘mean’ zero and ‘precision’
Suppose we are interested in the density value π ∗ of the vector x =
Q.
(0, −2, 0, −2)T , which can be factorized into x = 2e1 + 2e4 , hence x =
2e1 and x⊥ = 2e4 . Since e1 is a constant vector, the density π ∗ (x)
is invariant to the addition of any arbitrary constant to x. This can
be interpreted as a diffuse prior on the overall level of x, i.e., x ∼
N (0, κ−1 I) with κ → 0.
Using this interpretation of π ∗ (x), we will now define how to generate a
‘sample’ from π ∗ (x), where we use quotes to emphasize that the density
is actually improper. Since the rank deficiency is only due to x , we
define that a sample from π ∗ (x) means a sample from the proper part
π ∗ (x⊥ ), bearing in mind (3.15). For known eigenvalues and eigenvectors
of Q it is easy to sample from π ∗ (x⊥ ) using Algorithm 3.1.
Example 3.3 We have generated 1000 samples from π ∗ defined in Ex-
ample 3.2, shown in a ‘pairs plot’ in Figure 3.1. At first sight, these
samples seem well-behaved and proper, but note that 1T x = 0 for all
samples, and that the empirical correlation matrix is singular.
−10 0 10 20 −10 0 10 20
5 10
x1
−5 0
−15
20
10
x2
0
−10
20
10
x3
0
−10
20
10
x4
0
−10
−15 −5 0 5 10 −10 0 10 20
Figure 3.1 Pairs plot for 1000 samples from an improper GMRF with mean
and precision defined in Example 3.2.
As it will become clear later, in most cases the matrix Q and the
eigenvectors e1 , . . . , ek are known explicitly by construction. However,
the remaining eigenvalues and eigenvectors will typically not be known.
Hence an alternative algorithm based on Algorithm 2.6 will be useful.
Here we use the fact that π ∗ (x⊥ ) equals π(x|Ax = 0), from which we
can sample in two steps: first sample from the unconstrained density
and then correct the obtained sample for the constraint Ax = 0 via
Equation (2.30). More specifically, for π(x) we have to use a zero mean
GMRF with SPD precision matrix
k
+
Q=Q ai ei eTi .
i=1
Note that this method works for any strictly positive values of a1 , . . . , ak ;
1
E(xi | x−i ) = − Qij xj ,
Qii j:j∼i
where − j:j∼i Qij /Qii = 1. Hence, the conditional mean of xi is simply
a weighted mean of its neighbors, but does not involve an overall level.
In applications, this ‘local’ behavior is often desirable. We can then
concentrate on the deviation from any overall mean level without having
to specify the overall mean level itself. Many IGMRFs are constructed
such that the deviation from the overall level is a smooth curve in time
or a smooth surface in space.
Cov(xj − xi , xl − xk ) = 0.
These properties are well known and coincide with those of a Wiener
process observed in discrete time. We will define the Wiener process
shortly in Definition 3.4.
The density for x is derived from its n − 1 increments (3.20) as
( n−1
)
(n−1)/2 κ 2
π(x | κ) ∝ κ exp − (∆xi )
2 i=1
( n−1
)
κ
= κ(n−1)/2 exp − (xi+1 − xi )2
2 i=1
1
= κ(n−1)/2 exp − xT Qx , (3.21)
2
δi = si+1 − si . (3.25)
15
10
5
0
−5
0 20 40 60 80 100
(a)
2.0
1.5
1.0
0.5
0.0
0 20 40 60 80 100
(b)
1.0
0.8
0.6
0.4
0.2
0.0
−0.2
0 20 40 60 80 100
(c)
Figure 3.2 Illustrations of the properties of the RW1 model with n = 99: (a)
displays 10 samples of x⊥ , (b) displays Var(x⊥ i ) for i = 1, . . . , n normalized
with the average variance, and (c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n.
for 1 < i < n where the Qi,i−1 terms are found via Qi,i−1 = Qi−1,i .
A proper correction at the boundary (implicitly we use a diffuse prior
for W (0) rather than the fixed W (0) = 0) gives the remaining diagonal
terms Q11 = κ/δ1 , Qnn = κ/δn−1 . Clearly, Q1 = 0 still holds and the
joint density of x,
( n−1
)
κ
π(x | κ) ∝ κ(n−1)/2 exp − (xi+1 − xi )2 /δi , (3.26)
2 i=1
10
5
0
−5
0 20 40 60 80 100
(a)
2.0
1.5
1.0
0.5
0.0
0 20 40 60 80 100
(b)
1.0
0.8
0.6
0.4
0.2
0.0
−0.2
0 20 40 60 80 100
(c)
(a) (b)
2.0
1.5
1.0
0.5
2 4 6 8 10
(c)
Figure 3.4 Figures (a) and (b) display two samples from an IGMRF with
density (3.30) where two regions sharing a common border are considered
as neighbors. Figure (c) displays the variance in relation to the number of
neighbors, demonstrating that the variance decreases if the number of neighbors
increases.
(a) (b)
2.0
1.5
1.0
0.5
2 4 6 8 10
(c)
Figure 3.5 Figures (a) and (b) display two samples of x⊥ from an IGMRF with
density (3.33) where two regions sharing a common border are considered as
neighbors and weights are used based in the distance between centroids. Figure
(c) displays the variance in relation to the number of neighbors, demonstrating
that the variances decreases if the number of neighbors increases.
200
150
100
50
0
−50
−100
−150
0 20 40 60 80 100
(a)
4
3
2
1
0
0 20 40 60 80 100
(b)
1.0
0.5
0.0
−0.5
−1.0
0 20 40 60 80 100
(c)
Figure 3.6 Illustrations of the properties of the RW2 model with n = 99: (a)
displays 10 samples of x⊥ , (b) displays Var(x⊥ i ) for i = 1, . . . , n normalized
with the average variance, and (c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n.
where only the upper left part of the base is shown. To write out the
conditional expectations, it is convenient to use a graphical notation,
where, for example, (3.42) looks like
◦•◦ ◦◦◦
•◦• −4 ◦ • ◦. (3.45)
◦•◦ ◦◦◦
The format is to calculate the sum over all xij ’s in the locations of the
‘•’. The ‘◦”s are there only to fix the spatial configuration. When this
notation is used within a sum, then the sum-index denotes the center
node.
Returning to (3.44), then (3.42) gives the following full conditionals
in the interior
( ◦◦◦◦◦ ◦◦◦◦◦ ◦◦•◦◦
)
1 ◦◦•◦◦ ◦•◦•◦ ◦◦◦◦◦
E(xij | x−ij ) = 8 ◦•◦•◦ −2 ◦◦◦◦◦ −1 •◦◦◦•
20 ◦◦•◦◦
◦◦◦◦◦
◦•◦•◦
◦◦◦◦◦
◦◦◦◦◦
◦◦•◦◦
Prec(xij | x−ij ) = 20κ.
κ 2 −1
n1 −1 n
◦•◦ ◦◦◦
2
− •◦• −4 ◦•◦ , (3.46)
2 i=2 j=2 ◦•◦ ◦◦◦
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) (b)
Figure 3.7 Two samples from an IGMRF on a torus with dimension 256×256,
where (a) used increments (3.45) and (b) used increments (3.52).
From (3.56) we see that the IGMRF defined through (3.55) is invariant
to the addition of constants to any rows and columns. This is an example
of an IGMRF with more than one unspecified level. The density of x is
(n1 −1)(n2 −1)
π(x | κ) ∝ κ 2
⎛ ⎞
n
1 −1 n
2 −1
κ
× exp ⎝− (∆(1,0) ∆(0,1) xij )2 ⎠ (3.57)
2 i=1 j=1
with ∆(1,0) ∆(0,1) xij = xi+1,j+1 − xi+1,j − xi,j+1 + xij . Note that the
conditional mean of xij in the interior depends on its eight nearest sites
and is
1 ◦•◦ 1 •◦•
•◦• − ◦ ◦ ◦,
2 ◦•◦ 4 •◦•
which equals a least-squares locally quadratic fit through these eight
neighbors. The conditional precision is 4κ.
We note that the representation of the precision matrix Q as the
Kronecker product Q = κ(R1 ⊗ R2 ) is also useful for computing |Q|∗ ,
because (extending (3.1))
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) (b)
Figure 3.8 Two samples from model (3.55) defined on a torus of dimension
256 × 256 lattice with sum-to-zero constraints on each row and column.
does not simplify and we need to take the values at all other locations
into account. In other words, the precision matrix will be a completely
dense matrix. However, the conditional densities simplify if we augment
the process with its derivatives,
s t
(t − h)k−1−m
u(m) (0, t) = + dW (h)
0 s (k − 1 − m)!
s
(m) (t − h)k−1−m
= u (s, t) + dW (h)
0 (k − 1 − m)!
= u(m) (s, t)
s
((t − s) + (s − h))k−1−m
+ dW (h)
0 (k − 1 − m)!
= u(m) (s, t)
k − 1 − m
s k−1−m
+
0 j=0
j
(t − s)j (s − h)k−1−m−j
dW (h)
(k − 1 − m)!
= u(m) (s, t)
k−1−m
s
(t − s)j (s − h)k−1−m−j
+ dW (h)
j=0
j! 0 (k − 1 − m − j)!
k−1−m
(t − s)j (m+j)
= u(m) (s, t) + u (0, s). (3.62)
j=0
j!
If we define
u(s, t) = (u(0) (s, t), u(1) (s, t), . . . , u(k−1) (s, t))T
200
150
100
50
0
−50
−100
0 20 40 60 80 100
(a)
4
3
2
1
0
0 20 40 60 80 100
(b)
1.0
0.5
0.0
−0.5
−1.0
0 20 40 60 80 100
(c)
Figure 3.9 Illustrations of the properties of the CRW2 model with n = 99 and
irregular locations: (a) displays 10 samples of x⊥ (ignoring the derivative), (b)
displays Var(x⊥ i ) for i = 1, . . . , n normalized with the average variance, and
(c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n.
One of the main areas where GMRF models are used in statistics are
hierarchical models. Here, GMRFs serve as a convenient formulation to
model stochastic dependence between parameters, and thus implicitly,
dependence between observed data. The dependence can be of various
kinds, such as temporal, spatial or even spatiotemporal.
A hierarchical GMRF model is characterized through several stages of
observables and parameters. A typical scenario is as follows. In the first
stage we will formulate a distributional assumption for the observables,
dependent on latent parameters. If we have observed a time series of
binary observations y, we may assume a Bernoulli model with unknown
probability pi for yi , i = 1, . . . , n: yi ∼ B(pi ). Given the parameters of
the observation model, we assume the observations to be conditionally
independent. In the second stage we assign a prior model for the unknown
parameters, here pi . This is where GMRFs enter. For example, we could
choose an autoregressive model for the logit-transformed probabilities
xi = logit(pi ). Finally, a prior distribution is assigned to unknown
parameters (or hyperparameters) of the GMRF, such as the precision
parameter κ of the GMRF x. This is the third stage of a hierarchical
model. There may be further stages if necessary.
In a regression context, our simple example would thus correspond to a
generalized linear model where the intercept is varying over some domain
according to a GMRF with unknown hyperparameters. More generally,
so-called generalized additive models can be fitted using GMRFs. We will
give examples of such models later. GMRFs are also useful in extended
regression models where covariate effects are allowed to vary over some
domain. Such models have been termed varying coefficient models
(Hastie and Tibshirani, 1990) and the domain over which the effect is
allowed to vary is called the effect modifier. Again, GMRF models can
be used to analyze this class of models, see Fahrmeir and Lang (2001a)
for a generalized additive model based on GMRFs involving varying
coefficients.
For statistical inference we will mainly use Markov chain Monte Carlo
(MCMC) techniques, e.g., Robert and Casella (1999) or Gilks et al.
A simple example
Before we compare analytical results about the rate of convergence for
various sampling schemes, we need to define it. Let θ (1) , θ (2) , . . . denote a
Markov chain with target distribution π(θ) and initial value θ (0) ∼ π(θ).
The rate of convergence of the Markov chain can be characterized by
studying how quickly E(h(θ (t) )|θ (0) ) approaches the stationary value
E(h(θ)) for all square π-integrable functions h(·). Let ρ be the minimum
number such that for all h(·) and for all r > ρ
, 2 -
lim E E h(θ (k) ) | θ (0) − E (h(θ)) r−2k = 0. (4.1)
k→∞
µ∗ ∼ q(µ∗ | µ(k−1) )
(4.5)
x∗ |µ∗ ∼ N (µ∗ 1, Q−1 )
3
2
1
0
−3 −2 −1
0 200 400 600 800 1000
(a)
200
100
0
−100
−200
−3 −2 −1 0 1 2 3
(b)
Figure 4.1 Figure (a) shows the marginal chain for µ over 1000 iterations of
the marginal chain for µ using n = 100, σ 2 = 1/10 and φ = 0.9. The algorithm
updates successively µ and x from their full conditionals. Figure (b) displays
the pairs (µ(k) , 1T Qx(k) ), with µ(k) on the horizontal axis. The slow mixing
(and convergence) of µ is due to the strong dependence with 1T Qx(k) as only
horizontal and vertical moves are allowed. The arrows illustrate how a joint
update can improve the mixing (and convergence).
y | x, µ ∼ N (x, H −1 ),
θ∗ ∼ q(θ ∗ | θ (k−1) )
(4.8)
x∗ ∼ π(x | θ ∗ , y).
The proposal (θ∗ , x∗ ) is then accepted/rejected jointly. We denote this
as the one-block algorithm.
If we consider only the θ chain, then we are in fact sampling from the
posterior marginal π(θ|y) using the proposal (4.8). This is evident from
the acceptance probability for the joint proposal, which is
.
π(θ ∗ | y)
α = min 1, .
π(θ (k−1) | y)
400
400
300
300
200
200
100
100
0
0
0 100
(a)
200 300 400 0 100
(b)
200 300 400
Figure 4.2 (a) The precision matrix Q of s, t|y, κ in the original ordering, and
(b) after appropriate reordering to obtain a band matrix with small bandwidth.
Only the nonzero terms are shown and those are indicated by a dot.
1e+03
1e+02
1e+01
1e+00
1e−01
Figure 4.3 Trace plot showing the log of the three precisions κt (top, solid
line), κs (middle, dashed line) and κy (bottom, dotted line) for the first 1000
iterations. The acceptance rate was about 30%.
2500
2000
1500
1000
Figure 4.4 Observed and predicted counts (posterior median within 2.5 and
97.5% quantiles) for the drivers data without the covariate
2500
2000
1500
1000
Figure 4.5 Observed and predicted counts for the drivers data with the seat belt
covariate.
The data ỹ do not introduce extra dependence between the xL i ’s, as ỹi
acts as a noisy observation of xL
i . Denote by ni the number of neighbors
to location i and let L(i) be
L(i) = {k : xL (k) = xL
i },
1e+03
1e+02
1e+01
1e+00
1e−01
Figure 4.6 The value of κ for the 1000 first iterations of the subblock algorithm,
where κS is the solid line (top), κC is the dashed line, κL is the dotted line
and κy is the dashed-dotted line (bottom).
Table 4.1 Posterior median and 95% credible interval for fixed effects
The estimates of the fixed effects β and the intercept µ are given in
Table 4.1.
6
4
2
0
−2
−4
50 100 150
(a)
1
0
−1
−2
−3
Figure 4.7 Nonparametric effects for floor space (a) and year of construction
(b). The figures show the posterior median within 2.5 and 97.5% quantiles.
The distribution of the observed data is indicated with jittered dots.
case studies.
-1.5 0 1.5
Figure 4.8 Estimated posterior median effect for the location variable. The
shaded areas are districts with no houses, such as parks or fields.
Also in this case we use the subblock algorithm using two blocks (κ, x)
and (w, λ). The second block is sampled using (4.22) in the correct
order, first wij (for all ij) from the L(xi , 1) distribution, truncated to
1.0
0.8
0.6
0.4
0.2
0.0
J F M A M J J A S O N D
(a)
1.0
0.8
0.6
0.4
0.2
0.0
J F M A M J J A S O N D
(b)
Figure 4.9 Observed frequencies and fitted probabilities with uncertainty bounds
for the Tokyo rainfall data. (a): probit link. (b): logit link.
be positive if yij = 1 and negative if yij = 0, and then λij (for all ij)
from (4.21).
Figure 4.9(a) displays the binomial frequencies, scaled to the interval
[0, 1], and the estimated underlying probabilities pi obtained from the
probit regression approach, while (b) gives the corresponding results
obtained with the logistic link function. There is virtually no difference
between the results using the two link functions. Note that the credible
5e+04
5e+04
5e+03
5e+03
5e+02
5e+02
1e+02
1e+02
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
(a) (b)
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
(c) (d)
Figure 4.10 Trace plots using the subblock algorithm and the single-site Gibbs
sampler; (a) and (c) show the traces of log κ and g −1 (x180 ) using the subblock
algorithm, while (b) and (d) show the traces of log κ and g −1 (x180 ) using the
single-site Gibbs sampler.
intervals do not get wider at the beginning and the end of the time series,
due to the circular RW2 model.
We take this opportunity to compare the subblock algorithm with a
naı̈ve single-site Gibbs sampler which is know to converge slowly for this
problem (Knorr-Held, 1999). It is important to remember that this is
not a hard problem nor is the dimension high. We present the results
for the logit link. Figure 4.10 shows the trace of log κ using the subblock
algorithm in (a) and the single-site Gibbs sampler in (b), and the trace of
g −1 (x180 ) in (c) and (d). Both algorithms were run for 10000 iterations.
The results clearly demonstrate that the single-site Gibbs sampler has
γk = uk + vk .
400
400
400
300
300
300
200
200
200
100
100
100
0
0
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
Figure 4.11 (a) The precision matrix Q in the original ordering, and (b) the
precision matrix after appropriate reordering to reduce to number of nonzero
terms in the Cholesky triangle shown in (c). Only the nonzero terms are shown
and those are indicated by a dot.
4
2
0
−2
−4
Figure 4.12 Nonparametric effect of age group. Posterior median of the log
odds within 2.5 and 97.5% quantiles. The distribution of the observed covariate
is indicated with jittered dots.
5 5
1 1
0.2 0.2
(a) (b)
Figure 4.13 Estimated odds ratio (posterior median) for (a) the spatially
structured component ui and (b) the sum of the spatially structured and
unstructured variable ui + vi . The shaded region is West Berlin.
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−2 −1 0 1 2 3 −2 −1 0 1 2 3
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−2 −1 0 1 2 3 −2 −1 0 1 2 3
Figure 4.14 Normal approximation (dashed line) of the posterior density (4.24)
(solid line) for y = 3, µ = 0 and κ = 0.001 based on a quadratic Taylor
expansion around η0 for η0 = 0, 0.5, 1, 1.5. The value of η0 is indicated with
a small dot in each plot.
Table 4.3 Acceptance rates and number of iterations per second for Tokyo
rainfall data
day i equals
⎛ ⎞
mi
π(yi | xi ) ∝ exp ⎝xi yij − mi log(1 + exp(xi ))⎠ .
j=1
2.4 1.58
2.11 1.45
1.83 1.32
1.55 1.19
1.27 1.06
0.98 0.93
0.7 0.79
Figure 4.15 The standardized mortality ratios for oral cavity cancer (left) and
lung cancer (right) in Germany, 1986-1990.
is fulfilled. Hence, we consider only the relative risk, not the absolute
risk.
The common approach is to assume that the observed counts yij are
conditionally independent Poisson observations,
yij ∼ P(eij exp(ηij )),
where ηij denotes the log relative risk in area i for disease j.
The standardized mortality ratios (SMRs) yij /eij are displayed in Fig-
ure 4.15 for both diseases. For some background information on why the
SMRs are not suitable for estimating the relative risk, see, for example,
Mollié (1996). We will first consider a model for a separate analysis of
disease risk for each disease, then we will discuss a model for a joint
analysis of both diseases.
η = µ1 + u + v, (4.28)
For π(κ) we choose independent G(1, 0.01) priors for each of the two
precisions. Note that the prior for x = (µ, uT , v T )T conditioned on κ is a
GMRF. However, we see that yi depends on the sum of three components
of x, due to (4.28). This is an implementation nuisance and can be solved
here by reparameterization using η instead of v,
η | u, κ, µ ∼ N (µ1 + u, κv I).
1000
1000
1000
800
800
800
600
600
600
400
400
400
200
200
200
0
0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Figure 4.16 (a) The precision matrix (4.29) in the original ordering, and
(b) the precision matrix after appropriate reordering to reduce to number of
nonzero terms in the Cholesky triangle shown in (c). Only the nonzero terms
are shown and those are indicated by a dot.
1.64 1.54
1.51 1.43
1.38 1.32
1.25 1.21
1.12 1.09
0.99 0.98
0.86 0.87
(a) (b)
Figure 4.17 Estimated relative risks (posterior median) of the spatial compo-
nent exp(u) for (a) lung cancer and (b) oral cavity cancer.
η 2 | u1 , u2 , µ, κ ∼ N (µ2 1 + δ −1 u1 , κ−1
η 2 I),
where κη1 and κη2 are the unknown precisions of η 1 and η 2 , respectively.
Additionally, we impose sum-to-zero constraints for both u1 and u2 .
Assuming µ is zero mean normal with covariance matrix κ−1 µ I,
T T T T T T
then x = (µ , u1 , u2 , η 1 , η 2 ) is a priori a GMRF. The posterior
distribution is
n/2 n/2 (n−1)/2 (n−1)/2 1 T
π(x, κ, δ | y) ∝ κη1 κη2 κu1 κu2 exp − x Qx
2
⎛ ⎞
2 n
× exp ⎝ yij ηij − eij exp(ηij )⎠
j=1 i=1
× π(κ) π(δ),
where
⎛ ⎞
Qµµ Qµu1 Qµu2 Qµη1 Qµη2
⎜ Qu1 u1 Qu1 u2 Qu 1 η 1 Qu1 η2 ⎟
⎜ ⎟
Q=⎜
⎜ Qu2 u2 Qu 2 η 1 Qu2 η2 ⎟
⎟, (4.31)
⎝ sym. Qη1 η1 Qη1 η2 ⎠
Qη2 η2
2000
2000
2000
1500
1500
1500
1000
1000
1000
500
500
500
0
0
0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000
Figure 4.18 (a) The precision matrix (4.31) in the original ordering, and
(b) the precision matrix after appropriate reordering to reduce to number of
nonzero terms in the Cholesky triangle shown in (c). Only the nonzero terms
are shown and those are indicated by a dot.
1.4 1.4
1.2 1.2
1.1 1.1
1 1
0.9 0.9
0.8 0.8
0.7 0.7
(a) (b)
Figure 4.19 Estimated relative risks (posterior median) for (a) the shared
component exp(u1 ) (related to tobacco) and (b) the oral-specific component
exp(u2 ) (related to alcohol).
Approximation techniques
Figure 5.1 The Matérn CF with range r = 1, and ν = 1/2, 1, 3/2, 5/2, and
100 (from left to right). The case ν = 1/2 corresponds to the exponential CF
while ν = 100 is essentially the Gaussian CF.
where now C(1) = 0.04979 ≈ 0.05. Note that C(0) = 1 in all four cases,
so the CFs are also correlation functions. Of course, if we multiply C(h)
with σ 2 then the variance becomes σ 2 .
The powered exponential CF includes both the exponential (α = 1)
and the Gaussian (α = 2), and so does the often-recommended Matérn
CF with ν = 1/2 and ν → ∞, respectively. The Matérn CF is displayed
in Figure 5.1 for ν = 1/2, 1, 3/2, 5/2, and 100.
A further parameter r, called the range, is often introduced to scale
the Euclidean distance, so the CF is C(h/r). The range parameter can
be interpreted as the (minimum) distance h for which the correlation
function C(h) = 0.05. Hence, two locations more than distance r apart
are essentially uncorrelated and hence also nearly independent.
Nonisotropic CFs can be constructed from isotropic CFs by replacing
the Euclidean distance h between locations s and t with
0
h′ = (t − s)T A(t − s),
Prec(xij | x−ij ) = θ1 .
with positive weights wij > 0 for all ij. The CF is isotropic and its value
at (i, j) only depends on the distance to (0, 0). It is therefore natural to
choose wij ∝ 1/d((i, j), (0, 0))) for ij = 00. However, we will put slightly
more weight on lags with distance close to the range, and use
1 if ij = 00
wij ∝ 1+r/d((i,j),(0,0))
d((i,j),(0,0)) otherwise.
5.1.3 Results
We will now present some typical results showing how well the fitted
GMRFs approximate the CFs in Section 5.1. We concentrate on the
exponential and Gaussian CF using both a 5 × 5 and 7 × 7 neighborhood
with range 30 and 50. The size of the torus is taken as 512 × 512.
Figure 5.2(a) and Figure 5.2(c) shows the fit obtained for the
exponential CF with range 30 using a 5 × 5 and 7 × 7 neighborhood,
respectively. Figure 5.2(b) and Figure 5.2(d) show similar results for the
Gaussian CF with range 50. The fitted CF is drawn with a solid line,
while the target CF is drawn with a dashed line, while the difference
between the two is shown in Figure 5.3.
The approximation obtained is quite accurate. For the exponential
CF the absolute difference is less than 0.01 using a 5 × 5 neighborhood,
while it is less than 0.005 using a 7 × 7 neighborhood. The Gaussian CF
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
(a) (b)
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
(c) (d)
Figure 5.2 The figures display the correlation function (CF) for the fitted
GMRF (solid line) and the target CF (dashed line) with the following
parameters: (a) exponential CF with range 30 and a 5 × 5 neighborhood, (b)
Gaussian CF with range 50 and a 5 × 5 neighborhood, (c) exponential CF with
range 30 and a 7 × 7 neighborhood, and (d) Gaussian CF with range 50 and a
7 × 7 neighborhood.
is more difficult to fit, which is due to the CF type, not the increase
in the range. In order to fit the CF accurately for small lags, the fitted
correlation needs to be negative for larger lags. However, the absolute
difference is still reasonably small and about 0.04 and 0.008 for the 5 × 5
and 7 × 7 neighborhood, respectively. The improvement by enlarging the
neighborhood is larger for the Gaussian CF than for the exponential CF.
The results obtained in Figure 5.2 are quite typical for different range
parameters and other CFs. For other values of the range, the shape of
the fitted CF is about the same and only the horizontal scale is different
(due to the different range). We do not present the fits using the powered
exponential CF (5.1) for 1 ≤ α ≤ 2, and the Matérn CF (5.2), but they
are also quite good. The errors are typically between those obtained for
the exponential and the Gaussian CF.
For the CFs shown in Figure 5.3 the GMRF coefficients (compare (5.5)
0.02
0.000
0.00
−0.004
−0.02
−0.04
−0.008
(a) (b)
0.002
0.005
0.000
0.000
−0.004 −0.002
−0.005
(c) (d)
Figure 5.3 The figures display the difference between the target correlation
function (CF) and the fitted CF for the corresponding fits displayed in Figure
5.2. The difference goes to zero for lags larger than shown in the figures.
0.8
1.0
0.6
0.5
0.4
0.0
0.2
−0.5
0.0
0 50 100 150 200 250 0 50 100 150 200 250
(a) (b)
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
(c) (d)
Figure 5.4 The figures display the correlation function (CF) for the fitted
GMRF (solid line) and the Gaussian CF (dashed line) with range 50. The
figures are produced using the coefficients as computed for Figure 5.2(b) using
only (a) 3 significant digits, (b) 5 significant digits, (c) 7 significant digits and
(d) 9 significant digits, for θi /θ1 , i = 2, . . . , 5 while θ0 was not truncated. For
(a) the parameters are outside the valid parameter space so Q(θ) not SPD.
The marginal precisions based on the coefficients in (b), (c), and (d) is 276,
17, and 0.95, respectively.
Figure 5.5 The marginal precision with the coefficients in (5.13) corresponding
to a Gaussian CF with range 50, on a torus with size from 32 to 512. The
marginal precision is about 1 for dimension larger than 300, i.e., 5 times the
range in this case.
n2
n1−m
n1
Figure 5.6 The figure displays an unwrapped n1 × n2 torus. The set B have
thickness m and A is an (n1 − m) × (n2 − m) lattice.
Figure 5.7 The variance of xii , for i = 1, . . . , 200, when using the 5 × 5
coefficients corresponding to Gaussian (solid line) and exponential (dashed
line) CFs with range 50, on a 200 × 200 lattice.
n1+2m
B A
n1
Figure 5.8 An n1 ×n2 lattice (A) with an additional boundary (B) of thickness
m.
50
0
−50
East−West [km]
Figure 5.9 Swiss rainfall data at 100 sample locations. The plot displays the
locations and the observed values ranging from black to white.
250
200
150
South−North [km]
100
50
0
−50
East−West [km]
20 40 60 80 100
250
200
150
South−North [km]
100
50
0
−50
East−West [km]
Figure 5.10 Spatial interpolation of the Swiss rainfall data using a square-
root transform and an exponential CF with range 125. The top figure shows
the predictions and the bottom figure shows the prediction error (stdev), both
computed using plug-in estimates.
Approximation A1
Approximation A1 is found by removing all terms in (5.28) apart from
πG (xi |xi+1 , . . . , xn ), so
πA1 (xi | xi+1 , . . . , xn , y1 , . . . , yi ) = πG (xi | xi+1 , . . . , xn ). (5.29)
Using (5.27), we obtain
1
πA1 (x | y) = πG (xi | xi+1 , . . . , xn )
i=n
= π G (x);
hence A1 is the GMRF approximation. This construction offers an
alternative interpretation of the GMRF approximation.
Approximation A2
Approximation A2 is found by including the term we think is the most
important one missed in A1, which is the term involving the data yi ,
πA2 (xi | xi+1 , . . . , xn , y1 , . . . , yi ) ∝ πG (xi | xi+1 , . . . , xn )
× exp(−hi (xi , yi )). (5.30)
Approximation A3
Here, x(1) , . . . , x(M ) denote the M samples, one set for each value of xi .
Some description is needed at each of the steps.
• To sample from πG (x1 , . . . , xi−1 |xi , . . . , xn ) we make use of (5.23). Let
jmin (i) be the smallest j ∈ J (i). Then, iid samples can be produced
by solving (5.23) from row i − 1 until jmin (i), for iid z’s, and then
adding the mean µG .
• We make I(x ˆ i |J (i)) continuous wrt xi using the same random number
stream to produce the samples. A (much) more computationally
efficient approach is to make use of the fact that only the conditional
mean in πG (x1 , . . . , xi−1 |xi , . . . , xn ) will change if xi varies. Then, we
may sample M samples with zero conditional mean and simply add
the conditional mean that varies linearly with xi .
• Antithetic variables are also useful for estimating I(xi |J (i)). Anti-
thetic normal variates can also involve the scale and not only the sign
(Durbin and Koopman, 1997). Let z be a sample from N (0, I) and
define √
z̃ = z/ z T z
so that z̃ has unit length. Let x̃ solve LT x̃ = z̃. With the correct
scaling, x̃ will be a sample from N (0, Q−1 ). The correct scaling is
Table 5.1 The obtained acceptance rate and the number of iterations per second
on a 2.6-MHz CPU, for approximation A1 to A3b.
0.91 is impressive, but we can push this limit even further using more
computing.
Approximation A2 offers in many cases a significant improvement
compared to A1 without paying too much computationally. The reduc-
tion from 23.5 iterations per second to 5.3 per second is larger than it
would be for a large spatial GMRF. This is because a larger amount
of time will be spent factorizing the precision matrix and locating the
maximum, which is common for all approximations.
The number of regions in the log-spline approximations K also
influences the accuracy. If we increase K we improve the approximation,
most notably when the acceptance rate is high. In most cases a value of
K between 10 and 20 is sufficient.
A more challenging situation appears when we fix the parameters at
different values from their maximum likelihood estimates. The effect
of the different approximations can then be drastic. As an example, if
we reduce τǫ by a factor of 10 while keeping the other two parameters
unchanged, A1 produces an acceptance rate of essentially zero. The
acceptance rate for A2 and A3a is about 0.04 and 0.10, respectively.
In our experience, much computing is required to obtain an acceptance
rate in the high 90s, while in practice, only a ‘sufficiently’ accurate
approximation is required, i.e., one that produces an acceptance rate
well above zero. Unfortunately, the approximations do not behave
uniformly over the space of the hyperparameters. Although a GMRF
approximation can be adequate near the global mode, it may not be
sufficiently accurate for other values of the hyperparameters. Further,
the accuracy of the approximation decreases for increasing dimension.
which is used for sampling (κ, x) jointly from the posterior and
estimating π(κ|y) considering the samples of κ only.
The only unknown term in (5.36) is the denominator. An approx-
imation to the posterior marginal for κ can be obtained using an
approximation π(x|κ, y) in the denominator of
π(y | x) π(x | κ) π(κ)
(κ | y) ∝
π . (5.37)
(x | κ, y)
π
Note that the rhs now depends on the value of x. In particular, we can
choose x as a function of κ such that the denominator is as accurate as
possible.
To illustrate the dependency of x in (5.37), we computed π
(κ|y) using
the GMRF approximation (A1) in the rhs, and evaluate the rhs at x = 0
and at the mode x = x∗ (κ), which depends on κ. The result is shown
in Figure 5.11. The difference between the two estimates is quite large.
Intuitively, the GMRF approximation is most accurate at the mode and
therefore we should evaluate the rhs of (5.37) at x∗ (κ) and not at any
other point. Note that in this case (5.37) is a Laplace approximation,
see, for example, Tierney et al. (1989).
Figure 5.11 The estimated posterior marginal for κ using (5.37) and the
GMRF approximation, evaluating the rhs using the mode x∗ (κ) (solid line)
or x = 0 (dashed line).
Figure 5.12 The histogram of the posterior marginal for κ based on 2000
successive samples from the independence samples constructed from A2. The
solid line is the approximation π
e(κ|y).
for some function f (xi ). Of particular interest are the choices f (xi ) = xi
and f (xi ) = x2i , which are required to compute the posterior mean and
variance. We now approximate (5.39) using (5.35)
E(f (xi ) | y) ≈ (x, κ | y)dκ dx
f (xi ) π
, -
= (x | κ, y) dx π
f (xi ) π (κ | y) dκ.
Using approximation A1, we can easily compute the marginal π (xi |κ, y),
and hence
, -
E(f (xi ) | y) ≈ (xi | κ, y) dxi π
f (xi ) π (κ | y) dκ
, -
≈ (xi | κ, y) dxi π
f (xi ) π (κ | y)ω(κ) (5.40)
κ
Common distributions
1 2
4
Figure B.1 The representation for this graph is n = 4, nnbs = [1, 3, 1, 1],
nbs[1] = [2], nbs[2] = [1, 3, 4], nbs[3] = [2] and nbs[4] = [2].
format,
4
1 1 2
2 3 1 3 4
3 1 2
4 1 2
The first number is n, then each line gives the relevant information
for each node: node 1 has 1 neighbor, which is node 2, node 2 has 3
neighbors, which are nodes 1, 3, and 4, and so on. Note that there is
some redundancy here because we know that if i ∼ j then j ∼ i.
We then need to define the elements in the Q matrix. We know that
the only nonzero terms in Q are those Qij where i ∼ j or i = j. A
convenient way to represent this is to define the function
Qfunc(i, j), for i = j or i ∼ j,
returning Qij .
To illustrate the graph-object and the use of the Qfunc-function,
Algorithm B.1 demonstrates how to compute y = Qx. Note that only
the nonzero terms in Q are used to compute Qx. Recall that i ∼ i, so
we need to add the diagonal terms explicitly.
m[1] = 3
2 3 1
4 5 6 2 3
m[2] = 5 m[3] = 6
Figure B.3 The conditional mean (dashed line), the empirical mean (dashed-
dotted line) and one sample (solid line) from a circular RW1 model with κ = 1,
n = 366, conditional on x1 = 1 and x245 = 10.
and one sample from the conditional distribution are shown in Figure
B.3.
Here, gi (xi ) represents the log-likelihood term, but can represent any
(reasonable) function of xi . The GMRFLib_blockupdate routine con-
structs from (B.4) its GMRF approximation π (x) and sample from it the
proposal x∗ (forward step). Of course, the reverse or backward step is
also performed, constructing the GMRF approximation starting from x∗
and evaluating the log density of x using the GMRF approximation. The
computations are similar in the case where only xA is updated keeping
x−A fixed, which we use in the subblock algorithm.
As we always do a joint update of the GMRF (or parts of it) with
the corresponding hyperparameters θ we allow the terms µ, Q, c, b, d,
and {gi (·)} in (B.4) to depend on hyperparameters θ. The acceptance
probability for the joint update (4.8) is then min{1, R}, where
double link(double x)
{ /* the link function */
return exp(x)/(1+exp(x));
}
double log_gamma(double x, double a, double b)
{ /* return the log density of a gamma variable with mean a/b. */
return ((a-1.0)*log(x)-(x)*b);
}
double Qfunc(int i, int j, char *kappa)
{
return *((double *)kappa) * (i==j ? 2.0 : -1.0);
}
int gi(double *gis, double *x_i, int m, int idx, double *not_in_use, char *arg)
{
/* compute g_i(x_i) for m values of x_idx: x_i[0], ..., x_i[m-1]. return its values in gis[0],
* ..., gis[0]. additional (user) arguments are passed through the pointer gi_arg, here, the
* data itself. */
double *x_old, *x_new, kappa_old=100.0, kappa_new; /* old and new (the proposal) x and kappa */
x_old = calloc(n, sizeof(double)); /* allocate space */
x_new = calloc(n, sizeof(double)); /* allocate space */
while(1)
{ /* just keep on until the process is killed */
kappa_new = scale_proposal(6.0)*kappa_old;
double log_accept; /* GMRFLib_blockupdate does all the job... */
GMRFLib_blockupdate(&log_accept, x_new, x_old, b, b, c, c, mean, mean, d, d,
gi, (char *) &data, gi, (char *) &data,
fixed, graph, Qfunc, (char *)&kappa_new, Qfunc, (char *)&kappa_old, NULL, NULL, NULL, NULL,
constr, constr, NULL, NULL);