0% found this document useful (0 votes)
38 views

Gaussian Markov Random Fields Theory and

The document provides a list of monographs on statistics and applied probability published by Taylor & Francis Group. It includes over 80 titles published from 1960 to 1999 covering various statistical topics such as stochastic processes, Bayesian methods, regression analysis, and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Gaussian Markov Random Fields Theory and

The document provides a list of monographs on statistics and applied probability published by Taylor & Francis Group. It includes over 80 titles published from 1960 to 1999 covering various statistical topics such as stochastic processes, Bayesian methods, regression analysis, and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 259

MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY

General Editors

V. Isham, N. Keiding, T. Louis, N. Reid, R. Tibshirani, and H. Tong

1 Stochastic Population Models in Ecology and Epidemiology M.S. Barlett (1960)


2 Queues D.R. Cox and W.L. Smith (1961)
3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964)
4 The Statistical Analysis of Series of Events D.R. Cox and P.A.W. Lewis (1966)
5 Population Genetics W.J. Ewens (1969)
6 Probability, Statistics and Time M.S. Barlett (1975)
7 Statistical Inference S.D. Silvey (1975)
8 The Analysis of Contingency Tables B.S. Everitt (1977)
9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)
10 Stochastic Abundance Models S. Engen (1978)
11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979)
12 Point Processes D.R. Cox and V. Isham (1980)
13 Identification of Outliers D.M. Hawkins (1980)
14 Optimal Design S.D. Silvey (1980)
15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981)
16 Classification A.D. Gordon (1981)
17 Distribution-Free Statistical Methods, 2nd edition J.S. Maritz (1995)
18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)
19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982)
20 Risk Theory, 3rd edition R.E. Beard, T. Pentikäinen and E. Pesonen (1984)
21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)
22 An Introduction to Latent Variable Models B.S. Everitt (1984)
23 Bandit Problems D.A. Berry and B. Fristedt (1985)
24 Stochastic Modelling and Control M.H.A. Davis and R. Vinter (1985)
25 The Statistical Analysis of Composition Data J. Aitchison (1986)
26 Density Estimation for Statistics and Data Analysis B.W. Silverman (1986)
27 Regression Analysis with Applications G.B. Wetherill (1986)
28 Sequential Methods in Statistics, 3rd edition
G.B. Wetherill and K.D. Glazebrook (1986)
29 Tensor Methods in Statistics P. McCullagh (1987)
30 Transformation and Weighting in Regression
R.J. Carroll and D. Ruppert (1988)
31 Asymptotic Techniques for Use in Statistics
O.E. Bandorff-Nielsen and D.R. Cox (1989)
32 Analysis of Binary Data, 2nd edition D.R. Cox and E.J. Snell (1989)
33 Analysis of Infectious Disease Data N.G. Becker (1989)
34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989)
35 Empirical Bayes Methods, 2nd edition J.S. Maritz and T. Lwin (1989)
36 Symmetric Multivariate and Related Distributions
K.T. Fang, S. Kotz and K.W. Ng (1990)
37 Generalized Linear Models, 2nd edition P. McCullagh and J.A. Nelder (1989)

©฀2005฀by฀Taylor & Francis Group, LLC


38 Cyclic and Computer Generated Designs, 2nd edition
J.A. John and E.R. Williams (1995)
39 Analog Estimation Methods in Econometrics C.F. Manski (1988)
40 Subset Selection in Regression A.J. Miller (1990)
41 Analysis of Repeated Measures M.J. Crowder and D.J. Hand (1990)
42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991)
43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990)
44 Inspection Errors for Attributes in Quality Control
N.L. Johnson, S. Kotz and X. Wu (1991)
45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)
46 The Analysis of Quantal Response Data B.J.T. Morgan (1992)
47 Longitudinal Data with Serial Correlation—A State-Space Approach
R.H. Jones (1993)
48 Differential Geometry and Statistics M.K. Murray and J.W. Rice (1993)
49 Markov Models and Optimization M.H.A. Davis (1993)
50 Networks and Chaos—Statistical and Probabilistic Aspects
O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall (1993)
51 Number-Theoretic Methods in Statistics K.-T. Fang and Y. Wang (1994)
52 Inference and Asymptotics O.E. Barndorff-Nielsen and D.R. Cox (1994)
53 Practical Risk Theory for Actuaries
C.D. Daykin, T. Pentikäinen and M. Pesonen (1994)
54 Biplots J.C. Gower and D.J. Hand (1996)
55 Predictive Inference—An Introduction S. Geisser (1993)
56 Model-Free Curve Estimation M.E. Tarter and M.D. Lock (1993)
57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993)
58 Nonparametric Regression and Generalized Linear Models
P.J. Green and B.W. Silverman (1994)
59 Multidimensional Scaling T.F. Cox and M.A.A. Cox (1994)
60 Kernel Smoothing M.P. Wand and M.C. Jones (1995)
61 Statistics for Long Memory Processes J. Beran (1995)
62 Nonlinear Models for Repeated Measurement Data
M. Davidian and D.M. Giltinan (1995)
63 Measurement Error in Nonlinear Models
R.J. Carroll, D. Rupert and L.A. Stefanski (1995)
64 Analyzing and Modeling Rank Data J.J. Marden (1995)
65 Time Series Models—In Econometrics, Finance and Other Fields
D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (1996)
66 Local Polynomial Modeling and its Applications J. Fan and I. Gijbels (1996)
67 Multivariate Dependencies—Models, Analysis and Interpretation
D.R. Cox and N. Wermuth (1996)
68 Statistical Inference—Based on the Likelihood A. Azzalini (1996)
69 Bayes and Empirical Bayes Methods for Data Analysis
B.P. Carlin and T.A Louis (1996)
70 Hidden Markov and Other Models for Discrete-Valued Time Series
I.L. Macdonald and W. Zucchini (1997)
71 Statistical Evidence—A Likelihood Paradigm R. Royall (1997)
72 Analysis of Incomplete Multivariate Data J.L. Schafer (1997)

©฀2005฀by฀Taylor & Francis Group, LLC


73 Multivariate Models and Dependence Concepts H. Joe (1997)
74 Theory of Sample Surveys M.E. Thompson (1997)
75 Retrial Queues G. Falin and J.G.C. Templeton (1997)
76 Theory of Dispersion Models B. Jørgensen (1997)
77 Mixed Poisson Processes J. Grandell (1997)
78 Variance Components Estimation—Mixed Models, Methodologies and
Applications P.S.R.S. Rao (1997)
79 Bayesian Methods for Finite Population Sampling
G. Meeden and M. Ghosh (1997)
80 Stochastic Geometry—Likelihood and computation
O.E. Barndorff-Nielsen, W.S. Kendall and M.N.M. van Lieshout (1998)
81 Computer-Assisted Analysis of Mixtures and Applications—
Meta-analysis, Disease Mapping and Others D. Böhning (1999)
82 Classification, 2nd edition A.D. Gordon (1999)
83 Semimartingales and their Statistical Inference B.L.S. Prakasa Rao (1999)
84 Statistical Aspects of BSE and vCJD—Models for Epidemics
C.A. Donnelly and N.M. Ferguson (1999)
85 Set-Indexed Martingales G. Ivanoff and E. Merzbach (2000)
86 The Theory of the Design of Experiments D.R. Cox and N. Reid (2000)
87 Complex Stochastic Systems
O.E. Barndorff-Nielsen, D.R. Cox and C. Klüppelberg (2001)
88 Multidimensional Scaling, 2nd edition T.F. Cox and M.A.A. Cox (2001)
89 Algebraic Statistics—Computational Commutative Algebra in Statistics
G. Pistone, E. Riccomagno and H.P. Wynn (2001)
90 Analysis of Time Series Structure—SSA and Related Techniques
N. Golyandina, V. Nekrutkin and A.A. Zhigljavsky (2001)
91 Subjective Probability Models for Lifetimes
Fabio Spizzichino (2001)
92 Empirical Likelihood Art B. Owen (2001)
93 Statistics in the 21st Century
Adrian E. Raftery, Martin A. Tanner, and Martin T. Wells (2001)
94 Accelerated Life Models: Modeling and Statistical Analysis
Vilijandas Bagdonavicius and Mikhail Nikulin (2001)
95 Subset Selection in Regression, Second Edition Alan Miller (2002)
96 Topics in Modelling of Clustered Data
Marc Aerts, Helena Geys, Geert Molenberghs, and Louise M. Ryan (2002)
97 Components of Variance D.R. Cox and P.J. Solomon (2002)
98 Design and Analysis of Cross-Over Trials, 2nd Edition
Byron Jones and Michael G. Kenward (2003)
99 Extreme Values in Finance, Telecommunications, and the Environment
Bärbel Finkenstädt and Holger Rootzén (2003)
100 Statistical Inference and Simulation for Spatial Point Processes
Jesper Møller and Rasmus Plenge Waagepetersen (2004)
101 Hierarchical Modeling and Analysis for Spatial Data
Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand (2004)
102 Diagnostic Checks in Time Series Wai Keung Li (2004)
103 Stereology for Statisticians Adrian Baddeley and Eva B. Vedel Jensen (2004)
104 Gaussian Markov Random Fields: Theory and Applications
Havard Rue and Leonard Held (2005)

©฀2005฀by฀Taylor & Francis Group, LLC


Monographs on Statistics and Applied Probability 104

Gaussian Markov
Random Fields
Theory and Applications

Håvard Rue
Leonhard Held

Boca Raton London New York Singapore

©฀2005฀by฀Taylor & Francis Group, LLC


Published in 2005 by
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW
Boca Raton, FL 33487-2742

© 2005 by Taylor & Francis Group


Chapman & Hall/CRC is an imprint of Taylor & Francis Group
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-10: 1-58488-432-0 (Hardcover)
International Standard Book Number-13: 978-1-58488-432-3 (Hardcover)
Library of Congress Card Number 2004061870
This book contains information obtained from authentic and highly regarded sources. Reprinted material is
quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts
have been made to publish reliable data and information, but the author and the publisher cannot assume
responsibility for the validity of all materials or for the consequences of their use.
No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration
for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate
system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only
for identification and explanation, without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Rue, Havard.
Gaussian Markov random fields : theory and applications / Havard Rue & Leonhard Held.
p. cm. -- (Monographs on statistics and applied probability ; 104)
Includes bibliographical references and index.
ISBN 1-58488-432-0 (alk. paper)
1. Gaussian Markov random fields. I. Held, Leonhard. II. Title. III. Series.

QA274.R84 2005
519.2'33--dc22 2004061870

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
Taylor & Francis Group and the CRC Press Web site at
is the Academic Division of T&F Informa plc. http://www.crcpress.com

©฀2005฀by฀Taylor & Francis Group, LLC


To Mona and Ulrike

©฀2005฀by฀Taylor & Francis Group, LLC


Contents

Preface

1 Introduction
1.1 Background
1.1.1 An introductory example
1.1.2 Conditional autoregressions
1.2 The scope of this monograph
1.2.1 Numerical methods for sparse matrices
1.2.2 Statistical inference in hierarchical models
1.3 Applications of GMRFs

2 Theory of Gaussian Markov random fields


2.1 Preliminaries
2.1.1 Matrices and vectors
2.1.2 Lattice and torus
2.1.3 General notation and abbreviations
2.1.4 Conditional independence
2.1.5 Undirected graphs
2.1.6 Symmetric positive-definite matrices
2.1.7 The normal distribution
2.2 Definition and basic properties of GMRFs
2.2.1 Definition
2.2.2 Markov properties of GMRFs
2.2.3 Conditional properties of GMRFs
2.2.4 Specification through full conditionals
2.2.5 Multivariate GMRFs⋆
2.3 Simulation from a GMRF
2.3.1 Some basic numerical linear algebra
2.3.2 Unconditional simulation of a GMRF
2.3.3 Conditional simulation of a GMRF
2.4 Numerical methods for sparse matrices
2.4.1 Factorizing a sparse matrix
2.4.2 Bandwidth reduction
2.4.3 Nested dissection

©฀2005฀by฀Taylor & Francis Group, LLC


ii
2.5 A numerical case study of typical GMRFs
2.5.1 GMRF models in time
2.5.2 Spatial GMRF models
2.5.3 Spatiotemporal GMRF models
2.6 Stationary GMRFs⋆
2.6.1 Circulant matrices
2.6.2 Block-circulant matrices
2.6.3 GMRFs with circulant precision matrices
2.6.4 Toeplitz matrices and their approximations
2.6.5 Stationary GMRFs on infinite lattices
2.7 Parameterization of GMRFs⋆
2.7.1 The valid parameter space
2.7.2 Diagonal dominance
2.8 Bibliographic notes

3 Intrinsic Gaussian Markov random fields


3.1 Preliminaries
3.1.1 Some additional definitions
3.1.2 Forward differences
3.1.3 Polynomials
3.2 GMRFs under linear constraints
3.3 IGMRFs of first order
3.3.1 IGMRFs of first order on the line
3.3.2 IGMRFs of first order on lattices
3.4 IGMRFs of higher order
3.4.1 IGMRFs of higher order on the line
3.4.2 IGMRFs of higher order on regular lattices⋆
3.4.3 Nonpolynomial IGMRFs of higher order
3.5 Continuous-time random walks⋆
3.6 Bibliographic notes

4 Case studies in hierarchical modeling


4.1 MCMC for hierarchical GMRF models
4.1.1 A brief introduction to MCMC
4.1.2 Blocking strategies
4.2 Normal response models
4.2.1 Example: Drivers data
4.2.2 Example: Munich rental guide
4.3 Auxiliary variable models
4.3.1 Scale mixtures of normals
4.3.2 Hierarchical-t formulations
4.3.3 Binary regression models
4.3.4 Example: Tokyo rainfall data

©฀2005฀by฀Taylor & Francis Group, LLC


iii
4.3.5 Example: Mapping cancer incidence
4.4 Nonnormal response models
4.4.1 The GMRF approximation
4.4.2 Example: Joint disease mapping
4.5 Bibliographic notes

5 Approximation techniques
5.1 GMRFs as approximations to Gaussian fields
5.1.1 Gaussian fields
5.1.2 Fitting GMRFs to Gaussian fields
5.1.3 Results
5.1.4 Regular lattices and boundary conditions
5.1.5 Example: Swiss rainfall data
5.2 Approximating hidden GMRFs
5.2.1 Constructing non-Gaussian approximations
5.2.2 Example: A stochastic volatility model
5.2.3 Example: Reanalyzing Tokyo rainfall data
5.3 Bibliographic notes

Appendices

A Common distributions

B The library GMRFLib


B.1 The graph object and the function Qfunc
B.2 Sampling from a GMRF
B.3 Implementing block-updating algorithms for hierarchical
GMRF models

References

©฀2005฀by฀Taylor & Francis Group, LLC


Preface

This monograph describes Gaussian Markov random fields (GMRFs)


and some of its applications in statistics. At first sight, this seems to be
a rather specialized topic, as the wider class of Markov random fields
is probably known only to researchers in spatial statistics and image
analysis. However, GMRFs have applications far beyond these two areas,
for example in structural time-series analysis, analysis of longitudinal
and survival data, spatiotemporal statistics, graphical models, and semi-
parametric statistics.
Despite the wide range of application, there is a unified framework
for both representing, understanding and computing with GMRFs using
the graph formulation. Our main motivation to write this monograph
is to provide the first comprehensive account of the main properties of
GMRFs, to emphasize the strong connection between GMRFs and nu-
merical methods for sparse matrices, and to outline various applications
of GMRFs for statistical inference.
Complex hierarchical models are at the core of modern statistics,
and GMRFs play a central role in this framework to describe the
spatial and temporal dynamics of nature and real systems. Statistical
inference in hierarchical models, however, can typically only be done
using simulation, in particular through Markov chain Monte Carlo
(MCMC) methods. Thus we emphasize computational issues, which
allow us to construct fast and reliable algorithms for (Bayesian) inference
in hierarchical models with GMRF components. We emphasize the
concept of blocking, i.e., updating all or nearly all of the parameters
jointly, which we believe to be perhaps the only way to overcome
problems with convergence and mixing of ordinary MCMC algorithms.
We hope that the reader will share our enthusiasm and that the examples
provided in this book will stimulate further research in this area.
The book can be loosely categorized as follows. We begin in Chapter 1
by introducing GMRFs through two simple examples, an autoregressive
model in time and a conditional autoregressive model in space. We then
briefly discuss numerical methods for sparse matrices, and why they are
important for simulation-based inference in GMRF models. We illustrate
this through a simple hierarchical model. We finally describe various
areas where GMRFs are used in statistics.

©฀2005฀by฀Taylor & Francis Group, LLC


vi
Chapter 2 is the main theoretical chapter, describing the most impor-
tant results for GMRFs. It starts by introducing the necessary notation
and describing the central concept of conditional independence. GMRFs
are then defined and studied in detail. Efficient direct simulation from
a GMRF is described using numerical techniques for sparse matrices.
A numerical case study illustrates the performance of the algorithms
in different scenarios. Finally, two optional sections follow: The first
describes the theory of stationary GMRFs, where circulant and block
circulant matrices become important. Lastly we discuss the problem on
how to parameterize the precision matrix, the inverse covariance matrix,
of a GMRF without destroying positive definiteness.
In Chapter 3 we give a detailed discussion of intrinsic GMRFs
(IGMRFs). IGMRFs do have precision matrices which are no longer
of full rank. They are of central importance in Bayesian hierarchical
models, where they are often used as a nonstationary prior distribution
for dependent parameters in space or in time. A key concept to
understanding IGMRFs is the conditional distribution of a proper
GMRF under linear constraints. We then describe IGMRFs of various
kinds, on the line, the lattice, the torus, and on irregular graphs. A final
optional section is devoted to the representation of integrated Wiener
process priors as IGMRFs.
In Chapter 4 we discuss various applications of GMRFs for hierarchical
modeling. We outline how to use MCMC algorithms in hierarchical
models with GMRF components. We start with some general comments
regarding MCMC via blocking. We then discuss models with normal
observations, auxiliary variable models for probit and logistic regression
and nonnormal regression models, all with latent GMRF components.
The GMRFs may have a temporal or a spatial component, or they relate
to particular covariate effects in a semiparametric regression framework.
Finally, in Chapter 5 we first describe how GMRFs can be used to
approximate so-called Gaussian fields, i.e., normal distributed random
vectors where the covariance matrix rather than its inverse, the precision
matrix, is specified. The final section in Chapter 5 is devoted to the
problem of how to construct improved and non-GMRF approximations
to hidden GMRFs.
Appendices A and B describe the distributions we use and the
implementation of the algorithms in the public-domain library GMRFLib.
Chapters 2 and 3 are fairly self-contained and do not require much
prior knowledge from the reader, except for some familiarity with
probability theory and linear algebra. Chapters 4 and 5 assume that the
reader is experienced in the area of Bayesian hierarchical models and
their statistical analysis via MCMC, perhaps at the level of standard
textbooks such as Carlin and Louis (1996), Gilks et al. (1996), Robert

©฀2005฀by฀Taylor & Francis Group, LLC


vii
and Casella (1999), or Gelman et al. (2004).
This monograph can be read chronologically. Sections marked with
a ‘⋆’ indicate more advanced material which can be skipped at first
reading. We might ask too much of some readers patience in Chapter 2
and 3, which are motivated from the various applications of GMRFs for
hierarchical modeling in Chapter 4. It might therefore be useful to skim
through Chapter 4 before reading Chapter 2 and 3 in detail.
This book was conceived in the spring of 2003 but the main body of
work was done in the first half of 2004. We are indebted to Julian Besag,
who read his seminal paper on Markov random fields (Besag, 1974) 30
years ago to the Royal Statistical Society, his seminal contributions to
this field since then, and for introducing LH to MRFs in 1995/1996
during a visit to the University of Washington. We also appreciate his
comments on the initial draft and sending us a copy of Mondal and Besag
(2004).
We thank Hans R. Künsch for sharing his wisdom with HR during a
visit to the ETH Zürich in February 2004, Ludwig Fahrmeir, Stefan
Lang, and Håkon Tjelmeland for many good discussions, and Dag
Myrhaug for providing a quiet working environment for HR. The inter-
action and collaboration with (past) Ph.D. students about this theme
have been valuable, thanks to Sveinung Erland, Turid Follestad, Oddvar
K. Husby, Günter Raßer, Volker Schmid, and Ingelin Steinsland. The
support of the German Research Foundation (DFG, Sonderforschungs-
bereich 386) and the department of mathematical sciences at NTNU is
also appreciated. HR also thanks Anne Kajander for all administrative
help.
Håkon Tjelmeland and Geir Storvik read carefully through the initial
drafts and provided numerous comments and critical questions. Thank
you! Also the comments from Arnoldo Frigessi, Martin Sköld and Hanne
T. Wist were much appreciated. The collaboration with Chapman &
Hall/CRC was always smooth and constructive.
We look forward to returning to everyday life and enjoying our
families, Kristine and Mona, Valentina and Ulrike. Thank you for your
patience!

Håvard Rue Trondheim


Leonhard Held Munich
Summer 2004

©฀2005฀by฀Taylor & Francis Group, LLC


CHAPTER 1

Introduction

1.1 Background
This monograph considers Gaussian Markov random fields (GMRFs)
covering both theory and applications. A GMRF is really a simple
construct: It is just a (finite-dimensional) random vector following a
multivariate normal (or Gaussian) distribution. However, we will be
concerned with more restrictive versions where the GMRF satisfies ad-
ditional conditional independence assumptions, hence the term Markov.
Conditional independence is a powerful concept. Let x = (x1 , x2 , x3 )T
be a random vector, then x1 and x2 are conditionally independent given
x3 if, for known value of x3 , discovering x2 tells you nothing new about
the distribution of x1 . Under this condition the joint density π(x) must
have the representation
π(x) = π(x1 | x3 ) π(x2 | x3 ) π(x3 ),
which is a simplification of a general representation
π(x) = π(x1 | x2 , x3 ) π(x2 | x3 ) π(x3 ).
The conditional independence property implies that π(x1 |x2 , x3 ) is
simplified to π(x1 |x3 ), which is easier to understand, to represent, and
to interpret.

1.1.1 An introductory example


As a simple example of a GMRF, consider an autoregressive process of
order 1 with standard normal errors, which if often expressed as
iid
xt = φxt−1 + ǫt , ǫt ∼ N (0, 1), |φ| < 1 (1.1)
where the index t represents time. Assumptions about conditional
independence are not stated explicitly here, but show up more clearly if
we express (1.1) in the conditional form
xt | x1 , . . . , xt−1 ∼ N (φxt−1 , 1) (1.2)
for t = 2, . . . , n. In this model xs and xt with 1 ≤ s < t ≤ n are
conditionally independent given {xs+1 , . . . , xt−1 } if t − s > 1.

©฀2005฀by฀Taylor & Francis Group, LLC


2 INTRODUCTION
In addition to (1.2), let us now assume that the marginal distribution
of x1 is normal with mean zero and variance 1/(1 − φ2 ), which is simply
the stationary distribution of this process. Then the joint density of x is
π(x) = π(x1 ) π(x2 | x1 ) · · · π(xn | xn−1 ) (1.3)
 
1 1/2 1 T
= |Q| exp − x Qx ,
(2π)n/2 2
where the precision matrix Q is the tridiagonal matrix
⎛ ⎞
1 −φ
⎜−φ 1 + φ2 −φ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ . .. . .. . .. ⎟
Q=⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ −φ 1 + φ −φ⎠
2

−φ 1
with zero entries outside the diagonal and first off-diagonals. The
conditional independence assumptions impose certain restrictions on
the precision matrix. The tridiagonal form is due to the fact that xi
and xj are conditionally independent for |i − j| > 1, given the rest.
This also holds in general for any GMRF: If Qij = 0 for i = j,
then xi and xj are conditionally independent given the other variables
{xk : k = i and k = j} and vice versa. The sparse structure of Q
prepares the ground for fast computations of GMRFs to which we return
in Section 1.2.1.
The simple relationship between conditional independence and the
zero structure of the precision matrix is not evident in the covariance
matrix Σ = Q−1 , which is a (completely) dense matrix with entries
1
σij = φ|i−j| .
1 − φ2
For example, for n = 7,
⎛ ⎞
1 φ φ2 φ3 φ4 φ5 φ6
⎜φ 1 φ φ2 φ3 φ4 φ5 ⎟
⎜ 2 ⎟
⎜φ φ 1 φ φ2 φ3 φ4 ⎟
1 ⎜ ⎜φ3

Σ= φ2 φ 1 φ φ2 φ3 ⎟
⎟.
1 − φ2 ⎜
⎜φ4
⎜ 5 φ3 φ2 φ 1 φ φ2 ⎟

⎝φ φ4 φ3 φ2 φ 1 φ⎠
φ6 φ5 φ4 φ3 φ2 φ 1
It is therefore difficult to derive conditional independence properties from
the structure of Σ. Clearly, the entries in Σ only give (direct) information
about the marginal dependence structure, not the conditional one. For

©฀2005฀by฀Taylor & Francis Group, LLC


BACKGROUND 3
example, in the autoregressive model, xs and xt are marginally dependent
for any finite s and t as long as φ = 0.
Simplifications due to conditional independence do not only appear
for the directed conditional distributions as in (1.2), but also for
the undirected conditional distributions, often called full conditionals
{π(xt |x−t )}, where x−t denotes all elements in x but xt . In the
autoregressive example,

⎨ N (φxt+1 , 1)
⎪  t = 1,
φ 1
xt | x−t ∼ N 1+φ2 (xt−1 + xt+1 ), 1+φ2 1 < t < n, (1.4)


N (φxn−1 , 1) t = n,
so xt depends in general both on xt−1 and xt+1 . Equation (1.4) is
important as it allows for an alternative specification of the first-
order autoregressive models through the full conditionals π(xt |x−t ) for
t = 1, . . . , n. In fact, by starting with these full conditionals, we obtain
an alternative and completely equivalent representation of this model
with the same joint density for x. This is not so obvious as for the
directed conditional distributions (1.2), where the joint density is simply
the product of the densities corresponding to (1.2) for t = 2, . . . , n times
the (marginal) density of x1 .

1.1.2 Conditional autoregressions


We now make the discussion more general, leaving autoregressive models.
Let x be associated with observations or some property of points or
regions in the spatial domain. For example, xi could be the value of
pixel i in an image, the height of tile i in a tessellation or the relative
risk for some disease in the ith district. Now there is no natural ordering
of the indices and (1.3) is no longer useful to specify the joint density
of x. A common approach is then to specify the joint density of a zero
mean GMRF implicitly by specifying each of the n full conditionals
⎛ ⎞

xi | x−i ∼ N ⎝ βij xj , κ−1
i
⎠, (1.5)
j:j =i

which was pioneered by Besag (1974, 1975). These models are also
known by the name conditional autoregressions, abbreviated as CAR
models. There is also an alternative and more restrictive approach to
CAR models, the so-called simultaneous autoregressions (SAR), which
we will not discuss specifically. This approach dates back to Whittle
(1954), see for example, Cressie (1993) for further details.
The n full conditionals (1.5) must satisfy some consistency conditions
to ensure that a joint normal density exists with these full conditionals.

©฀2005฀by฀Taylor & Francis Group, LLC


4 INTRODUCTION
These reduces to require that Q = (Qij ) with elements

κi i=j
Qij =
−κi βij i = j,
is symmetric and positive definite. Symmetry is ensured by κi βij = κj βji
for all i = j, while positive definiteness requires κi > 0 for all i =
1, . . . , n, but imposes further (and often quite complicated) constraints
on the βij ’s. A common (perhaps too common!) approach to ensure
positive definiteness is to require that Q is diagonal dominant, which
means that, in each row (or column) of Q, the diagonal entry is larger
than the sum of the absolute off-diagonal entries. This is a sufficient but
not necessary condition for positive definiteness.
The conditional independence properties of this GMRF can now be
found by simply checking if Qij is zero or not. If Qij = 0 then xi and xj
are conditionally independent given the rest, and if Qij = 0 then they are
conditionally dependent. It is useful to represent these findings using an
undirected graph with nodes {1, . . . , n} and an edge between node i and
j = i if and only if Qij = 0. We then say that x is a GMRF with respect
to this graph. The neighbors to node i are all nodes j = i with βij = 0,
hence all nodes on which the full conditional (1.5) depends. Going back
to the autoregressive model (1.4), the neighbors of i are {i − 1, i + 1} for
i = 2, . . . , n − 1, and {2} and {n − 1} of node 1 and n, respectively.
In general the neighbors of i are often those that are, in one way or
the other, in the ‘proximity’ of node i. The common approach is first to
specify the graph by choosing a suitable set of neighbors to each node,
and then to choose βij for each pair i ∼ j of neighboring nodes i and j.
Figure 1.1 displays two such graphs, (a) a linear graph corresponding
to (1.2) with n = 50 and (b) the graph corresponding to the 16 states of
Germany where two states are neighbors if they share a common border.
The graph in (b) is not drawn to mimic the map of Germany but only to
visualize the graph itself. The number of neighbors in (b) varies between
2 and 9.
Figure 1.2 displays a graph constructed similarly to Figure 1.1(b), but
which now corresponds to the 366 regions in Sardinia. The neighborhood
structure is now slightly more complex and the number of neighbors
varies between 1 and 13 with a median of 5. This is a typical (but
simple) graph for applications of GMRF models.
The case where Q is symmetric and positive semidefinite is of partic-
ular interest. This class is known under the name intrinsic conditional
autoregressions or intrinsic GMRF s (IGMRFs). The density of x is
then improper but, by construction, x defines a proper distribution on
a specific lower-dimensional space. For example, if each row (or column)
of Q sums up to zero, then Q has rank n − 1 and the (improper) density

©฀2005฀by฀Taylor & Francis Group, LLC


THE SCOPE OF THIS MONOGRAPH 5
1

25 12 3
24
26

28 23
27
30
29 20 13 11 4 5
22
31 21 19

32 18
33
7 6
17
34

16
35
15 10 8
36
49 50
14
48
37
13 9
12
38 47

39 11
46
14
40 45 10

41 44 9
7
42 43 8 15
6

5
3
4 16
1
2

(a) (b)

Figure 1.1 (a) The linear graph corresponding to an autoregressive process of


order 1, (b) the graph of the 16 states of Germany where two states sharing a
common border are considered to be neighbors.

of x is invariant to the addition of a constant to all components in x.


This is of benefit if the level of x is unknown or perhaps not constant
but varies smoothly over the region of interest. More generally, one can
for example construct IGMRFs that are invariant to the addition of
polynomials. IGMRFs play a central role in hierarchical models, which
we discuss later.

1.2 The scope of this monograph


The main scope of this monograph is as follows:
• To provide a systematic presentation of the main theoretical results
for GMRFs and intrinsic GMRFs. We will focus mainly on finite
GMRFs, but also discuss GMRFs on infinite lattices.
• To present and discuss numerical methods for sparse matrices and how
these can be used to simulate from a GMRF and how to evaluate the
log density of a GMRF. Both tasks also can be done under various
forms of conditioning and linear constraints.
• To discuss hierarchical GMRF models, which illustrates the use of
GMRFs in various areas using different choices for the distribution of
the observed data.

©฀2005฀by฀Taylor & Francis Group, LLC


6 INTRODUCTION
133
174
129 132
354 59
47
98 39
2
314 318 107 77 75 71
169 76 57
330 340 342 159
353 337
306 32
52 50
60 63
300 302
305 361 58 37
42
298 336 338 166 70 6
322 349 21
9 28 66
343 362
328 288 23
358 96 93 26 68
14
303 365 65 25 49
325 346 324 320
327 102 29
326 128
351 10 67
289 85
97 72
334 360 363 307
344 308 19 56 22
359 165 12
293 357 339 134 122 45
355 294 41
350 30 27
291 315 4 64
312 292 301 95
348 304 15
317 331 31 38 33
329 364 24
290 321 154
319 43 78
150
335 141 11
356 51 18
352 310 148
345 3 55
332 74
313 295 145 80
296 89 161 73
299 168
88 144 17 13
333 297 347 341 172 54 81
309 0
140 7 48 61
316 311 206 323 131 130
69
120 173 44
272 113 112 135 62
215 152 53
270 175 35
259 137 92 155 8 36
197 260 151 16
228 136 108 139 5
269 208 146 20
248 229 118 109 94
86 100 101 1 83 46 34
184 278 218 114 182
217 189 170 127 125 84
180 160 40
212 183 115 176
238 162 164 79
240 276 126 91 111 142
279 105 138 87 177
205 147 167 82
204 235 250 158 178 117
103 99
202 275 214 207 219 116 124 157
163
255 106
256 149 119 110 121 143
190 213 90
266 280
216 231 252
223 274 225 236 104 179
284 227 264
211 277 181 153 123
281 185 156
253 171
222 247 187
199 271 188
195 261 221 244 200
224 241 210
232 198 203 237
242
254 263
262
230 285 220 191
186 257
226 273 192 283
196 239 258
268 287 265
194
193 246 234
243 251
209 282 249

267
286 233

245
201

Figure 1.2 The graph of the 366 administrative regions in Sardinia where two
regions sharing a common border are neighbors. Neighbors in the graph are
linked by edges or indicated by overlapping nodes.

• To provide a unified framework for fast and reliable Bayesian inference


in hierarchical GMRF models based on Markov chain Monte Carlo
(MCMC) simulation. Typically, all or nearly all unknown parameters
are updated simultaneously or at least in large blocks. An important
part of the algorithms is to use fast numerical methods for sparse
matrices.
Perhaps the two most innovative methodological contributions in this
monograph are the connection between GMRF and numerical methods
for sparse matrices, and the fast and reliable MCMC block algorithms
for Bayesian inference in hierarchical models with GMRF components.
We briefly describe the main ideas in the following.

1.2.1 Numerical methods for sparse matrices

Sparse matrices appear naturally for GMRFs as Qij = 0 only if i and


j are neighbors. By construction (most) precision matrices for GMRFs
are sparse where only O(n) of the terms in Q are nonzero. We can take

©฀2005฀by฀Taylor & Francis Group, LLC


THE SCOPE OF THIS MONOGRAPH 7
advantage of this for computing the Cholesky factorization of Q,
Q = LLT ,
where L is a lower-triangular matrix. It turns out that L can inherit the
nonzero pattern of Q so it can be sparse as well. However, how sparse L is
depends heavily on the ordering of the indices of the GMRF x. Therefore
the indices are permuted in advance to obtain a matrix L with as few
as possible nonzero entries. The computational savings stem from the
simple fact that we do not need to compute terms that are known to be
zero. Hence, only the nonzero terms in L are computed and stored. For
larger GMRFs, for example with 10, 000 - 100, 000 nodes, this results in
a huge speedup and low memory usage. The classical approach to obtain
such a matrix L is to construct a permutation of the indices of x such
that the permuted Q becomes a band matrix. Then a band-Cholesky
factorization can be used to compute L. In this case L will be a (lower)
band matrix with the same bandwidth as Q. For an introduction to
numerical methods for sparse matrices, see Dongarra et al. (1998), Duff
et al. (1989), George and Liu (1981), and Gupta (2002) for a comparison.
As an illustration, suppose we want to simulate from a GMRF.
Simulation-based inference via MCMC is typically the only way for
(Bayesian) inference in complex hierarchical models, and efficient simu-
lation of GMRFs is therefore one of the central themes of the book. To
simulate from a zero mean GMRF with precision matrix Q, we compute
the Cholesky triangle L and then solve
LT x = z, (1.6)
where z is a vector of independent standard normal variables. The sparse
structure of L will also make this step more efficient. It is easy to
show that the solution of (1.6) has precision matrix Q as required. The
generalization to arbitrary mean µ is trivial. Algorithms for conditional
simulation of GMRFs can also be constructed such that all sparse
matrices involved are taken advantage of. The same is true if one is
interested in the evaluation of the log density of the GMRF. Roughly,
the cost is O(n), O(n3/2 ), O(n2 ) for GMRFs in time, space, space ×
time, respectively.
The symbiotic connection between GMRFs and numerical methods
sparse matrices has been known implicitly and for special cases for a
long time. For autoregressive models and state-space models in general,
fast O(n) algorithms exist, derived from the Kalman filter and its
variants. The forward-filtering backward-sampling algorithm (Carter and
Kohn, 1994, Frühwirth-Schnatter, 1994) uses intermediate results from
the Kalman filter to sample from a (hidden) GMRF, but reduces to
factorizing a positive definite (block-)tridiagonal matrix (Knorr-Held

©฀2005฀by฀Taylor & Francis Group, LLC


8 INTRODUCTION
and Rue, 2002, Appendix). Lavine (1999) uses this algorithm for a
two- and three-dimensional GMRF on a regular lattice and derives
algorithms for the evaluation of the log density and for sampling
a GMRF. This algorithm is similar to the one derived by Moura
and Balram (1992). However, the extension from the regular lattice
to a general graph is difficult using this approach. Pace and Barry
(1997) propose using general numerical methods for sparse matrices to
evaluate the log density. Rue (2001) derives algorithms for conditional
sampling, evaluation of the corresponding log density and demonstrate
how to use these numerical methods to construct block-updating MCMC
algorithms further developed by Knorr-Held and Rue (2002). Rue and
Follestad (2003) provide additional details of Rue (2001, Appendix) and
a statistical interpretation of numerical methods for sparse matrices and
various permutation approaches.
A nice feature about modern techniques for sparse matrices is that
the permutation adapts to the graph of the GMRF under study, hence
such algorithms provide close-to-optimal algorithms for most cases of
interest. This is of great advantage as it allows us to merge the different
GMRFs usually involved in a hierarchical GMRF model into a larger
one, which makes it possible to construct a unified approach to MCMC-
based inference for hierarchical GMRF models. This will be sketched in
the following section.

1.2.2 Statistical inference in hierarchical models

GMRFs are frequently used in hierarchical models in order to allow for


stochastic dependence between a set of unknown parameters. A typical
setup uses three stages where unknown hyperparameters θ specify a
GMRF x. The field x is now connected to data y, which are commonly
assumed to be conditionally independent given x. In the simplest case,
each observation yi depends only on a corresponding ith element xi in x,
so y and x have the same dimension. Hence the three stages are specified
as
θ ∼ π(θ)
x ∼ π(x | θ)
iid
yi ∼ π(yi | xi ), i = 1, . . . , n.

The posterior distribution is


n

π(x, θ | y) ∝ π(θ) π(x | θ) π(yi | xi ).
i=1

©฀2005฀by฀Taylor & Francis Group, LLC


THE SCOPE OF THIS MONOGRAPH 9
For example, yi could be a normal variable with mean xi , Bernoulli
with mean 1/(1 + exp(xi )) or Poisson with mean exp(xi ). Consider for
example the Poisson case. In many applications there will be so-called
extra-Poisson variation, and a common approach to deal with this is to
add independent zero mean normal random effects vi to the model so
that yi is now Poisson with mean exp(xi + vi ).
Since v is also a GMRF with a diagonal precision matrix and x and
v are assumed to be independent, they form a joint GMRF of size
2n. However, conditional on y, both xi and vi will depend on yi . An
alternative approach is to parameterize the model from v to u = x + v,
which defines a GMRF w of size 2n,
 
x
w= . (1.7)
u
The graph of w is displayed in Figure 1.3 where the graph x corresponds
either to Figure 1.1(a) or (b) and the gray nodes in the graph correspond
to u. Using the new GMRF w, each observations yi is now only
connected to wi+n and the posterior distribution has the form
n

π(w, θ | y) ∝ π(θ) π(w | θ) π(yi | wn+i ).
i=1

This is a typical example where MCMC is the only way for statistical
inference, but where the choice of the particular MCMC algorithm is
crucial. In Section 4, we will describe an MCMC algorithm that jointly
updates the GMRF w and the hyperparameters θ (here the unknown
precisions of x and v) in one block, thus ensuring good mixing and
convergence properties of the algorithm.
Suppose now there is also covariate information z i available for each
observation yi , here z i is of dimension p, say. A common approach is to
assume now that yi is Poisson with mean exp(xi + vi + z Ti β), where β is
a vector of unknown regression parameters, with a multivariate normal
prior with some mean and some precision matrix, which can be zero. We
do not give the exact details here, but β can also be merged with the
GMRF w to a larger GMRF of dimension 2n + p, which still inherits
the sparse structure. Furthermore, block updates of the enlarged field,
jointly with unknown hyperparameters, is still possible.
Merging two or more GMRFs into a larger one typically preserves
the local features of the GMRF and simplifies the structure of the
model. This is important mainly for computational reasons, as we can
then construct efficient MCMC algorithms. Note that if θ is fixed
and yi is normal, this will correspond to independent simulation from
the posterior, no matter how large the dimension of the GMRF. For
nonnormal observations, as in the above Poisson case, we will use

©฀2005฀by฀Taylor & Francis Group, LLC


10 INTRODUCTION
the Metropolis-Hastings algorithm combined with Taylor expansions to
construct appropriate GMRF block proposals for the posterior distribu-
tion. However, for binary responses we will introduce so-called auxiliary
variables in the model, which avoid the use of Taylor expansions.

1.3 Applications of GMRFs


GMRFs have an enormous list of applications, dating back to 1880,
at least, with Thiele’s first-order random walk model for time-series
analysis, see Lauritzen (1981). We will now briefly describe some main
areas of application of GMRFs, not mutually disjoint, where GMRFs
are being used, pointing the interested reader to some key references.
Structural time-series analysis Autoregressive models are GMRFs
on a linear graph that is part of the standard literature in time series.
Extensions to state-space models add normal observations that makes
the conditional distribution of the hidden state x also a GMRF. Some
of the theoretical results derived in this area depend particularly on
the linear graph and its sequential representation. Computational
algorithms used are based on the Kalman filter and its variants.
Approximate inference for state-space models with nonnormal obser-
vations is discussed in Fahrmeir (1992). Simulation-based inference for
normal state-space models is described in Carter and Kohn (1994),
Frühwirth-Schnatter (1994), and Shephard (1994), while simulation-
based inference for state-space models with nonnormal observations
is proposed in Shephard and Pitt (1997) and Knorr-Held (1999). The
connection of these algorithms to our more general graph-oriented
approach will be discussed in Chapter 4, see also Knorr-Held and
Rue (2002, Appendix A). Good references to time-series analysis and
state-space models are Brockwell and Davis (1987), Harvey (1989)
and West and Harrison (1997).
Analysis of longitudinal and survival data GMRF priors, in par-
ticular their temporal versions, are used extensively to analyze
longitudinal and survival data. Some key references for state-space
approaches are Fahrmeir (1994), Gamerman and West (1987), Jones
(1993), see also Fahrmeir and Knorr-Held (2000, Sec. 18.3.3). The
analysis of longitudinal or survival data with additional GMRFs on
spatial components is described in Banerjee et al. (2003), Carlin and
Banerjee (2003), Crook et al. (2003), Knorr-Held (2000a), Knorr-Held
and Besag (1998), Knorr-Held and Richardson (2003), and Banerjee
et al. (2004) among others. Analysis of rates with several time scales is
described in Berzuini and Clayton (1994), Besag et al. (1995), Knorr-
Held and Rainer (2001), and Bray (2002). Finally, applications of
GMRF priors to longitudinal data in sports are described in Glickman

©฀2005฀by฀Taylor & Francis Group, LLC


APPLICATIONS OF GMRFs 11
80
85 81
86 79

35 30 78
87 31
36
29
34 77
37
32 28
33
76
84 27
38
90 82
26
83 75
88
39
40 25

92 89 74
24
41
42

93 91 23
73
43 72
22

94
44
21
71

45 20
95
70 69
99 19
46

96 49 100
47 18
68
48 50

17 67
97

98

64 16

66

63 15
14

57
13 65
53 54 56
58 61
51

7 12
1 3 60
4 6 11
8
2
5 62
10
9

52
55
59

(a)
1

2 17

3 18 12

19 11 4 5 13 28

27 20 6 21 7 29

22 8 10 23

9 24 26

25 14

30 15

16 31

32

(b)

Figure 1.3 The graph of w (1.7) where the graph of x is in Figure 1.1(a) and
(b), respectively. The nodes corresponding to u are displayed in gray.

©฀2005฀by฀Taylor & Francis Group, LLC


12 INTRODUCTION
and Stern (1998), Knorr-Held (2000b), Rue and Salvesen (2000), and
Held and Vollnhals (2005).
Graphical models GMRFs are central in the area of graphical models.
One problem is to estimate Q and its (associated) graph from data,
see Dempster (1972), Giudici and Green (1999), Whittaker (1990),
and Dobra et al. (2003) for an application in statistical genetics.
More generally, GMRFs are used in a larger setting involving not
only undirected but also directed or chain graphs, and perhaps
nonnormal or discrete random variables. Some theoretical results
regarding GMRFs that we do not cover in this monograph can be
found in this area, see for example, Speed and Kiiveri (1986) and the
books by Whittaker (1990) and Lauritzen (1996). Exact propagation
algorithms for graphical models also include algorithms for GMRFs,
see Lauritzen and Jensen (2001). Wilkinson and Yeung (2002, 2004)
discuss propagation algorithms and the connection to the sparse
matrix approach taken in this monograph.
Semiparametric regression and splines A similar task appearing
in both semiparametric statistics and spline models is to describe a
smooth curve in time or a surface in space, see for example, Fahrmeir
and Lang (2001a,c), Heikkinen and Arjas (1998). A semiparametric
approach is often based on intrinsic GMRF models using either a
first- or second-order random walk model in time or on the line.
Spline models are formulated differently, but Wahba (1978) derived
the connection between the posterior expectation of a diffuse inte-
grated Wiener process and polynomial splines. Second-order random
walk models, as they are commonly defined, can be seen as an
approximation to a discretely observed integrated Wiener process.
However, this connection can be made rigorous as we will discuss
later using results of Wecker and Ansley (1983) and Jones (1981). A
more recent approach taken by Lang and Brezger (2004) is to use
IGMRF models for the coefficients of B-splines. The presentation of
statistical modeling approaches using generalized linear models by
Fahrmeir and Tutz (2001) illustrates the use of GMRFs and splines
for semi-parametric regression in various settings.
Image analysis Image analysis is perhaps the first main area of
application of spatial GMRFs, see for example, techniques for image
restoration using the Wiener filter (Hunt, 1973), texture modeling,
and texture discrimination (Chellappa and Chatterjee, 1985, Chel-
lappa et al., 1985, Cross and Jain, 1983, Descombes et al., 1999, Rellier
et al., 2002). Further applications of GMRFs in image analysis include
modeling stationary fields (Chellappa and Jain, 1993, Chellappa
and Kashyap, 1982, Dubes and Jain, 1989, Kashyap and Chellappa,

©฀2005฀by฀Taylor & Francis Group, LLC


APPLICATIONS OF GMRFs 13
1983), modeling inhomogeneous fields (Aykroyd, 1998, Dreesman
and Tutz, 2001), segmentation (Dryden et al., 2003, Manjunath
and Chellappa, 1991), low-level vision (Marroquin et al., 2001),
blind restoration (Jeffs et al., 1998), deformable templates (Amit
et al., 1991, Grenander, 1993, Grenander and Miller, 1994, Hobolth
et al., 2002, Hurn et al., 2001, Kent et al., 2000, 1996, Ripley and
Sutherland, 1990, Rue and Husby, 1998), object identification (Rue
and Hurn, 1999), 3D reconstruction (Lindgren, 1997, Lindgren et al.,
1997), restoring ultrasound images (Husby et al., 2001, Husby and
Rue, 2004) and adjusted maximum likelihood and pseudolikelihood
estimation (Besag, 1975, 1977a,b, Dryden et al., 2002). GMRFs are
also used in edge-preserving restoration using auxiliary variables, see
Geman and Yang (1995). This field is simply to large to be treated
fairly, see also Hurn et al. (2003) for a statistically oriented (but still
incomplete) overview.
Spatial statistics The use of GMRF in this field is large, see for
example, Banerjee et al. (2004), Cressie (1993), and the references
therein. Some more recent applications include the analysis of spatial
binary data (Pettitt et al., 2002, Weir and Pettitt, 1999, 2000),
non-stationary models (Dreesman and Tutz, 2001), geostatistical
applications using GMRF approximations for Gaussian fields (Allcroft
and Glasbey, 2003, Follestad and Rue, 2003, Hrafnkelsson and Cressie,
2003, Husby and Rue, 2004, Rue and Follestad, 2003, Rue et al., 2004,
Rue and Tjelmeland, 2002, Steinsland and Rue, 2003, Werner, 2004),
analysis of data in social science, see Fotheringham et al. (2002), Hain-
ing (1990) and the references therein, spatial econometrics, see Anselin
and Florax (1995) and the references therein, multivariate GMRFs
(Gamerman et al., 2003, Gelfand and Vounatsou, 2003, Mardia, 1988),
space-varying regression models (Assunção et al., 1998, Gamerman
et al., 2003), analysis of agricultural field experiments (Bartlett, 1978,
Besag et al., 1995, Besag and Higdon, 1999), applications in spatial
and space-time epidemiology (Besag et al., 1991, Cressie and Chan,
1989, Knorr-Held, 2000a, Knorr-Held and Besag, 1998, Knorr-Held
et al., 2002, Knorr-Held and Rue, 2002, Mollié, 1996, Natario and
Knorr-Held, 2003, Schmid and Held, 2004), in environmental statistics
(Huerta et al., 2004, Lindgren and Rue, 2004, Wikle et al., 1998),
to inverse problems (Higdon et al., 2003) and so on. The list seems
endless.

©฀2005฀by฀Taylor & Francis Group, LLC


CHAPTER 2

Theory of Gaussian Markov


random fields

In this chapter, we will present the basic properties of a GMRF. As a


GMRF is normal, all results valid for a normal distribution are also valid
for a GMRF. However, in order to apply GMRFs in Bayesian hierarchical
models, we need to sample from GMRFs and to compute certain
properties of GMRFs under various conditions. What makes GMRFs
extremely useful in practice, is that the things we often need to compute
are particularly fast to compute for a GMRF. The key is naturally the
sparseness of the precision matrix and the structure of its nonzero terms.
It will be useful to represent GMRFs on a graph representing the nonzero
pattern of the precision matrix. This representation serves two purposes.
First, it will provide a unified way of interpreting and understanding a
GMRF through conditional independence, either for a GMRF in time, on
a lattice, or on some more general structure. Secondly, this representation
will also provide a unified way to actually compute various properties for
a GMRF and to generate samples from it, by using numerical methods
for sparse matrices.

2.1 Preliminaries
2.1.1 Matrices and vectors
Vectors and matrices are typeset in bold, like x and A. The transpose
of A is denoted by AT . The notation A = (Aij ) means that the element
in the ith row and jth column of A is Aij . For a vector we use the same
notation, x = (xi ). We denote by xi:j the vector (xi , xi+1 , . . . , xj )T . For
an n × m matrix A with columns A1 , A2 , . . . , Am , vec(A) denotes the
vector obtained by stacking the columns one above the other, vec(A) =
(AT1 , AT2 , . . . , ATm )T . A submatrix of A is obtained by deleting some
rows and/or columns of A. A submatrix of an n × n matrix A is called a
principal submatrix, if it can be obtained by deleting
 11  rows and columns of
the same index, so for example, B = A A13
A31 A33 is a principal submatrix
of A. An r × r submatrix is called a leading principal submatrix of A, if
it can be obtained by deleting the last n − r rows and columns.
We use the notation diag(A) and diag(a), where A is an n × n matrix

©฀2005฀by฀Taylor & Francis Group, LLC


16 THEORY OF GMRFs
and a a vector of length n, for the n × n diagonal matrices
⎛ ⎞ ⎛ ⎞
A11 a1
⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠ and ⎝ . ⎠,
Ann an

respectively. We denote by I the identity matrix.


The matrix A is called upper triangular if Aij = 0 whenever i > j and
lower triangular if Aij = 0 whenever i < j. The bandwidth of a matrix
A is max{|i − j| : Aij = 0}. The lower bandwidth is max{|i − j| : Aij =
0 and i > j}. The determinant of an n × n matrix A is denoted by |A|
and equals the product of the eigenvalues of A. The rank of A, denoted
by rank(A), is the number of linearly independent rows or columns of the
matrix.
 The trace of A is the sum of the diagonal elements, trace(A) =
i Aii . For elementwise multiplication of two matrices of size n × m, we
use the symbol ‘⊙’, i.e.,
⎛ ⎞
A11 B11 . . . A1m B1m
⎜ .. .. .. ⎟
A⊙B =⎝ . . . ⎠.
An1 Bn1 ... Anm Bnm

Similarly, ‘⊘’ denotes elementwise division. We will use the symbol ‘’
for raising each element of a matrix A to a scalar power a, i.e., element
ij of A  a is Aaij .

2.1.2 Lattice and torus

We denote by In a (regular) lattice (or grid) of size n = (n1 , n2 ) (for a


two-dimensional lattice). The location of pixel or site ij is denoted by
(i, j). Let x take values on In and denote by xij the value of x at site
ij, for i = 1, . . . , n1 and j = 1, . . . , n2 . We add where needed a ‘,’ in the
indices, like x11,1 , to avoid confusion. On an infinite lattice I∞ the sites
ij are numbered as i = 0, ±1, ±2, . . . , and j = 0, ±1, ±2, . . ..
A torus is a lattice with cyclic (or toroidal) boundary conditions and
denoted by Tn . By notational convenience, the dimension is n = (n1 , n2 )
(for a two-dimensional torus) and all indices are modulus n and run
from 0 to n1 − 1 and n2 − 1, respectively. If a GMRF x is defined on
Tn the toroidal boundary conditions imply that x−2,n2 equals xn1 −2,0
as −2 mod n1 equals n1 − 2 and n2 mod n2 equals 0. Figure 2.1 (a)
illustrates the form of a torus.
With an irregular lattice, a slightly imprecise term, we mean a spatial
configuration of regions i = 1, . . . , n, where (most often) the regions
share common borders. A typical examples is displayed in Figure 2.1(b)

©฀2005฀by฀Taylor & Francis Group, LLC


PRELIMINARIES 17

(a) (b)

Figure 2.1 (a) Illustration of a torus obtained on a two-dimensional lattice with


cyclic boundary conditions, (b) the states of the United States is an illustration
of an irregular lattice.

showing (most of) the states of the USA. Each state represents a region
and they share common borders.

2.1.3 General notation and abbreviations


For C ∈ I = {1, . . . , n}, define xC = {xi : i ∈ C}. With −C we denote
the set I − C, so that x−C = {xi : i ∈ −C}. For two sets A and B,
then A \ B = {i : i ∈ A and i ∈ B}.
We will make no notational difference between a random variable and
a specific realization of a random variable. The notation π(·) is a generic
notation for the density of its arguments, like π(x) for the density of x
and π(xA |x−A ) for the conditional density or xA , given a realization of
x−A . By ‘∼’ we mean ‘distributed as’, so if x ∼ L then x is distributed
according to the law L. We denote generically the expected value by
E(·), the variance by Var(·), the covariance by Cov(·), the precision by
Prec(·) = Cov(·)−1 , and the correlation by Corr(·, ·).
We use the shortcut iff for ‘if and only if’, wrt for ‘with respect to’,
and lhs (rhs) for the left- or right-hand side of an equation.
One flop is defined as one floating-point operation. For example,
evaluating x + a*b requires two flops: one multiplication and one
addition.

2.1.4 Conditional independence


To compute the conditional density of xA , given x−A , we will repeatedly
use that
π(xA , x−A )
π(xA | x−A ) = ∝ π(x). (2.1)
π(x−A )
This is true since the denominator does not depend on xA .
A key concept for understanding GMRFs is conditional independence.
Clearly, two random variables x and y are independent iff π(x, y) =

©฀2005฀by฀Taylor & Francis Group, LLC


18 THEORY OF GMRFs
π(x)π(y). We write this as x ⊥ y. Two variables x and y are called
conditionally independent given z, iff π(x, y|z) = π(x|z)π(y|z). We write
this as
x ⊥ y | z.
Note that x and y might be (marginally) dependent, although they
are conditionally independent given z. Using the following factorization
criterion for conditional independence, it is easy to verify conditional
independence.
Theorem 2.1
x⊥y|z ⇐⇒ π(x, y, z) = f (x, z)g(y, z) (2.2)
for some functions f and g, and for all z with π(z) > 0.
Example 2.1 For π(x, y, z) ∝ exp(x+xz+yz), on some bounded region,
we see that x ⊥ y|z. However, this is not the case for π(x, y, z) ∝
exp(xyz).
The concept of conditional independence easily extends to the multi-
variate case, where x and y are called conditionally independent given
z, iff π(x, y|z) = π(x|z)π(y|z), which we write as x ⊥ y|z. The
factorization theorem still holds in this case.

2.1.5 Undirected graphs


We will use undirected graphs for representing the conditional indepen-
dence structure in a GMRF. An undirected graph G is a tuple G = (V, E),
where V is the set of nodes in the graph, and E is the set of edges {i, j},
where i, j ∈ V and i = j. If {i, j} ∈ E, there is an undirected edge from
node i to node j, otherwise, there is no edge between node i to node j.
A graph is fully connected if {i, j} ∈ E for all i, j ∈ V with i = j. In most
cases we will assume that V = {1, 2, . . . , n}, in which case the graph
is called labelled. A simple example of an undirected graph is shown in
Figure 2.2.

1 3

Figure 2.2 An example of an undirected labelled graph with n = 3 nodes, here


V = {1, 2, 3} and E = {{1, 2}, {2, 3}}. We also see that ne(1) = 2, ne(2) =
{1, 3}, ne({1, 2}) = 3, and 2 separates 1 and 3.

©฀2005฀by฀Taylor & Francis Group, LLC


PRELIMINARIES 19
The neighbors of node i are all nodes in G having an edge to node i,
ne(i) = {j ∈ V : {i, j} ∈ E} .
We can extend this definition to a set A ⊂ V, where we define the
neighbors of A as

ne(A) = ne(i) \ A.
i∈A
The neighbors of A are all nodes not in A, but adjacent to a node in A.
Figure 2.2 illustrates this definition.
A path from i1 to im is a sequence of distinct nodes in V, i1 , i2 , . . . , im ,
for which (ij , ij+1 ) ∈ E for j = 1, . . . , m − 1. A subset C ⊂ V separates
two nodes i ∈ C and j ∈ C, if every path from i to j contains at least one
node from C. Two disjoint sets A ⊂ V \ C and B ⊂ V \ C are separated
by C, if all i ∈ A and j ∈ B are separated by C, i.e., we cannot walk
on the graph starting somewhere in A ending somewhere in B without
passing through C.
G
We write i ∼ j if node i and j are neighbors in graph G, or just i ∼ j
where the graph is implicit. A direct consequence of the definition is that
i ∼ j ⇔ j ∼ i.
We need the notion of a subgraph. Let A be a subset of V. Then
G A denotes the graph restricted to A, i.e., the graph we obtain after
removing all nodes not belonging to A and all edges where at least one
node does not belong to A. Precisely, G A = {V A , E A }, where V A = A
and
E A = {{i, j} ∈ E and {i, j} ∈ A × A} .
For example, if we let G be the graph in Figure 2.2 and A = {1, 2}, then
V A = {1, 2} and E A = {{1, 2}}.

2.1.6 Symmetric positive-definite matrices


An n × n matrix A is positive definite iff
xT Ax > 0, ∀x = 0.
If A is also symmetric, then it is called a symmetric positive-definite
(SPD) matrix. We only consider SPD matrices and sometimes use the
notation ‘A > 0’ for an SPD matrix A.
Some of the properties of a SPD matrix A are the following.
1. rank(A) = n.
2. |A| > 0.
3. Aii > 0.
4. Aii Ajj − A2ij > 0, for i = j.

©฀2005฀by฀Taylor & Francis Group, LLC


20 THEORY OF GMRFs
5. Aii + Ajj − 2|Aij | > 0, for i = j.
6. max Aii > maxi=j |Aij |.
7. A−1 is SPD.
8. All principal submatrices of A are SPD.
If A and B are SPD, then so is A + B, but the converse is not true in
general. If A and B are SPD and AB = BA, then AB is SPD.
The following conditions are all sufficient and necessary for a symmet-
ric matrix A to be SPD:
1. All the eigenvalues λ1 , . . . , λn of A are strictly positive.
2. There exists a matrix C such that A = CC T . If C is lower triangular
it is called the Cholesky triangle of A.
3. All leading principal submatrices have strictly positive determinants.
A sufficient but not necessary condition for a (symmetric) matrix to be
SPD is the diagonal dominance criterion:

Aii − |Aij | > 0, ∀i.
j:j =i

An n × n matrix A is called positive semidefinite iff


xT Ax ≥ 0, ∀x = 0.
If A is also symmetric, then it is called a symmetric positive semidefinite
(SPSD) matrix. A SPSD matrix A is sometimes denoted by ‘A ≥ 0’.

2.1.7 The normal distribution


We now recall the multivariate normal distribution and give some of its
basic properties. This makes the difference to a GMRF more clear. Other
distributions are defined in Appendix A.
The density of a normal random variable x = (x1 , . . . , xn )T , n < ∞,
with mean µ (n×1 vector) and SPD covariance matrix Σ (n×n matrix),
is
 
1
π(x) = (2π)−n/2 |Σ|−1/2 exp − (x − µ)T Σ−1 (x − µ) , x ∈ Rn .
2
(2.3)
Here, µi = E(xi ), Σij = Cov(xi , xj ), Σii = Var(xi ) > 0 and
Corr(xi , xj ) = Σij /(Σii Σjj )1/2 . We write this as x ∼ N (µ, Σ). A
standard normal distribution is obtained if n = 1, µ = 0 and Σ11 = 1.
We now divide x into two parts, x = (xTA , xTB )T , and split µ and Σ
accordingly:
   
µA ΣAA ΣAB
µ= and Σ = .
µB ΣBA ΣBB

©฀2005฀by฀Taylor & Francis Group, LLC


DEFINITION AND BASIC PROPERTIES OF GMRFs 21
Here are some basic properties of the normal distribution.
1. xA ∼ N (µA , ΣAA )
2. ΣAB = 0 iff xA and xB are independent.
3. The conditional distribution π(xA |xB ) is N (µA|B , ΣA|B ), where

µA|B = µA + ΣAB Σ−1


BB (xB − µB ) and

ΣA|B = ΣAA − ΣAB Σ−1


BB ΣBA .

4. If x ∼ N (µ, Σ) and x′ ∼ N (µ′ , Σ′ ) are independent, then x + x′ ∼


N (µ + µ′ , Σ + Σ′ ).

2.2 Definition and basic properties of GMRFs


2.2.1 Definition
Let x = (x1 , . . . , xn )T have a normal distribution with mean µ and
covariance matrix Σ. Define the labelled graph G = (V, E), where V =
{1, . . . , n} and E be such that there is no edge between node i and j iff
xi ⊥ xj |x−ij , where x−ij is short for x−{i,j} . Then we say that x is a
GMRF wrt G.
Before we define a GMRF formally, let us investigate the connection
between the graph G and the parameters of the normal distribution.
Since the mean µ does not have any influence on the pairwise conditional
independence properties of x, we can deduce that this information must
be ‘hidden’ solely in the covariance matrix Σ. It turns out that the
inverse covariance matrix, the precision matrix Q = Σ−1 plays the key
role.
Theorem 2.2 Let x be normal distributed with mean µ and precision
matrix Q > 0. Then for i = j,
xi ⊥ xj | x−ij ⇐⇒ Qij = 0.
This is a nice and useful result. It simply says that the nonzero pattern
of Q determines G, so we can read off from Q whether xi and xj are
conditionally independent. We will return to this in a moment. On the
other hand, for a given graph G, we know the nonzero terms in Q. This
can be used to provide a parameterization of Q, being aware that we
also require Q > 0.
Before providing the proof of Theorem 2.2, we state the formal
definition of a GMRF.
Definition 2.1 (GMRF) A random vector x = (x1 , . . . , xn )T ∈ Rn
is called a GMRF wrt a labelled graph G = (V, E) with mean µ and

©฀2005฀by฀Taylor & Francis Group, LLC


22 THEORY OF GMRFs
precision matrix Q > 0, iff its density has the form
 
1
π(x) = (2π)−n/2 |Q|1/2 exp − (x − µ)T Q(x − µ) (2.4)
2
and
Qij = 0 ⇐⇒ {i, j} ∈ E for all i = j.
If Q is a completely dense matrix then G is fully connected. This
implies that any normal distribution with SPD covariance matrix is also
a GMRF and vice versa. We will focus on the case when Q is sparse, as
it is here that the nice properties of GMRFs are really useful.
Proof. [Theorem 2.2] We partition x as (xi , xj , x−ij ) and then use
the multivariate version of the factorization criterion (Theorem 2.1) on
π(xi , xj , x−ij ). Fix i = j and assume µ = 0 without loss of generality.
From (2.4) we get
 1 
π(xi , xj , x−ij ) ∝ exp − xk Qkl xl
2
k,l
1 1  
∝ exp − xi xj (Qij + Qji ) − xk Qkl xl .
2   2
{k,l}={i,j}
term 1   
term 2

Term 2 does not involve xi xj while term 1 involves xi xj iff Qij = 0.


Comparing with (2.2) in Theorem 2.1, we see that
π(xi , xj , x−ij ) = f (xi , x−ij )g(xj , x−ij )
for some functions f and g, iff Qij = 0. The claim then follows.
We have argued that the natural way to describe a GMRF is
by its precision matrix Q. The elements of Q have nice conditional
interpretations.
Theorem 2.3 Let x be a GMRF wrt G = (V, E) with mean µ and
precision matrix Q > 0, then
1 
E(xi | x−i ) = µi − Qij (xj − µj ), (2.5)
Qii j:j∼i
Prec(xi | x−i ) = Qii
and (2.6)
Qij
Corr(xi , xj | x−ij ) = −  , i = j. (2.7)
Qii Qjj
The diagonal elements of Q are the conditional precisions of xi given
x−i , while the off-diagonal elements, with a proper scaling, provide
information about the conditional correlation between xi and xj , given
x−ij . These results should be compared to the interpretation of the

©฀2005฀by฀Taylor & Francis Group, LLC


DEFINITION AND BASIC PROPERTIES OF GMRFs 23
elements of the covariance
 matrix Σ = (Σij ); As Var(xi ) = Σii and
Corr(xi , xj ) = Σij / Σii Σjj , the covariance matrix gives information
about the marginal variance of xi and the marginal correlation between
xi and xj . The marginal interpretation given by Σ is intuitive and di-
rectly informative, as it reduces the interpretation from an n-dimensional
distribution to a one- or two-dimensional distribution. The interpretation
provided by Q is hard (or nearly impossible) to interpret marginally,
as we have to integrate out x−i or x−ij from the joint distribution
parameterized in terms of Q. In matrix terms this is immediate; by
definition Q−1 = Σ, and Σii depends generally on all elements in Q,
and visa versa.

Proof. [Theorem 2.3] First recall that a univariate normal random


variable xi with mean γ and precision κ has density proportional to
 1 
exp − κx2i + κxi γ . (2.8)
2
Assume for the moment that µ = 0 and apply (2.1) to (2.4):
 1 
π(xi | x−i ) ∝exp − xT Qx
2
 1 2  
∝ exp − xi Qii − xi Qij xj . (2.9)
2 j:j∼i

Comparing (2.8) and (2.9) we see that π(xi |x−i ) is normal. Comparing
the coefficients for the quadratic term, we obtain (2.6). Comparing the
coefficients for the linear term, we obtain
1 
E(xi | x−i ) = − Qij xj .
Qii j:j∼i

If x has mean µ, then x−µ has mean zero, hence replacing xi and xj by
xi − µi and xj − µj , respectively, gives (2.5). To show (2.7), we proceed
similarly and consider
  
 1 Qii Qij xi 
π(xi , xj | x−ij ) ∝ exp − (xi , xj ) + linear terms .
2 Qji Q jj xj
(2.10)
We compare this density with the density of the bivariate normal random
variable (xi , xj )T with covariance matrix Σ = (Σij ), which has density
proportional to
 −1  
 1 Σii Σij xi 
exp − (xi , xj ) + linear terms . (2.11)
2 Σji Σjj xj

©฀2005฀by฀Taylor & Francis Group, LLC


24 THEORY OF GMRFs
Comparing (2.10) with (2.11), we obtain
 −1  
Qii Qij Σii Σij
= ,
Qji Qjj Σji Σjj
which implies that Σii = Qjj /∆, Σjj = Qii /∆, and Σij = −Qij /∆
where ∆ = Qii Qjj − Q2ij . Using these expressions and the definition of
conditional correlation we obtain
Qij /∆
Corr(xi , xj | x−ij ) = − 
(Qjj /∆)(Qii /∆)
Qij
= − .
Qii Qjj

2.2.2 Markov properties of GMRFs


We have defined the graph G from checking if xi ⊥ xj |x−ij or not.
Theorem 2.2 says this is the same as checking if the corresponding off-
diagonal entry of the precision matrix, Qij , is zero or not. Hence G is
constructed from the nonzero pattern of Q. An interesting and useful
property of a GMRF is that more information regarding conditional
independence can be extracted from G. We consider now the local Markov
property and the global Markov property, additional to the pairwise
Markov property used to define G. It turns out that all these properties
are equivalent for a GMRF.
Theorem 2.4 Let x be a GMRF wrt G = (V, E). Then the following
are equivalent.
The pairwise Markov property:
xi ⊥ xj | x−ij if {i, j} ∈
 E and i = j.
The local Markov property:
xi ⊥ x−{i,ne(i)} | xne(i) for every i ∈ V.
The global Markov property:
x A ⊥ x B | xC (2.12)
for all disjoint sets A, B and C where C separates A and B, and A and
B are non-empty.
Figure 2.3 illustrates Theorem 2.4. The proof is a consequence of
a more general result, stating the equivalence of the various Markov
properties under some conditions satisfied for GMRFs. A simpler proof
can be constructed in the Gaussian case (Speed and Kiiveri, 1986), but
we omit it here.

©฀2005฀by฀Taylor & Francis Group, LLC


DEFINITION AND BASIC PROPERTIES OF GMRFs 25
The global Markov property immediately implies the local and
pairwise Markov property, but the converse is a bit surprising. Note
that the union of A, B, and C does not need to be V, so properties of
the marginal distribution can also be derived from G.
If C in (2.12) is empty, then xA and xB are independent.

(a) Pairwise

(b) Local
1111
0000
0000
1111
0000
1111
0000
1111
111
000
000
111 111
000
000
111
000
111 000
111
111
000
000
111
000
111
000
111
(c) Global

Figure 2.3 Illustration of the various Markov properties. (a) The pairwise
Markov property; the two black nodes are conditionally independent given
the gray nodes. (b) The local Markov property; the black and white nodes
are conditionally independent given the gray nodes. (c) The global Markov
property; the black and striped nodes are conditionally independent given the
gray nodes.

©฀2005฀by฀Taylor & Francis Group, LLC


26 THEORY OF GMRFs
2.2.3 Conditional properties of GMRFs
We will now discuss an important result of GMRFs; the conditional
distribution for a subset xA of x given the rest x−A . In this context
the canonical parameterization will be useful, a parameterization that is
easily updated under successive conditioning. Although all computations
can be expressed with matrices, we will also consider a more graph-
oriented view in Appendix B, which allows for efficient computation of
the conditional densities.

Conditional distribution
We split the indices into the nonempty sets A and denote by B the set
−A, so that  
xA
x= . (2.13)
xB
Partition the mean and precision accordingly,
   
µA QAA QAB
µ= , and Q = . (2.14)
µB QBA QBB
Our next result, is a generalization of Theorem 2.3.
Theorem 2.5 Let x be a GMRF wrt G = (V, E) with mean µ and
precision matrix Q > 0. Let A ⊂ V and B = V \ A where A, B = ∅. The
conditional distribution of xA |xB is then a GMRF wrt the subgraph G A
with mean µA|B and precision matrix QA|B > 0, where

µA|B = µA − Q−1
AA QAB (xB − µB ) (2.15)
and
QA|B = QAA .
This is a powerful result for two reasons. First, we have explicit knowl-
edge of QA|B through the principal matrix QAA , so no computation
is needed to obtain the conditional precision matrix. Constructing
the subgraph G A does not change the structure; it just removes all
nodes not in A and the corresponding edges. This is important for the
computational issues that will be discussed in Section 2.3. Secondly, since
Qij is zero unless j ∈ ne(i), the conditional mean only depends on values
of µ and Q in A ∪ ne(A). This is a great advantage if A is a small subset
of V and in striking contrast to the corresponding general result for the
normal distribution, see Section 2.1.7.
Example 2.2 To illustrate Theorem 2.5, we compute the mean and
precision of xi given x−i , which are found using A = {i} as (2.5)
and (2.6). This result is frequently used for single-site Gibbs sampling in
GMRF models, to which we return in Section 4.1.

©฀2005฀by฀Taylor & Francis Group, LLC


DEFINITION AND BASIC PROPERTIES OF GMRFs 27
Proof. [Theorem 2.5] The proof is similar to Theorem 2.3, but uses
matrices. Assume µ = 0 and compute the conditional density,
  
1 QAA QAB xA
π(xA | xB ) ∝ exp − (xA , xB )
2 QBA QBB xB
1 
∝ exp − xTA QAA xA − (QAB xB )T xA .
2
Comparing this with the density of a normal with precision P and
mean γ,
1 
π(z) ∝ exp − z T P z + (P γ)T z ,
2
we see that QAA is the conditional precision matrix and the conditional
mean is given by the solution of
QAA µA|B = −QAB xB .
Note that QAA > 0 since Q > 0. If x has mean µ then x − µ has mean
zero, hence (2.15) follows. The subgraph G A follows from the nonzero
elements of QAA .
To compute the conditional mean µA|B , we need to solve the linear
system
QAA (µA|B − µA ) = −QAB (xB − µB )
but not necessarily invert QAA . We postpone the discussion of this
numerical issue until Section 2.3.

The canonical parameterization


The canonical parameterization for a GMRF will be useful for successive
conditioning.
Definition 2.2 (Canonical parameterization) A GMRF x wrt G
with canonical parameters b and Q > 0 has density
1 
π(x) ∝ exp − xT Qx + bT x ,
2
i.e., the precision matrix is Q and the mean is µ = Q−1 b. We write the
canonical parameterization as
x ∼ NC (b, Q).
The relation to the normal distribution, is that N (µ, Q−1 ) = NC (Qµ, Q).
Partition the indices into two nonempty sets A and B, and partition
x, b and Q accordingly as in (2.13) and (2.14). Two lemmas follow easily.
Lemma 2.1 Let x ∼ NC (b, Q), then
xA | xB ∼ NC (bA − QAB xB , QAA ).

©฀2005฀by฀Taylor & Francis Group, LLC


28 THEORY OF GMRFs
Lemma 2.2 Let x ∼ NC (b, Q) and y|x ∼ N (x, P −1 ), then
x | y ∼ NC (b + P y, Q + P ). (2.16)
These results are useful for computing conditional densities with several
sources of conditioning, for example, conditioning on observed data
and a subset of variables. We can successively update the canonical
parameterization, without explicitly computing the mean, until we
actually need it. Computing the mean requires the solution of Qµ = b,
but only matrix-vector products are required to update the canonical
parameterization.

2.2.4 Specification through full conditionals


An alternative to specifying a GMRF by its mean and precision matrix,
is to specify it implicitly through the full conditionals {π(xi |x−i )}. This
approach was pioneered by Besag (1974, 1975) and the models are also
known by the name conditional autoregressions, abbreviated as CAR
models. We will now discuss this possibility and the specific conditions
we must impose on the full conditionals to correspond to a valid GMRF.
Suppose we specify the full conditionals as normals with

E(xi | x−i ) = µi − βij (xj − µj ) and (2.17)
j:j∼i
Prec(xi | x−i ) = κi > 0 (2.18)
for i = 1, . . . , n, for some {βij , i = j}, and vectors µ and κ. Clearly, ∼ is
defined implicitly by the nonzero terms of {βij }. These full conditionals
must be consistent so that there exists a joint density π(x) that will
give rise to these full conditional distributions. Since ∼ is symmetric,
this immediate gives the requirement that if βij = 0 then βji = 0.
Comparing term by term with (2.5) and (2.6), we see that if we choose
the entries of the precision matrix Q as
Qii = κi , and Qij = κi βij
and also require that Q is symmetric, i.e.,
κi βij = κj βji ,
then we have a candidate for a joint density giving the specified full
conditionals provided Q > 0. The next result says that this candidate is
unique.
Theorem 2.6 Given the n normal full conditionals with conditional
mean and precision as in (2.17) and (2.18), then x is a GMRF wrt a
labelled graph G = (V, E) with mean µ and precision matrix Q = (Qij ),

©฀2005฀by฀Taylor & Francis Group, LLC


DEFINITION AND BASIC PROPERTIES OF GMRFs 29
where 
κi βij i = j
Qij =
κi i=j
provided κi βij = κj βji , i = j, and Q > 0.
To prove this result we need Brook’s lemma.
Lemma 2.3 (Brook’s lemma) Let π(x) be the density for x ∈ Rn
and define Ω = {x ∈ Rn : π(x) > 0}. Let x, x′ ∈ Ω, then
n

π(x) π(xi |x1 , . . . , xi−1 , x′i+1 , . . . , x′n )
= (2.19)
π(x′ ) i=1
π(x′i |x1 , . . . , xi−1 , x′i+1 , . . . , x′n )
n
π(xi |x′1 , . . . , x′i−1 , xi+1 , . . . , xn )
= . (2.20)
i=1
π(x′i |x′1 , . . . , x′i−1 , xi+1 , . . . , xn )
If we fix x′ then (2.19) (and (2.20)) represents π(x), up to a constant
of proportionality, using the set of full conditionals {π(xi |x−i )}. The
constant of proportionality is found using that π(x) integrates to unity.
Proof. [Brook’s lemma] Start with the identity
π(xn |x1 , . . . , xn−1 ) π(x1 , . . . , xn−1 ) π(x1 , . . . , xn−1 , xn )
=
π(x′n |x1 , . . . , xn−1 ) π(x1 , . . . , xn−1 ) π(x1 , . . . , xn−1 , x′n )
from which it follows that
π(xn |x1 , . . . , xn−1 )
π(x1 , . . . , xn ) = π(x1 , . . . , xn−1 , x′n ).
π(x′n |x1 , . . . , xn−1 )
Express the last term on the rhs similarly to obtain
π(xn |x1 , . . . , xn−1 )
π(x1 , . . . , xn ) =
π(x′n |x1 , . . . , xn−1 )
π(xn−1 |x1 , . . . , xn−2 , x′n )
×
π(x′n−1 |x1 , . . . , xn−2 , x′n )
× π(x1 , . . . , xn−2 , x′n−1 , x′n ).
By repeating this process (2.19) follows. The alternative (2.20) is proved
similarly starting with
π(x1 |x2 , . . . , xn )
π(x1 , . . . , xn ) = π(x′1 , x2 , . . . , xn )
π(x′1 |x2 , . . . , xn )
and proceeding forward.
Proof. [Theorem 2.6] Assume µ = 0 and fix x′ = 0. Then (2.19)
simplifies to
n n i−1
π(x) 1 
log =− κi x2i − κi βij xi xj . (2.21)
π(0) 2 i=1 i=2 j=1

©฀2005฀by฀Taylor & Francis Group, LLC


30 THEORY OF GMRFs
Using (2.20) we obtain
n n−1 n
π(x) 1  
log =− κi x2i − κi βij xi xj . (2.22)
π(0) 2 i=1 i=1 j=i+1

Since (2.21) and (2.22) must be identical it follows that κi βij = κj βji
for i = j. The density of x can then be expressed as
n
1 1
log π(x) = const − κi x2i − κi βij xi xj ;
2 i=1 2
i=j

hence x is zero mean multivariate normal provided Q > 0. The precision


matrix has elements Qij = κi βij for i = j and Qii = κi .
In matrix terms (defining βii = 0), the precision matrix is
  
Q = diag(κ) I + βij ;
  
hence Q > 0 ⇐⇒ I + βij > 0.

2.2.5 Multivariate GMRFs⋆


A multivariate GMRF (MGMRF) is a multivariate extension of a GMRF
that has been shown to be useful in applications. To motivate its
construction, let x be a GMRF wrt to G. The Markov property implies
that
π(xi | x−i ) = π(xi | {xj : j ∼ i}).
We associate xi as the value related to node i. The nodes have often a
physical interpretation like a pixel in a lattice or an administrative region
of a country, and this may also be used to define the neighbors to node i.
For an illustration, see Figure 2.1(b). The extension is now to associate
a vector with dimension p, xi , with each of the n nodes, leading to a
GMRF of size np. We denote such an MGMRF by x = (xT1 , . . . , xTn )T .
The Markov property in terms of the nodes is then preserved, meaning
that
π(xi | x−i ) = π(xi | {xj : j ∼ i}).
where ∼ is wrt the same graph G. Let µ = (µT1 , . . . , µTn )T be the mean
of x where E(xi ) = µi , and Q = (Q  ) its precision matrix. Note that
ij
each element Q  is a p × p matrix.
ij
It follows directly from Theorem 2.2, that
xi ⊥ xj | x−ij ⇐⇒  ij = 0.
Q
The definition of an MGMRF with dimension p is an extension of the
definition of a GMRF (Definition 2.1).

©฀2005฀by฀Taylor & Francis Group, LLC


SIMULATION FROM A GMRF 31
Definition 2.3 (MGMRFp ) A random vector x = (xT1 , . . . , xTn )T
where dim(xi ) = p, is called a MGMRFp wrt G = (V = {1, . . . , n}, E)
with mean µ and precision matrix Q > 0, iff its density has the form
 
1  1/2 exp − 1 (x − µ)T Q(x
 − µ)
π(x) = ( )np/2 |Q|
2π 2
⎛ ⎞
1 np/2  1/2 1 
=( ) |Q| exp ⎝−  ij (xj − µj )⎠
(xi − µi )T Q
2π 2 ij

and
Q ij = 0 ⇐⇒ {i, j} ∈ E for all i = j.
An MGMRFp is also a GMRF with dimension np with identical mean
vector and precision matrix. All results valid for a GMRF are then
also valid for an MGMRFp , with obvious changes as the graph for an
MGMRFp is of size n and defined wrt {xi }, while for a GMRF it is of
size np and defined wrt {xi }.
Interpretation of Q  ij can be derived from the full conditional
 ii and Q
π(xi |x−i ). The extensions of (2.5) and (2.6) are

E(xi | x−i ) = µi − Q  −1 Q ij (xj − µj )
ii
j:j∼i

Prec(xi | x−i ) =  .
Qii

In some applications, the full conditionals



E(xi | x−i ) = µi − β ij (xj − µj )
j:j∼i
Prec(xi | x−i ) = κi > 0,
are used to define the MGMRFp , for some p × p-matrices {β ij , i = j},
{κi }, and vectors µi . Again, ∼ is defined implicitly by the nonzero
matrices {β ij }. The requirements for the joint density to exist are similar
 > 0.
to those for p = 1 (see Theorem 2.6): κi β ij = β Tji κj for i = j and Q
The p × p-elements of Q  are

 κi β ij i = j
Qij = ;
κi i=j
  
hence Q > 0 ⇐⇒ I + β ij > 0.

2.3 Simulation from a GMRF


This chapter will be more computationally oriented, presenting algo-
rithms for

©฀2005฀by฀Taylor & Francis Group, LLC


32 THEORY OF GMRFs
• Simulation of a GMRF
• Evaluation of the log density
• Calculating conditional densities
• Simulation conditional on a subset of a GMRF, a hard constraint,
or a soft constraint, and the corresponding evaluation of the log
conditional densities.
We will formulate all these tasks as simple matrix operations on the
precision matrix Q, which we know is sparse, hence easier to store
and faster to compute. One example is the Cholesky factorization Q =
LLT , where L is a lower triangular matrix referred to as the Cholesky
triangle. It turns out that L can be sparse as well and thus inherits the
(somewhat modified) nonzero pattern from Q. In general, computing
this factorization requires O(n3 ) flops, while a sparse Q will typically
reduce this to O(n) for temporal, O(n3/2 ) for spatial and O(n2 ) for
spatiotemporal GMRFs. Similarly, solving for example Lx = b, will also
be faster as L is sparse.
We postpone the discussion of numerical methods for sparse matrices
to Section 2.4, and will now discuss how simulation and evaluation of
the log density can be done based on Q.

2.3.1 Some basic numerical linear algebra


We start with some basic facts on numerical linear algebra.
Let A be an n × n SPD matrix, then there exists a unique Cholesky
triangle L such that L is a lower triangular matrix where Lii > 0 ∀i and
A = LLT . Computing L costs n3 /3 flops. This factorization is the basis
for solving systems like Ax = b or AX = B for k right-hand sides,
or equivalently, computing x = A−1 b or X = A−1 B. For example, we
solve Ax = b using Algorithm 2.1. Clearly, x is the solution of Ax = b

Algorithm 2.1 Solving Ax = b where A > 0


1: Compute the Cholesky factorization, A = LLT
2: Solve Lv = b
T
3: Solve L x = v
4: Return x

because x = (L−1 )T v = L−T (L−1 b) = (LLT )−1 b = A−1 b as required.


Step 2 is called forward substitution, as the solution v is computed in
a forward loop (recall that L is lower triangular),
i−1

1
vi = (bi − Lij vj ), i = 1, . . . , n.
Lii j=1

©฀2005฀by฀Taylor & Francis Group, LLC


SIMULATION FROM A GMRF 33
2
The cost is in general n flops. Step 3 is called back substitution, as
the solution x is computed in a backward loop (recall that LT is upper
triangular),
n
1
xi = (vi − Lji xj ), i = n, . . . , 1. (2.23)
Lii j=i+1

If we need to compute A−1 B, where B is a n × k matrix, we do this


by computing the solution X of AX = B for each column of X. More
specifically, we solve AX j = B j , where X j is the jth column of X and
B j is the jth column of B, see Algorithm 2.2.

Algorithm 2.2 Solving AX = B where A > 0


1: Compute the Cholesky factorization, A = LLT
2: for j = 1 to k do
3: Solve Lv = B j
4: Solve LT X j = v
5: end for
6: Return X

Note that choosing k = n and B = I, we obtain X = A−1 . Hence,


solving Ax = b in comparison to computing x = A−1 b, gives a speedup
of 4. There is no need to compute explicitly the inverse A−1 .
If A is a general invertible n × n matrix, but no longer SPD, then
similar algorithms apply with only minor modifications: We compute the
LU decomposition, as A = LU , where L is lower triangular and U is
upper triangular, and replace LT by U in Algorithm 2.1 and Algorithm
2.2.

2.3.2 Unconditional simulation of a GMRF


In this section we discuss simulation from GMFRs for the different
parameterizations.

Sample x ∼ N (µ, Σ)
We start with Algorithm 2.3, the most commonly used algorithm for
sampling from a multivariate normal random variable x ∼ N (µ, Σ).
Then x has the required distribution, as
T
 =L
Cov(x) = Cov(Lz) L
 =Σ (2.24)
and E(x) = µ. To obtain repeated samples, we do step 1 only once.

©฀2005฀by฀Taylor & Francis Group, LLC


34 THEORY OF GMRFs
Algorithm 2.3 Sampling x ∼ N (µ, Σ)
1: L
Compute the Cholesky factorization, Σ = L T
2: Sample z ∼ N (0, I)

3: Compute v = Lz
4: Compute x = µ + v
5: Return x

The log density is computed using (2.3), where


 n
1  ii
log |Σ| = log L
2 i=1

L
because |Σ| = |L  T | = |L||
 L  T | = |L|
 2 . Hence we obtain
n
n  ii − 1 uT u,
log π(x) = − log(2π) − log L (2.25)
2 i=1
2

 = x − µ. If x is sampled using Algorithm


where u is the solution of Lu
2.3 then u = z.
For a GMRF, we assume Q is known and Σ known only implicitly,
hence we aim at deriving an algorithm similar to (2.24) but using a
factorization of the precision matrix Q = LLT . In Section 2.4 we will
discuss how to compute this factorization rapidly taking the sparsity of
Q into account, and discover that the sparsity of Q may also be inherited
by L.

Sample x ∼ N (µ, Q−1 )


To sample x ∼ N (µ, Q−1 ), where Q = LLT , we use the following result:
If z ∼ N (0, I), then the solution of LT x = z has covariance matrix
Cov(x) = Cov(L−T z) = (LLT )−1 = Q−1 .
Hence we obtain Algorithm 2.4. For repeated samples, we do step 1 only

Algorithm 2.4 Sampling x ∼ N (µ, Q−1 )


1: Compute the Cholesky factorization, Q = LLT
2: Sample z ∼ N (0, I)
T
3: Solve L v = z
4: Compute x = µ + v
5: Return x

once. Step 3 solves the linear system LT v = z using back substitution

©฀2005฀by฀Taylor & Francis Group, LLC


SIMULATION FROM A GMRF 35
(2.23), from which we obtain the following result as a by-product, giving
some interpretation to the elements of L.

Theorem 2.7 Let x be a GMRF wrt to the labelled graph G, with mean
µ and precision matrix Q > 0. Let L be the Cholesky triangle of Q.
Then for i ∈ V,
n
1 
E(xi | x(i+1):n ) = µi − Lji (xj − µj ) and
Lii j=i+1
Prec(xi | x(i+1):n ) = L2ii .

Theorem 2.7 provides an alternative representation of a GMRF as a non-


homogeneous autoregressive process defined backward in the indices (or
a virtual time). It will be shown later in Section 5.2 that this is a useful
representation. The following corollary is immediate when we compare
L2ii = Prec(xi | x(i+1):n ) with Qii = Prec(xi | x−i ).

Corollary 2.1 Qii ≥ L2ii for all i.

In matrix terms,
i−1this is a direct consequence of Q = LLT , which gives
2 2
Qii = Lii + j=1 Lij .

Sample x ∼ NC (b, Q)

To sample from a GMRF defined from its canonical representation


(see Definition 2.2) we use Algorithm 2.5. This algorithm sample from
N (Q−1 b, Q−1 ), see Definition 2.2. The mean Q−1 b is computed using
Algorithm 2.1.

Algorithm 2.5 Sampling x ∼ NC (b, Q)


1: Compute the Cholesky factorization, Q = LLT
2: Solve Lw = b
3: Solve LT µ = w
4: Sample z ∼ N (0, I)
5: Solve LT v = z
6: Compute x = µ + v
7: Return x

For repeated samples, we do steps 1–3 only once. This algorithm


requires three back or forward substitutions compared to only one when
the mean is known.

©฀2005฀by฀Taylor & Francis Group, LLC


36 THEORY OF GMRFs
The log density of a sample
The log density of a sample x ∼ N (µ, Q−1 ) or x ∼ NC (b, Q) is easily
calculated using (2.4), where
 n
1
log |Q| = log Lii
2 i=1

because |Q| = |LLT | = |L||LT | = |L|2 . Hence we obtain


n
n 1
log π(x) = − log(2π) + log Lii − q, (2.26)
2 i=1
2
where
q = (x − µ)T Q(x − µ). (2.27)
If x is generated via Algorithm 2.4 or (2.5), q simplifies to q = z T z.
Otherwise we use (2.27) and first compute µ, if necessary, and then
v = x − µ, w = Qv, and q = v T w.

2.3.3 Conditional simulation of a GMRF


Sampling from π(xA |x−A ) where x ∼ N (µ, Q−1 )
From Theorem 2.5 we know that the conditional distribution π(xA |xB ),
where xB = x−A , is
xA | xB ∼ N (µA − Q−1 −1
AA QAB (xB − µB ), QAA ).

To sample from π(xA |xB ), it is convenient to first subtract the marginal


mean µA and to write xA − µA |xB in the canonical parameterization:
xA − µA | xB ∼ NC (−QAB (xB − µB ), QAA ).
Hence we can use Algorithm 2.5 to sample from π(xA − µA |xB ), and
then we simply add µA . Some more insight will be given to the term
QAB (xB − µB ) in Appendix B.

Sampling from π(x|Ax = e) where x ∼ N (µ, Q−1 )


We now consider the important case, where we want to sample from a
GMRF under an additional linear constraint
Ax = e,
where A is a k × n matrix, 0 < k < n, with rank k, and e is a vector of
length k. We will denote this problem sampling under a hard constraint.
This problem occurs quite frequently in practice, for example we might
require that the sum of the xi ’s is zero, which corresponds to k = 1,
A = 1T and e = 0.

©฀2005฀by฀Taylor & Francis Group, LLC


SIMULATION FROM A GMRF 37
The linear constraint ensures that the conditional distribution is
normal, but singular as the rank of the constrained covariance matrix is
n − k. For this reason, more care must be taken when sampling from this
distribution. One approach is to compute the mean and the covariance
from the joint distribution of x and Ax, which is normal with moments
       −1 
x µ x Q Q−1 AT
E = and Cov = .
Ax Aµ Ax AQ−1 AQ−1 AT
We condition on Ax = e, which leads to the conditional moments µ∗ =
E(x|Ax) and Σ∗ = Cov(x|Ax), where
µ∗ = µ − Q−1 AT (AQ−1 AT )−1 (Aµ − e) (2.28)
and
Σ∗ = Q−1 − Q−1 AT (AQ−1 AT )−1 AQ−1 . (2.29)
We can sample from this distribution as follows. As the conditional
covariance matrix Σ∗ is singular, we first compute the eigenvalues and
eigenvectors, and factorize Σ∗ as V ΛV T where V has the eigenvectors
on each column and Λ is a diagonal matrix with the corresponding
eigenvalues on the diagonal. This is a different factorization than the
one used in Algorithm 2.3, but any matrix C = V Λ1/2 , which satisfies
CC T = Σ∗ will do. Note that k of the eigenvalues are zero. We can now
generate a sample by computing v = Cz, where z ∼ N (0, I), and then
add the conditional mean. We can compute the log density as
n−k 1 
log π(x | Ax = e) = − log 2π − log Λii
2 2
i:Λii >0
1
− (x − µ∗ )T Σ− (x − µ∗ ),
2
where Σ− = V Λ− V T . Here (Λ− )ii is Λ−1
ii if Λii > 0 and zero otherwise.
In total, this is a quite computationally demanding procedure, as the
algorithm is not able to take advantage of the sparse structure of Q.
There is an alternative procedure that corrects for the constraint, at
nearly no cost if k ≪ n. In the geostatistics literature this is called
conditioning by Kriging. The result is the following: If we sample from
the unconstrained GMRF x ∼ N (µ, Q−1 ) and then compute
x∗ = x − Q−1 AT (AQ−1 AT )−1 (Ax − e), (2.30)

then x has the correct conditional distribution. This is clear after
comparing the mean and covariance of x∗ with (2.28) and (2.29). Note
that AQ−1 AT is a dense k × k matrix, hence its factorization is fast to
compute for small k. Algorithm 2.6 generates such a constrained sample,
where we denote the dimension of some of the matrices by subscripts.

©฀2005฀by฀Taylor & Francis Group, LLC


38 THEORY OF GMRFs
Algorithm 2.6 Sampling x|Ax = e where x ∼ N (µ, Q−1 )
1: Compute the Cholesky factorization, Q = LLT
2: Sample z ∼ N (0, I)
3: Solve LT v = z
4: Compute x = µ + v
5: Compute V n×k = Q−1 AT using Algorithm 2.2 using L from step 1
6: Compute W k×k = AV
7: Compute U k×n = W −1 V T using Algorithm 2.2
8: Compute c = Ax − e
9: Compute x∗ = x − U T c
10: Return x∗

For repeated samples we do step 1 and steps 5–7 only once. Note that
if z = 0 then x∗ is the conditional mean (2.28). The following trivial
but very useful example illustrates the use of (2.30).
Example 2.3 Let x1 , . . . , xn be independent normal  variables with
mean µi and variance σi2 . To sample x conditional on i xi = 0, we
first sample xi ∼ N (µi , σi2 ) for i = 1, . . . , n and compute the constrained
sample x∗ via

x∗i = xi − c σi2 ,
 
where c = j xj / j σj2 .
The log density π(x|Ax) can be rapidly evaluated at x∗ using the
identity
π(x)π(Ax | x)
π(x | Ax) = . (2.31)
π(Ax)
Note that we can compute each term on the right-hand side easier than
the term on the left-hand side: The unconstrained density π(x) is a
GMRF and the log density is computed using (2.26) and L computed in
Algorithm 2.6, step 1. The degenerate density π(Ax|x) is either zero or
a constant, which must be one for A = I. A change of variables gives us
1
log π(Ax | x) = − log |AAT |,
2
i.e., we need to compute the determinant of a k × k matrix, which is
found from its Cholesky factorization. Finally, the denominator π(Ax)
in (2.31) is normal with mean Aµ and covariance matrix AQ−1 AT . The
corresponding Cholesky triangle L is available from Algorithm 2.6, step
7. The log density can then be computed from (2.25).

©฀2005฀by฀Taylor & Francis Group, LLC


SIMULATION FROM A GMRF 39
−1
Sampling from π(x|e) where x ∼ N (µ, Q ) and e|x ∼ N (Ax, Σǫ )
Let x be a GMRF, where some linear transformation Ax is observed
with additional normal noise:
e | x ∼ N (Ax, Σǫ ). (2.32)
Here, e is a vector of length k < n, A is a k × n matrix of rank k, and
Σǫ > 0 is the covariance matrix of e. The log density of x|e is then
1 1
log π(x | e) = − (x − µ)T Q(x − µ) − (e − Ax)T Σ−1
ǫ (e − Ax) + const,
2 2
(2.33)
i.e.,
x | e ∼ NC (Qµ + AT Σ−1 T −1
ǫ e, Q + A Σǫ A), (2.34)
which could also be derived using (2.16). However, the precision matrix
in (2.34) is usually a completely dense matrix and the nice sparse
structure of Q is lost. For example, if we observe the sum of xi with
unit variance noise, the conditional precision is Q + 11T , which is a
completely dense matrix. In general, we have to sample from (2.33) using
Algorithm 2.3 or Algorithm 2.4, which is computationally expensive for
large n.
There is however an alternative approach that is feasible for k ≪ n.
If we extend (2.30) to
x∗ = x − Q−1 AT (AQ−1 AT + Σǫ )−1 (Ax − ǫ),
where
ǫ ∼ N (e, Σǫ ),
e is the observed value, and x ∼ N (µ, Q−1 ), then it is easy to show that
x∗ has the correct conditional distribution (2.34). We denote this case
sampling under a soft constraint. Algorithm 2.7 is similar to Algorithm
2.6.
Note that if z = 0 and ǫ = e in Algorithm 2.7, then x∗ is the
conditional mean. For repeated samples, we do step 1 and steps 5-7
only once.
Also for soft constraints we can evaluate the log density at x∗ using
(2.31) with Ax = e:
π(x)π(e | x)
π(x | e) = .
π(e)
The unconstrained density π(x) is a GMRF and the log density is
computed using (2.26). Regarding π(e|x), we know from (2.32) that
π(e|x) is normal with mean Ax and covariance Σǫ . We use (2.25)
to compute the log density. Finally, e is normal with mean Aµ and
covariance matrix AQ−1 AT + Σǫ , hence we can use (2.25) to compute

©฀2005฀by฀Taylor & Francis Group, LLC


40 THEORY OF GMRFs
Algorithm 2.7 Sampling from π(x|e) where x ∼ N (µ, Q−1 ) and e|x ∼
N (Ax, Σǫ )
1: Compute the Cholesky factorization, Q = LLT
2: Sample z ∼ N (0, I)
3: Solve LT v = z
4: Compute x = µ + v
5: Compute V n×k = Q−1 AT using Algorithm 2.2 using L from step 1
6: Compute W k×k = AV + Σǫ
7: Compute U k×n = W −1 V T using Algorithm 2.2
8: Sample ǫ ∼ N (e, Σǫ ) using Algorithm 2.3.
9: Compute c = Ax − ǫ
10: Compute x∗ = x − U T c
11: Return x∗

the log density. Note that all Cholesky triangles required to evaluate the
log density are already computed in Algorithm 2.7.
The stochastic version of Example 2.3 now follows.
Example 2.4 Let x1 , . . . , xn be independent normal variables with
variance σi2 and mean µi . We now observe e ∼ N ( i xi , σǫ2 ). To sample
from π(x|e), we sample xi ∼ N (µi , σi2 ), unconditionally, for i = 1, . . . , n
while we condition on ǫ ∼ N (e, σǫ2 ). A conditional sample x∗ is then

∗ 2 j xj − ǫ
xi = xi − c σi , where c =  2 2
.
j σj + σǫ

We can merge soft and hard constraints into one framework if we allow
Σǫ to be SPSD, but we have chosen not to, as the details are somewhat
tedious.

2.4 Numerical methods for sparse matrices


This section will give a brief introduction to numerical methods for
sparse matrices. During our discussion of simulation algorithms for
GMRFs, we have shown that they all can be expressed such that the
main tasks are to
1. Compute the Cholesky factorization of Q = LLT where Q > 0 is
sparse, and
2. Solve Lv = b and LT x = z
The second task is faster to compute than the first, but sparsity of
Q is also advantageous in this case. We restrict the discussion to
sparse Cholesky factorizations but the ideas also apply to sparse LU
factorizations for non symmetric matrices.

©฀2005฀by฀Taylor & Francis Group, LLC


NUMERICAL METHODS FOR SPARSE MATRICES 41
The goal is to explain why a sparse Q allows for fast factorization, how
we can take advantage of it, why we permute the nodes before factorizing
the matrix, and how statisticians can benefit from recent research results
in numerical mathematics. At the end, we will report a small case study
factorizing some typical matrices for GMRFs, using classical and more
recent methods.

2.4.1 Factorizing a sparse matrix

We start with a dense matrix Q > 0 with dimension n, and show how
to compute the Cholesky triangle L, so Q = LLT , which can be written
as
j

Qij = Lik Ljk , i ≥ j.
k=1

We now define
j−1

vi = Qij − Lik Ljk , i ≥ j,
k=1

and we immediately see that L2jj = vj and Lij Ljj = vi for i > j. If we
√ √
know {vi } for fixed j, then Ljj = vj and Lij = vi / vj , for i = j + 1 to
n. This gives the jth column in L. The algorithm is completed by noting
that {vi } for fixed j only depends on elements of L in the first j − 1
columns of L. Algorithm 2.8 gives the pseudocode using vector notation
for simplicity: vj:n = Qj:n,j is short for vk = Qkj for k = j to n and so
on. The overall process involves n3 /3 flops. If Q is symmetric but not
SPD, then vj ≤ 0 for some j and the algorithm fails.

Algorithm 2.8 Cholesky factorization of Q > 0


1: for j = 1 to n do
2: vj:n = Qj:n,j
3: for k = 1 to j − 1 do vj:n = vj:n − Lj:n,k Ljk

4: Lj:n,j = vj:n / vj
5: end for
6: Return L

The Cholesky factorization is computed iwithout pivoting and its


numerical stability follows (roughly) from k=1 L2ik = Qii hence L2ij ≤
Qii , which shows that the entries in the Cholesky triangle are nicely
bounded.

©฀2005฀by฀Taylor & Francis Group, LLC


42 THEORY OF GMRFs
Now we explore the possibilities of a sparse Q to speed up Algorithm
2.8. Recall Theorem 2.7 where we showed that
n
1 
E(xi | x(i+1):n ) = µi − Lji (xj − µj )
Lii j=i+1

and Prec(xi |x(i+1):n ) = L2ii . Another interpretation of Theorem 2.7 is


the following result.
Theorem 2.8 Let x be a GMRF wrt G, with mean µ and precision
matrix Q > 0. Let L be the Cholesky triangle of Q and define for 1 ≤
i < j ≤ n the set
F (i, j) = {i + 1, . . . , j − 1, j + 1, . . . , n},
which is the future of i except j. Then
xi ⊥ xj | xF (i,j) ⇐⇒ Lji = 0. (2.35)
Proof. [Theorem 2.8] Assume for simplicity that µ = 0 and fix 1 ≤ i <
j ≤ n. Theorem 2.7 gives that
⎛ ⎛ ⎞2 ⎞
n n
⎜ 1 1 ⎟
π(xi:n ) ∝ exp ⎝− L2kk ⎝xk + Ljk xj ⎠ ⎠
2 Lkk
k=i j=k+1
 
1 T (i:n)
= exp − xi:n Q xi:n ,
2
(i:n)
where Qij = Lii Lji . Using Theorem 2.2, it then follows that
xi ⊥ xj | xF (i,j) ⇐⇒ Lii Lji = 0,

which is equivalent to (2.35) since Lii > 0 as Q(i:n) > 0.


The implications of Theorem 2.8 are immediate: If we can verify that
Lji is zero, we do not have to compute it in Algorithm 2.8, hence we
save computations. However, as Theorem 2.8 relates zeros in the lower
triangular of L to conditional independence properties of the successive
marginals {π(xi:n )}ni=1 , there is no easy way to use Theorem 2.8 to check
if Lji = 0, except computing it and see if Lji turned out to be zero!
Theorem 2.4 provides a simple and sufficient criteria for checking if
Lji = 0, making use of the global Markov property. We state this as a
Corollary to Theorem 2.8.
Corollary 2.2 If F (i, j) separates i < j in G, then Lji = 0.
Proof. [Corollary 2.2] The global Markov property (2.12) ensures that if
F (i, j) separates i < j in G, then xi ⊥ xj | xF (i,j) . Hence, Lji = 0 using
Theorem 2.8.

©฀2005฀by฀Taylor & Francis Group, LLC


NUMERICAL METHODS FOR SPARSE MATRICES 43
Note that Corollary 2.2 does not make use of the actual values in Q to
decide if Lji = 0, but only uses the conditional independence structure
represented by G. Hence, if Lji = 0 using Corollary 2.2, then it is zero
for all Q > 0 with the same graph G. The reverse statement in Corollary
2.2 is of course not true in general. A simple counter-example is the
following.
Example 2.5 Let
⎛ ⎞
L11
⎜L21 L22 ⎟
L=⎜ ⎝L31


0 L33
L41 L42 L43 L44
so that Q = LLT with Q32 = L21 L31 . Here 2 and 3 are not separated
by F (2, 3) = 4, although L32 = 0.
The approach is then to make use of Corollary 2.2 to check if Lji = 0,
for all 1 ≤ i < j ≤ n, and to compute only those Lji ’s, that are not
known to be zero, using Algorithm 2.8. Note that we always need to
compute Lji for i ∼ j and j > i, since F (i, j) does not separate i and j
since Qij = 0. A simple example illustrates the procedure.
Example 2.6 Consider the graph
1 2

3 4
and the corresponding precision matrix Q
⎛ ⎞
× × ×
⎜× × ×⎟
Q=⎜ ⎝×

× ×⎠
× × ×
where the ×’s denote nonzero terms. The only possible zero terms in L
(in general) are L32 and L41 due to Corollary 2.2. Considering L32 we
see that F (2, 3) = 4. This is not a separating subset for 2 and 3 due to
node 1, hence L32 is not known to be zero using Corollary 2.2. For L41
we see that F (1, 4) = {2, 3}, which does separate 1 and 4, hence L41 = 0.
In total, L has the following structure:
⎛ ⎞
×
⎜× × ⎟
L=⎜ ⎝×
√ ⎟,

×
× × ×

where the possibly nonzero entry L32 is marked as ‘ ’.

©฀2005฀by฀Taylor & Francis Group, LLC


44 THEORY OF GMRFs
Applying Corollary 2.2, we know that L is always more or equally dense
than the lower triangular part of Q, i.e., the number nL of (possible)
nonzero elements in L is always larger than the number nQ of nonzero
elements in the lower triangular part of Q (including the diagonal).
Thus we are concerned about the number of fill-ins nL − nQ , which
we sometimes just simply call the fill-in.
Ideally, nL = nQ , but there are many other graphs where nL ≫ nQ .
We therefore compare different graphs through the fill-in ratio R =
nL /nQ . Clearly, R ≥ 1 and the closer R is to unity, the more efficient is
the Cholesky factorization of a given precision matrix Q. For example,
in Example 2.6, nQ = 8, nL = 9 and hence R = 9/8.
It will soon become clear that fill-in depends crucially on the ordering
of the nodes and is of major concern in numerical linear algebra for
sparse matrices. We will return shortly to this issue, but first discuss a
simple example and its consequences.
Example 2.7 Let x be a Gaussian autoregressive process of order 1,
where
xt | x1:(t−1) ∼ N (φxt−1 , σ 2 ), |φ| < 1, t = 2, . . . , n
with x1 ∼ N (µ1 , σ12 ), say. Now xi ⊥ xi+k |xrest for k > 1 and hence
⎛ ⎞
× ×
⎜× × × ⎟
⎜ ⎟
⎜ × × × ⎟
⎜ ⎟
Q=⎜ ⎜ × × × ⎟

⎜ × × × ⎟
⎜ ⎟
⎝ × × ×⎠
× ×
is tridiagonal. The ×’s denote the nonzero terms. The (possible) nonzero
terms in L can now be determined using Corollary 2.2. Since i ∼ i + 1 it
follows that Li+1,i is not known to be zero. For k > 1, F (i, i+k) separates
i and i + k, hence all the remaining terms are zero. The consequence is
that L is (in general) lower tridiagonal,
⎛ ⎞
×
⎜× × ⎟
⎜ ⎟
⎜ × × ⎟
⎜ ⎟
L=⎜ ⎜ × × ⎟.

⎜ × × ⎟
⎜ ⎟
⎝ × × ⎠
× ×
Note that in this example R = 1 and that the bandwidth of both Q and
L equals one.

©฀2005฀by฀Taylor & Francis Group, LLC


NUMERICAL METHODS FOR SPARSE MATRICES 45
We can extend Example 2.7 to any autoregressive process of order
p > 1, where Q will be a band matrix with bandwidth p. For k > p,
F (i, i + k) separates i and i + k, hence L is (in general) lower triangular
with the same lower bandwidth p. As a consequence, we have proved
that the bandwidth is preserved during Cholesky factorization.
Theorem 2.9 Let Q > 0 be a band matrix with bandwidth p and
dimension n, then the Cholesky triangle of Q has (lower) bandwidth
p.
This result is well known and a direct proof is available in Golub and
van Loan (1996, Theorem 4.3.1).
We can now do the trivial modification of Algorithm 2.8 to avoid
computing Lij and reading Qij for |i − j| > p. The band version of the
algorithm is Algorithm 2.9.

Algorithm 2.9 Band-Cholesky factorization of Q with bandwidth p


1: for j = 1 to n do
2: λ = min{j + p, n}
3: vj:λ = Qj:λ,j
4: for k = max{1, j − p} to j − 1 do
5: i = min{k + p, n}
6: vj:i = vj:i − Lj:i,k Ljk
7: end for

8: Lj:λ,j = vj:λ / vj
9: end for
10: Return L

The overall process involves n(p2 + 3p) flops assuming n ≫ p. For an


autoregressive process of order p, this is the cost of factorizing Q. The
costs are linear in n and have been reduced dramatically compared to
the general cost n3 /3.
Similar efficiency gains also show up if we want to solve LT x = z via
back-substitution:
min{i+p,n}
1 
xi = (vi − Lji xj ), i = n, . . . , 1,
Lii j=i+1

where the cost is 2np flops assuming n ≫ p. Again, the algorithm is


linear in n and we have gained one order of magnitude compared to n2
flops required in the general case.

2.4.2 Bandwidth reduction


Now we turn to the spatial case where we will demonstrate that the band-
Cholesky factorization and the band forward- and back-substitution are

©฀2005฀by฀Taylor & Francis Group, LLC


46 THEORY OF GMRFs

(a) (b)

Figure 2.4 (a) The map of Germany with n = 544 regions, and (b) the
corresponding graph for the GMRF where neighboring regions share a common
border.

also applicable for spatial GMRFs. We illustrate this by considering the


map of Germany in Figure 2.4(a), where we assume that a GMRF is
defined on the regions such that regions sharing a common border are
neighbors. The graph for the GMRF is shown in Figure 2.4(b).
If we want to apply the band-Cholesky algorithm, we need to make
sure that Q is a band matrix. The ordering of the regions is typically
arbitrary (here they are determined through administrative rules), so
we cannot expect Q to have a pattern that makes band-Cholesky
factorization particularly useful. This is illustrated later in Figure 2.6(a)
displaying Q in the original ordering with bandwidth 542. It is, however,
easy to permute the nodes: Select one of the n! possible permutations
and define the corresponding permutation matrix P such that iP = P i,
where i = (1, . . . , n)T , and iP is the new ordering of the vertices. This
means that node 5, say, is renamed to node iP 5 . We can then try to choose
P such that the corresponding precision matrix
QP = P QP T (2.36)
is a band matrix with a small bandwidth. Typically it will be impossible
to obtain the optimal permutation from all n! possible ones, but a sub-
optimal ordering will do as well. For a given ordering, we solve Qµ = b
as follows. Compute the reordered problem, QP as in (2.36), bP = P b.

©฀2005฀by฀Taylor & Francis Group, LLC


NUMERICAL METHODS FOR SPARSE MATRICES 47

(a) (b)

Figure 2.5 (a) The black regions make north and south conditionally indepen-
dent, and (b) displays the automatically computed reordering starting from the
white region ending at the black region. This reordering produces the precision
matrix in Figure 2.6(b).

Solve QP µP = bP and then map the solution back, µ = P T µP .


Return now to Corollary 2.2, which states a sufficient condition for
making Lji = 0 for |i − j| > p. We need to ensure that, after reordering,
xi and xj are conditional independent given xF (i,j) if |i−j| > p. Suppose
now we separate the south and the north of Germany through a third
subset of regions, the black regions in Figure 2.5(a). Then the north
and the south are conditionally independent given the black regions.
Clearly we can obtain a similar separation, by sliding the black line
from top to bottom. The bandwidth turns out to be the maximal number
of (black) regions needed to divide the north and the south. With this
reordering, the precision matrix will have a small bandwidth. Section 2.5
gives details about the algorithms used for computing the reordering.
One automatically computed ordering is shown in Figure 2.5(b), from
white to black. The reordered Q shown√ in Figure 2.6(b) has bandwidth
43. As the bandwidth is about O( n), the costs will be O(n2 ) for the
factorization for this and similar type of graphs.
To give an idea of the (amazing) speed of such algorithms, the
factorization of QP required about 0.0018 seconds on a 1200-MHz CPU.
Solving Qµ = b required about 0.0006 seconds. However, the fill-in ratio
for this graph is R = 5.3, which suggests that we may do even better

©฀2005฀by฀Taylor & Francis Group, LLC


48 THEORY OF GMRFs

500

500
400

400
300

300
200

200
100

100
0

0
0 100 200
(a)
300 400 500 0 100 200
(b)
300 400 500

Figure 2.6 (a) The precision matrix Q in the original ordering, and (b) the
precision matrix after appropriate reordering to obtain a band matrix with
small bandwidth. Only the nonzero terms are shown and those are indicated by
a dot.

5 6 5 1

1 6

4 2 4 2

3 3

(a) (b)

Figure 2.7 Two graphs with a slight change in the ordering. Graph (a) requires
O(n3 ) flops to factorize, while graph (b) only requires O(n). Here, n represent
the number of nodes being neighbors to the center node. The fill-in is maximal
in (a) and minimal in (b).

with other kinds of orderings.

2.4.3 Nested dissection

There has been much work in the numerical and computer science liter-
ature on other reordering schemes focusing on reducing the number of
fill-ins rather than focusing on reducing the bandwidth. Why reordering
schemes that reduce the number of fill-ins may be better can be seen
from the following example. For a GMRF with the graph shown in Figure

©฀2005฀by฀Taylor & Francis Group, LLC


NUMERICAL METHODS FOR SPARSE MATRICES 49
2.7(a), the precision matrix and its Cholesky triangle are
⎛ ⎞ ⎛ ⎞
× × × × × × ×
⎜× × ⎟ ⎜× × ⎟
⎜ ⎟ ⎜ ⎟
⎜× × ⎟ ⎜× √ × ⎟
⎜ ⎟ ⎜ ⎟.
Q=⎜ ⎟ , L = ⎜× √ √ × ⎟
⎜× × ⎟ ⎜ √ √ √ ⎟
⎝× × ⎠ ⎝× × ⎠
√ √ √ √
× × × ×
(2.37)
In this case the fill-in is maximal and, for general n, the cost is O(n3 )
flops to compute the factorization, where n is the number of nodes in the
circle. The reason is that all nodes depend on 1, hence F (i, j) is never a
separating subset for i < j.
However, if we switch the numbers for 1 and n = 7 as in Figure 2.7(b),
we obtain the following precision matrix and its corresponding Cholesky
triangle:
⎛ ⎞ ⎛ ⎞
× × ×
⎜ × ×⎟ ⎜ × ⎟
⎜ ⎟ ⎜ ⎟
⎜ × × ⎟ ⎜ × ⎟
Q=⎜ ⎜ ⎟ ⎜ ⎟.
× × ⎟, L = ⎜ × ⎟
⎜ ⎟ ⎜ ⎟
⎝ × ×⎠ ⎝ × ⎠
× × × × × × × × × × × ×
(2.38)
The situation is now quite different, the fill-in is zero, and we can
factorize Q in only O(n) flops. The remarkable difference to (2.37) is
that conditioning on node 7 in Figure 2.7(b) makes all other nodes
conditionally independent.
The idea in this example generalizes as follows to determine a good
ordering with less fill-in:
• Select a (small) set of nodes whose removal divides the graph into two
disconnected subgraphs of almost equal size.
• Order the nodes chosen after ordering all the nodes in both subgraphs.
• Apply this procedure recursively to the nodes in each subgraph.
This is the idea of reordering based on nested dissection. To demonstrate
how this applies to our current example with the graph in Figure 2.4(b),
we computed such a (slightly modified) ordering (from white to black) as
shown in Figure 2.8. Section 2.5 give details of what algorithms are used
for computing the reordering. We see here the idea in practice, first the
map is divided into two, then these two parts are divided further, etc.
At some stage the recursion is stopped. The reordered precision matrix
and its Cholesky triangle are shown in Figure 2.8(b) and (c), where the
fill-in is 2866. This corresponds to a fill-in ratio of R = 2.5. This is to be
compared to R = 5.3 for the band reordering shown in Figure 2.5.

©฀2005฀by฀Taylor & Francis Group, LLC


50 THEORY OF GMRFs

(a)
500

500
400

400
300

300
200

200
100

100
0

0 100 200
(b)
300 400 500 0 100 200
(c) 300 400 500

Figure 2.8 Figure (a) displays the ordering found using a nested dissection
algorithm where the ordering is from white to black. (b) displays the reordered
precision matrix and (c) its Cholesky triangle. In (b) and (c) only the nonzero
elements are shown and those are indicated by a dot.

©฀2005฀by฀Taylor & Francis Group, LLC


NUMERICAL METHODS FOR SPARSE MATRICES 51
3/2
Algorithms based on nested dissection ordering require O(n ) flops
in such and similar cases and gives √ O(n log n) fill-ins, see George and
Liu (1981, Ch. 8) for details. This is n faster than the band-Cholesky
approach, but the difference is not that large for many problems of
reasonable size. The next section contains an empirical comparison for
various ‘typical’ graphs.
George and Liu (1981, Ch. 8) proves also that any reordering would
require at least O(n3/2√ ) flops
√ for the factorization and produce at least
O(n log n) fill-ins for a n× n lattice with a local neighborhood. Hence,
the nested dissection reordering is optimal in the order of magnitude
sense.
The band-Cholesky factorization approach is quite simple, intuitive,
and requires only trivial changes to Algorithm 2.9. Implementation thus
requires only a few (hundreds) lines of code, apart from the reordering
itself that is somewhat more demanding. More general factorizations
like factorizing Figure 2.8(b) to get Figure 2.8(c) efficiently, require
substantial knowledge of numerical algorithms, data structures and
high-performance computing. The corresponding libraries easily require
10, 000 lines of code. The factorizing is usually performed in several steps:

1. A reordering phase, where the sparse matrix Q is analyzed to produce


a suitable ordering with reduced fill-in.
2. A symbolical factorization phase, where (informally) the (possible)
nonzero pattern of L is determined and data structures to compute
the factorization are constructed.
3. A numerical factorization phase, where the numerical values of L are
computed.
4. A solve phase in which LT x = z and/or Lv = b is solved.
The results from step 1 and 2 can be reused if we factorize several Q’s
with the same nonzero pattern; This is typical for applications of GMRFs
in MCMC algorithms. The data handling problem in such algorithms
is significant and it is important to implement this well (step 2) to
gain efficiency. We can illustrate this as follows. In the band-Cholesky
factorization, we only need to store the (p + 1) × n rectangle, as it
contains all nonzeros in L, and all loops in Algorithm 2.9 run over this
rectangle. If the nonzeros terms of L are spread out, we need to use
indirect addressing and loops like
for i=1, M
x(indx(i)) = x(indx(i)) + a*w(i)
endfor
There is severe potential of loss of performance using indirect address-
ing, but clever data handling and data structures can prevent or reduce
degradation loss.

©฀2005฀by฀Taylor & Francis Group, LLC


52 THEORY OF GMRFs
In summary, we recommend leaving the issue of constructing and
implementing algorithms for factorizing sparse matrices to the numerical
and computer science experts. However, statisticians should use their
results and libraries for efficient statistical computing. We end this
section quoting from Gupta (2002) who summarizes his findings on
recent advances for sparse linear solvers:
. . . recent sparse solvers have significantly improved the state of the art of
the direct solution of general sparse systems.
. . . recent years have seen some remarkable advances in the general sparse
direct-solver algorithms and software.

2.5 A numerical case study of typical GMRFs


We will now present a small case study using the algorithms from Section
2.4 on a set of typical GMRFs. The aim is to verify empirically the
computational requirements of various algorithms, and to gain some
experience in choosing algorithms for different kinds of problems.
As it will become clear in later sections, applications of GMRF models
can often be divided into three categories.
1. GMRF models in time or on a line. This includes autoregressive
models and models for smooth functions. Neighbors to xt are then
those {xs } such that |s − t| ≤ p.
2. Spatial GMRF models. Here, the graph is either a regular lattice,
or irregular induced by a tessellation or by regions. Neighbors to xi
(spatial index i), are those j spatially ‘close’ to i, where ‘close’ is
defined from its context.
3. Spatiotemporal GMRF models. These are often appropriate exten-
sions of temporal or spatial models.
We also include the case where additional nodes depend on all other
nodes. This occurs in many situations like the following. Let x be a
GMRF with a common mean µ, then
x | µ ∼ N (µ1, Q−1 ),
where Q is sparse. Assume µ ∼ N (0, σ 2 ), then (x, µ) is also a GMRF
where the node µ is a neighbor of all xi ’s.
We apply two different algorithms in our test:
1. The band-Cholesky factorization (BCF) as in Algorithm 2.9. Here we
use the LAPACK routines DPBTRF and DTBSV for the factorization
and the forward- or back-substitution, respectively, and the Gibbs-
Poole-Stockmeyer algorithm for bandwidth reduction as implemented
in Lewis (1982).

©฀2005฀by฀Taylor & Francis Group, LLC


A NUMERICAL CASE STUDY OF TYPICAL GMRFs 53

Figure 2.9 The graph for an autoregressive process with n = 5 and p = 2.

2. The multifrontal supernodal Cholesky factorization (MSCF) imple-


mentation in the library TAUCS (version 2.0) (Toledo et al., 2002) using
the nested dissection reordering from the library METIS (Karypis and
Kumar, 1998).
Both algorithms are available in the GMRFLib library described in Ap-
pendix B used throughout the book. We use plain LAPACK and
BLAS libraries compiled from scratch. Improved performance can be
obtained by replacing these with either vendor-supplied BLAS libraries
(if available) or libraries from the ATLAS (Automatically Tuned Linear
Algebra Software) project, see http://math-atlas.sourceforge.net/.
The tasks we want to investigate are:
1. Factorizing Q into LLT
2. Solving LLT µ = b, i.e., first solving Lw = b and then LT µ = w.
To produce a random sample from a GMRF we need to solve LT x = z,
so the cost is half the cost of solving the linear system in step 2 if the
factorizing is known. All tests reported here have been conducted on a
1200-MHz CPU.

2.5.1 GMRF models in time


Let Q be a band matrix with bandwidth p and dimension n. This
corresponds to an autoregressive process of order p as discussed in
Example 2.7 for p = 1. The graph of an autoregressive process with
n = 5 and p = 2 is shown in Figure 2.9. For such a problem, using BCF
will be (theoretically) optimal with zero fill-in.
Table 2.1 reports the average CPU time (in seconds) used (using 10
replications) for n equals 103 , 104 , 105 , and p equals 5 and 25. The
results obtained are quite impressive, which is often the case for ‘long
and thin’ problems. Computing the factorization and solving the system
requires np2 and 2np flops, respectively. This theoretical behavior is
approximately supported from the results, as small bandwidth makes
loops shorter, which gives a reduction in performance. The MSCF is less
optimal for band matrices. For p = 5 the factorization is about 25 times
slower, and the solving step is about 2 to 3 times slower, compared
to Table 2.1. This is due to a fill-in of about n/2 and because more
complicated data structures than needed are used.

©฀2005฀by฀Taylor & Francis Group, LLC


54 THEORY OF GMRFs
3 4
n = 10 n = 10 n = 105
CPU time p = 5 p = 25 p = 5 p = 25 p = 5 p = 25
Factorize 0.0005 0.0019 0.0044 0.0271 0.0443 0.2705
Solve 0.0000 0.0004 0.0031 0.0109 0.0509 0.1052

Table 2.1 The average CPU time for (in seconds) factorizing Q into LLT
and solving LLT µ = b, for a band matrix of order n and bandwidth p, using
band-Cholesky factorization and band forward- or back-substitution.

We now add 10 additional nodes, which are neighbors with all others.
This makes the bandwidth maximal so the BCF is not a good choice.
Using MSCF we have obtained the results shown in Table 2.2. The fill-in

n = 103 n = 104 n = 105


CPU time p = 5 p = 25 p = 5 p = 25 p = 5 p = 25
Factorize 0.0119 0.0335 0.1394 0.4085 1.6396 4.1679
Solve 0.0007 0.0035 0.0138 0.0306 0.1541 0.3078

Table 2.2 The average CPU time (in seconds) for factorizing Q into LLT and
solving LLT µ = b, for a band matrix of order n, bandwidth p with additional
10 nodes that are neighbors to all others. The factorization routine is MSCF.

is now approximately pn, which is due to the nested dissection ordering


used. In this particular case we can compare the result with the optimal
reordering giving no fill-ins. This is obtained by placing the 10 global
nodes after the n others so we obtain a nonzero structure as in (2.38).
With this optimal ordering we obtain a speedup up to about 1.5 for
p = 25 and slightly less for p = 5. However, the effect of not choosing the
optimal ordering is not dramatic if the ordering chosen is ‘reasonable’.
The nested dissection ordering gives good results in all cases considered
so far, which will also become clear from the spatial examples shown
next.

2.5.2 Spatial GMRF models


Spatial applications of GMRF models have graphs that are typically
either a regular or irregular lattice. Two such examples are provided in
Figure 2.10. Figure (a) shows a realization of a GMRF on a regular lattice
used as an approximation to a Gaussian field with given correlation
function (here the exponential). The neighbors to pixel i are those 24
pixels in a 5 × 5 window centered at i, illustrated in Figure 2.11(b).
We will discuss such approximations in Section 5.1. Figure (b) shows
the Dirichlet tessellation found from randomly distributed points on the

©฀2005฀by฀Taylor & Francis Group, LLC


A NUMERICAL CASE STUDY OF TYPICAL GMRFs 55

(a) (b)

Figure 2.10 Two examples of spatial GMRF models; (a) shows a GMRF on
a lattice used as an approximation to a Gaussian field with an exponential
correlation function, (b) the graph found from Delaunay triangulation of a
planar point set.

(a) (b)

Figure 2.11 The neighbors to the black pixel; (a) the 3 × 3 neighborhood system
and (b) the 5 × 5 neighborhood system.

unit square. If adjacent tiles are neighbors, then we obtain the graph
found by the Delaunay triangulation of the points. This graph is similar
to the one defined by regions of Germany in Figure 2.4, although here,
the outline and the configuration of the regions are random as well.
We will report some timing results for lattices only, as they are similar
for irregular lattices. The neighbors to a pixel i, will be those 8 (24) in the
3 × 3 (5 × 5) window centered at i, and the dimension of the lattice will
be 100 × 100, 150 × 150, and 200 × 200. Table 2.3 summarizes our results.
The speed of the algorithms is again impressive. The performance in the
solve part is quite similar, but for the largest lattice the MSCF really
outperform the BCF. The reason is the O(n3/2 ) cost for MSCF compared
to O(n2 ) for the BCF, which is of clear importance for large lattices. For

©฀2005฀by฀Taylor & Francis Group, LLC


56 THEORY OF GMRFs
2
n = 100 n = 1502 n = 2002
CPU time Method 3×3 5×5 3×3 5×5 3×3 5×5
Factorize BCF 0.51 1.02 2.60 4.93 13.30 38.12
MSCF 0.17 0.62 0.55 1.92 1.91 4.90
Solve BCF 0.03 0.05 0.10 0.16 0.24 0.43
MSCF 0.01 0.04 0.04 0.11 0.08 0.21

Table 2.3 The average CPU time (in seconds) for factorizing Q into LLT and
solving LLT µ = b, for a 1002 , 1502 and 2002 square lattice with 3 × 3 and
5 × 5 neighborhood, using the BCF and MSCF method.

time

Figure 2.12 A common neighborhood structure in spatiotemporal GMRF


models. In addition to spatial neighbors, also the same node in next and
previous time-step can be neighbors.

large lattices, we need to consider also the memory requirement. MSCF


has lower memory requirement than BCF, about O(n log n) compared
to O(n3/2 ). The consequence is that BCF runs into memory problems
just over n = 2002 while the MSCF runs smoothly until n = 3502 or so,
for our machine with 512 Mb memory.
We now add additional 10 nodes that are neighbors to all others, and
repeat the test for MSCF and 5 × 5 neighborhood. For the factorization,
we obtained 0.73, 2.40, and 5.38. The solve-part was nearly unchanged.
There is not that much extra computational costs adding global nodes.

2.5.3 Spatiotemporal GMRF models


Spatiotemporal GMRF models are extensions of spatial GMRF models
to account for additional temporal variation. A typical situation is shown
in Figure 2.12. Think of a sequence of T graphs in time, like the graph in
Figure 2.4(b). Fix one slice and one node in it. Let this node, xit , say, be
the black node in Figure 2.12. Its spatial neighbors are those spatially
close {xjt : j ∼ i}, here shown schematically using 4 neighbors. A

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 57
common extension to a spatiotemporal GMRF is to take additional
neighbors in time into account, like the same node in the next and
previous slices; that is, xi,t−1 and xi,t+1 . The results presented in Table
2.4 use this model and the graph of Germany in Figure 2.4(b), for T = 10
and T = 100, with and without additional 10 nodes that are neighbors
to all others. The results show a quite heavy dependency on T . This is

Without global nodes With 10 global nodes


CPU time T = 10 T = 100 T = 10 T = 100
Factorize 0.25 39.96 0.31 39.22
Solve 0.02 0.42 0.02 0.42

Table 2.4 The average CPU time (in seconds) using the MSCF algorithm for
factorizing Q into LLT and solving LLT µ = b, for the spatiotemporal GMRF
using T time steps, and with and without 10 global nodes. The dimension of
the graph is 544 × T .

due to the denser structure of Q due to the dependency both in space


and time. For a cube with n nodes, the MSCF requires O(n2 ) flops to
factorize, with neighbors similar to Figure 2.12.

2.6 Stationary GMRFs⋆


Stationary GMRFs are a special class of GMRFs obtained under rather
strong assumptions on both the graph G and the elements in Q. The
graph is most often a torus Tn (see Section 2.1.2) and the full conditionals
{π(xi |x−i )} have constant parameters not depending on i. This makes
Q a (block) circulant matrix, for which nice analytical results about
the eigenstructure are available. The practical advantage is that typical
operations on such GMRFs can be done using the discrete Fourier
transform (DFT). The computational complexity of typical operations
is then O(n log n) and does not depend on the number of neighbors.
Circulant matrices are also well adapted for theoretical studies and we
will later use them in Section 2.7.
We will first discuss circulant matrices in general and then apply
the results obtained to GMRFs. At the end of this section we will
extend these results to scenarios under slightly less strict assumptions.
In particular, we will discuss matrices of Toeplitz form and show that
they can be approximated by circulant matrices.

2.6.1 Circulant matrices


We will now present some analytical results for (real) circulant matrices.
For notational convenience, we will change our notation in this section

©฀2005฀by฀Taylor & Francis Group, LLC


58 THEORY OF GMRFs
slightly and denote the elements of a vector of length n by the indices
0, . . . , n − 1, and the elements of an n × n matrix by the indices
(0, 0), . . . , (n − 1, n − 1).
Circulant matrices have the property that its eigenvalues and eigen-
vectors are related to the discrete Fourier transform. This allows for
fast algorithms operating on circulant matrices, obtaining their inverse,
computing the product of two circulant matrices, and so on.
Definition 2.4 (Circulant matrix) An n × n matrix C is circulant
iff it has the form
⎛ ⎞
c0 c1 c2 . . . cn−1
⎜cn−1
⎜ c0 c1 . . . cn−2 ⎟

⎜cn−2 cn−1 c0 . . . cn−3 ⎟  
C=⎜ ⎟ = cj−i mod n
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
c1 c2 c3 . . . c0
for some vector c = (c0 , c1 , . . . , cn−1 )T . We call c the base of C.
A circulant matrix is hence fully specified by only one column or one
row.
The eigenvalues and eigenvectors of a circulant matrix C play a central
role. Any eigenvalue λ and eigenvector e of C is a solution of the equation
Ce = λe. This can be written row by row as n difference equations,
j−1
 n−1

cn−j+i ei + ci−j ei = λej
i=0 i=j

for j = 0, . . . , n − 1 and is equivalent to


n−1−j
 n−1

ci ei+j + ci ei−(n−j) = λej . (2.39)
i=0 i=n−j

These linear difference equations have constant coefficients, so we ‘guess’


that a solution has the form ej ∝ ρj for some complex scalar ρ. We will
now verify that this is indeed the case.
For ej ∝ ρj , equation (2.39) reduces to
n−1−j
 n−1

ci ρi + ρ−n ci ρi = λ.
i=0 i=n−j

If we choose ρ such that ρ−n = 1, then


n−1

λ= ci ρi (2.40)
i=0

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 59
and
1
e = √ (1, ρ, ρ2 , . . . , ρn−1 )T . (2.41)
n

The factor n appears in (2.41) because we require that eT e = 1.√The
n nth roots of 1 are {exp(2πι j/n), j = 0, . . . , n − 1} where ι = −1.
The jth eigenvalue is found using (2.40) for each nth root of 1,
n−1

λj = ci exp(−2πι ij/n)
i=0

and the corresponding jth eigenvector is


1
ej = √ (1, exp(−2πι j/n), exp(−2πι j2/n), . . . , exp(−2πι j(n−1)/n))T
n
for j = 0, . . . , n − 1.
We now define the eigenvector matrix,
 
F = e0 | e1 | . . . | en−1
⎛ ⎞
1 1 1 ... 1
⎜1 ω 1
ω 2
... ω n−1 ⎟
1 ⎜⎜ ⎟
2
ω4 ω 2(n−1) ⎟
= √ ⎜1 ω ... ⎟ (2.42)
n ⎜ .. .. .. .. ⎟
⎝. . . . ⎠
1 ω n−1 ω 2(n−1) ... ω (n−1)(n−1)

where ω = exp(−2πι/n). Note that F does not depend on c. Further-


more, let Λ be a diagonal matrix containing the eigenvalues,

Λ = diag(λ0 , λ1 , . . . , λn−1 ).

Note that F is unitary, i.e., F −1 = F H , where F H is the conjugate


transpose of F and

Λ = n diag(F c).
We can verify that C = F ΛF H by a direct calculation:
n−1
1
Cij = exp(2πι k(j − i)/n) λk
n
k=0
n−1
 n−1

1
= exp(2πι k(j − i)/n) cl exp(−2πι kl/n)
n
k=0 l=0
n−1 n−1
1 
= cl exp(2πι k(j − i − l)/n). (2.43)
n
l=0 k=0

©฀2005฀by฀Taylor & Francis Group, LLC


60 THEORY OF GMRFs
Using
n−1

 n if i − j = −l mod n
exp(2πι k(j − i − l)/n) =
k=0
0 otherwise

we obtain Cij = cj−i mod n , i.e., all circulant matrices can be expressed
as F ΛF H for some diagonal matrix Λ.
The following theorem now states that the class of circulant matrices
is closed under some matrix operations.
Theorem 2.10 Let C and D be n × n circulant matrices. Then
1. C and D commute, i.e., CD = DC, and CD is circulant
2. C ± D is circulant
3. C p is circulant, p = 1, 2, . . .
4. if C is non singular then C p is circulant, p = −1, −2, . . .
Proof. Recall that a circulant matrix is uniquely described by its
n eigenvalues as they all share the same eigenvectors. Let ΛC and
ΛD denote the diagonal matrices with the eigenvalues of C and D,
respectively, on the diagonal. Then
CD = F ΛC F H F ΛD F H
= F (ΛC ΛD )F H ;
hence CD is circulant with eigenvalues {λCi λDi }. The matrices com-
mute since ΛC ΛD = ΛD ΛC for diagonal matrices. Using the same
argument,
C ± D = F (ΛD ± ΛC )F H ;
hence C ± D is circulant. Similarly,
C p = F ΛpC F H , p = ±1, ±2, . . .
as ΛpC is a diagonal matrix.
The matrix F in (2.42) is well know as the discrete Fourier transform
(DFT) matrix, so computing F v for a vector v is the same as computing
the DFT of v. Taking the inverse DFT (IDFT) of v is the same as
calculating F H v. Note that if n can be factorized as a product of small
primes, the computation of F v requires only O(n log n) flops. ‘Small
primes’ is the ‘traditional’ requirement, but the (superb) library FFTW,
which is a comprehensive collection of fast C routines for computing the
discrete Fourier transform (http://www.fftw.org), allows arbitrary size
and employs O(n log n) algorithms for all sizes. Small primes are still
computational most efficient.

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 61
The link to the DFT is useful for computing with circulant matrices.
Define the DFT and IDFT of v as
⎛ n−1 ⎞
j=0 vj ω j0
⎜ n−1 j1 ⎟
1 ⎜ j=0 vj ω ⎟
DFT(v) = F v = √ ⎜ ⎜ . ⎟

n⎝ .. ⎠
n−1 j(n−1)
j=0 v j ω

and
⎛ n−1 ⎞
j=0 vj ω −j0
⎜ n−1 −j1 ⎟
1 ⎜ j=0 vj ω ⎟
IDFT(v) = F v = √ ⎜
H
⎜ . ⎟.

n⎝ .. ⎠
n−1 −j(n−1)
j=0 v j ω
Recall that ‘⊙’ denotes elementwise multiplication, ‘⊘’ denotes elemen-
twise division, and ‘’ is elementwise power, see Section 2.1.1. Let C be
a circulant matrix with base c, then the matrix-vector product Cv can
be computed as
Cv = F ΛF H v

= F n diag(F c) F H v

= n DFT(DFT(c) ⊙ IDFT(v)).
The product of two circulant matrices C and D, with base c and d,
respectively, can be written as
CD = F (ΛC ΛD ) F H (2.44)
√ √ 
= F n diag(F c) n diag(F d) F H . (2.45)
Since CD is a circulant matrix with (unknown) base p, say, then
√ 
CD = F n diag(F p) F H . (2.46)
Comparing (2.46) and (2.44), we see that
√ √ √
n diag(F p) = n diag(F c) n diag(F d);
hence

p= n IDFT (DFT(c) ⊙ DFT(d)) .
Solving Cx = b can be done similarly, since
x = C −1 b
= F Λ−1 F H b
1
= √ DFT(IDFT(b)) ⊘ DFT(c)).
n

©฀2005฀by฀Taylor & Francis Group, LLC


62 THEORY OF GMRFs
The inverse of C is
C −1 = F Λ−1 F H ;
hence the base of C −1 is
1
IDFT(DFT(c)  (−1)).
n

2.6.2 Block-circulant matrices


A natural generalization of circulant matrices are block-circulant ma-
trices. These matrices share the same properties as circulant matrices.
Algorithms for block-circulant matrices extend easily from those for
circulant matrices by, loosely speaking, replacing the discrete Fourier
transform with the two-dimensional discrete Fourier transform. Block-
circulant matrices are central for stationary GMRFs defined on a torus,
as we will see later.
Definition 2.5 (Block-circulant matrix) An N n × N n matrix C is
block circulant with N × N blocks, iff it can be written as
⎛ ⎞
C0 C1 C 2 . . . C N −1
⎜C N −1
⎜ C0 C 1 . . . C N −2 ⎟⎟
⎜C N −2 C N −1 C 0 . . . C N −3 ⎟  
C=⎜ ⎟ = C j−i mod N
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
C1 C2 C3 . . . C0
where C i is a circulant n × n matrix with base ci . The base of C is the
n × N matrix
 
c = c0 c1 . . . cN −1 .
A block-circulant matrix is fully specified by one block column, one block
row, or the base. The elements of C are defined by the base c; element
(k, l) in block (i, j) of C, is element l − k mod n of base cj−i mod N .
To compute the eigenvalues and eigenvectors of C, we will use results
from Section 2.6.1. Let F n and F N be the eigenvector matrix as
defined in (2.42) where the subscript denotes the dimension. As each
H
√ matrix is diagonalized by F n (i.e. C i = F n Λi F n , where
circulant
Λi = n diag(F n ci )), we see that
⎛ ⎞⎛ ⎞⎛ H ⎞
Fn Λ0 . . . ΛN −1 Fn
⎜ .. ⎟ ⎜ .. .. .. ⎟ ⎜ .. ⎟
C = ⎝ . ⎠⎝ . . . ⎠⎝ . ⎠
Fn Λ1 ... Λ0 FH
n

= FN N H
n Λ (F n )

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 63
with obvious notation. Each Λi is diagonal so a symmetric permutation
of rows and columns in Λ will result in a block-diagonal matrix with
circulant blocks. Let P be the permutation matrix that takes the ith
row of the block row j to the jth row of the block row i. For example,
for n = N = 3,
⎛ ⎞
1
⎜ 1 ⎟
⎜ ⎟
⎜ 1 ⎟
⎜ ⎟
⎜ ⎟
⎜ 1 ⎟
⎜ ⎟
P =⎜ ⎜ 1 ⎟.

⎜ 1 ⎟
⎜ ⎟
⎜ ⎟
⎜ 1 ⎟
⎜ ⎟
⎝ 1 ⎠
1
Then P P = I and
⎛ ⎞
D0
⎜ D1 ⎟
⎜ ⎟
P ΛP =⎜ .. ⎟ = D,
⎝ . ⎠
Dn
where D i is a circulant matrix. The jth element of di , the base of D i ,
is the ith diagonal element of Λj , so
⎛ ⎞ ⎛ ⎞
d0 F n c0
⎜ d1 ⎟ √ ⎜ F n c1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ = n P ⎜ .. ⎟.
⎝ . ⎠ ⎝ . ⎠
dn−1 F n cN −1
Since D i is diagonalized by F N (i.e. D i = F N Γi F H
√ N , where Γi =
N diag(F N di )), we obtain
 
C = FN n P FN
n
Γ (F nN )H P (F N
n)
H
, (2.47)

where F nN = diag(F N ) and Γ = diag(Γ0 , . . . , Γn−1 ). (Here, ‘diag’


operates on matrices instead of scalars.)
We have demonstrated by (2.47) that the nice factorization result
obtained for circulant matrices extends also to block-circulant matrices
and so does Theorem 2.10. It will also extend to higher dimensions, i.e.,
a block-circulant matrix where each block is a block-circulant matrix
and so on, by following the same route.
Although Equation (2.47) gives the recipe of how to compute the
eigenvalues, we typically do not want to use this expression directly.

©฀2005฀by฀Taylor & Francis Group, LLC


64 THEORY OF GMRFs
Instead, we can use the relation between the two-dimensional discrete
Fourier transform, and the eigenvectors and eigenvalues of a block-
circulant matrix. Let the block-diagonal matrix Γ contain all nN
eigenvalues on the diagonal. Store these eigenvalues in an n × N matrix
Π, where row i of Π is the diagonal of Γi . Since F is the discrete Fourier
transform matrix, we can compute Π as follows: compute the DFT of
each row
√ of the base c, compute the DFT of each column and scale both
with N n. The result is that Π is the two-dimensional discrete Fourier
transform of the base c. Similarly, the block matrix F N n
n P F N is the
two-dimensional discrete Fourier transform matrix.
Computations for block-circulant matrices are as easy as for circulant
matrices, if we extend the notation to two-dimensional discrete Fourier
transforms. Let a be an n × N matrix. The two-dimensional DFT of a,
DFT2(a) is an n × N matrix with elements
n−1 N −1  
1  ii′ jj ′
√ ai′ j ′ exp −2πι ( + ) ,
nN i′ =0 j ′ =0 n N

i = 0, . . . , n − 1, j = 0, . . . , N − 1, and the inverse DFT of a, IDFT2(a),


is an n × N matrix with elements
n−1 N −1  
1  ii′ jj ′
√ ai′ j ′ exp 2πι ( + ) .
nN ′ ′
n N
i =0 j =0

Using this notation, the n × N matrix



Π = nN DFT2(c)
contains all eigenvalues of C; a block-circulant matrix with base c. Let
v be an n × N matrix and vec(v) its vector representation obtained by
stacking the columns one above the other, see Section 2.1.1. The matrix-
vector product vec(u) = Cvec(v) can then be computed as

u = nN DFT2(DFT2(c) ⊙ IDFT2(v)). (2.48)
The product of two block-circulant matrices C and D, with base c and
d, respectively, is a block-circulant matrix with base

nN IDFT2 (DFT2(c) ⊙ DFT2(d)) . (2.49)
The solution of Cvec(x) = vec(b) is
1
x= √
DFT2(IDFT2(b) ⊘ DFT2(c))
nN
while the inverse of C has base
1
IDFT2(DFT2(c)  (−1)). (2.50)
nN

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 65
2.6.3 GMRFs with circulant precision matrices
The relevance of block-circulant matrices regarding GMRFs appears
when we study a stationary GMRF on a torus.
This puts strong assumptions on both the graph and Q, but it is an
important special case. For illustration, a torus is shown in Figure 2.1(a).
Let a zero mean GMRF be defined on Tn through the conditional
moments
1 
E(xij | x−ij ) = − θi′ j ′ xi+i′ ,j+j ′ ,
θ00 ′ ′ (2.51)
i j =00

Prec(xij | x−ij ) = θ00 ,


where usually only a few of the θi′ j ′ ’s are nonzero, for example, |i′ | ≤ 1
and |j ′ | ≤ 1, or |i′ | ≤ 2 and |j ′ | ≤ 2. Let G be the graph induced by the
torus Tn and the nonzero {θij }. The precision matrix Q is
Q(i,j),(i′ ,j ′ ) = θi−i′ ,j−j ′
and θi′ j ′ = θ−i′ ,−j ′ due to symmetry. Here we assume that the elements
are stored by row, i.e., (i, j) = i + jn1 , so Q is a block-cyclic matrix with
base θ. We assume further that Q is SPD.
The so-defined GMRF is called stationary if the mean vector is
constant and if
Cov(xij , xi′ j ′ ) = c(i − i′ , j − j ′ )
for some function c(·, ·), i.e., the covariance matrix is a block-cyclic
matrix with base c. Often c(i − i′ , j − j ′ ) depends on i − i′ and j − j ′ only
through the Euclidean distance (on the torus) between (i, j) and (i′ , j ′ ).
The precision matrix for a stationary GMRF is then block-cyclic by the
generalization of Theorem 2.10, with the consequence that a stationary
GMRF has the full conditionals (2.51). Similarly, a GMRF with full
conditionals (2.51) and constant mean is stationary.
Fast algorithms can now be derived based on operations on block-
circulant matrices using the discrete Fourier transform. As Q is SPD,
all eigenvalues are real, positive, and all eigenvectors are real.
To describe how to sample a zero mean GMRF x defined in (2.51),
let x be stored as an n × N matrix and similar with z. The spectral
decomposition of Q is Q = V ΛV T , so solving
Λ1/2 V T vec(x) = vec(z),
where vec(z) ∼ NnN (0, I), gives
vec(x) = V Λ−1/2 vec(z).
This can be computed using the DFT2 as illustrated in Algorithm 2.10
where Λ is an n × N matrix. The imaginary part of v is not used in this

©฀2005฀by฀Taylor & Francis Group, LLC


66 THEORY OF GMRFs
Algorithm 2.10 Sampling a zero mean GMRF with block-circulant
precision
iid iid
1: Sample z, where Re(zij ) ∼ N (0, 1) and
√ Im(zij ) ∼ N (0, 1)
2: Compute the (real) eigenvalues, Λ = nN DFT2(θ)
3: v = DFT2((Λ  (− 12 )) ⊙ z)
4: x = Re(v)
5: Return x

algorithm. We can make use of it since Im(v) has the same distribution
as Re(v), and Im(v) and Re(v) are independent.
The log density can be evaluated as
Nn 1 1 T
− log 2π + log |Q| − vec(x) Q vec(x),
2 2 2
where 
log |Q| = log Λij
ij

with Λ as computed in Algorithm 2.10. To obtain the quadratic term


T
q = vec(x) Q vec(x), we use (2.48) to obtain vec(u) = Q vec(x), and
then q = vec(x)T vec(u).
Example 2.8 Let Tn be a 128 × 128 torus, where
1
E(xij | x−ij ) = (xi+1,j + xi−1,j + xi,j+1 + xi,j−1 )
4+δ
Prec(xij | x−ij ) = 4 + δ, δ > 0.
The precision matrix Q is then block-circulant with base (128 × 128
matrix)
⎛ ⎞
4 + δ −1 −1
⎜ −1 ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
c=⎜ ⎜ ⎟,

⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
−1
where we display only the nonzero terms. Note that Q is symmetric
and diagonal dominant hence Q > 0. A sample using Algorithm 2.10
is displayed in Figure 2.13(a) using δ = 0.1. We also compute the
correlation matrix, i.e., the scaled covariance matrix Q−1 , which has

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 67

1.0
0.8
0.6
0.4
0.2
0.0
0 20 40 60 80 100 120

(a) (b)

Figure 2.13 Illustrations to Example 2.8, the sample (a) and the first column
of the base of Q−1 in (b).

base equal to
⎛ ⎞
1.00 0.42 0.39 0.28 0.24 0.20 0.18 0.16 0.14 ···
⎜0.42 0.33 0.31 0.26 0.23 0.20 0.18 0.16 0.14 · · ·⎟
⎜ ⎟
⎜0.39 0.31 0.29 0.25 0.22 0.19 0.17 0.15 0.14 · · ·⎟
⎜ ⎟
⎜0.28 0.26 0.25 0.22 0.20 0.18 0.16 0.15 0.13 · · ·⎟
⎜ ⎟
⎜0.24 0.23 0.22 0.20 0.18 0.17 0.15 0.14 0.13 · · ·⎟
⎜ ⎟
⎜0.20 0.20 0.19 0.18 0.17 0.15 0.14 0.13 0.12 · · ·⎟ .
⎜ ⎟
⎜0.18 0.18 0.17 0.16 0.15 0.14 0.13 0.12 0.11 · · ·⎟
⎜ ⎟
⎜0.16 0.16 0.15 0.15 0.14 0.13 0.12 0.11 0.10 · · ·⎟
⎜ ⎟
⎜0.14 0.14 0.14 0.13 0.13 0.12 0.11 0.10 0.09 · · ·⎟
⎝ ⎠
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .

The first column of the base is displayed in Figure 2.13(b).

2.6.4 Toeplitz matrices and their approximations

The toroidal assumption for a torus, where opposite sides of a regular


lattice in d dimensions are adjacent, may for many applications be
somewhat artificial.
However, the nice analytical properties of circulant matrices and their
superior computational properties through the connection to the DFT
raises the question whether it is possible to approximate nontoroidal
GMRFs with toroidal GMRFs. We will now look at approximations of
so-called Toeplitz matrices through circulant matrices.

©฀2005฀by฀Taylor & Francis Group, LLC


68 THEORY OF GMRFs
Definition 2.6 (Toeplitz matrix) An n×n matrix T is called Toeplitz
iff it has the form
⎛ ⎞
t0 t1 t2 . . . tn−1
⎜ t−1 t0 t1 . . . tn−2 ⎟
⎜ ⎟  
⎜ t−2 t t . . . tn−3 ⎟
T =⎜ −1 0 ⎟ = tj−i
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
t−(n−1) t−(n−2) t−(n−3) ... t0
for a vector t = (t−(n−1) , . . . , t−1 , t0 , t1 , . . . , tn−1 )T , called the base of T .
If T is symmetric, then tk = t−k and the base is t = (t0 , t1 , . . . , tn−1 )T .
A Toeplitz matrix is fully specified by one column and one row. A
symmetric Toeplitz matrix is fully specified by either one column or
one row.
Example 2.9 Let xt , t = . . . , −1, 0, 1, . . ., be a zero mean stationary
Gaussian autoregressive process of order K where t denotes time. The
precision matrix of xn = (x0 , . . . , xn−1 )T where n > 2p, conditioned on
xt = 0 for t ∈ {0, . . . , n − 1}, is then a symmetric Toeplitz matrix with
base
t = (θ0 , θ1 , . . . , θp , 0, . . . , 0)T , (2.52)
say. The (conditional) log density of xn is
n 1 1
log 2π + log |T n | − xTn T n xn .
log π(xn ) = − (2.53)
2 2 2
If n is large, then we might approximate T n with a circulant matrix C n
with base
c = (θ0 , . . . , θp , 0, . . . , 0, θp , . . . , θ1 )T . (2.54)
to obtain a more computational feasible log density,
n 1 1
log 2π + log |C n | − xTn C n xn ,
log πc (xn ) = − (2.55)
2 2 2
which is an approximation to (2.53).
The rationale for the approximation in Example 2.9 is that T n and C n
are asymptotically equivalent (to be defined) as n → ∞. This will enable
us to prove rather easily that
 
1 
 log π(xn ) − 1 log πc (xn ) → 0,
n n 

almost surely as n → ∞. As a consequence, the (conditional) maximum


likelihood estimator (MLE) of some unknown parameters θ tends to the
same limit as the one obtained using the circulant approximation.
To define asymptotically equivalent matrices, we need to define the
strong and weak matrix norm.

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 69
Definition 2.7 (Weak and strong norm) Let A be a real n × n
matrix. The strong and weak norm is defined as
1/2
As = max xT (AT A)x and
x : xT x=1
⎛ ⎞1/2
1 
Aw = ⎝ A2 ⎠ ,
n ij ij

respectively.
Both norms can be expressed in terms of the eigenvalues {λk } of AT A:
1 1
A2s = max λk , and A2w = trace(AT A) = λk .
k n n
k

If A is symmetric with eigenvalues {αk } then λk = αk2 .


Definition 2.8 (Asymptotically equivalent matrices) Two
sequences of n × n matrices An and B n are said to be asymptotically
equivalent if
1. There exists M < ∞ such that An s < M and B n s < M , and
2. An − B n w → 0 as n → ∞.
Asymptotically equivalent matrices have nice properties, for example,
certain functions of the eigenvalues converge to the same limit, see for
example Gray (2002, Thm. 2.4).
Theorem 2.11 Let An and B n be asymptotically equivalent matrices
with eigenvalues αn,k and βn,k and suppose there exist m > 0 and a
finite M such that
m < |αn,k | < M and m < |βn,k | < M.
Let f (·) be a continuous function on [m, M ], then
1 1
lim f (αn,k ) = lim f (βn,k )
n→∞ n n→∞ n
k k

if one of the limits exists.


The proof of Theorem 2.11 is too long to give here, but the idea is to
prove that the mean of powers of the eigenvalues converges to the same
limit as a remedy to show convergence for any polynomial f (·). Then the
Stone-Weierstrass approximation theorem is used to obtain the result for
any continuous function f (·).
We are now able to prove that the error we do by approximating
T n with C n is asymptotically negligible, hence we can approximate the
conditional log density by (2.55) and evaluate it utilizing the connection
to the DFT.

©฀2005฀by฀Taylor & Francis Group, LLC


70 THEORY OF GMRFs
Theorem 2.12 Let xn ∼ N (0, T −1
n ),
where the precision matrix T n
is Toeplitz with base (2.52) and let C n be the corresponding circulant
approximation with base (2.54). Assume the eigenvalues of T n and C n
are bounded away from 0, then as n → ∞,
 
1  1 1 1 1 
 log |T n | − xn T n xn − log |C n | + xn C n xn  → 0 (2.56)
n 2 2 2 2
almost surely.
Proof. First we show that T n and C n are asymptotically equivalent.
Note that T n and C n only differ in O(p2 ) terms. This ensures that
T n − C n w → 0. The eigenvalues of T n and C n are bounded from
above since the elements of T n and C n are bounded and so are the
eigenvalues of T Tn T n and C Tn C n . The strong norm is then bounded.
By Definition 2.8, T n and C n are asymptotically equivalent.
The triangle inequality applied to (2.56) gives the following upper
bound:
1   1  T 
log |T n | − log |C n | + xn T n xn − xTn C n xn  .
2n   2n  
term 1 term 2

Consider first term 1. Using Theorem 2.11 with f (·) = log(·) and that T n
and C n are asymptotically equivalent matrices with bounded eigenvalues
from above and from below (by assumption), then
1 1
lim log |T n | = lim log |C n |.
n→∞ n n→∞ n

Regarding term 2, note that only O(p2 ) terms in T n − C n are nonzero.


Further, each xi is bounded in probability since the eigenvalues of T n and
C n are bounded from below (by assumption) and above. This ensures
that term 2 tends to zero almost surely.
In applications we may use (2.55) to approximate the MLE of unknown
parameters θ. However, two issues arise.
1. We want the MLE from the (exact) log likelihood and not the
conditional log likelihood.
2. We also need to consider the rate of convergence of the circulant
approximation to the log likelihood and its partial derivatives wrt
parameters that govern γ(·), to compare the bias caused by a circulant
approximation with the bias and random error of the MLE.
Intuitively, the boundary conditions are increasingly important for
increasing dimension. For the d-dimensional sphere with radius r, the
volume is O(rd ) while the surface is O(rd−1 ). Hence the appropriateness
of the circulant approximation may depend on the dimension.

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 71
We will now report both more precise and more general results by
Kent and Mardia (1996) which also generalize results of Guyon (1982).
Let x be a zero mean stationary Gaussian process on a d-dimensional
lattice with size N = n × n × · · · × n. Let ΣN be the (block) Toeplitz
covariance matrix and S N its (block) circulant approximation. Under
some regularity conditions on

Cov(xi , xj ) = γ(i − j),

where i and j are points in the d-dimensional space and γ(·) is the so-
called covariance function (see Definition 5.1 in Section 5.1), they proved
that
1 1
− log |ΣN | + log |S N | = O(nd−1 )
2 2 (2.57)
1 1 T −1
− xT Σ−1
N x + x S N x = Op (n
d−1
).
2 2
The result (2.57) also holds for partial derivatives wrt parameters that
govern γ(·). The error in the deterministic and stochastic part of (2.57)
is of the same order. The consequence is that the log likelihood and its
circulant approximation differ by Op (nd−1 ). We can also use the results
of Kent and Mardia (1996) (their Lemma 4.3 in particular) to give the
same bound on the difference of the conditional log likelihood (or its
circulant approximation) and the log likelihood.
Let θ̂ be the true MLE estimator and θ  be the MLE estimator com-
puted using the circulant approximation. Maximum likelihood theory
states that, under some mild regularity conditions, θ̂ is asymptotically
normal with
N 1/2 (θ̂ − θ) ∼ N (0, H)
where H > 0. The standard deviation for a component of θ is then
O(N −1/2 ). The bias in the MLE is for this problem O(N −1 ) (Mardia,
1990). Kent and Mardia (1996) show that θ  has bias of O(1/n). From
this we can conclude, that for d = 1 the bias caused by the circulant
approximation is of smaller order than the standard deviation. The
circulant approximation is harmless and θ  has the same asymptotic
properties as θ̂. For d = 2 the bias caused by the circulant approximation
is of the same order as the standard deviation, so the error we make is
of the same order as the random error. The circulant approximation
is then tolerable, bearing in mind this bias. For d ≥ 3 the bias is of
larger order than the standard deviation so the error due to the circulant
approximation dominates completely. An alternative is then to use the
modified Whittle approximation to the log likelihood that is discussed
in Section 2.6.5.

©฀2005฀by฀Taylor & Francis Group, LLC


72 THEORY OF GMRFs
2.6.5 Stationary GMRFs on infinite lattices
As an alternative for a zero mean stationary GMRF on the torus we may
consider a zero mean GMRFs on an infinite lattice I∞ . Such a process
does exist on I∞ (Rosanov, 1967) and can be represented and defined
using its spectral density function (SDF). We will use the common term
conditional autoregression for this process. An approximation to the log
likelihood of a finite restriction of the process to In can be constructed
using the SDF. This section defines conditional autoregression on I∞ ,
the SDF and the log likelihood approximation using the SDF.
Let x (shorthand for {xi }) be a zero mean and stationary Gaussian
process on I∞ , and define the covariances {γij },
γij = E(xkl xk+i,l+j ),
which does not depend on k and l due to stationarity. We assume
throughout that ij∈I∞ |γij | is finite.
The covariances {γij } define the spectral density function of x.
Definition 2.9 (Spectral density function) The spectral density func-
tion of x is
1 
f (ω1 , ω2 ) = γij exp (−ι(iω1 + jω2 )) ,
4π 2
ij∈I∞

where ι = −1 and (ω1 , ω2 ) ∈ (−π, π]2 .
The SDF is the Fourier transform of {γij }, hence we can express γij
from the SDF using the inverse transform:
 π  π
γij = f (ω1 , ω2 ) exp (ι(iω1 + jω2 )) dω1 dω2 .
−π −π

Since γij = γ−i,−j , the SDF is real.


A conditional autoregression on I∞ is a Gaussian process with a SDF
of a specific form.
Definition 2.10 (Conditional autoregression) A zero mean station-
ary Gaussian process on I∞ is called a conditional autoregression if the
SDF is
1 1
f (ω1 , ω2 ) =  , (2.58)
4π 2 θ
ij∈I∞ ij exp (−ι(iω1 + jω2 ))
where
1. the number of nonzero coefficients θij is finite
2. θij = θ−i,−j
3. θ00 > 0
4. {θij } is so that f (ω1 , ω2 ) > 0 for all (ω1 , ω2 ) ∈ (−π, π]2

©฀2005฀by฀Taylor & Francis Group, LLC


STATIONARY GMRFs 73
Site ij and kl are called neighbors iff θi−k,j−l = 0 and ij = kl.
Conditions 2, 3, and 4 correspond to Q being SPD. The definition of
the conditional autoregression is consistent with the definition of a finite
GMRF and the term ‘neighbor’ has also the same meaning.
Theorem 2.13 The full conditionals of a conditional autoregression are
normal with moments
1 
E(xij | x−ij ) = − θkl xi+k,j+k (2.59)
θ00
kl∈I∞ \00

Prec(xij | x−ij ) = θ00 . (2.60)


To prove Theorem 2.13 we proceed as follows. First, define the covariance
generating function (CGF) that compactly represents {γij }.
Definition 2.11 (Covariance generating function) The covariance
generating function is defined by

Γ(z1 , z2 ) = γij z1i z2j ,
ij∈I∞

where it exists.
The covariance γij can be extracted from the CGF using
∂ i+j 

i j
Γ(z1 , z2 ) = γij
∂z1 ∂z2 (z1 ,z2 )=(0,0)

for i ≥ 0 and j ≥ 0, and the SDF can be expressed using the CGF as
1
f (ω1 , ω2 ) =
Γ(exp(−ιω1 ), exp(−ιω2 )). (2.61)
4π 2
We need the following result.
Lemma 2.4 The covariances of the conditional autoregression satisfy
the recurrence equations

 1, ij = 00
θkl γi+k,j+l = (2.62)
kl∈I
0, otherwise.

Proof. [Lemma 2.4] Let δij be 1 if ij = 00 and zero otherwise.


Then (2.62) is equivalent to
1 
γij = (δij − θkl γi+k,j+l ).
θ00
kl∈I∞ \00

Define
1 
dij = γij − (δij − θkl γi+k,j+l ).
θ00
kl∈I∞ \00

©฀2005฀by฀Taylor & Francis Group, LLC


74 THEORY OF GMRFs

By showing that ij∈I∞ dij z1i z2j
= 0 for all (z1 , z2 ) where Γ(z1 , z2 )
exists, we can conclude that dij = 0 for all ij. We obtain
 1 1  
dij z1i z2j = Γ(z1 , z2 ) − + θkl z1i z2j γi+k,j+l
θ00 θ00
ij∈I∞ ij∈I∞ kl∈I∞ \00
⎛ ⎞
1  1
= Γ(z1 , z2 ) ⎝1 + θkl z1−k z2−l ⎠ −
θ00 θ00
kl∈I∞ \00
1 1
= − = 0,
θ00 θ00
using (2.58) expressed by the CGF (2.61).
Proof. [Theorem 2.13] We can verify (2.59) by showing that
⎛⎛ ⎞ ⎞
1  
E ⎝⎝xij + θkl xi+k,j+l ⎠ θk′ l′ xi+k′ ,j+l′ ⎠ = 0,
θ00 ′ ′
kl∈I∞ \00 k l ∈I∞ \00

which follows by expanding the terms and then using Lemma 2.4. To
show (2.60) we start with
Var(xij ) = Var(E(xij | x−ij )) + E(Var(xij | x−ij ))
to compute
E(Var(xij | x−ij )) = γ00 − Var(E(xij | x−ij ))
⎛ ⎞
1  
= γ00 − E ⎝ 2 θkl θk′ l′ xi+k,j+l xi+k′ ,j+l′ ⎠
θ00 ′ ′ kl∈I∞ \00 k l ∈I∞ \00
1  
= γ00 − 2 θkl θk′ l′ γk′ −k,l′ −l
θ00
kl∈I∞ \00 k′ l′ ∈I ∞ \00

1 
= γ00 − 2 θkl (−θ00 )γkl
θ00
kl∈I∞ \00
⎛ ⎞
1 ⎝ 
= θ00 γ00 + θkl γkl ⎠
θ00
kl∈I∞ \00
1
= ,
θ00
where we have used Lemma 2.4 twice. From this (2.60) follows since
Var(xij |x−ij ) is a constant.
We end by presenting Whittle’s approximation, which uses the SDF
of the process on I∞ to approximate the log likelihood if x is observed

©฀2005฀by฀Taylor & Francis Group, LLC


PARAMETERIZATION OF GMRFs 75
at a finite lattice In (Whittle, 1954). The empirical covariances of x are
1 
γ̂ij = xkl xk+i,l+j
n
kl∈I∞

and the empirical SDF is


1 
fˆ(ω1 , ω2 ) = γ̂ij exp(−ι(iω1 + jω2 )).
4π 2
ij∈I∞

Then Whittle’s approximation is


n
log π(x) ≈ − log 2π
2  
π π
n
− 2 log(4π 2 f (ω1 , ω2 )) dω1 dω2
8π −π −π
 π  π ˆ
n f (ω1 , ω2 )
− 2 dω1 dω2 . (2.63)
8π −π −π f (ω1 , ω2 )
The properties of this approximation have been studied by Guyon
(1982). The approximation shares the same property as the circulant
approximation in Section 2.6.4 (Kent and Mardia, 1996), but modifica-
tions to (2.63) can be made to correct for the bias either using unbiased
estimates for γ̂ij (Guyon, 1982), or better, using data tapers (Dahlhaus
and Künsch, 1987).

2.7 Parameterization of GMRFs⋆


In this section we will consider the case, where Q is a function of some
parameters θ. In this case it is important to know the set of values of θ for
which Q is SPD, hence x is a GMRF. Suppose the precision matrix Q is
parameterized by some parameter vector θ ∈ Θ. We assume throughout
that Q(θ) is symmetric and has strictly positive diagonal entries for
all θ ∈ Θ. In some cases the parameterization is such that Q(θ) is by
definition positive definite for all θ ∈ Θ. One such example is
⎛ ⎞
θ1  θ2 
π(x | θ) ∝ exp ⎝− (xi − xj )2 − x2 ⎠
2 i∼j 2 i i

which is a GMRF with a SPD precision matrix if θ1 > 0 and θ2 > 0.


In this section we are concerned with the case where the parameter
space Θ has to be restricted to ensure Q(θ) > 0, hence we need to know
the valid parameter space
Θ+ = {θ ∈ Θ : Q(θ) > 0} .

©฀2005฀by฀Taylor & Francis Group, LLC


76 THEORY OF GMRFs
+
Unfortunately, it is hard to determine Θ in general. However, it is
always possible to check if θ ∈ Θ+ by a direct verification if Q(θ) > 0
or not. This is most easily done by trying to compute the Cholesky
factorization, which will be successful iff Q(θ) > 0, see Section 2.4.
Although this ‘brute force’ method is possible, it is often computa-
tionally expensive, intellectually not satisfying and gives little insight in
properties of Θ+ . To obtain some knowledge we will in Section 2.7.1
study Θ+ for a stationary GMRF on a torus (see Section 2.6). By using
the properties of (block) circulant matrices some analytical results are
possible to obtain. These analytical results are also useful for precision
matrices that are Toeplitz, a relationship we comment on at the end
of Section 2.7.1.
Since the characterization of Θ+ is difficult, a frequently used
approach is to use a sufficient condition, diagonal dominance, to ensure
that Q(θ) is SPD. This is the theme for Section 2.7.2. This sufficient
condition restricts Θ to a subset Θ++ , say, where Θ++ ⊆ Θ+ . The
parameter space Θ++ can in most cases be determined analytically
without much effort. We will compare Θ++ with Θ+ using the exact
results obtained in Section 2.7.1.

2.7.1 The valid parameter space


Let a zero mean GMRF be defined on the torus Tn through the
conditional moments

E(xij | x−ij ) = − θi′ j ′ xi+i′ ,j+j ′ and
i′ j ′ =00 (2.64)
Prec(xij | x−ij ) = 1,
where the sum is over a few terms only, for example, |i′ | ≤ 1 and |j ′ | ≤ 1.
The elements in Q are found from (2.64) using Theorem 2.6:
Q(i,j),(i′ ,j ′ ) = θi−i′ ,j−j ′ , (2.65)
where θi′ j ′ = θ−i′ ,−j ′ due to the symmetry of Q. Here (i, j) is short for
the index i + jn1 .
We now specify the parameters θ as
{θi′ j ′ , i′ = 0, ±1, . . . , ±m1 , j ′ = 0, ±1, . . . , ±m2 } (2.66)
for some fixed m = (m1 , m2 )T , where the remaining terms are zero
and θ00 = 1. We assume in the following that n > 2m, to simplify the
discussion.
The precision matrix (2.65) is a block-circulant matrix and its
properties were discussed in Section 2.6. For this class of matrices, the

©฀2005฀by฀Taylor & Francis Group, LLC


PARAMETERIZATION OF GMRFs 77
eigenvalues are known to be
  ′ 
ii jj ′
λij (θ) = θi′ j ′ cos 2π + (2.67)
′ ′
n1 n2
ij

for i = 0, . . . , n1 − 1 and j = 0, . . . , n2 − 1. Recall that the eigenvalues


can be computed using the two-dimensional discrete Fourier transform
of the matrix (θi′ j ′ ), see Section 2.6.
Some properties about Θ+ can be derived from (2.67). We need the
notion of a polyhedron, which is defined as a space that can be built
from line segments, triangles, tetrahedra, and their higher-dimensional
analogues by gluing them together along their faces. Alternatively, a
polyhedron can be viewed as an intersection of half spaces.
Theorem 2.14 Let x be a GMRF on Tn with dimension n > 2m with
full conditionals as in (2.64) and m as defined in (2.66). Then the valid
parameter space Θ+ is a bounded and convex polyhedron.
A bounded polyhedron is also called a polytope.
Proof. From (2.67) it is clear that the eigenvalues are linear in θ. Since
Q(θ) is SPD iff all eigenvalues are strictly positive, it follows that
Θ+ = {θ : Cθ > 0}
for some matrix C, hence Θ+ is a polyhedron. Let θ ′ and θ ′ be two
configurations in Θ+ , then consider
θ(α) = αθ ′ + (1 − α)θ ′
for 0 ≤ α ≤ 1. As
C(θ(α)) = C(αθ ′ + (1 − α)θ ′ ) = αCθ ′ + (1 − α)Cθ ′ > 0,
it follows that θ(α) ∈ Θ+ and that Θ+ is convex. Furthermore, Θ+ is
bounded as maxi=j |Qij | < 1 (Section 2.1.6) as Qii = 1 for all i.
A further complication is that Θ+ also depends on n. To investigate
this issue, we now write Θ+ n to make this dependency explicit.
Assume θ ∈ Θ+ n and change the dimension to n′ keeping θ fixed. Can
+
we conclude that θ ∈ Θn′ ? A simple counterexample demonstrates that
this is not true in general. Let n = (10, 1)T with
θ00 = 1, θ±1,0 = 0.745, θ±2,0 = 0.333,
then all eigenvalues are positive. If we change the dimension to n′ =
(11, 1)T or n′ = (9, 1)T , then the smallest eigenvalue is negative. This
is rather disappointing; if we estimate θ for one grid size, we cannot
necessarily use the estimates for a different grid size.
However, if we reduce the dimension from n to n′ > 2m, where n/n′
is a positive integer (or a vector of positive integers), then the following
result states that Θ+ +
n ⊆ Θ n′ .

©฀2005฀by฀Taylor & Francis Group, LLC


78 THEORY OF GMRFs
Theorem 2.15 Let A and B be block-circulant matrices of dimension
(k1 n1 k2 n2 )2 × (k1 n1 k2 n2 )2 and (n1 n2 )2 × (n1 n2 )2 , respectively, with
entries A(i,j),(i′ ,j ′ ) = θi−i′ ,j−j ′ and B(i,j),(i′ ,j ′ ) = θi−i′ ,j−j ′ where k1 and
k2 are positive integers. Here, θ is defined in (2.66) with n > 2m and
θi′ j ′ = θ−i′ ,−j ′ . If A is SPD then B is SPD.
Proof. Both A and B are symmetric as θi′ j ′ = θ−i′ ,−j ′ . The eigenvalues
for A are  ′ 
 ii jj ′
λA
ij (θ) = θ ′
ij ′ cos 2π + (2.68)
′ ′
k1 n1 k2 n2
ij
for i = 0, . . . , k1 n1 − 1, j = 0, . . . , k2 n2 − 1. The eigenvalues for B are
  ′ 
ii jj ′
λB
ij (θ) = θ ′
ij ′ cos 2π + (2.69)
′ ′
n1 n2
ij

for i = 0, . . . , n1 − 1, j = 0, . . . , n2 − 1. Comparing (2.68) and (2.69), we


see that
λB A
ij (θ) = λk0 i,k1 j (θ)
for i = 0, . . . , n1 − 1, j = 0, . . . , n2 − 1, as n > 2m. Since A is SPD then
all λA
ij (θ) are strictly positive ∀ij, hence B is SPD.
One consequence of Theorem 2.15 is, for example, that
Θ+ + + +
n ⊇ Θ2n ⊇ Θ4n ⊇ Θ8n ,

but as we have shown by a counterexample,


Θ+ +
n ⊇ Θn+1

in general. Since the size of Θ+ + +


n , Θ2n , Θ4n , . . . is nonincreasing, we may
+
hope that the intersection of all Θn ’s is nonempty. We define

Θ+∞ = Θ+ n. (2.70)
n>2m

If we can determine Θ+
∞, then any configuration in this set is valid for
all n > 2m.
Theorem 2.16 The set Θ+ ∞ as defined in (2.70) is nonempty, bounded,
convex and
⎧ ⎫
⎨  ⎬
Θ+∞ = θ : θij cos(iω1 + jω2 ) > 0, (ω1 , ω2 ) ∈ [0, 2π)2 . (2.71)
⎩ ⎭
ij

Note that (2.71) corresponds to the SDF (2.58) being strictly positive.
 The diagonal dominance criterion, see Section 2.1.6, states that
Proof.
if ij |θij | < 1 then Q(θ) is SPD for any n > 2m, hence Θ+ ∞ is non-
+
empty. Further, Θ∞ is bounded as |θij | < 1 (see Section 2.1.6), and

©฀2005฀by฀Taylor & Francis Group, LLC


PARAMETERIZATION OF GMRFs 79
convex by direct verification. For the remaining part of the proof, it is
sufficient to show that θ ∈ Θ+ +
∞ implies that θ ∈ Θn for any n > 2m,
+ +
and if θ ∈ Θ∞ then θ ∈ Θn for at least one n > 2m. The first part
follows by the definition of Θ+ +
∞ . For the latter part, fix θ ∈ Θ∞ so that

θij cos(iω1∗ + jω2∗ ) < 0 (2.72)
ij

for some irrational numbers (ω1∗ , ω2∗ ). By continuity of (2.72), we can


find rational numbers i′ /n1 and j ′ /n2 such that
 ii′ jj ′
θij cos(2π + 2π ) < 0;
ij
n1 n2

hence θ ∈ Θ+ n.
Lakshmanan and Derin (1993) showed that (2.71) is equivalent to a
bivariate reciprocal polynomial not having any zeros inside the unit
bicircle. They use some classical results concerning the geometry of
zero sets of reciprocal polynomials and obtain a complete procedure for
verifying the validity of any θ and for identifying Θ+ ∞ . The approach
taken is still somewhat complex but explicit results for m = (1, 1)T are
known. We will now report these results.
Let m = (1, 1)T so θ can be represented as
⎡ ⎤
sym θ01 θ11
⎣sym 1 θ10 ⎦ ,
sym sym θ1−1
where the entries marked with ‘sym’ follow from θij = θ−i,−j . Then
θ ∈ Θ+
∞ iff the following four conditions are satisfied:

2 2 2 1 2
ρ = 4(θ11 + θ01 + θ1−1 − θ10 ) − 1 < 0, (2.73)
2
 2

2 4θ11 θ1−1 − θ10 + 2 (4θ01 (θ11 + θ1−1 ) − 2θ10 ) + ρ < 0,
 2

2 4θ11 θ1−1 − θ10 − 2 (4θ01 (θ11 + θ1−1 ) − 2θ10 ) + ρ < 0,
and either
 2 2
 2
R = 16 4θ11 θ1−1 − θ10 − (4θ01 (θ11 + θ1−1 ) − 2θ10 ) < 0
or
&  2

R ≥ 0, and R2 − − 8ρ 4θ11 θ1−1 − θ10 +
2 '2
3 (4θ01 (θ11 + θ1−1 ) − 2θ10 ) < 0. (2.74)
These inequalities are somewhat involved but special cases are of great
interest. First consider the case where θ01 = θ10 and θ11 = θ1−1 = 0,
which gives the requirement |θ01 | < 1/4.

©฀2005฀by฀Taylor & Francis Group, LLC


80 THEORY OF GMRFs

θ11

1/4

1/2 1/2

1/4 1/4 θ01

1/4

Figure 2.14 The valid parameter space Θ+∞ where θ01 = θ10 and θ11 = θ1−1 ,
where restriction to diagonal dominance Θ++ defined in (2.76), is shown as
gray.

Suppose now we include the diagonal terms θ01 = θ10 and θ11 = θ1−1 ,
the inequalities (2.73) to (2.74) reduce to |θ01 |−θ11 < 1/4, and θ11 < 1/4.
Figure 2.14 shows Θ+ in this case. The smaller gray area in Figure 2.14
is found from using a sufficient condition only, the diagonal dominant
criterion, which we discuss in Section 2.7.2.
The analytical results obtained for a stationary GMRF on Tn are
also informative for a nonstationary GMRF on the lattice In′ with full
conditionals

E(xij | x−ij ) = − θi′ j ′ xi+i′ ,j+j ′ and
i′ j ′ =00
(i+i′ ,j+j ′ )∈In′

Prec(xij | x−ij ) = 1.
The full conditionals equal those in (2.64) in the interior of the lattice,
but differ at the boundary. Let Θ+,In′ be the valid parameter space for
the GMRF on In′ . We now observe that the (block Toeplitz) precision
matrix for the GMRF on the lattice is a principal submatrix of the
(block-circulant) precision matrix for the GMRF on the torus, if n′ ≤
n − m. The consequence is that
+,I
Θ+
n ⊆ Θ n′ for n′ ≤ n − m. (2.75)
Further details are provided in Section 5.1.4. The block-Toeplitz and the

©฀2005฀by฀Taylor & Francis Group, LLC


PARAMETERIZATION OF GMRFs 81
block-circulant precision matrix are asymptotically equivalent matrices
+,I
so Θ+ ′
∞ is also the intersection of {Θn′ } for all n > 2m. This follows
from Grenander and Szegö (1984, p. 65) provided there exists δ > 0 such
that the SDF f (ω1 , ω2 ) > δ > 0, see also Gray (2002, Sec. 4.4).

2.7.2 Diagonal dominance


A frequently used approach in practice is to impose a sufficient condition,
diagonal dominance, to ensure that Q(θ) is SPD. The diagonal domi-
nance criterion is most often easy to treat analytically. On the downside,
the criterion is known to be conservative. After presenting this criterion
we will compare Θ++ with the exact results for Θ+ obtained in Section
2.7.1.
Recall from
 Section 2.1.6 that a matrix A is called diagonal dominant,
if |Aii | > j:j =i |Aij |, ∀i. If a precision matrix is diagonal dominant,
then

Qii > |Qij |, ∀i
j:j =i

as the diagonal is always strictly positive. It is easy to show that this is


a sufficient condition for Q > 0:
Theorem 2.17 Let A be an n×n diagonal-dominant symmetric matrix
with strictly positive diagonal entries, then A is SPD.
The converse of Theorem 2.17 is not true in general.
Proof. Let λ be an eigenvalue of A with eigenvector v and define i =
arg maxj|vj |, break ties arbitrarily. Since Av = λv it follows that λvi −
Aii vi = j:j =i Aij vj . Using the triangle inequality we see that
 
|λvi − Aii vi | ≤ |Aij vj | ≤ |vi | |Aij |
j:j =i j:j =i

so

λ ≥ Aii − |Aij |.
j:j =i

The lower bound is strictly positive as A is diagonal dominant and Aii >
0. As λ was any eigenvalue, all n eigenvalues of A are strictly positive
and A has full rank. Let Λ be a diagonal matrix with the eigenvalues
on the diagonal and the corresponding eigenvectors in the corresponding
column of V , so A = V ΛV T . For x = 0,
xT Ax = xT (V ΛV T )x = (V T x)T Λ(V T x) > 0,
hence A is SPD.

©฀2005฀by฀Taylor & Francis Group, LLC


82 THEORY OF GMRFs
Using the diagonal dominance criterion on the stationary GMRF
in Section 2.7.1, we obtain
⎧ ⎫
⎨  ⎬
Θ++ = θ : |θi′ j ′ | < 1 . (2.76)
⎩ ′ ′

i j =00

Compared to the complexity of locating Θ+ given by the inequali-


ties (2.73) to (2.74) for m = (1, 1)T , the simplicity of (2.76) is striking.
However, if Θ+ ∞ is too conservative we might lose more than we gain.
Let us reconsider the special cases of the inequalities (2.73) to (2.74).
First consider the case where θ01 = θ10 and θ11 = θ1−1 = 0. Diagonal
dominance gives the requirement |θ01 | < 1/4. This is the same as the
diagonal dominance criterion, hence Θ++ = Θ+ ∞ . If we include the
diagonal terms θ01 = θ10 and θ11 = θ1−1 , we obtain the constraints:
|θ01 | + |θ11 | < 1/4. Both Θ+∞ and Θ
++
are shown in Figure 2.14, where
++
Θ is the gray area. We see that Θ+ ∞ is twice the size of Θ
++
, hence
using a diagonal-dominant parameterization is much more restrictive
than necessary.
Extrapolating these results suggests that diagonal dominance as
a criterion for positive definiteness becomes more restrictive for an
increasing number of parameters. Although we do not have a formal
proof of this claim, we have verified it through simulation in the one-
dimensional case using an autoregressive process of order M . We can
define Θ+ implicit using the partial correlation function defined on
(−1, 1)M and compute the parameters θ using the Levinson algorithm
(Brockwell and Davis, 1987).

2.8 Bibliographic notes


The definitions of matrices and SPD matrices are extracted from Searle
(1982) and Harville (1997). The definitions of graph related terms
and conditional independence are extracted from Whittaker (1990) and
Lauritzen (1996).
Guyon (1995), Mardia (1988), Whittaker (1990) and Lauritzen (1996)
are alternative sources for the results in Section 2.2. Brook’s lemma
(Brook, 1964) is discussed by Besag (1974). Multivariate GMRFs are
discussed in Mardia (1988).
Circulant matrices are discussed by Davis (1979) and Gray (2002).
The derivation of the eigenvalues and eigenvectors in Section 2.6.1 follow
Gray (2002).
Algorithm 2.10 is the FFT algorithm to sample stationary Gaussian
fields on toruses and has been reinvented several times. An early
reference is Woods (1972), see also Dietrich and Newsam (1997), Hunt

©฀2005฀by฀Taylor & Francis Group, LLC


BIBLIOGRAPHIC NOTES 83
(1973), Krogstad (1989) and Wood and Chan (1994). Dietrich and
Newsam (1996) discuss a nice extension to conditional simulation by
cyclic embedding.
Gray (2002) discuss Toeplitz matrices and their circulant approxima-
tions, which is our source for Section 2.6.4, see also Grenander and Szegö
(1984) for a more technical discussion. The proofs in Section 2.6.5 are
based on some unpublished notes by J. Besag. Box and Tiao (1973) also
make use of (2.31) to evaluate a log density under constraints.
The statistical aspects in Section 2.3 and Section 2.4 are from Rue
(2001) and Rue and Follestad (2003). Numerical methods for sparse
matrices are discussed in Dongarra et al. (1998), Duff et al. (1989),
George and Liu (1981) and Gupta (2002) gives a review and comparison.
Section 2.3.1 follows any standard reference in numerical linear algebra,
for example, Golub and van Loan (1996). Wilkinson and Yeung (2002,
2004) discuss propagation algorithms and their connection to the sparse
matrix approach. Rue (2001) presents a divide-and-conquer strategy for
the simulation of large GMRFs using iterative numerical techniques for
SPD linear systems. Steinsland (2003) discusses sampling from GMRFs
using parallel algorithms for sparse matrices while Wilkinson (2004)
discuss parallel computing relevant for statistics in general.
A similar result to Theorem 2.7 can also be derived for the Cholesky
triangle of the covariance matrix. For details regarding conditioning by
kriging, see, for example, Chilés and Delfiner (1999), Cressie (1993) and
Lantuéjoul (2002).
Gaussian random fields with a continuous parameter obeying the
Markov property are not discussed in this chapter. Refer to Künsch
(1979) for a well-written survey of the subject, see also Pitt (1971),
Wong (1969), and Pitt and Robeva (2003).

©฀2005฀by฀Taylor & Francis Group, LLC


CHAPTER 3

Intrinsic Gaussian Markov


random fields

This chapter will introduce a special type of GMRFs, so-called intrinsic


GMRFs (IGMRF). IGMRFs are improper, i.e., they have precision
matrices not of full rank. We will use these quite extensively later on
as prior distributions in various applications. Of particular importance
are IGMRFs that are invariant to any trend that is a polynomial of the
locations of the nodes up to a specific order.
IGMRFs have been studied in some depth on regular lattices. There
have been different attempts to define the order of an IGMRF, Besag and
Kooperberg (1995) adopt a definition for GMRFs to IGMRFs, where the
order is defined through the chosen neighborhood structure. However,
Matheron (1973) uses the term to describe the level of degeneracy of
continuum intrinsic processes, and subsequently Künsch (1987) does the
same for IGMRFs. This chapter also describes IGMRFs on irregular
lattices and nonpolynomial IGMRFs, so a more general definition is
needed. Inspired by Matheron (1973) and Künsch (1987), we define the
order of an IGMRF as the rank deficiency of its precision matrix. This
seems to be a natural choice, in particular since we only discuss IGMRF’s
on finite graphs. Note, however, that any autoregression that merely has
an indeterminate mean is zero order according to Künsch (1987), but
first order according to our definition.
To prepare the forthcoming construction of IGMRF’s, we will start
with a section where we first introduce some additional notation and
describe forward differences and polynomials. We then discuss the
conditional properties of x|Ax, where x is a GMRF. This will be
valuable to understand and construct IGMRFs on finite graphs.

3.1 Preliminaries
3.1.1 Some additional definitions
The null space of a matrix A is the set of all vectors x such that Ax = 0.
The nullity of A is the dimension of the null space. For an n × m matrix
the rank is min{n, m} − k where k is the nullity. For a singular n × n
matrix A with nullity k, we denote by |A|∗ the product of the n − k non-

©฀2005฀by฀Taylor & Francis Group, LLC


86 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
zero eigenvalues of A. We label this term the generalized determinant
due to lack of any standard terminology.
The Kronecker product of two matrices A and B is denoted by A ⊗ B
and produces a larger matrix with a special block structure. Let A be a
n × m matrix and B a p × q matrix, then their Kronecker product
⎛ ⎞
A11 B . . . A1m B

A ⊗ B = ⎝ ... .. .. ⎟
. . ⎠
An1 B ... Anm B
is an np × mq matrix. For example,
⎛ ⎞
    1 2 3 2 4 6
1 2 1 2 3 ⎜4 5 6 8 10 12 ⎟
⊗ =⎜ ⎟
⎝0 0 0 −1 −2 −3⎠ .
0 −1 4 5 6
0 0 0 −4 −5 −6
Let A, B, C, and D be matrices of appropriate dimensions. The basic
properties of the Kronecker product are the following:
1. For a scalar a
A ⊗ (aB) = a(A ⊗ B)
(aA) ⊗ B = a(A ⊗ B)

2. Kronecker product distributes over addition


(A + B) ⊗ C = (A ⊗ C) + (B ⊗ C)
A ⊗ (B + C) = (A ⊗ B) + (A ⊗ C)

3. The Kronecker product is associative


(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)

4. The Kronecker product is in general not commutative


A ⊗ B = B ⊗ A

5. Transpose does not invert order


(A ⊗ B)T = AT ⊗ B T

6. Matrix multiplication
(A ⊗ B)(C ⊗ D) = AC ⊗ BD

7. For invertible matrices A and B


(A ⊗ B)−1 = A−1 ⊗ B −1

©฀2005฀by฀Taylor & Francis Group, LLC


PRELIMINARIES 87
8. For an n × n matrix A and m × m matrix B
|A ⊗ B| = |A|m |B|n (3.1)

9. Rank of the Kronecker product of two matrices


rank(A ⊗ B) = rank(A) rank(B)

3.1.2 Forward differences


Intrinsic GMRFs of order k in one dimension are often constructed using
forward differences of order k. Here we introduce the necessary notation.

Definition 3.1 (Forward difference) Define the first-order forward


difference of a function f (·) as
∆f (z) = f (z + 1) − f (z).
Higher-order forward differences are defined recursively:
∆k f (z) = ∆∆k−1 f (z)
so
∆2 f (z) = f (z + 2) − 2f (z + 1) + f (z) (3.2)
and in general for k = 1, 2, . . .,
k  
k k j k
∆ f (z) = (−1) (−1) f (z + j).
j=0
j

For a vector z = (z1 , z2 , . . . , zn )T , ∆z has elements ∆zi = zi+1 − zi ,


i = 1, . . . , n − 1.
We can think of the forward difference of kth order as an approxi-
mation to the kth derivative of f (z). Consider, for example, the first
derivative
f (z + h) − f (z)
f ′ (z) = lim ,
h→0 h
which for h = 1 equals the first-order forward difference.

3.1.3 Polynomials
Many IGMRFs are invariant to the addition of a polynomial of a certain
degree. Here we introduce the necessary notation for polynomials, first
on a line and then in higher dimensions.
Let s1 < s2 < · · · < sn denote the ordered locations on the line and
define s = (s1 , . . . , sn )T . Let pk (si ) denote a polynomial of degree k,

©฀2005฀by฀Taylor & Francis Group, LLC


88 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
evaluated at the locations s,
1 1
pk (si ) = β0 + β1 si + β2 s2i + · · · + βk ski , (3.3)
2 k!
with some coefficients β k = (β0 , . . . , βk )T . In matrix notation, this is
⎛ ⎞ ⎛ 1 k
⎞⎛ ⎞
pk (s1 ) 1 s1 . . . k! s1 β0
⎜ pk (s2 ) ⎟ ⎜1 s2 . . . 1 sk2 ⎟ ⎜β1 ⎟
⎜ ⎟ ⎜ k! ⎟⎜ ⎟
⎜ .. ⎟ = ⎜ .. .. .. ⎟ ⎜ .. ⎟ , (3.4)
⎝ . ⎠ ⎝. . . ⎠⎝ . ⎠
1 k
pk (sn ) 1 sn . . . k! sn βk
which we write compactly as
pk = S k β k . (3.5)
The matrix S k is called the polynomial design matrix of degree k.
Throughout we assume that S k is of full rank k + 1.
Polynomials can also be defined in higher dimension, but this requires
more notation. Let
si = (si1 , si2 , . . . , sid )
denote the spatial location of node i, where sij is the location of node i
in the jth dimension, j = 1, . . . , d. We now use a compact notation and
define
j = (j1 , j2 , . . . , jd ),
sji = sji11 sji22 · · · sjidd , and
j! = j1 !j2 ! · · · jd !.
A polynomial trend of degree k in d dimensions will consist of
 
d+k
mk,d = (3.6)
k
terms and is expressed as
 1
pk,d (si ) = βj sji , (3.7)
j!
0≤j1 +···+jd ≤k

where jl ∈ {0, 1, . . . , k} for l ∈ {1, 2, . . . , d}.


For illustration, for d = 2 the number of terms mk,d is 3, 6 and 10 for
k = 1, 2 and 3. For d = k = 2, (3.7) is
1 1
p2,2 (si1 , si2 ) = β00 +β10 si1 +β01 si2 + β20 s2i1 +β11 si1 si2 + β02 s2i2 . (3.8)
2 2
Similar to (3.5), let pk,d denote the vector of the polynomial evaluated
at each of the n locations. We can express this vector as
pk,d = S k,d β k,d , (3.9)

©฀2005฀by฀Taylor & Francis Group, LLC


GMRFs UNDER LINEAR CONSTRAINTS 89
where β k,d is a vector of all the coefficients and S k,d is the polynomial
1 j
designmatrix with elements j! si . We do not consider degenerated cases
and assume therefore that S k,d is of full rank.

3.2 GMRFs under linear constraints

IGMRFs have much in common with GMRFs conditional on linear


constraints. In this section we consider proper (full-rank) GMRFs, and
derive their precision matrix, conditional on such linear constraints. We
then introduce informally IGMRFs. IGMRFs are always improper, i.e.,
their precision matrices do not have full rank.
Let x be a zero mean GMRF of dimension n with precision matrix Q >
0. Let λ1 , λ2 , . . . , λn be the eigenvalues and e1 , . . . , en the corresponding
eigenvectors of Q, such that

Q= λi ei eTi = V ΛV T ,
i

where V = (e1 , e2 , . . . , en ), V T V = I and Λ = diag(λ1 , λ2 , . . . , λn ).


Consider now the conditional density

π(x | Ax = a), (3.10)

where the k × n matrix A has the special form

AT = (e1 , e2 , . . . , ek ) (3.11)

and a = (a1 , . . . , ak )T is arbitrary. The specific form of A is not a


restriction as we will explain at the end of this section.
To derive the explicit form of (3.10), it is useful to change variables to
y = V T x, which can easily be shown to have mean zero and Prec(y) =
iid
Λ, i.e., yi ∼ N (0, λ−1 i ), i = 1 . . . , n. Now Ax = y 1:k where y i:j =
(yi , yi+1 , . . . , yj )T and
n

π(y | Ax = a) = 1[y1:k =a] π(yi ).
i=k+1

Hence E(y|Ax = a) = (aT , 0T )T and Prec(y|Ax = a) = Λ,  where


 = diag(0, . . . , 0, λk+1 , . . . , λn ), from which it follows that
Λ
 
a
E(x | Ax = a) = V = a1 e1 + · · · + ak ek and
0
Prec(x | Ax = a) = V ΛV  T.

©฀2005฀by฀Taylor & Francis Group, LLC


90 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
Hence (3.10) can be written as
n
n−k 1 
log π(x | Ax = a) = − log 2π + log λi
2 2
i=k+1
  T   
1 a  T a
− x−V V ΛV x−V (3.12)
2 0 0
n
n−k 1  1 
=− log 2π + log λi − xT Qx,
2 2 2
i=k+1

where Q  = V ΛV  T . Note that e1 , . . . , ek do not contribute to (3.12)


explicitly, and that (3.12) does depend on a only implicitly in the sense
that π(x|Ax = a) is nonzero only for Ax = a. Hence, for the specific
choice of A in (3.11), (3.10) has a particularly simple form. It can
be obtained from the corresponding unconstrained density π(x) by (a)
setting all eigenvalues in Λ that correspond to eigenvectors in A to zero
and (b) adjusting the normalizing constant accordingly.
Example 3.1 Assume
⎛ ⎞
6 −1 0 −1
⎜ −1 6 −1 0 ⎟

Q=⎝ ⎟,
0 −1 6 −1 ⎠
−1 0 −1 6

then Λ = diag(4, 6, 6, 8) and e1 = 1/2 · (1, 1, 1, 1)T , e2 = 1/2 ·
(1, 0, −1, 0)T , e3 = 1/2 · (0, −1, 0, 1)T and e4 = 1/2 · (1, −1, 1, −1)T .
If we now condition on eT1 x = 0, we obtain the conditional precision
matrix: ⎛ ⎞
5 −2 −1 −2
⎜ 5 −2 −1 ⎟
 = ⎜ −2
Q ⎟.
⎝ −1 −2 5 −2 ⎠
−2 −1 −2 5
Note that each row (and of course each column) sums up to zero.
Of key interest for IGMRFs is an improper version of the (log)
density (3.12), which we define as
n
n−k 1  1 
log π ∗ (x) = − log 2π + log λi − xT Qx (3.13)
2 2 2
i=k+1

for any x ∈ Rn . The rank of Q is n − k and appears because π ∗ (x) is


now defined for all x ∈ R , and not just for those x that satisfy Ax = a,
n

as was the case for (3.12).

©฀2005฀by฀Taylor & Francis Group, LLC


GMRFs UNDER LINEAR CONSTRAINTS 91
To see more specifically what is going on, note that any x ∈ R can n

be decomposed as
x = (c1 e1 + · · · + ck ek ) + (dk+1 ek+1 + · · · + dn en )
= x + x⊥ , (3.14)
where x is the part of x in the subspace spanned by the columns of
 and x⊥ is the part of x orthogonal to x ,
AT , the null space of Q,
T ⊥
where of course (x ) x = 0. For a given x, the coefficients in (3.14)
are ci = eTi x and dj = eTj x.
Using this decomposition we immediately see that
π ∗ (x) = π ∗ (x⊥ ), (3.15)
so π ∗ (x) does not depend on c1 , . . . , ck . Hence, π ∗ is invariant to the
addition of any x and this is the important feature of IGMRFs. Also
note that π ∗ (x⊥ ) is equal to π(x|Ax = a).
We can interpret π ∗ (x) as a limiting form of a proper density π̃(x).
Let π̆(x ) be the density of x and define the proper density
π̃(x) = π ∗ (x⊥ ) π̆(x ).
Let π̆(x ) be a zero mean Gaussian with precision matrix γI. Then in
the limit as γ → 0,
π̃(x) ∝ π ∗ (x). (3.16)
Roughly speaking, π ∗ (x) can be decomposed into the proper density for
x⊥ ∈ Rn−k times a diffuse improper density for x ∈ Rk .
Example 3.2 Consider again Example 3.1, where we now look at the
improper density π ∗ defined in (3.13) with ‘mean’ zero and ‘precision’
 Suppose we are interested in the density value π ∗ of the vector x =
Q.
(0, −2, 0, −2)T , which can be factorized into x = 2e1 + 2e4 , hence x =
2e1 and x⊥ = 2e4 . Since e1 is a constant vector, the density π ∗ (x)
is invariant to the addition of any arbitrary constant to x. This can
be interpreted as a diffuse prior on the overall level of x, i.e., x ∼
N (0, κ−1 I) with κ → 0.
Using this interpretation of π ∗ (x), we will now define how to generate a
‘sample’ from π ∗ (x), where we use quotes to emphasize that the density
is actually improper. Since the rank deficiency is only due to x , we
define that a sample from π ∗ (x) means a sample from the proper part
π ∗ (x⊥ ), bearing in mind (3.15). For known eigenvalues and eigenvectors
of Q  it is easy to sample from π ∗ (x⊥ ) using Algorithm 3.1.
Example 3.3 We have generated 1000 samples from π ∗ defined in Ex-
ample 3.2, shown in a ‘pairs plot’ in Figure 3.1. At first sight, these
samples seem well-behaved and proper, but note that 1T x = 0 for all
samples, and that the empirical correlation matrix is singular.

©฀2005฀by฀Taylor & Francis Group, LLC


92 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
Algorithm 3.1 Sampling from an improper GMRF with mean zero
1: for j = k + 1 to n do
2: yj ∼ N (0, λ−1
j )
3: end for
4: Return x = yk+1 ek+1 + yk+2 ek+2 + · · · + yn en

−10 0 10 20 −10 0 10 20

5 10
x1

−5 0
−15
20
10

x2
0
−10

20
10
x3

0
−10
20
10

x4
0
−10

−15 −5 0 5 10 −10 0 10 20

Figure 3.1 Pairs plot for 1000 samples from an improper GMRF with mean
and precision defined in Example 3.2.

As it will become clear later, in most cases the matrix Q  and the
eigenvectors e1 , . . . , ek are known explicitly by construction. However,
the remaining eigenvalues and eigenvectors will typically not be known.
Hence an alternative algorithm based on Algorithm 2.6 will be useful.
Here we use the fact that π ∗ (x⊥ ) equals π(x|Ax = 0), from which we
can sample in two steps: first sample from the unconstrained density
and then correct the obtained sample for the constraint Ax = 0 via
Equation (2.30). More specifically, for π(x) we have to use a zero mean
GMRF with SPD precision matrix
k

+
Q=Q ai ei eTi .
i=1

Note that this method works for any strictly positive values of a1 , . . . , ak ;

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 93
for simplicity we may use a1 = . . . = ak = 1.
Example 3.4 In the previous example e1 eT1 is a matrix with entries
equal to 1/4, hence Q can be obtained by adding an arbitrary but strictly
 For any such Q, the correction step (2.30) now
positive value to Q.
simply corresponds to the subtraction of the (unweighted) mean value of
the sample.
Finally, we will comment on the specific form of A in (3.11). At first
sight this choice seems rather restrictive, but we will now explain why
this is not the case. Consider the more general situation where we want
to compute the conditional density of x|Bx = b for any k × n matrix
B with rank 0 < k < n. If Cov(x) = Σ > 0, then
−1
Cov(x | Bx) = Σ − ΣB T BΣB T BΣ. (3.17)

The generality of our argument is evident by verifying that the k columns


of B T span the null space of Cov(x|Bx), i.e.,
 −1 
Σ − ΣB T BΣB T BΣ B T = 0.

The condition Bx = b is then equivalent to Ax = a in (3.11) expressed


in terms of those eigenvectors of (3.17) that have zero eigenvalues.

3.3 IGMRFs of first order


Using the results from Section 3.2, we will now define (polynomial)
intrinsic GMRFs of first order. We start by defining an improper GMRF
with rank deficiency k.
Definition 3.2 (Improper GMRF) Let Q be an n × n SPSD matrix
with rank n − k > 0. Then x = (x1 , . . . , xn )T is an improper GMRF of
rank n − k with parameters (µ, Q), if its density is
 
−(n−k)
1/2 1
π(x) = (2π) 2 (|Q|∗ ) exp − (x − µ)T Q(x − µ) . (3.18)
2
Further, x is an improper GMRF wrt to the labelled graph G = (V, E),
where
Qij = 0 ⇐⇒ {i, j} ∈ E for all i = j.
Recall that | · |∗ denote the generalized determinant as defined in Section
3.1.1. The parameters (µ, Q) do no longer represent the mean and the
precision since they formally do not exist; however, for convenience we
will continue to denote them as the ‘mean’ and the ‘precision’, even
without the quotes. The Markov properties of an IGMRF are to be
interpreted as those obtained from the limit of a proper density. This is

©฀2005฀by฀Taylor & Francis Group, LLC


94 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
similar to the argument leading to (3.16). Let the columns of AT span
the null space of Q, then define

Q(γ) = Q + γAT A. (3.19)

Now each element in Q(γ) tends to the corresponding one in Q as γ → 0.


Similarly, a statement like
1 
E(xi | x−i ) = µi − Qij (xj − µj )
Qii j∼i

(using (2.5)) will be meaningful by the same limiting argument.


An IGMRF of first order is an improper GMRF of rank n − 1, where
the vector 1 spans the null space of Q.

Definition 3.3 (IGMRF of first order) An intrinsic GMRF of first


order is an improper GMRF of rank n − 1 where Q1 = 0.

The condition Q1 = 0 simply means that j Qij = 0, for all i.
We can relate this to the discussion in Section 3.2, using AT = 1.
It then follows directly that the density for an IGMRF of first order
is invariant to the addition of c1 1, for any arbitrary c1 , see (3.14)
and (3.15). To illustrate this feature, let µ = 0 so

1 
E(xi | x−i ) = − Qij xj ,
Qii j:j∼i

where − j:j∼i Qij /Qii = 1. Hence, the conditional mean of xi is simply
a weighted mean of its neighbors, but does not involve an overall level.
In applications, this ‘local’ behavior is often desirable. We can then
concentrate on the deviation from any overall mean level without having
to specify the overall mean level itself. Many IGMRFs are constructed
such that the deviation from the overall level is a smooth curve in time
or a smooth surface in space.

3.3.1 IGMRFs of first order on the line

We will now construct a widely used model known as the first-order


random walk. We first assume that the location of the nodes i are all
positive integers, i.e., i = 1, 2, . . . , n. It is not uncommon in this case
to think of i as time t. The distance between the nodes is constant and
equal to 1. We will later discuss a modification for the case where the
nodes are nonequally spaced, i.e., the distance between the nodes varies.

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 95
The first-order random walk model for regular locations

The first-order random walk model is constructed assuming independent


increments
iid
∆xi ∼ N (0, κ−1 ), i = 1, . . . , n − 1. (3.20)
This immediately implies that

xj − xi ∼ N (0, (j − i)κ−1 ) for i < j.

Also, if the intersection between {i, . . . , j} and {k, . . . , l} is empty for


i < j and k < l, then

Cov(xj − xi , xl − xk ) = 0.

These properties are well known and coincide with those of a Wiener
process observed in discrete time. We will define the Wiener process
shortly in Definition 3.4.
The density for x is derived from its n − 1 increments (3.20) as
( n−1
)
(n−1)/2 κ 2
π(x | κ) ∝ κ exp − (∆xi )
2 i=1
( n−1
)
κ 
= κ(n−1)/2 exp − (xi+1 − xi )2
2 i=1
 
1
= κ(n−1)/2 exp − xT Qx , (3.21)
2

where Q = κR and where R is the so-called structure matrix :


⎛ ⎞
1 −1
⎜−1 2 −1 ⎟
⎜ ⎟
⎜ −1 2 −1 ⎟
⎜ ⎟
⎜ .. .. .. ⎟
R=⎜ . . . ⎟. (3.22)
⎜ ⎟
⎜ −1 2 −1 ⎟
⎜ ⎟
⎝ −1 2 −1⎠
−1 1

The form of R follows easily from


n−1

(∆xi )2 = (Dx)T (Dx) = xT D T Dx = xT Rx,
i=1

©฀2005฀by฀Taylor & Francis Group, LLC


96 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
where the (n − 1) × n matrix D has the form
⎛ ⎞
−1 1
⎜ −1 1 ⎟
⎜ ⎟
D=⎜ .. .. ⎟.
⎝ . . ⎠
−1 1
Note that the eigenvalues of R are equal to
λi = 2 − 2 cos (π(i − 1)/n) , i = 1, . . . , n, (3.23)
which can be used for analytic calculation of the generalized determinant
appearing in (3.18).
It is clear that Q1 = 0 by either verifying that
 π(x|κ) is invariant to
the addition of a constant vector c1 1, or that j Qij = 0 from (3.22).
The rank of Q is n − 1. Hence (3.21) is an IGMRF of first order,
by Definition 3.3. We denote this model by RW1(κ) or short RW1 as
an abbreviation for a random walk of first order.
The invariance to the addition of any constant to the overall mean is
evident from the full conditional distributions
1
xi | x−i , κ ∼ N ( (xi−1 + xi+1 ), 1/(2κ)), 1 < i < n, (3.24)
2
because there is no shrinkage toward an overall mean. An alternative
interpretation of the conditional mean can be obtained by fitting a first-
order polynomial, i.e., a simple line
p(j) = β0 + β1 j,
locally through the points (i−1, xi−1 ) and (i+1, xi+1 ) using least squares.
The conditional mean turns out to be equal to p(i).
If we fix the random walk at locations 1, . . . , i, future values have the
conditional distribution
xi+k | x1 , . . . , xi , κ ∼ N (xi , k/κ), 0 < i < i + k ≤ n.
Hence, this model gives a constant forecast equal to the last observed xi
with linearly increasing variance.
To give some intuition about the form of realizations of a RW1 model,
we have generated 10 samples of x⊥ with n = 99 and κ = 1. We set
x = 0 as described in Section 3.2. The samples are shown in Figure
3.2(a). The (theoretical) variances Var(x⊥ i ) for i = 1, . . . , n, are shown
in Figure 3.2(b) after being normalized with the average variance. There
is more variability at and near the ends compared to the interior. To
study the correlation structure, we have also computed Corr(x⊥ ⊥
n/2 , xi )
for i = 1, . . . , n and the result is shown in Figure 3.2(c). The behavior
is quasiexponential where the negative correlation at the endpoints with

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 97
xn/2 is due to the sum-to-zero constraint x = 0. Note that a different κ
only involves a change of the scale; the correlation structure is invariant
with respect to κ, but not with respect to n.
Alternatively, we could define a random walk of first order on a circle,
i.e., we would include the pair x1 ∼ xn in the graph and correspondingly
the term (x1 − xn )2 in the sum in the exponent of (3.21). The precision
matrix will then be circulant, and so will be the inverse, which implies a
constant variance, see Section 2.6.1. Note that such a circular first-order
random walk is also a IGMRF of first order, the distribution of x will
still be invariant to the addition of a constant.
We now have a closer look at the elements in the structure matrix
(3.22) of the (noncircular) RW1 model. Each row in R (except for the
first and the last one) has coefficients −1, 2, −1, which are simply the
coefficients in −∆2 as defined in (3.2). So, if ∆xi are the increments,
then R consists of −∆2 terms, apart from corrections at the boundary.
We may interpret

−xi−1 + 2xi − xi+1


as an estimate of the negative second derivative of an underlying
continuous function x(t) at t = i, making use of the observations at
{i − 1, i, i + 1}. In Section 3.4 when we consider random walks of second
order, ∆2 xi are the increments and the precision matrix will consist
of −∆4 terms plus boundary corrections. A closer look into the theory
of constructing continuous splines will provide further insight in this
direction, see, for example, Gu (2002).

The first-order random walk model for irregular locations

We will now discuss the case where xi is assigned to a location si but


where the distance between si+1 and si is not constant. Without loss of
generality we assume that s1 < s2 < · · · < sn , and define the distance

δi = si+1 − si . (3.25)

To obtain the precision matrix in this case, we will consider xi as the


realization of an integrated Wiener process in continuous time, at time
si .
Definition 3.4 (Wiener process) A Wiener process with precision κ
is a continuous-time stochastic process W (t) for t ≥ 0 with W (0) = 0
and such that the increments W (t) − W (s) are normal with mean 0
and variance (t − s)/κ for any 0 ≤ s < t. Furthermore, increments for
nonoverlapping time intervals are independent. For κ = 1, this process
is called a standard Wiener process.

©฀2005฀by฀Taylor & Francis Group, LLC


98 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS

15
10
5
0
−5

0 20 40 60 80 100

(a)
2.0
1.5
1.0
0.5
0.0

0 20 40 60 80 100

(b)
1.0
0.8
0.6
0.4
0.2
0.0
−0.2

0 20 40 60 80 100

(c)

Figure 3.2 Illustrations of the properties of the RW1 model with n = 99: (a)
displays 10 samples of x⊥ , (b) displays Var(x⊥ i ) for i = 1, . . . , n normalized
with the average variance, and (c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n.

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 99
The full conditional in the above case has moments
δi δi−1
E(xi | x−i , κ) = xi−1 + xi+1
δi−1 + δi δi−1 + δi
 
1 1
Prec(xi | x−i , κ) = κ + .
δi−1 δi
Here κ is a precision parameter and chosen so that we obtain the same
result as in (3.24) if δi = 1 for all i.
The precision matrix can now be obtained using Theorem 2.6 as

1 1
⎨ δi−1 + δi j = i

Qij = κ − δ1 j =i+1

⎩ i
0 otherwise

for 1 < i < n where the Qi,i−1 terms are found via Qi,i−1 = Qi−1,i .
A proper correction at the boundary (implicitly we use a diffuse prior
for W (0) rather than the fixed W (0) = 0) gives the remaining diagonal
terms Q11 = κ/δ1 , Qnn = κ/δn−1 . Clearly, Q1 = 0 still holds and the
joint density of x,
( n−1
)
κ 
π(x | κ) ∝ κ(n−1)/2 exp − (xi+1 − xi )2 /δi , (3.26)
2 i=1

is invariant to the addition of a constant. The scaling with δi is because


Var(xi+1 − xi ) = δi /κ according to the continuous-time model.
The interpretation of a RW1 model as a discretely observed Wiener
process (continuous in time) justifies the corrections needed for non-
equally spaced locations. Hence, the underlying model is the same, it is
only observed differently.
To compare this model with its regular counterpart, we repro-
duced Figure 3.2 with n = 99, but now with the locations s2 , . . . , s49
sampled uniformly between s1 = 1 and s50 = 50. The locations si
for i = 51, . . . , 99 are obtained requiring symmetry around s50 = 50:
si + s100−i = 100 for i = 1, . . . , n = 99. The results are shown in Figure
3.3.1 and show a very similar behavior as in the regular case. Note,
however, that in the case of nonsymmetric distributed random locations
between s1 = 1 and s99 = 99 the marginal variances and correlations
are not exactly symmetric around s50 = 50 (not shown).

When the mean is only locally constant

The approach to model Q through forward differences of first order as


normal increments does more than ‘just’ being invariant to the addition

©฀2005฀by฀Taylor & Francis Group, LLC


100 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS

10
5
0
−5

0 20 40 60 80 100

(a)
2.0
1.5
1.0
0.5
0.0

0 20 40 60 80 100

(b)
1.0
0.8
0.6
0.4
0.2
0.0
−0.2

0 20 40 60 80 100

(c)

Figure 3.3 Illustrations of the properties of the RW1 model with n = 99


irregular locations with s1 = 1 and sn = n: (a) displays 10 samples of x⊥
, (b) displays Var(x⊥i ) for i = 1, . . . , n normalized with the average variance,
and (c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n. The horizontal axis relates to
the locations si .

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 101
of a constant. Consider the alternative IGMRF
( n
)
κ 
π(x) ∝ κ(n−1)/2 exp − (xi − x)2 , (3.27)
2 i=1

where x is the empirical mean of x. Without writing out the precision


matrix, we immediately see that π(x) is invariant to the addition of a
constant and that its rank is n−1, hence (3.27) defines an IGMRF of first
order. Although both (3.27) and the RW1 model (3.21) are maximized
at x ∝ 1, the benefit of the RW1 model is evident when we consider the
following vector x. Assume n is even and

0, 1 ≤ i ≤ n/2
xi = (3.28)
1, n/2 < i ≤ n.

Thus x is locally constant with two levels. If we evaluate the density


of (3.28) under the RW1 model and the alternative (3.27), we obtain
κ κ
κ(n−1)/2 exp − and κ(n−1)/2 exp −n ,
2 8
respectively. The log ratio of the densities is then of order O(n). The
reason for the drastic difference is that the RW1 model only penalizes
the local deviation from a constant level (interpret ∆xi as the derivative
at location i) whereas (3.27) penalizes the global deviation from a
constant level. This local behavior of the RW1 model is obviously quite
advantageous in applications if the mean level of x is approximately
or locally constant. A similar argument will also apply to polynomial
IGMRFs of higher order; those will be constructed using forward
differences of order k as independent normal increments.

3.3.2 IGMRFs of first order on lattices


For regular or irregular lattices, the construction of IGMRFs of first
order follows the same concept, but is now based on ‘independent’
increments writing ‘·’ due to (hidden) linear constraints imposed by the
more complicated geometry. We first look at the irregular case.

First-order IGMRFs on irregular lattices


Let us reconsider Figure 2.4(a) and the map of the 544 regions in
Germany. Here we may define two regions as neighbors if they share
a common border. The corresponding graph is shown in Figure 2.4(b).
Between neighboring regions i and j, say, we define a normal increment
xi − xj ∼ N (0, κ−1 ) (3.29)

©฀2005฀by฀Taylor & Francis Group, LLC


102 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
and the assumption of ‘independent’ increments yields the IGMRF
model:
⎛ ⎞
κ 
π(x) ∝ κ(n−1)/2 exp ⎝− (xi − xj )2 ⎠ . (3.30)
2 i∼j

Here i ∼ j denotes the set of all unordered pairs of neighbors. The


requirement for the pair to be unordered prevents us from double
counting as i ∼ j ⇔ j ∼ i. Note that the number of increments |i ∼ j|
is typically larger than n, but the rank of the corresponding precision
matrix is still n − 1. This implies that there are hidden constraints in
the increments due to the more complicated geometry on a lattice than
on the line, hence the use of the term ‘independent’. To see this consider
the following simple example.
Example 3.5 Let n = 3 where all nodes are neighbors. Then (3.29)
gives x1 − x2 = ǫ1 , x2 − x3 = ǫ2 , and x3 − x1 = ǫ3 , where ǫ1 , ǫ2 and
ǫ3 are the increments. Adding the two first equations and comparing
with the last implies that ǫ1 + ǫ2 + ǫ3 = 0, which is the ‘hidden’ linear
constraint.
However, under the linear constraints the density of x is (3.30) by the
following argument. Let ǫ be |i ∼ j| independent increments and Aǫ the
linear constraints saying that the increments sum to zero over all circuits
in the graph. Then π(ǫ|Aǫ) ∝ π(ǫ) if ǫ is a configuration satisfying the
constraint from (2.31). We now change variables from ǫ to x and we
obtain the exponent in (3.30). See also Besag and Higdon (1999, p. 740).
The normalization constant follows from the generalized determinant of
Q, the precision matrix of x.
The density (3.30) is equal to the density of an RW1 model if we are
on a line and i ∼ j iff |i − j| = 1. However, in general there is no longer
an underlying continuous stochastic process that we can relate to this
density. Hence, if we change the spatial resolution or split a region into
two new ones, we really change the model.
Let ni denote the number of neighbors of region i. The precision matrix
Q in (3.30) has elements


⎨ni i = j,
Qij = κ −1 i ∼ j, (3.31)


0 otherwise,

from which it follows directly that


1  1
xi | x−i , κ ∼ N ( xj , ). (3.32)
ni j:j∼i ni κ

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 103
Similar to the RW1 model for nonequally spaced locations, we can
incorporate positive and symmetric weights wij for each pair of adjacent
nodes i and j in (3.30). For example, one could use the inverse Euclidean
distance between the centroids of each region.
Assuming the ‘independent’ increments
xi − xj ∼ N (0, 1/(wij κ)),
the joint density becomes
⎛ ⎞
κ
π(x) ∝ κ(n−1)/2 exp ⎝− wij (xi − xj )2 ⎠ . (3.33)
2 i∼j

The corresponding precision matrix Q now has elements


⎧

⎨ k:k∼i wik i = j,
Qij = κ −wij i ∼ j, (3.34)


0 otherwise,
and the full conditional π(xi |x−i , κ) is normal with mean and precision

xj wij 
j:j∼i and κ wij ,
j:j∼i wij j:j∼i

respectively. It is straightforward to verify that Q1 = 0 both for (3.31)


and (3.34).
Example 3.6 We will now have a closer look at the IGMRFs defined
in (3.30) and (3.33) using the graph shown in Figure 2.4(b) found from
the map of the administrative regions in Germany, as shown in Figure
2.4(a). Throughout we show samples of x⊥ and discuss its properties,
as argued for in Section 3.2. Additionally, we fix κ = 1. Note that the
correlation structure for x⊥ does not depend on κ. Of interest is how
samples from the density look, but also how much the variance of x⊥ i
varies through the graph.
We first consider (3.30) and display in Figure 3.4 two samples of
x⊥ . The gray scale is linear from the minimum value to the maximum
value, and the samples shows a relatively smooth variation over the
map. The dependence of the conditional precision on the number of
neighbors (3.32) is typically inherited to the (marginal) variance, al-
though the actual form of the graph also plays a certain role. In (c) we
display the scaled variance in relationship to the number of neighbors.
The scaling is so that the average variance is 1. We added some random
noise to the number of neighbors to make each dot visible. It can be seen
clearly that, on average, the variance increases with decreasing number of
neighbors, with a range from 0.7 to 2.0. This is a considerable variation.

©฀2005฀by฀Taylor & Francis Group, LLC


104 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
In particular, the large values of the variance for regions with only one
neighbor are worrisome. A possible (ad hoc) remedy is to choose all
adjacent regions, but if there is only one neighbor, we also add all the
neighbors of this neighbor. Unfortunately, the variation of the variance
for the others is about the same (not shown).
It is tempting to make an adjustment for the ‘distance’ d(i, j) between
region i and j, which is incorporated in (3.33) using the weights wij =
1/d(i, j). A feasible choice is to use the Euclidean distance between the
centroids of region i and j. When we have access to the outline of each
region, this is easy to compute using contour-integrals. Note that the
correlation structure only depends on the distances up to a multiplicative
constant for the same reason as the correlation structure does not depend
on the actual value of κ.
Figure 3.5(a) and (b) displays two samples of x⊥ with the distance
correction included and (c) the variance in relation to the number of
neighbors. The samples look slightly smoother compared to Figure 3.4,
but a closer study is needed to quantify this more objectively. Most
regions with only one neighbor are in the interior of its neighbor, typically
representing a larger city within its rural suburbs. In this case, the
distance between the city region and its surrounding region may be small.
One (ad hoc) remedy out of this is to apply the same correction as for
the unweighted model but increase the number of neighbors for those
regions with only one neighbor. The adjustment with distances results in
a more homogeneous variance, but the variance still decreases due to the
increasing number of neighbors (not shown).

First-order IGMRFs on regular lattices


For a lattice In with n = n1 n2 nodes, let ij or (i, j) denote the node in
the ith row and jth column that also defines its location. In the interior,
we can now define the nearest four sites of ij as its neighbors, i.e.,
(i + 1, j), (i − 1, j), (i, j + 1), (i, j − 1).
Without further weights, the corresponding precision matrix is (3.31),
and the full conditionals of xi are given in (3.32). In the interior of the
lattice, π(xij |x−ij , κ) is normal with mean
1
(xi+1,j + xi−1,j + xi,j+1 + xi,j−1 ) (3.35)
4
and precision 4κ. Of course, one could also introduce weights in the
formulation as in (3.34).
However, here an extended anisotropic model can be used, weighting
the horizontal and vertical neighbors differently. More specifically,

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 105

(a) (b)
2.0
1.5
1.0
0.5

2 4 6 8 10

(c)

Figure 3.4 Figures (a) and (b) display two samples from an IGMRF with
density (3.30) where two regions sharing a common border are considered
as neighbors. Figure (c) displays the variance in relation to the number of
neighbors, demonstrating that the variance decreases if the number of neighbors
increases.

©฀2005฀by฀Taylor & Francis Group, LLC


106 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS

(a) (b)
2.0
1.5
1.0
0.5

2 4 6 8 10

(c)

Figure 3.5 Figures (a) and (b) display two samples of x⊥ from an IGMRF with
density (3.33) where two regions sharing a common border are considered as
neighbors and weights are used based in the distance between centroids. Figure
(c) displays the variance in relation to the number of neighbors, demonstrating
that the variances decreases if the number of neighbors increases.

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF FIRST ORDER 107
suppose the conditional mean is
1 ′
(α (xi+1,j + xi−1,j ) + α′′ (xi,j+1 + xi,j−1 )) (3.36)
4
with positive parameters α′ and α′′ , constrained to fulfill α′ +α′′ = 2. The
conditional precision is still 4κ. This model can also be obtained by using
independent first-order increments in each direction and conditioning on
the sums over closed loops being zero (Besag and Higdon, 1999), see
Künsch (1999) for a rigorous proof in the infinite lattice case.
In applications, α′ (or α′′ ) can be treated as an unknown parameter,
so the degree of anisotropy can be estimated from the data. To estimate
α′ it is necessary to compute |Q|∗ of the corresponding precision matrix
Q in the density (3.18). Note that Q can be written as a Kronecker
product

Q = α′ Rn1 ⊗ I n2 + α′′ I n1 ⊗ Rn2 ,


where Rn is the structure matrix (3.22) of the RW1 model of dimension
n × n and I m is the identity matrix of dimension m × m. Hence, using
(3.23) and a result for such sums of Kronecker products, the eigenvalues
of Q can be calculated without imposing toroidal boundary conditions,

λij = 2 α′ (1 − cos(π(i − 1)/n1 )) + α′′ (1 − cos(π(j − 1)/n2 ))

for i = 1, . . . , n1 and j = 1, . . . , n2 . The generalized determinant |Q|∗


can then be easily computed.
Recently Mondal and Besag (2004) have shown that there is a
close relationship between first-order IGMRFs on regular (infinite)
lattices and the (continuous) de Wijs process (Chilés and Delfiner,
1999, Matheron, 1971). In fact, the de Wijs process can be interpreted
as the limit of model (3.35) on a sequence of ever finer lattices that
are superimposed on the study region. There is also a corresponding
result for the asymmetric model (3.36). This is an important result, as
it provides a link to an underlying continuous process, similar to the
close correspondence between a first-order random walk and the Wiener
process. The de Wijs process has good analytic properties and was for
this reason widely used by the pioneers of geostatistics. It also has a
remarkable property, often called the de Wijs formula: Let V be a set in
Rd and {vi } a partition of V , where V and each vi are geometrically
similar. Let v be one of {vi } selected uniformly. Then, the variance
between the mean of the process in v and V (the so-called dispersion
variance) is proportional to the log ratio of the volumes of V and v
regardless of the scale.

©฀2005฀by฀Taylor & Francis Group, LLC


108 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
An alternative limit argument for the first-order IGRMF
The first-order IGMRF is an improper GMRF of rank n − 1 where Q1 =
0. In Section 3.2 we argue for this construction as the limit when the
precision of x = 1T x tends to zero. In particular, Q can be seen as the
limit of the (proper) precision matrix
Q(γ) = Q + γ11T , γ > 0,
as γ → 0+ , see (3.19). Consider instead Q as the limit of

Q(γ) = Q + γI,
which is not in coherence with the general discussion in Section 3.2. Note

that Q(γ) is a completely dense matrix while Q(γ) is sparse if Q is.
To see that this is an alternative formulation to obtain IGMRFs in

the limit, let x ∼ N (0, Q(γ) −1
). Then
Prec(x | 1T x) − Q = cγ. (3.37)

where  ·  is either the weak norm, in which case c = (n − 1)/n or the
strong norm, in which case c = 1, see Definition 2.7 for the definition of
the weak and strong norm.
For example, suppose we use the precision matrix of a first-order
IGMRF and want to make it proper. The arguments in Section 3.2
then suggest to use Q(γ) with a small γ, with the consequence that
Q(γ) is a completely dense matrix. However, (3.37) suggests that a
good approximation is to use Q(γ), which maintains the sparsity of the
precision matrix Q. The error in the approximation measured in either
the weak or strong norm is cγ.
To verify (3.37), let {(λi , ei )} denote the eigenvalue/vector pairs for
of Q for i = 1, . . . , n where λ1 = 0 and e1 = 1. The corresponding pairs

of Q(γ) are
(γ, 1), (λ2 + γ, e2 ), . . . , (λn + γ, en ).
Note that the eigenvectors remain the same. The eigenvalue/vector pairs
of the conditional precision matrix of x, conditional on 1T x, are given
in Section 3.2 as
(0, 1), (λ2 + γ, e2 ), . . . , (λn + γ, en ).
and (3.37) follows.

3.4 IGMRFs of higher order


We will now discuss higher-order IGMRFs on the line and on regular
lattices. Higher-order IGMRFs have a rank deficiency larger than one
and can be defined on the line or in higher dimensions. The main idea

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 109
is to extend the class of functions for which the improper density is
invariant. We first consider (polynomial) IGMRFs of order k, which are
invariant to the addition of all polynomials with degree less than or equal
to k −1. For example, a second-order IGMRF is invariant to the addition
of a first-order polynomial, i.e., a line in one dimension and a plane in
two dimensions. In Section 3.4.3 we consider examples of nonpolynomial
IGMRFs.

3.4.1 IGMRFs of higher order on the line


Let s1 < s2 < · · · < sn denote the ordered locations of x = (x1 , . . . , xn )
and define s = (s1 , . . . , sn )T . An IGMRF of order k on the line is an
improper GMRF of rank n − k, where the columns of the design matrix
S k−1 , as defined in (3.4), are a basis for the null space of the precision
matrix Q.
Definition 3.5 (IGMRF of kth order on the line) An IGMRF of
order k is an improper GMRF of rank n − k, where QS k−1 = 0, with
S k−1 defined as in (3.4).
The rank deficiency of Q is k. The condition QS k−1 = 0 simply means
that
1 1
− (x + pk−1 )T Q(x + pk−1 ) = − xT Qx
2 2
for any coefficients β0 , . . . , βk−1 in (3.3). The density (3.18) is hence
invariant to the addition of any polynomial of degree k − 1, pk−1 , to x.
An alternative view is to decompose x as
x = trend(x) + residuals(x)
= H k−1 x + (I − H k−1 )x, (3.38)
where the ‘trend’ is of degree k − 1. Note that the ‘trend’ corresponds
to x and the ‘residuals’ corresponds to x⊥ in (3.14). The projection
matrix
H k−1 = S k−1 (S Tk−1 S k−1 )−1 S Tk−1
projects x down to the space spanned by the columns of S k−1 , and
I −H k−1 to the space orthogonal to that. The matrix H k−1 is commonly
named the hat matrix and I −H k−1 is symmetric, idempotent, satisfying
S Tk−1 (I − H k−1 ) = 0 and has rank n − k.
If we reconsider the quadratic term using the decomposition (3.38),
we obtain
1 1 T
− xT Qx = − ((I − H k−1 )x) Q ((I − H k−1 )x) ,
2 2
where the simplification is due to QH k−1 = 0. The interpretation is
that the density of a kth order IGMRF only depends on the ‘residuals’
after removing any polynomial ‘trend’ of degree k − 1.

©฀2005฀by฀Taylor & Francis Group, LLC


110 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
The second-order random walk model for regular locations
Assume now that si = i for i = 1, . . . , n, so the distance between
consecutive nodes is constant and equal to 1. Following the forward
difference approach, we may now use the second-order increments
∆2 xi ∼ N (0, κ−1 )
for i = 1, . . . , n − 2, to define the joint density of x:
( n−2
)
(n−2)/2 κ 2
π(x) ∝ κ exp − (xi − 2xi+1 + xi+2 ) (3.39)
2 i=1
 
(n−2)/2 1 T
= κ exp − x Qx ,
2
where the precision matrix is
⎛ ⎞
1 −2 1
⎜−2 5 −4 1 ⎟
⎜ ⎟
⎜ 1 −4 6 −4 1 ⎟
⎜ ⎟
⎜ 1 −4 6 −4 1 ⎟
⎜ ⎟
⎜ .. .. .. .. .. ⎟
Q=κ ⎜ . . . . . ⎟. (3.40)
⎜ ⎟
⎜ 1 −4 6 −4 1 ⎟
⎜ ⎟
⎜ 1 −4 6 −4 1 ⎟
⎜ ⎟
⎝ 1 −4 5 −2⎠
1 −2 1
We can verify directly that QS 1 = 0 and that the rank of Q is n − 2.
Hence this is an example of an IGMRF of second-order, invariant to
the addition of any line to x. This model is known as the second-order
random walk model, which we denote by RW2(κ) or simply RW2.
Remark. Although this is the RW2 model defined and used in the
literature, in Section 3.5 we will demonstrate that we cannot extend
it consistently to the case where the locations are irregular. Similar
problems occur if we increase the resolution from n to 2n locations,
say. Therefore, in Section 3.5 an alternative derivation with the desired
continuous time interpretation will be presented, where we are able to
correct consistently for irregular locations.
The full conditionals of the second-order random walk are easy to read
off from Q. The conditional mean and precision is
4 1
E(xi | x−i , κ) = (xi+1 + xi−1 ) − (xi+2 + xi−2 ),
6 6
Prec(xi | x−i , κ) = 6κ,
respectively, for 2 < i < n − 2 with obvious modifications in the other
cases. Some intuition about the coefficients in the conditional mean can

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 111
be gained if we consider the second-order polynomial
1
p(j) = β0 + β1 j + β2 j 2
2
and compute the coefficients by a local least-squares fit to the points
(i − 2, xi−2 ), (i − 1, xi−1 ), (i + 1, xi+1 ), (i + 2, xi+2 ).
Just as the conditional mean of the first-order random walk equals
the local first-order polynomial interpolation, here it turns out that
E(xi |x−i , κ) = p(i).
If we fix the second-order random walk at the locations 1, . . . , i, future
values have the conditional moments
E(xi+k | x1 , . . . , xi , κ) = (1 + k)xi − kxi−1 ,
(3.41)
Prec(xi+k | x1 , . . . , xi , κ) = κ/(1 + 22 + · · · + k 2 ),
where 2 ≤ i < i + k ≤ n. Hence the conditional mean is the linear
extrapolation based on the last two observations xi−1 and xi , with
cubically increasing variance, since 1 + 22 + · · · + k 2 = k(k + 1)(2k + 1)/6.
Based on the discussion in Section 3.3.1, we note that the precision
matrix Q consists of terms −∆4 apart from the corrections near the
boundary, meaning that
xi−2 − 4xi−1 + 6xi − 4xi−1 + xi+2
can be interpreted as an estimate of the negative 4th derivative of an
underlying continuous function x(t) at location t = i making use of the
observed values at {i − 2, . . . , i + 2}.
Figure 3.4.1 displays some samples of x⊥ from the RW2 model using
n = 99, the variance and the correlation between x⊥ ⊥
n/2 and xi for
i = 1, . . . , n. The samples are now much smoother than those obtained
in Figure 3.2 using the RW1 model. On the other hand, the variability
in the variance is somewhat larger, especially near the boundary. The
correlation structure is also more prominent, as the two constraints
induce a strong negative correlation.

The second-order random walk model for irregular locations


We will now discuss possible modifications to extend the definition of
the RW2 model to irregular locations. Those proposals are somewhat
ad hoc, so later, in Section 3.5, we present a theoretically sounder, but
also slightly more involved approach based on an underlying continuous
integrated Wiener process. The two models presented here are to be
considered as simpler alternatives that ‘weight’ in some sense the RW2
model, with weights typically proportional to the inverse distances
between consecutive locations.

©฀2005฀by฀Taylor & Francis Group, LLC


112 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS

200
150
100
50
0
−50
−100
−150

0 20 40 60 80 100

(a)
4
3
2
1
0

0 20 40 60 80 100

(b)
1.0
0.5
0.0
−0.5
−1.0

0 20 40 60 80 100

(c)

Figure 3.6 Illustrations of the properties of the RW2 model with n = 99: (a)
displays 10 samples of x⊥ , (b) displays Var(x⊥ i ) for i = 1, . . . , n normalized
with the average variance, and (c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n.

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 113
A weighted version of the RW2 model can be obtained in many ways.
One simple approach is to rewrite the second-order difference in (3.39)
as
xi − 2xi+1 + xi+2 = (xi+2 − xi+1 ) − (xi+1 − xi )
and to replace this with a weighted version
wi+1 (xi+2 − xi+1 ) − wi (xi+1 − xi ),
where wi > 0. This leads to the joint density
( n−2  2 )
κ 2 wi wi
π(x | κ) ∝ exp − w xi+2 − (1 + )xi+1 + xi .
2 i=1 i+1 wi+1 wi+1

assuming a diffuse prior for x1 and x2 . It is clear that for i = 3, . . . , n,


 
wi−2 wi−2
E (xi | xi−1 , xi−2 , κ) = 1 + xi−1 − xi−2
wi−1 wi−1
2
Var (xi | xi−1 , xi−2 , κ) = 1/(κwi−1 ).
If wi is an inverse distance between the locations si+1 and si , i.e., wi =
1/δi , the conditional mean is the linear extrapolation of the values xi−2
and xi−1 to time i. The conditional mean is a consistent generalization
of (3.41), but the conditional variance is quadratic in δi−1 rather than
cubic.
An alternative approach fixes this problem. Start with the unweighted
directed model
xi+1 | x1 , . . . , xi , κ ∼ N (2xi − xi−1 , κ−1 )
from which a generalization of equations (3.41) is derived,
 
k k
E(xi+k | xi , xi−s ) = 1 + xi − xi−s
s s
k(k + s)(2ks + 1)
Var(xi+k | xi , xi−s ) = .
6κs
Extending the integer ‘distances’ k and s to real valued distances, this
would thus suggests defining a RW2 model for irregular locations as
 
δi−1 δi−1
E(xi | xi−1 , xi−2 ) = 1 + xi−1 − xi−2
δi−2 δi−2
δi−1 (δi−1 + δi−2 )(2δi−1 δi−2 + 1)
Var(xi | xi−1 , xi−2 ) =
6κδi−2
for i = 3, . . . , n. Again, the conditional mean is the linear extrapolation
of the values xi−1 and xi−2 , but the conditional variance now has a
different form and is cubic in the δ’s.

©฀2005฀by฀Taylor & Francis Group, LLC


114 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
3.4.2 IGMRFs of higher order on regular lattices⋆
We now consider the construction of polynomial IGMRFs of higher order
on regular lattices of dimension d > 1. Basically, the idea is that the
precision matrix is orthogonal to a polynomial design matrix of a certain
degree. We first define this in general and then consider special cases for
d = 2.

The general construction


An IGMRF of kth order in d dimensions is an improper GMRF of rank
n − mk−1,d , where the columns of the polynomial design matrix S k−1,d
are a basis of the null space of the precision matrix Q.
Definition 3.6 (IGMRFs of order k in dimension d) An IGMRF
of order k in dimension d, is an improper GMRF of rank n − mk−1,d
where QS k−1,d = 0 with mk−1,d and S k−1,d as defined in (3.6)
and (3.9).

A second-order polynomial IGMRF in two dimensions


Let us consider a regular lattice In in d = 2 dimensions. To construct a
second-order IGMRF we choose the increments
(xi+1,j + xi−1,j + xi,j+1 + xi,j−1 ) − 4xi,j . (3.42)
The motivation for this choice is that (3.42) is

∆2(1,0) + ∆2(0,1) xi−1,j−1 ,

where we generalize the ∆ operator in the obvious way to account for


direction, so that ∆(1,0) is the forward difference in direction (1, 0) and
similar to ∆(0,1) . Adding the first-order polynomial
p1,2 (i, j) = β00 + β10 i + β01 j, (3.43)
i.e., a simple plane, to x will cancel in (3.42) for any choice of coefficients
β00 , β10 , and β01 .
The precision matrix (apart from boundary effects) should have
nonzero elements corresponding to
2 
− ∆2(1,0) + ∆2(0,1) = − ∆4(1,0) + 2∆2(1,0) ∆2(0,1) + ∆4(0,1) ,

which is a negative difference approximation to the biharmonic differen-


tial operator
 2 2
∂ ∂2 ∂4 ∂4 ∂4
+ = + 2 + .
∂x2 ∂y 2 ∂x4 ∂x2 ∂y 2 ∂y 4

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 115
The fundamental solution of the biharmonic equation
 4 
∂ ∂4 ∂4
+ 2 2 2 + 4 φ(x, y) = 0
∂x4 ∂x ∂y ∂y
is the thin plate spline, which is the two-dimensional analogue of the
cubic spline in one dimension, see, for example, Bookstein (1989), Gu
(2002), or Green and Silverman (1994).
Starting from (3.42), we now want to compute the coefficients in
Q. Although a manual calculation is possible, we want to automate
this process to easily compute more refined second-order IGMRFs. To
compute the coefficients in the interior only, we wrap out the lattice In
on a torus Tn , and then decompose Q into Q = D T D, where D is a
block circulant matrix with base
⎛ ⎞
−4 1 1
⎜1 ⎟
⎜ ⎟
⎜ ⎟.
⎜ ⎟
⎝ ⎠
1
Therefore also Q will be circulant with base q, say, which we can compute
using (2.49). The result is
⎛ ⎞
20 −8 1
⎜−8 2 ⎟
⎜ ⎟
q=⎜ 1⎜ ⎟, (3.44)

⎝ ⎠

where only the upper left part of the base is shown. To write out the
conditional expectations, it is convenient to use a graphical notation,
where, for example, (3.42) looks like
◦•◦ ◦◦◦
•◦• −4 ◦ • ◦. (3.45)
◦•◦ ◦◦◦

The format is to calculate the sum over all xij ’s in the locations of the
‘•’. The ‘◦”s are there only to fix the spatial configuration. When this
notation is used within a sum, then the sum-index denotes the center
node.
Returning to (3.44), then (3.42) gives the following full conditionals
in the interior
( ◦◦◦◦◦ ◦◦◦◦◦ ◦◦•◦◦
)
1 ◦◦•◦◦ ◦•◦•◦ ◦◦◦◦◦
E(xij | x−ij ) = 8 ◦•◦•◦ −2 ◦◦◦◦◦ −1 •◦◦◦•
20 ◦◦•◦◦
◦◦◦◦◦
◦•◦•◦
◦◦◦◦◦
◦◦◦◦◦
◦◦•◦◦
Prec(xij | x−ij ) = 20κ.

©฀2005฀by฀Taylor & Francis Group, LLC


116 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
The coefficients of Q that are affected by the boundary are found from
expanding the quadratic term

κ  2 −1
n1 −1 n
◦•◦ ◦◦◦
2
− •◦• −4 ◦•◦ , (3.46)
2 i=2 j=2 ◦•◦ ◦◦◦

but there are different approaches to actually compute them. We now


switch notation for a moment and index xij ’s using one index only, xi .
The first approach is to write the quadratic term in (3.46) as
1  1  
− ( wki xi )2 = − wki xi wkj xj
2 i
2 i j
k k
1 
= − xi xj Qij (3.47)
2 i j

for some weights wki ’s and therefore



Qij = wki wkj . (3.48)
k

Some bookkeeping is usually needed as only the nonzero wij ’s and


Qij ’s need to be stored. This approach is implemented in GMRFLib,
see Appendix B.
Example 3.7 Consider the expression
1 
− (x1 − x2 + x3 )2 + (x2 − x3 )2 ,
2
which corresponds to (3.47) with w11 = 1, w12 = −1, w13 = 1, w21 = 0,
w22 = 1, w23 = −1. Using (3.48), we obtain, for example,
Q23 = w12 w13 + w22 w23 = −2, and
Q22 = w12 w12 + w22 w22 = 2.
A less elegant but often quite simple approach to compute the Qij ’s
is to note that 
∂2  Qii i = j

U (x) = , (3.49)
∂xi ∂xj x=0 Qij i > j
where
1 
U (x) = ( wkl xl )2 .
2
k l
Let 1i be a vector with its ith element equal to one and the rest of the
elements zero. Using (3.49) Qij can be expressed as

U (1i ) − 2U (0) + U (−1i ) i=j
Qij = (3.50)
U (1i + 1j ) + U (0) − U (1j ) − U (1i ) i = j.

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 117
Of course, we do not need to evaluate all terms in U (·), only those that
contain xi and/or xj will suffice. This approach can also be extended
to obtain the canonical parameterization (see Definition 2.2) if U (·) is
extended to also include linear terms.
Example 3.8 Reconsider Example 3.7. By using (3.50) we obtain
1&
Q23 = ((0 − 1 + 1)2 + (1 − 1)2 ) + ((0 − 0 + 0)2 + (0 − 0)2 )
2 '
−((0 − 0 + 1)2 + (0 − 1)2 ) − ((0 − 1 + 0)2 + (1 − 0)2 )
= −2 and
1&
Q22 = ((0 − 1 + 0)2 + (1 − 0)2 )) − 2((0 − 0 + 0)2 + (0 − 0)2 )
2 '
+((0 − (−1) + 0)2 + (−1 − 0)2 ) = 2.

Alternative IGMRFs in two dimensions


Although (3.42) is an obvious first choice, it has some drawbacks.
First (3.46) does not contain the four corners x1,1 , x1,n2 , xn1 ,1 , and
xn1 ,n2 , so we need to add such terms manually. Furthermore, using only
the terms
◦•◦
•••
◦•◦
to obtain a difference approximation to
∂2 ∂2
+ (3.51)
∂x2 ∂y 2
is not optimal. The discretization error is quite different in the direction
45 degrees to the main directions, hence we could expect a ‘directional
effect’ in our model. The common way out is to use difference approxi-
mations where the discretization error is isotropic instead of anisotropic,
so it does not depend on the rotation of the coordinate system. The
classic (numerical) choice is known under the name Mehrstellen stencil
and given as
10 ◦ ◦ ◦ 2 ◦•◦ 1 •◦•
− ◦•◦ + •◦• + ◦ ◦ ◦. (3.52)
3 ◦◦◦ 3 ◦•◦ 6 •◦•
Using these differences as increments, we will still obtain nonzero
terms in Q that approximate the biharmonic differential operator;
however, the approximation is better and isotropic. The corresponding
full conditionals are now similar to obtain as for the increments (3.42),
( ◦◦◦◦◦ ◦◦•◦◦
1 ◦◦•◦◦ ◦◦◦◦◦
E(xij | x−ij ) = 144 ◦ • ◦ • ◦ − 18 • ◦ ◦ ◦ •
468 ◦◦•◦◦
◦◦◦◦◦
◦◦◦◦◦
◦◦•◦◦
◦◦◦◦◦ ◦•◦•◦ •◦◦◦•
)
◦ • ◦ • ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦
+8 ◦ ◦ ◦ ◦ ◦ −8 ◦ ◦ ◦ ◦ ◦ −1 ◦ ◦ ◦ ◦ ◦
◦ • ◦ • ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ • ◦ ◦ ◦ •

©฀2005฀by฀Taylor & Francis Group, LLC


118 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
Prec(xij | x−ij ) = 13κ.

As the coefficients in Q will approximate the biharmonic differential


operator, it is tempting to start with such an approximation, for
example, defining
( ◦◦ ◦ ◦ ◦ ◦◦◦◦ ◦
1 16 ◦◦ • ◦ ◦ 2 ◦•◦• ◦
E(xij | x−ij ) = ◦• ◦ • ◦ − ◦◦◦◦ ◦
15 3 ◦◦
◦◦






3 ◦•◦•
◦◦◦◦


◦ ◦ • ◦◦ •◦ ◦◦•
)
5 ◦ ◦ ◦ ◦◦ 1 ◦◦ ◦◦◦
− • ◦ ◦ ◦• − ◦◦ ◦◦◦ (3.53)
6 ◦





◦◦
◦◦
12 ◦• ◦◦ ◦◦◦
◦◦•
Prec(xij | x−ij ) = 15κ.

The big disadvantage is that the corresponding increments do not


depend on only a few neighbors, but on all x in a larger neighborhood
than the neighborhood in (3.53). This can be verified by solving Q =
D T D for D using (2.49).
If we use all neighbors in a 5×5 window around i to approximate (3.51)
isotropically, then we may use the increment
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
9 ◦ ◦ ◦ ◦ ◦ 16 ◦ ◦ • ◦ ◦ 2 ◦ • ◦ • ◦
− ◦ ◦ • ◦ ◦ + ◦ • ◦ • ◦ + ◦ ◦ ◦ ◦ ◦
2 ◦









15 ◦









15 ◦









◦ ◦ • ◦ ◦ • ◦ ◦ ◦ •
1 ◦ ◦ ◦ ◦ ◦ 1 ◦ ◦ ◦ ◦ ◦
− • ◦ ◦ ◦ • − ◦ ◦ ◦ ◦ ◦
15 ◦









120 ◦








for which E(xij | x−ij , κ) is


◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
( ◦



































1 ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
132 096 ◦ ◦ ◦ • ◦ • ◦ ◦ ◦ −25 568 ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
358 420 ◦



































◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
+2 048 ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ −66 • ◦ ◦ ◦ ◦ ◦ ◦ ◦ •
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ ◦ ◦
◦ ◦ ◦ • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
−14 944 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −1 792 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 119
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ • ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
◦ • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
+288 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −1 464 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
◦ ◦ ◦ • ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
◦ ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ • ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ •
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
+256 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −16 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ • ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ •
◦ ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ •




































)
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
+32 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ • ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ •
and
17 921
Prec(xij | x−ij , κ) =
κ.
720
The expressions for the full conditionals easily get quite complicated
using higher-order stencils.
Example 3.9 Figure 3.7 displays two samples using the two simpler
schemes (3.45) and (3.52) on a torus of dimension 256 × 256. We use
the torus only for computational reasons. By studying the corresponding
correlation matrices, we find that the deviation is largest on the direction
45 degrees to the horizontal axis, as expected. This is due to the fact
that (3.45) is anisotropic while (3.52) is isotropic. The difference is
between −10−4 and 10−4 , hence quite small and not of any practical
importance. As (3.52) includes four corner terms that (3.45) misses on
a regular lattice, we generally recommend using (3.52).
Remark. If we consider GMRFS defined via (3.45) and (3.52) on the
torus rather than the lattice, we obtain (strictly speaking) a first-order
and not a second-order IGMRF, because the rank of the precision matrix
will be one. Note that we cannot adjust for polynomials of order larger
than zero, since those are in general not cyclic.
An alternative model, proposed by Besag and Kooperberg (1995),
starts directly with the normal full conditionals defined through
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦
1 ◦ ◦ • ◦ ◦ 1 ◦ • ◦ • ◦ 1 ◦ ◦ ◦ ◦ ◦
E(xij | x−ij ) = ◦ • ◦ • ◦ + ◦ ◦ ◦ ◦ ◦ − • ◦ ◦ ◦ •
4 ◦









8 ◦









8 ◦








◦ (3.54)
Prec(xij | x−ij ) = κ.
The motivation for the specific form in (3.54) is that the least-squares
locally quadratic fit (3.8) through these twelve points generates these
coefficients. Furthermore, the model is invariant to the addition of a

©฀2005฀by฀Taylor & Francis Group, LLC


120 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS

1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(a) (b)

Figure 3.7 Two samples from an IGMRF on a torus with dimension 256×256,
where (a) used increments (3.45) and (b) used increments (3.52).

plane of the form (3.43). However, there is no representation of this


IGMRF based on simple increments so the model has the same drawback
as (3.53).

3.4.3 Nonpolynomial IGMRFs of higher order


In general, the matrix S that spans the null space of Q, i.e., QS = 0,
does not need to be a polynomial design matrix. It can, for example, be
constructed so that the IGMRF has more than one unspecified overall
level. We will now describe two examples, where such a generalization
can be useful. As a side product, the first introduces a general device to
construct higher-order IGMRFs using the Kronecker product.

Construction of IGMRFs using the Kronecker product


A useful approach to construct an IGMRF of higher order is to define
its structure matrix as the Kronecker product of structure matrices of
lower-order IGMRFs.
For illustration, consider a regular lattice In . Now define the structure
matrix R of an IGMRF model of x as the Kronecker product of two
RW1 structure matrices of the form (3.22) of dimension n1 and n2 ,
respectively:
R = R1 ⊗ R2 .
Clearly R has rank (n1 − 1)(n2 − 1). It can easily be shown that this
specification corresponds to a model with differences of differences as

©฀2005฀by฀Taylor & Francis Group, LLC


IGMRFs OF HIGHER ORDER 121
increments:
∆(1,0) ∆(0,1) xij ∼ N (0, κ−1 ), (3.55)
for i = 1, . . . , n1 − 1, and j = 1, . . . , n2 − 1. Here ∆(1,0) and ∆(0,1) are
defined as in Section 3.4.2, hence
◦• •◦
∆(1,0) ∆(0,1) xij = •◦ − ◦ •. (3.56)

From (3.56) we see that the IGMRF defined through (3.55) is invariant
to the addition of constants to any rows and columns. This is an example
of an IGMRF with more than one unspecified level. The density of x is
(n1 −1)(n2 −1)
π(x | κ) ∝ κ 2
⎛ ⎞
n
1 −1 n
2 −1
κ
× exp ⎝− (∆(1,0) ∆(0,1) xij )2 ⎠ (3.57)
2 i=1 j=1

with ∆(1,0) ∆(0,1) xij = xi+1,j+1 − xi+1,j − xi,j+1 + xij . Note that the
conditional mean of xij in the interior depends on its eight nearest sites
and is
1 ◦•◦ 1 •◦•
•◦• − ◦ ◦ ◦,
2 ◦•◦ 4 •◦•
which equals a least-squares locally quadratic fit through these eight
neighbors. The conditional precision is 4κ.
We note that the representation of the precision matrix Q as the
Kronecker product Q = κ(R1 ⊗ R2 ) is also useful for computing |Q|∗ ,
because (extending (3.1))

|R1 ⊗ R2 |∗ = (|R1 |∗ )n2 −1 (|R2 |∗ )n1 −1 ,

where n1 − 1 and n2 − 1 is the rank of R1 and R2 , respectively.


Such a model is useful for smoothing a spatial surface while ac-
commodating arbitrary row and column effects. Alternatively, one may
incorporate sum-to-zero constraints on all rows and columns. This model
is straightforward to generalize to a torus and then corresponds to the
Kronecker product of two circular RW1 models. Figure 3.8 displays two
samples from this model on a torus of dimension 256 × 256 using these
constraints.
Under suitable sum-to-zero constraints, the Kronecker product con-
struction is useful as a general device to specify interaction models, as
proposed in Clayton (1996). For example, to define an IGMRF on a space
× time domain, one might take the Kronecker product of the structure
matrix of a RW1 model (3.22) and the structure matrix of a spatial
IGMRF of first order, as defined in equation (3.31).

©฀2005฀by฀Taylor & Francis Group, LLC


122 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS

1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(a) (b)

Figure 3.8 Two samples from model (3.55) defined on a torus of dimension
256 × 256 lattice with sum-to-zero constraints on each row and column.

An IGMRF model for seasonal variation


Assume the location of the nodes i are all positive integers, i.e., i =
1, 2, . . . , n. Assume seasonal data are given with some periodicity of
period length m, say. For example, m = 12 for monthly data with a
yearly cycle.
A simple model for seasonal variation is obtained by assuming that
the sums xi + xi+1 + · · · + xi+m−1 are the increments with precision κ,
i = 1, . . . , n − m + 1. The joint density is
( n−m+1
)
n−m+1 κ  2
π(x | κ) ∝ κ 2 exp − (xi + xi+1 + · · · + xi+m−1 ) .
2 i=1
(3.58)
For example, for m = 4 the corresponding precision matrix Q has the
form
⎛ ⎞
1 1 1 1
⎜ 1 2 2 2 1 ⎟
⎜ ⎟
⎜ 1 2 3 3 2 1 ⎟
⎜ ⎟
⎜ 1 2 3 4 3 2 1 ⎟
⎜ ⎟
⎜ 1 2 3 4 3 2 1 ⎟
⎜ ⎟
⎜ . .. . .. . .
.. .. .. . . .. . .. ⎟
Q = κ⎜ ⎟ . (3.59)
⎜ ⎟
⎜ 1 2 3 4 3 2 1 ⎟
⎜ ⎟
⎜ 1 2 3 4 3 2 1 ⎟
⎜ ⎟
⎜ 1 2 3 3 2 1 ⎟
⎜ ⎟
⎝ 1 2 2 2 1 ⎠
1 1 1 1

©฀2005฀by฀Taylor & Francis Group, LLC


CONTINUOUS-TIME RANDOM WALKS 123
The bandwidth of Q is m − 1 while the rank of Q is n − m + 1. Thus, the
rank deficiency is larger than 1, but this is not a polynomial IGMRF.
Instead, QS = 0 for
⎛ ⎞
1 0 0
⎜0 1 0⎟
⎜ ⎟
⎜0 0 1⎟
⎜ ⎟
⎜−1 −1 −1⎟
⎜ ⎟
⎜1 0 0⎟
⎜ ⎟
S=⎜ 0 1 0⎟ , (3.60)
⎜ ⎟
⎜0 0 1 ⎟
⎜ ⎟
⎜−1 −1 −1⎟
⎜ ⎟
⎜1 0 0⎟
⎝ ⎠
.. .. ..
. . .
i.e., π(x) is invariant to the addition of any vector
c = (c1 , c2 , c3 , c4 , c1 , c2 , c3 , c4 , c1 , . . .)T
4
to x, as long as i=1 ci = 0 is fulfilled. This model is quite often used
in structural time-series analysis because it is completely nonparametric
and flexible; nonparametric, since no parametric form is assumed for
the seasonal pattern, flexible, because the seasonal pattern is allowed
to change over time. The latter point is evident from (3.58), where the
seasonal effects do not sum up to zero, as in a fixed seasonal model, but
to normal white noise.
Similar constructions could be made in a spatial context, for example,
for a two-dimensional seasonal pattern on regular lattices. As a simple
example, one could consider a model where the increments are all sums
of the form
•• •••
•• or • • • .
•••
This can easily be extended to sums over all m1 × m2 submatrices. Note
that this model can be constructed using the Kronecker product of two
seasonal structure matrices of period length m1 and m2 , respectively.

3.5 Continuous-time random walks⋆


In this section we introduce a class of random walk models, which satisfy
two important properties. First, they are consistent with respect to the
choice of the locations and the resolution. Secondly, they have a Markov
property that makes the precision matrix sparse so we can do efficient
computations. The idea is that we view the unknown parameters x as
realizations of an integrated Wiener process in continuous time. We will
denote this class of models as continuous-time random walks of kth
order, which we abbreviate as CRWk.

©฀2005฀by฀Taylor & Francis Group, LLC


124 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
We have already seen one particular example of a CRWk, the RW1
model for irregular locations with density defined in (3.26). However, it
will become clear that the CRW2 is different from the irregular RW2
and this is also the case for higher-order models.
The starting point is to consider a realization of an underlying
continuous-time process η(t), a (k − 1)-fold integrated Wiener process
where k is a positive integer. We will describe this concept first for
general k, and then work out the details for the important case k = 2.
The cases k = 3 and 4 will be briefly sketched at the end.
Definition 3.7 (A (k − 1)-fold integrated Wiener process) Let η(t)
be a (k − 1)-fold integrated standard Wiener process
 t
(t − h)k−1
η(t) = dW (h), (3.61)
0 (k − 1)!

where W (h) is a standard Wiener process. Let x = (x1 , . . . , xn )T be a


realization of η(t) at the locations 0 < s1 < s2 < . . . < sn . Let η(0) have
a diffuse prior, N (0, τ −1 ), where τ → 0. Then√the density π(x) of x is a
standard CRWk model and the density of x/ κ is a CRWk model with
precision κ.
Due to the (k−1)-fold integration, a CRWk model will also be an IGMRF
of order k.
Although we have not written up the density of a CRWk model
explicitly, the covariance matrix of the Gaussian density of a conditional
standard CRWk model is found from
 s
(s − h)k−1 (t − h)k−1
Cov(η(t), η(s) | η(0) = 0) = dh
0 ((k − 1)!)2
for 0 < s < t. Furthermore, E(η(t)|η(0) = 0) = 0 for all t > 0. However,
due to the correlation structure of η(t), the conditional mean

E (η(si ) | η(s1 ), . . . , η(si−1 ), η(si+1 ), . . . , η(sn ))

does not simplify and we need to take the values at all other locations
into account. In other words, the precision matrix will be a completely
dense matrix. However, the conditional densities simplify if we augment
the process with its derivatives,

η(t) = (η(t), η (1) (t), . . . , η (k−1) (t))T ,

where η (m) is the mth derivative of η(t),


 t
(m) (t − h)k−1−m
η (t) = dW (h),
0 (k − 1 − m)!

©฀2005฀by฀Taylor & Francis Group, LLC


CONTINUOUS-TIME RANDOM WALKS 125
for m = 0, . . . , k −1. The simplification is due to the following argument.
Let 0 < s < t and define

u(m) (s, t) = η (m) (t) − η (m) (s)


 t
(t − h)k−1−m
= dW (h)
s (k − 1 − m)!

for m = 0, . . . , k − 1. Note that u(0) (s, t) = η(t) − η(s). Then consider


the evolution from 0 to t, as first from 0 to s, then from s to t,

 s  t
(t − h)k−1−m
u(m) (0, t) = + dW (h)
0 s (k − 1 − m)!
 s
(m) (t − h)k−1−m
= u (s, t) + dW (h)
0 (k − 1 − m)!
= u(m) (s, t)
 s
((t − s) + (s − h))k−1−m
+ dW (h)
0 (k − 1 − m)!
= u(m) (s, t)
 k − 1 − m
 s k−1−m
+
0 j=0
j
(t − s)j (s − h)k−1−m−j
dW (h)
(k − 1 − m)!
= u(m) (s, t)
k−1−m
  s
(t − s)j (s − h)k−1−m−j
+ dW (h)
j=0
j! 0 (k − 1 − m − j)!
k−1−m
 (t − s)j (m+j)
= u(m) (s, t) + u (0, s). (3.62)
j=0
j!

If we define

u(s, t) = (u(0) (s, t), u(1) (s, t), . . . , u(k−1) (s, t))T

then (3.62) can be written as

u(0, t) = T (s, t)u(0, s) + u(s, t), (3.63)

©฀2005฀by฀Taylor & Francis Group, LLC


126 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
where
⎛ (t−s)2 (t−s)3 (t−s)k−1

t−s
1 1! 2! 3! ... (k−1)!
⎜ (t−s)2

(t−s)k−2 ⎟
⎜ 1 t−s
...
⎜ 1! 2! (k−2)! ⎟
⎜ t−s (t−s)k−3 ⎟
⎜ 1 ... ⎟
T (s, t) = ⎜ 1! (k−3)! ⎟ .
⎜ .. .. ⎟
⎜ . . ⎟
⎜ ⎟
⎝ 1 t−s ⎠
1!
1
It is perhaps easiest to interpret (3.63) conditionally. For known u(0, s)
we can add the normal distributed vector u(s, t) to T (s, t)u(0, s) in order
to obtain u(0, t). Since
E(u(s, t) | u(0, s)) = 0,
we may write
u(0, t) | u(0, s) ∼ N (T (s, t)u(0, s), Σ(s, t)). (3.64)
Element ij of Σ(s, t), Σij (s, t), is
 t
(t − h)k−i (t − h)k−j
Σij (s, t) = dh
s (k − i)! (k − j)!
(t − s)2k+1−i−j
= .
(2k + 1 − i − j) (k − i)! (k − j)!
The practical use of this result is to use the derived model for η(t) at the
locations of interest under a diffuse prior for the initial conditions η(0),
and then integrate out all the derivatives of order 1, . . . , k − 1. However,
for simulation-based inference using MCMC methods, we will simulate
the full vector η(t) at all locations, and simply ignore the derivatives in
the analysis.
Note also that the augmented model is computationally fast, as the
bandwidth for the corresponding precision matrix is 2k − 1, hence
compared to the (inconsistent) RW2 model, the computational cost is
about
kn × (2k − 1)2 , versus n × 22 ,
using the computational complexity of band-matrices, see Section 2.4.
For k = 2, the computational effort for a CRW2 model is about 18/4 =
4.5 times the costs required for a RW2 model. However, if n is not too
large, the practical cost will be very similar.
We will now derive the necessary details for the important case k = 2,
including the precision matrix for irregular locations. We assume for
simplicity that κ = 1. Let t = (s1 , . . . , sn ) be the locations of x =

©฀2005฀by฀Taylor & Francis Group, LLC


CONTINUOUS-TIME RANDOM WALKS 127
T
(x1 , . . . , xn ) and recall (3.25). Then (3.64) gives
     
η(si+1 ) 1 δi η(si )
= + u(δi ),
η (1) (si+1 ) 0 1 η (1) (si )
where    3 
0 δ /3 δi2 /2
u(δi ) ∼ N , i2 .
0 δi /2 δi
So,
 
1 12/δi3 −6/δi2
log π(η(si+1 ) | η(si )) = − u(δi )T u(δi ),
2 −6/δi2 4/δi
where      
η(si+1 ) 1 δi η(si )
u(δi ) = − .
η (1) (si+1 ) 0 1 η (1) (si )
Define the matrices
 
12/δi3 6/δi2
Ai =
6/δi2 4/δi
 
−12/δi3 6/δi2
Bi =
−6/δi2 2/δi
 
12/δi3 −6/δi2
Ci =
−6/δi2 4/δi
for i = 1, . . . , n − 1, then some straightforward algebra shows that
(η(s1 ), η (1) (s1 ), η(s2 ), η (1) (s2 ), . . . , η(sn ), η (1) (sn ))T
has precision matrix
⎛ ⎞
A1 B1
⎜B T1 A2 + C 1 B2 ⎟
⎜ ⎟
⎜ B T2 A3 + C 2 B3 ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟. (3.65)
⎜ .. .. .. ⎟
⎜ . . . ⎟
⎜ ⎟
⎝ B n−1 ⎠
B Tn−1 C n−1
Here we have used diffuse initial conditions for (η(s1 ), η (1) (s1 ))T . The
precision matrix for a CRW2 model with precision κ is found by
scaling (3.65) by κ. Note that κ for the CRW2 model is the precision for
the first-order increments, while for the RW2 model κ is the precision
for the second-order increments.
The null space of Q is spanned by the two vectors
(s1 , 1, s2 , 1, s3 , 1, . . . , sn , 1)T and (1, 0, 1, 0, 1, 0, . . . , 1, 0)T ,

©฀2005฀by฀Taylor & Francis Group, LLC


128 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
which can be verified directly. The density is invariant to the addition
of any constant to the locations, and to the addition of an arbitrary
constant to the derivatives with an obvious correction for the locations.
When the locations are equidistant and δi = 1 for all i, then the
matrices Ai , B i , and C i do not depend on i, and are equal to
     
12 6 −12 6 12 −6
Ai = Bi = Ci = .
6 4 −6 2 −6 4

Figure 3.5 displays 10 samples of x⊥ (ignoring the derivative) from


the CRW2 model for κ = 1 and n = 99, the marginal variances and
the correlations between x⊥ ⊥
n/2 and xi for i = 1, . . . , n. Note that we
fixed s50 = 50 and require that si + s100−i = 100 holds for all locations
i = 1, . . . , n, as in Figure 3.3.1. The samples show a very similar behavior
as those obtained in Figure 3.4.1 using the (equally spaced) RW2 model.
The (theoretical) variances and correlations Corr(x⊥ ⊥
n/2 , xi ) are also very
similar.
When we go to higher-order models, then the precision matrix for
(η(s1 ), . . . , η (k−1) (s1 ), . . . , η(sn ), . . . , η (k−1) (sn ))T
has the same structure as (3.65), but the matrices Ai , B i , and C i will
differ. For completeness, we will give the result for k = 3:
⎛ ⎞
720/δi5 360/δi4 60/δi3
Ai = ⎝360/δi4 192/δi3 36/δi2 ⎠ ,
60/δi3 36/δi2 9/δi
⎛ ⎞
−720/δi5 360/δi4 −60/δi3
B i = ⎝−360/δi4 168/δi3 −24/δi2 ⎠
−60/δi3 24/δi2 −3/δi
⎛ ⎞
720/δi5 −360/δi4 60/δi3
C i = ⎝−360/δi4 192/δi3 −36/δi2 ⎠ ,
60/δi3 −36/δi2 9/δi
and for k = 4,
⎛ ⎞
100800/δi7 50400/δi6 10080/δi5 840/δi4
⎜ 50400/δ 6 25920/δi5 5400/δi4 480/δi4 ⎟
Ai = ⎜
⎝ 10080/δi5
i ⎟
5400/δi4 1200/δi3 120/δi2 ⎠
840/δi4 480/δi3 120/δi2 16/δi
⎛ ⎞
−100800/δi7 50400/δi6 −10080/δi5 −840/δi4
⎜ −50400/δi6 24480/δi5 −4680/δi4 −360/δi3 ⎟
Bi = ⎜
⎝ −10080/δi5

4680/δi4 −840/δi3 −60/δi2 ⎠
−840/δi4 360/δi3 −60/δi2 4/δi

©฀2005฀by฀Taylor & Francis Group, LLC


CONTINUOUS-TIME RANDOM WALKS 129

200
150
100
50
0
−50
−100

0 20 40 60 80 100

(a)
4
3
2
1
0

0 20 40 60 80 100

(b)
1.0
0.5
0.0
−0.5
−1.0

0 20 40 60 80 100

(c)

Figure 3.9 Illustrations of the properties of the CRW2 model with n = 99 and
irregular locations: (a) displays 10 samples of x⊥ (ignoring the derivative), (b)
displays Var(x⊥ i ) for i = 1, . . . , n normalized with the average variance, and
(c) displays Corr(x⊥ ⊥
n/2 , xi ) for i = 1, . . . , n.

©฀2005฀by฀Taylor & Francis Group, LLC


130 INTRINSIC GAUSSIAN MARKOV RANDOM FIELDS
⎛ ⎞
100800/δi7 −50400/δi6 10080/δi5 −840/δi4
⎜−50400/δ 6 25920/δ 5 −5400/δ 4 480/δ 3 ⎟
Ci = ⎜
⎝ 10080/δi5
i i i i ⎟
.
−5400/δi4 1200/δi3 −120/δi2 ⎠
−840/δi4 480/δi3 −120/δi2 16/δi
It is straightforward to generate these and further results for higher order
using a proper tool for symbolic computation.

3.6 Bibliographic notes


The properties of Kronecker products are extracted from Searle (1982)
and Harville (1997).
RW1, RW2, and seasonal models of the form (3.58) are frequently used
in time series analysis, see Harvey (1989), West and Harrison (1997) and
Kitagawa and Gersch (1996).
Continuous intrinsic models in geostatistics have been pioneered by
Matheron (1973) and are now part of the standard literature, see, for
example, Cressie (1993), Chilés and Delfiner (1999) and Lantuéjoul
(2002). Künsch (1987) generalizes stationary autoregressions on a two-
dimensional infinite lattice to intrinsic autoregressions using related ideas
in the geostatistics literature and derives a spectral approximation to the
log likelihood of intrinsic autoregressions. This approximation is similar
to the Whittle (1954) approximation (2.32) and is discussed further by
Kent and Mohammadzadeh (1999). Besag and Kooperberg (1995) argue
for the use of intrinsic autoregressions, derive some theoretical results,
and discuss corrections of undesirable second-order characteristics for
small arrays and nonlattice applications. For maximum likelihood
estimation in intrinsic models, see also Mardia (1990). The first-order
IGMRF on irregular lattices (3.30) was made popular by Besag et al.
(1991).
Our presentation of IGMRFs is based on the construction on finite
graphs leaving the interesting infinite lattice case, which is similar
to Section 2.6.5, undiscussed. Although intrinsic models are traditionally
invariant to polynomial trends, this is not always the case and is our
motivation for Section 3.4.3.
The construction of IGMRFs has similarities to the construction of
splines (Gu, 2002, Wahba, 1990). This has motivated Section 3.2 and
Section 3.4.2. As an example about making IGMRFs proper which
is justified by (3.37), see Fernández and Green (2002). For analytical
results about the eigenstructure of some structure matrices, also for
higher dimension, see Gorsich et al. (2002). Interaction models based
on the Kronecker product are discussed in Clayton (1996), see Knorr-
Held (2000a) and Schmid and Held (2004) for further extensions.
An excellent source for the numerical stencils used in Section 3.4.2 is

©฀2005฀by฀Taylor & Francis Group, LLC


BIBLIOGRAPHIC NOTES 131
Patra and Karttunen (2004).
Lindgren and Rue (2004) discuss the construction of IGMRF’s on
triangulated spheres for applications in environmental statistics.
The construction of the CRW2 in Section 3.5 is from Wecker and
Ansley (1983) who consider fitting polynomial splines to a nonequally
spaced time series using algorithms derived from the Kalman filter, see
also Jones (1981) and Kohn and Ansley (1987). Wecker and Ansley
(1983) base their work on the results of Wahba (1978) about the
connection between the posterior expectation of a diffuse integrated
Wiener process and polynomial splines. See, for example, Shepp (1966)
for background on the (k − 1)-fold integrated Wiener process.

©฀2005฀by฀Taylor & Francis Group, LLC


CHAPTER 4

Case studies in hierarchical


modeling

One of the main areas where GMRF models are used in statistics are
hierarchical models. Here, GMRFs serve as a convenient formulation to
model stochastic dependence between parameters, and thus implicitly,
dependence between observed data. The dependence can be of various
kinds, such as temporal, spatial or even spatiotemporal.
A hierarchical GMRF model is characterized through several stages of
observables and parameters. A typical scenario is as follows. In the first
stage we will formulate a distributional assumption for the observables,
dependent on latent parameters. If we have observed a time series of
binary observations y, we may assume a Bernoulli model with unknown
probability pi for yi , i = 1, . . . , n: yi ∼ B(pi ). Given the parameters of
the observation model, we assume the observations to be conditionally
independent. In the second stage we assign a prior model for the unknown
parameters, here pi . This is where GMRFs enter. For example, we could
choose an autoregressive model for the logit-transformed probabilities
xi = logit(pi ). Finally, a prior distribution is assigned to unknown
parameters (or hyperparameters) of the GMRF, such as the precision
parameter κ of the GMRF x. This is the third stage of a hierarchical
model. There may be further stages if necessary.
In a regression context, our simple example would thus correspond to a
generalized linear model where the intercept is varying over some domain
according to a GMRF with unknown hyperparameters. More generally,
so-called generalized additive models can be fitted using GMRFs. We will
give examples of such models later. GMRFs are also useful in extended
regression models where covariate effects are allowed to vary over some
domain. Such models have been termed varying coefficient models
(Hastie and Tibshirani, 1990) and the domain over which the effect is
allowed to vary is called the effect modifier. Again, GMRF models can
be used to analyze this class of models, see Fahrmeir and Lang (2001a)
for a generalized additive model based on GMRFs involving varying
coefficients.
For statistical inference we will mainly use Markov chain Monte Carlo
(MCMC) techniques, e.g., Robert and Casella (1999) or Gilks et al.

©฀2005฀by฀Taylor & Francis Group, LLC


134 CASE STUDIES IN HIERARCHICAL MODELING
(1996). This will involve simulation from (possibly large) GMRFs, and
we will make extensive use of the methods discussed in Chapter 2. In
Section 4.1 we will give a brief summary of MCMC methods.

The hierarchical approach to model dependent data has been domi-


nant in recent years. However, traditional Markov random field models
have been used in a nonhierarchical setting as a direct model for the
observed data, not as a model for unobserved parameters, compare,
for example, Besag (1974) or Künsch (1987). Such direct modeling ap-
proaches have shown not to be flexible enough for applied data analysis.
For example, Markov random field models for Poisson observations,
so-called auto-Poisson models, can only model negative dependence
between neighboring sites (Besag, 1974). In contrast, a hierarchical
model with Poisson observations and a latent GMRF on the (log) rates
is able to capture positive dependence between observations.

There is a vast literature on various applications of GMRFs in


hierarchical models. We therefore do not attempt to cover the whole
area, but only give examples of applications in different settings. First
we will look at normal responses. Inference in this class of models
is fairly straightforward. In the temporal domain, this model class
corresponds to so-called state-space-models (Harvey, 1989, West and
Harrison, 1997) and we will outline analogies in our inferential methods
and those used in traditional state-space models, such as the Kalman
filter and smoother and the so-called forward-filtering-backward-sampling
algorithms for inference via MCMC.

The second class of models is characterized through the fact that


the sampling algorithms for statistical inference are similar to the
normal case, once we introduce so-called auxiliary variables. This is
typically achieved within a so-called scale mixtures of normals model
formulation. There are two main areas in this class: The first is to
account for nonnormal (but still continuous) distributions and typically
uses Student-tν distributions for the observational error distribution or
for the independent increments defining GMRFs. The second area are
models for binary and multicategorical responses. In particular, we will
discuss how probit and logit regression models can be implemented using
an auxiliary variable approach. Finally, we will discuss models where
such auxiliary variable approaches are not available. Those include,
for example, Poisson regression models or certain regression models for
survival data. We will discuss how GMRF approximations can be used
to facilitate MCMC via the Metropolis-Hastings algorithm.

©฀2005฀by฀Taylor & Francis Group, LLC


MCMC for hierarchical GMRF models 135
4.1 MCMC for hierarchical GMRF models
For further understanding and to introduce our notation, it is necessary
to give a brief introduction to Markov chain Monte Carlo (MCMC)
methods before discussing strategies for block updating in hierarchical
GMRF models.

4.1.1 A brief introduction to MCMC


Suppose θ is an unknown scalar parameter, and we are interested
in the posterior distribution π(θ|y) after observing some data y. We
suppress the dependence on y in this section and simply write π(θ) for
the posterior (target) distribution. The celebrated Metropolis-Hastings
algorithm, which forms the basis of most MCMC algorithms, can be used
to generate a Markov chain θ(1) , θ(2) , . . . , θ(k) , . . . that converges (under
mild regularity conditions) to π(θ):
1. Start with some arbitrary starting value θ(0) where π(θ(0) ) > 0. Set
k = 1.
2. Generate a proposal θ∗ from some proposal kernel q(θ∗ |θ(k−1) ) that in
general depends on the current value θ(k−1) of the simulated Markov
chain. Set θ(k) = θ∗ with probability
* +
π(θ∗ ) q(θ(k−1) |θ∗ )
α = min 1, ;
π(θ(k−1) ) q(θ∗ |θ(k−1) )
otherwise set θ(k) = θ(k−1) .
3. Set k = k + 1 and go back to 2.
Step 2 is often called the acceptance step, because the proposed value θ∗
is accepted with probability α as the new value of the Markov chain.
Depending on the specific choice of the proposal kernel q(θ∗ |θ), very
different algorithms result. There are two important subclasses: if q(θ∗ |θ)
does not depend on the current value of θ, i.e., q(θ∗ |θ) = q(θ∗ ), the
proposal is called an independence proposal. Another important class can
be obtained if q(θ∗ |θ) = q(θ|θ∗ ), in which case the acceptance probability
α simplifies to the ratio of the target density, evaluated at the proposed
new and the old value. These includes so-called random-walk proposals,
where q(θ∗ |θ) is symmetric around the current value θ. Typical examples
of random-walk proposals add a mean zero uniform or normal (or any
other symmetric) distribution to the current value of θ.
A trivial case occurs if the proposal distribution equals the target
distribution, i.e., q(θ∗ |θ) = π(θ∗ ), the acceptance probability α then
always equals one. So direct independent sampling from π(θ) is a
special case of the Metropolis-Hastings algorithm. However, if π(θ) is
nonstandard, it may not be straightforward to sample from π(θ) directly.

©฀2005฀by฀Taylor & Francis Group, LLC


136 CASE STUDIES IN HIERARCHICAL MODELING
The beauty of the Metropolis-Hastings algorithm is that we can use
(under suitable regularity conditions) any distribution as the proposal
and the algorithm will still converge to the target distribution. However,
the rate of convergence toward π(θ) and the degree of dependence
between successive samples of the Markov chain (its mixing properties)
will depend on the chosen proposal.
MCMC algorithms are often controlled through the acceptance proba-
bility α or its expected value E(α), assuming the chain is in equilibrium.
For a random-walk proposal, a too large value of E(α) implies that
the proposal distribution is too narrow around the current value, so
effectively the Markov chain will only make small steps. On the other
hand, if E(α) is very small, nearly all proposals are rejected and the
algorithm will stay too long in certain values of the target distribution.
Some intermediate values of E(α) in the interval 25 to 40% often work
well in practice, see also Roberts et al. (1997) for some analytical results.
Therefore, the spread of random-walk proposals is chosen so that E(α) is
in this interval. For an independence proposal, the situation is different as
a high E(α) indicates that the proposal distribution q(θ∗ ) approximates
the target π(θ) quite well.
The simple algorithm described above forms the basis of nearly all
MCMC methods. However, usually we are interested in a multivariate
(posterior) distribution π(θ) of a random vector θ of high dimension,
not in a single parameter θ. Some modifications are necessary to apply
the above algorithm to the multivariate setting.
Historically, most MCMC algorithms have been based on updating
each scalar component θi , i = 1, . . . , p of θ, conditional on the values
of the other parameter θ −i , using the Metropolis-Hastings algorithm.
Essentially, we apply the Metropolis-Hastings algorithm in turn to every
component θi of θ with arbitrary proposal kernels qi (θi∗ |θi , θ −i ). As long
as we update each component of θ, this algorithm will converge to the
target distribution π(θ).
Of particular prominence is the so-called Gibbs sampler algorithm,
where each component θi is updated with a random variable from its full
conditional π(θi |θ −i ). Note that this is a special case of the component-
wise Metropolis-Hastings algorithm, since α simply equals unity in this
case. Of course, any other proposal kernel can be used, where all terms
are now conditional on the current values of θ −i , to update θi . For more
details on these algorithms see, for example, Tierney (1994) or Besag
et al. (1995).
However, it was immediately realized that such single-site updating
can be disadvantageous if parameters are highly dependent in the
posterior distribution π(θ). The problem is that the Markov chain may
move around very slowly in its target (posterior) distribution. A general

©฀2005฀by฀Taylor & Francis Group, LLC


MCMC FOR HIERARCHICAL GMRF MODELS 137
approach to circumvent this problem is to update parameters in larger
blocks, θ j , say, where the bold face indicates that θ j is a vector of
components of θ. The choice of blocks are often controlled by what is
possible to do in practice. Ideally, we should choose a small number of
blocks with large dependence within the blocks but with less dependence
between blocks. The extreme case is to update all parameters θ in one
block.
Blocking is particularly easy if θ is a GMRF. Then we can apply
one of the algorithms discussed in Section 2.3 to simulate θ in one
step. However, in Bayesian hierarchical models, more parameters are
involved typically. For example, θ = (κ, x) where an unknown scalar
precision parameter κ may be of interest additional to the GMRF x. It
is tempting to form two blocks in these cases, where we update from the
full conditionals of each block; sample x from π(x|κ) and subsequently
sample κ from π(κ|x). However, there is often strong dependence
between κ and x in the posterior, and then a joint Metropolis-Hastings
update of κ and x is preferable. In the following examples we update
the GMRF x (or parts of it) and its associated hyperparameters in one
block, which is not as difficult as it seems. Why this modification is
important and why it works is discussed next.

4.1.2 Blocking strategies


To illustrate and discuss our strategy for block updating in hierarchical
GMRF models, we will start discussing a simple (normal) example where
explicit analytical results are possible to obtain. This will illustrate why
a joint update of x and its hyperparameters is important. At the end,
we discuss the more general case.

A simple example
Before we compare analytical results about the rate of convergence for
various sampling schemes, we need to define it. Let θ (1) , θ (2) , . . . denote a
Markov chain with target distribution π(θ) and initial value θ (0) ∼ π(θ).
The rate of convergence of the Markov chain can be characterized by
studying how quickly E(h(θ (t) )|θ (0) ) approaches the stationary value
E(h(θ)) for all square π-integrable functions h(·). Let ρ be the minimum
number such that for all h(·) and for all r > ρ
,  2 -
lim E E h(θ (k) ) | θ (0) − E (h(θ)) r−2k = 0. (4.1)
k→∞

We say that ρ is the rate of convergence. For normal target distribution


it is sufficient to consider linear functions h(·) only (see for example
Roberts and Sahu (1997)).

©฀2005฀by฀Taylor & Francis Group, LLC


138 CASE STUDIES IN HIERARCHICAL MODELING
Assume now that x is a first-order autoregressive process
xt − µ = γ(xt−1 − µ) + νt , t = 2, . . . , n, (4.2)
2
where |γ| < 1, {νt } are iid normals with zero mean and variance σ , and
σ2 2
x1 ∼ N (µ, 1−γ 2 ), which is the stationary distribution of xi . Let γ, σ ,

and µ be fixed parameters. We can of course sample from this model


directly, but here we want to apply an MCMC algorithm to generate
samples from π(x).
At each iteration a single-site Gibbs sampler will sample xt from the
full conditional π(xt |x−t ) for t = 1, . . . , n,

2
⎨N (µ + γ(x2 − µ), σ )
⎪ t = 1,
γ σ2
xt | x−t ∼ N (µ + 1+γ 2 (xt−1 + xt+1 − 2µ), 1+γ 2 ) t = 2, . . . , n − 1,

⎩ 2
N (µ + γ(xn−1 − µ), σ ) t = n.
For large n, the rate of convergence of this algorithm is (Pitt and
Shephard, 1999, Theorem 1)
γ2
ρ=4 . (4.3)
(1 + γ 2 )2
For |γ| close to one the rate of convergence can be slow: If γ = 1 − δ
for small δ > 0, then ρ = 1 − δ 2 + O(δ 3 ). The reason is that strong
dependency within x allow for larger moves in the joint posterior.
To circumvent this problem, we may update x in one block. This is
possible as x is a GMRF. The block algorithm converges immediately
and provides iid samples from the joint density.
We now relax the assumptions of fixed hyperparameters. Consider
a hierarchical formulation where the mean of xt , µ, is unknown and
assigned with a standard normal prior,
µ ∼ N (0, 1) and x | µ ∼ N (µ1, Q−1 ),
where Q is the precision matrix of the GMRF x|µ. The joint density of
(µ, x) is normal. We have two natural blocks, µ and x.
Since both full conditionals π(µ|x) and π(x|µ) are normal, it is
tempting to form a two-block Gibbs sampler and to update µ and x
with samples from their full conditionals,
 T 
(k) (k) 1 Qx(k−1) T −1
µ |x ∼ N , (1 + 1 Q1)
1 + 1T Q1 (4.4)
x(k) |µ(k) ∼ N (µ(k) 1, Q−1 ).
The presence of the hyperparameter µ will slow down the convergence
compared to the case when µ is fixed. Due to the nice structure of (4.4)
we can characterize explicitly the marginal chain of {µ(k) }.

©฀2005฀by฀Taylor & Francis Group, LLC


MCMC FOR HIERARCHICAL GMRF MODELS 139
(1) (2)
Theorem 4.1 The marginal chain µ , µ , . . . from the two-block
Gibbs sampler defined in (4.4) and started in equilibrium, is a first-order
autoregressive process
µ(k) = φµ(k−1) + ǫk ,
where
1T Q1
φ=
1 + 1T Q1
iid
and ǫk ∼ N (0, 1 − φ2 ).
Proof. It follows directly that the marginal chain µ(1) , µ(2) , . . . is a first-
order autoregressive process; the marginal distribution of µ is normal
with zero mean and unit variance, the chain has the Markov property
π(µ(k) | µ(1) , . . . , µ(k−1) ) = π(µ(k) | µ(k−1) )
and the density of (µ(1) , µ(2) , . . . , µ(k) ) is normal for k = 1, 2, . . . . The
coefficient φ is found by computing the covariance at lag 1,

Cov(µ(k) , µ(k+1) ) = E µ(k) µ(k+1)

= E µ(k) E µ(k+1) | µ(k)
 
= E µ(k) E E µ(k+1) | x(k) | µ(k)
  T 
(k) 1 Qx(k) (k)
= E µ E |µ
1 + 1T Q1
1T Q1
= Var(µ(k) ),
1 + 1T Q1
which is known to be φ times the variance Var(µ(k) ) for a first-order
autoregressive process. The variance of ǫk is determined such that the
variance of µ(k) is 1.
It can be shown that for a two-block Gibbs sampler the marginal chains
and the joint chain have the same rate of convergence (Liu et al., 1994,
Thm. 3.2). Applying (4.1) to the marginal chain µ(1) , µ(2) , . . ., we see
that the rate of convergence is ρ = |φ|.
It is illustrative to discuss the behavior of the two-block Gibbs sampler
for large n. For the autoregressive model (4.2), Qii = (1 + γ 2 )/σ 2 except
for i = 1 and n where it is 1/σ 2 , and Qi,i+1 = −γ/σ 2 . This implies that
1T Q1 is asymptotically equal to n(1 − γ)2 /σ 2 , so that
n(1 − γ)2 /σ 2 Var(xt ) 1 − γ 2
φ= 2 2
=1− + O(1/n2 ).
1 + n(1 − γ) /σ n (1 − γ)2
When n is large, φ is close to 1 and the chain will both mix and converge
slowly even though we use a two-block Gibbs sampler. The minimum

©฀2005฀by฀Taylor & Francis Group, LLC


140 CASE STUDIES IN HIERARCHICAL MODELING
number of iterations, k ∗ , needed before the correlation between µ(k) and
µ(k+l) is smaller than ζ = 0.05, say, is
1 1 (1 − γ)2
k ∗ /| log(ζ)| = −1/ log(φ) = +n + O(1/n).
2 Var(xt ) 1 − γ 2

Since (1 − γ)2 /(1 − γ 2 ) is strictly decreasing in the interval (−1, 1), we


conclude the following:
• For constant Var(xt ), k ∗ increases for decreasing γ.

• For constant γ, k ∗ increases for decreasing Var(xt ).


One might be tempted to believe that k ∗ should increase for increasing
γ 2 due to (4.3). However, since we update x in one block this is not
the case: The variance of µ|x in (4.4) increases for increasing γ and
increasing Var(xt ), and this weakens the dependence between µ and x.
Figure 4.1(a) shows a simulation with length 1 000 of the marginal
chain for µ using n = 100, σ 2 = 1/10, φ = 0.9, and the plot of the
pairs (µ(k) , 1T Qx(k) ) in Figure 4.1(b). The reason for the slow mixing
(and convergence) of the µ chain is the strong dependence between µ(k)
and 1T Qx(k) , the sufficient statistics of µ(k) in the full conditional (4.4).
The two-block Gibbs sampler only moves either horizontally (update
µ) or vertically (update x). Note that this is just the same as in the
standard example sampling from a two-dimensional normal distribution
using Gibbs sampling (see for example Gilks et al. (1996, Chapter 1)).
The discussion so far has only revealed the seemingly obvious,
that blocking improves mainly within the block. If there is strong
dependence between blocks, the MCMC algorithm may still suffer from
slow convergence. Our solution is to update (µ, x) jointly. Since µ is
univariate, we can use a simple scheme for updating µ as long as we
delay the accept/reject step until x also is updated. The joint proposal
is generated as follows:

µ∗ ∼ q(µ∗ | µ(k−1) )
(4.5)
x∗ |µ∗ ∼ N (µ∗ 1, Q−1 )

and then we accept/reject (µ∗ , x∗ ) jointly. Here, q(µ∗ |µ(k−1) ) can be a


simple random-walk proposal or some other suitable proposal distribu-
tion. To see how a joint update of (µ, x) can be helpful, consider Figure
4.1(b). A proposal from µ(k−1) to µ∗ may take µ∗ out of the diagonal,
while sampling x∗ from π(x∗ |µ∗ ) will take it back into the diagonal
again. Hence the mixing (and convergence) can be very much improved.

Assuming q(·|·) in (4.5) is a symmetric proposal, the acceptance

©฀2005฀by฀Taylor & Francis Group, LLC


MCMC FOR HIERARCHICAL GMRF MODELS 141

3
2
1
0
−3 −2 −1
0 200 400 600 800 1000

(a)
200
100
0
−100
−200

−3 −2 −1 0 1 2 3

(b)

Figure 4.1 Figure (a) shows the marginal chain for µ over 1000 iterations of
the marginal chain for µ using n = 100, σ 2 = 1/10 and φ = 0.9. The algorithm
updates successively µ and x from their full conditionals. Figure (b) displays
the pairs (µ(k) , 1T Qx(k) ), with µ(k) on the horizontal axis. The slow mixing
(and convergence) of µ is due to the strong dependence with 1T Qx(k) as only
horizontal and vertical moves are allowed. The arrows illustrate how a joint
update can improve the mixing (and convergence).

probability for (µ∗ , x∗ ) becomes


* +
1
α = min 1, exp(− ((µ∗ )2 − (µ(k−1) )2 )) . (4.6)
2
Note that only the marginal density of µ is needed in (4.6): Since we
sample x from its full conditional, we effectively integrate x out of the
joint density π(µ, x). The minor modification to delay the accept/reject
step until x is updated as well can give a large improvement.
A further extension of (4.2) is to condition on observed normal data
y. The distribution of interest is then π(x, µ|y). Assume that

y | x, µ ∼ N (x, H −1 ),

©฀2005฀by฀Taylor & Francis Group, LLC


142 CASE STUDIES IN HIERARCHICAL MODELING
where H is a known precision matrix. Consider the two-block Gibbs
sampler which first updates µ from π(µ|x) and then x from π(x|µ, y). It
is straightforward to extend Theorem 4.1 to this case. The marginal chain
{µ(k) } is still a first-order autoregressive process, but the autoregressive
parameter φ is now
1T Q(Q + H)−1 Q1
φ= . (4.7)
1 + 1T Q1
Assume for simplicity that H is a diagonal matrix with κ on the
diagonal, meaning that yi |x, µ ∼ N (xi , 1/κ). We can (with little effort)
compute (4.7) using the asymptotically negligible circulant approxima-
tion to Q, see Section 2.6.4, to obtain the limiting value of φ as n → ∞,
1
φ= .
1 + κσ 2 /(1 − γ)2
In this case φ < 1 for all n when κ > 0. The two-block Gibbs sampler
does no longer converge arbitrarily slow as n increases. However, in
practice the convergence is often still slow and a joint update will be
of great advantage.

Block algorithms for hierarchical GMRF models


Let us now consider a more general setup that contains all the
forthcoming case studies: Some hyperparameters θ control the GMRF
x of size n and some of the nodes of x are observed by data y. The
posterior is then
π(x, θ | y) ∝ π(θ) π(x | θ) π(y | x, θ).
Assume for a moment that we are able to sample from π(x|θ, y), i.e.,
the full conditional of x is a GMRF. We will later discuss options when
this full conditional is not a GMRF. The following algorithm is now a
direct generalization of (4.5) and updates (θ, x) in one block:

θ∗ ∼ q(θ ∗ | θ (k−1) )
(4.8)
x∗ ∼ π(x | θ ∗ , y).
The proposal (θ∗ , x∗ ) is then accepted/rejected jointly. We denote this
as the one-block algorithm.
If we consider only the θ chain, then we are in fact sampling from the
posterior marginal π(θ|y) using the proposal (4.8). This is evident from
the acceptance probability for the joint proposal, which is
 .
π(θ ∗ | y)
α = min 1, .
π(θ (k−1) | y)

©฀2005฀by฀Taylor & Francis Group, LLC


MCMC FOR HIERARCHICAL GMRF MODELS 143
The dimension of θ is typically low and often between 1 and 5, say.
Hence the proposed algorithm should not experience any serious mixing
problems. By sampling x from π(x|θ ∗ , y), we are in fact integrating x
out of π(x, θ|y). The computational cost per iteration depends on n (the
dimension of x) and is (usually) dominated by the cost of sampling from
π(x|θ ∗ , y). The fast algorithms in Section 2 for GMRFs are therefore
very useful.
However, the one-block algorithm is not always feasible for the
following reasons:
1. The full conditional of x can be a GMRF with a precision matrix
that is not sparse. This will prohibit a fast factorization, hence a joint
update is feasible but not computationally efficient.
2. The data can be nonnormal so the full conditional of x is not a GMRF
and sampling x∗ using (4.8) is not possible (in general).
The first problem can often be approached using subblocks of (θ, x),
following an idea of Knorr-Held and Rue (2002). Assume a natural
splitting exists for both θ and x into
(θ a , xa ), (θ b , xb ) and (θ c , xc ), (4.9)
say. The sets a, b, and c do not need to be disjoint. One class of examples
where such an approach is fruitful is (geo-)additive models where a, b,
and c represent three different covariate effects with their respective
hyperparameters. We will discuss such an example in Section 4.2.2. In
this class of models the full conditional of xa has a sparse precision
matrix and similarly for xb and xc . The subblock approach is then to
update each subblock in (4.9), using
θ ∗a ∼ q(θ ∗a | θ)
(4.10)
x∗a ∼ π(xa | x−a , θ ∗a , θ −a , y)
(dropping the superscript (k − 1) from here on) and then accept-
ing/rejecting (θ ∗a , x∗a ) jointly. Subblocks b and c are updated similarly.
We denote this as the subblock algorithm.
When the observed data is nonnormal, the full conditional of x will
not be a GMRF. However, we may use auxiliary variables z such that the
conditional distribution of x is still a GMRF. Typical examples include
logit and probit regression models for binary and multicategorical data,
and Student-tν distributed observations. We will discuss the auxiliary
variables approach in Section 4.3.
A more general idea is to construct a GMRF approximation to
π(x|θ, y) using a second-order Taylor expansion. Such an approximation
can be surprisingly accurate in many cases and can be interpreted as
integrating x approximately out of π(x, θ|y). Using a GMRF approxi-
mation will generalize the one-block and subblock algorithms. The most

©฀2005฀by฀Taylor & Francis Group, LLC


144 CASE STUDIES IN HIERARCHICAL MODELING
prominent example is Poisson regression, which we discuss in Section
4.4.
The subblock approach can obviously also be used in connection
with auxiliary variables and the GMRF approximation, when the full
conditional has a nonsparse precision matrix and its distribution is not
a GMRF.

4.2 Normal response models


We now look at hierarchical GMRF models, where the response variable
is assumed to be normally distributed with conditional mean given as
a linear function of underlying unknown parameters x. These unknown
parameters are assumed to follow a (possibly intrinsic) GMRF a priori,
typically with additional unknown hyperparameters κ. It is clear that
the conditional posterior π(x|y, κ) is still a GMRF so direct sampling
from this distribution is particularly easy using the algorithms described
in Section 2.3.
We will now describe two case studies. The first is concerned with the
analysis of a time series, using latent trend and seasonal components
and additional covariate information. It is not important that the latent
GMRF is defined on a temporal domain rather than a spatial domain,
but in this special case there are close connections to algorithms based
on the Kalman filter. We will describe the analogies briefly at the end
of this example. The second example describes the spatial analysis of
rent prices in Munich using a geoadditive model with nonparametric
and fixed covariate effects.

4.2.1 Example: Drivers data


In this example we consider a regular time series giving the monthly
totals of car drivers in Great Britain killed or seriously injured January
1969 to December 1984 (Harvey, 1989). This time series has length n =
192 and exhibits a strong seasonal pattern. One of our objectives is to
predict the pattern in the next m = 12 months.
We first assume that the square root counts yi , i = 1, . . . , n, are
conditionally independent normal variables,
yi ∼ N (si + ti , 1/κy ),
where the mean is a sum of a smooth trend ti and a seasonal effect si .
We assume s = (s1 , . . . , sn+m ) follows the seasonal model (3.58) with
season length 12 and precision κs and t = (t1 , . . . , tn+m ) follows the RW2
model (3.39) with precision κt . The trend t and the seasonal effect s are
assumed to be independent. Note that no observations yi are available

©฀2005฀by฀Taylor & Francis Group, LLC


NORMAL RESPONSE MODELS 145
for i = n + 1, . . . , n + m, but we can still include the corresponding
parameters of s and t for prediction.
Let κ denote the three precisions κy , κs , and κt , which is the vector
of hyperparameters in this model. The task is to do inference for
(κ, s, t) using the one-block algorithm (4.8). For illustration, we will
do this explicitly by first deriving the joint density π(s, t, y|κ) and then
condition on y. The joint density is
π(s, t, y | κ) = π(y | s, t, κy ) π(s | κs ) π(t | κt ), (4.11)
a GMRF with precision matrix Q, say. We then use (3.47) and (3.48)
to derive the desired precision matrix of the conditional distribution
π(s, t|y, κ). Due to Theorem 2.5, this precision matrix is simply a
principal matrix of Q. Note that (4.11) is improper but the conditional
distribution of interest is proper.
The details are as follows. First partition the precision matrix Q as
⎛ ⎞
Qss Qst Qsy
Q = ⎝ Qts Qtt Qty ⎠ ,
Qys Qyt Qyy
where Qss and Qtt are of dimension (n + m) × (n + m) while Qyy is of
dimension n×n. The dimensions of the other entries follow immediately.
Since s and t are a priori independent, let us start with the density
π(y|s, t, κ), which might add some dependence between s and t. The
conditional density of the data y is
( n
)
κy 
π(y | s, t, κy ) ∝ exp − (yi − si − ti )2 . (4.12)
2 i=1

We immediately see that (4.12) induces dependence between yi and si ,


yi and ti , and si and ti . Specifically, Qyy is a diagonal matrix with
entries κy , Qst is diagonal where the first n entries are κy and the other
m entries are zero, while Qsy and Qty , both of dimension (n + m) × n,
have nonzero entries −κy only at elements with the same row and column
index. The terms Qss and Qtt are the sum of two terms, one part from
the prior and an additional term on the diagonal due to (4.12). Finally,
Qss is the analog of (3.59) with seasonal length 12 rather than 4 and
precision κs plus a diagonal matrix with κy on the diagonal, while Qtt
equals (3.40) with precision κt plus a diagonal matrix with κy on the
diagonal.
The density π(s, t|y, κ) can now be found using Theorem 2.5 or Lemma
2.1. It is easiest to represent in its canonical parameterization:
      
s Qsy Qss Qst
| y, κ ∼ NC − y, .
t Qty Qts Qtt

©฀2005฀by฀Taylor & Francis Group, LLC


146 CASE STUDIES IN HIERARCHICAL MODELING

400

400
300

300
200

200
100

100
0

0
0 100
(a)
200 300 400 0 100
(b)
200 300 400

Figure 4.2 (a) The precision matrix Q of s, t|y, κ in the original ordering, and
(b) after appropriate reordering to obtain a band matrix with small bandwidth.
Only the nonzero terms are shown and those are indicated by a dot.

The nonzero structure of this precision matrix is displayed in Figure 4.2


before and after suitable reordering to reduce the bandwidth. Note that
before reordering, the submatrices of the seasonal model Qss and the
RW2 model Qtt are clearly visible.
Additional to s and t we also want to perform Bayesian inference
on the unknown precision parameters κ. Under a Poisson model for
observed counts, the square root counts are approximately normal with
constant variance 1/4, but, in order to allow for overdispersion, we
assume a G(4, 4) prior for κy . For κt we use a G(1, 0.0005) prior and for
κs a G(1, 0.1) prior. All of these priors are assumed to be independent.
Of course, other choices could be made as well.
We now propose a new configuration (s, t, κ) using the one-block
algorithm (4.8). Specifically, we do
κ∗s ∼ q(κ∗s | κs )
κ∗t ∼ q(κ∗t | κt )
κ∗y ∼ q(κ∗y | κy )
 ∗
s
∼ π(s, t | κ∗ , y).
t∗
To update the precisions, we will make a simple choice and follow a
suggestion from Knorr-Held and Rue (2002). Let κ∗s = f κs where the
scaling factor f have density
π(f ) ∝ 1 + 1/f, for f ∈ [1/F, F ] (4.13)
and zero otherwise. Here, F > 1 is a tuning parameter. Other choices

©฀2005฀by฀Taylor & Francis Group, LLC


NORMAL RESPONSE MODELS 147

1e+03
1e+02
1e+01
1e+00
1e−01

0 200 400 600 800 1000

Figure 4.3 Trace plot showing the log of the three precisions κt (top, solid
line), κs (middle, dashed line) and κy (bottom, dotted line) for the first 1000
iterations. The acceptance rate was about 30%.

for the proposal distribution may be more appropriate but we avoid a


discussion here to ease the presentation. The choice (4.13) is reasonable,
as the log is often a variance stabilizing transformation for precisions. It
is also convenient, as
q(κ∗s | κs )
= 1,
q(κs | κ∗s )
since the density of κ∗s is hence proportional to (κ∗s + κs )/(κ∗s κs ) on
κs [1/F, F ]. We use the same proposal for κt and κy as well.
The joint proposal (κ∗ , s∗ , t∗ ) is accepted/rejected jointly with prob-
ability,
* +
π(s∗ , t∗ , κ∗ | y) π(s, t | κ, y)
α = min 1, ×
π(s, t, κ | y) π(s∗ , t∗ | κ∗ , y)
* ∗
+
π(κ | y)
= min 1, .
π(κ | y)
Only the posterior marginals for κ are involved since we are integrating
out (s, t). It is important not to forget to include correct normalization
constants (using the generalized determinant) of the IGMRF priors for
s and t when computing the acceptance probability.
We have chosen the tuning parameters such that the acceptance rate
was approximately 30% using the same factor F for all three precisions.
The mixing of all precision parameters was quite good, see Figure 4.3.
More refined methods could be used, for example, incorporating the

©฀2005฀by฀Taylor & Francis Group, LLC


148 CASE STUDIES IN HIERARCHICAL MODELING

2500
2000
1500
1000

Jan 69 Jan 71 Jan 73 Jan 75 Jan 77 Jan 79 Jan 81 Jan 83 Jan 85

Figure 4.4 Observed and predicted counts (posterior median within 2.5 and
97.5% quantiles) for the drivers data without the covariate

posterior dependence between the precision parameters in the proposal


distribution for κ∗ , but that is not necessary in this example. Note that
the algorithm is effectively a simple Metropolis-Hastings algorithm for a
three-dimensional density. We neither expect or experience any problems
regarding convergence nor mixing of this algorithm.
Figure 4.4 displays the data and quantiles of the predictive distribution
of y (posterior median within 2.5 and 97.5% quantiles), back on the
original (squared) scale. The predictive distribution has been obtained by
(j)
simply adding zero-mean normal noise with precision κy to each sample
of s(j) + t(j) ; here j denotes the jth sample obtained from the MCMC
algorithm. Note that the figure also displays the predictive distribution
for 1985 where no data are available. There is evidence for overdispersion
with a posterior median of the observational precision κy equal to 0.49
rather than 4 that would have been expected under a Poisson observation
model.
We now extend the above model to include a regression parameter for
the effect of compulsory wearing of seat belts due to a law introduced
on January 31, 1983. The mean response of yi is now
*
si + ti i = 1, . . . , 169
E(yi | s, t, β) =
si + ti + β i = 170, . . . , 204.
The additional parameter β, for which we assume a normal prior with
zero precision can easily be added to s and t to obtain a larger GMRF.
It is straightforward to sample the GMRF (s, t, β)|(y, κ) in one block so
we essentially use the same algorithm as without the covariate.
The posterior median of β is −5.0 (95% CI: [−6.8, −3.2]), very close
to the observed difference in square root counts of the corresponding two
periods shortly before and after January 31, 1983, which is −5.07. The
inclusion of this covariate has some effect on the estimated precision
parameters, which change from 0.49 to 0.54 for κy , from 495 to 1283
for κx , and from 28.8 to 27.6 for κs (all posterior medians). The sharp

©฀2005฀by฀Taylor & Francis Group, LLC


NORMAL RESPONSE MODELS 149

2500
2000
1500
1000

Jan 69 Jan 71 Jan 73 Jan 75 Jan 77 Jan 79 Jan 81 Jan 83 Jan 85

Figure 4.5 Observed and predicted counts for the drivers data with the seat belt
covariate.

increase in the random-walk precision indicates that the previous model


was undersmoothing the overall trend, because it ignores the information
about the seat belt law, where a sudden nonsmooth drop in incidence
could be expected. Note also that the overdispersion has now slightly
decreased.
Observed and predicted counts can be seen in Figure 4.5 and the
slightly better fit of this model before and after January 1983 is visible.

The connection to methods based on the Kalman filter

An alternative method for direct simulation from the conditional


posterior π(s, t, β|y, κ) is to use the forward-filtering-backward-sampling
(FFBS) algorithm by Carter and Kohn (1994) and Frühwirth-Schnatter
(1994). This requires forcing our model into a state-space form, which is
possible, but requires a high-dimensional state space and a degenerate
stochastic model with deterministic relationships between the param-
eters. These tricks are necessary in order to apply the Kalman filter
recursions and apply the FFBS algorithm. Knorr-Held and Rue (2002,
Appendix A) have shown that, for nondegenerate Markov models, the
GMRF approach using band-Cholesky factorization is equivalent to the
FFBS algorithm. The same relation also holds if we run the forward-
filtering-backward-sampling on the nondeterministic part of the state-
space equations, suggested by Frühwirth-Schnatter (1994) and de Jong
and Shephard (1995). However, the GMRF approach using sparse-matrix
methods is superior over the Kalman-filter as it will run faster. Sparse-
matrix methods also offer great simplification conceptually, extend
trivially to general graphs, and can easily deal with conditioning, hard
and soft constraints. Moreover, the same computer code can be used for
GMRF models in time and in space (or even in space-time) on arbitrary
graphs.

©฀2005฀by฀Taylor & Francis Group, LLC


150 CASE STUDIES IN HIERARCHICAL MODELING
4.2.2 Example: Munich rental guide
As a second example for a normal response model, we consider a Bayesian
semiparametric regression model with an additional spatial effect. For
more introduction to this subject, see Hastie and Tibshirani (2000),
Fahrmeir and Lang (2001a) and Fahrmeir and Tutz (2001, section 8.5).
This class of models has recently been coined geoadditive by Kammann
and Wand (2003).
Here we build a model similar to Fahrmeir and Tutz (2001, Example
8.7) for the 2003 Munich rental data. The response variable yi is the
rent (per square meter in Euros) for a flat and the covariates are the
spatial location, floor space, year of construction, and various indicator
variables such as an indicator for a flat with no central heating, no
bathroom, large balcony facing south or west, etc. A regression analysis
of such data provides a rental guide that is published by most larger
cities. According to German law, the owners can base an increase in
the amount they charge on an ‘average rent’ of a comparable flat. This
information is provided in an official rental guide. The dataset we will
consider consists of n = 2 035 observations.
Important covariates of each observation include the size of the flat,
z S , which ranges between 17 and 185 square meters, and the year
of construction z C , with values between 1918 and 2001. We adopt a
nonparametric modeling approach for the effect of these two covariates.
Let sS = (sS1 , . . . , sSnS ) denote the ordered distinct covariate values
of z S and define similarly sC = (sC C
1 , . . . , snC ). We now define the
corresponding parameter values xS and xC and assume that both xS
and xC follow the CRW2 model (3.61) at the locations sS and sC ,
respectively. We estimate a distinct parameter value for each value of
the covariates xS and xC , respectively.
There is also information in which district of Munich each flat is
located. In total there are 380 districts in Munich, and we assume an
unweighted IGMRF model (3.30) of order one for the spatial effect of
a corresponding parameter vector xL defined at each district. Only 312
of the districts actually contain data in our sample, so for the other
districts the estimates are based on extrapolation.
Finally, we also include a number of additional binary indicators as
fixed effects. We subsume all these covariates and an entry for the
intercept in a vector z i with parameters β. We assume β to have a diffuse
prior. For reasons of identifiability we place sum-to-zero restrictions on
both CRW2 models and the spatial IGMRF model.
We now assume that the response variables yi , i = 1, . . . , n, are
normally distributed:
 
yi ∼ N µ + xS (i) + xC (i) + xL (i) + z Ti β, 1/κy

©฀2005฀by฀Taylor & Francis Group, LLC


NORMAL RESPONSE MODELS 151
S C L
with precision κy . The notation x (i) (and similarly x (i) and x (i))
denote the entry in xS , corresponding to the value of the covariate sS
for observation i.
We used independent gamma G(1.0, 0.01) priors for all precision
parameters in the model κ = (κy , κS , κC , κL ). The posterior distribution
is
π(xS , xC , xL , µ, β, κ | y) ∝ π(xS | κS ) π(xC | κC ) π(xL | κL )π(µ)
× π(β) π(κ) π(y | xS , xC , xL , β, κy , µ).
Similar as in Section 4.2.1, the components xS , xC , xL , and β are
a priori independent. However, conditional on the data they become
dependent, but their full conditionals are still GMRFs. To study the
dependence structure, let us consider in detail the likelihood term that
is proportional to
( )
κy    S C L T
2
exp − yi − µ + x (i) + x (i) + x (i) + z i β .
2 i

The dependence structure introduced is now different from the one


in Section 4.2.1. Each combination of covariate values xS (i), xC (i),
xL (i), and z i , introduces dependence within this combination and makes
the specific term in the precision matrix nonzero. For these data there
are 1980 different combinations of the first three covariates and 2011
different combinations if we also include z i .
The precision matrix of (xS , xC , xL , µ, β) will therefore be nonsparse,
so a joint update will not be computationally efficient ruling out the one-
block algorithm. We therefore switch to the subblock algorithm. There is
a natural grouping of the variables of interest and we use four subblocks,
(xS , κS ), (xC , κC ), (xL , κL ), and (β, µ, κy ).
We expect the dependence within each block to be stronger than
between the blocks. The last block consists of all fixed effects plus the
intercept jointly and the precision κy of the response variable. This
can be advantageous, if there is high posterior dependency between the
intercept and the fixed covariate effects.
The subblock algorithm now updates each block at a time, using
κ∗L ∼ q(κ∗L | κL )
L,∗
x ∼ π(xL,∗ | the rest)
and then accepts/rejects (κ∗L , xL,∗ ) jointly. The other subblocks are
updated similarly. A nice feature of the subblock algorithm is that
the full conditional of xL (and so with xS and xC ) has the same
Markov properties as the prior. This will be clear when we now derive

©฀2005฀by฀Taylor & Francis Group, LLC


152 CASE STUDIES IN HIERARCHICAL MODELING
L,∗
π(x |the rest). Introduce ‘fake’ data ỹ
 
ỹi = yi − µ + xS (i) + xC (i) + z Ti β ,
then the full conditional of xL is
κL  L
π(xL | the rest) ∝ exp(− (x − xL 2
j) )
2 i∼j i
κy   2
× exp(− ỹk − xL (k) ).
2
k

The data ỹ do not introduce extra dependence between the xL i ’s, as ỹi
acts as a noisy observation of xL
i . Denote by ni the number of neighbors
to location i and let L(i) be
L(i) = {k : xL (k) = xL
i },

where its size is |L(i)|. The full conditional of xL is a GMRF with


canonical parameters (b, Q), where

bi = κ y ỹk and
k∈Li


⎨κL ni + κy |L(i)| if i = j
Qij = −κL if i ∼ j


0 otherwise.
The additional sum-to-zero constraint is dealt with as described in Sec-
tion 2.3.3. Note that this is not equivalent to just recentering xL,∗ .
The value of κ for the first 1000 iterations of the subblock algorithm
are shown in Figure 4.6. The scaling factor F in (4.13) was tuned to give
an acceptance rate between 30% and 40%.
Figure 4.7 displays the estimated nonparametric effects of the floor
space (xS ) and the year of construction (xC ). There is a clear non-
linear effect of xS with a positive effect for smaller flats. The effect of
the variable xC is approximately linear for flats built after 1940, but
older flats, especially those built at the beginning of the 20th century
are more expensive than those built around 1940. Note that detailed
information about this covariate seems to be available only from 1966
on, whereas before 1966 the variable seems to rather crudely categorized,
see Figure 4.7(b). Figure 4.8 displays the posterior median spatial effect,
which is similar to earlier analysis. The large estimated spatial effect of
the district in the east is due to very few observations in this district,
plus the fact that the district has only one neighbor, so the amount of
spatial smoothing will be small.This could easily be fixed by adding more
neighbors to this district, see the general discussion in Example 3.6.

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 153

1e+03
1e+02
1e+01
1e+00
1e−01

0 200 400 600 800 1000

Figure 4.6 The value of κ for the 1000 first iterations of the subblock algorithm,
where κS is the solid line (top), κC is the dashed line, κL is the dotted line
and κy is the dashed-dotted line (bottom).

Covariate post. median 95% credible interval


Intercept µ 8.81 (8, 60, 9.03)
Good location 0.48 (0.26, 0.70)
Excellent location 1.63 (0.99, 2.23)
No hot water −2.02 (−2.56, −1.47)
No central heating −1.34 (−1.68, −0.96)
No tiles in bathroom −0.54 (−0.74, −0.30)
Special bathroom interior 0.55 (0.23, 0.87)
Special kitchen interior 1.19 (0.86, 1.52)

Table 4.1 Posterior median and 95% credible interval for fixed effects

The estimates of the fixed effects β and the intercept µ are given in
Table 4.1.

4.3 Auxiliary variable models


Auxiliary variables can in some cases be introduced into the model to
retrieve GMRF full conditionals that are otherwise lost by nonnormality.
We first discuss the construction of scale mixtures of normals and then
discuss how auxiliary variables can be useful for replacing normal with
Student-t distributions. We give more emphasis to the binomial probit
and logit model for categorical data, which we also illustrate with two

©฀2005฀by฀Taylor & Francis Group, LLC


154 CASE STUDIES IN HIERARCHICAL MODELING

6
4
2
0
−2
−4

50 100 150
(a)
1
0
−1
−2
−3

1920 1940 1960 1980 2000


(b)

Figure 4.7 Nonparametric effects for floor space (a) and year of construction
(b). The figures show the posterior median within 2.5 and 97.5% quantiles.
The distribution of the observed data is indicated with jittered dots.

case studies.

4.3.1 Scale mixtures of normals

Scale mixtures of normals play an important role in hierarchical


modeling. Suppose x|λ ∼ N (0, λ−1 ) where λ > 0 is a precision parameter
with some prespecified distribution. Then x is a called normal scale

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 155

-1.5 0 1.5

Figure 4.8 Estimated posterior median effect for the location variable. The
shaded areas are districts with no houses, such as parks or fields.

mixture. Although π(x|λ) is normal, its marginal density f (x) is not


normal unless λ is a constant. However, f is necessarily both symmetric
and unimodal. Kelker (1971) and Andrews and Mallows (1974) show
the following result establishing necessary and sufficient conditions for
x ∼ f (x) to have a normal scale mixture representation.
Theorem 4.2 If x has density f (x) symmetric around 0, then there
exist independent random variables z and v, with z standard normal
such that x = z/v iff the derivatives of f (x) satisfy
 k
d √
− f ( y) ≥ 0
dy
for y > 0 and for k = 1, 2, . . ..
Many important symmetric random variates are scale mixtures of
normals, in particular the Student-tν distribution with ν degrees of
freedom, which includes the Cauchy distribution as a special case for
ν = 1, the Laplace distribution, and the logistic distribution. Table 4.2
gives the corresponding mixing distribution for the precision parameter
λ that generates these distributions as scale mixtures of normals.

©฀2005฀by฀Taylor & Francis Group, LLC


156 CASE STUDIES IN HIERARCHICAL MODELING
Distribution of x Mixing distribution of λ
Student-tν G(ν/2, ν/2)
Logistic 1/(2K)2 where K is
Kolmogorov-Smirnov distributed
Laplace 1/(2E) where E is exponential distributed

Table 4.2 Important scale mixtures of normals

The Kolmogorov-Smirnov distribution and the logistic distribution are


defined in Appendix A and are abbreviated as KS and L, respectively.
The representation of the logistic distribution as a scale mixture of
normals will become important for the logistic regression model for
binary response data.

4.3.2 Hierarchical-t formulations


Using a scale mixture approach for GMRF models facilitates the analysis
of certain models for continuous responses. The main idea is to include in
the MCMC algorithm the mixing variable λ, or a vector of such variables
λ, as so-called auxiliary variables. The auxiliary variables are not of
interest but can be helpful to ensure GMRF full conditionals.
As a simple example consider the RW1 model discussed in Section
3.3.1. Suppose we wish to replace the assumption of normally distributed
increments by a Student-tν distribution to allow for larger jumps in the
sequence x. This can be done using n − 1 independent G(ν/2, ν/2) scale
mixture variables λi :
iid
∆xi | λi ∼ N (0, (κλi )−1 ), i = 1, . . . , n − 1.
With observed data yi ∼ N (xi , κ−1
y ) for i = 1, . . . , n, the posterior
density for (x, λ) is
π(x, λ | y) ∝ π(x | λ) π(λ) π(y | x).
Note that x|(y, λ) is now a GMRF while λ1 , . . . , λn−1 |(x, y) are
conditionally independent gamma distributed with parameters (ν + 1)/2
and (ν + κ(∆xi )2 )/2.
The replacement of a normal distribution with a Student-tν dis-
tribution is quite popular and is sometimes called the hierarchical-
t formulation. The same approach can also be used to replace the
observational normal distribution with a Student-tν distribution. In this
case the posterior of (x, λ) is
π(x, λ | y) ∝ π(x) π(y | x, λ) π(λ).

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 157
The full conditional of x is a GMRF while λ1 , . . . , λn |(x, y) are
conditionally independent.

4.3.3 Binary regression models

Consider a Bernoulli observational model for binary responses with


latent parameters that follow a GMRF x, which in turn usually depends
on further hyperparameters θ. Denote by B(p) a Bernoulli distribution
with probability p for 1 and 1 − p for 0. The most prominent regression
models in this framework are logit and probit models, where
yi ∼ B(g −1 (z Ti x)) (4.14)
for i = 1, . . . , m. Here z i is the vector of covariates, assumed to be fixed,
and g(p) is a link function:

log(p/(1 − p)) logit link
g(p) =
Φ(p) probit link

where Φ(·) denotes the standard normal distribution function.


These models have an equivalent representation using auxiliary vari-
ables w = (w1 , . . . , wm )T , where
iid
ǫi ∼ G(ǫi )
wi = z Ti x + ǫi

1 if wi > 0
yi =
0 otherwise.

Here, G(·) is the distribution function of the standard normal distri-


bution in the probit case and of the standard logistic distribution (see
Appendix A) in the logit case.
Note that yi is deterministic conditional on the sign of the stochastic
auxiliary variable wi . The equivalence can be seen immediately from
Prob(yi = 1) = Prob(wi > 0) = Prob(z Ti x + ǫi > 0) = G(z Ti x),
using that the density of ǫi is symmetric about zero. This auxiliary
variable approach was proposed by Albert and Chib (1993) for the probit
link, and Chen and Dey (1998) and Holmes and Held (2003) for the logit
link.
The motivation for introducing auxiliary variables is to ease the
construction of MCMC algorithms. We will discuss this issue in detail
in the following, first for the simpler probit link and then for the logit
link.

©฀2005฀by฀Taylor & Francis Group, LLC


158 CASE STUDIES IN HIERARCHICAL MODELING
MCMC for probit regression using auxiliary variables
Let x|θ be a zero mean GMRF of size n and assume first that z Ti x = xi
and m = n. Using the probit link, the posterior distribution is
π(x, w, θ | y) ∝ π(θ) π(x | θ) π(w | x) π(y | w). (4.15)
The full conditional of x is then
( )
1 T 1 2
π(x | θ, w) ∝ exp − x Q(θ)x − (xi − wi ) ,
2 2 i
which is a GMRF:
x | θ, w ∼ NC (w, Q(θ) + I). (4.16)
The full conditional of w factorizes as

π(w | x, y) = π(wi | xi , yi ), (4.17)
i

where wi |(x, y) is a standard normal with mean xi , but truncated to


be positive if yi = 1 or to be negative if yi = 0. A natural approach to
sample from (4.15) is to use two subblocks (θ, x) and w. The block (θ, x)
is sampled using (4.10) and w is updated using the factorization (4.17)
and algorithms for truncated normals, see, for example, Robert (1995).
Consider now the general setup (4.14) using a probit link. The full
conditional of x is then

x | θ, w ∼ NC Z T w, Q(θ) + Z T Z , (4.18)
where the m × n matrix Z is
⎛ ⎞
z T1
⎜ z T2 ⎟
⎜ ⎟
Z=⎜ .. ⎟.
⎝ . ⎠
z Tm
The full conditional of w is now

π(w | x, y) = π(wi | x, yi )
i

where wi |(x, yi ) is standard normal with mean z Ti x, but truncated to


be positive if yi = 1 or to be negative if yi = 0. Sparseness of the
precision matrix for x|(θ, w) now depends also on the sparseness of
Z T Z. Typically, but not always, if n is large then Z T Z is sparse, while if
n is small then Z T Z is dense. Hence, sampling x from its full conditional
is often computationally feasible.
If m is not too large, we can also integrate out x to obtain
w | y ∼ N (0, I + ZQ(θ)−1 Z T ) 1[w, y]. (4.19)

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 159
Here 1[w, y] is a shorthand for the truncation induced by wi > 0 if yi = 1
and wi < 0 if yi = 0 for each i. Sampling from a truncated normal in high
dimension is hard, therefore an alternative to sample w from π(w|x, y)
is to sample each component wi from π(wi |w−i , y) using (4.19). See
Holmes and Held (2003) for further details and a comparison.

MCMC for logistic regression using auxiliary variables


For the logit link we need to introduce one additional vector of length m
of auxiliary variables to transform the logistic distributed ǫi into a scale
mixture of normals. Using the result in Table 4.2 we obtain the following
representation
iid
ψi ∼ KS
λi = 1/(2ψi )2
iid
wi ∼ N (z T x, 1/λi )
 i
1 if wi > 0
yi =
0 if wi < 0.
The variable ψi is introduced for clarity only as it is a deterministic
function of λi . The posterior of interest is now
π(x, w, λ, θ | y) ∝ π(θ) π(x | θ) π(λ) π(w | x, λ) π(y | w).
The full conditional of x is

x | θ, w ∼ NC Z T Λw, Q(θ) + Z T ΛZ (4.20)

where Λ = diag(λ). The minor adjustments are due to the different


precisions of the wi ’s rather than all precisions equal to unity as in the
probit case.
The full conditional of w factorizes as

π(w | x, λ, y) = π(wi | x, λi , yi ),
i

where wi |(x, λi , yi ) is normal with mean z Ti x and precision λi , but


truncated to be positive if yi = 1 and negative if yi = 0. The full
conditional of λ factorizes similarly:

π(λ | x, w) = π(λi | x, wi ). (4.21)
i

It is a nonstandard task to sample from (4.21) as each term on the rhs


involves the Kolmogorov-Smirnov distribution for which the distribution
function is only known as an infinite series, see Appendix A. Holmes
and Held (2003) describe an efficient and exact approach based on the

©฀2005฀by฀Taylor & Francis Group, LLC


160 CASE STUDIES IN HIERARCHICAL MODELING
series method (Devroye, 1986) that avoids the density evaluation. The
alternative algorithm proposed in Chen and Dey (1998) approximates
the Kolmogorov-Smirnov density by a finite evaluation of this series.
The discussion so far suggests using the subblock approach and
constructing an MCMC algorithm updating each of the following three
subblocks conditionally on the rest: (θ, x), w and λ. However, further
progress can be made if we merge w and λ into one block using
π(w, λ | x, y) = π(w | x, y) π(λ | w, x), (4.22)
where we have integrated λ analytically out of π(w|x, y, λ) to obtain
π(w|x, y). It now follows that

π(w | x, y) = π(wi | x, yi ),
i

where wi |(x, yi ) is L(z Ti x, 1)


distributed, but truncated to be positive if
yi = 1 and negative if yi = 0. It is easy to sample from this distribution
using inversion.
To summarize, we suggest using the subblock algorithm with two
subblocks (θ, x) and (w, λ). The first block is updated using (4.18)
together with a simple proposal for θ, while the second block is updated
using (4.22), sampling first w then λ.

4.3.4 Example: Tokyo rainfall data


For illustration, we consider a simple but much analyzed binomial time
series, taken from Kitagawa (1987). Each day during the years 1983
and 1984, it was recorded whether there was more than 1 mm rainfall
in Tokyo. It is of interest to estimate the underlying probability pi of
rainfall at calendar day i = 1, . . . , 366, which is assumed to be gradually
changing with time. Note that for t = 60, which corresponds to February
29, only one binary observation is available, while for all other calender
days there are two. In total, we have m = 366 + 365 = 731 binary
observations and an underlying GMRF x of dimension n = 366.
In contrast to earlier modeling approaches (Fahrmeir and Tutz, 2001,
Kitagawa, 1987), we assume a circular RW2 model for x = g(p) with
precision κ. This explicitly connects the end and the beginning of the
time series, because smooth changes between the last week in December
and the first week in January are also to be expected. Such a model
cannot be directly analyzed with a state-space modeling approach. The
precision matrix of x is now a circular precision matrix Q = κR
with circulant structure matrix R with base (6, −4, 1, 0, . . . , 0, 1, −4)T .
Comparing Q to the precision matrix (3.40) of the ordinary RW2 model,
we see that only the entries in the first two and last two rows and columns

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 161
are different. Note that the rank of Q is n−1 and larger than the rank of
the noncircular RW2 model. The precision κ is assigned a G(1.0, 0.0001)
prior.
If we assume a probit link, then the observational model is
yi1 , yi2 ∼ B(g −1 (xi )), i = 60
−1
y60,1 ∼ B(g (x60 )).
We have two observations for each calendar day except for February 29.
Therefore we use a double index for the observed data. The data only
contain information about the sum yi1 + yi2 , so we assign (completely
arbitrary) yi1 = 1 and yi2 = 0 if the sum is 1.
We now introduce auxiliary variables wij for each binary response
variable yij . Let mi denote the number of observations of xi , which is 2
except for i = 60, and let m = (m1 , . . . , mn )T . Further, let
mi

wi• = wij
j=1

and w• = (w1• , . . . , wn• )T . The full conditional of x can now be derived


either by extending (4.16) or from (4.18), as
x | the rest ∼ NC (w• , Q + diag(m)) . (4.23)
The full conditional of w is (4.17) where wij is standard normal with
mean xi , but truncated to be positive if yij = 1 and negative if yij = 0.
The MCMC algorithm uses two subblocks (κ, x) and w, and successively
updates each block conditional on the rest.
For comparison, we have also implemented the auxiliary approach
for the logistic regression model, using a G(1.0, 0.000289) prior for
κ to accommodate the different variance of the logistic distribution
compared to the standard normal distribution. Note that the ratio of the
expectations of the two priors (logit versus probit) is 0.000289/0.0001 =
2.89, approximately equal to π 2 /3 · (15/16)2 , the common factor to
translate logit into probit results.
We need to introduce another layer of auxiliary variables λ to
transform the logistic distribution into a scale mixture of normals,
see Section 4.3.3. The full conditional of x now follows from (4.20),
which is of the form (4.23) using w / instead, where
 • and m
mi
 mi

i• =
w λij wij , and m
i = λij .
j=1 j=1

Also in this case we use the subblock algorithm using two blocks (κ, x)
and (w, λ). The second block is sampled using (4.22) in the correct
order, first wij (for all ij) from the L(xi , 1) distribution, truncated to

©฀2005฀by฀Taylor & Francis Group, LLC


162 CASE STUDIES IN HIERARCHICAL MODELING

1.0
0.8
0.6
0.4
0.2
0.0

J F M A M J J A S O N D
(a)
1.0
0.8
0.6
0.4
0.2
0.0

J F M A M J J A S O N D
(b)

Figure 4.9 Observed frequencies and fitted probabilities with uncertainty bounds
for the Tokyo rainfall data. (a): probit link. (b): logit link.

be positive if yij = 1 and negative if yij = 0, and then λij (for all ij)
from (4.21).
Figure 4.9(a) displays the binomial frequencies, scaled to the interval
[0, 1], and the estimated underlying probabilities pi obtained from the
probit regression approach, while (b) gives the corresponding results
obtained with the logistic link function. There is virtually no difference
between the results using the two link functions. Note that the credible

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 163

5e+04

5e+04
5e+03

5e+03
5e+02

5e+02
1e+02

1e+02
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

(a) (b)
1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

(c) (d)

Figure 4.10 Trace plots using the subblock algorithm and the single-site Gibbs
sampler; (a) and (c) show the traces of log κ and g −1 (x180 ) using the subblock
algorithm, while (b) and (d) show the traces of log κ and g −1 (x180 ) using the
single-site Gibbs sampler.

intervals do not get wider at the beginning and the end of the time series,
due to the circular RW2 model.
We take this opportunity to compare the subblock algorithm with a
naı̈ve single-site Gibbs sampler which is know to converge slowly for this
problem (Knorr-Held, 1999). It is important to remember that this is
not a hard problem nor is the dimension high. We present the results
for the logit link. Figure 4.10 shows the trace of log κ using the subblock
algorithm in (a) and the single-site Gibbs sampler in (b), and the trace of
g −1 (x180 ) in (c) and (d). Both algorithms were run for 10000 iterations.
The results clearly demonstrate that the single-site Gibbs sampler has

©฀2005฀by฀Taylor & Francis Group, LLC


164 CASE STUDIES IN HIERARCHICAL MODELING
severe problems. The trace of log κ has not yet reached its level even after
10 000 iterations while the subblock algorithm seems to converge after
just a few iterations. The computational cost per iteration is comparable.
However, trace plots similar to those in Figure 4.10(b) are not at all
uncommon using single-site schemes for moderately complex problems,
especially after hierarchical models are becoming increasingly popular.
The discussion in Section 4.1.2 is relevant here as well.
The sum of repeated independent Bernoulli observations is binomial
distributed if the probability for success is constant, hence this example
also demonstrates how to use the auxiliary approach for binomial
regression.
We end this example with a comment on an alternative MCMC
updating algorithms in the probit model. We could add a further
auxiliary variable w60,2 to make up for the ‘missing’ second observation
for February 29. The precision matrix for the full conditional of x will
now be circulant; hence we could have used the fast discrete Fourier
transform and Algorithm 2.10 to simulate from it. However, this would
not have been possible in the logit model as there the auxiliary variables
w have different precisions.

4.3.5 Example: Mapping cancer incidence

As a second example for an auxiliary variable approach, we consider a


problem in mapping cancer incidence where the stage of the disease at
time of diagnosis is known. For an introduction to the topic see Knorr-
Held et al. (2002). Data were available on all incidence cases of cervical
cancer in the former East German Republic (GDR) from 1979, stratified
by district and age group. Each of the n = 6 690 cases of cervical cancer
has been classified into either a premalignant (3755 cases) or a malignant
(2935 cases) stage. It is of interest to estimate the spatial variation of the
incidence ratio of premalignant to malignant cases in the 216 districts,
after adjusting for age effects. Age was categorized into 15 age groups.
Let yi = 1 denote a premalignant case and yi = 0 a malignant case.
We assume a logistic binary regression model yi ∼ B(pi ), i = 1, . . . , n
with
logit(pi ) = α + βj(i) + γk(i) ,
where j(i) and k(i) denote age group and district of case i, respectively.
The age group effects β are assumed to follow a RW2 model while for the
spatial effect γ we assume that it is the sum of an IGMRF model (3.30)
plus additional unstructured variation:

γk = uk + vk .

©฀2005฀by฀Taylor & Francis Group, LLC


AUXILIARY VARIABLE MODELS 165

400

400

400
300

300

300
200

200

200
100

100

100
0

0
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400

(a) (b) (c)

Figure 4.11 (a) The precision matrix Q in the original ordering, and (b) the
precision matrix after appropriate reordering to reduce to number of nonzero
terms in the Cholesky triangle shown in (c). Only the nonzero terms are shown
and those are indicated by a dot.

Here, u follows the IGMRF model (3.30) with precision κu and v is


normal with zero mean and diagonal precision matrix with entries κv .
For the corresponding precision parameters we assume a G(1.0, 0.01) for
both κu and κv and a G(1.0, 0.0005) for κβ . A diffuse prior is assumed
for the overall mean α and sum-to-zero constraints are placed both on
β and u. Note that this is not necessary for the unstructured effect v,
which has a proper prior with mean zero a priori.
Let κ = (κβ , κu , κv )T denote the vector of all precision parameters
in the model. Similar to the previous example, we used auxiliary
variables w1 , . . . , wn and λ1 , . . . , λn to facilitate the implementation of
the logistic regression model. Figure 4.11 displays the precision matrix
of (α, β, u, v)|(κ, w, λ) before (a) and after (b) appropriate reordering
to reduce the number of nonzero terms in the Cholesky triangle (c).
Note how the reordering algorithm puts the α and β variables at the
end, so the remaining variables become conditionally independent after
factorization as discussed in Section 2.4.3.
In our MCMC algorithm we group all variables into two subblocks and
update all variables in one subblock conditional on the rest. The first
subblock consists of (α, β, u, v, κ), while the auxiliary variables (w, λ)
form the other block.
Figure 4.12 displays the estimated age-group effect, which is close
to a linear effect on the log odds scale with a slightly increasing slope
for increasing age. Figure 4.13 displays the estimates of the spatially
structured component exp(u) and the total spatial effect exp(γ) =
exp(u + v). It can clearly be seen that the total spatial variation
is dominated by the spatially structured component. However, the
unstructured component plays a nonnegligible role, as the total pattern
is slightly rougher (range 0.34 to 5.2) than the pattern of the spatially
structured component alone (range 0.43 to 4.4).

©฀2005฀by฀Taylor & Francis Group, LLC


166 CASE STUDIES IN HIERARCHICAL MODELING

4
2
0
−2
−4

15−19 25−29 35−39 45−49 55−59 65−69 75−79 >84

Figure 4.12 Nonparametric effect of age group. Posterior median of the log
odds within 2.5 and 97.5% quantiles. The distribution of the observed covariate
is indicated with jittered dots.

5 5

1 1

0.2 0.2

(a) (b)

Figure 4.13 Estimated odds ratio (posterior median) for (a) the spatially
structured component ui and (b) the sum of the spatially structured and
unstructured variable ui + vi . The shaded region is West Berlin.

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 167
The estimates are similar to the results obtained by Knorr-Held et al.
(2002), who analyzed the corresponding data from 1975. Incidentally,
the intercept α, which is not of central interest, has a posterior median
of 0.79 with 95% credible interval [0.62, 1.00].

4.4 Nonnormal response models


We will now look at hierarchical GMRF models where the likelihood
is nonnormal. We have seen in the previous section that the binomial
model with probit and logit link can be reformulated using auxiliary
variables so that the full conditional of the latent GMRF is still a GMRF.
However, in other situations such an augmentation of the parameter
space is not possible. In these cases, it is useful to approximate the
log likelihood using a second-order Taylor expansion and to use this
GMRF approximation as a Metropolis-Hastings proposal in an MCMC
algorithm. A natural choice is to expand around the current state of
the Markov chain. However, the corresponding approximation can be
improved by repeating this process and expanding around the mean
of this approximation to generate an improved approximation. Such a
strategy will ultimately converge (under some regularity conditions), so
the mean of the approximation equals the mode of the full conditional.
The method has thus much in common with algorithms to calculate
the maximum likelihood estimator in generalized linear models, such as
Fisher scoring or iteratively reweighted least squares.

4.4.1 The GMRF approximation


The univariate case
The approach taken is best described by a simple example. Suppose
there is only one observation y from a Poisson distribution with mean
λ. Suppose further that we place a normal prior on η = log λ, η ∼
N (µ, κ−1 ), so the posterior distribution π(η|y) is
π(η | y) ∝ π(η) π(y | η) (4.24)
κ 
2
= exp − (η − µ) + yη − exp(η) = exp(f (η)),
2
say. In order to approximate f (η), a common approach is to construct
a quadratic Taylor expansion of the (unnormalized) log-posterior f (η)
around a suitable value η0 ,
1
f (η) ≈ f (η0 ) + f ′ (η0 )(η − η0 ) + f ′′ (η0 )(η − η0 )2
2
1
= a + bη − cη 2 .
2

©฀2005฀by฀Taylor & Francis Group, LLC


168 CASE STUDIES IN HIERARCHICAL MODELING

0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
−2 −1 0 1 2 3 −2 −1 0 1 2 3
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
−2 −1 0 1 2 3 −2 −1 0 1 2 3

Figure 4.14 Normal approximation (dashed line) of the posterior density (4.24)
(solid line) for y = 3, µ = 0 and κ = 0.001 based on a quadratic Taylor
expansion around η0 for η0 = 0, 0.5, 1, 1.5. The value of η0 is indicated with
a small dot in each plot.

Here, b = f ′ (η0 ) − f ′′ (η0 )η0 and c = −f ′′ (η0 ). The value of a is not


relevant for the following.
We can now approximate π(η|y) by π (η|y) where
 
1 2
(η | y) ∝ exp − cη + bη ,
π (4.25)
2

which is in the form of the canonical parametrization NC (b, c). Hence


(η|y) is a normal distribution with mean µ1 (η0 ) = b/c and precision
π
κ1 (η0 ) = c. We explicitly emphasize the dependence of µ1 and κ1 on
η0 and that all that enters in this construction is the value of f (η)
and its first and second derivative at η0 . Figure 4.14 illustrates this
approximation for y = 3, µ = 0, κ = 0.001, and η0 = 0, 0.5, 1, and 1.5.
One can clearly see that the approximation is better, the closer η0 is to
the mode of π(η|y).
The idea is now to use this normal distribution as a proposal
distribution in a Metropolis-Hastings step. For η0 we may simply take
the current value of the simulated Markov chain. More specifically, we
use a proposal distribution q(η ∗ |η0 ), which is normal with mean µ1 (η0 )
and precision κ1 (η0 ). Following the Metropolis-Hastings algorithm, this

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 169
proposal is then accepted with probability
* +
π(η ∗ |y) q(η0 |η ∗ )
α = min 1, . (4.26)
π(η0 |y) q(η ∗ |η0 )
Note that this involves not only the evaluation of q(η ∗ |η0 ), which is
available as a by-product of the construction of the proposal, but also
the evaluation of q(η0 |η ∗ ). Thus, we also have to construct a second
quadratic approximation around the proposed value η ∗ and evaluate the
density of the corresponding normal distribution at the current point η0
in order to evaluate q(η0 |η ∗ ) and α.
It is instructive to consider this algorithm in the case where y|η is not
Poisson, but normal with unknown mean η and known precision λ,
 
1 2
π(η | y) ∝ exp − (λ + κ)η + (λy + κµ)η .
2
Thus, π(η|y) is already in the quadratic form of (4.25) and hence
normal with mean (λy + κµ)/(λ + κ) and precision (λ + κ). A quadratic
approximation to log π(η|y) at any point η0 will just reproduce the
same quadratic function independent of η0 . Using this distribution as
a proposal distribution leads to α = 1 in (4.26). Thus, if log π(y|η) is
already a quadratic function the algorithm outlined above leads to the
mean/mode of π(η|y) in one step, independent of η0 .
In the more general setting it is well known that iterated applications
of such quadratic approximations converge (under regularity conditions)
to the mode of π(η|y). More specifically, we set η1 = µ1 and repeat the
quadratic approximation process to calculate µ2 (η1 ) and κ2 (η1 ), then set
η2 = µ2 (η1 ) and repeat this process until convergence. This algorithm
is in fact just the well-known Newton-Raphson algorithm; a slightly
modified version is known in statistics as Fisher scoring or iteratively
reweighted least squares.
It is seen from Figure 4.14 that the normal approximation to the
density π(η|y) improves, the closer η0 is to the mode of π(η|y). This
suggests that one may apply iterative quadratic approximations to the
posterior density until convergence in order to obtain a normal proposal
with mean equal to the posterior mode. Such a proposal should have
relatively large acceptance rates and will also be independent of η0 .
On the other hand, it is desirable to avoid iterative algorithms
within iterative algorithms such as the Metropolis-Hastings algorithm,
as this is likely to slow down the speed of the Metropolis-Hastings
algorithm. Nevertheless, it can be advantageous to apply the quadratic
approximation not only once, but twice or perhaps even more, in order
to improve the approximation of the proposal density to the posterior
density. Note that we have also to repeat constructing these iterated

©฀2005฀by฀Taylor & Francis Group, LLC


170 CASE STUDIES IN HIERARCHICAL MODELING
approximations around the proposed value η ∗ in order to evaluate
q(η0 |η ∗ ).

Generalization to the multivariate case


The idea described in the previous section can easily be generalized
to a multivariate setting. Suppose for simplicity that there are n
conditionally independent observations y1 , . . . , yn from a nonnormal
distribution where yi is an indirect observation of xi . Here x is a GMRF
with precision matrix Q and mean µ possibly depending on further
hyperparameters. The full conditional π(x|y) is then
( n
)
1 
π(x | y) ∝ exp − (x − µ)T Q(x − µ) + log π(yi | xi ) .
2 i=1
n
We now use a second-order Taylor expansion of i=1 log π(yi |xi ) around
µ0 , say, to construct a suitable GMRF proposal density π (x|y). To be
specific,
( )
1 T  1 2
T
(x | y) ∝ exp − x Qx + µ Qx +
π (ai + bi xi − ci xi )
2 i
2
 
1
∝ exp − xT (Q + diag(c))x + (Qµ + b)T x , (4.27)
2
where ci might be set to zero if it is negative. The canonical parameter-
ization is
NC (Qµ + b, Q + diag(c))
with mean µ1 , say. The approximation depends on µ0 as both b and c
depend on µ0 . Similar to the univariate case, we can repeat this process
and expand around µ1 to improve the approximation. The improvement
is due to µ1 being closer to the mode of π(x|y) than x0 . This is (most)
often the case as µ1 is one step of the multivariate Newton-Raphson
method to locate the mode of π(x|y). After m iterations when µm equals
the mode of π(x|y), we denote the approximation π (x|y) as the GMRF
approximation.
An important feature of (4.27) is that the GMRF approximation
inherits the Markov property of the prior on x, which is very useful for
MCMC simulation. A closer look at (4.27) reveals that this is because
yi depends on xi only. If covariates z i are also included, then yi will
typically depend on z Ti x. In this case the Markov properties may or
may not be inherited, depending on the z i ’s. Since Taylor expansion
is somewhat cumbersome in high dimensions, we will always try to
parameterize to ensure that yi depends only on xi . We will illustrate
this in the following examples.

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 171
There are also other strategies for locating the mode of log π(x|y)
than the Newton-Raphson method. Algorithms based on line search
in a specific direction are often based on the gradient of log π(x|y) or
that part of the gradient orthogonal to previous search directions. Such
approaches are particularly feasible as the gradient is
∇ log π(x | y) = −(Q + diag(c))x + Qµ + b.
To evaluate the gradient involves potentially costly matrix-vector prod-
ucts like Qx. However, this is computationally fast as it only requires
O(n) flops for common GMRFs, see Algorithm B.1 in the Appendix. In
our experience, line-search methods for optimizing log π(x|y) should be
preferred for huge GMRFs such as GMRFs for spatiotemporal problems,
while the Newton-Raphson approach is preferred for smaller GMRFs.
However, it is important to make all optimization methods robust as they
need to be fail-safe when they are to be used within an MCMC algorithm.
More details on robust strategies and constrained optimization for
constrained GMRFs are available in the documentation of the GMRFLib-
library described in Appendix B.
The Taylor expansion of log π(yi |xi ) is most accurate at the point
around which we expand. However, concerning an approximation to
π(x|y) it is more important that the error is small in the region where
the major part of the probability mass is. This motivates using numerical
approximations to obtain the terms (ai , bi , ci ) in favor of analytical
expressions of the Taylor expansion of log π(yi |xi ). Let f (η) = log π(yi |η)
and η0 be the point around which to construct an approximation. For
δ > 0 we use
f (η0 + δ) − 2f (η0 ) + f (η0 − δ)
ci = −
δ2
f (η0 + δ) − f (η0 − δ)
bi = − η0 ci

and similarly for ai . The value of δ should not be too small, for example,
δ between 10−2 and 10−4 . This depends of course on the scale of xi .
These values of (ai , bi , ci ) ensure that the error in the approximation is
zero not only at η0 , but also at η0 ± δ. See also the discussion and the
examples in Rue (2001).
In those cases where the GMRF approximation is not sufficiently
accurate, we can go beyond the Gaussian and construct non-Gaussian
approximations. We will return to this issue in Section 5.2.

Example: Revisiting Tokyo rainfall data


As a simple example, we now revisit the Tokyo rainfall data from Section
4.3.4. Assuming a logistic regression model, the likelihood for calendar

©฀2005฀by฀Taylor & Francis Group, LLC


172 CASE STUDIES IN HIERARCHICAL MODELING
Number of iterations Acceptance rates Iterations per second
1 67.6% 64.8
2 82.3% 45.4
3 83.1% 34.8
until convergence 83.2% 33.1

Table 4.3 Acceptance rates and number of iterations per second for Tokyo
rainfall data

day i equals
⎛ ⎞
mi

π(yi | xi ) ∝ exp ⎝xi yij − mi log(1 + exp(xi ))⎠ .
j=1

We construct a GMRF approximation for the full conditional of x


following the procedure in Section 4.4.1.
To illustrate the performance of the GMRF approximation, we have
computed the acceptance rate and the speed of the MCMC algorithm
for different values of the number of Newton-Raphson iterations to
construct the GMRF approximation. When we stop the Newton-
Raphson iterations before the mode is found, we obtain an approximation
to the GMRF approximation. We have used an independence sampler
for (κ, x), which is described in Section 5.2.3, by first sampling κ from
an approximation to the marginal posterior π(κ|y), and then sampling
from an approximation to π(x|κ, y). Finally, we accept/reject the two
jointly. These number are reported in Table 4.3. It can be seen that one
iteration already results in a very decent acceptance rate of 67.6%. For
two iterations, the rate increases to 82.3% but the algorithm runs slower.
The acceptance rates are only marginally larger if the proposal is defined
exactly at the posterior mode. In this example, it seems not necessary to
use more than two iterations. For comparison, we note that the auxiliary
variable algorithms are considerably faster, with 153.6 (221.7) iterations
per second on the same computer in the logit (probit) case.

4.4.2 Example: Joint disease mapping


In this final case study, we present a spatial analysis of oral cavity and
lung cancer mortality rates in Germany, 1986–1990 (Held et al., 2004).
The data yij are the number of cases during the 5-year period in area
i for cancer type j, where j = 1 denotes oral cavity cancer while j = 2
denotes lung cancer. The number of cases in region i depends also on
the number of people in that region, and their age distribution. The
expected number of cases of disease j in region i is calculated based on

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 173

2.4 1.58

2.11 1.45

1.83 1.32

1.55 1.19

1.27 1.06

0.98 0.93

0.7 0.79

Figure 4.15 The standardized mortality ratios for oral cavity cancer (left) and
lung cancer (right) in Germany, 1986-1990.

this information such that


 
eij = yij , j = 1, 2
i i

is fulfilled. Hence, we consider only the relative risk, not the absolute
risk.
The common approach is to assume that the observed counts yij are
conditionally independent Poisson observations,
yij ∼ P(eij exp(ηij )),
where ηij denotes the log relative risk in area i for disease j.
The standardized mortality ratios (SMRs) yij /eij are displayed in Fig-
ure 4.15 for both diseases. For some background information on why the
SMRs are not suitable for estimating the relative risk, see, for example,
Mollié (1996). We will first consider a model for a separate analysis of
disease risk for each disease, then we will discuss a model for a joint
analysis of both diseases.

Separate analysis of disease risk


We suppress the second index for simplicity. One of the most commonly
used methods for a separate spatial analysis assumes that the log-relative

©฀2005฀by฀Taylor & Francis Group, LLC


174 CASE STUDIES IN HIERARCHICAL MODELING
risk η can be decomposed into

η = µ1 + u + v, (4.28)

where µ is an intercept, u a spatially structured component, and v


an unstructured component (Besag et al., 1991, Mollié, 1996). The
intercept is often assumed to be zero mean normal with precision κµ .
The unstructured component is typically modeled with independent zero
mean normal variables with precision κv , say. The spatially structured
component u is typically an IGMRF of first order of the form (3.30) with
precision κu . Here we use the simple form without additional weights. A
sum-to-zero restriction is placed on u to ensure identifiability of µ. If the
prior on µ has zero precision, an equivalent model is to drop µ and the
sum-to-zero restriction, but such a formulation is not so straightforward
to generalize to a joint analysis of disease risk.
Let κ = (κu , κv )T denote the two unknown precisions while κµ is
assumed to be fixed to some value, we simply set κµ = 0. The posterior
distribution is now
( )
κµ 2  n/2 κv  2
π(µ, u, v, κ | y) ∝ exp − µ κv exp − v
2 2 i i
⎛ ⎞
κu 
× κ(n−1)/2
u exp ⎝− (ui − uj )2 ⎠
2 i∼j
( )

× exp yi (µ + ui + vi ) − ei exp(µ + ui + vi )
i
× π(κ).

For π(κ) we choose independent G(1, 0.01) priors for each of the two
precisions. Note that the prior for x = (µ, uT , v T )T conditioned on κ is a
GMRF. However, we see that yi depends on the sum of three components
of x, due to (4.28). This is an implementation nuisance and can be solved
here by reparameterization using η instead of v,

η | u, κ, µ ∼ N (µ1 + u, κv I).

In the new parameterization x = (µ, uT , η T )T and the posterior


distribution is now
 
1 T
π(x, κ | y) ∝ κn/2
v κ (n−1)/2
u exp − x Qx
2
( )

× exp yi ηi − ei exp(ηi ) π(κ),
i

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 175
where ⎛ ⎞
κµ + nκv κv 1T −κv 1T
Q = ⎝ κv 1 κu R + κv I −κv I ⎠ (4.29)
−κv 1 −κv I κv I
and R is the structure matrix for the IGMRF of first order


⎨mi if i = j
Rij = −1 if i ∼ j (4.30)


0 otherwise,
where mi is the number
 of neighbors to i. Note that Q is a 2n+1×2n+1
matrix with 6n + i mi nonzero off-diagonal terms. In this example
n = 544. Further, the spatially structured term u has a sum-to-zero
constraint, 1T u = 0.
We now apply the one-block algorithm and generate a joint proposal
using
κ∗u ∼ q(κ∗u | κu )
κ∗v ∼ q(κ∗v | κv )
x∗ (x | κ∗ , y)
∼ π
and then accept/reject (κ∗ , x∗ ) jointly. We use the simple choice (4.13)
to update the precisions but this can be improved upon. The GMRF
approximation is quite accurate in this case, and the factor F in (4.13)
is tuned to obtain an acceptance rate between 30 and 40%. The algorithm
does about 14 iterations per second on a 2.6-MHz processor. Since
we approximately integrate out x, the MCMC algorithm is essentially
a simple Metropolis-Hastings algorithm for a two-dimensional density.
We refer to Knorr-Held and Rue (2002) for a thorough discussion of
constructing MCMC algorithms doing block updating for these models.
Figure 4.16 displays the nonzero terms of the precision matrix (4.29)
in (a), after reordering to reduce the number of fill-ins in (b) and the
Cholesky triangle in (c). The structure of (4.29) is clearly recognized. The
estimated relative risks (posterior median) for the spatial component
exp(u) are displayed in Figure 4.17 where (a) displays the results for
lung cancer and (b) for oral cavity cancer.
If we choose a diffuse prior for µ using κµ = 0, some technical issues
appear using the GMRF approximation. Note that the full conditional of
x and the GMRF approximation are both proper due to the constraint.
However, since we are using Algorithm 2.6 to correct for the constraint
1T u = 0, we must require the GMRF approximation to be proper
also without the constraint. This can be accomplished by modifying the
GMRF approximation, either using a small positive value for κµ or to
add a small positive constant to the diagonal of R. The last option is

©฀2005฀by฀Taylor & Francis Group, LLC


176 CASE STUDIES IN HIERARCHICAL MODELING

1000

1000

1000
800

800

800
600

600

600
400

400

400
200

200

200
0

0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

(a) (b) (c)

Figure 4.16 (a) The precision matrix (4.29) in the original ordering, and
(b) the precision matrix after appropriate reordering to reduce to number of
nonzero terms in the Cholesky triangle shown in (c). Only the nonzero terms
are shown and those are indicated by a dot.

1.64 1.54

1.51 1.43

1.38 1.32

1.25 1.21

1.12 1.09

0.99 0.98

0.86 0.87

(a) (b)

Figure 4.17 Estimated relative risks (posterior median) of the spatial compo-
nent exp(u) for (a) lung cancer and (b) oral cavity cancer.

justified by (3.37). The acceptance probability is of course evaluated


using this modified GMRF approximation, so the stationary limit of the
MCMC algorithm is unaltered.

Joint analysis of disease risk


A natural extension of a separate analysis is to consider a joint analysis
of two or more diseases (Held et al., 2004, Knorr-Held and Best, 2001,

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 177
Knorr-Held and Rue, 2002). In this example we will consider a joint
analysis of oral cavity cancer and lung cancer shown in Figure 4.15.
We assume that there is a latent spatial component u1 , shared by
both diseases and again modeled through an IGMRF. An additional
unknown scale parameter δ > 0 is included to allow for a different
risk gradient of the shared component for the two diseases. A further
spatial component u2 may enter for one of the diseases, which we again
model through an IGMRF. In our setting, one of the diseases is oral
cavity cancer, while the second one is lung cancer. Both are known to
be related to tobacco smoking, but only oral cancer is known to be
related to alcohol consumption. Considering the latent components as
unobserved covariates, representing the two main risk factors tobacco
(u1 ) and alcohol (u2 ), it is natural to include u2 only for oral cancer.
More specifically we assume that
η 1 | u1 , u2 , µ, κ ∼ N (µ1 1 + δu1 + u2 , κ−1
η 1 I)

η 2 | u1 , u2 , µ, κ ∼ N (µ2 1 + δ −1 u1 , κ−1
η 2 I),

where κη1 and κη2 are the unknown precisions of η 1 and η 2 , respectively.
Additionally, we impose sum-to-zero constraints for both u1 and u2 .
Assuming µ is zero mean normal with covariance matrix κ−1 µ I,
T T T T T T
then x = (µ , u1 , u2 , η 1 , η 2 ) is a priori a GMRF. The posterior
distribution is
 
n/2 n/2 (n−1)/2 (n−1)/2 1 T
π(x, κ, δ | y) ∝ κη1 κη2 κu1 κu2 exp − x Qx
2
⎛ ⎞
2 n
× exp ⎝ yij ηij − eij exp(ηij )⎠
j=1 i=1

× π(κ) π(δ),
where
⎛ ⎞
Qµµ Qµu1 Qµu2 Qµη1 Qµη2
⎜ Qu1 u1 Qu1 u2 Qu 1 η 1 Qu1 η2 ⎟
⎜ ⎟
Q=⎜
⎜ Qu2 u2 Qu 2 η 1 Qu2 η2 ⎟
⎟, (4.31)
⎝ sym. Qη1 η1 Qη1 η2 ⎠
Qη2 η2

1T u1 = 0 and 1T u2 = 0. Defining the two n × 2 matrices C 1 = (1 0)


and C 2 = (0 1), the elements of Q are as follows:
Qµµ = κµ I + κη1 C T1 C 1 + κη2 C T2 C 2
Qµu1 = κη1 δ C T1
Qµu2 = κη1 C T1 + κη2 δ −1 C T2

©฀2005฀by฀Taylor & Francis Group, LLC


178 CASE STUDIES IN HIERARCHICAL MODELING
Qµη1 = −κη1 C T1
Qµη2 = −κη2 C T2
Qu 1 u 1 = κ u1 R + κ η 1 δ 2 I
Qu1 u2 = κη 1 δ I
Qu1 η1 = −κη1 δ I
Qu1 η2 = 0
Qu2 u2 = κu2 R + κη1 I + κη2 δ −2 I
Qu2 η1 = −κη1 I
Qu2 η2 = −κη2 δ −1 I
Qη1 η1 = κη 1 I
Qη1 η2 = 0
Qη2 η2 = κη2 I.
Here, R is the structure matrix of the IGMRF (4.30). We assign a
N (0, 0.172 ) prior on log δ and independent G(1.0, 0.01) priors on all
precisions.
There are now various ways to update (x, κ, δ). One option is to
update all in one block, by first updating the hyperparameters (κ, δ),
then sample a proposal for x using the GMRF approximation and then
accept/reject jointly. Although this is feasible, in this example it is both
faster and sufficient to use subblocks. Those can be chosen as
(κη1 , µ1 , η 1 ), (κη2 , µ2 , η 2 ), (δ, κu1 , u1 ), (δ, κu2 , u2 ),
but other choices are possible. However, it is important to update each
field jointly with its hyperparameter. Note that some variables can occur
in more than one subblock, like δ in this example. We might also choose
to merge the two last subblocks into
(κu1 , κu2 , δ, µ, u1 , u2 ).
This is justified as the full conditional of (µT , uT1 , uT2 )T is a GMRF.
Figure 4.18 displays the nonzero terms of the precision matrix (4.31)
in (a), after reordering to reduce the number of fill-ins in (b) and the
Cholesky triangle L in (c). It took 0.03 seconds to reorder and factorize
the matrix on a 2.6-MHz computer. The number of nonzero terms in L
is 24 987 including 15 081 fill-ins.
The estimated shared component u1 and the component u2 only
relevant for oral cancer are displayed in Figure 4.19. The two estimates
display very different spatial patterns and reflect quite nicely known
geographical differences in tobacco (u1 ) and alcohol (u2 ) consumptions.
The posterior median for δ is estimated to be 0.70 with a 95% credible

©฀2005฀by฀Taylor & Francis Group, LLC


NONNORMAL RESPONSE MODELS 179

2000

2000

2000
1500

1500

1500
1000

1000

1000
500

500

500
0

0
0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000

(a) (b) (c)

Figure 4.18 (a) The precision matrix (4.31) in the original ordering, and
(b) the precision matrix after appropriate reordering to reduce to number of
nonzero terms in the Cholesky triangle shown in (c). Only the nonzero terms
are shown and those are indicated by a dot.

1.4 1.4

1.2 1.2

1.1 1.1

1 1

0.9 0.9

0.8 0.8

0.7 0.7

(a) (b)

Figure 4.19 Estimated relative risks (posterior median) for (a) the shared
component exp(u1 ) (related to tobacco) and (b) the oral-specific component
exp(u2 ) (related to alcohol).

©฀2005฀by฀Taylor & Francis Group, LLC


180 CASE STUDIES IN HIERARCHICAL MODELING
interval [0.52, 0.88], suggesting that the shared component carries more
weight for lung cancer than for oral cancer. For more details on this
particular application we refer to Held et al. (2004), who also discuss
the connection to ecological regression models and generalizations to
more than two diseases.

4.5 Bibliographic notes


Our blocking strategy in hierarchical GMRF models is from Knorr-
Held and Rue (2002). Pitt and Shephard (1999) discuss in great detail
convergence and reparameterization issues for the first-order autore-
gressive process with normal observations also valid for the limiting
RW1 model. Papaspiliopoulos et al. (2003) discuss reparameterization
issues, and convergence for the two-block Gibbs sampler in normal
hierarchical models extending previous results by Gelfand et al. (1995).
Wilkinson (2003) comments on reparameterization issues and the joint
update approach taken here. Gamerman et al. (2003) compare various
block algorithms while Wilkinson and Yeung (2004) use the one-block
approach for a normal response model. For theoretical results regarding
blocking in MCMC, see Liu et al. (1994) and Roberts and Sahu (1997).
Steinsland and Rue (2003) present an alternative approach that updates
the GMRF in (4.8) as a sequence of overlapping blocks, a major benefit
for large GMRFs. Barone and Frigessi (1989) studied overrelaxation
MCMC methods for simulating from a GMRF, Barone et al. (2001) study
the case of general overrelaxation combined with blocking while Barone
et al. (2002) study the combination of coupling and overrelaxation.
The program BayesX is a (open source) software tool for performing
complex Bayesian inference using GMRFs (among others) (Brezger
et al., 2003). BayesX uses numerical methods for sparse matrices and
block updates large blocks of the GMRF using the GMRF approximation
for nonnormal data.
Other distributions that can be written as scale mixture of normals
are the symmetric stable and the symmetric gamma distributions, see
Devroye (1986) for the corresponding distribution of the mixing variables
λ in each case. For more information on scale mixtures of normals, see
Andrews and Mallows (1974), Barndorff-Nielsen et al. (1982), Kelker
(1971) and Devroye (1986).
The hierarchical-t formulation is used particularly in state-space
models (Carlin et al., 1992, Carter and Kohn, 1996) but also in
spatial models, for example, in agricultural field experiments (Besag and
Higdon, 1999). Additionally, one may also add a prior on the degrees of
freedom ν (Besag and Higdon, 1999).
The use of GMRF priors in additive models has been proposed in

©฀2005฀by฀Taylor & Francis Group, LLC


BIBLIOGRAPHIC NOTES 181
Fahrmeir and Lang (2001b), see also Fahrmeir and Knorr-Held (2000).
Biller and Fahrmeir (1997) use priors as those discussed in Section 3.5.
Similarly, GMRF priors can be used to model time-changing covariate
effects and are discussed in Harvey (1989) and Fahrmeir and Tutz (2001),
see also Fahrmeir and Knorr-Held (2000). Spatially varying covariate
effects as proposed in Assunção et al. (1998) and further developed in
Assunção et al. (2002) and Gamerman et al. (2003).
It is well known that the auxiliary variable approach is also useful
for multicategorical response data. Consider first the case, where the
response categories are ordered. Albert and Chib (1993) and Albert and
Chib (2001) have shown how to use the auxiliary variable approach
for the cumulative probit and sequential probit model. Similarly, it is
straightforward to adopt the auxiliary variable approach for logistic
regression to the cumulative and sequential model. For an introduction
to these models see Fahrmeir and Tutz (2001, Chapter 3). For an
application of such models with latent IGMRFs, but without auxiliary
variables, see Fahrmeir and Knorr-Held (2000) and Knorr-Held et al.
(2002). Turning to models for unordered response categories, the aux-
iliary variable approach can also be used in the multinomial probit
model, as noted by Albert and Chib (1993), and in the multinomial logit
model, as described in Holmes and Held (2003). For an application of the
multinomial probit model using GMRF priors and auxiliary variables see
Fahrmeir and Lang (2001c). The auxiliary variable approach does also
extend to a certain class of non-Gaussian (intrinsic) MRFs such that the
full conditional for the MRF is a GMRF, see Geman and Yang (1995).
The use of the algorithms to construct appropriate GMRF approx-
imations in Section 4.4.1 was first advocated by Gamerman (1997)
in the context of generalized linear mixed models. Follestad and
Rue (2003) and Rue and Follestad (2003) discuss the construction of
GMRF approximations using soft constraints (see Section 2.3.3) where
aggregated Poisson depends on the sum of all relative risks in a region.
Various other examples of hierarchical models with latent GMRF
components can also be found in Banerjee et al. (2004). Sun et al. (1999)
considers propriety of the posterior distribution of hierarchical models
using IGMRFs. Ferreira and De Oliveira (2004) discuss default/reference
priors for parameters of GMRFs for use in an ‘objective Bayesian
analysis’.

©฀2005฀by฀Taylor & Francis Group, LLC


CHAPTER 5

Approximation techniques

This chapter is reserved for the presentation of two recent developments


regarding GMRFs. At the time of writing these new developments have
not been fully explored but extend the range of applications regarding
GMRFs and bring GMRFs into new areas.
Section 5.1 provides a link between GMRFs and Gaussian fields used
in geostatistics. Commonly used Gaussian fields on regular lattices can
be well approximated using GMRF models with a small neighborhood.
The benefit of using GMRF approximations instead of Gaussian fields
is purely computational, as efficient computation of GMRFs utilize
algorithms for sparse matrices as discussed in Chapter 2.
The second topic is motivated by the extensive use of the GMRF
approximation to the full conditional of the prior GMRF x in Chapter
4. The GMRF approximation is found by Taylor expanding to second
order the nonquadratic terms around the mode. In those cases where this
approximation is not sufficiently accurate, we might wish to ‘go beyond
the Gaussian’ and construct non-Gaussian approximations. However,
to apply non-Gaussian approximations in the setting of Chapter 4, we
need to be able to sample exactly from the approximation and compute
the normalizing constant. Despite these rather strict requirements, we
will present a class of non-Gaussian approximations that satisfy these
requirements and are adaptive in the sense that the approximation
adapts itself to the particular full conditional for x under study.

5.1 GMRFs as approximations to Gaussian fields


This section discusses the link between Gaussian fields (GFs) used in
geostatistics and GMRFs. We will restrict ourselves to isotropic Gaussian
fields on regular lattices In . We will demonstrate that GMRFs with a
local neighborhood can well approximate isotropic Gaussian fields with
commonly used covariance functions, in the sense that each element in
the covariance matrix of the GMRF is close to the corresponding element
of the covariance matrix of the Gaussian field. This allows us to use
GMRFs as approximations to Gaussian fields on regular lattices. The
advantage is purely computational, but the speedup can be O(n3/2 ).
This is our approach to solve (partially) what Banerjee et al. (2004) call

©฀2005฀by฀Taylor & Francis Group, LLC


the big n problem. Further, the Markov property of the GMRFs can be
valuable for applying these models as a component in a larger complex
model, especially if simulation-based methods for inference are used.

5.1.1 Gaussian fields


Let {z(s), s ∈ D} be a stochastic process where D ⊂ Rd and s ∈ D
represents the location. In most applications, d is either 1, 2, or 3.
Definition 5.1 (Gaussian field) The process {z(s), s ∈ D} is a
Gaussian field if for any k ≥ 1 and any locations s1 , . . . , sk ∈ D,
(z(s1 ), . . . , z(sk ))T is normally distributed. The mean function and
covariance function (CF) of z are
µ(s) = E(z(s)), C(s, t) = Cov(z(s), z(t)),
which are both assumed to exist for all s and t. The Gaussian field is
stationary if µ(s) = µ for all s ∈ D and if the covariance function only
depends on s − t. A stationary Gaussian field is called isotropic if the
covariance function only depends onthe Euclidean distance between s
and t, i.e., C(s, t) = C(h) with h = ||s − t||.
For any finite set of locations, a CF must necessarily induce a positive
definite covariance matrix, i.e.,

ai aj C(si , sj ) > 0
i j

must hold for any k ≥ 1, any s1 , . . . , sk , and any (real) coefficients


a1 , . . . , an . If this is the case, then the CF is called positive definite. All
continuous CFs on Rd can be represented as Fourier transforms of a
finite measure, a rather deep result known as Bockner’s theorem, see, for
example, Cressie (1993). Bockner’s theorem is often used as a remedy to
construct CF or to verify that a (possible) CF is positive definite.
In most applications one of the following isotropic CFs is used in
geostatistics:
Exponential C(h) = exp(−3h)
Gaussian C(h) = exp(−3h2 )
Powered exponential C(h) = exp(−3hα ), 0 < α ≤ 2 (5.1)
1
Matérn C(h) = (sν h)ν Kν (sν h). (5.2)
Γ(ν)2ν−1
Here Kν is the modified Bessel function of the second kind and order
ν > 0, and sν is a function of ν such that the covariance function is scaled
to C(1) = 0.05. For similar reasons the multiplicative factor 3 enters in
the exponent of the exponential, Gaussian and powered exponential CF,

©฀2005฀by฀Taylor & Francis Group, LLC


1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.5 1.0 1.5

Figure 5.1 The Matérn CF with range r = 1, and ν = 1/2, 1, 3/2, 5/2, and
100 (from left to right). The case ν = 1/2 corresponds to the exponential CF
while ν = 100 is essentially the Gaussian CF.

where now C(1) = 0.04979 ≈ 0.05. Note that C(0) = 1 in all four cases,
so the CFs are also correlation functions. Of course, if we multiply C(h)
with σ 2 then the variance becomes σ 2 .
The powered exponential CF includes both the exponential (α = 1)
and the Gaussian (α = 2), and so does the often-recommended Matérn
CF with ν = 1/2 and ν → ∞, respectively. The Matérn CF is displayed
in Figure 5.1 for ν = 1/2, 1, 3/2, 5/2, and 100.
A further parameter r, called the range, is often introduced to scale
the Euclidean distance, so the CF is C(h/r). The range parameter can
be interpreted as the (minimum) distance h for which the correlation
function C(h) = 0.05. Hence, two locations more than distance r apart
are essentially uncorrelated and hence also nearly independent.
Nonisotropic CFs can be constructed from isotropic CFs by replacing
the Euclidean distance h between locations s and t with
0
h′ = (t − s)T A(t − s),

where A is SPD. This gives elliptical contours of the covariance function,


whereas isotropic ones always have circular contours. The Euclidean
distance is obtained if A = I.

©฀2005฀by฀Taylor & Francis Group, LLC


5.1.2 Fitting GMRFs to Gaussian fields
We will now formulate how to fit GMRFs to Gaussian fields and then
discuss in detail some practical and technical issues. We assume the
Gaussian field is isotropic with zero mean and restricted to the torus
Tn . The torus is preferred over the lattice for computational reasons and
to make the GMRF stationary (see Section 2.6.3).
Let z denote a zero mean Gaussian field on Tn . We denote by
d((i, j), (i′ , j ′ )) the Euclidean distance between zij and zi′ j ′ on the torus.
The covariance between zij and zi′ j ′ is
Cov(zij , zi′ j ′ ) = C(d((i, j), (i′ , j ′ ))/r),
where C(·/r) is one of the isotropic CFs presented in Section 5.1.1 and
r is the range parameter. A value of r = 10 means that the range is 10
pixels. Denote by Σ the covariance matrix of z.
Let x be a GMRF defined on Tn with precision matrix Q(θ) depending
on some parameter vector θ = (θ1 , θ2 , . . .)T . Let Σ(θ) = Q(θ)−1 denote
the covariance matrix of x. Let π(x; θ) be the density of the GMRF
and π(z) the density of the Gaussian field we want to fit. The best fit is
obtained using the ‘optimal’ parameter value
θ ∗ = arg min+ D(π(x; θ), π(z)). (5.3)
θ∈Θ∞

Here, D(·, ·) is a metric or measure of discrepancy between the two


densities and Θ+ ∞ is the space of valid values of θ as defined in (2.71).
In practice, we face a trade-off problem between sparseness of Q(θ) and
how close the two densities are. If Σ(θ) = Σ then the two densities are
identical but Q(θ) is typically a completely dense matrix. If Q(θ) is
(very) sparse, the fit may not be sufficiently accurate.
The fitted parameters may also depend on the size of the torus,
which means that we may need a different set of parameters for each
size. However, it will be demonstrated in Section 5.1.3 that the fitted
parameters are nearly invariant to the size of the torus if the size is large
enough compared to the range.
In order to obtain a working algorithm, we need to be more specific
about the choice of the neighborhood, the parameterization, the distance
measure and the numerical algorithms to compute the best fit.

Choosing the torus Tn instead of the lattice In


The covariance matrix Σ and the precision matrix Q(θ) are block-
circulant matrices, see Section 2.6.3. We denote their bases by σ and
q(θ), respectively. Denote by σ(θ) the base of Q(θ)−1 .
Numerical algorithms for block-circulant matrices are discussed in Sec-
tion 2.6.2 and are all based on the discrete Fourier transform. For this

©฀2005฀by฀Taylor & Francis Group, LLC


reason, the size of the torus is selected as products of small primes.
However, delicate technical issues appear if we define a Gaussian field
on Tn using a CF valid on R2 , as it might not be a valid CF on the torus.
Wood (1995) discusses this issue in the one-dimensional case where a line
is wrapped onto a circle and proves that the exponential CF will still
be valid while the Gaussian CF will never be valid no matter the size
of the circle compared to the range. However, considering only the CF
restricted to a lattice, Dietrich and Newsam (1997) prove essentially,
that if the size of the lattice is large enough compared to the range,
then the (block) circulant matrix will be SPD. See also Grenander and
Szegö (1984) and Gray (2002) for conditions when the (block) Toeplitz
matrix and its circulant embedding are asymptotically equivalent. The
consequence is simply to choose the size of the torus large enough
compared to the range.

Choice of neighborhood and parameterization

Regarding the neighborhood and parameterization, we will use a square


window of size (2m + 1) × (2m + 1) centered at each (i, j), where
m = 2 or 3. The reason to use a square neighborhood is purely
numerical (see Section 2), as if xi,j+2 is a neighbor to xij then the
additional cost is negligible if we let also xi+2,j+2 be a neighbor of
xij . Hence, by choosing the square neighborhood we maximize the
number of parameters, keeping the computational costs nearly constant.
Further, since the CF is isotropic, we must impose similar symmetries
regarding the precision matrix. For m = 2 this implies that E(xij |x−ij )
is parameterized with 6 parameters
( ◦◦◦◦◦ ◦◦◦◦◦ ◦◦•◦◦ ◦•◦•◦ •◦◦◦•
)
1 ◦◦•◦◦ ◦•◦•◦ ◦◦◦◦◦ •◦◦◦• ◦◦◦◦◦
− θ2 ◦ • ◦ • ◦ + θ3 ◦ ◦ ◦ ◦ ◦ + θ4 • ◦ ◦ ◦ • + θ5 ◦ ◦ ◦ ◦ ◦ + θ6 ◦ ◦ ◦ ◦ ◦
θ1 ◦◦•◦◦
◦◦◦◦◦
◦•◦•◦
◦◦◦◦◦
◦◦◦◦◦
◦◦•◦◦
•◦◦◦•
◦•◦•◦
◦◦◦◦◦
•◦◦◦•
(5.4)
using the same notation as in Section 3.4.2, while

Prec(xij | x−ij ) = θ1 .

However, we have to ensure that the corresponding precision matrix is


SPD, see the discussion in Section 2.7.
We will later display the coefficients as
⎡ ⎤
θ6 /θ1
θ1 ⎣ θ3 /θ1 θ5 /θ1 ⎦ (5.5)
1 θ2 /θ1 θ4 /θ1

representing the lower part of the upper right quadrant. For m = 3 we

©฀2005฀by฀Taylor & Francis Group, LLC


have 10 available parameters:
⎡ ⎤
θ10 /θ1
⎢ θ6 /θ1 θ9 /θ1 ⎥
θ1 ⎢

⎥ (5.6)
θ3 /θ1 θ5 /θ1 θ8 /θ1 ⎦
1 θ2 /θ1 θ4 /θ1 θ7 /θ1

with obvious notation.

Choice of metric between the densities

When considering the choice of metric between the two densities, we


note that we can restrict ourselves to the case where both densities have
unit marginal precision, so Prec(xij ) = Prec(zij ) = 1 for all ij. We can
obtain this situation if the CF satisfies C(0) = 1 and σ(θ) is scaled so
that element 00 is 1. The scaling makes one of the parameters redundant,
hence we set θ1 = 1 without loss of generality. After the best fit is found,
we compute the value of θ1 giving unit marginal precision and scale the
solution accordingly. The coefficients found in the minimization are those
within the brackets in (5.5) and (5.6).
As both densities have unit marginal precision, the covariance matrix
equals the correlation matrix. Let ρ be the base of the correlation matrix
of the Gaussian field and ρ(θ) the base of the correlation matrix of the
GMRF. Since ρ uniquely determines the density of z and ρ(θ) uniquely
determines the density of x, we can define the norm between the densities
in (5.3) by a norm between the corresponding correlation matrices. We
use the weighted 2-norm,

ρ − ρ(θ)2w = (ρij − ρij (θ))2 wij
ij

with positive weights wij > 0 for all ij. The CF is isotropic and its value
at (i, j) only depends on the distance to (0, 0). It is therefore natural to
choose wij ∝ 1/d((i, j), (0, 0))) for ij = 00. However, we will put slightly
more weight on lags with distance close to the range, and use

1 if ij = 00
wij ∝ 1+r/d((i,j),(0,0))
d((i,j),(0,0)) otherwise.

The coefficients giving the best fit are then found as

θ ∗ = arg min+ ρ − ρ(θ)2w , (5.7)


θ∈Θ∞

which has to be computed numerically.

©฀2005฀by฀Taylor & Francis Group, LLC


Numerical optimization: computing the objective function and its
gradient
Numerical optimization methods for solving (5.7) needs a function that
evaluates the objective function
U (θ) = ρ − ρ(θ)2w
and if possible, the gradient
∂ ∂
( , , . . .)T U (θ). (5.8)
∂θ2 ∂θ3
In (5.8), recall that we have fixed θ1 = 1. Both the objective function
and its gradient ignore the constraint that θ ∈ Θ+ ∞ . We will return to
this issue shortly.
The objective function and its gradient can be computed in O(n log n)
flops using algorithms for block-circulant matrices based on the discrete
Fourier transform as discussed in Section 2.6.2. The details are as follows.
The base ρ is determined by the target correlation function of the GF.
The base ρ(θ) is computed using (2.50),
1
σ(θ) = IDFT2(DFT2(q(θ)  (−1))),
n1 n2
which we need to scale to obtain unit precision:
ρ(θ) = σ(θ)/σ00 (θ). (5.9)
Here, ‘’ is defined in Section 2.1.1 and σ00 (θ) is the element 00 of σ(θ).
The objective function is then
 2
U (θ) = (ρij − ρij (θ)) wij . (5.10)
ij

Because the CF is defined on a torus and assumed to be isotropic, we


only need to sum over 1/8 of all indices in (5.10).
The computation of the gradient (5.8) is somewhat more involved. Let
A be a nonsingular matrix depending on a parameter α. Since AA−1 = I
it follows from the product rule that
   
∂ −1 ∂ −1
A A +A (A ) = 0,
∂α ∂α
where
∂ ∂ 
A = ∂α Aij .
∂α
So  
∂ ∂
Σ(θ) = −Σ(θ) Q(θ) Σ(θ)
∂θk ∂θk

©฀2005฀by฀Taylor & Francis Group, LLC


is a block-circulant matrix with base ∂θ∂k σ(θ) equal to
 
1 ∂
IDFT2 − (DFT2(q(θ))  −2) ⊙ DFT2( q(θ)) . (5.11)
n1 n2 ∂θk
Recall that ‘⊙’ is elementwise multiplication, see Section 2.1.1. Eq. (5.11)
is easily derived using the techniques presented in Section 2.6.2. Note
that the base ∂q(θ)/∂θk only contains zeros and ones. More specifically,
for m = 2 the base contain four 1’s for k ∈ {2, 3, 4, 6} and eight 1’s for
k = 5, see (5.4). From (5.9) we obtain
 
∂ 1 ∂ 1 ∂σ00 (θ)
ρ(θ) = σ(θ) − σ(θ);
∂θk σ00 (θ) ∂θk σ00 (θ)2 ∂θk
hence
∂  ∂ρij (θ)
U (θ) = −2 wij (ρij − ρij (θ)) .
∂θk ij
∂θk
We perform the optimization in R (Ihaka and Gentleman, 1996) using
the routine optim and the option BFGS. optim is a quasi-Newton method
also known as a variable metric algorithm. The algorithm uses function
values and gradients to build up a picture of the surface to be optimized.
A further advantage is that the implementation allows for returning a
nonvalid function value if the parameters are outside the valid area, i.e.,
if some of the eigenvalues of Q(θ) are negative.
As discussed in Section 2.7, it is hard to determine Θ+ ∞ . Our strategy
is to replace Θ+ ∞ by Θ +
n where n is the size of the torus. After the
solution is found, we will take a large n′ and verify that θ ∗ ∈ Θ+n′ . If θ


passes this test (which is nearly always the case), we accept θ , otherwise
we rerun the optimization with a larger n.

5.1.3 Results
We will now present some typical results showing how well the fitted
GMRFs approximate the CFs in Section 5.1. We concentrate on the
exponential and Gaussian CF using both a 5 × 5 and 7 × 7 neighborhood
with range 30 and 50. The size of the torus is taken as 512 × 512.
Figure 5.2(a) and Figure 5.2(c) shows the fit obtained for the
exponential CF with range 30 using a 5 × 5 and 7 × 7 neighborhood,
respectively. Figure 5.2(b) and Figure 5.2(d) show similar results for the
Gaussian CF with range 50. The fitted CF is drawn with a solid line,
while the target CF is drawn with a dashed line, while the difference
between the two is shown in Figure 5.3.
The approximation obtained is quite accurate. For the exponential
CF the absolute difference is less than 0.01 using a 5 × 5 neighborhood,
while it is less than 0.005 using a 7 × 7 neighborhood. The Gaussian CF

©฀2005฀by฀Taylor & Francis Group, LLC


1.0

1.0
0.8
0.8

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0

0 50 100 150 200 250 0 50 100 150 200 250

(a) (b)
1.0

1.0
0.8
0.8

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0

0 50 100 150 200 250 0 50 100 150 200 250

(c) (d)

Figure 5.2 The figures display the correlation function (CF) for the fitted
GMRF (solid line) and the target CF (dashed line) with the following
parameters: (a) exponential CF with range 30 and a 5 × 5 neighborhood, (b)
Gaussian CF with range 50 and a 5 × 5 neighborhood, (c) exponential CF with
range 30 and a 7 × 7 neighborhood, and (d) Gaussian CF with range 50 and a
7 × 7 neighborhood.

is more difficult to fit, which is due to the CF type, not the increase
in the range. In order to fit the CF accurately for small lags, the fitted
correlation needs to be negative for larger lags. However, the absolute
difference is still reasonably small and about 0.04 and 0.008 for the 5 × 5
and 7 × 7 neighborhood, respectively. The improvement by enlarging the
neighborhood is larger for the Gaussian CF than for the exponential CF.
The results obtained in Figure 5.2 are quite typical for different range
parameters and other CFs. For other values of the range, the shape of
the fitted CF is about the same and only the horizontal scale is different
(due to the different range). We do not present the fits using the powered
exponential CF (5.1) for 1 ≤ α ≤ 2, and the Matérn CF (5.2), but they
are also quite good. The errors are typically between those obtained for
the exponential and the Gaussian CF.
For the CFs shown in Figure 5.3 the GMRF coefficients (compare (5.5)

©฀2005฀by฀Taylor & Francis Group, LLC


0.04
0.004

0.02
0.000

0.00
−0.004

−0.02
−0.04
−0.008

0 50 100 150 200 250 0 50 100 150 200 250

(a) (b)
0.002

0.005
0.000

0.000
−0.004 −0.002

−0.005

0 50 100 150 200 250 0 50 100 150 200 250

(c) (d)

Figure 5.3 The figures display the difference between the target correlation
function (CF) and the fitted CF for the corresponding fits displayed in Figure
5.2. The difference goes to zero for lags larger than shown in the figures.

and (5.6)) are


⎡ ⎤
0.091
26.685 ⎣ 0.304 −0.191⎦ (5.12)
1 −0.537 0.275
and
⎡ ⎤
0.063
⎢ 0.085 −0.100⎥
14.876 ⎢


−0.015 −0.002 0.071 ⎦
1 −0.280 −0.005 −0.033
for the exponential CF and
⎡ ⎤
−0.083
118 668.081 ⎣ 0.000 0.166 ⎦ (5.13)
1 −0.333 −0.165

©฀2005฀by฀Taylor & Francis Group, LLC


and ⎡ ⎤
0.009
⎢ 0.173 −0.031⎥
73 382.052 ⎢


−0.263 −0.002 −0.057⎦
1 −0.122 0.029 0.106
for the Gaussian CF. We have truncated the values of θi /θ1 showing
only 3 digits.
Note that the coefficients are all far from producing a diagonal-
dominant precision matrix. This supports the claim made in Section 2.7.2
that imposing diagonal dominance can be severely restrictive for larger
neighborhoods. Further, we do not find the magnitude of the coefficients
and (some of) their signs particularly intuitive. Although the coefficients
have a nice conditional interpretation (see Theorem 2.3), it seems hard to
extrapolate these to knowledge of the covariance matrix without doing
the matrix inversion. A way to interpret the coefficients of the fitted
GMRF is to consider them purely as a mapping of the parameters of the
CF. The parameters in the CF are easier to interpret.
The fitted GMRF is sensitive wrt small changes in the coefficients. To
illustrate this issue, we took the coefficients in (5.13) and use 3, 5, 7, and
9 significant digits to represent θi /θ1 , for i = 2, . . . , 5. The scaling factor
θ0 is not truncated. Using these truncated coefficients we computed the
CF and compared it with the Gaussian CF with range 50. The result is
shown in Figure 5.4. Using only 3 significant (panel a) digits makes Q(θ)
not SPD and these parameters are not valid. With 5 significant digits
(panel b) the range is far too small and the marginal precision is 276,
far too large. Increasing the number of digits to 7 (panel c) decreases
the marginal precision to 17, while for 9 digits (panel d) the marginal
precision is now 0.95 and the fit has considerably improved. We need
12 digits to reproduce Figure 5.2(b). There is somewhat less sensitivity
of the coefficients for the exponential CF, but still we need 8 significant
digits to reproduce Figure 5.2(a).
The coefficients obtained as the solution of (5.7) also depend on the
size of the torus. How strong this dependency is can be investigated by
using the coefficients computed for one size to compute the CF and the
marginal precision using a torus of a different size. If the CF and the
marginal precision are close to the target CF and the marginal precision
is near 1, then we may use the computed coefficients on toruses of a
different size. This is an advantage as we only need one set of coefficients
for each CF. It turns out that as long as the size of the torus is large
compared to the range r, the effect of the size of the torus is (very) small.
We illustrate this result using the coefficients in (5.13) corresponding
to a Gaussian CF with range 50. We applied these coefficients to toruses
of size n × n with n between 32 and 512. For each size n we computed

©฀2005฀by฀Taylor & Francis Group, LLC


1.0
1.5

0.8
1.0

0.6
0.5

0.4
0.0

0.2
−0.5

0.0
0 50 100 150 200 250 0 50 100 150 200 250

(a) (b)
1.0

1.0
0.8
0.8

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0

0 50 100 150 200 250 0 50 100 150 200 250

(c) (d)

Figure 5.4 The figures display the correlation function (CF) for the fitted
GMRF (solid line) and the Gaussian CF (dashed line) with range 50. The
figures are produced using the coefficients as computed for Figure 5.2(b) using
only (a) 3 significant digits, (b) 5 significant digits, (c) 7 significant digits and
(d) 9 significant digits, for θi /θ1 , i = 2, . . . , 5 while θ0 was not truncated. For
(a) the parameters are outside the valid parameter space so Q(θ) not SPD.
The marginal precisions based on the coefficients in (b), (c), and (d) is 276,
17, and 0.95, respectively.

the marginal precision, shown in Figure 5.5. The marginal precision is


less than one for small n, becomes larger for larger n, before it stabilizes
at unity for n > 300. This corresponds to a size that is 6 times the
range. The marginal precision is not equal to one if the CF does not
reach its stable value at zero at lag n1 /2. Due to the cyclic boundary
conditions the CF is cyclic as well. Repeating this exercise using the
coefficients (5.12) obtained for the exponential CF with range 30 shows
that in this case, too, the dimension of the torus must be about 6 times
the range.

©฀2005฀by฀Taylor & Francis Group, LLC


1.0
0.8
0.6
0.4
0.2
0 100 200 300 400 500

Figure 5.5 The marginal precision with the coefficients in (5.13) corresponding
to a Gaussian CF with range 50, on a torus with size from 32 to 512. The
marginal precision is about 1 for dimension larger than 300, i.e., 5 times the
range in this case.

5.1.4 Regular lattices and boundary conditions


In applications where we use zero mean Gaussian fields restricted
to a regular lattice, cyclic boundary conditions are rarely justified
from a modeling point of view. This means that we must consider
approximations on the lattice In rather than the torus Tn .
However, the full conditional π(xij |x−ij ) of a GMRF approximation
will have a complicated structure with coefficients depending on ij and
in particular on the distance from the boundary, in order to fulfill
stationarity and isotropy.
Consider a (2m + 1) × (2m + 1) neighborhood. A naive approach is to
specify the full conditionals as
1 
E(xij | x−ij ) = − θkl xi+k,j+l (5.14)
θ00
kl
Prec(xij | x−ij ) = θ00 , (5.15)
using the same coefficients as obtained through (5.7) and simply setting
xi+k,j+l = 0 (the mean value) (5.14) for (i + l, j + l) ∈ In . Note that
the coefficients in (5.14) equal those in (5.5) and in (5.6) with a slight
change in notation.
The full conditional π(xij |x−ij ) now has constant coefficients but the
marginal precision and the CF will no longer be constant but varies
with ij and depends on the distance from the boundary. This effect is
not desirable and we will now discuss strategies to avoid these features.
Our first concern, however, is to make sure that it is valid to take
the coefficients on Tn and use them to construct a GMRF on In with

©฀2005฀by฀Taylor & Francis Group, LLC


n2−m
B A

n2
n1−m
n1

Figure 5.6 The figure displays an unwrapped n1 × n2 torus. The set B have
thickness m and A is an (n1 − m) × (n2 − m) lattice.

full conditionals as in (5.14) and (5.15). In other words, under what


conditions will Q(θ), defined by (5.14) and (5.15), be SPD? The solution
is trivial as soon as we get the right view.
Consider now Figure 5.6 where we have unwrapped the torus Tn .
Define the set B with thickness m shown in gray and let A denote
the remaining sites. Consider the density of xA |xB . Since B is thick
enough to destroy the effect of the cyclic boundary conditions, QAA
equals the precision matrix obtained when using the coefficients θ on
an (n1 − m) × (n2 − m) lattice. Since all principal submatrices of Q
are SPD (see Section 2.1.6), then QAA is SPD. Hence we can use the
coefficients computed in Section 5.1.2 on Tn to define GMRFs through
the full conditionals (5.14) and (5.15) on lattices not larger than (n1 −
m) × (n2 − m), see also (2.75).
To illustrate the boundary effect of using the full conditionals (5.14)
and (5.15), we use the 5 × 5 coefficients corresponding to Gaussian and
exponential CFs with range 50 and define a GMRF on a 200×200 lattice.
We computed the marginal variance for xii , as a function of i using the
algorithms in Section 2.3.1. Figure 5.7 shows the result.
The marginal variance at the four corners is 0.0052 (Gaussian) and
0.102 (exponential), which is small compared to the target unit variance.
Increasing the distance from the border the variance increases, and
reaches unity at a distance of around 75 (Gaussian) and 50 (exponential).
This corresponds to 1.5 (Gaussian) and 1.0 (exponential) times the
range. The different behavior of the Gaussian and exponential CF is
mainly due to the different smoothness properties of the CF at lag 0.
The consequence of Figure 5.7 is to enlarge the lattice to remove the
effect of xij being zero outside the lattice. As an example, if the problem

©฀2005฀by฀Taylor & Francis Group, LLC


1.0
0.8
0.6
0.4
0.2
0.0

0 50 100 150 200

Figure 5.7 The variance of xii , for i = 1, . . . , 200, when using the 5 × 5
coefficients corresponding to Gaussian (solid line) and exponential (dashed
line) CFs with range 50, on a 200 × 200 lattice.

at hand requests a 200 × 200 lattice and we expect a maximum range


of about 50, then using GMRF approximations, we should enlarge the
region of interest and use a lattice with dimension between 300 × 300
and 400 × 400.
An alternative approach is to scale the coefficients near the boundary,
in ‘some appropriate fashion’. However, if the precision matrix is not
diagonal dominant, it is not clear how to modify the coefficients near and
at the boundary to ensure that the modified precision matrix remains
SPD.
We will now discuss an alternative approach that apparently solves
this problem by an embedding technique. This approach is technically
more involved but the computational complexity to factorize precision
matrix is O(n3/2 log n), where n = n1 n2 . Let x be a zero mean GMRF
defined through its full conditionals (5.14) and (5.15) on the infinite
lattice I∞ . Let
γij = E(xij x00 )
be the covariance at lag ij, which is related to the coefficients θ of the
GMRF by
 π  π
1 cos(ω1 i + ω2 j)
γij = 2
 dω1 dω2 , (5.16)
4π −π −π kl θkl cos(ω1 k + ω2 j)
see Section 2.6.5. Figure 5.8 illustrates an (n1 + 2m) × (n2 + 2m) lattice,
which is to be considered as a subset of I∞ . Further, the thickness of B
is m, such that
xA ⊥ x(A∪B)c | xB .

©฀2005฀by฀Taylor & Francis Group, LLC


n2+2m
n2

n1+2m
B A

n1
Figure 5.8 An n1 ×n2 lattice (A) with an additional boundary (B) of thickness
m.

where (A ∪ B)c is the complement of A ∪ B. We factorize the density for


xA∪B as
π(xA∪B ) = π(xB ) π(xA | xB ).
Here, xB is Gaussian with covariance matrix ΣBB with elements found
from (5.16), while xA |xB has density
 
1 T
π(xA | xB ) ∝ exp − xA∪B QxA∪B
2
 
1 T T T
∝ exp − xA QAA xA + xB QAB xA ,
2
where Q(i,j),(k,l) = θi−k,j−l . The marginal density of xA now has the
correct covariance matrix with elements found from (5.16).
To simulate from π(xA∪B ) we first sample xB from π(xB ) then
xA from π(xA |xB ). Conditional simulation requires, however, the joint
precision matrix for xA∪B :
   
xA QAA QAB
Prec = . (5.17)
xB QTAB QTAB Q−1AA QAB + ΣBB
−1

Note that B has size nB of order O(n1/2 ). To factorize the (dense)


nB × nB matrix
QTAB Q−1 −1
AA QAB + ΣBB , (5.18)
we need O(n3B ) = O(n3/2 ) flops. This is the same cost needed to factorize
QAA . However, to compute (5.18), we need QTAB Q−1 AA QAB . Let LA
be the Cholesky triangle of QAA , then QTAB Q−1 T
AA AB = G G where
Q

©฀2005฀by฀Taylor & Francis Group, LLC


LA G = QAB . We can compute G by solving nB linear systems each of
cost O(n log n), hence the total cost is O(n3/2 log n). In total, (5.17) can
be computed and factorized using O(n3/2 log n) flops.
To compute the covariances of the GMRF on I∞ to find ΣBB , there
is no need to use (5.16), which involves numerical integration of an
awkward integrand. As Qn is a block Toeplitz matrix we can construct
its cyclic approximation C n . Since Qn and C n are asymptotically
equivalent, so are Q−1 −1
n and C n under mild conditions, see Gray (2002,
Theorem 2.1). The consequence, is that {γij } can be taken as the base
of C −1
n for a large value of n.
The embedding approach extends directly to regions of nonsquare
shape S as long as we consider S ∩In , and can embed it with a boundary
region with thickness m.

5.1.5 Example: Swiss rainfall data


In applications we are often faced with the problem of fitting a stationary
Gaussian field to some observed data. The observed data are most often
of the form
(s1 , x1 ), (s2 , x2 ), . . . , (sd , xd ),
where si is the spatial location for the ith observation xi , and d the
number of observations. This is a routine problem within geostatistics
and different approaches exist, see, for example, Cressie (1993), Chilés
and Delfiner (1999) and Diggle et al. (2003). We will not focus here on the
fitting problem itself, but on the question of how we may take advantage
of the GMRF approximations to Gaussian fields in this process.
Recall that we consider the GMRF approximations only as computa-
tionally more efficient representations of the corresponding GF. However,
we do not recommend computing maximum likelihood estimates of
the parameters in the GMRF directly from the observed data in
geostatistical applications. Such an approach may give surprising results.
This is related to model error and to the form of the sufficient statistics,
see Rue and Tjelmeland (2002) for examples and further discussion.
To fix the discussion consider the data displayed in Figure 5.9,
available in the geoR-library) (Ribeiro Jr. and Diggle, 2001). The data is
the measured rainfall on May 8, 1986 from 467 locations in Switzerland
where Figure 5.9 displays the observations at 100 of these locations. The
motivating scientific problem is to construct a continuous spatial map
of rainfall values from the observed data, but we have simplified the
problem for presentation issues. Assume the spatial field is a GF with a
common mean µ and a Matérn CF with unknown range, variance, and
smoothness ν. Denote these parameters by θ. Estimating the parameters
from the observed data using a Bayesian or a likelihood approach,

©฀2005฀by฀Taylor & Francis Group, LLC


100 150 200 250
South−North [km]

50
0
−50

0 50 100 150 200 250 300 350

East−West [km]

Figure 5.9 Swiss rainfall data at 100 sample locations. The plot displays the
locations and the observed values ranging from black to white.

involves evaluating the log-likelihood,


d 1 1
− log 2π − log |Σ(θ)| − (x − µ1)T Σ(θ)−1 (x − µ1), (5.19)
2 2 2
for several different values of the unknown parameters θ. Note that the
d × d matrix Σ(θ) is a dense matrix and due to the irregular locations
{si }, it has no special structure. Hence, a general Cholesky factorization
algorithm must be used to factorize Σ(θ), which costs d3 /3 flops.
If we estimate parameters using only d = 100 observations, then the
computational cost is not that high and there is no need to involve
GMRF approximations at this stage. However, if we use all the data
d = 467 and factorizing Σ(θ) will be costly. We may take advantage of
the GMRF approximations to GFs. Assume for simplicity ν = 1/2 so
the CF is exponential. We have precomputed the best fits for values of
the range r = 1, 1.1, 1.2, . . ., and so on. Since GMRF approximations are
for regular lattices only, we first need to replace the continuous area of
interest with a fine grid of size n, say. The resolution of the grid has
to be high enough so we can assign each observed data point (si , xi )
to the nearest grid point without (too much) error. Using some kind of
interpolation is also possible. The boundary conditions should be treated

©฀2005฀by฀Taylor & Francis Group, LLC


as discussed in Section 5.1.4.
Let x denote the GMRF at the lattice with size n, and denote by D the
set of sites that is observed and M those sites not observed. Note that
x = {xD , xM }. The likelihood for the observed data xD is not directly
available using GMRF approximations, but is available indirectly as
π(x | θ)
π(xD | θ) = ,
π(xM | xD , θ)
which holds for any xM . We may use xM = 0, say. Both terms on
the rhs are easy to evaluate if x is a GMRF, and the cost is O(n3/2 )
flops. Comparing this cost with the cost of evaluating (5.19), we see
that the breakpoint is n ≈ d2 . If n < d2 we will gain if we use GMRF
approximations and if n > d2 it will be better to use (5.19) directly.
Choosing a resolution of the grid of 1 × 1 km2 , we would use (5.19)
for d = 100, while for the full dataset, where d = 467, it would be
computationally more efficient to use the GMRF approximations.
The next stage in the process is to provide spatial predictions for
the whole area of interest with error bounds. Although it is also
important to propagate the uncertainly in the parameter estimation
into the prediction (Draper, 1995), we will for a short moment treat the
estimated parameters as fixed and apply the simple plug-in approach.
Maximum likelihood estimation using the exponential CF on the square-
root transformed data give the estimates r̂ = 125 and σ̂ 2 = 83 and
µ̂ = 21, which we will use further.
Spatial predictions are usually performed by imposing a fine regular
lattice of size n over the area of interest. Let x be the GF on this lattice.
The best prediction is then
E(x | observed data, θ̂) (5.20)
and the associated uncertainty is determined through the conditional
variance. To compute or estimate (5.20) we can make use of a GMRF
approximation as the size of the grid is usually large. The locations of
the observed data must again be moved to the nearest grid point using
the GMRF approximations. The conditional variance is not directly
available from the GMRF formulation, but is easily estimated using
iid samples from the conditional distribution. Figure 5.10 shows the
estimated conditional mean and standard deviation using a 200 × 200
lattice restricted to the area of interest.
To account for the uncertainty in the parameter estimates it will be
required to use simulation-based inference using MCMC techniques. This
also allows us to deal with non-Gaussian data, for example, binomial or
Poisson observations. For a thorough discussion of this issue in relation
to the Swiss rainfall data, see Diggle et al. (2003, Section 2.8). Simulating

©฀2005฀by฀Taylor & Francis Group, LLC


100 200 300 400 500

250
200
150
South−North [km]

100
50
0
−50

0 50 100 150 200 250 300 350

East−West [km]

20 40 60 80 100
250
200
150
South−North [km]

100
50
0
−50

0 50 100 150 200 250 300 350

East−West [km]

Figure 5.10 Spatial interpolation of the Swiss rainfall data using a square-
root transform and an exponential CF with range 125. The top figure shows
the predictions and the bottom figure shows the prediction error (stdev), both
computed using plug-in estimates.

©฀2005฀by฀Taylor & Francis Group, LLC


from the posterior density of the parameters of interest will explore their
uncertainty, which can (and should) be accounted for in the predictions.
Therefore, the predictive samples have to be generated as part of the
MCMC algorithm. Averages over the samples will give an estimate of
the expectation conditioned on the observed data. This is no different
from averaging over predictions made using different parameter values.
The GMRF approximations do only offer computational gain, but this
can be important as it is typically computationally demanding to account
for the parameter uncertainty using GFs.

5.2 Approximating hidden GMRFs


In this section we will discuss how to approximate a nonnormal density
of the form
( )
1 T 
π(x | y) ∝ exp − x Qx − gi (xi , yi ) . (5.21)
2 i

We assume that the terms gi (xi , yi ) contain also nonquadratic terms of


xi , so that x|y is not normal. If x is a GMRF wrt G that is partially
observed through y, then x|y is called a hidden GMRF abbreviated as
HGMRF. Note that (5.21) defines x as a Markov random field wrt (the
same graph) G, but it is not Gaussian.
Densities of the form (5.21) occurred frequently in Chapter 4 as full
conditionals for a GMRF conditioned on nonnormal and independent
observations {yi } where yi only depends on xi . For example, assume yi
is Poisson with mean exp(xi ), then the full conditional π(x|y) is of the
form (5.21) with
gi (xi , yi ) = −xi yi + exp(xi ).
Alternatively, if yi is a Bernoulli variable with mean exp(xi )/(1 +
exp(xi )), then gi (xi , yi ) reads
−xi yi + log (1 + exp(xi ))
using a logit link. In both cases, the gi (xi , yi ) contain nonquadratic terms
of xi .
One approach taken in Chapter 4 was to construct a GMRF approx-
imation to π(x|y), which we here denote as πG (x). The mean of the
approximation equals the mode of π(x|y), x∗ , and the precision matrix
is
Q + diag(c).
Here, ci is the coefficient in the second-order Taylor expansion of
gi (xi , yi ) at the mode x∗i ,
1
gi (xi , yi ) ≈ ai + bi xi + ci x2i (5.22)
2

©฀2005฀by฀Taylor & Francis Group, LLC


where the coefficients ai , bi , and ci depend on x∗i and yi . See also the
discussion in Section 4.4. If n is large and/or the gi (xi , yi ) terms are too
influential, the GMRF approximation may not be sufficiently accurate.
This will be evident in the MCMC algorithms as the acceptance rate for
a proposal sampled from the GMRF approximation will become (very)
low. In such cases it would be beneficial to construct a non-Gaussian
approximation with improved accuracy compared to the GMRF approx-
imation. An approach to construct improved approximations compared
to the GMRF approximation is the topic in this section. We assume
throughout that (5.21) is unimodal.
In Section 5.2.1 we will discuss how to construct approximations to a
HGMRF, then apply these to a stochastic volatility model in Section
5.2.2 and to construct independence samplers for the Tokyo rainfall
data in Section 5.2.3. More complex applications are found in Rue et al.
(2004), which also combine these new approximations with the GMRF
approximations to GFs in Section 5.1.

5.2.1 Constructing non-Gaussian approximations


Before we start discussing how an improved approximation to (5.21)
can be constructed we should remind ourselves why we need an
approximation and what properties we must require from one.
The improved approximation is needed to update x in one block,
preferable jointly with its hyperparameters. For this reason, we need
to be able to sample from the improved approximation directly and
have access to the normalizing constant. For the normal distribution,
this is feasible but there are not that many other candidates (for high
dimension) around with these properties. We will outline an approach
to construct improved approximations that have these properties, addi-
tional to being adaptive to the particular gi ’s under study.
The first key observation is that we can convert a GMRF wrt G to a
nonhomogeneous autoregressive model, defined backward in ‘time’, using
the Cholesky triangle of the reordered precision matrix,
LT x = z. (5.23)
T
Here, z ∼ N (0, I) and Q = LL . The representation is sequential
backward in time, as (5.23) is equivalent to
1
xn = zn
Lnn
1
xn−1 = (zn−1 − Ln,n−1 xn )
Ln−1,n−1
1
xn−2 = (zn−2 − Ln,n−2 xn − Ln−1,n−2 xn−1 ) ,
Ln−2,n−2

©฀2005฀by฀Taylor & Francis Group, LLC


and so on. In short, (5.23) represents π(x) as
1

π(x) = π(xi | xi+1 , . . . , xn ). (5.24)
i=n

The autoregressive process is nonhomogeneous as the coefficients Lij are


not a function of i − j. The order of the process also varies and can be
as large as n − 1. However, L is by construction sparse.
By using (5.24) we may rewrite (5.21) as
1
1 
π(x | y) = π(xi | xi+1 , . . . , n) exp(−gi (xi , yi ))
Z i=n
1

= π(xi | xi+1 , . . . , xn , y1 , . . . , yi ),
i=n

where Z is the normalizing constant and


π(xi |xi+1 , . . . , xn , y1 , . . . , yi ) ∝ π(xi | xi+1 , . . . , xn )
× exp (−gi (xi , yi ))
⎛ ⎞
 i−1
(5.25)
× exp ⎝− gj (xj , yj )⎠
j=1

π(x1 , . . . , xi−1 | xi , . . . , xn ) dx1 · · · dxi−1 .


The second key observation is to note that if we can approximate (5.25)
by
π̃(xi | xi+1 , . . . , xn , y1 , . . . , yi ), (5.26)
say, then we can approximate π(x|y) by
1

π̃(x | y) = π̃(xi | xi+1 , . . . , xn , y1 , . . . , yi ). (5.27)
i=n

Since (5.26) is univariate, we can construct the approximation so that


the normalizing constant is known. As the approximation (5.27) is
defined sequentially backward in ‘time’, it automatically satisfies the
two requirements for an improved approximation:
1. We can sample from π̃(x|y) directly (and exact), by successively
sampling
xn ∼ π̃(xn | y1 , . . . , yn )
xn−1 ∼ π̃(xn−1 | xn , y1 , . . . , yn−1 )
xn−2 ∼ π̃(xn−2 | xn−1 , xn , y1 , . . . , yn−2 ),
and so on, until we sample x1 .

©฀2005฀by฀Taylor & Francis Group, LLC


2. The density π̃(x|y) is normalized since each term (5.26) is normalized.
Before we discuss how we can construct approximations to (5.25) by
neglecting ‘not that important’ terms, we will simply shift the reference
from π(x) to the GMRF approximation of (5.21), πG (x), say, so
that (5.25) reads
π(xi |xi+1 , . . . , xn , y1 , . . . , yi ) ∝ πG (xi | xi+1 , . . . , xn )
× exp (−hi (xi , yi ))
⎛ ⎞
 i−1
(5.28)
× exp ⎝− hj (xj , yj )⎠
j=1

πG (x1 , . . . , xi−1 | xi , . . . , xn ) dx1 · · · dxi−1 ,


where
1
hi (xi , yi ) = gi (xi , yi ) − (ai + bi xi + ci x2i ),
2
using the Taylor expansion in (5.22). Further, let µG denote the mean
in πG (x).
Starting from (5.28) we will construct three classes of approximations,
which we denote by A1, A2, and A3.

Approximation A1
Approximation A1 is found by removing all terms in (5.28) apart from
πG (xi |xi+1 , . . . , xn ), so
πA1 (xi | xi+1 , . . . , xn , y1 , . . . , yi ) = πG (xi | xi+1 , . . . , xn ). (5.29)
Using (5.27), we obtain
1

πA1 (x | y) = πG (xi | xi+1 , . . . , xn )
i=n
= π G (x);
hence A1 is the GMRF approximation. This construction offers an
alternative interpretation of the GMRF approximation.

Approximation A2
Approximation A2 is found by including the term we think is the most
important one missed in A1, which is the term involving the data yi ,
πA2 (xi | xi+1 , . . . , xn , y1 , . . . , yi ) ∝ πG (xi | xi+1 , . . . , xn )
× exp(−hi (xi , yi )). (5.30)

©฀2005฀by฀Taylor & Francis Group, LLC


Note that A2 can be a great improvement over A1 if the precision matrix
in πG (x) is a diagonal matrix, then A1 can be very inaccurate for large
n while A2 will be exact.
However, since (5.30) can take any form, we need to introduce a second
level of approximation by using a finite dimensional representation
of (5.30), π̃A2 . For this we use log-linear or log-quadratic splines. It is easy
to sample from such a density and to compute the normalizing constant.
One potential general problem using a spline representation, is that the
support of the distribution can be unknown to some extent. However,
this is not really problematic here, as we as may take advantage of the
GMRF approximation which is centered at the mode. Let µi and σi2
denote the conditional mean and variance in (5.29), then the probability
mass is (with a high degree of certainty, except in artificial cases) in the
region
[µi − f σi , µi + f σi ] (5.31)
with f = 5 or 6, say. This can be used to determine the knots for the
log-spline representation. To construct a log-linear spline representation,
we can divide (5.31) into K equally spaced regions and evaluate the log
of (5.30) for the values of xi at each knot. We then define a straight
line interpolating the values in between the knots. A density that is
piecewise log-linear is particularly easy to integrate analytically and
straightforward to sample from. Additionally, we must add two border
regions from −∞ to the first knot, and from the last knot to ∞. This
is needed to obtain unbounded support of our approximation. A log-
quadratic spline approximation can be derived similarly, see Rue et al.
(2004, Appendix) for more details.

Approximation A3

In approximation A3 we include also the integral term in (5.28), which


we write as
⎛ ⎞
i−1

I(xi ) = E ⎝exp(− hj (xj , yj ))⎠ ,
j=1

where the expectation is wrt πG (x1 , . . . , xi−1 |xi , . . . , xn ) and we need to


evaluate the integral as a function of xi only. To approximate I(xi ), we
may include only the terms in the expectation we think are the most
important, J (i), say,
⎛ ⎞

I(xi | J (i)) = E ⎝exp(− hj (xj , yj ))⎠ .
j∈J (i))

©฀2005฀by฀Taylor & Francis Group, LLC


A first natural choice is
J (i) = {j : j < i and i ∼ j} (5.32)
meaning that we include hj terms such that j is a neighbor of i. Note
that the ordering of the indices in (5.24), required to make L sparse,
does not necessarily correspond to the original graph, so j ∈ J (i) does
not need to be close to i. However, if we use a band approach to factorize
Q and the GMRF model is an autoregressive process of order p on a line,
then j ∈ J (i) will be close to i:
J (i) = {j : max(1, i − p) ≤ j < i}.
We can improve (5.32) by also including in J (i) all neighbors to the
neighbors of i less than i, and so on. The main assumption is that the
importance decays with the distance (on the graph) to i. This is not
unreasonable as πG is located at the mode, but is of course not true in
general.
After choosing J (i) we need to approximate I(xi |J (i)) for each value
of xi corresponding to the K + 1 knots. A natural choice is to sample M
iid samples from πG (x1 , . . . , xi−1 |xi , . . . , xn ), for each value of xi , and
then estimate I(xi |J (i)) by the empirical mean,
M 
ˆ i | J (i)) = 1
I(x exp(−
(m)
hj (xj , yj )). (5.33)
M m=1
j∈J (i))

Here, x(1) , . . . , x(M ) denote the M samples, one set for each value of xi .
Some description is needed at each of the steps.
• To sample from πG (x1 , . . . , xi−1 |xi , . . . , xn ) we make use of (5.23). Let
jmin (i) be the smallest j ∈ J (i). Then, iid samples can be produced
by solving (5.23) from row i − 1 until jmin (i), for iid z’s, and then
adding the mean µG .
• We make I(x ˆ i |J (i)) continuous wrt xi using the same random number
stream to produce the samples. A (much) more computationally
efficient approach is to make use of the fact that only the conditional
mean in πG (x1 , . . . , xi−1 |xi , . . . , xn ) will change if xi varies. Then, we
may sample M samples with zero conditional mean and simply add
the conditional mean that varies linearly with xi .
• Antithetic variables are also useful for estimating I(xi |J (i)). Anti-
thetic normal variates can also involve the scale and not only the sign
(Durbin and Koopman, 1997). Let z be a sample from N (0, I) and
define √
z̃ = z/ z T z
so that z̃ has unit length. Let x̃ solve LT x̃ = z̃. With the correct
scaling, x̃ will be a sample from N (0, Q−1 ). The correct scaling is

©฀2005฀by฀Taylor & Francis Group, LLC


found when we use z T z ∼ χ2n . Let Fχ2n denote the cumulative
distribution function or a χ2n variable, and Fχ−1
2 its inverse. For a
n
sample u1 from a uniform density between 0 and 1, both
0 0
x̃ Fχ−1
2 (u1 ), and x̃ Fχ−1 2 (1 − u1 )
n n

are correct samples. The antithetic behavior is in the scaling, if one


is close to 0, then the other is far away. Additionally, we may also
use the classical trick to flip the sign without altering the marginal
distribution. The benefit of using antithetic variables is that we
can produce many antithetic samples without any computational
effort from just one sample from πG (x1 , . . . , xi−1 |xi , . . . , xn ). This is
computationally very efficient.
Approximation A3 is indexed by the stream of random numbers used,
and by keeping this sequence fixed, we can produce several samples from
the same approximation.

Approximating constrained HGMRFs


A natural extension is to incorporate constraints into the non-Gaussian
approximation and approximate (5.21) under the constraint Ax = e.
The rank of A is typically small with the most prominent example A =
1T , the sum-to-zero constraint. A natural first approach is to make use
of (2.30)
x∗ = x − Q−1 AT (AQ−1 AT )−1 (Ax − e).
Here, x is a sample from the HGMRF approximation, x∗ is the sample
corrected for the constraints, and for Q we may use the precision
matrix for the GMRF approximation. In the Gaussian case the density
of x∗ is correct. For the non-Gaussian cases the density of x∗ is an
approximation to the constrained HGMRF approximation. The density
of x∗ is supposed to be fairly accurate for constraints, that is, not too
influential. To evaluate the density of x∗ , we may use (2.31) but the
denominator is problematic as we need to evaluate π(Ax). Although
we can estimate the value of this density at x using either the GMRF
approximation or iid samples from the HGMRF approximation and
techniques for density estimation, we are unable evaluate this density
exactly.

5.2.2 Example: A stochastic volatility model


We will illustrate the use of the improved approximations on a simple
stochastic volatility model for the pound-dollar daily exchange rate from
October 1, 1981, to June 28, 1985, previously analyzed by Durbin and

©฀2005฀by฀Taylor & Francis Group, LLC


Koopman (2000), among others. Let {et } denote the exchange rate, then
the time series of interest is the log ratio {yt }, where
yt = log(et /et−1 ), t = 1, . . . , n = 945.
A simple model to describe {yt } is the following:
yt ∼ N (0, exp(xt )/τy ),
where {xt } is assumed to follow a first-order autoregressive process,
xt = φxt−1 + ǫt ,
where ǫt is iid zero mean normal noise with precision τǫ . The unknown
parameters to be estimated are τǫ , τy , and φ.
We can follow the same strategy for constructing a one-block algorithm
conducting a joint proposal for (τǫ , τy , φ, x). Propose a change to each
of the parameters τǫ , τy , and φ, and then conditioned on these values,
sample x from an approximation to the full conditional for x. Then,
accept or reject all parameters jointly.
The maximum likelihood estimates of the hyperparameters are
φ = 0.975, τy = 2.49, and τǫ = 33.57. (5.34)
To compare the improved approximations to the GMRF approximation,
we will measure the accuracy of the approximations using the acceptance
rate for fixed values of the hyperparameters as advocated by Robert
and Casella (1999, Section 6.4.1). We will compare the following
approximations to π(x|y, θ):
A1 The GMRF approximation.
A2 The approximation (5.30) that includes the likelihood term gi (xi , yi ).
We use K = 20 regions in our quadratic spline approximation.
A3a An improved approximation including the integral term (5.33)
using
J (i) = {i − 20, i − 19, . . . , i − 1},
with obvious changes near the boundary. We estimate (5.33) using
M = 1 ‘sample’ only; the conditional mean computed under the
GMRF approximation A1.
A3b Same as A3a, but using M = 10 samples and where each is
duplicated into 4 using antithetic techniques.
Fixing the hyperparameters at their maximum likelihood estimates (5.34),
we obtained the acceptance rates as displayed in Table 5.1. The results
demonstrate that we can improve the GMRF approximation at the
cost of more computing. The acceptance rate increases by roughly 0.2
for each level of the approximations. Obtaining an acceptance rate of

©฀2005฀by฀Taylor & Francis Group, LLC


Approximation Acceptance rate Iter/sec
A1 0.33 23.5
A2 0.43 5.3
A3a 0.61 1.04
A3b 0.91 0.06

Table 5.1 The obtained acceptance rate and the number of iterations per second
on a 2.6-MHz CPU, for approximation A1 to A3b.

0.91 is impressive, but we can push this limit even further using more
computing.
Approximation A2 offers in many cases a significant improvement
compared to A1 without paying too much computationally. The reduc-
tion from 23.5 iterations per second to 5.3 per second is larger than it
would be for a large spatial GMRF. This is because a larger amount
of time will be spent factorizing the precision matrix and locating the
maximum, which is common for all approximations.
The number of regions in the log-spline approximations K also
influences the accuracy. If we increase K we improve the approximation,
most notably when the acceptance rate is high. In most cases a value of
K between 10 and 20 is sufficient.
A more challenging situation appears when we fix the parameters at
different values from their maximum likelihood estimates. The effect
of the different approximations can then be drastic. As an example, if
we reduce τǫ by a factor of 10 while keeping the other two parameters
unchanged, A1 produces an acceptance rate of essentially zero. The
acceptance rate for A2 and A3a is about 0.04 and 0.10, respectively.
In our experience, much computing is required to obtain an acceptance
rate in the high 90s, while in practice, only a ‘sufficiently’ accurate
approximation is required, i.e., one that produces an acceptance rate
well above zero. Unfortunately, the approximations do not behave
uniformly over the space of the hyperparameters. Although a GMRF
approximation can be adequate near the global mode, it may not be
sufficiently accurate for other values of the hyperparameters. Further,
the accuracy of the approximation decreases for increasing dimension.

5.2.3 Example: Reanalyzing Tokyo rainfall data


We will now revisit the logistic RW2 model with binomial observations
used to analyze the Tokyo rainfall data in Section 4.3.4. The purpose
of this example is to illustrate how the various approximations can be
used to construct an independence sampler for (κ, x) and to discuss how
we can avoid simulation completely at the cost of some approximation

©฀2005฀by฀Taylor & Francis Group, LLC


error.

Constructing an independence sampler


We assume from here on the logit-link and a G(1.0, 0.000289) prior for
the precision κ of the second-order random walk x. To construct an
independence sampler for (κ, x) we will construct a joint proposal of the
following form:
(x, κ | y) = π
π (κ | y) π
(x | y, κ). (5.35)
The first term is an approximation to the posterior marginal for κ, while
the second term is an approximation to the full conditional for x.
The posterior marginal can be approximated starting from the simple
identity
π(x, κ | y)
π(κ | y) =
π(x | κ, y)
π(y | x) π(x | κ) π(κ)
∝ . (5.36)
π(x | κ, y)
We can evaluate the rhs for any fixed (valid) value of x, as the lhs
does not depend on x. Note that (5.36) is equivalent to the alternative
formulation 
π(κ | y) = π(x, κ | y) dx,

which is used for sampling (κ, x) jointly from the posterior and
estimating π(κ|y) considering the samples of κ only.
The only unknown term in (5.36) is the denominator. An approx-
imation to the posterior marginal for κ can be obtained using an
approximation π(x|κ, y) in the denominator of
π(y | x) π(x | κ) π(κ)
(κ | y) ∝
π . (5.37)
(x | κ, y)
π
Note that the rhs now depends on the value of x. In particular, we can
choose x as a function of κ such that the denominator is as accurate as
possible.
To illustrate the dependency of x in (5.37), we computed π
(κ|y) using
the GMRF approximation (A1) in the rhs, and evaluate the rhs at x = 0
and at the mode x = x∗ (κ), which depends on κ. The result is shown
in Figure 5.11. The difference between the two estimates is quite large.
Intuitively, the GMRF approximation is most accurate at the mode and
therefore we should evaluate the rhs of (5.37) at x∗ (κ) and not at any
other point. Note that in this case (5.37) is a Laplace approximation,
see, for example, Tierney et al. (1989).

©฀2005฀by฀Taylor & Francis Group, LLC


0.00012
0.00008
0.00004
0.00000

0 10000 20000 30000 40000

Figure 5.11 The estimated posterior marginal for κ using (5.37) and the
GMRF approximation, evaluating the rhs using the mode x∗ (κ) (solid line)
or x = 0 (dashed line).

We will use the same approximations as defined in Section 5.2.2


with obvious changes due to cyclic boundary conditions. The posterior
marginal for κ was approximated using each of these approximations,
evaluating (5.37) at x∗ (κ). The estimates where indistinguishable on
the plot and the densities all coincide with the solid line in Figure 5.11.
It is no contradiction that the different approximations produce nearly
the same estimate for π(κ|y); If the ratio of the densities produced
with A1 and A2 is proportional, then this constant will cancel after
re-normalizing (5.36). To illustrate this point, let z1 ∼ N (0, σ 2 ) and
z2 ∼ N (0, σ 2 )1[z2 >0] , then the densities for z1 and z2 are proportional
for positive arguments for any value of σ 2 , but the densities are very
different.
We now construct an independence sampler using the approximations
for each term in (5.35). The acceptance rates obtained were 0.83, 0.87,
0.94, and 0.95 for approximation A1, A2, A3a, and A3b respectively. Note
that the correlation in the marginal chain for κ, κ(1) , κ(2) , . . . satisfies
Corr(κ(i) , κ(i+k) ) ≈ (1 − α)|k| , (5.38)
where α is the acceptance rate for the independence sampler. The
result (5.38) holds exactly in an ideal situation where an independent
proposal is accepted with a fixed probability α. A comparison of (5.38)

©฀2005฀by฀Taylor & Francis Group, LLC


0.00015
0.00010
0.00005
0.00000

0 10000 20000 30000 40000

Figure 5.12 The histogram of the posterior marginal for κ based on 2000
successive samples from the independence samples constructed from A2. The
solid line is the approximation π
e(κ|y).

with the estimated correlations from each of the four independence


samplers verifies that the approximations is very accurate indeed.
We now run the independence sampler based on A2. Figure 5.12
displays the histogram of the first 2000 realizations of κ together
with π(κ|y) (solid line). The histogram is in good accordance with
the approximated posterior marginal. With very long runs of the
independence sampler, the estimated posterior marginal for κ fits the
approximation very well. However, we are not able to detect any ‘errors’
in our approximation. The reason is that the estimate based on the
simulations
√ will always be influenced by Monte Carlo error, which is
of Op (1/ N ), where N is the number of simulations. This naturally
raises the question whether we can avoid simulation completely in this
example.

Approximative inference not using simulation


We will now briefly discuss deterministic alternatives to simulation-
based inference. Such approaches will only be approximative but fast
to compute.
Consider first inference for κ based on π(κ|y). We can avoid simulation
if we base our inference on the computed approximation π (κ|y). This

©฀2005฀by฀Taylor & Francis Group, LLC


is a univariate density and we can easily construct a log-linear or log-
quadratic spline representation of it. Although not everything can be
computed analytically it will be available using numerical techniques.
The inference for x is more involved. Assume for simplicity that we
aim to estimate
 
E(f (xi ) | y) = f (xi ) π(x, κ | y)dκ dx (5.39)

for some function f (xi ). Of particular interest are the choices f (xi ) = xi
and f (xi ) = x2i , which are required to compute the posterior mean and
variance. We now approximate (5.39) using (5.35)
 
E(f (xi ) | y) ≈ (x, κ | y)dκ dx
f (xi ) π
 , -
= (x | κ, y) dx π
f (xi ) π (κ | y) dκ.

Using approximation A1, we can easily compute the marginal π (xi |κ, y),
and hence
 , -
E(f (xi ) | y) ≈ (xi | κ, y) dxi π
f (xi ) π (κ | y) dκ
 , -
≈ (xi | κ, y) dxi π
f (xi ) π (κ | y)ω(κ) (5.40)
κ

with some weights ω(κ) over a selection of values of κ. The integral is


just one-dimensional and may be computed analytically or numerically.
A similar technique can be applied using approximation A2, A3a, and
A3b, since we can always arrange for the reordering such that index i
becomes index n − 1 after reordering. Hence, π (xi |κ, y) is available as a
log-quadratic and linear-spline representation. Numerical evaluation of
the inner integral can then be used.
The error using (5.40) comes from two sources; the error using the
approximation itself and the error replacing the two-dimensional integral
by a finite sum. The last error is easy to control so the main contribution
to the error comes from using the approximation itself. This error is
hard to control, but some insight can be gained if we study the effect of
increasing the accuracy of the approximation (from A1 to A2 or A3a,
say). In any case, an error of the same order as the Monte Carlo error
in a simulation-based approach will be acceptable.

5.3 Bibliographic notes


For a background on Gaussian fields, see Cressie (1993) or Chilés and
Delfiner (1999). Section 5.1 is based on Rue and Tjelmeland (2002)

©฀2005฀by฀Taylor & Francis Group, LLC


but contains an extended discussion. GMRF approximations to GFs
have also been applied to nonisotropic correlation functions (Rue and
Tjelmeland, 2002) and to spatiotemporal Gaussian fields (Allcroft and
Glasbey, 2003). Hrafnkelsson and Cressie (2003) also suggest using
the neighborhood radius of the GMRF as a calibration parameter to
approximate the CFs of the Matérn family. GMRF approximations have
been applied by Follestad and Rue (2003), Husby and Rue (2004), Rue
and Follestad (2003), Rue et al. (2004), Steinsland (2003), Steinsland
and Rue (2003) and Werner (2004).
Section 5.2 is based on Rue et al. (2004), which contains more
challenging examples. The approximations are also applied in Steinsland
and Rue (2003) constructing overlapping block proposals for HGMRFs.

©฀2005฀by฀Taylor & Francis Group, LLC


APPENDIX A

Common distributions

The definitions of common distributions used are given here.


The normal distribution See Section 2.1.7.
The Student-tν distribution The density of a Student-tν distributed
variable x with ν degrees of freedom is
 (1+ν)/2
1 ν
π(x) = √ ,
νB(ν/2, 1/2) ν + x2
where B(a, b) is the beta function
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)
and Γ(z) is the gamma function equals (z − 1)! for z = 1, 2, . . . and
 ∞
Γ(z) = tz−1 exp(−t) dt
0

in general. The mean is 0 and the variance is ν/(ν − 2) for ν > 2. We


abbreviate this as x ∼ tν .
The gamma distribution The density of a gamma-distributed variable
τ > 0 with shape parameter a > 0 and inverse-scale parameter b > 0, is
ba a−1
π(τ ) = τ exp(−bτ ).
Γ(a)
The mean of τ equals a/b and the variance equals a/b2 . We abbreviate
this as τ ∼ G(a, b). For a = 1 we obtain the exponential distribution
with parameter 1/b. If also b = 1 we obtain the standard exponential
distribution, which has both mean and variance equal to 1.
The Laplace distribution The density of a standard Laplace distributed
variable x is
1
π(x) = exp(−|x|).
2
The mean is 0 and the variance is 2.

©฀2005฀by฀Taylor & Francis Group, LLC


220 COMMON DISTRIBUTIONS
The logistic distribution The density of a logistic-distributed variable
x with parameters a and b > 0 is
1 exp(− ((x − a)/b))
π(x) = .
b [1 + exp(− ((x − a)/b))]2
The mean is a and the variance is π 2 b2 /3. We abbreviate this as x ∼
L(a, b). For a = 0 and b = 1 we obtain the standard logistic distribution.
The Kolmogorov-Smirnov distribution The distribution function for a
Kolmogorov-Smirnov-distributed variable x, is


G(x) = (−1)k exp(−2k 2 x2 ), x > 0.
k=−∞

We abbreviate this as x ∼ KS.


The Bernoulli and binomial distribution A binary random variable x is
Bernoulli-distributed if Prob(x = 1) = p and Prob(x = 0) = 1 − p. The
mean is p and the variance p(1 − p). We abbreviate this as x ∼ B(p).
The sum s, of n independent Bernoulli-distributed variables {xi } where
xi ∼ B(p) is binomial distributed.
 We abbreviate this as s ∼ B(n, p),
where Prob(s = k) = nk pk (1 − p)n−k for s = 0, . . . , n. The mean of s is
np and the variance is np(1 − p).
The Poisson distribution A discrete random variable x ∈ {0, 1, 2, . . .}
is Poisson-distributed if
λk
exp(−λ), k = 0, 1, 2, . . . ,
Prob(x = k) =
k!
where λ ≥ 0. Both the mean and the variance of x equals λ. We
abbreviate this as x ∼ P(λ).

©฀2005฀by฀Taylor & Francis Group, LLC


APPENDIX B

The library GMRFLib

This appendix contains a short discussion of how we actually organize


and perform the computations in the (open-source) library GMRFLib
(Rue and Follestad, 2002). The library is written in C and Fortran.
We first describe the graph-object and the function Qfunc defining the
elements {Qij } in the precision matrix, then how to sample a subset
of x, xA conditionally on x−A when x is a GMRF, and finally how
to construct a block-updating MCMC algorithm for hierarchical GMRF
models. Along with these examples we also give C code illustrating how
to implement these tasks in GMRFLib, ending up with the code to analyze
the Tokyo rainfall data.
At the time of writing, the library GMRFLib (version 2.0) supports two
sparse matrix libraries: the band-matrix routines in the Lapack-library
(Anderson et al., 1995) using the Gibbs-Poole-Stockmeyer reorder
algorithm for bandwidth reduction (Lewis, 1982), and the multifrontal
supernodal Cholesky factorisation implementation in the TAUCS-library
(Toledo et al., 2002) using the nested dissection reordering from the
METIS-library (Karypis and Kumar, 1998).
We now give a brief introduction to GMRFLib. The library contains
many more useful features not discussed here and we refer to the manual
for further details.

B.1 The graph object and the function Qfunc


The graph-object is a representation of a labelled graph and has the
following structure:
n The size of the graph.
nnbs A vector defining the number of neighbors, so nnbs[i] is the number
of neighbors to node i.
nbs A vector of vectors defining which nodes are the neighbors; node i
has k = nnbs[i] neighbors, which are the nodes nbs[i][1], nbs[i][2], . . .,
nbs[i][k].
For example, the graph in Figure B.1 has the following representation;
n = 4, nnbs = [1, 3, 1, 1], nbs[1] = [2], nbs[2] = [1, 3, 4], nbs[3] = [2],
and nbs[4] = [2]. The graph can be defined in an external text-file with

©฀2005฀by฀Taylor & Francis Group, LLC


222 THE LIBRARY GMRFLib

1 2
4

Figure B.1 The representation for this graph is n = 4, nnbs = [1, 3, 1, 1],
nbs[1] = [2], nbs[2] = [1, 3, 4], nbs[3] = [2] and nbs[4] = [2].

format,
4
1 1 2
2 3 1 3 4
3 1 2
4 1 2
The first number is n, then each line gives the relevant information
for each node: node 1 has 1 neighbor, which is node 2, node 2 has 3
neighbors, which are nodes 1, 3, and 4, and so on. Note that there is
some redundancy here because we know that if i ∼ j then j ∼ i.
We then need to define the elements in the Q matrix. We know that
the only nonzero terms in Q are those Qij where i ∼ j or i = j. A
convenient way to represent this is to define the function
Qfunc(i, j), for i = j or i ∼ j,
returning Qij .
To illustrate the graph-object and the use of the Qfunc-function,
Algorithm B.1 demonstrates how to compute y = Qx. Note that only
the nonzero terms in Q are used to compute Qx. Recall that i ∼ i, so
we need to add the diagonal terms explicitly.

Algorithm B.1 Computing y = Qx


1: for i = 1 to n do
2: y[i] = x[i] ∗ Qfunc(i, i)
3: for k = 1 to nnbs[i] do
4: j = nbs[i][k]
5: y[i] = y[i] + x[j] ∗ Qfunc(i, j)
6: end for
7: end for
8: Return y

The following C-program illustrates the Qfunc-function and the graph-


object. The program creates the graph for a circular RW1 model and

©฀2005฀by฀Taylor & Francis Group, LLC


defines the corresponding Qfunc-function. It then writes out the nonzero
terms in the precision matrix. A third argument is passed to Qfunc to
transfer additional parameters. Note that the nodes in the graph-object
in GMRFLib are numbered from 0 to n − 1.

©฀2005฀by฀Taylor & Francis Group, LLC


#include <stdio.h>
#include "GMRFLib/GMRFLib.h" /* definitions of GMRFLib */
double Qfunc(int i, int j, char *kappa)
{
/* this function returns the element Q_{ij} in the precision matrix with the additional
* parameter in Qfunc_argument. recall that this function is *only* called with pairs ij where
* i\sim j or i=j */
if (i == j)
return 2.0* *((double *)kappa); /* return Q_{ii} */
else
return - *((double *)kappa); /* return Q_{ij} where i \sim j */
}
int main(int argc, char **argv)
{
/* create the graph for a circular RW1 */
GMRFLib_graph_tp *graph; /* pointer to the graph */
int n=5, bandwidth=1, cyclic=GMRFLib_TRUE; /* size of the graph, the bandwidth and cyclic flag */
GMRFLib_make_linear_graph(&graph, n, bandwidth, cyclic);
/* display the graph and the Q_{ij}’s using kappa = 1 */
int i, j, k;
double kappa = 1.0; /* use kappa=1 */
printf("the size of the graph is n = %1d\n", graph->n);
for(i=0; i<graph->n; i++)
{
printf("node %1d have %d neighbors\n", i, graph->nnbs[i]);
printf("\tQ(%1d, %1d) = %.3f\n", i, i, Qfunc(i, i, (char *)&kappa));
for(k=0;k<graph->nnbs[i];k++)
{
j = graph->nbs[i][k];
printf("\tQ(%1d, %1d) = %.3f\n", i, j, Qfunc(i, j, (char *)&kappa));
}
}
return(0);
}

©฀2005฀by฀Taylor & Francis Group, LLC


The output of the program is
the size of the graph is n = 5
node 0 have 2 neighbors
Q(0, 0) = 2.000
Q(0, 1) = -1.000
Q(0, 4) = -1.000
node 1 have 2 neighbors
Q(1, 1) = 2.000
Q(1, 0) = -1.000
Q(1, 2) = -1.000
node 2 have 2 neighbors
Q(2, 2) = 2.000
Q(2, 1) = -1.000
Q(2, 3) = -1.000
node 3 have 2 neighbors
Q(3, 3) = 2.000
Q(3, 2) = -1.000
Q(3, 4) = -1.000
node 4 have 2 neighbors
Q(4, 4) = 2.000
Q(4, 0) = -1.000
Q(4, 3) = -1.000

B.2 Sampling from a GMRF


We will now outline how to produce a sample from a GMRF. Although
a parameterization using the mean µ and the precision matrix Q
is sufficient, we often face the situation that these are only known
implicitly. This is caused by conditioning on a part of x and/or other
variables such as observed data. In order to avoid a lot of cumbersome
computing for the user, the library GMRFLib contains a high-level
interface to address a more general problem that in our experience covers
most situations of interest. The task is to sample from π(xA |xB ) (and/or
to evaluate the normalized density) where for notational convenience we
denote the set −A by B. Additionally, there may be a hard or soft
constraint, but we do not discuss this option here.
The joint density is assumed to be of the form:
1
log π(x) = − (x − µ)T (Q + diag(c))(x − µ) + bT x + const. (B.1)
2
Here, µ is a parameter and not necessarily the mean as b can be nonzero.
Furthermore, diag(c) is a diagonal matrix with the vector c on the
diagonal. The extra cost of allowing an extended parameterization is
negligible with the cost of computing the factorisation of the precision
matrix. To compute the conditional density, we expand (B.1) into terms
xA , xB , to obtain its canonical parameterization
xA | xB ∼ NC ( 
b, Q),

©฀2005฀by฀Taylor & Francis Group, LLC


where

b = bA + (QAA + diag(cA ))µA − QAB (xB − µB ) (B.2)

Q = QAA + diag(cA ). (B.3)
A
The graph is G as adding diag(cA ) does not change it.
We now need to compute G A . It is straightforward but somewhat
technical to do this efficiently, so we skip the details here. The graph-
object is labelled, therefore we need to know to which node in G a node
in G A corresponds. Let this mapping be m, such that node i in G A
corresponds to node m[i] in G.
It is not that hard to compute b, which is clear from Algorithm B.2.
The only simplification made is the following: If i ∈ A and j ∼ i, then
either j ∈ A or j ∈ B, so we can compute all terms in one loop.

Algorithm B.2 Computing 


b in (B.2)
1: for iA = 1 to nA do
2: i = m[iA ]
3: 
b[iA ] = b[i] + (Qfunc(i, i) + c[i]) ∗ µ[i]
4: for k = 1 to nnbs[i] do
5: j = nbs[i][k]
6: if j ∈ B then
7: b[iA ] = 
 b[iA ] − Qfunc(i, j) ∗ (x[j] − µ[j])
8: else
9: b[iA ] = 
 b[iA ] + Qfunc(i, j) ∗ µ[j]
10: end if
11: end for
12: end for
13: Return  b

We now apply Algorithm 2.5 using G A ,   defined as


b, and Qfunc
 j) ≡ Qfunc(m[i], m[j]) + 1[i=j] c[m[i]]
Qfunc(i,
 . Finally, we may insert this sample into x,
to obtain a sample x
 [i],
x[m[i]] = x i = 1, . . . , nA
and we are done.
Example B.1 Consider the graph on the left in Figure B.2 with 6
nodes. We want to sample from π(xA |xB ) where A = {3, 5, 6} and the
joint distribution of x is given in (B.1).
The subgraph G A is shown on the right in Figure B.2, where it is
indicated how each node in G A connects to a node in G. The mapping

©฀2005฀by฀Taylor & Francis Group, LLC


1

m[1] = 3
2 3 1

4 5 6 2 3

m[2] = 5 m[3] = 6

Figure B.2 Illustration of how to compute the conditional distribution


π(xA |xB ). The graph G is shown on the left where A = {3, 5, 6} is marked. The
subgraph G A is shown on the right, indicating which node in G A corresponds
to a node in G. The mapping is m = [3, 5, 6].

m, is m = [3, 5, 6], meaning that node i in G A corresponds to node m[i]


in G.
The conditional precision matrix is QAA + diag(cA ), which reads
⎛ ⎞
Q33 + c3 Q35 Q36
⎝ Q53 Q55 + c5 Q56 ⎠
Q63 Q65 Q66 + c6

and following Algorithm B.2, we obtain 


b:
b1 = b3 + (Q33 + c3 )µ3 − Q31 (x1 − µ1 ) + Q35 µ5
b2 = b5 + (Q55 + c5 )µ5 − Q52 (x2 − µ2 ) + Q53 µ3
b3 = b6 + (Q66 + c6 )µ6 + Q65 µ5 + Q63 µ3 .
Note that x4 is not used since 4 ∈ ne(A). We sample x from its canonical
parameterization using Algorithm 2.5. Finally, we may insert the sample
into x,
x3 = x1 , x5 = x 2 , x6 = x
3 .
The following C-program illustrates how to produce (conditional) sam-
ples from a GMRF. We continue with the circular RW1 model and con-
dition on x1 = 1 and x245 = 10. The function GMRFLib_init_problem
computes (and stores in problem) the conditional density and all inter-
mediate variables needed to produce samples (using GMRFLib_sample) or
to evaluate the log density of some configuration x (using the function
GMRFLib_evaluate). In the program we extract the conditional mean
and compute the empirical mean of 100 iid samples. These quantities

©฀2005฀by฀Taylor & Francis Group, LLC


15
10
5
0
−5
−10

0 100 200 300

Figure B.3 The conditional mean (dashed line), the empirical mean (dashed-
dotted line) and one sample (solid line) from a circular RW1 model with κ = 1,
n = 366, conditional on x1 = 1 and x245 = 10.

and one sample from the conditional distribution are shown in Figure
B.3.

©฀2005฀by฀Taylor & Francis Group, LLC


#include <assert.h>
#include <stdio.h>
#if !defined(__FreeBSD__)
#include <malloc.h>
#endif
#include <stdlib.h>
#include "GMRFLib/GMRFLib.h" /* include definitions of GMRFLib */

double Qfunc(int i, int j, char *kappa)


{
return *((double *)kappa) * (i==j ? 2.0 : -1.0);
}
int main(int argc, char **argv)
{
assert(argv[1]);
(*GMRFLib_uniform_init)(atoi(argv[1])); /* init the RNG with the seed in the first argument */
/* create the graph for a circular RW1 */
GMRFLib_graph_tp *graph; /* pointer to the graph */
int n = 366; /* size of the graph */
int bandwidth = 1; /* the bandwidth is 1 */
int cyclic = GMRFLib_TRUE; /* cyclic graph */
GMRFLib_make_linear_graph(&graph, n, bandwidth, cyclic);

GMRFLib_problem_tp *problem; /* hold the problem */


GMRFLib_constr_tp *constr = NULL; /* no constraints */
double *x, *mean=NULL, *b=NULL, *c=NULL; /* various vectors some are NULL */
char *fixed; /* indicate which are x_i’s are fixed or not */

x = calloc(n, sizeof(double)); /* allocate space */


fixed = calloc(n, sizeof(char)); /* allocate space */

/* sample from a circular RW1 conditioned on x_0 and x_{2n/3} */


fixed[0] = 1; /* fix x[0] */
x[0] = 1.0; /* ...to 1.0 */

©฀2005฀by฀Taylor & Francis Group, LLC


fixed[2*n/3] = 1; /* fix x[2*n/3] */
x[2*n/3] = 10.0; /* ...to 10.0 */
double kappa = 1.0; /* set kappa = 1 */

GMRFLib_init_problem(&problem, x, b, c, mean, graph, Qfunc, (char *)&kappa, /* init the problem */


fixed, constr, GMRFLib_NEW_PROBLEM);
/* extract the conditional mean, which is available for the sub_graph only. map it back using
* the mapping problem->map. */
mean = x; /* use same storage */
int i;
for(i=0;i<problem->sub_graph->n;i++) mean[problem->map[i]] = problem->sub_mean[i];

int j, m=100; /* sample m=100 samples to estimate the empirical mean */


double *emean = calloc(n, sizeof(double));
for(j=0;j<m;j++)
{
GMRFLib_sample(problem);
for(i=0;i<n;i++) emean[i] += problem->sample[i];
}
for(i=0;i<n;i++) /* write the results to stdout */
{
emean[i] /= m;
printf("%d %f %f %f\n", i, mean[i], emean[i], problem->sample[i]);
}
return(0);
}

©฀2005฀by฀Taylor & Francis Group, LLC


B.3 Implementing block-updating algorithms for hierarchical
GMRF models

GMRFLib also has a high-level routine to construct block updating


in hierarchical GMRF models, GMRFLib_blockupdate. This routine
locates the mode, constructs the GMRF approximation and computes
the contribution to the acceptance probability. Due to the general
framework, it is straightforward to implement all examples in Section
4.4 as soon as the GMRF is defined.
We extend (B.4) to include nonquadratic terms in the style of (5.21),
so the full density is defined as
1
log π(x) = − (x − µ)T (Q + diag(c))(x − µ) + bT x
2
 (B.4)
+ di gi (xi ) + const.
i

Here, gi (xi ) represents the log-likelihood term, but can represent any
(reasonable) function of xi . The GMRFLib_blockupdate routine con-
structs from (B.4) its GMRF approximation π (x) and sample from it the
proposal x∗ (forward step). Of course, the reverse or backward step is
also performed, constructing the GMRF approximation starting from x∗
and evaluating the log density of x using the GMRF approximation. The
computations are similar in the case where only xA is updated keeping
x−A fixed, which we use in the subblock algorithm.
As we always do a joint update of the GMRF (or parts of it) with
the corresponding hyperparameters θ we allow the terms µ, Q, c, b, d,
and {gi (·)} in (B.4) to depend on hyperparameters θ. The acceptance
probability for the joint update (4.8) is then min{1, R}, where

π(x∗ |θ ∗ , y) π (x|θ, y) π(θ ∗ ) q(θ|θ ∗ )


R= ,
π(x|θ, y) π (x∗ |θ ∗ , y) π(θ) q(θ ∗ |θ)
     
term 1 term 2

where π(θ) is the prior for θ. GMRFLib_blockupdate samples x∗ and


computes term 1 except for the normalization constant with respect to
θ. This information is not available in (B.4). Term 2 and the ratio of the
normalization constants must be added by the user.
For the Tokyo rainfall example using the logit-link, gi is

gi (xi ) = yi log(p(xi )) + (mi − yi ) log(1 − p(xi )),

where p(xi ) = exp(xi )/(1+exp(xi )), yi is the observed number of counts


on day i and mi = 2 except for i = 60 where m60 = 1. The weights {di }

©฀2005฀by฀Taylor & Francis Group, LLC


are all 1 in this case, and µ, c, and b are vector of zeros. Term 2 is
(κ∗ )a−1 exp(−bκ∗ ) (κ∗ )(n−1)/2
,
κa−1 exp(−bκ) κ(n−1)/2

q(θ|θ )
since q(θ ∗ |θ) = 1 using (4.13). The prior parameters are a = 1.0 and

b = 0.000289. The following C program is an implementation of the one-


block algorithm using GMRFLib.

©฀2005฀by฀Taylor & Francis Group, LLC


#include <assert.h>
#include <math.h>
#include <string.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include "GMRFLib/GMRFLib.h" /* include definitions of GMRFLib */

#define Uniform (*GMRFLib_uniform) /* use the RNG in GMRFLib. return Unif(0,1)’s */


#define MIN(a,b) ((a) < (b) ? (a) : (b)) /* MIN macro */

typedef struct /* the type holding the data */


{
double *y, *n; /* observed counts and number of days */
}
data_tp;

double link(double x)
{ /* the link function */
return exp(x)/(1+exp(x));
}
double log_gamma(double x, double a, double b)
{ /* return the log density of a gamma variable with mean a/b. */
return ((a-1.0)*log(x)-(x)*b);
}
double Qfunc(int i, int j, char *kappa)
{
return *((double *)kappa) * (i==j ? 2.0 : -1.0);
}
int gi(double *gis, double *x_i, int m, int idx, double *not_in_use, char *arg)
{
/* compute g_i(x_i) for m values of x_idx: x_i[0], ..., x_i[m-1]. return its values in gis[0],
* ..., gis[0]. additional (user) arguments are passed through the pointer gi_arg, here, the
* data itself. */

©฀2005฀by฀Taylor & Francis Group, LLC


int k;
double p;
data_tp *data;
data = (data_tp *) arg;
for(k=0; k<m; k++)
{
p = link(x_i[k]);
gis[k] = data->y[idx]*log(p) + (data->n[idx] - data->y[idx])*log(1.0-p);
}
return 0;
}
double scale_proposal(double F)
{
/* return a sample from f ~ \pi(f) \propto 1+1/f, on the interval [1/F, F]. write the density as
* a mixture and sample each component with the correct probability. */
double len = F - 1/F;
if (F == 1.0) return 1.0;
if (Uniform() < len/(len+2*log(F)))
return 1/F + len*Uniform();
else
return pow(F, 2.0*Uniform()-1.0);
}
int main(int argc, char **argv)
{
assert(argv[1]);
(*GMRFLib_uniform_init)(atoi(argv[1])); /* init the RNG with the seed in the first argument */

GMRFLib_graph_tp *graph; /* pointer to the graph */


int n = 366; /* size of the graph for the Tokyo example */
int bandwidth = 1; /* the bandwidth is 1 for RW1 */
int cyclic = GMRFLib_TRUE; /* cyclic graph */
GMRFLib_make_linear_graph(&graph, n, bandwidth, cyclic);

GMRFLib_constr_tp *constr = NULL; /* no constraints */

©฀2005฀by฀Taylor & Francis Group, LLC


double *d, *mean=NULL, *b=NULL, *c=NULL; /* various vectors some are NULL */
char *fixed = NULL; /* none x_i’s are fixed */

data_tp data; /* hold the data */


data.y = calloc(n, sizeof(double)); /* allocate space */
data.n = calloc(n, sizeof(double)); /* allocate space */

FILE *fp; int i; /* read data */


fp = fopen("tokyo.rainfall.data.dat", "r"); assert(fp);
for(i=0;i<n;i++) fscanf(fp, "%lf %lf\n", &data.y[i], &data.n[i]);
fclose(fp);

double *x_old, *x_new, kappa_old=100.0, kappa_new; /* old and new (the proposal) x and kappa */
x_old = calloc(n, sizeof(double)); /* allocate space */
x_new = calloc(n, sizeof(double)); /* allocate space */

d = calloc(n, sizeof(double)); /* allocate space */


for(i=0;i<n;i++) d[i] = 1.0; /* all are equal to 1 */

while(1)
{ /* just keep on until the process is killed */
kappa_new = scale_proposal(6.0)*kappa_old;
double log_accept; /* GMRFLib_blockupdate does all the job... */
GMRFLib_blockupdate(&log_accept, x_new, x_old, b, b, c, c, mean, mean, d, d,
gi, (char *) &data, gi, (char *) &data,
fixed, graph, Qfunc, (char *)&kappa_new, Qfunc, (char *)&kappa_old, NULL, NULL, NULL, NULL,
constr, constr, NULL, NULL);

double A = 1.0, B = 0.000289; /* prior parameters for kappa */


/* add terms to the acceptance probability not computed by GMRFLib_blockupdate: prior for
* kappa and the normalising constant. */
log_accept += ((n-1.0)/2.0*log(kappa_new) + log_gamma(kappa_new, A, B))
- ((n-1.0)/2.0*log(kappa_old) + log_gamma(kappa_old, A, B));

©฀2005฀by฀Taylor & Francis Group, LLC


static double p_acc = 0.0; /* sum of the accept probabilities */
double acc_prob = exp(MIN(log_accept, 0.0));
p_acc += acc_prob;
if (Uniform() < acc_prob)
{ /* accept the proposal */
memcpy(x_old, x_new, n*sizeof(double));
kappa_old = kappa_new;
}
static int count = 0; /* number of iterations */
if (!((++count)%10)) /* output every 10th */
{
printf(" %.3f", kappa_old);
for(i=0;i<n;i++) printf(" %.3f", link(x_old[i]));
printf("\n"); fflush(stdout);
fprintf(stderr, "mean accept prob %f\n", p_acc/count); /* monitor the mean accept probability */
}
}
return(0);
}

©฀2005฀by฀Taylor & Francis Group, LLC


References

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and


polychotomous responce data. Journal of the American Statistical
Association, 88(422), 669–679.
Albert, J. H. and Chib, S. (2001). Sequential ordinal modeling with
applications to survival data. Biometrics, 57, 829–836.
Allcroft, D. J. and Glasbey, C. A. (2003). A latent Gaussian Markov ran-
dom field model for spatio-temporal rainfall disaggregation. Journal
of the Royal Statistical Society, Series C, 52, 487–498.
Amit, Y., Grenander, U., and Piccioni, M. (1991). Structural image
restoration through deformable templates. Journal of the American
Statistical Association, 86, 376–387.
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J. J., Croz,
J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov,
S., , and Sorensen, D. C. (1995). LAPACK Users’ Guide, 2nd edition.
Philadelphia: Society for Industrial and Applied Mathematics.
Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal
distributions. Journal of the Royal Statistical Society, Series B, 36(1),
99–102.
Anselin, L. and Florax, R., Eds. (1995). New Directions in Spatial
Econometrics. New York: Springer-Verlag.
Assunção, R. M., Assunção, J. J., and Lemos, M. B. (1998). Induced
technical change: A Bayesian spatial varying parameter model. In
Proceedings of XVI Latin American Meeting of Econometric Society:
Catholic University of Peru, Peru.
Assunção, R. M., Potter, J. E., and Cavenaghi, S. M. (2002). A
Bayesian space varying parameter model applied to estimating fertility
schedules. Statistics in Medicine, 21(14), 2057–2075.
Aykroyd, R. G. (1998). Bayesian estimation for homogeneous and
inhomogeneous Gaussian random fields. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(5), 533–539.
Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004). Hierarchical
Modeling and Analysis for Spatial Data, volume 101 of Monographs
on Statistics and Applied Probability. London: Chapman & Hall.

©฀2005฀by฀Taylor & Francis Group, LLC


238 REFERENCES
Banerjee, S., Wall, M. M., and Carlin, B. P. (2003). Frailty modeling for
spatially correlated survival data, with application to infant mortality
in Minnesota. Biostatistics, 4(123–142).
Barndorff-Nielsen, O., Kent, J. T., and Sørensen, M. (1982). Normal
variance-mean mixtures and z distributions. International Statistical
Review, 50(2), 145–159.
Barone, P. and Frigessi, A. (1989). Improving stochastic relaxation
for Gaussian random fields. Probability in the Engineering and
Informational Sciences, 3(4), 369–389.
Barone, P., Sebastiani, G., and Stander, J. (2001). General over-
relaxation Markov chain Monte Carlo algorithms for Gaussian den-
sities. Statistics & Probability Letters, 52(2), 115–124.
Barone, P., Sebastiani, G., and Stander, J. (2002). Over-relaxation
methods and coupled Markov chains for Monte Carlo simulation.
Statistics and Computing, 12(1), 17–26.
Bartlett, M. S. (1978). Nearest neighbour models in the analysis of field
experiments (with discussion). Journal of the Royal Statistical Society,
Series B, 40(2), 147–174.
Berzuini, C. and Clayton, C. (1994). Bayesian survival analysis on
multiple time scales. Statistics in Medicine, 13, 823–838.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice
systems (with discussion). Journal of the Royal Statistical Society,
Series B, 36(2), 192–225.
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician,
24(3), 179–195.
Besag, J. (1977a). Efficiency of pseudolikelihood estimation for simple
Gaussian fields. Biometrika, 64(3).
Besag, J. (1977b). Errors-in-variables estimation for Gaussian lattice
schemes. Journal of the Royal Statistical Society, Series B, 39(1),
73–78.
Besag, J., Green, P. J., Higdon, D., and Mengersen, K. (1995). Bayesian
computation and stochastic systems (with discussion). Statistical
Science, 10(1), 3–66.
Besag, J. and Higdon, D. (1999). Bayesian analysis of agricultural field
experiments (with discussion). Journal of the Royal Statistical Society,
Series B, 61(4), 691–746.
Besag, J. and Kooperberg, C. (1995). On conditional and intrinsic
autoregressions. Biometrika, 82(4), 733–746.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 239
Besag, J., York, J., and Mollié, A. (1991). Bayesian image restoration
with two applications in spatial statistics (with discussion). Annals of
the Institute of Statistical Mathematics, 43(1), 1–59.
Biller, C. and Fahrmeir, L. (1997). Bayesian spline-type smoothing in
generalized regression models. Computational Statistics, 12, 135–151.
Bookstein, F. L. (1989). Principal warps: Thin-plate splines and
the decomposition of deformations. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 11(6), 567–585.
Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical
Analysis. Addison-Wesley Publishing Co., Reading, Mass.-London-
Don Mills, Ont.
Bray, I. (2002). Application of Markov chain Monte Carlo methods
to projecting cancer incidence and mortality. Journal of the Royal
Statistical Society, Series C, 51(2), 151–164.
Brezger, A., Kneib, T., and Lang, S. (2003). BayesX: Software for
Bayesian inference. Department of statistics, University of Munich,
version 1.1 edition. http://www.stat.uni-muenchen.de/∼lang/bayesx.
Brockwell, P. J. and Davis, R. A. (1987). Time Series: Theory and
Methods. Berlin: Springer-Verlag.
Brook, D. (1964). On the distinction between the conditional probability
and the joint probability approaches in the specification of nearest-
neighbour systems. Biometrika, 51(3 and 4), 481–483.
Carlin, B. P. and Banerjee, S. (2003). Hierarchical multivariate CAR
models for spatio-temporally correlated survival data (with discus-
sion). In Bayesian Statistics, 7 (pp. 45–63). New York: Oxford Univ.
Press.
Carlin, B. P. and Louis, T. A. (1996). Bayes and Empirical Bayes
Methods for Data Analysis, volume 69 of Monographs on Statistics
and Applied Probability. London: Chapman & Hall.
Carlin, B. P., Polson, N. G., and Stoffer, D. S. (1992). A Monte Carlo
approach to non-normal and nonlinear state-space modeling. Journal
of the American Statistical Association, 87(418), 493–500.
Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space
models. Biometrika, 81(3), 541–543.
Carter, C. K. and Kohn, R. (1996). Markov chain Monte Carlo in
conditionally Gaussian state space models. Biometrika, 83(3), 589–
601.
Chellappa, R. and Chatterjee, S. (1985). Classification of textures using
Gaussian Markov random fields. IEEE Transactions on Acoustics
Speech and Signal Processing, 33, 959–963.

©฀2005฀by฀Taylor & Francis Group, LLC


240 REFERENCES
Chellappa, R., Chatterjee, S., and Bagdazian, R. (1985). Texture syn-
thesis and compression using Gaussian-Markov random field models.
IEEE Transaction On Systems, Man and Cybernetics, 15(2), 298–303.
Chellappa, R. and Jain, A. K., Eds. (1993). Markov Random Fields.
Boston, MA: Academic Press Inc.
Chellappa, R. and Kashyap, R. L. (1982). Digital image restoration using
spatial interaction models. IEEE Transaction on Acoustics Speech and
Signal Processing, ASSP-30(3), 614–625.
Chen, M. H. and Dey, D. K. (1998). Bayesian modeling of correlated bi-
nary responses via scale mixture of multivariate normal link functions.
Sankhyā. The Indian Journal of Statistics. Series A, 60(3), 322–343.
Chilés, J. P. and Delfiner, P. (1999). Geostatistics: Modeling Spatial
Uncertainty. Wiley Series in Probability and Statistics. Chichester:
John Wiley & Sons, Ltd.
Clayton, D. G. (1996). Generalized linear mixed models. In W. R. Gilks,
S. Richardson, and D. J. Spiegelhalter (Eds.), Markov Chain Monte
Carlo in Practice (pp. 275–301). London: Chapman & Hall.
Cressie, N. A. C. (1993). Statistics for spatial data. Wiley Series
in Probability and Mathematical Statistics: Applied Probability and
Statistics. New York: John Wiley & Sons Inc. Revised reprint of the
1991 edition, A Wiley-Interscience Publication.
Cressie, N. A. C. and Chan, N. H. (1989). Spatial modeling of regional
variables. Journal of the American Statistical Association, 84(406),
393–401.
Crook, A. M., Knorr-Held, L., and Hemingway, H. (2003). Measuring
spatial effects in time to event data: A case study using months from
angiography to coronary artery bypass graft CABG. Statistics in
Medicine, (22), 2943–2961.
Cross, G. R. and Jain, A. K. (1983). Markov random field texture
models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 5(1), 25–39.
Dahlhaus, R. and Künsch, H. R. (1987). Edge effects and efficient
parameter estimation for stationary random fields. Biometrika, 74(4),
877–882.
Davis, P. J. (1979). Circulant Matrices. New York: John Wiley & Sons,
Ltd.
de Jong, P. and Shephard, N. (1995). The simulation smoother for time
series models. Biometrika, 82(2), 339–350.
Dempster, A. P. (1972). Covariance selection. Biometrics, 28(1), 157–
175.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 241
Descombes, X., Sigelle, M., and Préteux, F. (1999). Estimating Gaussian
Markov random field parameters in a nonstationary framework:
Application to remote sensing imaging. IEEE Transactions on Image
Processing, 8(4), 490–503.
Devroye, L. (1986). Non-uniform Random Variate Generation. Berlin:
Springer-Verlag. A copy of the book is freely available from L.
Devroye’s homepage http://jeff.cs.mcgill.ca/~luc.
Dietrich, C. R. and Newsam, G. N. (1996). A fast and exact method
for multidimensional Gaussian stochastic simulations: Extension to
realizations conditioned on direct and indirect measurements. Water
Resources Research, 32(6), 1643–1652.
Dietrich, C. R. and Newsam, G. N. (1997). Fast and exact simulation
of stationary Gaussian processes through circulant embedding of the
covariance matrix. SIAM Journal of Scientific Computing, 18(4),
1088–1107.
Diggle, P. J., Ribeiro Jr., P. J., and Christensen, O. F. (2003). An
introduction to model-based Geostatistics. In J. Møller (Ed.), Spatial
Statistics and Computational Methods, Lecture Notes in Statistics; 173
(pp. 43–86). Berlin: Springer-Verlag.
Dobra, A., Hans, C., Jones, B., Nevins, J. R., and West, M. (2003).
Sparse graphical models for exploring gene expression data. Technical
Report 7, Statistical and Applied Mathemathical Sciences Institute,
www.samsi.info.
Dongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A.
(1998). Numerical Linear Algebra for High-performance Computers.
Software, Environments, and Tools. Philadelphia, PA: Society for
Industrial and Applied Mathematics (SIAM).
Draper, D. (1995). Assessment and propagation of model uncertainty
(with discussion). Journal of the Royal Statistical Society, Series B,
57(1), 45–97.
Dreesman, J. M. and Tutz, G. (2001). Non-stationary conditional models
for spatial data based on varying coefficients. The Statistician, 50(1),
1–15.
Dryden, I. L., Ippoliti, L., and Romagnoli, L. (2002). Adjusted maximum
likelihood and pseudo-likelihood estimation for noisy Gaussian Markov
random fields. Journal of Computational and Graphical Statistics, 11,
370–388.
Dryden, I. L., Scarr, M. R., and Taylor, C. C. (2003). Bayesian texture
segmentation of weed and crop images using reversible jump Markov
chain Monte Carlo methods. Journal of the Royal Statistical Society.
Series C. Applied Statistics, 52(1), 31–50.

©฀2005฀by฀Taylor & Francis Group, LLC


242 REFERENCES
Dubes, R. and Jain, A. K. (1989). Random field models in image
analysis. Journal of Applied Statistics, 16(2), 131–164.
Duff, I. S., Erisman, A. M., and Reid, J. K. (1989). Direct Methods
for Sparse Matrices, 2nd edition. Monographs on Numerical Analysis.
New York: The Clarendon Press Oxford University Press. Oxford
Science Publications.
Durbin, J. and Koopman, S. J. (1997). Monte Carlo maximum likelihood
estimation for non-Gaussian state space models. Biometrika, 84(3),
669–684.
Durbin, J. and Koopman, S. J. (2000). Time series analysis of non-
Gaussian observations based on state space models from both classical
and Bayesian perspectives (with discussion). Journal of the Royal
Statistical Society, Series B, 62(1), 3–56.
Fahrmeir, L. (1992). Posterior mode estimation by extended Kalman
filtering for multivariate dynamic generalised linear models. Journal
of the American Statistical Association, 87, 501–509.
Fahrmeir, L. (1994). Dynamic modelling and penalized likelihood
estimation for discrete time survival data. Biometrika, 81(2), 317–
330.
Fahrmeir, L. and Knorr-Held, L. (2000). Dynamic and semiparametric
models. In M. G. Schimek (Ed.), Smoothing and Regression: Ap-
proaches, Computation, and Application (pp. 513–544). New-York:
John Wiley & Sons, Ltd.
Fahrmeir, L. and Lang, S. (2001a). Bayesian inference for generalized
additive mixed models based on Markov random field priors. Journal
of the Royal Statistical Society, Series C, 50(2), 201–220.
Fahrmeir, L. and Lang, S. (2001b). Bayesian inference for generalized
additive mixed models based on Markov random field priors. Journal
of the Royal Statistical Society, Series C, 50(2), 201–220.
Fahrmeir, L. and Lang, S. (2001c). Bayesian semiparametric regression
analysis of multicategorical time-space data. Annals of the Institute
of Statistical Mathematics, 53(1), 11–30.
Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling
Based on Generalized Linear Models, 2nd edition. Berlin: Springer-
Verlag.
Fernández, C. and Green, P. J. (2002). Modelling spatially correlated
data via mixtures: A Bayesian approach. Journal of the Royal
Statistical Society, Series B, 64(4), 805–826.
Ferreira, M. A. R. and De Oliveira, V. (2004). Bayesian analysis for a
class of Gaussian Markov random fields. Technical Report, Statistical
Laboratory, Universidade Federal do Rio de Janeiro, Brazil.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 243
Follestad, T. and Rue, H. (2003). Modelling spatial variation in disease
risk using Gaussian Markov random field proxies for Gaussian random
fields. Statistics Report No. 3, Department of Mathematical Sciences,
Norwegian University of Science and Technology, Trondheim, Norway.
Fotheringham, A. S., Brunsdon, C., and Charlton, M. (2002). Geo-
graphically Weighted Regression: The Analysis of Spatially Varying
Relationships. New York: John Wiley & Sons, Ltd.
Frühwirth-Schnatter, S. (1994). Data augmentation and dynamic linear
models. Journal of Time Series Analysis, 15(2), 183–202.
Gamerman, D. (1997). Sampling from the posterior distribution in
generalized linear mixed models. Statistics and Computing, 7(1), 57–
68.
Gamerman, D., Moreira, A. R. B., and Rue, H. (2003). Space-varying
regression models: Specifications and simulations. Computational
Statistics and Data Analysis, 42(3), 513–533.
Gamerman, D. and West, M. (1987). An application of dynamic survival
models in unemplyment studies. The Statistician, 36, 269–274.
Gelfand, A. E., Sahu, S. K., and Carlin, B. P. (1995). Efficient
parameterisations for normal linear mixed models. Biometrika, 82(3),
479–488.
Gelfand, A. E. and Vounatsou, P. (2003). Proper multivariate condi-
tional autoregressive models for spatial data analysis. Biostatistics,
4(1), 11–25.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004).
Bayesian Data Analysis, 2nd edition. Texts in Statistical Science
Series. Chapman & Hall/CRC, Boca Raton, FL.
Geman, D. and Yang, C. (1995). Nonlinear image recovery with half-
quadratic regularization. IEEE Transactions on Image Processing,
4(7), 923–945.
George, A. and Liu, J. W. H. (1981). Computer solution of large sparse
positive definite systems. Englewood Cliffs, N.J.: Prentice-Hall Inc.
Prentice-Hall Series in Computational Mathematics.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov
Chain Monte Carlo in Practice. London: Chapman & Hall.
Giudici, P. and Green, P. J. (1999). Decomposable graphical Gaussian
model determination. Biometrika, 86(4), 785–801.
Glickman, M. E. and Stern, H. S. (1998). A state-space model for
national football league scores. Journal of the American Statistical
Association, 93(1), 25–35.

©฀2005฀by฀Taylor & Francis Group, LLC


244 REFERENCES
Golub, G. H. and van Loan, C. F. (1996). Matrix Computations, 3rd
edition. Johns Hopkins University Press, Baltimore.
Gorsich, D. J., Genton, M. G., and Strang, G. (2002). Eigenstructures of
spatial design matrices. Journal of Multivariate Analysis, 80, 138–165.
Gray, R. M. (2002). Toeplitz and circulant matrices: A review. Free book
available from http://ee.stanford.edu/∼gray, Department of Electrical
Engineering, Stanford University.
Green, P. J. and Silverman, B. (1994). Nonparametric Regression
and Generalized Linear Models: A Roughness Penalty Approach.
Monographs on Statistics and Applied Probability. London: Chapman
& Hall.
Grenander, U. (1993). General Pattern Theory. Oxford: Oxford
University Press.
Grenander, U. and Miller, M. I. (1994). Representations of knowledge
in complex systems (with discussion). Journal of the Royal Statistical
Society, Series B, 56(4), 549–603.
Grenander, U. and Szegö, G. (1984). Toeplitz Forms and Their
Applications, 2nd edition. Chelsea Publ. Co: New York.
Gu, C. (2002). Smoothing Spline ANOVA Models. Springer Series in
Statistics. New York: Springer-Verlag.
Gupta, A. (2002). Recent advances in direct methods for solving
unsymmetric sparse systems of linear equations. ACM Transactions
on Mathematical Software (TOMS), 28(3), 301–324.
Guyon, X. (1982). Parameter estimation for a stationary process on a
d-dimentional lattice. Biometrika, 69(1), 95–105.
Guyon, X. (1995). Random Fields on a Network. Series in Probability
and Its Applications. New York: Springer-Verlag.
Haining, R. (1990). Spatial Data Analysis in the Social and Environ-
mental Sciences. Cambridge: Cambridge University Press.
Harvey, A. C. (1989). Forecasting, Structural Time Series Models and
the Kalman Filter. Cambridge: Cambridge University Press.
Harville, D. A. (1997). Matrix Algebra From a Statistician’s Perspective.
New York: Springer-Verlag.
Hastie, T. and Tibshirani, R. J. (2000). Bayesian backfitting (with
discussion). Statistical Science, 15(3), 196–223.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models,
volume 43 of Monographs on Statistics and Applied Probability.
London: Chapman & Hall.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 245
Heikkinen, J. and Arjas, E. (1998). Non-parametric Bayesian estimation
of a spatial Poisson intensity. Scandinavian Journal of Statistics,
25(3), 435–450.
Held, L., Natario, I., Fenton, S., Rue, H., and Becker, N. (2004). Towards
joint disease mapping. (To appear), Statistical Methods in Medical
Research, xx(xx), xx–xx.
Held, L. and Vollnhals, R. (2005). Dynamic rating of European football
teams. (To appear), IMA Journal of Management Mathematics,
xx(xx), xx–xx.
Higdon, D., Lee, H., and Holloman, C. (2003). Markov chain Monte
Carlo-based approaches for inference in computationally intensive
inverse problems (with discussion). In Bayesian Statistics, 7 (Tenerife,
2002) (pp. 181–197). New York: Oxford Univ. Press.
Hobolth, A., Kent, J. T., and Dryden, I. L. (2002). On the relation
between edge and vertex modelling in shape analysis. Scandinavian
Journal of Statistics, 29(3), 355–374.
Holmes, C. C. and Held, L. (2003). On the simulation of Bayesian binary
and polyhotomous regression models using auxiliary variables. Tech-
nical Report Discussion paper 306, Ludwig-Maximilians-Universität
München, Institut für Statistik.
Hrafnkelsson, B. and Cressie, N. A. C. (2003). Hierarchical modeling of
count data with application to nuclear fall-out. Environmental and
Ecological Statistics, 10, 179–200.
Huerta, G., Sansó, G., and Stroud, J. R. (2004). A spatiotemporal model
for Mexico City ozone levels. Journal of the Royal Statistical Society,
Series C, 53(2), 231–248.
Hunt, B. R. (1973). The application of constrained least squares esti-
mation to image restoration by digital computer. IEEE Transaction
on Computers, C-22(9).
Hurn, M. A., Husby, O. K., and Rue, H. (2003). Advances in Bayesian
image analysis. In P. J. Green, N. L. Hjort, and S. Richardson (Eds.),
Highly Structured Stochastic Systems, Oxford Statistical Science Se-
ries, no 27 (pp. 301–322). Oxford University Press.
Hurn, M. A., Steinsland, I., and Rue, H. (2001). Parameter estimation
for a deformable template model. Statistics and Computing, 11(4),
337–346.
Husby, O. K., Lie, T., Langø, T., Hokland, J., and Rue, H. (2001).
Bayesian 2D deconvolution: A model for diffuse ultrasound scattering.
IEEE Transaction of Ultrasonic Ferroelectric Frequency and Control,
48(1), 121–130.

©฀2005฀by฀Taylor & Francis Group, LLC


246 REFERENCES
Husby, O. K. and Rue, H. (2004). Estimating blood vessel areas in
ultrasound images using a deformable template model. Statistical
modelling, 4(3), 211–226.
Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis
and graphics. Journal of Computational and Graphical Statistics, 5(3),
299–314.
Jeffs, B. D., Hong, S., and Christou, J. (1998). A generalized Gauss
Markov model for space objects in blind restoration of adaptive optics
telescope images. In Proceedings of the 1998 International Conference
on Image Processing (ICIP ’98), number 3 (pp. 737–741).: Institute
of Electrical and Electronics Engineers.
Jones, R. H. (1981). Fitting a continous time autoregression to discrete
data. In Applied Time Series Analysis, II (Tulsa, Okla., 1980) (pp.
651–680). New York: Academic Press.
Jones, R. H. (1993). Longitudinal Data with Serial Correlation: A State-
space Approach, volume 47 of Monographs on Statistics and Applied
Probability. London: Chapman & Hall.
Kammann, E. E. and Wand, M. P. (2003). Geoadditive models. Journal
of the Royal Statistical Society, Series C, 52(1), 1–18.
Karypis, G. and Kumar, V. (1998). METIS. A software
backage for partitioning unstructured graphs, partitioning
meshes, and computing fill-reducing orderings of sparse
matrices. Version 4.0. Manual, University of Minnesota,
Department of Computer Science/ Army HPC Research Center.
http://www-users.cs.umn.edu/∼karypis/metis/index.html.
Kashyap, R. L. and Chellappa, R. (1983). Estimation and choice of
neighbors in spatial-interaction models of images. IEEE Transaction
on Information Theory, IT-29(1), 60–72.
Kelker, D. (1971). Infinite divisibility and variance mixture of the normal
distribution. Annals of Mathematical Statistics, 42(2), 802–808.
Kent, J. T., Dryden, I. L., and Anderson, C. R. (2000). Using circulant
symmetry to model featureless objects. Biometrika, 29, 527–544.
Kent, J. T. and Mardia, K. V. (1996). Spectral and circulant approxima-
tions to the likelihood for stationary Gaussian random fields. Journal
of Statistical Planning and Inference, 50(3), 397–394.
Kent, J. T., Mardia, K. V., and Walder, A. N. (1996). Conditional cyclic
Markov random fields. Advances in Applied Probability (SGSA), 28,
1–12.
Kent, J. T. and Mohammadzadeh, M. (1999). Spectral approximation
to the likelihood for an intrinsic Gaussian random field. Journal of
Multivariate Analysis, 70, 136–155.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 247
Kitagawa, G. (1987). Non-Gaussian state-space modeling of nonstation-
ary time series (with discussion). Journal of the American Statistical
Association, 82(400), 1032–1063.
Kitagawa, G. and Gersch, W. (1996). Smoothness Priors Analysis of
Time Series. Lecture Notes in Statistics no. 116. New York: Springer-
Verlag.
Knorr-Held, L. (1999). Conditional prior proposals in dynamic models.
Scandinavian Journal of Statistics, 26(1), 129–144.
Knorr-Held, L. (2000a). Bayesian modelling of inseparable space-time
variation in disease risk. Statistics in Medicine, 19(17-18), 2555–2567.
Knorr-Held, L. (2000b). Dynamic rating of sports teams. The
Statistician, 49(2), 261–276.
Knorr-Held, L. and Besag, J. (1998). Modelling risk from a disease in
time and space. Statistics in Medicine, 17(18), 2045–2060.
Knorr-Held, L. and Best, N. G. (2001). A shared component model for
detecting joint and selective clustering of two diseases. Journal of the
Royal Statistical Society, Series A, 164, 73–85.
Knorr-Held, L. and Rainer, E. (2001). Projections of lung cancer
mortality in West Germany: A case study in Bayesian prediction.
Biostatistics, 2, 109–129.
Knorr-Held, L., Raßer, G., and Becker, N. (2002). Disease mapping of
stage-specific cancer incidence data. Biometrics, 58, 492–501.
Knorr-Held, L. and Richardson, S. (2003). A hierarchical model
for space-time surveillance data on meningococcal disease incidence.
Journal of the Royal Statistical Society. Series C. Applied Statistics,
52(2), 169–183.
Knorr-Held, L. and Rue, H. (2002). On block updating in Markov
random field models for disease mapping. Scandinavian Journal of
Statistics, 29(4), 597–614.
Kohn, R. and Ansley, C. F. (1987). A new algorithm for spline smoothing
based on smoothing a stochastic process. SIAM Journal of Scientific
and Statistical Computing, 8(1), 33–48.
Krogstad, H. E. (1989). Simulation of multivariate Gaussian time series.
Communications in Statistics: Simulation and Computation, 18(3),
929–941.
Künsch, H. R. (1979). Gaussian Markov random fields. Journal of
the Faculty of Science. University of Tokyo. Section IA. Mathematics,
26(1), 53–73.
Künsch, H. R. (1987). Intrinsic autoregressions and related models on
the two-dimentional lattice. Biometrika, 74(3), 517–524.

©฀2005฀by฀Taylor & Francis Group, LLC


248 REFERENCES
Künsch, H. R. (1999). Contribution to the discussion of the paper by
Besag and Higdon: Bayesian analysis of agricultural field experiments.
Journal of the Royal Statistical Society, Series B, 61(4), 721–722.
Lakshmanan, S. and Derin, H. (1993). Valid parameter space for 2-D
Gaussian Markov random fields. IEEE Transactions on Information
Theory, 39(2), 703–709.
Lang, S. and Brezger, A. (2004). Bayesian P-splines. Journal of
Computational and Graphical Statistics, 13(1).
Lantuéjoul, C. (2002). Geostatistical Simulation. Models and Algorithms.
Berlin: Springer-Verlag.
Lauritzen, S. L. (1981). Time series analysis in 1880: A discussion of
contributions made by T. N. Thiele. International Statistical Review,
49(3), 319–331.
Lauritzen, S. L. (1996). Graphical Models, volume 17 of Oxford Statistical
Science Series. New York: The Clarendon Press Oxford University
Press. Oxford Science Publications.
Lauritzen, S. L. and Jensen, F. (2001). Stable local computation with
conditional Gaussian distributions. Statistics and Computing, 11(2),
191–203.
Lavine, M. (1999). Another look at conditionally Gaussian Markov
random fields. In Bayesian Statistics, 6 (Alcoceber, 1998) (pp. 371–
387). New York: Oxford University Press.
Lewis, J. G. (1982). Algorithm 582: The Gibbs-Poole-Stockmeyer
and Gibbs-King algorithms for reordering sparse matrices. ACM
Transactions on Mathematical Software, 8(2), 190–194.
Lindgren, F. (1997). Flame reconstruction. In K. Mardia and C. A. Gill
(Eds.), LASR: The Art and Science of Bayesian Image Analysis (pp.
52–59).: Dept. of Mathematical Statistics, University of Leeds.
Lindgren, F., Johansson, B., and Holst, J. (1997). Flame reconstruction
in spark ignition engines. In SAE Fall Fuels and Lubricants Meeting
and Exposition. SAE paper 972825.
Lindgren, F. and Rue, H. (2004). Intrinsic Gaussian Markov random
fields on triangulated spheres. Technical Report 2004:25, Centre for
Mathematical Sciences, Lund University.
Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of
the Gibbs sampler with applications to the comparisons of estimators
and augmentation schemes. Biometrika, 81(1), 27–40.
Manjunath, B. S. and Chellappa, R. (1991). Unsupervised texture
segmentation using Markov random field models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 13(5), 478–482.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 249
Mardia, K. V. (1988). Multidimensional multivariate Gaussian Markov
random fields with application to image processing. Journal of
Multivariate Analysis, 24(2), 265–284.
Mardia, K. V. (1990). Maximum likelihood estimation for spatial models.
In D. A. Griffith (Ed.), Spatial Statistics: Past, Present and Future
(pp. 203–253). Institute of Mathematical Geography, Ann Arbor,
Michigan.
Marroquin, J. L., Velasco, F. A., Rivera, M., and Nakamura, M.
(2001). Gauss-Markov measure field models for low-level vision. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23(4),
337–348.
Matheron, G. (1971). The theory of regionalized variables and its
applications. Les Cahiers du Centre de Morphologie Mathematique,
Centre de Geostatistique, Fontainebleau.
Matheron, G. (1973). The intrinsic random functions and their applica-
tions. Advances in Applied Probability, 5, 437–468.
Mollié, A. (1996). Bayesian mapping of disease. In W. R. Gilks, S.
Richardson, and D. J. Spiegelhalter (Eds.), Markov Chain Monte
Carlo in Practice (pp. 359–379). London: Chapman & Hall.
Mondal, D. and Besag, J. (2004). Variogram calculations for first-order
intrinsic autoregressions. Technical Report, Department of Statistics,
University of Washington, Seattle.
Moura, J. M. F. and Balram, N. (1992). Recursive structure of
non-causal Gauss-Markov random fields. IEEE Transactions on
Information Theory, 38, 334–354.
Natario, I. and Knorr-Held, L. (2003). Non-parametric ecological
regression and spatial variation. Biometrical Journal, 45, 670–688.
Pace, R. K. and Barry, R. P. (1997). Fast CARs. Journal of Statistical
Computation and Simulation, 59(2), 123–147.
Papaspiliopoulos, O., Roberts, G. O., and Sköld, M. (2003). Non-
centered parameterizations for hierarchical models and data augmen-
tation (with discussion). In Bayesian Statistics, 7 (pp. 307–326). New
York: Oxford Univ. Press.
Patra, M. and Karttunen, M. (2004). Stencils with isotropic
discretisation error for differential operators. Preprint available as
http://www.lce.hut.fi/research/polymer/downloads/stencil paper.pdf,
Sumitted to Numerical Methods for Partial Differential Equations,
xx(xx), xx–xx.
Pettitt, A. N., Weir, I. S., and Hart, A. G. (2002). A conditional
autoregressive Gaussian process for irregularly spaced multivariate

©฀2005฀by฀Taylor & Francis Group, LLC


250 REFERENCES
data with application to modelling large sets of binary data. Statistics
and Computing, 12(4), 353–367.
Pitt, L. D. (1971). A Markov property for Gaussian processes with a
multidimensional parameter. Arch. Rational Mech. Anal., 43, 367–391.
Pitt, L. D. and Robeva, R. S. (2003). On the sharp Markov property
for Gaussian random fields and spectral synthesis in spaces of Bessel
potentials. The Annals of Probability, 31(3), 1338–1376.
Pitt, M. K. and Shephard, N. (1999). Analytic convergence rates and
parameterization issues for the Gibbs sampler applied to state space
models. Journal of Time Series Analysis, 20(1), 63–85.
Rellier, G., Descombes, X., Zerubia, J., and Falzon, F. (2002). A
Gauss-Markov model for hyperspectral texture analysis of urban areas.
In Proceedings from the 16th International Conference on Pattern
Recognition (pp. I: 692–695).
Ribeiro Jr., P. J. and Diggle, P. J. (2001). geoR: A package for
geostatistical analysis. R-NEWS, 1(2), 15–18.
Ripley, B. D. and Sutherland, A. I. (1990). Finding spiral structures in
images of galaxies. Philosophical Transactions of the Royal Society of
London A, 332, 477–485.
Robert, C. P. (1995). Simulation of truncated normal variables. Statistics
and Computing, 5, 121–125.
Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods.
New York: Springer-Verlag.
Roberts, G. O., Gelman, A., and Gilks, W. R. (1997). Weak convergence
and optimal scaling of random walk Metropolis algorithms. The
Annals of Applied Probability, 7(1).
Roberts, G. O. and Sahu, S. K. (1997). Updating schemes, correlation
structure, blocking and parameterization for the Gibbs sampler.
Journal of the Royal Statistical Society, Series B, 59(2), 291–317.
Rosanov, Y. A. (1967). On Gaussian fields with give conditional
distributions. Theory of Probability and its Applications, XII(3), 381–
391.
Rue, H. (2001). Fast sampling of Gaussian Markov random fields.
Journal of the Royal Statistical Society, Series B, 63(2), 325–338.
Rue, H. and Follestad, T. (2002). GMRFLib: A C-library for fast and
exact simulation of Gaussian Markov random fields. Statistics Report
No. 1, Department of Mathematical Sciences, Norwegian University
of Science and Technology, Trondheim, Norway.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 251
Rue, H. and Follestad, T. (2003). Gaussian Markov random field models
with applications in spatial statistics. Statistics Report No. 6, Depart-
ment of Mathematical Sciences, Norwegian University of Science and
Technology, Trondheim, Norway.
Rue, H. and Hurn, M. A. (1999). Bayesian object identification.
Biometrika, 86(3), 649–660.
Rue, H. and Husby, O. K. (1998). Identification of partly destroyed
objects using deformable templates. Statistics and Computing, 8(3),
221–228.
Rue, H. and Salvesen, Ø. (2000). Prediction and retrospective analysis
of soccer matches in a league. The Statistician, 49(3), 399–418.
Rue, H., Steinsland, I., and Erland, S. (2004). Approximating hidden
Gaussian Markov random fields. Journal of the Royal Statistical
Society, Series B, 66(4), 877–892.
Rue, H. and Tjelmeland, H. (2002). Fitting Gaussian Markov random
fields to Gaussian fields. Scandinavian Journal of Statistics, 29(1),
31–50.
Schmid, V. and Held, L. (2004). Bayesian extrapolation of space-time
trends in cancer registry data. Biometrics, 60(4), 1034–1042.
Searle, S. R. (1982). Matrix Algebra Useful for Statistics. Wiley Series
in Probability and Mathematical Statistics: Applied Probability and
Statistics. Chichester: John Wiley & Sons, Ltd.
Shephard, N. (1994). Partial non-Gaussian state space. Biometrika,
81(1), 115–131.
Shephard, N. and Pitt, M. K. (1997). Likelihood analysis of non-
Gaussian measurement time series. Biometrika, 84(3), 653–667.
Shepp, L. A. (1966). Radon-Nikodym derivatives of Gaussian measures.
The Annals of Mathematical Statistics, 37(2), 321–354.
Speed, T. P. and Kiiveri, H. T. (1986). Gaussian Markov distributions
over finite graphs. The Annals of Statistics, 14(1), 138–150.
Steinsland, I. (2003). Parallel sampling of GMRFs and geostatistical
GMRF models. Technical Report 7, Department of Mathematical
Sciences, Norwegian University of Science and Technology, Trondheim,
Norway.
Steinsland, I. and Rue, H. (2003). Overlapping block proposals for latent
Gaussian Markov random fields. Statistics Report No. 8, Department
of Mathematical Sciences, Norwegian University of Science and Tech-
nology, Trondheim, Norway.

©฀2005฀by฀Taylor & Francis Group, LLC


252 REFERENCES
Sun, D., Tsutakawa, R. K., and Speckman, P. L. (1999). Posterior distri-
bution of hierarchical models using CAR(1) distributions. Biometrika,
86(2), 341–350.
Tierney, L. (1994). Markov chains for exploring posterior distributions
(with discussion). The Annals of Statistics, 22(4), 1701–1762.
Tierney, L., Kass, R. E., and Kadane, J. B. (1989). Fully exponential
Laplace approximations to expectations and variances of nonpositive
functions. Journal of the American Statistical Association, 84(407),
710–716.
Toledo, S., Chen, D., and Rotkin, V. (2002). TAUCS. A library of sparse
linear solvers. Version 2.0. Manual, School of Computer Science, Tel-
Aviv University. http://www.tau.ac.il/∼stoledo/taucs/.
Wahba, G. (1978). Improper priors, spline smoothing and the problem
of guarding against model errors in regression. Journal of the Royal
Statistical Society, Series B, 40(3), 364–372.
Wahba, G. (1990). Spline Models for Observational Data, volume 59
of CBMS-NSF Regional Conference Series in Applied Mathematics.
Philadelphia, PA: Society for Industrial and Applied Mathematics
(SIAM).
Wecker, W. E. and Ansley, C. F. (1983). The signal extraction approach
to nonlinear regression and spline smoothing. Journal of the American
Statistical Association, 78(381), 81–89.
Weir, I. S. and Pettitt, A. N. (1999). Spatial modelling for binary
data using a hidden conditional autoregressive Gaussian process: A
multivariate extension of the probit model. Statistics and Computing,
9(4), 77–86.
Weir, I. S. and Pettitt, A. N. (2000). Binary probability maps
using a hidden conditional autoregressive Gaussian process with an
application to Finnish common toad data. Journal of the Royal
Statistical Society, Series C, 49(4), 473–484.
Werner, L. (2004). Spatial inference for non-lattice data using Markov
random fields. Licentiate thesis, Centre for Mathematical Sciences,
Mathematical Statistics, Lund Uuniversity.
West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic
Models, 2nd edition. Springer Series in Statistics. New York: Springer-
Verlag.
Whittaker, J. (1990). Graphical Models in Applied Multivariate Statis-
tics. Wiley Series in Probability and Mathematical Statistics. Chich-
ester: John Wiley & Sons, Ltd.
Whittle, P. (1954). On stationary processes in the plane. Biometrika,
41(3/4), 434–449.

©฀2005฀by฀Taylor & Francis Group, LLC


REFERENCES 253
Wikle, C. K., Berliner, L. M., and Cressie, N. A. C. (1998). Hierarchical
Bayesian space-time models. Environmental and Ecological Statistics,
5(2), 117–154.
Wilkinson, D. J. (2003). Discussion to ”Non-centered parameterizations
for hierarchical models and data augmentation” by O. Papaspiliopou-
los, G. O. Roberts and M. Sköld. In Bayesian Statistics, 7 (pp. 323–
324). New York: Oxford Univ. Press.
Wilkinson, D. J. (2004). Parallel Bayesian computation. In E. J.
Kontoghiorghes (Ed.), Handbook of Parallel Computing and Statistics,
Statistics: Textbooks and Monographs (pp. xx–xx). New York: Marcel
Dekker. To appear.
Wilkinson, D. J. and Yeung, S. K. H. (2002). Conditional simulation from
highly structured Gaussian systems, with application to blocking-
MCMC for the Bayesian analysis of very large linear mode. Statistics
and Computing, 12(3), 287–300.
Wilkinson, D. J. and Yeung, S. K. H. (2004). A sparse matrix approach to
Bayesian computation in large linear models. Computational Statistics
and Data Analysis, 44, 493–516.
Wong, E. (1969). Homogeneous Gauss-Markov random fields. The
Annals of Mathematical Statistics, 40, 1625–1634.
Wood, A. T. A. (1995). When is a truncated covariance function on
the line a covariance function on the circle? Statistics and Probability
Letters, 24(2), 157–163.
Wood, A. T. A. and Chan, G. (1994). Simulation of stationary Gaussian
processes in [0, 1]d . Journal of Computational and Graphical Statistics,
3(4), 409–432.
Woods, J. W. (1972). Two-dimentional discrete Markovian fields. IEEE
Transactions of Information Thoery, 18(3), 232–240.

©฀2005฀by฀Taylor & Francis Group, LLC

You might also like