An Intuitive Bayesian Spatial Model For Disease Mapping That Accounts For Scaling
An Intuitive Bayesian Spatial Model For Disease Mapping That Accounts For Scaling
Abstract
In recent years, disease mapping studies have become a routine application within geographical
epidemiology and are typically analysed within a Bayesian hierarchical model formulation. A variety of
model formulations for the latent level have been proposed but all come with inherent issues. In the
classical BYM (Besag, York and Mollié) model, the spatially structured component cannot be seen
independently from the unstructured component. This makes prior definitions for the hyperparameters
of the two random effects challenging. There are alternative model formulations that address this
confounding; however, the issue on how to choose interpretable hyperpriors is still unsolved. Here,
we discuss a recently proposed parameterisation of the BYM model that leads to improved parameter
control as the hyperparameters can be seen independently from each other. Furthermore, the need for a
scaled spatial component is addressed, which facilitates assignment of interpretable hyperpriors and make
these transferable between spatial applications with different graph structures. The hyperparameters
themselves are used to define flexible extensions of simple base models. Consequently, penalised
complexity priors for these parameters can be derived based on the information-theoretic distance
from the flexible model to the base model, giving priors with clear interpretation. We provide
implementation details for the new model formulation which preserve sparsity properties, and we
investigate systematically the model performance and compare it to existing parameterisations.
Through a simulation study, we show that the new model performs well, both showing good learning
abilities and good shrinkage behaviour. In terms of model choice criteria, the proposed model performs at
least equally well as existing parameterisations, but only the new formulation offers parameters that are
interpretable and hyperpriors that have a clear meaning.
Keywords
Disease mapping, Bayesian hierarchical model, integrated nested Laplace approximations, Leroux model,
penalised complexity prior, scaling
1
Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway
2
Department of Mathematics and Statistics, UiT The Arctic University of Norway, Tromsø, Norway
3
Department of Mathematical Sciences, University of Bath, Bath, UK
Corresponding author:
Andrea Riebler, Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway.
Email: [email protected]
1146 Statistical Methods in Medical Research 25(4)
1 Introduction
Over the recent years, there has been much interest in spatial modelling and mapping of disease or
mortality rates. Due to inherent sampling variability it is not recommended to inspect crude rates
directly, but borrow strength over neighbouring regions to get more reliable region-specific
estimates. The state of the art is to use Bayesian hierarchical models, where the risk surface is
modelled using a set of spatial random effects, in addition to potentially available covariate
information.1 The random effects shall capture unobserved heterogeneity or spatial correlation
that cannot be explained by the available covariates.2 However, there has been much confusion
on the design of the spatial random effects. Besag et al.3 proposed an intrinsic autoregressive model,
often referred to as the CAR prior or Besag model, where the spatial effect of a particular region
depends on the effects of all neighbouring regions. They also proposed the commonly known BYM
model, where an additional unstructured spatial random effect is included to account for
independent region-specific noise. Consequently, various model modifications and alternative
approaches have been proposed, see for example Leroux et al.,4 Stern and Cressie5 and Dean
et al.6 The appropriateness and behaviour of different latent spatial models have been compared
in a full Bayesian context,2,7–9 an empirical Bayes setting10 or using penalised quasi likelihood.4 It is
accepted that the Besag model may lead to misleading results in the case where there is no spatial
correlation in the data.4,8 For this purpose, most alternative models propose to account for
unstructured variability in space.
By including both structured and unstructured components in the model, potential confounding
must be addressed,11 as otherwise it is not clear how to split the variability over the effects. This
problem has motivated development of reparameterised models, in which the precision parameters
of the two components are replaced by a common precision parameter and a mixing parameter,
which distributes the variability between the structured and unstructured components.4,6
However, existing approaches do have two issues. First, the spatially structured component is not
scaled. This implies that the precision parameter does not represent the marginal precision but is
confounded with the mixing parameter. Thus, the effect of any prior assigned to the precision
parameter depends on the graph structure of the application. This has the additional effect that a
given prior is not transferable between different applications if the underlying graph changes.12
Second, the choice of hyperpriors for the random effects is not straightforward. Approaches to
design epidemiologically sensible hyperpriors include, among others, Wakefield8 and Bernardinelli
et al.11
Recently, Simpson et al.13 proposed a new BYM parameterisation that accounts for scaling and
provides an intuitive way to define priors by taking the model structure into account. This new
model provides a new way to look at the BYM model. The primary goal is not to optimise model
choice criteria, such as deviance information criterion (DIC) values, but to provide a sensible model
formulation where all parameters have a clear meaning. The model structure is similar to the Dean
model,6 with the crucial modification that the precision parameter is mapped to the marginal
standard deviation. This makes the parameters of the model interpretable and facilitates
assignment of meaningful hyperpriors. The framework of penalised complexity (PC) priors is
applied to formulate prior distributions for the hyperparameters. The spatial model is thereby
seen as a flexible extension of two simpler base models that it will shrink towards, if not
indicated otherwise by the data.13 The upper level base model assumes a constant risk surface,
while the lower level model assumes a varying risk surface over space without spatial
autocorrelation. In this paper, we investigate systematically the model behaviour under different
simulation scenarios. Furthermore, we compare this new model formulation to commonly used
model formulations in disease mapping. We point out differences, focusing on parameter
Riebler et al. 1147
interpretability and prior distribution assignment. For completeness we also compare commonly
used model choice criteria.
We have chosen to implement all models using integrated nested Laplace approximations
(INLA),14 available to the applied user via the R-package INLA (see www.r-inla.org). INLA is
straightforward to use for full Bayesian inference in disease mapping and avoids any Markov chain
Monte Carlos techniques.15,16
The plan for the paper is as follows. Section 2 gives an introduction to disease mapping and
motivates the use of Bayesian hierarchical models. Commonly used spatial models for the latent
level are reviewed in Section 3, before the need of scaling and consequently the new modified spatial
model is presented. Section 4 uses the PC-prior framework to design sensible hyperpriors for the
precision and mixing parameter. In Section 5, we investigate the properties and behaviour of the new
model in a simulation study, including comparisons to commonly used models. This section also
contains an application to insulin-dependent diabetes mellitus (IDDM) data in Sardinia. Discussion
and concluding remarks are given in Section 6.
2 Disease mapping
Disease mapping has a long history in epidemiology and one major goal is to investigate the
geographical distribution of disease burden.8 In the following, we assume that our region of
interest is divided into n non-overlapping areas. Let Ei denote the number of persons at risk in
area i (i ¼ 1, . . . , n). These expected numbers are commonly calculated based on the size and
demographic characteristics of the population living in area i.2 Further, let yi denote the number
of cases or deaths in region i. When the disease is non-contagious and rare, it is usually reasonable to
assume that
yi j i PoissonðEi i Þ
>
Here, denotes the overall risk level, z> i ¼ zi1 , . . . , zip a set of p covariates with corresponding
>
regression parameters ¼ 1 , . . . , p and bi a random effect. For and we assign weekly
informative Gaussian distributions with mean zero and large variance. The random effects
b ¼ ðb1 , . . . , bn Þ> are used to account for extra-Poisson variation or spatial correlation due to latent
or unmeasured risk factors. In the following, we will review four models that are commonly used for b.
1148 Statistical Methods in Medical Research 25(4)
where b is a precision parameter and bi ¼ ðb1 , . . . , bi1 , biþ1 , . . . , bn Þ> . The mean of bi equals the
mean of the effects over all neighbours, and the precision is proportional to the number of
neighbours. The joint distribution for b is
!
b X
2 b
ðb j b Þ / exp ðbi bj Þ / exp b> Qb
2 ij 2
The model is intrinsic in the sense that Q is singular, i.e. it has a non-empty null-space V, see Rue
and Held (Section 3) for details.19 For any map, 1, denoting a vector of length n with only ones, is an
eigenvector of Q with eigenvalue 0. Hence the density is invariant to the addition of a constant. If
the spatial map has islands the definition of the null-space is more complex and depends on how
islands are defined, i.e. whether they form a new unconnected graph or are linked to the main land,
see Section 5.2.1 in Hodges.20 The rank deficiency of the Besag model is equal to the number of
connected subgraphs. The rank deficiency is equal to 1 if all regions are connected. Hence, the
density for the Besag model is
b
ðb j b Þ ¼ KbðnIÞ=2 exp b> Qb
2
where I denotes the number of connected subgraphs and K is a constant.20 To prevent confounding
with the intercept, sum-to-zero constraints are imposed on each connected subgraph.
Riebler et al. 1149
where 2 ½0, 1 denotes a mixing parameter. The model reduces to pure overdispersion if ¼ 0, and
to the Besag model when ¼ 1.9,10 The conditional expectation of bi, given all other random effects,
results as a weighted average of the zero-mean unstructured model and the mean value of the Besag
model (1). The conditional variance is the weighted average of 1=b and 1=ðb ni Þ.
Equation (4) is a reparameterisation of the original BYM model, where u1 ¼ b1 and
v1 ¼ b1 ð1 Þ.9 The additive decomposition of the variance is then on the log relative risk
scale. This is in contrast to the Leroux model (3), where the precision matrix of b resulted as a
weighted average of the precision matrices of the unstructured and structured spatial components.
As a consequence, the additive decomposition of variance in the Leroux model happens on the log
relative risk scale, conditional on bj, j 2 i.4
1150 Statistical Methods in Medical Research 25(4)
The generalised variance for two different graph structures is typically not equal, even when the
graphs have the same number of regions.
As an example, Figure 1 shows two administrative regions of Germany, Mittelfranken and
Arnsberg, that are both fully connected. Assume that we fit models to these two regions
separately, including a spatially structured effect in the models. As both regions have 12 districts,
we may be tempted to use identical priors for the precision b in these two cases. However, the prior
will penalise the local deviation from a constant level, differently. Specifically, if b ¼ 1 in equation
2 2
(6), it follows that GV ðbÞ 0:30 for Mittelfranken and GV ðbÞ 0:40 for Arnsberg. If we fit a
2
similar model to all the 544 districts in Germany, we obtain GV ðbÞ 0:56. These differences,
reflecting a characteristic level for the marginal variances, might be even more striking for other
applications. In order to unify the interpretation of a chosen prior for b and make it transferable
2
between applications, the structured effect needs to be scaled so that GV ðbÞ ¼ 1=b . This implies that
b represents the precision of the (marginal) deviation from a constant level, independently of the
underlying graph.
This issue of scaling applies to all intrinsic GMRFs.12 In practice, the scaling might be costly, as it
involves inversion of a singular matrix. The generalised inverse can be used, but the relative
tolerance threshold to detect zero singular values will be crucial. Knowing the rank deficiency r
of the matrix, the r smallest eigenvalues could be removed. Intrinsic GMRFs have much in common
with GMRFs conditional on linear constraints, see Rue and Held19 (Section 3.2). Hence, if we know
the null space V of Q, we can use sparse matrix algebra to compute ½Q ii . The marginal variances
can be computed basedon b j V> b ¼ 0. First, Q is made regular by adding a small term e to the
diagonal and then Diag Var b j V> b ¼ 0 is computed according to Rue and Held19 (Section 3.2,
equation (3.17)). In this way, we take advantage of the sparse matrix structure of Q. Extracting the
2
variance components, we compute GV and use this as a factor to scale Q. The R-package INLA,
available from www.r-inla.org, offers a function inla.scale.model which takes as argument any
singular precision matrix Q and the linear constraints spanning the null-space of Q. The scaled
Riebler et al. 1151
Arnsberg
Mittelfranken
Figure 1. Map of Germany separated in 544 districts. Marked in grey are the administrative districts Arnsberg in the
federal state Nordrhein-Westfalen and Mittelfranken in Bayern. Both areas consist of 12 districts.
precision matrix, where the geometric mean of the marginal variances is equal to one, is returned.
For the Besag model on a connected graph we have V ¼ 1, so that we can scale the matrix in R using
R > Q ¼ inla.scale.model(Q,
constr ¼ list(A ¼ matrix(1, nrow ¼ 1, ncol ¼ n), e ¼ 0))
We would like to note that scaling is also crucial if the goal is hypothesis testing involving an
intrinsic GMRF, as for example in Dean et al.,6 as otherwise the critical region would be naturally
dependent on the scale of the random effect. This also applies outside a Bayesian setting.
the precision matrix of the Besag model, scaled according to Section 3.2. This gives a modified
version of the random effect
1 pffiffiffiffiffiffiffiffiffiffiffi pffiffiffi
b ¼ pffiffiffiffi 1 vþ u? ð7Þ
b
Breslow et al.21 criticised the BYM model for obscuring the conditional mean and conditional
variance. Using the new parameterisation, Varðb j b , Þ has a clear and intuitive interpretation,
and interpretations in terms of the conditional distribution of bi given bj, j 2 i, are avoided.
Similarly to the Leroux and the Dean model, equation (7) emphasises the compromise between
pure overdispersion and spatially structured correlation, where 0 1 measures the proportion
of the marginal variance explained by the structured effect. The model reduces to pure
overdispersion for ¼ 0 and to the Besag model, i.e. only spatially structured risk, when ¼ 1.
More importantly, using the standardised Q ? the marginal variances will be approximately equal to
ð1 Þ=b þ =b . Approximation holds since Q by construction does not give the same marginal
variances for all regions, so that the standardisation only holds on the overall level of the generalised
2
variance GV . With the scaled structured effect u? , the random effect b is also scaled and the prior
imposed on b has the same interpretation, being transferred between graphs. Furthermore, the
hyperparameters b and are now interpretable and no longer confounded. It should be noted
that in contrast to the Dean model, the Leroux model cannot be scaled since by construction the
scaling would depend on the value of .
As ðwÞ ¼ ðw1 jw2 Þðw2 Þ, it follows that w is normally distributed with mean 0 and precision matrix
0 pffiffiffiffiffiffiffi 1
b b
B 1 I
1
I C
B pffiffiffiffiffiffiffi C
@ b A
I Q? þ I
1 1
The marginal of w1 then has the correct distribution. Working with this parameterisation, sparsity is
warranted and the structured component is given directly by w2 , i.e. the second half of the vector w.
Riebler et al. 1153
KLD measures the information lost using the base model to approximate the more flexible model.
To facilitate interpretation, this divergence is transformed to a unidirectional measure of distance
defined by
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
dðb Þ ¼ 2KLDðb Þ
In our case, flex ðb Þ ¼ Varðb j b , Þ as defined in equation (8), where is arbitrary and treated as
fixed, while the covariance matrix of the base model is base ¼ 0, reflecting infinite precision. Details
regarding the computation of equation (9) are presented in Simpson et al.13 and for completeness
provided in Appendix 2. On the distance scale the meaning of a prior is much more clear. The prior
should have mode at zero, i.e. at the base model, and decreasing density with increasing distance.
There are different priors that fulfil these properties, but differ in the way they decrease with
increasing distance. Since we are not in a variable selection setting, we follow the
recommendation of Simpson et al.13 and use the exponential distribution, which supports
1154 Statistical Methods in Medical Research 25(4)
8
0.020
6
Density
Density
0.015
4
0.010
2
0.005
0.000
0 50 100 150 200 250 300 0.0 0.2 0.4 0.6 0.8 1.0
τ φ
Figure 2. Left panel: PC prior for b (solid) with parameters U ¼ 1 and ¼ 0:01, and two gamma priors where for
both the shape parameter is equal to 1 and the rate parameter is either 0.01 (dashed, light grey) or 0.02 (dashed, dark
grey). Right panel: PC prior for the mixing parameter with U ¼ 0:5 and ¼ 2=3 (solid), and a uniform prior on
(0,1) (dashed).
Riebler et al. 1155
By increasing the value of , spatial dependency is gradually blended into the model and the model
component explaining most of the variance shifts from v to u? .13
To derive the PC prior for , we first compute the KLD of the base model from the flexible model,
see Appendix A2.2. The covariance matrices for the base and flexible models are base ¼ I and
flex ð Þ ¼ ð1 ÞI þ Q ? , respectively. Analogously pasffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
for b, we assign an exponential prior
with parameter to the distance scale dð Þ ¼ 2KLDð Þ. For simplicity, this prior is
transformed to give a prior on logitð Þ, as is defined on a bounded interval. Reasonable values
of can be inferred in a similar way as in Section 4.1, using a probability statement
Probð 5 UÞ ¼ . In this way it is straightforward to specify both strong or weak prior
information. The parameter has a more direct interpretation than the precision parameter b,
representing a fraction of the total variance which can be attributed to spatial dependency structure.
A reasonable formulation might be Probð 5 0:5Þ ¼ 2=3, which gives more density mass to values of
smaller than 0.5. This can be seen as a conservative choice, assuming that the unstructured random
effect accounts for more of the variability than the spatially structured effect.
In contrast to the precision parameter b, the resulting PC prior for is not available in closed
form. Within INLA, the function INLA:::inla.pc.bym.phi can be used to compute the prior for
(on logit-scale) for a specific Besag neighbourhood matrix Q. The scaling of the graph is
implemented by default. Thus, the prior can be obtained in tabulated form and also incorporated
in other software packages than INLA. Figure 2 (right panel) shows the prior for with U ¼ 0.5 and
¼ 2=3, generated in R using
For comparison, the figure also displays the commonly used uniform prior for .
5 Applications
5.1 Does the modified BYM model shrink to the respective base models?
Various types of cancer data, observed in the 544 districts of Germany (Figure 1), have been widely
analysed in the literature, both to study the epidemiology22 and to test methodological
developments.23 Simpson et al.13 re-analyse a larynx cancer dataset, assigning almost the same
PC priors as shown in Figure 2 to the hyperparameters. The modified BYM model (7), here
referred to as the BYM2 model, is clearly seen to learn from the data, leading to a posterior
marginal concentrated around 1 for .
5.1.1 Results
Here, we perform a simulation study to investigate whether the BYM2 model shrinks towards the
two different base models described in Section 4. Furthermore, we assess the behaviour under a
perfectly structured risk surface. The simulations are based on the neighbourhood structure of 366
districts of Sardinia, for which a real case study will be analysed in Section 5.2. Two hundred
datasets were simulated under nine different scenarios, using three different types of risk surfaces
and three different disease prevalences. The model we simulate from assumes that the log relative
1156 Statistical Methods in Medical Research 25(4)
Table 1. Mean values of the intercept and hyperparameters, for the BYM2 model for varying disease
prevalence, using either a constant, spatially varying unstructured risk surfaces or spatially varying
structured risk surfaces. PC priors are used for the weighting parameter and for the precision
parameter b. Standard deviations are provided in parentheses. Results are based on 200 simulations for
each setting, where the true parameter values are provided in the table.
pffiffiffi
1=
pffiffiffiffi
Constant risk 1= b ¼ 0, ¼ 0 :
E ¼ 15 3.78E04 (0.01) 0.02 (0.02) 0.26 (0.05)
E ¼ 60 9.15E04 (0.01) 0.01 (0.01) 0.26 (0.05)
E ¼ 200 4.36E04 (0.004) 0.01 (0.01) 0.26 (0.04)
pffiffiffiffi
Area-specific independent risk 1= b ¼ 0:5, ¼ 0 :
E ¼ 15 1.62E03 (0.03) 0.50 (0.02) 0.05 (0.03)
E ¼ 60 1.94E03 (0.03) 0.51 (0.03) 0.04 (0.02)
E ¼ 200 1.21E03 (0.03) 0.50 (0.02) 0.04 (0.02)
pffiffiffiffi
Area-specific correlated risk 1= b ¼ 0:5, ¼ 1 :
E ¼ 15 2.14E03 (0.01) 0.47 (0.04) 0.90 (0.06)
E ¼ 60 9.11E06 (0.01) 0.49 (0.05) 0.93 (0.04)
E ¼ 200 4.70E05 (0.004) 0.49 (0.07) 0.93 (0.04)
Riebler et al. 1157
risk, the obtained parameter estimates are seen to be robust to the prior choice for the mixing
parameter.
denoting the posterior mean fitted risk, the average DIC24 and the average proper logarithmic
scoring rule.25 For all three criteria, lower values imply better model properties. Note that the
parameter estimates of the BYM model are not included here as it is not parameterised in terms
of one precision and one mixing parameter. However, we will consider it in terms of model choice,
see figures in online supplementary material.
The estimated parameter values in Table 2 have to be interpreted with great care. Even though the
different models look comparable and the parameters have the same name, their interpretations are
not equivalent. A first issue is that we use different priors for the hyperparameters. Specifically, we
assume that Unifð0, 1Þ for the Leroux and Dean model, as this represents a popular non-
informative prior choice in the literature.8,9,26 Further, we assume a ðshape ¼ 1, rate ¼ 0:01Þ
distribution for the precision parameter in the iid model. For the spatially structured term, we
took the graph structure of Sardinia into account and use a ðshape ¼ 1, rate ¼ 0:02Þ11 prior.
However, also the different parameterisation of the Leroux and Dean model implies that the
parameters are not comparable. It is only when ¼ 0 or ¼ 1, these models reduce to the same
model (assuming the same prior for the precision parameter).
Table 2. Mean values of the intercept and hyperparameters, root mean squared error (RMSE) for
the fitted risk, DIC and logarithmic score (LS) when using different spatial latent models. Standard
deviationspare
ffiffiffi provided in parentheses. Results are based on 200 simulations where the true values are
¼ 0, 1= ¼ 0, ¼ 0, E ¼ 60. Be aware that parameter estimates might not be directly comparable
between models.
pffiffiffi
1= RMSE DIC LS
iid model 1.65E03 (0.01) 0.05 (0.004) 0.00 () 0.11 2545 3.47
Besag 1.26E03 (0.01) 0.07 (0.004) 1.00 () 0.12 2545 3.48
Leroux 1.51E03 (0.01) 0.07 (0.01) 0.56 (0.08) 0.11 2545 3.48
Dean 1.69E03 (0.01) 0.06 (0.004) 0.60 (0.12) 0.11 2547 3.48
BYM2 (unif) 8.39E04 (0.01) 0.01 (0.01) 0.46 (0.05) 0.12 2540 3.47
BYM2 (PC) 9.15E04 (0.01) 0.01 (0.01) 0.26 (0.05) 0.12 2540 3.47
1158 Statistical Methods in Medical Research 25(4)
The second issue is that none of the models are properly scaled, except the BYM2 model.
Specifically, the precision parameter in the Leroux and Dean model does not represent the marginal
precision. It is still confounded with and hence not comparable to the BYM2 model. The only
parameter estimates that are comparable are those from the iid model and the BYM2 model. The
Besag model can be scaled, and so can the Dean model leading to the parameterisation of the BYM2
model. However, the Leroux model cannot be scaled as any scaling would depend on the mixing
parameter . Hence, the precision parameter cannot be interpreted as marginal precision as it
depends on the underlying graph structure and is in addition confounded with the mixing parameter.
Even though the parameters of the models have different interpretations and priors are chosen
differently, the model choice criteria give quite similar average values, for all the models and only the
DIC values seem to slightly favour the two BYM2 models. This means that the BYM2 performs at
least as well as the other models and changing the prior for the mixing parameters leads to the same
results. This is also confirmed inspecting the logarithmic scores and DIC values over all simulations,
in terms of boxplots, see figures in online supplementary material, respectively. Results for these two
model choice criteria coincide well and indicate that the Besag model performs worse than the other
models when simulating an independent area-specific risk surface, where the iid model is naturally
but only marginally favoured. In contrast, the iid model performs worse when simulating a spatially
structured surface, where the Besag model is marginally favoured for low values of Ei. The other
models perform almost identically in both of these simulation settings. Simulating a constant-risk
surface, all models behave similarly for Ei ¼ 15. For increasing numbers of expected cases, the
BYM2 model seems to be slightly beneficial. The figures only include the boxplots obtained by
the BYM2 model using a PC prior for , as results using a uniform prior are visually identical.
Table 3. Posterior quantities for the intercept and hyperparameters under three different PC priors or
a uniform prior for the mixing parameter for IDDM counts in Sardinia.
fairly constant. The upper quantile and therefore the mean change more clearly. As in MacNab
et al.,28 the effect on the posterior risk estimates is negligible (Figure 4).
The table in the online supplementary material shows model choice criteria when using different
models for analysing IDDM counts in Sardinia. Furthermore, results using the four different priors
for the mixing parameter in the BYM2 model are shown. As in the simulation study there are no
clear differences between the models.
6 Discussion
Throughout this paper we have stressed the importance of scaling in Bayesian disease mapping.
Without proper scaling, hyperparameters have no clear meaning and may be falsely interpreted. To
be more specific, the precision parameter of the commonly used Leroux and Dean models tends to
be falsely interpreted as marginal precision controlling the deviation from a constant disease risk
over space. However, the parameter depends on the underlying graph structure and is confounded
with the mixing parameter if the structured spatial component is not appropriately scaled. In
addition, it is not clear how to choose a prior distribution for this parameter. First, due to the
lack of scaling, a fixed hyperprior for the precision parameter gives different amount of smoothing if
the graph on which given disease counts are observed is changed. Second, commonly used
hyperprior distributions induce overfitting13 and will not allow to reduce to simpler models such
as a constant risk surface or uncorrelated noise over space.
Simpson et al.13 proposed a new modification of the commonly known BYM model, termed
BYM2 model, which addresses the aforementioned issues, and applied it to one case study. The
BYM2 model consists of one precision parameter and one mixing parameter. The precision
parameter represents the marginal precision and controls the variability explained by a spatial
effect. The mixing parameter distributes existing variability between an unstructured and
1160 Statistical Methods in Medical Research 25(4)
1.31
1.28
1.25
1.23
1.2
1.17
1.14
1.11
1.08
1.05
1.03
0.97
0.94
0.91
0.88
0.85
0.82
0.79
Figure 3. Posterior mean of the relative risk for IDDM counts in Sardinia using the BYM2 model.
structured component. Importantly, the structured component is scaled,12 which facilitates prior
assignment. PC priors, i.e. principle-based priors, are used to favour simpler models until a more
complex model is supported. Furthermore, these priors allow epidemiologically intuitive
specification of hyperparameters based on the relative risks.
In this paper we have systematically assessed the behaviour of the BYM2 model formulation. By
means of a simulation study based on the neighbourhood structure of 366 regions in Sardinia, we have
shown that the model is able to shrink towards both a constant risk surface or a spatially unstructured
risk for different disease prevalences. This shows that the model does not overfit. When the disease risk
is spatially structured, the model shows good learning abilities. In contrast to other model
formulations, the posterior estimates of all hyperparameters can be directly and intuitively
interpreted. In terms of model comparison, we have found that the BYM2 model is slightly
Riebler et al. 1161
Unif(0,1)
1.0 1.0 1.0
0.7 0.7
0.8 0.9 1.0 1.1 1.2 1.3 0.8 0.9 1.0 1.1 1.2 1.3 0.8 0.9 1.0 1.1 1.2 1.3
U=0.5, alpha =2/3 U=0.5, alpha =2/3 U=0.5, alpha =2/3
Figure 4. Comparison of the posterior mean of the relative risk under four different prior distributions for the
mixing parameter for IDDM counts in Sardinia.
favoured compared to the Leroux and Dean model in terms of DIC and logarithmic score in the case
of a constant risk. In the case of an unstructured risk, the used model choice characteristics are almost
indistinguishable. We think that the practical benefits in terms of interpretability and prior assignment
make the BYM2 model advantageous compared to existing models and we recommend its usage also
since its model criteria performance is at least as good as for existing methods.
All models were implemented using INLA,14 which provides efficient Bayesian inference without
the need of MCMC sampling. This facilitated the simulation study considerably and allowed us to
investigate different simulation settings. The user-friendly implementation is illustrated by the R-
code in Appendix 1.
MacNab et al.28 noted sensitivity in choosing a prior distribution for the mixing parameter. Here,
we investigated this issue empirically in both the simulation study and an application to IDDM in
Sardinia. However, more effort needs to be placed into a detailed prior sensitivity analysis following
the theoretical framework of Roos et al.29
The BYM2 model can be naturally combined with covariate information, see Simpson et al.13 for
an application or integrated into a space–time context. However, it will require further work to
distribute the variance not only within the spatial components but over all model parameters in the
linear predictor. It should be noted that the BYM2 model is not only interesting within disease
mapping but also within other fields, such as genetics, where genes can be regarded as regions and
the neighbourhood structure represents which genes are linked.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or
publication of this article: Funding by the Research Council of Norway is gratefully acknowledged.
1162 Statistical Methods in Medical Research 25(4)
References
1. Lawson AB. Bayesian disease mapping: hierarchical 15. Schrödle B and Held L. A primer on disease mapping and
modeling in spatial epidemiology. 2nd Edition, Boca Raton, ecological regression using INLA. Comput Stat 2011; 26:
FL: Chapman and Hall/CRC press, 2013. 241–258.
2. Lee D. A comparison of conditional autoregressive models 16. Schrödle B and Held L. Spatio-temporal disease mapping
used in Bayesian disease mapping. Spatial Spatio-Temporal using INLA. Environmetrics 2011; 22: 725–734.
Epidemiol 2011; 2: 79–89. 17. Bernardinelli L, Pascutto C, Best N, et al. Disease mapping
3. Besag J, York J and Mollié A. Bayesian image restoration, with errors in covariates. Stat Med 1997; 16: 741–752.
with two applications in spatial statistics. Ann Inst Stat 18. Banerjee S, Carlin BP and Gelfand AE. Hierarchical
Math 1991; 43: 1–20. modeling and analysis for spatial data. 2nd Edition, Boca
4. Leroux BG, Lei X and Breslow N. Estimation of disease Raton, FL: CRC Press, 2014.
rates in small areas: a new mixed model for spatial 19. Rue H and Held L. Gaussian Markov random fields: theory
dependence. In: Halloran ME and Berry D (eds) Statistical and applications. Boca Raton, FL: Chapman and Hall/
models in epidemiology, the environment, and clinical trials. CRC press, 2005.
New York: Springer, 2000, pp.179–191. 20. Hodges JS. Richly parameterized linear models: additive,
5. Stern HS and Cressie N. Posterior predictive model checks time series, and spatial models using random effects. Boca
for disease mapping models. Stat Med 2000; 19: Raton, FL: CRC Press, 2014.
2377–2397. 21. Breslow N, Leroux B and Platt R. Approximate
6. Dean CB, Ugarte MD and Militino AF. Detecting hierarchical modelling of discrete data in epidemiology.
interaction between random region and fixed age effects in Stat Methods Med Res 1998; 7: 49–62.
disease mapping. Biometrics 2001; 57: 197–202. 22. Natário I and Knorr-Held L. Non-parametric ecological
regression and spatial variation. Biometr J 2003; 45:
7. Best N, Richardson S and Thomson A. A comparison of
670–688.
Bayesian spatial models for disease mapping. Stat Methods
23. Knorr-Held L and Rue H. On block updating in Markov
Med Res 2005; 14: 35–59.
random field models for disease mapping. Scand J Stat
8. Wakefield J. Disease mapping and spatial regression with
2002; 29: 597–614.
count data. Biostatistics 2007; 8: 158–183.
24. Spiegelhalter DJ, Best NG, Carlin BP, et al. Bayesian
9. MacNab YC. On Gaussian Markov random fields and
measures of model complexity and fit. J R Stat Soc Ser B
Bayesian disease mapping. Stat Methods Med Res 2011;
(Stat Methodol) 2002; 64: 583–639.
20: 49–68. 25. Gneiting T and Raftery AE. Strictly proper scoring rules,
10. Lee DJ and Durbán M. Smooth-CAR mixed models for prediction, and estimation. J Am Stat Assoc 2007; 102:
spatial count data. Comput Stat Data Anal 2009; 53: 359–378.
2968–2979. 26. Ugarte MD, Adin A, Goicoa T, et al. On fitting spatio-
11. Bernardinelli L, Clayton D and Montomoli C. Bayesian temporal disease mapping models using approximate
estimates of disease maps: How important are priors? Stat Bayesian inference. Stat Methods Med Res 2014; 23:
Med 1995; 14: 2411–2431. 507–530.
12. Sørbye SH and Rue H. Scaling intrinsic Gaussian Markov 27. Held L. Simultaneous posterior probability statements
random field priors in spatial modelling. Spatial Stat 2014; from Monte Carlo output. J Comput Graph Stat 2004; 13:
8: 39–51. 20–35.
13. Simpson DP, Rue H, Martins TG, et al. Penalising model 28. MacNab YC, Farrell PJ, Gustafson P, et al. Estimation in
component complexity: a principled, practical approach to Bayesian disease mapping. Biometrics 2004; 60: 865–873.
constructing priors. arXiv preprint arXiv:1403.4630v4, 29. Roos M, Martins TG, Held L, et al. Sensitivity analysis for
2015, http://arxiv.org/pdf/1403.4630v4.pdf (accessed 17 Bayesian hierarchical models. Bayesian Anal 2015; 10:
July 2016). 321–349.
14. Rue H, Martino S and Chopin N. Approximate Bayesian 30. Erisman A and Tinney W. On computing certain elements
inference for latent Gaussian models by using integrated of the inverse of a sparse matrix. Commun ACM 1975; 18:
nested Laplace approximations. J R Stat Soc Ser B (Stat 177–179.
Methodol) 2009; 71: 319–392.
Riebler et al. 1163
First we have to load the INLA package which can be installed by typing
in the R-terminal. In this paper, we used the INLA version 0.0-1445883880. In order to define the
model incorporating a spatially structured component, we need to provide the neighbourhood
structure for the regions we analyse. In INLA this is done by providing the path to the file which
stores this information. The graph file for Sardinia is named sardinia.graph and has the following
1164 Statistical Methods in Medical Research 25(4)
structure:
366
0 5 13 61 69 73 81
1 5 8 16 40 46 94
2 4 47 59 63 77
3 3 11 17 43
4 5 24 41 51 56 67
...
363 5 292 307 324 344 346
364 5 291 310 317 329 345
365 3 289 303 308
The first line provides the number of regions, here 366. The following 366 lines correspond to
each of the regions, where the first entry always specifies a unique region index. The second entry
specifies the number of neighbouring regions, and then the corresponding region indices for those
neighbours are listed. INLA transforms this information into the neighbourhood matrix Q with
entries as specified in equation (2).
Lines 23–36 specify the model structure in terms of a formula object. Here, we have an intercept
and a latent Gaussian model given by equation (7). The f(.) function is used to specify a random
effect in INLA. The first argument is always the variable name to which it applies, here region. We
specify the use of the BYM2 model in line 24. Then, we provide the graph and specify that we want
to scale the model in line 26. This will scale the graph as described in Section 3.2. To avoid
confounding with the intercept a sum-to-zero constraint is specified in line 27. The next lines
specify the hyperprior for the two hyperparameters in form of a list of length two. The first list
entry specifies the prior for the mixing parameter , named phi in the INLA implementation, the
second entry for the precision parameter b, named prec. Each entry is itself a list. Using the
argument prior we specify the prior we would like to use. The parameters are set in the argument
param. They coincide with the values used in Section 5.2. Using the argument initial a starting value
for the numerical optimisation of the INLA algorithm can be provided. This has to be given on the
log scale for the precision and the logit scale for , as these are the scales INLA works on internally.
After defining the model, we call the inla function providing the formula, the data and the
likelihood. Further, we specify the expected number of counts and specify that we not only
would like to get posterior estimates for the components of i but also i and i (line 40).
Improved hyperparameter estimates can be obtained by calling inla.hyperpar afterwards. Here, dz
and diff.logdens refer to options in the numerical integration algorithm, see help page.
Results can be inspected using summary(result) or plot(result). Posterior marginals for the fixed
effects and hyperparameters can be consequently accessed via result$summary.fixed or
result$summary.hyperpar, respectively.
Riebler et al. 1165
ðb Þ ¼ b3=2 exp b1=2
2
as prior for b .
1
KLDðN ð0, flex ð ÞÞjN ð0, base ÞÞ ¼ ½traceðflex ð ÞÞ n log jflex ð ÞjÞ
2
1 1 1
¼ n trace Q? 1 log ð1 ÞI þ Q1?
2 n
The computation of Q the trace is quick when Q? is sparse.30 The determinant can be computed as
ð1 ÞI þ Q? ¼ ni¼1 ð1 þ ~i Þ, where i denote the eigenvalues of Q? , and ~i ¼ 1= i .13 As
1
the given precision matrix Q? is singular with rank deficiency 1, one of the eigenvalues is zero. This
implies that ~i ¼ 1= i if i 4 0, or else ~i ¼ 0. We can avoid calculating eigenvalues if the dimension
is high, see Appendix A3 of Simpson et al.13 for details.