0% found this document useful (0 votes)
4 views

Biometrics - 2023 - Xu - Bayesian Model Selection for Generalized Linear Mixed Models

This document presents a Bayesian model selection approach for generalized linear mixed models (GLMMs), focusing on covariance structures for random effects commonly used in various fields. The authors propose a pseudo-likelihood method to approximate the integrated likelihood function, enabling simultaneous selection of covariates and random effects. Their approach, implemented in the R package GLMMselect, demonstrates improved performance over existing Bayesian methods through simulation studies and case studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Biometrics - 2023 - Xu - Bayesian Model Selection for Generalized Linear Mixed Models

This document presents a Bayesian model selection approach for generalized linear mixed models (GLMMs), focusing on covariance structures for random effects commonly used in various fields. The authors propose a pseudo-likelihood method to approximate the integrated likelihood function, enabling simultaneous selection of covariates and random effects. Their approach, implemented in the R package GLMMselect, demonstrates improved performance over existing Bayesian methods through simulation studies and case studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received: 18 October 2022 Accepted: 7 June 2023

DOI: 10.1111/biom.13896

BIOMETRIC METH ODOLOGY

Bayesian model selection for generalized linear mixed


models

Shuangshuang Xu Marco A. R. Ferreira Erica M. Porter


Christopher T. Franck

Department of Statistics, Virginia Tech,


Blacksburg, Virginia, USA Abstract
We propose a Bayesian model selection approach for generalized linear mixed
Correspondence
Marco A. R. Ferreira, Department of
models (GLMMs). We consider covariance structures for the random effects that
Statistics, Virginia Tech, Blacksburg, VA are widely used in areas such as longitudinal studies, genome-wide associa-
24060, USA. tion studies, and spatial statistics. Since the random effects cannot be integrated
Email: [email protected]
out of GLMMs analytically, we approximate the integrated likelihood function
Funding information using a pseudo-likelihood approach. Our Bayesian approach assumes a flat prior
National Science Foundation,
for the fixed effects and includes both approximate reference prior and half-
Grant/Award Numbers: DMS 1853549,
DMS 2054173 Cauchy prior choices for the variances of random effects. Since the flat prior
on the fixed effects is improper, we develop a fractional Bayes factor approach
to obtain posterior probabilities of the several competing models. Simulation
studies with Poisson GLMMs with spatial random effects and overdispersion
random effects show that our approach performs favorably when compared
to widely used competing Bayesian methods including deviance information
criterion and Watanabe–Akaike information criterion. We illustrate the useful-
ness and flexibility of our approach with three case studies including a Poisson
longitudinal model, a Poisson spatial model, and a logistic mixed model. Our pro-
posed approach is implemented in the R package GLMMselect that is available
on CRAN.

KEYWORDS
approximate reference prior, fractional Bayes factor, generalized linear mixed model, model
selection, pseudo-likelihood method

1 INTRODUCTION 2015), survival analysis (Tawiah et al., 2020), and neu-


roimaging (Liu et al., 2016). Even though Bayesian esti-
Generalized linear mixed models (GLMMs) are widely mation procedures for GLMMs are well established, there
used to model non-Gaussian data with dependent obser- are just a handful of papers that address Bayesian model
vations. This type of data is often found in many areas selection for GLMMs. Currently, most applied papers use
of application such as epidemiology (Meyer et al., 2017), the deviance information criterion (DIC) (Spiegelhalter
meta-analysis of multiple clinical trials (Sauter & Held, et al., 2002) to perform Bayesian model selection for

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the
original work is properly cited.
© 2023 The Authors. Biometrics published by Wiley Periodicals LLC on behalf of International Biometric Society.

3266 wileyonlinelibrary.com/journal/biom Biometrics. 2023;79:3266–3278.


15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
XU et al. 3267

GLMMs (Nouvellet et al., 2021; Tredennick et al., 2021). sample size, coefficient of non-null covariates, level of spa-
Even though the DIC is widely applicable, we show in a tial dependence, and overdispersion level. The simulation
simulation study that the DIC has some undesirable behav- study shows that DIC and WAIC cannot reliably distin-
iors when applied to GLMMs. To provide more reliable guish the random effect when there is another random
results, here we develop a novel Bayesian model selec- effect. In contrast, our methods ARM and HCM perform
tion approach for simultaneous selection of covariates and well at detecting covariates and correct dependence struc-
random effects for GLMMs. ture. In particular, ARM and HCM always correctly detect
Specifically, we focus on GLMMs where each random the case of no random effects. Finally, while the perfor-
effect has a covariance matrix that is the product of an mances of the DIC and WAIC do not improve much with
unknown variance component parameter and a known large sample sizes, our proposed ARM and HCM have
positive semi-definite symmetric matrix. The class of large improvement with increasing sample size. In addi-
GLMMs we consider can be used for the analysis of spatial tion, the simulation study shows that marginal likelihood
areal data (Banerjee et al., 2014; Clayton and Kaldor, 1987), computed by INLA has similar performance to our meth-
genome-wide association studies (GWAS) (Williams et al., ods ARM and HCM when selecting covariates. However,
2022), and longitudinal data (Breslow and Clayton, 1993; marginal likelihood computed by INLA does not perform
Xu et al., 2016). However, inference for GLMMs is difficult well when selecting random effects.
because the integrated likelihood function is not available Apart from the DIC, WAIC, and marginal likelihood,
in closed form. To deal with the issue of integration of there are not many other Bayesian model selection
random effects, we approximate the integrated likelihood approaches for GLMMs. One such approach proposed by
function using a pseudo-likelihood approach (Wolfinger Cai and Dunson (2006) for simultaneously selecting fixed
and O’Connell, 1993) that leads to a Gaussian likelihood and random effects in GLMMs assumes that the subject-
approximation. We then assign a flat prior for the vec- specific random effects have a covariance matrix with all
tor of regression coefficients and an approximate reference its elements being free parameters to be estimated. As a
prior (Ferreira et al., 2021) for the variance components of consequence, the method proposed by Cai and Dunson
the GLMMs, which is inspired by the reference prior pro- (2006) is only applicable to problems with replications and
posed by Keefe et al. (2019) for Gaussian data. In addition, cannot be readily applied to problems where the vector of
we also consider a half-Cauchy prior for the square root observations is a realization from a structured multivari-
of variance components (Gelman, 2006; Polson & Scott, ate distribution such as GWAS data and spatial areal data.
2012). Because the prior on the vector of regression coef- In contrast, because we assume that each random effect
ficients is improper, we develop a fractional Bayes factor has a covariance matrix that is the product of an unknown
(FBF) approach (O’Hagan, 1995). We note that Porter et al. variance component parameter with a known positive
(2023) have proposed FBF for Gaussian mixed models for semi-definite covariance matrix, our methods ARM and
the particular case of spatial areal data. In contrast, here we HCM can be applied to longitudinal data, GWAS data, and
consider GLMMs. In addition, we consider not only spatial spatial areal data.
random effects but also many other types of random effects The remainder of this paper is organized as follows. Sec-
such as overdispersion random effects and longitudinal tion 2 describes the GLMMs that we consider. Section 3
random effects. Because we use default priors combined outlines how the pseudo-likelihood approach approx-
with FBF, our proposed model selection approach is fully imates GLMMs for non-Gaussian data by computing
automatic, which obviates the need for subjective speci- adjusted observations that are modeled using Gaussian
fication of hyperparameters and makes the method more LMMs. Section 4 introduces priors for model selection,
accessible for practitioners. We call our two proposed the FBF approach for dealing with improper priors, and
model selection approaches the approximate reference posterior computation. Section 5 presents the results of
method (ARM) and the half-Cauchy method (HCM). a simulation study. Section 6 illustrates our method with
To compare the performance of our methods ARM and applications to two case studies. Section 7 concludes with
HCM to the performance of the DIC, the Watanabe– a discussion and future directions.
Akaike information Criterion (WAIC) (Watanabe, 2010), The online supporting information contains details
and marginal likelihood computed by Integrated Nested about the pseudo-likelihood method (Web Appendix A),
Laplace Approximation (INLA) under different parameter additional tables for the case studies (Web Appendix B),
settings, we present a simulation study based on Poisson one additional case study (Web Appendix C), several
GLMMs with a spatial random effect and an overdisper- additional simulation studies (Web Appendix D), and
sion random effect. In this simulation study, we vary the additional figures (Web Appendix E).
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3268 XU et al.

2 GENERALIZED LINEAR MIXED 3 PSEUDO LIKELIHOOD FUNCTION


MODELS FOR GENERALIZED LINEAR MIXED
MODELS
Consider a response vector 𝑦 = (𝑦1 , 𝑦2 , … , 𝑦𝑛 )⊤ of 𝑛 obser-
vations. Let 𝑋 be an 𝑛 by 𝑝 matrix of explanatory variables A key step in Bayesian model selection is to integrate
and 𝛽 be the corresponding 𝑝-dimensional vector of fixed out random effects from the likelihood function. However,
effects. Let 𝑍 𝑗 be an 𝑛 by 𝑞𝑗 design matrix and 𝛼 𝑗 be the cor- while for LMMs the random effects can be integrated out
responding 𝑞𝑗 -dimensional vector of random effects, 𝑗 = analytically, for GLMMs that is not possible. To overcome
1, … , 𝑄. Let vectors 𝑥 𝑖 and 𝑧 𝑖𝑗 be the 𝑖th rows of 𝑋 and 𝑍 𝑗 , this difficulty, here we use a pseudo-likelihood approach
respectively. Conditional on linear predictors 𝜂1 , … , 𝜂𝑛 , the that approximates a GLMM for non-Gaussian data by com-
observations 𝑦1 , … , 𝑦𝑛 are independent with probability puting adjusted observations that are modeled using an
density function belonging to the exponential family, that approximate Gaussian LMM.
is 𝑓(𝑦𝑖 |𝜂𝑖 ) = exp[𝑦𝑖 𝜂𝑖 − 𝐵𝑖 (𝜂𝑖 ) + 𝐶𝑖 (𝑦𝑖 )], 𝑖 = 1, … , 𝑛, where Let 𝛼 represent all random effects and 𝜏 represent all
the canonical parameter 𝜂𝑖 is modeled as a linear func- variance components. Then, the likelihood function with

tion of fixed and random effects as 𝜂𝑖 = 𝑥 ⊤ 𝑖
𝛽 + 𝑗 𝑧⊤ 𝛼 .
𝑖𝑗 𝑗
the relevant but intractable integral over random effects 𝛼

Each observation 𝑦𝑖 has mean 𝜇𝑖 = 𝐵𝑖 (𝜂𝑖 ) and variance is
𝑣𝑖 = 𝐵𝑖′′ (𝜂𝑖 ). In addition, we assume that each vector of
random effects 𝛼 𝑗 has a multivariate normal distribution 𝛽 , 𝜏 |𝑦𝑦 ) =
𝐿(𝛽 𝑝(𝑦𝑦 |𝛼
𝛼 , 𝛽 )𝑝(𝛼
𝛼 |𝜏𝜏 ) 𝑑𝛼
𝛼

with mean vector 𝟎 and covariance matrix 𝜏𝑗 Σ 𝑗 , where
[ { ( )
the variance component parameter 𝜏𝑗 is unknown and 𝑁
∏ ∑
Σ 𝑗 is a known symmetric positive semi-definite matrix. = exp 𝑦𝑖 𝑥⊤ 𝛽 + 𝑧⊤ 𝛼
𝑖𝑗 𝑗
∫ 𝑖
𝑖=1 𝑗
For example, if 𝛼 is a vector of overdispersion random
effects then the corresponding matrix Σ is an identity ( ) }]

matrix. −𝐵𝑖 𝑥⊤
𝑖
𝛽 + 𝑧⊤ 𝛼
𝑖𝑗 𝑗
+ 𝐶𝑖 (𝑦𝑖 )
As another example, in the case of spatial areal data, 𝑗
we assume that 𝛼 is a vector of spatial random effects [ { }]
∏ −
𝑞𝑗

1 𝛼⊤ Σ −𝛼
𝑗 𝑗 𝑗
that follows a sum-zero constrained Gaussian intrinsic (2𝜋) 2 |𝜏𝑗 Σ 𝑗 | 2 exp − 𝛼 . (2)
𝑑𝛼
conditional autoregressive model (Keefe et al., 2018, 2019), 𝑗
2𝜏𝑗
that is,
In Equation (2), the random effects 𝛼 cannot be inte-
𝛼 |𝜏 ∼ 𝑁(00, 𝜏Σ
Σ), (1) grated out analytically. Our method approximates the
integral in Equation (2) with a Gaussian LMM via a
where Σ is a known positive semi-definite covariance pseudo-likelihood approach. For a Gaussian LMM, the cor-
matrix that depends on the neighborhood structure of the responding integral can be solved analytically, and then the
spatial subregions. Specifically, an adjacency matrix 𝑊 is likelihood function of parameters has an analytic expres-
defined such that if subregions 𝑖 and 𝑗 are adjacent, the sion.
entries in cells (i, j) and (j, i) are 1, otherwise 0. Let 𝐷 𝑤 The pseudo-likelihood approach was first proposed by
be a diagonal matrix with each diagonal element equal Wolfinger and O’Connell (1993). The pseudo-likelihood
to the summation of the corresponding row of 𝑊 . Then, approach is an iterative procedure that starts by writing
the covariance matrix Σ is the Moore–Penrose inverse of the model as 𝑦 = 𝜇 + 𝜖 , where 𝜇 = (𝜇1 , … , 𝜇𝑛 )′ and 𝜖 is a
𝐷 𝑤 − 𝑊 (Keefe et al., 2018, 2019). We note that computa- vector of errors with Cov(𝜖𝜖 ) = 𝑉 = diag(𝑣1 , … , 𝑣𝑛 ). Let ˆ
𝛼,
tions for this model may be performed using the precision ˆ 𝜇
𝛽
𝛽, ˆ , and 𝑉ˆ be the current estimates of 𝛼 , 𝛽 , 𝜇 , and 𝑉 .
matrix. In addition, we note that the knowledge about Here, 𝛽ˆ is initialized at the estimate from a GLM fit. Now,
the covariance matrix Σ has allowed, for the case of Gaus- approximate 𝜇𝑖 with a first-order Taylor expansion around
sian hierarchical models with ICAR random effects, the 𝛼 =ˆ 𝛼 and 𝛽 = 𝛽 ˆ Rearrange all the terms in 𝑦 = 𝜇 + 𝜖
𝛽.
derivation of a reference prior for the parameters (Keefe such that the terms that depend on 𝑦 , ˆ ˆ and 𝜇
𝛼, 𝛽
𝛽, ˆ appear
et al., 2019), and formal Bayesian model selection (Porter on the left side of the equation and the remaining terms
et al., 2023). appear on the right side of the equation. Multiply both
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
XU et al. 3269

ˆ
−1 variance components of the 𝑄𝑐 types of random effects
sides by 𝑉 . As a result, the left side of the equation will
−1 ∑ in the model 𝑀𝑐 . The integrated likelihood based on the
⋆ ˆ
have 𝑦 = 𝑉 (𝑦𝑦 − 𝜇 ˆ ) + 𝑋 𝛽ˆ + 𝑗 𝑍 𝑗 ˆ𝛼 𝑗 . The vector 𝑦 ⋆ is vector of adjusted observations 𝑦 ⋆ is
known as the vector of pseudo-observations or the vector of
adjusted observations. Equating 𝑦 ⋆ to the right side of the
𝑝(𝑦𝑦⋆ |𝑀𝑐 ) = 𝑝(𝑦𝑦⋆ |𝛽
𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽
𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝛽
𝛽 𝑐 𝑑𝜏𝜏 𝑐 , (5)
equation, we obtain the following model for the adjusted ∫ ∫
observations:
where 𝜋(𝛽 𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 ) is the prior distribution of (𝛽
𝛽𝑐, 𝜏𝑐)
∑ −1
𝑦⋆ ≈ 𝑋𝛽 + ˆ 𝜖,
𝑍 𝑗𝛼 𝑗 + 𝑉 conditional on model 𝑀𝑐 . Let 𝜋(𝑀𝑐 ) be the prior probabi-
𝑗 lity of model 𝑀𝑐 . Then, application of Bayes theorem yields
posterior model probabilities 𝑃(𝑀𝑐 |𝑦𝑦⋆ ) = 𝑝(𝑦𝑦⋆ |𝑀𝑐 )
𝛼 𝑗 ∼ 𝑁(00, 𝜏𝑗 Σ 𝑗 ), ∑ 𝐶
𝜋(𝑀𝑐 )∕ 𝑟=1 𝑝(𝑦𝑦⋆ |𝑀𝑟 )𝜋(𝑀𝑟 ) ∝ 𝑝(𝑦𝑦⋆ |𝑀𝑐 )𝜋(𝑀𝑐 ).
𝜖 ∼ 𝑁(00, 𝑉 ). (3) In Section 4.1, we specify the priors for model param-
eters. In Section 4.2, we specify the priors on the model
Thus, the pseudo-likelihood approach assumes that 𝜖 fol- space. Section 4.3 discusses approximation of the inte-
lows a Gaussian distribution with mean vector 𝟎 and gral in Equation (5). In Section 4.4, we propose an FBF
covariance matrix 𝑉 . Substituting 𝑉 with 𝑉 ˆ in Equa- approach (Porter et al., 2023) to perform model selection

tion (3), 𝑦 can be approximately modeled with the LMM with improper priors.
∑ −1
𝑦⋆ ∼ 𝑁(𝑋 𝑋 𝛽 , 𝑗 𝜏𝑗 𝑍 𝑗 Σ 𝑗 𝑍 ⊤ ˆ ). Therefore, we have the
+𝑉
𝑗
closed-form pseudo-likelihood function
4.1 Priors for model parameters
1
|∑ |− 2
|𝑛 −1 | We consider the approximate reference prior proposed by
𝛽 , 𝜏) = (2𝜋) 2 || 𝜏𝑗 𝑍 𝑗 Σ 𝑗 𝑍 ⊤𝑗 + 𝑉
ˆ ||

𝑝(𝑦𝑦⋆ |𝛽
| 𝑗
|
|
|
Ferreira et al. (2021) in the context of LMMs for 𝛽 and the
reciprocal of 𝜏, which is based on the reference prior pro-
⎧ ( )−1 ⎫
⎪ 1 ⋆ ∑ −1 ⎪ posed by Keefe et al. (2019). In what follows, we consider
exp ⎨− (𝑦𝑦 − 𝑋 𝛽 )⊤ ⊤ ˆ
𝜏𝑗 𝑍 𝑗 Σ 𝑗 𝑍 𝑗 + 𝑉 ⋆
(𝑦𝑦 − 𝑋 𝛽 )⎬ . (4)
⎪ 2 𝑗 ⎪
the implied reference prior for 𝜏 obtained by transfor-
⎩ ⎭ mation of variables. For simple notation, let 𝑀 without
subscript represent a general model, 𝛽 represent the corre-
Further details about the pseudo-likelihood approach sponding vector of regressor coefficients, and 𝜏 represent
appear in Web Appendix A. To perform model selection, the variance component. In the reference prior (Keefe
we first use the pseudo-likelihood function in Equa- et al., 2019), all the parameters are independent. The vec-
tion (4) in an iterative manner to estimate the parameters tor of regression coefficients 𝛽 is assigned a uniform prior
and to obtain adjusted observations 𝑦 ⋆ . We then use on 𝑝 . In addition, as 𝜏 goes to infinity the reference prior
these adjusted observations 𝑦 ⋆ rather than the original 𝜋(𝜏) is proportional to 𝜏−2 . Further, as 𝜏 goes to 0 the ref-
observations 𝑦 to perform model selection. erence prior is proportional to a constant. Based on the
tail behavior of the reference prior for 𝜏, Ferreira et al.
(2021) proposed the approximate reference prior 𝜋(𝜏) ∝
4 MODEL SELECTION 𝜏
(1 + )−2 , where 𝑎𝜏 is a hyperparameter. We set 𝑎𝜏 equal
𝑎𝜏
to 2. The choice of 𝑎𝜏 = 2 is equivalent to the choice made
We perform model selection based on the pseudo-
by Ferreira et al. (2021) for Gaussian data. In addition, our
likelihood function given in Equation (4). Similar to
simulation study shows that this choice also works well for
Ten Eyck and Cavanaugh (2018), we use the same vector
GLMMs. Hence, for 𝛽 we use the flat prior 𝜋(𝛽 𝛽 |𝑀) ∝ 1,
of adjusted observations to compare all candidate models’
and for 𝜏 we use the approximate reference prior
posterior probabilities. Specifically, we compute the vec-
tor of adjusted observations using the full model with all 1
candidate regressors and all candidate random effects. In 𝜋1 (𝜏|𝑀) = , 𝜏 ≥ 0. (6)
2(𝜏∕2 + 1)2
addition, consider the model space  = {𝑀𝑐 , 𝑐 = 1, … , 𝐶},
with C possible models. We assume that model 𝑀𝑐 has This approximate reference prior is related to the half-
1
𝐾𝑐 regressors, where 𝑋 𝑐 is the corresponding matrix of Cauchy prior 𝜋(𝜏) ∝ 2 , which has the same tail behav-
𝜏 +1
explanatory variables and 𝛽 𝑐 is the corresponding vec- ior. Gelman (2006) proposed a half-Cauchy prior, however,
tor of coefficients. Further, model 𝑀𝑐 has 𝑄𝑐 types of for the standard deviation of random effects in a two-level
random effects. Let 𝜏 𝑐 = (𝜏𝑐,1 , … , 𝜏𝑐,𝑄𝑐 ) be the vector of Gaussian model. Assuming a half-Cauchy prior for the
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3270 XU et al.

square root of the variance component parameter 𝜏 implies


1 The vector of regression coefficients 𝛽 𝑐 can be integrated

for 𝜏 the prior density 𝜋2 (𝜏) ∝ 𝜏 2 (𝜏 + 1)−1 (Polson out analytically. After integrating out 𝛽 𝑐 , we can write the
1
− integrated likelihood as
& Scott, 2012). Thus, 𝜋2 (𝜏) = 𝑂(𝜏 ) for 𝜏 → 0 and 𝜋2 (𝜏) =
2


3 √
𝑂(𝜏 ) for 𝜏 → ∞. Hence, the half-Cauchy prior for 𝜏
2
𝑝(𝑦𝑦 ⋆ |𝑀𝑐 ) = 𝑝(𝑦𝑦 ⋆ , 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝜏𝜏 𝑐 ∝ exp
has more mass near zero and more mass for large val- ∫ ∫
ues of 𝜏 than the approximate reference prior for 𝜏 given [ ]
1 ⋆ ⊤ { −1 } ⋆
in Equation (6). Here, we consider two variants of our 𝑦 𝑋 ⊤𝑐𝐻 −1
𝐻 𝑐 𝑋 𝑐 (𝑋 −1 ⊤ −1 −1
𝑐 𝑋𝑐) 𝑋𝑐 𝐻 𝑐 − 𝐻 𝑐 𝑦
2
pseudo-likelihood-based method: ARM, which uses the
1
approximate reference prior given in Equation |𝐻 −1 −1 | 2
√ (6); and | 𝑐 (𝑋𝑋 ⊤𝑐𝐻 −1
𝑐 𝑋 𝑐 ) | 𝜋(𝜏 𝜏 𝑐 ) 𝑑𝜏𝜏 𝑐 , (7)
HCM, which uses the half-Cauchy prior for 𝜏. We com-
pare our methods ARM and HCM to the DIC and WAIC in ∑𝑄 −1
where 𝐻 𝑐 = 𝑗 𝑐 (𝜏𝑐𝑗 𝑍 𝑐𝑗 Σ 𝑐𝑗 𝑍 ⊤ ˆ . Note that the vec-
)+𝑉
𝑐𝑗
the simulation studies presented in Section 5.
tor of variance components 𝜏 𝑐 cannot be integrated out
analytically. To compute the integral in Equation (7), we
4.2 Priors on the model space first perform a logarithm transformation on 𝜏 𝑐 . Let 𝛿 𝑐 =
log(𝜏𝜏 𝑐 ) be the vector obtained by applying the logarithm
Let K denote the number of candidate covariates and Q transformation to each element of 𝜏 𝑐 . Then, we integrate
denote the number of candidate random effects types. For out 𝛿 𝑐 using a Laplace approximation to obtain
example, in an application where we may have spatial ran-
dom effects and/or overdispersion random effects, 𝑄 = 2. 𝑝(𝑦𝑦⋆ , 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝜏𝜏 𝑐 = 𝑝(𝑦𝑦⋆ , exp(𝛿𝛿 𝑐 )|𝑀𝑐 ) exp(𝛿𝛿 𝑐 ) 𝑑𝛿𝛿 𝑐
In addition, let 𝐾𝑐 denote the number of covariates in ∫ ∫
Model 𝑀𝑐 . For fixed effects, we use priors from Scott and 1 { }
𝑄𝑐
| |− 2
Berger (2010), which automatically correct for multiplic- ≈ (2𝜋) 2 |𝑞 ′′ (𝛿ˆ𝑐 )| exp −𝑞(𝛿ˆ𝑐 ) , (8)
| |
( 𝐾
ity. Specifically, the prior probability for model 𝑀𝑐 with )𝑐
covariates is 𝑃(𝑀𝑐 with 𝐾𝑐 covariates) = 1∕[(𝐾 + 1) 𝐾 ]. ⊤
1
𝐾𝑐
where 𝑞(𝛿𝛿 𝑐 ) = − 𝑦 ⋆ [𝐻𝐻 −1 𝑋⊤
𝑐 𝑋 𝑐 (𝑋
−1 −1 ⊤ −1
𝑐 𝐻𝑐 𝑋 𝑐) 𝑋𝑐 𝐻𝑐 −
With respect to random effects, there are possibilities 2𝑄 1
2
for inclusion and exclusion of random effects. Assuming 𝐻 −1
𝑐 ]𝑦𝑦⋆ − 𝐻 −1
log |𝐻 𝑋⊤
𝑐 (𝑋
−1 −1
𝑐 𝐻 𝑐 𝑋 𝑐 ) |− log 𝜋(exp(𝛿 𝛿 𝑐 )) + 𝛿 𝑐
2
that each random effect has 0.5 prior inclusion probabil- , 𝛿ˆ𝑐 is the point that minimizes 𝑞(𝛿𝛿 𝑐 ), and 𝑞′′ (𝛿𝛿 𝑐 ) is the
ity, the prior probability for Model 𝑀𝑐 with 𝑄𝑐 types of Hessian matrix.
random effects is 𝑃(𝑀𝑐 with 𝑄𝑐 types of random effects) =
1∕2𝑄 . Because usually in practice the number of candidate
random effects types Q is small, a discrete uniform prior for 4.4 Fractional Bayes factors
the inclusion of random effects is reasonable. Assuming a
priori independence of inclusion of fixed effects and ran- In order to obtain the posterior model probabilities of inter-
est, we use an FBF approach. The FBF is a modification
( )prior probability for model 𝑀𝑐 is 𝑃(𝑀𝑐 ) =
dom effects, the
1∕[2𝑄 (𝐾 + 1) 𝐾 ]. of the Bayes factor that allows for improper priors on
𝐾𝑐 parameters (O’Hagan, 1995).
To define the usual Bayes factor, let the baseline model
4.3 Integrated likelihood methods 𝑀𝑙 be the model with the largest integrated likelihood in
the model space. Then, the Bayes factor BF𝑐𝑙 of model
After the priors for parameters have been defined, the 𝑀𝑐 versus the baseline model 𝑀𝑙 is defined as the ratio
integrated likelihood given in Equation (5) based on the of their integrated likelihoods BF𝑐𝑙 = 𝑝(𝑦𝑦⋆ |𝑀𝑐 )∕𝑝(𝑦𝑦⋆ |𝑀𝑙 ).
adjusted observations 𝑦 ⋆ becomes Hence, we can compute the posterior probability of model
𝑀𝑐 as proportional to its prior probability times its Bayes
𝑝(𝑦𝑦⋆ |𝑀𝑐 ) = 𝑝(𝑦𝑦⋆ |𝛽
𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽
𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝛽
𝛽 𝑐 𝑑𝜏𝜏 𝑐 ∝
∫ ∫ ∫ ∫ factor versus the baseline model, that is, 𝑃(𝑀𝑐 |𝑦𝑦⋆ ) ∝
(𝑄 )−1 𝑃(𝑀𝑐 )𝑝(𝑦𝑦⋆ |𝑀𝑐 )∕𝑝(𝑦𝑦⋆ |𝑀𝑙 ) ∝ BF𝑐𝑙 𝑃(𝑀𝑐 ).
{
1 ⋆ ∑ 𝑐
−1 Note that the prior on the regression coefficients
exp − (𝑦𝑦 − 𝑋 𝑐 𝛽 𝑐 ) ⊤ ⊤ ˆ
(𝜏𝑐𝑗 𝑍 𝑐𝑗 Σ 𝑐𝑗 𝑍 𝑐𝑗 ) + 𝑉
2 𝑗 𝜋(𝛽𝛽 𝑐 |𝑀𝑐 ) ∝ 𝑑 is improper, where 𝑑 is an arbitrary constant.
1 Thus, the Bayes factor computed with the integrated like-
}| 𝑄𝑐 |− 2
|∑ −1 | lihood from Equations (7) and (8) is only defined up to an
𝑦⋆ ˆ || 𝜋(𝜏𝜏 𝑐 ) 𝑑𝛽
(𝑦𝑦 − 𝑋 𝑐 𝛽 𝑐 ) || (𝜏𝑐𝑗 𝑍 𝑐𝑗 Σ 𝑐𝑗 𝑍 ⊤𝑐𝑗 ) + 𝑉 𝛽 𝑐 𝑑𝜏𝜏 𝑐 .
| 𝑗 | unspecified constant of proportionality and cannot be used
| |
to compare models directly.
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
XU et al. 3271

To solve this problem, we use the FBF (O’Hagan 1995) 5 SIMULATION STUDY
to approximate the Bayes factor. Porter et al. (2023)
developed the FBF method for Gaussian hierarchical To investigate the performance of our proposed model
models with ICAR random effects. We use the FBF selection methods ARM and HCM when compared to the
approach to train the improper prior so that we can widely used DIC, WAIC, and marginal likelihood com-
compute a meaningful Bayes factor. By training the puted by INLA, we perform a simulation study for different
improper prior, we mean using Bayes theorem to combine combinations of parameter settings. Here, we present
the improper prior with a fraction of the likelihood results for Poisson GLMMs. In the Web Appendix C, we
to obtain a proper distribution (O’Hagan, 1995; Porter present results for Bernoulli GLMMs. For each combi-
et al., 2023). We can then use this latter distribution as nation of parameter settings, we generate 100 datasets.
a trained prior to compute a meaningful Bayes factor. We simulate samples on regular square grids and con-
Specifically, here we train the prior with a fraction 𝑏 of sider three sample sizes, 𝑛 = 100, 400, and 900. Each
the likelihood function. The trained prior density for sample may have spatial dependence based on a first-
model 𝑀𝑐 is obtained by Bayes theorem as 𝜋𝑏 (𝛽 𝛽𝑐, 𝜏𝑐) = order neighborhood structure modeled with a vector of
𝑏
𝑝 (𝑦𝑦 |𝛽⋆ 𝑏 ⋆
𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 )∕ ∫ 𝑝 (𝑦𝑦 |𝛽
𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽 𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽
𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 )𝑑𝛽
𝛽𝑐 spatial random effects 𝛼 1 following the ICAR distribu-
𝑑𝜏𝜏 𝑐 . The integrated likelihood is then computed as an tion given in Equation (1). For the variance component
integral of the product of the likelihood function raised 𝜏1 of the spatial random effects, we consider values 0,
to 1 − 𝑏 and the trained prior. Following O’Hagan 0.03, 0.05, 0.1, or 1, where 𝜏1 = 0 implies no spatial depen-
(1995), the resulting integrated likelihood of model dence. We also consider the possibility of overdispersion
𝑀𝑐 , called the fractional integrated likelihood, is equal random effect 𝛼 2 in the model. We set the variance com-
to ponent 𝜏2 of the overdispersion random effect to 0, 0.05,
0.1, 0.5, or 1, where 𝜏2 = 0 implies no overdispersion. We
𝑞𝑐 (𝑏, 𝑦 ⋆ ) = 𝑝1−𝑏 (𝑦𝑦⋆ |𝛽
𝛽 𝑐 , 𝜏 𝑐 )𝜋𝑏 (𝛽
𝛽 𝑐 , 𝜏 𝑐 ) 𝑑𝛽
𝛽 𝑐 𝑑𝜏𝜏 𝑐 consider four candidate covariates 𝑥1𝑖 , 𝑥2𝑖 , 𝑥3𝑖 , and 𝑥4𝑖

sampled from a standard normal distribution. We assume
𝑝𝑏 (𝑦𝑦⋆ |𝛽𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽 𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 ) that 𝛽 = (𝛽0 , 𝛽1 , 𝛽2 , 0, 0)⊤ , thus the last two covariates are
= 𝑝1−𝑏 (𝑦𝑦⋆ |𝛽
𝛽𝑐, 𝜏𝑐) 𝛽 𝑐 𝑑𝜏𝜏 𝑐
𝑑𝛽
∫ ∫ 𝑝𝑏 (𝑦𝑦⋆ |𝛽
𝛽 ,
𝑐 𝑐 𝜏 )𝜋(𝛽 𝛽 𝑐 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝛽
, 𝛽 𝑐 𝑑𝜏𝜏 𝑐 not in the true model. Here, 𝛽0 is the intercept, with val-
∫ 𝑝(𝑦𝑦⋆ |𝛽
𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽
𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝛽
𝛽 𝑐 𝑑𝜏𝜏 𝑐 ues equal to 1, 2, or 4. We let 𝛽1 = 𝛽2 with values 0,
= . (9) 0.1, 0.2, 0.3, 0.5, or 1. When 𝛽1 and 𝛽2 are both equal to
∫ 𝑝𝑏 (𝑦𝑦⋆ |𝛽
𝛽 𝑐 , 𝜏 𝑐 )𝜋(𝛽
𝛽 𝑐 , 𝜏 𝑐 |𝑀𝑐 ) 𝑑𝛽
𝛽 𝑐 𝑑𝜏𝜏 𝑐
0, there is no covariate in the true model. Conditionally
The size of the training fraction 𝑏 should be chosen independent Poisson observations 𝑦𝑖 are generated with
ind
carefully. When 𝑏 is too small, the denominator in Equa- the GLMM 𝑦𝑖 |𝜆𝑖 ∼ Poisson(𝜆𝑖 ), 𝑖 = 1, … , 𝑛, with log 𝜆𝑖 =
tion (9) may diverge. If 𝑏 is too large, a substantial part 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + 𝛽4 𝑥4𝑖 + 𝛼1𝑖 + 𝛼2𝑖 , spatial ran-
of the integrated likelihood is used to train the prior on dom effects 𝛼 1 ∼ 𝑁(00, 𝜏1Σ ), and overdispersion random
the parameters, and then the remaining information in effects 𝛼 2 ∼ 𝑁(00, 𝜏2𝐼 ).
the integrated likelihood used to update the prior model For each parameter setting, there are 𝐶 = 64 candidate
probabilities will lead to less distinctive posterior model models in total. Specifically, there are 24 possible combi-
probabilities. Here, we consider a training fraction size nations of fixed effects. In addition, there are four possible
equal to 𝑏 = 𝑚∕𝑛, where 𝑚 is the equivalent training size. combinations of random effects types, one with both spa-
To guide the choice of 𝑚 in our considered GLMM con- tial random effects and overdispersion random effects, one
text, we use the fact that for LMMs with the reference with only spatial random effects, one with only overdis-
prior proposed by Keefe et al. (2019) the minimal value persion random effects, and one without any random
of 𝑚 that guarantees propriety of the fractional integrated effects. We calculate posterior model probabilities for all 64
likelihood is 𝑝 + 1 (Porter et al., 2023). In particular, in models, and we compute posterior inclusion probabilities
all the GLMM applications we present in Section 6, the for each of the four covariates, for the spatial random effect,
training fraction 𝑏 = (𝑝 + 1)∕𝑛 yields well-defined Bayes and for the overdispersion random effect.
factors. We compare our model selection methods ARM and
Then, the FBF of model 𝑀𝑐 versus model 𝑀𝑙 is defined HCM to the DIC, the WAIC, and marginal likelihood com-
𝑞𝑐 (𝑏,𝑦𝑦⋆ )
as BF𝑏𝑐𝑙 = . Next, we compute the posterior probabil- puted by the R package INLA (Rue et al., 2009). For the
𝑞𝑙 (𝑏,𝑦𝑦⋆ )
ARM and HCM, we decide to include a component in
ity of model 𝑀𝑐 as proportional to the FBF, BF𝑏𝑐𝑙 , times the
the selected model if the posterior inclusion probability
prior probability of model 𝑀𝑐 , that is, 𝑃𝑏 (𝑀𝑐 |𝑦𝑦 ) = BF𝑏𝑐𝑙 × of that component is larger or equal to 0.5, that is, if that
∑𝐶
𝑃(𝑀𝑐 )∕[ 𝑘=1 BF𝑏𝑘𝑙 × 𝑃(𝑀𝑘 )]. component is in the median probability model (Barbieri &
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3272 XU et al.

Berger, 2004). For the criteria computed by INLA, we select but our methods ARM and HCM greatly outperform the
the model with the lowest DIC and WAIC values, and the DIC and WAIC for moderate to large values of 𝛽1 = 𝛽2 .
highest marginal likelihood, respectively. For the three cri- Meanwhile, in the more challenging large variance com-
teria computed by INLA, we consider the INLA default ponents setting, when 𝛽1 = 𝛽2 = 0, our ARM and HCM
prior specification as well as our proposed Approximate correctly select the model with no regressor in the model
reference (AR) prior and HC prior. for 100% of the simulated datasets samples. In contrast,
Because currently the most widely used criteria for when 𝛽1 = 𝛽2 = 0, the DIC and WAIC select the wrong
Bayesian selection of GLMMs are the DIC and WAIC com- covariate structure for 20% of the simulated datasets,
puted with INLA default priors, here we compare these respectively. Finally, as the magnitude of 𝛽1 = 𝛽2 increases,
criteria with our ARM and HCM. We present a compar- in comparison to the DIC and WAIC, ARM and HCM
ison of our methods ARM and HCM to DIC and WAIC achieve much higher probabilities of selecting the correct
computed using our AR and Half Cauchy (HC) priors in model.
Section D4 of Web Appendix D. The conclusions are sim- Figure 3 presents the probability of selecting correct spa-
ilar to those for DIC and WAIC computed with default tial random effects structure as a function of the value
INLA priors presented here. Figure 1 presents the prob- of the variance component for the spatial random effects.
ability of each competing method selecting the correct Results are shown for sample sizes 𝑛 = 100, 400, and 900,
covariate structure as a function of the value of their regres- and variance of overdispersion random effects 𝜏2 = 0 and
sion coefficients 𝛽1 = 𝛽2 . Here, there are spatial random 0.1. Figure 3 shows that the DIC and WAIC have low proba-
effects with 𝜏1 = 0.05 and overdispersion random effects bility of selecting the model with no spatial random effects
with 𝜏2 = 0.05. Three sample sizes are considered: 𝑛 = when the true model does not have spatial random effects.
100, 400, 900. Two values for the intercept are considered: In addition, this performance does not improve much as
𝛽0 = 1 and 4. Figure 1 shows that the ARM and HCM the sample size increases from 400 to 900. In contrast, our
perform much better than the DIC and the WAIC com- methods ARM and HCM have large probabilities of select-
puted with INLA’s default priors . For example, in the most ing the correct spatial random effects structure when the
challenging case considered with 𝑛 = 100 and 𝛽0 = 1, the true model does not have spatial random effects, and have
ARM and HCM have a higher probability than the DIC large probabilities of selecting spatial random effects when
and WAIC of selecting the correct covariate structure when the variance component for the spatial random effects is
their regression coefficients 𝛽1 and 𝛽2 are zero. In addi- large. Finally, the performance of ARM and HCM at cor-
tion, as the value of 𝛽1 = 𝛽2 increases, the probability of rectly detecting spatial dependence greatly improves as the
the ARM and HCM to correctly select the true non-null sample size increases.
covariates 𝑥1 and 𝑥2 increases more quickly than that of ARM, HCM, DIC, and WAIC’s performance when
the DIC and the WAIC. Finally, the probability of ARM selecting overdispersion random effects is similar to select-
and HCM to correctly select the two non-null regressors ing spatial random effects. Web Figure S1 in the Supporting
increases much closer to 1 than those of the DIC and WAIC information presents the probability of selecting correct
as the sample size increases and as the intercept value overdispersion structure as a function of the value of
increases. As the sample size increases, the probability the variance for overdispersion. Web Figure S1 shows
of ARM and HCM detecting covariates with small coeffi- that the DIC and WAIC have low probability of select-
cients increases substantially. For example, the left panels ing a model with no overdispersion random effects even
of Figure 1 show that when the intercept is equal to 1, the when overdispersion is not present in the true model,
probabilities of our proposed methods choosing the correct and this undesirable performance does not improve much
covariates structure when the coefficient is equal to 0.1 are when the sample size increases. In contrast, our meth-
about 10%, 60%, and 90% for sample sizes 100, 400, and 900, ods ARM and HCM have large probabilities of selecting
respectively. correct overdispersion structure when overdispersion is
Figure 2 investigates the impact of different magni- not present in the true model, and have large prob-
tudes of the variance components on the probability abilities of selecting overdispersion when the propor-
of selecting the correct covariate structure as a func- tion of variance due to overdispersion is large. Finally,
tion of the value of the regression coefficients 𝛽1 = 𝛽2 . the performance of ARM and HCM at correctly detect-
Figure 2a,b presents settings with small (𝜏1 = 0.01 and ing overdispersion greatly improves as the sample size
𝜏2 = 0) and large (𝜏1 = 1 and 𝜏2 = 1) variance compo- increases.
nents, respectively. In both panels, the sample size is 𝑛 = Web Figures S12–S14 present a comparison of the per-
400 and the intercept is 𝛽0 = 1. In the small variance formance of INLA marginal likelihood with our ARM and
components setting, ARM and HCM perform compara- HCM methods. Web Figure S12 shows that INLA marginal
bly to the DIC and WAIC for small values of 𝛽1 = 𝛽2 , likelihood with INLA’s default priors is worse than our
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
XU et al. 3273

F I G U R E 1 Probability of selecting the correct covariate structure as a function of the value of the regression coefficient, settings:
𝜏1 = 0.05, 𝜏2 = 0.05, n=100 (top row), n=400 (middle row), n=900 (bottom row), and 𝛽0 = 1 (left column), 𝛽0 = 4 (right column).

methods at selecting covariates when coefficients of covari- is no spatial random effects in the model, INLA marginal
ates are small. INLA marginal likelihood with INLA’s likelihood can correctly select overdispersion random
default prior or INLA marginal likelihood with our pro- effects. However, when there are spatial random effects in
posed priors are better than our methods ARM and HCM the model, marginal likelihood computed by INLA cannot
when the regression coefficient is large. For spatial ran- correctly select overdispersion random effects. In sum-
dom effects inclusion, Web Figure S13 shows that INLA mary, INLA marginal likelihood with our proposed priors
marginal likelihood with any of the considered priors has works well for selection of regressors but does not work
difficulties to detect spatial random effects. For overdisper- well for the selection of random effects. Meanwhile, our
sion random effects, Web Figure S14 shows that when there ARM and HCM methods work well in both cases.
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3274 XU et al.

F I G U R E 2 Probability of selecting the correct covariate structure as a function of the value of the regression coefficient, settings: (a)
𝜏1 = 0.01 and 𝜏2 = 0, and (b) 𝜏1 = 1 and 𝜏2 = 1, both with sample size 𝑛 = 400 and intercept value 𝛽0 = 1. (a) Weak dependence structure. (b)
Strong dependence structure. Dependence structure can affect our method’s performance for detecting covariates with small coefficients.
However, the DIC and WAIC have difficulty detecting covariates even with large coefficients when spatial dependence is strong.

6 CASE STUDIES T A B L E 1 Epilepsy data: posterior inclusion probabilities of


fixed and random effects.

6.1 Longitudinal epilepsy seizure data Variable ARM HCM


Fixed effect Base 1 1
We analyze a dataset on epilepsy seizures previously Trt 0.14 0.04
analyzed by Thall and Vail (1990), Breslow and Clay- Trt × base 0 0
ton (1993), and others. The data were collected in four Age 0.03 0.01
biweekly visits of 59 epileptics during a clinical trial to V4 0.12 0.11
evaluate the effectiveness of a drug to control seizures Random effect 𝛼1 1 1
(Leppik et al., 1987). The response variable is the number 𝛼2 0 0
of seizures 𝑦𝑖𝑗 for patient 𝑖 on visit 𝑗. The most gen- 𝛼3 1 1
ind
eral model we consider is 𝑦𝑖𝑗 |𝜇𝑖𝑗 ∼ Poisson(𝜇𝑖𝑗 ), with Abbreviations: ARM, approximate reference method; HCM,
log(𝜇𝑖𝑗 ) = 𝑥 ⊤𝑖𝑗
𝛼 1 ∼ 𝑁(00, 𝜏1 𝐼59 ),𝛼
𝛽 + 𝛼1𝑖 + 𝑧𝑗 𝛼2𝑖 + 𝛼3𝑖𝑗 ,𝛼 𝛼2 ∼ half-Cauchy method.
𝑁(00, 𝜏2 𝐼59 ), and 𝛼 3 ∼ 𝑁(00, 𝜏3 𝐼236 ), 𝑖 = 1, … , 59 and
𝑗 = 1, … , 4, where 𝑥 𝑖𝑗 denotes a six-dimensional vector
the vector of patient-specific random effects for the slope
with a one for intercept and five covariates. The 59 subjects
of the variable Visit with 𝑧 = (−0.3, −0.1, 0.1, 0.3), and
were randomly assigned to a new drug or a placebo. The
𝛼 3 = (𝛼311 , … , 𝛼3 59 1 , 𝛼312 , … , 𝛼3 59 2 , … , 𝛼314 , … , 𝛼3 59 4 ) is
first covariate is the treatment indicator (Trt), where Trt=1
the vector of overdispersion random effects.
indicates that the patient received the treatment and Trt=0
The covariates Trt, Base, Age, and Visit can be included
indicates that the patient received the placebo. The second
in the model independently. However, the interaction term
covariate is the baseline level of seizures (base), equal to
between Trt and Base is only included when both Trt and
the logarithm of the average number of epileptic seizures
Base are in the model. Thus, there are 20 possible com-
per two weeks recorded in the 8-week period before the
binations of covariates. For the dependence structure, we
treatment. The third covariate is the interaction term of
follow the four cases considered by Breslow and Clayton
Base and Trt. The fourth covariate is the logarithm of age
(1993): no random effects in the model; only patient-
(Age). And the fifth covariate is the visit number (Visit),
specific random effects 𝛼 1 ; 𝛼 1 and overdispersion random
with the four visits coded as −3, −1, 1, and 3. Breslow
effects 𝛼 3 ; 𝛼 1 and patient-specific random effects for the
and Clayton (1993) mentioned that preliminary analysis
slope of the variable Visit 𝛼 2 . Finally, we assume that the
indicated that the counts were substantially lower during
vectors of random effects 𝛼 1 , 𝛼 2 , and 𝛼 3 are independent.
the fourth visit. Thus, they also define a binary variable V4,
Therefore, the model space has 80 models, composed by
such that V4=1 indicates the fourth visit and V4=0 indi-
20 combinations of covariates and 4 possible settings of
cates the other visits. In the model above, 𝛽 is the vector
random effects.
of regression coefficients, 𝛼 1 = (𝛼11 , … , 𝛼1 59 ) is the vector
Table 1 presents the posterior inclusion probabilities of
of patient-specific random effects, 𝛼 2 = (𝛼21 , … , 𝛼2 59 ) is
the fixed and random effects. Both the ARM and the HCM
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
XU et al. 3275

F I G U R E 3 Probability of selecting the correct spatial random effects structure as a function of the value of variance component for
spatial random effects. Settings: 𝛽0 = 2, 𝛽1 = 𝛽2 = 1, n=100 (top row), n=400 (middle row), n=900 (bottom row), and 𝜏2 = 0.1 (left column),
𝜏2 = 0 (right column). If the spatial variance proportion is zero then there is no vector of spatial random effects in the model, and the correct
decision is to not select the vector of spatial random effects.

indicate that the baseline level of seizures (Base) should Web Table S1 in the Supporting information presents a
be included in the model. However, the posterior inclu- summary of the model selection results for the epilepsy
sion probabilities do not provide support for any of the data by comparing methods ARM, HCM, DIC, and WAIC.
other covariates. Further, both ARM and HCM strongly A check mark appears next to the effects (rows) selected
indicate that 𝛼 2 , the patient-specific random effects for the by each method (column). In addition, Web Table S1
slope of the variable Visit should not be included in the presents the selection of fixed effects and variance com-
model. Finally, both ARM and HCM strongly indicate the ponents based on estimates and standard errors reported
need to include the patient-specific random effect 𝛼 1 and by Breslow and Clayton (1993) for two models fitted with
overdispersion random effect 𝛼 3 . PQL, which we denote by PQL1 and PQL2. Web Table S2
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3276 XU et al.

presents estimates and standard errors for the parameters T A B L E 2 Lip cancer data: posterior inclusion probabilities of
based on the full model. Model PQL1 includes random fixed and random effects.
effects 𝛼 1 and 𝛼 2 , while Model PQL2 includes random Variable ARM HCM
effects 𝛼 1 and 𝛼 3 . Interestingly, while the original PQL Fixed effect AFF 0.93 0.92
method cannot choose between Model PQL1 or Model Random effect 𝛼1 1 1
PQL2, our ARM and HCM clearly show that the data sup- 𝛼2 0 0
port exclusion of random effect 𝛼 2 and inclusion of random
Abbreviations: AFF, agriculture, fishing, or forestry; ARM, approximate
effects 𝛼 1 and 𝛼 3 . Further, the DIC and WAIC agree with reference method; HCM, half-Cauchy method.
the ARM and HCM and also choose random effects 𝛼 1
and 𝛼 3 . Furthermore, in terms of fixed effects the DIC and There are two possible combinations for the fixed effects:
WAIC are the least parsimonious, choosing Base, Trt and with or without the covariate AFF. For the random effects,
Trt×Base, while PQL chooses Base and Trt. Finally, the we follow the options considered by Breslow and Clayton
ARM and HCM are the most parsimonious and choose (1993). When 𝛼 1 and 𝛼 2 are in the model at the same time,
only the Base covariate. the PQL estimate of the overdispersion variance 𝜏2 is 0.
In addition to selecting more parsimonious models, our Thus, we consider models with only three random effects
ARM and HCM provide more definitive support for the combinations: spatial random effects 𝛼 1 , overdispersion
inclusion or exclusion of each effect in the form of Bayesian random effects 𝛼 2 , and no random effects.
posterior probabilities. For example, the posterior inclu- Table 2 presents the posterior inclusion probabilities for
sion probabilities of the patient-level random effects 𝛼 1 , the fixed and random effects. Both the ARM and HCM
overdispersion random effects 𝛼 3 , and the covariate Base select the model with the covariate AFF and spatial ran-
are all equal to 1. Further, there is a lot less support for dom effect 𝛼 1 . Web Table S3 in the Supporting information
the covariate V4, which has posterior inclusion probabil- presents the DIC and WAIC for the six models considered.
ity of 0.12 by the ARM and 0.11 by the HCM. Furthermore, In contrast to the results of the ARM and HCM, DIC and
both ARM and HCM provide posterior inclusion probabil- WAIC select the model without the covariate AFF but with
ity equal to zero for the interaction between Trt and Base. spatial random effect 𝛼 1 . Web Table S4 in the Supporting
Finally, the simulation study presented in Section 5 shows information summarizes model selection results for the
that we can rely on the uncertainty quantification provided ARM, HCM, DIC, WAIC, as well as the selection of model
by the ARM and HCM. components based on PQL methods reported by Breslow
and Clayton (1993) for two models: PQL1 includes 𝛼 1 and
PQL2 includes 𝛼 2 . Results from PQL for the AFF regressor
6.2 Spatial lip cancer data agree with the results from the HCM and ARM. An advan-
tage of the HCM and ARM over PQL is that they clearly
In this section, we present an analysis of the Scottish indicate that the model should include a spatial random
lip cancer dataset previously analyzed by Clayton and effect and not include overdispersion.
Kaldor (1987), Breslow and Clayton (1993), Ferreira and
De Oliveira (2007), among many others. This dataset pro-
vides the number of male lip cancer cases in the 56 counties 7 DISCUSSION
of Scotland during the period 1975–1980, as well as the per-
centage of the work force employed in agriculture, fishing, We have proposed a novel Bayesian method for model
or forestry (AFF) as a covariate. The most general model selection for GLMMs. Our approach is based on a pseudo-
ind
we consider is 𝑦𝑖 |𝜇𝑖 ∼ Poisson(𝜇𝑖 ),log(𝜇𝑖 ) = log(𝑛𝑖 ) + likelihood approximation of GLMMs by LMMs leading to a
𝑥⊤𝑖
𝛼 1 ∼ 𝑁(00, 𝜏1Σ ), and 𝛼 2 ∼ 𝑁(00, 𝜏2𝐼 56 ),
𝛽 + 𝛼1𝑖 + 𝛼2𝑖 ,𝛼 closed form solution for integrating out the random effects.
𝑖 = 1, … , 56, where 𝑛𝑖 is the expected number of lip cancer We consider two priors for the model parameters. First, we
cases in the 𝑖th county, calculated based on the age distri- use an approximate reference prior that is uniform for the
butions by counties. In this analysis, the 𝑛𝑖 ’s are assumed fixed effects and has the tail behavior of the half-Cauchy
to be known constants. In addition, the vector 𝑥 𝑖 is a prior for the variance parameters. Second, while keeping
two-dimensional vector with one as the first element and the improper flat prior for the fixed effects, we consider
AFF for the 𝑖th county as the second element. Further, 𝛼 1 the half-Cauchy prior for the square root of the variance
is a vector of spatial random effects following a sum-zero parameters (Gelman, 2006; Polson & Scott, 2012). Finally,
constrained Gaussian intrinsic conditional autoregressive to deal with the prior impropriety we have developed a
model (Keefe et al., 2018) and modeled by Equa- FBF approach.
tion (1). Finally, 𝛼 2 is a vector of overdispersion random The simulation study has shown that our proposed
effects. methods ARM and HCM perform well for correctly
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
XU et al. 3277

selecting both covariates and dependence structure. ARM regression coefficients. This can be facilitated by the fact
and HCM assign high posterior inclusion probability to that our methods provide posterior probabilities for differ-
covariates with large coefficients and also high posterior ent models. Another promising research direction is the
inclusion probability to random effects with large variance use of nonlocal priors for the fixed effects. Finally, another
components. In particular, ARM and HCM are better than possible research topic is model selection for GLMMs
DIC and WAIC at correctly selecting covariates. In cases when the number of possible regressors is much larger
where random effects have large variances, the ability of than the sample size. We are currently working on the
DIC and WAIC to correctly select covariates is tremen- latter two research topics and will report the results in a
dously reduced. In contrast, ARM and HCM do not suffer future paper.
as badly when selecting covariates in the presence of ran-
dom effects with large variances. In addition, DIC and AC K N OW L E D G M E N T S
WAIC have high false positive rates and often select null The work of Ferreira and Xu was supported in part by
fixed and random effects. In contrast, ARM and HCM cor- National Science Foundation awards DMS 1853549 and
rectly assign low posterior inclusion probability to null DMS 2054173. Computations for this paper have been per-
covariates and to null random effects. We also compared formed on supercomputers of Advanced Research Com-
our methods with marginal likelihood computed by INLA. puting at Virginia Tech. The authors would like to thank
Our results show that when we use INLA with our priors the associate editor and an anonymous referee for their
instead of the default INLA priors, the marginal likelihood insightful comments that helped substantially improve
computed by INLA and the marginal likelihood computed this paper.
by our pseudo-likelihood approach work similarly for the
selection of regression coefficients. However, the marginal
likelihood computed by INLA does not work well for the D A T A AVA I L A B I L I T Y S T A T E M E N T
selection of spatial random effects and overdispersion ran- The datasets analyzed in this paper are available in the R
dom effects. Therefore, it seems that our pseudo-likelihood package mdhglm (Lee et al., 2018).
approximation works better than the INLA approximation
to the marginal likelihood for the selection of random ORCID
Marco A. R. Ferreira https://orcid.org/0000-0002-4705-
effects.
5661
We illustrate the application of our proposed meth-
ods ARM and HCM with three case studies. In the
REFERENCES
first case study, we consider epilepsy seizures as a type
of longitudinal count data. ARM and HCM are more Banerjee, S., Carlin, B.P. & Gelfand, A.E. (2014) Hierarchical modeling
and analysis for spatial data (2nd ed.). Boca Raton: Chapman &
parsimonious, selecting baseline covariate, patient-level
Hall.
random effects, and overdispersion random effects. DIC Barbieri, M.M. & Berger, J.O. (2004) Optimal predictive model
and WAIC select two more covariates: treatment and inter- selection. The Annals of Statistics, 32(3), 870–897.
action term between baseline and treatment. In the second Breslow, N.E. & Clayton, D.G. (1993) Approximate inference in gen-
case study, we study Scottish lip cancer data as a type of eralized linear mixed models. Journal of the American Statistical
spatial count data. Our methods ARM and HCM select Association, 88(421), 9–25.
spatial dependence and covariate AFF, whereas DIC and Cai, B. & Dunson, D.B. (2006) Bayesian covariance selection in
generalized linear mixed models. Biometrics, 62(2), 446–457.
WAIC select the model without covariate AFF but include
Clayton, D. & Kaldor, J. (1987) Empirical Bayes estimates of age-
spatial random effects. In the third case study, presented
standardized relative risks for use in disease mapping. Biometrics,
in Web Appendix C, we look at binary salamander mat- 671–681.
ing data. For fixed effects, our methods ARM and HCM Ferreira, M.A.R. & De Oliveira, V. (2007) Bayesian reference anal-
select WSF and WSF×WSM, whereas DIC and WAIC select ysis for Gaussian Markov random fields. Journal of Multivariate
all three covariates. For random effects, our two meth- Analysis, 98(4), 789–812.
ods ARM and HCM have totally different results than Ferreira, M.A.R., Porter, E.M. & Franck, C.T. (2021) Fast and scal-
DIC and WAIC: while DIC and WAIC select male ran- able computations for Gaussian hierarchical models with intrinsic
conditional autoregressive spatial random effects. Computational
dom effect, our methods ARM and HCM select female
Statistics and Data Analysis, 162, 107264.
random effect. Given the results from the simulation Gelman, A. (2006) Prior distributions for variance parameters in
study, we recommend the models selected by ARM and hierarchical models (comment on article by Browne and Draper).
HCM. Bayesian Analysis, 1(3), 515–534.
There are many potential avenues for future research. Keefe, M.J., Ferreira, M.A.R. & Franck, C.T. (2018) On the for-
One possible future research topic is the use of Bayesian mal specification of sum-zero constrained intrinsic conditional
model averaging for computing credible intervals for autoregressive models. Spatial Statistics, 24, 54–65.
15410420, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13896 by EBMG ACCESS - ETHIOPIA, Wiley Online Library on [12/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3278 XU et al.

Keefe, M.J., Ferreira, M.A.R. & Franck, C.T. (2019) Objective Bayesian events with dependent censoring and cure fraction. Biometrics,
analysis for Gaussian hierarchical models with intrinsic condi- 76(3), 753–766.
tional autoregressive priors. Bayesian Analysis, 14(1), 181–209. Ten Eyck, P. & Cavanaugh, J.E. (2018) An alternate approach to
Lee, Y., Molas, M. & Noh, M. (2018) The mdhglm package. https:// pseudo-likelihood model selection in the generalized linear mixed
CRAN.R-project.org/package=mdhglm. modeling framework. Sankhya B, 80(1), 98–122.
Leppik, I.E., Dreifuss, F.E., Porter, R., Bowman, T., Santilli, N., Thall, P.F. & Vail, S.C. (1990) Some covariance models for longitudi-
Jacobs, M., Crosby, C., Cloyd, J., Stackman, J., Graves, N., Sutula, nal count data with overdispersion. Biometrics, 46, 657–671.
T., Welty, T., Vickery, J., Brundage, R., Gates, J., Gumnit, R.J. & Tredennick, A.T., Hooker, G., Ellner, S.P. & Adler, P.B. (2021) A
Gutierrez, A. (1987) A controlled study of progabide in partial practical guide to selecting models for exploration, inference, and
seizures: methodology and results. Neurology, 37(6), 963–963. prediction in ecology. Ecology, 102(6), e03336.
Liu, Z., Berrocal, V.J., Bartsch, A.J. & Johnson, T.D. (2016) Pre- Watanabe, S. (2010) Asymptotic equivalence of Bayes cross-validation
surgical fMRI data analysis using a spatially adaptive conditionally and widely applicable information criterion in singular learning
autoregressive model. Bayesian Analysis, 11, 599–625. theory. Journal of Machine Learning Research, 11(12), 3571–3594.
Meyer, S., Held, L. & Höhle, M. (2017) Spatio-temporal analysis of Williams, J., Ferreira, M.A.R. & Ji, T. (2022) BICOSS: Bayesian itera-
epidemic phenomena using the R package surveillance. Journal tive conditional stochastic search for GWAS. BMC Bioinformatics,
of Statistical Software, 77, 1–55. 23, 475.
Nouvellet, P., Bhatia, S., Cori, A., Ainslie, K.E.C., Baguelin, M., Wolfinger, R. & O’Connell, M. (1993) Generalized linear mixed
Bhatt, S., Boonyasiri, A., Brazeau, N.F., Cattarino, L., Cooper, L.V., models: a pseudo-likelihood approach. Journal of Statistical Com-
Coupland, H., Cucunuba, Z.M., Cuomo-Dannenburg, G., Dighe, putation and Simulation, 48(3–4), 233–243.
A., Djaafara, B.A., Dorigatti, I., Eales, O.D., van Elsland, S.L., Xu, D., Chatterjee, A. & Daniels, M. (2016) A note on posterior pre-
Nascimento, F.F., . . . Donnelly, C.A. (2021) Reduction in mobility dictive checks to assess model fit for incomplete data. Statistics in
and COVID-19 transmission. Nature Communications, 12(1), 1–9. Medicine, 35(27), 5029–5039.
O’Hagan, A. (1995) Fractional Bayes factors for model comparison. Xu, S., Ferreira, M.A.R., Porter, E.M. & Franck, C.T. (2023) The
Journal of the Royal Statistical Society: Series B (Methodological), GLMMselect package. Available from: https://CRAN.R-project.
57(1), 99–118. org/package=GLMMselect [Accessed 20th April 2023].
Polson, N.G. & Scott, J.G. (2012) On the half-Cauchy prior for a global
scale parameter. Bayesian Analysis, 7(4), 887–902.
Porter, E.M., Franck, C.T. & Ferreira, M.A.R. (2023) Objective
S U P P O RT I N G I N F O R M AT I O N
Bayesian model selection for spatial hierarchical models with
Web Appendices, Tables, and Figures referenced in Sec-
intrinsic conditional autoregressive priors. Bayesian Analysis,
1–27, https://doi.org/10.1214/23-BA1375 tions 1, 3, 5, and 6 are available with this paper at the
Rue, H., Martino, S. & Chopin, N. (2009) Approximate Bayesian Biometrics website on Wiley Online Library. The R pack-
inference for latent Gaussian models by using integrated nested age GLMMselect available at https://CRAN.R-project.org/
Laplace approximations. Journal of the Royal Statistical Society: package=GLMMselect implements our ARM and HCM
Series B (Statistical Methodology), 71(2), 319–392. methods. In addition, the source code of the R package
Sauter, R. & Held, L. (2015) Network meta-analysis with integrated GLMMselect and a vignette html file that analyzes the
nested Laplace approximations. Biometrical Journal, 57(6), 1038–
lip cancer dataset are available with this paper at the
1050.
Scott, J.G. & Berger, J.O. (2010) Bayes and empirical-Bayes multiplic-
Biometrics website on Wiley Online Library.
ity adjustment in the variable-selection problem. The Annals of
Statistics, 2587–2619.
Spiegelhalter, D.J., Best, N.G., Carlin, B.P. & Van Der Linde, A. (2002)
Bayesian measures of model complexity and fit. Journal of the How to cite this article: Xu, S., Ferreira, M.A.R.,
Royal Statistical Society: Series B (Statistical Methodology), 64(4), Porter, E.M. & Franck, C.T. (2023) Bayesian model
583–639. selection for generalized linear mixed models.
Tawiah, R., McLachlan, G.J. & Ng, S.K. (2020) A bivariate joint frailty Biometrics, 79, 3266–3278.
model with mixture framework for survival analysis of recurrent https://doi.org/10.1111/biom.13896

You might also like