Model Selection and Model Averaging
Model Selection and Model Averaging
Averaging
Charles J. Geyer
October 13, 2021
1 License
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/
(http://creativecommons.org/licenses/by-sa/4.0/)).
2R
The version of R used to make this document is 4.1.0.
3 Information Criteria
3.1 AIC
In the early 1970’s Akaike proposed the first information criterion.
Later many others were proposed, so Akaike’s is now called the
Akaike
information criterion
(AIC) (https://en.wikipedia.org/wiki/Akaike_information_criterion).
AIC = LRT + 2p
All of this is a “large sample size” result based on the “usual” asymptotics
of maximum likelihood, which is not valid for all statistical models.
But it
is always valid for exponential family models in general and
categorical data analysis in particular (when sample size is “large”).
So the theory of hypothesis tests is the Wrong Thing (http://www.catb.org/jargon/html/W/Wrong-Thing.html) for comparing many models. And AIC
is the Right Thing (http://www.catb.org/jargon/html/R/Right-Thing.html) or at least a Right Thing.
3.3 BIC
In the late 1970’s Schwarz proposed another information criterion,
which is now usually called the Bayesian information criterion
(BIC)
(https://en.wikipedia.org/wiki/Bayesian_information_criterion).
Its formula is
Since log(n) ≥ 2 for n ≥ 8 , BIC penalizes larger models more than AIC.
BIC always selects smaller models than AIC.
1
exp(− BIC(m))g(m)
2
Pr(m ∣ data) ≈ (3.1)
1
∑ ∗
exp(− BIC(m ∗ ))g(m ∗ )
m ∈M 2
BIC provides consistent model selection when the true unknown model
is among the models under consideration.
Assuming the true unknown model to be among the models under consideration,
and Bayesians have to assume this — not among the models
under consideration
means prior probability zero and posterior probability zero — selecting the
model with smallest BIC will select the true
unknown model with probability
that goes to one as n goes to infinity. Of course, that does not mean
BIC is guaranteed to select the correct model
at any finite sample size.
In short,
use BIC when the true unknown model is assumed to be among the models under
consideration, but
True, BIC selects the correct model with probability going to one as
sample size goes to infinity, but for any finite sample size, this
probability may
be small. And AIC does not even guarantee that.
Also, model selection is just the Wrong Thing if you think like a Bayesian.
Bayesians apply Bayes’ rule, and that provides posterior probabilities,
which are approximated by (3.1).
So Bayesians should not select models, they should average over all models
according to posterior probability.
This is called Bayesian model averaging (BMA).
For a good introduction see the paper by Hoeting, et al. (1999,
doi:10.1214/ss/1009212519
(https://doi.org/10.1214/ss/1009212519),
unfortunately this paper had a “printing malfunction” that caused a lot of
minus signs and left parentheses
to be omitted, a corrected version is
available at
http://www.stat.washington.edu/www/research/online/hoeting1999.pdf
(http://www.stat.washington.edu/www/research/online/hoeting1999.pdf)).
^ 1
∑ E(W ∣ m, θ m ) exp(− BIC(m))g(m)
m∈M 2
E(W ∣ data) ≈ (5.1)
1 ∗ ∗
∑ ∗ exp(− BIC(m ))g(m )
m ∈M 2
for large n. The logic is that for large sample sizes the posterior
distribution of the within model parameter θm will be concentrated
near θ^ m . So
^
E(W ∣ m, θ m ) is a good
approximation to E(W ∣ m) .
Frequentists can also play the model averaging game. Even though they
do not buy Bayesian arguments, they can see that model selection is a
mug’s
game. If model selection is very unlikely to pick the best model
(which is always the case when there are many models and the sample
size
is not humongous), then averaging over good models is better.
There are many proposed ways to do frequentist model averaging (FMA) in the
literature, but one simple way (and the only way we will cover in this
course is to replace BIC with AIC in (5.1) giving
^ 1
∑ E(W ∣ m, θ m ) exp(− AIC(m))g(m)
m∈M 2
^ FMA ≈
w (5.2)
1 ∗ ∗
∑ ∗ exp(− AIC(m ))g(m )
m ∈M 2
6 Examples
We will redo some of the examples from other handouts that already used
R function glmbb .
data(table_9.3)
library(glmbb)
summary(out)
##
## Search was for all models with AIC no larger than min(AIC) + 10
##
sout$results
w <- w / sum(w)
all.equal(sout$results$weight, w)
## [1] TRUE
summary(out.bic)
##
## Search was for all models with BIC no larger than min(BIC) + 10
##
We see that BIC is much more favorable to the smaller (more parsimonious)
model than AIC is. BIC puts probability
0.97535
on the “best” model
but AIC puts only probability
0.69275 on this model.
In this example they agree on the “best” model,
but generally these two criteria do not agree.
rm(list = ls())
count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693,
library(glmbb)
family = "poisson")
summary(out.aic)
##
## Search was for all models with AIC no larger than min(AIC) + 10
##
Now we know how to interpret the weights. We see that no model gets
a majority of the weight. None is highly probable to be the best
model.
There are just too many models under consideration for that to
happen.
This, of course, assumes that in FMA we want to use a flat “prior”
(because frequentists don’t like priors), that is, we want g(m) = 1
for all m in
(5.2).
summary(out.bic)
##
## Search was for all models with BIC no larger than min(BIC) + 10
##
BIC does put the majority of the posterior probability on one model
(assuming flat prior on models).
BIC plus Occam’s window spreads out the posterior probability over
fewer models than AIC + Occam’s window.
It is not very obvious how to do FMA without refitting all the models,
but all the model fits are stored in the R environment out.aic$envir
ls(envir = out.aic$envir)
## [1] "min.crit"
## [2] "sha1.0b21bc716a36c9104619cad165cd8e1ee59eaa67"
## [3] "sha1.0bc71e279edad3fdfd08de9a0b9e0ae111052d6e"
## [4] "sha1.1178bbd4b48158d0044b3a1347e41003680ed4cb"
## [5] "sha1.1394de7910da632a1171a5370b549df8b89df21c"
## [6] "sha1.17fd74be284b4fb9f9500b30852ef16f55a5951d"
## [7] "sha1.2d0982bf246f88e37a97b49bdff7be4da5f4e1e2"
## [8] "sha1.34b3afacb24ed6a05f656993097cd2466e194c33"
## [9] "sha1.34b64a787d14cbac65556ecb290973c935a63910"
## [10] "sha1.364675bd9f91f1dd81e9b4f78723c7c893d02fb8"
## [11] "sha1.39061969f2e696a74694e6e643c158b68d571886"
## [12] "sha1.39cd3d0327e6002b8be8fad301dfbaafb2c24eaa"
## [13] "sha1.4855a747811e1cb6a94a5275898fb6497b0bebe4"
## [14] "sha1.4d6583bbf0d0862415f7bacafd6b00679c515c0c"
## [15] "sha1.527e582a9d862405ddf2c05b6f09c890212c3c0d"
## [16] "sha1.66b77d3f388eae1ce33b62b466e8f311bffa469a"
## [17] "sha1.66cfba8ba57104e87efbfc4ba75ca67551816543"
## [18] "sha1.6aa8694584f0ebadef108c9e6684b32fc9aef678"
## [19] "sha1.6f40df9c7712ee3618c90d56f68d7162b5d06e1a"
## [20] "sha1.724e48559c4e25c9dff6641a26069457726036c2"
## [21] "sha1.7ac914cd162e41bc044ba9fcdb9bddb00182abcf"
## [22] "sha1.7f9bf47ef1ae0aecfaa935614a3164c4cf1d385f"
## [23] "sha1.893a4536650ea9f4ec0f81782104f20e335738a9"
## [24] "sha1.8bca01f0d7b4ba675cb4e25bacfb13f06ce1bf29"
## [25] "sha1.932904ab5c59667c0edab04b0aff8e87d5eef820"
## [26] "sha1.a1f53c04447266ee85267edfcd62af50dedafa41"
## [27] "sha1.a449f95d9878915ee202f828977904a33819389b"
## [28] "sha1.a9eb2e4d8dbbfd65464afdc18c87c99ef16585aa"
## [29] "sha1.bd24727fa981859346a3b3ebf6fd8f1935d7f5c7"
## [30] "sha1.cba6840ac60dcc46de77bbf74a4c06e5e485ebb4"
## [31] "sha1.cbe1125be84ccb49411ef716ffe59218bb01436f"
## [32] "sha1.d21240256fc4cc009cb9b6ecd2ae465d2bc920b3"
## [33] "sha1.d9dd41b6107367ab849386daddf27f856bac8e82"
## [34] "sha1.db6cf35ff8bebb4c28e8baec502ae4e5c1365d35"
## [35] "sha1.e2f8395790cec12e6b92e36a338b2dbba7093932"
## [36] "sha1.f02e9c53c00058688f875f16b7024ce5d063e84c"
## [37] "sha1.f24e60e6005914cdeda0fdc45cabfcb700e2c567"
## [38] "sha1.f7cc3ef833144b019cbd8726f7865841afce6b64"
All of the R objects whose names begin with sha1. are the model fits.
The object min.crit is the same as out.aic$min.crit
identical(out.aic$min.crit, out.aic$envir$min.crit)
## [1] TRUE
And, although this is not documented, we can look in the code for R function
summary.glmbb to see how to extract the criteria from these models.
e <- out.aic$envir
rm("min.crit", envir = e)
w <- criterion[is.in.window]
w <- exp(- w / 2)
w <- w / sum(w)
print(foo, digits=5)
## injury gender location seat.belt observed expected.fma
And, if we redo the same calculation using BMA rather than FMA, we get
min.crit <- out.bic$min.crit
e <- out.bic$envir
rm("min.crit", envir = e)
w <- criterion[is.in.window]
w <- exp(- w / 2)
w <- w / sum(w)
print(foo, digits=5)
Even though the fitted.values components of the model fits are the
same in FMA and BMA, the weighted averages are different because
the
weights are different and because different numbers of models
are given any weight at all,
for FMA and for BMA.
If one wanted to go on and calculate other things that are functions
of these expected values, like conditional odds ratios that Agresti
computes,
then they should be based on the FMA or BMA values computed above.
If one is doing FMA, then standard errors for these FMA estimators can
be computed as described by Burnham and Anderson, Section 4.3.2
(https://www.amazon.com/Model-Selection-Multimodel-Inference-Information-Theoretic/dp/0387953647/). But we will not give examples of that
here.
rm(list = ls())
library(CatDataAnalysis)
data(table_8.1)
lake = factor(lake,
food = factor(food,
summary(out.aic)
##
## Search was for all models with AIC no larger than min(AIC) + 10
##
6.3.2 BIC
And now we can also do BIC.
summary(out.bic)
##
## Search was for all models with BIC no larger than min(BIC) + 10
##
This is quite a shock. It says that the empty model that says food
is not associated with lake , gender , or size or any interaction
of them is the
best model
(according to BIC). If you think like a Bayesian, perhaps there is nothing
much going on in these data.