0% found this document useful (0 votes)

60 views

Model Selection and Model Averaging

He found a leprechaun in his walnut shell. They throw cabbage that turns your brain into emotional baggage. If any cop asks you where you were, just say you were visiting Kansas. The door slammed on the watermelon. She wore green lipstick like a fashion icon. I don’t respect anybody who can’t tell the difference between Pepsi and Coke. He said he was not there yesterday; however, many people saw him there. The thick foliage and intertwined vines made the hike nearly impossible. The Japanese yen

Uploaded by

bbb

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Model Selection and Model Averaging

Uploaded by

bbb

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Stat 5421 Lecture Notes: Model Selection and Model

Averaging
Charles J. Geyer
October 13, 2021

1 License
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/
(http://creativecommons.org/licenses/by-sa/4.0/)).

2R
The version of R used to make this document is 4.1.0.

The version of the rmarkdown package used to make this document is

2.10.

The version of the glmbb package used to make this document is

0.5.1.

3 Information Criteria
3.1 AIC
In the early 1970’s Akaike proposed the first information criterion.
Later many others were proposed, so Akaike’s is now called the
Akaike
information criterion
(AIC) (https://en.wikipedia.org/wiki/Akaike_information_criterion).

The “information” in AIC is Kullback-Leibler

information (https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence),
which is the
concept of Shannon
information (https://en.wikipedia.org/wiki/Entropy_(information_theory))
imported into statistics, which in turn is the concept of
entropy
imported from statistical physics into communications theory.
It is expected log likelihood, which is what the maximized value of the
log
likelihood is trying to estimate.
What Akaike discovered is that the maximized value of the
log likelihood is a biased estimate of Kullback-Leibler information.
It overestimates it by
p the dimension of the model (number of parameters).
Or, what is the same thing, the likelihood ratio test (LRT) statistic,
which is minus twice the

(maximized value of the) log likelihood,

underestimates its expectation by 2p . So

AIC = LRT + 2p

All of this is a “large sample size” result based on the “usual” asymptotics
of maximum likelihood, which is not valid for all statistical models.
But it
is always valid for exponential family models in general and
categorical data analysis in particular (when sample size is “large”).

3.2 AIC Versus Hypothesis Tests

The central dogma of hypothesis testing (http://users.stat.umn.edu/~geyer/101b.pdf#page=13) is “do only one test” or if you do more than one test,
you must
correct P -values to account for doing more than one test (http://users.stat.umn.edu/~geyer/101b.pdf#page=24).

So the theory of hypothesis tests is the Wrong Thing (http://www.catb.org/jargon/html/W/Wrong-Thing.html) for comparing many models. And AIC
is the Right Thing (http://www.catb.org/jargon/html/R/Right-Thing.html) or at least a Right Thing.

Approaches based on hypothesis testing, such as that implemented

by R function step come with no guarantees of doing anything correct.
They
are undeniably TTD (things to do) but have no theoretical justification.

3.3 BIC
In the late 1970’s Schwarz proposed another information criterion,
which is now usually called the Bayesian information criterion
(BIC)
(https://en.wikipedia.org/wiki/Bayesian_information_criterion).
Its formula is

BIC = LRT + log(n) ⋅ p

Since log(n) ≥ 2 for n ≥ 8 , BIC penalizes larger models more than AIC.
BIC always selects smaller models than AIC.

The reason BIC is called “Bayesian” is that, if BIC(m)

denotes the BIC for model m and g(m) denotes the prior probability
for model m, then

1
exp(− BIC(m))g(m)
2
Pr(m ∣ data) ≈ (3.1)
1
∑ ∗
exp(− BIC(m ∗ ))g(m ∗ )
m ∈M 2

where M is the class of models under consideration.

This is a “large sample size” result based on the “usual” asymptotics
of Bayesian inference
(the Bernstein–von Mises
theorem
(https://en.wikipedia.org/wiki/Bernstein%E2%80%93von_Mises_theorem)),
which is not valid for all statistical models.
But it is always valid for
exponential family models in general
and categorical data analysis in particular (when sample size is “large”).

When we use a flat prior (g is a constant function of m),

the prior g cancels out of the formula, and we obtain

BIC(m) ≈ −2 log Pr(m ∣ data) + a constant

Clearly, BIC is defined the way it is to be comparable to AIC,

not to produce the simplest Bayesian formulas.

3.4 AIC Versus BIC

In model selection AIC and BIC do two different jobs.
No selection criterion can do both jobs (Yang, 2005,
DOI:10.1093/biomet/92.4.937
(https://doi.org/10.1093/biomet/92.4.937)).

BIC provides consistent model selection when the true unknown model
is among the models under consideration.

AIC is minimax-rate optimal for estimating the regression function

and other probabilities and expectations. It does not need the
true
unknown model to be among the models under consideration.

Assuming the true unknown model to be among the models under consideration,
and Bayesians have to assume this — not among the models
under consideration
means prior probability zero and posterior probability zero — selecting the
model with smallest BIC will select the true
unknown model with probability
that goes to one as n goes to infinity. Of course, that does not mean
BIC is guaranteed to select the correct model
at any finite sample size.

If we do not assume the true unknown model is

among the models under consideration, then we only have AIC as an option.
It generally does not
do consistent model selection.
However, it does give the best predictions of probabilities and expectations
of random variables in the model. It is
using the models under consideration
to give the best predictions of probabilities and expectations under the
true unknown model (which need not
be among the models under consideration).

In short,

use BIC when the true unknown model is assumed to be among the models under
consideration, but

use AIC when we do not want to assume this.

A shorthand often used is “sparsity”. The sparsity assumption is that

the true unknown model has only a few parameters and is one of the models
under consideration. Under sparsity, BIC does consistent model selection.

If you do not want to assume sparsity, then use AIC.

3.5 Other Information Criteria
Many other information criteria have been proposed. We will not cover
any of them except to note that R function glmbb in R package glmbb
also
allows for so-called corrected AIC, denoted AICc .
The help page for that function says that this has no justification for
categorical data analysis,
so we will not use it.

4 Considering All Possible Models

The branch and bound algorithm (https://en.wikipedia.org/wiki/Branch_and_bound)
allows consideration of all possible models (or all models in
some large class)
without actually fitting all of the models. The trick is in the “bounds”.
Sometimes one can tell that all models either having or
lacking a certain term
in their formula will fail to be anywhere near the best model found so far.
Thus they can be ignored. This is what R function
glmbb
(for GLM branch and bound) does.
Even when there are thousands of models, it can consider all of them
in a reasonable amount of
computer time.

The branch and bound algorithm is not magic, however.

When there are billions and billions of models, it may be too slow.

5 Model Selection Versus Model Averaging

Model selection is a
mug’s game (https://www.merriam-webster.com/dictionary/mug%27s%20game).
When there are many models under
consideration, the probability of selecting
the correct model may be very small, even when the correct model (true unknown
model) is among the
models under consideration.

True, BIC selects the correct model with probability going to one as
sample size goes to infinity, but for any finite sample size, this
probability may
be small. And AIC does not even guarantee that.

Also, model selection is just the Wrong Thing if you think like a Bayesian.
Bayesians apply Bayes’ rule, and that provides posterior probabilities,
which are approximated by (3.1).
So Bayesians should not select models, they should average over all models
according to posterior probability.
This is called Bayesian model averaging (BMA).
For a good introduction see the paper by Hoeting, et al. (1999,
doi:10.1214/ss/1009212519
(https://doi.org/10.1214/ss/1009212519),
unfortunately this paper had a “printing malfunction” that caused a lot of
minus signs and left parentheses
to be omitted, a corrected version is
available at
http://www.stat.washington.edu/www/research/online/hoeting1999.pdf
(http://www.stat.washington.edu/www/research/online/hoeting1999.pdf)).

To do full BMA one usually needs to do MCMC with multiple models.

That can be done with R function temper in R package mcmc but
we are not
going to cover that. The package vignette
bfst.pdf (https://cloud.r-project.org/web/packages/mcmc/vignettes/bfst.pdf)
shows how to do that, but we
will not cover it in this course.
A fairly good approximation (if the sample size is large) to BMA uses BIC.
Let θ^ m denote the maximum likelihood estimate parameter vector
for
^
model m, and let E(W ∣ m, θ m ) be
the expected value of some function W of the data assuming model m is
correct and the true unknown
parameter value is θ^ m .
Then

^ 1
∑ E(W ∣ m, θ m ) exp(− BIC(m))g(m)
m∈M 2
E(W ∣ data) ≈ (5.1)
1 ∗ ∗
∑ ∗ exp(− BIC(m ))g(m )
m ∈M 2

for large n. The logic is that for large sample sizes the posterior
distribution of the within model parameter θm will be concentrated
near θ^ m . So
^
E(W ∣ m, θ m ) is a good
approximation to E(W ∣ m) .

Frequentists can also play the model averaging game. Even though they
do not buy Bayesian arguments, they can see that model selection is a
mug’s
game. If model selection is very unlikely to pick the best model
(which is always the case when there are many models and the sample
size
is not humongous), then averaging over good models is better.
There are many proposed ways to do frequentist model averaging (FMA) in the
literature, but one simple way (and the only way we will cover in this
course is to replace BIC with AIC in (5.1) giving

^ 1
∑ E(W ∣ m, θ m ) exp(− AIC(m))g(m)
m∈M 2
^ FMA ≈
w (5.2)
1 ∗ ∗
∑ ∗ exp(− AIC(m ))g(m )
m ∈M 2

As an alternative to summing over all possible models in (5.1)

and (5.2). Madigan and Raftery (1992, DOI:10.1080/01621459.1994.10476894
(https://doi.org/10.1080/01621459.1994.10476894)) proposed what they
called “Occam’s window” in which we only sum over a subset of the
models
under consideration. Here we only sum over those that have the highest
BIC values. This is convenient because that is what R function
glmbb
outputs (only those models within a “cutoff” of the minimum criterion
value). The user can choose the “cutoff” as desired. The default
cutoff
of 10 results in a “weights”
exp(− 2 BIC(m)) that are
no less than
exp(− 2 cutoff),
which is exp(−5) for the default cutoff,
1 1
= 0.0067379

relative to the maximum weight.

6 Examples
We will redo some of the examples from other handouts that already used
R function glmbb .

6.1 High School Student Survey

In the high school student survey data from Chapter 9 in
Agresti (ch9.html#example-high-school-student-survey) we saw there were
two relatively
good models according to AIC
library(CatDataAnalysis)

data(table_9.3)

library(glmbb)

out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson)

summary(out)

## Results of search for hierarchical models with lowest AIC.

## Search was for all models with AIC no larger than min(AIC) + 10

## These are shown below.

## criterion weight formula

## 63.42 0.6927 count ~ ac + am + c*m

## 65.04 0.3073 count ~ acm

Now we know how to interpret the weights. They are

exp(− 1 AIC(m))
normalized to sum to one.
2

sout <- summary(out)

sout$results

## criterion weight formula

## 1 63.41741 0.6927499 count ~ ac + am + c*m

## 2 65.04343 0.3072501 count ~ acm

aic <- sout$results$criterion

w <- exp(- aic / 2)

w <- w / sum(w)

all.equal(sout$results$weight, w)

## [1] TRUE

If we treat these as vaguely like posterior probabilities (they are the

FMA competitor to posterior probabilities), then they say the “best” model
is not
so much better than the second “best” that we should not assume the
“best” model is correct.
We can also use BIC.

out.bic <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson,

criterion = "BIC", BIC.option = "sum")

summary(out.bic)

## Results of search for hierarchical models with lowest BIC.

## Search was for all models with BIC no larger than min(BIC) + 10

## These are shown below.

## criterion weight formula

## 103.5 0.97535 count ~ ac + am + c*m

## 110.9 0.02465 count ~ acm

We see that BIC is much more favorable to the smaller (more parsimonious)
model than AIC is. BIC puts probability
0.97535
on the “best” model
but AIC puts only probability
0.69275 on this model.
In this example they agree on the “best” model,
but generally these two criteria do not agree.

The optional argument BIC.option = "sum" to R function glmbb

is explained in the help for that function. We should always use it
when doing
categorical data analysis. It perhaps should be the
default, but is not for reasons of backward compatibility.

6.2 Seat Belt Use

In the seat belt use data from Chapter 9 in
Agresti (ch9.html#example-seat-belt-use) we saw there were
many relatively good models according to
AIC
# clean up R global environment

rm(list = ls())

count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693,

996, 759, 973, 757, 812, 380, 1084, 513)

injury <- gl(2, 8, 16, labels = c("No", "Yes"))

gender <- gl(2, 4, 16, labels = c("Female", "Male"))

location <- gl(2, 2, 16, labels = c("Urban", "Rural"))

seat.belt <- gl(2, 1, 16, labels = c("No", "Yes"))

library(glmbb)

out.aic <- glmbb(count ~ seat.belt * injury * location * gender,

family = "poisson")

summary(out.aic)

## Results of search for hierarchical models with lowest AIC.

## Search was for all models with AIC no larger than min(AIC) + 10

## These are shown below.

## criterion weight formula

## 182.8 0.24105 count ~ seat.belt*injury*location + seat.belt*location*gender + injury*location*gender
## 183.1 0.21546 count ~ injury*gender + seat.belt*injury*location + seat.belt*location*gender
## 184.0 0.13742 count ~ seat.belt*injury + seat.belt*location*gender + injury*location*gender
## 184.8 0.09055 count ~ seat.belt*injury*location + seat.belt*injury*gender + seat.belt*location*gender + injury*lo
cation*gender

## 184.9 0.08446 count ~ seat.beltinjury + injurylocation + injurygender + seat.beltlocation*gender

## 185.0 0.08042 count ~ seat.belt*injury*location + seat.belt*injury*gender + seat.belt*location*gender
## 185.5 0.06462 count ~ seat.belt*injury*location*gender
## 185.8 0.05365 count ~ seat.belt*injury*gender + seat.belt*location*gender + injury*location*gender
## 186.8 0.03237 count ~ injury*location + seat.belt*injury*gender + seat.belt*location*gender

Now we know how to interpret the weights. We see that no model gets
a majority of the weight. None is highly probable to be the best
model.
There are just too many models under consideration for that to
happen.
This, of course, assumes that in FMA we want to use a flat “prior”
(because frequentists don’t like priors), that is, we want g(m) = 1
for all m in
(5.2).

We can also use BIC.

out.bic <- glmbb(count ~ seat.belt * injury * location * gender,

family = "poisson", criterion = "BIC", BIC.option = "sum")

summary(out.bic)

## Results of search for hierarchical models with lowest BIC.

## Search was for all models with BIC no larger than min(BIC) + 10

## These are shown below.

## criterion weight formula

## 294.6 0.87998 count ~ seat.belt*injury + injury*location + injury*gender + seat.belt*location*gender
## 299.3 0.08189 count ~ seat.belt*injury + seat.belt*location + injury*location + seat.belt*gender + injury*gender
+ location*gender
## 301.8 0.02328 count ~ injury*gender + seat.belt*injury*location + seat.belt*location*gender
## 302.7 0.01485 count ~ seat.belt*injury + seat.belt*location*gender + injury*location*gender

There is a big difference between the analyses.

BIC does put the majority of the posterior probability on one model
(assuming flat prior on models).

That best model according to BIC weight is number 5 according to

AIC weight. Conversely, the best model according to AIC weight
is below
the cutoff according to BIC weight and so does not
appear on the BIC list.

BIC plus Occam’s window spreads out the posterior probability over
fewer models than AIC + Occam’s window.

If we compute expected cell counts according to FMA and BMA, they

should be rather different.

It is not very obvious how to do FMA without refitting all the models,
but all the model fits are stored in the R environment out.aic$envir

ls(envir = out.aic$envir)
## [1] "min.crit"

## [2] "sha1.0b21bc716a36c9104619cad165cd8e1ee59eaa67"

## [3] "sha1.0bc71e279edad3fdfd08de9a0b9e0ae111052d6e"

## [4] "sha1.1178bbd4b48158d0044b3a1347e41003680ed4cb"

## [5] "sha1.1394de7910da632a1171a5370b549df8b89df21c"

## [6] "sha1.17fd74be284b4fb9f9500b30852ef16f55a5951d"

## [7] "sha1.2d0982bf246f88e37a97b49bdff7be4da5f4e1e2"

## [8] "sha1.34b3afacb24ed6a05f656993097cd2466e194c33"

## [9] "sha1.34b64a787d14cbac65556ecb290973c935a63910"

## [10] "sha1.364675bd9f91f1dd81e9b4f78723c7c893d02fb8"

## [11] "sha1.39061969f2e696a74694e6e643c158b68d571886"

## [12] "sha1.39cd3d0327e6002b8be8fad301dfbaafb2c24eaa"

## [13] "sha1.4855a747811e1cb6a94a5275898fb6497b0bebe4"

## [14] "sha1.4d6583bbf0d0862415f7bacafd6b00679c515c0c"

## [15] "sha1.527e582a9d862405ddf2c05b6f09c890212c3c0d"

## [16] "sha1.66b77d3f388eae1ce33b62b466e8f311bffa469a"

## [17] "sha1.66cfba8ba57104e87efbfc4ba75ca67551816543"

## [18] "sha1.6aa8694584f0ebadef108c9e6684b32fc9aef678"

## [19] "sha1.6f40df9c7712ee3618c90d56f68d7162b5d06e1a"

## [20] "sha1.724e48559c4e25c9dff6641a26069457726036c2"

## [21] "sha1.7ac914cd162e41bc044ba9fcdb9bddb00182abcf"

## [22] "sha1.7f9bf47ef1ae0aecfaa935614a3164c4cf1d385f"

## [23] "sha1.893a4536650ea9f4ec0f81782104f20e335738a9"

## [24] "sha1.8bca01f0d7b4ba675cb4e25bacfb13f06ce1bf29"

## [25] "sha1.932904ab5c59667c0edab04b0aff8e87d5eef820"

## [26] "sha1.a1f53c04447266ee85267edfcd62af50dedafa41"

## [27] "sha1.a449f95d9878915ee202f828977904a33819389b"

## [28] "sha1.a9eb2e4d8dbbfd65464afdc18c87c99ef16585aa"

## [29] "sha1.bd24727fa981859346a3b3ebf6fd8f1935d7f5c7"

## [30] "sha1.cba6840ac60dcc46de77bbf74a4c06e5e485ebb4"

## [31] "sha1.cbe1125be84ccb49411ef716ffe59218bb01436f"

## [32] "sha1.d21240256fc4cc009cb9b6ecd2ae465d2bc920b3"

## [33] "sha1.d9dd41b6107367ab849386daddf27f856bac8e82"

## [34] "sha1.db6cf35ff8bebb4c28e8baec502ae4e5c1365d35"

## [35] "sha1.e2f8395790cec12e6b92e36a338b2dbba7093932"

## [36] "sha1.f02e9c53c00058688f875f16b7024ce5d063e84c"

## [37] "sha1.f24e60e6005914cdeda0fdc45cabfcb700e2c567"

## [38] "sha1.f7cc3ef833144b019cbd8726f7865841afce6b64"
All of the R objects whose names begin with sha1. are the model fits.
The object min.crit is the same as out.aic$min.crit

identical(out.aic$min.crit, out.aic$envir$min.crit)

## [1] TRUE

And, although this is not documented, we can look in the code for R function
summary.glmbb to see how to extract the criteria from these models.

min.crit <- out.aic$min.crit

cutoff <- out.aic$cutoff

e <- out.aic$envir

rm("min.crit", envir = e)

criterion <- unlist(eapply(e, "[[", "criterion"))

is.in.window <- criterion <= min.crit + cutoff

w <- criterion[is.in.window]

w <- exp(- w / 2)

w <- w / sum(w)

moo <- eapply(e, "[[", "fitted.values")

moo <- moo[is.in.window]

moo <- as.data.frame(moo)

moo <- as.matrix(moo)

mu.hat <- drop(moo %*% w)

foo <- data.frame(injury = injury, gender = gender, location = location,

seat.belt = seat.belt, observed = count, expected.fma = mu.hat)

print(foo, digits=5)
## injury gender location seat.belt observed expected.fma

## 1 No Female Urban No 7287 7277.39

## 2 No Female Urban Yes 11587 11608.43

## 3 No Female Rural No 3246 3252.80

## 4 No Female Rural Yes 6134 6115.39

## 5 No Male Urban No 10381 10380.36

## 6 No Male Urban Yes 10969 10957.83

## 7 No Male Rural No 6123 6126.46

## 8 No Male Rural Yes 6693 6701.35

## 9 Yes Female Urban No 996 1005.61

## 10 Yes Female Urban Yes 759 737.57

## 11 Yes Female Rural No 973 966.20

## 12 Yes Female Rural Yes 757 775.61

## 13 Yes Male Urban No 812 812.64

## 14 Yes Male Urban Yes 380 391.17

## 15 Yes Male Rural No 1084 1080.54

## 16 Yes Male Rural Yes 513 504.65

Note this is frequentist model averaging. The expected cell counts

reported in the last column do not correspond to the prediction of any
model.
They are a weighted average of the predictions of all
models that pass through “Occam’s window”
(that are reported by summary(out.aic) ).
And
the weights are the weights reported by summary(out.aic) .

And, if we redo the same calculation using BMA rather than FMA, we get
min.crit <- out.bic$min.crit

cutoff <- out.bic$cutoff

e <- out.bic$envir

rm("min.crit", envir = e)

criterion <- unlist(eapply(e, "[[", "criterion"))

is.in.window <- criterion <= min.crit + cutoff

w <- criterion[is.in.window]

w <- exp(- w / 2)

w <- w / sum(w)

moo <- eapply(e, "[[", "fitted.values")

moo <- moo[is.in.window]

moo <- as.data.frame(moo)

moo <- as.matrix(moo)

mu.hat <- drop(moo %*% w)

foo <- data.frame(foo, expected.bma = mu.hat)

print(foo, digits=5)

## injury gender location seat.belt observed expected.fma expected.bma

## 1 No Female Urban No 7287 7277.39 7264.64

## 2 No Female Urban Yes 11587 11608.43 11641.34

## 3 No Female Rural No 3246 3252.80 3262.64

## 4 No Female Rural Yes 6134 6115.39 6085.38

## 5 No Male Urban No 10381 10380.36 10368.87

## 6 No Male Urban Yes 10969 10957.83 10949.15

## 7 No Male Rural No 6123 6126.46 6140.85

## 8 No Male Rural Yes 6693 6701.35 6707.13

## 9 Yes Female Urban No 996 1005.61 1008.24

## 10 Yes Female Urban Yes 759 737.57 714.78

## 11 Yes Female Rural No 973 966.20 966.48

## 12 Yes Female Rural Yes 757 775.61 795.50

## 13 Yes Male Urban No 812 812.64 834.25

## 14 Yes Male Urban Yes 380 391.17 389.73

## 15 Yes Male Rural No 1084 1080.54 1056.03

## 16 Yes Male Rural Yes 513 504.65 508.99

Even though the fitted.values components of the model fits are the
same in FMA and BMA, the weighted averages are different because
the
weights are different and because different numbers of models
are given any weight at all,
for FMA and for BMA.
If one wanted to go on and calculate other things that are functions
of these expected values, like conditional odds ratios that Agresti
computes,
then they should be based on the FMA or BMA values computed above.

If one is doing FMA, then standard errors for these FMA estimators can
be computed as described by Burnham and Anderson, Section 4.3.2
(https://www.amazon.com/Model-Selection-Multimodel-Inference-Information-Theoretic/dp/0387953647/). But we will not give examples of that
here.

If one wants Bayesian posterior distributions for these expectations,

one would have to do full BMA via MCMC as described in the
aforementioned
vignette
bfst.pdf (https://cloud.r-project.org/web/packages/mcmc/vignettes/bfst.pdf)
in R package mcmc . But, as also aformentioned, we will not be
discussing
that in this course.

6.3 Alligator Food Choice

6.3.1 AIC
In the alligator food choice data from Chapter 8 in
Agresti (ch8.html#multinomial-response) we saw there were
many relatively good models
according to AIC

# clean up R global environment

rm(list = ls())

library(CatDataAnalysis)

data(table_8.1)

foo <- transform(table_8.1,

lake = factor(lake,

labels = c("Hancock", "Oklawaha", "Trafford", "George")),

gender = factor(gender, labels = c("Male", "Female")),

size = factor(size, labels = c("<=2.3", ">2.3")),

food = factor(food,

labels = c("Fish", "Invertebrate", "Reptile", "Bird", "Other")))

out.aic <- glmbb(big = count ~ lake * gender * size * food,

little = ~ lake * gender * size + food,

family = poisson, data = foo)

## Warning: glm.fit: fitted rates numerically 0 occurred

summary(out.aic)

## Results of search for hierarchical models with lowest AIC.

## Search was for all models with AIC no larger than min(AIC) + 10

## These are shown below.

## criterion weight formula

## 288.0 0.903304 count ~ lakefood + sizefood + lakegendersize

## 293.7 0.050072 count ~ lakefood + genderfood + sizefood + lakegender*size

## 294.9 0.028388 count ~ lakegendersize + lakesizefood

## 296.8 0.010643 count ~ sizefood + lakegendersize + lakegender*food

## 297.5 0.007593 count ~ genderfood + lakegendersize + lakesize*food

6.3.2 BIC
And now we can also do BIC.

out.bic <- glmbb(big = count ~ lake * gender * size * food,

little = ~ lake * gender * size + food,

family = poisson, data = foo,

criterion = "BIC", BIC.option = "sum")

summary(out.bic)
##

## Results of search for hierarchical models with lowest BIC.

## Search was for all models with BIC no larger than min(BIC) + 10

## These are shown below.

## criterion weight formula

## 388.0 0.96096 count ~ food + lakegendersize

## 394.4 0.03904 count ~ sizefood + lakegender*size

This is quite a shock. It says that the empty model that says food
is not associated with lake , gender , or size or any interaction
of them is the
best model
(according to BIC). If you think like a Bayesian, perhaps there is nothing
much going on in these data.

Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
From Everand
Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
William Sullivan
2/5 (1)
BBFH107
100% (1)
BBFH107
102 pages
Burnham and Anderson 2004 Multimodel Inference
No ratings yet
Burnham and Anderson 2004 Multimodel Inference
44 pages
Extended Bayesian Information Criteria For Model Selection With Large Model Spaces
No ratings yet
Extended Bayesian Information Criteria For Model Selection With Large Model Spaces
27 pages
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
No ratings yet
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
44 pages
Prob Sensitivity and Specificity of Information Criteri
No ratings yet
Prob Sensitivity and Specificity of Information Criteri
20 pages
Myung Ohio State Model Selection Methods
No ratings yet
Myung Ohio State Model Selection Methods
76 pages
Mixed Model Selection Information Theoretic
No ratings yet
Mixed Model Selection Information Theoretic
7 pages
Akaike Etc, Wli
No ratings yet
Akaike Etc, Wli
49 pages
The Bayesian Information Criterion
No ratings yet
The Bayesian Information Criterion
32 pages
L7 Model Selection
No ratings yet
L7 Model Selection
41 pages
Bayesian Information Criterion
No ratings yet
Bayesian Information Criterion
3 pages
PSTAT 174/274 Lecture Notes 6: Model Identification AND Estimation
No ratings yet
PSTAT 174/274 Lecture Notes 6: Model Identification AND Estimation
19 pages
188
No ratings yet
188
43 pages
AIC and BIC
No ratings yet
AIC and BIC
5 pages
Sensitivity and Specificity of Information Criteria
No ratings yet
Sensitivity and Specificity of Information Criteria
13 pages
3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
DATT - Class 05 - Assignment - GR 9
No ratings yet
DATT - Class 05 - Assignment - GR 9
9 pages
AIC Tutorial - Hu
No ratings yet
AIC Tutorial - Hu
19 pages
Feature Selection
No ratings yet
Feature Selection
22 pages
Information Criteria For Selection of Appropriate Models: Aic, Sic, and Hqic
No ratings yet
Information Criteria For Selection of Appropriate Models: Aic, Sic, and Hqic
7 pages
kuha2004
No ratings yet
kuha2004
44 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
GLM Project 2
No ratings yet
GLM Project 2
5 pages
Modified Akaike Information Criterion (MAIC) For Statistical Model Selection
No ratings yet
Modified Akaike Information Criterion (MAIC) For Statistical Model Selection
12 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
Akaike Information Criterion
100% (1)
Akaike Information Criterion
6 pages
Package Reams': R Topics Documented
No ratings yet
Package Reams': R Topics Documented
12 pages
Week8_Lecture_1_ML_SPR25
No ratings yet
Week8_Lecture_1_ML_SPR25
20 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Lesson 5 Model Selection
No ratings yet
Lesson 5 Model Selection
41 pages
Using Normalized Bayesian Information Criterion (Bic) To Improve Box - Jenkins Model Building
No ratings yet
Using Normalized Bayesian Information Criterion (Bic) To Improve Box - Jenkins Model Building
8 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
Conditional Model Selection in Mixed-Effects Models With Lme4
No ratings yet
Conditional Model Selection in Mixed-Effects Models With Lme4
31 pages
Bayesian Statistical Methods 1st Edition Brian J. Reich - The ebook in PDF/DOCX format is ready for download now
100% (4)
Bayesian Statistical Methods 1st Edition Brian J. Reich - The ebook in PDF/DOCX format is ready for download now
67 pages
We Ran One Regression: David F. Hendry and Hans-Martin Krolzig Department of Economics, Oxford University. March 10, 2004
No ratings yet
We Ran One Regression: David F. Hendry and Hans-Martin Krolzig Department of Economics, Oxford University. March 10, 2004
9 pages
Glmulti Walkthrough
No ratings yet
Glmulti Walkthrough
29 pages
Chapter 2 (2)
No ratings yet
Chapter 2 (2)
6 pages
Mathematics 07 01215
No ratings yet
Mathematics 07 01215
12 pages
A New Criterion For Model Selection
No ratings yet
A New Criterion For Model Selection
12 pages
T06 - Bayes Classifiers
No ratings yet
T06 - Bayes Classifiers
22 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Complete Download Bayesian Statistical Methods 1st Edition Brian J. Reich PDF All Chapters
100% (1)
Complete Download Bayesian Statistical Methods 1st Edition Brian J. Reich PDF All Chapters
44 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Instant download Bayesian Statistical Methods 1st Edition Brian J. Reich pdf all chapter
100% (2)
Instant download Bayesian Statistical Methods 1st Edition Brian J. Reich pdf all chapter
55 pages
14.384 Time Series Analysis: Mit Opencourseware
No ratings yet
14.384 Time Series Analysis: Mit Opencourseware
6 pages
Instant Download (Ebook) Bayesian Statistical Methods by Brian J. Reich, Sujit K. Ghosh ISBN 9780429202292, 9780429510915, 9780429514340, 9780429517778, 9781032093185, 0429202296, 0429510918, 0429514344, 0429517777 PDF All Chapters
100% (7)
Instant Download (Ebook) Bayesian Statistical Methods by Brian J. Reich, Sujit K. Ghosh ISBN 9780429202292, 9780429510915, 9780429514340, 9780429517778, 9781032093185, 0429202296, 0429510918, 0429514344, 0429517777 PDF All Chapters
55 pages
Sawa-InformationCriteriaDiscriminating-1978
No ratings yet
Sawa-InformationCriteriaDiscriminating-1978
20 pages
1xraftery Et All 1997
No ratings yet
1xraftery Et All 1997
14 pages
J Animal Breeding Genetics - 2015 - Corrales - Polynomial Order Selection in Random Regression Models Via Penalizing
No ratings yet
J Animal Breeding Genetics - 2015 - Corrales - Polynomial Order Selection in Random Regression Models Via Penalizing
8 pages
LectureNotes22 WI4455
No ratings yet
LectureNotes22 WI4455
154 pages
HW 4
No ratings yet
HW 4
6 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
Bayesian Model Averaging For Linear Regression Models
No ratings yet
Bayesian Model Averaging For Linear Regression Models
14 pages
Selecting Amongst Large Classes of Models: Brian D. Ripley
No ratings yet
Selecting Amongst Large Classes of Models: Brian D. Ripley
38 pages
AIMS-Lukacs Burnman Anderson-2010
No ratings yet
AIMS-Lukacs Burnman Anderson-2010
9 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture2
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture2
86 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Croatian Medieval Manuscript Collections in British Libraries
No ratings yet
Croatian Medieval Manuscript Collections in British Libraries
32 pages
Tsldicali
No ratings yet
Tsldicali
1 page
Tsldicali
No ratings yet
Tsldicali
1 page
Jovia: User Manual
No ratings yet
Jovia: User Manual
18 pages
From The Carrington-Cutileiro Plan To W PDF
No ratings yet
From The Carrington-Cutileiro Plan To W PDF
37 pages
Ottoman Bulgaria
No ratings yet
Ottoman Bulgaria
17 pages
The Bosnian Tetraevangelium From Vrutok
No ratings yet
The Bosnian Tetraevangelium From Vrutok
42 pages
Neven Jovanović Petronius, Codex Traguriensis, and Marko Marulić
No ratings yet
Neven Jovanović Petronius, Codex Traguriensis, and Marko Marulić
1 page
Mathematics Problem Book For JEE Chapter 9 - Sequence and Series
No ratings yet
Mathematics Problem Book For JEE Chapter 9 - Sequence and Series
50 pages
CH 4 Solution of Systems of Non Linear Equation
0% (1)
CH 4 Solution of Systems of Non Linear Equation
6 pages
Multi-Objective Optimization Using Particle Swarm Optimization
No ratings yet
Multi-Objective Optimization Using Particle Swarm Optimization
46 pages
2011 IJC Prelims P2 Soln PDF
No ratings yet
2011 IJC Prelims P2 Soln PDF
13 pages
2008 CJC Paper 1 PDF
No ratings yet
2008 CJC Paper 1 PDF
4 pages
Example of Kruskal-Wallis Test - Minitab
No ratings yet
Example of Kruskal-Wallis Test - Minitab
2 pages
Instant Download The Art of Data Science Roger D. Peng PDF All Chapters
No ratings yet
Instant Download The Art of Data Science Roger D. Peng PDF All Chapters
40 pages
NRS (1st) May2016
No ratings yet
NRS (1st) May2016
1 page
10th Maths 102 Question Papers
No ratings yet
10th Maths 102 Question Papers
288 pages
Homotopy Perturbation Method For Solving Fokker-Planck Equation
No ratings yet
Homotopy Perturbation Method For Solving Fokker-Planck Equation
14 pages
C4 January 2006 Mark Scheme
No ratings yet
C4 January 2006 Mark Scheme
6 pages
Kaveri AS1
No ratings yet
Kaveri AS1
13 pages
Undetermined Coefficients and Cauchy-Euler
No ratings yet
Undetermined Coefficients and Cauchy-Euler
16 pages
Linear Systems With Control Theory
100% (2)
Linear Systems With Control Theory
157 pages
Cosc 2836 Test #2 (Matlab)
No ratings yet
Cosc 2836 Test #2 (Matlab)
11 pages
Diogo Castilho_MIT PhD Dissertation _ GRIFADO
No ratings yet
Diogo Castilho_MIT PhD Dissertation _ GRIFADO
184 pages
Mathsgenie - Co.uk Mathsgenie - Co.uk Please Do Not Write On This Sheet
No ratings yet
Mathsgenie - Co.uk Mathsgenie - Co.uk Please Do Not Write On This Sheet
2 pages
MECH550F: Multivariable Feedback Control
No ratings yet
MECH550F: Multivariable Feedback Control
23 pages
Chap 3 Weighted Residual and Energy Method For 1D Problems: Finite Element Analysis and Design Nam-Ho Kim
No ratings yet
Chap 3 Weighted Residual and Energy Method For 1D Problems: Finite Element Analysis and Design Nam-Ho Kim
47 pages
Lesson08 Infinite Limits
No ratings yet
Lesson08 Infinite Limits
9 pages
First Periodical Test - Math 10
No ratings yet
First Periodical Test - Math 10
2 pages
Unit 5 - Testing of Hypothesis - SLM
No ratings yet
Unit 5 - Testing of Hypothesis - SLM
46 pages
Lecture 1 - Complex Number (Part1)
No ratings yet
Lecture 1 - Complex Number (Part1)
29 pages
Poisson Equation: by Li Chen
No ratings yet
Poisson Equation: by Li Chen
18 pages
Assignment 5 - Investment Criteria
No ratings yet
Assignment 5 - Investment Criteria
3 pages
Week 6: Sensitive Analysis
No ratings yet
Week 6: Sensitive Analysis
16 pages
Excel As A Tool in Financial Modelling
No ratings yet
Excel As A Tool in Financial Modelling
5 pages
Statistical Mechanics - Pathria Homework 6
100% (1)
Statistical Mechanics - Pathria Homework 6
4 pages
Model SOP: Standard Operating Procedure
No ratings yet
Model SOP: Standard Operating Procedure
7 pages

Model Selection and Model Averaging

Uploaded by

Model Selection and Model Averaging

Uploaded by

Stat 5421 Lecture Notes: Model Selection and Model

The version of the rmarkdown package used to make this document is

The version of the glmbb package used to make this document is

The “information” in AIC is Kullback-Leibler

(maximized value of the) log likelihood,

3.2 AIC Versus Hypothesis Tests

Approaches based on hypothesis testing, such as that implemented

BIC = LRT + log(n) ⋅ p

The reason BIC is called “Bayesian” is that, if BIC(m)

where M is the class of models under consideration.

When we use a flat prior (g is a constant function of m),

BIC(m) ≈ −2 log Pr(m ∣ data) + a constant

Clearly, BIC is defined the way it is to be comparable to AIC,

3.4 AIC Versus BIC

AIC is minimax-rate optimal for estimating the regression function

If we do not assume the true unknown model is

use AIC when we do not want to assume this.

A shorthand often used is “sparsity”. The sparsity assumption is that

If you do not want to assume sparsity, then use AIC.

4 Considering All Possible Models

The branch and bound algorithm is not magic, however.

5 Model Selection Versus Model Averaging

To do full BMA one usually needs to do MCMC with multiple models.

As an alternative to summing over all possible models in (5.1)

relative to the maximum weight.

6.1 High School Student Survey

out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson)

## Results of search for hierarchical models with lowest AIC.

## These are shown below.

## criterion weight formula

## 63.42 0.6927 count ~ a*c + a*m + c*m

## 65.04 0.3073 count ~ a*c*m

Now we know how to interpret the weights. They are

sout <- summary(out)

## criterion weight formula

## 1 63.41741 0.6927499 count ~ a*c + a*m + c*m

## 2 65.04343 0.3072501 count ~ a*c*m

aic <- sout$results$criterion

w <- exp(- aic / 2)

If we treat these as vaguely like posterior probabilities (they are the

out.bic <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson,

criterion = "BIC", BIC.option = "sum")

## Results of search for hierarchical models with lowest BIC.

## These are shown below.

## criterion weight formula

## 103.5 0.97535 count ~ a*c + a*m + c*m

## 110.9 0.02465 count ~ a*c*m

The optional argument BIC.option = "sum" to R function glmbb

6.2 Seat Belt Use

996, 759, 973, 757, 812, 380, 1084, 513)

injury <- gl(2, 8, 16, labels = c("No", "Yes"))

gender <- gl(2, 4, 16, labels = c("Female", "Male"))

location <- gl(2, 2, 16, labels = c("Urban", "Rural"))

seat.belt <- gl(2, 1, 16, labels = c("No", "Yes"))

out.aic <- glmbb(count ~ seat.belt * injury * location * gender,

## Results of search for hierarchical models with lowest AIC.

## These are shown below.

## criterion weight formula

## 184.9 0.08446 count ~ seat.belt*injury + injury*location + injury*gender + seat.belt*location*gender

We can also use BIC.

out.bic <- glmbb(count ~ seat.belt * injury * location * gender,

family = "poisson", criterion = "BIC", BIC.option = "sum")

## Results of search for hierarchical models with lowest BIC.

## These are shown below.

## criterion weight formula

There is a big difference between the analyses.

That best model according to BIC weight is number 5 according to

If we compute expected cell counts according to FMA and BMA, they

min.crit <- out.aic$min.crit

cutoff <- out.aic$cutoff

criterion <- unlist(eapply(e, "[[", "criterion"))

is.in.window <- criterion <= min.crit + cutoff

moo <- eapply(e, "[[", "fitted.values")

moo <- moo[is.in.window]

moo <- as.data.frame(moo)

moo <- as.matrix(moo)

## 63.42 0.6927 count ~ ac + am + c*m

## 65.04 0.3073 count ~ acm

## 1 63.41741 0.6927499 count ~ ac + am + c*m

## 2 65.04343 0.3072501 count ~ acm

## 103.5 0.97535 count ~ ac + am + c*m

## 110.9 0.02465 count ~ acm

## 184.9 0.08446 count ~ seat.beltinjury + injurylocation + injurygender + seat.beltlocation*gender

## 288.0 0.903304 count ~ lakefood + sizefood + lakegendersize

## 293.7 0.050072 count ~ lakefood + genderfood + sizefood + lakegender*size

## 294.9 0.028388 count ~ lakegendersize + lakesizefood

## 296.8 0.010643 count ~ sizefood + lakegendersize + lakegender*food

## 297.5 0.007593 count ~ genderfood + lakegendersize + lakesize*food