0% found this document useful (0 votes)
35 views

15 Model Averaging

This document discusses model averaging, where multiple models or estimators are combined using a weighted average. It describes Bayesian model averaging (BMA) where the weights are based on the posterior probability of each model. Smoothed or weighted AIC is also discussed as an alternative that does not assume a true model. For linear regression, the averaging estimator corresponds to averaging the coefficient estimates from each model. Mallows criterion is described as an optimal method for selecting weights that minimizes residuals while penalizing model complexity.

Uploaded by

dssd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

15 Model Averaging

This document discusses model averaging, where multiple models or estimators are combined using a weighted average. It describes Bayesian model averaging (BMA) where the weights are based on the posterior probability of each model. Smoothed or weighted AIC is also discussed as an alternative that does not assume a true model. For linear regression, the averaging estimator corresponds to averaging the coefficient estimates from each model. Mallows criterion is described as an optimal method for selecting weights that minimizes residuals while penalizing model complexity.

Uploaded by

dssd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

16 Model Averaging

16.1 Framework
Let g be a (non-parametric) object of interest, such as a conditional mean, variance, density,
or distribution function. Let g^m ; m = 1; :::; M be a discrete set of estimators. Most commonly,
this set is the same as we might consider for the problem of model selection. In linear regression,
typically g^m correspond to di¤erent sets of regressors. We will sometimes call the m’th estimator
the m’th “model”.
Let wm be a set of weights for the m’th estimator. Let w = (w1 ; :::; wM ) be the vector of
weights. Typically we will require

0 wm 1
M
X
wm = 1
m=1

The set of weights satisfying this condition is HM ; the unit simplex in RM :


An averaging estimator is
M
X
g^ (w) = wm g^m
m=1

It is commonly called a “model average estimator”.


Selection estimators are the special case where we impose the restriction wm 2 f0; 1g:

16.2 Model Weights


The most common method for weight speci…cation is Bayesian Model Averaging (BMA). As-
sume that there are M potential models and one of the models is the true model. Specify prior
probabilities that each of the potential models is the true model. For each model specify a prior
over the parameters. Then the posterior distribution is the weighted average of the individual
models, where the weights are Bayesian posterior probabilities that the given model is the true
model, conditional on the data.
Given di¤use priors and equal model prior probabilities, the BMA weights are approximately

1
exp BICm
2
wm =
PM 1
j=1 exp BICj
2

where
BICm = 2Lm + km log(n)

Lm is the negative log-likelihood, and km is the number of parameters in model m: BICm is the
Bayesian information criterion for model m: It is similar to AIC, but with the “2” replaced by

138
log(n).
The BMA estimator has the nice interpretation as a Bayesian estimator. The downside is that
it does not allow for misspeci…cation. It is designed to search for the “true” model, not to select
an estimator with low loss.
To remedy this situation, Burnham and Anderson have suggested replacing BIC with AIC,
resulting in what has been called smoothed AIC (AIC) or weighted AIC (WAIC). The weights are

1
exp AICm
2
wm =
PM 1
j=1 exp AICj
2
where
AICm = 2Lm + 2km

The suggestion goes back to Akaike, who suggested that these wm may be interpreted as model
probabilities. It is convenient and simple to implement. The idea can be applied quite broadly, in
any context where AIC is de…ned.
In simulation studies, the SAIC estimator performs very well. (In particular, better than
conventional AIC.) However, to date I have seen no formal justi…cation for the procedure. It
is unclear in what sense SAIC is producing a good approximation.

16.3 Linear Regression


In the case of linear regression, let Xm be regressor matrix for the m’th estimator. Then
the list of all regressors. Then the m’th estimator is

^ 0 1
m = Xm Xm Xm y
g^m = Xm ^ m
= Pm y

where
0 1
Pm = X m X m Xm Xm

The averaging estimator is

M
X
g^ (w) = wm g^m
m=1
XM
= wm Pm y
m=1
= P (w) y

139
where
M
X
P (w) = wm Pm
m=1

Let X be the matrix of all regressors. We can also write

M
X
0 1
g^ (w) = wm Xm Xm Xm Xm y
m=1
XM
= wm Xm ^ m
m=1
M
!
X ^
m
= X wm
m=1
0
= X ^ (w)

where !
M
X ^
^ (w) = m
wm
m=1
0

is the average of the coe¢ cient estimates. ^ (w) is the model average estimator for : In lin-
ear regression, there is a direct correspondence between the average estimator for the conditional
mean and the average estimator of the parameters, but this correspondence breaks down when the
estimator is not linear in the parameters.

16.4 Mallows Weight Selection


As pointed out above, in the linear regression setting, g^ (w) = P (w) y is a linear estimator, so
falls in the class studied by Li (1987). His framework allows for estimators indexed by w 2 HM
Under homoskedasticity, an optimal method for selection of w is the Mallows criterion. As we
discussed before, for estimators g^ (w) = P (w) y; the Mallows criterion is

C(w) = e^ (w)0 e^ (w) + 2 2


tr P (w)

where
e^ (w) = y g^ (w)

is the residual.

140
In averaging linear regression

M
X
tr P (w) = tr wm Pm
m=1
M
X
= wm tr Pm
m=1
XM
= wm km
m=1
= w0 K

where km is the number of coe¢ cients in the m’th model, and K = (k1 ; :::; kM )0 . The penalty is
twice w0 K; the (weighted) average number of coe¢ cients.
Also

e^ (w) = y g^ (w)
M
X
= wm (y g^m )
m=1
XM
= wm e^m
m=1
= ^
ew

where e^m is the n 1 residual vector from the m’th model, and ^
e = [^
e1 ; :::; e^M ] is the n M matrix
of residuals from all M models.
We can then write the criterion as

C(w) = w0^
e0 ^
ew + 2 2
w0 K

This is quadratic in the vector w:


The Mallows selected weight vector minimizes the criterion C(w) over w 2 HM ; the unit
simplex.
^ = argmin C(w)
w
w2HM

This is a quadratic programming problem with inequality constraints, which is pre-programmed in


Gauss and Matlab, so computation of w
^ is a simple command.
The Mallows selected estimator is then

g^ = g^(w)
^
M
X
= w
^m g^m
m=1

141
This is an

16.5 Weight Selection Optimality


As we discussed in the section on model selection, Li (1987) provided a set of su¢ cient con-
ditions for the Mallows selected estimator to be optimal, in the sense that the squared error is
asymptotically equivalent to the infeasible optimum. The key condition was
X s
(nR(w)) !0 (1)
w

In Hansen (Econometrica, 2007), I show that this condition is satis…ed if we restrict the set of
weights to a discrete set.
Recall that HM is the unit simplex in RM :
1 2
Now restrict w 2 HM HM ; where the weights in HM are elements of f ; ; :::; 1g for some
N N
integer N: In that paper, I show that Li’s condition (1) over w 2 HM holds under the similar
conditions as model selection, namely if the models are nested,

n = inf nR(w) ! 1
w2HM

and
4(N +1)
E ei j Xi < 1:

Thus model averaging is asymptotically optimal, in the sense that

L(w)
^
!p 1
inf w2HM L(w)

where, again
1
L(w) = (^
g (w) g)0 (^
g (w) g)
n
The proof is similar to that for model selection in linear regression. The restriction of w to a
discrete set is necessary to directly apply Li’s theorem, as the summation requires discreteness.
The discreteness was relaxed in a paper by Wan, Zhang, and Zou (2008, Least Squares Model
Combining by Mallows Criterion, working paper). Rather than proving (1), they provided a more
basic derivation, although using stronger conditions. Recall that the proof requires showing uniform
convergence results of the form

je0 b(w)j
sup !p 0
w2HM nR(w)

142
where
M
X
b(w) = wm bm
m=1
bm = (I Pm ) g

Here is their proof: First,

M
X
je0 b(w)j je0 bm j je0 bm j
sup wm max
w2HM nR(w) n 1 m M n
m=1

Second, by Markov’s and Whittle’s inequalities

M
X
je0 bm j je0 bm j
P max > P >
1 m M n n
m=1
XM
E je0 bm j2G
2G 2G
m=1 n

XM
jb0m bm jG
K 2G 2G
m=1 n

XM 0 ) G
nR(wm
K 2G 2G
m=1 n

0 is the weight vector with a 1 in the m’th place and zeros elsewhere. Equivalently,
where wm
0 ) is the expected squared error from the m’th model. The …nal inequality uses the fact from
nR(wm
the analysis for model selection that

0
nR(wm ) = b0m bm + 2
km b0m bm

Wan, Zhang, and Zou then assume


PM 0 ) G
m=1 nR(wm
2G
!0
n

PM 0 ) G
This is stronger than the condition from my paper n ! 1; as it requires that m=1 nR(wm
2G
diverges slower than n : They also do not directly assume that the models are nested.

16.6 Cross-Validation Selection


Hansen and Racine (Jacknife Model Averaging, working paper).
In this paper, we substitute CV for the Mallows criterion. As a result, we do not require
homoskedasticity.

143
For the m0 th model, let e~m
i denote the leave-one-out (LOO) residuals for the i’th observation,
e.g.
1
e~m
i = yi Xim0 Xm0i Xmi Xm0i y i

and let e~m denote the n 1 vector of the e~m


i : Then the LOO averaging residuals are

M
X
e~i (w) = wm e~m
i
m=1

M
X
e~ (w) = wm e~m = ~
ew
m=1

where ~
e is an n M matrix whose mth column is e~m : Then the sum-of-squared LOO residuals is

CV (w) = e~ (w)0 e~ (w) = w0 e~0 e~w

which is quadratic in w .
The CV (or jacknife) selected weight vector w
^ minimizes the criterion CV (w) over the unit
simplex. As for Mallows selection, this is solved by quadratic programming.
The JMA estimator is then g^(w)
^
In Hansen-Racine, we show that the CV estimator is asymptotically equivalent to the infeasible
best weight vector, under the conditions

0 < min E e2i j Xi min E e2i j Xi < 1


i i
4(N +1)
E ei j Xi <1

n = inf nR(w) ! 1
w2HM
1
max max Xim0 Xm0 Xm Xm0 Xim ! 0
1 m M1 i n

16.7 Many Unsolved Issues


Model averaging for other estimators: e.g. densities or conditional densities

IV, GMM, EL, ET

Standard errors?

Inference

144

You might also like