15 Model Averaging
15 Model Averaging
16.1 Framework
Let g be a (non-parametric) object of interest, such as a conditional mean, variance, density,
or distribution function. Let g^m ; m = 1; :::; M be a discrete set of estimators. Most commonly,
this set is the same as we might consider for the problem of model selection. In linear regression,
typically g^m correspond to di¤erent sets of regressors. We will sometimes call the m’th estimator
the m’th “model”.
Let wm be a set of weights for the m’th estimator. Let w = (w1 ; :::; wM ) be the vector of
weights. Typically we will require
0 wm 1
M
X
wm = 1
m=1
1
exp BICm
2
wm =
PM 1
j=1 exp BICj
2
where
BICm = 2Lm + km log(n)
Lm is the negative log-likelihood, and km is the number of parameters in model m: BICm is the
Bayesian information criterion for model m: It is similar to AIC, but with the “2” replaced by
138
log(n).
The BMA estimator has the nice interpretation as a Bayesian estimator. The downside is that
it does not allow for misspeci…cation. It is designed to search for the “true” model, not to select
an estimator with low loss.
To remedy this situation, Burnham and Anderson have suggested replacing BIC with AIC,
resulting in what has been called smoothed AIC (AIC) or weighted AIC (WAIC). The weights are
1
exp AICm
2
wm =
PM 1
j=1 exp AICj
2
where
AICm = 2Lm + 2km
The suggestion goes back to Akaike, who suggested that these wm may be interpreted as model
probabilities. It is convenient and simple to implement. The idea can be applied quite broadly, in
any context where AIC is de…ned.
In simulation studies, the SAIC estimator performs very well. (In particular, better than
conventional AIC.) However, to date I have seen no formal justi…cation for the procedure. It
is unclear in what sense SAIC is producing a good approximation.
^ 0 1
m = Xm Xm Xm y
g^m = Xm ^ m
= Pm y
where
0 1
Pm = X m X m Xm Xm
M
X
g^ (w) = wm g^m
m=1
XM
= wm Pm y
m=1
= P (w) y
139
where
M
X
P (w) = wm Pm
m=1
M
X
0 1
g^ (w) = wm Xm Xm Xm Xm y
m=1
XM
= wm Xm ^ m
m=1
M
!
X ^
m
= X wm
m=1
0
= X ^ (w)
where !
M
X ^
^ (w) = m
wm
m=1
0
is the average of the coe¢ cient estimates. ^ (w) is the model average estimator for : In lin-
ear regression, there is a direct correspondence between the average estimator for the conditional
mean and the average estimator of the parameters, but this correspondence breaks down when the
estimator is not linear in the parameters.
where
e^ (w) = y g^ (w)
is the residual.
140
In averaging linear regression
M
X
tr P (w) = tr wm Pm
m=1
M
X
= wm tr Pm
m=1
XM
= wm km
m=1
= w0 K
where km is the number of coe¢ cients in the m’th model, and K = (k1 ; :::; kM )0 . The penalty is
twice w0 K; the (weighted) average number of coe¢ cients.
Also
e^ (w) = y g^ (w)
M
X
= wm (y g^m )
m=1
XM
= wm e^m
m=1
= ^
ew
where e^m is the n 1 residual vector from the m’th model, and ^
e = [^
e1 ; :::; e^M ] is the n M matrix
of residuals from all M models.
We can then write the criterion as
C(w) = w0^
e0 ^
ew + 2 2
w0 K
g^ = g^(w)
^
M
X
= w
^m g^m
m=1
141
This is an
In Hansen (Econometrica, 2007), I show that this condition is satis…ed if we restrict the set of
weights to a discrete set.
Recall that HM is the unit simplex in RM :
1 2
Now restrict w 2 HM HM ; where the weights in HM are elements of f ; ; :::; 1g for some
N N
integer N: In that paper, I show that Li’s condition (1) over w 2 HM holds under the similar
conditions as model selection, namely if the models are nested,
n = inf nR(w) ! 1
w2HM
and
4(N +1)
E ei j Xi < 1:
L(w)
^
!p 1
inf w2HM L(w)
where, again
1
L(w) = (^
g (w) g)0 (^
g (w) g)
n
The proof is similar to that for model selection in linear regression. The restriction of w to a
discrete set is necessary to directly apply Li’s theorem, as the summation requires discreteness.
The discreteness was relaxed in a paper by Wan, Zhang, and Zou (2008, Least Squares Model
Combining by Mallows Criterion, working paper). Rather than proving (1), they provided a more
basic derivation, although using stronger conditions. Recall that the proof requires showing uniform
convergence results of the form
je0 b(w)j
sup !p 0
w2HM nR(w)
142
where
M
X
b(w) = wm bm
m=1
bm = (I Pm ) g
M
X
je0 b(w)j je0 bm j je0 bm j
sup wm max
w2HM nR(w) n 1 m M n
m=1
M
X
je0 bm j je0 bm j
P max > P >
1 m M n n
m=1
XM
E je0 bm j2G
2G 2G
m=1 n
XM
jb0m bm jG
K 2G 2G
m=1 n
XM 0 ) G
nR(wm
K 2G 2G
m=1 n
0 is the weight vector with a 1 in the m’th place and zeros elsewhere. Equivalently,
where wm
0 ) is the expected squared error from the m’th model. The …nal inequality uses the fact from
nR(wm
the analysis for model selection that
0
nR(wm ) = b0m bm + 2
km b0m bm
PM 0 ) G
This is stronger than the condition from my paper n ! 1; as it requires that m=1 nR(wm
2G
diverges slower than n : They also do not directly assume that the models are nested.
143
For the m0 th model, let e~m
i denote the leave-one-out (LOO) residuals for the i’th observation,
e.g.
1
e~m
i = yi Xim0 Xm0i Xmi Xm0i y i
M
X
e~i (w) = wm e~m
i
m=1
M
X
e~ (w) = wm e~m = ~
ew
m=1
where ~
e is an n M matrix whose mth column is e~m : Then the sum-of-squared LOO residuals is
which is quadratic in w .
The CV (or jacknife) selected weight vector w
^ minimizes the criterion CV (w) over the unit
simplex. As for Mallows selection, this is solved by quadratic programming.
The JMA estimator is then g^(w)
^
In Hansen-Racine, we show that the CV estimator is asymptotically equivalent to the infeasible
best weight vector, under the conditions
n = inf nR(w) ! 1
w2HM
1
max max Xim0 Xm0 Xm Xm0 Xim ! 0
1 m M1 i n
Standard errors?
Inference
144