Evaluating The Predictive Accuracy of Volatility Models: Jose A. Lopez
Evaluating The Predictive Accuracy of Volatility Models: Jose A. Lopez
Jose A. Lopez
ABSTRACT:
Statistical loss functions that generally lack economic content are commonly used for
evaluating financial volatility forecasts. In this paper, an evaluation framework based on loss
functions tailored to a users economic interests is proposed. According to these interests, the
user specifies the economic events to be forecast, the criterion with which to evaluate these
forecasts, and the subsets of the forecasts of particular interest. The volatility forecasts from a
model are then transformed into probability forecasts of the relevant events and evaluated using
the specified criteria (i.e., a probability scoring rule and calibration tests). An empirical example
using exchange rate data illustrates the framework and confirms that the choice of loss function
directly affects the forecast evaluation results.
Acknowledgments: The views expressed here are those of the author and do not necessarily reflect those of the
Federal Reserve Bank of San Francisco or the Federal Reserve System. I thank Frank Diebold, Tim Bollerslev,
Peter Christoffersen, Roberto Mariano, Lee Ohanian, Esther Ruiz and Ken West for their comments as well as
seminar participants at the Derivatives Project Meeting of the National Bureau of Economic Research, the
Econometric Society Seventh World Congress, the University of Pennsylvania, Arizona State University and the
Federal Reserve Bank of New York.
Author biography:
ABSTRACT:
Statistical loss functions that generally lack economic content are commonly used for
evaluating financial volatility forecasts. In this paper, an evaluation framework based on loss
functions tailored to a users economic interests is proposed. According to these interests, the
user specifies the economic events to be forecast, the criterion with which to evaluate these
forecasts, and the subsets of the forecasts of particular interest. The volatility forecasts from a
model are then transformed into probability forecasts of the relevant events and evaluated using
the specified criteria (i.e., a probability scoring rule and calibration tests). An empirical example
using exchange rate data illustrates the framework and confirms that the choice of loss function
directly affects the forecast evaluation results.
I. Introduction
Although dynamics in the variance of financial time series were observed at least as early
as Mandelbrot (1963), efforts to empirically model these dynamics have only developed in the
last twenty years. GARCH models, pioneered by Engle (1982) and Bollerslev (1986), are the
volatility models most commonly used, although numerous alternatives exist. Specifically,
stochastic volatility models, which arose from the theoretical finance literature, are increasingly
employed in empirical research. For recent surveys of volatility models, see Bollerslev, Engle
Volatility models and their forecasts are of interest to many types of economic agents.
For example, options traders require asset volatilities to price options, and central banks or
international investors forecasting exchange rates may require interval forecasts, which are
readily derived from volatility forecasts. Given the vast number of models available, such agents
must decide which forecasts to use as well as the evaluation criterion upon which to base that
decision. As suggested by Bollerslev et al. (1994), economic loss functions that explicitly
incorporate the costs faced by volatility forecast users provide the most meaningful forecast
evaluations. Several such loss functions based on specific economic questions have been
proposed in the literature. For example, West, Edison and Cho (1993) propose a loss function
based on the expected utility from an investment decision requiring volatility forecasts. Engle et
al. (1993) propose an economic loss function based on trading profits earned in a simulated
options market; see also Engle, Kane and Noh (1996) and Noh, Engle and Kane (1994).
1
However, since specific economic loss functions are frequently unavailable, volatility
forecast evaluation is typically conducted by minimizing a statistical loss function, such as mean
squared error (MSE). Numerous studies have used MSE to evaluate the performance of volatility
models; for example, see Taylor (1987); Akgiray (1989); Dimson and Marsh (1990); Pagan and
Schwert (1990); Lee (1991); West and Cho (1994); Kroner, Kneafsey and Claessens (1995);
Bollerslev and Ghysels (1996); and Brailsford and Faff (1996). Such forecast evaluations are
problematic since they almost exclusively use squared asset returns as a proxy for the latent
volatility process. As discussed by Andersen and Bollerslev (1997), squared returns are noisy
estimators of the actual variance dynamics and will thus limit the inference available regarding
loss functions is proposed. Rather than selecting a simple statistical criterion as the appropriate
evaluation measure, this framework permits the forecast user to tailor their choice of criterion
probability forecasts generated from an underlying volatility model. Probability forecasts are the
object of interest for a large, general class of loss functions based on decision-theoretic problems,
such as deciding whether to take a specific action; see Granger and Pesaran (1996) as well as
The user tailors the forecast evaluation by specifying three key elements of this
framework. The first element is the economic events of interest, such as a central banks interest
in whether an exchange rate will leave a specified target zone. Volatility forecasts are readily
transformed into such probability forecasts using the underlying model's assumed (or estimated)
2
distribution for the innovation term. Once generated, the forecasts are evaluated using scoring
rules, which measure the accuracy of probability forecasts with respect to whether the forecasted
events occur. Forecast users can select the scoring rule best suited to their particular interests.
The most common scoring rule (and the one used in this paper) is the quadratic probability score,
which is the analog of MSE for probability forecasts and an explicit measure of forecast
accuracy. Furthermore, forecast users can employ calibration tests to evaluate subsets of the
probability forecasts that are of particular interest, such as forecasts that lie above a certain
Moiseiwitsch and Dawid (1993), examine the degree of equivalence between an event's predicted
and observed frequencies of occurrence within subsets of the unit interval. The specification of
these three elements forms an evaluation framework that, although based on statistical loss
functions, can be tailored to provide economically meaningful inference for a wide range of
decision-theoretic problems.
The paper is structured as follows. In Section II, the loss functions previously used for
volatility forecast evaluation are reviewed. Section III describes the three elements of the
proposed evaluation framework that are specified by the user -- the events to be forecast, the
scoring rule used to evaluate the probability forecasts, and the subsets of the forecasts of
particular interest. Section IV presents an empirical example using daily exchange rates to
illustrate the evaluation procedure. Four different sets of probability forecasts are used to
evaluate the forecasts from nine volatility models. The example provides three insights into
volatility forecast evaluation. First, the evaluations are sensitive to the loss function chosen, and
thus forecast users should consider their choice carefully. Second, the disparity between in-
3
sample and out-of-sample evaluation results clearly indicate that in-sample diagnostic tests must
answers on which forecasts minimize the relevant loss functions, the proposed forecast
evaluation framework provides useful and economically meaningful information for reducing the
choice set. Section V concludes and suggests directions for further research.
yt t t , t ht zt , zt ~ i.i.d. D( 0, 1 ),
where the conditional mean t E yt G t1 , t1 is the information set available at time t-1,
t is the innovation term, ht is its conditional variance and D(0,1) is a symmetric, standardized
distribution; thus, t G t1 ~ D 0, ht . Variance dynamics are introduced into the model via
2
the specification of ht; for example, the GARCH(1,1) model sets ht t1 ht1 .
Generally, the parameters of the model are estimated over a specified in-sample period t= 1,..., T,
and volatility forecasts based on these estimates are generated over the out-of-sample period t=
T+1,...,T+T*. (Note that, in a convenient abuse of terminology, the in-sample, fitted conditional
moment tests as per Wooldridge (1990) or the news impact curve proposed by Engle and Ng
(1993), are available for comparing volatility model specifications. However, out-of-sample
forecast accuracy provides an alternative and potentially more useful comparison; i.e., a model
4
that generates accurate forecasts must be providing a reasonable approximation of the underlying
data generating process. In the literature, volatility forecasts have been evaluated using economic
and statistical loss functions. Economic loss functions, such as the utility-based loss function
proposed by West et al. (1993) or the profit-based loss function proposed by Engle et al. (1993),
provide the most meaningful forecast evaluations since they directly incorporate the users
However, such specific economic loss functions are rarely available, and purely statistical
loss functions, such as the mean-squared error (MSE) criterion, are more commonly used to
forecasts, MSE is the average squared difference between the actual conditional variance hT+t and
the corresponding volatility forecast h . However, since hT+t is not directly observable, the
Tt
T
2 1 2 2
squared innovation Tt is used as a proxy. Thus, MSE Tt hTt , where
T t1
T
hTt are the one-step-ahead forecasts based on the in-sample parameter estimates and the
t1
relevant information sets. (Note that this analysis can be extended to k-step-ahead forecasts.)
Although MSE could in some instances be the appropriate loss function for evaluating
2
conditional variance forecasts, the use of the Tt proxy is problematic. Specifically, even
2
though Tt is an unbiased estimator of hT+t, it is generally an imprecise or noisy estimator due
5
2
that is, even if one is willing to accept a proxy that is up to 50% different from hT+t, Tt would
fulfill this condition only 25% of the time. (See Andersen and Bollerslev (1997) for a discussion
of alternative proxies based on high-frequency data.) A further difficulty with MSE evaluation in
a heteroskedastic environment, as noted by Bollerslev et al. (1994), is that the symmetric nature
of this loss function does not sufficiently penalize nonpositive variance forecasts.
Several alternative statistical loss functions are proposed in the literature, although the
same criticisms cited above would apply to them; for example, mean absolute error (MAE),
T
1 2
MAE Tt hTt L. Two loss functions that penalize volatility forecasts
T t 1 L
asymmetrically are the logarithmic loss (LL) function employed by Pagan and Schwert (1990),
T
1 2 2
LL ln Tt ln hTt ,
T t 1
2
which penalizes inaccurate variance forecasts more heavily when Tt is low, and the
2
T 2
Tt
1
HMSE 1 .
T t 1 hTt
Bollerslev et al. (1994) further suggest the loss function implicit in the Gaussian quasi-maximum
T 2
Tt
1
GMLE ln hTt .
T t 1 hTt
Although these functions may coincide with a specific forecast users economic loss
function, they are generally devoid of economic content. Hence, economic inference based on
6
probability forecasts is proposed. This framework is also based on statistical loss functions;
however, analyzing probability forecasts generated from volatility forecasts permits the
The proposed framework permits the user to tailor the forecast evaluation to their
particular interests in two stages. The first stage consists of transforming volatility forecasts into
probability forecasts of events of interest. (Note that this type of evaluation defines a volatility
model as a combination of both the variance dynamics specified by ht and the distribution of the
standardized residuals specified by D.) This focus on probability forecasts can be linked to a
large, general class of economic loss functions commonly used in decision-theoretic problems.
As discussed by Granger and Pesaran (1996), probability forecasts are the key input to loss
functions that gauge the economic value of taking an action, such as whether or not to purchase
an asset. In the second stage, the probability forecasts are evaluated using the statistical criteria
that most closely correspond to the users interests. The first part of this section describes how
probability forecasts are generated from volatility forecasts, and the remainder describes the
probability forecasts. The distribution D can be assumed or estimated as per Engle and
7
integrating over this distribution. The appropriate range of integration will depend on the
economic event of interest; i.e., the user can tailor the forecast evaluation to the region of
users with different (but unspecified) economic loss functions to introduce their particular
interests into the forecast evaluation, even though they will all use the same statistical tools.
forecasts. As shown by Baillie and Bollerslev (1992), variance dynamics cause the distribution
address this difficulty; for example, Baillie and Bollerslev (1992) use the Cornish-Fisher
Mikkelsen (1996), can be used to generate the desired k-step-ahead probability forecasts. Since
the forecast evaluation tools described below are appropriate for transformed volatility forecasts
of all horizons, the following discussion will focus on one-step-ahead forecasts to simplify both
The probability forecast notation is as follows. Volatility models are fit to the in-sample
T
observations yt , and Pt is the in-sample probability forecast for time t based on the
t1
parameters estimated over the entire sample and the necessary elements of T. Out-of-sample,
PT+t for t = 1,...,T* is the one-step-ahead probability forecast conditional on the parameter
estimates and Tt1 . The subsequent discussion will focus on out-of-sample forecasts. The
event space to be examined is created by partitioning the set of all possible outcomes into N
mutually exclusive and collectively exhaustive subsets according to the forecast user's interests.
8
If N=2, a binomial event is specified, and PT+t and RT+t, an indicator variable equaling one if the
In the following discussion and empirical exercise, two general types of binomial events,
innovation and level events, will be examined. An important category of volatility forecast users
are those interested in the behavior of the innovation term Tt . This category includes, for
example, spot traders or options traders structuring hedging strategies with respect to Tt. For
such users, the economic event of interest is Tt M L, Tt, U, Tt , and the associated
u, Tt
L, Tt U, Tt
PTt Pr L, Tt@Tt@U, Tt Pr @ zTt @ f zTt dzTt,
hTt hTt l, Tt
where zT+t is the standardized innovation, f zTt is the functional form of D (0, 1) and
l, Tt, u, Tt is the standardized range of integration. Forecast users thus tailor the volatility
The second type of economic events of interest are based on the behavior of the level
term yT+t. Such events would be of interest, for example, to a central bank forecasting whether an
exchange rate will remain within a target zone. In such cases, the event of interest is
PTt Pr Ly, Tt @ yTt @ Uy, Tt Pr Ly, Tt Tt @ Tt @ Uy, Tt Tt
uy, Tt
Ly, Tt Tt Uy, Tt Tt
Pr @ zTt @ f zTt dzTt,
hTt hTt ly, Tt
where Tt is the forecasted conditional mean and ly, Tt, uy, Tt is the standardized range of
integration. For the central bank target zone example, Ly, Tt, Uy, Tt would equal the constant
9
interval [L,U] for all t. Granger, White and Kamstra (1989) examine interval forecasts based on
T
yT+t, in which case the endpoints Ly, Tt, Uy, Tt are set to generate a constant %
t1
confidence interval around the corresponding conditional mean forecasts; see Christoffersen
Once the desired probability forecasts are generated, forecast users can proceed to
evaluate them using probability scoring rules and calibration tests. These two forecast evaluation
criteria, as well as the Diebold-Mariano (1995) tests of comparative predictive accuracy, are
described in the following subsections; see Diebold and Lopez (1996) for further discussion.
Probability scoring rules are primarily employed in the economics literature to evaluate
and Lahiri and Wang (1994). Scoring rules measure the statistical accuracy of the forecasted
probabilities. (Note that a scoring rule should be proper; i.e., it should not cause the forecaster
to modify their actual forecasts in order to improve their expected score.) Just as with the
statistical loss functions described in Section II, a variety of proper scoring rules are available;
see Murphy and Daan (1985) as well as Winkler (1993) for further discussion. However, a key
difference between these two classes of statistical loss functions with respect to volatility
forecasts is that scoring rules are based on comparing a models forecasts to observable events,
In practice, the forecast user should select the proper scoring rule that addresses their
particular interests most directly. In the following discussion and empirical exercise, the
10
quadratic probability score (QPS), the most common scoring rule, is used. The QPS, originally
developed by Brier (1950), is the analog of MSE for probability forecasts and thus implies a
T
1
QPS 2 PTt RTt ,
2
T t1
which implies that QPS M [0,2] and has a negative orientation (i.e., smaller values indicate more
accurate forecasts). This type of statistical loss function would be most relevant to economic
agents facing decision-theoretic problems that explicitly value the accuracy of the probability
forecast inputs.
Scoring rules measure the accuracy of probability forecasts; hence, if the QPS for
PA, Tt T is closer to zero than the QPS for PB, Tt T , then the forecasts from model A are
t1 t1
more accurate than those from model B. However, whether this outcome is statistically
significant or an artifact of the sample at hand is unclear. Diebold and Mariano (1995) propose
several tests for determining whether the expected losses (under a general loss function) induced
by two sets of point forecasts are statistically different. These tests are readily generalized to
probability forecasts.
The null hypothesis under the loss function g is E g PA, Tt, RTt E g PB, Tt, RTt or,
equivalently, E dTt E g PA, Tt, RTt g PB, Tt, RTt 0. For QPS,
where PA,T+t and PB,T+t are the probability forecasts from two volatility models. (Note that to test
11
T
this hypothesis, dTt must be covariance stationary, a condition determined empirically.)
t1
Several statistics, both asymptotic and finite-sample, are proposed for testing this null hypothesis.
d a
S N(0,1),
~
2fd(0) / T
where d is the sample mean of dT+t and fd(0) is the spectral density function at frequency zero
event's observed and predicted frequencies of occurrence within subsets of the unit interval. For
example, a forecaster is providing perfectly calibrated rain forecasts if it rains on 10% of the days
for which a 10% chance of rain is forecast. A simple measure of calibration is the calibration
plot, which graphs Pr(event occurs | j) against j, where j is the midpoint of one of the J
mutually exclusive and collectively exhaustive subsets of the unit interval specified by the user.
The degree to which this graph lies on the 45N line is a visual measure of calibration. Seillier-
Moiseiwitsch and Dawid (1993) present test statistics that formalize this analysis.
The null hypothesis of these tests is that the predicted frequency of occurrence for a
specified binomial event equals the observed frequency of occurrence and hence the forecasts are
well calibrated. The relevant test statistics are constructed by dividing the out-of-sample
probability forecasts into the J subsets of the unit interval relevant to the forecast user; for
example, if the deciles {[0, 0.1); [0.1, 0.2); ...; [0.9, 1]} are of interest as in Diebold and
12
Rudebusch (1989), then J = 10. Let j denote the midpoint of subset j, Tj the number of
forecasts in subset j, and Rj the number of observed events paired with the forecasts in subset j.
Rj Tj j Rj ej
Zj , j 1,...,J,
1/2 1/2
Tj j 1 j wj
R0 e0
Z0 ,
1/2
w0
J J J
where R0 R j, e0 ej and w0 wj . Each of these statistics is the square root of
j 1 j 1 j 1
a goodness-of-fit statistic for a binomial event with Rj observed outcomes and ej expected
2
outcomes. Under the null hypothesis and weak conditions on the distribution of the probability
forecasts, these statistics are asymptotically normally distributed. If the null hypothesis for the
forecasts in subset j is rejected, then these forecasts do not adequately represent the observed
frequency of occurrence in that subset. The calibration tests and this flexible specification of the
J subsets of interest provide forecast users with an additional statistical tool with which to tailor
In summary, a statistical framework for volatility forecast evaluation that permits the user
to tailor the evaluation to their economic interests is proposed. As discussed by Murphy (1973),
purely statistical and purely economic evaluations of forecasts represent the end points of a
continuum. By permitting the user to specify the event to be forecast, a probability scoring rule
and a partitioning of the unit interval for calibration tests, this framework permits evaluations
that incorporate elements of both types. In the following section, an empirical application
13
illustrating this framework for one-step-ahead volatility forecasts is presented. Evaluations of k-
step-ahead forecasts using this framework are reserved for future research.
several foreign exchange rates are evaluated with respect to four different economic events of
interest. Within this empirical exercise, no one set of forecasts is found to minimize all of the
probability-based loss functions. In fact, not only does the minimizing set of forecasts vary
across exchange rates and events, it also varies across the in-sample and out-of-sample periods.
These results highlight two key features of volatility forecast evaluation (and forecast evaluation,
in general): the need for careful selection of the loss function employed and the usefulness of out-
The four exchange rates examined are the logged daily Canadian dollar (CD),
Deutschemark (DM) and Japanese yen (YN) spot exchange rates with respect to the U.S. dollar
and the U.S. dollar exchange rate with respect to the British pound (BP), as recorded by the
Federal Reserve Bank of New York, from 1980 to 1995. The in-sample period used for
parameter estimation is 1980-1993 (3516 observations), and the out-of-sample period is 1994-
Given established unit root results for exchange rates (as per Baillie and Bollerslev (1989)
and others) and after standard unit root testing, the log exchange rates are modeled as I(1)
14
processes; that is,
yt t1 t, t t1 ~ D 0, ht ,
where yt L 100 log s t and st is one of the spot exchange rates. The moving-average term is
included due to the presence of a small, but significant, first-order serial correlation in yt.
Table 1 presents the in-sample, least squares estimates of these conditional mean parameters,
higher-order moments of the in-sample and out-of-sample t series and the portmanteau statistics
2
for up to twentieth-order serial correlation in the t and t series. The kurtosis estimates and the
Q2(20) statistics indicate the presence of conditional heteroskedasticity in all four series.
To address the presence of variance dynamics in these series, three categories of volatility
models are used: simple models, GARCH models and stochastic volatility models. These
models are estimated over the in-sample period, and the parameter estimates are used to generate
volatility forecasts over the out-of-sample period. (The parameter estimates and relevant
Three simple volatility models are examined. The first model is the Gaussian
Although, as shown in Table 2, the squared standardized residuals from this model still display
benchmark. As in West and Cho (1994), two autoregressive models of t are estimated using
2
ordinary least squares and an assumed conditionally normal distribution. The autoregressive t
10
2
model assuming a lag-order of 10 specifies ht i ti and is denoted as AR10sq.
i1
15
(Note that volatility forecasts from this model can be negative, and in such cases, the forecasts
are set to a small positive number.) The second autoregressive model (AR10ab), as developed by
Davidian and Carroll (1987) and Schwert (1989), is based on t and is specified as
12 2
ht i ti .
2 i1
From the second category, five models are specified and estimated via maximum
likelihood methods. The first three models share the same GARCH(1,1) specification (i.e.,
2
ht t ht1 ), but use different distributional assumptions. Specifically, the three
assumed, symmetric forms of D are the normal distribution, the t-distribution and the generalized
exponential distribution (GED); all of which are special cases of the generalized t-distribution
discussed in McDonald and Newey (1988). Both the t- and GED-distributions permit fatter tails
than the normal distribution, an empirical regularity often found in financial time series; see
Figure 1 for a comparison of these distributions based on the estimated shape parameters for the
BP series. These models are denoted as GARCH, GARCH-t and GARCH-g, respectively.
The fourth and fifth models are derived from the calibrated, exponential-smoothing
model currently used in the volatility forecasting methodology proposed in the J.P. Morgans
RiskmetricsTM system; see J.P. Morgan (1996). The assumed variance dynamics are
Q
2 2
h t ( 1 ) i ti ( 1 ) t1 ht1;
i1
note that this calibrated model is equivalent to the IGARCH model of Engle and Bollerslev
(1986) with the parameter set to zero. The parameter is calibrated according to the degree of
persistence desired. For this exercise, the commonly-used value of =0.94 is specified, and the
16
conditional mean parameters are those estimated using the homoskedastic model. The two
exponential-smoothing models used in this exercise differ only in their assumed distribution D;
the e.smooth-N model uses the standard normal, and the e.smooth-t model uses the t-distribution
with the same shape parameter as estimated for the GARCH-t model.
The final model is the AR(1) stochastic volatility model (s.v.) used by Harvey, Ruiz and
Shephard (1994); that is, after removing the conditional mean based on least squares parameter
estimates,
t exp t / 2 t, t ~ N( 0, 1 )
2
t t1 t, t ~ N( 0, ) ,
where t t. Thus, the innovation term is subject to two independent shocks. Estimation of
the model is conducted using the Kalman filter on the measurement equation
2 2
log t t logt and the transition equation t t1 t. To use quasi-maximum
2
likelihood methods, the measurement equation is rewritten as logt t t, where
2
E log t and t is assumed to be distributed N 0, 2/2 . (Note that alternative
estimation techniques have been proposed; see Jacquier, Polson and Rossi, 1994). The log-
2 2
T T
T T logt at
ln L ; 1, ..., T log(2) log Gt ,
2 2 t1 t1 Gt
where at and Gt are the Kalman filter estimates of t and its conditional variance, respectively,
2
and , .
One-step-ahead volatility forecasts are easily obtained from the simple and GARCH(1,1)
models, and the corresponding probability forecasts are derived via integration. One-step-ahead
17
volatility forecasts from the s.v. model are obtained using the Kalman filter, but converting them
difficult and computationally intensive. To alleviate this problem, this distribution is empirically
approximated by simulating 10,000 draws from it and creating 0.5% quantiles. The probability
distribution function of zt. (The magnitude of this approximation error is small and can be made
presented in Table 2. In-sample, almost all of the volatility models, except for the homo. and s.v.
models, consistently remove the serial correlation in the squared standardized residuals in all four
series. Out-of-sample, only the homo. model fails to consistently do so. These results indicate
that all of the volatility models, with the possible exception of the s.v. model, may be appropriate
specifications of the variance dynamics. In fact, as shown in Table 3 for the BP series, the in-
sample and out-of-sample volatility forecasts are highly correlated, especially within the GARCH
class of models. These results are similar for the other three series.
Before evaluating the generated probability forecasts, Tables 4 and 5 report the in-sample
and out-of-sample forecast evaluation results using the statistical loss functions from Section II.
The in-sample results clearly indicate that the loss function chosen directly impacts the forecast
evaluation results. MSE in minimized for all four series by the AR10sq model, as expected.
However, as indicated by the S test results, it does not do so at the 5% level of statistical
18
significance. The MAE and LL loss functions are minimized by the s.v. model significantly; i.e.,
most of the alternative models forecasts fail the one-on-one S test at the 5% level. The results
for HMSE indicate no clear pattern, and the GMLE loss function is minimized, but not
The out-of-sample forecast evaluation results vary considerably in comparison to the in-
sample results. No one set of forecasts minimize a specific loss function across all four series,
and few of the forecasts that minimize a loss function do so significantly. (West and Cho (1995)
find qualitatively similar results for MSE using different statistical tests and various forecast
horizons.) Of the twenty forecast evaluations reported in Table 5, in only eight cases is the
minimizing set of forecasts the same across the in-sample and out-of-sample periods, and six of
these cases are the MAE and LL with respect to the s.v. model. These results indicate that
forecast evaluation based on such loss functions varies considerably; hence, the forecast user
should do so only after careful consideration of how these loss functions reflect their particular
economic interests.
In the next subsection, the four economic events used for evaluating the volatility
forecasts from the nine models are described. In order to simplify this sample evaluation, only
the event to be forecast is varied. The QPS is the probability scoring rule used throughout, and
the subsets of interest are the deciles [0, 0.1), [0.1, 0.2),...,[0.9, 1].
The first two events are based on the behavior of yt, the one-day return. The first event
is yt M 0.5 , 1 , which is a relatively high probability event, as shown in Table 6. (Note that,
19
after adjusting for the specified conditional mean, this event is t M 0.5 t1, 1 t1 . )
Such events are of interest, for example, to spot traders deciding how to structure their market
positions or to options traders holding out-of-the-money positions. The second event is related to
the value-at-risk estimates now commonly used in managing the market risk of trading
portfolios; see Lopez (1997) for further discussion. In such cases, traders generally wish to know
a specified lower-tail quantile of their portfolio returns distribution in order to manage their
market risk exposure. For this exercise, the lower five percent quantile of the empirical, in-
sample distribution of yt is specified as the threshold loss (TL) of interest to the forecast users,
and the event to be forecast is yt < TL. As shown in Table 6, the observed in-sample
frequency of occurrence of this event is constructed to be five percent, and the out-of-sample
The last two events focus on the behavior of the level term yt. The third event is a %
move in yt; that is, yt M (1/100) yt1 , (1/100) yt1 . Such an event would be of interest, for
asset. The parameter is set at 2, except for the case of YN in which it is set to 0.2 (due to the
high level of yt). The last event is based on the idea of a central bank forecasting whether an
exchange rate will remain within a specified target zone; i.e., the event of interest is
yt M L , U . In this evaluation exercise, L and U are set as 5% of yT, the last in-sample
observation. The observed frequencies of occurrence for event 3 are relatively high, while event
4 occurs only about 15 percent of the time, except for the YN series which is much higher. For
the purpose of illustration, Figure 2 plots the out-of-sample probability forecasts from the
GARCH-g model for the BP series for all four events of interest.
20
4d. Forecast Evaluation Results
Table 7 presents the in-sample QPS results for the four exchange rates and the four
specified events. The minimum QPS value for each column in each panel is underlined. An
interesting result is that the GARCH-g model minimizes the QPS value in half of the 16 cases,
although the significance of this result is low given the poor S test results. That is, many of
models generate QPS values that are greater than that of the GARCH-g model, but not
significantly so. Overall, these results suggest that the GARCH-g model seems to characterize
well the variance dynamics and the distribution of zt for these series. The fitted GED
distributions fit these series well both in the center of the distribution, as per events 1 and 3, and
Turning to the out-of-sample QPS results presented in Table 8, the results are less
conclusive. Overall, only three matches with the in-sample results occur. The GARCH-g model
generates the minimum QPS values in only four cases, and this result (as well as all other
minimum QPS values) is not generally significant at the 5% level. (Although not shown, sub-
sample results with the out-of-sample period divided in half provide similar results.) Thus, the
out-of-sample results indicate, much like the results in Table 5, that the chosen volatility models
provide forecasts that are practically indistinguishable under these event-based, loss functions.
additional tool for evaluating the competing sets of out-of-sample volatility forecasts. The
calibration test results are presented in the companion columns of Table 8. Probability forecasts
that are well calibrated at the five percent significance level are denoted with + symbol.
Clearly, in many cases, the calibration test reduces the number of models under consideration by
21
rejecting the probability forecasts that are not calibrated over the user-specified subsets of the
unit interval. (Once again, subsample results are consistent with the full-sample results.) For
example, for BP with respect to event 1, forecasts from the GARCH-g model generate the lowest
QPS value; the S test eliminates three competing models since they are significantly greater than
this value; and the calibration test eliminates three additional model. Hence, in this case,
although the forecasts from the GARCH-g, e.smooth-N and s.v. models are indistinguishable
using this statistical evaluation framework, the number of models to consider has been reduced
from nine to three. Although, as shown in Table 8, this evaluation procedure does not
distinguish this well between competing volatility forecasts in all cases, it can be used to assist
In conclusion, this sample forecast evaluation provides three clear results. First, the
different model rankings presented in Tables 4, 5, 7 and 8 clearly indicate that forecast
evaluations are directly affected by the loss function selected. In the absence of economic loss
functions, forecast users should carefully select the statistical loss functions that best match their
interests. Second, the disparity between the in-sample and out-of-sample results indicate that
tests of model fit and model specification should not be considered sufficient analysis. Out-of-
sample forecast evaluation provides additional, useful information for volatility model
specification and model selection. Third, the proposed framework for evaluating generated
probability forecasts can generally reduce the number of competing volatility forecasts under
consideration, a result that should be of particular interest to forecast users who must select
22
V. Conclusions
The forecast evaluations most meaningful to agents facing the issue of volatility model
selection are those conducted under economically meaningful loss functions. In the general
absence of such loss functions, researchers rely on statistical loss functions, such as MSE.
Although reasonable for point forecasts, MSE and related statistical loss functions are
problematic for volatility forecasts. In addition, since model rankings are highly dependent on
the loss function chosen and these loss functions are generally devoid of economic content, the
In this paper, a framework for volatility forecast evaluation based on statistical tools that
can be tailored to the economic interests of the forecast user is proposed. Specifically, the
forecasts from a volatility model are transformed (via integration or simulation) into probability
forecasts of events of interest. These probability forecasts are then evaluated using a proper
scoring rule selected by the user and calibration tests over the subsets of the unit interval that are
The empirical exercise in Section IV clearly indicates that the loss function directly
influences the forecast evaluation results. Thus, the use of MSE as the standard loss function for
volatility forecast evaluation must be replaced with a more thoughtful selection. Yet, even under
the appropriate loss function, forecast evaluations may provide comparative results that are not
significantly different, as shown by the S test results. (In fact, such results for one-step-ahead
volatility forecasts may not be surprising in light of Nelson (1992) as well as Nelson and Foster
(1994), who find that short-term volatility forecasting is robust to a variety of misspecifications.)
An advantage of the proposed framework is that calibration tests provide further analysis of the
23
transformed volatility forecasts using criteria tailored to the users interests.
This evaluation framework introduces a number of questions that require further analysis.
Most immediately, the properties of the transformed volatility forecasts must be more clearly
delineated. For example, further research is required to determine whether optimal forecasts
under the probability-based loss functions are available. Secondly, the properties of the QPS and
other scoring rules must be examined in light of West (1996); that is, uncertainty due to
parameter estimation could be incorporated into this analysis. Other avenues for research are
evaluating volatility forecasts from other models (i.e., different specifications for ht and D) and
for other financial time series under different loss functions. A systematic exploration of the
theoretical and empirical aspects of the proposed evaluation framework should permit the
24
References
Akgiray, V., 1989. Conditional Heteroskedasticity in Time Series of Stock Returns: Evidence
and Forecasts, Journal of Business, 62, 55-80.
Andersen, T.G. and Bollerslev, T., 1997. Answering the Critics: Yes, ARCH Models Do
Provide Good Volatility Forecasts, Working Paper #227, Department of Finance,
Kellogg Graduate School of Management, Northwestern University.
Baillie, R.T. and Bollerslev, T., 1989. The Message in Daily Exchange Rates: A Conditional
Variance Tale, Journal of Business and Economic Statistics, 7, 297-305.
Baillie, R.T, and Bollerslev, T., 1992. Prediction in Dynamic Models with Time-Dependent
Conditional Variances, Journal of Econometrics, 52, 91-113.
Bollerslev, T., Engle, R.F. and Nelson, D.B., 1994. ARCH Models, in Engle, R.F. and
McFadden, D., eds. The Handbook of Econometrics, Volume 4, 2959-3038. Amsterdam:
North-Holland.
Bollerslev, T. and Mikkelsen, H.O., 1996. Modeling and Pricing Long Memory in Stock
Market Volatility, Journal of Econometrics, 73, 151-184.
Brailsford, T.J. and Faff, R.W., 1996. An Evaluation of Volatility Forecasting Techniques,
Journal of Banking and Finance, 20, 419-438.
Christoffersen, P.F., 1998. Evaluating Interval Forecasts, International Economic Review, 39,
841-862.
Davidian, M. and Carroll, R.J., 1989. Variance Function Estimation, Journal of the American
Statistical Association, 82, 1079-1091.
25
Diebold, F.X. and Lopez, J.A., 1995. Modeling Volatility Dynamics, in Hoover, K., ed.,
Macroeconometrics: Developments, Tensions and Prospects, 427-466. Boston: Kluwer
Academic Press.
Diebold, F.X. and Lopez, J.A., 1996. Forecast Evaluation and Combination, in Maddala, G.S.
and Rao, C.R., eds., Handbook of Statistics, Volume 14: Statistical Methods in Finance,
241-268. Amsterdam: North-Holland.
Diebold, F.X. and Mariano, R., 1995. Comparing Predictive Accuracy, Journal of Business
and Economic Statistics, 13, 253-264.
Diebold, F.X. and Rudebusch, G.D., 1989. Scoring the Leading Indicators, Journal of
Business, 62, 369-391.
Dimson, E. and Marsh, P., 1990. Volatility Forecasting without Data-Snooping, Journal of
Banking and Finance, 14, 399-421.
Dunsmuir, W., 1979. A Central Limit Theorem for Parameter Estimation in Stationary Vector
Time Series and its Application to Models for a Signal Observed with Noise, The Annals
of Statistics, 7, 490-506.
Engle, R.F., 1993. Statistical Models for Financial Volatility, Financial Analysts Journal, 49,
72-78.
Engle, R.F. and Bollerslev, T., 1986. Modeling the Persistence of Conditional Variances,
Econometric Reviews, 5, 1-50.
Engle, R.F. and Gonzalez-Rivera, G., 1991. Semiparametric ARCH Models, Journal of
Business and Economic Statistics, 9, 345-359.
Engle, R.F., Hong, C.-H., Kane, A. and Noh, J., 1993. Arbitrage Valuation of Variance
Forecasts with Simulated Options, in Chance, D.M. and Tripp, R.R., eds., Advances in
Futures and Options Research, 393-415. Greenwich, CT: JIA Press.
Engle, R.F., Kane, A. and Noh, J., 1996. Index-Option Pricing with Stochastic Volatility and
the Value of Accurate Variance Forecasts, Review of Derivatives Research, 1, 139-158.
Engle, R.F. and Ng, V., 1993. Measuring and Testing the Impact of News on Volatility,
Journal of Finance, 48, 1749-1778.
26
Geweke, J., 1989. Exact Predictive Densities in Linear Models with ARCH Disturbances,
Journal of Econometrics, 44, 307-325.
Granger, C.W.J. and Pesaran, M.H., 1996. A Decision Theoretic Approach to Forecast
Evaluation, Working Paper #96-23, University of California, San Diego.
Granger, C.W.J., White, H. and Kamstra, M., 1989. Interval Forecasting: An Analysis Based
Upon ARCH-Quantile Estimators, Journal of Econometrics, 40, 87-96.
Harvey, A.C., Ruiz, E. and Shepard, N., 1994. Multivariate Stochastic Variance Models,
Review of Economic Studies, 61, 247-264.
Jacquier, E., Polson, N.G. and Rossi, P.E., 1994. Bayesian Analysis of Stochastic Volatility
Models, Journal of Business and Economic Statistics, 12, 371-389.
J.P. Morgan, 1996. J.P. Morgan/Reuters RiskmetricsTM - Technical Document. New York: J.P.
Morgan. (http://www.jpmorgan.com/RiskManagement/RiskMetrics/pubs.html).
Kroner, K.F., Kneafsey, K.P. and Claessens, S., 1995. Forecasting Volatility in Commodity
Markets, Journal of Forecasting, 14, 77-96.
Lahiri, K. and Wang, J.G., 1994. Predicting Cyclical Turning Points with Leading Index in a
Markov Switching Model, Journal of Forecasting, 13, 245-263.
Lee, K.Y., 1991. Are the GARCH Models Best in Out-of-Sample Performance?, Economic
Letters, 37, 305-308.
Lopez, J.A., 1997. Regulatory Evaluation of Value-at-Risk Models, Federal Reserve Bank of
New York Staff Report #33.
Mandelbrot, B., 1963. The Variation of Certain Speculative Prices, Journal of Business, 36,
394-419.
McDonald, J.B. and Newey, W.K., 1988. Partially Adaptive Estimation of Regression Models
via the Generalized T-Distribution, Econometric Theory, 4, 428-457.
Murphy, A.H., 1973. A New Vector Partition of the Probability Score, Journal of Applied
Meteorology, 12, 595-600.
Murphy, A.H. and Daan, H., 1985. Forecast Evaluation in Murphy, A.H. and Katz, R.W., eds.,
27
Probability, Statistics and Decision Making in the Atmospheric Sciences, 379-437.
Boulder, Colorado: Westview Press.
Nelson, D.B., 1992. Filtering and Forecasting with Misspecified ARCH Models: I, Journal of
Econometrics, 52, 61-90.
Nelson, D.B., and Foster, D.P., 1994. Asymptotic Filtering Theory for Univariate ARCH
Models, Econometrica, 62, 1-41.
Noh, J., Engle, R.F., and Kane, A., 1994. Forecasting Volatility and Option Prices of the S&P
500 Index, Journal of Derivatives, 2, 17-30.
Pagan, A.R. and Schwert, G.W., 1990. Alternative Models for Conditional Stock Volatility,
Journal of Econometrics , 45, 267-290.
Schwert, G.W., 1989. Why Does Stock Market Volatility Change over Time?, Journal of
Finance, 44, 1115-1154.
Seillier-Moiseiwitsch, F., and Dawid, A.P., 1993. On Testing the Validity of Sequential
Probability Forecasts, Journal of the American Statistical Association, 88, 355-359.
Taylor, S.J., 1987. Forecasting the Volatility of Currency Exchange Rates, International
Journal of Forecasting, 3, 159-170.
West, K.D., 1996. Asymptotic Inference about Predictive Ability, Econometrica, 64, 1067-
1084.
West, K.D. and Cho, D., 1994. The Predictive Accuracy of Several Models of Exchange Rate
Volatility, Journal of Econometrics, 69, 367-391.
West, K.D., Edison, H.J. and Cho, D., 1993. A Utility-Based Comparison of Some Models of
Exchange Rate Volatility, Journal of International Economics, 35, 23-45.
Zellner, A. and Hong, C., 1991. "Bayesian Methods for Forecasting Turning Points in Economic
Time Series: Sensitivity of Forecasts to Asymmetry of Loss Structures," in Lahiri, H. and
Moore, G., eds., Leading Economic Indicators: New Approaches and Forecasting
Records, 129-140. Cambridge, U.K.: Cambridge University Press.
28
Table 1. Conditional Mean Estimates and Residual Analysis for the First-Differenced
Exchange Rate Series
yt t1 t
t G t1 ~ N( 0, )
BP CD DM YN
Conditional Mean Parameters
0.0115 0.0004 0.0002 -0.0218
(0.0131) (0.0047) (0.0126) (0.0114)
In-sample Moments of t
variance 0.5170 0.0701 0.5203 0.4256
Out-of-sample Moments of t
variance 0.2377 0.0967 0.4493 0.5390
Note: The four exchange rate series are the logged daily spot rates from 1980 to 1995. The in-sample period is
1980-1993 (3516 observations), and the out-of-sample period is 1994-1995 (499 observations). The conditional
mean parameters of the in-sample, first-differenced series are estimated using least squares. Asymptotic standard
errors are listed in parentheses. The test for excess kurtosis is relative to the standard normal distribution; this
statistic is asymptotically distributed N(0,24/T). The portmanteau statistics for up to 20th-order serial correlation,
2
Q(20) and Q2(20), are for the t series and the t series, respectively. * indicates significance at the 5% level.
29
Table 2. Standard Diagnostic Tests of the In-Sample and Out-of-Sample
Standardized Residuals
In-Sample Out-of-Sample
Panel B. CD
homo. 6.39 14.29 524.34* 5.96 25.47 37.96*
AR10sq 5.58 14.81 36.20* 5.52 22.95 20.82
AR10ab 5.83 14.52 21.85 6.00 23.89 20.27
GARCH 6.11 16.07 10.79 5.93 23.48 18.39
GARCH-t 6.30 17.25 13.58 5.89 23.73 17.69
GARCH-g 6.20 17.09 12.20 5.91 23.80 18.00
e.smooth 6.77 18.12 44.59* 5.79 22.12 16.38
s.v. 6.77 16.28 176.98* 6.16 21.85 21.53
Panel C. DM
homo. 4.76 28.08 393.50* 4.86 22.83 106.54*
AR10sq 4.21 30.79 34.85* 4.51 18.56 37.97*
AR10ab 4.21 31.08 24.57 4.59 17.83 33.22*
GARCH 4.12 35.65* 14.85 4.38 16.21 28.74
GARCH-t 4.13 36.23* 13.88 4.40 16.20 28.38
GARCH-g 4.12 37.08* 14.17 4.41 16.41 28.74
e.smooth 4.34 42.19* 14.41 4.55 15.90 24.63
s.v. 4.30 35.24 52.01* 4.43 16.38 35.64*
Panel D. YN
homo. 5.42 28.09 226.12* 6.32 23.93 45.19*
AR10sq 5.27 28.64 35.09* 6.35 22.67 21.83
AR10ab 5.93 29.48 39.37* 6.09 24.41 21.86
GARCH 5.51 29.84 33.00* 6.21 23.35 13.45
GARCH-t 5.87 33.34* 31.30 6.40 25.12 12.90
GARCH-g 5.75 35.97* 31.18 6.35 25.21 13.30
e.smooth 7.26 30.44 27.84 6.81 25.40 12.54
s.v. 6.84 34.75* 51.38* 5.99 29.15 29.49
Note: The first column in each panel presents the kurtosis of the in-sample and out-of-sample standardized
residuals. The columns labeled Q(20) and Q2(20) are the portmanteau statistics for these standardized residuals and
their squares. * denotes significance at the five percent level for a 2(20) distribution.
30
Table 3. Correlations between the Volatility Forecasts for BP
In-sample
AR10sq AR10ab GARCH GARCH-t GARCH-g e.smooth s.v.
AR10sq 1.00
Out-of-sample
AR10sq AR10ab GARCH GARCH-t GARCH-g e.smooth s.v.
AR10sq 1.00
31
Table 4. In-Sample Forecast Evaluation Results for Statistical Loss Functions
Panel B. CD
homo. 0.0264 0.0794* 8.5695* 5.3887 -1.6586
AR10sq 0.0248 0.0749* 7.9613* 4.2311 -1.7953
AR10ab 0.0249 0.0713* 7.3964* 5.9917 -1.8100
GARCH 0.0255 0.0770* 7.4381 5.1159 -1.8246
GARCH-t 0.0261 0.0790* 7.4158 5.1547 -1.8184
GARCH-g 0.0256 0.0769* 7.6094 5.3606 -1.8236
e.smooth 0.0254 0.0747 7.4049 7.3534 -1.7894
s.v. 0.0256 0.0696 7.1285 10.0064 -1.7339
Panel C. DM
homo. 1.0168 0.5706 7.4755* 3.7563 0.3464*
AR10sq 0.9750 0.5536* 7.1299* 3.0720 0.2683
AR10ab 0.9761 0.5330 6.7509 3.8699* 0.2644
GARCH 0.9772 0.5570* 7.0919* 3.0957* 0.2582
GARCH-t 0.9824 0.5694* 7.3830* 2.8496 0.2604
GARCH-g 0.9813 0.5574* 7.3371 3.1214* 0.2585
e.smooth 0.9839 0.5552* 6.8452 3.9607* 0.2807
s.v. 0.9861 0.5212 6.5459 4.7767* 0.2857*
Panel D. YN
homo. 0.8004 0.4852* 8.8365* 4.4207 0.1452*
AR10sq 0.7746 0.4756* 8.5198* 4.2530 0.0885
AR10ab 0.7823 0.4451* 7.9554* 7.1143* 0.1140
GARCH 0.7841 0.4763* 8.3220* 4.5366 0.0861
GARCH-t 0.7994 0.4741* 8.3996 5.4410 0.0923
GARCH-g 0.8020 0.4801* 8.4988 4.9428 0.0896
e.smooth 0.7892 0.4771* 8.2154* 8.0179 0.1279*
s.v. 0.8045 0.4228 7.4526 13.5137* 0.2282*
Note: The columns represent the values of the statistical loss functions from Section II for the models in-sample,
fitted conditional variances. The minimum value within each column of each panel is underlined. * indicates that
this particular set of in-sample forecasts rejects the null hypothesis of equality with the minimum value using the S
test at the five percent significance level.
32
Table 5. Out-of-Sample Forecast Evaluation Results for Statistical Loss Functions
Panel B. CD
homo. 0.0470 0.0953 7.7033 9.5930 -1.2779
AR10sq 0.0465 0.1010 7.8258 7.3429 -1.3103
AR10ab 0.0469 0.0979 7.5533 11.2557 -1.2133
GARCH 0.0478 0.1123 8.1170 6.9315 -1.2686
GARCH-t 0.0482 0.1146 8.9795 6.1234 -1.2806
GARCH-g 0.0480 0.1125 8.2514 6.7614 -1.2715
e.smooth 0.0467 0.1085 8.3924 6.0445 -1.3083
s.v. 0.0478 0.0943 7.3589 15.9774 -1.0905
Panel C. DM
homo. 0.7853 0.5343 8.0663 2.9008 0.2130
AR10sq 0.7430 0.4950* 7.4763* 2.8007 0.1266
*
AR10ab 0.7470 0.4707 7.0276 3.9370 0.1347
GARCH 0.7402 0.4862* 7.2766 3.0493 0.1151
GARCH-t 0.7451 0.4981* 7.1669 2.8330 0.1198
GARCH-g 0.7452 0.4867* 7.0650 3.0989 0.1174
e.smooth 0.7453 0.4778 6.9420 4.3004 0.1378
s.v. 0.7527 0.4516 6.8011 4.9428 0.1581
Panel D. YN
homo. 1.5456 0.5595 8.2659 8.5368 0.4073
AR10sq 1.5322 0.5701* 8.2178* 7.5746 0.3488
AR10ab 1.5265 0.5373 7.7350* 9.6895 0.3726
GARCH 1.4995 0.5750* 7.8797 6.3216 0.2922
GARCH-t 1.5127 0.5806* 7.7722 6.8564 0.2938
GARCH-g 1.5192 0.5847* 7.8844 6.4474 0.2921
e.smooth 1.5034 0.6009 8.2284 7.7505 0.3138
s.v. 1.5571 0.5175 7.3176 14.9116 0.5133
Note: The columns represent the values of the statistical loss functions from Section II for the models out-of-
sample volatility forecasts. The minimum value within each column of each panel is underlined. * indicates that this
particular set of in-sample forecasts rejects the null hypothesis of equality with the minimum value using the S test
at the five percent significance level.
33
Table 6. In-Sample and Out-of-Sample Observed Event Frequencies
Panel B. CD
In-sample 96.8% 5.0% 89.0% 12.5%
Panel C. DM
In-sample 72.7% 5.0% 91.8% 13.2%
Panel D. YN
In-sample 77.4% 5.0% 89.4% a 36.5%
a
The parameter E is set to 0.2, instead of 2 as for the series, due to the high level of the Japanese yen.
Note: Event 1 refers to the innovation event yt M 0.5 , 1 or, equivalently, t M 0.5 t1, 1 t1 .
Event 2 refers to the innovation event yt < TL or, equivalently, t < TL t1. TL is the lower five percent
quantile of the in-sample, empirical distribution of yt. Event 3 refers to the level event
yt M (1/100) yt1 , (1/100) yt1 , where E is set at 2 for BP, CD and DM and at 0.2 for YN. Event 4 refers to
the level event yt M L , U , where L and U are 5% of the last in-sample observation yT.
34
Table 7. In-sample QPS Results
Panel B. CD
homo. 0.0314 0.0476* 0.0982 0.0089
AR10sq 0.0302 0.0458 0.0910 0.0090
AR10ab 0.0303 0.0455 0.0902 0.0091
GARCH 0.0305 0.0460 0.0909 0.0090
GARCH-t 0.0324* 0.0478* 0.0958* 0.0090
GARCH-g 0.0304 0.0457 0.0899 0.0090
e.smooth-N 0.0305 0.0460 0.0917 0.0090
e.smooth-t 0.0315* 0.0471* 0.0944* 0.0089
s.v. 0.0307 0.0457 0.0915 0.0090
Panel C. DM
homo. 0.2014* 0.0473 0.0701 0.0212
AR10sq 0.1938 0.0465 0.0682 0.0210
AR10ab 0.1921 0.0467 0.0683 0.0209
GARCH 0.1926 0.0466 0.0686 0.0209
GARCH-t 0.1982* 0.0473* 0.0719* 0.0212
GARCH-g 0.1912 0.0465 0.0683 0.0208
e.smooth-N 0.1929 0.0468* 0.0692 0.0209
e.smooth-t 0.1962* 0.0473* 0.0715* 0.0211
s.v. 0.1927 0.0469 0.0689 0.0209
Panel D. YN
homo. 0.1788* 0.0473 0.0949* 0.0055
AR10sq 0.1739 0.0466 0.0932 0.0056
AR10ab 0.1711 0.0470 0.0933 0.0056
GARCH 0.1724 0.0468 0.0931 0.0056
GARCH-t 0.1762* 0.0469 0.0961* 0.0057
GARCH-g 0.1697 0.0467 0.0926 0.0055
e.smooth-N 0.1724 0.0470 0.0942 0.0057
e.smooth-t 0.1772* 0.0473 0.0971* 0.0058
*
s.v. 0.1720 0.0475 0.0947 0.0055
Note: The columns contain the QPS values for the models in-sample probability forecasts for the four specified
events. The minimum QPS value within each column of each panel is underlined. * indicates that this particular set
of in-sample forecasts rejects the null hypothesis of equality with the minimum QPS value using the S test at the
five percent significance level; i.e., the models thus marked have QPS values significantly greater than the minimum
value.
35
Table 8. Out-of-sample QPS Results
Panel A. BP
homo. 0.1600* 0.0178 0.0919 0.0135
AR10sq 0.1392* 0.0163 0.0734 0.0121
AR10ab 0.1338 0.0161 0.0696 0.0115
GARCH 0.1330 0.0160 0.0682 0.0114
GARCH-t 0.1411* 0.0171 0.0757* 0.0119
GARCH-g 0.1312 + 0.0160 0.0677 0.0112
e.smooth-N 0.1318 + 0.0161 0.0672 + 0.0109
e.smooth-t 0.1338 0.0165 0.0672 + 0.0111
s.v. 0.1332 + 0.0159 0.0683 + 0.0113 +
Panel B. CD
homo. 0.0538 + 0.0615 + 0.0504 + 0.0290
AR10sq 0.0542 + 0.0625 + 0.0503 + 0.0278
AR10ab 0.0543 + 0.0624 + 0.0505 + 0.0283
GARCH 0.0551 + 0.0641 0.0518 + 0.0279
GARCH-t 0.0571 0.0666 0.0551* 0.0279
GARCH-g 0.0545 + 0.0632 + 0.0511 + 0.0279
e.smooth-N 0.0544 + 0.0634 0.0510 + 0.0280
e.smooth-t 0.0557 0.0652 0.0508 + 0.0280
s.v. 0.0540 + 0.0620 + 0.0508 + 0.0296 +
Panel C. DM
homo. 0.2005 0.0402 + 0.1509 0.0104
AR10sq 0.1939 + 0.0400 + 0.1429 + 0.0104
AR10ab 0.1928 + 0.0406 + 0.1428 + 0.0104
GARCH 0.1923 + 0.0402 + 0.1416 + 0.0105
GARCH-t 0.1961 0.0410 0.1462* 0.0105
GARCH-g 0.1918 + 0.0401 + 0.1417 + 0.0105
e.smooth-N 0.1929 + 0.0407 + 0.1424 + 0.0104
e.smooth-t 0.1942 + 0.0412 0.1424 + 0.0104
s.v. 0.1929 + 0.0408 + 0.1442* 0.0103 +
Panel D. YN
homo. 0.2008 + 0.0457 + 0.1295 + 0.0091
AR10sq 0.1993 + 0.0454 + 0.1272 + 0.0091
AR10ab 0.1977 + 0.0453 + 0.1274 + 0.0092
GARCH 0.1976 + 0.0451 + 0.1256 + 0.0092
GARCH-t 0.2007 0.0461 0.1304 0.0097
GARCH-g 0.1976 + 0.0451 + 0.1248 + 0.0090
e.smooth-N 0.2016 + 0.0459 0.1288 + 0.0093
e.smooth-t 0.2042 0.0470 0.1289 + 0.0094
s.v. 0.2012 0.0456 + 0.1327 0.0093
Note: The event columns contain the QPS values for the models out-of-sample probability forecasts for the four
specified events. The minimum QPS value within each column of each panel is underlined. * indicates that this
particular set of out-of-sample forecasts rejects the null hypothesis of equality with the minimum QPS value using
the S test at the five percent significance level; i.e., the models thus marked have QPS values significantly greater
than the minimum value. The companion columns labeled calib. present the results from the Z0 calibration test
using the deciles of the unit interval (i.e., [0,0.1); [0.1,0.2); ...; [0.9,1]) as the subsets of interest. The + symbol
indicates that the forecasts do not reject the null hypothesis and are thus considered to be well calibrated over the
selected subsets.
36
Figure 1. Alternative Distributions of the BP Standardized Residuals
Figure 1a.
Figure 1b.
Note: The standard normal distribution is plotted with a solid line. The estimated t- and GED-distributions are
plotted with dotted and dashed lines, respectively. The estimated shape parameter (i.e., degrees of freedom) for the
t-distribution is 5.7, and the estimated shape parameter for the GED-distribution is 1.4.
37
Figure 2. Out-of-Sample Probability Forecasts from the GARCH-g Model for BP
Note: The out-of-sample probability forecasts from the GARCH-g model for events 1 through 4 are plotted with a
dot-dash line, a dotted line, a dashed line and a solid line, respectively.
38
APPENDIX: Volatility Model Parameter Estimates
Table A.1. Volatility Model Parameters for the Gaussian Homoskedastic Model
yt t1 t
t G t1 ~ N 0, h t
ht 2
Parameter BP CD DM YN
Note: These parameters values are maximum likelihood estimates. The standard errors in parentheses are the
robust standard errors of Bollerslev and Wooldridge (1992).* indicates significance at the five percent level.
39
Table A.2. Volatility Model Parameters for the AR10sq Model
yt t1 t
t G t1 ~ N 0, h t
10
2
ht i ti .
i1
Parameter BP CD DM YN
0.0118 0.0035 0.0005 -0.0216
(0.0236) (0.0201) (0.0228) (0.0224)
Note: The conditional mean parameters are estimated using ordinary least squares. The conditional variance
parameters are estimated using ordinary least squares on the squared residuals. Standard errors are presented in
parentheses. * indicates significance at the five percent level.
40
Table A.3. Volatility Model Parameters for the AR10ab Model
yt t1 t t G t1 ~ N 0, ht
12 2
ht i ti
2 i1
Parameter BP CD DM YN
0.0118 0.0035 0.0005 -0.0216
(0.0236) (0.0201) (0.0228) (0.0224)
Note: The conditional mean parameters are estimated using ordinary least squares. The conditional variance
parameters are estimated using ordinary least squares on the absolute value of the residuals. Standard errors are
presented in parentheses. * indicates significance at the ten percent level.
41
Table A.4. Volatility Model Parameters for the Gaussian GARCH(1,1) Model
yt t1 t
t G t1 ~ N 0, h t
2
ht t1 ht1
Parameter BP CD DM YN
0.0067 0.0010 0.0060 -0.0135
(0.0111) (0.0037) (0.0111) (0.0104)
Note: These parameters values are maximum likelihood estimates. The standard errors in parentheses are the
robust standard errors of Bollerslev and Wooldridge (1992). * indicates significance at the five percent level.
t G t1 ~ t 0, ht, v
2
ht t1 ht1
Parameter BP CD DM YN
-0.0002 -0.0033 0.0127 0.0054
(0.0116) (0.0035) (0.0101) (0.0095)
Note: These parameters values are maximum likelihood estimates. Asymptotic standard errors, presented in
parentheses, are derived from the value of the numerical Hessian at the estimated parameter values. * represents
significance at the five percent level.
42
Table A.6. Volatility Model Parameters for the GARCH(1,1)-GED Model
yt t1 t
t G t1 ~ GED 0, h t,
2
ht t1 ht1
Parameter BP CD DM YN
0.0009 -0.0021 0.0145 0.0073
(0.0100) (0.0033) (0.0096) (0.0088)
Note: These parameters values are maximum likelihood estimates. Asymptotic standard errors, presented in
parentheses, are derived from the value of the numerical Hessian at the estimated parameter values. * represents
significance at the 5% level.
43
Table A.7. Volatility Model Parameters for the Stochastic Volatility Model
yt t1 t
t exp t / 2 t, t ~ N( 0, 1 )
2
t t1 t, t ~ N( 0, )
t t
2
Measurement equation: log t 1.27 t t
Parameter BP CD DM YN
0.0118 0.0035 0.0005 -0.0216
(0.0236) (0.0201) (0.0228) (0.0224)
Note: The conditional mean parameters are estimated using ordinary least squares, and the standard errors are
presented in parentheses. * indicates significance at the 5% level. The conditional variance parameters are
estimated using the Kalman filter and quasi-maximum likelihood methods as proposed in Harvey et al. (1994).
These standard errors are estimated using the method proposed by Dunsmuir (1979).
44