SlideShare a Scribd company logo
Evaluating competing predictive distributions. An
out-of-sample forecast simulation study.
Bachelor’s Thesis in Statistics
Andreas C. Collett∗
January 7, 2015
Abstract†
This thesis aims to formulate a simple measurement that evaluates predic-
tive distributions of out-of-sample forecasts between two competing models.
Predictive distributions form a large part of today’s forecast models used for
policy making. The possibility to compare predictive distributions between
models is important for policy makers who make informed decisions based
on probabilities. We conduct simulation studies to estimate autoregressive
models and vector autoregressive models with Bayesian inference. The for-
mulated measurement uses out-of-sample forecasts and predictive distribu-
tions to evaluate the full forecast error probability distribution by forecast
horizon. We find the measurement to be accurate and can be used to evalu-
ate single forecasts or to calibrate forecast models.
Keywords: Autoregressive, out-of-sample forecast, Bayesian inference, Gibbs
sampling, prior distribution, posterior distribution and predictive distribu-
tion.
department of statistics
autumn semester 2014
Course code SU-39434
∗
Correspondence to author: andreascollett@gmail.com.
†
I am deeply grateful to my supervisor Professor Emeritus Daniel Thorburn for his
commitment, time, notes and discussions.
Contents
1. Introduction 2
2. Related Literature 4
3.1 Bayesian Inference 5
3.2 Autoregression 7
3.3 Vector Autoregression 11
4. Evaluating the Predictive Distribution 13
5. Simulation Study 15
6. Hyperparameters 17
7.1 Results 19
7.2 Univariate Simulation Results 20
7.3 Multivariate Simulation Results 22
8. Conclusion 26
Appendix 27
A1. Gibbs Sampler 27
1
1. Introduction
This thesis aims to formulate a simple measurement that evaluates predic-
tive distributions of out-of-sample forecasts between two competing models.
Predictive distributions form a large part of today’s forecast models used for
policy making. The possibility to compare predictive distributions between
models is important for policy makers who make informed decisions based
on probabilities. Out-of-sample forecasts are used to mimic the situation
forecaster’s experience in real time and are used by academics in forecast
methodology research and by practitioners to calibrate forecast models. By
combining predictive distributions and out-of-sample forecasts one can eval-
uate the forecast error probability distribution. Earlier forecast evaluation
literature have tended to focus on point forecasts, either direct from a model
or by a certain value in the predictive distribution, rather than evaluating
the full predictive distribution. This results in loss of information about the
uncertainty of the forecasts and the forecast model. The contribution of
this thesis is to formulate a simple measurement that uses this information
to evaluate forecasts at multiple horizons. There is recent research which
address this subject in various forms; Geweke and Amisano (2008, 2012),
Warne et al. (2013) and Bauwens et al. (2014). However, the literature is
small relative to the literature of evaluating point forecasts.
We generate data samples from univariate autoregression (AR) and mul-
tivariate vector autoregression (VAR) with known true parameters. We will
use Bayesian methods to estimate the data by AR and VAR models to obtain
posterior inference and predictive distributions. A restrictive Gibbs sampler
is implemented to conduct the Bayesian inference. These methods allow us
to produce predictive distributions and to explore the theory and applica-
tion of Bayesian analysis throughout a simple example. The restrictive Gibbs
sampler is a popular method to obtain posterior inference in time series anal-
ysis. Therefore the thesis gives an introduction to Bayesian analysis and its
application in time series.
The structure of simulated data allow us to use simple statistical theory
to evaluate the posterior inference, summarized in Table 1, and the use of
out-of-sample forecasts allow us to evaluate which model produces the most
accurate predictive distribution. The diagonal in Table 1 represents the sit-
uation when the simulated data is estimated with the correct model. The
lower right outcome occurs when the data is simulated from an AR model but
is estimated with a VAR model. In this case the VAR model will not suffer
from misspecification but include irrelevant independent variables. This will
2
cause an (small) increase in the estimated variance, which in turn will lead
to a wider predictive distribution. The upper right outcome occurs when the
data is simulated from a VAR model but is estimated with an AR model.
In this case the AR model will be misspecified, i.e. the model suffers from
omitted variable bias. This will cause an (large) increase in the estimated
variance, which in turn will lead to a wider predictive distribution. The for-
mulated measurement accounts for both the size of the forecast error and the
probability that this forecast error would occur.
Table 1: Underlying Statistical Theory.
Simulation of Data
Univariate Multivariate
Model
AR(p) Optimal Misspecification
VAR(p) Irrelevant Independent Variables Optimal
We formulate a measurement that use out-of-sample forecasts and predictive
distributions to evaluate the full forecast error probability distribution by
forecast horizon. We are able to validate the accuracy of the measurement
against statistical theory, but we find that the autoregressive model and vec-
tor autoregressive model with the same lag length, have difficulty producing
dissimilar predictive distributions. However, we are able to separate pre-
dictive distributions from the models by letting both be correctly specified
but allowing for high degree of correlation between the error terms across
equations in the vector autoregressive model. From this we find that the
formulated measurement is able to measure the accuracy of the full forecast
error distribution. The measurement can be used as a forecasting evaluating
technique for single forecasts or to calibrate forecast models. The technique
used in this thesis can be used for these purposes.
The rest of the thesis is structured as follow. Section 2 presents the empirical
research closest to the research question. Section 3 describes the empirical
methodology. Section 4 describes the evaluation method of the predictive
distribution. Section 5 describes the simulation method. Section 6 accounts
for the selection of the hyperparameters in the priors. Section 7 discusses
the results and model comparisons.
3
2. Related Literature
We will present two articles that evaluate predictive distributions. The meth-
ods used are technical and we will not go into debt to describe them. Instead
we will mention the methods and give a brief summary of the results. For
interested readers please review the articles under consideration.
Geweke and Amisano (2012) compare the forecast performance and con-
struct model combination for three models; the dynamic factor model, the
dynamic stochastic general equilibrium model and the vector autoregressive
model. They use several analytical techniques to evaluate the forecast per-
formance and to construct model combination; pooling of predictive densi-
ties, analysis of predictive variances, probability integral transform tests and
Bayesian model averaging. They find two improvements that increase fore-
cast accuracy substantially. The first improvement is to use the full Bayesian
predictive distribution instead of the posterior mode for the parameters. The
second improvement is to construct model combination by the use of equally
weighted pooling of the predictive densities from the three models, instead
of relying on the individual predictive distribution from each model. This
result is considerable better than when Bayesian model averaging is used for
the same purpose.
Bauwens et al. (2014) compare the forecast performance of two models that
allow for structural breaks against a wide range of alternative models which
do not allow for structural breaks. They evaluate forecast performance by
two metrics. First, they use root mean squared forecast errors (RMSE) to
evaluate point forecasts. The median of the predictive distribution is used
as point forecast. Second, they use the average of log predictive likelihoods
(APL), which is the predictive density evaluated at the observed outcome.
The APL is estimated by a nonparametric kernel smoother, using draws from
the predictive simulator. They find that no single model is consistently bet-
ter against the alternatives in the presence of structural breaks. One source
for this uncertainty about the forecast performance is that the two metrics
yield substantially different conclusions. They find that the structural break
models seem to dominate the non-structural break models in terms of RMSE,
but the opposite is often true in terms of APL.
4
3.1 Bayesian Inference
To describe Bayesian inference the simple linear regression model will be
examined, consider the model
yt = Xtβ + εt (1)
where yt is a T ×1 vector representing the dependent variable, Xt is a T ×K
matrix and εt
iid
∼ N(0, σ2
), T is the number of observations and K is the
number of independent variables. Our purpose is to obtain estimates of the
K × 1 vector β and the scalar σ2
. These can be obtained by maximizing the
likelihood function
l(yt | β, σ2
) = (2πσ2
)
−T
2 exp −
(yt − βXt) (yt − βXt)
2σ2
(2)
which yields maximum likelihood estimates3 ˆβMLE = (XtXt)−1
(Xtyt) and
ˆσ2
MLE = (yt−ˆβMLEXt) (yt−ˆβMLEXt)
T
. According to the likelihood principle the
likelihood function contains all the information in the data about the param-
eters β and σ2
. This is where the difference between classical (or frequentist)
inference and Bayesian inference becomes apparent. Bayesian inference in-
corporates prior beliefs about the parameters in the estimation process in
form of probability distributions. This results in the joint posterior distribu-
tion
p(β, σ2
| yt) =
l(yt | β, σ2
)p(β, σ2
)
p(yt)
∝ l(yt | β, σ2
)p(β, σ2
) (3)
where p(β, σ2
) is the prior distribution and p(yt) is the density of the data,
or marginal likelihood. The marginal likelihood, p(yt), does not depend on β
or σ2
and can thus be considered a constant. This yields the unnormalized
joint posterior distribution which is proportional to the likelihood function
times the prior distribution. However, Karlsson (2013) stress the importance
of the marginal likelihood for model comparison. Several conclusions can
be drawn form the joint posterior distribution. First, the joint posterior
distribution represents the probability distribution of the parameters β and
σ2
when the prior distribution has been updated with the information in
the observed data, yt. Second, if the prior distribution is vague (or flat)
then it can be considered as almost constant, this causes the estimates to
3
Note that the ˆβMLE is equal to the ordinary least square estimator, ˆβOLS, while ˆσ2
MLE
is a biased estimate of the variance due to the fact that it does not deduct the number
of estimated parameters from the number of observations from the denominator, which is
the case in ˆσ2
OLS.
5
be similar to those of classical inference, i.e. the likelihood function will
determine the estimates. This also occur when the information in the data
is rich, i.e a large number of observations. There is a large literature on
Bayesian inference in macroeconomics, where the data tend to have a small
number of observations and the model requires a large number of parameters
to be estimated, for example the vector autoregressive model. Given the
joint posterior distribution, the marginal posterior distributions conditional
on the data, p(β | yt) and p(σ2
| yt), can be obtained by integrating β and
σ2
from the joint posterior distribution, one at the time:
p(β | yt) = p(β, σ2
| yt)dσ2
(4)
p(σ2
| yt) = p(β, σ2
| yt)dβ. (5)
For the simple regression model specified there exists analytical (or closed
form) results for the integrals. But for more complex models or particular
prior distributions there may not exist analytical results for the integrals.
Then numerical or simulation techniques are required to obtain estimates on
β and σ2
, such as Markow Chain Monte Carlo (MCMC) simulation. We will
impose the restriction of stability to our AR and VAR processes, i.e. we are
restricted to evaluate a certain range of the distributions of β. This will be
implemented by the Gibbs sampler (see Appendix A1), which will allow us
to sample from the range of this distribution. Even if there exists analytical
results and no restrictions are imposed, there are still situations where sim-
ulation is suitable, this is the case in forecasting. Forecasts with a horizon
beyond the one-step ahead horizon are nonlinear and can only be obtained
by simulation.
The essential key in Bayesian inference is the prior belief of the researcher,
i.e. the prior distribution, p(β, σ2
), in (3). The prior distribution allows
the researcher to address the uncertainty about the parameters before the
data has been taken into account, this is done by specifying a probability
distribution for each parameter. The prior distribution is classified into two
categories, noninformative and informative. The noninformative prior dis-
tributions are implemented when the researcher do not have prior beliefs
about the parameters, when the prior beliefs exists at a third party, i.e. not
known, and for scientific reports where differences in prior beliefs could have
impact on the result. The noninformative prior distribution put a uniform
distribution on the parameters which forces the estimates to be determined
by the data, but still reap benefits of Bayesian analysis. The informative
prior distributions are used when the researcher have prior beliefs about the
6
parameters and incorporate these beliefs into the prior distribution. This is
accomplished by assigning hyperparameters4
or by restricting the parameter
range. It is, typically, difficult to assess the prior belief in practice, therefore
it is essential that the joint posterior distribution is proper and to assess
the posterior inference with sensitivity analysis. According to Sun and Ni
(2004), there exists situations in which the posterior is improper even though
the full conditional distributions used for MCMC are all proper.
One reason for the use Bayesian inference is that it produce predictive distri-
butions. This enables assessment of the probability to an outcome which is
more coherent to policy decisions than evaluating a certain point forecast of a
model5
. The essence of Bayesian inference is that the predictive distribution
accounts for the uncertainty about the future and that then joint posterior
distribution accounts for the uncertainty about the parameters:
p(yt+1:t+h) = f(yt+1:t+h | yt, β, σ2
)p(β, σ2
| yt)dβdσ2
. (6)
This results in a predictive distribution of forecasts p(yt+h) at each forecast
horizon h. The predictive distribution enables the researcher to address the
probability that a certain outcome will occur. This is useful in many ways,
the predictive distribution can be described by measures of central tendency
and to assess distressed scenarios by evaluating quantiles of the predictive
distribution.
3.2 Autoregression
We will use Bayesian methods to estimate the parameters in an autoregressive
process, AR(p), where p is the number of lags to use for the univariate time
series yt
yt =
p
i=1
βiyt−i + εt, t = 1, ..., T (7)
where εt
iid
∼ N(0, σ2
). This model is the same as in (1) where Xt is a T × p
matrix (without a constant) and consists of p lags of the time series yt. An im-
4
The hyperparameters represents the parameters in the prior distribution and are called
hyperparameters to distinguish them from the parameters in the model.
5
There are methods in the frequentist view to generate an approximative predictive
distribution, for example by bootstrapping. These distributions are, however, mostly
tighter than the Bayesian predictive distribution due to the fact that they do not take into
account the uncertainty about the parameters. These methods will not be evaluated in
this thesis.
7
portant difference between the models in (1) and (7) is that in the AR model
the dependent variable is not independently and identically distributed, yt
depends on past values of itself.
The Normal-Gamma prior distribution is the conjugate prior for a normal
distribution, the posterior distribution and the prior distribution are in the
same family of distributions. In the Normal-Gamma prior distribution, the
parameter vector, β, is normally distributed and conditional on the variance,
σ2
, and the variance follows the (inverse) Gamma distribution. The prior
mean is
p β|σ2
∼ N β0, σ2 ¯H (8)
where β0 is a p × 1 vector representing the researchers prior belief about the
parameter values. The prior mean variance-covariance matrix, H, is equal to
the researchers the prior belief of the variance, σ2
, times the p × p diagonal
matrix ¯H representing the researchers uncertainty about the parameters.
Larger values on the diagonal of σ2 ¯H results in larger variances around the
prior means. The prior variance is
p σ2
∼ iΓ
α
2
,
θ0
2
(9)
where α represents the prior degrees of freedom and θ0 represents the prior
scale parameter. Holding the prior degrees of freedom fixed and letting the
prior scale increase results in a (inverse) Gamma distribution with an in-
creasing mean, i.e. the prior belief about the value of σ2
increases. Holding
the prior scale fixed and letting the prior degrees of freedom increase results
in a (inverse) Gamma distribution that is more tightly centred around the
mean, i.e. the prior belief of σ2
becomes tighter. This is illustrated in Figure
1. Specifying the prior belief depends on several factors. Practitioners set
the prior beliefs to their own views, while researchers tend to set these to
the OLS estimate to let the data influence the estimates more than the prior
beliefs. This is viewed as more coherent and accepted academic approach.
But it also depends on the number of observations and parameters, if there is
a large number of observations relative to the number of parameters, the in-
fluence of the data will be stronger. If there is a small number of observations
relative to the number of parameters, this is known as overparametrization
in the literature, then one must specify a strong prior.
8
Figure 1: The (left) figure illustrates the effect on the (inverse) Gamma distri-
bution as the scale parameter θ0 takes the values {1, 2, 3, 4}, holding the degrees
of freedom constant at α = 1. The (right) figure illustrates the effect on the (in-
verse) Gamma distribution as the degrees of freedom α takes the values {1, 2, 3, 4},
holding the scale parameter constant at τ0 = 1.
The conditional posterior distributions of β and σ2
are
p(β | σ2
, yt) = N(M, V ) (10)
p(σ2
| β, yt) = iΓ
τ1
2
,
θ1
2
(11)
where
M = H−1
+
1
σ2
XtXt
−1
H−1
β0 +
1
σ2
Xtyt (12)
V = H−1
+
1
σ2
XtXt
−1
(13)
τ1 = α + T (14)
θ1 = θ0 + (yt − Xtβ) (yt − Xtβ) (15)
and β0, Σ, α and θ0 are hyperparameters specified by the researcher. Note
that there exists analytical results for the Normal-Gamma prior distribution,
but we will use the Gibbs sampler to obtain parameter estimates and pre-
dictive distributions. Table 2 describes the implemented restrictive Gibbs
sampler.
9
Table 2: Restrictive Gibbs Sampler for a AR(p) model.
To illustrate the Gibbs sampler we will examine the first sample, m = 1.
We will start to sample β(1)
from p(β | (σ2
)(0)
, yt), in (10). It is important
to note that (12) and (13) depend on σ2
and the Gibbs sampler need
an initial value for this parameter, denoted (σ2
)(0)
, which is specified
by the researcher. We will set (σ2
)(0)
to the OLS estimate ˆσ2
. As we
have obtained M, in (12), and V , in (13), we can sample β(1)
from
p(β | (σ2
)(0)
, yt) by
ˆβ(1)
= M + [r(V )
1
2 ]
where r is a 1 × p vector of draws from the standard normal distribution
and (V )
1
2 is the Cholesky decomposition of V . We impose the restriction
that ˆβ(1)
must come from a stable AR process, i.e. all the roots, z, of
the polynomial βp(z) = 1 − β1z − β2z2
− ... − βpzp
must have a module
greater than one, |z| > 1. As ˆβ(1)
is obtained we can sample (σ2
)(1)
from
p(σ2
| β(1)
, yt), from the inverse Gamma distribution (11). Note that the
prior degrees of freedom τ1, in (14), and the prior scale parameter θ1,
in (15), requires the researcher to specify α and θ0. A sample from the
inverse Gamma distribution is structured as
(ˆσ2
)(1)
=
θ1
x0x0
where x0 is a 1 × τ1 vector of draws form the standard normal distribu-
tion. A sample from the predictive distributions for forecast horizon h is
structured as
ˆy
(1)
t+h =
h−1
i=1
ˆβ
(1)
i ˆy
(1)
t+h−i +
p
i=1
ˆβ
(1)
i y
(1)
t+h−i + ˆε
(1)
t+h
where ˆε
(1)
t+h = r (ˆσ2)(1) and r is a single draw from the standard normal
distribution. This process is repeated M iterations until we have obtained
β(1)
, ..., β(M)
, (σ2
)(1)
, ..., (σ2
)(M)
and ˆy
(1)
t+h, ..., ˆy
(M)
t+h . The first 1, ..., B it-
erations are discarded, thus β(B+1)
, ..., β(M)
, (σ2
)(B+1)
, ..., (σ2
)(M)
and
ˆy
(B+1)
t+h , ..., ˆy
(M)
t+h are used for empirical distributions. The iterations 1, ..., B
are known as burn-in iterations and are required for the Gibbs sampler
to converge. There is, however, no guarantee that the Gibbs sampler will
perform well or that it will converge. In this thesis we will set M = 4.000
and B = 1.000 to ensure convergence.
10
3.3 Vector Autoregression
We will use Bayesian methods to estimate a vector autoregressive process,
VAR(p), where p is the number of lags to use for each time series. The VAR
model is a system of equations model that allows the endogenous variables
to simultaneously affect each other. Furthermore, the error terms can be
correlated across equations. A structural shock in one of the error terms
may cause a shock in all error terms, causing contemporaneous movement in
all endogenous variables. The VAR(p) with n endogenous variables without
constants, deterministic and exogenous variables, is defined as
yt =
p
i=1
Biyt−i + εt, t = 1, ..., T (16)
where yt is a n × 1 vector, Bi is a n × n matrix, εt is a n × 1 vector and
εT
iid
∼ N(0, Σ). The endogenous variables in the VAR model are not iid, they
depend on past values of yt. We can express (16) in compact matrix form if
we rewrite xt = (yt−1, ..., yt−p)
Y t = XtB + Et (17)
where Y t and Et are T × n matrices, Xt = (x1, ..., xT ) is a T × np matrix
and B = (B1, ..., Bp) is a np × n matrix. Note that the parameter matrix
can be stacked into a n(np) × 1 vector by b = vec(B).
The Normal-Wishart prior distribution is the conjugate prior for a multivari-
ate normal distribution, the posterior distribution and the prior distribution
are in the same family of distributions. In the Normal-Wishart prior dis-
tribution, the parameter vector, b, is normally distributed and conditional
on the variance-covariance matrix, Σ, and the variance-covariance matrix
follows the (inverse) Wishart distribution
p(b | Σ) ∼ N(b0, Σ ⊗ ¯H) (18)
p(Σ) ∼ iW( ¯S, α) (19)
where ⊗ is the Kronecker product and b0 represents the researchers prior
belief of the parameter values. We follow Kadiyala and Karlsson (1993, 1997)
in specifying the matrix ¯H, the prior scale matrix ¯S and the prior degrees
of freedom α . The np × np diagonal matrix ¯H has the diagonal equal to
λ0λ1
lλ2 si
2
11
where l refers to the lag length and s2
i refers to the OLS estimate of the
variance from an AR(p) model and i refers to the endogenous variable in the
ith
equation. The prior scale is a n × n diagonal matrix with the diagonal
equal to (α − n − 1)λ−1
0 s2
i and the prior degrees of freedom satisfy
α = max{n + 2, n + 2h − T} (20)
to ensure existence of the prior variances of the regression parameters and
the posterior variances of the predictive distribution at forecast horizon h.
Following the guidelines of Kadiyala and Karlsson (1993, 1997) we only need
to specify the hyperparameters b0, λ0, λ1 and λ2. The interpretation of the
hyperparameters of λ are as follows:
λ0 controls the overall tightness of the prior on the covariance matrix.
λ1 controls the tightness of the prior on the coefficients on the first lag.
λ2 controls the degree to which coefficients are shrunk towards zero more tightly.
The prior mean variance-covariance matrix, H, is obtained by V (b) = (α −
m − 1)−1 ¯S ⊗ ¯H, due to the imposed Kronecker structure we are not able
to specify prior variances and standard deviations. Instead we are forced to
treat all equations symmetrically.
The conditional posterior distributions of β and Σ are
p(b | Σ, Y t) ∼ N(M, V ) (21)
p(Σ | b, Y t) ∼ iW(¯Σ, T + α) (22)
where
M = H−1
+ Σ−1
⊗ Xt Xt
−1
H−1
b0 + Σ−1
⊗ Xt Xt
ˆb (23)
V = H−1
+ Σ−1
⊗ Xt Xt
−1
(24)
¯Σ = ¯S + (Y t − XtB) (Y t − XtB). (25)
and ˆb is the OLS estimate of b. Kadiyala and Karlsson (1997) and Karlsson
(2013) provide the analytical results for this prior. But we will use the Gibbs
sampler to obtain parameter estimates and predictive distributions. The re-
strictive Gibbs sampler implemented in Table 2 is essentially the same for the
Normal-Wishart prior, ˆb is sampled from (21) and ˆΣ is sampled from (22).
12
There is, however, one step in the Gibbs sampler for the VAR model that will
affect the predictive distributions. This is due to fact that the predictive dis-
tribution accounts for the uncertainty about the future, combined with that
the VAR model allows the error terms to be correlated across the equations.
The Gibbs sampler will draw a sample m from the predictive distribution at
forecast horizon h
ˆy
(m)
t+h =
h−1
i=1
ˆB
(m)
i ˆy
(m)
t+h−i +
p
i=1
ˆB
(m)
i y
(m)
t+h−i + ˆε
(m)
t+h
where ˆε
(m)
t+h = r[ˆΣ
(m)
]
1
2 and r is n draws from the standard normal distribu-
tion. The term [ˆΣ
(m)
]
1
2 is the square root of the estimated variance-covariance
matrix by the Cholesky decomposition, which results in that the lower tri-
angle of ˆΣ
(m)
consists of zeros. Therefore, the order of the variables in the
VAR model is important. For example, for the bivariate VAR model, the
first equation will have two elements of uncertainty added to the forecast for
each draw. While in the second equation will only have one element of un-
certainty added to the forecast at each draw. Therefore, there will be more
uncertainty added to the predictive distribution of y1 than y2.
4. Evaluating the Predictive Distribution
To explain the measurement, consider a predictive distribution at a specific
forecast horizon, h. The problem is to structure a measurement that accounts
for two factors, (1) the accuracy of the forecasts6
and (2) the probabilities
that the forecasts occurs.
The most common method to examine forecast accuracy is to use out-of-
sample forecasts. This allow the researcher to mimic a real time situation.
The observed data is divided into two sets, data for parameter estimation and
actual values for forecast evaluation. The forecast error for each observation
in the predictive distribution at h is
ˆy
(m)
t+h − ya
t+h
where ˆy
(m)
t+h is the mth
Gibbs sampling in predictive distribution at horizon h,
and ya
t+h is the actual value corresponding to the forecast.
6
Note: When we use the terminology forecast, we refer to one element within the
predictive distribution if nothing else is specified.
13
The most common method to visualize a distribution is the histogram and
will serve as a tool to describe the concept of the measurement. The data is
classified into bins represented by rectangles where the height of the rectan-
gle represents the number of data points within the interval of the bin. The
histogram can be normalized to represent the probability for each bin with
the condition that the sum of probabilities of each bin is equal to one. In Fig-
ure 2 we can see the forecast error probability distribution of two competing
models, it is clear that the left graph is more accurate than the right graph.
We can conclude that these graphs serve our purpose, i.e. we can determine
which graph is most accurate by examining the probabilities of the forecast
error.
Figure 2: The (left) graph is the forecast errors probability distribution of the
AR(2) model at h = 2 for univariate simulation of y2. The (right) graph is
the forecast errors probability distribution of the VAR(2) model at h = 2 for
univariate simulation of y2. The red line indicates zero forecast error.
Examining each forecast error distribution is time-consuming if, for exam-
ple, a practitioner is calibrating a forecast model. Therefore we would like
to summarize the forecast error distribution by the expected value. This
will not, however, yield information about the accuracy of the forecast error
distribution that we intend to measure, instead it will contain information
about the bias of the forecast error distribution. To achieve the same conclu-
sion to be drawn as for the graphs in Figure 3 we will examine the squared
14
forecast errors7
. The expected value of the squared forecast error is
¯et+h =
M−B
m=1
(M − B)−1
(ˆy
(m)
t+h − ya
t+h)2
=
M−B
m=1
pm(ˆy
(m)
t+h − ya
t+h)2
(26)
where ¯et+h is the expected value of the squared forecast error and M − B is
the number of stored samples in the Gibbs sampler. The sum of probabilities
are equal to one, M−B
m=1 pm = 1. A tight forecast error distribution centred
around zero will produce a small value of the expected value of squared fore-
cast error. While a wide forecast error distribution not centred at zero or
skewed away from zero will produce a larger expected value of the squared
forecast error. The expected value of the squared forecast error is itself not
informative, it is only informative when put in relative terms to a competing
model. We will use the notation ¯e
AR(p)
t+h to represent the expected value of the
squared forecast error at forecast horizon h for the AR(p) model and ¯e
VAR(p)
t+h
for the corresponding representation for the VAR(p) model.
5. Simulation Studies
We generate the data for the variables y1 and y2 by (pseudo)-simulation from
univariate AR models and a multivariate VAR model. This will allow us to
obtain results corresponding to Table 1 (the simulation of data represents
the columns of Table 1). Both the univariate and multivariate simulation
will create data for the variables y1 and y2 with TS
= 200 observations8
.
Univariate Simulation: The two time series y1 and y2 are simulated from
two AR(2) models. Both AR models are conditional on being stable, this is
fulfilled when the modules of the eigenvalues of the companion matrix
β1 β2
1 0
are less than one. We impose a stricter condition, the modules of the eigenval-
ues must be less than 0.850. This is motivated by the implemented restrictive
7
There are also other factors to use the squared forecast errors, it penalizes outliers to
a high degree and it has nice properties to the normal distribution. Note, however, that
our forecast error distributions are t-distributed because they are estimated form a model.
8
Note that y1 and y2 depend on past observations. We have specified start values
for each simulation processes. To mitigate the effect of start values we generated 250
observations of y1 and y2 and discarded the first 50 observations, resulting in TS
= 200.
15
Gibbs sampler which will require large increases in computational time as the
modules of the eigenvalues approaches one. The series will be simulated by:
y1,t = 0.70y1,t−1 + 0.10y1,t−2 + ε1,t, ε1,t ∼ N(0, 2) (27)
y2,t = 0.35y2,t−1 + 0.30y2,t−2 + ε2,t, ε2,t ∼ N(0, 3). (28)
The parameters in (27) are chosen so that one of the eigenvalues have a
modules close to the chosen criteria and to have β1 large and β2 small. The
parameters in (28) are chosen to have β1 and β2 closer to each other than
in (27), they should be positive and not close to zero. The pair of β1 and β2
in (27) and (28) have been chosen to not be identical and the variances of y1
and y2 are different. The correlations between y1 and y2 should on average
be zero. But due to the randomness of simulation the correlations between
y1 and y2 will not be constant, which in turn will affect the estimation. The
left graph in Figure 3 shows 100 correlations between y1 and y2, the average
is -0.008 and the median is -0.008.
Multivariate Simulation: The two time series y1 and y2 are simulated
from a bivariate VAR(2) model. The VAR model is conditional on being
stable, this is fulfilled when the modules of the eigenvalues of the companion
matrix
B1 B2
I2 0
are less than one. We again impose the stricter condition that the modules
of the eigenvalues must be less than 0.850. The series will be simulated by
y1,t = 0.70y1,t−1 + 0.10y1,t−2 + ε1,t
y2,t = −0.25y1,t−1 + 0.35y2,t−1 + 0.50y1,t−2 + 0.30y2,t−2 + ε2,t
(29)
where the variance-covariance matrix is
Σ =
2 0.75
√
6
0.75
√
6 3
(30)
where the error terms between the equations have a correlation of 0.750 and
the variances of y1 and y2 are different.
Three factors have determined the choice of the parameter matrix B. First,
the elements in B are chosen so that the correlation that arise from the
parameters are balanced, i.e. the correlation between y1 and y2 is approx-
imately 0.750. Second, the first equation, y1, in (29) should not depend
on the parameters from y2. The only relationship between the variables, in
16
the first equation in (29), is the correlation between error terms across the
two equations. Both models are correctly specified, but the AR model dis-
card the correlated error terms across equations. While the VAR model will
include irrelevant independent variables and will add uncertainty to the pre-
dictive distribution due to the correlated error terms across equations. Both
these effects will make the predictive distribution of the VAR model wider.
Third, the second equation, y2, in (29) should depend on the parameters
of y1. Estimating y2 with a AR model will result in misspecification. Due
to randomness of simulation the correlations between y1 and y2 will not be
constant, which in turn will affect the estimation. The right graph in Figure
3 shows 100 correlations between y1 and y2, the average is 0.745 and the
median is 0.741.
Figure 3: The (left) graph is the histogram of correlations between y1 and y2
generated by 100 univariate simulations. The (right) graph is the histogram
of correlations between y1 and y2 generated by 100 multivariate simulations.
6. Hyperparameters
As mentioned in Section 3.3, Kadiyala and Karlsson (1993, 1997) suggests
a set of guidelines to standardize the restrictions on the parameters in the
Normal-Wishart prior distribution. This allows the researcher to only spec-
ify a number of hyperparameters. Informally, we can think of the inverse
Wishart distribution as a multivariate version of the inverse Gamma distri-
bution. This allows us to align the Normal-Gamma prior in the AR and
Normal-Wishart prior in the VAR, by using the guidelines of Kadiyala and
17
Karlsson (1993, 1997).
We align the Normal-Gamma prior to the Normal-Wishart prior in three
steps. First, we specify the hyperparameters for the prior mean variance,
in (8), in the AR(2) model
H = σ2 ¯H =
λ2
1 0
0 λ1
2λ3
2
we specify σ2
= 1 and the diagonal of ¯H to have the same variances for
the lags as matrix V (b) = (α − n − 1)−1 ¯S ⊗ ¯H has for own lags in the
Normal-Wishart prior. Second, the prior degrees of freedom α, in (14), will
be determined by (20), which results in α = 3. Third, the prior scale param-
eter θ0, in (15), will be determined by θ0 = (α − n − 1)λ−1
0 ˆσ2
, with α = 3
this simplifies to θ0 = λ−1
0 ˆσ2
, where ˆσ2
is the OLS estimate.
Now we turn to examine the guidelines of Kadiyala and Karlsson (1993,
1997) for the Normal-Wishart prior. The prior mean variance V (b) = (α −
n−1)−1 ¯S⊗ ¯H and the prior scale matrix ¯S, with the diagonal (α−n−1)λ−1
0 s2
i ,
both depend on the prior degrees of freedom, α. By (20) we determine that
α = 4, which results in that the scale matrix, ¯S, is
λ−1
0 s2
1 0
0 λ−1
0 s2
2
and the prior mean variance V (b) = ¯S ⊗ ¯H is


















λ2
1 0 0 0 0 0 0 0
0 s1λ1
s2
2
0 0 0 0 0 0
0 0 λ1
2λ2
2
0 0 0 0 0
0 0 0 s1λ1
s22λ2
2
0 0 0 0
0 0 0 0 s2λ1
s1
2
0 0 0
0 0 0 0 0 λ2
1 0 0
0 0 0 0 0 0 s2λ1
s12λ2
2
0
0 0 0 0 0 0 0 λ1
2λ2
2


















notice that the first and third diagonal element in this matrix is equal to the
diagonal elements of the prior mean variance in the Normal-Gamma prior.
18
This alignment between the Normal-Gamma prior and the Normal-Wishart
prior allow us to control the parameter restrictions of both priors by only
specifying the prior mean, β and b0, in (8) and (18) for each prior and the
hyperparameters λ0, λ1, λ2 for both priors. For the Normal-Gamma prior
we will set the prior mean to β = (0, 0) and for the Normal-Wishart prior
we will set the prior mean to b0 = (0, 0, 0, 0, 0, 0, 0, 0) . The hyperparameters
will be set to λ0 = λ1 = λ2 = 1.
7.1 Results
To validate the measurement we want to obtain results corresponding to those
in Table 1. We will attempt to verify these by examining posterior variances
and the difference in the measurement between two competing models by
forecast horizon.
But first we will describe the kind of data we analyse in this section. In
Section 5, we generated the data for the univariate and multivariate models.
From this we construct out-of-sample forecast, we will produce forecasts for
the horizon h = 1, 2, ..., 10 leaving Tp
= Ts
− h = 190 observations for pa-
rameter estimation. Predictive distributions are estimated and the expected
value of the squared forecast error, i.e. the measurement, is calculated for
ten forecast horizons. This step of simulating data and calculating the mea-
surement is repeated one hundred times. This results in that each model will
produce data of the measurement in form of two 100 × 10 matrices where
the first matrix is for the univariate simulated data and the second is for the
multivariate simulated data.
First, we will examine the posterior variance. The results are presented
in Table 3 and 5, the coefficient is the expected value of the mean posterior
variance, resulting from one hundred simulations of the data. This is moti-
vated by two reasons (1) to summarize the data, each simulation of the data
will yield a mean posterior variance and the corresponding ninety-five per-
cent probability interval of the posterior variance (2) deviations from Table
1 should be caused by the random correlation in the data due to simulation,
therefore the expected value of the mean posterior variance yield a more ro-
bust conclusion.
Second, we will examine the difference in the measurement between two
competing models at each forecast horizon. The mean of this difference will
determine which predictive distribution is most accurate. We will represent
19
this by the linear regression described in (1), where the dependent variable is
the difference in the measurement between the competing models by forecast
horizon and the independent variable of a constant:
(¯e
AR(2)
t+h )i − (¯e
VAR(2)
t+h )i = ϕ + εi, i = 1, ..., 100 (31)
The coefficient of the constant, ϕ, represent the mean of the difference in
the measurement between the competing models by forecast horizon, this
enables the following hypothesis test:
H0 : ϕ = 0
HA : ϕ = 0
If ϕ < 0 then the predictive distribution of the AR(2) is more accurate than
the VAR(2). If ϕ > 0 then the predictive distribution of the VAR(2) is more
accurate than the AR(2). If H0 is not rejected, then we cannot conclude that
one predictive distribution is more accurate than the other. The estimation
of this model is similar to the AR(p) model described in section 3.1 and the
Gibbs sampler described in Table 2. We will set the hyperparameters as
follows β0 = 0, σ2
= 1, θ0 = ˆσ2
and α = 3 according to (20). The results are
presented in Table 4 and 6.
7.2 Univariate Simulation Results
We start by examining the results of the estimated posterior variance and
covariance from the AR(2) and VAR(2) models for the univariate simulations
of the data. The first column in Table 1 shows the expected outcomes. The
AR model is correctly specified, while the VAR model includes irrelevant
independent variables which is expected to increase the estimated posterior
variance.
The results are presented in Table 3. For the variable y1, we conclude that
the expected value of the mean posterior variance is equal to 2.006 for the
AR and 2.007 for the VAR. Therefore we conclude that inclusion of irrele-
vant independent variables do increase the estimated posterior variance as
expected. However, this increase is smaller than we expected. For the vari-
able y2, we conclude that the expected value of the mean posterior variance
is 2.962 for the AR model and 2.965 for the VAR model. The increase in
estimated posterior variance is somewhat larger than for y1 but still smaller
than expected. The expected value of the mean posterior covariance in the
VAR model is 0.006. Which is close to zero as we expect due to that y1 and
20
y2 are simulated independent of each other. From the (left) graph in Figure
3, we conclude that the correlation between y1 and y2 is on average -0.008,
but range approximately from -0.300 to 0.350.
Overall, we conclude that the estimated posterior variances follow the ex-
pected theory in Table 1, but the increase in estimated posterior variance
due to inclusion of irrelevant variables in the VAR model is smaller than we
expected.
Table 3: Expected Values of Mean Posterior Variance/Covariance.
Univariate Simulation of the Data
AR(2) VAR(2)
E[Coef.] E[95% PI] E[Coef.] E[95% PI]
V(y1) 2.006 [ 1.638 ; 2.455] 2.007 [ 1.639 ; 2.456]
COV(y1,y2) - - 0.006 [-0.348 ; 0.361]
V(y2) 2.962 [ 2.419 ; 3.620] 2.965 [ 2.421 ; 3.633]
E[.] is the expected value of 100 simulations of the data.
We now turn to examine the results for the linear regression (31) of the
measurement we have formulated. From our findings about the estimated
posterior variance we expect the forecast error probability distributions of
the AR model to be tighter and more centred around zero than for the VAR
model, i.e the measurement would be smaller for the AR model than for
the VAR model. The expected value of the estimated posterior covariance
between y1 and y2 was close to zero and, therefore, a minimal increase of
uncertainty in the predictive distribution of y1 caused by the Cholesky de-
composition. Therefore, we expect to reject H0 and find that the parameter
ϕ is negative.
The results are presented in Table 4. For the variable y1, we conclude that
we reject H0 at one of the ten forecast horizons. We find that ϕ is negative
at the second forecast horizon as expected, i.e the ninety-five percent prob-
ability interval do not range over zero. On average at this forecast horizon,
the predictive distribution of the AR model is more accurate than that of
the VAR model. For all other forecast horizons H0 cannot be rejected. For
the variable y2, we conclude that we reject H0 at one of the ten forecast
horizons. We find that ϕ is negative at the first forecast horizon as expected.
On average at this forecast horizon, the predictive distribution of the VAR
model is more accurate than that of the AR model. For all other forecast
horizons H0 cannot be rejected.
21
Overall, we cannot conclude that the AR model produce more accurate pre-
dictive distributions than the VAR model for the univariate simulated data.
It seems that the small increase in estimated posterior variance caused by the
inclusion of irrelevant variables is not large enough to distinguish between
the two models predictive distributions.
Table 4: Regression Results.
Univariate Simulation of the Data
y1 y2
ϕ 95% PI ϕ 95% PI
h=1 -0.111 [-0.377 ; 0.155] -0.906 [-1.768 ; -0.063]
h=2 -0.527 [-0.976 ; -0.060] 0.080 [-0.536 ; 0.698]
h=3 -0.235 [-0.615 ; 0.132] -0.559 [-1.330 ; 0.233]
h=4 -0.197 [-0.568 ; 0.166] 0.261 [-0.263 ; 0.789]
h=5 -0.085 [-0.535 ; 0.358] -0.448 [-1.268 ; 0.358]
h=6 -0.016 [-0.384 ; 0.347] -0.157 [-0.597 ; 0.292]
h=7 0.161 [-0.245 ; 0.562] 0.387 [-0.060 ; 0.833]
h=8 0.042 [-0.359 ; 0.449] 0.186 [-0.128 ; 0.501]
h=9 0.004 [-0.434 ; 0.446] 0.005 [-0.264 ; 0.273]
h=10 0.060 [-0.381 ; 0.488] 0.135 [-0.148 ; 0.413]
PI stands for probability interval.
7.3 Multivariate Simulation Results
We start by examining the results of the estimated posterior variance and co-
variance from the AR(2) and VAR(2) models for the multivariate simulations
of the data. The second column in Table 1 shows the expected outcomes.
The VAR model is correctly specified, while the AR model is misspecified
which is expected to increase the estimated posterior variance.
The results are presented in Table 5. For the variable y1, which is not de-
termined by y2, we conclude that the expected value of the mean posterior
variance is equal to 1.980 for the AR and 1.983 for the VAR. We conclude
that the inclusion of irrelevant independent variables cause an increase in the
estimated posterior variance, this is the same conclusion as in the univariate
simulation of data. For the variable y2, we conclude that the result follow the
statistical theory well. The expected value of the mean posterior variance
is 3.291 for the AR model and 3.025 for the VAR model. The increase in
estimated posterior variance due to misspecification is large. The expected
value of the mean posterior covariance in the VAR model is 1.821. Which is
close to the covariance in equation (30), 0.75 ×
√
6 ≈ 1.837. The choice of
elements in parameter matrix B have balanced the correlation between y1
22
and y2 to the same correlation specified for the error terms. From the (right)
graph in Figure 3, we conclude that the correlation between y1 and y2 is on
average 0.745, but range approximately from 0.550 to 0.850.
Overall, we conclude that the estimated posterior variances follow the ex-
pected theory in Table 1. The increase in the estimated posterior variance
due to misspecification is large as expected. We find the same conclusion of
inclusion of irrelevant variables as in the univariate simulation of the data.
Table 5: Expected Values of Mean Posterior Variance/Covariance.
Multivariate Simulation of the Data
AR(2) VAR(2)
E[Coef.] E[95% PI] E[Coef.] E[95% PI]
V(y1) 1.980 [ 1.617 ; 2.420] 1.983 [ 1.619 ; 2.429]
COV(y1,y2) - - 1.821 [ 1.423 ; 2.303]
V(y2) 3.291 [ 2.689 ; 4.027] 3.025 [ 2.469 ; 3.703]
E[.] is the expected value of 100 simulations of the data.
We now turn to examine the results for the linear regression (31) of the mea-
surement. We expect ϕ to be negative for y1, the AR model do not suffer
from misspecification and the expected value of the mean posterior covari-
ance is large. Therefore we expect the predictive distributions of the VAR
model to be wide for y1 due to the Cholesky decomposition. From the finding
about the estimated posterior variance for y2, we expect to reject H0 and
find that the parameter ϕ is positive.
The results are presented in Table 6. For the variable y1, we conclude that we
reject H0 at all forecast horizons except the first. We find that ϕ is negative
for the second to tenth forecast horizon as we expected due to the Cholesky
decomposition. On average, the forecast accuracy of the AR relative to the
VAR increases over forecast horizons. For the variable y2, we conclude that
we reject H0 for three out of ten forecast horizons. We find that ϕ is neg-
ative for the eighth to tenth forecast horizon, which is opposite to what we
expected. We also conclude that the magnitude of the differences are large
for the ninth and tenth forecast horizon.
23
Table 6: Regression Results.
Multivariate Simulation of the Data
y1 y2
ϕ 95% PI ϕ 95% PI
h=1 -0.029 [-0.159 ; 0.109] 0.272 [-0.050 ; 0.604]
h=2 -0.251 [-0.467 ; -0.038] 0.050 [-0.257 ; 0.349]
h=3 -0.321 [-0.553 ; -0.087] 0.536 [ 0.145 ; 0.938]
h=4 -0.289 [-0.563 ; -0.030] 0.172 [-0.192 ; 0.539]
h=5 -0.219 [-0.470 ; 0.039] 0.334 [-0.104 ; 0.792]
h=6 -0.338 [-0.597 ; -0.068] -0.257 [-0.640 ; 0.125]
h=7 -0.308 [-0.614 ; -0.005] -0.211 [-0.614 ; 0.203]
h=8 -0.333 [-0.598 ; -0.062] -0.492 [-0.910 ; -0.066]
h=9 -0.483 [-0.755 ; -0.209] -0.701 [-1.107 ; -0.293]
h=10 -0.466 [-0.718 ; -0.210] -0.905 [-1.289 ; -0.523]
PI stands for probability interval.
Table 7 shows the same analysis as before but this time we have changed the
order of the variables when estimating the VAR model. For the variable y1
we conclude that we reject H0 at the second, fourth, fifth and ninth forecast
horizon. We find that ϕ is negative for the second forecast horizon and
positive for fourth, fifth and ninth forecast horizon. For the variable y2 we
conclude that we reject H0 for five out of the ten forecast horizons. We find
that ϕ is negative at the first, sixth, eight, ninth and tenth forecast horizon
as expected due to Cholesky decomposition. Both these results show the
strong effect of Cholesky decomposition. The variable y2 handles the extra
uncertainty added to the predictive distribution better than y1. This is most
likely due to the misspecification of the AR model in the y2 variable.
Table 7: Changed Order Of Variables in VAR.
Multivariate Simulation of the Data
y2 y1
ϕ 95% PI ϕ 95% PI
h=1 -2.054 [-3.041 ; -1.001] -0.940 [-1.956 ; 0.022]
h=2 -0.165 [-0.946 ; 0.614] -1.506 [-2.507 ; -0.480]
h=3 -0.415 [-0.836 ; 0.018] 0.132 [-0.282 ; 0.549]
h=4 -0.310 [-0.698 ; 0.086] 0.295 [ 0.077 ; 0.514]
h=5 -0.307 [-1.020 ; 0.409] 0.599 [ 0.088 ; 1.119]
h=6 -0.680 [-1.267 ; -0.089] 0.150 [-0.387 ; 0.666]
h=7 -0.433 [-0.924 ; 0.055] 0.388 [-0.008 ; 0.775]
h=8 -0.569 [-0.966 ; -0.168] 0.158 [-0.150 ; 0.463]
h=9 -0.829 [-1.290 ; -0.367] 0.388 [ 0.049 ; 0.726]
h=10 -0.936 [-1.369 ; -0.512] 0.178 [-0.238 ; 0.579]
PI stands for probability interval.
24
Overall, we can conclude that the AR model produce more accurate predic-
tive distributions than the VAR model for the multivariate simulated data.
This was expected for y1 due to Cholesky decomposition and that neither the
AR or VAR was misspecified. For y2, however, it seems that misspecification
is too small to have an effect on the predictive distribution. Instead the VAR
is out performed by the AR at longer horizons.
Summing up our results. From both the univariate and multivariate re-
sults we conclude that the estimated posterior variances follow the statistical
theory of Table 1. But the magnitude of these effects are smaller than ex-
pected. From analyzing the predictive distribution by the measurement for
the univariate simulation of the data. We find difficulties to verify the in-
crease in estimated posterior variance in the predictive distributions, due to
the statistical theory. Only at two horizons are we able to distinguish that
the AR models outperform the VAR models. This difficulty arise because
the increase in the estimated posterior variance is too small to separate the
predictive distributions. From the multivariate simulation of the data we
conclude that there is a large effect of the Cholesky decomposition in the
first equation of the estimation in the VAR model. We find that the VAR
model produces inferior predictive distributions of y1 for all horizons. This
result is not as strong for y2 where five of the horizons produce inferior pre-
dictive distributions. We also conclude that this effect increases with forecast
horizon.
We conclude that the autoregressive and vector autoregressive models with
a lag length of two, have difficulty to produce dissimilar predictive distribu-
tions. This has made it difficult to assess the accuracy of the measurement
but by examining the effect of the Cholesky decomposition in VAR models
we are able to validate the accuracy of the measurement.
25
8. Conclusion
We conduct simulation studies to formulate a measurement that evaluates
the forecast accuracy of predictive distributions. We use Bayesian methods to
estimate posterior inference and predictive distributions for the autoregres-
sive model and the vector autoregressive model. By the use of out-of-sample
forecasts and predictive distributions we are able to evaluate the full distri-
bution of forecast errors. We are also able to validate the accuracy of the
measurement, especially by allowing for correlated error terms across equa-
tions in the vector autoregressive model.
We formulate a measurement that uses out-of-sample forecasts and predic-
tive distributions to evaluate the full forecast error probability distribution by
forecast horizon. The measurement can be used as a forecasting evaluating
technique for single forecasts or to calibrate forecast models. The technique
used in this thesis can be used for these purposes. Furthermore, we recom-
mend that the measurement should be used with several other forecasting
evaluation techniques when used by practitioners.
For further research we recommend that the measurement should be evalu-
ated by models that are not from the same family to ensure differences in
predictive distributions, treat conditional hetroskedasticity differently, case
studies of outliers such as financial crisis and evaluated against a wide range
of forecasting evaluation techniques.
26
Appendix
A1. Gibbs Sampler
To explain the intuition behind the Gibbs sampler we will borrow a summary
put forward by Ciccarelli and Rebucci (2003), but with the mathematical no-
tation convenient to Section 3.1.
In many applications the analytical integration of p(β, σ2
|yt) may be diffi-
cult or even impossible to implement. This problem, however, can often
be solved by using numerical integration based on Monte Carlo simulation
methods.
One particular method used in the literature to solve similar estimation prob-
lems of those discussed in the paper is the Gibbs sampler. The Gibbs Sampler
is a recursive Monte Carlo method which only requires knowledge of the full
conditional posterior distribution of the parameters of interest, p(β|σ2
, yt)
and p(σ2
|β, yt) are known. Then the Gibbs sampler starts from an arbi-
trary value for β(0)
or (σ2
)(0)
, and samples alternately from the density from
each element of the parameter vector, conditional on the values of the other
element sampled in the previous iteration and the data. Thus, the Gibbs
sampler samples recursively as follows:
β(1)
from p(β|(σ2
)(0)
, yt)
(σ2
)(1)
from p(σ2
|β(1)
, yt)
β(2)
from p(β|(σ2
)(1)
, yt)
(σ2
)(2)
from p(σ2
|β(2)
, yt)
...
β(m)
from p(β|(σ2
)(m−1)
, yt)
(σ2
)(m)
from p(σ2
|β(m)
, yt)
and so on.
The vectors ϑ(m)
= (β(m)
, (σ2
)(m)
) from a Markov Chain, and, for a suffi-
ciently large number of iterations (say m ≥ M), can be regarded as draws
from the true joint posterior distribution. Given a large sample of draws
from this limiting distribution, any posterior moment or marginal density
of interest can then be easily estimated consistently with the corresponding
sample average.
27
References
Luc Bauwens, Gary Koop, Dimitris Korobilis, and Jeroen VK Rombouts.
The contribution of structural break models to forecasting macroeconomic
series. Journal of Applied Econometrics, 2014.
Andrew P Blake and Haroon Mumtaz. Applied Bayesian econometrics for
central bankers. Number 4 in Technical Books. Centre for Central Banking
Studies, Bank of England, 2012. URL http://ideas.repec.org/b/ccb/
tbooks/4.html.
Matteo Ciccarelli and Alessandro Rebucci. BVARs: A Survey of the Re-
cent Literature with an Application to the European Monetary Sys-
tem. Rivista di Politica Economica, 93(5):47–112, September 2003. URL
http://ideas.repec.org/a/rpo/ripoec/v93y2003i5p47-112.html.
A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B.
Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC
Texts in Statistical Science. Taylor & Francis, 2013. ISBN 9781439840955.
URL http://books.google.se/books?id=ZXL6AQAAQBAJ.
John Geweke and Gianni Amisano. Comparing and evaluating Bayesian
predictive distributions of assets returns. Working Paper Series 0969, Eu-
ropean Central Bank, November 2008. URL http://ideas.repec.org/
p/ecb/ecbwps/20080969.html.
John Geweke and Gianni Amisano. Prediction using several macroeconomic
models, 2012.
K Rao Kadiyala and Sune Karlsson. Forecasting with generalized bayesian
vector auto regressions. Journal of Forecasting, 12(3-4):365–378, 1993.
K Rao Kadiyala and Sune Karlsson. Numerical Methods for Estimation
and Inference in Bayesian VAR-Models. Journal of Applied Econometrics,
12(2):99–132, March-Apr 1997. URL http://ideas.repec.org/a/jae/
japmet/v12y1997i2p99-132.html.
Sune Karlsson. Chapter 15 - forecasting with bayesian vector autoregres-
sion. In Graham Elliott and Allan Timmermann, editors, Handbook
of Economic Forecasting, volume 2, Part B of Handbook of Economic
Forecasting, pages 791 – 897. Elsevier, 2013. doi: http://dx.doi.org/
10.1016/B978-0-444-62731-5.00015-4. URL http://www.sciencedirect.
com/science/article/pii/B9780444627315000154.
28
Gary Koop and Dimitris Korobilis. Bayesian multivariate time series meth-
ods for empirical macroeconomics. Now Publishers Inc, 2010.
Dongchu Sun and Shawn Ni. Bayesian analysis of vector-autoregressive mod-
els with noninformative priors. Journal of statistical planning and infer-
ence, 121(2):291–309, 2004.
Anders Warne, G¨unter Coenen, and Kai Christoffel. Predictive likelihood
comparisons with DSGE and DSGE-VAR models. Working Paper Series
1536, European Central Bank, April 2013. URL http://ideas.repec.
org/p/ecb/ecbwps/20131536.html.
29
Ad

More Related Content

What's hot (19)

Mathematical Econometrics
Mathematical EconometricsMathematical Econometrics
Mathematical Econometrics
jonren
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
B025209013
B025209013B025209013
B025209013
inventionjournals
 
Christopher Johnson Bachelor's Thesis
Christopher Johnson Bachelor's ThesisChristopher Johnson Bachelor's Thesis
Christopher Johnson Bachelor's Thesis
BagpipesJohnson
 
FSRM 582 Project
FSRM 582 ProjectFSRM 582 Project
FSRM 582 Project
Qi(Gilbert) Zhou
 
Morse et al 2012
Morse et al 2012Morse et al 2012
Morse et al 2012
Brendan Morse
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
Galit Shmueli
 
statistical estimation
statistical estimationstatistical estimation
statistical estimation
Amish Akbar
 
Principal components
Principal componentsPrincipal components
Principal components
Hutami Endang
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
domsr
 
15-088-pub
15-088-pub15-088-pub
15-088-pub
Terrance Savitsky
 
Statistical Methods to Handle Missing Data
Statistical Methods to Handle Missing DataStatistical Methods to Handle Missing Data
Statistical Methods to Handle Missing Data
Tianfan Song
 
Statsci
StatsciStatsci
Statsci
Vassilios Kelessidis
 
Application of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in ForecastingApplication of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in Forecasting
paperpublications3
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation
Remyagharishs
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
Dr Hemant Sharma
 
Chapter 021
Chapter 021Chapter 021
Chapter 021
stanbridge
 
Inference about means and mean differences
Inference about means and mean differencesInference about means and mean differences
Inference about means and mean differences
Andi Koentary
 
On Confidence Intervals Construction for Measurement System Capability Indica...
On Confidence Intervals Construction for Measurement System Capability Indica...On Confidence Intervals Construction for Measurement System Capability Indica...
On Confidence Intervals Construction for Measurement System Capability Indica...
IRJESJOURNAL
 
Mathematical Econometrics
Mathematical EconometricsMathematical Econometrics
Mathematical Econometrics
jonren
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
Christopher Johnson Bachelor's Thesis
Christopher Johnson Bachelor's ThesisChristopher Johnson Bachelor's Thesis
Christopher Johnson Bachelor's Thesis
BagpipesJohnson
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
Galit Shmueli
 
statistical estimation
statistical estimationstatistical estimation
statistical estimation
Amish Akbar
 
Principal components
Principal componentsPrincipal components
Principal components
Hutami Endang
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
domsr
 
Statistical Methods to Handle Missing Data
Statistical Methods to Handle Missing DataStatistical Methods to Handle Missing Data
Statistical Methods to Handle Missing Data
Tianfan Song
 
Application of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in ForecastingApplication of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in Forecasting
paperpublications3
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation
Remyagharishs
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
Dr Hemant Sharma
 
Inference about means and mean differences
Inference about means and mean differencesInference about means and mean differences
Inference about means and mean differences
Andi Koentary
 
On Confidence Intervals Construction for Measurement System Capability Indica...
On Confidence Intervals Construction for Measurement System Capability Indica...On Confidence Intervals Construction for Measurement System Capability Indica...
On Confidence Intervals Construction for Measurement System Capability Indica...
IRJESJOURNAL
 

Similar to Evaluating competing predictive distributions (20)

journal in research
journal in research journal in research
journal in research
rikaseorika
 
research journal
research journalresearch journal
research journal
rikaseorika
 
journals public
journals publicjournals public
journals public
rikaseorika
 
published in the journal
published in the journalpublished in the journal
published in the journal
rikaseorika
 
beven 2001.pdf
beven 2001.pdfbeven 2001.pdf
beven 2001.pdf
Diego Lopez
 
SLR Assumptions:Model Check Using SPSS
SLR Assumptions:Model Check Using SPSSSLR Assumptions:Model Check Using SPSS
SLR Assumptions:Model Check Using SPSS
Nermin Osman
 
STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)
AJHSSR Journal
 
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
BRNSS Publication Hub
 
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning ModelsEmpirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
mlaij
 
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning ModelsEmpirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
mlaij
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
carlstromcurtis
 
ProjectWriteupforClass (3)
ProjectWriteupforClass (3)ProjectWriteupforClass (3)
ProjectWriteupforClass (3)
Jeff Lail
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzer
Dennis Sweitzer
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
Venkata Reddy Konasani
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
Rafael Bustamante Romaní
 
1756-0500-3-267.pdf
1756-0500-3-267.pdf1756-0500-3-267.pdf
1756-0500-3-267.pdf
Mohammad Alam Tareque
 
binary logistic assessment methods and strategies
binary logistic assessment methods and strategiesbinary logistic assessment methods and strategies
binary logistic assessment methods and strategies
mikaelgirum
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
inventionjournals
 
Edison S Statistics
Edison S StatisticsEdison S Statistics
Edison S Statistics
teresa_soto
 
A review of statistics
A review of statisticsA review of statistics
A review of statistics
edisonre
 
journal in research
journal in research journal in research
journal in research
rikaseorika
 
research journal
research journalresearch journal
research journal
rikaseorika
 
published in the journal
published in the journalpublished in the journal
published in the journal
rikaseorika
 
SLR Assumptions:Model Check Using SPSS
SLR Assumptions:Model Check Using SPSSSLR Assumptions:Model Check Using SPSS
SLR Assumptions:Model Check Using SPSS
Nermin Osman
 
STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)STRUCTURAL EQUATION MODEL (SEM)
STRUCTURAL EQUATION MODEL (SEM)
AJHSSR Journal
 
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
BRNSS Publication Hub
 
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning ModelsEmpirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
mlaij
 
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning ModelsEmpirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
mlaij
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
carlstromcurtis
 
ProjectWriteupforClass (3)
ProjectWriteupforClass (3)ProjectWriteupforClass (3)
ProjectWriteupforClass (3)
Jeff Lail
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzer
Dennis Sweitzer
 
binary logistic assessment methods and strategies
binary logistic assessment methods and strategiesbinary logistic assessment methods and strategies
binary logistic assessment methods and strategies
mikaelgirum
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
inventionjournals
 
Edison S Statistics
Edison S StatisticsEdison S Statistics
Edison S Statistics
teresa_soto
 
A review of statistics
A review of statisticsA review of statistics
A review of statistics
edisonre
 
Ad

Evaluating competing predictive distributions

  • 1. Evaluating competing predictive distributions. An out-of-sample forecast simulation study. Bachelor’s Thesis in Statistics Andreas C. Collett∗ January 7, 2015 Abstract† This thesis aims to formulate a simple measurement that evaluates predic- tive distributions of out-of-sample forecasts between two competing models. Predictive distributions form a large part of today’s forecast models used for policy making. The possibility to compare predictive distributions between models is important for policy makers who make informed decisions based on probabilities. We conduct simulation studies to estimate autoregressive models and vector autoregressive models with Bayesian inference. The for- mulated measurement uses out-of-sample forecasts and predictive distribu- tions to evaluate the full forecast error probability distribution by forecast horizon. We find the measurement to be accurate and can be used to evalu- ate single forecasts or to calibrate forecast models. Keywords: Autoregressive, out-of-sample forecast, Bayesian inference, Gibbs sampling, prior distribution, posterior distribution and predictive distribu- tion. department of statistics autumn semester 2014 Course code SU-39434 ∗ Correspondence to author: [email protected]. † I am deeply grateful to my supervisor Professor Emeritus Daniel Thorburn for his commitment, time, notes and discussions.
  • 2. Contents 1. Introduction 2 2. Related Literature 4 3.1 Bayesian Inference 5 3.2 Autoregression 7 3.3 Vector Autoregression 11 4. Evaluating the Predictive Distribution 13 5. Simulation Study 15 6. Hyperparameters 17 7.1 Results 19 7.2 Univariate Simulation Results 20 7.3 Multivariate Simulation Results 22 8. Conclusion 26 Appendix 27 A1. Gibbs Sampler 27 1
  • 3. 1. Introduction This thesis aims to formulate a simple measurement that evaluates predic- tive distributions of out-of-sample forecasts between two competing models. Predictive distributions form a large part of today’s forecast models used for policy making. The possibility to compare predictive distributions between models is important for policy makers who make informed decisions based on probabilities. Out-of-sample forecasts are used to mimic the situation forecaster’s experience in real time and are used by academics in forecast methodology research and by practitioners to calibrate forecast models. By combining predictive distributions and out-of-sample forecasts one can eval- uate the forecast error probability distribution. Earlier forecast evaluation literature have tended to focus on point forecasts, either direct from a model or by a certain value in the predictive distribution, rather than evaluating the full predictive distribution. This results in loss of information about the uncertainty of the forecasts and the forecast model. The contribution of this thesis is to formulate a simple measurement that uses this information to evaluate forecasts at multiple horizons. There is recent research which address this subject in various forms; Geweke and Amisano (2008, 2012), Warne et al. (2013) and Bauwens et al. (2014). However, the literature is small relative to the literature of evaluating point forecasts. We generate data samples from univariate autoregression (AR) and mul- tivariate vector autoregression (VAR) with known true parameters. We will use Bayesian methods to estimate the data by AR and VAR models to obtain posterior inference and predictive distributions. A restrictive Gibbs sampler is implemented to conduct the Bayesian inference. These methods allow us to produce predictive distributions and to explore the theory and applica- tion of Bayesian analysis throughout a simple example. The restrictive Gibbs sampler is a popular method to obtain posterior inference in time series anal- ysis. Therefore the thesis gives an introduction to Bayesian analysis and its application in time series. The structure of simulated data allow us to use simple statistical theory to evaluate the posterior inference, summarized in Table 1, and the use of out-of-sample forecasts allow us to evaluate which model produces the most accurate predictive distribution. The diagonal in Table 1 represents the sit- uation when the simulated data is estimated with the correct model. The lower right outcome occurs when the data is simulated from an AR model but is estimated with a VAR model. In this case the VAR model will not suffer from misspecification but include irrelevant independent variables. This will 2
  • 4. cause an (small) increase in the estimated variance, which in turn will lead to a wider predictive distribution. The upper right outcome occurs when the data is simulated from a VAR model but is estimated with an AR model. In this case the AR model will be misspecified, i.e. the model suffers from omitted variable bias. This will cause an (large) increase in the estimated variance, which in turn will lead to a wider predictive distribution. The for- mulated measurement accounts for both the size of the forecast error and the probability that this forecast error would occur. Table 1: Underlying Statistical Theory. Simulation of Data Univariate Multivariate Model AR(p) Optimal Misspecification VAR(p) Irrelevant Independent Variables Optimal We formulate a measurement that use out-of-sample forecasts and predictive distributions to evaluate the full forecast error probability distribution by forecast horizon. We are able to validate the accuracy of the measurement against statistical theory, but we find that the autoregressive model and vec- tor autoregressive model with the same lag length, have difficulty producing dissimilar predictive distributions. However, we are able to separate pre- dictive distributions from the models by letting both be correctly specified but allowing for high degree of correlation between the error terms across equations in the vector autoregressive model. From this we find that the formulated measurement is able to measure the accuracy of the full forecast error distribution. The measurement can be used as a forecasting evaluating technique for single forecasts or to calibrate forecast models. The technique used in this thesis can be used for these purposes. The rest of the thesis is structured as follow. Section 2 presents the empirical research closest to the research question. Section 3 describes the empirical methodology. Section 4 describes the evaluation method of the predictive distribution. Section 5 describes the simulation method. Section 6 accounts for the selection of the hyperparameters in the priors. Section 7 discusses the results and model comparisons. 3
  • 5. 2. Related Literature We will present two articles that evaluate predictive distributions. The meth- ods used are technical and we will not go into debt to describe them. Instead we will mention the methods and give a brief summary of the results. For interested readers please review the articles under consideration. Geweke and Amisano (2012) compare the forecast performance and con- struct model combination for three models; the dynamic factor model, the dynamic stochastic general equilibrium model and the vector autoregressive model. They use several analytical techniques to evaluate the forecast per- formance and to construct model combination; pooling of predictive densi- ties, analysis of predictive variances, probability integral transform tests and Bayesian model averaging. They find two improvements that increase fore- cast accuracy substantially. The first improvement is to use the full Bayesian predictive distribution instead of the posterior mode for the parameters. The second improvement is to construct model combination by the use of equally weighted pooling of the predictive densities from the three models, instead of relying on the individual predictive distribution from each model. This result is considerable better than when Bayesian model averaging is used for the same purpose. Bauwens et al. (2014) compare the forecast performance of two models that allow for structural breaks against a wide range of alternative models which do not allow for structural breaks. They evaluate forecast performance by two metrics. First, they use root mean squared forecast errors (RMSE) to evaluate point forecasts. The median of the predictive distribution is used as point forecast. Second, they use the average of log predictive likelihoods (APL), which is the predictive density evaluated at the observed outcome. The APL is estimated by a nonparametric kernel smoother, using draws from the predictive simulator. They find that no single model is consistently bet- ter against the alternatives in the presence of structural breaks. One source for this uncertainty about the forecast performance is that the two metrics yield substantially different conclusions. They find that the structural break models seem to dominate the non-structural break models in terms of RMSE, but the opposite is often true in terms of APL. 4
  • 6. 3.1 Bayesian Inference To describe Bayesian inference the simple linear regression model will be examined, consider the model yt = Xtβ + εt (1) where yt is a T ×1 vector representing the dependent variable, Xt is a T ×K matrix and εt iid ∼ N(0, σ2 ), T is the number of observations and K is the number of independent variables. Our purpose is to obtain estimates of the K × 1 vector β and the scalar σ2 . These can be obtained by maximizing the likelihood function l(yt | β, σ2 ) = (2πσ2 ) −T 2 exp − (yt − βXt) (yt − βXt) 2σ2 (2) which yields maximum likelihood estimates3 ˆβMLE = (XtXt)−1 (Xtyt) and ˆσ2 MLE = (yt−ˆβMLEXt) (yt−ˆβMLEXt) T . According to the likelihood principle the likelihood function contains all the information in the data about the param- eters β and σ2 . This is where the difference between classical (or frequentist) inference and Bayesian inference becomes apparent. Bayesian inference in- corporates prior beliefs about the parameters in the estimation process in form of probability distributions. This results in the joint posterior distribu- tion p(β, σ2 | yt) = l(yt | β, σ2 )p(β, σ2 ) p(yt) ∝ l(yt | β, σ2 )p(β, σ2 ) (3) where p(β, σ2 ) is the prior distribution and p(yt) is the density of the data, or marginal likelihood. The marginal likelihood, p(yt), does not depend on β or σ2 and can thus be considered a constant. This yields the unnormalized joint posterior distribution which is proportional to the likelihood function times the prior distribution. However, Karlsson (2013) stress the importance of the marginal likelihood for model comparison. Several conclusions can be drawn form the joint posterior distribution. First, the joint posterior distribution represents the probability distribution of the parameters β and σ2 when the prior distribution has been updated with the information in the observed data, yt. Second, if the prior distribution is vague (or flat) then it can be considered as almost constant, this causes the estimates to 3 Note that the ˆβMLE is equal to the ordinary least square estimator, ˆβOLS, while ˆσ2 MLE is a biased estimate of the variance due to the fact that it does not deduct the number of estimated parameters from the number of observations from the denominator, which is the case in ˆσ2 OLS. 5
  • 7. be similar to those of classical inference, i.e. the likelihood function will determine the estimates. This also occur when the information in the data is rich, i.e a large number of observations. There is a large literature on Bayesian inference in macroeconomics, where the data tend to have a small number of observations and the model requires a large number of parameters to be estimated, for example the vector autoregressive model. Given the joint posterior distribution, the marginal posterior distributions conditional on the data, p(β | yt) and p(σ2 | yt), can be obtained by integrating β and σ2 from the joint posterior distribution, one at the time: p(β | yt) = p(β, σ2 | yt)dσ2 (4) p(σ2 | yt) = p(β, σ2 | yt)dβ. (5) For the simple regression model specified there exists analytical (or closed form) results for the integrals. But for more complex models or particular prior distributions there may not exist analytical results for the integrals. Then numerical or simulation techniques are required to obtain estimates on β and σ2 , such as Markow Chain Monte Carlo (MCMC) simulation. We will impose the restriction of stability to our AR and VAR processes, i.e. we are restricted to evaluate a certain range of the distributions of β. This will be implemented by the Gibbs sampler (see Appendix A1), which will allow us to sample from the range of this distribution. Even if there exists analytical results and no restrictions are imposed, there are still situations where sim- ulation is suitable, this is the case in forecasting. Forecasts with a horizon beyond the one-step ahead horizon are nonlinear and can only be obtained by simulation. The essential key in Bayesian inference is the prior belief of the researcher, i.e. the prior distribution, p(β, σ2 ), in (3). The prior distribution allows the researcher to address the uncertainty about the parameters before the data has been taken into account, this is done by specifying a probability distribution for each parameter. The prior distribution is classified into two categories, noninformative and informative. The noninformative prior dis- tributions are implemented when the researcher do not have prior beliefs about the parameters, when the prior beliefs exists at a third party, i.e. not known, and for scientific reports where differences in prior beliefs could have impact on the result. The noninformative prior distribution put a uniform distribution on the parameters which forces the estimates to be determined by the data, but still reap benefits of Bayesian analysis. The informative prior distributions are used when the researcher have prior beliefs about the 6
  • 8. parameters and incorporate these beliefs into the prior distribution. This is accomplished by assigning hyperparameters4 or by restricting the parameter range. It is, typically, difficult to assess the prior belief in practice, therefore it is essential that the joint posterior distribution is proper and to assess the posterior inference with sensitivity analysis. According to Sun and Ni (2004), there exists situations in which the posterior is improper even though the full conditional distributions used for MCMC are all proper. One reason for the use Bayesian inference is that it produce predictive distri- butions. This enables assessment of the probability to an outcome which is more coherent to policy decisions than evaluating a certain point forecast of a model5 . The essence of Bayesian inference is that the predictive distribution accounts for the uncertainty about the future and that then joint posterior distribution accounts for the uncertainty about the parameters: p(yt+1:t+h) = f(yt+1:t+h | yt, β, σ2 )p(β, σ2 | yt)dβdσ2 . (6) This results in a predictive distribution of forecasts p(yt+h) at each forecast horizon h. The predictive distribution enables the researcher to address the probability that a certain outcome will occur. This is useful in many ways, the predictive distribution can be described by measures of central tendency and to assess distressed scenarios by evaluating quantiles of the predictive distribution. 3.2 Autoregression We will use Bayesian methods to estimate the parameters in an autoregressive process, AR(p), where p is the number of lags to use for the univariate time series yt yt = p i=1 βiyt−i + εt, t = 1, ..., T (7) where εt iid ∼ N(0, σ2 ). This model is the same as in (1) where Xt is a T × p matrix (without a constant) and consists of p lags of the time series yt. An im- 4 The hyperparameters represents the parameters in the prior distribution and are called hyperparameters to distinguish them from the parameters in the model. 5 There are methods in the frequentist view to generate an approximative predictive distribution, for example by bootstrapping. These distributions are, however, mostly tighter than the Bayesian predictive distribution due to the fact that they do not take into account the uncertainty about the parameters. These methods will not be evaluated in this thesis. 7
  • 9. portant difference between the models in (1) and (7) is that in the AR model the dependent variable is not independently and identically distributed, yt depends on past values of itself. The Normal-Gamma prior distribution is the conjugate prior for a normal distribution, the posterior distribution and the prior distribution are in the same family of distributions. In the Normal-Gamma prior distribution, the parameter vector, β, is normally distributed and conditional on the variance, σ2 , and the variance follows the (inverse) Gamma distribution. The prior mean is p β|σ2 ∼ N β0, σ2 ¯H (8) where β0 is a p × 1 vector representing the researchers prior belief about the parameter values. The prior mean variance-covariance matrix, H, is equal to the researchers the prior belief of the variance, σ2 , times the p × p diagonal matrix ¯H representing the researchers uncertainty about the parameters. Larger values on the diagonal of σ2 ¯H results in larger variances around the prior means. The prior variance is p σ2 ∼ iΓ α 2 , θ0 2 (9) where α represents the prior degrees of freedom and θ0 represents the prior scale parameter. Holding the prior degrees of freedom fixed and letting the prior scale increase results in a (inverse) Gamma distribution with an in- creasing mean, i.e. the prior belief about the value of σ2 increases. Holding the prior scale fixed and letting the prior degrees of freedom increase results in a (inverse) Gamma distribution that is more tightly centred around the mean, i.e. the prior belief of σ2 becomes tighter. This is illustrated in Figure 1. Specifying the prior belief depends on several factors. Practitioners set the prior beliefs to their own views, while researchers tend to set these to the OLS estimate to let the data influence the estimates more than the prior beliefs. This is viewed as more coherent and accepted academic approach. But it also depends on the number of observations and parameters, if there is a large number of observations relative to the number of parameters, the in- fluence of the data will be stronger. If there is a small number of observations relative to the number of parameters, this is known as overparametrization in the literature, then one must specify a strong prior. 8
  • 10. Figure 1: The (left) figure illustrates the effect on the (inverse) Gamma distri- bution as the scale parameter θ0 takes the values {1, 2, 3, 4}, holding the degrees of freedom constant at α = 1. The (right) figure illustrates the effect on the (in- verse) Gamma distribution as the degrees of freedom α takes the values {1, 2, 3, 4}, holding the scale parameter constant at τ0 = 1. The conditional posterior distributions of β and σ2 are p(β | σ2 , yt) = N(M, V ) (10) p(σ2 | β, yt) = iΓ τ1 2 , θ1 2 (11) where M = H−1 + 1 σ2 XtXt −1 H−1 β0 + 1 σ2 Xtyt (12) V = H−1 + 1 σ2 XtXt −1 (13) τ1 = α + T (14) θ1 = θ0 + (yt − Xtβ) (yt − Xtβ) (15) and β0, Σ, α and θ0 are hyperparameters specified by the researcher. Note that there exists analytical results for the Normal-Gamma prior distribution, but we will use the Gibbs sampler to obtain parameter estimates and pre- dictive distributions. Table 2 describes the implemented restrictive Gibbs sampler. 9
  • 11. Table 2: Restrictive Gibbs Sampler for a AR(p) model. To illustrate the Gibbs sampler we will examine the first sample, m = 1. We will start to sample β(1) from p(β | (σ2 )(0) , yt), in (10). It is important to note that (12) and (13) depend on σ2 and the Gibbs sampler need an initial value for this parameter, denoted (σ2 )(0) , which is specified by the researcher. We will set (σ2 )(0) to the OLS estimate ˆσ2 . As we have obtained M, in (12), and V , in (13), we can sample β(1) from p(β | (σ2 )(0) , yt) by ˆβ(1) = M + [r(V ) 1 2 ] where r is a 1 × p vector of draws from the standard normal distribution and (V ) 1 2 is the Cholesky decomposition of V . We impose the restriction that ˆβ(1) must come from a stable AR process, i.e. all the roots, z, of the polynomial βp(z) = 1 − β1z − β2z2 − ... − βpzp must have a module greater than one, |z| > 1. As ˆβ(1) is obtained we can sample (σ2 )(1) from p(σ2 | β(1) , yt), from the inverse Gamma distribution (11). Note that the prior degrees of freedom τ1, in (14), and the prior scale parameter θ1, in (15), requires the researcher to specify α and θ0. A sample from the inverse Gamma distribution is structured as (ˆσ2 )(1) = θ1 x0x0 where x0 is a 1 × τ1 vector of draws form the standard normal distribu- tion. A sample from the predictive distributions for forecast horizon h is structured as ˆy (1) t+h = h−1 i=1 ˆβ (1) i ˆy (1) t+h−i + p i=1 ˆβ (1) i y (1) t+h−i + ˆε (1) t+h where ˆε (1) t+h = r (ˆσ2)(1) and r is a single draw from the standard normal distribution. This process is repeated M iterations until we have obtained β(1) , ..., β(M) , (σ2 )(1) , ..., (σ2 )(M) and ˆy (1) t+h, ..., ˆy (M) t+h . The first 1, ..., B it- erations are discarded, thus β(B+1) , ..., β(M) , (σ2 )(B+1) , ..., (σ2 )(M) and ˆy (B+1) t+h , ..., ˆy (M) t+h are used for empirical distributions. The iterations 1, ..., B are known as burn-in iterations and are required for the Gibbs sampler to converge. There is, however, no guarantee that the Gibbs sampler will perform well or that it will converge. In this thesis we will set M = 4.000 and B = 1.000 to ensure convergence. 10
  • 12. 3.3 Vector Autoregression We will use Bayesian methods to estimate a vector autoregressive process, VAR(p), where p is the number of lags to use for each time series. The VAR model is a system of equations model that allows the endogenous variables to simultaneously affect each other. Furthermore, the error terms can be correlated across equations. A structural shock in one of the error terms may cause a shock in all error terms, causing contemporaneous movement in all endogenous variables. The VAR(p) with n endogenous variables without constants, deterministic and exogenous variables, is defined as yt = p i=1 Biyt−i + εt, t = 1, ..., T (16) where yt is a n × 1 vector, Bi is a n × n matrix, εt is a n × 1 vector and εT iid ∼ N(0, Σ). The endogenous variables in the VAR model are not iid, they depend on past values of yt. We can express (16) in compact matrix form if we rewrite xt = (yt−1, ..., yt−p) Y t = XtB + Et (17) where Y t and Et are T × n matrices, Xt = (x1, ..., xT ) is a T × np matrix and B = (B1, ..., Bp) is a np × n matrix. Note that the parameter matrix can be stacked into a n(np) × 1 vector by b = vec(B). The Normal-Wishart prior distribution is the conjugate prior for a multivari- ate normal distribution, the posterior distribution and the prior distribution are in the same family of distributions. In the Normal-Wishart prior dis- tribution, the parameter vector, b, is normally distributed and conditional on the variance-covariance matrix, Σ, and the variance-covariance matrix follows the (inverse) Wishart distribution p(b | Σ) ∼ N(b0, Σ ⊗ ¯H) (18) p(Σ) ∼ iW( ¯S, α) (19) where ⊗ is the Kronecker product and b0 represents the researchers prior belief of the parameter values. We follow Kadiyala and Karlsson (1993, 1997) in specifying the matrix ¯H, the prior scale matrix ¯S and the prior degrees of freedom α . The np × np diagonal matrix ¯H has the diagonal equal to λ0λ1 lλ2 si 2 11
  • 13. where l refers to the lag length and s2 i refers to the OLS estimate of the variance from an AR(p) model and i refers to the endogenous variable in the ith equation. The prior scale is a n × n diagonal matrix with the diagonal equal to (α − n − 1)λ−1 0 s2 i and the prior degrees of freedom satisfy α = max{n + 2, n + 2h − T} (20) to ensure existence of the prior variances of the regression parameters and the posterior variances of the predictive distribution at forecast horizon h. Following the guidelines of Kadiyala and Karlsson (1993, 1997) we only need to specify the hyperparameters b0, λ0, λ1 and λ2. The interpretation of the hyperparameters of λ are as follows: λ0 controls the overall tightness of the prior on the covariance matrix. λ1 controls the tightness of the prior on the coefficients on the first lag. λ2 controls the degree to which coefficients are shrunk towards zero more tightly. The prior mean variance-covariance matrix, H, is obtained by V (b) = (α − m − 1)−1 ¯S ⊗ ¯H, due to the imposed Kronecker structure we are not able to specify prior variances and standard deviations. Instead we are forced to treat all equations symmetrically. The conditional posterior distributions of β and Σ are p(b | Σ, Y t) ∼ N(M, V ) (21) p(Σ | b, Y t) ∼ iW(¯Σ, T + α) (22) where M = H−1 + Σ−1 ⊗ Xt Xt −1 H−1 b0 + Σ−1 ⊗ Xt Xt ˆb (23) V = H−1 + Σ−1 ⊗ Xt Xt −1 (24) ¯Σ = ¯S + (Y t − XtB) (Y t − XtB). (25) and ˆb is the OLS estimate of b. Kadiyala and Karlsson (1997) and Karlsson (2013) provide the analytical results for this prior. But we will use the Gibbs sampler to obtain parameter estimates and predictive distributions. The re- strictive Gibbs sampler implemented in Table 2 is essentially the same for the Normal-Wishart prior, ˆb is sampled from (21) and ˆΣ is sampled from (22). 12
  • 14. There is, however, one step in the Gibbs sampler for the VAR model that will affect the predictive distributions. This is due to fact that the predictive dis- tribution accounts for the uncertainty about the future, combined with that the VAR model allows the error terms to be correlated across the equations. The Gibbs sampler will draw a sample m from the predictive distribution at forecast horizon h ˆy (m) t+h = h−1 i=1 ˆB (m) i ˆy (m) t+h−i + p i=1 ˆB (m) i y (m) t+h−i + ˆε (m) t+h where ˆε (m) t+h = r[ˆΣ (m) ] 1 2 and r is n draws from the standard normal distribu- tion. The term [ˆΣ (m) ] 1 2 is the square root of the estimated variance-covariance matrix by the Cholesky decomposition, which results in that the lower tri- angle of ˆΣ (m) consists of zeros. Therefore, the order of the variables in the VAR model is important. For example, for the bivariate VAR model, the first equation will have two elements of uncertainty added to the forecast for each draw. While in the second equation will only have one element of un- certainty added to the forecast at each draw. Therefore, there will be more uncertainty added to the predictive distribution of y1 than y2. 4. Evaluating the Predictive Distribution To explain the measurement, consider a predictive distribution at a specific forecast horizon, h. The problem is to structure a measurement that accounts for two factors, (1) the accuracy of the forecasts6 and (2) the probabilities that the forecasts occurs. The most common method to examine forecast accuracy is to use out-of- sample forecasts. This allow the researcher to mimic a real time situation. The observed data is divided into two sets, data for parameter estimation and actual values for forecast evaluation. The forecast error for each observation in the predictive distribution at h is ˆy (m) t+h − ya t+h where ˆy (m) t+h is the mth Gibbs sampling in predictive distribution at horizon h, and ya t+h is the actual value corresponding to the forecast. 6 Note: When we use the terminology forecast, we refer to one element within the predictive distribution if nothing else is specified. 13
  • 15. The most common method to visualize a distribution is the histogram and will serve as a tool to describe the concept of the measurement. The data is classified into bins represented by rectangles where the height of the rectan- gle represents the number of data points within the interval of the bin. The histogram can be normalized to represent the probability for each bin with the condition that the sum of probabilities of each bin is equal to one. In Fig- ure 2 we can see the forecast error probability distribution of two competing models, it is clear that the left graph is more accurate than the right graph. We can conclude that these graphs serve our purpose, i.e. we can determine which graph is most accurate by examining the probabilities of the forecast error. Figure 2: The (left) graph is the forecast errors probability distribution of the AR(2) model at h = 2 for univariate simulation of y2. The (right) graph is the forecast errors probability distribution of the VAR(2) model at h = 2 for univariate simulation of y2. The red line indicates zero forecast error. Examining each forecast error distribution is time-consuming if, for exam- ple, a practitioner is calibrating a forecast model. Therefore we would like to summarize the forecast error distribution by the expected value. This will not, however, yield information about the accuracy of the forecast error distribution that we intend to measure, instead it will contain information about the bias of the forecast error distribution. To achieve the same conclu- sion to be drawn as for the graphs in Figure 3 we will examine the squared 14
  • 16. forecast errors7 . The expected value of the squared forecast error is ¯et+h = M−B m=1 (M − B)−1 (ˆy (m) t+h − ya t+h)2 = M−B m=1 pm(ˆy (m) t+h − ya t+h)2 (26) where ¯et+h is the expected value of the squared forecast error and M − B is the number of stored samples in the Gibbs sampler. The sum of probabilities are equal to one, M−B m=1 pm = 1. A tight forecast error distribution centred around zero will produce a small value of the expected value of squared fore- cast error. While a wide forecast error distribution not centred at zero or skewed away from zero will produce a larger expected value of the squared forecast error. The expected value of the squared forecast error is itself not informative, it is only informative when put in relative terms to a competing model. We will use the notation ¯e AR(p) t+h to represent the expected value of the squared forecast error at forecast horizon h for the AR(p) model and ¯e VAR(p) t+h for the corresponding representation for the VAR(p) model. 5. Simulation Studies We generate the data for the variables y1 and y2 by (pseudo)-simulation from univariate AR models and a multivariate VAR model. This will allow us to obtain results corresponding to Table 1 (the simulation of data represents the columns of Table 1). Both the univariate and multivariate simulation will create data for the variables y1 and y2 with TS = 200 observations8 . Univariate Simulation: The two time series y1 and y2 are simulated from two AR(2) models. Both AR models are conditional on being stable, this is fulfilled when the modules of the eigenvalues of the companion matrix β1 β2 1 0 are less than one. We impose a stricter condition, the modules of the eigenval- ues must be less than 0.850. This is motivated by the implemented restrictive 7 There are also other factors to use the squared forecast errors, it penalizes outliers to a high degree and it has nice properties to the normal distribution. Note, however, that our forecast error distributions are t-distributed because they are estimated form a model. 8 Note that y1 and y2 depend on past observations. We have specified start values for each simulation processes. To mitigate the effect of start values we generated 250 observations of y1 and y2 and discarded the first 50 observations, resulting in TS = 200. 15
  • 17. Gibbs sampler which will require large increases in computational time as the modules of the eigenvalues approaches one. The series will be simulated by: y1,t = 0.70y1,t−1 + 0.10y1,t−2 + ε1,t, ε1,t ∼ N(0, 2) (27) y2,t = 0.35y2,t−1 + 0.30y2,t−2 + ε2,t, ε2,t ∼ N(0, 3). (28) The parameters in (27) are chosen so that one of the eigenvalues have a modules close to the chosen criteria and to have β1 large and β2 small. The parameters in (28) are chosen to have β1 and β2 closer to each other than in (27), they should be positive and not close to zero. The pair of β1 and β2 in (27) and (28) have been chosen to not be identical and the variances of y1 and y2 are different. The correlations between y1 and y2 should on average be zero. But due to the randomness of simulation the correlations between y1 and y2 will not be constant, which in turn will affect the estimation. The left graph in Figure 3 shows 100 correlations between y1 and y2, the average is -0.008 and the median is -0.008. Multivariate Simulation: The two time series y1 and y2 are simulated from a bivariate VAR(2) model. The VAR model is conditional on being stable, this is fulfilled when the modules of the eigenvalues of the companion matrix B1 B2 I2 0 are less than one. We again impose the stricter condition that the modules of the eigenvalues must be less than 0.850. The series will be simulated by y1,t = 0.70y1,t−1 + 0.10y1,t−2 + ε1,t y2,t = −0.25y1,t−1 + 0.35y2,t−1 + 0.50y1,t−2 + 0.30y2,t−2 + ε2,t (29) where the variance-covariance matrix is Σ = 2 0.75 √ 6 0.75 √ 6 3 (30) where the error terms between the equations have a correlation of 0.750 and the variances of y1 and y2 are different. Three factors have determined the choice of the parameter matrix B. First, the elements in B are chosen so that the correlation that arise from the parameters are balanced, i.e. the correlation between y1 and y2 is approx- imately 0.750. Second, the first equation, y1, in (29) should not depend on the parameters from y2. The only relationship between the variables, in 16
  • 18. the first equation in (29), is the correlation between error terms across the two equations. Both models are correctly specified, but the AR model dis- card the correlated error terms across equations. While the VAR model will include irrelevant independent variables and will add uncertainty to the pre- dictive distribution due to the correlated error terms across equations. Both these effects will make the predictive distribution of the VAR model wider. Third, the second equation, y2, in (29) should depend on the parameters of y1. Estimating y2 with a AR model will result in misspecification. Due to randomness of simulation the correlations between y1 and y2 will not be constant, which in turn will affect the estimation. The right graph in Figure 3 shows 100 correlations between y1 and y2, the average is 0.745 and the median is 0.741. Figure 3: The (left) graph is the histogram of correlations between y1 and y2 generated by 100 univariate simulations. The (right) graph is the histogram of correlations between y1 and y2 generated by 100 multivariate simulations. 6. Hyperparameters As mentioned in Section 3.3, Kadiyala and Karlsson (1993, 1997) suggests a set of guidelines to standardize the restrictions on the parameters in the Normal-Wishart prior distribution. This allows the researcher to only spec- ify a number of hyperparameters. Informally, we can think of the inverse Wishart distribution as a multivariate version of the inverse Gamma distri- bution. This allows us to align the Normal-Gamma prior in the AR and Normal-Wishart prior in the VAR, by using the guidelines of Kadiyala and 17
  • 19. Karlsson (1993, 1997). We align the Normal-Gamma prior to the Normal-Wishart prior in three steps. First, we specify the hyperparameters for the prior mean variance, in (8), in the AR(2) model H = σ2 ¯H = λ2 1 0 0 λ1 2λ3 2 we specify σ2 = 1 and the diagonal of ¯H to have the same variances for the lags as matrix V (b) = (α − n − 1)−1 ¯S ⊗ ¯H has for own lags in the Normal-Wishart prior. Second, the prior degrees of freedom α, in (14), will be determined by (20), which results in α = 3. Third, the prior scale param- eter θ0, in (15), will be determined by θ0 = (α − n − 1)λ−1 0 ˆσ2 , with α = 3 this simplifies to θ0 = λ−1 0 ˆσ2 , where ˆσ2 is the OLS estimate. Now we turn to examine the guidelines of Kadiyala and Karlsson (1993, 1997) for the Normal-Wishart prior. The prior mean variance V (b) = (α − n−1)−1 ¯S⊗ ¯H and the prior scale matrix ¯S, with the diagonal (α−n−1)λ−1 0 s2 i , both depend on the prior degrees of freedom, α. By (20) we determine that α = 4, which results in that the scale matrix, ¯S, is λ−1 0 s2 1 0 0 λ−1 0 s2 2 and the prior mean variance V (b) = ¯S ⊗ ¯H is                   λ2 1 0 0 0 0 0 0 0 0 s1λ1 s2 2 0 0 0 0 0 0 0 0 λ1 2λ2 2 0 0 0 0 0 0 0 0 s1λ1 s22λ2 2 0 0 0 0 0 0 0 0 s2λ1 s1 2 0 0 0 0 0 0 0 0 λ2 1 0 0 0 0 0 0 0 0 s2λ1 s12λ2 2 0 0 0 0 0 0 0 0 λ1 2λ2 2                   notice that the first and third diagonal element in this matrix is equal to the diagonal elements of the prior mean variance in the Normal-Gamma prior. 18
  • 20. This alignment between the Normal-Gamma prior and the Normal-Wishart prior allow us to control the parameter restrictions of both priors by only specifying the prior mean, β and b0, in (8) and (18) for each prior and the hyperparameters λ0, λ1, λ2 for both priors. For the Normal-Gamma prior we will set the prior mean to β = (0, 0) and for the Normal-Wishart prior we will set the prior mean to b0 = (0, 0, 0, 0, 0, 0, 0, 0) . The hyperparameters will be set to λ0 = λ1 = λ2 = 1. 7.1 Results To validate the measurement we want to obtain results corresponding to those in Table 1. We will attempt to verify these by examining posterior variances and the difference in the measurement between two competing models by forecast horizon. But first we will describe the kind of data we analyse in this section. In Section 5, we generated the data for the univariate and multivariate models. From this we construct out-of-sample forecast, we will produce forecasts for the horizon h = 1, 2, ..., 10 leaving Tp = Ts − h = 190 observations for pa- rameter estimation. Predictive distributions are estimated and the expected value of the squared forecast error, i.e. the measurement, is calculated for ten forecast horizons. This step of simulating data and calculating the mea- surement is repeated one hundred times. This results in that each model will produce data of the measurement in form of two 100 × 10 matrices where the first matrix is for the univariate simulated data and the second is for the multivariate simulated data. First, we will examine the posterior variance. The results are presented in Table 3 and 5, the coefficient is the expected value of the mean posterior variance, resulting from one hundred simulations of the data. This is moti- vated by two reasons (1) to summarize the data, each simulation of the data will yield a mean posterior variance and the corresponding ninety-five per- cent probability interval of the posterior variance (2) deviations from Table 1 should be caused by the random correlation in the data due to simulation, therefore the expected value of the mean posterior variance yield a more ro- bust conclusion. Second, we will examine the difference in the measurement between two competing models at each forecast horizon. The mean of this difference will determine which predictive distribution is most accurate. We will represent 19
  • 21. this by the linear regression described in (1), where the dependent variable is the difference in the measurement between the competing models by forecast horizon and the independent variable of a constant: (¯e AR(2) t+h )i − (¯e VAR(2) t+h )i = ϕ + εi, i = 1, ..., 100 (31) The coefficient of the constant, ϕ, represent the mean of the difference in the measurement between the competing models by forecast horizon, this enables the following hypothesis test: H0 : ϕ = 0 HA : ϕ = 0 If ϕ < 0 then the predictive distribution of the AR(2) is more accurate than the VAR(2). If ϕ > 0 then the predictive distribution of the VAR(2) is more accurate than the AR(2). If H0 is not rejected, then we cannot conclude that one predictive distribution is more accurate than the other. The estimation of this model is similar to the AR(p) model described in section 3.1 and the Gibbs sampler described in Table 2. We will set the hyperparameters as follows β0 = 0, σ2 = 1, θ0 = ˆσ2 and α = 3 according to (20). The results are presented in Table 4 and 6. 7.2 Univariate Simulation Results We start by examining the results of the estimated posterior variance and covariance from the AR(2) and VAR(2) models for the univariate simulations of the data. The first column in Table 1 shows the expected outcomes. The AR model is correctly specified, while the VAR model includes irrelevant independent variables which is expected to increase the estimated posterior variance. The results are presented in Table 3. For the variable y1, we conclude that the expected value of the mean posterior variance is equal to 2.006 for the AR and 2.007 for the VAR. Therefore we conclude that inclusion of irrele- vant independent variables do increase the estimated posterior variance as expected. However, this increase is smaller than we expected. For the vari- able y2, we conclude that the expected value of the mean posterior variance is 2.962 for the AR model and 2.965 for the VAR model. The increase in estimated posterior variance is somewhat larger than for y1 but still smaller than expected. The expected value of the mean posterior covariance in the VAR model is 0.006. Which is close to zero as we expect due to that y1 and 20
  • 22. y2 are simulated independent of each other. From the (left) graph in Figure 3, we conclude that the correlation between y1 and y2 is on average -0.008, but range approximately from -0.300 to 0.350. Overall, we conclude that the estimated posterior variances follow the ex- pected theory in Table 1, but the increase in estimated posterior variance due to inclusion of irrelevant variables in the VAR model is smaller than we expected. Table 3: Expected Values of Mean Posterior Variance/Covariance. Univariate Simulation of the Data AR(2) VAR(2) E[Coef.] E[95% PI] E[Coef.] E[95% PI] V(y1) 2.006 [ 1.638 ; 2.455] 2.007 [ 1.639 ; 2.456] COV(y1,y2) - - 0.006 [-0.348 ; 0.361] V(y2) 2.962 [ 2.419 ; 3.620] 2.965 [ 2.421 ; 3.633] E[.] is the expected value of 100 simulations of the data. We now turn to examine the results for the linear regression (31) of the measurement we have formulated. From our findings about the estimated posterior variance we expect the forecast error probability distributions of the AR model to be tighter and more centred around zero than for the VAR model, i.e the measurement would be smaller for the AR model than for the VAR model. The expected value of the estimated posterior covariance between y1 and y2 was close to zero and, therefore, a minimal increase of uncertainty in the predictive distribution of y1 caused by the Cholesky de- composition. Therefore, we expect to reject H0 and find that the parameter ϕ is negative. The results are presented in Table 4. For the variable y1, we conclude that we reject H0 at one of the ten forecast horizons. We find that ϕ is negative at the second forecast horizon as expected, i.e the ninety-five percent prob- ability interval do not range over zero. On average at this forecast horizon, the predictive distribution of the AR model is more accurate than that of the VAR model. For all other forecast horizons H0 cannot be rejected. For the variable y2, we conclude that we reject H0 at one of the ten forecast horizons. We find that ϕ is negative at the first forecast horizon as expected. On average at this forecast horizon, the predictive distribution of the VAR model is more accurate than that of the AR model. For all other forecast horizons H0 cannot be rejected. 21
  • 23. Overall, we cannot conclude that the AR model produce more accurate pre- dictive distributions than the VAR model for the univariate simulated data. It seems that the small increase in estimated posterior variance caused by the inclusion of irrelevant variables is not large enough to distinguish between the two models predictive distributions. Table 4: Regression Results. Univariate Simulation of the Data y1 y2 ϕ 95% PI ϕ 95% PI h=1 -0.111 [-0.377 ; 0.155] -0.906 [-1.768 ; -0.063] h=2 -0.527 [-0.976 ; -0.060] 0.080 [-0.536 ; 0.698] h=3 -0.235 [-0.615 ; 0.132] -0.559 [-1.330 ; 0.233] h=4 -0.197 [-0.568 ; 0.166] 0.261 [-0.263 ; 0.789] h=5 -0.085 [-0.535 ; 0.358] -0.448 [-1.268 ; 0.358] h=6 -0.016 [-0.384 ; 0.347] -0.157 [-0.597 ; 0.292] h=7 0.161 [-0.245 ; 0.562] 0.387 [-0.060 ; 0.833] h=8 0.042 [-0.359 ; 0.449] 0.186 [-0.128 ; 0.501] h=9 0.004 [-0.434 ; 0.446] 0.005 [-0.264 ; 0.273] h=10 0.060 [-0.381 ; 0.488] 0.135 [-0.148 ; 0.413] PI stands for probability interval. 7.3 Multivariate Simulation Results We start by examining the results of the estimated posterior variance and co- variance from the AR(2) and VAR(2) models for the multivariate simulations of the data. The second column in Table 1 shows the expected outcomes. The VAR model is correctly specified, while the AR model is misspecified which is expected to increase the estimated posterior variance. The results are presented in Table 5. For the variable y1, which is not de- termined by y2, we conclude that the expected value of the mean posterior variance is equal to 1.980 for the AR and 1.983 for the VAR. We conclude that the inclusion of irrelevant independent variables cause an increase in the estimated posterior variance, this is the same conclusion as in the univariate simulation of data. For the variable y2, we conclude that the result follow the statistical theory well. The expected value of the mean posterior variance is 3.291 for the AR model and 3.025 for the VAR model. The increase in estimated posterior variance due to misspecification is large. The expected value of the mean posterior covariance in the VAR model is 1.821. Which is close to the covariance in equation (30), 0.75 × √ 6 ≈ 1.837. The choice of elements in parameter matrix B have balanced the correlation between y1 22
  • 24. and y2 to the same correlation specified for the error terms. From the (right) graph in Figure 3, we conclude that the correlation between y1 and y2 is on average 0.745, but range approximately from 0.550 to 0.850. Overall, we conclude that the estimated posterior variances follow the ex- pected theory in Table 1. The increase in the estimated posterior variance due to misspecification is large as expected. We find the same conclusion of inclusion of irrelevant variables as in the univariate simulation of the data. Table 5: Expected Values of Mean Posterior Variance/Covariance. Multivariate Simulation of the Data AR(2) VAR(2) E[Coef.] E[95% PI] E[Coef.] E[95% PI] V(y1) 1.980 [ 1.617 ; 2.420] 1.983 [ 1.619 ; 2.429] COV(y1,y2) - - 1.821 [ 1.423 ; 2.303] V(y2) 3.291 [ 2.689 ; 4.027] 3.025 [ 2.469 ; 3.703] E[.] is the expected value of 100 simulations of the data. We now turn to examine the results for the linear regression (31) of the mea- surement. We expect ϕ to be negative for y1, the AR model do not suffer from misspecification and the expected value of the mean posterior covari- ance is large. Therefore we expect the predictive distributions of the VAR model to be wide for y1 due to the Cholesky decomposition. From the finding about the estimated posterior variance for y2, we expect to reject H0 and find that the parameter ϕ is positive. The results are presented in Table 6. For the variable y1, we conclude that we reject H0 at all forecast horizons except the first. We find that ϕ is negative for the second to tenth forecast horizon as we expected due to the Cholesky decomposition. On average, the forecast accuracy of the AR relative to the VAR increases over forecast horizons. For the variable y2, we conclude that we reject H0 for three out of ten forecast horizons. We find that ϕ is neg- ative for the eighth to tenth forecast horizon, which is opposite to what we expected. We also conclude that the magnitude of the differences are large for the ninth and tenth forecast horizon. 23
  • 25. Table 6: Regression Results. Multivariate Simulation of the Data y1 y2 ϕ 95% PI ϕ 95% PI h=1 -0.029 [-0.159 ; 0.109] 0.272 [-0.050 ; 0.604] h=2 -0.251 [-0.467 ; -0.038] 0.050 [-0.257 ; 0.349] h=3 -0.321 [-0.553 ; -0.087] 0.536 [ 0.145 ; 0.938] h=4 -0.289 [-0.563 ; -0.030] 0.172 [-0.192 ; 0.539] h=5 -0.219 [-0.470 ; 0.039] 0.334 [-0.104 ; 0.792] h=6 -0.338 [-0.597 ; -0.068] -0.257 [-0.640 ; 0.125] h=7 -0.308 [-0.614 ; -0.005] -0.211 [-0.614 ; 0.203] h=8 -0.333 [-0.598 ; -0.062] -0.492 [-0.910 ; -0.066] h=9 -0.483 [-0.755 ; -0.209] -0.701 [-1.107 ; -0.293] h=10 -0.466 [-0.718 ; -0.210] -0.905 [-1.289 ; -0.523] PI stands for probability interval. Table 7 shows the same analysis as before but this time we have changed the order of the variables when estimating the VAR model. For the variable y1 we conclude that we reject H0 at the second, fourth, fifth and ninth forecast horizon. We find that ϕ is negative for the second forecast horizon and positive for fourth, fifth and ninth forecast horizon. For the variable y2 we conclude that we reject H0 for five out of the ten forecast horizons. We find that ϕ is negative at the first, sixth, eight, ninth and tenth forecast horizon as expected due to Cholesky decomposition. Both these results show the strong effect of Cholesky decomposition. The variable y2 handles the extra uncertainty added to the predictive distribution better than y1. This is most likely due to the misspecification of the AR model in the y2 variable. Table 7: Changed Order Of Variables in VAR. Multivariate Simulation of the Data y2 y1 ϕ 95% PI ϕ 95% PI h=1 -2.054 [-3.041 ; -1.001] -0.940 [-1.956 ; 0.022] h=2 -0.165 [-0.946 ; 0.614] -1.506 [-2.507 ; -0.480] h=3 -0.415 [-0.836 ; 0.018] 0.132 [-0.282 ; 0.549] h=4 -0.310 [-0.698 ; 0.086] 0.295 [ 0.077 ; 0.514] h=5 -0.307 [-1.020 ; 0.409] 0.599 [ 0.088 ; 1.119] h=6 -0.680 [-1.267 ; -0.089] 0.150 [-0.387 ; 0.666] h=7 -0.433 [-0.924 ; 0.055] 0.388 [-0.008 ; 0.775] h=8 -0.569 [-0.966 ; -0.168] 0.158 [-0.150 ; 0.463] h=9 -0.829 [-1.290 ; -0.367] 0.388 [ 0.049 ; 0.726] h=10 -0.936 [-1.369 ; -0.512] 0.178 [-0.238 ; 0.579] PI stands for probability interval. 24
  • 26. Overall, we can conclude that the AR model produce more accurate predic- tive distributions than the VAR model for the multivariate simulated data. This was expected for y1 due to Cholesky decomposition and that neither the AR or VAR was misspecified. For y2, however, it seems that misspecification is too small to have an effect on the predictive distribution. Instead the VAR is out performed by the AR at longer horizons. Summing up our results. From both the univariate and multivariate re- sults we conclude that the estimated posterior variances follow the statistical theory of Table 1. But the magnitude of these effects are smaller than ex- pected. From analyzing the predictive distribution by the measurement for the univariate simulation of the data. We find difficulties to verify the in- crease in estimated posterior variance in the predictive distributions, due to the statistical theory. Only at two horizons are we able to distinguish that the AR models outperform the VAR models. This difficulty arise because the increase in the estimated posterior variance is too small to separate the predictive distributions. From the multivariate simulation of the data we conclude that there is a large effect of the Cholesky decomposition in the first equation of the estimation in the VAR model. We find that the VAR model produces inferior predictive distributions of y1 for all horizons. This result is not as strong for y2 where five of the horizons produce inferior pre- dictive distributions. We also conclude that this effect increases with forecast horizon. We conclude that the autoregressive and vector autoregressive models with a lag length of two, have difficulty to produce dissimilar predictive distribu- tions. This has made it difficult to assess the accuracy of the measurement but by examining the effect of the Cholesky decomposition in VAR models we are able to validate the accuracy of the measurement. 25
  • 27. 8. Conclusion We conduct simulation studies to formulate a measurement that evaluates the forecast accuracy of predictive distributions. We use Bayesian methods to estimate posterior inference and predictive distributions for the autoregres- sive model and the vector autoregressive model. By the use of out-of-sample forecasts and predictive distributions we are able to evaluate the full distri- bution of forecast errors. We are also able to validate the accuracy of the measurement, especially by allowing for correlated error terms across equa- tions in the vector autoregressive model. We formulate a measurement that uses out-of-sample forecasts and predic- tive distributions to evaluate the full forecast error probability distribution by forecast horizon. The measurement can be used as a forecasting evaluating technique for single forecasts or to calibrate forecast models. The technique used in this thesis can be used for these purposes. Furthermore, we recom- mend that the measurement should be used with several other forecasting evaluation techniques when used by practitioners. For further research we recommend that the measurement should be evalu- ated by models that are not from the same family to ensure differences in predictive distributions, treat conditional hetroskedasticity differently, case studies of outliers such as financial crisis and evaluated against a wide range of forecasting evaluation techniques. 26
  • 28. Appendix A1. Gibbs Sampler To explain the intuition behind the Gibbs sampler we will borrow a summary put forward by Ciccarelli and Rebucci (2003), but with the mathematical no- tation convenient to Section 3.1. In many applications the analytical integration of p(β, σ2 |yt) may be diffi- cult or even impossible to implement. This problem, however, can often be solved by using numerical integration based on Monte Carlo simulation methods. One particular method used in the literature to solve similar estimation prob- lems of those discussed in the paper is the Gibbs sampler. The Gibbs Sampler is a recursive Monte Carlo method which only requires knowledge of the full conditional posterior distribution of the parameters of interest, p(β|σ2 , yt) and p(σ2 |β, yt) are known. Then the Gibbs sampler starts from an arbi- trary value for β(0) or (σ2 )(0) , and samples alternately from the density from each element of the parameter vector, conditional on the values of the other element sampled in the previous iteration and the data. Thus, the Gibbs sampler samples recursively as follows: β(1) from p(β|(σ2 )(0) , yt) (σ2 )(1) from p(σ2 |β(1) , yt) β(2) from p(β|(σ2 )(1) , yt) (σ2 )(2) from p(σ2 |β(2) , yt) ... β(m) from p(β|(σ2 )(m−1) , yt) (σ2 )(m) from p(σ2 |β(m) , yt) and so on. The vectors ϑ(m) = (β(m) , (σ2 )(m) ) from a Markov Chain, and, for a suffi- ciently large number of iterations (say m ≥ M), can be regarded as draws from the true joint posterior distribution. Given a large sample of draws from this limiting distribution, any posterior moment or marginal density of interest can then be easily estimated consistently with the corresponding sample average. 27
  • 29. References Luc Bauwens, Gary Koop, Dimitris Korobilis, and Jeroen VK Rombouts. The contribution of structural break models to forecasting macroeconomic series. Journal of Applied Econometrics, 2014. Andrew P Blake and Haroon Mumtaz. Applied Bayesian econometrics for central bankers. Number 4 in Technical Books. Centre for Central Banking Studies, Bank of England, 2012. URL http://ideas.repec.org/b/ccb/ tbooks/4.html. Matteo Ciccarelli and Alessandro Rebucci. BVARs: A Survey of the Re- cent Literature with an Application to the European Monetary Sys- tem. Rivista di Politica Economica, 93(5):47–112, September 2003. URL http://ideas.repec.org/a/rpo/ripoec/v93y2003i5p47-112.html. A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013. ISBN 9781439840955. URL http://books.google.se/books?id=ZXL6AQAAQBAJ. John Geweke and Gianni Amisano. Comparing and evaluating Bayesian predictive distributions of assets returns. Working Paper Series 0969, Eu- ropean Central Bank, November 2008. URL http://ideas.repec.org/ p/ecb/ecbwps/20080969.html. John Geweke and Gianni Amisano. Prediction using several macroeconomic models, 2012. K Rao Kadiyala and Sune Karlsson. Forecasting with generalized bayesian vector auto regressions. Journal of Forecasting, 12(3-4):365–378, 1993. K Rao Kadiyala and Sune Karlsson. Numerical Methods for Estimation and Inference in Bayesian VAR-Models. Journal of Applied Econometrics, 12(2):99–132, March-Apr 1997. URL http://ideas.repec.org/a/jae/ japmet/v12y1997i2p99-132.html. Sune Karlsson. Chapter 15 - forecasting with bayesian vector autoregres- sion. In Graham Elliott and Allan Timmermann, editors, Handbook of Economic Forecasting, volume 2, Part B of Handbook of Economic Forecasting, pages 791 – 897. Elsevier, 2013. doi: http://dx.doi.org/ 10.1016/B978-0-444-62731-5.00015-4. URL http://www.sciencedirect. com/science/article/pii/B9780444627315000154. 28
  • 30. Gary Koop and Dimitris Korobilis. Bayesian multivariate time series meth- ods for empirical macroeconomics. Now Publishers Inc, 2010. Dongchu Sun and Shawn Ni. Bayesian analysis of vector-autoregressive mod- els with noninformative priors. Journal of statistical planning and infer- ence, 121(2):291–309, 2004. Anders Warne, G¨unter Coenen, and Kai Christoffel. Predictive likelihood comparisons with DSGE and DSGE-VAR models. Working Paper Series 1536, European Central Bank, April 2013. URL http://ideas.repec. org/p/ecb/ecbwps/20131536.html. 29