Switching Models Workbook
Switching Models Workbook
Thomas A. Doan
Estima
Preface vi
2 Fluctuation Tests 8
2.1 Simple Fluctuation Test . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Fluctuation Test for GARCH . . . . . . . . . . . . . . . . . . . . . . 14
3 Parametric Tests 15
3.1 LM Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Full Coefficient Vector . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Outliers and Shifts . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Break Analysis for GMM . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 ARIMA Model with Outlier Handling . . . . . . . . . . . . . . . . 26
3.3 GARCH Model with Outlier Handling . . . . . . . . . . . . . . . . 27
4 TAR Models 29
4.1 Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Arranged Autoregression Test . . . . . . . . . . . . . . . . . 31
4.2.2 Fixed Regressor Bootstrap . . . . . . . . . . . . . . . . . . . 32
4.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Generalized Impulse Responses . . . . . . . . . . . . . . . . . . 34
4.1 TAR Model for Unemployment . . . . . . . . . . . . . . . . . . . . 37
4.2 TAR Model for Interest Rate Spread . . . . . . . . . . . . . . . . . 39
i
Contents ii
5 Threshold VAR/Cointegration 43
5.1 Threshold Error Correction . . . . . . . . . . . . . . . . . . . . 43
5.2 Threshold VAR . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Threshold Cointegration . . . . . . . . . . . . . . . . . . . . . 48
5.1 Threshold Error Correction Model . . . . . . . . . . . . . . . . . . 50
5.2 Threshold Error Correction Model: Forecasting . . . . . . . . . . . 52
5.3 Threshold VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 STAR Models 58
6.1 Testing for STAR . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1 LSTAR Model: Testing and Estimation . . . . . . . . . . . . . . . 62
6.2 LSTAR Model: Impulse Responses . . . . . . . . . . . . . . . . . . 63
7 Mixture Models 66
7.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 EM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3 Bayesian MCMC . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3.1 Label Switching . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1 Mixture Model-Maximum Likelihood . . . . . . . . . . . . . . . . . 73
7.2 Mixture Model-EM . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Mixture Model-MCMC . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.1.4 EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.1.5 MCMC (Gibbs Sampling) . . . . . . . . . . . . . . . . . . . . 135
11.1 Hamilton Model: ML Estimation . . . . . . . . . . . . . . . . . . . 138
11.2 Hamilton Model: EM Estimation . . . . . . . . . . . . . . . . . . . 139
11.3 Hamilton Model: MCMC Estimation . . . . . . . . . . . . . . . . . 140
Bibliography 226
Index 229
Preface
This workbook is based upon the content of the RATS e-course on Switching
Models and Structural Breaks, offered in fall of 2010. It covers a broad range
of topics for models with various types of breaks or regime shifts.
In some cases, models with breaks are used as diagnostics for models with
fixed coefficients. If the fixed coefficient model is adequate, we would expect to
reject a similar model that allows for breaks, either in the coefficients or in the
variances. For these uses, the model with the breaks isnt being put forward as
a model of reality, but simply as an alternative for testing purposes. Chapters 2
and 3 provide several examples of these, with Chapter 2 looking at fluctuation
tests and Chapter 3 examining parametric tests.
Increasingly, however, models with breaks are being put forward as a descrip-
tion of the process itself. There are two broad classes of such models: those
with observable regimes and those with hidden regimes. Models with observ-
able criteria for classifying regimes are covered in Chapters 4 (Threshold Au-
toregressions), 5 (Threshold VAR and Cointegration) and 6 (Smooth Threshold
Models). In all these models, there is a threshold trigger which causes a shift
of the process from one regime to another, typically when an observable se-
ries moves across an (unknown) boundary. There are often strong economic
argument for such models (generally based upon frictions such as transactions
costs), which must be overcome before an action is taken. Threshold models
are generally used as an alternative to fixed coefficient autoregressions and
VAR s. As such, the response of the system to shocks is one of the more useful
ways to examine the behavior of the model. However, as the models are non-
linear, there is no longer a single impulse response function which adequately
summarizes this. Instead, we look at ways to compute two main alternatives:
the eventual forecast function, and the generalized impulse response function
(GIRF).
The remaining seven chapters cover models with hidden regimes, that is mod-
els where there is no observable criterion which determines to which regime
a data point belongs. Instead, we have a model which describes the behavior
of the observables in each regimes, and a second model which describes the
(unconditional) probabilities of the regimes, which we combine using Bayes
rule to infer the posterior probability of the regimes. Chapter 7 starts off
with the simple case of time independence of the regimes, while the remain-
der use the (more realistic) assumption of Markov switching. The sequence
of chapters 8 to 11 look at increasingly complex models based upon linear re-
gressions, from univariate, to systems, to VARs with complicated restrictions.
vi
Preface vii
All of these demonstrate the three main methods for estimating these types of
models: maximum likelihood, EM and Bayesian MCMC.
The final two chapters look at Markov switching in models where exact likeli-
hoods cant be computed, requiring approximations to the likelihood. Chapter
12 examines state-space models with Markov switching, while Chapter 13 is
devoted to switching ARCH and GARCH models.
We use bold-faced Courier (for instance, DLM) for any use of RATS instruction
or procedure names within the main text, and non-bolded Courier (%SCALAR)
for any other pieces of code, such as function and variable names. For easy
reference, the full text of each example is included. The running examples are
also available as separate files.
Chapter 1
While in practice, it isnt common to know precisely where breaks are, the basic
building block for finding an unknown break point is the analysis with a known
break, since we will generally need to estimate with different test values for the
break point. We divide this into static and dynamic models, since the two
have quite different properties.
Outliers
The simplest break in a model is a single-period outlier, say at t0 . The con-
ventional way to deal with a outlier at a known location is to dummy out the
data point, by adding a dummy variable for that point only. This is equivalent
to running weighted least squares with the variance at t0 being . Since its
fairly rare that were sure that a data point requires such extreme handling,
well look at other, more flexible ways to handle this later in the course. We can
create the required dummy with:
set dpoint = t==t0
Broken Trends
If X includes constant and trend, a broken trend function is frequently of in-
terest. This can take one of several forms, as shown in Figure 1.1.
If the break point is at the entry T0, the two building blocks for the three forms
of break are created with:
set dlevel = t>t0
set dtrend = %max(t-t0,0)
The crash model keeps the same trend rate, but has an immediate and per-
manent level shift (for variables like GDP, generally a shift down, hence the
name). It is obtained if you add the DLEVEL alone. The join model has a change
in the growth rate, but no immediate change in the level. You get it if you add
1
Estimation with Breaks at Known Locations 2
10.0
Base
Crash
Break
7.5
Joined
5.0
2.5
0.0
-2.5
-5.0
-7.5
-10.0
5 10 15 20
DTREND alone. The break model has a change in both level and ratein effect,
you have two completely different trend functions before and after. You get this
by adding both DLEVEL and DTREND.
The two are related with (1) = and (2) = + . Each of these has some uses
for which it is most convenient.
If the residuals arent serially correlated, (1.2) can be estimated by running
separate regressions across the two subsamples. The following, for instance, is
from Greenes 5th edition:
linreg loggpop 1960:1 1973:1
# constant logypop logpg logpnc logpuc trend
compute rsspre=%rss
linreg loggpop 1974:01 1995:01
# constant logypop logpg logpnc logpuc trend
compute rsspost=%rss
If you dont need the coefficients themselves, an even more convenient way to
do this is to use the SWEEP instruction (see page 5 for more). The calculations
from Greene can be done with
1
Well use the notation Icondition as a 1-0 indicator for the condition described
Estimation with Breaks at Known Locations 3
sweep(group=(t<=1973:1),series=resv)
# loggpop
# constant logypop logpg logpnc logpuc trend
Our rewrite of this uses the %EQNXVECTOR function to generate the full list of
dummied regressors (see page 6 for more). For a break at a specific time period
(called TIME), this is done with
compute k=%nreg
dec vect[series] dummies(k)
*
do i=1,k
set dummies(i) = %eqnxvector(baseeqn,t)(i)*(t>time)
end do i
The expression on the right side of the SET first extracts the time T vector
of variables, takes element I out of it, and multiplies that by the relational
T>TIME. The DUMMIES set of series is used in a regression with the original set
of variables, and an exclusion restriction is tested on the dummies, to do a HAC
test for a structural break:
linreg(noprint,lags=7,lwindow=newey) drpoj lower upper
# constant fdd{0 to 18} dummies
exclude(noprint)
# dummies
The , rather than shifting the mean of the process as it would in a static model,
now shifts its trend rate.
There are two basic ways of incorporating shifts, which have been dubbed Ad-
ditive Outliers (AO) and Innovational Outliers (IO), though they arent always
applied just to outlier handling. The simpler of the two to handle (at least
in linear regression models) is the IO, in which shifts are done by adding the
proper set of dummies to the regressor list. They are called innovational be-
cause they are equivalent to adjusting the mean of the ut process. For instance,
we could write (1.3)
yt = yt1 + ut
where ut has mean for t t0 and mean + for t > t0 . The effect of an IO
is usually felt gradually, as the change to the shock process works into the y
process itself.
The AO are more realistic, but are more complicatedthey directly shift the
mean of the y process itself. The drifting random walk yt = yt1 + + ut will
look something like the Base series in Figure 1.1 with added noise. How would
we create the various types of interventions? The easiest way to look at this is
to rewrite the process as yt = y0 + t + zt , where zt = zt1 + ut . This breaks the
series down into the systematic trend and the noise.
The crash model needs the trend part to take the form y0 + t + It>t0 . We could
just go ahead and estimate the parameters using
yt = y0 + t + It>t0 + zt
treating zt as the error process. That would give consistent, but highly ineffi-
cient, estimates of and with a Durbin-Watson hovering near zero. Instead,
we can difference to eliminate the unit root in zt , producing
yt yt1 = (t (t 1)) + (It>t0 It1>t0 ) + ut
which reduces to
yt yt1 = + It=(t0 +1) + ut
so to get the permanent shift in the process, we need a one-shot change at the
intervention point.2
The join model has a trend of
y0 + t + max(t t0 , 0)
Differencing that produces + It>t0 . The full break model needs both inter-
vention terms, so the trend is
y0 + t + It>t0 + max(t t0 , 0)
2
Standard practice is for level shifts to apply after the quoted break date, hence the dummy
being for t0 + 1.
Estimation with Breaks at Known Locations 5
which differences to
+ It=t0 +1 + It>t0
so we need both the spike dummy, and the level shift dummy to handle both
effects.
Now suppose we have a stationary model (AR(1) for simplicity) where we want
to incorporate a once-and-for-all change in the process mean. Lets use the
same technique of splitting the representation into mean and noise models:
yt = + It>t0 + zt
where now zt = zt1 + ut . If we quasi-difference the equation to eliminate the
zt , we get
yt yt1 = (1 ) + (It>t0 It1>t0 ) + ut
The term is no longer as simple as it was when we were first differencing. Its
(1 ) for t > t0 + 1 (terms which are zero when = 1), and when t = t0 + 1.
There is no way to unravel these (plus the intercept) to give just two terms
that we can use to estimate the model by (linear) least squares; in other words,
unlike the IO, the AO interventions dont translate into a model that can be
estimated by simple techniques.
With constants and simple polynomial trends for the deterministic parts, there
is no (effective) difference between estimating a reduced form stationary AR(p)
process using a LINREG, and doing the same thing with the mean + noise ar-
rangement used by BOXJENK (estimated with conditional least squares)the
AR parameters will match exactly, and the deterministic coefficients can map
to each other. That works because the filtered versions of the polynomials in t
are still just polynomials of the same order. Once you put in any intervention
terms, this is no longer the casethe filtered intervention terms generally pro-
duce two (or more) terms that arent already included. BOXJENK is designed to
do the AO style of intervention, so if you need to estimate this type of interven-
tion model, its the instruction to use. Well look more at BOXJENK in Chapter
3.
sweep(group=(t<=1973:1),series=resv)
# loggpop
# constant logypop logpg logpnc logpuc trend
%EQNXVECTOR function
%EQNXVECTOR(eqn,t) returns the VECTOR of explanatory variables for equa-
tion eqn at entry t. Youll see this used quite a bit in the various procedures
used for break analysis since its often necessary to look at an isolated data
point. An eqn of 0 can be used to mean the last regression run.
In this chapter, this is used in the following:
compute k=%nreg
dec vect[series] dummies(k)
*
do i=1,k
set dummies(i) = %eqnxvector(baseeqn,t)(i)*(t>time)
end do i
The expression on the right side of the SET first extracts the time T vector
of variables, takes element I out of it, and multiplies that by the relational
T>TIME. The DUMMIES set of series is used in a regression with the original set
of variables, and an exclusion restriction is tested on the dummies, to do a HAC
test for a structural break:
There are several related functions which can also be handy. In all cases, eqn is
either an equation name, or 0 for the last estimated (linear) regression. These
also evaluate at a specific entry T.
Fluctuation Tests
8
Fluctuation Tests 9
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
0.0 0.2 0.4 0.6 0.8 1.0
0.4
0.2
-0.0
-0.2
-0.4
-0.6
-0.8
-1.0
0.0 0.2 0.4 0.6 0.8 1.0
what function(s) of these will be useful for detecting instability in the process.
The K-S maximum gap is one possibility, which would likely work for the break
in level, but wouldnt work well for the break in variance. The proposal from
Nyblom (1989) is to use the sample equivalent of
Z 1
B(x)2 dx (2.3)
0
This should be quite good at detecting instability in the mean, since B(x)2 will
be unnaturally large due to the drift in B(x). Its not as useful for detecting
breaks in the variance (in this form), since the higher and lower zones will
tend to roughly cancel. A function of higher powers than 2 will be required for
that.
The same basic idea can be applied to problems much broader than an i.i.d.
univariate process. Suppose that
T
X
f (yt |Xt , ) (2.4)
t=1
is the log likelihood for y given X and the parameters . Then for the maximiz-
ing value , the sequence of derivatives
f (yt |Xt , )
gt (2.5)
has to sum to zero for over the sample for each component. Under standard
regularity conditions, and with the assumption that the underlying model is
stable, we have the result that
T
1 X d
N (0, I 1 )
gt (2.6)
T t=1
where I is 1/T times the information matrix. For purposes of deriving a Brow-
nian motion, the sample gt can be treated as if they were individually mean
zero and variance I 1 . This is the principal result of the Nyblom paper. The
individual components of the gradient can be tested as above, and you can also
apply a joint test, as the whole vector of gradients becomes a multi-dimensional
Brownian Bridge with the indicated covariance matrix. If we have an estimate
of V I 1 , the fluctuation statistic for component i of the gradient is computed
as
T t
!2
X X
T 2 vii 1 gs,i (2.7)
t=1 s=1
The trickiest part of working out these formulas often is getting the power of T
correct. If we look at the univariate case, the Brownian Bridge construction is
t
!
t X
B = T 1/2 vii 1/2 gs,i (2.9)
T s=1
The results are in Table 2.1. This gives a joint test for all four coefficients, and
separate tests on each coefficient. It would be surprising for a GARCH model to
give us a problem with the coefficients of the variance model (here coefficients
2-4) simply because if theres a failure, it isnt likely to be systematic in the
time direction. The one coefficient that seems to have a stability problem is the
process mean, which is much less surprising.
The @FLUX procedure uses the BHHH estimate for the information matrix, since
that can be computed from the gradients. The calculations themselves are
relatively simple:
1
This uses a saddlepoint approximation, which, in effect, expands the distribution function
at , so its accurate in the tails, but not as accurate closer to 0.
Fluctuation Tests 12
1 0.71267640 0.01
2 0.35762640 0.09
3 0.13834512 0.41
4 0.18261092 0.29
The CMOM instruction computes the cross product matrix of the gradients, which
will be the estimate for the sample information matrix. The calculation for
VINV gets the factors of T correct for computing the joint statistic (2.8). The vec-
tor S is used to hold the partial sums of the entire vector of gradients. Through
the time loop, %HSTATS has the individual component sums of squares for the
partial sums; that gets scaled at the end. (We cant use elements of VINV for
those, since thats from inverting the joint matrix, and we need the inverse just
of the diagonal elements). %HJOINT takes care of the running sum for the joint
test.
Independently (and apparently unaware) of Nyblom, Hansen (1992) derived
similar statistics for partial sums of moments for a linear regression. As with
the gradients for the likelihood, we have (under certain regularity conditions),
the result that, for the linear regression yt = Xt + ut
T
1 X 0 d
N 0, E(X 0 u2 X)
X t ut (2.10)
T t=1
The least squares solution for forces the sum of the sample moments to zero:
T
X
X 0 t ut = 0 (2.11)
t=1
Fluctuation Tests 13
Basically, you just use the same setup as a LINREG, but with @STABTEST as
the command instead. The results for that are in Table 2.2, which shows an
overwhelming rejection of stability.2
Since extracting the mean is just regression on a constant, we can get a better
test for stability of the variance in the initial example. The results are in Table
2.3. As we would expect, that doesnt find instability in the mean (which was
zero throughout), but overwhelming rejects stability in the variance. This is
part of example 2.1:
2
which is corroborated by the other tests in the example
Fluctuation Tests 14
Parametric Tests
A parametric test for a specific alternative can usually be done quite easily ei-
ther as a likelihood ratio test or a Wald test. However, if a search over possible
breaks is required, it could be quite time-consuming to set up and estimate a
possibly non-linear model for every possible break. Instead, looking at a se-
quence of LM tests is likely to be a better choice.
3.1 LM Tests
3.1.1 Full Coefficient Vector
We are testing for a break at T0 . Assume first that were doing likelihood-based
estimation. If we can write the log likelihood as:
T
X
l() = f (|Yt , Xt ) (3.1)
t=1
then the likelihood with coefficient vector + after the break point is:
T0
X T
X
f (|Yt , Xt ) + f ( + |Yt , Xt ) (3.2)
t=1 t=T0 +1
15
Parametric Tests 16
For the simplest, and probably most important, case of linear least squares, we
have ut = Yt Xt , which reduces the condition (3.5) to
T
X
Xt0 ut = 0 (3.6)
t=T0 +1
where were using the shorthand ut = u(|Yt , Xt ). As with the fluctuation test,
the question is whether the moment conditions which hold (by definition) over
the full sample also hold in the partial samples.
In sample, neither (3.3) nor (3.5) will be zero; the LM test is whether they are
close to zero. If we use the general notation of gt for the summands in (3.3) or
(3.5), then we have the following under the null of no break with everything
evaluated at :
T
X
gt = 0 (3.7)
t=1
T
X
gt 0 (3.8)
t=T0 +1
In order to convert the partial sample sum into a test statistic, we need a vari-
ance for it. The obvious choice is the (properly scaled) covariance matrix that
we compute for the whole sample under the null. In computing an LM test, we
need to take into account the covariance between the derivatives with respect
to the original set of coefficients and the test set, that is, the two blocks in (3.8).
If we use the full sample covariance matrix, this gives us a very simple form
for this. If we assume that
T
1 X d
gt
N (0, J ) (3.9)
T t=1
then the covariance matrix of the full (top) and partial (bottom) sums are (ap-
proximately):
TJ (T T0 ) J T (T T0 )
= J (3.10)
(T T0 ) J (T T0 ) J (T T0 ) (T T0 )
The inverse is
1
T10
T0
J 1 (3.11)
T10 T
T0 (T T0 )
The only part of this that matters is the bottom corner, since the full sums are
zero at , so the LM test statistic is:
T
LM = g 0 J 1 g (3.12)
T0 (T T0 )
This is basically formula (4.4) in Andrews (1993).
Parametric Tests 17
The full-sample LINREG gives the residuals (as %RESIDS), the inverse of the
full sample cross product matrix of the X as %XX and the sum of squared resid-
uals for the base model as %RSS. We need running sums of both X 0 X (DXX in
the code) and X 0 u (DXE), which are most easily done using %EQNXVECTOR to
extract the X vector at each entry, then adding the proper functions on to the
previous values. RSSDIFF is calculated as the difference between the sum of
squared residuals with and without the break dummies. If we wanted to do
a standard LM test, we could compute that as RSSDIFF/%SEESQ. What this2
computes is an F times its numerator degrees of freedom, basically the LM
test but with the sample variance computed under the alternativethe sum
of squared residuals under the alternative is %RSS-RSSDIFF, and the degrees
of freedom is the original degrees of freedom less the extra %NREG regressors
added under the alternative. Its kept in the chi-squared form since the tabled
critical values are done that way.
Note that this only computes the test statistic for the zone of entries between
piStart and piEnd. Its standard procedure to limit the test only to a central
set of entries excluding a certain percentage on each endthe test statistic
cant even be computed for the first and last k entries, and is of fairly limited
usefulness when only a small percentage of data points are either before or
after the break. However, even though the test statistics are only computed for
a subset of entries, the accumulation of the subsample cross product and test
vector must start at the beginning of the sample.
The procedures @APBREAKTEST and @REGHBREAK generally do the same thing.
The principal difference is that @APBREAKTEST is invoked much like a LINREG,
while @REGHBREAK is a post-processoryou run the LINREG you want first,
then @REGHBREAK. The example file ONEBREAK.RPF uses @APBREAKTEST with
@apbreaktest(graph) drpoj 1950:1 2000:12
# constant fdd{0 to 18}
There are several other differences. One is that @APBREAKTEST computes ap-
proximate p-values using formulas from Hansen (1997). The original Andrews-
Ploberger paper has several pages of lookup tables for the critical values, which
depend upon the trimming percentage and number of coefficientsHansen
worked out a set of transformations which allowed the p-value to be approx-
imated using a chi-square. @REGHBREAK allows you to compute approximate
p-values by using the fixed regressor bootstrap (Hansen (2000)). The fixed re-
2
from the @REGHBREAK procedure
Parametric Tests 19
u2T0
(3.18)
2 1 XT0 (X0 X)1 XT0 0
This is similar
to the rule-of-thumb test,0 except
1 0
that it uses the studentized-
residuals ut / htt , where htt = 1 Xt (X X) Xt . htt is between 0 and 1. Xt
that are quite different from the average will produce a smaller value of this;
such data points have more influence on the least squares fit, and this is taken
into account in the LM testa 2.0 ratio on an very influential data point can be
quite high, since the least square fit has already adjusted substantially to try
to reduce that residual.
3
In some applications, the randomizing for the dependent variable multiplies the observed
residual by a N (0, 1).
4
This was proposed originally in Quandt (1960).
Parametric Tests 20
After you do a LINREG, you can generate the leverage statistics Xt (X0 X)1 Xt 0
and the studentized residuals with
prj(xvx=h)
set istudent = %resids/sqrt(%seesq*(1-h))
ARIMA models
The X 11 seasonal adjustment algorithm is designed to decompose a series into
three main components: trend-cycle, seasonal and irregular. The seasonally
adjusted data is what you get if you take out the seasonal. That means that
the irregular is an important part of the final data product. If you have major
outliers due to strikes or weather, you cant just ignore them. However, they
will contaminate the estimates of both the trend-cycle and seasonal if theyre
allowed to. The technique used by Census X 12, and by Statistics Canadas
X 11- ARIMA before that, is to compute a separate pre-adjustment component
that takes out the identifiably irregular parts, applies X 11 to the preliminarily
adjusted data, then puts the extracted components back at the end.
The main focus in this preliminary adjustment is on two main types of addi-
tive outliers, as defined in Section 1.2. One (unfortunately) is known itself as
an additive outlier, which is a single period shift. The other is a permanent
level shift. In the standard set, theres also a temporary change, which is
an impulse followed by an geometric decline back to zero, but that isnt used
much.5 . The shifts are analyzed in the context of an ARIMA model. Seasonal
ARIMA models can take a long time to estimate if they have any seasonal AR or
seasonal MA components, as those greatly increase the complexity of the likeli-
hood function. As a result, it really isnt feasible to do a search across possible
outliers by likelihood ratio tests. Instead, the technique used is something of a
stepwise technique:
This process is now built into the BOXJENK instruction. To employ it, add the
option OUTLIER to the BOXJENK. Example 3.2 applies this to the well-known
airline data with:
boxjenk(diffs=1,sdiffs=1,ma=1,sma=1,outliers=ao) airline
The main choices for OUTLIERS are AO (only the single data points), LS (for
level shift) and STANDARD, which is the combination of AO and LS. With STANDARD,
both the AO and LS are scanned at each stage, and the largest between them is
kept if sufficiently significant. The scan output from this is:
Forward Addition pass 1
Largest t-statistic is AO(1960:03)= -4.748 > 3.870 in abs value
The tests are reported as t-statistics, so the LM statistics will be the squares.
The 3.870 is the standard threshold value for this size data set; it may seem
rather high, but the X 11 algorithm has its own robustness calculations, so
only very significant effects need special treatment. As we can see, this adds
one outlier dummy at 1960:3, fails to find another, then keeps the one that was
added in the backwards pass. The output from the final BOXJENK (the output
from all the intermediate ones is suppressed) is in Table 3.1.
Box-Jenkins - Estimation by ML Gauss-Newton
Convergence in 11 Iterations. Final criterion was 0.0000049 <= 0.0000100
Dependent Variable AIRLINE
Monthly Data From 1950:02 To 1960:12
Usable Observations 131
Degrees of Freedom 128
Centered R 2 0.9913511
R-Bar 2 0.9912160
Uncentered R 2 0.9988735
Mean of Dependent Variable 295.63358779
Std Error of Dependent Variable 114.84472501
Standard Error of Estimate 10.76359799
Sum of Squared Residuals 14829.445342
Log Likelihood -495.7203
Durbin-Watson Statistic 2.0011
Q(32-2) 42.9961
Significance Level of Q 0.0586423
Note that both the forward and backwards passes use a sharp cutoff. As a re-
sult, its possible for small changes to the data (one extra data point, minor
Parametric Tests 22
GARCH models
The mean model in a GARCH is often neglected in favor of the variance
model, but the very first assumption underlying a GARCH is that the residuals
are mean zero and serially uncorrelated. You can check serial correlation by
applying standard tests like Ljung-Box to the standardized residuals. But that
wont help if the basic mean model is off.
In deriving an LM test for outliers and level shifts, the variance cant be treated
as a constant in taking the derivative with respect to mean parameters, since
its a recursively-defined function of the residual. Simple LM tests based upon
(3.1) wont work because the likelihood at t is a function of the data for all
previous values. The recursion required to get the derivative will be different
for each type of GARCH variance model. For the simple univariate GARCH(1,1),
we have
ht = c + bht1 + a2t1 (3.19)
so for a parameter which is just in the mean model:
ht ht1 t1
=b + 2at1 (3.20)
For a one-shot outlier,
t 1 if t = T0
= (3.21)
0 o.w.
and for a level shift,
t 1 if t > T0
= (3.22)
0 o.w.
The procedure @GARCHOutlier in Example 3.3 does the same types of outlier
detection as we described for ARIMA models. The calculation requires that you
save the partial derivatives when the model is estimated:6
garch(p=1,q=1,reg,hseries=h,resids=eps,derives=dd) start end y
# constant previous
Once the gradient of the log likelihood with respect to the shift dummy is com-
puted (into the series GRAD), the LM statistic is computed using:
mcov(opgstat=lmstat(test))
# dd grad
6
The use of PREVIOUS allows you to feed in previously located shifts.
Parametric Tests 23
This uses a sample estimate for the covariance matrix of the first order condi-
tions, basically, the BHHH estimate of the covariance matrix and computes the
LM statistic using that.
Parametric Tests 24
compute walds(splits)=%qform(inv(vpre+vpost),betapre-betapost)
compute lrs(splits)=%qform(wfull*(1.0/pi),m1pi)+$
%qform(wfull*(1.0/(1-pi)),m2pi)-(uzwzupre+uzwzupost)
compute lms(splits)=1.0/(pi*(1-pi))*%qform(lmv,m1pi)
end do splits
*
* Graph the test statistics
*
graph(key=upleft,footer="Figure 7.7 Structural Change Test Statistics") 3
# walds
# lrs
# lms
*
* Figure out the grand test statistics
*
disp "LM" @10 *.## %maxvalue(lms) %avg(lms) log(%avg(%exp(.5*lms)))
disp "Wald" @10 *.## %maxvalue(walds) %avg(walds) log(%avg(%exp(.5*walds)))
disp "LR" @10 *.## %maxvalue(lrs)
Parametric Tests 26
*
* Level shift. We leave out the first and last entries, since those
* are equivalent to additive outliers.
*
set lmstat gstart+1 gend-1 = 0.0
do test=gstart+1,gend-1
set deps gstart gend = (t>=test)
set(first=0.0) dh gstart gend = $
%beta(%nreg)*dh{1}+%beta(%nreg-1)*eps{1}*deps{1}
set grad gstart gend = -.5*dh/h+.5*(eps/h)2*dh-(eps/h)*deps
mcov(opgstat=lmstat(test))
# dd grad
end do test
ext(noprint) lmstat gstart+1 gend-1
if graph {
graph(header="Level Shifts")
# lmstat gstart gend
}
if %maximum>beststat {
compute beststat=%maximum
compute bestbreak=%maxent
compute besttype =3
}
}
*
if outlier<>1 {
if besttype==2
disp "Maximum LM for Additive Outlier" *.### beststat $
"at" %datelabel(bestbreak)
else
disp "Maximum LM for Level Shift" *.### beststat $
"at" %datelabel(bestbreak)
}
end
***************************************************************************
open data garch.asc
all 1867
data(format=free,org=columns) / bp cd dm jy sf
set dlogdm = 100*log(dm/dm{1})
*
dec vect[series] outliers(0)
@GARCHOutlier(outlier=standard,graph) dlogdm / outliers
dim outliers(1)
set outliers(1) = (t>=1303)
@GARCHOutlier(outlier=standard,graph) dlogdm / outliers
Chapter 4
TAR Models
11 yt1 + . . . + 1p ytp + ut if ztd < c
yt = (4.1)
21 yt1 + . . . + 2q ytq + ut if ztd c
A SETAR model (Self-Exciting TAR) is a special case where the threshold vari-
able y itself.
Well work with two data series. The first is the U.S. unemployment rate (Fig-
ure 4.1). This shows the type of asymmetric cyclical behavior that can be han-
dled by a threshold modelit goes up much more abruptly than it falls. The
series modeled will be the first difference of this.
11
10
3
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
The other is the spread between U.S. short and long-run interest rates (Fig-
ure 4.2), where we will be using the data set from Enders and Granger (1998).
Unlike the unemployment rate series, which will be modeled using a full co-
efficient break, this will have a coefficient which is common to the regimes.
29
TAR Models 30
The full break is simpler to analyze, and most of the existing procedures are
designed for that case, so for the spread series, well need to use some special
purpose programming.
-1
-2
-3
-4
-5
1960 1965 1970 1975 1980 1985 1990
Note: you may need to be careful with creating a threshold series such as the
difference in the unemployment rate. The U.S. unemployment rate is reported
at just one decimal digit, and its a relatively slow-moving series. As a result,
almost all the values for the difference are one of -.2, -.1, 0, .1 or .2. However,
in computer arithmetic, all .1s are not the samedue to machine rounding
there can be several values which disagree in the 15th digit when you subtract.
(Fortunately, zero differences will be correct). Because the test in (4.1) has a
sharp cutoff, its possible for the optimal threshold to come out in the middle
of (for instance) the .1s, since some will be (in computer arithmetic) larger
than others. While this is unlikely (since the rounding errors are effectively
random), you can protect against it by using the %ROUND function to force all
the similar values to map to exactly the same result. For the unemployment
rate (with one digit), this would be done with something like:
set dur = %round(unrate-unrate{1},1)
The interest rate spread is reported at two digits, and takes many more val-
ues, so while we include the rounding in our program, it is quite unlikely that
rounding error will cause a problem.
4.1 Estimation
The first question is how to pick p in (4.1). If there really is a strong break at an
unknown value, then standard methods for picking p based upon information
criteria wont work well, since the full data series doesnt follow an AR(p), and
TAR Models 31
you cant easily just apply those to a subset of the data, like you could if the
break were based upon time and not a threshold value. The standard strategy
is to overfit to start, then prune out unnecessary lags. While (4.1) is written
with the same structure in each branch, in practice they can be different.
The second question is how to pick the threshold z and delay d. For a SETAR
model, z is y. Other variables might be suggested by theory, as, for instance,
in the Enders-Granger paper, where z is y. The delay is obviously a discrete
parameter, so estimating it requires a search over the possible values. c is less
obviously discrete, but, in fact, the only values of c at which (4.1) changes are
observed values for ztd , so its impossible to use variational methods to esti-
mate c. Instead, that will also require a search. If both d and c are unknown,
a nested search over d and, given a test value of d, over the values of ztd is
necessary.
4.2 Testing
The test which suggests itself is a likelihood ratio test or something similar
which compares the one-branch model with the best two-branch model. The
problem with that approach is that that test statistic will not have a standard
asymptotic distribution because of both the lack of differentiability with respect
to c, plus the lack of identification of c if there is no difference between the
branches. There is a relatively simple bootstrap procedure which can be used to
approximate the p-value. Well look at that in 4.2.2. The first testing procedure
isnt quite as powerful, but is much quicker.
The Arranged Autoregression Test (Tsay (1989)) first runs recursive estimates
of the autoregression, with the data points added in the order of the test thresh-
old variable, rather than conventional time series order. Under the null of no
break, the residuals should be fairly similar to the least squares residuals,
so there should be no correlation between the recursive residuals and the re-
gressors, and Tsay shows that under the null, an exclusion test for the full
coefficient vector in a regression of the recursive residuals on the regressor
set is asymptotically a chi-square (or approximated by a small-sample F).1 If
there is a break, and you have the correct threshold variable, then the recur-
sive estimates will be expected to be quite different at the beginning of the
sample from those at the end, and there we would likely see some correlation
in regressing the recursive residuals on the regressors, which would show as a
significant exclusion test. In effect, this does the testing procedure in reverse
orderrather than do the standard regression first and then see if there is a
1
Note that this isnt the same as the conventional regression F statistic, which doesnt test
the intercept. Its a test on the full vector, including the constant.
TAR Models 32
correlation between the residuals and regressors across subsamples, this does
the subsamples first, and sees if there is a correlation with the full sample.
This is quite simple to do with RATS because the RLS instruction allows you
do provide an ORDER series which does the estimates in the given order, but
keeps the original alignment of the entries. This is already implemented in the
@TSAYTEST procedure, which, if you look at it, is rather short. @TSAYTEST can
be applied with any threshold series, not just a lag of the dependent variable.
You need to create the test threshold as a separate series, as we do in this
case of the differenced unemployment rate, with its first lag as the threshold
variable:
set thresh = dur{1}
@tsaytest(thresh=thresh) dur
# constant dur{1 to 4}
which produces:
TSAY Arranged Autoregression Test
F( 5 , 594 )= 2.54732 P= 0.02706
Hansen (1996) derives a simple bootstrap procedure for threshold and similar
models. Instead of bootstrapping an entire new sample of the data, it takes the
regressors as fixed and just draws the dependent variable, as a N(0,1). This is
for evaluating test statistics onlynot for the more general task of evaluating
the full distribution of the estimates.
This fixed regressor bootstrap is built into several RATS procedures.
@THRESHTEST is similar in form to @TSAYTESTthe application of it to the
change in the unemployment rate is:
@threshtest(thresh=thresh,nreps=500) dur
# constant dur{1 to 4}
Its also in the @TAR procedure, which is specifically designed for working with
TAR models. It tests all the lags as potential thresholds:
@tar(p=4,nreps=500) dur
The three variations of the test statistic are the maximum LM, geometric aver-
age of the LM statistics (across thresholds) and the arithmetic average.
TAR Models 33
These procedures wont work with the M - TAR (Momentum TAR) model of the
Enders-Granger paper since it has a common regressor, rather than having the
entire model switch. The following is borrowed from the TAR procedure:
set thresh = spread{1}
*
compute trim=.15
compute startl=%regstart(),endl=%regend()
set copy startl endl = thresh
set ix startl endl = t
*
order copy startl endl ix
set flatspot = (t<endl.and.copy==copy{-1})
compute nobs = %nobs
compute nb = startl+fix(trim*nobs),ne=startl+fix((1-trim)*nobs)
set flatspot startl nb = 1
The trimming prevents this from looking at the very extreme values of the
threshold, and the FLATSPOT series will keep data points with the same thresh-
old value together. This then does the identification of the best break value by
running the regressions with the break in one variable, with the common re-
gressor in ds1.
compute ssqmax=0.0
do time=nb,ne
compute retime=fix(ix(time))
if flatspot(time)
next
set s1_1 = %if(ds{1}>=0,thresh-thresh(retime),0.0)
set s1_2 = %if(ds{1}<0 ,thresh-thresh(retime),0.0)
linreg(noprint) ds
# s1_1 s1_2 ds{1}
if rssbase-%rss>ssqmax
compute ssqmax=rssbase-%rss,breakvalue=thresh(retime)
end do time
disp ssqmax
disp breakvalue
4.3 Forecasting
A TAR model is a self-contained dynamic model for a series. However, to this
point, there has been almost no difference between the analysis with a truly
exogenous threshold and a threshold formed by lags of the dependent variable.
Once we start to forecast, thats no longer the casewe need the link between
the threshold and the dependent variable to be included in the model.
While the branches may very well be (and probably are) linear, the connec-
tion isnt, so we need to use a non-linear system of equations. For the unem-
TAR Models 34
ployment series, the following estimates the two branches, given the identified
break at 0 on the first lag:
linreg(smpl=dur{1}<=0.0,frml=branch1) dur
# constant dur{1 to 4}
compute rss1=%rss,ndf1=%ndf
linreg(smpl=dur{1}>0.0,frml=branch2) dur
# constant dur{1 to 4}
compute rss2=%rss,ndf2=%ndf
compute seesq=(rss1+rss2)/(ndf1+ndf2)
The following is for accumulating back the unemployment rate from the change,
and combines the non-linear equation plus the identity into a model:
frml(identity) urid unrate = unrate{1}+dur
group tarmodel tarfrml urid
Because of the non-linearity of the model, the mean square forecast cant be
computed using the point values done by FORECAST, but need the average
across simulations.
set urfore 2010:10 2011:12 = 0.0
compute ndraws=5000
do draw=1,ndraws
simulate(model=tarmodel,from=2010:10,to=2011:12,results=sims)
set urfore 2010:10 2011:12 = urfore+sims(2)
end do draw
set urfore 2010:10 2011:12 = urfore/ndraws
whether unit shocks or some other pattern. Neither part of this is going to be
a proper practice with a TAR or similar model. First, the response, in general,
will be very sensitive to the initial conditionsif were far from the threshold in
either direction, any shock of reasonable size will be very unlikely to cause us
to shift from one branch to the other, so the differential effect will be whatever
the dynamics are of the branch on which we started. On the other hand, if were
near the threshold, a positive shock will likely put us into one branch, while a
negative shock of equal size would put us into the other, and (depending upon
the dynamics), we might see the branch switch after a few periods of response.
The generalized impulse response is computed by averaging the differential
behavior across typical shocks. This can be computed analytically for linear
models, but can only be done by simulation for non-linear models. This will
be similar to the forecast procedure, but the random draws must be controlled
better, so, instead of SIMULATE, you use FORECAST with PATHS.
This puts together the forecasting model for the M - TAR on the spread:
frml transfct = ds{1}>=0
set s1_1 = %if(ds{1}>=0,thresh-breakvalue,0.0)
set s1_2 = %if(ds{1}<0 ,thresh-breakvalue,0.0)
linreg(print) ds
# s1_1 s1_2 ds{1}
frml adjust ds = %beta(3)*ds{1}+$
%if(transfct,%beta(1)*(spread{1}-breakvalue),$
%beta(2)*(spread{1}-breakvalue))
frml(identity) spreadid spread = spread{1}+ds
group mtarmodel adjust spreadid
and this computes the GIRF based on 1974:3, where the spread is large and
negative:
compute stddev=sqrt(%seesq)
set girf 1974:4 1974:4+59 = 0.0
compute ndraws=5000
do draw=1,ndraws
set shocks = %ran(stddev)
forecast(paths,model=mtarmodel,from=1974:4,$
steps=60,results=basesims)
# shocks
compute ishock=%ran(stddev)
compute shocks(1974:4)=shocks(1974:4)+ishock
forecast(paths,model=mtarmodel,from=1974:4,$
steps=60,results=sims)
# shocks
set girf 1974:4 1974:4+59 = girf+(sims(2)-basesims(2))/ishock
end do draw
The first FORECAST will be identical to what you would get with SIMULATE.
The second adds into those shocks a random first period shock. The difference
TAR Models 36
between those simulations, scaled by the value of ISHOCK will convert into a
equivalent of a unit shock. This is for convenience, since there is no obvious
standard size when the responses arent linear. The GIRFs at two different sets
of guess values are shown in Figures 4.3 and 4.4.
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
0 10 20 30 40 50
1.0
0.8
0.6
0.4
0.2
0.0
0 10 20 30 40 50
end do draw
set urfore 2010:10 2011:12 = urfore/ndraws
graph 2
# unrate 2009:1 2010:9
# urfore
*
* Average "impulse response"
* We need to simulate over the forecast range, but also need to control
* the initial shock. The results of this will depend upon the starting
* point.
*
compute stddev=sqrt(seesq)
set girf 2009:1 2009:1+35 = 0.0
compute ndraws=5000
do draw=1,ndraws
set shocks = %ran(stddev)
forecast(paths,model=tarmodel,from=2009:1,$
steps=36,results=basesims)
# shocks
compute ishock=%ran(stddev)
compute shocks(2009:1)=shocks(2009:1)+ishock
forecast(paths,model=tarmodel,from=2009:1,$
steps=36,results=sims)
# shocks
set girf 2009:1 2009:1+35 = girf+(sims(2)-basesims(2))/ishock
end do draw
*
set girf 2009:1 2009:1+35 = girf/ndraws
graph(number=0,footer="GIRF for unemployment at 2009:1")
# girf
*
compute stddev=sqrt(%seesq)
clear girf
set girf 1984:1 1984:1+35 = 0.0
compute ndraws=5000
do draw=1,ndraws
set shocks 1984:1 1984:1+35 = %ran(stddev)
forecast(paths,model=tarmodel,from=1984:1,$
steps=36,results=basesims)
# shocks
compute ishock=%ran(stddev)
compute shocks(1984:1)=shocks(1984:1)+ishock
forecast(paths,model=tarmodel,from=1984:1,$
steps=36,results=sims)
# shocks
set girf 1984:1 1984:1+35 = girf+(sims(2)-basesims(2))/ishock
end do draw
*
set girf 1984:1 1984:1+35 = girf/ndraws
graph(number=0,footer="GIRF for unemployment at 1984:1")
# girf
TAR Models 39
next
set s1_1 = %max(thresh-thresh(retime),0.0)
set s1_2 = %min(thresh-thresh(retime),0.0)
linreg(noprint) ds
# s1_1 s1_2 ds{1}
if rssbase-%rss>ssqmax
compute ssqmax=rssbase-%rss,breakvalue=thresh(retime)
end do time
*
* M-TAR model
*
compute ssqmax=0.0
do time=nb,ne
compute retime=fix(ix(time))
if flatspot(time)
next
set s1_1 = %if(ds{1}>=0,thresh-thresh(retime),0.0)
set s1_2 = %if(ds{1}<0 ,thresh-thresh(retime),0.0)
linreg(noprint) ds
# s1_1 s1_2 ds{1}
if rssbase-%rss>ssqmax
compute ssqmax=rssbase-%rss,breakvalue=thresh(retime)
end do time
disp ssqmax
disp breakvalue
*
* Further analysis of the M-TAR model
* Generate a self-contained model of the process, estimated at
* the least squares fit.
*
frml transfct = ds{1}>=0
set s1_1 = %if(ds{1}>=0,thresh-breakvalue,0.0)
set s1_2 = %if(ds{1}<0 ,thresh-breakvalue,0.0)
linreg(print) ds
# s1_1 s1_2 ds{1}
*
* We need to write this in terms of spread{1}, since the structural
* equation generates the difference for spread-spread{1}.
*
frml adjust ds = %beta(3)*ds{1}+$
%if(transfct,%beta(1)*(spread{1}-breakvalue),$
%beta(2)*(spread{1}-breakvalue))
frml(identity) spreadid spread = spread{1}+ds
*
group mtarmodel adjust spreadid
*
* Starting from end of sample, where spread is relatively high
*
set highspread 1994:2 1994:2+59 = 0.0
compute ndraws=5000
do draw=1,ndraws
simulate(model=mtarmodel,from=1994:2,steps=60,$
results=sims,cv=%seesq)
set highspread 1994:2 1994:2+59 = highspread+sims(2)
TAR Models 41
end do draw
set highspread 1994:2 1994:2+59 = highspread/ndraws
graph 2
# spread 1990:1 1994:1
# highspread
*
* Starting from 1974:3, where the spread is large negative.
*
set lowspread 1974:4 1974:4+59 = 0.0
compute ndraws=5000
do draw=1,ndraws
simulate(model=mtarmodel,from=1974:4,steps=60,$
results=sims,cv=%seesq)
set lowspread 1974:4 1974:4+59 = lowspread+sims(2)
end do draw
set lowspread 1974:4 1974:4+59 = lowspread/ndraws
graph 2
# spread 1970:1 1974:3
# lowspread
*
* Average "impulse response"
* We need to simulate over the forecast range, but also need to control
* the initial shock. The results of this will depend upon the starting
* point.
*
compute stddev=sqrt(%seesq)
set girf 1974:4 1974:4+59 = 0.0
compute ndraws=5000
do draw=1,ndraws
set shocks = %ran(stddev)
forecast(paths,model=mtarmodel,from=1974:4,$
steps=60,results=basesims)
# shocks
compute ishock=%ran(stddev)
compute shocks(1974:4)=shocks(1974:4)+ishock
forecast(paths,model=mtarmodel,from=1974:4,$
steps=60,results=sims)
# shocks
set girf 1974:4 1974:4+59 = girf+(sims(2)-basesims(2))/ishock
end do draw
*
set girf 1974:4 1974:4+59 = girf/ndraws
graph(number=0,footer="GIRF for Spread at 1974:3")
# girf
*
compute stddev=sqrt(%seesq)
clear girf
set girf 1994:2 1994:2+59 = 0.0
compute ndraws=5000
do draw=1,ndraws
set shocks 1994:2 1994:2+59 = %ran(stddev)
forecast(paths,model=mtarmodel,from=1994:2,$
steps=60,results=basesims)
# shocks
TAR Models 42
compute ishock=%ran(stddev)
compute shocks(1994:2)=shocks(1994:2)+ishock
forecast(paths,model=mtarmodel,from=1994:2,$
steps=60,results=sims)
# shocks
set girf 1994:2 1994:2+59 = girf+(sims(2)-basesims(2))/ishock
end do draw
*
set girf 1994:2 1994:2+59 = girf/ndraws
graph(number=0,footer="GIRF for Spread at 1994:1")
# girf
Chapter 5
Threshold VAR/Cointegration
All the test statistics are significant, but the two with d = 1 are by far the
strongest, so the remaining analysis uses the lagged spread as the threshold
variable.
1
The empirical application is only in the working paper version, not the journal article.
43
Threshold VAR/Cointegration 44
Under the presumed behavior, the spread should show stationary behavior in
the two tails and unit root behavior in the middle portion. To analyze this, the
authors do an arranged Dickey-Fuller testa standard D-F regression, but
with the data points added in threshold order. Computing this requires the
use of the RLS instruction, saving both the history of the coefficients and of the
standard errors of coefficients to compute the sequence of D-F t-tests:
set dspread = spread-spread{1}
set thresh = spread{1}
rls(order=thresh,cohistory=coh,sehistory=seh) dspread
# spread{1} constant dspread{1}
*
set tstats = coh(1)(t)/seh(1)(t)
This is done with the recursions in each direction. The clearer of the two is
with the threshold in increasing order (Figure 5.1). Once you have more than
a handful of data points in the left tail, the D-F tests very clearly reject unit
root behavior, a conclusion which reverses rather sharply beginning around a
threshold of 1, though where exactly this starts to turn isnt easy to read off
the graph since the data in use at that point are overwhelmingly in the left
tailthe observed values of the spread are mainly in the range of -.5 to .5.
-1
-2
-3
-4
-5
-6
-7
-2.5 0.0 2.5 5.0 7.5
Based upon this graph, the authors did a (very) coarse grid search for the two
threshold values in a two-lag TAR on the spread using all combinations of a
lower threshold in {.2, .1, 0, .1, .2} and an upper threshold in {1.6, 1.7, 1.8, 1.9, 2.0},
getting the minimum sum of squares at -.2 and 1.6. A more complete grid
search can be done fairly easily and quickly at modern computer speeds. As
is typical of (hard) threshold models, the sum of squares function is flat for
thresholds between observed values, so the search needs only to look at those
Threshold VAR/Cointegration 45
-293
-294
-295
-296
-297
-298
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8
Figure 5.2: Log Likelihood as function of left threshold given right threshold
An alternative to doing the TAR model on the spread is to choose breaks using
the multivariate likelihood for the actual VECM model. This is also done easily
using the instruction SWEEP. The following does the base multivariate regres-
sion of the two differenced variables on 1, one lag of the differenced variables
and the lagged spread:
sweep
# dff ddr
# constant dff{1} ddr{1} spread{1}
compute loglbest=%logl
2
And since the right threshold of 1.45 isnt even necessarily the constrained maximizer for
any given value of the left threshold, even more left threshold values are likely to be acceptable
if we did the required number crunching to do a formal likelihood ratio test for each.
Threshold VAR/Cointegration 46
%LOGL is the log likelihood of the systems regression. With groupings, it does
separate systems estimates, aggregating the outer product of residuals to get
an overall likelihood:
sweep(group=(thresh>=tvalues(lindex))+(thresh>tvalues(uindex)))
# dff ddr
# constant dff{1} ddr{1} spread{1}
The estimation using this gives the (much) tighter range of .63 to 1.22 as the
center subsample.
In order to do any type of forecasting calculation with this type of model, we
need to define identities to connect the five series: the two rates, the spread
and the changes. And we need to define formulas for the two rates of change
which switch with the simulated values of the threshold variable.
With the lower and upper threshold values in the variables of the same name,
the following formula will evaluate to 1, 2 or 3 depending upon where the value
of spread{1} falls:
dec frml[int] switch
frml switch = 1+fix((spread{1}>=lower)+(spread{1}>upper))
Note that you must use spread{1} in this, not a fixed series generated from
spread that we can use during estimationwhen we do simulations spread
is no longer just data.
The most flexible way to handle the switching functions for the changes is to de-
fine a RECT[FRML] with rows for equations and columns for partitions. Using
separate formulas for each case (rather than using the same equation chang-
ing the coefficients) makes it easier to use different forms. In this case, we are
using the same functional form for all three branches, but that might not be
the best strategy.
system(model=basevecm)
variables dff ddr
lags 1
det constant spread{1}
end(system)
*
dec rect[frml] tvecfrml(2,3)
do i=1,3
estimate(smpl=(switch(t)==i))
frml(equation=%modeleqn(basevecm,1)) tvecfrml(1,i)
frml(equation=%modeleqn(basevecm,2)) tvecfrml(2,i)
end do i
With the work already done above, the switching formulas are now quite sim-
ple:
frml dffeq dff = tvecfrml(1,switch(t))
frml ddreq ddr = tvecfrml(2,switch(t))
The full code for the joint analysis of the model is in example 5.2.
@gridseries(from=-.30,to=.05,n=300,pts=ngrid) rgrid
set aic 1 ngrid = 0.0
*
do i=1,ngrid
compute rtest=rgrid(i)
sweep(group=thresh<rtest,var=hetero)
# g3year g3month
# constant g3year{1 to p} g3month{1 to p}
compute aic(i)=-2.0*%logl+2.0*%nregsystem
end do i
scatter(footer=$
"AIC Values for Interest Rates Regression vs Break Point")
# rgrid aic
Note that this uses the VAR=HETERO option on SWEEP. That allows the covari-
ance matrix to be different between partitions and computes the likelihood
function on that basis. This can also be done with SWEEP with just a single
target variable.
The double break model can now (with faster computers) be done in a rea-
sonable amount of time using the empirical grid. As in the paper, this uses
a somewhat coarser grid, though its a slightly different one than is described
there.
@gridseries(from=-.30,to=.05,n=80,pts=ngrid) rgrid
*
compute bestaic=%na
do i=1,ngrid-19
do j=i+20,ngrid
sweep(group=(thresh<rgrid(i))+(thresh<rgrid(j)),var=homo)
# g3year g3month
# constant g3year{1 to p} g3month{1 to p}
compute thisaic=-2.0*%logl+2.0*%nregsystem
if .not.%valid(bestaic).or.thisaic<bestaic
compute bestaic=thisaic,bestlower=rgrid(i),bestupper=rgrid(j)
end do j
end do i
*
disp "Best double break is at" bestlower "and" $
bestupper "with AIC" bestaic
The two key parameters are , the coefficient in the cointegrating relation, and
, the breakpoint on the threshold variable. You can input either or search for
either.
The procedure @HSLMTest tests for threshold cointegration given an input .
This does fixed regressor bootstrapping for computing an approximate p-value.
5
The data from roughly the first 50 observations were accidentally read twice and appended
to the proper data set.
Threshold VAR/Cointegration 50
cal(m) 1955:1
open data irates.xls
data(format=xls,org=columns) 1955:01 1990:12 fedfunds mdiscrt
*
set spread = fedfunds-mdiscrt
@dfunit(lags=12) fedfunds
@dfunit(lags=12) mdiscrt
@dfunit(lags=12) spread
*
* Pick lags
*
@arautolags(crit=bic) spread
*
* Tsay threshold tests with direct ordering
*
do d=1,4
set thresh = spread{d}
@tsaytest(thresh=thresh,$
title="Tsay Test-Direct Order, delay="+d) spread
# constant spread{1 2}
end do d
*
* And with reversed ordering
*
do d=1,4
set thresh = -spread{d}
@tsaytest(thresh=thresh,$
title="Tsay Test-Reverse Order, delay="+d) spread
# constant spread{1 2}
end do d
*
* Arranged D-F t-statistics
*
set dspread = spread-spread{1}
set thresh = spread{1}
rls(order=thresh,cohistory=coh,sehistory=seh) dspread
# spread{1} constant dspread{1}
*
set tstats = coh(1)(t)/seh(1)(t)
scatter(footer="Figure 1. Recursive D-F T-Statistics\\"+$
"Arranged Autoregression from Low to High")
# thresh tstats
*
* Same thing in reversed order
*
rls(order=-thresh,cohistory=coh,sehistory=seh) dspread
# spread{1} constant dspread{1}
*
Threshold VAR/Cointegration 51
spgraph
scatter(vgrid=yvalue,footer=$
"Log likelihood as function of left threshold given right threshold")
# testx testf
scatter(vgrid=yvalue)
# testx testf
grtext(y=yvalue,x=0.0,direction=45) ".05 Critical Point for Left Threshold"
spgraph(done)
*
* This will be 0, 1 or 2, depending upon the value of thresh
*
set group = (thresh>=tvalues(lindexbest))+(thresh>tvalues(uindexbest))
*
dofor i = 0 1 2
disp "**** Group " i "****"
linreg(smpl=(group==i)) spread
# constant spread{1 2}
summarize(noprint) %beta(2)+%beta(3)-1.0
disp "DF T-Stat" %cdstat
end do i
*
* Threshold error correction models
*
set dff = fedfunds-fedfunds{1}
set ddr = mdiscrt -mdiscrt{1}
dofor i = 0 1 2
disp "**** Group " i "****"
linreg(smpl=(group==i)) dff
# constant dff{1} ddr{1} spread{1}
linreg(smpl=(group==i)) ddr
# constant dff{1} ddr{1} spread{1}
end do i
cal(m) 1955:1
open data irates.xls
data(format=xls,org=columns) 1955:01 1990:12 fedfunds mdiscrt
*
set spread = fedfunds-mdiscrt
set thresh = spread{1}
linreg spread
# constant spread{1 2}
@UniqueValues(values=tvalues) thresh %regstart() %regend()
*
compute n=%rows(tvalues)
compute pi=.15
*
Threshold VAR/Cointegration 53
compute spacing=fix(pi*n)
*
* These are the bottom and top of the permitted index values for the
* lower index, and the top of the permitted values for the upper index.
*
compute lstart=spacing,lend=n+1-2*spacing
compute uend =n+1-spacing
*
set dff = fedfunds-fedfunds{1}
set ddr = mdiscrt -mdiscrt{1}
*
sweep
# dff ddr
# constant dff{1} ddr{1} spread{1}
compute loglbest=%logl
*
do lindex=lstart,lend
do uindex=lindex+spacing,uend
sweep(group=(thresh>=tvalues(lindex))+(thresh>tvalues(uindex)))
# dff ddr
# constant dff{1} ddr{1} spread{1}
if %logl>loglbest
compute lindexbest=lindex,uindexbest=uindex,loglbest=%logl
end do uindex
end do lindex
disp "Best Break Values" tvalues(lindexbest) "and" tvalues(uindexbest)
*
compute lower=tvalues(lindexbest),upper=tvalues(uindexbest)
dec frml[int] switch
frml switch = 1+fix((spread{1}>=lower)+(spread{1}>upper))
*
* Estimate the model at the best breaks to get the covariance matrix.
*
sweep(group=switch(t))
# dff ddr
# constant dff{1} ddr{1} spread{1}
compute tvecmsigma=%sigma
*
set dff = fedfunds-fedfunds{1}
set ddr = mdiscrt -mdiscrt{1}
*
system(model=basevecm)
variables dff ddr
lags 1
det constant spread{1}
end(system)
*
dec rect[frml] tvecfrml(2,3)
do i=1,3
estimate(smpl=(switch(t)==i))
frml(equation=%modeleqn(basevecm,1)) tvecfrml(1,i)
frml(equation=%modeleqn(basevecm,2)) tvecfrml(2,i)
end do i
*
Threshold VAR/Cointegration 54
dofor m0 = 50 100
set thresh = sspread{d}
*
rls(noprint,order=thresh,condition=m0) g3year / rr3year
# constant g3year{1 to p} g3month{1 to p}
rls(noprint,order=thresh,condition=m0) g3month / rr3month
# constant g3year{1 to p} g3month{1 to p}
*
* We need to exclude the conditioning observations, so we generate
* the series of ranks of the threshold variable over the
* estimation range.
*
order(ranks=rr) thresh %regstart() %regend()
*
linreg(noprint,smpl=rr>m0) rr3year / wr3year
# constant g3year{1 to p} g3month{1 to p}
linreg(noprint,smpl=rr>m0) rr3month / wr3month
# constant g3year{1 to p} g3month{1 to p}
*
ratio(mcorr=%nreg,degrees=k*%nreg,noprint)
# rr3year rr3month
# wr3year wr3month
disp "D=" d "m0=" m0 @16 "C(d)=" *.## %cdstat @28 "P-value" #.##### %signif
end dofor m0
end do d
*
* Evaluate the AIC across a grid of threshold settings
*
set thresh = sspread{1}
@gridseries(from=-.30,to=.05,n=300,pts=ngrid) rgrid
set aic 1 ngrid = 0.0
*
compute bestaic=%na
*
do i=1,ngrid
compute rtest=rgrid(i)
sweep(group=thresh<rtest,var=homo)
# g3year g3month
# constant g3year{1 to p} g3month{1 to p}
compute aic(i)=-2.0*%logl+2.0*%nregsystem
if i==1.or.aic(i)<bestaic
compute bestaic=aic(i),bestbreak=rtest
end do i
*
scatter(footer=$
"AIC Values for Interest Rates Regression vs Break Point")
# rgrid aic
disp "Best Break is" bestbreak "with AIC" bestaic
*
set thresh = sspread{1}
*
* This is a slightly different set of grids than used in the paper.
*
@gridseries(from=-.30,to=.05,n=80,pts=ngrid) rgrid
Threshold VAR/Cointegration 57
*
compute bestaic=%na
do i=1,ngrid-19
do j=i+20,ngrid
sweep(group=(thresh<rgrid(i))+(thresh<rgrid(j)),var=homo)
# g3year g3month
# constant g3year{1 to p} g3month{1 to p}
compute thisaic=-2.0*%logl+2.0*%nregsystem
if .not.%valid(bestaic).or.thisaic<bestaic
compute bestaic=thisaic,bestlower=rgrid(i),bestupper=rgrid(j)
end do j
end do i
*
disp "Best double break is at" bestlower "and" $
bestupper "with AIC" bestaic
Chapter 6
STAR Models
The threshold models considered in Chapters 4 and 5 have both had sharp
cutoffs between the branches. In many cases, this is unrealistic, and the lack of
continuity in the objective function causes other problemsyou cant use any
asymptotic distribution theory for the estimates, and without changes, they
arent appropriate for forecasting since its not clear how to handle simulated
values that fall between the observed data values near the cutoff.
An alternative is the STAR model (Smooth Transition AutoRegression) and
more generally, the STR (Smooth Transition Regression). Instead of the sharp
cutoff, this uses a smooth function of a threshold variable. One way to write
this is:
yt = Xt (1) + Xt (2) G(Ztd , , c) + ut (6.1)
The transition function G is bounded between 0 and 1, and depends upon a
location parameter c and a scale parameter .1 The two standard transition
functions are the logistic (LSTAR) and the exponential (ESTAR). The formulas
for these are:
1 [1 + exp ((Ztd c))]1 for LSTAR
G(Ztd , , c) = (6.2)
1 exp (Ztd c)2 for ESTAR
LSTAR is more similar to the models examined earlier, with values to the left
of c generally being in one branch (with coefficient vector (1) ) and those to the
right of c having coefficient vector more like (1) + (2) . ESTAR treats the tails
symmetrically, with values near c having coefficients near (1) , while those far-
ther away (in either direction) being close to (1) + (2) . ESTAR is often used
when there are seen to be costs of adjustment in either direction. An unre-
stricted three branch model (such as in Balke-Fomby in 5.1) could be done by
adding in a second LSTAR branch.
STAR models, at least theoretically, can be estimated using non-linear least
squares. This, however, requires a bit of finesse: under the default initial val-
ues of zero for all parameters used for NLLS, both the parameters in the tran-
sition function and the autoregressive coefficients that they control have zero
derivatives. As a result, if you do NLLS with the default METHOD=GAUSS, it can
never move the estimates away from zero. A better way to handle this is to
1
There are other, equivalent, ways of writing this. The form we use here lends itself more
easily to testing for STAR effects, since its just least squares if (2) is zero.
58
STAR Models 59
split the parameter set into the transition parameters and the autoregressive
parameters, and first estimate the autoregressions conditional on a pegged set
of values for the transition parameters. With c and pegged, (6.1) is linear,
and so converges in one iteration. The likelihood function generally isnt par-
ticularly sensitive to the choice of , but it can be sensitive to the choice for
c, so you might want to experiment with several guesses for c before deciding
whether youre done with the estimation.
Although the likelihood is relatively insensitive to the value of , thats only
when its in the proper range. As you can see from (6.2), depends upon
the scale of the transition variable Ztd . A guess value of something like 1
or 2 times z1 for an LSTAR and z2 for an ESTAR is generally adequate. In
Terasvirta (1994), the LSTAR exponent is directly rescaled by (an approximate)
reciprocal standard deviation. If you do that, the values will have some simi-
larities from one application to the next.
For the LSTAR model, do not use the formula as written in (6.2)for large pos-
itive values of Ztd , the exp function will overflow, causing the entire function
to compute as a missing value. exp of a large negative value will underflow
(which you could get in the negative tail for an LSTAR and either tail for an ES -
TAR ), but underflowed values are treated as zero, which gives the proper limit
behavior. To avoid the problem with the overflows, use the %LOGISTIC func-
tion, which does the same calculation, but computes in a different form which
avoids overflow.2
LSTAR models include the sharp cutoff models as a special case where .
However, where a sharp cutoff is appropriate, you may see very bad behavior
for the non-linear estimates from LSTAR. For instance, the LSTAR doesnt work
well for the unemployment rate series studied in 4.1. The change in the unem-
ployment rate takes only a small number of values, with almost all data points
being -.2, -.1, 0, .1 or .2. The estimates for and c show as:
Variable Coeff Std Error T-Stat Signif
11. GAMMA 220.943152 286157411.494605 7.72104e-007 0.99999939
12. C 0.018804 24356.890066 7.72021e-007 0.99999939
The standard errors look (and are) nonsensical, but this is a result of the like-
lihood (or sum of squares) function being flat for a range of values. Its always
a good idea to graph the transition function against the threshold, particularly
when the results look odd. In this case, it gives us Figure 6.1. With the ex-
ception of a very tiny non-zero weight at zero, this is the same as the sharp
transition. Almost any value of c between 0 and 1 will give almost the identical
transition function, and so will almost any large value of .
2
1/(1 + exp(x)) = exp(x)/(exp(x) + 1), one form of which will have the safe negative
exponents.
STAR Models 60
1.00
0.75
Transition Function
0.50
0.25
0.00
-0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00
Change in UR
The Linearity test includes all the interaction terms through the 3rd power
of the transition variable, and serves as a general test for a STAR effect. H01,
H02 and H03 are tests on the single power individually, while H12 is a joint test
with the first and second powers only. For an LSTAR model, all of these should
be signficant. For an ESTAR, because of symmetry, the 3rd power shouldnt
enter, so you should see H12 significant and H03 insignificant. For the ESTAR,
STAR Models 61
you would also likely reject on the Linearity line, but it wont have as much
power as H12, since it includes the 3rd power terms as well. A common recom-
mendation is to choose the delay d on the threshold based upon the test which
gives the strongest rejection.
STAR Models 62
cal 1821
open data lynx.dat
data(org=cols) 1821:1 1934:1 lynx
set x = log(lynx)/log(10)
*
* This uses the restricted model already selected.
*
stats x
compute scalef=1.8
*
nonlin(parmset=starparms) gamma c
frml flstar = %logistic(scalef*gamma*(x{3}-c),1.0)
compute c=%mean,gamma=2.0
*
equation standard x
# x{1}
equation transit x
# x{2 3 4 10 11}
frml(equation=standard,vector=phi1) phi1f
frml(equation=transit ,vector=phi2) phi2f
frml star x = f=flstar,phi1f+f*phi2f
nonlin(parmset=regparms) phi1 phi2
nlls(parmset=regparms,frml=star) x
nlls(parmset=regparms+starparms,frml=star) x
*
compute rstart=12,rend=%regend()
*
* One-off GIRF
*
group starmodel star
*
compute istart=1925:1
compute nsteps=40
compute iend =istart+nsteps-1
*
compute stddev=sqrt(%seesq)
set girf istart iend = 0.0
compute ndraws=5000
do draw=1,ndraws
set shocks istart iend = %ran(stddev)
forecast(paths,model=starmodel,from=istart,to=iend,results=basesims)
# shocks
compute ishock=%ran(stddev)
compute shocks(istart)=shocks(istart)+ishock
forecast(paths,model=starmodel,from=istart,to=iend,results=sims)
# shocks
set girf istart iend = girf+(sims(1)-basesims(1))/ishock
STAR Models 64
end do draw
*
set girf istart iend = girf/ndraws
graph(number=0,footer="GIRF for Lynx at "+%datelabel(istart))
# girf istart iend
*
* GIRF with confidence bands
*
* Independence Metropolis-Hastings. Draw from multivariate t centered at
* the NLLS estimates. In order to do this most conveniently, we set up a
* VECTOR into which we can put the draws from the standardized
* multivariate t.
*
compute accept=0
compute ndraws=5000
compute nburn =1000
*
* Prior for variance. We use a flat prior on the coefficients.
*
compute s2prior=1.0/10.0
compute nuprior=5.0
*
compute allparms=regparms+starparms
compute fxx =%decomp(%seesq*%xx)
compute nuxx=10
*
* Since were starting at %BETA, the kernel of the proposal density is 1.
*
compute bdraw=%beta
compute logqlast=0.0
*
dec series[vect] girfs
gset girfs 1 ndraws = %zeros(nsteps,1)
*
infobox(action=define,progress,lower=-nburn,upper=ndraws) $
"Independence MH"
do draw=-nburn,ndraws
*
* Draw residual precision conditional on current bdraw
*
compute %parmspoke(allparms,bdraw)
sstats rstart rend (x-star(t))2>>rssbeta
compute rssplus=nuprior*s2prior+rssbeta
compute hdraw =%ranchisqr(nuprior+%nobs)/rssplus
*
* Independence chain MC
*
compute btest=%beta+%ranmvt(fxx,nuxx)
compute logqtest=%ranlogkernel()
compute %parmspoke(allparms,btest)
sstats rstart rend (x-star(t))2>>rsstest
compute logptest=-.5*hdraw*rsstest
compute logplast=-.5*hdraw*rssbeta
compute alpha =exp(logptest-logplast+logqlast-logqtest)
STAR Models 65
if alpha>1.0.or.%uniform(0.0,1.0)<alpha {
compute bdraw=btest,accept=accept+1
compute logqlast=logqtest
}
infobox(current=draw) %strval(100.0*accept/(draw+nburn+1),"##.#")
if draw<=0
next
*
* Do draw for GIRF
*
compute stddev=sqrt(1.0/hdraw)
set shocks istart iend = %ran(stddev)
forecast(paths,model=starmodel,results=basesims,from=istart,to=iend)
# shocks
compute ishock=%ran(stddev)
compute shocks(istart)=shocks(istart)+ishock
forecast(paths,model=starmodel,results=sims,from=istart,to=iend)
# shocks
set girf istart iend = (sims(1)-basesims(1))/ishock
*
* Flatten this estimate of the GIRF to save as the <<draw>> element in the
* full history.
*
ewise girfs(draw)(i)=girf(i+istart-1)
end do draw
infobox(action=remove)
*
set median istart iend = 0.0
set lower istart iend = 0.0
set upper istart iend = 0.0
*
dec vect work(ndraws)
do time=istart,iend
ewise work(i)=girfs(i)(time+1-istart)
compute ff=%fractiles(work,||.16,.50,.84||)
compute lower(time)=ff(1)
compute upper(time)=ff(3)
compute median(time)=ff(2)
end do time
*
graph(number=0,footer="GIRF with 16-84% confidence band") 3
# median istart iend
# lower istart iend 2
# upper istart iend 2
Chapter 7
Mixture Models
Suppose that we have a data series y which can be in one of several possible
(unobservable) regimes.1 If the regimes are independent across observations,
we have a (simple) mixture model. These are quite a bit less complicated than
Markov mixture or Markov switching models (Chapter 8), where the regime
at one time period depends upon the regime at the previous one, but illus-
trate many of the problems that arise in working with the more difficult time-
dependent data. Simple mixture models are used mainly in cross-section data
to model unobservable heterogeneity, though they can also be used in an error
process to model outliers or other fat-tailed behavior.
To simplify the notation, well use just two regimes. Well use St to represent
the regime of the system at time t,2 and p will be the (unconditional) probability
that the system is in regime 1. Theres no reason that p has to be fixed, and
generalizing it to be a function of exogenous variables isnt difficult.
If we write the likelihood under regime i as f(i) (yt |Xt , ), where Xt are exoge-
nous and are parameters, then the log likelihood element for observation t
is
log pf(1) (yt |Xt , ) + (1 p) f(2) (yt |Xt , ) (7.1)
Each likelihood element is a probability-weighted average of the likelihoods
in the two regimes. This produces a sample likelihood which can show very
bad behavior, such as multiple peaks. In the most common case where the
two regimes have the same structure but with different parameter vectors, the
regimes become interchangeable. The labeling of the regimes isnt defined
by the model itself, so there are (in an n regime model) n! identical likelihood
modesone for each permutation of the regimes. If the model is estimated by
maximum likelihood, you will end up at one of these, and you can usually (but
not always) control which you get by your choices of guess values. However, the
problem of label switching is a very serious issue with Bayesian estimation.
There are three main ways to estimate a model like this: conventional max-
imum likelihood (ML), expectation-maximization (EM), and Bayesian Markov
Chain Monte Carlo (MCMC). These have in common the need for the values for
1
Well use regime rather than state for this to avoid conflict with the term state in the
state-space model.
2
Well number these as 1 and 2 since that will generalize better to more than two regimes
than a 0-1 coding.
66
Mixture Models 67
f(i) (yt |Xt , ) across i for each observation. Our suggestion is that you create a
FUNCTION to do this calculation, which will make it easier to make changes.
The return value of the FUNCTION should be a VECTOR with size equal to the
number of regimes. Remember that these are the likelihoods, not the log likeli-
hoods. If its more convenient doing the calculation in log form, make sure that
you exp the results before you return. As an example:
function RegimeF time
type vector RegimeF
type integer time
local integer i
dim RegimeF(2)
ewise RegimeF(i)=$
exp(%logdensity(sigsq,%eqnrvalue(xeq,time,phi(i))))
end
In addition to the problem with label switching, there are other pathologies
that may occur in the likelihood and which affect both ML and EM. One of the
simplest cases of a mixture model has the mean and variance different in each
regime. The log likelihood for observation t is
log pfN xt 1 , 12 + (1 p) fN xt 2 , 22
(7.2)
where fN (x, 2 ) is the normal density at x with variance 2 . At 1 = x1 (or
any other data point), the likelihood can be made arbitrarily high by making
12 very close to 0. The other data points will, of course, have a zero density for
the first term, but will have a non-zero value for the second, and so will have
a finite log likelihood value. The likelihood function has very high spikes
around values equal to the data points.
This is a pathology which occurs because of the combination of both the mean
and variance being free, and will not occur if either one is common to the
regimes. For instance, an outlier model would typically have all branches
with a zero mean and thus cant push the variance to zero at a single point.
This problem with the likelihood was originally studied in Kiefer and Wolfowitz
(1956) and further examined in Kiefer (1978). The latter paper shows that an
interior solution (with the variance bounded away from zero) has the usual de-
sirable properties for maximum likelihood. To estimate the model successfully,
we have to somehow eliminate any set of parameters with unnaturally small
variances.
In addition to the narrow range of values at which the likelihood is unbounded,
it is possible (and probably likely) that there will be multiple modes. With both
ML and EM , its important to test alternative starting values.
Well employ the three methods to estimate a model of fish sizes used in
Fruehwirth-Schnatter (2006). This has length measurements on 256 fish which
are assumed to have sizes which vary with an (unobservable) age. Models with
three and four categories are examined in most cases.
Mixture Models 68
The function for the regime-dependent likelihoods (which will change slightly
from one application to another) is:
function RegimeF time
type vector RegimeF
type integer time
dim RegimeF(ncats)
ewise RegimeF(i)=exp(%logdensity(sigma(i)2,fish(time)-mu(i)))
end
THETA is the vector of logistic indexes for the probabilities, which will map to
the working P vector. The mapping function (which will be the same for all
applications with time-invariant probabilities) is:
function %MixtureP theta
type vect theta %MixtureP
dim %MixtureP(%rows(theta)+1)
ewise %MixtureP(i)=%if(i<=%rows(theta),exp(theta(i)),1.0)
compute %MixtureP=%MixtureP/%sum(%MixtureP)
end
Given those building blocks, the log likelihood is the very simple:
Mixture Models 69
The START option transforms the free parameters in THETA to the working P
vector; the REJECT option prevents the smallest value of SIGMA from getting
too close to zero. Because the likelihood goes unbounded only in a very narrow
zone around the zero variance, the limit can be quite small; in this case, we
made it a very small fraction of the interquartile range.
stats(fractiles) fish
compute sigmalimit=.00001*(%fract75-%fract25)
7.2 EM Estimation
The EM algorithm (Appendix B) is able to simplify the calculations quite a bit
in these models. The augmenting parameters x are the regimes. In the M step,
we maximize over the simpler
E log f (yt , St |Xt , ) = E {log f (yt |St , Xt , ) + log f (St |Xt , )} (7.3)
{St |0 } {St |0 }
For the typical case where the underlying models are linear regressions, the
two terms on the right use separate sets of parameters, and thus can be maxi-
mized separately. Maximizing the sum of the first just requires a probability-
weighted linear regression; while maximizing the second estimates p as the
average of the probabilities of the regimes. p is constructed to be in the proper
range, so we dont have to worry about that as we might with maximum likeli-
hood. The value of the E step is that it allows us to work with sums of the more
convenient log likelihoods rather than logs of the sums of the likelihoods as in
(7.1).
In Example 7.2, we use a SERIES[VECT] to hold the estimated probabilities
of the regimes using the previous parameter settings. For a two-regime model,
we could get by with just a single series with the probabilities of regime 1,
knowing that regime 2 would have one minus that as the probability. However,
the more general way of handling it is, in fact, simpler to write. The setup for
this is
dec series[vect] pt_t
gset pt_t gstart gend = %fill(ncats,1,1.0/ncats)
The second line just initializes all the elements; the values dont really matter
because the first step in EM is to compute the probabilities of these anyway.
Mixture Models 70
The E step just uses Bayes rule to fill in those probabilities. This is computed
using the previous parameter settings for the means and variances and for the
unconditional probabilities of the branches:
gset pt_t gstart gend = f=RegimeF(t),(f.*p)/%dot(f,p)
The M step is, in this case, most conveniently done by using the SSTATS in-
struction to compute probability-weighted sums and sums of squares along
with the sum of the probabilities themselves. This requires a separate cal-
culation for each of the possible regimes. Because PT T is a SERIES[VECT],
you first have to reference the time period (thus PT T(T)) and then further
reference the element within that VECTOR, which is why you use PT T(T)(I)
to get the probability of regime I at time T.
do i=1,ncats
sstats gstart gend pt_t(t)(i)>>sumw pt_t(t)(i)*fish>>sumwm $
pt_t(t)(i)*fish2>>sumwmsq
compute p(i)=sumw/%nobs
compute mu(i)=sumwm/sumw
compute sigma(i)=%max(sqrt(sumwmsq/sumw-mu(i)2),sigmalimit)
end do i
Note that, like ML, this also has a limit on how small the variance can be. The
EM iterations can blunder into one of the small variance spikes if nothing is
done to prevent it.
The full EM algorithm requires repeating those steps, so this is enclosed in a
loop. At the end of each step, the log likelihood is computedthis should in-
crease with each iteration, generally rather quickly at first, then slowly crawl-
ing up. In this case, we do 50 EM iterations to improve the guess values, then
switch to maximum likelihood. With straight ML, it takes 32 iterations of ML
after some simplex preliminary iterations; with the combination of EM plus
ML , it takes just 9 iterations of ML to finish convergence, and less than half the
execution time. The EM iterations also tend to be a bit more robust to guess
values, although if there are multiple modes (which is likely) theres no reason
that EM cant home in on one with a lower likelihood.
sstats gstart gend log(%dot(RegimeF(t),p))>>logl
disp "Iteration" emits logl p mu sigma
Mixture Models 71
Next up is drawing the means given the variances and the regimes. In this
example, we use a flat prior on the meanswell discuss that choice in section
7.3.1. Given the standard deviations just computed, the sum and observation
count are sufficient statistics for the mean, which are drawn as normals.
do i=1,ncats
sstats(smpl=(s==i)) gstart gend fish>>sum
compute mu(i)=sum/%nobs+%ran(sigma(i)/sqrt(%nobs))
end do i
For now, well skip over the next step in the example (relabeling) and take that
up in section 7.3.1. Well next look at drawing the regimes given the other pa-
rameters. Bayes formula gives us the relative probabilities of the regime i at
time t as the product of the unconditional probability p(i) times the likelihood
of regime i at t. The %RANBRANCH function is designed precisely for drawing a
random index from a vector of relative probability weights.4 A single instruc-
tion does the job:
set s gstart gend = fxp=RegimeF(t).*p,%ranbranch(fxp)
Finally, we need to draw the unconditional probabilities given the regimes. For
two regimes, this can be done with a beta distribution (Appendix F.2), but for
3
See the first page in Appendix C for the derivation of this.
4
Note that you dont need to divide through by the sum of f times p%RANBRANCH takes
care of the normalization.
Mixture Models 72
more than that, we need the more general Dirichlet distribution (Appendix
F.7). For an n component distribution, this takes n input shapes and returns
an n vector with non-negative components summing to one. The counts of the
number of the regimes are combined with a weak Dirichlet prior (all compo-
nents are 4; the higher the values, the tighter the prior). An uninformative
prior for the Dirichlet would have input shapes that are all zeros. However,
this isnt recommended. It is possible for a sweep to generate no data points
in a particular regime. A non-zero prior makes sure that the unconditional
probability doesnt also collapse to zero.
do i=1,ncats
sstats(smpl=(s==i)) gstart gend 1>>shapes(i)
end do i
compute p=%randirichlet(shapes+priord)
Because of the positioning of the relabeling step (after the draws for MU, SIGMA
and P, but before the draws for the regimes), there is no need to switch the
definitions of the regimes, since they will naturally follow the definitions of the
others.
This type of correction is quite simple in this case since the regression part
is just the single parameter, which can be ordered easily. With a more gen-
eral regression, it might be more difficult to define the interpretations of the
regimes.
*
do i=1,ncats
sstats(smpl=(s==i)) gstart gend (fish-mu(i))2>>sumsqr
compute sigma(i)=sqrt((sumsqr+nusumsqr)/%ranchisqr(%nobs+nudf))
end do i
*
* Draw mus given sigmas and regimes.
*
do i=1,ncats
sstats(smpl=(s==i)) gstart gend fish>>sum
compute mu(i)=sum/%nobs+%ran(sigma(i)/sqrt(%nobs))
end do i
*
* Relabel if necessary
*
compute swaps=%index(mu)
*
* Relabel the mus
*
compute temp=mu
ewise mu(i)=temp(swaps(i))
*
* Relabel the sigmas
*
compute temp=sigma
ewise sigma(i)=temp(swaps(i))
*
* Relabel the probabilities
*
compute temp=p
ewise p(i)=temp(swaps(i))
*
* Draw the regimes, given p, the mus and the sigmas
*
set s gstart gend = fxp=RegimeF(t).*p,%ranbranch(fxp)
*
* Draw the probabilities
*
do i=1,ncats
sstats(smpl=(s==i)) gstart gend 1>>shapes(i)
end do i
compute p=%randirichlet(shapes+priord)
infobox(current=draw)
if draw>0 {
*
* Do the bookkeeping
*
do i=1,ncats
compute mus(i)(draw)=mu(i)
compute sigmas(i)(draw)=sigma(i)
end do i
}
end do draw
infobox(action=remove)
Mixture Models 78
*
density(smoothing=1.5) mus(1) 1 ndraws xf ff
scatter(footer="Density Estimate for mu(1)",style=line)
# xf ff
density(smoothing=1.5) mus(2) 1 ndraws xf ff
scatter(footer="Density Estimate for mu(2)",style=line)
# xf ff
density(smoothing=1.5) mus(3) 1 ndraws xf ff
scatter(footer="Density Estimate for mu(3)",style=line)
# xf ff
density(smoothing=1.5) mus(4) 1 ndraws xf ff
scatter(footer="Density Estimate for mu(4)",style=line)
# xf ff
*
scatter(footer="Mean-Sigma combinations") 4
# mus(1) sigmas(1) 1 ndraws
# mus(2) sigmas(2) 1 ndraws
# mus(3) sigmas(3) 1 ndraws
# mus(4) sigmas(4) 1 ndraws
Chapter 8
With time series data, a model where the (unobservable) regimes are indepen-
dent across time will generally be unrealistic for the process itself, and will
typically be used only for modeling residuals. Instead, it makes more sense to
model the regime as a Markov Chain. In a Markov Chain, the probability that
the process is in a particular regime at time t depends only upon the probabili-
ties of the regimes at time t 1, and not on earlier periods as well. This isnt as
restrictive as it seems, because its possible to define a system of regimes at t
which includes not just St , but the tuple St , St1 , . . . , Stk , so the memory can
stretch back for k periods. This creates a feasible but more complicated chain,
with M k+1 combinations in this augmented regime, where M is the number of
possibilities for each St .
As with the mixture models, the likelihood of the data (at t) given the regime is
written as f(i) (yt | Xt , ). For now, well just assume that this can be computed1
and concentrate on the common calculations for all Markov Switching models
which satisfy this assumption.
79
Markov Switching: Introduction 80
in MCMC for estimating or drawing the regime-specific parameters for the mod-
els controlled by the switching are also the same as they are for the mixture
models. What we need that we have not seen before are the rules of inference
on the regime probabilities:
The support routines for this are in the file MSSETUP.SRC. You pull this in
with:
@MSSETUP(states=# of regimes,lags=# of lags)
The LAGS option is used when the likelihood depends upon (a finite number of)
lagged regimes. Well talk about that later, but for most models well have (the
default) of LAGS=0. @MSSETUP defines quite a few standard variables which
will be used in implementing these models. Among them are NSTATES and
NLAGS (number of regime and number of lagged regimes needed), P and THETA
(transition matrix and logistic indexes for it), PT T1, PT T2 and PSMOOTH (se-
ries of probabilities of regimes at various points). The only one that might raise
a conflict with a variable in your own program would probably be P, so if you
run across an error about a redefinition of P, you probably need to rename your
original variable.
This invokes Bayes rule and is identical to the calculation in the simpler mix-
ture model, combining the predicted probabilities with the vector of likelihood
values across the regimes:
pt|t1 (i)f(i) (yt | Xt , )
pt|t (i) = M
(8.1)
P
pt|t1 (i)f(i) (yt | Xt , )
i=1
8.1.3 Smoothing
x is the regime at t
y is the regime at t + 1
I is the data through t
J is the data through T
The key assumption (A.1) holds because knowing the actual regime y at t + 1
provides better information about the regime at t than all the data from t +
1, . . . , T that are added in moving from I to J . To implement this, we need to
retain the series of predicted probabilities (which will give us f (y|I), when we
look at the predicted probabilities for t + 1) and filtered probabilities (f (x|I)).
At the end of the filtering step, we have the filtered probabilities for ST , which
are (by definition) the same as the smoothed probabilities. (A.2) is then applied
recursively, to generate the smoothed probability at T 1, then T 2, etc. For
Markov Switching: Introduction 82
a Markov Chain, the f (y|x, I) in (A.2) is just the probability transition matrix
from t to t + 1. Substituting in the definitions for the Markov Chain, we get
M
X pt+1|T (i)
pt|T (j) = pt|t (j) pij
i=1
pt+1|t (i)
This assumes that PT T and PT T1 have been defined as described above, which
is what the %MSPROB function will do. The output from @MSSMOOTHED is the
SERIES[VECT] of smoothed probabilities.
To extract a series of a particular component from one of these SERIES[VECT],
use SET with something like
set p1 = psmooth(t)(1)
The most efficient way to draw a sample of regimes is using the Forward Filter-
Backwards Sampling (FFBS) algorithm described in Chib (1996). This is also
known as Multi-Move Sampling since the algorithm samples the entire history
at one time. The Forward Filter is exactly the same as used in the first step of
smoothing. Backwards Sampling can also be done using the result in Appendix
A: draw the regime at T from the filtered distribution. Then compute the distri-
bution from which to sample T 1 applying (A.2) with the f (y|J ) (which is, in
this case, f (ST |T )) a unit vector at the sampled value for ST . Walk backwards
though the data range to get the full sampled distribution.
The procedure on mssetup.src that does the sampling is @MSSAMPLE. This
has a form similar to the smoothing procedure. Its
@MSSAMPLE start end REGIME
Markov Switching: Introduction 83
The inputs are the same, while the output REGIME is a SERIES[INTEGER]
which takes the sampled values between 1 and M . MSSETUP includes a defini-
tion of a SERIES[INTEGER] called MSRegime which we use in most examples
that require this.
Single-Move Sampling
To use Multi-Move Sampling, you need to be able to do the filtering (and
smoothing) steps. This wont always be possible. The assumption made in (8.1)
was that the likelihood at time t was a function only of the regime at t. The
filtering and smoothing calculations can be extended to situations where the
likelihood depends upon a fixed number of previous regimes by defining an aug-
mented regime using the tuple St , St1 , . . . , Stk . However, a GARCH model has
an unobservable state variablethe lagged variancewhich depends upon
the precise sequence of regimes that preceded it. The likelihood at t has M t
branches, which rapidly becomes too large to enumerate. Similarly, most state-
space models have unobservable state variables which also depend upon the
entire sequence of earlier regimes.3
An alternative form of sampling which can be used in such cases is Single-Move
Sampling. This samples St taking S1 , . . . , St1 , St+1 , . . . , ST as given. The joint
likelihood of the full data set and the full set of regimes (conditional on all other
parameters in the model) can be written as
f (Y, S | ) = f (Y | S, )f (S | ) (8.2)
Using the shorthand S St for the sequence of regimes other than St , Bayes
rule gives us
p(St = i | Y, S St , ) f (Y | S St , St = i, )f (S St , St = i | ) (8.3)
Using the Markov property on the chain, the second factor on the right can be
written sequentially as:
f (S | ) = f (S1 | )f (S2 | S1 , ) . . . f (ST | ST 1 , )
In doing inference on St alone, any factor in this which doesnt include St will
cancel in doing the proportions in (8.3), so were left with:
Other than f (S1 | ) (which is discussed on page 86), these will just be the
various transition probabilities between the (assumed fixed) regimes at times
other than t and the choice for St that were evaluating.
3
There are, however, state-space models for which states at t are known given the data
through t, and such models can be handled using the regime filtering and smoothing.
Markov Switching: Introduction 84
compute pstar=%mcergodic(p)
do time=gstart,gend
compute oldregime=MSRegime(time)
do i=1,nstates
if oldregime==i
compute logptest=logplast
else {
compute MSRegime(time)=i
sstats gstart gend msgarchlogl>>logptest
}
compute pleft =%if(time==gstart,pstar(i),p(i,MSRegime(time-1)))
compute pright=%if(time==gend ,1.0,p(MSRegime(time+1),i))
compute fps(i)=pleft*pright*exp(logptest-logplast)
compute logp(i)=logptest
end do i
compute MSRegime(time)=%ranbranch(fps)
compute logplast=logp(MSRegime(time))
end do time
This loops over TIME from the start to the end of the data range, drawing values
for MSREGIME(TIME). To reduce the calculation time, this doesnt compute the
log likelihood function value at the current setting since that will be just be
the value carried over from the previous time periodthe variable LOGPLAST
keeps the log likelihood at the current set of regimes. The only part of this
thats specific to an application is the calculation of the log likelihood (into
LOGPTEST) given a test set of values of MSREGIME, which is done here using
the SSTATS instruction to sum the MSGARCHLOGL formula across the data set.
Because the calculation needs (in the end) the likelihood itself (not the log
likelihood), its important to be careful about over- or underflows when exping
Markov Switching: Introduction 85
the sample log likelihoods. Since relative probabilities are all that matter, the
LOGPLAST value is subtracted from all the LOGPTEST values before doing the
exp.4
The FPS and LOGP vectors need to be set up before the loop with:
dec vect fps logp
dim fps(nstates) logp(nstates)
FPS keeps the relative probabilities of the test regimes, and LOGP keeps the
log likelihoods for each so we dont have to recompute once weve chosen the
regime.
compute pstar=%mcergodic(p)
do time=gstart,gend
compute oldregime=MSRegime(time)
do i=1,nstates
compute pleft =%if(time==gstart,pstar(i),p(i,MSRegime(time-1)))
compute pright=%if(time==gend ,1.0 ,p(MSRegime(time+1),i))
compute qp(i)=pleft*pright
end do i
compute candidate=%ranbranch(qp)
if MSRegime(time)<>candidate {
compute MSRegime(time)=candidate
sstats gstart gend msgarchlogl>>logptest
compute alpha=exp(logptest-logplast)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha
compute logplast=logptest
else
compute MSRegime(time)=oldregime
}
end do time
The first sentence in the description of the prediction step began Assuming
that we have a vector of probabilities at t 1 given data through t 1. We have
not addressed what happens when t = 1. There are two statistically justifiable
ways to handle the p0|0 . One is to treat them as free parameters, adding to
the parameter set an M vector of non-negative values summing to one. This is
by far the simplest way to handle the pre-sample if you use the EM algorithm,
since it can compute this probability vector directly as part of the smoothing
process. The other is to set them to the ergodic probabilities, which are the
long-run average probabilities for the Markov Chain given the values for the
transition matrix. For a time-invariant Markov Chain, this ergodic probability
exists except in rare circumstances, such as an absorbing state (if, for instance,
theres a permanent break in the process). If you use ML, this is the most
convenient way to handle the pre-sample.7 The two methods arent the same,
and give rise to slightly different log likelihoods (the estimated probability, of
necessity, giving the higher value).
7
Maximum likelihood tends to be slow enough without having to deal with the extra param-
eters.
Markov Switching: Introduction 87
8.2 Estimation
As with mixture models, there are three basic methods of estimation: maxi-
mum likelihood, EM and MCMC. And as with mixture models, ML is usually
the simplest to set up and EM is the quickest to execute. However, there are
several types of underlying models where (exact) maximum likelihood isnt fea-
sible and even more where EM isnt. MCMC is the only method which works in
almost all cases.
For all types of models, you need to be able to compute a VECTOR of likelihood
elements for each regime at each time period. Its the inability to construct
this that makes ML and EM infeasible for certain types of modelswhile the
switching mechanism may be a Markov Chain with short memory, the con-
trolled process might depend upon the entire history of the regimes, which
rapidly becomes too large to enumerate. MCMC avoids the problem because it
always works with just one sample path at a time for the regimes rather than
a weighted average across all paths. There are other models for which EM is
not practical because the M step for the model parameters has no convenient
form.
Well look at a simple example to illustrate the three estimation methods. This
has zero mean and Markov Switching variances. It is taken from Kim and
Nelson (1999).8 The data are excess stock returns, monthly from 1926 to 1986.
This is proposed as an alternative to ARCH or GARCH models as way to explain
clustering of large changesthere is serial correlation in the variance regime
through a Markov Switching process.
The full-sample mean is extracted from the data, so the working data are as-
sumed to be mean zero. Thus, the only parameters are the variances in the
branches and the parameters governing the switching process. The model is
fit with three brancheswell use the variable NSTATES in all examples for the
number of regimes. The variances will be in a VECTOR called SIGMAS. This will
make it easy to change the number of regimes.
For all estimation methods, we need a FUNCTION which returns a VECTOR of
likelihoods across regimes at a given time period. Well use the following in
this example, which can take any number of variance regimes:
8
Their Application 3 in Section 4.6.
Markov Switching: Introduction 88
The @MSFilterInit procedure takes care of the specific setup that is needed
for the forward filtering. The %MSINIT function takes care of the required
initialization of a single filtering pass through the data (returning the pre-
sample probabilities) and the %MSProb function does the prediction and update
steps returning the (non-logged) likelihood.
This is the structure for direct estimation of the transition probabilities. We
can also choose the logistic parameterization. With three (or more) choices, the
logistic is generally the most reliable because the 1 to 3 and 3 to 1 probabilities
can often be effectively zero. Well use this for the example in this section
since it has three regimes. Example 8.1 does maximum likelihood. We need
to create the THETA matrix of logistic indices and give it guess values. Each
column in THETA is normalized with a 0 index for the final element. The set of
guess values used will have the probability of staying somewhere near .8, with
probabilities of moving to a different regime declining as it gets more distant
from the current one:
input theta
4.0 0.0 -4.0
2.0 2.0 -2.0
Markov Switching: Introduction 89
Were trying to steer the estimates towards the labeling with 1 being the lowest
variance and 3 the highest. Theres no guarantee that well be successful, but
this will probably work. You just dont want any of the guess values to be so
extreme that the probabilities are effectively zero for all of them at all data
points.
If you use the logistic parameterization, the MAXIMIZE instruction will have
a slightly different START option, because we need to transform the logistic
THETA into the probability matrix P. The START option will always have the
PSTAR=%MSINIT() calculation; however, the transformation always needs to
be done first. The %(...) enclosing the two parts of the START is needed
because without it a , would be a separator for the instruction options.
maximize(start=%(p=%mslogisticp(theta),pstar=%msinit(p)),$
parmset=msparms+modelparms,$
method=bfgs,iters=400,pmethod=simplex,piters=5) markov * 1986:12
One thing to note is that the logistic mapping, while it makes estimation sim-
pler when a transition probability is near the boundary, does not fix the prob-
lem with boundary itself. In order to get a true zero probability, you need an
index of at the slot which needs to be zero, or of + on the others in the
column if the zero needs to be at the bottom of the column. Thats apparent
in the output in Table 8.1, where the probability of moving from 1 to 3 is (ef-
fectively) zeroin order to represent that, the 1,1 and 1,2 elements need to be
quite large.
8.2.3 EM
where the sum is over all possible histories of the regimes. Since
f (x, y|) = f (y|x, )f (x|)
we can also usefully decompose (8.4) as
X X
(log f (y|x, )) p(x|y, 0 ) + (log f (x|)) p(x|y, 0 ) (8.5)
x x
The first term in this is relatively straightforward. Given x, log f (y|x, ) is just
the standard log likelihood for the model evaluated at that particular set of
regimes. The overall term is the probability-weighted average of the log likeli-
hood, where the weights are the probabilities of the state histories evaluated at
0 . In most cases, this will end up requiring a probability-weighted regression
of some form. Because the weighting probabilities are based upon 0 rather
than the over which the M step optimizes, this term can be maximized sepa-
rately from the second onethe first depends upon the regression parameters
in , but not the transition parameters, while the second is the reverse.
Where there are no parameters shared across regimes, this part of the M step
can generally be done by looping over the regimes, using a standard estima-
tion instruction with the WEIGHT option, where the WEIGHT is the series of
smoothed probabilities for that regime. If there are shared parameters (for in-
stance, in a regression, a common variance), the calculation is quite a bit more
complicated.
In the second term in (8.5), using the standard trick of sequential conditioning
gives
f (x|) = f (S0 | )f (S1 |S0 , ) . . . f (ST |ST 1 , ST 2 , . . . , S0 , ) (8.6)
By the Markov property, the conditional densities can be reduced to condition-
Markov Switching: Introduction 91
ing on just one lag. The second term can thus be rewritten as
T
X
(log f (S0 | )) p(S0 |y, 0 ) + (log f (St |St1 , )) p(St , St1 |y, 0 ) (8.7)
t=1
The first term in (8.7) is quite inconvenient, and is generally ignored, with
the assumption that it will be negligible as a single term compared to the T
element sum that follows. That does, however, mean that EM wont converge
(exactly) to maximum likelihood, but will only be close.
The probability weights in the sum, p(St , St1 |y, 0 ), are smoothed probabilities
of the pair (St , St1 ) computed at 0 . They have to be the smoothed estimates
because they are conditioned on the full y record. This requires a specialized
filtering and smoothing calculation operating on pairs of regimes, but its the
same calculation for all underlying model types. Well use the abbreviation:
pij,t = P (St = i, St1 = j|T )
Given those, the maximizer for the sum in (8.7) for a fixed transition matrix is
T
P
pij,t
t=1
pij = M T
PP
pij,t
j=1 t=1
This includes all the functions from mssetup.src, plus additional ones for the
EM calculations in Markov Switching models. These will do almost everything
except the calculation of your model-specific likelihood functions, and the M
step for the model-specific parameters. In Example 8.2, we do 50 iterations of
the EM algorithm. Each iteration starts with:
@MSEMFilterInit
do time=gstart,gend
@MSEMFilterStep time RegimeF(time)
end do time
disp "Iteration" ### emits * "Log Likelihood" %logl
Markov Switching: Introduction 92
This executes the filter step on the pairs of current and lagged regimes.9 As
a side-effect, this combination computes the log likelihood into %LOGL. This
will increase from one iteration to the nextusually quickly at first, then more
slowly.
The above generates predicted and filtered versions of the probabilities of the
regime pairs. The next part computes the smoothed probabilities of the regime
pairs and their marginal to the current regime:
@MSEMSmooth
gset psmooth gstart gend = MSEMMarginal(MSEMpt_sm(t))
These two steps are basically the same for any model, other than the calcu-
lation of the REGIMEF function. The next part is where the model will mat-
ter. In the case of the switching variances, the M step for the variances does
probability-weighted averages of the squares of the data:
do i=1,nstates
sstats gstart gend psmooth(t)(i)*ew_excs2>>wsumsq $
psmooth(t)(i)>>wts
compute sigmas(i)=wsumsq/wts
end do i
The final procedure from the support routines does the M step for the transi-
tions:
@MSEMDoPMatrix
This treats the regimes as a separate set of parameters. There are three basic
steps in this:
1. Draw the regimes given the model parameters and the transition param-
eters.
2. Draw the model parameters given the regimes (transition parameters
generally dont enter).
3. Draw the transition parameters given the regimes (model parameters
generally dont enter).
The first step was already discussed above (Section 8.1.4). In practice, we may
need to reject any draws which produce too few entries in one of the regimes
to allow safe estimation of the model parameters for that regime. The second
will require a standard set of techniques; its just that they will be applied to
9
This is needed for the inference on the transitions, and theres no advantage to doing a
separate filter/smooth operation just on the current regime, since those probabilities are just
marginals of the regime pair calculation.
Markov Switching: Introduction 93
subsamples determined by the draws for the regimes. The only thing (slightly)
new here will be the third step. This will be similar to the analogous step in
the mixture model except that the analysis has to be carried separately for
each value of the source regime. In other words, for each j, we look at the
probabilities of moving from j at t 1 to i at t. We count the number of cases
in each bin, combine it with a weak Dirichlet prior and draw column j in
the transition matrix from a Dirichlet distribution. Each column will generally
need its own settings for the prior, as standard practice is for a weak prior but
one which favors the process staying in its current regime. In our examples,
we will represent the prior as a VECT[VECT] with a separate vector for each
regime. For instance, in Example 8.3, we use
dec vect[vect] gprior(nstates)
compute gprior(1)=||8.0,1.0,1.0||
compute gprior(2)=||1.0,8.0,1.0||
compute gprior(3)=||1.0,1.0,8.0||
which has a mean of the probability of staying in the current regime as .8. The
obvious initial values for the P matrix are the means:
ewise p(i,j)=gprior(j)(i)/%sum(gprior(j))
The draw for the P matrix can be done by using the procedure @MSDrawP, which
takes the input gprior in the form weve just described, along with the sampled
values of MSRegime and produces a draw for P. This same line will appear in
the MCMC estimation in each of the chapters on Markov switching:
@MSDrawP(prior=gprior) gstart gend p
In addition, MCMC has the issue of label switching, which isnt shared by ML
and EM, both of which simply pick a particular set of labels. For models with
an obvious ordering, we can proceed as described in Section 7.3.1. However,
its not always immediately clear how to order a model. In an AR model, you
might, for instance, need to compute the process means as a function of the
coefficients and order on that.
A final thing to note is that Markov Switching models can often have multiple
modes aside from simple label switches: you could have modes with switches
between high and low mean, between high and low variances, between normal
and outlier data points. ML and EM will generally find one (though not always
the one with the highest likelihood). MCMC may end up visiting several of
these, so simple sample statistics like the mean of a parameter across draws
might not be a good description. Its not a bad idea to also include estimated
densities.
Kim and Nelson estimate this model using Gibbs Sampling as their Application
1 in Section 9.2. We will do several things differently than is described there.
Most of the changes are designed to make the algorithm easier to modify for
a different number of regimes. First, K&N draw the transition probabilities
Markov Switching: Introduction 94
by using a sequence of draws from the beta, first drawing the probability of
staying vs moving, then dividing the probability of moving between the two
remaining regimes. This is similar to, but not the same as, drawing the full
column at one time using the Dirichlet (which is what well do), and is much
more complicated. Second, they draw the variances sequentially, starting with
the smallest, working towards the largest, enforcing the requirement that the
variances stay in increasing order by rejecting draws which put the variances
out of order. This will work adequately10 as long as all the regimes are all well-
populated. However, if the regime at the end (in particular) has a fairly small
number of members, the draws for its variance in their scheme can be quite
erratic.
Instead of their sequential method, well use a (symmetrical) hierarchical prior
as described in Appendix C and switch labels to put the variances in the desired
order. The draws for the hierarchical prior take the following form: this draws
the common variance taking the ratios as given11 then the regime variances
given SCOMMON.
sstats gstart gend ew_excs2*scommon/sigmas(MSRegime(t))>>sumsqr
compute scommon=(sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
do i=1,nstates
sstats(smpl=MSRegime(t)==i) gstart gend ew_excs2/scommon>>sumsqr
compute sigmas(i)=scommon*(sumsqr+nuprior(i))/$
%ranchisqr(%nobs+nuprior(i))
end do i
The second stage in this requires the degrees of freedom for the prior for each
component (the NUPRIOR vector), which is 4 for all regimes in our example.
The first stage allows for informative priors on the common variance (using
NUCOMMON and S2COMMON), but were using the non-informative zero value for
NUCOMMON.
With the order in which the parameters are drawn (regimes, then variances,
then transition probabilities), the label switching needs to correct the sigmas
and the regimes, but doesnt have to fix the transitions since they get computed
afterwards:
10
Though it becomes more complicated as the number of regimes increases.
11
This never actually computes the ratios, but instead uses the implied value of
SIGMAS(i)/SCOMMON.
Markov Switching: Introduction 95
compute swaps=%index(sigmas)
*
* Relabel the sigmas
*
compute temp=sigmas
ewise sigmas(i)=temp(swaps(i))
*
* Relabel the regimes
*
gset MSRegime gstart gend = swaps(MSRegime(t))
The sections of the MCMC loop which draw the regimes and draw the transition
matrices are effectively the same for all models.
Markov Switching: Introduction 96
set p2 = psmooth(t)(2)
set p3 = psmooth(t)(3)
graph(style=stacked,maximum=1.0,picture="##.##",$
header="Smoothed Probabilities of Variance Regimes",key=below,$
klabels=||"Low Variance","Medium Variance","High Variance"||) 3
# p1
# p2
# p3
*
set variance = p1*sigmas(1)+p2*sigmas(2)+p3*sigmas(3)
graph(footer="Figure 4.10 Estimated variance of historical stock returns")
# variance
*
set stdu = ew_excs/sqrt(variance)
graph(footer="Figure 4.11a Plot of standardized stock returns")
# stdu
ewise p(i,j)=gprior(j)(i)/%sum(gprior(j))
*
* Initialize the sigmas
*
stats ew_excs
compute scommon =%variance
compute sigmas(1)=0.2*scommon
compute sigmas(2)=1.0*scommon
compute sigmas(3)=5.0*scommon
*
* Hierarchical prior for sigmas
* Uninformative prior on the common component.
*
compute nucommon=0.0
compute s2common=0.0
*
* Priors for the relative variances
*
dec vect nuprior(nstates)
dec vect s2prior(nstates)
compute nuprior=%fill(nstates,1,4.0)
compute s2prior=%fill(nstates,1,1.0)
*********************************************************************
*
* RegimeF returns a vector of likelihoods for the various regimes at
* <<time>>. The likelihoods differ in the regimes based upon the
* values of sigmas.
*
function RegimeF time
type vector RegimeF
type integer time
*
local integer i
*
dim RegimeF(nstates)
do i=1,nstates
compute RegimeF(i)=exp(%logdensity(sigmas(i),ew_excs(time)))
end do i
end
*********************************************************************
@MSFilterSetup
*
* Initialize the regimes
*
gset MSRegime gstart gend = 1
*
* This is a smaller number of draws than would be desired for a final
* result.
*
compute nburn=2000,ndraws=10000
*
* For convenience in saving the draws for the parameters, create a
* non-linear PARMSET with the parameters we want to save. We can use
* the %PARMSPEEK function to get a VECTOR with the values for saving
Markov Switching: Introduction 101
compute swaps=%index(sigmas)
*
* Relabel the sigmas
*
compute temp=sigmas
ewise sigmas(i)=temp(swaps(i))
*
* Relabel the regimes
*
gset MSRegime gstart gend = swaps(MSRegime(t))
******************************************************************
*
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
infobox(current=draw)
*
* Once were past the burn-in, save results.
*
if draw>0 {
******************************************************************
*
* Combine p and the sigma vector into a single vector and save
* it for each draw.
*
compute bgibbs(draw)=%parmspeek(mcmcparms)
*
* Update the sum of the occurrence of each regime, and the sum
* of the variance of each entry.
*
do i=1,nstates
set sstates(i) = sstates(i)+(MSRegime(t)==i)
end do i
set estvariance = estvariance+sigmas(MSRegime(t))
******************************************************************
}
end do draw
infobox(action=remove)
*
* Compute means and standard deviations from the Gibbs samples
*
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
*
* Put together a report similar to table 9.2. Note that were
* including the 3rd row in the saved statistics, so we have to skip
* over a few slots to get just the ones from the table.
*
report(action=define)
report(atrow=1,atcol=2,tocol=3,span) "Posterior"
report(atrow=2,atcol=2) "Mean" "SD"
report(atrow=3,atcol=1) "$p_{11}$" bmeans(1) bstderrs(1)
report(atrow=4,atcol=1) "$p_{12}$" bmeans(2) bstderrs(2)
report(atrow=5,atcol=1) "$p_{21}$" bmeans(4) bstderrs(4)
report(atrow=6,atcol=1) "$p_{22}$" bmeans(5) bstderrs(5)
Markov Switching: Introduction 103
In the example in Chapter 8, each regime had only one free parameter: its
variance. This makes it very easy to control the interpretation of the regimes.
Its still certainly possible to have more than one non-trivial local mode for the
likelihood, but each mode will place the regimes in an identifiable order.
Things are considerably more complicated with the subject of this chapter,
which is a Markov switching linear regression. Even in the simplest case where
the only explanatory variable is a constant, if the variance also switches with
regime, then there are two parameters governing each regime. Unless either
the mean or the variance is rather similar across regimes, you wont know for
certain whether the regimes represent mainly a difference in variance, a dif-
ference in means, or some combination of the two. In fact, in the example that
we are using in this chapter, the interpretation desired by the author (that the
regimes separate based upon the mean level of the process) doesnt seem to be
the one supported by the data, and instead, its the variance.
This difficulty arises because the regime can only be inferred from the esti-
mates, unlike a TAR or STAR model, where a computable condition determines
which regime holds. Its not difficult to estimate a fully-switching linear regres-
sion; it just may be hard to make sense of the results.
9.1 Estimation
The ML and EM programs will be quite similar and relatively simple for all
such models, once youve read in the data and executed the procedure described
next to set up the model. The Bayesian estimates will naturally be somewhat
different from application to application because of the need for a model-specific
prior.
The support routines for Markov switching linear regressions are in the
file MSREGRESSION.SRC. Youll generally pull this in by executing the
@MSRegression procedure:
@MSRegression( options ) depvar
# list of regressors
104
Markov Switching Regressions 105
The example that well use is based upon the application from section 4.5 in
Tsay (2010).1 This is a two-lag autoregression on the change in the unemploy-
ment rate, estimated using two regimes. It is basically the same model (with
a slightly different data set) used as an example of TAR in Example 4.1. Un-
like the Hamilton model described in Chapter 22 of Hamilton (1994), this
allows the intercept, lag coefficients and variance all to switch, while Hamilton
heavily constrains the autoregression to allow only the mean to switch. The
Hamilton approach is much more complicated to implement, but makes for
easier interpretation by allowing only the one switching parameter per regime.
The setup code, which will be used in all three estimation methods is:
@MSRegression(switch=ch,states=2) dx
# constant dx{1 2}
to high as the regime number increases). If you expect the switch to be (for
instance) more on a slope coefficient, you would have to use some other method
to set the initial guess values for the BETAS vectors.
compute gstart=1948:4,gend=1993:4
@MSRegInitial gstart gend
Since were allowing free estimation of both the regression coefficients and the
variances, we need to avoid the problem with zero variance spikes. To that end,
we compute a lower limit on the variance that we can use for the REJECT option
on the MAXIMIZE. Here, we make that a small multiple of the smaller guess
value for the variance. Since the problem comes only when the variance is (ef-
fectively) zero, any small, but finite, bound should work without affecting the
ability to locate the interior solution. The function %MSRegInitVariances is
needed at the start of each function evaluation to make sure the fixed vs switch-
ing variances are handled properly; as a side effect, it returns the minimum of
the variances.
compute sigmalimit=1.e-6*%minvalue(sigsqv)
*
frml logl = f=%MSRegFVec(t),fpt=%MSProb(t,f),log(fpt)
@MSFilterInit
maximize(start=%(sigmatest=%MSRegInitVariances(),pstar=%msinit()),$
parmset=regparms+msparms,$
reject=sigmatest<sigmalimit,$
method=bfgs,pmethod=simplex,piters=5) logl gstart gend
As you can see, this is very similar to the estimation code in the previous chap-
ter.
In all three examples, well include a graph of the estimated (smoothed) prob-
abilities of regime 1. Each estimation technique will use a different method to
obtain this. With ML, we need to do the extra step of smoothing the filtered
estimates, since those arent needed in the course of the estimation itself.
@MSSmoothed gstart gend psmooth
set p1smooth = psmooth(t)(1)
graph(footer="Smoothed Probabilities of Regime 1",max=1.0,min=0.0)
# p1smooth
Finally, this computes the conditional means across regimes by looking at the
implied process means of the AR (2) models:
do i=1,2
disp "Conditional Mean for Regime" i $
betas(i)(1)/(1-betas(i)(2)-betas(i)(3))
end do i
9.1.4 EM
A quick comparison of the two previous examples with Example 9.3 would
confirm that this is quite a bit harder to set up for this case than ML or EM.
At least in comparison with EM, this is somewhat misleading, because theres
quite a bit of calculation in the @MSRegEMStep procedure; its just that it takes
the same form for all switching linear regressions, while MCMC requires more
model-specific calculations.
2
This uses BHHH rather than BFGS because there will only be minor adjustments due to the
slight difference in likelihoods between EM and ML. This wont give BFGS enough iterations to
give a reliable covariance matrix estimate.
Markov Switching Regressions 108
The great advantage of Gibbs sampling is that it can give you a better idea of
the shape of the likelihood. For instance, in this model, the intended interpre-
tation of the two regimes is that one is a low mean, and other a high mean.
However, a careful look at the results from MCMC shows that its more of a low
variance-high variance split.
As in Section 8.2.4, we need priors for the transition probabilities:
dec vect[vect] gprior(nstates)
compute gprior(1)=||8.0,2.0||
compute gprior(2)=||2.0,8.0||
dec vect tcounts(nstates)
What we need now that we didnt before is a prior for the regression coefficients.
In order to keep the posterior symmetric in permutations of the regimes, we use
the same (Normal) prior for both. This is specified using a mean VECTOR and
precision (inverse covariance) SYMMETRIC matrix. In this case, the mean is
zero on all coefficients; the precision is 0 onthe constant (which means a flat
prior) and 4 (for a standard deviation of 1.0/ 4 = .5) on the two lag coefficients.
This is a fairly loose prior for an autoregression where we would expect the
persistence to be fairly low, since its on the first differences.
dec symm hprior
dec vect bprior
compute hprior=%diag(||0.0,4.0,4.0||)
compute bprior=%zeros(%nreg,1)
Inside the simulation loop, the first step is drawing the variances. The pro-
cedure @MSRegResids computes the regime-specific residuals that are needed
for this. Given those, the draws for the common variance factor and the regime-
specific variances is basically the same as in the previous chapter:
sstats gstart gend resids2*scommon/sigsqv(MSRegime(t))>>sumsqr
compute scommon=(sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
do i=1,nstates
sstats(smpl=MSRegime(t)==i) gstart gend resids2/scommon>>sumsqr
compute sigsqv(i)=scommon*(sumsqr+nuprior(i))/$
%ranchisqr(%nobs+nuprior(i))
end do i
Markov Switching Regressions 109
The draws for the coefficients are complicated a bit by this being an autore-
gression. Conventionally, we dont want either AR polynomial to be explosive.
Unlike a SETAR model, where a regime can be explosive because the threshold
will cause the process to leave that regime once the value gets too large, the
Markov switching process doesnt react to the level of the dependent variable.
If we make the prior Normal as above, but truncated to the stable region on
the AR process, we can implement this by drawing from the posterior Normal
and rejecting a draw if it has unstable roots. %MSRegARIsUnstable takes a
VECTOR of coefficients for an autoregression and returns 1 if they have an un-
stable (unit or explosive) root. Here, the first coefficient in each of the BETAS
vectors is for the constant, so we test the VECTOR made up of the second and
third. If you have a linear regression for which there are no stability issues
like this, you would just delete the IF and GOTO lines.
:redrawbeta
do i=1,nstates
cmom(smpl=(MSRegime==i),equation=MSRegEqn) gstart gend
compute betas(i)=%ranmvpostcmom($
%cmom,1.0/sigsqv(i),hprior,bprior)
if MSRegARIsUnstable(%xsubvec(betas(i),2,3))
goto redrawbeta
end do i
We now have draws for all the regression parameters. We want to re-label
so as to make the process means go from low to high. In other applications,
you might re-label based upon the intercept or one of the slope coefficients, or
possibly the variance. The means are functions of the regression coefficients:
do i=1,nstates
compute mu(i)=betas(i)(1)/(1-betas(i)(2)-betas(i)(3))
end do i
compute swaps=%index(mu)
We need to swap coefficient vectors, variances and transition rows and columns
in the transition matrix based upon the sorting index for the means.
We now need to draw the regimes, which is relatively simple using
the @MSRegFilter procedure to do the filtering step, then the standard
@MSSAMPLE procedure to sample the regimes.
@MSRegFilter gstart gend
@MSSample gstart gend MSRegime
This also includes a check that the number of entries in each regime is at or
above a minimum level (here 5). While not strictly necessary3 , this is a common
step.
3
We have an informative prior on the coefficients and on the variance, so the estimates cant
collapse to a singular matrix or zero variance.
Markov Switching Regressions 110
do i=1,nstates
sstats gstart gend MSRegime==i>>tcounts(i)
end do i
if %minvalue(tcounts)<5 {
disp "Draw" draw "Redrawing regimes with regime of size" $
%minvalue(tcounts)
goto redrawregimes
}
Note that the program includes messages whenever theres a redraw or swap,
which helps you see how smoothly the simulations are going. If you run this,
you will note that there are quite a few swaps to put the means back in order.
Thats a sign that the regimes arent particularly well described based upon
their means. Example 9.3 is actually only about half of the program file; the
full file SMS 9 3.RPF includes an alternative where the labels are switched
based upon the variance. That turns out to be much more successful from the
sampling standpoint, though it might not be as interesting a result.
Markov Switching Regressions 111
*
* Used for relabeling
*
dec vect mu(nstates)
dec vect[vect] tempbetas(nstates)
*
compute nburn=2000,ndraws=5000
nonlin(parmset=mcmcparms) betas mu sigsqv p
*
* Bookkeeping arrays
*
dec series[vect] bgibbs
gset bgibbs 1 ndraws = %parmspeek(mcmcparms)
set regime1 gstart gend = 0.0
infobox(action=define,lower=-nburn,upper=ndraws,progress) $
"Gibbs Sampling"
do draw=-nburn,ndraws
*
* Draw sigmas given betas and regimes
*
@MSRegResids(regime=MSRegime) resids gstart gend
*
* Draw the common variance factor given the relative variances and
* regimes.
*
sstats gstart gend resids2*scommon/sigsqv(MSRegime(t))>>sumsqr
compute scommon=(sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
*
* Draw the relative variances, given the common variances and the
* regimes.
*
do i=1,nstates
sstats(smpl=MSRegime(t)==i) gstart gend resids2/scommon>>sumsqr
compute sigsqv(i)=scommon*(sumsqr+nuprior(i))/$
%ranchisqr(%nobs+nuprior(i))
end do i
*
* Draw betas given sigmas and regimes
*
:redrawbeta
do i=1,nstates
cmom(smpl=(MSRegime==i),equation=MSRegEqn) gstart gend
compute betas(i)=%ranmvpostcmom($
%cmom,1.0/sigsqv(i),hprior,bprior)
if %MSRegARIsUnstable(%xsubvec(betas(i),2,3))
goto redrawbeta
end do i
*
* Relabel if necessary. They are ordered based upon the process means.
*
do i=1,nstates
compute mu(i)=betas(i)(1)/(1-betas(i)(2)-betas(i)(3))
end do i
Markov Switching Regressions 115
*
compute swaps=%index(mu)
if swaps(1)==2
disp "Draw" draw "Executing swap"
*
* Relabel the mus
*
compute tempmu=mu
ewise mu(i)=tempmu(swaps(i))
*
* Relabel the betas
*
ewise tempbetas(i)=betas(i)
ewise betas(i)=tempbetas(swaps(i))
*
* Relabel the sigmas
*
compute tempsigsq=sigsqv
ewise sigsqv(i)=tempsigsq(swaps(i))
*
* Relabel the transitions
*
compute tempp=p
ewise p(i,j)=tempp(swaps(i),swaps(j))
*
* Draw the regimes
*
@MSRegFilter gstart gend
:redrawregimes
@MSSample(counts=tcounts) gstart gend MSRegime
if %minvalue(tcounts)<5 {
disp "Draw" draw "Redrawing regimes with regime of size" $
%minvalue(tcounts)
goto redrawregimes
}
*
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
infobox(current=draw)
if draw>0 {
*
* Do the bookkeeping
*
set regime1 gstart gend = regime1+(MSRegime==1)
compute bgibbs(draw)=%parmspeek(mcmcparms)
}
end do draw
infobox(action=remove)
*
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
*
report(action=define)
report(atrow=1,atcol=1,fillby=cols) %parmslabels(mcmcparms)
Markov Switching Regressions 116
report(atrow=1,atcol=2,fillby=cols) bmeans
report(atrow=1,atcol=3,fillby=cols) bstderrs
report(action=format,picture="*.###")
report(action=show)
*
set regime1 gstart gend = regime1/ndraws
graph(header="MCMC Probability of Low Mean Regime")
# regime1
Chapter 10
This generalizes the model in Chapter 9 to allow more than one dependent vari-
able. The number of free parameters greatly increases, with, in an n variable
model, n times as many regression coefficients and the variance parameter(s)
replaced by an n n covariance matrix. Needless to say, this greatly increases
the difficulty with properly labeling the regimes, since you could, for instance,
have equation variances for different components going in different directions
between regimes.
10.1 Estimation
As in the previous chapter, ML and EM programs are relatively simple to set
up. However, ML rapidly becomes infeasibly slow as the model gets large.
The support routines for Markov switching linear systems regressions are in
the file MSSYSREGRESSION.SRC. Youll generally pull this in by executing the
@MSSysRegression procedure:
@MSSysRegression( options )
# list of dependent variables
# list of regressors
117
Markov Switching Multivariate Regressions 118
The example that well use is based upon the model in Ehrmann, Ellison, and
Valla (2003).1 This is a three variable, three lag vector autoregression on ca-
pacity utilization, consumer prices and oil prices. Because all coefficients are
switching, this can be handled using the simpler MSSYSREGRESSION proce-
dures rather than MSVARSETUP that will be covered in Chapter 11.
The setup code, which will be used in all three estimation methods, is:
@mssysregression(states=2,switch=ch)
# logcutil logcpi logpoil
# constant logcutil{1 to 3} logcpi{1 to 3} logpoil{1 to 3}
If there were any fixed coefficients, we would need also to include GAMMASYS in
the REGPARMS.
Because there are so many possible ways to jiggle the switching parameters to
distinguish the regimes, @MSSysRegInitial doesnt attempt to do anything
other than initialize the parameters based upon a multivariate regression,
with all the BETASYS matrices made equal to the same full-sample estimates.
In this example, we start with a higher variance for the oil price for regime 2
and lower variance for the macro variables with
compute gstart=1973:4,gend=2000:12
@MSSysRegInitial gstart gend
compute sigmav(2)=%diag(%xdiag(sigmav(1)).*||.25,.25,4.0||)
and 4.0. Since there are likely to be several local modes, youll probably need to
experiment a bit and see how sensitive the estimates are to the choice for the
guess values.
With a multivariate regression with switching covariance matrices, we not only
have to worry about likelihood spikes with (near) zero variances, but also near-
singular matrices with non-zero values. The appropriate protection against
this is based upon the log determinant of the covariance matrixwe cant al-
low that to get too small for one regime relative to the others. The function
%MSSysRegInitVariances() returns the minimum log determinant of the
covariance matrices. A reasonable lower limit for that is something like 12
times the number of dependent variables less than the full-sample log determi-
nant. The following is the estimation code for maximum likelihood, including
the rejection test for small variances:
compute logdetlimit=%MSSysRegInitVariances()-12.0*3
*
frml logl = f=%MSSysRegFVec(t),fpt=%MSProb(t,f),log(fpt)
@MSFilterInit
maximize(start=$
%(logdet=%MSSysRegInitVariances(),pstar=%MSSysRegInit()),$
parmset=regparms+msparms,reject=logdet<logdetlimit,$
method=bfgs,iters=500,pmethod=simplex,piters=5) logl gstart gend
Note that this hangs up at a local mode, which actually doesnt show behavior
all that much different from the better mode found using EM.
The full program is Example 10.1.
10.1.4 EM
Well use the same priors for the transitions as in Example 9.3:
dec vect[vect] gprior(nstates)
compute gprior(1)=||8.0,2.0||
compute gprior(2)=||2.0,8.0||
dec vect tcounts(nstates)
Well again use a hierarchical prior for the covariance matrices; the extension
to Wishart distributions is described on page 210.
dec vect nuprior(nstates)
compute nuprior=%fill(nstates,1,6.0)
*
dec vect[series] vresids(nvar)
dec vect[symm] uu(nstates)
In this example, were using a flat prior on the regression coefficients. A Min-
nesota type prior might make more sense, but would be more complicated to
set up.
dec symm hprior
dec vect bprior
compute hprior=%zeros(nvar*nreg,nvar*nreg)
compute bprior=%zeros(nvar*nreg,1)
Inside the simulation loop, the first step is drawing the covariance matrices.
The procedure @MSSysRegResids computes the VECT[SERIES] of residuals,
using the appropriate coefficients in the currently sampled regimes. We then
need to compute the cross products of those residuals in the subsample for each
regime (into UU(I)), and the count of the subsample into TCOUNTS(I).
@MSSysRegResids(regime=MSRegime) vresids gstart gend
ewise uu(i)=%zeros(nvar,nvar)
ewise tcounts(i)=0.0
do time=gstart,gend
compute uu(MSRegime(time))=uu(MSRegime(time))+$
%outerxx(%xt(vresids,time))
compute tcounts(MSRegime(time))=tcounts(MSRegime(time))+1
end do time
The common matrix is then drawn using the following:
compute uucommon=%zeros(nvar,nvar)
do k=1,nstates
compute uucommon=uucommon+nuprior(k)*inv(sigmav(k))
end do k
compute sigma=%ranwishartf(%decomp(inv(uucommon)),%sum(nuprior))
Markov Switching Multivariate Regressions 121
Given the new SIGMA, the regime-specific covariance matrices are drawn by:
do k=1,nstates
compute sigmav(k)=%ranwisharti($
%decomp(inv(uu(k)+nuprior(k)*sigma)),tcounts(k)+nuprior(k))
end do k
We now need to draw the regimes and then the transition matrix. The only
difference between this and the univariate model is the very first line, which is
@MSSysRegFilter gstart gend
model is still linear), its simpler, both with EM and with MCMC, to break the
optimization problem up into separate parameter sets, treating the fixed and
the switching separately. This is fairly standard practice is MCMC, where we
always need to block the parameters for convenience. In the case of EM, it
means that the M step doesnt maximize the probability-weighted likelihood,
but merely improves it. The use of a partial rather than full optimization is
known as Generalized EM.
There are three blocks of parameters: the covariance matrices, the fixed re-
gression coefficients and the switching regression coefficients. The simplest
process is to do the regression coefficients first. Taking those as given, we can
compute the residuals and estimate the covariance matrices using straight-
forward methods. The switching regression coefficients arent especially hard
given the other twowe subtract off the contribution of the fixed regressors
from the dependent variables and treat those partial residuals as if they were
the dependent variables in a (now) fully switching model. Its the estimation of
the fixed regression coefficients thats the most complicated of the three. This
requires stacking and weighting the equations.
If you use EM and the MSSysRegEM... procedures, all of this is done automat-
ically. It isnt if you do MCMC since you could be using a prior or applying the
rejection method to some of your draws. For drawing the fixed coefficients, you
need something like:
@MSSysRegFixedCMOM(regimes=MSRegime) gstart gend xxfixed
compute gammasys=%reshape(%ranmvpostcmom($
xxfixed,1.0,hpriorGamma,bpriorGamma),MSSysRegNFix,nvar)
The 1.0 in the second argument of %RANMVPOSTCMOM is there because the co-
variance matrices have already been incorporated into the cross product ma-
trix.
The draws for the switching coefficients is similar to that used when the entire
coefficient vector switches. You just need to compute the residuals from the
fixed part (done with @MSSysRegFixResids) and use a separate procedure for
Markov Switching Multivariate Regressions 123
*
dec symm hprior
dec vect bprior
compute hprior=%zeros(nvar*nreg,nvar*nreg)
compute bprior=%zeros(nvar*nreg,1)
*
compute nburn=2000,ndraws=5000
nonlin(parmset=allparms) betasys sigmav p
*
* For relabeling
*
dec vect voil(nstates)
dec vect[rect] tempbeta(nstates)
dec vect[symm] tempsigmav(nstates)
dec rect tempp(nstates,nstates)
*
* Bookkeeping arrays
*
dec series[vect] bgibbs
gset bgibbs 1 ndraws = %parmspeek(allparms)
set regime1 gstart gend = 0.0
infobox(action=define,lower=-nburn,upper=ndraws,progress) $
"Gibbs Sampling"
do draw=-nburn,ndraws
*
* Draw sigmas given betas and regimes
* Compute the regime-specific residuals
*
@MSSysRegResids(regime=MSRegime) vresids gstart gend
*
* Compute the sums of squared residuals for each regime.
*
ewise uu(i)=%zeros(nvar,nvar)
ewise tcounts(i)=0.0
do time=gstart,gend
compute uu(MSRegime(time))=uu(MSRegime(time))+$
%outerxx(%xt(vresids,time))
compute tcounts(MSRegime(time))=tcounts(MSRegime(time))+1
end do time
compute uucommon=%zeros(nvar,nvar)
do k=1,nstates
compute uucommon=uucommon+nuprior(k)*inv(sigmav(k))
end do k
*
* Draw the common sigma given the regime-specific ones.
*
compute sigma=%ranwishartf(%decomp(inv(uucommon)),%sum(nuprior))
*
* Draw the regime-specific sigmas given the common one.
*
do k=1,nstates
compute sigmav(k)=%ranwisharti($
%decomp(inv(uu(k)+nuprior(k)*sigma)),tcounts(k)+nuprior(k))
end do k
Markov Switching Multivariate Regressions 128
*
* Draw betas given sigmas and regimes
*
do i=1,nstates
cmom(smpl=(MSRegime==i),model=MSSysRegModel) gstart gend
compute betasys(i)=%reshape(%ranmvkroncmom($
%cmom,inv(sigmav(i)),hprior,bprior),nreg,nvar)
end do i
*
* Relabel if necessary. They are ordered based upon the variance
* of oil.
*
ewise voil(i)=sigmav(i)(3,3)
compute swaps=%index(voil)
if swaps(1)==2
disp "Draw" draw "Executing swap"
*
* Relabel the betas
*
ewise tempbeta(i)=betasys(i)
ewise betasys(i)=tempbeta(swaps(i))
*
* Relabel the sigmas
*
ewise tempsigmav(i)=sigmav(i)
ewise sigmav(i)=tempsigmav(swaps(i))
*
* Relabel the transitions
*
compute tempp=p
ewise p(i,j)=tempp(swaps(i),swaps(j))
*
* Draw the regimes
*
@MSSysRegFilter gstart gend
:redrawregimes
@MSSample(counts=tcounts) gstart gend MSRegime
if %minvalue(tcounts)<5 {
disp "Draw" draw "Redrawing regimes with regime of size" $
%minvalue(tcounts)
goto redrawregimes
}
*
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
infobox(current=draw)
if draw>0 {
*
* Do the bookkeeping
*
set regime1 gstart gend = regime1+(MSRegime==1)
compute bgibbs(draw)=%parmspeek(allparms)
}
Markov Switching Multivariate Regressions 129
end do draw
infobox(action=remove)
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
*
report(action=define)
report(atrow=1,atcol=1,fillby=cols) %parmslabels(allparms)
report(atrow=1,atcol=2,fillby=cols) bmeans
report(atrow=1,atcol=3,fillby=cols) bstderrs
report(action=format,picture="*.###")
report(action=show)
*
set regime1 gstart gend = regime1/ndraws
graph(header="MCMC Probability of Low Variance Regime")
# regime1
Chapter 11
130
Markov Switching VARs 131
The complication in (11.2) is that the density function for yt depends upon the
current t and on p of its lags. Thus, if we have M choices for each t , there
are M p+1 branches in the likelihood, each with a distinct value. All three esti-
mation methods will need to take this into account. Note that estimation time
goes up at roughly that same M p+1 factor since the evaluation of the likelihoods
is a fairly big piece of the calculation.
The discussions in the previous chapters will handle all cases except the
switching means model, so that will be the focus of this chapter. You can use
the specialized sets of MS - VAR procedures for switching intercept and coeffi-
cient models as well, as it will be simpler to use when you have a VAR; its just
that the description of what those are doing has been covered in the earlier
chapters.
11.1 Estimation
Most of the hard work is done inside the procedures described in 11.1.2. Max-
imum likelihood and EM are both handled with programs very similar to what
weve seen with other types of regressions. MCMC is, however, definitely harder.
Were using t to represent the value of the switching mean process at t, and
will use (s) to represent the value for this when St = s. So we need to estimate
the , and variance(s) or covariance matrices.
Both EM and MCMC need to estimate the lag coefficients and the means
separately, each given the other. To estimate the means, we need to rearrange
(11.2) to
yt 1 yt1 . . . p ytp = t 1 t1 . . . p tp + ut (11.3)
Note that the left side of this is the same for all combinations of regimes. The
right side will be a linear combination of the (s) where the multipliers will
depend upon the precise combination of regimes {St , . . . , Stp }. Since both (s)
will, in general, appear on the right side, the have to be estimated jointly.1
Internally, the RATS procedures number the combinations of expanded regimes
as 1, . . . , M p+1 , with the longest lags varying most quickly; that is, with M = 2,
they will be in the order 1, . . . , 1, 1, 1, . . . , 1, 2, 1, . . . , 2, 1, 1, . . . , 2, 2, etc.
Well work with the original Hamilton model, which has switching means with
four lags and two regimes. That makes an expanded regime vector with 32
components, which makes it slower than a three or four variable VAR with the
simpler switching intercepts.
1
This isnt true with switching intercepts, where each can be estimated separately.
Markov Switching VARs 132
@MSVARSetup( options )
# list of dependent variables
The option LAGS=# of VAR lags (with a default of 1) sets the number of lags.
This sets up an autoregression or VAR with each equation having a CONSTANT
and the lags of the dependent variables.
The options which control the switching are STATES=# of regimes, with
a default of 2, and SWITCH=[M]/I/MH/IH/C/CH, which determines what
switchesan M means switching mean (fixed lags), an I means switching in-
tercepts (fixed lags), and C means intercept form with all coefficients switching.
An H suffix means that the variances (covariance matrices) switch as well.
Because its designed (mainly) for vector autoregressions, the parameters that
are used are matrices, not scalars. These are:
In addition, there are the P and THETA forms for the transition probabilities.
Because the working set of regimes for a MS - VAR can take two different forms,
depending upon whether or not we use the switching mean variant, there are
a whole set of specialized procedures included. For instance, the filter and
smoothing for the switching means model has to work with the expanded
regime. However, in the end, we only need the (marginal) probabilities of
the current regime. So the procedure MSVARSmoothed does the calculation of
smoothed probabilities as required by the form of the model, but then marginal-
izes if necessary to return only the probabilities for St .
Markov Switching VARs 133
As usual, the hard work is done by the procedures. Note that this is where
the warnings earlier about the (slow) speed of ML becomes most apparent. The
combination of several variables and relatively long lags creates a very large
parameter set with a substantial amount of calculation to evaluate the likeli-
hood. The set up for estimation is:
@msvarsetup(lags=4,switch=m)
# g
nonlin(parmset=msparms) p
nonlin(parmset=varparms) mu phi sigma
This is the combination of PARMSETs for this set of options or for SWITCH=I. If
use use SWITCH=C (or CH), replace PHI with PHIV. And if you use any of the H
choices, replace SIGMA with SIGMAV.
The next step sets the estimation range and gets a standard set of guess values.
compute gstart=1952:2,gend=1984:4
@msvarinitial gstart gend
11.1.4 EM
The E step here needs to smooth the expanded regime vector {St , . . . , Stp }. A
generalized M step is used, with the values for and estimated separately (
given the previous , given the recalculated ).3 Even though each of these
is a type of linear regression, neither can be done using any standard regres-
sion. The estimator for is a probability-weighted regression, but at each t, a
probability-weighted sum is required over all of combinations for {St , . . . , Stp },
since each generates a different set of dependent and explanatory variables.
For a one-lag univariate model, the explicit formula is
T P
P M P
M
pij,t (yt1 (j)) (yt (i))
t=1 i=1 j=1
T P
M P
M
(11.4)
P 2
pij,t (yt1 (j))
t=1 i=1 j=1
The are estimated using the rearranged form (11.3). Again, the estimation
is a probability-weighted regression, where the sums are across both t and
the combination of regimes. For the regime tuple s0 , s1 , ..., sp , the explanatory
variable in the regression on (k) is
p
X
(s0 == k) j (sj == k)
j=1
This is similar to what weve seen before, but with a different set of procedures.
Because the E step needs to smooth over p + 1 tuples of regimes rather than
just pairs, both the setup and emstep procedures need to be different. Youll
probably notice that the convergence of EM is slower (in terms of progress per
iteration) than it is with the simpler model. This is partly because were doing
a generalized M and thus not making as much progress at each step as we
would if we were doing a full M. And also the likelihoods are more dependent
upon the transition probabilities and not just the probabilities of the regimes
themselves, so the maximization problem changes more when P gets updated.
Again, we recommend that you do some BHHH iterations to polish the estimates
and get standard errors. Note, however, that with a VAR, particularly one with
switching coefficients, you could run out of data points to do BHHH. BHHH
uses the inverse of the cross product of the gradient vectors to estimate the
covariance matrix. But if the number of parameters exceeds the number of
data points, the matrix to be inverted is necessarily singular.
There are several complications in doing MCMC with a switching means VAR.
With a standard regression (univariate or systems) model, the samples for the
regression coefficients can be done using an off-the-shelf linear regression
simulator; its just applied to the subsample formed by each regime in turn.
Markov Switching VARs 136
That no longer works with either the or with the because the likelihoods
for each involve the regimes at the lags as well. Also, the should, ideally, be
estimated jointly, since thats the form of (11.3). This being Gibbs sampling, you
could do them sequentially, but that might tend to create a greater correlation
in the chain.
Sampling Variances
Sampling the variances (or covariance matrix) is no different from the previous
two chapters, once you have the residuals. Those can be computed using the
procedure MSVARResids. The MSVARSETUP procedure file defines a special
VECT[SERIES] for the residuals named MSVARU.
@MSVARResids(regime=MSRegime) MSVARU gstart gend
which makes sure all the variance information is in the proper locations for
work in the remaining steps.
Sampling Means
Given the other parameters, the likelihood for the means is based upon (11.3).
A cross-product matrix with the (variance-weighted) sample information for
this can be computed using:
@MSVARMuCMOM gstart gend xxem
The output from this (here called XXEM) takes the proper form for use with
the procedure %RANMVPOSTCMOM where the data precision is 1.0, since the
variances have already been incorporated into the crossproduct matrix. The
stacked vector of means (or intercepts, the same procedure will handle both)
is sampled with:
compute mustack=%ranmvpostcmom(xxem,1.0,hmu,bmu)
In our example, well use a symmetrical prior for the means and label-switch
based upon the sampled values. The mean (BMU) is a vector of zeros and the
precision (HMU) is a very loose diagonal matrix with .25s (meaning standard
deviation 2.0) on the diagonal.
MUSTACK is just a VECTOR with # of regimes # of variables entries, which is
not the form used elsewhere. The procedure @MSVARMuUnstack takes it and
fills in the VECT[VECT] that is used for MU.
@MSVARMuUnstack mustack mu
Markov Switching VARs 137
Note that XXEM will have quite different dimensions as it did when sampling
meansthe number of lag coefficients could be quite large. In this example,
were using a mean (BPHI) vector of zeros with the precision (HPHI) as a diago-
nal matrix with values of 4.0 (standard deviations .5). As with the means, the
output is a long VECTOR which needs to be re-packaged into the proper form;
the procedure MSVARPhiUnstack does that.
compute phistack=%ranmvpostcmom(xxem,1.0,hphi,bphi)
@MSVARPhiUnstack phistack phi
Sampling Regimes
As with EM, you need to forward filter using the expanded set of regimes, so the
MSFilterStep procedure used in the previous chapters wont work. Instead,
theres a separate procedure MSVARFilter which does the forward filter on
whatever collection of regimes is required given the settings for the model.
Similarly, the backwards sampling requires a specialized routine as well which
samples the expanded regimesthis is MSVARSample. When you sample the
final period, you get {ST , . . . , ST p }, the next will get {ST 1 , . . . , ST p1 }, backing
up until the start of the sample finally gives {S1 , S0 , . . . , Sp+1 }, where the pre-
sample regimes S0 to Sp+1 are needed to compute the likelihood for the early
values of t.
Execution Time
The switching means model is again many times slower than the switching
intercepts model. 11.3 does a test number of draws (500 with 200 burn-in).
For a final production run, you would probably want to up those by a factor
of at least 10.
Results
Table 11.2 shows the mean and standard deviation from a run with 5000 saved
draws. This is similar to the ML estimates, but the means arent quite as widely
separated and have higher standard errors. If we look at the graphs of the
densities of the simulated means (Figure 11.1), we can see that the two den-
sities skew towards each other. Theres a gray-zone of values around .5%
quarter-to-quarter growth which isnt clearly either expansion or recession.
This wouldnt be as obvious from looking at ML alone.
Markov Switching VARs 138
3.0
2.5
2.0
1.5
1.0
0.5
0.0
-3 -2 -1 0 1 2 3
compute gstart=1952:2,gend=1984:4
@msvarinitial gstart gend
*
* Estimate the model by maximum likelihood.
*
frml msvarf = log(%MSVARProb(t))
maximize(parmset=varparms+msparms,$
start=(pstar=%MSVARInit()),$
reject=%MSVARInitTransition()==0.0,$
pmethod=simplex,piters=5,method=bfgs,iters=300) msvarf gstart gend
*
* Compute smoothed estimates of the regimes.
*
@msvarsmoothed gstart gend psmooth
set pcontract gstart gend = psmooth(t)(1)
*
* To create the shading marking the recessions, create a dummy series
* which is 1 when the recessq series is 1, and 0 otherwise. (recessq
* is 1 for NBER recessions and -1 for expansions).
*
set contract = recessq==1
*
spgraph(vfields=2)
graph(header="Quarterly Growth Rate of US GNP",shade=contract)
# g %regstart() %regend()
graph(style=polygon,shade=contract,$
header="Probability of Economy Being in Contraction")
# pcontract %regstart() %regend()
spgraph(done)
*
* Priors for the relative variances (if needed)
*
dec vect nuprior(nstates)
dec vect s2prior(nstates)
compute nuprior=%fill(nstates,1,4.0)
compute s2prior=%fill(nstates,1,1.0)
compute scommon =sigma(1,1)
*
* Prior for the means. Since they are estimated jointly, we need a
* joint prior. To keep symmetry, we use zero means. The precision
* will be somewhat data dependent---here well make it the very
* loose .25, which will mean a standard deviation of 2.
*
dec symm hmu
dec vect bmu
*
compute bmu=||0.0,0.0||
compute hmu=.25*%identity(nstates)
*
* Prior for the lag coefficients. 0 mean, standard error .5.
*
dec symm hphi
dec vect bphi
compute hphi=4*%identity(nlags)
compute bphi=%zeros(nlags,1)
*
* For final estimation, you would probably multiply these by 10,
* which would take 10 times as long.
*
compute nburn=200,ndraws=500
*
* For relabeling
*
dec vect muv(nstates)
dec rect tempp(nstates,nstates)
*
nonlin(parmset=allparms) mu phi sigma p
*
* Bookkeeping arrays
*
dec series[vect] bgibbs
gset bgibbs 1 ndraws = %parmspeek(allparms)
set regime1 gstart gend = 0.0
infobox(action=define,lower=-nburn,upper=ndraws,progress) $
"Gibbs Sampling"
do draw=-nburn,ndraws
*
* Draw sigmas given coefficients and regimes
* Compute the regime-specific residuals
*
@MSVARResids(regime=MSRegime) MSVARU gstart gend
if MSVARSigmaSwitch==0 {
sstats gstart gend MSVARU(1)2>>sumsqr
Markov Switching VARs 142
compute sigma(1,1)=(sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
}
else {
*
* Draw the common variance factor given the relative variances
* and regimes.
*
sstats gstart gend MSVARU(1)2*scommon/sigmav(MSRegime(t))(1,1)>>sumsqr
compute scommon=(sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
*
* Draw the relative variances, given the common variances and
* the regimes.
*
do s=1,nstates
sstats(smpl=MSRegime==s) gstart gend MSVARU(1)2/scommon>>sumsqr
compute sigmav(s)(1,1)=scommon*(sumsqr+nuprior(s))/$
%ranchisqr(%nobs+nuprior(s))
end do s
}
compute %MSVARInitVariances()
*
* Draw mu given phi
*
@MSVARMuCMOM gstart gend xxem
compute mustack=%ranmvpostcmom(xxem,1.0,hmu,bmu)
@MSVARMuUnstack mustack mu
*
* Draw phi given mu
*
@MSVARPhiCMOM gstart gend xxem
compute phistack=%ranmvpostcmom(xxem,1.0,hphi,bphi)
@MSVARPhiUnstack phistack phi
*
* Relabel if necessary
*
ewise muv(i)=mu(i)(1)
compute swaps=%index(muv)
if swaps(1)==2
disp "Draw" draw "Executing swap"
*
* Relabel the mus
*
ewise mu(i)=muv(swaps(i))
*
* Relabel the transitions
*
compute tempp=p
ewise p(i,j)=tempp(swaps(i),swaps(j))
*
* Draw the regimes
*
@MSVARFilter gstart gend
Markov Switching VARs 143
:redrawregimes
@MSVARSample(counts=tcounts) gstart gend MSRegime
if %minvalue(tcounts)<5 {
disp "Draw" draw "Redrawing regimes with regime of size" $
%minvalue(tcounts)
goto redrawregimes
}
*
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
infobox(current=draw)
if draw>0 {
*
* Do the bookkeeping
*
set regime1 gstart gend = regime1+(MSRegime==1)
compute bgibbs(draw)=%parmspeek(allparms)
}
end do draw
infobox(action=remove)
*
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
*
report(action=define)
report(atrow=1,atcol=1,fillby=cols) %parmslabels(allparms)
report(atrow=1,atcol=2,fillby=cols) bmeans
report(atrow=1,atcol=3,fillby=cols) bstderrs
report(action=format,picture="*.###")
report(action=show)
*
@nbercycles(down=recess)
set regime1 gstart gend = regime1/ndraws
graph(header="MCMC Probability of Low Mean Regime",shading=recess)
# regime1
set mulowsample 1 ndraws = bgibbs(t)(1)
set muhighsample 1 ndraws = bgibbs(t)(2)
density(grid=automatic,maxgrid=100,smooth=1.5) mulowsample $
1 ndraws xmulow fmulow
density(grid=automatic,maxgrid=100,smooth=1.5) muhighsample $
1 ndraws xmuhigh fmuhigh
scatter(style=lines,header="MCMC Densities of Means") 2
# xmulow fmulow
# xmuhigh fmuhigh
Chapter 12
144
Markov Switching State-Space Models 145
For this model, exact maximum likelihood is possible, though very time-
consuming. If we unwind the definition for the trend, we get:
t
X t
X t
X
t = 0 + (St ) = 0 + (1) (St = 1) + (2) (St = 2) (12.5)
s=1 s=1 s=1
the likelihood at t has t branches, rather than M t . Because the recent past mat-
ters for the AR part, it turns out that you need to keep track of M q+1 t possible
combinations.
The second example is a time-varying parameters regression with Markov
switching heteroscedastic errors. The regression is a money demand function
on the growth rate with
Mt = 0t + 1t it1 + 2t inft1 + 3t surpt1 + 4t Mt1 + et
where Mt is log money, it is the interest rate, inft is the inflation rate, surpt is
the detrended full-employment budget surplus. The are assumed to follow
independent random walks, so we have the state-space representation:
0t 1 0 0 0 0 0t1 w0t
1t 0 1 0 0 0 1t1 w1t
Xt 2t
= 0 0 1 0 0 2t1 + w2t
3t 0 0 0 1 0 3t1 w3t
4t 0 0 0 0 1 4t1 w4t
and the log of this is the value of the log likelihood element at t for the Kim
filter. Bayes rule gives us the updated probabilities of the i, j combinations as:
f (i, j)p(i, j)pt1 (j)
pt (i, j) = M P
M
(12.7)
P
f (i, j)p(i, j)pt1 (j)
i=1 j=1
The standard Kalman filter calculations will give us an updated mean and
covariance matrix for Xt for each combination of i, j, which well call Xt|t (i, j)
and t|t (i, j). We now need to collapse those to a summary just for the time t
regime i by aggregating out the time t1 regimes (the j). The mean is simple
its just the probability-weighted average:
M
P
pt (i, j)Xt|t (i, j)
j=1
Xt|t (i) = M
(12.8)
P
pt (i, j)
j=1
The part of this calculation which will be specific to a model and form of switch-
ing is the calculation of the f (i, j), Xt|t (i, j) and t|t (i, j). These are all stan-
dard Kalman filter calculations, but need to be adjusted to allow for whatever
switches, so the Kalman filtering steps have to be done manually.
The examples both define a (model-specific) function called KimFilter which
does a single step in the Kim filter. Most of this is common to all such applica-
tions. First, the following need to be defined:
dec vect[vect] xstar(nstates)
dec vect[symm] sstar(nstates)
XSTAR and SSTAR are the collapsed means and variances of the state vector,
indexed by the regime. They get replaced at the end of each step through
the filter. The VECTOR PSTAR is also used; thats defined as part of the stan-
dard Markov procedures and keeps the filtered probability distribution for the
regimes, as it has before.
Markov Switching State-Space Models 148
Within the function, XWORK and SWORK are the predicted means and covariance
matrices Xt|t (i, j) and t|t (i, j). FWORK is used for f (i, j). The model-specific
calculations will be in the first double loop over I and J. This is for the Lam
model, in the first-differenced form. The line in upper case is the one that has
an adjustment for the switch.
*
* Collapse the SSM matrices down to one per regime.
*
compute xstates(time)=%zeros(ndlm,1)
do i=1,nstates
compute xstar(i)=%zeros(ndlm,1)
do j=1,nstates
compute xstar(i)=xstar(i)+xwork(i,j)*phat(i,j)/pstar(i)
end do j
*
* This is the overall best estimate of the filtered state.
*
compute xstates(time)=xstates(time)+xstar(i)*pstar(i)
compute sstar(i)=%zeros(ndlm,ndlm)
do j=1,nstates
compute sstar(i)=sstar(i)+phat(i,j)/pstar(i)*$
(swork(i,j)+%outerxx(xstar(i)-xwork(i,j)))
end do j
end do i
compute KimFilter=likely
end
Its important to note that the Kim filter is not a relatively innocuous
approximationthere are no results which show that it gives consistent es-
timates for the parameters of the actual model, and how far off it is in practice
isnt really known. What would be expected is that it will bias the estimates in
favor of parameters for which the approximation works reasonably well, which
one would speculate would be models dominated by a persistent regime.
The following is the special setup code for using the Kim filter for the Lam
switching model. The number of states in the differenced form is just the num-
ber of AR lags in the noise term. To use the differenced form of the model, this
needs to be at least two, since you need both xt and xt1 in the measurement
equation. If you actually wanted only one lag, you could set this up with two
states, and just leave the 1,2 element of the A matrix as zero. Well call the
size of the state vector NDLM.
compute q=2
dec vect phi(q)
compute ndlm=q
The A matrix will have everything below the first row fixed as the standard
lag-shifting submatrix in this type of transition matrix. Well set that up with
the first row as zeros (for now). Since the first row depends upon parameters
which we need to estimate, well need to patch that over as part of a function
evaluation.
Markov Switching State-Space Models 150
We know the first two elements of C will be 1 and -1 and everything else will
be zeros. This is written to handle any lags above two automatically:
dec rect c(ndlm,1)
compute c=||1.0,-1.0||%zeros(1,ndlm-2)
The F matrix will always just be a unit vector for the first element with the
proper size:
dec rect f(ndlm,1)
compute f=%unitv(ndlm,1)
dec symm sw(ndlm,ndlm)
compute sv=0.0
If this were a standard state-space model for use with the DLM instruction,
the SW matrix would be just 1 1 since theres just the one shockDLM au-
tomatically computes the rank-one covariance matrix FW F0 for the shock to
the states. Since we have to do the filtering calculations ourselves, well just
do the calculation of the full-size matrix one time at the start of each function
evaluation. Theres no error in the measurement equation, so SV is 0.
The next sets up the shifting part of the model, which are the regime-specific
means for the change. For estimation purposes, well use guess values which
go from large to small, since that was the ordering used in the original paper.
dec vect delta(nstates)
The final piece of this is a vector of initial (pre-sample) values for the x compo-
nent. This is one of several ways to handle the pre-sample states. If this were
a model in which all the state matrices and variances were time-invariant,
the most straightforward way to handle this would be to use the unconditional
mean and covariance matrix (done with the PRESAMPLE=ERGODIC option on
DLM). With switching models, by definition there are some input matrices which
arent time-invariant, so there isnt an obvious choice for the pre-sample distri-
bution. Plus, we need a mean and covariance matrix for each regime. In the
calculations for this example, the pre-sample state is added to the parameter
set, with a single vector used for each regime, with a zero covariance matrix.
The hope is that the estimation wont be very sensitive to the method of han-
dling the pre-sample.
dec vect x0(ndlm)
compute x0=%zeros(ndlm,1)
The free parameters are divided into three parameter sets, one for the state-
space parameters (the means, the variance in the evolution of x and the au-
toregressive coefficients), one for the transition (here modeled in logistic form)
and finally the initial conditions:
Markov Switching State-Space Models 151
Theres quite a bit of initialization code required at the start of each function
evaluation. Some of this is standard for state-space models, some is stan-
dard for Markov switching models. Both are packaged into a function called
DLMStart.
The standard state-space initialization code stuffs the current settings for PHI
into the top row of the A matrix, and expands the SW matrix from F and the
variance parameter:
compute %psubmat(a,1,1,tr(phi))
compute sw=f*tr(f)*sigsq
The standard Markov switching initialization has generally been done as part
of a START option on MAXIMIZE. This transforms the THETA into a transition
matrix, computes the stationary distribution for the regimes and then expands
the transition matrix to full size, which will make it easier to use.
compute p=%MSLogisticP(theta)
compute pstar=%mcergodic(p)
compute p=%mspexpand(p)
The one step that is new to this type of model is initializing the XSTAR and
SSTAR matrices. With the handling of the initial conditions described above,
this copies the X0 out of the parameter set into each component of XSTAR and
zeros out each component of SSTAR.
ewise xstar(i)=x0
ewise sstar(i)=%zeros(ndlm,ndlm)
This is similar to what weve seen before, as the details are hidden by the
KimFilter function. The full code is Example 12.1.
without switching, the DLM instruction can handle this with either the option
PRESAMPLE=DIFFUSE or PRESAMPLE=ERGODIC.1 These use a specialized dou-
ble calculation for the first k data points to keep separate track of the infinite
(due to unit roots) and finite parts of the estimates. With each data point from
1 to k, the rank of the infinite matrix goes down one until it is zeroed out.
This calculation is based upon a formal limit as the initial variances on the
coefficients goes to .
While it would be possible to do the same thing in a switching context, it would
require doing the parallel calculations for each of the M 2 branches required for
the Kim filter, and extending the Kim reduction step to the partially infinite,
partially finite matrices. The simpler alternative is to approximate this by
using large finite variances (and zero means) for the pre-sample coefficients,
then leaving out of the working log likelihood at least the first k data points.
You just need to be somewhat careful about how you choose the large values
too large and you can have a loss of precision in the calculations, since you end
up subtracting two large numbers (several times) with the end result being a
small number.
Its also quite possible for the variance of the drift on any one coefficient (and
sometimes on all of them) to be optimally zero; that is, a fixed coefficient model
has a higher likelihood than any model that allows for drifting coefficients. In
general, a drifting coefficients model will produce smaller residuals, but at the
cost of a higher predictive variance, which gets penalized in the likelihood. Its
thus necessary to parameterize the variance in a way that keeps it in the range
[0, ), such as by estimating in standard deviation form.
The parameters to be estimated in the TVP model are the variances on the five
coefficient drifts, the variances in the two branches for the switching process
and the parameters in the transition matrix. The variances are all put into
standard deviation form and given guess values with:
dec vect sigmae(nstates)
compute sigmae(1)=.25*sqrt(%seesq),sigmae(2)=1.5*sqrt(%seesq)
compute sigmav=.01*%stderrs
*
nonlin(parmset=dlmparms) sigmae sigmav
The SIGMAE are the regression error standard deviationsone for each regime.
SIGMAV is a VECTOR with one value per regressorstheyre initialized at .01
times the corresponding standard error from a fixed coefficient regression.
For the state-space model, A is just the identity, and the C matrix changes
from data point to data point and is just the current set of regressors from the
equation of interest. The model-specific part of the KimFilter function is:
COMPUTE C=TR(%EQNXVECTOR(MDEQ,TIME))
1
Theyll be equivalent here because all the roots are unit roots.
Markov Switching State-Space Models 153
*
compute f1hat(time)=0.0,f2hat(time)=0.0
do i=1,nstates
do j=1,nstates
*
* Do the SSM predictive step. In this application A is the
* identity, so the calculations simplify quite a bit.
*
compute xwork(i,j)=xstar(j)
compute swork(i,j)=sstar(j)+sw
*
* Do the prediction error and variance for y under state i. The
* predictive variance is the only part of this that depends upon
* the regime. Compute the density function for the prediction
* error.
*
COMPUTE YERR=M1GR(TIME)-%DOT(C,XWORK(I,J))
COMPUTE VHAT=C*SWORK(I,J)*TR(C)+SIGMAE(I)2
*
* Do the decomposition of vhat into its components and add
* probability-weighted values to the sums across (i,j)
*
compute f1hat(time)=f1hat(time)+%scalar(c*swork(i,j)*tr(c))*$
p(i,j)*pstar(j)
compute f2hat(time)=f2hat(time)+sigmae(i)2*p(i,j)*pstar(j)
compute gain=swork(i,j)*tr(c)*inv(vhat)
compute fwork(i,j)=exp(%logdensity(vhat,yerr))
*
* Do the SSM update step
*
compute xwork(i,j)=xwork(i,j)+gain*yerr
compute swork(i,j)=swork(i,j)-gain*c*swork(i,j)
end do j
end do i
Again, the lines in upper case are the ones which are special to this model. The
first two (the COMPUTE C at the start) and the COMPUTE YERR in the middle)
are just to get the C and Y for this data point. Its the COMPUTE VHAT which
includes the switching component. In computing XWORK and SWORK, this also
takes advantage of the fact that A is the identity to simplify the calculation.
This also adds an extra calculation of F1HAT and F2HAT which decompose the
predictive variance (the VHAT) into the part due to the uncertainty in the co-
efficients (F1HAT) and that due to the (switching) regression variance (F2HAT).
These are the probability-weighted averages across the M 2 combinations.
In the DLMStart function for this model, the only calculation needed for the
state-space model is to square the standard deviations on the drift variances.
(The regression variances are squared as part of the calculation above).
compute sw=%diag(sigmav.2)
Markov Switching State-Space Models 154
The Kim filter initialization is to zero out XSTAR and make SSTAR a diagonal
matrix with relatively large values (here 100). Whether 100 is a good choice
isnt clear without at least some experimentation.
ewise xstar(i)=%zeros(ndlm,1)
ewise sstar(i)=100.0*%identity(ndlm)
1. Sample the regimes given the states, switching and other parameters.
2. Sample the switching parameters given the regimes (nothing else should
matter).
3. Sample the states given the regimes, switching and other parameters.
4. Sample the other parameters given states and regimes.
The state-space setup used for estimation with the Kim filter cant be used
in a standard way for MCMC. Ordinarily, the measurement equation in first
differenced form:
yt = (St ) + xt
Markov Switching State-Space Models 155
would be rearranged to
yt (St ) = [1, 1]Xt
for the purposes of sampling the states (X) given the regimes and parameters
(number 3 on the list). And, in fact, that can be done to generate a series of
xt . The problem is that this equation is the only place where St appears, and
it has no error term. Given the sampled state series, and the , theres only
one possible set of St the ones just used to create the x. Because X and S are
linked by an identity, we cant sample the S treating the X as given.
For this example, well go back to the original specification of the model with
y as the observable rather than its difference. The differenced form loses the
connection to the level of y so it cant really give an accurate estimate of the
trend series itself. With q as the number of lags in the AR, the size of the state
vector is q + 1 with the extra one being the trend variable which will be in the
last position. The setup for the fixed parts of the state matrices is:
compute ndlm=q+1
*
dec rect a(ndlm,ndlm)
ewise a(i,j)=%if(i==ndlm,(i==j),(i==j+1))
*
dec rect c(ndlm,1)
compute c=%unitv(q,1)1.0
*
dec rect f(ndlm,1)
compute f=%unitv(ndlm,1)
We need a prior for the transition matrix, which will again be Dirichlet weakly
favoring staying in each state:
dec vect[vect] gprior(nstates)
dec vector tcounts(nstates) pdraw(nstates)
compute gprior(1)=||8.0,2.0||
compute gprior(2)=||2.0,8.0||
The initial values for the lag coefficients and the variance will come from an
OLS regression of a preliminary estimate of the cycle on its lags.
The prior for the lag coefficients will be very loose, with a 0 mean and .5 stan-
dard deviation (precision is 4.0) on each:
dec vect bprior(%nreg)
dec symm hprior(%nreg,%nreg)
compute bprior=%zeros(%nreg,1)
compute hprior=%diag(%fill(%nreg,1,4.0))
Markov Switching State-Space Models 156
Well use an uninformative prior for the variance, since that will always be
estimated using the full sample:
compute s2prior=1.0
compute nuprior=0.0
compute %psubmat(a,1,1,tr(phi))
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,type=csim,$
z=delta(MSRegime(t))*%unitv(ndlm,ndlm)) xstart gend xstates
set x xstart gend = xstates(t)(1)
Given the generated x series, the (and variance) can be drawn using standard
Bayesian procedures for a least squares regression. Well reject non-stationary
estimates, doing a redraw if we have an unstable root.
cmom(equation=areqn) gstart gend
:redraw
compute phi =%ranmvpostcmom(%cmom,1.0/sigsq,hprior,bprior)
compute %eqnsetcoeffs(areqn,phi)
compute cxroots=%polycxroots(%eqnlagpoly(areqn,x))
if %cabs(cxroots(%rows(cxroots)))<=1.00 {
disp "PHI draw rejected"
goto redraw
}
compute sumsqr=%rsscmom(%cmom,phi)
compute sigsq =(sumsqr+s2prior*nuprior)/%ranchisqr(%nobs+nuprior)
All that remains is to draw the . There are two possible approaches to this.
First, we can unwind t as in (12.5). Other than 0 (for which we just produced
a simulated value), this is a linear function of the . The equation
yt = 0 + 1 c1t + 2 c2t + xt
is in the form of a linear regression with serially correlated errors with a known
form for the covariance matrixthe and 2 are assumed known. We can
filter the data and sample as if it were a least squares regression. There is one
potential problem with this for this particular model and data set: the second
regime (low drift) is likely to be quite sparse so 2 wont be very well determined
and we might need a prior that is more informative than we would like in order
to keep the sampler working properly.
Instead, were choosing to use (random walk) Metropolis within Gibbs. Our
proposal density will be the current value plus a Normal increment. After a
bit of experimenting, we came up with (independent) Normal increments with
standard deviation .10:
compute fdelta=||.10,0.0|0.0,.10||
compute %psubmat(a,1,1,tr(phi))
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,$
z=delta(MSRegime(t))*%unitv(ndlm,ndlm)) gstart gend
compute logplast=%logl
*
compute [vector] deltatest=delta+%ranmvnormal(fdelta)
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,$
z=deltatest(MSRegime(t))*%unitv(ndlm,ndlm)) gstart gend
compute logptest=%logl
compute alpha=exp(logptest-logplast)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha
compute delta=deltatest,accept=accept+1
The results are surprisingly different from those from the Kim filter. The fil-
tered estimates of the probabilities of the high-mean regime from the Kim filter
are in Figure 12.1:
1.0
0.8
0.6
0.4
0.2
0.0
1952 1955 1958 1961 1964 1967 1970 1973 1976 1979 1982
1.0
0.8
0.6
0.4
0.2
0.0
1951 1954 1957 1960 1963 1966 1969 1972 1975 1978 1981 1984
The Kim filter approximation also identifies the two modes, but much more
clearly favors the outlier modesince the data set is largely classified as one
regime, the approximation will be more accurate than with the mode where
there are many data points in each regime.
This is Example 12.4. The sampler for the regimes is simpler here than in the
previous case. We can add the measurement errors and shocks to the regres-
sion coefficients to the parameter set and simulate them using DLM (given the
previous settings for the regimes). Taking the measurement errors as given,
the regimes can be sampled using a simple FFBS algorithm exactly as in Ex-
ample 8.3. While there is some correlation between the regimes and the mea-
surement errors (almost no Gibbs sampler will avoid some correlation among
blocks), this is nothing like the identity we had in the previous example, and
isnt as tight a relationship as we would have if the regime-switching controlled
the mean (rather than variance) of a process.3
The main problems come from it being a time-varying parameters regression.
The difficulty coming out of the diffuse initial conditions wont be the problem
here because the PRESAMPLE=DIFFUSE option on DLM can apply since theres
only one branch that needs to be evaluated. The possibility of the variance in
a drift being (optimally) zero remains. The variances will be drawn as inverse
chi-squareds, which will never be true zero. However, if you use conditional
simulation on a state-space model with a component variance being effectively
zero, the element of the disturbances that that variance controls will be forced
to be (near) zero as well, causing the next estimate of the variance to again be
3
The modal value of a Normal is still zero whether the variance is high or low.
Markov Switching State-Space Models 160
near zero. Thus, the Gibbs sampler will have an absorbing state at zero for each
of the variances of the coefficient drifts, unless we use a non-zero informative
prior which ensures that if the variance goes to zero that it has a chance of
being sampled non-zero in a future sweep.
As we did with the Kim filter estimates for this model, well start with a linear
regression to get initial values. Well start with the variances on the drifts as
values too large to be reasonable.
compute sigmae(1)=.25*%seesq,sigmae(2)=2.5*%seesq
compute sigmav=%stderrs.2
The switching variance for the equation will be handled as before with a hier-
archical prior, using a non-informative prior for the common variance scale:
compute nucommon=0.0
compute s2common=.5*%seesq
dec vect nuprior(nstates)
dec vect s2prior(nstates)
compute nuprior=%fill(nstates,1,4.0)
compute s2prior=%fill(nstates,1,1.0)
compute scommon =%seesq
The prior on the coefficient drift variances is (very) weakly informative, cen-
tered on a small multiple of the least squares variances.
dec vect nusw(ndlm)
dec vect s2sw(ndlm)
*
ewise nusw(i)=1.0
ewise s2sw(i)=(.01*%stderrs(i))2
The Gibbs sampling loop starts by simulating the state-space model given the
settings for the variances and the current values of the regimes. This will give
us simulated coefficient drifts in WHAT, which is a SERIES of VECTORS, and
equation errors in VHAT, similarly a SERIES of VECTORS (size one in this case,
since theres only one observable). This also includes simulated values for the
state vector, which will here be the (time-varying) regression coefficients.
dlm(y=m1gr,c=%eqnxvector(mdeq,t),sv=sigmae(MSRegime(t)),$
sw=%diag(sigmav),presample=diffuse,type=csimulate,$
what=what,vhat=vhat) gstart gend xstates vstates
We then treat the WHAT and VHAT as given in drawing the variances. This does
the coefficient drift variances:
do i=1,ndlm
sstats gstart+ncond gend what(t)(i)2>>sumsqr
compute sigmav(i)=(sumsqr+nusw(i)*s2sw(i))/$
%ranchisqr(%nobs+nusw(i))
end do i
Markov Switching State-Space Models 161
and this does the (switching) equation variances using a hierarchical prior:
sstats gstart+ncond gend vhat(t)(1)2/sigmae(MSRegime(t))>>sumsqr
compute scommon=(scommon*sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
do k=1,nstates
sstats(smpl=MSRegime(t)==k) gstart+ncond gend vhat(t)(1)2>>sumsqr
compute sigmae(k)=(sumsqr+nuprior(k)*scommon)/$
%ranchisqr(%nobs+nuprior(k))
end do k
compute pstar=%mcergodic(p)
do time=gstart,gend
compute pt_t1(time)=%mcstate(p,pstar)
compute pstar=%msupdate(RegimeF(time),pt_t1(time),fpt)
compute pt_t(time)=pstar
end do time
@%mssample p pt_t pt_t1 MSRegime
which requires the function RegimeF which returns the vector of likelihoods
given the simulated VHAT:
function RegimeF time
type integer time
type vector RegimeF
local integer i
*
dim RegimeF(nstates)
ewise RegimeF(i)=exp(%logdensity(sigmae(i),vhat(time)(1)))
end
Note that (at least with this data set), this requires a very large number of
draws to get the numerical standard errors on the coefficient variances down
to a reasonable level (compared to their means). The switching variance and
the transition probabilities are quite a bit more stable.
Markov Switching State-Space Models 162
end do j
end do i
*
* Compute the updated probabilities of the regime combinations.
*
compute phat=phat/likely
compute pstar=%sumr(phat)
compute pt_t(time)=pstar
*
* Collapse the SSM matrices down to one per regime.
*
compute xstates(time)=%zeros(ndlm,1)
do i=1,nstates
compute xstar(i)=%zeros(ndlm,1)
do j=1,nstates
compute xstar(i)=xstar(i)+xwork(i,j)*phat(i,j)/pstar(i)
end do j
compute sstar(i)=%zeros(ndlm,ndlm)
do j=1,nstates
compute sstar(i)=sstar(i)+phat(i,j)/pstar(i)*$
(swork(i,j)+%outerxx(xstar(i)-xwork(i,j)))
end do j
*
* This is the overall best estimate of the filtered state.
*
compute xstates(time)=xstates(time)+xstar(i)*pstar(i)
end do i
compute KimFilter=likely
end
******************************************************************
*
* This is the start up code for each function evaluation
*
function DLMStart
*
local integer i
*
* Fill in the top row in the A matrix
*
compute %psubmat(a,1,1,tr(phi))
*
* Compute the full matrix for the transition variance
*
compute sw=f*tr(f)*sigsq
*
* Transform the theta to transition probabilities.
*
compute p=%MSLogisticP(theta)
compute pstar=%mcergodic(p)
compute p=%mspexpand(p)
*
* Initialize the KF state and variance. This is set up for estimating
* the pre-sample states.
*
Markov Switching State-Space Models 165
ewise xstar(i)=x0
ewise sstar(i)=%zeros(ndlm,ndlm)
end
*********************************************************************
*
* Get guess values for the means from running an AR(2) on the growth
* rate. However, the guess values for the AR coefficients need to be
* input; the BOXJENK values arent persistent enough because the mean
* model doesnt take switches into account.
*
boxjenk(ar=q,constant) g * 1984:4
compute delta(1)=%beta(1)+0.5,delta(2)=%beta(1)-1.5
compute phi=||1.2,-.3||
compute sigsq=%seesq
*
* Initialize theta from transitions common for Hamilton type models
*
compute theta=%msplogistic(||.9,.5||)
*
frml kimf = kf=KimFilter(t),log(kf)
*
* Do initializations
*
@MSFilterInit
gset xstates = %zeros(ndlm,1)
*
maximize(parmset=msparms+dlmparms+initparms,start=DLMStart(),$
reject=sigsq<=0.0,method=bfgs) kimf 1952:2 1984:4
*
set cycle 1952:2 1984:4 = xstates(t)(1)
set trend 1952:2 1984:4 = loggdp-cycle
*
set phigh 1952:2 1984:4 = pt_t(t)(1)
graph(footer="Figure 5.2a Filtered Probabilities of High-Growth Regime")
# phigh
graph(footer=$
"Figure 5.3 Real GNP and Markov-switching trend component") 2
# loggdp 1952:2 1984:4
# trend
graph(footer=$
"Figure 5.4 Cyclical component from Lams Generalized Hamilton model")
# cycle
Markov Switching State-Space Models 166
* equation.
*
dec vect sigmae(nstates)
*
* Get guess values based upon the results of a linear regression. All
* standard deviations are scales of the corresponding variances in
* the least squares regression. (The drift ones being a much smaller
* multiple).
*
compute sigmae(1)=.25*sqrt(%seesq),sigmae(2)=1.5*sqrt(%seesq)
compute sigmav=.01*%stderrs
*
nonlin(parmset=dlmparms) sigmae sigmav
*********************************************************************
*
* These are for keeping track of the decomposition of the predictive
* variance for money growth.
*
set f1hat = 0.0
set f2hat = 0.0
*********************************************************************
*
* This does a single step of the Kim (approximate) filter
*
function KimFilter time
type integer time
*
local integer i j
local real yerr likely
local symm vhat
local rect gain
local rect phat(nstates,nstates)
local rect fwork(nstates,nstates)
*
local rect c
*
* Pull out the C matrix, which is time-varying (regressors at *time*)
*
compute c=tr(%eqnxvector(mdeq,time))
*
compute f1hat(time)=0.0,f2hat(time)=0.0
do i=1,nstates
do j=1,nstates
*
* Do the SSM predictive step. In this application A is the
* identity, so the calculations simplify quite a bit.
*
compute xwork(i,j)=xstar(j)
compute swork(i,j)=sstar(j)+sw
*
* Do the prediction error and variance for y under regime i.
* The predictive variance is the only part of this that
* depends upon the regime. Compute the density function for
* the prediction error.
Markov Switching State-Space Models 168
*
compute yerr=m1gr(time)-%dot(c,xwork(i,j))
compute vhat=c*swork(i,j)*tr(c)+sigmae(i)2
*
* Do the decomposition of vhat into its components and add
* probability-weighted values to the sums across (i,j)
*
compute f1hat(time)=f1hat(time)+$
%scalar(c*swork(i,j)*tr(c))*p(i,j)*pstar(j)
compute f2hat(time)=f2hat(time)+sigmae(i)2*p(i,j)*pstar(j)
compute gain=swork(i,j)*tr(c)*inv(vhat)
compute fwork(i,j)=exp(%logdensity(vhat,yerr))
*
* Do the SSM update step
*
compute xwork(i,j)=xwork(i,j)+gain*yerr
compute swork(i,j)=swork(i,j)-gain*c*swork(i,j)
end do j
end do i
*
* Everything from here to the end of the function is the same for all
* such models.
*
* Compute the Hamilton filter likelihood
*
compute likely=0.0
do i=1,nstates
do j=1,nstates
compute phat(i,j)=p(i,j)*pstar(j)*fwork(i,j)
compute likely=likely+phat(i,j)
end do j
end do i
*
* Compute the updated probabilities of the regime combinations.
*
compute phat=phat/likely
compute pstar=%sumr(phat)
compute pt_t(time)=pstar
*
* Collapse the SSM matrices down to one per state
*
compute xstates(time)=%zeros(ndlm,1)
do i=1,nstates
compute xstar(i)=%zeros(ndlm,1)
do j=1,nstates
compute xstar(i)=xstar(i)+xwork(i,j)*phat(i,j)/pstar(i)
end do j
*
* This is the overall best estimate of the filtered state.
*
compute xstates(time)=xstates(time)+xstar(i)*pstar(i)
compute sstar(i)=%zeros(ndlm,ndlm)
do j=1,nstates
compute sstar(i)=sstar(i)+phat(i,j)/pstar(i)*$
Markov Switching State-Space Models 169
(swork(i,j)+%outerxx(xstar(i)-xwork(i,j)))
end do j
end do i
compute KimFilter=likely
end
*********************************************************************
*
* This is the start up code for each function evaluation
*
function DLMStart
*
local integer i
*
* Compute the full matrix for the transition variance
*
compute sw=%diag(sigmav.2)
*
* Transform the theta to transition probabilities.
*
compute p=%MSLogisticP(theta)
compute pstar=%mcergodic(p)
compute p=%mspexpand(p)
*
* Initialize the KF state and variance. This uses a large finite
* value for the variance, then conditions on the first 10
* observations.
*
ewise xstar(i)=%zeros(ndlm,1)
ewise sstar(i)=100.0*%identity(ndlm)
end
*********************************************************************
*
* Skip the first 10 data points in evaluating the likelihood. The
* textbook seems to have p00 and p11 mislabeled in Table 5.2.
*
frml kimf = kf=KimFilter(t),%if(t<1962:1,0.0,log(kf))
@MSFilterInit
gset xstates = %zeros(ndlm,1)
maximize(parmset=dlmparms+msparms,start=DLMStart(),method=bfgs) $
kimf 1959:3 1989:2
*
set totalvar = f1hat+f2hat
graph(style=bar,key=below,klabels=||"Total Variance","TVP","MRKF"||,$
footer="Figure 5.5 Decomposition of monetary growth uncertainty") 3
# totalvar 1962:1 *
# f1hat 1962:1 *
# f2hat 1962:1 *
*
set plow = pt_t(t)(1)
graph(header="Filtered Probability of Low-Variance Regime")
# plow
*
set binterest 1962:1 * = xstates(t)(2)
set binflation 1962:1 * = xstates(t)(3)
Markov Switching State-Space Models 170
* independent)
*
dec vect bprior(%nreg)
dec symm hprior(%nreg,%nreg)
compute bprior=%zeros(%nreg,1)
compute hprior=%diag(%fill(%nreg,1,4.0))
*
* Prior for the variance
*
compute s2prior=1.0
compute nuprior=0.0
*
* Prior for transitions
*
dec vect[vect] gprior(nstates)
dec vector tcounts(nstates) pdraw(nstates)
compute gprior(1)=||8.0,2.0||
compute gprior(2)=||2.0,8.0||
*
* For draws for the deltas
*
compute fdelta=||.10,0.0|0.0,.10||
*
* Draw counts.
*
compute nburn=5000,ndraws=5000
*
* Initial values for p.
*
dim p(nstates,nstates)
ewise p(i,j)=gprior(j)(i)/%sum(gprior(j))
*
* For re-labeling
*
dec vect temp
dec rect ptemp
*
* For single-move sampling
*
dec vect fps logp
dim fps(nstates) logp(nstates)
*
* Initialize regimes based upon 1 for above average and 2 for below.
*
gset MSRegime = 1+fix(g<.2*delta(1)+.8*delta(2))
*
dec vect[series] count(nstates) fcount(nstates)
*
* For bookkeeping
*
set regime1 = 0.0
set trendsum = 0.0
set trendsumsq = 0.0
*
Markov Switching State-Space Models 173
do draw=-nburn,ndraws
*
* Single-move sampling for the regimes.
*
* Evaluate the log likelihood at the current settings.
*
compute %psubmat(a,1,1,tr(phi))
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,$
z=delta(MSRegime(t))*%unitv(ndlm,ndlm)) gstart gend
compute logplast=%logl
compute pstar=%mcergodic(p)
do time=xstart,gend
compute oldregime=MSRegime(time)
do i=1,nstates
if oldregime==i
compute logptest=logplast
else {
compute MSRegime(time)=i
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,$
z=delta(MSRegime(t))*%unitv(ndlm,ndlm)) gstart gend
compute logptest=%logl
}
compute pleft =%if(time==xstart,pstar(i),p(i,MSRegime(time-1)))
compute pright=%if(time==gend ,1.0,p(MSRegime(time+1),i))
compute fps(i)=pleft*pright*exp(logptest-logplast)
compute logp(i)=logptest
end do i
compute MSRegime(time)=%ranbranch(fps)
compute logplast=logp(MSRegime(time))
end do time
*
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
*
* Draw the state-space states given the current regimes and other
* parameters. First, fill in the top row in the A matrix using the
* current values of phi.
*
compute %psubmat(a,1,1,tr(phi))
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,type=csim,$
z=delta(MSRegime(t))*%unitv(ndlm,ndlm)) xstart gend xstates
set x xstart gend = xstates(t)(1)
*
* Draw the lag coefficients and variance using a linear
* regression of x on its lags. Reject clearly explosive roots.
Markov Switching State-Space Models 174
*
cmom(equation=areqn) gstart gend
:redraw
compute phi =%ranmvpostcmom(%cmom,1.0/sigsq,hprior,bprior)
compute %eqnsetcoeffs(areqn,phi)
compute cxroots=%polycxroots(%eqnlagpoly(areqn,x))
if %cabs(cxroots(%size(cxroots)))<=1.00 {
disp "PHI draw rejected"
goto redraw
}
compute sumsqr=%rsscmom(%cmom,phi)
compute sigsq =(sumsqr+s2prior*nuprior)/%ranchisqr(%nobs+nuprior)
*
* Draw the deltas by random walk Metropolis-Hastings
*
compute %psubmat(a,1,1,tr(phi))
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,$
z=delta(MSRegime(t))*%unitv(ndlm,ndlm)) gstart gend
compute logplast=%logl
*
compute [vector] deltatest=delta+%ranmvnormal(fdelta)
dlm(a=a,f=f,c=c,sw=sigsq,y=y,presample=ergodic,$
z=deltatest(MSRegime(t))*%unitv(ndlm,ndlm)) gstart gend
compute logptest=%logl
compute alpha=exp(logptest-logplast)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha
compute delta=deltatest,accept=accept+1
*
* Check to see if we need to switch labels
*
compute swaps=%index(-1.0*delta)
if swaps(1)<>1 {
disp "Need label switching"
compute temp=delta
ewise delta(i)=temp(swaps(i))
gset MSRegime = swaps(MSRegime(t))
compute ptemp=p
ewise p(i,j)=ptemp(swaps(i),swaps(j))
}
*
infobox(current=draw) %strval(100.0*accept/(draw+nburn+1),"##.#")
*
* If were past the burn-in, do the bookkeeping
*
if draw>0 {
*
* Combine delta, sigsq and p into a single vector and save the
* draw.
*
compute bgibbs(draw)=%parmspeek(mcmcparms)
*
* Update the probability of being in regime 1
*
set regime1 = regime1+(MSRegime(t)==1)
Markov Switching State-Space Models 175
*
* Update the estimate of the trend
*
set tau xstart gend = xstates(t)(ndlm)
set trendsum xstart gend = trendsum+tau
set trendsumsq xstart gend = trendsumsq+tau2
}
end do draw
infobox(action=remove)
*
set trendsum = trendsum/ndraws
set trendsumsq = trendsumsq/ndraws-trendsum2
set trendlower = trendsum-2.0*sqrt(trendsumsq)
set trendupper = trendsum+2.0*sqrt(trendsumsq)
*
graph(footer="Estimate of Trend with 2 s.d. bounds") 4
# y
# trendsum xstart gend
# trendlower xstart gend 3
# trendupper xstart gend 3
*
set regime1 = regime1/ndraws
graph(footer="Probability of being in high growth regime")
# regime1 xstart gend
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
report(action=define)
report(atrow=1,atcol=2) "Mean" "S.D"
report(atrow=2,atcol=1) "delta(1)" bmeans(1) bstderrs(1)
report(atrow=3,atcol=1) "delta(2)" bmeans(2) bstderrs(2)
report(atrow=4,atcol=1) "phi(1)" bmeans(3) bstderrs(3)
report(atrow=5,atcol=1) "phi(2)" bmeans(4) bstderrs(4)
report(atrow=6,atcol=1) "sigmasq" bmeans(5) bstderrs(5)
report(atrow=7,atcol=1) "p" bmeans(6) bstderrs(6)
report(atrow=8,atcol=1) "q" bmeans(9) bstderrs(9)
report(action=format,picture="###.###")
report(action=show)
*
compute plabels=%parmslabels(mcmcparms)
do i=1,%rows(plabels)
set xstat 1 ndraws = bgibbs(t)(i)
density(smoothing=1.5) xstat 1 ndraws xx fx
scatter(style=line,footer=plabels(i))
# xx fx
end do i
Markov Switching State-Space Models 176
*
nonlin(parmset=mcmcparms) sigmav sigmae p
dec series[vect] bgibbs
gset bgibbs 1 ndraws = %parmspeek(mcmcparms)
*
compute gstart=%regstart(),gend=%regend()
set regime1 gstart gend = 0.0
*
infobox(action=define,lower=-nburn,upper=ndraws,progress) $
"Gibbs Sampling"
do draw=-nburn,ndraws
*
* Draw coefficients and shocks jointly given the regimes.
*
dlm(y=m1gr,c=%eqnxvector(mdeq,t),sv=sigmae(MSRegime(t)),$
sw=%diag(sigmav),presample=diffuse,type=csimulate,$
what=what,vhat=vhat) gstart gend xstates vstates
*
* Draw sigmavs given the whats.
*
do i=1,ndlm
sstats gstart+ncond gend what(t)(i)2>>sumsqr
compute sigmav(i)=(sumsqr+nusw(i)*s2sw(i))/$
%ranchisqr(%nobs+nusw(i))
end do i
*
* Draw the common component for sigmae
*
sstats gstart+ncond gend vhat(t)(1)2/sigmae(MSRegime(t))>>sumsqr
compute scommon=(scommon*sumsqr+nucommon*s2common)/$
%ranchisqr(%nobs+nucommon)
*
* Draw the specific components for the sigmae
*
do k=1,nstates
sstats(smpl=MSRegime(t)==k) gstart+ncond gend vhat(t)(1)2>>sumsqr
compute sigmae(k)=(sumsqr+nuprior(k)*scommon)/$
%ranchisqr(%nobs+nuprior(k))
end do k
*
* Relabel if necessary
*
ewise tempsigma(i)=sigmae(i)
compute swaps=%index(tempsigma)
if swaps(1)==2 {
disp "Draw" draw "Executing swap"
*
* Relabel the variances
*
ewise sigmae(i)=tempsigma(swaps(i))
*
* Relabel the transitions
*
compute tempp=p
Markov Switching State-Space Models 179
ewise p(i,j)=tempp(swaps(i),swaps(j))
}
*
* Draw MSRegime using the vhats and variances. At this point,
* its a simple variance-switch which can be done using FFBS.
*
@MSFilterInit
do time=gstart,gend
@MSFilterStep time RegimeF(time)
end do time
*
* Backwards sample
*
@MSSample gstart gend MSRegime
*
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
infobox(current=draw)
if draw>0 {
*
* Do the bookkeeping
*
set regime1 gstart gend = regime1+(MSRegime==1)
compute bgibbs(draw)=%parmspeek(mcmcparms)
gset betahist gstart gend = betahist+xstates
}
end do draw
infobox(action=remove)
*
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
*
report(action=define)
report(atrow=1,atcol=1,fillby=cols) %parmslabels(mcmcparms)
report(atrow=1,atcol=2,fillby=cols) bmeans
report(atrow=1,atcol=3,fillby=cols) bstderrs
report(action=format,picture="*.########")
report(action=show)
*
set regime1 = regime1/ndraws
graph(footer="Probability of low-variance regime")
# regime1
*
gset betahist gstart gend = betahist/ndraws
*
set binterest gstart gend = betahist(t)(2)
set binflation gstart gend = betahist(t)(3)
set bsurplus gstart gend = betahist(t)(4)
set blagged gstart gend = betahist(t)(5)
graph(header="Estimated Coefficient on Lagged Change in R")
# binterest
graph(header="Estimated Coefficient on Lagged Surplus")
# binflation
graph(header="Estimated Coefficient on Lagged Inflation")
Markov Switching State-Space Models 180
# binflation
graph(header="Estimated Coefficient on Lagged M1 Growth")
# blagged
Chapter 13
181
Markov Switching ARCH and GARCH 182
The Cai model can be seen in the RATS SWARCH.RPF example. Because the
switch is in the constant in the ARCH process, the likelihood depends upon only
the current regime, making estimation particularly simple. Well focus here on
the HS model.
HS analyze weekly value-weighted SP 500 returns. Their mean model is a one-
lag autoregression, which is assumed to be fixed among regimes:
yt = + yt1 + ut
q
X
Eu2t ht = a0 + ai u2ti
i=1
Cais model replaces a0 with a0 (St ) with St Markov switching. With HS , what
switches is a variance inflation factor, which gives the formula:
q
( )
2
X a u
i ti
Eu2t = g(St ) a0 +
i=1
g(S t1 )
Compared with Cais model, this wont respond quite as dramatically to large
shocks since this scales down the persistence of the lagged squared residuals
in the high-variance regimes. Whether one or the other tends to work better is
an open question. HS makes the likelihood dependent upon current and q lags
of the regimes, so its a bit harder to set up and takes longer to estimate.
In addition, HSinclude an asymmetry term:
q
( )
2 2
u X a u
i ti
Eu2t = g(St ) a0 + t1 (ut1 < 0) + (13.1)
g(St1 ) i=1
g(S t1 )
and use conditionally Student-t errors in most of their models. In (13.1), there
is a need for a normalization, since the right side is a product with fully free
coefficients in both factors. HS choose to fix one of the g at 1. In our coding for
this, well instead normalize with a0 = 1, which is simpler to implement since
it allows the g to be put into the parameter set as a complete VECTOR.
HS estimate quite a few models. What well show is a three regime-two lag
asymmetric model with t errors. The initial setup is done with:
compute q=2
@MSSetup(lags=q,states=3)
This sets up the expanded regime system required for handling the lagged
regimes. This defines the variable NEXPAND as the number of expanded
regimes, which will here be 32+1 = 27.
The parameter set has three parts:
2. the ARCH model parameters, which include the ARCH lags (the VECTOR
named A), the variance inflations (the VECTOR named GV), plus, for this
model, the asymmetry parameter (XI) and degrees of freedom (NU),
3. the Markov switching parameters, which here will be the logistic indexes
THETA.
For the likelihood calculation, we need the value for each expanded regime,
so the loop inside the ARCHRegimeTF function runs over the range from 1 to
NEXPAND. The %MSLagState function is defined when you do the @MSSETUP
with a non-zero value for the LAGS option. %MSLagState(SX,LAG) maps the
expanded regime number SX and lag number LAG to the base regime repre-
sented by SX at that lag (with 0 for LAG meaning the current regime).
function ARCHRegimeTF time e
type vector ARCHRegimeTF
type real e
type integer time
*
local integer i k
local real h
*
dim ARCHRegimeTF(nexpand)
do i=1,nexpand
compute h=1.0
do k=1,q
compute h=h+a(k)*uu(time-k)/gv(%MSLagState(i,k))
end do k
compute h=h+xi*$
%if(u(time-1)<0.0,uu(time-1)/gv(%MSLagState(i,1)),0.0)
compute ARCHRegimeTF(i)=%if(h>0,$
exp(%logtdensity(h*gv(%MSLagState(i,0)),e,nu)),0.0)
end do i
end
13.1.1 Estimation by ML
We could use either a GARCH instruction or a LINREG to get guess values for
the parameters. Since its simpler to get the information out, well use LINREG.
Because we need lagged residuals for the ARCH function, well shift the start
of estimation for the SWARCH model up q periods at the front end. As with any
ARCH or GARCH model done using MAXIMIZE, we need to keep series of squared
residuals and also (to handle the asymmetry) the residuals themselves so we
can compute the needed lags in the variance equation.
linreg y
# constant y{1}
compute gstart=%regstart()+q,gend=%regend()
*
set uu = %seesq
set u = %resids
compute alpha=%beta(1),beta=%beta(2)
frml uf = y-alpha-beta*y{1}
*
* For the non-switching parts of the ARCH parameters
*
compute a=%const(0.05),xi=0.0,nu=10.0
We initialize the GV array by using a fairly wide scale of the least squares
residual variance. Note that the actual model variance after a large outlier can
be quite a bit higher than the largest value of GV because of the ARCH terms.
compute gv=%seesq*||0.2,1.0,5.0||
Finally, we need guess values for the transition. Since were using the logistic
indexes, we input the P matrix first, since its more convenient, and invert it to
get THETA.
compute p=||.8,.1,.05|.15,.8,.15||
compute theta=%msplogistic(p)
This evaluates the current residual using the UF formula, saves the residual
and its square, evaluates the likelihoods across the expanded set of regimes,
then uses the %MSProb function to update the regime probabilities and com-
pute the final likelihood. The log of the return from %MSProb is the desired log
likelihood.
Estimation is carried out with the standard:
Markov Switching ARCH and GARCH 185
@MSFilterInit
maximize(parmset=meanparms+archparms+msparms,$
start=%(p=%mslogisticp(theta),pstar=%MSInit()),$
method=bfgs,iters=400,pmethod=simplex,piters=5) logl gstart gend
Note that this takes a fair bit of time to estimate. The analogous Cai form
would take about 1/9 as long, since this has a 27 branch likelihood while Cais
would just have 3.
The smoothed probabilities of the regimes are computed and graphed with:
@MSSmoothed gstart gend psmooth
set p1 = psmooth(t)(1)
set p2 = psmooth(t)(2)
set p3 = psmooth(t)(3)
graph(max=1.0,style=stacked,footer=$
"Probabilities of Regimes in Three Regime Student-t model") 3
# p1
# p2
# p3
Well do a somewhat simpler model with just two regimes (rather than three)
and with conditionally Gaussian rather than t errors. The function to re-
turn the VECTOR of likelihoods across expanded regimes is almost identical
for Gaussian errors to the one with t errors, requiring a change only to the
final line that computes the likelihood given the error and variance:
Markov Switching ARCH and GARCH 186
There will remain two ARCH lags. Because of the way the HS model works,
the FFBS procedure will need to work with augmented regimes using current
and two lags. @MSSample adjusts the output regimes, sampling the full set of
augmented regimes, but copying out only the current regime into MSRegime.
@MSFilterInit
do time=gstart,gend
compute u(time)=uf(time),uu(time)=u(time)2
compute ft=ARCHRegimeGaussF(time,u(time))
@MSFilterStep time ft
end do time
@MSSample gstart gend MSRegime
We need to sample the mean parameters, the ARCH lag parameters and the
variance inflation factors. It probably is best to do them as those three groups
rather than breaking them down even further.1 You might think that you could
do the mean parameters as a relatively simple weighted least squares sam-
pler taking the variance process as given. Unfortunately, that isnt a proper
procedureh is a deterministic function of the residuals, and so cant be taken
as given for estimating the parameters controlling the residuals. Instead, both
1
Tsay (2010) has an example (TSAYP663.RPF) which does Gibbs sampling on a switching
GARCH model. He uses single-move sampling and does each parameter separately, using the
very inefficient process of griddy Gibbs.
Markov Switching ARCH and GARCH 187
the mean parameters and the ARCH parameters require Metropolis within
Gibbs (Appendix D). For this, we need to be able to compute the full sample
log likelihood given a test set of parameters at the currently sampled values of
the regimes. This is similar to ARCHRegimeGaussF except it returns the log,
and doesnt have the loop over the expanded set of regimes, instead doing the
calculation using the values of MSRegime:
function MSARCHLogl time
type integer time
*
local integer k
local real hi
*
compute u(time)=uf(time)
compute uu(time)=u(time)2
compute hi=1.0
do k=1,q
compute hi=hi+a(k)*uu(time-k)/gv(MSRegime(time-k))
end do k
compute hi=hi+xi*$
%if(u(time-1)<0.0,uu(time-1)/gv(MSRegime(time-1)),0.0)
compute MSARCHLogl=%if(hi>0,$
%logdensity(hi*gv(MSRegime(time)),u(time)),%na)
end
To get a distribution for doing Random Walk M - H, we run the maximum like-
lihood ARCH model. We take the ML estimates as the initial values for the
Gibbs sampler, and use the Cholesky factor of the proper segment of the ML co-
variance matrix as the factor for the covariance matrix for the random Normal
increment. This is the initialization code for drawing the mean parameters:
garch(q=2,regressors,asymmetric) %regstart()+q %regend() y
# constant y{1}
compute gstart=%regstart(),gend=%regend()
*
compute msb=%xsubvec(%beta,1,2)
compute far=%decomp(%xsubmat(%xx,1,2,1,2))
compute aaccept=0
and this is the corresponding code for drawing mean parameters inside the
loop:
Markov Switching ARCH and GARCH 188
First off, this computes the sample log likelihood given the current values for
the regimes and all the parameters. It then saves the current values for the
mean parameters (into BLAST) and draws a test set by adding a Normal in-
crement to the current values. The test set has to go into MSB since those are
the parameters used by MSARCHLogl. The sample log likelihood is then recom-
puted at the test set. The Metropolis condition is evaluated and if the draw is
acceptable, we keep it (and increment the count of accepted draws); if not, we
copy the previous set back into MSB. Note that the log likelihood at whatever
is the set of parameters we take is in LOGPLAST; that way, we dont need to
recompute it when we do the procedure for drawing the ARCH parameters.
The setup code for drawing the ARCH parameters is:
compute a=%xsubvec(%beta,4,5),xi=%beta(6)
compute farch=%decomp(%xsubmat(%xx,4,6,4,6))
compute [vector] msg=axi
compute gaccept=0
compute glast=msg
:redrawg
compute msg=glast+%ranmvnormal(farch)
compute a=%xsubvec(msg,1,q),xi=msg(q+1)
if %minvalue(a)<0.0
goto redrawg
sstats gstart gend msarchlogl(t)>>logptest
compute alpha=exp(logptest-logplast)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha {
compute logplast=logptest
compute gaccept=gaccept+1
}
else
compute msg=glast
compute a=%xsubvec(msg,1,q),xi=msg(q+1)
The methods used for drawing the mean and ARCH parameters are quite com-
monly used in non-linear models, and are generally very effective. Sometimes
its necessary to tweak the increment distribution by scaling up the covariance
matrix, or by switching to a fatter-tailed t rather than the Normal, but usually
not with such a small set of free parameters in each grouping.
The variance inflation factors are a bit trickier. They cant be treated like the
variances in Examples 8.3 and 9.3 because theyre more firmly tied into the
rest of the model. And, because they are (of necessity) positive, a Normal or t
increment random walk isnt fully appropriate.2 Instead of a random walk,
well use mean one random multipliers, drawn as random chi-squares normal-
ized to unit means. This requires a relatively simple correction factor for the
log ratio of the probability of moving to the probability of reversing the move.
Unlike the other parameters, it isnt clear a priori what the appropriate spread
is for the increments. 2 / has mean 1 and variance 2/. A value like = 10
is almost certainly too small, as almost half of all moves will be on the order
of 25% up or down, which is likely to be too sizeable to be accepted on a reg-
ular basis. After some experimenting (looking at the acceptance percentage),
we ended up with = 50, though almost anything in the 20-50 range will give
fairly similar performance. The set up keys off the constant in the variance
equation from the GARCH instruction, scaling up and down from that:
compute gv(1)=%beta(3)/2.0
compute gv(2)=%beta(3)*2.0
dec vect nuv(nstates) fiddle(nstates)
ewise nuv(i)=50.0
compute vaccept=0
compute glast=gv
ewise fiddle(i)=%ranchisqr(nuv(i))/nuv(i)
compute logqratio=-2.0*%sum(%log(fiddle))
compute gv=glast.*fiddle
sstats gstart gend msarchlogl(t)>>logptest
compute alpha=exp(logptest-logplast-logqratio)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha {
compute logplast=logptest
compute vaccept=vaccept+1
}
else
compute gv=glast
This completes the sampler (other than the re-labeling code, which switches
regimes based upon the order of the GV). We need to check how our various
M - H draws worked, which is done outside the draw loop:
You should always start with a very modest number of draws to check whether
your rates are reasonable. Make adjustments if they arent, running a produc-
tion number of draws only when youre satisfied that you have the sampler set
up properly. Note that even though this is able to use Multi-Move sampling,
because of the size of the data set (1328 usable observations), it takes quite a
while even to do just the 200 burn-in plus 500 keeper draws that weve used in
the full example (Example 13.2).
The use of DFFilterSize allows for increasing the number of lags in the ex-
panded regimes beyond just one. As were using this, this is just NSTATES.
The following function returns a VECTOR of variances across each combination
of current and lagged regimes. This is needed for computing the likelihood,
and is also needed for collapsing the variances, which is why we write it as
a separate function. Note that it uses hs(%MSLagState(i,1))(time-1) as
the lagged variance for expanded regime I.
function GARCHRegimeH time
type vector GARCHRegimeH
type integer time
*
local integer i
*
dim GARCHRegimeH(nexpand)
do i=1,nexpand
compute GARCHRegimeH(i)=1.0+msg(2)*hs(%MSLagState(i,1))(time-1)+$
msg(1)*uu(time-1)/gv(%MSLagState(i,1))+$
%if(u(time-1)<0,msg(3)*uu(time-1)/gv(%MSLagState(i,1)),0.0)
end do i
end
This returns the VECTOR of (not logged) likelihoods across the expanded
regimes. Most of the calculation is done by the function above that computes
the variances.
Markov Switching ARCH and GARCH 192
summing out the lagged regimes.3 The summarized variances are then saved
into the HS VECTOR of SERIES, and the current residual and square residual
are saved into the series U and UU.
The actual estimation is now relatively simple. We just need to make sure the
Markov switching filter calculations are properly initialized:
function DFStart
@MSFilterInit
end
*****************************************************************
frml LogDFFilter = DuekerFilter(t)
*
nonlin p msg msb gv
@MSFilterSetup
maximize(start=DFStart(),pmethod=simplex,piters=5,method=bfgs) $
logDFFilter gstart gend
3
PH has the probability-weighted variances, which is then re-shaped into a matrix with
the current regime in the columns and lagged regime in the row. The sums down the columns
of those divided by the corresponding sums down the columns of the probabilities themselves
give the probability-weighted estimates for the variances.
Markov Switching ARCH and GARCH 194
type real e
type integer time
*
local integer i k
local real h
*
dim ARCHRegimeTF(nexpand)
do i=1,nexpand
compute h=1.0
do k=1,q
compute h=h+a(k)*uu(time-k)/gv(%MSLagState(i,k))
end do k
compute h=h+xi*$
%if(u(time-1)<0.0,uu(time-1)/gv(%MSLagState(i,1)),0.0)
compute ARCHRegimeTF(i)=%if(h>0,$
exp(%logtdensity(h*gv(%MSLagState(i,0)),e,nu)),0.0)
end do i
end
*********************************************************************
*
* Run a regression to use for guess values.
*
linreg y
# constant y{1}
compute gstart=%regstart()+q,gend=%regend()
*
set uu = %seesq
set u = %resids
compute alpha=%beta(1),beta=%beta(2)
frml uf = y-alpha-beta*y{1}
*
* For the non-switching parts of the ARCH parameters
*
compute a=%const(0.05),xi=0.0,nu=10.0
*
* As is typically the case with Markov switching models, there is no
* global identification of the regimes. By defining a fairly wide
* spread for gv, well hope that well stay in the zone of the
* likelihood where regime 1 is low variance and regime 2 is high.
*
compute gv=%seesq*||0.2,1.0,5.0||
*
* These are our guess values for the P matrix. We have to invert
* that to get the guess values for the logistic indexes.
*
compute p=||.8,.1,.05|.15,.8,.15||
compute theta=%msplogistic(p)
*
* We need to keep series of the residual (u) and squared residual
* (uu). Because the mean function is the same across regimes, we can
* just compute the residual and send it to ARCHRegimeTF, which
* computes the likelihoods for the different ARCH variances.
*
frml logl = u(t)=uf(t),uu(t)=u(t)2,f=ARCHRegimeTF(t,u(t)),$
Markov Switching ARCH and GARCH 196
fpt=%MSProb(t,f),log(fpt)
*
@MSFilterInit
maximize(parmset=meanparms+archparms+msparms,$
start=%(p=%mslogisticp(theta),pstar=%MSInit()),$
method=bfgs,iters=400,pmethod=simplex,piters=5) logl gstart gend
*
@MSSmoothed gstart gend psmooth
set p1 = psmooth(t)(1)
set p2 = psmooth(t)(2)
set p3 = psmooth(t)(3)
graph(max=1.0,style=stacked,footer=$
"Probabilities of Regimes in Three Regime Student-t model") 3
# p1
# p2
# p3
compute MSARCHLogl=%if(hi>0,$
%logdensity(hi*gv(MSRegime(time)),u(time)),%na)
end
********************************************************************
*
* Do an ARCH model to get initial guess values and matrices for
* random walk M-H. Run this over the same range as we will use for
* the switching model.
*
garch(q=2,regressors,asymmetric) %regstart()+q %regend() y
# constant y{1}
compute gstart=%regstart(),gend=%regend()
*
compute msb=%xsubvec(%beta,1,2)
compute far=%decomp(%xsubmat(%xx,1,2,1,2))
compute aaccept=0
*
* The GARCH instruction with 2 lags and asymmetry includes asymmetry
* terms for each lag. Well ignore the second in extracting the guess
* values and the covariance matrix.
*
compute a=%xsubvec(%beta,4,5),xi=%beta(6)
compute farch=%decomp(%xsubmat(%xx,4,6,4,6))
compute [vector] msg=axi
compute gaccept=0
*
* Start with guess values for GV that scale the original constant
* down and up.
*
compute gv(1)=%beta(3)/2.0
compute gv(2)=%beta(3)*2.0
dec vect nuv(nstates) fiddle(nstates)
ewise nuv(i)=50.0
compute vaccept=0
*
* Do the standard setup for MS filtering
*
@MSFilterSetup
*
* Prior for transitions.
*
dec vect[vect] gprior(nstates)
dec vect tcounts(nstates)
compute gprior(1)=||8.0,2.0||
compute gprior(2)=||2.0,8.0||
*
dec rect p(nstates,nstates)
ewise p(i,j)=gprior(j)(i)/%sum(gprior(j))
*
gset MSRegime gstart gend = %ranbranch(%fill(nstates,1,1.0))
*
* This is a relatively small number of draws. If you have patience,
* you would want these to be more like 2000 and 5000.
*
Markov Switching ARCH and GARCH 199
compute nburn=200,ndraws=500
*
* For relabeling
*
compute tempvg=msg
compute tempvb=msb
compute tempp =p
*
* Used for bookkeeping
*
nonlin(parmset=allparms) gv msb msg p
dec series[vect] bgibbs
gset bgibbs 1 ndraws = %parmspeek(allparms)
set regime1 = 0.0
infobox(action=define,lower=-nburn,upper=ndraws,progress) $
"Gibbs Sampling"
do draw=-nburn,ndraws
compute swaps=%index(gv)
if swaps(1)<>1.or.swaps(2)<>2 {
disp "Draw" draw "Executing swap"
*
* Relabel the scale factors
*
compute testvg=gv
ewise gv(i)=tempvg(swaps(i))
*
* Relabel the garch parameters
*
compute testvg=msg
ewise msg(i)=tempvg(swaps(i))
*
* Relabel the mean parameters
*
compute testvb=msb
ewise msb(i)=tempvb(swaps(i))
*
* Relabel the transitions
*
compute tempp=p
ewise p(i,j)=tempp(swaps(i),swaps(j))
}
compute a=%xsubvec(msg,1,q),xi=msg(q+1)
*
* Draw regimes by multi-move sampling
*
@MSFilterInit
do time=gstart,gend
compute u(time)=uf(time),uu(time)=u(time)2
compute ft=ARCHRegimeGaussF(time,u(time))
@MSFilterStep time ft
end do time
@MSSample gstart gend MSRegime
*
Markov Switching ARCH and GARCH 200
* Draw ps
*
@MSDrawP(prior=gprior) gstart gend p
*
* Draw AR by Metropolis-Hastings
*
sstats gstart gend msarchlogl(t)>>logplast
compute blast=msb
compute msb =blast+%ranmvnormal(far)
sstats gstart gend msarchlogl(t)>>logptest
compute alpha=exp(logptest-logplast)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha {
compute logplast=logptest
compute aaccept=aaccept+1
}
else
compute msb=blast
*
* Draw the base GARCH parameters by M-H. Reject any with any
* negative coefficients on the ARCH lags.
*
compute glast=msg
:redrawg
compute msg=glast+%ranmvnormal(farch)
compute a=%xsubvec(msg,1,q),xi=msg(q+1)
if %minvalue(a)<0.0
goto redrawg
sstats gstart gend msarchlogl(t)>>logptest
compute alpha=exp(logptest-logplast)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha {
compute logplast=logptest
compute gaccept=gaccept+1
}
else
compute msg=glast
compute a=%xsubvec(msg,1,q),xi=msg(q+1)
*
* Draw the GARCH scale parameters. These cant be estimated like
* variances in regression models because they dont just scale the
* current variance, but also appear in lags in the ARCH
* formulation.
*
compute glast=gv
ewise fiddle(i)=%ranchisqr(nuv(i))/nuv(i)
compute logqratio=-2.0*%sum(%log(fiddle))
compute gv=glast.*fiddle
sstats gstart gend msarchlogl(t)>>logptest
compute alpha=exp(logptest-logplast-logqratio)
if alpha>1.0.or.%uniform(0.0,1.0)<alpha {
compute logplast=logptest
compute vaccept=vaccept+1
}
else
compute gv=glast
Markov Switching ARCH and GARCH 201
*
infobox(current=draw)
if draw>0 {
set regime1 gstart gend = regime1+(MSRegime==1)
compute bgibbs(draw)=%parmspeek(allparms)
}
end do draw
infobox(action=remove)
*
disp "Acceptance Probabilities"
disp "Mean Parameters" ###.## 100.0*aaccept/ndraws
disp "ARCH Parameters" ###.## 100.0*gaccept/ndraws
disp "Variance Scales" ###.## 100.0*vaccept/ndraws
*
@mcmcpostproc(ndraws=ndraws,mean=bmeans,stderrs=bstderrs) bgibbs
*
report(action=define)
report(atrow=1,atcol=1,fillby=cols) %parmslabels(allparms)
report(atrow=1,atcol=2,fillby=cols) bmeans
report(atrow=1,atcol=3,fillby=cols) bstderrs
report(action=format,picture="*.########")
report(action=show)
*
set regime1 = regime1/ndraws
graph(footer="Probability of low-variance regime")
# regime1
*
dim GARCHRegimeGaussF(nexpand)
compute hi = GARCHRegimeH(time)
do i=1,nexpand
compute GARCHRegimeGaussF(i)=%if(hi(i)>0,$
exp(%logdensity(hi(i)*gv(%MSLagState(i,0)),e)),0.0)
end do i
end
*********************************************************************
*
* Do a GARCH model to get initial guess values and matrices for
* random walk M-H.
*
linreg y
# constant y{1}
set u = %resids
set uu = %sigmasq
*
garch(p=1,q=1,regressors,asymmetric) / y
# constant y{1}
*
* To make this comparable with the ARCH(2) model, well drop two data
* points.
*
compute gstart=%regstart()+2,gend=%regend()
*
compute msb=%xsubvec(%beta,1,2)
*
* The "H" series is scaled down by g(S) relative to the standard use
* in GARCH models, so we scale H and HS down accordingly. These will
* be overwritten for all but the pre-sample values anyway.
*
compute msg=%xsubvec(%beta,4,6)
set h = %sigmasq/%beta(3)
do i=1,nstates
set hs(i) = %sigmasq/%beta(3)
end do i
*
* Start with guess values for GV that scale the original constant
* down and up.
*
compute gv(1)=%beta(3)/2.0
compute gv(2)=%beta(3)*2.0
*
compute p=||.8,.2||
*********************************************************************
*
* This does a single step of the Dueker (approximate) filter
*
function DuekerFilter time
type integer time
*
local real e
local vector fvec hvec ph
Markov Switching ARCH and GARCH 204
*
compute e=uf(time)
compute fvec=GARCHRegimeGaussF(time,e)
@MSFilterStep(logf=DuekerFilter) time fvec
compute hvec=GARCHRegimeH(time)
*
* Take probability weighted averages of the hs for each regime.
*
compute ph=pt_t(time).*hvec
compute phstar=%sumc(%reshape(ph,nstates,DFFilterSize))./$
%sumc(%reshape(pt_t(time),nstates,DFFilterSize))
*
* Push them out into the HS vector
*
compute %pt(hs,time,phstar)
*
* Save the residual and residual squared
*
compute u(time)=e
compute uu(time)=e2
end
*****************************************************************
function DFStart
@MSFilterInit
end
*****************************************************************
frml LogDFFilter = DuekerFilter(t)
*
nonlin p msg msb gv
@MSFilterSetup
maximize(start=DFStart(),pmethod=simplex,piters=5,method=bfgs) $
logDFFilter gstart gend
@MSSmoothed gstart gend psmooth
set p1 = psmooth(t)(1)
set p2 = psmooth(t)(2)
graph(max=1.0,style=stacked,footer=$
"Probabilities of Regimes in Two Regime Gaussian model") 2
# p1
# p2
Appendix A
Proof
By Bayes rule
f (y|x, I)f (x|I)
f (x|y, I) = (A.3)
f (y|I)
By (A.1), f (x|y, I) = f (x|y, J ), so we have
f (y|x, I)f (x|I)
f (x|y, J ) = (A.4)
f (y|I)
By standard results,
Z Z
f (x|J ) = f (x, y|J ) dy = f (x|y, J )f (y|J ) dy (A.5)
205
Appendix B
The EM Algorithm
which can be quite difficult in the (non-trivial) case where x and y arent inde-
pendent. The result in the DLR paper is that repeating the following process
will (under regularity conditions) eventually produce the maximizer of (B.1):
E Step: Let 0 be the parameter values at the end of the previous iteration.
Determine the functional form for
Q (|0 ) = E log f (y, x|) (B.2)
x|y,0
206
The EM Algorithm 207
Generalized EM
It turns out that it isnt necessary to fully maximize (B.2) during the M step
to get the overall result that EM will eventually reach a mode of the overall
optimization problem. In Generalized EM, the M step takes one (or more) it-
erations on a process which increases the value of (B.2). This is particularly
helpful in multivariate estimation, since actually maximizing (B.2) would then
almost always involve an iterated SUR estimate. With iterated sur, most of
the improvement in the function value comes on the first iterationsince the
function itself will change with the next E step, extra calculations to achieve
convergence to a temporary maximum arent especially useful.
1
This is less efficient computationally because it estimates the entire parameter vector us-
ing a hill-climbing method, rather than directly optimizing each subset.
Appendix C
Hierarchical Priors
208
Hierarchical Priors 209
up being fairly small (at least in some Gibbs sweeps). When T is a number
under 10, there is a non-trivial chance that the sampled value will be quite
different (by a factor of two or more) from 2 .
An alternative which can be used in a situation like this is a hierarchical prior.
The variances in the subsample are written as the product of a common scale
factor, defined across the full sample, and a relative scale. Well write this as
ri2 2 for subsample i. The common scale factor is drawn using the full sample
of data, and so can be done with a non-informative prior, while the relative
scale can use an informative prior with a scale of 1.0; that is, all the variances
are thought to be scales of an overall variance. With a sample split into two
regimes based upon variance, the likelihood is:
! !
T /2 1 X T /2 1 X
r12 2 x2 r22 2 x2
1
2
exp 2 2 exp 2 2
2r1 S=1 t 2r2 S=2 t
and the hierarchical prior (assuming a non-informative one for the common
variance) is
2 2 1 /21
1 2 2 /21
2
r1 exp 2 r2 exp 2
2r1 2r2
From the product of the prior and likelihood, we can isolate the r12 factors as
!!
2 (T1 +1 )/21
1 1 X 2
r1 exp 2 1 + 2 x (C.6)
2r1 S=1 t
1. Draw the overall scale using the sums of squares weighted by each
regimes relative variance (the bracketed term in the exponent of (C.7)),
with degrees of freedom equal to the total sample size.
Hierarchical Priors 210
2. Draw the relative variances using the sum of squares in each subsample
weighted by the overall variance plus the prior degrees of freedom (the
term in the exponent of (C.6)) with degrees of freedom equal to the sub-
sample size plus the degrees of freedom.
with T + degrees of freedom. Drawing the covariance matrix from this is done
with
compute sigmav(k)=%ranwisharti(%decomp(inv(uu(k)+nuprior(k)*sigma)),$
tcounts(k)+nuprior(k))
where uu(k) is the sum of the outer product of residuals for regime k,
tcounts(k) is number of observations in that regime, and nuprior(k) is the
prior degrees of freedom (which will almost always be the same across regimes).
The draw for the common isnt quite as simple since were now dealing with
matrices rather than scalars and matrices, in general, dont allow wholesale
rearrangements of calculations. The simplest way to handle this is to draw the
common sigma conditional on the regime-specific ones by inverting the prior
from an inverse Wishart for the regime-specific covariance matrix to a Wishart
for the common one. The mean for this is the inverse of the sums of the in-
verses. The draw can be made with:
compute uucommon=%zeros(nvar,nvar)
do k=1,nstates
compute uucommon=uucommon+nuprior(k)*inv(sigmav(k))
end do k
compute sigma=%ranwishartf(%decomp(inv(uucommon)),%sum(nuprior))
Appendix D
Markov Chain Monte Carlo (MCMC) techniques allow for generation of draws
from distributions which are too complicated for direct analysis. The simplest
form of this is Gibbs sampling. This simulates draws from the density by means
of a correlated Markov Chain. Under the proper circumstances, estimates of
sample statistics generated from this converge to their true means under the
actual posterior density.
Gibbs Sampling
The idea behind Gibbs sampling is that we partition our set of parameters
into two or more groups: for illustration, well assume we have two, called 1
and 2 . We want to generate draws from a joint density f (1 , 2 ), but dont
have any simple way to do that. We can always write the joint density as
f (1 |2 )f (2 ) and as f (2 |1 )f (1 ). In a very wide range of cases, these condi-
tional densities are tractablein particular, with switching models, if we know
the regimes, the model generally takes a standard form. Its the unconditional
densities of the other block that are the problem.
The intuition behind Gibbs sampling is that if we draw 1 given 2 , and then
2 given 1 , the pair should be closer to their unconditional distribution than
before. Each combination of draws is known as a sweep. Repeat sweeps often
enough and it should converge to give draws from the joint distribution. This
turns out to be true in most situations. Just discard enough at the beginning
(called the burn-in draws) so that the chain has had time to converge to the
unconditional distribution. Once you have a draw from f (1 , 2 ), you (of ne-
cessity) have a draw from the marginals f (1 ) and f (2 ), so now each sweep
will give you a new draw from the desired distribution. Theyre just not inde-
pendent draws. With enough forgetfulness, however, the sample averages of
the draws will converge to the true mean of the joint distribution.
Gibbs sampling works best when parameters which are (strongly) correlated
with each other are in the same block, otherwise it will be hard to move both
since the sampling procedure for each is tied to the other.
211
Gibbs Sampling and Markov Chain Monte Carlo 212
However, there are only a handful of distributions for which we can do that
things like Normals, gammas, Dirichlets. In many cases, the desired density
is the likelihood for a complicated non-linear model which doesnt have such a
form.
Suppose, we want to sample the random variable from f () which takes a
form for which direct sampling is difficult.1 Instead, we sample from a more
convenient density q(). Let (i1) be the value from the previous sweep. Com-
pute
f () q (i1)
= (D.1)
f ((i1) ) q ()
With probability , we accept the new draw and make (i) = , otherwise we
stay with our previous value and make (i) = (i1) . Note that its possible to
have > 1, in which case, we just accept the new draw.
The first ratio in (D.1) makes perfect sense. We want, as much as possible, to
have draws where the posterior density is high, and not where its low. The
second counterweights (notice that its the ratio in the opposite order) for the
probability of drawing a given value. Another way of looking at the ratio is
f () f ((i1) )
=
q () q ((i1) )
f /q is a measure of the relative desirability of a draw. The ones that really
give us a strong move signal are where the target density (f ) is high and
the proposal density (q) is low; we may not see those again, so when we get a
chance we should move. Conversely, if f is low and q is high, we might as well
stay put; we may revisit that one at a later time.
What this describes is Independence Chain Metropolis, where the proposal den-
sity doesnt depend upon the last draw. Where a set of parameters is expected
to have a single mode, a good proposal density often can be constructed from
maximum likelihood estimates, using a Normal or multivariate t centered at
the ML estimates, with the covariance matrix some scale multiple (often just
1.0) of the estimate coming out of the maximization procedure.
If the actual density has a shape which isnt a good match for a Normal or
t, this is unlikely to work well. If there are points where f is high and q is
relatively low, it might take a very large number of draws before we find them;
once we get there well likely stay for a while. A more general procedure has
(i1)
the proposal density depending upon the last value: q | . The acceptance
criterion is now based upon:
f ((i1) )
f ()
= (D.2)
q (|(i1) ) q ((i1) |)
1
What were describing here is Metropolis-Hastings. In all our applications, the densities
will be conditional on the other blocks of parameters, which is formally known as Metropolis
within Gibbs.
Gibbs Sampling and Markov Chain Monte Carlo 213
Note that the counterweighting is now based upon the ratio of the probability
of moving from (i1) to to the probability of moving back. The calculation
simplifies greatly if the proposal
is a mean zero Normal or t added to (i1) .
Because of symmetry, q |(i1) = q (i1) | , so the q cancels, leaving just:
= f () f ((i1) )
This is known as Random Walk Metropolis, which is probably the most common
choice for Metropolis within Gibbs. Unlike Independence Chain, when you
move to an isolated area where f is high, you can always move back by, in
effect, retracing your route.
Avoiding Overflows
The f that weve been using in this is either the sample likelihood or the pos-
terior (sample likelihood times prior). With many data sets, the sample log
likelihood is on the order of 100s or 1000s (positive or negative). If you try to
convert these large log likelihoods by taking the exp, you will likely overflow
(for large positive) or underflow (for large negative), in either case, losing the
actual value. Instead, you need to calculate the ratios by adding and subtract-
ing in log form, and taking the exp only at the end.
Diagnostics
There are two types of diagnostics: those on the behavior of the sampling meth-
ods for subsets of the parameters, and those on the behavior of the overall
chain. When you use Independence Chain Metropolis, you would generally like
the acceptance rate to be fairly high. In fact, if q = f (which means that youre
actually doing a simple Gibbs draw), everything cancels and the acceptance
is = 1. Numbers in the range of 20-30% are generally fine, but acceptance
rates well below 10% are often an indication that you have a bad choice for q
your proposal density isnt matching well with f . When you use Random Walk
Metropolis, an acceptance probability near 100% isnt good. You will almost
never get rates like that unless youre taking very small steps and thus not
moving around the parameter space sufficiently. Numbers in the 20-40% range
are usually considered to be ideal. You can often tweak either the variance or
the degrees of freedom in the increment to move that up or down. Clearly, in
either case, you need to count the number of times you move out of the total
number of draws.
If youre concerned about the overall behavior of the chain, you can run it sev-
eral times and see if you get similar results. If you get decidedly different re-
sults, its possible that the chain hasnt run long enough to converge, requiring
a greater number of burn-in draws. The @MCMCPOSTPROC procedure also com-
putes a CD measure which does a statistical comparison of the first part of the
accepted draws with the last part. If the burn-in is sufficient, these should be
asymptotically standard Normal statistics (theres one per parameter). If you
get values that have absolute values far out in the tails for a Normal (such as
Gibbs Sampling and Markov Chain Monte Carlo 214
4 and above), thats a strong indication that you either have a general problem
with the chain, or you just havent done a long enough burn-in.
Pathologies
Most properly designed chains will eventually converge, although it might take
many sweeps to accomplish this. There are, however, some situations in which
it wont work no matter how long it runs. If the Markov Chain has an ab-
sorbing state (or set of states), once the chain moves into this, it cant get out.
With the types of models in this book, we have two possible absorbing states.
First, if the unconditional probability of a branch is sampled as (near) zero,
then the posterior probability of selecting that branch will also be near zero,
so the regime sampling will give a zero count, which will again give a zero un-
conditional probabilityif the prior is uninformative. Having an even slightly
informative Dirichlet prior will prevent the chain from getting trapped at a zero
probability state, since even a zero count of sampled regimes will not produce
a posterior which has most of its mass near zero. The other possible absorb-
ing state is a regime with a zero variance in a linear regression or something
similar.
Appendix E
E.1 EM Algorithm
The only adjustment required to the M step is for the estimation of the param-
eters of the index functions. Unlike the case with fixed transition probabilities,
it isnt possible to maximize these with a single calculation. Instead, it makes
more sense to do generalized EM and just make a single iteration of the maxi-
mization algorithm. We need one step in the maximizer for:
T
X
(log f (St |St1 , )) pij,t (E.2)
t=1
where the pij,t are the smoothed probabilities of St = i and St1 = j at t. Because
it appears in the denominator of (E.1), each ij for a fixed j appears in the logf
for all the other i. The derivative of (E.2) with respect to ij is
T X
X M T
X
pkj,t (I{i = k} pij,t ()) Zt = pij,t (1 pij,t ()) (pj,t pij,t )pij,t () Zt
t=1 k=1 t=1
(E.3)
1
This assumes that the same explanatory variables are used in all the transitions. That can
be relaxed with a bit of extra work.
215
Time-Varying Transition Probabilities 216
where the inner sum collapses because the terms with i 6= k are identical other
than the pkj,t which just add up to the pj,t pij,t . 2 Zeroing this equalizes the
predicted and actual (average) log odds ratios between between moving to i or
not moving to i from regime j.
The second derivative is also easily computed as:
T
X
pj,t (1 pij,t ())pij,t ()Zt0 Zt (E.4)
t=1
2
pj,t is the empirical probability of having St1 = j. The p sum to 1 across all i, j, so this
downweights any observations that have very low St1 = j and thus tell us relatively little
about j to i transitions.
Appendix F
Probability Distributions
217
Probability Distributions 218
Support (, )
Mean
Variance 2
Kernel x1 (1 x)1
Mean ba or
22
Variance b2 a or
Main uses Prior, exact and approximate posterior for the pre-
cision (reciprocal of variance) of residuals or other
shocks in a model
Range [0, )
1
Mean a(b1)
if b > 1
1
Variance a2 (b1)2 (b2)
if b > 2
Main uses Prior, exact and approximate posterior for the vari-
ance of residuals or other shocks in a model.
Kernel px (1 p)1x
Range 0 or 1
Mean p
Variance p(1 p)
Mean
Variance or H1
Mean A
Main uses Prior, exact and approximate posterior for the preci-
sion matrix (inverse of covariance matrix) of resid-
uals in a multivariate regression.
226
Bibliography 227
229
Index 230
Terasvirta, T., 59
THETA matrix, 88
@THRESHTEST procedure, 17, 32
Transition equation, 144
Transition matrix, 79
Tsay, R., 31, 55, 105, 186
@TSAYTEST procedure, 32, 43
Underflow, 84