0% found this document useful (0 votes)
125 views

Time Series Econometrics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

Time Series Econometrics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 215

Time-Series Econometrics

A Concise Course

Francis X. Diebold
University of Pennsylvania

Edition 2019
Version 2019.01.14
Time Series Econometrics
Time Series Econometrics
A Concise Course

Francis X. Diebold
Copyright
c 2013-2019, by Francis X. Diebold.
All rights reserved.

This work is freely available for your use, but be warned: it is highly preliminary, significantly
incomplete, and rapidly evolving. It is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. (Briefly: I retain copyright, but
you can use, copy and distribute non-commercially, so long as you give me attribution and do
not modify. To view a copy of the license, visit http://creativecommons.org/licenses/by-nc-
nd/4.0/.) In return I ask that you please cite the book whenever appropriate, as: “Diebold,
F.X. (2019), Time Series Econometrics, Department of Economics, University of Pennsyl-
vania, http://www.ssc.upenn.edu/ fdiebold/Textbooks.html.”
To Marc Nerlove,
who taught me time series,

and to my wonderful Ph.D. students,


his “grandstudents”
Brief Table of Contents

About the Author xv

About the Cover xvi

Guide to e-Features xvii

Acknowledgments xviii

Preface xxii

Chapter 1. The Wold Representation and its Approximation 1

Chapter 2. Spectral Analysis 23

Chapter 3. Markovian Structure, Linear Gaussian State Space, and Optimal (Kalman) Filtering 47

Chapter 4. Frequentist Time-Series Likelihood Evaluation, Optimization, and Inference 79

Chapter 5. Simulation Basics 90

Chapter 6. Bayesian Analysis by Simulation 96

Chapter 7. (Much) More Simulation 109

Chapter 8. Non-Stationarity: Integration, Cointegration and Long Memory 126

Chapter 9. Non-Linear Non-Gaussian State Space and Optimal Filtering 138

Chapter 10. Volatility Dynamics 145

Chapter 11. High Dimensionality 186

Appendices 187

Appendix A. A “Library” of Useful Books 188


Detailed Table of Contents

About the Author xv

About the Cover xvi

Guide to e-Features xvii

Acknowledgments xviii

Preface xxii

Chapter 1. The Wold Representation and its Approximation 1


1.1 Economic Time Series and Their Analysis 1
1.2 The Environment 1
1.3 White Noise 3
1.4 The Wold Decomposition and the General Linear Process 4
1.5 Approximating the Wold Representation 6
1.5.1 The M A(q) Process 6
1.5.2 The AR(p) Process 6
1.5.3 The ARM A(p, q) Process 6
1.6 Wiener-Kolmogorov-Wold Extraction and Prediction 6
1.6.1 Extraction 6
1.6.2 Prediction 6
1.7 Multivariate 7
1.7.1 The Environment 7
1.7.2 The Multivariate General Linear Process 8
1.7.3 Vector Autoregressions 9
1.8 Exercises, Problems and Complements 14
1.9 Notes 21
1.10 Exercises, Problems and Complements 22

Chapter 2. Spectral Analysis 23


2.1 The Many Uses of Spectral Analysis 23
2.2 The Spectrum and its Properties 23
2.3 Rational Spectra 26
2.4 Multivariate 27
2.5 Filter Analysis and Design 30
2.6 Estimating Spectra 34
2.6.1 Univariate 34
2.6.2 Multivariate 36
2.7 Approximate (asymptotic) frequency domain Gaussian likelihood 36
2.8 Exercises, Problems and Complements 37
2.9 Notes 46
xii DETAILED TABLE OF CONTENTS

Chapter 3. Markovian Structure, Linear Gaussian State Space, and Optimal (Kalman) Filtering 47
3.1 Markovian Structure 47
3.1.1 The Homogeneous Discrete-State Discrete-Time Markov Process 47
3.1.2 Multi-Step Transitions: Chapman-Kolmogorov 47
3.1.3 Lots of Definitions (and a Key Theorem) 48
3.1.4 A Simple Two-State Example 49
3.1.5 Constructing Markov Processes with Useful Steady-State Distributions 50
3.1.6 Variations and Extensions: Regime-Switching and More 51
3.1.7 Continuous-State Markov Processes 52
3.2 State Space Representations 53
3.2.1 The Basic Framework 53
3.2.2 ARMA Models 55
3.2.3 Linear Regression with Time-Varying Parameters and More 60
3.2.4 Dynamic Factor Models 62
3.2.5 Unobserved-Components Models 63
3.3 The Kalman Filter and Smoother 64
3.3.1 Statement(s) of the Kalman Filter 65
3.3.2 Derivation of the Kalman Filter 66
3.3.3 Calculating P0 69
3.3.4 Predicting yt 69
3.3.5 Steady State and the Innovations Representation 70
3.3.6 Kalman Smoothing 72
3.4 Exercises, Problems and Complements 72
3.5 Notes 78

Chapter 4. Frequentist Time-Series Likelihood Evaluation, Optimization, and Inference 79


4.1 Likelihood Evaluation: Prediction-Error Decomposition and the Kalman Filter 79
4.2 Gradient-Based Likelihood Maximization: Newton and Quasi-Newton Methods 80
4.2.1 The Generic Gradient-Based Algorithm 80
4.2.2 Newton Algorithm 81
4.2.3 Quasi-Newton Algirithms 82
4.2.4 “Line-Search” vs. “Trust Region” Methods: Levenberg-Marquardt 82
4.3 Gradient-Free Likelihood Maximization: EM 83
4.3.1 “Not-Quite-Right EM”
(But it Captures and Conveys the Intuition) 84
4.3.2 Precisely Right EM 84
4.4 Likelihood Inference 86
4.4.1 Under Correct Specification 86
4.4.2 Under Possible Mispecification 87
4.5 Exercises, Problems and Complements 89
4.6 Notes 89

Chapter 5. Simulation Basics 90


5.1 Generating U(0,1) Deviates 90
5.2 The Basics: c.d.f. Inversion, Box-Mueller, Simple Accept-Reject 91
5.2.1 Inverse c.d.f. 91
5.2.2 Box-Muller 92
5.2.3 Simple Accept-Reject 93
5.3 Simulating Exact and Approximate Realizations of Time Series Processes 94
5.4 more 95
5.5 Notes 95

Chapter 6. Bayesian Analysis by Simulation 96


DETAILED TABLE OF CONTENTS xiii

6.1 Bayesian Basics 96


6.2 Comparative Aspects of Bayesian and Frequentist Paradigms 96
6.3 Markov Chain Monte Carlo 98
6.3.1 Metropolis-Hastings Independence Chain 99
6.3.2 Metropolis-Hastings Random Walk Chain 99
6.3.3 More 99
6.3.4 Gibbs and Metropolis-Within-Gibbs 100
6.4 Conjugate Bayesian Analysis of Linear Regression 102
6.5 Gibbs for Sampling Marginal Posteriors 103
6.6 General State Space: Carter-Kohn Multi-Move Gibbs 104
6.7 Exercises, Problems and Complements 108
6.8 Notes 108

Chapter 7. (Much) More Simulation 109


7.1 Economic Theory by Simulation: “Calibration” 109
7.2 Econometric Theory by Simulation: Monte Carlo and Variance Reduction 109
7.2.1 Experimental Design 109
7.2.2 Simulation 110
7.2.3 Variance Reduction: Importance Sampling, Antithetics, Control Variates
and Common Random Numbers 112
7.2.4 Response Surfaces 116
7.3 Estimation by Simulation: GMM, SMM and Indirect Inference 117
7.3.1 GMM 117
7.3.2 Simulated Method of Moments (SMM) 118
7.3.3 Indirect Inference 119
7.4 Inference by Simulation: Bootstrap 119
7.4.1 i.i.d. Environments 119
7.4.2 Time-Series Environments 122
7.5 Optimization by Simulation 123
7.5.1 Local 123
7.5.2 Global 124
7.5.3 Is a Local Optimum Global? 125
7.6 Interval and Density Forecasting by Simulation 125
7.7 Exercises, Problems and Complements 125
7.8 Notes 125

Chapter 8. Non-Stationarity: Integration, Cointegration and Long Memory 126


8.1 Random Walks as the I(1) Building Block: The Beveridge-Nelson Decomposition 126
8.2 Stochastic vs. Deterministic Trend 127
8.3 Unit Root Distributions 128
8.4 Univariate and Multivariate Augmented Dickey-Fuller Representations 130
8.5 Spurious Regression 131
8.6 Cointegration, Error-Correction and Granger’s Representation Theorem 131
8.7 Fractional Integration and Long Memory 135
8.8 Exercises, Problems and Complements 136
8.9 Notes 137

Chapter 9. Non-Linear Non-Gaussian State Space and Optimal Filtering 138


9.1 Varieties of Non-Linear Non-Gaussian Models 138
9.2 Markov Chains to the Rescue (Again): The Particle Filter 138
9.3 Particle Filtering for Estimation: Doucet’s Theorem 138
9.4 Key Application I: Stochastic Volatility (Revisited) 138
9.5 Key Application II: Credit-Risk and the Default Option 138
xiv DETAILED TABLE OF CONTENTS

9.6 Key Application III: Dynamic Stochastic General Equilibrium (DSGE) Macroe-
conomic Models 138
9.7 A Partial “Solution”: The Extended Kalman Filter 138

Chapter 10. Volatility Dynamics 145


10.1 Volatility and Financial Econometrics 145
10.2 GARCH 145
10.3 Stochastic Volatility 145
10.4 Observation-Driven vs. Parameter-Driven Processes 145
10.5 Exercises, Problems and Complements 185
10.6 Notes 185

Chapter 11. High Dimensionality 186


11.1 Exercises, Problems and Complements 186
11.2 Notes 186

Appendices 187

Appendix A. A “Library” of Useful Books 188


About the Author

Francis X. Diebold is Paul F. and Warren S. Miller Professor of Economics, and Professor
of Finance and Statistics, at the University of Pennsylvania, as well as Faculty Research As-
sociate at the National Bureau of Economic Research in Cambridge, Mass. He has published
widely in econometrics, forecasting, finance and macroeconomics, and he has served on the
editorial boards of numerous scholarly journals. He is an elected Fellow of the Econometric
Society, the American Statistical Association, and the International Institute of Forecasters;
the recipient of Sloan, Guggenheim, and Humboldt fellowships; and past President of the
Society for Financial Econometrics. Diebold lectures actively, worldwide, and has received
several prizes for outstanding teaching. He has held visiting appointments in Economics and
Finance at Princeton University, Cambridge University, the University of Chicago, the Lon-
don School of Economics, Johns Hopkins University, and New York University. His research
and teaching are firmly rooted in applications; he has served as an economist under Paul
Volcker and Alan Greenspan at the Board of Governors of the Federal Reserve System in
Washington DC, an Executive Director at Morgan Stanley Investment Management, Co-
Director of the Wharton Financial Institutions Center, and Chairman of the Federal Reserve
System’s Model Validation Council. All his degrees are from the University of Pennsylvania;
he received his B.S. from the Wharton School in 1981 and his economics Ph.D. in in 1986.
He is married with three children and lives in suburban Philadelphia.
About the Cover

The colorful graphic is by Peter Mills and was obtained from Wikimedia Commons. As
noted there, it represents “the basins of attraction of the Gaspard-Rice scattering system
projected onto a double impact parameter” (whatever that means). I used it mainly because
I like it, but also because it’s reminiscent of a trending time series.

For details see http://commons.wikimedia.org/wiki/File%3AGR_Basins2.tiff. The


complete attribution is: By Peter Mills (Own work) [CC-BY-SA-3.0 (http://creativecommons.
org/licenses/by-sa/3.0)], via Wikimedia Commons)
Guide to e-Features

• Hyperlinks to internal items (table of contents, index, footnotes, etc.) appear in red.
• Hyperlinks to bibliographic references appear in green.

• Hyperlinks to the web appear in cyan.


• Hyperlinks to external files (e.g., video) appear in blue.
• Many images are clickable to reach related material.

• Additional related materials are at http://www.ssc.upenn.edu/~fdiebold, includ-


ing book updates, presentation slides, datasets, and code.
• Facebook group: Diebold Time Series Econometrics
• Related blog (No Hesitations): fxdiebold.blogspot.com
Acknowledgments

All media (images, audio, video, ...) were either produced by me or obtained from the
public domain repository at Wikimedia Commons.
List of Figures

1.1 The R Homepage 16


1.2 Resources for Economists Web Page 17

2.1 Granger’s Typical Spectral Shape of an Economic Variable 27


2.2 Gain of Differencing Filter 1 − L 31
2.3 Gain of Kuznets’ Filter 1 32
2.4 Gain of Kuznets’ Filter 2 32
2.5 Composite Gain of Kuznets’ two Filters 33

5.1 Ripley’s “Horror” Plots of pairs of (Ui+1 , Ui ) for Various Congruential


Generators Modulo 2048 (from Ripley, 1987) 91
5.2 Transforming from U(0,1) to f (from Davidson and MacKinnon, 1993) 92
5.3 Naive Accept-Reject Method 94

10.1 Time Series of Daily NYSE Returns 146


10.2 Correlogram of Daily NYSE Returns. 147
10.3 Histogram and Statistics for Daily NYSE Returns. 147
10.4 Time Series of Daily Squared NYSE Returns. 148
10.5 Correlogram of Daily Squared NYSE Returns. 148
10.6 True Exceedance Probabilities of Nominal 1% HS-V aR When Volatility
is Persistent. We simulate returns from a realistically-calibrated dynamic volatility
model, after which we compute 1-day 1% HS-V aR using a rolling window of 500 ob-
servations. We plot the daily series of true conditional exceedance probabilities, which
we infer from the model. For visual reference we include a horizontal line at the desired
1% probability level. 151
10.7 GARCH(1,1) Estimation, Daily NYSE Returns. 157
10.8 Correlogram of Squared Standardized GARCH(1,1) Residuals, Daily NYSE
Returns. 158
10.9 Estimated Conditional Standard Deviation, Daily NYSE Returns. 158
10.10Conditional Standard Deviation, History and Forecast, Daily NYSE Re-
turns. 158
10.11AR(1) Returns with Threshold t-GARCH(1,1)-in Mean. 159
10.12S&P500 Daily Returns and Volatilities (Percent). The top panel shows daily
S&P500 returns, and the bottom panel shows daily S&P500 realized volatility. We
compute realized volatility as the square root of AvgRV , where AvgRV is the average
of five daily RVs each computed from 5-minute squared returns on a 1-minute grid of
S&P500 futures prices. 160
xx LIST OF FIGURES

10.13S&P500: QQ Plots for Realized Volatility and Log Realized Volatility.


The top panel plots the quantiles of daily realized volatility against the corresponding
normal quantiles. The bottom panel plots the quantiles of the natural logarithm of daily
realized volatility against the corresponding normal quantiles. We compute realized
volatility as the square root of AvgRV , where AvgRV is the average of five daily RVs
each computed from 5-minute squared returns on a 1-minute grid of S&P500 futures
prices. 161
10.14S&P500: Sample Autocorrelations of Daily Realized Variance and Daily
Return. The top panel shows realized variance autocorrelations, and the bottom panel
shows return autocorrelations, for displacements from 1 through 250 days. Horizontal
lines denote 95% Bartlett bands. Realized variance is AvgRV , the average of five daily
RVs each computed from 5-minute squared returns on a 1-minute grid of S&P500
futures prices. 162
10.15Time-Varying International Equity Correlations. The figure shows the esti-
mated equicorrelations from a DECO model for the aggregate equity index returns for
16 different developed markets from 1973 through 2009. 166
10.16QQ Plot of S&P500 Returns. We show quantiles of daily S&P500 returns from
January 2, 1990 to December 31, 2010, against the corresponding quantiles from a
standard normal distribution. 168
10.17QQ Plot of S&P500 Returns Standardized by NGARCH Volatilities. We
show quantiles of daily S&P500 returns standardized by the dynamic volatility from a
NGARCH model against the corresponding quantiles of a standard normal distribution.
The sample period is January 2, 1990 through December 31, 2010. The units on each
axis are standard deviations. 169
10.18QQ Plot of S&P500 Returns Standardized by Realized Volatilities. We show
quantiles of daily S&P500 returns standardized by AvgRV against the corresponding
quantiles of a standard normal distribution. The sample period is January 2, 1990
through December 31, 2010. The units on each axis are standard deviations. 170
10.19Average Threshold Correlations for Sixteen Developed Equity Markets.
The solid line shows the average empirical threshold correlation for GARCH residuals
across sixteen developed equity markets. The dashed line shows the threshold correla-
tions implied by a multivariate standard normal distribution with constant correlation.
The line with square markers shows the threshold correlations from a DECO model
estimated on the GARCH residuals from the 16 equity markets. The figure is based on
weekly returns from 1973 to 2009. 173
10.20Simulated data, ρ = 0.5 180
10.21Simulated data, ρ = 0.9 181
10.22Simulated data 184
List of Tables

10.1 Stock Return Volatility During Recessions. Aggregate stock-return volatility


is quarterly realized standard deviation based on daily return data. Firm-level stock-
return volatility is the cross-sectional inter-quartile range of quarterly returns. 174
10.2 Real Growth Volatility During Recessions. Aggregate real-growth volatility is
quarterly conditional standard deviation. Firm-level real-growth volatility is the cross-
sectional inter-quartile range of quarterly real sales growth. 174
Preface

Time Series Econometrics (TSE ) provides a modern and concise master’s or Ph.D.-level
course in econometric time series. It can be covered realistically in one semester, and I
have used it successfully for many years with first-year Ph.D. students at the University of
Pennsylvania.
The elephant in the room is of course Hamilton’s Time Series Analysis. TSE complements
Hamilton in three key ways.
First, TSE is concise rather than exhaustive. (Nevertheless it maintains good breadth
of coverage, treating everything from the classic early framework of Wold, Wiener, and
Kolmogorov, through to cutting-edge Bayesian MCMC analysis of non-linear non-Gaussian
state space models with the particle filter.) Hamilton’s book can be used for more extensive
background reading for those topics that overlap.
Second and crucially, however, many of the topics in TSE and Hamilton do not overlap,
as TSE treats a variety of more recently-emphasized ideas. It stresses Markovian structure
throughout, from linear state space, to MCMC, to optimization, to non-linear state space
and particle filtering. Simulation and Bayes feature prominently, as do nonparametrics,
realized volatility, and more.
Finally, TSE is generally e-aware, with numerous hyperlinks to internal items, web sites,
code, research papers, books, databases, blogs, etc.)

Francis X. Diebold
Philadelphia

Monday 14th January, 2019


Time Series Econometrics
Chapter One

The Wold Representation and its Approximation

1.1 ECONOMIC TIME SERIES AND THEIR ANALYSIS

Any series of observations ordered along a single dimension, such as time, may be thought
of as a time series. The emphasis in time series analysis is the study of dependence among
the observations at different points in time.1
Many economic and financial variables, such as prices, wages, sales, GDP and its com-
ponents, stock returns, interest rates and foreign exchange rates, are observed over time;
in addition to being interested in the interrelationships among such variables, we are also
concerned with relationships among the current and past values of one or more of them,
that is, relationships over time.
At its broadest level, time series analysis provides the language for of stochastic dy-
namics. Hence it’s the language of even pure dynamic economic theory, quite apart from
empirical analysis. It is, however, a great workhorse of empirical analysis, in “pre-theory”
mode (non-structurally “getting the facts straight” before theorizing, always a good idea),
in “post-theory” mode (structural estimation and inference), and in forecasting (whether
non-structural or structural).
Empirically, the analysis of economic time series is central to a wide range of applica-
tions, including business cycle measurement, financial risk management, policy analysis,
and forecasting. Special features of interest in economic time series include trends and non-
stationarity, seasonality, cycles and persistence, predictability (or lack thereof), structural
change, and nonlinearities such as volatility fluctuations and regime switching.

1.2 THE ENVIRONMENT

Time series Yt (doubly infinite)

Realization yt (again doubly infinite)

Sample path yt , t = 1, ..., T

1 Indeed what distinguishes time series analysis from general multivariate analysis is precisely the temporal

order imposed on the observations.


2 CHAPTER 1

Strict Stationarity

Joint cdfs for sets of observations


depend only on displacement, not time.

Weak Stationarity

(Second-order stationarity, wide sense stationarity,


covariance stationarity, ...)

Eyt = µ, ∀t

γ(t, τ ) = E(yt − Eyt ) (yt+τ − Eyt+τ ) = γ(τ ), ∀t

0 < γ(0) < ∞

Autocovariance Function

(a) symmetric
γ(τ ) = γ(−τ ), ∀τ

(b) nonnegative definite


a0 Σa ≥ 0, ∀a
where Toeplitz matrix Σ has ij -th element γ(i − j)

(c) bounded by the variance


γ(0) ≥ |γ(τ )|, ∀τ

Autocovariance Generating Function


X
g(z) = γ(τ ) z τ
τ =−∞

Autocorrelation Function

γ(τ )
ρ(τ ) =
γ(0)
THE WOLD REPRESENTATION 3

1.3 WHITE NOISE

White noise: ηt ∼ W N (µ, σ 2 ) (serially uncorrelated)

Zero-mean white noise: ηt ∼ W N (0, σ 2 )

iid
Independent (strong) white noise: ηt ∼ (0, σ 2 )

iid
Gaussian white noise: ηt ∼ N (0, σ 2 )

Unconditional Moment Structure of Strong White Noise

E(ηt ) = 0

var(ηt ) = σ 2

Conditional Moment Structure of Strong White Noise

E(ηt |Ωt−1 ) = 0

var(ηt |Ωt−1 ) = E[(ηt − E(ηt |Ωt−1 ))2 |Ωt−1 ] = σ 2

where

Ωt−1 = ηt−1 , ηt−2 , ...

Autocorrelation Structure of Strong White Noise

(
σ2 , τ = 0
γ(τ ) =
0, τ ≥ 1

(
1, τ = 0
ρ(τ ) =
0, τ ≥ 1
4 CHAPTER 1

An Aside on Treatment of the Mean

In theoretical work we assume a zero mean, µ = 0.

This reduces notational clutter and is without loss of generality.

(Think of yt as having been centered around its mean, µ,


and note that yt − µ has zero mean by construction.)

(In empirical work we allow explicitly for a non-zero mean,


either by centering the data around the sample mean
or by including an intercept.)

1.4 THE WOLD DECOMPOSITION AND THE GENERAL LINEAR PRO-


CESS

Under regularity conditions,


every covariance-stationary process {yt } can be written as:

X
yt = bi εt−i
i=0

where:

b0 = 1


X
b2i < ∞
i=0

εt = [yt − P (yt |yt−1 , yt−2 , ...)] ∼ W N (0, σ 2 )

The General Linear Process


X
yt = B(L)εt = bi εt−i
i=0

εt ∼ W N (0, σ 2 )

b0 = 1
THE WOLD REPRESENTATION 5


X
b2i < ∞
i=0

Unconditional Moment Structure of the General Linear Process

∞ ∞ ∞
!
X X X
E(yt ) = E bi εt−i = bi Eεt−i = bi · 0 = 0
i=0 i=0 i=0

∞ ∞ ∞
!
X X X
var(yt ) = var bi εt−i = b2i var(εt−i ) = σ 2
b2i
i=0 i=0 i=0

Conditional Moment Structure

E(yt |Ωt−1 ) = E(εt |Ωt−1 ) + b1 E(εt−1 |Ωt−1 ) + b2 E(εt−2 |Ωt−1 ) + ...

(Ωt−1 = εt−1 , εt−2 , ...)


X
= 0 + b1 εt−1 + b2 εt−2 + ... = bi εt−i
i=1

var(yt |Ωt−1 ) = E[(yt − E(yt |Ωt−1 ))2 |Ωt−1 ]

= E(ε2t |Ωt−1 ) = E(ε2t ) = σ 2

(These calculations assume strong WN innovations. Why?)

Autocovariance Structure

∞ ∞
" ! !#
X X
γ(τ ) = E bi εt−i bh εt−τ −h
i=−∞ h=−∞


X
= σ2 bi bi−τ
i=−∞

(where bi ≡ 0 if i < 0)

g(z) = σ 2 B(z) B(z −1 )


6 CHAPTER 1

1.5 APPROXIMATING THE WOLD REPRESENTATION

1.5.1 The M A(q) Process

(Obvious truncation)
Unconditional moment structure, conditional moment structure, autocovariance func-
tions, stationarity and invertibility conditions

1.5.2 The AR(p) Process

(Stochastic difference equation)


Unconditional moment structure, conditional moment structure, autocovariance func-
tions, stationarity and invertibility conditions

1.5.3 The ARM A(p, q) Process

Rational B(L), later rational spectrum, and links to state space.


Unconditional moment structure, conditional moment structure, autocovariance func-
tions, stationarity and invertibility conditions

1.6 WIENER-KOLMOGOROV-WOLD EXTRACTION AND PREDICTION

1.6.1 Extraction

1.6.2 Prediction

yt = εt + b1 εt−1 + ...

yT +h = εT +h + b1 εT +h−1 + ... + bh εT + bh+1 εT −1 + ...

Project on ΩT = {εT , εT −1 , ...} to get:

yT +h,T = bh εT + bh+1 εT −1 + ...

Note that the projection is on the infinite past

Prediction Error
h−1
X
eT +h,T = yT +h − yT +h,T = bi εT +h−i
i=0

(An M A(h − 1) process!)

E(eT +h,T ) = 0
THE WOLD REPRESENTATION 7

h−1
X
var(eT +h,T ) = σ 2 b2i
i=0

Wold’s Chain Rule for Autoregressions

Consider an AR(1) process:


yt = φyt−1 + εt

History:
{yt }Tt=1

Immediately,
yT +1,T = φyT
yT +2,T = φyT +1,T = φ2 yT
..
.
yT +h,T = φyT +h−1,T = φh yT

Extension to AR(p) and AR(∞) is immediate.

1.7 MULTIVARIATE

1.7.1 The Environment

(y1t , y2t )0 is covariance stationary if:

E(y1t ) = µ1 ∀ t
E(y2t ) = µ2 ∀ t

!
y1t − µ1
Γy1 y2 (t, τ ) = E (y1,t−τ − µ1 , y2,t−τ − µ2 )
y2t − µ2

!
γ11 (τ ) γ12 (τ )
=
γ21 (τ ) γ22 (τ )
τ = 0, 1, 2, ...

Cross Covariances and the Generating Function

γ12 (τ ) 6= γ12 (−τ )


8 CHAPTER 1

γ12 (τ ) = γ21 (−τ )

Γy1 y2 (τ ) = Γ0y1 y2 (−τ ), τ = 0, 1, 2, ...


X
Gy1 y2 (z) = Γy1 y2 (τ ) z τ
τ =−∞

Cross Correlations

Ry1 y2 (τ ) = Dy−1
1 y2
Γy1 y2 (τ ) Dy−1
1 y2
, τ = 0, 1, , 2, ...

!
σ1 0
D =
0 σ2

1.7.2 The Multivariate General Linear Process

The Multivariate General Linear Process

! ! !
y1t B11 (L) B12 (L) ε1t
=
y2t B21 (L) B22 (L) ε2t

yt = B(L)εt = (I + B1 L + B2 L2 + ...)εt
(
Σ if t = s
E(εt ε0s ) =
0 otherwise


X
k Bi k2 < ∞
i=0

Autocovariance Structure


X
0
Γy1 y2 (τ ) = Bi Σ Bi−τ
i=−∞

(where Bi ≡ 0 if i < 0)

Gy (z) = B(z) Σ B 0 (z −1 )
THE WOLD REPRESENTATION 9

Wiener-Kolmogorov Prediction

yt = εt + B1 εt−1 + B2 εt−2 + ...

yT +h = εT +h + B1 εT +h−1 + B2 εT +h−2 + ...

Project on Ωt = {εT , εT −1 , ...} to get:

yt+h,T = Bh εT + Bh+1 εT −1 + ...

Wiener-Kolmogorov Prediction Error

h−1
X
εT +h,T = yT +h − yT +h,T = Bi εT +h−i
i=0

E[εT +h,T ] = 0

h−1
X
E[εT +h,T ε0T +h,T ] = Bi ΣBi0
i=0

1.7.3 Vector Autoregressions

N -variable VAR of order p:

Φ(L)yt = εt

εt ∼ W N (0, Σ)

where:

Φ(L) = I − Φ1 L − ... − Φp Lp

– Simple estimation and analysis (OLS)


– Granger-Sims causality
– Getting the facts straight before theorizing; assessing the restrictions implies by
economic theory

! ! ! !
y1t φ11 φ12 y1t−1 ε1t
= +
y2t φ21 φ22 y2t−1 ε2t
10 CHAPTER 1

! ! !!
ε1t 0 σ12 σ12
∼ WN ,
ε2t 0 σ12 σ22

– Two sources of cross-variable interaction.


What are they?

Understanding VAR’s: Bivariate Granger-Sims Causality

Is the history of yj useful for predicting yi ,


over and above the history of yi ?

– Granger non-causality tests: Simple exclusion restrictions

– In the simple 2-Variable VAR(1) example,


! ! ! !
y1t φ11 φ12 y1t−1 ε1t
= + ,
y2t φ21 φ22 y2t−1 ε2t
y2 does not Granger cause y1 iff φ12 = 0

– Natural extensions for N > 2


(always testing exclusion restrictions)

Understanding VAR’s: MA Representation

Φ(L)yt = εt

yt = Φ−1 (L)εt = Θ(L)εt

where:

Θ(L) = I + Θ1 L + Θ2 L2 + ...

Long-Form MA Representation of 2-Variable VAR(1)

! !  ! !
1 0 φ11 φ12 y1t ε1t
− L =
0 1 φ21 φ22 y2t ε2t

! ! ! !
1 1
y1t ε1t θ11 θ12 ε1t−1
= + 1 1
+ ...
y2t ε2t θ21 θ22 ε2t−1
THE WOLD REPRESENTATION 11

Understanding VAR’s: Impulse Response Functions (IRF’s)

(I − Φ1 L − ... − Φp Lp )yt = εt

εt ∼ W N (0, Σ)

The impulse-response question:


How is yit dynamically affected by a shock to yjt (alone)?

(N × N matrix of IRF graphs (over steps ahead))

Problem:
Σ generally not diagonal, so how to shock j alone?

Understanding VAR’s: Variance Decompositions (VD’s)

(I − Φ1 L − ... − Φp Lp )yt = εt

εt ∼ W N (0, Σ)

The variance decomposition question:


How much of the h-step ahead (optimal) prediction-error variance of yi is due to shocks to
variable j?

(N × N matrix of VD graphs (over h) could be done.


Or pick an h and examines the N × N matrix of VD numbers.)

Problem:
Σ generally not diagonal, which makes things tricky, as the variance of a sum of
innovations is therefore not the sum of the variances.

Orthogonalizing VAR’s by Cholesky Factorization


(The Classic Identification Scheme)

Original:

(I − Φ1 L − ... − Φp Lp )yt = εt , εt ∼ W N (0, Σ)

Equivalently:

(I − Φ1 L − ... − Φp Lp )yt = P vt , vt ∼ W N (0, I)


12 CHAPTER 1

where Σ = P P 0 , for lower-triangular P


(Cholesky factorization)

Now we can shock j alone (for IRF’s)

Now we can proceed to calculate forecast-error variances


without worrying about covariance terms (for VD’s)
But there’s no free lunch. Why?

IRF’s and VD’s from the Orthogonalized VAR

IRF comes from the


orthogonalized moving-average representation:

yt = (I + Θ1 L + Θ2 L2 + ...) P vt

= (P + Θ1 P L + Θ2 P L2 + ...) vt

IRFij is {Pij , (Θ1 P )ij , (Θ2 P )ij , ...}

VDij comes similarly from the


orthogonalized moving-average representation.

Note how the the contemporaneous IRF and VD for h = 1 are driven by the Cholesky
choice of P .
Other choices are possible.

Two-Variable IRF Example (IRF12 )

yt = P vt + Θ1 P vt−1 + Θ2 P vt−2 + ...

vt ∼ W N (0, I)

0
yt = C0 vt + C1 vt−1 + C2 vt−2 + ... (Q : What is C12 ?)

! ! ! ! !
y1t c011 c012 v1t c111 c112 v1t−1
= + + ...
y2t c021 c022 v2t c121 c122 v2t−1

IRF12 = C012 , C112 , C212 , ...


THE WOLD REPRESENTATION 13

Two-Variable VD Example (VD12 (2))

16 CHAPTER 1. VECTOR AUTOREGRESSIONS

t+2,t = C0 vt+2 + C1 vt+1


475 18
16 R
450
14
425 vt ∼ W N (0, I)
12
400 10
375 8
6
350 ! ! ! ! !
4
325 1t+2,t c011 c012
P 2v1t+2 c111 c112 v1t+1
300 = 0 +
1959 1965 2
1971 1977 1983 1989
t+2,t 211995 22
2001 c0 c0 v2t+2
1959 1965 1971 1977 c1
21 1983 22 c1
2t+1
1989 1995 2001 v
920 900
900 Y M
850
880
860 1t+2,t = c011 v1t+2 + c012800
v2t+2 + c111 v1t+1 + c112 v2t+1
750
840
700
820
800 var(1t+2,t ) = (c011 )2 650
+ (c012 )2 + (c111 )2 + (c112 )2
780 600
760 550
0 2
1959 1965 1971 1977 1983 1989 1995 2001 Part coming from v1959
2 : (c
1965
12 1971 (c112
) +1977 )2 1989
1983 1995 2001

(c012 )2 + (c112 )2
V D12 (2) =
(c011 )2 + (c012 )2 + (c111 )2 + (c112 )2
Graphic: IRF Matrix for 4-Variable U.S. Macro VAR
Response of Y Response of P Response of R Response of M
1.0 2.00 0.84 2.1
0.8 1.75 0.70 1.8
1.50 1.5
0.6 1.25 0.56
1.2
0.4 1.00 0.42
0.9
0.2 0.75 0.28
Y 0.0 0.50 0.14
0.6
0.25 0.3
-0.2 0.00 0.00 0.0
-0.4 -0.25 -0.14 -0.3
0 10 20 30 0 20 0 20 0 10 20 30

0.2 1.50 0.6 1.75


0.1 1.25 0.5 1.50
-0.0 1.00 0.4 1.25
0.3 1.00
-0.1 0.75
0.2 0.75
-0.2 0.50
P -0.3 0.25
0.1 0.50
0.0 0.25
Shock to

-0.4 0.00 -0.1 0.00


-0.5 -0.25 -0.2 -0.25
0 10 20 30 0 20 0 10 20 30 0 20

0.50 0.18 1.00 0.4


-0.00 0.75 0.2
0.25
-0.18 -0.0
0.50 -0.2
0.00 -0.36
0.25 -0.4
-0.54
R -0.25
-0.72 0.00 -0.6
-0.8
-0.50 -0.25
-0.90 -1.0
-0.75 -1.08 -0.50 -1.2
0 20 0 20 0 20 0 10 20 30

0.6 1.8 0.4 1.8


0.5 1.6 1.6
0.4 1.4 0.3
0.3 1.2 1.4
0.2 1.0 0.2 1.2
0.1 0.8 1.0
M -0.0 0.6
0.1
0.8
-0.1 0.4 0.0
-0.2 0.2 0.6
-0.3 0.0 -0.1 0.4
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30

Response of Y Response of P Response of R Response of M

Orthogonalizing/Identifying VAR’s
Figure 1.1: VARMore Generally
evidence on US data
“Structural VAR’s”

Structure:
A0 yt = A1 yt−1 + ... + Ap yt−p + vt , vt ∼ (0, D)
14 CHAPTER 1

where D is diagonal.

Reduced form:
yt = A−1
0 A1 yt−1 + ... + A−1 −1
0 Ap yt−p + A0 vt
= Φ1 yt−1 + ... + Φp yt−p + et ,
where et = A−1
0 vt .

The structure can be identified from the reduced form


N 2 −N
if 2 restrictions are imposed on A0 .
Cholesky restricts A0 to be lower triangular (“recursive structure”).

IRF:
yt = (I + Θ1 L + Θ2 L2 + ...) et
= (I + Θ1 L + Θ2 L2 + ...) A−1
0 vt
= (A−1
0 + Θ1 A−1 −1 2
0 L + Θ2 A0 L + ...) vt

1.8 EXERCISES, PROBLEMS AND COMPLEMENTS

1. The autocovariance function of the MA(1) process, revisited.


In the text we wrote


 θσ 2 , τ = 1
γ(τ ) = E(yt yt−τ ) = E((εt + θεt−1 )(εt−τ + θεt−τ −1 )) =

0, otherwise.

Fill in the missing steps by evaluating explicitly the expectation

E((εt + θεt−1 )(εt−τ + θεt−τ −1 )).

2. Predicting AR processes.
Show the following.

(a) If yt is a covariance stationary AR(1) process, i.e., yt = αyt−1 + t with |α| < 1,
then yt+h,t = αh yt .
(b) If yt is AR(2),

yt = (α1 + α2 )yt−1 − α1 α2 yt−2 + t ,

with | α1 |, | α2 | < 1, then

yt+1,t = (α1 + α2 )yt − α1 α2 yt−1 .

(c) In general, the result for an AR(p) process is

yt+k,t = ψ1 yt+k−1,t + ... + ψp yt+k−p,t ,


THE WOLD REPRESENTATION 15

where yt−j = yt−j , for j = 0, 1, ..., at time t. Thus for pure autoregressions, the
MMSE prediction is a linear combination of only the p most recently observed
values.

3. Predicting M A process.
If yt is M A(1),

yt = t − βt−1 ,

where |β < 1, then



X
yt+1 = −β β j xt−j ,
j=0

and yt+k = 0 for all k > 1.


For moving-average processes more generally, predictions for a future period greater
than the order of the process are zero and those for a period less distant cannot be
expressed in terms of a finite number of past observed values.

4. Predicting the ARM A(1, 1) process.


If yt is ARM A(1, 1),

yt − αyt−1 = t − βt−1 ,

with | α |, | β | < 1, then



X
k−1
yt+k,t = α (α − β) β j yt−j .
j=0

5. Prediction-error dynamics.
Consider the general linear process with strong white noise innovations. Show that
both the conditional (with respect to the information set Ωt = {t , t−1 , ...}) and
unconditional moments of the Wiener-Kolmogorov h-step-ahead prediction error are
identical.

6. Truncating the Wiener-Kolmogorov predictor.


Consider the sample path, {yt }Tt=1 , where the data generating process is yt = B(L)t
and B(L) is of infinite order. How would you modify the Weiner-Kolmogorov linear
least squares prediction formula to generate an operational 3-step-ahead forecast?
(Hint: truncate.) Is your suggested predictor linear least squares? Least squares within
the class of linear predictors using only T past observations?

7. Factor structure.
16 CHAPTER 1

Figure 1.1: The R Homepage

Consider the bivariate linearly indeterministic process,


! ! !
y1t B11 (L) B12 (L) ε1t
= ,
y2t B21 (L) B22 (L) ε2t

under the usual assumptions. Suppose further that B11 (L) = B21 (L) = 0 and ε1t = ε2t = εt
(with variance σ 2 ). Discuss the nature of this system. Why might it be useful in eco-
nomics?

8. Software (and a Tiny bit of Hardware)


Let’s proceed from highest level to lowest level.
Eviews is a good high-level environment for economic time-seres analysis. It’s a mod-
ern object-oriented environment with extensive time series, modeling and forecasting
capabilities. It implements almost all of the methods described in this book, and many
more.
Eviews, however, can sometimes be something of a “black box.” Hence you’ll also
want to have available slightly lower-level (“mid-level”) environments in which you
can quickly program, evaluate and apply new tools and techniques. R is one very
powerful and popular such environment, with special strengths in modern statistical
methods and graphical data analysis.2 R is available for free as part of a massive and
highly-successful open-source project. RStudio provides a fine R working environment,
and, like R, it’s free. A good R tutorial, first given on Coursera and then moved to
YouTube, is here. R-bloggers is a massive blog with all sorts of information about all
things R.

2 Python and Julia are other interesting mid-level environments.


THE WOLD REPRESENTATION 17

Figure 1.2: Resources for Economists Web Page

If you need real speed, such as for large simulations, you will likely need a low-level
environment like Fortran or C++. And in the limit (and on the hardware side), if
you need blazing-fast parallel computing for massive simulations etc., graphics cards
(graphical processing units, or GPU’s) provide stunning gains, as documented for
example in Aldrich et al. (2011). Actually the real limit is quantum computing, but
we’re not there yet.
For a compendium of econometric and statistical software, see the software links site,
maintained by Marius Ooms at the Econometrics Journal.

9. Data
Here we mention just a few key “must-know” sites. Resources for Economists, main-
tained by the American Economic Association, is a fine portal to almost anything of
interest to economists. It contains hundreds of links to data sources, journals, profes-
sional organizations, and so on. FRED (Federal Reserve Economic Data) is a tremen-
dously convenient source for economic data. The National Bureau of Economic Re-
search site has data on U.S. business cycles, and the Real-Time Data Research Center
at the Federal Reserve Bank of Philadelphia has real-time vintage macroeconomic
data. Quandl is an interesting newcomer with striking breadth of coverage; it seems
to have every time series in existence, and it has a nice R interface.

10. Markup
Markup languages effectively provide typesetting or “word processing.” HTML is the
most well-known example. Research papers and books are typically written in LaTeX.
MiCTeX is a good and popular flavor of LaTeX, and TeXworks is a good editor
18 CHAPTER 1

designed for LaTeX. knitr is an R package, but it’s worth mentioning separately, as it
powerfully integrates R and LaTeX./footnoteYou can access everything in RStudio.
Another markup language worth mentioning is Sphinx, which runs under Python. The
Stachurchski-Sargent e-book Quantitative Economics, which features Python promi-
nently, is written in Sphinx.

11. Version Control


Git and GitHub are useful for open/collaborative development and version control. For
my sorts of small-group projects I find that Dropbox or equivalent keeps me adequately
synchronized, but for serious large-scale development, use of git or equivalent appears
crucial.

12. Estimating the Autocovariance Function

T −|τ |
1 X
γ̂(τ ) = xt xt+|τ | , τ = 0, ± 1, ..., ± (T − 1)
T t=1

T −|τ |
∗ 1 X
γ (τ ) = xt xt+|τ | , τ = 0, ± 1 , ..., ± (T − 1)
T − |τ | t=1

Perhaps surprisingly, γ̂(τ ) is better

Asymptotic Distribution of the Sample Autocorrelations

ρ = (ρ(0), ρ(1), ..., ρ(r))0

d

T (ρ̂ − ρ) → N (0, Σ)

Important special case (iid):


1
asyvar(ρ̂(τ )) = , ∀τ
T

asycov(ρ̂(τ ), ρ̂(τ + v)) = 0

“Bartlett standard errors”


THE WOLD REPRESENTATION 19

13. Estimation of ARM A Approximations


Fitting: OLS, MLE, GMM. The simplest is OLS estimation of autoregressions.

14. Lag Order Selection


What not to do...

PT 2
t=1 et
M SE =
T
PT 2
2 t=1 et
R = 1 − PT
t=1 (yt − ȳ)2

M SE
= 1 − 1
PT
T t=1 (yt − ȳ)2

Still bad:

PT 2
t=1 et
s2 =
T −k
PT !
2
 
2 T t=1 et
s =
T −k T

PT 2
2 t=1 et / T −k
R̄ = 1 − PT
t=1 (yt − ȳt )2 / T − 1

s2
= 1 − PT
t=1 (yt − ȳt )2 / T − 1

Good:

PT !
2
t=1 et
SIC = T ( T )
k

More generally,
−2lnL KlnT
SIC = +
T T

Consistency (oracle property)


20 CHAPTER 1

15. Diagnostic Checking: Testing White Noise Residuals

H0 : ρ(1) = ρ(2) = ... = ρ(m) = 0

As T → ∞ (Box-Pierce):
m
X
QBP = T ρ̂2 (τ ) ∼ χ2 (m)
τ =1

Also as T → ∞ (Ljung-Box):

m  
X 1
QLB = T (T + 2) ρ̂2 (τ ) ∼ χ2 (m)
τ =1
T −τ

The χ2 (m) null distributions are for observed time series. For model residuals the null
distribution is χ2 (m − k), where k is the number of parameters fit.

16. Empirical GDP dynamics.

(a) Obtain the usual quarterly expenditure-side U.S. GDPE from FRB St. Louis,
1960.1-present.
(b) Leaving out the 12 most recent quarters of data, perform a full correlogram
analysis for GDPE logarithmic growth.
(c) Again leaving out the 12 most recent quarters of data, specify, estimate and de-
fend appropriate AR(p) and ARM A(p, q) models for GDPE logarithmic growth.
(d) Using your preferred AR(p) and ARM A(p, q) models for GDPE logarithmic
growth, generate a 12-quarter-ahead linear least-squares path forecast for the
“hold-out” sample. How do your AR(p) and ARM A(p, q) forecasts compare to
the realized values? Which appears more accurate?
(e) Obtain ADNSS GDPplus logarithmic growth from FRB Philadelphia, read about
it, and repeat everything above.
(f) Contrast the results for GDPE logarithmic growth and GDPplus logarithmic
growth.

17. Time-domain analysis of housing starts and completions.

(a) Obtain monthly U.S. housing starts and completions data from FRED at FRB
St. Louis, seasonally-adjusted, 1960.1-present. Your two series should be of equal
length.
THE WOLD REPRESENTATION 21

(b) Using only observations {1, ..., T −4}, perform a full correlogram analysis of starts
and completions. Discuss in detail.
(c) Using only observations {1, ..., T − 4}, specify and estimate appropriate univari-
ate ARM A(p, q) models for starts and completions, as well as an appropriate
V AR(p). Discuss in detail.
(d) Characterize the Granger-causal structure of your estimated V AR(p). Discuss in
detail.
(e) Characterize the impulse-response structure of your estimated V AR(p) using all
possible Cholesky orderings. Discuss in detail.
(f) Using your preferred ARM A(p, q) models and V AR(p) model, specified and es-
timated using only observations {1, ..., T − 4}, generate linear least-squares path
forecasts for the four quarters of “hold out data,” {T − 3, T − 2, T − 1, T }. How
do your forecasts compare to the realized values? Discuss in detail.

1.9 NOTES

The study of time series of, for example, astronomical observations predates recorded history.
For references old and new, see the “library” of useful books in Appendix A. Early writers
on economic subjects occasionally made explicit reference to astronomy as the source of
their ideas. For example, Cournot stressed that, as in astronomy, it is necessary to recognize
secular variation that is independent of periodic variation. Similarly, Jevons made clear
his approach to the study of short-term fluctuations used the methods of astronomy and
meteorology. During the 19th century interest in, and analysis of, social and economic time
series evolved into a new field of study independent of developments in astronomy and
meteorology. Time-series analysis then flourished. Nerlove et al. (1979) provides a brief
history of the field’s early development.
Characterization of time series by means of autoregressive, moving average, or ARMA
models was suggested, more or less simultaneously, by the Russian statistician and economist
E. Slutsky and the British statistician G.U. Yule. The Slutsky-Yule framework was modern-
ized, extended, and made part of an innovative and operational modeling and forecasting
paradigm in a more recent classic, a 1970 book by Box and Jenkins. In fact, ARMA and
related models are often called “Box-Jenkins models.”
By 1930 Slutzky and Yule had shown that rich dynamics could be obtained by tak-
ing weighted averages of random shocks. Wold’s celebrated 1937 decomposition established
the converse, decomposing covariance stationary series into weighted averages of random
shocks, and paved the way for subsequent path-breaking work by Wiener, Kolmogorov,
Kalman and others. The beautiful 1963 treatment by Wold’s student Whittle (1963), up-
dated and reprinted as Whittle (1983) with a masterful introduction by Tom Sargent, re-
mains widely-read. Much of macroeconomics is built on the Slutzky-Yule-Wold-Wiener-
22 CHAPTER 1

Kolmogorov foundation. For a fascinating overview of parts of the history in its relation
to macroeconomics, see Davies and Mahon (2009), at http://www.minneapolisfed.org/
publications_papers/pub_display.cfm?id=4348.

1.10 EXERCISES, PROBLEMS AND COMPLEMENTS

1. Approaches and issues in economic time series analysis.


Consider the following point/counterpoint items. In each case, which do you think
would be more useful for analysis of economic time series? Why?

• Continuous / discrete
• linear / nonlinear
• deterministic / stochastic
• univariate / multivariate
• time domain / frequency domain
• conditional mean / conditional variance
• trend / seasonal / cycle / noise
• ordered in time / ordered in space
• stock / flow
• stationary / nonstationary
• aggregate / disaggregate
• Gaussian / non-Gaussian

2. Nobel prizes for work involving time series analysis.


Go to the economics Nobel Prize web site. Read about Economics Nobel Prize win-
ners Frisch, Tinbergen, Kuznets, Tobin, Klein, Modigliani, Friedman, Lucas, Engle,
Granger, Prescott, Sargent, Sims, Fama, Shiller, and Hansen. Each made extensive
contributions to, or extensive use of, time series analysis. Other econometricians and
empirical economists winning the Prize include Leontief, Heckman, McFadden, Koop-
mans, Stone, Modigliani, and Haavelmo.
Chapter Two

Spectral Analysis

2.1 THE MANY USES OF SPECTRAL ANALYSIS

We have reserved spectral analysis for last because it spans all of time-series econometrics,
so appreciating it requires some initial familiarity with much of the subject.
Spectral Analysis

• As with the acov function, “getting the facts straight”

• Trend and persistence (power near zero)

• Integration, co-integration and long memory

• Cycles (power at cyclical frequencies)

• Seasonality (power spikes at fundamental and harmonics)

• Filter analysis and design

• Maximum-likelihood estimation (including band spectral)

• Assessing agreement between models and data

• Robust (HAC) variance estimation

2.2 THE SPECTRUM AND ITS PROPERTIES

Recall the General Linear Process


X
yt = B(L)εt = bi εt−i
i=0

Autocovariance generating function:



X
g(z) = γ(τ ) z τ
τ =−∞

= σ 2 B(z)B(z −1 )
24 CHAPTER 2

γ(τ ) and g(z) are a z-transform pair


Spectrum
Evaluate g(z) on the unit circle, z = e−iω :


X
g(e−iω ) = γ(τ ) e−iωτ , − π < ω < π
τ = −∞

= σ 2 B(eiω ) B(e−iω )

= σ 2 | B(eiω ) |2

Spectrum
Trigonometric form:


X
g(ω) = γ(τ )e−iωτ
τ =−∞


X
γ(τ ) eiωτ + e−iωτ

= γ(0) +
τ =1


X
= γ(0) + 2 γ(τ ) cos(ωτ )
τ =1

Spectral Density Function

1
f (ω) = g(ω)


1 X
f (ω) = γ(τ )e−iωτ (−π < ω < π)
2π τ =−∞


1 1X
= γ(0) + γ(τ ) cos(ωτ )
2π π τ =1

σ2
B eiω B e−iω
 
=

SPECTRAL ANALYSIS 25

σ2
| B eiω |2

=

Properties of Spectrum and Spectral Density

1. symmetric around ω = 0

2. real-valued

3. 2π-periodic

4. nonnegative

A Fourier Transform Pair


X
g(ω) = γ(τ )e−iωτ
τ =−∞

Z π
1
γ(τ ) = g(ω)eiωτ dω
2π −π

A Variance Decomposition by Frequency

Z π
1
γ(τ ) = g(ω)eiωτ dω
2π −π

Z π
= f (ω)eiωτ dω
−π

Hence
Z π
γ(0) = f (ω)dω
−π

Robust Variance Estimation


PT
x̄ = T1 t=1 xt
PT PT
var(x̄) = T12 s=1 t=1 γ(t − s)
(“Add row sums”)
 
PT −1 |τ |
= T1 τ =−(T −1) 1 − T γ(τ )
(“Add diagonal sums,” using change of variable τ = t − s)
Hence:

 
T −1

 
X |τ |
T (x̄ − µ) ∼ 0, 1− γ(τ )
T
τ =−(T −1)
26 CHAPTER 2

d

T (x̄ − µ) → N (0, gx (0))

2.3 RATIONAL SPECTRA

White Noise Spectral Density

yt = εt

εt ∼ W N (0, σ 2 )

σ2
B eiω B e−iω
 
f (ω) =

σ2
f (ω) =

AR(1) Spectral Density

yt = φyt−1 + εt

εt ∼ W N (0, σ 2 )

σ2
f (ω) = B(eiω )B(e−iω )

σ2 1
=
2π (1 − φe )(1 − φe−iω )

σ2 1
=
2π 1 − 2φ cos(ω) + φ2
How does shape depend on φ? Where are the peaks?
ARMA(1, 1) Spectral Density

(1 − φL)yt = (1 − θL)εt

σ 2 1 − 2θ cos(ω) + θ2
f (ω) =
2π 1 − 2φ cos(ω) + φ2
SPECTRAL ANALYSIS 27

Figure 2.1: Granger’s Typical Spectral Shape of an Economic Variable

“Rational spectral density”


Internal peaks? What will it take?

2.4 MULTIVARIATE

Multivariate Frequency Domain


Covariance-generating function;

X
Gyx (z) = Γyx (τ )z τ
τ =−∞

Spectral density function:


1
Fyx (ω) = Gyx (e−iω )


1 X
= Γyx (τ ) e−iωτ , − π < ω < π
2π τ =−∞

(Complex-valued)
Co-Spectrum and Quadrature Spectrum

Fyx (ω) = Cyx (ω) + iQyx (ω)


1 X
Cyx (ω) = Γyx (τ ) cos(ωτ )
2π τ =−∞
28 CHAPTER 2


−1 X
Qyx (ω) = Γyx (τ ) sin(ωτ )
2π τ =−∞

Cross Spectrum
fyx (ω) = gayx (ω)exp(i phyx (ω)) (generic cross spectrum)
1
2
gayx (ω) = [Cyx (ω) + Q2yx (ω)] 2 (gain)
 
Qyx (ω)
phyx (ω) = arctan Cyx (ω) (phase)
ph(ω)
(Phase shift in time units is ω )

|fyx (ω)|2
cohyx (ω) = (coherence)
fxx (ω)fyy (ω)
Squared correlation decomposed by frequency
Useful Spectral Results for Filter Design and Analysis

• (Effects of a linear filter)


If yt = B(L)xt , then:
2
– fyy (ω) = |B(e−iω )| fxx (ω)
– fyx (ω) = B(e−iω )fxx (ω).

B(e−iω ) is the filter’s frequency response function.

• (Effects of a series of linear filters (follows trivially))


If yt = A(L)B(L)xt , then
2 2
– fyy (ω) = |A(e−iω )| |B(e−iω )| fxx (ω)
– fyx (ω) = A(e−iω )B(e−iω )fxx (ω).

• (Spectrum of an independent sum)


PN
If y = i=1 xi , and the xi are independent, then
N
X
fy (ω) = fxi (ω).
i=1

Nuances... Note that

fyx (ω)
B(e−iω ) =
fxx (ω)

gayx (ω)ei phyx (ω)


=⇒ B(e−iω ) =
fxx (ω)
Phases of fyx (ω) and B(e−iω ) are the same.
SPECTRAL ANALYSIS 29

Gains are closely related.


Example

yt = .5xt−1 + εt

εt ∼ W N (0, 1)

xt = .9xt−1 + ηt

ηt ∼ W N (0, 1)

Correlation Structure
Autocorrelation and cross-correlation functions are straightforward:

ρy (τ ) = .9|τ |

ρx (τ ) ∝ .9|τ |

ρyx (τ ) ∝ .9|τ −1|

(What is the qualitative shape of ρyx (τ )?)


Spectral Density of x

1
xt = ηt
1 − .9L

1 1 1
=⇒ fxx (ω) =
2π 1 − .9e−iω 1 − .9eiω

1 1
=
2π 1 − 2(.9) cos(ω) + (.9)2

1
=
11.37 − 11.30 cos(ω)
Shape?
Spectral Density of y

yt = 0.5Lxt + εt
30 CHAPTER 2

1
=⇒ fyy (ω) =| 0.5e−iω |2 fxx (ω) +

1
= 0.25fxx (ω) +

0.25 1
= +
11.37 − 11.30 cos(ω) 2π
Shape?
Cross Spectrum
B(L) = .5L
B(e−iω ) = 0.5e−iω
fyx (ω) = B(e−iω )fxx (ω)
= 0.5e−iω fxx (ω)
= (0.5fxx (ω)) e−iω
0.5
gyx (ω) = 0.5fxx (ω) = 11.37−11.30 cos(ω)
P hyx (ω) = −ω
(In time units, P hyx (ω) = −1, so y leads x by -1)
Coherence
| fyx (ω) |2 2
.25fxx (ω) .25fxx (ω)
Cohyx (ω) = = =
fxx (ω)fyy (ω) fxx (ω)fyy (ω) fyy (ω)

1 1
.25 2π 1−2(.9) cos(ω)+.92
= 1 1 1
.25 2π 1−2(.9) cos(ω)+.92 + 2π

1
=
8.24 + 7.20 cos(ω)
Shape?

2.5 FILTER ANALYSIS AND DESIGN

Filter Analysis: A Trivial (but Important) High-Pass Filter

yt = xt − xt−1

=⇒ B(e−iω ) = 1 − e−iω

Hence the filter gain is:


B(e−iω ) = 1 − e−iω = 2(1 − cos(ω))

How would the gain look for B(L) = 1 + L?


SPECTRAL ANALYSIS 31

Figure 2.2: Gain of Differencing Filter 1 − L

Filter Analysis: Kuznets’ Infamous Filters


Low-frequency fluctuations in aggregate real output growth.
“Kuznets cycle” – 20-year period
Filter 1 (moving average):
2
1 X
yt = xt−j
5 j=−2

2
1 X −iωj sin(5ω/2)
=⇒ B1 (e−iω ) = e =
5 j=−2 5sin(ω/2)

Hence the filter gain is:



B1 (e−iω ) = sin(5ω/2)

5sin(ω/2)

Kuznets’ Filters, Continued


Kuznets’ Filters, Continued
Filter 2 (fancy difference):

zt = yt+5 − yt−5

=⇒ B2 (e−iω ) = ei5ω − e−i5ω = 2sin(5ω)

Hence the filter gain is:


B2 (e−iω ) = |2sin(5ω)|

Kuznets’ Filters, Continued


Kuznets’ Filters, Continued
32 CHAPTER 2

Figure 2.3: Gain of Kuznets’ Filter 1

Figure 2.4: Gain of Kuznets’ Filter 2


SPECTRAL ANALYSIS 33

Figure 2.5: Composite Gain of Kuznets’ two Filters

Composite gain:


B1 (e−iω )B2 (e−iω ) = sin(5ω/2) |2sin(5ω)|

5sin(ω/2)

Kuznets’ Filters, Continued


Filter Design: A Bandpass Filter
Canonical problem:
Find B(L) s.t.

(
fx (ω) on [a, b] ∪ [−b, −a]
fy (ω) =
0 otherwise,
where

X
yt = B(L)xt = bj εt−j
j=−∞

Bandpass Filter, Continued


Recall
2
fy (ω) = |B(e−iω )| fx (ω).
Hence we need:
(
1 on [a, b], ∪[−b, −a], 0 < a < b < π
B(e−iω ) =
0 otherwise

By Fourier series expansion (“inverse Fourier transform”):


Z π
1
bj = B(e−iω )eiωj
2π −π
 
1 sin(jb) − sin(ja)
= , ∀j ∈ Z
π j
34 CHAPTER 2

Bandpass Filter, Continued


Many interesting issues:

• What is the weighting pattern? Two-sided? Weights symmetric around 0?

• How “best” to make this filter feasible in practice? What does that mean? Simple
truncation?

• One sided version?

• Phase shift?

2.6 ESTIMATING SPECTRA

2.6.1 Univariate

Estimation of the Spectral Density Function


Periodogram ordinate at frequency ω:

T 2 r T
! r T
!
2 X 2 X −iωt 2 X iωt
I(ω) = yt e−iωt = yt e yt e

T T t=1 T t=1


t=1

−π ≤ ω ≤ π
2πj
Usually examine frequencies ωj = T , j = 0, 1, 2, ..., T2
Sample Spectral Density

T −1
1 X
fˆ(ω) = γ̂(τ )e−iωτ

τ =−(T −1)

2
T
1 X
fˆ(ω) = yt e −iωt

2πT


t=1

T
! T
!
1 X −iωt 1 X iωt
= √ yt e √ yt e
2πT t=1 2πT t=1

1
= I(ω)

Properties of the Sample Spectral Density
2πj
(Throughout we use ωj , j = T , j = 0, 1, ..., T2 )

• Ordinates asymptotically unbiased


SPECTRAL ANALYSIS 35

• Ordinates asymptotically uncorrelated

• But variance does not converge to 0


(degrees of freedom don’t accumulate)

• Hence inconsistent

For Gaussian series we have:


d
2fˆ(ωj )
→ χ22 ,
f (ωj )

where the χ22 random variables are


independent across frequencies
Consistent (Lag Window) Spectral Estimation

T −1 T −1
1 X 1 2 X
fˆ(ω) = γ̂(τ )e−iωτ = γ̂(0) + γ̂(τ ) cos(ωτ )
2π 2π 2π τ =1
τ =−(T −1)

T −1
1 X
f ∗ (ω) = λ(τ )γ̂(τ )e−iωτ

τ =−(T −1)

Common lag windows with truncation lag MT :


λ(τ ) = 1, |τ | ≤ MT and 0 otherwise (rectangular, or boxcar)
|τ |
λ(τ ) = 1 − MT , τ ≤ MT and 0 otherwise
(triangular, or Bartlett, or Newey-West)
MT
Consistency: MT → ∞ and T →0
Truncation lag must increase “appropriately” with T
Other (Closely-Related) Routes to
Consistent Spectral Estimation
Fit parametric approximating models (e.g., autoregressive)
(“Model-based estimation”)
– Let order increase appropriately with sample size
Smooth the sample spectral density
(“spectral window estimation”)
– Let window width decrease appropriately with sample size
36 CHAPTER 2

2.6.2 Multivariate

Spectral density matrix:



1 X
Fyx (ω) = Γyx (τ )e−iωτ , −π < ω < π
2π τ =−∞

Consistent (lag window) estimator:


(T −1)
∗ 1 X
Fyx (ω) = λ(τ )Γ̂yx (τ ) e−iωτ , −π < ω < π

τ =−(T −1)

Different lag windows may be used for different elements of Fyx (ω)
Or do model-based...

2.7 APPROXIMATE (ASYMPTOTIC) FREQUENCY DOMAIN GAUSSIAN


LIKELIHOOD

Recall that for Gaussian series we have as T → ∞ :


d
2fˆ(ωj )
xj = → χ22
f (ωj ; θ)

where f (ωj ; θ) is the spectral density and the χ22 random variables are independent across
frequencies
2πj T
ωj = , j = 0, 1, ...,
T 2
⇒ MGF of any one of the xj ’s is
1
Mx (t) =
1 − 2t

Let
f (ωj ; θ) xj
yj = fˆ(ωj ) =
2

 
f (ωj ; θ) 1
⇒ My (t) = Mx t =
2 1 − f (ωj ; θ) t
This is the MGF of exponential rv with parameter 1/f (ωj ; θ).

ˆ
−f (ωj )
1
⇒ g(fˆ(ωj ); θ) = e f (ωj ;θ)
f (ωj ; θ)
Univariate asymptotic Gaussian log likelihood:
SPECTRAL ANALYSIS 37

T /2 T /2
X X fˆ(ωj )
ln L(fˆ; θ) = − ln f (ωj ; θ) −
j=0 j=0
f (ωj ; θ)

Multivariate asymptotic Gaussian log likelihood:

 
T /2 T /2
X X
ln L(fˆ; θ) = − ln |F (ωj ; θ)| − trace  F −1 (ωj ; θ) F̂ (ωj )
j=0 j=0

2.8 EXERCISES, PROBLEMS AND COMPLEMENTS

1. Seasonality and Seasonal Adjustment

2. HAC Estimation

3. Applied spectrum estimation.


Pick an interesting and appropriate real time series. Compute and graph (when ap-
propriate) the sample mean, sample autocovariance, sample autocorrelation, sample
partial autocorrelation, and sample spectral density functions. Also compute and graph
the sample coherence and phase lead. Discuss in detail the methods you used. For the
sample spectra, try to discuss a variety of smoothing schemes. Try smoothing the
periodogram as well as smoothing the autocovariances, and also try autoregressive
spectral estimators.

4. Sample spectrum.
Generate samples of Gaussian white noise of sizes 32, 64, 128, 256, 512, 1024 and
2056, and for each compute and graph the sample spectral density function at the
usual frequencies. What do your graphs illustrate?

5. Lag windows and spectral windows.


Provide graphs of the rectangular, Bartlett, Tukey-Hamming and Parzen lag windows.
Derive and graph the corresponding spectral windows.

6. Bootstrapping sample autocorrelations.


Assuming normality, propose a “parametric bootstrap” method of assessing the finite-
sample distribution of the sample autocorrelations. How would you generalize this to
assess the sampling uncertainty associated with the entire autocorrelation function?
How might you dispense with the normality assumption?
Solution: Assume normality, and then take draws from the process by using a noram
random number generator in conjunction with the Cholesky factorization of the data
38 CHAPTER 2

covariance matrix. This procedure can be used to estimate the sampling distribution
of the autocorrelations, taken one at a time. One will surely want to downweight
the long-lag autocorrelations before doing the Cholesky factorization, and let this
downweighting adapt to sample size. Assessing sampling uncertainty for the entire
autocorrelation function (e.g., finding a 95% confidence “tunnel”) appears harder,
due to the correlation between sample autocorrelations, but can perhaps be done
numerically. It appears very difficult to dispense with the normality assumption.

7. Bootstrapping sample spectra.


Assuming normality, propose a “parametric bootstrap” method of assessing the finite-
sample distribution of a consistent estimator of the spectral density function at various
selected frequencies. How would you generalize this to assess the sampling uncertainty
associated with the entire spectral density function?
Solution: At each bootstrap replication of the autocovariance bootstrap discussed
above, Fourier transform to get the corresponding spectral density function.

8. Bootstrapping spectra without normality.


Drop the normality assumption, and propose a “parametric bootstrap” method of
assessing the finite-sample distribution of (1) a consistent estimator of the spectral
density function at various selected frequencies, and (2) the sample autocorrelations.
Solution: Make use of the asymptotic distribution of the periodogram ordinates.

9. Sample coherence.
If a sample coherence is completed directly from the sample spectral density matrix
(without smoothing), it will be 1, by definition. Thus, it is important that the sam-
ple spectrum and cross-spectrum be smoothed prior to construction of a coherence
estimator.
Solution:
2
|fyx (ω)|
coh(ω) =
fx (ω) fy (ω)
In unsmoothed sample spectral density analogs,
[Σyt e−iωt Σxt e+iωt ][Σyt e+iωt Σxt e−iωt ]
côh(ω) = [Σxt e−iωt Σxt e+iωt ][Σyt e−iωt Σyt e+iωt ]
≡ 1.

10. De-meaning.
Consider two forms of a covariance stationary time series: “raw” and de-meaned.
Contrast their sample spectral density functions at ordinates 2πj/T, j = 0, 1, ...,
T/2. What do you conclude? Now contrast their sample spectral density functions at
ordinates that are not multiples of 2πj/T. Discuss.
SPECTRAL ANALYSIS 39

Solution: Within the set 2πj/T, j = 0, 1, ..., T/2, only the sample spectral density at
frequency 0 is affected by de-meaning. However, de-meaning does affect the sample
spectral density function at all frequencies in [0, π] outside the set 2πj/T, j = 0, 1,
..., T/2. See Priestley (1980, p. 417). This result is important for the properties of
time- versus frequency-domain estimators of fractionally-integrated models. Note in
particular that
1 X iωj t 2
I(ωj ) ∝ | yt e |
T
so that
1 X 2
I(0) ∝ | yt | ∝ T ȳ 2 ,
T
which approaches infinity with sample size so long as the mean is nonzero. Thus it
makes little sense to use I(0) in estimation, regardless of whether the data have been
demeaned.

11. Schuster’s periodogram.


The periodogram ordinates can be written as
T T
2 X X
I(ωj ) = ([ yt cos ωj t]2 + [ yt sin ωj t]2 ).
T t=1 t=1

Interpret this result.

12. Applied estimation.


Pick an interesting and appropriate real time series. Compute and graph (when appro-
priate) the sample mean, sample autocovariance, sample autocorrelation, sample par-
tial autocorrelation, and sample spectral density functions. Also compute and graph
the sample coherence and phase lead. Discuss in detail the methods you used. For
the sample spectra, try and discuss a variety of smoothing schemes. If you can, try
smooting the periodogram as well as smoothing the autocovariances, and also try
autoregressive spectral estimators.

13. Periodogram and sample spectrum.


Prove that I(ω) = 4π fˆ(ω).

14. Estimating the variance of the sample mean.


Recall the dependence of the variance of the sample mean of a serially correlated time
series (for which the serial correlation is of unknown form) on the spectral density
of the series evaluated at ω = 0. Building upon this result, propose an estimator of
the variance of the sample mean of such a time series. If you are very ambitious, you
might want to explore in a Monte Carlo experiment the sampling properties of your
40 CHAPTER 2

estimator of the standard error vs. the standard estimator of the standard error, for
various population models (e.g., AR(1) for various values of ρ) and sample sizes. If
you are not feeling so ambitious, at least conjecture upon the outcome of such an
experiment.

15. Coherence.
a. Write out the formula for the coherence between two time series x and y.
b. What is the coherence between the filtered series, (1 - b1 L) xt and (1 - b2 L) yt ?
(Assume that b1 6= b2 .)
c. What happens if b1 = b2 ? Discuss.

16. Multivariate spectra.


Show that for the multivariate LRCSSP,

Fy (ω) = B(e−iω ) Σ B ∗ (e−iω )

where “*” denotes conjugate transpose.

17. Filter gains.


Compute, plot and discuss the squared gain functions associated with each of the fol-
lowing filters.
(a) B(L) = (1 - L)
(b) B(L) = (1 + L)
(c) B(L) = (1 - .5 L12)-1
(d) B(L) = (1 - .5 L12).

Solution:
(a) G2 = 1 - e-iω2 is monotonically increasing on [0, π]. This is an example of a “high
pass” filter.
(b) G2 = 1 + e-iω2 is monotonically decreasing on [0, π]. This is an example of a “low
pass” filter.
(c) G2 = (1 - .5 e-12iω)2 has peaks at the fundamental seasonal frequency and its
harmonics, as expected. Note that it corresponds to a seasonal autoregression.
(d) G2 = (1 - .5 e-12iω)2 has troughs at the fundamental seasonal frequency and its
harmonics, as expected, because it is the inverse of the seasonal filter in (c) above.
Thus, the seasonal process associated with the filter in (c) above would be appropri-
ately “seasonally adjusted” by the present filter, which is its inverse.
SPECTRAL ANALYSIS 41

18. Filtering
(a) Consider the linear filter B(L) = 1 + θ L. Suppose that yt = B(L) xt,
where xt ∼ WN(0, σ2). Compute fy(ω).

(b) Given that the spectral density of white noise is σ2/2π, discuss how the filtering
theorem may be used to determine the spectrum of any LRCSSP by viewing it as a
linear filter of white noise.

Solution:
(a) fy(ω) = 1 + θe-iω2 fx(ω)
= σ2/2π (1 + θe-iω)(1 + θeiω)
= σ2/2π (1 + θ2 + 2θ cos ω),
which is immediately recognized as the sdf of an MA(1) process.
(b) All of the LRCSSP’s that we have studied are obtained by applying linear filters
to white noise. Thus, the filtering theorem gives their sdf’s as
f(ω) = σ2/2π B(e-iω)2
= σ2/2π B(e-iω) B(eiω)
= σ2/2π B(z) B(z-1),
evaluated on |z| = 1, which matches our earlier result.

19. Zero spectra.


Suppose that a time series has a spectrum that is zero on an interval of positive mea-
sure. What do you infer?

Solution: The series must be deterministic, because one could design a filter such that
the filtered series has zero spectrum everywhere.

20. Period.
Period is 2π/ω and is expressed in time/cycle. 1/P, cycles/time. In engineering, time
is often measured in seconds, and 1/P is Hz.

21. Seasonal autoregression.


Consider the “seasonal” autoregression (1 - φ L12) yt = t.
(a) Would such a structure be characteristic of monthly seasonal data or quarterly
seasonal data?
(b) Compute and plot the spectral density f(ω), for various values of φ. Does it have
any internal peaks on (0, π)? Discuss.
(c) The lowest-frequency internal peak occurs at the so-called fundamental seasonal
42 CHAPTER 2

frequency. What is it? What is the corresponding period?


(d) The higher-frequency spectral peaks occur at the harmonics of the fundamental
seasonal frequency. What are they? What are the corresponding periods?
Solution:
(a) Monthly, because of the 12-period lag.
(b)
σ2
f (ω) = (1 + φ2 − 2φ cos(12ω))

The sdf has peaks at ω = 0, π/6, 2π/6, ..., 5π/6, and π.
(c) The fundamental frequency is π/6, which corresponds to a period of 12 months.
(d) The harmonic frequencies are 2π/6, ..., 5π/6, and π, corresponding to periods of
6 months, 4 months, 3 months, 12/5 months and 2 months, respectively.

22. More seasonal autoregression.


Consider the “seasonal” autoregression
(1 - φ L4) yt = t.
(a) Would such a structure be characteristic of monthly seasonal data or quarterly
seasonal data?
(b) Compute and plot the spectral density f(ω), for various values of φ. Does it have
any internal peaks on (0, π)? Discuss.
(c) The lowest-frequency internal peak occurs at the so-called fundamental seasonal
frequency. What is it? What is the corresponding period?
(d) The higher-frequency spectral peaks occur at the harmonics of the fundamental
seasonal frequency. What are they? What are the corresponding periods?
Solution: (a) Quarterly, because of the 4-period lag.
SPECTRAL ANALYSIS 43

(b)
σ2
f (ω) = (1 + φ2 − 2φ cos(4ω))

The sdf has peaks at ω = 0, π/2 and π.
(c) The fundamental frequency is π/2, which corresponds to a period of 4 quarters.
(d) The only harmonic is π, corresponding to a period of 2 quarters.

23. The long run.


Discuss and contrast the economic concept of “long run” and the statistical concept
of ”low frequency”. Give examples showing when, if ever, the two may be validly and
fruitfully equated. Also give examples showing when, if ever, it would be inappropriate
to equate the two concepts.
Solution:
A potential divergence of the concepts occurs when economists think of the “long run”
as “in steady state,” which implies the absence of dynamics.

24. Variance of the sample mean.


The variance of the sample mean of a serially correlated time series is proportional to
the spectral density function at frequency zero.
Solution:
Let
T
1X
x̄ = xt .
T t=1
44 CHAPTER 2

Then
1
PT PT
var(x̄) = T2 t=1 γ(t − s)
1
PTs=1
−1 |τ |
= T τ =−(T −1) (1 − T )γ(τ ),

where γ(τ ) is the autocovariance function of x. So for large T, we have


2πfx (0)
var(x̄) ≈ .
T
25. ARCH process spectrum.
Consider the ARCH(1) process xt , where

xt | xt−1 ∼ N (0, .2 + .8 x2t−1 )

a. Compute and graph its spectral density function.


b. Compute and graph the spectral density function of x2T . Discuss.

Solution:
a. By the law of iterated expectations, we have

E(xt ) = E[E(xt |xt−1 )] = E(0) = 0

γ(0) = E(x2t ) = E[E(x2t |xt−1 )] = E[E(0.2 + 0.8x2t−1 )] = 0.2 + 0.8γ(0)

0.2
γ(0) = = 1
1 − 0.8

γ(τ ) = E(xt xt−τ ) = E[E(xt xt−τ |xt−1 , xt−2 , · · · )] = E[xt−τ E(xt |xt−1 , xt−2 , · · · )] = E(xt−τ 0) = 0

for τ =1,2,. . ..
Therefore

1 X 1 1
f (ω) = γ(τ )e−iωτ = γ(0) =
2π τ =−∞ 2π 2π

Because it is white noise, the spectrum will be flat.


b. Because the kurtosis of normal random variables is 3, we have

E(x4t ) = E[E(x4t |xt−1 , xt−2 · · · )] = E[3(0.2 + 0.8x2t−1 )2 ] = 3[0.04 + 0.32E(x2t−1 ) + 0.64E(x4t−1 )].

∴ γx2 (0) = E(x4t ) − (E(x2t ))2 = 3 − 1 = 2

Because

x2t
SPECTRAL ANALYSIS 45

follows an AR(1) process, it follows that

γx2 (τ ) = 0.8γx2 (τ − 1)

for τ =1,2,.... We can write the s.d.f . as


∞ ∞
1 X 1 X
f (ω) = γx2 (0) + 2 γx2 (τ ) cos(ωτ ) = (1 + 2 0.8τ cos(ωτ ))
2π τ =1
2π τ =1

which will look like an AR(1) process’ s.d.f .

26. Computing spectra.


Compute, graph and discuss the spectral density functions of
a. yt = .8yt−12 + t
b. yt = .8t−12 + t .
Solution:
a.
σ2 σ2
f (ω) = [(1 − 0.8e12iω )(1 − 0.8e−12iω )]−1 = (1 − 1.6cos12ω + 0.64)−1
2π 2π
b.

σ2 σ2
f (ω) = [(1 + 0.8e12iω )(1 + 0.8e−12iω )] = (1 + 1.6cos12ω + 0.64)
2π 2π

27. Regression in the frequency domain.


The asymptotic diagonalization theorem provides the key not only to approximate
(i.e., asymptotic) MLE of time-series models in the frequency domain, but also to
46 CHAPTER 2

many other important techniques, such as Hannan efficient regression:

β̂GLS = (X 0 Σ−1 X)−1 X 0 Σ−1 Y

But asymptotically,

Σ = P 0 DP

so

Σ−1 = P 0 D−1 P

Thus asymptotically

β̂GLS = (X 0 P 0 D−1 P X)−1 X 0 P 0 D−1 P Y

which is just WLS on Fourier transformed data.

28. Band spectral regression

2.9 NOTES

Harmonic analysis is one of the earliest methods of analyzing time series thought to exhibit
some form of periodicity. In this type of analysis, the time series, or some simple trans-
formation of it, is assumed to be the result of the superposition of sine and cosine waves
of different frequencies. However, since summing a finite number of such strictly periodic
functions always results in a perfectly periodic series, which is seldom observed in practice,
one usually allows for an additive stochastic component, sometimes called “noise.” Thus, an
observer must confront the problem of searching for “hidden periodicities” in the data, that
is, the unknown frequencies and amplitudes of sinusoidal fluctuations hidden amidst noise.
An early method for this purpose is periodogram analysis, initially used to analyse sunspot
data, and later to analyse economic time series.
Spectral analysis is a modernized version of periodogram analysis modified to take account
of the stochastic nature of the entire time series, not just the noise component. If it is assumed
that economic time series are fully stochastic, it follows that the older periodogram technique
is inappropriate and that considerable difficulties in the interpretation of the periodograms
of economic series may be encountered.
These notes draw in part on Diebold, Kilian and Nerlove, New Palgrave, ***.
Chapter Three

Markovian Structure, Linear Gaussian State Space, and Optimal

(Kalman) Filtering

3.1 MARKOVIAN STRUCTURE

3.1.1 The Homogeneous Discrete-State Discrete-Time Markov Process

{Xt }, t = 0, 1, 2, . . .

Possible values (”states”) of Xt : 1, 2, 3, . . .

First-order homogeneous Markov process:

P rob(Xt+1 = j|Xt = i, Xt−1 = it−1 , . . . , X0 = i0 )

= P rob(Xt+1 = j|Xt = i) = pij


1-step transition probabilities:

[time (t + 1)]

 
p11 p12 ···
[time t]  p21 p22 ···
 

 
 · · ··· 
 
P ≡  · · ···
 

·

P∞
pij ≥ 0, j=1 pij = 1

3.1.2 Multi-Step Transitions: Chapman-Kolmogorov

m-step transition probabilities:


(m)
pij = P rob(Xt+m = j | Xt = i)

 
(m)
Let P (m) ≡ pij .
48 CHAPTER 3

Chapman-Kolmogorov theorem:

P (m+n) = P (m) P (n)

Corollary: P (m) = P m

3.1.3 Lots of Definitions (and a Key Theorem)


(n)
State j is accessible from state i if pij > 0, for some n.

Two states i and j communicate (or are in the same class) if each is accessible from the
other. We write i ↔ j.

A Markov process is irreducible if there exists only one class (i.e., all states communicate).

(n)
State i has period d if pii = 0 ∀n such that n/d 6∈ Z, and d is the greatest integer with
that property. (That is, a return to state i can only occur in multiples of d steps.) A state
with period 1 is called an aperiodic state.

A Markov process all of whose states are aperiodic is called an aperiodic Markov process.
Still more definitions....
The first-transition probability is the probability that, starting in i, the first transition to
j occurs after n transitions:
(n)
fij = P rob(Xn = j, Xk 6= j, k = 1, ..., (n − 1)|X0 = i)
P∞ (n)
Denote the eventual transition probability from i to j by fij (= n=1 fij ).

State j is recurrent if fjj = 1 and transient otherwise.

Denote the expected number of transitions needed to return to recurrent state j by


P∞ (n)
µjj (= n=1 nfjj ).

A recurrent state j is positive recurrent if µjj < ∞ null recurrent if µjj = ∞.


One final definition...
The row vector π is called the stationary distribution for P if:

πP = π.

The stationary distribution is also called the steady-state distribution.


Theorem: Consider an irreducible, aperiodic Markov process.

Then either:
STATE SPACE AND THE KALMAN FILTER 49

(1) All states are transient or all states are null recurrent

(n)
pij → 0 as n → ∞ ∀i, j. No stationary distribution.

or

(2) All states are positive recurrent.

(n)
pij → πj as n → ∞ ∀i, j. {πj , j = 1, 2, 3, ...} is the unique stationary distribution.
π is any row of limn→∞ P n .

3.1.4 A Simple Two-State Example

Consider a Markov process with transition probability matrix:


!
0 1
P =
1 0
Call the states 1 and 2.

We will verify many of our claims, and we will calculate the steady-state distribution.

3.1.4.1 Valid Transition Probability Matrix

pij ≥ 0 ∀i, j
2
X 2
X
p1j = 1, p2j = 1
j=1 j=1

3.1.4.2 Chapman-Kolmogorov Theorem (for P (2) )


! ! !
0 1 0 1 1 0
P (2) = P · P = =
1 0 1 0 0 1

3.1.4.3 Communication and Reducibility

Clearly, 1 ↔ 2, so P is irreducible.
50 CHAPTER 3

3.1.4.4 Periodicity

State 1: d(1) = 2
State 2: d(2) = 2

3.1.4.5 First and Eventual Transition Probabilities

(1) (n)
f12 = 1, f12 = 0 ∀ n > 1 ⇒ f12 = 1
(1) (n)
f21 = 1, f21 = 0 ∀ n > 1 ⇒ f21 = 1

3.1.4.6 Recurrence

Because f21 = f12 = 1, both states 1 and 2 are recurrent.

Moreover,

(n)
X
µ11 = nf11 = 2 < ∞ (and similarly µ22 = 2 < ∞)
n=1

Hence states 1 and 2 are positive recurrent.

3.1.4.7 Stationary Probabilities

We will guess and verify.

Let π1 = .5, π2 = .5 and check πP = π:

!
0 1
(.5, .5) = (.5, .5).
1 0

Hence the stationary probabilities are 0.5 and 0.5.

Note that in this example we can not get the stationary probabilities by taking limn→∞ P n .
Why?

3.1.5 Constructing Markov Processes with Useful Steady-State Distributions

In section 3.1.4 we considered an example of the form, “for a given Markov process, character-
ize its properties.” Interestingly, many important tools arise from the reverse consideration,
“For a given set of properties, find a Markov process with those properties.”
STATE SPACE AND THE KALMAN FILTER 51

3.1.5.1 Markov Chain Monte Carlo

(e.g., Gibbs sampling)


We want to sample from f (z) = f (z1 , z2 )
(0)
Initialize (j = 0) using z2
1 (0) (1) (1)
Gibbs iteration j = 1: Draw z(1) from f (z1 |z2 ), draw z2 from f (z2 |z1 )
Repeat j = 2, 3, ....

3.1.5.2 Global Optimization

(e.g., simulated annealing)


If θc ∈
/ N (θ(m) ) then P (θ(m+1) = θc |θ(m) ) = 0
If θc ∈ N (θ(m) ) then P (θ(m+1) = θc | θ(m) ) = exp (min[0, ∆/T (m)])

3.1.6 Variations and Extensions: Regime-Switching and More

3.1.6.1 Markovian Regime Switching


!
p11 1 − p11
P =
1 − p22 p22

st ∼ P

yt = cst + φst yt−1 + εt

εt ∼ iid N (0, σs2t )

“Markov switching,” or “hidden Markov,” model

Popular model for macroeconomic fundamentals

3.1.6.2 Heterogeneous Markov Processes


 
p11,t p12,t ···
 
p21,t p22,t ··· 
Pt = 
 ·
.
 · ··· 

· · ···

e.g., Regime switching with time-varying transition probabilities:

st ∼ Pt
52 CHAPTER 3

yt = cst + φst yt−1 + εt

εt ∼ iid N (0, σs2t )

Business cycle duration dependence: pij,t = gij (t)


Credit migration over the cycle: pij,t = gij (cyclet )
General covariates: pij,t = gij (xt )

3.1.6.3 Semi-Markov Processes

We call semi-Markov a process with transitions governed by P , such that the state durations
(times between transitions) are themselves random variables. The process is not Markov,
because conditioning not only the current state but also time-to-date in state may be useful
for predicting the future, but there is an embedded Markov process.

Key result: The stationary distribution depends only on P and the expected state dura-
tions. Other aspects of the duration distribution are irrelevant.

Links to Diebold-Rudebush work on duration dependence: If welfare is affected only by


limiting probabilities, then the mean of the duration distribution is the only relevant aspect.
Other, aspects, such as spread, existence of duration dependence, etc. are irrelevant.

3.1.6.4 Time-Reversible Processes

Theorem: If {Xt } is a stationary Markov process with transition probabilities pij and sta-
tionary probabilities πi , then the reversed process is also Markov with transition probabilities
πj
p∗ij = pji .
πi

In general, p∗ij 6= pij . In the special situation p∗ij = pij (so that πi pij = πj pji ), we say that
the process is time-reversible.

3.1.7 Continuous-State Markov Processes

3.1.7.1 Linear Gaussian State Space Systems

αt = T αt−1 + Rηt

yt = Zαt + εt

ηt ∼ N, εt ∼ N
STATE SPACE AND THE KALMAN FILTER 53

3.1.7.2 Non-Linear, Non-Gaussian State Space Systems

αt = Q(αt−1 , ηt )

yt = G(αt , εt )

ηt ∼ Dη , εt ∼ Dε

Still Markovian!

3.2 STATE SPACE REPRESENTATIONS

3.2.1 The Basic Framework

Transition Equation

αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1

t = 1, 2, ..., T

Measurement Equation

yt = Z αt + Γ wt + εt
1x1 1xm mx1 1xL Lx1 1x1

(This is for univariate y. We’ll do multivariate shortly.)

t = 1, 2, ..., T

(Important) Details

!  
ηt
∼ WN 0, diag( Q , |{z}
h )
εt |{z}
g×g 1×1

E(α0 ηt 0 ) = 0mxg

E(α0 εt ) = 0mx1
54 CHAPTER 3

All Together Now

αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1

yt = Z αt + Γ wt + εt
1x1 1xm mx1 1xL Lx1 1x1

!  
ηt
∼ WN 0, diag( Q , |{z}
h )
εt |{z}
g×g 1×1

E(α0 εt ) = 0mx1 E(α0 ηt 0 ) = 0mxg

State Space Representations Are Not Unique


Transform by the nonsingular matrix B.
The original system is:

αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1

yt = Z αt + Γ wt + εt
1x1 1xm mx1 1xL Lx1 1x1

Rewrite the system in two steps


First, write it as:

αt = T B −1 B αt−1 + R ηt
mx1 mxm mxm mxm mx1 mxg gx1

yt = Z B −1 B αt + Γ wt + εt
1x1 1xm mxm mxm mx1 mxL Lx1 1x1

Second, premultiply the transition equation by B to yield:


STATE SPACE AND THE KALMAN FILTER 55

(B αt ) = (B T B −1 ) (B αt−1 ) + (B R) ηt
mx1 mxm mx1 mxg gx1

yt = (Z B −1 ) (B αt ) + Γ wt + εt
1x1 1xm mx1 mxL Lx1 1x1

(Equivalent State Space Representation)

3.2.2 ARMA Models

State Space Representation of an AR(1)

yt = φ yt−1 + ηt

ηt ∼ W N (0, ση2 )

Already in state space form!

αt = φ αt−1 + ηt

yt = αt

(T = φ, R = 1, Z = 1, Γ = 0, Q = ση2 , h = 0)
MA(1)

yt = Θ(L)εt

εt ∼ W N (0, σ 2 )

where

Θ(L) = 1 + θ1 L

MA(1) in State Space Form

yt = ηt + θ ηt−1

ηt ∼ W N (0, ση2 )
56 CHAPTER 3

! ! ! !
α1t 0 1 α1,t−1 1
= + ηt
α2t 0 0 α2,t−1 θ

yt = (1, 0) αt = α1t

MA(1) in State Space Form


Why? Recursive substitution from the bottom up yields:

!
yt
αt =
θηt

MA(q)

yt = Θ(L)εt

εt ∼ W N (0, σ 2 )

where

Θ(L) = 1 + θ1 L + ... + θq Lq

MA(q) in State Space Form

yt = ηt + θ1 ηt−1 + ... + θq ηt−q

ηt ∼ W N N (0, ση2 )

       
α1t 0 α1,t−1 1
α2t 0 Iq  α2,t−1  θ1 
       
    

 .. 
 = 
 .. 


 ..  +  .  ηt
  . 
 .   .   .   . 
αq+1,t 0 00 αq+1,t−1 θq

yt = (1, 0, ..., 0) αt = α1t

MA(q) in State Space Form


Recursive substitution from the bottom up yields:
STATE SPACE AND THE KALMAN FILTER 57

   
θq ηt−q + . . . + θ1 ηt−1 + ηt yt

 .. 


 .. 

αt ≡ 
 . 
 =

 . 


 θq ηt−1 + θq−1 ηt 

 θ η
 q t−1 + θq−1 ηt


θq ηt θq ηt
AR(p)

Φ(L)yt = εt

εt ∼ W N (0, σ 2 )

where

Φ(L) = (1 − φ1 L − φ2 L2 − ... − φp Lp )

AR(p) in State Space Form

yt = φ1 yt−1 + ... + φp yt−p + ηt

ηt ∼ W N (0, ση2 )

       
α1t φ1 α1,t−1 1
       
 α2t   φ2 Ip−1   α2,t−1   0 
αt =  . =  +  .  ηt
       
 . ..
 ..  ..  .. 
  


 



 . 
  
αpt φp 00 αp,t−1 0

yt = (1, 0, ..., 0) αt = α1t

AR(p) in State Space Form


Recursive substitution from the bottom up yields:
   
α1t φ1 α1,t−1 + . . . + φp α1,t−p + ηt

 .. 


 .. 

αt =

 . 
 =

 . 

 α   φp−1 α1,t−1 + φp α1,t−2 
 p−1,t   
αpt φp α1,t−1
 
yt

 .. 

=

 . 


 φp−1 yt−1 + φp yt−2  
φp yt−1
ARMA(p,q)
58 CHAPTER 3

Φ(L)yt = Θ(L)εt
εt ∼ W N (0, σ 2 )

where
Φ(L) = (1 − φ1 L − φ2 L2 − ... − φp Lp )

Θ(L) = 1 + θ1 L + ... + θq Lq

ARMA(p,q) in State Space Form

yt = φ1 yt−1 + ... + φp yt−p + ηt + θ1 ηt−1 + ... + θq ηt−q

ηt ∼ W N (0, ση2 )

Let m = max(p, q + 1) and write as ARMA(m, m − 1):

(φ1 , φ2 , ..., φm ) = (φ1 , ..., φp , 0, ..., 0)

(θ1 , θ2 , ..., θm−1 ) = (θ1 , ..., θq , 0, ..., 0)

ARMA(p,q) in State Space Form


   
φ1 1
   
 φ2 Im−1   θ1 
αt =  αt−1 +   ηt
   
 . ..
 ..   . 
   
φm 00 θm−1

yt = (1, 0, ..., 0) αt

ARMA(p,q) in State Space Form Recursive substitution from the bottom up yields:
   
α1t φ1 α1,t−1 + φp α1,t−p + ηt + θ1 ηt−1 + . . . + θq ηt−q
 .   . 
 .   . 

 . 
 =

 . 

   
 αm−1,t   φm−1 α1,t−1 + αm,t−1 + θm−2 ηt 
αmt φm α1,t−1 + θm−1 ηt
 
yt
 . 
 . 
=

 . 

 
 φm−1 yt−1 + φm yt−2 + θm−1 ηt−1 + θm−2 ηt 
φm yt−1 + θm−1 ηt

Multivariate State Space


(Same framework, N > 1 observables)

αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1
STATE SPACE AND THE KALMAN FILTER 59

yt = Z αt + Γ Wt + εt
N x1 N xm mx1 N xL Lx1 N x1

!  
ηt
∼ WN H )
0, diag( Q , |{z}
εt |{z}
g×g N ×N

E(α0 ηt 0 ) = 0mxg E(α0 εt 0 ) = 0mxN

N -Variable V AR(p)

yt = Φ1 yt−1 + ... + Φp yt−p + ηt


N x1 N xN N x1

ηt ∼ W N (0, Σ)

State Space Representation


       
α1t Φ1 α1,t−1 IN
 α2t Φ2 IN (p−1)  α2,t−1 0N xN
       
     
= +  ηt
       
 .   .   .  . 
 .   .   .   . 
 .   .   .   . 
αpt Φp 00 αp,t−1 0N xN
N px1 N pxN p N px1 N P xN

yt = (IN , 0N , ..., 0N ) αt
N x1 N xN p N px1

Multivariate ARMA(p,q)

yt = Φ1 yt−1 + ... + Φp yt−p


N x1 N xN N xN

+ ηt + Θ1 ηt−1 + ... + Θq ηt−q


N xN N xN

ηt ∼ W N (0, Σ)
60 CHAPTER 3

Multivariate ARMA(p,q)
   
Φ1 I
 Φ2 IN (m−1) Θ1
   
  
αt =  .
 .
 αt−1 + 
  ..  ηt

 .   . 
N mx1
Φm 0N xN (m−1) Θm−1

yt = (I, 0, ..., 0) αt = α1t

where m = max(p, q + 1)

3.2.3 Linear Regression with Time-Varying Parameters and More

Linear Regression Model, I


Transition: Irrelevant
Measurement:

yt = β 0 xt + ε t

(Just a measurement equation with exogenous variables)


(T = 0, R = 0, Z = 0, γ = β 0 , Wt = xt , H = σε2 )
Linear Regression Model, II
Transition:

αt = αt−1

Measurement:

yt = x0t αt + εt

(T = I, R = 0, Zt = x0t , γ = 0, H = σε2 )
Note the time-varying system matrix.
Linear Regression with ARMA(p,q) Disturbances

yt = βxt + ut

ut = φ1 ut−1 + ... + φp ut−p + ηt + φ1 ηt−1 + ... + θq ηt−q

   
φ1 1
 φ2 Im−1  θ1
   
 
αt =  .
 .
 αt−1 + 
  ..  ηt

 .   . 
φm 00 θm−1
STATE SPACE AND THE KALMAN FILTER 61

yt = (1, 0, ..., 0)αt + βxt

where m = max(p, q + 1)

Linear Regression with Time-Varying Coefficients


Transition:

αt = φ αt−1 + ηt

Measurement:

yt = x0t αt + εt

(T = φ, R = I, Q = cov(ηt ), Zt = x0t , γ = 0, H = σε2 )


– Gradual evolution of tastes, technologies and institutions
– Lucas critique
– Stationary or non-stationary

3.2.3.1 Simultaneous Equations

N -Variable Dynamic SEM


+
Structure:

φ0 yt = φ1 yt−1 + ... + φp yt−p + P ηt

ηt ∼ W N (0, I)

Reduced form:

yt = φ−1 −1 −1
0 φ1 yt−1 + ... + φ0 φp yt−p + φ0 P ηt

Assume that the system is identified.


SEM State Space Representation

φ−1 φ−1
       
α1t 0 φ1 α1,t−1 0 P
 
  −1 η1t
 α2t   φ 0 φ2 I   α2,t−1 0
     
   . 
 . = .. .. + ..  .
 .
  
 .      
 . . . .

      
ηN t
αpt φ−1
0 φp 00 αp,t−1 0

yt = (IN , 0N , ..., 0N ) αt
N x1 N xN p N px1
62 CHAPTER 3

3.2.4 Dynamic Factor Models

Dynamic Factor Model – Single AR(1) factor


(White noise idiosyncratic factors uncorrelated with each other
and uncorrelated with the factor at all leads and lags...)

    
  
y1t µ1 λ1 ε1t
 .   .   .   . 
 .  =  .  +  .  Ft +  .
 .   .  .   .

 
yN t µN λN εN t

Ft = φFt−1 + ηt

Already in state-space form!


Dynamic Factor Model – Single ARMA(p,q) Factor
       
y1t µ1 λ1 ε1t
 .   .   .   . 
 .  =  .  +  .  Ft +  .
 .   .   .   .


yN t µN λN εN t

Φ(L) Ft = Θ(L) ηt

Dynamic Factor Model – Single ARMA(p,q) Factor State vector for F is state vector for
system:

   
φ1 1
 φ2 Im−1  θ1
   
 
αt =  .
 .
 αt−1 + 
  ..  ηt

 .   . 
φm 00 θm−1

Dynamic Factor Model – Single ARMA(p,q) factor System measurement equation is then:

      
y1t µ1 λ1 ε1t
 . 
 .  =  ..  +  ..  (1, 0, ..., 0) αt +  ..
     
 .   .  .   .

 
yN t µN λN εN t

     
µ1 λ1 0 ... 0 ε1t
 .   .
 αt +  .. 
  
=  .   .
 .  +  .   . 
µN λN 0 ... 0 εN t
STATE SPACE AND THE KALMAN FILTER 63

3.2.5 Unobserved-Components Models

– Separate components for separate features


– Signal extraction
– Trends and detrending
– Seasonality and seasonal adjustment
– Permanent-transitory decompositions
Cycle (“Signal”) + Noise Model

xt = φ xt−1 + ηt

yt = xt + εt

! ! !
εt σε2 0
∼ WN 0,
ηt 0 ση2

(αt = xt , T = φ, R = 1, Z = 1, γ = 0, Q = ση2 , H = σε2 )


Cycle + Seasonal + Noise

yt = ct + st + εt

ct = φ ct−1 + ηct

st = γ st−4 + ηst

Cycle + Seasonal + Noise


Transition equations for the cycle and seasonal:

αct = φ αc,t−1 + ηct

   
0 1
   
 0 I3 
 αs,t−1 +  0
 
αst = 
 0   0
 ηst

   
γ 00 0
Cycle + Seasonal + Noise Stacking transition equations gives the grand transition equa-
tion:
64 CHAPTER 3

   
0 1 0 0 0 1 0
 0 0 10 0  0 0 
!   !   !
αst   αs,t−1   ηst
=  0 0 0 1 0  + 
 0 0 
αct   αc,t−1  ηct
 γ 0 0 0 0   0 0 
   

0 0 0 0 φ 0 1
Finally, the measurement equation is:
!
αst
yt = (1, 0, 0, 0, 1) + εt
αct

Are there any Important/Useful Linear Models for which


Linear State-Space Representations do not Exist?
– There are none. Well, almost none.
(By “existence”, we mean a finite-dimensional state vector.)

– One exception: long-memory processes. Specified directly as AR(∞) and hence do not
have finite-order Markovian structure.

Simplest version (“pure long-memory process”):

(1 − L)d yt = εt , d ∈ [−1/2, 1/2]

where
d(d − 1) 2 d(d − 1)(d − 2) 3
(1 − L)d = 1 − dL + L − L + ...
2! 3!

3.3 THE KALMAN FILTER AND SMOOTHER

State Space Representation

αt = T αt−1 + R ηt
mx1 mxm mx1 mxg gx1

yt = Z αt + γ Wt + εt
N x1 N xm mx1 N xL Lx1 N x1

!  
ηt
∼ WN 0, diag( Q , |{z}
H )
εt |{z}
g×g N ×N
STATE SPACE AND THE KALMAN FILTER 65

E(α0 ηt0 ) = 0mxg

E(α0 ε0t ) = 0mxN

The filtering “thought experiment”. Prediction and updating. “Online”, “ex ante”, using
only real-time available data ỹt
to extract αt and predict αt+1 .

3.3.1 Statement(s) of the Kalman Filter

I. Initial state estimate and MSE

a0 = E(α0 )

P0 = E(α0 − a0 ) (α0 − a0 )0

Statement of the Kalman Filter


II. Prediction Recursions

at/t−1 = T at−1

Pt/t−1 = T Pt−1 T 0 + R Q R0

III. Updating Recursions

at = at/t−1 + Pt/t−1 Z 0 Ft−1 (yt − Zat/t−1 − γWt )

(where Ft = Z Pt/t−1 Z 0 + H)

Pt = Pt/t−1 − Pt/t−1 Z 0 Ft−1 Z Pt/t−1

t = 1, ..., T
State-Space in Density Form (Assuming Normality)

αt |αt−1 ∼ N (T αt−1 , RQR0 )

yt |αt ∼ N (Zαt , H)

Kalman Filter in Density Form (Assuming Normality)


66 CHAPTER 3

Initialize at a0 , P0
State prediction:
αt |ỹt−1 ∼ N (at/t−1 , Pt/t−1 )
at/t−1 = T at−1
Pt/t−1 = T Pt−1 T 0 + RQR0
Data prediction:
yt |ỹt−1 ∼ N (Zat/t−1 , Ft )
Update:
αt |ỹt ∼ N (at , Pt )
at = at/t−1 + Kt (yt − Zat/t−1 )
Pt = Pt/t−1 − Kt ZPt/t−1
where ỹt = {y1 , ..., yt }

3.3.2 Derivation of the Kalman Filter

Useful Result 1: Conditional Expectation is MMSE Extraction Under Normality Suppose


that
!  
x
∼N µ, Σ
y
where x is unobserved and y is observed.
Then
Z Z
E(x|y) = argmin x̂(y) (x − x̂(y))2 f (x, y) dx dy

Useful Result 2: Properties of the Multivariate Normal

!   !
x 0 Σxx Σxy
∼N µ, Σ µ = (µx , µy ) Σ=
y Σyx Σyy


=⇒ x|y ∼ N µx|y , Σx|y

µx|y = µx + Σxy Σ−1


yy (y − µy )

Σx|y = Σxx − Σxy Σ−1


yy Σyx

Constructive Derivation of the Kalman Filter


Under Normality
Let Et (·) ≡ E(· |Ωt ), where Ωt ≡ {y1 , ..., yt }.
STATE SPACE AND THE KALMAN FILTER 67

Time 0 “update”:

a0 = E0 (α0 ) = E (α0 )

P0 = var0 ( α0 ) = E [(α0 − a0 ) (α0 − a0 )0 ]

Derivation of the Kalman Filter, Continued...


Time 0 prediction
At time 1 we know that:

α1 = T α0 + Rη1

Now take expectations conditional on time-0 information:

E0 (α1 ) = T E0 (α0 ) + RE0 (η1 )

= T a0

= a1/0

Derivation of the Kalman Filter, Continued...


 
P1/0 = E0 (α1 − a1/0 ) (α1 − a1/0 )0

 
0
(subst. a1/0 ) = E0 (α1 − T a0 ) (α1 − T a0 )

 
0
(subst. α1 ) = E0 (T (α0 − a0 ) + Rη1 ) (T (α0 − a0 ) + Rη1 )

= T P0 T 0 + RQR0

(using E(α0 ηt0 ) = 0 ∀t)

Derivation of the Kalman Filter, Continued...


Time 1 updating
We will derive the distribution of:
!
α1
Ω0
y1
68 CHAPTER 3

and then convert to

α1 |(Ω0 ∪ y1 )

or

α1 |Ω1

Derivation of the Kalman Filter, Continued...


Means:

E0 (α1 ) = a1/0

E0 (y1 ) = Za1/0 + γW1

Derivation of the Kalman Filter, Continued...


Variance-Covariance Matrix:

 
var0 (α1 ) = E0 (α1 − a1/0 ) (α1 − a1/0 ) = P1/0

 
0
var0 (y1 ) = E0 (y1 − Za1/0 − γW1 ) (y1 − Za1/0 − γW1 )

 
0
= E0 (Z(α1 − a1/0 ) + ε1 ) (Z(α1 − a1/0 ) + ε1 )

= Z P1/0 Z 0 + H (using ε⊥η)

cov0 (α1 , y1 ) = E0 (α1 − a1/0 ) (Z (α1 − a1/0 ) + ε1 )0

= P1/0 Z 0 (using ε⊥η)

Derivation of the Kalman Filter, Continued...


Hence:

! ! !!
α1
Ω0 ∼ N a1/0 P1/0 P1/0 Z 0
,
y1 Za1/0 + γW1 ZP1/0 ZP1/0 Z 0 + H

Now by Result 2, α1 | Ω0 ∪ y1 ∼ N (a1 , P1 )


STATE SPACE AND THE KALMAN FILTER 69

a1 = a1/0 + P1/0 Z 0 F1−1 (y1 − Za1/0 − γW1 )

P1 = P1/0 − P1/0 Z 0 F1−1 Z P1/0

(F1 = Z P1/0 Z 0 + H)

Repeating yields the Kalman filter.


What Have We Done?
Under normality,
we proved that the Kalman filter delivers
MVU predictions and extractions.
Dropping normality,
one can also prove that the Kalman filter delivers
BLU predictions and extractions.

3.3.3 Calculating P0

Treatment of Initial Covariance Matrix: P0 = Γ(0) (Covariance stationary case: All eigen-
values of T inside |z| = 1)

αt = T αt−1 + Rηt

=⇒ Γα (0) = E(T αt−1 + Rηt )(T αt−1 + Rηt )0 = T Γα (0)T 0 + RQR0

=⇒ P0 = T P0 T 0 + RQR0

=⇒ vec(P0 ) = vec(T P0 T 0 ) + vec(RQR0 )

= (T ⊗ T )vec(P0 ) + vec(RQR0 )

=⇒ vec(P0 ) = (I − (T ⊗ T ))−1 vec(RQR0 )

3.3.4 Predicting yt

Point prediction:

yt/t−1 = Zat/t−1 + γWt

Prediction error:

vt = yt − (Zat/t−1 + γWt )

Density Prediction of yt
70 CHAPTER 3

yt |Ωt−1 ∼ N (yt/t−1 , Ft )

or equivalently

vt | Ωt−1 ∼ N (0, Ft )

Et−1 vt = Et−1 [yt − (Zat/t−1 + γWt )]

= Et−1 [Z (αt − at/t−1 ) + εt ] = 0

Et−1 vt vt0 = Et−1 [Z (αt − at/t−1 ) + εt ] [Z (αt − at/t−1 ) + εt ]0

= ZPt/t−1 Z 0 + H ≡ Ft

Normality follows from linearity of all transformations.

3.3.4.1 Combining State Vector Prediction and Updating

(For notational convenience we now drop Wt )


(1) Prediction: at+1/t = T at
(2) Update: at = at/t−1 + Pt/t−1 Z 0 Ft−1 (yt − Zat/t−1 )
= at/t−1 + Kt vt
where

Kt = Pt/t−1 Z 0 Ft−1

Substituting (2) into (1):

at+1/t = T at/t−1 + T Kt vt

3.3.5 Steady State and the Innovations Representation

Recall the “Two-Shock” State Space Representation

αt = T αt−1 + Rηt

yt = Zαt + εt

E(ηt ηt0 ) = Q
STATE SPACE AND THE KALMAN FILTER 71

E(εt ε0t ) = H

(Nothing new)

3.3.5.1 Combining Covariance Matrix Prediction and Updating

(1) Prediction: Pt+1/t = T Pt T 0 + RQR0


(2) Update: Pt = Pt/t−1 − Kt Z Pt/t−1
Substitute (2) into (1):

Pt+1/t = T Pt/t−1 T 0 − T Kt Z Pt/t−1 T 0 + RQR0

(Matrix Ricatti equation)


Why Care About Combining Prediction and Updating?
It leads us to the notion of steady state of the Kalman filter...
...which is the bridge from the Wold representation
to the state space representation

3.3.5.2 “One-Shock” (“Prediction Error”) Representation

We have seen that

at+1|t = T at|t−1 + T Kt vt (transition)

Moreover, it is tautologically true that

yt = Z at|t−1 + (yt − Zat|t−1 )

= Z at|t−1 + vt (measurement)
Note that one-shock state space representation
has time-varying system matrices:

• “R matrix” in transition equation is T Kt

• Covariance matrix of vt is Ft

3.3.5.3 “Innovations” (Steady-State) Representation

at+1|t = T at|t−1 + T K̄vt

yt = Z at|t−1 + vt
72 CHAPTER 3

where

K̄ = P̄ Z 0 F̄ −1

E(vt vt0 ) = F̄ = Z P̄ Z 0 + H

P̄ solves the matrix Ricatti equation


– Effectively Wold-Wiener-Kolmogorov prediction and extraction
– Prediction yt+1/t is now the projection of yt+1 on infinite past, and one-step prediction
errors vt are now the Wold-Wiener-Kolmogorov innovations
Remarks on the Steady State

1. Steady state will be approached if:

• underlying two-shock system is time invariant

• all eigenvalues of T are less than one

• P1|0 is positive semidefinite

2. Because the recursions for Pt|t−1 and Kt don’t depend on the data, but only on P0 ,
we can calculate arbitrarily close approximations to P̄ and K̄ by letting the filter run

3.3.6 Kalman Smoothing

The smoothing thought experiment. “Offline”, “ex post”, using all data ỹT
to extract αt and predict αt+1 .
1. (Kalman) filter forward through the sample, t = 1, ..., T
2. Smooth backward, t = T, (T − 1), (T − 2), ..., 1

Initialize: aT,T = aT , PT,T = PT

Then:

at,T = at + Jt (at+1,T − at+1,t )

Pt,T = Pt + Jt (Pt+1,T − Pt+1,t )Jt0

where
−1
Jt = Pt T 0 Pt+1,t

3.4 EXERCISES, PROBLEMS AND COMPLEMENTS

1. Markov process theory.


STATE SPACE AND THE KALMAN FILTER 73

Prove the following for Markov processes.

(a) Communication is an equivalence relation; i.e., i ↔ i (reflexive), i ↔ j ⇐⇒ j ↔ i


(symmetric), and i ↔ j, j ↔ k =⇒ i ↔ k (transitive).
(b) Any two classes are either disjoint or identical.
(c) Periodicity is a class property.
(d) Let fij denote the probability of an eventual transition from i to j. Then fij =
P∞ n
n=1 fij .

(e) The expected number of returns to a recurrent state is infinite, and the expected
number of returns to a transient state is finite. That is,


X
n
State j is recurrent ⇐⇒ Pjj = ∞,
n=1
X∞
n
State j is transient ⇐⇒ Pjj < ∞.
n=1

(f) Recurrence is a class property. That is, if i is recurrent and i ↔ j, then j is


recurrent.
(g) The expected number of transitions into a transient state is finite. That is,

X
j transient =⇒ Pijn < ∞, ∀ i.
n=1

(h) Positive and null recurrence are class properties.


(i) If the probability distribution of X0 (call it πj = P (X0 = j), j ≥ 1) is a stationary
distribution, then P (Xt = j) = πj , ∀t. Moreover, the process is stationary.
(j) A stationary Markov process is time-reversible iff each path from i to i has the
same probability as the reversed path, ∀i.

2. A Simple Markov Process.


Consider the Markov process:
!
.9 .1
P =
.3 .7

Verify and/or calculate:

(a) Validity of the transition probability matrix


(b) Chapman-Kolmogorov theorem
(c) Accessibility/communication
74 CHAPTER 3

(d) Periodicity
(e) Transition times
(f) Recurrence/transience
(g) Stationarity
(h) Time reversibility

Solution:

(a) Valid transition probability matrix:

Pij ≥ 0 ∀i, j
2
X 2
X
P1j = 1, P2j = 1
j=1 j=1

(b) Illustrating the Chapman-Kolmogorov Theorem:

! ! ! !
3 .9 .1 .9 .1 .9 .1 .804 .196
P = =
.3 .7 .3 .7 .3 .7 .588 .412

(c) Communication and reducibility:


1 ↔ 2, so the Markov process represented by P is irreducible.
(d) Periodicity:

State 1: d(1) = 1
State 2: d(2) = 1

Hence both states are aperiodic.


(e) First and eventual transition probabilities:
First:

(1)
f12 = .1
(2)
f12 = .9 ∗ .1 = .09
(3)
f12 = .92 ∗ .1 = .081
(4)
f12 = .93 ∗ .1 = .0729
···
STATE SPACE AND THE KALMAN FILTER 75

(1)
f12 = .3
(2)
f21 = .7 ∗ .3 = .21
(3)
f21 = .72 ∗ .3 = .147
(4)
f21 = .73 ∗ .3 = .1029
···

Eventual:

X (n) .1
f12 = f12 = =1
n=1
1 − .9


X (n) .3
f21 = f21 = =1
n=1
1 − .7

(f) Recurrence:
Because f12 = f21 = 1, both states 1 and 2 are recurrent. In addition,
P∞ (n)
µ11 = n=1 n f11 < ∞
P∞ (n)
µ22 = n=1 n f 22 < ∞

States 1 and 2 are therefore positive recurrent and (given their aperiodicity es-
tablished earlier) ergodic.
(g) Stationary distribution
We can iterate on the P matrix to see that:
!
.75 .25
lim P n =
n→∞ .75 .25
Hence π1 = 0.75 and π2 = 0.25.
Alternatively, in the two-state case, we can solve analytically for the stationary
probabilities as follows.

!
  P11 P12  
π1 π2 = π1 π2
P21 P22

π1 P11 + π2 P21 = π1 (1)


=⇒
π1 P12 + π2 P22 = π2 (2)

Using π2 = 1 − π1 , we get from (1) that

π1 P11 + (1 − π1 )P21 = π1
76 CHAPTER 3

P21
π1 =
1 − P11 + P21
1 − P11
π2 =
1 − P11 + P21
Thus,
!
n 1 P21 1 − P11
lim P =
n→∞ (1 − P11 + P21 ) P21 1 − P11

(h) Time reversibility:


∗ π2
P12 = π1 P21 = .1
∗ π2
P21 = π1 P12 = .3
We have a time-reversible process.

3. AR(p) in state-space form.


Find a state-space representation different from the one used in the text, in which the
state vector is (y1 , ..., yp )0 . This is precisely what one expects given the intuitive idea
of state: in an AR(p) all history relevant for future evolution is (y1 , ..., yp )0 .

4. ARMA(1,1) in state space form.

yt = φ yt−1 + ηt + θ ηt−1

ηt ∼ W N (0, ση2 )

! ! ! !
α1t φ 1 α1,t−1 1
= + ηt
α2t 0 0 α2,t−1 θ

yt = (1, 0) αt = α1t

Recursive substitution from the bottom up yields:

! !
φ yt−1 + θ ηt−1 + ηt yt
αt = =
θ ηt θηt

5. Rational spectra and state space forms.


Can all linear discrete-time systems be written in state-space form? In particular, what
is the relationship, if any, between rational spectra and existence of a state-space form?
STATE SPACE AND THE KALMAN FILTER 77

6. Impulse-responses and variance decompositions.


If given a model in state space form, how would you calculate the impulse-responses
and variance decompositions?

7. Identification in UCM’s.
Discuss the identifying assumption that UC innovations are orthogonal at all leads and
lags. What convenient mathematical properties does it entail for the observed sum of
the unobserved components? In what ways is it restrictive?
Solution: Orthogonality of component innovations implies that the spectrum of the
observed time series is simply the sum of the component spectra. Moreover, the or-
thogonality facilitates identification. The assumption is rather restrictive, however, in
that it entails no interaction between cyclical and secular economic fluctuations.

8. ARMA(1,1) “Reduced Form” of the Signal Plus Noise Model.


Consider the “structural” model:

xt = yt + ut

yt = αyt−1 + vt .

Show that the reduced form is ARM A(1, 1):

yt = αyt−1 + εt + βεt−1

and provide expressions for σε2 and β in terms of the underlying parameters α, σv2 and
σu2 .
Solution:
Box and Jenkins (1976) and Nerlove et al. (1979) show the ARMA result and give the
formula for β. That leaves σε2 . We will compute var(x) first from the UCM and then
from the ARMA(1,1) reduced form, and equate them.
From the UCM:

σv2
var(x) = + σu2
1 − α2
From the reduced form:
78 CHAPTER 3

(1 + β 2 − 2αβ)
var(x) = σ2
1 − α2
Equating yields
σv2 + σu2 (1 − α2 )
σ2 =
1 + β 2 − 2αβ

9. Properties of optimal extractions.


In the case where the seasonal is modeled as independent of the nonseasonal and
the observed data is just the sum of the two, the following assertions are (perhaps
surprisingly) both true:
i) optimal extraction of the nonseasonal is identically equal to the observed data minus
optimal extraction of the seasonal (i.e., y ≡ ŷs + ŷn ≡ yn ), so it doesn’t matter whether
you estimate the nonseasonal by optimally extracting it directly or instead optimally
extract the seasonal and substract–both methods yield the same answer;
ii) ŷs , the estimated seasonal, is less variable than ys , the true seasonal, and ŷn , the
estimated nonseasonal, is less variable than yn , the true nonseasonal. It is paradoxical
that, by (ii), both estimates are less variable than their true counterparts, yet, by (i),
they still add up to the same observed series as their true counterparts. The paradox is
explained by the fact that, unlike their true counterparts, the estimates ŷs and ŷn are
correlated (so the variance of their sum can be more than the sum of their variances).

3.5 NOTES
Chapter Four

Frequentist Time-Series Likelihood Evaluation, Optimization,

and Inference

4.1 LIKELIHOOD EVALUATION: PREDICTION-ERROR DECOMPOSITION


AND THE KALMAN FILTER

Brute-Force Direct Evaluation:

yt ∼ N (µ, Σ(θ))

Example: AR(1)

(yt − µ) = φ(yt−1 − µ) + εt

σ2
Σij (φ) = φ|i−j|
1 − φ2

 
T /2 −1/2 1 0 −1
L(y; θ) = (2π) |Σ(θ)| exp − (y − µ) Σ (θ)(y − µ)
2

1 1
lnL(y; θ) = const − ln|Σ(θ)| − (y − µ)0 Σ−1 (θ) (y − µ)
2 2
T xT matrix Σ(θ) can be very hard to calculate (we need analytic formulas for the auto-
covariances) and invert (numerical instabilities and inaccuracies; slow even if possible)
Prediction-error decomposition and the Kalman filter:
Schweppe’s prediction-error likelihood decomposition is:

T
Y
L(y1 , . . . , yT ; θ) = Lt (yt |yt−1 , . . . , y1 ; θ)
t=1
or:
T
X
ln L(y1 , . . . , yT ; θ) = ln Lt (yt |yt−1 , . . . , y1 ; θ)
t=1
80 CHAPTER 4

“Prediction-error decomposition”
In the univariate Gaussian case, the Schweppe decomposition is

T T
T 1X 1 X (yt − µt )2
ln L = − ln 2π − ln σt2 −
2 2 t=1 2 t=1 σt2

T T
T 1X 1 X vt2
=− ln 2π − ln Ft −
2 2 t=1 2 t=1 Ft

Kalman filter delivers vt and Ft !


No autocovariance calculation or matrix inversion!
In the N -variate Gaussian case, the Schweppe decomposition is
T T
NT 1X 1X
ln L = − ln 2π − ln |Σt | − (yt − µt )0 Σ−1
t (yt − µt )
2 2 t=1 2 t=1

T T
NT 1X 1 X 0 −1
=− ln 2π − ln |Ft | − v F vt
2 2 t=1 2 t=1 t t

Kalman filter again delivers vt and Ft .


Only the small matrix Ft (N × N ) need be inverted.

4.2 GRADIENT-BASED LIKELIHOOD MAXIMIZATION: NEWTON AND


QUASI-NEWTON METHODS

The key is to be able to evaluate lnL for a given parameter configuration


Then we can climb uphill to maximize lnL to get the MLE
Crude Search
Function lnL(θ) to be optimized w.r.t. θ,
θ ∈ Θ, a compact subset of Rk

• Deterministic search: Search k dimensions at r locations in each dimension.

• Randomized Search: Repeatedly sample from Θ, repeatedly evaluating lnL(θ)

– Absurdly slow (curse of dimensionality)

4.2.1 The Generic Gradient-Based Algorithm

So use the gradient for guidance.


“Gradient-Based” Iterative Algorithms (“Line-Search”)
Parameter vector at iteration m: θ(m) .
MAXIMUM LIKELIHOOD 81

θ(m+1) = θ(m) + C (m) , where C (m) is the step.


Gradient algorithms: C (m) = −t(m) D(m) s(m)
t(m) is the step length, or step size (a positive number)
D(m) is a positive definite direction matrix
s(m) is the score (gradient) vector evaluated at θ(m)
General Algorithm

1. Specify θ(0)

2. Compute D(m) and s(m)

3. Determine step length t(m)


(Often, at each step, choose t(m) to optimize the objective function (“variable step
length”))

4. Compute θ(m+1)

5. If convergence criterion not met, go to 2.

Convergence
Convergence Criteria
k s(m) k “small”
k θ(m) − θ(m−1) k “small”
Convergence Rates
kθ (m+1) −θ̂k
p such that limm→∞ kθ (m) −θ̂k
p = O(1)
Method of Steepest Decent
Use D(m) = I, t(m) = 1, ∀ m.
Properties:

1. May converge to a critical point other than a minimum


(of course)

2. Requires only first derivative of the objective function

3. Very slow to converge (p = 1)

4.2.2 Newton Algorithm

Take D(m) as the inverse Hessian of lnL(θ) at θ(m)

 −1
∂ 2 lnL ∂ 2 lnL
|
∂θ12 θ (m)
. . . ∂θ1 ∂θk |θ (m)
 

 . 

D(m) = H −1(m) =  .
 

 

 . 

∂ 2 lnL ∂ 2 lnL
∂θk ∂θ1 |θ (m) . . . ∂θk 2 |θ (m)
82 CHAPTER 4

Also take t(m) = 1


Then θ(m+1) = θ(m) − H −1(m) s(m)
Derivation From Second-Order Taylor Expansion
Initial guess: θ(0)
lnL(θ) ≈ lnL(θ(0) ) + s(0) (θ − θ(0) ) + 12 (θ − θ(0) )0 H (0) (θ − θ(0) )
F.O.C.:
s(0) + H (0) (θ∗ − θ(0) ) = 0
or
θ∗ = θ(0) − H −1(0) s(0)
Properties of Newton
lnL(θ) quadratic ⇒ full convergence in a single iteration
More generally, iterate to convergence:
θ(m+1) = θ(m) − H −1(m) s(m)
Faster than steepest descent (p = 2 vs. p = 1)
But there is a price:
Requires first and second derivatives of the objective function
Requires inverse Hessian at each iteration

4.2.3 Quasi-Newton Algirithms

e.g., Davidon-Fletcher-Powell (DFP):


D(0) = I
 
D(m) = D(m−1) + f D(m−1) , δ (m) , φ(m) , m = 1, 2, 3, ...

δ (m) = θ(m) − θ(m−1)


φ(m) = s(m) − s(m−1)

• Second derivatives aren’t needed, but first derivatives are

• Approximation to Hessian is built up during iteration

• If lnL(θ) is quadratic, then D(k) = H, where k = dim(Θ)

• Intermediate convergence speed (p ≈ 1.6)

4.2.4 “Line-Search” vs. “Trust Region” Methods: Levenberg-Marquardt

An interesting duality...
Line search: First determine direction, then step
Trust region: First determine step, then direction
– Approximate the function locally in a trust region
containing all admissible steps, and then determine direction
Classic example: Levenberg-Marquardt
MAXIMUM LIKELIHOOD 83

Related R packages:
trust (trust region optimization)
minpack.lm (R interface to Levenberg-Marquardt in MINPACK)

4.3 GRADIENT-FREE LIKELIHOOD MAXIMIZATION: EM

Recall Kalman smoothing


1. Kalman filter the state forward through the sample, t = 1, ..., T
2. Kalman smooth the state backward, t = T, (T − 1), (T − 2), ..., 1
Initialize:

aT,T = aT

PT,T = PT

Smooth:

at,T = at + Jt (at+1,T − at+1,t )

Pt,T = Pt + Jt (Pt+1,T − Pt+1,t )Jt0

−1
where Jt = Pt T 0 Pt+1,t

Another Related Smoothing Piece, Needed for EM


3. Get smoothed predictive covariance matrices as well:

P(t,t−1),T = E[(αt − at,T )(αt−1 − at−1,T )0 ]

Initialize:

P(T,T −1),T = (I − KT Z)T PT −1

Then:
0 0
P(t−1,t−2),T = Pt−1 Jt−2 + Jt−1 (P(t,t−1),T − T Pt−1 )Jt−2

where Kt is the Kalman gain.


The EM “Data-Augmentation Algorithm” Think of {αt }Tt=1 as data that are unfortunately
missing in

αt = T αt−1 + ηt

yt = Zαt + εt
84 CHAPTER 4

Incomplete Data Likelihood:

lnL(y; θ)

Complete Data Likelihood: (If only we had complete data!)

lnL y, {αt }Tt=0 ; θ




Expected Complete Data Likelihood:

lnL(m) (y; θ) ≈ Eα lnL y, {αt }Tt=0 ; θ


 

EM iteratively constructs and maximizes the expected complete-data likelihood, which


(amazingly) has same
maximizer as the (relevant) incomplete-data likelihood.
The EM (Expectation/Maximization) Algorithm
1. E Step:

 
Construct lnL(m) (y; θ) ≈ Eα lnL y, {αt }Tt=0 ; θ
2. M Step:

 
θ(m+1) = argmaxθ lnL(m) (y; θ }

3. If convergence criterion not met, go to 1


But how to do the E and M steps?

4.3.1 “Not-Quite-Right EM”


(But it Captures and Conveys the Intuition)

1. E Step:
Approximate a “complete data” situation by replacing
{αt }Tt=0 with at,T from the Kalman smoother
2. M Step:
Estimate parameters by running regressions:
at,T → at−1,T
yt → at,T
3. If convergence criterion not met, go to 1

4.3.2 Precisely Right EM

4.3.2.1 Complete Data Likelihood



The complete data are α0 , {αt , yt }Tt=1 .
MAXIMUM LIKELIHOOD 85

Complete-Data Likelihood:
T
Y T
Y
fθ (y, α0 , {αt }Tt=1 ) = fa0 ,P0 (α0 ) fT,Q (αt |αt−1 ) fZ,H (yt |αt )
t=1 t=1

Gaussian Complete-Data Log-Likelihood:


1 1
ln L(y, {αt }T
t=1 ; θ) = const − ln |P0 | − (α0 − a0 )0 P0−1 (α0 − a0 )
2 2
T
T 1X
− ln |Q| − (αt − T αt−1 )0 Q−1 (αt − T αt−1 )
2 2 t=1
T
T 1X
− ln |H| − (yt − Zαt )0 H −1 (yt − Zαt )
2 2 t=1

4.3.2.2 E Step
 
Construct: lnL(m) (y; θ) ≈ Eα lnL y, {αt }Tt=0 ; θ

h i 1 1 h i
Eα ln L(y, {αt }T
t=0 ; θ) = const − ln |P0 | − Eα (α0 − a0 )0 P0−1 (α0 − a0 )
2 2
T
T 1X
Eα (αt − T αt−1 )0 Q−1 (αt − T αt−1 )
 
− ln |Q| −
2 2 t=1
T
T 1X
Eα (yt − Zαt )0 H −1 (yt − Zαt )
 
− ln |H| −
2 2 t=1

– Function of {at,T (θ(m) ), Pt,T (θ(m) ), and P(t,t−1),T (θ(m) )}Tt=1


– So the E step is really just running the three smoothers

4.3.2.3 M Step Formulas



Find: θ(m+1) = argmaxθ lnL(m) (y; θ)

T
! T !−1
X X
0 0 0
   
T̂ = Eα αt αt−1 Eα αt−1 αt−1
t=1 t=1

η̂t = (αt − T̂ αt−1 )

T
1X
Q̂ = Eα [η̂t η̂t0 ]
T t=1

T
! T
!−1
0
X X
0
Ẑ = yt Eα [αt ] Eα [αt αt0 ]
t=1 t=1

ˆt = (yt − Ẑαt )


86 CHAPTER 4

T
1X
Ĥ = t ˆ0t ]
Eα [ˆ
T t=1

where:

Eα [αt ] = at|T

Eα [αt αt0 ] = at|T a0t|T + Pt|T

0
= at|T a0t−1|T + P(t,t−1)|T
 
Eα αt αt−1

Simply replacing αt with at,T won’t work because E [αt αt0 |ΩT ] 6= at,T a0t,T
Instead we have E [αt αt0 |ΩT ] = E [αt | ΩT ] E [αt |ΩT ]0 + V ar(αt |ΩT ) = at,T a0t,T + Pt,T

4.4 LIKELIHOOD INFERENCE

4.4.1 Under Correct Specification

T
X
ln L(θ) = ln Lt (θ)
t=1

• Always true in iid environments

• Also holds in time series, via prediction-error decomposition of ln L

Expected Fisher Information


Score at true parameter θ0 :
T T
∂ ln L(θ) X ∂ ln Lt (θ) X
s(θ0 ) = = = st (θ0 )
∂θ θ0 t=1
∂θ
θ0 t=1

Expected Fisher Information at true parameter θ0 :

∂ 2 ln L(θ)
 
IEX,H (θ0 ) = −E
∂θ ∂θ0
θ0

T  T
∂ 2 ln Lt (θ)
X  X
= − EH(θ0 ) = −E = −EHt (θ0 )
∂θ ∂θ0


t=1 θ0 t=1

More on Expected Information Matrices


PT
(1) We had already: IEX,H (θ0 ) = t=1 − E Ht (θ0 )
“Expected information based on the Hessian”
PT 0
(2) Can also form: IEX,s (θ0 ) = t=1 E st (θ0 ) st (θ0 )
“Expected information based on the score”
(In a moment we’ll see why this is of interest)
Distribution of the MLE Under Correct Specification
MAXIMUM LIKELIHOOD 87

Under correct specification and regularity conditions,

d

T (θ̂M L − θ0 ) → N (0, VEX (θ0 )) (4.1)

where
 −1
IEX,H (θ0 )
VEX (θ0 ) = VEX,H (θ0 ) = plimT →∞
T

 −1
IEX,s (θ0 )
= VEX,s (θ0 ) = plimT →∞
T

θ̂M L consistent, asymptotically normal, asymptotically efficient (Cramer-Rao lower bound


met)
Observed Information Matrices
PT
(1) IOB,H (θ0 ) = t=1 − Ht (θ0 )
(“Observed information based on the Hessian”)
PT 0
(2) IOB,s (θ0 ) = t=1 st (θ0 ) st (θ0 )
(“Observed information based on the score”)
In a moment we’ll see why these are of interest.
Consistent Estimators of VEX (θ0 )

!−1 !−1
IEX,H (θ̂M L ) IEX,s (θ̂M L )
V̂EX,H (θ0 ) = V̂EX,s (θ0 ) =
T T

!−1 !−1
IOB,H (θ̂M L ) IOB,s (θ̂M L )
V̂OB,H (θ0 ) = V̂OB,s (θ0 ) =
T T

Under correct specification, plimT →∞ V̂EX,H (θ0 ) = plimT →∞ V̂EX,s (θ0 ) = VEX (θ0 )
plimT →∞ V̂OB,H (θ0 ) = plimT →∞ V̂OB,s (θ0 ) = VEX (θ0 )

4.4.2 Under Possible Mispecification

4.4.2.1 Distributional Misspecification

Under possible distributional misspecification (but still assuming correct conditional mean
and variance function specifications),

d
√ m
T (θ̂M L − θ0 ) → N (0, VEX (θ0 )) (4.2)
88 CHAPTER 4

where:
m
VEX (θ0 ) = VEX,H (θ0 )−1 VEX,s (θ0 )VEX,H (θ0 )−1

“Quasi MLE” or “Pseudo-MLE”


Under correct specification: (4.2) collapses to standard result (4.1)
Under distributional misspecification:
θ̂M L consistent, asymptotically normal, asymptotically inefficient (CRLB not met), but we
can nevertheless do credible
(if not fully efficient) inference
m
Consistent estimators of VEX (θ0 ):

!−1 ! !−1
m IEX,H (θ̂M L ) IEX,s (θ̂M L ) IEX,H (θ̂M L )
V̂EX (θ0 ) =
T T T

!−1 ! !−1
m IOB,H (θ̂M L ) IOB,s (θ̂M L ) IOB,H (θ̂M L )
V̂OB (θ0 ) =
T T T

“Sandwich Estimator”

4.4.2.2 General Misspecification

Under possible general mispecification,

d
√ m
T (θ̂M L − θ∗ ) → N (0, VEX (θ∗ )) (4.3)

where:
m
VEX (θ∗ ) = VEX,H (θ∗ )−1 VEX,s (θ∗ )VEX,H (θ∗ )−1

Under correct specification: Collapses to standard result (4.1)


Under purely distributional specification: Collapses to result (4.2)
Under general misspecification:
θ̂M L consistent for KLIC-optimal pseudo-true value θ∗ ,
asymptotically normal, and we can do credible inference
m
Consistent estimators of VEX (θ∗ ):

!−1 ! !−1
m IEX,H (θ̂M L ) IEX,s (θ̂M L ) IEX,H (θ̂M L )
V̂EX (θ∗ ) =
T T T
MAXIMUM LIKELIHOOD 89

!−1 ! !−1
m IOB,H (θ̂M L ) IOB,s (θ̂M L ) IOB,H (θ̂M L )
V̂OB (θ∗ ) =
T T T

All told, observed information appears preferable to estimated expected information. It’s
clearly simpler and it performs well.

4.5 EXERCISES, PROBLEMS AND COMPLEMENTS

1. State space model fitting by exact Gaussian pseudo-MLE using a prediction-error


decomposition and the Kalman filter.
Read Aruoba et al. (2013), and fit the block-diagonal dynamic-factor model (3)-(4) by
exact Gaussian pseudo-MLE using a predecition-error decomposition and the Kalman
filter. Obtain both filtered and smoothed estimates of the series of states. Compare
them to each other, to the “raw” GDPE and GDPI series, and to the current vintage
of GDPplus from FRB Philadelphia.

2. Method of scoring
Slight variation on Newton:
Use E(H (m) )−1 rather than H −1(m)


(Expected rather than observed Hessian.)

3. Constrained optimization.

(a) Substitute the constraint directly and use Slutsky’s theorem.


– To keep a symmetric matrix nonnegative definite, write it as P P 0
– To keep a parameter in [0, 1], write it as p2 /(1 + p2 ).
(b) Barrier and penalty functions

4.6 NOTES
Chapter Five

Simulation Basics

5.1 GENERATING U(0,1) DEVIATES

Criteria for (Pseudo-) Random Number Generation

1. Statistically independent

2. Reproducible

3. Non-repeating

4. Quickly-generated

5. Minimal in memory requirements

The Canonical Problem: Uniform (0,1) Deviates


Congruential methods
a = b(mod m)
“a is congruent to b modulo m”
“(a − b) is an integer multiple of m”
Examples:
255 = 5(mod50)
255 = 5(mod10)
123 = 23(mod10)

Key recursion: xt = axt−1 (mod m)


To get xt , just divide axt−1 by m and keep the remainder
Multiplicative Congruential Method

xt = axt−1 (mod m), xt , a, m ∈ Z+

Example:

xt = 3xt−1 (mod 16), x0 = 1


SIMULATION BASICS 91

Figure 5.1: Ripley’s “Horror” Plots of pairs of (Ui+1 , Ui ) for Various Congruential Gener-
ators Modulo 2048 (from Ripley, 1987)

x0 = 1, x1 = 3, x2 = 9, x3 = 11, x4 = 1, x5 = 3, ...

Perfectly periodic, with a period of 4.


Generalize:

xt (axt−1 + c)(mod m) (xt , a, c, m ∈ Z+ )

Remarks

xt
1. xt ∈ [0, m − 1], ∀t. So take x∗t = m, ∀t

2. Pseudo-random numbers are perfectly periodic.

3. The maximum period, m, can be attained using the mixed congruential generator if:

• c and m have no common divisor


• a = 1(mod p) ∀ prime factors p of m
• a = 1(mod 4) if m is a multiple of 4

4. m is usually determined by machine wordlength; e.g., 264

5. Given U (0, 1) , U (α, β) is immediate

5.2 THE BASICS: C.D.F. INVERSION, BOX-MUELLER, SIMPLE ACCEPT-


REJECT

5.2.1 Inverse c.d.f.

Inverse cdf Method (“Inversion Methods”)


Desired density: f (x)
92 CHAPTER 5

Figure 5.2: Transforming from U(0,1) to f (from Davidson and MacKinnon, 1993)

1. Find the analytical c.d.f., F (x), corresponding to f (x)

2. Generate T U (0, 1) deviates {r1 ,...rT }

3. Calculate {F−1 (r1 ),..., F−1 (rT )}

Graphical Representation of Inverse cdf Method


Example: Inverse cdf Method for exp(β) Deviates
f (x) = βe−βx where β > 0, x ≥ 0
Rx
⇒ F (x) = 0 βe−βt dt
x
−βt
= βe−β = − e−βx + 1 = 1 − e−βx
o
ln(1 − F (x))
Hence e−βx = 1 − F (x) so x = −β
Then insert a U (0, 1) deviate for F (x)
Complications Analytic inverse cdf not always available (e.g., N (0, 1) distribution).

• Approach 1: Evaluate the cdf numerically

• Approach 2: Use a different method


e.g., CLT approximation:
P 
12
Take i=1 Ui (0, 1) − 6 for N (0, 1)

5.2.2 Box-Muller

An Efficient Gaussian Approach: Box-Muller


Let x1 and x2 be i.i.d. U (0, 1), and consider

y1 = −2 ln x1 cos(2πx2 )

y2 = −2 ln x1 sin(2πx2 )
Find the distribution of y1 and y2 . We know that
SIMULATION BASICS 93


∂x1 ∂x1
∂y1 ∂y2
f (y1 , y2 ) = f (x1 , x2 )

∂x2 ∂x2

∂y1 ∂y2

1 2 2
 
y2
Box-Muller (Continued) Here we have x1 = e− 2 (y1 +y2 ) and x2 = 1
2π arctan y1


∂x1 ∂x1   

∂y1 ∂y2
1 −y12 /2 1 −y22 /2
Hence = √ e √ e

∂x2 ∂x2
∂y1 ∂y2
2π 2π

Bivariate density is the product of two N (0, 1) densities, so we have generated two inde-
pendent N (0, 1) deviates.
Generating Deviates Derived from N(0,1)

χ21 = [N (0, 1)]2


Pd
χ2d = i=1 [Ni (0, 1)]2 , where the Ni (0, 1) are independent
2
N (µ, σ ) = µ + σ N (0, 1)
p
td = N (0, 1)/ x2d /d, where N (0, 1) and χ2d are independent
Fd1 ,d2 = χ2d1 /d1 /χ2d2 /d2 where χ2d1 and χ2d2 are independent
Multivariate Normal
N (0, I) (N -dimensional) – Just stack N N (0, 1)’s
N (µ, Σ) (N -dimensional)
Let P P 0 = Σ (P is the Cholesky factor of Σ)
Let X ∼ N (0, I). Then P X ∼ N (0, Σ)
To sample from N (µ, Σ), take µ + P X

5.2.3 Simple Accept-Reject

Accept-Reject
(Naive but Revealing Example)
We want to sample x ∼ f (x)
Draw:

ν1 ∼ U (α, β)

ν2 ∼ U (0, h)

If ν1 , ν2 lies under the density f (x), then take x = ν1


Otherwise reject and repeat
Graphical Representation of Naive Accept-Reject
94 CHAPTER 5

Figure 5.3: Naive Accept-Reject Method

Accept-Reject
General (Non-Naive) Case
We want to sample x ∼ f (x) but we only know how to sample x ∼ g(x).
f (x)
Let M satisfy g(x) ≤ M < ∞, ∀x. Then:

1. Draw x0 ∼ g(x)
f (x0 )
2. Take x = x0 w.p. g(x0 )M ; else go to 1.

(Allows for “blanket” functions g(·) more efficient than the uniform)
Note that accept-reject requires that we be able to evaluate f (x) and g(x) for any x.
Mixtures
On any draw i,

x ∼ fi (x), w.p. pi

where

0 ≤ pi ≤ 1, ∀ i

N
X
pi = 1
i=1

For example, all of the fi could be uniform, but with different location and scale.

5.3 SIMULATING EXACT AND APPROXIMATE REALIZATIONS OF TIME


SERIES PROCESSES

Simulating Time Series Processes


SIMULATION BASICS 95

VAR(1) simulation is key (state transition dynamics).

1. Nonparametric: Exact realization via Cholesky factorization of desired covariance ma-


trix. One need only specify the autocovariances.

2. Parametric I: Exact realization via Cholesky factorization of covariance matrix corre-


sponding to desired parametric model

3. Parametric II: Approximate realization via arbitrary startup value with early realiza-
tion discarded

4. Parametric III: Exact realization via drawing startup values from unconditional den-
sity

5.4 MORE

Slice Sampling
Copulas and Sampling From a General Joint Density

5.5 NOTES
Chapter Six

Bayesian Analysis by Simulation

6.1 BAYESIAN BASICS

6.2 COMPARATIVE ASPECTS OF BAYESIAN AND FREQUENTIST PARADIGMS

Overarching Paradigm (T → ∞)

T (θ̂ − θ) ∼ N (0, Σ)

Shared by classical and Bayesian, but interpretations differ.


Classical: θ̂ random, θ fixed
Bayesian: θ̂ fixed, θ random
Classical: Characterize the distribution of the random data (θ̂) conditional on fixed “’true”
θ. Focus on the likelihood max (θ̂M L ) and likelihood curvature in an -neighborhood of the
max.
Bayesian: Characterize the distribution of the random θ conditional on fixed “true” data
(θ̂). Examine the entire likelihood.
Bayesian Computational Mechanics
Data y ≡ {y1 , . . . , yT }
Bayes’ Theorem:
f (y/θ)f (θ)
f (θ/y) =
f (y)
or
f (θ/y) = c f (y/θ)f (θ)
where c−1 = θ f (y/θ)f (θ)
R

f (θ/y) ∝ f (y/θ)f (θ)


p(θ/y) ∝ L(θ/y)g(θ)
posterior ∝ likelihood · prior
Classical Paradigm (T → ∞)

−1 !


d IEX,H (θ)
T (θ̂M L − θ) → N 0,
T
BAYES 97

or more crudely


T (θ̂M L − θ) ∼ N (0, Σ)

(Enough said.)
Bayesian Paradigm (T → ∞)
(Note that as T → ∞, p(θ/y) ≈ L(θ/y),
so the likelihood below can be viewed as the posterior.)

Expand lnL(θ/y) around fixed θ̂M L :


lnL(θ/y) ≈ lnL(θ̂M L /y) + S(θ̂M L /y)0 (θ − θ̂M L )
−1/2(θ − θ̂M L )0 IOB,H (θ̂M L /y)(θ − θ̂M L )

But S(θ̂M L /y) ≡ 0, so:


lnL(θ/y) ≈ lnL(θ̂M L /y) − 1/2(θ − θ̂M L )0 IOB,H (θ̂M L /y)(θ − θ̂M L )

Exponentiating and neglecting the expansion remainder,


we then have:

L(θ/y) ∝ exp(−1/2(θ − θ̂M L )0 IOB,H (θ̂M L /y)(θ − θ̂M L ))


or
−1
L(θ/y) ∝ N (θ̂M L , IOB,H (θ̂M L /y))

Or, a posteriori, T (θ − θ̂M L ) ∼ N (0, Σ)
Bayesian estimation and model comparison
Estimation:

Full posterior density


Highest posterior density intervals
Posterior mean, median, mode (depending on loss function)

Model comparison:
p(Mi |y) p(y|Mi ) p(Mi )
= •
p(Mj |y) p(y|Mj ) p(Mj )
| {z } | {z } | {z }
posterior odds Bayes f actor prior odds

The Bayes factor is a ratio of marginal likelihoods, or marginal data densities.


– For comparison, simply report the posterior odds
– For selection, invoke 0-1 loss which implies selection of the model with highest posterior
probability.
98 CHAPTER 6

– Hence select the model with highest marginal likelihood if prior odds are 1:1.
Understanding the Marginal Likelihood
As a penalized log likelihood:
As T → ∞, the marginal likelihood is approximately the maximized log likelihood minus
KlnT /2. It’s the SIC!
As a predictive likelihood:

T
Y
P (y) = P (y1 , ..., yT ) = P (yt |y1:t−1 )
t=1

T
X
=⇒ lnP (y) = lnP (yt |y1:t−1 )
t=1

T
X Z
= ln P (yt |θ, y1:t−1 )P (θ|y1:t−1 ) dθ
t=1

Bayesian model averaging:


Weight by posterior model probabilities:

P (yt+1 |y1:t ) = πit P (yt+1 |y1:t , Mi ) + πjt P (yt+1 |y1:t , Mj )

As T → ∞, the distinction between model averaging and selection vanishes, as one π goes
to 0 and the other goes to 1.
If one of the models is true, then both model selection and model averaging are consistent
for the true model. Otherwise they’re consistent for the X-optimal approximation to the
truth. Does X = KLIC?

6.3 MARKOV CHAIN MONTE CARLO

Metropolis-Hastings
We want to draw S values of θ from p(θ). Initialize chain at θ(0) and burn it in.
1. Draw θ∗ from proposal density q(θ; θ(s−1) )
2. Calculate the acceptance probability α(θ(s−1) , θ∗ )
3. Set



 θ∗ w.p. α(θ(s−1) , θ∗ ) “accept”
s
θ =

θ(s−1) w.p. 1 − α(θ(s−1) , θ∗ ) “reject”

4. Repeat 1-3, s = 1, ..., S


The question, of course, is what to use for step 2.
BAYES 99

6.3.1 Metropolis-Hastings Independence Chain

Fixed proposal density:

q(θ; θ(s−1) ) = q ∗ (θ)

Acceptance probability:

"  #
(s−1) ∗ p(θ = θ∗ ) q ∗ θ = θ(s−1)
α(θ , θ ) = min , 1
p(θ = θ(s−1) ) q ∗ (θ = θ∗ )

6.3.2 Metropolis-Hastings Random Walk Chain

Random walk proposals:

θ∗ = θ(s−1) + ε

Acceptance probability reduces to:

p(θ = θ∗ )
 
(s−1) ∗
α(θ , θ ) = min ,1
p(θ = θ(s−1) )

6.3.3 More

Burn-in, Sampling, and Dependence


“total simulation” = “burn-in” + “sampling”
Questions:
How to assess convergence to steady state?
In the Markov chain case, why not do something like the following. Whenever time t
is a multiple of m, use a distribution-free non-parametric (randomization) test for equal-
ity of distributions to test whether the unknown distribution f1 of xt , ..., xt−(m/2) equals
the unknown distribution f2 of xt−(m/2)+1 , ..., xt−m . If, for example, we pick m = 20, 000,
then whenever time t is a multiple of 20,000 we would test equality of the distributions of
xt , ..., xt−10000 and xt−10001 , ..., xt−20000 . We declare arrival at the steady state when the
null is not rejected. Or something like that.
Of course the Markov chain is serially correlated, but who cares, as we’re only trying to
assess equality of unconditional distributions. That is, randomizations of xt , ..., xt−(m/2) and
of xt−(m/2)+1 , ..., xt−m destroy the serial correlation, but so what?
How to handle dependence in the sampled chain?
Better to run one long chain or many shorter parallel chains?
A Useful Property of Accept-Reject Algorithms
(e.g., Metropolis)
100 CHAPTER 6

Metropolis requires knowing the density of interest only up to a constant, because the
acceptance probability is governed by the RATIO p(θ = θ∗ )/p(θ = θ(s−1) ). This will turn
out to be important for Bayesian analysis.
Metropolis-Hastings (Discrete)
For desired π, we want to find P such that πP = π. It is sufficient to find P such that
πi Pij = πj Pji . Suppose we’ve arrived at zi . Use symmetric, irreducible transition matrix
Q = [Qij ] to generate proposals. That is, draw proposal zj using probabilities in ith row of
Q.
Move to zj w.p. αij , where:
πj

 1, if
 πi ≥1
αij =
πj

otherwise

πi

Equivalently, move to zj w.p. αij , where:


 
πj
αij = min , 1
πi
Metropolis-Hastings, Continued...
This defines a Markov chain P with:


 αij Qij , for i 6= j
Pij =
 P
1 − j6=i Pij , for i = j

Iterate this chain to convergence and start sampling from π.


Blocking strategies: Yu and Meng (2010)
Note that I have set this up to list bibliography at end. It gives a compiling error, but
you can skip through it, and everything looks fine at the end.
Blocking MH algorithms: Ed Herbst Siddhartha Chib and Srikanth Ramamurthy. Tai-
lored randomized block mcmc methods with application to dsge models. Journal of Econo-
metrics, 155(1):1938, 2010. Vasco Curdia and Ricardo Reis. Correlated disturbances and
u.s. business cycles. 2009. Nikolay Iskrev. Evaluating the information matrix in linearized
dsge models. Economics Letters, 99:607610, 2008. Robert Kohn, Paolo Giordani, and Ing-
var Strid. Adaptive hybrid metropolis-hastings samplers for dsge models. Working Paper,
2010. G. O. Roberts and S.K. Sahu. Updating schemes, correlation structure, blocking and
parameterization for the gibbs sampler. Journal of the Royal Statistical Society. Series B
(Methodological), 59(2):291317, 1997.

6.3.4 Gibbs and Metropolis-Within-Gibbs

Bivariate Gibbs Sampling


We want to sample from f (z) = f (z1 , z2 )
BAYES 101

Initialize (j = 0) using z20


Gibbs iteration j = 1:
a. Draw z11 from f (z1 |z20 )
b. Draw z21 from f (z2 |z11 )
Repeat j = 2, 3, ....
Theorem (Clifford-Hammersley): limj→∞ f (z j ) = f (z)
Useful if/when conditionals are known and easy to sample from, but joint and marginals
are not. (This happens a lot in Bayesian analysis.)
General Gibbs Sampling
We want to sample from f (z) = f (z1 , z2 , ..., zk )
Initialize (j = 0) using z20 , z30 , ..., zk0
Gibbs iteration j = 1:
a. Draw z11 from f (z1 |z20 , ..., zk0 )
b. Draw z21 from f (z2 |z11 , z30 , ..., zk0 )
c. Draw z31 from f (z3 |z11 , z21 , z40 , ..., zk0 )
...
k. Draw zk1 from f (zk |z11 , ..., zk−1
1
)
Repeat j = 2, 3, ....
Again, limj→∞ f (z j ) = f (z)
Metropolis Within Gibbs
Gibbs breaks a big draw into lots of little (conditional) steps. If you’re lucky, those little
steps are simple.
If/when a Gibbs step is difficult, i.e., it’s not clear how to sample from the relevant
conditional, it can be done by Metropolis.
(”Metropolis within Gibbs”)
Metropolis is more general but also more tedious, so only use it when you must.
Composition
We may want (x1 , y1 ), ..., (xN , yN ) ∼ iid from f (x, y)
Or we may want y1 , ..., yN ∼ iid from f (y)
They may be hard to sample from directly.
But sometimes it’s easy to:
Draw x∗ ∼ f (x)
Draw y ∗ ∼ f (y|x∗ )
Then:
(x1 , y1 ), ..., (xN , yN ) ∼ iid f (x, y)
(y1 , ..., yN ) ∼ iid f (y)
102 CHAPTER 6

6.4 CONJUGATE BAYESIAN ANALYSIS OF LINEAR REGRESSION

Bayes for Gaussian Regression with Conjugate Priors


y = Xβ + ε
ε ∼ iid N (0, σ 2 I)
Standard results:

β̂M L = (X 0 X)−1 X 0 y

2 e0 e
σ̂M L =
T

β̂M L ∼ N β, σ 2 (X 0 X)−1


2
T σ̂M L
∼ χ2T −K
σ2
Bayesian Inference for β/σ
Prior:
β/σ 2 ∼ N (β0 , Σ0 )
g(β/σ 2 ) ∝ exp(−1/2(β − β0 )0 Σ0−1 (β − β0 ))
Likelihood:
−1 0
L(β/σ 2 , y) ∝ exp( 2σ 2 (y − Xβ) (y − Xβ))

Posterior:
p(β/σ 2 , y) ∝ exp(−1/2(β − β0 )0 Σ−1
0 (β − β0 ) −
1
2σ 2 (y − Xβ)0 (y − Xβ))
This is the kernel of a normal distribution (*Problem*):
β/σ 2 , y ∼ N (β1 , Σ1 )
where
−1
β1 = Σ−1
0 +σ
−2
(X 0 X) (Σ−1
0 β0 + σ
−2
(X 0 X)β̂M L )
Σ1 = (Σ−1
0 +σ
−2
(X 0 X))−1
Gamma and Inverse Gamma Refresher

  v  
iid 1 X v δ
zt ∼ N 0, , x= zt2 ⇒ x ∼ Γ x; ,
δ t=1
2 2

(Note δ = 1 ⇒ x ∼ χ2v , so χ2 is a special case of Γ)

   
v δ v −xδ
Γ x; , ∝ x 2 −1 exp
2 2 2

v
E(x) =
δ
BAYES 103

2v
var(x) =
δ2
x ∼ Γ−1 ( v2 , 2δ ) (”inverse gamma”) ⇔ 1
x ∼ Γ( v2 , 2δ )
Bayesian Inference for σ 2 /β
Prior:
1 v0 δ 0

σ 2 /β ∼ Γ 2 , 2 v
 0 −1 δ0
g σ12 /β ∝ σ12 2
 
exp − 2σ 2

(Independent of β, but write σ 2 /β for completeness.)


−T /2
L σ12 /β, y ∝ σ 2 exp − 2σ1 2 (y − Xβ)0 (y − Xβ)
 

(*Problem*: In contrast to L(β/σ 2 , y) earlier, we don’t absorb the (σ 2 )−T /2 term into the
constant of proportionality. Why?)
Hence (*Problem*):
 v1 −1 −δ1
p σ12 /β, y ∝ σ12 2
 
exp 2σ 2
or σ12 /β, y ∼ Γ v21 , δ21


v1 = v0 + T
δ1 = δ0 + (y − Xβ)0 (y − Xβ)
Bayesian Pros Thus Far

1. Feels sensible to focus on p(θ/y). Classical relative frequency in repeated samples


replaced with subjective degree of belief conditional on the single sample actually
obtained

2. Exact finite-sample full-density inference

Bayesian Cons Thus Far

1. From where does the prior come? How to elicit prior distributions?

2. How to do an “objective” analysis?


(e.g. what is an “uninformative” prior? Uniform?)
(Note, however, that priors can be desirable and helpful. See, for example, the cartoon
at http://fxdiebold.blogspot.com/2014/04/more-from-xkcdcom.html)

3. We still don’t have the marginal posteriors that we really want: p(β, σ 2 /y), p(β/y).
– Problematic in any event!

6.5 GIBBS FOR SAMPLING MARGINAL POSTERIORS

Markov Chain Monte Carlo Solves the Problem! 0. Initialize: σ 2 = (σ 2 )(0)


Gibbs sampler at generic iteration j:
j1. Draw β (j) from p(β (j) /(σ 2 )(j−1) , y) (N (β1 , Σ1 ))
j2. Draw (σ 2 )(j) from p(σ 2 /β (j) , y) Γ−1 v21 , δ21


Iterate to convergence to steady state, and then estimate posterior moments of interest
104 CHAPTER 6

6.6 GENERAL STATE SPACE: CARTER-KOHN MULTI-MOVE GIBBS

Bayesian Analysis of State-Space Models

αt = T αt−1 + Rηt

yt = Zαt + εt

! !
ηt iid Q 0
∼N
εt 0 H

Let α̃T = (α10 , . . . , αT0 )0 , θ = (T 0 , R0 , Z 0 , Q0 , H 0 )0


The key: Treat α̃T as a parameter, along with system matrices θ
Recall the State-Space Model in Density Form

αt |αt−1 ∼ N (T αt−1 , RQR0 )

yt |αt ∼ N (Zαt , H)

Recall the Kalman Filter in Density Form


Initialize at a0 , P0
State prediction:
αt |ỹt−1 ∼ N (at/t−1 , Pt/t−1 )
at/t−1 = T at−1
Pt/t−1 = T Pt−1 T 0 + RQR0
State update:
αt |ỹt ∼ N (at , Pt )
at = at/t−1 + Kt (yt − Zat/t−1 )
Pt = Pt/t−1 − Kt ZPt/t−1
Data prediction:
yt |ỹt−1 ∼ N (Zat/t−1 , Ft )
where ỹt = (y10 , ..., yt0 )0
Carter-Kohn Multi-move Gibbs Sampler
Let ỹT = (y10 , . . . , yT0 )0
0. Initialize θ(0)
Gibbs sampler at generic iteration j:
(j)
j1. Draw from posterior α̃T /θ(j−1) , ỹT (“hard”)
(j)
j2. Draw from posterior θ(j) /α̃T , ỹT (“easy”)
BAYES 105

Iterate to convergence, and then estimate posterior moments of interest


Just two Gibbs draws: (1) α̃T parameter, (2) θ parameter
(j)
Multimove Gibbs Sampler, Step 2 (θ(j) |α̃T , ỹT ) (“easy”)
(j)
Conditional upon draws α̃T , sampling θ(j) becomes a multivariate regression problem.
We have already seen how to do univariate regression. We can easily extend to multivariate
regression. The Gibbs sampler continues to work.
*********************fxd
Multivariate Regression

yit = x0t β i + εit

iid
(1,t , ...N,t )0 ∼ N (0, Σ)

i = 1, ..., N

t = 1, ...T

or

Y = |{z}
|{z} X |{z}
B + |{z}
E
T ×N T ×K K×N T ×N

OLS is still (X 0 X)−1 X 0 Y


Bayesian Multivariate Regression
(Precisely Parallels Bayesian Univariate Regression)
B|Σ conjugate prior is multivariate normal:

vec(B)|Σ ∼ N (B0 , Σ0 )

Σ|B conjugate prior is inverse Wishart:

n−p−1 1
p(Σ−1 |vec(B)) ∝ |Σ−1 | 2 exp(− tr(Σ−1 V−1 ))
2

Inverse Wishart refresher (multivariate inverse gamma):

X ∼ W −1 (n, V) ↔ X −1 ∼ W (n, V)
106 CHAPTER 6

where
 
n−p−1 1
W (X; n, V) ∝ |X| 2 exp − tr(XV−1 )
2
Bayesian Inference for B|Σ
Prior:

 
1  0 −1
p(vec(B)|Σ) ∝ exp − tr vec(B − B0 ) V0 vec(B − B0 )
2

Likelihood:
T
!
1X 0 0 −1 0
p(Y, X|B, Σ) ∝ exp − (Yt − B Xt ) Σ (Yt − B Xt )
2 t=1
 
1  −1 0
∝ exp − tr Σ (Y − XB) (Y − XB)
2
 
1 0

−1 0

∝ exp − vec(B − B̂) Σ ⊗ X X vec(B − B̂)
2

Posterior:

p(vec(B)|Σ, Y )
 
1  0

−1 0

0 −1
∝ exp − vec(B − B̂) Σ ⊗ X X vec(B − B̂) + vec(B − B0 ) V0 vec(B − B0 )
2

This is the kernel of a Multivariate Normal distribution:

vec(B)|Σ, Y ∼ N (B1 , V1 )
h i h i−1
vec(B1 ) = V1 (Σ−1 ⊗ X 0 X)vec(B̂) + V0−1 vec (B0 ) , V1 = Σ−1 ⊗ X 0 X + V0−1
and B̂ = (X 0 X)−1 (X 0 Y )

Bayesian Inference for Σ|B


Prior:
n−p−1 1
p(Σ−1 |vec(B)) ∝ |Σ−1 | 2 exp(− tr(Σ−1 V−1 ))
2
Likelihood:  
T 1
p(Y, X|B, Σ) ∝ |Σ|− 2 exp − tr Σ−1 (Y − XB)0 (Y − XB)

2
Posterior:
 
T +n−p−1 1
p(Σ−1 |vec(B), Y ) ∝ |Σ−1 | exp − tr Σ−1 (Y − XB)0 (Y − XB) + V−1

2
2
This is the kernel of a Wishart distribution:

Σ−1 |vec(B), Y ∼ W (T + n, ((Y − XB)0 (Y − XB) + V−1 )−1 )

So Σ is distributed according to Inverse Wishart (W −1 (T + n, ((Y − XB)0 (Y − XB) + V−1 ))−1 ).


(j)
Multimove Gibbs Sampler, Step 1 (α̃T /θ(j−1) , ỹT ) (“hard”)
For notational simplicity we write p(α̃T /ỹT ), suppressing the dependence on θ.
p(α̃T /ỹT ) = p(αT /ỹT )p(α̃T −1 /αT , ỹT )
= p(αT /ỹT )p(αT −1 /αT , ỹT )p(α̃T −2 /αT −1 , αT , ỹT )
= ...
(T −1)
= p(αT /ỹT )Πt=1 p(αt /αt+1 , ỹt )
(*Problem*: Fill in the missing steps subsumed under “. . . ”)
So, to draw from p(α̃T /ỹT ), we need to be able to draw from p(αT /ỹT ) and p(αt /αt+1 , ỹt ), t =
BAYES 107

1, . . . , (T − 1)
Multimove Gibbs sampler, Continued
The key is to work backward :
Draw from p(αT /ỹT ),
then from p(αT −1 /αT , ỹT −1 ),
then from p(αT −2 /αT −1 , ỹT −2 ),
etc.
Time T draw is easy:
p(αT /ỹT ) is N (aT,T , PT,T )
(where the Kalman filter delivers aT,T and PT,T )
Earlier-time draws are harder:
How to get p(αt /αt+1 , ỹt ), t = (T − 1), ..., 1?
Multimove Gibbs sampler, Continued
It can be shown that (*Problem*):
p(αt /αt+1 , ỹt ), t = (T − 1), ..., 1, is N (at/t,αt+1 , Pt/t,αt+1 )
where
at/t,αt+1 = E(αt /ỹt , αt+1 ) = E(αt |at , αt+1 )
= at + Pt T 0 (T Pt T 0 + Q)−1 (αt+1 − T at )
Pt/t,αt+1 = cov(αt /ỹt , αt+1 ) = cov(αt |at , αt+1 )
= Pt − Pt T 0 (T Pt T 0 + Q)−1 T Pt
*** Expanding S(θ̂M L ) around θ yields:
S(θ̂M L ) ≈ S(θ) + S 0 (θ)(θ̂M L − θ) = S(θ) + H(θ))(θ̂M L − θ).
Noting that S(θ̂M L ) ≡ 0 and taking expectations yields:
0 ≈ S(θ) − IEX,H (θ)(θ̂M L − θ)
or
−1
(θ̂M L − θ) ≈ IEX,H (θ).
a
Using S(θ) ∼ N (0, IEX,H (θ)) then implies:
a −1
(θ̂M L − θ) ∼ N (0, IEX,H (θ))
or
Case 3 β and σ 2
Joint prior g(β, σ12 ) = g(β/ σ12 )g( σ12 )
where β/ σ12 ∼ N (β0 , Σ0 ) and σ12 ∼ G( v20 , δ20 )
HW Show that the joint posterior,
p(β, σ12 /y) = g(β, σ12 )L(β, σ12 /y)
can be factored as p(β/ σ12 , y)p( σ21/y )
where β/ σ12 , y ∼ N (β1 , Σ1 )
and 1
σ 2 /y ∼ G( v21 , δ21 ),
and derive expressions for β1 , Σ1 , v1 , δ1
in terms of β0 , Σ0 , δ0 , x, and y.
108 CHAPTER 6

Moreover, the key marginal posterior


R∞
P (β/y) = 0 p(β, σ12 /y)dσ 2 is multivariate t.
Implement the Bayesian methods via Gibbs sampling.

6.7 EXERCISES, PROBLEMS AND COMPLEMENTS

6.8 NOTES
Chapter Seven

(Much) More Simulation

7.1 ECONOMIC THEORY BY SIMULATION: “CALIBRATION”

7.2 ECONOMETRIC THEORY BY SIMULATION: MONTE CARLO AND


VARIANCE REDUCTION

Monte Carlo
Key: Solve deterministic problems by simulating stochastic analogs, with the analytical
unknowns reformulated as parameters to be estimated.
Many important discoveries made by Monte Carlo.
Also, numerous mistakes avoided by Monte Carlo!
The pieces:
(I) Experimental Design
(II) Simulation (including variance reduction techniques)
(III) Analysis: Response surfaces (which also reduce variance)

7.2.1 Experimental Design

(I) Experimental Design

• Data-Generating Process (DGP), M (θ)

• Objective
• e.g., MSE of an estimator:

E[(θ − θ̂)2 ] = g(θ, T )

• e.g., Power function of a test:

π = g(θ, T )

Experimental Design, Continued


110 CHAPTER 7

• Selection of (θ, T ) Configurations to Explore


a. Do we need a “full design”? In general many values of θ and T need be explored.
But if, e.g., g(θ, T ) = g1 (θ) + g2 (T ), then only explore θ values for a single T , and T
values for a single θ (i.e., there are no interactions).
b. Is there parameter invariance (g(θ, T ) unchanging in θ)? e.g., If y = Xβ + ε,
2
β̂M LE −β σ̂M
ε ∼ N (0, σ 2 Ω(α)), then the exact finite-sample distributions of σ and LE
σ2
are invariant to true β, σ 2 . So vary only α, leaving β and σ alone (e.g., set to 0 and
1). Be careful not to implicitly assume invariance regarding unexplored aspects of the
design (e.g., structure of X variables above.)

Experimental Design, Continued

• Number of Monte Carlo Repetitions (N )


e.g., MC computation of test size
#rej ΣN
i=1 I(reji )
nominal size α0 , true size α, estimator α̂ = N = N

a  
α(1 − α)
N ormal approximation : α̂ ∼ N α,
N

" r #!
α(1 − α)
P α∈ α̂ ± 1.96 = .95
N

Suppose we want the 95% CI for α to be .01 in length.

Experimental Design, Continued


Strategy 1 (Use α = α0 ; not conservative enough if α > α0 ):

r
α0 (1 − α0 )
2 ∗ 1.96 = .01
N
If α0 = .05, N = 7299
1
Strategy 2 (Use α = 2 = argmaxα [α(1 − α)]; conservative):

s
1 1

2 2
2 ∗ 1.96 = .01 ⇒ N = 38416
N
Strategy 3 (Use α = α̂; the obvious strategy)

7.2.2 Simulation

(II) Simulation
Running example: Monte Carlo integration
(MUCH) MORE SIMULATION 111

R1
Definite integral: θ = 0
m(x)dx
Key insight:
R1
θ = 0 m(x)dx = E(m(x))
x ∼ U (0, 1)
Notation:
θ = E[m(x)]
σ 2 = var(m(x))
Direct Simulation:
Arbitrary Function, Uniform Density
Generate N U (0, 1) deviates xi , i = 1, ..., N
Form the N deviates mi = m(xi ), i = 1, ..., N

N
1 X
θ̂ = mi
N i=1

d

N (θ̂ − θ) → N (0, σ 2 )

Direct Simulation General Case:


Arbitrary Function, Arbitrary Density

Z
θ = E(m(x)) = m(x)f (x)dx

– Indefinite integral, arbitrary function m(·), arbitrary density f (x)


Draw xi ∼ f (·), and then form mi (xi ),
N
1 X
θ̂ = mi
N i=1

d

N (θ̂ − θ) → N (0, σ 2 )

Direct Simulation Leading Case


Mean Function, Arbitrary Density
(e.g., Posterior Mean)

Z
θ = E(x) = xf (x)dx
112 CHAPTER 7

– Indefinite integral, x has arbitrary density f (x)


Draw xi ∼ f (·)
N
1 X
θ̂ = xi
N i=1

d

N (θ̂ − θ) → N (0, σ 2 )

7.2.3 Variance Reduction: Importance Sampling, Antithetics, Control Variates


and Common Random Numbers

Importance Sampling to Facilitate Sampling


Sampling from f (·) may be difficult. So change to:
Z
f (x)
θ= x g(x)dx
g(x)
where the “importance sampling density” g(·) is easy to sample
Draw xi ∼ g(·), and then form mi = fg(x
(xi )
i)
xi , i = 1, ..., N
N N N
1 X 1 X f (xi ) X
θ̂∗ = mi = xi = wi xi
N i=1 N i=1 g(xi ) i=1

d

N (θ̂∗ − θ) → N (0, σ∗2 )

– Avg of f (x) draws replaced by weighted avg of g(x) draws


– Weight wi reflects relative heights of f (xi ) and g(xi )
Importance Sampling
Consider the classic problem of calculating the mean E(y) of r.v. y with marginal density:

Z
f (y) = f (y/x)f (x)dx.

The standard solution is to form:

N
1 X
E(y)
b = f (y|xi )
N i=1

where the xi are iid draws from f (x).


But f (x) might be hard to sample from! What to do?
(MUCH) MORE SIMULATION 113

Importance Sampling
Write
Z
f (x)
f (y) = f (y|x) g(x)dx,
I(x)
where the “importance sampler,” g(x), is easy to sample from.
Take
N f (xi ) N
I(xi )
X X
E(y)
b = PN f (xj ) f (y|xi ) = wi f (y|xi ).
i=1 j=1 g(xj ) i=1

So importance sampling replaces a simple average of f (y|xi ) based on initial draws from
f (x) with a weighted average of f (y|xi ) based on initial draws from g(x), where the weights
wi reflect the relative heights of f (xi ) and g(xi ).
Indirect Simulation
“Variance-Reduction Techniques”
(“Swindles”)
Importance Sampling to Achieve Variance Reduction
Again we use:
Z
f (x)
θ= x g(x)dx,
g(x)
and again we arrive at

d

N (θ̂∗ − θ) → N (0, σ∗2 )

If g(x) is chosen judiciously, σ∗2  σ 2


xf (x)
Key: Pick g(x) s.t. g(x) has small variance
Importance Sampling Example
Let x ∼ N (0, 1), and estimate the mean of I(x > 1.96):
Z
θ = E(I(x > 1.96)) = P (x > 1.96) = I(x > 1.96) φ(x) dx
| {z } |{z}
m(x) f (x)

N
X I(xi > 1.96)
θ̂ = (with variance σ 2 )
i=1
N

Use importance sampler:

g(x) = N (1.96, 1)
Z
I(x > 1.96) φ(x)
P (x > 1.96) = g(x) dx
g(x)
114 CHAPTER 7

PN I(xi >1.96) φ(xi )


i=1 g(xi )
θ̂∗ = (with variance σ∗2 )
N

σ∗2
≈ 0.06
σ2
Antithetic Variates
We average negatively correlated unbiased estimators of θ (Unbiasedness maintained,
variance reduced)
The key: If x ∼ symmetric(µ, v), then xi ± µ are equally likely
e.g., if x ∼ U (0, 1), so too is (1 − x)
e.g., if x ∼ N (0, v), so too is −x
Consider for example the case of zero-mean symmetric f (x)
Z
θ = m(x)f (x)dx

N
1 X
Direct : θ̂ = mi , (θ̂ is based on xi , i = 1, ..., N )
N i=1

1 1
Antithetic : θ̂∗ = θ̂(x) + θ̂(−x)
2 2
(θ̂(x) is based on xi , i = 1, ..., N/2 , and
θ̂(−x) is based on −xi , i = 1, ..., N/2)
Antithetic Variates, Cont’d
More concisely,
N/2
2 X
θ̂∗ = ki (xi )
N i=1

where:
1 1
ki = m(xi ) + m(−xi )
2 2

d

N (θ̂∗ − θ) → N (0, σ∗2 )

1 1 1
σ∗2 = var (m(x)) + var (m(−x)) + cov (m(x), m(−x))
4 4 2 | {z }
<0 f or m monotone incr.

Often σ∗2  σ 2

Z Z Z
θ= m(x)f (x)dx = g(x)f (x)dx + [m(x) − g(x)]f (x)dx
(MUCH) MORE SIMULATION 115

Control function g(x) simple enough to integrate analytically and flexible enough to absorb
most of the variation in m(x).

We just find the mean of m(x)−g(x), where g(x) has known mean and is highly correlated
with m(x).
Control Variates

Z N
1 X
θ̂ = g(x)dx + [m(xi ) − g(xi )]
N i=1

d

N (θ̂ − θ) → N (0, σ∗2 )

If g(x) is chosen judiciously, σ∗2  σ 2 .

Related method (conditioning): Find the mean of E(z|w) rather than the mean of z. The
two are of course the same (the mean conditional mean is the unconditional mean), but
var(E[z|w]) ≤ var(z).
Control Variate Example

Z 1
f (x) = ex dx
0

Control variate: g(x) = 1 + 1.7x


Z 1 
1.7 2 1

⇒ g(x)dx = x+ x  = 1.85

0 2 
0

N
1 X xi
θ̂direct = e
N i=1

N
1 X xi
θ̂cv = 1.85 + [e − (1 + 1.7xi )]
N i=1

var(θ̂direct )
≈ 78
var(θ̂CV )
Common Random Numbers
We have discussed estimation of a single integral:
Z 1
f1 (x)dx
0
116 CHAPTER 7

But interest often centers on difference (or ratio) of the two integrals:

Z 1 Z 1
f1 (x)dx − f2 (x)dx
0 0

The key: Evaluate each integral using the same random numbers.
Common Random Numbers in Estimator Comparisons
Two estimators θ̂, θ̃ ; true parameter θ0
Compare MSEs: E(θ̂ −θ0 )2 , E(θ̃ − θ0 )2 
Expected difference: E (θ̂ − θ0 )2 − (θ̃ − θ0 )2
Estimate:
N
1 X  2

(θ̂i − θ0 )2 − (θ̃i − θ0 )
N i=1

Variance of estimate:
1   1   2  
var (θ̂ − θ0 )2 + var (θ̃ − θ0 )2 − cov (θ̂ − θ0 )2 , (θ̃ − θ0 )2
N N N

Extensions...

• Sequential importance sampling: Builds up improved proposal densities across draws

7.2.4 Response Surfaces

(III) Response surfaces

1. Direct Response Surfaces

2. Indirect Responses Surfaces:

• Clear and informative graphical presentation


• Variance reduction
• Imposition of known asymptotic results
(e.g., power → 1 as T → ∞)
• Imposition of known features of functional form
(e.g. power ∈ [0,1])

Example: Assessing Finite-Sample Test Size

α = P (s > s∗ |T, H0 true) = g(T )

(α is empirical size, s is test statistic, s∗ is asymptotic c.v.)


rej
α̂ =
N
(MUCH) MORE SIMULATION 117

 
α(1 − α)
α̂ ∼ N α,
N
or

α̂ = α + ε = g(T ) + ε

 
g(T )(1 − g(T ))
ε ∼ N 0,
N
Note the heteroskedasticity: variance of ε changes with T .
Example: Assessing Finite-Sample Test Size
Enforce analytically known structure on α̂.
Common approach:

p
!
−i
− 12
X
α̂ = α0 + T c0 + ci T 2 +ε
i=1

α0 is nominal size, which obtains as T → ∞. Second term is the vanishing size distortion.
Response surface regression:

1 3
(α̂ − α0 ) → T − 2 , T −1 , T − 2 , ...

Disturbance will be approximately normal but heteroskedastic.


So use GLS or robust standard errors.

7.3 ESTIMATION BY SIMULATION: GMM, SMM AND INDIRECT INFER-


ENCE

7.3.1 GMM

k-dimensional economic model parameter θ

θ̂GM M = argminθ d(θ)0 Σd(θ)

where
 
m1 (θ) − m̂1
m2 (θ) − m̂2 
 
d(θ) = 
 .. 

 . 
mr (θ) − m̂r

The mi (θ) are model moments and the m̂i are data moments.
MM: k = r and the mi (θ) calculated analytically
118 CHAPTER 7

GMM: k < r and the mi (θ) calculated analytically

• Inefficient relative to MLE, but useful when likelihood is not available

7.3.2 Simulated Method of Moments (SMM)

(k ≤ r and the mi (θ) calculated by simulation )

• Model moments for GMM may also be unavailable (i.e., analytically intractable)

• SMM: if you can simulate, you can consistently estimate

– Simulation ability is a good test of model understanding


– If you can’t figure out how to simulate pseudo-data from a given probabilistic
model, then you don’t understand the model (or the model is ill-posed)
– Assembling everything: If you understand a model you can simulate it, and if you
can simulate it you can estimate it consistently. Eureka!
– No need to work out what might be very complex likelihoods even if they are in
principle ”available.”

• MLE efficiency lost may be a small price for SMM tractability gained.

SMM Under Misspecification


All econometric models are misspecified.
GMM/SMM has special appeal from that perspective.

• Under correct specification any consistent estimator (e.g., MLE or GMM/SMM) takes
you to the right place asymptotically, and MLE has the extra benefit of efficiency.

• Under misspecification, consistency becomes an issue, quite apart from the secondary
issue of efficiency. Best DGP approximation for one purpose may be very different
from best for another.

• GMM/SMM is appealing in such situations, because it forces thought regarding which


moments M = {m1 (θ), ..., mr (θ)} to match, and then by construction it is consistent
for the M -optimal approximation.

SMM Under Misspecification, Continued

• In contrast, pseudo-MLE ties your hands. Gaussian pseudo-MLE, for example, is con-
sistent for the KLIC-optimal approximation (1-step-ahead mean-squared prediction
error).
(MUCH) MORE SIMULATION 119

• The bottom line: under misspecification MLE may not be consistent for what you
want, whereas by construction GMM is consistent for what you want (once you decide
what you want).

7.3.3 Indirect Inference

k-dimensional economic model parameter θ


δ > k-dimensional auxiliary model parameter β

θ̂IE = argminθ d(θ)0 Σd(θ)

where
 
β̂1 (θ) − β̂1
β̂2 (θ) − β̂2 
 
d(θ) = 
 .. 

 . 
β̂d (θ) − β̂d

β̂i (θ) are est. params. of aux. model fit to simulated model data
β̂i are est. params. of aux. model fit to real data

– Consistent for true θ if economic model correctly specified


– Consistent for pseudo-true θ otherwise
– We introduced “Wald form”; also LR and LM forms
Ruge-Murcia (2010)

7.4 INFERENCE BY SIMULATION: BOOTSTRAP

7.4.1 i.i.d. Environments

Simplest (iid) Case


iid
{xt }Tt=1 ∼ (µ, σ 2 )

100α percent confidence interval for µ:


σ(x) σ(x)
I = [x̄T − u(1+α)/2 √ , x̄T − u(1−α)/2 √ ]
T T

T
1X
x̄T = xt , σ 2 (x) = E(x − µ)2
T t=1

!
(x̄T − µ)
uα solves P ≤ uα =α
√σ
T
120 CHAPTER 7

Exact interval, regardless of the underlying distribution.


Operational Version

σ̂(x) σ̂(x)
I = [x̄T − û(1+α)/2 √ , x̄T − û(1−α)/2 √ ]
T T

T
1 X
σ̂ 2 (x) = (xt − x̄T )2
T − 1 t=1

 

ûα solves P  (x̄T − µ) ≤ ûα  = α


σ̂(x)

T

Classic (Gaussian) example:


σ̂(x)
I = x̄T ± t(1−α)/2 √
T
Bootstrap approach: No need to assume Gaussian data.
“Percentile Bootstrap”
(x̄T − µ)
Root : S =
√σ
T

!
(x̄T − µ)
Root c.d.f. : H(z) = P ≤z
√σ
T

(j) T
1. Draw {xt }t=1 with replacement from {xt }Tt=1
(j)
x̄T −x̄T
2. Compute σ̂(x)

T

(j)
x̄T −x̄T
3. Repeat many times and build up the sampling distribution of σ̂(x)

which is an
T
x̄T −µ
approximation to the distribution of √σ
T

“Russian doll principle”


Percentile Bootstrap, Continued
Bootstrap estimator of H(z):

 
(j)
Ĥ(z) = P  (x̄T − x̄T ) ≤ z
σ̂(x)

T

Translates into bootstrap 100α percent CI:


(MUCH) MORE SIMULATION 121

σ̂(x) σ̂(x)
Iˆ = [x̄T − û(1+α)/2 √ , x̄T − û(1−α)/2 √ ]
T T
!
(j)
(x̄T − x̄T )
where P σ̂(x)
≤ ûα = Ĥ(ûα ) = α

T

“Percentile-t” Bootstrap
(x̄T − µ)
S= σ̂(x)

T

 

H(z) = P  (x̄T − µ) ≤ z
σ̂(x)

T

 
(j)
Ĥ(z) = P  (x̄T − x̄T ) ≤ z
σ̂(x(j) )

T

σ̂(x) σ̂(x)
Iˆ = [x̄T − û(1+α)/2 √ , x̄T − û(1−α)/2 √ ]
T T
 
(j)
(x̄ − x̄T )
P  T (j) ≤ ûα  = α
σ̂(x )

T

Percentile-t Bootstrap, Continued


Key Insight:
(j)
Percentile: x̄T changes across bootstrap replications
(j)
Percentile-t: both x̄T and σ̂(x(j) ) change across bootstrap replications
Effectively, the percentile method bootstraps the parameter, whereas the percentile-t boot-
straps the t statistic
Key Bootstrap Property: Consistent Inference
Real-world root:
d
S → D (as T → ∞)

Bootstrap-world root:

d

S → D∗ (as T, N → ∞)

Bootstrap consistent (“valid,” “first-order valid”) if D = D∗ .


122 CHAPTER 7

Holds under regularity conditions.


But Aren’t There Simpler ways to do Consistent Inference?
Of Course. BUT:

1. Bootstrap idea extends mechanically to much more complicated models

2. Bootstrap can deliver higher-order refinements


(e.g., percentile-t)

3. Monte Carlo indicates that bootstrap often does very well in finite samples (not un-
related to 2, but does not require 2)

4. Many variations and extensions of the basic bootstraps

7.4.2 Time-Series Environments

Stationary Time Series Case Before:


(x̄T −µ)
1. Use S = σ̂(x)

T

(j) T
2. Draw {xt }t=1 with replacement from {xt }Tt=1

Issues:

1. Inappropriate standardization of S for dynamic data. So replace σ̂(x) with 2πfx∗ (0),
where fx∗ (0) is a consistent estimator of the spectral density of x at frequency 0.
(j) T
2. Inappropriate to draw {xt }t=1 with replacement for dynamic data. What to do?

Non-Parametric Time Series Bootstrap


(Overlapping Block Sampling)
Overlapping blocks of size b in the sample path:

ξt = (xt , ..., xt+b−1 ), t = 1, ..., T − b + 1


−b+1
Draw k blocks (where T = kb) from {ξt }Tt=1 :
(j) (j)
ξ1 , ..., ξk
(j) (j) (j) (j)
Concatenate: (x1 , ..., xT ) = (ξ1 ...ξk )
Consistent if b → ∞ as T → ∞ with b/T → 0
AR(1) Parametric Time Series Bootstrap

xt = c + φxt−1 + εt , εt ∼ iid
(MUCH) MORE SIMULATION 123

1. Regress xt → (c, xt−1 ) to get ĉ and φ̂, and save residuals, {et }Tt=1
(j)
2. Draw {εt }Tt=1 with replacement from {et }Tt=1
(j)
3. Draw x0 from {xt }Tt=1
(j) (j) (j)
4. Generate xt = ĉ + φ̂xt−1 + εt , t = 1, ..., T
(j) (j)
5. Regress xt → (c, xt−1 ) to get ĉ(j) and φ̂(j) , associated t-statistics, etc.

6. Repeat j = 1, ..., R, and build up the distributions of interest

General State-Space Parametric Time Series Bootstrap


Recall the prediction-error state space representation:

at+1/t = T at/t−1 + T Kt vt

yt = Zat/t−1 + vt
1. Estimate system parameters θ. (We will soon see how to do this.)
2. At the estimated parameter values θ̂, run the Kalman filter to get the corresponding 1-step-ahead
−1/2
prediction errors v̂t ∼ (0, F̂t ) and standardize them to ût = Ω̂t v̂t ∼ (0, I), where Ω̂t Ω̂0t = F̂t .
(j) (j) (j)
3. Draw {ut }T T T T
t=1 with replacement from {ût }t=1 and convert to {vt }t=1 = {Ω̂t ut }t=1 .

(j) (j)
4. Using the prediction-error draw {vt }T T
t=1 , simulate the model, obtaining {yt }t=1 .

5. Estimate the model, obtaining θ̂(j) and related objects of interest.

6. Repeat j = 1, ..., R, simulating the distributions of interest.

Many Variations and Extensions...


• Stationary block bootstrap: Blocks of random (exponential) length
• Wild bootstrap: multiply bootstrap draws of shocks randomly by ±1 to enforce symmetry
• Subsampling

7.5 OPTIMIZATION BY SIMULATION

Markov chains yet again.

7.5.1 Local
Using MCMC for MLE (and Other Extremum Estimators)
Chernozukov and Hong show how to compute extremum estimators as mean of pseudo-posterior distri-

butions, which can be simulated by MCMC and estimated at the parametric rate 1/ N , in contrast to the
much slower nonparametric rates achievable (by any method) by the standard posterior mode extremum
estimator.
124 CHAPTER 7

7.5.2 Global
Summary of Local Optimization:

1. initial guess θ(0)


2. while stopping criteria not met do
3. select θ(c) ∈ N (θ(m) ) (Classically: use gradient)
4. if ∆ ≡ lnL(θ(c) ) − lnL(θ(m) ) > 0 then θ(m+1) = θ(c)
5. end while

Simulated Annealing
(Illustrated Here for a Discrete Parameter Space)
Framework:

1. A set Θ, and a real-valued function lnL (satisfying regularity conditions) defined on Θ. Let Θ∗ ⊂ Θ
be the set of global maxima of lnL
2. ∀θ(m) ∈ Θ, a set N (θ(m) ) ⊂ Θ − θ(m) , the set of neighbors of θ(m)
3. A nonincreasing function, T (m) : N → (0, ∞) (“the cooling schedule”), where T (m) is the “temper-
ature” at iteration m
4. An initial guess, θ(0) ∈ Θ

Simulated Annealing Algorithm

1. initial guess θ(0)


2. while stopping criteria not met do
3. select θ(c) ∈ N (θ(m) )
4. if ∆ > 0 or exp (∆/T (m)) > U (0, 1) then θ(m+1) = θ(c)
5. end while
Note the extremes:

T = 0 implies no randomization (like classical gradient-based)

T = ∞ implies complete randomization (like random search)

A (Heterogeneous) Markov Chain


If θ(c) ∈/ N (θ(m) ) then
P (θ (m+1) = θ(c) |θ(m) ) = 0
If θ (c) ∈ N (θ(m) ) then
P (θ (m+1) = θ(c) | θ(m) ) = exp (min[0, ∆/T (m)])
Convergence of a Global Optimizer
Definition. We say that the simulated annealing algorithm converges if
limm→∞ P [θ(m) ∈ Θ∗ ] = 1.
Definition: We say that θ(m) communicates with Θ∗ at depth d if there exists a path in Θ (with each
element of the path being a neighbor of the preceding element) that starts at θ(m) and ends at some element
of Θ∗ , such that the smallest value of lnL along the path is lnL(θ(m) ) − d.
Convergence of Simulated Annealing
Theorem: Let d∗ be the smallest number such that every θ(m) ∈ Θ communicates with Θ∗ at depth d∗ .
Then the simulated annealing algorithm converges if and only if, as m → ∞,
T (m) → 0
and
(MUCH) MORE SIMULATION 125

exp(−d∗ /T (m)) → ∞.
P

Problems: How to choose T , and moreover we don’t know d∗


Popular choice of cooling function: T (m) = ln 1m
Regarding speed of convergence, little is known

7.5.3 Is a Local Optimum Global?


1. Try many startup values (sounds trivial but very important)
2. At the end of it all, use extreme value theory to assess the likelihood that the local optimum is global
(“Veall’s Method”)

θ ∈ Θ ⊂ Rk
lnL(θ) is continuous
lnL(θ∗ ) is the unique finite global max of lnL(θ), θ ∈ Θ
H(θ∗ ) exists and is nonsingular
lnL(θ̂) is a local max
Develop statistical inference for θ∗
Draw {θi }N i=1 uniformly from Θ and form {lnL(θi )}i=1
N

lnL1 first order statistic, lnL2 second order statistic


P [lnL(θ∗ ) ∈ (lnL1 , lnLα )] = (1 − α), as N → ∞
where
lnLα = lnL1 + lnL1 −lnL −2
2
.
(1−α) k −1

7.6 INTERVAL AND DENSITY FORECASTING BY SIMULATION

7.7 EXERCISES, PROBLEMS AND COMPLEMENTS

1. Convex relaxation.
Our approaches to global optimization involved attacking a nasty objective function with methods
involving clever randomization. Alternatively, one can approximate the nasty objective with a friendly
(convex) objective, which hopefully has the same global optimum. This is called “convex relaxation,”
and when the two optima coincide we say that the relaxation is “tight.”

7.8 NOTES
Chapter Eight

Non-Stationarity: Integration, Cointegration and Long Memory

8.1 RANDOM WALKS AS THE I(1) BUILDING BLOCK: THE BEVERIDGE-


NELSON DECOMPOSITION

Random Walks
Random walk:
yt = yt−1 + εt
εt ∼ W N (0, σ 2 )
Random walk with drift:
yt = δ + yt−1 + εt
εt ∼ W N (0, σ 2 )
Properties of the Random Walk

t
X
yt = y0 + εi
i=1
(shocks perfectly persistent)

E(yt ) = y0
var(yt ) = tσ 2
lim var(yt ) = ∞
t→∞
Properties of the Random Walk with Drift

t
X
yt = tδ + y0 + εi
i=1
(shocks again perfectly persistent)

E(yt ) = y0 + tδ
var(yt ) = tσ 2
lim var(yt ) = ∞
t→∞
The Random Walk as a Building Block
Generalization of random walk: ARIM A(p, 1, q)
Beveridge-Nelson decomposition:
yt ∼ ARIM A(p, 1, q) ⇒ yt = xt + zt
xt = random walk
zt = covariance stationary
– So shocks to ARIM A(p, 1, q) are persistent, but not perfectly so.
BAYES 127

– This was univariate BN. We will later derive multivariate BN.


Forecasting a Random Walk with Drift

xt = b + xt−1 + εt
εt ∼ W N (0, σ 2 )
Optimal forecast:
xT +h,T = bh + xT
Forecast does not revert to trend
Forecasting a Linear Trend + Stationary AR(1)
xt = a + bt + yt

yt = φyt−1 + εt
εt ∼ W N (0, σ 2 )
Optimal forecast:
xT +h,T = a + b(T + h) + φh yT
Forecast reverts to trend

8.2 STOCHASTIC VS. DETERMINISTIC TREND

Some Language...
“Random walk with drift” vs. “stat. AR(1) around linear trend”
“unit root” vs. “stationary root”
“Difference stationary” vs. “trend stationary”
“Stochastic trend” vs. “deterministic trend”
“I(1)” vs. “I(0)”
Stochastic Trend vs. Deterministic Trend
128 CHAPTER 8

8.3 UNIT ROOT DISTRIBUTIONS

Unit Root Distribution in the AR(1) Process

yt = yt−1 + εt

d
T (φ̂LS − 1) → DF

Superconsistent
Biased in finite samples (E φ̂ < φ ∀ φ ∈ (0, 1])
“Hurwicz bias” “Dickey-Fuller bias”
“Nelson-Kang spurious periodicity”
Bigger as T → 0 , as φ → 1 , and as intercept, trend included
Non-Gaussian (skewed left)
DF tabulated by Monte Carlo
Studentized Version

φ̂ − 1
τ̂ = q
s PT 1 y2
t=2 t−1

Not t in finite samples


Not N (0, 1) asymptotically
Again, tabulate by Monte Carlo
Nonzero Mean Under the Alternative

(yt − µ) = φ(yt−1 − µ) + εt

yt = α + φyt−1 + εt

where α = µ(1 − φ)
Random walk null vs. mean-reverting alternative
Studentized statistic τ̂µ
Deterministic Trend Under the Alternative

(yt − a − b t) = φ(yt−1 − a − b (t − 1)) + εt

yt = α + βt + φyt−1 + εt

where α = a(1 − φ) + bφ and β = b(1 − φ)


BAYES 129

H0 : φ = 1 (unit root)
H1 : φ < 1 (stationary root)
“Random walk with drift” vs. “stat. AR(1) around linear trend”
“Difference stationary” vs. “trend stationary”
“Stochastic trend” vs. “deterministic trend”
“I(1)” vs. “I(0)”
Studentized statistic τ̂τ
Tabulating the Dickey-Fuller Distributions

1. Set T

2. Draw T N (0, 1) variates

3. Construct yt

4. Run three DF regressions (using common random numbers)

• τ̂ : yt = φyt−1 + et
• τ̂µ : yt = c + φyt−1 + et
• τ̂τ : yt = c + βt + φyt−1 + et

5. Repeat N times, yielding {τ̂ i , τ̂µi , τ̂τi }N


i=1

6. Sort and compute fractiles

7. Fit response surfaces

AR(p)

p
X
yt + φj yt−j = εt
j=1

p
X
yt = ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
Pp Pp
where p ≥ 2, ρ1 = − j=1 φj , and ρi = j=i φj , i = 2, ..., p
Studentized statistic τ̂
Allowing for Nonzero Mean Under the Alternative

p
X
(yt − µ) + φj (yt−j − µ) = εt
j=1
130 CHAPTER 8

p
X
yt = α + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2
Pp
where α = µ(1 + j=1 φj )
Studentized statistic τ̂µ
Allowing for Trend Under the Alternative

p
X
(yt − a − bt) + φj (yt−j − a − b(t − j)) = εt
j=1

p
X
yt = k1 + k2 t + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2

p
X p
X
k1 = a(1 + φi ) − b iφi
i=1 i=1

p
X
k2 = b (1 + φi )
i=1
Pp
Under the null hypothesis, k1 = −b i=1 iφi and k2 = 0
Studentized statistic τ̂τ

8.4 UNIVARIATE AND MULTIVARIATE AUGMENTED DICKEY-FULLER


REPRESENTATIONS

General ARM A Representations


(“Augmented Dickey-Fuller” (ADF))
Use one of the representations:

k−1
X
yt = ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2

k−1
X
yt = α + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2

k−1
X
yt = k1 + k2 t + ρ1 yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2

Let k → ∞ with k/T → 0


“Trick Form” of ADF
BAYES 131

k−1
X
(yt − yt−1 ) = (ρ1 − 1)yt−1 + ρj (yt−j+1 − yt−j ) + εt
j=2

• Unit root corresponds to (ρ1 − 1) = 0

• Use standard automatically-computed t-stat


(which of course does not have the t-distribution)

8.5 SPURIOUS REGRESSION

Multivariate Problem: Spurious Time-Series Regressions


Regress a persistent variable on an unrelated persistent variable:

yt = β xt + εt

(Canonical case: y, x independent driftless random walks)


d
2
R → RV (not zero)

d
√t → RV (t diverges)
T

d
√β̂ → RV (β̂ diverges)
T

When are I(1) Levels Regressions Not Spurious?


Answer: When the variables are cointegrated.

8.6 COINTEGRATION, ERROR-CORRECTION AND GRANGER’S REP-


RESENTATION THEOREM

Cointegration
Consider an N -dimensional variable x:
x ∼ CI (d, b) if

1. xi ∼ I(d), i = 1, . . . , N

2. ∃ 1 or more linear combinations zt = α0 xt s.t. zt ∼ I(d − b), b > 0

Leading Case
x ∼ CI(1, 1) if
132 CHAPTER 8

(1) xi ∼ I(1), i = 1, . . . , N
(2) ∃ 1 or more linear combinations
zt = α0 xt s.t. zt ∼ I(0)
Example

xt = xt−1 + vt , vt ∼ W N

yt = xt−1 + εt , εt ∼ W N, εt ⊥ vt−τ , ∀t, τ

⇒ (yt − xt ) = εt − vt = I(0)

Cointegration and “Attractor Sets”


xt is N -dimensional but does not wander randomly in RN
α0 xt is attracted to an (N − R)-dimensional subspace of RN
N : space dimension
R: number of cointegrating relationships
Attractor dimension = N − R
(“number of underlying unit roots”)
(“number of common trends”)
Example
3-dimensional V AR(p), all variables I(1)
R = 0 ⇔ no cointegration ⇔ x wanders throughout R3
R = 1 ⇔ 1 cointegrating vector ⇔ x attracted to a 2-Dim hyperplane in R3 given by
α0 x = 0
R = 2 ⇔ 2 cointegrating vectors ⇔ x attracted to a 1-Dim hyperplane (line) in R3 given
by intersection of two 2-Dim hyperplanes, α10 x = 0 and α20 x = 0
R = 3 ⇔ 3 cointegrating vectors ⇔ x attracted to a 0-Dim hyperplane (point) in R3 given
by the intersection of three 2-Dim hyperplanes, α10 x = 0 , α20 x = 0 and α30 x = 0
(Covariance stationary around E(x))
Cointegration Motivation: Dynamic Factor Structure
Factor structure with I(1) factors
(N − R) I(1) factors driving N variables
e.g., single-factor model:

     
y1t 1 ε1t
 .   ..   . 
 .  = 
 .  .  ft +  ..
  
 
yN t 1 εN t

ft = ft−1 + ηt
BAYES 133

R = (N − 1) cointegrating combs: (y2t − y1t ), ..., (yN t − y1t )


(N − R) = N − (N − 1) = 1 common trend
Cointegration Motivation: Optimal Forecasting
I(1) variables always co-integrated with their optimal forecasts
Example:
xt = xt−1 + εt
xt+h|t = xt
Ph
⇒ xt+h − xt+h|t = i=1 εt+i

(finite MA, always covariance stationary)


Cointegration Motivation:
Long-Run Relation Augmented with Short-Run Dynamics
Simple AR Case (ECM):

∆yt = α ∆yt−1 + β ∆xt−1 − γ(yt−1 − δxt−1 ) + ut


= α ∆yt−1 + β ∆xt−1 − γzt−1 + ut

General AR Case (VECM):


A(L) ∆xt = −γzt−1 + ut
where:
A(L) = I − A1 L − ... − Ap Lp
zt = α0 xt
Multivariate ADF
Any V AR can be written as:

p−1
X
∆xt = − Πxt−1 + Bi ∆xt−i + ut
i=1

Integration/Cointegration Status

• Rank(Π) = 0
0 cointegrating vectors, N underlying unit roots
(all variables appropriately specified in differences)

• Rank(Π) = N
N cointegrating vectors, 0 unit roots
(all variables appropriately specified in levels)

• Rank(Π) = R (0 < R < N )


R cointegrating vectors, N − R unit roots
New and important intermediate case
(not possible in univariate)
134 CHAPTER 8

Granger Representation Theorem

xt ∼ V ECM ⇔ xt ∼ CI(1, 1)

V ECM ⇐ Cointegration
We can always write
Pp−1
∆xt = i=1 Bi ∆xt−i − Π xt−1 + ut
But under cointegration, rank(Π) = R < N , so
Π γ α0
=
N ×N N ×R R×N
Pp−1
⇒ ∆xt = i=1 Bi ∆xt−i − γα0 xt−1 + ut
Pp−1
= i=1 Bi ∆xt−i − γzt−1 + ut
V ECM ⇒ Cointegration

p−1
X
∆xt = Bi ∆xt−i − γ α0 xt−1 + ut
i=1
0
Premultiply by α :

p−1
X
0 0
α ∆xt = α Bi ∆xt−i − α0 γ α0 xt−1 + α0 ut
|{z}
i=1
full rank
0
So equation balance requires that α xt−1 be stationary.
Stationary-Nonstationary Decomposition
 
α0
 
(R × N ) CI combs
 
0
 
M x  
= x =  
(N × N ) (N × 1)    
δ com. trends
 
 
(N − R) × N

(Rows of δ ⊥ to columns of γ)
Intuition Transforming the system by δ yields
p−1
X
δ ∆ xt = δ Bi ∆ xt−i − δ0 γ α0 xt−1 + δ µt
|{z}
i=1
0 by orthogonality

So δ isolates that part of the VECM


that is appropriately specified as a VAR in differences.
Note that if we start with M0 x, then the observed series is
(M0 )−1 M0 x, so nonstationarity is spread throughout the system.
BAYES 135

Example

x1t = x1t−1 + u1t

x2t = x1t−1 + u2t


Levels form:
! ! ! ! !
1 0 1 0 x1t u1t
− L =
0 1 1 0 x2t u2t
Dickey-Fuller form:
! ! ! !
∆x1t 0 0 x1t−1 u1t
=− +
∆x2t −1 1 x2t−1 u2t
Example, Continued

!
0  
Π= −1 1 = γα0
1

! !
0 −1 1 α0
M = =
1 0 ⊥γ

! ! !
0 x1t u2t − u1t x2t − x1t
M = =
x2t x1t x1t

8.7 FRACTIONAL INTEGRATION AND LONG MEMORY

Long Memory and Fractional Integration


“Integer-integrated” ARIM A(p, d, q) I(d):

(1 − L)d Φ(L)yt = Θ(L)εt , d = 0, 1, ...

Covariance stationary: ARIM A(p, 0, q)


Random walk: ARIM A(0, 1, 0) “pure unit root process”
“Fractionally-integrated” ARF IM A(p, d, q) I(d):

(1 − L)d Φ(L)yt = Θ(L)εt

Covariance stationary: − 21 < d < 12 (focus on 0 < d < 21 )


ARF IM A(0, d, 0): “pure fractionally-integrated process”

d(d − 1) 2 d(d − 1)(d − 2) 3


(1 − L)d = 1 − dL + L − L + ...
2! 3!
136 CHAPTER 8

Long memory and fractional integration in the time domain, τ → ∞

• I(1) : ρ(τ ) ∝ const

• I(0) : ρ(τ ) ∝ rτ (0 < r < 1)

• I(d) : ρ(τ ) ∝ τ 2d−1 (0 < d < 1/2)

Long memory and fractional integration in the frequency domain, ω → 0

• I(1) : f (ω) ∝ ω −2

• I(0) : f (ω) ∝ const

• I(d) : f (ω) ∝ ω −2d (0 < d < 1/2)

Frequency-domain I(d) behavior implies that for low frequencies,

ln f ∗ (ω) = β0 + β1 lnω + εt
|{z}
−2d

“GPH estimator” of d: Regress ln f ∗ (ω) → const, ln ω.


Take dˆ = − 1 β̂1 .
2

8.8 EXERCISES, PROBLEMS AND COMPLEMENTS

1. Applied modeling.
Obtain a series of U.S. industrial production, monthly, 1947.01-present, not seasonally
adjusted. Discard the last 20 observations. Now graph the series. Examine its trend
and seasonal patterns in detail. Does a linear trend (fit to logs) seem appropriate? Do
monthly seasonal dummies seem appropriate? To the log of the series, fit an ARMA
model with linear trend and monthly seasonal dummies. Be sure to try a variety of
the techniques we have covered (sample autocorrelation and partial autocorrelation
functions, Bartlett standard errors, Box-Pierce test, standard error of the regression,
adjusted R2 , etc.) in your attempt at selecting an adequate model. Once a model has
been selected and estimated, use it to forecast the last 20 observations, and compare
your forecast to the actual realized values. Contrast the results of the approach with
those of the Box-Jenkins seasonal ARIMA approach, which involves taking first and
seasonal differences as necessary to induce covariance stationarity, and then fitting a
multiplicative seasonal ARMA model.

2. Aggregation.
Granger (1980) shows that aggregation of a very large number of stationary ARMA
time series results, under regularity conditions (generalized in Robinson, 1991), in a
BAYES 137

fractionally-integrated process. Thus, aggregation of short-memory processes results


in a long-memory process. Discuss this result in light of theorems on aggregation of
ARMA processes. In particular, recall that aggration of ARMA processes results in
new ARMA processes, generally of higher order than the components.

3. Over-differencing and UCM’s.


The stochastic trend:
Tt = Tt−1 + βt−1 + ηt
βt = βt−1 + ζt ,
has two unit roots. Discuss as regards “overdifferencing” in economic time series.

8.9 NOTES
Chapter Nine

Non-Linear Non-Gaussian State Space and Optimal Filtering

9.1 VARIETIES OF NON-LINEAR NON-GAUSSIAN MODELS

9.2 MARKOV CHAINS TO THE RESCUE (AGAIN): THE PARTICLE FIL-


TER

9.3 PARTICLE FILTERING FOR ESTIMATION: DOUCET’S THEOREM

9.4 KEY APPLICATION I: STOCHASTIC VOLATILITY (REVISITED)

9.5 KEY APPLICATION II: CREDIT-RISK AND THE DEFAULT OPTION

9.6 KEY APPLICATION III: DYNAMIC STOCHASTIC GENERAL EQUI-


LIBRIUM (DSGE) MACROECONOMIC MODELS

9.7 A PARTIAL “SOLUTION”: THE EXTENDED KALMAN FILTER

Familiar Linear / Gaussian State Space

αt = T αt−1 + Rηt

yt = Zαt + εt

ηt ∼ N (0, Q), εt ∼ N (0, H)

Linear / Non-Gaussian

αt = T αt−1 + Rηt

yt = Zαt + εt
NON-LINEAR NON-GAUSSIAN 139

ηt ∼ Dη , εt ∼ Dε

Non-Linear / Gaussian

αt = Q(αt−1 , ηt )

yt = G(αt , εt )

ηt ∼ N (0, Q), εt ∼ N (0, H)

Non-Linear / Gaussian II
(Linear / Gaussian with time-varying system matrices)

αt = Tt αt−1 + Rt ηt

yt = Zt αt + εt

ηt ∼ N η , εt ∼ N ε

“Conditionally Gaussian”
White’s theorem
Non-Linear / Non-Gaussian

αt = Q(αt−1 , ηt )

yt = G(αt , εt )

ηt ∼ D η , εt ∼ Dε

(DSGE macroeconomic models are of this form)


Non-Linear / Non-Gaussian, Specialized

αt = Q(αt−1 ) + ηt

yt = G(αt ) + εt
140 CHAPTER 9

ηt ∼ D η , εt ∼ Dε

Non-Linear / Non-Gaussian, Generalized

αt = Qt (αt−1 , ηt )

yt = Gt (αt , εt )

ηt ∼ Dtη , εt ∼ Dtε

Asset value of the firm Vt :

Vt = µVt−1 ηt0 , ηt0 ∼ lognormal

Firm issues liability D: Zero-coupon bond, matures at T , pays D


Equity value of the firm St :

St = max(Vt − D, 0)

From the call option structure of St , Black-Scholes gives:


St = BS(Vt )ε0t
ε0t ∼ lognormal
(ε0t captures BS misspecification, etc.)
Credit Risk Model (Nonlinear / Gaussian Form)
Taking logs makes it Gaussian, but it’s intrinsically non-linear:
ln Vt = ln µ + Vt−1 + ηt
ln St = ln BS(Vt ) + εt
ηt ∼ N, εt ∼ N
Regime Switching Model (Nonlinear / Gaussian)
! ! ! !
α1t φ 0 α1,t−1 η1t
= +
α2t 0 γ α2,t−1 η2t

!
α1t
yt = µ0 + δ I(α2t > 0) + (1, 0)
α2t

η1t ∼ N η1 η2t ∼ N η2 η1t ⊥η2t

Extensions to:
– Richer α1 dynamics (governing the observed y)
NON-LINEAR NON-GAUSSIAN 141

– Richer α2 dynamics (governing the latent regime)


– Richer ηt distribution (e.g., η2t asymmetric)
– More than two states
– Switching also on dynamic parameters, volatilities, etc.
– Multivariate
Stochastic Volatility Model (Nonlinear/Gaussian Form)

ht = ω + βht−1 + ηt (transition)


rt = eht εt (measurement)

ηt ∼ N (0, ση2 ), εt ∼ N (0, 1)


Stochastic Volatility Model (Linear/Non-Gaussian Form)

ht = ω + βht−1 + ηt (transition)

2ln|rt | = ht + 2ln|εt | (measurement)

or

ht = ω + βht−1 + ηt

yt = ht + ut

ηt ∼ N (0, ση2 ), ut ∼ Du
– A “signal plus (non-Gaussian) noise”
components model for volatility
Realized and Integrated Volatility

IVt = φIVt−1 + ηt

RVt = IVt + εt

ε represents the fact that RV is based on less than an infinite sampling frequency.
Microstructure Noise Model
**Hasbrouck
(Non-linear / non-Gaussian)
A Distributional Statement of the Kalman Filter ****
Multivariate Stochastic Volatility with Factor Structure
142 CHAPTER 9

***
Approaches to the General Filtering Problem Kitagawa (1987), numerical integration
(linear / non-Gaussian) More recently, Monte Carlo integration
Extended Kalman Filter (Non-Linear / Gaussian)

αt = Q(αt−1 , ηt )

yt = G(αt , εt )

ηt ∼ N, εt ∼ N

Take first-order Taylor expansions of:


Q around at−1
G around at,t−1
Use Kalman filter on the approximated system
Unscented Kalman Filter (Non-Linear / Gaussian)
Bayes Analysis of SSMs: Carlin-Polson-Stoffer 1992 JASA
“single-move” Gibbs sampler
(Many parts of the Gibbs iteration: the parameter vector, and then each observation of
the state vector, period-by-period)
Multi-move Gibbs sampler can handle non-Gaussian (via mixtures of normals), but not
nonlinear.
Single-move can handle nonlinear and non-Gaussian.
Expanding S(θ̂M L ) around θ yields:
S(θ̂M L ) ≈ S(θ) + S 0 (θ)(θ̂M L − θ) = S(θ) + H(θ))(θ̂M L − θ).
Noting that S(θ̂M L ) ≡ 0 and taking expectations yields:
0 ≈ S(θ) − IEX,H (θ)(θ̂M L − θ)
or
−1
(θ̂M L − θ) ≈ IEX,H (θ).
a
Using S(θ) ∼ N (0, IEX,H (θ)) then implies:
a −1
(θ̂M L − θ) ∼ N (0, IEX,H (θ))
or
***
Case 3 β and σ 2
Joint prior g(β, σ12 ) = g(β/ σ12 )g( σ12 )
where β/ σ12 ∼ N (β0 , Σ0 ) and σ12 ∼ G( v20 , δ20 )
HW Show that the joint posterior,
p(β, σ12 /y) = g(β, σ12 )L(β, σ12 /y)
can be factored as p(β/ σ12 , y)p( σ21/y )
NON-LINEAR NON-GAUSSIAN 143

where β/ σ12 , y ∼ N (β1 , Σ1 )


and 1
σ 2 /y ∼ G( v21 , δ21 ),
and derive expressions for β1 , Σ1 , v1 , δ1
in terms of β0 , Σ0 , δ0 , x, and y.
Moreover, the key marginal posterior
R∞
P (β/y) = 0 p(β, σ12 /y)dσ 2 is multivariate t.
Implement the Bayesian methods via Gibbs sampling.
Linear Quadratic Business Cycle Model
Hansen and Sargent
Linear Gaussian state space system
Parameter-Driven vs. Observation-Driven Models
Parameter-driven: Time-varying parameters measurable w.r.t. latent variables
Observation-driven: Time-varying parameters measurable w.r.t. observable variables
Parameter-driven models are mathematically appealing but hard to estimate. Observation-
driven models are less mathematically appealing but easy to estimate. State-space models,
in general, are parameter-driven. Stochastic volatility models are parameter-driven, while
ARCH models are observation-driven.
—–
Regime Switching
We have emphasized dynamic linear models, which are tremendously important in prac-
tice. They’re called linear because yt is a simple linear function of past y’s or past ε ’s.
In some forecasting situations, however, good statistical characterization of dynamics may
require some notion of regime switching, as between “good” and “bad” states, which is a
type of nonlinear model.
Models incorporating regime switching have a long tradition in business-cycle analysis,
in which expansion is the good state, and contraction (recession) is the bad state. This
idea is also manifest in the great interest in the popular press, for example, in identifying
and forecasting turning points in economic activity. It is only within a regime-switching
framework that the concept of a turning point has intrinsic meaning; turning points are
naturally and immediately defined as the times separating expansions and contractions.
—————
Observable Regime Indicators
Threshold models are squarely in line with the regime-switching tradition. The follow-
ing threshold model, for example, has three regimes, two thresholds, and a d-period delay
regulating the switches:
(u)
 

 c(u) + φ(u) yt−1 + εt , θ(u) < yt−d 

(m)
yt = c(m) + φ(m) yt−1 + εt , θ(l) < yt−d < θ(u) .
(l)
 
(l) (l) (l)
c + φ yt−1 + εt , θ > yt−d .
 

The superscripts indicate “upper,” “middle,” and “lower” regimes, and the regime operative
144 CHAPTER 9

at any time t depends on the observable past history of y – in particular, on the value of
yt−d .
—————–
Latent Markovian Regimes
Although observable threshold models are of interest, models with latent states as opposed
to observed states may be more appropriate in many business, economic and financial con-
texts. In such a setup, time-series dynamics are governed by a finite-dimensional parameter
vector that switches (potentially each period) depending upon which of two unobservable
states is realized, with state transitions governed by a first-order Markov process. To make
matters concrete, let’s take a simple example. Let {st }Tt=1 be the (latent) sample path of
two-state first-order autoregressive process, taking just the two values 0 or 1, with transition
probability matrix given by
!
p00 1 − p00
M = .
1 − p11 p11

The ij-th element of M gives the probability of moving from state i (at time t − 1) to state
j (at time t). Note that there are only two free parameters, the staying probabilities, p00
and p11 . Let {yt }Tt=1 be the sample path of an observed time series that depends on {st }Tt=1
such that the density of yt conditional upon {st } is
!
2
1 −(yt − µst )
f (yt |st ; θ) = √ exp .
2π σ 2σ 2

Thus, yt is Gaussian white noise with a potentially switching mean. The two means around
which yt moves are of particular interest and may, for example, correspond to episodes of
differing growth rates (“booms” and “recessions”, “bull” and “bear” markets, etc.).
Chapter Ten

Volatility Dynamics

10.1 VOLATILITY AND FINANCIAL ECONOMETRICS

10.2 GARCH

10.3 STOCHASTIC VOLATILITY

10.4 OBSERVATION-DRIVEN VS. PARAMETER-DRIVEN PROCESSES

Prologue: Reading
Much of what follows draws heavily upon:

• Andersen, T.G., Bollerslev, T., Christoffersen, P.F. and Diebold, F.X. (2012), ”Finan-
cial Risk Measurement for Financial Risk Management,” in G. Constantinedes, M.
Harris and Rene Stulz (eds.), Handbook of the Economics of Finance, Elsevier.

• Andersen, T.G., Bollerslev, T. and Diebold, F.X. (2010), ”Parametric and Nonpara-
metric Volatility Measurement,” in L.P. Hansen and Y. Ait-Sahalia (eds.), Handbook
of Financial Econometrics. Amsterdam: North-Holland, 67-138.

• Andersen, T.G., Bollerslev, T., Christoffersen, P.F., and Diebold, F.X. (2006), ”Volatil-
ity and Correlation Forecasting,” in G. Elliott, C.W.J. Granger, and A. Timmermann
(eds.), Handbook of Economic Forecasting. Amsterdam: North-Holland, 778-878.

Prologue

• Throughout: Desirability of conditional risk measurement

• Aggregation level
– Portfolio-level (aggregated, univariate) Risk measurement
– Asset-level (disaggregated, multivariate): Risk management

• Frequency of data observations


– Low-frequency vs. high-frequency data
– Parametric vs. nonparametric volatility measurement
146 CHAPTER 10

Figure 10.1: Time Series of Daily NYSE Returns

• Object measured and modeled


– From conditional variances to conditional densities

• Dimensionality reduction in “big data” multivariate environments


– From ad hoc statistical restrictions to factor structure

What’s in the Data?


Returns
Key Fact 1: Returns are Approximately Serially Uncorrelated
Key Fact 2: Returns are not Gaussian
Key Fact 3: Returns are Conditionally Heteroskedastic I
Key Fact 3: Returns are Conditionally Heteroskedastic II

Why Care About Volatility Dynamics?


Everything Changes when Volatility is Dynamic

• Risk management

• Portfolio allocation

• Asset pricing

• Hedging

• Trading

Risk Management
VOLATILITY DYNAMICS 147

Figure 10.2: Correlogram of Daily NYSE Returns.

Figure 10.3: Histogram and Statistics for Daily NYSE Returns.


148 CHAPTER 10

Figure 10.4: Time Series of Daily Squared NYSE Returns.

Figure 10.5: Correlogram of Daily Squared NYSE Returns.


VOLATILITY DYNAMICS 149

Individual asset returns:

r ∼ (µ, Σ)

Portfolio returns:

rp = λ0 r ∼ (λ0 µ, λ0 Σλ)

If Σ varies, we need to track time-varying portfolio risk, λ0 Σt λ


Portfolio Allocation
Optimal portfolio shares w∗ solve:

minw w0 Σw

s.t. w0 µ = µp

Importantly, w∗ = f (Σ)
If Σ varies, we have wt∗ = f (Σt )
Asset Pricing I: Sharpe Ratios
Standard Sharpe:
E(rit − rf t )
σ
Conditional Sharpe:
E(rit − rf t )
σt
Asset Pricing II: CAPM Standard CAPM:

(rit − rf t ) = α + β(rmt − rf t )

cov((rit − rf t ), (rmt − rf t ))
β =
var(rmt − rf t )
Conditional CAPM:
covt ((rit − rf t ), (rmt − rf t ))
βt =
vart (rmt − rf t )
Asset Pricing III: Derivatives
Black-Scholes:

C = N (d1 )S − N (d2 )Ke−rτ


ln(S/K) + (r + σ 2 /2)τ
d1 = √
σ τ
ln(S/K) + (r − σ 2 /2)τ
d2 = √
σ τ
150 CHAPTER 10

PC = BS(σ, ...)

(Standard Black-Scholes options pricing)


Completely different when σ varies!
Hedging

• Standard delta hedging

∆Ht = δ ∆St + ut

cov(∆Ht , ∆St )
δ =
var(∆St )

• Dynamic hedging

∆Ht = δt ∆St + ut

covt (∆Ht , ∆St )


δt =
vart (∆St )

Trading

• Standard case: no way to trade on fixed volatility

• Time-varying volatility I: Options straddles, strangles, etc. Take position according to


whether PC >< f (σt+h,t , . . .)
(indirect)

• Time-varying volatility II: Volatility swaps


Effectively futures contracts written on underlying
“realized volatility”
(direct)

Some Warm-Up
Unconditional Volatility Measures
Variance: σ 2 = E(rt − µ)2 (or standard deviation: σ)
Mean Absolute Deviation: M AD = E|rt − µ|
Interquartile Range: IQR = 75% − 25%
VOLATILITY DYNAMICS 151

12

10

8
True Probability, %

0
0 100 200 300 400 500 600 700 800 900 1000
Day Number

Figure 10.6: True Exceedance Probabilities of Nominal 1% HS-V aR When Volatility is


Persistent. We simulate returns from a realistically-calibrated dynamic volatility model, after which we
compute 1-day 1% HS-V aR using a rolling window of 500 observations. We plot the daily series of true
conditional exceedance probabilities, which we infer from the model. For visual reference we include a
horizontal line at the desired 1% probability level.

p% Value at Risk (V aRp )): x s.t. P (rt < x) = p


Outlier probability: P |rt − µ| > 5σ (for example)
Tail index: γ s.t. P (rt > r) = k r−γ
Kurtosis: K = E(r − µ)4 /σ 4
Dangers of a Largely Unconditional Perspective (HS-V aR)
Dangers of an Unconditional Perspective, Take II
The unconditional HS-V aR perspective encourages incorrect rules of thumb, like scaling

by h to convert 1-day vol into h-day vol.
Conditional VaR
Conditional VaR (V aRTp +1|T ) solves:

p
Z −V aRT +1|T
p = PT (rT +1 ≤ −V aRTp +1|T ) = fT (rT +1 )drT +1
−∞

( fT (rT +1 ) is density of rT +1 conditional on time-T information)


But VaR of any Flavor has Issues

• V aR is silent regarding expected loss when V aR is exceeded


(fails to assess the entire distributional tail)

• V aR fails to capture beneficial effects of portfolio diversification

Conditionally expected shortfall:

Z p
ESTp+1|T = p −1
V aRTγ+1|T dγ
0

• ES assesses the entire distributional tail

• ES captures the beneficial effects of portfolio diversification


152 CHAPTER 10

Exponential Smoothing and RiskMetrics

σt2 = λ σt−1
2 2
+ (1 − λ) rt−1


X
σt2 = 2
ϕj rt−1−j
j=0

ϕj = (1 − λ) λj

(Many initializations possible: r12 , sample variance, etc.)

RM-VaRpT +1|T = σT +1 Φ−1


p

• Random walk for variance

• Random walk plus noise model for squared returns

• Volatility forecast at any horizon is current smoothed value

• But flat volatility term structure is not realistic

Rigorous Modeling I
Conditional Univariate Volatility Dynamics from “Daily”
Data
Conditional Return Distributions
f (rt ) vs. f (rt |Ωt−1 )
Key 1: E(rt |Ωt−1 )
Are returns conditional mean independent? Arguably yes.
Returns are (arguably) approximately serially uncorrelated, and (arguably) approximately
free of additional non-linear conditional mean dependence.
Conditional Return Distributions, Continued Key 2: var(rt |Ωt−1 ) = E((rt − µ)2 |Ωt−1 )
Are returns conditional variance independent? No way!
Squared returns serially correlated, often with very slow decay.
The Standard Model
(Linearly Indeterministic Process with iid Innovations)


X
yt = bi εt−i
i=0
VOLATILITY DYNAMICS 153


X
ε ∼ iid (0, σε2 ) b2i < ∞ b0 = 1
i=0

Uncond. mean: E(yt ) = 0 (constant)


P∞ 2
Uncond. variance: E(yt − E(yt ))2 = σε2 i=0 bi (constant)
P∞
Cond. mean: E(yt | Ωt−1 ) = i=1 i t−i (varies)
b ε
2
Cond. variance: E([yt − E(yt | Ωt−1 )] | Ωt−1 ) = σε2 (constant)
The Standard Model, Continued
k-Step-Ahead Least Squares Forecasting


X
E(yt+k | Ωt ) = bk+i εt−i
i=0

Associated prediction error:


k−1
X
yt+k − E(yt+k | Ωt ) = bi εt+k−i
i=0

Conditional prediction error variance:


k−1
X
2
E([yt+k − E(yt+k | Ωt )] | Ωt ) = σε2 b2i
i=0

Key: Depends only on k, not on Ωt


ARCH(1) Process

rt |Ωt−1 ∼ N (0, ht )

2
ht = ω + αrt−1

E(rt ) = 0

2 ω
E(rt − E(rt )) =
(1 − α)

E(rt |Ωt−1 ) = 0

2 2
E([rt − E(rt |Ωt−1 )] |Ωt−1 ) = ω + αrt−1

GARCH(1,1) Process
“Generalized ARCH”

rt | Ωt−1 ∼ N (0, ht )
154 CHAPTER 10

2
ht = ω + αrt−1 + βht−1

E(rt ) = 0

2 ω
E(rt − E(rt )) =
(1 − α − β)

E(rt |Ωt−1 ) = 0

2 2
E([rt − E(rt | Ωt−1 )] | Ωt−1 ) = ω + αrt−1 + βht−1

Conditionally-Gaussian GARCH-Based 1-Day VaR

GARCH-VaRTp +1|T ≡ σT +1|T Φ−1


p

– Consistent with fat tails of unconditional return distribution


– Can be extended to allow for fat-tailed conditional distribution
Unified Theoretical Framework

• Volatility dynamics (of course, by construction)

• Conditional symmetry translates into unconditional symmetry

• Volatility clustering produces unconditional leptokurtosis

Tractable Empirical Framework

L (θ; r1 , . . . , rT ) ≈ f (rT |ΩT −1 ; θ) f (rT −1 |ΩT −2 ; θ) . . . f (rp+1 |Ωp ; θ)

If the conditional densities are Gaussian,


1 rt2
 
1
f (rt |Ωt−1 ; θ) = √ ht (θ)−1/2 exp −
2π 2 ht (θ)

T T
T −p 1 X 1 X rt2
ln L (θ; rp+1 , . . . , rT ) ≈ − ln(2π) − ln ht (θ) −
2 2 t=p+1 2 t=p+1 ht (θ)

The Squared Return as a Noisy Volatility Proxy


Note that we can write:
rt2 = ht + νt
Thus rt2 is a noisy indicator of ht
Various approaches handle the noise in various ways.
VOLATILITY DYNAMICS 155

GARCH(1,1) and Exponential Smoothing


Exponential smoothing recursion:

r̄t2 = γ rt2 + (1 − γ) r̄t−1


2

Back substitution yields:


X
r̄t2 = 2
wj rt−j

where wj = γ (1 − γ)j
But in GARCH(1,1) we have:
2
ht = ω + αrt−1 + βht−1

ω X
ht = + α β j−1 rt−j
2
1−β
Variance Targeting
Sample unconditional variance:
T
1X 2
σ̂ 2 = r
T t=1 t

Implied unconditional GARCH(1,1) variance:


ω
σ2 =
1−α−β
We can constrain σ 2 = σ̂ 2 by constraining:

ω = (1 − α − β)σ̂ 2

– Saves a degree of freedom and ensures reasonableness


ARMA Representation in Squares
rt2 has the ARMA(1,1) representation:

rt2 = ω + (α + β)rt−1
2
− βνt−1 + νt ,

where νt = rt2 − ht .
Variations on the GARCH Theme
Regression with GARCH Disturbances

yt = x0t β + εt

εt |Ωt−1 ∼ N (0, ht )
156 CHAPTER 10

• Regression with GARCH Disturbances


• Incorporating Exogenous Variables
• Asymmetric Response and the Leverage Effect:
• Fat-Tailed Conditional Densities

• Time-Varying Risk Premia

Incorporating Exogenous Variables

2
ht = ω + α rt−1 + β ht−1 + γ 0 zt

γ is a parameter vector
z is a set of positive exogenous variables.
Asymmetric Response and the Leverage Effect I: TARCH
2
Standard GARCH: ht = ω + αrt−1 + βht−1
2 2
TARCH:( ht = ω + αrt−1 + γrt−1 Dt−1 + βht−1
1 if rt < 0
Dt =
0 otherwise
positive return (good news): α effect on volatility

negative return (bad news): α + γ effect on volatility

γ 6= 0: Asymetric news response


γ > 0: “Leverage effect”
Asymmetric Response II: E-GARCH


r rt−1
t−1
ln(ht ) = ω + α 1/2 + γ 1/2 + β ln(ht−1 )
h ht−1
t−1

• Log specification ensures that the conditional variance is positive.

• Volatility driven by both size and sign of shocks

• Leverage effect when γ < 0

Fat-Tailed Conditional Densities: t-GARCH


If r is conditionally Gaussian, then √rt ∼ N (0, 1)
ht
But often with high-frequency data, √rht ∼ f at tailed
t
So take:
1/2
rt = ht zt
VOLATILITY DYNAMICS 157

Figure 10.7: GARCH(1,1) Estimation, Daily NYSE Returns.

iid
td
zt ∼
std(td )

Time-Varying Risk Premia: GARCH-M


Standard GARCH regression model:

yt = x0t β + εt

εt |Ωt−1 ∼ N (0, ht )

GARCH-M model is a special case:

yt = x0t β + γht + εt

εt |Ωt−1 ∼ N (0, ht )

A GARCH(1,1) Example
A GARCH(1,1) Example
A GARCH(1,1) Example
A GARCH(1,1) Example
After Exploring Lots of Possible Extensions...

Rigorous Modeling II
Conditional Univariate Volatility Dynamics from High-
Frequency Data
158 CHAPTER 10

Figure 10.8: Correlogram of Squared Standardized GARCH(1,1) Residuals, Daily NYSE


Returns.

Figure 10.9: Estimated Conditional Standard Deviation, Daily NYSE Returns.

Figure 10.10: Conditional Standard Deviation, History and Forecast, Daily NYSE Returns.
VOLATILITY DYNAMICS 159

Dependent Variable: R
Method: ML - ARCH (Marquardt) - Student's t distribution
Date: 04/10/12 Time: 13:48
Sample (adjusted): 2 3461
Included observations: 3460 after adjustments
Convergence achieved after 19 iterations
Presample variance: backcast (parameter = 0.7)
GARCH = C(4) + C(5)*RESID(-1)^2 + C(6)*RESID(-1)^2*(RESID(-1)<0)
+ C(7)*GARCH(-1)

Variable Coefficient Std. Error z-Statistic Prob.

@SQRT(GARCH) 0.083360 0.053138 1.568753 0.1167


C 1.28E-05 0.000372 0.034443 0.9725
R(-1) 0.073763 0.017611 4.188535 0.0000

Variance Equation

C 1.03E-06 2.23E-07 4.628790 0.0000


RESID(-1)^2 0.014945 0.009765 1.530473 0.1259
RESID(-1)^2*(RESID(-
1)<0) 0.094014 0.014945 6.290700 0.0000
GARCH(-1) 0.922745 0.009129 101.0741 0.0000

T-DIST. DOF 5.531579 0.478432 11.56188 0.0000

Figure 10.11: AR(1) Returns with Threshold t-GARCH(1,1)-in Mean.


160 CHAPTER 10

Daily Annualized Realized Volatility (%)


150

100

50

0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010

Daily close−to−close Returns (%)


15

10

−5

−10

−15
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010

Figure 10.12: S&P500 Daily Returns and Volatilities (Percent). The top panel shows daily S&P500
returns, and the bottom panel shows daily S&P500 realized volatility. We compute realized volatility as the
square root of AvgRV , where AvgRV is the average of five daily RVs each computed from 5-minute squared
returns on a 1-minute grid of S&P500 futures prices.

Intraday Data and Realized Volatility

dp(t) = µ(t)dt + σ(t)dW (t)

N (∆)
X 2
RVt (∆) ≡ pt−1+j∆ − pt−1+(j−1)∆
j=1

Z t
RVt (∆) → IVt = σ 2 (τ ) dτ
t−1

Microstructure Noise
– State space signal extraction
– AvgRV
– Realized kernel
– Many others
RV is Persistent
RV is Reasonably Approximated as Log-Normal
RV is Long-Memory
Exact and Approximate Long Memory
Exact long memory:

(1 − L)d RVt = β0 + νt
VOLATILITY DYNAMICS 161

Quantiles of Input Sample QQ plot of Daily RV−AVR


100

−100
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
QQ plot of Daily Realized Volatility
Quantiles of Input Sample

10

−10
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles
QQ plot of Daily log RV−AVR
Quantiles of Input Sample

−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles

Figure 10.13: S&P500: QQ Plots for Realized Volatility and Log Realized Volatility. The top
panel plots the quantiles of daily realized volatility against the corresponding normal quantiles. The bottom
panel plots the quantiles of the natural logarithm of daily realized volatility against the corresponding normal
quantiles. We compute realized volatility as the square root of AvgRV , where AvgRV is the average of five
daily RVs each computed from 5-minute squared returns on a 1-minute grid of S&P500 futures prices.

“Corsi model” (HAR):

RVt = β0 + β1 RVt−1 + β2 RVt−5:t−1 + β3 RVt−21:t−1 + νt

Even better:

log RVt = β0 + β1 log RVt−1 + β2 log RVt−5:t−1 + β3 log RVt−21:t−1 + νt

– Ensures positivity and promotes normality


RV-VaR

RV − V aRTp+1|T = RV
d T +1|T Φ−1
p ,

GARCH-RV

σt2 = ω + β σt−1
2
+ γ RVt−1

• Fine for 1-step

• Multi-step requires “closing the system” with an RV equation

– “Realized GARCH”
– “HEAVY”

Separating Jumps

QVt = IVt + JVt


162 CHAPTER 10

ACF of Daily RV−AVR


0.8

0.6
Autocorrelation

0.4

0.2

50 100 150 200 250


Lag Order

ACF of Daily Return


0.8

0.6
Autocorrelation

0.4

0.2

50 100 150 200 250


Lag Order

Figure 10.14: S&P500: Sample Autocorrelations of Daily Realized Variance and Daily Re-
turn. The top panel shows realized variance autocorrelations, and the bottom panel shows return autocor-
relations, for displacements from 1 through 250 days. Horizontal lines denote 95% Bartlett bands. Realized
variance is AvgRV , the average of five daily RVs each computed from 5-minute squared returns on a 1-minute
grid of S&P500 futures prices.

where
Jt
X
2
JVt = Jt,j
j=1

e.g., we might want to explore:

RVt = β0 + β1 IVt−1 + β2 IVt−5:t−1 + β3 IVt−21:t−1


+ α1 JVt−1 + α2 JVt−5:t−1 + α3 JVt−21:t−1 + νt
But How to Separate Jumps?

• Truncation:
N (∆)
X
T Vt (∆) = ∆p2t−1+j∆ I ( ∆pt−1+j∆ < T )
j=1

• Bi-Power Variation:
N (∆)−1
π N (∆) X
BP Vt (∆) = |∆pt−1+j∆ | |∆pt−1+(j+1)∆ |
2 N (∆) − 1 j=1

• Minimum:
  N (∆)−1
π N (∆) X  2
M inRVt (∆) = min |∆pt−1+j∆ |, |∆pt−1+(j+1)∆ |
π−2 N (∆) − 1 j=1

Rigorous Modeling III


Conditional Asset-Level (Multivariate) Volatility Dynam-
VOLATILITY DYNAMICS 163

ics from “Daily” Data


Multivariate
Univariate volatility models useful for portfolio-level risk measurement (VaR, ES, etc.)
But what about risk management questions:

• Portfolio risk change under a certain scenario involving price movements of set of
assets or asset classes?

• Portfolio risk change if certain correlations increase suddenly

• Portfolio risk change if I double my holdings of Intel?

• How do optimal portfolio shares change if the covariance matrix moves in a certain
way?

Similarly, what about almost any other question in asset pricing, hedging, trading? Almost
all involve correlation.
Basic Framework and Issues I
N × 1 return vector Rt
N × N covariance matrix Ωt
N (N +1)
• 2 distinct elements

• Structure needed for pd or even psd

• Huge number of parameters even for moderate N

• And N may be not be moderate!

Basic Framework and Issues II


Univariate:

rt = σt zt

zt ∼ i.i.d.(0, 1)

Multivariate:
1/2
Rt = Ωt Zt

Zt ∼ i.i.d.(0, I)
1/2
where Ωt is a “square-root” (e.g., Cholesky factor) of Ωt
Ad Hoc Exponential Smoothing (RM)

0
Ωt = λ Ωt−1 + (1 − λ) Rt−1 Rt−1
164 CHAPTER 10

• Assumes that the dynamics of all the variances and covariances are driven by a single
scalar parameter λ (identical smoothness)

• Guarantees that the smoothed covariance matrices are pd so long as Ω0 is pd


PT
• Common strategy is to set Ω0 equal to the sample covariance matrix 1
T t=1 Rt Rt0
(which is pd if T > N )

• But covariance matrix forecasts inherit the implausible scaling properties of the uni-
variate RM forecasts and will in general be suboptimal

Multivariate GARCH(1,1)

0
vech (Ωt ) = vech (C) + B vech (Ωt−1 ) + A vech (Rt−1 Rt−1 )

• vech operator converts the upper triangle of a symmetric matrix into a 12 N (N + 1) × 1


column vector

• A and B matrices are both of dimension 21 N (N + 1) × 12 N (N + 1)

• Even in this “parsimonious” GARCH(1,1) there are O(N 4 ) parameters


– More than 50 million parameters for N = 100!

Encouraging Parsimony: Diagonal GARCH(1,1)


Diagonal GARCH constrains A and B matrices to be diagonal.

0
vech (Ωt ) = vech (C) + (Iβ) vech (Ωt−1 ) + (Iα) vech (Rt−1 Rt−1 )

– Still O(N 2 ) parameters.


Encouraging Parsimony: Scalar GARCH(1,1)
Scalar GARCH constrains A and B matrices to be scalar:
0
vech (Ωt ) = vech (C) + (Iβ) vech (Ωt−1 ) + (Iα) vech (Rt−1 Rt−1 )

– Mirrors RM, but with the important difference that the Ωt forecasts now revert to
Ω = (1 − α − β)−1 C
– Fewer parameters than diagonal, but still O(N )2
(because of C)
Encouraging Parsimony: Covariance Targeting
Recall variance targeting:
T
1X 2 ω
σ̂ 2 = r , σ2 = =⇒ take ω = (1 − α − β)σ̂ 2
T t=1 t 1−α−β
VOLATILITY DYNAMICS 165

Covariance targeting is the obvious multivariate generalization:


T
1 X
vech(C) = (I − A − B) vech ( Rt Rt0 )
T t=1

– Encourages both parsimony and reasonableness


Constant Conditional Correlation (CCC) Model
[Key is to recognize that correlation matrix
is the covariance matrix of standardized returns]
Two-step estimation:

• Estimate N appropriate univariate GARCH models

• Calculate standardized return vector, êt = Rt D̂t−1


1
PT 0
• Estimate correlation matrix Γ (assumed constant) as T t=1 êt êt

– Quite flexible as the N models can differ across returns


Dynamic Conditional Correlation (DCC) Model
Two-step estimation:

• Estimate N appropriate univariate GARCH models

• Calculate standardized return vector, êt = Rt D̂t−1

• Estimate correlation matrix Γt (assumed to have scalar GARCH(1,1)-style dynamics)


as following

vech(Γt ) = vech(C) + (Iβ)vech(Γt−1 ) + (Iα)vech(et−1 e0t−1 )

– “Correlation targeting” is helpful

DECO

• Time-varying correlations assumed identical across all pairs of assets, which implies:

Γt = (1 − ρt ) I + ρt J ,

where J is an N × N matrix of ones

• Analytical inverse facilitates estimation:


 
−1 1 ρt
Γt = I − J
(1 − ρt ) 1 + (N − 1)ρt

• Assume GARCH(1,1)-style conditional correlation structure:

ρt = ωρ + αρ ut + βρ ρt−1
166 CHAPTER 10

DECO Correlation, 1973−2009

0.8

0.6

0.4

0.2

0
75 80 85 90 95 00 05 10

Figure 10.15: Time-Varying International Equity Correlations. The figure shows the estimated
equicorrelations from a DECO model for the aggregate equity index returns for 16 different developed
markets from 1973 through 2009.

• Updating rule is naturally given by the average conditional correlation of the stan-
dardized returns,
PN PN
2 i=1 j>i ei,t ej,t
ut = PN
N i=1 e2i,t

• Three parameters, ωρ , αρ and βρ , to be estimated.

DECO Example
Factor Structure

Rt = λFt + νt

where
1/2
Ft = ΩF t Zt

Zt ∼ i.i.d.(0, I)

νt ∼ i.i.d.(0, Ων )

=⇒ Ωt = λ ΩF t λ0 + Ωνt

One-Factor Case with Everything Orthogonal

Rt = λft + νt
VOLATILITY DYNAMICS 167

where

ft = σf t zt

zt ∼ i.i.d.(0, 1)

νt ∼ i.i.d.(0, σν2 )

=⇒ Ωt = σf2 t λλ0 + Ων

2
σit = σf2 t λ2i + σνi
2

2
σijt = σf2 t λi λj

Rigorous Modeling IV
Conditional Asset-Level (Multivariate) Volatility Dynam-
ics from High-Frequency Data
Realized Covariance

dP (t) = M (t) dt + Ω(t)1/2 dW (t)

N (∆)
X
0
RCovt (∆) ≡ Rt−1+j∆,∆ Rt−1+j∆,∆
j=1

Z t
RCovt (∆) → ICovt = Ω (τ ) dτ
t−1

– p.d. so long as N (∆) > N ; else use regularization methods


Asynchronous Trading and the Epps Effect
– Epps effect biases covariance estimates downward
– Can overcome Epps by lowering sampling frequency to accommodate least-frequently-
traded asset, but that wastes data
– Opposite extreme: Calculate each pairwise realized covariance matrix using appropriate
sampling; then assemble and regularize
Regularization (Shrinkage)
168 CHAPTER 10

15

10

5
Quantiles of Input Sample

−5

−10

−15
−15 −10 −5 0 5 10 15
Standard Normal Quantiles

Figure 10.16: QQ Plot of S&P500 Returns. We show quantiles of daily S&P500 returns from January
2, 1990 to December 31, 2010, against the corresponding quantiles from a standard normal distribution.

Ω̂St = κ RCovt (∆) + (1 − κ) Υt

– Υt is p.d. and 0 < κ < 1


– Υt = I (naive benchmark)
– Υt = Ω (unconditional covariance matrix)
– Υt = σf2 λλ0 + Ων (one-factor market model)
Multivariate GARCH-RV

vech (Ωt ) = vech (C) + B vech (Ωt−1 ) + A vech(Ω̂t−1 )

• Fine for 1-step

• Multi-step requires “closing the system” with an RV equation


– Noureldin et al. (2011), multivariate HEAVY

Rigorous Modeling V
Distributions
Modeling Entire Return Distributions:
Returns are not Unconditionally Gaussian
Modeling Entire Return Distributions:
Returns are Often not Conditionally Gaussian
Modeling Entire Return Distributions: Issues

• Gaussian QQ plots effectively show calibration of Gaussian VaR at different levels

• Gaussian unconditional VaR is terrible

• Gaussian conditional VaR is somewhat better but left tail remains bad
VOLATILITY DYNAMICS 169

2
Quantiles of Input Sample

−1

−2

−3

−4

−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles

Figure 10.17: QQ Plot of S&P500 Returns Standardized by NGARCH Volatilities. We show


quantiles of daily S&P500 returns standardized by the dynamic volatility from a NGARCH model against
the corresponding quantiles of a standard normal distribution. The sample period is January 2, 1990 through
December 31, 2010. The units on each axis are standard deviations.

• Gaussian conditional expected shortfall, which integrates over the left tail, would be
terrible

• So we want more accurate assessment of things like V aRTp +1|T than those obtained
under Gaussian assumptions
–Doing so for all values of p ∈ [0, 1] requires estimating the entire conditional return
distribution
– More generally, best-practice risk measurement is about tracking the entire condi-
tional return distribution

Observation-Driven Density Forecasting


Using r = σ ε and GARCH
Assume:

rT +1 = σT +1/T εT +1

εT +1 ∼ iid(0, 1)

Multiply εT +1 draws by σT +1/T (fixed across draws, from a GARCH model) to build up
the conditional density of rT +1 .

• εT +1 simulated from standard normal

• εT +1 simulated from standard t


rT +1
• εT +1 simulated from kernel density fit to σT +1/T

• εT +1 simulated from any density that can be simulated


170 CHAPTER 10

2
Quantiles of Input Sample

−1

−2

−3

−4

−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
Standard Normal Quantiles

Figure 10.18: QQ Plot of S&P500 Returns Standardized by Realized Volatilities. We show


quantiles of daily S&P500 returns standardized by AvgRV against the corresponding quantiles of a standard
normal distribution. The sample period is January 2, 1990 through December 31, 2010. The units on each
axis are standard deviations.

Parameter-Driven Density Forecasting


Using r = σ ε and SV
Assume:

rT +1 = σT +1 εT +1

εT +1 ∼ iid(0, 1)

Multiply εT +1 draws by σT +1 draws (from a simulated SV model) to build up the condi-


tional density of rT +1 .
– Again, εT +1 simulated from any density deemed relevant
Modeling Entire Return Distributions:
Returns Standardized by RV are Approximately Gaussian
A Special Parameter-Driven Density Forecasting Approach
Using r = σ ε and RV
(Log-Normal / Normal Mixture)
Assume:

rT +1 = σT +1 εT +1

εT +1 ∼ iid(0, 1)

Multiply εT +1 draws from N (0, 1) by σT +1 draws (from a simulated RV model fit to log
realized standard deviation) to build up the conditional density of rT +1 .
VOLATILITY DYNAMICS 171

Pitfalls of the “r = σ ε” Approach


In the conditionally Gaussian case we can write with no loss of generality:

rT +1 = σT +1/T εT +1

εT +1 ∼ iidN (0, 1)

But in the conditionally non-Gaussian case there is potential loss of generality in writing:

rT +1 = σT +1/T εT +1

εT +1 ∼ iid(0, 1),

because there may be time variation in conditional moments other than σT +1/T , and using
εT +1 ∼ iid(0, 1) assumes that away
Multivariate Return Distributions
– If reliable realized covariances are available, one could do a multivariate analog of the
earlier lognormal/normal mixture model. But the literature thus far has focused primarily
on conditional distributions for “daily” data.
Return version:
−1/2
Zt = Ω t Rt , Zt ∼ i.i.d., Et−1 (Zt ) = 0 V art−1 (Zt ) = I

Standardized return version (as in DCC):

et = Dt−1 Rt , Et−1 (et ) = 0, V art−1 (et ) = Γt

where Dt denotes the diagonal matrix of conditional standard deviations for each of the
assets, and Γt refers to the potentially time-varying conditional correlation matrix.
Leading Examples
Multivariate normal:

f (et ) = C (Γt ) exp − 12 e0t Γ−1



t et

Multivariate t:

−(d+N )/2
e0 Γ−1 et

f (et ) = C (d, Γt ) 1+ t t
(d − 2)
172 CHAPTER 10

Multivariate asymmetric t:
  r    
C d, Γ˙t K d+N d + (et − µ̇)0 Γ̇−1
t (et − µ̇) ξ 0 Γ̇−1
t ξ exp (et − µ̇)0 Γ̇−1
t ξ
2
f (et ) = (d+N )
 −1
 (d+N ) r −
(et −µ̇)0 Γ̇t (et −µ̇) 2 2

1+ d d + (et − µ̇)0 Γ̇−1
t (et − µ̇) ξ 0 Γ̇−1
t ξ

– More flexible than symmetric t but requires estimation of N asymmetry parameters si-
multaneously with the other parameters, which is challenging in high dimensions.
Copula methods sometimes provide a simpler two-step approach.
Copula Methods
Sklar’s Theorem:

F (e) = G( F1 (e1 ), ..., FN (eN ) ) ≡ G( u1 , ..., uN ) ≡ G(u)

N
∂ N G(F1 (e1 ), ..., FN (eN )) Y
f (e) = = g (u) × fi (ei )
∂e1 ...∂eN i=1

T
X T X
X N
=⇒ log L = log g(ut ) + log fi (ei,t )
t=1 t=1 i=1

Standard Copulas
Normal:

 
−1 1 −1 ∗−1
g(ut ; Γ∗t ) = |Γ∗t | 2 0 −1
exp − Φ (ut ) (Γt − I)Φ (ut )
2
where Φ−1 (ut ) refers to the N × 1 vector of standard inverse univariate normals, and the
correlation matrix Γ∗t pertains to the N × 1 vector e∗t with typical element,

e∗i,t = Φ−1 (ui,t ) = Φ−1 (Fi (ei,t )).

– Often does not allow for sufficient dependence between tail events.
– t copula
– Asymmetric t copula
Asymmetric Tail Correlations
Multivariate Distribution Simulation (General Case)
Simulate using:

1/2
Rt = Ω̂t Zt

Zt ∼ i.i.d.(0, I)

– Zt may be drawn from parametrically-(Gaussian, t, ...) or nonparametrically-fitted


VOLATILITY DYNAMICS 173

16 Developed Markets, 1973−2009


0.5

Empirical
Gaussian
0.4 DECO

0.3
Threshold Correlation

0.2

0.1

0
−1 −0.5 0 0.5 1
Standard Deviation

Figure 10.19: Average Threshold Correlations for Sixteen Developed Equity Markets. The
solid line shows the average empirical threshold correlation for GARCH residuals across sixteen developed
equity markets. The dashed line shows the threshold correlations implied by a multivariate standard normal
distribution with constant correlation. The line with square markers shows the threshold correlations from a
DECO model estimated on the GARCH residuals from the 16 equity markets. The figure is based on weekly
returns from 1973 to 2009.

distributions, or with replacement from the empirical distribution.


Multivariate Distribution Simulation (Factor Case)
Simulate using:

1/2
Ft = Ω̂F,t ZF,t

Rt = λ̂ Ft + νt

– ZF,t and νt may be drawn from parametrically- or nonparametrically-fitted distributions,


or with replacement from the empirical distribution.

Rigorous Modeling VI
Risk, Return and Macroeconomic Fundamentals
We Want to Understand the Financial / Real Connections
Statistical vs. “scientific” models
Returns ↔ Fundamentals
r↔f
Disconnect?
“excess volatility,” “disconnect,” “conundrum,” ...
µr , σr , σf , µf
Links are complex:
µr ↔ σr ↔ σf ↔ µf
Volatilities as intermediaries?
For Example...
174 CHAPTER 10

Mean Recession Standard Sample


Volatility Increase Error Period
Aggregate Returns 43.5% 3.8% 63Q1-09Q3
Firm-Level Returns 28.6% 6.7% 69Q1-09Q2

Table 10.1: Stock Return Volatility During Recessions. Aggregate stock-return volatility is quar-
terly realized standard deviation based on daily return data. Firm-level stock-return volatility is the cross-
sectional inter-quartile range of quarterly returns.

Mean Recession Standard Sample


Volatility Increase Error Period
Aggregate Growth 37.5% 7.3% 62Q1-09Q2
Firm-Level Growth 23.1% 3.5% 67Q1-08Q3

Table 10.2: Real Growth Volatility During Recessions. Aggregate real-growth volatility is quarterly
conditional standard deviation. Firm-level real-growth volatility is the cross-sectional inter-quartile range of
quarterly real sales growth.

GARCH σr usually has no µf or σf :


2 2 2
σr,t = ω + αrt−1 + βσr,t−1 .

One might want to entertain something like:


2 2 2
σr,t = ω + αrt−1 + βσr,t−1 + δ1 µf,t−1 + δ2 σf,t−1 .

µf ↔ σr
Return Volatility is Higher in Recessions
Schwert’s (1989) “failure”: Very hard to link market risk to expected fundamentals (lever-
age, corporate profitability, etc.).
Actually a great success:
Key observation of robustly higher return volatility in recessions!
– Earlier: Officer (1973)
– Later: Hamilton and Lin (1996), Bloom et al. (2009)
Extends to business cycle effects in credit spreads via the Merton model
µf ↔ σr , Continued
Bloom et al. (2009) Results
µf ↔ σf
Fundamental Volatility is Higher in Recessions
More Bloom, Floetotto and Jaimovich (2009) Results
σf ↔ σr
Return Vol is Positively Related to Fundamental Vol
Follows immediately from relationships already documented
Moreover, direct explorations provide direct evidence:
– Engle et al. (2006) time series
VOLATILITY DYNAMICS 175

– Diebold and Yilmaz (2010) cross section


– Engle and Rangel (2008) panel
Can be extended to fundamental determinants of correlations (Engle and Rangle, 2011)
[Aside: Inflation and its Fundamental (U.S. Time Series)]

Weak inflation / money growth link


[Inflation and its Fundamental (Barro’s Cross Section)]
Strong inflation / money growth link
100

90

80

70
ΔP (% per year)

60

50

40

30

20

10

0
0 10 20 30 40 50 60 70 80 90 100

ΔM currency (% per year)

Back to σf ↔ σr : Cross-Section Evidence


176 CHAPTER 10

Real Stock Return Volatility and Real PCE Growth Volatility, 1983-2002

Now Consider Relationships Involving the Equity Premium


?? µr ??
µr ↔ σr
“Risk-Return Tradeoffs” (or Lack Thereof)
Studied at least since Markowitz
ARCH-M characterization:

Rt = β0 + β1 Xt + β2 σt + εt

σt2 = ω + αrt−1
2 2
+ βσt−1

– But subtleties emerge...


µr ↔ µf
Odd Fama-French (1989):

rt+1 = β0 + β1 dpt + β2 termt + β3 deft + t+1

Less Odd Lettau-Ludvigson (2001):

rt+1 = β0 + β1 dpt + β2 termt + β3 deft + β4 cayt + t+1

Natural Campbell-Diebold (2009):


rt+1 = β0 + β1 dpt + β2 termt + β3 deft + β4 cayt + β5 gte + t+1
– Also Goetzman et al. (2009) parallel cross-sectional analysis
Expected Business Conditions are Crucially Important!
µr ↔ σf
Bansal and Yaron (2004)
(and many others recently)
VOLATILITY DYNAMICS 177

(1) (2) (3) (4) (5) (6) (7)


gte -0.22 – – -0.21 -0.20 – -0.20
(0.08) – – (0.09) (0.09) – (0.10)

DPt – – 0.25 0.17 – 0.19 0.12


(0.10) (0.10) (0.12) (0.11)
DEFt – – -0.11 -0.01 – -0.10 0.00
(0.07) (0.09) (0.08) (0.09)
T ERMt – – 0.15 0.17 – 0.09 0.11
(0.07) (0.07) (0.09) (0.09)
CAYt – 0.24 – – 0.22 0.17 0.15
(0.07) (0.08) (0.11) (0.10)

So, Good News:


We’re Learning More and More About the Links
– We’ve Come a Long Way Since Markowitz:
µr ↔ σr
The Key Lesson
The business cycle is of central importance for both µr and σr
– Highlights the importance of high-frequency business cycle monitoring. We need to
interact high-frequency real activity with high-frequency financial market activity
e.g., Aruoba-Diebold-Scotti real-time framework at Federal Reserve Bank of Philadelphia
Conclusions

• Reliable risk measurement requires conditional models that allow for time-varying
volatility.

• Risk measurement may be done using univariate volatility models. Many important
recent developments.

• High-frequency return data contain a wealth of volatility information.

• Other tasks require multivariate models. Many important recent developments, espe-
cially for N large. Factor structure is often useful.

• The business cycle emerges as a key macroeconomic fundamental driving risk.

• New developments in high-frequency macro monitoring yield high-frequency real ac-


tivity data to match high-frequency financial market data.

****************************
Models for non-negative variables (from Minchul)
Introduction Motivation: Why do we need dynamic models for positive values?

• Volatility: Time-varying conditional variances


178 CHAPTER 10

• Duration: Intertrade duration, Unemployment spell

• Count: Defaults of U.S. corporations

Autoregressive Gamma processes (ARG)

• Gourieroux and Jasiak (2006)

• Monfort, Pegoraro, Renne and Roussellet (2014)

Alternative model

• ACD (autoregressive conditional duration) by Engle and Russell (1998)

• Its extension through Dynamic conditional score models

– Harvey (2013)
– Creal, Koopman, and Lucas (2013)

Autoregressive Gamma Processes


Autoregressive Gamma Processes (ARG): Definition Definition: Yt follows the autore-
gressive gamma process if

Yt conditional on Yt−1 follows the non-central gamma distribution with

• degree of freedom parameter: δ

• non-centrality parameter: βYt−1

• scale parameter: c

Very exotic ... but we can guess

• Gamma distribution –¿ Maybe it takes positive values

• Conditional dynamics through non-centrality parameter

ARG Processes: State space representation If Yt follows ARG, then

Measurement:

Yt Zt ∼ Gamma(δ + Zt , c)
BAYES 179

Transition:

Zt Yt−1 ∼ P oisson(βYt−1 )

• Yt takes positive real number

• Zt takes positive integer

• Dynamics through Zt

Conditional moments Measurement:



Yt Zt ∼ Gamma(δ + Zt , c)

Transition:

Zt Yt−1 ∼ P oisson(βYt−1 )

Conditional moments:
E(Yt |Yt−1 ) = ρYt−1 + cδ
V (Yt |Yt−1 ) = 2ρcYt−1 + c2 δ
Corr(Yt , Yt−h ) = ρh
where ρ = βc > 0.

The process is stationary when ρ < 1.


Conditional over-dispersion The conditional over-dispersion exists if and only if

V (Yt |Yt−1 ) > E(Yt |Yt−1 )2

When δ < 1,

• The stationary ARG process features marginal over-dispersion.

• The process may feature either conditional under- or over-dispersion, depending on


the value of Yt−1 .

Remark: ACD (autoregressive conditional duration) model assumes the path-indepednet


over-dispersion.
180 CHAPTER 10

Figure 10.20: Simulated data, ρ = 0.5

Continuous time limit of ARG(1) The stationary ARG process is a discretized version of
the CIR process.
p
dYt = a(b − Yt )dt + σ Yt dWt

where
a = − log ρ

b=
1−ρ
−2 log ρ
σ2 = c
1−ρ

• This process is non-negative almost surely.

– Originally model for interest rates.


– Also used for volatility dynamics.

Long memory Case 1) Let ρ = 1, then

• Yt is a stationary Markov process.

• An autocorrelation function with a hyperbolic rate of decay.

Case 2) Stochastic autoregressive coefficient. Let ρ ∼ π. Then

Corr(Yt , Yt−h |δ, c, π) = Eπ (ρh )

The autocorrelation function features hyperbolic decay when the distribution π assigns
sufficiently large probabilities to values close to one.
Figures 1
BAYES 181

Figure 10.21: Simulated data, ρ = 0.9

Application in the original paper


Measurement:

Yt Zt ∼ Gamma(δ + Zt , c)

Transition:

Zt Yt−1 ∼ P oisson(βYt−1 )

• Yt : Interquote durations of the Dayton Mining stock traded on the Toronto Stock
Exchange in October 1998.

• Estimation based on QMLE

Extension Creal (2013) considers the following non-linear state space

Measurement

yt ∼ p(yt |ht , xt ; θ)

where xt is an exogenous regressor.

Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )

• When yt = ht , the process becomes ARG.

• Various applications are fall into this form.


182 CHAPTER 10

Example 1: Stochastic volatility models Measurement


p
yt = µ + xt β + ht et , et ∼ N (0, 1)

Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )
Example 2: Stochastic duration and intensity models Measurement

yt ∼ Gamma (α, ht exp(xt β))

Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )
Example 3: Stochastic count models Measurement

yt ∼ P oisson (ht exp(xt β))

Transition
ht ∼ Gamma(δ + zt , c)
zt ∼ P oisson(ρht−1 )

Recent extension: ARG-zero processes 1 Monfort, Pegoraro, Renne, Roussellet (2014)


extend ARG process to take account for zero-lower bound spells,
BAYES 183

Recent extension: ARG-zero processes 2 Monfort, Pegoraro, Renne, Roussellet (2014)


extend ARG process to take account for zero-lower bound spells,

If Yt follows ARG, then



Yt Zt ∼ Gamma(δ + Zt , c)

Zt Yt−1 ∼ P oisson(βYt−1 )

If Yt follows ARG-zero, then



Yt Zt ∼ Gamma(Zt , c)

Zt Yt−1 ∼ P oisson(α + βYt−1 )

Two modifications

• δ = 0: As δ → 0, Gamma(δ, c) converges to dirac delta function.

• α is related with a probability of escaping from the zero lower bound.

Characterization Probability density for ARG-zero is



X
p(Yt |Yt−1 ; α, β, c) = g(Yt , Yt−1 , α, β, c, z)1{Yt >0} + exp(−α − βYt−1 )1{Yt =0}
z=1

• Second term is consequence of δ → 0.

• If α = 0, Yt = 0 becomes an absorbing state.

Conditional moments

E[Yt |Yt−1 ] = αc + ρYt−1

and

V (Yt |Yt−1 ) = 2c2 α + 2cρYt−1

where ρ = βc.
Figure: ARG-zero

ACD and DCS


Autoregressive conditional duration model (ACD) Yt follows the autoregressive condi-
tional duration model if
184 CHAPTER 10

Figure 10.22: Simulated data

yt = µt et , E[et ] = 1
µt = w + αµt−1 + βyt−1

• Because of its multiplicative form, it is classified as the multiplicative error model (MEM).
• Conditional moments

E[yt |y1:t−1 ] = µt
V (yt |y1:t−1 ) = k0 µ2t

• Conditional over-dispersion is path-independent

V (yt |y1:t−1 )
= k0
E[yt |y1:t−1 ]2

Recall that ARG process can have path-dependent over-dispersion.

Dynamic conditional score (DCS) model Dynamic conditional score model (or Generalized Autoregressive
Score model) is a general class of observation-driven model.

• Observation-driven model is a time-varying parameter model where time-varying parameter is a


function of histories of observable. For example, GARCH, ACD, ...
• DCS (GAS) model encompasses GARCH, ACD, and other observation-driven models.

The idea is very simple and pragmatic


• Give me a conditional likelihood and time-varying parameters, I will give you a law of motion for
time-varying parameters.

Convenient and general modelling strategy. I will describe it within the MEM class of model.
DCS Example: ACD 1 Recall

yt = µt et , E[et ] = 1
µt = w + αµt−1 + βyt−1
BAYES 185

Instead, we apply DCS principle: “Give me conditional likelihood and time-varying parameters, then I
will give you a law motion”
yt = µt et , et ∼ Gamma(κ, 1/κ)
DCS Example: ACD 2
yt = µt et , et ∼ Gamma(κ, 1/κ)

Then DCS specifies a law of motion for µt as follows:

µt = w + αµt−1 + βst−1

where (w, α, β) are additional parameters and st is a scaled score,


−1
∂ log p(yt |µt , y1:t ; κ) ∂ log p(yt |µt , y1:t ; κ) 0

∂ log p(yt |µt , y1:t ; κ)
st = Et−1
∂µt ∂µt ∂µt

In this case, it happens to be

µt = w + αµt−1 + βyt−1

which is ACD.

However, a law of motion will be different with different choice of distribution – General-
ized Gamma, Log-Logistic, Burr, Pareto, and many other distributions

10.5 EXERCISES, PROBLEMS AND COMPLEMENTS

10.6 NOTES
Chapter Eleven

High Dimensionality

11.1 EXERCISES, PROBLEMS AND COMPLEMENTS

1. xxx

11.2 NOTES
Appendices

187
Appendix A

A “Library” of Useful Books

Ait-Sahalia, Y. and Hansen, L.P. eds. (2010), Handbook of Financial Econometrics. Ams-
terdam: North-Holland.

Ait-Sahalia, Y. and Jacod, J. (2014), High-Frequency Financial Econometrics, Princeton


University Press.

Beran, J., Feng, Y., Ghosh, S. and Kulik, R. (2013), Long-Memory Processes: Probabilistic
Properties and Statistical Methods, Springer.

Box, G.E.P. and Jenkins, G.W. (1970), Time Series Analysis, Forecasting and Control,
Prentice-Hall.

Davidson, R. and MacKinnon, J. (1993), Estimation and Inference in Econometrics, Oxford


University Press.

Diebold, F.X. (1998), Elements of Forecasting, South-Western.

Douc, R., Moulines, E. and Stoffer, D.S. (2014), Nonlinear Time Series: Theory, Methods,
and Applications with R Examples, Chapman and Hall.

Durbin, J. and Koopman, S.J. (2001), Time Series Analysis by State Space Methods, Oxford
University Press.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, Chapman and
Hall.

Elliott, G., Granger, C.W.J. and Timmermann, A., eds. (2006), Handbook of Economic
Forecasting, Volume 1, North-Holland.

Elliott, G., Granger, C.W.J. and Timmermann, A., eds. (2013), Handbook of Economic
Forecasting, Volume 2, North-Holland.

Engle, R.F. and McFadden, D., eds. (1995), Handbook of Econometrics, Volume 4, North-
Holland.
USEFUL BOOKS 189

Geweke, J. (2010), Complete and Incomplete Econometric Models, Princeton University


Press.

Geweke, J., Koop, G. and van Dijk, H., eds. (2011), The Oxford Handbook of Bayesian
Econometrics, Oxford University Press.

Granger, C.W.J. and Newbold, P. (1977), Forecasting Economic Time Series, Academic
Press.

Granger, C.W.J. and Tersvirta, Y. (1996), Modeling Nonlinear Economic Relationships,


Oxford University Press.

Hall, P. (1992), The Bootstrap and Edgeworth Expansion, Springer Verlag.

Hammersley, J.M. and Handscomb, D.C. (1964), Monte Carlo Methods, Chapman and Hall.

Hansen, L.P. and Sargent, T.J. (2013), Recursive Models of Dynamic Linear Economies,
Princeton University Press.

Harvey, A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Filter,
Cambridge University Press.

Harvey, A.C. (1993.), Time Series Models, MIT Press.

Harvey, A.C. (2013), Dynamic Models for Volatility and Heavy Tails, Cambridge University
Press.

Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning:
Data Mining, Inference and Prediction, Springer-Verlag.

Herbst, E. and Schorfheide, F. (2015), Bayesian Estimation of DSGE Models, Manuscript.

Kim, C.-J. and Nelson, C.R. (1999), State-Space Models with Regime Switching, MIT Press.

Koop, G. (2004), Bayesian Econometrics, John Wiley.

Nerlove, M., Grether, D.M., Carvalho, J.L. (1979), Analysis of Economic Time Series: A
Synthesis, Academic Press.

Priestley, M. (1981), Spectral analysis and Time Series, Academic Press.

Silverman, B.W. (1986), Density Estimation for Statistics and Data Analysis, Chapman and
Hall.
190 APPENDIX A

Whittle, P. (1963), Prediction and Regulation by Linear Least Squares Methods, University
of Minnesota Press.

Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics, John Wiley and


Sons.
Bibliography

Aldrich, E.M., F. Fernndez-Villaverde, A.R. Gallant, and J.F. Rubio-Ramrez (2011), “Tap-
ping the Supercomputer Under Your Desk: Solving Dynamic Equilibrium Models with
Graphics Processors,” Journal of Economic Dynamics and Control , 35, 386–393.

Aruoba, S.B., F.X. Diebold, J. Nalewaik, F. Schorfheide, and D. Song (2013), “Improving
GDP Measurement: A Measurement Error Perspective,” Working Paper, University of
Maryland, Federal Reserve Board, and University of Pennsylvania.

Nerlove, M., D.M. Grether, and J.L. Carvalho (1979), Analysis of Economic Time Series:
A Synthesis. New York: Academic Press. Second Edition.

Ruge-Murcia, Francisco J. (2010), “Estimating Nonlinear DSGE Models by the Simulated


Method of Moments,” Manuscript, University of Montreal.

Yu, Yaming and Xiao-Li Meng (2010), “To Center or Not to Center: That is Not the Question
An Ancillarity-Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency,”
Manuscript, Harvard University.

You might also like