AE2015 Lecture Notes Ch4
AE2015 Lecture Notes Ch4
Advanced Econometrics I
Chapter 4
Francisco Blasques
These lecture notes contain the material covered in the master course
Advanced Econometrics I. Further study material can be found in the
lecture slides and the many references cited throughout the text.
Contents
4 Nonlinear dynamic probability models: definitions and properties
4.1 Models and DGPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Probability spaces and random variables . . . . . . . . . . . .
4.1.2 What is a probability model? . . . . . . . . . . . . . . . . . .
4.1.3 What is a DGP? . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.4 Correct Specification . . . . . . . . . . . . . . . . . . . . . . .
4.1.5 Generality of linear dynamic models (Wolds representation) .
4.1.6 Nonlinear dynamic models . . . . . . . . . . . . . . . . . . . .
4.2 Examples of nonlinear dynamic models . . . . . . . . . . . . . . . . .
4.2.1 Nonlinear autoregressions . . . . . . . . . . . . . . . . . . . .
4.2.2 Random coefficient autoregressions . . . . . . . . . . . . . . .
4.2.3 Time-varying parameter models: parameter-driven . . . . . .
4.2.4 Time-varying parameter models: observation-driven . . . . . .
4.2.5 Nonlinear dynamic models with exogenous variables . . . . . .
4.3 Stationarity, dependence, ergodicity and moments . . . . . . . . . . .
4.3.1 Strict stationarity and m-dependence . . . . . . . . . . . . . .
4.3.2 Strict stationarity and ergodicity (SE) . . . . . . . . . . . . .
4.3.3 Sufficient conditions for SE . . . . . . . . . . . . . . . . . . . .
4.3.4 Examples and Counter Examples . . . . . . . . . . . . . . . .
4.3.5 Bounded unconditional moments . . . . . . . . . . . . . . . .
4.3.6 Examples and Counter Examples . . . . . . . . . . . . . . . .
4.3.7 Notes for Time-varying Parameter Models . . . . . . . . . . .
4.3.8 Notes for Models with Exogenous Variables . . . . . . . . . .
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
3
8
11
11
12
15
16
16
18
19
20
23
24
25
27
28
31
36
37
39
44
46
4
4.1
4.1.1
In this section we venture into the world of set theory and measure theory. We will
not dig too deep into this world. Actually, we will barely scratch its surface. Our
objective is just to understand certain basic concepts.
As you may know, it was only in the late 19th and 20th centuries that mathematicians attempted to give proper foundations to mathematics. The famous work of
Gottlob Frege attempted to give proper definitions of numbers, functions and variables.
What is the number 2 after all? We know how to use the number 2. But what is it?
Frege set himself to answer these questions. Unfortunately, just a few days before publishing his monumental work, written over more than 10 years, Frege received a letter
from the great mathematician and philosopher Bertrand Russell pointing out that he
had found a small inconsistency in Freges argument. This small problem turned
out to completely destroy Freges work. As a result, Alfred Whitehead and Bertrand
Russell tried themselves to give foundations to mathematics and published their outstanding work of art Principia Mathematica in 3 volumes. In Volume 1 of Principia
Mathematica the authors take 379 pages just to arrive at the proof that 1 + 1 = 2. The
authors planned a fourth volume of Principia Mathematica but could not go further
on the account of intellectual exhaustion. Finally, in 1931, Kurt G
odel s historical
Incompleteness Theorem showed once and for all that any such attempts to give proper
foundations to mathematics would always fail! In effect, he showed that any logical
axiomatic system (like mathematics) will always be incomplete (i.e. contain statements
that cannot be proved) or contradictory (i.e. contain statements that are both true and
false).
The world of set theory and measure theory is a fascinating one indeed! We will
now make use of set theory and measure theory to arrive at a precise definition of
probability model. First we will have to define what a probability space is. Next we
define the concept of random variable. Finally, we turn to probability model.
3
P (tails) = 1
Banach and Tarski proved a theorem in 1924 showing that you can take a ball, cut it
into pieces and then re-arrange those pieces in such a manner as to obtain two balls
of the exact same size, that have no parts missing! This unbelievable result warns us
about the mathematical problems that may occur if we are not careful enough when
measuring things! The use of the -algebra solves this problem.
n=1
Fn F.
Since the -algebra is important to avoid problems we will often work with measurable spaces.
Definition 3 (Measurable space) A measurable space is just a pair (E, F) composed
of an event space E and respective -algebra F.
Definition 4 (Probability measure) A probability measure P defined on a measurable
space (E, F) is a function P : F [0, 1] satisfying:
(i) P (F ) 0 F F.
(ii) P (E) = 1.
(iii) If {Fn }nN is a collection of sets in F, then P
n=1
P
Fn =
n=1 P (Fn ).
Now that we know what a probability space is, we can finally define the concept
of random variable! As we shall see a random variable is essentially a measurable
function (or measurable map).
Definition 5 (Measurable function) Given two measurable spaces (A, FA ) and (B, FB ),
a function f : A B is said to be measurable if f is such that every element b FB
satisfies f 1 (b) FA ; i.e. the inverse image of each element of FB is in FA .
Note that the inverse map f 1 in the definition above exists always! You may recall
from your introductory mathematics courses that some functions are not invertible.
However, that only means that the inverse map f 1 is not a function. The inverse
map exists always, it may just not be a function. Do you remember the properties of
a function?
Definition 6 (Function) Given two sets A and B, and a map f : A B, we say that
f is a function if and only if each element a of A is associated to a unique element b
of B.
1
We are finally ready to give a proper definition of random variable. In this definition,
the random variable emerges as a real valued measurable function that maps each
event in E to a number in R. Depending on the event that occurs we obtain a
different number in R for our random variable!
Definition 7 (Random variable) Given a probability space (E, F, P ) and the measurable space (R, FR ), a random-variable xt is a measurable map xt : E R that maps
elements of E to the real numbers R; i.e. the inverse map x1
: FR F satisfies
t
2
x1
(r)
F
for
every
element
r
F
.
R
t
This definition of random variable is actually quite intuitive! The requirement that
the function be measurable just means that we can assign probabilities for each value
of the random variable xt . Indeed, we can define a probability measure PR on FR by
assigning for each element r in FR the probability PR (r) = P (x1
t (r)). In essence,
the random variable induces a probability measure PR on FR . We thus obtain a
probability space (R, FR , PR ).
From the probability measure PR we can now define the cumulative distribution
function F that you know so well as
F (a) = PR (x a) a R.
This notion of random variable generalizes easily to random vectors and other
random elements such as random functions.
Definition 8 (Random vector) Given a probability space (E, F, P ) and a measurable
space (Rn , FRn ) with n N, an n-variate random-vector xt is a measurable map
xt : E Rn that maps elements of E to Rn .
Now that we defined a random variable xt as a function from the event space E to
the reals R, we can easily distinguish the random variable from its realizations. In
particular, note that while xt is a random variable, xt (e) R is the actual realization
of the random variable produced by event e E. A different event e0 E may produce
a different realization xt (e0 ) 6= xt (e).
The definition of random variable is also interesting for another reason: whether
something is a random-variable or not, depends on the -algebra that one is using!
As an example, consider the case where xt is a normal random variable xt N (0, 2 ).
Is x2t also a random variable? How about exp(xt )? or log(xt )?
Below we define the Borel -algebra introduced by one of the pioneers of measure
theory and probability theory, the French mathematician Emile Borel.
Please do not be fooled by the notation x1
into thinking that x1
maps from R to E, i.e. that
t
t
: R E. It does not! First, since the inverse f 1 is not necessarily a function, it will
naturally map elements of R to subsets of E like those in F. Second, note that, given any function
g : A B, we can always define a mapping g from subsets A of A to subsets of B satisfying
g (A) = {g(a), a A A}.
2
x1
t
Definition 9 (Borel -algebra) Given set A the Borel -algebra BA is the smallest
-algebra containing all open sets of A.3
Luckily, you do not have to feel intimidated by the definition above. In fact, you
dont even have to understand it! The only thing you have to know is its practical
consequence: all continuous functions are measurable under the Borel -algebra. Indeed, the Borel -algebra is famous because it ensures that any continuous function
f is measurable. As such, any continuous transformation f (xt ) of a random variable
xt is also a random variable.
The fact that continuous functions are measurable under the Borel -algebra is
trivial once we give an appropriate definition of continuous function! This definition
uses the notion of topological space. Much like a measurable space, a topological space
is just a pair (A, TA ) where A is a set and TA is a collection of subsets of A. This
collection, when constructed in a certain way is called a topology.4 Most importantly,
the elements of TA are precisely what we call the open sets.
Definition 10 (Continuous function) Let (A, TA ) and (B, TB ) be topological spaces.
A function f : A B is said to be continuous if its inverse f 1 maps open sets to
open sets; i.e. if for every b TB we have f 1 (b) TA .
The definition above clarifies everything! If the Borel -algebra is made of open
sets, then every continuous function is measurable because f 1 maps elements of one
Borel -algebra to another! In other words, given two measurable spaces (A, BA )
and (B, BB ) a continuous function f : A B will be measurable because for every
b BB we will have f 1 (b) BA .
Definition 11 (Random element) Given a probability space (E, F, P ) and the measurable space (A, FA ), a random-element at taking values in A is a measurable map
at : E A that maps elements of E to A.
The following is an example of random element that we will often use. In this example
C(R) denotes the set of all continuous functions defined on R. Each element of C(R)
is thus a continuous function.
Example 1 (Random continuous function) Given a probability space (E, F, P ) and
the measurable space (C(R), FC(R) ), a random continuous function f taking values in
C(R) is a measurable map f : E C(R) that maps elements of E to C(R).
Note that while f is a random element taking values in C(R), the realization f (e),
e E, is an actual continuous function f (e) C(R).
3
The collection of all open sets of a set A is known as the topology of A. Hence, you might come
across some references that define the Borel -algebra BA as the -algebra generated by the topology
of A.
4
A collection TA of subsets of A is called a topology on A when both the empty set and A are
elements of TA , any union of elements of TA is an element of TA and any intersection of infinitely
many elements of TA is an element of TA . Do not memorize this! This is not important!
4.1.2
The notion of probability model is often an illusive one. What exactly is a model? To
answer this question, consider first the very simple Bernoulli model for coin tosses.
In particular, suppose that we want to model T tosses of a coin that may, or may not,
be fair. Then, it is reasonable to suppose that the T observed outcomes x1 , ..., xT are
realizations of T Bernoulli random variables, xt Bern() with unknown probability
parameter [0, 1]. Note that for each we have defined a certain probability
distribution
for the random vector (x1 , ..., xT ) taking values in RT (where RT =
QT
t=1 R denotes the Cartesian product of T copies of R). Indeed, our model consists
of a collection of probability distributions on RT . For each you have a different
distribution! Since, each distribution in this collection of distributions is identified by
the parameter [0, 1], we call this model a parametric model.
If you think carefully about it, this definition of model is the one you
have been using all along. Even if you did not realize it!
For example, in the Gaussian linear AR(1) model,
xt = + xt1 + t ,
t N (0, 2 ) ,
tZ
each value of the parameter vector (, , 2 ) defines a unique distribution for the
time-series {xt }tZ . Hence, this AR(1) model is also a collection of distributions.
Furthermore, it is a parametric model because each distribution is indexed by the
parameter vector (, , 2 ). In the case of a time-series model, since we have an
infinite sequence of random variables, {xt }tZ the model is actually a collection of
probability measures on the -algebra of R (the Cartesian product of infinitely
many copies of R).
Definition 12 (Probability model) Given a measurable space (E, F) and a parameter
space , a probability model is a collection P := {P , } of probability measures
defined on F.
Definition 13 (Time-series probability model) Given the measurable space (R , FR )
and a parameter space , a probability model is a collection P := {P , } of
probability measures defined on FR .
Since a time-series is an infinite random sequence {xt }tZ , a realization of a timeseries is an infinite sequence of points {xt (e)}tZ . Alternative, you can see the timeseries as a random element taking values in R , and a realization of the time-series as
a point in R . The term stochastic process is usually reserved for the continuous-time
series that are obtained by connecting the dots of the realized time-series (see Figure 1). A stochastic process X is typically seen as a random element x : E C(R)
taking values in the space of continuous functions C(R), and every realization of
8
the stochastic process is then a specific continuous function, i.e. x(e) C(R) and
x(e, t) R (e, t) E R. In this course we will only focus on random sequences!
Even when we refer to a stochastic process, we will have in mind a random sequence.
Realization of random sequence
2
5
10
15
20
25
30
35
40
45
50
time
10
15
20
25
30
35
40
45
50
time
At this point it may be useful to clarify the following. The Gaussian linear AR(1)
model above is a probability model because the innovations are random variables.
If the innovations are not recognized as being random variables, then it is not a
probability model! Of course, the distribution of the innovations could belong to
some family other than the normal. For example, we could have specified that the
innovations belong to the Students t family where each distribution () is indexed by
the degrees of freedom parameter . The innovations could be uniformly distributed
t u(a, b) with each distribution u(a, b) parameterized by the end points of the
distribution a and b. More generally, we may not even specify a parametric family
of distributions! For example, if we just state that the innovations {t }tZ are white
noise (i.e. have mean zero, uncorrelated and with constant variance) as in Section
2.2,
xt = xt1 + t , {t }tZ WN(0, 2 ) , t Z.
Now that we know what a probability model is, it is important that we ask ourselves the question: why? why should we work with probability models? A complete
answer to this question would take us to the very foundations of econometrics itself! Below, you can find a very brief review of the history behind the probability
approach in econometrics. An excellent reference is Haavelmos 1944 paper entitled
The Probability Approach in Econometrics. Haavelmos paper is the longest paper
ever published in Econometrica and the only to occupy an entire issue of the journal!
In 1936, the Dutch econometrician Jan Tinbergen, published a book describing the
relations between economic variables using hundreds of dynamic regressions.5 In this
way, The Netherlands became the first country for which a macroeconometric model
was estimated. Before Tinbergens model, econometrics was essentially descriptive. In
the words of Robert Solow, the work of Jan Tinbergen was a major force in the transformation of economics from a discursive discipline into a model-building discipline.
In 1939, the world famous economist John Maynard Keynes published however an
extremely insightful review in which he praised Jan Tinbergen personally, but criticized
his work.6 Keynes was essentially concerned with the fact that the model could not be
proved wrong by the data. Consider a regression of consumption Ct on output Yt given
by Ct = + Yt + t . You may notice that any given observation (ct , yt ) will always
fit this regression because the error term can account for anything! In other words,
if anyone complains that this model is not accurate because the observation (ct , yt ) is
such that ct 6= + yt , we can simply reply Well, I told you there would be an error!.
The fact that the model can never be rejected by the data is a real problem:
the model becomes totally useless! The solution to this problem was finally stated in
5
Jan Tinbergen was awarded the Nobel Prize for this outstanding contribution to economics.
John Maynard Keynes, the guardian of Isaac Newtons papers, was not only a brilliant economist,
he was a master in various other fields such as logic and mathematics. Bertrand Russel himself (the
famous mathematician and author of Principia Mathematica) wrote the following about Keynes:
Keyness intellect was the sharpest and clearest that I have ever known. When I argued with him,
I felt that I took my life in my hands, and I seldom emerged without feeling something of a fool. In
fact, Keynes intellect conquered even his life-long opponents like Friedrich Hayek which described
Keynes as follows: He was the one really great man I ever knew, and for whom I had unbounded
admiration. The world will be a very much poorer place without him.
6
10
4.1.3
What is a DGP?
In econometrics we refer to the data generating process (DGP) as the place from
where the data comes ! In other words, the DGP is the unknown mechanism that
generates the data. From an economic perspective, this mechanism is most likely
a very complex one, involving thousands (if not infinitely many) agents, factors,
variables, decisions, etc. Luckily, whatever that mechanism might be in economic
terms, in mathematical terms at least, it is a quite simple animal: the DGP is a given
probability measure. This probability measure is the end result of the immensely
complex workings of the economy.
Definition 16 (Data generating process) Given a measurable space (E, F), a data
generating process is a probability measure P0 defined on F.
In a time-series perspective, the DGP is a probability measure P0 that defines all the
interconnections and dependencies between the infinitely many random variables in
the random sequence {xt }tZ .
Definition 17 (Time-series data generating process) Given the measurable space
(R , FR ) a data generating process is a probability measure P0 defined on FR .
4.1.4
Correct Specification
We say that a model is correctly specified when the DGP is an element of the model.
Otherwise the model is said to be misspecified.
Definition 18 (Correctly specified model) A model P := {P , } is said to
be correctly specified if the data generating process P0 is an element of the model P ;
i.e. if there exists 0 such that P0 = P0 . When this parameter 0 exists, then it
is called the true parameter.
In shorter notation, we will often say that the model is correctly specified if
0 : P 0 = P0 ,
which means that P0 P . We often refer to a correctly specified model as a well
specified model.
11
Definition 19 (Mis-specified model) A model P := {P , } is said to be misspecified (or incorrectly specified) if the data generating process P0 is not an element
of the model P ; i.e. if P 6= P0 .
Consider again the linear Gaussian AR(1) model
xt = + xt1 + t ,
t N (0, 2 ) ,
t Z.
(1)
Now that we know what a correctly specified model is, we can ask ourselves: how
general are the linear dynamic models? Is there any chance that these models are
correctly specified? A first answer comes in the form of the much celebrated Wolds
decomposition theorem stated by Wold in 1938. We review Wolds theorem in this
section and discuss its strengths and limitations.
We recall first the definitions of weakly stationary process, white noise, linear
stochastic process and linear model.
Definition 20 (Weak/Covariance Stationarity) A time series {xt } is said to be weakly
stationary or covariance stationary, if the mean t = E(xt ) and autocovariance function t (h) = Cov(xt , xth ) are constant in time t = and t (h) = (h) (t, h).
Weak stationarity implies that the mean, variance and autocovariances do not
change over time. Unlike strict stationarity (see Definition 3 in Section 2.2), the
notion of weak stationarity requires the existence of moments of second order (mean,
variance and autocovariances) but allows all higher-order moments (e.g. skewness and
kurtosis) to be time-varying.
A simple example of a weakly stationary sequence is the so-called white noise.
Definition 21 (White Noise) A random sequence {xt }tZ is said to be a white noise
process, denoted {xt }tZ WN(0, 2 ), if it is a sequence of uncorrelated random variables Cov(xt , xth ) = 0 t and every h 6= 0, with zero mean E(xt ) = 0 t and
constant variance Var(xt ) = 2 t.
Note that the definition of white noise is silent about higher-order moments. Just
as in the definition of weak stationarity, the higher-order moments are allowed to vary
in time.
Finally, we turn to the definition of linear process, linear model, and Wolds
theorem.
12
j ztj
t Z,
j=
j=
|j | < .
j ztj + vt ,
j=0
where,
a. 0 = 1 and
j=0
j2 < ,
tZ
13
j tj
with 0 = 1 and
j=0
|j | < .
j=0
For example, the AR(1) model admits an MA() representation when || < 1
xt = xt1 + t
xt =
j tj
with j = j j.
j=0
At this point, it may seem that ARMA models can describe any weakly stationary
process. Indeed, Wolds representation theorem, together with the MA() representation of ARMA processes, seems to imply that any weakly stationary process can
be written in ARMA form. This is not true! In particular, it is crucial to note the
following practical limitations of Wolds representation theorem and ARMA models:
P
1. In Wolds theorem, the stochastic component
j=0 j ztj is not necessarily a
linear process. Hence, more general models may have to be considered:
As you can P
see in Definition 22, linear processes have coefficients {j } that are absolutely
summable
|j | < whereas
representation theorem features coefficients
PWolds
{j } that are square summable
j2 < . Since absolute summability implies square
summability it follows that linear models are too restrictive for Wolds representation
theorem.7
2. Even if Wolds theorem featured a linear stochastic component {zt }, the representation would still involve infinitely many parameters {j } that cannot be
estimated from a finite sample of data:
ARMA models solve this problem by reducing the infinite MA coefficients {j } to just a
few AR and MA parameters. However, as a result, they can only define linear processes
with a restrictive sequence of parameters {j }. For example, we saw above that the
linear AR(1) can only define a sequence {j } with a decay of the type j = j j.
3. Even if the stochastic component of Wolds theorem were linear and well approximated by an ARMA model, the representation would still contain a deterministic term {vt }tZ which is unknown and potentially very complex:
7
It is easy to show that absolute summability implies square summability by noting that: if {j }
is absolutely summable, then n N such that |j | < 1 j n, and hence, j2 |j | j n. As
a result,
n1
X
X
X
j2 =
j2 +
|j | < .
j=0
j=0
j=n
14
4. Finally, even if we could somehow ignore points 1, 2 and 3 above (i.e. if Wolds
stochastic component was linear, well approximated by an ARMA and the deterministic component was simple and easy to estimate), we would still be faced
with the problem that the distribution of the white noise sequence is unknown
and possibly too complex to be tractable:
Wolds representation theory leaves the joint distribution of the innovations unspecified.
It just tells us it belongs to a class of distributions with zero mean, fixed variance and
zero covariances (white noise). The distribution of the error term may be extremely
complex since dependence of higher-order moments is allowed.
The above points might seem to cast an overly negative picture on the use of
linear models. However, this should not be interpreted in that way! The objective
was simply to enumerate carefully the practical limitations of working with ARMA
models and the extent to which it is reasonable to appeal to Wolds representation
theorem as a guide for linear models.
In the end of the day, the question of which model to use should always be left to
the data to answer. As we shall see, in many cases, ARMA models actually do well in
describing economic data! Model specification tests may show that the ARMA model
provides a good description of the data, and it may happen that nonlinear models
do not provide better results. In other cases, nonlinear models are clearly needed
(e.g. for modeling time-varying volatility in financial applications). In a considerable
number of cases there is at least room for improvement by adopting nonlinear models.
4.1.6
Nonlinear dynamic models belong to a much more general class of models than linear
ones and can generally nest linear models like the ARMA. As the name itself suggests,
a model is non-linear if its is not a linear model. One should be careful however not
to infer from this statement that a nonlinear model cannot contain linear models as
a special case.
Definition 25 (Nonlinear time-series model) A time-series model P := {P ,
} is said to be nonlinear if at least some measure P P defines a stochastic
process that is not linear.8
8
Most commonly, a model is said to be nonlinear if it seems that for some , the probability
measure P defines a process that is not linear. The emphasis on the word seems is due to the fact
that we may just lack a proof of linearity; i.e. maybe no one was yet able to prove that the process
defined by P can actually be represented as a weighted infinite sum of a white noise sequence with
absolutely summable coefficients.
15
xt1 + t
1 + exp( + xt1 )
tZ
where , and are unknown parameters and {t }tZ is Gaussian iid sequence with
t N (0, 2 ) for every t. This model is a nonlinear model as long as the parameter
space allows for values (, ) 6= (0, 0). However, this nonlinear model nests the linear
model that is obtained by setting (, ) = (0, 0) and || < 2.9
4.2
4.2.1
Nonlinear autoregressions
The linear AR(1) model can be easily extended to a nonlinear setting by considering
the general nonlinear autoregressive (GNLAR) model,
xt = f (xt1 , t ; )
tZ
(2)
t Z.
t Z.
t Z.
(3)
Strictly speaking, even the famous Gaussian random walk discussed in Section 3.2P
is not a linear
process. Indeed, since the random walk takes the form xt = xt1 + t xt =
j tj with
j = 1 j, the MA coefficients {j } are not absolutely summable.
16
tZ
tZ
1 + exp( + zt1 )
t Z.
The STAR(1) model has many variants depending on the nature of the dependence
driver zt1 . When the driver {zt } is an exogenous process that is independent of
{xt }, then we call this model the exogenous STAR model. When the driver {zt } is
the lagged dependent variable {xt1 }, then we call this model the self-excited STAR
(SESTAR) model. Two important SESTAR examples are: (i) the logistic SESTAR
model, and the exponential SESTAR.
Example: (Logistic SESTAR model) The logistic SESTAR presented in Section
3.1.2 is obtained when zt1 = xt1
g(zt1 ; ) := +
1 + exp( + xt1 )
t Z.
The logistic SESTAR model allows us to model changes in the dependence of the
time-series {xt1 }tZ that are related to past realizations of the time-series itself.
For example, many macroeconomic variables feature higher temporal dependence in
recessions (i.e. when xt1 < 0) than expansions (i.e. when xt1 > 0).
g(x t1, )
0.8
0.6
0.4
0.5
x t1
0.5
Figure 2: Plot of g(xt1 ; ) for (, , , ) = (0.5, 0.4, 0, 15). Temporal dependence is low when
xt1 is positive.
17
1 + exp( + (xt1 )2 )
t Z.
The exponential SESTAR model is useful, for example, in modeling the temporal
dependence of real exchange rates. Indeed, real exchange rates tend to have high
dependence (no mean reversion) when rates are close to 1 and lower values of dependence (mean reverting behavior) when rates are far away from 1.
110
110
105
105
100
100
95
95
90
1995
2000
2005
90
2010
50
100
150
200
250
Figure 3: Real exchange rate of EU 15 vs Danish Kroner (left) and data simulated
from an exponential SESTAR model (right).
g(x t1, )
0.8
0.6
0.4
0.5
1.5
x t1
Figure 4: Plot of g(xt1 ; ) for (, , , , ) = (0.1, 0.8, 0, 50, 1). Temporal dependence is high
when xt1 is close to 1.
4.2.2
Another popular type of nonlinear dynamic model is the random coefficient autoregressive (RCAR) model
xt = t1 xt1 + t
18
tZ
where both {t }tZ and {t }tZ are exogenous iid sequences with a certain distribution. This simple model first proposed by Quinn (1980) has several important
applications in finance and biology as it allows for a time-varying conditional mean
and variance. Suppose, for example, that
t1 N (, 2 )
t N (0, 2 )
and
t Z.
{vt } NID(0, v2 ).
The crucial feature here is the fact that the time-varying parameter t evolves in time
independently of {xt }tZ . This is what makes the model a parameter-driven model.
The local-level model has important applications in many areas of science, including
e.g. in macroeconometrics.
19
{t } NID(0, 2 ) ,
2
t2 = + t1
+ vt ,
{vt } NID(0, v2 ).
This model defines a time-series {xt }tZ with time-varying volatility. Applications
in finance are typically related to modeling returns.
4.2.4
6
4
2
0
2
data
true mean
LL filter
Robust LL filter
4
6
20
40
60
80
100
120
140
Figure 5: Filtered time-varying means using local level and robust local level models.
volatility in a sequence of returns {xt } (i.e. the percent changes in the prices of
stocks). Mean zero returns are modeled as,
xt = t t ,
{t }tZ NID(0, 1) ,
2
t2 = + x2t1 + t1
.
This model defines a time-series {xt }tZ with time-varying conditional volatility that
has many important applications in finance. Figure 6 below plots a path of S&P500
returns and a path simulated from a GARCH model.
0.15
0.15
0.1
0.1
0.05
0.05
0.05
0.05
0.1
0.1
2005
2006
2007
2008
2009
2010
2011
2012
2013
500
1000
1500
2000
2500
3000
3500
Figure 6: Time-series of S&P500 returns (left), and data simulated from a GARCH
model (right).
The GARCH model uses the lagged x2t1 as a driver to update the volatility
parameter. While this choice of driver for the volatility is quite intuitive, it can
often be improved upon. In particular, in the presence of outliers produced by fattailed innovations, the GARCH filter tends to produce exaggerated updates of the
time-varying parameter t . Robust alternatives are thus important.
21
{t } iid () ,
x2t1
2
=+
.
+ t1
2
1 + xt1
The boundedness of the update is crucial for avoiding large explosions in the filtered volatility as a result of outliers in xt1 . As we shall see in Chapter 5, recognizing
that the innovations are fat tailed may be important for robustness since the ML estimator converges to a pseudo-true parameter that renders the volatility filter more
robust. Figures 7 and 8 compare the driver (also called the news impact curve) and
the filtered volatilities of the GARCH and robust GARCH models.
1.5
GARCH
Robust GARCH
1
0.5
0.5
0.5
Notice how the leverage effect is obtained through the cross-term xt1 t1 in the
volatility driver
2
(xt1 t1 )2 = x2t1 + 2 t1
2xt1 t1 .
22
Figure 8: Filtered volatilities from S&P500 returns using the GARCH and Robust
GARCH models.
In particular, if both > 0 and > 0, then large values of xt produce less volatility
because 2xt1 t1 < 0.
Example: (Q-GARCH model) Similarly, the quadratic GARCH (QGARCH) introduced in Sentana (1995) proposes the following volatility update with a leverage
effect,
2
t2 = + x2t1 + xt1 + t1
.
The leverage effect exists when < 0 since negative values of xt1 will generate more
volatility than positive values of xt1 .
4.2.5
t Z.
t Z.
(5)
This model is useful to describe time-series {xt } that exhibit nonlinear autoregressive
dynamics, or when zt affects nonlinearly xt .
Example: (NADL(1,0) with logistic contemporaneous link) One possible
formulation of the logistic NADL(1,0) model that features a nonlinear contemporaneous relation between xt and zt is given by
xt = xt1 + g(xt1 ; )zt + t
with g(xt1 ; ) = +
t Z.
1 + exp( + xt1 )
t Z.
In economics, this NADL(1,0) model is useful to describe how fiscal policy affects real GDP nonlinearly over the business cycle. In particular, changes in net
government expenditure zt have larger impact on GDP xt during periods of economic
recession, when GDP xt1 is low and far from potential, compared to periods of
economic expansion, when GDP xt1 is high and close to its full employment potential. Indeed, if the economy is already operating at full capacity with labor and
capital fully employed in production, then the rise in aggregate demand that results
from increased Government net expenditure zt results in increased prices (inflation),
rather than increased output supply. In contrast, if the economy is operating far from
potential, then an increase in aggregate demand fostered by a rise in government expenditure zt leads to a rise in labor and capital employment, increased output, and
hence a growth in real GDP xt .
4.3
24
As you may recall from Definition 3 of Section 2.2, a time-series is strictly stationary if its finite-dimensional distribution (fidis) is invariant in time.
Definition 26 (Strict Stationarity) A time-series {xt }tZ is said to be strictly stationary if the distribution of every finite sub-vector is invariant in time
d
{t }tZ NID(0, 2 ).
The notion of m-dependence is useful because it allows us to apply the following law of large numbers (LLN) and central limit theorem (CLT) for stationary mdependent sequences.
10
The statement that functions of independent random variables are themselves independent is
rather trivial once you appropriately define the notion of independent random variables in terms
of independence of sub -algebras;see Theorem 1 in Chapter 4 of David Polards (2002) A Users
Guide to Measure Theoretic Probability for further details. Here we will just take this for granted.
25
Theorem 2 (LLN for stationary m-dependent sequences) Let {xt }tZ be a strictly
stationary m-dependent sequence with E|xt | < . Then
T
1X
p
xt E(xt )
T t=1
as T .
Theorem 3 (CLT for stationary m-dependent sequences) Let {xt }tZ be a strictly
stationary m-dependent sequence with E(x1 ) = and Var(x1 ) = x2 < . Then
T
1 X
d
T
xt N (0, w2 )
T t=1
where
w =
x2
+2
m1
X
as T
Cov(x1 , x1+k ).
k=1
Unfortunately, the notion of m-dependence is too limited to hold even for the
simple AR(1) process with (, 2 ) 6= (0, 0).
Proposition 2 Let {xt }tZ be generated by an AR(1) process with (, 2 ) 6= (0, 0)
xt = xt1 + t ,
26
mN
4.3.2
The property of a stationary sequence which ensures that the ensemble average and
the time average converge to the same point is known as ergodicity. The ensemble
average is the sample average obtained from N realizations xt (e1 ), xt (e2 ), ..., xt (eN )
of the same random variable xt
N
1 X
xt (ei ) for some given t Z where each ei is an event ei E.
N i=1
The time average is the average obtained from a single realization {xt (e)}tZ of the
time-series {xt }tZ
T
1X
xt (e) for some e E.
T t=1
The formal measure-theoretic definition of ergodicity is a tricky one. Here we
provide an approximate definition.
Definition 28 (Ergodic sequence) A random sequence {zt }tZ is said to be ergodic
if and only if, over an infinite amount of time, every event occurs with probability 0
or 1.
The definition above is essentially stating that every event that can happen will
eventually happen. In a loose sense, it is not possible that no element of the sequence
{zt }tZ ever takes a certain value, if that value has a positive probability of occurring.
In essence, the sequence will eventually visit all corners of the space, i.e. it will take
any possible value that has positive probability of occurring.
Well fortunately, you need not worry too much about the complications surrounding the concept of ergodicity, because here we will only be concerned with the practical
result of ergodicity: the LLNs and CLTs that it produces! In particular, you may
recall the following theorems from Sections 2.2.2 and and 2.2.3.
Theorem 4 (Ergodic theorem: LLN for SE data) Let {zt }tZ be a strictly stationary
and ergodic random sequence with finite first moment E|z1 | < . Then we have
T
1X
p
zt E(zt ) as T .
T t=2
4.3.3
At this point it should be obvious that we need tools for establishing the strict stationarity and ergodicity of random sequences {xt }tZ since this will open the door to
obtaining LLNs and CLTs.
In general we will be concerned with conditions for stationarity and ergodicity
that apply to time-series generated by stochastic Markov dynamical systems.
Definition 30 (Markov dynamical system) An nx -variate time-series {xt }tZ is said
to be generated by a random Markov dynamical system if and only if every xt satisfies
xt+1 = (xt , t , )
for some parametric function : Rnx Rn Rnx and some n -variate random
innovation sequence {t }tZ .
In the definition above we say that the system is dynamic because every xt depends
on its past values. Naturally, the system is stochastic because the innovations {t }tZ
make the generated time-series a random process. Finally, the dynamical system is
said to be a Markov dynamical system because xt is a function only of xt1 and not
of further lags (e.g. xt2 ).
Andrey Markov was one of the most famous Russian mathematicians of all times. He
was a doctoral student of the great Pafnuty Chebyshev.11 Entire fields of probability
and statistics like (e.g. the field of Markov chains) are today named after Markov.
Example: (Linear and nonlinear AR(1)) The linear AR(1) is an obvious example
of a
xt = xt1 + t
Markov dynamical system where
(xt1 , t , ) = xt1 + t .
Clearly, any nonlinear AR(1) is also a Markov system as it fits Definition 30. Furthermore, it is important to note that dynamic systems with an arbitrary number
of lags can also be re-written as a vector Markov dynamical system by adopting a
companion form.
Example: (Linear and nonlinear AR(p)) The linear AR(p) model
xt = 1 xt1 + ... + p xtp + t
11
Andrey Markovs brother, Vladimir Markov, was himself a great mathematician, also a doctoral student of Pafnuty Chebyshev. Together they wrote the famous Markov brothers inequality.
Unfortunately, Vladimir Markov died very early at the age of 25.
28
xt
1 2 . . .
xt1 1 0 . . .
.. . .
..
= .
.
.
xtp+2 0 . . . 1
xtp+1
0 ... 0
. . . p
xt1
t
... 0
xt2 0
..
..
..
+ . .
.
.
0 0 xtp+1 0
1 0
xtp
0
Furthermore, we let the symbol denote exponentially fast almost sure convergence.
e.a.s.
a.s.
In particular, we write zt 0 if and only if there exists > 1 such that t |zt | 0.
Theorem 6 (Bougerols Theorem) For some , let {xt (, x1 )}tN be a random
sequence initialized at t = 1 with value x1 X R and generated by the Markov
dynamic system
xt+1 = (xt , t , ) t N
with differentiable function : X Rn X and elements xt (, x1 ) taking values
in X R for every t N. Suppose further that the following conditions hold:
A1. {t }tZ is an exogenous n -variate SE sequence,
A2. there exists x1 X such that E log+ |(x1 , t )| < ,
A3. the dynamical system is contracting on average E log supxX (x, t )/x < 0.
Then {xt (, x1 )}tN converges e.a.s. to a unique strictly stationary and ergodic (SE)
e.a.s.
sequence {xt ()}tZ . In other words, |xt (, x1 ) xt ()| 0.
29
30
4.3.4
Example: (AR(1) model) Consider a time-series {xt (, x1 )}tN generated by a Gaussian AR(1) initialized at some value x1 R,
xt+1 = xt + t
tN,
{t }tZ NID(0, 2 ).
(6)
It is easy to show that the Gaussian AR(1) with || < 1 generates asymptotically SE
time-series by applying Bougerols theorem.12
Proposition 3 (SE Gaussian AR) Let {xt (, x1 )}tN be generated by the Gaussian
AR(1) in (6) with = (, 2 ) satisfying || < 1 and 0 < 2 < . Then {xt (, x1 )}tN
converges e.a.s. to a limit SE sequence {xt ()}tZ for any initialization x1 R.
Proof. Condition A1 holds trivially since the innovations {t }tZ are iid (and hence also
SE).
Condition A2 holds because, for any given value x1 R, we have
E log+ |(x1 , t )| = E log+ |x1 + t | E|x1 + t | || |x1 | + E|t | < .
|{z} |{z} | {z }
< <
<
The first equality E log+ |(x1 , t )| = E log+ |x1 + t | holds by definition. The first inequality holds because log+ (x) x for every x. The second inequality holds by the sub-additivity
of the absolute value |a + b| |a| + |b| for every a and b, and the absolute homogeneity of the
absolute value |ab| = |a||b| for every a and b. The terms || and |x1 | are bounded constants
in R and E|t | < because t is normally distributed with bounded variance 2 < , and
hence, has bounded moments of any order; i.e. E|t |n < n 0.
Condition A3 holds since (x, t )/x = , and hence,
(x, )
t
E log sup
<0
x
xX
E log || < 0
xX
where the second equivalence holds because the derivative does not depend on x and hence
we can drop the supremum. Finally, we obtain the desired results by noting that we can
drop the expectation E log || < 0 log || < 0 because log || is just a constant, and
that log || < 0 || < 1.
Note that {xt }tN initialized at some point x1 R is always non-stationary! Suppose that
x1 = 5. Well, then the mean of the process changes from E(x1 ) = x1 = 5 to E(x2 ) = x1 = 5.
Hence the process is non-stationary! In fact, even if x1 = 0, we still have a changing variance from
Var(x1 ) = 0 to Var(x2 ) = 2 . Hence, the process is still non-stationary. There is really no x1 R
that would make the model stationary.
31
Surely, you already knew the stationarity condition (|| < 1) for the AR(1) model
from your introductory econometrics courses. Bougerols theorem can however be
applied to many other models!
Example: (random-coefficient AR(1) model) Consider for example the randomcoefficient Gaussian AR(1) initialized at some value x1 R,
xt+1 = t xt + t
tN,
{t }tN NID(b, 2 ) ,
{t }tN NID(0, 2 ).
(7)
Note that we can apply Bougerols Theorem, by defining the n -variate vector t in
Theorem 6 as t = (t , t ). As such, it is easy to show that this random-coefficient
model generates asymptotically SE time-series if t is smaller than 1 on average; i.e. if
E|t | < 1. Note that the condition E|t | < 1 imposes implicitly restrictions on the
parameters b and 2 that define the distribution of t .
Proposition 4 (SE Gaussian RCAR) Let {xt (, x1 )}tN be generated by the Gaussian random-coefficient AR(1) in (7) with = (b, 2 , 2 ) satisfying E|t | < 1, and
0 < 2 < . Then {xt (, x1 )}tN converges e.a.s. to a limit SE sequence {xt ()}tZ
for any initialization x1 R.
Proof.
Condition A1 holds trivially since the innovations {}tN and {t }tN are iid.
Condition A2 holds because for any given value x1 R we have
E log+ |(x1 , t )| = E log+ |t x1 + t | E|t x1 + t | |x1 | E |t | + E|t | < .
|{z} |{z} | {z }
<
<
<
The second inequality holds by the sub-additivity of the absolute value. The term |x1 | is a
bounded constant in R and E|t | < 1 < and E|t | < because {t } is NID with finite
variance and hence has bounded moments of any order.
Condition A3 holds since (x, t )/x = t , and hence,
(x, )
t
E log sup
<0
x
xX
E log |t | < 0
xX
It is important to note that the condition E|t | < 1 used in Proposition 4 is unnecessarily restrictive. We could have assumed E log |t | < 0 which is weaker. In the two
examples above, the contraction condition was rather trivial because the derivative
(x, t )/x did not depend on x. We turn now to an example where (x, t )/x
is a function of x. The logistic AR(1) model ensures that every xt takes values in the
interval [0, 1] and is useful to model time-series that refer to probabilities, intensities,
percentages, etc.
32
1
1 + exp( xt + t )
tN,
{t }tN NID(0, 2 ).
(8)
|| < 4.
13
This follows from the fact that w/(1 + w)2 is maximized at w = 1, and hence, the function
exp(z)/(1 + exp(z))2 is maximized at z = 0.
33
{t }tN NID(0, 1) ,
2
t+1
= + x2t + t2
t N.
It is easy to show that {xt (, 12 )}tN converges e.a.s. to a limit SE sequence as long
as E log |2t + | < 0. A simpler sufficient condition is thus that || < 1 ||.
Proposition 6 (SE GARCH) Let {xt (, 12 )}tN be generated by the GARCH(1,1)
in (4.3.4) under some = (, , ) with E log |2t + | < 0. Then {xt (, 12 )}tN
converges e.a.s. to a limit SE sequence {xt ()}tZ .
Proof. We substitute first xt in the recursion equation for t2 using the fact that xt = t t
and obtain
2
t+1
= + t2 2t + t2 t N.
Now we can analyze this recursion as a stochastic Markov dynamical system and show that
{t2 (, 12 )}tN converges to a limit SE sequence.
Condition A1 holds trivially because the innovations {2t }tN are iid.
Condition A2 holds since for any given value 12 R we have
E log+ |(12 , t )| = E log+ | + 12 2t + 12 | || + |||12 |E|2t | + |||12 | <
because every real valued parameter , and , and the real valued initialization 12 are
naturally finite, and E|t |2 < because t is standard Gaussian.
34
2 R+
2 R+
We thus conclude that {t2 (, 12 )}tN converges e.a.s. to a limit SE process {t2 ()}tZ .
Finally, {xt (, 12 )}tN also converges to a limit SE process {xt ()}tZ since, by the product
e.a.s. convergence theorem (see below), we have
a.s.
t zt 0 as
t .
{t }tN NID(0, 2 ).
It is easy to see that this stochastic Markov dynamical system can only satisfy the
contraction condition on a degenerate parameter space with = 0.
In particular, note that (x, t )/x = 2x, and hence,
(x, )
t
E log sup
= E log sup |2x|
x
xX
xX
is unbounded because the derivative can be made arbitrarily large by taking x to
infinity. Strictly speaking, the contraction condition is also not satisfied for = 0
since log(0) is not defined. In any case, we know that for = 0, the process {xt } =
{t } is a simple iid sequence.
14
35
4.3.5
Every law of large numbers imposes the condition that the first unconditional moment exists E|xt | < . Every central limit theorem imposes the condition that the
second unconditional moment exists E|xt |2 < . In the previous section we have
learned how to verify that a stochastic dynamical system generates time-series that
are asymptotically SE. We now turn to the last element that we need to obtain laws
of large numbers and central limit theorems: ensuring that the unconditional moments are bounded. As in the previous section, we focus on time-series generated by
stochastic Markov dynamical systems.
Theorem 9 (Power-n contraction) For some , let {xt (, x1 )}tN be a random
sequence initialized at t = 1 with value x1 X R and generated by the Markov
dynamic system
xt+1 = (xt , t , ) t N
with differentiable function : X Rn X and elements xt (, x1 ) taking values
in X R for every t N. Suppose further that the following conditions hold for
some n > 0:
A1. {t }tZ is an exogenous n -variate SE sequence,
A2. there exists x1 X such that E|(x1 , t )|n < ,
n
A3. the system is power-n contracting E supxX (x, t )/x < 1.
Then {xt (, x1 )}tN satisfies supt E|xt (, x1 )|n < and converges e.a.s. to a unique
strictly stationary and ergodic (SE) sequence {xt ()}tZ satisfying E|xt ()|n < .
Note that, besides the bounded moments, Theorem 9 also obtains the e.a.s. convergence to an SE limit. This is simply due to the fact that conditions A1, A2 and A3
of Theorem 9 imply conditions A1, A2 and A3 of Theorem 6. Indeed, the conditions
of both theorems are very similar. In fact, condition A1 is the same, and conditions
A2 and A3 just impose n moments rather than the log moment.
The power-n contraction theorem ensures that a Markov dynamical system generates time-series with bounded unconditional moments. If conditions A1, A2 and A3
hold for n = 2, then the time-series has bounded unconditional variance. This means
that the time-series is weakly stationary!
Two important inequalities will be useful for verifying that the conditions of the
power-n contraction theorem hold. These are known as the cn -inequality and the
generalized Holder inequality.
Theorem 10 (cn -inequality) For every n > 0, there exists a constant c > 0 such that
E|zt + wt |n cE|zt |n + cE|wt |n .
36
Theorem 11 (Generalized Holder inequality) For every p > 0 and q > 0, it holds
true that
E|zt wt |n E|zt |p E|wt |q
with n = p q/(p + q).
Three important implications of the generalized Holder inequality are:
1. Let E|zt |p < and E|wt |q < , then E|zt wt |n < with n =
pq
.
p+q
2. Let E|zt |p < and |wt | < a.s., then E|zt wt |n < with n = p.
3. Let E|zt |p+ < for some > 0 and E|wt |q < q > 0, then E|zt wt |n <
with n = p.
4.3.6
Example: (AR(1) model) Consider a time-series {xt (, x1 )}tN generated by a Gaussian AR(1) initialized at some value x1 R,
xt+1 = xt + t
tN,
{t }tZ NID(0, 2 ).
(9)
It is easy to show that the time-series {xt (, x1 )}tN has bounded moments of any
order when || < 1 and 2 is finite.
Proposition 7 Let {xt (, x1 )}tN be generated the Gaussian AR(1) in (9) with =
(, 2 ) satisfying || < 1 and 0 < 2 < . Then {xt (, x1 )}tN and its SE limit
{xt ()}tN satisfy supt E|xt (, x1 )|n < and E|xt ()|n < for any n > 0 and any
x1 X .
Proof. Condition A1 holds trivially for every n > 0. Condition A2 holds n > 0 and
any x1 R, we have
E|(x1 , t )|n = E|xt + t |n c ||n |x1 |n +c E|t |n < .
| {z }
|{z} |{z}
<
<
<
E||n < 1
It is important to note that the fat-tailed AR(1) model with iid student-t (TID)
innovations
xt+1 = xt + t t N , {t }tZ TID()
37
{t }tN NID(0, 1) ,
t N.
t N.
and now we check that the conditions of the power-n contraction theorem hold for this
Markov system.
Condition A1 holds trivially because the innovations {2t }tN are iid.
Condition A2 holds for any m = n + and 12 R+ since we have
E|(12 , t )|m = E| + 12 2t + 12 |m
c||m + c||m |12 |m E|2t |m + c||m |12 |m <
Condition A3 holds because ( 2 , t )/ 2 = 2t + , and hence,
( 2 , ) n+
t
E sup
<1
2
2
R+
As a result, we conclude that supt E|t2 (, 12 )|n+ < and E|t2 ()|n+ < .
Finally, we use the n + bounded moments of the volatility sequences {t2 (, 12 )}tN
and {t2 ()}tZ to conclude that the data sequences {xt (, 12 )}tN and {xt ()}tZ have n
bounded moments.
In particular, we show that supt E|xt (, 12 )|n < and E|xt ()|n < holds for all
12 R+ because E|t |q < q > 0 and hence, by the generalized holder inequality, we
have
sup E|xt (, 12 )|n = sup E|t (, 12 )t |n <
t
and
38
0.9
0.8
0.7
0.6
0.5
0
0.2
0.4
0.6
0.8
Figure 9: Pairs (, ) below each frontier, are pairs (, ) that ensure SE or bounded
moments.
4.3.7
Take, for example, the observation-driven model for the time-varying mean
xt = t + t , t N (0, 2 ).
The filtered time-varying mean {t (, 1 )} initialized at some value 1 R takes the
data {xt } as given. The filtered parameter {t (, 1 )} is updated according to the
updating equation
t+1 = + (xt t ) + t .
If the model is correctly specified, then the true time-varying mean {t ( 0 )} is supposed to be generated by the model under some 0 . It is important to note that the
true parameter t ( 0 ) influences the data xt because the data is generated according
to xt = t ( 0 ) + t . As a result, we cannot simply analyze the properties of {t ( 0 )}
as if the data {xt } were given. Instead, we must recognize that xt = t + t and hence
substitute xt in the updating equation and obtain
t+1 = + t + t .
When estimating time-varying parameter models, we always need to analyze the
properties of the filtered parameter {t (, 1 )}. After all, it is the filtered parameter,
not the true parameter, that enters the criterion function for estimation of parameters.
The true parameter is unobserved! For example, it is the filtered parameter that enters
the log likelihood function
T
xt t (, 1 )
1X 1
log 22
L(xT , ) =
T t=2 2
22
2
that we use to obtain maximum likelihood estimates of the parameters of the model!
On the contrary, we only need to analyze the properties of the true sequence {t ( 0 )}
if we want to establish properties for the data {xt }.
Consider now the GARCH model for time-varying volatility
xt = t t , t N (0, 1).
The filtered time-varying volatility {t2 (, 12 )} takes the data {xt } as given. The
filtered volatility {t2 (, 12 )} is then updated according to the updating equation
2
t+1
= + x2t + t2 .
If the model is correctly specified, then the true time-varying volatility {t2 ( 0 )} is
supposed to be generated by the model under some 0 . Again, it is important to
note that t2 ( 0 ) influences the data xt because the data is generated according to
xt = t2 ( 0 )t . As a result, we cannot simply analyze the properties of {t2 ( 0 )} as
40
if the data {xt } were given. Instead, we must recognize that xt = t2 t and hence
substitute xt in the updating equation and obtain
2
= + t2 2t + t2 .
t+1
Again, please keep in mind that when estimating the parameters of the GARCH
model, we always need to analyze the properties of the filtered volatility {t2 (, 12 )},
since the filtered volatility enters the log likelihood function
LT (xT , ) =
T
X
t=2
1
1
1
x2t
log 2 log t2 (, 12 )
.
2
2
2 t2 (, 12 )
On the contrary, we only need to analyze the properties of the true volatility {t2 ( 0 )}
if we want to establish the properties the data {xt }.
Stationarity and Moments
We have noted above that in time-varying parameter models, the filtered parameter
and the true parameter can have very different behavior. Above, we have learned
how to show that the true time-varying parameter is strictly stationary and ergodic
(SE) and has n bounded moments, and we used it to show that {xt }tZ is also SE
and has bounded moments. In particular, we applied Bougerols Theorem and the
Power-n Theorem to the true time-varying parameter.
As we shall now see, we can also use Bougerols Theorem and the Power-n Theorem
to establish the stochastic properties of the filtered time-varying parameter. In this
case however, we take the properties of the data {xt } as given, and look at the data
{xt } as innovations.
Example: (Local-level model) In the case of the local-level model, we can show
that the true sequence {t ( 0 )}tZ is SE and has n bounded moments by applying
Bougerols Theorem and the Power-n Theorem to the updating equation
t+1 = + t + t .
The conditions of Bougerols Theorem which ensure that the true parameter {t ( 0 )}tZ
is SE are:
A1: {t } is iid
A2: E log+ | + t + 1 | <
A3: E log sup || < 0
|| < 1.
The conditions of the Power-n Theorem which ensure that the true parameter {t ( 0 )}tZ
is SE and has n bounded moments E|t ( 0 )|n < are:
41
A1: {t } is iid
A2: E| + t + 1 |n <
A3: E sup ||n < 1
|| < 1.
On the other hand, if we instead want to ensure that the filtered parameter {t (, 1 )}tN
initialized at some 1 R converges to a limit SE process with n bounded moments
we should apply Bougerols Theorem and the Power-n Theorem to the updating
equation
t+1 = + (xt t ) + t .
Bougerols conditions for a limit SE filtered parameter {t (, 1 )}tN are then:
A1: {xt } is SE
A2: E log+ | + (xt 1 ) + 1 | <
A3: E log sup | | < 0
| | < 1.
| | < 1.
Note that the conditions used in the filtered parameter are very different from those
used in the true parameter case. For example, condition A1 for the filtered parameter
requires that we already know certain properties of the data (i.e. the SE nature of
{xt }). The contraction conditions are also very different. In particular, || < 1 is
sufficient to show that the true parameter is SE and has bounded moments, whereas
| | < 1 is required to ensure that the filtered parameter is SE and has bounded
moments.
Example: (GARCH model) In the case of the GARCH model, we can show that the
true volatility sequence {t2 ( 0 )}tZ is SE and has n bounded moments by applying
Bougerols Theorem and the Power-n Theorem to the updating equation
2
t+1
= + t2 2t + t2 .
Bougerols conditions for the true volatility {t2 ( 0 )}tZ to be SE are then given by:
A1: {t } is iid
A2: E log+ | + 12 2t + 12 | <
42
The conditions of the Power-n Theorem for the true volatility {t2 ( 0 )}tZ to be SE
and have n bounded moments E|t2 ( 0 )| < are:
A1: {t } is iid
A2: E| + 12 2t + 12 |n <
A3: E sup2 |2t + |n < 0
E|2t + |n < 0.
On the other hand, if we instead want to ensure that the filtered parameter {t2 (, 12 )}tN
initialized at some 12 R converges to a limit SE process with n bounded moments
we should apply Bougerols Theorem and the Power-n Theorem to the updating
equation
2
t+1
= + x2t + t2 .
The conditions of Bourgerols Theorem which ensure that the filtered volatility parameter {t2 (, 12 )}tN converges to an SE limit are:
A1: {xt } is SE
A2: E log+ | + x2t + 12 | <
A3: E log sup2 || < 0
|| < 1.
The conditions of the Power-n Theorem which ensure that the filtered parameter
{t2 (, 12 )}tN converges to a limit SE process with n bounded moments are:
A1: {xt } is SE
A2: E| + x2t + 12 |n <
A3: E sup2 ||n < 0
|| < 1.
Note that the conditions used in the filtered parameter are different from those used
in true parameter. For example, condition A1 for the filtered parameter requires that
we already know certain properties of the data (i.e. the SE nature of {xt }). The
contraction conditions are also very different. In particular, E log |2t + | < 0 is
sufficient to show that the true parameter is SE and has bounded moments, whereas
|| < 1 is required to ensure that the filtered parameter is SE and has bounded
moments.
43
4.3.8
Before we end this chapter, it is convenient to note that the theory covered in the
previous sections applies also to models with exogenous variables. Indeed, since both
Bougerols Theorem and the Power-n Theorem were not specific about the nature of
the innovation vector {t }tZ , our freedom here is considerable. Typically, the label
innovations applies only to sequences that exhibit weak or no temporal dependence,
such as white noise processes. Those two theorems however, allowed the vector process
{t }tZ to contain dependence, as long as it was SE and exogenous in nature. As such,
there is a host of variables that can potentially fit that definition.
As an example, consider the following nonlinear autoregressive distributed lag
(NADL) model
xt+1 = f (xt , zt , t ; ) t N
where {t }tZ is a sequence of iid innovations, and {zt }tZ denotes some exogenous
SE time-series. Clearly, the theorems above can be directly applied by defining the
vector t as t = (t , zt ) t. It is really that simple!
Below we treat explicitly the example of a random coefficient ADL (RC-ADL)
model with autoregressive dynamics in the time-varying coefficients. From here the
reader should immediately guess how to proceed for other models.
Example: (Gaussian RC-ADL model) Consider the Gaussian random-coefficient ADL
model for {xt }tN initialized at some x1 R,
xt+1 = t xt + t zt + t
tN,
{t }tZ NID(0, 2 ) ,
where {t }tZ NID(b, 2 ), {t }tZ NID(a, 2 ), and {zt }tZ is some exogenous
sequence.
In the model above, the vector {t }tZ that appears in both Bougerols Theorem
and the Power-n Theorem, takes effectively the form t = (t , t , zt , t ). We recall
that, in this context, the exogeneity of the vector process {t }tZ (which we always
assume implicitly!) ensures that t , t , zt and t are independent of each other at all
leads and lags, as well as, independent of present and future values of xt . As we shall
now see, this random-coefficient ADL model generates asymptotically SE time-series
if t is smaller than 1 on average.
Proposition 9 (SE Gaussian RC-ADL) Let {zt }tZ be an exogenous sequence satisfying E|zt |1+ < for some > 0, and {xt (, x1 )}tN be generated by the Gaussian
RC-ADL model above, and parameterized by a vector = (b, a, 2 , 2 , 2 ) which ensures that E|t | < 1, 0 < 2 < and 0 < 2 < . Then {xt (, x1 )}tN converges
e.a.s. to a limit SE sequence {xt ()}tZ for any initialization x1 R.
Proof.
Condition A1 of Bougerols Theorem holds trivially since the innovations {}tN , {}tN
and{t }tN are iid, and {zt }tN is SE.
44
Condition A2 of Bougerols Theorem holds because for any given value x1 R we have
E log+ |(x1 , zt , t )| = E log+ |t x1 + t zt + t |
E|t x1 + t zt + t |
|x1 | E |t | + E|t zt | + E|t | < .
|{z} |{z} | {z } | {z }
<
<
<
<
The second inequality holds by the sub-additivity of the absolute value. The term |x1 | is a
bounded constant in R. E|t | < 1 < holds by assumption. E|t zt |2 < holds by the
generalized Holders inequality because {t } is NID with bounded variance (and hence has
bounded moments of any order, i.e. E|t |q < q > 0) and E|zt |1+ < by assumption.
Finally, E|t | < because {t } is NID with finite variance and hence has bounded moments
of any order.
Condition A3 of Bougerols theorem holds since (x, t )/x = t , and hence,
(x, )
t
E log sup
<0
x
xX
E log |t | < 0
xX
Notice that the Power-n Theorem could be applied to show that {xt }tN has
bounded moments of a certain order. This is left for the reader to explore as an
exercise in the end of this chapter. Note also that we could have explicitly stated a
DGP for {zt }tZ . For example, {zt }tZ could be generated exogenously by a logistic
SESTAR model. Bougerols Theorem and the Power-n Theorem could then be applied
to show that {zt }tZ is SE and that it has the required bounded moments. The last
exercise of this chapter explores this possibility!
45
4.4
Exercises
2
t+1
= + x2t + t2 ,
xt = t + 0.2t1 + 3t2 ,
(b) Fat-tailed MA(2):15
xt = t + 1 t1 + 2 t2 ,
{t }tZ T ID()
{t }tZ T ID().
T ID() stands for student-t independently distributed ; i.e. elements of the sequence are independently identically distributed with students-t distribution with degrees of freedom.
47
(d) ARMA(1,1):16
{t }tZ W N (0, 2).
xt = 0.3xt1 + t + 2t1 ,
(e) Random-coefficient Gaussian MA(1):
{t }tZ N ID(0, 2 ).
t N (, 2 ) ,
xt = t + t t1 ,
xt = t xt1 + t ,
{t }tZ T ID(3) ,
{t }tZ N ID(0, 2 ) ,
1 + exp(0.13 + 24zt1 )
zt = 0.85zt1 + vt ,
tZ
{t }tZ N ID(0, 2 ) ,
1 + exp(2 + 1.4xt1 )
t Z.
{t }tZ N ID(0, 2 ) ,
1 + exp 0.5 + 7.2(xt1 )2
16
t Z.
WN(0, 2 ) stands for white noise with mean zero and variance 2 .
UID(0, 1.5) stands for uniformly independently distributed; i.e. elements of the sequence are
independently identically distributed with uniform distribution between 0 and 1.5.
17
48
{t } NID(0, 3) ,
t = 5 + 0.9t1 + vt ,
xt = t t ,
2
t2 = + t1
+ vt ,
{vt } TID().
{t } NID(0, 1) ,
xt = t t ,
2
t2 = + x2t1 + t1
.
{t } TID(7) ,
2
t2 = 5 + 0.2 tanh(x2t ) + 0.4t1
.
(q) NGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
t2 = 0.5 + 0.1(xt1 t1 )2 + 0.9t1
.
(r) QGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
.
t2 = 0.5 + 0.1x2t1 + 0.2xt1 + 0.3t1
10. Which of the DGPs in the previous question generate time-series {xt } that
satisfy a law of large numbers? Which DGPs generate time-series {xt } that
satisfy a central limit theorem? Which DGPs generate a time-series {xt } with
bounded fourth moment?
49
j ztj + vt ,
j=0
where,
a. 0 = 1 and
j=0
j2 < ,
t p () ,
(b) NLAR(1)
xt = (xt1 , t , ) ,
t p () ,
(c) NLARMA(p,q)
xt = (xt1 ..., xtp , t , ..., tp , ) ,
t p () ,
t p () ,
t = (t1 , vt , ) ,
vt pv () ,
t p () ,
t = (t1 , xt1 , ).
13. Find a model that satisfies Bougerols contraction (condition A3) of Theorem
6 only on a degenerate set.
14. Find a model that never satisfies Bougerols contraction (condition A3) of Theorem 6.
15. Find a model that does not satisfy the moment bound (condition A2) of Theorem 6.
50
16. Consider the RC-ADL model analyzed in Proposition 9. Give sufficient conditions for {xt ()}tZ to have two bounded moments.
17. Consider the following system
xt+1 = t xt + t zt + t
where
t = 0.97t + vt
tN,
zt = 0.9zt1 + wt + 2.3wt1
tZ,
Can you show that {xt ()}tZ is SE and has two bounded moments?
51