Applied Time Series Analysis
Applied Time Series Analysis
SS 2018
February 12, 2018
1.1 PURPOSE 1
1.2 EXAMPLES 2
1.3 GOALS IN TIME SERIES ANALYSIS 8
2 MATHEMATICAL CONCEPTS 11
3 TIME SERIES IN R 15
4 DESCRIPTIVE ANALYSIS 23
5 STATIONARY TIME SERIES MODELS 69
6 SARIMA AND GARCH MODELS 99
7 TIME SERIES REGRESSION 113
8 FORECASTING 137
9 MULTIVARIATE TIME SERIES ANALYSIS 161
1 Introduction
1.1 Purpose
Time series data, i.e. records which are measured sequentially over time, are
extremely common. They arise in virtually every application field, such as e.g.:
Business
Sales figures, production numbers, customer frequencies, ...
Economics
Stock prices, exchange rates, interest rates, ...
Official Statistics
Census data, personal expenditures, road casualties, ...
Natural Sciences
Population sizes, sunspot activity, chemical process data, ...
Environmetrics
Precipitation, temperature or pollution recordings, ...
Page 1
ATSA 1 Introduction
Once a good model is found and fitted to data, the analyst can use that model to
forecast future values and produce prediction intervals, or he can generate
simulations, for example to guide planning decisions. Moreover, fitted models are
used as a basis for statistical tests: they allow determining whether fluctuations in
monthly sales provide evidence of some underlying change, or whether they are
still within the range of usual random variation.
The dominant main features of many time series are trend and seasonal variation.
These can either be modeled deterministically by mathematical functions of time,
or are estimated using non-parametric smoothing approaches. Yet another key
feature of most time series is that adjacent observations tend to be correlated, i.e.
serially dependent. Much of the methodology in time series analysis is aimed at
explaining this correlation using appropriate statistical models.
While the theory on mathematically oriented time series analysis is vast and may
be studied without necessarily fitting any models to data, the focus of our course
will be applied and directed towards data analysis. We study some basic
properties of time series processes and models, but mostly focus on how to
visualize and describe time series data, on how to fit models to data correctly, on
how to generate forecasts, and on how to adequately draw conclusions from the
output that was produced.
1.2 Examples
> data(AirPassengers)
> AirPassengers
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
Page 2
ATSA 1 Introduction
One of the most important steps in time series analysis is to visualize the data, i.e.
create a time series plot, where the air passenger bookings are plotted versus the
time of booking. For a time series object, this can be done very simply in R, using
the generic plot function:
The result is displayed on the next page. There are a number of features in the
plot which are common to many time series. For example, it is apparent that the
number of passengers travelling on the airline is increasing with time. In general, a
systematic change in the mean level of a time series that does not appear to be
periodic is known as a trend. The simplest model for a trend is a linear increase or
decrease, an often adequate approximation. We will discuss how to estimate
trends, and how to decompose time series into trend and other components in
section 4.3.
The data also show a repeating pattern within each year, i.e. in summer, there are
always more passengers than in winter. This is known as a seasonal effect, or
seasonality. Please note that this term is applied more generally to any repeating
pattern over a fixed period, such as for example restaurant bookings on different
days of week.
Passenger Bookings
600
500
400
Pax
300
200
100
We can naturally attribute the increasing trend of the series to causes such as
rising prosperity, greater availability of aircraft, cheaper flights and increasing
population. The seasonal variation coincides strongly with vacation periods. For
this reason, we here consider both trend and seasonal variation as deterministic
Page 3
ATSA 1 Introduction
> data(lynx)
> plot(lynx, ylab="# of Lynx Trapped", main="Lynx Trappings")
The plot on the next page shows that the number of trapped lynx reaches high and
low values every about 10 years, and some even larger figure every about 40
years. While biologists often approach such data with predator-prey-models, we
here focus on the analysis of the time signal only. This suggests that the
prominent periodicity is to be interpreted as random, but not deterministic.
Lynx Trappings
6000
# of Lynx Trapped
4000
2000
0
This leads us to the heart of time series analysis: while understanding and
modeling trend and seasonal variation is a very important aspect, much of the time
series methodology is aimed at stationary series, i.e. data which do not show
deterministic, but only random (cyclic) variation.
Page 4
ATSA 1 Introduction
> data(lh)
> lh
Time Series:
Start = 1; End = 48; Frequency = 1
[1] 2.4 2.4 2.4 2.2 2.1 1.5 2.3 2.3 2.5 2.0 1.9 1.7 2.2 1.8
[15] 3.2 3.2 2.7 2.2 2.2 1.9 1.9 1.8 2.7 3.0 2.3 2.0 2.0 2.9
[29] 2.9 2.7 2.7 2.3 2.6 2.4 1.8 1.7 1.5 1.4 2.1 3.3 3.5 3.5
[43] 3.1 2.6 2.1 3.4 3.0 2.9
Again, the data themselves are of course needed to perform analyses, but provide
little overview. We can improve this by generating a time series plot:
Luteinizing Hormone
3.5
3.0
LH level
2.5
2.0
1.5
0 10 20 30 40
Time
For this series, given the way the measurements were made (i.e. 10 minute
intervals), we can almost certainly exclude any deterministic seasonal pattern. But
is there any stochastic cyclic behavior? This question is more difficult to answer.
Normally, one resorts to the simpler question of analyzing the correlation of
subsequent records, called autocorrelations. The autocorrelation for lag 1 can be
visualized by producing a scatterplot of adjacent observations:
Page 5
ATSA 1 Introduction
3.5
3.0
lh[2:48]
2.5
2.0
1.5
Its value of 0.58 is an estimate for the so-called autocorrelation coefficient at lag 1.
As we will see in section 4.4, the idea of considering lagged scatterplots and
computing Pearson correlation coefficients serves as a good proxy for a
mathematically more sound method. We also note that despite the positive
correlation of +0.58, the series seems to always have the possibility of “reverting to
the other side of the mean”, a property which is common to stationary series – an
issue that will be discussed in section 2.2.
Page 6
ATSA 1 Introduction
> data(EuStockMarkets)
> EuStockMarkets
Time Series:
Start = c(1991, 130)
End = c(1998, 169)
Frequency = 260
DAX SMI CAC FTSE
1991.496 1628.75 1678.1 1772.8 2443.6
1991.500 1613.63 1688.5 1750.5 2460.2
1991.504 1606.51 1678.6 1718.0 2448.2
1991.508 1621.04 1684.1 1708.1 2470.4
1991.512 1618.16 1686.6 1723.1 2484.7
1991.515 1610.61 1671.6 1714.3 2466.8
Because subsetting from a multiple time series object results in a vector, but not a
time series object, we need to regenerate a latter one, sharing the arguments of
the original. In the plot we clearly observe that the series has a trend, i.e. the mean
is obviously non-constant over time. This is typical for all financial time series.
Such trends in financial time series are nearly impossible to predict, and difficult to
characterize mathematically. We will not embark in this, but analyze the so-called
log-returns, i.e. the day-to-day changes after a log-transformation of the series:
Page 7
ATSA 1 Introduction
SMI Log-Returns
0.04
0.00
lret.smi
-0.04
-0.08
However, it is visible that large changes, i.e. log-returns with high absolute values,
imply that future log-returns tend to be larger than normal, too. This feature is also
known as volatility clustering, and financial service providers are trying their best to
exploit this property to make profit. Again, you can convince yourself of the
volatility clustering effect by taking the squared log-returns and analyzing their
serial correlation, which is different from zero.
Page 8
ATSA 1 Introduction
1.3.2 Modeling
The formulation of a stochastic model, as it is for example also done in regression,
can and does often lead to a deeper understanding of the series. The formulation
of a suitable model usually arises from a mixture between background knowledge
in the applied field, and insight from exploratory analysis. Once a suitable model is
found, a central issue remains, i.e. the estimation of the parameters, and
subsequent model diagnostics and evaluation.
1.3.3 Forecasting
An often-heard motivation for time series analysis is the prediction of future
observations in the series. This is an ambitious goal, because time series
forecasting relies on extrapolation, and is generally based on the assumption that
past and present characteristics of the series continue. It seems obvious that good
forecasting results require a very good comprehension of a series’ properties, be it
in a more descriptive sense, or in the sense of a fitted model.
Page 9
ATSA 2 Mathematical Concepts
2 Mathematical Concepts
For performing anything else than very basic exploratory time series analysis,
even from a much applied perspective, it is necessary to introduce the
mathematical notion of what a time series is, and to study some basic probabilistic
properties, namely the moments and the concept of stationarity.
An observed time series, on the other hand, is seen as a realization of the random
vector X ( X1, X 2 ,, X n ) , and is denoted with small letters x ( x1, x2 , , xn ) . It is
important to note that in a multivariate sense, a time series is only one single
realization of the n -dimensional random variable X , with its multivariate,
n -dimensional distribution function F1:n . As we all know, we cannot do statistics
with just a single observation. As a way out of this situation, we need to impose
some conditions on the joint distribution function F1:n .
2.2 Stationarity
The aforementioned condition on the joint distribution F1:n will be formulated as the
concept of stationarity. In colloquial language, stationarity means that the
probabilistic character of the series must not change over time, i.e. that any
section of the time series is “typical” for every other section with the same length.
More mathematically, we require that for any indices s, t and k , the observations
xt ,, xt k could have just as easily occurred at times s, , s k . If that is not the
case practically, then the series is hardly stationary.
Page 11
ATSA 2 Mathematical Concepts
Expectation E[ X t ] ,
Variance 2 Var ( X t ) ,
Covariance (h) Cov( X t , X t h ) .
Remarks:
When we analyze time series data, we need to verify whether it might have
arisen from a stationary process or not. Be careful with the wording:
stationarity is always a property of the process, and never of the data.
Page 12
ATSA 2 Mathematical Concepts
However, we will not discuss any of these tests here for a variety of reasons. First
and foremost, they all focus on some very specific non-stationarity aspects, but do
not test stationarity in a broad sense. While they may reasonably do their job in the
narrow field they are aimed for, they have low power to detect general non-
stationarity and in practice often fail to do so. Additionally, theory and formalism of
these tests is quite complex, and thus beyond the scope of this course. In
summary, these tests are to be seen as more of a pasttime for the mathematically
interested, rather than a useful tool for the practitioner.
Page 13
ATSA 3 Time Series in R
3 Time Series in R
3.1 Time Series Classes
In R, there are objects, which are organized in a large number of classes. These
classes e.g. include vectors, data frames, model output, functions, and many
more. Not surprisingly, there are also several classes for time series. We start by
presenting ts, the basic class for regularly spaced time series. This class is
comparably simple, as it can only represent time series with fixed interval records,
and only uses numeric time stamps, i.e. (sophistically) enumerates the index set.
However, it will be sufficient for most, if not all, of what we do in this course. Then,
we also provide an outlook to more complicated concepts.
Example: We here consider a simple and short series that holds the number of
days per year with traffic holdups in front of the Gotthard road tunnel north
entrance in Switzerland. The data are available from the Federal Roads Office.
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
88 76 112 109 91 98 139 150 168 149
The start of this series is in 2004. The time unit is years, and since we have just
one record per year, the frequency of this series is 1. This tells us that while there
may be a trend, there cannot be a seasonal effect, as the latter can only be
present in periodic series, i.e. series with frequency > 1. We now define a ts
object in in R.
> rawdat <- c(88, 76, 112, 109, 91, 98, 139, 150, 168, 149)
> ts.dat <- ts(rawdat, start=2004, freq=1)
> ts.dat
Time Series: Start = 2004, End = 2013
Frequency = 1
[1] 88 76 112 109 91 98 139 150 168 149
Page 15
ATSA 3 Time Series in R
There are a number of simple but useful functions that extract basic information
from objects of class ts, see the following examples:
> start(ts.dat)
[1] 2004 1
> end(ts.dat)
[1] 2013 1
> frequency(ts.dat)
[1] 1
> deltat(ts.dat)
[1] 1
Another possibility is to obtain the measurement times from a time series object.
As class ts only enumerates the times, they are given as fractions. This can still
be very useful for specialized plots, etc.
> time(ts.dat)
Time Series:
Start = 2004
End = 2013
Frequency = 1
[1] 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
The next basic, but for practical purposes very useful function is window(). It is
aimed at selecting a subset from a time series. Of course, also regular R-
subsetting such as ts.dat[2:5] does work with the time series class. However,
this results in a vector rather than a time series object, and is thus mostly of less
use than the window() command.
While we here presented the most important basic methods/functions for class ts,
there is a wealth of further ones. This includes the plot() function, and many
more, e.g. for estimating trends, seasonal effects and dependency structure, for
fitting time series models and generating forecasts. We will present them in the
forthcoming chapters of this scriptum.
To conclude the previous example, we will not do without showing the time series
plot of the Gotthard road tunnel traffic holdup days, see next page. Because there
are a limited number of observations, it is difficult to give statements regarding a
possible trend and/or stochastic dependency.
Page 16
ATSA 3 Time Series in R
Traffic Holdups
160
140
# of Days
120
100
80
Some further packages which contain classes and methods for time series include
xts, its, tseries, fts, timeSeries and tis. Additional information on their
content and philosophy can be found on CRAN.
As a general rule for date/time data in R, we suggest to use the simplest technique
possible. Thus, for date only data, as.Date() will mostly be the optimal choice. If
handling dates and times, but without time-zone information, is required, the
chron package is the choice. The POSIX classes are especially useful in the
relatively rare cases when time-zone manipulation is important.
Page 17
ATSA 3 Time Series in R
Apart for the POSIXlt class, dates/times are internally stored as the number of
days or seconds from some reference date. These dates/times thus generally
have a numeric mode. The POSIXlt class, on the other hand, stores date/time
values as a list of components (hour, min, sec, mon, etc.), making it easy to
extract these parts. Also the current date is accessible by typing Sys.Date() in
the console, and returns an object of class Date.
> as.Date("2012-02-14")
[1] "2012-02-14"
> as.Date("2012/02/07")
[1] "2012-02-07"
Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (character, abbreviated)
%B Month (character, full name)
%y Year (decimal, two digit)
%Y Year (decimal, four digit)
Internally, Date objects are stored as the number of days passed since the 1st of
January in 1970. Earlier dates receive negative numbers. By using the
as.numeric() function, we can easily find out how many days are past since the
reference date. Also back-conversion from a number of past days to a date is
straightforward:
Page 18
ATSA 3 Time Series in R
> weekdays(mydat)
[1] "Dienstag"
> months(mydat)
[1] "Februar"
> quarters(mydat)
[1] "Q1"
Furthermore, some very useful summary statistics can be generated from Date
objects: median, mean, min, max, range, ... are all available. We can even
subtract two dates, which results in a difftime object, i.e. the time difference in
days.
> min(dat)
[1] "2000-01-01"
> max(dat)
[1] "2007-08-09"
> mean(dat)
[1] "2003-12-15"
> median(dat)
[1] "2004-04-04"
> dat[3]-dat[1]
Time difference of 2777 days
The by argument proves to be very useful. We can supply various units of time,
and even place an integer in front of it. This allows creating a sequence of dates
separated by two weeks:
Page 19
ATSA 3 Time Series in R
> library(chron)
> dat <- c("2007-06-09 16:43:20", "2007-08-29 07:22:40",
"2007-10-21 16:48:40", "2007-12-17 11:18:50")
> dts <- substr(dat, 1, 10)
> tme <- substr(dat, 12, 19)
> fmt <- c("y-m-d","h:m:s")
> cdt <- chron(dates=dts, time=tme, format=fmt)
> cdt
[1] (07-06-09 16:43:20) (07-08-29 07:22:40)
[3] (07-10-21 16:48:40) (07-12-17 11:18:50)
As before, we can again use the entire palette of summary statistic functions. Of
some special interest are time differences, which can now be obtained as either
fraction of days, or in weeks, hours, minutes, seconds, etc.:
> cdt[2]-cdt[1]
Time in days:
[1] 80.61065
> difftime(cdt[2], cdt[1], units="secs")
Time difference of 6964760 secs
As explained above, the POSIXct class also stores dates/times with respect to the
internal reference, whereas the POSIXlt class stores them as a list of
components (hour, min, sec, mon, etc.), making it easy to extract these parts.
Page 20
ATSA 3 Time Series in R
The most common form for sharing time series data are certainly spreadsheets, or
in particular, Microsoft Excel files. While library(ROBDC) offers functionality to
directly import data from Excel files, we discourage its use. First of all, this only
works on Windows systems. More importantly, it is usually simpler, quicker and
more flexible to export comma- or tab-separated text files from Excel, and import
them via the ubiquitous read.table() function, respectively the tailored version
read.csv() (for comma separation) and read.delim() (for tab separation).
With packages ROBDC and RMySQL, R can also communicate with SQL databases,
which is the method of choice for large scale problems. Furthermore, after loading
library(foreign), it is also possible to read files from Stata, SPSS, Octave
and SAS.
Page 21
ATSA 4 Descriptive Analysis
4 Descriptive Analysis
As always when working with data, i.e. “a pile of numbers”, it is important to gain
an overview. In time series analysis, this encompasses several aspects:
4.1 Visualization
Another issue is the correct aspect ratio for time series plots: if the time axis gets
too much compressed, it can become difficult to recognize the behavior of a
series. Thus, we recommend choosing the aspect ratio appropriately. However,
there are no hard and simple rules on how to do this. As a rule of the thumb, use
the “banking to 45 degrees” paradigm: increase and decrease in periodic series
should not be displayed at angles much higher or lower than 45 degrees. For very
long series, this can become difficult on either A4 paper or a computer screen. In
this case, we recommend splitting up the series and display it in different frames.
For illustration, we here show an example, the monthly unemployment rate in the
US state of Maine, from January 1996 until August 2006. The data are available
from a text file on the web. We can read it directly into R, define the data as an
object of class ts and then do the time series plot:
Page 23
ATSA 4 Descriptive Analysis
Unemployment in Maine
6
5
(%)
4
3
Not surprisingly for monthly economic data, the series shows both a non-linear
trend and a seasonal pattern that increases with the level of the series. Since
unemployment rates are one of the main economic indicators used by
politicians/decision makers, this series poses a worthwhile forecasting problem.
All three series show a distinct seasonal pattern, along with a trend. It is also
instructive to know that the Australian population increased by a factor of 1.8
during the period where these three series were observed. As visible in the bit of
code above, plotting multiple series into different panels is straightforward. As a
general rule, using different frames for multiple series is the most recommended
means of visualization. However, sometimes it can be more instructive to have
them in the same frame. Of course, this requires that the series are either on the
same scale, or have been indexed, resp. standardized to be so. Then, we can
simply use plot(ind.tsd, plot.type="single"). When working with one
single panel, we recommend to use different colors for the series, which is easily
possible using a col=c("green3", "red3", "blue3") argument.
Page 24
ATSA 4 Descriptive Analysis
## Legend
ltxt <- names(dat)
legend("topleft", lty=1, col=clr, legend=ltxt)
choc
beer
elec
800
600
Index
400
200
Page 25
ATSA 4 Descriptive Analysis
In the indexed single frame plot above, we can very well judge the relative
development of the series over time. Due to different scaling, this was nearly
impossible with the multiple frames on the previous page. We observe that
electricity production increased around 8x during 1958 and 1990, whereas for
chocolate the multiplier is around 4x, and for beer less than 2x. Also, the seasonal
variation is most pronounced for chocolate, followed by electricity and then beer.
4.2 Transformations
Time series data do not necessarily need to be analyzed in the form they were
provided to us. In many cases, it is much better, more efficient and instructive to
transform the data. We will here highlight several cases and discuss their impact
on the results.
> library(TSA)
> plot(milk, xlab="Year", ylab="pounds", main="Monthly …")
> abline(v=1994:2006, col="grey", lty=3)
> lines(milk, lwd=1.5)
Page 26
ATSA 4 Descriptive Analysis
4.2.3 Log-Transformation
Many popular time series models and estimators (i.e. the usual ones for mean,
variance and correlation) are based and most efficient in case of Gaussian
distribution and additive, linear relations. However, data may exhibit different
behavior. In such cases, we can often improve results by working with transformed
values g ( x1 ),..., g ( xn ) rather than the original data x1,..., xn , The most popular and
Page 27
ATSA 4 Descriptive Analysis
6000
50
Sample Quantiles
40
Frequency
4000
30
20
2000
10
0
The lynx data are positive, on a relative scale and strongly right-skewed. Hence, a
log-transformation proves beneficial. Implementing the transformation is easy in R:
Page 28
ATSA 4 Descriptive Analysis
The data now follow a more symmetrical pattern; the extreme upward spikes are
all gone. Another major advantage of the transformation is that model-based
forecasts will have to be back-transformed to the original scale by taking exp() for
both the point and interval predictions. This ensures that no negative values will
appear which is often highly desired for time series that are positive.
4.3 Decomposition
X t mt st Rt ,
where X t is the time series process at time t , mt is the trend, st is the seasonal
effect, and Rt is the remainder, i.e. a sequence of usually correlated random
variables with mean zero. The goal is to find a decomposition such that Rt is a
stationary time series. Such a model might be suitable for all the monthly-data
series we got acquainted with so far: air passenger bookings, unemployment in
Maine and Australian production. However, closer inspection of all these series
exhibits that the seasonal effect and the random variation increase as the trend
increases. In such cases, a multiplicative decomposition model is better:
X t mt st Rt
Empirical experience says that taking logarithms is beneficial for such data. Also,
some basic math shows that this brings us back to the additive case:
For illustration, we carry out the log-transformation on the air passenger bookings:
The plot on the next page shows that indeed, seasonal effect and random
variation now seem to be independent of the level of the series. Thus, the
multiplicative model is much more appropriate than the additive one. However, a
further snag is that the seasonal effect seems to alter over time rather than being
constant. In earlier years, a prominent secondary peak is apparent. Over time, this
erodes away, but on the other hand, the summer peak seems to be ever rising.
The issue of how to deal with evolving seasonal effects will be addressed later in
chapter 4.3.4.
Page 29
ATSA 4 Descriptive Analysis
4.3.2 Differencing
A simple approach for removing deterministic trends and/or seasonal effects from
a time series is by taking differences. A practical interpretation of taking
differences is that by doing so, the changes in the data will be monitored, but no
longer the series itself. While this is conceptually simple and quick to implement,
the main disadvantage is that it does not result in explicit estimates of the trend
component mt , the seasonal component st nor the remainder Rt .
We will first turn our attention to series with an additive trend, but without seasonal
variation. By taking first-order differences with lag 1, and assuming a trend with
little short-term changes, i.e. mt mt 1 , we have:
Xt mt Rt
Yt X t X t 1 Rt Rt 1
Yt X t X t 1 Rt Rt 1
Page 30
ATSA 4 Descriptive Analysis
Cov(Yt , Yt 1 ) Cov( Rt Rt 1 , Rt 1 Rt 2 )
Cov( Rt 1 , Rt 1 )
0
We illustrate how differencing works by using a dataset that shows the traffic
development on Swiss roads. The data are available from the federal road office
(ASTRA) and show the indexed traffic amount from 1990-2010. We type in the
values and plot the original series:
There is a clear trend, which is close to linear, thus the simple approach should
work well here. Taking first-order differences with lag 1 shows the yearly changes
in the Swiss Traffic Index, which must now be a stationary series. In R, the job is
done with function diff().
> diff(SwissTraffic)
Time Series:
Start = 1991
End = 2010
Frequency = 1
[1] 2.7 1.5 0.4 2.1 0.2 0.7 2.3 2.1 2.3 3.1
[11] 0.9 2.6 2.8 0.4 0.5 1.0 2.3 -0.5 2.8 1.1
Page 31
ATSA 4 Descriptive Analysis
Please note that the time series of differences is now 1 instance shorter than the
original series. The reason is that for the first year, 1990, there is no difference to
the previous year available. The differenced series now seems to have a constant
mean, i.e. the trend was successfully removed.
X X X t 1 X t X t 1
Yt log( X t ) log( X t 1 ) log t log t 1
X t 1 X t 1 X t 1
The approximation of the log return to the relative change is very good for small
changes, and becomes a little less precise with larger values. For example, if we
have a 0.00% relative change, then Yt 0.00% , for 1.00% relative change we
obtain Yt 0.995% and for 5.00%, Yt 4.88% . We conclude with summarizing that
for any non-stationary series which is also due to a log-transformation, the
transformation is always carried out first, and then followed by the differencing!
B( X t ) X t 1 .
Page 32
ATSA 4 Descriptive Analysis
Less mathematically, we can also say that applying B means “go back one step”,
or “increment the time series index t by -1”. The operation of taking first-order
differences at lag 1 as above can be written using the backshift operator:
Yt (1 B) X t X t X t 1
However, the main aim of the backshift operator is to deal with more complicated
forms of differencing, as will be explained below.
Higher-Order Differencing
We have seen that taking first-order differences is able to remove linear trends
from time series. What has differencing to offer for polynomial trends, i.e. quadratic
or cubic ones? We here demonstrate that it is possible to take higher order
differences to remove also these, for example, in the case of a quadratic trend.
Xt 1t 2t 2 Rt , Rt stationary
Yt (1 B ) 2 X t
( X t X t 1 ) ( X t 1 X t 2 )
Rt 2 Rt 1 Rt 2 2 2
We see that the operator (1 B)2 means that after taking “normal” differences, the
resulting series is again differenced “normally”. This is a discretized variant of
taking the second derivative, and thus it is not surprising that it manages to
remove a quadratic trend from the data. As we can see, Yt is an additive
combination of the stationary Rt ’s terms, and thus itself stationary. Again, if Rt was
an independent process, that would clearly not hold for Yt , thus taking higher-order
differences (strongly!) alters the dependency structure.
For time series with monthly measurements, seasonal effects are very common.
Using an appropriate form of differencing, it is possible to remove these, as well as
potential trends. We take first-order differences with lag p :
Yt (1 B p ) X t X t X t p ,
Here, p is the period of the seasonal effect, or in other words, the frequency of
series, which is the number of measurements per time unit. The series Yt then is
made up of the changes compared to the previous period’s value, e.g. the
previous year’s value. Also, from the definition, with the same argument as above,
it is evident that not only the seasonal variation, but also a strictly linear trend will
be removed.
Page 33
ATSA 4 Descriptive Analysis
Usually, trends are not exactly linear. We have seen that taking differences at lag
1 removes slowly evolving (non-linear) trends well due to mt mt 1 . However, here
the relevant quantities are mt and mt p , and especially if the period p is long,
some trend will usually be remaining in the data. Then, further action is required.
Example
We are illustrating seasonal differencing using the Mauna Loa atmospheric CO2
concentration data. This is a time series with monthly records from January 1959
to December 1997. It exhibits both a trend and a distinct seasonal pattern. We first
load the data and do a time series plot:
> data(co2)
> plot(co2, main="Mauna Loa CO2 Concentrations")
Page 34
ATSA 4 Descriptive Analysis
The next step would be to analyze the autocorrelation of the series below and fit
an ARMA( p, q ) model. Due to the two differencing steps, such constructs are also
named SARIMA models. They will be discussed in chapter 6.
Page 35
ATSA 4 Descriptive Analysis
We here again consider the Swiss Traffic data that were already exhibited before.
They show the indexed traffic development in Switzerland between 1990 and
2010. Linear filtering is available with function filter() in R. With the correct
settings, this function becomes a running mean estimator.
> trend.est
Time Series: Start = 1990, End = 2010, Frequency = 1
[1] NA 102.3000 103.8333 105.1667 106.0667 107.0667
[7] 108.1333 109.8333 112.0667 114.5667 116.6667 118.8667
[13] 120.9667 122.9000 124.1333 124.7667 126.0333 126.9667
[19] 128.5000 129.6333 NA
Page 36
ATSA 4 Descriptive Analysis
In our example, we chose the trend estimate to be the mean over three
consecutive observations. This has the consequence that for both the first and the
last instance of the time series, no trend estimate is available. Also, it is apparent
that the Swiss Traffic series has a very strong trend signal, whereas the remaining
stochastic term is comparably small in magnitude. We can now compare the
estimated remainder term from the running mean trend estimation to the result
from differencing:
The blue line is the remainder estimate from running mean approach, while the
grey one resulted from differencing with lag 1. We observe that the latter has
bigger variance; and, while there are some similarities between the two series,
there are also some prominent differences – please note that while both seem
stationary, they are different.
We now turn our attention to time series that show both trend and seasonal effect.
The goal is to specify a filtering approach that allows trend estimation for periodic
data. We still base this on the running mean idea, but have to make sure that we
average over a full period. For monthly data, the formula is:
1 1 1
mˆ t X t 6 X t 5 X t 5 X t 6 , for t 7,..., n 6
12 2 2
Page 37
ATSA 4 Descriptive Analysis
R’s function filter(), with appropriate choice of weights, we can compute the
seasonal running mean. We illustrate this with the Mauna Loa CO2 data.
We obtain a trend which fits well to the data. It is not a linear trend, rather it seems
to be slightly progressively increasing, and it has a few kinks, too.
We finish this section about trend estimation using linear filters by stating that
other smoothing approaches, e.g. running median estimation, the loess smoother
and many more are valid choices for trend estimation, too.
For fully decomposing periodic series such as the Mauna Loa data, we also need
to estimate the seasonal effect. This is done on the basis of the trend adjusted
data: simple averages over all observations from the same seasonal entity are
taken. The following formula shows the January effect estimation for the Mauna
Loa data, a monthly series which starts in January and has 39 years of data.
1 38
sˆJan sˆ1 sˆ13 ... ( x12 j 1 mˆ 12 j 1 )
39 j 0
Page 38
ATSA 4 Descriptive Analysis
2 4 6 8 10 12
Month
In the plot above, we observe that during a period, the highest values are usually
observed in May, whereas the seasonal low is in October. The estimate for the
remainder at time t is simply obtained by subtracting estimated trend and
seasonality from the observed value.
Rˆt xt mˆ t sˆt
From the plot on the next page, it seems as if the estimated remainder still has
some periodicity and thus it is questionable whether it is stationary. The periodicity
is due to the fact that the seasonal effect is not constant but slowly evolving over
time. In the beginning, we tend to overestimate it for most months, whereas in the
end, we underestimate. We will address the issue on how to visualize evolving
seasonality below in section 4.3.4 about STL-decomposition. A further option for
dealing with non-constant seasonality is given by the exponential smoothing
approach which is covered in chapter 8.
Page 39
ATSA 4 Descriptive Analysis
Page 40
ATSA 4 Descriptive Analysis
Please note that decompose() only works with periodic series where at least two
full periods were observed; else it is not mathematically feasible to estimate trend
and seasonality from a series. The decompose() function also offers the neat
plotting method shown above that generates the four frames above with the
series, and the estimated trend, seasonality and remainder. Except for the
different visualization, the results are exactly the same as what we had produced
with our do-it-yourself approach.
We here again consider the Swiss Traffic dataset, for which the trend had already
been estimated above. Our goal is to re-estimate the trend with LOESS, a
smoothing procedure that is based on local, weighted regression. The aim of the
weighting scheme is to reduce potentially disturbing influence of outliers. Applying
the LOESS smoother with (the often optimal) default settings is straightforward:
We observe that the estimated trend, in contrast to the running mean result, is now
smooth and allows for interpolation within the observed time. Also, the loess()
algorithm returns trend estimates which extend to the boundaries of the dataset. In
summary, we recommend to always perform trend estimation with LOESS.
Page 41
ATSA 4 Descriptive Analysis
R’s stl() procedure offers a decomposition of a periodic time series into trend,
seasonality and remainder. All estimates are based on the LOESS smoother.
While the output is (nearly) equivalent to what we had obtained above with
decompose(), we recommend to use this procedure only, because the results
are more trustworthy. We do here without giving the full details on the procedure,
but only state that it works iteratively. The commands are as follows:
3
seasonal
1
-1
-3
360
trend
340
320
remainder
0.5
-0.5
1960 1970 1980 1990
time
The graphical output is similar to the one from decompose() The grey bars on
the right hand side facilitate interpretation of the decomposition: they show the
relative magnitude of the effects, i.e. cover the same span on the y-scale in all of
the frames. The two principal arguments in function stl() are t.window and
s.window: t.window controls the amount of smoothing for the trend, and has a
default value which often yields good results. The value used can be inferred with:
> co2.stl$win[2]
t
19
The result is the number of lags used as a window for trend extraction in LOESS.
Increasing it means the trend becomes smoother; lowering it makes the trend
rougher, but more adapted to the data. The second argument, s.window, controls
the smoothing for the seasonal effect. When set to "periodic" as above, the
seasonality is obtained as a constant value from simple (monthly) averaging.
Page 42
ATSA 4 Descriptive Analysis
0.25
0.05
-0.05 0.00
0.20
0.15
Please note that there is no default value for the seasonal span, and the optimal
choice is left to the user upon visual inspection. An excellent means for doing so is
the monthplot() function which shows the seasonal effects that were estimated
by stl().
The output, which can be found on the next page shows an appropriate amount of
smoothing in the left panel, with s.window=13. However on the right, with smaller
span, i.e. s.window=5, we observe overfitting – the seasonal effects do not
evolve in a smooth way, and it means that this is not a good decomposition
estimate.
Page 43
ATSA 4 Descriptive Analysis
6.0
data
5.0
0.2
seasonal
0.0
-0.2
6.0
trend
5.4
4.8
0.05
remainder
-0.05
1950 1952 1954 1956 1958 1960
time
0.2
0.1
0.1
seasonal
seasonal
0.0
0.0
-0.2 -0.1
-0.2
J M M J S N J M M J S N
Page 44
ATSA 4 Descriptive Analysis
regression based approach also allows for formal significance testing for the trend
and the seasonal component which is often very valuable in explorative data
analysis. Some prudence is required though due to the potentially correlated
residuals in the linear models, an issue which will only be thoroughly discussed in
chapter 7. In a second example, we use parametric modeling in a more flexible
way with a smooth trend component and a dummy variable for the seasonal
component, which yields results that are close to a STL decomposition or the
smoothing approach implemented in R’s decompose() procedure.
This example is an excerpt from a joint research project of the lecturer with
Environmental Protection Office of the Swiss canton Lucerne (UWE Luzern). Part
of this project included the analysis of Phosphate levels in the river Suhre, which is
an effluent of Lake Sempach. The time series with 36 monthly measurements over
a period of 3 years is displayed below.
The time series features a prominent seasonal effect and potentially a slight
downward trend. The aims in the project included a decomposition of the series,
as well as statements whether there are trends and seasonalities in the various
pollutants that were analyzed for a large number of rivers in the canton.
As the time series only has 36 observations and there seems to be a considerable
amount of (weather, i.e. rainfall or draught induced) noise, using smoothing
approaches or STL did not seem promising. These methods unavoidably spend
many degrees of freedom, primarily due to simple averaging in the seasonal
component. A way out is to set up a parametric decomposition model that is based
on a linear trend in time plus a cyclic seasonal component and a remainder.
Page 45
ATSA 4 Descriptive Analysis
The coefficients and inference results can be seen from the summary output, but
we have to be careful with their interpretation. The error term in the linear model is
a stationary, but potentially serially correlated time series Rt . If correlation exists,
the assumptions for the least square algorithm are violated. Chapter 7 contains a
full expositions of these topics, but in short summary, the coefficients would be
unbiased though slightly inefficiently estimated, whereas the standard erros are
biased and the derived p-values are not trustworthy. We plot the fit.
Page 46
ATSA 4 Descriptive Analysis
The red line shows the fitted values, i.e. the estimated average Phosphate levels.
In blue, the trend function has been added. We can also provide a full
decomposition plot as for example STL provides, but we have to construct it
ourselves.
Time Series
35
30
25
20
15
10
Page 47
ATSA 4 Descriptive Analysis
Despite the fact that a simple, linear trend function and a cyclic sine/cosine
seasonality was used, the remainder seems like a stationary series with mean
zero. There is no apparent serial correlation among the remainder terms, hence in
this situation, we can even rely on the inference results. Please note that while the
chosen model is fully adequate for the present situation, being so simplistic and
parsimonius is not the correct strategy on all datasets. There is absolutely no need
that a seasonal component is cyclic, with the Airline Pax being a prominent
counterexample.
We consider the Maine unemployment data from section 4.1.1. Our goal is to fit a
smooth trend, along with a seasonal effect that is obtained from averaging.
Sometimes, polynomial functions are used for modeling the trend function.
However, we recommend to stay away from high-order polynomials due to their
often very erratic behavior at the boundaries (cf. Runge’s Phenomenon), so that
anything beyond a quadratic trend should be avoided. The way out lies in using a
generalized additive model (GAM) with a flexible trend function. The seasonal
effect is included as a factor variable. In mathematical notation, the model is:
X t f (t ) i t Rt , ,
where t 1,,128 and i t {1,,12} , i.e. i t is a factor variable encoding for the
month the observation was made in, see the R code below. Two questions
immediately pop up, namely how to determine the smooth trend function f () , and
how fo fit the model as a whole. Both can conveniently be done using the gam()
function from library(mgcv) in R. Please note that here, we model resp.
decompose the logged Maine data, since their variation clearly increases with
increasing level of the series.
> library(mgcv)
> tnum <- as.numeric(time(maine))
> mm <- rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
+ "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
> mm <- factor(rep(mm,11),levels=mm)[1:128]
> fit <- gam(log(maine) ~ s(tnum) + mm)
Page 48
ATSA 4 Descriptive Analysis
0.1
-0.1 0.0
Time
0.0
-0.1
-0.3
-0.2
As we can see from the estimated trend and seasonal components, a simple
model using a linear trend or a cyclic seasonal component would not have been
suitable here. Please also note that both the trend component and the seasonal
effect are centered to mean zero here.
Page 49
ATSA 4 Descriptive Analysis
Finally, we extract the remainder term. These are just the residuals from the GAM
model, which are readily available and very quickly plotted.
Remainder
0.10
0.05
resid(fit)
0.00
-0.05
-0.10
0 20 40 60 80 100 120
Index
The plot strongly raises the question whether the remainder term can be seen as
stationary. It seems as if the behavior over the first 50 observations is markedly
different than in the second two thirds of the series. Moreover, the late
observations show a prominent perdiodicity with an off-season period of roughly
20 observations. Hence, further investigation of these features would certainly be
required. However, we conclude our exposition on parametric modeling for time
series decomposition at this point.
4.4 Autocorrelation
An important feature of time series is their (potential) serial correlation. This
section aims at analyzing and visualizing these correlations. We first display the
autocorrelation between two random variables X t k and X t , which is defined as:
Cov( X t k , X t )
Cor(X t k , X t )
Var ( X t k )Var ( X t )
This is a dimensionless measure for the linear association between the two
random variables. Since for stationary series, we require the moments to be non-
changing over time, we can drop the index t for these, and write the
autocorrelation as a function of the lag k :
(k ) Cor ( X t k , X t )
Page 50
ATSA 4 Descriptive Analysis
The goals in the forthcoming sections are estimating these autocorrelations from
observed time series data, and to study the estimates’ properties. The latter will
prove useful whenever we try to interpret sample autocorrelations in practice.
The example we consider in this chapter is the wave tank data. The values are
wave heights in millimeters relative to still water level measured at the center of
the tank. The sampling interval is 0.1 seconds and there are 396 observations. For
better visualization, we here display the first 60 observations only:
0 10 20 30 40 50 60
Time
These data show some pronounced cyclic behavior. This does not come as a
surprise, as we all know from personal experience that waves do appear in cycles.
The series shows some very clear serial dependence, because the current value
is quite closely linked to the previous and following ones. But very clearly, it is also
a stationary series.
Page 51
ATSA 4 Descriptive Analysis
500
wave
0
-500
The association seems linear and is positive. The Pearson correlation coefficient
turns out to be 0.47, thus moderately strong. How to interpret this value from a
practical viewpoint? Well, the square of the correlation coefficient, 0.47 2 0.22 , is
the percentage of variability explained by the linear association between xt and its
respective predecessor. Here in this case, xt 1 explains roughly 22% of the
variability observed in xt .
We can of course extend the very same idea to higher lags. We here analyze the
lagged scatterplot correlations for lags k 2,...5 , see next page. When computed,
the estimated Pearson correlations turn out to be -0.27, -0.50, -0.39 and -0.22,
respectively. The formula for computing them is:
n k
t 1
2
1 nk 1 n
where x(1) i
n k i 1
x and x(k ) xi
n k ik 1
It is important to notice that while there are n 1 data pairs for computing (1) ,
there are only n 2 for (2) , and then less and less, i.e. n k pairs for ( k ) .
Thus for the last autocorrelation coefficient which can be estimated, (n 2) , there
is only one single data pair which is left. Of course, they can always be
interconnected by a straight line, and the correlation in this case is always 1 . Of
course, this is an estimation snag, rather than perfect linear association for the two
random variables. Intuitively, it is clear that because there are less and less data
pairs at higher lags, the respective estimated correlations are less and less
precise. Indeed, by digging deeper in mathematical statistics, one can prove that
the variance of ( k ) increases with k . This is undesired, as it will lead to instable
results and spurious effects. The remedy is discussed in the next section.
Page 52
ATSA 4 Descriptive Analysis
500
wave
wave
0
-500
lag 2 lag 3
500
wave
wave
0
-500
lag 4 lag 5
-1000 -500 0 500 1000
ˆ (k )
ˆ (k ) , for k 1,..., n 1 ,
ˆ (0)
1 nk 1 n
where ˆ ( k )
n s 1
( xs k x )( xs x ) , with x xt .
n t 1
Note that here, n is used as a denominator irrespective of the lag and thus the
number of summands. This has the consequence that ˆ (0) is not an unbiased
estimator for (0) X2 , but as explained above, there are good reasons to do so.
When plugging in the above terms, the estimate for the k th autocorrelation
coefficient turns out to be:
n k
(x sk x )( xs x )
ˆ (k ) s 1
n , for k 1,..., n 1 .
(x x )
t 1
t
2
Page 53
ATSA 4 Descriptive Analysis
0 1 2 3 4 5 6 7
1.000 0.470 -0.263 -0.499 -0.379 -0.215 -0.038 0.178
8 9 10 11 12 13 14 15
0.269 0.130 -0.074 -0.079 0.029 0.070 0.063 -0.010
16 17 18 19 20 21 22 23
-0.102 -0.125 -0.109 -0.048 0.077 0.165 0.124 0.049
24 25
-0.005 -0.066
Next, we compare the autocorrelations from lagged scatterplot estimation vs. the
ones from the plug-in approach. These are displayed below. While for the first 50
lags, there is not much of a difference, the plug-in estimates are much more
damped for higher lags. As claimed above, the lagged scatterplot estimate shows
a value of 1 for lag 394, and some generally very erratic behavior in the few lags
before.
We finish this chapter by repeating that the bigger the lag, the fewer data pairs
remain for estimating the autocorrelation coefficient. We discourage of the use of
the lagged scatterplot approach. While the preferred plug-in approach is biased
Page 54
ATSA 4 Descriptive Analysis
due to the built-in damping mechanism, i.e. the estimates for high lags are
shrunken towards zero; it can be shown that it has lower mean squared error. This
is because it produces results with much less (random) variability. It can also be
shown that the plug-in estimates are consistent, i.e. the bias disappears
asymptotically.
Lagged Scatterplot
Plug-In
-1.0
0 10 20 30 40 50 60
Index
Nevertheless, all our findings still suggest that it is a good idea to consider only a
first portion of the estimated autocorrelations. A rule of the thumb suggests that
10 log10 (n) is a good threshold. For a series with 100 observations, the threshold
becomes lag 20. A second rule operates with n / 4 as the maximum lag to which
the autocorrelations are shown.
4.4.3 Correlogram
Now, we know how to estimate the autocorrelation function (ACF) for any lag k .
Here, we introduce the correlogram, the standard means of visualization for the
ACF. We will then also study the properties of the ACF estimator. We employ R
and type (see next page for the graphical output):
It has become a widely accepted standard to use vertical spikes for displaying the
estimated autocorrelations. Also note that the ACF in R by default starts with lag 0,
at which it always takes the value 1. If one does not like the spike at lag 0, one can
alternatively use the Acf() function from library(forecast). For better
judgment, we also recommend setting the y -range to the interval [1,1] . Apart
from these technicalities, the ACF reflects the properties of the series. We also
observe a cyclic behavior with a period of 8, as it is apparent in the time series plot
of the original data. Moreover, the absolute value of the correlations attenuates
with increasing lag. Next, we will discuss the interpretation of the correlogram.
Page 55
ATSA 4 Descriptive Analysis
0 5 10 15 20 25
Lag
Confidence Bands
It is obvious that even for an iid series without any serial correlation, and thus
( k ) 0 for all k , the estimated autocorrelations ˆ ( k ) will generally not be zero.
Hopefully, they will be small, but the question is how much they can differ from
zero just by chance. An answer is indicated by the confidence bands, i.e. the blue
dashed lines in the plot above. The confidence bands are based on an asymptotic
result: for long iid time series, it can be shown that the ˆ ( k ) approximately follow a
N (0,1 / n) distribution. Thus, under the null hypothesis that a series is iid and
hence ( k ) 0 for all k , the 95% acceptance region for the null is given by the
interval 1.96 / n . This leads us to the following statement that facilitates
interpretation of the correlogram:
“for any stationary time series, sample autocorrelation coefficients ˆ ( k ) that fall
within the confidence band of 1.96 / n are considered to be different from 0 only
by chance, while those outside the confidence band are considered to be truly
different from 0 .”
On the other hand, the above statement means that even for iid series, we expect
5% of the estimated ACF coefficients to exceed the confidence bounds; these
correspond to type 1 errors in the statistical testing business. Please note again
that the indicated bounds are asymptotic and derived for iid series. The properties
of serially dependent finite length series are much harder to derive!
Ljung-Box Test
The Ljung-Box approach tests the null hypothesis that a number of autocorrelation
coefficients are simultaneously equal to zero. Or, more colloquially, it evaluates
whether there is any significant autocorrelation in a series. The test statistic is:
Page 56
ATSA 4 Descriptive Analysis
h
ˆ k2
Q(h) n (n 2)
k 1 nk
Here, n is the length of the time series, ˆ k are the sample autocorrelation
coefficients at lag k and h is the lag up to which the test is performed. It is typical
to use h 1 , 3 , 5 , 10 or 20 . The test statistic asymptotically follows a 2
distribution with h degrees of freedom. As an example, we compute the test
statistic and the respective p-value for the wave tank data with h 10 .
We observe that Q (10) 344.0155 which is far in excess of what we would expect
by chance on independent data.The critical value, i.e. the 95%-quantile of the 102
is at 18.3 and thus, the p-value is close to (but not exactly) zero. There is also a
dedicated R function which can be used to perform Ljung-Box testing:
The result is, of course, identical. Please be aware that the test is sometimes also
referred to as Box-Ljung test. Also R is not very consistent in its nomenclature.
However, the two are one and the same. Moreover, with a bit of experience the
results of the Ljung-Box test can usually be guessed quite well from the
correlogram by eyeballing.
Estimation of the ACF from an observed time series assumes that the underlying
process is stationary. Only then we can treat pairs of observations at lag k as
being probabilistically “equal” and compute sample covariance coefficients. Hence,
while stationarity is at the root of ACF estimation, we can of course still apply the
formulae given above to non-stationary series. The ACF then usually exhibits
some typical patterns. This can serve as a second check for non-stationarity, i.e.
helps to identify it, should it have gone unnoticed in the time series plot. We start
by showing the correlogram for the SMI daily closing values from section 1.2.4.
This series does not have seasonality, but a very clear trend.
We observe that the ACF decays very slowly. The reason is that if a time series
features a trend, the observations at consecutive observations will usually be on
Page 57
ATSA 4 Descriptive Analysis
the same side of the series’ global mean x . This is why that for small to moderate
lags k , most of the terms
( xsk x )( xs x )
are positive. For this reason, the sample autocorrelation coefficient will be positive
as well, and is most often also close to 1. Thus, a very slowly decaying ACF is an
indicator for non-stationarity, i.e. a trend which was not removed before
autocorrelations were estimated.
0 20 40 60 80 100
Lag
Next, we show an example of a series that has no trend, but a strongly recurring
seasonal effect. We use R’s data(nottem), a time series containing monthly
average air temperatures at Nottingham Castle in England from 1920-1939. Time
series plot and correlogram are as follows:
Page 58
ATSA 4 Descriptive Analysis
0 1 2 3 4 5
Lag
The ACF is cyclic, and owing to the recurring seasonality, the envelope again
decays very slowly. Also note that for periodic series, R has periods rather than
lags on the x-axis – often a matter of confusion. We conclude that a hardly, or very
slowly decaying periodicity in the correlogram is an indication of a seasonal effect
which was forgotten to be removed. Finally, we also show the correlogram for the
logged air passenger bookings. This series exhibits both an increasing trend and a
seasonal effect. The result is as follows:
> data(AirPassengers)
> txt <- "Correlogram of Logged Air Passenger Bookings"
> acf(log(AirPassengers), lag.max=48, main=txt)
0 1 2 3 4
Lag
Page 59
ATSA 4 Descriptive Analysis
Here, the two effects described above are interspersed. We have a (here
dominating) slow decay in the general level of the ACF, plus some periodicity.
Again, this is an indication for a non-stationary series. It needs to be decomposed,
before the serial correlation in the stationary remainder term can be studied.
If a time series has an outlier, it will appear twice in any lagged scatterplot, and will
thus potentially have “double” negative influence on the ˆ ( k ) . As an example, we
consider variable temp from data frame beaver1, which can be found in R’s
data(beavers). This is the body temperature of a female beaver, measured by
telemetry in 10 minute intervals. We first visualize the data with a time series plot.
0 20 40 60 80 100
Time
The two data points where the outlier is involved are highlighted. The Pearson
correlation coefficients with and without these two observations are 0.86 and 0.91.
Depending on the outliers severity, the difference can be much bigger. The next
plot shows the entire correlogram for the beaver data, computed with (black) and
without (red) the outlier. Also here, the difference may seem small and rather
academic, but it could easily be severe if the outlier was just pronounced enough.
Page 60
ATSA 4 Descriptive Analysis
37.2
36.8
36.4
The question is, how do we handle missing values in time series? In principle, we
cannot just omit them without breaking the time structure. And breaking it means
going away from our paradigm of equally spaced points in time. A popular choice
is thus replacing the missing value. This can be done with various degrees of
sophistication:
with outlier
without outlier
0.8
0.4 0.6
ACF
0.2
-0.2 0.0
0 5 10 15 20
Lag
The best strategy depends upon the case at hand. And in fact, there is a fourth
alternative: while R’s acf() function by default does not allow for missing values,
it still offers the option to proceed without imputation. If argument is set as
na.action=na.pass, the covariances are computed from the complete cases,
and the correlogram is shown as usual. However, having missed values in the
series has the consequence that the estimates produced may well not be a valid
Page 61
ATSA 4 Descriptive Analysis
(i.e. positive definite) autocorrelation sequence, and may contain missing values.
From a practical viewpoint, these drawbacks can often be neglected, though. Also
many other R functions for time series analysis allow for the presence of missing
values if the arguments are set properly.
One can show that the sum of all autocorrelation coefficients which can be
estimated from a time series realization, i.e. the sum over all ˆ ( k ) for lags
k 1,..., n 1 , adds up to -1/2. Or, written as a formula:
n 1
1
ˆ (k ) 2
k 1
We omit the proof here. It is clear that the above condition will lead to quite severe
artifacts, especially when a time series process has only positive correlations. We
here show both the true, theoretical ACF of an AR (1) process with 1 0.7 , which,
as we will see in section 5, has ( k ) 0 for all k , and the sample correlogram for
a realization of that process with a length 200 observations.
Page 62
ATSA 4 Descriptive Analysis
## True ACF
true.acf <- ARMAacf(ar=0.7, lag.max=200)
plot(0:200, true.acf, type="h", xlab="Lag", ylim=c(-1,1))
title("True ACF of an AR(1) Process with alpha=0.7")
abline(h=0, col="grey")
What we observe is quite striking: only for the very first few lags, the sample ACF
does match with its theoretical counterpart. As soon as we are beyond lag k 6 ,
the sample ACF turns negative. This is an artifact, because the sum of the
estimated autocorrelations coefficients needs to add up to -1/2. Some of these
spurious, negative correlation estimates are so big that they even exceed the
confidence bounds – an observation that has to be well kept in mind if one
analyzes and tries to interpret the correlogram.
Simulation Study
Last but not least, we will run a small simulation study that visualizes bias and
variance in the sample autocorrelation coefficients. We will again base this on the
simple AR (1) process with coefficient 1 0.7 . For further discussion of the
process’ properties, we refer to section 5. There, it will turn out that the k th
autocorrelation coefficient of such a process takes the value (0.7)k , as visualized
on the previous page.
For understanding the variability in ˆ (1) , ˆ (2) , ˆ (5) and ˆ (10) , we simulate from
the aforementioned AR (1) process. We generate series of length n 20 , n 50 ,
Page 63
ATSA 4 Descriptive Analysis
n 100 and n 200 . We then obtain the correlogram, record the estimated
autocorrelation coefficients and repeat this process 1000 times. This serves as a
basis for displaying the variability in ˆ (1) , ˆ (2) , ˆ (5) and ˆ (10) with boxplots.
They can be found below.
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
n=20 n=50 n=100 n=200 n=20 n=50 n=100 n=200
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
We observe that for “short” series with less than 100 observations, estimating the
ACF is difficult: the ˆ ( k ) are strongly biased, and there is huge variability. Only for
longer series, the consistency of the estimator “kicks in”, and yields estimates
which are reasonably precise. For lag k 10 , on the other hand, we observe less
bias, but the variability in the estimate remains large, even for “long” series.
Page 64
ATSA 4 Descriptive Analysis
1 n
ˆ Xt
n t 1
1 n
Var ( ˆ )
n 2
Var Xt
t 1
1 n n
n 2
Cov
t 1
X t ,
t 1
Xt
1 n n
Cov( X s , X t )
n 2 t 1 s 1
1 n n
( t s )
n 2 t 1 s 1
(0) n n
( t s )
n2 t 1 s 1
(0) n 1
n 2 (n k ) (k )
n2 k 1
In reality, one often has to deal with time series that only feature positive
autocorrelation coefficients. In that case Var ( ˆ ) will be larger than for an iid series.
Hence, falsely assuming independence may lead to deflated confidence intervals
and spuriously significant results.
(0) 10log10 ( n )
ˆ 1.96 2
n 2 (n k ) (k )
n k 1
Page 65
ATSA 4 Descriptive Analysis
We illustrate the issue based on the series with the body temperature of the
beaver from above. The mean and the faulty confidence interval under iid
assumption are simply computed as:
> mean(beaver)
[1] 36.862
> mean(beaver)+c(-1.96,1.96)*sd(beaver)/sqrt(length(beaver))
[1] 36.827 36.898
When adjusting for the sequential correlation of the observations, the confidence
interval becomes around 2.7x longer, which can make a big difference!
(k ) Cor ( X t k , X t | X t 1 xt 1,, X t k 1 xt k 1 )
In other words, we can also say that the partial autocorrelation is the association
between X t and X t k with the linear dependence of X t 1 through X t k 1 removed.
Another instructive analogy can be drawn to linear regression. The autocorrelation
coefficient ( k ) measures the simple dependence between X t and X t k , whereas
the partial autocorrelation ( k ) measures the contribution to the multiple
dependence, with the involvement of all intermediate instances X t 1,..., X t k 1 as
explanatory variables. There is a (theoretical) relation between the partial
autocorrelations ( k ) and the plain autocorrelations (1),..., (k ) , i.e. they can be
derived from each other, e.g.:
The formula for higher lags k exists, but get complicated rather quickly, so we do
without displaying them. However, another absolutely central property of the
partial autocorrelations ( p ) is that the k th coefficent of the AR ( p ) model,
denoted as p , is equal to ( p ) . While there is an in depth discussion of AR ( p )
Page 66
ATSA 4 Descriptive Analysis
models in section 5, we here briefly sketch the idea, because it makes the above
property seem rather logical. An autoregressive model of order p , i.e. an AR ( p )
is:
X t 1 X t 1 k X t p Et ,
5 10 15 20 25
Lag
We observe that ˆ (1) 0.5 and ˆ (2) 0.6 . Some further PACF coefficients up to
lag 10 seem significantly different from zero, but are smaller. From what we see
here, we could try to describe the wave tank data with an AR (2) model. The next
section will explain why.
> library(forecast)
> tsdisplay(wave, points=FALSE)
Page 67
ATSA 4 Descriptive Analysis
wave
500
0
-500
-0.6
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
Page 68
ATSA 5 Stationary Time Series Models
Before we show a realization of a White Noise process, we state that the term
“White Noise” was coined in an article on heat radiation published in Nature in
April 1922. There, it was used to refer to series time series that contained all
frequencies in equal proportions, analogous to white light. It is possible to show
that iid sequences of random variables do contain all spectral frequencies in equal
proportions, and hence, here we are.
Page 69
ATSA 5 Stationary Time Series Models
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 5 10 15 20
Lag Lag
White Noise series are important, because they usually arise as residual series
when fitting time series models. The correlogram generally provides enough
evidence for attributing a series as White Noise, provided the series is of
reasonable length – our studies in section 4.4 suggests that 100 or 200 is such a
value. Please also note that while there is not much structure in Gaussian White
Noise, it still has a parameter. It is the variance W2 .
X t t Et .
Hereby, t is the conditional mean of the series, i.e. t E[ X t | X t 1, X t 2 ,...] and Et
is a disturbance term. For all models in section 5, the disturbance term is assumed
to be a White Noise innovation.
Page 70
ATSA 5 Stationary Time Series Models
In words, the conditional mean is a function of past instances of the series as well
as past innovations. We will see that usually, a selection of the involved terms is
made, and that the function f () is a linear combination of the arguments.
X t 1 X t 1 2 X t 2 ... p X t p Et .
Hereby, the disturbance term Et comes from a White Noise process, i.e. is iid.
Moreover, we require that it is an innovation, i.e. that it is stochastically
independent of X t 1, X t 2 ,... . The term innovation is illustrative, because (under
suitable conditions), it has the power to drive the series into a new direction,
meaning that it is strong enough so that it can overplay the dependence of the
series from its own past. An alternative notation for AR ( p ) models is possible with
the backshift operator:
Page 71
ATSA 5 Stationary Time Series Models
When is an AR ( p ) stationary? Not always, but under some mild conditions. First
of all, we study the unconditional expectation of an AR ( p ) process X t which we
assume to be stationary, hence E[ X t ] for all t . When we take expectations on
both sides of the model equation, we have:
Thus, any stationary AR ( p ) process has a global mean of zero. But please be
aware of the fact that the conditional mean is time dependent and generally
different from zero.
Yt m X t
E2
Var ( X t ) Var (1 X t 1 Et ) , hence
2 2 2 2
. 2
1 12
X 1 X E X
From this we derive that an AR(1) can only be stationary if 1 1 . That limitation
means that the dependence from the series’ past must not be too strong, so that
the memory fades out. If 1 1 , the process diverges. The general condition for
AR ( p ) models is (as mentioned above) more difficult to derive. We require that:
> abs(polyroot(c(1,-0.4,0.2,-0.3)))
[1] 1.405467 1.540030 1.540030
Page 72
ATSA 5 Stationary Time Series Models
X t 0.8 X t 1 Et
So far, we had only required that Et is a White Noise innovation, but not a
distribution. We use the Gaussian here and set x1 E1 as the starting value.
> set.seed(24)
> E <- rnorm(200, 0, 1)
> x <- numeric()
> x[1] <- E[1]
> for(i in 2:200) x[i] <- 0.8*x[i-1] + E[i]
> plot(ts(x), main= "AR(1) with...")
We observe some cycles with exclusively positive and others with only negative
values. That is not surprising: if the series takes a large value, then the next one is
determined as 0.8 times that large value plus the innovation. Thus, it is more likely
that the following value has the same sign as its predecessor. On the other hand,
the innovation is powerful enough so that jumping to the other side of the global
mean is always an option. Given that behavior, it is evident that the autocorrelation
at lag 1 is positive. We can compute it explicitly from the model equation:
Thus we have (1) 0.8 here, or in general (1) 1 . The correlation for higher
lags can be determined similarly by repeated plug-in of the model equation. It is:
(k ) 1k .
Page 73
ATSA 5 Stationary Time Series Models
The series shows an alternating behavior: the next value is more likely to lie on the
opposite side of the global mean zero, but there are exceptions when the
innovation takes a large value. The autocorrelation still follows (k ) 1k . It is also
alternating between positive and negative values with an envelope for (k ) that is
exponentially decaying.
Page 74
ATSA 5 Stationary Time Series Models
What is now the (theoretical) correlation in this AR (3) ? We apply the standard trick
of plugging-in the model equation. This yields:
(k ) (0)1 Cov( X t k , X t )
(0) 1 Cov(1 X t k 1 ... p X t k p Et , X t )
1 (k 1) ... p (k p)
0 5 10 15 20
Lag
We observe that the theoretical correlogram shows a more complex structure than
what could be achieved with an AR(1) . Nevertheless, one can still find an
exponentially decaying envelope for the magnitude of the autocorrelations (k ) .
That is a property which is common to all AR ( p ) models.
From the above, we can conclude that the autocorrelations are generally non-zero
for all lags, even though in the underlying model, X t only depends on the p
previous values X t 1,..., X t p . In section 4.5 we learned that the partial
autocorrelation at lag k illustrates the dependence between X t and X t k when the
linear dependence on the intermittent terms was already taken into account. It is
evident by definition that for any AR ( p ) process, we have ( k ) 0 for all k p .
This can and will serve as a useful indicator for deciding on the model order p if
we are trying to identify the suitable model order when fitting real world data. In
this section, we focus on the PACF for the above AR (3) .
Page 75
ATSA 5 Stationary Time Series Models
0 5 10 15 20
Lag
As claimed previously, we indeed observe (1) (1) 0.343 and (3) 3 0.3 .
All partial autocorrelations from (4) on are exactly zero.
5.3.2 Fitting
Fitting an AR ( p ) model to data involves three main steps. First, the model and its
order need to be identified. Second, the model parameters need to be estimated
and third, the quality of the fitted model needs to be verified by residual analysis.
Model Identification
The model identification step first requires verifying that the data show properties
which make it plausible that they were generated from an AR ( p ) process. In
particular, the time series we are aiming to model needs to be stationary, show an
ACF with approximately exponentially decaying envelope and a PACF with a
recognizable cut-off at some lag p smaller than about 5 10 . If any of these three
properties is strongly violated, it is unlikely that an AR ( p ) will yield a satisfactory
fit, and there might be models which are better suited for the problem at hand.
The choice of the model order p then relies on the analysis of the sample PACF.
Following the paradigm of parameter parsimony, we would first try the simplest
model that seems plausible. This means choosing the smallest p at which we
suspect a cut-off, i.e. the smallest after which none, or only few and weakly
significant partial autocorrelations follow. We illustrate the concept with the logged
Lynx data that were already discussed in section 1.2.2. We need to generate both
ACF and PACF, which can be found on the next page.
Page 76
ATSA 5 Stationary Time Series Models
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 5 10 15 20
Lag Lag
There is no reason to doubt the stationarity of the Lynx series. Moreover, the ACF
shows a cyclic behavior that has an exponentially decaying envelope. Now does
the PACF show a cut-off? That is a bit less clear, and several orders p
( 2, 4,7,11) come into question. However in summary, we conjecture that there are
no strong objections against fitting an AR ( p ) . The choice of the order is debatable,
but the parsimony paradigm tells us that we should try with the smallest candidate
first, and that is p 2 .
Parameter Estimation
Observed time series are rarely centered and thus, it is usually inappropriate to fit
a pure AR ( p ) process. In fact, all R routines for fitting autoregressive models by
default assume the shifted process Yt m X t . Hence, we have a regression-type
equation with observations:
The goal here is to estimate the parameters m,1,..., p such that the data are
fitted well. There are several concepts that define well fitting. These include
ordinary least squares estimation (OLS), Burg’s algorithm (Burg), the Yule-Walker
approach (YW) and maximum likelihood estimation (MLE). Already at this point we
note that while the four methods have fundamental individuality, they are
asymptotically equivalent (under some mild assumptions) and yield results that
mostly only differ slightly in practice. Still, it is worthwhile to study all the concepts.
OLS
The OLS approach is based on the notion with the centering; the above equation
defines a multiple linear regression problem without intercept. The goodness-of-fit
criterion is ( xt xˆt )2 resp. ( yt yˆt )2 , the two quantities are equal. The first step with
this approach is to center the data, which is based on subtracting the global mean:
Page 77
ATSA 5 Stationary Time Series Models
Coefficients:
Estimate Std. Error t value Pr(>|t|)
pred1 1.38435 0.06359 21.77 <2e-16 ***
pred2 -0.74793 0.06364 -11.75 <2e-16 ***
---
Residual standard error: 0.528 on 110 degrees of freedom
Multiple R-squared: 0.8341, Adjusted R-squared: 0.8311
F-statistic: 276.5 on 2 and 110 DF, p-value: < 2.2e-16
We can extract mˆ 6.686 , ˆ1 1.384 , ˆ 2 0.748 and ˆ E 0.528 . But while this is
an instructive way of estimating AR ( p ) models, it is a bit cumbersome and time
consuming. Not surprisingly, there are procedures that are dedicated to fitting such
models in R. We here display the use of function ar.ols(). To replicate the
hand-produced result, we type:
Coefficients:
1 2
1.3844 -0.7479
Note that for producing the result, we need to avoid AIC-based model fitting with
aic=FALSE. The shift m is automatically estimated, and thus we need to exclude
an intercept term in the regression model using intercept=FALSE. We observe
that the estimated AR -coefficients ˆ1,ˆ 2 take exactly the same values as with the
hand-woven procedure above. The estimated shift m̂ can be extracted via
> fit.ar.ols$x.mean
[1] 6.685933
and corresponds to the global mean of the series. Finally, the estimate for the
innovation variance requires some prudence. The lm() summary output yields an
Page 78
ATSA 5 Stationary Time Series Models
> sum(na.omit(fit.ar.ols$resid)^2)/112
[1] 0.2737594
Burg’s Algorithm
While the OLS approach works, its downside is the asymmetry: the first p terms
are never evaluated as responses. That is cured by Burg’s Algorithm, an
alternative approach for estimating AR ( p ) models. It is based on the notion that
any AR ( p ) process is also an AR ( p ) if the time is run in reverse order. Under this
property, minimizing the forward and backward 1-step squared prediction errors
makes sense:
n p 2
p
2
X t
t p 1
k X t k X t p k X t p k
k 1 k 1
Call:
ar.burg.default(x = log(lynx), aic = FALSE, order.max = 2)
Coefficients:
1 2
1.3831 -0.7461
> f.ar.burg$x.mean
[1] 6.685933
> sum(na.omit(f.ar.burg$resid)^2)/112
[1] 0.2737614
There are a few interesting points which require commenting. First and foremost,
Burg’s algorithm also uses the arithmetic mean to estimate the global mean m̂ .
The fitting procedure is then done on the centered observations xt . On a side
remark, note that assuming centered observations is possible. If argument
demean=FALSE is set, the global mean is assumed to be zero and not estimated.
Page 79
ATSA 5 Stationary Time Series Models
The two coefficients ˆ1,ˆ 2 take some slightly different values than with OLS
estimation. While often, the difference between the two methods is practically
neglectable, it is nowadays generally accepted that the Burg solution is better for
finite samples. Asymptotically, the two methods are equivalent. Finally, we
observe that the ar.burg() output specifies ˆ E2 0.2707 . This is different from
the MLE estimate of 0.27376 on the residuals. The explanation is that for Burg’s
Algorithm, the innovation variance is estimated from the Durbin-Levinson updates;
see the R help file for further reference.
Yule-Walker Equations
A third option for estimating AR ( p ) models is to plugging-in the sample ACF into
the Yule-Walker equations. In section 5.3.1 we had learned that there is a p p
linear equation system ( k ) 1 ( k 1) ... p (k p ) for k 1,..., p . Hence we
can and will explicitly determine ˆ (0),..., ˆ ( k ) and then solve the linear equation
system for the coefficients 1,..., p . The procedure is implemented in R function
ar.yw().
Coefficients:
1 2
1.3504 -0.7200
Again, the two coefficients ˆ1,ˆ 2 take some slightly different values than
compared to the two methods before. Mostly this difference is practically
neglectable and Yule-Walker is asymptotically equivalent to OLS and Burg.
Nevertheless, for finite samples, the estimates from the Yule-Walker method are
often worse in the sense that their (Gaussian) likelihood is lower. Thus, we
recommend preferring Burg’s algorithm. We conclude this section by noting that
the Yule-Walker method also involves estimating the global mean m with the
arithmetic mean as the first step. The innovation variance is estimated from the
fitted coefficients and the autocovariance of the series and thus again takes a
different value than before.
The MLE is based on determining the model coefficients such that the likelihood
given the data is maximized, i.e. the density function takes its maximal value under
the present observations. This requires assuming a distribution for the AR ( p )
process, which comes quite naturally if one assumes that for the innovations, we
have Et ~ N (0, E2 ) , i.e. they are iid Gaussian random variables. With some theory
(which we omit), one can then show that an AR ( p ) process X1,..., X n is a random
vector with a multivariate Gaussian distribution.
Page 80
ATSA 5 Stationary Time Series Models
n
L( , m E2 ) exp ( xt xˆt ) 2
t 1
The details are quite complex and several constants are part of the equation, too.
But we here note that the MLE derived from the Gaussian distribution is based on
minimizing the sum of squared errors and thus equivalent to the OLS approach.
Due to the simultaneous estimation of model parameters and innovation variance,
a recursive algorithm is required. There is an implementation in R:
> f.ar.mle
Call: arima(x = log(lynx), order = c(2, 0, 0))
Coefficients:
ar1 ar2 intercept
1.3776 -0.7399 6.6863
s.e. 0.0614 0.0612 0.1349
We observe estimates which are again slightly different from the ones computed
previously. Again, those differences are mostly neglectable for practical data
analysis. What is known from theory is that the MLE is (under mild assumptions)
asymptotically normal with minimum variance among all asymptotically normal
estimators. Note that the MLE based on Gaussian distribution still works
reasonably well if that assumption is not met, as long as we do not have strongly
skewed data (apply a transformation in that case) or extreme outliers.
Practical Aspects
On the other hand, ar.ols(), ar.yw() und ar.burg() do not provide standard
errors, but allow for convenient determination of the model order p with the AIC
statistic. While we still recommend investigating on the suitable order by analyzing
ACF and PACF, the parsimonity paradigm and inspecting residual plots, using AIC
as a second opinion is still recommended. It works as follows:
Page 81
ATSA 5 Stationary Time Series Models
0 5 10 15 20
0:fit.ar.aic$order.max
We observe that already p 2 yields a good AIC value. Then there is little further
improvement until p 11 , and a just slightly lower value is found at p 12 . Hence,
we will evaluate p 2 and p 11 as two competing models with some further
tools in the next section.
The output is displayed on the next page. While some differences are visible, it is
not easy to judge from the fitted values, which of the two models is preferable. A
better focus on the quality of the fit is obtained when the residuals and their
dependance are inspected with time series plots as well as ACF/PACF
correlograms. The graphical output is again displayed on the next page. We
observe that the AR (2) residuals are not iid. Hence they do not form a White
Page 82
ATSA 5 Stationary Time Series Models
Noise process and thus, we conclude that the AR(11) model yields a better
description of the logged lynx data.
AR(2) AR(11)
1.0
0.5
0.5
ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 0 5 10 15 20
Lag Lag
1.0
0.5
0.5
Partial ACF
Partial ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
5 10 15 20 5 10 15 20
Lag Lag
Page 83
ATSA 5 Stationary Time Series Models
Because our estimation routines to some extent rely on the Gaussian distribution,
it is always worthwhile to generate a Normal QQ-Plot for verifying this. Here, we
obtain:
> par(mfrow=c(1,2))
> qqnorm(as.numeric(fit.ar02$resid))
> qqline(as.numeric(fit.ar02$resid))
> qqnorm(as.numeric(fit.ar11$resid))
> qqline(as.numeric(fit.ar11$resid))
1.0
1.0
Sample Quantiles
Sample Quantiles
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-2 -1 0 1 2 -2 -1 0 1 2
Theoretical Quantiles Theoretical Quantiles
Neither of the two plots (which are very similar) does indicate any problems. If the
residuals’ distribution looks non-normal, a log-transformation might be worth a
consideration, as it often improves the situation.
If there are competing models and none of the other criterions dictate which one to
use, another option is to generate realizations from the fitted process using R’s
function arima.sim(). It usually pays off to generate multiple realizations from
each fitted model. By eyeballing, one then tries to judge which model yields data
that resemble the true observations best. We here do the following:
A realization from the fitted AR(11) can be seen on the next side. In summary, the
simulations from this bigger model look more realistic than the ones from the
AR (2) . The clearest answer about the model which is preferable here comes from
the ACF/PACF correlograms, though. We conclude this section about model fitting
by saying that the logged lynx data are best modeled with the AR(11) .
Page 84
ATSA 5 Stationary Time Series Models
0 20 40 60 80 100
Time
X t Et 1·Et 1 q ·Et q
X t (1 1B q B q ) Et ( B) Et
Page 85
ATSA 5 Stationary Time Series Models
Firstly, they have been applied successfully in many applied fields, particularly in
econometrics. Time series such as economic indicators are affected by a variety of
random events such as strikes, government decisions, referendums, shortages of
key commodities, et cetera. Such events will not only have an immediate effect on
the indicator, but may also affect its value (to a lesser extent) in several of the
consecutive periods. Thus, it is plausible that moving average processes appear in
practice. Moreover, some of their theoretical properties are in a nice way
complementary to the ones of autoregressive processes. This will become clear if
we study the moments and stationarity of the MA process.
A first, straightforward but very important result is that any MA( q ) process X t , as
a linear combination of innovation terms, has zero mean and constant variance:
q
E[ X t ] 0 for all t , and Var ( X t ) 1 j2 const
2
E
j 1
Yt m X t .
Hence, the zero mean property does not affect the possible field of practical
application. Now, if we could additionally show that the autocovariance in MA
processes is independent of the time t , we had already proven their stationarity.
This is indeed the case. We start by considering a MA(1) with X t Et 1·Et 1
For any lag k exceeding the order q 1 , we use the same trick of plugging-in the
model equation and directly obtain a perhaps somewhat surprising result:
(1) 1
(1) and ( k ) 0 for all k q 1 .
(0) 1 12
From this we conclude that (1) 0.5 , no matter what the choice for 1 is. Thus if
in practice we observe a series where the first-order autocorrelation coefficient
clearly exceeds this value, we have counterevidence to a MA(1) process.
Page 86
ATSA 5 Stationary Time Series Models
Furthermore, we have shown that any MA(1) has zero mean, constant variance
and an ACF that only depends on the lag k , hence it is stationary. Note that the
stationarity does (in contrast to AR processes) not depend on the choice of the
parameter 1 . The stationarity property can be generalized to MA( q ) processes.
Using some calculations and 0 1 , we obtain:
q k q
j j k j
for k 1,..., q
2
/
( k ) j 0 j 0
0 for k q
Example of a MA(1)
> set.seed(21)
> ts.ma1 <- arima.sim(list(ma=0.7), n=500)
>
> plot(ts.ma1, ylab="", ylim=c(-4,4))
> title("Simulation from a MA(1) Process")
Page 87
ATSA 5 Stationary Time Series Models
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
1.0
0.5
0.5
pacf.true
acf.true
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 5 10 15 20
lag lag
We observe that the estimates are pretty accurate: the ACF has a clear cut-off,
whereas the PACF seems to feature some alternating behavior with an
exponential decay in absolute value. This behavior is typical: the PACF of any
MA( q ) process shows an exponential decay, while the ACF has a cut-off. In this
respect, MA( q ) processes are in full contrast to the AR ( p ) ’s, i.e. the appearance
of ACF and PACF is swapped.
Invertibility
It is easy to show that the first autocorrelation coefficient (1) of an MA(1) process
can be written in standard form, or also as follows:
1 1 / 1
(1)
1 1 1 (1 / 1 ) 2
2
Apparently, any MA(1) process with coefficient 1 has exactly the same ACF as
the one with 1/ 1 . Thus, the two processes X t Et 0.5·Et 1 and Ut Et 2·Et 1
have the same dependency structure. Or in other words, given some ACF, we
cannot identify the generating MA process uniquely. This problem of ambiguity
leads to the concept of invertibility. Now, if we express the processes X t and U t in
terms of X t 1 , X t 2 ,... resp. U t 1 ,U t 2 ,... , we find by successive substitution:
Page 88
ATSA 5 Stationary Time Series Models
Et X t 1 X t 1 12 X t 2 ...
Et U t (1/ 1 )U t 1 (1/ 12 )U t 2 ...
Hence, if we rewrite the MA(1) as an AR () , only one of the processes will
converge. That is the one where 1 1 , and it will be called invertible. It is
important to know that invertibility of MA processes is central when it comes to
fitting them to data, because parameter estimation is based on rewriting them in
the form of an AR () .
5.4.2 Fitting
The process of fitting MA( q ) models to data is more difficult than for AR ( p ) , as
there are no (efficient) explicit estimators and numerical optimization is mandatory.
Perhaps the simplest idea for estimating the parameters is to exploit the relation
between the model parameters and the autocorrelation coefficients, i.e.:
q k q
j j k / j for k 1,..., q
2
( k ) j 0 j 0
0 for k q
Hence in case of a MA(1) , we would determine ̂1 by plugging-in ˆ (1) into the
equation (1) 1 / (1 12 ) . This can be seen as an analogon to the Yule-Walker
approach in AR modelling. Unfortunately, the plug-in idea yields an inefficient
estimator and is not a viable option for practical work.
Another appealing idea would be to use some (adapted) least squares procedure
for determining the parameters. A fundamental requirement for doing so is that we
can express the sum of squared residuals Et2 in terms of the observations
X 1 ,..., X n and the parameters 1 ,..., q only, and do not have to rely on the
unobservable E1 ,..., En directly. This is (up to the choice of some initial values)
possible for all invertible MA( q ) processes. For simplicity, we restrict our
illustration to the MA(1) case, where we can replace any innovation term Et by:
Page 89
ATSA 5 Stationary Time Series Models
Maximum-Likelihood Estimation
As can be seen from the R help file, the Conditional Sum of Squares method is
only secondary to method="CSS-ML" in the R function arima(). This means
that it is preferable to use CSS only to obtain a first estimate of the coefficients
1 ,..., q . They are then used as starting values for a Maximum-Likelihood
estimation, which is based on the assumption of Gaussian innovations Et . It is
pretty obvious that X t Et 1Et 1 ... q Et q , as a linear combination of normally
distributed random variables, follows a Gaussian too. By taking the covariance
terms into account, we obtain a multivariate Gaussian for the time series vector:
The benefit of MLE is that (under mild and in practice usually fulfilled conditions)
certain optimality conditions are guaranteed. It is well known that the estimates are
asymptotically normal with minimum variance among all asymptotically normal
estimators. Additionally, it is pretty easy to derive standard errors for the
estimates, which further facilitates their interpretation. And even though MLE is
based on assuming Gaussian innovations, it still produces reasonable results if the
deviations from that model are not too strong. Be especially wary in case of
extremely skewed data or massive outliers. In such cases, applying a log-
transformation before the modelling/estimation starts is a wise idea.
Page 90
ATSA 5 Stationary Time Series Models
0 50 100 150
Time
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 5 10 15 20
Lag Lag
The series seems to originate from a stationary process. There are no very clear
cycles visible, hence it is hard to say anything about correlation and dependency,
and it is impossible to identify the stochastic process behind the generation of
these data from a time series plot alone. Using the ACF and PACF as a visual aid,
we observe a pretty clear cut-off situation in the ACF at lag 1 which lets us assume
that a MA(1) might be suitable. That opinion is undermined by the fact that the
PACF drops off to small values quickly, i.e. we can attribute some exponential
decay to it for lags 1 and 2.
Our next goal is now to fit the MA(1) to the data. As explained above, the simplest
solution would be to determine ˆ (1) 0.266 and derive ̂1 from (1) 1 / (1 12 ) .
This yields two solutions, namely ˆ1 0.28807 and ˆ1 3.47132 . Only one of
these (the former) defines an invertible MA(1) , hence we would stick to that
Page 91
ATSA 5 Stationary Time Series Models
solution. A better alternative is to use the CSS approach for parameter estimation.
The code for doing so is as follows:
Call:
arima(x = diff(attbond), order = c(0, 0, 1), method = "CSS")
Coefficients:
ma1 intercept
-0.2877 -0.0246
s.e. 0.0671 0.0426
Even more elegant and theoretically sound is the MLE. We can also perform this
in R using function arima(). It yields a very similar but not identical result:
Call:
arima(x = diff(attbond), order = c(0, 0, 1))
Coefficients:
ma1 intercept
-0.2865 -0.0247
s.e. 0.0671 0.0426
Please note that the application of the three estimation procedures here was just
for illustrative purposes, and to show the (slight) differences that manifest
themselves when different estimators are employed. In any practical work, you can
easily restrict yourself to the application of the arima() procedure using the
default fitting by method="CSS-ML".
For verifying the quality of the fit, a residual analysis is mandatory. The residuals
of the MA(1) are estimates of the innovations Et . The model can be seen as
adequate if the residuals reflect the main properties of the innovations. Namely,
they should be stationary and free of any dependency, as well as approximately
Gaussian. We can verify this by producing a time series plot of the residuals, along
with their ACF and PACF, and a Normal QQ-Plot. Sometimes, it is also instructive
to plot the fitted values into the original data, or to simulate from the fitted process,
since this further helps verifying that the fit is good.
> par(mfrow=c(2,2))
> fit <- arima(diff(attbond), order=c(0,0,1))
> plot(resid(fit))
> qqnorm(resid(fit)); qqline(resid(fit))
> acf(resid(fit)); pacf(resid(fit))
Page 92
ATSA 5 Stationary Time Series Models
3
2
2
Sample Quantiles
1
1
resid(fit)
0
0
-1
-1
-2
-2
0 50 100 150 -3 -2 -1 0 1 2 3
Time Theoretical Quantiles
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 5 10 15 20
Lag Lag
Page 93
ATSA 5 Stationary Time Series Models
As before, we assume that Et is causal and White Noise, i.e. an innovation with
mean E[ Et ] 0 and finite variance Var ( Et ) E2 . It is much more convenient to use
the characteristic polynomials () for the AR part, and () for the MA part,
because this allows for a very compact notation:
( B) X t ( B) Et .
Yt m X t , where X t is an ARMA( p, q ) .
On the next page, we exhibit the (theoretical) ACF and PACF. It is typical that
neither the ACF nor the PACF cut-off strictly at a certain lag. Instead, they both
show some infinite behavior, i.e. an exponential decay in the magnitude of the
coefficients. However, superimposed on that is a sudden drop-off in both ACF and
PACF. In our example, it is after lag 1 in the ACF, as induced by the moving
average order q 1 . In the PACF, the drop-off happens after lag 2, which is the
logical consequence of the autoregressive order of p 2 . The general behavior of
the ACF and PACF is summarized in the table below.
Page 94
ATSA 5 Stationary Time Series Models
1.0
0.5
0.5
PACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 5 10 15 20
0:20 1:20
5.5.2 Fitting
The above properties of ACF and PACF can be exploited for choosing the type
and order of a time series model. In particular, if neither the ACF nor the PACF
shows a pronounced cut-off, where after some low lag (i.e. p or q 10 ) all the
following correlations are non-significantly different from zero, then it is usually
wise to choose an ARMA( p, q ) . For determining the order ( p, q ) , we search for the
superimposed cut-off in the ACF (for q ), respectively PACF (for p ). The drop-off
is not always easy to identify in practice. In “difficult” situations, it has also proven
beneficial to support the choice of the right order with the AIC criterion. We could
for example perform a grid search on all possible ARMA( p, q ) models which do not
use more than 5 parameters, i.e. p q 5 . This can readily be done in R by
programming a for() loop or using auto.arima()in library(forecast).
It is very important to know that ARMA models are parsimonious, i.e. they usually
do not require high orders p and q . In particular, they work well with far fewer
parameters then pure AR or MA models would. Or in other words: often it is
possible to fit high-order AR ( p ) ’s or MA( q ) ’s instead of a low-order ARMA( p, q ) .
That property does not come as a surprise: as we conjectured above, any
stationary and invertible ARMA can be represented in the form of an AR () or an
MA( ) . However, this is not a good idea in practice: estimating parameters “costs
money”, i.e. will lead to less precise estimates. Thus, a low-order ARMA( p, q ) is
always to be preferred over a high-order AR ( p ) or MA( q ) .
Page 95
ATSA 5 Stationary Time Series Models
For estimating the coefficients, we are again confronted with the fact that there are
no explicit estimators available. This is due to the MA component in the model
which involves innovation terms that are unobservable. By rearranging terms in
the model equation, we can again represent any ARMA( p, q ) in a form where it
only depends on the observed X t , the coefficients 1 ,..., p ; 1 ,..., q and some
previous innovations Et with t 1 . If these latter terms are all set to zero, we can
determine the optimal set of model coefficients by minimizing the sum of squared
residuals (i.e. innovations) with a numerical method. This is the CSS approach that
was already mentioned in 5.4.2 and is implemented in function arima() when
method="CSS". By default however, these CSS estimates are only used as
starting values for a MLE. If Gaussian innovations are assumed, then the joint
distribution of any ARMA( p, q ) process vector X ( X 1 ,..., X n ) has a multivariate
normal distribution.
Page 96
ATSA 5 Stationary Time Series Models
> tsdisplay(sunspotarea)
log(sunspotarea)
8
7
6
5
4
3
2
0.8
0.0 0.2 0.4 0.6
-0.4
5 10 15 20 5 10 15 20
Lag Lag
The series, much like the lynx data, shows some strong periodic behavior.
However, these periods are seen to be stochastic, hence the series as a whole is
considered being stationary. The ACF has a slow decay and the PACF cuts-off
after lag 10, suggesting an AR (10) . We complement with an exhaustive AIC-based
search over all ARMA( p, q ) up to p, q 10 .
Page 97
ATSA 5 Stationary Time Series Models
Note that we need to set the information criterion argument ic="aic". Moreover,
if it is computationally feasible, we recommend to set stepwise=FALSE, because
else a non-exhaustive, stepwise search strategy will be employed which may not
result in the AIC-optimal model. As it turns out, an ARMA(2,3) yields the lowest
AIC value, i.e. is better in this respect than an AR (10) and spends fewer
parameters, too. To verify whether the model fits adequately, we need to run a
residual analysis.
> tsdisplay(resid(fit))
0.3
0.2
0.2
0.1
0.1
PACF
ACF
0.0
0.0
-0.1
-0.1
-0.2
-0.2
5 10 15 20 5 10 15 20
Lag Lag
As we can observe, the time series of residuals is not White Noise, since there are
several ACF and PACF coefficients that exceed the confidence bands. Hence, the
AIC-selected model clearly underfits these data. If an AR (10) is used in place of
the ARMA(2,3) , the residuals feature the desired White Noise property. As a
conjecture, we would reject the ARMA(2,3) here despite its better AIC value and
note that blindly trusting in automatic model selection procedures may well lead to
models that fit poorly.
Page 98
ATSA 6 SARIMA and GARCH Models
Time series from financial or economic background often show serial correlation in
the conditional variance, i.e. are conditionally heteroskedastic. This means that
they exhibit periods of high and low volatility. Understanding the behavior of such
series pays off, and the usual approach is to set up autoregressive models for the
conditional variance. These are the famous ARCH models, which we will discuss
along with their generalized variant, the GARCH class.
Yt (1 B ) d X t ,
( B)Yt ( B) Et
( B)(1 B) X td
( B) Et
Such series do appear in practice, as our example of the monthly prices for a
barrel of crude oil (in US$) from January 1986 to January 2006 shows. To stabilize
the variance, we decide to log-transform the data, and model these.
> library(TSA)
> data(oil.price)
> lop <- log(oil.price)
> plot(lop, ylab="log(Price)")
> title("Logged Monthly Price for a Crude Oil Barrel")
Page 99
ATSA 6 SARIMA and GARCH Models
The series does not exhibit any apparent seasonality, but there is a clear trend, so
that it is non-stationary. We try first-order (i.e. d 1 ) differencing, and then check
whether the result is stationary.
The trend was successfully removed by taking differences. ACF and PACF show
that the result is serially correlated. There may be a drop-off in the ACF at lag 1,
and in the PACF at either lag 1 or 2, suggesting an ARIMA(1,1,1) or an
ARIMA(2,1,1) for the logged oil prices. However, the best performing model is
somewhat surprisingly an ARIMA(1,1, 2) , for which we show the results.
Page 100
ATSA 6 SARIMA and GARCH Models
> par(mfrow=c(1,2))
> acf(dlop, main="ACF", ylim=c(-1,1), lag.max=24)
> pacf(dlop, main="ACF", ylim=c(-1,1), lag.max=24)
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
0.0 0.5 1.0 1.5 2.0 -1.0 0.5 1.0 1.5 2.0
Lag Lag
The fitting can be done with the arima() procedure that (by default) estimates the
coefficients using Maximum Likelihood with starting values obtained from the
Conditional Sum of Squares method. We can either let the procedure do the
differencing:
Coefficients:
ar1 ma1 ma2
0.8429 -0.5730 -0.3104
s.e. 0.1548 0.1594 0.0675
Or, we can use the differenced series dlop as input and fit an ARMA(1, 2) .
However, we need to tell R to not include an intercept – this is not necessary when
the trend was removed by taking differences and the constant would result in a so-
called (sometimes useful) drift-term, see chapter 8.2.1. The command is:
The output from this is exactly the same as above, although it is generally better to
use the first approach and fit a true ARIMA model. The next step is to perform
residual analysis – if the model is appropriate, they must look like White Noise.
This is (data not shown here) more or less the case. For decisions on the correct
model order, also the AIC statistics can provide valuable information.
Page 101
ATSA 6 SARIMA and GARCH Models
2) Analyze ACF and PACF of the differenced series. If the stylized facts of
an ARMA process are present, decide for the orders p and q .
3) Fit the model using the arima() procedure. This can be done on the
original series by setting d accordingly, or on the differences, by setting
d 0 and argument include.mean=FALSE.
4) Analyze the residuals; these must look like White Noise. If several
competing models are appropriate, use AIC to decide for the winner.
The fitted ARIMA model can then be used to generate forecasts including
prediction intervals. This will, however, only be discussed in section 8.
We here discuss another example where the tree ring widths of a douglas fir are
considered over a very long period lasting from 1107 to 1964. Modeling these data
is non-trivial, since there remains a lot of ambiguity on how to approach them. The
time series plot, enhanced with ACF and PACF (see next page) raises the
important question whether the data generating process was stationary or not. The
ACF shows a relatively slow decay and the local mean of the series seems to
persist on higher/lower levels for longer periods of time. On the other hand, what
we observe in this series would clearly still fit within the envelope of what can be
produced by a stationary time series process. But then, the differenced data (see
next page) look clearly “more stationary”. Hence, both options, a pure ARMA( p, q )
and an ARIMA( p,1, q ) remain open. We here lay some focus on the aspects that
are involved in the decision process.
First and foremost, the model chosen needs to fit with the series, the ACF and
PACF that we observe. As mentioned above, this here leaves both the option for a
stationary and integrated model. Next, we can of course try both approaches and
compare the insample fit via the AIC. If function auto.arima() is employed, we
can even set the scope such that the search includes both stationary and
integrated models in one step.
Page 102
ATSA 6 SARIMA and GARCH Models
-0.1
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Lag Lag
Let us here first focus on modelling the original, non-differenced data. The analysis
of ACF/PACF above suggests using an ARMA(2, 0) or an ARMA(1,1) as
parsimonius models. The residuals of these look similar, but the latter has the
lower AIC value. Another option is to run an AIC-based search using the
auto.arima() function.
The settings in the search are such that only stationary models (including an
intercept) are allowed. The maximum order for p and q equals 5, but also
p q 5 due to the max.order=5 default setting in the function. If larger models
Page 103
ATSA 6 SARIMA and GARCH Models
are desired, then this needs to be adjusted accordingly. Moreover, for a full search
over all possible models, we need setting stepwise=FALSE. The result is an
ARMA(4,1) which is not exactly suggestive according to ACF/PACF, but its
residuals look slightly more like White Noise. We now inspect the differenced data.
0.1
0.0
0.0
PACF
-0.1
-0.1
ACF
-0.2
-0.2
-0.3
-0.3
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Lag Lag
There is a clear cut-off in the ACF at lag 1. In the PACF, there is some decay,
perhaps with an additional cut-off at lag 1. Hence, the most plausible simple
integrated models include the ARIMA(0,1,1) and the ARIMA(1,1,1) . The former
cannot capture the dependencies in a reasonable way, the residuals are still
correlated and violate the White Noise assumption. The ARIMA(1,1,1) is much
better in this regard. However, its AIC value is slightly worse than the one of the
ARIMA(1, 0,1) considered previously. We again employ auto.arima() for a non-
stepwise grid search over all ARIMA( p,1, q ) with p, q 5 and p q 5 . Since we
want to avoid a drift-term and directly work on the differenced data, we have to set
allowmean=FALSE.
Page 104
ATSA 6 SARIMA and GARCH Models
> fit
ARIMA(3,0,1) with zero mean
Coefficients:
ar1 ar2 ar3 ma1
0.4244 0.0778 0.0061 -0.9206
s.e. 0.0381 0.0387 0.0368 0.0251
sigma^2 estimated as 1060: log likelihood=-4201.06
AIC=8412.23 AICc=8412.3 BIC=8436
With a stationary ARMA( p, q ) , we would here focus more on the long-term aspects
of the series, i.e. the climatic changes that happen over decades or even
centuries. If the data are differenced, we consider the changes in growth from year
to year. This puts the focus on the bio-chemical aspect, while climate change is
ruled out. Not surprisingly, the autocorrelation among the differenced data is
strongly negative at lag 1. This means that a big positive change in growth is more
likely to be followed by a negative change in growth and vice versa. Hence this
model focuses more on the recovery of the tree after strong resp. weak growth in
one year versus the next. Hence it would not primarily be the climate which is
modelled, but more the bio-chemical processes within the tree. Thus, it is also a
matter of the applied research question which of the two models is more suited.
Yt X t X t 12 (1 B12 ) X t
usually has the seasonality removed. However, it is quite often the case that the
result has not yet constant global mean, and thus, some further differencing at lag
1 is required to achieve stationarity:
We illustrate this using the Australian beer production series that we had already
considered in section 4. It has monthly data that range from January 1958 to
December 1990. Again, a log-transformation to stabilize the variance is indicated.
On the next page, we display the original series X t , the seasonally differenced
series Yt and finally the seasonal-trend differenced series Zt .
Page 105
ATSA 6 SARIMA and GARCH Models
Page 106
ATSA 6 SARIMA and GARCH Models
While the two series X t and Yt are non-stationary, the last one, Zt may be,
although it is a bit debatable whether the assumption of constant variation is
violated or not. We proceed by analyzing ACF and PACF of series Zt .
> par(mfrow=c(1,2))
> acf(d.d12.lbeer, ylim=c(-1,1))
> pacf(d.d12.lbeer, ylim=c(-1,1), main="PACF")
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
There is very clear evidence that series Zt is serially dependent, and we could try
an ARMA( p, q ) to model this dependence. As for the choice of the order, this is not
simple on the basis of the above correlograms. They suggest that high values for
p and q are required, and model fitting with subsequent residual analysis and
AIC inspection confirm this: p 14 and q 11 yield a good result.
It is (not so much in the above, but generally when analyzing data of this type)
quite striking that the ACF and PACF coefficients have large values at multiples of
the period s . This is very typical behavior for seasonally differenced series, in fact
it originates from the evolution of resp. changes in the seasonality over the years.
A simple model accounting for this is the so-called airline model:
Zt (1 1B)(1 1B12 ) Et
(1 1B 1B12 11B13 ) Et
Et 1Et 1 1Et 12 11Et 13
This is a MA(13) model, where many of the coefficients are equal to 0. Because it
was made up of an MA(1) with B as an operator in the characteristic polynomial,
and another one with B s as the operator, we call this a SARIMA(0,1,1)(0,1,1)12 . This
idea can be generalized: we fit AR and MA parts with both B and B s as operators
in the characteristic polynomials, which again results in a high order ARMA model
for Zt .
Page 107
ATSA 6 SARIMA and GARCH Models
( B ) S ( B s ) Z t ( B ) S ( B s ) Et ,
Fortunately, it turns out that usually d D 1 is enough. As for the model orders
p, q, P, Q , the choice can be made on the basis of ACF and PACF, by searching
for cut-offs. Mostly, these are far from evident, and thus, an often applied
alternative is to consider all models with p, q, P, Q 2 and doing an AIC-based
grid search, function auto.arima() may be very handy for this task.
For our example, the SARIMA(2,1,2)(2,1,2)12 has the lowest value and also shows
satisfactory residuals, although it seems to perform slightly less well than the
SARIMA(14,1,11)(0,1,0)12 . The R-command for the former is:
As it was mentioned in the introduction to this section, one of the main advantages
of ARIMA and SARIMA models is that they allow for quick and convenient
forecasting. While this will be discussed in depth later in section 8, we here
provide a first example to show the potential.
From the logged beer production data, the last 2 years were omitted before the
SARIMA model was fitted to the (shortened) series. On the basis of this model, a
2-year-forecast was computed, which is displayed by the red line in the plot above.
The original data are shown as a solid (insample, 1958-1988) line, respectively as
a dotted (out-of-sample, 1989-1990) line. We see that the forecast is reasonably
accurate.
Page 108
ATSA 6 SARIMA and GARCH Models
To facilitate the fitting of SARIMA models, we finish this chapter by providing some
guidelines:
2) Do a time series plot of the output of the above step. Decide whether it is
stationary, or whether additional differencing at lag 1 is required to remove
a potential trend. If not, then d 0 , and proceed. If yes, d 1 is enough for
most series.
3) From the output of step 2, i.e. series Zt , generate ACF and PACF plots to
study the dependency structure. Look for coefficients/cut-offs at low lags
that indicate the direct, short-term dependency and determine orders p
and q . Then, inspect coefficients/cut-offs at multiples of the period s , which
imply seasonal dependency and determine P and Q .
4) Fit the model using procedure arima(). In contrast to ARIMA fitting, this is
now exclusively done on the original series, with setting the two arguments
order=c(p,d,q) and seasonal=c(P,D,Q) accordingly.
5) Check the accuracy of the fitted model by residual analysis. These must
look like White Noise. If thus far, there is ambiguity in the model order, AIC
analysis can serve to come to a final decision.
Next, we turn our attention to series that have neither trend nor seasonality, but
show serial dependence in the conditional variance.
However, that matter is different with the SMI log-returns: here, there are periods
of increased volatility, and thus the conditional variance of the series is serially
correlated, a phenomenon that is called conditional heteroskedasticity. This is not
a violation of the stationarity assumption, but some special treatment for this type
of series is required. Furthermore, the ACF of such series typically does not differ
significantly from White Noise. Still, the data are not iid, which can be shown with
the ACF of the squared observations. With the plots on the next page, we illustrate
the presence of these stylized facts for the SMI log-returns:
Page 109
ATSA 6 SARIMA and GARCH Models
0.04
+
+++++++
+++
Sample Quantiles
+
++
++
+
++
++
+
++
++++++
++++++++++++++++
++++++++++++++++++
0.00
0.00
++
+++
++
+
++
+
++
++
+ ++
+++++
lret.smi
+
+++
+
++
++
+
++
++
+++++++
+++++++++++++
+
++
++++++++++
+
+
+++
++++++
-0.04
-0.04
++
-0.08
-0.08
+
2500 3000 3500 4000 -3 -2 -1 0 1 2 3
Time Theoretical Quantiles
ACF
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Lag Lag
X t t Et ,
The most simple and intuitive way of doing this is to use an autoregressive model
for the variance process. Thus, a series Et is first-order autoregressive conditional
heteroskedastic, denoted as ARCH (1) , if:
Et Wt 0 1Et21 .
Page 110
ATSA 6 SARIMA and GARCH Models
Here, Wt is a White Noise process with mean zero and unit variance. The two
parameters 0 , 1 are the model coefficients. An ARCH (1) process shows
volatility, as can easily be derived:
Var ( Et ) E[ Et2 ]
E[Wt 2 ]E[ 0 1Et21 ]
E[ 0 1Et21 ]
0 1 Var ( Et 1 )
1.0
0.8
0.8
Partial ACF
0.6
0.6
ACF
0.4
0.4
0.2
0.2
0.0
0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Lag Lag
In our case, the analysis of ACF and PACF of the squared log-returns suggests
that the variance may be well described by an AR (2) process. This is not what we
had discussed, but an extension exists. An ARCH ( p ) process is defined by:
p
Et Wt 0 p Et2i
i 1
Fitting in R can be done using procedure garch(). This is a more flexible tool,
which also allows for fitting GARCH processes, as discussed below. The
command in our case is as follows:
Page 111
ATSA 6 SARIMA and GARCH Models
For verifying appropriate fit of the ARCH (2) , we need to check the residuals of the
fitted model. This includes inspecting ACF and PACF for both the “normal” and the
squared residuals. We here do without showing plots, but the ARCH (2) is OK.
q p
H t 0 i Et2i j H t j
i 1 j 1
Page 112
ATSA 7 Time Series Regression
The regression coefficients 1,..., p are usually estimated with the least squares
algorithm, for which an error term with zero expectation, constant variation and no
correlation is assumed. However, if response and predictors are time series with
autocorrelation, the last condition often turns out to be violated, though this is not
always the case.
Now, if we are facing a (time series) regression problem with correlated errors, the
estimates ˆ j will remain being unbiased, but the least squares algorithm is no
longer efficient. Or in other words: more precisely working estimators exist. Even
more problematic are the standard errors of the regression coefficients ˆ j : they
are often grossly wrong in case of correlated errors. As they are routinely
underestimated, inference on the predictors often yields spurious significance, i.e.
one is prone to false conclusions from the analysis.
Thus, there is a need for more general linear regression procedures that can deal
with serially correlated errors, and fortunately, they exist. We will here discuss the
simple, iterative Cochrane-Orcutt procedure, and the Generalized Least Squares
method, which marks a theoretically sound approach to regression with correlated
errors. But first, we present some time series regression problems to illustrating
what we are dealing with.
In climate change studies time series with global temperature values are analyzed.
The scale of measurement is anomalies, i.e. the difference between the monthly
global mean temperature versus the overall mean between 1961 and 1990. The
data can be obtained at http://www.cru.uea.ac.uk/cru/data. For illustrative
purposes, we here restrict to a period from 1971 to 2005 which corresponds to a
series of 420 records. For a time series plot, see the next page.
Page 113
ATSA 7 Time Series Regression
There is a clear trend which seems to be linear. Despite being monthly measured,
the data show no evident seasonality. This is not overly surprising, since we are
considering a global mean, i.e. the season should not make for a big difference.
But on the other hand, because the landmass is not uniformly distributed over both
halves of the globe, it could still be present. It is natural to try a season-trend-
decomposition for this series. We will employ a parametric model featuring a linear
trend plus a seasonal factor.
Yt 0 1·t 2 1[ month" Feb "] ... 12 1[ month" Dec "] Et ,
where t 1,, 420 and 1[ month" Feb"] is a dummy variable that takes the value 1 if an
observation is from month February, and zero else. Clearly, this is a time series
regression model. The response Yt is the global temperature anomalies, and even
the predictors, i.e. the time and the dummies, can be seen as time series, even if
simple ones.
As we have seen previously, the goal with such parametric decomposition models
is to obtain a stationary remainder term Et . But stationary does not necessarily
mean White Noise, and in practice it often turns out that Et shows some serial
correlation. Thus, if the regression coefficients are obtained from the least squares
algorithm, we apparently feature some violated assumption.
Page 114
ATSA 7 Time Series Regression
In this second example, we consider a time series that is stationary, and where the
regression aims at understanding the series, rather than decomposing it into some
deterministic and random components. We examine the dependence of a
photochemical pollutant (morning maximal value) on the two meteorological
variables wind and temperature. The series, which constitute of 30 observations
taken on consecutive days, come from the Los Angeles basin. They are not
publicly available, but can be obtained from the lecturer upon request.
0 5 10 15 20 25 30
Time
There is no counterevidence to stationarity for all three series. What could be the
goal here? Well, we aim for enhancing the understanding of how the pollution
depends on the meteorology, i.e. what influence wind and temperature have on
the oxidant values. We can naturally formulate the relation with a linear regression
model:
Yt 0 1 xt1 2 xt 2 Et .
In this model, the response Yt is the oxidant, and as predictors we have xt1 , wind,
and xt 2 , the temperature. For the index, we have t 1,...,30 , and obviously, this is
a time series regression model.
For gaining some more insight with these data, it is also instructive to visualize the
data using a pairs plot, as shown on the next page. There, a strong, positive linear
Page 115
ATSA 7 Time Series Regression
35 40 45 50 55 60 65
+ + +
+ ++ + ++ +
+ +
20
+ + ++ +
+ + + + + + ++ +
Oxidant + + +
+ + + +++ +
++ + ++ + + + +
+
5 10
+ + +
+ +
+ + + ++ +
65
+ +
+ +
+ ++ + ++
55
+ + + +++ + ++
+ +
+
++ +
+ Wind +
+ ++
+ + ++ +
+
+
45
+ + + +
+ + + +
+ ++ ++ +
+
+ + + +
35
+ +
+ +
90
+ +
+
+ + +
++ + + + +
+ ++ + +
+ + + + + + + Temp
80
+ ++ +
+
+ + + + + +
+ ++ + +
+
+
+
+ ++ + + + +
++ +
70
++ + +
5 10 15 20 25 70 75 80 85 90
For achieving our practical goals with this dataset, we require precise and
unbiased estimates of the regression coefficients 1 and 2 . Moreover, we might
like to give some statements about the significance of the predictors, and thus, we
require some sound standard errors for the estimates. However, also with these
data, it is well conceivable that the error term Et will be serially correlated. Thus
again, we will require some procedure that can account for this.
The two examples have shown that time series regression models do appear
when decomposing series, but can also be important when we try to understand
the relation between response and predictors with measurements that were taken
sequentially. Generally, with the model
we assume that the influence of the series xt1 ,, xtp on the response Yt is
simultaneous. Nevertheless, lagged variables are also allowed, i.e. we can also
use terms such as x( t k ); j with k 0 as predictors. While this generalization can be
easily built into our model, one quickly obtains models with many unknown
parameters. Thus, when exploring the dependence of a response series to lags of
some predictor series, there are better approaches than regression. In particular,
this is the cross correlations and the transfer function model, which will be
exhibited in later chapters of this script.
Page 116
ATSA 7 Time Series Regression
In fact, there are not many restrictions for the time series regression model. As we
have seen, it is perfectly valid to have non-stationary series as either the response
or as predictors. However, it is crucial that there is no feedback from Yt to the xtj .
Additionally, the error Et must be independent of the explanatory variables, but it
may exhibit serial correlation.
Our goal is the decomposition of the global temperature series into a linear trend
plus some seasonal factor. First and foremost, we prepare the data:
The regression model is the estimated with R’s function lm(). The summary
function returns estimates, standard errors plus the results from some hypothesis
tests. It is important to notice that all of these results are in question should the
errors turn out to be correlated.
Call:
lm(formula = temp ~ time + season, data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.36554 -0.07972 -0.00235 0.07497 0.43348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.603e+01 1.211e+00 -29.757 <2e-16 ***
time 1.822e-02 6.089e-04 29.927 <2e-16 ***
seasonFeb 6.539e-03 3.013e-02 0.217 0.8283
seasonMar -1.004e-02 3.013e-02 -0.333 0.7392
seasonApr -1.473e-02 3.013e-02 -0.489 0.6252
seasonMay -3.433e-02 3.013e-02 -1.139 0.2552
Page 117
ATSA 7 Time Series Regression
As the next step, we need to perform some residual diagnostics. The plot()
function, applied to a regression fit, serves as a check for zero expectation,
constant variation and normality of the errors, and can give hints on potentially
problematic leverage points.
> par(mfrow=c(2,2))
> plot(fit.lm, pch=20)
326 326
0.4
Standardized residuals
3
0.2
2
Residuals
1
0.0
-1 0
-0.4
63 278 63 278
-3
326
Standardized residuals
63
Standardized residuals
278
1.5
26
2
1.0
0
0.5
-2
Cook's distance
0.0
63
-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.000 0.010 0.020 0.030
Fitted values Leverage
Except for some very slightly long tailed errors, which do not require any action,
the residual plots look fine. What has not yet been verified is whether there is any
serial correlation among the residuals. If we wish to see a time series plot, the
following commands are useful:
Page 118
ATSA 7 Time Series Regression
It is fairly obvious from the time series plot that the residuals are correlated. Our
main tool for describing the dependency structure is the ACF and PACF plots,
however. These are as follows:
> par(mfrow=c(1,2))
> acf(resid(fit.lm), main="ACF of Residuals")
> pacf(resid(fit.lm), main="PACF of Residuals")
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
The ACF shows a rather slow exponential decay, whereas the PACF shows a
clear cut-off at lag 2. With these stylized facts, it might well be that an AR (2)
Page 119
ATSA 7 Time Series Regression
model is a good description for the dependency among the residuals. We verify
this:
When using Burg’s algorithm for parameter estimation and doing model selection
by AIC, order 2 also turns out to be optimal. For verifying an adequate fit, we
visualize the residuals from the AR (2) model. These need to look like White Noise.
Residuals of AR(2)
0.3
0.2
0.1
fit.ar2$resid
-0.1 0.0
-0.3
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
Page 120
ATSA 7 Time Series Regression
There is no contradiction to the White Noise hypothesis for the residuals from the
AR (2) model. Thus, we can summarize as follows: the regression model that was
used for decomposing the global temperature series into a linear trend and a
seasonal factor exhibit correlated errors that seem to originate from an AR (2)
model. Theory tells us that the point estimates for the regression coefficients are
still unbiased, but they are no longer efficient. Moreover, the standard errors for
these coefficients can be grossly wrong. Thus, we need to be careful with the
regression summary approach that was displayed above. And since our goal is
inferring significance of trend and seasonality, we need to come up with some
better suited method.
Now, we are dealing with the air pollution data. Again, we begin our regression
analysis using the standard assumption of uncorrelated errors. Thus, we start out
by applying the lm() function and printing the summary().
Call:
lm(formula = Oxidant ~ Wind + Temp, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.3939 -1.8608 0.5826 1.9461 4.9661
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.20334 11.11810 -0.468 0.644
Wind -0.42706 0.08645 -4.940 3.58e-05 ***
Temp 0.52035 0.10813 4.812 5.05e-05 ***
---
We will do without showing the 4 standard diagnostic plots, and here only report
that they do not show any model violations. Because we are performing a time
series regression, we also need to check for potential serial correlation of the
errors. As before, this is done on the basis of time series plot, ACF and PACF:
Page 121
ATSA 7 Time Series Regression
0 5 10 15 20 25 30
1:30
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 2 4 6 8 10 12 14 2 4 6 8 10 12 14
Lag Lag
Also in this example, the time series of the residuals exhibits serial dependence.
Because the ACF shows an exponential decay and the PACF cuts off at lag 1, we
hypothesize that an AR (1) model is a good description. While the AIC criterion
suggests an order of p 14 , the residuals of an AR (1) show the behavior of White
Noise. Additionally, using an AR (14) would be spending too many degrees of
freedom for a series with only 30 observations.
Thus, we can summarize that also in our second example with the air pollution
data, we feature a time series regression that has correlated errors. Again, we
must not communicate the above regression summary and for sound inference,
we require more sophisticated models.
Page 122
ATSA 7 Time Series Regression
(r rt t 1 )2
Dˆ t 2
n
r
t 1
t
2
where rt yt yˆt is the residual from the regression model, observed at time t .
There is an approximate relationship between the test statistic D̂ and the
autocorrelation coefficient at lag 1:
Dˆ 2(1 ˆ (1))
> dwtest(fit.lm)
data: fit.lm
DW = 0.5785, p-value < 2.2e-16
alt. hypothesis: true autocorrelation is greater than 0
> dwtest(fit.lm)
data: fit.lm
DW = 1.0619, p-value = 0.001675
alt. hypothesis: true autocorrelation is greater than 0
Thus, the null hypothesis is rejected in both examples and we come to the same
conclusion (“errors are correlated”) as with our visual inspection. It is very
important to note that this is not necessary: In cases where the errors follow an
AR ( p ) process where p 1 and | (1) | is small, the null hypothesis will not be
rejected despite the fact that the errors are correlated.
Page 123
ATSA 7 Time Series Regression
The fundamental trick, on which in fact all time series regression methods are
based, will be presented here and now. We make the transformation:
Yt Yt Yt 1
Next, we plug-in the model equation and rearrange the terms. Finally, we build on
the fundamental property that Et Et 1 U t . The result is:
Obviously, this is a time series regression problem where the error term U t is iid.
Also note that both the response and the predictors have undergone a
transformation. The coefficients however, are identical in both the original and the
modified regression equation. For implementing this approach in practice, we
require knowledge about the AR (1) parameter . Usually, it is not known
previously. A simple idea to overcome this and solve the time series regression
problem for the Air Pollution data is as follows:
3) Determine the prime variables Y ; x and derive ˆ0* , ˆ1 ,..., ˆ p by OLS
This procedure is know as the Cochrane-Orcutt iterative method. Please note that
the estimates ˆ0 , ˆ1 ,..., ˆ p and their standard errors from the OLS regression in
step 3) are sound and valid. But while the Cochrane-Orcutt procedure has its
historical importance and is very nice for illustration, it lacks of a direct R
implementation, and, as an iterative procedure, also of mathematical closedness
and quality. The obvious improvement is to solve the prime regression problem by
simultaneous Maximum-Likelihood estimation of all parameters:
0 ,... p ; ; U2
Page 124
ATSA 7 Time Series Regression
y X E
S y S 1 X S 1E
1
y X E
With some algebra, it is easy to show that the estimated regression coefficients for
the generalized least squares approach turn out to be:
ˆ ( X T 1 X ) X T 1 y
This is what is known as the generalized least squares estimator. Moreover, the
covariance matrix of the coefficient vector is:
Var ( ˆ ) ( X T 1 X ) 1 2
This covariance matrix then also contains standard errors in which the correlation
of the errors has been accounted for, and with which sound inference is possible.
However, while this all neatly lines up, we of course require knowledge about the
error covariance matrix , which is generally unknown in practice. What we can
do is estimate it from the data, for which two approaches exist.
Cochrane-Orcutt Method
As we have seen above, this method is iterative: it starts with an ordinary least
squares (OLS) regression, from which the residuals are determined. For these
residuals, we can then fit an appropriate ARMA( p, q ) model and with its estimated
model coefficients 1 ,..., p and 1MA( q ) ,..., qMA( q ) . On the basis of the estimated
AR ( MA) model coefficients, an estimate of the error covariance matrix can be
derived. We denote it by ̂ , and plug it into the formulae presented above. This
yields adjusted regression coefficients and correct standard errors for these
regression problems. As mentioned above, the iterative approach is secondary to
a simultaneous MLE. Thus, we do without further performing Cochrane-Orcutt on
our examples.
Page 125
ATSA 7 Time Series Regression
Every GLS regression analysis starts by fitting an OLS an analyzing the residuals,
as we have done above. Remember that the only model violation we found were
correlated residuals that were well described with an AR (2) model. Please note
that for performing GLS, we need to provide a dependency structure for the errors.
Here, this is the AR (2) model, in general, it is an appropriate ARMA( p, q ) . The
syntax and output is as follows:
> library(nlme)
> corStruct <- corARMA(form=~time, p=2)
> fit.gls <- gls(temp~time+season, data=dat, corr=corStruct)
> fit.gls
Generalized least squares fit by REML
Model: temp ~ time + season
Data: dat
Log-restricted-likelihood: 366.3946
Coefficients:
(Intercept) time seasonFeb seasonMar
-39.896981987 0.020175528 0.008313205 -0.006487876
seasonApr seasonMay seasonJun seasonJul
-0.009403242 -0.027232895 -0.017405404 -0.015977913
seasonAug seasonSep seasonOct seasonNov
-0.011664708 -0.024637218 -0.036152584 -0.048582236
seasonDec
-0.025326174
The result reports the regression and the AR coefficients. Using the summary()
function, even more output with all the standard errors can be generated. We omit
this here and instead focus on the necessary residual analysis for the GLS model.
We can extract the residuals using the usual resid() command. Important: these
Page 126
ATSA 7 Time Series Regression
residuals must not look like White Noise, but as from an ARMA( p, q ) with orders
p and q as provided in the corStruct object – which in our case, is an AR (2) .
> par(mfrow=c(1,2))
> acf(resid(fit.gls), main="ACF of GLS-Residuals")
> pacf(resid(fit.gls), main="PACF of GLS-Residuals")
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
The plots look similar to the ACF/PACF plots of the OLS residuals. This is often
the case in practice, only for more complex situations, there can be a bigger
discrepancy. And because we observe an exponential decay in the ACF, and a
clear cut-off at lag 2 in the PACF, we conjecture that the GLS residuals meet the
properties of the correlation structure we hypothesized, i.e. of an AR (2) model.
Thus, we can now use the GLS fit for drawing inference. We first compare the
OLS and GLS point estimate for the trend, along with its confidence interval:
> coef(fit.lm)["time"]
time
0.01822374
> confint(fit.lm, "time")
2.5 % 97.5 %
time 0.01702668 0.0194208
> coef(fit.gls)["time"]
time
0.02017553
> confint(fit.gls, "time")
2.5 % 97.5 %
time 0.01562994 0.02472112
We obtain a temperature increase of 0.0182 degrees per year with the OLS, and
of 0.0202 with the GLS. While this may seem academic, the difference among the
point estimates is around 10%, and theory tells us that the GLS result is more
reliable. Moreover, the length of the confidence interval is 0.0024 with the OLS,
Page 127
ATSA 7 Time Series Regression
and 0.0091 and thus 3.5 times as large with the GLS. Thus, without accounting for
the dependency among the errors, the precision of the trend estimate is by far
overestimated. Nevertheless, also the confidence interval obtained from GLS
regression does not contain the value 0, and thus, the null hypothesis on no global
warming trend is rejected (with margin!).
Model:
temp ~ time + season
Df Sum of Sq RSS AIC F value Pr(F)
<none> 6.4654 -1727.0
time 1 14.2274 20.6928 -1240.4 895.6210 <2e-16 ***
season 11 0.1744 6.6398 -1737.8 0.9982 0.4472
> anova(fit.gls)
Denom. DF: 407
numDF F-value p-value
(Intercept) 1 78.40801 <.0001
time 1 76.48005 <.0001
season 11 0.64371 0.7912
As for the trend, the result is identical with OLS and GLS. The seasonal effect is
non-significant with p-values of 0.45 (OLS) and 0.79 (GLS). Again, the latter is the
value we must believe in. We conclude that there is no seasonality in global
warming – but there is a trend.
For finishing the air pollution example, we also perform a GLS fit here. We
identified an AR (1) as the correct dependency structure for the errors. Thus, we
define it accordingly:
> fit.gls
Generalized least squares fit by REML
Model: model
Data: dat
Log-restricted-likelihood: -72.00127
Page 128
ATSA 7 Time Series Regression
Coefficients:
(Intercept) Wind Temp
-3.7007024 -0.4074519 0.4901431
Again, we have to check if the GLS residuals show the stylized facts of an AR (1) :
0.4
0.8
0.2
Partial ACF
0.4
ACF
0.0
0.0
-0.2
-0.4
0 2 4 6 8 10 12 14 2 4 6 8 10 12 14
Lag Lag
This is the case, and thus we can draw inference from the GLS results. The
confidence intervals of the regression coefficients are:
Here the differences among point estimates and confidence intervals are not very
pronounced. This has to do with the fact that the correlation among the errors is
weaker than in the global temperature example. But we emphasize again that the
GLS results are the one to be relied on and the magnitude of the difference
between OLS and GLS can be tremendous.
Page 129
ATSA 7 Time Series Regression
Simulation Study
xt t / 50
yt xt 2( xt )2 Et
where Et is from an AR (1) process with 1 0.65 . The innovations are Gaussian
with 0.1 . Regression coefficients and the true standard deviations of the
estimators are known in this extraordinary situation. However, we generate 100
realizations with series length n 50 , on each perform OLS and GLS regression
and record both point estimate and standard error.
0.30
1.1
1.0
0.20
0.9
0.8
0.10
The simulation outcome is displayed by the boxplots in the figure above. While the
point estimator for 1 in the left panel is unbiased for both OLS and GLS, we
observe that the standard error for ̂1 is very poor when the error correlation is not
accounted for. We emphasize again that OLS regression with time series will
inevitably lead to spuriously significant predictors and thus, false conclusions.
Hence, it is absolutely key to inspect for possible autocorrelation in the regression
residuals and apply the gls() procedure if necessary.
However, while gls() can cure the problem of autocorrelation in the error term, it
does not solve the issue from the root. Sometimes, even this is possible. In the
next subsection, we conclude the chapter about time series regression by showing
how correlated errors can originate, and what practice has to offer for deeper
understanding of the problem.
Page 130
ATSA 7 Time Series Regression
Ski Sales
55
50
45
sales
40
35
30
The coefficient of determination is rather large, i.e. R 2 0.801 and the linear fit
seems adequate, i.e. a straight line seems to correctly describe the systematic
relation between sales and PDI. However, the model diagnostic plots (see the next
page) show some rather special behavior, i.e. there are hardly any “small”
residuals (in absolute value). Or more precisely, the data points almost lie on two
lines around the regression line, with almost no points near or on the line itself.
Page 131
ATSA 7 Time Series Regression
1.5
25 25
Standardized residuals
4
2
0.5
Residuals
0
-0.5
-2
-4
-1.5
6 6
27
-6
27
35 40 45 50 -2 -1 0 1 2
Fitted values Theoretical Quantiles
25
Standardized residuals
1
0.8
0
0.4
-1
Cook's distance
3
6
0.0
-2
27
As the next step, we analyze the correlation of the residuals and perform a Durbin-
Watson test. The result is as follows:
> dwtest(fit)
data: fit
DW = 1.9684, p-value = 0.3933
alt. hypothesis: true autocorrelation is greater than 0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 5 10 15
Lag Lag
Page 132
ATSA 7 Time Series Regression
While the Durbin-Watson test does not reject the null hypothesis, the residuals
seem very strongly correlated. The ACF exhibits some decay that may still qualify
as exponential, and the PACF has a clear cut-off at lag 2. Thus, an AR (2) model
could be appropriate. And because it is an AR (2) where 1 and (1) are very
small, the Durbin-Watson test fails to detect the dependence in the residuals. The
time series plot is as follows:
0 10 20 30 40
Index
While we could now account for the error correlation with a GLS, it is always better
to identify the reason behind the dependence. I admit this is suggestive here, but
as mentioned in the introduction of this example, these are quarterly data and we
might have forgotten to include the seasonality. It is not surprising that ski sales
are much higher in fall and winter and thus, we introduce a factor variable which
takes the value 0 in spring and summer, and 1 else.
1 1
1 11 1
1
50
1 00
1
0
0 00
11
45
1 1 0 0
sales
1 0 0
1 1 0
40
1 0 0
1 11
00
35
0 0
0
0 0
30
Page 133
ATSA 7 Time Series Regression
Introducing the seasonal factor variable accounts to fitting two parallel regression
lines for the winter and summer term. Eyeballing already lets us assume that the fit
is good. This is confirmed when we visualize the diagnostic plots:
23 23
Standardized residuals
2
2
15 15
Residuals
1
1
0
0
-3 -2 -1
-1
-2
27 27
35 40 45 50 55 -2 -1 0 1 2
Fitted values Theoretical Quantiles
27
23
Standardized residuals
Standardized residuals
2
15
1.0
1
0
0.5
-1
Cook's distance
6
-2
0.0
27
The unwanted structure is now gone, as is the correlation among the errors:
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 5 10 15
Lag Lag
Page 134
ATSA 7 Time Series Regression
Apparently, the addition of the season as an additional predictor has removed the
dependence in the errors. Rather than using GLS, a sophisticated estimation
procedure, we have found a simple model extension that describes the data well
and is certainly easier to interpret (especially when it comes to prediction) than a
model that is built on correlated errors.
We conclude by saying that using GLS for modeling dependent errors should only
take place if care has been taken that no important and/or obvious predictors are
missing in the model.
Page 135
ATSA 8 Forecasting
8 Forecasting
One of the principal goals with time series analysis is to produce predictions which
show the future evolution of the data. This is what it is: an extrapolation in the time
domain. And as we all know, extrapolation is always (at least slightly) problematic
and can lead to false conclusions. Of course, this is no different with time series
forecasting.
The saying is that the task we are faced with can be compared to driving a car by
looking through the rear window mirror. While this may work well on a wide
motorway that runs mostly straight and has a few gentle bends only, things get
more complicated as soon as there are some sharp and unexpected bends in the
road. Then, we would need to drive very slowly to stay on track. This all translates
directly to time series analysis. For series where the signal is much stronger than
the noise, accurate forecasting is possible. However, for noisy series, there is a
great deal of uncertainty in the predictions, and they are at best reliable for a very
short horizon.
From the above, one might conclude that the principal source of uncertainty is
inherent in the process, i.e. comes from the innovations. However, in practice, this
is usually different, and several other factors can threaten the reliability of any
forecasting procedure. In particular:
We need to be certain that the data generating process does not change
over time, i.e. continues in the future as it was observed in the past.
Keeping these general warnings in mind, we will now present several approaches
to time series forecasting. First, we deal with stationary processes and present,
how AR, MA and ARMA processes can be predicted. These principles can be
extended to the case of ARIMA and SARIMA models, such that forecasting series
with either trend and/or seasonality is also possible.
Page 137
ATSA 8 Forecasting
Xˆ n k ,1:n E[ X n k | X 1 ,..., X n ]
In evaluating this term, we use the fact that the best forecast of all future
innovation terms Et , t n is simply zero. We will be more specific in the following
subsections. Besides providing a point forecast with the conditional expectation, it
is in practice equally important to produce an interval forecast that makes a
statement about its precision.
X t 1 X t 1 Et ,
Thus, we can easily forecast the next instance of a time series with the observed
value of the previous one, as long as it is available. In particular:
Xˆ n 1,1:n 1 xn .
For the k -step forecast with k 1 , we need to repeatedly plug-in the model
equation, and use the fact that E[ En k | X 1 ,..., X n ] 0 for all k 0 .
Page 138
ATSA 8 Forecasting
Apparently, for any stationary AR (1) process, the k -step forecast beyond the end
of a series depends on the last observation xn only and goes to zero exponentially
quickly. For practical implementation with real data, we would just plug-in the
estimated model parameter ̂1 and can so produce a forecast for arbitrary horizon.
As always, a prediction is much more useful in practice if one knows how precise it
is. Under the assumption of Gaussian innovations, a 95% prediction interval can
be derived from the conditional variance Var ( X nk | X1,..., X n ) . For the special case
of k 1 we obtain:
1xn 1.96 E .
Again, for practical implementation, we need to plug-in ̂1 and ˆ E . However, the
interval does not account for the uncertainty that arises from plugging-in these
estimates, so the coverage of the interval will be smaller than 95%. By how much
this is the case largely depends on the quality of the estimates, i.e. the series
length n . For a k -step forecast, the theoretical 95% prediction interval is:
1 xn 1.96 1 j 112 j E .
k 1
Simulation Study
As we have argued above, the 95% prediction interval does not account for the
uncertainty in the parameter estimates, the choice of the model or the continuity of
the data generating process. We run a small simulation study for pointing out the
effect of plugging-in the parameter estimates. It consists of generating a length
( n 1) realization from an AR (1) process with 1 0.5 and Gaussian innovations
where E 1 . From the first n data points, an AR (1) model, respectively the
parameters ˆ1 , ˆ E , are estimated by MLE and the point forecast along with the
prediction interval is determined. Finally, it was checked whether the next instance
of the time series fell within the interval, from which an empirical coverage level
could be determined. The values were:
n 20 n 50 n 100 n 200
Page 139
ATSA 8 Forecasting
Practical Example
We now apply the R functions that implement the above theory on the Beaver data
from section 4.4.3. An AR (1) is appropriate, and for estimating the coefficients, we
omit the last 14 observations from the data. These will be predicted, and the true
values will be used for verifying the prediction. The R commands for fitting the
model on the training data and producing the 14-step prediction are as follows:
The forecast object is a list that has two components, pred and se, which
contain the point predictions and the predictions’ standard errors, respectively. We
now turn our attention to how the forecast is visualized:
0 20 40 60 80 100
Time
One more issue requires some attention here: for the Beaver data, a pure AR (1)
process is not appropriate, because the global series mean is clearly different from
zero. The way out is to de-mean the series, then fit the model and produce
forecasts, and finally re-adding the global mean. R does all this automatically. We
conclude by summarizing what we observe in the example: the forecast is based
on the last observed value x100 36.76 , and from there approaches the global
series mean ˆ 36.86 exponentially quickly. Because the estimated coefficient is
ˆ1 0.87 , and thus relatively close to one, the convergence to the global mean
takes some time.
Page 140
ATSA 8 Forecasting
Xˆ n1,1:n 1 xn 2 xn 1 ... p xn p
The question is, what do we do for longer forecasting horizons? There, the
forecast is again based on the linear combination of the p past instances. For the
ones with an index between 1 and n , the observed value xt is used. Else, if the
index exceeds n , we just plug-in the forecasted values xˆt ,1:n . Thus, the general
formula is:
Practical Example
We consider the lynx data for which we had identified an AR (11) as a suitable
model. Again, we use the first 100 observations for fitting the model and lay aside
the last 14, which are in turn used for verifying the result. Also, we do without
showing the R code, because it is identical to the one from the previous example.
We observe that the forecast tracks the general behavior of the series well, though
the level of the series is underestimated somewhat. This is, however, not due to
an “error” of ours, it is just that the values were higher than the past observations
suggested. We finish this section with some remarks:
Page 141
ATSA 8 Forecasting
The prediction intervals only account for the uncertainty caused by the
innovation variance, but not for the one caused by model misconception,
and plugging-in estimated parameters. Thus in practice, a true 95% interval
would most likely be wider than shown above.
X t Et 1Et 1
The best MA(1) forecast for horizons 2 and up is thus zero. Remember that we
require Et being an innovation, and thus independent from previous instances
X s , s t of the time series process. Next, we address the 1-step forecast. This is
more problematic, because the above derivation leads to:
Xˆ n1,1:n ...
1E[ En | X1 ,..., X n ]
0 ( generally)
The 1-step forecast is generally different from zero. The term E[ En | X1,..., X n ] is
difficult to determine. Using some mathematical trickery, we can at least propose
an approximate value. This trick is to move the point of reference into the infinite
past, i.e. conditioning on all previous instances of the MA(1) process. We denote
en : E[ En | X
n
].
Page 142
ATSA 8 Forecasting
As this model equation contains past innovations, we face the same problems as
in section 8.1.3 when trying to derive the forecast for horizons q . These can be
mitigated, if we again condition on the infinite past of the process.
Xˆ n1,1:n E[ X n1 | X
n
]
p q
E[ X
i 1
i n 1i | X ] E[ En1 | X ] j E[ En1 j | X
n
n
j 1
n
]
p q
x
i 1
i n 1i j E[ En 1 j | X ]
i 1
n
If we are aiming for k -step forecasting, we can use a recursive prediction scheme:
Page 143
ATSA 8 Forecasting
p q
Xˆ nk ,1:n i E[ X nk i | X
n
] j E[ Enk j | X
n
],
i 1 j 1
xt , if t n
E[ X t | X
n
]
Xˆ t ,1:n , if t n
e , if 0 t n
E[ Et | X
n
] t
0, if t n
The terms et are then determined as outlined above in section 8.1.3, and for the
model parameters, we are plugging-in the estimates. This allows us to generate
any forecast from an ARMA( p, q ) model that we wish. The procedure is also
known as Box-Jenkins procedure, after the two researchers who first introduced it.
Next, we illustrate this with a practical example, though in R, things are quite
unspectacular. It is again the predict() procedure that is applied to a fit from
arima(), the Box-Jenkins scheme that is employed runs in the background.
Practical Example
We here consider the Douglas Fir data which show the width of a tree’s year rings
over a period from 1107 to 1964. We choose to model the data without taking
differences first. The auto.arima() solution with the lowest AIC value turned out
to be an ARMA(4,1) which will be used for generating the forecasts. For illustrative
purpose, we choose to put the last 64 observations of the series aside so that we
can verify our predictions. Then, the model is fitted and the Box-Jenkins forecasts
are obtained. The result, including a 95% prognosis interval, is shown below. The
R code used for producing the results, follows thereafter.
Page 144
ATSA 8 Forecasting
We observe that the forecast approaches the global mean of the series very
quickly, in fact in an exponential decay. However, because there is an AR part in
the model, all forecasts will be different from the global mean (but only slightly so
for larger horizons). Then, since there is also a MA term in the model, all time
series observations down to the first one from 1107 have some influence on the
forecast. Again, the ARMA model combines the properties from pure AR and MA
processes. Regarding the quality of the forecast, we notice that it does not really
provide much value for the true evolution of the series. Furthermore, the prediction
intervals seem rather small. As it turns out, 12 out of 64 predictions (18.75%)
violate the 95% prediction interval. Is it bad luck or a problem with the model? We
will reconsider this in the discussion about ARIMA forecasting.
Page 145
ATSA 8 Forecasting
As we can see, the k -step forecast for the original data is the cumulative sum of
all forecasted terms of the differenced data. The formulae for the prediction
intervals in an ARIMA( p,1, q ) forecast are difficult to derive and are beyond the
scope of this script. All we say at this moment is that the width of the prediction
interval does not converge as for an ARMA( p, q ) but is growing indefinitely with
increasing forecasting horizon k . We illustrate this with a forecast for the Douglas
Fir data using the non-stationary ARIMA(1,1,1) model (red) and compare it to what
we had obtained when a stationary ARMA(4,1) was used (blue).
In this particular example, the ARMA(4,1) forecast is more accurate and even the
empirical coverage of its prediction interval is closer to the 95% that are are
required by construction. However, this is an observation on one single dataset, it
would be plain wrong to conclude that ARIMA forecasts are generally less
accurate than the ones which are obtained from stationary models.
However, it is crucial to understand what ARIMA forecasts can do and what they
cannot do. Despite the fact that ARIMA models are for non-stationary time series,
the forecast will converge to a constant if d 1 . So in case of a series with a
deterministic, linear trend, a default ARIMA forecast will miserably fail, see the
example below. To be fair however, we need to point out that default ARIMA
processes feature a unit root and are non-stationary, but are not compatible with a
deterministic linear trend.
Page 146
ATSA 8 Forecasting
Page 147
ATSA 8 Forecasting
Trend
We assume a smooth trend for which we recommend linear extrapolation.
Seasonal Effect
We extrapolate the seasonal effect according to the last observed period.
Stationary Remainder
We fit an ARMA( p, q ) and determine the forecast as discussed above.
We illustrate the procedure on the Maine unemployment data. We will work with
the log-transformed data, for which an STL decomposition under assuming a
constant seasonal effect was performed.
Page 148
ATSA 8 Forecasting
0.2
seasonal
0.0
-0.2
trend
1.5
1.2
remainder
0.05
-0.05
1996 1998 2000 2002 2004 2006
time
Fit a least squares regression line into the past trend values. The window
on which this fit happens is chosen such that is has the same length as the
forecasting horizon. In our particular example, where we want to forecast
the upcoming two years of the series, these are the last 24 data points. Or
in other words: for the trend forecast, we use the last observation as an
anchor point and predict with the average slope from the last two years.
Please note that the so-produced trend forecast is a recommendation, but not
necessarily the best solution. If some expert knowledge from the application field
suggests another trend extrapolation, then it may well be used. It is however
important, to clearly declare how the trend forecast was determined. The following
code does the job, see next page for the result:
Page 149
ATSA 8 Forecasting
The blue line in the light grey window indicates the average trend over the last two
years of observations. For the trend extrapolation, we use the last observed trend
value as the anchor, and then continue using the determined slope. Please note
that generally (though not in the stl() context), function loess() in R allows for
extrapolation if argument surface= "direct ". However, according to the
author’s experience, such trend extrapolations are often extreme and perform
worse than the linear extrapolation that is suggested here. As we have now solved
the issue with the trend, we remain with forecasting the seasonal effect and the
remainder term. For the former, things are trivial, as we assume that it stays as it
was last. The R code for producing the forecast of the seasonal component is:
Hence, we only need to take care of the stationary remainder. Generally, this is a
stationary series that will be described with an ARMA( p, q ) , for which the
forecasting method has already been presented in chapter 8.1. Hence, we here
focus on the particular case at hand, where a simple solution is to recognize an
exponential decay in the ACF and a cut-off at lag 4 in the PACF, so that an AR (4)
model (here without using a global mean!) will be fitted. Residual analysis (not
shown here) indicates that the White Noise assumption for the estimated
innovation terms is justified in this case. Using the predict() command, we then
produce a 24-step forecast from the AR (4) for the stationary remainder.
Page 150
ATSA 8 Forecasting
0.4
0.8
Partial ACF
0.2
0.4
ACF
0.0
0.0
-0.2
-0.4
The final task is then to couple the forecasts for all three parts (trend, seasonal
component and remainder) to produce a 2-year-forecast for the original series with
the logged unemployment figures from the state of Maine. This is based on a
simple addition of the three components.
Page 151
ATSA 8 Forecasting
X t t Et ,
Page 152
ATSA 8 Forecasting
at xt (1 )at 1 , with 0 1 .
We can rewrite the weighted average equation in two further ways, which yields
insight into how exponential smoothing works. Firstly, we can write the level at
time t as the sum of at 1 and the 1-step forecasting error and obtain the update
formula:
at ( xt at 1 ) at 1
at xt (1 ) xt 1 (1 ) 2 xt 2 ...
When written in this form, we see that the level at is a linear combination of the
current and all past observations with more weight given to recent observations.
The restriction 0 1 ensures that the weights (1 )i become smaller as i
increases. In fact, they are exponentially decaying and form a geometric series.
When the sum over these terms is taken to infinity, the result is 1. In practice, the
infinite sum is not feasible, but can be avoided by specifying a1 x1 .
For any given smoothing parameter , the update formula plus the choice of
a1 x1 as a starting value can be used to determine the level at for all times
t 2,3,... . The 1-step prediction errors et are given by:
Page 153
ATSA 8 Forecasting
There is some mathematical theory that examines the quality of the SS1PE -
minimizing . Not surprisingly, this depends very much on the true, underlying
process. However in practice, this value is reasonable and allows for good
predictions.
Practical Example
We here consider a time series that shows the number of complaint letters that
were submitted to a motoring organization over the four years 1996-1999. At the
beginning of year 2000, the organization wishes to estimate the current level of
complaints and investigate whether there was any trend in the past. We import the
data and do a time series plot:
The series is rather short, and there is no clear evidence for a deterministic trend
and/or seasonality. Thus, it seems sensible to use exponential smoothing here.
The algorithm that was described above is implemented in R’s HoltWinters()
procedure. Please note that HoltWinters() can do more than plain exponential
smoothing, and thus we have to set arguments beta=FALSE and gamma=FALSE.
If we do not specify a value for the smoothing parameter with argument alpha,
it will be estimated using the SS1PE criterion.
Page 154
ATSA 8 Forecasting
Holt-Winters filtering
35
30
Observed / Fitted
25
20
15
10
5
The output shows that the level in December 1999, this is a48 , is estimated as
17.70. The optimal value for according to the SS1PE criterion is 0.143, and the
sum of squared prediction errors was 2502. Any other value for will yield a
worse result, thus we proceed and display the result visually.
at ( xt st p ) (1 )(at 1 bt 1 )
bt (at at 1 ) (1 )bt 1
st ( xt at ) (1 ) st p
In the above equations, at is again the level at time t , bt is called the slope and st
is the seasonal effect. There are three smoothing parameters , , which are
aimed at level, slope and season. The explanation of the equations is as follows:
Page 155
ATSA 8 Forecasting
The first updating equation for the level takes a weighted average of the
most recent observation with the existing estimate of the previous’ period
seasonal effect term subtracted, and the 1-step level forecast at t 1 , which
is given by level plus slope.
If nothing else is known, the typical choice for the smoothing parameters is
0.2 . Moreover, starting values for the updating equations are required.
Mostly, one chooses a1 x1 , the slope b1 0 and the seasonal effects s1 ,..., s p are
either also set to zero or to the mean over the observations of a particular season.
When applying the R function HoltWinters(), the starting values are obtained
from the decompose() procedure, and it is possible to estimate the smoothing
parameters through SS1PE minimization. The most interesting aspect are the
predictions, though: the k -step forecasting equation for X n k at time n is:
Xˆ n k ,1:n an kbn sn k p ,
i.e. the current level with linear trend extrapolation plus the appropriate seasonal
effect term. The following practical example nicely illustrates the method.
Practical Example
We here discuss the series of monthly sales (in thousands of litres) of Australian
white wine from January 1980 to July 1995. This series features a deterministic
trend, the most striking feature is the sharp increase in the mid-80ies, followed by
a reduction to a distinctly lower level again. The magnitude of both the seasonal
effect and the errors seem to be increasing with the level of the series, and are
thus multiplicative rather than additive. We will cure this by a log-transformation of
the series, even though there exists a multiplicative formulation of the Holt-Winters
algorithm, too.
Page 156
ATSA 8 Forecasting
> fit
Call: HoltWinters(x = log(aww))
Smoothing parameters:
alpha: 0.4148028
beta : 0
gamma: 0.4741967
Page 157
ATSA 8 Forecasting
Coefficients:
a 5.62591329 s4 0.20894897 s9 -0.17107682
b 0.01148402 s5 0.45515787 s10 -0.29304652
s1 -0.01230437 s6 -0.37315236 s11 -0.26986816
s2 0.01344762 s7 -0.09709593 s12 -0.01984965
s3 0.06000025 s8 -0.25718994
The coefficient values (at time n ) are also the ones which are used for forecasting
from that series with the formula given above. We produce a prediction up until the
end of 1998, which is a 29-step forecast. The R commands are:
Holt-Winters filtering
6.5
6.0
Observed / Fitted
5.5
5.0
4.5
Holt-Winters-Fit
xhat
5.5
0.014 4.8 5.4 6.0 4.5
level
trend
0.2 0.008
season
-0.2
Page 158
ATSA 8 Forecasting
The last plot on the previous page shows how level, trend and seasonality evolved
over time. However, since we are usually more interested in the prediction on the
original scale, i.e. in liters rather than log-liters of wine, we just re-exponentiate the
values. Please note that the result is an estimate of the median rather than the
mean of the series. There are methods for correction, but the difference is usually
only small.
Also, we note that the (insample) 1-step prediction error is equal to 50.04, which is
quite a reduction when compared to the series’ standard deviation which is 121.4.
Thus, the Holt-Winters fit has substantial explanatory power. Of course, it would
now be interesting to test the accuracy of the predictions. We recommend that
you, as an exercise, put aside the last 24 observations of the Australian white wine
data, and run a forecasting evaluation where all the methods (SARIMA,
decomposition approaches, Holt-Winters) compete against each other.
Page 159
ATSA 9 Multivariate Time Series Analysis
Generally speaking, the goal of this section is to describe and understand the
(inter)dependency between two series. We introduce the basic concepts of cross
correlation and transfer function models, warn of arising difficulties in interpretation
and show how these can be mitigated.
We here consider only one single borehole, it is located near the famous Hörnli hut
at the base of Matterhorn near Zermatt/CH on 3295m above sea level. The air
temperature was recorded on the nearby Platthorn at 3345m of elevation and
9.2km distance from the borehole. Data are available from beginning of July 2006
to the end of September 2006. After the middle of the observation period, there is
a period of 23 days during which the ground was covered by snow, highlighted in
grey color in the time series plots on the next page.
Because the snow insulates the ground, we do not expect the soil to follow the air
temperature during that period. Hence, we set all values during that period equal
to NA. The time series plots, and especially the indexed plot where both series are
shown, clearly indicate that the soil reacts to the air temperature with a delay of a
few days. We now aim for analyzing this relationship on a more quantitative basis,
for which the methods of multivariate time series analysis will be employed.
Page 161
ATSA 9 Multivariate Time Series Analysis
Air Temperature
10
5
° Celsius
0
-5
0 20 40 60 80
Soil Temperature
8
6
° Celsius
4
2
0
0 20 40 60 80
0 20 40 60 80
ACF
0 5 10 15 0 5 10 15
Lag Lag
Page 162
ATSA 9 Multivariate Time Series Analysis
The ACF exhibits a slow decay, especially for the soil temperature. Thus, we
decide to perform lag 1 differencing before analyzing the series. This has another
advantage: we are then exploring how changes in the air temperature are
associated with changes in the soil temperature and if so, what the time delay is.
These results are easier to interpret than a direct analysis of air and soil
temperatures. Next, we display the differenced series with their ACF and PACF.
The observations during the snow cover period are now omitted.
0 20 40 60 80
Time
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 5 10 15
Lag Lag
The differenced air temperature series seems stationary, but is clearly not iid.
There seems to be some strong negative correlation at lag 4. This may indicate
the properties of the meteorological weather patterns at that time of year in that
part of Switzerland. We now perform the same analysis for the changes in the soil
temperature.
Page 163
ATSA 9 Multivariate Time Series Analysis
0 20 40 60 80
Time
ACF PACF
1.0
1.0
0.5
0.5
Partial ACF
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
0 5 10 15 5 10 15
Lag Lag
In the course of our discussion of multivariate time series analysis, we will require
some ARMA( p, q) models fitted to the changes in air and soil temperature. For the
former series, model choice is not simple, as in both ACF and PACF, the
coefficient at lag 4 sticks out. A grid search shows that an AR(5) model yields the
best AIC value, and also, the residuals from this model do look as desired, i.e. it
seems plausible that they are White Noise.
For the changes in the soil temperature, model identification is easier. ACF and
PACF suggest either a MA(1) , an ARMA(2,1) or an AR (2) . From these three
models, the MA(1) shows both the lowest AIC value as well as the “best looking”
residuals. Furthermore, it is the parsimonious choice, and hence we got with it.
Page 164
ATSA 9 Multivariate Time Series Analysis
The cross covariances between the two processes X1 and X2 are given by:
Note that owing to the stationarity of the two series, the cross covariances 12 (k )
and 21(k ) both do not depend on the time t . Moreover, there is some obvious
symmetry in the cross covariance:
Thus, for practical purposes, it suffices to consider 12 (k ) for positive and negative
values of k . Note that we will preferably work with correlations rather than
covariances, because they are scale-free and thus easier to interpret. We can
obtain the cross correlations by standardizing the cross covariances:
12 (k ) 21 (k )
12 (k ) , 21 (k ) .
11 (0) 22 (0) 11 (0) 22 (0)
Not surprisingly, we also have symmetry here, i.e. 12 (k ) 21 (k ) . Additionally,
the cross correlations are limited to the interval between -1 and +1, i.e. | 12 (k ) | 1 .
As for the interpretation, 12 (k ) measures the linear association between two
values of X1 and X 2 , if the value of the first time series is k steps ahead.
Concerning estimation of cross covariances and cross correlations, we apply the
usual sample estimators:
1 1
ˆ12 (k )
n t
( x1,t k x1 )( x2,t x2 ) and ˆ21 (k ) ( x2,t k x2 )( x1,t x1 ) ,
n t
where the summation index t for k 0 goes from 1 to n k and for k 0 goes
from 1 k to n . With x1 and x2 we denote the mean values of x1,t and x2,t ,
respectively. We define the estimation of the cross-correlations as
ˆ12 (k ) ˆ21 (k )
ˆ12 (k ) , ˆ 21 (k ) .
ˆ11 (0)ˆ22 (0) ˆ11 (0)ˆ22 (0)
The plot of ˆ12 (k ) against k is called the cross-correlogram. Note that this must be
viewed for both positive and negative k . In R, we the job is done by the acf()
function, applied to a multiple time series object.
Page 165
ATSA 9 Multivariate Time Series Analysis
1.0
ACF
0.0
0.0
-1.0
-1.0
0 5 10 15 0 5 10 15
Lag Lag
1.0
ACF
0.0
0.0
-1.0
-1.0
-15 -10 -5 0 0 5 10 15
Lag Lag
The top left panel shows the ACF of the differenced air temperature, the bottom
right one holds the pure autocorrelations of the differenced soil temperature. The
two off-diagonal plots contains estimates of the cross correlations: The top right
panel has ˆ12 (k ) for positive values of k , and thus shows how changes in the air
temperature depend on changes in the soil temperature.
Note that we do not expect any significant correlation coefficients here, because
the ground temperature has hardly any influence on the future air temperature at
all. Conversely, the bottom left panel shows ˆ12 (k ) for negative values of k , and
thus how the changes in the soil temperature depend on changes in the air
temperature. Here, we expect to see significant correlation.
The reason for these problems is that the variances and covariances of the ˆ12 (k )
are very complicated functions of 11 ( j ), 22 ( j ) and 12 ( j ), j . For illustrative
purposes, we will treat some special cases explicitly.
Page 166
ATSA 9 Multivariate Time Series Analysis
In the case where the cross correlation 12 ( j ) 0 for | j | m , we have for | k | m :
1
Var ( ˆ12 ( k ))
n
{
j
11 ( j ) 22 ( j ) 12 ( j k ) 12 ( j k )} .
Thus, the variance of the estimated cross correlation coefficients goes to zero for
$n \rightarrow \infty$, but for a deeper understanding with finite sample size, we
must know all true auto and cross-correlations, which is of course impossible in
practice.
If the two processes X1 and X2 are independent, i.e. 12 ( j ) 0 for all j , then the
variance of the cross correlation estimator simplifies to:
1
Var ( ˆ12 (k ))
n
j
11 ( j ) 22 ( j ) .
If, for example, X1 and X2 are two independent AR (1) processes with parameters
1 and 2 , then 11 ( j ) 1| j| , 22 ( j ) 2| j| and 12 ( j ) 0 . For the variance of ˆ12 (k )
we have, because the autocorrelations form a geometric series:
1
1 1 1 2
Var ( ˆ12 (k ))
n
( )
j
1 2
| j|
·
n 1 1 2
.
For 1 1 and 2 1 this expression goes to , i.e. the estimator ˆ12 (k ) can, for
a finite time series, differ greatly from the true value 0 . We would like to illustrate
this with two simulated AR (1) processes with 1 2 0.9 . According to theory all
cross-correlations are 0. However, as we can see in the figure on the next page,
the estimated cross correlations differ greatly from 0, even though the length of the
estimated series is 200. In fact, 2 Var ( ˆ12 ( k )) 0.44 , i.e. the 95% confidence
interval is $\pm 0.44$. Thus even with an estimated cross-correlation of 0.4 the
null hypothesis “true cross-correlation is equal to 0” cannot be rejected.
Case 3: No cross correlations for all lags and one series uncorrelated
Only now, in this special case, the variance of the cross correlation estimator is
significantly simplified. In particular, if X1 is a White Noise process which is
independent of X2 , we have, for large n and small k :
1
Var ( ˆ12 (k )) .
n
Thus, in this special case, the rule of thumb 2 / n yields a valid approximation to
a 95% confidence interval for the cross correlations and can help to decide
whether they are significantly or just randomly different from zero.
Page 167
ATSA 9 Multivariate Time Series Analysis
X1 X1 & X2
1.0
1.0
ACF
0.0
0.0
-1.0
-1.0
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
X2 & X1 X2
1.0
1.0
ACF
0.0
0.0
-1.0
-1.0
-25 -20 -15 -10 -5 0 0 5 10 15 20 25
Lag Lag
In most practical examples, however, the data will be auto- and also cross
correlated. Thus, the question arises whether it is at all possible to do something
here. Fortunately, the answer is yes: with the method of prewhitening, described in
the next chapter, we do obtain a theoretically sound and practically useful cross
correlation analysis.
9.3 Prewhitening
The idea behind prewhitening is to transform one of the two series such that it is
uncorrelated, i.e. a White Noise series, which also explains the name of the
approach. Formally, we assume that the two stationary processes X1 and X2 can
be transformed as follows:
U t ai X 1,t i
i 0
Vt bi X 2,t i
i 0
Thus, we are after coefficients ai and bi such that an infinite linear combination of
past terms leads to White Noise. We know from previous theory that such a
representation exists for all stationary and invertible ARMA( p, q) processes, it is
the AR ( ) representation. For the cross-correlations between Ut and Vt and
between X t and Yt , the following relation holds:
Page 168
ATSA 9 Multivariate Time Series Analysis
UV (k ) aib j X X (k i j )
1 2
i 0 j 0
We conjecture that for two independent processes X1 and X2 , where all cross
correlation coefficients X1X 2 ( k ) 0 , also all UV (k ) 0 . Additionally, the converse
is also true, i.e. it follows from “ Ut and Vt uncorrelated” that the original processes
X1 and X2 are uncorrelated, too. Since Ut and Vt are White Noise processes, we
are in the above explained case 3, and thus the confidence bounds in the cross
correlograms are valid. Hence, any cross correlation analysis on “real” time series
starts with representing them in terms of ut and vt .
For our example with the two simulated AR (1) processes, we can estimate the AR
model coefficients with the Burg method and plug them in for prewhitening the
series. Note that this amounts considering the residuals from the two fitted models!
U U&V
1.0
1.0
ACF
0.0
0.0
-1.0
-1.0
0 5 10 15 20 0 5 10 15 20
Lag Lag
V&U V
1.0
1.0
ACF
0.0
0.0
-1.0
-1.0
The figure on the previous page shows both the auto and cross correlations of the
prewhitened series. We emphasize again that we here consider the residuals from
the AR (1) models that were fitted to series X1 and X2 . We observe that, as we
expect, there are no significant autocorrelations, and there is just one cross
correlation coefficient that exceeds the 95% confidence bounds. We can attribute
this to random variation.
Page 169
ATSA 9 Multivariate Time Series Analysis
The theory suggests, because Ut and Vt are uncorrelated, that also X1 and X2
do not show any linear dependence. Well, owing to how we set up the simulation,
we know this for a fact, and take the result as evidence that the prewhitening
approach works in practice.
For verifying whether there is any cross correlation between the changes in air and
soil temperatures, we have to perform prewhitening also for the two differenced
series. Previously, we had identified an AR(5) and a MA(1) model as. We can now
just take their residuals and perform a cross correlation analysis:
1.0
ACF
0.0
0.0
-1.0
-1.0
0 5 10 15 0 5 10 15
Lag Lag
1.0
ACF
0.0
0.0
-1.0
-1.0
-15 -10 -5 0 0 5 10 15
Lag Lag
The bottom left panel shows some significant cross correlations. A change in the
air temperature seems to induce a change in the ground with a lag of 1 or 2 days.
Page 170
ATSA 9 Multivariate Time Series Analysis
The transfer function models are a possible way to capture the dependency
between two time series. We must assume that the first series influences the
second, but the second does not influence the first. Furthermore, the influence
occurs only at simultaneously or in the future, but not on past values. Both
assumptions are met in our example. The transfer function model is:
X 2,t 2 j ( X 1,t j 1 ) Et
j 0
We call X1 the input and correspondingly, X2 is named the output. For the error
term Et we require zero expectation and that they are independent from the input
series, in particular:
However, the errors Et are usually autocorrelated. Note that this model is very
similar to the time series regression model. However, here we have infinitely many
unknown coefficients j , i.e. we do not know (a priori) on which lags to regress the
input for obtaining the output. For the following theory, we assume (w.l.o.g.) that
1 2 0 , i.e. the two series were adjusted for their means. In this case the cross
covariances 21(k ) are given by:
21 (k ) Cov( X 2,t k , X 1,t ) Cov( j X 1,t k j , X 1,t ) j 11 (k j ) .
j 0 j 0
In cases where the transfer function model has a finite number of coefficients j
only, i.e. j 0 for j K , then the above formula turns into a linear system of
K 1 equations that we could theoretically solve for the unknowns j , j 0,, K .
If we replaced the theoretical 11 and 21 by the empirical covariances ˆ11 and ˆ21 ,
this would yield, estimates ˆ j . However, this method is statistically inefficient and
the choice of K proves to be difficult in practice. We again resort to some special
case, for which the relation between cross covariance and transfer function model
coefficients simplifies drastically.
21 (k ) (0)
k 21 22 , for k 0 .
11 (0) 11 (0)
Page 171
ATSA 9 Multivariate Time Series Analysis
applied to the output series. Namely, we have to filter the output with the model
coefficients from the input series.
X 1,t 0.296·X 1,t 1 0.242·X 1,t 2 0.119·X 1,t 3 0.497·X 1,t 4 0.216·X 1,t 5 Dt ,
where Dt is the innovation, i.e. a White Noise process, for which we estimate the
variance to be ˆ D2 2.392 . We now solve this equation for Dt and get:
We now apply this same transformation, i.e. the characteristic polynomial of the
AR(5) also on the output series X2 and the transfer function model errors Et :
We can now equivalently write the transfer function model with the new processes
Dt , Zt and Ut . It takes the form:
Z t j Dt j U t ,
j 0
where the coefficients j are identical than for the previous formulation of the
model. The advantage of this latest formulation, however, is that the input series
Dt is now White Noise, such that the above special case applies, and the transfer
function model coefficients can be obtained by a straightforward computation from
the cross correlations:
ˆ21 (k ) ˆ Z
ˆk ˆ (k ) , where k 0 .
ˆ D2 ˆ D 21
where ˆ21 and ̂ 21 denote the empirical cross covariances and cross correlations
of Dt and Zt . However, keep in mind that Zt and Ut are generally correlated.
Thus, the outlined method is not a statistically efficient estimator either. While
efficient approaches exist, we will not discuss them in this course and scriptum.
Furthermore, for practical application the outlined procedure usually yields reliable
results. We conclude this section by showing the results for the permafrost
example: the transfer function model coefficients in the example are based on the
cross correlation between the AR(5) residuals of the air changes and the ground
changes that had been filtered with these AR(5) coefficients.
Page 172
ATSA 9 Multivariate Time Series Analysis
1.0
1.0
ACF
0.0
0.0
-1.0
-1.0
0 5 10 15 0 5 10 15
Lag Lag
1.0
ACF
0.0
0.0
-1.0
Again, in all except for the bottom left panel, the correlation coefficients are mostly
zero, respectively only insignificantly or by chance different from that value. This is
different in the bottom left panel. Here, we have substantial cross correlation at
lags 1 and 2. Also, these values are proportional to the transfer function model
coefficients. We can extract these as follows:
Thus, the soil temperature in the permafrost boreholes reacts to air temperature
changes with a delay of 1-2 days. An analysis of further boreholes has suggested
that the delay depends on the type of terrain in which the measurements were
made. Fastest response times are found for a very coarse-blocky rock glacier site,
whereas slower response times are revealed for blocky scree slopes with smaller
grain sizes.
Page 173
ATSA 10 Spectral Analysis
10 Spectral Analysis
During this course, we have encountered several time series which show periodic
behavior. Prominent examples include the number of shot lynx in the Mackenzie
River district in Canada, as well as the wave tank data from section 4.4. In these
series, the periodicity is not deterministic, but stochastic. An important goal is to
understand the cycles at which highs and lows in the data appear.
A time series process has a continous spectrum, or spectral density. This is based
on the notion that any stationary process can be obtained from an infinite linear
combination of harmonic oscillations. For ARMA models e.g., it is not just a few
particular frequencies that make up for the variation in the data, but all contribute
to a smaller or bigger extent. Hence, any time series process can be characterized
by its spectrum. A further goal of this section lies in using the periodogram of an
observed time series as an estimator for the spectrum of the data generating
process.
y (t ) a cos(2 t ) .
Here, we call a the amplitude, v is the frequency and is the phase. Apparently,
the function y (t ) is periodic, and the period is T 1 / . It is common to write the
above harmonic oscillation in a different form, i.e.:
y (t ) cos(2 t ) sin(2 t ) ,
Page 175
ATSA 10 Spectral Analysis
where in fact a cos( ) and a sin( ) . The advantage of this latter form is that
if we want to fit a harmonic oscillation with fixed frequency to data, which means
estimating amplitude and phase, we face a linear problem instead of a non-linear
one, as it was the case in the previous formulation. The time can be either
continuous or discrete. In the context of our analysis of discrete time series, only
the latter will be relevant.
These are called the Fourier frequencies. Using some mathematics, one can
prove that the above regression problem has an orthogonal design. Thus, the
estimated coefficients ˆ k , ˆk are uncorrelated and (for k 0 ) have variance
2 E2 / 2 . Because we are also spending n parameters for the n observations, the
frequency decomposition model fits perfectly, i.e. all residuals are zero. Another
very important result is that the
n
n 2 ˆ2
sum of squared residuals r
i 1
i
2
increases by
2
(ˆ k k )
if the frequency k is omitted from the model. We can use this property to gauge
the prominence of a particular frequency in the decomposition model, and exactly
that is the aim with the periodogram which will be discussed below.
Page 176
ATSA 10 Spectral Analysis
n 2 ˆ2
I n ( k ) (ˆ k k )
4
2 2
1 n 1 n
n t 1
xt cos(2 k
t )
n t 1
xt sin(2 k t )
The result is then plotted versus the frequency k , and this is known as the raw
periodogram. In R, we can use the convenient function spec.pgram(). We
illustrate its use with the lynx and the wave tank data:
Time Series Plot of log(lynx) Time Series Plot of Wave Tank Data
9
500
8
log(lynx)
7
Height
0
6
5
-500
4
8e+05
20
spectrum
spectrum
4e+05
5 10
0e+00
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
The periodogram of the logged lynx data is easy to read: the most prominent
frequencies in this series with 114 observations are the ones near 0.1, more
exactly, these are 11 11/114 0.096 and 12 12 /114 0.105 . The period of these
frequencies is 1/ k and thus, 114 / 11 10.36 and 114 / 12 9.50 . This suggests that
the series shows a peak at around every 10th observation which is clearly the case
in practice. We can also say that the highs/lows appear between 11 and 12 times
in the series. Also this can easily be verified in the time series plot.
Page 177
ATSA 10 Spectral Analysis
For the wave tank data, we here consider the first 120 observations only. The
periodogram is not as clean as for the logged lynx data, but we will try with an
interpretation, too. The most prominent peaks are at k 12, 17 and 30 . Thus we
have a superposition of cycles which last 4, 7 and 10 observations. The
verification is left to you.
10.1.4 Leakage
While some basic inspections of the periodogram can and sometimes do already
provide valuable insight, there are a few issues which need to be taken care of.
The first one which is discussed here is the phenomenon called leakage. It
appears if there is no Fourier frequency that corresponds to the true periodicity in
the data. Usually, the periodogram then shows higher values in the vicinity of the
true frequency. The following simulation example is enlightening:
2 13 t 2 20 t
X t cos 0.8 cos , for t 0, ..., 139
140 140
0 10 20 30 40 50 60 70
Page 178
ATSA 10 Spectral Analysis
Now if we shorten this very same series by 16 data points so that 124
observations remain, the true frequencies 20 / 140 and 13 / 140 do no longer
appear in the decomposition model, i.e. are not Fourier frequencies anymore. The
periodogram now shows leakage:
0 10 20 30 40 50 60
1:62
If not all of the true frequencies in the data generating model are Fourier
frequencies, then, ˆ k , ˆk from the decomposition model are only approximations
to the true contribution of a particular frequency for the variation in the series. This
may seem somewhat worrying, but then on the other hand the spectrum of an
ARMA time series process is a continuous function involving infinitely many
frequencies, so we may require some smoothing anyway if it is to be estimated
from the periodogram.
In a White Noise series, the spectrum is constant over all frequencies. Such a
series is “totally random”, i.e. there are no periodicities or in other words, all of
them appear at the same magnitude. That property also coined the term White
Noise, as white light also consists of all frequencies (i.e. wave lengths) at equal
amounts. In contrast, the spectrum of an AR (1) is non-trivial, but already a quite
Page 179
ATSA 10 Spectral Analysis
Spectrum of AR(1)-Processes
It will be interesting to see how the spectrum of an AR (2) process behaves. It can
be shown that for having m maxima in the spectral density function, an AR model
order of 2m is required. That property can be used to derive the order of an
AR ( p ) from a spectral estimate, see below. Next, we simulate from an AR (2) with
n 100 and coefficients 1 0.9; 2 0.8 . This allows us to generate both the
true and estimated ACF, as well as the periodogram and the true spectrum.
> set.seed(21)
> AR2.sim <- arima.sim(n=100, model=list(ar=c(0.9,-0.8)))
> plot(AR2.sim, main="…")
Simulated AR(2)
4
2
AR2.sim
0
-2
-4
0 20 40 60 80 100
Time
Page 180
ATSA 10 Spectral Analysis
20
spectrum
ACF
5 10
-1.0
0
0 5 10 15 20 0.0 0.1 0.2 0.3 0.4 0.5
Lag frequency
bandwidth = 0.00289
5.0
spectrum
ACF
1.0
0.2
-1.0
1 L
f (ˆ j ) I n ( j k ) for all j .
2 L 1 k L
Page 181
ATSA 10 Spectral Analysis
1 L
f (ˆ j ) wk I n ( j k ) ,
2 L 1 k L
where the weights are symmetric ( w k wk ), sum up to one and are decaying
( wo w1 ... wk ). This usually leads to a periodogram that has more smoothness
and makes the choice of the bandwidth somewhat easier, to the price of having to
choose the weights. The weighted running mean is unbiased and consistent as
well. Please note that function spec.pgram() in R uses a weighted running
mean if argument spans=2L+1 is set. It is based on wk 1/ 2 L for k L and
wk 1/ 4 L for k L . This is called a Daniel smoother.
> par(mfrow=c(1,2))
> spec.pgram(AR2.sim, spans= 13, log="no", main="…")
> spec.pgram(AR2.sim, spans= c(5,5), log="no", main="…")
10
spectrum
spectrum
6
4
5
2
0
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
frequency frequency
bandwidth = 0.035 bandwidth = 0.0176
We here illustrate the use of smoothing in practice. The left frame contains a
Daniel smoother with L 6 . The right frame first uses a Daniel smoother with
L 2 , and then again smoothes the output with the same setting. The double
smoother usually works better in practice than the single smoother. It allows for
choosing smaller L but still achieves a smooth result. The benefit of using small
L lies in the fact, that more pikey peaks are produced. This resembles the true
spectra of ARMA processes better, as they do not tend to have very wide peaks
usually. Further improvement of the spectral estimate can be achieved by the so-
called tapering, which will be explained below.
Page 182
ATSA 10 Spectral Analysis
10.2.2 Tapering
Tapering is a mathematical manipulation sometimes performed that has the goal
of further improving the spectral density estimation. In spectral analysis, a time
series is regarded as a finite sample of an infinitely long series, and the objective
is to infer the properties of the infinitely long series. If the observed time series is
viewed as repeating itself an infinite number of times, the sample can be
considered as resulting from applying a data window to the infinite series. The
data window is a series of weights equal to 1 for the n observations of the time
series, and zero elsewhere. This data window is rectangular in appearance. The
effect of the rectangular data window on spectral estimation is to distort the
estimated spectrum of the unknown infinite-length series by introducing leakage,
as we had already shown above. The objective of tapering lies in reducing
leakage. Tapering consists of altering the ends of the time series so that they taper
gradually down to zero.
The default with the R function spec.pgram() is a taper applied to 10% of the
data at the start and the end of the series. We strongly recommend to keep this
default value, unless you know much better. We here illustrate the effect of
tapering on the end result, by comparing the above estimate from a tapered Daniel
smoother to a non-tapered one. As it turns out, the effect here is relatively little, but
this does not change our recommendation to always taper the spectral estimate.
Tapered
Non-Tapered
8
spectrum
6
4
2
0
Please note that the R function spec.pgram() does not only use a tapered
Daniel smoother, but also estimates and removes a linear trend before the
spectrum is estimated. This is to avoid that there is a peak at low frequencies,
which would spread when the smoothing approaches are applied. If one is sure
that neither a trend nor a global mean is present in the series for which the
spectrum is to be estimated, that functionality can also be switched off, see the
help(spec.pgram).
Page 183
ATSA 10 Spectral Analysis
Series: AR2.sim
AR (2) spectrum
20.0
5.0
spectrum
2.0
0.5
0.2
If applied to the simulated AR (2) series, we observe that it recovers the true
spectrum much better than the periodogram based approaches before. While it
always has the advantage of producing a smooth spectral estimate, it has some
advantage over the other methods here. First of all, the true data generating
process is an AR here and we know the true order p 2 . In practice, of course
there can be a mismatch in the choice of the model order, or the data generating
process is not autoregressive at all. In these cases, the periodogram based
approaches do not necessarily perform worse.
We now focus on the practical application of this technique. The raw periodogram
for the logged lynx data has been shown above in section 10.1.3. We now utilize a
Daniel smoother with tapering to produce a reasonable spectral estimate. Please
note that it is customary to show the spectrum on a logarithmic scale, i.e. in dB.
This is different to how the spectrum was presented so far in this chapter, but it is
the default in R function spec.pgram() and is superior as it allows for better
identification of secondary, lower peaks in the spectrum. Moreover, the spectral
estimate also features a 95% confidence interval (given in blue on the top right of
the plot) that helps for verifying whether peaks in the spectrum are significant and
need to be regarded. In the case of the logged lynx data, it is a close call whether
the secondary peaks are negligible or not.
Page 184
ATSA 10 Spectral Analysis
Series: log(lynx)
Smoothed Periodogram
10
5
spectrum (dB)
0
-5
-20 -15 -10
If we now compare this spectral estimate to the ones that are obtained after fitting
an AR (2) and an AR (11) model, we notice that the latter is much closer to it. We
take this as further evidence that an AR (11) is better suited for this dataset than
the simpler AR (2) . Using different methodology, earlier analyses yielded the very
same result.
10
spectrum (dB)
spectrum (dB)
0
0
-5
-10
-10
-15
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
frequency frequency
We conclude this section on spectral analysis with a real world example that deals
with fault detection on electric motors, see the next pages.
Page 185
ATSA 10 Spectral Analysis
44 46 48 50 52 54 56
Frequency (Hz)
Page 186
ATSA 11 State Space Models
We will here first introduce the general formulation of state space models, and
then illustrate with a number of examples. The first two concern AR processes with
additional observation noise, then there are two regressions with time-varying
coefficients, and finally we consider a linear growth model. The conditional means
and variances of the state vectors are usually estimated with the Kalman Filter.
Because this is mathematically rather complex topic, and not in the primary focus
of the applied user, this scriptum only provides the coarse rationale for it.
State Equation
X t Gt X t 1 Wt , where Wt ~ N (0, wt )
Observation Equation
Yt Ft X t Vt , where Vt ~ N (0, vt )
Page 187
ATSA 11 State Space Models
Note that in this general formulation, all matrices can be time varying, but in most
of our examples, they will be constant. Also, the nomenclature is different
depending on the source, but we here adopt the notation of R.
Yt X t Vt , where Vt ~ N (0, V2 ) .
Thus, the realizations of the process of interest, X t are latent, i.e. hidden under
some random noise. We will now discuss how this issue can be solved in practice.
Example: AR (1)
X t 1 X t 1 Wt .
Yt X t Vt , where Vt ~ N (0, V2 ) .
This is the observation equation, note that Ft is also time-constant and equal to
the 11 identity matrix. We here assume that the errors Vt are iid, and also
independent of X s and Ws for all t and s . It is important to note that Wt is the
process innovation that impacts future instances X t k . In contrast, Vt is pure
measurement noise with no influence on the future of process X t . For illustration,
we consider a simulation example. We use 1 0.7 , the innovation variance W2 is
0.1 and the measurement error variance V2 is 0.5. The length of the simulated
series is 100 observations. On the next page, we show a number of plots. They
include a time series plot with both series X t and Yt , and the individual plots of X t
with its ACF/PACF, and of Yt with ACF/PACF. We clearly observe that the
appearance of the two processes is very different. While X t looks like an
autoregressive process, and has ACF and PACF showing the stylized facts very
prominently, Yt almost appears to be White Noise. We here know that this is not
Page 188
ATSA 11 State Space Models
StateState
X_t X_t
-2
Observed
Observed
Y_t Y_t
0 20 40 60 80 100
Time
We here emphasize that the state space formulation allowed to write a model
comprised of a true signal plus additional noise. However, if we face an observed
series of this type, we are not any further yet. We need some means to separate
the two components. Kalman filtering, which is implemented in R package sspir,
allows doing so: it recovers the (expected value of) the states X t by some
recursive algorithm. For details see below.
1.0
1.0
0.0 0.5
0.5
0.5
Partial ACF
yt1[, 1]
ACF
0.0
0.0
-0.5
-0.5
-1.0
-1.0
-1.0
0 20 40 60 80 100 0 5 10 15 20 5 10 15 20
Time Lag Lag
1.0
1
0.5
0.5
Partial ACF
0
yt1[, 2]
ACF
0.0
0.0
-1
-0.5
-0.5
-2
-1.0
-1.0
0 20 40 60 80 100 0 5 10 15 20 5 10 15 20
Time Lag Lag
Page 189
ATSA 11 State Space Models
Kalman filtering in R requires to specifiy the state space model first. We need to
supply argument y which stands for the observed time series data. They have to
come in form of a matrix. Moreover, we have to specify the matrices Ft , Gt , as well
as the covariance structures vt , wt . In our case, these are all simple 11 matrices.
Finally, we have to provide m0, the starting value of the initial state, and C0, the
variance of the initial state.
State X_t
Observed Y_t
-2
KF-Output
0 20 40 60 80 100
Time
We can then employ the Kalman filter to recover the original signal X t . It was
added as the blue line in the above plot. While it is not 100% accurate, it still does
a very good job of filtering the noise out. However, note that with this simulation
example, we have some advantage over the real-life situation. Here, we could
specify the correct state space formulation. In practice, we might have problems to
identify appropriate values for Gt (the true AR (1) parameter) and the variances in
vt , wt . On the other hand, in reality the precision of many measurement devices is
more or less known, and thus some educated guess is possible.
Page 190
ATSA 11 State Space Models
Example: AR (2)
X t 1 2 X t 1 Wt
X t 1 1 0 X t 2 0
X
Yt (1 0) t Vt
X t 1
Once the equations are set up, it is straightforward to derive the matrices:
2 W2 0
Gt G 1 , H H (1 0) , w , vt V
2
1 0
t t
0 0
Similar to the example above, we could now simulate from an AR (2) process, add
some artificial measurement noise and then try to uncover the signal using the
Kalman filter. This is left as an exercise.
St Lt t Pt Vt
This is a linear regression model with price as the predictor, and the general level
as the intercept. The assumption is that their influence varies over time, but
generally only in small increments. We can use the following notation:
Lt Lt 1 Lt
t t 1 t
In this model, we assume that vt , Lt and t are random deviations with mean
zero that are independent over time. While we assume independence of Lt and
t , we could also allow for correlation among the two. The relative magnitudes of
these perturbations are accounted for with the variances in the matrices Vt and Wt
of the state space formulation. Note that if we set Wt 0 , then we are in the case
Page 191
ATSA 11 State Space Models
of plain OLS regression with constant parameters. Hence, we can also formulate
any regression models in state space form. Here, we have:
L L 1 1 0
Yt St , X t t , Wt t , Ft , G
t t Pt 0 1
Because we do not have any data for this sales example, we again rely on a
simulation. Evidently, this has the advantage that we can evaluate the Kalman
filter output versus the truth. Thus, we let
yt a bxt zt
xt 2 t /10
We simulate 30 data points from t 1,...,30 and assume errors which are standard
normally distributed, i.e. zt ~ N (0,1) . The regression coefficients are a 4 and
b 2 for t 1,...,15 and a 5 and b 1 for t 16,...,30 . We will fit a straight line
with time-varying coefficients, as this is the model that matches what we had found
for the sales example above.
> ## Simulation
> set.seed(1)
> x1 <- 1:30
> x1 <- x1/10+2
> aa <- c(rep(4,15), rep( 5,15))
> bb <- c(rep(2,15), rep(-1,15))
> nn <- length(x1)
> y1 <- aa+bb*x1+rnorm(nn)
> x0 <- rep(1,nn)
> xx <- cbind(x0,x1)
> x.mat <- matrix(xx, nrow=nn, ncol=2)
> y.mat <- matrix(y1, nrow=nn, ncol=1)
>
> ## State Space Formulation
> ssf <- SS(y=y.mat, x=x.mat,
> Fmat=function(tt,x,phi)
> return(matrix(c(x[tt,1],x[tt,2]),2,1)),
> Gmat=function(tt,x,phi) return(diag(2)),
> Wmat=function(tt,x,phi) return(0.1*diag(2)),
> Vmat=function(tt,x,phi) return(matrix(1)),
> m0=matrix(c(5,3),1,2),
> C0=10*diag(2))
>
> ## Kalman-Filtering
> fit <- kfilter(ssf)
> par(mfrow=c(1,2))
> plot(fit$m[,1], type="l", xlab="Time", ylab="")
> title("Kalman Filtered Intercept")
> plot(fit$m[,2], type="l", xlab="Time", ylab="")
> title("Kalman Filtered Slope")
Page 192
ATSA 11 State Space Models
2
5.2
1
4.8
0
4.4
-1
4.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Time Time
The plots show the Kalman filter output for intercept and slope. The estimates pick
up the true values very quickly, even after the change in the regime. It is worth
noting that in this example, we had a very clear signal with relatively little noise,
and we favored recovering the truth by specifying the state space formulation with
the true error variances that are generally unknown in practice.
Example: Batmobile
The data comprise of quarterly figures of traffic accidents and the fuel
consumption in the Albuquerque area as a proxy of the driven mileage. The first
29 quarters are the control period, and observations 30 to 52 were recorded during
the experimental (batmobile) period. We would naturally assume a time series
regression model for the number of accidents:
The accidents depend on the mileage driven and there is a seasonal effect. In the
above model the intercept 0 is assumed to be constant. In the light of the BAT
program, we might well replace it by 0 Lt , i.e. some general level of accidents
that is time-dependent. Let us first perform regression and check residuals.
Page 193
ATSA 11 State Space Models
The time series plot of the residuals shows very clear evidence that there is timely
dependence. In contrast to what the regression model with constant coefficients
suggests, the level of accidents seems to rise in the control period, then drops
markedly after the BAT program was introduced. The conclusion is that our above
regression model is not adequate and needs to be enhanced. However, just
adding an indicator variable that codes for the times before and after the
introduction of the BAT program will not solve the issue. It is evident that the level
of the residuals is not constant before the program started, and it does not
suddenly drop to a constant lower level thereafter.
0 10 20 30 40 50
Time
The alternative is to formulate a state space model and estimate it with the Kalman
filter. We (conceptually) assume that all regression parameters are time
dependent, and rewrite the model as:
Our main interest lies in the estimation of the modified intercept term Lt , which we
now call the level. We expect it to drop after the introduction of the BAT program,
but let’s see if this materializes. The state vector X t we are using contains the
regression coefficients, and the state equation which describes their evolution over
time is as follows:
Page 194
ATSA 11 State Space Models
Lt 1 0 0 0 0 10 0 0 0 0
1t 0 1 0 0 0 0 0 0 0 0
X t GX t 1 Wt , where X t 2t , G 0 0 1 0 0 , wt 0 0 0 0 0 .
3t 0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
4t
As we can see, we only allow for some small, random permutations in the level Lt ,
but not in the other regression coefficients. The observation equation then
describes the regression problem, i.e.
The variance of the noise term V2 is set to 600, which is the error variance in the
canonical regression fit. Also the starting values for Kalman filtering, as well as the
variances of these initial states are taken from there. Hence, the code for the state
space formulation, as well as for Kalman filtering is as follows:
The filtered output is in object m in the output list. We can extract the estimates for
the mean of the state vector at time t , which we display versus time.
5.6
10
5.4
fit.kal$m[, 1]
Coefficient
5.2
0
5.0
-10
4.8
-20
0 10 20 30 40 50 0 10 20 30 40 50
Time Time
Page 195
ATSA 11 State Space Models
70
60
60
Coefficient
Coefficient
50
50
40
40
30
30
0 10 20 30 40 50 0 10 20 30 40 50
Time Time
Coefficient for Q3
40
30
Coefficient
20
10
0 10 20 30 40 50
Time
The estimates for these coefficients slightly wiggle around over time. However,
they do not seem to change systematically, as we had previously guessed.
Page 196