0% found this document useful (0 votes)
5 views

A Bayesian Kriged Kalman Model for Short-term Forecasting of Air Pollution Levels

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

A Bayesian Kriged Kalman Model for Short-term Forecasting of Air Pollution Levels

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Appl. Statist.

(2005)
54, Part 1, pp. 223–244

A Bayesian kriged Kalman model for short-term


forecasting of air pollution levels

Sujit K. Sahu
University of Southampton, UK

and Kanti V. Mardia


University of Leeds, UK

[Received April 2003. Final revision January 2004]

Summary. Short-term forecasts of air pollution levels in big cities are now reported in news-
papers and other media outlets. Studies indicate that even short-term exposure to high levels
of an air pollutant called atmospheric particulate matter can lead to long-term health effects.
Data are typically observed at fixed monitoring stations throughout a study region of interest
at different time points. Statistical spatiotemporal models are appropriate for modelling these
data. We consider short-term forecasting of these spatiotemporal processes by using a Bayes-
ian kriged Kalman filtering model. The spatial prediction surface of the model is built by using
the well-known method of kriging for optimum spatial prediction and the temporal effects are
analysed by using the models underlying the Kalman filtering method. The full Bayesian model
is implemented by using Markov chain Monte Carlo techniques which enable us to obtain the
optimal Bayesian forecasts in time and space. A new cross-validation method based on the
Mahalanobis distance between the forecasts and observed data is also developed to assess
the forecasting performance of the model implemented.
Keywords: Bending energy; Gibbs sampler; Kalman filter; Kriging; Markov chain Monte Carlo
methods; Spatial temporal modelling; State space model

1. Introduction
In recent years there has been a tremendous growth in the statistical models and techniques
to analyse spatiotemporal data such as air pollution data. Spatiotemporal data arise in many
other contexts, e.g. disease mapping and economic monitoring of real-estate prices. Often the
primary interests in analysing such data are to smooth and predict time evolution of some
response variables over a certain spatial domain.
Cressie (1994) and Goodall and Mardia (1994) obtained models for spatiotemporal data.
Mardia et al. (1998) introduced a combined approach which they call kriged Kalman filter
modelling. Recent references within this broad framework include Sansó and Guenni (1999,
2000), Stroud et al. (2001), Kyriakidis and Journel (1999), Wikle and Cressie (1999), Wikle
et al. (1998), Brown et al. (2000), Allcroft and Glasbey (2003) and Kent and Mardia (2002).
Kent and Mardia (2002) provided a unified approach to spatiotemporal modelling through
the use of drift and/or correlation in space and/or time to accommodate spatial continuity. For
drift functions, they have emphasized the use of so-called principal kriging functions, and for

Address for correspondence: Sujit K. Sahu, School of Mathematics, Southampton Statistical Sciences Research
Institute, University of Southampton, Highfield, Southampton, SO17 1BJ, UK.
E-mail: [email protected]

 2005 Royal Statistical Society 0035–9254/05/54223


224 S. K. Sahu and K. V. Mardia
correlations they have discussed the use of a first-order Markov structure in time combined with
spatial blurring. Here we adopt one of their strategies but in a full Bayesian framework.
We work here with a process which is continuous in space and discrete in time. The underlying
spatial drift is modelled by the principal kriging functions and the time component at observed
sites is modelled by a vector random-walk process. The dynamic random-walk process models
stochastic trend and the resulting Bayesian analysis essentially leads to Kalman filtering, which
is a computational method to analyse dynamic time series data; see for example Mardia et al.
(1998). In addition, the models proposed are presented in a hierarchical framework following
Wikle et al. (1998). This allows the inclusion of a ‘nugget’ term in the spatial part of the model.
The model is fitted and used for forecasting in a unified computational framework by using Mar-
kov chain Monte Carlo (MCMC) methods. The MCMC methods replace the task of Kalman
filtering by using a random-walk model in time.
The plan of the remainder of the paper is as follows. In Section 2 we describe the data set
that is used in this study. Section 3 describes the hierarchical Bayesian kriged Kalman filter
model. Important computational details are discussed in Section 4. In Section 5 we return to
the analysis of the data set that is described in Section 2. The paper ends with a discussion. The
data that are analysed in the paper can be obtained from
http://www.blackwellpublishing.com/rss

2. New York City air pollution data


This paper is motivated by the need to develop coherent Bayesian computational method-
ology implementing flexible hierarchical models for short-term forecasting of spatiotemporal
processes. In environmental monitoring and prediction problems it is often desired to pre-
dict the dependent variable, e.g. pollution level and rainfall, for 5 days or at most a week in
advance.
The Environmental Protection Agency in the USA monitor atmospheric particulate matter
that is less than 2.5 µm in size known as PM2.5. This PM2.5 measure is one of six primary air
pollutants and is a mixture of fine particles and gaseous compounds such as sulphur dioxide
(SO2 ) and nitrogen oxides (NOx ). Interest in analysing fine particles such as PM2.5 comes from
the fact that as those particles are less than 2.5 µm in diameter they are sufficiently small to
enter the lungs and can cause various health problems. Short-term forecasting of PM2.5 levels
is the focus of this paper.
The data set that we analyse here is the PM2.5 concentration data that were observed at 15
monitoring stations in the city of New York during the first 9 months of 2002. The data are
observed once in every 3 days and during the first 9 months there were 91 equally spaced days.
Out of these 1365 (= 15 × 91) data points 126 were missing observations which we take to be
missing completely at random.
Let z.si , t/ denote the observed PM2.5 concentration level at site si and at time t where i =
1, . . . , n and t = 1, . . . , T . Here we have n = 15 and T = 91.
Fig. 1 shows the locations of the sites numbered 1–15. The first three monitoring sites are in
the Bronx area of the city, sites 4–6 are in Brooklyn, sites 7–10 are in Manhattan, sites 11–13
are in Queens and lastly sites 14 and 15 are in Staten Island. These five boroughs constitute the
city of New York.
There are considerable spatiotemporal variations in these data. Fig. 2 provides the sitewise box
plots of the data. The plot shows that sites 7 and 8 in the Manhattan area are more polluted than
others. The concentration levels at sites 4–6 in Brooklyn are similar. However, the variations at
Forecasting of Air Pollution Levels 225

2
1
3
9
1112
7
10 13
8
6

5
4
14

15

Fig. 1. 15 monitoring sites in New York City


80 60
PM2.5 levels
40 20
0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sites
Fig. 2. Box plot of the data at 15 sites

sites 14 and 15 are not similar, although they are on the same island. More discussion regarding
Fig. 2 is given later.
We formally investigate the spatial variation by using an empirical variogram of the data.
We first remove the temporal trends by taking the first differences for each time series from
the 15 sites, i.e. we obtain, w.si , t/ = z.si , t + 1/ − z.si , t/ for t = 1, . . . , T − 1 and i = 1, . . . , n. The
time series plots of the difference data (not shown) confirmed that there were no more temporal
effects, but there were a few outliers. The variation in the resulting data w.si , t/ (without the
outliers) can be expected to have arisen from variation due to space.
226 S. K. Sahu and K. V. Mardia
To understand the behaviour of an isotropic and stationary process W.s, t/ we use the vario-
gram defined by
2 γ.d/ = E[{W.s1 , t/ − W.s2 , t/}2 ]
where d is the distance between the spatial locations s1 and s2 . Traditionally variograms are
calculated by grouping the possible values of d into bins, and by computing one value by taking
the sample average of {w.s1 , t/ − w.s2 , t/}2 values for which the distance d between s1 and s2 lies
within a given bin. Here, we adopt a slightly different procedure. The estimate of γ.d/ for an
observed distance of d is given by
1 T
−1
γ̂.d/ = {w.s1 , t/ − w.s2 , t/}2 ,
2.T − 1/ t=1
assuming that there are no missing observations. We remove the missing observations from the
above sum and adjust the denominator accordingly.
We use the geodetic distance between two locations with given latitudes θ1 and θ2 and longi-
tudes φ1 and φ2 (converted to radians). The geodetic distance d is the distance at the surface of
the Earth considered as a sphere of radius R = 6371 km. We use the formula
d = 6371cos−1 .B/ (km),
where
B = sin.θ1 / sin.θ2 / + cos.θ1 / cos.θ2 / cos.φ1 − φ2 /:
Fig. 3 provides a plot of the estimated variogram γ̂.d/ against d. Site 14 has been omitted
from this plot because it contained two outlying extreme observations on June 10th and 13th,
and these high observations distorted the general linear trend that is seen in the variogram plot.
We shall return to this issue later in Section 5.2.
The variogram plot (Fig. 3) shows strong linear spatial variations. The full curve in Fig. 3 is
the empirical LOESS fit (S-PLUS function loess) to the estimated variogram. The variogram
plot does not show a clear finite range and a finite sill. However, a finite range and a finite sill
can be seen if the five extreme variogram values for distance values above 30 were ignored. The


12
10


8


variogram



• •

6

• • • • ••
• •
•• • ••

• • • • • •• • •
4

• •• • •• • •
• • •
• • • • •• •••• ••• • • •
• • •
• •• • • • • •• ••
2


• • ••• • •
•• •
• • •
0

0 10 20 30
distance
Fig. 3. Variogram of the differenced data after removing site 14: , empirical LOESS fit; . . . . . . .,
empirical LOESS fit after removing five extreme points corresponding to distance values more than 30
Forecasting of Air Pollution Levels 227
dotted curve in the plot is the LOESS fit to the variogram after removing these five extreme
values. The underlying theoretical variogram corresponding to the dotted curve does indicate
the presence of a finite sill and a finite range.
Note, however, that the plotted variograms are to be treated as exploratory tools where the
main objective is to show spatial variations in the data. These exploratory and empirical vario-
grams should not be confused with the Matérn family (Matérn, 1986) of covariance functions
that are assumed in Section 3.1 for the latent variables which appear in a lower level hierarchy
of model building. The latent variables there are not same as the differenced data points w.s, t/
here. See Section 5.2 for more discussion regarding this.
The box plots in Fig. 2 also indicate that there is a very large observed value at each site. Fur-
ther investigation (Fig. 4) shows that the large observation at each site was for July 7th which
was the first day of monitoring after the July 4th firework celebrations. These large observations
are also seen to be positively skewed (see the box plot of the data for July 7th plotted in Fig. 5).
In Fig. 5 the box plot of the data for July 4th is also presented for comparisons. This plot shows
negative skewness for the PM2.5 concentration data on July 4th. Perhaps this is to be expected
for pollution data on a regular day since high levels of concentration can only be expected to
occur at a few sites. In any case this sort of differences in observed variations will affect the spatial
predictions; see Section 5.2 where we report the spatial predictions for both July 4th and July 7th.
These very large observations make the data non-stationary in time and will cause problems
in modelling using traditional regression-based methods. The short-term forecasting models
that we propose here are non-stationary and are seen to be adequate for the entire data; see
Section 5. Moreover, our modelling approach here does not require explicit modelling of the
large observations (e.g. using an indicator covariate for the days with large observations).
Some exploratory linear models fitted to both the raw data sets and their transformations
suggested that it is better to model the square-root transformation of the data which encour-
aged normality. Smith et al. (2003) also reported similar findings. Henceforth, we model the
square root of the data. However, we make the predictions on the original scale for ease of
communicating to practitioners.
The Environmental Protection Agency use mostly linear regression models to forecast the
PM2.5 levels. Models based on classification and regression trees are also used sometimes; see
for example Dye et al. (2002). Some explanatory variables, e.g. precipitation, temperature, wind
speed and holidays, are used in their models. However, there are several limitations in their
approach. The main drawbacks arise because regression models cannot be used satisfactorily
for data which are correlated in space and time. The explanatory variables can be used in our
analysis as well perhaps to enhance model fitting, but we do not include those here because some
of the explanatory variables are themselves to be predicted first to obtain forecasts of PM2.5.
Smith et al. (2003) analysed PM2.5 data for North Carolina, South Carolina and Georgia
by using specific models for spatial and temporal effects. They used weekly dummy variables
to model the time effect and incorporated a spatial trend model using thin plate splines. See
for example Mardia and Goodall (1993) for more on thin plate splines. Moreover, they have
included covariates, e.g. land use, in their model to discriminate between concentration levels
in the vast area that is covered by the three states.

3. The kriged Kalman filter model


The general model that we propose here is for spatiotemporal data recorded at n sites si , i =
1, . . . , n, over a period of T equally spaced time points. Let Zt = .Z.s1 , t/, . . . , Z.sn , t// denote
the n-dimensional observation vector at time point t, t = 1, . . . , T .
228 S. K. Sahu and K. V. Mardia

22.5

24.1
23.8
23.8
23
24
24.2 22.6
32.5
23

22.7

24.2
28

22.7

(a)

80

78.2
79
77.6
79.1
79.4
81.6 76.1
85.8
81.3

80.7

86.4
84.5

82.5

(b)
Fig. 4. Raw data for (a) July 4th and (b) July 7th

Often, the first step in modelling spatiotemporal data is to assume a hierarchical model
Zt = Yt + εt .1/

where Yt = .Y.s1 , t/, . . . , Y.sn , t// is an unobserved but scientifically meaningful process (signal)
and εt is a white noise process. Thus we assume that the components of εt are independent
and identically distributed normal random variables with mean 0 and unknown variance σ"2 . In
geostatistics, these error terms are often known as a nugget effect. A certain specific correlation
structure for ε can also be considered. However, we assume specific structures in the next level of
model hierarchy. The prior distribution for τ"2 = 1=σ"2 is assumed to be the gamma distribution
with shape parameter a and rate parameter b. We assume that a = b = 0:001 so that the gamma
Forecasting of Air Pollution Levels 229

86
32

84
30

82
28

80
26

78
24

(a) 76 (b)
Fig. 5. Box plots of the data for (a) July 4th and (b) July 7th

distribution has mean 1 and variance 1000. The resulting prior distribution has the desirable
property that it is proper but diffuse.
The space–time process Yt is thought to be the sum of parametric systematic components θt
and an isotropic time homogeneous spatial process denoted by γ t . Thus we assume that
Yt = θt + γ t .2/
where the error term γ t is assumed to be zero mean Gaussian with covariance matrix Σγ which
has elements
σ.si , sj / = cov{Y.si , t/, Y.sj , t/} .3/
for i, j = 1, . . . , n. The quantity σ.si , sj / is the covariance function of the spatial process to be
specified later. The components of θt are unspecified as well and will be discussed in the following
subsections.
The modelling hierarchies (1) and (2) are used when it is desired to predict the smooth process
Y.s, t/ rather than the observed noisy process Z.s, t/; see for example Wikle and Cressie (1999).
They also pointed out that it is not desirable to coalesce the two equations into
Zt = θt + γ t + εt : .4/
This equation also defines an inefficient and often unidentifiable parameterization; see for
example Gelfand et al. (1995) for other examples.

3.1. Models for the spatial covariance


We assume that the covariance function belongs to the Matérn family (Matérn, 1986)
1
σ.si , sj / = σγ2 κ−1 λdij Kκ .λdij /, λ > 0, κ  1, .5/
2 Γ.κ/
where dij is the geodetic distance between sites si and sj , Kκ .·/ is the modified Bessel function of
the second kind and of order κ; see for example Berger et al. (2001). For our illustration we take
230 S. K. Sahu and K. V. Mardia
κ = 1 and consider several values for λ. We choose the particular λ by using a predictive model
choice criterion. We estimate σγ2 by using MCMC methods. There are many possible parametric
and semiparametric models for covariance of isotropic spatial processes; see for example Ecker
and Gelfand (1997) where a Bayesian model choice study has been presented.
In our Bayesian set-up, a prior distribution for σγ2 must be specified. Here we assume that
τγ = 1=σγ2 follows the gamma prior distribution with parameters a and b. We take a = b = 0:001
2

so that the gamma distribution has mean 1 and variance 1000. Our choice avoids the default
improper prior distribution, namely

π.σγ2 / = 1=σγ2 , σγ2 > 0,

because this may lead to improper posterior distributions which would be difficult to verify in
practice; see for example Berger et al. (2001) and Gelfand and Sahu (1999).

3.2. Principal kriging functions


The systematic component θt is assumed to evolve as a stochastic time-varying linear combi-
nation of some optimal spatial functions. These are taken to be the principal kriging functions
following Kent and Mardia (2002) and Mardia et al. (1998). Given a certain known covariance
function, the unbiased linear prediction of the spatial process is called ‘kriging’. The principal
kriging functions are used as the optimal spatial functions on which the dynamic temporal
effects take place. Thus the first term of θt is given by
 p 
 
p
Hαt = hs1 j αtj , . . . , hsn j αtj
j=1 j=1

where the matrix H is n × p with ijth element hsi j , for i = 1, . . . , n, j = 1, . . . , p and αt =


.αt1 , . . . , αtp / . The choice of p is discussed at the end of this section. The columns of H are
determined by principal fields in kriging space and αt is a temporal state vector which varies in
time.
The matrix H quantifies the spatial component in the model; when multiplied by the dynamic
time component αt , it provides a time-varying linear combination of the spatial regression sur-
face that is described by the columns of H. The columns consist of two sets of spatial trend
fields. The first set of q columns corresponds to the constant, linear and quadratic functions of
co-ordinate dimensions, say. For example, if q = 3 and d = 2 the first column can be chosen to
be 1 corresponding to the constant trend field and the entries in the other two columns can be
taken as the X- and Y -co-ordinates of the locations where data have been observed. This n × q
matrix is denoted by F in the following discussion.
The remaining p − q fields are chosen as the spatial directions relative to an assumed covari-
ance structure. The directions are obtained as follows. Assume, for developing the principal
functions, that the data are collected for only one time point; thus the suffix t is suppressed
in the following discussion. Let Σγ and F  Σ−1 γ F be non-singular matrices. Assume that Z =
.Z.s1 /, . . . , Z.sn // follows the multivariate normal distribution with mean and variance given by

E.Z/ = F µ,
cov.Z/ = Σγ ,

which is a simplified version of the full model that is considered in this paper. Under this model
(and a flat prior on the parameter µ) the predictive mean for a site s is
Forecasting of Air Pollution Levels 231
 
E{Z.s/|Σγ , z} = f.s/ Az + σ.s/ Bz .6/

where f.s/ is the (q × 1)-vector of the trend field at the site s; z is the realization and σ.s/ =
.σ.s, s1 /, . . . , σ.s, sn // ;

A = .F  Σ−1 −1  −1
γ F/ F Σγ ,

B = Σ−1 −1
γ − Σγ FA:

Methods are available for singular Σγ which are required for thin plate splines; see Kent and
Mardia (1994). If the site s coincides with any particular si , i = 1, . . . , n, then it is easy to see
that the above predictive mean reduces to z.s/ as expected.
The matrix B is known as the bending energy matrix; see for example Bookstein (1989) who
motivated its use from the study of thin plate splines. Consider the spectral decomposition of B,

B = UEU  ,

Bui = ei ui ,

where U = .u1 , . . . , un / and E = diag.e1 , . . . , en /, and assume without loss of generality that the
eigenvalues are in non-decreasing order, e1 = . . . = eq = 0 < eq+1  . . .  en . It is easy to verify
that B satisfies BF = 0. Thus the columns of F can be thought of as the eigenvectors that are
associated with the null eigenvalues, e1 , . . . , eq .
Any observation vector z can be represented as a linear combination of the eigenvectors ui
since the latter set forms a basis. Indeed, suppose that z = Σni=1 ci ui for suitable constants ci . Now
the predictive mean (6) reduces to


n 
n
f.s/ A ci ui + ci ei σ.s/ ui :
i=1 i=q+1

Thus the predictive mean is a linear combination of the q trend fields f1 .s/, . . . , fq .s/ and the
n − q principal kriging functions ei σ.s/ ui : These functions span the space of all kriging solutions
with observations at the n given sites, the specified trend fields and the covariogram. We shall
use the terms principal kriging functions and principal fields interchangeably henceforth.
The smaller eigenvalues of B are associated with large scale spatial variation (global features)
and the larger eigenvalues describe local spatial variation. This can be inferred from the fact that
the global trend fields that are described by the columns of F are the eigenvectors corresponding
to the zero eigenvalues of B. See also Kent and Mardia (2002) and Mardia et al. (1998) for more
details in this regard. In practice, for model reduction, we may choose to work with p − q < n − q
principal functions. Thus, when the values at the observed sites are to be predicted, we choose
the p − q columns of H to be ei Σγ ui , i = q + 1, . . . , p: Hence the matrix H is taken as

H = .F , eq+1 Σγ uq+1 , . . . , ep Σγ up /: .7/

In what follows we shall illustrate the choice of p and q in particular examples, including
the case p = q for which no principal kriging functions are taken in the model. The model
with only polynomials (without the principal fields) are often used in the literature; see for
example the spatiotemporal model that was adopted by Sansó and Guenni (1999). Princi-
pal kriging functions have some advantages over only polynomial-type trend functions, which
232 S. K. Sahu and K. V. Mardia
is the case for p = q. They grow less quickly than polynomials outside the domain of the
data.

3.3. Dynamic temporal trend models


Motivated by our example, here we concentrate on smoothing and short-term forecasting in the
temporal domain. A standard procedure in such cases is to adopt a random-walk state space
type of formulation for temporal components; see for example Stroud et al. (2001) and Gelfand
et al. (2004). We thus assume that

αt = αt−1 + ηt , .8/

where the p-dimensional error term ηt is assumed to be normally distributed with mean 0 and
covariance matrix Ση : To complete the modelling hierarchies we suppose that α0 ∼ N.0, Cα I/
and with a large value of Cα . Here I denotes the identity matrix of appropriate order. See West
and Harrison (1997) for more on dynamic time series models.
We assume that Qη = Σ−1 η has the Wishart prior distribution, i.e.

Qη ∼ Wp .2aη , 2bη /

where 2aη is the assumed prior degrees of freedom (greater than or equal to p) and bη is a
known positive definite matrix, to be specified later. We say that X has the Wishart distribution
Wp .m, R/ if its density is proportional to

|R|m=2 |x|.m−p−1/=2 exp{− 21 tr.Rx/}

if x is a p × p positive definite matrix; see for example Mardia et al. (1979), page 85. (Here
tr.A/ is the trace of a matrix A.) To obtain diffuse but proper prior distributions we choose
aη = p=2. This assumption makes the prior distributions worth the same number of observa-
tions as the corresponding dimensions and is often used in a multivariate Bayesian modelling
framework. The matrix 2bη is chosen to be 0.01 times the identity matrix. This again comes
from the requirement of assuming diffuse prior distributions.
An alternative to the assumption of stochastic trend is to consider deterministic polyno-
mial trend models. For example, we can assume that αt = .1, t, t 2 , . . . , t p−1 /. This polynomial
trend model is not as flexible as the stochastic trend model (8). Hence we do not consider the
polynomial trend model at all, and we always work with the stochastic trend model (8).
A referee has commented that from a fluid dynamics perspective this random-walk model
cannot be fully justified for atmospheric systems. In fact, Mardia et al. (1998) have taken the
state equation (8) of the form

αt = Pαt−1 + ηt

with unknown transition matrix P. There are some identifiability problems with this approach
as discussed by Kent and Mardia (2002) for a general P and a general covariance matrix for
ηt . They showed that it is sufficient to assume that the largest eigenvalue of P is less than 1 in
absolute value and the matrix H is of full rank.
In our Bayesian set-up the identifiability problems can be resolved by assuming proper prior
distributions for both P and αt . However, we model with the choice P = I which is motivated
by the need to develop models for short-term forecasting. Moreover, this choice avoids insur-
mountable problems in MCMC convergence (which we have encountered) arising from the
weak identifiability of the parameters under sufficiently diffuse prior distributions.
Forecasting of Air Pollution Levels 233
4. Computations
4.1. The joint posterior distribution
To obtain the joint posterior distribution we recall that
Z.si , t/|Y.si , t/ ∼ N{Y.si , t/, σ"2 }, i = 1, . . . , n, t = 1, . . . , T ,
where
Yt = .Y.s1 , t/, Y.s2 , t/, . . . , Y.sn , t// ∼ N.θt , Σγ /, t = 1, . . . , T ,
independently. Further, we have assumed that

p
θ.s, t/ = hsj αtj , .9/
j=1

and αt ∼ N.αt−1 , Ση / for t = 1, . . . , T and α0 ∼ N.0, Cα I/.


Let ξ denote the following exhaustive set of parameters:
(a) the error precision parameters, τγ2 = 1=σγ2 and τ"2 = 1=σ"2 , and
(b) the latent process Yt , t = 1, . . . , T ,
(c) the dynamic parameters, αt , t = 1, . . . , T , and their precision matrix Qη = Σ−1
η , and
(d) the missing data, ZÅ .s, t/ for all s and t for which Z.s, t/ is missing.
The log-likelihood function for the hierarchical model is given by
nT τ2 T T
log{f.z1 , . . . , zT |ξ/} ∝ log.τ"2 / − " .zt − yt / .zt − yt / − log |Σγ |
2 2 t=1 2
1 T
− .yt − θt / Σ−1
γ .yt − θt /:
2 t=1
The joint posterior density is obtained, up to a normalizing constant, as the product of the
above likelihood function and the prior distributions for the parameters in the model, i.e.
π.ξ|z1 , . . . , zT / ∝ f.z1 , . . . , zT |ξ/ π.ξ/ .10/
where π.ξ/ denotes the prior distribution that is assumed for the parameters in ξ except for the
missing data ZÅ .s, t/.

4.2. The full conditional distributions


We derive the full conditional distributions that are needed for Gibbs sampling under both
the above models; see for example Carter and Kohn (1994) for similar calculations in state
space models. The full conditional distribution of τ 2 is the gamma distribution with parameter
a + Tn=2 and

T
b + 21 .zt − yt / .zt − yt /:
t=1

The full conditional distribution of yt is the multivariate normal distribution N.V µt , V/ where
V −1 = τ"2 I + Σ−1
γ ,

µt = τ"2 zt + Σ−1
γ θt :
234 S. K. Sahu and K. V. Mardia
The full conditional distribution of τγ2 is the gamma distribution with parameters a + Tn=2
and

T
b + 21 .yt − θt / V −1 .yt − θt /,
t=1

where Vij = .λdij /Kκ .λdij /. This conjugate distribution is obtained by using the facts
(a) Σγ = σγ2 V where V is free of σγ2 and
(b) H is invariant with respect to (i.e. free of) σγ2 .
The second claim is proved as follows. Note that the matrices A and F are free of σγ2 ; B =
τγ2 .V −1 − V −1 FA/. The eigenvalues of B will be τγ2 multiples of the eigenvalues of V −1 − V −1 FA
(which is free of σγ2 ). The multiplier τγ2 cancels out when forming H because of the premultipli-
cation by Σγ .
The full conditional distribution of αt is N.Vt µt , Vt / where
Vt−1 = I=Cα + Qη , µt = Qη αt+1 , when t = 0,
Vt−1 = H  Σ−1
γ H + 2Qη , µt = H  Σ−1
γ yt + Qη .αt−1 + αt+1 /, when 0 < t < T ,
Vt−1 = H  Σ−1
γ H + Qη , µt = H 
Σ−1
γ yt + Qη αt−1 , when t = T:
Block updating of all the αt , t = 1, . . . , T , can also be considered. However, this will mean stor-
age and inversion of (p × T )-dimensional matrices. Although the matrices will be structured
band diagonal matrices, additional programming effort will be required to implement the block
updating methods. Componentwise updating, as implemented here, will work fine when the
states are not highly correlated.
Missing data, denoted by ZÅ .s, t/, are sampled at each MCMC iteration by using the full
conditional distribution N{Y.s, t/, σ"2 }.

4.3. Forecasting
The posterior predictive distributions are used to make step ahead predictions (forecasts). The
one-step-ahead forecast distribution is given by

π.zT +1 |z1 , . . . , zT / = π.zT +1 |ξ/ π.ξ|z1 , . . . , zT / dξ, .11/

where the likelihood term π.zT +1 |ξ/ is obtained from the hierarchical model (1). The
E.ZT +1 |z1 , . . . , zT / under density (11) provides the optimal one-step ahead forecast under a
.j/
squared error loss function. To approximate E.ZT +1 |z1 , . . . , zT / we draw samples ZT +1 from
π.zT +1 |ξ / and form the sample average. Other interesting summary measures, e.g. the 95%
.j/
.j/
predictive intervals, are obtained by appropriately using the samples zT +1 ; see for example
Gelfand (1996).
Suppose that we are not only interested in one-step-ahead predictions but also in L-step-ahead
predictions where L > 1 is a positive integer. We obtain the predictive distribution (11), but here
the dynamic parameters, e.g. the αt , are first sampled from their distributions specified by the
.j/
model; see equation (8). Using these forward values of the parameters we sample ZT +L from
the likelihood. These last samples are then averaged to obtain the estimated forecasts.
Throughout the paper we assume that the mean and variance of the L-step-ahead forecast
distribution exist. This assumption is very reasonable in our set-up since we are primarily inter-
ested in making short-term forecasts. We can use other summary measures, e.g. the median if
the means are not finite. Moreover, in such situations MCMC samples drawn from the forecast
Forecasting of Air Pollution Levels 235
distribution may drift to infinite values, thereby giving an early indication of problems. This
may happen if the model is a very poor fit to the data. Some further checks on model validity
should be performed before finally abandoning the current models in lieu of new ones.
The predictive distribution (11) is used to obtain simultaneous forecasts for all the monitored
sites at any future time point t > T . Suppose that it is desired to predict the response at some
unmonitored sites at any given time point t where t can be less than or equal to T . The meth-
odology for obtaining the predictive distribution at one particular unmonitored site is given
below; the extension for more than one site is straightforward and obvious.
To predict at an unmonitored site, s say, we use a predictive distribution like equation (11)
with the following modifications to account for the spatial correlations between the responses
at site s and at the monitored sites s1 , s2 , . . . , sn .
We first obtain the spatial covariance matrix ΣÅγ of order n + 1 by using the assumed covari-
ogram (3), i.e.
 
Å Σγ Σ12 .s/
Σγ = ,
Σ12 .s/ σ.s, s/
where Σ12 .s/ is the n-dimensional vector with elements σ.si , s/, i = 1, . . . , n. On the basis of the
n + 1 spatial locations s1 , s2 , . . . , sn and s we derive the .n + 1/ × p matrix H Å by using equation
(7) where we replace Σγ by ΣÅγ . Let us partition the matrix H Å as
 Å
Å H1
H =
H2Å

where H1Å is n × p and H2Å is 1 × p. We now have that


 
Yt
∼ N.H Å αt , ΣÅγ /
Y.s, t/
by using the model assumption (2). From this multivariate normal distribution we obtain that
Y.s, t/|ξ ∼ N{H2Å αt + Σ12 .s/Σ−1 Å  −1
γ .Yt − H1 αt /, σ.s, s/ − Σ12 .s/Σγ Σ12 .s/} .12/
by using standard methods. Now using model assumption (1) we have that
Z.s, t/|ξ ∼ N{Y.s, t/, σ"2 },
where Y.s, t/ follows distribution (12) conditionally on ξ. Now the predictive distribution at site
s is given by

π{z.s, t/|z1 , . . . , zT } = π{z.s, t/|ξ} π.ξ|z1 , . . . , zT / dξ: .13/

If we were to forecast the smooth process Yt at an unmonitored site s, we use the condi-
tional distribution of Y.s, t/ detailed in distribution (12). The average of samples drawn from
this conditional distribution is the estimated forecast of the smooth process Yt at site s.
If new data were available we can rerun the entire MCMC implementation and predict obser-
vations which are future in time. However, there are many approximation methods using impor-
tance sampling which can be used as well; see for example Irwin et al. (2002) for details.

4.4. Assessing the forecasts


Many graphical diagnostic methods are used to perform diagnostic checking and model vali-
dation; see for example Mardia et al. (1998). Several validation statistics are also available; see
236 S. K. Sahu and K. V. Mardia
for example Carroll and Cressie (1996). They make use of the three statistics
T
+L
.1=L/ {Z.sj , t/ − Ẑ.sj , t/}
t=T +1
CR1 .sj / =  1=2 ,
T
+L
.1=L/ σ̂Z2 .sj , t/
t=T +1
 T
+L
1=2
 .1=L/ {Z.sj , t/ − Ẑ.sj , t/}2 
 t=T +1 
CR2 .sj / =   ,
 T
+L 
.1=L/ σ̂Z2 .sj , t/
t=T +1
T
+L
1=2
CR3 .sj / = .1=L/ {Z.sj , t/ − Ẑ.sj , t/}2 ,
t=T +1

where Ẑ.sj , t/ is the prediction of Z.sj , t/ and σ̂Z2 .sj , t/ is the mean-square prediction error. Then
it is recommended that summary statistics be used to compare the models; for example one may
find the means of the above three statistics. When forecasts are accurate, the means of CR1 .sj /
and CR2 .sj / should be close to 0 and 1 respectively; the mean of CR3 .sj / provides a ‘goodness
of prediction’ and it is expected to be small when predicted values are close to the true values.
The forecasts Ẑ.sj , t/, for t = T + 1, . . . , T + L, depend on one another and this fact is ignored
when summary statistics are formed from the time-averaged statistics CR1 .sj /, CR2 .sj / and
CR3 .sj /. To overcome this we adopt the weighted distance between the forecasts and the actual
observations. Let
Z 
T +1
V= :: 
:
ZT +L
denote the set of observations for which we seek validation. Note that we have observed data
Z1 , . . . , ZT +L but we have used only Z1 , . . . , ZT to fit the model and to obtain the validation
forecast for V. Let vobs denote the observed data.
Using the implemented MCMC algorithm we draw V.j/ , j = 1, . . . , E (where E is a large
positive integer), samples from the forecast distribution π.v|z1 , . . . , zT /. The first paragraph in
Section 4.3 details how to draw these samples. Now
1 E
V̄ = V.j/ ,
E j=1

1  E
Σ̂ = .V.j/ − V̄/.V.j/ − V̄/
E − 1 j=1

unbiasedly estimate the mean vector and the covariance matrix of the forecast distribution
π.v|z1 , . . . , zT / respectively. The ergodicity properties of the MCMC simulation algorithms guar-
antee that these estimates converge to the true mean and covariance matrix of the forecast
distribution when E is large.
Under suitable regularity conditions which guarantee asymptotic normality and for small
values of L, the predictive distribution π.v|z1 , . . . , zT / can be approximated by the nL-dimen-
sional normal distribution with mean V̄ and covariance matrix Σ̂. Using well-known properties
Forecasting of Air Pollution Levels 237
of the multivariate normal distribution, we have
D2 = .V − V̄/ Σ̂−1 .V − V̄/ ∼ χ2nL , approximately. .14/
The approximation arises because V is only approximately multivariate normal for small values
of L for short-term forecasting. A numerical justification for this approximation is provided in
Section 5.3.
Our proposed validation statistic is the observed value of D2 , given by
2
Dobs = .vobs − V̄/ Σ̂−1 .vobs − V̄/: .15/
2 will increase if there are large discrepancies between the forecast based on the
Clearly, Dobs
model, V̄, and the observed data, vobs . Thus Dobs
2 can be referred to the theoretical values of the

χ -distribution with nL degrees of freedom. Note also that Dobs


2 2 is the Mahalanobis distance

when the distributions of Vobs and V̄ have the common covariance matrix Σ̂.

Table 1. Values of the predictive


model choice criterion for various
values of p and λ

p Results for the following


values of λ:

0.3 0.4 0.5

4 125.3 125.8 128.8


5 120.3 116.9 117.1
6 117.4 112.3 114.4
7 100.8 96.7 97.8
8 109.7 103.5 104.2
8
6
4
2

0 10 20 30 40 50 60 70 80 90
(a)
8
site means
4 6
2

0 10 20 30 40 50 60 70 80 90
days
(b)
Fig. 6. (a) Marginal posterior means and 95% credible intervals of αt1 and (b) mean observed time series:
the time unit is 3 days
238 S. K. Sahu and K. V. Mardia
5. The New York City data example
5.1. Model choice
We return to the example that was discussed in Section 2. We first choose the parameters p and
λ by using the following well-known predictive model choice criterion (see for example Laud
and Ibrahim (1995)):

PMCC = .[Z.s, t/obs − E{Z.s, t/rep }]2 + var{Z.s, t/rep }/,
where the summation is taken over all the nT observations except for the missing observations
and Z.s, t/rep is a future observation corresponding to Z.s, t/ under the model assumed. The
estimated values of PMCC are reported in Table 1. The model with p = 7 and λ = 0:4 is seen to
be the best model and henceforth we work with this model. Table 1 also shows that the model
choice criterion is not greatly sensitive to the choice of λ among the values that are considered.
We have also computed the model choice criterion for λ = 0:2 and λ = 0:1. For those values
the criterion values were higher than the values corresponding to each value of p reported in
Table 1.
The chosen value of λ = 0:4 corresponds to an approximate range of 10 miles in spatial depen-
dence since the covariogram decays to 0.05 for λ = 0:4 and d = 10. The choice of p = 7 is seen to
0.0 0.1 0.2 0.3 0.4

–0.4 –0.3 –0.2 –0.1 0.0 0.1

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(a) (b)
–0.6 –0.4 –0.2 0.0 0.2 0.4 0.6

–0.4 –0.2 0.0 0.2

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(c) (d)
0.2

–0.6 –0.4 –0.2 0.0


–0.4 –0.2 0.0

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(e) (f)
Fig. 7. Marginal posterior means and 95% credible intervals of αti for (a) i D 2, (b) i D 3, (c) i D 4, (d) i D 5,
(e) i D 6 and (f) i D 7: the horizontal line at 0.0 has been superimposed to see the significance of the states;
the time unit is 3 days
Forecasting of Air Pollution Levels 239
be about half the maximum number of principal fields possible. We shall further examine the
choice by monitoring the components of α for this optimal model.

5.2. Analysis
The estimates of σ"2 and σγ2 under the model chosen are 0.0356 and 0.0172 respectively. The
standard deviations are estimated to be 0.0032 and 0.0046 respectively. The MCMC chains for
these two parameters were monitored to detect possible problems in convergence. However, no
such problems were found in the current implementation.
We plot the MCMC estimates of αt1 for all values of t along with the 95% credible intervals
in Fig. 6. Since the first column of the matrix H is a unit vector, αt1 will estimate the mean of
the time series that is observed at the different sites. To see this we plot the mean time series
that is obtained by averaging the response from all the sites in Fig. 6(b). As expected the plots
in Figs 6(a) and 6(b) look virtually the same. This justifies our previous claim that model (8)
captures the main temporal structures in the data.
The plots of the remaining six components of αt along with their 95% credible intervals
appear in Fig. 7. In Fig. 7 we have also plotted a horizontal line at zero to see the significance
of the αti , i = 2, . . . , 7, for the entire range of t. The second and third components αt2 and αt3
–1 0 1 2 3

–1 0 1 2 3

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 –1 0 1 2 3 0 10 20 30 40 50 60 70 80 90
(a) (b) (c)
–1 0 1 2 3

–1 0 1 2 3

–1 0 1 2 3

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(d) (e) (f)
–1 0 1 2 3

–1 0 1 2 3

–1 0 1 2 3

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(g) (h) (i)
–1 0 1 2 3

–1 0 1 2 3

–1 0 1 2 3

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(j) (k) (l)
–1 0 1 2 3

–1 0 1 2 3

–1 0 1 2 3

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
(m) (n) (o)
Fig. 8. Time series plots of the residuals from (a) site 1, (b) site 2, (c) site 3, (d) site 4, (e) site 5, (f) site 6,
(g) site 7, (h) site 8, (i) site 9, (j) site 10, (k) site 11, (l) site 12, (m) site 13, (n) site 14 and (o) site 15: the time
unit is 3 days
240 S. K. Sahu and K. V. Mardia
are seen to be significant for all values of t. The remaining four components are significant at
different times but are not significant for all values of t. The two components αt5 and αt6 are
significant for only a few values of t. Fig. 7 also shows that none of the seven components of αt
can be removed to obtain a more parsimonious model as all the components are significant at
least for some values of t.
The time series plots of the raw residuals, the differences between the observed and the fitted,
are given in Fig. 8. As expected, the residual plots do not show any spatial or temporal patterns.
The plot for site 14, however, shows high residual values for June 10th and 13th. As mentioned

25 30 35

(a)

75 80 85 90

(b)
Fig. 9. Model predicted maps for (a) July 4th and (b) July 7th: these predictions should be compared with
the observed data plotted in Fig. 4
Forecasting of Air Pollution Levels 241
previously in Section 2 these two observations are outliers and consequently the fitted model
shows some lack of fit for these two observations. We have also examined the variogram of the
fitted values as was done for the data in Fig. 3 and this looked very similar to Fig. 3. This is
expected since the model provides a very good fit to the data, as suggested by the above residual
plots.
We now return to the peculiarity of the data as plotted in Fig. 4. We spatially predict the
level of the response on 625 locations on land for July 4th and 7th. Note that these are spatial
predictions and are not temporal forecasts. Moreover, no cross-validation is done here. We use
all the data for model fitting and then we predict at the new locations using the Bayesian predic-

(a)

(b)
Fig. 10. Standard deviation of the predicted maps for (a) July 4th and (b) July 7th
242 S. K. Sahu and K. V. Mardia
tive distribution (13). Note that we require the matrix H Å to obtain this predictive distribution.
Here we first obtain the matrix H Å .640 × 7/ for all the 640 sites (15 monitoring sites and 625
locations for predictions) and then use H1Å .15 × 7/ for model fitting and use H2Å .625 × 7/ for
prediction.
The two spatial prediction surfaces each with 640 predictions (at the 15 monitored and 625
unmonitored sites) are linearly interpolated and then plotted in Fig. 9. The plot for July 4th
shows two hot spots, one each in Manhattan and in Staten Island. These two hot spots also
remain on July 7th, but more hot spots emerge on July 7th possibly because of the after-effect
of the July 4th firework celebrations. This reinforces the fact that there are different spatial
patterns at different locations and at different time points. A comparison between these and the
data plots in Fig. 4 shows that there is very good agreement between the model predictions and
the observed data.
The standard deviations of the predictions are plotted in Fig. 10. The standard deviations are
smaller for the locations which are near the observed sites. As expected a good predictor should
be able to predict better for the sites which are close to the observation sites than the sites which
are far away.
Why is the prediction map for July 4th much lighter than the same for July 7th? This is
explained by the two different types of variation in the data for two days; see Fig. 5. The data
for July 4th have a long left-hand tail whereas the data for July 7th have a long right-hand tail.
The small data values in the long left-hand tail have influenced the predicted surface for July
4th to be lighter in colour, and the large data values in the long right-hand tail have influenced
the surface for July 7th to be darker.

5.3. Cross-validation
We examine the cross-validation statistic D2 that was proposed in Section 4.4. A referee has
expressed concern regarding the asymptotic normal approximation and hence the asymptotic
χ2 -approximation for D2 that is given in expression (14). We address the concern as follows. We
only consider cross-validation for one and two time steps in advance since the main motivation
here is short-term forecasting of spatiotemporal processes. For the one-step-ahead predictions
D2 will be approximately χ2 distributed with 15 degrees of freedom and for the two-step-ahead
predictions D2 will have 30 degrees of freedom approximately.
We estimate V̄ and Σ̂ by using 10 000 MCMC samples from the predictive distributions of the
one- and two-step-ahead predictions. Subsequently, we draw 1000 independent random sam-
ples, V .j/ , j = 1, . . . , 1000, from the corresponding predictive distributions and form the statistic
D2 in each case. Note that the samples are not drawn from the approximate multivariate normal
distribution.
The histogram of the 1000 D2 -values and the density of the theoretical χ2 -distributions are
plotted in Fig. 11. Figs 11(a) and 11(b) show that the data histogram in each case is a very good
approximation for the corresponding theoretical χ2 -distribution. Moreover, to see the good-
ness of fit we run the Kolmogorov–Smirnov goodness-of-fit test using the 1000 simulated values.
The p-value of the test is 0.85 for the one-step-ahead prediction and 0.21 for the two-step-ahead
predictions. These high p-values indicate that the distributions of the observed D2 -values can
be taken to be the corresponding theoretical χ2 -distributions as claimed in distribution (14).
Now we evaluate the forecasting performance of the model by using Dobs 2 as given in equa-
2
tion (15). Using the current model the Dobs -values are 17.7 with 15 degrees of freedom for the
one-step-ahead forecasts and 37.9 with 30 degrees of freedom for the two-step-ahead forecasts.
These values clearly indicate that the model is forecasting the data well.
Forecasting of Air Pollution Levels 243

0.08

0.05
p-value= 0.85 p-value= 0.21

0.06

0.04
0.03
0.04

0.02
0.02

0.01
0.0

0.0
10 20 30 40 10 20 30 40 50 60
(a) (b)
Fig. 11. χ2 -approximation of D 2 for (a) the one-step-ahead forecasts and (b) the two-step-ahead forecasts:
the p-values are those of the Kolmogorov–Smirnov goodness-of-fit test

6. Discussion
We have proposed a Bayesian model for analysing spatiotemporal data. The model has been
implemented in a full Bayesian set-up using MCMC sampling. We have implemented the models
in a simulation example (documented in an unpublished technical report version of the current
paper by the same authors) which validated our MCMC code. However, for brevity we do not
present the example here.
The principal kriging functions that were used in the model that we proposed are basis
functions which are optimal for spatial predictions alone. The comparative models using poly-
nomial-type regressors do not use these optimal functions and hence may provide less accurate
forecasts especially for extrapolation.
We have applied our model on the air pollution data, and using new cross-validation meth-
ods we have shown that the model is adequate for short-term forecasting. Our use of Bayesian
predictive densities for spatial predictions makes our method optimal in the sense of Wikle and
Cressie (1999). The models proposed work even when the numbers of sites are moderately large,
although as expected the computations become more intensive as the number of sites increases.
The well-known advantages of the fully implemented MCMC methods, however, justify their
use for small to moderate data sets.

Acknowledgements
The authors thank John Kent and Richard Smith for helpful discussions. They thank David
Holland of the US Environmental Protection Agency for providing the data; they also thank the
Joint Editor, an Associate Editor and two referees for many helpful comments and suggestions.

References
Allcroft, D. J. and Glasbey, C. A. (2003) A latent Gaussian Markov random-field model for spatiotemporal
rainfall disaggregation. Appl. Statist., 52, 487–498.
244 S. K. Sahu and K. V. Mardia
Berger, J. O., de Oliveira, V. and Sansó, B. (2001) Objective Bayesian analysis of spatially correlated data. J. Am.
Statist. Ass., 96, 1361–1374.
Bookstein, F. L. (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans.
Pattn Anal. Mach. Intell., 11, 567–585.
Brown, P. E., Diggle, P. J., Lord, M. E. and Young, P. C. (2001) Space–time calibration of radar rainfall data.
Appl. Statist., 50, 221–241.
Carroll, S. S. and Cressie, N. (1996) A comparison of geostatistical methodologies used to estimate snow water
equivalent. Wat. Resour. Bull., 32, 267–278.
Carter, C. and Kohn, R. (1994) On Gibbs sampling for state space models. Biometrika, 81, 541–553.
Cressie, N. (1994) Comment on “An approach to statistical spatial-temporal modeling of meteorological fields”
by M. S. Handcock and J. R. Wallis. J. Am. Statist. Ass., 89, 379–382.
Dye, T., Miller, D. and MacDonald, C. (2002) Summary of PM2:5 forecasting program development and opera-
tions for Salt Lake City, Utah during winter 2002. Technical Report. Sonoma Technology, Petaluma.
Ecker, M. D. and Gelfand, A. E. (1997) Bayesian variogram modeling for an isotropic spatial process. J. Agric.
Biol. Environ. Statist., 2, 607–617.
Gelfand, A. E. (1996) Model determination using sampling based methods. In Markov Chain Monte Carlo in
Practice (eds W. R. Gilks, S. Richardson and D. J. Spiegelhalter), pp. 145–161. London: Chapman and Hall.
Gelfand, A. E., Banerjee, S. and Gamerman, D. (2004) Spatial process modelling for univariate and multivariate
dynamic spatial data. Environmetrics, to be published.
Gelfand, A. E. and Sahu, S. K. (1999) Identifiability, improper priors, and Gibbs sampling for generalized linear
models. J. Am. Statist. Ass., 94, 247–253.
Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1995) Efficient parametrization for normal linear mixed models.
Biometrika, 82, 479–488.
Goodall, C. and Mardia, K. V. (1994) Challenges in multivariate spatio-temporal modeling. In Proc. 17th Int.
Biometric Conf., Hamilton, Aug. 8th–12th, pp. 1–17. Hamilton: McMaster University Press.
Irwin, M. E., Cressie, N. and Johannesson, G. (2002) Spatial-temporal non-linear filtering based on hierarchical
statistical models (with discussion). Test, 11, 249–302.
Kent, J. T. and Mardia, K. V. (1994) The link between Kriging and thin-plate splines. In Probability, Statistics
and Optimisation (ed. F. P. Kelly), pp. 324–329. New York: Wiley.
Kent, J. T. and Mardia, K. V. (2002) Modelling strategies for spatial-temporal data. In Spatial Cluster Modelling
(eds A. Lawson and D. Denison), pp. 214–226. London: Chapman and Hall.
Kyriakidis, P. C. and Journel, A. G. (1999) Geostatistical space-time models: a review. Math. Geol., 31, 651–684.
Laud, P. W. and Ibrahim, J. G. (1995) Predictive model selection. J. R. Statist. Soc. B, 57, 247–262.
Mardia, K. V. and Goodall, C. (1993) Spatial-temporal analysis of multivariate environmental monitoring data.
In Multivariate Environmental Statistics (eds G. P. Patil and C. R. Rao), pp. 347–386. Amsterdam: Elsevier.
Mardia, K. V., Goodall, C., Redfern, E. J. and Alonso, F. J. (1998) The Kriged Kalman filter (with discussion).
Test, 7, 217–252.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. London: Academic Press.
Matérn, B. (1986) Spatial Variation. Berlin: Springer.
Sansó, B. and Guenni, L. (1999) Venezuelan rainfall data analysed by using a Bayesian space–time model. Appl.
Statist., 48, 345–362.
Sansó, B. and Guenni, L. (2000) A nonstationary multisite model for rainfall. J. Am. Statist. Ass., 95, 1089–1100.
Smith, R. L., Kolenikov, S. and Cox, L. H. (2003) Spatio-temporal modelling of PM2:5 data with missing values.
J. Geophys. Res. Atmos., 108.
Stroud, J. R., Müller, P. and Sansó, B. (2001) Dynamic models for spatiotemporal data. J. R. Statist. Soc. B, 63,
673–689.
West, M. and Harrison, J. (1997) Bayesian Forecasting and Dynamic Models. New York: Springer.
Wikle, C. K., Berliner, L. M. and Cressie, N. (1998) Hierarchical Bayesian space-time models. Environ. Ecol.
Statist., 5, 117–154.
Wikle, C. K. and Cressie, N. (1999) A dimension-reduced approach to space-time Kalman filtering. Biometrika,
86, 815–829.

You might also like