Stein 2011 DiffFilter
Stein 2011 DiffFilter
, JIE CHEN
Abstract. In many statistical applications one must solve linear systems corresponding to
large, dense, and possibly irregularly structured covariance matrices. These matrices are often ill-
conditioned; for example, the condition number increases at least linearly with respect to the size of
the matrix when observations of a random process are obtained from a xed domain. This paper
discusses a preconditioning technique based on a dierencing approach such that the preconditioned
covariance matrix has a bounded condition number independent of the size of the matrix for some
important process classes. When used in large scale simulations of random processes, signicant
improvement is observed for solving these linear systems with an iterative method.
Key words. Condition number, preconditioner, stochastic process, random eld, spectral anal-
ysis, xed-domain asymptotics
AMS subject classications. 65F35, 60G25, 62M15
1. Introduction. A problem that arises in many statistical applications is the
solution of linear systems of equations for large positive denite covariance matrices
(see, e.g., [15]). An underlying challenge for solving such linear systems is that co-
variance matrices are often dense and ill-conditioned. Specically, if one considers
taking an increasing number of observations of some random process in a xed and
bounded domain, then one often nds the condition number grows without bound
at some polynomial rate in the number of observations. This asymptotic approach
in which an increasing number of observations is taken in a xed region is called
xed-domain asymptotics. It is used extensively in spatial statistics [15] and is being
increasingly used in time series, especially in nance, where high frequency data is
now ubiquitous [2]. Preconditioned iterative methods are usually the practical choice
for solving these covariance matrices, whereby the matrix-vector multiplications and
the choice of a preconditioner are two crucial factors that aect the computational
eciency. Whereas the former problem has been extensively explored, for example,
by using the fast multipole method [9, 3, 6], the latter has not acquired satisfac-
tory answers yet. Some designs of the preconditioners have been proposed (see, e.g.,
[7, 10]); however, their behavior was rarely theoretically studied. This paper proves
that for processes whose spectral densities decay at certain specic rates at high fre-
quencies, the preconditioned covariance matrices have a bounded condition number.
The preconditioners use lters based on simple dierencing operations, which have
long been used to prewhiten (make the covariance matrix closer to a multiple of
the identity) regularly observed time series. However, the utility of such lters for
irregularly observed time series and spatial data is not as well recognized. These cases
are the focus of this work.
Consider a stationary real-valued random process Z(x) with covariance function
k(x) and spectral density f(), which are mutually related by the Fourier transform
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439.
Emails: (jiechen, anitescu)@mcs.anl.gov. Work of these authors was supported by the U.S.
Department of Energy, through Contract No. DE-AC02-06CH11357.
1
2 M. L. STEIN, J. CHEN, M. ANITESCU
and the inverse transform:
k(x) =
_
+
j,l
a
j
a
l
k(x
j
x
l
) =
_
R
d
f()
j
a
j
exp(i
T
x
j
)
2
d, (1.1)
which is obviously nonnegative as it must be since it equals var
_
j
a
j
Z(x
j
)
_
. The
existence of a spectral density implies that k is continuous.
In some statistical applications, a family of parameterized covariance functions
is chosen, and the task is to estimate the parameters and to uncover the underlying
covariance function that presumably generates the given observed data. Let be the
vector of parameters. We expand the notation and denote the covariance function
by k(x; ). Similarly, we use K() to denote the covariance matrix parameterized by
. We assume that observations y
j
= Z(x
j
) come from a stationary random eld
that is Gaussian with zero mean.
1
The maximum likelihood estimation [13] method
estimates the parameter by nding the maximizer of the log-likelihood function
/() =
1
2
y
T
K()
1
y
1
2
log(det(K()))
m
2
log 2,
where the vector y contains the m observations y
j
. A maximizer
is called a
maximum likelihood estimate of . The optimization can be performed by solving
(assuming there is a unique solution) the score equation
y
T
K()
1
K()
K()
1
y + tr
_
K()
1
K()
_
= 0, , (1.2)
where the left-hand side is nothing but the partial derivative of 2/(). Because of
the diculty of evaluating the trace for a large matrix, Anitescu et al. [1] exploited
1
The case of nonzero mean that is linear in a vector of unknown parameters can be handled with
little additional eort by using maximum likelihood or restricted maximum likelihood [15].
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 3
the Hutchinson estimator of the matrix trace and proposed solving the sample average
approximation of the score equation instead:
F
() := y
T
K()
1
K()
K()
1
y
+
1
N
N
j=1
u
T
j
_
K()
1
K()
_
u
j
= 0, , (1.3)
where the sample vectors u
j
s have independent Rademacher variables as entries. As
the number N of sample vectors tends to innity, the solution
N
of (1.3) converges
to
in distribution:
(V
N
/N)
1/2
(
)
D
standard normal, (1.4)
where V
N
is some positive denite matrix dependent on the Jacobian and the variance
of F(). This error needs to be distinguished from the error in
itself as an estimate
of . Roughly speaking, this convergence result indicates that the th estimated
parameter
2
Z(y
m
)
1
2
Z(z
m
)
_
0
as m , so that the minimum eigenvalue of K also tends to 0 as m . To
get a lower bound on the maximum eigenvalue, we note that there exists r > 0 such
that k(x) >
1
2
k(0) for all [x[ r. Assume that the observation domain has a nite
diameter, so that it can be covered by a nite number of balls of diameter r and call
this number B. Then for any m, one of these balls must contain at least m
m/B
observations. The sum of these observations divided by
k(0)
m
2B
k(0), so the maximum eigenvalue of K grows at least linearly with
m. Thus, the ratio of the maximum to the minimum eigenvalue of K and hence its
condition number grows faster than linearly in m. How much faster clearly depends
on the smoothness of Z, but we will not pursue this topic further here.
In what follows, we consider a ltering technique that essentially preconditions
K such that the new system has a condition number that does not grow with the size
of K for some distinguished process classes. Strictly speaking, the ltering operation,
though linear, is not equal to a preconditioner in the standard sense, since it reduces
the size of the matrix by a small number. Thus, we also consider augmenting the
lter to obtain a full-rank linear transformation that serves as a real preconditioner.
However, as long as the rank of the ltering matrix is close to m, maximum likelihood
of based on the ltered observations should generally be nearly as statistically
eective as maximum likelihood based on the full data. In particular, maximum
likelihood estimates are invariant under full rank transformations of the data.
The theoretical results on bounded condition numbers heavily rely on the prop-
erties of the spectral density f. For example, the results in one dimension require
either that the process behaves not too dierently than does Brownian motion or in-
tegrated Brownian motion, at least at high frequencies. Although the restrictions on
4 M. L. STEIN, J. CHEN, M. ANITESCU
f are strong, they do include some models frequently used for continuous time series
and in spatial statistics. As noted earlier, the theory is developed based on xed-
domain asymptotics; and, without loss of generality, we assume that this domain is
the box [0, T]
d
. As the observations become denser, for continuous k the correlations
of neighboring observations tend to 1, resulting in matrices K that are nearly singu-
lar. However, the proposed dierence lters can precondition K so that the resulting
matrix has a bounded condition number independent of the number of observations.
Section 4 gives several numerical examples demonstrating the eectiveness of this
preconditioning approach.
2. Filter for one-dimensional case. Let the process Z(x) be observed at
locations
0 x
0
< x
1
< < x
n
T,
and suppose the spectral density f satises
f()
2
bounded away from 0 and as . (2.1)
The spectral density of Brownian motion is proportional to
2
, so (2.1) says that
Z is not too dierent from Brownian motion in terms of its high frequency behavior.
Dene the process ltered by dierencing and scaling as
Y
(1)
j
= [Z(x
j
) Z(x
j1
)]/
_
d
j
, j = 1, . . . , n, (2.2)
where d
j
= x
j
x
j1
. Let K
(1)
denote the covariance matrix of the Y
(1)
j
s:
K
(1)
(j, l) = cov
_
Y
(1)
j
, Y
(1)
l
_
.
For Z Brownian motion, K
(1)
is a multiple of the identity matrix, and (2.1) is sucient
to show the condition number of K
(1)
is bounded by a nite value independent of the
number of observations.
Theorem 2.1. Suppose Z is a stationary process on R with spectral density f
satisfying (2.1). There exists a constant C depending only on T and f that bounds
the condition number of K
(1)
for all n.
If we let L
(1)
be a bidiagonal matrix with nonzero entries
L
(1)
(j, j 1) = 1/
_
d
j
and L
(1)
(j, j) = 1/
_
d
j
,
it is not hard to see that K and K
(1)
are related by
K
(1)
= L
(1)
KL
(1)
T
.
Note that L
(1)
is rectangular, since the row index ranges from 1 to n and the column
index ranges from 0 to n. It entails a special property that each row sums to zero:
a
T
L
(1)
1 = 0 (2.3)
for any vector a, where 1 denotes the vector of all 1s. It will be clear later that (2.3)
is key to the proof of the theorem. For now we note that if = L
(1)
T
a, then
a
T
K
(1)
a =
T
K = var
_
j
Z(x
j
)
_
with
j
= 0. (2.4)
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 5
Strictly speaking, L
(1)
T
L
(1)
is not a preconditioner, since L
(1)
has more columns
than rows, even though the transformed matrix K
(1)
has a desirable condition prop-
erty. A real preconditioner can be obtained by augmenting L
(1)
. To this end, we
dene, in addition to (2.2),
Y
(1)
0
= Z(x
0
), (2.5)
and let
K
(1)
denote the covariance matrix of all the Y
(1)
j
s, including Y
(1)
0
. Then we
have
K
(1)
=
L
(1)
K
L
(1)
T
,
where
L
(1)
is obtained by adding to L
(1)
the 0th row, with 0th entry equal to 1 and
other entries 0. Clearly,
L
(1)
is nonsingular. Thus,
L
(1)
T
L
(1)
preconditions the matrix
K:
Corollary 2.2. Suppose Z is a stationary process on R with spectral density
f satisfying (2.1). Then there exists a constant C depending only on T and f that
bounds the condition number of
K
(1)
for all n.
We next consider the case where the spectral density f satises
f()
4
bounded away from 0 and as . (2.6)
Integrated Brownian motion, a process whose rst derivative is Brownian motion,
has spectral density proportional to
4
. Thus (2.6) says Z behaves somewhat like
integrated Brownian motion at high frequencies. In this case, the appropriate precon-
ditioner uses second order dierences. Dene
Y
(2)
j
=
[Z(x
j+1
) Z(x
j
)]/d
j+1
[Z(x
j
) Z(x
j1
)]/d
j
2
_
d
j+1
+d
j
, j = 1, . . . , n 1, (2.7)
and denote by K
(2)
the covariance matrix of the Y
(2)
j
s, j = 1, . . . , n 1, namely,
K
(2)
(j, l) = cov
_
Y
(2)
j
, Y
(2)
l
_
.
Then for Z integrated Brownian motion K
(2)
is a tridiagonal matrix with bounded
condition number (see 2.3). This result allows us to show the condition number of
K
(2)
is bounded by a nite value independent of n whenever f satises (2.6).
Theorem 2.3. Suppose Z is a stationary process on R with spectral density f
satisfying (2.6). Then there exists a constant C depending only on T and f that
bounds the condition number of K
(2)
for all n.
If we let L
(2)
be the tridiagonal matrix with nonzero entries
L
(2)
(j, j 1) = 1/(2d
j
_
d
j
+d
j+1
),
L
(2)
(j, j + 1) = 1/(2d
j+1
_
d
j
+d
j+1
),
L
(2)
(j, j) = L
(2)
(j, j 1) L
(2)
(j, j + 1),
for j = 1, . . . , n 1, and let K
(2)
be the covariance matrix of the Y
(2)
j
s, then K and
K
(2)
are related by
K
(2)
= L
(2)
KL
(2)
T
.
6 M. L. STEIN, J. CHEN, M. ANITESCU
Similar to (2.3), the matrix L
(2)
has a property that for any vector a,
a
T
L
(2)
x
0
= 0,
a
T
L
(2)
x
1
= 0,
(2.8)
where x
0
= 1, the vector of all 1s, and x
1
has entries (x
1
)
j
= x
j
. In other words, if
we let = L
(2)
T
a, then
a
T
K
(1)
a =
T
K = var
_
_
_
j
Z(x
j
)
_
_
_
, with
j
= 0 and
j
x
j
= 0.
To yield a preconditioner for K in the strict sense, in addition to (2.7), we dene
Y
(2)
0
= Z(x
0
) +Z(x
n
), and Y
(2)
n
= [Z(x
n
) Z(x
0
)]/(x
n
x
0
).
Accordingly, we augment the matrix L
(2)
to
L
(2)
with
L
(2)
(0, l) =
_
_
1, l = 0
1, l = n
0, otherwise,
L
(2)
(n, l) =
_
_
1/(x
n
x
0
), l = 0
1/(x
n
x
0
), l = n
0, otherwise,
and use
K
(2)
to denote the covariance matrix of the Y
(2)
j
s, including Y
(2)
0
and Y
(2)
n
.
Then, we obtain
K
(2)
=
L
(2)
K
L
(2)
T
.
One can easily verify that
L
(2)
is nonsingular. Thus,
L
(2)
T
L
(2)
becomes a precondi-
tioner for K:
Corollary 2.4. Suppose Z is a stationary process on R with spectral density
f satisfying (2.6). Then there exists a constant C depending only on T and f that
bounds the condition number of
K
(2)
for all n.
We expect that versions of the theorems and corollaries hold whenever, for some
positive integer , f()
2
is bounded away from 0 and as . However,
the given proofs rely on detailed calculations on the covariance matrices and do not
easily extend to larger . Nevertheless, we nd it interesting and somewhat surprising
that no restriction is needed on the spacing of the observation locations, especially
for = 2. These results perhaps give some hope that similar results for irregularly
spaced observations might hold in more than one dimension.
The rest of this section gives proofs of the above results. The proofs make sub-
stantial use of results concerning equivalence of Gaussian measures [11]. In contrast,
the results for the high dimension case (presented in 3) are proved without recourse
to equivalence of Gaussian measures.
2.1. Intrinsic random function and equivalence of Gaussian measures.
We rst provide some preliminaries. For a random process Z (not necessarily station-
ary) on R and a nonnegative integer p, a random variable of the form
n
j=1
j
Z(x
j
)
for which
n
j=1
j
x
j
= 0 for all nonnegative integers p is called an authorized
linear combination of order p, or ALC-p [5]. If, for every ALC-p
n
j=1
j
Z(x
j
), the
process Y (x) =
n
j=1
j
Z(x +x
j
) is stationary, then Z is called an intrinsic random
function of order p, or IRF-p [5].
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 7
Similar to stationary processes, intrinsic random functions have spectral measures,
although they may not be integrable in a neighborhood of the origin. We still use g()
to denote the spectral density with respect to the Lebesgue measure. Corresponding
to these spectral measures are what are known as generalized covariance functions.
Specically, for any IRF-p, there exists a generalized covariance function G(x) such
that for any ALC-p
n
j=1
j
Z(x
j
),
var
_
_
_
n
j=1
j
Z(x
j
)
_
_
_
=
n
j,l=1
l
G(x
j
x
l
).
Although a generalized covariance function G cannot be written as the Fourier trans-
form of a positive nite measure, it is related to the spectral density g by
n
j,l=1
l
G(x
j
x
l
) =
_
+
g()
j=1
j
exp(ix
j
)
2
d
for any ALC-p
n
j=1
j
Z(x
j
).
Brownian motion is an example of an IRF-0 and integrated Brownian motion an
example of an IRF-1. Dening g
r
() = [[
r
, Brownian motion has a spectral density
proportional to g
2
with generalized covariance function c[x[ for some c > 0. Note
that if one sets Z(0) = 0, then covZ(x), Z(s) = minx, s for x, s 0. Integrated
Brownian motion has a spectral density proportional to g
4
with generalized covariance
function c[x[
3
for some c > 0.
We will need to use some results from Stein [17] on equivalence of Gaussian
measures. Let L
T
be the vector space of random variables generated by Z(x) for
x [0, T] and L
T,p
the subspace of L
T
containing all ALC-ps in L
T
, so that L
T
L
T,0
L
T,1
. Let P
T,p
(f) and P
T
(f) be the Gaussian measure for L
T,p
and L
T
,
respectively, when Z has mean 0 and spectral density f. For measures P and Q on
the same measurable space, write P Q to indicate that the measures are equivalent
(mutually absolutely continuous). Since L
T
L
T,p
, for two spectral densities f and
g, P
T
(f) P
T
(g) implies that P
T,p
(f) P
T,p
(g) for all p 0.
2.2. Proof of Theorem 2.1. Let K(h) denote the covariance matrix K associ-
ated to a spectral density h, and similarly for K
(1)
(h),
K
(1)
(h), K
(2)
(h), and
K
(2)
(h).
The main idea of the proof is to upper and lower bound the bilinear form a
T
K
(1)
(f)a
for f satisfying (2.1) by constants times a
T
K
(1)
(g
2
)a. Then since K
(1)
(g
2
) has a con-
dition number 1 independent of n, it immediately follows that K
(1)
(f) has a bounded
condition number, also independent of n.
Let f
0
() = (1 +
2
)
1
and
f
R
() =
_
f(), [[ R
f
0
(), [[ > R
for some R. By (2.1), there exist R and 0 < C
0
< C
1
< such that C
0
f
R
()
f() C
1
f
R
() for all . Then by (1.1) and (2.4), for any real vector a,
C
0
a
T
K
(1)
(f
R
)a a
T
K
(1)
(f)a C
1
a
T
K
(1)
(f
R
)a. (2.9)
By the denition of f
0
, we have P
T,0
(f
0
) P
T,0
(g
2
) [17, Theorem 1]. Since
f
R
= f
0
for [[ > R, by Ibragimov and Rozanov [11, Theorem 17 of Chapter III], we
8 M. L. STEIN, J. CHEN, M. ANITESCU
have P
T
(f
R
) P
T
(f
0
); thus P
T,0
(f
R
) P
T,0
(f
0
). Therefore, by the transitivity of
equivalence, we obtain that P
T,0
(f
R
) P
T,0
(g
2
). From basic properties of equivalent
Gaussian measures (see [11, (2.6) on page 76]), there exist constants 0 < C
2
< C
3
<
such that for any ALC-0,
n
j=0
j
Z(x
j
) with 0 x
j
T for all j,
C
2
var
g2
_
_
_
n
j=0
j
Z(x
j
)
_
_
_
var
fR
_
_
_
n
j=0
j
Z(x
j
)
_
_
_
C
3
var
g2
_
_
_
n
j=0
j
Z(x
j
)
_
_
_
,
where var
f
, for example, indicates that variances are computed under the spectral
density f. Then by (2.4) we obtain
C
2
a
T
K
(1)
(g
2
)a a
T
K
(1)
(f
R
)a C
3
a
T
K
(1)
(g
2
)a. (2.10)
Combining (2.9) and (2.10), we have
C
0
C
2
a
T
K
(1)
(g
2
)a a
T
K
(1)
(f)a C
1
C
3
a
T
K
(1)
(g
2
)a,
and thus the condition number of K
(1)
(f) is upper bounded by C
1
C
3
/(C
0
C
2
).
2.3. Proof of Theorem 2.3. Following a similar argument as in the preceding
proof, the bilinear form a
T
K
(2)
(f)a for f satisfying (2.6) can be upper and lower
bounded by constants times a
T
K
(2)
(g
4
)a. Then it suces to prove that K
(2)
(g
4
) has
a bounded condition number, and thus the theorem holds.
To estimate the condition number of K
(2)
(g
4
), rst note the fact that for any two
ALC-1s
j
j
Z(x
j
) and
j
j
Z(x
j
),
j,l
l
(x
j
x
l
)
3
= 0. (2.11)
Based on the generalized covariance function of g
4
, c[x[
3
, we have
(j, l)-entry of K
(2)
(g
4
) = cov
_
Y
(2)
j
, Y
(2)
l
_
= cov
_
_
_
+1
=1
L
(2)
(j, j +j
)Z(x
j+j
),
+1
=1
L
(2)
(l, l +l
)Z(x
l+l
)
_
_
_
= c
+1
=1
+1
=1
L
(2)
(j, j +j
)L
(2)
(l, l +l
)[x
j+j
x
l+l
[
3
.
Since for any j, Y
(2)
j
is ALC-1, by using (2.11) one can calculate that
(j, l)-entry of K
(2)
(g
4
) =
_
_
c, l = j
cd
j+1
/(2
_
d
j+1
+d
j
_
d
j+2
+d
j+1
) l = j + 1
0, [l j[ > 1,
which means that K
(2)
(g
4
) is a tridiagonal matrix with a constant diagonal c.
To simplify notation, let C(j, l) denote the (j, l)-entry of K
(2)
(g
4
). We have
[C(j 1, j)[ +[C(j, j + 1)[ =
cd
j
2
_
d
j
+d
j1
_
d
j+1
+d
j
+
cd
j+1
2
_
d
j+1
+d
j
_
d
j+2
+d
j+1
c
_
d
j
2
_
d
j+1
+d
j
+
c
_
d
j+1
2
_
d
j+1
+d
j
2
.
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 9
For any vector a,
a
T
K
(2)
(g
4
)a =
n1
j,l=1
a
j
a
l
C(j, l) c
n1
j=1
a
2
j
2
n2
j=1
[a
j
a
j+1
C(j, j + 1)[,
but
2
n2
j=1
[a
j
a
j+1
C(j, j + 1)[
n2
j=1
(a
2
j
+a
2
j+1
)[C(j, j + 1)[
n1
j=1
a
2
j
([C(j 1, j)[ +[C(j, j + 1)[)
2
n1
j=1
a
2
j
.
Therefore,
a
T
K
(2)
(g
4
)a c(1 1/
2) |a|
2
. (2.12)
Similarly, we have a
T
K
(2)
(g
4
)a c(1 + 1/
2) |a|
2
. Thus the condition number of
K
(2)
(g
4
) is at most (1 + 1/
2)/(1 1/
2) = 3 + 2
2.
2.4. Proof of Corollaries 2.2 and 2.4. The proof of Corollary 2.2 is similar to
but simpler than the proof of Corollary 2.4 and is omitted. The main idea of proving
Corollary 2.4 is to consider the following covariance function (3.836.5 in [8] with n = 4
shows that B
(x) =
_
_
32
3
3
4x
2
+[x[
3
, [x[ 2
1
3
(4 [x[)
3
, 2 < [x[ 4
0, [x[ > 4
for > 0 and the covariance function E(x) = 3e
|x|
(1 +[x[). The function B
has a
spectral density h
() proportional to sin()
4
/()
4
, and E has a spectral density
() = 6(1 +
2
)
2
. Using similar ideas as in the proof of Theorem 2.1, we dene
R
() =
_
f(), [[ R
(), [[ > R
for some R. Then by (2.6), there exist R and 0 < C
0
< C
1
< such that for any
real vector a,
C
0
a
T
K
(2)
(
R
)a a
T
K
(2)
(f)a C
1
a
T
K
(2)
(
R
)a. (2.13)
Furthermore, according to the results in [11, Theorem 17 of Chapter III], when T 2,
P
T
(h
) P
T
() P
T
(
R
), which leads to
C
2
a
T
K
(2)
(h
)a a
T
K
(2)
(
R
)a C
3
a
T
K
(2)
(h
)a (2.14)
for some 0 < C
2
< C
3
< . Combining (2.13) and (2.14), it remains to prove that
K
(2)
(h
entry by entry:
(0, 0)-entry =
128
3
3
8D
2
+ 2D
3
(0, j)-entry =
_
4 +
3
2
D
_
_
d
j+1
+d
j
for j = 1, . . . , n 1
(0, n)-entry = 0
(j, 0)-entry = (0, j)-entry for j = 1, . . . , n 1
(j, l)-entry = (j, l)-entry of K
(2)
(g
4
)/c for j, l = 1, . . . , n 1
(j, n)-entry = (n, j)-entry for j = 1, . . . , n 1
(n, 0)-entry = 0
(n, j)-entry =
_
d
j+1
+d
j
D
_
x
j1
+x
j
+x
j+1
3
2
x
0
3
2
x
n
_
for j = 1, . . . , n 1
(n, n)-entry = 8 2D,
where D = x
n
x
0
, and recall that c is the coecient in the generalized covariance
function corresponding to g
4
. To simplify notation, let H(j, l) denote the (j, l)-entry
of
K
(2)
(h
). Then we have
a
T
K
(2)
(h
)a = a
2
0
H(0, 0) +a
2
n
H(n, n) + 2a
0
n1
j=1
a
j
H(0, j) + 2a
n
n1
j=1
a
j
H(n, j)
+ a
T
K
(2)
(g
4
) a/c, (2.15)
where a is the vector a with a
0
and a
n
removed. For every > 0, using [2xy[ x
2
+y
2
and the Cauchy-Schwartz inequality, we have
2a
0
n1
j=1
a
j
H(0, j)
2
a
2
0
+
1
2
n1
j=1
a
2
j
n1
j=1
H(0, j)
2
2
a
2
0
+
1
2
2
D(8 3D)
2
n1
j=1
a
2
j
. (2.16)
Similarly, for every > 0, using
x
j1
+x
j
+x
j+1
3
2
x
0
3
2
x
n
3D, we have
2a
n
n1
j=1
a
j
H(n, j)
2
a
2
n
+
1
2
n1
j=1
a
2
j
n1
j=1
H(n, j)
2
2
a
2
n
+
18D
2
n1
j=1
a
2
j
. (2.17)
Furthermore, by 2.12,
a
T
K
(2)
(g
4
) a/c (1 1/
2) | a|
2
. (2.18)
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 11
Applying (2.16), (2.17) and (2.18) to (2.15), together with D T 2, we obtain
a
T
K
(2)
(h
)a
_
128
3
3
8D
2
+ 2D
3
2
_
a
2
0
+ (8 2D
2
)a
2
n
+
_
1
1
1
2
2
D(8 3D)
2
18D
2
_
n1
j=1
a
2
j
_
128
3
3
8T
2
+ 2T
3
2
_
a
2
0
+ (8 2T
2
)a
2
n
+
_
1
1
1
2
2
T (8 3T)
2
18T
2
_
n1
j=1
a
2
j
.
Setting = 14T,
2
= 116000T
3
and
2
= 100T yields
a
T
K
(2)
(h
)a
2902
3
T
3
a
2
0
+ 10Ta
2
1
+
_
178359
232000
1
2
_
n1
j=1
a
2
j
.
Since
178359
232000
1
2
.06, the minimum eigenvalue of
K
(2)
(h
p=1
Z(j e
p
) 2Z(j) +Z(j +e
p
),
where e
p
denotes the unit vector along the pth coordinate. When the operator is
applied times, we denote
Y
[]
j
=
Z(j).
2
Sometimes, boldface letters denote a vector of same entries (such as n meaning a vector of all
ns). Under context, this notation is self-explaining and not to be confused with the notation of a
general vector. Other examples in this paper include 1 and .
12 M. L. STEIN, J. CHEN, M. ANITESCU
Note that this notation is in parallel to the ones in (2.2) and (2.7), with [] meaning the
number of applications of the Laplace operator (instead of the order of the dierence),
and the index j being a vector (instead of a scalar). In addition, we use K
[]
to denote
the covariance matrix of Y
[]
j
, j n :
K
[]
(j, l) = cov
_
Y
[]
j
, Y
[]
l
_
.
We have the following result.
Theorem 3.1. Suppose Z is a stationary random eld on R
d
with spectral density
f satisfying
f() (1 +||)
, (3.1)
where = 4 for some positive integer . Then there exists a constant C depending
only on T and f that bounds the condition number of K
[]
for all n.
Recall that for a(), b() 0, the relationship a() b() indicates that
there exist C
1
, C
2
> 0 such that C
1
a() b() C
2
a(), .
It is not hard to verify that K
[]
and K are related by K
[]
= L
[]
KL
[]
T
, where
L
[]
= L
n+1
L
n1
L
n
and L
s
is an (s 1)
d
(s + 1)
d
matrix with entries
L
s
(j, l) =
_
_
2d, l = j
1, l = j e
p
, p = 1, . . . , d
0, otherwise,
for 1 j s 1. One may also want to have a nonsingular
L
[]
such that the
condition number of
L
[]
K
L
[]
T
is bounded. However, we cannot prove that such an
augmentation yields matrices with bounded condition number, although numerical
results in 5 suggest that such a result may be achievable. Stein [16] applied the
iterated Laplacian to gridded observations in d dimensions to improve approximations
to the likelihood based on the spatial periodogram and similarly made no eort to
recover the information lost by using a less than full rank transformation. It is worth
noting that processes with spectral densities of the form (3.1) observed on a grid bear
some resemblance to Markov random elds [14], which provide an alternative way to
model spatial data observed at discrete locations.
3.1. Proof of Theorem 3.1. First note that if one restricts to observations on
the grid j for j Z
d
, the covariance function k can be written as an integral in
[, ]
d
:
k(j) =
_
R
d
f() exp(i
T
(j)) d =
_
[,]
d
f
() exp(i
T
j) d,
where
f
() =
d
lZ
d
f(
1
( + 2l)). (3.2)
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 13
Denote by k
[]
the covariance function such that k
[]
(j l) = K
[]
(j, l). Then
according to the denition of the operator , we have k
[0]
= k and the recurrence
k
[+1]
(j) =
d
p,q=1
k
[]
(j +(e
p
+e
q
)) 2k
[]
(j +e
p
) +k
[]
(j +(e
p
e
q
))
2k
[]
(j +e
q
) + 4k
[]
(j) 2k
[]
(j e
q
)
+k
[]
(j +(e
p
+e
q
)) 2k
[]
(j e
p
) +k
[]
(j +(e
p
e
q
)).
If we let
k
[]
(j) =
_
[,]
d
f
[]
() exp(i
T
j) d,
then the above recurrence for k
[]
translates to
f
[]
() =
_
d
p=1
4 sin
2
_
p
2
_
_
2
f
(), (3.3)
and for any real vector a, we have
a
T
K
[]
a =
j,ln
a
j
a
l
k
[]
(jl) =
_
[,]
d
f
[]
()
jn
a
j
exp(i
T
j)
2
d.
Therefore, to prove that K
[]
has a bounded condition number, we need to bound the
expression for a
T
K
[]
a given in the above equality.
According to the assumption of f in (3.1), combining (3.2) and (3.3), we have
d
f
[]
()
_
d
p=1
4 sin
2
_
p
2
_
_
2
lZ
d
( +| + 2l|)
=: h
().
Therefore, there exist 0 < C
0
C
1
< independent of and a, such that
C
0
H
(a)
d
a
T
K
[]
a C
1
H
(a), (3.4)
where
H
(a) =
_
[,]
d
h
()
jn
a
j
exp(i
T
j)
2
d.
We proceed to bound the function H
(a).
For any ,= 0, h
() is continuous with h
converges to h
0
pointwise except at the origin. Since h
> h
when <
, we
have that h
is upper bounded by h
0
for all . Moreover, by the continuity of h
0
in
[, ]
d
, h
0
has a maximum C
2
. Therefore, h
() C
2
for all and , and thus
H
(a) C
2
_
[,]
d
jn
a
j
exp(i
T
j)
2
d = C
2
(2)
d
jn
a
2
j
. (3.5)
14 M. L. STEIN, J. CHEN, M. ANITESCU
Now we need a lower bound for H
() sinc
2
(1/2) ||
4
( +||)
.
Therefore, for any 0 < /,
H
(a) sinc
2
(1/2)
_
[,]
d
_
||
+||
_
jn
a
j
exp(i
T
j)
2
d
sinc
2
(1/2)
_
[,]
d
\{}
_
||
+||
_
jn
a
j
exp(i
T
j)
2
d
sinc
2
(1/2)
_
1 +
_
_
[,]
d
\{}
jn
a
j
exp(i
T
j)
2
d. (3.6)
To obtain a lower bound on this last integral, note that
_
[,]
d
jn
a
j
exp(i
T
j)
2
d = (2)
d
jn
a
2
j
and
_
jn
a
j
exp(i
T
j)
2
d
_
_
jn
[a
j
[
_
2
d
(n + 1 2)
d
jn
a
2
j
d
= (n + 1 2)
d
()
d
V
d
jn
a
2
j
(T)
d
V
d
jn
a
2
j
,
where V
d
is the volume of the d-dimensional unit ball, which is always less than 2
d
.
Applying these results to (3.6),
H
(a) sinc
2
(1/2)
_
1 +
_
_
(2)
d
(T)
d
V
d
jn
a
2
j
.
Since this bound holds for any 0 < /, we specically let = 1/T. Then
H
(a) C
3
jn
a
2
j
(3.7)
with
C
3
=
sinc
2
(1/2)[(2)
d
V
d
]
(1 +T)
which is independent of .
Combining (3.4), (3.5) and (3.7), we have
C
0
C
3
|a|
2
d
a
T
K
[]
a C
1
C
2
(2)
d
|a|
2
,
which means that the condition number of K
[]
is bounded by (2)
d
C
1
C
2
/(C
0
C
3
).
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 15
4. Numerical experiments. A class of popularly used covariance functions
that are exible in reecting the local behavior of spatially varying data is the Matern
covariance model [15, 13]:
k(x) =
1
2
1
()
_
2 |x|
2 |x|
_
,
where is the Gamma function and /
2
+||
2
_
(+d/2)
,
which is dimension dependent. It is clear that with some choices of , f satises the
requirements of the theorems in this paper. For example, when d = 1, the Matern
model with = 1/2 corresponds to Theorem 2.1 and Collorary 2.2, whereas = 3/2
corresponds to Theorem 2.3 and Collorary 2.4. Also, when d = 2, the Matern model
with = 1 corresponds to Theorem 3.1 with = 4, meaning that the Laplace operator
is needed to apply once ( = 1). Whittle [18] argued that the choice of = 1 is
particularly natural for processes in R
2
, in large part because the process is a solution
to a stochastic version of the Laplace equation driven by white noise.
For the above three examples, we plot in Figure 4.1 the curves of the condition
numbers for both K and the ltered versions of K, as the size m of the matrix varies.
The plots were obtained by xing the domain T = 100 and the scale parameter = 7.
For one-dimensional cases, observation locations were randomly generated according
to the uniform distribution on [0, T]. The plots clearly show that the condition number
of K grows very fast with the size of the matrix. With an appropriate lter applied,
on the other hand, the condition number of the ltered covariance matrix stays more
or less the same, a phenomenon consistent with the theoretical results.
The good condition property of the ltered covariance matrix is exploited in the
block preconditioned conjugate gradient (block PCG) solver. The block version of
PCG is used instead of the single vector version because in some applications, such as
the one presented in 1, the linear system has multiple right-hand sides. We remark
that the convergence rate of block PCG depends not on the condition number, but on
a modied condition number of the linear system [12]. Let
j
, sorted increasingly, be
the eigenvalues of the linear system. With s right-hand sides, the modied condition
number is
m
/
s
. Nevertheless, a bounded condition number indicates a bounded
modied condition number, which is desirable for block PCG. Figure 4.2 shows the
results of an experiment where the observation locations were on a 128 128 regular
grid and s = 100 random right-hand sides were used. Note that since K and K
[1]
are
BTTB (block Toeplitz with Toeplitz blocks), they can be further preconditioned by
using a BCCB (block circulant with circulant blocks) preconditioner [4]. Comparing
the convergence history for K, K preconditioned with a BCCB preconditioner, K
[1]
,
and K
[1]
preconditioned with a BCCB preconditioner, we see that the last case clearly
yields the fastest convergence.
Next, we demonstrate the usefulness of the bounded condition number results in
the maximum likelihood problem mentioned in 1. The simulation process without
any ltering is as follows. We rst generated observations y = Z(x) for a Gaussian
16 M. L. STEIN, J. CHEN, M. ANITESCU
10
1
10
2
10
3
10
4
10
0
10
5
10
10
matrix dimension n
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
K
(1)
tilde K
(1)
(a) d = 1, = 1/2, rst order dierence lter
10
1
10
2
10
3
10
4
10
0
10
5
10
10
10
15
10
20
matrix dimension n
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
K
(2)
tilde K
(2)
(b) d = 1, = 3/2, second order dierence lter
10
2
10
4
10
6
10
0
10
2
10
4
10
6
10
8
matrix dimension m
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
K
[1]
(c) d = 2, = 1, Laplace lter once
Fig. 4.1. Condition numbers of K (both unltered and ltered) as the matrix size m varies.
random eld in R
2
with the covariance rule
k(x; ) =
_
2 r
x;
_
2 r
x;
_
, r
x;
=
x
2
1
2
1
+
x
2
2
2
2
,
where = 1,
N
. For dierent grid sizes
n (matrix size m = n
2
) the computational times were recorded and the accuracies
of the estimates compared to the exact maximum likelihood estimates
(in terms of
condence interval as derived from (1.4)) were compared.
We have noted that the condition number of K grows faster than linearly in
m. Therefore, we instead solved a nonlinear system other than (1.3) to obtain the
estimate
N
. We applied the Laplace operator to the sample vector y once and
obtained a vector y
[1]
. Then we solved the nonlinear system
(y
[1]
)
T
(K
[1]
)
1
K
[1]
(K
[1]
)
1
(y
[1]
) +
1
N
N
j=1
u
T
j
_
(K
[1]
)
1
K
[1]
_
u
j
= 0, (4.1)
where the u
j
s are as in (1.3). This approach is equivalent to estimating the parameter
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 17
0 100 200 300 400
10
10
10
5
10
0
10
5
iteration
r
e
s
i
d
u
a
l
K
K preconditioned
K
[1]
K
[1]
preconditioned
Fig. 4.2. Convergence history of BPCG.
from the sample vector y
[1]
with covariance K
[1]
. The matrix K
[1]
is guaranteed to
have a bounded condition number for all m according to Theorem 3.1.
The simulation was performed on a Linux desktop with 16 cores with 2.66 GHz
frequency and 32 GB of memory. The nonlinear equation (4.1) was solved by using
the Matlab command fsolve, which by default used the trust-region dogleg algo-
rithm. Results are shown in Figure 4.3. As we would expect, as the number m of
observations increases, the estimates
N
tend to become closer to
which generated
the simulation data. Furthermore, despite the fact that N = 100 is xed as m in-
creases, the condence intervals for
N
become increasingly narrow as m increases,
which suggests that it may not be necessary to let N increase with m to insure that
the simulation error
is small compared to the statistical error
. Finally,
as expected, the running time of the simulation scales roughly O(m), which shows
promising practicality for running simulations on much larger grids than 10241024.
10
4
10
6
6
7
8
9
10
11
matrix dimension m
2
(a) Est. parameters with condence interval.
10
4
10
6
10
1
10
2
10
3
10
4
10
5
matrix dimension m
t
i
m
e
(
s
e
c
o
n
d
s
)
64x64 grid
2.56 mins
func eval: 7
128x128 grid
6.62 mins
func eval: 7
256x256 grid
1.1 hours
func eval: 8
512x512 grid
2.74 hours
func eval: 8
1024x1024 grid
11.7 hours
func eval: 8
(b) Running time versus matrix dimension m.
Fig. 4.3. Simulation results of the maximum likelihood problem.
5. Further numerical exploration. This section describes additional numer-
ical experiments. First we consider trying to reduce the condition number of our
matrices by rescaling them to be correlation matrices. Specically, for a covariance
18 M. L. STEIN, J. CHEN, M. ANITESCU
matrix K, the corresponding correlation matrix is given by
C = diag(K)
1/2
K diag(K)
1/2
.
Although C is not guaranteed to have smaller condition number than K, in practice
it often will. For observations on a regular grid and a spatially invariant lter, which
is the case in 3, all diagonal elements of K are equal, so there is no point in rescaling.
For irregular observations, rescaling does make a dierence. For all of the settings
considered in 2, the ratio of the biggest to the smallest diagonal elements of all of the
covariance matrices considered is bounded. It follows that all of the theoretical results
in that section on bounded condition numbers apply to the corresponding correlation
matrices.
10
1
10
2
10
3
10
4
10
0
10
5
10
10
matrix dimension n
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
(2)
C
(2)
tilde K
(2)
tilde C
(2)
(a) d = 1, = 3/2, second order dierence lter
10
1
10
2
10
3
10
4
10
0
10
5
10
10
10
15
matrix dimension n
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
K
(1)
K
(2)
(b) d = 1, = 1, two lters
10
1
10
2
10
3
10
4
10
0
10
5
10
10
10
15
10
20
matrix dimension n
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
K
(1)
K
(2)
(c) d = 1, = 2, two lters
10
2
10
4
10
6
10
0
10
2
10
4
10
6
10
8
matrix dimension m
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
K
tilde K
[1]
tilde C
[1]
(d) d = 2, = 1, augmented Laplace lter once
Fig. 5.1. Condition numbers of covariance matrices and correlation matrices.
Figure 4.1(b) shows that the ltered covariance matrices
K
(2)
have much larger
condition numbers than does K
(2)
. This result is perhaps caused by the full rank
transformation
L
(2)
that makes the (0, 0) and (n, n) entry of
K
(2)
signicantly dierent
from the rest of the diagonal. For the same setting, Figure 5.1(a) shows that diagonal
rescaling yields much improved resultsthe correlation matrix
C
(2)
has a condition
number much smaller than that of
K
(2)
and close to that of K
(2)
.
Theorems 2.1 and 2.3 indicate the possibility of reducing the condition number
of the covariance matrix for spectral densities with a tail similar to [[
p
for even
DIFFERENCE FILTERS FOR COVARIANCE MATRICES 19
p by applying an appropriate dierence lter. A natural question is whether the
dierence lter can also be applied to spectral densities whose tails are similar to [[
to some negative odd power. Figures 5.1(b) and 5.1(c) show the ltering results for
[[
3
and [[
5
, respectively. In both plots, neither the rst nor the second order
dierence lter resulted in a bounded condition number, but the condition number
of the ltered matrix is greatly reduced. This encouraging result indicates that the
ltering operation may be useful for a wide range of densities (e.g., all Matern models)
that behave like [[
p
at high frequencies, whether or not p is an even integer.
For processes in d > 1 dimension, our result (Theorem 3.1) requires a transfor-
mation L
[]
that reduces the dimension of the covariance matrix by O(n
d1
). One
may want to have a full rank transformation or some transformation that reduces the
dimension of the matrix by at most O(1). We tested one such transformation here
for a R
2
example, which reduced the dimension by four. The transformation
L
[1]
is
dened as follows. When j is not on the boundary, namely, 1 j n 1,
L
[1]
(j, l) =
_
_
4, l = j
2, l = j + (e
p
), p = 1, 2
1, l = j +
_
1
1
0, otherwise.
When j is on the boundary but not at the corner, the denition of
L
[1]
(j, l) is exactly
the same as above, but only for legitimate l, that is, components of l cannot be smaller
than 0 or larger than n. The corner locations are ignored. The condition numbers of
the ltered covariance matrix
K
[1]
=
L
[1]
K
L
[1]
T
and those of the corresponding corre-
lation matrix
C
[1]
are plotted in Figure 5.1(d), for the same covariance function used
in Figure 4.1(c). Indeed, the diagonal entries of
K
[1]
corresponding to the boundary
locations are not too dierent from those not on the boundary; therefore, it is not
surprising that the condition numbers for
K
[1]
and
C
[1]
look similar. It is plausible
that the condition number of
K
[1]
is bounded independent of the size of the grid.
6. Conclusions. We have shown that for stationary processes with certain spec-
tral densities, a rst/second order dierence lter can precondition the covariance
matrix of irregularly spaced observations in one dimension, and the discrete Laplace
operator (possibly applied more than once) can precondition the covariance matrix
of regularly spaced observations in high dimension. Even when the observations are
located within a xed domain, the resulting ltered covariance matrix has a bounded
condition number independent of the number of observations. This result is particu-
larly useful for large scale simulations that require the solves of the covariance matrix
using an iterative method. It remains to investigate whether the results for high
dimension can be generalized for observation locations that are irregularly spaced.
REFERENCES
[1] M. Anitescu, J. Chen, and L. Wang, A matrix-free approach for solving the Gaussian pro-
cess maximum likelihood porblem, Tech. Rep. ANL/MCS-P1857-0311, Argonne National
Laboratory, 2011.
[2] O. E. Barndorff-Nielsen and N. Shephard, Econometric analysis of realized covariation:
High frequency based covariance, regression, and correlation in nancial economics, Econo-
metrica, 72 (2004), pp. 885925.
[3] J. Barnes and P. Hut, A hierarchical O(N log N) force-calculation algorithm, Nature, 324
(1986), pp. 446449.
20 M. L. STEIN, J. CHEN, M. ANITESCU
[4] R. H.-F. Chan and X.-Q. Jin, An Introduction to Iterative Toeplitz Solvers, SIAM, 2007.
[5] J. Chil` es and P. Delfiner, Geostatistics: Modeling Spatial Uncertainty, Wiley, New York,
1999.
[6] Z. Duan and R. Krasny, An adaptive treecode for computing nonbounded potential energy in
classical molecular systems, J. Comput. Chem., 23 (2001), pp. 15491571.
[7] A. C. Faul, G. Goodsell, and M. J. D. Powell, A Krylov subspace algorithm for multi-
quadric interpolation in many dimensions, IMA Journal of Numerical Analysis, 25 (2005),
pp. 124.
[8] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products, Academic Press,
Orlando, seventh ed., 2007.
[9] L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, J. Comput. Phys.,
73 (1987), pp. 325348.
[10] N. A. Gumerov and R. Duraiswami, Fast radial basis function interpolation via precondi-
tioned Krylov iteration, SIAM J. Sci. Comput., 29 (2007), pp. 18761899.
[11] I. A. Ibragimov and Y. A. Rozanov, Gaussian Random Processes, Springer-Verlag, New
York, 1978.
[12] D. P. OLeary, The block conjugate gradient algorithm and related methods, Linear Algebra
Appl., 29 (1980), pp. 293322.
[13] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, MIT Press, Cam-
bridge, Massachusets., 2006.
[14] H. Rue and L. Held, Gaussian Markov Random Fields: Theory and Applications, Chapman
& Hall/CRC, Boca Raton, FL, 2005.
[15] M. Stein, Interpolation of Spatial Data: Some Theory for Kriging, Springer, New York, 1999.
[16] M. L. Stein, Fixed domain asymptotics for spatial periodograms, Journal of the American
Statistical Association, 90 (1995), pp. 12771288.
[17] , Equivalence of Gaussian measures for some nonstationary random elds, Journal of
Statistical Planning and Inference, 123 (2004), pp. 111.
[18] P. Whittle, On stationary processes in the plane, Biometrika, 41 (1954), pp. 434449.
The submitted manuscript has been created by the University of
Chicago as Operator of Argonne National Laboratory (Argonne)
under Contract No. DE-AC02-06CH11357 with the U.S. Depart-
ment of Energy. The U.S. Government retains for itself, and others
acting on its behalf, a paid-up, nonexclusive, irrevocable world-
wide license in said article to reproduce, prepare derivative works,
distribute copies to the public, and perform publicly and display
publicly, by or on behalf of the Government.