Spatial Correlation New
Spatial Correlation New
1. Definition
Spatial autocorrelation measures the correlation of a variable with itself through space.
Spatial autocorrelation can be positive or negative. Positive spatial autocorrelation
occurs when similar values occur near one another. Negative spatial autocorrelation
occurs when dissimilar values occur near one another.
2. Weight Matrix
To assess spatial autocorrelation, one first needs to define what is meant by two
observations being close together, i.e., a distance measure must be determined. These
distances are presented in weight matrix, which defines the relationships between
locations where measurements were made. If data are collected at n locations, then the
weight matrix will be n n with zeroes on the diagonal.
Other weight matrices are possible. The weight matrix is often row-standardized, i.e., all
the weights in a row sum to one. Note that the actual values in the weight matrix are up
to the researcher.
A. Moran's I
Moran's I (Moran 1950) tests for global spatial autocorrelation for continuous data.
It is based on cross-products of the deviations from the mean and is calculated for n
observations on a variable x at locations i , j as:
1
n
w ( x x )( x
i j
ij i j x)
I ,
S0 (x x )i
i
2
where x is the mean of the x variable, wij are the elements of the weight matrix, and
S0 is the sum of the elements of the weight matrix: S0 wij .
i j
where
B. Geary's C
n 1
w ( x x )
i j
ij i j
2
C .
2 S0 (x x ) i
i
2
Geary’s C ranges from 0 (maximal positive autocorrelation) to a positive value for high
negative autocorrelation. Its expectation is 1 in the absence of autocorrelation and
regardless of the specified weight matrix (Sokal & Oden 1978). If the value of Geary’s
2
C is less than 1, it indicates positive spatial autocorrelation. The variance is:
D. SAS example
There is no straightforward way to calculate Moran’s I and Geary’s C using SAS. One
source of code using proc iml can be found at the website of E.B. Moser at the
Department of Experimental Statistics, Louisiana State University
(http://www.stat.lsu.edu/faculty/moser/spatial/spatial.html).
The weight matrix should therefore be 16 × 16. We assume the matrix is symmetric (the
relationship between locations i and j is the same as the relationship between locations j
and i) and specify all nonzero relationships in the data step below. I and J are the indices
for pairs of locations, and WT is the weight for those pairs.
DATA WEIGHTS;
INPUT I J WT @@;
LIST;
CARDS;
1 6 1 2 6 1 2 7 1 3 6 1 3 7 1 4 7 1 5 6 1 5 10 1 6 9 1 7 8 1 7
12 1 8 11 1 9 10 1 10 13 1 10 14 1 10 15 1 11 14 1 11 15 1 11 16 1 11 12 1
;
3
START LOADDATA;
USE RESPONSE VAR {X};
READ ALL;
CLOSE RESPONSE;
N=NROW(X);
W=J(N,N,0); /* INITIALIZE WEIGHT MATRIX TO ZEROS */
USE WEIGHTS VAR {I J WT};
READ ALL;
/* PLACE WEIGHTS INTO THE WEIGHT MATRIX */
DO K=1 TO NROW(I);
W[I[K],J[K]]=WT[K];
W[J[K],I[K]]=WT[K]; /* ASSUMING SYMMETRY IN WEIGHTS */
END;
/* DROP THE I J AND WT MATRICES */
FREE I J K WT;
PRINT / 'INITIAL DATA ' ,, 'RESPONSE',X,, 'WEIGHTS', W[FORMAT=3.0];
/* $$$ THE FORMAT MAY NEED TO BE CHANGED IF $$$ */
/* $$$ FRACTIONAL WEIGHTS ARE TO BE USED. $$$ */
FINISH;
START STATS;
XBAR=X[+]/N;
Z=X-REPEAT(XBAR,N,1);
ZSQ=Z`*Z;
XVAR=ZSQ/(N-1);
VMRATIO=XVAR/XBAR;
Z2=Z#Z; Z4=Z2`*Z2;
B2=N*Z4/ZSQ**2; /* KURTOSIS */
FREE Z2 Z4;
SUMW=0; S1=0;
DO K=1 TO N;
DO J=1 TO N;
IF K <> J THEN
DO;
SUMW=SUMW + W[K,J];
S1=S1 + (W[K,J]+W[J,K])**2;
END;
END;
END;
S1=S1/2;
S2=W[,+]+W[+,]`;
S2=S2`*S2;
SUMW2=SUMW*SUMW;
TEMP=N||XBAR||XVAR||VMRATIO||B2;
RNAME={" "};
CNAME={" N" " XBAR" "VARIANCE" "V/M RATIO" "KURTOSIS"};
PRINT / 'SUMMARY STATISTICS ' TEMP[ROWNAME=RNAME COLNAME=CNAME];
FREE XBAR XVAR VMRATIO K J RNAME CNAME TEMP;
FINISH;
START AUTOCORR;
I=0; C=0;
DO K=1 TO N;
DO J=1 TO N;
IF K <> J THEN
DO;
I = I + W[K,J] * Z[K]*Z[J];
C = C + W[K,J] * (X[K]-X[J])**2;
END;
END;
END;
I=(N/SUMW)*(I/ZSQ);
C=((N-1)/(2*SUMW))*(C/ZSQ);
FREE K J;
FINISH;
START AUTOSTAT;
RNAME={" "}; CNAME={"I " "C "};
TEMP=I||C;
4
PRINT ,, 'SPATIAL AUTOCORRELATION STATISTICS'
TEMP[ROWNAME=RNAME COLNAME=CNAME];
FREE RNAME CNAME TEMP;
FINISH; /* AUTOSTAT */
START TESTOFI;
EXPECT=1/(1.0-N);
RNAME={" "}; CNAME={" E(I) " "OBSERVED"};
TEMP=EXPECT||I;
PRINT "TEST OF I", "EXPECTED AND OBSERVED VALUES"
TEMP[ROWNAME=RNAME COLNAME=CNAME];
VAR=((N**2*S1-N*S2+3*SUMW2)/(SUMW2*N**2-1))-EXPECT**2;
ZZ=(I-EXPECT)/SQRT(VAR);
P=1-PROBNORM(ABS(ZZ));
CNAME={"VARIANCE" " Z " " P>|Z| "};
TEMP=VAR||ZZ||P;
PRINT 'TEST BASED ON NORMALITY:'
TEMP[ROWNAME=RNAME COLNAME=CNAME];
VAR=(N*((N**2-3*N+3)*S1-N*S2+3*SUMW2)-
B2*((N**2-N)*S1-2*N*S2+6*SUMW2))/
((N-1)*(N-2)*(N-3)*SUMW2) - EXPECT**2;
ZZ=(I-EXPECT)/SQRT(VAR);
P=1-PROBNORM(ABS(ZZ));
TEMP=VAR||ZZ||P;
PRINT 'TEST BASED ON RANDOMIZATION:'
TEMP[ROWNAME=RNAME COLNAME=CNAME];
FREE EXPECT VAR ZZ P RNAME CNAME TEMP;
FINISH;
START TESTOFC;
EXPECT=1.0;
RNAME={" "}; CNAME={" E(C) " "OBSERVED"};
TEMP=EXPECT||C;
PRINT "TEST OF C", "EXPECTED AND OBSERVED VALUES"
TEMP[ROWNAME=RNAME COLNAME=CNAME];
VAR=((2*S1+S2)*(N-1)-4*SUMW2)/(2*(N+1)*SUMW2);
ZZ=(C-EXPECT)/SQRT(VAR);
P=1-PROBNORM(ABS(ZZ));
CNAME={"VARIANCE" " Z " " P>|Z| "};
TEMP=VAR||ZZ||P;
PRINT 'TEST BASED ON NORMALITY:'
TEMP[ROWNAME=RNAME COLNAME=CNAME];
/* ZZ AND P BELOW ARE TEMPORARIES */
ZZ=(N-1)*S1*(N**2-3*N+3-(N-1)*B2)-(N-1)/4*S2;
P=(N**2+3*N-6-(N**2-N+2)*B2)+SUMW2*(N**2-3-(N-1)**2*B2);
VARC=ZZ*P/(N*(N-2)*(N-3)*SUMW2);
ZZ=(C-EXPECT)/SQRT(VAR);
P=1-PROBNORM(ABS(ZZ));
TEMP=VAR||ZZ||P;
PRINT 'TEST BASED ON RANDOMIZATION:'
TEMP[ROWNAME=RNAME COLNAME=CNAME];
FREE EXPECT VAR ZZ P RNAME CNAME TEMP;
FINISH;
5
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0
0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0
0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1
0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
VARIANCE Z P>|Z|
TEST BASED ON NORMALITY: 0.0360244 -4.7292651 1.12667E-6
VARIANCE Z P>|Z|
TEST BASED ON RANDOMIZATION: 0.0368859 -4.6737112 1.47903E-6
VARIANCE Z P>|Z|
TEST BASED ON NORMALITY: 0.07647059 5.22250725 8.82583E-8
VARIANCE Z P>|Z|
TEST BASED ON RANDOMIZATION: 0.07647059 5.22250725 8.82583E-8
Note the p-values calculated by “TESTOFI” and “TESTOFC” are one-sided. There may
be a cause for concern in “TESTOFC”, as it calculated the same variance for the two tests
under normality and randomization assumptions.
4. Semivariogram
A. Definition
1
( h)
2n( h) (i , j ): hij h
( xi x j ) 2
where (h) is the semivariance for distance class h , n(h) is the total number of pairs
of values at distance h , and hij is the distance between locations i and j .
6
It is unlikely that any actual pair of locations would exactly have the distance of h . It is
common to consider a range of distances, [h h, h h] , to group pairs for a single
term in the expression for (h) and to modify n(h) accordingly.
7
B. Theoretical Semivariogram Models
There are many types of theoretical semivariogram models. Some commonly used
models are given below:
(1) Spherical model: this is only valid for 1-3 dimensions. The equation is:
0, | h | 0
3 | h | 1 | h | 3
(h) c0 c1 , 0 | h | a
2a 2 a
c0 c1 , | h | a
where c0 , c1 , and a 0 .
(2) Exponential model: this is valid for all dimensions with formula:
0, | h | 0
( h)
3|h|/ a
,
c0 c1 1 e | h | 0
where c0 , c1 , and a 0 .
(3) Gaussian model: this is valid for all dimensions with formula:
0, | h | 0
( h)
3(|h|/ a )
2
c0 c1 1 e
, | h | 0
where c0 , c1 , and a 0 .
(4) Power model: this is valid for all dimensions with formula:
0, | h | 0
( h)
c0 b | h | , | h | 0
8
model but reaches the sill more quickly. The Gaussian is useful when the data have very
high spatial correlation between two close points. It has an S-shape. The power model
does not reach a sill and the shape depends on the parameter. It is appropriate when the
data show a long-range correlation. If the semivariogram does not seem to follow any
of the standard structures, it is possible to combine structures to obtain something with
characteristics of more than one standard structure. This is called a “nested structure”.
Spherical Exponential
1.4
1.4
1.2
1.2
gamma
gamma
1.0
1.0
0.8
0.8
0.6
0.6
0 2 4 6 8 10 0 2 4 6 8 10
|h| |h|
Gaussian Power
25
1.4
20
1.2
gamma
gamma
15
1.0
0.8
10
0.6
0 2 4 6 8 10 0 2 4 6 8 10
|h| |h|
In SAS, proc variogram computes the sample semivariogram, from which a suitable
theoretical semivariogram can be found visually. The goal here is often to predict
values of the measured variable at unsampled locations.
The following example is taken from the SAS Version 8 online help
(http://v8doc.sas.com/sashtml/).
9
Data were simulated to represent coal seam thickness measurements taken over an
approximately square area:
data thick;
input east north thick @@;
datalines;
0.7 59.6 34.1 2.1 82.7 42.2 4.7 75.1 39.5
4.8 52.8 34.3 5.9 67.1 37.0 6.0 35.7 35.9
6.4 33.7 36.4 7.0 46.7 34.6 8.2 40.1 35.4
13.3 0.6 44.7 13.3 68.2 37.8 13.4 31.3 37.8
17.8 6.9 43.9 20.1 66.3 37.7 22.7 87.6 42.8
23.0 93.9 43.6 24.3 73.0 39.3 24.8 15.1 42.3
24.8 26.3 39.7 26.4 58.0 36.9 26.9 65.0 37.8
27.7 83.3 41.8 27.9 90.8 43.3 29.1 47.9 36.7
29.5 89.4 43.0 30.1 6.1 43.6 30.8 12.1 42.8
32.7 40.2 37.5 34.8 8.1 43.3 35.3 32.0 38.8
37.0 70.3 39.2 38.2 77.9 40.7 38.9 23.3 40.5
39.4 82.5 41.4 43.0 4.7 43.3 43.7 7.6 43.1
46.4 84.1 41.5 46.7 10.6 42.6 49.9 22.1 40.7
51.0 88.8 42.0 52.8 68.9 39.3 52.9 32.7 39.2
55.5 92.9 42.2 56.0 1.6 42.7 60.6 75.2 40.1
62.1 26.6 40.1 63.0 12.7 41.8 69.0 75.6 40.1
70.5 83.7 40.9 70.9 11.0 41.7 71.5 29.5 39.8
78.1 45.5 38.7 78.2 9.1 41.7 78.4 20.0 40.8
80.5 55.9 38.7 81.1 51.0 38.6 83.8 7.9 41.6
84.5 11.0 41.5 85.2 67.3 39.4 85.5 73.0 39.8
86.7 70.4 39.6 87.2 55.7 38.8 88.1 0.0 41.6
88.4 12.1 41.3 88.4 99.6 41.2 88.8 82.9 40.5
88.9 6.2 41.5 90.6 7.0 41.5 90.7 49.6 38.9
91.5 55.4 39.0 92.9 46.8 39.1 93.4 70.9 39.7
94.8 71.5 39.7 96.2 84.3 40.3 98.2 58.2 39.5
;
A scatterplot of the sampled locations is below. Note that if these data are intended for
prediction, the sampled locations should be fairly evenly distributed across the area of
interest.
The next step is to create a “surface plot” of the measured variable (thickness). Any
apparent trend must be removed before estimating the semivariogram model.
proc g3d data=thick;
title 'Surface Plot of Coal Seam Thickness';
scatter east*north=thick / xticknum=5 yticknum=5
grid zmin=20 zmax=65;
label east = 'East'
north = 'North'
thick = 'Thickness'
;
10
run;
axis1 minor=none;
axis2 minor=none label=(angle=90 rotate=0);
title 'Distribution of Pairwise Distances';
proc gchart data=outd;
vbar mdpt / type=sum sumvar=count discrete frame
cframe=ligr gaxis=axis1 raxis=axis2 nolegend;
run;
11
In general, it is desirable to have a small value for LAGDISTANCE. However, SAS
says a rule of thumb used in computing sample semivariograms is to use at least 30
point pairs in computing a single value of the empirical or experimental semivariogram.
The code below uses NHCLASSES=20, which leads to enough points in all but lag 0.
The length of the lag 1 interval (rounded to the nearest integer) will be used as the
LAGDISTANCE.
proc variogram data=thick outdistance=outd;
compute nhc=20 novariogram;
coordinates xc=east yc=north;
var thick;
run;
axis1 minor=none;
axis2 minor=none label=(angle=90 rotate=0);
title 'Distribution of Pairwise Distances';
proc gchart data=outd;
vbar mdpt / type=sum sumvar=count discrete frame
cframe=ligr gaxis=axis1 raxis=axis2 nolegend;
run;
12
title 'Standard and Robust Semivariogram for Coal Seam
Thickness Data';
proc gplot data=outv2;
plot vari*distance=type / frame cframe=ligr vaxis=axis2
haxis=axis1;
symbol1 i=join l=1 c=blue /* v=star */;
symbol2 i=join l=1 c=yellow /* v=square */;
axis1 minor=none
label=(c=black 'Lag Distance') /* offset=(3,3) */;
axis2 order=(0 to 9 by 1) minor=none
label=(angle=90 rotate=0 c=black 'Variogram')
/* offset=(3,3) */;
run;
The increasingly rapid rise from the origin suggests that a Gaussian model may be
appropriate (note this is based on visual inspection).
Trial and error reveals that a scale of c0 = 7.5 and a range of a0 = 30 fit reasonably for
both the robust and standard semivariogram:
data outv3; set outv;
c0=7.5; a0=30;
vari = c0*(1-exp(-distance*distance/(a0*a0)));
type = 'Gaussian'; output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
13
Note that the fit is good, particularly at smaller lag distances. This suggests that the
choice of a Gaussian form is adequate. The Gaussian form with the parameter values
chosen above can be used in PROC KRIGE2D to produce a contour plot of the kriging
estimates and the associated standard errors.
References:
Cliff, AD and Ord, JK (1975). The choice of a test for spatial autocorrelation. In J. C.
Davies and M. J. McCullagh (eds) Display and Analysis of Spatial Data, John Wiley and
Sons, London, 54-77
Cliff, A. D. and Ord, J. K. 1981 Spatial processes - models and applications. (London:
Pion).
Geary, R. (1954). The contiguity ratio and statistical mapping. The Incorporated
Statistician 5: pp115-45
Moran, P.A.P. (1950). Notes on continuous stochastic phenomena, Biometrika 37, pp17-
23.
Sokal, R.R. and Oden, N.L. (1978). Spatial autocorrelation in biology. 1. Methodology.
Biological Journal of the Linnean Society, 10: 199-228.
14