Introduction To Geostatistics For Site Characterization and Safety Assessment
Introduction To Geostatistics For Site Characterization and Safety Assessment
SAND2013-4769C
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Acknowledgements
2
Topics
3
Background
4
Background
5
Background
Many geological media are produced by processes that are
very complex and occur on scales ranging from microscopic
to 100s of km
Sampling of subsurface media is extremely limited relative
to the volume of material of interest, because of economic
and practical limitations
The challenge for geoscientists and engineers is the
characterize and model geological media adequately for
predictive analysis and decision making
Characterization and prediction is necessarily uncertain
because exact description of the system is impossible
The goal is to make full use of available geological
information and to understand the limitations of our
knowledge and the related uncertainty
6
Background
Fortunately, geological media have spatial structure that can
be characterized with subjective and quantitative knowledge
Geological knowledge and interpretation are important
sources of information on spatial structure and continuity,
particularly for large-scale features
Material properties also tend to have spatial correlation
related to the continuity of processes that operated during
formation of the geological media
Intuitively, the values for a particular property tend to be
more similar for locations that are closer together than for
locations that are more widely spaced
This characteristic of geological media forms the basis for
the field of geostatistics
7
Geostatistics - Overview
Primary objective of geostatistics is
the characterization of spatial or
temporal systems that are
incompletely known
Classical univariate statistics considers
only the population of values for a
particular variable
Geostatistics is an extension of
bivariate statistics that uses the
sampling location (in space or time) of
every measurement
Geostatistical analysis is only
meaningful if the measurements show
some spatial (or temporal) correlation
8
Geostatistics - Background
Geostatistics is a very broad field; this workshop provides
only a brief introduction to the topic
Geostatistics has developed over a long time frame,
starting with theoretical developments in the 1950s and
expanded significantly in the era of digital computation
Original applications included estimation methods applied
to calculation of ore reserves in mining
More recent applications have been more focused on
simulation methods, as applied to reservoir modeling in
petroleum engineering applications
Geostatistical estimation methods are generally used for
interpolation of measurement, but simulation methods
can be used in extrapolation of parameters (with caution!)
9
Geostatistics - Overview
Geostatistics provides a set of tools for modeling spatial
distributions of parameters based on the available data and
the two-point spatial covariance
Histogram
Variogram
g
distance
10
Geostatistics
11
Geostatistics Applications
Three-dimensional model of fracture
permeability at the JNC MIU site, Japan
13
Geostatistics – Approach
14
Geostatistics – Example Variogram
15
Geostatistics - Estimation
16
Geostatistics - Estimation
160.0
140.0
120.0
Concentration (ppm)
100.0
80.0
60.0
40.0
20.0
0 200 400 600 800 1000
Distance (m)
17
Geostatistics - Simulation
18
Geostatistics - Simulation
160.0
140.0
120.0
Concentration (ppm)
100.0
80.0
60.0
40.0
20.0
0 200 400 600 800 1000
Distance (m)
19
Geostatistics - Simulation Example
Sample Data showing location and 20 Realizations of Th-232 Activity
Activity of Th-232 Levels
4.0
0.0
20
Geostatistics – Groundwater Flow
One realization of
non-uniform flow through
heterogeneous material
Transport is convective, not
dispersive
(Large Peclet numbers)
21
Geostatistics – Groundwater Flow
Three Realizations conditioned to same 96 boreholes
22
Geostatistics – Realizations
23
Geostatistics – Additional Sampling
Where to locate additional boreholes? (3 Approaches)
Traditional Technique
Reduce estimation error or kriging variance by putting
boreholes in unsampled locations
Decision Based Technique
Areas of maximum uncertainty defined by probability
mapping
New Idea
Consider K to be a stochastic input parameter to transport
model and use sensitivity analysis
24
Geostatistics – Summary
Uncertainty is due to limited sampling of a spatially
heterogeneous variable
Spatial uncertainty creates uncertainty in performance
assessment results
Geostatistical simulation provides a technique for examining and
quantifying the amount uncertainty.
Estimates of uncertainty can be propagated through
performance assessment models
Relationship between PA uncertainty and spatial uncertainty can
be used to guide site characterization
25
Geostatistics
26
Geostatistics – Exercises Objectives
27
Geostatistics – PA Decision Framework
G o a ls
(UR L , S a fe ty
A s s es sm e n t)
C o n c e p tu a l M o d e l, D e c is io n
P a ra m e te r D e v e lo p m e n t P o in t
U n c e rta in ty
An a l ys is D a ta W o rth A n a lys is
H o w M u c h D a ta to C o lle c t?
S to p
28
Geostatistics – PA Decision Framework
Sedimentary layer
River
Aquifer
Rock mass
29
Geostatistics – Exercise Preview
30
SAND2013-xxxx
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Outline
Introduction
Exploratory data analysis
Visualization
Univariate statistics
Data correlations
Clustering, transformations, and trends
Spatial correlation analysis
Experimental variograms
Correlation anisotropy
Variogram models
2
Introduction
3
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is everything you do to understand
your data. It includes both objective and subjective analyses.
Topics
- mapping the data
- histogram techniques
- probability-plotting techniques
- correlations among multivariate data
- data transformations
4
EDA - Mapping the Data
5
EDA - Data Posting
0.0 0.0
0.0 0.0
10.0 10.0
20.0 20.0
30.0 30.0
40.0 40.0
6
Data Posting
7
EDA – Contour Mapping
row
24
to steep gradients in contour
map and may indicate bad 22
data 20
20 21 22 23 24 25 26
Different contouring
column
algorithms give different
maps. Which one is best?
0 20 40 60 80 100
8
EDA – Indicator Plots
9
EDA – Histograms
Simple Histograms
- Probability density function (PDF) is histogram
- Cumulative density function (CDF)
- Check for outliers (and their cause)
- Multimodality: evidence for multiple processes
- Clustering of data or preferential sampling
10
EDA – PDF and CDF plots
CDF
Cumulative Frequency
Frequency
Variable
VariableValue
Value
11
EDA – PDF and CDF plots
12
EDA – Sample Clustering and Plots
50.0
Lead Concentrations
500.
40.0
450.
400.
350.
30.0
300.
250.
20.0 200.
150.
100.0
10.0 50.0
0.0
0.0
0.0 10.0 20.0 30.0 40.0 50.0
Clustered Lead Data Number of Data 140 Declustered Lead Data Number of Data 140
0.250 mean 406.9022 mean 231.5361
std. dev. 600.6014 std. dev. 414.9137
coef. of var 1.4760 0.300 coef. of var 1.7920
maximum 4972.3999 maximum 4972.3999
0.200 upper quartile 573.8152 upper quartile 229.9890
median 152.9998 median 88.5706
lower quartile 48.4950 lower quartile 36.6273
minimum 0.0700 minimum 0.0700
0.150 0.200
Frequency
Frequency
0.100
0.100
0.050
0.000 0.000
0. 500. 1000. 1500. 2000. 0. 500. 1000. 1500. 2000.
Variable Variable
13
EDA – Probability Plotting
99.9
99.8
99
98
95
Cumulative Probability
90
Cumulative Probability
80 99.99
Normal Probability Plot
70 99.9
99.8
60
50 99
98
40 95
30
Cumulative Probability
Cumulative Probability
90
20 80
70
60
10 50
40
5 30
20
2 10
1 5
2
1
0.2
0.1 0.2
0.1
0.01
0.01 -5.0 -3.0 -1.0 1.0 3.0 5.0
0. 0.20 0.40 0.60 0.80 1.0 Variable Value
Variable
Variable Value
Variable
EDA – Normal Probability Plot
Example probability plot from workshop exercises – using
normal score transform value for the Gaussian probability axis
EDA – Correlation of Multivariate Data
DCPA
200.
30.0
Nitrate Concentration 30.0
DCPA Concentration
100.
50.0 500.
28.0 28.0
0.
26.0 0.0 10.0 20.0 30.0 26.0
NO3
24.0 24.0
22.0 22.0
20.0 20.0
0.0 0.0
18.0 18.0
18.00 20.00 22.00 24.00 26.00 18.00 20.00 22.00 24.00 26.00
EDA – Correlation of Multivariate Data
0.8
Cumulative Frequency
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 -3 -2 -1 0 1 2 3
What is a trend?
• Geostatistics assumes second-order stationarity
• “Deterministic Geologic Processes”
- Trend analysis and modeling must make geologic sense
• Removing a trend - Analysis of residuals
Data Value
Data Value
Distance Distance
EDA – Summary
h2
Correlation, r
h1
Separation, h
The greater the distance between points, the less correlated the
values.
Since h is a vector, direction matters. Separation and differences
may be different in different directions.
Scatterplot Example
h=5 h = 10 h = 15 h = 20
Z(x+h)
Z(x+h)
Z(x+h)
Z(x+h)
Z(x) Z(x) Z(x) Z(x)
Distance or Time
Analyzing Spatial Correlation
• In geostatistics we tend to look at the opposite of correlation,
which is variability.
• At very close distances variability is low, and as the separation
distance increases, so does variability.
Range
Var i abi li ty
Sill
Nugget
Separation Distance
Variogram
The variogram is a measure of variability as
a function of separation distance h.
error or
variability at variability level at
separations
Sill which the
smaller than the variogram value
sample distance. becomes constant
Nugget
Separation Distance
Variogram Equation
1 n(h)
( h)
2n(h) i 1
( z i ( x ) z i ( x h )) 2
This gives a value for variability at the given h, and the value is a
point on the experimental variogram. Repeat for each value of h.
Variogram – Covariance Relationship
C(h) = Sill-(h)
1.2
Sill 1.0
Covariance is the
0.8 Semi-variance inverse of the
Gamma/Cova
0.6 variogram
0.4
Covariance
0.2
0.0
0.0 25.0 50.0 75.0 100.0 125.0 150.0
Distance (meters)
2
Gamma
1
1 2 3
Separation Distance
Variogram – Search Neighborhood
search
neighborhood
Variogram – Search Neighborhood
Use geological knowledge of genetic processes to customize search
along a preferred orientation. Orient search along this direction, the
search direction.
Search angle direction
Half-angle
Y direction
Half-angle
Y direction
1st lag
X direction
h 3
h<a: (h) C 1.5 0.5
h
a a Variogram Models
Range = 100.0 SIll= 1.0
1.0
0.8
Gamma
Where C = sill value 0.6
a = range 0.4
0.0
0.0 50.0 100.0 150.0 200.0 250.0
Distance
( 3h ) 2
2 Variogram Models
(h) C 1 e
a Range = 100.0 SIll= 1.0
1.2
1.0
0.8
Gamma
Where C = sill value 0.6
3h Variogram Models
Range = 100.0 SIll= 1.0
(h) C 1 e a
1.2
1.0
0.8
Gamma
Where C = sill value 0.6
Gamma
Exponential and Gaussian have a 0.6
practical range where the model
0.4
hits the sill, but often use a
different definition of the range: the 0.2
0.4
Gamma
0.3
0.2
Spherical Model Parameters
Ranges Nest
3
Range
67.0
C
0.40
Model
Exponential
0.1 2 45.0 0.21 Spherical
1 8.0 0.10 Spherical
Nugget 0
0.0
0.0 50.0 100.0 150.0 200.0
Distance (feet)
Anisotropy
• Variograms that show variation as a function of search direction
are anisotropic
• Anisotropy in the variable requires fine tuning of search
neighborhood
20000
2.0
10000
Distance (feet)
0 1.5
-10000
1.0
-20000
0.5
-20000 -10000 0 10000 20000
Distance (feet)
Nugget Effect Variogram
1.0
0.8
Property Value
0.6
Sampling transect of
0.4 random, uncorrelated data
0.2
0.0
0 50 100 150 200 250 300
Distance
Resulting “nugget
effect” variogram
No spatial correlation
Hole Effect Variogram
0.14
3.0e-04
0.13
Gamma
Porosity
2.0e-04
0.12
0.11 1.0e-04
0.10
0.0e+00
1700.0 1725.0 1750.0 1775.0 1800.0 0.0 20.0 40.0 60.0 80.0 100.0 120.0
Distance (meters)
Depth (meters)
Periodic Data Hole Effect Variogram
Trend Effect Variogram
700.0
55.0
600.0
Daily Close (dollars)
45.0
500.0
Gamma
35.0 400.0
300.0
25.0
200.0
15.0
100.0
5.0 0.0
94.0 94.5 95.0 95.5 96.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Julian Date Time (years)
Spatial Correlation Summary
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Outline
Software
Exercise data files
Exploratory data analysis exercise tasks
2
Software
SGeMS (Stanford Geostatistical Modeling Software)
Windows-based software with graphical users interface (GUI)
Open source software from Stanford University
Based on the original GSLIB suite of DOS-based software
Available for download from website
http://sgems.sourceforge.net/
Two GSLIB software codes – NSCORE and LOCMAP
DOS-based codes run using parameter input files
Available for download from website
http://scrf.stanford.edu/resources.software.gslib.php
GSview software
Used for viewing postscript output files generate by GSLIB codes
Available for download from website
http://pages.cs.wisc.edu/~ghost/
3
Exercise Data Files
Four data sets are provided for use in the geostatistics
exercises
2-D data set 1
2-D data set 2
2-D data set 2 - exhaustive
3-D data set 3
Data sets were randomly extracted from a hypothetical
synthetic geological system
Data are given for rock porosity and permeability
Data sets differ in spatial correlation structure in ways
that the students will explore and discover
4
Data File Format
SGeMS can read data from the GSLIB format. Exercise
data files are provided in this format.
5
SGeMS Software Introduction
The main screen of SGeMS contains the algorithm window, objects window,
and the visualization window
6
SGeMS Software Introduction
The first task is to open a data file by selecting “Load Objects” from the
“Objects” pulldown menu
Navigate to the data file, open it, select object type “point set”, go to next
screen, enter a Pointset name in the dialog box, and confirm that x, y, and z
columns are correctly identified
7
SGeMS Software Introduction
Now that the data set is loaded, various actions can be taken, including data
visualization (shown below), data analysis tasks, and algorithms.
Practice manipulating the visualization, adding a colorbar, setting the color scale, and
exporting an image of the plot (use the camera icon)
Note that SGeMS does not manage screen real estate very well and windows may
need to be resized to show what you want to see
GSLIB Software - LOCMAP
Use the LOCMAP DOS program to generate
a 2D plot of the sample data
The LOCMAP program is executed with a
parameter control file named locmap.par
Parameters for LOCMAP
*********************
START OF PARAMETERS:
data_set_1_2D.dat \file with data
1 2 4 \ columns for X, Y, variable
-1.0e21 1.0e21 \ trimming limits
locmap.ps \file for PostScript output
0.0 1000. \xmn,xmx
0.0 1000. \ymn,ymx
0 \0=data values, 1=cross validation
0 \0=arithmetic, 1=log scaling
1 \0=gray scale, 1=color scale
0 \0=no labels, 1=label each location
0.12 0.20 0.005 \gray/color scale: min, max, increm
0.5 \label size: 0.1(sml)-1(reg)-10(big)
Porosity Data Map - 2D Data \Title
START OF PARAMETERS:
data_set_3_3D.dat \file with data
4 0 \ columns for variable and weight
-1.0e21 1.0e21 \ trimming limits
0 \1=transform according to specified ref. dist.
unknown.out \ file with reference dist.
1 2 \ columns for variable and weight
nscore.out \file for output
nscore.trn \file for output transformation table
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
SGeMS – Variogram Analysis
Load the data sets that were used in the exploratory data analysis exercise.
Start the variogram analysis by choosing the “Variogram” option under the “Data
Analysis” pulldown menu.
Choose the data set to analyze from the “Grid Name” pulldown menu and choose the
parameter of interest from both the “Head Property” and “Tail Property” menus.
SGeMS – Variogram Analysis
Enter the number of lags, lag separation, lag tolerance, azimuth, dip, directional
tolerance, and bandwidth to control the construction of the experimental variogram.
Note that multiple variogram plots can be generated simulataneously by increasing
“Number of directions”.
Search direction is defined by azimuth and dip, as explained in the figures to the right.
SGeMS – Variogram Analysis
The experimental variogram is displayed in the next window.
The number of data pairs that were used to calculate each plotted value are displayed
by right clicking the mouse.
SGeMS – Variogram Analysis
Fit the experimental variogram with a model created with the parameters entered in
the panel to the right of the plot.
Try different variogram types and increasing the number of structures to create
variograms using linear combinations of models.
SGeMS – Variogram Analysis
Save or write down the parameters for the variogram model that you have fit to the
experimental variogram for later use in the estimation and simulation exercises.
Load the other data sets ( 2-D data set 2, 2-D data set 2–exhaustive, and 3-D data set
3) and examine directional variograms to examine anisotropy in the horizontal and
vertical directions.
SGeMS – Variogram Analysis
Fit a model to the experimental variograms created in different directions to evaluate
the anisotropy in the spatial correlation.
Save or write down the parameters for the variogram models that you have fit to the
directional variograms for later use in the estimation and simulation exercises.
SAND2013-xxxx
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Outline
2
Point Estimation
3
Estimation Techniques
Example data-driven point estimation techniques (require
some data to exist already) that interpolate to the
surrounding locations:
• Nearest Neighbor Polygons
(aka Theissen or Voronoi polygons)
• Local Mean (using surrounding data)
• Inverse Distance Squared
4
Estimation Example
500
0.200
400
0.215
•Porosity measured
0.203 at 6 points.
300
•Make estimate of
Y
0.261
200 porosity at
Unknown unknown point, x0.
0.174
100
0.241
0
0 100 200 300 400 500
X
5
Nearest Neighbor Polygons
• Construct polygons around the samples that divide the space
into regions
• Everywhere inside of the polygon is closer to the sample point
enclosed by that polygon than to any other sample point
Advantages:
•Simple, fast, exact interpolator (at a point where the value is
known, it returns that exact value)
Disadvantages:
•Discontinuities at polygon boundaries
•If data are sparse and somewhat unevenly spaced, the global
estimation is dominated by the sparsely located points
6
Nearest Neighbor Polygons
- Connect each sample point to the
- Draw a perpendicular
neighboring sample points to
bisector through each line.
create a series of triangles
500 500
0.200 0.200
400 400
0.215 0.215
0.203 0.203
300 300
Y
Y
0.261 0.261
200 200
Unknown
Unknown
0.174 0.174
100 100
0.241 0.241
0 0
0 100 200 300 400 500 0 100 200 300 400 500
X X
porosity
Synthetic “Reality” porosity
Local Mean Estimation
Advantages:
• Simple, fast, few large errors (near the edges of the domain)
Disadvantages:
• Not an exact interpolator (the average of the few surrounding data
points won’t necessarily return the exact value for a known point)
• Definition of "surrounding data“ may be difficult?
• It has a smoothing effect on the data values. Any extreme values,
high or low, will get smoothed out as they are averaged in with the
surrounding values
porosity
Synthetic “Reality” porosity
Local Mean Estimation
Y
0.261
• Should the more distant points be 200
close
included in the average? If so, point Unknown
0.174
100
should they be given less weight? 0.241
0
0 100 200 300 400 500
X
porosity
Synthetic “Reality” porosity
Evaluating Estimation Methods
small spread.
true
true
Evaluating Estimation Methods
estimate
estimate
true true
Heteroscedastic: The variance changes as
a function of the value. So as the values
heteroscedastic increase, quality of the fit about the 45 conditional bias
degree line deteriorates.
estimate
estimate
Conditional: A small subsection appears
to be optimal, but the low values tend
true to be overestimated and high values true
tend to be underestimated.
Precision and Accuracy
Precise Imprecise
Accurate
• Realize that the mean is stationary, so both the estimate and the
actual value have the same mean.
• If the average error is set to zero, then:
n
i 1.0
i 1
Kriging – Lagrange Parameter
n n
n
i 1 i 1 0 2 i 1 0
i 1 i 1 i 1
Kriging – Summary
Covariances in D act like inverse distance weights (two points close
together have a high covariance, as the distance increases the
covariance approaches zero).
•However, the weight as a function of distance is not limited to simple
powers, it can fit with more complex variogram models.
500
0.200
400
0.215
•Porosity measured
0.203 at 6 points.
300
•Make estimate of
Y
0.261
200 porosity at
Unknown unknown point, x0.
0.174
100
0.241
0
0 100 200 300 400 500
X
Kriging – Results
n
e2 e i e
1 2
n i 1
n
σ̂ e2 σ 2Data iCi0 μ in matrix
form: σ̂e2 σ 2Data D
i 1
140.0 140.0
120.0
Concentration (ppm)
120.0
Concentration (ppm)
100.0 100.0
80.0 80.0
60.0 60.0
40.0 40.0
20.0 20.0
0 200 400 600 800 1000 0 200 400 600 800 1000
Distance (m) Distance (m)
Estimation versus Simulation
Example from workshop exercise data set
Ordinary Kriging (Estimation) Sequential Gaussian Simulation
Simulation Example
Synthetic “Reality”
Transport Example
Uncertainty in spatial distribution of hydraulic properties leads to
uncertainty in transport results
Transfer Function
Frequency
Ground Water
Ground Water Flow
Flowandand
Transport Model
Transport Model
Concentration
General Types of Simulation
11200
646.0
11400
647.0
Northing (feet)
644.0
11000
Northing (feet)
644.0
10700
642.0
10600
641.0
640.0
10200 10200
638.0
638.0
9800
635.0
9400 9800 10200 10600 11000 11400
9700 636.0 Easting (feet)
9400 9900 10400 10900 11400
Easting (feet)
Indicator Simulation Example
Kriging versus Simulation
Sample Data
and Kriging
Estimate
Two example
realizations
from
simulation
Kriging versus Simulation
1.2 1.2
1.0 1.0
Variogram
Variogram
0.8 0.8
Model
0.6 0.6
Sim1 EW
0.4 Model 0.4
Sim1 NS
0.2 Krig EW 0.2 Sim2 EW
Krig NS Sim2 NS
0.0 0.0
0 5 10 15 20 25 0 5 10 15 20 25
Distance Distance
Kriging versus Simulation
Histogram Histogram of
of raw data kriged data
Histogram of Histogram of
Realization 1 Realization 2
(simulation) (simulation)
Kriging versus Simulation
160.0
Concentration (ppm)
data 100.0
80.0
140.0
Concentration (ppm)
above and below the maximum 100.0
40.0
defined cdf [0,1] at each 20.0
Nugget = 0.00
Nugget = 0.40
Simulation – Variogram Models