Mark Stamp - Introduction To Machine Learning With Applications in Information Security (Chapman & Hall - CRC Machine Learning & Pattern Recogn (2022, Chapman and Hall - CRC) - Libgen - Li
Mark Stamp - Introduction To Machine Learning With Applications in Information Security (Chapman & Hall - CRC Machine Learning & Pattern Recogn (2022, Chapman and Hall - CRC) - Libgen - Li
Machine Learning
with Applications in
Information Security
Bayesian Programming
Pierre Bessiere, Emmanuel Mazer, Juan Manuel Ahuactzin, Kamel Mekhnacha
Mark Stamp
Second edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.
co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003264873
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Preface xi
1 Introduction 1
1.1 Basic sampling concepts . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Population parameters . . . . . . . . . . . . . . . . . . 5
1.1.2 Descriptive statistics vs. inference about a population 6
1.1.3 Random sampling vs. probability sampling . . . . . . 6
1.2 Design-based vs. model-based approach . . . . . . . . . . . . . 7
1.3 Populations used in sampling experiments . . . . . . . . . . 8
1.3.1 Soil organic matter in Voorst, the Netherlands . . . . 10
1.3.2 Poppy fields in Kandahar, Afghanistan . . . . . . . . . . 11
1.3.3 Aboveground biomass in Eastern Amazonia, Brazil . . 12
1.3.4 Annual mean air temperature in Iberia . . . . . . . . . 13
v
vi Contents
Bibliography 515
Index 529
Preface
Since the start of The R Series of Chapman & Hall/CRC in 2011, numerous
books have been published on the statistical analysis and modelling of data
using R. To date, no book has been published in this series on how these data
can best be collected. From my point of view this was an omission, as scientific
research often starts with data collection. If the data collection is part of the
project, it might be a good idea to start thinking right at project start instead
of after the data have been collected, to make a well-founded decision on how
many data are needed, and on the type of sampling design.
My experience as a statistical consultant is that many researchers pay insuffi-
cient attention to the method for data collection. Too many researchers start
thinking when the data are there. Often, I had to conclude that the way the
data were collected by fellow researchers was suboptimal, or even unsuitable
for their aim. I hope that this new book may help researchers, practitioners,
and students to implement proper sampling designs, tailored to their problems
at hand, so that valuable data are collected that can be used to answer the
research questions.
Over the past decades, numerous wall-to-wall data sets have been collected
by remote sensing devices such as satellites and drones. These remote sensing
images are valuable sources of information on the natural environment and
resources. The question may arise about how useful it still can be in this
big data era to collect data in the field at a restricted number of sampling
locations. Do we really need these data to estimate a population mean or total,
for instance of the aboveground biomass or carbon stocks in the soil, or to map
these study variables? In many cases the answer is that it is indeed still useful
to collect sample data on the study variable, because the remote sensing images
provide only proxies of the study variable. The variables derived from the
remote sensing images can be related to the study variable, but we still need
groundtruth data of the study variable to model this relation. By combining
the wall-to-wall data of covariates and the sample data of the groundtruth, we
can increase the accuracy of the survey result compared to using only one of
these data sources.
The handbook Sampling for Natural Resource Monitoring (SNRM) (de Gruijter
et al., 2006) presents an overview of sampling strategies for the survey of natural
resources at a given point in time, as well as for how these resources can be
monitored through repeated surveys. The book presented here can be seen
xi
xii Preface
Acknowledgments
In 2006 our handbook Sampling for Natural Resource Monitoring (SNRM) was
published (de Gruijter et al., 2006). Soon after this milestone, Jaap de Gruijter
retired from Wageningen University and Research (WUR). I am now in the
final stage of my career at WUR. For a couple of years I had been thinking of
a revision of our handbook, to repair errors and to include new developments
in sampling design. Then I realised that to increase the impact of SNRM, it
might be a better idea to write a new book, showing by means of computer
code how the sampling designs can be implemented, and how the sample data
can be used in statistical inference.
A nice result of the publication of SNRM was that I was asked to give sampling
courses at many places in the world: China, Ethiopia, Uzbekistan, Australia,
and various countries in the European Union. I have very pleasant memories of
these courses, and they made my life as a scientist very joyful. For these courses
I wrote numerous scripts with computer code, using the popular programming
language R (R Core Team, 2021). My naïve idea was that all I had to do was
bundle these R scripts into an Rmarkdown document (Xie et al., 2020), and
add some text explaining the theory and the R code. As usual, it proved to
be much more work than expected, but I am very happy that I was able to
finish the job just before my retirement.
I could not have written this book without the help of many fellow researchers.
First, I am very grateful for the support I received from the authors of various
packages used in this book: Thomas Lumley for his support with package
survey, Yves Tillé and Alina Gabriela Matei with package sampling, Anton
Grafström with package BalancedSampling, Giulio Barcaroli and Marco
Ballin with package SamplingStrata, Andreas Hill and Alex Massey with
1 https://github.com/DickBrus/SpatialSamplingwithR/tree/master
xiv Preface
1
2 1 Introduction
study area, often leading to more precise estimates of the population mean or
total as compared to sampling designs resulting in spatial clusters of units.
Two types of populations can be distinguished: discrete and continuous pop-
ulations. Discrete populations consist of discrete natural objects, think of
trees, agricultural fields, lakes, etc. These objects are referred to as popula-
tion units. The total number of population units in a discrete population
is finite. A finite spatial population of discrete units can be denoted by
𝒰 = {𝑢(𝐬1 ), 𝑢(𝐬2 ), … , 𝑢(𝐬u� )}, with 𝑢(𝐬u� ) the unit located at 𝐬u� , where 𝐬
is a vector with spatial coordinates. The population units naturally serve
as the elementary sampling units. In this book the spatial populations are
two-dimensional, so a vector 𝐬 has two coordinates, Easting and Northing.
Other populations may, for the purpose of sampling, be considered as a physical
continuum, e.g., the soil in a region, the water in a lake, the crop on a field.
If interest lies in crop properties per areal unit of the field, the population is
continuous. However, if interest lies in properties per plant, the population
is discrete and finite.
FIGURE 1.1: Three sample supports: points, squares, and circles. With
disjoint squares, the population is finite. With points, and squares or circles
that are allowed to overlap, the population is infinite.
element of the population”. So, in the case just mentioned the response design
is the sampling design and the estimator for the mean of the soil property of
the 25 m squares.
Ideally, the sample support is constant, but in some situations a varying sample
support cannot be avoided. Think, for instance, of square sampling units in an
irregularly shaped study area. Near the border of the study area, there are
squares that cross the border. The part of a square that falls outside the study
area is not observed. So, the support of the observations of squares crossing
the border is smaller than that of the observations of squares in the interior of
the study area. See also Section 3.4.
To sample a finite spatial population, the population units are listed in a data
frame. This data frame contains the spatial coordinates of the population
units and other information needed for selecting sampling units according to a
specific design. Think, for instance, of the labels of more or less homogeneous
subpopulations (used as strata in stratified random sampling, see Chapter
4) and the labels of clusters of population units, for instance, all units in a
polygon of a map (used in cluster random sampling, see Chapter 6). Besides,
if we have information about covariates possibly related to the study variable,
which we would like to use in selecting the population units, these covariates
are added to the list. The list used for selecting sampling units is referred to
as the sampling frame.
If the elementary sampling units are disjoint square grid cells (sample support
is a square), the population is finite and the grid cells can be selected through
selection of their centres (or any other point that uniquely identifies a grid
cell) listed in the sampling frame.
In this book also continuous populations are sampled using a list as a sampling
frame. The infinite population is discretised by the cells of a fine discretisation
grid. The grid cells are listed in the sampling frame by the spatial coordinates of
the centres of the grid cells. So, the infinite population is represented by a finite
4 1 Introduction
list of points. The advantage of this is that existing R packages for sampling
of finite populations can also be used for sampling infinite populations.
If the elementary sampling units are points (sample support is a point), the
population is infinite. In this case, sampling of points can be implemented
by a two-step approach. In the first step, cells of the discretisation grid are
selected with or without replacement, and in the second step one or more
points are selected within the selected grid cells. Figure 1.2 is an illustration of
this two-step approach for simple random sampling of points from a discretised
infinite population. Ten grid cells are selected by simple random sampling
with replacement. Every time a grid cell is selected, one point is randomly
selected from that grid cell. Note that a grid cell can be selected more than
once, so that more than one point will be selected from that grid cell. Note
also that we may select a point that falls outside the boundary of the study
area. This is actually the case with one grid cell in Figure 1.2. The points
1.1 Basic sampling concepts 5
outside the study area are discarded and replaced by a randomly selected new
point inside the study area. Finally, note that near the boundary there are
small areas not covered by a grid cell, so that no points can be selected in
these areas. It is important that the discretisation grid is fine enough to keep
the discretisation error so small that it can be ignored. The alternative is to
extend the discretisation grid beyond the boundaries of the study area so that
the full study area is covered by grid cells.
The sample data are used to estimate characteristics of the whole population,
e.g., the population mean or total; some quantile, e.g., the median or the 90th
percentile; or even the entire cumulative frequency distribution.
A finite population total is defined as
u�
𝑡(𝑧) = ∑ 𝑧u� = ∑ 𝑧u� , (1.1)
u�∈u� u�=1
with 𝑁 the number of population units and 𝑧u� the study variable for population
unit 𝑘. A finite population mean is defined as a finite population total divided
by 𝑁.
An infinite population total is defined as an integral of the study variable over
the study area:
∑u� 𝑦
u�=1 u�
𝑝= . (1.3)
𝑁
A cumulative distribution function (CDF) is defined as
with 𝑝(𝑧′ ) the proportion of population units whose value for the study variable
equals 𝑧′ .
6 1 Introduction
where 𝑝 is a number between 0 and 1 (e.g., 0.5 for the median, 0.9 for the 90th
percentile), and 𝐹−1 (𝑝) is the smallest value of the study variable 𝑧 satisfying
𝐹(𝑧) ≥ 𝑝.
In surveys of spatial populations, the aim can also be to make a map of the
population.
When we observe only a (small) part of the population, we are uncertain about
the population parameter estimates and the map of the population. By using
statistical methods, we can quantify how uncertain we are about these results.
In decision making it can be important to take this uncertainty into account.
An example is a survey of water quality. In Europe the concentration levels of
nutrients are regulated in the European Water Framework Directive. To test
whether the mean concentration of a nutrient complies with its standard, it is
important to account for the uncertainty in the estimated mean. When the
estimated mean is just below the standard, there is still a large probability
that the population mean exceeds the standard. This example shows that it is
important to distinguish computing descriptive statistics from characterising
the population using the sample data. For instance, we can compute the sample
mean (average of the sample data) without error, but if we use this sample
mean as an estimate of the population mean, there is certainly an error in this
estimate.
Many sampling methods are available. At the highest level, one may distinguish
random from non-random sampling methods. In random sampling, a subset of
population units is randomly selected from the population, using a (pseudo)
random number generator. In non-random sampling, no such random number
1.2 Design-based vs. model-based approach 7
When the aim is to map the study variable, a model-based approach is the
most natural option. This implies that for this aim probability sampling is not
necessarily required. In principle, both approaches are suitable for estimating
(sub)population parameters. The more subpopulations are distinguished, the
more attractive a model-based approach becomes. Model-based estimates of
the subpopulation means or totals are potentially more accurate (depending on
how good the model is) than model-free design-based estimates. On the other
hand, an advantage of design-based estimation is that an objective assessment
of the uncertainty of the estimated mean or total is warranted, and that the
coverage of confidence intervals is (almost) correct.
A probability sample can also be used in model-based inference. This flexibility
can be attractive when we have a dual aim, mapping as well as estimation of
parameters of (sub)populations. When units are not selected by probability
sampling, model-free design-based estimation is impossible, and model-based
prediction is the only option.
data sets are obtained through simulation, i.e., by drawing numbers from a
probability distribution. Sample data from these two study areas are used
to calibrate a statistical model. This model is subsequently used to simulate
values of the study variable for all population units. Voorst actually is an
infinite population of points. However, this study area is discretised by the
cells of a fine grid, and the study variable, the soil organic matter (SOM)
concentration, is simulated for all centres of the grid cells. Kandahar is a finite
population consisting of 965 squares of size 5 km × 5 km. The study variable
is the area cultivated with poppy. Eastern Amazonia is a map in raster format,
with a resolution of 1 km × 1 km. The study variable is the aboveground
biomass as derived from remote sensing images. The aboveground biomass
value of a raster cell is treated as the average biomass of that raster cell. The
data set Iberian Peninsula is a time series of four maps in raster format with a
resolution of 30 arc sec. The study variable is the annual mean air temperature
at two metres above the earth surface in ∘ C.
The exhaustive data sets are used in the first part of this book on probability
sampling for estimating population parameters. By taking the population as
the reality, we know the population parameters. Also, for any randomly selected
sample from this population, the study variable values for the selected sampling
units are known, so that we can estimate the population parameters from
this sample. An estimated population parameter can then be compared with
the population parameter. The difference between these two is the sampling
error in the estimated population parameter. This opens up the possibility
of repeating the random selection of samples with a given sampling design a
large number of times, estimating the population parameter for every sample,
so that a frequency distribution of the estimated population parameter is
obtained. Ideally, the mean of this frequency distribution, referred to as the
sampling distribution, is equal to the population parameter (mean sampling
error equals zero), and the variance of the estimated population parameters is
small. Another advantage is that sampling designs can be compared on the
basis of the sampling distribution, for instance the sampling distributions of
the estimator of the population mean with stratified random sampling and
simple random sampling, to evaluate whether the stratification leads to more
accurate estimates of the population mean.
Furthermore, various data sets are used with data for a sample of population
units only. These data sets are described at places where they are first used.
All data sets are available by installing the R package sswr. This package
can be installed from github with function install_github of package remotes
(Csárdi et al., 2021).
library(remotes)
install_github("DickBrus/sswr")
10 1 Introduction
The package can then be loaded. You can see the contents of the package
and of the data files by typing a question mark, followed by the name of the
package or a data file.
library(sswr)
?sswr
?grdVoorst
The study area of Voorst is located in the eastern part of the Netherlands. The
size of the study area is 6 km × 1 km. At 132 points, samples of the topsoil
were collected by graduate students of Wageningen University, which were then
analysed for SOM concentrations (in g kg-1 dry soil) in a laboratory. The map
is created by conditional geostatistical simulation of natural logs of the SOM
concentration on a 25 m × 25 m grid, followed by backtransformation, using
a linear mixed model with spatially correlated residuals and combinations of
soil type and land use as a qualitative predictor (factor). Figure 1.3 shows the
simulated map of the SOM concentration.
The frequency distribution of the simulated values at all 7,528 grid cells shows
that the SOM concentration is skewed to the right (Figure 1.4).
Summary statistics are:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.75 49.45 68.15 81.13 100.63 394.45
The ancillary information consists of a map of soil classes and a land use
map, which are combined to five soil-land use combinations (Figure 1.5). The
first letter in the labels for the combinations stands for the soil type: B for
beekeerdgrond (sandy wetland soil with gleyic properties), E for enkeerdgrond
(sandy soil with thick anthropogenic humic topsoil), P for podzols (sandy soil
with eluviated horizon below the topsoil), R for river clay soil, and X for other
sandy soils. The second letter is for land use: A for agriculture (grassland,
arable land) and F for forest.
1.3 Populations used in sampling experiments 11
poppy area in hectares per square. The frequency distribution of the simulated
poppy area per square shows very strong positive skew (Figure 1.7). For 375
squares, the simulated poppy area was smaller than 1 hectare (ha).
This data set consists of data on the live woody aboveground biomass (AGB)
in megatons per ha (Baccini et al., 2012). A rectangular area of 1,642 km × 928
km in Eastern Amazonia, Brazil, was selected from this data set. The data were
aggregated to a map with a resolution of 1 km × 1 km. Besides, a stack of five
ecologically relevant covariates of the same spatial extent was prepared, being
long term mean of MODIS short-wave infrared radiation (SWIR2), primary
production in kg C per m2 (Terra_PP), average precipitation in driest month
1.3 Populations used in sampling experiments 13
The space-time designs of Chapter 15 are illustrated with the annual mean
air temperature at two metres above the earth surface (TAS) in ∘ C, in Iberia
(Spain and Portugal, islands excluded) for 2004, 2009, 2014, and 2019 (Figure
1.10). These data are part of the data set CHELSA1 (Karger et al., 2017). The
raster files are latitude-longitude grids with a resolution of 30 arc sec. The
data are projected using the Lambert azimuthal equal area (laea) projection.
The resolution of the resulting laea raster file is about 780 m × 780 m.
1 https://chelsa-climate.org/wp-admin/download-page/CHELSA_tech_specification_V2.pdf
14 1 Introduction
FIGURE 1.10: Annual mean air temperature in Iberia for 2004, 2009, 2014,
and 2019.
Part I
where 𝒮 ∋ 𝑘 indicates that the sum is over all samples that contain unit 𝑘, and
𝑝(𝒮) is the selection probability of sample 𝒮. 𝑝(⋅) is called the sampling design.
It is a function that assigns a probability to every possible sample (subset of
population units) that can be selected with a given sample selection scheme
(sampling algorithm). For instance, consider the following sample selection
scheme from a finite population of 𝑁 units:
3. Repeat this until an 𝑛th unit is selected with equal probability from
the 𝑁 − (𝑛 − 1) units.
probability), and zero for all other samples. There are (u�−1
u�−1 ) samples of size 𝑛
in which unit 𝑘 is included. The inclusion probability of each unit 𝑘 therefore
is (u�−1
u�−1 )/( u� ) = u� . The sampling design plays a key role in the design-based
u� u�
19
20 2 Introduction to probability sampling
computed from a sample such as the estimator of the population mean, see
Section 2.1. The number of selected population units is referred to as the
sample size.
In sampling with replacement, each unit can be selected more than once. In
this case, the sample size refers to the number of draws, not to the number
of unique population units in the sample.
The first five sampling designs are basic sampling designs. Implementation of
these designs is rather straightforward, as well as the associated estimation
of the population mean, total, or proportion, and their sampling variance.
The final three sampling designs are more advanced. Appropriate use of these
designs requires more knowledge of sampling theory and statistics, such as
linear regression.
all units. It is highly questionable whether this also holds for arbitrary and
haphazard sampling. In arbitrary and haphazard sampling, the sampling units
are not selected by a probability mechanism. So, the selection probabilities of
the sampling units and of combinations of sampling units are unknown. Design-
based estimation is therefore impossible, because it requires the inclusion
probabilities of the population units as determined by the sampling design.
The only option for statistical analysis using arbitrarily or haphazardly selected
samples is model-based inference, i.e., a model of the spatial variation must be
assumed.
Exercises
with 𝒮 the sample, 𝑧u� the observed study variable for unit 𝑘, and 𝑤u� the
design weight attached to unit 𝑘:
1
𝑤u� = , (2.3)
𝜋u�
with 𝜋u� the inclusion probability of unit 𝑘. The estimator of Equation (2.2)
is referred to as the Horvitz-Thompson estimator or 𝜋 estimator. The 𝑧u� /𝜋u� -
values are referred to as the 𝜋-expanded values. The 𝑧-value of unit 𝑘 in the
sample is multiplied by the reciprocal of the inclusion probability of that unit,
and the sample sum of these 𝜋-expanded values is used as an estimator of the
population total. The inclusion probabilities are determined by the type of
sampling design and the sample size.
For infinite populations discretised by a finite set of points, the same estimator
can be used.
For infinite populations, the population total can be estimated by multiplying
the estimated population mean by the area of the population 𝐴:
The 𝜋 estimator can be worked out for the different types of sampling design
listed above by inserting the inclusion probabilities as determined by the
sampling design. For simple random sampling, this leads to the unweighted
sample mean (see Chapter 3), and for stratified simple random sampling the 𝜋
estimator is equal to the weighted sum of the sample means per stratum, with
weights equal to the relative size of the strata (see Chapter 4).
1 𝑧
̂ (𝑧) =
𝑡pwr ∑ u� , (2.6)
𝑛 u�∈u� 𝑝u�
with 𝑝u� the draw-by-draw selection probability of population unit 𝑘. For instance,
in simple random sampling with replacement the draw-by-draw selection
probability 𝑝 of each unit is 1/𝑁. If we select only one unit 𝑘, the population
total can be estimated by the observation of that unit divided by 𝑝, 𝑡(𝑧) ̂ =
𝑧u� /𝑝u� = 𝑁𝑧u� . If we repeat this 𝑛 times, this results in 𝑛 estimated population
totals. The pwr estimator is the average of these 𝑛 elementary estimates. If
a unit occurs multiple times in the sample 𝒮, this unit provides multiple
elementary estimates of the population total.
24 2 Introduction to probability sampling
1 𝑡u� (𝑧)
̂ (𝑧) =
𝑡pwr ∑ , (2.7)
𝑛 u�∈u� 𝑝u�
with 𝑡u� (𝑧) the total of the cluster selected in the 𝑗th draw. If not all population
units of a selected cluster are observed, but only a sample of population units
from a cluster, as in two-stage cluster random sampling, the cluster totals 𝑡u� (𝑧)
are replaced by the estimated cluster totals 𝑡u�̂ (𝑧).
Exercises
Simple random sampling is the most basic form of probability sampling. There
are two subtypes:
[1] 21 7 5 16 58 76 44 100 71 84
1 A tibble is a data frame of class tbl_df of package tibble (Müller and Wickham, 2021).
Hereafter, I will use the terms tibble and data frame interchangeably. A traditional data
frame is referred to as a data.frame.
27
28 3 Simple random sampling
n <- 40
N <- nrow(grdVoorst)
set.seed(314)
units <- sample(N, size = n, replace = FALSE)
mysample <- grdVoorst[units, ]
mysample
# A tibble: 40 x 4
s1 s2 z stratum
<dbl> <dbl> <dbl> <chr>
1 206992 464506. 23.5 EA
2 202567 464606. 321. XF
3 205092 464530. 124. XF
4 203367 464556. 53.6 EA
5 205592 465180. 38.4 PA
6 201842 464956. 159. XF
7 201667 464930. 139. XF
8 204317 465306. 59.4 PA
9 203042 464406. 90.5 BA
10 204567 464530. 48.1 PA
# ... with 30 more rows
The result of function sample is a vector with the centres of the selected cells
of the discretisation grid, referred to as discretisation points. The order of the
elements of the vector is the order in which these are selected. Restricting
the sampling points to the discretisation points can be avoided as follows. A
simple random sample of points is selected in two stages. First, n times a grid
cell is selected by simple random sampling with replacement. Second, every
time a grid cell is selected, one point is selected fully randomly from that
grid cell. This selection procedure accounts for the infinite number of points
in the population. In the code chunk below, the second step of this selection
procedure is implemented with function jitter. It adds random noise to the
spatial coordinates of the centres of the selected grid cells, by drawing from a
continuous uniform distribution unif(−𝑐, 𝑐), with 𝑐 half the side length of the
square grid cells. With this selection procedure we respect that the population
actually is infinite.
set.seed(314)
units <- sample(N, size = n, replace = TRUE)
mysample <- grdVoorst[units, ]
cellsize <- 25
mysample$s1 <- jitter(mysample$s1, amount = cellsize / 2)
mysample$s2 <- jitter(mysample$s2, amount = cellsize / 2)
mysample
3.0 Simple random sampling 29
# A tibble: 40 x 4
s1 s2 z stratum
<dbl> <dbl> <dbl> <chr>
1 206986. 464493. 23.5 EA
2 202574. 464609. 321. XF
3 205095. 464527. 124. XF
4 203369. 464556. 53.6 EA
5 205598. 465181. 38.4 PA
6 201836. 464965. 159. XF
7 201665. 464941. 139. XF
8 204319. 465310. 59.4 PA
9 203052. 464402. 90.5 BA
10 204564. 464529. 48.1 PA
# ... with 30 more rows
Variable stratum is not used in this chapter but in the next chapter. The selected
sample is shown in Figure 3.1.
Dropouts
In practice, it may happen that inspection in the field shows that a selected
sampling unit does not belong to the target population or cannot be observed
for whatever reason (e.g., no permission). For instance, in a soil survey the
sampling unit may happen to fall on a road or in a built-up area. What to do
with these dropouts? Shifting this unit to a nearby unit may lead to a biased
estimator of the population mean, i.e., a systematic error in the estimated
population mean. Besides, knowledge of the inclusion probabilities is lost. This
can be avoided by discarding these units and replacing them by sampling
units from a back-up list, selected in the same way, i.e., by the same type of
sampling design. The order of sampling units in this list must be the order in
which they are selected. In summary, do not replace a deleted sampling unit
by the nearest sampling unit from the back-up list, but by the first unit, not
yet selected, from the back-up list.
30 3 Simple random sampling
this it follows that the probability that unit 𝑘 is included in the sample is
u�−1 )/( u� ) = u� (Lohr, 1999). Substituting this in the general 𝜋 estimator
(u�−1 u� u�
for the total (Equation (2.2)) gives for simple random sampling without
replacement (from finite populations)
𝑁
̂ =
𝑡(𝑧) ∑ 𝑧 = 𝑁𝑧u�̄ , (3.1)
𝑛 u�∈u� u�
with 𝑧u�̄ the unweighted sample mean. So, for simple random sampling without
replacement the 𝜋 estimator of the population mean is the unweighted sample
mean:
1
𝑧 ̄̂ = 𝑧u�̄ = ∑𝑧 . (3.2)
𝑛 u�∈u� u�
1 𝑧
̂ =
𝑡(𝑧) ∑ u� , (3.3)
𝑛 u�∈u� 𝑝u�
where 𝑛 is the number of draws (sample size) and 𝑝u� is the draw-by-draw
selection probability of unit 𝑘. With simple random sampling 𝑝u� = 1/𝑁, 𝑘 =
1, … , 𝑁. Inserting this in the pwr estimator yields
𝑁
̂ =
𝑡(𝑧) ∑𝑧 , (3.4)
𝑛 u�∈u� u�
which is equal to the 𝜋 estimator of the population total for simple random
sampling without replacement.
Alternatively, the population total can be estimated by the 𝜋 estimator. With
simple random sampling with replacement the inclusion probability of each
1 u�
unit 𝑘 equals 1 − (1 − u� ) , which is smaller than the inclusion probability
with simple random sampling without replacement of size 𝑛 (Särndal et al.,
1992). Inserting these inclusion probabilities in the general 𝜋 estimator of the
population total (Equation (2.2)), where the sample 𝒮 is reduced to the unique
3.1 Estimation of population parameters 31
units in the sample, yields the 𝜋 estimator of the total for simple random
sampling with replacement.
With simple random sampling of infinite populations, the 𝜋 estimator of the
population mean equals the sample mean. Multiplying this estimator with the
area of the region of interest 𝐴 yields the 𝜋 estimator of the population total:
𝐴
̂ =
𝑡(𝑧) ∑𝑧 . (3.5)
𝑛 u�∈u� u�
As explained above, selected sampling units that do not belong to the target
population must be replaced by a unit from a back-up list if we want to
observe the intended number of units. The question then is how to estimate
the population total and mean. We cannot use the 𝜋 estimator of Equation
(3.1) to estimate the population total, because we do not know the population
size 𝑁. The population size can be estimated by
̂ = 𝑛 − 𝑑 𝑁∗ ,
𝑁 (3.6)
𝑛
with 𝑑 the number of dropouts and 𝑁∗ the supposed population size, i.e., the
number of units in the sampling frame used to select the sample. This yields
the inclusion probability
𝑛 𝑛2
𝜋u� = = . (3.7)
̂ (𝑛 − 𝑑)𝑁∗
𝑁
Inserting this in the 𝜋 estimator of the population total yields
(𝑛 − 𝑑)𝑁∗ (𝑛 − 𝑑)𝑁∗
̂ =
𝑡(𝑧) ∑ 𝑧 u� = ̂𝑧u�̄ .
𝑧u�̄ = 𝑁 (3.8)
𝑛2 u�∈u�
𝑛
̂
𝑡(𝑧)
𝑧 ̄̂ = = 𝑧u�̄ . (3.9)
𝑁 ̂
This estimator is a so-called ratio estimator: both the numerator and denom-
inator are estimators of totals. See Section 10.2 for more information about
this estimator.
The simple random sample of size 40 selected above is used to estimate the total
mass of soil organic matter (SOM) in the population. First, the population
mean is estimated.
32 3 Simple random sampling
mz <- mean(mysample$z)
The estimated mean SOM concentration is 93.3 g kg-1 . Simply multiplying the
estimated mean by the area 𝐴 to obtain an estimate of the population total is
not very useful, as the dimension of the total then is in g kg-1 m2 . To estimate
the total mass of SOM in the soil layer 0 − 30 cm, first the soil volume in m3
is computed by the total number of grid cells, 𝑁, multiplied by the size of the
grid cells and by the thickness of the soil layer. The total is then estimated by
the product of this volume, the bulk density of soil (1,500 kg m-3 ), and the
estimated population mean (g kg-1 ). This is multiplied by 10-6 to obtain the
total mass of SOM in Mg (1 Mg is 1,000 kg).
Note that a constant bulk density is used. Ideally, this bulk density is
also measured at the sampling points, by collecting soil aliquots of a
constant volume. The measured SOM concentration and bulk density can
then be used to compute the volumetric SOM concentration in kg m-3 at
the sampling points. The estimated population mean of this volumetric
SOM concentration can then be multiplied by the total volume of soil in
the study area, to get an estimate of the total mass of SOM in the study area.
The simulated population is now sampled 10,000 times to see how sampling
affects the estimates. For each sample, the population mean is estimated by
the sample mean. Figure 3.2 shows the approximated sampling distribution
of the 𝜋 estimator of the mean SOM concentration. Note that the sampling
distribution is nearly symmetric, whereas the frequency distribution of the
SOM concentrations in the population is far from symmetric, see Figure 1.4.
The increased symmetry is due to the averaging of 40 numbers.
If we would repeat the sampling an infinite number of times and make the
width of the bins in the histogram infinitely small, then we obtain, after scaling
so that the sum of the area under the curve equals 1, the sampling distribution
of the estimator of the population mean. Important summary statistics of this
sampling distribution are the expectation (mean) and the variance.
When the expectation equals the population mean, there is no systematic error.
The estimator is then said to be design-unbiased. In Chapter 21 another type
of unbiasedness is introduced, model-unbiasedness. The difference between
design-unbiasedness and model-unbiasedness is explained in Chapter 26. In
3.1 Estimation of population parameters 33
the 10,000 estimated population means equals 81.1 g kg-1 , so the difference
with the true population mean equals -0.03 g kg-1 . The variance of the 10,000
estimated population means equals 55.8 (g kg-1 )2 . The square root of this
variance, referred to as the standard error, equals 7.47 g kg-1 . Note that the
standard error has the same units as the study variable, g kg-1 , whereas the
units of the variance are the squared units of the study variable.
In some cases one is interested in the proportion of the population (study area)
satisfying a given condition. Think, for instance, of the proportion of trees in
a forest infected by some disease, the proportion of an area or areal fraction,
in which a soil pollutant exceeds some critical threshold, or the proportion
of an area where habitat conditions are suitable for some endangered species.
Recall that a population proportion is defined as the population mean of
an 0/1 indicator 𝑦 with value 1 if the condition is satisfied, and 0 otherwise
(Subsection 1.1.1). For simple random sampling, this population proportion
can be estimated by the same formula as for the mean (Equation (3.2)):
1
𝑝̂ = ∑𝑦 . (3.10)
𝑛 u�∈u� u�
scale_x_continuous(name = "SOM") +
scale_y_continuous(name = "F")
Note the pipe operator %>% of package magrittr (Bache and Wickham, 2020)
forwarding the result of function quantile to function round.
Package QuantileNPCI (Hutson et al., 2019) can be used to compute a
non-parametric confidence interval estimate of a quantile, using fractional
order statistics (Hutson, 1999). Parameter q specifies the proportion.
library(QuantileNPCI)
res <- quantCI(mysample$z, q = 0.5, alpha = 0.05, method = "exact")
The estimated median equals 66.2 g kg-1 , the lower bound of the 95% confidence
interval equals 54.0 g kg-1 , and the upper bound equals 98.1 g kg-1 .
Exercises
𝑆2 (𝑧)
𝑉(𝑧)̄̂ = , (3.11)
𝑛
with 𝑆2 (𝑧) the population variance, also referred to as the spatial variance. For
finite populations, this population variance is defined as (Lohr, 1999)
u�
1
𝑆2 (𝑧) = ∑ (𝑧u� − 𝑧)̄ 2 , (3.12)
𝑁 − 1 u�=1
1
𝑆2 (𝑧) = ∫ (𝑧(𝐬) − 𝑧)̄ 2 d𝐬 , (3.13)
𝐴
u�∈u�
with 𝑧(𝐬) the value of the study variable 𝑧 at a point with two-dimensional
coordinates 𝐬 = (𝑠1 , 𝑠2 ), 𝐴 the area of the study area, and 𝒜 the study area.
In practice, we select only one sample, i.e., we do not repeat the sampling
many times. Still it is possible to estimate the variance of the estimator of
the population mean if we would repeat the sampling. In other words, we can
estimate the sampling variance of the estimator of the population mean from
a single sample. We do so by estimating the population variance from the
sample, and this estimate can then be used to estimate the sampling variance
of the estimator of the population mean. For simple random sampling with
replacement from finite populations, the sampling variance of the estimator of
the population mean can be estimated by
̂
𝑆 2 (𝑧) 1
𝑉(̂ 𝑧)̄̂ = = ∑ (𝑧 − 𝑧u�̄ )2 , (3.14)
𝑛 𝑛 (𝑛 − 1) u�∈u� u�
with 𝑆̂2 (𝑧) the estimated population variance. With simple random sampling,
the sample variance, i.e., the variance of the sample data, is an unbiased
estimator of the population variance. The variance estimator of Equation
(3.14) can also be used for infinite populations. For simple random sampling
without replacement from finite populations, the sampling variance of the
estimator of the population mean can be estimated by
𝑛 𝑆̂2 (𝑧)
𝑉(̂ 𝑧)̄̂ = (1 − ) . (3.15)
𝑁 𝑛
The term 1 − u�
u� is referred to as the finite population correction (fpc).
38 3 Simple random sampling
In the sampling experiment described above, the average of the 10,000 estimated
sampling variances equals 55.7 (g kg-1 )2 . The true sampling variance equals
55.4 (g kg-1 )2 . So, the difference is very small, indicating that the estimator of
the sampling variance, Equation (3.15), is design-unbiased.
The sampling variance of the estimator of the total of a finite population can
be estimated by multiplying the estimated variance of the estimator of the
population mean by 𝑁2 . For simple random sampling without replacement
this estimator thus equals
𝑛 𝑆̂2 (𝑧)
𝑉(̂ 𝑡(𝑧))
̂ = 𝑁2 (1 − ) . (3.16)
𝑁 𝑛
For simple random sampling of infinite populations, the sampling variance of
the estimator of the total can be estimated by
̂
𝑆 2 (𝑧)
𝑉(̂ 𝑡(𝑧))
̂ = 𝐴2 . (3.17)
𝑛
The sampling variance of the estimator of a proportion 𝑝̂ for simple random
sampling without replacement of a finite population can be estimated by
𝑛 𝑝(1
̂ − 𝑝)̂
𝑉(̂ 𝑝)̂ = (1 − ) . (3.18)
𝑁 𝑛−1
The numerator in this estimator is an estimate of the population variance of
the indicator. Note that this estimated population variance is divided by 𝑛 − 1,
and not by 𝑛 as in the estimator of the population mean (Lohr, 1999).
Estimation of the standard error of the estimated population mean in R is
very straightforward. To estimate the standard error of the estimated total in
Mg, the standard error of the estimated population mean must be multiplied
by a constant equal to the product of the soil volume, the bulk density, and
10−6 ; see second code chunk in Section 3.1.
The estimated standard error of the estimated total equals 20,334 Mg. This
standard error does not account for spatial variation of bulk density.
Although there is no advantage in using package survey (Lumley, 2021) to
compute the 𝜋 estimator and its standard error for this simple sampling
design, I illustrate how this works. For more complex designs and alternative
estimators, estimation of the population mean and its standard error with
functions defined in this package is very convenient, as will be shown in the
following chapters.
3.2 Sampling variance of estimator of population parameters 39
First, the sampling design that is used to select the sampling units is specified
with function svydesign. The first argument specifies the sampling units. In
this case, the centres of the discretisation grid cells are used as sampling
units, which is indicated by the formula id = ~ 1. In Chapter 6 clusters of
population units are used as sampling units, and in Chapter 7 both clusters
and individual units are used as sampling units. Argument probs specifies the
inclusion probabilities of the sampling units. Alternatively, we may specify the
weights with argument weights, which are in this case equal to the inverse of
the inclusion probabilities. Variable pi is a column in tibble mysample, which is
indicated with the tilde in probs = ~ pi.
The population mean is then estimated with function svymean. The first argu-
ment is a formula specifying the study variable. Argument design specifies the
sampling design.
library(survey)
mysample$pi <- n / N
design_si <- svydesign(id = ~ 1, probs = ~ pi, data = mysample)
svymean(~ z, design = design_si)
mean SE
z 93.303 9.6041
mysample$N <- N
design_si <- svydesign(id = ~ 1, probs = ~ pi, fpc = ~ N, data = mysample)
svymean(~ z, design_si)
mean SE
z 93.303 9.5786
The estimated standard error is smaller now due to the finite population
correction, see Equation (3.15).
Population totals can be estimated with function svytotal, quantiles with
function svyquantile, and ratios of population totals with svyratio, to mention
a few functions that will be used in following chapters.
$z
quantile ci.2.5 ci.97.5 se
0.5 65.56675 53.67764 99.93484 11.43457
0.9 164.36975 153.86258 320.74887 41.25353
40 3 Simple random sampling
attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"
Exercises
where 𝑢(0.10/2) is the 0.95 quantile of the standard normal distribution, i.e.,
the value of 𝑢 having a tail area of 0.05 to its right. Note that in this equation
the sampling variance of the estimator of the population mean 𝑉(𝑧)̄̂ is used. In
practice, this variance is unknown, because the population variance is unknown,
and must be estimated from the sample (Equations (3.14) and (3.15)). To
account for the unknown sampling variance, the standard normal distribution
is replaced by Student’s t distribution (hereafter shortly referred to as the t
distribution), which has thicker tails than the standard normal distribution.
This leads to the following bounds of the 100(1 − 𝛼)% confidence interval
estimate of the mean:
(3.20)
(u�−1)
𝑧 ̄̂ ± 𝑡u�/2 ⋅ √𝑉(̂ 𝑧)̄̂ ,
More easily, we can use method confint of package survey to compute the
confidence interval.
2.5 % 97.5 %
z 73.92817 112.6771
42 3 Simple random sampling
Figure 3.4 shows the 90% confidence interval estimates of the mean SOM
concentration for the first 100 simple random samples drawn above. Note that
both the location and the length of the intervals differ between samples. For
each sample, I determined whether this interval covers the population mean.
Out of the 10,000 samples, 1,132 samples do not cover the population mean,
i.e., close to the specified 10%. So, a 90% confidence interval is a random
interval that contains in the long run the population mean 90% of the time.
library(DescTools)
n <- 50
k <- 5
print(p.est <- BinomCI(k, n, conf.level = 0.95, method = "clopper-pearson"))
[1] 0.025
and the lower bound of the confidence interval is the proportion at which the
probability of 5 or more successes is also equal to 0.025. Note that to compute
the upper tail probability, we must assign 𝑘 − 1 = 4 to argument q, because
with argument lower.tail = FALSE function pbinom computes the probability of
𝑋 > 𝑥, not of 𝑋 ≥ 𝑥.
[1] 0.025
For large sample sizes and for proportions close to 0.5, the confidence interval
can be computed with a normal distribution as an approximation to the
binomial distribution, using Equation (3.18) for the variance estimator of the
44 3 Simple random sampling
estimator of a proportion:
𝑝(1
̂ − 𝑝)̂
𝑝̂ ± 𝑢u�/2 √ . (3.21)
𝑛−1
Sampling from a finite set of fixed circles is simple, but as we will see this
requires an assumption about the distribution of the study variable in the
population. In this implementation, the sampling units consist of a finite set of
slightly overlapping or non-overlapping fixed circular plots (Figure 3.5). The
circles can be constructed as follows. A grid with squares is superimposed
on the study area, so that it fully covers the study area. These squares are
then substituted by circles with an area equal to the area of the squares, or
by non-overlapping tangent circles inscribed in the squares. The radius of the
partly overlapping circles equals √𝑎/𝜋, with√ 𝑎 the area of the squares, the
radius of the non-overlapping circles equals 𝑎/2. In both implementations,
the infinite population is replaced by a finite population of circles that does
not fully tessellate the study area. When using the partly overlapping circles as
sampling units we may avoid overlap by selecting a systematic sample (Chapter
5) of circular plots. The population total can then be estimated by Equation
(3.1), substituting 𝐴/𝑎 for 𝑁, and where 𝑧u� is the total of the 𝑘th circle (sum
of observations of all population units in 𝑘th circle). However, no unbiased
3.4 Simple random sampling of circular plots 45
FIGURE 3.5: Simple random sample of ten circular plots from a square
discretised by a finite set of partly overlapping or non-overlapping circular
plots.
𝐴 ̂
̂ =
𝑡(𝑧) 𝑧̄, (3.22)
𝑎
with 𝑧 ̄̂ the estimated mean of the finite population. The variance can be
estimated by the variance of the estimator of the mean of the finite population,
multiplied by the square of 𝐴/𝑎. However, we still need to assume that the
mean of the finite population is equal to the mean of the infinite population.
This assumption can be avoided by sampling from an infinite set of floating
circles.
circular plots. Besides, when a plot is selected near the border of the study area,
a part of the plot is outside the study area. This part is ignored in estimating
the population mean or total. To select the centres, the study area must be
extended by a zone with a width equal to the radius of the circular plots. This
is illustrated in Figure 3.6, showing a square study area of 100 m × 100 m.
To select ten circular plots with a radius of 5 m from this square, ten points
are selected by simple random sampling, using function runif, with -5 as lower
limit and 105 as upper limit of the uniform distribution.
set.seed(129)
s1 <- runif(10, min = -5, max = 105)
s2 <- runif(10, min = -5, max = 105)
Two points are selected outside the study area, in the extended zone. For both
points, a small part of the circular plot is inside the square. To determine the
study variable for these two sampling units, only the part of the plot inside
the square is observed. In other words, these two observations have a smaller
support than the observations of the other eight plots, see Chapter 1.
FIGURE 3.6: Simple random sample of ten floating circular plots from a
square.
3.4 Simple random sampling of circular plots 47
In the upper left corner, two sampling units are selected that largely overlap.
The intersection of the two circular plots is used twice, to determine the study
variable of both sampling units.
Given the observations of the selected circular plots, the population total can
be estimated by (De Vries, 1986)
𝐴1
̂ =
𝑡(𝑧) ∑𝑧 , (3.23)
𝑎 𝑛 u�∈u� u�
with 𝑎 the area of the circle and 𝑧u� the observed total of sampling unit 𝑘
(circle). The same estimate of the total is obtained if we divide the observations
by 𝑎 to obtain a mean per sampling unit:
1 𝑧
̂ =𝐴
𝑡(𝑧) ∑ u� . (3.24)
𝑛 u�∈u� 𝑎
̂
𝐴 2𝑆2 (𝑧)
𝑉(̂ 𝑡(𝑧))
̂ =( ) , (3.25)
𝑎 𝑛
with 𝑆̂ 2 (𝑧) the estimated population variance of the totals per population unit
(circle).
Exercises
BA EA PA RA XF
13 8 9 4 7
The sum of the stratum sample sizes is 41; we want 40, so we reduce the largest
stratum sample size by 1.
The stratified simple random sample is selected with function strata of package
sampling (Tillé and Matei, 2021). Argument size specifies the stratum sample
sizes.
49
50 4 Stratified simple random sampling
The stratum sample sizes must be in the order the strata are encountered in
tibble grdVoorst, which is determined first with function unique.
Within the strata, the grid cells are selected by simple random sampling with
replacement (method = "srswr"), so that in principle more than one point can
be selected within a grid cell, see Chapter 3 for a motivation of this. Function
getdata extracts the observations of the selected units from the sampling
frame, as well as the spatial coordinates and the stratum of these units. The
coordinates of the centres of the selected grid cells are jittered by an amount
equal to half the side of the grid cells. In the next code chunk, this is done
with function mutate of package dplyr (Wickham et al., 2021) which is part of
package tidyverse (Wickham et al., 2019). We have seen the pipe operator %>%
of package magrittr (Bache and Wickham, 2020) before in Subsection 3.1.2.
If you are not familiar with tidyverse I recommend reading the excellent book
R for Data Science (Wickham and Grolemund, 2017).
library(sampling)
ord <- unique(grdVoorst$stratum)
set.seed(314)
units <- sampling::strata(
grdVoorst, stratanames = "stratum", size = n_h[ord], method = "srswr")
mysample <- getdata(grdVoorst, units) %>%
mutate(s1 = s1 %>% jitter(amount = 25 / 2),
s2 = s2 %>% jitter(amount = 25 / 2))
FIGURE 4.1: Stratified simple random sample of size 40 from Voorst. Strata
are combinations of soil class and land use.
4.1 Estimation of population parameters 51
u�
𝑧 ̄̂ = ∑ 𝑤ℎ 𝑧ℎ̄̂ , (4.1)
ℎ=1
1
𝑧ℎ̄̂ = ∑ 𝑧 , (4.2)
𝑛ℎ u�∈u� u�
ℎ
1 u� 𝑧 1 u� 𝑁ℎ u�
𝑧 ̄̂ = ∑ ∑ u� = ∑ ∑ 𝑧u� = ∑ 𝑤ℎ 𝑧ℎ̄̂ . (4.3)
𝑁 ℎ=1 u�∈u� 𝜋u� 𝑁 ℎ=1 𝑛ℎ u�∈u� ℎ=1
ℎ ℎ
The sampling fractions are usually slightly different, even with proportional
allocation (Section 4.3), because 𝑛ℎ /𝑁ℎ cannot be made exactly equal for
all strata. Sample sizes necessarily are integers, so 𝑛ℎ /𝑁ℎ must be rounded
to integers.
u�
𝑉(̂ 𝑧)̄̂ = ∑ 𝑤2ℎ 𝑉(̂ 𝑧ℎ̄̂ ) , (4.4)
ℎ=1
52 4 Stratified simple random sampling
TABLE 4.1: Size (Nh), sample size (nh), estimated mean (Mean), estimated
variance (Variance), and estimated standard error of estimator of mean (se) of
the five strata in Voorst.
𝑛 𝑆 ̂2 (𝑧)
𝑉(̂ 𝑧ℎ̄̂ ) = (1 − ℎ ) ℎ , (4.5)
𝑁ℎ 𝑛ℎ
with 𝑆
̂2 (𝑧) the estimated variance of 𝑧 within stratum ℎ:
ℎ
1 2
̂
𝑆2 (𝑧) =
ℎ ∑ (𝑧u� − 𝑧ℎ̄̂ ) . (4.6)
𝑛ℎ − 1 u�∈u�
ℎ
Table 4.1 shows per stratum the estimated mean, variance, and sampling
variance of the estimated mean of the SOM concentration. We can see large
differences in the within-stratum variances. For the stratified sample of Figure
4.1, the estimated population mean equals 86.3 g kg-1 , and the estimated
standard error of this estimator equals 5.8 g kg-1 .
The population mean can also be estimated directly using the basic 𝜋 estimator
(Equation (2.4)). The inclusion probabilities are included in data.frame mysample,
obtained with function getdata (see code chunk above), as variable Prob.
head(mysample)
4.1 Estimation of population parameters 53
The population total is estimated first, and by dividing this estimated total by
the total number of population units 𝑁 an estimate of the population mean is
obtained.
[1] 86.53333
The two estimates of the population mean are not exactly equal. This is due to
rounding errors in the inclusion probabilities. This can be shown by computing
the sum of the inclusion probabilities over all population units. This sum
should be equal to the sample size 𝑛 = 40, but as we can see below, this sum
is slightly smaller.
[1] 39.90711
Now suppose we ignore that the sample data come from a stratified sampling
design and we use the (unweighted) sample mean as an estimate of the
population mean.
print(mean(mysample$z))
[1] 86.11247
The sample mean slightly differs from the proper estimate of the population
mean (7.238). The sample mean is a biased estimator, but the bias is small. The
bias is only small because the stratum sample sizes are about proportional to
the sizes of the strata, so that the inclusion probabilities (sampling intensities)
are about equal for all strata: 0.0050494, 0.0055344, 0.0052509, 0.006056,
0.005189. The probabilities are not exactly equal because the stratum sample
sizes are necessarily rounded to integers and because we reduced the largest
sample size by one unit. The bias would have been substantially larger if an
equal number of units would have been selected from each stratum, leading
to much larger differences in the inclusion probabilities among the strata.
Sampling intensity in stratum BA, for instance, then would be much smaller
54 4 Stratified simple random sampling
compared to the other strata, and so would be the inclusion probabilities of the
units in this stratum as compared to the other strata. Stratum BA then would
be underrepresented in the sample. This is not a problem as long as we account
for the difference in inclusion probabilities of the units in the estimation of the
population mean. The estimated mean of stratum BA then gets the largest
weight, equal to the inverse of the inclusion probability. If we do not account
for these differences in inclusion probabilities, the estimator of the mean will
be seriously biased.
The next code chunk shows how the population mean and its standard error
can be estimated with package survey (Lumley, 2021). Note that the stratum
weights 𝑁ℎ /𝑛ℎ must be passed to function svydesign using argument weight.
These are first attached to data.frame mysample by creating a look-up table lut,
which is then merged with function merge to data.frame mysample.
library(survey)
labels <- sort(unique(mysample$stratum))
lut <- data.frame(stratum = labels, weight = N_h / n_h)
mysample <- merge(x = mysample, y = lut)
design_stsi <- svydesign(
id = ~ 1, strata = ~ stratum, weight = ~ weight, data = mysample)
svymean(~ z, design_stsi)
mean SE
z 86.334 5.8167
Figure 4.2 shows the estimated CDF, estimated from the stratified simple
random sample of 40 units from Voorst (Figure 4.1).
$z
quantile ci.2.5 ci.97.5 se
0.5 69.56081 65.70434 84.03993 4.515916
0.8 117.73877 102.75359 161.88611 14.563887
attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"
56 4 Stratified simple random sampling
more sampling units than with stratified simple random sampling to obtain
an estimate of the same precision.
The stratification effect can be computed from the population variance 𝑆2 (𝑧)
(Equation (3.12)) and the variances within the strata 𝑆2ℎ (𝑧). In the sampling
experiment, these variances are known without error because we know the
𝑧-values for all units in the population. In practice, we only know the 𝑧-values
for the sampled units. However, a design-unbiased estimator of the population
variance is (de Gruijter et al., 2006)
̂2 2
̂
𝑆 2 (𝑧) = 𝑧 − (𝑧)̄̂ + 𝑉(̂ 𝑧)̄̂ , (4.7)
̂
where 𝑧2 denotes the estimated population mean of the study variable squared
(𝑧 ), obtained in the same way as 𝑧 ̄̂ (Equation (4.1)), but using squared values,
2
and 𝑉(̂ 𝑧)̄̂ denotes the estimated variance of the estimator of the population
mean (Equation (4.4)).
The estimated population variance is then divided by the sum of the stratum
sample sizes to get an estimate of the sampling variance of the estimator of
the mean with simple random sampling of an equal number of units:
̂
𝑆2 (𝑧)
𝑉(̂ 𝑧SI
̄̂ ) = . (4.8)
∑u� 𝑛
ℎ=1 ℎ
𝑁−1 𝑛 1 (𝑧 − 𝑧u�̄̂ )2
̂
𝑆2 (𝑧) = ∑ u� . (4.9)
𝑁 𝑛 − 1 𝑁 − 1 u�∈u� 𝜋u�
library(surveyplanning)
S2z <- s2(mysample$z, w = mysample$weight)
reciprocal of the stratification effect. For the stratified simple random sample
of Figure 4.1, the design effect can then be estimated as follows. Function SE
extracts the estimated standard error of the estimator of the mean from the
output of function svymean. The extracted standard error is then squared to
obtain an estimate of the sampling variance of the estimator of the population
with stratified simple random sampling. Finally, this variance is divided by
the variance with simple random sampling of an equal number of units.
z
z 0.6903965
mean SE DEff
z 86.3340 5.8167 0.6904
So, when using package survey, estimation of the population variance is not
needed to estimate the design effect. I only added this to make clear how
the design effect is computed with functions in package survey. In following
chapters I will skip the estimation of the population variance.
The estimated design effect as estimated from the stratified sample is smaller
than 1, showing that stratified simple random sampling is more efficient than
simple random sampling. The reciprocal of the estimated design effect (1.448)
is somewhat larger than the stratification effect as computed in the sampling
experiment, but this is an estimate of the design effect from one stratified
sample only. The estimated population variance varies among stratified samples,
and so does the estimated design effect.
Stratified simple random sampling with proportional allocation (Section 4.3)
is more precise than simple random sampling when the sum of squares of the
stratum means is larger than the sum of squares within strata (Lohr, 1999):
u�
𝑆𝑆𝐵 = ∑ 𝑁ℎ (𝑧ℎ̄ − 𝑧)̄ 2 , (4.11)
ℎ=1
4.2 Confidence interval estimate 59
and SSW the sum over the strata of the weighted variances within strata
(weights equal to 1 − 𝑁ℎ /𝑁):
u�
𝑁ℎ 2
𝑆𝑆𝑊 = ∑ (1 − )𝑆 . (4.12)
ℎ=1
𝑁 ℎ
In other words, the smaller the differences in the stratum means and the larger
the variances within the strata, the smaller the stratification effect will be.
Figure 4.4 shows a boxplot of the SOM concentration per stratum (soil-land
use combination). The stratum means are equal to 83.0, 49.0, 68.8, 92.7, 122.3
g kg-1 . The stratum variances are 1799.2, 238.4, 1652.9, 1905.4, 2942.8 (g kg-1 )2 .
The large stratum variances explain the modest gain in precision realised by
stratified simple random sampling compared to simple random sampling in
this case.
FIGURE 4.4: Boxplots of the SOM concentration (g kg-1 ) for the five strata
(soil-land use combinations) in Voorst.
̂2
2
(∑u� 𝑤2 u� ℎ (u�) )
ℎ=1 ℎ u�
(4.14)
ℎ
𝑑𝑓 ≈ 2
.
u� ̂2
∑ℎ=1 𝑤4ℎ ( u� u�ℎ (u�) ) 1
u�ℎ −1
ℎ
2.5 % 97.5 %
z 74.52542 98.14252
𝑁ℎ
𝑛ℎ = 𝑛 ⋅ , (4.15)
∑ 𝑁ℎ
with 𝑁ℎ the total number of population units (size) of stratum ℎ. With infinite
populations 𝑁ℎ is replaced by the area 𝐴ℎ . The sample sizes computed with
this equation are rounded to the nearest integers.
If we have prior information on the variance of the study variable within the
strata, then it makes sense to account for differences in variance. Heterogeneous
strata should receive more sampling units than homogeneous strata, leading
4.3 Allocation of sample size to strata 61
to Neyman allocation:
𝑁ℎ 𝑆ℎ (𝑧)
𝑛ℎ = 𝑛 ⋅ u�
, (4.16)
∑ 𝑁ℎ 𝑆ℎ (𝑧)
ℎ=1
with 𝑆ℎ (𝑧) the standard deviation (square root of variance) of the study
variable 𝑧 in stratum ℎ.
Finally, the costs of sampling may differ among the strata. It can be relatively
expensive to sample nearly inaccessible strata, and we do not want to sample
many units there. This leads to optimal allocation:
with 𝑐ℎ the costs per sampling unit in stratum ℎ. Optimal means that given the
total costs this allocation type leads to minimum sampling variance, assuming
a linear costs model:
u�
𝐶 = 𝑐 0 + ∑ 𝑛 ℎ 𝑐ℎ , (4.18)
ℎ=1
with 𝑐0 overhead costs. So, the more variable a stratum and the lower the
costs, the more units will be selected from this stratum.
These optimal sample sizes can be computed with function optsize of package
surveyplanning.
[1] 14 3 9 4 10
Table 4.2 shows the proportional and optimal sample sizes for the five strata of
the study area Voorst, for a total sample size of 40. Stratum XF is the one-but-
smallest stratum and therefore receives only seven sampling units. However,
the standard deviation in this stratum is the largest, and as a consequence
with optimal allocation the sample size in this stratum is increased by three
points, at the cost of stratum EA which is relatively homogeneous.
62 4 Stratified simple random sampling
Figure 4.5 shows the standard error of the 𝜋 estimator of the mean SOM con-
centration as a function of the total sample size, for simple random sampling
and for stratified simple random sampling with proportional and Neyman allo-
cation. A small extra gain in precision can be achieved using Neyman allocation
instead of proportional allocation. However, in practice often Neyman allo-
cation is not achievable, because we do not know the standard deviations of
the study variable within the strata. If a quantitative covariate 𝑥 is used for
stratification (see Sections 4.4 and 13.2), the standard deviations 𝑆ℎ (𝑧) are
approximated by 𝑆ℎ (𝑥), resulting in approximately optimal stratum sample
sizes. The gain in precision compared to proportional allocation is then partly
or entirely lost.
Optimal allocation and Neyman allocation assume univariate stratification, i.e.,
the stratified simple random sample is used to estimate the mean of a single
study variable. If we have multiple study variables, optimal allocation becomes
more complicated. In Bethel allocation, the total sampling costs, assuming a
linear costs model (Equation (4.18)), are minimised given a constraint on the
precision of the estimated mean for each study variable (Bethel, 1989), see
Section 4.8. Bethel allocation can be computed with function bethel of package
SamplingStrata (Barcaroli et al., 2020).
Exercises
FIGURE 4.5: Standard error of the 𝜋 estimator of the mean SOM concentra-
tion (g kg-1 ) as a function of the total sample size, for simple random sampling
(SI) and for stratified simple random sampling with proportional (STSI(prop))
and Neyman allocation (STSI(Neyman)) for Voorst.
4. Proof that the sum of the inclusion probabilities over all population
units with stratified simple random sampling equals the sample size
𝑛.
64 4 Stratified simple random sampling
4. Divide the cumulative sum of the last bin by the number of strata,
multiply this value by 1, 2, … , 𝐻 − 1, with H the number of strata,
and select the boundaries of the histogram bins closest to these
values.
amongst others a numeric vector with the stratum bounds (bh) and a factor
with the stratum levels of the grid cells (stratumID). Finally, note that the
values of the stratification variable must be positive. The minimum elevation
is -5 m, so I added the absolute value of this minimum to elevation.
library(stratification)
grdXuancheng <- grdXuancheng %>%
arrange(dem) %>%
mutate(dem_new = dem + abs(min(dem)))
crfstrata <- strata.cumrootf(
x = grdXuancheng$dem_new, n = 100, Ls = 5, nclass = 500)
bh <- crfstrata$bh
grdXuancheng$crfstrata <- crfstrata$stratumID
Exercises
library(sp)
gridded(grdXuancheng) <- ~ s1 + s2
subgrd <- spsample(
grdXuancheng, type = "regular", cellsize = 400, offset = c(0.5, 0.5))
subgrd <- data.frame(coordinates(subgrd), over(subgrd, grdXuancheng))
Five clusters are computed with k-means using as clustering variables the five
covariates mentioned above. The scale of these covariates is largely different,
and for this reason they must be scaled before being used in clustering. The
k-means algorithm is a deterministic algorithm, i.e., the same initial clustering
will end in the same final, optimised clustering. This final clustering can be
suboptimal, and therefore it is recommended to repeat the clustering as many
times as feasible, with different initial clusterings. Argument nstart is the
number of initial clusterings. The best clustering, i.e., the one with the smallest
within-cluster sum-of-squares, is kept.
Figure 4.8 shows the five clusters obtained by k-means clustering of the raster
cells. These clusters can be used as strata in random sampling.
FIGURE 4.8: Five clusters obtained by k-means clustering of the raster cells
of Xuancheng, using five scaled covariates in clustering.
The size of the clusters used as strata is largely different (Table 4.3). This
table also shows means of the unscaled covariates used in clustering.
Categorical variables can be accommodated in clustering using the technique
proposed by Huang (1998), implemented in package clustMixType (Szepan-
nek, 2018).
In the situation that we already have some data of the study variable, an
alternative solution is to calibrate a model for the study variable, for instance
TABLE 4.3: Size (Nh) and means of clustering variables of the five strata of
Xuancheng obtained with k-means clustering of raster cells.
If the total number of grid cells divided by the number of strata is an integer,
the stratum sizes are exactly equal; otherwise, the difference is one grid cell.
Walvoort et al. (2010) describe the k-means algorithms implemented in this
package in detail. Argument object of function stratify specifies a spatial ob-
ject of the population units. In the R code below grdVoorst is converted to a
SpatialPixelsDataFrame with function gridded of the package sp. The spatial ob-
ject can also be of class SpatialPolygons. In that case, either argument nGridCells
or argument cellSize must be set, so that the vector map in object can be
discretised by a finite number of grid cells. Argument nTry specifies the number
of initial stratifications in k-means clustering, and therefore is comparable with
70 4 Stratified simple random sampling
library(spcosa)
library(sp)
set.seed(314)
gridded(subgrd) <- ~ x1 + x2
mygeostrata <- stratify(
object = subgrd, nStrata = 50, nTry = 1, equalArea = TRUE)
set.seed(314)
mysample <- spcosa::spsample(mygeostrata, n = 2)
Figure 4.9 shows fifty compact geostrata of equal size for Xuancheng with the
selected sampling points. Note that the sampling points are reasonably well
spread throughout the study area1 .
Once the observations are done, the population mean can be estimated with
function estimate. For Xuancheng I simulated data from a normal distribution,
just to illustrate estimation with function estimate. Various statistics can be
estimated, among which the population mean (spatial mean), the standard
error, and the CDF. The CDF is estimated by transforming the data into
indicators (Subsection 3.1.2).
library(spcosa)
mysample <- spcosa::spsample(mygeostrata, n = 2)
mydata <- data.frame(z = rnorm(100, mean = 10, sd = 2))
mean <- estimate("spatial mean", mygeostrata, mysample, data = mydata)
se <- estimate("standard error", mygeostrata, mysample, data = mydata)
cdf <- estimate("scdf", mygeostrata, mysample, data = mydata)
The estimated population mean equals 9.8 with an estimated standard error
of 0.2.
1 The compact geostrata and the sample are plotted with package ggplot2. A simple
FIGURE 4.9: Compact geostrata of equal size for Xuancheng and stratified
simple random sample of two points per stratum.
Exercises
8. The geostrata in Figure 4.9 have equal size (area), which can
be enforced by argument equalArea = TRUE. Why are equal sizes
attractive? Work out the estimator of the population mean for
strata of equal size.
10. Laboratory costs for measuring the study variable can be saved by
bulking the soil aliquots (composite sampling). There are two options:
bulking all soil aliquots from the same stratum (bulking within strata)
or bulking by selecting one aliquot from each stratum (bulking across
strata). In spcosa bulking across strata is implemented. Write an
R script to construct 20 compact geographical strata for study area
Voorst. Use argument equalArea = TRUE. Select four points per stratum
using argument type = "composite", and convert the resulting object
to SpatialPoints. Extract the 𝑧-values in grdVoorst at the selected
sampling points using function over. Add a variable to the resulting
data frame indicating the composite (points 1 to 4 are from the first
stratum, points 5 to 8 from the second stratum, etc.), and estimate
the means for the four composites using function tapply. Finally,
estimate the population mean and its standard error.
• Can the sampling variance of the estimator of the mean be
estimated for bulking within the strata?
• The alternative to analysing the concentration of four composite
samples obtained by bulking across strata is to analyse all 20
× 4 aliquots separately. The strata have equal size, so the
inclusion probabilities are equal. As a consequence, the sample
mean is an unbiased estimator of the population mean. Is the
precision of this estimated population mean equal to that of
the estimated population mean with composite sampling? If
not, is it smaller or larger, and why?
• If you use argument equalArea = FALSE in combination with ar-
gument type = "composite", you get an error message. Why does
this combination of arguments not work?
4.8 Multiway stratification 73
The predicted log concentrations of the two heavy metals are used as stratifi-
cation variables in designing a new sample for design-based estimation of the
population means of Cd and Zn. For the log of Cd, there are negative predicted
concentrations (Figure 4.10). This leads to an error when running function
optimStrata. The minimum predicted log Cd concentration is -1.7, so I added
2 to the predictions. A variable indicating the domains of interest is added
to the data frame. The value of this variable is 1 for all grid cells, so that a
sample is designed for estimating the mean of the entire population. As a first
step, function buildFrameDF is used to create a data frame that can be handled
by function optimStrata. Argument X specifies the stratification variables, and
argument Y the study variables. In our case, the stratification variables and
the study variables are the same. This is typical for the situation where the
stratification variables are obtained by mapping the study variables.
library(SamplingStrata)
df <- data.frame(cd = lcd_kriged$var1.pred + 2,
4.8 Multivariate stratification 75
zn = lzn_kriged$var1.pred,
dom = 1,
id = seq_len(nrow(lcd_kriged)))
frame <- buildFrameDF(
df = df, id = "id",
X = c("cd", "zn"), Y = c("cd", "zn"),
domainvalue = "dom")
Next, a data frame with the precision requirements for the estimated means is
created. The precision requirement is given as a coefficient of variation, i.e.,
the standard error of the estimated population mean, divided by the estimated
mean. The study variables as specified in Y are used to compute the estimated
means and the standard errors for a given stratification and allocation.
set.seed(314)
res <- optimStrata(
method = "continuous", errors = cv, framesamp = frame, nStrata = 5,
iter = 50, pops = 20, showPlot = FALSE)
Column Population contains the sizes of the strata, i.e., the number of grid cells.
The total sample size equals 26. The sample size per stratum is computed with
Bethel allocation, see Section 4.3. The last four columns contain the lower and
upper bounds of the orthogonal intervals.
Figure 4.11 shows a 2D-plot of the bivariate strata. The strata can be plotted
as a series of nested rectangles. All population units in the smallest rectangle
belong to stratum 1; all units in the one-but-smallest rectangle that are not
in the smallest rectangle belong to stratum 2, etc. If we have more than two
76 4 Stratified simple random sampling
FIGURE 4.11: 2D-plot of optimised bivariate strata of the study area Meuse.
It may happen that after the optimisation of the stratum bounds in some
resulting strata, no units are contained. If the stratification with a smaller
number of strata requires fewer sampling units so that the sampling costs
are lower (and still the precision requirement is met), then this is retained
as the optimal stratification.
expected_CV(res$aggr_strata)
4.8 Multivariate stratification 77
FIGURE 4.12: Map of optimised bivariate strata of the study area Meuse.
cv(Y1) cv(Y2)
DOM1 0.02 0.0087401
A simple way of drawing probability samples whose units are spread uniformly
over the study area is systematic random sampling (SY), which from a two-
dimensional spatial population entails the selection of a regular grid randomly
placed on the area. A systematic sample can be selected with function spsample
of package sp with argument type = "regular" (Bivand et al., 2013). Argument
offset is not used, so that the grid is randomly placed on the study area.
This is illustrated with Voorst. First, data.frame grdVoorst is converted to
SpatialPixelsDataFrame with function gridded.
library(sp)
gridded(grdVoorst) <- ~ s1 + s2
n <- 40
set.seed(777)
mySYsample <- spsample(x = grdVoorst, n = n, type = "regular") %>%
as("data.frame")
Figure 5.1 shows the randomly selected systematic sample. The shape of the
grid is square, and the orientation is East-West (E-W), North-South (N-S).
There is no strict need for random selection of the orientation of the grid.
Random placement of the grid on the study area suffices for design-based
estimation.
Argument n in function spsample is used to set the sample size. Note that this
is the expected sample size, i.e., on average over repeated sampling the sample
size is 40. In Figure 5.1 the number of selected sampling points equals 38. Given
the expected sample size, the spacing of the square grid can be computed with
√𝐴/𝑛, with 𝐴 the area of the study area. This area 𝐴 can be computed by
79
80 5 Systematic random sampling
the total number of cells of the discretisation grid multiplied by the area of a
grid cell. Note that the area of the study area is smaller than the number of
grid cells in the horizontal direction multiplied by the number of grid cells in
the vertical direction multiplied by the grid cell area, as we have non-availables
(built-up areas, roads, etc.).
cell_size <- 25
A <- nrow(grdVoorst) * cell_sizeˆ2
(spacing <- sqrt(A / n))
[1] 342.965
dy <- 1000 / 3
dx <- A / (n * dy)
mySYsample_rect <- spsample(
x = grdVoorst, cellsize = c(dx, dy), type = "regular")
The E-W spacing is somewhat larger than the N-S spacing: 352.875 m
vs. 333.333 m. The variation in sample size with the random rectangular
grid is much smaller than that of the square grid. The sample size now ranges
from 33 to 46, whereas with the square grid the range varies from 20 to 48.
summary(sampleSizes)
yield most precise estimates of the population mean given the expected sample
size (Matérn, 1986). Given the spacing of a triangular grid, the expected sample
size can be computed by the area 𝐴 of the study area divided by the area of
hexagonal grid cells
√ with the sampling points at their centres. The area of a
hexagon equals 6 3/4 𝑟2 , with 𝑟 the radius of the circle circumscribing the
hexagon (distance from centre to a corner of the hexagon). So, by choosing a
√
radius of √𝐴/(6 3/4) 𝑛 the expected sample equals 𝑛. The distance between
neighbouring
√ points of the triangular
√ grid in the E-W direction, 𝑑𝑥, then equals
𝑟 3. The N-S distance equals 3/2 𝑑𝑥.
The following code can be used for random selection of triangular grids.
Figure 5.3 shows a triangular grid, selected randomly from Voorst with an
expected sample size of 40. The selected triangular grid has 42 points.
𝑧u� 𝑧
̂ =∑
𝑡(𝑧) = 𝑁 ∑ u� . (5.1)
𝜋
u�∈u� u� u�∈u�
𝐸[𝑛]
𝑧u�
𝑧 ̄̂ = ∑ . (5.2)
u�∈u�
𝐸[𝑛]
In this 𝜋 estimator of the population mean the sample sum of the observations
is not divided by the number of selected units 𝑛, but by the expected number
of units 𝐸[𝑛].
84 5 Systematic random sampling
1 𝑁
𝑁̂ = ∑ =𝑛 . (5.3)
𝜋
u�∈u� u�
𝐸[𝑛]
̂
𝑡(𝑧) 1
̄̂
𝑧ratio = = ∑ 𝑧u� . (5.4)
𝑁̂ 𝑛 u�∈u�
So, the ratio estimator of the population total is equal to the unweighted
sample mean. In general, the variance of this ratio estimator is smaller than
that of the 𝜋 estimator. On the other hand the 𝜋 estimator is design-unbiased,
whereas the ratio estimator is not, although its bias can be negligibly small.
Only in the very special case where the sample size with systematic random
sampling is fixed, the two estimators are equivalent.
Recall that for Voorst we have exhaustive knowledge of the study variable 𝑧:
values of the soil organic matter (SOM) concentration were simulated for all
grid cells. To determine the 𝑧-values at the selected sampling points, an overlay
of the systematic random sample and the SpatialPixelsDataFrame is made, using
function over of package sp.
Using the systematic random sample of Figure 5.1, the 𝜋 estimated mean SOM
concentration equals 69.8 g kg-1 , the ratio estimate equals 73.5 g kg-1 . The
ratio estimate is larger than the 𝜋 estimate, because the size of the selected
sample is two units smaller (38) than the expected sample size (40).
The clustering is repeated 100 times (ntry = 100). The clustering with the
smallest mean of the squared distances of the sampling units to their cluster
centres (mean squared shortest distance, MSSD) is selected.
Figure 5.4 shows the clustering of the systematic random sample of Figure 5.1.
The two or three sampling units of a cluster are treated as a simple random
sample from a stratum, and the variance estimator for stratified random
sampling is used. The weights are computed by 𝑤ℎ = 𝑛ℎ /𝑛. With 𝑛 even the
stratum weight is 1/𝐻 for all strata. For more details on variance estimation
with stratified simple random sampling, refer to Section 4.1.
∑u� 𝑑2
u�=1 u�
𝑉(̂ 𝑧u�̄ ) = , (5.6)
𝑛2
with 𝑑2u� the squared difference of group unit 𝑔, and 𝐺 the total number of
groups.
To approximate the variance with Matérn’s method, a function is defined.
Before using this function the data frame with the sample data must be
extended with two variables: an index 𝑖 for the column number and an index 𝑗
for the row number of the square grid.
[1] 41.63163
5.2 Approximating the sampling variance of the estimator of the mean 89
The boxplots of the estimated means indicate that systematic random sam-
pling in combination with the ratio estimator is more precise than simple
random sampling. The variance of the 10,000 ratio estimates equals 49.0 (g
kg-1 )2 , whereas for simple random sampling this variance equals 55.4 (g kg-1 )2 .
Systematic random sampling in combination with the 𝜋 estimator performs
very poorly: the variance equals 142.6 (g kg-1 )2 . This can be explained by the
strong variation in sample size (Figure 5.2), which is not accounted for in the
𝜋 estimator.
The mean of the 10,000 ratio estimates is 81.2 g kg-1 , which is about equal to
the population mean 81.1 g kg-1 , showing that in this case the design-bias of
the ratio estimator is negligibly small indeed.
90 5 Systematic random sampling
The variance of the 10,000 ratio estimates of the population mean with the
triangular grid and an expected sample size of 40 equals 46.9 (g kg-1 )2 . Treating
the triangular grid as a simple random sample strongly overestimates the
variance: the average approximated variance equals 60.1 (g kg-1 )2 . The stratified
simple random sample approximation performs much better in this case: the
average of the 10,000 approximated variances equals 46.8 (g kg-1 )2 . Matérn’s
method cannot be used to approximate the variance with a triangular grid.
5.2 Approximating the sampling variance of the estimator of the mean 91
Brus and Saby (2016) compared various variance approximations for systematic
random sampling, among which model-based prediction of the variance, using
a semivariogram that is estimated from the systematic sample, see Chapter 13.
Exercises
2. Do you like this solution? What about the variance of the estimator
of the mean, obtained by selecting two systematic random samples
of half the expected size, as compared with the variance of the
estimator of the mean, obtained with a single systematic random
sample? Hint: plot the two random square grids. What do you think
of the spatial coverage of the two samples?
6
Cluster random sampling
93
94 6 Cluster random sampling
randomly selecting a starting unit. The remaining three units of the cluster
are selected E of this starting unit. Units outside the study area are ignored.
With this selection method, the set of selected units is not independent of
the starting unit, and therefore this selection method is invalid.
Note that the size, i.e., the number of units, of a cluster need not be constant.
With the proper selection method described above, the selection probability of
a cluster is proportional to its size. With irregularly shaped study areas, the
size of the clusters can vary strongly. The size of the clusters can be controlled
by subdividing the study area into blocks, for instance, stripes perpendicular
to the direction of the transects, or square blocks in case the clusters are grids.
In this case, the remaining units are identified by extending the transect or
grid to the boundary of the block. With irregularly shaped areas, blocking will
not entirely eliminate the variation in cluster sizes.
Cluster random sampling is illustrated with the selection of E-W oriented
transects in Voorst. In order to delimit the length of the transects, the study
area is split into six 1 km × 1 km zones. In this case, the zones have an equal
size, but this is not needed. Note that these zones do not serve as strata. When
used as strata, from each zone, one or more clusters would be selected, see
Section 6.4.
In the code chunk below, function findInterval of the base package is used to
determine for all discretisation points in which zone they fall.
cell_size <- 25
w <- 1000 #width of zones
grdVoorst <- grdVoorst %>%
mutate(zone = s1 %>% findInterval(min(s1) + 1:5 * w + 0.5 * cell_size))
In total, there are 960 clusters in the population. Figure 6.1 shows the frequency
distribution of the size of the clusters.
Clusters are selected with probabilities proportional to their size and with
replacement (ppswr). So, the sizes of all clusters must be known, which explains
that all clusters must be enumerated. Selection of clusters by ppswr can be
done by simple random sampling with replacement of elementary units (centres
of grid cells) and identifying the clusters to which these units belong. Finally,
all units of the selected clusters are included in the sample. In the code chunk
below, a function is defined for selecting clusters by ppswr. Note variable cldraw,
that has value 1 for all units selected in the first draw, value 2 for all units
selected in the second draw, etc. This variable is needed in estimating the
population mean, as explained in Section 6.1.
96 6 Cluster random sampling
n <- 6
set.seed(314)
mysample <- cl_ppswr(sframe = grdVoorst, n = n)
As our population actually is infinite, the centres of the selected grid cells are
jittered to a random point within the selected grid cells. Note that the same
noise is added to all units of a given cluster.
Figure 6.2 shows the selected sample. Note that in this case the second west-
most zone has two transects (clusters), whereas one zone has none, showing
that the zones are not used as strata. The total number of selected points equals
50. Similar to systematic random sampling, with cluster random sampling the
total sample size is random, so that we do not have perfect control of the total
sample size. This is because in this case the size, i.e., the number of points, of
the clusters is not constant but varies.
The output data frame of function cl has a variable named start. This is
an indicator with value 1 if this point of the cluster is selected first, and 0
otherwise. When in the field, it appears that the first selected point of a cluster
does not belong to the target population, all other points of that cluster are
also discarded. This is to keep the selection probabilities of the clusters exactly
proportional to their size. Column cldraw is needed in estimation because
clusters are selected with replacement. In case a cluster is selected more than
once, multiple means of that cluster are used in estimation, see next section.
6.1 Estimation of population parameters 97
1 𝑡u� (𝑧)
̂ =
𝑡(𝑧) ∑ , (6.1)
𝑛 u�∈u� 𝑝u�
with 𝑛 the number of cluster draws, 𝑝u� the draw-by-draw selection probability
of cluster 𝑗, and 𝑡u� (𝑧) the total of cluster 𝑗:
u�u�
with 𝑀u� the size (number of units) of cluster 𝑗 and 𝑧u�u� the study variable
value of unit 𝑘 in cluster 𝑗.
The draw-by-draw selection probability of a cluster equals
𝑀u�
𝑝u� = , (6.3)
𝑀
with 𝑀 the total number of population units (for Voorst 𝑀 equals 7,528).
Inserting this in Equation (6.1) yields
𝑀 𝑡u� (𝑧) 𝑀
̂ =
𝑡(𝑧) ∑ = ∑𝑧̄ , (6.4)
𝑛 u�∈u� 𝑀u� 𝑛 u�∈u� u�
with 𝑧u�̄ the mean of cluster 𝑗. Note that if a cluster is selected more than once,
multiple means of that cluster are used in the estimator.
98 6 Cluster random sampling
Dividing this estimator by the total number of population units, 𝑀, yields the
estimator of the population mean:
1
𝑧 ̄̂ = ∑𝑧̄ . (6.5)
𝑛 u�∈u� u�
Note the two bars in 𝑧,̄̂ indicating that the observations are averaged twice.
For an infinite population of points discretised by the centres of a finite number
of grid cells, 𝑧u�u� in Equation (6.2) is the study variable value at a randomly
selected point within the grid cell multiplied by the area of the grid cell. The
estimated population total thus obtained is equal to the estimated population
mean (Equation (6.5)) multiplied by the area of the study area.
The sampling variance of the estimator of the mean with ppswr sampling of
clusters is equal to (Cochran (1977), equation (9A.6))
1 u� 𝑀u�
𝑉(𝑧)̄̂ = ∑ (𝑧 ̄ − 𝑧)̄ 2 , (6.6)
𝑛 u�=1 𝑀 u�
with 𝑁 the total number of clusters (for Voorst, 𝑁 = 960), 𝑧u�̄ the mean
of cluster 𝑗, and 𝑧 ̄ the population mean. Note that 𝑀u� /𝑀 is the selection
probability of cluster 𝑗.
This sampling variance can be estimated by (Cochran (1977), equation (9A.22))
̂2
̂ 𝑧)̄̂ = 𝑆 (𝑧)̄ ,
𝑉( (6.7)
𝑛
where 𝑆̂2 (𝑧)̄ is the estimated variance of cluster means (the between-cluster
variance):
1
̂
𝑆2 (𝑧)̄ = ∑(𝑧 ̄ − 𝑧)̄̂ 2 . (6.8)
𝑛 − 1 u�∈u� u�
In R the population mean and the sampling variance of the estimator of the
population means can be estimated as follows.
The estimated mean equals 87.1 g kg-1 , and the estimated standard error
equals 17.4 g kg-1 . Note that the size of the clusters does not appear in these
formulas. This simplicity is due to the fact that the clusters are selected with
6.1 Estimation of population parameters 99
probabilities proportional to size. The effect of the cluster size on the variance
is implicitly accounted for. To understand this, consider that larger clusters
result in smaller variance among their means.
The same estimates are obtained with functions svydesign and svymean of package
survey (Lumley, 2021). Argument weights specifies the weights of the sampled
clusters equal to 𝑀/(𝑀u� 𝑛) (Equation (6.4)).
library(survey)
M <- nrow(grdVoorst)
mysample$weights <- M / (M_cl[mysample$cluster] * n)
design_cluster <- svydesign(id = ~ cldraw, weights = ~ weights, data = mysample)
svymean(~ z, design_cluster, deff = "replace")
mean SE DEff
z 87.077 17.428 4.0767
The design effect DEff as estimated from the selected cluster sample is consid-
erably larger than 1. About 4 times more sampling points are needed with
cluster random sampling compared to simple random sampling to estimate
the population mean with the same precision.
A confidence interval estimate of the population mean can be computed with
method confint. The number of degrees of freedom equals the number of cluster
draws minus 1.
2.5 % 97.5 %
z 52.91908 121.2347
Figure 6.3 shows the approximated sampling distribution of the pwr estimator
of the mean soil organic matter (SOM) concentration with cluster random
sampling and of the 𝜋 estimator with simple random sampling, obtained by
repeating the random sampling with each design and estimation 10,000 times.
The size of the simple random samples is equal to the expected sample size of
the cluster random sampling design (rounded to nearest integer).
The variance of the 10,000 estimated population means with cluster random
sampling equals 126.2 (g kg-1 )2 . This is considerably larger than with simple
random sampling: 44.8 (g kg-1 )2 . The large variance is caused by the strong
spatial clustering of points. This may save travel time in large study areas,
but in Voorst the saved travel time will be very limited, and therefore cluster
random sampling in Voorst is not a good idea. The average of the estimated
variances with cluster random sampling equals 125.9 (g kg-1 )2 . The difference
with the variance of the 10,000 estimated means is small because the estimator
of the variance, Equation (6.7), is unbiased. Figure 6.4 shows the approximated
100 6 Cluster random sampling
sampling distribution of the sample size. The expected sample size can be
computed as follows:
[1] 49.16844
So, the unequal draw-by-draw selection probabilities of the clusters are ac-
counted for in computing the expected sample size.
Exercises
is small, with 𝑁 being the total number of clusters and 𝑛 the sample size, i.e.,
the number of cluster draws. If a cluster is selected more than once, there is
less information about the population mean in this sample than in a sample
with all clusters different. Selection of clusters with probabilities proportional
to size without replacement (ppswor) is not straightforward.
Many algorithms have been developed for ppswor sampling, see Tillé (2006) for
an overview, and quite a few of them are implemented in package sampling
(Tillé and Matei, 2021). In the next code chunk, function UPpivotal is used
to select a cluster random sample with ppswor. For an explanation of this
algorithm, see Subsection 8.2.2.
library(sampling)
n <- 6
pi <- n * M_cl / M
set.seed(314)
eps <- 1e-6
sampleind <- UPpivotal(pik = pi, eps = eps)
clusters <- sort(unique(grdVoorst$cluster))
clusters_sampled <- clusters[sampleind == 1]
mysample <- grdVoorst[grdVoorst$cluster %in% clusters_sampled, ]
1 The problem is the computation of the joint inclusion probabilities of pairs of points.
6.3 Simple random sampling of clusters 103
mean SE
z 96.83 13.454
mean SE
z 96.83 13.436
𝑁
̂ =
𝑡(𝑧) ∑ 𝑡 (𝑧) . (6.9)
𝑛 u�∈u� u�
̂
𝑡(𝑧)
𝑧u�̄̂ (𝑧) = . (6.10)
𝑀
Alternatively, we may estimate the population mean by dividing the estimate
of the population total by the estimated population size:
𝑀u� 𝑁
̂= ∑
𝑀 = ∑ 𝑀u� . (6.11)
u�∈u�
𝜋u� 𝑛 u�∈u�
104 6 Cluster random sampling
̂
𝑡(𝑧)
̄̂
𝑧ratio (𝑧) = . (6.12)
̂
𝑀
The 𝜋 estimator and the ratio estimator are equal when the clusters are selected
with probabilities proportional to size. This is because the estimated population
size is equal to the true population size.
[1] 7528
However, when clusters of different size are selected with equal probabilities,
the two estimators are different. This is shown below. Six clusters are selected
by simple random sampling without replacement.
set.seed(314)
clusters <- sort(unique(grdVoorst$cluster))
units_cl <- sample(length(clusters), size = n, replace = FALSE)
clusters_sampled <- clusters[units_cl]
mysample <- grdVoorst[grdVoorst$cluster %in% clusters_sampled, ]
The 𝜋 estimate and the ratio estimate of the population mean are computed
for the selected sample.
N <- length(clusters)
mysample$pi <- n / N
tz_HT <- sum(mysample$z / mysample$pi)
mz_HT <- tz_HT / M
M_HT <- sum(1 / mysample$pi)
mz_ratio <- tz_HT / M_HT
The 𝜋 estimate equals 68.750 g kg-1 , and the ratio estimate equals 70.319 g
kg-1 . The 𝜋 estimate of the population mean can also be computed by first
computing totals of clusters, see Equations (6.9) and (6.10).
[1] 68.74994
6.3 Simple random sampling of clusters 105
𝑛 𝑆̂2 (𝑡(𝑧))
𝑉(̂ 𝑡(𝑧))
̂ = 𝑁2 (1 − ) , (6.13)
𝑁 𝑛
and dividing this variance by the squared number of population units:
1
𝑉(̂ 𝑧)̄̂ = 2 𝑉(̂ 𝑡(𝑧))
̂ . (6.14)
𝑀
fpc <- 1 - n / N
v_tz <- Nˆ2 * fpc * var(tz_cluster) / n
se_mz_HT <- sqrt(v_tz / Mˆ2)
̂ u� ,
𝑒u� = 𝑡u� (𝑧) − 𝑏𝑀 (6.15)
with 𝑏̂ the ratio of the estimated population mean of the cluster totals to the
estimated population mean of the cluster sizes:
1
u� ∑u�∈u� 𝑡u�
𝑏̂ = 1
. (6.16)
u� ∑u�∈u� 𝑀u�
The variance of the ratio estimator of the population mean can be estimated
by
𝑛 1 ̂
𝑆2
̄̂
𝑉(̂ 𝑧ratio ) = (1 − ) 1 u�
, (6.17)
𝑁 ( u� ∑ 𝑀u� )2 𝑛
u�∈u�
with 𝑆
̂2 the estimated variance of the residuals.
u�
[1] 12.39371
106 6 Cluster random sampling
The ratio estimate can also be computed with function svymean of package
survey, which also provides an estimate of the standard error of the estimated
mean.
mean SE
z 70.319 12.394
FIGURE 6.5: Stratified cluster random sample from Voorst, with three
strata. From each stratum two times a cluster is selected by ppswr.
The population mean is estimated by first estimating the stratum means using
Equation (6.5) at the level of the strata, followed by computing the weighted
average of the estimated stratum means using Equation (4.3). The variance of
the estimator of the population mean is estimated in the same way, by first
estimating the variance of the estimator of the stratum means using Equations
(6.7) and (6.8) at the level of the strata, followed by computing the weighted
average of the estimated variances of the estimated stratum means (Equation
(4.4)).
The estimated mean equals 82.8 g kg-1 , and the estimated standard error equals
4.7 g kg-1 . The same estimates are obtained with function svymean. Weights
for the clusters are computed as before, but now at the level of the strata.
Note argument nest = TRUE, which means that the clusters are nested within
the strata.
108 6 Cluster random sampling
mean SE
z 82.796 4.6737
Exercises
109
110 7 Two-stage cluster random sampling
In the first stage, an SSU is selected in order to select a PSU. This may
seem unnecessarily complicated. The reason for this is that this procedure
automatically adjusts for the size of the PSUs (number of SSUs within a PSU),
i.e., a PSU is selected with probability proportional to its size. In the second
stage, a pre-determined number of SSUs, 𝑚u� , is selected every time PSU 𝑗 is
selected.
Note that the SSU selected in the first step of the two algorithms primarily
serves to identify the PSU, but these SSUs can also be used as selected SSUs.
The selection of a two-stage cluster random sample is illustrated again with
Voorst. Twenty-four 0.5 km squares are constructed that serve as PSUs.
Due to built-up areas, roads, etc., the PSUs in Voorst have unequal size, i.e.,
the number of SSUs (points, in our case) within the PSUs varies among the
PSUs.
cell_size <- 25
w <- 500 #width of zones
grdVoorst <- grdVoorst %>%
mutate(zone_s1 = s1 %>% findInterval(min(s1) + 1:11 * w + 0.5 * cell_size),
zone_s2 = s2 %>% findInterval(min(s2) + w + 0.5 * cell_size),
psu = str_c(zone_s1, zone_s2, sep = "_"))
Note that both the PSUs and the SSUs are selected with replacement. If a
grid cell centre is selected, one point is selected fully randomly from that grid
cell. This is done by shifting the centre of the grid cell to a random point
within the selected grid cell with function jitter, see code chunk hereafter. In
every grid cell, there is an infinite number of points, so we must select the
grid cell centres with replacement. If a grid cell is selected more than once,
more than one point is selected from the associated grid cell. Column psudraw
in the output data frame of function twostage is needed in estimation because
PSUs are selected with replacement. In case a PSU is selected more than once,
multiple estimates of the mean of that PSU are used in estimation, see next
section.
In the next code chunk, function twostage is used to select four times a PSU
(𝑛 = 4), with probabilities proportional to size and with replacement (ppswr).
The second stage sample size equals 10 for all PSUs (𝑚u� = 10, 𝑗 = 1, … , 𝑁).
These SSUs are selected by simple random sampling.
n <- 4
m <- 10
cell_size <- 25
set.seed(314)
mysample <- grdVoorst %>%
twostage(psu = "psu", n = n, m = m) %>%
mutate(s1 = s1 %>% jitter(amount = cell_size / 2),
s2 = s2 %>% jitter(amount = cell_size / 2))
FIGURE 7.1: Two-stage cluster random sample from Voorst. PSUs are 0.5
km squares, built-up areas, roads, etc. excluded. Four times a PSU is selected
by ppswr. Each time a PSU is selected, ten SSUs (points) are selected from
that PSU by simple random sampling.
𝑀 𝑡u�̂ (𝑧) 𝑀
̂ =
𝑡(𝑧) ∑ = ∑ 𝑧 ̄̂ , (7.1)
𝑛 u�∈u� 𝑀u� 𝑛 u�∈u� u�
where 𝑛 is the number of PSU selections and 𝑀u� is the total number of SSUs in
PSU 𝑗. This shows that the mean of cluster 𝑗, 𝑧u�̄ , is replaced by the estimated
mean of PSU 𝑗, 𝑧u�̄̂ . Dividing this estimator by the total number of population
units 𝑀 gives the pwr estimator of the population mean:
1
𝑧 ̄̂ = ∑ 𝑧 ̄̂ , (7.2)
𝑛 u�∈u� u�
with 𝑧u�̄̂ the estimated mean of the PSU 𝑗. With simple random sampling of
SSUs, this mean can be estimated by the sample mean of this PSU. Note the
two bars in 𝑧,̄̂ indicating that the population mean is estimated as the mean
of estimated PSU means. When 𝑚u� is equal for all PSUs, the sampling design
is self-weighting, i.e., the average of 𝑧 over all selected SSUs is an unbiased
estimator of the population mean.
For an infinite population of points, the population total is estimated by
multiplying the estimated population mean (Equation (7.2)) by the area of
the study area.
The sampling variance of the estimator of the mean with two-stage cluster
random sampling, PSUs selected with probabilities proportional to size with
7.1 Estimation of population parameters 113
𝑆2 𝑆2
𝑉(𝑧)̄̂ = b + w , (7.3)
𝑛 𝑛𝑚
with
u�
2
𝑆2b = ∑ 𝑝u� (𝑧u�̄ − 𝑧)̄ (7.4)
u�=1
and
u�
𝑆2w = ∑ 𝑝u� 𝑆2u� , (7.5)
u�=1
with 𝑁 the total number of PSUs in the population, 𝑝u� = 𝑀u� /𝑀 the draw-by-
draw selection probability of PSU 𝑗, 𝑧u�̄ the mean of PSU 𝑗, 𝑧 ̄ the population
mean of 𝑧, and 𝑆2u� the variance of 𝑧 within PSU 𝑗:
u�u�
1
𝑆2u� = ∑ (𝑧 − 𝑧u�̄ )2 . (7.6)
𝑀u� u�=1 u�u�
The first term of Equation (7.3) is equal to the variance of Equation (6.6). This
variance component accounts for the variance of the true PSU means within
the population. The second variance component quantifies our additional
uncertainty about the population mean, as we do not observe all SSUs of
the selected PSUs, but only a subset (sample) of these units.
The sampling variance of the estimator of the population mean can simply be
estimated by
̂2 ̂
̂ 𝑧)̄̂ = 𝑆 (𝑧)̄ ,
𝑉( (7.7)
𝑛
with 𝑆
̂2 (𝑧)̄̂ the estimated variance of the estimated PSU means:
1
̂
𝑆2 (𝑧)̄̂ = ∑(𝑧 ̄̂ − 𝑧)̄̂ 2 , (7.8)
𝑛 − 1 u�∈u� u�
1 Equation (11.33) in Cochran (1977) is the variance estimator for the estimator of the
population total. In Exercise 5 you are asked to derive the variance estimator for the estimator
of the population mean from this variance estimator.
114 7 Two-stage cluster random sampling
with 𝑧u�̄̂ the estimated mean of PSU 𝑗 and 𝑧 ̄̂ the estimated population mean
(Equation (7.2)).
Neither the sizes of the PSUs, 𝑀u� , nor the secondary sample sizes 𝑚u� occur
in Equations (7.7) and (7.8). This simplicity is due to the fact that the PSUs
are selected with replacement and with probabilities proportional to their
size. The effect of the secondary sample sizes on the variance is implicitly
accounted for. To understand this, note that the larger 𝑚u� , the less variable
𝑧u�̄̂ , and the smaller its contribution to the variance.
Let us assume a linear model for the total costs: 𝐶 = 𝑐0 + 𝑐1 𝑛 + 𝑐2 𝑛𝑚, with
𝑐0 the fixed costs, 𝑐1 the costs per PSU, and 𝑐2 the costs per SSU. We want
to minimise the total costs, under the constraint that the variance of the
estimator of the population mean may not exceed 𝑉max . The total costs can
then be minimised by selecting (de Gruijter et al., 2006)
1 𝑐2
𝑛= (𝑆w 𝑆b √ + 𝑆2b ) (7.9)
𝑉max 𝑐1
PSUs and
𝑆w 𝑐1
𝑚= √ (7.10)
𝑆b 𝑐2
𝐶max 𝑆b
𝑛= √ , (7.11)
𝑆w 𝑐1 𝑐2 + 𝑆b 𝑐1
and 𝑚 as above.
In R the population mean and the sampling variance of the estimator of the
mean can be estimated as follows.
The estimated mean equals 48.6 g kg-1 , and the estimated standard error
equals 0.0 g kg-1 . The sampling design is self-weighting, and so the estimated
mean is equal to the sample mean.
print(mean(mysample$z))
[1] 48.55792
The same estimate is obtained with functions svydesign and svymean of package
survey (Lumley, 2021). The estimator of the population total can be written
as a weighted sum of the observations with all weights equal to 𝑀/(𝑛 𝑚).
These weights are passed to function svydesign with argument weight.
library(survey)
M <- nrow(grdVoorst)
mysample$weights <- M / (n * m)
design_2stage <- svydesign(
id = ~ psudraw + ssunits, weight = ~ weights, data = mysample)
svymean(~ z, design_2stage, deff = "replace")
mean SE DEff
z 48.558 0.000 0
2.5 % 97.5 %
z 48.55792 48.55792
Figure 7.2 shows the approximated sampling distribution of the pwr estimator
of the mean soil organic matter (SOM) concentration with two-stage cluster
random sampling and of the 𝜋 estimator with simple random sampling from
Voorst, obtained by repeating the random sampling with each design and
estimation 10,000 times. For simple random sampling the sample size is equal
to 𝑛 × 𝑚.
The variance of the 10,000 means with two-stage cluster random sampling
equals 179.6 (g kg-1 )2 . This is considerably larger than with simple random
116 7 Two-stage cluster random sampling
sampling: 56.3 (g kg-1 )2 . The average of the estimated variances with two-stage
cluster random sampling equals 182.5 (g kg-1 )2 .
Optimal sample sizes for two-stage cluster random sampling (ppswr in first
stage, simple random sampling without replacement in second stage) can be
computed with function clusOpt2 of R package PracTools (Valliant et al.
(2021), Valliant et al. (2018)). This function requires as input various variance
measures, which can be computed with function BW2stagePPS, in case the study
variable is known for the whole population or estimated from a sample with
function BW2stagePPSe. This is left as an exercise (Exercise 5).
Exercises
2
1 u� 𝑡u� (𝑧) 1 u� 𝑀2u� (1 − 𝑓2u� )𝑆2u�
̂
𝑉(𝑡(𝑧)) = ∑ 𝑝u� ( − 𝑡(𝑧)) + ∑ ,
𝑛 u�=1 𝑝u� 𝑛 u�=1 𝑚u� 𝑝u�
(7.12)
with 𝑡(𝑧)
̂ and 𝑡(𝑧) the estimated and the true population total of
𝑧, respectively, 𝑡u� (𝑧) the total of PSU 𝑗, and 𝑝u� = 𝑀u� /𝑀. Use
𝑚u� = 𝑚, 𝑗 = 1, … , 𝑁, and 𝑓2u� = 0, i.e., sampling from infinite
population, or sampling of SSUs within PSUs by simple random
sampling with replacement from finite population. Derive the vari-
ance of the estimator for the population mean, Equation (7.3), from
Equation (7.12).
118 7 Two-stage cluster random sampling
library(sampling)
M_psu <- tapply(grdVoorst$z, INDEX = grdVoorst$psu, FUN = length)
n <- 6
pi <- n * M_psu / M
set.seed(314)
sampleind <- UPpivotal(pik = pi, eps = 1e-6)
psus <- sort(unique(grdVoorst$psu))
sampledpsus <- psus[sampleind == 1]
mysample_stage1 <- grdVoorst[grdVoorst$psu %in% sampledpsus, ]
units <- sampling::strata(mysample_stage1, stratanames = "psu",
size = rep(m, n), method = "srswor")
mysample <- getdata(mysample_stage1, units)
mysample$ssunits <- units$ID_unit
mysample$pi <- n * m / M
print(mean_HT <- sum(mysample$z / mysample$pi) / M)
[1] 100.039
The population mean can be estimated with function svymean of package survey.
To estimate the variance, a simple solution is to treat the two-stage cluster
random sample as a pps sample with replacement, so that variance can be
estimated with Equation (7.7). With small sampling fractions of PSUs, the
overestimation of the variance is negligible. With larger sampling fractions,
Brewer’s method is recommended, see Berger (2004) (option 2).
mean SE
z 100.04 19.883
7.3 Simple random sampling of primary sampling units 119
u�
𝑡u�̂ (𝑧) 𝑁 u�
̂ =∑
𝑡(𝑧) = ∑ 𝑡u�̂ (𝑧) , (7.13)
u�=1
𝜋u� 𝑛 u�=1
with 𝑡u�̂ (𝑧) an estimator of the total of PSU 𝑗. The population mean can be
estimated by dividing this estimator by the population size 𝑀.
Alternatively, we may estimate the population mean by dividing the estimate
of the population total by the estimated population size. The population size
can be estimated by the 𝜋 estimator, see Equation (6.11). The 𝜋 estimator
and the ratio estimator are equal when the PSUs are selected by ppswr, but
not so when the PSUs of different size are selected with equal probabilities.
This is shown below. First, a sample is selected by selecting both PSUs and
SSUs by simple random sampling without replacement.
library(sampling)
set.seed(314)
psus <- sort(unique(grdVoorst$psu))
ids_psu <- sample(length(psus), size = n, replace = FALSE)
sampledpsus <- psus[ids_psu]
mysample_stage1 <- grdVoorst[grdVoorst$psu %in% sampledpsus, ]
units <- sampling::strata(mysample_stage1, stratanames = "psu",
size = rep(m, n), method = "srswor")
mysample <- getdata(mysample_stage1, units)
mysample$ssunits <- units$ID_unit
The population mean is estimated by the 𝜋 estimator and the ratio estimator.
N <- length(unique(grdVoorst$psu))
M_psu <- tapply(grdVoorst$z, INDEX = grdVoorst$psu, FUN = length)
pi_psu <- n / N
pi_ssu <- m / M_psu[mysample$psu]
est <- mysample %>%
mutate(pi = pi_psu * pi_ssu,
z_piexpanded = z / pi) %>%
summarise(tz_HT = sum(z_piexpanded),
120 7 Two-stage cluster random sampling
mz_HT = tz_HT / M,
M_HT = sum(1 / pi),
mz_ratio = tz_HT / M_HT)
The 𝜋 estimate equals 79.0 g kg-1 , and the ratio estimate equals 79.8 g kg-1 . The
𝜋 estimate of the population mean can also be computed by first estimating
totals of PSUs, see Equation (7.13).
[1] 78.99646
𝑛 𝑆̂2 (𝑡 ̂ (𝑧))
𝑉(̂ 𝑡(𝑧))
̂ = 𝑁2 (1 − ) u�
, (7.14)
𝑁 𝑛
and dividing this variance by the squared number of population units:
1
𝑉(̂ 𝑧)̄̂ = 2 𝑉(̂ 𝑡(𝑧))
̂ , (7.15)
𝑀
as shown in the code chunk below (the final line computes the standard error).
fpc <- 1 - n / N
v_tz <- Nˆ2 * fpc * var(tz_psu) / n
(se_mz_HT <- sqrt(v_tz / Mˆ2))
[1] 9.467406
The ratio estimator of the population mean and its standard error can be
computed with function svymean of package survey.
mysample$fpc1 <- N
mysample$fpc2 <- M_psu[mysample$psu]
design_2stage <- svydesign(
id = ~ psu + ssunits, fpc = ~ fpc1 + fpc2, data = mysample)
svymean(~ z, design_2stage)
mean SE
z 79.845 7.7341
7.4 Stratified two-stage cluster random sampling 121
The estimated standard error of the ratio estimator is slightly smaller than
the standard error of the 𝜋 estimator.
FIGURE 7.3: Stratified two-stage random sample from Voorst. Strata are
groups of eight PSUs (0.5 km squares) within 2 km × 1 km blocks. From
each stratum two times a PSU is selected by ppswr, and six SSUs (points) are
selected per PSU draw by simple random sampling.
122 7 Two-stage cluster random sampling
}
mysample$s1 <- jitter(mysample$s1, amount = cell_size / 2)
mysample$s2 <- jitter(mysample$s2, amount = cell_size / 2)
The population mean can be estimated in much the same way as with stratified
cluster random sampling. With function svymean this is an easy task.
mean SE
z 73.654 0
8
Sampling with probabilities proportional to
size
In simple random sampling, the inclusion probabilities are equal for all popu-
lation units. The advantage of this is simple and straightforward statistical
inference. With equal inclusion probabilities the unweighted sample mean is
an unbiased estimator of the spatial mean, i.e., the sampling design is self-
weighting. However, in some situations equal probability sampling is not very
efficient, i.e., given the sample size the precision of the estimated mean or total
will be relatively low. An example is the following. In order to estimate the
total area of a given crop in a country, a raster of square cells of, for instance,
1 km × 1 km is constructed and projected on the country. The square cells
are the population units, and these units serve as the sampling units. Note
that near the country border cells cross the border. Some of them may contain
only a few hectares of the target population, the country under study. We do
not want to select many of these squares with only a few hectares of the study
area, as intuitively it is clear that this will result in a low precision of the
estimated crop area. In such situation it can be more efficient to select units
with probabilities proportional to the area of the target population within
the squares, so that small units near the border have a smaller probability of
being selected than interior units. Actually, the sampling units are not the
square cells, but the pieces of land obtained by overlaying the cells and the GIS
map of the country under study. As a consequence, the sampling units have
unequal size. The sampling units of unequal size are selected by probabilities
proportional to their size (pps).
In Chapters 6 and 7 pps sampling was already used to select clusters (primary
sampling units) of population units. In this chapter the individual population
units (elementary sampling units) are selected with probabilities proportional
to size.
If we have a GIS map of land use categories such as agriculture, built-up areas,
water bodies, forests, etc., we may use this file to further adapt the selection
probabilities. The crop will be grown in agricultural areas only, so we expect
small crop areas in cells largely covered by non-agricultural land. As a size
measure in computing the selection probabilities, we may use the agricultural
area, as represented in the GIS map, in the country under study within the
123
124 8 Sampling with probabilities proportional to size
cells. Note that size now has a different meaning. It does not refer to the area
of the sampling units anymore, but to an ancillary variable that we expect to
be related to the study variable, i.e., the crop area. When the crop area per
cell is proportional to the agricultural area per cell, then the precision of the
estimated total area of the crop can be increased by selecting the cells with
probabilities proportional to the agricultural area.
In this example the sampling units have an area. However, sampling with
probabilities proportional to size is not restricted to areal sampling units, but
can also be used for selecting points. If we have a map of an ancillary variable
that is expected to be positively related to the study variable, this ancillary
variable can be used as a size measure. For instance, in areas where soil organic
matter shows a positive relation with elevation, it can be efficient to select
sampling points with a selection probability proportional to this environmental
variable. The ancillary variable must be strictly positive for all points.
Sampling units can be selected with probabilities proportional to their size
(pps) with or without replacement. This distinction is immaterial for infinite
populations, as in sampling points from an area. pps sampling with replacement
(ppswr) is much easier to implement than pps sampling without replacement
(ppswor). The problem with ppswor is that after each draw the selected unit
is removed from the sampling frame, so that the sum of the size variable
over all remaining units changes and as a result the draw-by-draw selection
probabilities of the units.
pps sampling is illustrated with the simulated map of poppy area per 5 km
square in the province of Kandahar (Figure 1.6). The first six rows of the data
frame are shown below. Variable poppy is the study variable, variable agri is
the agricultural area within the 5 km squares, used as a size variable.
grdKandahar
# A tibble: 965 x 4
s1 s2 poppy agri
<dbl> <dbl> <dbl> <dbl>
1 809232. 3407627. 0.905 65.7
2 814232. 3412627. 0.00453 15.6
3 794232. 3417627. 11.3 17.6
4 809232. 3417627. 0.110 14.0
5 814232. 3417627. 0.0344 22.2
6 819232. 3417627. 0.143 13.3
7 794232. 3422627. 3.66 34.1
8 799232. 3422627. 3.66 6.12
9 809232. 3422627. 0.688 10.6
10 814232. 3422627. 4.79 130.
# ... with 955 more rows
8.1 Probability-proportional-to-size sampling with replacement 125
of the size variable. The selected unit is then replaced, and these two steps are
repeated 𝑛 times. Note that with this sampling design population units can be
selected more than once, especially with large sampling fractions 𝑛/𝑁.
The population total can be estimated by the pwr estimator:
1 𝑧
̂ =
𝑡(𝑧) ∑ u� , (8.1)
𝑛 u�∈u� 𝑝u�
where 𝑛 is the sample size (number of draws). The population mean can be
estimated by the estimated population total divided by the population size
𝑁. With independent draws, the sampling variance of the estimator of the
population total can be estimated by
2
1 𝑧
𝑉(̂ 𝑡(𝑧))
̂ = ∑ ( u� − 𝑡(𝑧))
̂ . (8.2)
𝑛 (𝑛 − 1) u�∈u� 𝑝u�
The sampling variance of the estimator of the mean can be estimated by the
variance of the estimator of the total divided by 𝑁2 .
As a first step, I check whether the size variable is strictly positive in our
case study of Kandahar. The minimum equals 0.307 m2 , so this is the case. If
there are values equal to or smaller than 0, these values must be replaced by a
small number, so that all units have a positive probability of being selected.
Then the draw-by-draw selection probabilities are computed, and the sample
is selected using function sample.
To select the units, computing the selection probabilities is not strictly needed.
Exactly the same units are selected when the agricultural area within the
units (variable agri in the data frame) is used in argument prob of sample.
126 8 Sampling with probabilities proportional to size
units Freq
9 278 2
13 334 2
14 336 2
24 439 2
Figure 8.1 shows the selected sampling units, plotted on a map of the agricul-
tural area within the units which is used as a size variable.
The next code chunk shows how the population total of the poppy area can be
estimated, using Equation (8.1), as well as the standard error of the estimator
of the population total (square root of estimator of Equation (8.2)). As a
first step, the observations are inflated, or expanded, through division of the
observations by the selection probabilities of the corresponding units.
The estimated total equals 65,735 ha, with a standard error of 12,944 ha. The
same estimates are obtained with package survey (Lumley, 2021).
library(survey)
mysample$weight <- 1 / (mysample$p * n)
design_ppswr <- svydesign(id = ~ 1, data = mysample, weights = ~ weight)
svytotal(~ poppy, design_ppswr)
total SE
poppy 65735 12944
In ppswr sampling, a sampling unit can be selected more than once, especially
with large sampling fractions 𝑛/𝑁. This may decrease the sampling efficiency.
With large sampling fractions, the alternative is pps sampling without replace-
ment (ppswor), see next section.
The estimators of Equations (8.1) and (8.2) can also be used for infinite
populations. For infinite populations, the probability that a unit is selected
more than once is zero.
Exercises
Many algorithms are available for ppswor sampling, see Tillé (2006) for an
overview. A simple, straightforward method is systematic ppswor sampling. Two
subtypes can be distinguished, systematic ppswor sampling with fixed frame
order and systematic ppswor sampling with random frame order (Rosén, 1997).
Given some order of the units, the cumulative sum of the inclusion probabilities
is computed. Each population unit is then associated with an interval of
cumulative inclusion probabilities. The larger the inclusion probability of
a unit, the wider the interval. Then a random number from the uniform
distribution is drawn, which serves as the start of a one-dimensional systematic
sample of size 𝑛 with an interval of 1. Finally, the units are determined for
which the systematic random values are in the interval of cumulative inclusion
probabilities, see Figure 8.2 for ten population units and a sample size of four.
The units selected are 2, 5, 7, and 9. Note that the sum of the interval lengths
equals the sample size. Further note that a unit cannot be selected more than
once because the inclusion probabilities are < 1 and the sampling interval
equals 1.
library(sampling)
set.seed(314)
N <- 10
n <- 4
x <- rnorm(N, mean = 20, sd = 5)
pi <- inclusionprobabilities(x, n)
print(data.frame(id = seq_len(N), x, pi))
id x pi
1 1 13.55882 0.3027383
8.2 Probability-proportional-to-size sampling without replacement 129
2 2 23.63731 0.5277684
3 3 15.83538 0.3535687
4 4 16.48162 0.3679978
5 5 20.63624 0.4607613
6 6 18.32529 0.4091630
7 7 16.50655 0.3685545
8 8 20.06336 0.4479702
9 9 22.94495 0.5123095
10 10 11.15957 0.2491684
[1] 2 5 7 9
FIGURE 8.2: Systematic random sample along a line with unequal inclusion
probabilities.
Sampling efficiency can be increased by ordering the units by the size variable
(Figure 8.3). With this design, the third, fourth, fifth, and second units in the
original frame are selected, with sizes 15.8, 16.5, 20.6, and 23.6, respectively.
Ordering the units by size leads to a large within-sample and a small between-
sample variance of the size variable 𝑥. If the study variable is proportional to
the size variable, this results in a smaller sampling variance of the estimator of
the mean of the study variable. A drawback of systematic ppswor sampling
with fixed order is that no unbiased estimator of the sampling variance exists.
FIGURE 8.3: Systematic random sample along a line with unequal inclusion
probabilities. Units are ordered by size.
A small simulation study is done next to see how much gain in precision can be
achieved by ordering the units by size. A size variable 𝑥 and a study variable 𝑧
are simulated by drawing 1,000 values from a bivariate normal distribution with
a correlation coefficient of 0.8. Function mvrnorm of package MASS (Venables
and Ripley, 2002) is used for the simulation.
130 8 Sampling with probabilities proportional to size
library(MASS)
rho <- 0.8
mu1 <- 10; sd1 <- 2
mu2 <- 15; sd2 <- 4
mu <- c(mu1, mu2)
sigma <- matrix(
data = c(sd1ˆ2, rep(sd1 * sd2 * rho, 2), sd2ˆ2),
nrow = 2, ncol = 2)
N <- 1000
set.seed(314)
dat <- as.data.frame(mvrnorm(N, mu = mu, Sigma = sigma))
names(dat) <- c("z", "x")
head(dat)
z x
1 9.462930 9.149784
2 12.605847 17.306046
3 7.892686 11.979986
4 7.945021 12.567608
5 11.004325 15.165744
6 10.369943 13.258177
Twenty units are selected by systematic ppswor sampling with random order
and ordered by size. This is repeated 10,000 times.
The standard deviation of the 10,000 estimated means with systematic ppswor
sampling with random order is 0.336, and when ordered by size 0.321. So,
a small gain in precision is achieved through ordering the units by size. For
comparison, I also computed the standard error for simple random sampling
without replacement (SI) of the same size. The standard error with this basic
sampling design is 0.424.
1. Select randomly two units 𝑘 and 𝑙 with 0 < 𝜋u� < 1 and 0 < 𝜋u� < 1.
(𝜋′u� , 𝜋′u� ) = { u� u�
, (8.3)
(𝜋u� + 𝜋u� , 0) with probability u� +u�
u�u�
u� u�
3. Replace (𝜋u� , 𝜋u� ) by (𝜋′u� , 𝜋′u� ), and repeat the first two steps until
each population unit is either selected (inclusion probability equals
1) or not selected (inclusion probability equals 0).
In words, when the sum of the inclusion probabilities is smaller than 1, the
updated inclusion probability of one of the units will become 0, which means
that this unit will not be sampled. The inclusion probability of the other unit
will become the sum of the two inclusion probabilities, which means that the
probability increases that this unit will be selected in one of the subsequent
iterations. The probability of a unit of being excluded from the sample is
proportional to the inclusion probability of the other unit, so that the larger
the inclusion probability of the other unit, the larger the probability that it
will not be selected.
When the sum of the inclusion probabilities of the two units is larger than or
equal to 1, then one of the units is selected (updated inclusion probability is
one), while the inclusion probability of the other unit is lowered by 1 minus
the inclusion probability of the selected unit. The probability of being selected
is proportional to the complement of the inclusion probability of the other
unit. After the inclusion probability of a unit has been updated to either 0 or
1, this unit cannot be selected anymore in the next iteration.
With this ppswor design, the population total can be estimated by the 𝜋
estimator, Equation (2.2). The 𝜋 estimator of the mean is simply obtained by
dividing the estimator for the total by the population size 𝑁.
The inclusion probabilities 𝜋u� used in the 𝜋 estimator are not the final
probabilities obtained with the local pivotal method, which are either 0 or 1,
but the initial inclusion probabilities.
132 8 Sampling with probabilities proportional to size
library(sampling)
n <- 40
size <- ifelse(grdKandahar$agri < 1E-12, 0.1, grdKandahar$agri)
pi <- inclusionprobabilities(size, n)
set.seed(314)
sampleind <- UPrandompivotal(pik = pi)
mysample <- data.frame(grdKandahar[sampleind == 1, ], pi = pi[sampleind == 1])
nrow(mysample)
[1] 39
As can be seen, not 40 but only 39 units are selected. The reason is that function
UPrandompivotal uses a very small number that can be set with argument eps. If
the updated inclusion probability of a unit is larger than the complement of
this small number eps, the unit is treated as being selected. The default value
of eps is 10−6 . If we replace sampleind == 1 by sampleind > 1 - eps, 40 units are
selected.
[1] 40
The total poppy area can be estimated from the ppswor sample by
The total poppy area as estimated with the 𝜋 estimator equals 88,501 ha. The
Hájek estimator results in a much smaller estimated total: 62,169 ha.
The 𝜋 estimate can also be computed with function svytotal of package survey,
which also provides an approximate estimate of the standard error. Various
methods are implemented in function svydesign for approximating the standard
error. These methods differ in the way the pairwise inclusion probabilities are
approximated from the unitwise inclusion probabilities. These approximated
pairwise inclusion probabilities are then used in the 𝜋 variance estimator or the
Yates-Grundy variance estimator. In the next code chunks, Brewer’s method is
used, see option 2 of Brewer’s method in Berger (2004), as well as Hartley-Rao’s
method for approximating the variance.
library(survey)
design_ppsworbrewer <- svydesign(
id = ~ 1, data = mysample, pps = "brewer", fpc = ~ pi)
svytotal(~ poppy, design_ppsworbrewer)
total SE
poppy 88501 14046
total SE
poppy 88501 14900
library(samplingVarEst)
se_tz_Hajek <- sqrt(VE.Hajek.Total.NHT(mysample$poppy, mysample$pi))
pikl <- Pkl.Hajek.s(mysample$pi)
se_tz_HT <- sqrt(VE.HT.Total.NHT(mysample$poppy, mysample$pi, pikl))
se_tz_SYG <- sqrt(VE.SYG.Total.NHT(mysample$poppy, mysample$pi, pikl))
The three approximated standard errors are 14,045, 14,068, and 14,017 ha.
The differences are small when related to the estimated total.
Figure 8.4 shows the approximated sampling distribution of estimators of
the total poppy area with ppswor sampling and simple random sampling
without replacement of size 40, obtained by repeating the random sampling
with each design and estimation 10,000 times. With the ppswor samples, the
total poppy area is estimated by the 𝜋 estimator and the Hájek estimator. For
each ppswor sample, the variance of the 𝜋 estimator is approximated by the
Hájek-Rosén variance estimator (using function VE.Hajek.Total.NHT of package
samplingVarEst).
Exercises
In this chapter two related but fundamentally different sampling designs are
described and illustrated. The similarity and difference are shortly outlined
below, but hopefully will become clearer in following sections.
Roughly speaking, for a balanced sample the sample means of covariates are
equal to the population means of these covariates. When the covariates are
linearly related to the study variable, this may yield a more precise estimate
of the population mean or total of the study variable.
A well-spread sample is a sample with a large range of values for the covariates,
from small to large values, but also including intermediate values. In more
technical terms: the sampling units are well-spread along the axes spanned
by the covariates. If the spatial coordinates are used as covariates (spreading
variables), this results in samples that are well-spread in geographical space.
Such samples are commonly referred to as spatially balanced samples, which is
somewhat confusing, as the geographical spreading is not implemented through
balancing on the geographical coordinates. On the other hand, the averages of
the spatial coordinates of a sample well-spread in geographical space will be
close to the population means of the coordinates. Therefore, the sample will
be approximately balanced on the spatial coordinates (Grafström and Schelin,
2014). The reverse is not true: with balanced sampling, the spreading of the
sampling units in the space spanned by the balancing variables can be poor. A
sample with all values of a covariate used in balancing near the population
mean of that variable has a poor spreading along the covariate axis, but can
still be perfectly balanced.
137
138 9 Balanced and well-spread sampling
Let me illustrate balanced sampling with a small simulation study. The sim-
ulated population shown in Figure 9.1 shows a linear trend from West to
East and besides a trend from South to North. Due to the West-East trend,
the simulated study variable 𝑧 is correlated with the covariate Easting and,
due to the South-North trend, also with the covariate Northing. To estimate
the population mean of the simulated study variable, intuitively it is attractive
to select a sample with an average of the Easting coordinate that is equal to
the population mean of Easting (which is 10). Figure 9.1 (subfigure on the
left) shows such a sample of size four; we say that the sample is ‘balanced’ on
the covariate Easting. The sample in the subfigure on the right is balanced on
Easting as well as on Northing.
FIGURE 9.1: Sample balanced on Easting (E) and on Easting and Northing
(E and N).
Simple random sampling is not a balanced sampling design, because for many
simple random samples the sample mean of the balancing variable 𝑥 is not
equal to the population mean of 𝑥. Only the expectation of the sample mean
of 𝑥, i.e., the mean of the sample means obtained by selecting an infinite
number of simple random samples, equals the population mean of 𝑥.
9.1 Balanced sampling 139
Figure 9.2 shows for 1,000 simple random samples the squared error of the
estimated population mean of the study variable 𝑧 against the difference
between the sample mean of balancing variable Easting and the population
mean of Easting. Clearly, the larger the absolute value of the difference, the
larger on average the squared error. So, to obtain a precise and accurate
estimate of the population mean of 𝑧, we better select samples with a difference
close to 0.
Using only Easting as a balancing variable reduces the sampling variance of the
estimator of the mean substantially. Using Easting and Northing as balancing
variables further reduces the sampling variance. See Table 9.1.
Until now we have assumed that the inclusion probabilities of the population
units are equal, but this is not a requirement for balanced sampling designs. A
140 9 Balanced and well-spread sampling
TABLE 9.1: Sampling variance of the 𝜋 estimator of the mean for simple
random sampling (SI) and balanced sampling of four units.
u�
𝑥u�
∑ = ∑ 𝑥u� . (9.1)
𝜋
u�∈u� u� u�=1
̂
𝑡regr (𝑧) = 𝑡u�̂ (𝑧) + 𝑏̂ (𝑡(𝑥) − 𝑡u�̂ (𝑥)) , (9.2)
with 𝑡u�̂ (𝑧) and 𝑡u�̂ (𝑥) the 𝜋 estimators of the population total of the study
variable 𝑧 and the covariate 𝑥, respectively, 𝑡(𝑥) the population total of the
covariate, and 𝑏̂ the estimated slope parameter (see hereafter). With a perfectly
balanced sample the second term in the regression estimator, which adjusts
the 𝜋 estimator, equals zero.
Balanced samples can be selected with the cube algorithm of Deville and Tillé
(2004). The population total and mean can be estimated by the 𝜋 estimator.
The approximated variance of the 𝜋 estimator of the population mean can be
estimated by (Deville and Tillé (2005), Grafström and Tillé (2013))
2
1 𝑛 𝑒
𝑉(̂ 𝑧)̄̂ = 2 ∑ 𝑐u� ( u� ) , (9.3)
𝑁 𝑛 − 𝑝 u�∈u� 𝜋u�
with 𝑝 the number of balancing variables, 𝑐u� a weight for unit 𝑘 (see hereafter),
and 𝑒u� the residual of unit 𝑘 given by
𝑒u� = 𝑧u� − 𝐱T ̂
u� 𝐛 , (9.4)
9.1 Balanced sampling 141
with 𝐱u� a vector of length 𝑝 with the balancing variables for unit 𝑘, and 𝐛̂ the
estimated population regression coefficients, given by
−1
𝐱 𝐱 T 𝐱u� 𝑧u�
𝐛̂ = (∑ 𝑐u� u� u� ) ∑ 𝑐u� . (9.5)
u�∈u�
𝜋u� 𝜋u� u�∈u�
𝜋u� 𝜋u�
Working this out for balanced sampling without replacement with equal inclu-
sion probabilities, 𝜋u� = 𝑛/𝑁, 𝑘 = 1, … , 𝑁, yields
1
𝑉(̂ 𝑧)̄̂ = ∑ 𝑐 𝑒2 . (9.6)
𝑛(𝑛 − 𝑝) u�∈u� u� u�
Deville and Tillé (2005) give several formulas for computing the weights 𝑐u� ,
one of which is 𝑐u� = (1 − 𝜋u� ).
Balanced sampling is now illustrated with aboveground biomass (AGB) data
of Eastern Amazonia, see Figure 1.8. Log-transformed short-wave infrared
radiation (lnSWIR2) is used as a balancing variable. The samplecube function of
the sampling package (Tillé and Matei, 2021) implements the cube algorithm.
Argument X of this function specifies the matrix of ancillary variables on which
the sample must be balanced. The first column of this matrix is filled with
ones, so that the sample size is fixed. To speed up the computations, a 5 km
× 5 km subgrid of grdAmazonia is used.
Equal inclusion probabilities are used, i.e., for all population units the inclusion
probability equals 𝑛/𝑁.
mz <- mean(mysample$AGB)
The next code chunk shows how the estimated variance of the 𝜋 estimator of
the population mean can be computed.
pi <- rep(n / N, n)
c <- (1 - pi)
b <- estimate_b(z = mysample$AGB / pi, X = X[units, ] / pi, c = c)
zpred <- X %*% b
e <- mysample$AGB - zpred[units]
v_tz <- n / (n - ncol(X)) * sum(c * (e / pi)ˆ2)
v_mz <- v_tz / Nˆ2
Figure 9.3 shows the selected balanced sample. Note the spatial clustering
of some units. The estimated population mean (as estimated by the sample
mean) of AGB equals 224.5 109 kg ha-1 . The population mean of AGB equals
225.3 109 kg ha-1 . The standard error of the estimated mean equals 6.1 109 kg
ha-1 .
Figure 9.4 shows the approximated sampling distribution of the 𝜋 estimator of
the mean AGB with balanced sampling and simple random sampling, obtained
9.1 Balanced sampling 143
by repeating the random sampling with both designs and estimation 1,000
times.
The variance of the 1,000 estimates of the population mean of the study
variable AGB equals 28.8 (109 kg ha-1 )2 . The gain in precision compared
to simple random sampling equals 2.984 (design effect is 0.335), so with
simple random sampling about three times more sampling units are needed
to estimate the population mean with the same precision. The mean of the
1,000 estimated variances equals 26.4 (109 kg ha-1 )2 , indicating that the
approximated variance estimator somewhat underestimates the true variance
in this case. The population mean of the balancing variable lnSWIR2 equals
6.414. The sample mean of lnSWIR2 varies a bit among the samples. Figure 9.5
shows the approximated sampling distribution of the sample mean of lnSWIR2.
In other words, many samples are not perfectly balanced on lnSWIR2. This is
not exceptional; in most cases perfect balance is impossible.
Exercises
Much in the same way as we controlled in the previous subsection the sample
size 𝑛 by balancing the sample on the known total number of population units 𝑁,
we can balance a sample on the known total number of units in subpopulations.
A sample balanced on the sizes of subpopulations is a stratified random sample.
Figure 9.6 shows four subpopulations or strata. These four strata can be used
in balanced sampling by constructing the following design matrix 𝐗 with as
many columns as there are strata and as many rows as there are population
units:
1 0 0 0
⎡1 0 0 0⎤
⎢ ⎥
⎢1 0 0 0⎥
⎢1 0 0 0⎥
𝐗 = ⎢0 1 0 0⎥ . (9.7)
⎢0 1 0 0⎥
⎢ ⎥
⎢0 0 1 0⎥
⎢⋮ ⋮ ⋮ ⋮⎥
⎣0 0 0 1⎦
The first four rows refer to the four leftmost bottom row population units in
Figure 9.6. These units belong to class A, which explains that the first column
for these units contain ones. The other three columns for these rows contain
all zeroes. The fifth and sixth unit belong to stratum B, so that the second
column for these rows contain ones, and so on. The final row is the upperright
sampling unit in stratum D, so the first three columns contain zeroes, and the
fourth column is filled with a one. The sum of the indicators in the columns is
the total number of population units in the strata.
In the next code chunk, the inclusion probabilities are computed by 𝜋ℎu� =
𝑛ℎ /𝑁ℎ , 𝑘 = 1, … , 𝑁ℎ , with 𝑛ℎ = 5 for all four strata. The stratum sample
146 9 Balanced and well-spread sampling
sizes are equal, but the number of population units of a stratum differ among
the strata, so the inclusion probabilities also differ among the strata.
A B C D
0.06250000 0.12500000 0.03125000 0.04166667
The inclusion probabilities are added to tibble mypop with the 400 population
units, using a look-up table lut and function left_join. The ten leftmost units
on the bottom row of Figure 9.6 are shown below. Variables s1 and s2 are the
spatial coordinates of the centres of the units.
# A tibble: 400 x 4
stratum s1 s2 pi
<fct> <dbl> <dbl> <dbl>
1 A 0.5 0.5 0.0625
2 A 1.5 0.5 0.0625
3 A 2.5 0.5 0.0625
4 A 3.5 0.5 0.0625
5 B 4.5 0.5 0.125
6 B 5.5 0.5 0.125
7 C 6.5 0.5 0.0312
8 C 7.5 0.5 0.0312
9 C 8.5 0.5 0.0312
10 C 9.5 0.5 0.0312
# ... with 390 more rows
0.0625 0 0 0
⎡0.0625 0 0 0 ⎤
⎢ ⎥
⎢ 0.0625 0 0 0 ⎥
⎢0.0625 0 0 0 ⎥
𝐗=⎢ 0 0.125 0 0 ⎥ . (9.8)
⎢ 0 0.125 0 0 ⎥
⎢ ⎥
⎢ 0 0 0.03125 0 ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
⎣ 0 0 0 0.04167⎦
In the above example all units in a stratum have the same inclusion probability,
yielding a stratified simple random sample. We may also use variable inclusion
probabilities, for instance proportional to a size measure of the units, yielding
a stratified ppswor random sample (Section 8.2).
The advantage of selecting a stratified random sample by balancing the sample
on a categorical variable becomes clear in case we have multiple classifications
that we would like to use in stratification, and we cannot afford to use all
cross-classifications as strata. This is the topic of the next subsection.
Falorsi and Righi (2008) describe how a multiway stratified sample can be
selected as a balanced sample. Multiway stratification is of interest when one
has multiple stratification variables, each stratification variable leading to
several strata, so that the total number of cross-classification strata becomes
so large that the stratum sample sizes are strongly disproportional to their size
or even exceed the total sample size. For instance, suppose we have three maps
with 4, 3, and 5 map units. Further, suppose that all combinations of map
units are non-empty, so that we have 4 × 3 × 5 = 60 combinations. We may
not like to use all combinations (cross-classifications) as strata. The alternative
is then to use the 4 + 3 + 5 = 12 map units as strata.
The sample sizes of the marginal strata can be controlled using a design matrix
with as many columns as there are strata. The units of an individual map used
for stratification are referred to as marginal strata. Each row 𝑘 = 1, … , 𝑁 in
the design matrix 𝐗 has as many non-zero values as we have maps, in entries
corresponding to the cross-classification map unit of population unit 𝑘, and
zeroes in the remaining entries. The non-zero value is the inclusion probability
of that unit. Each column of the design matrix has non-zero values in entries
corresponding to the population units in that marginal stratum and zeroes in
all other entries.
Two-way stratified random sampling is illustrated with a simulated population
of 400 units (Figure 9.7). Figure 9.8 shows two classifications of the population
units. Classification A consists of four classes (map units), classification B
of three classes. Instead of using 4 × 3 = 12 cross-classifications as strata in
random sampling, only 4 + 3 = 7 marginal strata are used in two-way stratified
random sampling.
9.1 Balanced sampling 149
As a first step, the inclusion probabilities are added to data.frame mypop with
the spatial coordinates and simulated values. To keep it simple, I computed
inclusion probabilities equal to 2 divided by the number of population units in
a cross-classification stratum. Note that this does not imply that a sample is
selected with two units per cross-classification stratum. As we will see later, it
is possible that in some cross-classification strata no units are selected at all,
while in other cross-classification strata more than two units are selected. In
multiway stratified sampling, the marginal stratum sample sizes are controlled.
The inclusion probabilities should result in six selected units for all four units
of map A and eight selected units for all three units of map B.
The next step is to create the design matrix. Two submatrices are computed, one
per stratification. The two submatrices are joined columnwise, using function
cbind. The columns are multiplied by the vector with inclusion probabilities.
150 9 Balanced and well-spread sampling
Matrix 𝐗 can be reduced by one column if in the first column the inclusion
probabilities of all population units are inserted. This first column contains
no zeroes. Balancing on this variable implies that the total sample size is
controlled. Now there is no need anymore to control the sample sizes of all
marginal strata. It is sufficient to control the sample sizes of three marginal
strata of map A (A2, A3, and A4) and two marginal strata of map B (B2 and
B3). Given the total sample size, the sample sizes of map units A1 and B1
then cannot be chosen freely anymore.
This reduced design matrix is not strictly needed for selecting a multiway
stratified sample, but it must be used in estimation. If in estimation, as many
balancing variables are used as we have marginal strata, the matrix with the
sum of squares of the balancing variables (first sum in Equation (9.5)) cannot
be inverted (the matrix is singular), and as a consequence the population
regression coefficients cannot be estimated.
Finally, the two-way stratified random sample is selected with function
samplecube of package sampling.
addmargins(table(mysample$A, mysample$B))
9.1 Balanced sampling 151
B1 B2 B3 Sum
A1 2 0 4 6
A2 2 3 1 6
A3 3 1 2 6
A4 1 4 1 6
Sum 8 8 8 24
N <- nrow(mypop)
print(mean <- sum(mysample$z / mysample$pih) / N)
[1] 8.688435
c <- (1 - mysample$pih)
b <- estimate_b(
z = mysample$z / mysample$pih, X = X[units, ] / mysample$pih, c = c)
zpred <- X %*% b
e <- mysample$z - zpred[units]
n <- nrow(mysample)
v_tz <- n / (n - ncol(X)) * sum(c * (e / mysample$pih)ˆ2)
print(v_mz <- v_tz / Nˆ2)
[1] 0.1723688
152 9 Balanced and well-spread sampling
subset of the population in search for nearest neighbours and is thus not as
good. Another function lpm2_kdtree of package SamplingBigData (Lisic and
Grafström, 2018) is developed for big data sets.
Inclusion probabilities are computed with function inclusionprobabilities of
package sampling. A matrix 𝐗 must be defined with the values of the spread-
ing variables of the population units. Figure 9.9 shows a ppswor sample of 40
units selected from the sampling frame of Kandahar, using the spatial coordi-
nates of the population units as spreading variables. Inclusion probabilities
are proportional to the agricultural area within the population units. The
geographical spreading is improved compared with the sample shown in Figure
8.1.
library(BalancedSampling)
library(sampling)
n <- 40
pi <- inclusionprobabilities(grdKandahar$agri, n)
X <- cbind(grdKandahar$s1, grdKandahar$s2)
set.seed(314)
units <- lpm1(pi, X)
myLPMsample <- grdKandahar[units, ]
The total poppy area can be estimated with the 𝜋 estimator (Equation (2.2)).
154 9 Balanced and well-spread sampling
The estimated total poppy area equals 62,232 ha. The sampling variance of
the estimator of the population total with the local pivotal method can be
estimated by (Grafström and Schelin, 2014)
2
1 𝑧 𝑧u�
𝑉(̂ 𝑡(𝑧))
̂ = ∑ ( u� − u� ) , (9.9)
2 u�∈u� 𝜋u� 𝜋u�u�
with 𝑘u� the nearest neighbour of unit 𝑘 in the sample. This variance estimator
is for the case where we have only one nearest neighbour.
Function vsb of package BalancedSampling is an implementation of a more
general variance estimator that accounts for more than one nearest neighbour
(equation (6) in Grafström and Schelin (2014)). We expect a somewhat smaller
variance compared to pps sampling, so we may use the variance of the pwr
estimator (Equation (8.2)) as a conservative variance estimator.
The standard error obtained with function vsb equals 12,850 ha, the standard
error of the pwr estimator equals 14,094 ha.
As explained above, the LPM design can also be used to select a probability
sample well-spread in the space spanned by one or more quantitative covariates.
Matrix 𝐗 then should contain the values of the scaled (standardised) covariates
instead of the spatial coordinates.
Exercises
FIGURE 9.10: Numbering of grid cells and subcells for GRTS sampling.
The next step is to place the 16 subcells on a line in a random order. The
randomisation is done hierarchically. First, the four grid cells at the highest
level are randomised. In our example, the randomised order is 1, 2, 3, 0 (Figure
9.11). Next, within each grid cell, the order of the subcells is randomised. This
is done independently for the grid cells. In our example, for grid cell 1 the
randomised order of the subcells is 2, 1, 3, 0 (Figure 9.11). Note that the empty
subcells (0,0) and (3,3) are removed from the line.
156 9 Balanced and well-spread sampling
set.seed(314)
ord <- sample(4, 4)
myfinpop_rand <- NULL
for (i in ord) {
units <- which(myfinpop$partit1 == i)
units_rand <- sample(units, size = length(units))
myfinpop_rand <- rbind(myfinpop_rand, myfinpop[units_rand, ])
}
FIGURE 9.11: Systematic random sample along a line with equal inclusion
probabilities.
Figure 9.12 shows a systematic random sample along a line with unequal
inclusion probabilities. The inclusion probabilities are proportional to a size
variable, with values 1, 2, 3, or 4. The selected population units are the units
in subcells 10, 20, 31, 01, and 02.
GRTS samples can be selected with function grts of package spsurvey
(Dumelle et al., 2021). The next code chunk shows the selection of a GRTS
sample of 40 units from Kandahar. Tibble grdKandahar is first converted to an
sf object with function st_as_sf of package sf (Pebesma, 2018). The data set
is using a UTM projection (zone 41N) with WGS84 datum. This projection is
9.2 Well-spread sampling 157
FIGURE 9.12: Systematic random sample along a line with inclusion proba-
bilities proportional to size.
passed to function st_as_df with argument crs (crs = 32641). Argument sframe
of function grts specifies the sampling frame. The sample size is passed to
function grts with argument n_base. Argument seltype is set to proportional to
select units with probabilities proportional to an ancillary variable which is
passed to function grts with argument aux_var.
library(spsurvey)
library(sf)
sframe_sf <- st_as_sf(grdKandahar, coords = c("s1", "s2"), crs = 32641)
set.seed(314)
res <- grts(
sframe = sframe_sf, n_base = 40, seltype = "proportional", aux_var = "agri")
myGRTSsample <- res$sites_base
The estimated total poppy area is 109,809 ha, and the estimated standard
error is 24,327 ha.
The alternative is to estimate the total poppy area by the 𝜋 estimator. Function
vsb of package BalancedSampling can be used to estimate the standard error
of the 𝜋 estimator.
The estimated total is 71,634 ha, and the estimated standard error is 12,593
ha.
158 9 Balanced and well-spread sampling
library(BalancedSampling)
N <- nrow(grdAmazonia)
n <- 100
Xbal <- cbind(rep(1, times = N), grdAmazonia$lnSWIR2)
Xspread <- cbind(grdAmazonia$x1, grdAmazonia$x2)
pi <- rep(n / N, times = N)
set.seed(314)
units <- lcube(Xbal = Xbal, Xspread = Xspread, prob = pi)
mysample <- grdAmazonia[units, ]
The selected sample is shown in Figure 9.13. Comparing this sample with the
balanced sample of Figure 9.3 shows that the geographical spreading of the
sample is improved, although there still are some close points. I used equal
inclusion probabilities, so the 𝜋 estimate of the mean is equal to the sample
mean, which is equal to 225.6 109 kg ha-1 .
The variance of the 𝜋 estimator of the mean can be estimated by (equation (7)
of Grafström and Tillé (2013))
2
𝑛 𝑝 𝑒
𝑉(̂ 𝑧)̄̂ = ∑(1 − 𝜋u� ) ( u� − 𝑒u�̄ ) , (9.10)
𝑛 − 𝑝 𝑝 + 1 u�∈u� 𝜋u�
with 𝑝 the number of balancing variables, 𝑒u� the regression model residual of
unit 𝑘 (Equation (9.4)), and 𝑒u�̄ the local mean of the residuals of this unit,
computed by
u�
∑u�+1 (1 − 𝜋u� ) u�u�
(9.11)
u�=1 u�
𝑒u�̄ = u�+1
.
∑u�=1 (1 − 𝜋u� )
9.3 Balanced sampling with spreading 159
library(spsurvey)
pi <- rep(n / N, n)
c <- (1 - pi)
b <- estimate_b(z = mysample$AGB / pi, X = Xbal[units, ] / pi, c = c)
zpred <- Xbal %*% b
e <- mysample$AGB - zpred[units]
weights <- localmean_weight(x = mysample$x1, y = mysample$x2, prb = pi, nbh = 3)
v_mz <- localmean_var(z = e / pi, weight_1st = weights) / Nˆ2
The estimated standard error is 2.8 109 kg ha-1 , which is considerably smaller
than the estimated standard error of the balanced sample without geographical
spreading.
10
Model-assisted estimation
with 𝜇(𝐱u� ) the model-mean for population unit 𝑘 which is a function of the
covariate values of that unit collected in vector 𝐱u� = (1, 𝑥1,u� , … , 𝑥u�,u� )T and
𝜖u� a random variable with zero mean. Note that I use uppercase 𝑍 to distinguish
the random variable 𝑍u� of unit 𝑘 from one realisation of this random variable
for unit 𝑘 in the population of interest, 𝑧u� . The model-mean 𝜇(𝐱u� ) can be
161
162 10 Model-assisted estimation
1 u� 1 𝑧 − 𝑚(𝐱
̂ u� )
̄̂ =
𝑧dif ∑ 𝑚(𝐱̂ u� ) + ∑ u� , (10.2)
𝑁 u�=1 𝑁 u�∈u� 𝜋u�
with 𝜋u� the inclusion probability of unit 𝑘. The first term is the population
mean of model predictions of the study variable, and the second term is the 𝜋
estimator of the population mean of the residuals.
A wide variety of model-assisted estimators have been developed and tested
over the past decades. They differ in the working model used to obtain the
estimates 𝑚(𝐱
̂ u� ) in Equation (10.2). The best known class of model-assisted
estimators is the generalised regression estimator that uses a linear model in
prediction (Särndal et al., 1992). Alternative model-assisted estimators are
the estimators using machine learning techniques for prediction. In the era
of big data with a vastly increasing number of exhaustive data sets and a
rapid development of machine learning techniques, these estimators have great
potentials for spatial sample survey.
𝑍u� = 𝐱T
u� 𝛽 + 𝜖u� , (10.3)
with 𝜖u� uncorrelated residuals, with zero mean and variance 𝜎2 (𝜖u� ). Note
that the variance of the residuals 𝜎2 (𝜖u� ) need not be constant but may differ
among the population units. If {𝑧u� , 𝑥1,u� , … , 𝑥u�,u� } were observed for all units
10.1 Generalised regression estimator 163
u� −1 u�
𝐱u� 𝐱T 𝐱u� 𝑧u�
𝐛 = (∑ u�
2 (𝜖 )
) ∑ 2 (𝜖 )
, (10.4)
u�=1
𝜎 u� u�=1
𝜎 u�
with 𝐱u� the vector (1, 𝑥1,u� , … , 𝑥u�,u� )T and 𝜎2 (𝜖u� ) the variance of the residual
of unit 𝑘. Similar to the distinction between model-mean and population mean
(see Chapter 26), here the model regression coefficients 𝛽 are distinguished
from the population regression coefficients 𝐛. The means 𝑚(𝐱u� ) would then
be computed by
𝑚(𝐱u� ) = 𝐱T
u� 𝐛 . (10.5)
−1
𝐱u� 𝐱T 𝐱u� 𝑧u�
𝐛̂ = (∑ 2 (𝜖 )𝜋
u�
) ∑ 2
. (10.6)
u�∈u�
𝜎 u� u� u�∈u�
𝜎 (𝜖u� )𝜋u�
̂ u� ) = 𝐱T
𝑚(𝐱 ̂
u� 𝐛 . (10.7)
1 u� T ̂ 𝑧 − 𝐱T ̂
1 u� 𝐛
̄̂
𝑧regr = ∑ 𝐱u� 𝐛 + ∑ u� . (10.8)
𝑁 u�=1 𝑁 u�∈u� 𝜋u�
u�
̄̂
𝑧regr = 𝑧u�̄̂ + ∑ 𝑏̂u� (𝑥u�̄ − 𝑥u�,u�
̄̂ ) , (10.9)
u�=1
164 10 Model-assisted estimation
The working model of the simple and the multiple regression estimator is
the homoscedastic linear regression model. The only difference with the het-
eroscedastic model is that the variance of the residuals is assumed constant:
𝜎2 (𝜖u� ) = 𝜎2 (𝜖), 𝑘 = 1, … , 𝑁.
In the simple linear regression model, the mean is a linear function of a single
covariate, 𝜇(𝑥u� ) = 𝛼 + 𝛽 𝑥u� . The simple linear regression model leads to the
simple regression estimator. With simple random sampling, this estimator for
the population mean is
̄̂
𝑧regr = 𝑧u�̄ + 𝑏̂ (𝑥̄ − 𝑥u�̄ ) , (10.10)
where 𝑧u�̄ and 𝑥u�̄ are the sample means of the study variable and the covariate,
respectively, 𝑥̄ is the population mean of the covariate, and 𝑏̂ is the estimated
slope coefficient:
The rationale of the regression estimator is that when the estimated mean of
the covariate is, for instance, smaller than the population mean of the covariate,
then with a positive correlation between study variable and covariate, also
the estimated mean of the study variable is expected to be smaller than the
10.1 Generalised regression estimator 165
population mean of the study variable. The difference between the population
mean and the estimated mean of the covariate can be used to improve the 𝜋
estimate of the mean of 𝑧 (which is for simple random sampling equal to the
sample mean 𝑧u�̄ ), by adding a term proportional to the difference between the
estimated mean and the population mean of the covariate. As a scaling factor,
the estimated slope of the fitted regression line is used.
The sampling variance of this regression estimator can be estimated by com-
puting first the regression residuals 𝑒u� = 𝑧u� − 𝑧u�̂ , 𝑘 = 1, … , 𝑛 at the sampling
units, with 𝑧u�̂ = 𝑎̂ + 𝑏𝑥
̂ u� the predicted value for unit 𝑘. Note that I use symbol
𝜖 (Equation (10.3)) for the residuals from the model with the model regression
coefficients 𝛽 , whereas for the residuals from the model with the estimated
population regression coefficients 𝐛̂ I use symbol 𝑒. To compute the residuals 𝑒
also an estimate of the intercept 𝑎 is needed. With simple random sampling,
this intercept can be estimated by
𝑛 𝑆̂2 (𝑒)
𝑉(̂ 𝑧regr
̄̂ ) = (1 − ) , (10.13)
𝑁 𝑛
with 𝑆
̂2 (𝑒) the estimated population variance of the regression residuals:
1
̂
𝑆2 (𝑒) = ∑ 𝑒2 . (10.14)
𝑛 − 1 u�∈u� u�
For simple random sampling with replacement from finite populations and
simple random sampling of infinite populations, the finite population correction
factor 1 − 𝑛/𝑁 must be dropped, see Chapter 3.
In the multiple linear regression model, the mean is a linear function of multiple
covariates. This model leads to the multiple regression estimator. With simple
random sampling, the population regression coefficients of this estimator can
166 10 Model-assisted estimation
be estimated by
−1
𝐛̂ = (∑ 𝐱u� 𝐱T
u� ) ∑ 𝐱u� 𝑧u� . (10.15)
u�∈u� u�∈u�
[,1] [,2]
[1,] 1751.636 -237.1379
The same estimates are obtained by ordinary least squares (OLS) fitting of
the model with function lm.
(Intercept) lnSWIR2
1751.6363 -237.1379
The simple random sample is used to estimate the population mean of the
study variable AGB by the simple regression estimator and to approximate the
sampling variance of the regression estimator. The residuals of the fitted model
can be extracted with function residuals because in this case the OLS estimates
of the regression coefficients are equal to the design-based estimates. With
unequal inclusion probabilities, the residuals must be computed by predicting
the study variable for the selected units, using the design-based estimates
of the regression coefficients, and subtracting the observations of the study
variable.
N <- nrow(grdAmazonia)
se_mz_regr <- sqrt((1 - n / N) * S2e / n)
The difference 𝛿(𝑥) between the population mean of the covariate lnSWIR2
(6.415) and its estimated mean (6.347) equals 0.068. We may expect the
difference between the unknown population mean of the study variable AGB
and its sample mean (246.510) to be equal to 𝛿(𝑥), multiplied by the estimated
slope of the line, which equals -237.1. The result, -16.1039, is added to the
simple random sample estimate, so that the ultimate regression estimate is
adjusted downward to 230.4 109 kg ha-1 .
The estimated approximate standard error of the regression estimator equals
4.458 109 kg ha-1 . The approximated variance is a simplification of a more
complicated approximation derived from writing the regression estimator of the
population total as a weighted sum of the 𝜋-expanded observations (Särndal
et al. (1992), equation (6.5.9)):
1 𝑧
̄̂
𝑧regr = ∑ 𝑔 u� , (10.16)
𝑁 u�∈u� u� 𝜋u�
with 𝑔u� the weight for unit 𝑘. For simple random sampling, the weights are
equal to (Särndal et al. (1992), equation (6.5.12))
mean(g)
[1] 1
and the sample mean of the product of the weights and the covariate 𝑥 equals
the population mean of the covariate.
[1] TRUE
10.1 Generalised regression estimator 169
𝑛 ∑ 𝑔2u� 𝑒2u�
𝑉(̂ 𝑧regr
̄̂ ) = (1 − ) u�∈u� . (10.18)
𝑁 𝑛(𝑛 − 1)
Comparing this with Equation (10.13) shows that in the first approximation
we assumed that all weights are equal to 1.
The alternative approximate standard error is computed in the next code
chunk.
[1] 4.546553
The regression estimator and its standard error can be computed with package
survey (Lumley, 2021). After specifying the sampling design with function
svydesign, function calibrate is used to calibrate the sample on the known
population totals 𝑁 and 𝑡(𝑥) = ∑u�=1 𝑥u� , with 𝑥u� the value of covariate
u�
library(survey)
mysample$fpc <- N
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ fpc)
populationtotals <- c(N, sum(grdAmazonia$lnSWIR2))
mysample_cal <- calibrate(design_si, formula = ~ lnSWIR2,
population = populationtotals, calfun = "linear")
g <- weights(mysample_cal)
all.equal(sum(g), N)
[1] TRUE
The sample sum of the product of the weights and the covariate equals the
population total of the covariate.
170 10 Model-assisted estimation
[1] TRUE
Finally, the population mean can be estimated with function svymean. This is
simply the sample sum of the product of the weights and the study variable
AGB, divided by 𝑁.
mean SE
AGB 230.41 4.5466
The standard error is computed with Equation (10.18). Figure 10.2 shows
the sampling distribution of the simple regression estimator along with the
distribution of the 𝜋 estimator, obtained by repeating simple random sampling
of 100 units and estimation 10,000 times.
The average of the 10,000 regression estimates equals 224.9 109 kg ha-1 . The
population mean of the study variable AGB equals 225.0 109 kg ha-1 , so the
estimated bias of the regression estimator equals -0.1 109 kg ha-1 , which is
negligibly small related to the estimated population mean. The variance of
10.1 Generalised regression estimator 171
the 10,000 regression estimates equals 26.70 (109 kg ha-1 )2 , and the average
of the 10,000 estimated approximate variances using Equation (10.18) equals
26.86 (109 kg ha-1 )2 . The gain in precision due to the regression estimator,
quantified by the ratio of the variance of the 𝜋 estimator to the variance of
the regression estimator equals 3.192.
For simple random sampling, the ratio of the variances of the simple regression
estimator and the 𝜋 estimator is independent of the sample size and equals
1 − 𝑟2 , with 𝑟 the correlation coefficient of the study variable and the covariate
(Särndal et al. (1992), p. 274).
Using multiple covariates in the regression estimator is straightforward with
function calibrate of package survey. As a first step, the best model is selected
with function regsubsets of package leaps (Lumley, 2020).
library(leaps)
n <- 100
set.seed(321)
mysample <- grdAmazonia %>%
dplyr::select(AGB, lnSWIR2, Terra_PP, Prec_dm, Elevation, Clay) %>%
slice_sample(n = n)
models <- regsubsets(AGB ~ ., data = mysample, nvmax = 4)
res_sum <- summary(models)
res_sum$outmat
The best model with one predictor is the model with lnSWIR2, the best
model with two predictors is the one with lnSWIR2 and Terra_PP, etc. Of
these models, the third model, i.e., the model with lnSWIR2, Terra_PP, and
Elevation, is the best when using adjusted 𝑅2 as a selection criterion.
which.max(res_sum$adjr2)
[1] 3
The standard error of the estimated mean AGB is somewhat reduced by adding
the covariates Terra_PP and Elevation to the regression estimator.
sum(grdAmazonia$Terra_PP), sum(grdAmazonia$Elevation))
mysample_cal <- calibrate(design_si, formula = ~ lnSWIR2 + Terra_PP + Elevation,
population = totals, calfun = "linear")
svymean(~ AGB, mysample_cal)
mean SE
AGB 230.54 4.2224
library(mase)
covars <- c("lnSWIR2", "Terra_PP", "Elevation")
res <- greg(y = mysample$AGB, xsample = mysample[covars],
xpop = grdAmazonia[covars], pi = rep(n / N, n),
var_est = TRUE, var_method = "LinHTSRS", model = "linear")
res$pop_mean
[1] 230.5407
The multiple regression estimate is equal to the estimate obtained with function
calibrate of package survey. The estimated standard error equals
sqrt(res$pop_mean_var)
[,1]
[1,] 4.207809
which is slightly smaller than the standard error computed with package
survey. The standard error obtained with function greg is computed by
ignoring the g-weights (McConville et al., 2020). In an exercise, the two
approximate standard errors are compared in a sampling experiment.
All five covariates are used in prediction, but the coefficients associated with
these predictors are small except for lnSWIR2.
As shown below, the estimated standard error is considerably larger than the
standard error obtained with lnSWIR2, Terra_PP, and Elevation as predictors.
In this case, the elastic net regression estimator does not work as well as the
multiple regression estimator using the best subset of the covariates.
sqrt(res$pop_mean_var)
s1
s1 6.207637
Exercises
With stratified simple random sampling there are two regression estimators:
separate and combined. In the first estimator, the regression estimator for
simple random sampling is applied at the level of the strata. This implies that
for each stratum separately a vector with population regression coefficients 𝐛ℎ
is estimated. The regression estimates of the stratum means are then combined
by computing the weighted average, using the relative sizes of the strata as
weights:
u�
̄̂
𝑧sregr ̄̂
= ∑ 𝑤ℎ 𝑧regr,ℎ , (10.19)
ℎ=1
̄̂
𝑧regr,ℎ ̄ + 𝑏̂ℎ (𝑥ℎ
= 𝑧u�ℎ ̄ − 𝑥u�ℎ
̄ ) , (10.20)
with 𝑧u�ℎ
̄ and 𝑥u�ℎ
̄ the stratum sample means of the study variable and the
covariate, respectively, 𝑥ℎ
̄ the mean of the covariate in stratum ℎ, and 𝑏̂ℎ the
estimated slope coefficient for stratum ℎ.
10.1 Generalised regression estimator 175
The variance of this separate regression estimator of the population mean can
be estimated by first estimating the variances of the regression estimators of
the stratum means using Equation (10.13), and then combining these variances
using Equation (4.4).
The separate regression estimator is illustrated with Eastern Amazonia. Biomes
are used as strata. There are four biomes, the levels of which are given short
names using function levels.
[1] "Mangroves"
[2] "Tropical & Subtropical Dry Broadleaf Forests"
[3] "Tropical & Subtropical Grasslands, Savannas & Shrublands"
[4] "Tropical & Subtropical Moist Broadleaf Forests"
Moist forest is by far the largest stratum, it covers 92% of the area. Mangrove,
Forest_dry, and Grassland cover 0.4, 2.3, and 5.5% of the area, respectively. A
stratified simple random sample of size 100 is selected using function strata
of package sampling, see Chapter 4. I chose five units as a minimum sample
size. Note that the stratum sample sizes are not proportional to their size.
library(sampling)
N_h <- table(grdAmazonia$Biome)
n_h <- c(5, 5, 5, 85)
set.seed(314)
units <- sampling::strata(grdAmazonia, stratanames = "Biome",
size = n_h[unique(grdAmazonia$Biome)], method = "srswor")
mysample <- getdata(grdAmazonia, units)
As a first step in estimation, for each stratum the mean of the covariate over
all units in a stratum (population mean per stratum) and the sample means
of the study variable and the covariate are computed.
The next step is to estimate the regression coefficients (intercept and slope)
per stratum. This is done in a for-loop. The estimated slope coefficient is used
176 10 Model-assisted estimation
to compute the regression estimator per stratum. The residuals are extracted
to approximate the variance of the regression estimator per stratum.
[1] 223.9426
[1] 5.077558
must be passed to function calibrate with this ANCOVA model. This can be
determined by printing the design matrix that is used to fit the ANCOVA
model. Only the first two rows are printed.
With this model formulation, the first population total is the total number
of population units. The second, third, and fourth population totals are the
number of population units in stratum levels 2, 3, and 4. The fifth population
total is the population total of covariate lnSWIR2 and the sixth, seventh, and
eighth population totals are the totals of covariate lnSWIR2 in stratum levels
2, 3, and 4.
mean SE
AGB 223.94 5.8686
With this formula, the population totals are the number of population units
in stratum levels 1, 2, 3, and 4, as well as the population totals of covariate
lnSWIR2 of the strata.
mean SE
AGB 223.94 5.8686
̄̂
𝑧cregr = 𝑧u�̄̂ + 𝑏̂ (𝑥̄ − 𝑥u�
̄̂ ) , (10.21)
̂
𝑤2 𝑆2 (𝑧, 𝑥)
𝑏̂ = ℎ ℎ , (10.22)
2 ̂
𝑤ℎ 𝑆2 (𝑥)
ℎ
with 𝑆̂2 (𝑧, 𝑥) the estimated covariance of the study variable and the covariate
ℎ
in stratum ℎ and 𝑆 ̂ 2 (𝑥) the estimated variance of the covariate.
ℎ
10.1 Generalised regression estimator 179
Estimator (10.22) is for infinite populations and for stratified simple random
sampling with replacement of finite populations. For sampling without re-
placement from finite populations, finite population corrections 1 − 𝑛ℎ /𝑁ℎ
must be added to the numerator and denominator of 𝑏̂ (Cochran (1977),
p. 202).
2. Estimate for each stratum the variance of the estimator of the mean
of the residuals: 𝑉(̂ 𝑒ℎ̄̂ ) = 𝑆
̂2 (𝑒)/𝑛 , with 𝑆
ℎ ℎ
̂2 (𝑒) the estimated
ℎ
variance of the residuals in stratum ℎ.
The next code chunks show the estimation procedure. First, the population
means of the study variable AGB and of the covariate lnSWIR2 are estimated
by the 𝜋 estimator, see Chapter 4.
[,1] [,2]
[1,] 1678.268 -226.6772
Note that the same estimates are obtained by model-based estimation, using
weighted least squares, based on the assumption that the variances 𝜎2 (𝜖u� ) are
proportional to the inclusion probabilities (which is a weird assumption).
(Intercept) lnSWIR2
1678.2684 -226.6772
[1] 224.1433
[1] 5.122518
10.2 Ratio estimator 181
mean SE
AGB 224.14 5.8707
Function calibrate computes the regression estimate and its standard error
with the calibrated weights 𝑔u� (Särndal et al. (1992), equation (6.5.12)). This
explains the difference between the two standard errors.
𝑡u�̂ (𝑧)
̂
𝑡ratio (𝑧) = 𝑡(𝑥) = 𝑏̂ 𝑡(𝑥) , (10.23)
𝑡u�̂ (𝑥)
with 𝑡u�̂ (𝑧) and 𝑡u�̂ (𝑥) the 𝜋 estimators of the total of the study variable (poppy
area) and the ancillary variable (agricultural area), respectively, and 𝑡(𝑥) the
total of the ancillary variable, which must be known.
182 10 Model-assisted estimation
with 𝛽 the slope of the line and 𝜎2 a constant (variance of residual for 𝑥u� = 1).
The residual variance is assumed proportional to the covariate 𝑥.
The ratio estimator was applied before to estimate the population mean or
population total from a systematic random sample (Chapter 5), a one-stage
and two-stage cluster random sample (Sections 6.3 and 7.3), and a ppswor
sample (Section 8.2). By taking 𝑥u� = 1, 𝑘 = 1, … , 𝑁, 𝑡u�̂ (𝑥) in Equation
(10.23) is equal to 𝑁,̂ and 𝑡(𝑥) is equal to 𝑁. For (two-stage) cluster random
sampling 𝑀 is used for the total number of population units (𝑁 is the total
number of clusters or primary sampling units in the population) and therefore
𝑡u�̂ (𝑥) = 𝑀̂ and 𝑡(𝑥) = 𝑀. This yields the ratio estimators of the population
total appropriate for these sampling designs.
Equation (10.23) is a general estimator that can be used for any probability
sampling design, not only for simple random sampling. For simple random
sampling, the coefficient 𝑏 is estimated by the ratio of the sample means of 𝑧
and 𝑥.
For simple random sampling, the sampling variance of the ratio estimator of
the population total can be approximated by
̂
𝑆2 (𝑒)
𝑉(̂ 𝑡ratio
̂ (𝑧)) = 𝑁2 , (10.25)
𝑛
with 𝑆
̂2 (𝑒) the estimated variance of the residuals 𝑒 = 𝑧 − 𝑏𝑥
u� u�
̂ u� :
1
̂
𝑆2 (𝑒) = ∑ 𝑒2 . (10.26)
𝑛 − 1 u�∈u� u�
n <- 50
N <- nrow(grdKandahar)
units <- sample(N, size = n, replace = FALSE)
10.2 Ratio estimator 183
[1] 55009.69
[1] 18847.31
𝑡(𝑥)
𝑔= , (10.27)
𝑡u�̂ (𝑥)
with 𝑡(𝑥) the population total of the covariate and 𝑡u�̂ (𝑥) the 𝜋 estimate of the
population total of the covariate.
pi <- n / N
tx_HT <- sum(mysample$agri / pi)
g <- tx_pop / tx_HT
S2ge <- sum(gˆ2 * eˆ2) / (n - 1)
print(se_tz_ratio <- sqrt(Nˆ2 * (1 - n / N) * S2ge / n))
[1] 17149.62
The ratio estimate and the estimated standard error of the ratio estimator can
be computed with package survey as follows.
mysample$N <- N
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
b <- svyratio(~ poppy, ~ agri, design = design_si)
predict(b, total = tx_pop)
$total
agri
poppy 55009.69
$se
agri
poppy 17149.62
184 10 Model-assisted estimation
Figure 10.3 shows the sampling distribution of the ratio estimator and the
𝜋 estimator, obtained by repeating simple random sampling of size 50 and
estimation 10,000 times. The average of the 10,000 ratio estimates of the total
poppy area equals 62,512 ha. The population total of poppy equals 63,038
ha, so the estimated bias of the ratio estimator equals -526 ha. The boxplots
in Figure 10.3 show that the ratio estimator has less extreme outliers. The
standard deviation of the 10,000 ratio estimates equals 24,177 ha. The gain in
precision due to the ratio estimator, quantified by the ratio of the variance of
the 𝜋 estimator to the variance of the ratio estimator, equals 1.502.
Exercises
the two models with the ratio of the total poppy area to the total
agricultural area.
With stratified simple random sampling, there are, similar to the regression
estimator, two options for estimating a population parameter: either estimate
the ratios separately for the strata or estimate a combined ratio. The separate
ratio estimator of the population total is
u�
̂
𝑡sratio ̂
(𝑧) = ∑ 𝑡ratio,ℎ (𝑧) , (10.28)
ℎ=1
with
̂ (𝑧)
𝑡u�,ℎ
̂
𝑡ratio,ℎ (𝑧) = 𝑡 (𝑥) , (10.29)
̂ (𝑥) ℎ
𝑡u�,ℎ
in which 𝑡u�,ℎ
̂ (𝑧) and 𝑡u�,ℎ
̂ (𝑥) are the 𝜋 estimators of the population total of
the study variable and the covariate for stratum ℎ, respectively.
The combined ratio estimator is
∑u� 𝑡 ̂ (𝑧)
ℎ=1 u�,ℎ
̂
𝑡cratio (𝑧) = 𝑡(𝑥) . (10.30)
∑u� 𝑡 ̂ (𝑥)
ℎ=1 u�,ℎ
The code chunks below show how the combined and separate regression estimate
can be computed with package survey. First, two equal-sized strata are
computed using the median of the covariate agri as a stratum bound. Stratum
sample sizes are computed, and a stratified simple random sample without
replacement is selected.
The stratum sizes N_h are added to mysample, function svydesign specifies the
sampling design, function svyratio estimates the population ratio and its
variance, and finally function predict estimates the population total.
$total
agri
poppy 28389.02
$se
agri
poppy 8845.847
total SE
poppy 28389 8845.8
Computing the separate ratio estimator goes along the same lines. Function
svyratio with argument separate = TRUE estimates the ratio and its variance for
each stratum separately. To predict the population total, the stratum totals of
the covariate must be passed to function predict using argument total.
$total
agri
poppy 28331.32
$se
agri
poppy 8882.492
10.2 Ratio estimator 187
u�u�
u�
𝑡u�̂ (𝑧) u� ∑u�∈u� u�u�
̄̂
𝑧pos = ∑ 𝑤u� = ∑ 𝑤u� u�
1
, (10.31)
u�=1 𝑁u�̂ u�=1 ∑ u�∈u�u� u�u�
where 𝒮u� is the sample from group 𝑔, 𝑤u� = 𝑁u� /𝑁 is the relative size of group
𝑔, 𝑡u�̂ (𝑧) is the estimated total of the study variable for group 𝑔, 𝑁 ̂u� is the
estimator of the size of group 𝑔, and 𝜋u� is the inclusion probability of unit
𝑘. The estimated group means are weighted by their relative sizes 𝑤u� , which
are assumed to be known. In spite of this, the group means are estimated by
dividing the estimated group totals by their estimated size, 𝑁 ̂u� , because this
ratio estimator is more precise than the 𝜋 estimator of the group mean.
The poststratified estimator is the natural estimator for the one-way ANOVA
model,
with 𝜇u� the mean for group (subpopulation) 𝑔 = 1, … , 𝐺 and 𝜎2u� the variance
of the study variable of group 𝑔.
For simple random sampling, the poststratified estimator reduces to
u�
̄̂ = ∑ 𝑤u� 𝑧u�̄ u� ,
𝑧pos (10.33)
u�=1
188 10 Model-assisted estimation
where 𝑧u�̄ u� is the sample mean of group 𝑔. If for all groups we have at least
two sampling units, 𝑛u� ≥ 2, the variance of this poststratified estimator of the
mean can be estimated by
u� ̂
𝑆2
(10.34)
u�
𝑉(̂ 𝑧pos
̄̂ |𝐧u� ) = ∑ 𝑤2u� ,
u�=1
𝑛u�
1
̂
𝑆2 =
u� ∑ (𝑧 − 𝑧u�̄ u� )2 . (10.35)
𝑛u� − 1 u�∈u� u�
u�
library(forcats)
grdVoorst$poststratum <- fct_collapse(
grdVoorst$stratum, SA = c("BA", "EA", "PA"))
print(N_g <- tapply(grdVoorst$z, INDEX = grdVoorst$poststratum, FUN = length))
SA RA XF
5523 659 1346
One hundred points are selected by simple random sampling with replacement.
The expected sample sizes per group are proportional to the size of the groups,
𝐸(𝑛u� /𝑛) = 𝑁u� /𝑁, but for a single sample the sample proportions may deviate
considerably from the population proportions.
10.2 Ratio estimator 189
n <- 100
N <- nrow(grdVoorst)
set.seed(314)
units <- sample(N, size = n, replace = TRUE)
mysample <- grdVoorst[units, ]
n_g <- tapply(mysample$z, INDEX = mysample$poststratum, FUN = length)
print(n_g)
SA RA XF
71 6 23
The population mean is estimated by first computing the sample means per
group, followed by computing the weighted average of the sample means, using
the relative sizes of the groups as weights.
[1] 85.11039
[1] 4.49969
Note that this variance estimator can only be computed with at least two units
per group. For this reason, I recommend using a limited number of groups,
especially for small sample sizes.
Function postStratify of package survey can be used to compute the post-
stratified estimator and its standard error.
mysample$weights <- N / n
design_si <- svydesign(id = ~ 1, weights = ~ weights, data = mysample)
pop <- data.frame(poststratum = c("SA", "RA", "XF"), Freq = N_g)
mysample_pst <- postStratify(
190 10 Model-assisted estimation
mean SE
z 85.11 4.3942
Lohr (1999) warns about data snooping. By defining groups after analysing
the data, arbitrarily small sampling variances of the estimated mean can be
obtained.
1 1 1 u� 1 𝑚(𝐱
̂ u� )
̄̂
𝑧MC = 𝑧u�̄̂ + 𝑎̂ (1 − ∑ ) + 𝑏̂ ( ∑ 𝑚(𝐱 ̂ u� ) − ∑ ) ,
𝑁 u�∈u� 𝜋u� 𝑁 u�=1 𝑁 u�∈u� 𝜋u�
(10.36)
with 𝑧u�̄̂ the 𝜋 estimator of the population mean of the study variable, 𝑚̄̂ u� the
𝜋 estimator of the population mean of the predicted values, and 𝑎̂ an intercept
estimated by
1 𝑧
𝑎̂ = (1 − 𝑏)̂ ( ∑ u� ) . (10.38)
𝑁 u�∈u� 𝜋u�
10.3 Model-assisted estimation using machine learning techniques 191
The second term in Equation (10.36) cancels for all sampling designs for which
the sum of the design weights, i.e., the sum of the reciprocal of the inclusion
probabilities, equals the population size: ∑u�∈u� 1/𝜋u� = 𝑁. Only for some
unequal probability sampling designs this may not be the case.
The alternative is to plug the fitted values 𝑚(𝐱
̂ u� ) into the generalised difference
estimator, Equation (10.2). If we drop the second term, the model-calibration
estimator can be rewritten as
1 u� ̂ 1 𝑧 − 𝑏̂ 𝑚(𝐱
̂ u� )
̄̂
𝑧MC = ̂ u� ) + ∑ u�
∑ 𝑏 𝑚(𝐱 . (10.39)
𝑁 u�=1 𝑁 u�∈u� 𝜋u�
1 1 u� 1
̄̂
𝑧MC = ∑ 𝑧u� + 𝑏̂SI ( ∑ 𝑚(𝐱 ̂ u� ) − ∑ 𝑚(𝐱 ̂ u� )) , (10.40)
𝑛 u�∈u� 𝑁 u�=1 𝑛 u�∈u�
∑u�∈u� {𝑚(𝐱
̂ u� ) − 𝑚̄ u� }{𝑧u� − 𝑧u�̄ }
𝑏̂SI = , (10.41)
̂ u� ) − 𝑚̄ u� }2
∑u�∈u� {𝑚(𝐱
with 𝑚̄ u� the sample mean of the predicted values.
An estimator of the variance of the model-assisted calibration estimator is
𝑉(̂ 𝑧MC
̄̂ ) = 𝑉(̂ 𝑒u�̄̂ ) , (10.42)
with 𝑒u�̄̂ the 𝜋 estimator of the population mean of the residuals 𝑒. For sampling
designs with fixed sample size, these residuals are equal to 𝑒u� = 𝑧u� − 𝑏̂ 𝑚(𝐱
̂ u� ).
For simple random sampling with replacement from finite populations and
192 10 Model-assisted estimation
̂
𝑆2 (𝑒)
𝑉(̂ 𝑧MC
̄̂ ) = , (10.43)
𝑛
with 𝑆
̂2 (𝑒) the estimated population variance of the residuals.
̄̂ ) = 𝑉(̂ 𝑑u�̄̂ ) ,
𝑉(̂ 𝑧dif (10.44)
with 𝑑u�̄̂ the 𝜋 estimator of the population mean of the differences 𝑑u� =
𝑧u� − 𝑚(𝐱
̂ u� ).
The data of Eastern Amazonia are used to illustrate model-assisted estimation
of AGB, using five environmental covariates in predicting AGB. First, a
regression tree is used for prediction, after that a random forest is used for
prediction. For an introduction to regression trees and random forest modelling,
see this blog1 . In this blog the study variable is a categorical variable, whereas
in our example the study variable is quantitative and continuous. This is not
essential. The only difference is the measure for quantifying how good a split
is. With a quantitative study variable, this is quantified by the following sum
of squares:
2
𝑆𝑆 = ∑ ∑ (𝑧u�u� − 𝑧u�̄ u� )2 , (10.45)
u�=1 u�∈u�u�
N <- nrow(grdAmazonia)
n <- 100
set.seed(314)
units <- sample(N, size = n, replace = FALSE)
covs <- c("SWIR2", "Terra_PP", "Prec_dm", "Elevation", "Clay")
mysample <- grdAmazonia[units, c("AGB", covs)]
1 https://victorzhou.com/blog/intro-to-random-forests/
10.3 Model-assisted estimation using machine learning techniques 193
Package rpms (Toth, 2021) is used to build a regression tree for AGB, using
all five covariates as predictors. Note that I now use the original untransformed
SWIR2 as a predictor. Transforming predictors so that the relation with the
study variable becomes linear is not needed when fitting a non-linear model
such as a regression tree. Figure 10.4 shows the fitted tree.
library(rpms)
tree <- rpms(
rp_equ = AGB ~ SWIR2 + Terra_PP + Prec_dm + Elevation + Clay,
data = as.data.frame(mysample), pval = 0.05)
FIGURE 10.4: Regression tree for AGB (109 kg ha-1 ) calibrated on a simple
random sample of size 100 from Eastern Amazonia.
The regression tree is used to predict AGB for all population units.
[1] 226.7433
Its standard error is estimated by the square root of the variance of the
estimator of the mean differences.
[1] 3.212222
194 10 Model-assisted estimation
library(mase)
pi <- rep(n / N, n)
res <- gregTree(
mysample$AGB, xsample = mysample[, covs],
xpop = grdAmazonia[, covs], pi = pi,
var_est = TRUE, var_method = "LinHTSRS")
res$pop_mean
[1] 226.7433
sqrt(res$pop_mean_var)
[1] 3.212222
The variance of the estimator of the mean can also be estimated by bootstrap-
ping the sample (Lohr (1999), section 9.3.3).
sqrt(res$pop_mean_var)
[,1]
[1,] 4.076
The standard error obtained by the bootstrap is considerably larger than the
previous standard error based on a Taylor linearisation of the estimator of the
mean. As we will see hereafter, the Taylor linearisation seriously underestimates
the true standard error.
The simple random sampling of 100 units and the model-assisted estimation
are repeated 500 times, using a regression tree for prediction. The variance is
estimated by Taylor linearisation (var_method = LinHTSRS) and by bootstrapping
(var_method = bootstrapSRS) using 100 bootstrap samples.
The variance of the 500 estimated population means of AGB is 20.8 (109
kg ha-1 )2 . Estimation of the variance through Taylor linearisation strongly
underestimates the variance: the average of the 500 estimated variances equals
14.5 (109 kg ha-1 )2 . On the contrary, the bootstrap variance estimator overesti-
mates the variance: the average of the 500 estimated variances equals 23.4 (109
kg ha-1 )2 . I prefer to overestimate my uncertainty about the mean, instead
10.3 Model-assisted estimation using machine learning techniques 195
The package ranger (Wright and Ziegler, 2017) is used to fit a random forest
(RF) model for AGB using the five environmental covariates as predictors
and the simple random sample of size 100 selected in the previous subsection.
Function importance shows how often the covariates are used in a binary splitting.
All five covariates are used, SWIR2 by far most often.
library(ranger)
set.seed(314)
forest.sample <- ranger(
AGB ~ ., data = mysample, num.trees = 1000, importance = "impurity")
importance(forest.sample)
Out-of-bag predictions for the selected units are saved in element predictions
of the output object of function ranger. The fitted model is also used to predict
AGB at all units (raster cells), using function predict.
1. using all trees of the forest (the predictions obtained with function
predict); or
2. using only the trees calibrated on bootstrap samples that do not
include the sampling unit used as a prediction unit. These out-of-bag
predictions are stored in element predictions of the output object of
function ranger.
The next code chunk shows how the model-calibration estimate can be com-
puted with the AGB data of the simple random sample and the RF predictions
of AGB. First, all trees are used.
196 10 Model-assisted estimation
The two calibration estimates are about equal: 226.8 109 kg ha-1 using sample
predictions obtained with function predict, and 227.3 109 kg ha-1 with the
out-of-bag sample predictions. However, their estimated variances are largely
different: 2.35 (109 kg ha-1 )2 and 12.46 (109 kg ha-1 )2 , respectively.
In the next code chunk, the generalised difference estimate (Equation (10.2))
is computed. Similar to the model-calibration estimate, this difference estimate
is computed from predictions based on all trees and from the out-of-bag
predictions.
For the difference estimator the results are very similar. The two difference
estimates are 226.6 109 kg ha-1 and 227.0 109 kg ha-1 , and their estimated
variances are 2.70 (109 kg ha-1 )2 and 12.80 (109 kg ha-1 )2 , respectively. The
model-calibration estimate and the generalised difference estimate are nearly
equal.
The sampling and estimation are repeated: 1,000 simple random samples with-
out replacement of size 100 are selected. Each sample is used to calibrate a RF,
each forest consisting of 1,000 trees. This results in 2 × 1,000 model-calibration
10.3 Model-assisted estimation using machine learning techniques 197
1
∑1000 𝑧u�̄̂ − 𝑧 ̄
(10.46)
1000 u�=1
𝑏𝑖𝑎𝑠 = .
𝑧̄
Besides, for each estimator the variance of the 1,000 estimates is computed,
which can be compared with the mean of the 1,000 estimated variances. The
mean of the estimated variances is used to compute the variance to mean
squared error ratio:
1
∑1000 𝑉(̂ 𝑧u�̄̂ )
(10.47)
1000 u�=1
𝑅= ,
𝑀𝑆𝐸
with
1 1000 ̂
𝑀𝑆𝐸 = ∑ (𝑧 ̄ − 𝑧)̄ 2 . (10.48)
1000 u�=1 u�
50 100 250
Relative bias -0.0014 -0.0006 -0.0012
Experimental variance 33.9075 15.5255 5.6111
Mean variance estimates 36.1512 14.9750 5.0467
Variance to MSE ratio 1.0642 0.9642 0.8882
Relative efficiency 5.0066 5.4755 5.9866
with 1,000 trees these predictions are the average of, on average, 368 tree
predictions. This explains why the out-of-bag prediction errors are larger
than the prediction errors obtained with function predict. In other words, the
variance of the out-of-bag differences 𝑑 and of the out-of-bag residuals 𝑢 are
larger than those obtained with predicting using all trees.
Hereafter, I report the results obtained with the out-of-bag samples only.
The relative bias is negligibly small for all sample sizes (Table 10.1). For
𝑛 = 250 and 100 the average of the 1,000 estimated variances is smaller than
the variance of the 1,000 estimated means, whereas for 𝑛 = 50 the average of
the 1,000 variance estimates is larger than the experimental variance of the
estimated means. The variance to MSE ratio is smaller than 1 for 𝑛 = 250 and
100, but larger than 1 for 𝑛 = 50. The model-calibration estimator is much
more accurate than the 𝜋 estimator for all three sample sizes, as shown by the
high relative efficiencies. The relative efficiency increases with the sample size.
The summary statistics for the performance of the generalised difference
estimator are very similar to those of the model-calibration estimator (Table
10.2).
50 100 250
Relative bias -0.0008 -0.0006 -0.0013
Experimental variance 34.5188 16.2339 5.6891
Mean variance estimates 38.4629 15.4568 5.1125
Variance to MSE ratio 1.1144 0.9520 0.8867
Relative efficiency 4.9275 5.2379 5.8997
Kim and Wang (2019) also describe an alternative approach that does not
require a probability sample at all. In this approach, the big data sample is
subsampled to correct the selection bias in the big data sample. The subsample
is selected by inverse sampling, using data on an ancillary variable 𝑥, either from
a census or a probability sample. The subsample is selected with conditional
inclusion probabilities equal to the subsample size multiplied by an importance
weight (Kim and Wang (2019), equation 4).
11
Two-phase random sampling
The regression and ratio estimators of Chapter 10 require that the means
of the ancillary variables are known. If these are unknown, but the ancillary
variables can be measured cheaply, one may decide to estimate the population
means of the ancillary variables from a large sample. The study variable is
measured in a random subsample of this large sample only. This technique
is known in the sampling literature as two-phase random sampling or double
sampling. Another application of two-phase sampling is two-phase sampling
for stratification. Stratified random sampling (Chapter 4) requires a map with
the strata. The poststratified estimator of Subsection 10.2.2 requires that
the sizes of the strata are known. With two-phase sampling for stratification,
neither a map of the strata nor knowledge of the stratum sizes is required.
Note that the term ‘phase’ does not refer to a period of time; all data can be
collected in one sampling campaign. Let me also explain the difference with
two-stage cluster sampling (Chapter 7). In two-stage cluster random sampling,
we have two types of sampling units, clusters of population units and individual
population units. In two-phase sampling, we have one type of sampling unit
only, the objects of a discrete population or the elementary sampling units of
a continuous population (Section 1.1).
In two-phase sampling for regression and two-phase sampling for stratification,
the two phases have the same aim, i.e., to estimate the population mean of
the study variable. The observations of the covariate(s) and/or strata in the
first phase are merely done to increase the precision of the estimated mean of
the study variable. Another application of two-phase sampling is subsampling
an existing probability sample designed for a different aim. So, in this case the
study variable observed in the second-phase sample may not be related to the
variables observed in the first-phase sample.
An example is LUCAS-Topsoil (Ballabio et al., 2019) which is a subsample of
approximately 22,000 units sampled from a much larger sample, the LUCAS
sample, designed for estimating totals of land use and land cover classes across
the European Union. It was not feasible to observe the soil properties at all
sites of the LUCAS sample, and for this reason a subsample was selected.
Regrettably, this subsample is not a probability sample from the LUCAS
sample: the inclusion probabilities are either zero or unknown. Design-based
or model-assisted estimation of means of soil properties for domains of interest
is not feasible. The only option is model-based prediction.
201
202 11 Two-phase random sampling
𝑧u� 𝑧
̂ = ∑
𝑡(𝑧) = ∑ u�∗ , (11.1)
u�∈u�2
𝜋1u� 𝜋u�|u�1 u�∈u�
𝜋 u�
2
with 𝜋1u� the probability that unit 𝑘 is selected in the first phase, and 𝜋u�|u�1
the probability that unit 𝑘 is selected in the second phase, given the first-phase
sample 𝒮1 . This general 𝜋 estimator for two-phase sampling, referred to as
the 𝜋∗ estimator by Särndal et al. (1992), can be used for any combination of
probability sampling designs in the first and second phase.
To derive the variance, it is convenient to write the total estimation error as
the sum of two errors:
𝑧u� 𝑧 𝑧
̂ − 𝑡(𝑧) = ( ∑
𝑡(𝑧) − 𝑡(𝑧)) + ( ∑ u�∗ − ∑ u� )
u�∈u�1
𝜋1u� u�∈u�
𝜋 u�2u�∈u�
𝜋 1u� 1
(11.2)
= 𝑒 1 + 𝑒2 .
The first error 𝑒1 is the error in the estimated population total, as estimated
by the usual 𝜋 estimator using the study variable values for the units in
the first-phase sample. This estimator cannot be computed in practice, as the
study variable values are only known for a subset of the units in the first-phase
sample. The second error 𝑒2 is the difference between the 𝜋∗ estimator using the
study variable values for the units in the subsample only, and the 𝜋 estimator
using the study variable values for all units in the first-phase sample.
The variance of the 𝜋∗ estimator can be decomposed into the variance of these
two errors as follows:
𝑉u�1 ,u�2 (𝑡)̂ = 𝑉u�1 𝐸u�2 (𝑡|𝒮 ̂ 1 ) = 𝑉u� (𝑒1 )+𝐸u� 𝑉u� (𝑒2 |𝒮1 ) , (11.3)
̂ 1 )+𝐸u� 𝑉u� (𝑡|𝒮
1 2 1 1 2
with 𝑉u�1 and 𝐸u�1 the variance and expectation of the estimator for the
population total over repeated sampling with the design of the first phase,
respectively, and 𝑉u�2 and 𝐸u�2 the variance and expectation of the estimator
for the population total over repeated sampling with the design of the second
phase, respectively. The population mean can be estimated by the estimated
total divided by the population size 𝑁.
11.1 Two-phase random sampling for stratification 203
This sampling design is applied, for instance, to monitor land use and land
cover in the European Union by the LUCAS monitoring network mentioned
above. In the first phase, a systematic random sample is selected, consisting
of the nodes of a square sampling grid with a spacing of 2 km. Land use and
land cover (LULC) are then determined at the selected grid nodes, using
orthophotographs, satellite imagery, and fieldwork. The idea is that this
procedure results in a more accurate classification of LULC at the selected
units than by overlaying the grid nodes with an existing LULC map such
as the Corine Land Cover map. The site-specific determinations of LULC
classes are then used to select a stratified random subsample (second-phase
sample). In 2018 the monitoring network was redesigned (Ballin et al., 2018).
Two-phase sampling for stratification is now illustrated with study area Voorst.
A map with five combinations of soil type and land use is available of this
study area. These combinations were used as strata in Chapter 4, and the
stratum sizes were used in the poststratified estimator of Subsection 10.2.2.
Here, we consider the situation that we do not have this map and that we do
not know the sizes of these strata either. In the first phase, a simple random
sample of size 100 is selected. In the field, the soil-land use combination is
determined for the selected points, see Figure 11.1. This time we assume that
the field determinations are equal to the classes as shown on the map.
n1 <- 100
set.seed(123)
N <- nrow(grdVoorst)
units <- sample(N, size = n1, replace = FALSE)
mysample <- grdVoorst[units, ]
library(sampling)
n2 <- 40
n1_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = length)
n2_h <- round(n1_h / n1 * n2, 0)
units <- sampling::strata(mysample, stratanames = "stratum",
size = n2_h[unique(mysample$stratum)], method = "srswor")
mysubsample <- getdata(mysample, units)
table(mysubsample$stratum)
BA EA PA RA XF
15 8 6 4 7
With simple random sampling in the first phase and stratified simple random
sampling in the second phase, the population mean can be estimated by
u�u� 1
𝑛
𝑧 ̄̂ = ∑ 1ℎ 𝑧u�̄ 2ℎ , (11.4)
ℎ=1
𝑛1
where 𝐻u�1 is the number of strata used for stratification of the first-phase
sample, 𝑛1ℎ is the number of units in the first-phase sample that form stratum
ℎ in the second phase, 𝑛1 is the total number of units of the first-phase sample,
and 𝑧u�̄ 2ℎ is the mean of the subsample from stratum ℎ.
The estimated population mean equals 85.6 g kg-1 . The sampling variance over
repeated sampling with both designs can be approximated1 by (Särndal et al.
(1992), equation at bottom of p. 353)
u�u�1 u�u�
𝑛
2 ̂
𝑆2
1 1
𝑛 2
u�2ℎ
𝑉(̂ 𝑧)̄̂ = ∑ ( 1ℎ ) + ∑ 1ℎ (𝑧u�̄ 2ℎ − 𝑧)̄̂ , (11.5)
ℎ=1
𝑛1 𝑛2ℎ 𝑛1 ℎ=1 𝑛1
with 𝑆
̂2
u�2ℎ the variance of 𝑧 in the subsample from stratum ℎ.
library(survey)
lut <- data.frame(stratum = sort(unique(mysample$stratum)), fpc2 = n1_h)
mysample <- mysample %>%
mutate(ind = FALSE,
fpc1 = N) %>%
left_join(lut, by = "stratum")
mysample$ind[units$ID_unit] <- TRUE
design_2phase <- survey::twophase(
id = list(~ 1, ~ 1), strata = list(NULL, ~ stratum),
data = mysample, subset = ~ ind, fpc = list(~ fpc1, ~ fpc2))
svymean(~ z, design_2phase)
mean SE
z 85.606 7.033
As shown in the next code chunk, the standard error is computed with the
original variance estimator, without approximation (Särndal et al. (1992),
equation (9.4.14)).
[1] 7.032964
sample is selected, and the population mean of lnSWIR2 is estimated from the
first-phase sample. In doing so, the effect of ignorance of the population mean
of the covariate on the variance of the regression estimator becomes apparent.
In the next code chunk, a first-phase sample of 250 units, the dots in the plot,
is selected by simple random sampling without replacement. In the second
phase a subsample of 100 units, the triangles in the plot, is selected from the
250 units by simple random sampling without replacement. At all 250 units of
the first-phase sample the covariate lnSWIR2 is measured, whereas AGB is
measured at the 100 subsample units only.
of the subsample can be used to estimate the regression coefficient 𝑏. The true
population mean of the ancillary variable, 𝑥̄ in Equation (10.10), is unknown
now. This true mean is replaced by the mean as estimated from the relatively
large first-phase sample, 𝑥u�̄ 1 . The estimated mean of the covariate, 𝑥u�̄ in
Equation (10.10), is estimated from the subsample, 𝑥u�̄ 2 . This leads to the
following estimator:
where 𝑧u�̄ 2 is the subsample mean of the study variable, and 𝑥u�̄ 1 and 𝑥u�̄ 2 are
the means of the covariate in the first-phase sample and the subsample (i.e.,
the second-phase sample), respectively.
The sampling variance is larger than that of the regression estimator with
known mean of 𝑥. The variance can be decomposed into two components. The
first component is equal to the sampling variance of the 𝜋 estimator of the
mean of 𝑧 with the sampling design of the first phase (in this case, simple
random sampling without replacement), supposing that the study variable is
observed on all units of the first-phase sample. The second component is equal
to the sampling variance of the regression estimator of the mean of 𝑧 in the
first-phase sample, with the design of the second-phase sample (again simple
random sampling without replacement in this case):
𝑛 𝑆 ̂2 (𝑧) 𝑛 𝑆 ̂2 (𝑒)
𝑉(̂ 𝑧)̄̂ = (1 − 1 ) + (1 − 2 ) , (11.7)
𝑁 𝑛1 𝑛1 𝑛2
with 𝑆
̂ 2 (𝑒) the variance of the regression residuals as estimated from the
subsample:
1
̂
𝑆2 (𝑒) = ∑ 𝑒2 . (11.8)
(𝑛2 − 1) u�∈u� u�
2
The ratios (1 − 𝑛1 /𝑁) and (1 − 𝑛2 /𝑛1 ) in Equation (11.7) are finite popula-
tion corrections (fpcs). These fpcs account for the reduced variance due to
sampling the finite population and subsampling the first-phase sample without
replacement.
The estimated population mean equals 228.1 109 kg ha-1 . The standard error
can be approximated as follows.
e <- residuals(lm_subsample)
S2e <- sum(eˆ2) / (n2 - 1)
S2z <- var(mysubsample$AGB)
N <- nrow(grdAmazonia)
se_mz_reg2ph <- sqrt((1 - n1 / N) * S2z / n1 + (1 - n2 / n1) * S2e / n2)
mean SE
AGB 228.07 7.2638
Exercises
𝑛SIR
𝑛SI = , (12.1)
1 + u�u�
SIR
with 𝑛SIR the required sample size for simple random sampling with replace-
ment.
211
212 12 Computing the required sample size
2
𝑆∗ (𝑧)
𝑛=( ) , (12.2)
𝑠𝑒max
with 𝑆∗ (𝑧) a prior estimate of the population standard deviation. The required
sample size 𝑛 should be rounded to the nearest integer greater than the
right-hand side of Equation (12.2). This also applies to the following equations.
For the population proportion (areal fraction) as the parameter of interest,
the required sample size can be computed by (see Equation (3.18))
2
√𝑝∗ (1 − 𝑝∗ )
𝑛=( ) +1, (12.3)
𝑠𝑒max
To determine the required sample size for estimating the population propor-
tion, we need a prior estimate of the population parameter of interest itself,
whereas for the population mean a prior estimate is needed of the population
standard deviation. The parameter of which a prior estimate is needed for
sample size determination is referred to as the design parameter.
Alternatively, we may require that the relative standard error, i.e., the standard
error of the estimator divided by the population mean, may not exceed a given
limit 𝑟𝑠𝑒max . In this case the required sample size can be computed by
2
𝑐𝑣∗
𝑛=( ) , (12.4)
𝑟𝑠𝑒max
2 2
𝑝∗ (1 − 𝑝∗ ) 1 − 𝑝∗
𝑛=( ) + 1 = ( ) +1. (12.5)
𝑟𝑠𝑒max 𝑝∗ 𝑟𝑠𝑒max
12.2 Length of confidence interval 213
𝑆(𝑧)
2 𝑡u�/2,u�−1 √ ≤ 𝑙max , (12.6)
𝑛
2
𝑆∗ (𝑧)
𝑛 = (𝑢u�/2 ) . (12.7)
𝑙max /2
|𝑧 ̄̂ − 𝑧|̄
𝑃( ≤ 𝑟max ) ≤ 1 − 𝛼 . (12.9)
𝑧̄
Noting that the absolute error equals 𝑟max 𝑧 ̄ and inserting this in Equation
(12.7) gives
2
𝑐𝑣∗
𝑛 = (𝑢u�/2 ) . (12.10)
𝑟max
cv <- 0.5
rmax <- 0.1
u <- qnorm(p = 1 - 0.05 / 2, mean = 0, sd = 1)
n <- ceiling((u * cv / rmax)ˆ2)
The required sample size is 97. The same result is obtained with function
nContMoe of package PracTools (Valliant et al. (2021), Valliant et al. (2018)).
library(PracTools)
print(ceiling(nContMoe(moe.sw = 2, e = rmax, alpha = 0.05, CVpop = cv)))
[1] 97
2
√𝑝∗ (1 − 𝑝∗ )
𝑛 = (𝑢u�/2 ) +1. (12.11)
𝑙max /2
library(binomSamSize)
p_prior <- 0.2
n_prop_wald <- ciss.wald(p0 = p_prior, d = 0.1, alpha = 0.05)
n_prop_agrcll <- ciss.agresticoull(p0 = p_prior, d = 0.1, alpha = 0.05)
n_prop_wilson <- ciss.wilson(p0 = p_prior, d = 0.1, alpha = 0.05)
12.3 Statistical testing of hypothesis 215
The required sample sizes are 62, 58, and 60, for the Wald, Agresti-Coull,
and Wilson approximation of the binomial proportion confidence interval,
respectively. The required sample size with function ciss.wald is one unit
smaller than as computed with Equation (12.11), as shown in the code chunk
below.
[1] 63
𝑆2 (𝑧)
𝑛= (𝑢u�/2 + 𝑢u� )2 , (12.12)
Δ2
with Δ the smallest relevant difference of the population mean from the test
value, 𝛼 the tolerable probability of a type I error, i.e., the probability of
rejecting the null hypothesis when the population mean is equal to the test
value, 𝛽 the tolerable probability of a type II error, i.e., the probability of
not rejecting the null hypothesis when the population mean is not equal to
the test value, 𝑢u�/2 as before, and 𝑢u� the (1 − 𝛽) quantile of the standard
normal distribution. The quantity 1 − 𝛽 is the power of a test: the probability
of correctly rejecting the null hypothesis. For a one-sided test, 𝑢u�/2 must be
replaced by 𝑢u� .
In the next code chunk, the sample size required for a given target power is
computed with the standard normal distribution (Equation (12.12)), as well as
with the t distribution using function pwr.t.test of package pwr (Champely,
2020)1 . This requires some iterative algorithm, as the degrees of freedom of
the t distribution are a function of the sample size. The required sample size
is computed for a one-sample test and a one-sided alternative hypothesis.
library(pwr)
sd <- 4; delta <- 1; alpha <- 0.05; beta <- 0.2
n_norm <- (sd / delta)ˆ2 * (qnorm(1 - alpha) + qnorm(1 - beta))ˆ2
n_t <- pwr.t.test(
1 The same result is obtained with function power.t.test of the stats package.
216 12 Computing the required sample size
In this example, the required sample size computed with the t distribution is
two units larger than that obtained with the standard normal distribution:
101 vs. 99. Package pwr has various functions for computing the power of a
test given the sample size, or reversely, the sample size for a given power, such
as for the two independent samples t test, binomial test (for one proportion),
test for two proportions, etc.
As can be seen in the R code, as a first step for each total sample size the
smallest number of successes 𝑘min is computed at which the null hypothesis
is rejected. Then the binomial probability is computed of 𝑘min + 1 or more
successes for a probability of success equal to 𝑝test + Δ. Note that there is no
need to add 1 to k_min as with argument lower.tail = FALSE the value specified
by argument q is not included.
Figure 12.1 shows that the power does not increase monotonically with the
sample size. The graph shows a saw-toothed behaviour. This is caused by
the stepwise increase of the critical number of successes (k_min) with the total
sample size.
The required sample size can be computed in two ways. The first option is to
compute the smallest sample size for which the power is larger than or equal
to the required power 1 − 𝛽. The alternative is to compute the smallest sample
size for which the power is larger than or equal to 1 − 𝛽 for all sample sizes
larger than this.
12.3 Statistical testing of hypothesis 217
FIGURE 12.1: Power of right-tail binomial test (test proportion: 0.2; signifi-
cance level: 0.10).
The smallest sample size at which the desired level of 0.8 is reached is 88.
However, as can be seen in Figure 12.1, for sample sizes 89, 90, 93, 94, and 97,
the power drops below the desired level of 0.80. The smallest sample size at
which the power stays above the level of 0.8 is 98.
Alternatively, we may use function pwr.p.test of package pwr. This is an
approximation, using an arcsine transformation of proportions. The first step
is to compute Cohen’s ℎ, which is a measure of the distance between two
√ √
proportions: ℎ = 2 𝑎𝑟𝑐𝑠𝑖𝑛( 𝑝1 ) − 2 𝑎𝑟𝑐𝑠𝑖𝑛( 𝑝2 ). This can be done with
function ES.h. The value of ℎ must be positive, which is achieved when the
proportion specified by argument p1 is larger than the proportion specified by
argument p2.
The approximated sample size equals 84, which is somewhat smaller than the
required sample sizes computed above.
Exercises
2. Do the same for a single value for the half-length of the confidence
interval of 0.2 and a range of values for the prior proportion 𝑝∗ =
(0.01, 0.02, … , 0.49). Explain what you see. Why is it not needed to
compute the required sample size for prior proportions > 0.5?
The design effect can also be quantified by the ratio of two standard errors.
Then there is no need to take the square root of the design effect, as done in
Equation (12.14), to compute the required sample size for a more complex
design, given a constraint on the standard error or the half-length of a
confidence interval.
𝑓(𝜃)𝑓(𝐳|𝜃)
𝑓(𝜃|𝐳) = , (12.15)
𝑓(𝐳)
with 𝑓(𝜃|𝐳) the posterior distribution, i.e., the probability density2 of the
parameter given the sample data, 𝑓(𝜃) our prior belief in the parameters
specified by a probability distribution (prior distribution), 𝑓(𝐳|𝜃) the likelihood
of the data, and 𝑓(𝐳) the probability distribution of the data.
with Θ the parameter space for 𝜃 containing all possible values of 𝜃. This
predictive distribution is also named the preposterior distribution, stressing
that the data are not yet accounted for in the distribution.
Even if 𝜃 would be fixed, we do not have only one vector 𝐳 with 𝑛 data values
but a probability distribution, from which we can simulate possible data vectors,
referred to as the data space 𝒵. In case of a binomial probability and sample
size 𝑛, the data space 𝒵 (in the form of the number of observed successes given
sample size 𝑛) can be written as the set {0, 1, … , 𝑛}, i.e., one vector of length
𝑛 with all “failures”, 𝑛 vectors of length 𝑛 with one success, (u�2 ) vectors with
two successes, etc. Each data vector is associated with a probability density
(for continuous data) or probability mass (for discrete data). As a consequence,
we do not have only one posterior distribution function 𝑓(𝜃|𝐳), but as many
as we have data vectors in the data space. For each posterior distribution
function the coverage of the HPD interval of a given length can be computed,
or reversely, the length of the HPD interval for a given coverage. This leads
to various criteria for computing the required sample size, among which are
the average length criterion (ALC), the average coverage criterion (ACC), and
the worst outcome criterion (WOC) (Joseph et al. (1995), Joseph and Bélisle
(1997)).
For a fixed posterior HPD interval coverage of 100(1−𝛼)%, the smallest sample
size 𝑛 is determined such that
where 𝑓(𝐳|𝑛) is the predictive distribution of the data (Equation (12.16)) and
𝑙(𝐳, 𝑛) is the length of the 100(1 − 𝛼)% HPD interval for data 𝐳 and sample
size 𝑛, obtained by solving
u�+u�(u�,u�)
∫ 𝑓(𝜃|𝐳, 𝑛)d𝜃 = 1 − 𝛼 , (12.18)
u�
for 𝑙(𝐳, 𝑛), for each possible data set 𝐳 ∈ 𝒵. 𝑓(𝜃|𝐳, 𝑛) is the posterior density
of the population parameter of interest given the data 𝐳 and sample size 𝑛.
ALC ensures that the average length of 100(1 − 𝛼)% posterior HPD intervals,
weighted by 𝑓(𝐳|𝑛), is at most 𝑙max .
For a fixed posterior HPD interval of length 𝑙max , the smallest sample size 𝑛
is determined such that
u�+u�max
∫ {∫ 𝑓(𝜃|𝐳, 𝑛)d𝜃} 𝑓(𝐳|𝑛)d𝐳 ≥ 1 − 𝛼 . (12.19)
u� u�
ACC ensures that the average coverage of HPD intervals of length 𝑙max is
at least 1 − 𝛼. The integral inside the curly brackets is the integral of the
posterior density of the population parameter of interest over the HPD interval
(𝑣, 𝑣 + 𝑙max ), given a data vector 𝐳 of size 𝑛. The mean of this integrated
posterior density of the parameter of interest 𝜃 is obtained by multiplying the
integrated density with the predictive probability of the data and integrating
over all possible data sets in 𝒵.
Neither ALC nor ACC guarantee that for a particular data set 𝐳 the criterion
is met, as both are defined as averages over all possible data sets in 𝒵. A more
conservative sample size can be computed by requiring that for all data sets
𝒵 both criteria are met. Joseph and Bélisle (1997) modified this criterion by
restricting the data sets to a subset 𝒲 of most likely data sets. The criterion
thus obtained is referred to as the modified worst outcome criterion, or for
short, the worst outcome criterion (WOC). So, the criterion is
u�+u�(u�,u�)
infu�∈u� {∫ 𝑓(𝜃|𝐳, 𝑛)d𝜃} ≥ 1 − 𝛼 . (12.20)
u�
The smallest sample size satisfying this condition is used as the sample size.
For instance, if the 95% most likely data sets are chosen as subspace 𝒲, WOC
12.5 Bayesian sample size determination 223
guarantees that there is 95% assurance that the length of the 100(1 − 𝛼)%
posterior HPD intervals will be at most 𝑙max . The fraction of most likely data
sets in subspace 𝒲 is referred to as the worst level.
Besides the fully Bayesian approach, Joseph and Bélisle (1997) describe a
mixed Bayesian-likelihood approach for determining the sample size. In the
mixed Bayesian-likelihood approach of sample size determination, the prior
distribution of the parameter or parameters is only used to derive the predictive
distribution of the data (Equation (12.16)), not the posterior distributions of
the parameter of interest for each data vector. For analysis of the posterior
distribution, an uninformative prior is therefore used. This mixed approach
is of interest when, after the data have been collected, we prefer to estimate
the population mean from these data only, using the frequentist approach
described in previous sections.
An example of a situation where the mixed Bayesian-likelihood approach can
be attractive is the following. Suppose some data of the study variable from
the population of interest are already available, but we would like to collect
more data so that we will be more confident about the (current) population
mean once these new data are collected. The legacy data are used to construct
a prior distribution. We have doubts about the quality of the legacy data
because they were collected a long time ago and the study variable might
have changed in the meantime. In that case, the mixed Bayesian-likelihood
approach can be a good option – we are willing to use the legacy data to plan
the sampling, but not to make statements about the current population.
No closed formula for computing the required sample size exists for this ap-
proach because the posterior density function 𝑓(𝜃|𝑧, 𝑛) is not a well-defined
distribution as before. However, the required sample size still can be approxi-
mated by simulation.
The three criteria (ALC, ACC, and WOC) described above are now used to
compute the required sample size for estimating the population mean, assuming
that the data come from a normal distribution. As we are uncertain about
the population standard deviation 𝜎 (𝑆∗ (𝑧) in Equation (12.7) is only a prior
point estimate of 𝜎), a prior distribution is assigned to this parameter. It
is convenient to assign a gamma distribution as a prior distribution to the
reciprocal of the population variance, referred to as the precision parameter
𝜆 = 1/𝜎2 . More precisely, a prior bivariate normal-gamma distribution is
224 12 Computing the required sample size
assigned to the population mean and the precision parameter3 . With this prior
distribution, the posterior distribution of the population mean is fully defined,
i.e., both the type of distribution and its parameters are known. The prior
distribution is so-called conjugate with the normal distribution.
The gamma distribution has two parameters: 𝑎 and 𝑏. Figure 12.2 shows the
gamma distribution for 𝑎 = 5 and 𝑏 = 100.
FIGURE 12.2: Prior gamma distribution for the precision parameter for a
shape parameter 𝑎 = 5 and a scale parameter 1/𝑏 = 1/100.
The mean of the precision parameter 𝜆 is given by 𝑎/𝑏 and its standard
deviation by √𝑎/𝑏2 .
The normal-gamma prior is used to compute the predictive distribution for the
data. For ACC the required sample size can then be computed with (Adcock,
1988)
4𝑏 2
𝑛= 𝑡 − 𝑛0 , (12.21)
𝑎 𝑙2max 2u�;u�/2
with 𝑡22u�;u�/2 the squared (1 − 𝛼/2) quantile of the (usual, i.e., neither shifted
nor scaled) t distribution with 2𝑎 degrees of freedom and 𝑛0 the number
of prior points. The prior sample size 𝑛0 is only relevant if we have prior
3 This is equal to a normal-inverse gamma distribution to the population mean and
population variance.
12.5 Bayesian sample size determination 225
information about the population mean and an informative prior is used for
this population mean. If we have no information about the population mean,
a non-informative prior is used and 𝑛0 equals 0. Note that as 𝑎/𝑏 is the prior
mean of the inverse of the population variance, Equation (12.21) is similar
to Equation (12.7). The only difference is that a quantile from the standard
normal distribution is replaced by a quantile from a t distribution with 2𝑎
degrees of freedom.
No closed-form formula exists for computing the smallest 𝑛 satisfying ALC,
but the solution can be found by a bisectional search algorithm (Joseph and
Bélisle, 1997).
Package SampleSizeMeans (Joseph and Bélisle, 2012) is used to compute
Bayesian required sample sizes, using both criteria, ACC and ALC, for the fully
Bayesian and the mixed Bayesian-likelihood approach. The gamma distribution
plotted in Figure 12.2 is used as a prior distribution for the precision parameter
𝜆. As a reference, also the frequentist required sample size is computed.
library(SampleSizeMeans)
lmax <- 2
n_freq <- mu.freq(len = lmax, lambda = a / b, level = 0.95)
n_alc <- mu.alc(len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95)
n_alcmbl <- mu.mblalc(len = lmax, alpha = a, beta = b, level = 0.95)
n_acc <- mu.acc(len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95)
n_accmbl <- mu.mblacc(len = lmax, alpha = a, beta = b, level = 0.95)
n_woc <- mu.modwoc(
len = lmax, alpha = a, beta = b, n0 = 0, level = 0.95, worst.level = 0.95)
n_wocmbl <- mu.mblmodwoc(
len = lmax, alpha = a, beta = b, level = 0.95, worst.level = 0.95)
Table 12.1 shows that all six required sample sizes are larger than the frequentist
required sample size. This makes sense, as the frequentist approach does not
account for uncertainty in the population variance parameter. The mixed
approach leads to slightly larger required sample sizes than the fully Bayesian
TABLE 12.1: Required sample sizes for estimating a normal mean, computed
with three criteria for the fully Bayesian and the mixed Bayesian-likelihood
(MBL) approach.
approach. This is because in the mixed approach the prior distribution of the
precision parameter is not used. Apparently, we do not lose much information
by ignoring this prior. With WOC the required sample sizes are about twice
the sample sizes obtained with the other two criteria, but this depends of
course on the size of the subspace 𝒲. If, for instance the 80% most likely data
sets are chosen as subspace 𝒲, the required sample sizes are much smaller.
The required sample sizes with this criterion are 124 and 128 using the fully
Bayesian and the mixed Bayesian-likelihood approach, respectively.
The same criteria can be used to estimate the proportion of a population or,
in case of an infinite population, the areal fraction, satisfying some condition
(Joseph et al., 1995). With simple random sampling, this boils down to esti-
mating the probability-of-success parameter 𝑝 of a binomial distribution. In
this case the space of possible outcomes 𝒵 is the number of successes, which is
discrete: 𝒵 = {0, 1, … , 𝑛} with 𝑛 the sample size.
The conjugate prior for the binomial likelihood is the beta distribution:
1
𝑝∼ 𝜋u�−1 (1 − 𝜋)u�−1 , (12.22)
𝐵(𝑐, 𝑑)
where 𝐵(𝑐, 𝑑) is the beta function. The two parameters 𝑐 and 𝑑 correspond
to the number of ‘successes’ and ‘failures’ in the problem context. The larger
these numbers, the more the prior information, and the more sharply defined
the probability distribution. The plot below shows this distribution for 𝑐 = 0.6
and 𝑑 = 2.4.
The mean of the binomial proportion equals 𝑐/(𝑐+𝑑) and its standard deviation
√𝑐𝑑/{(𝑐 + 𝑑 + 1)(𝑐 + 𝑑)2 }.
12.5 Bayesian sample size determination 227
and for a given number of successes 𝑧 out of 𝑛 trials the posterior distribution
of 𝑝 equals
1
𝑓(𝑝|𝑧, 𝑛, 𝑐, 𝑑) = 𝑝u�+u�−1 (1 − 𝑝)u�−u�+u�−1 . (12.24)
𝐵(𝑧 + 𝑐, 𝑛 − 𝑧 + 𝑑)
For the binomial parameter, criterion ALC (Equation (12.17)) can be written
as u�
∑ 𝑙(𝑧, 𝑛)𝑓(𝑧, 𝑛) ≤ 𝑙max . (12.25)
u�=0
To compute the smallest 𝑛 satisfying this condition, for each value of 𝑧 and
each 𝑛, 𝑙(𝑧, 𝑛) must be computed so that
u�+u�(u�,u�)
∫ 𝑓(𝑝|𝑧, 𝑛, 𝑐, 𝑑)d𝑝 = 1 − 𝛼 , (12.26)
u�
with 𝑣 the lower bound of the HPD credible set given the sample size and the
observed number of successes 𝑧.
For the binomial parameter, criterion ACC (Equation (12.19)) can be written
as
u�
∑ Pr{𝑝 ∈ (𝑣, 𝑣 + 𝑙max )}𝑓(𝑧, 𝑛) ≥ 1 − 𝛼 , (12.27)
u�=0
with
u�+u�max
Pr{𝑝 ∈ (𝑣, 𝑣 + 𝑙max )} ∝ ∫ 𝑝u� (1 − 𝑝)u�−u� 𝑓(𝑝)d𝑝 , (12.28)
u�
FIGURE 12.3: Prior beta distribution for the binomial proportion for a beta
function 𝐵(0.6, 2.4).
et al., 2018). This package is used to compute the required sample sizes us-
ing the beta distribution shown in Figure 12.3 as a prior for the population
proportion. Note that argument len of the various functions of package Sam-
pleSizeBinomial specifies the total length of the confidence interval, not half
the length as passed to function ciss.wald using argument d.
library(SampleSizeBinomial)
n_alc <- prop.alc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_alcmbl <- prop.mblalc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_acc <- prop.acc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_accmbl <- prop.mblacc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE)$n
n_woc <- prop.modwoc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE,
worst.level = 0.80)$n
n_wocmbl <- prop.mblmodwoc(
len = 0.2, alpha = c, beta = d, level = 0.95, exact = TRUE,
worst.level = 0.80)$n
library(binomSamSize)
n_freq <- ciss.wald(p0 = c / (c + d), d = 0.1, alpha = 0.05)
with 𝑍(𝐬) the study variable at location 𝐬, 𝜇(𝐬) the mean at location 𝐬, 𝜖(𝐬)
the residual at location 𝐬, and 𝐶(𝐡) the covariance of the residuals at two
locations separated by vector 𝐡 = 𝐬 − 𝐬′ . The residuals are assumed to have a
normal distribution with zero mean and a constant variance 𝜎2 (𝒩(0, 𝜎2 )).
The model of the spatial variation has several parameters. In case of a model in
which the mean is a linear combination of covariates, these are the regression
coefficients associated with the covariates and the parameters of a semivari-
ogram describing the spatial dependence of the residuals. A semivariogram is
231
232 13 Model-based optimisation of probability sampling designs
a model for half the expectation of the squared difference of the study variable
or the residuals of a model at two locations, referred to as the semivariance,
as a function of the length (and direction) of the vector separating the two
locations (Chapter 21).
Using the model to predict the sampling variance of a design-based estimator of
a population mean requires prior knowledge of the semivariogram. When data
from the study area of interest are available, these data can be used to choose
a semivariogram model and to estimate the parameters of the model. If no
such data are available, we must make a best guess, based on data collected in
other areas. In all cases I recommend keeping the model as simple as possible.
̄̂ = 𝛾̄ − 𝐸u� (𝜆
𝐸u� {𝑉u� (𝑧)} 𝜆TΓ u�𝜆 ) , (13.2)
where 𝐸u� (⋅) is the statistical expectation over realisations from the model
𝜉, 𝐸u� (⋅) is the statistical expectation over repeated sampling with sampling
design 𝑝, 𝑉u� (𝑧)̄̂ is the variance of the 𝜋 estimator of the population mean over
repeated sampling with sampling design 𝑝, 𝛾̄ is the mean semivariance of the
random variable at two randomly selected locations in the study area, 𝜆 is the
vector of design-based weights of the units of a sample selected with design 𝑝,
and Γ u� is the matrix of semivariances between the units of a sample 𝒮 selected
with design 𝑝.
234 13 Model-based optimisation of probability sampling designs
̄̂ = 𝛾/𝑛
𝐸u� {𝑉SI (𝑧)} ̄ , (13.4)
For systematic random sampling, i.e., sampling on a randomly placed grid, the
variance can be predicted by
̄̂ = 𝛾̄ − 𝐸SY (𝛾SY
𝐸u� {𝑉SY (𝑧)} ̄ ), (13.6)
13.1 Model-based optimisation of sampling design type and sample size 235
library(gstat)
coordinates(sampleLeest) <- ~ s1 + s2
vg <- variogram(N ~ 1, data = sampleLeest)
vgm_MoM <- fit.variogram(
vg, model = vgm(model = "Sph", psill = 2000, range = 20))
The few data lead to a very noisy sample semivariogram. For the moment, I
ignore my uncertainty about the semivariogram parameters; Subsection 13.1.3
will show how we can account for our uncertainty about the semivariogram
parameters in model-based prediction of the sampling variance. A spherical
semivariogram model without nugget is fitted to the sample semivariogram,
i.e., the intercept is 0. The fitted range of the model is 45 m, and the fitted sill
equals 966 (kg ha-1 )2 . The fitted semivariogram is used to predict the sampling
variance for three sampling designs: simple random sampling, stratified simple
random sampling, and systematic random sampling. The costs for these three
design types will be about equal, as the study area is small, so that the
access time of the sampling points selected with the three designs is about
equal. The sample size of the evaluated sampling designs is 25 points. As for
systematic random sampling, the number of points varies among the samples,
this sampling design has an expected sample size of 25 points.
For simple random sampling, we must compute the mean semivariance within
the field (Equation (13.4)). As shown in the next code chunk, the mean
semivariance is approximated by discretising the field by a square grid of 2,000
points, computing the 2,000 × 2,000 matrix with distances between all pairs
of discretisation nodes, transforming this distance matrix into a semivariance
236 13 Model-based optimisation of probability sampling designs
The strata of the stratified simple random sampling design are compact geo-
graphical strata of equal size (Section 4.6). The number of geostrata is equal
to the sample size, 25 points, so that we have one point per stratum. With
this design, the sampling points are reasonably well spread over the field, but
not as good as with systematic random sampling. To predict the sampling
variance, we must compute the mean semivariances within the geostrata, see
Equation (13.5). Note that the stratum weights are constant as the strata have
equal size, 𝑤ℎ = 1/𝑛, and that 𝑛ℎ = 1. Therefore, Equation (13.5) reduces to
1 u�
̄̂ =
𝐸u� {𝑉STSI (𝑧)} ∑𝛾 ̄ . (13.7)
𝑛2 ℎ=1 ℎ
The next code chunk shows the computation of the mean semivariance per
stratum and the model-based prediction of the sampling variance of the
estimator of the mean. The matrix with the coordinates is first converted to a
tibble with function as_tibble.
13.1 Model-based optimisation of sampling design type and sample size 237
library(spcosa)
mygrid <- mygrid %>%
as_tibble() %>%
setNames(c("x1", "x2"))
gridded(mygrid) <- ~ x1 + x2
mygeostrata <- stratify(mygrid, nStrata = n, equalArea = TRUE, nTry = 10) %>%
as("data.frame")
m_semivar_geostrata <- numeric(length = n)
for (i in 1:n) {
ids <- which(mygeostrata$stratumId == (i - 1))
mysubgrd <- mygeostrata[ids, ]
H_geostratum <- as.matrix(dist(mysubgrd[, c(2, 3)]))
G_geostratum <- variogramLine(vgm_MoM, dist_vector = H_geostratum)
m_semivar_geostrata[i] <- mean(G_geostratum)
}
Exi_V_STSI <- sum(m_semivar_geostrata) / nˆ2
The model-based prediction of the sampling variance with this design equals
13.5 (kg ha-1 )2 , which is much smaller than with simple random sampling.
The large stratification effect can be explained by the assumed strong spatial
structure of NO3 -N in the agricultural field and the improved geographical
spreading of the sampling points, see Figure 13.1.
FIGURE 13.1: Sample semivariogram and fitted spherical model for NO3 -N
in field Leest. The numbers refer to point-pairs used in computing semivari-
ances.
238 13 Model-based optimisation of probability sampling designs
set.seed(314)
m_semivar_SY <- numeric(length = 100)
for (i in 1:100) {
mySYsample <- spsample(x = mygrid, n = n, type = "regular") %>%
as("data.frame")
H_SY <- as.matrix(dist(mySYsample))
G_SY <- variogramLine(vgm_MoM, dist_vector = H_SY)
m_semivar_SY[i] <- mean(G_SY)
}
Exi_V_SY <- m_semivar_field - mean(m_semivar_SY)
If the soil aliquots collected at the points of the stratified random sample
are bulked into a composite, as is usually done in soil testing of agricultural
fields, the procedure for predicting the variance of the estimator of the mean
is slightly different. Only the composite sample is analysed in a laboratory on
NO3 -N, not the individual soil aliquots. This implies that the contribution of
the measurement error to the total uncertainty about the population mean
is larger. To predict the sampling variance in this situation, we need the
semivariogram of errorless measurements of NO3 -N, i.e., of the true NO3 -N
contents of soil aliquots collected at points. The sill of this semivariogram will
be smaller than the sill of the semivariogram of measured NO3 -N data. A
simple option is to subtract an estimate of the measurement error variance
from the semivariogram of measured NO3 -N data that contain a measurement
error. So, the measurement error variance is subtracted from the nugget. This
13.1 Model-based optimisation of sampling design type and sample size 239
Exercises
3. Do the same for systematic random sampling. Note that for this
sampling design, no such formula is available. Predict for a series
of expected sample sizes, 𝑛 = 5, 6, … , 40, the sampling variance of
the estimator of the mean, using Equation (13.6). Approximate
̄ ) from ten repeated selections. Compute the length of
𝐸SY (𝛾SY
the confidence interval from the predicted sampling variances, and
plot the interval length against the sample size. Finally, determine
the required sample size for a maximum length of 20. What is the
design effect for an expected sample size of 34 points (the required
sample size for simple random sampling)? See Equation (12.13). Also
compute the design effect for expected sample sizes of 5, 6, … , 40.
Explain why the design effect is not constant.
2. Use the model to simulate values of the study variable for all
sampling points.
3. Estimate for each sample the population mean, using the design-
based estimator of the population mean for sampling design 𝑝. This
results in 𝑆 estimated population means.
This approach is illustrated with the western part of the Amhara region in
Ethiopia (hereafter referred to as West-Amhara) where a large sample is
available with organic matter data in the topsoil (SOM) in decagram per kg
dry soil (dag kg-1 ; 1 decagram = 10 gram). The soil samples are collected
along roads (see Figure 17.5). It is a convenience sample, not a probability
sample, so these sample data cannot be used in design-based or model-assisted
estimation of the mean or total soil carbon stock in the study area. However,
the data can be used to model the spatial variation of the SOM concentration,
and this geostatistical model can then be used to design a probability sample
for design-based estimation of the total mass of SOM. Apart from the point
data of the SOM concentration, maps of covariates are available, such as
a digital elevation model and remote sensing reflectance data. In the next
code chunk, four covariates are selected to model the mean of the SOM
concentration: elevation (dem), average near infrared reflectance (rfl-NIR),
average red reflectance (rfl-red), and average land surface temperature (lst). I
assume a normal distribution for the residuals of the linear model. The model
parameters are estimated by restricted maximum likelihood (REML), using
package geoR (Ribeiro Jr et al., 2020), see Subsection 21.5.2 for details on
REML estimation of a geostatistical model. As a first step, the projected
coordinates of the sampling points are changed from m into km using function
mutate. Using coordinates in m in function likfit could not find an optimal
estimate for the range.
13.1 Model-based optimisation of sampling design type and sample size 241
library(geoR)
sampleAmhara <- sampleAmhara %>%
mutate(s1 = s1 / 1000, s2 = s2 / 1000)
dGeoR <- as.geodata(obj = sampleAmhara, header = TRUE,
coords.col = c("s1", "s2"), data.col = "SOM",
covar.col = c("dem", "rfl_NIR", "rfl_red", "lst"))
vgm_REML <- likfit(geodata = dGeoR,
trend = ~ dem + rfl_NIR + rfl_red + lst,
cov.model = "spherical", ini.cov.pars = c(1, 50), nugget = 0.2,
lik.method = "REML", messages = FALSE)
The fitted model of the spatial variation of the SOM concentration is used to
compare systematic random sampling and two-stage cluster random sampling
at equal variances of the estimator of the mean.
The mean of the 100 conditional variances equals 0.015 (dag kg-1 )2 . This is
a Monte Carlo approximation of the model-based prediction of the sampling
variance of the ratio estimator of the mean for systematic random sampling
with an expected sample size of 50.
set.seed(314)
res <- kmeans(
grdAmhara[, c("s1", "s2")], iter.max = 1000, centers = 100, nstart = 100)
mypsus <- res$cluster
psusize <- as.numeric(table(mypsus))
summary(psusize)
In the next code chunks, I assume that the PSUs are selected with probabilities
proportional to their size and with replacement (ppswr sampling), see Chapter
7. In Section 7.1 formulas are presented for computing the optimal number
of PSU draws and SSU draws per PSU draw. The optimal sample sizes are a
244 13 Model-based optimisation of probability sampling designs
function of the pooled variance of PSU means, 𝑆2b , and the pooled variance
of secondary units (points) within the PSUs, 𝑆2w . In the current subsection,
these variance components are predicted with the geostatistical model.
As a first step, a large number of maps are simulated.
For each simulated field, the means of the PSUs and the variances of the
simulated values within the PSUs are computed using function tapply in
function apply.
Next, for each simulated field, the pooled variance of PSU means and the
pooled variance within PSUs are computed, and finally these pooled variances
are averaged over all simulated fields. The averages are approximations of
the model-expectations of the pooled between unit and within unit variances,
𝐸u� [𝑆2b ] and 𝐸u� [𝑆2w ].
The optimal sample sizes are computed for a simple linear costs model: 𝐶 =
𝑐0 + 𝑐1 𝑛 + 𝑐2 𝑛𝑚, with 𝑐0 the fixed costs, 𝑐1 the access costs per PSU, including
the access costs of the SSUs (points) within a given PSU, and 𝑐2 the observation
costs per SSU. In the next code chunk, I use 𝑐1 = 2 and 𝑐2 = 1. For the optimal
sample sizes only the ratio of 𝑐1 and 𝑐2 is important, not their absolute values.
13.1 Model-based optimisation of sampling design type and sample size 245
Given values for 𝑐1 and 𝑐2 , the optimal number of PSU draws 𝑛 and the
optimal number of SSU draws per PSU draw 𝑚 are computed, required for a
sampling variance of the estimator of the mean equal to the sampling variance
with systematic random sampling of 50 points, see Equations (7.9) and (7.10).
c1 <- 2; c2 <- 1
nopt <- 1 / Exi_vmz_SY * (sqrt(Exi_S2w * Exi_S2b) * sqrt(c2 / c1) + Exi_S2b)
mopt <- sqrt(Exi_S2w / Exi_S2b) * sqrt(c1 / c2)
The optimal number of PSU draws is 26, and the optimal number of points
per PSU draw equals 5. The total number of sampling points is 26 × 5 =
130. This is much larger than the sample size of 50 obtained with systematic
random sampling. The total observation costs therefore are substantially larger.
However, the access time can be substantially smaller due to the spatial
clustering of sampling points. To answer the question of whether the costs
saved by this reduced access time outweigh the extra costs of observation, the
model for the access costs and observation costs must be further developed.
of a square grid with a spacing of about 4.5 m. As a first step, I check whether
we can safely assume that the data come from a normal distribution.
The Q-Q plot (Figure 13.2) shows that a normal distribution is not very likely:
there are too many large values, i.e., the distribution is skewed to the right.
Also the p-value of the Shapiro-Wilk test shows that we should reject the
null hypothesis of a normal distribution for the data: 𝑝 = 0.0028. I therefore
proceed with the natural log of NO3 -N, in short lnN.
The parameters that are estimated are the reciprocal of the sill 𝜆, the ratio
of spatial dependence 𝜉, defined as the partial sill divided by the sill, and the
distance parameter 𝜙. This parameterisation of the semivariogram is chosen
13.1 Model-based optimisation of sampling design type and sample size 247
because hereafter in the Bayesian approach prior distributions are chosen for
these parameters.
library(mvtnorm)
ll <- function(thetas) {
sill <- 1 / thetas[1]
psill <- thetas[2] * sill
nugget <- sill - psill
vgmodel <- vgm(
model = model, psill = psill, range = thetas[3], nugget = nugget)
C <- variogramLine(vgmodel, dist_vector = D, covariance = TRUE)
XCX <- crossprod(X, solve(C, X))
XCz <- crossprod(X, solve(C, z))
betaGLS <- solve(XCX, XCz)
mu <- as.numeric(X %*% betaGLS)
logLik <- dmvnorm(x = z, mean = mu, sigma = C, log = TRUE)
logLik
}
library(BayesianTools)
priors <- createUniformPrior(lower = c(1e-6, 0, 1e-6),
upper = c(1000, 1, 150))
bestML <- c(vgML$par[1], vgML$par[2], vgML$par[3])
setup <- createBayesianSetup(likelihood = ll, prior = priors,
best = bestML, names = c("lambda", "xi", "phi"))
set.seed(314)
res <- runMCMC(setup, sampler = "DEzs")
MCMCsample <- getSample(res, start = 1000, numSamples = 1000) %>% data.frame()
Figure 13.3 shows several semivariograms, sampled by MCMC from the poste-
rior distribution of the estimated semivariogram parameters.
The evaluated sampling design is the same as used in Subsection 13.1.1 for
field Leest: stratified simple random sampling, using compact geographical
strata of equal size, a total sample size of 25 points, and one point per stratum.
The next step is to simulate with each of the sampled semivariograms a
large number of maps of lnN. This is done by sequential Gaussian simulation,
conditional on the available data. The simulated values are backtransformed.
Each simulated map is then used to compute the variance of the simulated
values within the geostrata 𝑆2ℎ . These stratum variances are used to compute
the sampling variance of the estimator of the mean. Plugging 𝑤ℎ = 1/𝑛 (all
strata have equal size) into Equation (4.4) and using 𝑛ℎ = 1 in Equation (4.5)
13.1 Model-based optimisation of sampling design type and sample size 249
FIGURE 13.3: Semivariograms of the natural log of NO3 -N for field Melle
obtained by MCMC sampling from posterior distribution of the estimated
semivariogram parameters.
1 u� 2
𝑉(𝑧)̄̂ = ∑𝑆 . (13.8)
𝑛2 ℎ=1 ℎ
In the code chunk below, I use the first 100 sampled semivariograms to simulate
with each semivariogram 100 maps.
Figure 13.4 shows 16 maps simulated with the first four semivariograms. The
four maps in a row (a to d) are simulated with the same semivariogram. All maps
show that the simulated data have positive skew, which is in agreement with
the prior data. The data obtained by simulating from a lognormal distribution
are always strictly positive. This is not guaranteed when simulating from a
normal distribution.
The sampling variances of the estimated mean of NO3 -N obtained with these
16 maps are shown below.
a b c d
1 1.364 0.831 0.878 0.847
2 1.379 1.151 0.991 1.162
3 0.669 0.594 0.522 0.530
4 0.932 1.949 0.878 0.739
The sampling variance shows quite strong variation among the maps. The
frequency distribution of Figure 13.5 shows our uncertainty about the sampling
variance, due to uncertainty about the semivariogram, and about the spatial
distribution of NO3 -N within the agricultural field given the semivariogram
and the available data from that field.
As a model-based prediction of the sampling variance, we can take the mean or
the median of the sampling variances over all 100 × 100 simulated maps, which
are equal to 0.728 (dag kg-1 ) and 0.666 (dag kg-1 ), respectively. If we want to
be more safe, we can take a high quantile, e.g., the P90 of this distribution as
the predicted sampling variance, which is equal to 1.100 (dag kg-1 ).
I used the 30 available NO3 -N data as conditioning data in geostatistical
simulation. Unconditional simulation is recommended if we cannot rely on the
quality of the legacy data, for instance due to a temporal change in lnN since
the time the legacy data have been observed. For NO3 -N this might well be
the case. I believe that, although the effect of 30 observations on the simulated
fields and on the uncertainty distribution of the sampling variance will be
very small, one still may prefer unconditional simulation. With unconditional
simulation, we must assign the model-mean 𝜇 to argument beta of function
krige. The estimated model-mean can be estimated by generalised least squares,
see function ll above.
13.1 Model-based optimisation of sampling design type and sample size 251
FIGURE 13.4: Maps of NO3 -N of field Melle simulated with four semivari-
ograms (rows). Each semivariogram is used to simulate four maps (columns
a-d).
252 13 Model-based optimisation of probability sampling designs
u�
𝑆2ℎ (𝑧)
𝑉(𝑧)̄̂ = ∑ 𝑤2ℎ . (13.9)
ℎ=1
𝑛ℎ
Plugging the stratum sample sizes under optimal allocation (Equation (4.17))
into Equation (13.9) yields
u� u�
1 √ 𝑤 𝑆 (𝑧)
𝑉(𝑧)̄̂ = ( ∑ 𝑤ℎ 𝑆ℎ (𝑧) 𝑐ℎ ∑ ℎ√ ℎ ) . (13.10)
𝑛 ℎ=1 ℎ=1
𝑐ℎ
So, given the total sample size 𝑛, the variance of the estimator of the mean is
minimal when the criterion
u� u�
√ 𝑤 𝑆 (𝑧)
𝑂 = ∑ 𝑤ℎ 𝑆ℎ (𝑧) 𝑐ℎ ∑ ℎ√ ℎ (13.11)
ℎ=1 ℎ=1
𝑐ℎ
is minimised.
Assuming that the costs are equal for all population units, so that the mean
costs are the same for all strata, the minimisation criterion reduces to
u� 2
𝑂 = ( ∑ 𝑤ℎ 𝑆ℎ (𝑧)) . (13.12)
ℎ=1
In practice, we do not know the values of the study variable 𝑧. de Gruijter et al.
(2015) consider the situation where we have predictions of the study variable
from a linear regression model: 𝑧 ̂ = 𝑧 + 𝜖, with 𝜖 the prediction error. So, this
implies that we do not know the population standard deviations within the
strata, 𝑆ℎ (𝑧) of Equation (13.10). What we do have are the stratum standard
deviations of the predictions of 𝑧: 𝑆ℎ (𝑧).
̂ With many statistical models, such
254 13 Model-based optimisation of probability sampling designs
u� 2
u� −1
ℎ u�ℎ
1
𝐸u� [𝑆2ℎ (𝑧)] = ∑ ∑ 𝐸u� [𝑑2u�u� ] , (13.14)
𝑁2ℎ u�=1 u�=u�+1
with 𝑑2u�u� = (𝑧u� − 𝑧u� )2 the squared difference of the study variable values at two
nodes of a discretisation grid. The model-expectation of the squared differences
are equal to
𝐸u� [𝑑2u�u� ] = (𝑧u�̂ − 𝑧u�̂ )2 + 𝑆2 (𝜖u� ) + 𝑆2 (𝜖u� ) − 2𝑆2 (𝜖u� , 𝜖u� ) , (13.15)
with 𝑆2 (𝜖u� ) the variance of the prediction error at node 𝑖 and 𝑆2 (𝜖u� , 𝜖u� ) the
covariance of the prediction errors at nodes 𝑖 and 𝑗. The authors then argue
that for smoothers, such as kriging and regression, the first term must be
divided by the squared correlation coefficient 𝑅2 :
(𝑧u�̂ − 𝑧u�̂ )2
𝐸u� [𝑑2u�u� ] = + 𝑆2 (𝜖u� ) + 𝑆2 (𝜖u� ) − 2𝑆2 (𝜖u� , 𝜖u� ) . (13.16)
𝑅2
The predicted stratum standard deviations are approximated by the square root
of Equation (13.16). Plugging these model-based predictions of the stratum
standard deviations into the minimisation criterion, Equation (13.12), yields
u� −1 u�ℎ 1/2
1 u� ℎ
(𝑧u�̂ − 𝑧u�̂ )2
𝐸u� [𝑂] = ∑( ∑ ∑ + 𝑆2 (𝜖u� ) + 𝑆2 (𝜖u� ) − 2𝑆2 (𝜖u� , 𝜖u� )) .
𝑁 ℎ=1 u�=1 u�=u�+1
𝑅2
(13.17)
a simple linear regression model for the SOM concentration, using the elevation
of the surface (dem) as a predictor. Function lm of the stats package is used
to fit the simple linear regression model.
In fitting a linear regression model, we assume that the relation is linear, the
residual variance is constant (independent of the fitted value), and the residuals
have a normal distribution. These assumptions are checked with a scatter plot
of the residuals against the fitted value and a Q-Q plot, respectively (Figure
13.6).
FIGURE 13.6: Scatter plot of residuals against fitted value and Q-Q plot of
residuals, for a simple linear regression model of the SOM concentration in
Xuancheng, using elevation as a predictor.
The scatter plot shows that the first assumption is realistic. No pattern can
be seen: at all fitted values, the residuals are scattered around the horizontal
line. However, the second and third assumptions are questionable: the residual
variance clearly increases with the fitted value, and the distribution of the
residuals has positive skew, i.e., it has a long upper tail. There clearly is some
evidence that these two assumptions are violated. Possibly these problems can
be solved by fitting a model for the natural log of the SOM concentration.
The variance of the residuals is more constant (Figure 13.7), and the Q-Q plot
is improved, although we now have too many strong negative residuals for
a normal distribution. I proceed with the model for natural-log transformed
SOM (lnSOM). The fitted linear regression model is used to predict lnSOM at
the nodes of a 200 m × 200 m discretisation grid.
256 13 Model-based optimisation of probability sampling designs
FIGURE 13.7: Scatter plot of residuals against fitted value and Q-Q plot of
residuals, for a simple linear regression model of the natural log of the SOM
concentration in Xuancheng, using elevation as a predictor.
The predictions and their standard errors are shown in Figure 13.8.
Let us check now whether the spatial structure of the study variable lnSOM is
fully captured by the mean, modelled as a linear function of elevation. This
can be checked by estimating the semivariogram of the model residuals. If
the semivariogram of the residuals is pure nugget (the semivariance does not
increase with distance), then we can assume that the prediction errors are
independent. In that case, we do not need to account for a covariance of
the prediction errors in optimisation of the spatial strata. However, if the
semivariogram does show spatial structure, we must account for a covariance
of the prediction errors. Figure 13.9 shows the sample semivariogram of the
residuals computed with function variogram of package gstat.
library(gstat)
sampleXuancheng <- sampleXuancheng %>%
mutate(s1 = s1 / 1000, s2 = s2 / 1000)
coordinates(sampleXuancheng) <- ~ s1 + s2
vg <- variogram(lnSOM ~ dem, data = sampleXuancheng)
The sample semivariogram does not show much spatial structure, but the first
two points in the semivariogram have somewhat smaller values. This indicates
13.2 Model-based optimisation of spatial strata 257
TABLE 13.1: Estimated regression coefficients (intercept and slope for dem)
and parameters of an exponential semivariogram for the natural log of the
SOM concentration (g kg-1 ) in Xuancheng.
that the residuals at two close points, say, < ±5 km, are not independent,
whereas if the distance between the two points > ±5 km, they are independent.
This spatial dependency of the residuals can be modelled, e.g., by an exponential
function. The exponential semivariogram has three parameters, the nugget
variance 𝑐0 , the partial sill 𝑐1 , and the distance parameter 𝜙. The total number
of model parameters now is five: two regression coefficients (intercept and slope
for elevation) and three semivariogram parameters. All five parameters can best
be estimated by restricted maximum likelihood, see Subsection 21.5.2. Table
13.1 shows the estimated regression coefficients and semivariogram parameters.
Up to a distance of about three times the estimated distance parameter 𝜙,
which is about 8 km, the residuals are spatially correlated; beyond that distance,
they are hardly correlated anymore.
We conclude that the errors in the regression model predictions are not inde-
pendent, although the correlation will be weak in this case, and that we must
account for this correlation in optimising the spatial strata.
13.2 Model-based optimisation of spatial strata 259
The discretisation grid with predicted lnSOM consists of 115,526 nodes. These
are too many for function optimStrata. The grid is therefore thinned to a grid
with a spacing of 800 m × 800 m, resulting in 7,257 nodes.
The first step in optimisation of spatial strata with package SamplingStrata
is to build the sampling frame with function buildFrameSpatial. Argument X
specifies the stratification variables, and argument Y specifies the study variables.
In our case, we have only one stratification variable and one study variable,
and these are the same variable. Argument variance specifies the variance of
the prediction error of the study variable. Variable dom is an identifier of the
domain of interest of which we want to estimate the mean or total. I assign
the value 1 to all population units, see code chunk below, which implies that
the stratification is optimised for the entire population. If we have multiple
domains of interest, the stratification is optimised for each domain separately.
Finally, as a preparatory step we must specify how precise the estimated
mean should be. This precision must be specified in terms of the coefficient of
variation (cv), i.e., the standard error of the estimated mean divided by the
mean. I use a cv of 0.005. In case of multiple domains of interest and multiple
study variables, a cv must be specified per domain and per study variable.
This precision requirement is used to compute the sample size for Neyman
allocation (Equation (4.16))1 . The optimal stratification is independent of the
precision requirement.
library(SamplingStrata)
subgrd$id <- seq_len(nrow(subgrd))
subgrd$dom <- rep(1, nrow(subgrd))
frame <- buildFrameSpatial(df = subgrd, id = "id", X = c("lnSOMpred"),
Y = c("lnSOMpred"), variance = c("varpred"), lon = "x1", lat = "x2",
domainvalue = "dom")
cv <- as.data.frame(list(DOM = "DOM1", CV1 = 0.005, domainvalue = 1))
[1] 0.005033349
expected_CV(strata)
cv(Y1)
DOM1 0.005
Figure 13.10 shows the optimised strata. I used the stratum bounds in data.frame
smr_strata,to compute the stratum for all raster cells of the original 200 m ×
200 m grid.
13.2 Model-based optimisation of spatial strata 261
FIGURE 13.10: Model-based optimal strata for estimating the mean of the
natural log of the SOM concentration in Xuancheng.
14
Sampling for estimating parameters of
domains
263
264 14 Sampling for estimating parameters of domains
where 𝑁u� is the size of the domain, 𝑧u�u� is the value for unit 𝑘 of domain 𝑑,
and 𝜋u�u� is the inclusion probability of this point.
When the domain is not used as a (marginal) stratum, so that the sample size
of the domain is random, the mean of the domain can be estimated best by
the ratio estimator:
u�u�u�
𝑡u�̂ (𝑧) ∑u�∈u� u�u�u�
̄̂
𝑧ratio,u� = = u�
1
. (14.2)
𝑁u�̂ ∑ u�u�u�
u�∈u� u�
The ratio estimator can also be used when the size of the domain is unknown.
An example of this is estimating the mean of soil classes as observed in the
field, not as depicted on a soil map. A soil map is impure, i.e., the map units
contain patches with soil classes that differ from the soil class as indicated
on the map. The area of a given true soil class is not known.
For simple random sampling without replacement, 𝜋u�u� = 𝑛/𝑁. Inserting this
in Equation (14.2) gives
1
̄̂
𝑧ratio,u� = ∑𝑧 . (14.3)
𝑛u� u�∈u� u�u�
u�
The mean of the domain is simply estimated by the mean of the 𝑧-values
observed in the domain, i.e., the sample mean in domain 𝑑. The variance of
this estimator can be estimated by
1 1
𝑉(̂ 𝑧ratio,u�
̄̂ )= 2 ⋅ ∑ (𝑧 − 𝑧u�̄ u� )2 , (14.4)
𝑎u�̂ 𝑛 (𝑛 − 1) u�∈u� u�u�
u�
where 𝑧u�̄ u� is the sample mean in domain 𝑑, and 𝑎u�̂ is the estimated relative
size of domain 𝑑:
𝑛
𝑎u�̂ = u� . (14.5)
𝑛
14.1 Direct estimator for large domains 265
Refer to section (8.2.2) in de Gruijter et al. (2006) for the ratio estimator and
its standard error with stratified simple random sampling, in case the domains
cut across the strata, and other sampling designs.
The ratio estimator and its standard error can be computed with function svyby
of package survey (Lumley, 2021). This is illustrated with Eastern Amazonia.
We wish to estimate the mean aboveground biomass (AGB) of the 16 ecoregions
from a simple random sample of 200 units.
library(survey)
set.seed(314)
n <- 200
mysample <- grdAmazonia %>%
mutate(N = n()) %>%
slice_sample(n = n)
design_si <- svydesign(id = ~ 1, data = mysample, fpc = ~ N)
res <- svyby(~ AGB, by = ~ Ecoregion, design = design_si, FUN = svymean)
The ratio estimates of the mean AGB are shown in Table 14.1. Two ecoregions
are missing in the table: no units are selected from these ecoregions so that a
direct estimate is not available. There are three ecoregions with an estimated
standard error of 0.0. These ecoregions have less than two sampling units only,
so that the standard error cannot be estimated.
TABLE 14.1: Ratio estimates and estimated standard errors of the ratio
estimator of the mean AGB (109 kg ha-1 ) of ecoregions in Eastern Amazonia,
with simple random sampling without replacement of size 200.
Ecoregion AGB se
Cerrado 99.7 15.6
Guianan highland moist forests 296.0 0.0
Guianan lowland moist forests 263.0 0.0
Gurupa varzea 80.0 0.0
Madeira-Tapajos moist forests 286.0 14.0
Marajo varzea 116.6 27.2
Maranhao Babassu forests 90.0 9.6
Mato Grosso tropical dry forests 177.0 90.7
Monte Alegre varzea 189.5 89.0
Purus-Madeira moist forests 145.5 47.1
Tapajos-Xingu moist forests 288.6 8.9
Tocantins/Pindare moist forests 176.3 17.9
Uatuma-Trombetas moist forests 274.7 6.7
Xingu-Tocantins-Araguaia moist forests 223.1 16.6
The estimated standard errors of 0.0 are non-availables.
266 14 Sampling for estimating parameters of domains
TABLE 14.2: Standard deviations of 1,000 𝜋 estimates (HT) and 1,000 ratio
estimates (Ratio) of the mean AGB (109 kg ha-1 ) of ecoregions in Eastern
Amazonia, with simple random sampling without replacement of size 200.
Ecoregion HT Ratio n
Amazon-Orinoco-Southern Caribbean mangroves 136.9 36.9 0.87
Cerrado 38.6 18.2 8.16
Guianan highland moist forests 271.4 29.5 0.71
Guianan lowland moist forests 160.0 17.9 2.51
Guianan savanna 101.5 67.1 2.86
Gurupa varzea 103.7 57.8 0.99
Madeira-Tapajos moist forests 55.9 14.2 22.96
Marajo varzea 48.6 24.7 10.80
Maranhao Babassu forests 59.3 25.7 4.61
Mato Grosso tropical dry forests 72.7 41.1 2.98
Monte Alegre varzea 120.2 63.3 2.27
Purus-Madeira moist forests 288.0 55.5 0.45
Tapajos-Xingu moist forests 45.5 12.0 31.51
Tocantins/Pindare moist forests 31.5 15.3 28.39
Uatuma-Trombetas moist forests 33.4 8.1 52.37
Xingu-Tocantins-Araguaia moist forests 42.4 17.2 27.56
n: expected sample size.
The simple random sampling is repeated 1,000 times, and every sample is used
to estimate the mean AGB of the ecoregions both with the 𝜋 estimator and
the ratio estimator. As can be seen in Table 14.2 the standard deviation of the
ratio estimates is much smaller than that of the 𝜋 estimates. The reason is
that the number of sampling units in an ecoregion varies among samples, i.e.,
the sample size of an ecoregion is random. When many units are selected from
an ecoregion, the estimated total of that ecoregion is large. The estimated
mean as obtained with the 𝜋 estimator then is large too, because the estimated
total is divided by the fixed size (total number of population units, 𝑁u� ) of
the ecoregion. However, in the ratio estimator the size of an ecoregion is
estimated from the same sample, although we know its size, see Equation
(14.2). With many units selected from an ecoregion, the estimated size of that
ecoregion, 𝑁̂u� , is also large. By dividing the large estimated total by the large
estimated size, a more stable estimate of the mean of the domain is obtained.
For quite a few ecoregions the standard deviations are very large, especially
of the 𝜋 estimator. These are the ecoregions with very small average sample
sizes. With simple random sampling, the expected sample size can simply be
computed by 𝐸[𝑛] = 𝑛 𝑁u� /𝑁. In the following section, alternative estimators
are described for these ecoregions with small expected sample sizes. To speed
14.2 Model-assisted estimators for small domains 267
̂
𝑡regr,u� (𝑧)
̄̂
𝑧ratio,u� = . (14.6)
̂u�
𝑁
For a large domain with a reasonable sample size, the regression estimate can be
computed from the data of that domain (Chapter 10). For small domains, also
the data from outside these domains can be used to estimate the population
regression coefficients. This is explained in Subsection 14.2.1.
In the regression estimator, the potential bias due to the globally estimated
regression coefficients can be eliminated by adding the 𝜋 estimator of the
mean of the regression residuals to the mean of the predictions in the domain
(compare with Equation (10.8)) (Mandallaz (2007), Mandallaz et al. (2013)):
u� u�
1 1 𝑒 1 𝑒
̄̂
𝑧regr,u� = ∑ 𝐱T 𝐛̂ + ∑ u�u� = 𝐱̄ T ̂
u� 𝐛 + ∑ u�u� , (14.7)
𝑁u� u�=1 u�u� 𝑁u� u�∈u� 𝜋u�u� 𝑁u� u�∈u� 𝜋u�u�
u� u�
with 𝐱u�u� the vector with covariate values for unit 𝑘 in domain 𝑑, 𝐛̂ the vector
with globally estimated regression coefficients, 𝑒u�u� the residual for unit 𝑘 in
domain 𝑑, 𝜋u�u� the inclusion probability of that unit, and 𝐱̄ u� the mean of the
covariates in domain 𝑑. Alternatively, the mean of the residuals in a domain is
estimated by the ratio estimator:
1 𝑒
̄̂
𝑧regr,u� = 𝐱̄ T ̂
u� 𝐛 + ∑ u�u� , (14.8)
̂ 𝜋
𝑁u� u�∈u�u� u�u�
with 𝑁̂u� the estimated size of domain 𝑑, see Equation (14.2). The regression
coefficients can be estimated by Equation (10.15). With simple random sam-
pling, the second term in Equation (14.8) is equal to the sample mean of the
residuals, so that the estimator reduces to
̄̂
𝑧regr,u� = 𝐱̄ T ̂
u� 𝐛 + 𝑒u�
̄ u� , (14.9)
𝑉(̂ 𝑧regr,u�
̄̂ ) = 𝐱̄ T ̂ ̂ ̂ ̄̂ ) ,
u� 𝐂(𝐛)𝐱̄ u� + 𝑉(𝑒u� (14.10)
with 𝐂(
̂ 𝐛)̂ the matrix with estimated sampling variances and covariances of
the regression coefficients. The first variance component is the contribution
due to uncertainty about the regression coefficients, the second component
accounts for the uncertainty about the mean of the residuals in the domain. For
simple random sampling, the sampling variance of the 𝜋 estimator of the mean
of the residuals in a domain can be estimated by the sample variance of the
residuals in that domain divided by the sample size 𝑛u� . This variance estimator
is presented in Hill et al. (2021). If the domain is not used as a stratum and the
14.2 Model-assisted estimators for small domains 269
domain mean of the residuals is estimated by the ratio estimator, the second
variance component can be estimated by
2
𝑛 1
𝑉(̂ 𝑒ratio,u�
̄̂ )=( ) ⋅ ∑ (𝑒 − 𝑒u�̄ u� )2 . (14.11)
𝑛u� 𝑛 (𝑛 − 1) u�∈u� u�u�
u�
With simple random sampling, the sampling variances and covariances of the
estimated regression coefficients can be estimated by (equation 2 in Hill et al.
(2021))
−1 u� −1
̂ = 1 (∑ 𝐱u� 𝐱T )
̂ 𝐛)
𝐂(
1
( 2 ∑ 𝑒2u� 𝐱u� 𝐱T
1
(∑ 𝐱u� 𝐱T . (14.12)
u� u� ) u� )
𝑛 u�∈u� 𝑛 u�∈u� 𝑛 u�∈u�
library(forestinventory)
n <- 200
set.seed(314)
units <- sample(nrow(grdAmazonia), size = n, replace = FALSE)
grdAmazonia <- grdAmazonia %>%
mutate(lnSWIR2 = log(SWIR2),
id = row_number(),
ind = as.integer(id %in% units))
grdAmazonia$AGB[grdAmazonia$ind == 0L] <- NA
mx_eco_pop <- tapply(
grdAmazonia$lnSWIR2, INDEX = grdAmazonia$Ecoregion, FUN = mean)
mX_eco_pop <- data.frame(
Intercept = rep(1, length(mx_eco_pop)), lnSWIR2 = mx_eco_pop)
ecos_in_sam <- unique(mysample$Ecoregion)
res <- forestinventory::twophase(AGB ~ lnSWIR2,
data = as.data.frame(grdAmazonia),
phase_id = list(phase.col = "ind", terrgrid.id = 1),
small_area = list(sa.col = "Ecoregion",
areas = sort(ecos_in_sam),
unbiased = TRUE),
psmall = TRUE, exhaustive = mX_eco_pop)
regr <- res$estimation
The alternative is to save the selected units (sample) in a data frame, passed
to function twophase with argument data. The results are identical because the
true means of the covariate 𝑥 specified with argument exhaustive contains all
required information at the population level.
For two ecoregions, no regression estimate of the mean AGB is obtained (Table
14.3). No units are selected from these domains. The estimated variance of
the estimated domain mean is in the column g_var. In the estimated variance
ext_var the first variance component of Equation (14.10) is ignored. Note that
for the ecoregions with a sample size of one unit (the sample size per domain
is in column n2G), no estimate of the variance is available, because the variance
of the estimated mean of the residuals cannot be estimated from one unit.
Figure 14.1 shows the regression estimates plotted against the ratio estimates.
The intercept of the line, fitted with ordinary least squares (OLS), is larger
than 0, and the slope is smaller than 1. Using the regression model predictions
in the estimation of the means leads to some smoothing.
I quantified the gain in precision of the estimated mean AGB due to the use
of the regression model by the variance of the ratio estimator divided by the
variance of the regression estimator (Table 14.4). For ratios larger than 1,
there is a gain in precision. Both variances are estimated from 1,000 repeated
ratio and regression estimates obtained with simple random sampling without
14.2 Model-assisted estimators for small domains 271
TABLE 14.3: Regression estimates of the mean AGB (109 kg ha-1 ) of ecore-
gions in Eastern Amazonia, for simple random sample without replacement of
size 200, using lnSWIR2 as a predictor.
replacement of size 200. For all but two small ecoregions, there is a gain. For
quite a few ecoregions, the gain is quite large. These are the ecoregions where
the globally fitted regression model explains a large part of the spatial variation
of AGB.
For small domains from which no units are selected, the mean can still be
estimated by the synthetic estimator, also referred to as the synthetic regression
estimator, by dropping the second term in Equation (14.7):
̄̂
𝑧syn,u� = 𝐱̄ T ̂
u� 𝐛 . (14.13)
𝑉(̂ 𝑧syn,u�
̄̂ ) = 𝐱̄ T ̂ ̂
u� 𝐂(𝐛)𝐱̄ u� . (14.14)
This is equal to the first variance component of Equation (14.10). The synthetic
estimate can be computed with function twophase, with argument psmall = FALSE
and element unbiased = FALSE in the list small_area.
272 14 Sampling for estimating parameters of domains
FIGURE 14.1: Scatter plot of the ratio and the regression estimates of the
mean AGB (109 kg ha-1 ) of ecoregions in Eastern Amazonia for simple random
sample without replacement of size 200. In the regression estimate, lnSWIR2
is used as a predictor. The line is fitted by ordinary least squares.
For all ecoregions, also the unsampled ones, a synthetic estimate of the mean
AGB is obtained (Table 14.5). For the sampled ecoregions, the synthetic
estimate differs from the regression estimate. This difference can be quite large
for ecoregions with a small sample size. Averaged over all sampled ecoregions,
the difference, computed as synthetic estimate minus regression estimate,
equals 14.9 109 kg ha-1 . The variance of the regression estimator is always
much larger than the variance of the synthetic estimator. The difference is
the variance of the estimator of the domain mean of the residuals. However,
recall that the regression estimator is design-unbiased, whereas the synthetic
14.3 Model-based prediction 273
Ecoregion Gain
Amazon-Orinoco-Southern Caribbean mangroves 0.57
Cerrado 1.63
Guianan highland moist forests 1.01
Guianan lowland moist forests 1.26
Guianan savanna 6.70
Gurupa varzea 0.65
Madeira-Tapajos moist forests 1.44
Marajo varzea 1.87
Maranhao Babassu forests 1.70
Mato Grosso tropical dry forests 2.81
Monte Alegre varzea 1.21
Purus-Madeira moist forests 1.74
Tapajos-Xingu moist forests 2.73
Tocantins/Pindare moist forests 2.56
Uatuma-Trombetas moist forests 2.10
Xingu-Tocantins-Araguaia moist forests 4.06
estimator is not. A more fair comparison is on the basis of the root mean
squared error (RMSE) (Table 14.6). For the regression estimator, the RMSE
is equal to its standard error and therefore not shown in the table.
In the synthetic estimator and the regression estimator, both quantitative
covariates and categorical variables can be used. If one or more categorical
variables are included in the estimator, the variable names in the data frame
with the true means of the ancillary variables per domain, specified with
argument exhaustive, must correspond to the column names of the design
matrix that is generated with function lm, see Subsection 10.1.3.
TABLE 14.5: Synthetic estimates of the mean AGB (109 kg ha-1 ) of ecore-
gions in Eastern Amazonia, for simple random sample without replacement of
size 200, using lnSWIR2 as a predictor.
in Chapter 26. The models used in this section are linear mixed models. In
a linear mixed model, the mean of the study variable is modelled as a linear
combination of covariates, similar to a linear regression model. The difference
with a linear regression model is that the residuals of the mean are not assumed
independent. The dependency of the residuals is also modelled. Two types of
linear mixed model are described: a random intercept model and a geostatistical
model.
A basic linear mixed model that can be used for model-based prediction of
means of small domains is the random intercept model:
𝑍u�u� = 𝐱T
u�u�𝛽 + 𝑣u� + 𝜖u�u�
𝑣u� ∼ 𝒩(0, 𝜎2u� ) (14.15)
𝜖u�u� ∼ 𝒩(0, 𝜎2u� ) .
Two random variables are now involved, both with a normal distribution with
mean zero: 𝑣u� , a random intercept at the domain level with variance 𝜎2u� , and
the residuals 𝜖u�u� at the unit level with variance 𝜎2u� . The variance 𝜎2u� can
14.3 Model-based prediction 275
TABLE 14.6: Estimated standard error (se), bias, and root mean squared
error (RMSE) of the regression estimator (reg) and the synthetic estimator
(syn) of the mean AGB of ecoregions in Eastern Amazonia. The regression
estimator is design-unbiased, so the RMSE of the regression estimator is equal
to its standard error.
̄̂
𝑧mb,u� = 𝐱̄ T ̂
u� 𝛽 + 𝑣u�
̂ , (14.16)
with 𝛽̂ the best linear unbiased estimates (BLUE) of the regression coefficients
and 𝑣u�̂ the best linear unbiased prediction (BLUP) of the intercept for domain
𝑑, 𝑣u� . The model-based predictor can also be written as
1
̄̂
𝑧mb,u� = 𝐱̄ T ̂
u� 𝛽 + 𝜆u� ( ∑𝜖 ) , (14.17)
𝑛u� u�∈u� u�u�
u�
with 𝜆u� a weight for the second term that corrects for the bias of the synthetic
estimator. This weight is computed by
𝜎̂2u�
𝜆u� = . (14.18)
𝜎̂2u� + 𝜎̂2u� /𝑛u�
276 14 Sampling for estimating parameters of domains
This equation shows that the larger the estimated residual variance 𝜎̂2u� , the
smaller the weight for the bias correction factor, and the larger the sample size
𝑛u� , the larger the weight. Comparing Equations (14.16) and (14.17) shows
that the random intercept of a domain is predicted by the sample mean of the
residuals of that domain, multiplied by a weight factor computed by Equation
(14.18).
The means of the small domains can be computed with function eblup.mse.f.wrap
of package JoSAE (Breidenbach, 2018). It requires as input a linear mixed
model generated with function lme of package nlme (Pinheiro et al., 2021). The
simple random sample of size 200 selected before is used to fit the linear mixed
model, with lnSWIR2 as a fixed effect, i.e., the effect of lnSWIR2 on the mean
AGB. The random effect is added by assigning another formula to argument
random. The formula ~ 1 | Ecoregion means that the intercept is treated as a
random variable and that it varies among the ecoregions. This linear mixed
model is referred to as a random intercept model: the intercepts are allowed to
differ among the small domains, whereas the effects of the covariates, lnSWIR2
in our case, is equal for all domains.
Tibble grdAmazonia is converted to a data.frame to avoid problems with function
eblup.mse.f.wrap hereafter. A simple random sample with replacement of size
200 is selected.
library(nlme)
library(JoSAE)
lmm_AGB <- lme(fixed = AGB ~ lnSWIR2, data = mysample, random = ~ 1 | Ecoregion)
The fixed effects of the linear mixed model can be extracted with function
fixed.effects.
The fixed effects of the linear mixed model differ somewhat from the fixed
effects in the simple linear regression model (fixed_lm):
14.3 Model-based prediction 277
fixed_lm fixed_lmm
(Intercept) 1778.1959 1667.9759
lnSWIR2 -241.1567 -225.6561
random.effects(lmm_AGB)
(Intercept)
Cerrado 21.439891
Guianan highland moist forests 6.397816
Guianan lowland moist forests 5.547995
Gurupa varzea -52.985839
Madeira-Tapajos moist forests 27.479921
Marajo varzea -50.587786
Maranhao Babassu forests -1.702322
Mato Grosso tropical dry forests 18.812207
Monte Alegre varzea -12.201148
Purus-Madeira moist forests -9.320683
Tapajos-Xingu moist forests 28.760508
Tocantins/Pindare moist forests -8.940962
Uatuma-Trombetas moist forests 16.165017
Xingu-Tocantins-Araguaia moist forests 11.135385
The random intercepts are added to the fixed intercept; the coefficient of
lnSWIR2 is the same for all ecoregions:
coef(lmm_AGB)
(Intercept) lnSWIR2
Cerrado 1689.416 -225.6561
Guianan highland moist forests 1674.374 -225.6561
Guianan lowland moist forests 1673.524 -225.6561
Gurupa varzea 1614.990 -225.6561
Madeira-Tapajos moist forests 1695.456 -225.6561
Marajo varzea 1617.388 -225.6561
Maranhao Babassu forests 1666.274 -225.6561
Mato Grosso tropical dry forests 1686.788 -225.6561
Monte Alegre varzea 1655.775 -225.6561
Purus-Madeira moist forests 1658.655 -225.6561
Tapajos-Xingu moist forests 1696.736 -225.6561
Tocantins/Pindare moist forests 1659.035 -225.6561
Uatuma-Trombetas moist forests 1684.141 -225.6561
Xingu-Tocantins-Araguaia moist forests 1679.111 -225.6561
278 14 Sampling for estimating parameters of domains
The fitted model can now be used to predict the means of the ecoregions as
follows. As a first step, a data frame must be defined, with the size and the
population mean of the covariate lnSWIR2 per domain. This data frame is
passed to function eblup.mse.f.wrap with argument domain.data. This function
computes the model-based prediction, as well as the regression estimator
(Equation (14.7)) and the synthetic estimator (Equation (14.13)) and their
variances. The model-based predictor is the variable EBLUP in the output data
frame. For the model-based predictor, two standard errors are computed, see
Breidenbach and Astrup (2012) for details.
Table 14.7 shows the model-based predictions and the estimated standard
errors of the mean AGB of the ecoregions, obtained with the random intercept
model.
Note that with this model no predictions of the mean AGB are obtained for
the unsampled ecoregions. This is because the random intercept 𝑣u� cannot be
predicted in the absence of data, see Equations (14.16) and (14.17).
In a geostatistical model, there is only one random variable, the residual of the
model-mean, not two random variables as in the random intercept model. See
Equation (21.2) for a geostatistical model with a constant mean and Equation
(21.16) for a model with a mean that is a linear combination of covariates.
In a geostatistical model, the covariance of the residuals of the mean at two
locations is modelled as a function of the distance (and direction) of the points.
Instead of the covariance, often the semivariance is modelled, i.e., half the
variance of the difference of the residuals at two locations, see Chapter 21 for
details.
The simple random sample of size 200 selected before is used to estimate the
regression coefficients for the mean, an intercept, and a slope coefficient for
lnSWIR2, and besides the parameters of a spherical semivariogram model
for the residuals of the mean. The two regression coefficients and the three
semivariogram parameters are estimated by restricted maximum likelihood
(REML), see Subsection 21.5.2. This estimation procedure is also used in
function lme to fit the random intercept model. Here, function likfit of package
geoR (Ribeiro Jr et al., 2020) is used to estimate the model parameters. First,
a geoR object must be generated with function as.geodata.
library(geoR)
dGeoR <- as.geodata(mysample, header = TRUE,
coords.col = c("x1", "x2"), data.col = "AGB", covar.col = "lnSWIR2")
vgm_REML <- likfit(geodata = dGeoR, trend = ~ lnSWIR2,
cov.model = "spherical",
ini.cov.pars = c(600, 600), nugget = 1500,
lik.method = "REML", messages = FALSE)
The estimated intercept and slope are 1,744 and -236.5, respectively. The
estimated semivariogram parameters are 1,623 (109 kg ha-1 )2 , 700 (109 kg
ha-1 )2 , and 652 km for the nugget, partial sill, and range, respectively. These
model parameters are used to predict AGB for all units in the population,
using function krige of package gstat (Pebesma, 2004). The REML estimates
of the semivariogram parameters are passed to function vgm with arguments
nugget, psill, and range. The coordinates of the sample are shifted to a random
point within a 1 km × 1 km grid cell. This is done to avoid the coincidence
of a sampling point and a prediction point, which leads to an error message
when predicting AGB at the nodes of the grid.
280 14 Sampling for estimating parameters of domains
library(gstat)
mysample$x1 <- jitter(mysample$x1, amount = 0.5)
mysample$x2 <- jitter(mysample$x2, amount = 0.5)
coordinates(mysample) <- ~ x1 + x2
vgm_REML_gstat <- vgm(model = "Sph",
nugget = vgm_REML$nugget, psill = vgm_REML$sigmasq, range = vgm_REML$phi)
coordinates(grdAmazonia) <- ~ x1 + x2
predictions <- krige(
formula = AGB ~ lnSWIR2,
locations = mysample,
newdata = grdAmazonia,
model = vgm_REML_gstat,
debug.level = 0) %>% as("data.frame")
Besides a prediction (variable var1.pred), for every population unit the variance
of the prediction error is computed (var1.var). The unitwise predictions can be
averaged across all units of an ecoregion to obtain a model-based prediction of
the mean of that ecoregion.
Similar to the synthetic estimator, for all ecoregions an estimate of the mean
AGB is obtained, also for the unsampled ecoregions (Table 14.8). The model-
based prediction is strongly correlated with the synthetic estimate (Figure
14.2).
The most striking difference is the standard error. The standard errors of the
synthetic estimator range from 3.7 to 7.1 (Table 14.6), whereas the standard
errors of the geostatistical predictions range from 6.2 to 28.1. However, these
two standard errors are fundamentally different and should not be compared.
The standard error of the synthetic estimator is a sampling standard error, i.e.,
it quantifies the variation of the estimated mean of an ecoregion over repeated
random sampling with the sampling design, in this case simple random sampling
of 200 units. The model-based standard error is not a sampling standard error
but a model standard error, which expresses our uncertainty about the means
of the domains due to our imperfect knowledge of the spatial variation of
AGB. Given the observations of AGB at the selected sample, the map with the
covariate lnSWIR2, and the estimated semivariogram model parameters, we
are uncertain about the exact value of AGB at unsampled units. No samples are
considered other than the one actually selected. For the fundamental difference
between design-based, model-assisted, and model-based estimates of means,
refer to Section 1.2 and Chapter 26.
282 14 Sampling for estimating parameters of domains
Ecoregion AGB se
Amazon-Orinoco-Southern Caribbean mangroves 191.1 22.4
Cerrado 96.8 12.1
Guianan highland moist forests 299.1 28.1
Guianan lowland moist forests 279.3 16.4
Guianan savanna 166.1 10.2
Gurupa varzea 201.2 15.6
Madeira-Tapajos moist forests 287.0 8.6
Marajo varzea 194.6 11.3
Maranhao Babassu forests 121.6 12.0
Mato Grosso tropical dry forests 121.4 11.7
Monte Alegre varzea 241.4 10.8
Purus-Madeira moist forests 229.0 24.3
Tapajos-Xingu moist forests 273.8 8.2
Tocantins/Pindare moist forests 163.0 6.8
Uatuma-Trombetas moist forests 265.9 6.4
Xingu-Tocantins-Araguaia moist forests 220.0 6.2
se: standard error of predicted mean.
It makes more sense to compare the two model-based predictions, the random
intercept model predictions and the geostatistical predictions, and their stan-
dard errors. Figure 14.3 shows that the two model-based predictions are very
similar.
For four ecoregions, the standard errors of the geostatistical model predictions
are much smaller than those of the random intercept model predictions (Figure
14.4). These are ecoregions with small sample sizes.
FIGURE 14.2: Scatter plot of the model-based prediction and the synthetic
estimate of the mean AGB (109 kg ha-1 ) of ecoregions in Eastern Amazonia.
The solid line is the 1:1 line.
is used to estimate the population mean or total. In the second approach, the
samples are not combined, but the two estimates from the separate samples.
In this section only the first approach is illustrated with a simple situation in
which the two samples are easily combined. Refer to Grafström et al. (2019) for
a more general approach of how multiple probability samples can be combined.
Suppose that the original sample is a simple random sample from the entire
study area. A supplemental sample is selected from small domains, i.e., domains
that have few selected units only. For a given small domain, the first sample
is supplemented by selecting a simple random sample from the units not yet
selected in the first sample. The size of the supplemental sample of a domain
depends on the number of units of that domain in the first sample. The first
sample is supplemented so that the total sample size of that domain is fixed.
In this case, the combined sample of a domain is a simple random sample from
that domain, so that the usual estimators for simple random sampling can be
used to estimate the domain mean or total and its standard error.
This sampling strategy is illustrated with Eastern Amazonia. A simple random
sample without replacement of 400 units is selected.
14.4 Supplemental probability sampling of small domains 285
The selected units are removed from the sampling frame. For each of the
three small biomes, Mangrove, Forest_dry, and Grassland, the size of the
286 14 Sampling for estimating parameters of domains
supplemental sample is computed so that the total sample size becomes 40.
The supplemental sample is selected by stratified simple random sampling
without replacement, using the small biomes as strata (Chapter 4).
The two samples are merged, and the means of the domains are estimated by
the sample means.
This sampling approach and estimation are repeated 10,000 times, i.e., a
simple random sample without replacement of size 400 is selected 10,000 times
from Eastern Amazonia, and the samples from the three small domains are
supplemented so that the total sample sizes in these domains become 40. In
two out of the 10,000 samples, the size of the first sample in one of the domains
exceeded 40 units. These two samples are discarded. Ideally, these samples are
not discarded, but their sizes in the small domains are reduced to 40 units,
which are then used to estimate the means of the domains.
For all three small domains, the average of the 10,000 estimated means of AGB
is about equal to the true mean (Table 14.9). Also the mean of the 10,000
estimated standard errors is very close to the standard deviation of the 10,000
14.4 Supplemental probability sampling of small domains 287
estimated means. The coverage rates of 95, 90, and 80% confidence intervals
are about equal to the nominal coverage rates.
This simple approach is feasible because at the domain level the two merged
samples are a simple random sample. This approach is also applicable when the
first sample is a stratified simple random sample from the entire population,
and the supplemental sample is a stratified simple random sample from a small
domain using as strata the intersections of the strata used in the first phase
and that domain.
15
Repeated sample surveys for monitoring
population parameters
The previous chapters are all about sampling to estimate population parameters
at a given time. The survey is done in a relatively short period of time, so
that we can safely assume that the study variable has not changed during
that period. This chapter is about repeating the sample survey two or more
times, to estimate, for instance, a temporal change in a population parameter.
Sampling locations are selected by probability sampling, by any design type.
In most cases, sampling times are not selected randomly, but purposively. For
instance, to monitor the carbon stock in the soil of a country, we may decide
to repeat the survey after five years, in the same season of the year as the first
survey.
289
290 15 Repeated sample surveys for monitoring population parameters
independently from each other, for the first three surveys, and these samples
are revisited in subsequent surveys.
Two other compromise designs are a supplemented panel (SP) design and a
rotating panel (RP) design. In an SP design only a subset of the sampling
locations of the first survey is revisited in the subsequent surveys. These are
the permanent sampling locations observed in all subsequent surveys. The
permanent sampling locations are supplemented by samples that are selected
independently from the samples in the previous surveys. In Figure 15.1, half
of the sampling locations (ten locations) is permanent (panel a), i.e., revisited
in all surveys, but the proportion of permanent sampling locations can be
smaller or larger and, if prior information on the variation in space and time
is available, even can be optimised for estimating the current mean. Also in
an RP design, sampling units of the previous survey are partially replaced by
new units. The difference with an SP design is that there are no permanent
sampling units, i.e., no units observed in all surveys. All sampling units are
sequentially rotated out and in again at the subsequent sampling times.
In Figure 15.1 the shape and colour of the symbols represent a panel. A panel
is a group of sampling locations that is observed in the same surveys. In the SS
design, there is only one panel. All locations are observed in all surveys, so all
locations are in the same panel. In the IS design, there are as many panels as
there are surveys. In the SA design with a period of two, the number of panels
equals the number of surveys divided by two. In these three space-time designs
(SS, IS, and SA), all sampling locations of a given survey are in the same panel.
This is not the case in the SP and RP designs. In Figure 15.1 in each survey,
two panels are observed. In the SP sample, there is one panel of permanent
sampling locations (pure panel part of sample) and another panel of swarming
sampling locations observed in one survey only. In the RP sample of Figure
15.1, the sampling locations are observed in two consecutive surveys; however,
this number can be increased. For instance, in an ‘in-for-three’ rotational
sample (McLaren and Steel, 2001) the sampling locations stay in the sample
for three consecutive surveys. The number of panels per sampling time is then
three. Also, similar to an SA design, in an IS, SP, and RP design we may decide
after several surveys to stop selecting new sampling locations and to revisit
existing locations. The concept of panels is needed hereafter in estimating
space-time population parameters.
u� u�
1 1 u�
̄ =
𝑑u�u� ( ∑ 𝑧u� (𝑡u� ) − ∑ 𝑧u� (𝑡u� )) = ∑𝑑 , (15.1)
𝑁 u�=1 u�=1
𝑁 u�=1 u�u�u�
with 𝑑u�u�u� the change of the study variable in the period between time 𝑡u� and
𝑡u� for unit 𝑘. For infinite populations, the sums are replaced by integrals:
1 1
̄ =
𝑑u�u� (∫ 𝑧(𝐬, 𝑡u� ) d𝐬 − ∫ 𝑧(𝐬, 𝑡u� ) d𝐬) = ∫ 𝑑u�u� (𝐬) , (15.2)
𝐴 u�∈u� u�∈u�
𝐴 u�∈u�
with 𝑑u�u� (𝐬) the change of the study variable in the period between time 𝑡u�
and 𝑡u� at location 𝐬.
With more than two surveys, an interesting population parameter is the average
change per time unit of the mean, referred to as the temporal trend of the
spatial mean. It is defined as a linear combination of the spatial means at the
sampling times (Breidt and Fuller, 1999):
u�
𝑏 = ∑ 𝑤u� 𝑧u�̄ , (15.3)
u�=1
with 𝑅 the number of sampling times, 𝑧u�̄ the spatial mean at time 𝑡u� , and
weights 𝑤u� equal to
𝑡u� − 𝑡̄
𝑤u� = u�
, (15.4)
∑u�=1 (𝑡u� − 𝑡)̄ 2
1 u�
𝑧u�̄ = ∑𝑧 ̄ . (15.5)
𝑅 u�=1 u�
1
𝑧u�̄ = ∫ 𝑧̄ , (15.6)
𝑇 u�∈u� u�
with 𝑇 the length of the monitoring period and 𝑧u�̄ the spatial mean at time 𝑡.
𝐳̂ = 𝐗𝐳 + 𝐞 , (15.7)
with 𝐳 the vector of true spatial means 𝑧(𝑡 ̄ u� ) at the 𝑅 sampling times,
̄ 1 ), … , 𝑧(𝑡
𝐗 the (𝑃 × 𝑅) design matrix with zeroes and ones that selects the appropriate
294 15 Repeated sample surveys for monitoring population parameters
1 0 0 0
⎡0 1 0 0⎤
⎢ ⎥
⎢0 0 1 0⎥
⎢0 0 0 1⎥
𝐗=⎢ . (15.8)
1 0 0 0⎥
⎢ ⎥
⎢0 1 0 0⎥
⎢0 0 1 0⎥
⎣0 0 0 1⎦
The first four rows of this matrix are associated with the elementary estimates
of the spatial means at the four sampling times, estimated from panel a, the
panel with permanent sampling locations. Hereafter, this panel is referred
to as the static-synchronous subsample. The remaining rows correspond to
the elementary estimates from the other four panels, the swarming locations,
hereafter referred to as the independent-synchronous subsamples. For the RP
design of Figure 15.1, the design matrix equals
1 0 0 0
⎡1 0 0 0⎤
⎢ ⎥
⎢0 1 0 0⎥
⎢0 1 0 0⎥
𝐗=⎢ . (15.9)
0 0 1 0⎥
⎢ ⎥
⎢0 0 1 0⎥
⎢0 0 0 1⎥
⎣0 0 0 1⎦
The first two rows correspond to the two elementary estimates of the mean
at time 𝑡1 from panels a and b, the third and fourth rows correspond to the
elementary estimate at time 𝑡2 from panels b and c, respectively, etc.
The minimum variance linear unbiased estimator (MVLUE) of the spatial
means at the different times is the design-based generalised least squares (GLS)
estimator (Binder and Hidiroglou, 1988):
𝐳GLS
̂ = (𝐗T 𝐂−1 𝐗)−1 𝐗T 𝐂−1 𝐳̂ . (15.10)
To define matrix 𝐂 for the SP design in Figure 15.1, let 𝑧u�u� ̄̂ denote the estimated
mean at time 𝑡u� , 𝑗 = 1, 2, 3, 4 in subsample 𝑝, 𝑝 ∈ (𝑎, 𝑏, 𝑐, 𝑑, 𝑒), with panel 𝑎 the
15.3 Design-based generalised least squares estimation of spatial means 295
𝑉1u� 0 0 0 0 0 0 0
⎡ 0 𝑉1u� 𝐶1,2 0 0 0 0 0 ⎤
⎢ ⎥
⎢ 0 𝐶2,1 𝑉2u� 0 0 0 0 0 ⎥
⎢ 0 0 0 𝑉2u� 𝐶2,3 0 0 0 ⎥
𝐂=⎢ . (15.12)
0 0 0 𝐶3,2 𝑉3u� 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 0 𝑉3u� 𝐶3,4 0 ⎥
⎢ 0 0 0 0 0 𝐶4,3 𝑉4u� 0 ⎥
⎣ 0 0 0 0 0 0 0 𝑉4u� ⎦
Only the elementary estimates of the same panel are correlated, for instance
the elementary estimates of the spatial means at times 𝑡1 and 𝑡2 , estimated
from panel b.
For the SA design of Figure 15.1 the variance-covariance matrix equals (there
is only one estimate per time)
𝑉1 0 𝐶1,3 0
⎡ 0 𝑉2 0 𝐶2,4 ⎤
𝐂=⎢ ⎥ . (15.13)
⎢𝐶3,1 0 𝑉3 0 ⎥
⎣ 0 𝐶4,2 0 𝑉4 ⎦
296 15 Repeated sample surveys for monitoring population parameters
̂
𝑆2
𝐶̂u�u� = u�u�
, (15.14)
𝑚
u�
1
̂
𝑆2
u�u� = ̄̂ )(𝑧u�u�u� − 𝑧u�u�
∑ (𝑧u�u�u� − 𝑧u�u� ̄̂ ) , (15.15)
𝑚 − 1 u�=1
with 𝑧u�u�u� the study variable of unit 𝑘 in panel 𝑝 at time 𝑡u� and 𝑧u�u�
̄̂ the spatial
mean at time 𝑡u� as estimated from panel 𝑝.
The variances and covariances of the GLS estimators of the spatial means at
the 𝑅 sampling times can be estimated by
Cov(𝐳GLS
̂ ) = (𝐗T 𝐂
̂−1 𝐗)−1 . (15.16)
Given the design-based GLS estimates of the spatial means at the different
times, it is an easy job to compute the estimated change of the mean between
two surveys, the estimated temporal trend of the mean, and the estimated
space-time mean.
As explained above, with the SS, IS, and SA designs, the design matrix 𝐗 is
the identity matrix of size 𝑅. From this it follows that for these space-time
designs 𝐳GLS
̂ = 𝐳,̂ see Equation (15.10), and Cov(𝐳GLS
̂ ) = Cov(𝐳)̂ = 𝐂. In
words, the GLS estimator of the spatial mean at a given time equals the usual
𝜋 estimator of the mean at that time, and the variance-covariance matrix of
the GLS estimators of the spatial means equals the variance-covariance matrix
of the 𝜋 estimators.
With SP and RP sampling in space-time, there is partial overlap between
the samples at the different times, and so the samples at the previous times
can be used to increase the precision of the estimated mean at the last time
15.3 Design-based generalised least squares estimation of spatial means 297
(current mean). The estimated current mean is simply the last element in 𝐳GLS
̂ .
The estimated variance of the estimator of the current mean is the element
in the final row and final column of the variance-covariance matrix of the
estimated spatial means (Equation (15.16)). These two space-time designs with
partial replacement of sampling locations may yield a more precise estimate of
the current mean than the other three space-time designs.
The change of the spatial mean between two sampling times can simply be
estimated by subtracting the estimated spatial means at these two times. With
the SS, IS, and SA designs, these means are estimated by the 𝜋 estimators.
With the SP and RP designs, the two spatial means are estimated by the GLS
estimators. The variance of the estimator of the change can be estimated by
the sum of the estimated variances of the spatial mean estimators at the two
sampling times, minus two times the estimated covariance of the two estimators.
The covariance is maximal when all sampling locations are revisited, leading
to the most precise estimate of the change of the spatial mean. With SP and
RP sampling, the covariance of the two spatial mean estimators is smaller, and
so the variance of the change estimator is larger.
The temporal trend of the mean is estimated by the weighted average of the
(GLS) estimated means at 𝑡1 , … , 𝑡u� , with weights equal to Equation (15.4):
u�
𝑏̂ = ∑ 𝑤u� 𝑧GLS,u�
̄̂ . (15.17)
u�=1
̂ 𝐳GLS
𝑉(̂ 𝑏)̂ = 𝐰′ 𝐂( ̂ )𝐰 . (15.18)
Brus and de Gruijter (2011) compared the space-time designs for estimating
the temporal trend of the spatial means, under a first order autoregressive
time-series model of the spatial means. The SS design performed best when
the correlation is strong, say > 0.8. What is the best design depends amongst
others on the strength of the correlation and the number of sampling times. A
safe choice is an SA design. With strong positive correlation, say > 0.9, the
SS design can be a good choice, but remarkably for weak positive correlation
this design performed relatively poorly.
298 15 Repeated sample surveys for monitoring population parameters
space-time universe (Chapter 7). The primary sampling units (PSUs) are the
spatial sections of that universe (horizontal lines in Figure 15.2), the secondary
sampling units (SSUs) are the sampling locations. The space-time mean can
therefore be estimated by Equation (7.2), and its variance by Equation (7.7).
For simple random sampling, both in space and in time, and a linear costs
model 𝐶 = 𝑐0 + 𝑐1 𝑛 + 𝑐2 𝑛𝑚 Equations (7.9) and (7.10) can be used to optimise
the number of sampling times (PSUs, 𝑛) and the number of sampling locations
(SSUs, 𝑚) per time. The pooled within-unit variance, 𝑆2w , is in this case the
time-averaged spatial variance of the study variable at a given time, and the
between-unit variance, 𝑆2b , is the variance of the spatial means over time.
Estimation of the space-time mean from an SS sample in which both locations
and times are selected by probability sampling is the same as for an IS sample.
However, estimation of the variance of the space-time mean estimator is more
complicated. For the variance of the estimator of the space-time mean with
an SS space-time design and simple random sampling of both locations and
times, see equation (15.8) in de Gruijter et al. (2006). Due to the two-fold
alignment of the sampling units, no unbiased estimator of the variance is
available. The variance estimator of two-stage cluster sampling can be used to
approximate the variance, but this variance estimator does not account for a
possible temporal correlation of the estimated spatial means, resulting in an
underestimation of the variance.
For an application of an IS design, to estimate the space-time mean of nutrients
(nitrogen and phosphorous) in surface waters, see Brus and Knotters (2008) and
Knotters and Brus (2010). In both applications, sampling times are selected
by stratified simple random sampling, with periods of two months as strata. In
Brus and Knotters (2008), sampling locations are selected by stratified simple
random sampling as well.
The spatial means at the four times can be estimated by the sample means
(Equation (3.2)), the variances of the mean estimators by Equation (3.14).
The covariances of the estimators of the means at two different times can
be estimated by Equation (15.14) with 𝑚 equal to the sample size 𝑛. The
15.4 Case study: annual mean daily temperature in Iberia 301
estimated current mean (the spatial mean at the fourth survey) and the es-
timated standard error of the estimator of the current mean can be simply
extracted from the vector with estimated means and from the matrix with
estimated variances and covariances.
The change of the spatial mean from the first to the fourth survey can simply
be estimated by subtracting the estimated spatial mean of the first survey from
the estimated mean of the fourth survey. The standard error of this estimator
can be estimated by the sum of the estimated variances of the two spatial
mean estimators, minus two times the estimated covariance of the two mean
estimators, and finally taking the square root.
The same estimates are obtained by defining a weight vector with values -1 and
1 for the first and last element, respectively, and 0 for the other two elements.
w <- c(-1, 0, 0, 1)
d_mz_SS <- t(w) %*% mz
se_d_mz_SS <- sqrt(t(w) %*% C %*% w)
The temporal trend of the spatial means can be estimated much in the same
way, but using a different vector with weights (Equation (15.4)).
The estimated temporal trend equals 0.0403∘ C 𝑦−1 , and the estimated standard
error equals 0.0024794∘ C 𝑦−1 . Using t = 1:4 yields the estimated average change
in annual mean temperature per five years.
t <- 1:4
w <- (t - mean(t)) / sum((t - mean(t))ˆ2)
print(mz_trend_SS <- t(w) %*% mz)
302 15 Repeated sample surveys for monitoring population parameters
[,1]
[1,] 0.2016378
Using a constant weight vector with values 1/4 yields the estimated space-time
mean and the standard error of the space-time mean estimator.
w <- rep(1 / 4, 4)
mz_st_SS <- t(w) %*% mz
se_mz_st_SS <- sqrt(t(w) %*% C %*% w)
The units of the finite population are selected by simple random sampling
with replacement. As a consequence, there can be partial overlap between
the samples at the different times, i.e., some units are observed at multiple
times. However, this overlap is by chance; it is not coordinated as in an SP
and RP design. By selecting the units with replacement, the estimators of
the spatial means are independent (covariance of estimators equals zero). For
infinite populations such as points in an area, there will be no overlap, so
that the covariance of the estimators equals zero.
The spatial means are estimated with the 𝜋 estimator as there is no partial
overlap of the spatial samples. All covariances of the mean estimators are
zero. Estimation of the space-time parameters is done as before with the SS
space-time sample. Four data frames of 𝑛 rows are first made with the data
observed in a specific panel. The variables of these four data frames are joined
into a single data frame. The spatial means at the four times are then estimated
by the sample means, computed with function apply.
rows are first made with the data observed in a specific panel. The variables of
these two data frames are joined into a single data frame. The spatial means
at the four times are then estimated by the sample means, computed with
function apply.
To compute the matrix with estimated variances and covariances of the spa-
tial mean estimators, first the full matrix is computed. Then the estimated
covariances of spatial means of consecutive surveys are replaced by zeroes
as the samples of these consecutive surveys are selected independently from
each other, so that the two estimators are independent. Estimation of the
space-time parameters is done as before with the SS space-time sample.
C <- var(panel_ab) / n
odd <- c(1, 3)
C[row(C) %in% odd & !(col(C) %in% odd)] <- 0
C[!(row(C) %in% odd) & col(C) %in% odd] <- 0
#current mean
mz_cur_SA <- mz[4]
se_mz_cur_SA <- sqrt(C[4, 4] / n)
#change of mean from time 1 to time 4
w <- c(-1, 0, 0, 1)
d_mz_SA <- t(w) %*% mz
se_d_mz_SA <- sqrt(t(w) %*% C %*% w)
#trend of mean
w <- (t - mean(t)) / sum((t - mean(t))ˆ2)
mz_trend_SA <- t(w) %*% mz
se_mz_trend_SA <- sqrt(t(w) %*% C %*% w)
#space-time mean
mz_st_SA <- mean(mz)
w <- rep(1 / 4, 4)
se_mz_st_SA <- sqrt(t(w) %*% C %*% w)
With SP sampling and four sampling times we have five panels: one panel
with fixed sampling locations and four panels with swarming locations (Figure
15.1). Each panel consists of 𝑛/2 sampling locations, so in total 5𝑛/2 = 250
306 15 Repeated sample surveys for monitoring population parameters
locations are selected. A variable indicating the panel is added to the data
frame with the selected sampling locations. Figure 15.6 shows the selected SP
sample.
With SP sampling, the spatial means are estimated by the design-based GLS
estimator (Equation (15.10)). As a first step, the eight elementary estimates
(two per sampling time) are computed and collected in a vector. Note the
order of the elementary estimates: first the estimated spatial means of 2004,
2009, 2014, and 2019, estimated from the panel with fixed sampling locations,
then the spatial means estimated from the panels with swarming locations.
The design matrix 𝐗 corresponding to this order of elementary estimates is
constructed.
15.4 Case study: annual mean daily temperature in Iberia 307
Ordering the elementary estimates by sampling time is also fine, but then
the design matrix 𝐗 should be adapted to this order.
The variances and covariances of the elementary estimates of the panel with
fixed locations (SS subsample) are estimated, as well as the variances of the
elementary estimates of the panels with swarming locations (IS subsample).
The two variance-covariance matrices and a 4 × 4 submatrix with all zeroes
are combined into a single matrix, see matrix (15.11).
#current mean
mz_cur_SP <- mz_GLS[4]
se_mz_cur_SP <- sqrt(XCXinv[4, 4])
#change of mean
w <- c(-1, 0, 0, 1)
d_mz_SP <- t(w) %*% mz_GLS
se_d_mz_SP <- sqrt(t(w) %*% XCXinv %*% w)
#trend of mean
w <- (t - mean(t)) / sum((t - mean(t))ˆ2)
mz_trend_SP <- t(w) %*% mz_GLS
se_mz_trend_SP <- sqrt(t(w) %*% XCXinv %*% w)
#space-time mean
w <- rep(1 / 4, 4)
mz_st_SP <- t(w) %*% mz_GLS
se_mz_st_SP <- sqrt(t(w) %*% XCXinv %*% w)
Similar to the SP design, with an in-for-two RP design and four sampling times
we have five panels, each consisting of 𝑛/2 sampling locations, so that in total
5𝑛/2 = 250 locations are selected. Figure 15.7 shows the selected space-time
sample.
Two elementary estimates per sampling time are computed and collected in a
vector. Note that now the elementary estimates are ordered by sampling time.
The design matrix corresponding to this order is constructed.
The design-based GLS estimates of the spatial means for the four sampling
times are computed as before, followed by computing the estimated space-time
parameters.
The random sampling with the five space-time designs and estimation of the
four space-time parameters is repeated 10,000 times. The standard deviations
of the 10,000 estimates of a space-time parameter are shown in Table 15.1.
Note that to determine the standard errors of the estimators of the space-
time parameters for the SS, IS, and SA designs, a sampling experiment is
not really needed. These can be computed without error because we have
exhaustive knowledge of the study variable at all sampling times. However, for
15.4 Case study: annual mean daily temperature in Iberia 311
errors of the current mean estimator are equal for these three space-time
designs.
The strong correlation also explains that the estimated change of the spatial
mean from 2004 to 2019 with the SS design is much more precise than with
the IS and SA designs. With the IS and SA designs, the sample of 2019 is
selected independently from the sample of 2004, so that the covariance of
the two spatial mean estimators is zero. With the SS design this covariance
is subtracted two times from the sum of the variances of the spatial mean
estimators. The standard error of the estimator of the change of the spatial
mean from 2009 to 2019 with the SA design (not shown in Table 15.1) is
much smaller than that of the change from 2004 to 2019, because the spatial
means at these two times are estimated from the same sample, so that we
profit from the strong positive correlation. The standard error of the change
estimator with the SP design is slightly larger than the standard error with
the SS design, because with this design there is only partial overlap, so that
we profit less from the correlation. The standard error with the RP design
is larger than that of the SP design, because with the RP design there is no
overlap of the samples of 2004 and 2019 (Figure 15.1). Despite the absence of
overlap, the standard error is still considerably smaller than those with the IS
and SA designs because the spatial means of 2004 and 2019 are estimated by
the GLS estimator that uses the data of all years, so that we still profit from
the correlation.
Estimation of the temporal trend of the spatial mean is most precise with the
SS design, closely followed by the SP design, and least precise with the IS
design. This is in agreement with the results of Brus and de Gruijter (2011)
and Brus and de Gruijter (2013). On the contrary, estimation of the space-time
mean is most precise with the IS design and least precise with the SS design.
With strong persistence of the spatial patterns, as in our case, it is not efficient
to observe the same sampling locations at all times when interest is in the
space-time mean. In our case with very strong correlation, the larger the total
number of sampling locations over all sampling times, the smaller the standard
error of the space-time mean estimator. The total number of sampling locations
in this case study are 𝑛 with SS, 4𝑛 with IS, 2𝑛 with SA, and 5𝑛/2 with SP
and RP.
in the case study of the previous section, stratified simple random sampling
using climate zones is most likely more efficient than simple random sampling.
To select a space-time sample with stratified simple random sampling as a
spatial design, the selection procedures described above for the five basic types
of space-time design are applied at the level of the strata. Estimation of the
space-time parameters goes along the same lines, using the 𝜋 estimator of the
spatial mean and the estimator of the standard error presented in Chapter 4.
With the SP and RP designs, the covariance of the elementary estimators of
the spatial means at two sampling times using the data of the same panel, can
be estimated by
u� ̂
𝑆2
𝐶̂u�u� = ∑ 𝑤2ℎ u�u�ℎ , (15.19)
ℎ=1
𝑚ℎ
with 𝑆 ̂2
u�u�ℎ the estimated covariance of the study variable at times 𝑡u� and
𝑡u� in stratum ℎ, and 𝑚ℎ the number of sampling locations in the panel in
stratum ℎ.
Interesting new developments are presented by Wang and Zhu (2019) and Zhao
and Grafström (2020).
Exercises
1. Compute for the annual mean temperature data of Iberia the true
standard error of the estimator of (i) the spatial mean in 2019; (ii)
the change of the spatial mean from 2004 to 2019; (iii) the temporal
trend of the spatial mean (average change per five years in the period
from 2004 to 2019); and (iv) the space-time mean, for all five space-
time designs and simple random sampling with replacement of 100
units per time. Use for the designs SP and RP the true covariances
of the elementary estimates in the GLS estimators of the spatial
means.
• Compare the standard errors with the standard deviations in
Table 15.1. Explain why for the designs SP and RP the true
standard errors of the estimators of all space-time parameters
are slightly smaller than the standard deviations in Table 15.1.
𝑆2u� 𝑆2 𝑚 𝑆2u�u�
̄ )=
𝑉(𝑑u�u� + u� − 2 , (15.20)
𝑛 𝑛 𝑛2
314 15 Repeated sample surveys for monitoring population parameters
with 𝑆2u� and 𝑆2u� the spatial variance at time 𝑡u� and 𝑡u� , respectively,
𝑆2u�u� the spatial covariance of the study variable at times 𝑡u� and
𝑡u� , 𝑛 the size of the simple random samples, and 𝑚 the number of
units observed at both times. Compute for Iberia the standard error
of the estimator of the change of the mean with Equation (15.20),
using the true spatial variances and covariances for 2004 and 2019,
for simple random samples of size 100 (𝑛 = 100) and a matching
proportion of 0.5 (𝑚 = 50). Compare with the true standard error
of the estimator of the change of the mean using the GLS estimators
computed in the previous exercise. Explain that the standard error
of the 𝜋 estimators is larger than the standard error with the GLS
estimators.
Part II
with 𝑍u� the study variable of unit 𝑘, 𝛽0 and 𝛽1 regression coefficients, 𝑥u� a
covariate for unit 𝑘 used as a predictor, and 𝜖u� the error (residual) at unit 𝑘,
normally distributed with mean zero and a constant variance 𝜎2 . The errors
are assumed independent, so that Cov(𝜖u� , 𝜖u� ) = 0 for all 𝑘 ≠ 𝑗. Figure 16.1
shows a simple random sample without replacement and the sample optimised
for mapping with a simple linear regression model. Both samples are plotted
on a map of the covariate 𝑥.
The optimal sample for mapping with a simple linear regression model contains
the units with the smallest and the largest values of the covariate 𝑥. The
optimal sample shows strong spatial clustering. Spatial clustering is not avoided
because in a simple linear regression model we assume that the residuals are
not spatially correlated. In Chapter 23 I will show that when the residuals
are spatially correlated, spatial clustering of sampling units is avoided. The
standard errors of both regression coefficients are considerably smaller for the
317
318 16 Introduction to sampling for mapping
FIGURE 16.1: Simple random sample and optimal sample for mapping with
a simple linear regression model, plotted on a map of the covariate.
optimal sample (Table 16.1). The joint uncertainty about the two regression
coefficients, quantified by the determinant of the variance-covariance matrix
of the regression coefficient estimators, is also much smaller for the optimal
sample. When we are less uncertain about the regression coefficients, we are also
less uncertain about the regression model predictions of the study variable 𝑧
at points where we have observations of the covariate 𝑥 only. We can conclude
that for mapping with a simple linear regression model, in this example simple
random sampling is not a good option.
Of course, this simple example would only be applicable if we have evidence of
a linear relation between study variable 𝑧 and covariate 𝑥, and in addition if
we are willing to rely on the assumption that the residuals are not spatially
correlated.
16.2 Sampling for simultaneously mapping and estimating means 319
Biome Forest_moist is by far the largest biome with a sample size of 459
points.
In the next code chunk, a balanced sample is selected with equal inclusion
probabilities, using both the categorical variable biome and the continuous
variable lnSWIR2 as balancing variables (Subsection 9.1.3). The geographical
coordinates are used as spreading variables.
library(BalancedSampling)
grdAmazonia$lnSWIR2 <- log(grdAmazonia$SWIR2)
pi <- n_h / N_h
stratalabels <- levels(grdAmazonia$Biome)
lut <- data.frame(Biome = stratalabels, pi = as.numeric(pi))
grdAmazonia <- merge(x = grdAmazonia, y = lut)
Xbal <- model.matrix(~ Biome - 1, data = grdAmazonia) %>%
cbind(grdAmazonia$lnSWIR2)
Xspread <- cbind(grdAmazonia$x1, grdAmazonia$x2)
set.seed(314)
units <- lcube(Xbal = Xbal, Xspread = Xspread, prob = grdAmazonia$pi)
mysample <- grdAmazonia[units, ]
16.3 Broad overview of sampling designs for mapping 321
FIGURE 16.2: Balanced sample of size 500 from Eastern Amazonia, bal-
anced on biome and lnSWIR2, with geographical spreading. Equal inclusion
probabilities are used.
I think this is a suitable sample, both for mapping AGB across the entire
study area, for instance by kriging with an external drift (Section 21.3), and
for estimating the mean AGB of the four biomes. For biome Forest_moist,
the population mean can be estimated from the data of this biome only, using
the 𝜋 estimator, as the sample size of this biome is very large (Section 9.3).
For the other three biomes, we may prefer model-assisted estimation for small
domains as described in Section 14.2.
In this example I used one quantitative covariate, lnSWIR2, for balancing
the sample. If we have a legacy sample that can be used to fit a linear or
non-linear model, for instance a random forest using multiple covariates and
factors as predictors (Chapter 10), then this model can be used to predict the
study variable for all population units, so that we can use the predictions of
the study variable to balance the sample, see Section 10.3.
Square and triangular grids are examples of geometric sampling designs; the
sampling units show a regular, geometric spatial pattern. In other geometric
sampling designs the spatial pattern is not perfectly regular. Yet these are
classified as geometric sampling designs when the samples are obtained by min-
imising some geometric criterion, i.e., a criterion defined in terms of distances
between the sampling units and the nodes of a fine prediction grid discretising
the study area (Section 17.2 and Chapter 18).
In model-based sampling designs, the samples are obtained by minimising a
criterion that is defined in terms of variances of prediction errors. An example
is the mean kriging variance criterion, i.e., the average of the kriging variances
over all nodes of the prediction grid. Model-based sampling therefore requires
prior knowledge of the model of spatial variation. Such a model must be
specified and justified. Once this model is given, the sample can be optimised.
In Chapter 22 I will show how a spatial model can be used to optimise the
spacing of a square grid given a requirement on the accuracy of the map. The
grid spacing determines the number of sampling units, so this optimisation
boils down to determining the required sample size. In Chapter 23 I will show
how a sample of a given size can be further optimised through optimisation of
the spatial coordinates of the sampling units.
In Chapter 1 the design-based and model-based approaches for sampling and
statistical inference were introduced. Note that a model-based approach does
not necessarily imply model-based sampling. The adjective ‘model-based’ refers
to the model-based inference, not to the selection of the units. In a model-based
approach sampling units can be, but need not be, selected by model-based
sampling. If they are, then both in selecting the units and in mapping a
statistical model is used. In most cases, the two models differ: once the sample
data are collected, these are used to update the postulated model used for
designing the sample. The updated model is then used in mapping.
Besides geometric and model-based sampling designs for a spatial survey, a
third category can be distinguished: sampling designs that are adaptations
of experimental designs. An adaptation is necessary because in contrast to
experiments, in observational studies one is not free to choose combinations
of levels of different factors. For instance, when two covariates are strongly
positively correlated, it may happen that there are no units with a relatively
large value for one covariate and a relatively small value for the other covariate.
In a full factorial design, all combinations of factor levels are observed. For
instance, suppose we have only two covariates, e.g., application rates for N
and P in an agricultural experiment, and four levels for each covariate. To
account for possible non-linear effects, a good option is to have multiple plots
for all 4 × 4 combinations. This is referred to as a full factorial design. With
𝑘 factors and 𝑙 levels per factor the total number of observations is 𝑙u� . With
numerous factors and/or numerous levels per factor, this becomes unfeasible in
16.3 Broad overview of sampling designs for mapping 323
practice. Alternative designs have been developed that need fewer observations
but still provide detailed information about how the study variable responds
to changes in the factor levels. Examples are Latin hypercube samples and
response surface designs. The survey sampling analogues of these experimental
designs are described in Chapters 19 and 20.
17
Regular grid and spatial coverage sampling
This chapter describes and illustrates two sampling designs by which the
sampling locations are evenly spread throughout the study area: regular grid
sampling and spatial coverage sampling. In a final section, the spatial coverage
sampling design is used to fill in the empty spaces of an existing sample.
325
326 17 Regular grid and spatial coverage sampling
library(sp)
gridded(grdVoorst) <- ~ s1 + s2
mysample <- spsample(
x = grdVoorst, type = "regular", cellsize = c(200, 200),
offset = c(0.5, 0.5)) %>% as("data.frame")
FIGURE 17.1: Non-random square grid sample with a grid spacing of 200
m from Voorst.
The number of grid points in this example equals 115. Nodes of the square grid
in parts of the area not belonging to the population of interest, such as built-up
areas and roads, are discarded by spsample (these nodes are not included in the
sampling frame file grdVoorst). As a consequence, there are some undersampled
areas, for instance in the middle of the study area where two roads cross. If
we use the square grid in spatial interpolation, e.g., by ordinary kriging, we
are more uncertain about the predictions in these undersampled areas than in
areas where the grid is complete. The next section will show how this local
undersampling can be avoided.
Exercises
selected grid the number of selected grid points, and save this
in a numeric. Compute summary statistics of the sample size,
and plot a histogram.
• Select a square grid of exactly 100 points.
1 u�
𝑀𝑆𝑆𝐷 = ∑ min (𝐷2u�u� ) , (17.1)
𝑁 u�=1 u�
where 𝑁 is the total number of nodes of the discretisation grid and 𝐷u�u� is the
distance between the 𝑘th grid node and the 𝑗th sampling point. This distance
measure can be minimised by the k-means algorithm, which is a numerical,
iterative procedure. Figure 17.2 illustrates the selection of a spatial coverage
sample of four points from a square. In this simple example the optimal spatial
coverage sample is known, being the centres of the four subsquares of equal
size. A simple random sample of four points serves as the initial solution. Each
raster cell is then assigned to the closest sampling point. This is the initial
clustering. In the next iteration, the centres of the initial clusters are computed.
Next, the raster cells are reassigned to the closest new centres. This continues
until there is no change anymore. In this case only nine iterations are needed,
where an iteration consists of computing the clusters by assigning the raster
cells to the nearest centre (sampling unit), followed by computing the centres
of these clusters. Figure 17.2 shows the first, second, and ninth iterations.
The same algorithm was used in Chapter 4 to construct compact geographical
strata (briefly referred to as geostrata) for stratified random sampling. The
clusters serve as strata. In stratified random sampling, one or more sampling
units are selected randomly from each geostratum. However, for mapping
purposes probability sampling is not required, so the random selection of a unit
328 17 Regular grid and spatial coverage sampling
FIGURE 17.2: First, second, and ninth iterations of the k-means algorithm
to select a spatial coverage sample of four points from a square. Iterations
are in rows from top to bottom. In the left column of subfigures, the clusters
are computed by assigning the raster cells to the nearest centre. In the right
column of subfigures, the centres of the clusters are computed.
17.2 Spatial coverage sampling 329
within each stratum is not needed. With random selection, the spatial coverage
is suboptimal. Here, the centres of the final clusters (geostrata) are used as
sampling points. This improves the spatial coverage compared to stratified
random sampling.
In probability sampling, we may want to have strata of equal area (clusters of
equal size) so that the sampling design becomes self-weighting. For mapping,
equally sized clusters are not recommended, as this may lead to samples with
suboptimal spatial coverage.
In Figure 17.2 the clusters are of equal size, but this is an artefact. Equally
sized clusters are not guaranteed by the illustrated k-means algorithm. Clus-
tering the raster cells of a square into four clusters is a very special case. In
other cases, the clusters computed with the k-means algorithm described
above might have unequal sizes. In package spcosa also a different k-means
algorithm is implemented, using swops, enforcing compact clusters of equal
size.
library(spcosa)
n <- 115
set.seed(314)
gridded(grdVoorst) <- ~ s1 + s2
mystrata <- spcosa::stratify(
grdVoorst, nStrata = n, equalArea = FALSE, nTry = 10)
mysample <- spsample(mystrata) %>% as("data.frame")
If the clusters need not be of equal size, we may also use function kmeans of
the stats package, using the spatial coordinates as clustering variables. This
requires less computing time, especially with large data sets.
When function kmeans is used to compute the spatial coverage sample, there
is no guarantee that the computed centres of the clusters, used as sampling
points, are inside the study area. In Figure 17.4 there are eight such centres.
This problem can easily be solved by selecting points inside the study area
closest to the centres that are outside the study area. Function rdist of package
fields is used to compute a matrix with distances between the centres outside
the study area and the nodes of the discretisation grid. Then function apply is
used with argument FUN = which.min to compute the discretisation nodes closest
to the centres outside the study area. A similar procedure is implemented
in function spsample of package spcosa when the centres of the clusters are
selected as sampling points (so, when argument n of function spsample is not
used).
library(fields)
gridded(grdVoorst) <- ~ s1 + s2
coordinates(mysample_kmeans) <- ~ s1 + s2
res <- over(mysample_kmeans, grdVoorst)
inside <- as.factor(!is.na(res$z))
units_out <- which(inside == FALSE)
grdVoorst <- as_tibble(grdVoorst)
mysample_kmeans <- as_tibble(mysample_kmeans)
D <- fields::rdist(x1 = mysample_kmeans[units_out, ],
x2 = grdVoorst[, c("s1", "s2")])
units_close <- apply(D, MARGIN = 1, FUN = which.min)
mysample_kmeans[units_out, ] <- grdVoorst[units_close, c("s1", "s2")]
17.3 Spatial infill sampling 331
Exercises
5. Consider the case of six strata. The strata are not of equal size. If the
soil samples are bulked into a composite sample, the measurement
on this single sample is a biased estimator of the plot mean. How
can this bias be avoided?
sample. The legacy data are not ideal for mapping the SOM concentration
throughout West-Amhara. Clearly, it is desirable to collect additional data in
the off-road parts of the study area, with the exception of the northeastern part
where we have already quite a few data that are not near the main roads. The
legacy data are passed to function stratify of package spcosa with argument
priorPoints. The object assigned to this argument must be of class SpatialPoints
or SpatialPointsDataFrame. This optional argument fixes these points as cluster
centres. A spatial infill sample of 100 points is selected, taking into account
these fixed points.
gridded(grdAmhara) <- ~ s1 + s2
n <- 100
ntot <- n + nrow(sampleAmhara)
coordinates(sampleAmhara) <- ~ s1 + s2
proj4string(sampleAmhara) <- NA_character_
set.seed(314)
mystrata <- spcosa::stratify(grdAmhara, nStrata = ntot,
priorPoints = sampleAmhara, nTry = 10)
17.3 Spatial infill sampling 333
In the output object of spsample, both the prior and the new sampling points
are included. The new points can be obtained as follows:
Exercises
Regular grid sampling and spatial coverage sampling are pure spatial sampling
designs. Covariates possibly related to the study variable are not accounted for
in selecting sampling units. This can be suboptimal when the study variable is
related to covariates of which maps are available, think for instance of remote
sensing imagery or digital elevation models related to soil properties. Maps of
these covariates can be used in mapping the study variable by, for instance, a
multiple linear regression model or a random forest. This chapter describes
a simple, straightforward method for selecting sampling units on the basis of
the covariate values of the raster cells.
The simplest option for covariate space coverage (CSC) sampling is to cluster
the raster cells by the k-means clustering algorithm in covariate space. Similar
to spatial coverage sampling (Section 17.2) the mean squared shortest distance
(MSSD) is minimised, but now the distance is not measured in geographical
space but in a 𝑝-dimensional space spanned by the 𝑝 covariates. Think of this
space as a multidimensional scatter plot with the covariates along the axes.
The covariates are centred and scaled so that their means become zero and
standard deviations become one. This is needed because, contrary to the spatial
coordinates used as clustering variables in spatial coverage sampling, the ranges
of the covariates in the population can differ greatly. In the clustering of the
raster cells, the mean squared shortest scaled distance (MSSSD) is minimised.
The name ‘scaled distance’ can be confusing. The distances are not scaled, but
rather they are computed in a space spanned by the scaled covariates.
In the next code chunk, a CSC sample of 20 units is selected from Eastern
Amazonia. All five quantitative covariates, SWIR2, Terra_PP, Prec_dm,
Elevation, and Clay, are used as covariates. To select 20 units, 20 clusters are
constructed using function kmeans of the stats package (R Core Team, 2021).
The number of clusters is passed to function kmeans with argument centers.
Note that the number of clusters is not based, as would be usual in cluster
analysis, on the assumed number of subregions with a high density of units
in the multivariate distribution, but rather on the number of sampling units.
The k-means clustering algorithm is a deterministic algorithm, i.e., the final
optimised clustering is fully determined by the initial clustering. This final
clustering can be suboptimal, i.e., the minimised MSSSD value is somewhat
larger than the global minimum. Therefore, the clustering should be repeated
many times, every time starting with a different random initial clustering.
335
336 18 Covariate space coverage sampling
The number of repeats is specified with argument nstart. The best solution is
automatically kept. To speed up the computations, a 5 km × 5 km subgrid of
grdAmazonia is used.
Raster cells with the shortest scaled Euclidean distance in covariate space
to the centres of the clusters are selected as the sampling units. To this end,
first a matrix with the distances of all the raster cells to the cluster centres
is computed with function rdist of package fields (Nychka et al., 2021). The
raster cells closest to the centres are computed with function apply, using
argument FUN = which.min.
library(fields)
covs_s <- scale(grdAmazonia[, covs])
D <- rdist(x1 = myclusters$centers, x2 = covs_s)
units <- apply(D, MARGIN = 1, FUN = which.min)
myCSCsample <- grdAmazonia[units, ]
Figure 18.1 shows the clustering of the raster cells and the raster cells closest
in covariate space to the centres that are used as the selected sample. In Figure
18.2 the selected sample is plotted in biplots of some pairs of covariates. In the
biplots, some sampling units are clearly clustered. However, this is misleading,
as actually we must look in five-dimensional space to see whether the units
are clustered. Two units with a large separation distance in a five-dimensional
space can look quite close when projected on a two-dimensional plane.
The next code chunk shows how the MSSSD of the selected sample can be
computed.
Note that to centre and scale the covariate values in the CSC sample, the
population means and the population standard deviations are used, as passed
to function scale with arguments center and scale. If these means and stan-
dard deviations are unspecified, the sample means and the sample standard
18.0 Covariate space coverage sampling 337
3. Choose one new raster cell at random as a new sampling unit with
probabilities proportional to 𝑑2u�u� and add the selected raster cell to
the set of selected cells.
5. Now that the initial centres have been selected, proceed using stan-
dard k-means.
library(LICORS)
myclusters <- kmeanspp(
scale(grdAmazonia[, covs]), k = n, iter.max = 10000, nstart = 30)
Due to the improved initial centres, the risk of ending in a local minimum is
reduced. The k-means++ algorithm is of interest for small sample sizes. For
large sample sizes, the extra time needed for computing the initial centres can
become substantial and may not outweigh the larger number of starts that can
be afforded with the usual k-means algorithm for the same computing time.
The function is used to select an infill sample of 15 units from Eastern Amazonia.
A legacy sample of five units is randomly selected.
set.seed(314)
units <- sample(nrow(grdAmazonia), 5)
fixed <- data.frame(units, scale(grdAmazonia[, covs])[units, ])
340 18 Covariate space coverage sampling
Figures 18.3 and 18.4 show the selected sample plotted on a map of the clusters
and in biplots of covariates, respectively.
FIGURE 18.3: Covariate space infill sample of 15 units from Eastern Ama-
zonia, obtained with k-means clustering and five fixed cluster centres, plotted
on a map of the clusters. The dots represent the fixed centres (legacy sample),
the triangles the infill sample.
FIGURE 18.4: Covariate space infill sample of Figure 18.3 plotted in biplots
of covariates, coloured by cluster. The dots represent the fixed centres (legacy
sample), the triangles the infill sample.
FIGURE 18.5: Scatter plot of the minimisation criterion MSSSD and the root
mean squared error (RMSE) of RF predictions of AGB in Eastern Amazonia
for covariate space coverage (CSC) sampling and simple random (SI) sampling,
and three sample sizes.
18.2 Performance of covariate space 343
Exercises
This chapter and Chapter 20 on response surface sampling are about exper-
imental designs that have been adapted for spatial surveys. Adaptation is
necessary because, in contrast to experiments, in observational studies one is
not free to choose any possible combination of levels of different factors. When
two covariates are strongly positively correlated, it may happen that there
are no population units with a relatively large value for one covariate and a
relatively small value for the other covariate. By contrast, in experimental
research it is possible to select any combination of factor levels.
In a full factorial design, all combinations of factor levels are observed. With 𝑘
factors and 𝑙 levels per factor, the total number of observations is 𝑙u� . With
numerous factors and/or numerous levels per factor, observing 𝑙u� experimental
units becomes unfeasible in practice. Alternative experimental designs have
been developed that need fewer observations but still provide detailed infor-
mation about how the study variable responds to changes in the factor levels.
This chapter will describe and illustrate the survey sampling analogue of Latin
hypercube sampling. Response surface sampling follows in the next chapter.
Latin hypercube sampling is used in designing industrial processes, agricultural
experiments, and computer experiments, with numerous covariates and/or
factors of which we want to study the effect on the output (McKay et al.,
1979). A much cheaper alternative to a full factorial design is an experiment
with, for all covariates, exactly one observation per level. So, in the agricultural
experiment described in Chapter 16 with the application rates of N and P as
factors and four levels for each factor, this would entail four observations only,
distributed in a square in such way that we have in all rows and in all columns
one observation, see Figure 19.1. This is referred to as a Latin square. The
generalisation of a Latin square to a higher number of dimensions is a Latin
hypercube (LH).
Minasny and McBratney (2006) adapted LH sampling for observational studies;
this adaptation is referred to as conditioned Latin hypercube (cLH) sampling.
For each covariate, a series of intervals (marginal strata) is defined. The number
of marginal strata per covariate is equal to the sample size, so that the total
number of marginal strata equals 𝑝u� , with 𝑝 the number of covariates and 𝑛
the sample size. The bounds of the marginal strata are chosen such that the
numbers of raster cells in these marginal strata are equal. This is achieved by
345
346 19 Conditioned Latin hypercube sampling
FIGURE 19.1: Latin square for agricultural experiment with four application
rates of N and P.
1. O1: the sum over all marginal strata of the absolute deviations of the
marginal stratum sample size from the targeted sample size (equal
to 1);
2. O2: the sum over all classes of categorical covariates of the abso-
lute deviations of the sample proportion of a given class from the
population proportion of that class; and
3. O3: the sum over all entries of the correlation matrix of the absolute
deviation of the correlation in the sample from the correlation in
the population.
With cLH sampling, the marginal distributions of the covariates in the sample
are close to these distributions in the population. This can be advantageous for
mapping methods that do not rely on linear relations, for instance in machine
learning techniques like classification and regression trees (CART), and random
forests (RF). In addition, criterion O3 ensures that the correlations between
predictors are respected in the sample set.
19.0 Conditioned Latin hypercube sampling 347
cLH samples can be selected with function clhs of package clhs (Roudier,
2021). With this package, the criterion is minimised by simulated annealing,
see Section 23.1 for an explanation of this optimisation method. Arguments
iter, temp, tdecrease, and length.cycle of function clhs are control parameters of
the simulated annealing algorithm. In the next code chunk, I use default values
for these arguments. With argument weights, the weights of the components of
the minimisation criterion can be set. The default weights are equal to 1.
Argument cost is for cost-constrained cLH sampling (Roudier et al., 2012), and
argument eta can be used to control the sampling intensities of the marginal
strata (Minasny and McBratney, 2010). This argument is of interest if we
would like to oversample the marginal strata near the edge of the multivariate
distribution.
cLH sampling is illustrated with the five covariates of Eastern Amazonia that
were used before in covariate space coverage sampling (Chapter 18).
library(clhs)
covs <- c("SWIR2", "Terra_PP", "Prec_dm", "Elevation", "Clay")
set.seed(314)
res <- clhs(
grdAmazonia[, covs], size = 20, iter = 50000, temp = 1, tdecrease = 0.95,
length.cycle = 10, progress = FALSE, simple = FALSE)
mysample_CLH <- grdAmazonia[res$index_samples, ]
Figure 19.2 shows the selected sample in a map of SWIR2. In Figure 19.3
the sample is plotted in a biplot of Prec_dm against SWIR2. Each black
dot in the biplot represents one grid cell in the population. The vertical and
horizontal lines in the biplot are at the bounds of the marginal strata of SWIR2
and Prec_dm, respectively. The number of grid cells between two consecutive
vertical lines is constant, as well as the number of grid cells between two
consecutive horizontal lines, i.e., the marginal strata have equal sizes. The
intervals are the narrowest where the density of grid cells in the plot is highest.
Ideally, in each column and row, there is exactly one sampling unit (red dot).
Figure 19.4 shows the sample sizes for all 100 marginal strata. The next code
chunk shows how the marginal stratum sample sizes are computed.
For all marginal strata with one sampling unit, the contribution to component
O1 of the minimisation criterion is 0. For marginal strata with zero or two
sampling units, the contribution is 1, for marginal strata with three sampling
units the contribution equals 2, etc. In Figure 19.4 there are four marginal
strata with zero units and four marginal strata with two units. Component O1
therefore equals 8 in this case.
Figure 19.5 shows the trace of the objective function, i.e., the values of the
minimisation criterion during the optimisation. The trace plot indicates that
50,000 iterations are sufficient. I do not expect that the criterion can be
reduced anymore. The final value of the minimisation criterion is extracted
with function tail using argument n = 1.
[1] 9.51994
In the next code chunk, the minimised value of the criterion is computed “by
hand”.
[1] 9.51994
19.0 Conditioned Latin hypercube sampling 349
Exercises
FIGURE 19.4: Sample sizes of marginal strata for the conditioned Latin
hypercube sample of size 20 from Eastern Amazonia.
set.seed(314)
units <- sample(nrow(grdAmazonia), 10, replace = FALSE)
res <- clhs(grdAmazonia[, covs], size = 30, must.include = units,
tdecrease = 0.95, iter = 50000, progress = FALSE, simple = FALSE)
mysample_CLHI <- grdAmazonia[res$index_samples, ]
mysample_CLHI$free <- as.factor(rep(c(1, 0), c(20, 10)))
Figure 19.6 shows the selected cLHI sample in a map of SWIR2. In Figure 19.7
the sample is plotted in a biplot of SWIR2 against Prec_dm. The marginal
strata already covered by the legacy sample are mostly avoided by the additional
sample.
are used to estimate three map quality indices, the population mean error
(ME), the population root mean squared error (RMSE), and the population
Nash-Sutcliffe model efficiency coefficient (MEC), see Chapter 25.
Figure 19.8 shows the results as boxplots, each based on 500 estimates. For
𝑛 = 25 and 100, cLH sampling performs best in terms of RMSE and MEC,
whereas for 𝑛 = 50 CSC sampling performs best. For 𝑛 = 25 and 50, the
boxplots of cLH and SI show quite a few outliers with large values of RMSE,
resulting in small values of MEC. For CSC, these map quality indices are more
stable. Remarkably, for 𝑛 = 100 SI sampling performs about equally to CSC
and cLH sampling.
In Figure 19.9 the RMSE is plotted against the minimised criterion (O1 + O3)
for the cLH and the SI samples. For all three sample sizes, there is a weak
positive correlation of the minimisation criterion and the RMSE: for 𝑛 = 25,
50, and 100 this correlation is 0.369, 0.290, and 0.140, respectively. On average,
cLH performs slightly better than SI for 𝑛 = 25 (Table 19.1). The gain in
accuracy decreases with the sample size. For 𝑛 = 100, the two designs perform
about equally. Especially for 𝑛 = 25 and 50, the distribution of RMSE with SI
has a long right tail. For these small sample sizes, the risk of selecting an SI
354 19 Conditioned Latin hypercube sampling
sample leading to a poor map with large RMSE is much larger than with cLH
sampling.
FIGURE 19.9: Biplot of the minimisation criterion (O1 + O3) and the
RMSE of RF predictions of AGB in Eastern Amazonia for conditioned Latin
hypercube (cLH) sampling and simple random (SI) sampling, and three sample
sizes.
These results are somewhat different from the results of Wadoux et al. (2019)
and Ma et al. (2020). In these case studies, cLH sampling appeared to be
an inefficient design for selecting a calibration sample that is subsequently
used for mapping. Wadoux et al. (2019) compared cLH, CSC, spatial coverage
sampling (Section 17.2), and SI for mapping soil organic carbon in France with
a RF model. The latter two sampling designs do not exploit the covariates
19.2 Performance of conditioned Latin hypercube sampling 355
in selecting the calibration units. Sample sizes were 100, 200, 500, and 1,000.
cLH performed worse (larger RMSE) than CSC and not significantly better
than SI for all sample sizes.
Ma et al. (2020) compared cLH, CSC, and SI for mapping soil classes by
various models, among which a RF model, in a study area in Germany. Sample
sizes were 20, 30, 40, 50, 75, and 100 points. They found no relation between
the minimisation criterion of cLH and the overall accuracy of the map with
predicted soil classes. Models calibrated on CSC samples performed better
on average, i.e., on average the overall accuracy of the maps obtained by
calibrating the models on these CSC samples was higher. cLH was hardly
better than SI.
20
Spatial response surface sampling
357
358 20 Spatial response surface sampling
This design has been applied, among others, for mapping soil salinity (ECe),
using electromagnetic (EM) induction measurements and surface array con-
ductivity measurements as predictors in multiple linear regression models.
For applications, see Corwin and Lesch (2005), Lesch (2005), Fitzgerald et al.
(2006), Corwin et al. (2010), and Fitzgerald (2010).
Spatial response surface sampling is illustrated with the EM measurements
(mS m-1 ) of the apparent electrical conductivity on the 80 ha Cotton Research
Farm in Uzbekistan. The EM measurements in vertical dipole mode, with
transmitter at 1 m and 0.5 m from the receiver, are on transects covering the
Cotton Research Farm (Figure 20.1). As a first step, the natural log of the two
EM measurements, denoted by lnEM, are interpolated by ordinary kriging to
a fine grid (Figure 20.2). These ordinary kriging predictions of lnEM are used
as covariates in response surface sampling. The two covariates are strongly
correlated, 𝑟 = 0.73, as expected since they are interpolations of measurements
of the same variable but of different overlapping layers.
Function prcomp of the stats package (R Core Team, 2021) is used to compute
the principal component scores for all units in the population (grid cells). The
two covariates are centred and scaled, i.e., standardised principal components
are computed.
The means of the two principal component scores are 0; however, their standard
deviations are not zero but 1.330 and 0.480. Therefore, the principal component
scores are divided by these standard deviations. They then will have the same
weight in the following steps.
Function ccd of package rsm (Lenth, 2009) is now used to generate a central
composite response surface design (CCRSD). Argument basis specifies the
number of factors, which is two in our case. Argument n0 is the number of
centre points, and argument alpha determines the position of the star points
(explained hereafter).
library(rsm)
set.seed(314)
print(ccdesign <- ccd(basis = 2, n0 = 1, alpha = "rotatable"))
Data are stored in coded form using these coding formulas ...
x1 ~ x1.as.is
x2 ~ x2.as.is
The experiment consists of two blocks, each of five experimental units. Block 1,
the so-called cube block, consists of one centre point and four cube points. In
the experimental unit represented by the centre point, both factors have levels
in the centre of the experimental range. In the experimental units represented
by the cube points, the levels of both factors is either -1 or +1 unit in the
design space. Block 2, referred to as the star block, consists of one centre point
and four star points. With alpha = "rotatable" the star points are on the circle
circumscribing the square (Figure 20.3).
To adapt this design for an observational study, we drop one of the centre
points (0,0).
The coordinates of the CCRSD points are multiplied by a factor so that a large
proportion 𝑝 of the bivariate standardised principal component scores of the
population units is covered by the circle that passes through the design points
(Figure 20.3). The factor is computed as a sample quantile of the empirical
20.0 Spatial response surface sampling 361
distribution of the distances of the points in the scatter to the centre. For 𝑝, I
chose 0.7.
70%
1.472547
The next step is to select for each design point several candidate sampling
points. For each of the nine design points, eight points are selected that are
closest to that design point. This results in 9 × 8 candidate sampling points.
Figure 20.4 shows the nine clusters of candidate sampling points around
the design points. Note that the location of the candidate sampling points
associated with the design points with coordinates (0,-2.13), (1.51,-1.51), and
(2.13,0) are all far inside the circle that passes through the design points. So,
for the optimised sample, there will be three points with principal component
scores that considerably differ from the ideal values according to the CCRSD
design.
Figure 20.5 shows that in geographical space for most design points there are
multiple spatial clusters of candidate units. For instance, for design point nine,
there are three clusters of candidate sampling units. Therefore, there is scope
to optimise the sample computationally.
As a first step, an initial subsample from the candidate sampling units is
selected by stratified simple random sampling, using the levels of factor dpnt
as strata. Function strata of package sampling is used for stratified random
sampling (Tillé and Matei, 2021).
362 20 Spatial response surface sampling
FIGURE 20.4: Clusters of points (red points) around the design points
(triangles) of a CCRSD (two covariates), serving as candidate sampling points.
library(sampling)
set.seed(314)
units_stsi <- sampling::strata(
candi_all, stratanames = "dpnt", size = rep(1, 9))
mysample0 <- getdata(candi_all, units_stsi) %>%
dplyr::select(-(ID_unit:Stratum))
The locations of the nine sampling units are now optimised by minimising a
criterion that is a function of the distance between the nine sampling points.
Two minimisation criteria are implemented, a geometric criterion and a model-
based criterion.
In the geometric criterion (as proposed by Lesch (2005)) for each sampling
point the log of the shortest distance to the other points is computed. The
minimisation criterion is the negative of the sample mean of these distances.
The model-based minimisation criterion is the average correlation of the
sampling points. This criterion requires as input the parameters of a residual
correlogram (see Section 21.3). I assume an exponential correlogram without
nugget, so that the only parameter to be chosen is the distance parameter 𝜙
(Equation (21.13)). Three times 𝜙 is referred to as the effective range of the
exponential correlogram. The correlation of the random variables at two points
separated by this distance is 0.05.
A penalty term is added to the geometric or the model-based minimisation
criterion, equal to the average distance of the sampling points to the associated
design points, multiplied by a weight. With weights > 0, sampling points close
to the design points are preferred over more distant points.
In the next code chunk, a function is defined for computing the minimisation
criterion. Given a chosen value for 𝜙, the 9 × 9 distance matrix of the sampling
points can be converted into a correlation matrix, using function variogramLine
of package gstat (Pebesma, 2004). Argument weight is an optional argument
with default value 0.
}
return(criterion_cur)
}
Function getCriterion is used to compute the geometric criterion for the initial
sample.
The initial value of the geometric criterion is -4.829. In the next code chunk,
the initial value for the model-based criterion is computed for an effective
range of 150 m.
It does not make sense to make the effective range smaller than the size of the
grid cells, which is 25 m in our case. For smaller ranges, the correlation matrix
is for any sample a matrix with zeroes. If the effective range is smaller than
the smallest distance between two points in a cluster, the mean correlation
is equal for all samples.
phi <- 50
criterion_mb <- getCriterion(mysample = mysample0, dpnt = ccd_df, phi = phi)
set.seed(314)
mySRSsample <- anneal(
mysample = mysample0, candidates = candi_all, dpnt = ccd_df, phi = 50,
20.1 Increasing the sample size 365
Figure 20.6 shows the optimised CCRSD samples plotted in the space spanned
by the two principal components, obtained with the geometric and the model-
based criterion, plotted together with the design points. The two optimised
samples are very similar.
Figure 20.7 shows the two optimised CCRSD samples plotted in geographical
space on the first standardised principal component scores.
FIGURE 20.7: CCRSD sample from the Cotton Research Farm, optimised
with the geometric and the model-based criterion, plotted on a map of the
first standardised principal component (PC1).
candidate sampling units are not in one spatial cluster; so in this case, the
solution may work properly. I increased the number of candidate sampling
units per design point to 16, so that there is a larger choice in the optimisation
of the sampling pattern.
set.seed(314)
units_stsi <- sampling::strata(
candi_all, stratanames = "dpnt", size = rep(2, 9))
mysample0 <- getdata(candi_all, units_stsi) %>%
dplyr::select(-(ID_unit:Stratum))
The data frame with the design points must be doubled. Note that the order
of the design points must be equal to the order in the stratified subsample.
Figures 20.8 and 20.9 show the optimised CCRSD sample of 18 points in
geographical and principal component space, respectively, obtained with the
model-based criterion, an effective range of 150 m, and zero weight for the
penalty term. Sampling points are not spatially clustered, so I do not expect
violation of the assumption of independent residuals. In principal component
space, all points are pretty close to the design points, except for the four design
points in the lower right corner, where no candidate units near these design
points are available.
FIGURE 20.8: CCRSD sample with two points per design point, from the
Cotton Research Farm, plotted on a map of the first standardised principal
component (PC1).
FIGURE 20.9: CCRSD sample (triangles) with two points per design point
(dots), optimised with model-based criterion, plotted in the space spanned by
the two standardised principal components.
20.2 Stratified spatial response surface sampling 369
The spatial strata are not used for fitting separate regression models. All
data are used to fit one (second-order) polynomial regression model.
Figure 20.10 shows two subareas used as strata in stratified response surface
sampling of the Cotton Research Farm.
FIGURE 20.10: Two subareas of the Cotton Research Farm used as strata
in stratified CCRSD sampling.
The candidate sampling units are selected in a double for-loop. The outer loop
is over the strata, the inner loop over the design points. Note that variable dpnt
continues to increase by 1 after the inner loop over the nine design points in
subarea 1 is completed, so that variable dpnt (used as a stratification variable
in subsampling the sample of candidate sampling points) now has values
1, 2, … , 18. An equal number of candidate sampling points per design point
in both strata (eight points) is selected by sorting the points of a stratum by
the distance to a design point using function order. Figure 20.11 shows the
candidate sampling points for stratified CCRSD sampling.
set.seed(314)
units_stsi <- sampling::strata(
candi_all, stratanames = "dpnt", size = rep(1, 18))
mysample0 <- getdata(candi_all, units_stsi) %>%
dplyr::select(-(ID_unit:Stratum))
ccd_df2 <- rbind(ccd_df, ccd_df)
Figures 20.12 and 20.13 show the optimised sample of 18 points in geographical
and principal component space, obtained with the model-based criterion with
an effective range of 150 m. The pattern in the principal component space is
worse compared to the pattern in Figure 20.9. In stratum 1, the distance to
the star point at the top and the upper left and upper right cube points is very
large. In this stratum no population units are present that are close to these
20.3 Mapping 371
FIGURE 20.12: Stratified CCRSD samples from the Cotton Research Farm,
optimised with the model-based criterion, obtained without (weight = 0) and
with penalty (weight = 5) for a large average distance to design points.
20.3 Mapping
Once the data are collected, the study variable is mapped by fitting a multiple
linear regression model using the two covariates, in our case the two EM
372 20 Spatial response surface sampling
with 𝐗 the (𝑛 × (𝑝 + 1)) matrix with covariate values and ones in the first
column (𝑛 is the sample size, and 𝑝 is the number of covariates) and 𝐳 the
𝑛-vector with observations of the study variable.
Although the principal component scores are used to select the sampling
locations, there is no need to use these scores as predictors in the linear
regression model. When all principal components derived from the covariates
are used as predictors, the predicted values and standard errors obtained
with the model using the principal components as predictors are equal to
those obtained with the model using the covariates as predictors.
20.3 Mapping 373
𝑉(̂ 𝑍(𝐬
̂ 0 )) = 𝜎̂2u� (1 + 𝐱T T
0 (𝐗 𝐗)
−1
𝐱0 ) , (20.3)
The assumption underlying Equations (20.2) and (20.3) is that the model
residuals are independent. We assume that all the spatial structure of the
study variable is explained by the covariates. Even the residuals at two locations
close to each other are assumed to be uncorrelated. A drawback of the spatial
response surface design is that it is hard or even impossible to check this
assumption, as the sampling locations are spread throughout the study area.
If the residuals are not independent, the covariance of the residuals can
be accounted for by generalised least squares estimation of the regression
coefficients (Equation (21.24)). The study variable can then be mapped by
kriging with an external drift (Section 21.3). However, this requires an estimate
of the semivariogram of the residuals (Section 21.5).
21
Introduction to kriging
375
376 21 Introduction to kriging
with 𝑍(𝐬) the study variable at location 𝐬, 𝜇(𝐬) the mean at location 𝐬, 𝜖(𝐬)
the residual (difference between study variable 𝑧 and mean 𝜇(𝐬)) at location 𝐬,
and 𝐶(𝐡) the covariance of the residuals at two locations separated by vector
𝐡 = 𝐬 − 𝐬′ .
In ordinary kriging (OK) it is assumed that the mean of the study variable is
constant, i.e., the same everywhere (Webster and Oliver, 2007):
u�
̂OK (𝐬0 ) = ∑ 𝜆u� 𝑍(𝐬u� ) ,
𝑍 (21.3)
u�=1
where 𝑍(𝐬u� ) is the study variable at the 𝑖th sampling location and 𝜆u� is
the weight attached to this location. The weights should be related to the
correlation of the study variable at the sampling location and the prediction
location. Note that as the mean is assumed constant (Equation (21.2)), the
correlation of the study variable 𝑍 is equal to the correlation of the residual 𝜖.
Roughly speaking, the stronger this correlation, the larger the weight must
be. If we have a model for this correlation, then we can use this model to
find the optimal weights. Further, if two sampling locations are very close, the
weight attached to these two locations should not be twice the weight attached
to a single, isolated sampling location at the same distance of the prediction
location. This explains that in computing the kriging weights, besides the
covariances of the 𝑛 pairs of prediction location and sampling location, also
the covariances of the 𝑛(𝑛 − 1)/2 pairs that can be formed with the 𝑛 sampling
units are used, see Isaaks and Srivastava (1989) for a nice intuitive explanation.
For OK, the optimal weights, i.e., the weights that lead to the model-unbiased1
1 Model-unbiasedness is explained in Chapter 26.
21.1 Ordinary kriging 377
predictor with minimum error variance (best linear unbiased predictor), can
be found by solving the following 𝑛 + 1 equations:
u�
∑ 𝜆u� 𝐶(𝐬1 , 𝐬u� ) + 𝜈 = 𝐶(𝐬1 , 𝐬0 )
u�=1
u�
∑ 𝜆u� 𝐶(𝐬2 , 𝐬u� ) + 𝜈 = 𝐶(𝐬2 , 𝐬0 )
u�=1
⋮ , (21.4)
u�
∑ 𝜆u� 𝐶(𝐬u� , 𝐬u� ) + 𝜈 = 𝐶(𝐬u� , 𝐬0 )
u�=1
u�
∑ 𝜆u� = 1
u�=1
where 𝐶(𝐬u� , 𝐬u� ) is the covariance of the 𝑖th and 𝑗th sampling location, 𝐶(𝐬u� , 𝐬0 )
is the covariance of the 𝑖th sampling location and the prediction location 𝑠0 ,
and 𝜈 is an extra parameter to be estimated, referred to as the Lagrange
multiplier. This Lagrange multiplier must be included in the set of equations
because the error variance is minimised under the constraint that the kriging
weights sum to 1, see the final line in Equation (21.4). This constraint ensures
that the OK-predictor is model-unbiased. It is convenient to write this system
of equations in matrix form:
𝐂 𝟏 𝜆 𝐜
[ ][ ]=[ 0 ] . (21.6)
𝟏T 0 𝜈 1
The kriging weights 𝜆 and the Lagrange multiplier 𝜈 can then be computed
by premultiplying both sides of Equation (21.6) with the inverse of the first
matrix of this equation:
−1
𝜆 𝐂 𝟏 𝐜0
[ ]=[ T ] [ ] . (21.7)
𝜈 𝟏 0 1
with 𝜎2 the a priori variance, see Equation (21.2). This equation shows that
the OK variance is not a function of the data at the sampling locations. Given
a covariance function, it is fully determined by the spatial pattern of the
sampling locations and the prediction location. It is this property of kriging
that makes it possible to optimise the grid spacing (Chapter 22) and, as we will
see in Chapter 23, to optimise the spatial pattern of the sampling locations,
given a requirement on the kriging variance. If the kriging variance were a
function of the data at the sampling locations, optimisation would be much
more complicated.
In general practice, the covariance function is not used in kriging, rather
a semivariogram. A semivariogram 𝛾(𝐡) is a model of the dissimilarity of
the study variable at two locations, as a function of the vector 𝐡 separating
the two locations. The dissimilarity is quantified by half the variance of the
difference of the study variable 𝑍 at two locations. Under the assumption that
the expectation of 𝑍 is constant throughout the study area (stationarity in
the mean), half the variance of the difference is equal to half the expectation
of the squared difference:
FIGURE 21.1: Spherical covariance function (red line and dot) and semivar-
iogram (black line and dot).
21.1 Ordinary kriging 379
̂ 0 )) = 𝜆T𝛾 0 + 𝜈 ,
𝑉OK (𝑍(𝐬 (21.11)
with 𝛾 0 the vector with semivariances between the sampling locations and a
prediction location.
Computing the kriging predictor requires a model for the covariance (or
semivariance) as a function of the vector separating two locations. Often, the
covariance is modelled as a function of the length of the separation vector
only, so as a function of the Euclidian distance between two locations. We
then assume isotropy: given a separation distance between two locations,
the covariance is the same in all directions. Only authorised functions are
allowed for modelling the semivariance, ensuring that the variance of any
linear combination of random variables, like the kriging predictor, is positive.
Commonly used functions are an exponential and a spherical model.
The spherical semivariogram model has three parameters:
1. nugget (𝑐0 ): where the semivariogram touches the y-axis (in Figure
21.1: 25);
2. partial sill (𝑐1 ): the difference between the maximum semivariance
and the nugget (in Figure 21.1: 75); and
3. range (𝜙): the distance at which the semivariance reaches its maxi-
mum (in Figure 21.1: 250 m).
⎧ 0 if ℎ = 0
{ ℎ 3
𝛾(ℎ) = ⎨ 𝑐0 + 𝑐1 [1 − 3
2
ℎ
( u� ) + 1
2 ( u� ) ] if 0 < ℎ ≤ 𝜙 . (21.12)
{ 𝑐 +𝑐 if ℎ > 𝜙
⎩ 0 1
The sum of the nugget and the partial sill is referred to as the sill (or sill
variance or a priori variance).
An exponential semivariogram model also has three parameters. Its formula is
0 if ℎ = 0
𝛾(ℎ) = { . (21.13)
𝑐0 + 𝑐1 exp(−ℎ/𝜙) if ℎ > 0
is at 95% of the sill. Three times the distance parameter is referred to as the
effective or practical range.
In following chapters, I also use a correlogram, which is a scaled covariance
function, such that the sill of the correlogram equals 1:
𝐶(𝐡)
𝜌(𝐡) = . (21.14)
𝜎2
To illustrate that the OK variance is independent of the values of the study
variable at the sampling locations, I simulated a spatial population of 50 ×
50 units. For each unit a value of the study variable is simulated, using the
semivariogram of Figure 21.1. This is repeated ten times, resulting in ten maps
of 2,500 units. Figure 21.2 shows two of the ten simulated maps. Note that
the two maps clearly show spatial structure, i.e., there are patches of similar
values.
The simulated maps are sampled on a centred square grid with a spacing of 100
distance units, resulting in a sample of 100 units. Each sample is used one-by-
one to predict the study variable at one prediction location (see Figure 21.2),
using again the semivariogram of Figure 21.1. The semivariogram is passed to
function vgm of package gstat (Pebesma, 2004). Usually, this semivariogram
is estimated from a sample, see Chapter 24, but here we assume that it is
known. Function krige of package gstat is used for kriging. Argument formula
specifies the dependent (study variable) and independent variables (covariates).
The formula z ~ 1 means that we do not have covariates (we assume that the
model-mean is a constant) and that predictions are done by OK (or simple
21.2 Block-kriging 381
library(sp)
library(gstat)
vgmodel <- vgm(model = "Sph", nugget = 25, psill = 75, range = 250)
gridded(mypop) <- ~ s1 + s2
mysample <- spsample(x = mypop, type = "regular", cellsize = c(100, 100),
offset = c(0.5, 0.5))
zsim_sample <- over(mysample, mypop)
coordinates(s_0) <- ~ s1 + s2
zpred_OK <- v_zpred_OK <- NULL
for (i in seq_len(ncol(Z))) {
mysample$z <- zsim_sample[, i]
predictions <- krige(
formula = z ~ 1,
locations = mysample,
newdata = s_0,
model = vgmodel,
debug.level = 0)
zpred_OK[i] <- predictions$var1.pred
v_zpred_OK[i] <- predictions$var1.var
}
As can be seen in Table 21.1, unlike the predicted value, the OK variance
produced from the different simulations is constant.
21.2 Block-kriging
In the previous section, the support of the prediction units is equal to that of
the sampling units. So, if the observations are done at points (point support),
the support of the predictions are also points, and if means of small blocks are
observed, the predictions are predicted means of blocks of the same size and
shape. There is no change of support. In some cases we may prefer predictions
at a larger support than that of the observations. For instance, we may prefer
predictions of the average concentration of some soil property of blocks of 5 m
× 5 m, instead of predictions at points, simply because of practical relevance.
382 21 Introduction to kriging
̂
𝑉OBK (𝑍(ℬ T ̄
(21.15)
0 )) = 𝜆 𝛾 (ℬ0 ) + 𝜈 − 𝛾(ℬ
̄ 0 , ℬ0 ) ,
with 𝛾 (ℬ
̄ 0 ) the vector with mean semivariances between the sampling points
and a prediction block and 𝛾(ℬ ̄ 0 , ℬ0 ) the mean semivariance within the
prediction block. Comparing this with Equation (21.11) shows that the block-
kriging variance is smaller than the point-kriging variance by an amount
approximately equal to the mean semivariance within a prediction block. Recall
from Chapter 13 that the mean semivariance within a block is a model-based
prediction of the variance within a block (Equation (13.3)).
21.3 Kriging with an external drift 383
u�
𝑍(𝐬) = ∑ 𝛽u� 𝑥u� (𝐬) + 𝜖(𝐬)
u�=0
(21.16)
𝜖(𝐬) ∼ 𝒩(0, 𝜎2 )
Cov(𝜖(𝐬), 𝜖(𝐬′ )) = 𝐶(𝐡) ,
with 𝑥u� (𝐬) the value of the 𝑘th covariate at location 𝐬 (𝑥0 = 1 for all locations),
𝑝 the number of covariates, and 𝐶(𝐡) the covariance of the residuals at two
locations separated by vector 𝐡 = 𝐬 − 𝐬′ . The constant mean 𝜇 in Equation
(21.2) is replaced by a linear combination of covariates and, as a consequence,
the mean is not constant anymore but varies in space.
With KED, the study variable at a prediction location 𝐬0 is predicted by
u� u� u�
̂
𝑍KED (𝐬0 ) = ∑ 𝛽u�̂ 𝑥u� (𝐬0 ) + ∑ 𝜆u� {𝑍(𝐬u� ) − ∑ 𝛽u�̂ 𝑥u� (𝐬u� )} , (21.17)
u�=0 u�=1 u�=0
with 𝛽u�̂ the estimated regression coefficient associated with covariate 𝑥u� .
The first component of this predictor is the estimated model-mean at the
new location based on the covariate values at this location and the estimated
regression coefficients. The second component is a weighted sum of the residuals
at the sampling locations.
The optimal kriging weights 𝜆u� , 𝑖 = 1, … , 𝑛 are obtained in a similar way
as in OK. The difference is that additional constraints on the weights are
needed, to ensure unbiased predictions. Not only the weights must sum to 1,
but also for all 𝑝 covariates the weighted sum of the covariate values at the
sampling locations must equal the covariate value at the prediction location:
∑u�=1 𝜆u� 𝑥u� (𝐬u� ) = 𝑥u� (𝐬0 ) for all 𝑘 = 1, … , 𝑝. This leads to a system of 𝑛 + 𝑝 + 1
u�
𝐂 𝐗 𝜆 𝐜
[ ][ ]=[ 0 ] , (21.18)
𝐗T 𝟎 𝜈 𝐱0
384 21 Introduction to kriging
with
1 𝑥11 𝑥12 … 𝑥1u�
⎡ 1 𝑥21 𝑥22 … 𝑥2u� ⎤
𝐗=⎢ ⎥ . (21.19)
⎢ ⋮ ⋮ ⋮ … ⋮ ⎥
⎣ 1 𝑥u�1 𝑥u�2 … 𝑥u�u� ⎦
̂ 0 )) = 𝜎2 − 𝜆T 𝐜0 − 𝜈T 𝐱0 .
𝑉KED (𝑍(𝐬 (21.20)
The prediction error variance with KED can also be written as the sum of
the variance of the predictor of the mean and the variance of the error in the
interpolated residuals (Christensen, 1991):
̂ 0 )) = 𝜎2 − 𝐜T
𝑉KED (𝑍(𝐬 0𝐂
−1
𝐜0 +
(21.21)
(𝐱0 − 𝐗 𝐂 𝐜0 )T (𝐗T 𝐂−1 𝐗)−1 (𝐱0 − 𝐗T 𝐂−1 𝐜0 ) .
T −1
The first two terms constitute the interpolation error variance, the third term
the variance of the predictor of the mean.
To illustrate that the kriging variance with KED depends on the values of the
covariate at the sampling locations and the prediction location, values of a
covariate 𝑥 and of a correlated study variable 𝑧 are simulated for the 50 × 50
units of a spatial population (Figure 21.3). First, a field with covariate values
is simulated with a model-mean of 10. Next, a field with residuals is simulated.
The field of the study variable is then obtained by multiplying the simulated
field with covariate values by two (𝛽1 = 2), adding a constant of 10 (𝛽0 = 10),
and finally adding the simulated field with residuals.
As before, a centred square grid with a spacing of 100 distance units is selected.
The simulated values of the study variable 𝑧 and covariate 𝑥 are used to predict
𝑧 at a prediction location 𝐬0 by kriging with an external drift (red cell in
Figure 21.3). Although at the prediction location we have only one simulated
value of covariate 𝑥, a series of covariate values is used to predict 𝑧 at that
location: 𝑥0 = 0, 2, 4, … , 20. In practice, we have of course only one value of
the covariate at a fixed location, but this is for illustration purposes only. Note
that we have only one data set with ‘observations’ of 𝑥 and 𝑧 at the sampling
locations (square grid).
FIGURE 21.3: Maps with simulated values of covariate 𝑥 and study variable
𝑧, the centred square grid of sampling units, and the prediction unit (red cell
with coordinates (590,670)).
formula = z ~ x,
locations = mysample,
newdata = s_0,
model = vgm_resi,
debug.level = 0)
v_zpred_KED[i] <- predictions$var1.var
}
Note the formula z ~ x in the code chunk above, indicating that there is now
an independent variable (covariate). The covariate values are attached to the
file with the prediction location one-by-one in a for-loop. Also note that for
KED we need the semivariogram of the residuals, not of the study variable
itself. The residual semivariogram used in prediction is the same as the one
used in simulating the fields: a spherical model without nugget, with a sill of
5, and a range of 100 distance units.
To assess the contribution of the uncertainty about the model-mean 𝜇(𝐬), I also
predict the values assuming that the model-mean is known. In other words, I
assume that the two regression coefficients 𝛽0 (intercept) and 𝛽1 (slope) are
known. This type of kriging is referred to as simple kriging (SK). With SK,
the constraints explained above are removed, so that there are no Lagrange
multipliers involved. Argument beta is used to specify the known regression
coefficients. I use the same values used in simulation.
Figure 21.4 shows that, contrary to the SK variance, the kriging variance with
KED is not constant but depends on the covariate value at the prediction
location. It is smallest near the mean of the covariate values at the sampling
locations, which is 10.0. The more extreme the covariate value at the prediction
location, the larger the kriging variance with KED. This is analogous to the
variance of predictions with a linear regression model.
21.4 Estimating the semivariogram 387
21.4.1 Method-of-moments
set.seed(123)
units <- sample(nrow(mypop), size = 150)
mysample <- mypop[units, ]
coordinates(mysample) <- ~ s1 + s2
vg <- variogram(z ~ 1, data = mysample)
head(vg[, c(1, 2, 3)])
np dist gamma
1 35 24.02379 38.73456
2 74 49.02384 42.57199
3 179 78.62401 62.13025
4 213 110.05088 72.41387
21.4 Estimating the semivariogram 389
The next step is to fit a model. This can be done with function fit.variogram of
the gstat package. Many models can be fitted with this function (type vgm()
to see all models). I chose a spherical model. Function fit.variogram requires
initial values of the semivariogram parameters. From the sample semivariogram,
my eyeball estimates are 25 for the nugget, 250 for the range, and 75 for the
partial sill. These are passed to function vgm. Figure 21.5 shows the sample
semivariogram along with the fitted spherical model2 .
print(vgm_MoM)
Function fit.variogram has several options for weighted least squares optimisa-
tion, see ?fit.variogram for details. Also note that this non-linear fit may not
converge to a solution, especially if the starting values passed to vgm are not
near their optimal values.
2 The figure is plotted with package ggplot2. The sample semivariogram and the fitted
Further, this method depends on the choice of cutoff and distance intervals.
We hope that modifying these does not change the fitted model too much, but
this is not always the case, especially with smaller data sets.
In contrast to the MoM, with the maximum likelihood (ML) method the data
are not paired into couples and binned into a sample semivariogram. Instead,
the semivariogram model is estimated in one step. To apply this method, one
typically assumes that (possibly after transformation) the 𝑛 sample data come
from a multivariate normal distribution. If we have one observation from a
normal distribution, the probability density of that observation is given by
1 1 𝑧−𝜇 2
𝑓(𝑧|𝜇, 𝜎2 ) = √ exp {− ( ) } , (21.22)
𝜎 2𝜋 2 𝜎
with 𝜇 the mean and 𝜎2 the variance. With multiple independent observations,
each of them coming from a normal distribution, the joint probability density
is given by the product of the probability densities per observation. However,
if the data are not independent, we must account for the covariances and the
joint probability density can be computed by
1
(21.23)
u� 1
𝜇, 𝜃 ) = (2𝜋)− 2 |𝐂|− 2 exp {− (𝐳 − 𝜇 )T 𝐂−1 (𝐳 − 𝜇 )} ,
𝑓(𝐳|𝜇
2
where 𝐳 is the vector with the 𝑛 sample data, 𝜇 is the vector with means, 𝜃
is the vector with parameters of the covariance function, and 𝐂 is the 𝑛 × 𝑛
matrix with variances and covariances of the sample data. If the probability
density of Equation (21.23) is regarded as a function of 𝜇 and 𝜃 with the data
𝐳 fixed, this equation defines the likelihood.
ML estimates of the semivariogram can be obtained with function likfit of
package geoR (Ribeiro Jr et al., 2020). First, a geoR object must be made
specifying which columns of the data frame contain the spatial coordinates
and the study variable.
library(geoR)
mysample <- as(mysample, "data.frame")
dGeoR <- as.geodata(obj = mysample, header = TRUE,
coords.col = c("s1", "s2"), data.col = "z")
The model parameters can then be estimated with function likfit. Argument
trend = "cte" means that we assume that the mean is constant throughout the
study area.
21.5 Estimating the residual semivariogram 391
Parameter MoM ML
nugget 26.3 19.5
partial sill 68.4 83.8
range 227.8 217.8
Table 21.2 shows the ML estimates together with the MoM estimates. As can
be seen, the estimates are substantially different, especially the division of
the a priori variance (sill) into partial sill and nugget. In general, I prefer the
ML estimates because the arbitrary choice of distance intervals to compute a
sample semivariogram is avoided. Also ML estimates of the parameters are
more precise, given a sample size. On the other hand, in ML estimation we
need to assume that the data are normally distributed.
set.seed(314)
units <- sample(nrow(mypop), size = 150)
mysample <- mypop[units, ]
vg_resi <- variogram(z ~ x, data = mysample)
model_eye <- vgm(model = "Sph", psill = 10, range = 150, nugget = 0)
vgmresi_MoM <- fit.variogram(vg_resi, model = model_eye)
̂
𝛽 GLS = (𝐗T 𝐂−1 𝐗)−1 (𝐗T 𝐂−1 𝐳) . (21.24)
The next code chunk shows how the GLS estimates of the regression coefficients
can be computed. Function spDists of package sp is used to compute the matrix
with distances between the sampling locations, and function variogramLine of
package gstat is used to transform the distance matrix into a covariance
matrix.
repeat {
betaGLS.cur <- betaGLS
mu <- X %*% betaGLS
mysample$e <- z - mu
vg_resi <- variogram(e ~ 1, data = mysample)
vgmresi_MoM <- fit.variogram(vg_resi, model = model_eye)
C <- variogramLine(vgmresi_MoM, dist_vector = D, covariance = TRUE)
XCX <- crossprod(X, solve(C, X))
XCz <- crossprod(X, solve(C, z))
betaGLS <- solve(XCX, XCz)
if (sum(abs(betaGLS - betaGLS.cur)) < 0.0001) {
break
}
}
Premultiplying both sides of the KED model (Equation (21.16)) with 𝐏 gives
(Webster and Oliver, 2007)
𝛽 + 𝐏𝜖𝜖(𝐬) = 𝐏𝜖𝜖(𝐬) .
𝐏𝐳(𝐬) = 𝐲(𝐬) = 𝐏𝐗𝛽 (21.26)
with 𝐈 the 𝑛 × 𝑛 identity matrix (matrix with ones on the diagonal and zeroes
in all off-diagonal elements). The natural log of the residual likelihood can be
computed by (Lark and Webster, 2006)
Table 21.3 shows that REML yields a smaller estimated (partial) sill and a
larger estimated range than iterative MoM. Of the two regression coefficients,
especially the estimated intercept differs considerably among the two estimation
methods.
Realising that this is a rather short introduction to kriging, refer to Isaaks and
Srivastava (1989) for an introduction to geostatistics, to Goovaerts (1997) for
an exposé of the many versions of kriging, and to Webster and Oliver (2007)
for an elaborate explanation of kriging. A nice educational tool for getting a
feeling for ordinary kriging is E{Z}-Kriging3 .
3 https://wiki.52north.org/AI_GEOSTATS/SWEZKriging
22
Model-based optimisation of the grid spacing
This is the first chapter on model-based sampling1 . In Section 17.2 and Chapter
18 a geometric criterion is minimised, i.e., a criterion defined in terms of
distances, either in geographic space (Section 17.2) or in covariate space
(Chapter 18). In model-based sampling, the minimisation criterion is a function
of the variance of the prediction errors.
This chapter on model-based sampling is about optimisation of the spacing of
a square grid, i.e., the distance between neighbouring points in the grid. The
grid spacing is derived from a requirement on the accuracy of the map. Here
and in following chapters, I assume that the map is constructed by kriging;
see Chapter 21 for an introduction. As we have seen in Chapter 21, a kriging
prediction of the study variable at an unobserved location is accompanied by
a variance of the prediction error, referred to as the kriging variance. The map
accuracy requirement is a population parameter of this kriging variance, e.g.,
the population mean of the kriging variance.
397
398 22 Model-based optimisation of the grid spacing
such data are lacking and a best guess of the semivariogram must be made,
for instance using data for the same study variable from other, similar areas.
There is no simple equation that relates the grid spacing to the kriging variance.
What can be done is calculate the mean OK variance for a range of grid spacings,
plot the mean ordinary kriging variances against the grid spacings, and use
this plot inversely to determine the tolerable grid spacing, given a constraint
on the mean OK variance.
In the next code chunks, this procedure is used to compute the tolerable spacing
of a square grid for mapping soil organic matter (SOM) in West-Amhara. The
legacy data of the SOM concentration (dag kg-1 ), used before to design a
spatial infill sample (Section 17.3), are used here to estimate a semivariogram.
A sample semivariogram is estimated by the method-of-moments (MoM), and a
spherical model is fitted using functions of package gstat (Pebesma, 2004). The
values for the partial sill, range, and nugget, passed to function fit.variogram
with argument model, are guesses from an eyeball examination of the sample
semivariogram obtained with function variogram, see Figure 22.1. The ultimate
estimates of the semivariogram parameters differ from these eyeball estimates.
First, the projected coordinates of the sampling points are changed from m
into km using function mutate2 .
library(gstat)
grdAmhara <- grdAmhara %>%
mutate(s1 = s1 / 1000, s2 = s2 / 1000)
sampleAmhara <- sampleAmhara %>%
mutate(s1 = s1 / 1000, s2 = s2 / 1000)
coordinates(sampleAmhara) <- ~ s1 + s2
vg <- variogram(SOM ~ 1, data = sampleAmhara)
model_eye <- vgm(model = "Sph", psill = 0.6, range = 40, nugget = 0.6)
vgm_MoM <- fit.variogram(vg, model = model_eye)
library(geoR)
sampleAmhara <- as_tibble(sampleAmhara)
dGeoR <- as.geodata(
obj = sampleAmhara, header = TRUE, coords.col = c("s1", "s2"),
data.col = "SOM")
vgm_ML <- likfit(geodata = dGeoR, trend = "cte",
FIGURE 22.1: Sample semivariogram and fitted spherical model of the SOM
concentration in West-Amhara, estimated from the legacy data.
Parameter MoM ML
nugget 0.62 0.56
partial sill 0.56 0.68
range 45.40 36.90
Table 22.1 shows the ML estimates of the parameters of the spherical semivari-
ogram, together with the MoM estimates. Either could be used in the following
steps.
400 22 Model-based optimisation of the grid spacing
To check whether the size of the simple random sample of evaluation points
is sufficiently large, we may estimate the standard error of the estimator of
the MKV, see Chapter 3, substituting the kriging variances at the evaluation
points for the study variable values.
set.seed(314)
mysample <- grdAmhara %>%
slice_sample(n = 5000, replace = TRUE) %>%
mutate(s1 = s1 %>% jitter(amount = 0.5),
s2 = s2 %>% jitter(amount = 0.5))
The R code below shows the next steps. Given a spacing, a square grid with a
fixed starting point is selected with function spsample, using argument offset.
A dummy variable is added to the data frame, having value 1 at all grid points,
but any other value is also fine. The predicted value at all evaluation points
equals 1. However, we are not interested in the predicted value but in the
kriging variance only, and we have seen in Chapter 21 that the kriging variance
is independent of the observations of the study variable. The ML estimates of
the semivariogram are used in function vgm to define a semivariogram model
of class variogramModel that can be handled by function krige. For each grid
22.2 Controlling the mean or a quantile of the ordinary kriging variance 401
spacing the population mean, median, and P90 of the kriging variance are
estimated from the evaluation sample. The estimated median and P90 can be
computed with function quantile.
coordinates(mysample) <- ~ s1 + s2
gridded(grdAmhara) <- ~ s1 + s2
MKV_OK <- P50KV_OK <- P90KV_OK <- samplesize <-
numeric(length = length(spacing))
vgm_ML_gstat <- vgm(model = "Sph", nugget = vgm_ML$nugget,
psill = vgm_ML$sigmasq, range = vgm_ML$phi)
for (i in seq_len(length(spacing))) {
mygrid <- spsample(x = grdAmhara, cellsize = spacing[i],
type = "regular", offset = c(0.5, 0.5))
mygrid$dummy <- rep(1, length(mygrid))
samplesize[i] <- nrow(mygrid)
predictions <- krige(
formula = dummy ~ 1,
locations = mygrid,
newdata = mysample,
model = vgm_ML_gstat,
nmax = 100,
debug.level = 0)
MKV_OK[i] <- mean(predictions$var1.var)
P50KV_OK[i] <- quantile(predictions$var1.var, probs = 0.5)
P90KV_OK[i] <- quantile(predictions$var1.var, probs = 0.9)
}
dfKV_OK <- data.frame(spacing, samplesize, MKV_OK, P50KV_OK, P90KV_OK)
The estimated mean and quantiles of the kriging variance are plotted against
the grid spacing (Figure 22.2).
The tolerable grid spacing for the three quality indices can be computed with
function approx of the base package, as shown below for the median kriging
variance.
For a mean kriging variance of 0.8 (dag kg-1 )2 the tolerable grid spacing is 8.6
km. For the median kriging variance this is 9.2 km, which is somewhat larger
leading to a smaller sample size. The smaller grid spacing for the mean can
be explained by the right-skewed distribution of the kriging variance, so that
the mean kriging variance is larger than the median kriging variance. For the
P90 of the kriging variance, the tolerable grid spacing is much smaller, 6.8 km,
leading to a much larger sample size.
402 22 Model-based optimisation of the grid spacing
FIGURE 22.2: Mean, median (P50), and 0.90 quantile (P90) of the ordinary
kriging variance of predictions of the SOM concentration in West-Amhara, as
a function of the spacing of a square grid.
Exercises
Figure 22.3 shows that the mean, P50, and P90 of the block-kriging predic-
tions are substantially smaller than those of the point-kriging predictions
(Figure 22.2). This can be explained by the large nugget of the semivari-
ogram (Table 22.1). The side length of a prediction block (100 m) is much
smaller than the range of the semivariogram (36.9 km), so that in this case
the mean semivariance within a prediction block is about equal to the nugget.
Roughly speaking, for a given grid spacing the mean point-kriging variance
is reduced by an amount about equal to this mean semivariance to yield the
mean block-kriging variance for this spacing (Section 21.2). Recall that the
mean semivariance within a block is a model-based prediction of the variance
within a block (Subsection 13.1.1, Equation (13.3)).
404 22 Model-based optimisation of the grid spacing
FIGURE 22.3: Mean, median (P50), and 0.90 quantile (P90) of the ordinary
block-kriging variance of predictions of the mean SOM concentration of blocks
of 100 m × 100 m, in West-Amhara, as a function of the spacing of a square
grid.
library(geoR)
dGeoR <- as.geodata(obj = sampleAmhara, header = TRUE,
coords.col = c("s1", "s2"), data.col = "SOM",
covar.col = c("dem", "rfl_NIR", "rfl_red", "lst"))
vgm_REML <- likfit(geodata = dGeoR, trend = ~ dem + rfl_NIR + rfl_red + lst,
22.4 Optimal grid spacing for kriging with an external drift 405
Parameter ML REML
nugget 0.56 0.36
partial sill 0.68 0.44
range (km) 36.91 5.24
The total sill (partial sill + nugget) of the residual semivariogram, estimated
by REML, equals 0.80, which is considerably smaller than that of the ML
semivariogram of SOM (Table 22.2). A considerable part of the variance of
SOM is explained by the covariates. Besides, note the much smaller range
of the residual semivariogram. The smaller sill and range of the residual
semivariogram show that the spatial structure of SOM is largely captured by
the covariates. The residuals of the model-mean, which is a linear combination
of the covariates, no longer show much spatial structure.
The mean kriging variance as obtained with KED is used as the evaluation
criterion. With KED, the kriging variance is also a function of the values of
the covariates at the sampling locations and the prediction location (Section
21.3). Compared with the procedure above for OK, in the code chunk below,
a slightly different procedure is used. The square grid of a given spacing is
randomly placed on the area (option offset in function spsample is not used),
and this is repeated ten times.
R <- 10
MKV_KED <- matrix(nrow = length(spacing), ncol = R)
vgm_REML_gstat <- vgm(model = "Sph", nugget = vgm_REML$nugget,
psill = vgm_REML$sigmasq, range = vgm_REML$phi)
set.seed(314)
for (i in seq_len(length(spacing))) {
for (j in 1:R) {
mygrid <- spsample(x = grdAmhara, cellsize = spacing[i], type = "regular")
mygrid$dummy <- rep(1, length(mygrid))
mygrd <- data.frame(over(mygrid, grdAmhara), mygrid)
coordinates(mygrd) <- ~ x1 + x2
predictions <- krige(
formula = dummy ~ dem + rfl_NIR + rfl_red + lst,
406 22 Model-based optimisation of the grid spacing
locations = mygrd,
newdata = mysample,
model = vgm_REML_gstat,
nmax = 100,
debug.level = 0)
MKV_KED[i, j] <- mean(predictions$var1.var)
}
}
dfKV_KED <- data.frame(spacing, MKV_KED)
Figure 22.4 shows the mean kriging variances, obtained with OK and KED,
as a function of the grid spacing. Interestingly, for grid spacings smaller than
about 9 km, the mean kriging variance with KED is larger than with OK. In
this case only for larger grid spacings KED outperforms OK in terms of the
mean kriging variance. Only for mean kriging variances larger than about 0.82
(dag kg-1 )2 we can afford with KED a larger grid spacing (smaller sample size)
than with OK. Only with large spacings (small sample sizes) we profit from
modelling the mean as a linear function of covariates.
The tolerable grid spacing for a mean kriging variance of 0.8 (dag kg-1 )2 , using
KED, equals 7.9 km.
Exercises
𝑓(𝜃𝜃)𝑓(𝐳|𝜃𝜃)
𝑓(𝜃𝜃|𝐳) = , (22.1)
𝑓(𝐳)
with 𝑓(𝜃𝜃|𝐳) the posterior distribution function, i.e., the probability density
function of the semivariogram parameters given the sample data, 𝑓(𝜃𝜃) our prior
belief in the parameters specified by a probability density function, 𝑓(𝐳|𝜃𝜃) the
408 22 Model-based optimisation of the grid spacing
likelihood of the data, and 𝑓(𝐳) the probability density function of the data.
This probability density function 𝑓(𝐳) is hard to obtain.
Problems with analytical derivation of the posterior distribution are avoided
by selecting a large sample of units (vectors with semivariogram parameters)
from the posterior distribution through Markov chain Monte Carlo (MCMC)
sampling, see Subsection 13.1.3.
In a Bayesian approach, we must define the likelihood function of the data, see
Subsection 13.1.3. I assume that the SOM concentration data in West-Amhara
have a multivariate normal distribution, and that the spatial covariance of the
data can be modelled by a spherical model, see Subsection 21.4.2. The likelihood
is a function of the semivariogram parameters. Given a vector of semivariogram
parameters, the variance-covariance matrix of the data is computed from the
matrix with geographic distances between the sampling points. Inputs of the
loglikelihood function ll are the matrix with distances between the sampling
points, the design matrix X, and the vector with observations of the study
variable z, see Subsection 13.1.3.
D <- as.matrix(dist(sampleAmhara[,c("s1","s2")]))
X <- matrix(1, nrow(sampleAmhara), 1)
z <- sampleAmhara$SOM
library(BayesianTools)
priors <- createUniformPrior(
22.5 Bayesian approach 409
set.seed(314)
res <- runMCMC(setup, sampler = "DEzs")
mcmcsample <- getSample(res, start = 1000, numSamples = 1000) %>%
data.frame()
Table 22.3 shows the first ten units of the MCMC sample from the posterior
distribution of the semivariogram parameters.
The units of the MCMC sample (vectors with semivariogram parameters) are
used one-by-one to compute the average of the kriging variances at the simple
random sample of evaluation points.
For each unit in the MCMC sample, the tolerable grid spacing is computed for
a target MKV of 0.8. Figure 22.5 shows that for most sampled semivariograms
(MCMC sample units) the tolerable grid spacing equals 8 km, which roughly
corresponds with the tolerable grid spacing derived above for OK. For 165
sampled semivariograms, the tolerable grid spacing exceeds 12 km. However,
this grid spacing leads to a sample size that is too small for estimating the
semivariogram and kriging.
Finally, for each grid spacing, the proportion of MCMC samples with a MKV
smaller than or equal to the target MKV of 0.8 is computed. Figure 22.6 shows,
for instance, that if the MKV is required not to exceed a target MKV of 0.8
with a probability of 80%, the tolerable grid spacing is 6.6 km. With a grid
spacing of 8.6 km, as determined before, the probability that the MKV exceeds
0.8 is 54%.
410 22 Model-based optimisation of the grid spacing
TABLE 22.3: First ten units of a MCMC sample from the posterior distribu-
tion of the parameters of a spherical semivariogram for the SOM concentration
in West-Amhara.
413
414 23 Model-based optimisation of the sampling pattern
location of the current sample is randomly selected, and this location is shifted
to a random location within the neighbourhood of the selected location.
The minimisation criterion is computed for the proposed sample and compared
with that of the current sample. If the criterion of the proposed sample is
smaller, the sample is accepted. If the criterion is larger, the sample is accepted
with a probability equal to
(23.1)
−Δ
𝑃=𝑒 u� ,
The name of this parameter shows the link with annealing in metallurgy.
Annealing is a heat treatment of a material above its recrystallisation tem-
perature. Simulated annealing mimics the gradual cooling of metal alloys,
resulting in an optimum or near-optimum structure of the atoms in the alloy.
The larger the value of 𝑇, the larger the probability that a proposed sample
with a given increase of the criterion is accepted (Figure 23.1). The temperature
𝑇 is stepwise decreased during the optimisation: 𝑇u�+1 = 𝛼𝑇u� . In Figure 23.1
𝛼 equals 0.9. The effect of decreasing the temperature is that the acceptance
probability of worse samples decreases during the optimisation and approaches
0 towards the end of the optimisation. Note that the temperature remains
constant during a number of iterations, referred to as the chain length. In
Figure 23.1 this chain length equals 100 iterations. Finally, a stopping criterion
is required. Various stopping criteria are possible; one option is to set the
maximum numbers of chains with no improvement. 𝑇, 𝛼, the chain length, and
the stopping criterion are annealing schedule parameters that must be chosen
by the user.
library(gstat)
res_nls <- nls(semivar ~ nugget + psill * (1 - exp(-h / range)),
start = list(nugget = 0.1, psill = 0.4, range = 200), weights = somnp)
vgm_lnECe <- vgm(model = "Exp", nugget = coef(res_nls)[1],
psill = coef(res_nls)[2], range = coef(res_nls)[3])
The estimated semivariogram parameters are shown in Table 23.1. The nugget-
to-sill ratio is about 1/4, and the effective range is about 575 m (three times
the distance parameter of an exponential model).
The coordinates of the sampling points are optimised with function optimMKV of
package spsann (Samuel-Rosa, 2019)1 . First, the candidate sampling points
are specified by the nodes of a grid discretising the population. As explained
hereafter, this does not necessarily imply that the population is treated as a
finite population. Next, the parameters of the annealing schedule are set. Note
that both the initial acceptance rate and the initial temperature are set, which
may seem weird as the acceptance rate is a function of the temperature, see
Equation (23.1). The optimisation stops when an initial temperature is chosen
leading to an acceptance rate outside the interval specified with argument
initial.acceptance. If the acceptance rate is smaller than the lower bound of the
interval, a larger value for the initial temperature must be chosen; if the rate
is larger than the upper bound, a smaller initial temperature must be chosen.
1 At the moment of writing this book, package spsann is not available on CRAN. You
can install spsann and its dependency pedometrics with remotes::install_github("samuel-
rosa/spsann") and remotes::install_github("samuel-rosa/pedometrics").
23.2 Optimising the sampling pattern for ordinary kriging 417
library(spsann)
candi <- grdCRF[, c("x", "y")]
schedule <- scheduleSPSANN(
initial.acceptance = c(0.8,0.95),
initial.temperature = 0.004, temperature.decrease = 0.95,
chains = 500, chain.length = 2, stopping = 10, cellsize = 25)
set.seed(314)
res <- optimMKV(
points = 50, candi = candi,
vgm = vgm_lnECe, eqn = z ~ 1,
schedule = schedule, nmax = 20,
plotit = FALSE, track = TRUE)
mysample <- res$points
trace <- res$objective$energy
The spatial pattern of the sample in Figure 23.3 and the trace of the MKV in
Figure 23.4 suggest that we are close to the global optimum.
418 23 Model-based optimisation of the sampling pattern
For comparison I also computed a spatial coverage sample of the same size.
The spatial patterns of the two samples are quite similar (Figure 23.3). The
MKV of the spatial coverage sample equals 0.2633 (dS m-1 )2 , whereas for
the model-based sample the MKV equals 0.2642 (dS m-1 )2 . So, no gain in
precision is achieved by the model-based optimisation of the sampling pattern
compared to spatial coverage sampling. With cellsize = 0 the minimised MKV
is slightly smaller: 0.2578 (dS m-1 )2 . This outcome is in agreement with the
results reported by Brus et al. (2007).
Instead of the mean OK variance (MOKV), we may prefer to use some quantile
of the cumulative distribution function of the OK variance as a minimisation
criterion. For instance, if we use the 0.90 quantile as a criterion, we are
searching for the sampling locations so that the 90th percentile (P90) of the
OK variance is minimal. This can be done with function optimUSER of package
23.2 Optimising the sampling pattern for ordinary kriging 419
The next code chunk shows how this objective function can be minimised.
Exercises
solve this problem, I jittered the coordinates of the sampling points by a small
amount.
library(geoR)
sampleCRF$lnEM100 <- log(sampleCRF$EMv1m)
sampleCRF$x <- jitter(sampleCRF$x, amount = 0.001)
sampleCRF$y <- jitter(sampleCRF$y, amount = 0.001)
dGeoR <- as.geodata(obj = sampleCRF, header = TRUE,
coords.col = c("x", "y"), data.col = "lnECe", covar.col = "lnEM100")
vgm_REML <- likfit(geodata = dGeoR, trend = ~ lnEM100,
cov.model = "exponential", ini.cov.pars = c(0.1, 200),
nugget = 0.1, lik.method = "REML", messages = FALSE)
set.seed(314)
res <- optimMKV(
points = 50, candi = candi, covars = grdCRF,
vgm = vgm_REML_gstat, eqn = z ~ lnEM100cm,
schedule = schedule, nmax = 20,
plotit = FALSE, track = FALSE)
FIGURE 23.5: Optimised sampling pattern for KED of lnECe at the Cotton
Research Farm, using lnEM100cm as a covariate.
noting that the variance of the KED prediction error can be decomposed into
two components: the variance of the interpolated residuals and the variance
of the estimator of the model-mean, see Section 21.3. The contribution of the
first variance component is minimised through geographical spreading, that of
the second component by selecting locations with covariate values near the
minimum and maximum.
A sample with covariate values close to the minimum and maximum only
is not desirable if we do not want to rely on the assumption of a linear
relation between the study variable and the covariates. To identify a non-
linear relation, locations with intermediate covariate values are needed.
Optimisation using a semivariogram with clear spatial structure leads to
geographical spreading of the sampling units, so that most likely also locations
with intermediate covariate values are selected.
When one or more covariates are used in optimisation of the sampling pattern
but not used in KED once the data are collected, the sample is suboptimal for
the model used in prediction. Inversely, ignoring a covariate in optimisation
of the sampling pattern while using this covariate as a predictor also leads to
suboptimal samples. The selection of covariates to be used in sampling design
therefore should be done with care. Besides, as we will see in the next exercise,
the nugget of the residual semivariogram has a strong effect on the optimised
sampling pattern, stressing the importance of a reliable prior estimate of this
semivariogram parameter.
Exercises
library(sp)
coordinates(sampleAmhara) <- ~ s1 + s2
legacy <- remove.duplicates(sampleAmhara, zero = 1, remove.second = TRUE)
pnts <- list(fixed = coordinates(legacy), free = 100)
candi <- grdAmhara[, c("s1", "s2")]
names(candi) <- c("x", "y")
The number of points used in kriging can be passed to function optimMKV with
argument nmax.
set.seed(314)
vgm_ML_gstat <- vgm(model = "Sph", psill = vgm_ML$sigmasq,
range = vgm_ML$phi, nugget = vgm_ML$nugget)
res <- optimMKV(
points = pnts, candi = candi,
vgm = vgm_ML_gstat, eqn = z ~ 1,
nmax = 20, schedule = schedule, track = FALSE)
infillSample <- res$points %>%
filter(free == 1)
Figure 23.7 shows a model-based infill sample of 100 points for OK of the
soil organic matter (SOM) concentration (dag kg-1 ) throughout West-Amhara.
Comparison of the model-based infill sample with the spatial infill sample of
Figure 17.5 shows that in a wider zone on both sides of the roads no new
23.5 Model-based infill sampling for kriging with an external drift 425
sampling points are selected. This can be explained by the large range, 36.9
km, of the semivariogram.
preferably selected. In Section 22.4 the legacy data were used to estimate the
residual semivariogram by REML, see Table 22.2. In the next code chunk,
the estimated parameters of the residual semivariogram are used to optimise
the spatial pattern of an infill sample of 100 points for mapping the SOM
concentration throughout West-Amhara by KED, using elevation (dem), NIR-
reflectance (rfl_NIR), red-reflectance (rfl_red), and land surface temperature
(lst) as predictors for the model-mean.
Figure 23.8 shows the optimised sample. Again the legacy points are avoided,
but the infill sampling of the undersampled areas is less uniform compared to
Figure 23.7. Spreading in geographical space is less important than with OK
because the residual semivariogram has a much smaller range (Table 22.2).
Spreading in covariate space does not play any role with OK, whereas with
KED selecting locations with extreme values for the covariates is important to
minimise the uncertainty about the estimated model-mean.
The MKV of the optimised sample equals 0.878 (dag kg-1 )2 , which is somewhat
larger than the sill (sum of nugget and partial sill) of the residual semivariogram
(Table 22.2). This can be explained by the very small range of the semivariogram,
so that ignoring the uncertainty about the model-mean, the kriging variance at
nearly all locations in the study area equals the sill. Besides, we are uncertain
about the model-mean, explaining that the MKV can be larger than the sill.
23.5 Model-based infill sampling for kriging with an external drift 427
FIGURE 23.8: Model-based infill sample for KED of the SOM concentration
throughout West-Amhara, plotted on a map of one of the covariates. Legacy
units have free-value 0; infill units have free-value 1.
24
Sampling for estimating the semivariogram
429
430 24 Sampling for estimating the semivariogram
which is four in this example, should not be too small; as a rule of thumb, use
three or larger.
There are two versions of nested sampling. In the first stage of the first version,
several main stations are selected in a way that they cover the study area
well, for instance by spatial coverage sampling. In the second stage, each of
the main stations is used as a starting point to select one point at a distance
equal to the largest chosen separation distance (512 m in the example) in a
random direction from the main station. This doubles the sample size. In the
third stage, all points selected in the previous stages (main stations of stage
1 plus the points of stage 2) are used as starting points to select one point
at a distance equal to the second largest separation distance (128 m), and so
on. All points selected in the various stages are included in the nested sample.
The code chunk below shows the function for random selection of one point
at distance ℎ from a starting point. Note the while loop which continues until
a point is found that is inside the area. This is checked with function over of
package sp.
The first stage of the second version is equal to that of the first version. However,
in the second stage each of the main stations serves as a starting point for
randomly selecting a pair of points with a separation distance equal to the
largest chosen separation distance. The main station is halfway the selected
pair of points. In the third stage, each of the substations is used to select in
the same way a pair of points separated by the second largest chosen distance,
and so on. Only the points selected in the final stage are used as sampling
points. The R code below shows the function for random selection of two points
separated by ℎ distance units with a starting point halfway the pair of points.
The while loop continues until both points of a pair are inside the area.
The R code below shows the selection of a nested sample from Hunter Valley
using both versions. Only one main station is selected. In total 16 points are
selected in four stages. The separation distances are 2,000, 1,000, 500, and
250 m. Sixteen points is not enough for estimating the semivariogram and a
multiplier of two is rather small, but this example is for illustrative purposes
only.
Note that the separation distances are in descending order. The largest sepa-
ration distance should not be chosen too large, because then, when the main
station is somewhere in the middle of the study area, it may happen that
using the first version, no pair can be found with that separation distance. A
similar problem may occur with the second version when in subsequent stages
a station is selected near the border of the study area. A copy of grdHunterValley
is made because both the original data frame is needed, as well as a gridded
version of this data frame.
library(sp)
grid <- grdHunterValley
gridded(grdHunterValley) <- ~ s1 + s2
lags <- c(2000, 1000, 500, 250)
set.seed(614)
unit <- sample(nrow(grid), 1)
mainstation <- grid[unit, c("s1", "s2")]
432 24 Sampling for estimating the semivariogram
The R code for the second version is presented in the next code chunk.
Figure 24.1 shows the two selected nested samples. For the sample selected
with the second version, also the stations that served as starting points for the
selection of the point-pairs are plotted.
The samples of Figure 24.1 are examples of balanced nested samples. The
number of point-pairs separated by a given distance doubles with every stage.
As a consequence, the estimated semivariances for the smallest separation
distance are much more precise than for the largest distance. We are most
uncertain about the estimated semivariances for the largest separation distances.
If in the first stage only one point-pair separated by the largest distance is
selected, then we have only one degree of freedom for estimating the variance
component associated with this stage. It is more efficient to select more than
one main station, say about ten, and to select fewer points in the final stages.
For instance, with the second version we may decide to select a point-pair at
24.1 Nested sampling 433
FIGURE 24.1: Balanced nested samples from Hunter Valley, selected with
the two versions of nested sampling. In the subfigure of the second version
the selected sampling points (symbol x) are plotted together with the selected
stations (halfway the two points of a pair).
434 24 Sampling for estimating the semivariogram
only half the number of stations selected in the one-but-last stage. The nested
sample then becomes unbalanced.
The model for nested sampling with four stages is a hierarchical analysis of
variance (ANOVA) model with random effects:
with 𝜇 the mean, 𝐴u� the effect of the 𝑖th first stage station, 𝐵u�u� the effect
of the 𝑗th second stage station within the 𝑖th first stage station, and so on.
𝐴u� , 𝐵u�u� , 𝐶u�u�u� , and 𝜖u�u�u�u� are random quantities (random effects) all with zero
mean and variances 𝜎21 , 𝜎22 , 𝜎23 , and 𝜎24 , respectively.
For balanced designs, the variance components can be estimated by the MoM
from a hierarchical ANOVA. The first step is to assign factors to the sampling
points that indicate the grouping of the sampling points in the various stages.
The number of factors needed is the number of stages minus 1. All factors have
two levels. Figures 24.2 and 24.3 show the levels of the three factors. The levels
of the first factor show the strongest spatial clustering, those of the second
factor the one-but-strongest, and so on.
FIGURE 24.2: The levels of the three factors assigned to the sampling points
of the balanced nested sample selected with the first version.
The R code below shows the construction of the three factors for the second
version of nested sampling.
24.1 Nested sampling 435
FIGURE 24.3: The levels of the three factors assigned to the sampling points
of the balanced nested sample selected with the second version.
library(nlme)
lmodel <- lme(
436 24 Sampling for estimating the semivariogram
Exercises
point from the study area. Then the second point is randomly selected from
the circle with the first point at its centre and a radius equal to the chosen
separation distance. If this second point is outside the study area, both points
are discarded. This is repeated until we have the required point-pairs for this
separation distance. The next code chunk is an implementation of this selection
procedure.
IPP sampling is illustrated with the compound topographic index (cti, which is
the same as topographic wetness index) data of Hunter Valley. Five separation
distances are chosen, collected in numeric h, and for each distance 𝑛 = 100
point-pairs are selected by simple random sampling.
library(sp)
h <- c(50, 100, 200, 500, 1000)
n <- 100
set.seed(123)
allpairs <- NULL
438 24 Sampling for estimating the semivariogram
for (i in seq_len(length(h))) {
pairs <- SIpairs(h = h[i], n = n, area = grdHunterValley)
allpairs <- rbind(allpairs, pairs, make.row.names = FALSE)
}
The data.frame allpairs has four variables: the spatial coordinates of the first
and of the second point of a pair. An overlay is made of the selected points
with the SpatialPixelsDataFrame, and the cti values are extracted.
The semivariances for the chosen separation distances are estimated as well as
the variance of these estimated semivariances.
Figure 24.4 shows the sample semivariogram and the fitted model.
for (i in seq_len(length(h))) {
units <- which(mysample$h == h[i])
mysam_btsp <- mysample[units, ] %>%
slice_sample(n = n, replace = TRUE)
gammah[i] <- mean((mysam_btsp$z1 - mysam_btsp$z2)ˆ2, na.rm = TRUE) / 2
vgammah[i] <- var((mysam_btsp$z1 - mysam_btsp$z2)ˆ2, na.rm = TRUE) / (n * 4)
}
sample_vg <- data.frame(h, gammah, vgammah)
tryCatch({
fittedvariogram <- nls(gammah ~ SphNug(h, range, psill, nugget),
data = sample_vg, start = list(psill = 4, range = 200, nugget = 1),
weights = 1 / vgammah, algorithm = "port", lower = c(0, 0, 0))
pars <- coef(fittedvariogram)
allpars <- rbind(allpars, pars)}, error = function(e) {})
}
#compute variance-covariance matrix
signif(var(allpars), 3)
Note the large variance for the range parameter (the standard deviation is
258 m) as well as the negative covariance of the nugget and the partial sill
parameter (the Pearson correlation coefficient is -0.72). Histograms of the three
estimated semivariogram parameters are shown in Figure 24.5.
Marcelli et al. (2019) show how a probability sample of points (instead of pairs
of points) can be used in design-based estimation of the semivariogram. From
the 𝑛 randomly selected points all 𝑛(𝑛−1)/2 point-pairs are constructed. The
second-order inclusion probabilities of these point-pairs are used to estimate
the mean semivariance for separation distance classes. This sampling strategy
makes better use of the data and is therefore potentially more efficient than
IPP sampling.
Exercises
Müller and Zimmerman (1999) as well as Bogaert and Russo (1999) pro-
posed the determinant of the variance-covariance matrix of semivariogram
parameters, estimated by generalised least squares to fit the MoM sample
semivariogram. For instance, if we have two semivariogram parameters, 𝜃1
and 𝜃2 , the determinant of the 2 × 2 variance-covariance matrix equals the
sum of the variances of the two estimated parameters minus two times the
covariance of the two estimated parameters. If the two estimated parameters
are positively correlated, the determinant of the matrix is smaller than if
they are uncorrelated, and the covariance term is zero. The determinant is a
measure of our joint uncertainty about the semivariogram parameters.
Zhu and Stein (2005) proposed as a minimisation criterion the log of the
determinant of the inverse Fisher information matrix in ML estimation of
the semivariogram, hereafter shortly denoted by logdet. The Fisher informa-
tion about a semivariogram parameter is a function of the likelihood of the
semivariogram parameter; the likelihood of a semivariogram parameter is the
probability of the data as a function of the semivariogram parameter. The log
of this likelihood can be plotted against values of the parameter. The flatter the
log-likelihood surface, the less information is in the data about the parameter.
The flatness of the surface can be measured by the first derivative of the
log-likelihood to the semivariogram parameter. Strong negative or positive
derivative values indicate a steep surface. The Fisher information for a model
parameter is defined as the expectation of the square of the first derivative of
the log-likelihood to that semivariogram parameter, see Ly et al. (2017) for a
nice tutorial on this subject. The more information we have about a semivari-
ogram, the less uncertain we are about that parameter. This explains why the
inverse of the Fisher information can be used as a measure of uncertainty. The
inverse Fisher information matrix contains the variances and covariances of
the estimated semivariogram parameters.
The code chunks hereafter show how logdet can be computed. It makes use
of the result of Kitanidis (1987) who showed that each element of the Fisher
information matrix 𝐈(𝜃) can be obtained with (see also Lark (2002))
1 𝜕𝐀 −1 𝜕𝐀
[𝐈(𝜃)]u�u� = Tr [𝐀−1 𝐀 ] , (24.2)
2 𝜕𝜃u� 𝜕𝜃u�
library(sp)
library(gstat)
set.seed(314)
mysample0 <- grdHunterValley %>%
slice_sample(n = 50)
coordinates(mysample0) <- ~ s1 + s2
D <- spDists(mysample0)
xi <- 0.8; phi <- 200
thetas <- c(xi, phi)
vgmodel <- vgm(model = "Exp", psill = thetas[1],
range = thetas[2], nugget = 1 - thetas[1])
A <- variogramLine(vgmodel, dist_vector = D, covariance = TRUE)
In the next step, the semivariogram parameters are slightly changed one-by-
one. The changes, referred to as perturbations, are a small fraction of the
preliminary semivariogram parameter values. The perturbed semivariogram
parameters are used to compute the perturbed correlation matrices (pA) and
the partial derivatives of the correlation matrix (dA) for each perturbation.
gridded(grdHunterValley) <- ~ s1 + s2
candi <- spsample(grdHunterValley, type = "regular",
cellsize = c(50, 50), offset = c(0.5, 0.5))
candi <- as.data.frame(candi)
names(candi) <- c("x", "y")
schedule <- scheduleSPSANN(
initial.acceptance = c(0.8, 0.95),
initial.temperature = 0.15, temperature.decrease = 0.9,
chains = 300, chain.length = 10, stopping = 10,
x.min = 0, y.min = 0, cellsize = 50)
set.seed(314)
24.3 Optimisation of sampling pattern for semivariogram estimation 445
Figure 24.6 shows the optimised sampling pattern of 50 points. The logdet of
the optimised sample equals 3.548, which is 43% of the value of the simple
random sample used above to illustrate the computations. The optimised
sample consists of two clusters. There are quite a few point-pairs with nearly
coinciding points.
library(spcosa)
gridded(grdHunterValley) <- ~ s1 + s2
set.seed(314)
mystrata <- stratify(grdHunterValley, nStrata = 100, equalArea = FALSE, nTry = 10)
mysample_SC <- as(spsample(mystrata), "SpatialPoints")
mysample_eval <- spsample(
x = grdHunterValley, n = 200, type = "regular", offset = c(0.5, 0.5))
The following code chunks show how VKV at the evaluation point is computed.
First, the correlation matrix of the spatial coverage sample (A) is computed as
well as the correlation matrix of the spatial coverage sample and the evaluation
points (A0). Correlation matrix A is extended with a column and a row with
ones, see Equation (21.5).
D <- spDists(mysample_SC)
vgmodel <- vgm(model = "Exp", psill = thetas[1], range = thetas[2],
nugget = 1 - thetas[1])
A <- variogramLine(vgmodel, dist_vector = D, covariance = TRUE)
nobs <- length(mysample_SC)
B <- matrix(data = 1, nrow = nobs + 1, ncol = nobs + 1)
B[1:nobs, 1:nobs] <- A
B[nobs + 1, nobs + 1] <- 0
D0 <- spDists(x = mysample_eval, y = mysample_SC)
A0 <- variogramLine(vgmodel, dist_vector = D0, covariance = TRUE)
b <- cbind(A0, 1)
24.3 Optimisation of sampling pattern for semivariogram estimation 447
Next, the semivariogram parameters are perturbed one-by-one, and the per-
turbed correlation matrices pA and pA0 are computed.
Next, the kriging variance and the perturbed kriging variances are computed,
and the partial derivatives of the kriging variance with respect to the semivari-
ogram parameters are approximated. See Equations (21.7) and (21.8) for how
the kriging weights l and the kriging variance var are computed.
Finally, the partial derivatives of the kriging variance are used to approximate
VKV at the 200 evaluation points (Equation (24.3)). For this, the variances and
covariances of the estimated semivariogram parameters are needed, estimated
by the inverse of the Fisher information matrix. The Fisher information matrix
448 24 Sampling for estimating the semivariogram
For the simple random sample, the square root of MVKV equals 0.223. The
mean kriging variance (MKV) at these points equals 0.787, so the uncer-
tainty about the kriging variance is substantial. Hereafter, we will see how
much MVKV can be reduced by optimising the sampling pattern with spatial
simulated annealing.
As for logdet, the sample with minimum value for MVKV can be searched for
using spsann function optimUSER. The objective function MVKV is defined in pack-
age sswr. Argument points specifies the size of the sample for semivariogram
estimation. Argument psample is to specify the sample used for prediction at
the evaluation points (after the second round of sampling). Argument esample
is to specify the sample with evaluation points for estimating MVKV. The
optimisation requires substantial computing time. With 200 evaluation points
and the annealing schedule specified below the computing time was 46.25
minutes (processor AMD Ryzen 5, 16 GB RAM).
Figure 24.6 shows the optimised sample. The minimised value of MVKV is 29%
of the value of the simple random sample used to illustrate the computations.
The optimised sample points are clustered in an ellipse.
Both minimisation criteria, logdet and MVKV, are a function of the semivari-
ogram parameters 𝜃 , showing that the problem is circular. Using a preliminary
estimate of the semivariogram parameters, 𝜃 ,̂ leads to a locally optimal design
at 𝜃 .̂ For this reason, Bogaert and Russo (1999) and Zhu and Stein (2005)
proposed a Bayesian approach, in which a multivariate prior distribution for
the semivariogram parameters is postulated. The expected value over this
distribution of the criterion is minimised. Lark (2002) computed the average
of VKV over a number of semivariograms.
Both methods for sample optimisation rely, amongst others, on the assumption
that the mean and the variance are constant throughout the area. Under this
assumption, it is no problem that the sampling units are spatially clustered.
So, we assume that the semivariogram estimated from the data collected in a
small portion of the study area is representative for the whole study area. If
we do not feel comfortable with this assumption, spreading the sampling units
throughout the study area by the sampling methods described in the next two
sections can be a good option.
Exercises
with 𝑉OK (𝐬0 ) the ordinary kriging variance, see Equation (21.8), and E[𝜏2 (𝐬0 )]
the expectation of the additional variance component due to uncertainty about
the semivariogram parameters estimated by ML. The additional variance
24.4 Optimisation of sampling pattern 451
with u�u�
u�u�
the vector of partial derivatives of the kriging weights with respect
u�
to the 𝑗th semivariogram parameter. Comparing Equations (24.5) and (24.3)
shows that the two variances differ. VKV quantifies our uncertainty about the
estimated kriging variance, whereas E[𝜏2 ] quantifies our uncertainty about the
kriging prediction due to uncertainty about the semivariogram parameters.
I use the mean of the AKV over the nodes of a prediction grid (evaluation
grid) as a minimisation criterion (MAKV). The same criterion can also be
used in situations where we have maps of covariates that we want to use in
prediction. In that case, the aim is to design a single sample that is used both
for estimation of the residual semivariogram and for prediction by kriging with
an external drift. The ordinary kriging variance 𝑉OK (𝐬0 ) in Equation (24.4)
is then replaced by the prediction error variance with kriging with an external
drift 𝑉KED (𝐬0 ), see Equation (21.20).
Zhu and Stein (2006) proposed as a minimisation criterion a linear combination
of AKV (Equation (24.4)) and VKV (Equation (24.3)), referred to as the
estimation adjusted criterion (EAC):
1
𝐸𝐴𝐶(𝐬0 ) = 𝐴𝐾𝑉(𝐬0 ) + 𝑉𝐾𝑉(𝐬0 ) . (24.6)
2𝑉OK (𝐬0 )
Again, the mean of the EAC values (MEAC) over the nodes of a prediction
grid (evaluation) is used as a minimisation criterion.
Computing time for optimisation of the coordinates of a large sample, say, > 50
points, can become prohibitively long. To reduce computing time, Zhu and Stein
(2006) proposed a two-step approach. In the first step, for a fixed proportion
𝑝 ∈ (0, 1) the locations of (1 − 𝑝) 𝑛 points are optimised for prediction with
given parameters, for instance by minimising MKV. This ‘prediction sample’
is supplemented with 𝑝 𝑛 points, so that the two combined samples of size 𝑛
minimise logdet or MVKV. This is repeated for different values of 𝑝. In the
second step, MEAC is computed for the combined samples of size 𝑛, and the
proportion and the associated sample with minimum MEAC are selected.
A simplification of this two-step approach is to select in the first step a square
grid or a spatial coverage sample (Section 17.2), and to supplement this
sample by a fixed number of points whose coordinates are optimised by spatial
simulated annealing (SSA), using either MAKV or MEAC computed from both
samples (grid sample or spatial coverage sample plus supplemental sample)
as a minimisation criterion. In SSA the grid or spatial coverage sample is
fixed, i.e., the locations are not further optimised. Lark and Marchant (2018)
452 24 Sampling for estimating the semivariogram
library(spcosa)
set.seed(314)
mystrata <- stratify(grdHunterValley, nStrata = 90, equalArea = FALSE, nTry = 10)
mysample_SC <- as(spsample(mystrata), "SpatialPoints")
nsup <- 10
units <- sample(nrow(grdHunterValley), nsup)
mysample_sup0 <- as(grdHunterValley[units, ], "SpatialPoints")
mysample_eval <- spsample(
x = grdHunterValley, n = 200, type = "regular", offset = c(0.5, 0.5))
The next step is to compute the inverse of the Fisher information matrix, given
a preliminary semivariogram model, which is used as the variance-covariance
matrix of the estimated semivariogram parameters. Contrary to Section 24.3
now all sampling locations are used to compute this matrix. The locations of
the spatial coverage sample and the supplemental sample are merged into one
SpatialPoints object.
To learn how the Fisher information matrix is computed, refer to the code
chunks in Section 24.3. The inverse of this matrix can be computed with
function solve.
In the next code chunk, for each evaluation point the kriging weights (L), the
kriging variance (var), the perturbed kriging weights (pL), and the perturbed
kriging variances (pvar) are computed. In the final lines, the partial derivatives
of the kriging weights (dL) and the kriging variances (dvar) with respect to
the semivariogram parameters are computed. The partial derivatives of the
kriging variances with respect to the semivariogram parameters are needed for
computing VKV, see Equation (24.3), which in turn is needed for computing
criterion EAC, see Equation (24.6).
In the next code chunk, the expected variance due to uncertainty about the
semivariogram parameters (Equation (24.5)) is computed.
The AKVs are computed by adding the kriging variances and the extra variances
due to semivariogram uncertainty (Equation (24.4)). The VKV values and
the EAC values are computed. Both the AKV and the EAC differ among the
evaluation points. As a summary, the mean of the two variables is computed.
for (i in seq_len(length(dvar))) {
for (j in seq_len(length(dvar))) {
VKVij <- invI[i, j] * dvar[[i]] * dvar[[j]]
VKV <- VKV + VKVij
}
}
EAC <- augmentedvar + (VKV / (2 * var))
MEAC0 <- mean(EAC)
Figure 24.7 shows for Hunter Valley a spatial coverage sample of 90 points,
supplemented by 10 points optimised by SSA, using MAKV and MEAC as a
minimisation criterion.
24.4 Optimisation of sampling pattern 455
The frequency distribution of the shortest distance to the spatial coverage sam-
ple is shown in Figure 24.8. With both criteria there are several supplemental
points at very short distance of a point of the spatial coverage sample. The
remaining points are at large distances of spatial coverage sample points. The
average distance between neighbouring spatial coverage sampling points equals
381 m.
MAKV of the optimised sample equals 0.795 which is 94% of MAKV of the
initial sample. MEAC of the optimised sample equals 0.808 which is 91% of
MEAC of the initial sample. The reduction of these two criteria through the
optimisation is much smaller than for logdet and MVKV in Section 24.3. This
can be explained by the small number of sampling units that is optimised:
only the locations of 10 points are optimised, 90 are fixed. In Section 24.3 all
100 locations were optimised.
Exercises
h <- 20
m <- 10
set.seed(314)
units <- sample(nrow(mysample_SC), m, replace = FALSE)
mySCsubsample <- mysample_SC[units, ]
dxy <- matrix(nrow = m, ncol = 2)
24.5 A practical solution 457
MAKV of this sample equals 0.808, and MEAC equals 0.816. For MAKV, 25%
of the maximal reduction is realised by this practical solution; for MEAC, this
is 10%.
25
Sampling for validation of maps
In the previous chapters of Part II, various methods are described for selecting
sampling units with the aim to map the study variable. Once the map has been
made, we would like to know how good it is. It should come as no surprise
that the value of the study variable at a randomly selected location as shown
on the map differs from the value at that location in reality. This difference is
a prediction error. The question is how large this error is on average, and how
variable it is. This chapter describes and illustrates with a real-world case study
how to select sampling units at which we will confront the predictions with
the true values, and how to estimate map quality indices from the prediction
errors of these sampling units.
If the map has been made with a statistical model, then the predictors are
typically model-unbiased and the variance of the prediction errors can be
computed from the model. Think, for instance, of kriging which also yields
a map of the kriging variance. In Chapters 22 and 23 I showed how this
kriging variance can be used to optimise the grid spacing (sample size) and
the sampling pattern for mapping, respectively. So, if we have a map of these
variances, why do we still need to collect new data for estimating the map
quality?
The problem is that the kriging variances rely on the validity of the assumptions
made in modelling the spatial variation of the study variable. Do we assume
a constant mean, or a mean that is a linear combination of some covariates?
In the latter case, which covariates are assumed to be related to the study
variable? Or should we model the mean with a non-linear function as in a
random forest model? How certain are we about the semivariogram model
type (spherical, exponential, etc.), and how good are our estimates of the
semivariogram parameters? If one or more of the modelling assumptions are
violated, the variances of the prediction errors as computed with the model
may become biased. For this reason, the quality of the map is preferably
determined through independent validation, i.e., by comparing predictions
with observations not used in mapping, followed by design-based estimation of
the map quality indices. This process is often referred to as validation, perhaps
better statistical validation, a subset of the more comprehensive term map
quality evaluation, which includes the concept of fitness-for-use.
459
460 25 Sampling for validation of maps
1 u�
𝑀𝐸 = ∑ (𝑧 ̂ − 𝑧u� ) (25.1)
𝑁 u�=1 u�
1 u�
𝑀𝐴𝐸 = ∑ (|𝑧 ̂ − 𝑧u� |) (25.2)
𝑁 u�=1 u�
1 u�
𝑀𝑆𝐸 = ∑ (𝑧 ̂ − 𝑧u� )2 , (25.3)
𝑁 u�=1 u�
with 𝑁 the total number of units (e.g., raster cells) in the population, 𝑧u�̂
the predicted value for unit 𝑘, 𝑧u� the true value of that unit, and | ⋅ | the
absolute value operator. For infinite populations, the sum must be replaced by
an integral over all locations in the mapped area and divided by the size of the
area. The ME quantifies the systematic error and ideally equals 0. It can be
positive (in case of overprediction) and negative (in case of underprediction).
Positive and negative errors cancel out and, as a consequence, the ME does
not quantify the magnitude of the prediction errors. The MAE and MSE do
quantify the magnitude of the errors, they are non-negative. Often, the square
root of MSE is taken, denoted by RMSE, which is in the same units as the
study variable and is therefore more intelligible. The RMSE is strongly affected
by outliers, i.e., large prediction errors, due to the squaring of the errors, and
for this reason I recommend estimating both MAE and RMSE.
462 25 Sampling for validation of maps
Two other important map quality indices are the population coefficient of
determination (𝑅2 ) and the Nash-Sutcliffe model efficiency coefficient (MEC).
𝑅2 is defined as the square of the Pearson correlation coefficient 𝑟 of the study
variable and the predictions of the study variable, given by
with 𝑧 ̄ the population mean of the study variable, 𝑧 ̄̂ the population mean of the
predictions, 𝑆2 (𝑧, 𝑧)̂ the population covariance of the study variable and the
predictions of 𝑧, 𝑆(𝑧) the population standard deviation of the study variable,
and 𝑆(𝑧)̂ the population standard deviation of the predictions. Note that 𝑅2
is unaffected by bias and therefore should not be used in isolation, but should
always be accompanied by ME.
MEC is defined as (Janssen and Heuberger, 1995)
∑u� (𝑧 ̂ − 𝑧u� )2
u�=1 u� 𝑀𝑆𝐸
𝑀𝐸𝐶 = 1 − =1− , (25.5)
∑u� (𝑧
u�=1 u�
− 𝑧)̄ 2 𝑆2 (𝑧)
with 𝑆2 (𝑧) the population variance of the study variable. MEC quantifies the
improvement made by the model over using the mean of the observations as a
predictor. An MEC value of 1 indicates a perfect match between the observed
and the predicted values of the study variable, whereas a value of 0 indicates
that the mean of the observations is as good a predictor as the model. A
negative value occurs when the mean of the observations is a better predictor
than the model, i.e., when the residual variance is larger than the variance of
the measurements.
For categorical maps, a commonly used map quality index is the overall purity,
which is defined as the proportion of units that is correctly classified (mapped):
1 u�
𝑃= ∑𝑦 , (25.6)
𝑁 u�=1 u�
with 𝑦u� an indicator for unit 𝑘, having value 1 if the predicted class equals the
true class, and 0 otherwise:
1 if 𝑐u�̂ = 𝑐u�
𝑦u� = { (25.7)
0 otherwise ,
with 𝑐u� and 𝑐u�̂ the true and the predicted class of unit 𝑘, respectively. For
infinite populations the purity is the fraction of the area that is correctly
classified (mapped).
25.1 Map quality indices 463
The population ME, MSE, 𝑅2 , MEC, and purity can also be defined for
subpopulations. For categorical maps, natural subpopulations are the classes
depicted in the map, the map units. In that case, for infinite populations the
purity of map unit 𝑢 is defined as the fraction of the area of map unit 𝑢 that
is correctly mapped as 𝑢.
A different subpopulation is the part of the population that is in reality class
𝑢 (but possibly not mapped as 𝑢). We are interested in the fraction of the area
covered by this subpopulation that is correctly mapped as 𝑢. This is referred
to as the class representation of class 𝑢, for which I use hereafter the symbol
𝑅u� .
̂ = 1 ∑ 1 𝑒u� ,
𝑀𝐸 (25.8)
𝑁 u�∈u� 𝜋u�
with 𝑒u� = 𝑧u�̂ − 𝑧u� the prediction error for unit 𝑘. By taking the absolute
value of the prediction errors 𝑒u� in Equation (25.8) or by squaring them, the 𝜋
estimators for the MAE and MSE are obtained, respectively. By replacing 𝑒u�
by the indicator 𝑦u� of Equation (25.7), the 𝜋 estimator for the overall purity
is obtained.
With simple random sampling, the square of the sample correlation coefficient,
i.e., the correlation of the study variable and the predictions of the study
variable in the sample, is an unbiased estimator of 𝑅2 . See Särndal et al. (1992)
(p. 486 − 491) for how to estimate 𝑅2 for other sampling designs.
The population MEC can be estimated by
̂
̂ = 1 − 𝑀𝑆𝐸 .
𝑀𝐸𝐶 (25.9)
̂
𝑆2 (𝑧)
For simple random sampling the sample variance, i.e., the variance of the
observations of 𝑧 in the sample, is an unbiased estimator of the population
variance 𝑆2 (𝑧). For other sampling designs, this population variance can be
estimated by Equation (4.9).
464 25 Sampling for validation of maps
u�u�
∑u�∈u� u�u�
𝑅̂ u� = u�u� , (25.10)
∑u�∈u� u�
u�
1 if 𝑐u�̂ = 𝑐u� = 𝑢
𝑦u� = { (25.11)
0 otherwise ,
1 if 𝑐u� = 𝑢
𝑥u� = { (25.12)
0 otherwise .
This estimator is also recommended for estimating other map quality indices
from a sample with a sample size that is not fixed but varies among samples
selected with the sampling design. This is the case, for instance, when estimating
the mean (absolute or squared) error or the purity of a given map unit from a
simple random sample. The number of selected sampling units within the map
unit is uncontrolled and varies among the simple random samples. In this case,
we can estimate the mean error or the purity of a map unit 𝑢 by dividing the
estimated population total by either the known size (number of raster cells,
area) of map unit 𝑢 or by the estimated size. Interestingly, in general using the
estimated size in the denominator, instead of the known size, yields a more
precise estimator (Särndal et al., 1992). See also Section 14.1.
Two methods are used in mapping, kriging with an external drift (KED) and
random forest prediction (RF). For mapping with RF, seven covariates are
used: planar curvature, profile curvature, slope, temperature, precipitation,
topographic wetness index, and elevation. For mapping with KED only the
two most important covariates in the RF model are used: precipitation and
elevation.
The two maps that are to be validated are shown in Figure 25.1. Note that
non-soil areas (built-up, water, roads) are not predicted. The maps are quite
similar. The most striking difference between the maps is the smaller range of
the RF predictions: they range from 9.8 to 61.5, whereas the KED predictions
range from 5.3 to 90.5.
The two maps are evaluated by statistical validation with a stratified simple
random sample of 62 units (points). The strata are the eight units of a geological
map (Figure 25.2).
To estimate the population MSE of the two maps, first the squared prediction
errors are computed. The name of the measured study variable at the validation
sample in data.frame sample_test is SOM_A_hori. Four new variables are added to
sample_test using function mutate, by computing the prediction errors for KED
and RF and squaring these errors.
FIGURE 25.2: Stratified simple random sample for validation of the two
maps of the SOM concentration in Xuancheng.
mutate(
eKED = SOM_A_hori - SOM_KED,
eRF = SOM_A_hori - SOM_RF,
e2KED = (SOM_A_hori - SOM_KED)ˆ2,
e2RF = (SOM_A_hori - SOM_RF)ˆ2)
These four new variables now are our study variables of which we would like
to estimate the population means. The population means can be estimated
as explained in Chapter 4. First, the stratum sizes and stratum weights are
computed, i.e., the number and relative number of raster cells per stratum
(Figure 25.2).
Next, the stratum means of the prediction errors, obtained with KED and RF,
are estimated by the sample means, and the population mean of the errors are
estimated by the weighted mean of the estimated stratum means.
25.2 Real-world case study 467
The estimated MSE of the KED map equals 89.3 (g kg-1 )2 , that of the RF
map 93.8 (g kg-1 )2 .
Exercises
1. Are you certain that the population MSE of the KED map is smaller
than the population MSE of the RF map?
# A tibble: 8 x 2
stratum n
<int> <int>
1 1 5
2 2 1
3 3 8
4 4 10
5 5 2
6 6 23
7 7 9
8 8 4
The collapsed strata can be used to estimate the standard errors of the
estimators of the population MSEs. As a first step, the weights and the sample
sizes of the collapsed strata are computed.
TABLE 25.1: Estimated population mean error (ME) and population mean
squared error (MSE) of KED and RF map, and their standard errors.
The sampling variance of the estimator of the mean of the (squared) prediction
error can be estimated by Equation (4.4). The estimated ME and MSE and
their estimated standard errors are shown in Table 25.1.
Exercises
To estimate the MEC, we must first estimate the population variance of the
study variable from the stratified simple random sample (the denominator in
Equation (25.9)). First, the sizes and the sample sizes of the collapsed strata
must be added to sample_test. Then the population variance is estimated with
function s2 of package surveyplanning (Subsection 4.1.2).
470 25 Sampling for validation of maps
library(surveyplanning)
s2z <- sample_test %>%
left_join(strata_clp_Xuancheng, by = "stratum_clp") %>%
summarise(s2z = s2(SOM_A_hori, w = N_hc / n_hc)) %>%
flatten_dbl
The estimated MEC for KED equals 0.016 and for RF -0.034, showing that
the two models used in mapping are no better than the estimated mean SOM
concentration used as a predictor. This is quite a disappointing result.
The outcomes of the test statistics are 0.690 and 0.309 for KED and RF,
respectively, with p-values 0.493 and 0.759. So, we clearly have not enough
evidence for systematic errors, neither with KED nor with RF mapping.
Now we test whether the two population MSEs differ significantly. This can
be done by a paired t-test. The first step in a paired t-test is to compute
pairwise differences of squared prediction errors, and then we can proceed as
in a one-sample t-test.
flatten_dbl
The outcome of the test statistic is -0.438, with a p-value of 0.663, so we clearly
do not have enough evidence that the population MSEs obtained with the two
mapping methods are different.
26
Design-based, model-based, and
model-assisted approach for sampling and
inference
Section 1.2 already mentioned the design-based and the model-based approach
for sampling and statistical inference. In this chapter, the fundamental differ-
ences between these two approaches are explained in more detail. Several
misconceptions about the design-based approach for sampling and statistical
inference, based on classical sampling theory, seem to be quite persistent.
These misconceptions are the result of confusion about basic statistical con-
cepts such as independence, expectation, and bias and variance of estimators
or predictors. These concepts have a different meaning in the design-based
and the model-based approach. Besides, a population mean is still often con-
fused with a model-mean, and a population variance with a model-variance,
leading to invalid formulas for the sampling variance of an estimator of the
population mean. The fundamental differences between these two approaches
are illustrated with simulations, so that hopefully a better understanding
of this subject is obtained. Besides, the difference between model-dependent
inference (as used in the model-based approach) and model-assisted inference
is explained. This chapter has been published as part of a journal paper, see
Brus (2021).
473
474 26 Design-based approach for sampling and inference
Most students vote for answer 1, the other students vote for answer 3, nearly
no one votes for answer 2. Then I explain that you cannot say which answer
is correct, simply because for correlation we need two series of data, not just
two numbers. The question then is how to generate two series of data. We
need some random process for this. This random process differs between the
design-based and the model-based approach.
In the design-based approach, the random process is the random selection of
sampling units, whereas in the model-based approach randomness is introduced
via the statistical model of the spatial variation (Table 1.1). So, the design-
based approach requires probability sampling, i.e., random sampling, using
a random number generator, in such way that all population units have a
positive probability of being included in the sample and that these inclusion
probabilities are known for at least the selected population units (Särndal
et al., 1992). A probability sampling design can be used to generate an infinite
number of samples in theory, although in practical applications only one is
selected.
The spatial variation model used in the model-based approach contains two
terms, one for the mean (deterministic part) and one for the error with a
specified probability distribution. For instance, Equation (21.2) in Chapter 21
describes the model used in ordinary kriging. This model can be used to simulate
an infinite number of spatial populations. All these populations together are
referred to as a superpopulation (Särndal et al. (1992), Lohr (1999)). Depending
on the model of spatial variation, the simulated populations may show spatial
structure because the mean is a function of covariates, as in kriging with
an external drift, and/or because the errors are spatially autocorrelated. A
superpopulation is a construct, the populations do not exist in the real world.
The populations are similar, but not identical. For instance, the mean differs
among the populations. The expectation of the population mean, i.e., the
average over all possible simulated populations, equals the superpopulation
mean, commonly referred to as the model-mean, parameter 𝜇 in Equation
(21.2). The variance also differs among the populations. Contrary to the mean,
the average of the population variance over all populations generally is not
equal to the model-variance, parameter 𝜎2 in Equation (21.2), but smaller.
I will come back to this later. The differences between the simulated spatial
populations illustrate our uncertainty about the spatial variation of the study
variable in the population that is sampled or will be sampled.
In the design-based approach, only one population is considered, the one sam-
pled, but the statistical inference is based on all samples that can be generated
by a probability sampling. The top row of Figure 26.1 shows five simple random
samples of size ten. The population is the same in all plots. Proponents of
the design-based approach do not like to consider other populations than the
one sampled. Their challenge is to characterise this one population from a
probability sample.
26.1 Two sources of randomness 475
FIGURE 26.1: Random process considered in the design-based (top row) and
the model-based approach (bottom row). The design-based approach considers
only the sampled population, but all samples that can be generated by the
sampling design. The model-based approach considers only the selected sample,
but all populations that can be generated by the model.
As stressed by de Gruijter and ter Braak (1990) and Brus and de Gruijter (1997),
both approaches have their strengths and weaknesses. Broadly speaking, the
design-based approach is the most appropriate if interest is in the population
mean (total, proportion) or the population means (totals, proportions) of a
restricted number of subpopulations (subareas). The model-based approach
is the most appropriate if our aim is to map the study variable. Further, the
strength of the design-based approach is the strict validity of the estimates.
Validity means that an objective assessment of the uncertainty of the estimator
is warranted and that the coverage of confidence intervals is (almost) correct,
provided that the sample is large enough to assume an approximately normal
distribution of the estimator and design-unbiasedness of the variance estimator
476 26 Design-based approach for sampling and inference
𝜎2 − Cov(𝑧u� , 𝑧u� )
𝑉(𝑧)̄̂ = , (26.1)
𝑛
with 𝑉(𝑧)̄̂ the variance of the estimator of the regional mean (mean of spatial
population), 𝜎2 the population variance, 𝑛 the sample size, and Cov(𝑧u� , 𝑧u� ) the
average autocovariance between all pairs of individuals (𝑖, 𝑗) in the population
(sampled and unsampled). So, according to this formula, ignoring the mean
covariance within the population leads to an overestimation of the variance of
the estimator of the mean. In Section 26.4 I will make clear that this formula
is incorrect and that the classical formula is still valid, also for populations
showing spatial structure or continuity.
26.2 Identically and independently distributed 477
Remarkably, in other publications we can read that the classical formula for the
variance of the estimator of the population mean with simple random sampling
underestimates the true variance for populations showing spatial structure, see
for instance Griffith (2005) and Plant (2012). The reasoning is that due to
the spatial structure, there is less information in the sample data about the
population mean. In Section 26.4 I explain that this is also a misconception.
Do not get confused by these publications and stick to the classical formulas
which you can find in standard textbooks on sampling theory, such as Cochran
(1977) and Lohr (1999), as well as in Chapter 3 of this book.
The concept of independence of random variables is illustrated with a simulation.
The top row of Figure 26.2 shows five simple random samples of size two. The
two points are repeatedly selected from the same population (showing clear
spatial structure), so this top row represents the design-based approach. The
bottom row shows two points, not selected randomly and independently, but at
a fixed distance of 10 m. These two points are placed on different populations
generated by the model described above, so the bottom row represents the
model-based approach.
The values measured at the two points are plotted against each other in a
scatter plot, not for just five simple random samples or five populations, but
for 1,000 samples and 1,000 populations (Figure 26.3). As we can see there is
478 26 Design-based approach for sampling and inference
FIGURE 26.4: Two series (a and b) of simple random samples of ten points
(top) and two series (a and b) of systematic random samples of, on average,
ten points (bottom). The samples of series a and b are selected independently
from each other.
480 26 Design-based approach for sampling and inference
are selected at locations with a large value, few points at locations with
a small value. The sample data are used in ordinary kriging (Figure 26.6).
The prediction errors are computed by subtracting the kriged map from the
simulated population.
Figure 26.7 shows a histogram of the prediction errors. The population mean
error equals 0.482, not 0. You may have expected a positive systematic error
because of the overrepresentation of locations with large values, but on the other
hand, kriging predictions are best linear unbiased predictions (BLUP), so from
that point of view, this systematic error might be unexpected. BLUP means that
at individual locations the ordinary kriging predictions are unbiased. However,
apparently this does not guarantee that the average of the prediction errors,
averaged over all population units, equals 0. The reason is that unbiasedness
is defined here over all realisations (populations) of the statistical model of
spatial variation. So, the U in BLUP stands for model-unbiasedness. For other
model realisations, sampled at the same points, we may have much smaller
values, leading to a negative mean error of that population. On average, over
all populations, the error at any point will be 0 and consequently also the
average over all populations of the mean error.
This experiment shows that model-unbiasedness does not protect us against
selection bias, i.e., bias due to preferential sampling.
482 26 Design-based approach for sampling and inference
𝜎2
𝑉(𝜇)̂ = , (26.2)
𝑛
with 𝜎2 the model-variance of the random variable (see Equation (21.2)). The
variance presented in Equation (26.2) necessarily is a model-variance as it
quantifies our uncertainty about the model-mean, which only exists in the
model-based approach. If the random variables are not model-independent,
the model-variance of the sample mean can be computed by (de Gruijter et al.,
2006)
𝜎2
𝑉(𝜇)̂ = {1 + (𝑛 − 1)𝜌}̄ , (26.3)
𝑛
with 𝜌 ̄ the mean correlation within the sample (the average of the correlation
of all pairs of sampling points). The term inside the curly brackets is larger
than 1, unless 𝜌 ̄ equals 0. So, the variance of the estimator of the model-mean
with dependent data is larger than when data are independent. The number of
independent observations that is equivalent to a spatially autocorrelated data
set’s sample size 𝑛, referred to as the effective sample size, can be computed
by (de Gruijter et al., 2006)
𝑛
𝑛eff = . (26.4)
{1 + (𝑛 − 1)𝜌}̄
484 26 Design-based approach for sampling and inference
𝑛 𝑆2
𝑉(𝑧)̄̂ = (1 − ) , (26.5)
𝑁 𝑛
with 𝑁 the total number of population units (𝑁 = 100). This is done for a
range of sample sizes: 𝑛 = 10, 11, … , 100. Note that for 𝑛 < 100 the model-
variance of the sample mean for a given 𝑛, differs between samples. For samples
showing strong spatial clustering, the mean correlation is relatively large, and
consequently the model-variance is relatively large (see Equation (26.3)). There
is less information in these samples about the model-mean than in samples
without spatial clustering of the points. Therefore, to estimate the expectation
26.4 Effective sample size 485
In Figure 26.11 we can see that the population mean shows considerable
variation. The variance of 10,000 simulated population means equals 0.0513,
which is nearly equal to the value of 0.0509 for the model-variance computed
with Equation (26.3).
In observational research, I cannot think of situations in which interest is
in estimation of the mean of a superpopulation model. This in contrast to
experimental research. In experimental research, we are interested in the effects
486 26 Design-based approach for sampling and inference
of treatments; think for instance of the effects of different types of soil tillage
on the soil carbon stock. These treatment effects are quantified by different
model-means. Also, in time-series analysis of data collected in observational
studies, we might be more interested in the model-mean than in the mean over
a bounded period of time.
Now let us return to Equation (26.1). What is wrong with this variance
estimator? Where Griffith (2005) confused the population mean and the model-
mean, Wang et al. (2010) confused the population variance with the sill (a
priori variance) of the random process that has generated the population
(Webster and Oliver, 2007). The parameter 𝜎2 in their formula is defined as
the population variance. In doing so, the variance estimator is clearly wrong.
However, if we define 𝜎2 in this formula as the sill, the formula makes more
sense, but even then, the equation is not fully correct. The variance computed
with this equation is not the design-variance of the average of a simple random
sample selected from the sampled population, but the expectation of this design-
variance over all realisations of the model. So, it is a model-based prediction of
the design-variance of the estimator of the population mean, estimated from a
simple random sample, see Chapter 13. For the population actually sampled,
the design-variance is either smaller or larger than this expectation. Figure
26.11 shows that there is considerable variation in the population variance
among the 10,000 populations simulated with the model. Consequently, for
an individual population, the variance of the estimator of the population
mean, estimated from a simple random sample, can largely differ from the
model-expectation of this variance. Do not use Equation (26.1) for estimating
the design-variance of the estimator of the population mean, but simply use
Equation (26.5) (for simple random sampling with replacement and simple
random sampling of infinite populations the term (1 − 𝑛/𝑁) can be dropped).
Equation (26.1) is only relevant for comparing simple random sampling under
a variety of models of spatial variation (Ripley (1981), Domburg et al. (1994)).
(Chapter 9). At the inference stage, the covariate maps can be used in a model-
assisted approach, using, for instance, a linear regression model to increase the
precision of the design-based estimator (Chapter 10, Section 26.6).
If no covariate maps are available, we may anticipate the presence of spatial
structure by spreading the sampling units throughout the study area. This
spreading can be done in many ways, for instance by systematic random sam-
pling (Chapter 5), compact geographical stratification (Section 4.6), well-spread
sampling in geographical space with the local pivotal method (LPM) (Subsec-
tion 9.2.1), and generalised random-tessellation stratified (GRTS) sampling
(Subsection 9.2.2). At the inference stage, again a model-assisted approach can
be advantageous, using the spatial coordinates in a regression model.
TABLE 26.3: Estimated relative bias of the regression estimator and ratio
estimator, standard deviation of 5,000 regression/ratio estimates, and average
of 5,000 estimated standard errors of the regression/ratio estimator.
R scripts of the answers to the exercises are available in the Exercises folder
at the github repository of this book.
495
496 A Answers to exercises
5. The larger the population size 𝑁, the smaller the difference between
the sampling variances of the estimator of the mean for simple
random sampling with replacement and simple random sampling
without replacement (given a sample size 𝑛).
6. The true sampling variance of the estimator of the mean for simple
random sampling from an infinite population can be computed with
the population variance divided by the sample size: 𝑉(𝑧)̄̂ = 𝑆2 (𝑧)/𝑛.
8. See SI.R1 . The 90% confidence interval is less wide than the 95%
interval, because a larger proportion of samples is allowed not to
cover the population mean. The estimated standard error of the
estimated total underestimates the true standard error, because a
constant bulk density is used. In reality this bulk density also varies.
1 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/SI.R
2 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/STSI1.R
3 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/STSI2.R
A.0 Answers to exercises 497
9. See STSIgeostrata.R5 .
• Collapsing the geostrata on the basis of the measurements of
the study variable is not a proper way, as it will lead to a biased
estimator of the sampling variance of the estimator of the mean.
The estimated stratum variances 𝑆2̂ (𝑧) will be small, and so
the estimated sampling variance will underestimate the true
sampling variance.
• I propose to group neighbouring geostrata, i.e., geostrata that
are close to each other.
• The sampling variance estimator is not unbiased. The sampling
variance is slightly overestimated, because we assume that the
two (or three) points within a collapsed stratum are selected
by simple random sampling, whereas they are selected by
stratified random sampling (a collapsed stratum consists of
two or three geostrata), and so there is less spatial clustering
compared to simple random sampling.
4 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/STSIcumrootf.R
5 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/STSIgeostrata.R
498 A Answers to exercises
2. As can be seen in the plot, the spatial coverage of the study area by
the two systematic random samples can be quite poor. So, I expect
that the variance of the estimator of the mean using the data of two
systematic random samples of half the expected size is larger than
the variance of the estimator of the mean based on the data of a
single systematic random sample.
6 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/STSIgeostrata_composite.
R
7 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/SY.R
8 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/Cluster.R
A.0 Answers to exercises 499
2. With ten PSU draws and four SSUs per PSU draw (10 × 4), the
expected standard error of the estimator of the population mean
is smaller than with four PSU draws and ten SSUs per PSU draw
(4 × 10), because spatial clustering of the sampling points is less
strong.
10
3. See TwoStage.R .
11
4. See TwoStage.R .
12
5. See TwoStage.R .
9 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/TwoStage.R
10 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/TwoStage.R
11 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/TwoStage.R
12 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/TwoStage.R
500 A Answers to exercises
2. No, this field should not be included in the poppy area of that
sampling unit, because it is located outside the target area.
3. Yes, this field must be included in the poppy area of that sampling
unit, as it is located inside the target area. The target area is the
territory of Kandahar, regardless of how an area inside this territory
is depicted on the map, as agricultural land or otherwise.
Model-assisted estimation
1. See RegressionEstimator.R15 . The approximate standard error esti-
mator that uses the 𝑔-weights (computed with functions calibrate
and svymean of package survey) has a larger mean (7.194) than
the approximated standard error (7.130) computed with Equation
(10.13).
VarianceRegressionEstimator.R
502 A Answers to exercises
Twophase.R
A.0 Answers to exercises 503
CIprop.R
20 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/RequiredSampleSize_
CIprop.R
A.0 Answers to exercises 505
sample size is symmetric. For instance, the required sample size for
𝑝∗ = 0.7 is equal to the required sample size for 𝑝∗ = 0.3.
22
2. See first part of MBRequiredSampleSize_SIandSY.R .
VariogramwithNugget.R
22 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/MBRequiredSampleSize_
SIandSY.R
23 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/SE_STparameters.R
506 A Answers to exercises
26
4. See SpatialCoverageCircularPlot.R . See Figure A.4.
27
6. See SpatialInfill.R .
24 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/SE_ChangeofMean_HT.R
25 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/SquareGrid.R
26 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/
SpatialCoverageCircularPlot.R
27 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/SpatialInfill.R
A.0 Answers to exercises 507
FIGURE A.4: Spatial coverage samples of five and six points in a circular
plot.
FIGURE A.5: Covariate space coverage sample from Hunter Valley, using
cti, ndvi, and elevation as clustering variables, plotted on a map of cti.
28 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/
CovariateSpaceCoverageSample.R
508 A Answers to exercises
2. See cLHS_Square.R30 .
• Spatial coverage is improved by using the spatial coordinates
as covariates, but it is not optimal in terms of MSSD.
• It may happen that not all marginal strata of 𝑠1 and 𝑠2 are
sampled. Even when all these marginal strata are sampled, this
does not guarantee a perfect spatial coverage.
• With set.seed(314) and default values for the arguments of
function clhs, there is one unsampled marginal stratum and one
marginal stratum with two sampling locations. So, component
O1 equals 2. The minimised value (2.62) is slightly larger due
to the contribution of O3 to the criterion.
Sensitivity.Rmd
A.0 Answers to exercises 509
points vary, so that also the variance of the estimator of the mean,
which contributes to the kriging variance, differs among grid samples.
FIGURE A.10: Effect of the nugget (no nugget, large nugget, pure nugget)
on the optimised sampling pattern of 16 points for KED, using Easting as a
covariate for the mean.
40 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/MBSample_SSA_MVKV.R
41 https://github.com/DickBrus/SpatialSamplingwithR/tree/master/Exercises/MBSample_SSA_MEAC.R
514 A Answers to exercises
2. The standard errors of the estimated MEs are large when related
to the estimated MEs, so my guess is that we do not have enough
evidence against the hypothesis that there is no systematic error.
515
516 Bibliography
Barcaroli, G., Ballin, M., Odendaal, H., Pagliuca, D., Willighagen, E., and
Zardetto, D. (2020). SamplingStrata: Optimal Stratification of Sampling
Frames for Multipurpose Sampling Surveys. R package version 1.5-1.
Barnes, R. J. (1988). Bounding the required sample size for geologic site
characterization. Mathematical Geology, 20:477–490.
Berger, Y. G. (2004). A simple variance estimator for unequal probability
sampling without replacement. Canadian Journal of Applied Statistics,
31(3):305–315.
Bethel, J. (1989). Sample allocation in multivariate surveys. Survey Methodol-
ogy, 15(1):47–57.
Binder, D. A. and Hidiroglou, M. A. (1988). Sampling in time. In Krishnaiah,
P. R. and Rao, C. R., editors, Handbook of Statistics, volume 6, pages
187–211. North-Holland, Amsterdam.
Bivand, R. S., Pebesma, E., and Gómez-Rubio, V. (2013). Applied Spatial
Data Analysis with R. Springer, New York, second edition.
Bogaert, P. and Russo, D. (1999). Optimal spatial sampling design for the
estimation of the variogram based on a least squares approach. Water
Resources Research, 35(4):1275–1289.
Breidaks, J., Liberts, M., and Jukams, J. (2020). surveyplanning: Survey
planning tools. R package version 4.0.
Breidenbach, J. (2018). JoSAE: Unit-Level and Area-Level Small Area Esti-
mation. R package version 0.3.0.
Breidenbach, J. and Astrup, R. (2012). Small area estimation of forest attributes
in the Norwegian National Forest Inventory. European Journal of Forest
Research, 131:1255–1267.
Breidt, F. J. and Fuller, W. A. (1999). Design of supplemented panel surveys
with application to the National Resources Inventory. Journal of Agricultural,
Biological, and Environmental Statistics, 4(4):391–403.
Breidt, F. J. and Opsomer, J. D. (2017). Model-assisted survey estimation
with modern prediction techniques. Statistical Science, 32(2):190–205.
Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation
for a binomial proportion - Comment - Rejoinder. Statistical Science,
16(2):101–133.
Brus, D. J. (2019). Sampling for digital soil mapping: A tutorial supported by
R scripts. Geoderma, 338:464–480.
Brus, D. J. (2021). Statistical approaches for spatial sample survey: Persistent
misconceptions and new developments. European Journal of Soil Science,
72(2):686–703.
Bibliography 517
Ly, A., Marsman, M., Verhagen, J., Grasman, R. P. P. P., and Wagenmakers,
E. J. (2017). A tutorial on Fisher information. Journal of Mathematical
Psychology, 80:40–55.
Ma, T., Brus, D. J., Zhu, A.-X., Zhang, L., and Scholten, T. (2020). Compar-
ison of conditioned Latin hypercube and feature space coverage sampling
for predicting soil classes using simulation from soil maps. Geoderma,
370:114366.
Mandallaz, D. (2007). Sampling Techniques for Forest Inventories. Chapman
& Hall/CRC, Boca Raton.
Mandallaz, D., Breschan, J., and Hill, A. (2013). New regression estimators in
forest inventories with two-phase sampling and partially exhaustive informa-
tion: A design-based Monte Carlo approach with applications to small-area
estimation. Canadian Journal of Forest Research, 43(11):1023–1031.
Marcelli, A., Corona, P., and Fattorini, L. (2019). Design-based estimation of
mark variograms in forest ecosystem surveys. Spatial Statistics, 30:27–38.
Marchant, B. P. and Lark, R. M. (2007). Optimized sample schemes for
geostatistical surveys. Mathematical Geology, 39(1):113–134.
Matérn, B. (1947). Methods of estimating the accuracy of line and sample
plot surveys. Meddelanden från Statens Skogsforskningsinstitut, 36(1).
Matérn, B. (1986). Spatial Variation, volume 36 of Lecture Notes in Statistics.
Springer-Verlag, Berlin, second edition.
McConville, K., Tang, B., Zhu, G., Li, S., Chueng, S., and Toth, D. (2021).
mase: Model-Assisted Survey Estimation. R package version 0.1.3.
McConville, K. S., Moisen, G. G., and Frescino, T. S. (2020). A tutorial
on model-assisted estimation with application to forest inventory. Forests,
11(2):244.
McKay, M. D., Beckman, R. J., and Conover, W. J. (1979). A comparison
of three methods for selecting values of input variables in the analysis of
output from a computer code. Technometrics, 21(2):239–245.
McLaren, C. H. and Steel, D. G. (2001). Rotation patterns and trend estimation
for repeated surveys using rotation group estimates. Statistica Neerlandica,
55(2):221–238.
Minasny, B. and McBratney, A. B. (2006). A conditioned Latin hypercube
method for sampling in the presence of ancillary information. Computers &
Geosciences, 32(9):1378–1388.
Minasny, B. and McBratney, A. B. (2010). Conditioned Latin hypercube
sampling for calibrating soil sensor data to soil properties. In Viscarra
Rossel, R. A., McBratney, A. B., and Minasny, B., editors, Proximal Soil
Sensing, pages 111–119. Springer, Dordrecht.
524 Bibliography
Rao, J. N. K. (2003). Small Area Estimation. John Wiley & Sons, Hoboken.
Ribeiro Jr, P. J., Diggle, P. J., Schlather, M., Bivand, R., and Ripley, B. (2020).
geoR: Analysis of Geostatistical Data. R package version 1.8-1.
Ripley, B. D. (1981). Spatial Statistics. John Wiley & Sons, New York.
Robertson, B. L., Brown, J. A., McDonald, T., and Jaksons, P. (2013).
BAS: Balanced Acceptance Sampling of Natural Resources. Biometrics,
69(3):776–784.
Rosén, B. (1997). On sampling with probability proportional to size. Journal
of Statistical Planning and Inference, 62(2):159–191.
Roudier, P. (2021). clhs: a R package for conditioned Latin hypercube sampling.
R package version 0.9.0.
Roudier, P., Hewitt, A. E., and Beaudette, D. E. (2012). A conditioned Latin
hypercube sampling algorithm incorporating operational constraints. In
Minasny, B., Malone, B. P., and McBratney, A. B., editors, Digital Soil
Assessments and Beyond. Proceedings of the 5th Global Workshop on Digital
Soil Mapping, pages 227–232. CRC Press/Balkema, Leiden.
Samuel-Rosa, A. (2019). spsann: Optimization of Sample Configurations using
Spatial Simulated Annealing. R package version 2.2.0.
Särndal, C. E., Swensson, B., and Wretman, J. (1992). Model Assisted Survey
Sampling. Springer, New York.
Schloerke, B., Cook, D., Larmarange, J., Briatte, F., Marbach, M., Thoen,
E., Elberg, A., and Crowley, J. (2021). GGally: Extension to ’ggplot2’. R
package version 2.1.2.
Schoch, T. (2014). rsae: Robust Small Area Estimation. R package version
0.1-5.
Signorell, A. (2021). DescTools: Tools for Descriptive Statistics. R package
version 0.99.43.
Stehman, S. V. (1999). Basic probability sampling designs for thematic
map accuracy assessment. International Journal of Remote Sensing,
20(12):2423–2441.
Stehman, S. V., Fonte, C. C., Foody, G. M., and See, L. (2018). Using volun-
teered geographic information (VGI) in design-based statistical inference for
area estimation and accuracy assessment of land cover. Remote Sensing of
Environment, 212:47–59.
Stevens, D. L. and Olson, A. R. (2004). Spatially balanced sampling of natural
resources. Journal of the American Statistical Association, 99(465):262–278.
Szepannek, G. (2018). clustMixType: User-friendly clustering of mixed-type
data in R. The R Journal, 10(2):200–208.
526 Bibliography
Xie, Y., Dervieux, C., and Riederer, E. (2020). R Markdown Cookbook. Chap-
man & Hall/CRC, Boca Raton, first edition.
Zhao, X. and Grafström, A. (2020). A sample coordination method to monitor
totals of environmental variables. Environmetrics, 31(6).
Zhu, Z. and Stein, M. L. (2005). Spatial sampling design for parameter
estimation of the covariance function. Journal of Statistical Planning and
Inference, 134(2):583–603.
Zhu, Z. and Stein, M. L. (2006). Spatial sampling design for prediction with
estimated parameters. Journal of Agricultural, Biological, and Environmental
Statistics, 11(1):24–44.
Zhu, Z. and Zhang, H. (2006). Spatial sampling under the infill asymptotic
framework. Environmetrics, 17:323–337.
Zimmerman, D. L. (2006). Optimal network design for spatial prediction,
covariance parameter estimation, and empirical prediction. Environmetrics,
17(6):635–652.
Index
529
530 Index