Bootstrap Methods With Applications in R Official Download
Bootstrap Methods With Applications in R Official Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/bootstrap-methods-with-applications-in-r/
Efron’s introduction of the classical bootstrap in 1979 was the starting point of an
immense and lasting research activity. Accompanied and supported by the
improvement of PCs’ computing power, these methods are now an established
approach in applied statistics. The appealing simplicity makes it easy to use this
approach in different fields of science where statistics is applied.
The intention of this manuscript is to discuss the bootstrap concept in the context
of statistical testing, with a focus on its use or support in statistical modeling.
Furthermore, we would like to address different reader preferences with the content.
Specifically, we have thought of two types of readers. On the one hand, users of
statistics who have a solid basic knowledge of probability theory and who would
like to have a goal-oriented and short-term problem solution provided. On the other
hand, however, this book is also intended for readers who are more interested in the
theoretical background of a problem solution and who have advanced knowledge of
probability theory and mathematical statistics.
In most cases, we start a topic with some introductory examples, basic mathe-
matical considerations, and simple implementations of the corresponding algorithm.
A reader who is mainly interested in applying a particular approach may stop after
such a section and apply the discussed procedures and implementations to the
problem in mind. This introductory part to a topic is mainly addressed to the first
type of reader. It can also be used just to motivate bootstrap approximations and to
apply them in simulation studies on a computer. The second part of a topic covers
the mathematical framework and further background material. This part is mainly
written for those readers who have a strong background in probability theory and
mathematical statistics.
Throughout all chapters, computational procedures are provided in R. R is a
powerful statistical computing environment, which is freely available and can be
downloaded from the R-Project website at www.r-project.org. We focus only on a
few but very popular packages from the so-called tidyverse, mainly ggplot2 and
dplyr. This hopefully helps readers, who are not familiar with R, understand the
implementations more easily, first because the packages make the source code quite
intuitive to read and second because of their popularity a lot of helpful information
vii
viii Preface
can be found on the Internet. However, the repository of additional R-packages that
have been created by the R-community is immense, also with respect to
non-statistical aspects, that makes it worth to learn and work with R. The
R-programs considered in the text are made available on the website https://www.
springer.com/gp/book/9783030734794.
The first three chapters provide introductory material and are mainly intended for
readers who have never come into contact with bootstrapping. Chapter 1 gives a
short introduction to the bootstrap idea and some notes on R. In the Chap. 2, we
summarize some results about the generation of random numbers. The Chap. 3 lists
some well-known results of the classical bootstrap method.
In Chap. 4, we discuss the first basic statistical tests using the bootstrap method.
Chapters 5 and 6 cover bootstrap applications in the context of linear and gener-
alized linear regression. The focus is on goodness-of-fit tests, which can be used to
detect contradictions between the data and the fitted model. We discuss the work of
Stute on marked empirical processes and transfer parts of his results into the
bootstrap context in order to approximate p-values for the individual
goodness-of-fit tests. Some of the results here are new, at least to the best of our
knowledge. Although the mathematics behind these applications is quite complex,
we consider these tests as useful tools in the context of statistical modeling and
learning. Some of the subsections focus exactly on this modeling topic.
In the appendix, some additional aspects of R with respect to bootstrap appli-
cations are illustrated. In the first part of this appendix, some applications of the
“boot” R-package of Brian Ripley, which can be obtained from the R-project’s
website, are demonstrated. The second part describes the “simTool” R-package of
Marsel Scheer, which was written to simplify the implementation of simulation
studies like bootstrap replications in R. This package also covers applications of
parallel programming issues. Finally, the usage of our “bootGOF” R-package is
illustrated, which provides a tool to perform goodness-of-fit tests for (linear) models
as discussed in Chap. 6.
The first three chapters of this manuscript were written during the time when the
first author was employed as a Research Assistant at the Chair for Mathematical
Stochastics of the Justus Liebig University in Gießen. They were prepared for a
summer course at the Department of Mathematical Sciences at the University of
Wisconsin-Milwaukee, which the first author taught in 1988 (and several times later
on) after completing his doctorate.
Special thanks must be given to Prof. Dr. Winfried Stute, who supervised the
first author in Giessen. Professor Stute realized the importance of the bootstrap
method at a very early stage and inspired and promoted interest in it among the first
author. In addition, Prof. Stute together with Prof. Gilbert Walter from the
Department of Mathematical Science of the University of Wisconsin-Milwaukee
initiated a cooperation between the two departments, which ultimately formed the
basis for the long-lasting collaboration between the first author and his colleagues
from the statistics group in Milwaukee.
Financially, this long-term cooperation was later on supported by the Department
of Medical Engineering and Technomathematics of the Fachhochschule Aachen and
by the Department of Mathematical Sciences of the University of Wisconsin-
Milwaukee, and we would like to thank Profs. Karen Brucks, Allen Bell, Thomas
O’Bryan, and Richard Stockbridge for their kind assistance.
Finally, the first author would like to thank his colleagues from the statistics
group in Milwaukee, Jay Beder, Vytaras Brazauskas, and especially Jugal Ghorai,
and, from Fachhochschule Aachen, Martin Reißel for their helpful discussions and
support.
Also, the second author gives special thanks to Prof. Dr. Josef G. Steinebach
from the Department of Mathematics of the University of Cologne for his excellent
lectures in Statistics and Probability Theory.
We are both grateful to Dr. Andreas Kleefeld, who kindly provided us with
many comments and corrections to a preliminary version of the book.
ix
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Idea of the Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The R-Project for Statistical Computing . . . . . . . . . . . . . . . . . . . . 5
1.3 Usage of R in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Further Non-Statistical R-Packages . . . . . . . . . . . . . . . . . . 6
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Distributions in the R-Package Stats . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Uniform df. on the Unit Interval . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The Quantile Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Method of Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Generation of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 The Classical Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 21
3.1 An Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . . ..... 21
3.2 Basic Mathematical Background of the Classical Bootstrap . ..... 27
3.3 Discussion of the Asymptotic Accuracy of the Classical
Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Empirical Process and the Classical Bootstrap . . . . . . . . . . . . . . . 34
3.5 Mathematical Framework of Mallow’s Metric . . . . . . . . . . . . . . . 36
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Bootstrap-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 The One-Sample Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Two-Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xi
xii Contents
xv
xvi Abbreviations
Notations
A :¼ B A is defined by B
AB A and B are equivalent
Bn Borel ralgebra on Rn
C[0,1] Space of continuous, real-valued function on the unit interval
D[0,1] Skorokhod space on the unit interval
EðXÞ Expectation of the random variable X
En ðX Þ Expectation of the bootstrap random variable X
EXPðaÞ Exponential distribution with parameter a [ 0
Fn Empirical distribution function
Ifx2Ag Indicator function
IfAg ðxÞ Indicator function
Ip Identity matrix of size p p
h; i Inner product of a Hilbert space
a^b Minimum of a and b
N ðl; r2 Þ Normal distribution with expectation l and variance r2
Pn Probability measure corresponding to bootstrap rvs. based on n
original observations
P Probability measure corresponding to the wild bootstrap
Rn Basic marked empirical process (BMEP)
R1n Marked empirical process with estimated parameters propagating in
fixed direction (EMEP)
1n
R Marked empirical process with estimated parameters propagating in
an estimated direction (EMEPE)
UNIða; bÞ Uniform distribution on the interval ½a; b
UNI Standard uniform distribution on the interval, i.e., UNIð0; 1Þ
VARðXÞ Variance of the random variable X
VARn ðX Þ Variance of the bootstrap random variable X
WEIBða; bÞ Weibull distribution with parameter a and b
X F Random variable X is distributed according to F
Chapter 1
Introduction
In this introduction, we discuss the basic idea of the bootstrap procedure using a
simple example. Furthermore, the Statistical Software R and its use in the context of
this manuscript is briefly covered. Readers who are familiar with this material can
skip this chapter.
A short summary of the contents of this manuscript can be found in the Preface
and is not listed here again.
Typical statistical methods, such as constructing a confidence interval for the expected
value of a random variable or determining critical values for a hypothesis test, require
knowledge of the underlying distribution. However, this distribution is usually only
partially known at most. The statistical method we use to perform the task depends
on our knowledge of the underlying distribution.
Let us be more precise and assume that
X 1, . . . , X n ∼ F
1
n
X̄ n := Xi
n i=1
where
1
n
sn2 := (X i − X̄ n )2
n − 1 i=1
is the unbiased estimator of σ 2 = VAR(X ), that is, the variance of X . Note that we
write P F here to indicate that F is the data generating df.
If we know that F comes from the class of normal distributions, then the df. under
(1.1) belongs to a tn−1 −distribution, i.e., a Student’s t distribution with n − 1 degrees
of freedom. Using the known quantiles of the tn−1 − distribution exact confidence
interval can be determined. For example, an exact 90% confidence interval for μ F is
given by
sn q0.95 sn q0.95
X̄ n − √ , X̄ n + √ , (1.2)
n n
where Φ denotes the standard normal df. Based on the CLT, we can now construct
an asymptotic confidence interval. For example, the 90% confidence interval under
(1.2) has the same structure when we construct it using the CLT. However, q0.95 now
is the 95% quantile of the standard normal distribution. The interval constructed in
this way is no longer an exact confidence interval. It can only be guaranteed that the
confidence level of 90% is reached with n → ∞. It should also be noted that for q0.95
the 95% quantile of the tn−1 − distribution can also be chosen, because for n → ∞,
the tn−1 − df. converges to the standard normal df.
So far we have concentrated exclusively on the studentized mean. Let us generalize
this to a statistic of the type
Tn (F) = Tn (X 1 , . . . , X n ; F),
1.1 Basic Idea of the Bootstrap 3
where X 1 , . . . , X n ∼ F are i.i.d. Then the question arises how to approximate the
df.
P F Tn (F) ≤ x , x ∈R (1.4)
if F is unknown. This is where Efron’s bootstrap enters the game. The basic idea of
the bootstrap method is the assumption that the df. of Tn is about the same when the
data generating distribution F is replaced by another data generating distribution F̂
which is close to F and which is known to us. If we can find such a df. F̂,
may also be an approximation of Eq. (1.4). We call this df. for the moment a bootstrap
approximation of the df. given under Eq. (1.4). However, this approach only makes
sense if we can guarantee that
sup P F Tn (F) ≤ x − P F̂ Tn ( F̂) ≤ x −→ 0, for n → ∞. (1.6)
x∈R
Now let us go back to construct a 90% confidence interval for μ F based on the
bootstrap approximation. For this, we take the studentized mean for Tn and assume
that we have a data generating df. F̂ that satisfies (1.6). Since F̂ is known, we can
now, at least theoretically, calculate the 5% and 95% quantiles of the df.
√
P F̂ n( X̄ n − μ F̂ ) sn ≤ x ,
(a) Construct m i.i.d. (bootstrap) samples independent of one another of the type
∗ ∗
X 1;1 ... X 1;n
.. .. ..
. . .
∗ ∗
X m;1 . . . X m;n
∗ ∗
to obtain T1;n , . . . , Tm;n .
∗ ∗
(c) Since the T1;n , . . . , Tm;n are i.i.d. , the Glivenko-Cantelli theorem (GC) guaran-
tees
1 m
sup P F̂ Tn ( F̂) ≤ x − I{Tk;n
∗
≤x} −→ 0, for m → ∞, (1.8)
x∈R m k=1
where I{x∈A} ≡ I{A} (x) denotes the indicator function of the set A, that is,
1 : x∈A
I{x∈A} = .
0 : x∈
/ A
1
n
Fn (x) := I{X ≤x} , x ∈ R, (1.9)
n i=1 i
is a good choice for F̂ since, by the Glivenko-Cantelli theorem, we get with proba-
bility one (w.p.1)
sup Fn (x) − F(x) −→ 0.
n∈R n→∞
If we choose Fn for F̂ then we are talking about the classical bootstrap which was
historically the first to be studied.
1.2 The R-Project for Statistical Computing 5
The programming language R, see R Core Team (2019), is a widely used open-
source software tool for data analysis and graphics which runs on the commonly
used operating systems. It can be downloaded from the R-project’s website at www.r-
project.org. The R Development Core Team also offers some documentation on this
website:
• R installation and administration,
• An introduction to R,
• The R language definition,
• R data import/export, and
• The R reference index.
Additionally to this material, there is a large and strongly growing number of text-
books available covering the R programming language and the applications of R in
different fields of data analysis, for instance, Beginning R or Advanced R.
Besides the R software, one also should install an editor or an integrated develop-
ment environment (IDE) to work with R conveniently. Several open-source products
are available on the web, like
• RStudio, see RStudio Team (2020), at www.rstudio.org;
• RKWard, at http://rkward.sourceforge.net;
• Tinn-R, at http://www.sciviews.org/Tinn-R; and
• Eclipse based StatET, at http://www.walware.de/goto/statet.
Throughout the book we implement, for instance, different resampling schemes and
simulation studies in R. Our implementations are free from any checking of function
arguments. We provide R-code that focuses solely on an understandable implemen-
tation of a certain algorithm. Therefore, there is plenty of room to improve the imple-
mentations. Some of these improvements will be discussed within the exercises.
R is organized in packages. A new installation of R comes with some pre-installed
packages. And the packages provided by the R-community makes this programming
language really powerful. More than 15000 packages (as of 2020/Feb) are available
(still growing). But especially for people starting with R this is also a problem. The
CRAN Task View https://cran.r-project.org/web/views summarizes certain packages
within categories like “Graphics”, “MachineLearning”, or “Survival”. We decided
to use only a handful of packages that are directly related to the main objective
of this book, like the boot-package for bootstrapping, or (in the opinion of the
authors) are too important and helpful to be ignored, like ggplot2, dplyr, and
tidyr. In addition, we have often used the simTool package from Marsel Scheer
to carry out simulations. This package is explained in the appendix. Furthermore,
6 1 Introduction
we decided to use the pipe operator, i.e., %>%. There are a few critical voices about
this operator, but the authors as the most R users find it very comfortable to work
with the pipe operator. People familiar with Unix systems will recognize the concept
and probably appreciate it. A small example will demonstrate how the pipe operator
works. Suppose we want to apply a function A to the object x and the result of this
operation should be processed further by the function B. Without the pipe operator
one could use
B(A(x))
# or
tmp = A(x)
B(tmp)
Especially with longer chains of functions using pipes may help to obtain R-code
that is easier to understand.
There are a lot of packages that are worth to look at. Again the CRAN Task View may
be a good starting point. The following list is focused on writing reports, developing
R-packages, and increasing the speed of R-code itself. By far this list is not exhaustive:
• knitr for writing reports (this book was written with knitr);
• readxl for the import of excel files;
• testthat for creating automated unit tests. It is also helpful for checking func-
tion arguments;
• covR for assessing test coverage of the unit tests;
• devtools for creating/writing packages;
• data.table amazingly fast aggregation, joins, and various manipulations of
large datasets;
• roxygen2 for creating help pages within packages;
• Rcpp for a simple integration of C++ into R;
• profvis a profiling tool that assess at which line of code R spends its time;
• checkpoint, renv for package dependency.
Of course, further packages for importing datasets, connecting to databases, cre-
ating interactive graphs and user interfaces, and so on exist. Again, the packages
provided by the R-community make this programming language really powerful.