100% found this document useful (2 votes)
28 views

Bootstrap Methods With Applications in R Official Download

This document discusses the bootstrap method in statistics, focusing on its applications in R for statistical testing and modeling. It caters to both novice and advanced readers by providing introductory examples and deeper mathematical frameworks. The book includes practical implementations in R, particularly using popular packages from the tidyverse, and covers various statistical tests and regression analysis using bootstrap techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
28 views

Bootstrap Methods With Applications in R Official Download

This document discusses the bootstrap method in statistics, focusing on its applications in R for statistical testing and modeling. It caters to both novice and advanced readers by providing introductory examples and deeper mathematical frameworks. The book includes practical implementations in R, particularly using popular packages from the tidyverse, and covers various statistical tests and regression analysis using bootstrap techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Bootstrap Methods With Applications in R

Visit the link below to download the full version of this book:

https://medipdf.com/product/bootstrap-methods-with-applications-in-r/

Click Download Now


To our families:
Renate and Jan
Natalie, Nikolas and Alexander
for their support and patience
Preface

Efron’s introduction of the classical bootstrap in 1979 was the starting point of an
immense and lasting research activity. Accompanied and supported by the
improvement of PCs’ computing power, these methods are now an established
approach in applied statistics. The appealing simplicity makes it easy to use this
approach in different fields of science where statistics is applied.
The intention of this manuscript is to discuss the bootstrap concept in the context
of statistical testing, with a focus on its use or support in statistical modeling.
Furthermore, we would like to address different reader preferences with the content.
Specifically, we have thought of two types of readers. On the one hand, users of
statistics who have a solid basic knowledge of probability theory and who would
like to have a goal-oriented and short-term problem solution provided. On the other
hand, however, this book is also intended for readers who are more interested in the
theoretical background of a problem solution and who have advanced knowledge of
probability theory and mathematical statistics.
In most cases, we start a topic with some introductory examples, basic mathe-
matical considerations, and simple implementations of the corresponding algorithm.
A reader who is mainly interested in applying a particular approach may stop after
such a section and apply the discussed procedures and implementations to the
problem in mind. This introductory part to a topic is mainly addressed to the first
type of reader. It can also be used just to motivate bootstrap approximations and to
apply them in simulation studies on a computer. The second part of a topic covers
the mathematical framework and further background material. This part is mainly
written for those readers who have a strong background in probability theory and
mathematical statistics.
Throughout all chapters, computational procedures are provided in R. R is a
powerful statistical computing environment, which is freely available and can be
downloaded from the R-Project website at www.r-project.org. We focus only on a
few but very popular packages from the so-called tidyverse, mainly ggplot2 and
dplyr. This hopefully helps readers, who are not familiar with R, understand the
implementations more easily, first because the packages make the source code quite
intuitive to read and second because of their popularity a lot of helpful information

vii
viii Preface

can be found on the Internet. However, the repository of additional R-packages that
have been created by the R-community is immense, also with respect to
non-statistical aspects, that makes it worth to learn and work with R. The
R-programs considered in the text are made available on the website https://www.
springer.com/gp/book/9783030734794.
The first three chapters provide introductory material and are mainly intended for
readers who have never come into contact with bootstrapping. Chapter 1 gives a
short introduction to the bootstrap idea and some notes on R. In the Chap. 2, we
summarize some results about the generation of random numbers. The Chap. 3 lists
some well-known results of the classical bootstrap method.
In Chap. 4, we discuss the first basic statistical tests using the bootstrap method.
Chapters 5 and 6 cover bootstrap applications in the context of linear and gener-
alized linear regression. The focus is on goodness-of-fit tests, which can be used to
detect contradictions between the data and the fitted model. We discuss the work of
Stute on marked empirical processes and transfer parts of his results into the
bootstrap context in order to approximate p-values for the individual
goodness-of-fit tests. Some of the results here are new, at least to the best of our
knowledge. Although the mathematics behind these applications is quite complex,
we consider these tests as useful tools in the context of statistical modeling and
learning. Some of the subsections focus exactly on this modeling topic.
In the appendix, some additional aspects of R with respect to bootstrap appli-
cations are illustrated. In the first part of this appendix, some applications of the
“boot” R-package of Brian Ripley, which can be obtained from the R-project’s
website, are demonstrated. The second part describes the “simTool” R-package of
Marsel Scheer, which was written to simplify the implementation of simulation
studies like bootstrap replications in R. This package also covers applications of
parallel programming issues. Finally, the usage of our “bootGOF” R-package is
illustrated, which provides a tool to perform goodness-of-fit tests for (linear) models
as discussed in Chap. 6.

Jülich, Germany Gerhard Dikta


January 2021 Marsel Scheer
Acknowledgements

The first three chapters of this manuscript were written during the time when the
first author was employed as a Research Assistant at the Chair for Mathematical
Stochastics of the Justus Liebig University in Gießen. They were prepared for a
summer course at the Department of Mathematical Sciences at the University of
Wisconsin-Milwaukee, which the first author taught in 1988 (and several times later
on) after completing his doctorate.
Special thanks must be given to Prof. Dr. Winfried Stute, who supervised the
first author in Giessen. Professor Stute realized the importance of the bootstrap
method at a very early stage and inspired and promoted interest in it among the first
author. In addition, Prof. Stute together with Prof. Gilbert Walter from the
Department of Mathematical Science of the University of Wisconsin-Milwaukee
initiated a cooperation between the two departments, which ultimately formed the
basis for the long-lasting collaboration between the first author and his colleagues
from the statistics group in Milwaukee.
Financially, this long-term cooperation was later on supported by the Department
of Medical Engineering and Technomathematics of the Fachhochschule Aachen and
by the Department of Mathematical Sciences of the University of Wisconsin-
Milwaukee, and we would like to thank Profs. Karen Brucks, Allen Bell, Thomas
O’Bryan, and Richard Stockbridge for their kind assistance.
Finally, the first author would like to thank his colleagues from the statistics
group in Milwaukee, Jay Beder, Vytaras Brazauskas, and especially Jugal Ghorai,
and, from Fachhochschule Aachen, Martin Reißel for their helpful discussions and
support.
Also, the second author gives special thanks to Prof. Dr. Josef G. Steinebach
from the Department of Mathematics of the University of Cologne for his excellent
lectures in Statistics and Probability Theory.
We are both grateful to Dr. Andreas Kleefeld, who kindly provided us with
many comments and corrections to a preliminary version of the book.

ix
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Idea of the Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The R-Project for Statistical Computing . . . . . . . . . . . . . . . . . . . . 5
1.3 Usage of R in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Further Non-Statistical R-Packages . . . . . . . . . . . . . . . . . . 6
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Distributions in the R-Package Stats . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Uniform df. on the Unit Interval . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The Quantile Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Method of Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Generation of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 The Classical Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 21
3.1 An Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . . ..... 21
3.2 Basic Mathematical Background of the Classical Bootstrap . ..... 27
3.3 Discussion of the Asymptotic Accuracy of the Classical
Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Empirical Process and the Classical Bootstrap . . . . . . . . . . . . . . . 34
3.5 Mathematical Framework of Mallow’s Metric . . . . . . . . . . . . . . . 36
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Bootstrap-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 The One-Sample Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Two-Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xi
xii Contents

4.4 Goodness-of-Fit (GOF) Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


4.5 Mathematical Framework of the GOF Test . . . . . . . . . . . . . . . . . 65
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Homoscedastic Linear Regression under Fixed Design . . . . . . . . . 74
5.1.1 Model-Based Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 LSE Asymptotic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.3 LSE Bootstrap Asymptotic . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Linear Correlation Model and the Bootstrap . . . . . . . . . . . . . . . . . 90
5.2.1 Classical Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 Wild Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.3 Mathematical Framework of LSE . . . . . . . . . . . . . . . . . . . 99
5.2.4 Mathematical Framework of Classical
Bootstrapped LSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.5 Mathematical Framework of Wild Bootstrapped LSE . . . . . 104
5.3 Generalized Linear Model (Parametric) . . . . . . . . . . . . . . . . . . . . 106
5.3.1 Mathematical Framework of MLE . . . . . . . . . . . . . . . . . . 121
5.3.2 Mathematical Framework of Bootstrap MLE . . . . . . . . . . . 133
5.4 Semi-parametric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.4.1 Mathematical Framework of LSE . . . . . . . . . . . . . . . . . . . 147
5.4.2 Mathematical Framework of Wild Bootstrap LSE . . . . . . . 153
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6 Goodness-of-Fit Test for Generalized Linear Models . . . . . . . . . . . . 165
6.1 MEP in the Parametric Modeling Context . . . . . . . . . . . . . . . . . . 167
6.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.1.2 Bike Sharing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1.3 Artificial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.2 MEP in the Semi-parametric Modeling Context . . . . . . . . . . . . . . 187
6.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.2.2 Artificial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.3 Comparison of the GOF Tests under the Parametric
and Semi-parametric Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.4 Mathematical Framework: Marked Empirical Processes . . . . . . . . 197
6.4.1 The Basic MEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.4.2 The MEP with Estimated Model Parameters
Propagating in a Fixed Direction . . . . . . . . . . . . . . . . . . . 203
6.4.3 The MEP with Estimated Model Parameters
Propagating in an Estimated Direction . . . . . . . . . . . . . . . 207
Contents xiii

6.5 Mathematical Framework: Bootstrap of Marked Empirical


Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.5.1 Bootstrap of the BMEP . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.5.2 Bootstrap of the EMEP . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

Appendix A: boot Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231


Appendix B: simTool Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Appendix C: bootGOF Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Appendix D: Session Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Abbreviations

a.e. Almost everywhere


a.s. Almost sure
BMEP Basic marked empirical process
CLT Central limit theorem
CvM Cramér-von Mises
df. Distribution function
edf. Empirical distribution function of an i.i.d. sample
EMEP Estimated marked empirical process
EMEPE Estimated marked empirical process in estimated direction
GA General assumptions
GC Glivenko-Cantelli theorem
GLM Generalized linear model
GOF Goodness-of-fit
i.i.d. Independent and identically distributed
KS Kolmogorov-Smirnov
MEP Marked empirical process
MLE Maximum likelihood estimate
pdf. Probability density function
PRNG Pseudo-random number generators
qf. Quantile function
rv. Random variable
RSS Resampling scheme
SLLN Strong law of large numbers
W.l.o.g. Without loss of generality
WLLN Weak law of large numbers
w.p.1 With probability one

xv
xvi Abbreviations

Notations

A :¼ B A is defined by B
AB A and B are equivalent
Bn Borel ralgebra on Rn
C[0,1] Space of continuous, real-valued function on the unit interval
D[0,1] Skorokhod space on the unit interval
EðXÞ Expectation of the random variable X
En ðX  Þ Expectation of the bootstrap random variable X
EXPðaÞ Exponential distribution with parameter a [ 0
Fn Empirical distribution function
Ifx2Ag Indicator function
IfAg ðxÞ Indicator function
Ip Identity matrix of size p  p
h; i Inner product of a Hilbert space
a^b Minimum of a and b
N ðl; r2 Þ Normal distribution with expectation l and variance r2
Pn Probability measure corresponding to bootstrap rvs. based on n
original observations
P Probability measure corresponding to the wild bootstrap
Rn Basic marked empirical process (BMEP)
R1n Marked empirical process with estimated parameters propagating in
fixed direction (EMEP)
 1n
R Marked empirical process with estimated parameters propagating in
an estimated direction (EMEPE)
UNIða; bÞ Uniform distribution on the interval ½a; b
UNI Standard uniform distribution on the interval, i.e., UNIð0; 1Þ
VARðXÞ Variance of the random variable X
VARn ðX  Þ Variance of the bootstrap random variable X
WEIBða; bÞ Weibull distribution with parameter a and b
X F Random variable X is distributed according to F
Chapter 1
Introduction

In this introduction, we discuss the basic idea of the bootstrap procedure using a
simple example. Furthermore, the Statistical Software R and its use in the context of
this manuscript is briefly covered. Readers who are familiar with this material can
skip this chapter.
A short summary of the contents of this manuscript can be found in the Preface
and is not listed here again.

1.1 Basic Idea of the Bootstrap

Typical statistical methods, such as constructing a confidence interval for the expected
value of a random variable or determining critical values for a hypothesis test, require
knowledge of the underlying distribution. However, this distribution is usually only
partially known at most. The statistical method we use to perform the task depends
on our knowledge of the underlying distribution.
Let us be more precise and assume that

X 1, . . . , X n ∼ F

is a sequence of independent and identically distributed (i.i.d.) random variables with


common distribution function (df.) F. Consider the statistic

Electronic supplementary material The online version of this chapter


(https://doi.org/10.1007/978-3-030-73480-0_1) contains supplementary material, which is
available to authorized users.

© Springer Nature Switzerland AG 2021 1


G. Dikta and M. Scheer, Bootstrap Methods,
https://doi.org/10.1007/978-3-030-73480-0_1
2 1 Introduction

1
n
X̄ n := Xi
n i=1

to estimate the parameter μ F = E(X ), that is, the expectation of X .


To construct a confidence interval for μ F or to perform a hypothesis test on μ F ,
we consider the df. of the studentized version of X̄ n , that is,
√  
PF n( X̄ n − μ F ) sn ≤ x , x ∈ R, (1.1)

where
1 
n
sn2 := (X i − X̄ n )2
n − 1 i=1

is the unbiased estimator of σ 2 = VAR(X ), that is, the variance of X . Note that we
write P F here to indicate that F is the data generating df.
If we know that F comes from the class of normal distributions, then the df. under
(1.1) belongs to a tn−1 −distribution, i.e., a Student’s t distribution with n − 1 degrees
of freedom. Using the known quantiles of the tn−1 − distribution exact confidence
interval can be determined. For example, an exact 90% confidence interval for μ F is
given by
 sn q0.95 sn q0.95 
X̄ n − √ , X̄ n + √ , (1.2)
n n

where q0.95 is the 95% quantile of the tn−1 distribution.


But in most situations we are not able to specify a parametric distribution class
for F. In such a case, we have to look for a suitable approximation for (1.1). If it is
ensured that E(X 2 ) < ∞, the central limit theorem (CLT) guarantees that
 √   
 
sup P F n( X̄ n − μ F ) sn ≤ x − Φ(x) −→ 0, for n → ∞, (1.3)
x∈R

where Φ denotes the standard normal df. Based on the CLT, we can now construct
an asymptotic confidence interval. For example, the 90% confidence interval under
(1.2) has the same structure when we construct it using the CLT. However, q0.95 now
is the 95% quantile of the standard normal distribution. The interval constructed in
this way is no longer an exact confidence interval. It can only be guaranteed that the
confidence level of 90% is reached with n → ∞. It should also be noted that for q0.95
the 95% quantile of the tn−1 − distribution can also be chosen, because for n → ∞,
the tn−1 − df. converges to the standard normal df.
So far we have concentrated exclusively on the studentized mean. Let us generalize
this to a statistic of the type

Tn (F) = Tn (X 1 , . . . , X n ; F),
1.1 Basic Idea of the Bootstrap 3

where X 1 , . . . , X n ∼ F are i.i.d. Then the question arises how to approximate the
df.  
P F Tn (F) ≤ x , x ∈R (1.4)

if F is unknown. This is where Efron’s bootstrap enters the game. The basic idea of
the bootstrap method is the assumption that the df. of Tn is about the same when the
data generating distribution F is replaced by another data generating distribution F̂
which is close to F and which is known to us. If we can find such a df. F̂,

P F̂ (Tn ( F̂) ≤ x), x ∈R (1.5)

may also be an approximation of Eq. (1.4). We call this df. for the moment a bootstrap
approximation of the df. given under Eq. (1.4). However, this approach only makes
sense if we can guarantee that
    

sup P F Tn (F) ≤ x − P F̂ Tn ( F̂) ≤ x  −→ 0, for n → ∞. (1.6)
x∈R

Now let us go back to construct a 90% confidence interval for μ F based on the
bootstrap approximation. For this, we take the studentized mean for Tn and assume
that we have a data generating df. F̂ that satisfies (1.6). Since F̂ is known, we can
now, at least theoretically, calculate the 5% and 95% quantiles of the df.
√  
P F̂ n( X̄ n − μ F̂ ) sn ≤ x ,

which we denote by qn,0.05 and qn,0.95 , respectively, to derive


 sn qn,0.95 sn qn,0.05 
X̄ n − √ , X̄ n − √ , (1.7)
n n

an asymptotic 90% confidence interval for μ F .


If we want to use such a bootstrap approach, we have
(A) to choose the data generating df. F̂ such that the bootstrap approximation (1.6)
holds,
(B) to calculate the df. of Tn , where the sample is generated under F̂.
Certainly (A) is the more demanding part, in particular, the proof of the approximation
(1.6). Fortunately, a lot of work has been done on this in the last decades. Also, the
calculation of the df. under (B) may turn out to be very complex. However, this is of
minor importance, because the bootstrap df. in Eq. (1.6) can be approximated very
well by a Monte Carlo approach. It is precisely this opportunity to perform a Monte
Carlo approximation, together with the rapid development of powerful PCs that has
led to the great success of the bootstrap approach.
To demonstrate such a Monte Carlo approximation for the df. of Eq. (1.5), we
proceed as follows:
4 1 Introduction

(a) Construct m i.i.d. (bootstrap) samples independent of one another of the type
∗ ∗
X 1;1 ... X 1;n
.. .. ..
. . .
∗ ∗
X m;1 . . . X m;n

with common df. F̂.


(b) Calculate for each sample k ∈ {1, 2, . . . , m}
∗ ∗ ∗
Tk;n := Tn (X k;1 , . . . , X k;n ; F̂)

∗ ∗
to obtain T1;n , . . . , Tm;n .
∗ ∗
(c) Since the T1;n , . . . , Tm;n are i.i.d. , the Glivenko-Cantelli theorem (GC) guaran-
tees
   1  m 
 
sup P F̂ Tn ( F̂) ≤ x − I{Tk;n

≤x}  −→ 0, for m → ∞, (1.8)
x∈R m k=1

where I{x∈A} ≡ I{A} (x) denotes the indicator function of the set A, that is,

1 : x∈A
I{x∈A} = .
0 : x∈
/ A

The choice of an appropriate F̂ depends on the underlying problem, as we will


see in the following chapters. In the context of this introduction, Fn , the empirical
df. (edf.) of the sample X 1 , . . . , X n , defined by

1
n
Fn (x) := I{X ≤x} , x ∈ R, (1.9)
n i=1 i

is a good choice for F̂ since, by the Glivenko-Cantelli theorem, we get with proba-
bility one (w.p.1)  
sup  Fn (x) − F(x) −→ 0.
n∈R n→∞

If we choose Fn for F̂ then we are talking about the classical bootstrap which was
historically the first to be studied.
1.2 The R-Project for Statistical Computing 5

1.2 The R-Project for Statistical Computing

The programming language R, see R Core Team (2019), is a widely used open-
source software tool for data analysis and graphics which runs on the commonly
used operating systems. It can be downloaded from the R-project’s website at www.r-
project.org. The R Development Core Team also offers some documentation on this
website:
• R installation and administration,
• An introduction to R,
• The R language definition,
• R data import/export, and
• The R reference index.
Additionally to this material, there is a large and strongly growing number of text-
books available covering the R programming language and the applications of R in
different fields of data analysis, for instance, Beginning R or Advanced R.
Besides the R software, one also should install an editor or an integrated develop-
ment environment (IDE) to work with R conveniently. Several open-source products
are available on the web, like
• RStudio, see RStudio Team (2020), at www.rstudio.org;
• RKWard, at http://rkward.sourceforge.net;
• Tinn-R, at http://www.sciviews.org/Tinn-R; and
• Eclipse based StatET, at http://www.walware.de/goto/statet.

1.3 Usage of R in This Book

Throughout the book we implement, for instance, different resampling schemes and
simulation studies in R. Our implementations are free from any checking of function
arguments. We provide R-code that focuses solely on an understandable implemen-
tation of a certain algorithm. Therefore, there is plenty of room to improve the imple-
mentations. Some of these improvements will be discussed within the exercises.
R is organized in packages. A new installation of R comes with some pre-installed
packages. And the packages provided by the R-community makes this programming
language really powerful. More than 15000 packages (as of 2020/Feb) are available
(still growing). But especially for people starting with R this is also a problem. The
CRAN Task View https://cran.r-project.org/web/views summarizes certain packages
within categories like “Graphics”, “MachineLearning”, or “Survival”. We decided
to use only a handful of packages that are directly related to the main objective
of this book, like the boot-package for bootstrapping, or (in the opinion of the
authors) are too important and helpful to be ignored, like ggplot2, dplyr, and
tidyr. In addition, we have often used the simTool package from Marsel Scheer
to carry out simulations. This package is explained in the appendix. Furthermore,
6 1 Introduction

we decided to use the pipe operator, i.e., %>%. There are a few critical voices about
this operator, but the authors as the most R users find it very comfortable to work
with the pipe operator. People familiar with Unix systems will recognize the concept
and probably appreciate it. A small example will demonstrate how the pipe operator
works. Suppose we want to apply a function A to the object x and the result of this
operation should be processed further by the function B. Without the pipe operator
one could use
B(A(x))
# or
tmp = A(x)
B(tmp)

With the pipe operator this becomes


A(x) %>%
B
# or
x %>%
A %>%
B

Especially with longer chains of functions using pipes may help to obtain R-code
that is easier to understand.

1.3.1 Further Non-Statistical R-Packages

There are a lot of packages that are worth to look at. Again the CRAN Task View may
be a good starting point. The following list is focused on writing reports, developing
R-packages, and increasing the speed of R-code itself. By far this list is not exhaustive:
• knitr for writing reports (this book was written with knitr);
• readxl for the import of excel files;
• testthat for creating automated unit tests. It is also helpful for checking func-
tion arguments;
• covR for assessing test coverage of the unit tests;
• devtools for creating/writing packages;
• data.table amazingly fast aggregation, joins, and various manipulations of
large datasets;
• roxygen2 for creating help pages within packages;
• Rcpp for a simple integration of C++ into R;
• profvis a profiling tool that assess at which line of code R spends its time;
• checkpoint, renv for package dependency.
Of course, further packages for importing datasets, connecting to databases, cre-
ating interactive graphs and user interfaces, and so on exist. Again, the packages
provided by the R-community make this programming language really powerful.

You might also like