100% found this document useful (2 votes)
12 views

Basic Elements of Computational Statistics Google Drive Download

The document is a preface and overview of a book on computational statistics using the R programming language, aimed at non-mathematicians and practitioners. It covers various statistical techniques, including univariate and multivariate data analyses, numerical techniques, and graphical methods, with practical examples and exercises. The book is designed for advanced undergraduate and graduate students, as well as inexperienced data analysts, providing downloadable resources for hands-on learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
12 views

Basic Elements of Computational Statistics Google Drive Download

The document is a preface and overview of a book on computational statistics using the R programming language, aimed at non-mathematicians and practitioners. It covers various statistical techniques, including univariate and multivariate data analyses, numerical techniques, and graphical methods, with practical examples and exercises. The book is designed for advanced undergraduate and graduate students, as well as inexperienced data analysts, providing downloadable resources for hands-on learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Basic Elements of Computational Statistics

Visit the link below to download the full version of this book:

https://medipdf.com/product/basic-elements-of-computational-statistics/

Click Download Now


Preface

The R programming language is becoming the lingua franca of computational


statistics. It is the usual statistical software platform used by statisticians, econo-
mists, engineers and scientists both in corporations and in academia. Established
international companies use R in their data analysis. R has gained its popularity for
two reasons. First, it is an OS independent free open-source program which is
popularised and improved by hundreds of volunteers all over the world. A plethora
of packages are available for many scientific disciplines. Second, common analysts
can do complicated analyses without deep computer programming knowledge. This
book on the basic elements of computational statistics presents the tools and con-
cepts of univariate and multivariate data analyses with a strong focus on applica-
tions and implementations. The aim of this book is to present data analysis in a way
that is understandable for non-mathematicians and practitioners who are confronted
by statistical data analysis. All practical examples may be recalculated and modified
by the reader: all data sets and programmes (Quantlets) used in the book are
downloadable from the publisher’s home page of this book (www.quantlet.de). The
text contains a wide variety of exercises and covers the basic mathematical, sta-
tistical and programming problems.
The first chapter introduces the reader to the basics of the R language, taking into
account that only minimal prior experience in programming is required. Starting
with the developing history and R environments under different operating systems,
the book discusses the syntax. We start the description of the syntax with the
classical ‘Hello World!!!’ program. The use of R as an advanced calculator, data
types, loops, if then conditions, own function construction and classes are the topics
covered in this chapter. As in statistical analysis one deals with data, special
attention is paid to work with vectors and matrices.
The second part deals with the numerical techniques which one needs during the
analysis. A short excursion into matrix algebra will be helpful in understanding
multivariate techniques provided in the further sections. Different methods of
numerical integration, differentiation and root finding help the reader to get inside
the core of the R system.
Chapter 3 highlights set theory, combinatoric rules, plus some of the main
discrete distributions: binomial, multinomial, hypergeometric and Poisson.
Different characteristics, cumulative distribution functions and density functions
of the continuous distributions: uniform, normal, t, v2 , F, exponential and Cauchy
will be explained in detail in Chapter 4.
The next chapter is devoted to univariate statistical analysis and basic smoothing
techniques. The histogram, kernel density estimator, graphical representation of the
data, confidence intervals, different simple tests as well as tests that need more
computations, like the Wilcoxon, Kruskal–Wallis, sign tests, are the topics of
Chapter 5.
The sixth chapter deals with multivariate distributions: their definition, charac-
teristics and application of general multivariate distributions, multinormal distri-
butions, as well as classes of copulas. Further, Chapter 7 discusses linear and
nonlinear relationships via regression models.
Chapter 8 partially extends the problems solved in Chapter 5, but also considers
more sophisticated topics, such as multidimensional scaling, principal component,
factor, discriminant and cluster analysis. These techniques are difficult to apply
without computational power, so they are of special interest in this book.
Theoretical models need to be calibrated in practice. If there is no data available,
then Monte Carlo simulation techniques are necessary parts of each study. Chapter
9 starts from simple sampling techniques from the uniform distribution. These are
further extended to simulation methods from other univariate distributions. We also
discuss simulation from multivariate distributions, especially copulae.
Chapter 10 describes more advanced graphical techniques, with special attention
to three-dimensional graphics and interactive programmes using packages lattice,
rgl and rpanel.
This book is designed for the advanced undergraduate and first-year graduate
student as well as for the inexperienced data analyst who would like a tour of the
various statistical tools in a data analysis workshop. The experienced reader with a
good knowledge of statistics and programming will certainly skip some sections
of the univariate models, but hopefully enjoy the various mathematical roots of the
multivariate techniques. A graduate student might think that the first section on
description techniques is well known to him from his training in introductory
statistics. The programming, mathematical and the applied parts of the book will
certainly introduce him into the rich realm of statistical data analysis modules.
A book of this kind would not have been possible without the help of many
friends, colleagues and students. For many suggestions, corrections and technical
support, we would like to thank Aymeric Bouley, Xiaofeng Cao, Johanna Simone
Eckel, Philipp Gschöpf, Gunawan Gunawan, Johannes Haupt, Uri Yakobi Keller,
Polina Marchenko, Félix Revert, Alexander Ristig, Benjamin Samulowski, Martin
Schelisch, Christoph Schult, Noa Tamir, Anastasija Tetereva, Tatjana
Tissen-Diabaté, Ivan Vasylchenko and Yafei Xu. We thank Alice Blanck and
Veronika Rosteck from Springer Verlag for continuous support and valuable
suggestions on the style of writing and the content covered. Special thanks go to the
anonymous proofreaders who checked not only the language but also the statistical,
programming and mathematical content of the book. All errors are our own.

Berlin, Germany Wolfgang Karl Härdle


Dresden, Germany Ostap Okhrin
Augsburg, Germany Yarema Okhrin
April 2017
Contents

1 The Basics of R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 R on Your Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 History of the R Language . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Installing and Updating R. . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 First Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 “Hello World !!!” . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Getting Help. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Working Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Basics of the R Language. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 R as a Calculator. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.6 Programming in R . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.7 Date Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.8 Reading and Writing Data from and to Files. . . . . . . . 30
2 Numerical Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.1 Characteristics of Matrices . . . . . . . . . . . . . . . . . . . . 34
2.1.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . 41
2.1.4 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . 43
2.1.5 Norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.1 Integration of Functions of One Variable . . . . . . . . . . 46
2.2.2 Integration of Functions of Several Variables . . . . . . . 50
2.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 Analytical Differentiation . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . 56
2.3.3 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . 59
2.4 Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.1 Solving Systems of Linear Equations. . . . . . . . . . . . . 62
2.4.2 Solving Systems of Nonlinear Equations . . . . . . . . . . 64
2.4.3 Maximisation and Minimisation of Functions . . . . . . . 66
3 Combinatorics and Discrete Distributions . . . . . . . . . . . . . . . . . . 77
3.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.1 Creating Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.2 Basics of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.3 Base Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.4 Sets Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.5 Generalised Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Probabilistic Experiments with Finite Sample Spaces . . . . . . . . 85
3.2.1 R Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2.2 Sample Space and Sampling from Urns . . . . . . . . . . . 87
3.2.3 Sampling Procedure. . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.4 Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.1 Bernoulli Random Variables. . . . . . . . . . . . . . . . . . . 94
3.3.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.3 Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.4 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.5 Hypergeometric Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 101
3.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.6.1 Summation of Poisson Distributed
Random Variables. . . . . . . . . . . . . . . . . . . ....... 106
4 Univariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 Properties of Continuous Distributions . . . . . . . . . . . . 110
4.2 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4 Distributions Related to the Normal Distribution . . . . . . . . . . . 114
4.4.1 v2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4.2 Student’s t-distribution. . . . . . . . . . . . . . . . . . . . . . . 117
4.4.3 F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5 Other Univariate Distributions . . . . . . . . . . . . . . . . . . . . . . . 121
4.5.1 Exponential Distribution. . . . . . . . . . . . . . . . . . . . . . 121
4.5.2 Stable Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.5.3 Cauchy Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 127
5 Univariate Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.1 Graphical Data Representation . . . . . . . . . . . . . . . . . 130
5.1.2 Empirical (Cumulative) Distribution Function . . . . . . . 132
5.1.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.1.4 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . 135
5.1.5 Location Parameters . . . . . . . . . . . . . . . . . . . . . . . . 137
5.1.6 Dispersion Parameters . . . . . . . . . . . . . . . . . . . . . . . 140
5.1.7 Higher Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.8 Box-Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2 Confidence Intervals and Hypothesis Testing . . . . . . . . . . . . . 146
5.2.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3 Goodness-of-Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.3.1 General Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3.2 Tests for Normality . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3.3 Wilcoxon Signed Rank Test
and Mann–Whitney U Test . . . . . . . . . . . . . ...... 167
5.3.4 Kruskal–Wallis Test . . . . . . . . . . . . . . . . . . ...... 169
6 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . ....... 171
6.1 The Distribution Function and the Density Function
of a Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.2 The Multinormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 178
6.2.1 Sampling Distributions and Limit Theorems . . . . . . . . 182
6.3 Copulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.1 Copula Families . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.3.2 Archimedean Copulae . . . . . . . . . . . . . . . . . . . . . . . 189
6.3.3 Hierarchical Archimedean Copulae . . . . . . . . . . . . . . 191
6.3.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1 Idea of Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.2.1 Model Selection Criteria . . . . . . . . . . . . . . . . . . . . . 200
7.2.2 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.3.1 General Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.3.2 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3.3 k-Nearest Neighbours (k-NN) . . . . . . . . . . . . . . . . . . 209
7.3.4 Splines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.3.5 LOESS or Local Regression . . . . . . . . . . . . . . . . . . . 213
8 Multivariate Statistical Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.1 Principal Components Analysis. . . . . . . . . . . . . . . . . . . . . . . 219
8.2 Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.2.1 Maximum Likelihood Factor Analysis . . . . . . . . . . . . 225
8.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.3.1 Proximity of Objects . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . 231
8.4 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
8.4.1 Metric Multidimensional Scaling . . . . . . . . . . . . . . . . 235
8.4.2 Non-metric Multidimensional Scaling . . . . . . . . . . . . 236
8.5 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9 Random Numbers in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . 243
9.1.1 Pseudorandom Number Generators . . . . . . . . . . . . . . 244
9.1.2 Uniformly Distributed Pseudorandom Numbers. . . . . . 248
9.1.3 Uniformly Distributed True Random Numbers . . . . . . 249
9.2 Generating Random Variables . . . . . . . . . . . . . . . . . . . . . . . 250
9.2.1 General Principles for Random Variable
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 251
9.2.2 Random Variables. . . . . . . . . . . . . . . . . . . . . ..... 253
9.2.3 Random Variable Generation for Continuous
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . ..... 253
9.2.4 Random Variable Generation for Discrete
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . ..... 259
9.2.5 Random Variable Generation for Multivariate
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.3 Tests for Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
9.3.1 Birthday Spacings . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.2 k-Distribution Test . . . . . . . . . . . . . . . . . . . . . . . . . 266
10 Advanced Graphical Techniques in R . . . . . . . . . . . . . . . . . . . . . 269
10.1 Package lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.1.1 Getting Started with lattice . . . . . . . . . . . . . . . . . 270
10.1.2 formula Argument . . . . . . . . . . . . . . . . . . . . . . . . 270
10.1.3 panel Argument and Appearance Settings . . . . . . . . 272
10.1.4 Conditional and Grouped Plots . . . . . . . . . . . . . . . . . 273
10.1.5 Concept of shingle . . . . . . . . . . . . . . . . . . . . . . . 275
10.1.6 Time Series Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.1.7 Three- and Four-Dimensional Plots . . . . . . . . . . . . . . 279
10.2 Package rgl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.2.1 Getting Started with rgl . . . . . . . . . . . . . . . . . . . . . 281
10.2.2 Shape Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.2.3 Export and Animation Functions . . . . . . . . . . . . . . . . 287
10.3 Package rpanel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10.3.1 Getting Started with rpanel . . . . . . . . . . . . . . . . . . 289
10.3.2 Application Functions in rpanel. . . . . . . . . . . . . . . 293

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Notation

Basics
X; Y Random variables or vectors
X1 ; X2 ; . . .; Xp Random variables
X ¼ ðX1 ; . . .; Xp Þ> Random vector
X  X has distribution 
C; D Matrices
A; B; X ; Y Data matrices
R Covariance matrix
1n Vector of ones ð1; . . .; 1Þ>
|fflfflffl{zfflfflffl}
n times
0n Vector of zeros ð0; . . .; 0Þ>
|fflfflffl{zfflfflffl}
n times
In Identity matrix
Ið:Þ Indicator function
d. . .e Ceiling function
b. . .c Floor function
i Imaginary unit, i2 ¼ 1
) Implication
, Equivalence
 Approximately equal
iff if and only if, equivalence
i.i.d. Independent and identically distributed
rv Random variable
Rn n-dimensional space of real numbers
dik The Kronecker delta, that is 1 if i ¼ k and 0
otherwise
P
Pn Pn ¼ ft 2 C½a; bjtðxÞ ¼ ni¼0 ai xi ; ai 2 Rg
f ðxÞ 2 OfgðxÞg There is k [ 0 such that for all sufficiently large
values of x, f ðxÞ is at ost kgðxÞ in absolute value
med ðxÞ The median value of the sample x

Samples
x; y Observations of X and Y
x1 ; . . .; xn ¼ fxi gni¼1 Sample of n observations of X
X ¼ fxij gi¼1;...;n;j¼1;...;p (n  p) data matrix of observations of X1 ; . . .; Xp or
of X ¼ ðX1 ; . . .; Xp Þ>
xð1Þ ; . . .; xðnÞ The order statistic of x1 ; . . .; xn
H Centering matrix, H ¼ I n  n1 1n 1Tn
x The sample mean

Densities and Distribution Functions


f ðxÞ Density of X
f ðx; yÞ Joint density of X and Y
fX ðxÞ; fY ðyÞ Marginal densities of X and Y
fX1 ðx1 Þ; . . .; fXp ðxp Þ Marginal densities of X1 ; . . .; Xp
^fh ðxÞ Histogram or kernel estimator of f ðxÞ
FðxÞ Distribution function of X
Fðx; yÞ Joint distribution function of X and Y
FX ðxÞ; FY ðyÞ Marginal distribution functions of X and Y
FX1 ðx1 Þ; . . .; FXd ðxd Þ Marginal distribution functions of X1 ; . . .; Xd
/X ðtÞ Characteristic function of X
mk k-th moment of X
^
FðxÞ Empirical cumulative distribution function (ecdf)
pdf Probability density function

Empirical Moments
P
n Average of X sampled by fxi gi¼1;...;n
x ¼ 1n xi
i¼1
P
n Empirical covariance of random variables X and Y
s2XY ¼ n1
1
ðxi  xÞðyi  yÞ
i¼1 sampled by fxi gi¼1;...;n and fyi gi¼1;...;n
Pn Empirical variance of random variable X sampled
s2XX ¼ 1
ðxi  xÞ2
n1
i¼1 by fxi gi¼1;...;n
s2
rXY ¼ pffiffiffiffiffiffiffiffiffi
XY
2 2
Empirical correlation of X and Y
sXX sYY
^ ¼ fsX X g
R Empirical covariance matrix of a sample or obser-
i j
vations of X1 ; . . .; Xp or of the random vector
X ¼ ðX1 ; . . .; Xp Þ>
R ¼ frXi Xj g Empirical correlation matrix of a sample or obser-
vations of X1 ; . . .; Xp or of the random vector
X ¼ ðX1 ; . . .; Xp Þ>

Distributions
uðxÞ Density of the standard normal distribution
UðxÞ Cumulative distribution function of the standard
normal distribution
N ð0; 1Þ Standard normal or Gaussian distribution
N ðl; r2 Þ Normal distribution with mean l and variance r2
Nd ðl; RÞ d-dimensional normal distribution with mean l and
covariance matrix R
L Convergence in distribution
!
a:s Almost sure convergence
!
a Asymptotic distribution

Uða; bÞ Uniform distribution on ða; bÞ
CLT Central Limit Theorem
v2p v2 distribution with p degrees of freedom
v21a;p 1  a quantile of the v2 distribution with p degrees
of freedom
tn t-distribution with n degrees of freedom
t1a=2;n 1  a=2 quantile of the t-distribution with n d.f
Fn;m F-distribution with n and m degrees of freedom
F1a;n;m 1  a quantile of the F-distribution with n and m
degrees of freedom
Bðn; pÞ Binomial distribution
Hðx; n; M; NÞ Hypergeometric distribution
Poisðki Þ Poisson distribution with parameter ki

Mathematical Abbreviations
trðAÞ Trace of matrix A
diagðAÞ Diagonal of matrix A
rankðAÞ Rank of matrix A
detðAÞ Determinant of matrix A
id Identity function on a vector space V
C½a; b The set of all continuous differentiable functions on
the interval ½a; b
Chapter 1
The Basics of R

Don’t think—use the computer.

— G. Dyke

1.1 Introduction

The R software package is a powerful and flexible tool for statistical analysis which
is used by practitioners and researchers alike. A basic understanding of R allows
applying a wide variety of statistical methods to actual data and presenting the results
clearly and understandably. This chapter provides help in setting up the programme
and gives a brief introduction to its basics.
R is open-source software with a list of available, add-on packages that provide
additional functionalities. This chapter begins with detailed instructions on how to
install it on the computer and explains all the procedures needed to customise it to
the user’s needs.
In the next step, it will guide you through the use of the basic commands and the
structure of the R language. The goal is to give an idea of the syntax so as to be able
to perform simple calculations as well as structure data and gain an understanding
of the data types. Lastly, the chapter discusses methods of reading data and saving
datasets and results.

1.2 R on Your Computer

1.2.1 History of the R Language

R is a complete programming language and software environment for statistical com-


puting and graphical representation. R is closely related to S, the statistical program-
2 1 The Basics of R

ming language of Bell Laboratories developed by Becker and Chamber in 1984. It is


actually an implementation of S with lexical scoping semantics inspired by Scheme,
which started in 1992 and with the first results published by the developers Ihaka
and Gentleman (1996) of the University of Auckland, NZ, for teaching purposes. Its
name, R, is taken from the first names of the authors.
As part of the GNU Project, the source code of R has been freely available under
the GNU General Public License since 1995. This decision contributed to spreading
the software within the community of statisticians using free-code operating systems
(OS). It is now a multi-platform statistical package widely known by people from
many scientific fields such as mathematics, medicine and biology.
R enables its users to handle and store data, perform calculations on many types
of variables, statistically analyse information under different aspects, create graphics
and execute programmes. Its functionalities can be expanded by importing packages
and including code written in C, C++ or Fortran. It is freely available on the Internet
using the CRAN mirrors (Comprehensive R Archive Network at http://cran.r-project.
org/). Since this chapter deals with installation issues and the basics of the R language,
the reader familiar with the basics may skip it.
There exist several books about R, discussing specific topics in statistics and
econometrics (biostatistics, etc.) or comparing R with other software, for example
Stata. Typical users of Stata may be interested in Muenchen and Hilbe (2010). If
the research topic requires Bayesian econometrics and MCMC techniques, Albert
(2009) might be helpful. Two additional books on R, by Gaetan and Guyon (2009)
and Cowpertwait and Metcalfe (2009), may support the development of R skills,
depending on the application.

1.2.2 Installing and Updating R

Installing
As mentioned before, R is a free software package, which can be downloaded legally
from the Internet page http://cran.r-project.org/bin.
Since R is a cross-platform software package, installing R on different operating
systems will be explained. A full installation guide for all systems is available at
http://cran.r-project.org/doc/manuals/R-admin.html.
Precompiled binary distributions
There are several ways of setting up R on a computer. On the one hand, for many
operating systems, precompiled binary files are available. And on the other hand, for
those who use other operating systems, it is possible to compile the programme from
the source code.
1.2 R on Your Computer 3

• Installing R under Unix


Precompiled binary files are available for the Debian, RedHat, SuSe and Ubuntu
Unix distributions. They can be found on the CRAN website at http://cran.r-
project.org/bin/linux.
• Installing R under Windows
The binary version of R for Windows is located at http://cran.r-project.org/bin/
windows.
If an account with Administrator privileges is used, R can be installed in the
Program Files path and all the optional registry entries are automatically set. Oth-
erwise, there is only the possibility of installing R in the user files path. Recent
versions of Windows ask for confirmation to proceed with installing a programme
from an ‘unidentified publisher’. The installation can be customised, but the default
is suitable for most users.
For further information, it is suggested to visit http://cran.r-project.org/bin/
windows/base/rw-FAQ.html
• Installing R under Mac
The current version of R for Mac OS is located at http://cran.r-project.org/bin/
macosx/.
The installation package corresponding to the specific version of the Mac OS must
be chosen and downloaded. During the installation, the Installer will guide the user
through the necessary steps. Note that this will require the password or login of
an account with administrator privileges. The installation can be customised, but
the default is suitable for most users. After the installation, R can be started from
the application menu.
For further information, it is suggested to visit http://cran.r-project.org/bin/
macosx/RMacOSX-FAQ.html.

Updating
The best way to upgrade R is to uninstall the previous version of R, then install
the new version and copy the old installed packages to the library folder of
the new installation. Command update.packages(checkBuilt = TRUE,
ask = FALSE) will update the packages for the new installation. Afterwards, any
remaining data from the old installation can be deleted. Old versions of the software
may be kept due to the parallel structure of the folders of the different installations.
In cases where the user has a personal library, the contents must be copied into an
update folder before running the update of the packages.

1.2.3 Packages

A package is a file, which may be composed of R scripts (for example func-


tions) or dynamic link libraries (DLL) written in other languages, such as C or

You might also like