Basic Elements of Computational Statistics
Basic Elements of Computational Statistics
Basic Elements
of Computational
Statistics
QUANTLETS
Statistics and Computing
Series editor
W.K. Härdle, Humboldt-Universität zu Berlin, Berlin, Germany
Statistics and Computing (SC) includes monographs and advanced texts on
statistical computing and statistical packages.
Yarema Okhrin
Basic Elements
of Computational Statistics
123
Wolfgang Karl Härdle Yarema Okhrin
CASE – Center for Applied Statistics Chair of Statistics, Faculty of Business
and Economics, School of Business and Economics
and Economics University of Augsburg
Humboldt-Universität zu Berlin Augsburg
Berlin Germany
Germany
Ostap Okhrin
Econometrics and Statistics, esp. in
Transportation, Institut für Wirtschaft und
Verkehr, Fakultät Verkehrswissenschaften
“Friedrich List”
Technische Universität Dresden
Dresden, Sachsen
Germany
ix
x Preface
Chapter 3 highlights set theory, combinatoric rules, plus some of the main
discrete distributions: binomial, multinomial, hypergeometric and Poisson.
Different characteristics, cumulative distribution functions and density functions
of the continuous distributions: uniform, normal, t, v2 , F, exponential and Cauchy
will be explained in detail in Chapter 4.
The next chapter is devoted to univariate statistical analysis and basic smoothing
techniques. The histogram, kernel density estimator, graphical representation of the
data, confidence intervals, different simple tests as well as tests that need more
computations, like the Wilcoxon, Kruskal–Wallis, sign tests, are the topics of
Chapter 5.
The sixth chapter deals with multivariate distributions: their definition, charac-
teristics and application of general multivariate distributions, multinormal distri-
butions, as well as classes of copulas. Further, Chapter 7 discusses linear and
nonlinear relationships via regression models.
Chapter 8 partially extends the problems solved in Chapter 5, but also considers
more sophisticated topics, such as multidimensional scaling, principal component,
factor, discriminant and cluster analysis. These techniques are difficult to apply
without computational power, so they are of special interest in this book.
Theoretical models need to be calibrated in practice. If there is no data available,
then Monte Carlo simulation techniques are necessary parts of each study. Chapter
9 starts from simple sampling techniques from the uniform distribution. These are
further extended to simulation methods from other univariate distributions. We also
discuss simulation from multivariate distributions, especially copulae.
Chapter 10 describes more advanced graphical techniques, with special attention
to three-dimensional graphics and interactive programmes using packages lattice,
rgl and rpanel.
This book is designed for the advanced undergraduate and first-year graduate
student as well as for the inexperienced data analyst who would like a tour of the
various statistical tools in a data analysis workshop. The experienced reader with a
good knowledge of statistics and programming will certainly skip some sections
of the univariate models, but hopefully enjoy the various mathematical roots of the
multivariate techniques. A graduate student might think that the first section on
description techniques is well known to him from his training in introductory
statistics. The programming, mathematical and the applied parts of the book will
certainly introduce him into the rich realm of statistical data analysis modules.
A book of this kind would not have been possible without the help of many
friends, colleagues and students. For many suggestions, corrections and technical
support, we would like to thank Aymeric Bouley, Xiaofeng Cao, Johanna Simone
Eckel, Philipp Gschöpf, Gunawan Gunawan, Johannes Haupt, Uri Yakobi Keller,
Polina Marchenko, Félix Revert, Alexander Ristig, Benjamin Samulowski, Martin
Schelisch, Christoph Schult, Noa Tamir, Anastasija Tetereva, Tatjana
Tissen-Diabaté, Ivan Vasylchenko and Yafei Xu. We thank Alice Blanck and
Veronika Rosteck from Springer Verlag for continuous support and valuable
Preface xi
suggestions on the style of writing and the content covered. Special thanks go to the
anonymous proofreaders who checked not only the language but also the statistical,
programming and mathematical content of the book. All errors are our own.
1 The Basics of R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 R on Your Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 History of the R Language . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Installing and Updating R. . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 First Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 “Hello World !!!” . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Getting Help. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Working Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Basics of the R Language. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 R as a Calculator. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.6 Programming in R . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.7 Date Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.8 Reading and Writing Data from and to Files. . . . . . . . 30
2 Numerical Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.1 Characteristics of Matrices . . . . . . . . . . . . . . . . . . . . 34
2.1.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . 41
2.1.4 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . 43
2.1.5 Norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.1 Integration of Functions of One Variable . . . . . . . . . . 46
2.2.2 Integration of Functions of Several Variables . . . . . . . 50
xiii
xiv Contents
2.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 Analytical Differentiation . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . 56
2.3.3 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . 59
2.4 Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.1 Solving Systems of Linear Equations. . . . . . . . . . . . . 62
2.4.2 Solving Systems of Nonlinear Equations . . . . . . . . . . 64
2.4.3 Maximisation and Minimisation of Functions . . . . . . . 66
3 Combinatorics and Discrete Distributions . . . . . . . . . . . . . . . . . . 77
3.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.1 Creating Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.2 Basics of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.3 Base Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.4 Sets Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.5 Generalised Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Probabilistic Experiments with Finite Sample Spaces . . . . . . . . 85
3.2.1 R Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2.2 Sample Space and Sampling from Urns . . . . . . . . . . . 87
3.2.3 Sampling Procedure. . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.4 Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.1 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . 94
3.3.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.3 Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.4 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.5 Hypergeometric Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 101
3.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.6.1 Summation of Poisson Distributed
Random Variables. . . . . . . . . . . . . . . . . . . ....... 106
4 Univariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 Properties of Continuous Distributions . . . . . . . . . . . . 110
4.2 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4 Distributions Related to the Normal Distribution . . . . . . . . . . . 114
4.4.1 v2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4.2 Student’s t-distribution. . . . . . . . . . . . . . . . . . . . . . . 117
4.4.3 F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5 Other Univariate Distributions . . . . . . . . . . . . . . . . . . . . . . . 121
4.5.1 Exponential Distribution. . . . . . . . . . . . . . . . . . . . . . 121
4.5.2 Stable Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.5.3 Cauchy Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 127
Contents xv
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Symbols and Notations
Basics
X; Y Random variables or vectors
X1 ; X2 ; . . .; Xp Random variables
X ¼ ðX1 ; . . .; Xp Þ> Random vector
X X has distribution
C; D Matrices
A; B; X ; Y Data matrices
R Covariance matrix
1n Vector of ones ð1; . . .; 1Þ>
|fflfflffl{zfflfflffl}
n times
0n Vector of zeros ð0; . . .; 0Þ>
|fflfflffl{zfflfflffl}
n times
In Identity matrix
Ið:Þ Indicator function
d. . .e Ceiling function
b. . .c Floor function
i Imaginary unit, i2 ¼ 1
) Implication
, Equivalence
Approximately equal
iff if and only if, equivalence
i.i.d. Independent and identically distributed
rv Random variable
Rn n-dimensional space of real numbers
xix
xx Symbols and Notations
Samples
x; y Observations of X and Y
x1 ; . . .; xn ¼ fxi gni¼1 Sample of n observations of X
X ¼ fxij gi¼1;...;n;j¼1;...;p (n p) data matrix of observations of X1 ; . . .; Xp or
of X ¼ ðX1 ; . . .; Xp Þ>
xð1Þ ; . . .; xðnÞ The order statistic of x1 ; . . .; xn
H Centering matrix, H ¼ I n n1 1n 1Tn
x The sample mean
Empirical Moments
P
n Average of X sampled by fxi gi¼1;...;n
x ¼ 1n xi
i¼1
P
n Empirical covariance of random variables X and Y
s2XY ¼ n1
1
ðxi xÞðyi yÞ
i¼1 sampled by fxi gi¼1;...;n and fyi gi¼1;...;n
Pn Empirical variance of random variable X sampled
s2XX ¼ 1
ðxi xÞ2
n1
i¼1 by fxi gi¼1;...;n
s2
rXY ¼ pffiffiffiffiffiffiffiffiffi
XY
2 2
Empirical correlation of X and Y
sXX sYY
Symbols and Notations xxi
^ ¼ fsXi Xj g
R Empirical covariance matrix of a sample or obser-
vations of X1 ; . . .; Xp or of the random vector
X ¼ ðX1 ; . . .; Xp Þ>
R ¼ frXi Xj g Empirical correlation matrix of a sample or obser-
vations of X1 ; . . .; Xp or of the random vector
X ¼ ðX1 ; . . .; Xp Þ>
Distributions
uðxÞ Density of the standard normal distribution
UðxÞ Cumulative distribution function of the standard
normal distribution
N ð0; 1Þ Standard normal or Gaussian distribution
N ðl; r2 Þ Normal distribution with mean l and variance r2
Nd ðl; RÞ d-dimensional normal distribution with mean l and
covariance matrix R
L Convergence in distribution
!
a:s Almost sure convergence
!
a Asymptotic distribution
Uða; bÞ Uniform distribution on ða; bÞ
CLT Central Limit Theorem
v2p v2 distribution with p degrees of freedom
v21a;p 1 a quantile of the v2 distribution with p degrees
of freedom
tn t-distribution with n degrees of freedom
t1a=2;n 1 a=2 quantile of the t-distribution with n d.f
Fn;m F-distribution with n and m degrees of freedom
F1a;n;m 1 a quantile of the F-distribution with n and m
degrees of freedom
Bðn; pÞ Binomial distribution
Hðx; n; M; NÞ Hypergeometric distribution
Poisðki Þ Poisson distribution with parameter ki
Mathematical Abbreviations
trðAÞ Trace of matrix A
diagðAÞ Diagonal of matrix A
rankðAÞ Rank of matrix A
detðAÞ Determinant of matrix A
id Identity function on a vector space V
C½a; b The set of all continuous differentiable functions on
the interval ½a; b
Chapter 1
The Basics of R
— G. Dyke
1.1 Introduction
The R software package is a powerful and flexible tool for statistical analysis which
is used by practitioners and researchers alike. A basic understanding of R allows
applying a wide variety of statistical methods to actual data and presenting the results
clearly and understandably. This chapter provides help in setting up the programme
and gives a brief introduction to its basics.
R is open-source software with a list of available, add-on packages that provide
additional functionalities. This chapter begins with detailed instructions on how to
install it on the computer and explains all the procedures needed to customise it to
the user’s needs.
In the next step, it will guide you through the use of the basic commands and the
structure of the R language. The goal is to give an idea of the syntax so as to be able
to perform simple calculations as well as structure data and gain an understanding
of the data types. Lastly, the chapter discusses methods of reading data and saving
datasets and results.
Installing
As mentioned before, R is a free software package, which can be downloaded legally
from the Internet page http://cran.r-project.org/bin.
Since R is a cross-platform software package, installing R on different operating
systems will be explained. A full installation guide for all systems is available at
http://cran.r-project.org/doc/manuals/R-admin.html.
Precompiled binary distributions
There are several ways of setting up R on a computer. On the one hand, for many
operating systems, precompiled binary files are available. And on the other hand, for
those who use other operating systems, it is possible to compile the programme from
the source code.
1.2 R on Your Computer 3
Updating
The best way to upgrade R is to uninstall the previous version of R, then install
the new version and copy the old installed packages to the library folder of
the new installation. Command update.packages(checkBuilt = TRUE,
ask = FALSE) will update the packages for the new installation. Afterwards, any
remaining data from the old installation can be deleted. Old versions of the software
may be kept due to the parallel structure of the folders of the different installations.
In cases where the user has a personal library, the contents must be copied into an
update folder before running the update of the packages.
1.2.3 Packages
Fortran, that gives access to more functions or data sets for the current ses-
sion. Some packages are ready for use after the basic installation, others have to be
downloaded and then installed when needed. On all operating systems, the function
install.packages() can be used to download and install a package automat-
ically through an available internet connection. Command install.packages
may require to decide whether to compile and install sources if they are newer, then
binaries. When installing packages manually, there are slight differences between
operating systems.
Unix
Gzipped tar packages can be installed using the UNIX console by
R CMD INSTALL /your_path/your_package.tar.gz
Windows
In the R GUI, one uses the menu Packages.
• With an available internet connection, new packages can be downloaded and
installed directly by clicking the Install Packages button. In this case, it is proposed
to choose the CRAN mirror nearest to the user’s location, and select the package
to be installed.
• If the .zip file is already available on the computer, the package can be installed
through Install Packages from Zip files.
Mac OS
There is a recommended Package Manager in the R.APP GUI. It is possible to
install packages from the shell, but we suggest having a look at the FAQ on the
CRAN website first.
All systems
Once a package is installed, it should be loaded in a session when needed. This ensures
that the software has all the additional functions and datasets from this package in
memory. This can be done through the commands
If the requested package is not installed, the function library() gives an error,
while require() is designed for use inside of other functions and only returns
FALSE and gives a warning.
The package will also be loaded as the second item in the system search path.
Packages can also be loaded automatically if the corresponding code line is included
in the .Rprofile file.
1.2 R on Your Computer 5
The function detach() can also be used to remove any R object from the search
path. This alternative usage will be shown later in this chapter.
After this first impression of what R is and how it works, the next steps are to see
how it is used and to get used to it. In general, users should be aware of the case
sensitivity of the R language.
It is also convenient to know that previously executed commands can be selected
by the ‘up arrow’ on the keyboard. This is particularly useful for correcting typos and
mistakes in commands that caused an error, or to re-run commands with different
parameters.
As a first example, we will write some code that gives the output ‘Hello World !!!’
and a plot, see Fig. 1.1. There is no need to understand all the lines of the code now.
> install.packages("rworldmap")
> require(rworldmap)
> data("countryExData", envir = environment())
> mapCountryData(joinCountryData2Map(countryExData),
+ nameColumnToPlot = "EPI_regions",
+ catMethod = "categorical",
+ mapTitle = "Hello World!!!",
+ colourPalette = "rainbow",
+ missingCountryCol = "lightgrey",
+ addLegend = FALSE)
Once R has been installed and/or updated, it is useful to have a way to get help. To
open the primary interface to the help system, one uses
6 1 The Basics of R
Hello World!!!
> help()
There are two ways of getting help for a particular function or command:
which returns the help file that comes with the specific package. If the different pro-
posals of help were not satisfying, one can try
to see all help subjects containing “function name”. Finally, under Windows and
Mac OS, under the Help menu are several PDF manuals which provide thorough and
detailed information. The same help can be reached with the function
> help.start()
The current directory, where all pictures and tables are saved and from which all data
is read by default, is known as the working directory. It can be found by getwd()
and can be changed by setwd().
> getwd() > setwd("your/own/path")
Each of the following functions returns a vector of character strings providing the
names of the objects already defined in the current session.
> ls() > objects()
The above line, for example, will remove the variables var1 and var2 from the
working space. The next example will remove all objects defined in the current
session and can be used to completely clean the whole working space.
> rm(list = ls())
The code below erases all variables, including the system ones beginning with a dot.
Be cautious when using this command! Note that it has the same effect as the menu
entry Remove all objects under Windows or Clear Workspace under Mac OS.
> rm(list = ls(all.names = TRUE))
However, we should always make sure that all previously defined variables are deleted
or redefined when running new code, in order to be sure that there is no information
left from the previous run of the programme which could affect the results. Therefore,
it is recommended to have a line rm(list = ls(all.names = TRUE)) in
the beginning of each programme.
One saves the workspace as a .Rdata file in the specified working directory using
the function save.image(), and saves the history of the commands in the .R
format with savehistory(). Saving the workspace means keeping all defined
variables in memory, so that the next time when R is in use, there is no need to define
them again. If the history is saved, the variables will NOT be saved, whereas the
commands defining them will be. So once the history is loaded, everything that was
in the console should be compiled, but this can take a while for time-consuming
calculations. The previously saved workspace and the history can be loaded with
> load(".Rdata") > loadhistory()
The apropos(‘word ’) returns the vector of functions, variables, etc., containing the
argument word, as does find(‘word ’), but with a different user interface. Without
8 1 The Basics of R
going into details, the best way to set one’s own search parameters is to consult the
help concerning these functions.
Furthermore, a recommended and very convenient way of writing programmes
is to split them into different modules, which might contain a list of definitions or
functions, in order not to mess up the main file. They are executed by the function
> source("my_module.r")
To write the output in a separate .txt file instead of the screen, one uses sink().
This file appears in the working directory and shows the full output of the session.
> sink("my_output.txt")
This section contains information on how R can be a useful software for all basic
mathematical and programming needs.
1.4.1 R as a Calculator
R may be seen as a powerful calculator which allows dealing with a lot of mathe-
matical functions. Classical fundamental operations as presented in Tables 1.1, 1.2
and 1.3 are, of course, available in R.
In contrast to the classical calculator, R allows assigning one or more values to a
variable.
> a = pi + 0.5; a # create variable a; print a
[1] 3.641593
1.4 Basics of the R Language 9
floor() (ceiling()) returns the largest (smallest) integer that is smaller (larger)
than the value of the given variable a, trunc() truncates the decimal part of a
real-valued variable to obtain an integer variable. The function round() rounds
a real-valued variable scientifically to an integer unless the argument digits is
applied specifying the number of decimal places, in which case it scientifically rounds
the given real number to that many decimal places. Scientific rounding of a real
number rounds it to the closest integer, except for the case there the number after
the predefined decimal place is exactly 5. For this case the closest even integer is
returned. The function factorial(), which for an integer a returns f (a) = a! =
1 · 2 · . . . · a, works with real-valued arguments as well, by using the Gamma function
∞
(x) = t x−1 exp(−t) dt,
0
implemented by gamma(x+1) in R.
1.4.2 Variables
Assigning variables
There are different ways to assign variables to symbols.
> a = pi + 0.5; a # assign (pi + 0.5) to a
[1] 3.641593
> b = a; b # assign the value of a to b
[1] 3.641593
> d = e = 2^(1 / 2); d # assign 2^(1 / 2) to e
> # and the value of e to d
[1] 1.414214
> e
[1] 1.414214
> f <- d; f # assign the value of d to f
[1] 1.414214
> d -> g; g # assign the value of d to g
[1] 1.414214
Be careful with using ‘=’ for assigning, because the known argument, which defines
the other, must be placed on the right side of the equals sign. The arrow assignment
allows the following kind of constructions:
1.4 Basics of the R Language 11
These constructions should not be used extensively due to their evident lack of clarity.
Note that variable names are case sensitive and must not begin with a digit or a period
followed by a digit. Furthermore, names should not begin with a dot as this is common
only for system variables. It is often convenient to choose names that contain the
type of the specific variable, e.g. for the variable iNumber, the ‘i’ at the beginning
indicates that the variable is of the type integer.
> iNumber = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
It is also useful to add the prefix ‘L’ to all local variables. For example, despite the
fact that pi is a constant, one can reassign a different value to it. In order to avoid
confusion, it would be convenient in this case to call the reassigned variable Lpi.
We are not always following this suggestion in order to keep listings’ lengths as short
as possible.
It is also possible to define functions in a similar fashion.
> Stirling = function(n){sqrt(2 * pi * n) * (n / exp(1))^n}
n!
√ → 1, as n → ∞.
2πn · (n/e)n
A description of the variable types matrix and list is given in Sects. 1.4.3 and 1.4.5
in more detail. To show, or print, the content of a, one uses the function print().
> print(a)
Furthermore, one can check if an R object is finite, infinite, unknown, or of any other
type. The function is.finite(argument) returns a Boolean object (a vector or
matrix, if the input is a vector or a matrix) indicating whether the values are finite
or not. This test is also available to test types, such as is.integer(x), to test
whether x is an integer or not, etc.
> x = c(Inf, NaN, 4)
> is.finite(x) # check if finite
[1] FALSE FALSE TRUE
> is.nan(x) # check if NaN (operation not valid)
[1] FALSE TRUE FALSE
> is.double(x) # check if type double
[1] TRUE
> is.character(x) # check if type character
[1] FALSE
1.4.3 Arrays
The indexing of a vector starts from 1. If an element is addressed that does not exist,
e.g. v[0], the error NA or numeric(0) is returned. A numerical vector may be
integer if it contains only integers, numeric if it contains only real numbers, and
complex if it contains complex numbers. The length of a vector object v is found
through
> v = c(1.000000, 3.141593, 1.414214)
> length(v) # length of vector v
[1] 3
Be careful with this function, keeping in mind that it always returns one value, even
for multi-dimensional arrays, so one should know the nature of the objects one is
dealing with.
14 1 The Basics of R
One easily applies the same transformation to all elements of a vector. One can
calculate, for example, the elementwise inverse with the command ˆ (-1). This is
still the case for other objects, such as arrays.
> v = c(1.000000, 3.141593, 1.414214)
> d = v + 3; d
[1] 4.000000 6.141593 4.414214
> v^(-1)
[1] 1.0000000 0.3183099 0.7071068
> v * v^(-1)
[1] 1 1 1
There are a lot of other ways to construct vectors. The function array(x, y)
creates an array of dimension y filled with the value x only. The function seq(x,
y, by = z) gives a sequence of numbers from x to y in steps of z. Alternatively,
the required length can be specified by option length.out.
> c(1, 2, 3)
[1] 1 2 3
> 1:3
[1] 1 2 3
> array(1:3, 6)
[1] 1 2 3 1 2 3
> seq(1, 3)
[1] 1 2 3
> seq(1, 3, by = 2)
[1] 1 3
> seq(1, 4, length.out = 5)
[1] 1.00 1.75 2.50 3.25 4.00
One can also use the rep() function to create a vector in which some values are
repeated.
> v = c(1.000000, 3.141593, 1.414214)
> rep(v, 2) # the vector twice
[1] 1.00 3.14 1.41 1.00 3.14 1.41
> rep(v, c(2, 0, 1)) # 1st value twice, no 2nd value
> # 3rd value once
[1] 1.00 1.00 1.41
> rep(v, each = 2) # each value twice
[1] 1.00 1.00 3.14 3.14 1.41 1.41
With the second command of the above code, R creates a vector in which the first
value of v should appear two times, the second zero times, and the third only once.
Note that if the second argument is not an integer, R takes the rounded value. In the
last call, each element is repeated twice, proceeding element per element.
The names of the months, their abbreviations and all letters of the alpha-
bet are stored in predefined vectors. The months can be addressed in the vector
month.name[]. For their abbreviations, use month.abb[]. Letters are stored
in letters[] and capital letters in LETTERS[].
1.4 Basics of the R Language 15
Note that if one element in a vector is of type character, then all elements in the
vector are converted to character, since a vector can only contain objects of one
type.
To keep only some specific values of a vector, one can use different methods
of conditional selection. The first is to use logical operators for vectors in R: “!”
is the logical NOT, “&” is the logical AND and "|" is the logical OR. Using these
commands, it is possible to perform a conditional selection of vector elements. The
elements for which the conditions are TRUE can then, for example, be saved in
another vector.
> v = c(1.000000, 3.141593, 1.414214)
> v > 0 # element greater 0
[1] TRUE TRUE TRUE
> (v != 1) & (v > 0) # element not equal to 1 and greater 0
[1] FALSE TRUE TRUE
In the last example, the first value is bigger than zero, but equal to one, so FALSE
is returned. This method may be a little bit confusing for beginners, but it is very
useful for working with multi-dimensional arrays.
Multiple selection of elements of a vector may be done using another vector of
indices as arguments in the square brackets.
> v = c(1.000000, 3.141593, 1.414214)
> v[c(1, 3)] # 1st and 3rd element
[1] 1.000000 1.414214
> w = v[(v != 1) & (v > 0)]; w # save the specified elements in w
[1] 3.141593 1.414214
To eliminate specific elements in a vector, the same procedure is used as for selection,
but a minus sign indicates the elements which should be removed.
For a one-dimensional vector function, which returns the index or indices of specific
elements.
> v = c(1.000000, 3.141593, 1.414214)
> which(v == pi) # indices of elements that fulfill the condition
[1] 2
There are different functions for working with vectors. Extremal values are found
through the functions min and max, which return the minimal and maximal values
16 1 The Basics of R
However, this can be done simultaneously by the function range, which returns a
vector consisting of the two extreme values.
> v = c(1.000000, 3.141593, 1.414214)
> range(v) # min and max value
[1] 1.000000 3.141593
Joining the function which() with min or max, one gets the function which.min
or which.max that returns the index of the smallest or largest element of the
vector, respectively, and is equivalent to which(x == max(x)) and which
(x == min(x)).
Quite often, the elements of a vector have to be sorted before one can proceed
with further transformations. The simplest function for this purpose is sort().
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> sort(x) # values in increasing order
[1] 0 1 2 3 4 5 7 9
Being a function, it does not modify the original vector x. To get the coordinates of
the elements that are in the sorted vector, we use the function rank().
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> rank(x) # rank of elements in increasing order
[1] 5 3 6 7 2 8 1 4
In this example, the first value of the result is ‘5’. This means that the first element in
the original vector x[1] = 4 is in the fifth place in the ordered vector. The inverse
function to rank() is order(), which states the position of the element of the
sorted vector in the original vector, e.g. the smallest element in x is the seventh, the
second smallest is the fifth, etc.
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> order(x) # positions of sorted elements in the original vector
[1] 7 5 2 8 1 3 4 6
Replacing specific values in a vector is done with the function replace(). This
function replaces the elements of x that are specified by the second argument by the
values given in the third argument.
1.4 Basics of the R Language 17
> v = 1:10; v
[1] 1 2 3 4 5 6 7 8 9 10
> replace(v, v < 3, 12) # replace all els. smaller than 3 by 12
[1] 12 12 3 4 5 6 7 8 9 10
> replace(v, 6, 12) # replace the 6th element by 12
[1] 1 2 3 4 5 12 7 8 9 10
The second argument is a vector of indices for the elements to be replaced by the
values. In the second line, all numbers smaller than 3 are to be replaced by 12, while
in the last line, the element with index 6 is replaced by 12. Note again that functions
do not change the original vectors, so that the last output does not show 1 and 2
replaced by 12 after the second command.
There are also a few more functions for vectors which are of further interest. The
function rev() returns the elements in reversed order, and sum() gives the sum
of all the elements in the vector.
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> rev(x) # reverse the order of x
[1] 3 0 9 1 7 5 2 4
> sum(x) # sum all elements of x
[1] 31
In algebra and statistics, matrices are fundamental objects, which allow summarising
a large amount of data in a simple format. In R, matrices are only allowed to have
one data type for their entries, which is their main difference from data frames, see
Sect. 1.4.4.
18 1 The Basics of R
Creating a matrix
There are many possible ways to create a matrix, as shown in the example below.
The function matrix() constructs matrices with specified dimensions.
> matrix(0, 2, 5) # zeros, 2x5
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
In the third matrix, in the above example, the argument byrow = TRUE indi-
cates that the filling must be done by rows, which is not the case in the second
matrix, where the matrix was filled by columns (column-major storage), the func-
tion as.vector(matrix ) converts a matrix into a vector. If the matrix has more
than one row or column, the function concatenates the columns into a vector. One
can also construct diagonal matrices using diag(), see Sect. 2.1.1.
Another way to transform a given vector into a matrix with specified dimensions
is the function dim(). The function t() is used to transpose matrices.
> m = 1:6 > t(m) # transpose m
> dim(m) = c(2, 3); m [,1] [,2]
[,1] [,2] [,3] [1,] 1 2
[1,] 1 3 5 [2,] 3 4
[2,] 2 4 6 [3,] 5 6
Coupling vectors using the functions cbind() (column bind) and rbind() (row
bind) joins vectors column-wise or row-wise into a matrix.
> x = 1:6
> y = LETTERS[1:6]
> rbind(x, y) # bind vectors row-wise
[,1] [,2] [,3] [,4] [,5] [,6]
x "1" "2" "3" "4" "5" "6"
y "A" "B" "C" "D" "E" "F"
The functions col and row return the column and row indices of all elements of
the argument, respectively.
1.4 Basics of the R Language 19
The procedure to extract an element or submatrix uses a syntax similar to the syntax
for vectors. In order to extract a particular element, one uses m[row index,
column index]. As a reminder, in the example below, 10 is the second element
of the fifth column, in accordance with the standard mathematical convention.
> k = matrix(1:10, 2, 5); k # create a matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
One can combine this with row() and col() to construct a useful tool for the
conditional selection of matrix elements. For example, extracting the diagonal of a
matrix can be done with the following code.
> m = matrix(1:6, ncol = 3)
> m[row(m) == col(m)] # select elements [1, 1]; [2, 2]; etc.
[1] 1 4
The same result is obtained by using the function diag(m). To better understand
the process, note that the command row(m) == col(m) creates just the Boolean
matrix below and all elements with value TRUE are subsequently selected.
> row(m) == col(m) # condition (row index = column index)
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
The first command selects the second row of y. The second command selects the
2nd column. The third line considers the matrix as a succession of column vectors
and gives, according to this construction, the second element. The last call selects a
range of rows and columns.
Many functions can take matrices as an argument, such as sum() or product(),
which will calculate the sum or product of all elements in the matrix, respectively.
The functions colSums() and rowSums() can be used to calculate the column-
wise or row-wise sums. All classical binary operators are implemented element-by-
element. This means that, for example, x * y returns the Kronecker product, not
the classical matrix product discussed in Sect. 2.1 on matrix algebra.
One can assign names to the rows and columns of a matrix using the function
dimnames(). Alternatively, the column and row names can be assigned separately
by colnames() and rownames(), respectively.
> A = matrix(1:20, ncol = 5, nrow = 4)
> dimnames(A) = list(letters[4:7], letters[5:9]) # name dimensions
> A
e f g h i
d 1 5 9 13 17
e 2 6 10 14 18
f 3 7 11 15 19
g 4 8 12 16 20
> A[2, 2]
[1] 6
> A["b", "b"]
[1] 6
This leads directly to another very useful format in R: the data frame.
A data frame is a very useful object, because of the possibility of collecting data
of different types (numeric, logical, factor, character, etc.). Note, however, that
all elements must have the same length. The function data.frame() creates
a new data frame object. It contains several arguments, as the column names can
be directly specified with data.frame(..., row.names = c(), ...).
A further possibility for creating a data frame is to convert it from a matrix with the
as.data.frame(matrix name) function.
Basic manipulations
Consider the following example, which constructs a data frame.
> cities = c("Berlin", "New York", "Paris", "Tokyo")
> area = c(892, 1214, 105, 2188)
> population = c(3.4, 8.1, 2.1, 12.9)
> continent = factor(c("Europe", "North America", "Europe", "Asia"))
> myframe = data.frame(cities, area, population, continent)
> is.data.frame(myframe) # check if object is a dataframe
[1] TRUE
> rownames(myframe) = c("Berlin", "New York", "Paris", "Tokyo")
1.4 Basics of the R Language 21
Note that if we defined the above data frame as a matrix, then all elements would be
converted to type character, since matrices can only store one data type.
data.frame() automatically calls the function factor() to convert all char-
acter vectors to factors, as it does for the Continent column above, because
these variables are assumed to be indicators for a subdivision of the data set. To
perform data analysis (e.g. principal component analysis or cluster analysis, see
Chap. 8), numerical expressions of character variables are needed. It is therefore
often useful to assign ordered numeric values to character variables, in order to
perform statistical modelling, set the correct number of degrees of freedom, and
customise graphics. These variables are treated in R as factors. As an example, a
new variable is constructed, which will be added to the data frame “myframe”.
Three position categories are set, according to the proximity of each city to the sea:
Coastal (‘0’), Middle (‘1’) and Inland (‘2’). These categories follow a certain
order, with Middle being in between the others, which needs to be conveyed to R.
> e = c(2, 0, 2, 0) # code info. in e
> f = factor(e, level = 0:2) # create factor f
> levels(f) = c("Coastal", "Middle", "Inland"); f # with 3 levels
[1] Inland Coastal Inland Coastal
Levels: Coastal Middle Inland
> class(f)
[1] "factor"
> as.numeric(f)
[1] 3 1 3 1
The variable f is now a factor, and levels are defined by the function levels()
in the 3rd line in decreasing order of the proximity to the sea. When sorting the
variable, R will now follow the order of the levels. If the position values were simply
coded as string, i.e. Coastal, Middle and Inland, any sorting would be done
alphabetically. The first level would be Coastal, but the second Inland, which
does not follow the inherited order of the category.
The function as.numeric() extracts the numerical coding of the levels and
the indexation begins now with 1.
> myframe = data.frame(myframe, f)
> colnames(myframe)[5] = "Prox.Sea" # name 5th column
> myframe
City Area Pop. Continent Prox.Sea
Berlin Berlin 892 3.4 Europe Inland
New York New York 1214 8.1 North America Coastal
Paris Paris 105 2.1 Europe Inland
Tokyo Tokyo 2188 12.9 Asia Coastal
22 1 The Basics of R
The column names for columns 1 to 4 are the ones that were assigned before, since
myframe is used in the call of data.frame(). Note that one should not use names
with spaces, e.g. Sea.Env. instead of Sea. Env. To add columns or rows to a
data frame, one can use the same functions as for matrices, or the procedure described
below.
> myframe = cbind(myframe, "Language.Spoken"=
+ c("German", "English", "French", "Japanese"))
> myframe
City Area Pop. Continent Prox.Sea Language.Spoken
Berlin Berlin 892 3.4 Europe Inland German
New York New York 1214 8.1 North America Coastal English
Paris Paris 105 2.1 Europe Inland French
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese
There are several ways of addressing one particular column by its name:
myframe$Pop., myframe[, 3], myframe[, "Pop."], myframe
["Pop."]. All these commands except the last return a numeric vector. The last
command returns a data frame.
> myframe$Pop. # select only population column
[1] 3.4 8.1 2.1 12.9
> myframe["Pop."] # population column as dataframe
Pop.
Berlin 3.4
New York 8.1
Paris 2.1
Tokyo 12.9
> myframe[3] == myframe["Pop."]
Pop.
Berlin TRUE
New York TRUE
Paris TRUE
Tokyo TRUE
The output of the above code is a data frame and, therefore, can not be indexed
like a vector. One uses $ notation similar to addressing fields of objects in the C++
programming language.
> myframe[2, 3] # select 3rd entry of 2nd row
[1] 8.1
> myframe[2, ] # select 2nd row
City Area Pop. Continent Prox.Sea Language.Spoken
New York New York 1214 8.1 North America Coastal English
Long names for data frames and the contained variables should be avoided, because
the source code becomes very messy if several of them are called. This can be solved
by the function attach(). Attached data frames will be set to the search path and
the included variables can be called directly. Any R object can be attached. To remove
it from the search path, one uses the function detach().
> rm(area) # remove var. "area" to avoid confusion
> attach(myframe) # attach dataframe "myframe"
> Area # specify column Area in attached frame
[1] 892 1214 105 2188
> detach(myframe)
1.4 Basics of the R Language 23
If two-word names are used, it is advised to label the data frame or variable with
a block name, so that the two words in the name are connected with a dot or an
underline, e.g. Language.Spoken. This avoids having to put names in quotes.
One of the easiest ways to edit a data frame or a matrix is through interactive
tables, called by the edit function. Note that the edit() function does not allow
changing the original data frame.
> edit(myframe)
The first command of the last listing selects both the cities in which French is spoken
or the cities with more than 10 million inhabitants. The second command selects only
the first, fourth and sixth variables for display. As explained above, the individual
data, as well as rows and columns, can be addressed using the square brackets. If no
variable is selected, i.e. [,], all information about the observations is kept.
The following functions are also helpful for conditional selections from data
frames. The function subset(), which performs conditional selection from a data
frame, is frequently used when only a subset of the data is used for the analysis.
> subset(myframe, Area > 1000)
City Area Pop. Continent Prox.Sea Language.Spoken
New York New York 1214 8.1 North America Coastal English
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese
Another way to extract data according to the values is based on addressing specific
variables. In the next example, the interest is in the cumulative area of cities that are
not inland.
> Area.Seasiders = myframe$Area[myframe$Prox.Sea == "Middle"
+ | myframe$Prox.Sea == "Coastal"]
> Area.Seasiders
[1] 1214 2188
> sum(Area.Seasiders)
[1] 3402
The important technique of sorting the data frame is illustrated below. Remember
that order() sorts the elements and returns their ranks in the original vector.
The optional argument partial specifies the columns for subsequent ordering, if
necessary. It is used to order groups of data according to one column and order the
values in each group according to another column.
> myframe[order(myframe$Pop., partial = myframe$Area), ]
City Area Pop. Continent Prox.Sea Language.Spoken
Paris Paris 105 2.1 Europe Inland French
Berlin Berlin 892 3.4 Europe Inland German
New York New York 1214 8.1 North America Coastal English
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese
1.4.5 Lists
Lists are very flexible objects which, unlike matrices and data frames, may contain
variables of different types and lengths.
The simplest way to construct a list is by using the function list(). In the
following example, a string, a vector and a function are joined into one variable.
> a = c(2, 7)
> b = "Hello"
> d = list(example = Stirling, a, end = b)
> d
$example
function(x){
sqrt(2 * pi * x) * (x / exp(1))^x
}
[[2]]
[1] 2 7
$end
[1] "Hello"
1.4 Basics of the R Language 25
To address the elements of a list object, one again uses ‘$’, the same syntax as for a
data frame.
> d$end
[1] "Hello"
A list can be transformed into a 1-element list, i.e. a list of length 1, using unlist.
In this example, the element [[2]] of list d is split into two elements, each of
length 1.
> unlist(d) # transform to list with elements of length 1
$example
function(x){
sqrt(2 * pi * x) * (x / exp(1))^x
}
[[2]]
[1] 2
[[3]]
[1] 7
$end
[1] "Hello"
One of the possible ways of converting objects is to use the function split(). This
returns a list of the split objects with separations according to the defined criteria.
> split(myframe, myframe$Continent)
$Asia
City Area Pop. Continent Prox.Sea Language.Spoken
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese
$Europe
City Area Pop. Continent Prox.Sea Language.Spoken
Berlin Berlin 892 3.4 Europe Inland German
Paris Paris 105 2.1 Europe Inland French
$‘North America‘
City Area Pop. Continent Prox.Sea Language.Spoken
New York New York 1214 8.1 North America Coastal English
In the above example, the data frame myframe is split into elements according to
its column Continent and transformed into a list.
26 1 The Basics of R
1.4.6 Programming in R
Functions
R has many programming capabilities, and allows creating powerful routines with
functions, loops, conditions, packages and objects. As in the Stirling example,
args() is used to receive a list of possible arguments for a specific function.
> args(data.frame) # list possible arguments and default values
function(..., row.names = NULL, check.rows = FALSE, check.names
= TRUE, stringsAsFactors = default.stringsAsFactors())
NULL
This command provides a list of all arguments that can be used in the function,
including the default settings for the optional ones, which have the form optional
argument = setting value.
Below a simple function is presented, which returns the list {a · sin(x), a · cos(x)}.
The arguments a and x are defined in round brackets. We can define functions with
optional arguments that have default values, in this example, a = 1.
> myfun = function(x, a = 1){ # define function
+ r1 = a * sin(x)
+ r2 = a * cos(x)
+ list(r1, r2)
+ }
> myfun(pi / 2) # apply to pi / 2, a = default
[[1]]
[1] 1
[[2]]
[1] 6.123234e-17
Note that if no return(result) operator is given at the end of the function body,
then the last created object will be returned.
Loops and conditions
The family of these operators is a powerful and useful tool. However, in order to
perform well, they should be used wisely. Let us start with the ‘if ’ condition.
> x = 1
> if(x == 2){print("x == 2")}
> if(x == 2){print("x == 2")}else{print("x != 2")}
[1] "x != 2"
Furthermore, for and while are very useful functions for creating loops, but
are best avoided in case of large sample sizes and extensive computations, since
they work very slowly. The difference between the functions is that for applies the
computation for a defined range of integers and while carries out the computation
until a certain condition is fulfilled. One may also use repeat, which will repeat the
specified code until it reaches the command break. One must be careful to include
a break rule or the loop will repeat infinitely.
> x = numeric(1)
> # for i from 1 to 10, the i-th element of x takes value i
> for(i in 1:10) x[i] = i
> x
[1] 1 2 3 4 5 6 7 8 9 10
> # as long as i < 21, set i-th element equal to i and increase i by 1
> i = 1
> while(i < 21){
+ x[i] = i
+ i = i + 1
+ }
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> # remove the first element of x, stop when x has length 1
> repeat{
+ x = x[-1]
+ if(length(x) == 1) break
+ }
> x
[1] 20
The functions lapply() and sapply() are of a more general form, since the
function is applied to each element of the list object and returns a list; and sapply
returns a numeric vector or a matrix, if appropriate. So if the class of the object is
matrix or numeric, the sapply function is preferred, but this function takes longer,
28 1 The Basics of R
as it applies lapply and converts the result afterwards. If a more general object,
e.g. a list of objects, is used, the lapply function is more appropriate.
> A = matrix(1:24, 12, 2, byrow = TRUE)
> # apply function sin() to every element, return numeric vector
> sapply(A[1, ], sin)
[1] 0.8414710 0.9092974
> class(sapply(A[1:4, ], sin))
[1] "numeric"
[[2]]
[1] 0.9092974
> class(lapply(A[1:4, ], sin))
[1] "list"
There is one more useful function, called tapply(), which applies a defined func-
tion to each cell of a ragged array. The latter is made from non-empty groups of
values given by a unique combination of the levels of certain factors. Simply speak-
ing, tapply() is used to break an array or vector into different subgroups before
applying the function to each subgroup. In the example below, a matrix A is exam-
ined, which could contain the observations 1–12 of individuals from group 1, 2 or 3.
Our intention is to calculate the mean for each group separately.
> g = c(rep(1, 4), rep(2, 4), rep(3, 4)) # create vector "group ID"
> A = cbind(1:12, g) # observations and group ID
> tapply(A[, 1], A[, 2], mean) # apply function per group
1 2 3
2.5 6.5 10.5
Finally, the switch() function may be seen as the highlight of R’s built-in program-
ming functions. The function switch(i, expression1, expression2,...)
chooses the i-th expression in the given expression arguments. It works with num-
bers, but also with character chains to specify the expressions. This can be used to
simplify code, e.g. by defining a function that can be called to perform different
computations.
> rootsquare = function(x, type){ # define function for ^2 or ^(0.5)
+ switch (type, square = x * x, root = sqrt(x))
+ }
> rootsquare(10, "square") # apply "square" to argument 10
[1] 100
> rootsquare(10, 1) # first is equivalent to "square"
[1] 100
> rootsquare(10, "root") # apply "root" to argument 10
[1] 3.162278
> rootsquare(10, 2)
[1] 3.162278
> rootsquare(10, "ROOT") # apply "ROOT" (not defined)
[1] NULL
1.4 Basics of the R Language 29
Here the function rnorm(x) is used, which simulates from the normal distribution,
see Sect. 4.3. Note that the hardwired rnorm() is faster than the for loop.
R provides full access to current date and time values through the functions
The function as.Date() is used to format data from another source to dates that
R can work with. When reading in dates, it is important to specify the date format,
i.e. the order and delimiter. A list of date formats that can be converted by R via the
appropriate conversion specifications can be found in the help for strptime. One
can also change the format of the dates in R.
> dates = c("23.05.1984", "2001/01/01", "May 3, 1256")
> # read dates specifying correct format
> dates1 = as.Date(dates[1], "%d.%m.%Y"); dates1
[1] "1984-05-23"
> dates2 = as.Date(dates[2], "%Y/%m/%d"); dates2
[1] "2001-01-01"
> dates3 = as.Date(dates[3], "%B %d,%Y"); dates3
[1] "1256-05-03"
> dates.a = c(dates1, dates2, dates3)
> format(dates.a, "%m.%Y") # delimiter "." and month/year only
[1] "05.1984" "01.2001" "05.1256"
Note that the function as.Date is not only applicable to character strings, factors
and logical NA, but also to objects of types POSIXlt and POSIXct. The last two objects
represent calendar dates and times, where POSIXct denotes the UTC timezone as a
numeric vector and POSIXlt gives a list of vectors including seconds, minutes, hours,
etc.
30 1 The Basics of R
The functions months(), weekdays() and quarters() give the month, week-
day and quarter of the specified date, respectively.
> dates.a = as.Date(c("1984/05/23", "2001/01/01", "1256/05/03"))
> months(dates.a)
[1] "May" "January" "May"
> weekdays(dates.a)
[1] "Wednesday" "Saturday" "Wednesday"
> quarters(dates.a)
[1] "Q2" "Q1" "Q2"
For statisticians, software must be able to easily handle data without restrictions on
its format, whether it is ‘human readable’ (such as .csv,.txt), in binary format
(SPSS, STATA, Minitab, S-PLUS, SAS (export libs)) or from relational databases.
Writing data
There are some useful functions for writing data, e.g. the standard write.table().
Its often used options include col.names and row.names, which specify whether
row or column names are written to the data file, as well as sep, which specifies the
separator to be used between values.
> write.table(myframe, "mydata.txt")
> write.table(Orange, "example.txt",
+ col.names = FALSE, row.names = FALSE)
> write.table(Orange, "example2.txt", sep="\t")
The first command creates the file mydata.txt in the working directory of the data
frame myframe from Sect. 1.4.4, the second specifies that the names for columns and
rows are not defined, and the last one asks for tab separation between cells.
The functions write.csv() and write.csv2() are both used to create
Excel-compatible files. They differ from write.table() only in the default def-
inition of the decimal separator, where write.csv() uses ‘.’ as a decimal separator
and ‘,’ as the separator between columns in the data. Function write.csv2() uses
‘,’ as decimal separator and ‘;’ as column separator.
> data = write.csv("file name") # decimal ".", column separator ","
> data = write.csv2("file name") # decimal ",", column separator ";"
Reading data
R supplies many built-in data sets. They can be found through the function data(),
or, less efficiently, through objects(package:datasets). In any case, we
1.4 Basics of the R Language 31
can load the pre-built data sets library using library("datasets"), where the
quotation marks are optional. Many packages bring their own data, so for many exam-
ples in this book, packages will be loaded in order to work with their included data
sets. To check whether a data set is in a package, data(package = "package
name") is used.
Moreover, we can import .txt files, with or without header.
> data = read.table("mydata.txt", header = TRUE)
The function head() returns the first few rows of an object, which can be used to
check whether there is a header. To do this, it is important to know that the first row
is generally a header when it has one column less than the second row.
In some cases, the separation between the columns will not follow any of the
standard formats. We can then use option sep to manually specify the column
separator.
> data = read.table("file name", sep = "\t")
Here we specified manually that there is a tab character between the variables. With-
out a correctly specified separator, R may read all the lines as a single expression.
Missing values are represented by NA in R, but different programmes and authors
use other symbols, which can be defined in the function. Suppose, for example, that
the NA values were denoted by ‘missing’ by the creator of the dataset.
> data = read.table("file name", na.strings = "missing")
To import data in .csv (comma-separated list) format, e.g. from Microsoft Excel, the
functions read.csv() or read.csv2() are used. They differ from each other
in the same way as the functions write.csv() or write.csv2() discussed
above.
To import or even write data in the formats of statistic software packages such as
STATA or SPSS, the package foreign provides a number of additional functions.
These functions are named read. plus the data file extension, e.g. read.dta()
for STATA data.
32 1 The Basics of R
To read data in the most general way from any file, the function scan("file name")
is used. This function is more universal than read.table(), but not as simple to
handle. It can be used to read columnar data or read data into a list.
It is possible to have the user choose interactively between several options by
using the function menu(). This function shows a list of options from which the
user can choose by entering the value or its index number, and gives as output the list
rank. With the option graphics = TRUE, the list is shown in a separate window.
> menu(c("abc", "def"), title = "Enter value")
Enter value
1: abc
2: def
Selection: def
[1] 2
> menu(c("abc", "def"), graphics = TRUE, title = "Enter value")
Chapter 2
Numerical Techniques
With more and more practical problems of applied mathematics appearing in different
disciplines, such as chemistry, biology, geology, management and economics, to men-
tion just a few, the demand for numerical computation has considerably increased.
These problems frequently have no analytical solution or the exact result is time-
consuming to derive. To solve these problems, numerical techniques are used to
approximate the result. This chapter introduces matrix algebra, numerical integra-
tion, differentiation and root finding.
Matrices with one column and n rows are column vectors, and matrices with one row
and p columns are row vectors. The following R code produces (3 × 3) matrices A
and B with the numbers from 1 to 9 and from 0 to −8, respectively. The matrices are
filled by rows if byrow = TRUE.
> # set matrices A and B
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> B = matrix(0:-8, nrow = 3, ncol = 3, byrow = FALSE); B
[,1] [,2] [,3]
[1,] 0 -3 -6
[2,] -1 -4 -7
[3,] -2 -5 -8
There are several special matrices that are frequently encountered in practical and
theoretical work. Diagonal matrices are special matrices where all off-diagonal ele-
ments are equal to 0, that is, A(n × p) is a diagonal matrix if ai j = 0 for all i = j.
The function diag() creates diagonal matrices (square or rectangular) or extracts
the main diagonal of a matrix in R.
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
> diag(x = A) # extract diagonal
[1] 1 5 9
> diag(3) # identity matrix
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(2, 3) # 2 on diag, 3x3
[,1] [,2] [,3]
[1,] 2 0 0
[2,] 0 2 0
[3,] 0 0 2
> diag(c(1, 5, 9, 13), nrow = 3, ncol = 4) # 3x4
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 5 0 0
[3,] 0 0 9 0
> diag(2, 3, 4) # 3x4, 2 on diagonal
[,1] [,2] [,3] [,4]
[1,] 2 0 0 0
[2,] 0 2 0 0
2.1 Matrix Algebra 35
[3,] 0 0 2 0
As seen from the listing above, the argument x of diag() can be a matrix, a vector,
or a scalar. In the first case, the function diag() extracts the diagonal elements of
the existing matrix, and in the remaining two cases, it creates a diagonal matrix with
a given diagonal or of given size.
Rank
The rank of A is denoted by rank(A) and is the maximum number of linearly inde-
pendent rows or columns. Linear independence of a set of h rows a j means that
h
j=1 c j a j = 0 p if and only if c j = 0 for all j. If the rank is equal to the number
of rows or columns, the matrix is called a full-rank matrix. In R the rank can be
calculated using the function qr() (which does the so-called QR decomposition)
with the object field rank.
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
> qr(A)$rank # rank of matrix A
[1] 2
The matrix A is not of full rank, because the second column can be represented as a
linear combination of the first and third columns:
⎛ ⎞ ⎛ ⎞
2 1+3
1
⎝5⎠ = ⎝4 + 6⎠ (2.1)
8 2 7+9
This shows that the general condition for linear independence is violated for the spe-
cific matrix A. The coefficients are c1 = c3 = 21 and c2 = −1, and are thus different
from zero.
Trace
The trace of a matrix tr(A) is the sum of its diagonal elements:
min(n, p)
tr(A) = ai i .
i=1
The trace of a scalar just equals the scalar itself. One obtains the trace in R by
combining the functions diag() and sum():
> A = matrix(1:12, nrow = 4, ncol = 3); A
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> sum(diag(A)) # trace
[1] 18
36 2 Numerical Techniques
The function diag() extracts the diagonal elements of a matrix, which are then
summed by the function sum().
Determinant
The formal definition of the determinant of a square matrix A( p × p) is
det(A(2 × 2) ) = a1 1 a2 2 − a1 2 a2 1 .
The determinant is often useful for checking whether matrices are singular or regular.
If the determinant is equal to 0, then the matrix is singular. Singular matrices can not
be inverted, which limits some computations. In R the determinant is computed by
the function det():
> A = matrix(1:9, nrow = 3, ncol = 3); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> det(A) # determinant
[1] 0
Thus, A is singular.
Transpose
A matrix A(n × p) has a transpose A
(n × p) , which is obtained by reordering the ele-
ments of the original matrix. Formally, the transpose of A(n × p) is
A
(n × p) = (ai j ) = (a j i ).
The resulting matrix has p rows and n columns. One has that
(A ) = A,
(AB) = B A .
[2,] 2 5 8
[3,] 3 6 9
When creating a matrix with a constructor matrix, its transpose can be created by
setting the argument byrow to FALSE.
A good overview of special matrices and vectors is provided by Table 2.1 in Härdle
and Simar (2015), Chap. 2. The same notations are used in this book.
Conjugate transpose
Every matrix A(n × p) has a conjugate transpose ACp × n . The elements of A can be
complex numbers. If a matrix entry ai j = α + βi is a complex number with real
numbers α, β and imaginary unit i 2 = −1, then its conjugate is aiCj = α − βi. The
same holds in the other direction: if ai j = α − βi, the conjugate is aiCj = α + βi.
Therefore the conjugate transpose is
⎛ ⎞
a1C1 a2C1 ... anC1
⎜ a1C2 a2C2 ... anC2 ⎟
⎜ ⎟
A = ⎜ ..
C
.. .. .. ⎟ . (2.3)
⎝ . . . . ⎠
a Cp 1 . . . . . . anC p
The function Conj() yields the conjugates of the elements. One can combine the
functions Conj() and t() to get the conjugate transpose of a matrix. For A =
1+0.5·i 1 1
1 1 1−0.5·i , the conjugate transpose is computed in R as follows:
For a matrix with only real values, the conjugate transpose AC is equal to the normal
transpose A .
Basic operations
For matrices A(n × p) and B(n × p) of the same dimensions, matrix addition and sub-
traction work elementwise as follows:
A + B = (ai j + bi j ),
A − B = (ai j − bi j ).
R reports an error if one tries to add or subtract matrices with different dimensions.
The elementary operations, including addition, subtraction, multiplication and divi-
sion can also be used with a scalar and a matrix in R, and are applied to each entry
of the matrix. An example is the modulo operation
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> A %% 2 # modulo operation
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 0
[3,] 1 0 1
In R, one uses the operator %*% between two objects for matrix multiplication. The
objects have to be of class vector or matrix.
> A = matrix(3:11, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
> B = matrix(-3:-11, nrow = 3, ncol = 3, byrow = TRUE); B
[,1] [,2] [,3]
[1,] -3 -4 -5
[2,] -6 -7 -8
[3,] -9 -10 -11
> A %*% B # matrix multiplication
[,1] [,2] [,3]
[1,] -78 -90 -102
[2,] -132 -153 -174
[3,] -186 -216 -246
W
A−1 = ,
det(A)
which are the cofactors of A. To compute the cofactors w j i , one deletes column
j and row i of A, then computes the determinant for that reduced matrix, and then
multiplies by 1 if j + i is even or by −1 if it is odd. This computation is only feasible
for small matrices.
Using the above definition, one can determine the inverse of a square matrix by
solving the system of linear equations (see Sect. 2.4.1) in (2.4) by employing the
function solve(A, b). In R this function can be used to solve a general system
of linear equations Ax = b. If one does not specify the right side b of the system of
equations, the solve() function computes the inverse of the square matrix A. The
125
following code computes the inverse of the square matrix A = 3 9 2 .
222
Generalised inverse
In practice, we are often confronted with singular matrices, whose determinant is
equal to zero. In this situation, the inverse can be given by a generalised inverse A−
satisfying
AA− A = A. (2.5)
There are sometimes several A− which satisfy (2.5). The Moore–Penrose generalised
inverse (hereafter, just ‘generalised inverse’) is the most common type and was
developed by Moore (1920) and Penrose (1955). It is used to compute the ‘best fit’
solution to a system of linear equations that does not have a unique solution. Another
approach is to find the minimum (Euclidean) norm (see Sect. 2.1.5) solution to a
system of linear equations with several solutions. The Moore–Penrose generalised
2.1 Matrix Algebra 41
inverse is defined and unique for all matrices with real or complex entries. It can be
computed using the singular value decomposition, see Press (1992).
In R the generalised inverse of a matrix defined in (2.5) can be computed with the
function ginv() from the MASS package. With ginv() one obtains the generalised
inverse of the matrix A = 01 00 , which is equal to A− = 01 00 .
> require(MASS)
> A = matrix(c(1, 0, 0, 0),
+ ncol = 2, nrow = 2); A # matrix from (2.6)
[,1] [,2]
[1,] 1 0
[2,] 0 0
> ginv(A) # generalised inverse
[,1] [,2]
[1,] 1 0
[2,] 0 0
The ginv() function can also be used for non-square matrices, like A = 1 2 3
11 12 13 .
> require(MASS)
> A = matrix(c(1, 2, 3, 11, 12, 13),
+ nrow = 2, ncol = 3, byrow = TRUE); A # non-square matrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13
> A.ginv = ginv(A) # generalised inverse
[,1] [,2]
[1,] -0.63333333 0.13333333
[2,] -0.03333333 0.03333333
[3,] 0.56666667 -0.06666667
This code shows that the solution for A− fulfills the condition AA− A = A.
For a given basis of a vector space, a matrix A( p× p) can represent a linear function of
a p-dimensional vector space to itself. If this function is applied to a nonzero vector
and maps that vector to a multiple of itself, that vector is called an eigenvector γ and
the multiple is called the corresponding eigenvalue λ. Formally this can be written
as
Aγ = λγ.
42 2 Numerical Techniques
In order to obtain the eigenvalues of A one has to solve for the roots λ of the
polynomial det(D) = 0. For a three-dimensional matrix this looks like
det(D) = c0 + c1 λ + c2 λ2 + c3 λ3 ,
c0 = det(A),
c1 = aii a j j − ai j a ji ,
1≤i = j≤3 1≤i = j≤3
c2 = aii ,
1≤i≤3
c3 = −1.
Let γ2 = (1, 0, 0) be the second column of the eigenvector matrix P = (γ1 , γ2 , γ3 ).
Then it can be seen that
Aγ2 = 2γ2 .
A = PP −1 . (2.7)
In R, one can use the function eigen() to compute eigenvalues and eigenvectors.
The eigenvalues are in the field named values and are sorted in decreasing order
(see the example above). Using the output of the function eigen(), the linear
independence of the eigenvectors can be checked for the above example by computing
the rank of the matrix P:
> A = matrix(c(2, 0, 1, 0, 3, 1, 0, 6, 2), # matrix A
+ nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 2 0 1
[2,] 0 3 1
[3,] 0 6 2
> Eigen = eigen(A) # eigenvectors and -values
> P = eigen(A)$vectors # eigenvector matrix
> L = diag(eigen(A)$values) # eigenvalue matrix
> qr(P)$rank # rank of P
[1] 3
> P %*% L %*% solve(P) # spectral decomposition
[,1] [,2] [,3]
[1,] 2 4.440892e-16 1
[2,] 0 3.000000e+00 1
[3,] 0 6.000000e+00 2
44 2 Numerical Techniques
From this computation, it can be seen that P has full rank. The diagonal matrix can
be obtained by extracting the eigenvalues from the output of the function eigen().
It is possible to decompose the matrix A by (2.7) in R. The difference between A
and the result from the spectral decomposition in R is negligibly small.
2.1.5 Norm
There are two types of frequently used norms: the vector norm and the matrix norm.
The vector norm, which appears frequently in matrix algebra and numerical compu-
tation, will be introduced first. An extension of the vector norm is the matrix norm.
Definition 2.2 Let V be a vector space and b be a scalar, both lying either in Rn or
Cn . Consider the vectors x, y ∈ V . Then a norm is a mapping · : V → R+ 0 with
the following properties:
1. bx = |b|x,
2. x + y ≤ x + y,
3. x ≥ 0, where x = 0 if and only if x = 0.
Let x = (x 1 , . . . , xn ) ∈ Rn , k ≥ 1 and k ∈ R. Then a general norm is the L k norm,
which can be represented as follows,
n
1/k
xk = |xi |k .
i=1
There are several special norms, depending on the value of k, some are listed below.
n
Manhattan norm : ||x||1 = |xi |, (2.8)
i=1
1/2
n √
Euclidean norm : ||x||2 = |xi | 2
= x x, (2.9)
i=1
infinity norm : ||x||∞ = max{|xi |i=1
n
}, (2.10)
√
Frobenius norm : ||x|| F = x x.
(2.11)
The most frequently used norms are the Manhattan and Euclidean norms. For vector
norms, the Euclidean and Frobenius norm coincide. The infinity norm selects the
maximum absolute value of the elements of x and the maximum norm just the
maximum value.
In R the function norm() can return the norms from (2.8) to (2.11). The argument
type specifies which norm is returned.
2.1 Matrix Algebra 45
1. aA = |a|A,
2. A + B ≤ A + B,
3. A ≥ 0, where A = 0 if and only if A = 0.
In R, the function norm() can be applied to vectors and matrices in the same fashion.
The one norm, the infinity norm, the Frobenius norm, the maximum norm and the
spectral norm for matrices are represented by
n
one norm : ||A||1 = max |ai j |,
1≤ j≤ p i=1
spectral/Euclidean norm : ||A||2 = λmax (AC A),
p
infinity norm : ||A||∞ = max |ai j |,
1≤i≤n j=1
n p
Frobenius norm : ||A|| F = |ai j |2 ,
i=1 j=1
where AC is the conjugate matrix of A. The next code shows how to compute these
five norms with the function norm() for the matrix A = 13 24 .
> A = matrix(c(1, 2, 3, 4),
+ ncol = 2, nrow = 2, byrow = TRUE) # matrix A
> norm(A, type = c("O")) # one norm
[1] 6 # maximum of column sums
> norm(A, type = c("2")) # Euclidean norm
[1] 5.464986
> norm(A, type = c("I")) # infinity norm
[1] 7 # maximum of row sums
> norm(A, type = c("F")) # Frobenius norm
[1] 5.477226
Note that the Frobenius norm returns the squareroot of the trace of the matrix
product of the matrix with its conjugate transpose, tr(AC A). The spectral norm or
Euclidean norm returns the square root of the maximum eigenvalue of AC A.
46 2 Numerical Techniques
Not every function f ∈ C[a, b] has an indefinite integral with an analytical repre-
sentation. Therefore, it is not always possible to analytically compute the area under
a curve. An important example is
exp(−x 2 )d x. (2.12)
To construct a polynomial that satisfies this condition, the following basis polyno-
mials are used:
n
x − xi
L k (x) = .
i=0,i =k
x k − xi
This leads to the so-called Lagrange polynomial, which satisfies the condition in
(2.13) (assuming 00 = 1).
n
pn (x) = f (xk )L k (x). (2.14)
k=0
b
Let I ( f ) = a f (x)d x be the exact integration operator applied to a function
f ∈ C[a, b]. Then define In ( f ) as the approximation of I ( f ) using (2.14) as an
approximation for f :
2.2 Numerical Integration 47
b
In ( f ) = pn (x) d x,
a
By construction, (2.15) is exact for every f ∈ Pn . Suppose the nodes xk are equidis-
tant in [a, b], i.e., xk = a + kh, where h = (b − a)n −1 . Then (2.15) is the (closed)
Newton–Cotes rule. The weights αk can be explicitly computed up to n = 7. Start-
ing from n = 8, negative weights occur and the Newton–Cotes rule can no longer be
applied. The trapezoidal rule is an example of the Newton–Cotes rule.
b
Example 2.2 For n = 1 and I ( f ) = a f (x) d x, the nodes are given as follows:
x0 = a, x1 = b. The weights can be computed explicitly by transforming the integral
using two substitutions:
n
n
1 b 1
t − ti 1 n
s −i
αk = L k (x) d x = dt = ds.
b−a a 0 i=0,i =k tk − ti n 0 i=0,i=k
k −i
In R, the trapezoidal rule is implemented within the package caTools. There the
function trapz(x, y) is used with a sorted vector x that contains the x-axis values
and a vector y with the corresponding y-axis values. This function uses a summed
version of the trapezoidal rule, where [a, b] is split into n equidistant intervals. For
all k = {1, . . . , n}, the integral Ik ( f ) is computed according to the trapezoidal rule:
this is the so-called extended trapezoidal rule.
b − a
Ik ( f ) = f {a + (k − 1)n −1 (b − a)} + f {a + kn −1 (b − a)} .
2n
b − a
n n
I( f ) ≈ Ik ( f ) = f {a + (k − 1)n −1 (b − a)} + f {a + kn −1 (b − a)} .
k=1 k=1
2n
48 2 Numerical Techniques
For example, consider the integral of the cosine function on [− π2 , π2 ] and split the
interval into 10 subintervals, where the trapezoidal rule is applied:
> require(caTools)
> x = (-5:5) * (pi / 2) / 5 # set subintervals
> intcos = trapz(x, cos(x)); intcos # integration
[1] 1.983524
> abs(intcos - 2) # absolute error
[1] 0.01647646
A total of 2n equations are used to find the nodes x1 , . . . , xn and the coefficients
α1 , . . . , αn .
Consider the special case with two nodes (x1 , x2 ) and two weights (α1 , α2 ). The
particular polynomial p(x) is of order 2 · n − 1 = 3, where the number of nodes
is n. The integral is approximated by α1 f (x1 ) + α2 f (x2 ) and it is assumed that
f (x) = p(x). Therefore the following two equations can be derived:
2.2 Numerical Integration 49
b − a = α1 + α2 ;
1/2 · (b − a 2 ) = α1 x1 + α2 x2 ;
2
For simplicity, in most cases the interval [−1, 1] is considered. It is possible to extend
these results to the more general interval [a, b]. To apply the results for [−1, 1] to
the interval [a, b], one uses
b
b−a 1
b−a a+b
f (x)d x = f x+ d x.
a 2 −1 2 2
For the special case w(x) = 1 and the interval [−1, 1], the procedure is called Gauss–
Legendre quadrature. The nodes are the roots of the Legendre polynomials Pn (x) =
1 dn
2n·n! d x n
{(x 2 − 1)n }. The weights αk can be calculated by
2
αk = .
(1 − xk2 ){Pn (xk )}2
In the following example, we illustrate the process of numerical integration using the
function integrate(). One can specify the following arguments: f (integrand),
a (lower limit) and b (upper limit), subdivisions (number of subintervals) and
arguments rel.tol, as well as abs.tol for the relative and absolute accuracy
requested. Consider again the cosine function on the interval [− π2 , π2 ].
> require(stats)
> integrate(cos, # integrand
+ lower = -pi / 2, # lower integration limit
+ upper = pi / 2) # upper integration limit
2 with absolute error < 2.2e-14
The output of the integrate() function delivers the computed value of the definite
integral and an upper bound on the absolute error. In this example, the absolute error
is smaller than 2.2 · 10−14 . Therefore, the integrate() function is much more
accurate for the cosine function than the trapz() function used in a previous
example.
50 2 Numerical Techniques
Analytical computation of this integral yields 1/12. The surface z = x 2 y 3 for the
interval [0, 1]2 is depicted in Fig. 2.1. For the computation of multiple integrals, the
R package R2Cuba is used, which is introduced in Hahn (2013). It includes four
different algorithms for multivariate integration, where the function cuhre uses the
adaptive method.
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ (x^2) * (y^3) # function
+ }
> cuhre(integrand, # adaptive method
+ ncomp = 1, # number of components
+ ndim = 2, # dimension of the integral
2.2 Numerical Integration 51
The output shows that the adaptive algorithm carried out two iteration steps. Only two
subregions have been used for the computation, which is stated by the output value
nregions. The output value neval states that the number of evaluations is 195. To
make a statement about the reliability of the process, consider the probability
value. A probability of 0 for the χ2 distribution (see Sect. 4.4.1) means that the null
0.08
0.06
f(x,y)
0.04
0.02
0.00
1.0
0.8
0.6
0.4
x 0.2 0.6 0.4 0.2 0.0
0.0 1.0 0.8
y
hypothesis can be rejected. The null hypothesis states that the absolute error estimate
is not a reliable estimate of the true integration error. The approximation of integral
1
I is 0.08333, which is close to the result of the analytical computation, 12 . For a
more detailed discussion of the output, refer to Hahn (2013).
Example 2.4 Evaluate the integral with three variables
1 1 1
sin(x) log(1 + 2y) exp(3z)d xd ydz, (2.18)
0 0 0
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ z = arg[3] # function argument z
+ sin(x) * log(1 + 2 * y) * exp(3 * z) # function
+ }
> cuhre(integrand, # adaptive method
+ ncomp = 1, # number of components
+ ndim = 3, # dimension of the integral
+ lower = rep(0, 3), # lower bound of interval
+ upper = rep(1, 3), # upper bound of interval
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 0)) # controls output
integral: 1.894854 (+-4.1e-07)
nregions: 2; number of evaluations: 381; probability: 0.04784836
For the function of three variables (2.18), an analytical computation yields the value:
1 1 1
sin(x)d x log(1 + 2y)dy exp(3z)dz
0 0 0
1 1
= {1 − cos(1)} [3{log(3) − 1} + 1] {exp(3) − 1} = 1.89485.
2 3
The value provided by the adaptive method is very close to the exact value.
Monte Carlo method
For a multiple integral I of the function of p variables f (x1 , . . . , x p ) with lower
bounds a1 , . . . , a p and upper bounds b1 , . . . , b p , the integral is given by
b1 bp
I( f ) = ... f (x1 , . . . , x p )d x1 . . . d x p = ··· f (x)d x,
a1 ap D
where x stands for a vector (x1 , . . . , x p ) and D for the integration region. Let X be
a random vector (see Chap. 6), with each component X j of X uniformly distributed
(Sect. 4.2) in [a j , b j ]. Then the algorithm of Monte Carlo multiple integration can be
described as follows. In the first step, n points of dimension p are randomly drawn
from the region D, such that
2.2 Numerical Integration 53
(x11 , . . . , x1 p ), . . . , (xn 1 , . . . , xn p ).
p
In the second step, the p-dimensional volume is estimated by V = j=1 (b j − a j )
and the integrand f is evaluated for all n points. In the third step, the integral I can
be estimated using a sample moment function,
n
I ( f ) ≈ Iˆ( f ) = n −1 V f (xi1 , . . . , xi p ).
i=1
The Monte Carlo method is applied to example (2.17) via the function vegas.
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ (x^2) * (y^3) # function
+ }
> vegas(integrand, # Monte Carlo method
+ ncomp = 1, # number of components
+ ndim = 2, # dimension of the integral
+ lower = rep(0, 2), # lower integration bound
+ upper = rep(1, 2), # upper integration bound
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 0)) # controls output
integral: 0.08329357 (+-7.5e-05)
number of evaluations: 17500; probability: 0.1201993
The outputs of the functions vegas and cuhre are almost identical. Additional
output information can be obtained by setting the argument verbose to one. Then
the output shows that the Monte Carlo algorithm executed 7 iterations and 17 500
evaluations of the integrand. The approximation of integral I is 0.0832, which is
1
close to the exact value 12 . For the function (2.18) the Monte Carlo algorithm looks
as follows:
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ z = arg[3] # function argument z
+ sin(x) * log(1 + 2 * y) * exp(3 * z) # function
+ }
> vegas(integrand, # Monte Carlo method
+ ncomp = 1, # number of components
+ ndim = 3, # dimension of the integral
+ lower = rep(0, 3), # lower integration bound
+ upper = rep(1, 3), # upper integration bound
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 0)) # controls output
54 2 Numerical Techniques
The performance of the adaptive method is again superior to that of the Monte Carlo
method, which gives 1.894488 as the value of the integral.
2.3 Differentiation
The function D() returns an argument of type call (see help(call) for further
information) and one can therefore recursively compute higher order derivatives. For
example, consider the second derivative of 3x 3 + x 2 .
> D(D(f,"x"),"x")
3 * (3 * (2 * x)) + 2
This function replaces the initial function with its first derivative until the argument
order is reduced to one. Then the third derivative for 3x 3 + x 2 can be computed with
this function.
2.3 Differentiation 55
The gradient of a function can also be computed using the function D().
Definition 2.5 Let f : Rn → R be a differentiable function and x = (x1 , . . . , xn )
∈ Rn . Then the vector
def ∂f ∂f
∇ f (x) = (x), . . . , (x)
∂x1 ∂xn
Now consider the function f : R2 → R that maps x1 and x2 coordinates to the square
of their Euclidean norm.
> f = expression(x^2 + y^2) # function
> grad = c(D(f,"x"), D(f,"y")) # gradient vector
> grad
[[1]]
2 * x
[[2]]
2 * y
If it is necessary to have the gradient as a function that can be evaluated, the func-
tion deriv(f, name, function.arg = NULL, hessian = FALSE)
should be used. The function argument f is the function (as an object of mode
expression) and the argument name identifies the vector with respect to which
the derivative will be computed. Furthermore the arguments function.arg spec-
ify the parameters of the returned function and hessian indicates whether the
56 2 Numerical Techniques
If the option hessian is set to TRUE, the Hessian matrix at a point (x,y) can be
retrieved through the call attr(eucld(2, 2),"hessian").
h2 h3
f (x + h) = f (x) + h · f (x) + f (x) + f (x) + O(h 3 ). (2.20)
2! 3!
Only if the fourth derivative of f exists and f is bounded on [x, x + h] the repre-
sentation in (2.20) is valid. If the Taylor expansion is truncated after the linear term,
then (2.20) can be solved for f (x):
f (x + h) − f (x)
f (x) = + O(h). (2.21)
h
Therefore an approximation for the derivative at point x could be
f (x + h) − f (x)
f (x) ≈ . (2.22)
h
Another more accurate method uses the Richardson (1911) extrapolation. Redefine
the expression in (2.22) with g(h) = f (x+h)−
h
f (x)
. Then (2.21) can be written as
2.3 Differentiation 57
Now (2.23) can be subtracted from (2.24) times two. Then the term involving k1 is
eliminated:
2 3
h h h
f (x) = 2g − g(h) + k2 − h + k3
2
− h + ··· .
3
2 2 4
This process can be continued to obtain formulae of higher order. In R, the pack-
age numDeriv provides some functions that use these methods to differentiate
a function numerically. For example, the function grad() calculates a numerical
approximation to the gradient of func at the point x. The argument method can be
“simple” or “Richardson”. If the method argument is simple, a formula as
in (2.22) is applied. Then only the element eps of methods.args is used (equiv-
alent to the above h in (2.22)). The method “Richardson”
uses the Richardson
extrapolation. Consider the function f (x1 , x2 , x3 ) = x12 + x22 + x32 , which has the
gradient
⎛ ⎞
x1 x2 x3
∇ f (x) = ⎝ , , ⎠ .
x12 + x22 + x32 x12 + x22 + x32 x12 + x22 + x32
It could also be interesting to compute numerically the Jacobian or the Hessian matrix
of a function F : Rn → Rm .
In R, the function jacobian(func,x,...) can be used to compute the
Jacobian matrix of a function func at a point x. As with the function grad(), the
function jacobian() uses the Richardson extrapolation by default. Consider the
following example, where the Jacobian matrix of f (x) = {sin(x1 + x2 ), cos(x1 +
x2 )} at the point (0, 2π) is computed:
> require(numDeriv)
> f1 = function(x){c(sin(sum(x)), cos(sum(x)))}
> jacobian(f1, x = c(0, 2 * pi))
[,1] [,2]
[1,] 1 1
[2,] 0 0
From the definition of the Euclidean norm, it would make sense for f to have a
minimum at (0, 0, 0). The above information can be used to check whether f has a
local minimum at (0, 0, 0). In order to check this, two conditions have to be fulfilled.
The gradient at (0, 0, 0) has to be the zero vector and the Hessian matrix should be
positive definite (see Canuto and Tabacco 2010 for further information on the calcu-
lation of local extreme values using the Hessian matrix). The second condition can
be restated by using the fact that a positive definite matrix has only positive eigenval-
ues. Therefore, the second condition can be checked by computing the eigenvalues
of the above Hessian matrix and the first condition can be checked using the grad()
function.
> f = function(x){sqrt(sum(x^2))}
> grad(f, x = c(0, 0, 0)) # gradient at the
[1] 0 0 0 # optimum point
> hessm = hessian(func, x = c(0, 0, 0)) # Hessian matrix
> eigen(hessm)$values # eigenvalues
[1] 251364.0 251364.0 80531.3
This output shows that the gradient at (0, 0, 0) is the zero vector and the eigenvalues
are all positive. Therefore, as expected, the point (0, 0, 0) is a local minimum of f .
2.3 Differentiation 59
If the number of variables becomes large, then the expression will use a tremendous
amount of memory and have a very tedious representation.
In automatic differentiation, all arguments of the function are redefined as dual
numbers, xi + xi ε, where ε has the property that ε2 ≈ 0. The change in xi is xi ε, for
all i. Therefore, automatic differentiation for this function looks like
⎛ ⎞
10 10
10
9
∇ f (x) = xi + ε ⎝x1 xi + · · · + x j xi + · · · + x10 xi ⎠ .
i=1 i=2 i = j i=1
f (x + h) − f (x)
f (x) ≈ ,
h
or
f (x + h) − f (x − h)
f (x) ≈ .
2h
60 2 Numerical Techniques
It is obvious that the accuracy of this type of differentiation is related to the choice
of h. If h is small, then the method of divided differences has errors introduced by
rounding off the floating point numbers. If h is large, then the formula disobeys
the essence of this method, which assumes that h tends to zero. Also, the method
of divided differences introduces truncation errors by neglecting the terms of order
O(h 2 ), something which does not happen in automatic differentiation.
Automatic differentiation has two operation modes: forward and reverse. For
forward mode, the algorithm starts by evaluating the derivatives of every elementary
function, the function arguments itself, of f at the given points. In each intermediate
step, the derivatives are combined to reproduce the derivatives of more complicated
functions. The last step merely assembles the evaluations from the results of the
computations already performed, employing the chain rule. For example, we use the
forward mode to evaluate the derivative of f (x) = (x + x 2 )3 : the pseudocode can
be summarised as
function(y, y’)=f’(x, x’)
s1 = x * x;
s1’ = 2 * x * x’;
s2 = x + s1;
s2’ = x’ + s1’;
y = s2 * s2 * s2;
y’ = 3 * s2 * s2 * s2’
end
where f represents the derivative, i.e. ∂ f /∂x. Therefore, let us evaluate the derivative
of f (x) = (x + x 2 )3 at the point x = 2 with the forward mode.
s1 = x · x = 2 · 2 = 4,
s1 = 2 · x · x = 2 · 2 · 1 = 4,
s 2 = x + s1 = 2 + 4 = 6,
s2 = x + s1 = 1 + 4 = 5,
y = s2 · s2 · s2 = 6 · 6 · 6 = 216,
y = 3 · s2 · s2 · s2 = 3 · 6 · 6 · 5 = 540.
For reverse mode, the programme performs the computation in the reverse direction.
We need to set v̄ = dy/dv, then ȳ = dy/dy = 1. For the same example as before,
where the derivative at x = 2 is evaluated, it looks as
Two examples are implemented in R using the package radx developed by Anna-
malai (2010). This package is not available on CRAN, therefore is installed via
2.3 Differentiation 61
> require(devtools)
> # install_github("radx","quantumelixir") # installs from GitHub
> require(radx) # not provided by CRAN
> f = function(x) {(x^2 + x)^3} # function
> radxeval(f, # automatic differ.
+ point = 2, # point at which to eval.
+ d = 1) # order of differ.
[,1]
[1,] 540
The upper computation illustrates that the value of the first derivative of the function
(2.25) at x = 2 is equal to 540.
Example 2.6 Evaluate the first and second derivatives of the vector function
f 1 (x, y) = 1 − 3y + sin(3π y) − x,
f 2 (x, y) = y − sin(3πx)/2,
at (x = 3, y = 5).
Let K denote either the set of real numbers, or the set of complex numbers. Sup-
pose ai j , bi ∈ K with i = 1, . . . , n and j = 1, . . . , p. Then the following system of
equations is called a system of linear equations:
⎧
⎨ a11 x1 + . . . +a1 p x p
⎪ = b1
.. ..
⎪ . .
⎩
an1 x1 + . . . +anp x p = bn
Ax = b. (2.26)
Let Aen×( p+1) be the extended matrix, i.e. the matrix whose last column is the vector
of constants b, and otherwise is the same as A. Then (2.26) can be solved if and only
if the rank of A is the same as the rank of Ae . In this case b can be represented by a
linear combination of the columns of A. If (2.26) can be solved and the rank of A
equals n = p, then there exists a unique solution. Otherwise (2.26) might have no
solution or infinitely many solutions, see Greub (1975).
The Gaussian algorithm, which transforms the system of equations by elementary
transformations to upper triangular form, is frequently applied. The solution can be
computed by back-substitution. The Gaussian algorithm decomposes A into the
matrices L and U, the so-called LU decomposition (see Braun and Murdoch (2007)
for further details). L is a lower triangular matrix and U is an upper triangular matrix
with the following form:
⎛ ⎞ ⎛ ⎞
1 0 ... 0 u 11 u 12 ... u 1n
⎜ .⎟ ⎜ 0 u 22 u 2n ⎟
⎜ l21 1 . . . .. ⎟ ⎜ ... ⎟
⎜
L=⎜ . ⎟, U =⎜ . .. .. ⎟ .
⎝ .. .. ⎟ ⎝ .. . . ⎠
. 0⎠
ln1 ln2 . . . 1 0 . . . 0 u nn
Ax = LU x = b. (2.27)
Now the system in (2.26) can be solved in two steps. First define U x = y and solve
Ly = b for y by forward substitution. Then solve U x = y for x by back-substitution.
In R, the function solve(A,b) uses the LU decomposition to solve a system of
linear equations with the matrix A and the right side b. Another method that can be
used in R to solve a system of linear equations is the QR decomposition, where the
matrix A is decomposed into the product of an orthogonal matrix Q and an upper
triangular matrix R. One uses the function qr.solve() to compute the solution
of a system of linear equations using the QR decomposition. In contrast to the LU
decomposition, this method can be applied even if A is not a square matrix. The next
example shows how to solve a system of linear equations in R using solve().
Example 2.7 Solve the following system of linear equations in R with the Gaussian
algorithm and back-substitution,
Ax = b,
⎛ ⎞
2 − 21 − 21 0
⎜− 21 0 2 − 21 ⎟
A=⎜ ⎟,
⎝− 1 2
2
0 −1⎠ 2
0 − 21 − 21 2
b = (0, 3, 3, 0) ,
⎛ ⎞
2 − 21 − 21 0 0
⎜− 21 0 2 − 21 3⎟
Ae = ⎜
⎝− 1 2 0 − 1 3⎠ .
⎟
2 2
0 − 21 − 21 2 0
The upper system of linear equations is solved first by hand and then the example
is computed in R for verification. This system of linear equations is not difficult to
solve with the Gaussian algorithm. First, one finds the upper triangular matrix
⎛ ⎞
2 − 21 − 12 0 0
⎜0 15 − 1 − 1 3 ⎟
⎜ ⎟
U e = ⎜ 8 288 28 16 ⎟ .
⎝0 0 15 − 15 5 ⎠
12 12
0 0 0 7 7
Second, one uses back-substitution to obtain the final result, that (x1 , x2 , x3 , x4 ) =
(1, 2, 2, 1) . Then the solution of this system of linear equations in R is presented.
Two parameters are required: the coefficient matrix A and the vector of constraints b.
> A = matrix( # coefficient matrix
+ c( 2, -1/2, -1/2, 0,
+ -1/2, 0, 2, -1/2,
+ -1/2, 2, 0, -1/2,
64 2 Numerical Techniques
The manually found solution for the system coincides with the solution found in R.
There are many different numerical methods for solving systems of nonlinear equa-
tions. In general, one distinguishes between gradient and non-gradient methods. In
the following, the Newton method, or the Newton–Raphson method, is presented. To
get a better illustration of the idea behind the Newton method, consider a continuous
differentiable
! function F : R → R, where one tries to find x ∗ with F(x ∗ ) = 0 and
∂ F(x) !
∂x ! ∗
= 0. Start by choosing a starting value x0 ∈ R and define the tangent line
x=x
!
∂ F(x) !!
p(x) = F(x0 ) + (x − x0 ). (2.28)
∂x !x=x0
Then the tangent line p(x) is! a good approximation to F in a sufficiently small
!
neighbourhood of x0 . If ∂ F(x)
∂x !
= 0, the root x1 of p in (2.28) can be computed
x=x0
as follows:
F(x0 )
x1 = x0 − ! .
∂ F(x) !
∂x !x=x
0
With the new value x1 , the rule can be applied again. This procedure can be applied
iteratively and under certain theoretical conditions the solution should converge to
the actual root. Figure 2.2 demonstrates the Newton method for f (x) = x 2 − 4 with
the starting value x0 = 6.
The Fig. 2.2 was computed using the function newton.method(f, init,
...) from the package animation, where f is the function of interest and init is
the starting value for the iteration process. The function provides an illustration of the
iterations in Newton’s method (see help(newton.method) for further details).
The function uniroot() searches in an interval for a root of a function and returns
2.4 Root Finding 65
40
0 1 2 3 4 5 6 7
x
Fig. 2.2 Illustration of the iteration steps of Newton’s method to find the root of f (x) = x 2 − 4
with x0 = 6. BCS_Newton
only one root, even if several roots exist within the interval. At the boundaries of the
interval, the sign of the value of the function must change.
> f = function(x){ # objective function
+ -x^4 - cos(x) + 9 * x^2 - x - 5
+ }
> uniroot(f,
+ interval = c(0, 2))$root # root in [0, 2]
[1] 0.8913574
> uniroot(f,
+ interval = c(-3, 2))$root # root in [-3, 2]
[1] -2.980569
> uniroot(f,
+ interval = c(0, 3))$root # f(0) and f(3) negative
Error in uniroot(f, c(0, 3)) :
Values of f() at the boundaries have same sign
Definition 2.9 If the domain M is a metric space, then f is said to have a local
maximum at xopt if there exists some > 0 such that f (xopt ) ≥ f (x) for all x in M
within a distance of from x opt . Analogously, the function has a local minimum at
xopt if f (xopt ) ≤ f (x) for all x in M within of x opt .
Maxima and minima are not always unique. Consider the function sin(x), which
has global maxima f (xmax ) = 1 and global minima f (xmin ) = −1 for every xmax =
(0.5 + 2k)π and xmin = (−0.5 + 2k)π for k ∈ Z.
Example 2.8 The following function possesses several local maxima, local minima,
global maxima and global minima.
● ● ●
0.10
●
0.05
●
f(x,y) 0.00
−0.05
●
−0.10
● ●
5
4
●
3
x 2
1 3 2 1 0
0 5 4
y
Fig. 2.3 3D plot of the function (2.29) with maxima and minima depicted by points.
BCS_Multimodal
> require(stats)
> f = function(x){-(x - 3)^2 + 10} # function
> optimize(f, # objective function
+ interval = c(-10, 10), # interval
+ tol = 0.0001, # level of the tolerance
+ maximum = TRUE) # to find maximum
$maximum
[1] 3
$objective
[1] 10
The argument tol defines the convergence criterion for the results. The function
reaches its global maximum at xopt = 3, which is easily derived by solving the first-
order condition −2xopt + 6 = 0 for xopt and computing the ! value f (xopt ). For a
∂ 2 f (x) ∂ 2 f (x) !
maximum at xopt , one should have ∂x 2 < 0 and ∂x 2 ! = −2. Therefore
x=xopt
xopt = 3, which is verified in R with the code from above.
Nelder–Mead method
This method was proposed in Nelder and Mead (1965) and is applied frequently in
multivariate unconstrained optimisation problems. It is a direct method, where the
computation does not use gradients. The main idea of the Nelder–Mead method is
briefly explained below and a graph for a two-dimensional input case is shown in
Fig. 2.4.
1. Choose x1 , x2 , x3 such that f (x1 ) < f (x2 ) < f (x3 ) and set x and/or f .
2. Stop if xi − x j < x and/or f (xi ) − f (x j ) < f , f or i = j, i, j ∈ {1, 2,
3} and set xmin = x1 .
3. Else, compute z = 21 (x1 + x2 ) and d = 2z − x3 .
If f (x1 ) < f (d) < f (x2 ) ⇒ x3 = d.
If f (d) ≤ f (x1 ), compute k = 2d − z.
If f (k) < f (x1 ) ⇒ x3 = k.
Else, x3 = d.
If f (x3 ) > f (d) ≥ f (x2 ) ⇒ x3 = d.
Else, compute t = [t| f (t) = min{ f (t1 ), f (t2 )}], where t1 = 21 (x3 + z) and
t2 = 21 (d + z).
If f (t) < f (x3 ) ⇒ x3 = t.
Else, x3 = s = 1/2(x1 + x3 ) and x2 = z.
4. Return to step 2.
In general, the Nelder–Mead algorithm works with more than three initial guesses.
The starting values xi are allowed to be vectors. In the iteration procedure one tries
to improve the initial guesses step by step. The worst guess x3 will be replaced by
better values until the convergence criterion for the values f of the function or the
arguments x of the function is met. Next we will give an example of how to use the
Nelder–Mead method to find extrema of a function in R (Fig. 2.5).
2.4 Root Finding 69
Fig. 2.4 Algorithm graph for the Nelder–Mead method. The variables x1 , x2 and x3 are the search
region at the specific iteration step. All other variables, d, k, s, t1 and t2 , are possible updates for
one xi
Example 2.10 The function to be minimized is the Rosenbrock function, which has
an analytic solution with global minimum at (1, 1) and a global minimum value
f (1, 1) = 0.
> require(neldermead)
> f = function(x){
+ 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2 # Rosenbrock function
+ }
> fNM = fminsearch(fun = f,
+ x0 = c(-1.2, 1), # starting point
+ verbose = FALSE)
> neldermead.get(fNM, key ="xopt") # optimal x-values
[,1]
[1,] 1.000022
[2,] 1.000042
> neldermead.get(fNM, key ="fopt") # optimal function value
[1] 8.177661e-10
The upper computation illustrates that the numerical solution by the Nelder–Mead
method is close to the analytical solution for the Rosenbrock function (2.31). The
errors of the numerical solution are negligibly small.
BFGS method
This frequently used method for multivariate optimisation problems was proposed
independently in Broyden (1970), Fletcher (1970), Goldfarb (1970) and Shanno
(1970). BFGS stands for the first letters of each author, in alphabetical order. The
main idea of this method originated from Newton’s method, where the second-order
Taylor expansion for a twice differentiable function f : Rn → R at x = xi ∈ Rn is
employed, such that
1
f (x) = f (xi ) + ∇ f (xi )q + q H (xi )q,
2
70 2 Numerical Techniques
2500
2000
1500
f(x,y)
1000
500
2
●
1
0
x −1 0 −1
3 2 1
−2
y
Fig. 2.5 Plot for the Rosenbrock function with its minimum depicted by a point.
BCS_Rosenbrock
where q = x − xi , and ∇ f (xi ) is the value of the partial derivative of f at the point
xi , and H (xi ) is the Hessian matrix. Employing the first-order condition, one obtains
q = x − xi = −H −1 (xi )∇ f (xi ),
xi+1 = xi − H −1 (xi )∇ f (xi ).
The recursion will converge quadratically to the optimum. The problem is that New-
ton’s method requires the computation of the exact Hessian at each iteration, which
is computationally expensive. Therefore, the BFGS method overcomes this disad-
vantage with an approximation of the Hessian’s inverse obtained from the following
optimisation problem,
2.4 Root Finding 71
The weighted Frobenius norm, denoted by · W , and the matrix W are, respectively,
Example 2.11 Here, the BFGS method is used to minimise the Rosenbrock function
(2.31) using optimx package (see Nash and Varadhan 2011).
> require(optimx)
> f = function(x){100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2}
> fBFGS = optimx(fn = f, # objective function
+ par = c(-1.2, 1), # starting point
+ method ="BFGS") # optimisation method
> print(data.frame(fBFGS$p1, fBFGS$p2, fBFGS$value))
fBFGS.p1 fBFGS.p2 fBFGS.value
1 0.9998044 0.9996084 3.827383e-08 # minimum
The BFGS method computes the minimum value of the function (2.31) to be 3.83e −
08 at the minimum point (0.99, 0.99). The outputs fevals = 127, gevals =
38 show the calls of the objective function and the calls of the gradients, respectively.
These outputs are close to the exact solution xopt = (1, 1) and f (xopt ) = 0.
Conjugate gradient method
The conjugate gradient method was proposed in Hestenes and Stiefel (1952) and is
widely used for solving symmetric positive definite linear systems. A multivariate
unconstrained optimisation problem, like
can be solved with the Conjugate Gradient Method. The main idea behind this method
is to use iterations to approach the optimum of the linear system.
72 2 Numerical Techniques
riri
αi = ,
pi A pi
xi+1 = xi + αi pi ,
ri+1 = ri − αi A pi ,
r ri+1
βi = i+1 ,
ri ri
pi+1 = ri+1 + βi pi .
> require(optimx)
> f = function(x){100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2}
> fCG = optimx(fn = f, # objective function
+ par = c(1.2, 1), # initial guess (x_0)
+ control = list(reltol = 10^-7), # relative tolerance
+ method ="CG") # method of optimisation
>
>print(data.frame(fCG$p1, fCG$p2, fCG$value)) # minimum
fCG.p1 fCG.p2 fCG.value
1 1.030077 1.061209 0.0009036108
For the Rosenbrock function, the Conjugate Gradient method delivers the biggest
errors, compared to the Nelder–Mead and BFGS methods. All numerical meth-
ods which are applied to optimize a function will only approximately find the true
solution. The examples above show how the choice of method might influence the
accuracy of the result. Worth mentioning is, that in the latter case we changed the
initial guess, as the function failed with the same starting value as we took for BFGS
method.
Constrained optimisation
Constrained optimisation problems can be categorised into two classes in terms of
to the linearity of the objective function and the constraints. A linear programming
2.4 Root Finding 73
(LP) problem has a linear objective function and linear constraints, otherwise it is a
nonlinear programming problem (NLP).
LP is a method to find the solution to an optimisation problem with a linear objec-
tive function, under constraints in the form of linear equalities and linear inequalities.
It has a feasible region defined by a convex polyhedron, which is a set made by the
intersection of finitely many half-spaces. These represent linear inequalities. The
objective of linear programming is to find a point in the polyhedron where the objec-
tive function reaches a minimum or maximum value. A representative LP can be
expressed as follows:
arg max a x,
x
subject to: Cx ≤ b,
x ≥ 0,
For the example in (2.33), the function from the package Rglpk (see Theussl 2013)
is used to compute the solution in R.
> require(Rglpk)
> Rglpk_solve_LP(obj = c(2, 4), # objective function
+ mat = matrix(c(3, 4), nrow = 1), # constrains coefficients
+ dir ="<=", # type of constrains
+ rhs = 60, # constrains vector
+ max = TRUE) # to maximise
$optimum # maximum
[1] 60
$status # no errors
[1] 0
The maximum value of the function (2.33) is 60 and occurs at the point (0, 15).
74 2 Numerical Techniques
150
100
f(x1,x2)
●
50
0
25
20
15
10
x1 5 15 10 5 0
0 25 20
x2
Fig. 2.6 Plot for the linear programming problem with the constraint hyperplane depicted by the
grid and the optimum by a point. BCS_LP
> require(stats)
> f = function(x){
+ sqrt(5 * x[1]) + sqrt(3 * x[2]) # objective function
+ }
> A = matrix(c(-3, -5), nrow = 1,
+ ncol = 2, byrow = TRUE) # coefficients matrix
> b = c(-10) # vector of constraints
2.4 Root Finding 75
5
●
4
f(x,y) 3
5
4
3
2
x 1 3 2 1 0
0 5 4
y
Fig. 2.7 Plot for the objective function with its constraint from (2.34) and the optimum depicted
by the point. BCS_NLP
The upper computation illustrates that the maximum value of the function (2.34)
is 4.7610, and occurs at the point (2.4511, 0.5294). answer$function equal to
170 means that the objective function has been called 170 times.
Chapter 3
Combinatorics and Discrete Distributions
— Stéphane Mallarmé
In the second half of the nineteenth century, the German mathematician Georg Cantor
developed the greater part of today’s set theory. At the turn of the nineteenth and
twentieth centuries, Ernst Zermelo, Bertrand Russell, Cesare Burali-Forti and others
found contradictions in the nonrestrictive set formation: For every property there
is a unique set of individuals, which have that property, see Johnson (1972). This
so called ‘naïve comprehension principle’ produced inconsistencies, illustrated by
the famous Russell paradox, and was therefore untenable. Ernst Zermelo in 1908
gave an axiomatic system which precisely described the existence of certain sets
and the formation of sets from other sets. This Zermelo–Fraenkel set theory is still
the most common axiomatic system for set theory. There are 9 axioms, amongst
others, that deal with set equality, regularity, pairing sets, infinity, and power sets.
Since these axioms are very theoretical, we refer the interested reader to Jech (2003).
Later, further axioms were added in order to be able to universally interpret all
mathematical objects or constructs, making set theory a fundamental discipline of
mathematics. It also plays a major role for computational statistics since it mostly
uses basic functions, which constitute set theoretical relations.
Most of the basic R objects containing several elements, such as an array, a matrix,
or a data frame, are sets.
After the creation of a set, the next step is to manipulate the set in useful ways. One
possible goal could be selecting a specific subset. A subset of a set M is another set
M1 whose elements a are also elements of M, i.e. a ∈ M1 implies a ∈ M. There
are several other relations besides the subset relation. The basic set operations are
union, intersection, difference, test for equality, and the operation ‘is-an-element-of’.
Table 3.1 contains definitions and the corresponding tools from the packages base
and sets discussed below. In order to use the functions provided by the package
sets, objects have to be defined as sets. All functions contained in base R can be
applied to vectors or matrices. One can use the relations from Table 3.1 to state the
following equations and properties, which are generally valid in set theory.
1. A ∪ ∅ = A, A ∩ ∅ = ∅;
2. A ∪ = , A ∩ = A;
3. A ∪ Ac = , A ∩ Ac = ∅;
4. (Ac )c = A;
5. Commutative property: A ∪ B = B ∪ A, A ∩ B = B ∩ A;
6. Associative property: (A ∪ B) ∪ C = A ∪ (B ∪ C), (A ∩ B) ∩ C = A ∩ (B ∩ C);
3.1 Set Theory 79
(x %e% A)
x∈
/ A x is not an element of A !(x %in% A) !(x %e% A)
A⊆B Each element of A is an element of B A %in% B set_is_subset(A, B)
A=B A ⊇ B and A ⊆ B setequal(A,B) set_is_equal(A, B)
∅ The empty set, {} x = c() set()
The Universe ls()
A∪B Union: {x | x ∈ A or x ∈ B} union(A,B), set_union(A,B), A | B
A∩B Intersection: {x | x ∈ A and x ∈ B} intersect(A,B) set_intersection(A, B)
if A ∩ B = ∅ then A and B are A & B
disjoint
A\B Set difference: {x | x ∈ A and x ∈
/ B} setdiff(A,B) A − B
A B Symmetric difference: set_symdiff(), %D%
(A \ B) ∪ (B \ A)
{x | x either x ∈ A or x ∈ B}
Ac The complement of a set A: \ A set_complement(A, )
P (A) Power set: the set of all subsets of A set_power(A), 2^A
7. Distributive property: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) , A ∩ (B ∪ C) =
(A ∩ B) ∪ (A ∩ C);
8. De Morgan’s Law: (A ∪ B)c = Ac ∩ B c and (A ∩ B)c = Ac ∪ B c ,
or, more generally, (∪i Ai )c = ∩i Aic and (∩i Ai )c = ∪i Aic .
The base package provides functions to perform most set operations, as shown in
the second column of Table 3.1. The results are given as an output vector or list. Note
that R is able to compare numeric and character elements. The output will be
given as a character vector, as in line 3 below.
> set1 = c (1 , 2) # numeric vector
> set2 = c ( " 1 " , " 2 " , 3) # vector with strings
> s e t e q u a l ( set1, set2 ) # sets are not equal
[1] FALSE
> i s . e l e m e n t ( set2, c (2 , 1)) # 1 , 2 are e l e m e n t s of 2 nd set
[1] TRUE TRUE FALSE
> i n t e r s e c t ( set1, set2 ) # different element types
[1] " 1 " " 2 "
As there is no specific function in base package for the symmetric difference it can
be obtained by combining base functions union() and setdiff() as:
80 3 Combinatorics and Discrete Distributions
> A = 1:4 # {1 , 2 , 3 , 4}
> B = -3:3 # { -3 , -2 , -1 , 0 , 1 , 2 , 3}
> union ( s e t d i f f ( A, B ) , s e t d i f f ( B, A )) # s y m m e t r i c d i f f e r e n c e set
[1] 4 -3 -2 -1 0
The symmetric difference set is the union of the difference sets. In the example above,
A and B have 1, 2, and 3 as their common elements. All other elements belong to the
symmetric difference set.
When working with basic R objects like lists, vectors, arrays, or data-frames,
using functions from the base package is appropriate. These functions, for example,
union(), intersect(), setdiff() and setequal(), apply as.vector
to the arguments. Applying operations on different types of sets, like a list and a
vector in the following example, does not necessarily lead to a problem.
> s e t l i s t = list (3 , 4) # set of type list
> s e t v e c 1 = c (5 , 6 , 8 , 20) # set of t y p e v e c t o r
> intersect ( setlist, setvec1 ) # no c o m m o n e l e m e n t s
n u m e r i c (0)
> s e t v e c 2 = c ( " blue " , " red " , 3) # set of t y p e v e c t o r
> intersect ( setlist, setvec2 ) # common elements
[1] " 3 "
In the following example, the objects A and B are combined in the data frame AcB.
The union of a data frame AcB and another object M returns a list of all elements.
> AcB = d a t a . f r a m e ( A = 1:3 , B = 5:7)
> M = list (10 , 15 , 10)
> union ( AcB, M ) # u n i o n r e t u r n s a l i s t for d a t a f r a m e s
[[1]]
[1] 1 2 3
[[2]]
[1] 5 6 7
[[3]]
[1] 10
[[4]]
[1] 15
> i n t e r s e c t ( AcB, M )
list ()
> DcE1 = d a t a . f r a m e ( D = c (1 , 3 , 2) , E = c (5 , 6 , 7))
> i n t e r s e c t ( AcB, DcE1 ) # s h o u l d r e t u r n both D and E
E
1 5
2 6
3 7
> DcE2 = d a t a . f r a m e ( D = c (1 , 2 , 3) , E = c (5 , 6 , 7))
> i n t e r s e c t ( AcB, DcE2 ) # s h o u l d r e t u r n both D and E
E
1 5
2 6
3 7
Using vectors as sets has some drawbacks when working with data frames, as shown
for the intersections above. In the base package, the intersection of two data frames
3.1 Set Theory 81
with a common element returns the empty set if the elements are ordered or defined
differently, therefore the elements c(1, 2, 3, 4) and c(1, 3, 2, 4) as
well as c(1, 2, 3, 4) and 1:4 are treated as different sets. When using the
sets function set(), the order becomes unimportant.
The package sets was specifically created by David Meyer and others for appli-
cations concerning set theory. This package provides basic operations for ordinary
sets and also for generalizations like fuzzy sets and multisets. The objects created
with functions from this package, e.g. by using the function set(), can be viewed
as real set objects, in contrast to vectors or lists, for example. This is visible in the
output, since sets are denoted by curly brackets.
A data frame can be viewed as a nested set and should be created with several
set() commands. Note that these functions in R require the sets package.
> r e q u i r e ( sets )
> A = set (1 , 2 , 3) # set A
> B = a s . s e t ( c (5 , 6 , 7)) # set B
> set ( A, B ) # set AcB from above
{{1 , 2 , 3} , {5 , 6 , 7}}
The as.set() function is used above to convert an array object into a set object. For
objects of the class set, it is recommended to use the methods from the same package,
like set_union and set_intersection or, more simply, the symbols & and |.
In the following, some of these functions, as presented in Table 3.1, are used on two
simple sets.
> A = set (1 , 2 , 3) # set A
> B = set (5 , 6 , 7 , " 5 " ) # set B
> B # o r d e r e d and d i s t i n c t
{ " 5 " , 5 , 6 , 7}
> A | B # union set
{ " 5 " , 1 , 2 , 3 , 5 , 6 , 7}
> A & B # i n t e r s e c t i o n set
{}
> A - B # set d i f f e r e n c e
{1 , 2 , 3}
> A %D% B # symmetric difference
{ " 5 " , 1 , 2 , 3 , 5 , 6 , 7}
> s u m m a r y ( A %D% B ) # s u m m a r y of the s y m m e t r i c d i f f e r e n c e
A set with 7 e l e m e n t s .
> set_is_empty( A ) # check for empty set
[1] FALSE
Besides the functions in Table 3.1, the basic predicate functions ==, ! =, <, <=,
defined for equality and subset, can be used intuitively for the set objects. For vectors
or lists, however, these functions are executed element by element, so the objects
must have the same length.
82 3 Combinatorics and Discrete Distributions
Where the set_similarity function computes the fraction of the number of the
elements in the intersection of two sets over the number of the elements in the union.
In computational statistics, one often needs to work with sets and compute such
properties as the mean and median, see Sect. 5.1.5. Such statistics can be calculated
for set objects similarly to other R objects. Applying the functions sum(), mean()
and median() to a set, R will try to convert the set to a numeric vector, e.g. 5
defined as a character is converted to a numeric 5 in the example below.
> A = set (1 , 2 , 3); B = set (5 , 6 , 7 , " 5 " )
> A + B # u n i o n of A and B
{ " 5 " , 1 , 2 , 3 , 5 , 6 , 7}
> sum ( c ( " 5 " , 1 , 2 , 3 , 5 , 6 , 7))
E r r o r in sum ( c ( " 5 " , 1 , 2 , 3 , 5 , 6 , 7)):
i n v a l i d ’ type ’ ( c h a r a c t e r ) of a r g u m e n t
> sum ( A + B ) # sum of u n i o n set A and B
[1] 29 # "5" b e c o m e s n u m e r i c
> A * B # Cartesian product
{(1 , 5) , (1 , 6) , (1 , 7) , (1 , " 5 " ) , (2 , 5) , (2 , 6) , (2 , 7) , (2 , " 5 " ) ,
(3 , 5) , (3 , 6) , (3 , 7) , (3 , " 5 " )}
Furthermore, in the sets package, the calculation of the closure and reduction of
sets is implemented by means of the function closure().
> D = set ( set (1) , set (2) , set (3)); D
{{1} , {2} , {3}}
> closure( D ) # set of all s u b s e t s
{{1} , {2} , {3} , {1 , 2} , {1 , 3} , {2 , 3} , {1 , 2 , 3}}
In contrast to ordinary sets generalised sets allow keeping, which have their elements
in a sorted and distinct form, a generalised set keeps every element, even if there
are redundant elements, but still in a sorted way. Generalised sets allow keeping
more information or characteristics of a set and include two special cases: fuzzy
sets and multisets. Every generalised set can be created using the gset() function
and all methods in this regard begin with the prefix gset_. Before constructing a
generalised set, it is important to think about its characteristics, like the membership
of an element, which differ for fuzzy sets and multisets.
Membership is described by a function f that maps each element of a set A to a
membership number:
3.1 Set Theory 83
• For ordinary sets, each element is either in the set or not, i.e. f : A → {0, 1};
• For fuzzy sets, the membership function maps into the unit interval, f : A →
[0, 1];
• For multisets, f : A → N.
Multisets allow each element to appear more than once, so that in statistics, multisets
occur as frequency tables. Since in base R there is no support for multisets, the
sets package is a good solution. In the example below, the set object A has four
distinct elements and each element has a certain membership value. The absolute
cardinality of a set can be obtained by the function gset_cardinality(), i.e.
the number of elements in a set.
> r e q u i r e ( sets )
> ms1 = gset ( c ( " red " , rep ( " blue " , 3))) # multiset
> ms1 # repeated elements retained
{ " blue " , " blue " , " blue " , " red " }
> gset_cardinality( ms1 ) # n u m b e r of e l e m e n t s
[1] 4 # c a r d i n a l i t y of ms1
> fs1 = gset ( c (1 , 2 , 3) , # fuzzy set
+ m e m b e r s h i p = c (0 .2, 0 .6, 0 .9 ));
> fs1
{1 [0 .2 ] , 2 [0 .6 ] , 3 [0 .9 ]}
> plot ( fs1 ) # left plot in Fig. \ ,3.1
> B = c("x", "y", "z", "z", "z", "x") # c r e a t e m u l t i s e t from R object
> table ( B )
B
x y z
2 1 3
> ms2 = a s . g s e t ( B ); ms2 # c o n v e r t s v e c t o r to the set
{ " x " [2] , " y " [1] , " z " [3]}
> gset_cardinality( ms2 ) # c a r d i n a l i t y of m u l t i 2
[1] 6
> ms3 = gset ( c ( ’x ’ , ’y ’ , ’z ’ ) , # c r e a t e m u l t i s e t via gset
+ m e m b e r s h i p = c (2 , 1 , 3)); ms3
{ " x " [2] , " y " [1] , " z " [3]}
> gset_cardinality( ms3 )
[1] 6
> plot ( ms3, col = ’ l i g h t b l u e ’ ) # right plot in Fig. \ ,3.1
By employing the repeat function rep(x, times) with times = 2, the mem-
bership is doubled.
> ms4 = rep ( ms3, times = 2)
> ms4
{ " x " [4] , " y " [2] , " z " [6]}
> gset_cardinality( ms4 )
[1] 12
The function set_combn(set, length) from the sets package creates a set
with subsets of the specified length: it consists of all combinations of the elements
in the specified set (Fig. 3.1).
When the same function is applied to all factorial combinations of two sets, e.g. the
function set_outer(set1, set2, operation) applies a binary operator
like sum or product to all elements of two sets. It applies the operation to all pairs of
elements specified in sets 1 and 2 and returns a matrix of dimension length(set1)
times length(set2). outer can be also used for vectors and matrices in R.
84 3 Combinatorics and Discrete Distributions
3
1.0
0.8
Membership Grade
2
0.6
Count
0.4
1
0.2
0.0
0
"x" "y" "z"
1.0 1.5 2.0 2.5 3.0
Universe Set Members
Fig. 3.1 R plot of a fuzzy set (left) and a multiset (right). BCS_FuzzyMultiSets
Users of base R can get wrong or confusing results when applying basic set
operations like union and intersection. Indexable structures, like lists and vectors,
are interpreted as sets. For set theoretical applications, this imitation has not been
sufficiently elaborated: basic operations such as the Cartesian product and power
set are missing. The base package in R performs a type conversion via match(),
which might in some cases lead to wrong results. In most cases it makes no difference
whether one uses a = 2 or a = 2L, where the latter defines a directly as an integer
by the suffix L. But to save memory in computationally extensive codes, it is useful
to define a directly as having integer type.
> y = (1:100) * 1 # o p t i o n 1 to d e f i n e v e c t o r y
> typeof (y)
[1] " d o u b l e "
> object.size( y ) # m e m o r y u s e d by t h i s o b j e c t
840 bytes
> yL = ( 1 : 1 0 0 ) * 1 L # o p t i o n 2 to d e f i n e v e c t o r y
> t y p e o f ( yL )
[1] " i n t e g e r "
> object.size( yL ) # m e m o r y u s e d by t h i s o b j e c t
440 bytes
3.1 Set Theory 85
If one tries to check its code for constants not defined as integers, the match()
function will not distinguish between 1 and 1L.
The sets package avoids such steps by the use of set classes for ordinary, general,
and customised sets, as presented in Fig. 3.2. Customised sets are an extension of
generalized sets and are implemented in R via the function cset. With the help of
customisable sets, one is able to define how elements in the sets are matched through
the argument matchfun.
> setA = set ( a s . n u m e r i c (1)) # set with n u m e r i c 1
> 1 L %e% setA # 1 L is not an e l e m e n t of A
[1] FALSE
> csetA = cset ( a s . n u m e r i c (1) , # cset with match f u n c t i o n
+ m a t c h f u n = match )
> 1 L %e% csetA # 1 L is now an e l e m e n t of A
[1] TRUE
The basic R function match considers the integer one 1L to be the same as the
numeric 1. With the help of customisable sets, users of R are able to specify which
elements are considered to be the same. This is very useful for data management.
When working with data that is subject to random variation, the theory behind this
probabilistic situation becomes important. There are two types of experiments: deter-
ministic and random. We will focus here on nondeterministic processes with a finite
number of possible outcomes.
A random trial (or experiment) yields one of the distinct outcomes that alto-
gether form the sample or event space . All possible outcomes constitute the
universal event. Subsets of are called events, e.g. itself is the universal
event. Examples of experiments include rolling a die with the sample space =
86 3 Combinatorics and Discrete Distributions
{{1}, {2}, {3}, {4}, {5}, {6}}, another is tossing a coin with only two possible out-
comes: heads (H ) and tails (T ).
A combination of several rolls of a die or tosses of a coin leads to more possible
results, such as tossing a coin twice, with the sample space = {{H, H }, {H, T },
{T, H }, {T, T }}. Generally, the combination of several different experiments yields
a sample space with all possible combinations of the single events. If, for instance,
one needs two coins to fall on the same side, then the favored event is a set of two
elements: {H, H } and {T, T }.
The prob package, which will be used in the following, has been developed
by G. Jay Kerns specifically for probabilistic experiments. It provides methods for
elementary probability calculation on finite sample spaces, including counting tools,
defining probability spaces discussed later, performing set algebra, and calculating
probabilities.
The situation of tossing a coin twice is considered in the following code, for which
the package prob is needed. The functions used will be explained shortly.
> r e q u i r e (prob)
> ev = t o s s c o i n (2) # s a m p l e s p a c e for 2 coin ~ t o s s e s
> p r o b s p a c e ( ev ) # p r o b a b i l i t i e s for e v e n t s
toss1 toss2 probs
1 H H 0 .25
2 T H 0 .25
3 H T 0 .25
4 T T 0 .25
The interesting information is how likely an event is. Each event has a probability
assigned and this probability is included as the last column of the R output in the
example above. The values quantify our chances of observing the corresponding
outcome for the outcome of tossing a coin twice.
Comparable to the set theory in Sect. 3.1, one can apply operations like union or
intersection to events. The event probability follows the axioms of probability, which
are shortly summarised in the following.
• P(·) is a probability function that assigns to each event A in the sample space a
real number P(A), which lies between zero and one. P(A) is the probability that
the event A occurs. The probability of the whole sample space is equal to one,
which means that it occurs with certainty.
• P(A ∪ B) = P(A) + P(B) if A and B are disjoint. In general,
P(A ∪ B) = P(A) + P(B) − P(A ∩ B),
P() = 1 and P(∅) = 0.
The probability of the complementary event and that of the difference between two
sets are given by
• P(Ac ) = 1 − P(A);
• P(A \ B) = P(A) − P(A ∩ B).
3.2 Probabilistic Experiments with Finite Sample Spaces 87
3.2.1 R Functionality
The following functions, which generate common and elementary experiments, can
be used to set up a sample space.
• urnsamples(x, size, replace = FALSE, ordered = FALSE, …),
• tosscoin(ncoins, makespace = FALSE),
• rolldie(ndies, nsides = 6, makespace = FALSE),
• cards(jokers = FALSE, makespace = FALSE),
• roulette(european = FALSE, makespace = FALSE).
If the argument makespace is set TRUE, the resulting data frame has an additional
column showing the (equal) probability of each single event. In the simplest case,
the probability of an event can be computed as the relative frequency. Some methods
for working with probabilities and random samples from the prob and the base
packages are the following.
• probspace(outcomes, probs) forms a probability space,
• prob(prspace, event = NULL) gives the probability of an event as its relative
frequency,
• factorial(n) is the mathematical operation n!for a non-negative integer n,
• choose(n, k) gives the binomial coefficient nk = k!(n−k)!
n!
.
> r e q u i r e ( prob )
> ev = urnsamples( c ( " bus " , " car " , " bike " , " train " ) ,
+ size = 2,
+ ordered = TRUE )
> probspace( ev ) # probability space
X1 X2 probs
1 bus car 0 . 0 8 3 3 3 3 3 3
2 car bus 0 . 0 8 3 3 3 3 3 3
3 bus bike 0 . 0 8 3 3 3 3 3 3
4 bike bus 0 . 0 8 3 3 3 3 3 3
5 bus t r a i n 0 . 0 8 3 3 3 3 3 3
6 train bus 0 . 0 8 3 3 3 3 3 3
7 car bike 0 . 0 8 3 3 3 3 3 3
8 bike car 0 . 0 8 3 3 3 3 3 3
9 car t r a i n 0 . 0 8 3 3 3 3 3 3
10 t r a i n car 0 . 0 8 3 3 3 3 3 3
11 bike train 0 . 0 8 3 3 3 3 3 3
12 train bike 0 . 0 8 3 3 3 3 3 3
> Prob( p r o b s p a c e ( ev ) , X2 == " bike " ) # 3 of 12 c a s e s = 1 / 4
[1] 0 .25
> f a c t o r i a l (3) # 3 * 2 * 1
[1] 6
> c h o o s e ( n = 10 , k = 2) # 10 ! / (2 ! * 8 ! ) = 10 * 9 / 2
[1] 45
In R, the sample spaces can be represented by data frames or lists and may contain
empirical or simulated data. Random samples, including sampling from urns, can
be drawn from a set with the R base method sample(). The sample size can be
88 3 Combinatorics and Discrete Distributions
Table 3.2 Number of all possible samples of size k from a set of n objects. The sampling method
is specified by replacement and order
Ordered Unordered
(n+k−1)!
With replacement nk k!(n−1)!
n!
n n!
Without replacement (n−k)! k = k!(n−k)!
chosen as the second argument in the function and the type of sampling can be either
with or without replacement:
• sampling with replacement:
sample(x, size = n, replace = TRUE, prob = NULL),
• sampling without replacement:
sample(x, n).
In general, there are four types of sampling, regarding replacement and order, which
are briefly presented in the following. The calculation rules for the number of possible
draws for a sample depend on the assumptions about the particular situation. All four
cases are outlined in Table 3.2.
In R the function nsamp is able to calculate the possible numbers of samples
drawn from an urn. The following code shows how all four cases from Table 3.2 are
applied when n = 10 and k = 2.
> r e q u i r e ( prob )
> nsamp(10 , 2 , replace = TRUE, ordered = TRUE ) # 10^2
[1] 100
> nsamp(10 , 2 , replace = TRUE, ordered = FALSE ) # 11 ! / (2 ! * 9 ! )
[1] 55
> nsamp(10 , 2 , replace = F A L S E , ordered= TRUE ) # 10 ! / 8 !
[1] 90
> nsamp(10 , 2 , replace = F A L S E , ordered = FALSE ) # 10 ! / (2 ! * 8 ! )
[1] 45
Ordered Sample
For several applications, the order of k experimental outcomes is decisive. Consider,
for example, the random selection of natural numbers. For the random selection of
a telephone number, both the replacement and the order of the digits are important.
The method urnsample() from the prob package yields all possible samples
according to the sampling method. Consider the next example, where three elements
are taken from an urn of eight elements. Sampling with replacement is conducted
first, followed by sampling without replacement for comparison. Clearly the number
of samples is smaller if we do not replace the elements. This number can also be
computed with the counting tool nsamp() introduced in the last section.
3.2 Probabilistic Experiments with Finite Sample Spaces 89
> r e q u i r e ( prob )
> urn1 = urnsamples( x = 1:3 , # all e l e m e n t s
+ size = 2, # num of s e l e c t e d e l e m e n t s
+ replace = TRUE, # with r e p l a c e m e n t
+ ordered = TRUE ) # ordered
> urn1 # all p o s s i b l e draws
X1 X2
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
> dim ( urn1 ) # d i m e n s i o n of the m a t r i x
[1] 9 2
> urn2 = urnsamples( x = 1:3 ,
+ size = 2,
+ replace = F A L S E , # without replacement
+ ordered = TRUE ) # ordered
> dim ( urn2 ) # d i m e n s i o n of the m a t r i x
[1] 6 2
Unordered Sample
In the simple case of drawing balls from an urn, the order in which the balls are
drawn is rarely relevant. For a lottery, for example, it is only relevant whether a
certain number is included in the winning sample or not. When conducting a survey
or selecting participants, the order of the selection is generally irrelevant. Having
created the sample space, a sample can be drawn, which leaves the question about
the replacement. The researcher has to decide what fits best in this situation.
Note that in an unordered sample without replacement, the number of possible
samples is given by the binomial coefficient. Using the formula from Table 3.2, the
sample size can be checked and the probability of drawing a certain sample can be
calculated.
> r e q u i r e ( prob )
> urn3 = urnsamples( x = 1:3 ,
+ size = 2,
+ replace = TRUE, # with r e p l a c e m e n t
+ ordered = FALSE ) # not o r d e r e d
> dim ( urn3 ) # d i m e n s i o n s of the m a t r i x
[1] 6 2
> urn4 = urnsamples( x = 1:3 ,
+ size = 2,
+ replace = F A L S E , # without replacement
+ ordered = FALSE ) # not o r d e r e d
> urn4 # all p o s s i b l e draws
X1 X2
1 1 2
2 1 3
3 2 3
> probspace( urn4 ) # p r o b a b i l i t y space
X1 X2 probs
1 1 2 0 .3333333
2 1 3 0 .3333333
3 2 3 0 .3333333
90 3 Combinatorics and Discrete Distributions
The probability of obtaining a certain pair of values is one over the number of
possible pairs. For the case without replacement and ignoring order, each sample has
the probability 1/3 ≈ 0.3333. This number together with all 3 possible samples is
given when applying the method probspace() to urn4.
Beside these simple experiments, it is also useful to know that the number of
subsets of a set of n elements is 2n . Furthermore, there are n! possible ways of
choosing all n elements and rearranging them. This is the same thing as the number
of permutations of n elements. In case the sample size is the same as the number
of elements and replace = FALSE, the sampling can be seen as a random per-
mutation. If the sample space consists of all combinations of a number of factors,
the function expand.grid() from the base package can be used to generate a
data frame containing all combinations of these factors. The example below shows
all combinations of two variables specifying colour and number.
> e x p a n d . g r i d ( c o l o u r = c ( " red " , " blue " , " y e l l o w " ) , nr = 1:2)
c o l o u r nr
1 red 1
2 blue 1
3 yellow 1
4 red 2
5 blue 2
6 yellow 2
There are several ways to sample from a population. It matters for the number of
possible samples whether one arranges the elements or selects a subset from the
population. All different possibilities are illustrated in Fig. 3.3.
Combinatorics
Arrangement Selection
different elements identical elements without replacement with replacement without replacement with replacement
n!
r
n! ⎛n⎞ ⎛n + k − 1⎞
n! ∏ gj! (n − k)! nk ⎝k⎠ ⎜ k ⎟
⎝ ⎠
j=1
Fig. 3.3 The possible sample numbers for an urn model with n elements. For samples with assem-
bled elements with r groups for identical elements g j or k out of n selected different elements.
BCS_SamplesDiagram
3.2 Probabilistic Experiments with Finite Sample Spaces 91
The examples above are very specific and restricted to a particular sample space.
Now we can address sampling from a more general perspective. Again, some ran-
dom selection mechanism is involved: the theory behind this is called probabilistic
sampling. Specific types of sampling are: simple random sampling, the equal prob-
ability selection method, probability-proportional-to-size, and systematic sampling.
Details can be found in Babbie (2013). In real applications, each member of a pop-
ulation can have different characteristics, i.e. the population is heterogeneous, and
one needs a sample large enough to study the characteristics of the whole population.
The idea is to find a sample which describes the population well. Yet, there is always
a risk of biased samples if the sampling method is not adequate, that is to say, if the
set of selected members is not representative of the population.
In the following example, it is assumed that a population consists of women and
men in a ratio of 1 : 1. In order to test this assumption about the ratio, a sample is
drawn.
> # s e t . s e e d (18) # set the seed, see Chap. \ ,9
> popul = d a t a . f r a m e (
+ g e n d e r = rep ( c ( " f " , " m " ) , each = 500) ,
+ grade = s a m p l e (1:10 , 1000 , replace = TRUE ))
> head( popul ) # first 6 r o w s of m a t r i x
gender grade
1 f 9
2 f 8
3 f 10
4 f 1
5 f 1
6 f 6
> table ( popul [ , 1]) # true p r o p o r t i o n
f m
500 500
> table ( s a m p l e ( popul [ , 1] , 10)) # d r a w s a m p l e of 10
f m
3 7
In this example, a simple random sample was drawn, which was too small to capture
the true ratio. For more sophisticated sampling methods in R, the package sampling
can be used. It contains methods for stratified sampling, which divides the population
into subgroups and samples. The corresponding R function is strata(). Its argu-
ment, stratanames, specifies the variable that is used to identify the subgroups.
> require ( sampling )
> s t r a t a ( data, s t r a t a n a m e s = NULL, size,
+ m e t h o d = c ( " s r s w o r " , " srswr " , " p o i s s o n " , " s y s t e m a t i c " ) ,
+ pik, description = FALSE )
The two methods, srswor and srswr, denote simple random sampling without
and with replacement, respectively. In the example below, a sample of six persons
each is taken from the female and male students without replacement.
The function getdata() extracts data from a dataset according to a vector of
selected units or a sample data frame. Here, we use the sample data frame created
92 3 Combinatorics and Discrete Distributions
by the function strata() to extract the grades for the sample students from our
dataset.
A simple tool of analysis is the function aggregate(), which is used to calcu-
late summary statistics for subsets of data. It is applied below to calculate the mean
of the grades in the sample for each gender. Note, that the subsets need to be given
as a list.
> require ( sampling )
> st = strata( p o p u l ,
+ stratanames = " g e n d e r " , # take 6 s a m p l e s of e a c h g e n d e r
+ size = c (6 , 6) ,
+ method = " srswor ")
> dataX = getdata( p o p u l , m = st ) # e x t r a c t the s a m p l e
> dataX
g r a d e g e n d e r ID _ unit Prob S t r a t u m
98 8 f 98 0 .012 1
114 5 f 114 0 .012 1
288 1 f 288 0 .012 1
392 5 f 392 0 .012 1
411 7 f 411 0 .012 1
421 6 f 421 0 .012 1
532 2 m 532 0 .012 2
619 9 m 619 0 .012 2
667 7 m 667 0 .012 2
771 3 m 771 0 .012 2
952 1 m 952 0 .012 2
968 3 m 968 0 .012 2
> a g g r e g a t e ( data $ g r a d e , # mean grade by g e n d e r
+ by = list ( data $ g e n d e r ) ,
+ FUN = mean )
Group.1 x
1 f 5 .333333
2\ ,m 4 . 1 6 6 6 6 7
To test whether these results support our expectations of equal grades for each gender,
we would need some functions for statistical testing discussed in Sect. 5.2.2. Applying
a t-test, we would find that the results are indeed supportive.
Definition 3.1 A real valued random variable (rv) X on the probability space
(, F, P), is a real valued function X (ω) defined on , such that for every Borel
subset B of the real numbers
{ω : X (ω) ∈ B} ∈ F.
3.2 Probabilistic Experiments with Finite Sample Spaces 93
The probability function P assigns a probability to each event, for a detailed discus-
sion, see Ash (2008).
For the probabilistic experiment of tossing a fair coin = {H, T }, the rv X is
defined as follows:
1, if head H shows up,
X=
0, if tail T shows up.
There are two types of rvs: discrete and continuous. This distinction is very important
for their analysis.
Definition 3.2 An rv X is said to be discrete if the possible distinct values x j of X
are either countably infinite or finite.
The distribution of a discrete rv is described by its probability mass function f (x j )
and the cumulative distribution function F(x j ):
Definition 3.3 The probability mass function (pdf) of a discrete rv X is a function
that returns the probability, that an rv X is exactly equals to some value
f (x j ) = P(X = x j ).
Definition 3.4 The cumulative distribution function (cdf) is defined for ordinally
scaled variables (variables with the natural order) and returns the probability, that an
rv X is smaller or equal to some value:
F(x j ) = P(X ≤ x j ).
The outcomes of tossing a fair coin can be mapped by a discrete rv with finite distinct
values. Drawing randomly a person and its number of descendants can be described
by a discrete rv with countably infinite distinct values.
An rv X has an expectation E X and a variance Var X (also called the first moment
and the second central moment of X , respectively). The definition of these moments
differs for discrete and continuous rvs.
Definition 3.5 Let X be a discrete rv with distinct values {x1 , . . . , xk } and probability
function P(X = x j ) ∈ [0, 1] for j ∈ {1, . . . , k}. Then the expectation (expected
value) of X is defined to be
k
EX = x j P(X = x j ). (3.1)
j=1
For infinitely many possible outcomes, the finite sum becomes an infinite sum. The
expectation is not defined for every rv. An example for continuous rvs is the Cauchy
distribution, introduced in Sect. 4.5.2.
94 3 Combinatorics and Discrete Distributions
Definition 3.6 Let X be a discrete rv with distinct values {x1 , . . . , xk } and probability
function P(X = x j ) ∈ [0, 1] for j ∈ {1, . . . , k}. Then the variance of X is defined
to be
k
Var X = E(X − E X )2 = (x j − E X )2 P(X = x j ). (3.2)
j=1
As for the expectation, the variance is not defined for every rv. The variance measures
the expected dispersion of an rv around its expected value. Deterministic variables
have a variance equal to zero.
Definition 3.7 An rv X is said to be continuous if the possible distinct values x j are
uncountably infinite.
For a continuous rv, the probability density function (pdf) describes its distribution
(see Definition 4.2). Selecting randomly a person and its weight is a typical example
of a probabilistic experiment which can be described by a continuous rv.
In the following, the most prominent discrete rvs and their probability mass func-
tions are introduced. Continuous rvs and their properties are covered in Chap. 4.
One of the basic probability distributions is the binomial. Examples of this distrib-
ution can be observed in daily life: whether we are tossing a coin to obtain heads
or tails, or trying to score a goal in a football game, we are dealing with a binomial
distribution.
P(X = 0) = 1 − p,
P(X = 1) = p
and the rv X is said to have a Bernoulli distribution. The expected value and variance
of a Bernoulli rv are E X = p and Var X = p(1 − p).
3.3 Binomial Distribution 95
To derive these results, just apply (3.1) and (3.2). The expectation is then derived
as follows:
E X = P(X = 0) · 0 + P(X = 1) · 1 = (1 − p) · 0 + p = p.
Example 3.1 Consider a box containing two red marbles and eight blue marbles.
Let X = 1 if the drawn marble is red and 0 otherwise. The probability of randomly
selecting one red marble and the expectation of X at one try is E X = P(X = 1) =
1/5 = 0.2. The variance of X is Var X = 1/5(1 − 1/5) = 4/25.
Table 3.3 Sample space and X for tossing a coin three times
Outcome HHH HHT HTH THH HTT THT TTH TTT
Value of X 3 2 2 2 1 1 1 0
96 3 Combinatorics and Discrete Distributions
|{H H H }| |{H T T, T H T, T T H }|
P(X = 3) = = 1/8; P(X = 1) = = 3/8.
|| ||
The same results can be computed from the binomial mass function below by
setting n = 3 and p = 1/2.
Definition 3.8 The binomial distribution is the distribution of an rv X for which
n x
P(X = x) = p (1 − p)n−x , x ∈ {0, 1, 2, . . . , n}, (3.3)
x
where n is the number ofall trials, x is the number of successful outcomes, p is the
probability of success, nx is the number of possibilities of n outcomes’ leading to x
successes and n − x failures.
The binomial distribution can be used to define the probability of obtaining exactly x
successes in a sequence of n independent trials. We will denote binomial distributions
by B(x; n, p) For the example above, the rv X follows the binomial distribution
B(x; 3, 0.5), or X ∼ B(3, 0.5). The expectation of a binomial rv is E X = μ = np,
which is the expected number of successes x in n trials. The variance is Var X =
σ 2 = np(1 − p).
Example 3.2 Continuing with marbles, we randomly draw ten marbles one at a
time, while putting it back each time before drawing again. What is the probability
of drawing exactly two red marbles?
Here, the number of draws n = 10 and we define getting a red marble as a success
with p = 0.2 and x = 2. Hence X ∼ B(10, 0.2) and
10
P(X = x) = 0.2x (1 − 0.2)10−x , x = 0, 1, 2, . . . , 10
x
Furthermore, one can use dbinom() to calculate the probability of each outcome
(Fig. 3.4).
3.3 Binomial Distribution 97
0.30
distribution with number of
trials n = 10 and probability
of success p = 0.2.
0.20
Probability
BCS_Binhist
0.10
0.00
0 1 2 3 4 5 6 7 8 9 10
x
It is implemented in R by pbinom().
Example 3.3 Continuing Example 3.2, consider the probability of drawing two or
less red marbles. Let n = 10, p = 0.2 and x = 2, then
1.0
cumulative distribution
function with n = 10,
p = 0.2 and p = 0.6.
0.8
BCS_Bincdf
0.6
Probability
0.4
0.2
0.0
0 2 4 6 8 10
x
The probability that three or four red marbles are drawn is given by (Fig. 3.5).
> p b i n o m (4 , size = 10 , prob = 0 .2 ) - p b i n o m (2 , size = 10 , prob = 0 .2 )
[1] 0 .289407 # F (4) - F (2)
3.3.3 Properties
X − np L
Z=√ → N(0, 1)
np(1 − p)
0.6
0.6
0.30
0.5
0.5
0.25
0.4
0.4
0.20
Prob.
Prob.
Prob.
0.3
0.3
0.15
0.2
0.2
0.10
0.1
0.1
0.05
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
x x x
0.4
0.25
0.4
0.20
0.3
0.3
0.15
Prob.
Prob.
Prob.
0.2
0.2
0.10
0.1
0.1
0.05
0.00
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
x x x
0.06
Prob.
Prob.
Prob.
0.04
0.02
0.00
Fig. 3.6 Probability mass function of B(n, p) for different n and p. BCS_Binpdf
for continuity requires adding or subtracting 0.5 from the values of the discrete bino-
mial rv. Furthermore, the binomial distribution can approach other distributions in
the limit. If n → ∞ and p → 0 with finite np, the limit of the binomial distribution
is the Poisson distribution, see Sect. 3.6. A hypergeometric distribution can also be
obtained from the binomial distribution under certain conditions, see Sect. 3.5.
0.08
binomial distribution versus
normal distribution.
0.06
BCS_Binnorm
Probability
0.04
0.02
0.00
35 40 45 50 55 60 65
x
together each time, what is the probability of getting only a certain number for all
dice?
Definition 3.9 Suppose a random experiment is independently repeated n times, so
that it returns each time one of the fixed k possible outcomes with the probabilities
p1 , p2 , . . . , pk . An example of the multinomial distribution arises as the distribution
of a vector of rvs X = (X 1 , X 2 , . . . , X k ) where each X i denotes the number of
occurrences for which
n!
P(X 1 = x1 , X 2 = x2 , . . . , X k = xk ) = p x1 p x2 · · · pkxk , (3.4)
x1 !x2 ! · · · xk ! 1 2
k k
where p1 , p2 , . . . , pk > 0, i=1 pi = 1, i=1 xi = n, and xi is nonnegative.
When k = 3, the corresponding distribution is called the trinomial distribution.
For k = 2, we get the binomial distribution discussed above. The example below
illustrates how to use the formula to calculate the probability in the multinomial case.
Example 3.4 Suppose we had a box with two red, three green, and five blue mar-
bles. We randomly draw three marbles with replacement. What is the probability of
drawing one marble of each colour?
Here the realizations of the rv are x1 = 1, x2 = 1, x3 = 1, and the corresponding
probabilities are p1 = 0.2, p2 = 0.3, p3 = 0.5 . Therefore according to (3.4), the
desired probability is P(X 1 = 1, X 2 = 1, X 3 = 1) = 1! 3!
1! 1!
0.21 0.31 0.51 = 0.18.
In R, this can be calculated as follows.
> dmultinom( x = c (1 , 1 , 1) , # set s u c c e s s p r o b a b i l i t i e s
+ prob = c (0 .2, 0 .3, 0 .5 )) # set v a l u e s of m u l t i n o m i a l rvs
[1] 0 .18
In the typical ‘6 from 49’ lottery, 6 numbers from 1 to 49 are chosen without replace-
ment. Every time one number is drawn, the chances of the remaining numbers to
be chosen will change. This is an example of a hypergeometric experiment, which
satisfies the following requirements:
1. a sample is randomly selected without replacement from a population;
2. each element of the population is from one of two different groups which can
also be defined as success and failure.
Because the sample is drawn without replacement, the trials in the hypergeometric
experiment are not independent and the probability of each success in turn keep
changing. This differs from the binomial and multinomial distributions.
Definition 3.10 An rv X from a hypergeometric experiment follows the hypergeo-
metric distribution H (n, M, N ), which has the probability function
M N −M
x
P(X = x) = Nn−x
, x = 0, 1, ..., min{M, n}, (3.5)
n
where N is the size of the population, n is the size of the sample, M is the number
of successes in the population and x is the number of successes in the sample.
In (3.5), the probability of exactly x successes in n trials of a hypergeometric exper-
iment is given. The following example illustrates this distribution.
Example 3.5 Having a box with 20 marbles, including 10 red and 10 blue marbles,
we randomly select 6 marbles without replacement. The probability of getting two red
marbles can be calculated using the H (6, 10, 20) distribution. Here the experiment
consists of 6 trials, so n = 6, and there are 20 marbles in the box, so N = 20. Now,
M = 10, since there are 10 red marbles inside, of which two should be selected, so
x = 2. Then 1020−10
2
P(X = 2) = 206−2
= 0.2438 .
6
H(6, 10, 20) vs. B(6, 0.5) H(6, 10, 200) vs. B(6, 0.05)
0.6
0.3
Probability
Probability
0.4
0.2
0.2
0.1
0.0
0.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x
Fig. 3.8 Probability functions of the hypergeometric (lines) versus binomial distribution (dots).
BCS_Binhyper
Example 3.6 In Example 3.5, if there are 500 marbles inside the box including 10 red
marbles, what is the probability of drawing two red marbles out of 6 draws without
replacement?
X ∼ H ( 6, 10, 500),
10500−10
2 6−2
P(X = 2) = 500 = 0.00507
6
X ∼ B( 6, 0.02)
6
P(X = x) = 0.02x (1 − 0.02)6−x
x
P(X = 2) ≈ 0.00553,
n!
P(X = x) = p x (1 − p)n−x , x = 0, 1, 2, . . . , n.
x!(n − x)!
Let λ = np , then
x
n! λ λ n−x
P(X = x) = 1− (3.6)
x!(n − x)! n n
n! λx (1 − λn )n
=
x!(n − x)! n x (1 − λn )x
n! λx (1 − λn )n
= .
n x (n − x)! x! (1 − λn )x
n! n(n − 1) · · · (n − x + 1)
= ≈1
n x (n − x)! nx
(1 − λ/n)n ≈ exp (−λ)
(1 − λ/n)x ≈ 1.
0.15
Bernoulli distribution (dots).
BCS_Binpois
0.10
Probability
0.05
0.00
0 5 10 15 20
x
time the total number of trials should be very large. As a rule of thumb, if p ≤ 0.1,
n ≥ 50 and np ≤ 5, the approximation is sufficiently close.
λx
P(X = x) = exp(−λ) · , x = 0, 1, 2, . . . and λ > 0. (3.7)
x!
12
P(X = 2) = exp(−1) · = 0.184 .
2!
The parameter λ is also called the intensity, which is motivated by the fact that λ
describes the expected number of events within a given interval.
Example 3.8 The Prussian horsekick fatality dataset from Ladislaus von Bortkiewicz
(Quine and Seneta 1987) gives us the number of soldiers killed by horsekick in 10
3.6 Poisson Distribution 105
X ∼ Pois(0.61)
0.611
P(X = 1) = exp(−0.61) · = 0.33144,
1!
X ∼ B(200, 0.00305),
200
P(X = 1) = 0.003051 (1 − 0.00305)200−1 = 0.33215 .
1
> n = 200
> l a m b d a = 0 .61
> p = lambda / n
> d b i n o m ( x = 1 , size = n, prob = p ) # b i n o m i a l pdf
[1] 0 . 3 3 2 1 4 8 3
> dpois ( x = 1 , l a m b d a = l a m b d a ) # P o i s s o n pdf
[1] 0 . 3 3 1 4 4 4
106 3 Combinatorics and Discrete Distributions
n
X i ∼ Pois(λi ), i = 1, . . . , n implies X i ∼ Pois(λ1 + λ2 + . . . + λn ).
i=1
This feature of the Poisson distribution is very useful, since it allows us to combine
different Poisson experiments by summing the rates. Furthermore, for two Poisson
rvs X and Y , the conditional distribution (see Sect. 6.1) of Y is binomial with the
probability parameter λY /(λ X +λY ), see Bolger and Harkness (1965) for more details
(Fig. 3.11).
λ = 0.5 λ = 2.5
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.20
prob
prob
0.10
0.00
0 2 4 6 8 10 0 2 4 6 8 10
x x
λ=5 λ = 25
0.00 0.02 0.04 0.06 0.08
0.15
0.10
prob
prob
0.05
0.00
0 2 4 6 8 10 0 10 20 30 40 50
x x
Fig. 3.11 Probability mass functions of the Poisson distribution for different λ. BCS_Poispdf
3.6 Poisson Distribution 107
For the Poisson distribution h(x) = x!1 , g(θ ) = λx , η(θ) = −λ and t (x) = 1. Other
popular distributions, such as the normal, exponential, gamma, χ 2 and Bernoulli,
belong to the exponential family and are discussed in Chap. 4, except for the last.
This condition can be extended to multidimensional problems. Furthermore, if we
standardise a Poisson rv X , the limiting distribution of this standardised variable
follows a standard normal distribution.
X −λ L
√ −→ N(0, 1), as λ → ∞.
λ
Chapter 4
Univariate Distributions
In this chapter, the theory of discrete random variables from Chap. 3 is extended to
continuous random variables. At first, we give an introduction to the basic definitions
and properties of continuous distributions in general. Then we elaborate on the normal
distribution and its key role in statistics. Finally, we exposit in detail several other
key distributions, such as the exponential and χ2 distributions.
Continuous random variables (see Definition 3.7) can take on an uncountably infinite
number of possible values, unlike discrete random variables, which take on either a
finite or a countably infinite set of values. These random variables are characterised
by a distribution function and a density function.
Definition 4.3 Let X be a continuous rv with a density function f X (x). Then the
expectation of X is defined as
∞
EX = x f X (x) d x. (4.1)
−∞
The expectation exists if (4.1) is absolutely convergent. It describes the location (or
centre of gravity) of the distribution.
√
and the standard deviation is σ X = Var X .
The variance describes the variability of the variable and exists if the integral in (4.2)
is absolute convergent.
Other useful characteristics of a distribution are its skewness and excess kurtosis.
The skewness of a probability distribution is defined as the extent to which it deviates
from symmetry. One says that a distribution has negative skewness if the left tail is
longer than the right tail of the distribution, so that there are more values on the right
side of the mean, and vice versa for positive skewness.
Unlike discrete distributions, where the characteristic function is also the moment-
generating function, the moment-generating function for continuous distributions is
defined as the characteristic function evaluated at −it. The argument of the distrib-
ution is t, which might live in real or complex space.
Definition 4.9 For any univariate distribution F, and for 0 < p < 1, the quantity
F −1 ( p) = inf{x : F(x) ≥ p}
is called the theoretical pth quantile or fractile of F, usually denoted as ξ p and the
F −1 is called the quantile function.
In particular ξ1/2 is called the theoretical median of F. For the quantile function holds,
that it is nondecreasing and left-continuous and satisfies the following inequalities:
i F −1 {F(x)} ≤ x, −∞ < x < ∞,
ii F{F −1 (t)} ≥ t, 0 < t < 1,
iii F(x) ≥ t if and only if x ≥ F −1 (t).
112 4 Univariate Distributions
a+b (b − a)2 6
EX = , Var X = , S(X ) = 0, K (X ) = − . (4.3)
2 12 5
The cf if given through
eitb − eita
φ X (t, a, b) = .
it (b − a)
which are for the pdf, the cdf, the quantile function and for generating random
uniformly distributed samples, respectively. Function dunif also contains argu-
ment log which allows for computation of the log density, useful in the likelihood
estimation.
1.0
0.8
0.3
0.6
pdf
cdf
0.2
0.4
0.1
0.2
0.0
0.0
Fig. 4.1 pdf (left) and cdf (right) of the normal distribution (for μ = 0 and σ 2 = 1, σ 2 = 3, σ 2 = 6,
respectively). BCS_NormPdfCdf
114 4 Univariate Distributions
Distribution function
The cdf of X ∼ N(μ, σ 2 ) is
x
(x, μ, σ 2 ) = (2πσ 2 )−1/2 exp −(u − μ)2 /(2σ 2 ) du.
−∞
Another useful property of the family of normal distributions is that it is closed under
linear transformations. Thus a linear combination of two independent normal rvs,
X 1 ∼ N(μ1 , σ12 ) and X 2 ∼ N(μ2 , σ22 ), is also normally distributed:
This property of the normal distribution is actually the direct consequence of a far
more general property of the family of distributions called stable distributions, see
Sect. 4.5.2, as shown in Härdle and Simar (2015).
In order to work with this distribution in R, there is a list of standard implemented
functions: dnorm(x, mean, sd), for the pdf (if argument log = TRUE then
log density); pnorm(q, mean, sd), for the cdf; qnorm(p, mean, sd), for
the quantile function; and rnorm(n, mean, sd) for generating random nor-
mally distributed samples. Their parameters are x, a vector of quantiles, p, a vector
of probabilities, and n, the number of observations. Additional parameters are mean
and sd for the vectors of means and standard deviation, which, if not specified, are
set to the standard normal values by default.
Central role of the normal distribution in statistics becomes evident when we look at
other important distributions constructed from the normal one.
While the normal distribution is frequently applied to describe the underlying
distribution of a statistical experiment, asymptotic test statistics (see Sect. 5.2.2) are
often based on a transformation of a (non-) normal rv. To get a better understanding of
these tests, it will be helpful to study the χ2 , t- and F-distributions, and their relations
with the normal one. Skew or leptokurtic distributions, such as the exponential, stable
4.4 Distributions Related to the Normal Distribution 115
and Cauchy distributions, are commonly required for modelling extreme events or
an rv defined on positive support, and therefore will be discussed subsequently.
4.4.1 χ2 Distribution
z (n/2, z/2)
F(z, n) = ,
(n/2)
z
where z is the incomplete Gamma function: z (α) = 0 t α−1 exp(−t)dt.
In order to work with this distribution in R, there is a list of standard implemented
functions:
which are for the pdf, the cdf, the quantile function and for generating random χ2 -
distributed samples, respectively. Same as for other distributions, if log = TRUE
in dchisq function, then log density is computed, which is useful for maximum
likelihood estimation. Similar to the functions for the t (see Sect. 4.4.2) and F (see
Sect. 4.4.3) distributions, all the functions also have the parameter ncp which is the
non-negative parameter of non-centrality, where this rv is constructed from Gaussian
rvs with non-zero expectations.
116 4 Univariate Distributions
0.15
1.0
0.8
0.10
0.6
pdf
cdf
0.4
0.05
0.2
0.00
0.0
0 10 20 30 40 50 0 10 20 30 40 50
z z
Fig. 4.2 pdf (left) and cdf (right) of χ2 distribution (degrees of freedom n = 5, n = 10, n = 15,
n = 25, respectively). BCS_ChiPdfCdf
0 2 4 6 8 10
z
Figure 4.2 illustrates the different shapes of the χ2 distribution’s cdf and pdf,
for different degrees of freedom n. In general, the χ2 pdf is bell-shaped and shifts
to the right-hand side for greater numbers of degrees of freedom, becoming more
symmetric.
There are two special cases, namely n = 1 and n = 2. In the first case, the vertical
axis is an asymptote and the distribution is not defined at 0. In the second case, the
curve steadily decreases from the value 0.5 (Fig. 4.3).
Properties of the χ2 distribution
A distinctive feature of χ2 is that it is positive, due to the fact that it represents a sum
of squared values.
The expectation, variance, skewness and excess kurtosis coefficients are
2 12
E X = n, Var X = 2n, S(X ) = 2 , K (X ) = .
n n
4.4 Distributions Related to the Normal Distribution 117
20000
20000
Frequency
Frequency
5000 10000
5000 10000
0
0
−10 0 10 20 30 100 150 200
z z
Fig. 4.4 Asymptotic normality of χ2 distribution (left panel n = 10; right panel n = 150).
BCS_ChiNormApprox
One can observe in Fig. 4.4 that the χ2 distribution (coloured in blue) approaches the
standard normal distribution for large numbers of degrees of freedom.
two sample means. It is also used to construct confidence intervals for population
means and linear regression analysis.
X
Z=√ ∼ tn−1 .
Y /n
X +μ
Z= √ ,
Y /n
{(n + 1)/2}
f (z, n) = √ (n+1)/2
.
πn (n/2) 1 + z 2 /n
Distribution function
The cdf of the t-distribution is
z
B (z; n/2; n/2)
F(z) = f (t, n)dt = ,
−∞ B (n/2; n/2)
1
where B (n/2; n/2) is the Beta function: B(x, y) = 0 t x−1 (1 − t) y−1 dt and
z
B(z; n/2; n/2) is the incomplete Beta function B(z; a, b) = 0 t a−1 (1 − t)b−1 dt.
Similar to other distributions, the R functions for t-distribution are
for computing the pdf, cdf, quantile function and generating random numbers. Same
as for other distributions, if log = TRUE in dt function, then log density is
computed, which is useful for maximum likelihood estimation. Also similar to the
functions for the χ2 and F (see Sect. 4.4.3) distributions, all the above-mentioned
functions have the non-centrality parameter ncp.
Figure 4.5 shows the standard normal distribution (black bold line) and several
different t-distributions with different degrees of freedom.
4.4 Distributions Related to the Normal Distribution 119
0.4
1.0
0.8
0.3
0.6
pdf
cdf
0.2
0.4
0.1
0.2
0.0
0.0
−4 −2 0 2 4 −10 −5 0 5 10
z z
Fig. 4.5 Density function of Student’s t-distribution and correspondent cumulative distribution
functions (n = 1, n = 2, n = 5, bold line- N(0, 1)). BCS_tPdfCdf
4.4.3 F-distribution
Definition 4.15 The rvZ has the Fisher–Snedecor (F-distribution) distribution with
n and m degrees of freedom if
χ2 (n)/n
Z= ∼ Fn,m ,
χ2 (m)/m
Distribution function
The cdf is
Fh {(n + m)/2, n/2; 1 + n/2; −nz/m}
F(z) = 2n (n−2)/2 (x/m)n/2 for z ≥ 0,
B (n/2, m/2)
for computing the pdf, cdf, quantile function and generating random numbers. Here
parameters df1 and df2 are the two degrees of freedom parameters. Same as for
other distributions, if log = TRUE in df function, then log density is computed,
which is useful for maximum likelihood estimation. Also similar to the functions for
the χ2 and t-distribution, all the above-mentioned functions have the non-centrality
parameter ncp.
Distribution parameters
The expectation and variance of the F-distribution are defined if m > 2:
m 2m 2 (n + m − 2)
EZ = , Var Z = .
m−2 n(m − 2)2 (m − 4)
Looking at Fig. 4.6, one can distinguish three characteristic shapes of the pdf
curve, depending on the parameters n and m:
• for n = 1, the curve monotonically decreases for all values of m with the vertical
axis as an asymptote;
4.4 Distributions Related to the Normal Distribution 121
1.5
1.0
0.8
1.0
0.6
pdf
cdf
0.4
0.5
0.2
0.0
0.0
0 1 2 3 4 5 0 2 4 6 8 10
z z
• for n = 2, the curve again decreases for all m, but intersects the vertical axis at the
point 1;
• for n ≥ 3, the curve has an asymmetrical bell shape for all m, gradually shifting
to the right-hand side for larger numbers of degrees of freedom.
Example 4.1 Let us assume that over the time interval [0, T ], the online service
of a food delivery company receives x orders. At some point, the managers of this
business became curious as to the probabilities of the amounts of orders over time.
In general, the number of orders can be described by a Poisson distribution, see
Definition 3.7, where λ is the expected number of occurrences during a given time
period. If during one hour the online service receives on average λ = 35 orders, then
within any given hour the probability of receiving exactly 30 orders has a probability
of p = 3530 e−35 /30! = 0.049.
However, when we need to model the distribution of time intervals between orders,
or events, the exponential distribution comes in handy.
Density function
The pdf of the exponential distribution is defined as
λe−λz , for z ≥ 0,
f (z, λ) =
0, for z < 0,
122 4 Univariate Distributions
where λ is a rate parameter, such that the time interval is 1/λ. The rate parameter
gives the expected number of events in a time interval, whereas its reciprocal gives
the expected time interval between two events. And one writes X ∼ E(λ).
Distribution function
The expression for the cdf looks relatively similar to that of the pdf:
1 − e−λx , for x ≥ 0,
F(x, λ) =
0, for x < 0.
In general, the greater the λ is, the steeper are the curves of the exponential density
and distribution functions (Fig. 4.7).
The main R functions for the exponential distribution are
for computing the pdf, cdf, quantile function and generating random numbers. Same
as for other distributions, if log = TRUE in dexp function, then log density is
computed, which is useful for maximum likelihood estimation.
Example 4.2 University beverage vending machines have a lifetime of X , which is
exponentially distributed with λ = 0.3 defective machines per year:
0.3e−0.3x , for x ≥ 0,
f (x, 0.3) =
0, for x < 0.
We would like to find the probability that this vending machine will function more
than 1.7 years.
1.0
1.0
0.8
0.8
0.6
0.6
pdf
cdf
0.4
0.4
0.2
0.2
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5
z z
Fig. 4.7 Pdf and cdf of the exponential distribution (λ = 0.3, λ = 0.5, λ = 1 and λ = 3).
BCS_ExpPdfCdf
4.5 Other Univariate Distributions 123
Thus the probability that a breakdown occurs within 1.7 years is approximately 60%.
Properties of the exponential distribution
The exponential distribution has the following expectation and variance:
The mode (see Definition 5.6) is 0 and the median (see Definition 4.9) is
log 2
ξ1/2 = .
λ
The exponential distribution has skewness and excess kurtosis coefficients indepen-
dent of λ, unlike some of the distributions we have seen so far. They are
S(X ) = 2, K (X ) = 6.
P (X ≤ t + q|X > t) = P (X ≤ q) .
The conditional probability of the next event’s occurring by time t + q given that the
last event was at time t is equal to the unconditional probability of the next event’s
occurring at time q without any previous information.
As mentioned in Sect. 4.3, the stable distributions are a family of distributions which
are closed under linear transformations.
Definition 4.16 A distribution function is said to be stable if for any two independent
rvs Z 1 and Z 2 following this distribution, and any two positive constants a and b, we
have
124 4 Univariate Distributions
a Z 1 + bZ 2 = cZ + d,
−σ S α |t|α (1 − iβsign(t) tan πα ) + iμt, α = 1,
log φ Z (t, α, β, σ S , μ) = 2
−σ S |t| (1 + iβsign(t) π2 log |t|) + iμt, α = 1,
where i 2 = −1.√Note that the σ used here is not the usual Gaussian scale σ, but the
value σ S = σ/ 2.
In R the stable distributions can be implemented by
and
pstable(z, alpha, beta, gamma, delta, pm),
which require the stabledist package. We can easily work with a stable distri-
bution of interest that depends on the parameters α, β, σ, μ and the parameter pm,
which refers to the parameterization type. The functions qstable and rstable
with the same parameters let us use quantiles and generate samples.
An interesting toolbox is implemented with the command stableSlider()
(of the fBasics package). It provides a good illustration of the pdf and cdf functions
of different stable distributions. One can change the parameters to see how the shape
of the functions reacts to the changed values, see Fig. 4.9. There exist three special
cases of stable distributions that have closed form formulas for their pdf and cdf:
4.5 Other Univariate Distributions 125
0.5
1.0
0.4
0.8
0.3
0.6
pdf
cdf
0.2
0.4
0.1
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z
α ∈ (0.6, 1, 1.5, 2), β = 0
1.0
0.30
0.8
0.20
0.6
pdf
cdf
0.4
0.10
0.2
0.00
0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z
α = 1, β ∈ (0, −0.8, 0.8)
Fig. 4.8 Stable distribution functions and their density functions given different combinations of
α and β (in all cases σ = 1 and μ = 0). BCS_StablePdfCdf
1 (z − μ)2
Normal Distribution f (z) = √ exp − ,
2πσ 2 2σ 2
Cauchy Distribution f (z) = σ/ π(z − μ)2 + πσ 2 , (4.4)
c exp − c
2(z−μ)
Lévy Distribution f (z) = .
2π (z − μ)3/2
With the help of the following short code we plot the pdf for those special cases.
These can be built using the dstable function from package stabledist with
the appropriate parameters α, β, σ and μ (Fig. 4.10).
> require(stabledist)
> z = seq(-6, 6, length = 300)
> s.norm = dstable(z, # values of the density
+ alpha = 2, # tail
+ beta = 0, # skewness
126 4 Univariate Distributions
+ gamma = 1, # scale
+ delta = 0, # location
+ pm = 1), # type of parametrization
> s.cauchy = dstable(z, # values of the density
+ alpha = 1, # tail
+ beta = 0, # skewness
+ gamma = 1, # scale
+ delta = 0, # location
+ pm = 0), # type of parametrization
> s.levy = dstable(z, # values of the density
+ alpha = 0.5, # tail
+ beta = 0.9999, # skewness
+ gamma = 1, # scale
+ delta = 0, # location
+ pm = 0), # type of parametrization
> plot(z, s.norm, # plot normal
+ col ="red", type =’l’, ylim = c(0,0.5))
> lines(z, s.cauchy, # plot Cauchy
+ col ="green")
> lines(z, s.levy, # plot Levy
+ col ="blue")
In all cases, σ = 1 and μ = 0. The cdf functions can be plotted analogously using
the procedure pstable.
4.5 Other Univariate Distributions 127
0.5
1.0
0.4
0.8
0.3
0.6
pdf
cdf
0.2
0.4
0.1
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z
Example 4.3 Consider an isotropic source emitting particles to the plane L. The
angle θ of each emitted particle is uniformly distributed. Each particle hits the plane
at some distance x from the point 0 (Fig. 4.11). By definition, the distance rv X
follows Cauchy distribution.
Density function
The pdf of the Cauchy distribution is defined as in (4.4) where μ ∈ R is a location
parameter, i.e. it defines the position of the peak of the distribution, and σ > 0 is
a scale parameter specifying one-half the width of the probability density function
at one-half its maximum height. For μ = 0 and σ = 1, the distribution is called a
standard Cauchy distribution (Fig. 4.12).
L 0 x
128 4 Univariate Distributions
0.8
0.25
0.6
pdf
cdf
0.15
0.4
0.2
0.05
−4 −2 0 2 4 −4 −2 0 2 4
z z
Fig. 4.12 Cauchy distribution functions and corresponding density functions (μ = −2, σ = 1;
μ = 0, σ = 1; μ = 2, σ = 1; μ = 0, σ = 1.5; μ = 0, σ = 2). BCS_CauchyPdfCdf
Distribution function
The Cauchy cdf is
1 z−μ 1
F(z; μ, σ) = arctan + .
π σ 2
In R, the pdf, cdf, quantile function and generating random numbers from Cauchy
distribution can be done using the commands
or by using the
This chapter presents basic statistical methods used in describing and analysing
univariate data in R. It covers the topics of descriptive and inferential statistics of
univariate data, which are mostly treated in introductory courses in Statistics.
Among other useful statistical tools, we discuss simple techniques of explorative
data analysis, such as the Bar Diagram, Bar Plot, Pie Chart, Histogram, kernel density
estimator, the ecdf, and parameters of location and dispersion. We also demonstrate
how they are easily implemented in R. Further in this chapter we discuss different
test for location, dispersion and distribution.
The function table() returns all possible observed values of the data along
with their absolute frequencies. These can be used further to compute the relative
frequencies by dividing by n.
Let us consider the dataset chickwts, a data frame with 71 observations of 2
variables, weight, a numeric variable for the weight of the chicken, and feed, a
factor for the type of feed. In order to select only the observed values of feed, one
considers the field chickwts$feed. By using table(chickwts$feed), we
get one line, stating the possible chicken feed, i.e. each possible observational value,
and the absolute frequency of each type in the line below.
> table(chickwts$feed) # absolute frequencies
14
12 0.15
10
8 0.10
n(aj )
h(aj )
6
4
2 0.05
0
0.00
soybean
sunflower
casein
horsebean
linseed
meatmeal
soybean
sunflower
casein
horsebean
linseed
meatmeal
Fig. 5.1 Bar diagram of the absolute frequencies n(a j ) (left) and bar plot of the relative frequencies
h(a j ) (right) of chickwts$feed. BCS_BarGraphs
The result of the first plot command is shown in the left panel of Fig. 5.1.
Bar plot
Unlike in the Bar Diagram, each observation is plotted using bars in the Bar Plot.
If the endpoints of the bars are connected, one obtains a frequency polygon. It is in
particular useful to illustrate the behaviour (variation) of time ordered data.
> n = length(chickwts$feed) # sample size
> barplot(table(chickwts$feed)) # absolute frequency
> barplot(table(chickwts$feed) / n) # relative frequency
The result of the second baplot command is shown in the right panel of Fig. 5.1.
Pie chart
In a Pie Chart, each observation has its own sector with an angle (or a square for
a square Pie Chart) proportional to its frequency. The angle can be obtained from
α(ai ) = h(ai ) · 360◦ . The disadvantage of this approach is that the human eye cannot
precisely distinguish differences between angles (or areas). Instead, it recognises
much better differences in lengths, which is the reason why the Bar Plot and Bar
Diagram are better tools than the Pie Chart. In Fig. 5.2, each group seems to have the
same area in the Pie Chart, though the frequencies differ slightly from each other, as
is evident from the Bar Plot in Fig. 5.1.
> pie (table(chickwts$feed))
132 5 Univariate Statistical Analysis
meatmeal
sunflower
soybean
n
a.s.
F̂(x) = n −1 I(X i ≤ x) −→ F(x). (5.1)
i=1
1.0
●
0.8
●
0.6
^
F(0.5)
F(x)
●
^
0.4
●
0.2
●
0.0
5.1.3 Histogram
where |K i | denotes the length of the class K i and h(K i ) its relative frequency, which
is calculated as the ratio of the number of observations falling into class K i to the
sample size n.
We write fˆ(x) with a hat because it is a sample-based estimator of the true density
function f (x), which describes the relative likelihood of the underlying variable to
take on any given value. fˆ(x) is a consistent estimator of f (x), since for every value
134 5 Univariate Statistical Analysis
0.5
0.3
0.4
0.3
0.2
f (x)
^
f (x)
0.2
^
0.1
0.1
0.0
0.0
48 50 52 54 48 50 52 54
Fig. 5.4 Histograms of nhtemp with the number of classes calculated using default method (left)
and by manually setting to intervals of length 0.5 (right). BCS_hist1, BCS_hist2
of x, fˆ(x) converges almost surely to f (x) when n goes to infinity (Strong Law of
Large Numbers, Serfling 1980).
Now, consider nhtemp, a sample of size n = 60 containing the mean annual
temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971.
The histograms in Fig. 5.4 are produced using the function hist(). By default,
without specifying the arguments for hist(), R produces a histogram with the
absolute frequencies of the classes on the y-axis. Thus, to obtain a histogram accord-
ing to our definition, one needs to set freq = FALSE. The number of classes s is
calculated by default using Sturges’ formula s = log2 n + 1. The brackets denote
the ceiling function used to round up to the next integer (see Sect. 1.4.1) to avoid
fractions of classes. Note that this formula performs poorly for n < 30. To spec-
ify the intervals manually, one can fill the argument breaks with a vector giving
the breakpoints between the histogram cells, or simply the desired number of cells.
In the following example, breaks = seq(47, 55, 0.5) means that the his-
togram should range from 47 to 55 with a break every 0.5 step, i.e. K 1 = [47, 47.5),
K 2 = [47.5, 48), ….
> hist(nhtemp, freq = FALSE)
> hist(nhtemp, freq = FALSE, breaks = seq(47, 55, 0.5))
Figure 5.4 displays histograms with different bin sizes. A better reflection of the
data is achieved by using more bins. But, as the number of bins increases, the his-
togram becomes less smooth. Finding the right level of smoothness is an important
task in nonparametric estimation, and more information can be found in Härdle et al.
(2004).
5.1 Descriptive Statistics 135
The histogram is a density estimator with a relatively low rate of convergence to the
true density. A simple idea to improve the rate of convergence is to use a function
that weights the observations in the vicinity of the point where we want to estimate
the density, depending on how far away each such observation is from that point.
Therefore, the estimated density is defined as
1 1 x − xi
n n
fˆh (x) = K h (x − xi ) = K ,
n i=1 nh i=1 h
where K ( x−x
h
i
) is the kernel, which is a symmetric nonnegative real valued integrable
function. Furthermore, the kernel should have the following properties:
∞
u K (u)du = 0,
−∞
∞
K (u)du = 1.
−∞
These criteria define a pdf and it is straightforward to use different density functions
as a kernel. This is the basic idea of kernel smoothing. The foundations in this
area were laid in Rosenblatt (1956) and Parzen (1962). Some examples for different
weight functions are given in Fig. 5.5 and Table 5.1.
Deriving a formal expression for the kernel density estimator is fairly intuitive.
The weights for the observations depend mainly on the distance to the estimated
point. The main idea behind the histogram to estimate the pdf is
F̂(x + h) − F̂(x − h)
fˆh (x) ≈ , (5.3)
2h
where F̂ is the ecdf. If h is small, the approximation method works well, producing
smaller bin widths and a smaller bias. Rearranging (5.3) yields
1
n
fˆh (x) = I(x + h ≥ xi > x − h).
2nh i=1
1
K (x − xi ) = I(x + h ≥ xi > x − h)w(xi ),
h
136 5 Univariate Statistical Analysis
0.8
0.8
0.4
0.4
0.0
0.0
−2 −1 0 1 2 −2 −1 0 1 2
0.8
0.4
0.4
0.0
0.0
−2 −1 0 1 2 −2 −1 0 1 2
0
1
16 (1 − u ) I(|u| ≤ 1)
15 2 2
Quartic
Gaussian √1
exp{− 21 u 2 }
2π
1
n
fˆh (x) = K (x − xi ).
h i=1
0.3
BCS_Kernel_nhTemp
0.2
Density
0.1
0.0
48 50 52 54 56
N = 60 Bandwidth = 0.3924
To find the optimal bandwidth h for a kernel estimator, a similar problem has to
be solved as for the optimal binwidth. In practice one can use Silverman’s rule of
thumb:
h ∗ = 1.06 · σ̂n − 5 .
1
It is only a rule of thumb, because this h ∗ is only the optimal bandwidth under normal-
ity. But this bandwidth will be close to the optimal bandwidth for other distributions.
The optimal bandwidth depends on the kernel and the true density, see Härdle et al.
(2004).
“Where are the data centered?” “How are the data scattered around the centre?” “Are
the data symmetric or skewed?” These questions are often raised when it comes to
a simple description of sample data. Location parameters describe the centre of a
distribution through a numerical value. They can be quantified in different ways and
visualised particularly well by boxplots.
138 5 Univariate Statistical Analysis
Arithmetic mean
The term arithmetic mean characterises the average position of the realisations on the
variable axis. It is a good location measure for data from a symmetric distribution.
Definition 5.2 The sample (arithmetic) mean for a sample of n values, x1 , x2 , ..., xn
is defined by
n
x̄ = n −1 xi . (5.4)
i=1
Applying the notions of absolute and relative frequencies, this formula can then be
rewritten as
k
k
x̄ = n −1 a j n(a j ) = a j h(a j ).
j=1 j=1
n
a.s.
n −1 X i −→ μ when n → ∞.
i=1
α-trimmed mean
The arithmetic mean is very often used as a location parameter, although it is not very
robust, since its value is sensitive to the presence of outliers. In order to eliminate the
outliers, one can trim the data by dropping a fraction α ∈ [0 , 0.5) of the smallest and
largest observations before calculating the arithmetic mean. This type of arithmetic
mean, called the α-trimmed mean, is more robust to outliers. However, there is no
unified recommendation regarding the choice of α. In order to define the trimmed
mean, we need to define order statistics first.
Definition 5.3 Let x(1) ≤ x(2) ≤ . . . ≤ x(n) be the sorted realizations of the rv X .
The term x(i) , i = 1, . . . , n is called the i th order statistic, and in particular, x(1) is
called the sample minimum and x(n) the sample maximum.
Definition 5.4 The α-trimmed mean is the arithmetic mean computed after trimming
the fraction α of the smallest and largest observations of X , given by
1
n−nα
x̄ α = x(i)
n − 2nα i=nα+1
5.1 Descriptive Statistics 139
with α ∈ [0 , 0.5). Where a is the floor function, returning the largest integer not
greater than a, see Sect. 1.4.1.
The argument trim is used in the function mean to compute the α-trimmed mean.
Quantiles
Another type of location parameter is the quantile. Quantiles are very robust, i.e. not
influenced by outliers, since they are determined by the rank of the observations and
they are estimates of the theoretical quantiles, see Definition 4.9.
Where a is the ceiling function, returning the smallest integer not less than a,
see Sect. 1.4.1. The sample quartiles are a special case of quantiles: the lower quar-
tile Q 1 = x̃0.25 , the median Q 2 = x̃0.5 = med, and upper quartile Q 3 = x̃0.75 .
These three quartile values Q 1 ≤ Q 2 ≤ Q 3 divide the sorted observations into four
segments, each of which contains roughly 25% of the observations in the sample.
To calculate the p-quantiles x̃ p of the sample nhtemp, one uses quantile().
This function allows up to 9 different methods of computing the quantile, all of them
converge asymptotically, as the sample size tends to infinity, to the true theoretical
quantiles (type = 2 is the method discussed here). Leaving the argument probs
blank, R returns by default an array containing the 0, 0.25, 0.5, 0.75 and 1 quantiles,
which are the sample minimum x (1) , the lower quartile Q 1 , the median Q 2 (or med),
the upper quartile Q 3 , and the sample maximum x(n) . The median can be also found
using median().
> quantile(nhtemp, probs = 0.2) # 20% quantile
20%
50.2
> median(nhtemp) # median
[1] 51.2
> quantile(nhtemp, probs = c(0.2, 0.5)) # 20% and 50% quantiles
20% 50%
50.2 51.2
> quantile(nhtemp) # all quartiles
0% 25% 50% 75% 100%
47.90 50.575 51.20 51.90 54.60
140 5 Univariate Statistical Analysis
Mode
The mode is the most frequently occurring observation in a data set (also called the
most fashionable observation). Together with the mean and median, one can use it as
an indicator of the skewness of the data. In general, the mode is not equal to either the
mean or the median, and the difference can be huge if the data are strongly skewed.
Definition 5.6 The mode is defined by
These nested functions are better understood from the inside out. The function
table() creates a frequency table for the observations in the dataset nhtemp,
calculating the frequency for every single value. sort() with the argument
decreasing = TRUE sorts the frequency table in decreasing order, so that the
element with the highest frequency, i.e. the mode, appears first. Its name, the unique
value for which the frequency was calculated, is then extracted by the function
names()[1], where [1] restricts the output of names() to the first position of
the vector. Lastly, as the result is a string, here ‘50.9’, it needs to be converted into a
number by as.numeric().
If it is desired to have the usual location parameters, such as the median, mean
and some quantiles at once, one can use the command summary().
> summary(nhtemp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
47.90 50.58 51.20 51.16 51.90 54.60
Definition 5.7 The total range is the difference between the sample maximum and
the sample minimum, i.e.
Since the total range depends only on two observations, it is very sensitive to outliers
and is thus a very weak dispersion parameter.
The function range() returns an array containing two values, namely the sam-
ple minimum and maximum. diff() calculates the difference between values by
subtracting each value in a vector from the subsequent value. To obtain the total
range, one simply calculates the first difference of the array given by the function
range() using diff().
> range(nhtemp) # sample min and sample max
[1] 47.9 54.6
> totalrange = diff(range(nhtemp)) # difference between max and min
> totalrange
[1] 6.7
Interquartile range
Definition 5.8 The interquartile range (IQR) of a sample is the difference between
the upper quartile x̃0.75 and the lower quartile x̃0.25 , i.e.
It is also called the midspread or middle fifty, since roughly fifty percent of the
observations are found within this range. The IQR is a robust statistic and is therefore
preferred to the total range.
Currently, there is no function in R which directly gives the IQR. To find the upper
and lower quartiles, one uses the function quantile() with probs = c(0.25,
0.75), meaning that R should return the 0.25-quantile and the 0.75-quantile. In this
example, the function diff() computes the IQR, i.e. the difference between the
lower and upper quantiles.
> LUQ = quantile(nhtemp, probs = c(0.25, 0.75)); LUQ
25% 75%
50.575 51.900
> IQR = diff(LUQ); IQR
75%
1.325
142 5 Univariate Statistical Analysis
Variance
The variance is one of the most widely used measures of dispersion. The variance is
sensitive to outliers and is only reasonable for symmetric data.
1
n
s̃ 2 = (xi − x̄)2 . (5.5)
n i=1
The unbiased variance estimator also called the empirical variance for a sample of n
values x1 , x2 , . . . , xn is the sum of the squared deviations from their mean x̄ divided
by n − 1, i.e.
1
n
n 2
s2 = s̃ = (xi − x̄)2 . (5.6)
n−1 n − 1 i=1
Definition 5.10 The sample standard deviation s̃ and the estimator for the popula-
tion standard deviation based on the unbiased variance estimator are calculated from
(5.5) and (5.6): √ √
s̃ = s̃ 2 , s = s2.
The R functions var() and sd() compute estimates for the variance and standard
deviation using the formulas for the unbiased estimators s 2 and s.
the median absolute deviation. It is robust since the median is less sensitive to outliers
and the distances are not squared, effectively reducing the weight of outliers.
Definition 5.11 The median absolute deviation (MAD) is the median of the absolute
deviations from the median:
The function mad() returns by default the MAD according to Definition 5.11.
However, if it is desired to compute the median of the absolute deviation from some
other values, one simply includes the argument center. Below is an example of
deviations both from the median and from the mean, that uses measurements of
the annual flow of the river Nile at Ashwan between 1871 and 1970 (discharge in
108 m3 ).
1 1
n n
d1 = |xi − x̃0.5 | or d2 = |xi − x̄|.
n i=1 n i=1
The sample estimate of the skewness S(X ), see Definition 4.5, is given through
n
i=1 (x i − x̄)3
1
Ŝ = n
n ,
{ n−1
1
i=1 (x i − x̄)2 }3/2
5.1.8 Box-Plot
The box-plot (or box-whisker plot) is a diagram which describes the distribution of a
given data set. It summarises the location and dispersion measures discussed previ-
ously. The box-plot gives a quick glimpse of the observations’ range and empirical
distribution.
This box-plot of the dataset Nile visualizes the skewness of the data very well,
see Fig. 5.7. Since the median, shown by the middle line, is not in the centre of the
box, the data are not symmetric and the results for calculating the median absolute
deviation from the median or from the mean differ, as we have just shown in the code
above.
Let us now analyse the dataset nhtemp using the command boxplot(). The
output is given in Fig. 5.8.
> boxplot(nhtemp)
median: med x
800
lower fence
400
5.1 Descriptive Statistics 145
55
nhtemp. BCS_Boxplot
54
highest value within
53
upper fence
52
upper quartile: x0.75
median: med x
51
Approximately fifty percent of the observations are contained in the box. The
upper edge is the 0.75-quantile and the lower edge is the 0.25-quantile. The distance
between these two edges is the interquartile range (IQR). The median is indicated
by the horizontal line between the two edges. If its distance from the upper edge is
not equal to its distance from the lower edge, then the data are skewed.
The vertical lines extending outside the box are called whiskers. In the absence
of outliers, the ends of the whiskers indicate the sample maximum and minimum.
Otherwise, the ends of the whiskers lie at the highest value that is still within the
upper fence (x̃0.75 +1.5· I Q R) and the lowest value that is still within the lower fence
(x̃0.25 − 1.5 · I Q R). The factor 1.5, used by default, can be modified by setting the
argument range appropriately. The (suspected) outliers are denoted by the points
outside the two fences. For R not to draw the outliers, we set the argument outline
= FALSE.
Another way of producing a box-plot is using the package lattice (more details
in Chap. 10). The function used here is bwplot(). Consider again the dataset
nhtemp. Since nhtemp is a time series object, it is converted to a vector using
as.vector() in order for bwplot to work for the data nhtemp.
> require(lattice)
> bwplot(as.vector(nhtemp))
146 5 Univariate Statistical Analysis
When estimating a population parameter θ (e.g. the population mean μ or the variance
σ 2 ), it is important to have some clue about the precision of the estimation. The
precision in this context is the probability that the estimate θ̂ is wrong by less than a
given amount. It can be calculated using a random sample of size n drawn from the
population. For most cases, like θ = μ, the sample size n must be large enough so
that θ̂ can be assumed to be normally distributed (Central Limit Theorem 6.5).
The standard error of the sample mean measures the accuracy of the estimation
of the mean, and the confidence interval quantifies how close the sample mean is
expected to be to the population mean. Furthermore, it is naturally desirable to have a
confidence interval as short as possible, something which is induced by large samples.
Definition 5.12 The confidence interval (CI) for the parameter θ of a continuous rv
is a range of feasible values for an unknown θ together with a confidence coefficient
(1 − α) conveying one’s confidence that the interval actually covers the true θ.
Formally, it is written as
P(θ ∈ CI) = 1 − α.
P(−z 1− α2 ≤ Z ≤ z 1− α2 ) = 1 − α.
√ X̄ −μ
With Z = n· σ
, this implies
σ σ
P X̄ − z 1− α2 · √ ≤ μ ≤ X̄ + z 1− α2 · √ = 1 − α.
n n
Thus, for a fixed α ∈ [0, 1], the confidence coefficient is (1−α) and the corresponding
100 · (1 − α)%-confidence interval for the population mean μ, assuming that the
population is normally distributed and σ is known (Fig. 5.9), is given by
5.2 Confidence Intervals and Hypothesis Testing 147
α α
2 2
− z1−α2 = zα2 z1−α2
σ σ
x̄ − z 1− α2 · √ ; x̄ + z 1− α2 · √ .
n n
Sometimes only an upper limit or a lower limit for μ is desired, but not both. These
are called one-sided (one-tailed) confidence limits. For example, a toy is considered
to be harmful to children if it contains an amount of mercury that exceeds a certain
value. A European buyer wants a guarantee from a European company that their
products comply with European safety laws. The transaction may then take place
if the 99%-confidence upper limit does not exceed the desired maximum. In the
same contract, one does not want too many failures in the shipped good, e.g. the
95%-confidence lower limit should not exceed the desired minimum (Fig. 5.10).
√
Taking Z = n X̄ σ−μ ∼ N(0, 1), it follows that
σ
P μ ≥ X̄ − z 1−α · √ = α.
n
Thus, the 100 · α%-confidence lower limit is x̄ − z 1−α √σn and the upper limit is given
by x̄ + z 1−α √σn .
148 5 Univariate Statistical Analysis
α α
zα z1−α
Fig. 5.10 Two types of one-sided confidence intervals: lower limit (left) and upper limit (right)
confidence interval. BCS_Conf1Sidedleft, BCS_Conf1sidedright
P(−t1− α2 ,ν ≤ V ≤ t1− α2 ,ν ) = 1 − α.
Now, assuming that the population is normally distributed, and the sample size is n,
√ X̄ −μ √ 1 n
i=1 X i −μ
it follows that V = n S = n √ 1 n n
, thus
n−1 { i=1 (X i − X̄ ) }
2
S S
P X̄ − t1− α2 ,n−1 · √ ≤ μ ≤ X̄ + t1− α2 ,n−1 · √ = 1 − α.
n n
Definition 5.13 The 100 · (1 − α)%-confidence interval for the population mean μ
when the population is normally distributed and σ is unknown is defined by
s s
x̄ − t1− α2 ,n−1 · √ ; x̄ + t1− 2 ,n−1 · √ .
α
n n
When calculating the confidence interval for σ unknown, the R code from above
changes only a little. To find t1−α/2,n−1 , we use the function qt(), see again Sect. 4.4.
> smean = mean(nhtemp)
> sigma2 = sd(nhtemp) # estimate sd from data
> n = length(nhtemp) # sample size
> alpha = 0.1 # confidence level
> z2 = qt(1 - alpha / 2, n - 1) # the t-distr. quantiles
> CI2 = c(smean - z2 * sigma2 / sqrt(n), # confidence interval
+ smean + z2 * sigma2 / sqrt(n))
> CI2
[1] 50.88696 51.43304
Note that this confidence interval is slightly larger than the one calculated for σ
known. In other words, having to estimate the variance introduces more uncertainty
into our estimation.
5.2 Confidence Intervals and Hypothesis Testing 149
H0 : θ = θ0 vs H1 : θ = θ0 .
It is important to note that the hypotheses are mutually exclusive. The test above is
called a two-sided or two-tailed test, since the alternative hypothesis H1 does not
make any reference to the sign of the difference θ − θ0 . Therefore, the interest here
lies only in the absolute values of θ − θ0 .
However, sometimes the investigator notes only deviations from the null hypoth-
esis H0 in one direction and ignores deviations in other directions. The investigators
could, for example, be certain that if θ is not less than or equal to θ0 , then θ must be
greater than θ0 or vice versa. Formally:
H0 : θ ≤ θ0 vs H1 : θ > θ0 ,
or
H0 : θ ≥ θ0 vs H1 : θ < θ0 .
Each time when conducting a statistical test, one faces two types of risk:
Type I error is the error of rejecting a null hypothesis when it is in fact true.
Type II error is the error of failing to reject a null hypothesis when it is actually not
true.
These two types of risks are treated in different ways: it is always desired to have
the probability of type I error, denoted by α, be as small as possible. On the other
hand, since it is ideal that a test of significance rejects a null hypothesis when it is
false, it is desired to have the probability of type II error, denoted by β, as small as
possible too.
150 5 Univariate Statistical Analysis
However the null hypothesis H0 can not be immediately rejected. Deviations may
occur even if H0 is true, for example through an unfavorable sampling. Only when
it exceeds a certain critical value is the deviation said to be statistically significant,
or simply significant, and therefore one rejects the null hypothesis H0 .
The question is now: how to decide whether or not the deviation is significant? We
fail to reject the null hypothesis H0 as long as the estimated confidence interval for μ
of the same sample contains the hypothetical value μ0 (this constitutes a connection
between hypothesis testing and confidence intervals).
The Critical Region The critical region (or the region of rejection) is the set of values
of x̄ that cause the rejection of the null hypothesis H0 . It can be determined with the
distribution of the sample mean x̄. The probability of a type I error (also called the
significance level of a test), i.e. the probability of rejecting the null hypothesis H0
although it is true, should be at most α:
The construction of the critical region depends on the type of test one conducts
(two-sided or one-sided).
Two-Sided Tests Recall the hypotheses
H0 : μ = μ0 vs H1 : μ = μ0 .
In a test for the mean, a natural criterion for judging whether the observations favour
H0 or H1 is the size of the deviation of the sample mean x̄ from the hypothetical value
μ0 , i.e. (x̄ − μ0 ). Under the null hypothesis H0 , ( X̄ − μ0 ) ∼ N(0 , σ 2 /n). Under the
alternative hypothesis H1 , ( X̄ − μ0 ) ∼ N(μ1 − μ0 , σ 2 /n), where μ1 − μ0 = 0.
Large values of the test criterion (x̄ − μ0 ) can cause the rejection of H0 in favour
of H1 . However there is no exact answer how large these values should be. The larger
is the value of (x̄ − μ0 ) required to reject H0 , the smaller is the probability of a type I
error α, but the higher is the probability of a type II error β. Larger samples minimise
both errors, but may be difficult to obtain and therefore inefficient. In practice, the
critical value is determined so that the probability of type I errors α is 0.05. This
is called a test at the 5% level. Sometimes a level of 1% is chosen when incorrect
rejection of the null hypothesis H0 is considered as a serious mistake.
5.2 Confidence Intervals and Hypothesis Testing 151
√ X̄ − μ0
Z= n ∼ N(0, 1).
σ
Recall that in a two-sided test, only absolute values of |Z | are relevant, since there
is no reference to the sign of (x̄ − μ0 ) in the alternative hypothesis H1 .
2. Reject the null hypothesis H0 : μ = μ0 if z as a realization of Z fulfills
that is to say |z| > z 1− α2 . The value z 1− α2 is determined in such a way that
α
(z 1− α2 ) = 1 − ,
2
where is the cdf of N(0, 1). It is important to note that the critical region of a
two-sided test is symmetric, with a probability of α2 on each side. Thus one rejects
the null hypothesis H0 : μ = μ0 if
σ σ
x̄ ∈ critical region ≡ −∞ , μ0 − z 1− α2 · √ ∪ μ0 + z 1− α2 · √ , +∞ .
n n
It is easy to see the connection between the two-sided hypothesis testing μ = μ0 and
the 100 · (1 − α)%-confidence interval for μ. According to the rule of hypothesis
rejection, H0 : μ = μ0 fails to be rejected by a two-sided test when
σ σ
x̄ ∈ μ0 − z 1− α2 · √ , μ0 + z 1− α2 · √ .
n n
σ σ
μ0 ∈ x̄ − z 1− α2 · √ , x̄ + z 1− α2 · √ ,
n n
One-Sided Tests Unlike in a two-sided test, the critical region of a one-sided test is
not symmetric. Thus, the hypotheses do not concern a single discrete value, but are
expressed as
H0 : μ ≤ μ0 vs H1 : μ > μ0 .
It is often needed when a new treatment (e.g. scholarships for disadvantaged students)
is of no interest unless it is superior to the standard treatment (no scholarships). Thus,
the null hypothesis can be expressed as H0 : the average grade of disadvantaged
students does not increase after they receive scholarships. The rule is to reject H0
when
√ x̄ − μ0
z= n > z 1−α .
σ
The critical region is the interval (z 1−α , +∞). The area under the bell curve ϕ(·)
within this interval is α, too.
Inversely, when the investigator is interested to know whether or not the population
mean is smaller than a certain value μ0 , the hypotheses are
H0 : μ ≥ μ0 vs H1 : μ < μ0 .
The next testing step is to convert the test criterion into under H0 a standard normal
variable.
5.2 Confidence Intervals and Hypothesis Testing 153
Depending on the type of test, one uses different decision rules for rejecting or
not rejecting the null hypothesis H0 :
1. for the two-sided test, one rejects H0 at α = 0.05 since |z| > z 0.975 = 1.96,
2. for the left-sided test, one rejects H0 at α = 0.05 since z < z 0.05 = −1.64,
3. for the right-sided test, one cannot reject H0 at α = 0.05 since z ≯ z 0.95 = 1.64.
√ x̄ − μ
t= n ∈ critical region ≡ (−∞ , −tn−1 , 1− α2 ) ∪ (tn−1 , 1− α2 , +∞),
s
that is to say t > tn−1, 1− α2 . The value tn−1, 1− α2 is determined so that
α
P(T ≤ tn−1 , 1− α2 ) = 1 − , where T ∼ tn−1 .
2
One-Sided Tests The rules for hypothesis testing in this case are analoguous to those
used when σ is known.
H0 is rejected if
154 5 Univariate Statistical Analysis
√ x̄ − μ0
t= n > (<) tn−1 , 1−α .
s
Hypothesis Testing Using p-Values The critical regions depend on the distribution
of the test statistics and on the probability of a type I error α. This makes manual
testing inconvenient, since for every new value of α the critical region has to be
recomputed. To overcome this problem most of the tests in R can be performed using
the concept of the p-value.
Definition 5.14 The p-value is the probability of obtaining a test criterion at least
as large as the observed one, assuming that the null hypothesis is true.
For continuous distributions of the test criterion, the p-value can be determined as
that value of α∗ such that the test criterion coincides with the next boundary of the
critical region. For the one-sided test for the mean with H0 : μ ≤ μ0 , this implies
√ x̄ − μ0
n = z 1−α∗ .
σ
Solving for α∗ leads to
√ x̄ − μ0
p-value = 1 − n .
σ
√ x̄ − μ0
p-value = n .
σ
In the case of the two-sided test, it can be shown using a similar logic that
√ x̄ − μ0
p-value = 2 − 2 n .
σ
In the case of unknown σ, the cdf is replaced with the cdf of the t-distribution with
n-1 degrees of freedom. For more complicated tests and distributions, the p-value
should be determined individually.
The decision regarding the rejection of the null hypothesis is made using a simple
scheme:
• the null hypothesis is rejected if the p-value is smaller than the prespecified sig-
nificance level α;
• the null hypothesis is not rejected if the p-value is equal to or larger than α.
This decision rule is independent of the type of the test and the distribution of the
test criterion, allowing for quick testing with different levels of α.
Using t.test() Hypothesis testing of the mean with unknown variance involving
Student’s t test can be used in R through the function t.test(). For the next
5.2 Confidence Intervals and Hypothesis Testing 155
H0 : μ = 50 vs H1 : μ = 50.
Under the assumption that the standard deviation σ is unknown, one uses t.test().
> t.test(x = nhtemp,
+ alternative = "two.sided", # two-sided test
+ mu = 50, # for mu = 50
+ conf.level = 0.95) # at level 0.95
data: nhtemp
t = 7.0996, df = 59, p-value = 1.835e-09
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
50.83306 51.48694
sample estimates:
mean of x
51.16
As can be seen from the listing above, beside the test statistics, the function
t.test() returns the confidence intervals and the sample estimates. The hypothesis
testing above leads to the rejection of the null hypothesis H0 : μ = 10 at 95%-
confidence level. Obviously in this two-sided test, the hypothetical value μ0 = 10
lies outside the 95%-confidence interval. The absolute value of Student’s t statistic
is greater than the critical value t59, 1− 0.05 , since the p-value is much smaller than
2
α = 5%.
To conduct a one-sided test, the argument alternative must be changed into
less or greater.
> t.test(x = nhtemp,
+ alternative = "less", # one-sided test
+ mu = 50, # for mu < 50
+ conf.level = 0.95) # at level 0.95
data: nhtemp
t = 7.0996, df = 59, p-value = 1
alternative hypothesis: true mean is less than 50
95 percent confidence interval:
-Inf 51.43304
sample estimates:
mean of x
51.16
156 5 Univariate Statistical Analysis
Thus the hypothesis testing with a hypothetical value μ0 = 50 leads to the rejection
of H0 : μ > μ0 at the 95%-confidence level.
Testing σ 2 of a normal population
It is also interesting to see whether the population variance has a certain value of σ02 .
The question is then: how to construct confidence intervals for σ 2 from the estimator
s 2 ? How to test hypotheses about the value of σ 2 ? With a little modification of s 2 ,
one can answer these questions by looking at a rv that follows a χ2ν distribution.
If X 1 , ..., X n are i.i.d. random normal variables, then using Definition 4.13 of the
χ2 distribution we obtain
1
n
(n − 1)S 2
∼ χ2n−1 , with S 2 = (X i − X̄ )2 .
σ 2 n − 1 i=1
Confidence Intervals for σ 2 Let Y ∼ χ2ν . Now, choose χ2ν,1−α/2 such that
v·S 2
Since Y = σ2
, it is easy to show that
vS 2
P χ2ν, α < < χ2ν,1− α = 1 − α,
2 σ2 2
which is equivalent to
vS 2 vS 2
P < σ2 < 2 = 1 − α.
χν,1− α
2
χν, α
2 2
This is the general formula for a two-sided 100 · (1 − α)%-confidence limit for σ 2 .
The number of degrees of freedom is ν = n − 1 if s 2 is computed from a sample of
size n.
Testing for σ 2 This situation occurs, for example, when a theoretical value of σ 2 is
to be tested or when the sample data are being compared to a population whose σ 2
is known. If the hypotheses are
then reject H0 if
5.2 Confidence Intervals and Hypothesis Testing 157
n
vs 2 (xi x̄)2
y= 2 = i=1
> χ2ν , α < χ2ν , 1−α .
σ0 σ02
The rejection rule of the null hypothesis H0 using this proxy variable is analogous
to the test of a mean when σ is known.
Test for equal means μ1 = μ2 of two independent samples
In comparative studies, one is interested in the differences between effects rather than
the effects themselves. For instance, it is not the absolute level of sugar concentration
in blood reported for two types of diabetes medication that is of interest, but rather
the difference between the levels of sugar concentration. One of many aspects of
comparative studies is comparing the means of two different populations.
Consider two samples {xi,1 }i∈{1,...,n 1 } and {x j,2 } j∈{1,...,n2 } , independently drawn
from N(μ1 , σ12 ) and N(μ2 , σ22 ) respectively. The two-sample test for the mean is as
follows:
H0 : μ1 − μ2 = δ0 vs H1 : μ1 − μ2 = δ0 .
Definition 5.15 Under the assumption of independent rvs, the variance of the dif-
ference between the sample means is defined as
σ12 σ2
σx̄21 −x̄2 = + 2.
n1 n2
Furthermore, this variance can be estimated and used to construct the test statistics
later on. The estimation of σx̄21 −x̄2 depends on the assumptions about σ1 and σ2 .
1. When both populations have the same variance σ 2 = σ12 = σ22 , then σ 2 is esti-
mated by the unbiased pooled estimator of the x̄1 − x̄2 variance spooled
2
:
158 5 Univariate Statistical Analysis
Thus, the sample estimate sx̄21 −x̄2 of the population variance σx̄21 −x̄2 is
1 1
sx̄21 −x̄2 = spooled
2
+ .
n1 n2
2. When the populations have different variances σ12 = σ22 , then σx̄21 −x̄2 is estimated
by the following unbiased estimator:
s12 s2
sx̄21 −x̄2 = + 2. (5.8)
n1 n2
Whether the first case applies can be investigated by the function var.test(),
which uses the F-distribution introduced in Sect. 4.4.3. Consider in this example
sleep, a data frame with 20 observations on 2 variables: the amount of extra sleep
after taking a drug (extra) and the control group (group).
> # test for equal variances
> var.test(sleep$extra, sleep$group,
+ ratio = 1, # hypothesized ratio of variances
+ alternative = "two.sided", # two-sided test
+ conf.level = 0.95) # at level 0.95
The null hypothesis of equal variances of the groups is rejected. This result will
be useful when testing for equal means.
Testing the Hypothesis μ1 = μ2 when σ 2 = σ12 = σ22
Under this assumption, use (5.7) for the estimator sx̄1 −x̄2 as follows:
1 1
sx̄1 −x̄2 = spooled
2
+ .
n1 n2
H0 : μ1 = μ2 vs H1 : μ1 = μ2 ,
5.2 Confidence Intervals and Hypothesis Testing 159
is
x̄1 − x̄2
|t| = > tn +n −2 , 1− α .
sx̄1 −x̄2 1 2 2
then reject H0 if
x̄1 − x̄2
t= > tn 1 +n 2 −2 , 1−α < −tn 1 +n 2 −2 , 1−α .
sx̄1 −x̄2
H0 : μ1 = μ2 vs H1 : μ1 = μ2 ,
is |t| = sx̄x̄1 −−x̄x̄ 2 > tv , 1− α2 .
1 2
If the hypotheses are
then reject H0 if
x̄1 − x̄2
t= > tν , 1−α < −tν , 1−α .
sx̄1 −x̄2
Using oneway.test() Testing for equal means is done using the function
oneway.test. The assumption about the variances can be specified in the argu-
ment var.equal. Consider again the dataframe sleep. Suppose we want to test
whether the mean of the hours of sleep in the first group is equal to that of the second
group (H0 : μ1 = μ2 ).
160 5 Univariate Statistical Analysis
The test in R relies on the squared test criterion T 2 , which follows an F-distribution
with 1 and ν degrees of freedom. Using the p-value approach, one cannot reject the
null hypothesis H0 of equality of means of both groups at the 5%-level, since the
p-value > 0.05. This applies in both cases, whether the variances are equal or not.
However at the 10%-level, one rejects H0 since the p-value < 0.1.
H0 : F = G vs. H1 : F = G.
This test is commonly used to compare the ecdf to the assumed parametric one. The
following example shows whether the standardised log-returns r̃t,D AX = rt,D sAXr −r̄ D AX
D AX
of the DAX index follow a t-distribution with k degrees of freedom, where rt,D AX =
log PPt+1,D
t,D AX
AX
are log-returns, and Pt,D AX prices at time t. Standardised log-returns have
a zero mean and a unit standard error.
H0 : r̃ D AX ∼ tk vs. H1 : r̃ D AX tk .
The number of degrees of freedom k are found via maximum likelihood estimation,
(see Sect. 6.3.4 for estimation of copulae).
The test statistic follows the Kolmogorov distribution, which was originally tabu-
lated in Smirnov (1939). This distribution is independent of the assumed continuous
univariate distribution under the null hypothesis.
> require(stats)
> dax = EuStockMarkets[, 1] # DAX index
> r.dax = diff(log(dax)) # log-returns
> r.dax_st = scale(r.dax) # standardisation
> l = function(k, x){ # log-likelihood
+ -sum(dt(x, df = k, log = TRUE))
162 5 Univariate Statistical Analysis
+ }
> k_ML = optimize(f = l, # optimize l
+ interval = c(0, 30), # range of k
+ x = r.dax_st)$minimum # retrieve optimal k
> k_ML
[1] 11.56088
> ks.test(x = r.dax_st, # test for t-dist.
+ y = "pt", # t distribution function
+ df = k_ML) # estimated df
data: r.dax_st
D = 0.063173, p-value = 7.194e-07
alternative hypothesis: two-sided
H0 can be rejected for any significance level larger than the p-value. To test against
other distributions, the parameter y should be set equal to the corresponding string
variable of the cdf, like pnorm, pgamma, pcauchy etc.
To test whether the DAX and the FTSE log-returns follow the same distribution,
one runs the following code in R.
> r.dax = diff(log(EuStockMarkets[, 1]))
> ftse = EuStockMarkets[, 4] # FTSE index
> r.ftse = diff(log(ftse)) # log-returns
> r.ftse_st = scale(r.ftse) # standardisation
> ks.test(r.dax, r.ftse) # test with raw
# log-returns
Two-sample Kolmogorov--Smirnov test
H0 can be rejected for non-standardised log-returns, indicating that the DAX and the
FTSE log-returns do not follow the same distribution. After standardisation of the
log-returns, one can not reject H0 . Figure 5.11 illustrates these test results. The non-
standardised log-returns have different means and standard deviations. Therefore
the rejection of H0 in the first test is due to different first and second moments. This
example shows that the scaling of the variables can influence the test results.
The Kolmogorov-Smirnov test belongs to the group of exact tests, which are more
reliable in smaller samples than asymptotic tests.
5.3 Goodness-of-Fit Tests 163
1.0
1.0
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
Fig. 5.11 Empirical cumulative distribution functions for DAX log-returns and FTSE log-returns.
BCS_EdfsDAXFTSE
F(x) is replaced by the ecdf of the sample and G(x) is the cdf for the distribution
of X under H0 . The Anderson–Darling test uses different weights than does the
Cramér–von Mises test.
It is necessary to use ordered statistics to conduct both tests. Let {x(1) , . . . , x(n) } be
ordered random realizations of the rv X with EX = μ and VarX = σ 2 . Furthermore,
x −μ
one has to standardise these ordered realizations z (i) = (i)σ . In the following it is
assumed that μ = x̄ and σ = s.
For the Cramér–von Mises test, w(x) = 1 and the test statistic is
1 2i − 1
n
CM = + − G(z (i) ) ,
12n i=1
2n
where n denotes the sample size. The distribution of the test statistic under H0 can be
computed in R with pCvM and qCvM according to Csorgo and Farraway (1996). In the
following code, the Cramér–von Mises is used to test whether the standardised DAX
log returns follow the t-distribution with the same number of degrees of freedom as
found in previous subsection. In R the ordering of the realizations is automatically
done, but it is necessary to standardise them.
164 5 Univariate Statistical Analysis
> require(goftest)
> r.dax_st = scale(diff(log(EuStockMarkets[, 1])))
> cvm.test(r.dax_st, # Cramer von Mises test
+ null = "pt", # to test for t distr
+ df = 11.56088) # degrees of freedom
data: r.dax_st
omega2 = 3.0274, p-value = 6.457e-08
Again, the null hypothesis that r̃ D AX comes from a rv that follows a t11.56088 -
distribution can be rejected.
The Anderson–Darling test sets w(x) = [G(x){1 − G(x)}]−1 , which leads to the
test statistic
1
n
A2 = −n − (2i − 1) log{G(z (i) )} + log{1 − G(z (n+1−i) )} .
n i=1
The distribution of the Anderson–Darling test statistic can be obtained by pAD and
qAD. These functions are based on Marsaglia and Marsaglia (2004).
The following code uses the Anderson–Darling test to test whether the standard-
ised DAX index log-returns follow a t11.56088 -distribution.
> require(goftest)
> r.dax_st = scale(diff(log(EuStockMarkets[, 1])))
> ad.test(r.dax_st, null = "pt", df = 11.56088)
# Anderson-Darling test
Anderson-Darling test of goodness-of-fit
Null hypothesis: Student’s t distribution
with parameter df = 11.56088
data: r.dax_st
An = 17.715, p-value = 3.228e-07
The null hypothesis can be rejected for a significance level close to zero. Therefore the
standardised log-returns of the DAX-index do not follow the t11.56088 -distribution. All
three tests reject the null hypothesis of t-distributed log-returns of the DAX-index.
Shapiro–Wilk test
The Shapiro–Wilk test was developed in Shapiro and Wilk (1965). The specific form
of H0 leads to desirable efficiency properties. Especially in small samples, the power
of this test is superior to other nonparametric tests, see Razali and Wah (2011).
The rv X is tested as to whether it follows a normal distribution.
H0 : X ∼ N vs. H1 : X N.
where s̃ is the sample standard deviation and σ is the theoretically expected standard
deviation, which is calculated from the ordered statistics x(i) as follows:
τ V −1
c= √ ,
τ V −1 V −1 τ
n
σ= ci x(i) , (5.10)
i=1
Under H0 , the theoretical values τ(i) for x(i) depend only on the sample size and the
position i. Under H0 , the theoretical expected variance σ 2 should be close to the
sample variance σ̂ 2 . The test statistic W is bounded by
nc12
≤ W ≤ 1.
n−1
If the test statistic W is close to one, H0 can not be rejected. In this case the sample
can be regarded as a realization of a normal rv. For low values of W it is likely that
the null hypothesis is wrong and can be rejected. The distribution of the test statistic
is tabulated in Shapiro and Wilk (1965).
The next example tests whether or not the DAX log-returns follow a normal
distribution. The function shapiro.test is implemented in R as follows.
166 5 Univariate Statistical Analysis
data: r.dax
W = 0.9538, p-value < 2.2e-16
The null hypothesis that r.dax follows a normal distribution can clearly be rejected.
> random = rnorm(1000) + 5 # for this sample H0 is not rejected
> shapiro.test(random)
data: random
W = 0.9987, p-value = 0.7089
Jarque–Bera test
An alternative test for normality is the Jarque–Bera test. The hypotheses are the same
as for the Shapiro–Wilk test. The Jarque–Bera test considers the third and fourth
moments of the distribution. Here Ŝ are the sample skewness and K̂ the sample
kurtosis, see Sect. 5.1.7. Then
2
n K̂
JB = Ŝ 2 + .
6 4
This test uses the results of Chap. 4 for the moments of the normal distribution for
which skewness and excess kurtosis should be both zero. Two parameters are esti-
mated to compute the test statistic, therefore the statistic follows the χ22 distribution.
There is an implementation for the Jarque–Bera test in R, which requires the
package tseries.
> require(tseries)
> r.dax = diff(log(EuStockMarkets[, 1]))
> jarque.bera.test(r.dax) # by default H0: X ~ N(mu, sigma^2)
Jarque-Bera Test
data: r.dax
X-squared = 3149.641, df = 2, p-value < 2.2e-16
The p-values provided by R for daily DAX log-returns are identically small for
the two tests, but this is not true for the truly normal sample. In general, inference
with both tests might lead to different conclusions.
Most of these tests are also provided by the package fBasics and func-
tions ksnormTest, shapiroTest, jarqueberaTest, jbTest, adTest
and cvmTest.
5.3 Goodness-of-Fit Tests 167
The Kolmogorov–Smirnov test assumes an interval or ratio scale for the variable of
interest. Wilcoxon (1945) developed two tests that also work for ordinal data: the
Wilcoxon signed rank and rank sum tests. The latter is also known as the Mann–
Whitney U test.
The Wilcoxon signed rank test is an asymptotic test for the median x̃0.5 of the
sample {x1 , . . . , xn }, see Sect. 5.1.5.
where a is an assumed value. For two samples {x1,1 , . . . , x1,n 1 } and {x2,1 , . . . , x2,n 2 }
with sample sizes n 1 and n 2 , the hypotheses are
The algorithm of the Wilcoxon signed rank test for two samples can be written as
follows:
1. Randomly draw n s = min(n 1 , n 2 ) observations from the larger sample;
2. Calculate si = sign(x1,i − x2,i ) and di = |x1,i − x2,i | for the paired samples;
3. Compute the ranks Ri of di ascending
n s from 1;
4. Then the test statistic is W = | i=1 si Ri |.
The test statistic has the asymptotic distribution
L
W −−−→ N(0.5, σW 2
), (5.11)
n s →∞
with σW = n s (n s +1)(2n
6
s +1)
.
Thus the Wilcoxon signed rank test checks whether two samples come from the same
population, in which case the mean of the weighted sign() operator is 0.5, just as for
a fair coin toss. If the statistic is close to 0.5, positive and negative differences are
equally likely. The second set can also be a vector 1n a if one wants to test against a
specific constant a.
Note the test statistic follows the normal distribution only asymptotically. How-
ever, ranks are ordinal and not metric, therefore the assumption of a normal distrib-
ution is not appropriate in finite and small samples. It is necessary to correct the test
statistic for continuity, which is done by default in R.
Consider as an example the popularity of American presidents in the past. For
this we use the dataset presidents and denote the sample by {x1 , . . . , xn }, which
contains quarterly approval ratings in percentages for the President of the United
States from the first quarter of 1945 to the last quarter of 1974. To verify that the
median ranking is at least 50, the following hypotheses are tested:
168 5 Univariate Statistical Analysis
The p-value turns out to be close to zero and we can reject the hypothesis that the
true value of the approval ratings is at most 50.
Unlike the signed rank test, the Mann–Whitney U test can also be used for non-
paired data. Let F1 and F2 denote the distributions of two variables
The core idea of the test is that half the maximal possible sum of ranks is deducted
from the actual sum of ranks. If both samples are from the same distribution, this
statistic should be close to n 12n 2 .
Now one may ask the question whether President Nixon’s popularity was signif-
icantly lower than that of his predecessors. The dataset is split into two parts: one
containing the realizations from the previous presidents and another set for President
Nixon:
5.3 Goodness-of-Fit Tests 169
The hypothesis that the medians are equal can clearly be rejected, because the
obtained p-value is smaller than 5%. The difference between the sample medians is
too big for the distributions to be considered equal. Nixon, with a median approval
rating of 49%, was significantly less popular than other presidents, with 61%.
The tests discussed above considered two samples. They can not be used to check for
the equality of more than two samples. Consider a test that rejects the null hypothesis
of pairwise equal distributions for three variables X , Y and Z if at least for one pair
a two-sample test rejects the equality of distributions at the significance level α.
It would be wrong to assume that the joint significance of this test is α = 0.05
because the probability for the test to favour equality between X and Y , Y and Z
and X and Z is in fact the probability of not rejecting three times, which is equal to
1 − (1 − 0.05)3 = 1 − 0.86 = 0.14 which means an α of 0.14 and not 0.05.
Kruskal and Wallis (1952) developed an extension of the Mann–Whitney U test
that solves this problem. The null hypothesis is rejected if at least one sample dis-
tribution has a different mean than the other distributions. Let l = {1, . . . , m} be an
index of the considered samples and a be a constant,
m l
where n = l=1 nl and R̄l = nj=1 R jl /nl denotes the average of all ranks allocated
l m
within sample l. R̄ = nj=1 l=1 jl /n is the overall average of all ranks in all
R
samples and R jl is the rank for the pooled sample of observation j in sample l.
The test statistic follows the χ2k distribution, where k = m − 1. In the following
example we compare again the popularity of the ruling president for every single
decade from the first quarter of 1945 to the last quarter of 1974. We define a variable
170 5 Univariate Statistical Analysis
of starting points for each group. A Kruskal–Wallis test for the null hypothesis that
the popularity did not change significantly is executed by
> decades = c(rep(1, length.out = 20), # group indicator for decades
+ rep(2:3, each = 40),
+ rep(4, length.out = 20))
> kruskal.test(presidents, decades) # Kruskal-Wallis test
Over the decades, the popularity of the presidents varies significantly. This means
that the null hypothesis of equal locations of the distributions R̄ = R̄l , for all l can
be rejected at a significance level close to zero.
Chapter 6
Multivariate Distributions
The preceding chapters discussed the behaviour of a single rv. This chapter introduces
the basic tools of statistics and probability theory for multivariate analysis, where
the relations between d rvs are considered. At first we present the basic tools of
probability theory used to describe a multivariate rv, including the marginal and
conditional distributions and the concept of independence.
The normal distribution plays a central role in statistics because it can be viewed
as an approximation and limit of many other distributions. The basic justification for
this relies on the central limit theorem. This is done in the framework of sampling
theory, together with the main properties of the multinormal distribution.
However, a multinormal approximation can be misleading for data which is not
symmetric or has heavy tails. The need for a more flexible dependence structure and
arbitrary marginal distributions has led to the wide use of copulae for modelling and
estimating multivariate distributions.
∂d ∂d
F(x) = F(x),
∂x1 . . . ∂xd ∂xi1 . . . ∂xid
If we partition (X 1 , . . . , X d ) as X k∗ = (X i1 , . . . , X ik ) ∈ Rk and X −k
∗
=
(X ik+1 , . . . , X id ) ∈ R , then the function defined by
d−k
is called the k-dimensional marginal cdf and is equal to F evaluated at (xi1 , . . . , xik )
∗
and x−k set to infinity. For continuous variables, the marginal pdf can be computed
from the joint density by “integrating out” irrelevant variables
∞ ∞
f X k (xi1 , . . . , xik ) = ... f (x1 , . . . , xd )d xik +1 . . . d xid .
−∞ −∞
f (x1 , x2 )
f (x2 | x1 ) = .
f X 1 (x1 )
where xi j is the ith realisation of the jth element of the random vector X . The idea
of statistical inference for a given random sample is to analyse the properties of the
population variable X . This is typically done by analysing some characteristics of
its distribution.
6.1.1 Moments
Expectation
E (αX + βY ) = α E X + β E Y, (6.1)
E(AX ) = A E X.
174 6 Multivariate Distributions
E(X Y ) = E X E Y .
rowMeans and colMeans calculate the averages by rows and columns of the
matrix, or data frame respectively.
> rowMeans( w o m e n . m ) # a v e r a g e s by rows
[1] 86 .5 88 .0 90 .0 92 .0 94 .0 96 .0 98 .0
[8] 100 .0 102 .5 104 .5 107 .0 109 .5 112 .0 115 .0 118 .0
> colMeans( w o m e n . m ) # a v e r a g e s by c o l u m n s
height weight
65 .0000 136 .7333
Covariance matrix
The matrix
is the (theoretical) covariance matrix, also called the centred second moment. It is
positive semi-definite, i.e. ≥ 0, with elements = (σ X i X j ). The off-diagonal
elements are σ X i X j = Cov(X i , X j ) and the diagonal elements are σ X i X i = Var(X i ),
i, j = 1, . . . , d, where
Cov(X i , X j ) = E(X i X j ) − μi μ j ,
Var X i = E X i2 − μi2 .
6.1 The Distribution Function and the Density Function of a Random Vector 175
Writing X ∼ (μ, ) means that X is a random vector with mean vector μ and covari-
ance matrix . The variance of the linear transformation of the variables satisfies
Var(AX ) = AVar(X )A = ai a j σ X i X j ,
i, j
Var(AX + b) = A Var(X )A , (6.2)
Cov(X, Y ) = E(X Y ) − μ X μ
Y = E(X Y ) − E X E Y .
˜ = n −1 X X − x x .
(6.3)
ˆ = 1 n
X X − x x . (6.4)
n−1 n−1
Equation (6.4) can be written equivalently in scalar form or based on the centering
matrix H = In − n −1 1n 1
n
ˆ = (n − 1)−1 (X X − n −1 X 1n 1
n X)
−1
= (n − 1) X HX . (6.5)
These formulas are implemented directly in R. The function cov returns the empirical
covariance matrix of the given sample matrix X . Its argument could be of type
176 6 Multivariate Distributions
data.frame, matrix, or consist of two vectors of the same size. The following
ˆ
code presents possible calculations of :
> women.m = as.matrix ( women )
> n = dim ( w o m e n . m ) [ 1 ] ; n
[1] 15
> meanw = colMeans( w o m e n )
> cov1 = # u s i n g (6 .4 )
+ ( t ( w o m e n . m ) % * % w o m e n . m - n * m e a n w % * % t ( m e a n w )) / ( n - 1)
> H = diag (1 , n ) - 1 / n * rep (1 , n ) % * % t ( rep (1 , n ))
> cov2 = t ( w o m e n . m ) % * % H % * % w o m e n . m / ( n - 1) # u s i n g (6 .5 )
> cov3 = cov ( w o m e n ) # for d a t a . f r a m e
> cov4 = cov ( w o m e n . m ) # for m a t r i x
As expected, all the matrices cov1, cov2, cov3 and cov4 return the same result.
height weight
height 20 69 . 0 0 0 0
weight 69 240 . 2 0 9 5
The internal function cov is twice as fast as manual methods with or without a
predetermined centred matrix, and independent of sample size.
If the arguments of cov are two vectors x and y, then the result is their covariance.
> cov ( women $ h e i g h t , w o m e n $ w e i g h t )
[1] 69
Cov(X i , X j ) σ X i ,X j
ρ X i ,X j = =√ .
Var X i Var X j σXi σX j
The calculation of the sample correlation in R is done by the atomic function cor,
ˆ may be converted into a correlation
which is similar to cov. The covariance matrix
matrix using the function cov2cor.
> cor ( women );
> cov2cor( cov ( women ))
height weight
height 1 .0000000 0 .9954948
weight 0 .9954948 1 .0000000
The linear correlation is sensitive to outliers, and is invariant only under strictly
increasing linear transformations. Alternative rank correlation coefficients, which
6.1 The Distribution Function and the Density Function of a Random Vector 177
Cov{F1 (X 1 ), F2 (X 2 )}
ρS = .
Var{F1 (X 1 )} Var{F2 (X 2 )}
Both rank-based correlation coefficients are invariant under strictly increasing trans-
formations and measure the ‘average dependence’ between X 1 and X 2 . The empirical
τ̂ and ρ̂ S are calculated by
4
τ̂ = Pn − 1, (6.6)
n(n − 1)
n
i=1 (Ri − R) (Si − S)
2 2
ρ̂ S = , (6.7)
n n
i=1 (Ri − R)2 i=1 (Si − S)2
where Pn is the number of concordant pairs, i.e., the number of pairs (x1k , x2k ) and
(x1m , x2m ) of points in the sample for which
The Ri and Si in (6.7) represent the position of the observation in a sorted by size list of
all observations (statistical rank). These two correlation coefficients are implemented
by the function cor, using the parameter method. In the following listing we applied
cor function to the dataset cars, which contain the speed of cars and the distances
taken to stop (data were recorded in the 1920s).
> cor ( cars $ s p e e d , c a r s $ dist )
[1] 0 . 8 0 6 8 9 4 9
> cor ( cars $ s p e e d , c a r s $ d i s t , m e t h o d = " k e n d a l l " )
[1] 0 . 6 6 8 9 9 0 1
> cor ( cars $ s p e e d , c a r s $ d i s t , m e t h o d = " s p e a r m a n " )
[1] 0 . 8 3 0 3 5 6 8
10
● ●
●
1.0
●
●
●
●
●
●
8
●
●●
0.5
●
●
●
6
●●
●●●
●
● ●●
●●●
● ●
●●● ● ●●●●● ●
●●● ●●
0.0
● ● ●●●
y
●●
y
● ●● ●
●●
●
●●● ●●●●
4
●
●●
●●●
● ●●
●
●
−0.5
●
●
●●
2
●●
●
●
● ● ●
●●●●●
●●●● ●
●●● ●
●●●●
●● ●
● ●●
● ●
●
●●● ●
●●●●● ●
● ●
−1.0
● ● ●
●● ●● ●
●●●
●
0
●● ● ●
●●●● ●
●●●●●● ●
●
●●
● ● ●●●
● ●● ●●●
●
●●●● ●●
●● ●
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
Fig. 6.1 Linear fit for the linearly correlated data with an outlier (left) and almost perfectly depen-
dent monotone transformed data (right). BCS_CopulaInvarOutlier
τ are equal to 0.98 and 0.88, respectively. The same is observed on the right side of
Fig. 6.1: the nonlinear but monotone transformation of the almost perfectly dependent
data results in ρ = 0.892 but ρ S = 0.996 and τ = 0.956.
The multivariate normal distribution is one of the most widely used multivariate
distributions. A random vector X is said to be normally distributed with mean μ and
covariance > 0, or X ∼ Nd (μ, ), if it has the following density function:
1
f (x) = |2π|−1/2 exp − (x − μ) −1 (x − μ) . (6.8)
2
As with the univariate normal distribution, the multinormal distribution does not
x1 xd
have an explicit form for its cdf, i.e. (x) = −∞ . . . −∞ f (u)du.
The multivariate t-distribution is closely related to the multinormal distribution.
If Z ∼ Nd (0, Id ) and Y 2 ∼ χ2m are independent rvs, a t-distributed rv T with m
degrees of freedom can be defined by
√ Z
T = m .
Y
Moreover, the multivariate t-distribution belongs to the family of d-dimensional
spherical distributions, see Fang and Zhang (1990).
R offers several independent packages for the multinormal and multivariate t
distributions, namely fMultivar by Wuertz et al. (2009b), mvtnorm by Genz
and Bretz (2009) and Genz et al. (2012), and mnormt by Genz and Azzalini (2012).
6.2 The Multinormal Distribution 179
From (6.8) and Fig. 6.2, one sees that the density of the multinormal distribution is
constant on ellipsoids of the form
(x − μ) −1 (x − μ) = a 2 . (6.9)
The half-lengths of the axes in the contour ellipsoid are a 2 λi , where λi are the
eigenvalues of . If is a diagonal matrix, the rectangle circumscribing the contour
ellipse has sides of length 2aσi and is thus naturally proportional to the standard
deviations of X i (i = 1, 2).
The distribution of the quadratic form in (6.9) is given in the next theorem.
0.02
2
0.04
0.06
0.1
Z
0.14
0.18
22
0
0.
0.2
0.16
−1
0.12
0.08
Y
−2
−2 −1 0 1 2
4
3
Z
2
0.9
1
0.8
0.7
0.6
0.5
0
0.4
Y
0.3
0.2
−1
Z 0.1
−1 0 1 2 3 4
Simulation techniques, see Chap. 9, are implemented in several packages. The fol-
lowing code demonstrates the simplest case, d = 2 using mvtnorm.
> l s i g m a = m a t r i x ( c (1 , 0 .7, 0 .7, 1) , ncol = 2)
> lmu = c (0 , 0)
> # s e t . s e e d ( 2 ^ 1 1 - 1) # set the s e e d , see C h a p t e r 9
> rmvnorm(5 , mean = lmu, sigma = l s i g m a ) # sample 5 observations
[ ,1 ] [ ,2 ]
[1 , ] -0 . 1 5 0 8 0 4 1 -0 . 7 3 7 4 9 2 0
[2 , ] 1 .0681719 0 .3484549
[3 , ] -0 . 3 6 1 1 6 6 5 -0 . 3 8 1 9 2 5 8
[4 , ] 0 .8141042 0 .5995360
[5 , ] 1 .4857324 1 .4820885
6.2 The Multinormal Distribution 181
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
X1
0.4
0.2
0.0
1.0
3
2
0.8
1
0.6
0
X2
0.4
−1
−2
0.2
−3
0.0
4
3
2
1
X3
0
−1
−2
−3
−4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Fig. 6.4 Sample X from N3 , with ρ12 = 0.2, ρ13 = 0.5, ρ23 = 0.8 and n = 1500. Plots for
estimated marginal distributions of X 1,2,3 are in the upper triangle and contour ellipsoids for the
bivariate normal densities in the lower triangle of the matrix. BCS_NormalCopulaContour
The package mvtnorm is the only package with a method of calculating the
quantiles of the multinormal distribution, namely qmvtnorm. Table 6.1 lists and
compares the methods from all these packages (Fig. 6.4). If μ = 0 and = Id , then
182 6 Multivariate Distributions
Y ∼ Nd (0, Id ),
X = 1/2 Y + μ.
Using (6.1) and (6.2), we can verify that E(X ) = μ and Var(X ) = . The follow-
ing theorem is useful because it presents the distribution of a linearly transformed
variable.
Theorem 6.3 Let X ∼ Nd (μ, ), A ( p × p) and b ∈ R p , where A is non-singular.
Then Y = AX + b is again a p-variate normal, i.e.
Statistical inference often requires more than just the mean or the variance of a
statistic. We need the sampling distribution of a statistic to derive confidence inter-
vals and define rejection regions in hypothesis testing for a given significance level.
Theorem 6.4 gives the distribution of the sample mean for a multinormal population.
Remark 6.1 One may wonder how large n should be in practice to provide reasonable
approximations. There is no definite answer to this question. It depends mainly on
the problem at hand, i.e. the shape of the distribution and the dimension of X i . If the
X i are normally distributed, the normality of the sample mean, x, obtains even from
n = 1. However, in most situations, the approximation is valid in one-dimensional
problems for n > 50.
6.3 Copulae
This section describes modelling and measuring the dependency between d rvs using
copulae.
Definition 6.1 A d-dimensional copula is a distribution function on [0, 1]d such that
all marginal distributions are uniform on [0,1].
Copulae gained their popularity due to their applications in finance. Sklar (1959)
gave a basic theorem on copulae.
Theorem 6.6 (Sklar (1959)) Let F be a multivariate distribution function with mar-
gins F1 , . . . , Fd . Then there exists a copula C such that
where the Fi−1 are the inverse marginal distribution functions, also referred to as
quantile functions. The copula density and the density of the multivariate distribution
with respect to the copula are
∂ d C(u)
c(u) = , u ∈ [0, 1]d , (6.13)
∂u 1 , . . . , ∂u d
d
f (x1 , . . . , xd ) = c{F1 (x1 ), . . . , Fd (xd )} f i (xi ), x1 , . . . , xd ∈ R. (6.14)
i=1
In the multivariate case, the copula function is invariant under monotone transfor-
mations.
The estimation and calculation of probability distributions and goodness-of-fit
tests are implemented in several R packages: copula see Yan (2007), Hofert and
Maechler (2011) and Kojadinovic and Yan (2010), fCopulae see Wuertz et al.
(2009a), fgac see Gonzalez-Lopez (2009), gumbel see Caillat et al. (2008), HAC
see Okhrin and Ristig (2012), gofCopula see Trimborn et al. (2015) and sbgcop
see Hoff (2010). All of these packages have comparative advantages and disadvan-
tages.
The package sbgcop estimates the parameters of a Gaussian copula by treat-
ing the univariate marginal distributions as nuisance parameters. It also provides a
semiparametric imputation procedure for missing multivariate data.
A separate package, gumbel, provides functions only for the Gumbel–Hougaard
copula. The HAC package focuses on the estimation, simulation and visualisation of
Hierarchical Archimedean Copulae (HAC), which are discussed in Sect. 6.3.3. The
fCopulae package was developed for learning purposes. We recommend using this
package for a better understanding of copulae. Almost all the methods in this package,
like density, simulation, generator function, etc. can be interactively visualised by
changing their parameters with a slider. As fCopulae is for learning purposes,
only the bivariate case is treated, in order to ease the visualisation. The copula
package tries to cover all possible copula fields. It allows the simulation and fitting
of different copula models as well as their testing in high dimensions. As far as
we know, this is the only package that deals not only with the copulae, but with
the multivariate distributions based on copulae. In contrast to most of the other
packages, in copula one has to create an object from the classes copula or
mvdc (multivariate distributions constructed from copulae). These classes contain
information about the dimension, dependency parameter and margins, in the case of
the mvdc class. For example, in the following listing, we construct an object that
describes a bivariate Gaussian copula with correlation parameter ρ = 0.75, with
N(0, 2), and E(2) margins.
6.3 Copulae 185
Using other methods, we can simulate, estimate, calculate and plot the distribution’s
density function. For copula modelling, we concentrate in this section on the two
packages copula and fCopulae.
d
(u 1 , . . . , u d ) = ui .
i=1
Two other extremes, representing perfect negative and positive dependencies, are the
lower and upper Fréchet–Hoeffding bounds,
d
W (u 1 , . . . , u d ) = max 0, ui + 1 − d ,
i=1
M(u 1 , . . . , u d ) = min(u 1 , . . . , u d ), u 1 , . . . , u d ∈ [0, 1].
An arbitrary copula C(u 1 , . . . , u d ) lies between the upper and lower Fréchet–
Hoeffding bounds
W (u 1 , . . . , u d ) ≤ C(u 1 , . . . , u d ) ≤ M(u 1 , . . . , u d ).
As far as we now, the upper and lower Fréchet–Hoeffdings bounds are not imple-
mented in any package. The reason might be that the lower Fréchet–Hoeffding bound
is not a copula function for d > 2. Using objects of the class indepCopula, one can
model the product copula using the functions dCopula, pCopula, or rCopula,
described later in this section.
186 6 Multivariate Distributions
Elliptical copulae
Due to the popularity of the Gaussian and t-distributions in financial applications,
the elliptical copulae have an important role. The construction of this type of copula
is based on Theorem 6.6 and its implication (6.12). The Gaussian copula and its
copula density are given by
where is the distribution function of N(0, 1). −1 is the functional inverse of ,
and is a d-dimensional normal distribution with zero mean and correlation matrix
. The variances of the variables are determined by the marginal distributions.
In the bivariate case, the t-copula and its density are given by
tν−1 (u 1 ) tν−1 (u 2 )
ν+2
Ct (u 1 , u 2 , ν, δ) = ν 2
−∞ −∞ 2
πν (1 − δ 2 )
− ν −1
x12 − 2δx1 x2 + x22 2
× 1+ d x1 d x2 ,
(1 − δ 2 )ν
f νδ {tν−1 (u 1 ), tν−1 (u 2 )}
ct (u 1 , u 2 , ν, δ) = , u 1 , u 2 , δ ∈ [0, 1],
f ν {t −1 (u 1 )} f ν {t −1 (u 2 )}
modelling, we pay special attention to it. As mentioned above, one should first create
a copula object using normalCopula, tCopula, or ellipCopula.
> norm.cop = normalCopula ( # Gaussian copula
+ param = c (0 .5, 0 .6, 0 .7 ) , # cor m a t r i x
+ dim = 3, # 3 dimensional
+ d i s p s t r = " un " ) # u n s t r u c t u r e d cor m a t r i x
> t.cop = tCopula ( # t copula
+ param = c (0 .5, 0 .3 ) , # w i t h p a r a m s c (0 .5, 0 .3 )
+ dim = 3, # 3 dimensional
+ df = 2, # n u m b e r of d e g r e e s of f r e e d o m
+ d i s p s t r = " toep " ) # T o e p l i t z s t r u c t u r e of cor matr.
> norm.cop1 = ellipCopula ( # elliptical family
+ family = " n o r m a l " , # Gaussian copula
+ param = c (0 .5, 0 .6, 0 .7 ) ,
+ dim = 3 , d i s p s t r = " un " ) # same as n o r m . c o p
The parameter dispstr specifies the type of the symmetric positive definite matrix
characterising the elliptical copula. It can take the values ex for exchangeable, ar1
for AR(1), toep for Toeplitz, and un for unstructured. With these objects, one can
use the general functions rCopula, dCopula or pCopula for the simulation or
calculation of the density or distribution functions.
> n o r m . c o p = n o r m a l C o p u l a ( p a r a m = c (0 .5, 0 .6, 0 .7 ) , dim = 3 ,
+ d i s p s t r = " un " )
> # s e t . s e e d (2^11 -1) # set the s e e d , see C h a p t e r 9
> rCopula( n = 3 , # s i m u l a t e 3 obs. from a G a u s s i a n cop.
+ copula = norm.cop )
[ ,1 ] [ ,2 ] [ ,3 ]
[1 , ] 0 . 6 3 2 0 0 1 6 0 . 3 7 0 8 7 7 4 0 . 7 9 2 0 2 0 1
[2 , ] 0 . 4 0 0 9 4 9 2 0 . 3 5 4 0 8 2 8 0 . 3 4 5 5 3 2 7
[3 , ] 0 . 8 6 2 4 0 8 3 0 . 8 1 0 3 2 1 3 0 . 9 1 1 5 4 9 9
> dCopula( u = c (0 . 2 , 0 . 5 , 0 . 1 ) , # e v a l u a t e the c o p u l a d e n s i t y
+ copula = norm.cop )
[1] 1 . 1 0 3 6 2 9
> pCopula( u = c (0 . 2 , 0 . 5 , 0 . 1 ) , # e v a l u a t e the 3 D t - c o p u l a
+ c o p u l a = t.cop)
[1] 0 . 0 4 1 9 0 9 3 4
Plotting the results of these fuctions is possible for d = 2 using the standard plot,
persp and contour methods. The following code demonstrates how to use these
methods, and the results are displayed in Fig. 6.5.
> n o r m . 2 d . c o p = normalCopula( p a r a m = 0 .7, dim = 2)
> # construct a 2D Gaussian copula
> plot ( r C o p u l a ( 1 0 0 0 , n o r m . 2 d . c o p )) # s c a t t e r p l o t
> persp ( norm.2d.cop, pCopula ) # 3D copula plot
> contour ( norm.2d.cop, pCopula ) # copula contour curves
> persp ( norm.2d.cop, dCopula ) # 3 D plot of the copula density
> contour ( norm.2d.cop, dCopula ) # contour curves of the d e n s i t y
Using the mvdc object on the base of the copula objects, one can create a mul-
tivariate distribution based on a copula by specifying the parameters of the marginal
distributions. The mvdc objects can be plotted with contour, persp or plot as
well.
188 6 Multivariate Distributions
1.0
●● ●● ● ●●●
● ●● ●● ● ●●●
● ● ● ● ●●
●
● ●
● ● ●● ●● ● ●● ● ●●●●
● ● ● ● ● ● ●●●●
● ● ● ● ●●
● ● ●● ●● ● ● ●● ●● ●● ●● ●● ● ●
● ● ● ● ●● ●
● ● ● ●●● ● ● ●
●● ● ● ●
●● ●
● ● ● ● ● ● ●●
● ● ●●● ●
● ● ● ● ●
●● ● ● ● ●● ● ● ●●
● ● ● ●
● ● ● ● ● ●●● ●●
●●
● ● ● ● ●
● ●●
0.8
●● ● ● ● ●● ●● ●●
● ● ●
● ●● ● ● ●
● ●● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ●●● ●
●
●
● ●● ● ● ● ● ● ● ●●
● ● ● ●● ● ● ● ●● ●
● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●
● ● ●
● ●●● ● ● ● ●●
●● ●● ● ●
● ●
● ●
●● ● ● ● ● ● ●
● ●
● ● ● ●●● ●
● ● ● ●● ● ● ●
● ● ● ●● ● ●●
●
● ● ● ● ●● ●
●● ● ● ● ●
● ● ● ● ●● ●● ● ●● ● ●● ●●
●
● ● ● ● ● ● ●●
● ●
●● ● ● ● ●● ●● ●● ● ● ● ●● ●
● ● ● ●
0.6
● ● ●● ● ● ● ●
● ●
● ●
● ●
● ● ●● ● ● ● ●
●
● ● ●
● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●
●●● ●● ● ● ● ● ● ●
● ● ●
●
●● ● ●
●
● ● ● ●●
● ●● ● ● ●
y
● ● ● ● ● ●
●● ● ● ●
● ● ● ● ●●● ● ● ● ● ●●
● ● ● ●● ● ●
●
●●● ● ●● ● ●
● ●● ●
●
●
● ●
●
●
● ●
● ● ● ● ●
●
● ● ●●
● ● ● ● ● ● ●● ● ●● ● ● ● ● ●
● ●
0.4
● ● ● ●● ●●
●
●
● ●
● ● ●●● ● ● ● ●
● ●
●●
●● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ● ●● ● ●
● ● ●● ●● ● ●
●●● ● ● ●● ● ● ● ●
● ●
● ●● ● ● ●● ● ●● ● ●● ●●●● ● ●● ●
● ●● ●● ● ● ●● ●● ● ●
● ● ●
●●
● ● ●●
● ● ● ●
●● ●
●● ●● ● ● ● ●
● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●
● ● ● ●
●● ●● ●
● ● ●
● ● ● ●●
● ● ●
● ● ● ● ●
● ● ●●
●● ●● ● ●● ● ●●●●
● ●●● ● ●● ● ● ●● ●
●
● ● ● ●● ● ● ● ● ● ● ●
0.2
●● ●● ● ● ● ● ● ● ●● ●● ●●
●
●
● ● ● ● ●●
● ● ● ●● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●●
● ●
● ●● ● ● ● ● ●
● ● ●● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ● ●●
● ● ●●●● ● ● ●
● ● ●
●● ● ●● ●● ● ●
● ●● ●●●
● ● ● ● ● ● ● ●
●● ●● ●● ●● ● ● ●● ●
● ● ● ● ● ●● ●● ●
●● ●● ● ● ● ● ●● ●
● ● ● ●
●
●●● ●● ● ● ● ●● ●
● ●
●●
●●●● ●●
●● ● ●● ● ● ●
●
● ● ● ● ● ● ●● ●●● ●
● ● ● ●
0.0
●● ●● ● ●
●
●
●●
● ●●● ●
0.9
0.8
0.8
0.7
zmat
0.6
0.6
y
0.5
0.4
0.4
0.3
yis
0.2
0.2
xis
0.1
0.0
5
4
3
2
0.8
zmat
0.6
y
0.4
yis
0.2
xis
3
4
0.0
12
Fig. 6.5 Gaussian copula. Note From top to bottom: scatterplot, distribution and density function
for the Gaussian copula with ρ = 0.7. BCS_NormalCopula
6.3 Copulae 189
Using (6.12), one can derive the copula function for an arbitrary elliptical distrib-
ution. The problem is, however, that such copulae depend on the inverse distribution
functions and these are rarely available in an explicit form, see Sect. 6.2. Therefore,
the next class of copulae and their generalisations provide an important flexible and
rich family of alternatives to the elliptical copulae.
The first popular Archimedean copula is the so-called Frank copula, which is the
only elliptically contoured Archimedean copula (which is different from elliptical
family) for d = 2 satisfying
The Gumbel copula is frequently used in financial applications. Its generator and
copula functions are
Consider a bivariate distribution based on the Gumbel copula with univariate extreme
value marginal distributions. Genest and Rivest (1989) showed that this distribution
is the only bivariate extreme value distribution based on an Archimedean copula.
Moreover, all distributions based on Archimedean copulae belong to its domain of
attraction under common regularity conditions. Unlike the elliptical copulae, the
Gumbel copula leads to asymmetric contour diagrams. It shows stronger linkages
between positive values. However, it also shows more variability and more mass in
the negative tail. For θ > 1, this copula allows the generation of a dependence in
the upper tail. For θ = 1, the Gumbel copula reduces to the product copula and for
θ → ∞, we obtain the Fréchet–Hoeffding upper bound.
As mentioned above, apart from the packages copula and fCopulae
(type = 4), the package gumbel, specially designed for this copula family, allows
only exponential or gamma marginal distributions.
The Clayton copula, in contrast to the Gumbel copula, has more mass in the lower
tail and less in the upper. The generator and copula function are
6.3 Copulae 191
⎧⎛ ⎞ ⎫−θ−1
⎨ d ⎬
Cθ (u 1 , . . . , u d ) = ⎝ u −θ ⎠−d +1 .
⎩ j
⎭
j=1
The Clayton copula is one of few copulae whose density has a simple explicit form
for any dimension
⎛ ⎞−(θ−1 +d)
d d
cθ (u 1 , . . . , u d ) = {1 + ( j − 1)θ}u −(θ+1)
j
⎝ u −θ
j −d +1
⎠ .
j=1 j=1
As the parameter θ tends to infinity, the dependence becomes maximal, and as θ tends
to zero, we have independence. As θ goes to −1 in the bivariate case, the distribution
tends to the lower Fréchet bound. The level plots of the two-dimensional respective
densities are given in Fig. 6.6.
for φ−1 ∗
d−i ◦ φd− j ∈ L , i < j, where
The HAC defines the whole dependency structure in a recursive way. At the lowest
level, the dependency between the first two variables is modelled by a copula function
with the generator φ1 , i.e. z 1 = C(u 1 , u 2 ) = φ1 {φ−1 −1
1 (u 1 ) + φ1 (u 2 )}. At the second
level, another copula function is used to model the dependency between z 1 and u 3 ,
etc. Note that the generators φi can come from the same family, differing only in
their parameters. But, to introduce more flexibility, they can also come from different
families of generators. As an alternative to the fully nested model, we can consider
copula functions with arbitrarily chosen combinations at each copula level. Okhrin
et al. (2013) provide several methods for determining the structure of the HAC from
the data.
192 6 Multivariate Distributions
2
2
0.06
0.06
0.1
0.1
1
1
2
0.1 2
0.1
0
0
16
4
0.
0.1
0.14
−1
−1
0.08 0.08
0.04 0.04
−2
−2
0.02
0.02
−2 −1 0 1 2 −2 −1 0 1 2
0.02 0.02
2
2
0.04 0.04
0.08 0.08
1
1
0.12 0.12
16
0.
6
0
0.1
0
8
0.1
4
0.1
4
−1
−1
0.1 0.1
0.1 6
0.0
−2
6
−2
0.0
−2 −1 0 1 2 −2 −1 0 1 2
1 0.
01
0.0 0.02
2
0.04 0.04
0.07 0.08
1
9
0.0 0.12
0.11
0
4
0.
1 0.1
01
.
0.08
−1
−1
0.0 0.06
6
0.05
0.03
−2
−2
0. 0.02
01 1
0.0
−2 −1 0 1 2 −2 −1 0 1 2
Fig. 6.6 Contour diagrams for (from top to bottom) the Gumbel, Clayton and Frank copu-
lae with parameter 2 and Normal (left column) and t6 distributed (right column) margins.
BCS_ArchimedeanContour
6.3 Copulae 193
3
2
1
0
3
2
−1
1
0
−2
−1
−2
−3
−3
−3 −2 −1 0 1 2 3
The HAC package provides intuitive techniques for estimating and visualising
HAC. In accordance with the naming in the other packages, the functions dHAC,
pHAC compute the values of the pdf and cdf, and rHAC generates random vectors.
Figure 6.7 presents the scatterplot of the three-dimensional HAC-based distribution.
On the sides of the cube one sees shaded bivariate marginal distribution, that com-
pletely differs from each others.
6.3.4 Estimation
n
L(α; x1 , . . . , xn ) = f (x1i , . . . , xdi ; α1 , . . . , αd , θ).
i=1
According to (6.14), the density f can be decomposed into the copula density c
and the product of the marginal densities, so that the log-likelihood function can be
written as
n
n
d
(α; x1 , . . . , xn ) = log c{F1 (x1i ; α1 ), . . . , Fd (xdi ; αd ); θ} + log f j (x ji ; α j ).
i=1 i=1 j=1
n
(θ, α̂1 , . . . , α̂d ) = log c{F1 (x1i ; α̂1 ), . . . , Fd (xdi ; α̂d ); θ},
i=1
is maximised over θ to get the dependence parameter estimate θ̂. A detailed discussion
of this method is to be found in Joe and Xu (1996). Note that this procedure does
not lead to efficient estimators, but, as argued by Joe (1997), the loss in efficiency
should be modest. The advantage of the inference for margins procedure lies in the
dramatic reduction of the computational complexity, as the estimation of the margins
is disentangled from the estimation of the copula. As a consequence, all R packages
6.3 Copulae 195
use the separate estimation of the copula and its margins and, therefore, they focus
only on the optimisation of the copula parameter(s).
In the CML (canonical maximum likelihood) method, the univariate marginal
distributions are estimated through some non-parametric method F̂ as described
in Sect. 5.1.2. The asymptotic properties of the multistage estimators of θ do not
depend explicitly on the type of the non-parametric estimator, but on its convergence
properties. For the estimation of the copula, one should normalise the empirical cdf
not by n but by n + 1:
1
n
F̂ j (x) = I(x ji ≤ x).
n + 1 i=1
n
θ̂C M L = arg max log c{ F̂1 (x1i ), . . . , F̂d (xdi ); θ}.
θ i=1
Notice that the first step of the IMF and CML methods estimates the marginal dis-
tributions. After the estimation of the marginal distributions, a pseudosample {u i }
of observations transformed to the unit d-cube is obtained and used for the copula
estimation. As in the IFM, the semiparametric estimator θ̂ is asymptotically normal
under suitable regularity conditions.
In the two-dimensional case d = 2, one often uses the generalised method of
moments, since there is a one to one relation between the bivariate copula parameter
and Kendall’s τ or Spearman’s ρ S . For example, for Gumbel copulae, τ = 1 − 1θ ,
and for Gaussian copulae, τ = π2 arcsin ρ. One estimates this measure as (6.7) or
(6.6) and subsequently converts it to θ.
Estimation of the different copula models is implemented in a variety of packages,
such as copula, fCopulae, gumbel and HAC. The gumbel package implements
all methods of estimation for the Gumbel copula. The package fCopulae deals
only with copula functions with uniform margins, and the estimation is provided
through maximum likelihood. It estimates the parameters for all the copula families
in Nelsen (2006). The package copula is of the highest interest, since almost all
estimation methods for the estimation of multivariate copula-based distributions,
or just the copula function, are implemented in it. To estimate a parametric copula
C, one uses the fitCopula function, which among other parameters needs the
parameter method, indicating the method that should be used in the estimation. The
parameter method can be either ml (maximum likelihood), mpl (maximum pseudo-
likelihood), itau (inversion of Kendall’s tau), or irho (inversion of Spearman’s
rho). The default method is mpl. In the following listing, we present the estimation
of the Gumbel copula parameter using different methods.
196 6 Multivariate Distributions
Similarly, using the fitMvdc method, one can estimate the whole multivariate
copula-based distribution together with its margins.
Chapter 7
Regression Models
— Albert Einstein
Regression models aim to find the most likely values of a dependent variable Y for
a set of possible values {xi }, i = 1, . . . , n of the explanatory variable X
yi = g(xi ) + εi , εi ∼ Fε ,
useful in fields like quantitative finance, where the underlying distribution is in fact
unknown. However, as fewer assumptions can be exploited, this flexibility comes
with the need for more data. A detailed introduction to nonparametric techniques
can be found in Härdle et al. (2004).
Y = β0 + β1 X 1 + · · · + β p X p + ε.
The variable ε is called the error term and represents all factors other than X 1 , . . . , X d
that affect Y . Let y = {yi }i=1 n
be a vector of the response variables and X =
{xi j }i=1,...,n; j=1,..., p be a data matrix of p explanatory variables. In many cases, a
constant is included through xi1 = 1 for all i in this matrix. The resulting data matrix
is denoted by X = {xi j }i=1,...,n; j=1,..., p+1 . The aim is to find a good linear approxi-
mation of y using linear combinations of covariates
y = X β + ε,
where ε is the vector of errors. To estimate β, the following least squares optimisation
has to be solved:
β̂ = (X X )−1 X y. (7.2)
This estimator is called the (ordinary) least squares (OLS) estimator. Under the
following conditions, the OLS estimator β̂ is by the Gauss–Markov theorem the best
linear unbiased estimator (BLUE).
If these conditions are fulfilled, then OLS has the smallest variance in the class of
all linear unbiased estimators, with E(β̂) = β and Var(β̂) = σ 2 (X X )−1 .
Additional assumptions are required to develop further inference about β̂. Under
a normality assumption, ε ∼ N(0, σ In ), the estimator β̂ has2 a normal distribution,
2
−1
i.e. β̂ ∼ N β, σ (X X ) . In practice, the error variance σ is often unknown, but
2
can be estimated by
1
σ2 = (y − ŷ) (y − ŷ),
n − ( p + 1)
where (X X ) j j is the j-th diagonal element of the matrix. It can be shown that β̂ j
and σ̂ 2 (β̂ j ) are statistically independent.
The distributional property of β̂ is used to form tests and build confidence intervals
for the vector of parameters β. Testing the hypothesis H0 : β j = 0 is analogous to
the one-dimensional t-test with the test statistics given by
t = β̂ j /
σ (β̂ j ). (7.3)
Under H0 , the test statistic (7.3) follows the tn−( p+1) distribution. For further reading
we refer to Greene (2003), Härdle and Simar (2015) and Wasserman (2004).
Additionally one can test whether all independent variables have no effect on the
dependent variable
H0 : β1 = . . . = β p = 0 vs H1 : βk = 0, f or at least one k = 1, . . . , p
In order to decide on the rejection of the null hypothesis, the residual sum of squares
RSS serves as a measure. According to the restrictions, under the null hypothesis the
test compares the RSS of the reduced model SS(r educed), in which the variables
listed in H0 are dropped, with the RSS of an unrestricted model SS( f ull), in which all
variables are included. In general, SS(r educed) is greater than or equal to SS( f ull)
because the OLS estimation of the restricted model uses fewer parameters. The ques-
tion is whether the increase of the RSS in moving from the unrestricted model to the
restricted model is large enough to ensure the rejection of the null hypothesis. There-
fore, the F-statistic is used, which addresses the difference between SS(r educed)
and SS( f ull):
200 7 Regression Models
Under the null hypothesis, the statistic (7.4) follows the distribution Fd f (r )−d f ( f ),d f (r )
(see Sect. 4.4.3), where d f ( f ) and d f (r ) denote the number of degrees of free-
dom under the unrestricted model and the restricted model (d f ( f ) = n − p − 1 and
d f (r ) = n − 1). Based on this F-distribution, the critical region can be chosen in
order to reject or not reject the null hypothesis.
Even in the case of well-fitted models, it is not an easy task to select the best model
from a set of alternatives. Usually, one looks at the coefficient of determination R 2 or
adjusted R 2 . These values measure the ‘goodness of fit’ of the regression equation.
They represent the percentage of the total variation of the data explained by the fitted
linear model. Consequently, higher values indicate a ‘better fit’ of the model, while
low values may indicate a poor model. R 2 is given by
y − ŷ2
R2 = 1 − , (7.5)
y − y2
with R 2 ∈ [0, 1]. It is important to know, that R 2 always increases with the number
of explanatory variables added to the model even if they are irrelevant.
The adjusted R 2 is a modification of (7.5), which considers the number of explana-
tory variables used, and is given by
n−1
j = 1 − (1 − R )
2 2
Rad . (7.6)
n − ( p + 1) − 1
2
Note that Rad j can be negative. However, the coefficients of determination (7.5) and
(7.6) are not always the best criteria for choosing the model. Other popular criteria
to choose the regression model are Mallows’ C p , the Akaike Information Criterion
(AIC) and the Bayesian Information Criterion (BIC).
Mallows’ C p is a model selection criterion which uses the residual sum of squares,
2
but penalises for the number of unknown parameters like Rad j . It is given by
y − ŷ2
Cp = − n + 2( p + 1).
y − y2
σ 2 + 2( p + 1),
AIC = n log
7.2 Linear Regression 201
where p is the number of parameters and σ 2 an estimate for the variance maximising
L a likelihood function. The second term is a penalty, as in Mallows’ C p .
The last information criterion discussed here is the BIC, defined as
BIC = n log
σ 2 + log(n)( p + 1).
There is no rule of thumb determining which criterion to use. In small samples all
criteria give similar results. Since BIC has a larger penalty for n ≥ 3 than AIC, it
will have a tendency to select more parsimonious models.
Stepwise regression builds the model from a set of candidate predictor variables
by entering and removing predictors in a stepwise manner. One can perform for-
ward or backward stepwise selection using the step function or stepAIC from
the MASS package. Both functions perform stepwise model selection by exact AIC.
The stepAIC function is preferable, because it is applicable to more model types,
e.g. nonlinear regression models, apart from the linear model while providing the
same options. Forward selection starts by choosing the independent variable which
explains the most variation in the dependent variable. It then chooses the variable
which explains most of the remaining residual variation and recalculates the regres-
sion coefficients. The algorithm continues until no further variable significantly
explains the remaining residual variation. Another similar selection algorithm is
backward selection, which starts with all variables and excludes the most insignifi-
cant variables one at a time, until only significant variables remain. A combination
of the two algorithms performs forward selection, while dropping variables which
are no longer significant after the introduction of a new variable.
The stepAIC() function requires a number of arguments. The argument k is
a multiple of the number of degrees of freedom used for the penalty. If k = 2 the
original AIC is applied, k = log(n) is equivalent to BIC. The direction of the
stepwise regression can be chosen as well, setting it to forward, backward or both.
If trace = 1, it will return every model it goes over as well as the coefficients of
the final model. scope gives the range of the included predictors, while lower and
upper specify the minimal and maximal number of models the stepwise procedure
may go over.
The nutritional database on US cereals introduced in Venables and Ripley (1999)
provides a good illustration of MLR. The UScereal data frame from package MASS
is from the 1993 ASA Statistical Graphics Exposition. The data have been normalised
to a portion size of one American cup and among other contain information on: mfr
(Manufacturer, represented by its first initial), calories (number of calories per
portion), protein (grams of protein per portion), fat (grams of fat per portion),
carbo (grams of complex carbohydrates per portion) and sugars (grams of sugars per
portion). The analysis is restricted to the dependence between calories and protein,
202 7 Regression Models
The resulting fitted model is an object of a class lm, for which the function
summary() shows the conventional regression table. The part Call states the
applied model.
> summary(fit) # show results
Call:
lm(formula = calories ~ protein + fat + carbo + sugars,
data = UScereal)
The next part of the output provides the minimum, maximum and empirical quartiles
of the residuals.
Residuals: # output ctnd.
Min 1Q Median 3Q Max
-20.177 -4.693 1.419 4.940 24.758
The last part shows the estimated β̂ with the corresponding standard errors, t-statistics
and associated p-values. The measures of goodness of fit, R 2 and adjusted R 2 as
discussed in Sect. 7.2.1, and the results of an F-test are given as well. The F-test
tests the null hypothesis that all regression coefficients (excluding the constant) are
simultaneously equal to 0.
Coefficients: # output ctnd.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18.7698 3.5127 -5.343 1.49e-06 ***
protein 4.0506 0.5438 7.449 4.28e-10 ***
fat 8.8589 0.7973 11.111 3.41e-16 ***
carbo 4.9247 0.1587 31.040 < 2e-16 ***
sugars 4.2107 0.2116 19.898 < 2e-16 ***
---
Signif. codes: 0"***"0.001"**"0.01"*"0.05"."
In the given example, all four variables are statistically significant, i.e. all p-values
are smaller than 0.05, which is commonly used as threshold. The coefficients can
be assessed via the command fit$coef. The interpretation of the coefficients
7.2 Linear Regression 203
is quite intuitive. One may know from a high school chemistry course that 1 g of
carbohydrates or proteins contains 4 calories and 1 g of fat gives nine calories. In
order to investigate the model closely, four diagnostic plots are constructed. The
layout command is used to put all graphs in one figure.
> layout(matrix(1:4, 2, 2)) # plot 4 graphics in 1 figure
> plot(fit) # depict 4 diagnostic plots
The upper left plot in Fig. 7.1 shows the residual errors plotted against their fitted
values. The residuals should be randomly distributed around the horizontal axis.
There should be no distinct trend in the distribution of points. A nonparametric
regression of the residuals is added to the plot, which should, in an ideal model, be
close to the horizontal line y = 0. Unfortunately, this is not the case in the example,
possibly due to outliers (Grappe-Nuts, Quaker Oat Squares and Bran Chex). Potential
outliers are always named in the diagnostic plots.
Grape−Nuts ●
2.0
Grape−Nuts ●
Bran Chex
● ●
1.5
● ●●
●
10
Residuals
● ● ●
●
●● ● ●● ● ● ●
● ●● ● ● ●● ● ●● ● ●
●
● ● ● ●
●●
● ● ●●● ●
1.0
● ● ●● ● ●● ●
●● ● ●
0
● ● ●
● ●
●
●● ●● ● ●● ● ● ●
●●●● ●
●
● ● ●● ● ● ●
●●
−10
● ●●
●● ● ●● ●
0.5
●● ●
●
●● ● ●
● ● ● ● ●●
−20
●Bran Chex● ●
0.0
Grape−Nuts ● Grape−Nuts ●
Standardized residuals
Standardized residuals
4
4
● 1
● ●
●
●●●
0.5
●●●
●● ●
● ●●
●●
●●
●●
●●●●● ●
●
●
●●
●
● ● ● ●
●
●
●●
●●
●
● ●●●
●● ●●
●
●●
●
●●
● ●●●
0
●
●●
●●
●
●● ●
●● ●
0
●● ●
●
● ●●
●
●●
●●
● ●●
●●
●●
●●
● ● ●●●
0.5
●
●●●● ● All−Bran
●● 1
−2
● ●● ● ●
−2
The upper right plot in Fig. 7.1 presents the scale-location, also called spread-
location. It shows the square root of the standardised residuals as a function of the
fitted values and is a useful tool to verify if the standard deviation of the residuals
is constant as a function of the dependent variable. In this example, the standard
deviation increases with the number of calories.
The lower left plot in Fig. 7.1 is a Q-Q plot. In a Q-Q plot, quantiles of a theoretical
distribution (Gaussian in this case) are plotted against empirical quantiles. If the data
points lie close to the diagonal line there is no reason to doubt the assumed distribution
of the errors (e.g. Gaussian).
Finally, the lower right plot shows each point’s leverage, which is a measure of its
importance in determining the regression result. The leverage of the i-th observation
is the i-th diagonal element of the matrix X (X X )−1 X . It always takes values
between 0 and 1 and shows the influence of the given observation on the overall
modelling results and particularly on the size of the residual. Superimposed on the
plot are contour lines for Cook’s distance, which is another measure of the importance
of each observation to the regression, showing the change in the predicted values
when that observation is excluded from the dataset. Smaller distances mean that this
observation has little effect on the regression. Distances larger than 1 are suspicious
and suggest the presence of possible outliers or a poor model. For more details, we
refer to Cook and Weisberg (1982). In the given regression model, some possible
outliers are observed. It makes sense to have a closer look at these observations and
either exclude them or experiment with other model specifications.
As mentioned above, adjusted R 2 is a widely used measure of the goodness of
fit. In the given example, the model seems to explain the variability in the data quite
2
well, the Rad j is 0.9798. However, a similar goodness of fit might be obtained using a
smaller set of regressors. An investigation of this question using stepwise regression
procedure shows that no regressor can be removed from the model.
> require(MASS)
> stepAIC(fit, direction ="both") # stepwise regression using AIC
Start: AIC = 288.43
calories ~ protein + fat + carbo + sugars
Call:
lm(formula = calories ~ protein + fat + carbo + sugars,
data = UScereal)
Coefficients:
(Intercept) protein fat carbo sugars
-18.770 4.051 8.859 4.925 4.211
7.2 Linear Regression 205
−170
−140
−110
bic
−100
−72
−55
−37
−19
(Intercept)
protein
sugars
carbo
fat
A possible drawback of stepwise regression is that once the variable is included
(excluded) in the model, it remains there (or is eliminated) for all remaining steps.
Thus, it is a good idea to perform stepwise regression in both directions in order to
look at all the possible combinations of explanatory variables. It is possible to perform
an all subsets regression using the function regsubsets from the package leaps.
By plotting regsubset object one obtain a matrix, on which with dark colour are
highlighted models with larger BIC (Fig. 7.2).
> require("leaps")
> sset = regsubsets(calories ~ protein + fat + carbo + sugars,
+ data = UScereal, nbest = 3) # fit lm to all subsets
> plot(sset)
The general idea of regression analysis is to find a reasonable relation between two
variables X and Y . For n realisations {(xi , yi )}i=1
n
, the relation can be modelled by
where X is our explanatory variable, Y is the explained variable, and ε is the noise.
A parametric estimation would suggest g(xi ) = g(xi , θ), therefore estimating g
would result in estimating θ and using ĝ(xi ) = g(xi , θ̂). In contrast, nonparametric
regression allows g to have any shape, in the sense that g need not belong to a set of
defined functions. Nonparametric regression provides a powerful tool by allowing
wide flexibility for g. It avoids a biased regression, and might be a good starting point
206 7 Regression Models
n
ĝ(x) = wi (x)yi , (7.8)
i=1
n
min wi (x)(θ − yi )2 ,
θ∈R
i=1
which is solved for θ. This means that finding a local average is the same as finding a
locally weighted least squares estimate. For more details on the distinction between
local polynomial fitting and kernel smoothing, see Müller (1987).
A method similar to the histogram divides the set of observations into bins of size
h and computes the mean within each bin by
7.3 Nonparametric Regression 207
0.06
regression of daily DAX ●
0.04
●
0.02
●
● ●● ●●●● ●● ●
DAX log−returns
●● ● ●
●
●●
● ●●● ● ● ● ● ●
●●● ● ●● ●●
●●●
●● ●● ●●●● ●●●● ● ●
●●● ●● ●●
●
● ●
● ●●
● ●
●●●●●● ●
● ● ●
● ●●
● ● ●
●
●●●● ● ●●
●●● ●●● ● ●● ●
●● ●● ● ● ●●●●
● ●●
●
●●● ●● ● ●●●●●● ●
●● ● ●● ● ● ●
● ●●●●
●
●
●
●●
●
●
●●●
● ●●
●
●
●
●
●
●●●
●● ●
●●●
●●
● ● ●●
●●
●● ● ● ● ● ●●● ●●
●●
● ● ●●
●●●●●●
●
●
●●
●●
●●
●●●
● ●●
●●
●●● ●
● ● ● ●
●
●
●●●●
●
●●
●●
●
● ●
●●
●
●●●
● ●
●●
●● ● ●●●
●
●●●●●
●
●●
●●
●●
●●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●●●●●●
●●●●● ●
● ● ●●● ●●●
● ●●
●●●●
●
●
●●
●●
●●●
●●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●● ●
●●●
●
●
●●
●●
● ●
●
●●●
●●●
● ●● ● ●
● ●● ●
●● ● ● ●
●
● ●
●
●
●●
●
●●
● ●●●
●●●● ●●
●● ●
● ●●
● ●●●●
●● ●
● ●●●●●●●● ●●
●
●
● ●● ●
● ●
●●● ●●●●●●●
0.00
●
● ● ●
● ●● ●●●
●● ●●
●●●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●●●
●●
●
●●
●
●●
●● ●●
●●
● ●● ●
●
● ● ● ●
● ●
●
●●●●
●●●●●
●●
●●
●
●●
●●
●●●●
●
●
●●
●●●
●●
●
●●●●
●●
●● ● ●● ●
●
● ●
●●●
●● ●
●●
●●● ●● ●
●●
●●●
●
●
●●
●
●●
●
●●
●
●
●●
●●●
●
●●
●● ●●
●●● ● ●●●
●
● ●
●● ● ●●●●
● ●
●●
●
● ●
●●●
●
●
●●●
●●
●
●●●
●●●
●●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
● ●
●
●●●
● ●
●●●
● ● ● ●
● ●●●
●●●●
●
●
●
●●●
●●
●
●
●
●
●●
●●●
●●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●●●
● ●
●
● ●
● ●●●●
● ●●●●● ●●●
●●
● ●●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●●●●●
●
● ●
●
●● ●●
● ●
●
● ●
●●●
●●●●
●●
●●● ●●●
●●●
●●
●● ●●●● ●●
●
●●
● ●●●● ●
●
●●●
● ●
●● ●
●
●●●
●●●●●
●●
●●●
●
●●●
●●●
●●●
●●●●
●● ●● ●●●● ●
● ●
● ●●
●●
●●
●
●●●
●● ●●
●●
●
●●●●●
●
●●● ●●
●●●●
●●
●●●●●
●●
●●
●●●● ●●
● ●●●●●
●●
●
● ●
●●●
●●●
● ●
●
● ●
●●
●
●●
●●●
●
● ●
●
●●
●●
●●
●
● ●
●●● ● ● ●
●
● ●● ●
●●
●●●●●● ●●●
●
●●●●●●
●●●●● ●
●● ●
●● ●
●●● ●
● ●
●●
●●
●
●●●●●
●
●●
●
●●
●●●●● ●
● ●●
●● ●●
●
●
●●●
● ●●●●●
● ●●● ● ● ● ●
●● ● ● ● ● ●●●
● ●●● ●●
●
●
●●●● ●●
●●●
●
●●●●● ● ●●●
−0.02
●●●●●
● ● ●●● ●
● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ●●
●●●● ●● ●
●●●
●●
●● ● ● ●● ● ● ● ●●●
●●● ●
●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
−0.04 ● ●●
●
●
●
●
−0.06
n
I{|x − xi | < h/2}yi
ĝ(x) = i=1
n . (7.9)
i=1 I{|x − x i | < h/2}
The function ksmooth() returns the fitted values of the DAX log-returns and the
respective FTSE log-returns. This estimator is a special case of a wider family of
estimators, which is the topic of the following section.
As with the density estimation, we can bring into play the kernel functions for the
weights we need in (7.8). Recall the density estimator
1 x − xi
n
fˆ(x) = K .
nh i=1 h
208 7 Regression Models
We then have the general estimator ĝ(x) related to the bandwidth h and to a kernel K
n
K ( x−x i
)yi
ĝ(x) = i=1
n
h
. (7.10)
i=1 K ( x−x
h
i
)
n
K ( x−x i
)yi
ĝ(x) = h
.
i=1
f (x)
These estimators are of the form of (7.8), with weights equal to wi (x) = K x−x i
/
n h
f (x) and wi (x) = K h / i=1 K h .
x−xi x−xi
In the following, some regressions with different kernels and different bandwidths
are computed.
> r.dax = diff(log(EuStockMarkets[, 1])) # daily DAX log returns
> r.ftse = diff(log(EuStockMarkets[, 4])) # daily FTSE log returns
> n = length(r.dax) # sample size
> h = c(0.1, n^-1, n^-0.5) # bandwidths
> Color = c("red3","green3","blue3") # vector for colors
> # kernel regression with uniform kernel
> r.dax.un = list(h1 = NA, h2 = NA, h3 = NA) # list for results
> for(i in 1:3){
+ r.dax.un[[i]] = ksmooth(x = r.ftse, # independent variable
+ y = r.dax, # dependent variable
+ kernel ="box", # use uniform kernel
+ bandwidth = h[i]) # h = 0.1, n^-1, n^-0.5
+ }
> plot(x = r.ftse, y = r.dax) # scatterplot for data
> for(i in 1:3){
+ lines(r.dax.un[[i]], col = Color[i]) # regression curves
+ }
> # kernel regression with normal kernel
> r.dax.no = list(h1 = NA, h2 = NA, h3 = NA) # list for results
> for(i in 1:3){
+ r.dax.no[[i]] = ksmooth(x = r.ftse, # independent variable
+ y = r.dax, # dependent variable
+ kernel ="normal", # use normal kernel
+ bandwidth = h[i]) # h = 0.1, n^-1, n^-0.5
+ }
> plot(x = r.ftse, y = r.dax) # scatterplot for data
> for(i in 1:3){
+ lines(r.dax.no[[i]], col = Color[i]) # regression curves
+ }
As previously noted in Sect. 5.1.4, the choice of the bandwidth h is crucial for the
degree of smoothing, see Fig. 7.4. In Fig. 7.5 the Gaussian kernel K (x) = ϕ(x) is
used with the same bandwidths. The kernel determines the shape of the estimator ĝ,
which is illustrated in Figs. 7.4 and 7.5.
7.3 Nonparametric Regression 209
0.06
regression of daily DAX
●
log-returns by daily FTSE ●
0.04
●
log-returns, using uniform ● ● ●
●
kernel. The plot shows the ● ● ●● ●
●
● ●
●
●●
● ●● ●
density estimates for ● ●● ●● ●
0.02
●
●● ● ● ●●●●● ●●
●● ●●● ● ● ● ● ●
●
DAX log−returns
● ● ●
● ●●●● ●●●● ●
different bandwidths: ●●● ●●●
●●●
●
●
●
● ●●
●● ●●
●●
●
●●
●●
●●
●
●●
●●●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●●
●●●
●●●
●●●●●
●●●●
●●
● ●
●
● ●● ● ●
●
●
h = 0.1, h = n −1 and
●
●●● ●● ●
● ●●●●●
● ●● ● ●
● ●●●●
●●
●
●●●
●
●
●●
●●
● ●●
●
●
●
●
●●
●
●
● ● ●
●●
●
● ●
● ●●●
●●
●● ●
●● ● ● ● ● ●●● ●
●●●
● ●●●●
●●●
●●
●
●●
●●
●●
●
●●●●●
●●
●●● ●
●
● ● ●
●
●
● ●
●●●●
●●
●
●
● ●
●●
●●● ●
●●
●
● ● ●●●
●●●
●
●
●
●
●
●●●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●●●●●
●●●●● ●
● ● ●●● ●● ● ●●
● ●●
●●● ●
● ● ●●● ● ●
h = n −1/2 .
●●●●
● ●
●
●●
●●●●●
●●
●●
●
●
● ●
●●
●
●●●
● ●
●● ●
●
●●●●● ●● ● ●
● ●● ●●●
●●
● ●●●
●●
● ●
●●
●
●
●●
●●
●
●
●
● ●
●
●●
●
●●●
●●
●●
●●●●●
●
●●
●●
● ●
●● ● ● ●●
● ● ● ●●● ● ● ●
0.00
●
● ● ●
● ●●●
●●●
●●
● ●
●●
●●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●●
● ●●
●
●● ●● ●● ● ●
● ● ● ● ●
●●●
●●
●
●●
●●●●
●●●
●
●
●●●
●●●●
●
●●●
●●●●
● ●
●●●
● ● ●●● ●●
●
● ●
●●●
●
●●
●●●●
● ●
● ●
●●
●●●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●
●●●●●●
●●
● ● ●●●
●
● ● ●
●
● ●●●
● ●
●
●●●
●
●●
●
●●
●
●●
●●
●
●●●
●●
●●●●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●●
●
● ●
●●●●
●●●
●●● ● ● ●
● ●
●●●●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●●●
●●
●● ●
●●●● ●
● ●●● ●
●●●●
●●
●
●
●
●
●
●●
●●
●
●●●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●●
●●
●
●
●●●
●●●●●
●
● ●
●
●● ●●
BCS_UniformKernel ●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●●●●●
●●●
●
●●●● ●
●
●● ●
●●●● ●
●●●●● ●●●
●
●●●
●●●●● ●●●● ●● ● ●
●●● ●
●●●●
●
●●
●
●●●●
●
●
●●
●
●●●●● ● ●●● ●●
●●●
● ●
●●●●
● ●●●●● ●●
●● ● ● ● ● ●●●
● ●●● ●●
●
●●
●●● ●●
●●
●
●●●
−0.02
●●●● ● ●●
●●●● ●●●● ●
● ● ●
● ●●●● ● ●●
● ●● ●●● ●●●● ● ●● ●●
●●● ● ●● ●
●●●
● ●
● ● ● ● ● ● ●
● ● ●●●
●●● ●
●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
−0.04 ● ●●
●
●
●
●
−0.06
●
log-returns, using Gaussian ● ● ●
●
kernel. The plot shows the ● ● ●● ●
●
●
● ●
●●
● ●● ●
density estimates for ● ●● ●● ●
0.02
●
DAX log−returns
●● ● ● ● ●●●●● ●● ●● ●
●
●●
● ●●● ● ● ● ●
● ●●●● ●●●● ●
different bandwidths: ●●● ●●●
●●●
●●
●
● ●●
●● ●●
●●
●●
● ●
●● ●
●●
●
●●
●
●●
●
●●●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●●●
●●
●●
●●●
●
●●
●●
● ●●
●
● ●●
● ●
h = 0.1, h = n −1 and
●
●●
●● ●
●●● ●
●● ●●●
● ●● ●● ● ● ●
● ●● ●
●●
● ●
●●
●●●
●●
● ●●●
● ●●
●●● ●● ●
● ● ● ● ●●● ●● ●● ● ●
●●● ●●
●● ● ●
● ●● ●● ●
● ●●
●
●
●●● ●
●●●
●●
●
●
●●●
●●
●●●●
●●
●
●●
●●
●
●●
●
●
●
●
●● ●●●
●●●● ●●●
●●
●
●● ● ● ●
●
●●●●●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ● ●●●
●●●● ●
● ● ●●● ● ●
● ●●
● ●●●●●●●●● ● ●●● ●● ●
h = n −1/2 .
●●
● ●●●
●
●●●
●
●
●●
●●●
●●
●
●●
●●
●●
●●●●
●●●●●● ●●● ●● ●●●● ●
● ●● ●●●●
●
●● ●●
●●
● ●
●●●
●●●●
●●●
●●
●●
●
●●●●●●●
● ●●
●
●●
●
●●
● ●●● ● ●
● ●●
● ● ●
0.00
● ● ● ●●●
●●● ●
●●
●●
●●
●●
● ●
●
●
●●
●
●●●
●●
●●● ●
● ●●●● ● ● ●
●● ●●●●
●● ●
●●●●
●●
●
●
●●
●●●
●●
●
●
●●
●
●
●●
●●
●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●●
●●●
●●
●●
●
●●
●
●● ●●●●●●
● ●
● ●● ● ● ●
●●●●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●● ●
●
●
●
●●
●●●
●
●●
●
●●●● ● ● ●
● ● ● ●● ● ● ●●●●
●● ● ●
● ●● ●
●●
● ●
●
●●●
●●
●●●
●●
●
●●
●●
●
●
●●●●●
●
●●●
●●●●
●
●●●●●●
●●●
●●● ●
● ●●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●● ●●
●
●
●
●
●
●● ●●
BCS_GaussianKernel ● ●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●●●●
●●●
●
●●●● ●
●● ●
●
●●●● ●● ● ●
●●●●
● ●
●●●●● ●
●●●●
●●●●
●●●●
●
●●
●●
●
●
●
●●
●
●
●●●
●
●●
●
●●
●
●●●
●●● ●●
● ●●
● ●●
● ● ●
●●●●●
●●● ●●● ● ● ●
●● ● ● ● ● ●●
● ●● ●● ●●
●
●
●●●●●●
●●●
●●●●● ● ●●●
−0.02
●●●●●● ●●● ●
●
● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ● ●●
●●● ●●●●●● ● ●
●● ●● ● ●● ● ● ● ●● ●
● ●● ●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
● ●●
−0.04
●
●
●
●
−0.06
local average for each different subset. In other words, if {Si } is a family of disjoint
subsets of the domain S of X and S = ∪i∈I Si , with I = {1, . . . , p}, then one uses
1
|{ j, x j ∈Si }| j, x j ∈Si y j , if ∃i, x ∈ Si ,
ĝ(x) = (7.11)
0, else.
For instance, if one picks Si (h) = {x, |x − xi | < h/2}, then this is simply the uni-
form kernel regressor with bandwidth h. An alternative is to choose Si such that
the k-nearest observations xi to x, in terms of the Euclidean distance, are selected.
This avoids the regressor’s being equal to 0, and has an intuitive foundation. Since
the estimator is computed with the k-nearest points, it is less sensitive to outliers in
the dataset. However, there can be a lack of accuracy when k is large compared to
n (the sample size). The estimator will give the same weight to neighbours that are
close and far away. This problem is less severe with a larger number of observations,
or in the case of the ‘fixed design’ problem, where xi is selected by the user. In
the case of a small number of observations, one can also compensate for this lack
of consistency by combining this method with a kernel regression. R includes an
implementation of the k-NN algorithm for dependent variables Y . The function in R
is knn() from package class.
Consider again the DAX log-returns from the EuStockMarkets dataset. The
probability of having positive DAX log-returns conditional on the FTSE, CAC and
SMI log-returns, is computed in the following.
> require(class)
> k = 20 # neibourghs
> data = diff(log(EuStockMarkets)) # log-returns
> size = (dim(data)[1] - 9):dim(data)[1] # last ten obs.
> train = data[-size, -1] # training set
> test = data[size, -1] # testing set
> cl = factor(ifelse(data[-size, 1] < 0, # returns as factor
+ "decrease","increase"))
> tcl = factor(ifelse(data[size, 1] < 0, # true classification
+ "decrease","increase"))
> pcl = knn(train, test, cl, k, prob = TRUE) # predicted returns
> pcl
[1] decrease decrease decrease decrease
increase decrease decrease increase
decrease increase
attr(,"prob")
[1] 0.95 0.90 0.95 0.90 1.00 1.00 0.95 1.00 0.90 1.00
Levels: decrease increase
> tcl == pcl # validation
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The predicted classifications for the DAX log-returns fit perfectly the actual classi-
fications. All predicted probabilities are at least 0.90.
This simple call to the k-NN estimator allows only for one set of yi . However,
manually coding a k-NN function does not require much depth in the reasoning if
we keep it simple. One may write the following code:
7.3 Nonparametric Regression 211
This code is used to produce Fig. 7.6. The function argument xis specifies the vector
of regressors for the vector of dependent variables yis. The parameter k determines
the number of neighbours with which to build the local average for the dependent
variable. The argument x is a vector, which defines the interval for the regression
analysis.
To achieve the best fit, one has to find the optimal k, similar to a kernel regression.
It is not possible to establish a theoretical expression for the optimal value of k, since
it depends greatly on the sample.
●
log-returns, using k-Nearest ● ● ●
●
Neighbours. The plot shows ● ● ●● ●
●
● ●
●
●●
fitted values for k = 1, ● ●●
● ●● ●
●● ●
0.02
●
●●●●● ●●
DAX log−returns
●●● ● ● ● ● ●
k = 10 and k = 250. ● ●
● ●●● ● ● ● ●
●●● ● ●● ●●
●●●●●● ●●●● ●●●● ● ●
●● ●● ●●
●●●●●●● ●● ●
●
● ● ● ●
●
●
●
● ●●●●
●
● ●●●●● ● ●
●●● ●●● ● ●
●● ●●●●● ●●●●● ●●●
● ●● ●
●
●●● ●● ●
● ●●●●
● ●● ● ●● ● ● ●
● ●●● ●
● ●
●●
●●●
●●●
●
●●●
● ●● ●
● ●●
BCS_kNN ●● ● ● ● ●
● ● ●●●
● ●
●
●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●●●
● ●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●●● ●
● ●● ●●●● ●●
● ●●
●●●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●● ●
●
●●
●
●●●●
●
●●
●
●●
● ●●●●
●● ●● ● ●
● ● ●
● ●●●
●
●●
● ●
●●●●●●●
●
●●
●●● ●●
●
● ●● ●
● ●
●●● ● ●●
0.00
●
● ● ●
● ●● ●
●●●
●
●●●
●●
●●
●●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●●●
●
●●
●●
● ●●
●
●● ●● ●● ● ●
● ● ● ● ●
●●●●
●
●●
●●●●
●
●
●●●
●
●
●●●
●
●●●●
●
●●●
●●
●●●●●
●●
●● ● ●● ●●
●
● ●
●●●
●
●
● ●
●●
●●●●
● ●●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●
●●
●
●
●
●●● ●●
●●●
● ● ●●●
●
● ● ●
●
● ●●●
● ●
●
●●●
●●●
●
●
●●
●●
●●
●●
●
●●
●●●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●● ●
●
●●●
●●●
●●● ● ● ●
● ●
●●●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●
●
● ●●
● ●
●
● ●
●●●● ●
● ●●●●●
●
●●
●●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●●●● ●
●
●● ●●
● ●
●
●●●●●●●●
●
●●
●●●●●
●●●●
●● ●●
●● ●●
●
●●● ●●●●●●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●●●
●●●
●
●●●
●●●
●●
●●
● ●●
●● ●● ●●●● ●
● ●
● ●●
●●
●
●●
●●●
●
● ●●
●●
●
●
●
●●
●●
●●
● ●●
● ●
●●●
●●
●● ●
●●
●●
●●●● ●
● ●●●●●
●●
●●●
●●●
●
●
●●●●
●
● ●
●●
●
●●
●
●●
●
●
●
●
● ●
●
●●
●●
●●
●
● ●
● ● ● ●● ●
● ●● ● ●●●
●●●●● ● ● ● ● ●
●●●● ●●●●●●●●●●
●
●
●●●●●
●
●●●● ●●●●●
●● ●
●●●●●●●
●
●
●●●
●
●
●●● ●
●●
●
●●●●● ● ● ●
●●●
● ● ●●●●●●● ●
●● ● ● ● ● ●●●
● ●●
●
●
●●●
● ● ●●
●●
●●●●● ●
−0.02
●●
●●●●●●
● ●●● ●
●● ● ●
● ●●●● ● ●●
● ●● ●●● ●●●● ● ●● ●●
●●● ● ●● ●
●●●
● ●
● ● ● ● ● ● ●
● ● ●●●
●●● ●
●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
● ●●
−0.04
●
●
●
●
−0.06
7.3.4 Splines
n 2
d 2 g(z)
min Sλ {g(x)} = min {yi − g(xi )}2 + λ dz. (7.12)
g(x),g∈C 2 g(x),g∈C 2
i=1
dz 2
The spline optimization problem (7.12) minimises the sum of the squared residuals
and a penalty term. In most applications the penalty term is the second derivative
of the estimator with respect to x, which reflects the smoothness of a function. The
parameter λ determines the importance of the penalty term for the estimator. One
can rewrite (7.12) in matrix notation since in fact the minimum in (7.12) is achieved
by a piecewise cubic polynomial for g(x):
∂x 2
} is a vector containing the second derivatives of each of the cubic
x=xn
polynomials pi (x). The penalty term can be rewritten as follows:
∂ 2 g(x) 2
d x = g(x) h(x)h(x) g(x) = g(x) K g(x),
∂x 2
2 d 2 p (z)
where the matrix K n×n has entries ki, j = d dzpi 2(z) dzj2 dz.
Therefore the following estimator is proved to be a weighted sum of y:
See Härdle et al. (2004) for a more detailed description. R provides in the pack-
age stats cubic splines for second derivative penalty terms using function
smooth.spline.
7.3 Nonparametric Regression 213
0.06
regression of daily DAX
●
log-returns by daily FTSE ●
0.04
●
log-returns, using spline ● ● ●
●
regression. Regression ● ● ●● ●
●
● ●
●
●●
● ●● ●
results are depicted for ● ●● ●● ●
0.02
●
DAX log−returns
●● ● ● ● ●● ●●●●● ●● ●
● ●●● ●
λ = 2, λ = 1 and λ = 0.2. ● ●
● ● ● ●
●●● ● ●●●●
●●●●● ● ●●●● ●●●● ● ●
●●● ●● ●
●●●
●●● ●●
● ●
●●●●●● ●
● ● ● ●●●● ●
●
●●●● ● ●
●●● ●●● ● ●
●● ●● ●● ●●● ●●●● ●● ● ●
●
●
●●● ●● ●
● ●●●●●●● ●● ● ●
● ●●● ●
● ●
●●
●●●●
●
●●
●●●
● ● ●●
● ●● ●● ●
BCS_Splines ●● ● ● ● ●
● ● ●●●
● ●
●
●
●
●
●●●●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●●
●
● ●
●
●
●●●
●●
●●
●
●
●
●●
●
●●
●
●●● ●
● ●● ●●●● ●●
● ●●
●●●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●●●
●
●
●
●
●●●
●●
●
●●
● ●●●●
●● ●● ● ●
● ●
●● ●●●
●●
●
● ●
●●●●●●●●
●●
●● ●●
●
● ●● ●
●●●● ● ●●
0.00
●
● ● ●
● ●●●
● ●●
●
●●●●●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●●
●●
●
●●
●●
● ●●
●●●●●
● ●● ● ●
● ● ● ● ●
● ●
●●●
●●●●●
●
●●
●●
●
●●●
●
●●●●
●
●●
●●●
●
●●
●
●●●●
●●
●● ● ●● ●●
●
● ●
●●●
●
●●●●
●●●●●●●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●●
●
●●
● ●
●●●● ● ●●●
●
● ●● ●
● ●●●●
● ●
●
●●
●●
●●
●●
●
●●
●●●
●
●●
●
●●●
●●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●●●
●
●●● ●
●●●
● ● ● ●
● ●
● ●
●●
● ●
●
●●●
●●
●●●
●●
●
●●
●●
●
●
●●●●
●
●●
●●● ●
●
●
●●●●
● ●
●●●
●●●●
● ●●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
● ●●
●
●
●
●
●
●● ●●
● ●
●●●
●●●●●●
●
●●●
●●●●
●●●
●●●
● ●●●
● ●●
●
●●
● ●●●●
●●●
●●
●●
●
●●
● ●
●
●●●
●●
●●●
●●
●●●
●
●●
●●
●●●
●●●
●●●●
●● ●● ●●●● ●
● ●
● ●●
●●
●
●●
●
●●●
●
●
●● ●
●●
●
●●
●●●
● ●
●
●●● ●●
● ●
●●
●●
● ●
●
●
●
●
●●
●●●● ●
● ●●●●
●●
●●●
●●●
●
●● ●
●
● ●
●●
●
●●
●●
●●
●●
●
●●
●●
●●
●
● ●
●● ●
● ●
●
● ●● ● ●●●●●●● ●●
●●●●● ●●●● ● ● ●
●●●● ●●●●●●●●●●
●
●
●● ● ●●● ● ●
●●●●●●●
●
●
●●●
●
●
●●● ●●
●
●●
●
●
●
●●
● ●● ● ●● ●●
● ● ●●●●●●● ●
●● ● ● ● ● ●●●
● ●●
●
●
●●●●●●
●●●
●●●●● ● ●●●
−0.02
●●
●●●● ●●● ●●
● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ● ●●
●●● ● ●● ●
●●●
●
●
● ●● ● ● ● ●
● ● ●●●
● ●● ●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
−0.04 ● ●●
●
●
●
●
−0.06
This listing creates Fig. 7.7 for the regression of DAX log-returns and FTSE log-
returns. The function arguments x and y are the observations of the independent and
dependent variables, respectively. Instead of using two separate vectors, a matrix can
be used. The argument spar defines λ through λ = c3spar −1 , therefore the greater the
spar, the greater the λ. One can also apply weights to the observations of x through
the variable w, which must have the same length as x. To find a good value for λ,
set cv = TRUE for the ordinary cross-validation method and cv = FALSE for a
generalised cross-validation method. Of course we could impose further restrictions
on g, e.g. a penalty on its third derivative, or on any other type of norm.
yi = g(xi ) + εi .
k
yi = a j (xi − x) j + εi .
j=0
where z i = (xi − x)/ h and h is half the width of the interval around x. Therefore the
weight attached to an observation is small if z is large and vice versa. This method
has the same particularities as the k-NN method in the way that it can extrapolate
the data. However, after the number of neighbours has been defined, through h, the
coefficients a j are estimated by the least squares approach.
This is both an advantage and a drawback, as the regression does not need any
regularity conditions (for instance, compared to the spline method, which needs ĝ to
be twice differentiable), but it provides less intuition in the interpretation of the final
curve (Fig. 7.8). The main function to use in Ris loess(). First, one has to specify
●
log-returns, using LOESS ● ● ●
●
regression with degree one. ● ● ●● ●
●
● ●
●
●●
●
The used LOESS parameters ● ●●
● ●●
●● ●
0.02
●
DAX log−returns
●● ● ● ● ●● ●●●● ●● ●
● ●●● ●
are: α = 0.9, α = 0.3 and ● ●
● ● ● ● ●
●●● ● ●● ●●
●●●
●● ●● ●●●● ●●●● ● ●
●● ●● ●● ● ●●●
● ●● ●
●
● ● ● ●
●
●
●
● ● ●
●
●
●●●●●●●● ● ●●
●●● ●●● ● ●
●● ●●●●● ● ● ●●●●
● ●●
● ●● ●
● ●●●●●● ●
●● ● ●● ● ● ●
α = 0.05. BCS_LOESS ●●●●●●
●
● ●
●●
●●●●●
●●● ● ●●
● ● ●
● ●
●●● ●
●●
●●● ●
●
●●●
●●●●
●●● ●●
●● ● ● ● ● ●●● ●
●●
●● ●
● ●●
●●●●●●
●
●
●●
●●●
●
●●●
● ●●● ●●●
●
● ● ●● ●
●
●
●●●●
●
●●●
●
●●●
●
●●
●●
●●
● ●
●●● ● ●
●
●●●●●
●●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●●
●●●
●●
●●
●●
●
●
●
●●
●●
●●●
●●●
●
●
●
●●
●
●●●●●●
●●●●● ●
● ● ●●● ●●
●● ●
●
●
● ●●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●●●●
●●●
●●
● ●
●
●●●●●
● ●● ● ●
● ●● ●●●● ●●●
● ●
●
●
●●
●
●●
● ●●●
●●●● ●●
●● ●
● ● ●
● ●●●
●
●● ●
● ●●●●●●●● ●●
●
●
● ●● ●
● ●
●●● ●●●●●●●
0.00
●
● ● ●
● ●● ●●●
●●●
●●
●●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●
● ●
●●
●
●●
●
●●
●● ●●
●●
● ●● ●
●
● ● ● ●
●●●●
●
●●
●
●●
●●●
●
●
●●●
●
●
● ●
●
●
●●●●
●●●●
●●
● ●
●●●
●●
●● ● ●● ●
●
● ●
●●●
●
●
● ●
●●
●●●●
● ●
●●
●●●
●
●●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●●
●
●● ●●
●●● ● ●●●
●
● ●
●● ● ●●●●
● ●
●
●●
●●
●●●
●●
●
●●
●
●●
●●
●
●●
●●●●
●●
●
●●
●
●●
●
●●
●
●●●
●
●
●
●●
●
● ●
●
●●●
●●●●●
● ● ● ●
● ●
●●●●●
●●
●
●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
●
●
●●
●●●
● ●
●
● ●
● ●●●●
● ●●●●●
●
●●●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●●●
●
● ●
●
●● ●●
● ●
●
●●●
●●●●●●
●
●●
●●●●●
●●●
●●● ●●●● ●●
●
●●
● ●●●●●●
●●
●●
●
●
●● ●
●
●●
●●
●●
●●●
●●●
●
●●●
●●●
●●●
●●●●
●● ●● ●●●● ●
● ●
● ●●
●●
●
●●
●
●●●
●
● ●●
●●
●
●
●●
●●
●●
● ●●
●●●●
●●
●●●●●
●●
●●
●●●● ●●
● ●●●●●
●●
●●●
●●●
●
●● ●
●
● ●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
● ●
●●● ● ● ●
●
● ●● ●
●
●●●●●● ●●●
●
●
●●●●●
● ●
●●●● ●
●● ●
●● ●
●● ● ●
● ●
●●
●●
●
●
●
●●●
●
●●
●
●
●●●
● ●● ●
● ●●
●● ●●
●
●
●●●
● ●
●●●●
● ●●● ●● ● ●
●● ● ● ● ● ●●●
● ●●● ●●
●
●
●●●
● ● ●●
●● ●
●●●●● ● ●●●
−0.02
●●
●●●● ●●● ●
●● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ●●
●●●● ●● ●
●●●
●●
●● ● ● ●● ● ● ● ●●●
●●● ●
●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
● ●●
−0.04
●
●
●
●
−0.06
the two variables to regress in the syntax of linear regressions. Then the user needs to
select the degree of the polynomial, and the span parameter, which represents the
proportion of points (or neighbours) to use. By default, Ruses the tri-cube weighting
function w(z), which the user can change. Nevertheless, the weight should satisfy
some properties, stated in Cleveland (1979).
The following code produces a plot for a LOESS regression of DAX log-returns
on FTSE log-returns.
> dax.r = diff(log(EuStockMarkets[, 1]))
> ftse.r = diff(log(EuStockMarkets[, 4]))
> loess1 = loess(r.dax ~ r.ftse, # LOESS regression
+ degree = 1, # degree of polynomial
+ span = 0.9)$fit # proportion of neighbours
> loess2 = loess(r.dax ~ r.ftse, degree = 1, span = 0.01)$fit
> loess3 = loess(r.dax ~ r.ftse, degree = 1, span = 0.3)$fit
> l1 = loess1[order(r.ftse)] # order as FTSE
> l2 = loess2[order(r.ftse)]
> l3 = loess3[order(r.ftse)]
> plot(x, y)
> lines(l1,col="red")
> lines(l2,col="green")
> lines(l3,col="blue")
This section and the previous sections introduced methods to model an unknown
relation between two variables X and Y . Each method depends greatly on the smooth-
ing parameter, which has different optimal values for different regression methods.
As discussed in Sect. 5.1.4, the optimal bandwidth for a normal kernel is given by
h opt = 1.06n −1/5 σ̂. The choice of the kernel becomes of minor importance as the
number of observations increases. The optimal parameters for other methods are
found via cross-validation algorithms.
On top of this, other complications can appear. Problems such as predicting from a
low number of observations or the presence of outliers within the dataset can make the
regression results less accurate. The following example illustrates, using simulated
data, how different nonparametric regressions perform.
Example 7.1 Consider two rvs X and Y that are generated from the model
One can use the rule of thumb for the kernel regression’s bandwidth:
> kernel.reg.example = function(new.x){
+ ksmooth(x = Xis, y = Yis,
+ kernel ="normal",
+ bandwidth = 1.06 * n^(-1 / 5),
+ xpoints = new.x)$y
+ }
However, for the k-NN regression and the spline regression, the smoothing parameter
is selected by a cross-validation algorithm. Below is a simple line for the spline
regression
> spline.reg.example = smooth.spline(x = Xis, y = Yis)
1
n
k C V = argmink M S E(k) = { ŷi (k) − yi }2 .
n i=1
Therefore the value of k minimising M S E(k) is used for the regression analysis.
In the following, the leave-one-out cross-validation procedure is applied, where just
one observation is dropped. Each observation will be excluded from the sample to
preform the k-NN regression. Afterwards the squared error for each observation is
computed for a specific k. The squared error is computed by the following code
(Fig. 7.9).
BCS_RegressionCurve ● ●
●
●
●
●
●
● ●●
●
● ●
● ● ●
0
● ● ● ●
● ●
●●
●● ● ●
●
●
● ●
f(x)
●● ●
● ● ●
●
●
−2
● ●
●
●
●
−4
●
−6
−3 −2 −1 0 1 2 3
x
7.3 Nonparametric Regression 217
2.8
2.6
2.4
Mean Squared Error
2.2
2.0
1.8
1.6
1.4
0 10 20 30 40 50
k
Fig. 7.10 MSE for k-NN regression using the Leave-one-out cross-validation method.
BCS_LeaveOneOut
● ●
●
●
●
●
●
●●
●
0
●
● ● ●
●● ● ● ●
●
●
● ●
● ● ●
●
●● ●
● ●
● ● ●
●
●
●
−2
●
f(x)
● ●
●
● ●
●
−4
●
−6
−3 −2 −1 0 1 2 3
x
Fig. 7.11 Nonparametric regressions for simulated data. The regression results for kernel, kNN
and spline are plotted. BCS_NonparametricRegressions
Now one needs to compute, for each k, the mean of the squared errors for every
single point (xi , yi ), and select the k that minimises the M S E(k).
> listk = matrix(0, n - 1, n) # object for ks and es
> for (k in 1:(n - 1)) # loop for possible ks
+ for (p in 1:n){ # possible dropped obs.
+ listk[k, p] = SEkNN(k, Xis, Yis, p) # knn.reg required
+ }
> MSEkNN = (n)^(-1) * rowSums(listk) # Mean squared error
> which.min(MSEkNN) # cross validated k
[1] 3
The code above shows that k C V is equal to 3 for this dataset. Figure 7.10 plots the
M S E(k) depending on k for the k-NN regression. After the optimal parameters have
been selected for each of the regressions, it is interesting to compare the results of
these different methods to the true regression curve depicted in Fig. 7.9. Figure 7.11
shows that the regression curves are very similar. This actually tends to be true when
n is large. In the meantime, all three regressions perform poorly at the boundaries of
the support of x. The standard normally distributed rv X has 95% of its realisations
in the interval [−1.96, 1.96]. Therefore the regression is likely to have both a large
bias and a large variance outside of these bounds.
Chapter 8
Multivariate Statistical Analysis
— Marie Curie
There are several equivalent ways of deriving the principal components mathe-
matically. The simplest way is to find the projections of the original p-dimensional
vectors onto a subspace of dimension q. These projections should have the follow-
ing property. The first component is the direction of the original variable space,
along which the projection has the largest variance. The second principal component
is the direction which maximises the variance among all directions orthogonal to
the first principal component, and so on. Thus, the i-th component is the variance-
maximising direction orthogonal to the previous i − 1 components. For an original
dataset of dimension p, there are p principal components.
The principal component (PC) transformation of rv X with E(X ) = μ and
Var(X ) = = is defined as
Y = (X − μ),
where is the matrix of eigenvectors of the covariance matrix and is the diag-
onal matrix of the corresponding eigenvalues, see Sect. 2.1 for details. The principal
component properties are given in the following theorem.
Theorem 8.1 For a given X ∼ (μ, ), let Y = (X − μ) be the principal com-
ponent transformation. Then
E(Y j ) = 0, j = 1, . . . , p;
Var(Y j ) = λ j , j = 1, . . . , p;
Cov(Yi , Y j ) = 0, i = j;
Var(Y1 ) ≥ Var(Y2 ) ≥ . . . ≥ Var(Y p ) > 0.
In practice, the expectation μ and covariance matrix are replaced by their estimators
x and S respectively. If S = GLG is the spectral decomposition of S, then the
principal components are obtained by Y = (X −1n x )G. Note that with the centring
matrix H = I − n −1 1n 1 n and H 1n x = 0, the empirical covariance matrix of the
principal components can be written as
S y = n −1 Y HY = L,
length, width, etc.). First, the required package is loaded and the data set is saved as
a data frame without the indicator stating whether banknotes are genuine.
> data( b a n k n o t e , p a c k a g e = " m c l u s t " ) # load the data
> m y d a t a = b a n k n o t e [ , -1] # r e m o v e the f i r s t c o l u m n
The output includes the standard deviations of each component, i.e. the square root
of the covariance matrix’s eigenvalues. A measure of how well the first q PCs explain
the total variance is given by the cumulative relative proportion of variance
q
j=1 λj
ψq = p .
j=1 λj
The loadings matrix, given by matrix , gives the multiplicative weights of each
standardised variable in the component score. In practice, one considers the matrix
G. Small loadings values are replaced by a space in order to highlight the pattern of
loadings.
> print ( fit $ l o a d i n g s , d i g i t s = 3) # prints loadings
Loadings :
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Length -0 . 3 2 6 0 .562 0 .753
Left 0 .112 -0 .259 0 .455 -0 .347 -0 .767
Right 0 .139 -0 .345 0 .415 -0 .535 0 .632
Bottom 0 . 7 6 8 -0 .563 -0 .218 -0 .186
Top 0 .202 0 .659 -0 .557 -0 .451 0 .102
D i a g o n a l -0 .579 -0 .489 -0 .592 -0 .258
Scatter plots in Fig. 8.1 of the components clearly illustrate how principal component
analysis simplifies multivariate techniques.
> l a y o u t ( m a t r i x (1:4 , 2 , 2))
> g r o u p = f a c t o r ( b a n k n o t e [ , 1]) # g r o u p as f a c t o r
> plot ( fit $ s c o r e s [ , 1 : 2 ] , col = g r o u p ) # p l o t 1 vs 2 f a c t o r
> plot ( fit $ s c o r e s [ , c (1 , 3)] , col = g r o u p ) # p l o t 1 vs 3 f a c t o r
> plot ( fit $ s c o r e s [ , 2 : 3 ] , col = g r o u p ) # p l o t 2 vs 3 f a c t o r
In practice, a question which often arises is how to choose the number of components.
Commonly, one retains just those components that explain some specified percentage
222 8 Multivariate Statistical Analysis
1.5
3
1.0
2
0.5
1
PC3
PC2
0.0
0
−2 −1 1 2 3 −3 −2 −1 0 1 2 3
PC1 PC2
1.00
Cumulative percentage variance
1.0
0.90
0.5
PC3
0.0
0.80
−1.5 −1.0 −0.5
0.70
−2 −1 1 2 3 1 2 3 4 5 6
PC1 Number of components
Fig. 8.1 Principal components and scree plot for Swiss Banknote dataset. BCS_PCAvar
of the total variation of the original variables. Values between 70 and 90% are usually
suggested, although smaller values might be appropriate as p or the sample size
increase.
> plot ( c u m s u m ( fit $ sdev ^2 / sum ( fit $ sdev ^2))) # cumulative variance
A graphical representation of the PCs’ ability to explain the variation in the data is
given in Fig. 8.1. The plot down on the right, called scree plot, depicts the relative
cumulative proportion of the explained variance as given by ψq above. The figure
implies that the use of the first and the second principal components is sufficient to
identify the genuine banknotes.
Another way to choose the optimal number of principal components is to exclude
the principal components with eigenvalues less than the average, see Everitt (2005).
The covariance between the PC vector Y and the original variables X is important
for the interpretation of the PCs. It is calculated as
Cov(X, Y ) = .
8.1 Principal Components Analysis 223
The correlations of the original variables X i with the first two PCs are given in the
first two columns of the table in the previous code. The third column shows the
cumulative percentage of the variance of each variable explained by the first two
principal components Y1 and Y2 , i.e. 2j=1 r X2 i Y j .
The results are displayed visually in a correlation plot, where r X2 i Y1 are plotted
against r X2 i Y2 in Fig. 8.2 (left). When the variables lie near the periphery of the circle,
they are well explained by the first two PCs. The plot confirms that the percentage
of the variance of X 1 explained by the first two PCs is relatively small.
> # c o o r d i n a t e s for the s u r r o u n d i n g c i r c l e
> u c i r c l e = cbind ( cos ( ( 0 : 3 6 0 ) / 180 * pi ) , sin ( ( 0 : 3 6 0 ) / 180 * pi ))
> plot ( u c i r c l e , type = " l " , lty = " solid " ) # plot circle
> a b l i n e ( h = 0 .0, v = 0 .0 ) # plot o r t h o g o n a l lines
> label = paste ( " X " , 1:6 , sep = " " )
> text ( cor ( m y d a t a , fit $ s c o r e s ) , l a b e l ) # p l o t s c o r e s in text
1.0
1.0
X1
X5
0.5
0.5
X2
Second PC
X2 X6 X3
X3
PC2
X1
0.0
0.0
X5
X4
X6 X4
−0.5
−0.5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
PC1 First PC
Fig. 8.2 The correlation of the original variable with the PCs (left) and normalised PCs (right).
BCS_PCAbiplot, BCS_NPCAbiplot
224 8 Multivariate Statistical Analysis
X = HX D−1/2 ,
The scree plot for the normalised model would be different and the variables can be
observed to lie closer to the periphery of the circle in Fig. 8.2 (right).
Factor analysis is widely used in behavioural sciences. Scientists are often interested
not in the observed variables, but in unobserved factors. For example, sociologists
record people’s occupation, education, home ownership, etc., on the assumption that
these factors reflect their unobservable ‘social class’. Explanatory factor analysis
investigates the relationship between manifested variables and factors. Note that the
number of variables should be much smaller than the number of observations. The
factor model can be written as
X = QF + μ, (8.1)
X = QF + μ + U ,
8.2 Factor Analysis 225
= QQ + ψ, (8.2)
X F = E{(QF + U )F } = Q.
The correlation is
P X F = D−1/2 Q,
where D = diag(σ X 1 X 1 , . . . , σ X p X p ).
The analysis based on the normalised variables is performed by using R = QQ +
ψ and the loadings can be interpreted directly. Factor analysis is scale invariant.
However, the loadings are unique only up to multiplication by an orthogonal matrix.
This leads to potential difficulties with estimation, but facilitates the interpretation of
the factors. Multiplication by an orthogonal matrix is called rotation of the factors.
The most widely used rotation is the varimax rotation which maximises the sum of
the variances of the squared loadings within each column. For more details on factor
analysis see Rencher (2002).
In practice, Q and ψ have to be estimated by S = Q̂Q̂ + ψ̂. The number of
estimated parameters is d = 21 ( p−k)2 − 21 ( p+k). An exact solution exists only when
d = 0, otherwise an approximation must be used. Assuming a normal distribution
of the factors, maximum likelihood estimators can be computed as discussed below.
Other methods to find Q̂ are principal factors and principal component analysis, see
Härdle and Simar (2015) for details.
This subsection takes a closer look at the maximum likelihood factor analysis, one
possible fitting procedure in factor analysis. It is generally recommended if it can
be assumed that the factor scores are independent across factors and individuals and
normally distributed, i.e. F ∼ N(0, 1). Under this assumption, it is necessary that
X ∼ N(0, + w w) and and ŵ are found by maximising the log-likelihood
np n n
L=− log 2π − log | + w w| − tr ( + w w)−1 V .
2 2 2
226 8 Multivariate Statistical Analysis
Figure 8.3 suggests to start the analysis with three factors, because the increase in
explained variance becomes very small for more than three factors. While including a
third factor increases the explained variance by about 6 percentage points, including
a fourth factor offers an increase of less than one additional percentage point.
> r e q u i r e ( stats )
2 4 6 8 10
Number of principal components
8.2 Factor Analysis 227
> m y d a t a = d e c a t h l o n [ , 1:10]
> fit = factanal ( mydata, # fit the model
+ factors = 3, # n u m b e r of f a c t o r s
+ r o t a t i o n = " none " ) # no r o t a t i o n p e r f o r m e d
> fit # p r i n t the r e s u l t s
Call :
factanal( x = m y d a t a , f a c t o r s = 3 , r o t a t i o n = " none " )
Uniquenesses :
100 m Long.jump Shot.put High.jump 400 m
0 .411 0 .396 0 .106 0 .697 0 .264
110 m . h u r d l e Discus Pole.vault Javeline 1500 m
0 .491 0 .534 0 .907 0 .785 0 .005
The output states the computed uniquenesses and a matrix of loadings with one
column for each factor. Uniqueness gives the proportion of variance of the vari-
able not associated with the factors. It is defined as 1 − communality (namely
k
1 − l=1 q 2jl , j = 1, . . . , k), where communality is the variance of that variable as
determined by the common factors. Note that the greater the uniqueness of a variable,
the lower the relevance of the variable in the factor model, since the factors capture
less of the variance of the variable.
Loadings : # o u t p u t f r o m fit, cont.
Factor1 Factor2 Factor3
100 m -0 .573 0 .507
Long.jump 0 .455 -0 .629
Shot.put 0 .881 0 .322 0 .121
High.jump 0 .547
400 m -0 .432 0 .617 0 .412
110 m . h u r d l e -0 .493 0 .514
Discus 0 .621 0 .114 0 .261
Pole.vault -0 .174 0 .246
Javeline 0 .356 0 .239 -0 .178
1500 m 0 .997
The factor loadings give the correlation between the factors and the observed vari-
ables. They can be used to interpret the factors based on the variables they capture.
As stated above, the factor loadings are not unique to multiplication by an orthogo-
nal matrix and this multiplication, called rotation of the factor loadings matrix, can
greatly facilitate interpretation.
The most common rotation method, varimax, aims at maximising the variance
of the squared loadings of a factor on all the variables. Note that the assumption
of orthogonality of the factors is required. There is a number of ‘oblique’ rotations
available, which allow the factors to correlate.
> m y d a t a = d e c a t h l o n [ , 1:10]
228 8 Multivariate Statistical Analysis
Call :
factanal( x = m y d a t a , f a c t o r s = 3 , r o t a t i o n = " v a r i m a x " )
Uniquenesses :
100 m Long.jump Shot.put High.jump 400 m
0 .411 0 .396 0 .106 0 .697 0 .264
110 m . h u r d l e Discus Pole.vault Javeline 1500 m
0 .491 0 .534 0 .907 0 .785 0 .005
Loadings :
Factor1 Factor2 Factor3
100 m 0 .699 -0 .264 -0 .178
Long.jump -0 .764 0 .107
Shot.put -0 .128 0 .934
High.jump -0 .232 0 .497
400 m 0 .815 0 .263
110 m . h u r d l e 0 .685 -0 .183
Discus -0 . 1 5 2 0 .618 0 .248
Pole.vault -0 .124 0 .278
Javeline 0 .412 -0 .214
1500 m 0 .190 0 .976
For the first factor, the largest loadings are for 100 m, 400 m, 110 m hurdle run, and
long jump. This factor can be interpreted as the ‘sprinting performance’. The loadings
for the second factor present a counter-intuitive throwing-jumping combination: the
highest loadings are for the three throwing events (discus throwing, javeline and shot
put) and for the high jump event. For the third factor, the largest loading is for 1500 m
running. The first and second factors can be interpreted straightforward as ‘sprinting
abilities’ and ‘endurance’. The meaning of the last factor is not evident.
Note that the model fit has room for improvement, because the value 0.54 for
Cumulative Var in the third line signifies that only 54% of the variation in the
data is explained by three factors. Including a fourth factor is easily done by changing
the code to factors = 4.
> m y d a t a = d e c a t h l o n [ , 1:10]
> fit3 = factanal ( mydata, # factor model
+ factors = 4, # 4 factors
+ rotation = " varimax ") # varimax rotation
> fit3 # p r i n t the r e s u l t s
Call :
factanal( x = m y d a t a , f a c t o r s = 4 , r o t a t i o n = " v a r i m a x " )
8.2 Factor Analysis 229
Uniquenesses :
100 m Long.jump Shot.put High.jump 400 m
0 .409 0 .386 0 .005 0 .680 0 .270
110 m . h u r d l e Discus Pole.vault Javeline 1500 m
0 .464 0 .492 0 .005 0 .800 0 .005
Loadings :
Factor1 Factor2 Factor3 Factor4
100 m 0 .720 -0 .245 -0 .112
Long.jump -0 .770 0 .131
Shot.put -0 .144 0 .976 0 .103 0 .103
High.jump -0 .259 0 .480 -0 .152
400 m 0 .770 0 .363
110 m . h u r d l e 0 .712 -0 .157
Discus -0 . 2 2 0 0 .585 0 .297 -0 .170
Pole.vault -0 .102 0 .117 0 .983
Javeline 0 .403 -0 .191
1500 m 0 .984 0 .143
This improves the result, with the new model explaining 65% of variation. The
interpretation of the first and second factor remains the same, but the 1500 m run and
pole vaulting are now captured by factors 3 and 4, respectively.
Thus, there is no unique or ‘best’ solution in factor analysis. Using the maximum
likelihood method allows to test the goodness of the factor model. The test examines
if the model fits significantly worse than a model in which the variables correlate
freely. p-values higher than 0.05 indicate a good fit, since the null hypothesis of a
good fit cannot be rejected.
In this case, the p-value is 0.457. The null hypothesis that 3 factors are sufficient
cannot be rejected, suggesting a good fit of the model.
Cluster analysis techniques are used to search for clusters or groups in a priori
unclassified multivariate data. The main goal is to obtain clusters of objects which
are similar to one another and different from objects in other clusters. Many methods
of cluster analysis have been developed, since most studies allow for a variety of
techniques.
230 8 Multivariate Statistical Analysis
The starting point for cluster analysis is a (n × p) data matrix X with n measure-
ments of p objects. The proximity among objects is described by a matrix D which
contains measures of similarity or dissimilarity among the n objects. The elements
can be either distance or proximity measures. The nature of the observations plays
an important role in the choice of the measure. Nominal values lead, in general, to
proximity values, whereas metric values lead to distance matrices. To measure the
similarity of objects with binary structure, one defines
p
p
ai j,1 = I (xik = x jk = 1), ai j,2 = I (xik = 0, x jk = 1),
k=1 k=1
p
p
ai j,3 = I (xik = 1, x jk = 0), ai j,4 = I (xik = x jk = 0).
k=1 k=1
where δ and λ are weighting factors. Table 8.1 shows two similarity measures for
given weighting factors. To measure the distance between continuous variables, one
uses L r -norms (see Sect. 2.1.5):
1/r
p
di j = ||xi − x j ||r = |xik − x jk |r , (8.3)
k=1
where xik denotes the value of the k-th variable of object i. The class of distances in
(8.3) measures the dissimilarity using different weights for varying r . The L 1 -norm,
for example, gives less weight to outliers than the L 2 -norm (the Euclidean norm).
An underlying assumption in applying L r -norms see Sect. 2.1.5 is that the vari-
ables are measured on the same scale. Otherwise, a standardisation is required, cor-
responding to a more general L 2 or Euclidean norm with a matrix A, where A > 0:
p
(xik − x jk )2
di2j = .
k=1
sXk Xk
Here each component has the same weight and the distance does not depend on any
particular measurement units.
In practice, the data is often mixed, i.e. contains both binary and continuous vari-
ables. One way to solve this is to recode the data to normalised similarity by assigning
each attribute level to a separate binary variable. However, this approach often does
not adequately capture the size of distance that can be captured by continuous vari-
ables and leads to a large increase in a4 .
The second way is to calculate a generalised similarity measure, e.g. the commonly
used Gower similarity coefficient. It calculates and average over similarities and is
defined as
v
di jk · δi jk
k=1
Di j = ,
δi jk
There are two types of hierarchical clustering algorithms, agglomerative and split-
ting. The first starts with the finest possible partition. The second starts with the
coarsest possible partition, i.e. one cluster containing all the observations. The draw-
back of these methods is that the clusters are not adjusted, e.g. objects assigned to
a cluster cannot be removed in further steps. Since all agglomerative hierarchical
techniques ultimately reduce the data to a single cluster containing all the individu-
232 8 Multivariate Statistical Analysis
als, the investigator seeking the solution with the best-fitting number of clusters will
need to decide which division to choose. The problem of deciding on the ‘correct’
number of clusters will be taken up later.
The hierarchical agglomerative clustering algorithm works as follows:
1. Find the nearest pair of distinct clusters, say ci and c j , merge them into ck and
decrease the number of clusters by one;
2. If the number of clusters equals one, end the algorithm, else return to step 1.
For this purpose, the distance between two groups or an individual and a group
must be calculated. Different definitions of distances lead to different clustering
algorithms. Widely used measures of distance are
where n A and n B are the number of objects in the two groups. The Ward clustering
algorithm does not put together groups with smaller distance. Instead, it joins groups
that do not increase a given measure of heterogeneity ‘too much’. The resulting
groups are as homogeneous as possible. Let us study an example of the Gross National
Product (GNP) per capita and the percentage of the population working in agriculture
for each country belonging to the European Union in 1993. The data can be loaded
from the cluster package. First the data should be checked for missing values and
standardised. As mentioned above, the matrix of distances should be calculated first.
Euclidean distance is used in this case. This matrix is then used to obtain clusters
using the complete linkage algorithm (other algorithms are available as parameters
of function hclust).
> require ( cluster ) # p a c k a g e for CA
> data( a g r i c u l t u r e , package = " c l u s t e r " ) # load the data
> m y d a t a = scale ( a g r i c u l t u r e ) # n o r m a l i s e data
> d = dist ( m y d a t a , # calculate distances
+ method =" euclidean ") # Euclidean
> print ( d, d i g i t s = 2) # show d i s t a n c e s
B DK D GR E F IRL I L NL P
DK 1 .02
D 0 .40 0 .63
GR 3 .74 4 .03 3 .88
E 1 .68 2 .15 1 .87 2 .08
F 0 .55 0 .71 0 .43 3 .48 1 .50
IRL 2 .12 2 .46 2 .27 1 .62 0 .49 1 .87
I 0 .90 1 .04 0 .88 3 .03 1 .11 0 .46 1 .43
L 0 .86 0 .35 0 .46 4 .21 2 .25 0 .75 2 .61 1 .18
NL 0 .26 1 .01 0 .48 3 .49 1 .44 0 .39 1 .87 0 .65 0 .94
P 2 .92 3 .27 3 .08 0 .84 1 .24 2 .68 0 .82 2 .25 3 .43 2 .67
UK 0 .57 1 .56 0 .97 3 .49 1 .43 0 .96 1 .92 1 .10 1 .42 0 .58 2 .66
> fit = h c l u s t ( d, m e t h o d = " c o m p l e t e " ) # fit the model
8.3 Cluster Analysis 233
4
agriculture data.
BCS_CAComplete
3
Height
2
1
UK
GR
P
E
IRL
0
F
DK
NL
The specific partition of the data can now be selected from the dendrogram, see
Fig. 8.4. ‘Cutting’ off the dendrogram at some height will give a partition with a
particular number of groups. One of the methods to choose the number of clusters
is to examine the size of the height changes in the dendrogram, where a large jump
in the dendrogram indicates a large loss in homogeneity of clusters if they are joint
as suggested by the current step of the tree. Function rect.hclust draws a den-
drogram with red borders around clusters, facilitating the interpretation, see Fig. 8.4
where axis height shows the value of the criterion associated with the clustering
method. The value k specifies the desired number of groups. We do not discuss the
choice of the number of clusters in detail. A popular method is the scree plot or elbow
criterion, which is easy to implement and visualise.
> plot ( fit ) # plot the s o l u t i o n
> g r o u p s = c u t r e e ( fit, k = 5) # define clusters
> r e c t . h c l u s t ( f i t , k = 5 , b o r d e r = " red " ) # draw boxes
k
arg min x − μi 2 ,
S i=1 x∈Si
234 8 Multivariate Statistical Analysis
2.5
1.0
2.0
Euclidean distance
Euclidean distance
Euclidean distance
0.8
6
GR
1.5
P
0.6
1.0
UK
2
0.4
E
IRL
GR
P
UK
0.5
I
I
D
F
E
IRL
DK
L
UK
I
0.2
D
F
GR
P
DK
L
0.0
E
IRL
D
F
DK
L
B
NL
B
NL
B
NL
Coca-Cola to Pepsi-Cola, but cannot say how much. This kind of data is used very
often in psychology and market research.
Assume that a data matrix X is given. The metric MDS begins with a n × n distance
matrix DX , which contains distances between the given objects. Note that this matrix
is symmetric, with diiX = 0 and diXj > 0, which naturally follows from the definition
of distance in any metric space. Given such a matrix, MDS attempts to find n data
points y1 , . . . , yn constituting the new data matrix Y in p-dimensional space, such
that DX is similar to DY . In particular, metric MDS minimises
n
n
min (diXj − diYj )2 , (8.4)
Y
i=1 j=1
where diXj = xi − x j and diYj = yi − y j . For the Euclidean distance, (8.4) can
be reduced to
n
n
min (xi x j − yi yi )2 .
Y
i=1 j=1
400
the MDS of the American car Infiniti Q45
subsample. BCS_MDS
300
200
100
y2
Buick Roadmaster
Chevrolet Caprice
0
Ford Crown_Victoria
−100
Lincoln Town_Car
−300
Lexus SC300
−1500 −1000 −500 0 500 1000 1500
y1
Function cmdscale() performs MDS and uses as arguments the distance matrix
and the dimension of the space in which the data will be represented. Including the
option eig = TRUE additionally provides the eigenvalues of the data. They are
used to calculate P and P in order to decide on the dimensions of the projection
space p. In this case, both criteria give satisfactory results, i.e. values greater than
0.8, for any number of dimensions. Given this result, it is convenient to set p equal
2 to plot the MDS map in a simple diagram.
> fit = c m d s c a l e ( d, eig = TRUE, k = 2) # fit mds model
Unfortunately, classical metric MDS cannot always be used and other methods of
scaling might be more suitable. This section is concerned with non-metric multidi-
mensional scaling, which can be applied if there is a considerable number of negative
eigenvalues and classical scaling of the proximity matrix may be inadvisable. Non-
metric scaling is also used in the case of ordinal data, e.g. when comparing a range
8.4 Multidimensional Scaling 237
of colours. Customers might be able to specify that one was ‘brighter’ than another,
without being able to attach any quantitative value to the extent the colours differ.
Non-metric MDS uses only a rank order of the proximities to produce a spatial
representation of them. Thus, the solution is invariant under monotonic transfor-
mations of the proximities. One such method was originally suggested by Shepard
(1962) and Kruskal (1964). Beginning from arbitrary coordinates in a p-dimensional
space, e.g. calculated by metric MDS, the distances are used to estimate disparity
between the objects using monotonic regression. The aim is to represent the fitted
distances diYj as diYj = d̂iXj + εi j , where the estimated disparities d̂iXj are monotonic
with the observed proximities and, subject to this constraint, resemble the diYj as
closely as possible. For a given set of disparities, the required coordinates can be
found by minimising some function of the squared differences between the observed
proximities and the derived disparities, generally known as stress. The procedure is
iterated until some convergence criterion is satisfied. The number of dimensions is
chosen by comparing stress values or other criteria, for example R 2 .
Non-metric MDS can be applied using R as well. The next example uses the
voting data of the package HSAUR2. This dataset represents the voting results of
15 congressmen from New Jersey on 19 environmental bills.
To perform non-metric MDS, load the data, compute the distance matrix and run
function isoMDS with the default two-dimensional solution.
> r e q u i r e ( MASS )
> data( v o t i n g , p a c k a g e = " H S A U R 2 " ) # load the data
> fit = i s o M D S ( v o t i n g ) # fit MDS
> plot ( fit $ p o i n t s , type = " n " ) # plot the model
> a b l i n e ( v = 0 , lty = " d o t t e d " ) # y = 0 line
> a b l i n e ( h = 0 , lty = " d o t t e d " ) # x = 0 line
> text ( fit $ p o i n t s , labels = r o w n a m e s ( v o t i n g )) # add text
Thompson(D)
4
Patten(D)
2
Roe(D)
y2
Hunt(R) Rinaldo(R)
Heltoski(D)
0
Minish(D)
dnall(R) Daniels(D) Rodino(D)
Howard(D)
−2
Forsythe(R)
−4
Freylinghuysen(R)
−6
Maraziti(R)
−10 −5 0 5
y1
238 8 Multivariate Statistical Analysis
Figure 8.7 shows the output of the above procedure. It is clear that the Democratic
congressmen have voted differently from the Republicans. A possible further conclu-
sion is that the Republicans have not shown as much solidarity as their Democratic
colleagues. More examples on MDS are given in Everitt and Hothorn (2011).
One of the most applied tools in multivariate data analysis is classification. Discrim-
inant analysis is concerned with deriving rules for the allocation of observations to
sets of a priori defined classes in some optimal way. It requires two samples—the
training sample, for which group membership is known with certainty a priori, and
the test sample, for which group membership is unknown.
The theory of discriminant analysis states that one needs to know the class pos-
teriors P(G | X ), where G is a given class and X contains other characteristics
of an object. Suppose f k (x) is the class-conditional density and let πk be the prior
probability of class k. A simple application of the Bayes’ theorem gives
f k (x) πk
P(G = k | X = x) = K .
j=1 f j (x)π j
It is easy to see that in terms of the ability to classify, the f k (x) is almost equivalent
to having the quantity P(G = k | X = x).
Linear discriminant analysis (LDA) arises in the special case when each class
density is a multivariate Gaussian and classes have a common covariance matrix
k = , ∀k. The purpose of LDA is to find the linear combination of individual
variables which gives the greatest separation between the groups. To discriminate
between two classes k and l, a decision rule can be constructed as the log ratio
P(G = k | X = x) πk 1
log = log − (μk − μl ) −1 (μk − μl ) + x −1 (μk − μl ),
P(G = l | X = x) πl 2
(8.5)
where , μk and μl are in most cases unknown and have to be estimated from the
training data set. Equation (8.5) is linear in x.
The decision rule can also be expressed as a set of k linear discriminant functions
1 −1
δk (x) = x −1 μk − μ μk + log πk .
2 k
An observation is assigned to the class with the highest value in the respective
discriminant function. The parameter space Rp is divided by hyperplanes into regions
that are classified as classes 1, 2, . . . , K .
8.5 Discriminant Analysis 239
In cases where the classes do not have a common covariance matrix, the decision
boundaries between each pair of classes are described by a quadratic function. The
corresponding quadratic discriminant functions are defined as
1 1
δk (x) = − log |k | − (x − μk ) k−1 (x − μk ) + log πk .
2 2
For more details on discriminant analysis, see Hastie et al. (2009).
To illustrate the method and its implementation in R, the datasets spanish and
spanishMeta are used. They contain information about the relative frequencies
of the 120 most frequent tag trigrams (combination of three letters) in 15 texts con-
tributed by three Spanish authors (Cela, Mendoza and Vargas Llosa). The aim of the
analysis is to construct a classification rule which allows automatic assignment of a
text by an ‘unknown author’ to Cela, Mendoza or Vargas Llosa. In this dataset the
number of variables, i.e. the different tag trigrams, is much larger than the number
of observations. In addition, some of the variables are highly correlated. Practically,
this means that much information that is conveyed by the variables is redundant. We
therefore can perform a principal component analysis before constructing the dis-
criminant function, in order to reduce dimensions without much loss of information.
> r e q u i r e ( MASS )
> data( s p a n i s h , p a c k a g e = " l a n g u a g e R " ) # load the data
> mydata = t( spanish ) # transpose
> pca = prcomp ( mydata, # fit PCA model
+ center = TRUE, # center values
+ scale = TRUE ) # and r e s c a l e d
> d a t a l d a = pca $ x
> d a t a l d a = d a t a l d a [ order ( r o w n a m e s ( d a t a l d a )) , ] # sort by r o w n a m e s
> data( s p a n i s h M e t a , p a c k a g e = " l a n g u a g e R " ) # load data
> m y d a t a = cbind ( d a t a l d a [ ,1 :2] , s p a n i s h M e t a $ A u t h o r )
> c o l n a m e s ( m y d a t a ) = c ( " PC1 " , " PC2 " , " A u t h o r " )
> mydata = as.data.frame ( mydata )
> mydata $ Author = as.factor ( mydata $ Author )
Before performing LDA, the dataset is randomly divided into a training dataset and
a test dataset. The training dataset is used to construct a discrimination rule, which
is subsequently applied to the test dataset in order to test its precision. Performing
the precision test on the same data for which the classification rule was constructed
is bad practice, since the results are not reliable, e.g. biased towards overfitting.
Alternatively, a wide range of resampling techniques such as bootstrap and cross-
validation can be used. These methods are described in Hastie et al. (2009).
> # s e t . s e e d (123) # set seed, see Chap. \ ,9
> n = nrow ( m y d a t a ); n # t o t a l n u m b e r of o b s e r v a t i o n s
[1] 15
> nt = floor (0 .6 * n ); nt # set t r a i n i n g set size
[1] 9
> i n d i c e s = s a m p l e (1: n, size = nt ) # sample
> mydata.train = mydata [ indices, ] # d e f i n e the t r a i n i n g set
> mydata.test = mydata [- indices, ] # define the test set
To perform LDA in R, one can use the lda function in the package MASS. The output
are the prior probabilities for each group, the group means, the coefficients of the
240 8 Multivariate Statistical Analysis
linear discriminant functions and the proportion of the trace, i.e. which proportion
of variance is explained by each discriminant function.
> fit = lda ( A u t h o r ~ PC1 + PC2, # fit LDA model
+ data = m y d a t a . t r a i n )
> fit
Call :
lda( A u t h o r ~ PC1 + PC2, data = m y d a t a . t r a i n )
Prior p r o b a b i l i t i e s of g r o u p s :
1 2 3
0 .33 0 .22 0 .44
Group means :
PC1 PC2
1 -3 .3 -4 .50
2 4 .3 4 .27
3 1 .7 -0 .83
C o e f f i c i e n t s of l i n e a r d i s c r i m i n a n t s :
LD1 LD2
PC1 -0 .26 -0 .14
PC2 -0 .22 0 .12
P r o p o r t i o n of trace :
LD1 LD2
0 .9929 0 .0071
Having used this classification rule, one can easily depict discrimination borders for
the groups to see how distinguishable the three classes are, see Fig. 8.8. The borders
between two classes are obtained by the difference between the corresponding dis-
criminant functions. For this purpose use function partimat from package klaR.
2
0
PC1
1
−5
1
−2 0 2 4 6
PC2
8.5 Discriminant Analysis 241
> r e q u i r e ( klaR )
> p a r t i m a t ( A u t h o r ~ PC1 + PC2, # m u l t i p l e f i g u r e a r r a y
+ data = m y d a t a _ test, # for d a t a s e t
+ m e t h o d = " lda " , # using LDA
+ main ="") # no title
Predicted classes and posterior probabilities can be obtained by using the function
predict(). The table below shows the probability of each element of falling into
each of the classes.
> p r e d . c l a s s = p r e d i c t ( fit, m y d a t a . t e s t ) $ class # p r e d i c t e d class
> pred.class
[1] 1 1 3 2 3 3
Levels : 1 2 3
> p o s t . p r o p = p r e d i c t ( fit, m y d a t a . t e s t ) $ p o s t e r i o r # p o s t e r i o r prob.
> post.prop
1 2 3
X14459gll 0 .75 0 .00 0 .24
X14460gll 0 .97 0 .00 0 .03
X14464gll 0 .00 0 .75 0 .25
X14466gll 0 .01 0 .28 0 .72
X14467gll 0 .02 0 .25 0 .73
X14474gll 0 .12 0 .07 0 .81
To check whether the discrimination rule works well, the percentage of correctly
classified observations is calculated. In this case, the percentage of error is 33%,
which can be interpreted as high or low depending on the application. The calculations
of the prediction error are shown below.
> p r . t a b l e = table ( m y d a t a _ test $ A u t h o r , p r e d . c l a s s ) # pred. vs true
> pr.table
pred.class
1 2 3
1 2 0 0
2 0 1 2
3 0 0 1
> p r e d . c o r r e c t = diag ( p r o p . t a b l e ( p r . t a b l e , 1))
> pred.correct
1 2 3
1 .00 0 .33 1 .00 # p r e d i c t i o n in %
Random number generation has many applications in economic, statistical, and finan-
cial problems. With the advantage of high speed and cheap computation, new sta-
tistical methods using random number generation have been developed. Important
examples are the bootstrap based procedures. When referring to a random number
generator of any statistical software package, the phrase ‘random number’ is mislead-
ing, as all random number generators are based on specific mathematical algorithms.
Thus, the computer generates deterministic and therefore pseudorandom numbers,
which are called ‘random’ for simplicity. In this context, the standard uniform dis-
tribution plays a key role, because its random numbers can be transformed so as
to obtain pseudo-samples from any other distribution. True random numbers can be
obtained by sampling and processing a source of natural entropy such as atmospheric
noise, radioactive decay, etc.
The main purpose of this chapter is to provide some computational algorithms
that generate random numbers.
huge amounts of random numbers are used, thus sampling speed is crucial. Typically,
the algorithms are periodic, which means that the sequence repeats itself in the long
run. While periodicity is hardly ever a desirable characteristic, modern algorithms
have such long periods that they can be ignored for most practical purposes.
The initial state of the generator is s0 and it evolves according to the recurrence
sn = T (sn−1 ), for n = 1, 2, 3, . . .. At step n, the generator creates un = G(sn ) as
output. For n ≥ 0, the un are the random numbers produced by the generator. Due
to the fact that S is finite, the sequence of states sn is eventually periodic. So the
generator must eventually reach a previously seen state, which means si = sj for
some 0 ≤ i < j. This implies that sj+n = si+n and therefore uj+n = ui+n for all n ≥ 0.
The length of the period is the smallest integer p such that sp+n = sn for all n ≥ r for
some integer r ≥ 0. The smallest r with this property is called transient. For r = 0,
the sequence is called purely periodic. Note that the length of the period cannot
exceed the maximal number of possible states |S|. Thus a good generator has p very
close to |S|. Otherwise, this would result in a waste of computer memory.
Modular arithemtic is often used to cope with the issue of generating a sequence of
apparently random numbers on computer systems, which are completely predictable.
The basic relation of modular arithmetic is called equivalence modulo m, where m is
an integer. The modulo operation finds the remainder of the division of one number
by another, e.g. 7 mod 3 = 1. As stated in Sect. 1.4.1, the modulo operator in R is
%%.
In the following, we present two pseudorandom number generators, which illus-
trate the main ideas behind such algorithms.
Linear congruential generator
The Linear Congruential Generator (LCG) is one of the first developed and best-
known pseudorandom number generator algorithms. It is fast and can be easily imple-
mented.
9.1 Generating Random Numbers 245
where m, a, c and x0 are the modulus, multiplier, increment and seed value, respec-
tively.
To obtain numbers with the desired properties discussed in Sect. 9.1, one has to
transform the generated integers into [0, 1] with
xi
G(xi ) = = Ui , for i = 0, 1, 2, . . . .
m
The selection of values for a, c, m and x0 drastically affects the statistical proper-
ties and the cycle length of the generated sequence of integers. The full cycle length
is m if and only if
1. c = 0,
2. c and m are relatively prime, i.e. their greatest common divisor is 1,
3. a − 1 is divisible by all prime factors of m,
4. and if m is divisible by 4, a − 1 also has to be divisible by 4.
In addition, Marsaglia (1968) has shown that these points, when plotted in n-
dimensional space, will lie on at most m1/n hyperplanes, see Fig. 9.1. This is illustrated
by a famous example of badly chosen starting values, namely in RANDU, a random
number generator developed by IBM. This algorithm was first introduced in the early
1960s and became widespread soon after.
Fig. 9.1 Nine plots of random numbers xk+2 versus xk+1 versus xk generated by RANDU visualised:
in a three dimensional space all points fall in 15 hyperplanes. BCS_RANDU
The chosen modulus, m = 231 , is not a prime and the multiplier was chosen primarily
because of the simplicity of its binary representation, not for the goodness of the
resulting sequence of integers. Consequently, RANDU does not have full cycle length
and has some clearly non-random characteristics.
To demonstrate the inferiority of these values, consider the following calculation
where mod 231 has been omitted from each term.
Unlike the LCG, the seed is not a single value. It is rather a sequence of (at least) j
integers, of which one integer should be odd. The statistical properties of the resulting
sequence of numbers rely heavily on this seed.
The value of the modulus m does not by itself limit the period of the generator, as it
does in the case of a LCG. The maximum cycle length for m = 2M is (2j − 1) · 2M−1
if and only if the trinomial x l + x k + 1 is primitive over the integers mod 2.
Using the notation LFG(j, k, p) to indicate the lags and the power of two moduli,
a commonly used version of this algorithm is LFG(17, 5, 31). The cycle length of
this version is 247 .
> LFG = function(j, k, p, n) {
+ seed = runif(j, 0, 2^p) # generate the seed
+ for(i in 1:n) {
+ seed[j + i] = (seed[i] + seed[j + i - k]) %% 2^p
+ }
+ seed[(j + 1):length(seed)] / max(seed) # standardise to [0, 1]
+ }
> LFG(17, 5, 31, 4) # generate 4 random numbers
[1] 0.3102951 0.9048108 0.4415016 1.0000000
The basic problem with this generator is, that there exist three-point correlation
between xi−k , xi−j and xi given by the construction of the generator itself, but typically
these correlations are very small.
Mersenne twister
The Mersenne twister is a pseudorandom number generator developed by Matsumoto
and Nishimura (1998). Due to its good properties, it is widely used even nowadays.
Its name is derived from the fact that the period length is a Mersenne prime, i.e. a
prime which is one less than a power of two: Mp = 2p − 1
This section presents the most common version of this algorithm, also called
MT 19937.
248 9 Random Numbers in R
Definition 9.5 (The Mersenne Twister ‘MT19937’) The sequence of numbers gener-
ated by the MT 19937 is uniformly distributed on [0, 1]. To save computation time, the
generator works internally with binary numbers. The main equation of the generator
is given by
00 Iw−r 0
xk+n = xk+m + xk+1 A + xk A , k = 0, 1, . . . ,
0 Ir 0 0
⎡ ⎤
0 0 ... 0
1
⎢ 0 1 ... 0⎥
0
⎢ ⎥
⎢ .. .. . . .. ⎥.
..
A=⎢ . .. . .⎥
⎢ ⎥
⎣ 0 0 0 ... 1⎦
a31 a30 a29 . . . a0
As a result, each vector xi is a binary number with 32 digits. Afterwards, the resulting
vector xk+n is rescaled. x0 = 4357 is chosen to be the most appropriate seed.
To explain the recursion above, one can think of it as a concatenation and a shift:
a new vector is generated by the first 13 entries of xk and the last 19 entries of xk+1 .
The shift results from the multiplication by A, which is in some way disturbed by
the addition of a0 , a1 , . . .. The result is added to xk+m .
The resulting properties of the generated sequence of numbers are extremely
good. The period length of 219937 − 1 (≈ 4.3·106001 ) is astronomically high and
sufficient for nearly every purpose today. It is k-distributed to 32-bit accuracy for
every 1 ≤ k ≤ 623 (see Sect. 9.3.2). In addition, it passes numerous tests for statistical
randomness.
.Random.seed is an integer vector, containing the seed for random number gen-
eration. Due to the fact that all implemented generators use this seed, it is strongly
recommended not to alter this vector!
One can define a specific starting value with the function set.seed(). This is
a great way to ensure that simulation results are reproducible by using the same seed
value, as shown in the following.
> set.seed(2) # fix the seed
> x1 = runif(5)
[1] 0.1848823 0.7023740 0.5733263 0.1680519 0.9438393
> x2 = runif(5)
[1] 0.9434750 0.1291590 0.8334488 0.4680185 0.5499837
> set.seed(2) # use the same seed value as for x1
> x3 = runif(5)
[1] 0.1848823 0.7023740 0.5733263 0.1680519 0.9438393
> x1 == x2 # comparison of the generated sequences
[1] FALSE FALSE FALSE FALSE FALSE
> x1 == x3
[1] TRUE TRUE TRUE TRUE TRUE
set.seed() uses its single integer argument to automatically set as many seeds as
required for the pseudorandom number generator. This is considered a simple way of
getting quite different seeds by specifying small integer arguments, and also a way
of getting valid seed sets for the more complicated methods.
> require(random)
> x = randomNumbers(n = 1000, min = 1, max = 100, col = 1) / 100
> head (as.vector(x))
[,1] [,2] [,3] [,4] [,5] [,6]
V1 0.75 0.08 0.02 0.94 0.43 0.78
In this chapter, the three main principles for rv generation, the inverse transform
method, acceptance–rejection method, and the composition method, will be dis-
cussed. For simplicity, it is assumed that there is available a pseudorandom number
generator that produces a sequence of independent U(0, 1) variables, as discussed
in Sect. 9.1.1
The inverse transform method
From the property of the quantile function given in Definition 4.11, which states
that for U ∼ U(0, 1), the rv X = F −1 (U) has cdf F, i.e. X ∼ F, one can create rvs
very efficiently whenever F −1 can be calculated. This method is called the inverse
transform method.
Recall, that the inverse transform method has been shown to work even in the case
of discontinuities in F(x). As a result, the generated X will satisfy P(X ≤ x) = F(x),
so that X has the required distribution.
The acceptance–rejection method
Suppose the inverse of F is unknown or numerically hard to calculate and one wants
to sample from a distribution with pdf f (x). Under the following two assumptions,
the acceptance–rejection method can be used:
1. There is another function g(x) that dominates f (x) in the sense that g(x) ≥ f (x) ∀x.
2. It is possible to generate uniform values between 0 and g(x). These values will
be either above or below f (x).
It is intuitively clear that X has the desired distribution because the density of X is
proportional to the height of f (Fig. 9.2).
The dominating function g(x) should be chosen in an efficient way, so that the
area between f (x) and g(x) is small, to keep the proportion of rejected points small.
Additionally, it should be easy to generate uniformly distributed points under g(x).
252 9 Random Numbers in R
1.0
method. BCS_ARM
0.8
0.6
Density
Rejection
Region
0.4
f(x)
0.2
g(x)
Acceptance Region
0.0
The average number of points (X, Y ) needed to produce one accepted X is called the
trials ratio, which is always greater than or equal to unity. The closer the trials ratio
is to unity, the more efficient is the generator. To present a handy way of constructing
a suitable g(x), consider the density f (x) of a distribution for which an easy way of
generating variables already exists, and define g(x) = K · h(x). It can be shown that
if X is a variable from g(x) and U is uniformly distributed on (0, 1) and independent
of X, then the points (X, Y ) = {X, K · U · h(x)} are uniformly distributed under the
graph of g(x). In this case, K has to be chosen in such a way that g(x) ≥ f (x) is
assured. Therefore the trials ratio is exactly K.
The composition method
Suppose a given density f can be written as a weighted sum of n densities
n
f (x) = pi · fi (x),
i=1
where the weights pi satisfy the two conditions pi > 0 and ni=1 pi = 1. In such a
framework, the density f is said to be a compound density. This method can be used
to split the range of X into different intervals, so that sampling from each interval
facilitates the overall process.
9.2 Generating Random Variables 253
For several distributions, R provides predefined functions for generating rvs. Most
of these functions will be discussed later in this chapter, to give a general overview
of this field. The syntax in this area follows a straight structure. All commands are
compositions of d, p, q, r (which stand for the density, distribution function, quantile
function, and rvs), plus the name of the desired distribution, as discussed in Chaps. 4
and 6. Thus
> rbinom()
will give an rv from the exponential distribution with parameters set by default.
This section discusses the generation of rvs for several continuous distributions.
Starting with three famous algorithms for the normal distribution, several ways of
generating rvs for the exponential, gamma, and beta distribution will be presented.
The normal distribution
As already mentioned briefly in Sect. 4.3 rnorm() produces n rvs for the normal
distribution with mean equal to 0 and standard deviation equal to 1 by default.
> rnorm(n, mean = 0, sd = 1)
A famous method developed by Box and Muller (1958) was the earliest method for
generating normal rvs and, thanks to its simplicity, it was used for a long time. The
algorithm is provided in the following definition.
254 9 Random Numbers in R
Definition 9.7 (The Box–Muller Method) Let U1 and U2 be independent rvs obeying
the uniform distribution U(0, 1). Consider the rvs
X1 = −2 log(U1 ) cos(2πU2 )
X2 = −2 log(U1 ) sin(2πU2 )
Then X1 and X2 are independent and standard normally distributed, i.e. (X1 , X2 ) ∼
N(0, I2 ). Considering 1/2 (X1 , X2 ) , we have dependent rvs, see Fig. 6.4.
Unfortunately, this algorithm is rather slow due to the fact that for each number a
square root, log, and a trigonometric function have to be computed.
Neave (1973) has shown that the Box–Muller method shows a large discrepancy
between observed and expected frequencies in the tails of the normal distribution
when U1 and U2 are generated with a congruential generator. This effect became
known as the Neave effect and is a result of the dependence of the pairs generated by
a congruential generator, such as RANDU. This problem can be avoided by using
two different sources for U1 and U2 , as shown in the following R code.
> boxmuller = function(n){
+ if(n %% 2 == 0){a = n / 2}else{a = n / 2 + 1}
+ x1 = x2 = 1:a
+ for (i in 1:a) {
+ u1 = runif(1) # generate two
+ u2 = runif(1) # uniform rvs
+ x1[i] = sqrt(-2 * log(u1)) * cos(2 * pi * u2) # transformation
+ x2[i] = sqrt(-2 * log(u1)) * sin(2 * pi * u2)
+ }
+ c(x1, x2) # print results
+ }
> boxmuller(4)
[1] 2.527755 -1.548469 -0.794818 -1.777311
+ x2[i] = u2 * z
+ i = i + 1 # precede counter
+ }
+ }
+ c(x1, x2) # print results
+ }
> polarmethod(8)
[1] 0.41867423 0.90550395 -0.07986714 1.17828848
[5] 0.65455600 -0.71171498 -0.05401868 0.90767865
The first part produces a point (u1 , u2 ) which is an observation from an rv uniformly
distributed on [0, 2
√1] . If w is smaller than 1, this point
√ is located inside the unit
circle. Then u1 / w is equivalent to the sine and u2 / w to the cosine of a random
direction (angle). Moreover, the angle is independent of w, which is an observation
of the rv that follows uniform distribution. This method is a good example of the
acceptance–rejection method for the normal distribution.
Definition 9.9 (Ratio of Uniforms) Generate u1 from U(0, b), u2 from U(c, d) and
x = u1 /u2 , with b = sup {h(x)}1/2 , c = − sup x {h(x)}1/2 and d = sup x {h(x)}1/2 .
2
u1 ≤ h(u2 /u1 ), deliver x;
If
otherwise, repeat the algorithm;
where h(·) is some density function.
For the normal distribution with the non-normalised density h(x) = exp(−x 2 /2), the
algorithm can be stated as follows. √ √
Generate u1 from U(0, 1) and u2 from U(− 2/e, 2/e), where e is the base
of the natural logarithm. Let x = u2 /u1 and z = x 2 .
⎧
⎪
⎪ z ≤ 5 − {4 exp(1/4)} u1 , deliver x (Quick accept);
⎨
z > {4 exp(−1/4)} /u1 − 3, repeat the algorithm (Quick reject);
If
⎪
⎪ z ≤ −4 log u1 , deliver x;
⎩
otherwise, repeat the algorithm.
The inequality can then be stated in terms of the variable x. To avoid repeated com-
putation of log u1 , the inner and outer bounds defined by the following inequalities
on log u are calculated.
These two inequalities arise from the fact that the tangent line, taken at the point d,
lies above the concave log function:
Taking y = u1 and d = 1/c leads to the lower bound, using y = 1/u1 and d = c yields
the upper bound. Note that the area of the inner bound is largest when c = exp(1/4)
and note that the constant 4 · exp(1/4) = 5.1361 is computed and stored in advance
to avoid computing the same trigonometric function a second time.
The exponential distribution
The first algorithm provided in this section is interesting, because it uses only arith-
metic operations.
Definition 9.10 (Neumann’s Algorithm) Generate a number of random observations
u1 , u2 , . . . from a uniform distribution as long as their values consecutively decrease,
i.e. until un+1 > un . If n is even, return x = un , otherwise repeat the procedure.
Despite its simplicity, this method should not be used for generating rvs, because
too many uniformly distributed random numbers are needed to generate one expo-
nentially distributed rv. Neumann’s rather inefficient algorithm can be improved
by applying the result of Pyke (1965)’s Theorem. It states that n U(k+1) − U(k) =
9.2 Generating Random Variables 257
nSk ∼ E(k), where U(1) ≤ U(2) ≤ . . . ≤ U(n) is an ordered series of standard uni-
formly distributed rvs with Sk = U(k+1) − U(k) and Sn = U(n) .
One important advantage in computing rvs from the exponential distribution is the
fact that the exponential distribution has a closed form expression for the inverse of its
cumulative distribution function. Given a random number generator, some numbers
X must be selected, which need to obey an exponential distribution. The following
definition states the selection procedure.
Definition 9.11 (Inverse cdf Method for the Exponential Distribution) First, gener-
ate a variable u from U(0, 1), then calculate x = 1 − exp(−λ · u).
rexp() uses the algorithm by Ahrens and Dieter (1972), which is faster than the
inverse method presented above.
> ptm = proc.time()
> invexp(10000, 1)
> proc.time() - ptm
User System elapsed
0.39 0.00 0.66
generates n rvs with shape parameter b, default rate of 1, and scale parameter 1/rate.
For b ≥ 1, a specific algorithm by Ahrens and Dieter (1982b) is used, but for 0 <
b < 1, the following, different, algorithm by Ahrens and Dieter (1974) is used.
3. Generate u2 from U(0, 1) and set y = w1/b . If u2 ≤ exp(−y), return x = ay, else
go to 1.
4. Generate u2 from U(0, 1) and set y = − log ( e+b e
− w)/b . If u2 ≤ y1/b , return
x = ay, else go to 1.
This algorithm was introduced by Atkinson and Pearce (1976) and is simple and
short. It is also efficient for b < 5, as the trials ratio is reduced from unity at b = 1
and to 2.38 at b = 5. Note that for greater values of b, Algorithm 9.13 by Cheng is
more efficient.
generates n rvs of the beta distribution with the two shape parameters p and q and a
default non-centrality parameter of 0. rbeta() is based on the following algorithm
by Cheng (1978).
This method has a bounded trials ratio of less than 4/e ≈ 1.47.
9.2 Generating Random Variables 259
The general methods of Sect. 9.2.3 are in principle available for constructing discrete
variable generators. However, the special characteristics of discrete variables imply
certain modifications.
The binomial distribution
In R, rvs from the binomial distribution, see 3.6, can be generated via
> rbinom(n, size, prob)
where n is the number of observations, size the number of trials, and prob the prob-
ability of success.
Definition 9.16
1. Set x = 0.
2. Generate u from U(0, 1).
3. If u ≤ p, set y = 1, else y = 0.
4. Set x = x + y.
5. Repeat n times from step 2, then return x.
This algorithm uses the fact that x is an observation from binomially distributed rv
with n and p, i.e. the sum of n independent Bernoulli variables with parameter p.
Note that the generation time increases early with n.
As pointed out earlier, the fastest binomial generators for fixed parameters n and
p are obtained via table methods. On the downside for these methods, the memory
requirements and the setup time for new values of n and p are proportional to n, which
is a major drawback. More useful is a simple inversion without a table resulting in
a short algorithm and a shorter setup time. The execution time is proportional to
n · min(p, 1 − p). Therefore, rejection algorithms were proposed because they are
on the whole both fast and well suited for changing the values of n and p, as typically
required in simulation.
The implemented algorithm for rbinom() is based on a version by
Kachitvichyanukul and Schmeiser (1988). The algorithm generates binomial vari-
ables via an acceptance/rejection based on the function
x+0.5 − np+p
np + p ! · (n − np + p )! p
f (x) = for − 0.5 ≤ x ≤ n + 0.5.
x + 0.5 !(n − x + 0.5 )! 1−p
The resulting algorithm dominates other algorithms with constant memory require-
ments when n · min(p, 1 − p) ≥ 10 in terms of execution times. Only for n ·
min(p, 1 − p) ≤ 10 is the inverse transformation algorithm faster. An implemen-
tation of the inverse transformation is presented below.
260 9 Random Numbers in R
The advantages of this algorithm are that only one constant L = exp(−λ) has to be
evaluated and that it requires only a minimum amount of storage space. However, the
time to generate an rv increases rapidly with λ. The following method can be used for
L
large λ, such as λ > 25, based on the fact that the distribution of λ−1/2 (X − λ) −→
N(0, 1). Bear in mind that this is an asymptotic result.
> poisson.as = function(n, lambda = 1){
+ a = lambda^0.5
+ sapply(1:n,
+ function(x){max(0, trunc(0.5 + lambda + a * rnorm(1)))})
+ }
> poisson.as(10, 30)
[1] 31 27 31 35 33 34 34 26 27 31
Then the vector X can be built up one component at a time, where each component
is obtained by sampling from a univariate distribution and recusrsively calculating
each X1 , X2 , . . . , Xd .
For this method, it is necessary to know all the conditional densities. Therefore, its
usefulness depends heavily on the availability of the conditional distributions and,
of course, on the difficulty of sampling from them.
The transformation method
If the conditional distributions of X are difficult to derive, then perhaps a more con-
venient transformation can be found. The key element for this method is to represent
X as a function of other, usually independent, univariate rvs. An example of this
method is the Box–Muller method (see Definition 9.7), which uses two independent
uniform variables and converts them into two independent normal variables.
Even though the transformation method has wide applicability, it is not always
trivial to find a transformation with which to generate a multivariate distribution of
a given X. The following guidelines by Johnson (1987) have proven helpful.
1. Beginning with the functional form fX (x), one could apply invertible transforma-
tions to the components of X in order to find a recognizable distribution.
2. Consider transformations of X that simplify arguments of transcendental functions
in the density fX (x).
This method works faster than the conditional inverse technique. The drawback is
that the distribution M can be determined explicitly only for a few generator functions
φ, for example the Frank, Gumbel and Clayton families.
A simple implementation in R makes use of the package copula, which was
discussed in detail in Sect. 6.3. The command rMvdc() draws n random numbers
from a specified copula.
> # specification of the Clayton copula with uniform marginals
> require(copula)
> uniclayMVD = mvdc(claytonCopula(0.79),
+ margins = c("unif", "unif"),
+ paramMargins = list(list(min = 0, max = 1),
+ list(min = 0, max = 1)))
> # 10000 random number draw from the Clayton copula
> rMvdc(uniclayMVD, n = 10000)
Figure 9.3 shows 10,000 random numbers drawn from a Clayton copula.
Uniform Normal
1.0
4
0.8
2
0.6
0
0.4
−2
0.2
0.0
Fig. 9.3 10,000 realizations of an rv with uniform marginals in [0, 1] (left) and with standard
normal marginals (right) with the dependence structure in both cases given by a Clayton copula
with θ = 0.79. BCS_claytonMC
9.3 Tests for Randomness 265
The first tests for random numbers in history were published by Kendall and Smith
(1938). They were built on statistical tools, such as the Pearson chi-square test, which
were developed in order to distinguish whether or not experimental phenomena
matched up with their theoretical probabilities.
Kendall and Smith’s (SJP) original four tests were hypothesis tests, which tested
the null hypothesis that each number in a given random sequence had an equal chance
of occurring, and that various other patterns in the data should also be distributed
equiprobably. The four tests are:
• The frequency test is a very basic test which checks whether there are roughly the
same number of 0’s, 1’s, 2’s, 3’s, etc.
• The serial test does the same for sequences of two digits at a time (00, 01, 02, etc.),
comparing their observed frequencies with their hypothetical predictions based on
equal distribution.
• The poker test is used to test for certain sequences of five numbers at a time (00000,
00001, 00011, etc.), based on hands in the game poker.
• The gap test looks at the distances between zeroes (00 would be a distance of 0,
030 would be a distance of 1, 02250 would be a distance of 3, etc.).
Nevertheless, in the following sections, two of the less intuitive and hard to pass
tests will be provided, in contrast to the more natural approaches above.
The birthday spacing test is one of a series of tests called Diehard tests, which
were developed by Marsaglia (1995) and published on a CD. Consider the following
situation. If m birthdays are randomly chosen from a year of n days (usually 365) and
sorted, the number of duplicate values among the spacings between those ordered
3
birthdays will be asymptotically Poisson distributed with parameter λ = m 4n .
Theory provides little guidance on the speed of the approach to the limiting form,
but extensive simulation with a variety of random number generators provides values
of m and n for which the limiting Poisson distribution seems satisfactory. Among
these are m = 1024 birthdays for a year of length n = 224 with λ = 16.
where truncv (x) denotes the number formed by the leading v bits of x, i.e. trunc2
(0.23917) = 0.23, and each of the 2kv possible combinations of bits occurs the
same number of times in a period, except for the all-zero combination that occurs
less often by one instance.
To test for k-distribution to n-bit accuracy, at least 2kn measurements are needed.
Thus, this property is generally shown theoretically without preforming the actual
measurements. Nevertheless, it is possible to test for small k. In the case of k = 2,
each pair of the sequence {Ui , Ui+1 }2n−1
i=1 refers to certain points of the unit square,
where Ui ∼ U(0, 1). Decomposing the unit square into n2 subsquares and counting
the number of points in the subsquares allows using a χ2 -test for independence,
since the number of observed and expected points in each cell can be compared. This
example can be extended to larger k.
9.3 Tests for Randomness 267
—Pablo Picasso
The name Trellis comes from the trellis-like rectangular array of panels sim-
ilar to a garden trellis. By means of the Trellis graphics it is possible to study
the dependence of a response variable on more than two explanatory variables. Mul-
tipanel conditioning is used for displaying multiple plots in one page with shared
coordinate scales, aspect ratios and labels. This feature is especially useful for plotting
multivariate and panel data, and is not provided by the standard Rgraphic system.
The design goal of the Trellis system is the optimisation of the available output
area, therefore Trellis graphics provide default settings that produce superior
plots in comparison to its traditional counterparts.
The lattice package is based on the grid graphics system, which is a low-
level graphics system, see Sarkar (2010). grid does not provide high-level functions
to create complete plots, but creates a basis for developing high-level functions as well
as facilitates the manipulation of graphical output in lattice. Since lattice
consists of grid calls, it is possible to add grid output to lattice output and vice versa,
see R Development Core Team (2012). The knowledge of the grid package would
be beneficial for customising the plots in lattice. Nevertheless lattice is a
self-contained graphics system, enabling one to produce complete plots, functions
for controlling the appearance of the plots and functions for opening and closing
devices.
The short description of the package functions and relevant examples of the
lattice graphical output will be given in the following.
The lattice package contains functions, objects and datasets. Most of the func-
tions implemented in lattice are already available in the traditional Rgraphics
environment. The complete list is given in Table 10.1.
Each of the listed high-level functions creates a particular type of display by
default. Although the functions produce different output, they share many common
features, i.e. that several common arguments affect the resulting displays in similar
ways. These arguments are extensively documented in the help pages for xyplot().
The most important of them are the formula argument, describing the variables,
and the panel argument, specifying the plotting function. These will be explained
in more details in the following subsections.
In order to avoid mistakes in the use of the formula argument, it should be kept
in mind that the syntax of the formula in lattice differs from that of formula
used in the lm() linear model function, see Chap. 8.
The variable on the left side of “ ∼” is a dependent variable, while the independent
variable(s) is (are) placed on the right side. For graphs of a single variable, only one
independent variable needs to be specified in the first row of Table 10.2.
In order to define multiple dependent or independent variables, the sign’+’is
placed between them. In case of multiple dependent variables, the formula would
be assigned as y1 + y2 ∼ x, so that the variables y1 and y2 are plotted against the
variable x. In fact, y1 ∼ x and y2 ∼ x will be superposed in each panel. In a similar
way, one can set multiple independent or both independent and dependent variables
simultaneously as is implied in the code of Fig. 10.1 later in this chapter.
To produce conditional plots, the conditioning variable should be also specified
in the formula argument, standing after the’|’symbol. When multiple conditioning
272 10 Advanced Graphical Techniques in R
virginica
● ● 8
●●●● ● ●●●●
●●●● ● ●
●●●●●
●
●●
● ●●● ● ●●●● 7
●● ● ●●● ●
●●●●●
● ●● ●● ●●●
●●●● ●●● ●●●●●●●● ●
●● ● ●●●● ●
●
●● ● ● 6
●
● ●●
● ● 5
Sepal.Length + Sepal.Width
●● ●● 4
● ●
●●●●● ●●● ●
●
●● ●●● ●
●●●●●
●●●●●
●●
●● ●●
●●● ●● ● ●●● ● ●● ● 3
●
●●●●●● ● ● ●●●
●●● ● ● ●●
● ●●●● ● ● ●● ●
● ●
2
setosa versicolor
8
7 ●●
●
●●
●●●
●● ●●●●●
●
● ●
●● ●●●●●●
●
●●● ●●
6 ●●●●●
●●●
●
●● ●●●● ●
●●● ● ●● ●●
● ●
● ●● ●●●
●● ●●
● ● ●●
●●●● ●●●●●●●●
●●
● ●●●
● ● ● ●
●●
●●●●● ●●●●●● ● ● ●● ●
5 ●●
●● ● ●●●
●● ●●
●● ●
●
●
●
● ●
●●
●● ● ●● ●● ●
●● ●● ●●●
●●● ●●●●●
●● ● ●●
4 ●●
●●● ● ●●●● ● ●
●●
●●●● ● ●●
●
●
●●●● ● ● ●
●●
●●●●
●● ● ●●
●● ●●●●● ●●●
●● ●●● ●
3 ●●
●
●● ● ●● ●
●● ● ●●●
●●
●●
●●●● ●● ●●●●
●●●●●● ●● ●
●●
●● ●
●●●●● ●
●● ●●●
●●●●●● ●
●
●●● ●● ● ●● ●●●
● ● ●
● ● ●
● ● ● ● ●●
●
2 ● ●
0 2 4 6
Petal.Width + Petal.Length
z variables are specified, then for each level combination of z 1 and z 2 , lattice
produces several plots of y against x, as depicted in Fig. 10.5. The notation is y ∼
x | z1 + z2 .
The definition of the formula argument is the initial step in the multilevel devel-
opment process of lattice graphical output. The values used in the formula
are contained in the argument data, specifying the data frame.
As mentioned above, the default settings of lattice plots are optimised for the
traditional Rplots. The panel function is a function that uses a subset of the argu-
ments to create a display. All lattice plotting functions have a default panel
function, the name of which is built up from the prefix panel and the name of
the function. For instance, the default panel function for the bwplot() function
is panel.bwplot(). However, apart from superior default settings, lattice
offers lots of flexibility due to its highly customisable panel functions.
There are two perspectives from which the lattice graph should be observed.
First, the function call, e.g. histogram(), sets up the external components of the
10.1 Package lattice 273
display, such as scale rectangle, axis lables, etc. Second, the panel function creates
everything placed into the plotting region of the graph, such as plotting symbols.
The panel function is called from the general display function by the panel
argument. Therefore, for the default settings, both function calls are identical.
> h i s t o g r a m ( ~ x, data = d a t a s e t )
> h i s t o g r a m ( ~ x, data = d a t a s e t , panel = p a n e l . h i s t o g r a m )
There are different arguments that could be treated under the panel function. In
order to temporarily change the default settings of, for instance, the plotting symbols,
one can rewrite the new value into the panel function inside the general function
call.
> x y p l o t ( y ~ x,
+ data = dataset,
+ panel = f u n c t i o n ( x, y ){ p a n e l . x y p l o t ( x, y, pch = 20)})
Now by choosing the my.panel function, one would always use the type of the
plotting points pch = 29.
In a similar way, different attributes (e.g. cex, font, lty, lwd, etc.) could
be altered for a specific function either temporary or permanently.
lattice offers conditional and grouped plots to work with and display multivariate
data. In order to obtain a conditional plot, at least one variable should be defined as
conditioning.
One gets different visual representations of the dataset, depending on whether the
same variable is being used as conditioning or as grouping.
From the dataset iris, the conditioning variable Species is set to be condi-
tioning, as shown in the Rcode below, which corresponds to Fig. 10.1.
> xyplot ( Sepal.Length + Sepal.Width ~
+ Petal.Length + Petal.Width | Species,
+ data = iris )
Figure 10.1 contains three panels, standing for three types of Species. Each panel
contains four combinations of iris characteristics.
Another alternative for displaying multivariate data is the groups argument.
This splits the data according to the grouping variable. For the sake of comparability,
Fig. 10.2 shows four panels, each one illustrating the combination of two variables
and types of Species denoted by different colours.
274 10 Advanced Graphical Techniques in R
0 2 4 6
Sepal.Width * Petal.Width Sepal.Width * Petal.Length
8
5
Sepal.Length + Sepal.Width
● ●
●● ●
●●●●
●●
●●● ● ●● ●●● ● ●● 4
●●●●
●
●● ● ●●
● ●● ●
●●●● ● ●●●●●
● ● ●●●●
●●● ●●●● ●● ●● ●
●●●●
●●
●● ● ●●● ●
●● ●● ●●●
●● ● ●●●●●
● ●●●●
●●●● ●●
●●●●●●●
●●●
●●
● ●
●●●
●
●
●●●●
● ●●
● ● ●●
●● ● ●●
●
●●●●● ●●●●● ● 3
●●
● ● ●●●●
●● ●●●
●● ●●● ● ●●●
●●●●●●●●●●●● ● ● ●●
●●●●
●● ●●●● ● ● ●● ●●●
● ● ● ●● ●● ●
● ●
● ●● ● ● ● ●● ●
●
● ● 2
Sepal.Length * Petal.Width Sepal.Length * Petal.Length
8 ● ●
●●●● ● ●●●●
● ●
●●
● ● ●●●●
7 ●● ● ●● ●●● ● ●●
●
●●●● ●● ●●● ● ● ●●● ●●●●● ●
●●●● ●● ●● ● ●
●●
● ● ●●●●
●●
● ●
● ●●●●●●●●●●●
●●
● ●●●
● ●●● ● ●● ●● ●● ● ●
6 ● ● ● ●● ●● ●●●● ●●
●●● ● ●●●●● ●●● ● ● ●● ●● ●●● ●●●●●
●● ●
● ●
●●●●●●●● ● ●●● ● ●
●● ●●● ● ●● ●●
●●
●●●●● ● ●
●
●
●●●●● ● ●● ●
5 ●●
●
●●● ● ● ●●●
●
●
●
●● ●
●● ●●
●
● ●●● ● ● ● ●
●●
● ●●●●●●
●● ● ●●
4
0 2 4 6
Petal.Width + Petal.Length
The use of a conditioning or grouping variable requires including a key legend in the
graph. The argument auto.key() draws the legend, and the attribute columns
defines the number of columns into which the legend is split.
According to this particular example, it is not very important how one employs
the Species variable, since both outputs are qualitatively equal.
There are datasets where it is preferable to produce grouped plots rather than con-
ditional plots. The following example of a density plot from the dataset chickwts
confirms this.
> densityplot (~ weight | feed, # set c o n d i t i o n a l v a r i a b l e
+ data = chickwts,
+ plot.points = FALSE ) # mask points
The resulting output is shown in Fig. 10.3. Since we employed the conditioning
variable feed, which has six categories, Fig. 10.3 produces six panels with density
plots.
Alternatively, the variable feed can be used as a grouping variable. Figure 10.4
creates one single panel with six superposed kernel density lines and enables a direct
comparison between the different groups.
10.1 Package lattice 275
0.010
0.005
0.000
Density
0.010
0.005
0.000
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
weight
Moreover, the black and white colour scheme was applied to both plots on Fig. 10.3
as well as Fig. 10.4; in lattice, the colours will be changed by different types of
symbols when the following code is applied.
> lattice.options ( default.theme =
+ # set the d e f a u l t l a t t i c e c o l o r s c h e m e to b l a c k / w h i t e s c h e m e
+ m o d i f y L i s t ( s t a n d a r d . t h e m e ( color = FALSE ) ,
+ # set s t r i p s b a c k g r o u n d to t r a n s p a r e n t
+ list ( s t r i p . b a c k g r o u n d = list ( col = " t r a n s p a r e n t " ))))
0.015
0.010
Density
0.005
0.000
variables in the form of factors. It consists of a numeric vector and possibly overlap-
ping intervals.
To convert a continuous variable into a shingle object means to split it into
(possibly overlapping) intervals (levels). In order to do this, one uses the shingle()
function, whereas the function equal.count() is used when splitting into equal
length intervals is required. The number argument defines the number of intervals,
whereas the overlap argument assigns the fraction of points to be shared by the
consecutive intervals. The endpoints of the intervals are chosen in such a way that
the counts of points in the intervals are as equal as possible. shingle returns the
list of intervals of the numeric variable.
In the following Rcode, the continuous variables temperature and wind are
split into four equal non-overlapping intervals and can be treated as usual factor
variables. The new factor variables Temperature and Wind are considered as the
conditioning variables.
> Temperature = equal.count ( environmental $ temperature,
+ number = 3, # split into 3 equal i n t e r v a l s
+ overlap = 0) # no o v e r l a p p i n g
> Wind = e q u a l . c o u n t ( e n v i r o n m e n t a l $ wind,
+ number = 4,
+ overlap = 0)
> x y p l o t ( ozone ~ r a d i a t i o n | T e m p e r a t u r e * Wind,
+ data = environmental,
+ a s . t a b l e = TRUE ) # p a n e l s l a y o u t top to b o t t o m
Figure 10.5 depicts the simultaneous use of these two conditioning variables.
Temperature now contains three levels and Wind has four levels, though a rec-
10.1 Package lattice 277
150
●●
100
●
●●
●
●
● ● ●
● ● ●
●
● ● ● 50
●
● ● ● ● ● ●●
●● ●
● ●
● ● ● ● ● ● ●
●● 0
Wind Wind Wind
Temperature Temperature Temperature
150
100 ●
●
● ●
50 ●
● ● ● ●●
● ● ● ● ● ●
●● ●
● ● ● ● ●● ● ● ●
● ● ●● ● ● ●
0 ●● ●
Wind Wind Wind
Temperature Temperature Temperature
150
100
●
●
● ●
●●
●
● ●
50
● ● ●● ● ● ●
● ● ●● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ●
●
0
0 100 200 300 0 100 200 300
Solar radiation in Langley (1 LANG = 41,868 Joule/m2)
tangular array of 12 panels was created, depicting the ozone variable against the
radiation variable for each combination of conditioning variables.
In the code, the argument par.strip.text() controls the text on each strip
with the main components cex, col, font, etc.. By default, lattice
displays the panels from bottom to top and left to right. By defining the argument
as.table = TRUE the panels will be displayed from top to bottom.
Of course more than two conditioning variables are also possible, but the increas-
ing level of complexity of the graphical output should be kept in mind.
278 10 Advanced Graphical Techniques in R
1400
1200
1000
800
600
The ability to draw multiple panels in one plot is particularly useful for time series
data. lattice enables cut-and-stack time series plots. The argument cut is spec-
ified by the number of intervals into which the time series dataset should be split, so
that changes over a time period can be studied more precisely.
The code for the simple time series plot in Fig. 10.6 is
> x y p l o t ( Nile )
One can customise the plot by varying the arguments aspect, cut and strip,
where the last is responsible for the colour scheme of the strips.
> x y p l o t ( N i l e , a s p e c t = " xy " ,
+ cut = list ( n u m b e r = 3 , # split into t h r e e p a n e l s
+ o v e r l a p = 0 .1 ) , # 10 per cent o v e r l a p
+ strip = s t r i p . c u s t o m ( bg = " y e l l o w " , # strips background
+ fg = " l i g h t b l u e " )) # strips foreground
Figure 10.7 plots three panels, according to the number of intervals. Such a combi-
nation of two plots is most valuable for the user.
An object of class ts could also be a multivariate series, so that multiple time
series can be displayed in parallel in the same graph. For instance, by setting the
superpose argument to be TRUE, all series will be overlaid in one panel. When
the screens argument is specified, the series will be plotted into a predefined panel.
10.1 Package lattice 279
time
1400
1200
1000
800
600
20 30 40 50 60
1932 1931
Trebi
Wisconsin No. 38
No. 457
Glabron
Peatland
variety
Velvet
No. 475
Manchuria
No. 462
Svansota
site
The result of the listing is shown in Fig. 10.8, which presents the four-dimensional plot
of the lattice data set barley. The explanatory variable yield is illustrated
by means of the sizes and colours of the boxes. The higher the value of yield, the
larger and lighter are the rectangles.
The arguments of interest are cuts, which specify the number of levels (the
colour gradient) into which the range of a dependent variable is to be divided and
region, which is a logical variable that defines whether the regions between the
contour lines should be filled. Since region = TRUE, col.regions defines
the colour gradient of the dependent variable. The shrink argument scales the
rectangles proportionally to the dependent variable, and the between argument
specifies the space between the panels on x and/or y axis.
The settings for the layout and appearance of the lattice plots facilitate an
enhanced comprehension of the data. Multipanel conditioning is a central feature
delivered by lattice, which enables data visualisation on multiple panels simul-
taneously, displaying different subsets of the data. Although lattice provides this
10.1 Package lattice 281
1. Device management functions include six functions, which control the RGL win-
dow device. These functions are used to open/close the device, to return the
number of the active devices, to activate the device and to shut down the rgl
device system or to re-initialise rgl.
2. Scene management functions enable stepwise removal of certain objects, such as
shapes, lights, bounding boxes and background, from the 3D scene.
3. Export functions are used to save snapshots or screenshots in order to export them
to other file formats.
4. Environment functions are set to alter the environment properties of the scene,
e.g. to modify the viewpoint, background, axis labelling, bounding box, or to add
a light source to the 3D scene.
282 10 Advanced Graphical Techniques in R
We next demonstrate the shape functions in the rgl package combined with certain
environment and object properties.
The shape functions are an important part of the RGL library since they enable
both the plotting of primitive shapes, such as points, lines, linestrips, triangles and
quads, as well as high-level shapes, such as spheres and different surfaces, see
Figs. 10.9, 10.10, 10.11, 10.12, 10.13 and 10.14.
RGL adds further shapes to the already opened device by default. To avoid this,
one can create a new device window with the calls rgl.open() or open3d().
The shape functions in rgl are briefly described in the list below.
1. 3D points are drawn by the function rgl.points(x, y, z,...), see
Fig. 10.9.
2. 3D lines can be depicted with the function rgl.lines(x, y, z,...), see
Fig. 10.10. The nodes of the line are defined by the vectors x, y, z, each of length
two.
3. 3D linestrips are constructed with the function rgl.linestrips(x,y,
z,...). The nodes of the linestrips are, as in rgl.lines(x, y, z,...),
defined by the vectors x, y, z, each of length two. In the output, each next line
strip starts at the point where the previous one ends, see Fig. 10.11.
4. 3D triangles are created with the function rgl.triangles(x, y, z,...),
see Fig. 10.12. The vectors x, y and z, each of length three, specify the coordinates
of the triangle.
5. 3D quads can be drawn with the function rgl.quads(x, y, z,...), see
Fig. 10.13. The vectors x, y and z, each of length four, specify the coordinates of
the quad.
6. 3D spheres are not primitive, but they can be easily created with the function
rgl.spheres(x, y, z, r,...). This function plots spheres with centres
defined by x, y, z and radius r . In order to create multiple spheres, one can define
x, y, z, r as vectors of length n, see Fig. 10.14.
7. 3D surfaces can be drawn by means of the generic rgl.surface(x,...)
function. This is defined by a matrix specifying the height of the nodes and two
vectors defining the coordinates.
Each of the shape functions can be produced with higher level functions from the
r3d interface.
Alternatively, 3D surfaces can be constructed with the persp3d(x,...),
surface3d() or terrain3d() functions. As an example of a 3D surface, the
hyperbolic paraboloid
x2 z2
− = y,
a2 b2
is produced by surface3d() and displayed in Fig. 10.15.
10.2 Package rgl 285
> r e q u i r e ( rgl )
> x = z = -9:9
> f = f u n c t i o n ( x, z ){( x ^2 - z ^2) / 10}
> y = outer ( x, z, f ) # square matrix,
> # x rows, z c o l u m n s
> o p e n 3 d ()
> s u r f a c e 3 d ( x, z, y, # plot 3 D s u r f a c e
+ back = " lines " , # back side grid
+ col = rainbow (1000) ,
+ alpha = 0 .9 ) # t r a n s p a r e n c y level
> b b o x 3 d ( back = " lines " , front = " lines " ) # 3 D b o u n d i n g box
In this example of a 3D surface (see Fig. 10.15), a side dependent rendering effect
was implemented. This option gives the possibility of drawing the ’front’ and ’back’
sides of an object differently. By default, the solid mode is applied, which can be
changed either to lines, points or cull (hidden). In Fig. 10.15, the front side is drawn
with a solid colour, whereas the back side appears to be a grid. This creates a better
illusion of 3D space. The bounding box is added to the scene with the function
bbox3d().
There are many options that can be used in order to make the 3D object look
more realistic. The lighting condition of the shape is described by the totality of light
objects. There are three types of lighting, i.e. the specular component determines the
light on the top of an object, ambient determines the lighting type of the surrounding
area and the diffuse component specifies the colour component, which scatters the
light in all directions equally. The light parameters specify the intensity of the light,
whereas theta and phi are polar coordinates defining the position of the light.
The following examples of 3D spheres (Figs. 10.16, 10.17, 10.18 and 10.19) depict
the effects of ambient and specular material on a sphere. Furthermore, the argument
smooth creates the effect of internal smoothing and determines the type of shading
applied to the spheres. When smooth is TRUE, Gouraud shading is used, otherwise
286 10 Advanced Graphical Techniques in R
flat shading. In Fig. 10.17, the rgl.clear() function is used in order to customise
the lighting scene of the display by deleting the lighting from the scene.
10.2 Package rgl 287
Exporting results from the rgl package differs from exporting classical graphical
outputs. For this reason, we will explain some of the main commands in this section.
To save the screenshot to a file in PostScript or in other vector graphics formats,
the function rgl.postscript() is used. There are also other supported formats,
such as eps, tex, pdf, svg, pgf. The drawText argument is a logical,
defining whether to draw text or not.
r g l . p o s t s c r i p t ( " f i l e n a m e . e p s " , fmt = " eps " , d r a w T e x t = FALSE )
Alternatively, it is also possible to export the rgl content into bitmap png format
with the function rgl.snapshot().
r g l . s n a p s h o t ( " f i l e n a m e . p n g " , fmt = " png " , top = TRUE )
By applying this code to the 3D surface example, one obtains a five second demon-
stration of the plot, rotated around the x- and y-axes.
Another alternative for manipulating the plot, rather than rotation and zoom,
is provided by the function select3d(). This enables the user to select three-
dimensional regions in a scene. This function can be used to pick out one part of the
data, not influencing the whole dataset.
> if ( i n t e r a c t i v e ()){ # interactive navigation
+ # is a l l o w e d
+ x = rnorm (5) # g e n e r a t e pseudo - r a n d o m
+ # normal vector
+ y = z = x
+ r = runif (5)
+ o p e n 3 d () # o p e n new d e v i c e
+ s p h e r e s 3 d ( x, y, z, r, col = " red3 " ) # red s p h e r e s
+ k = s e l e c t 3 d () # s e l e c t the r e c t a n g l e
+ # area
+ keep = k ( x, y, z ) # keep s e l e c t e d area
+ # unchanged
+ r g l . p o p () # clear shapes
+ s p h e r e s 3 d ( x [ keep ] , y [ keep ] , z [ keep ] , r [ keep ] ,
+ col = " blue3 " ) # r e d r a w the s e l e c t e d
+ # area in blue
+ s p h e r e s 3 d ( x [ ! keep ] , y [ ! keep ] , z [ ! keep ] , r [ ! keep ] ,
+ col = " red3 " ) # r e d r a w the non -
+ # s e l e c t e d area in red
}
The rpanel package employs different graphical user interface (GUI) controls to
enable an immediate communication with the graphical output and provide dynamic
graphics. Such an animation of graphs is possible by using single function calls,
such as sliders, buttons or others, to control the parameters. If the particular state
of a control button is altered, the response function call will be executed and the
associated graphical display will be changed correspondingly.
rpanel is built on the tcltk package created by Dalgaard (2001). The tcltk
package contains various options for interactive control offered by the Tcl/Tk sys-
tem, whereas rpanel includes only a limited number of useful tools that enable
the creation of control widgets through single function calls. rpanel offers the
possibility of redrawing the entire plot, and interactively changing the values of the
parameters set by the relevant controls, something which is not possible with the
object-oriented graphics created, for instance, in Java, see Bowman et al. (2007). In
order to be able to use the rpanel package, one should load the tcltk package
first.
rpanel displays graphs in a standard Rgraphics window and creates a separate
panel with control parameters. To avoid the necessity of operating with multiple
panels, one can use the tkrplot package of Tierney (2005) to integrate the plot
into the control panel, see Bowman et al. (2007).
The rpanel package consists of control functions and application functions. The
control functions are mainly used to build simple GUI controls for the Rfunctions.
The most useful GUI controls are listed in Table 10.3.
It is worth mentioning that several controls can be used simultaneously, as shown
in the next example, where both rp.doublebutton() and rp.slider() are
applied to the same panel object. First, one defines the function which is called when
an item is chosen, then one fills it with the values of the observed variable. Next, one
draws the panel and places the proper function in it. The rp.control() function
appears to be the central control function implemented in rpanel, since it is called
290 10 Advanced Graphical Techniques in R
every time a new panel window is drawn, defining where the rpanel widgets can be
placed. Eventually, rp.slider() and rp.doublebutton() are used in order
to control a numeric variable by increasing or decreasing it with a slider or button
widget.
The following code demonstrates the usage of both functions on dataset trees:
> require ( rpanel )
> r = diff ( r a n g e ( H e i g h t )) # d e f i n e the r a n g e of the v a r i a b l e
> d e n s i t y . d r a w = f u n c t i o n ( p a n e l ){ # draw d e n s i t y f u n c t i o n
+ plot ( d e n s i t y ( p a n e l $ y, p a n e l $ sp ))
+ panel
+ }
> # define panel window arguments
> density.panel = rp.control ( title =" density estimation ",
+ y = Height, # data a r g u m e n t
+ sp = r / 8) # smoothing parameter
> # add a s l i d e r to the p a n e l w i n d o w
> r p . s l i d e r ( d e n s i t y . p a n e l , sp,
+ from = r / 40 , to = r / 2 , # l o w e r and u p p e r l i m i t s
+ action = density.draw,
+ main =" Bandwidth ") # n a m e of the w i d g e t
> # add a w i d g e t w i t h "+" and " -" b u t t o n s
> r p . d o u b l e b u t t o n ( d e n s i t y . p a n e l , sp,
+ step = 0 .03,
+ log = TRUE, # step is m u l t i p l i c a t i v e
+ range = c ( r / 50 , NA ) , # l o w e r and u p p e r l i m i t s
+ action = density.draw )
The first argument of rp.slider() identifies the panel object to which the slider
should be added. The second argument gives the name of the component of the
created panel object that is subsequently controlled by the slider. The from and to
arguments define the start and end points of the range of the slider. The action
argument gives the name of the function which will be called when the slider position
is changed. The last argument adds a label to the slider (see Fig. 10.21).
The rp.double.button() function is used to change the value of the partic-
ular panel component by small steps when a more accurate adjustment of parameters
is needed. Most of the arguments used by this function are the same as for the
10.3 Package rpanel 291
Fig. 10.21 Slider and double button for the control of density estimate.
BCS_ControlDensityEstimate
rp.slider(). The range argument serves the same purpose as the from and
to arguments defining the limits for the variable.
Another feature enabled in rpanel is the possibility of interactively choosing
between several types of plots to be applied to the same data set. It is also fea-
sible to adjust different parameters within the chosen plot. This can be tested with
rp.listbox(). This function adds a listbox of alternative commands to the panel.
When an item is pressed, the corresponding graphics display will occur. The argu-
ments of the function are the same as with the previous control functions. One can
follow the setting of this function in the code below.
> d a t a . p l o t f n = f u n c t i o n ( panel ){ # define plot function
+ if ( panel $ p l o t . t y p e == " h i s t o g r a m " ) # choose histogram
+ hist ( panel $ y ) # then plot h i s t o g r a m
+ else if ( panel $ p l o t . t y p e == " b o x p l o t " ) # choose boxplot
+ b o x p l o t ( panel $ y ) # then plot b o x p l o t
+ panel
+ }
> panel = r p . c o n t r o l ( y = H e i g h t ) # new panel
> rp.listbox ( panel, plot.type, # list with 2 o p t i o n s
+ c(" histogram "," boxplot "),
+ action = data.plotfn,
+ title = " Plot type " ) # n a m e of the w i d g e t
292 10 Advanced Graphical Techniques in R
Fig. 10.22 Listbox control function with histogram and boxplot as alternative plots.
BCS_HistogramBoxplotOption
The difference between this type of plot and the default plotting in rpanel can be
observed in Fig. 10.23.
10.3 Package rpanel 293
The rpanel package also includes several useful built-in application functions.
These simplify the dynamic plotting of several processes, such as the analysis of
covariance, regression, plotting of 3D plots, fitting a normal distribution, etc. A list
of selected application functions is given in Table 10.4.
The rp.regression() function plots a response variable against one or two
covariates and automatically creates an rpanel with control widgets.
The arguments of the function are mostly relevant in the case of one covari-
ate. So the use of panel.plot makes sense for two-dimensional plots and acti-
vates the tkrplot function in order to merge the control and output panels in
one window. One should be aware that three-dimensional graphics can not be
placed inside the panel. The code demonstrating the two-dimensional regression
with rp.regression() is presented below.
294 10 Advanced Graphical Techniques in R
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ data ( l o n g l e y )
+ attach ( longley ) # c o m p o n e n t s are t e m p o r a r i l y v i s i b l e
+ # univariate regression
+ r p . r e g r e s s i o n ( GNP, U n e m p l o y e d ,
+ l i n e . s h o w i n g = TRUE, # r e g r e s s i o n line
+ panel.plot = FALSE ) # plot is o u t s i d e the c o n t r o l panel
+ }
A regression line will appear in the plot if the argument line.showing is set to
TRUE. If the regression line is drawn, than one can interactively change its intercept
and slope, see Fig. 10.24.
If the function has two covariates, the rp.regression() plot is generated with
the help of the rgl package, through the function rp.plot3d(), see Fig. 10.25.
In fact, one advanced interactive display will be created, which extends even the
features of the rgl 3D interactive scatterplot. The created plot is rotatable and
a zoom function is included. Additionally, one can set the panel argument to be
TRUE in order to create a control panel allowing interactive control of the fitted linear
models with one or two covariates. Double buttons are also available for stepwise
control of the rotation degrees of theta and phi.
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ data ( l o n g l e y )
+ attach ( longley ) # c o m p o n e n t s are t e m p o r a r i l y v i s i b l e
+ # multivariate regression
+ r p . r e g r e s s i o n ( cbind ( GNP, A r m e d . F o r c e s ) , U n e m p l o y e d ,
+ panel = TRUE ) # a p a n e l is c r e a t e d
+ }
10.3 Package rpanel 295
The function rp.normal() plots a histogram of data samples and allows a normal
density curve to be added to the display. Furthermore, the fitted normal distribution
with mean and standard deviation of the data sample can also be plotted. Double-
buttons are built-in as well, and enable interactive control of the mean and standard
deviation.
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ y = Height # data a r g u m e n t
+ # plot h i s t o g r a m with d e n s i t y c u r v e
+ r p . n o r m a l ( y, p a n e l . p l o t = TRUE )
+ }
296 10 Advanced Graphical Techniques in R
Adler, D., Nenadic, O., & Zucchini, W. (2003). RGL: A R-library for 3D visualization with OpenGL,
Technical report, University of Goettingen.
Ahrens, J. H. (1972). Computer methods for sampling from the exponential and normal distributions.
Communications of the ACM, 15(10), 873–882.
Ahrens, J. H., & Dieter, U. (1974). Computer methods for sampling from gamma, beta, poisson
and binomial distribution. Computing, 12, 223–246.
Ahrens, J. H., & Dieter, U. (1982a). Computer generation of Poisson deviates from modified normal
distributions. ACM Transactions on Mathematical Software, 8, 163–179.
Ahrens, J. H., & Dieter, U. (1982b). Generating gamma variates by a modified rejection technique.
Communications of the ACM, 25, 47–54.
Albert, J. (2009). Bayesian Computation with R, Use R! (2nd ed.). New York: Springer.
Annamalai, C. (2010). Package “radx”. https://github.com/quantumelixir/radx.
Ash, R. B. (2008). Basic Probability Theory, Dover Books on Mathematics (1st ed.). New York:
Dover Pblications Inc.
Atkinson, A. C., & Pearce, M. (1976). The computer generation of beta, gamma and normal random
variables. Journal of the Royal Statistical Society, 139, 431–461.
Babbie, E. (2013). The Practice of Social Research. Boston: Cengage Learning.
Banks, J. (1998). Handbook of Simulation. Norcross: Engineering and Management Press.
Becker, R. A., Cleveland, W. S., & Shyu, M.-J. (1996). The visual design and control of trellis
display. Journal of Computational and Graphical Statistics, 5, 123–155.
Bolger, E. M., & Harkness, W. L. (1965). Characterizations of some distributions by conditional
moments. The Annals of Mathematical Statistics, 36, 703–705.
Bowman, A., Crawford, E., Alexander, G., & Bowman, R. W. (2007). rpanel: Simple interactive
controls for R functions using the tcltk package. Journal of Statistical Software, 17, 1–18.
Box, G. E. P., & Muller, M. E. (1958). A note on the generation of random normal deviates. Annals
of Mathematical Statistics, 29, 610–611.
Braun, W., & Murdoch, D. (2007). A First Course in Statistical Programming with R. Cambridge:
Cambridge University Press.
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion.
Statistical Science, 16, 101–117.
Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms. Journal
of the Institute of Mathematics and Its Applications, 6, 76–90.
Caillat, A.-L., Dutang, C., Larrieu, V., & NGuyen, T. (2008). Gumbel: package for Gumbel copula.
R package version 1. 01.
Canuto, C., & Tabacco, A. (2010). Mathematical Analysis II. Universitext Series. Milan: Springer.
Cheng, R. C. H. (1977). The generation of gamma variables with non-integral shape parameter.
Journal of the Royal Statistical Society, 26(1), 71–75.
Cheng, R. C. H. (1978). Generating beta variates with nonintegral shape parameters. Communica-
tions of the ACM, 21, 317–322.
Clayton, D. G. (1978). A model for association in bivariate life tables and its application in epi-
demiological studies of familiar tendency in chronic disease incidence. Biometrika, 65, 141–151.
Cleveland, W. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of
the American Statistical Association, 74, 829–836.
Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. New York: Chapman
and Hall.
Cowpertwait, P. S., & Metcalfe, A. (2009). Introductory Time Series with R. New York: Springer.
Csorgo, S., & Farraway, J. (1996). The exact and assymptotic distributions of Cramér-von-Mises
statistics. Journal of the Royal Statistical Society Series B, 58, 221–234.
Dalgaard, P. (2001). The r-tcl/tk interface. Proceedings of DSC, 1, 2.
Demarta, S., & McNeil, A. J. (2004). The t-copula and related copulas. International Statiastical
Review, 73(1), 111–129.
Everitt, B. (2005). An R and S-PLUS Companion to Multivariate Analysis. London: Springer.
Everitt, B., & Hothorn, T. (2011). An Introduction to Applied Multivariate Analysis with R. New
York: Springer.
Everitt, B., Landau, S., Leese, M., & Stahl, D. (2009). Cluster Analysis. Chichester: Wiley.
Fang, K. & Zhang, Y. (1990). Generalized multivariate analysis, Science Press and Springer.
Fishman, G. (1976). Sampling from the gamma distribution on a computer. Communications of the
ACM, 19(7), 407–409.
Fletcher, R. (1970). A new approach to variable metric algorithms. Computer Journal, 13, 317–322.
Frank, M. J. (1979). On the simultaneous associativity of f (x, y) and x + y − f (x, y). Aequationes
Mathematicae, 19, 194–226.
Frees, E., & Valdez, E. (1998). Understanding relationships using copulas. North American Actu-
arial Journal, 2, 1–125.
Gaetan, C., & Guyon, X. (2009). Spatial Statistics and Modeling. New York: Springer.
Genest, C., & Rivest, L.-P. (1989). A characterization of Gumbel family of extreme value distribu-
tions. Statistics and Probability Letters, 8, 207–211.
Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Compu-
tational and Graphical Statistics, 1, 141–150.
Genz, A. (1993). Comparison of methods for the computation of multivariate normal probabilities.
Computing Science and Statistics, 25, 400–405.
Genz, A. & Azzalini, A. (2012). mnormt: The multivariate normal and t distributions. R package
version 1.4-5. http://CRAN.R-project.org/package=mnormt.
Genz, A., & Bretz, F. (2009). Computation of Multivariate Normal and t Probabilities., Lecture
Notes in Statistics Heidelberg: Springer.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F. & Hothorn, T. (2012). mvtnorm:
Multivariate Normal and t Distributions. R package version 0.9-9993. http://CRAN.R-project.
org/package=mvtnorm.
Goldfarb, D. (1970). A family of variable metric updates derived by variational means. Mathematics
of Computation, 24, 23–26.
Gonzalez-Lopez, V. A. (2009). fgac: Generalized Archimedean Copula. R package version 0.6-1.
http://CRAN.R-project.org/package=fgac.
Greene, W. (2003). Econometric Analysis. Upper Saddle River: Pearson Education.
Greub, W. (1975). Linear Algebra. Graduate Texts in Mathematics. New York: Springer.
Gumbel, E. J. (1960). Distributions des valeurs extrêmes en plusieurs dimensions. Publications de
Institut de Statistique de Université de Paris, 9, 171–173.
Hahn, T. (2013). R2Cuba: Multidimensional numerical integration. http://cran.r-project.org/web/
packages/R2Cuba/R2Cuba.pdf.
Bibliography 299
Härdle, W. K., & Vogt, A. (2014). Ladislaus von Bortkiewicz-statistician. Economist and a European
Intellectual, International Statistical Review, 83(1), 17–35.
Härdle, W., Müller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and Semiparametric
Models. Springer Series in Statistics. New York: Springer.
Härdle, W., & Simar, L. (2015). Applied Multivariate Statistical Analysis (4th ed.). New York:
Springer.
Hastie, T., Tibshirani, R., & Friedman, F. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. New York: Springer.
Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Journal of Research of the National Bureau of Standards, 49, 409–436.
Hofert, M., & Maechler, M. (2011). Nested Archimedean copulas meet R: The nacopula package.
Journal of Statistical Software, 39(9), 1–20.
Hoff, P. (2010). sbgcop: Semiparametric Bayesian Gaussian copula estimation and imputation. R
package version 0.975. http://CRAN.R-project.org/package=sbgcop.
Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3), 299–314.
Jarle Berntsen, T. E., & Genz, A. (1991). An adaptive algorithm for the approximate calculation of
multiple integrals. ACM Transactions on Mathematical Software, 17, 437–451.
Jech, T. J. (2003). Set Theory, Springer Monographs in Mathematics (3rd ed.). The third millennium
edition, revised and expanded: Springer-Verlag, Berlin.
Joe, H. (1997). Multivariate Models and Dependence Concepts. London: Chapman and Hall.
Joe, H., & Xu, J. J. (1996). The estimation method of inference functions for margins for multivariate
models, Technical Report 166. Department of Statistics: University of British Columbia.
Johnson, M. E. (1987). Multivariate Statistical Simulation. New York: Wiley.
Johnson, P. (1972). A History of Set Theory., Prindle, Weber & Schmidt Complementary Series in
Mathematics Boston: Prindle, Weber & Schmidt.
Kachitvichyanukul, V., & Schmeiser, B. W. (1988). Binomial random variate generation. Commu-
nications of the ACM, 31, 216–222.
Kendall, M. G., & Smith, B. B. (1938). Randomness and random sampling numbers. Journal of the
Royal Statistical Society, 101(1), 147–166.
Kiefer, J. (1953). Sequential minimax search for a maximum. Proceedings of the American Math-
ematical Society, 4(3), 502–506.
Knuth, D. E. (1969). The Art of Computer Programming (Vol. 2). Seminumerical Algorithms
Reading: Addison-Wesley.
Kojadinovic, I., & Yan, J. (2010). Modeling multivariate distributions with continuous margins
using the copula r package. Journal of Statistical Software, 34(9), 1–20.
Kruskal, J. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrica, 29,
115–129.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of
the American Statistical Association, 47, 583–621.
Marsaglia, G. (1964). Generating a variable from the tail of the normal distribution. Technometrics,
6, 101–102.
Marsaglia, G. (1968). Random numbers fall mainly in the planes. Proceedings of the National
Academy of Sciences of the United States of America, 61(1), 25–28.
Marsaglia, G. (1995). Diehard Battery of Tests of Randomness, Florida State University.
Marsaglia, G., & Marsaglia, J. (2004). Evaluating the anderson-darling distribution. Journal of
Statistical Software, 9(2), 1–5.
Marshall, A. W., & Olkin, J. (1988). Families of multivariate distributions. Journal of the American
Statistical Association, 83, 834–841.
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). Mcmcpack: Markov chain monte carlo in R.
Journal of Statistical Software, 42(9), 1–21.
300 Bibliography
Samorodnitsky, G., & Taqqu, M. S. (1994). Stable Non-Gaussian Random Processes. New York:
Chapman & Hall.
Sarkar, D. (2010). Lattice: Multivariate Data Visualization with R. New York: Springer.
Scherer, M., & Mai, J.-K. (2012). Simulating Copulas: Stochastic Models, Sampling Algorithms,
and Applications. Series in Quantitative Finance. Singapore: World Scientific Pub Co Inc.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley Series in Proba-
bility and Statistics. New York: Wiley.
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathe-
matics of Computation, 24, 647–656.
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples).
Biometrika, 52, 591–611.
Shepard, R. (1962). The analysis of proximities: multidimensional scaling with unknown distance
function. Psychometrica, 27, 125–139.
Sklar, A. (1959). Fonctions de repartition á n dimension et leurs marges. Publications de Institut de
Statistique de L’ Université de Paris, 8, 299–231.
Smirnov, N. (1939). On the estimation of the disrepancy between empirical curves of distribution
for two independent samples. Bulletin Mathématique de l’Université de Moscou, 2, 2.
Stroud, A. H. (1971). Approximation Caculation of Multiple Integrals. New Jersey: Prentice Hall.
Theussl, S. (2013). Package “rglpk”. http://cran.r-project.org/web/packages/Rglpk/Rglpk.pdf.
Trimborn, S., Okhrin, O., Zhang, S., & Zhou, M. Q. ( 2015). gofCopula: Goodness-of-Fit Tests for
Copulae. R package version 0.2-5. http://CRAN.R-project.org/package=gofCopula.
van Dooren, P., & de Ridder, L. (1976). An adaptive algorithm for numerical integration over an
n-dimensional cube. Journal of Computational and Applied Mathematics, 2, 207–217.
Venables, W. N., & Ripley, B. D. (1999). Modern Applied Statistics with S-PLUS. New York:
Springer.
von Bortkewitsch, L. (1898). Das Gesetz der kleinen Zahlen, Leipzig.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York:
Springer.
Watson, G. (1964). Smooth regression analysis. Sankyah. Ser. A, 26, 359–372.
Weron, R. (2001). Levy-stable distributions revisited: Tail index >2 does not exclude the levy-stable
regime. International Journal of Modern Physics C, 12, 209–223.
Whelan, N. (2004). Sampling from Archimedean copulas. Quantitative Finance, 4, 339–352.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1, 80–83.
Wuertz, D., many others and see the SOURCE file ( 2009a). fCopulae: Rmetrics - Dependence Struc-
tures with Copulas. R package version 2110.78. http://CRAN.R-project.org/package=fCopulae.
Wuertz, D., many others and see the SOURCE file ( 2009b). fMultivar: Multivariate Market Analy-
sis. R package version 2100.76. http://CRAN.R-project.org/package=fMultivar.
Yan, J. (2007). Enjoy the joy of copulas: With a package copula. Journal of Statistical Software,
21(4), 1–21.
Index
A Column-major storage, 18
Absolute frequency, 130 Comparison relations, 8
Akaike Information Criterion(AIC), 200 Concentration ellipsoid, 179
α-trimmed Mean, 138 Concordant, 177
Apropos(), 7 Conditional sampling, 261
Archimedean copulae, 189 Confidence intervals, 146
Args(), 26 Contour ellipse, 179
Arithmetic mean, 138 Copula, 171, 183, 263
Array, 13, 14 copula density, 184
Assign, 10 copula estimation, 193
Attach, 22 copula families, 185
Auckland, 2 hierarchical archimedean copulae, 191
Correlation, 176
Covariance matrix, 174
B CRAN, 2
Bar Diagram, 130 Critical region, 150
Bar Plot, 131 Cross-platform software, 2
Basic functions, 8 Cumsum, cumprod, cummin, cummax, 17
Bayesian Information Criterion (BIC), 200 Cumulative distribution function (cdf), 109,
Bernoulli distribution, 94 171
Bernoulli experiment, 94
Best linear unbiased estimator (BLUE), 198
Binomial distribution, 94, 95 D
Box-plot, 144 Density function, 109, 171
Discriminant analysis, 238
Dispersion parameters, 140
C Distance, 230
C(), 13 Distribution, 171
Canonical maximum likelihood, 195 Cauchy distribution, 127
Cauchy distribution, 179 multinormal distribution, 178
Ceiling, 8 multivariate normal distribution, 178
Central Limit Theorem (CLT), 183 multivariate t-distribution, 178
Central limit theorem (CLT), 183
Chi-squared distribution, 115
Class(), 11 E
Clayton copula, 190 Elliptical copulae, 186
Cluster analysis, 229 Euclidian norm, 230
© Springer International Publishing AG 2017 303
W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8
304 Index
Example(), 6 package, 6
Excess kurtosis, 111 Histogram, 133
Expectation, 110, 173 Hypergeometric distribution, 101
Exponential distribution, 121
Ex_Stirling, 24
I
Indexing
F negative, 15
Factor analysis, 224 Inf, 12, 13
Factorial, 8 Inference for margins, 194
F-distribution, 119 Installing, 2
Find(), 7 Integer division, 8
Floor, 8 Interquartile range, 141
Frank copula, 189 Inverse transform method, 251
Fréchet–Hoeffdings, 185 Is.finite(), 13
Function Is.nan(), 13
as.character(), 12
as.data.frame(), 12
K
as.double(), 12
Kendall, 177
as.integer(), 12
Kurtosis, 111
as.list(), 12
as.matrix(), 12
cbind(), 18 L
diag(), 18 Law of Large Numbers, 138
dim(), 18 Length(), 13
dimnames(), 20 Leptokurtic distributions, 110
head, 31 Letters[], 14
k-means(), 234 Library(), 5
matrix(), 18 Limit theorems, 182
names(), 31 Linear congruential generator, 244
rbind(), 18 Loadhistory(), 7
str(), 31 Loadings matrix, 221
t(), 18 Load(.Rdata), 7
update.packages(), 3 Logical relations
Fundamental operations, 8 vectors, 15
Ls(), 7
G
Gamma distribution, 257 M
Generalised set, 82 Mac, 3
Generator of the copula, 189 Mahalanobis transformation, 182, 224
Gentleman, 2 Mallows’ C p , 200
GNU General Public License, 2 Marginal cdf, 172
GNU Project, 2 Marginal probability, 172
Goodness of fit, 200, 265 Maximum likelihood factor analysis, 225
Gumbel copula, 190 Mean(), 17
Gumbel–Hougaard copula, 184 Median absolute deviation (MAD), 143, 142
Mode, 140
Modulo division, 8
H Moments, 173, 174
HAC package, 193 Month.abb[], 14
Help, 5 Multidimensional Scaling, 234
help(), 5 Multinomial distribution, 99
help.search(), 6 Multinormal, 171
Index 305