0% found this document useful (0 votes)
124 views

Basic Elements of Computational Statistics

Uploaded by

Rizwan syed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Basic Elements of Computational Statistics

Uploaded by

Rizwan syed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 318

Statistics and Computing

Wolfgang Karl Härdle


Ostap Okhrin
Yarema Okhrin

Basic Elements
of Computational
Statistics

QUANTLETS
Statistics and Computing

Series editor
W.K. Härdle, Humboldt-Universität zu Berlin, Berlin, Germany
Statistics and Computing (SC) includes monographs and advanced texts on
statistical computing and statistical packages.

More information about this series at http://www.springer.com/series/3022


Wolfgang Karl Härdle Ostap Okhrin

Yarema Okhrin

Basic Elements
of Computational Statistics

123
Wolfgang Karl Härdle Yarema Okhrin
CASE – Center for Applied Statistics Chair of Statistics, Faculty of Business
and Economics, School of Business and Economics
and Economics University of Augsburg
Humboldt-Universität zu Berlin Augsburg
Berlin Germany
Germany

Ostap Okhrin
Econometrics and Statistics, esp. in
Transportation, Institut für Wirtschaft und
Verkehr, Fakultät Verkehrswissenschaften
“Friedrich List”
Technische Universität Dresden
Dresden, Sachsen
Germany

ISSN 1431-8784 ISSN 2197-1706 (electronic)


Statistics and Computing
ISBN 978-3-319-55335-1 ISBN 978-3-319-55336-8 (eBook)
DOI 10.1007/978-3-319-55336-8
Library of Congress Control Number: 2017943193

Mathematics Subject Classification (2010): 62-XX, 62G07, 62G08, 62H15, 62Jxx

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Martin, Sophie, and Iryna, who are my
grace and inspiration
Ostap
To Katharine, Justine, Danylo, and Irena for
making my life colorful
Yarema
To our parents
Ostap and Yarema
To my family
Wolfgang
Think and read, and to your neighbours’ gifts
pay heed, yet do not thus neglect your own.
—Taras Shevchenko
(translated by C. H. Andrusyshen and W. Kirkconnel)
Preface

The R programming language is becoming the lingua franca of computational


statistics. It is the usual statistical software platform used by statisticians, econo-
mists, engineers and scientists both in corporations and in academia. Established
international companies use R in their data analysis. R has gained its popularity for
two reasons. First, it is an OS independent free open-source program which is
popularised and improved by hundreds of volunteers all over the world. A plethora
of packages are available for many scientific disciplines. Second, common analysts
can do complicated analyses without deep computer programming knowledge. This
book on the basic elements of computational statistics presents the tools and con-
cepts of univariate and multivariate data analyses with a strong focus on applica-
tions and implementations. The aim of this book is to present data analysis in a way
that is understandable for non-mathematicians and practitioners who are confronted
by statistical data analysis. All practical examples may be recalculated and modified
by the reader: all data sets and programmes (Quantlets) used in the book are
downloadable from the publisher’s home page of this book (www.quantlet.de). The
text contains a wide variety of exercises and covers the basic mathematical, sta-
tistical and programming problems.
The first chapter introduces the reader to the basics of the R language, taking into
account that only minimal prior experience in programming is required. Starting
with the developing history and R environments under different operating systems,
the book discusses the syntax. We start the description of the syntax with the
classical ‘Hello World!!!’ program. The use of R as an advanced calculator, data
types, loops, if then conditions, own function construction and classes are the topics
covered in this chapter. As in statistical analysis one deals with data, special
attention is paid to work with vectors and matrices.
The second part deals with the numerical techniques which one needs during the
analysis. A short excursion into matrix algebra will be helpful in understanding
multivariate techniques provided in the further sections. Different methods of
numerical integration, differentiation and root finding help the reader to get inside
the core of the R system.

ix
x Preface

Chapter 3 highlights set theory, combinatoric rules, plus some of the main
discrete distributions: binomial, multinomial, hypergeometric and Poisson.
Different characteristics, cumulative distribution functions and density functions
of the continuous distributions: uniform, normal, t, v2 , F, exponential and Cauchy
will be explained in detail in Chapter 4.
The next chapter is devoted to univariate statistical analysis and basic smoothing
techniques. The histogram, kernel density estimator, graphical representation of the
data, confidence intervals, different simple tests as well as tests that need more
computations, like the Wilcoxon, Kruskal–Wallis, sign tests, are the topics of
Chapter 5.
The sixth chapter deals with multivariate distributions: their definition, charac-
teristics and application of general multivariate distributions, multinormal distri-
butions, as well as classes of copulas. Further, Chapter 7 discusses linear and
nonlinear relationships via regression models.
Chapter 8 partially extends the problems solved in Chapter 5, but also considers
more sophisticated topics, such as multidimensional scaling, principal component,
factor, discriminant and cluster analysis. These techniques are difficult to apply
without computational power, so they are of special interest in this book.
Theoretical models need to be calibrated in practice. If there is no data available,
then Monte Carlo simulation techniques are necessary parts of each study. Chapter
9 starts from simple sampling techniques from the uniform distribution. These are
further extended to simulation methods from other univariate distributions. We also
discuss simulation from multivariate distributions, especially copulae.
Chapter 10 describes more advanced graphical techniques, with special attention
to three-dimensional graphics and interactive programmes using packages lattice,
rgl and rpanel.
This book is designed for the advanced undergraduate and first-year graduate
student as well as for the inexperienced data analyst who would like a tour of the
various statistical tools in a data analysis workshop. The experienced reader with a
good knowledge of statistics and programming will certainly skip some sections
of the univariate models, but hopefully enjoy the various mathematical roots of the
multivariate techniques. A graduate student might think that the first section on
description techniques is well known to him from his training in introductory
statistics. The programming, mathematical and the applied parts of the book will
certainly introduce him into the rich realm of statistical data analysis modules.
A book of this kind would not have been possible without the help of many
friends, colleagues and students. For many suggestions, corrections and technical
support, we would like to thank Aymeric Bouley, Xiaofeng Cao, Johanna Simone
Eckel, Philipp Gschöpf, Gunawan Gunawan, Johannes Haupt, Uri Yakobi Keller,
Polina Marchenko, Félix Revert, Alexander Ristig, Benjamin Samulowski, Martin
Schelisch, Christoph Schult, Noa Tamir, Anastasija Tetereva, Tatjana
Tissen-Diabaté, Ivan Vasylchenko and Yafei Xu. We thank Alice Blanck and
Veronika Rosteck from Springer Verlag for continuous support and valuable
Preface xi

suggestions on the style of writing and the content covered. Special thanks go to the
anonymous proofreaders who checked not only the language but also the statistical,
programming and mathematical content of the book. All errors are our own.

Berlin, Germany Wolfgang Karl Härdle


Dresden, Germany Ostap Okhrin
Augsburg, Germany Yarema Okhrin
April 2017
Contents

1 The Basics of R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 R on Your Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 History of the R Language . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Installing and Updating R. . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 First Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 “Hello World !!!” . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Getting Help. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Working Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Basics of the R Language. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 R as a Calculator. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.6 Programming in R . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.7 Date Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.8 Reading and Writing Data from and to Files. . . . . . . . 30
2 Numerical Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.1 Characteristics of Matrices . . . . . . . . . . . . . . . . . . . . 34
2.1.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . 41
2.1.4 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . 43
2.1.5 Norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.1 Integration of Functions of One Variable . . . . . . . . . . 46
2.2.2 Integration of Functions of Several Variables . . . . . . . 50

xiii
xiv Contents

2.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 Analytical Differentiation . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . 56
2.3.3 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . 59
2.4 Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.4.1 Solving Systems of Linear Equations. . . . . . . . . . . . . 62
2.4.2 Solving Systems of Nonlinear Equations . . . . . . . . . . 64
2.4.3 Maximisation and Minimisation of Functions . . . . . . . 66
3 Combinatorics and Discrete Distributions . . . . . . . . . . . . . . . . . . 77
3.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.1 Creating Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.2 Basics of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.3 Base Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.4 Sets Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1.5 Generalised Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Probabilistic Experiments with Finite Sample Spaces . . . . . . . . 85
3.2.1 R Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2.2 Sample Space and Sampling from Urns . . . . . . . . . . . 87
3.2.3 Sampling Procedure. . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.4 Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.1 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . 94
3.3.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.3 Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.4 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.5 Hypergeometric Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 101
3.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.6.1 Summation of Poisson Distributed
Random Variables. . . . . . . . . . . . . . . . . . . ....... 106
4 Univariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 Properties of Continuous Distributions . . . . . . . . . . . . 110
4.2 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4 Distributions Related to the Normal Distribution . . . . . . . . . . . 114
4.4.1 v2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4.2 Student’s t-distribution. . . . . . . . . . . . . . . . . . . . . . . 117
4.4.3 F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.5 Other Univariate Distributions . . . . . . . . . . . . . . . . . . . . . . . 121
4.5.1 Exponential Distribution. . . . . . . . . . . . . . . . . . . . . . 121
4.5.2 Stable Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.5.3 Cauchy Distribution. . . . . . . . . . . . . . . . . . . . . . . . . 127
Contents xv

5 Univariate Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 129


5.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.1 Graphical Data Representation . . . . . . . . . . . . . . . . . 130
5.1.2 Empirical (Cumulative) Distribution Function . . . . . . . 132
5.1.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.1.4 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . 135
5.1.5 Location Parameters . . . . . . . . . . . . . . . . . . . . . . . . 137
5.1.6 Dispersion Parameters . . . . . . . . . . . . . . . . . . . . . . . 140
5.1.7 Higher Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.8 Box-Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2 Confidence Intervals and Hypothesis Testing . . . . . . . . . . . . . 146
5.2.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3 Goodness-of-Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.3.1 General Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3.2 Tests for Normality . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3.3 Wilcoxon Signed Rank Test
and Mann–Whitney U Test . . . . . . . . . . . . . ...... 167
5.3.4 Kruskal–Wallis Test . . . . . . . . . . . . . . . . . . ...... 169
6 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . ....... 171
6.1 The Distribution Function and the Density Function
of a Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.2 The Multinormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 178
6.2.1 Sampling Distributions and Limit Theorems . . . . . . . . 182
6.3 Copulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.1 Copula Families . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.3.2 Archimedean Copulae . . . . . . . . . . . . . . . . . . . . . . . 189
6.3.3 Hierarchical Archimedean Copulae . . . . . . . . . . . . . . 191
6.3.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1 Idea of Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.2.1 Model Selection Criteria . . . . . . . . . . . . . . . . . . . . . 200
7.2.2 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.3.1 General Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.3.2 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3.3 k-Nearest Neighbours (k-NN) . . . . . . . . . . . . . . . . . . 209
7.3.4 Splines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.3.5 LOESS or Local Regression . . . . . . . . . . . . . . . . . . . 213
xvi Contents

8 Multivariate Statistical Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 219


8.1 Principal Components Analysis. . . . . . . . . . . . . . . . . . . . . . . 219
8.2 Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.2.1 Maximum Likelihood Factor Analysis . . . . . . . . . . . . 225
8.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.3.1 Proximity of Objects . . . . . . . . . . . . . . . . . . . . . . . . 230
8.3.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . 231
8.4 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
8.4.1 Metric Multidimensional Scaling . . . . . . . . . . . . . . . . 235
8.4.2 Non-metric Multidimensional Scaling . . . . . . . . . . . . 236
8.5 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9 Random Numbers in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . 243
9.1.1 Pseudorandom Number Generators . . . . . . . . . . . . . . 244
9.1.2 Uniformly Distributed Pseudorandom Numbers. . . . . . 248
9.1.3 Uniformly Distributed True Random Numbers . . . . . . 249
9.2 Generating Random Variables . . . . . . . . . . . . . . . . . . . . . . . 250
9.2.1 General Principles for Random Variable
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 251
9.2.2 Random Variables. . . . . . . . . . . . . . . . . . . . . ..... 253
9.2.3 Random Variable Generation for Continuous
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . ..... 253
9.2.4 Random Variable Generation for Discrete
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . ..... 259
9.2.5 Random Variable Generation for Multivariate
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.3 Tests for Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
9.3.1 Birthday Spacings . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.2 k-Distribution Test . . . . . . . . . . . . . . . . . . . . . . . . . 266
10 Advanced Graphical Techniques in R . . . . . . . . . . . . . . . . . . . . . 269
10.1 Package lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.1.1 Getting Started with lattice . . . . . . . . . . . . . . . . . 270
10.1.2 formula Argument . . . . . . . . . . . . . . . . . . . . . . . . 270
10.1.3 panel Argument and Appearance Settings . . . . . . . . 272
10.1.4 Conditional and Grouped Plots . . . . . . . . . . . . . . . . . 273
10.1.5 Concept of shingle . . . . . . . . . . . . . . . . . . . . . . . 275
10.1.6 Time Series Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.1.7 Three- and Four-Dimensional Plots . . . . . . . . . . . . . . 279
10.2 Package rgl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.2.1 Getting Started with rgl . . . . . . . . . . . . . . . . . . . . . 281
10.2.2 Shape Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.2.3 Export and Animation Functions . . . . . . . . . . . . . . . . 287
Contents xvii

10.3 Package rpanel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289


10.3.1 Getting Started with rpanel . . . . . . . . . . . . . . . . . . 289
10.3.2 Application Functions in rpanel. . . . . . . . . . . . . . . 293

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Symbols and Notations

Basics
X; Y Random variables or vectors
X1 ; X2 ; . . .; Xp Random variables
X ¼ ðX1 ; . . .; Xp Þ> Random vector
X  X has distribution 
C; D Matrices
A; B; X ; Y Data matrices
R Covariance matrix
1n Vector of ones ð1; . . .; 1Þ>
|fflfflffl{zfflfflffl}
n times
0n Vector of zeros ð0; . . .; 0Þ>
|fflfflffl{zfflfflffl}
n times
In Identity matrix
Ið:Þ Indicator function
d. . .e Ceiling function
b. . .c Floor function
i Imaginary unit, i2 ¼ 1
) Implication
, Equivalence
 Approximately equal
iff if and only if, equivalence
i.i.d. Independent and identically distributed
rv Random variable
Rn n-dimensional space of real numbers

xix
xx Symbols and Notations

dik The Kronecker delta, that is 1 if i ¼ k and 0


otherwise
P
Pn Pn ¼ ft 2 C½a; bjtðxÞ ¼ ni¼0 ai xi ; ai 2 Rg
f ðxÞ 2 OfgðxÞg There is k [ 0 such that for all sufficiently large
values of x, f ðxÞ is at ost kgðxÞ in absolute value
med ðxÞ The median value of the sample x

Samples
x; y Observations of X and Y
x1 ; . . .; xn ¼ fxi gni¼1 Sample of n observations of X
X ¼ fxij gi¼1;...;n;j¼1;...;p (n  p) data matrix of observations of X1 ; . . .; Xp or
of X ¼ ðX1 ; . . .; Xp Þ>
xð1Þ ; . . .; xðnÞ The order statistic of x1 ; . . .; xn
H Centering matrix, H ¼ I n  n1 1n 1Tn
x The sample mean

Densities and Distribution Functions


f ðxÞ Density of X
f ðx; yÞ Joint density of X and Y
fX ðxÞ; fY ðyÞ Marginal densities of X and Y
fX1 ðx1 Þ; . . .; fXp ðxp Þ Marginal densities of X1 ; . . .; Xp
^fh ðxÞ Histogram or kernel estimator of f ðxÞ
FðxÞ Distribution function of X
Fðx; yÞ Joint distribution function of X and Y
FX ðxÞ; FY ðyÞ Marginal distribution functions of X and Y
FX1 ðx1 Þ; . . .; FXd ðxd Þ Marginal distribution functions of X1 ; . . .; Xd
/X ðtÞ Characteristic function of X
mk k-th moment of X
^
FðxÞ Empirical cumulative distribution function (ecdf)
pdf Probability density function

Empirical Moments
P
n Average of X sampled by fxi gi¼1;...;n
x ¼ 1n xi
i¼1
P
n Empirical covariance of random variables X and Y
s2XY ¼ n1
1
ðxi  xÞðyi  yÞ
i¼1 sampled by fxi gi¼1;...;n and fyi gi¼1;...;n
Pn Empirical variance of random variable X sampled
s2XX ¼ 1
ðxi  xÞ2
n1
i¼1 by fxi gi¼1;...;n
s2
rXY ¼ pffiffiffiffiffiffiffiffiffi
XY
2 2
Empirical correlation of X and Y
sXX sYY
Symbols and Notations xxi

^ ¼ fsXi Xj g
R Empirical covariance matrix of a sample or obser-
vations of X1 ; . . .; Xp or of the random vector
X ¼ ðX1 ; . . .; Xp Þ>
R ¼ frXi Xj g Empirical correlation matrix of a sample or obser-
vations of X1 ; . . .; Xp or of the random vector
X ¼ ðX1 ; . . .; Xp Þ>

Distributions
uðxÞ Density of the standard normal distribution
UðxÞ Cumulative distribution function of the standard
normal distribution
N ð0; 1Þ Standard normal or Gaussian distribution
N ðl; r2 Þ Normal distribution with mean l and variance r2
Nd ðl; RÞ d-dimensional normal distribution with mean l and
covariance matrix R
L Convergence in distribution
!
a:s Almost sure convergence
!
a Asymptotic distribution

Uða; bÞ Uniform distribution on ða; bÞ
CLT Central Limit Theorem
v2p v2 distribution with p degrees of freedom
v21a;p 1  a quantile of the v2 distribution with p degrees
of freedom
tn t-distribution with n degrees of freedom
t1a=2;n 1  a=2 quantile of the t-distribution with n d.f
Fn;m F-distribution with n and m degrees of freedom
F1a;n;m 1  a quantile of the F-distribution with n and m
degrees of freedom
Bðn; pÞ Binomial distribution
Hðx; n; M; NÞ Hypergeometric distribution
Poisðki Þ Poisson distribution with parameter ki

Mathematical Abbreviations
trðAÞ Trace of matrix A
diagðAÞ Diagonal of matrix A
rankðAÞ Rank of matrix A
detðAÞ Determinant of matrix A
id Identity function on a vector space V
C½a; b The set of all continuous differentiable functions on
the interval ½a; b
Chapter 1
The Basics of R

Don’t think—use the computer.

— G. Dyke

1.1 Introduction

The R software package is a powerful and flexible tool for statistical analysis which
is used by practitioners and researchers alike. A basic understanding of R allows
applying a wide variety of statistical methods to actual data and presenting the results
clearly and understandably. This chapter provides help in setting up the programme
and gives a brief introduction to its basics.
R is open-source software with a list of available, add-on packages that provide
additional functionalities. This chapter begins with detailed instructions on how to
install it on the computer and explains all the procedures needed to customise it to
the user’s needs.
In the next step, it will guide you through the use of the basic commands and the
structure of the R language. The goal is to give an idea of the syntax so as to be able
to perform simple calculations as well as structure data and gain an understanding
of the data types. Lastly, the chapter discusses methods of reading data and saving
datasets and results.

1.2 R on Your Computer

1.2.1 History of the R Language

R is a complete programming language and software environment for statistical com-


puting and graphical representation. R is closely related to S, the statistical program-

© Springer International Publishing AG 2017 1


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_1
2 1 The Basics of R

ming language of Bell Laboratories developed by Becker and Chamber in 1984. It is


actually an implementation of S with lexical scoping semantics inspired by Scheme,
which started in 1992 and with the first results published by the developers Ihaka
and Gentleman (1996) of the University of Auckland, NZ, for teaching purposes. Its
name, R, is taken from the first names of the authors.
As part of the GNU Project, the source code of R has been freely available under
the GNU General Public License since 1995. This decision contributed to spreading
the software within the community of statisticians using free-code operating systems
(OS). It is now a multi-platform statistical package widely known by people from
many scientific fields such as mathematics, medicine and biology.
R enables its users to handle and store data, perform calculations on many types
of variables, statistically analyse information under different aspects, create graphics
and execute programmes. Its functionalities can be expanded by importing packages
and including code written in C, C++ or Fortran. It is freely available on the Internet
using the CRAN mirrors (Comprehensive R Archive Network at http://cran.r-project.
org/). Since this chapter deals with installation issues and the basics of the R language,
the reader familiar with the basics may skip it.
There exist several books about R, discussing specific topics in statistics and
econometrics (biostatistics, etc.) or comparing R with other software, for example
Stata. Typical users of Stata may be interested in Muenchen and Hilbe (2010). If
the research topic requires Bayesian econometrics and MCMC techniques, Albert
(2009) might be helpful. Two additional books on R, by Gaetan and Guyon (2009)
and Cowpertwait and Metcalfe (2009), may support the development of R skills,
depending on the application.

1.2.2 Installing and Updating R

Installing
As mentioned before, R is a free software package, which can be downloaded legally
from the Internet page http://cran.r-project.org/bin.
Since R is a cross-platform software package, installing R on different operating
systems will be explained. A full installation guide for all systems is available at
http://cran.r-project.org/doc/manuals/R-admin.html.
Precompiled binary distributions
There are several ways of setting up R on a computer. On the one hand, for many
operating systems, precompiled binary files are available. And on the other hand, for
those who use other operating systems, it is possible to compile the programme from
the source code.
1.2 R on Your Computer 3

• Installing R under Unix


Precompiled binary files are available for the Debian, RedHat, SuSe and Ubuntu
Unix distributions. They can be found on the CRAN website at http://cran.r-
project.org/bin/linux.
• Installing R under Windows
The binary version of R for Windows is located at http://cran.r-project.org/bin/
windows.
If an account with Administrator privileges is used, R can be installed in the
Program Files path and all the optional registry entries are automatically set. Oth-
erwise, there is only the possibility of installing R in the user files path. Recent
versions of Windows ask for confirmation to proceed with installing a programme
from an ‘unidentified publisher’. The installation can be customised, but the default
is suitable for most users.
For further information, it is suggested to visit http://cran.r-project.org/bin/
windows/base/rw-FAQ.html
• Installing R under Mac
The current version of R for Mac OS is located at http://cran.r-project.org/bin/
macosx/.
The installation package corresponding to the specific version of the Mac OS must
be chosen and downloaded. During the installation, the Installer will guide the user
through the necessary steps. Note that this will require the password or login of
an account with administrator privileges. The installation can be customised, but
the default is suitable for most users. After the installation, R can be started from
the application menu.
For further information, it is suggested to visit http://cran.r-project.org/bin/
macosx/RMacOSX-FAQ.html.

Updating
The best way to upgrade R is to uninstall the previous version of R, then install
the new version and copy the old installed packages to the library folder of
the new installation. Command update.packages(checkBuilt = TRUE,
ask = FALSE) will update the packages for the new installation. Afterwards, any
remaining data from the old installation can be deleted. Old versions of the software
may be kept due to the parallel structure of the folders of the different installations.
In cases where the user has a personal library, the contents must be copied into an
update folder before running the update of the packages.

1.2.3 Packages

A package is a file, which may be composed of R scripts (for example func-


tions) or dynamic link libraries (DLL) written in other languages, such as C or
4 1 The Basics of R

Fortran, that gives access to more functions or data sets for the current ses-
sion. Some packages are ready for use after the basic installation, others have to be
downloaded and then installed when needed. On all operating systems, the function
install.packages() can be used to download and install a package automat-
ically through an available internet connection. Command install.packages
may require to decide whether to compile and install sources if they are newer, then
binaries. When installing packages manually, there are slight differences between
operating systems.
Unix
Gzipped tar packages can be installed using the UNIX console by
R CMD INSTALL /your_path/your_package.tar.gz

Windows
In the R GUI, one uses the menu Packages.
• With an available internet connection, new packages can be downloaded and
installed directly by clicking the Install Packages button. In this case, it is proposed
to choose the CRAN mirror nearest to the user’s location, and select the package
to be installed.
• If the .zip file is already available on the computer, the package can be installed
through Install Packages from Zip files.

Mac OS
There is a recommended Package Manager in the R.APP GUI. It is possible to
install packages from the shell, but we suggest having a look at the FAQ on the
CRAN website first.
All systems
Once a package is installed, it should be loaded in a session when needed. This ensures
that the software has all the additional functions and datasets from this package in
memory. This can be done through the commands

> library(package) or > require(package)

If the requested package is not installed, the function library() gives an error,
while require() is designed for use inside of other functions and only returns
FALSE and gives a warning.
The package will also be loaded as the second item in the system search path.
Packages can also be loaded automatically if the corresponding code line is included
in the .Rprofile file.
1.2 R on Your Computer 5

To see the installed libraries, the functions library() or require() are


used.
> library()

To detach or unload a loaded package one uses.


> detach("package:name", unload = TRUE)

The function detach() can also be used to remove any R object from the search
path. This alternative usage will be shown later in this chapter.

1.3 First Steps

After this first impression of what R is and how it works, the next steps are to see
how it is used and to get used to it. In general, users should be aware of the case
sensitivity of the R language.
It is also convenient to know that previously executed commands can be selected
by the ‘up arrow’ on the keyboard. This is particularly useful for correcting typos and
mistakes in commands that caused an error, or to re-run commands with different
parameters.

1.3.1 “Hello World !!!”

As a first example, we will write some code that gives the output ‘Hello World !!!’
and a plot, see Fig. 1.1. There is no need to understand all the lines of the code now.
> install.packages("rworldmap")
> require(rworldmap)
> data("countryExData", envir = environment())
> mapCountryData(joinCountryData2Map(countryExData),
+ nameColumnToPlot = "EPI_regions",
+ catMethod = "categorical",
+ mapTitle = "Hello World!!!",
+ colourPalette = "rainbow",
+ missingCountryCol = "lightgrey",
+ addLegend = FALSE)

1.3.2 Getting Help

Once R has been installed and/or updated, it is useful to have a way to get help. To
open the primary interface to the help system, one uses
6 1 The Basics of R

Hello World!!!

Fig. 1.1 “Hello World!!!” example in R. BCS_HelloWorld

> help()

There are two ways of getting help for a particular function or command:

> help(function) and > ? function

To find help about packages, one uses


> help(package = package)

which returns the help file that comes with the specific package. If the different pro-
posals of help were not satisfying, one can try

> help.search("function name") > ?? "function name"

to see all help subjects containing “function name”. Finally, under Windows and
Mac OS, under the Help menu are several PDF manuals which provide thorough and
detailed information. The same help can be reached with the function
> help.start()

An HTML version without pictures can be found on the CRAN website.


Examples for a particular topic can be found using
> example(topic)
1.3 First Steps 7

1.3.3 Working Space

The current directory, where all pictures and tables are saved and from which all data
is read by default, is known as the working directory. It can be found by getwd()
and can be changed by setwd().
> getwd() > setwd("your/own/path")

Each of the following functions returns a vector of character strings providing the
names of the objects already defined in the current session.
> ls() > objects()

To clear the R environment, objects can be deleted by


> rm(var1, var2)

The above line, for example, will remove the variables var1 and var2 from the
working space. The next example will remove all objects defined in the current
session and can be used to completely clean the whole working space.
> rm(list = ls())

The code below erases all variables, including the system ones beginning with a dot.
Be cautious when using this command! Note that it has the same effect as the menu
entry Remove all objects under Windows or Clear Workspace under Mac OS.
> rm(list = ls(all.names = TRUE))

However, we should always make sure that all previously defined variables are deleted
or redefined when running new code, in order to be sure that there is no information
left from the previous run of the programme which could affect the results. Therefore,
it is recommended to have a line rm(list = ls(all.names = TRUE)) in
the beginning of each programme.
One saves the workspace as a .Rdata file in the specified working directory using
the function save.image(), and saves the history of the commands in the .R
format with savehistory(). Saving the workspace means keeping all defined
variables in memory, so that the next time when R is in use, there is no need to define
them again. If the history is saved, the variables will NOT be saved, whereas the
commands defining them will be. So once the history is loaded, everything that was
in the console should be compiled, but this can take a while for time-consuming
calculations. The previously saved workspace and the history can be loaded with
> load(".Rdata") > loadhistory()

The apropos(‘word ’) returns the vector of functions, variables, etc., containing the
argument word, as does find(‘word ’), but with a different user interface. Without
8 1 The Basics of R

going into details, the best way to set one’s own search parameters is to consult the
help concerning these functions.
Furthermore, a recommended and very convenient way of writing programmes
is to split them into different modules, which might contain a list of definitions or
functions, in order not to mess up the main file. They are executed by the function
> source("my_module.r")

To write the output in a separate .txt file instead of the screen, one uses sink().
This file appears in the working directory and shows the full output of the session.
> sink("my_output.txt")

To place that output back on the screen, one uses


> sink()

The simplest ways to quit R are

> quit() or > q()

1.4 Basics of the R Language

This section contains information on how R can be a useful software for all basic
mathematical and programming needs.

1.4.1 R as a Calculator

R may be seen as a powerful calculator which allows dealing with a lot of mathe-
matical functions. Classical fundamental operations as presented in Tables 1.1, 1.2
and 1.3 are, of course, available in R.
In contrast to the classical calculator, R allows assigning one or more values to a
variable.
> a = pi + 0.5; a # create variable a; print a
[1] 3.641593
1.4 Basics of the R Language 9

Table 1.1 Fundamental operations


Function name Example Result
Addition 1 + 2 3
Subtraction 1 - 2 -1
Multiplication 1 * 2 2
Division 1 / 2 0.5
Raising to a power 3ˆ2 9
Integer division 5 %/% 2 2
Modulo division 5 %% 2 1

Table 1.2 Basic functions


Function name Example Result
Square root sqrt(2) 1.414214
Sine sin(pi) 1.224606e-16
Cosine cos(pi) -1
Tangent tan(pi/4) 1
Arcsine asin(pi/4) 0.903339
Arccosine acos(0) 1.570796
Arctangent atan(1) 0.785398
Arctan(y/x) atan2(1, 2) 0.463647
Hyperbolic sine sinh(1) 1.175201
Hyperbolic cosine cosh(0) 1
Hyperbolic tangent tanh(pi) 0.9962721
Exponential exp(1) 2.718282
Logarithm log(1) 0

Table 1.3 Comparison relations


Meaning Example Result
Smaller 5<5 FALSE
Smaller or equal 3<= 4 TRUE
Bigger 7>2 TRUE
Bigger or equal 5>= 5 TRUE
Unequal 2 != 1 TRUE
Logical equal pi == acos(-1) TRUE
10 1 The Basics of R

Transformations of numbers are implemented by the following functions.


> floor(a) > trunc(a)
[1] 3 [1] 3
> floor(-a) > trunc(-a)
[1] -4 [1] -3
> ceiling(a) > round(a)
[1] 4 [1] 4
> ceiling(-a) > round(-a)
[1] -3 [1] -4
> round(a, digits = 2) > factorial(a)
[1] 3.64 [1] 14.19451

floor() (ceiling()) returns the largest (smallest) integer that is smaller (larger)
than the value of the given variable a, trunc() truncates the decimal part of a
real-valued variable to obtain an integer variable. The function round() rounds
a real-valued variable scientifically to an integer unless the argument digits is
applied specifying the number of decimal places, in which case it scientifically rounds
the given real number to that many decimal places. Scientific rounding of a real
number rounds it to the closest integer, except for the case there the number after
the predefined decimal place is exactly 5. For this case the closest even integer is
returned. The function factorial(), which for an integer a returns f (a) = a! =
1 · 2 · . . . · a, works with real-valued arguments as well, by using the Gamma function
 ∞
(x) = t x−1 exp(−t) dt,
0

implemented by gamma(x+1) in R.

1.4.2 Variables

Assigning variables
There are different ways to assign variables to symbols.
> a = pi + 0.5; a # assign (pi + 0.5) to a
[1] 3.641593
> b = a; b # assign the value of a to b
[1] 3.641593
> d = e = 2^(1 / 2); d # assign 2^(1 / 2) to e
> # and the value of e to d
[1] 1.414214
> e
[1] 1.414214
> f <- d; f # assign the value of d to f
[1] 1.414214
> d -> g; g # assign the value of d to g
[1] 1.414214

Be careful with using ‘=’ for assigning, because the known argument, which defines
the other, must be placed on the right side of the equals sign. The arrow assignment
allows the following kind of constructions:
1.4 Basics of the R Language 11

> h <- 4 -> j # assign the value 4 to the variables h and j

These constructions should not be used extensively due to their evident lack of clarity.
Note that variable names are case sensitive and must not begin with a digit or a period
followed by a digit. Furthermore, names should not begin with a dot as this is common
only for system variables. It is often convenient to choose names that contain the
type of the specific variable, e.g. for the variable iNumber, the ‘i’ at the beginning
indicates that the variable is of the type integer.
> iNumber = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

It is also useful to add the prefix ‘L’ to all local variables. For example, despite the
fact that pi is a constant, one can reassign a different value to it. In order to avoid
confusion, it would be convenient in this case to call the reassigned variable Lpi.
We are not always following this suggestion in order to keep listings’ lengths as short
as possible.
It is also possible to define functions in a similar fashion.
> Stirling = function(n){sqrt(2 * pi * n) * (n / exp(1))^n}

This is the mathematical function known as Stirling’s formula or Stirling’s approxi-


mation for the factorial, which has the property

n!
√ → 1, as n → ∞.
2πn · (n/e)n

For multi-output functions, see Sect. 1.4.6.


Working with variables
There are different types of variables in R. A brief summary is given in Table 1.4.
Obviously each variable requires its own storage space, so during computationally
intensive calculations one should pay attention to the choice of the variables. The
types numeric and double are identical. Both are vectors of a specific length and store
real valued elements. A variable of the type logical contains only the values TRUE
and FALSE.
The command returning the type, i.e. the storage mode, of an object is typeof(),
the possible values are listed in the structure TypeTable. Alternatively, the function
class() can be used, which in turn returns the class of the object and is often used
in the object-oriented style of programming.

> typeof(object name) > class(object name)

A character variable consists of elements within quotation marks.


12 1 The Basics of R

Table 1.4 Variable types


Variable type Example of memory needed Result
Numeric object.size(numeric(2)) 56 bytes
Logical object.size(logical(2)) 48 bytes
Character object.size(character(2)) 104 bytes
Integer object.size(integer(2)) 48 bytes
Double object.size(double(2)) 56 bytes
Matrix object.size(matrix(0, ncol = 2)) 216 bytes
List object.size(list(2)) 96 bytes

> a = character(length = 2); a


[1] "" ""
> a = c(exp(1), "exp(1)")
> class(a)
[1] "character"
> a
[1] "2.71828182845905" "exp(1)"

A description of the variable types matrix and list is given in Sects. 1.4.3 and 1.4.5
in more detail. To show, or print, the content of a, one uses the function print().
> print(a)

‘Unknown values’ as a result of missing values is a common problem in statistics. To


handle this situation, R uses the value NA. Every operation with NA will give an NA
result as well. Note that NA is different from the value NaN, which is the abbreviation
for ‘not a number’. If R returns NaN, the underlying operation is not valid, which
obviously has to be distinguished from the case in which the data is not available.
A further output, that the reader should worry about, is Inf denoting infinity. It is
difficult to work with such results, but R provides tools which modify the class of
results, like NaN, and there are functions to help in transforming variables from one
type into another. The functions as.character, as.double, as.integer,
as.list, as.matrix, as.data.frame, etc., coerce their arguments to the
function specific type. It is therefore possible to transform results like NaN or Inf
to a preferable type, which is often done in programming with R.
In the example below, the function paste() automatically converts the argu-
ments to strings, by using the function as.character(), and concatenates them
into a single string. The option sep specifies the string character that is used to
separate them.
> paste(NA, 1, "hop", sep = "@") # concatenate objects, separator @
[1] "NA@1@hop"
> typeof(paste(NA, 1, "hop", sep = "@"))
[1] "character"
1.4 Basics of the R Language 13

Furthermore, one can check if an R object is finite, infinite, unknown, or of any other
type. The function is.finite(argument) returns a Boolean object (a vector or
matrix, if the input is a vector or a matrix) indicating whether the values are finite
or not. This test is also available to test types, such as is.integer(x), to test
whether x is an integer or not, etc.
> x = c(Inf, NaN, 4)
> is.finite(x) # check if finite
[1] FALSE FALSE TRUE
> is.nan(x) # check if NaN (operation not valid)
[1] FALSE TRUE FALSE
> is.double(x) # check if type double
[1] TRUE
> is.character(x) # check if type character
[1] FALSE

1.4.3 Arrays

A vector is a one-dimensional array of fixed length, which can contain objects of


one type only. There are only three categories of vectors: numerical, character and
logical.
Basic manipulations

Joining the values 1, π, 2 in a vector is done easily by the concatenate function, c.
> v = c(1, pi, sqrt(2)); v # concatenate
[1] 1.000000 3.141593 1.414214

The ith element of the vector can be addressed using v[i].


> v = c(1.000000, 3.141593, 1.414214)
> v[2] # 2nd element of v
[1] 3.141593

The indexing of a vector starts from 1. If an element is addressed that does not exist,
e.g. v[0], the error NA or numeric(0) is returned. A numerical vector may be
integer if it contains only integers, numeric if it contains only real numbers, and
complex if it contains complex numbers. The length of a vector object v is found
through
> v = c(1.000000, 3.141593, 1.414214)
> length(v) # length of vector v
[1] 3

Be careful with this function, keeping in mind that it always returns one value, even
for multi-dimensional arrays, so one should know the nature of the objects one is
dealing with.
14 1 The Basics of R

One easily applies the same transformation to all elements of a vector. One can
calculate, for example, the elementwise inverse with the command ˆ (-1). This is
still the case for other objects, such as arrays.
> v = c(1.000000, 3.141593, 1.414214)
> d = v + 3; d
[1] 4.000000 6.141593 4.414214
> v^(-1)
[1] 1.0000000 0.3183099 0.7071068
> v * v^(-1)
[1] 1 1 1

There are a lot of other ways to construct vectors. The function array(x, y)
creates an array of dimension y filled with the value x only. The function seq(x,
y, by = z) gives a sequence of numbers from x to y in steps of z. Alternatively,
the required length can be specified by option length.out.
> c(1, 2, 3)
[1] 1 2 3
> 1:3
[1] 1 2 3
> array(1:3, 6)
[1] 1 2 3 1 2 3
> seq(1, 3)
[1] 1 2 3
> seq(1, 3, by = 2)
[1] 1 3
> seq(1, 4, length.out = 5)
[1] 1.00 1.75 2.50 3.25 4.00

One can also use the rep() function to create a vector in which some values are
repeated.
> v = c(1.000000, 3.141593, 1.414214)
> rep(v, 2) # the vector twice
[1] 1.00 3.14 1.41 1.00 3.14 1.41
> rep(v, c(2, 0, 1)) # 1st value twice, no 2nd value
> # 3rd value once
[1] 1.00 1.00 1.41
> rep(v, each = 2) # each value twice
[1] 1.00 1.00 3.14 3.14 1.41 1.41

With the second command of the above code, R creates a vector in which the first
value of v should appear two times, the second zero times, and the third only once.
Note that if the second argument is not an integer, R takes the rounded value. In the
last call, each element is repeated twice, proceeding element per element.
The names of the months, their abbreviations and all letters of the alpha-
bet are stored in predefined vectors. The months can be addressed in the vector
month.name[]. For their abbreviations, use month.abb[]. Letters are stored
in letters[] and capital letters in LETTERS[].
1.4 Basics of the R Language 15

> s = c(2, month.abb[2], FALSE, LETTERS[6]); s


[1] "2" "Feb" "FALSE" "F"
> class(s)
[1] "character"

Note that if one element in a vector is of type character, then all elements in the
vector are converted to character, since a vector can only contain objects of one
type.
To keep only some specific values of a vector, one can use different methods
of conditional selection. The first is to use logical operators for vectors in R: “!”
is the logical NOT, “&” is the logical AND and "|" is the logical OR. Using these
commands, it is possible to perform a conditional selection of vector elements. The
elements for which the conditions are TRUE can then, for example, be saved in
another vector.
> v = c(1.000000, 3.141593, 1.414214)
> v > 0 # element greater 0
[1] TRUE TRUE TRUE
> (v != 1) & (v > 0) # element not equal to 1 and greater 0
[1] FALSE TRUE TRUE

In the last example, the first value is bigger than zero, but equal to one, so FALSE
is returned. This method may be a little bit confusing for beginners, but it is very
useful for working with multi-dimensional arrays.
Multiple selection of elements of a vector may be done using another vector of
indices as arguments in the square brackets.
> v = c(1.000000, 3.141593, 1.414214)
> v[c(1, 3)] # 1st and 3rd element
[1] 1.000000 1.414214
> w = v[(v != 1) & (v > 0)]; w # save the specified elements in w
[1] 3.141593 1.414214

To eliminate specific elements in a vector, the same procedure is used as for selection,
but a minus sign indicates the elements which should be removed.

> v = c(1.0000, 3.1416, 1.4142) >


> v[-1] # exclude first > v[-c(1, 3)] # excl. 1st and 3rd
[1] 3.1416 1.4142 [1] 3.141593

For a one-dimensional vector function, which returns the index or indices of specific
elements.
> v = c(1.000000, 3.141593, 1.414214)
> which(v == pi) # indices of elements that fulfill the condition
[1] 2

There are different functions for working with vectors. Extremal values are found
through the functions min and max, which return the minimal and maximal values
16 1 The Basics of R

of the vector, respectively.


> v = c(1.0000, 3.1416, 1.4142) >
> min(v) > max(v)
[1] 1.000000 [1] 3.141593

However, this can be done simultaneously by the function range, which returns a
vector consisting of the two extreme values.
> v = c(1.000000, 3.141593, 1.414214)
> range(v) # min and max value
[1] 1.000000 3.141593

Joining the function which() with min or max, one gets the function which.min
or which.max that returns the index of the smallest or largest element of the
vector, respectively, and is equivalent to which(x == max(x)) and which
(x == min(x)).
Quite often, the elements of a vector have to be sorted before one can proceed
with further transformations. The simplest function for this purpose is sort().
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> sort(x) # values in increasing order
[1] 0 1 2 3 4 5 7 9

Being a function, it does not modify the original vector x. To get the coordinates of
the elements that are in the sorted vector, we use the function rank().
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> rank(x) # rank of elements in increasing order
[1] 5 3 6 7 2 8 1 4

In this example, the first value of the result is ‘5’. This means that the first element in
the original vector x[1] = 4 is in the fifth place in the ordered vector. The inverse
function to rank() is order(), which states the position of the element of the
sorted vector in the original vector, e.g. the smallest element in x is the seventh, the
second smallest is the fifth, etc.
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> order(x) # positions of sorted elements in the original vector
[1] 7 5 2 8 1 3 4 6

Replacing specific values in a vector is done with the function replace(). This
function replaces the elements of x that are specified by the second argument by the
values given in the third argument.
1.4 Basics of the R Language 17

Table 1.5 Cumulative functions


Meaning Implementation Result
Sum cumsum(1:10) 1 3 6 10 15 21 28 36
45 55
Product cumprod(1:5) 1 2 6 24 120
Minimum cummin(c(3:1,2:0,4:2)) 3 2 1 1 1 0 0 0 0
Maximum cummax(c(3:1,2:0,4:2)) 3 3 3 3 3 3 4 4 4

> v = 1:10; v
[1] 1 2 3 4 5 6 7 8 9 10
> replace(v, v < 3, 12) # replace all els. smaller than 3 by 12
[1] 12 12 3 4 5 6 7 8 9 10
> replace(v, 6, 12) # replace the 6th element by 12
[1] 1 2 3 4 5 12 7 8 9 10

The second argument is a vector of indices for the elements to be replaced by the
values. In the second line, all numbers smaller than 3 are to be replaced by 12, while
in the last line, the element with index 6 is replaced by 12. Note again that functions
do not change the original vectors, so that the last output does not show 1 and 2
replaced by 12 after the second command.
There are also a few more functions for vectors which are of further interest. The
function rev() returns the elements in reversed order, and sum() gives the sum
of all the elements in the vector.
> x = c(4, 2, 5, 7, 1, 9, 0, 3)
> rev(x) # reverse the order of x
[1] 3 0 9 1 7 5 2 4
> sum(x) # sum all elements of x
[1] 31

More sophisticated ways of cumulative summation of the elements of vectors are


given in Table 1.5.
Vectors can also be considered as sets, and for this purpose there exist binary set
operators, such as a %in% b, which gives the elements of a that are also in b.
More advanced functions for working with sets are discussed in Chap. 3.
> a = 1:3 # 1 2 3
> b = 2:6 # 2 3 4 5 6
> a %in% b # FALSE TRUE TRUE
> b %in% a # TRUE TRUE FALSE FALSE FALSE
> a = c("A","B") # "A" "B"
> b = LETTERS[2:6] # "B" "C" "D" "E" "F"
> a %in% b # FALSE TRUE
> b %in% a # TRUE FALSE FALSE FALSE FALSE

In algebra and statistics, matrices are fundamental objects, which allow summarising
a large amount of data in a simple format. In R, matrices are only allowed to have
one data type for their entries, which is their main difference from data frames, see
Sect. 1.4.4.
18 1 The Basics of R

Creating a matrix
There are many possible ways to create a matrix, as shown in the example below.
The function matrix() constructs matrices with specified dimensions.
> matrix(0, 2, 5) # zeros, 2x5
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0

> matrix(1:12, nrow = 3)


[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

> matrix(1:6, nrow = 2, byrow = TRUE)


[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6

In the third matrix, in the above example, the argument byrow = TRUE indi-
cates that the filling must be done by rows, which is not the case in the second
matrix, where the matrix was filled by columns (column-major storage), the func-
tion as.vector(matrix ) converts a matrix into a vector. If the matrix has more
than one row or column, the function concatenates the columns into a vector. One
can also construct diagonal matrices using diag(), see Sect. 2.1.1.
Another way to transform a given vector into a matrix with specified dimensions
is the function dim(). The function t() is used to transpose matrices.
> m = 1:6 > t(m) # transpose m
> dim(m) = c(2, 3); m [,1] [,2]
[,1] [,2] [,3] [1,] 1 2
[1,] 1 3 5 [2,] 3 4
[2,] 2 4 6 [3,] 5 6

Coupling vectors using the functions cbind() (column bind) and rbind() (row
bind) joins vectors column-wise or row-wise into a matrix.
> x = 1:6
> y = LETTERS[1:6]
> rbind(x, y) # bind vectors row-wise
[,1] [,2] [,3] [,4] [,5] [,6]
x "1" "2" "3" "4" "5" "6"
y "A" "B" "C" "D" "E" "F"

The functions col and row return the column and row indices of all elements of
the argument, respectively.
1.4 Basics of the R Language 19

> m = matrix(1:6, ncol = 3) > m = matrix(1:6, ncol = 3)


> col(m) # column-indices > row(m) # row-indices
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1 2 3 [1,] 1 1 1
[2,] 1 2 3 [2,] 2 2 2

The procedure to extract an element or submatrix uses a syntax similar to the syntax
for vectors. In order to extract a particular element, one uses m[row index,
column index]. As a reminder, in the example below, 10 is the second element
of the fifth column, in accordance with the standard mathematical convention.
> k = matrix(1:10, 2, 5); k # create a matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10

> k [2, 5] # select element in row 2, column 5


[1] 10

One can combine this with row() and col() to construct a useful tool for the
conditional selection of matrix elements. For example, extracting the diagonal of a
matrix can be done with the following code.
> m = matrix(1:6, ncol = 3)
> m[row(m) == col(m)] # select elements [1, 1]; [2, 2]; etc.
[1] 1 4

The same result is obtained by using the function diag(m). To better understand
the process, note that the command row(m) == col(m) creates just the Boolean
matrix below and all elements with value TRUE are subsequently selected.
> row(m) == col(m) # condition (row index = column index)
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE

This syntax can also be used to select whole rows or columns.


> y = matrix (1:16, ncol = 4, nrow = 4); y
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> y[2, ] # second row
[1] 2 6 10 14
> y[, 2] # second column
[1] 5 6 7 8
> y[2] # second element (column-wise)
[1] 2
> y[1:2, 3:4] # several rows and columns
[,1] [,2]
[1,] 9 13
[2,] 10 14
20 1 The Basics of R

The first command selects the second row of y. The second command selects the
2nd column. The third line considers the matrix as a succession of column vectors
and gives, according to this construction, the second element. The last call selects a
range of rows and columns.
Many functions can take matrices as an argument, such as sum() or product(),
which will calculate the sum or product of all elements in the matrix, respectively.
The functions colSums() and rowSums() can be used to calculate the column-
wise or row-wise sums. All classical binary operators are implemented element-by-
element. This means that, for example, x * y returns the Kronecker product, not
the classical matrix product discussed in Sect. 2.1 on matrix algebra.
One can assign names to the rows and columns of a matrix using the function
dimnames(). Alternatively, the column and row names can be assigned separately
by colnames() and rownames(), respectively.
> A = matrix(1:20, ncol = 5, nrow = 4)
> dimnames(A) = list(letters[4:7], letters[5:9]) # name dimensions
> A
e f g h i
d 1 5 9 13 17
e 2 6 10 14 18
f 3 7 11 15 19
g 4 8 12 16 20
> A[2, 2]
[1] 6
> A["b", "b"]
[1] 6

This leads directly to another very useful format in R: the data frame.

1.4.4 Data Frames

A data frame is a very useful object, because of the possibility of collecting data
of different types (numeric, logical, factor, character, etc.). Note, however, that
all elements must have the same length. The function data.frame() creates
a new data frame object. It contains several arguments, as the column names can
be directly specified with data.frame(..., row.names = c(), ...).
A further possibility for creating a data frame is to convert it from a matrix with the
as.data.frame(matrix name) function.
Basic manipulations
Consider the following example, which constructs a data frame.
> cities = c("Berlin", "New York", "Paris", "Tokyo")
> area = c(892, 1214, 105, 2188)
> population = c(3.4, 8.1, 2.1, 12.9)
> continent = factor(c("Europe", "North America", "Europe", "Asia"))
> myframe = data.frame(cities, area, population, continent)
> is.data.frame(myframe) # check if object is a dataframe
[1] TRUE
> rownames(myframe) = c("Berlin", "New York", "Paris", "Tokyo")
1.4 Basics of the R Language 21

> colnames(myframe) = c("City", "Area", "Pop.", "Continent")


> myframe
City Area Pop. Continent
Berlin Berlin 892 3.4 Europe
New York New York 1214 8.1 North America
Paris Paris 105 2.1 Europe
Tokyo Tokyo 2188 12.9 Asia

Note that if we defined the above data frame as a matrix, then all elements would be
converted to type character, since matrices can only store one data type.
data.frame() automatically calls the function factor() to convert all char-
acter vectors to factors, as it does for the Continent column above, because
these variables are assumed to be indicators for a subdivision of the data set. To
perform data analysis (e.g. principal component analysis or cluster analysis, see
Chap. 8), numerical expressions of character variables are needed. It is therefore
often useful to assign ordered numeric values to character variables, in order to
perform statistical modelling, set the correct number of degrees of freedom, and
customise graphics. These variables are treated in R as factors. As an example, a
new variable is constructed, which will be added to the data frame “myframe”.
Three position categories are set, according to the proximity of each city to the sea:
Coastal (‘0’), Middle (‘1’) and Inland (‘2’). These categories follow a certain
order, with Middle being in between the others, which needs to be conveyed to R.
> e = c(2, 0, 2, 0) # code info. in e
> f = factor(e, level = 0:2) # create factor f
> levels(f) = c("Coastal", "Middle", "Inland"); f # with 3 levels
[1] Inland Coastal Inland Coastal
Levels: Coastal Middle Inland
> class(f)
[1] "factor"
> as.numeric(f)
[1] 3 1 3 1

The variable f is now a factor, and levels are defined by the function levels()
in the 3rd line in decreasing order of the proximity to the sea. When sorting the
variable, R will now follow the order of the levels. If the position values were simply
coded as string, i.e. Coastal, Middle and Inland, any sorting would be done
alphabetically. The first level would be Coastal, but the second Inland, which
does not follow the inherited order of the category.
The function as.numeric() extracts the numerical coding of the levels and
the indexation begins now with 1.
> myframe = data.frame(myframe, f)
> colnames(myframe)[5] = "Prox.Sea" # name 5th column
> myframe
City Area Pop. Continent Prox.Sea
Berlin Berlin 892 3.4 Europe Inland
New York New York 1214 8.1 North America Coastal
Paris Paris 105 2.1 Europe Inland
Tokyo Tokyo 2188 12.9 Asia Coastal
22 1 The Basics of R

The column names for columns 1 to 4 are the ones that were assigned before, since
myframe is used in the call of data.frame(). Note that one should not use names
with spaces, e.g. Sea.Env. instead of Sea. Env. To add columns or rows to a
data frame, one can use the same functions as for matrices, or the procedure described
below.
> myframe = cbind(myframe, "Language.Spoken"=
+ c("German", "English", "French", "Japanese"))
> myframe
City Area Pop. Continent Prox.Sea Language.Spoken
Berlin Berlin 892 3.4 Europe Inland German
New York New York 1214 8.1 North America Coastal English
Paris Paris 105 2.1 Europe Inland French
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese

There are several ways of addressing one particular column by its name:
myframe$Pop., myframe[, 3], myframe[, "Pop."], myframe
["Pop."]. All these commands except the last return a numeric vector. The last
command returns a data frame.
> myframe$Pop. # select only population column
[1] 3.4 8.1 2.1 12.9
> myframe["Pop."] # population column as dataframe
Pop.
Berlin 3.4
New York 8.1
Paris 2.1
Tokyo 12.9
> myframe[3] == myframe["Pop."]
Pop.
Berlin TRUE
New York TRUE
Paris TRUE
Tokyo TRUE

The output of the above code is a data frame and, therefore, can not be indexed
like a vector. One uses $ notation similar to addressing fields of objects in the C++
programming language.
> myframe[2, 3] # select 3rd entry of 2nd row
[1] 8.1
> myframe[2, ] # select 2nd row
City Area Pop. Continent Prox.Sea Language.Spoken
New York New York 1214 8.1 North America Coastal English

Long names for data frames and the contained variables should be avoided, because
the source code becomes very messy if several of them are called. This can be solved
by the function attach(). Attached data frames will be set to the search path and
the included variables can be called directly. Any R object can be attached. To remove
it from the search path, one uses the function detach().
> rm(area) # remove var. "area" to avoid confusion
> attach(myframe) # attach dataframe "myframe"
> Area # specify column Area in attached frame
[1] 892 1214 105 2188
> detach(myframe)
1.4 Basics of the R Language 23

If two-word names are used, it is advised to label the data frame or variable with
a block name, so that the two words in the name are connected with a dot or an
underline, e.g. Language.Spoken. This avoids having to put names in quotes.
One of the easiest ways to edit a data frame or a matrix is through interactive
tables, called by the edit function. Note that the edit() function does not allow
changing the original data frame.
> edit(myframe)

If the modifications are to be saved, the function fix() is employed. It opens a


table like edit(), but the changes in the data are stored.
> fix(myframe)

A data frame as a database


Furthermore, R provides the possibility of selecting subsets of a data frame by using
the logical operators discussed in Sects. 1.4.1 and 1.4.3: <, >, =<, >=, ==, ! =,
&, | and !.
> myframe[(myframe$Language.Spoken == "French")|
+ (myframe$Pop. > 10), -1]
Area Pop. Continent Prox.Sea Language.Spoken
Paris 105 2.1 Europe Inland French
Tokyo 2188 12.9 Asia Coastal Japanese
> myframe[, -c(1, 2, 3, 5)] # select all except specified columns
Continent Language.Spoken
Berlin Europe German
New York North America English
Paris Europe French
Tokyo Asia Japanese

The first command of the last listing selects both the cities in which French is spoken
or the cities with more than 10 million inhabitants. The second command selects only
the first, fourth and sixth variables for display. As explained above, the individual
data, as well as rows and columns, can be addressed using the square brackets. If no
variable is selected, i.e. [,], all information about the observations is kept.
The following functions are also helpful for conditional selections from data
frames. The function subset(), which performs conditional selection from a data
frame, is frequently used when only a subset of the data is used for the analysis.
> subset(myframe, Area > 1000)
City Area Pop. Continent Prox.Sea Language.Spoken
New York New York 1214 8.1 North America Coastal English
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese

A conditional transformation of the data frames, by adding a new variable which is


a function of others, is done by using the function transform(). As an example,
a new variable Density is added to our data frame.
24 1 The Basics of R

> transform(myframe[, -c(1, 4, 5)], Density = Pop. * 10^6 / Area)


Area Pop. Language.Spoken Density
Berlin 892 3.4 German 3811.659
New York 1214 8.1 English 6672.158
Paris 105 2.1 French 20000.000
Tokyo 2188 12.9 Japanese 5895.795

Another way to extract data according to the values is based on addressing specific
variables. In the next example, the interest is in the cumulative area of cities that are
not inland.
> Area.Seasiders = myframe$Area[myframe$Prox.Sea == "Middle"
+ | myframe$Prox.Sea == "Coastal"]
> Area.Seasiders
[1] 1214 2188
> sum(Area.Seasiders)
[1] 3402

The important technique of sorting the data frame is illustrated below. Remember
that order() sorts the elements and returns their ranks in the original vector.
The optional argument partial specifies the columns for subsequent ordering, if
necessary. It is used to order groups of data according to one column and order the
values in each group according to another column.
> myframe[order(myframe$Pop., partial = myframe$Area), ]
City Area Pop. Continent Prox.Sea Language.Spoken
Paris Paris 105 2.1 Europe Inland French
Berlin Berlin 892 3.4 Europe Inland German
New York New York 1214 8.1 North America Coastal English
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese

1.4.5 Lists

Lists are very flexible objects which, unlike matrices and data frames, may contain
variables of different types and lengths.
The simplest way to construct a list is by using the function list(). In the
following example, a string, a vector and a function are joined into one variable.
> a = c(2, 7)
> b = "Hello"
> d = list(example = Stirling, a, end = b)
> d
$example
function(x){
sqrt(2 * pi * x) * (x / exp(1))^x
}

[[2]]
[1] 2 7

$end
[1] "Hello"
1.4 Basics of the R Language 25

Another way to join these into a list is through a vector construction.


> z = c(Stirling, a)
> typeof(z)
[1] "list"

To address the elements of a list object, one again uses ‘$’, the same syntax as for a
data frame.
> d$end
[1] "Hello"

A list can be transformed into a 1-element list, i.e. a list of length 1, using unlist.
In this example, the element [[2]] of list d is split into two elements, each of
length 1.
> unlist(d) # transform to list with elements of length 1
$example
function(x){
sqrt(2 * pi * x) * (x / exp(1))^x
}

[[2]]
[1] 2

[[3]]
[1] 7

$end
[1] "Hello"

One of the possible ways of converting objects is to use the function split(). This
returns a list of the split objects with separations according to the defined criteria.
> split(myframe, myframe$Continent)
$Asia
City Area Pop. Continent Prox.Sea Language.Spoken
Tokyo Tokyo 2188 12.9 Asia Coastal Japanese

$Europe
City Area Pop. Continent Prox.Sea Language.Spoken
Berlin Berlin 892 3.4 Europe Inland German
Paris Paris 105 2.1 Europe Inland French

$‘North America‘
City Area Pop. Continent Prox.Sea Language.Spoken
New York New York 1214 8.1 North America Coastal English

In the above example, the data frame myframe is split into elements according to
its column Continent and transformed into a list.
26 1 The Basics of R

1.4.6 Programming in R

Functions
R has many programming capabilities, and allows creating powerful routines with
functions, loops, conditions, packages and objects. As in the Stirling example,
args() is used to receive a list of possible arguments for a specific function.
> args(data.frame) # list possible arguments and default values
function(..., row.names = NULL, check.rows = FALSE, check.names
= TRUE, stringsAsFactors = default.stringsAsFactors())
NULL

This command provides a list of all arguments that can be used in the function,
including the default settings for the optional ones, which have the form optional
argument = setting value.
Below a simple function is presented, which returns the list {a · sin(x), a · cos(x)}.
The arguments a and x are defined in round brackets. We can define functions with
optional arguments that have default values, in this example, a = 1.
> myfun = function(x, a = 1){ # define function
+ r1 = a * sin(x)
+ r2 = a * cos(x)
+ list(r1, r2)
+ }
> myfun(pi / 2) # apply to pi / 2, a = default
[[1]]
[1] 1

[[2]]
[1] 6.123234e-17

Note that if no return(result) operator is given at the end of the function body,
then the last created object will be returned.
Loops and conditions
The family of these operators is a powerful and useful tool. However, in order to
perform well, they should be used wisely. Let us start with the ‘if ’ condition.
> x = 1
> if(x == 2){print("x == 2")}
> if(x == 2){print("x == 2")}else{print("x != 2")}
[1] "x != 2"

The first programme is only an if condition, whereas the second is extended by


the else command, which provides an alternative command in case the condi-
tion is not realised. More advanced, but more unusual in syntax, is the function
ifelse(boolean check, if-case, else-case). It is used mainly in advanced frame-
works, to simplify code in which several ifelse constructions are embedded within
each other.
1.4 Basics of the R Language 27

Furthermore, for and while are very useful functions for creating loops, but
are best avoided in case of large sample sizes and extensive computations, since
they work very slowly. The difference between the functions is that for applies the
computation for a defined range of integers and while carries out the computation
until a certain condition is fulfilled. One may also use repeat, which will repeat the
specified code until it reaches the command break. One must be careful to include
a break rule or the loop will repeat infinitely.
> x = numeric(1)
> # for i from 1 to 10, the i-th element of x takes value i
> for(i in 1:10) x[i] = i
> x
[1] 1 2 3 4 5 6 7 8 9 10
> # as long as i < 21, set i-th element equal to i and increase i by 1
> i = 1
> while(i < 21){
+ x[i] = i
+ i = i + 1
+ }
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> # remove the first element of x, stop when x has length 1
> repeat{
+ x = x[-1]
+ if(length(x) == 1) break
+ }
> x
[1] 20

As an alternative to loops, R provides the functions apply(), sapply() (for


multivariate case mapply()) and lapply(). In many cases, using these func-
tions will improve the computational time significantly compared to the above loop
functions. The function apply() applies a deterministic function to the rows or to
the columns of a matrix. The first argument specifies the matrix, the second argument
determines whether the third argument is applied to rows or columns, and the third
argument specifies the function.
> A = matrix(1:24, 12, 2, byrow = TRUE)
> apply(A, 2, mean) # apply mean to columns separately
[1] 12 13
> apply(A, 2, sum) # column-wise sum
[1] 144 156
> apply(A, 1, mean) # row-wise mean
[1] 1.5 3.5 5.5 7.5 9.5 11.5 13.5 15.5 17.5 19.5 21.5 23.5
> apply(A, 1, sum) # row-wise sum
[1] 3 7 11 15 19 23 27 31 35 39 43 47

The functions lapply() and sapply() are of a more general form, since the
function is applied to each element of the list object and returns a list; and sapply
returns a numeric vector or a matrix, if appropriate. So if the class of the object is
matrix or numeric, the sapply function is preferred, but this function takes longer,
28 1 The Basics of R

as it applies lapply and converts the result afterwards. If a more general object,
e.g. a list of objects, is used, the lapply function is more appropriate.
> A = matrix(1:24, 12, 2, byrow = TRUE)
> # apply function sin() to every element, return numeric vector
> sapply(A[1, ], sin)
[1] 0.8414710 0.9092974
> class(sapply(A[1:4, ], sin))
[1] "numeric"

> # apply function sin() to every element, return list


> lapply(A[1, ], sin)
[[1]]
[1] 0.841471

[[2]]
[1] 0.9092974
> class(lapply(A[1:4, ], sin))
[1] "list"

There is one more useful function, called tapply(), which applies a defined func-
tion to each cell of a ragged array. The latter is made from non-empty groups of
values given by a unique combination of the levels of certain factors. Simply speak-
ing, tapply() is used to break an array or vector into different subgroups before
applying the function to each subgroup. In the example below, a matrix A is exam-
ined, which could contain the observations 1–12 of individuals from group 1, 2 or 3.
Our intention is to calculate the mean for each group separately.
> g = c(rep(1, 4), rep(2, 4), rep(3, 4)) # create vector "group ID"
> A = cbind(1:12, g) # observations and group ID
> tapply(A[, 1], A[, 2], mean) # apply function per group
1 2 3
2.5 6.5 10.5

Finally, the switch() function may be seen as the highlight of R’s built-in program-
ming functions. The function switch(i, expression1, expression2,...)
chooses the i-th expression in the given expression arguments. It works with num-
bers, but also with character chains to specify the expressions. This can be used to
simplify code, e.g. by defining a function that can be called to perform different
computations.
> rootsquare = function(x, type){ # define function for ^2 or ^(0.5)
+ switch (type, square = x * x, root = sqrt(x))
+ }
> rootsquare(10, "square") # apply "square" to argument 10
[1] 100
> rootsquare(10, 1) # first is equivalent to "square"
[1] 100
> rootsquare(10, "root") # apply "root" to argument 10
[1] 3.162278
> rootsquare(10, 2)
[1] 3.162278
> rootsquare(10, "ROOT") # apply "ROOT" (not defined)
[1] NULL
1.4 Basics of the R Language 29

It is sometimes useful to compare the efficiency of two different commands in


terms of the time they need to be computed, which can be done by the function
system.time(). This function returns three values: user is the CPU time charged
for the execution of the user instructions of R, i.e. the processing of the functions.
The system time is the CPU time used by the system on behalf of the calling process.
In sum, they give the total amount of CPU time. Elapsed time is the total real time
passed for the user. Since CPU processes can be run simultaneously, there is no clear
relation between total CPU time and elapsed time.
> x = c(1:500000)
> system.time(for(i in 1:200000) {x[i] = rnorm(1)})
user system elapsed
0.892 0.033 0.925
> system.time(x <- rnorm(200000))
user system elapsed
0.017 0.000 0.017

Here the function rnorm(x) is used, which simulates from the normal distribution,
see Sect. 4.3. Note that the hardwired rnorm() is faster than the for loop.

1.4.7 Date Types

R provides full access to current date and time values through the functions

> Sys.time() > date()

The function as.Date() is used to format data from another source to dates that
R can work with. When reading in dates, it is important to specify the date format,
i.e. the order and delimiter. A list of date formats that can be converted by R via the
appropriate conversion specifications can be found in the help for strptime. One
can also change the format of the dates in R.
> dates = c("23.05.1984", "2001/01/01", "May 3, 1256")
> # read dates specifying correct format
> dates1 = as.Date(dates[1], "%d.%m.%Y"); dates1
[1] "1984-05-23"
> dates2 = as.Date(dates[2], "%Y/%m/%d"); dates2
[1] "2001-01-01"
> dates3 = as.Date(dates[3], "%B %d,%Y"); dates3
[1] "1256-05-03"
> dates.a = c(dates1, dates2, dates3)
> format(dates.a, "%m.%Y") # delimiter "." and month/year only
[1] "05.1984" "01.2001" "05.1256"

Note that the function as.Date is not only applicable to character strings, factors
and logical NA, but also to objects of types POSIXlt and POSIXct. The last two objects
represent calendar dates and times, where POSIXct denotes the UTC timezone as a
numeric vector and POSIXlt gives a list of vectors including seconds, minutes, hours,
etc.
30 1 The Basics of R

The difference between two dates is calculated by difftime().


> dates.a = as.Date(c("1984/05/23", "2001/01/01", "1256/05/03"))
> difftime(Sys.time(), dates.a)
Time differences in days
[1] 11033.473 5332.473 276949.473

The functions months(), weekdays() and quarters() give the month, week-
day and quarter of the specified date, respectively.
> dates.a = as.Date(c("1984/05/23", "2001/01/01", "1256/05/03"))
> months(dates.a)
[1] "May" "January" "May"
> weekdays(dates.a)
[1] "Wednesday" "Saturday" "Wednesday"
> quarters(dates.a)
[1] "Q2" "Q1" "Q2"

1.4.8 Reading and Writing Data from and to Files

For statisticians, software must be able to easily handle data without restrictions on
its format, whether it is ‘human readable’ (such as .csv,.txt), in binary format
(SPSS, STATA, Minitab, S-PLUS, SAS (export libs)) or from relational databases.
Writing data
There are some useful functions for writing data, e.g. the standard write.table().
Its often used options include col.names and row.names, which specify whether
row or column names are written to the data file, as well as sep, which specifies the
separator to be used between values.
> write.table(myframe, "mydata.txt")
> write.table(Orange, "example.txt",
+ col.names = FALSE, row.names = FALSE)
> write.table(Orange, "example2.txt", sep="\t")

The first command creates the file mydata.txt in the working directory of the data
frame myframe from Sect. 1.4.4, the second specifies that the names for columns and
rows are not defined, and the last one asks for tab separation between cells.
The functions write.csv() and write.csv2() are both used to create
Excel-compatible files. They differ from write.table() only in the default def-
inition of the decimal separator, where write.csv() uses ‘.’ as a decimal separator
and ‘,’ as the separator between columns in the data. Function write.csv2() uses
‘,’ as decimal separator and ‘;’ as column separator.
> data = write.csv("file name") # decimal ".", column separator ","
> data = write.csv2("file name") # decimal ",", column separator ";"

Reading data
R supplies many built-in data sets. They can be found through the function data(),
or, less efficiently, through objects(package:datasets). In any case, we
1.4 Basics of the R Language 31

can load the pre-built data sets library using library("datasets"), where the
quotation marks are optional. Many packages bring their own data, so for many exam-
ples in this book, packages will be loaded in order to work with their included data
sets. To check whether a data set is in a package, data(package = "package
name") is used.
Moreover, we can import .txt files, with or without header.
> data = read.table("mydata.txt", header = TRUE)

The default option is header = FALSE, indicating that there is no description in


the first row of each column. Information about the data frame is requested by the
functions names(), which states the column names, and str(), which displays
the structure of the data.
> names(data)
[1] "City" "Area" "Pop." "Continent"
[5] "Prox.Sea" "Language.Spoken"
> str(data)
’data.frame’: 4 obs. of 6 variables:
$ City : Factor w/ 4 levels "Berlin","New York",..:1 2 3 4
$ Area : int 892 1214 105 2188
$ Pop. : num 3.4 8.1 2.1 12.9
$ Continent : Factor w/ 3 levels "Asia","Europe",..:2 3 2 1
$ Prox.Sea : Factor w/ 2 levels "Coastal","Inland":2 1 2 1
$ Language.Spoken: Factor w/ 4 levels "English","French",..:3 1 2 4

The function head() returns the first few rows of an object, which can be used to
check whether there is a header. To do this, it is important to know that the first row
is generally a header when it has one column less than the second row.
In some cases, the separation between the columns will not follow any of the
standard formats. We can then use option sep to manually specify the column
separator.
> data = read.table("file name", sep = "\t")

Here we specified manually that there is a tab character between the variables. With-
out a correctly specified separator, R may read all the lines as a single expression.
Missing values are represented by NA in R, but different programmes and authors
use other symbols, which can be defined in the function. Suppose, for example, that
the NA values were denoted by ‘missing’ by the creator of the dataset.
> data = read.table("file name", na.strings = "missing")

To import data in .csv (comma-separated list) format, e.g. from Microsoft Excel, the
functions read.csv() or read.csv2() are used. They differ from each other
in the same way as the functions write.csv() or write.csv2() discussed
above.
To import or even write data in the formats of statistic software packages such as
STATA or SPSS, the package foreign provides a number of additional functions.
These functions are named read. plus the data file extension, e.g. read.dta()
for STATA data.
32 1 The Basics of R

To read data in the most general way from any file, the function scan("file name")
is used. This function is more universal than read.table(), but not as simple to
handle. It can be used to read columnar data or read data into a list.
It is possible to have the user choose interactively between several options by
using the function menu(). This function shows a list of options from which the
user can choose by entering the value or its index number, and gives as output the list
rank. With the option graphics = TRUE, the list is shown in a separate window.
> menu(c("abc", "def"), title = "Enter value")
Enter value

1: abc
2: def

Selection: def
[1] 2
> menu(c("abc", "def"), graphics = TRUE, title = "Enter value")
Chapter 2
Numerical Techniques

The general who wins a battle makes many calculations in the


temple before an attack.
— Sun Tzu, The Art of War

With more and more practical problems of applied mathematics appearing in different
disciplines, such as chemistry, biology, geology, management and economics, to men-
tion just a few, the demand for numerical computation has considerably increased.
These problems frequently have no analytical solution or the exact result is time-
consuming to derive. To solve these problems, numerical techniques are used to
approximate the result. This chapter introduces matrix algebra, numerical integra-
tion, differentiation and root finding.

2.1 Matrix Algebra

Matrix algebra is a fundamental concept for applying and understanding numerical


methods, therefore the beginning of this section introduces the basic characteristics
of matrices, their operations and their implementation in R. Thereafter, other opera-
tions, such as the inverse, including the inverse both of non-singular and of singular
matrices, the norms containing the vector norms and the matrix norms, the calcula-
tion of eigenvalues and eigenvectors and different types of matrix decompositions
are presented. Some theorems and accompanying examples computed in R are also
provided.

© Springer International Publishing AG 2017 33


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_2
34 2 Numerical Techniques

2.1.1 Characteristics of Matrices

A matrix A(n × p) is a system of numbers with n rows and p columns:


⎛ ⎞
a1 1 a1 2 ... a1 p
⎜ a2 1 a2 2 ... a2 p ⎟
⎜ ⎟
A=⎜ . .. .. .. ⎟ = (ai j ).
⎝ .. . . . ⎠
an 1 . . . . . . a n p

Matrices with one column and n rows are column vectors, and matrices with one row
and p columns are row vectors. The following R code produces (3 × 3) matrices A
and B with the numbers from 1 to 9 and from 0 to −8, respectively. The matrices are
filled by rows if byrow = TRUE.
> # set matrices A and B
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> B = matrix(0:-8, nrow = 3, ncol = 3, byrow = FALSE); B
[,1] [,2] [,3]
[1,] 0 -3 -6
[2,] -1 -4 -7
[3,] -2 -5 -8

There are several special matrices that are frequently encountered in practical and
theoretical work. Diagonal matrices are special matrices where all off-diagonal ele-
ments are equal to 0, that is, A(n × p) is a diagonal matrix if ai j = 0 for all i  = j.
The function diag() creates diagonal matrices (square or rectangular) or extracts
the main diagonal of a matrix in R.
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
> diag(x = A) # extract diagonal
[1] 1 5 9
> diag(3) # identity matrix
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(2, 3) # 2 on diag, 3x3
[,1] [,2] [,3]
[1,] 2 0 0
[2,] 0 2 0
[3,] 0 0 2
> diag(c(1, 5, 9, 13), nrow = 3, ncol = 4) # 3x4
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 5 0 0
[3,] 0 0 9 0
> diag(2, 3, 4) # 3x4, 2 on diagonal
[,1] [,2] [,3] [,4]
[1,] 2 0 0 0
[2,] 0 2 0 0
2.1 Matrix Algebra 35

[3,] 0 0 2 0

As seen from the listing above, the argument x of diag() can be a matrix, a vector,
or a scalar. In the first case, the function diag() extracts the diagonal elements of
the existing matrix, and in the remaining two cases, it creates a diagonal matrix with
a given diagonal or of given size.
Rank
The rank of A is denoted by rank(A) and is the maximum number of linearly inde-
pendent rows or columns. Linear independence of a set of h rows a j means that
h
j=1 c j a j = 0 p if and only if c j = 0 for all j. If the rank is equal to the number
of rows or columns, the matrix is called a full-rank matrix. In R the rank can be
calculated using the function qr() (which does the so-called QR decomposition)
with the object field rank.
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
> qr(A)$rank # rank of matrix A
[1] 2

The matrix A is not of full rank, because the second column can be represented as a
linear combination of the first and third columns:
⎛ ⎞ ⎛ ⎞
2 1+3
1
⎝5⎠ = ⎝4 + 6⎠ (2.1)
8 2 7+9

This shows that the general condition for linear independence is violated for the spe-
cific matrix A. The coefficients are c1 = c3 = 21 and c2 = −1, and are thus different
from zero.
Trace
The trace of a matrix tr(A) is the sum of its diagonal elements:
min(n, p)
tr(A) = ai i .
i=1

The trace of a scalar just equals the scalar itself. One obtains the trace in R by
combining the functions diag() and sum():
> A = matrix(1:12, nrow = 4, ncol = 3); A
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> sum(diag(A)) # trace
[1] 18
36 2 Numerical Techniques

The function diag() extracts the diagonal elements of a matrix, which are then
summed by the function sum().
Determinant
The formal definition of the determinant of a square matrix A( p × p) is

det(A) = (−1)φ p (τ1 ,...,τ p ) a1 τ1 . . . a p τ p , (2.2)

where φ p (τ1 , . . . , τ p ) = n 1 + · · · + n p and n k represents the number of integers in


the subsequence τk+1 , . . . , τ p that are smaller than τk . For a square matrix A(2 × 2) of
dimension two, (2.2) reduces to

det(A(2 × 2) ) = a1 1 a2 2 − a1 2 a2 1 .

The determinant is often useful for checking whether matrices are singular or regular.
If the determinant is equal to 0, then the matrix is singular. Singular matrices can not
be inverted, which limits some computations. In R the determinant is computed by
the function det():
> A = matrix(1:9, nrow = 3, ncol = 3); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> det(A) # determinant
[1] 0

Thus, A is singular.
Transpose
A matrix A(n × p) has a transpose A
(n × p) , which is obtained by reordering the ele-
ments of the original matrix. Formally, the transpose of A(n × p) is

A 
(n × p) = (ai j ) = (a j i ).

The resulting matrix has p rows and n columns. One has that

(A ) = A,
(AB) = B  A .

R provides the function t() which returns the transpose of a matrix:


> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> t(A) # transpose
[,1] [,2] [,3]
[1,] 1 4 7
2.1 Matrix Algebra 37

[2,] 2 5 8
[3,] 3 6 9

When creating a matrix with a constructor matrix, its transpose can be created by
setting the argument byrow to FALSE.
A good overview of special matrices and vectors is provided by Table 2.1 in Härdle
and Simar (2015), Chap. 2. The same notations are used in this book.
Conjugate transpose
Every matrix A(n × p) has a conjugate transpose ACp × n . The elements of A can be
complex numbers. If a matrix entry ai j = α + βi is a complex number with real
numbers α, β and imaginary unit i 2 = −1, then its conjugate is aiCj = α − βi. The
same holds in the other direction: if ai j = α − βi, the conjugate is aiCj = α + βi.
Therefore the conjugate transpose is
⎛ ⎞
a1C1 a2C1 ... anC1
⎜ a1C2 a2C2 ... anC2 ⎟
⎜ ⎟
A = ⎜ ..
C
.. .. .. ⎟ . (2.3)
⎝ . . . . ⎠
a Cp 1 . . . . . . anC p

The function Conj() yields the conjugates of the elements. One can combine the
functions Conj() and t() to get the conjugate transpose of a matrix. For A =
1+0.5·i 1 1
1 1 1−0.5·i , the conjugate transpose is computed in R as follows:

> a = c(1 + 0.5i, 1, 1, 1, 1, 1 - 0.5i) # matrix entries


> A = matrix(a, nrow = 2, ncol = 3, byrow = TRUE) # complex matrix
> A
[,1] [,2] [,3]
[1,] 1+0.5i 1+0i 1+0.0i
[2,] 1+0.0i 1+0i 1-0.5i
> AC = Conj(t(A)) # conjugate
> AC # transpose
[,1] [,2]
[1,] 1-0.5i 1+0.0i
[2,] 1+0.0i 1+0.0i
[3,] 1+0.0i 1+0.5i

For a matrix with only real values, the conjugate transpose AC is equal to the normal
transpose A .

2.1.2 Matrix Operations

There are four fundamental operations in arithmetic: addition, subtraction, multi-


plication and division. In matrix algebra, there exist analogous operations: matrix
addition, matrix subtraction, matrix multiplication and ‘division’.
38 2 Numerical Techniques

Basic operations
For matrices A(n × p) and B(n × p) of the same dimensions, matrix addition and sub-
traction work elementwise as follows:

A + B = (ai j + bi j ),

A − B = (ai j − bi j ).

These operations can be applied in R as shown below.


> A = matrix(3:11, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
> B = matrix(-3:-11, nrow = 3, ncol = 3, byrow = TRUE); B
[,1] [,2] [,3]
[1,] -3 -4 -5
[2,] -6 -7 -8
[3,] -9 -10 -11
> A + B
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0

R reports an error if one tries to add or subtract matrices with different dimensions.
The elementary operations, including addition, subtraction, multiplication and divi-
sion can also be used with a scalar and a matrix in R, and are applied to each entry
of the matrix. An example is the modulo operation
> A = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> A %% 2 # modulo operation
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 0
[3,] 1 0 1

If one uses in R the elementary operations, including addition +, subtraction -,


multiplication *, or division/between two matrices, they are all interpreted in R as
elementwise operations.
Matrix multiplication returns the matrix product of the matrices A(n × p) and
B( p × m) , which is
2.1 Matrix Algebra 39
⎛ p p p ⎞
i=1 a1 i bi 1 i=1 a1 i bi 2 . . . i=1 a1 i bi m
⎜ p p p ⎟
⎜ i=1 a2 i bi 1 i=1 a2 i bi 2 . . . a2 i bi m ⎟
⎜ i=1 ⎟
A(n × p) · B( p × m) = C(n × m) = ⎜ .. .. .. .. ⎟.
⎜ . . . . ⎟
⎝ ⎠
p p
i=1 an i bi 1 ... . . . i=1 an i bi m

In R, one uses the operator %*% between two objects for matrix multiplication. The
objects have to be of class vector or matrix.
> A = matrix(3:11, nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
> B = matrix(-3:-11, nrow = 3, ncol = 3, byrow = TRUE); B
[,1] [,2] [,3]
[1,] -3 -4 -5
[2,] -6 -7 -8
[3,] -9 -10 -11
> A %*% B # matrix multiplication
[,1] [,2] [,3]
[1,] -78 -90 -102
[2,] -132 -153 -174
[3,] -186 -216 -246

The number of columns of A has to equal the number of rows of B.


Inverse
The division operation for square matrices is done by inverting a matrix. The inverse
A−1 of a square matrix A( p × p) exists if det(A)  = 0:

A−1 A = AA−1 = I p . (2.4)

The inverse of A = (ai j ) can be calculated by

W
A−1 = ,
det(A)

where W = (wi j ) is the adjoint matrix of A. The elements of W are


⎛ ⎞
a1 1 . . . a1 ( j−1) a1 ( j+1) . . . a1 p
⎜ .. .. .. .. .. .. ⎟
⎜ . . . . . . ⎟
⎜ ⎟
⎜a(i−1) 1 . . . a(i−1) ( j−1) a(i−1) ( j+1) . . . a(i−1) p ⎟
(w j i ) = (−1)i+ j det ⎜
⎜a(i+1) 1
⎟,
⎜ . . . a(i+1) ( j−1) a(i+1) ( j+1) . . . a(i+1) p ⎟

⎜ .. .. .. .. .. .. ⎟
⎝ . . . . . . ⎠
a p 1 . . . a p ( j−1) a p ( j+1) . . . a p p
40 2 Numerical Techniques

which are the cofactors of A. To compute the cofactors w j i , one deletes column
j and row i of A, then computes the determinant for that reduced matrix, and then
multiplies by 1 if j + i is even or by −1 if it is odd. This computation is only feasible
for small matrices.
Using the above definition, one can determine the inverse of a square matrix by
solving the system of linear equations (see Sect. 2.4.1) in (2.4) by employing the
function solve(A, b). In R this function can be used to solve a general system
of linear equations Ax = b. If one does not specify the right side b of the system of
equations, the solve() function computes the inverse of the square matrix A. The
125
following code computes the inverse of the square matrix A = 3 9 2 .
222

> A = matrix(c(1, 2, 5, 3, 9, 2, 2, 2, 2), # all elements


+ nrow = 3, ncol = 3, byrow = TRUE); A # matrix dimensions
[,1] [,2] [,3]
[1,] 1 2 5
[2,] 3 9 2
[3,] 2 2 2
> solve(A) # inverse of A
[,1] [,2] [,3]
[1,] -0.28 -0.12 0.82
[2,] 0.04 0.16 -0.26
[3,] 0.24 -0.04 -0.06
> A %*% solve(A) # check (2.4)
[,1] [,2] [,3]
[1,] 1.000000e+00 5.551115e-17 -1.110223e-16
[2,] -5.551115e-17 1.000000e+00 8.326673e-17
[3,] -5.551115e-17 1.804112e-16 1.000000e+00

For diagonal matrices, the inverse A−1 = (ai−1


i ) if ai i  = 0 for all i.

Generalised inverse
In practice, we are often confronted with singular matrices, whose determinant is
equal to zero. In this situation, the inverse can be given by a generalised inverse A−
satisfying
AA− A = A. (2.5)

Consider the singular matrix A = 10


00 . Its inverse A− must satisfy (2.5),
     
10 10 10
A− = . (2.6)
00 00 00

There are sometimes several A− which satisfy (2.5). The Moore–Penrose generalised
inverse (hereafter, just ‘generalised inverse’) is the most common type and was
developed by Moore (1920) and Penrose (1955). It is used to compute the ‘best fit’
solution to a system of linear equations that does not have a unique solution. Another
approach is to find the minimum (Euclidean) norm (see Sect. 2.1.5) solution to a
system of linear equations with several solutions. The Moore–Penrose generalised
2.1 Matrix Algebra 41

inverse is defined and unique for all matrices with real or complex entries. It can be
computed using the singular value decomposition, see Press (1992).
In R the generalised inverse of a matrix defined in (2.5) can be computed with the
function ginv() from the MASS package. With ginv() one obtains the generalised
inverse of the matrix A = 01 00 , which is equal to A− = 01 00 .
> require(MASS)
> A = matrix(c(1, 0, 0, 0),
+ ncol = 2, nrow = 2); A # matrix from (2.6)
[,1] [,2]
[1,] 1 0
[2,] 0 0
> ginv(A) # generalised inverse
[,1] [,2]
[1,] 1 0
[2,] 0 0

The ginv() function can also be used for non-square matrices, like A = 1 2 3
11 12 13 .
> require(MASS)
> A = matrix(c(1, 2, 3, 11, 12, 13),
+ nrow = 2, ncol = 3, byrow = TRUE); A # non-square matrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13
> A.ginv = ginv(A) # generalised inverse
[,1] [,2]
[1,] -0.63333333 0.13333333
[2,] -0.03333333 0.03333333
[3,] 0.56666667 -0.06666667

The condition (2.5) can be verified in R by the following code:


> A %*% A.ginv %*% A # first condition
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13

This code shows that the solution for A− fulfills the condition AA− A = A.

2.1.3 Eigenvalues and Eigenvectors

For a given basis of a vector space, a matrix A( p× p) can represent a linear function of
a p-dimensional vector space to itself. If this function is applied to a nonzero vector
and maps that vector to a multiple of itself, that vector is called an eigenvector γ and
the multiple is called the corresponding eigenvalue λ. Formally this can be written
as
Aγ = λγ.
42 2 Numerical Techniques

In Chap. 8, the theory of eigenvectors, eigenvalues and spectral decomposition, pre-


sented in Sect. 2.1.4, becomes important in order to understand the principal com-
ponents analysis, the factor analysis and other methods of dimension reduction.
Definition 2.1 provides a formal description of an eigenvalue and the corresponding
eigenvector.
Definition 2.1 Let V be any real vector space and L : V → V be a linear transfor-
mation. Then a nonzero vector γ ∈ V is called an eigenvector if, and only if, there
exists a scalar λ ∈ R such that L(γ) = Aγ = λγ. Therefore

λ is eigenvalue of A ⇔ det(A − λI) = 0.

If we consider λ as an unknown variable, then det(A − λI) is a polynomial of


degree n in λ, and is called the characteristic polynomial of A. Its coefficients can
be computed from the matrix entries of A. Its roots λ1 , . . . , λ p , which might be
complex, are the eigenvalues of A. The eigenvalue matrix  is a diagonal matrix
with elements λ1 , . . . , λ p . Vectors γ1 , . . . , γ p are the eigenvectors, that correspond
to eigenvalues λ1 , . . . , λ p . The eigenvector matrix P has columns γ1 , . . . , γ p . In the
following example, the computation of the eigenvalues of a matrix of dimension 3
is shown.
201
Example 2.1 Consider the matrix A = 031 and the matrix D = A − λI.
062

In order to obtain the eigenvalues of A one has to solve for the roots λ of the
polynomial det(D) = 0. For a three-dimensional matrix this looks like

det(D) = c0 + c1 λ + c2 λ2 + c3 λ3 ,
c0 = det(A),
c1 = aii a j j − ai j a ji ,
1≤i = j≤3 1≤i = j≤3

c2 = aii ,
1≤i≤3

c3 = −1.

Thus, the characteristic polynomial of A is −λ3 + 7λ2 + 10λ. In this case, A is


singular: the intercept of the polynomial is equal to zero. Therefore, one eigenvalue
is 0 and the other two are 2 and 5.
In R the eigenvalues and eigenvectors of a matrix A can be calculated using the
function eigen().
> A = matrix(c(2, 0, 1, 0, 3, 1, 0, 6, 2), # matrix A
+ nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 2 0 1
[2,] 0 3 1
[3,] 0 6 2
> Eigen = eigen(A) # eigenvectors and -values
2.1 Matrix Algebra 43

> Eigen$values # eigenvalues


[1] 5 2 0
> Eigen$vectors # eigenvector matrix
[,1] [,2] [,3]
[1,] 0.2857143 1 -0.4285714
[2,] 0.4285714 0 -0.2857143
[3,] 0.8571429 0 0.8571429

Let γ2 = (1, 0, 0) be the second column of the eigenvector matrix P = (γ1 , γ2 , γ3 ).
Then it can be seen that
Aγ2 = 2γ2 .

This means that γ2 is the eigenvector corresponding to the eigenvalue λ2 = 2 of A.


After the eigenvalues of a matrix are found, it is easy to  compute its trace or
p
trace: tr(A) = i=1 λi . The product
determinant. The sum of all its eigenvalues is its 
p
of its eigenvalues is its determinant: det(A) = i=1 λi .

2.1.4 Spectral Decomposition

The spectral decomposition of a matrix is a representation of that matrix in terms of


its eigenvalues and eigenvectors.
Theorem 2.1 Let A( p× p) be a matrix with real entries and let  be its eigenvalue
matrix and P the corresponding eigenvector matrix. Then

A = PP −1 . (2.7)

In R, one can use the function eigen() to compute eigenvalues and eigenvectors.
The eigenvalues are in the field named values and are sorted in decreasing order
(see the example above). Using the output of the function eigen(), the linear
independence of the eigenvectors can be checked for the above example by computing
the rank of the matrix P:
> A = matrix(c(2, 0, 1, 0, 3, 1, 0, 6, 2), # matrix A
+ nrow = 3, ncol = 3, byrow = TRUE); A
[,1] [,2] [,3]
[1,] 2 0 1
[2,] 0 3 1
[3,] 0 6 2
> Eigen = eigen(A) # eigenvectors and -values
> P = eigen(A)$vectors # eigenvector matrix
> L = diag(eigen(A)$values) # eigenvalue matrix
> qr(P)$rank # rank of P
[1] 3
> P %*% L %*% solve(P) # spectral decomposition
[,1] [,2] [,3]
[1,] 2 4.440892e-16 1
[2,] 0 3.000000e+00 1
[3,] 0 6.000000e+00 2
44 2 Numerical Techniques

From this computation, it can be seen that P has full rank. The diagonal matrix can
be obtained by extracting the eigenvalues from the output of the function eigen().
It is possible to decompose the matrix A by (2.7) in R. The difference between A
and the result from the spectral decomposition in R is negligibly small.

2.1.5 Norm

There are two types of frequently used norms: the vector norm and the matrix norm.
The vector norm, which appears frequently in matrix algebra and numerical compu-
tation, will be introduced first. An extension of the vector norm is the matrix norm.

Definition 2.2 Let V be a vector space and b be a scalar, both lying either in Rn or
Cn . Consider the vectors x, y ∈ V . Then a norm is a mapping  ·  : V → R+ 0 with
the following properties:
1. bx = |b|x,
2. x + y ≤ x + y,
3. x ≥ 0, where x = 0 if and only if x = 0.
Let x = (x 1 , . . . , xn ) ∈ Rn , k ≥ 1 and k ∈ R. Then a general norm is the L k norm,
which can be represented as follows,
 n
1/k
xk = |xi |k .
i=1

There are several special norms, depending on the value of k, some are listed below.
n
Manhattan norm : ||x||1 = |xi |, (2.8)
i=1
 1/2
n √
Euclidean norm : ||x||2 = |xi | 2
= x  x, (2.9)
i=1
infinity norm : ||x||∞ = max{|xi |i=1
n
}, (2.10)

Frobenius norm : ||x|| F = x x.
 (2.11)

The most frequently used norms are the Manhattan and Euclidean norms. For vector
norms, the Euclidean and Frobenius norm coincide. The infinity norm selects the
maximum absolute value of the elements of x and the maximum norm just the
maximum value.
In R the function norm() can return the norms from (2.8) to (2.11). The argument
type specifies which norm is returned.
2.1 Matrix Algebra 45

> x = matrix(c(2, 1, 2), nrow = 3, ncol = 1) # vector x


> norm(x, type = c("O")) # Manhattan norm
[1] 5
> norm(x, type = c("2")) # Euclidean norm
[1] 3
> norm(x, type = c("I")) # infinity norm
[1] 2
> norm(x, type = c("F")) # Frobenius norm
[1] 3

The object x has to be of class matrix in R to compute all norms.


Definition 2.3 Let U n× p be a set of (n × p) matrices and a be a scalar, which are
either real or complex. U n× p is a vector space equipped with matrix addition and
scalar multiplication. Let A, B ∈ U n× p . Then a matrix norm is a mapping  ·  :
U n× p → R+ 0 with the following properties:

1. aA = |a|A,
2. A + B ≤ A + B,
3. A ≥ 0, where A = 0 if and only if A = 0.
In R, the function norm() can be applied to vectors and matrices in the same fashion.
The one norm, the infinity norm, the Frobenius norm, the maximum norm and the
spectral norm for matrices are represented by

n
one norm : ||A||1 = max |ai j |,

1≤ j≤ p i=1
spectral/Euclidean norm : ||A||2 = λmax (AC A),
p
infinity norm : ||A||∞ = max |ai j |,
1≤i≤n j=1

 n  p
Frobenius norm : ||A|| F = |ai j |2 ,
i=1 j=1

where AC is the conjugate matrix of A. The next code shows how to compute these
five norms with the function norm() for the matrix A = 13 24 .
> A = matrix(c(1, 2, 3, 4),
+ ncol = 2, nrow = 2, byrow = TRUE) # matrix A
> norm(A, type = c("O")) # one norm
[1] 6 # maximum of column sums
> norm(A, type = c("2")) # Euclidean norm
[1] 5.464986
> norm(A, type = c("I")) # infinity norm
[1] 7 # maximum of row sums
> norm(A, type = c("F")) # Frobenius norm
[1] 5.477226

Note that the Frobenius norm returns the squareroot of the trace of the matrix
product of the matrix with its conjugate transpose, tr(AC A). The spectral norm or
Euclidean norm returns the square root of the maximum eigenvalue of AC A.
46 2 Numerical Techniques

2.2 Numerical Integration

This section discusses numerical methods in R for integrating a function. Some


integrals cannot be computed analytically and numerical methods should be used.

2.2.1 Integration of Functions of One Variable

Not every function f ∈ C[a, b] has an indefinite integral with an analytical repre-
sentation. Therefore, it is not always possible to analytically compute the area under
a curve. An important example is

exp(−x 2 )d x. (2.12)

There exists no analytical, closed-form representation of (2.12). Therefore, the cor-


responding definite integral has to be computed numerically using numerical inte-
gration, also called ‘quadrature’. The basic idea behind numerical integration lies in
approximating the function by a polynomial and subsequently integrating it using
the Newton–Cotes rule.
Newton–Cotes rule
If a function f ∈ C[a, b] and nodes
 a = x0 < x1 < · · · < xn = b are given, one
looks for a polynomial pn (x) = nj=1 c j x j−1 ∈ Pn , where Pn are basis polynomials,
satisfying the condition

pn (xk ) = f (xk ), for all k ∈ {0, . . . , n}. (2.13)

To construct a polynomial that satisfies this condition, the following basis polyno-
mials are used:
n
x − xi
L k (x) = .
i=0,i =k
x k − xi

This leads to the so-called Lagrange polynomial, which satisfies the condition in
(2.13) (assuming 00 = 1).

n
pn (x) = f (xk )L k (x). (2.14)
k=0

b
Let I ( f ) = a f (x)d x be the exact integration operator applied to a function
f ∈ C[a, b]. Then define In ( f ) as the approximation of I ( f ) using (2.14) as an
approximation for f :
2.2 Numerical Integration 47
 b
In ( f ) = pn (x) d x,
a

which can be restated using weights for the different values of f :


n
In ( f ) = (b − a) f (xk )αk , (2.15)
k=0
 b
1
with weights αk = L k (x) d x.
b−a a

By construction, (2.15) is exact for every f ∈ Pn . Suppose the nodes xk are equidis-
tant in [a, b], i.e., xk = a + kh, where h = (b − a)n −1 . Then (2.15) is the (closed)
Newton–Cotes rule. The weights αk can be explicitly computed up to n = 7. Start-
ing from n = 8, negative weights occur and the Newton–Cotes rule can no longer be
applied. The trapezoidal rule is an example of the Newton–Cotes rule.
b
Example 2.2 For n = 1 and I ( f ) = a f (x) d x, the nodes are given as follows:
x0 = a, x1 = b. The weights can be computed explicitly by transforming the integral
using two substitutions:
  
n  
n
1 b 1
t − ti 1 n
s −i
αk = L k (x) d x = dt = ds.
b−a a 0 i=0,i =k tk − ti n 0 i=0,i=k
k −i

Then the weights for n = 1 are α0 = 21 and α1 = 21 . So the Newton–Cotes rule I1 ( f )


is given by the formula for the area of a trapezoid:
 
f (a) + f (b)
I1 ( f ) = (b − a) .
2

In R, the trapezoidal rule is implemented within the package caTools. There the
function trapz(x, y) is used with a sorted vector x that contains the x-axis values
and a vector y with the corresponding y-axis values. This function uses a summed
version of the trapezoidal rule, where [a, b] is split into n equidistant intervals. For
all k = {1, . . . , n}, the integral Ik ( f ) is computed according to the trapezoidal rule:
this is the so-called extended trapezoidal rule.

b − a 
Ik ( f ) = f {a + (k − 1)n −1 (b − a)} + f {a + kn −1 (b − a)} .
2n

Therefore, the whole integral I ( f ) is approximated by

b − a 
n n
I( f ) ≈ Ik ( f ) = f {a + (k − 1)n −1 (b − a)} + f {a + kn −1 (b − a)} .
k=1 k=1
2n
48 2 Numerical Techniques

For example, consider the integral of the cosine function on [− π2 , π2 ] and split the
interval into 10 subintervals, where the trapezoidal rule is applied:
> require(caTools)
> x = (-5:5) * (pi / 2) / 5 # set subintervals
> intcos = trapz(x, cos(x)); intcos # integration
[1] 1.983524
> abs(intcos - 2) # absolute error
[1] 0.01647646

The integral of the cosine function on [− π2 , π2 ] is supposed to be exactly 2, so the


absolute error is almost 0.02. It can be shown that the error of the trapezoidal rule
is of order O(h 3  f 2 ) with an unknown second derivative of the function. If a
Newton–Cotes rule with more nodes is used, the integrand will be approximated by
a polynomial of higher order. Therefore, the error could diminish if the integrand is
very smooth, so that it can be approximated well by a polynomial.
Gaussian quadrature
In R, the function integrate() uses an integration method that is based on
Gaussian quadrature (the exact method is called the Gauss–Kronrod quadrature,
see Press (1992) for further details). The Gaussian method uses non-predetermined
nodes x1 , . . . , xn to approximate the integral, so that polynomials of higher order can
be integrated more precisely
 than using the Newton–Cotes rule. For n nodes, it uses
a polynomial p(x) = 2n j=1 j x
c j−1
of order 2n − 1.
Definition 2.4 A method of numerical integration for a function f : [a, b] → R
with the formula
 b  b n
I( f ) = f (x)d x ≈ w(x) p(x)d x ≈ InG ( f) = f (xk )αk ,
a a k=1

and n nodes x1 , . . . , xn is called Gaussian quadrature if an arbitrary weighting func-


tion w(x) and a polynomial p(x) of degree 2n − 1 exactly approximate f (x), such
that f (x) = w(x) p(x).
Consider the simplest case, where w(x) = 1. Then the method of undetermined
coefficients leads to the following nonlinear system of equations:
n
bj − aj j−1
= αk xk , for j = 1, . . . , 2n.
j! k=1

A total of 2n equations are used to find the nodes x1 , . . . , xn and the coefficients
α1 , . . . , αn .
Consider the special case with two nodes (x1 , x2 ) and two weights (α1 , α2 ). The
particular polynomial p(x) is of order 2 · n − 1 = 3, where the number of nodes
is n. The integral is approximated by α1 f (x1 ) + α2 f (x2 ) and it is assumed that
f (x) = p(x). Therefore the following two equations can be derived:
2.2 Numerical Integration 49

I2G ( f ) = c1 (α1 + α2 ) + c2 (α1 x1 + α2 x2 ) + c3 (α1 x12 + α2 x22 ) + c4 (α1 x13 + α2 x23 ),


b2 − a 2 b3 − a 3 b4 − a 4
I2G ( f ) = c1 (b − a) + c2 + c3 + c4 .
2 3 4
All coefficients c j of the polynomial are set equal to each other, because the coeffi-
cients are arbitrary. The system of four nonlinear equations is

b − a = α1 + α2 ;
1/2 · (b − a 2 ) = α1 x1 + α2 x2 ;
2

1/3 · (b3 − a 3 ) = α1 x12 + α2 x23 ;


1/4 · (b4 − a 4 ) = α1 x13 + α2 x23 .

For simplicity, in most cases the interval [−1, 1] is considered. It is possible to extend
these results to the more general interval [a, b]. To apply the results for [−1, 1] to
the interval [a, b], one uses
 
b
b−a 1
b−a a+b
f (x)d x = f x+ d x.
a 2 −1 2 2

For the special case w(x) = 1 and the interval [−1, 1], the procedure is called Gauss–
Legendre quadrature. The nodes are the roots of the Legendre polynomials Pn (x) =
1 dn
2n·n! d x n
{(x 2 − 1)n }. The weights αk can be calculated by

2
αk = .
(1 − xk2 ){Pn (xk )}2

In the following example, we illustrate the process of numerical integration using the
function integrate(). One can specify the following arguments: f (integrand),
a (lower limit) and b (upper limit), subdivisions (number of subintervals) and
arguments rel.tol, as well as abs.tol for the relative and absolute accuracy
requested. Consider again the cosine function on the interval [− π2 , π2 ].
> require(stats)
> integrate(cos, # integrand
+ lower = -pi / 2, # lower integration limit
+ upper = pi / 2) # upper integration limit
2 with absolute error < 2.2e-14

The output of the integrate() function delivers the computed value of the definite
integral and an upper bound on the absolute error. In this example, the absolute error
is smaller than 2.2 · 10−14 . Therefore, the integrate() function is much more
accurate for the cosine function than the trapz() function used in a previous
example.
50 2 Numerical Techniques

2.2.2 Integration of Functions of Several Variables

Repeated quadrature method


Similar to numerical integration in the context of one variable, in the case of more
variables, an integration can be expressed as follows:
 b1  bp n n
... f (x1 , . . . , xn )d x1 . . . d x p ≈ ··· Wi1 · · · Wi p f (xi1 , . . . , xi p ), (2.16)
a1 ap i 1 =1 i p =1

where D j = [a j , b j ], j ∈ {1, . . . , p} is an integration region in R and (xi1 , . . . , xi p )


is the p-dimensional point at the i-th dimension, where i j ∈ {1, . . . , n}, and Wi j is the
coefficient used as the weight. The problem with the repeated quadrature is that for
(2.16) one needs to evaluate p n terms, which may lead to computational difficulties.
Adaptive method
The adaptive method in the context of multiple integrals divides the integration region
D ∈ R p into subregions S j ∈ R p . For each subregion S j , specific rules are applied
to approximate the integral. To improve  the approximation, consider the error E j
for each subregion. If the overall error j E j is smaller than a predefined tolerance
level, the algorithm stops. But if this condition is not met, the highest error is selected
and the corresponding region is split into two subregions. Then the rules are applied
again to each subregion until the tolerance level is met. For a more detailed description
of this algorithm, see van Dooren and de Ridder (1976) and Stroud (1971).
Jarle Berntsen and Genz (1991) improved the reliability of the algorithm, including
a strategy for the selection of subregions, error estimation and parallelisation of the
computation.
Example 2.3 Integrate the function of two variables
 1  1
x 2 y 3 d xd y. (2.17)
0 0

Analytical computation of this integral yields 1/12. The surface z = x 2 y 3 for the
interval [0, 1]2 is depicted in Fig. 2.1. For the computation of multiple integrals, the
R package R2Cuba is used, which is introduced in Hahn (2013). It includes four
different algorithms for multivariate integration, where the function cuhre uses the
adaptive method.
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ (x^2) * (y^3) # function
+ }
> cuhre(integrand, # adaptive method
+ ncomp = 1, # number of components
+ ndim = 2, # dimension of the integral
2.2 Numerical Integration 51

+ lower = rep(0, 2), # lower integration bound


+ upper = rep(1, 2), # upper integration bound
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 1)) # controls output
Iteration 1: 65 integrand evaluations so far
[1] 0.0833333 +- 1.15907e-015 chisq 0 (0 df)
Iteration 2: 195 integrand evaluations so far
[1] 0.0833333 +- 1.04612e-015 chisq 0.10352e-05 (1 df)
integral: 0.08333333 (+-1.3e-15)
nregions: 2; number of evaluations: 195; probability: 0.00623341

The output shows that the adaptive algorithm carried out two iteration steps. Only two
subregions have been used for the computation, which is stated by the output value
nregions. The output value neval states that the number of evaluations is 195. To
make a statement about the reliability of the process, consider the probability
value. A probability of 0 for the χ2 distribution (see Sect. 4.4.1) means that the null

0.08

0.06
f(x,y)

0.04

0.02

0.00
1.0
0.8
0.6
0.4
x 0.2 0.6 0.4 0.2 0.0
0.0 1.0 0.8
y

Fig. 2.1 Plot of the multivariate function (2.17). BCS_Integrand


52 2 Numerical Techniques

hypothesis can be rejected. The null hypothesis states that the absolute error estimate
is not a reliable estimate of the true integration error. The approximation of integral
1
I is 0.08333, which is close to the result of the analytical computation, 12 . For a
more detailed discussion of the output, refer to Hahn (2013).
Example 2.4 Evaluate the integral with three variables
 1  1  1
sin(x) log(1 + 2y) exp(3z)d xd ydz, (2.18)
0 0 0

> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ z = arg[3] # function argument z
+ sin(x) * log(1 + 2 * y) * exp(3 * z) # function
+ }
> cuhre(integrand, # adaptive method
+ ncomp = 1, # number of components
+ ndim = 3, # dimension of the integral
+ lower = rep(0, 3), # lower bound of interval
+ upper = rep(1, 3), # upper bound of interval
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 0)) # controls output
integral: 1.894854 (+-4.1e-07)
nregions: 2; number of evaluations: 381; probability: 0.04784836

For the function of three variables (2.18), an analytical computation yields the value:
 1  1  1
sin(x)d x log(1 + 2y)dy exp(3z)dz
0 0 0
1 1
= {1 − cos(1)} [3{log(3) − 1} + 1] {exp(3) − 1} = 1.89485.
2 3
The value provided by the adaptive method is very close to the exact value.
Monte Carlo method
For a multiple integral I of the function of p variables f (x1 , . . . , x p ) with lower
bounds a1 , . . . , a p and upper bounds b1 , . . . , b p , the integral is given by
 b1  bp  
I( f ) = ... f (x1 , . . . , x p )d x1 . . . d x p = ··· f (x)d x,
a1 ap D

where x stands for a vector (x1 , . . . , x p ) and D for the integration region. Let X be
a random vector (see Chap. 6), with each component X j of X uniformly distributed
(Sect. 4.2) in [a j , b j ]. Then the algorithm of Monte Carlo multiple integration can be
described as follows. In the first step, n points of dimension p are randomly drawn
from the region D, such that
2.2 Numerical Integration 53

(x11 , . . . , x1 p ), . . . , (xn 1 , . . . , xn p ).
p
In the second step, the p-dimensional volume is estimated by V = j=1 (b j − a j )
and the integrand f is evaluated for all n points. In the third step, the integral I can
be estimated using a sample moment function,
n
I ( f ) ≈ Iˆ( f ) = n −1 V f (xi1 , . . . , xi p ).
i=1

The absolute error can be approximated as follows:

= |(I ( f ) − Iˆ( f )| ≈ n −1/2 {V I ( f 2 ) − I 2 ( f )}.

The Monte Carlo method is applied to example (2.17) via the function vegas.
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ (x^2) * (y^3) # function
+ }
> vegas(integrand, # Monte Carlo method
+ ncomp = 1, # number of components
+ ndim = 2, # dimension of the integral
+ lower = rep(0, 2), # lower integration bound
+ upper = rep(1, 2), # upper integration bound
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 0)) # controls output
integral: 0.08329357 (+-7.5e-05)
number of evaluations: 17500; probability: 0.1201993

The outputs of the functions vegas and cuhre are almost identical. Additional
output information can be obtained by setting the argument verbose to one. Then
the output shows that the Monte Carlo algorithm executed 7 iterations and 17 500
evaluations of the integrand. The approximation of integral I is 0.0832, which is
1
close to the exact value 12 . For the function (2.18) the Monte Carlo algorithm looks
as follows:
> require(R2Cuba)
> integrand = function(arg){ # construct the integrand
+ x = arg[1] # function argument x
+ y = arg[2] # function argument y
+ z = arg[3] # function argument z
+ sin(x) * log(1 + 2 * y) * exp(3 * z) # function
+ }
> vegas(integrand, # Monte Carlo method
+ ncomp = 1, # number of components
+ ndim = 3, # dimension of the integral
+ lower = rep(0, 3), # lower integration bound
+ upper = rep(1, 3), # upper integration bound
+ rel.tol = 1e-3, # relative tolerance level
+ abs.tol = 1e-12, # absolute tolerance level
+ flags = list(verbose = 0)) # controls output
54 2 Numerical Techniques

integral: 1.894488 (+-0.0016)


number of evaluations: 13500; probability: 0.1099108

The performance of the adaptive method is again superior to that of the Monte Carlo
method, which gives 1.894488 as the value of the integral.

2.3 Differentiation

The analytical computation of derivatives may be impossible if the function is only


given indirectly (for example by an algorithm) and can be evaluated only point-wise.
Therefore, it is necessary to use numerical methods. Before presenting some numer-
ical methods for differentiation, first it will be shown how to compute analytically
the derivative in R.

2.3.1 Analytical Differentiation

To calculate the derivative of a one variable function in R, the function D(expr,


name) is used. For expr the function is inserted (as an object of mode
expression) and name identifies the variable with respect to which the derivative
will be computed. Consider the following example:
> f = expression(3 * x^3 + x^2)
> D(f,"x")
3 * (3 * x^2) + 2 * x

The function D() returns an argument of type call (see help(call) for further
information) and one can therefore recursively compute higher order derivatives. For
example, consider the second derivative of 3x 3 + x 2 .
> D(D(f,"x"),"x")
3 * (3 * (2 * x)) + 2

To compute higher order derivatives, it can be useful to define a recursive function.


> DD = function(expr, name, order = 1){
+ if(order < 1) stop("’order’ must be >= 1")# warning message
+ if(order == 1) D(expr, name) # first derivative
+ else DD(D(expr, name), name, order - 1) # 1st derivative of DD
+ }

This function replaces the initial function with its first derivative until the argument
order is reduced to one. Then the third derivative for 3x 3 + x 2 can be computed with
this function.
2.3 Differentiation 55

> DD(f,"x", order = 3)


3 * (3 * 2)

The gradient of a function can also be computed using the function D().
Definition 2.5 Let f : Rn → R be a differentiable function and x = (x1 , . . . , xn )
∈ Rn . Then the vector
 
def ∂f ∂f
∇ f (x) = (x), . . . , (x)
∂x1 ∂xn

is called the gradient of f at point x.


If f maps its arguments to a multidimensional space, then we consider the Jacobian
matrix as a generalisation of the gradient.
Definition 2.6 Let F : Rn → Rm be a differentiable function with F = ( f 1 , . . . , f m )
and the coordinates x = (x1 , . . . , xn ) ∈ Rn . Then the Jacobian matrix at a point
x ∈ Rn is defined as follows:
 
def ∂ f i (x)
JF (x) = .
∂x j i=1,...,m; j=1,...,n

If one is interested in the second derivatives of a function, the Hessian matrix is


important.
Definition 2.7 Let f : Rn → R be twice continuously differentiable. Then the
Hessian matrix of f at a point x is defined as follows:
 
def ∂2 f
H f (x) = (x) . (2.19)
∂xi ∂x j i, j=1,...,n

Now consider the function f : R2 → R that maps x1 and x2 coordinates to the square
of their Euclidean norm.
> f = expression(x^2 + y^2) # function
> grad = c(D(f,"x"), D(f,"y")) # gradient vector
> grad
[[1]]
2 * x

[[2]]
2 * y

If it is necessary to have the gradient as a function that can be evaluated, the func-
tion deriv(f, name, function.arg = NULL, hessian = FALSE)
should be used. The function argument f is the function (as an object of mode
expression) and the argument name identifies the vector with respect to which
the derivative will be computed. Furthermore the arguments function.arg spec-
ify the parameters of the returned function and hessian indicates whether the
56 2 Numerical Techniques

second derivatives should be calculated. When function.arg is not specified,


the return value of deriv() is an expression and not a function. As an example
of the use of the function deriv(), consider the above function f.
> eucld2 = deriv(f,
+ name = c("x","y"), # variable names +
function.arg = c("x","y")) # arguments for a function return
> eucld2(2, 2)
[1] 8
attr(,"gradient")
x y
[1,] 4 4

The function eucld2(x, y) delivers the value of f at (x, y) and, as an attribute


(see help(attr)), the gradient of f evaluated at (x, y). If only the evaluated
gradient at (x, y) should be returned, the function attr(x, which) should
be used, where x is an object and which a non-empty character string specifying
which attribute is to be accessed.
> attr(eucld2(2, 2),"gradient")
x y
[1,] 4 4

If the option hessian is set to TRUE, the Hessian matrix at a point (x,y) can be
retrieved through the call attr(eucld(2, 2),"hessian").

2.3.2 Numerical Differentiation

To develop numerical methods for determining the derivatives of a function at a point


x, one uses the Taylor expansion

h2 h3
f (x + h) = f (x) + h · f (x) + f (x) + f (x) + O(h 3 ). (2.20)
2! 3!
Only if the fourth derivative of f exists and f is bounded on [x, x + h] the repre-
sentation in (2.20) is valid. If the Taylor expansion is truncated after the linear term,
then (2.20) can be solved for f (x):

f (x + h) − f (x)
f (x) = + O(h). (2.21)
h
Therefore an approximation for the derivative at point x could be

f (x + h) − f (x)
f (x) ≈ . (2.22)
h
Another more accurate method uses the Richardson (1911) extrapolation. Redefine
the expression in (2.22) with g(h) = f (x+h)−
h
f (x)
. Then (2.21) can be written as
2.3 Differentiation 57

f (x) = g(h) + k1 h + k2 h 2 + k3 h 3 + · · · , (2.23)

where k1 , k2 , k3 , . . . represent constant terms involving the derivatives at point x.


Taylor’s theorem holds for all positive h, one therefore can replace h by h/2:
 
h h h2 h3
f (x) = g + k1 + k2 + k3 + · · · . (2.24)
2 2 4 8

Now (2.23) can be subtracted from (2.24) times two. Then the term involving k1 is
eliminated:
   2   3 
h h h
f (x) = 2g − g(h) + k2 − h + k3
2
− h + ··· .
3
2 2 4

Therefore f (x) can be rewritten as follows:


 
h
f (x) = 2g − g(h) + O(h 2 ).
2

This process can be continued to obtain formulae of higher order. In R, the pack-
age numDeriv provides some functions that use these methods to differentiate
a function numerically. For example, the function grad() calculates a numerical
approximation to the gradient of func at the point x. The argument method can be
“simple” or “Richardson”. If the method argument is simple, a formula as
in (2.22) is applied. Then only the element eps of methods.args is used (equiv-
alent to the above h in (2.22)). The method “Richardson”
 uses the Richardson
extrapolation. Consider the function f (x1 , x2 , x3 ) = x12 + x22 + x32 , which has the
gradient
⎛ ⎞
x1 x2 x3
∇ f (x) = ⎝  , , ⎠ .
x12 + x22 + x32 x12 + x22 + x32 x12 + x22 + x32

The gradient of f represents the normalised coordinates of a vector with respect to


the Euclidean norm. The evaluation of the gradient using grad(), e.g. at the point
(1,0,0), would be calculated by
> require(numDeriv)
> f = function(x){sqrt(sum(x^2))}
> grad(f,
+ x = c(1,0,0), # point at which to compute the gradient
+ method ="Richardson") # method to use for the approximation
[1] 1 0 0
> grad(f,
+ x = c(1,0,0), # point at which to compute the gradient
+ method ="simple") # method to use for the approximation
[1] 1e+00 5e-05 5e-05
58 2 Numerical Techniques

It could also be interesting to compute numerically the Jacobian or the Hessian matrix
of a function F : Rn → Rm .
In R, the function jacobian(func,x,...) can be used to compute the
Jacobian matrix of a function func at a point x. As with the function grad(), the
function jacobian() uses the Richardson extrapolation by default. Consider the
following example, where the Jacobian matrix of f (x) = {sin(x1 + x2 ), cos(x1 +
x2 )} at the point (0, 2π) is computed:
> require(numDeriv)
> f1 = function(x){c(sin(sum(x)), cos(sum(x)))}
> jacobian(f1, x = c(0, 2 * pi))
[,1] [,2]
[1,] 1 1
[2,] 0 0

The Hessian matrix is symmetric and can be computed in R with


hessian(func,x,...). For example, consider f (x1 , x2 , x3 ) = x12 + x22 + x32
as above, which maps the coordinates of a vector to their Euclidean norm. The fol-
lowing computation provides the Hessian matrix at the point (0, 0, 0).
> f = function(x){sqrt(sum(x^2))}
> hessian(func, c(0, 0, 0))
[,1] [,2] [,3]
[1,] 194419.75 -56944.23 -56944.23
[2,] -56944.23 194419.75 -56944.23
[3,] -56944.23 -56944.23 194419.75

From the definition of the Euclidean norm, it would make sense for f to have a
minimum at (0, 0, 0). The above information can be used to check whether f has a
local minimum at (0, 0, 0). In order to check this, two conditions have to be fulfilled.
The gradient at (0, 0, 0) has to be the zero vector and the Hessian matrix should be
positive definite (see Canuto and Tabacco 2010 for further information on the calcu-
lation of local extreme values using the Hessian matrix). The second condition can
be restated by using the fact that a positive definite matrix has only positive eigenval-
ues. Therefore, the second condition can be checked by computing the eigenvalues
of the above Hessian matrix and the first condition can be checked using the grad()
function.
> f = function(x){sqrt(sum(x^2))}
> grad(f, x = c(0, 0, 0)) # gradient at the
[1] 0 0 0 # optimum point
> hessm = hessian(func, x = c(0, 0, 0)) # Hessian matrix
> eigen(hessm)$values # eigenvalues
[1] 251364.0 251364.0 80531.3

This output shows that the gradient at (0, 0, 0) is the zero vector and the eigenvalues
are all positive. Therefore, as expected, the point (0, 0, 0) is a local minimum of f .
2.3 Differentiation 59

2.3.3 Automatic Differentiation

For a function f : Rn → Rm , automatic differentiation, which is also called algo-


rithmic differentiation or computational differentiation, is a technique employed to
evaluate derivatives based on the chain rule. As the derivatives of the elementary
functions, such as exp, log, sin, cos, etc., are already known, the derivative of f can
be an automatically assembled from these known elementary partial derivatives by
employing the chain rule.
Automatic differentiation is different from two other methods of differentiation,
symbolic differentiation and numerical differentiation.
The main difference between automatic differentiation and symbolic differen-
tiation is that the latter focuses on the symbolic expression of formulae and the
former concentrates on evaluation. The disadvantages of symbolic differentiation is
in both taking up too much memory for the computation, and generating unnecessary
expressions associated with the computation. Consider, for example, the symbolic
differentiation of
10
f (x) = xi = x1 · x2 · · · · · x10 .
i=1

The corresponding gradient in symbolic style is


 
∂f ∂f ∂f
∇ f (x) = , ,...,
∂x1 ∂x2 ∂x10
= (x2 · x3 · · · · · x10 , x1 · x3 · · · · · x10 , . . . , x1 · x2 · · · · · x9 ).

If the number of variables becomes large, then the expression will use a tremendous
amount of memory and have a very tedious representation.
In automatic differentiation, all arguments of the function are redefined as dual
numbers, xi + xi ε, where ε has the property that ε2 ≈ 0. The change in xi is xi ε, for
all i. Therefore, automatic differentiation for this function looks like
⎛ ⎞

10 10 
10 
9
∇ f (x) = xi + ε ⎝x1 xi + · · · + x j xi + · · · + x10 xi ⎠ .
i=1 i=2 i = j i=1

Automatic differentiation is more accurate than numerical differentiation. Numerical


differentiation (or divided differences) uses

f (x + h) − f (x)
f (x) ≈ ,
h
or
f (x + h) − f (x − h)
f (x) ≈ .
2h
60 2 Numerical Techniques

It is obvious that the accuracy of this type of differentiation is related to the choice
of h. If h is small, then the method of divided differences has errors introduced by
rounding off the floating point numbers. If h is large, then the formula disobeys
the essence of this method, which assumes that h tends to zero. Also, the method
of divided differences introduces truncation errors by neglecting the terms of order
O(h 2 ), something which does not happen in automatic differentiation.
Automatic differentiation has two operation modes: forward and reverse. For
forward mode, the algorithm starts by evaluating the derivatives of every elementary
function, the function arguments itself, of f at the given points. In each intermediate
step, the derivatives are combined to reproduce the derivatives of more complicated
functions. The last step merely assembles the evaluations from the results of the
computations already performed, employing the chain rule. For example, we use the
forward mode to evaluate the derivative of f (x) = (x + x 2 )3 : the pseudocode can
be summarised as
function(y, y’)=f’(x, x’)
s1 = x * x;
s1’ = 2 * x * x’;
s2 = x + s1;
s2’ = x’ + s1’;
y = s2 * s2 * s2;
y’ = 3 * s2 * s2 * s2’
end

where f represents the derivative, i.e. ∂ f /∂x. Therefore, let us evaluate the derivative
of f (x) = (x + x 2 )3 at the point x = 2 with the forward mode.

s1 = x · x = 2 · 2 = 4,
s1 = 2 · x · x = 2 · 2 · 1 = 4,
s 2 = x + s1 = 2 + 4 = 6,
s2 = x + s1 = 1 + 4 = 5,
y = s2 · s2 · s2 = 6 · 6 · 6 = 216,
y = 3 · s2 · s2 · s2 = 3 · 6 · 6 · 5 = 540.

For reverse mode, the programme performs the computation in the reverse direction.
We need to set v̄ = dy/dv, then ȳ = dy/dy = 1. For the same example as before,
where the derivative at x = 2 is evaluated, it looks as

s̄2 = 3 · s22 = 3 · 36 = 108,


s̄1 = 3 · s22 = 3 · 36 = 108,
x̄ = s̄2 + 4s̄1 = 108 + 4 · 108 = 540.

Two examples are implemented in R using the package radx developed by Anna-
malai (2010). This package is not available on CRAN, therefore is installed via
2.3 Differentiation 61

function install_github (provided by package devtools) from GitHub from


repository radx by quantumelixir.
Example 2.5 Evaluate the first-order derivative of
3
f (x) = x + x 2 , f or x = 2. (2.25)

> require(devtools)
> # install_github("radx","quantumelixir") # installs from GitHub
> require(radx) # not provided by CRAN
> f = function(x) {(x^2 + x)^3} # function
> radxeval(f, # automatic differ.
+ point = 2, # point at which to eval.
+ d = 1) # order of differ.
[,1]
[1,] 540

The upper computation illustrates that the value of the first derivative of the function
(2.25) at x = 2 is equal to 540.
Example 2.6 Evaluate the first and second derivatives of the vector function

f 1 (x, y) = 1 − 3y + sin(3π y) − x,
f 2 (x, y) = y − sin(3πx)/2,

at (x = 3, y = 5).

> f = function(x, y){ # multidimensional function


+ c(1 - 3 * y + sin(3 * pi * y) - x,
+ y - sin(3 * pi * x) / 2)
+ }
> radxeval(f,
+ point = c(3, 5), # point at which to evaluate
+ d = 1) # 1st order of differentiation
[,1] [,2]
[1,] -1.00000 4.712389
[2,] -12.42478 1.000000
> radxeval(f,
+ point = c(3, 5), # point at which to evaluate
+ d = 2) # 2nd order of differentiation
[,1] [,2]
[1,] 0.00000e+00 4.894984e-14
[2,] 0.00000e+00 0.000000e+00
[3,] -4.78741e-13 0.000000e+00
62 2 Numerical Techniques

2.4 Root Finding

A root is the solution to a system of equations or an optimisation problem. In both


cases, one tries to find values for the arguments of the function such that the value
of the function is zero. In the case of an optimisation problem, this is done for the
first derivative of the objective function.

2.4.1 Solving Systems of Linear Equations

Let K denote either the set of real numbers, or the set of complex numbers. Sup-
pose ai j , bi ∈ K with i = 1, . . . , n and j = 1, . . . , p. Then the following system of
equations is called a system of linear equations:

⎨ a11 x1 + . . . +a1 p x p
⎪ = b1
.. ..
⎪ . .

an1 x1 + . . . +anp x p = bn

For a matrix A = (ai j ) ∈ K n× p and two vectors x = (x1 , . . . , x p ) and b = (b1 , . . . ,


bn ) , the system of linear equations can be rewritten in matrix form:

Ax = b. (2.26)

Let Aen×( p+1) be the extended matrix, i.e. the matrix whose last column is the vector
of constants b, and otherwise is the same as A. Then (2.26) can be solved if and only
if the rank of A is the same as the rank of Ae . In this case b can be represented by a
linear combination of the columns of A. If (2.26) can be solved and the rank of A
equals n = p, then there exists a unique solution. Otherwise (2.26) might have no
solution or infinitely many solutions, see Greub (1975).
The Gaussian algorithm, which transforms the system of equations by elementary
transformations to upper triangular form, is frequently applied. The solution can be
computed by back-substitution. The Gaussian algorithm decomposes A into the
matrices L and U, the so-called LU decomposition (see Braun and Murdoch (2007)
for further details). L is a lower triangular matrix and U is an upper triangular matrix
with the following form:
⎛ ⎞ ⎛ ⎞
1 0 ... 0 u 11 u 12 ... u 1n
⎜ .⎟ ⎜ 0 u 22 u 2n ⎟
⎜ l21 1 . . . .. ⎟ ⎜ ... ⎟

L=⎜ . ⎟, U =⎜ . .. .. ⎟ .
⎝ .. .. ⎟ ⎝ .. . . ⎠
. 0⎠
ln1 ln2 . . . 1 0 . . . 0 u nn

Then (2.26) can be rewritten as


2.4 Root Finding 63

Ax = LU x = b. (2.27)

Now the system in (2.26) can be solved in two steps. First define U x = y and solve
Ly = b for y by forward substitution. Then solve U x = y for x by back-substitution.
In R, the function solve(A,b) uses the LU decomposition to solve a system of
linear equations with the matrix A and the right side b. Another method that can be
used in R to solve a system of linear equations is the QR decomposition, where the
matrix A is decomposed into the product of an orthogonal matrix Q and an upper
triangular matrix R. One uses the function qr.solve() to compute the solution
of a system of linear equations using the QR decomposition. In contrast to the LU
decomposition, this method can be applied even if A is not a square matrix. The next
example shows how to solve a system of linear equations in R using solve().
Example 2.7 Solve the following system of linear equations in R with the Gaussian
algorithm and back-substitution,

Ax = b,
⎛ ⎞
2 − 21 − 21 0
⎜− 21 0 2 − 21 ⎟
A=⎜ ⎟,
⎝− 1 2
2
0 −1⎠ 2
0 − 21 − 21 2
b = (0, 3, 3, 0) ,
⎛ ⎞
2 − 21 − 21 0 0
⎜− 21 0 2 − 21 3⎟
Ae = ⎜
⎝− 1 2 0 − 1 3⎠ .

2 2
0 − 21 − 21 2 0

The upper system of linear equations is solved first by hand and then the example
is computed in R for verification. This system of linear equations is not difficult to
solve with the Gaussian algorithm. First, one finds the upper triangular matrix
⎛ ⎞
2 − 21 − 12 0 0
⎜0 15 − 1 − 1 3 ⎟
⎜ ⎟
U e = ⎜ 8 288 28 16 ⎟ .
⎝0 0 15 − 15 5 ⎠
12 12
0 0 0 7 7

Second, one uses back-substitution to obtain the final result, that (x1 , x2 , x3 , x4 ) =
(1, 2, 2, 1) . Then the solution of this system of linear equations in R is presented.
Two parameters are required: the coefficient matrix A and the vector of constraints b.
> A = matrix( # coefficient matrix
+ c( 2, -1/2, -1/2, 0,
+ -1/2, 0, 2, -1/2,
+ -1/2, 2, 0, -1/2,
64 2 Numerical Techniques

+ 0, -1/2, -1/2, 2),


+ ncol = 4, nrow = 4, byrow = TRUE)
> b = c(0, 3, 3, 0) # vector of constants
> solve(A, b)
[1] 1 2 2 1 # x1, x2, x3, x4

The manually found solution for the system coincides with the solution found in R.

2.4.2 Solving Systems of Nonlinear Equations

A system of nonlinear equations is represented by a function F = ( f 1 , . . . , f n ) :


Rn → Rn . Any nonlinear system has a general extended form

⎨ f 1 (x1 , . . . , xn )
⎪ = 0,
.. ..
⎪ . .

f n (x1 , . . . , xn ) = 0.

There are many different numerical methods for solving systems of nonlinear equa-
tions. In general, one distinguishes between gradient and non-gradient methods. In
the following, the Newton method, or the Newton–Raphson method, is presented. To
get a better illustration of the idea behind the Newton method, consider a continuous
differentiable
! function F : R → R, where one tries to find x ∗ with F(x ∗ ) = 0 and
∂ F(x) !
∂x ! ∗
 = 0. Start by choosing a starting value x0 ∈ R and define the tangent line
x=x
!
∂ F(x) !!
p(x) = F(x0 ) + (x − x0 ). (2.28)
∂x !x=x0

Then the tangent line p(x) is! a good approximation to F in a sufficiently small
!
neighbourhood of x0 . If ∂ F(x)
∂x !
 = 0, the root x1 of p in (2.28) can be computed
x=x0
as follows:
F(x0 )
x1 = x0 − ! .
∂ F(x) !
∂x !x=x
0

With the new value x1 , the rule can be applied again. This procedure can be applied
iteratively and under certain theoretical conditions the solution should converge to
the actual root. Figure 2.2 demonstrates the Newton method for f (x) = x 2 − 4 with
the starting value x0 = 6.
The Fig. 2.2 was computed using the function newton.method(f, init,
...) from the package animation, where f is the function of interest and init is
the starting value for the iteration process. The function provides an illustration of the
iterations in Newton’s method (see help(newton.method) for further details).
The function uniroot() searches in an interval for a root of a function and returns
2.4 Root Finding 65

40

Current root: 2.00006103608759


30
x2−4
20
10
0

0 1 2 3 4 5 6 7
x

Fig. 2.2 Illustration of the iteration steps of Newton’s method to find the root of f (x) = x 2 − 4
with x0 = 6. BCS_Newton

only one root, even if several roots exist within the interval. At the boundaries of the
interval, the sign of the value of the function must change.
> f = function(x){ # objective function
+ -x^4 - cos(x) + 9 * x^2 - x - 5
+ }
> uniroot(f,
+ interval = c(0, 2))$root # root in [0, 2]
[1] 0.8913574
> uniroot(f,
+ interval = c(-3, 2))$root # root in [-3, 2]
[1] -2.980569
> uniroot(f,
+ interval = c(0, 3))$root # f(0) and f(3) negative
Error in uniroot(f, c(0, 3)) :
Values of f() at the boundaries have same sign

For a real or complex polynomial of the form p(x) = z 1 + z 2 · x + . . . + z n · x n−1 ,


the function polyroot(z), with z being the vector of coefficients in the increasing
order, computes a root. The algorithm does not guarantee it will find all the roots of
the polynomial.
> z = c(0.2567, 0.1570, 0.0821, -0.3357, 1) # coefficients
> round(polyroot(z), digits = 2) # complex roots
[1] 0.59+0.60i -0.42+0.44i -0.42-0.44i 0.59-0.60i
66 2 Numerical Techniques

2.4.3 Maximisation and Minimisation of Functions

The maximisation and minimisation of functions, or optimisation problems, contain


two components, such as an objective function f (x) and constraints g(x). Optimi-
sation problems can be classified into two categories, according to the existence of
constraints. If there are constraints affiliated with the objective function, then it is
a constrained optimisation problem; otherwise, it is a unconstrained optimisation
problem. This section introduces six different optimisation techniques. The first four
are the golden ratio search method, the Nelder–Mead method, the BFGS method, and
the conjugate gradient method for unconstrained optimisation. The two other opti-
misation techniques, linear programming (LP) and nonlinear programming (NLP),
are used for constrained optimisation problems.
First, one needs to define the concepts of local and global extrema, which will be
frequently used later on.
Definition 2.8 A real function f defined on a domain M has a global maximum at
xopt if f (xopt ) ≥ f (x) for all x in M. Then f (x opt ) is called the maximum value of the
function. Analogously, the function has a global minimum at xopt if f (xopt ) ≤ f (x)
for all x in M. Then f (xopt ) is called the minimum value of the function.

Definition 2.9 If the domain M is a metric space, then f is said to have a local
maximum at xopt if there exists some > 0 such that f (xopt ) ≥ f (x) for all x in M
within a distance of from x opt . Analogously, the function has a local minimum at
xopt if f (xopt ) ≤ f (x) for all x in M within of x opt .

Maxima and minima are not always unique. Consider the function sin(x), which
has global maxima f (xmax ) = 1 and global minima f (xmin ) = −1 for every xmax =
(0.5 + 2k)π and xmin = (−0.5 + 2k)π for k ∈ Z.
Example 2.8 The following function possesses several local maxima, local minima,
global maxima and global minima.

f (x, y) = 0.03 sin(x) sin(y) − 0.05 sin(2x) sin(y)


+ 0.01 sin(x) sin(2y) + 0.09 sin(2x) sin(2y). (2.29)

The function is plotted in Fig. 2.3 with highlighted extrema.


Golden ratio search method
The golden ratio section search method was proposed in Kiefer (1953).
This method is frequently employed for solving optimisation problems with one-
dimensional uni-modal objective functions, and belongs to the group of non-gradient
methods. A very common algorithm for this method is the following one:

√ bound xU , lower bound x L , xU > x L and ε.


1. Define upper
2. Set r = ( 5 + 1)/2, d = (xU − x L )r .
3. Choose x1 and x2 ∈ [x L , xU ]. Set x1 = x L + d and x2 = xU − d.
2.4 Root Finding 67

● ● ●

0.10

0.05

f(x,y) 0.00

−0.05

−0.10

● ●
5
4

3
x 2
1 3 2 1 0
0 5 4
y

Fig. 2.3 3D plot of the function (2.29) with maxima and minima depicted by points.
BCS_Multimodal

4. Stop if | f (xU ) − f (x L )| < ε or |xU − x L | < ε. If f (x1 ) < f (x2 ) ⇒ xmin = x1 ,


otherwise xmin = x2 .
5. Update xU or x L if | f (xU ) − f (x L )| ≥ ε or |xU − x L | ≥ ε. If f (x1 ) > f (x2 ) ⇒
xU = x2 , otherwise x L = x1 .
6. Return to step 2.
The algorithm first defines an initial search interval d. The length of the interval
depends on the difference√
between the upper and lower bounds xU − x L and the
‘Golden Ratio’ r = 1+2 5 . The points x1 and x2 will decrease the length of the search
interval. If the tolerance level ε for the search criteria (| f (xU ) − f (x L )| or |xU − x L |)
is satisfied, the process stops. But if the search criteria is still greater than or equal
to the tolerance level, the bounds are updated and the algorithm starts again.
Example 2.9 Apply the golden ratio search method to find the maximum of

f (x) = −(x − 3)2 + 10. (2.30)


68 2 Numerical Techniques

> require(stats)
> f = function(x){-(x - 3)^2 + 10} # function
> optimize(f, # objective function
+ interval = c(-10, 10), # interval
+ tol = 0.0001, # level of the tolerance
+ maximum = TRUE) # to find maximum
$maximum
[1] 3

$objective
[1] 10

The argument tol defines the convergence criterion for the results. The function
reaches its global maximum at xopt = 3, which is easily derived by solving the first-
order condition −2xopt + 6 = 0 for xopt and computing the ! value f (xopt ). For a
∂ 2 f (x) ∂ 2 f (x) !
maximum at xopt , one should have ∂x 2 < 0 and ∂x 2 ! = −2. Therefore
x=xopt
xopt = 3, which is verified in R with the code from above.
Nelder–Mead method
This method was proposed in Nelder and Mead (1965) and is applied frequently in
multivariate unconstrained optimisation problems. It is a direct method, where the
computation does not use gradients. The main idea of the Nelder–Mead method is
briefly explained below and a graph for a two-dimensional input case is shown in
Fig. 2.4.
1. Choose x1 , x2 , x3 such that f (x1 ) < f (x2 ) < f (x3 ) and set x and/or f .
2. Stop if xi − x j  < x and/or  f (xi ) − f (x j ) < f , f or i  = j, i, j ∈ {1, 2,
3} and set xmin = x1 .
3. Else, compute z = 21 (x1 + x2 ) and d = 2z − x3 .
If f (x1 ) < f (d) < f (x2 ) ⇒ x3 = d.
If f (d) ≤ f (x1 ), compute k = 2d − z.
If f (k) < f (x1 ) ⇒ x3 = k.
Else, x3 = d.
If f (x3 ) > f (d) ≥ f (x2 ) ⇒ x3 = d.
Else, compute t = [t| f (t) = min{ f (t1 ), f (t2 )}], where t1 = 21 (x3 + z) and
t2 = 21 (d + z).
If f (t) < f (x3 ) ⇒ x3 = t.
Else, x3 = s = 1/2(x1 + x3 ) and x2 = z.
4. Return to step 2.
In general, the Nelder–Mead algorithm works with more than three initial guesses.
The starting values xi are allowed to be vectors. In the iteration procedure one tries
to improve the initial guesses step by step. The worst guess x3 will be replaced by
better values until the convergence criterion for the values f of the function or the
arguments x of the function is met. Next we will give an example of how to use the
Nelder–Mead method to find extrema of a function in R (Fig. 2.5).
2.4 Root Finding 69

Fig. 2.4 Algorithm graph for the Nelder–Mead method. The variables x1 , x2 and x3 are the search
region at the specific iteration step. All other variables, d, k, s, t1 and t2 , are possible updates for
one xi

Example 2.10 The function to be minimized is the Rosenbrock function, which has
an analytic solution with global minimum at (1, 1) and a global minimum value
f (1, 1) = 0.

f (x1 , x2 ) = 100(x2 − x12 )2 + (1 − x1 )2 . (2.31)

> require(neldermead)
> f = function(x){
+ 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2 # Rosenbrock function
+ }
> fNM = fminsearch(fun = f,
+ x0 = c(-1.2, 1), # starting point
+ verbose = FALSE)
> neldermead.get(fNM, key ="xopt") # optimal x-values
[,1]
[1,] 1.000022
[2,] 1.000042
> neldermead.get(fNM, key ="fopt") # optimal function value
[1] 8.177661e-10

The upper computation illustrates that the numerical solution by the Nelder–Mead
method is close to the analytical solution for the Rosenbrock function (2.31). The
errors of the numerical solution are negligibly small.
BFGS method
This frequently used method for multivariate optimisation problems was proposed
independently in Broyden (1970), Fletcher (1970), Goldfarb (1970) and Shanno
(1970). BFGS stands for the first letters of each author, in alphabetical order. The
main idea of this method originated from Newton’s method, where the second-order
Taylor expansion for a twice differentiable function f : Rn → R at x = xi ∈ Rn is
employed, such that

1
f (x) = f (xi ) + ∇ f  (xi )q + q  H (xi )q,
2
70 2 Numerical Techniques

2500

2000

1500

f(x,y)

1000

500

2

1
0
x −1 0 −1
3 2 1
−2
y

Fig. 2.5 Plot for the Rosenbrock function with its minimum depicted by a point.
BCS_Rosenbrock

where q = x − xi , and ∇ f (xi ) is the value of the partial derivative of f at the point
xi , and H (xi ) is the Hessian matrix. Employing the first-order condition, one obtains

∇ f (x) = ∇ f (xi ) + H (xi )q = 0,

hence, if H (xi ) is invertible, then

q = x − xi = −H −1 (xi )∇ f (xi ),
xi+1 = xi − H −1 (xi )∇ f (xi ).

The recursion will converge quadratically to the optimum. The problem is that New-
ton’s method requires the computation of the exact Hessian at each iteration, which
is computationally expensive. Therefore, the BFGS method overcomes this disad-
vantage with an approximation of the Hessian’s inverse obtained from the following
optimisation problem,
2.4 Root Finding 71

H (xi ) = arg min H −1 − H −1 (xi−1 )W , (2.32)


H
subject to: B −1 = (H −1 ) ,
−1
B (∇ f i − ∇ f i−1 ) = xi − xi−1 .

The weighted Frobenius norm, denoted by  · W , and the matrix W are, respectively,

H −1 − H −1 (xi−1 )W = W 2 {H −1 − H −1 (xi−1 )}W 2 ,


1 1

W(∇ f i − ∇ f i−1 ) = xi − xi−1 .

Equation (2.32) has a unique solution such that

H −1 (xi ) = M1 H (xi−1 )M2 + (xi − xi−1 )γi−1 (xi − xi−1 ) ,


M1 = I − γi−1 (xi − xi−1 )(∇ f i − ∇ f i−1 ) ,
M2 = I − γi−1 (∇ f i − ∇ f i−1 )(xi − xi−1 ) ,
γi = {(∇ f i − ∇ f i−1 ) (xi − xi−1 )}−1 .

Example 2.11 Here, the BFGS method is used to minimise the Rosenbrock function
(2.31) using optimx package (see Nash and Varadhan 2011).

> require(optimx)
> f = function(x){100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2}
> fBFGS = optimx(fn = f, # objective function
+ par = c(-1.2, 1), # starting point
+ method ="BFGS") # optimisation method
> print(data.frame(fBFGS$p1, fBFGS$p2, fBFGS$value))
fBFGS.p1 fBFGS.p2 fBFGS.value
1 0.9998044 0.9996084 3.827383e-08 # minimum

The BFGS method computes the minimum value of the function (2.31) to be 3.83e −
08 at the minimum point (0.99, 0.99). The outputs fevals = 127, gevals =
38 show the calls of the objective function and the calls of the gradients, respectively.
These outputs are close to the exact solution xopt = (1, 1) and f (xopt ) = 0.
Conjugate gradient method
The conjugate gradient method was proposed in Hestenes and Stiefel (1952) and is
widely used for solving symmetric positive definite linear systems. A multivariate
unconstrained optimisation problem, like

Ax = b, A ∈ Rn×n , A = A , and A positive definite,

can be solved with the Conjugate Gradient Method. The main idea behind this method
is to use iterations to approach the optimum of the linear system.
72 2 Numerical Techniques

1. Set x0 and ε, then compute p0 = r0 = b − Ax0 .


2. Stop if ri < ε and set xopt = xi+1 .
3. Else, compute:

riri
αi = ,
pi A pi
xi+1 = xi + αi pi ,
ri+1 = ri − αi A pi ,
r  ri+1
βi = i+1 ,
ri ri
pi+1 = ri+1 + βi pi .

4. Update xi , ri and pi . Increment i.


5. Return to step 2.
At first, the initial guess x0 determines the residual r0 and the initially used basis
vector p0 . The algorithm tries to reduce the residual ri at each step to get to the optimal
solution. At the optimum, 0 = b − Axi . The tolerance level t and the final residual
should be close to zero. The parameters αi and βi are improvement parameters for
the next iteration. The parameter αi directly determines the size of the improvement
for the residual and indirectly influences the conjugate vector pi+1 used in the next
iteration. For βi , the opposite is true. The final result depends on both parameters.
Example 2.12 To illustrate the Conjugate Gradient method, let us again consider the
Rosenbrock function (2.31).

> require(optimx)
> f = function(x){100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2}
> fCG = optimx(fn = f, # objective function
+ par = c(1.2, 1), # initial guess (x_0)
+ control = list(reltol = 10^-7), # relative tolerance
+ method ="CG") # method of optimisation
>
>print(data.frame(fCG$p1, fCG$p2, fCG$value)) # minimum
fCG.p1 fCG.p2 fCG.value
1 1.030077 1.061209 0.0009036108

For the Rosenbrock function, the Conjugate Gradient method delivers the biggest
errors, compared to the Nelder–Mead and BFGS methods. All numerical meth-
ods which are applied to optimize a function will only approximately find the true
solution. The examples above show how the choice of method might influence the
accuracy of the result. Worth mentioning is, that in the latter case we changed the
initial guess, as the function failed with the same starting value as we took for BFGS
method.
Constrained optimisation
Constrained optimisation problems can be categorised into two classes in terms of
to the linearity of the objective function and the constraints. A linear programming
2.4 Root Finding 73

(LP) problem has a linear objective function and linear constraints, otherwise it is a
nonlinear programming problem (NLP).
LP is a method to find the solution to an optimisation problem with a linear objec-
tive function, under constraints in the form of linear equalities and linear inequalities.
It has a feasible region defined by a convex polyhedron, which is a set made by the
intersection of finitely many half-spaces. These represent linear inequalities. The
objective of linear programming is to find a point in the polyhedron where the objec-
tive function reaches a minimum or maximum value. A representative LP can be
expressed as follows:

arg max a  x,
x
subject to: Cx ≤ b,
x ≥ 0,

where x ∈ Rn is a vector of variables to be identified, a and b are vectors of known


coefficients and C is a known matrix of the coefficients in the constraints. The expres-
sion a  x is the objective function. The inequalities Cx ≤ b and x ≥ 0 are the con-
straints, under which the objective function will be optimized.
NLP has an analogous definition as that of the LP problem. The differences
between NLP and LP are that the objective function or the constraints in an NLP can
be nonlinear functions. The following example is an LP problem (Fig. 2.6).
Example 2.13 Solve the following linear programming optimisation problem with R.

arg max 2x1 + 4x2 , (2.33)


x1 , x2

subject to: 3x1 + 4x2 ≤ 60,


x1 ≥ 0,
x2 ≥ 0.

For the example in (2.33), the function from the package Rglpk (see Theussl 2013)
is used to compute the solution in R.
> require(Rglpk)
> Rglpk_solve_LP(obj = c(2, 4), # objective function
+ mat = matrix(c(3, 4), nrow = 1), # constrains coefficients
+ dir ="<=", # type of constrains
+ rhs = 60, # constrains vector
+ max = TRUE) # to maximise
$optimum # maximum
[1] 60

$solution # point of maximum


[1] 0 15

$status # no errors
[1] 0

The maximum value of the function (2.33) is 60 and occurs at the point (0, 15).
74 2 Numerical Techniques

150

100

f(x1,x2)


50

0
25
20
15
10
x1 5 15 10 5 0
0 25 20
x2

Fig. 2.6 Plot for the linear programming problem with the constraint hyperplane depicted by the
grid and the optimum by a point. BCS_LP

Example 2.14 Next, consider a constrained nonlinear optimisation problem, which


can be solved in R using constrOptim(). Solve the following nonlinear optimi-
sation problem with R (Fig. 2.7).
 
arg max 5x1 + 3x2 , (2.34)
x1 , x2

subject to: 3x1 + 5x2 ≤ 10,


x1 ≥ 0,
x2 ≥ 0.

> require(stats)
> f = function(x){
+ sqrt(5 * x[1]) + sqrt(3 * x[2]) # objective function
+ }
> A = matrix(c(-3, -5), nrow = 1,
+ ncol = 2, byrow = TRUE) # coefficients matrix
> b = c(-10) # vector of constraints
2.4 Root Finding 75

5

4

f(x,y) 3

5
4
3
2
x 1 3 2 1 0
0 5 4
y

Fig. 2.7 Plot for the objective function with its constraint from (2.34) and the optimum depicted
by the point. BCS_NLP

> answer = constrOptim(f = f, # objective function


+ theta = c(1, 1), # initial guess
+ grad = NULL, # no gradient provided
+ ui = A,
+ ci = b, # vector of constrains
+ control = list(fnscale = -1)) # to maximise
> c(answer$par, answer$value) # optimum
[1] 2.4510595 0.5293643 4.7609523

The upper computation illustrates that the maximum value of the function (2.34)
is 4.7610, and occurs at the point (2.4511, 0.5294). answer$function equal to
170 means that the objective function has been called 170 times.
Chapter 3
Combinatorics and Discrete Distributions

The roll of the dice will never abolish chance.

— Stéphane Mallarmé

In the second half of the nineteenth century, the German mathematician Georg Cantor
developed the greater part of today’s set theory. At the turn of the nineteenth and
twentieth centuries, Ernst Zermelo, Bertrand Russell, Cesare Burali-Forti and others
found contradictions in the nonrestrictive set formation: For every property there
is a unique set of individuals, which have that property, see Johnson (1972). This
so called ‘naïve comprehension principle’ produced inconsistencies, illustrated by
the famous Russell paradox, and was therefore untenable. Ernst Zermelo in 1908
gave an axiomatic system which precisely described the existence of certain sets
and the formation of sets from other sets. This Zermelo–Fraenkel set theory is still
the most common axiomatic system for set theory. There are 9 axioms, amongst
others, that deal with set equality, regularity, pairing sets, infinity, and power sets.
Since these axioms are very theoretical, we refer the interested reader to Jech (2003).
Later, further axioms were added in order to be able to universally interpret all
mathematical objects or constructs, making set theory a fundamental discipline of
mathematics. It also plays a major role for computational statistics since it mostly
uses basic functions, which constitute set theoretical relations.

3.1 Set Theory

3.1.1 Creating Sets

In mathematics, the most famous sets are

© Springer International Publishing AG 2017 77


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_3
78 3 Combinatorics and Discrete Distributions

• N: the set of natural numbers, i.e. {1, 2, 3, 4 . . . }.


• Z: the set of integer numbers, i.e. {· · · − 3, −2, −1, 0, 1, 2, 3 . . . }.
• Q: the set of rational numbers. An example of a finite set of rational numbers is,
e.g., C = {−3.5, 0.01, 20
19
} ⊂ Q.
• R: the set of real√numbers. An example of a finite set of real numbers is, e.g.
D = {1, 2.5, −3, 2} ⊂ R.
• C: the set of complex numbers. An example of a finite set of complex numbers is,
e.g. E = {2 + 3i, 1 − i} ∈ C, with i 2 = −1.
For each set there is a cardinal number which stands for the magnitude or number of
elements in the set and allows describing infinite sets. For the set of natural numbers,
for example, its cardinal number is ℵ0 (‘Aleph null’). In R, of course, one cannot
create infinite sets due to its limited storage capacity.
The sets above are named by a fixed character or letter, whereas other sets in the
literature are labelled arbitrarily with a Latin or Greek letter, for example, a, b, N ,
M, .
Using R, a set can be defined by enumerating its elements or by stating its form,
as in the following.
> A = c (1 , 2 , 3 , 4); A
[1] 1 2 3 4
> B = seq ( from = -3 , to = 3 , by = 1); B
[1] -3 -2 -1 0 1 2 3

Most of the basic R objects containing several elements, such as an array, a matrix,
or a data frame, are sets.

3.1.2 Basics of Set Theory

After the creation of a set, the next step is to manipulate the set in useful ways. One
possible goal could be selecting a specific subset. A subset of a set M is another set
M1 whose elements a are also elements of M, i.e. a ∈ M1 implies a ∈ M. There
are several other relations besides the subset relation. The basic set operations are
union, intersection, difference, test for equality, and the operation ‘is-an-element-of’.
Table 3.1 contains definitions and the corresponding tools from the packages base
and sets discussed below. In order to use the functions provided by the package
sets, objects have to be defined as sets. All functions contained in base R can be
applied to vectors or matrices. One can use the relations from Table 3.1 to state the
following equations and properties, which are generally valid in set theory.
1. A ∪ ∅ = A, A ∩ ∅ = ∅;
2. A ∪  = , A ∩  = A;
3. A ∪ Ac = , A ∩ Ac = ∅;
4. (Ac )c = A;
5. Commutative property: A ∪ B = B ∪ A, A ∩ B = B ∩ A;
6. Associative property: (A ∪ B) ∪ C = A ∪ (B ∪ C), (A ∩ B) ∩ C = A ∩ (B ∩ C);
3.1 Set Theory 79

Table 3.1 Definitions, relations and operations on sets


Notation Definition R base sets package
x∈A x is an element of A %in%,is.element() set_contains_element(A,x),

(x %e% A)
x∈
/ A x is not an element of A !(x %in% A) !(x %e% A)
A⊆B Each element of A is an element of B A %in% B set_is_subset(A, B)
A=B A ⊇ B and A ⊆ B setequal(A,B) set_is_equal(A, B)
∅ The empty set, {} x = c() set()
 The Universe ls()
A∪B Union: {x | x ∈ A or x ∈ B} union(A,B), set_union(A,B), A | B
A∩B Intersection: {x | x ∈ A and x ∈ B} intersect(A,B) set_intersection(A, B)
if A ∩ B = ∅ then A and B are A & B
disjoint
A\B Set difference: {x | x ∈ A and x ∈
/ B} setdiff(A,B) A − B
A B Symmetric difference: set_symdiff(), %D%
(A \ B) ∪ (B \ A)
{x | x either x ∈ A or x ∈ B}
Ac The complement of a set A:  \ A set_complement(A, )
P (A) Power set: the set of all subsets of A set_power(A), 2^A

7. Distributive property: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) , A ∩ (B ∪ C) =
(A ∩ B) ∪ (A ∩ C);
8. De Morgan’s Law: (A ∪ B)c = Ac ∩ B c and (A ∩ B)c = Ac ∪ B c ,
or, more generally, (∪i Ai )c = ∩i Aic and (∩i Ai )c = ∪i Aic .

3.1.3 Base Package

The base package provides functions to perform most set operations, as shown in
the second column of Table 3.1. The results are given as an output vector or list. Note
that R is able to compare numeric and character elements. The output will be
given as a character vector, as in line 3 below.
> set1 = c (1 , 2) # numeric vector
> set2 = c ( " 1 " , " 2 " , 3) # vector with strings
> s e t e q u a l ( set1, set2 ) # sets are not equal
[1] FALSE
> i s . e l e m e n t ( set2, c (2 , 1)) # 1 , 2 are e l e m e n t s of 2 nd set
[1] TRUE TRUE FALSE
> i n t e r s e c t ( set1, set2 ) # different element types
[1] " 1 " " 2 "

As there is no specific function in base package for the symmetric difference it can
be obtained by combining base functions union() and setdiff() as:
80 3 Combinatorics and Discrete Distributions

> A = 1:4 # {1 , 2 , 3 , 4}
> B = -3:3 # { -3 , -2 , -1 , 0 , 1 , 2 , 3}
> union ( s e t d i f f ( A, B ) , s e t d i f f ( B, A )) # s y m m e t r i c d i f f e r e n c e set
[1] 4 -3 -2 -1 0

The symmetric difference set is the union of the difference sets. In the example above,
A and B have 1, 2, and 3 as their common elements. All other elements belong to the
symmetric difference set.
When working with basic R objects like lists, vectors, arrays, or data-frames,
using functions from the base package is appropriate. These functions, for example,
union(), intersect(), setdiff() and setequal(), apply as.vector
to the arguments. Applying operations on different types of sets, like a list and a
vector in the following example, does not necessarily lead to a problem.
> s e t l i s t = list (3 , 4) # set of type list
> s e t v e c 1 = c (5 , 6 , 8 , 20) # set of t y p e v e c t o r
> intersect ( setlist, setvec1 ) # no c o m m o n e l e m e n t s
n u m e r i c (0)
> s e t v e c 2 = c ( " blue " , " red " , 3) # set of t y p e v e c t o r
> intersect ( setlist, setvec2 ) # common elements
[1] " 3 "

In the following example, the objects A and B are combined in the data frame AcB.
The union of a data frame AcB and another object M returns a list of all elements.
> AcB = d a t a . f r a m e ( A = 1:3 , B = 5:7)
> M = list (10 , 15 , 10)
> union ( AcB, M ) # u n i o n r e t u r n s a l i s t for d a t a f r a m e s
[[1]]
[1] 1 2 3

[[2]]
[1] 5 6 7

[[3]]
[1] 10

[[4]]
[1] 15

> i n t e r s e c t ( AcB, M )
list ()
> DcE1 = d a t a . f r a m e ( D = c (1 , 3 , 2) , E = c (5 , 6 , 7))
> i n t e r s e c t ( AcB, DcE1 ) # s h o u l d r e t u r n both D and E
E
1 5
2 6
3 7
> DcE2 = d a t a . f r a m e ( D = c (1 , 2 , 3) , E = c (5 , 6 , 7))
> i n t e r s e c t ( AcB, DcE2 ) # s h o u l d r e t u r n both D and E
E
1 5
2 6
3 7

Using vectors as sets has some drawbacks when working with data frames, as shown
for the intersections above. In the base package, the intersection of two data frames
3.1 Set Theory 81

with a common element returns the empty set if the elements are ordered or defined
differently, therefore the elements c(1, 2, 3, 4) and c(1, 3, 2, 4) as
well as c(1, 2, 3, 4) and 1:4 are treated as different sets. When using the
sets function set(), the order becomes unimportant.

3.1.4 Sets Package

The package sets was specifically created by David Meyer and others for appli-
cations concerning set theory. This package provides basic operations for ordinary
sets and also for generalizations like fuzzy sets and multisets. The objects created
with functions from this package, e.g. by using the function set(), can be viewed
as real set objects, in contrast to vectors or lists, for example. This is visible in the
output, since sets are denoted by curly brackets.
A data frame can be viewed as a nested set and should be created with several
set() commands. Note that these functions in R require the sets package.
> r e q u i r e ( sets )
> A = set (1 , 2 , 3) # set A
> B = a s . s e t ( c (5 , 6 , 7)) # set B
> set ( A, B ) # set AcB from above
{{1 , 2 , 3} , {5 , 6 , 7}}

The as.set() function is used above to convert an array object into a set object. For
objects of the class set, it is recommended to use the methods from the same package,
like set_union and set_intersection or, more simply, the symbols & and |.
In the following, some of these functions, as presented in Table 3.1, are used on two
simple sets.
> A = set (1 , 2 , 3) # set A
> B = set (5 , 6 , 7 , " 5 " ) # set B
> B # o r d e r e d and d i s t i n c t
{ " 5 " , 5 , 6 , 7}
> A | B # union set
{ " 5 " , 1 , 2 , 3 , 5 , 6 , 7}
> A & B # i n t e r s e c t i o n set
{}
> A - B # set d i f f e r e n c e
{1 , 2 , 3}
> A %D% B # symmetric difference
{ " 5 " , 1 , 2 , 3 , 5 , 6 , 7}
> s u m m a r y ( A %D% B ) # s u m m a r y of the s y m m e t r i c d i f f e r e n c e
A set with 7 e l e m e n t s .
> set_is_empty( A ) # check for empty set
[1] FALSE

Besides the functions in Table 3.1, the basic predicate functions ==, ! =, <, <=,
defined for equality and subset, can be used intuitively for the set objects. For vectors
or lists, however, these functions are executed element by element, so the objects
must have the same length.
82 3 Combinatorics and Discrete Distributions

> A = set (1 , 2 , 3); B = set (5 , 6 , 7 , " 5 " )


> D = set (4 , 5 , 6 , 7 , " 5 " ) # set D
> B <= D # c h e c k if B is a s u b s e t of D
[1] TRUE
> B != A # check if B is u n e q u a l to A
[1] TRUE
> set_similarity( B, D ) # f r a c t i o n of e l e m e n t s in i n t e r s e c t i o n
[1] 0 .8

Where the set_similarity function computes the fraction of the number of the
elements in the intersection of two sets over the number of the elements in the union.
In computational statistics, one often needs to work with sets and compute such
properties as the mean and median, see Sect. 5.1.5. Such statistics can be calculated
for set objects similarly to other R objects. Applying the functions sum(), mean()
and median() to a set, R will try to convert the set to a numeric vector, e.g. 5
defined as a character is converted to a numeric 5 in the example below.
> A = set (1 , 2 , 3); B = set (5 , 6 , 7 , " 5 " )
> A + B # u n i o n of A and B
{ " 5 " , 1 , 2 , 3 , 5 , 6 , 7}
> sum ( c ( " 5 " , 1 , 2 , 3 , 5 , 6 , 7))
E r r o r in sum ( c ( " 5 " , 1 , 2 , 3 , 5 , 6 , 7)):
i n v a l i d ’ type ’ ( c h a r a c t e r ) of a r g u m e n t
> sum ( A + B ) # sum of u n i o n set A and B
[1] 29 # "5" b e c o m e s n u m e r i c
> A * B # Cartesian product
{(1 , 5) , (1 , 6) , (1 , 7) , (1 , " 5 " ) , (2 , 5) , (2 , 6) , (2 , 7) , (2 , " 5 " ) ,
(3 , 5) , (3 , 6) , (3 , 7) , (3 , " 5 " )}

Furthermore, in the sets package, the calculation of the closure and reduction of
sets is implemented by means of the function closure().
> D = set ( set (1) , set (2) , set (3)); D
{{1} , {2} , {3}}
> closure( D ) # set of all s u b s e t s
{{1} , {2} , {3} , {1 , 2} , {1 , 3} , {2 , 3} , {1 , 2 , 3}}

3.1.5 Generalised Sets

In contrast to ordinary sets generalised sets allow keeping, which have their elements
in a sorted and distinct form, a generalised set keeps every element, even if there
are redundant elements, but still in a sorted way. Generalised sets allow keeping
more information or characteristics of a set and include two special cases: fuzzy
sets and multisets. Every generalised set can be created using the gset() function
and all methods in this regard begin with the prefix gset_. Before constructing a
generalised set, it is important to think about its characteristics, like the membership
of an element, which differ for fuzzy sets and multisets.
Membership is described by a function f that maps each element of a set A to a
membership number:
3.1 Set Theory 83

• For ordinary sets, each element is either in the set or not, i.e. f : A → {0, 1};
• For fuzzy sets, the membership function maps into the unit interval, f : A →
[0, 1];
• For multisets, f : A → N.

Multisets allow each element to appear more than once, so that in statistics, multisets
occur as frequency tables. Since in base R there is no support for multisets, the
sets package is a good solution. In the example below, the set object A has four
distinct elements and each element has a certain membership value. The absolute
cardinality of a set can be obtained by the function gset_cardinality(), i.e.
the number of elements in a set.
> r e q u i r e ( sets )
> ms1 = gset ( c ( " red " , rep ( " blue " , 3))) # multiset
> ms1 # repeated elements retained
{ " blue " , " blue " , " blue " , " red " }
> gset_cardinality( ms1 ) # n u m b e r of e l e m e n t s
[1] 4 # c a r d i n a l i t y of ms1
> fs1 = gset ( c (1 , 2 , 3) , # fuzzy set
+ m e m b e r s h i p = c (0 .2, 0 .6, 0 .9 ));
> fs1
{1 [0 .2 ] , 2 [0 .6 ] , 3 [0 .9 ]}
> plot ( fs1 ) # left plot in Fig. \ ,3.1
> B = c("x", "y", "z", "z", "z", "x") # c r e a t e m u l t i s e t from R object
> table ( B )
B
x y z
2 1 3
> ms2 = a s . g s e t ( B ); ms2 # c o n v e r t s v e c t o r to the set
{ " x " [2] , " y " [1] , " z " [3]}
> gset_cardinality( ms2 ) # c a r d i n a l i t y of m u l t i 2
[1] 6
> ms3 = gset ( c ( ’x ’ , ’y ’ , ’z ’ ) , # c r e a t e m u l t i s e t via gset
+ m e m b e r s h i p = c (2 , 1 , 3)); ms3
{ " x " [2] , " y " [1] , " z " [3]}
> gset_cardinality( ms3 )
[1] 6
> plot ( ms3, col = ’ l i g h t b l u e ’ ) # right plot in Fig. \ ,3.1

By employing the repeat function rep(x, times) with times = 2, the mem-
bership is doubled.
> ms4 = rep ( ms3, times = 2)
> ms4
{ " x " [4] , " y " [2] , " z " [6]}
> gset_cardinality( ms4 )
[1] 12

The function set_combn(set, length) from the sets package creates a set
with subsets of the specified length: it consists of all combinations of the elements
in the specified set (Fig. 3.1).
When the same function is applied to all factorial combinations of two sets, e.g. the
function set_outer(set1, set2, operation) applies a binary operator
like sum or product to all elements of two sets. It applies the operation to all pairs of
elements specified in sets 1 and 2 and returns a matrix of dimension length(set1)
times length(set2). outer can be also used for vectors and matrices in R.
84 3 Combinatorics and Discrete Distributions

3
1.0
0.8
Membership Grade

2
0.6

Count
0.4

1
0.2
0.0

0
"x" "y" "z"
1.0 1.5 2.0 2.5 3.0
Universe Set Members

Fig. 3.1 R plot of a fuzzy set (left) and a multiset (right). BCS_FuzzyMultiSets

> set_combn( set (2 , 4 , 6 , 8 , 10) , # all s u b s e t s


+ l e n g t h = 2) # of l e n g t h 2
{{2 , 4} , {2 , 6} , {2 , 8} , {2 , 10} , {4 , 6} , {4 , 8} , {4 , 10} , {6 , 8} ,
{6 , 10} , {8 , 10}}
> set _ outer ( set (3 , 4) ,
+ set (10 , 50 , 100) , " * " ) # outer p r o d u c t with sets
10 50 100
3 30 150 300
4 40 200 400
> outer ( c (3 , 4) , c (10 , 50 , 100) , " * " ) # o u t e r p r o d u c t with ve c t o r s
[ ,1 ] [ ,2 ] [ ,3 ]
[1 , ] 30 150 300
[2 , ] 40 200 400

Users of base R can get wrong or confusing results when applying basic set
operations like union and intersection. Indexable structures, like lists and vectors,
are interpreted as sets. For set theoretical applications, this imitation has not been
sufficiently elaborated: basic operations such as the Cartesian product and power
set are missing. The base package in R performs a type conversion via match(),
which might in some cases lead to wrong results. In most cases it makes no difference
whether one uses a = 2 or a = 2L, where the latter defines a directly as an integer
by the suffix L. But to save memory in computationally extensive codes, it is useful
to define a directly as having integer type.
> y = (1:100) * 1 # o p t i o n 1 to d e f i n e v e c t o r y
> typeof (y)
[1] " d o u b l e "
> object.size( y ) # m e m o r y u s e d by t h i s o b j e c t
840 bytes
> yL = ( 1 : 1 0 0 ) * 1 L # o p t i o n 2 to d e f i n e v e c t o r y
> t y p e o f ( yL )
[1] " i n t e g e r "
> object.size( yL ) # m e m o r y u s e d by t h i s o b j e c t
440 bytes
3.1 Set Theory 85

Fig. 3.2 Inheritance of


object classes ‘set’, customisable set: cset
‘gset’ and ‘cset’ from
the sets package. All
generalised set: gset
operation names can be
combined with the
corresponding prefix set_,
gset_ and cset_
set

If one tries to check its code for constants not defined as integers, the match()
function will not distinguish between 1 and 1L.
The sets package avoids such steps by the use of set classes for ordinary, general,
and customised sets, as presented in Fig. 3.2. Customised sets are an extension of
generalized sets and are implemented in R via the function cset. With the help of
customisable sets, one is able to define how elements in the sets are matched through
the argument matchfun.
> setA = set ( a s . n u m e r i c (1)) # set with n u m e r i c 1
> 1 L %e% setA # 1 L is not an e l e m e n t of A
[1] FALSE
> csetA = cset ( a s . n u m e r i c (1) , # cset with match f u n c t i o n
+ m a t c h f u n = match )
> 1 L %e% csetA # 1 L is now an e l e m e n t of A
[1] TRUE

The basic R function match considers the integer one 1L to be the same as the
numeric 1. With the help of customisable sets, users of R are able to specify which
elements are considered to be the same. This is very useful for data management.

3.2 Probabilistic Experiments with Finite Sample Spaces

When working with data that is subject to random variation, the theory behind this
probabilistic situation becomes important. There are two types of experiments: deter-
ministic and random. We will focus here on nondeterministic processes with a finite
number of possible outcomes.
A random trial (or experiment) yields one of the distinct outcomes that alto-
gether form the sample or event space . All possible outcomes constitute the
universal event. Subsets of  are called events, e.g.  itself is the universal
event. Examples of experiments include rolling a die with the sample space  =
86 3 Combinatorics and Discrete Distributions

{{1}, {2}, {3}, {4}, {5}, {6}}, another is tossing a coin with only two possible out-
comes: heads (H ) and tails (T ).
A combination of several rolls of a die or tosses of a coin leads to more possible
results, such as tossing a coin twice, with the sample space  = {{H, H }, {H, T },
{T, H }, {T, T }}. Generally, the combination of several different experiments yields
a sample space with all possible combinations of the single events. If, for instance,
one needs two coins to fall on the same side, then the favored event is a set of two
elements: {H, H } and {T, T }.
The prob package, which will be used in the following, has been developed
by G. Jay Kerns specifically for probabilistic experiments. It provides methods for
elementary probability calculation on finite sample spaces, including counting tools,
defining probability spaces discussed later, performing set algebra, and calculating
probabilities.
The situation of tossing a coin twice is considered in the following code, for which
the package prob is needed. The functions used will be explained shortly.
> r e q u i r e (prob)
> ev = t o s s c o i n (2) # s a m p l e s p a c e for 2 coin ~ t o s s e s
> p r o b s p a c e ( ev ) # p r o b a b i l i t i e s for e v e n t s
toss1 toss2 probs
1 H H 0 .25
2 T H 0 .25
3 H T 0 .25
4 T T 0 .25

The interesting information is how likely an event is. Each event has a probability
assigned and this probability is included as the last column of the R output in the
example above. The values quantify our chances of observing the corresponding
outcome for the outcome of tossing a coin twice.
Comparable to the set theory in Sect. 3.1, one can apply operations like union or
intersection to events. The event probability follows the axioms of probability, which
are shortly summarised in the following.
• P(·) is a probability function that assigns to each event A in the sample space a
real number P(A), which lies between zero and one. P(A) is the probability that
the event A occurs. The probability of the whole sample space is equal to one,
which means that it occurs with certainty.
• P(A ∪ B) = P(A) + P(B) if A and B are disjoint. In general,
P(A ∪ B) = P(A) + P(B) − P(A ∩ B),
P() = 1 and P(∅) = 0.
The probability of the complementary event and that of the difference between two
sets are given by
• P(Ac ) = 1 − P(A);
• P(A \ B) = P(A) − P(A ∩ B).
3.2 Probabilistic Experiments with Finite Sample Spaces 87

3.2.1 R Functionality

The following functions, which generate common and elementary experiments, can
be used to set up a sample space.
• urnsamples(x, size, replace = FALSE, ordered = FALSE, …),
• tosscoin(ncoins, makespace = FALSE),
• rolldie(ndies, nsides = 6, makespace = FALSE),
• cards(jokers = FALSE, makespace = FALSE),
• roulette(european = FALSE, makespace = FALSE).

If the argument makespace is set TRUE, the resulting data frame has an additional
column showing the (equal) probability of each single event. In the simplest case,
the probability of an event can be computed as the relative frequency. Some methods
for working with probabilities and random samples from the prob and the base
packages are the following.
• probspace(outcomes, probs) forms a probability space,
• prob(prspace, event = NULL) gives the probability of an event as its relative
frequency,
• factorial(n) is the mathematical operation n!for a non-negative integer n,
• choose(n, k) gives the binomial coefficient nk = k!(n−k)!
n!
.

> r e q u i r e ( prob )
> ev = urnsamples( c ( " bus " , " car " , " bike " , " train " ) ,
+ size = 2,
+ ordered = TRUE )
> probspace( ev ) # probability space
X1 X2 probs
1 bus car 0 . 0 8 3 3 3 3 3 3
2 car bus 0 . 0 8 3 3 3 3 3 3
3 bus bike 0 . 0 8 3 3 3 3 3 3
4 bike bus 0 . 0 8 3 3 3 3 3 3
5 bus t r a i n 0 . 0 8 3 3 3 3 3 3
6 train bus 0 . 0 8 3 3 3 3 3 3
7 car bike 0 . 0 8 3 3 3 3 3 3
8 bike car 0 . 0 8 3 3 3 3 3 3
9 car t r a i n 0 . 0 8 3 3 3 3 3 3
10 t r a i n car 0 . 0 8 3 3 3 3 3 3
11 bike train 0 . 0 8 3 3 3 3 3 3
12 train bike 0 . 0 8 3 3 3 3 3 3
> Prob( p r o b s p a c e ( ev ) , X2 == " bike " ) # 3 of 12 c a s e s = 1 / 4
[1] 0 .25
> f a c t o r i a l (3) # 3 * 2 * 1
[1] 6
> c h o o s e ( n = 10 , k = 2) # 10 ! / (2 ! * 8 ! ) = 10 * 9 / 2
[1] 45

3.2.2 Sample Space and Sampling from Urns

In R, the sample spaces can be represented by data frames or lists and may contain
empirical or simulated data. Random samples, including sampling from urns, can
be drawn from a set with the R base method sample(). The sample size can be
88 3 Combinatorics and Discrete Distributions

Table 3.2 Number of all possible samples of size k from a set of n objects. The sampling method
is specified by replacement and order
Ordered Unordered
(n+k−1)!
With replacement nk k!(n−1)!
n!
n  n!
Without replacement (n−k)! k = k!(n−k)!

chosen as the second argument in the function and the type of sampling can be either
with or without replacement:
• sampling with replacement:
sample(x, size = n, replace = TRUE, prob = NULL),
• sampling without replacement:
sample(x, n).

In general, there are four types of sampling, regarding replacement and order, which
are briefly presented in the following. The calculation rules for the number of possible
draws for a sample depend on the assumptions about the particular situation. All four
cases are outlined in Table 3.2.
In R the function nsamp is able to calculate the possible numbers of samples
drawn from an urn. The following code shows how all four cases from Table 3.2 are
applied when n = 10 and k = 2.
> r e q u i r e ( prob )
> nsamp(10 , 2 , replace = TRUE, ordered = TRUE ) # 10^2
[1] 100
> nsamp(10 , 2 , replace = TRUE, ordered = FALSE ) # 11 ! / (2 ! * 9 ! )
[1] 55
> nsamp(10 , 2 , replace = F A L S E , ordered= TRUE ) # 10 ! / 8 !
[1] 90
> nsamp(10 , 2 , replace = F A L S E , ordered = FALSE ) # 10 ! / (2 ! * 8 ! )
[1] 45

Ordered Sample
For several applications, the order of k experimental outcomes is decisive. Consider,
for example, the random selection of natural numbers. For the random selection of
a telephone number, both the replacement and the order of the digits are important.
The method urnsample() from the prob package yields all possible samples
according to the sampling method. Consider the next example, where three elements
are taken from an urn of eight elements. Sampling with replacement is conducted
first, followed by sampling without replacement for comparison. Clearly the number
of samples is smaller if we do not replace the elements. This number can also be
computed with the counting tool nsamp() introduced in the last section.
3.2 Probabilistic Experiments with Finite Sample Spaces 89

> r e q u i r e ( prob )
> urn1 = urnsamples( x = 1:3 , # all e l e m e n t s
+ size = 2, # num of s e l e c t e d e l e m e n t s
+ replace = TRUE, # with r e p l a c e m e n t
+ ordered = TRUE ) # ordered
> urn1 # all p o s s i b l e draws
X1 X2
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
> dim ( urn1 ) # d i m e n s i o n of the m a t r i x
[1] 9 2
> urn2 = urnsamples( x = 1:3 ,
+ size = 2,
+ replace = F A L S E , # without replacement
+ ordered = TRUE ) # ordered
> dim ( urn2 ) # d i m e n s i o n of the m a t r i x
[1] 6 2

Unordered Sample
In the simple case of drawing balls from an urn, the order in which the balls are
drawn is rarely relevant. For a lottery, for example, it is only relevant whether a
certain number is included in the winning sample or not. When conducting a survey
or selecting participants, the order of the selection is generally irrelevant. Having
created the sample space, a sample can be drawn, which leaves the question about
the replacement. The researcher has to decide what fits best in this situation.
Note that in an unordered sample without replacement, the number of possible
samples is given by the binomial coefficient. Using the formula from Table 3.2, the
sample size can be checked and the probability of drawing a certain sample can be
calculated.
> r e q u i r e ( prob )
> urn3 = urnsamples( x = 1:3 ,
+ size = 2,
+ replace = TRUE, # with r e p l a c e m e n t
+ ordered = FALSE ) # not o r d e r e d
> dim ( urn3 ) # d i m e n s i o n s of the m a t r i x
[1] 6 2
> urn4 = urnsamples( x = 1:3 ,
+ size = 2,
+ replace = F A L S E , # without replacement
+ ordered = FALSE ) # not o r d e r e d
> urn4 # all p o s s i b l e draws
X1 X2
1 1 2
2 1 3
3 2 3
> probspace( urn4 ) # p r o b a b i l i t y space
X1 X2 probs
1 1 2 0 .3333333
2 1 3 0 .3333333
3 2 3 0 .3333333
90 3 Combinatorics and Discrete Distributions

The probability of obtaining a certain pair of values is one over the number of
possible pairs. For the case without replacement and ignoring order, each sample has
the probability 1/3 ≈ 0.3333. This number together with all 3 possible samples is
given when applying the method probspace() to urn4.
Beside these simple experiments, it is also useful to know that the number of
subsets of a set of n elements is 2n . Furthermore, there are n! possible ways of
choosing all n elements and rearranging them. This is the same thing as the number
of permutations of n elements. In case the sample size is the same as the number
of elements and replace = FALSE, the sampling can be seen as a random per-
mutation. If the sample space consists of all combinations of a number of factors,
the function expand.grid() from the base package can be used to generate a
data frame containing all combinations of these factors. The example below shows
all combinations of two variables specifying colour and number.
> e x p a n d . g r i d ( c o l o u r = c ( " red " , " blue " , " y e l l o w " ) , nr = 1:2)
c o l o u r nr
1 red 1
2 blue 1
3 yellow 1
4 red 2
5 blue 2
6 yellow 2

There are several ways to sample from a population. It matters for the number of
possible samples whether one arranges the elements or selects a subset from the
population. All different possibilities are illustrated in Fig. 3.3.

Combinatorics

Arrangement Selection

order is important order is arbitrary

Permutation Variation Combination

different elements identical elements without replacement with replacement without replacement with replacement

n!
r
n! ⎛n⎞ ⎛n + k − 1⎞
n! ∏ gj! (n − k)! nk ⎝k⎠ ⎜ k ⎟
⎝ ⎠
j=1

Fig. 3.3 The possible sample numbers for an urn model with n elements. For samples with assem-
bled elements with r groups for identical elements g j or k out of n selected different elements.
BCS_SamplesDiagram
3.2 Probabilistic Experiments with Finite Sample Spaces 91

3.2.3 Sampling Procedure

The examples above are very specific and restricted to a particular sample space.
Now we can address sampling from a more general perspective. Again, some ran-
dom selection mechanism is involved: the theory behind this is called probabilistic
sampling. Specific types of sampling are: simple random sampling, the equal prob-
ability selection method, probability-proportional-to-size, and systematic sampling.
Details can be found in Babbie (2013). In real applications, each member of a pop-
ulation can have different characteristics, i.e. the population is heterogeneous, and
one needs a sample large enough to study the characteristics of the whole population.
The idea is to find a sample which describes the population well. Yet, there is always
a risk of biased samples if the sampling method is not adequate, that is to say, if the
set of selected members is not representative of the population.
In the following example, it is assumed that a population consists of women and
men in a ratio of 1 : 1. In order to test this assumption about the ratio, a sample is
drawn.
> # s e t . s e e d (18) # set the seed, see Chap. \ ,9
> popul = d a t a . f r a m e (
+ g e n d e r = rep ( c ( " f " , " m " ) , each = 500) ,
+ grade = s a m p l e (1:10 , 1000 , replace = TRUE ))
> head( popul ) # first 6 r o w s of m a t r i x
gender grade
1 f 9
2 f 8
3 f 10
4 f 1
5 f 1
6 f 6
> table ( popul [ , 1]) # true p r o p o r t i o n
f m
500 500
> table ( s a m p l e ( popul [ , 1] , 10)) # d r a w s a m p l e of 10
f m
3 7

In this example, a simple random sample was drawn, which was too small to capture
the true ratio. For more sophisticated sampling methods in R, the package sampling
can be used. It contains methods for stratified sampling, which divides the population
into subgroups and samples. The corresponding R function is strata(). Its argu-
ment, stratanames, specifies the variable that is used to identify the subgroups.
> require ( sampling )
> s t r a t a ( data, s t r a t a n a m e s = NULL, size,
+ m e t h o d = c ( " s r s w o r " , " srswr " , " p o i s s o n " , " s y s t e m a t i c " ) ,
+ pik, description = FALSE )

The two methods, srswor and srswr, denote simple random sampling without
and with replacement, respectively. In the example below, a sample of six persons
each is taken from the female and male students without replacement.
The function getdata() extracts data from a dataset according to a vector of
selected units or a sample data frame. Here, we use the sample data frame created
92 3 Combinatorics and Discrete Distributions

by the function strata() to extract the grades for the sample students from our
dataset.
A simple tool of analysis is the function aggregate(), which is used to calcu-
late summary statistics for subsets of data. It is applied below to calculate the mean
of the grades in the sample for each gender. Note, that the subsets need to be given
as a list.
> require ( sampling )
> st = strata( p o p u l ,
+ stratanames = " g e n d e r " , # take 6 s a m p l e s of e a c h g e n d e r
+ size = c (6 , 6) ,
+ method = " srswor ")
> dataX = getdata( p o p u l , m = st ) # e x t r a c t the s a m p l e
> dataX
g r a d e g e n d e r ID _ unit Prob S t r a t u m
98 8 f 98 0 .012 1
114 5 f 114 0 .012 1
288 1 f 288 0 .012 1
392 5 f 392 0 .012 1
411 7 f 411 0 .012 1
421 6 f 421 0 .012 1
532 2 m 532 0 .012 2
619 9 m 619 0 .012 2
667 7 m 667 0 .012 2
771 3 m 771 0 .012 2
952 1 m 952 0 .012 2
968 3 m 968 0 .012 2
> a g g r e g a t e ( data $ g r a d e , # mean grade by g e n d e r
+ by = list ( data $ g e n d e r ) ,
+ FUN = mean )
Group.1 x
1 f 5 .333333
2\ ,m 4 . 1 6 6 6 6 7

To test whether these results support our expectations of equal grades for each gender,
we would need some functions for statistical testing discussed in Sect. 5.2.2. Applying
a t-test, we would find that the results are indeed supportive.

3.2.4 Random Variables

The outcomes of a probabilistic experiment might be described by a random variable


X . All possible events ω are elements of , the event space. These outcomes of a prob-
abilistic experiment are associated with distinct values x j of X , for j = {1, . . . , k}.

Definition 3.1 A real valued random variable (rv) X on the probability space
(, F, P), is a real valued function X (ω) defined on , such that for every Borel
subset B of the real numbers

{ω : X (ω) ∈ B} ∈ F.
3.2 Probabilistic Experiments with Finite Sample Spaces 93

The probability function P assigns a probability to each event, for a detailed discus-
sion, see Ash (2008).
For the probabilistic experiment of tossing a fair coin  = {H, T }, the rv X is
defined as follows: 
1, if head H shows up,
X=
0, if tail T shows up.

There are two types of rvs: discrete and continuous. This distinction is very important
for their analysis.
Definition 3.2 An rv X is said to be discrete if the possible distinct values x j of X
are either countably infinite or finite.
The distribution of a discrete rv is described by its probability mass function f (x j )
and the cumulative distribution function F(x j ):
Definition 3.3 The probability mass function (pdf) of a discrete rv X is a function
that returns the probability, that an rv X is exactly equals to some value

f (x j ) = P(X = x j ).

Definition 3.4 The cumulative distribution function (cdf) is defined for ordinally
scaled variables (variables with the natural order) and returns the probability, that an
rv X is smaller or equal to some value:

F(x j ) = P(X ≤ x j ).

The outcomes of tossing a fair coin can be mapped by a discrete rv with finite distinct
values. Drawing randomly a person and its number of descendants can be described
by a discrete rv with countably infinite distinct values.
An rv X has an expectation E X and a variance Var X (also called the first moment
and the second central moment of X , respectively). The definition of these moments
differs for discrete and continuous rvs.
Definition 3.5 Let X be a discrete rv with distinct values {x1 , . . . , xk } and probability
function P(X = x j ) ∈ [0, 1] for j ∈ {1, . . . , k}. Then the expectation (expected
value) of X is defined to be


k
EX = x j P(X = x j ). (3.1)
j=1

For infinitely many possible outcomes, the finite sum becomes an infinite sum. The
expectation is not defined for every rv. An example for continuous rvs is the Cauchy
distribution, introduced in Sect. 4.5.2.
94 3 Combinatorics and Discrete Distributions

Definition 3.6 Let X be a discrete rv with distinct values {x1 , . . . , xk } and probability
function P(X = x j ) ∈ [0, 1] for j ∈ {1, . . . , k}. Then the variance of X is defined
to be
 k
Var X = E(X − E X )2 = (x j − E X )2 P(X = x j ). (3.2)
j=1

As for the expectation, the variance is not defined for every rv. The variance measures
the expected dispersion of an rv around its expected value. Deterministic variables
have a variance equal to zero.
Definition 3.7 An rv X is said to be continuous if the possible distinct values x j are
uncountably infinite.
For a continuous rv, the probability density function (pdf) describes its distribution
(see Definition 4.2). Selecting randomly a person and its weight is a typical example
of a probabilistic experiment which can be described by a continuous rv.
In the following, the most prominent discrete rvs and their probability mass func-
tions are introduced. Continuous rvs and their properties are covered in Chap. 4.

3.3 Binomial Distribution

One of the basic probability distributions is the binomial. Examples of this distrib-
ution can be observed in daily life: whether we are tossing a coin to obtain heads
or tails, or trying to score a goal in a football game, we are dealing with a binomial
distribution.

3.3.1 Bernoulli Random Variables

A Bernoulli experiment is a random experiment with two outcomes: success or


failure. Let p denote the probability of the success of each trial and let the rv X be
equal to 1 if the outcome is a success and 0 if a failure. Then the probability mass
function of X is

P(X = 0) = 1 − p,
P(X = 1) = p

and the rv X is said to have a Bernoulli distribution. The expected value and variance
of a Bernoulli rv are E X = p and Var X = p(1 − p).
3.3 Binomial Distribution 95

To derive these results, just apply (3.1) and (3.2). The expectation is then derived
as follows:

E X = P(X = 0) · 0 + P(X = 1) · 1 = (1 − p) · 0 + p = p.

For the variance, the derivation looks as follows:

Var X = P(X = 0)(0 − p)2 + P(X = 1)(1 − p)2 ,


= (1 − p) p 2 + p(1 − p)2 ,
= p2 − p3 + p − 2 p2 + p3 ,
= p(1 − p).

Example 3.1 Consider a box containing two red marbles and eight blue marbles.
Let X = 1 if the drawn marble is red and 0 otherwise. The probability of randomly
selecting one red marble and the expectation of X at one try is E X = P(X = 1) =
1/5 = 0.2. The variance of X is Var X = 1/5(1 − 1/5) = 4/25.

3.3.2 Binomial Distribution

A sequence of Bernoulli experiments performed n times is called a binomial exper-


iment and satisfies the following requirements:
1. there are only two possible outcomes for each trial;
2. all the trials are independent of each other;
3. the probability of each outcome remains constant;
4. the number of trials is fixed.
The following example illustrates a binomial experiment. When tossing a fair coin
three times, eight different possible results may occur, each with equal probability
1/8. Let X be the rv denoting the number of heads obtained in these three tosses. The
sample space  and possible values for X are listed in Table 3.3. One is often inter-
ested in the total number of successes and the corresponding probabilities, instead
of the outcomes themselves or the order in which they occur. For the above case, it
follows that
|{T T T }| |{H H T, H T H, T H H }|
P(X = 0) = = 1/8; P(X = 2) = = 3/8;
|| ||

Table 3.3 Sample space and X for tossing a coin three times
Outcome HHH HHT HTH THH HTT THT TTH TTT
Value of X 3 2 2 2 1 1 1 0
96 3 Combinatorics and Discrete Distributions

|{H H H }| |{H T T, T H T, T T H }|
P(X = 3) = = 1/8; P(X = 1) = = 3/8.
|| ||

The same results can be computed from the binomial mass function below by
setting n = 3 and p = 1/2.
Definition 3.8 The binomial distribution is the distribution of an rv X for which
 
n x
P(X = x) = p (1 − p)n−x , x ∈ {0, 1, 2, . . . , n}, (3.3)
x

where n is the number ofall trials, x is the number of successful outcomes, p is the
probability of success, nx is the number of possibilities of n outcomes’ leading to x
successes and n − x failures.
The binomial distribution can be used to define the probability of obtaining exactly x
successes in a sequence of n independent trials. We will denote binomial distributions
by B(x; n, p) For the example above, the rv X follows the binomial distribution
B(x; 3, 0.5), or X ∼ B(3, 0.5). The expectation of a binomial rv is E X = μ = np,
which is the expected number of successes x in n trials. The variance is Var X =
σ 2 = np(1 − p).
Example 3.2 Continuing with marbles, we randomly draw ten marbles one at a
time, while putting it back each time before drawing again. What is the probability
of drawing exactly two red marbles?
Here, the number of draws n = 10 and we define getting a red marble as a success
with p = 0.2 and x = 2. Hence X ∼ B(10, 0.2) and

10
P(X = x) = 0.2x (1 − 0.2)10−x , x = 0, 1, 2, . . . , 10
x

and μ = 2 , σ 2 = 1.6. For x = 2 we obtain P(X = 2) = 0.30199.


In R, the command dbinom() creates a probability mass function (3.3). The desired
probability of the example above is obtained as follows.
> dbinom (x = 2, # p r o b a b i l i t y mass f u n c t i o n at x = 2
+ size = 10 , # n u m b e r of t r i a l s
+ prob = 0 .2 ) # p r o b a b i l i t y of s u c c e s s
[1] 0 . 3 0 1 9 9

Furthermore, one can use dbinom() to calculate the probability of each outcome
(Fig. 3.4).
3.3 Binomial Distribution 97

Fig. 3.4 Probability mass


function of the binomial

0.30
distribution with number of
trials n = 10 and probability
of success p = 0.2.

0.20
Probability
BCS_Binhist

0.10
0.00
0 1 2 3 4 5 6 7 8 9 10
x

> d b i n o m ( x = 0:10 , # p r o b a b i l i t y at 0 , 1 , ..., 10


+ size = 10 , # n u m b e r of t r i a l s
+ prob = 0 .2 ) # p r o b a b i l i t y of s u c c e s s
[1] 0 . 1 0 7 3 7 4 0 . 2 6 8 4 3 5 0 . 3 0 1 9 8 9 0 . 2 0 1 3 2 6 0 . 0 8 8 0 8 0
[6] 0 . 0 2 6 4 2 4 0 . 0 0 5 5 0 5 0 . 0 0 0 7 8 6 0 . 0 0 0 0 7 4 0 . 0 0 0 0 0 4
[11] 0 . 0 0 0 0 0 0

The Cumulative Distribution Function


The cdf of a binomial distribution for discrete variables is defined as
x  
 n
FX (x) = P(X ≤ x) = pi (1 − p)n−i .
i=0
i

It is implemented in R by pbinom().

Example 3.3 Continuing Example 3.2, consider the probability of drawing two or
less red marbles. Let n = 10, p = 0.2 and x = 2, then

FX (2) = P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2)


2  
10
= 0.2i (1 − 0.2)10−i = 0.6777995 .
i=0
i

> p b i n o m ( x = 2 , size = 10 , prob = 0 .2 )


[1] 0 . 6 7 7 7 9 9 5

Equivalently, the result can be obtained using dbinom().


> sum ( d b i n o m ( x = 0:2 , size = 10 , prob = 0 .2 ))
[1] 0 . 6 7 7 7 9 9 5
98 3 Combinatorics and Discrete Distributions

Fig. 3.5 The binomial

1.0
cumulative distribution
function with n = 10,
p = 0.2 and p = 0.6.

0.8
BCS_Bincdf

0.6
Probability
0.4
0.2
0.0

0 2 4 6 8 10
x

The probability that three or four red marbles are drawn is given by (Fig. 3.5).
> p b i n o m (4 , size = 10 , prob = 0 .2 ) - p b i n o m (2 , size = 10 , prob = 0 .2 )
[1] 0 .289407 # F (4) - F (2)

> q b i n o m ( p = 0 .95, size = 10 , prob = 0 .2 )


[1] 4

3.3.3 Properties

A binomial distribution is symmetric if the probability of each trial is p = 0.5. Given


two binomial rvs X ∼ B(n, p) and Y ∼ B(m, p), the sum of the two rvs also follows
a binomial distribution X + Y ∼ B(n + m, p) with expectation (m + n) p. Intuitively,
n independent Bernoulli experiments and another m independent Bernoulli experi-
ments again follow a binomial distribution: that of (n + m) independent Bernoulli
experiments. The De Moivre–Laplace CLT ensures that the binomial rv converges
in distribution to the normal distribution (see Sect. 4.3), i.e.

X − np L
Z=√ → N(0, 1)
np(1 − p)

as n → ∞. Different textbooks recommend various rules for values of n and p that


will make the normal distribution (see Sect. 4.3) a good approximation for the bino-
mial distribution (see Figs. 3.6 and 3.7). Values fulfilling np > 5 and n(1 − p) > 5
already produce satisfying results (Brown et al. (2001)). Since the binomial distribu-
tion is a discrete distribution and the normal distribution is continuous, the correction
3.3 Binomial Distribution 99

n=5 p=0.1 n=5 p=0.5 n=5 p=0.9

0.6
0.6

0.30

0.5
0.5

0.25

0.4
0.4

0.20

Prob.
Prob.

Prob.

0.3
0.3

0.15

0.2
0.2

0.10

0.1
0.1

0.05

0.0
0.0

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

x x x

n=10 p=0.1 n=10 p=0.5 n=10 p=0.9

0.4
0.25
0.4

0.20

0.3
0.3

0.15

Prob.
Prob.

Prob.

0.2
0.2

0.10

0.1
0.1

0.05
0.00

0.0
0.0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

x x x

n=100 p=0.1 n=100 p=0.5 n=100 p=0.9


0.08

0.00 0.02 0.04 0.06 0.08 0.10 0.12


0.00 0.02 0.04 0.06 0.08 0.10 0.12

0.06

Prob.
Prob.

Prob.
0.04
0.02
0.00

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100


x x x

Fig. 3.6 Probability mass function of B(n, p) for different n and p. BCS_Binpdf

for continuity requires adding or subtracting 0.5 from the values of the discrete bino-
mial rv. Furthermore, the binomial distribution can approach other distributions in
the limit. If n → ∞ and p → 0 with finite np, the limit of the binomial distribution
is the Poisson distribution, see Sect. 3.6. A hypergeometric distribution can also be
obtained from the binomial distribution under certain conditions, see Sect. 3.5.

3.4 Multinomial Distribution

A binomial experiment always generates two possible outcomes (success or failure)


at each trial. When tossing a die, the probability of getting the number six is 1/6,
and there is a 5/6 chance of rolling any other number. But if we toss several dice
100 3 Combinatorics and Discrete Distributions

Fig. 3.7 Probability of B(100, 0.5) vs. N(50, 25)

0.08
binomial distribution versus
normal distribution.

0.06
BCS_Binnorm

Probability
0.04
0.02
0.00
35 40 45 50 55 60 65
x

together each time, what is the probability of getting only a certain number for all
dice?
Definition 3.9 Suppose a random experiment is independently repeated n times, so
that it returns each time one of the fixed k possible outcomes with the probabilities
p1 , p2 , . . . , pk . An example of the multinomial distribution arises as the distribution
of a vector of rvs X = (X 1 , X 2 , . . . , X k ) where each X i denotes the number of
occurrences for which
n!
P(X 1 = x1 , X 2 = x2 , . . . , X k = xk ) = p x1 p x2 · · · pkxk , (3.4)
x1 !x2 ! · · · xk ! 1 2
k k
where p1 , p2 , . . . , pk > 0, i=1 pi = 1, i=1 xi = n, and xi is nonnegative.
When k = 3, the corresponding distribution is called the trinomial distribution.
For k = 2, we get the binomial distribution discussed above. The example below
illustrates how to use the formula to calculate the probability in the multinomial case.

Example 3.4 Suppose we had a box with two red, three green, and five blue mar-
bles. We randomly draw three marbles with replacement. What is the probability of
drawing one marble of each colour?
Here the realizations of the rv are x1 = 1, x2 = 1, x3 = 1, and the corresponding
probabilities are p1 = 0.2, p2 = 0.3, p3 = 0.5 . Therefore according to (3.4), the
desired probability is P(X 1 = 1, X 2 = 1, X 3 = 1) = 1! 3!
1! 1!
0.21 0.31 0.51 = 0.18.
In R, this can be calculated as follows.
> dmultinom( x = c (1 , 1 , 1) , # set s u c c e s s p r o b a b i l i t i e s
+ prob = c (0 .2, 0 .3, 0 .5 )) # set v a l u e s of m u l t i n o m i a l rvs
[1] 0 .18

Each specific rv X i , for i = 1, 2, . . . , k, follows a binomial distribution, thus the


expectation and the variance are E X i = npi and Var X i = npi (1 − pi ), respectively.
3.5 Hypergeometric Distribution 101

3.5 Hypergeometric Distribution

In the typical ‘6 from 49’ lottery, 6 numbers from 1 to 49 are chosen without replace-
ment. Every time one number is drawn, the chances of the remaining numbers to
be chosen will change. This is an example of a hypergeometric experiment, which
satisfies the following requirements:
1. a sample is randomly selected without replacement from a population;
2. each element of the population is from one of two different groups which can
also be defined as success and failure.
Because the sample is drawn without replacement, the trials in the hypergeometric
experiment are not independent and the probability of each success in turn keep
changing. This differs from the binomial and multinomial distributions.
Definition 3.10 An rv X from a hypergeometric experiment follows the hypergeo-
metric distribution H (n, M, N ), which has the probability function
 M  N −M 
x
P(X = x) =  Nn−x
 , x = 0, 1, ..., min{M, n}, (3.5)
n

where N is the size of the population, n is the size of the sample, M is the number
of successes in the population and x is the number of successes in the sample.
In (3.5), the probability of exactly x successes in n trials of a hypergeometric exper-
iment is given. The following example illustrates this distribution.
Example 3.5 Having a box with 20 marbles, including 10 red and 10 blue marbles,
we randomly select 6 marbles without replacement. The probability of getting two red
marbles can be calculated using the H (6, 10, 20) distribution. Here the experiment
consists of 6 trials, so n = 6, and there are 20 marbles in the box, so N = 20. Now,
M = 10, since there are 10 red marbles inside, of which two should be selected, so
x = 2. Then 1020−10
2
P(X = 2) = 206−2
 = 0.2438 .
6

This example has a straightforward solution in R.


> dhyper (x = 2, # number of s u c c e s s e s in s a m p l e
+ m = 10 , # number of s u c c e s s e s in p o p u l a t i o n
+ n = 10 , # number of f a i l s in p o p u l a t i o n
+ k = n) # sample size
[1] 0 . 2 4 3 8 0 8

The binomial distribution is a limiting form of the hypergeometric distribution, which


pops up when the population size is very large compared to the sample size. In
this case, it is possible to ignore the ‘no replacement’ problem and approximate a
hypergeometric distribution by the binomial distribution.
102 3 Combinatorics and Discrete Distributions

H(6, 10, 20) vs. B(6, 0.5) H(6, 10, 200) vs. B(6, 0.05)

0.6
0.3

Probability
Probability

0.4
0.2

0.2
0.1

0.0
0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x

Fig. 3.8 Probability functions of the hypergeometric (lines) versus binomial distribution (dots).
BCS_Binhyper

Example 3.6 In Example 3.5, if there are 500 marbles inside the box including 10 red
marbles, what is the probability of drawing two red marbles out of 6 draws without
replacement?

X ∼ H ( 6, 10, 500),
10500−10
2 6−2
P(X = 2) = 500 = 0.00507
6

Or it can be approximated by using a binomial distribution with p = M/N = 0.02

X ∼ B( 6, 0.02)
 
6
P(X = x) = 0.02x (1 − 0.02)6−x
x
P(X = 2) ≈ 0.00553,

which somewhat overestimates the actual value.


The distributions are compared in Fig. 3.8, where the points denote the binomial
probability and the vertical lines show the hypergeometric probability. The expec-
tation and variance of the rv X are E X = n M/N and Var X = n M(N − M)(N −
n)/{N 2 (N − 1)}, respectively.
3.6 Poisson Distribution 103

3.6 Poisson Distribution

Another important distribution is the Poisson distribution, discovered and published


by Simeon Denis Poisson in 1837, see Poisson (1837). Later in 1898 Ladislaus von
Bortkiewicz made a practical application and showed, that events with low frequency
in a large population follow a Poisson distribution even when the probabilities of the
events vary, see von Bortkewitsch (1898), Härdle and Vogt (2014).
The Poisson distribution is closely related to the binomial distribution. The exam-
ple of tossing a die has already been introduced for the binomial distribution, but if
we simultaneously toss 100 or 1000 unusual dice with 100 sides each, how can we
calculate the probability of a certain number of dice showing the same face? First,
we derive the Poisson distribution by taking an approximate limit of the binomial
distribution.
Suppose that X is a B(n, p) variable, then the probability mass function is

n!
P(X = x) = p x (1 − p)n−x , x = 0, 1, 2, . . . , n.
x!(n − x)!

Let λ = np , then
 x  
n! λ λ n−x
P(X = x) = 1− (3.6)
x!(n − x)! n n
n! λx (1 − λn )n
=
x!(n − x)! n x (1 − λn )x
n! λx (1 − λn )n
= .
n x (n − x)! x! (1 − λn )x

If n is large and p is small, then

n! n(n − 1) · · · (n − x + 1)
= ≈1
n x (n − x)! nx
(1 − λ/n)n ≈ exp (−λ)
(1 − λ/n)x ≈ 1.

Hence (3.6) becomes


λx
P(X = x) ≈ exp(−λ) · .
x!
Eventually, the limiting distribution of the binomial is the Poisson distribution. The
comparison is plotted in Fig. 3.9, where the points denote the binomial probability
and the vertical lines show the Poisson probability. All the trials are independent and
the probability of ‘success’ is low and equal from trial to trial, while at the same
104 3 Combinatorics and Discrete Distributions

Fig. 3.9 Probabilities of the Pois(10) vs. B(100, 0.1)


Poisson (lines) and the

0.15
Bernoulli distribution (dots).
BCS_Binpois

0.10
Probability
0.05
0.00

0 5 10 15 20
x

time the total number of trials should be very large. As a rule of thumb, if p ≤ 0.1,
n ≥ 50 and np ≤ 5, the approximation is sufficiently close.

Definition 3.11 An rv X follows a Poisson distribution with parameter λ, denoted


as Pois(λ), if

λx
P(X = x) = exp(−λ) · , x = 0, 1, 2, . . . and λ > 0. (3.7)
x!

The expectation and the variance of a Poisson rv X is E X = λ and Var X = λ. Note


that this does not hold for the binomial distribution.
Example 3.7 If a typist makes on average one typographical error per page, what is
the probability of exactly two errors in a one-page text?
Assume that the number of typographical errors per page is an rv X ∼ B(n, p).
Here n is the number of words per page and p the probability of a word’s containing
a typographical error. It is plausible to assume that n is large and p small, which is
sufficient for X to follow the Poisson distribution with λ = E X = 1. Using (3.7),
the probability of two typographical errors on the same page P(X = 2) is

12
P(X = 2) = exp(−1) · = 0.184 .
2!

> dpois ( x = 2 , l a m b d a = 1) # pdf


[1] 0 . 1 8 3 9 3 9 7

The parameter λ is also called the intensity, which is motivated by the fact that λ
describes the expected number of events within a given interval.
Example 3.8 The Prussian horsekick fatality dataset from Ladislaus von Bortkiewicz
(Quine and Seneta 1987) gives us the number of soldiers killed by horsekick in 10
3.6 Poisson Distribution 105

Fig. 3.10 The original Prussian horsekick dataset

similar corps of the Prussian military over a period of 20 years. Observations on


accidents over a total of 200 corps-years are available, as well as λ = 0.61 estimated
from the 200 observations. What is the probability that there was exactly one fatal
horsekick over 20 years in a corps? Here we will use both the binomial distribution and
the Poisson distribution to calculate the probability, in order to compare the results and
see their approximate equality. Since n = 200 and λ = 0.61, p = λ/n = 0.00305
and the probability can be calculated by (Fig. 3.10).

X ∼ Pois(0.61)
0.611
P(X = 1) = exp(−0.61) · = 0.33144,
1!
X ∼ B(200, 0.00305),
 
200
P(X = 1) = 0.003051 (1 − 0.00305)200−1 = 0.33215 .
1

> n = 200
> l a m b d a = 0 .61
> p = lambda / n
> d b i n o m ( x = 1 , size = n, prob = p ) # b i n o m i a l pdf
[1] 0 . 3 3 2 1 4 8 3
> dpois ( x = 1 , l a m b d a = l a m b d a ) # P o i s s o n pdf
[1] 0 . 3 3 1 4 4 4
106 3 Combinatorics and Discrete Distributions

3.6.1 Summation of Poisson Distributed Random Variables

Let X i ∼ Pois(λi ) be independent and Poisson distributed rvs with parameters


λ1 , λ2 , . . . , λn . Then their sum is an rv also following the Poisson distribution with
λ = λ 1 + λ2 + . . . + λ n :


n
X i ∼ Pois(λi ), i = 1, . . . , n implies X i ∼ Pois(λ1 + λ2 + . . . + λn ).
i=1

This feature of the Poisson distribution is very useful, since it allows us to combine
different Poisson experiments by summing the rates. Furthermore, for two Poisson
rvs X and Y , the conditional distribution (see Sect. 6.1) of Y is binomial with the
probability parameter λY /(λ X +λY ), see Bolger and Harkness (1965) for more details
(Fig. 3.11).

If X ∼ Pois(λ X ), Y ∼ Pois(λY ), then (X + Y ) ∼ Pois(λ X + λY )

and Y |(X + Y ) ∼ B{x + y, λY /(λ X + λY )}.

λ = 0.5 λ = 2.5
0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.20
prob

prob
0.10
0.00

0 2 4 6 8 10 0 2 4 6 8 10
x x

λ=5 λ = 25
0.00 0.02 0.04 0.06 0.08
0.15
0.10
prob

prob
0.05
0.00

0 2 4 6 8 10 0 10 20 30 40 50
x x

Fig. 3.11 Probability mass functions of the Poisson distribution for different λ. BCS_Poispdf
3.6 Poisson Distribution 107

The Poisson distribution belongs to the exponential family of distributions. An rv


X follows a distribution belonging to the exponential family if its probability mass
function with a single parameter θ has the form

P(X = x) = h(x)g(θ ) exp{η(θ)t (x)}.

For the Poisson distribution h(x) = x!1 , g(θ ) = λx , η(θ) = −λ and t (x) = 1. Other
popular distributions, such as the normal, exponential, gamma, χ 2 and Bernoulli,
belong to the exponential family and are discussed in Chap. 4, except for the last.
This condition can be extended to multidimensional problems. Furthermore, if we
standardise a Poisson rv X , the limiting distribution of this standardised variable
follows a standard normal distribution.
X −λ L
√ −→ N(0, 1), as λ → ∞.
λ
Chapter 4
Univariate Distributions

Everybody believes in the exponential law of errors [i.e., the


normal distribution]: the experimenters, because they think it
can be proved by mathematics; and the mathematicians,
because they believe it has been established by observation.

— Poincare Henri,“Calcul Des Probabilités.”

In this chapter, the theory of discrete random variables from Chap. 3 is extended to
continuous random variables. At first, we give an introduction to the basic definitions
and properties of continuous distributions in general. Then we elaborate on the normal
distribution and its key role in statistics. Finally, we exposit in detail several other
key distributions, such as the exponential and χ2 distributions.

4.1 Continuous Distributions

Continuous random variables (see Definition 3.7) can take on an uncountably infinite
number of possible values, unlike discrete random variables, which take on either a
finite or a countably infinite set of values. These random variables are characterised
by a distribution function and a density function.

Definition 4.1 Let X be a continuous random variable (rv). The mapping FX : R →


[0, 1], defined by FX (x) = P(X ≤ x), is called the cumulative distribution function
(cdf) of the rv X .

Definition 4.2 A mapping f X : R → R+ is called the ∞probability density function


(pdf) of an rv X if f X (x) = ∂ F∂x
X (x)
exists for all x and −∞ f X (x) d x exists and takes
on the value one.

The cdf of an rv X can now be written as


 x
FX (x) = f X (u) du.
−∞

© Springer International Publishing AG 2017 109


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_4
110 4 Univariate Distributions
b
FX (b) = −∞ f X (x) d x is the probability that the rv X is less than a given value b,
b
and FX (b) − FX (a) = a f X (x) d x is the probability that it lies in (a, b).

4.1.1 Properties of Continuous Distributions

As in Chap. 3, the elementary properties of the probability distribution of a continuous


rv can be described by an expectation and a variance. The only difference from
discrete rvs is in the use of integrals, rather than sums, over the possible values of
the rv.

Definition 4.3 Let X be a continuous rv with a density function f X (x). Then the
expectation of X is defined as
 ∞
EX = x f X (x) d x. (4.1)
−∞

The expectation exists if (4.1) is absolutely convergent. It describes the location (or
centre of gravity) of the distribution.

Definition 4.4 If the expectation of X exists, the variance is defined as


 ∞
Var X = (x − E X )2 f X (x) d x, (4.2)
−∞


and the standard deviation is σ X = Var X .

The variance describes the variability of the variable and exists if the integral in (4.2)
is absolute convergent.
Other useful characteristics of a distribution are its skewness and excess kurtosis.
The skewness of a probability distribution is defined as the extent to which it deviates
from symmetry. One says that a distribution has negative skewness if the left tail is
longer than the right tail of the distribution, so that there are more values on the right
side of the mean, and vice versa for positive skewness.

Definition 4.5 The skewness of an rv X is defined as


 
S(X ) = E (X − E X )3 /σ 3X .

The kurtosis is a measure of the peakedness of a probability distribution. The excess


kurtosis is used to compare the kurtosis of a pdf with the kurtosis of the normal
distribution, which equals 3. Distributions with negative or positive excess kurtosis
are called platykurtic distributions and leptokurtic distributions, respectively.
4.1 Continuous Distributions 111

Definition 4.6 The excess kurtosis of an rv X is defined as


 
E (X − E X )4
K (X ) = − 3.
σ 4X

A complete and unique characterization of the distribution of an rv X is given by its


characteristic function (cf).
Definition 4.7 For an rv X ∈ R with a pdf f (x), the cf is defined as
 ∞
φ X (t) = E {exp(it X )} = exp(it x) f X (x)d x.
−∞

Unlike discrete distributions, where the characteristic function is also the moment-
generating function, the moment-generating function for continuous distributions is
defined as the characteristic function evaluated at −it. The argument of the distrib-
ution is t, which might live in real or complex space.

Definition 4.8 For X ∈ R with a pdf f (x), the moment-generating function is


defined as
 ∞
M X (t) = φ X (−it) = E {exp(t X )} = exp(t x) f X (x)d x.
−∞

If the cf is absolutely integrable, then f X (t) is absolutely continuous and the rv X ’s


pdf is also given by
 ∞
1
f X (t) = exp(−it x)φ X (x)d x.
2π −∞

Definition 4.9 For any univariate distribution F, and for 0 < p < 1, the quantity

F −1 ( p) = inf{x : F(x) ≥ p}

is called the theoretical pth quantile or fractile of F, usually denoted as ξ p and the
F −1 is called the quantile function.

In particular ξ1/2 is called the theoretical median of F. For the quantile function holds,
that it is nondecreasing and left-continuous and satisfies the following inequalities:
i F −1 {F(x)} ≤ x, −∞ < x < ∞,
ii F{F −1 (t)} ≥ t, 0 < t < 1,
iii F(x) ≥ t if and only if x ≥ F −1 (t).
112 4 Univariate Distributions

4.2 Uniform Distribution

Continuous uniform distribution or rectangular distribution is a family of symmetric


probability distributions such that for each member of the family, all intervals of the
same length on the distribution’s support are equally probable. The support is defined
by the two parameters, a and b, which are its minimum and maximum values. The
distribution is often abbreviated U (a, b).
Definition 4.10 An rv U with the pdf
 1
for a ≤ x ≤ b,
f (x, a, b) = b−a
0 else.

is said to be uniformly distributed between a and b, and is written as U ∼ U (a, b).


An arbitrary rv X ∼ F can be converted to a uniform distribution via the probability
integral transform:
Definition 4.11 (Probability Integral Transform) Suppose U ∼ U (0, 1). Then the
returned rv X = F −1 (U ) has the cdf F, i.e. X ∼ F, and F(X ) ∼ U (0, 1).
This method can be extended to the discrete case and works even in case of disconti-
nuities in F(x). It is also very often used in the simulation from various distributions,
see Chap. 9.
Distribution function and properties of the uniform distribution
The cdf of X ∼ U (a, b) is

⎨0 for x < a,
F(x, a, b) = x−a
for a ≤ x ≤ b,
⎩ b−a
1 for x > b.

The expectation, variance, skewness and excess kurtosis coefficients are

a+b (b − a)2 6
EX = , Var X = , S(X ) = 0, K (X ) = − . (4.3)
2 12 5
The cf if given through
eitb − eita
φ X (t, a, b) = .
it (b − a)

In order to work with this distribution in R, there is a list of standard implemented


functions:

dunif(x, min, max), punif(q, min, max),


qunif(p, min, max), runif(n, min, max),
4.2 Uniform Distribution 113

which are for the pdf, the cdf, the quantile function and for generating random
uniformly distributed samples, respectively. Function dunif also contains argu-
ment log which allows for computation of the log density, useful in the likelihood
estimation.

4.3 Normal Distribution

The normal distribution is considered the most prominent distribution in statistics.


It is a continuous probability distribution that has a bell-shaped probability density
function, also known as the Gaussian function. The normal distribution arises from
the central limit theorem, which states, under weak conditions, that the sum of a
large number of rvs drawn from the same distribution is distributed approximately
normally, irrespective of the form of the original distribution. In addition, the normal
distribution can be manipulated analytically, enabling one to derive a large number
of results in explicit form. Due to these two aspects, the normal distribution is used
extensively in theory and practice.

Definition 4.12 An rv X with the Gaussian pdf


 
ϕ(x, μ, σ 2 ) = (2πσ 2 )−1/2 exp −(x − μ)2 /(2σ 2 )

is said to be normally distributed with E X = μ and Var X = σ 2 and is written as


X ∼ N(μ, σ 2 ). If μ = 0 and σ 2 = 1, then ϕ is called the standard normal distribution,
and we abbreviate ϕ(x, 0, 1) as ϕ(x).

An arbitrary normal rv X can be converted to a standard normal distribution,


or standardised, by a transformation. The standard normal rv Z is defined as Z =
(X − μ)/σ (Fig. 4.1).
0.4

1.0
0.8
0.3

0.6
pdf

cdf
0.2

0.4
0.1

0.2
0.0

0.0

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15


x x

Fig. 4.1 pdf (left) and cdf (right) of the normal distribution (for μ = 0 and σ 2 = 1, σ 2 = 3, σ 2 = 6,
respectively). BCS_NormPdfCdf
114 4 Univariate Distributions

Distribution function
The cdf of X ∼ N(μ, σ 2 ) is
 x  
(x, μ, σ 2 ) = (2πσ 2 )−1/2 exp −(u − μ)2 /(2σ 2 ) du.
−∞

Properties of the normal distribution


The cf of the normal distribution is

φ X (t, μ, σ 2 ) = exp iμt − σ 2 t 2 /2 .

The moment-generating function of X ∼ N(μ, σ 2 ) is

M X (t) = M(t, μ, σ 2 ) = exp μt + σ 2 t 2 /2 .

Another useful property of the family of normal distributions is that it is closed under
linear transformations. Thus a linear combination of two independent normal rvs,
X 1 ∼ N(μ1 , σ12 ) and X 2 ∼ N(μ2 , σ22 ), is also normally distributed:

a X 1 + bX 2 + c ∼ N(aμ1 + bμ2 + c, a 2 σ12 + b2 σ22 ).

This property of the normal distribution is actually the direct consequence of a far
more general property of the family of distributions called stable distributions, see
Sect. 4.5.2, as shown in Härdle and Simar (2015).
In order to work with this distribution in R, there is a list of standard implemented
functions: dnorm(x, mean, sd), for the pdf (if argument log = TRUE then
log density); pnorm(q, mean, sd), for the cdf; qnorm(p, mean, sd), for
the quantile function; and rnorm(n, mean, sd) for generating random nor-
mally distributed samples. Their parameters are x, a vector of quantiles, p, a vector
of probabilities, and n, the number of observations. Additional parameters are mean
and sd for the vectors of means and standard deviation, which, if not specified, are
set to the standard normal values by default.

4.4 Distributions Related to the Normal Distribution

Central role of the normal distribution in statistics becomes evident when we look at
other important distributions constructed from the normal one.
While the normal distribution is frequently applied to describe the underlying
distribution of a statistical experiment, asymptotic test statistics (see Sect. 5.2.2) are
often based on a transformation of a (non-) normal rv. To get a better understanding of
these tests, it will be helpful to study the χ2 , t- and F-distributions, and their relations
with the normal one. Skew or leptokurtic distributions, such as the exponential, stable
4.4 Distributions Related to the Normal Distribution 115

and Cauchy distributions, are commonly required for modelling extreme events or
an rv defined on positive support, and therefore will be discussed subsequently.

4.4.1 χ2 Distribution

In statistics, the χ2 distribution describes the sum of the squares of independent


standard normal rvs.
Definition 4.13 If Z i ∼ N(0, 1), i = 1, ..., n are independent, then the rv X given by
n
X= Z i2 ∼ χ2n
i=1

is χ2 distributed with n degrees of freedom.


This distribution is of particular interest since it describes the distribution of a sample
variance (see Sect. 5.1.6) and is used further in tests (see Sect. 5.2.2).
Density function
The pdf of the χ2 distribution is

2−n/2 z n/2−1 exp(−z/2)


f (z, n) = ,
 (n/2)
∞
where (k) is the Gamma function  (z) = 0 t z−1 exp(−t)dt.
Distribution function
The cdf of the χ2 distribution is

z (n/2, z/2)
F(z, n) = ,
 (n/2)
z
where z is the incomplete Gamma function: z (α) = 0 t α−1 exp(−t)dt.
In order to work with this distribution in R, there is a list of standard implemented
functions:

dchisq(x, df), pchisq(q, df), qchisq(p, df), rchisq(n, df),

which are for the pdf, the cdf, the quantile function and for generating random χ2 -
distributed samples, respectively. Same as for other distributions, if log = TRUE
in dchisq function, then log density is computed, which is useful for maximum
likelihood estimation. Similar to the functions for the t (see Sect. 4.4.2) and F (see
Sect. 4.4.3) distributions, all the functions also have the parameter ncp which is the
non-negative parameter of non-centrality, where this rv is constructed from Gaussian
rvs with non-zero expectations.
116 4 Univariate Distributions

0.15

1.0
0.8
0.10

0.6
pdf

cdf
0.4
0.05

0.2
0.00

0.0
0 10 20 30 40 50 0 10 20 30 40 50
z z

Fig. 4.2 pdf (left) and cdf (right) of χ2 distribution (degrees of freedom n = 5, n = 10, n = 15,
n = 25, respectively). BCS_ChiPdfCdf

Fig. 4.3 Pdf of χ2 0.8


distribution (n = 1 and
n = 2). BCS_ChiPdf
0.6
pdf
0.4
0.2
0.0

0 2 4 6 8 10
z

Figure 4.2 illustrates the different shapes of the χ2 distribution’s cdf and pdf,
for different degrees of freedom n. In general, the χ2 pdf is bell-shaped and shifts
to the right-hand side for greater numbers of degrees of freedom, becoming more
symmetric.

There are two special cases, namely n = 1 and n = 2. In the first case, the vertical
axis is an asymptote and the distribution is not defined at 0. In the second case, the
curve steadily decreases from the value 0.5 (Fig. 4.3).
Properties of the χ2 distribution
A distinctive feature of χ2 is that it is positive, due to the fact that it represents a sum
of squared values.
The expectation, variance, skewness and excess kurtosis coefficients are

2 12
E X = n, Var X = 2n, S(X ) = 2 , K (X ) = .
n n
4.4 Distributions Related to the Normal Distribution 117

20000

20000
Frequency

Frequency
5000 10000

5000 10000
0

0
−10 0 10 20 30 100 150 200
z z

Fig. 4.4 Asymptotic normality of χ2 distribution (left panel n = 10; right panel n = 150).
BCS_ChiNormApprox

A χ2 -distributed variable X reaches its maximum value when z = n − 2, given that


the number of degrees of freedom is n ≥ 2. Interestingly, it follows that if X is
χ2 -distributed with n → ∞ degrees of freedom, then (in an asymptotic sense)
L
(i) X −→ N(n, 2n)
√ L √
(ii) 2X −→ N( 2n − 1, 1).
In order to check the asymptotic property in (ii), we have generated samples of
such two types:
> # samples for n = 1
> x1 = rchisq(n = 500000,
+ df = 1) # chi-square distr. with 1 df
> norm1 = rnorm(n = 500000, # normal distr.
+ mean = 1, # with expectation = 1
+ sd = sqrt(2 * 1)) # and variance = 2
> # samples for n = 2
> x2 = rchisq(n = 500000,
+ df = 2) # chi-square distr. with 2 df
> norm2 = rnorm(n = 500000, # normal distr.
+ mean = 2, # with expectation = 2
+ sd = sqrt(2 * 2)) # and variance = 4

One can observe in Fig. 4.4 that the χ2 distribution (coloured in blue) approaches the
standard normal distribution for large numbers of degrees of freedom.

4.4.2 Student’s t-distribution

A combination of the normal and χ2 distributions is represented by the t-distribution.


It gained importance because it is widely used in statistical tests, particularly in
Student’s t-test for estimating the statistical significance of the difference between
118 4 Univariate Distributions

two sample means. It is also used to construct confidence intervals for population
means and linear regression analysis.

Definition 4.14 Let X ∼ N(0, 1) and Y ∼ χ2n be independent rvs, then a


t-distributed rv Z with n − 1 degrees of freedom can be formalised as

X
Z=√ ∼ tn−1 .
Y /n

The noncentral t-distribution is a generalised version of Student’s distribution:

X +μ
Z= √ ,
Y /n

where μ is a non-centrality parameter and X ∼ N(0, 1) and Y ∼ χ2n are independent.


Density function
The pdf of the t-distribution is

 {(n + 1)/2}
f (z, n) = √ (n+1)/2
.
πn (n/2) 1 + z 2 /n

Distribution function
The cdf of the t-distribution is
 z
B (z; n/2; n/2)
F(z) = f (t, n)dt = ,
−∞ B (n/2; n/2)
1
where B (n/2; n/2) is the Beta function: B(x, y) = 0 t x−1 (1 − t) y−1 dt and
z
B(z; n/2; n/2) is the incomplete Beta function B(z; a, b) = 0 t a−1 (1 − t)b−1 dt.
Similar to other distributions, the R functions for t-distribution are

dt(x, df), pt(q, df), qt(p, df), rt(n, df),

for computing the pdf, cdf, quantile function and generating random numbers. Same
as for other distributions, if log = TRUE in dt function, then log density is
computed, which is useful for maximum likelihood estimation. Also similar to the
functions for the χ2 and F (see Sect. 4.4.3) distributions, all the above-mentioned
functions have the non-centrality parameter ncp.
Figure 4.5 shows the standard normal distribution (black bold line) and several
different t-distributions with different degrees of freedom.
4.4 Distributions Related to the Normal Distribution 119

0.4

1.0
0.8
0.3

0.6
pdf

cdf
0.2

0.4
0.1

0.2
0.0

0.0
−4 −2 0 2 4 −10 −5 0 5 10
z z

Fig. 4.5 Density function of Student’s t-distribution and correspondent cumulative distribution
functions (n = 1, n = 2, n = 5, bold line- N(0, 1)). BCS_tPdfCdf

Properties of Student’s t-distribution


For n > 2 degrees of freedom, the expectation and variance of Student’s t-distribution
are n
E Z = 0, Var Z = ,
n−2

otherwise they do not exist.


The skewness and excess kurtosis are
3(n − 2)
S(Z ) = 0, K (Z ) = , for n > 4.
n−4

The quantiles of a t-distributed rv Z are denoted by t p , and, due to symmetry, t p =


−t1− p . Thus, when stating the hypothesis for a two-sided test, the critical values are
used with given significance levels α as follows:
 
P(|x| > t1−α/2 ) = α.

4.4.3 F-distribution

Definition 4.15 The rvZ has the Fisher–Snedecor (F-distribution) distribution with
n and m degrees of freedom if

χ2 (n)/n
Z= ∼ Fn,m ,
χ2 (m)/m

where χ2 (n) ∼ χ2n and χ2 (m) ∼ χ2m are independent rvs.


120 4 Univariate Distributions

This kind of distribution is directly used in analysis of variance problems (F-test),


see Sect. 5.2.2.
Density function
The pdf of an F-distributed rv contains the degrees of freedom n and m:

 {(n + m)/2} z (n/2−1)


f (z, n, m) = (n/m)n/2 .
 (n/2)  (m/2) (1 + nz/m)(n+m)/2

Distribution function
The cdf is
Fh {(n + m)/2, n/2; 1 + n/2; −nz/m}
F(z) = 2n (n−2)/2 (x/m)n/2 for z ≥ 0,
B (n/2, m/2)

where Fh is the hypergeometric function.


The procedures in R dedicated to this distribution require the parameters n and m
as well:

df(x, df1, df2), pf(q, df1, df2),


qf(p, df1, df2), rf(n, df1, df2),

for computing the pdf, cdf, quantile function and generating random numbers. Here
parameters df1 and df2 are the two degrees of freedom parameters. Same as for
other distributions, if log = TRUE in df function, then log density is computed,
which is useful for maximum likelihood estimation. Also similar to the functions for
the χ2 and t-distribution, all the above-mentioned functions have the non-centrality
parameter ncp.
Distribution parameters
The expectation and variance of the F-distribution are defined if m > 2:

m 2m 2 (n + m − 2)
EZ = , Var Z = .
m−2 n(m − 2)2 (m − 4)

And the skewness coefficient expression is



(2n + m − 2) 8(m − 4)
S(Z ) = √ for m > 6.
(m − 6) n(n + m − 2)

Looking at Fig. 4.6, one can distinguish three characteristic shapes of the pdf
curve, depending on the parameters n and m:
• for n = 1, the curve monotonically decreases for all values of m with the vertical
axis as an asymptote;
4.4 Distributions Related to the Normal Distribution 121

1.5

1.0
0.8
1.0

0.6
pdf

cdf
0.4
0.5

0.2
0.0

0.0
0 1 2 3 4 5 0 2 4 6 8 10
z z

Fig. 4.6 Density and cumulative distribution of the F-distribution (n = 1, m = 1, n = 2, m = 6,


n = 3, m = 10 and n = 50, m = 50, respectively). BCS_FPdfCdf

• for n = 2, the curve again decreases for all m, but intersects the vertical axis at the
point 1;
• for n ≥ 3, the curve has an asymmetrical bell shape for all m, gradually shifting
to the right-hand side for larger numbers of degrees of freedom.

4.5 Other Univariate Distributions

4.5.1 Exponential Distribution

Example 4.1 Let us assume that over the time interval [0, T ], the online service
of a food delivery company receives x orders. At some point, the managers of this
business became curious as to the probabilities of the amounts of orders over time.
In general, the number of orders can be described by a Poisson distribution, see
Definition 3.7, where λ is the expected number of occurrences during a given time
period. If during one hour the online service receives on average λ = 35 orders, then
within any given hour the probability of receiving exactly 30 orders has a probability
of p = 3530 e−35 /30! = 0.049.

However, when we need to model the distribution of time intervals between orders,
or events, the exponential distribution comes in handy.
Density function
The pdf of the exponential distribution is defined as

λe−λz , for z ≥ 0,
f (z, λ) =
0, for z < 0,
122 4 Univariate Distributions

where λ is a rate parameter, such that the time interval is 1/λ. The rate parameter
gives the expected number of events in a time interval, whereas its reciprocal gives
the expected time interval between two events. And one writes X ∼ E(λ).
Distribution function
The expression for the cdf looks relatively similar to that of the pdf:

1 − e−λx , for x ≥ 0,
F(x, λ) =
0, for x < 0.

In general, the greater the λ is, the steeper are the curves of the exponential density
and distribution functions (Fig. 4.7).
The main R functions for the exponential distribution are

dexp(x, rate), pexp(q, rate),


qexp(p, rate), rexp(n, rate).

for computing the pdf, cdf, quantile function and generating random numbers. Same
as for other distributions, if log = TRUE in dexp function, then log density is
computed, which is useful for maximum likelihood estimation.
Example 4.2 University beverage vending machines have a lifetime of X , which is
exponentially distributed with λ = 0.3 defective machines per year:

0.3e−0.3x , for x ≥ 0,
f (x, 0.3) =
0, for x < 0.

We would like to find the probability that this vending machine will function more
than 1.7 years.
1.0

1.0
0.8

0.8
0.6

0.6
pdf

cdf
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 0 1 2 3 4 5
z z

Fig. 4.7 Pdf and cdf of the exponential distribution (λ = 0.3, λ = 0.5, λ = 1 and λ = 3).
BCS_ExpPdfCdf
4.5 Other Univariate Distributions 123

The reliability after time x is given by

P(X > x) = e−0.3x .

> 1 - pexp(q = 1.7, rate = 0.3)


[1] 0.6004956

Thus the probability that a breakdown occurs within 1.7 years is approximately 60%.
Properties of the exponential distribution
The exponential distribution has the following expectation and variance:

E X = 1/λ, Var X = 1/λ2 .

The mode (see Definition 5.6) is 0 and the median (see Definition 4.9) is

log 2
ξ1/2 = .
λ
The exponential distribution has skewness and excess kurtosis coefficients indepen-
dent of λ, unlike some of the distributions we have seen so far. They are

S(X ) = 2, K (X ) = 6.

Another interesting property of the exponential distribution is that it is memoryless,


something which can be said of only two distribution: the exponential and the geo-
metric. It follows that the expected time until the next event is constant and does not
depend on the time elapsed since the occurrence of the previous event. We can see
this property in a more formal presentation through

P (X ≤ t + q|X > t) = P (X ≤ q) .

The conditional probability of the next event’s occurring by time t + q given that the
last event was at time t is equal to the unconditional probability of the next event’s
occurring at time q without any previous information.

4.5.2 Stable Distributions

As mentioned in Sect. 4.3, the stable distributions are a family of distributions which
are closed under linear transformations.

Definition 4.16 A distribution function is said to be stable if for any two independent
rvs Z 1 and Z 2 following this distribution, and any two positive constants a and b, we
have
124 4 Univariate Distributions

a Z 1 + bZ 2 = cZ + d,

where c is a positive constant, d is a constant, and Z is an rv with the same distribution


as Z 1 and Z 2 . Here, c and d depend on a and b.
If this property holds with d = 0, then such a distribution is said to be strictly
stable.

In order to completely define stable distributions, we need four parameters: the


index of stability α ∈ (0, 2] (determines the thickness of the tails), the skewness
parameter β ∈ [−1, 1] (determining the asymmetry), the location parameter μ ∈ R
and a scale parameter σ > 0.
Figure 4.8 shows the effect of the parameters α and β on the shape of the density
curves. Greater values of α make the peak less pointed and the tails fatter. For β > 0,
the distribution is skewed to the right (the right tail is fatter) and for β < 0, it is
skewed to the left. When β = 0, the distribution is symmetric around the peak.
The pdf and cdf of stable distributions
Generally, the pdf and the cdf of stable distributions cannot be written down analyti-
cally (except for three special cases). However, stable distributions can be described
by their cf φ Z .
A conventional parameterization of the cf of a stable rv Z with parameters
α, β, σ S and μ is given by Samorodnitsky and Taqqu (1994), Weron (2001):


−σ S α |t|α (1 − iβsign(t) tan πα ) + iμt, α = 1,
log φ Z (t, α, β, σ S , μ) = 2
−σ S |t| (1 + iβsign(t) π2 log |t|) + iμt, α = 1,

where i 2 = −1.√Note that the σ used here is not the usual Gaussian scale σ, but the
value σ S = σ/ 2.
In R the stable distributions can be implemented by

dstable(z, alpha, beta, gamma, delta, pm)

and
pstable(z, alpha, beta, gamma, delta, pm),

which require the stabledist package. We can easily work with a stable distri-
bution of interest that depends on the parameters α, β, σ, μ and the parameter pm,
which refers to the parameterization type. The functions qstable and rstable
with the same parameters let us use quantiles and generate samples.
An interesting toolbox is implemented with the command stableSlider()
(of the fBasics package). It provides a good illustration of the pdf and cdf functions
of different stable distributions. One can change the parameters to see how the shape
of the functions reacts to the changed values, see Fig. 4.9. There exist three special
cases of stable distributions that have closed form formulas for their pdf and cdf:
4.5 Other Univariate Distributions 125

0.5

1.0
0.4

0.8
0.3

0.6
pdf

cdf
0.2

0.4
0.1

0.2
0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z
α ∈ (0.6, 1, 1.5, 2), β = 0

1.0
0.30

0.8
0.20

0.6
pdf

cdf
0.4
0.10

0.2
0.00

0.0

−4 −2 0 2 4 −4 −2 0 2 4
z z
α = 1, β ∈ (0, −0.8, 0.8)

Fig. 4.8 Stable distribution functions and their density functions given different combinations of
α and β (in all cases σ = 1 and μ = 0). BCS_StablePdfCdf

 
1 (z − μ)2
Normal Distribution f (z) = √ exp − ,
2πσ 2 2σ 2
 
Cauchy Distribution f (z) = σ/ π(z − μ)2 + πσ 2 , (4.4)
 
c exp − c
2(z−μ)
Lévy Distribution f (z) = .
2π (z − μ)3/2

With the help of the following short code we plot the pdf for those special cases.
These can be built using the dstable function from package stabledist with
the appropriate parameters α, β, σ and μ (Fig. 4.10).
> require(stabledist)
> z = seq(-6, 6, length = 300)
> s.norm = dstable(z, # values of the density
+ alpha = 2, # tail
+ beta = 0, # skewness
126 4 Univariate Distributions

Fig. 4.9 Screenshot of fBasics package’s stableSlider() toolbox

+ gamma = 1, # scale
+ delta = 0, # location
+ pm = 1), # type of parametrization
> s.cauchy = dstable(z, # values of the density
+ alpha = 1, # tail
+ beta = 0, # skewness
+ gamma = 1, # scale
+ delta = 0, # location
+ pm = 0), # type of parametrization
> s.levy = dstable(z, # values of the density
+ alpha = 0.5, # tail
+ beta = 0.9999, # skewness
+ gamma = 1, # scale
+ delta = 0, # location
+ pm = 0), # type of parametrization
> plot(z, s.norm, # plot normal
+ col ="red", type =’l’, ylim = c(0,0.5))
> lines(z, s.cauchy, # plot Cauchy
+ col ="green")
> lines(z, s.levy, # plot Levy
+ col ="blue")

In all cases, σ = 1 and μ = 0. The cdf functions can be plotted analogously using
the procedure pstable.
4.5 Other Univariate Distributions 127

0.5

1.0
0.4

0.8
0.3

0.6
pdf

cdf
0.2

0.4
0.1

0.2
0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z

Fig. 4.10 Special cases of stable distributions (Gaussian: α = 2, β = 0 ; Cauchy: α = 1, β = 0 ;


Lévy: α = 0.5, β = 1 ). BCS_StablePdfCdfSpecial

4.5.3 Cauchy Distribution

The Cauchy–Lorentz distribution is a continuous probability distribution which is


stable and known for having an undefined mean and an infinite variance. It is impor-
tant in physics since it is the solution of differential equations describing forced
resonance and describes the shape of spectral lines.

Example 4.3 Consider an isotropic source emitting particles to the plane L. The
angle θ of each emitted particle is uniformly distributed. Each particle hits the plane
at some distance x from the point 0 (Fig. 4.11). By definition, the distance rv X
follows Cauchy distribution.

Density function
The pdf of the Cauchy distribution is defined as in (4.4) where μ ∈ R is a location
parameter, i.e. it defines the position of the peak of the distribution, and σ > 0 is
a scale parameter specifying one-half the width of the probability density function
at one-half its maximum height. For μ = 0 and σ = 1, the distribution is called a
standard Cauchy distribution (Fig. 4.12).

Fig. 4.11 Illustration of the Isotopic source


generation of a
Cauchy-distributed rv
(Particles are emitted with
uniformly distributed angles
and hit the plane L so that
the distance x is Cauchy
distributed

L 0 x
128 4 Univariate Distributions

0.8
0.25

0.6
pdf

cdf
0.15

0.4
0.2
0.05

−4 −2 0 2 4 −4 −2 0 2 4
z z

Fig. 4.12 Cauchy distribution functions and corresponding density functions (μ = −2, σ = 1;
μ = 0, σ = 1; μ = 2, σ = 1; μ = 0, σ = 1.5; μ = 0, σ = 2). BCS_CauchyPdfCdf

Distribution function
The Cauchy cdf is  
1 z−μ 1
F(z; μ, σ) = arctan + .
π σ 2

In R, the pdf, cdf, quantile function and generating random numbers from Cauchy
distribution can be done using the commands

dcauchy(z, mu, sigma), pcauchy(p, mu, sigma),


qcauchy(p, mu, sigma), rcauchy(q, mu, sigma),

or by using the

dstable, pstable, qstable, rstable

functions with parameters α = 1 and β = 0.


Properties of the Cauchy distribution
As mentioned before, the Cauchy distribution has a non-finite expectation. As a
consequence, its variance, skewness, kurtosis and other higher order moments do
not exist. Its mode (see Definition 5.6) and the median (see Definition 4.9) are both
defined and equal to μ.
Cauchy rvs can be simply thought of as a ratio of two N(0, 1) rvs: if X ∼ N(0, 1)
and Y ∼ N(0, 1) are independent rvs, then X/Y follows the standard Cauchy distri-
bution.
The Cauchy distribution is an infinitely divisible distribution, which means that for
any positive n, there exist n independent and identically distributed rvs X n1 , . . . , X nn
whose sum has the Cauchy distribution.
It is worth mentioning that the standard Cauchy distribution coincides with Stu-
dent’s t-distribution with one degree of freedom.
Chapter 5
Univariate Statistical Analysis

It is a capital mistake to theorise before one has data. Insensibly


one begins to twist facts to suit theories, instead of theories to
suit facts.
— Sir Arthur Conan Doyle

This chapter presents basic statistical methods used in describing and analysing
univariate data in R. It covers the topics of descriptive and inferential statistics of
univariate data, which are mostly treated in introductory courses in Statistics.
Among other useful statistical tools, we discuss simple techniques of explorative
data analysis, such as the Bar Diagram, Bar Plot, Pie Chart, Histogram, kernel density
estimator, the ecdf, and parameters of location and dispersion. We also demonstrate
how they are easily implemented in R. Further in this chapter we discuss different
test for location, dispersion and distribution.

5.1 Descriptive Statistics

Let us consider an rv X following a distribution Fθ , X ∼ Fθ . To obtain a sample of n


observations one first constructs n copies of the rv X , i.e. sample rvs X 1 , . . . , X n ∼ Fθ
which follow the same distribution Fθ as the original variable X . All X 1 , . . . , X n are
often assumed to be independent. Subsequently, we draw one observation xi from
every sample rv X i ; this results in a random sample x1 , . . . , xn . Note that these are
not rvs but numbers and can be used to estimate the unknown parameters θ (for
normal distribution these are μ and σ). Thus, all the functions based on the sample
discussed later in this chapter might be random as well as nonrandom. For example,
n
sample mean (see Sect. 5.1.5) x̄ = n1 i=1 xi is non-random and defines the center
n
of the cloud of observations. On the other hand X̄ = n1 i=1 X i is a rv, which can
be characterized by a distribution function and has the property, E X̄ = EX .

© Springer International Publishing AG 2017 129


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_5
130 5 Univariate Statistical Analysis

Having realizations {xi } , i ∈ 1, . . . , n of a rv X , with {a j } , j ∈ 1, . . . , k, denot-


ing all possible but different realizations of X in the sample, we can define the
following two types of frequencies.

Definition 5.1 The absolute frequency of a j , denoted by n(a j ), is the number of


occurrences of a j in the sample {xi }. The relative frequency of a j , denoted by h(a j ),
is the ratio of the absolute frequencies of a j and the sample size n: h(a j ) = n(a j )/n.

Clearly kj=1 h(a j ) = 1.

The function table() returns all possible observed values of the data along
with their absolute frequencies. These can be used further to compute the relative
frequencies by dividing by n.
Let us consider the dataset chickwts, a data frame with 71 observations of 2
variables, weight, a numeric variable for the weight of the chicken, and feed, a
factor for the type of feed. In order to select only the observed values of feed, one
considers the field chickwts$feed. By using table(chickwts$feed), we
get one line, stating the possible chicken feed, i.e. each possible observational value,
and the absolute frequency of each type in the line below.
> table(chickwts$feed) # absolute frequencies

casein horsebean linseed meatmeal soybean sunflower


12 10 12 11 14 12
> n = length(chickwts$feed); n # sample size
[1] 71
> table(chickwts$feed) / n # relative frequency

casein horsebean linseed meatmeal soybean sunflower


0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141

5.1.1 Graphical Data Representation

There are several methods of visualising the frequencies graphically. Depending on


which type of data is at our disposal, the adequate approach needs to be chosen
carefully: Bar Plot, Bar Diagram or Pie Chart for qualitative or discrete variables,
and the histogram for continuous (or quasi-continuous) variables.
Since the variable feed in dataset chickwts has only 6 distinct observed values
of feed, the frequencies can be conveniently shown using a Bar Plot, a Bar Diagram
or a Pie Chart.
Bar diagram
In a Bar Diagram, each observation is plotted using sticks. The y-axis indicates,
depending on the specification, the absolute or relative frequencies.
5.1 Descriptive Statistics 131

14
12 0.15
10
8 0.10
n(aj )

h(aj )
6
4
2 0.05
0
0.00

soybean

sunflower
casein

horsebean

linseed

meatmeal

soybean

sunflower
casein

horsebean

linseed

meatmeal
Fig. 5.1 Bar diagram of the absolute frequencies n(a j ) (left) and bar plot of the relative frequencies
h(a j ) (right) of chickwts$feed. BCS_BarGraphs

> n = length(chickwts$feed) # sample size


> plot(table(chickwts$feed)) # absolute frequency
> plot(table(chickwts$feed) / n) # relative frequency

The result of the first plot command is shown in the left panel of Fig. 5.1.
Bar plot
Unlike in the Bar Diagram, each observation is plotted using bars in the Bar Plot.
If the endpoints of the bars are connected, one obtains a frequency polygon. It is in
particular useful to illustrate the behaviour (variation) of time ordered data.
> n = length(chickwts$feed) # sample size
> barplot(table(chickwts$feed)) # absolute frequency
> barplot(table(chickwts$feed) / n) # relative frequency

The result of the second baplot command is shown in the right panel of Fig. 5.1.
Pie chart
In a Pie Chart, each observation has its own sector with an angle (or a square for
a square Pie Chart) proportional to its frequency. The angle can be obtained from
α(ai ) = h(ai ) · 360◦ . The disadvantage of this approach is that the human eye cannot
precisely distinguish differences between angles (or areas). Instead, it recognises
much better differences in lengths, which is the reason why the Bar Plot and Bar
Diagram are better tools than the Pie Chart. In Fig. 5.2, each group seems to have the
same area in the Pie Chart, though the frequencies differ slightly from each other, as
is evident from the Bar Plot in Fig. 5.1.
> pie (table(chickwts$feed))
132 5 Univariate Statistical Analysis

Fig. 5.2 Pie chart of the horsebean


data chickwts$feed.
BCS_pie
linseed
casein

meatmeal
sunflower

soybean

5.1.2 Empirical (Cumulative) Distribution Function

The empirical cumulative distribution function (ecdf) is denoted by F̂(x) and


describes the relative number of observations which are less than or equal to x
in the sample. We write F̂(x) with a hat because it is an estimator of the true cdf
F(x) = P(X ≤ x) (see Definition 3.4 for discrete rvs and Definition 4.1 for con-
tinuous rvs), as for every value of x, F̂(x) converges almost surely to F(x) when n
goes to infinity (Strong Law of Large Numbers, Serfling 1980), i.e.


n
a.s.
F̂(x) = n −1 I(X i ≤ x) −→ F(x). (5.1)
i=1

The ecdf is a non-decreasing step function, i.e. it is a function which is constant


except for jumps at a discrete set of points. The points where the jumps occur are
the realisations of the rv and thus illustrate a general property of the cumulative
distribution function: continuous at right and existing limit at left, which holds for
all F̂. This step function is given by ecdf(), which returns a function of class
ecdf, which can be plotted using plot().
Take for example the dataset Formaldehyde containing 6 observations on two
variables: carbohydrate (car) and optical density (optden). The ecdf for the subset
Formaldehyde$car can be calculated as follows, with the result being shown in
Fig. 5.3.
> ecdf (Formaldehyde$car) # ecdf
Empirical CDF
Call: ecdf(Formaldehyde$car)
x[1:6] = 0.1, 0.3, 0.5, ..., 0.7, 0.9
> plot(ecdf(Formaldehyde$car)) # plot of ecdf
5.1 Descriptive Statistics 133

1.0

0.8


0.6

^
F(0.5)
F(x)


^

0.4


0.2


0.0

0.0 0.2 0.4 0.6 0.8 1.0


x

Fig. 5.3 ecdf of Formaldehyde$car. BCS_ecdf

5.1.3 Histogram

The histogram is a common way of visualising the data frequencies of continuous


(real valued) variables. In a histogram, bars are erected on distinct intervals, which
constitute the so-called classes. The y-axis represents the relative frequency of the
classes, so that the total area of the bars is equal to one. While the width of the bars is
given by the chosen intervals, the height of each bar is equal to the empirical density
of the corresponding interval.
If {K i }, i = 1, ..., s is a set of s disjunct classes, the histogram or empirical density
function fˆ(x) is defined by

relative frequency of the class containing x h(K i )


fˆ(x) = = for x ∈ K i , (5.2)
length of the class containing x |K i |

where |K i | denotes the length of the class K i and h(K i ) its relative frequency, which
is calculated as the ratio of the number of observations falling into class K i to the
sample size n.
We write fˆ(x) with a hat because it is a sample-based estimator of the true density
function f (x), which describes the relative likelihood of the underlying variable to
take on any given value. fˆ(x) is a consistent estimator of f (x), since for every value
134 5 Univariate Statistical Analysis

0.5
0.3

0.4
0.3
0.2

f (x)
^
f (x)

0.2
^

0.1

0.1
0.0

0.0
48 50 52 54 48 50 52 54

Fig. 5.4 Histograms of nhtemp with the number of classes calculated using default method (left)
and by manually setting to intervals of length 0.5 (right). BCS_hist1, BCS_hist2

of x, fˆ(x) converges almost surely to f (x) when n goes to infinity (Strong Law of
Large Numbers, Serfling 1980).
Now, consider nhtemp, a sample of size n = 60 containing the mean annual
temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971.
The histograms in Fig. 5.4 are produced using the function hist(). By default,
without specifying the arguments for hist(), R produces a histogram with the
absolute frequencies of the classes on the y-axis. Thus, to obtain a histogram accord-
ing to our definition, one needs to set freq = FALSE. The number of classes s is
calculated by default using Sturges’ formula s = log2 n + 1. The brackets denote
the ceiling function used to round up to the next integer (see Sect. 1.4.1) to avoid
fractions of classes. Note that this formula performs poorly for n < 30. To spec-
ify the intervals manually, one can fill the argument breaks with a vector giving
the breakpoints between the histogram cells, or simply the desired number of cells.
In the following example, breaks = seq(47, 55, 0.5) means that the his-
togram should range from 47 to 55 with a break every 0.5 step, i.e. K 1 = [47, 47.5),
K 2 = [47.5, 48), ….
> hist(nhtemp, freq = FALSE)
> hist(nhtemp, freq = FALSE, breaks = seq(47, 55, 0.5))

Figure 5.4 displays histograms with different bin sizes. A better reflection of the
data is achieved by using more bins. But, as the number of bins increases, the his-
togram becomes less smooth. Finding the right level of smoothness is an important
task in nonparametric estimation, and more information can be found in Härdle et al.
(2004).
5.1 Descriptive Statistics 135

5.1.4 Kernel Density Estimation

The histogram is a density estimator with a relatively low rate of convergence to the
true density. A simple idea to improve the rate of convergence is to use a function
that weights the observations in the vicinity of the point where we want to estimate
the density, depending on how far away each such observation is from that point.
Therefore, the estimated density is defined as

1 1   x − xi 
n n
fˆh (x) = K h (x − xi ) = K ,
n i=1 nh i=1 h

where K ( x−x
h
i
) is the kernel, which is a symmetric nonnegative real valued integrable
function. Furthermore, the kernel should have the following properties:
 ∞
u K (u)du = 0,
−∞
 ∞
K (u)du = 1.
−∞

These criteria define a pdf and it is straightforward to use different density functions
as a kernel. This is the basic idea of kernel smoothing. The foundations in this
area were laid in Rosenblatt (1956) and Parzen (1962). Some examples for different
weight functions are given in Fig. 5.5 and Table 5.1.
Deriving a formal expression for the kernel density estimator is fairly intuitive.
The weights for the observations depend mainly on the distance to the estimated
point. The main idea behind the histogram to estimate the pdf is

F̂(x + h) − F̂(x − h)
fˆh (x) ≈ , (5.3)
2h

where F̂ is the ecdf. If h is small, the approximation method works well, producing
smaller bin widths and a smaller bias. Rearranging (5.3) yields

1 
n
fˆh (x) = I(x + h ≥ xi > x − h).
2nh i=1

Different weights for observations in the vicinity of x can be achieved by simply


multiplying I{x + h ≥ xi > x − h} with the desired weight w. The kernel for the
histogram is defined as

1
K (x − xi ) = I(x + h ≥ xi > x − h)w(xi ),
h
136 5 Univariate Statistical Analysis

Uniform kernel Triangular kernel

0.8

0.8
0.4

0.4
0.0

0.0
−2 −1 0 1 2 −2 −1 0 1 2

Epanechnikov kernel Quartic kernel


0.8

0.8
0.4

0.4
0.0

0.0

−2 −1 0 1 2 −2 −1 0 1 2
0
1

Fig. 5.5 Popular kernel functions. BCS_PopularKernels

Table 5.1 Popular kernels Kernel Weighting function


and their weighting functions 2
Epanechnikov 3
√ (1 − u5 )I(|u| ≤ 1)
4 5
Triangular 1 − |u|I(|u| ≤ 1)
Uniform 1
2 I(|u| ≤ 1)

16 (1 − u ) I(|u| ≤ 1)
15 2 2
Quartic
Gaussian √1
exp{− 21 u 2 }

where w(xi ) is the weight for observation


n xi , and depends on the distance of xi from
x. The sum of the weights must be i=1 w(xi ) = 1 and |w(xi )| ≤ 1 for all i. The
density estimator for the pdf is then

1
n
fˆh (x) = K (x − xi ).
h i=1

K (x − xi ) is the uniform kernel weighting function and smooths the histogram.


5.1 Descriptive Statistics 137

Fig. 5.6 Nonparametric


density estimation of the
temperature at New Haven.

0.3
BCS_Kernel_nhTemp

0.2
Density
0.1
0.0

48 50 52 54 56
N = 60 Bandwidth = 0.3924

In R the simplest command to execute a kernel density estimation is density().


One can implement all of the kernels mentioned above via the function argument
kernel. Returning to the temperature dataset let us estimate the kernel density for
the temperature in New Haven, Connecticut, using the Epanechnikov kernel. The
density is then estimated and plotted via (see Fig. 5.6)
> plot(density(nhtemp, kernel ="epanechnikov"))

To find the optimal bandwidth h for a kernel estimator, a similar problem has to
be solved as for the optimal binwidth. In practice one can use Silverman’s rule of
thumb:

h ∗ = 1.06 · σ̂n − 5 .
1

It is only a rule of thumb, because this h ∗ is only the optimal bandwidth under normal-
ity. But this bandwidth will be close to the optimal bandwidth for other distributions.
The optimal bandwidth depends on the kernel and the true density, see Härdle et al.
(2004).

5.1.5 Location Parameters

“Where are the data centered?” “How are the data scattered around the centre?” “Are
the data symmetric or skewed?” These questions are often raised when it comes to
a simple description of sample data. Location parameters describe the centre of a
distribution through a numerical value. They can be quantified in different ways and
visualised particularly well by boxplots.
138 5 Univariate Statistical Analysis

Arithmetic mean
The term arithmetic mean characterises the average position of the realisations on the
variable axis. It is a good location measure for data from a symmetric distribution.
Definition 5.2 The sample (arithmetic) mean for a sample of n values, x1 , x2 , ..., xn
is defined by

n
x̄ = n −1 xi . (5.4)
i=1

Applying the notions of absolute and relative frequencies, this formula can then be
rewritten as

k 
k
x̄ = n −1 a j n(a j ) = a j h(a j ).
j=1 j=1

According to the Law of Large Numbers, if {X i }i=1


n
denote n i.i.d. rvs with the same
finite expected value E(X i ) = μ, then their sample means converge almost surely to
μ (also called the population mean), i.e.


n
a.s.
n −1 X i −→ μ when n → ∞.
i=1

The arithmetic mean is calculated by mean().


> mean(nhtemp) # average temperature in New Haven
[1] 51.16

α-trimmed mean
The arithmetic mean is very often used as a location parameter, although it is not very
robust, since its value is sensitive to the presence of outliers. In order to eliminate the
outliers, one can trim the data by dropping a fraction α ∈ [0 , 0.5) of the smallest and
largest observations before calculating the arithmetic mean. This type of arithmetic
mean, called the α-trimmed mean, is more robust to outliers. However, there is no
unified recommendation regarding the choice of α. In order to define the trimmed
mean, we need to define order statistics first.
Definition 5.3 Let x(1) ≤ x(2) ≤ . . . ≤ x(n) be the sorted realizations of the rv X .
The term x(i) , i = 1, . . . , n is called the i th order statistic, and in particular, x(1) is
called the sample minimum and x(n) the sample maximum.
Definition 5.4 The α-trimmed mean is the arithmetic mean computed after trimming
the fraction α of the smallest and largest observations of X , given by

1 
n−nα
x̄ α = x(i)
n − 2nα i=nα+1
5.1 Descriptive Statistics 139

with α ∈ [0 , 0.5). Where a is the floor function, returning the largest integer not
greater than a, see Sect. 1.4.1.

The argument trim is used in the function mean to compute the α-trimmed mean.

> mean(nhtemp, trim = 0.2) > mean(nhtemp, trim = 0.4)


[1] 51.22222 [1] 51.225

Quantiles
Another type of location parameter is the quantile. Quantiles are very robust, i.e. not
influenced by outliers, since they are determined by the rank of the observations and
they are estimates of the theoretical quantiles, see Definition 4.9.

Definition 5.5 The p-quantile x̃ p , where 0 ≤ p ≤ 1, is a value such that at most


100 · p% of the observations are less than or equal to x̃ p and 100 · (1 − p)% are
greater than or equal to x̃ p . The number of observations which are less than or equal
to x̃ p is, then, equal to np.

x(np) i f np ∈
/Z
x̃ p =  f or p ∈ [0 , 1].
1
2
x(np) + x(np+1) i f np ∈ Z

Where a is the ceiling function, returning the smallest integer not less than a,
see Sect. 1.4.1. The sample quartiles are a special case of quantiles: the lower quar-
tile Q 1 = x̃0.25 , the median Q 2 = x̃0.5 = med, and upper quartile Q 3 = x̃0.75 .
These three quartile values Q 1 ≤ Q 2 ≤ Q 3 divide the sorted observations into four
segments, each of which contains roughly 25% of the observations in the sample.
To calculate the p-quantiles x̃ p of the sample nhtemp, one uses quantile().
This function allows up to 9 different methods of computing the quantile, all of them
converge asymptotically, as the sample size tends to infinity, to the true theoretical
quantiles (type = 2 is the method discussed here). Leaving the argument probs
blank, R returns by default an array containing the 0, 0.25, 0.5, 0.75 and 1 quantiles,
which are the sample minimum x (1) , the lower quartile Q 1 , the median Q 2 (or med),
the upper quartile Q 3 , and the sample maximum x(n) . The median can be also found
using median().
> quantile(nhtemp, probs = 0.2) # 20% quantile
20%
50.2
> median(nhtemp) # median
[1] 51.2
> quantile(nhtemp, probs = c(0.2, 0.5)) # 20% and 50% quantiles
20% 50%
50.2 51.2
> quantile(nhtemp) # all quartiles
0% 25% 50% 75% 100%
47.90 50.575 51.20 51.90 54.60
140 5 Univariate Statistical Analysis

Mode
The mode is the most frequently occurring observation in a data set (also called the
most fashionable observation). Together with the mean and median, one can use it as
an indicator of the skewness of the data. In general, the mode is not equal to either the
mean or the median, and the difference can be huge if the data are strongly skewed.
Definition 5.6 The mode is defined by

xmod = a j , with n(xmod ) > n(ai ), ∀i ∈ {1, ..., k},

where n(x) is the absolute frequency of x.


Note that the mode is not uniquely defined if several observations have the same
maximal absolute frequency. There is no function in R that directly finds the sam-
ple mode of given data. The sample mode can be calculated using the command
names(sort(table(x), decreasing = TRUE))[1]), where x is a vec-
tor. Let us consider again the dataset nhtemp.
> as.numeric(names(sort(table(nhtemp), decreasing = TRUE))[1])
[1] 50.9

These nested functions are better understood from the inside out. The function
table() creates a frequency table for the observations in the dataset nhtemp,
calculating the frequency for every single value. sort() with the argument
decreasing = TRUE sorts the frequency table in decreasing order, so that the
element with the highest frequency, i.e. the mode, appears first. Its name, the unique
value for which the frequency was calculated, is then extracted by the function
names()[1], where [1] restricts the output of names() to the first position of
the vector. Lastly, as the result is a string, here ‘50.9’, it needs to be converted into a
number by as.numeric().
If it is desired to have the usual location parameters, such as the median, mean
and some quantiles at once, one can use the command summary().
> summary(nhtemp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
47.90 50.58 51.20 51.16 51.90 54.60

In general, summary() also produces summaries of the results of model fit-


ting functions. Depending on the class of the first argument, particular methods
are employed for this function. For example, the functions summary.lm() and
summary.glm() are particular methods used to summarise the results produced
by lm and glm, see Sect. 7.2.

5.1.6 Dispersion Parameters

One characteristic of a set of observations described by the measures of location is


the typical or central value. However, it is also interesting to know how dispersed
5.1 Descriptive Statistics 141

the observations are. Evaluating measures of dispersion in addition to measures of


location provides a more complete description of the data.
Total range

Definition 5.7 The total range is the difference between the sample maximum and
the sample minimum, i.e.

TotalRange = x(n) − x(1) .

Since the total range depends only on two observations, it is very sensitive to outliers
and is thus a very weak dispersion parameter.
The function range() returns an array containing two values, namely the sam-
ple minimum and maximum. diff() calculates the difference between values by
subtracting each value in a vector from the subsequent value. To obtain the total
range, one simply calculates the first difference of the array given by the function
range() using diff().
> range(nhtemp) # sample min and sample max
[1] 47.9 54.6
> totalrange = diff(range(nhtemp)) # difference between max and min
> totalrange
[1] 6.7

Interquartile range

Definition 5.8 The interquartile range (IQR) of a sample is the difference between
the upper quartile x̃0.75 and the lower quartile x̃0.25 , i.e.

IQR = x̃0.75 − x̃0.25 .

It is also called the midspread or middle fifty, since roughly fifty percent of the
observations are found within this range. The IQR is a robust statistic and is therefore
preferred to the total range.
Currently, there is no function in R which directly gives the IQR. To find the upper
and lower quartiles, one uses the function quantile() with probs = c(0.25,
0.75), meaning that R should return the 0.25-quantile and the 0.75-quantile. In this
example, the function diff() computes the IQR, i.e. the difference between the
lower and upper quantiles.
> LUQ = quantile(nhtemp, probs = c(0.25, 0.75)); LUQ
25% 75%
50.575 51.900
> IQR = diff(LUQ); IQR
75%
1.325
142 5 Univariate Statistical Analysis

Variance
The variance is one of the most widely used measures of dispersion. The variance is
sensitive to outliers and is only reasonable for symmetric data.

Definition 5.9 The sample variance for a sample of n values x1 , x2 , . . . , xn is the


average of the squared deviations from the sample mean x̄:

1
n
s̃ 2 = (xi − x̄)2 . (5.5)
n i=1

The unbiased variance estimator also called the empirical variance for a sample of n
values x1 , x2 , . . . , xn is the sum of the squared deviations from their mean x̄ divided
by n − 1, i.e.

1 
n
n 2
s2 = s̃ = (xi − x̄)2 . (5.6)
n−1 n − 1 i=1

Having copies X 1 , . . . , X n of a rv X ∼ (μ, σ 2 ), it can be shown that


 n
1 
E(S 2 ) = E (X i − x̄)2 = σ2 ,
n − 1 i=1

which means that S 2 is an unbiased estimator of Var(X ) = σ 2 .


Standard deviation
As the variance is, due to the squares, not on the same scale as the original data, it is
useful to introduce a normalised dispersion measure.

Definition 5.10 The sample standard deviation s̃ and the estimator for the popula-
tion standard deviation based on the unbiased variance estimator are calculated from
(5.5) and (5.6): √ √
s̃ = s̃ 2 , s = s2.

The R functions var() and sd() compute estimates for the variance and standard
deviation using the formulas for the unbiased estimators s 2 and s.

> var(nhtemp) > sd(nhtemp)


[1] 1.601763 [1] 1.265608

Median absolute deviation


When computing the standard deviation, the distances to the mean are squared,
thereby assigning more weight to large deviations. The standard deviation is thus very
sensitive to outliers. Alternatively, one can use a more robust measure of dispersion,
5.1 Descriptive Statistics 143

the median absolute deviation. It is robust since the median is less sensitive to outliers
and the distances are not squared, effectively reducing the weight of outliers.
Definition 5.11 The median absolute deviation (MAD) is the median of the absolute
deviations from the median:

M AD = med |xi − x̃0.5 |, ∀i ∈ {1, . . . , n}.

The function mad() returns by default the MAD according to Definition 5.11.
However, if it is desired to compute the median of the absolute deviation from some
other values, one simply includes the argument center. Below is an example of
deviations both from the median and from the mean, that uses measurements of
the annual flow of the river Nile at Ashwan between 1871 and 1970 (discharge in
108 m3 ).

> mad(Nile) > mad(Nile, center = mean(Nile))


[1] 179.3946 [1] 178.6533

Besides the dispersion measures discussed above, an alternative robust approach


would be to measure the average absolute distance of each realisation from the median
or the mean, i.e.

1 1
n n
d1 = |xi − x̃0.5 | or d2 = |xi − x̄|.
n i=1 n i=1

5.1.7 Higher Moments

The sample estimate of the skewness S(X ), see Definition 4.5, is given through
n
i=1 (x i − x̄)3
1
Ŝ = n
n ,
{ n−1
1
i=1 (x i − x̄)2 }3/2

and the sample excess kurtosis, see Definition 4.6, is provided by


1 n
i=1 (x i − x̄)
4
K̂ = 1 n
n
−3
{ n i=1 (xi − x̄)2 }2

which is implemented in R in package moments in functions skewness and


kurtosis respectively.
144 5 Univariate Statistical Analysis

> require(moments) >


> skewness(Nile) > kurtosis(Nile) - 3
[1] 0.3223697 [1] -0.3049068

5.1.8 Box-Plot

The box-plot (or box-whisker plot) is a diagram which describes the distribution of a
given data set. It summarises the location and dispersion measures discussed previ-
ously. The box-plot gives a quick glimpse of the observations’ range and empirical
distribution.
This box-plot of the dataset Nile visualizes the skewness of the data very well,
see Fig. 5.7. Since the median, shown by the middle line, is not in the centre of the
box, the data are not symmetric and the results for calculating the median absolute
deviation from the median or from the mean differ, as we have just shown in the code
above.
Let us now analyse the dataset nhtemp using the command boxplot(). The
output is given in Fig. 5.8.
> boxplot(nhtemp)

Fig. 5.7 Box-plot of the data


1400

Nile. BCS_Boxplot2 upper fence


1200

upper quartile: x 0.75


1000

median: med x
800

lower quartile: x 0.25


600

lower fence
400
5.1 Descriptive Statistics 145

Fig. 5.8 Box-plot of the data

55
nhtemp. BCS_Boxplot

54
highest value within

53
upper fence

52
upper quartile: x0.75

median: med x
51

lower quartile: x0.25


50
49

lowest value within


lower fence
48

Approximately fifty percent of the observations are contained in the box. The
upper edge is the 0.75-quantile and the lower edge is the 0.25-quantile. The distance
between these two edges is the interquartile range (IQR). The median is indicated
by the horizontal line between the two edges. If its distance from the upper edge is
not equal to its distance from the lower edge, then the data are skewed.
The vertical lines extending outside the box are called whiskers. In the absence
of outliers, the ends of the whiskers indicate the sample maximum and minimum.
Otherwise, the ends of the whiskers lie at the highest value that is still within the
upper fence (x̃0.75 +1.5· I Q R) and the lowest value that is still within the lower fence
(x̃0.25 − 1.5 · I Q R). The factor 1.5, used by default, can be modified by setting the
argument range appropriately. The (suspected) outliers are denoted by the points
outside the two fences. For R not to draw the outliers, we set the argument outline
= FALSE.
Another way of producing a box-plot is using the package lattice (more details
in Chap. 10). The function used here is bwplot(). Consider again the dataset
nhtemp. Since nhtemp is a time series object, it is converted to a vector using
as.vector() in order for bwplot to work for the data nhtemp.
> require(lattice)
> bwplot(as.vector(nhtemp))
146 5 Univariate Statistical Analysis

5.2 Confidence Intervals and Hypothesis Testing

5.2.1 Confidence Intervals

When estimating a population parameter θ (e.g. the population mean μ or the variance
σ 2 ), it is important to have some clue about the precision of the estimation. The
precision in this context is the probability that the estimate θ̂ is wrong by less than a
given amount. It can be calculated using a random sample of size n drawn from the
population. For most cases, like θ = μ, the sample size n must be large enough so
that θ̂ can be assumed to be normally distributed (Central Limit Theorem 6.5).
The standard error of the sample mean measures the accuracy of the estimation
of the mean, and the confidence interval quantifies how close the sample mean is
expected to be to the population mean. Furthermore, it is naturally desirable to have a
confidence interval as short as possible, something which is induced by large samples.

Definition 5.12 The confidence interval (CI) for the parameter θ of a continuous rv
is a range of feasible values for an unknown θ together with a confidence coefficient
(1 − α) conveying one’s confidence that the interval actually covers the true θ.
Formally, it is written as

P(θ ∈ CI) = 1 − α.

Confidence intervals for μ when σ is known


Since the value of σ is unknown in most cases, calculating confidence intervals when
σ is known is not the typical situation. However, it can sometimes be known from past
data in repeated surveys, from surveys on similar populations, or from theoretical
considerations.
Consider a normal population with an unknown mean μ and known standard
deviation σ. Let X i ∼ N(μ, σ 2 ) for i = 1, ..., n be the sample rv’s. Then X̄ =
1 n √ X̄ −μ
i=1 X i ∼ N(μ, σ /n) and n σ ∼ N(0, 1).
2
n
Now, let Z ∼ N(0, 1) with such z 1− α2 that

P(−z 1− α2 ≤ Z ≤ z 1− α2 ) = 1 − α.

√ X̄ −μ
With Z = n· σ
, this implies

σ σ
P X̄ − z 1− α2 · √ ≤ μ ≤ X̄ + z 1− α2 · √ = 1 − α.
n n

Thus, for a fixed α ∈ [0, 1], the confidence coefficient is (1−α) and the corresponding
100 · (1 − α)%-confidence interval for the population mean μ, assuming that the
population is normally distributed and σ is known (Fig. 5.9), is given by
5.2 Confidence Intervals and Hypothesis Testing 147

Fig. 5.9 The


100 · (1 − α)%-confidence
interval for Z is the interval
between z α2 and z 1− α2 . The
area under the normal
density within this interval is
equal to 1 − α.
BCS_Conf2sided

α α
2 2
− z1−α2 = zα2 z1−α2


σ σ
x̄ − z 1− α2 · √ ; x̄ + z 1− α2 · √ .
n n

Note that z 1− α2 is a 1 − α2 -quantile for a standard normal distribution, thus −z 1− α2 =


z α2 . To find the lower and upper limits (z α2 and z 1− α2 ), one uses the quantile function
for the normal distribution qnorm(), see Sect. 4.4.
These simple confidence intervals are best computed manually in R. For the exam-
ple below, we again use the data nhtemp and assume that the standard deviation σ
of 1.25 is known, for example.
> smean = mean(nhtemp) # sample mean
> n = length(nhtemp) # sample size
> alpha = 0.1 # confidence level
> sigma = 1.25 # sd known
> z = qnorm(1 - alpha / 2) # norm. distr. quantiles
> CI = c(smean - z * sigma / sqrt(n), # confidence interval
+ smean + z * sigma / sqrt(n))
> CI
[1] 50.89456 51.42544

Sometimes only an upper limit or a lower limit for μ is desired, but not both. These
are called one-sided (one-tailed) confidence limits. For example, a toy is considered
to be harmful to children if it contains an amount of mercury that exceeds a certain
value. A European buyer wants a guarantee from a European company that their
products comply with European safety laws. The transaction may then take place
if the 99%-confidence upper limit does not exceed the desired maximum. In the
same contract, one does not want too many failures in the shipped good, e.g. the
95%-confidence lower limit should not exceed the desired minimum (Fig. 5.10).

Taking Z = n X̄ σ−μ ∼ N(0, 1), it follows that

σ
P μ ≥ X̄ − z 1−α · √ = α.
n

Thus, the 100 · α%-confidence lower limit is x̄ − z 1−α √σn and the upper limit is given
by x̄ + z 1−α √σn .
148 5 Univariate Statistical Analysis

α α

zα z1−α

Fig. 5.10 Two types of one-sided confidence intervals: lower limit (left) and upper limit (right)
confidence interval. BCS_Conf1Sidedleft, BCS_Conf1sidedright

Confidence intervals for μ when σ is unknown


Since the population’s σ is generally unknown, the construction of confidence inter-
vals for μ will usually be based on Student’s t-distribution (see Sect. 4.4.2). Define
V ∼ tν (meaning that V is t-distributed with ν degrees of freedom) by

P(−t1− α2 ,ν ≤ V ≤ t1− α2 ,ν ) = 1 − α.

Now, assuming that the population is normally distributed, and the sample size is n,
√ X̄ −μ √ 1 n
i=1 X i −μ
it follows that V = n S = n √ 1 n n
, thus
n−1 { i=1 (X i − X̄ ) }
2

S S
P X̄ − t1− α2 ,n−1 · √ ≤ μ ≤ X̄ + t1− α2 ,n−1 · √ = 1 − α.
n n

Definition 5.13 The 100 · (1 − α)%-confidence interval for the population mean μ
when the population is normally distributed and σ is unknown is defined by

s s
x̄ − t1− α2 ,n−1 · √ ; x̄ + t1− 2 ,n−1 · √ .
α
n n

When calculating the confidence interval for σ unknown, the R code from above
changes only a little. To find t1−α/2,n−1 , we use the function qt(), see again Sect. 4.4.
> smean = mean(nhtemp)
> sigma2 = sd(nhtemp) # estimate sd from data
> n = length(nhtemp) # sample size
> alpha = 0.1 # confidence level
> z2 = qt(1 - alpha / 2, n - 1) # the t-distr. quantiles
> CI2 = c(smean - z2 * sigma2 / sqrt(n), # confidence interval
+ smean + z2 * sigma2 / sqrt(n))
> CI2
[1] 50.88696 51.43304

Note that this confidence interval is slightly larger than the one calculated for σ
known. In other words, having to estimate the variance introduces more uncertainty
into our estimation.
5.2 Confidence Intervals and Hypothesis Testing 149

5.2.2 Hypothesis Testing

A hypothesis test, or test of significance, is a widely used tool in statistical analysis.


Based on the information gained from a sample, one attempts to make a decision
about a hypothesis. The hypotheses in this case are assumptions about the parameters
of the distribution followed by the rv in the population (e.g. the mean μ, proportion
p, variance σ 2 , standard deviation σ, etc.). Similarly to parameter estimation, where
no estimates are exactly equal to the parameters, it can never be concluded with
certainty whether a hypothesis is wrong or correct. Therefore, one can only reject or
fail to reject a hypothesis.
When conducting a test, two hypotheses are made, namely the null hypothesis
H0 , and an alternative hypothesis H1 (or often also Ha ). Suppose a null hypothesis
about a parameter θ is made: one desires to know whether θ is equal to a certain
value, which is denoted by θ0 . Then the alternative hypothesis H1 is that θ is not
equal to θ0 :

H0 : θ = θ0 vs H1 : θ  = θ0 .

It is important to note that the hypotheses are mutually exclusive. The test above is
called a two-sided or two-tailed test, since the alternative hypothesis H1 does not
make any reference to the sign of the difference θ − θ0 . Therefore, the interest here
lies only in the absolute values of θ − θ0 .
However, sometimes the investigator notes only deviations from the null hypoth-
esis H0 in one direction and ignores deviations in other directions. The investigators
could, for example, be certain that if θ is not less than or equal to θ0 , then θ must be
greater than θ0 or vice versa. Formally:

H0 : θ ≤ θ0 vs H1 : θ > θ0 ,
or
H0 : θ ≥ θ0 vs H1 : θ < θ0 .

Each time when conducting a statistical test, one faces two types of risk:

Type I error is the error of rejecting a null hypothesis when it is in fact true.
Type II error is the error of failing to reject a null hypothesis when it is actually not
true.
These two types of risks are treated in different ways: it is always desired to have
the probability of type I error, denoted by α, be as small as possible. On the other
hand, since it is ideal that a test of significance rejects a null hypothesis when it is
false, it is desired to have the probability of type II error, denoted by β, as small as
possible too.
150 5 Univariate Statistical Analysis

Test for the mean of a normal population (σ known)


Suppose X ∼ N(μ, σ 2 ). For simplicity, it is assumed that σ is known. A random
sample {xi } of size n is drawn from X . The null hypothesis of the two-sided test
for the mean is then defined as follows: H0 : μ = μ0 . Using the information in
the sample, we have to make a decision about the null hypothesis H0 : should it be
rejected or not? In most cases, one looks at the deviation of the sample mean x̄ from
the hypothetical value μ0 :
|x̄ − μ0 | > 0.

However the null hypothesis H0 can not be immediately rejected. Deviations may
occur even if H0 is true, for example through an unfavorable sampling. Only when
it exceeds a certain critical value is the deviation said to be statistically significant,
or simply significant, and therefore one rejects the null hypothesis H0 .
The question is now: how to decide whether or not the deviation is significant? We
fail to reject the null hypothesis H0 as long as the estimated confidence interval for μ
of the same sample contains the hypothetical value μ0 (this constitutes a connection
between hypothesis testing and confidence intervals).
The Critical Region The critical region (or the region of rejection) is the set of values
of x̄ that cause the rejection of the null hypothesis H0 . It can be determined with the
distribution of the sample mean x̄. The probability of a type I error (also called the
significance level of a test), i.e. the probability of rejecting the null hypothesis H0
although it is true, should be at most α:

P(x̄ ∈ critical region | H0 ) = α.

The construction of the critical region depends on the type of test one conducts
(two-sided or one-sided).
Two-Sided Tests Recall the hypotheses

H0 : μ = μ0 vs H1 : μ  = μ0 .

In a test for the mean, a natural criterion for judging whether the observations favour
H0 or H1 is the size of the deviation of the sample mean x̄ from the hypothetical value
μ0 , i.e. (x̄ − μ0 ). Under the null hypothesis H0 , ( X̄ − μ0 ) ∼ N(0 , σ 2 /n). Under the
alternative hypothesis H1 , ( X̄ − μ0 ) ∼ N(μ1 − μ0 , σ 2 /n), where μ1 − μ0  = 0.
Large values of the test criterion (x̄ − μ0 ) can cause the rejection of H0 in favour
of H1 . However there is no exact answer how large these values should be. The larger
is the value of (x̄ − μ0 ) required to reject H0 , the smaller is the probability of a type I
error α, but the higher is the probability of a type II error β. Larger samples minimise
both errors, but may be difficult to obtain and therefore inefficient. In practice, the
critical value is determined so that the probability of type I errors α is 0.05. This
is called a test at the 5% level. Sometimes a level of 1% is chosen when incorrect
rejection of the null hypothesis H0 is considered as a serious mistake.
5.2 Confidence Intervals and Hypothesis Testing 151

The method of testing the hypothesis μ = μ0 is as follows:


1. Transform ( X̄ − μ0 ) into under H0 a standard normal variable

√ X̄ − μ0
Z= n ∼ N(0, 1).
σ
Recall that in a two-sided test, only absolute values of |Z | are relevant, since there
is no reference to the sign of (x̄ − μ0 ) in the alternative hypothesis H1 .
2. Reject the null hypothesis H0 : μ = μ0 if z as a realization of Z fulfills

z ∈ critical region = (−∞ , −z 1− α2 ) ∪ (z 1− α2 , +∞),

that is to say |z| > z 1− α2 . The value z 1− α2 is determined in such a way that

α
(z 1− α2 ) = 1 − ,
2
where  is the cdf of N(0, 1). It is important to note that the critical region of a
two-sided test is symmetric, with a probability of α2 on each side. Thus one rejects
the null hypothesis H0 : μ = μ0 if

σ σ
x̄ ∈ critical region ≡ −∞ , μ0 − z 1− α2 · √ ∪ μ0 + z 1− α2 · √ , +∞ .
n n

It is easy to see the connection between the two-sided hypothesis testing μ = μ0 and
the 100 · (1 − α)%-confidence interval for μ. According to the rule of hypothesis
rejection, H0 : μ = μ0 fails to be rejected by a two-sided test when

σ σ
x̄ ∈ μ0 − z 1− α2 · √ , μ0 + z 1− α2 · √ .
n n

Rearranging this equation, one obtains

σ σ
μ0 ∈ x̄ − z 1− α2 · √ , x̄ + z 1− α2 · √ ,
n n

a two-sided 100 · (1 − α)%-confidence interval for μ. It contains those values of μ0


for μ that would lead to a failure of rejection of H0 using a two-sided test on the
sample mean x̄.
152 5 Univariate Statistical Analysis

One-Sided Tests Unlike in a two-sided test, the critical region of a one-sided test is
not symmetric. Thus, the hypotheses do not concern a single discrete value, but are
expressed as
H0 : μ ≤ μ0 vs H1 : μ > μ0 .

It is often needed when a new treatment (e.g. scholarships for disadvantaged students)
is of no interest unless it is superior to the standard treatment (no scholarships). Thus,
the null hypothesis can be expressed as H0 : the average grade of disadvantaged
students does not increase after they receive scholarships. The rule is to reject H0
when
√ x̄ − μ0
z= n > z 1−α .
σ
The critical region is the interval (z 1−α , +∞). The area under the bell curve ϕ(·)
within this interval is α, too.
Inversely, when the investigator is interested to know whether or not the population
mean is smaller than a certain value μ0 , the hypotheses are

H0 : μ ≥ μ0 vs H1 : μ < μ0 .

The rule to reject H0 is


√ x̄ − μ0
z= n < zα .
σ
The critical region is the interval (−∞ , z α ). The area under the bell curve ϕ(·)
within this interval is α, too.
Manual Hypothesis Testing Unfortunately, there is no function in R that does
hypothesis testing under the assumption that σ is known. But one can easily compute
the statistics manually.
For the following example, consider nottem, a sample of size n = 240 contain-
ing average monthly air temperatures at Nottingham Castle in degrees Fahrenheit for
the period 1920–1939, which is then converted to Celsius. Supposing the standard
deviation in the population σ = 5, one desires to test whether the average monthly
temperatures in degrees Celsius in Nottingham is equal to μ = 15.5 using the sample
mean x̄ = 9.46. The hypotheses are written as follows:

H0 : μ = 15.5 vs H1 : μ  = 15.5 (for two-sided test),


H0 : μ ≥ 15.5 vs H1 : μ < 15.5 (for left-sided test),
H0 : μ ≤ 15.5 vs H1 : μ > 15.5 (for right-sided test).

The next testing step is to convert the test criterion into under H0 a standard normal
variable.
5.2 Confidence Intervals and Hypothesis Testing 153

> nottemc = (nottem - 32) * 5 / 9 # Fahrenheit to Celsius


> n = length(nottemc); n # sample size
[1] 240
> z = (mean(nottemc) - 15.5) / (5 / sqrt(n))
> z # test statistics
[1] -18.69432

Depending on the type of test, one uses different decision rules for rejecting or
not rejecting the null hypothesis H0 :
1. for the two-sided test, one rejects H0 at α = 0.05 since |z| > z 0.975 = 1.96,
2. for the left-sided test, one rejects H0 at α = 0.05 since z < z 0.05 = −1.64,
3. for the right-sided test, one cannot reject H0 at α = 0.05 since z ≯ z 0.95 = 1.64.

Test for the mean of a normal population (σ unknown)


Just as in the construction of confidence intervals, Student’s t distribution can be
used in hypothesis testing under certain circumstances, namely when the popula-
tion’s standard deviation σ is unknown and therefore needs to be estimated. An
indispensable assumption for small samples is that the rv is approximately normally
distributed. If this is the case, then the test statistic follows Student’s t distribution.
In large samples, this assumption is no longer obligatory since the sample means
are approximately normally distributed by the Central Limit Theorem, see Serfling
(1980). The decision rules for the hypothesis testing are similar to those of the test
for the mean when σ is known, except that the standard normal distribution N(0, 1)
is replaced by the Student’s t-distribution with n − 1 degrees of freedom, where n is
the sample size.
Two-Sided Tests The rule for two-sided tests is to reject the null hypothesis
H0 : μ = μ0 if

√ x̄ − μ
t= n ∈ critical region ≡ (−∞ , −tn−1 , 1− α2 ) ∪ (tn−1 , 1− α2 , +∞),
s
that is to say t > tn−1, 1− α2 . The value tn−1, 1− α2 is determined so that
α
P(T ≤ tn−1 , 1− α2 ) = 1 − , where T ∼ tn−1 .
2

One-Sided Tests The rules for hypothesis testing in this case are analoguous to those
used when σ is known.

For the hypotheses

H0 : μ ≤ (≥) μ0 vs H1 : μ > (<) μ0 ,

H0 is rejected if
154 5 Univariate Statistical Analysis

√ x̄ − μ0
t= n > (<) tn−1 , 1−α .
s

Hypothesis Testing Using p-Values The critical regions depend on the distribution
of the test statistics and on the probability of a type I error α. This makes manual
testing inconvenient, since for every new value of α the critical region has to be
recomputed. To overcome this problem most of the tests in R can be performed using
the concept of the p-value.
Definition 5.14 The p-value is the probability of obtaining a test criterion at least
as large as the observed one, assuming that the null hypothesis is true.
For continuous distributions of the test criterion, the p-value can be determined as
that value of α∗ such that the test criterion coincides with the next boundary of the
critical region. For the one-sided test for the mean with H0 : μ ≤ μ0 , this implies

√ x̄ − μ0
n = z 1−α∗ .
σ
Solving for α∗ leads to

√ x̄ − μ0
p-value = 1 −  n .
σ

Similarly, for the one-sided test with H0 : μ ≥ μ0 , we obtain

√ x̄ − μ0
p-value =  n .
σ

In the case of the two-sided test, it can be shown using a similar logic that
 
√ x̄ − μ0 

p-value = 2 − 2  n  .
σ 

In the case of unknown σ, the cdf  is replaced with the cdf of the t-distribution with
n-1 degrees of freedom. For more complicated tests and distributions, the p-value
should be determined individually.
The decision regarding the rejection of the null hypothesis is made using a simple
scheme:
• the null hypothesis is rejected if the p-value is smaller than the prespecified sig-
nificance level α;
• the null hypothesis is not rejected if the p-value is equal to or larger than α.
This decision rule is independent of the type of the test and the distribution of the
test criterion, allowing for quick testing with different levels of α.
Using t.test() Hypothesis testing of the mean with unknown variance involving
Student’s t test can be used in R through the function t.test(). For the next
5.2 Confidence Intervals and Hypothesis Testing 155

example, consider the dataset nhtemp, a sample of size n = 60 containing the


mean annual temperature x in degrees Fahrenheit in New Haven, Connecticut, with
a sample mean of x̄ = 51.16. Suppose one wants to test whether the population mean
μ is equal to the hypothetical value μ0 , say 50. This is a two-sided test of the mean:

H0 : μ = 50 vs H1 : μ  = 50.

Under the assumption that the standard deviation σ is unknown, one uses t.test().
> t.test(x = nhtemp,
+ alternative = "two.sided", # two-sided test
+ mu = 50, # for mu = 50
+ conf.level = 0.95) # at level 0.95

One Sample t-test

data: nhtemp
t = 7.0996, df = 59, p-value = 1.835e-09
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
50.83306 51.48694
sample estimates:
mean of x
51.16

As can be seen from the listing above, beside the test statistics, the function
t.test() returns the confidence intervals and the sample estimates. The hypothesis
testing above leads to the rejection of the null hypothesis H0 : μ = 10 at 95%-
confidence level. Obviously in this two-sided test, the hypothetical value μ0 = 10
lies outside the 95%-confidence interval. The absolute value of Student’s t statistic
is greater than the critical value t59, 1− 0.05 , since the p-value is much smaller than
2
α = 5%.
To conduct a one-sided test, the argument alternative must be changed into
less or greater.
> t.test(x = nhtemp,
+ alternative = "less", # one-sided test
+ mu = 50, # for mu < 50
+ conf.level = 0.95) # at level 0.95

One Sample t-test

data: nhtemp
t = 7.0996, df = 59, p-value = 1
alternative hypothesis: true mean is less than 50
95 percent confidence interval:
-Inf 51.43304
sample estimates:
mean of x
51.16
156 5 Univariate Statistical Analysis

Thus the hypothesis testing with a hypothetical value μ0 = 50 leads to the rejection
of H0 : μ > μ0 at the 95%-confidence level.
Testing σ 2 of a normal population
It is also interesting to see whether the population variance has a certain value of σ02 .
The question is then: how to construct confidence intervals for σ 2 from the estimator
s 2 ? How to test hypotheses about the value of σ 2 ? With a little modification of s 2 ,
one can answer these questions by looking at a rv that follows a χ2ν distribution.
If X 1 , ..., X n are i.i.d. random normal variables, then using Definition 4.13 of the
χ2 distribution we obtain

1 
n
(n − 1)S 2
∼ χ2n−1 , with S 2 = (X i − X̄ )2 .
σ 2 n − 1 i=1

Confidence Intervals for σ 2 Let Y ∼ χ2ν . Now, choose χ2ν,1−α/2 such that

P(Y ≥ χ2ν,1−α/2 ) = α/2.

and χ2ν,α/2 such that

P(Y ≤ χ2ν,α/2 ) = α/2.

v·S 2
Since Y = σ2
, it is easy to show that

vS 2
P χ2ν, α < < χ2ν,1− α = 1 − α,
2 σ2 2

which is equivalent to
 
vS 2 vS 2
P < σ2 < 2 = 1 − α.
χν,1− α
2
χν, α
2 2

This is the general formula for a two-sided 100 · (1 − α)%-confidence limit for σ 2 .
The number of degrees of freedom is ν = n − 1 if s 2 is computed from a sample of
size n.
Testing for σ 2 This situation occurs, for example, when a theoretical value of σ 2 is
to be tested or when the sample data are being compared to a population whose σ 2
is known. If the hypotheses are

H0 : σ 2 ≤ (≥) σ02 vs H1 : σ 2 > (<) σ02 ,

then reject H0 if
5.2 Confidence Intervals and Hypothesis Testing 157
n
vs 2 (xi x̄)2  
y= 2 = i=1
> χ2ν , α < χ2ν , 1−α .
σ0 σ02

For a two-sided test, reject the null hypothesis H0 if

y < χ2ν , 1− α or y > χ2ν , α .


2 2

Note that for values ν > 100, an approximation can be made by


√ √
Z= 2Y − 2ν − 1 ∼ N(0, 1).

The rejection rule of the null hypothesis H0 using this proxy variable is analogous
to the test of a mean when σ is known.
Test for equal means μ1 = μ2 of two independent samples
In comparative studies, one is interested in the differences between effects rather than
the effects themselves. For instance, it is not the absolute level of sugar concentration
in blood reported for two types of diabetes medication that is of interest, but rather
the difference between the levels of sugar concentration. One of many aspects of
comparative studies is comparing the means of two different populations.
Consider two samples {xi,1 }i∈{1,...,n 1 } and {x j,2 } j∈{1,...,n2 } , independently drawn
from N(μ1 , σ12 ) and N(μ2 , σ22 ) respectively. The two-sample test for the mean is as
follows:

H0 : μ1 − μ2 = δ0 vs H1 : μ1 − μ2  = δ0 .

There are two cases to distinguish:


1. both populations have the same standard deviation (σ1 = σ2 );
2. the two populations have different standard deviations (σ1  = σ2 ).
Before testing the hypotheses, it is important to aquire information about the variance
of the difference of the sample means σx̄21 −x̄2 .

Definition 5.15 Under the assumption of independent rvs, the variance of the dif-
ference between the sample means is defined as

σ12 σ2
σx̄21 −x̄2 = + 2.
n1 n2

Furthermore, this variance can be estimated and used to construct the test statistics
later on. The estimation of σx̄21 −x̄2 depends on the assumptions about σ1 and σ2 .
1. When both populations have the same variance σ 2 = σ12 = σ22 , then σ 2 is esti-
mated by the unbiased pooled estimator of the x̄1 − x̄2 variance spooled
2
:
158 5 Univariate Statistical Analysis

pooled sum of squares (n 1 − 1)s12 + (n 2 − 1)s22


2
spooled = = . (5.7)
pooled degrees of freedom n1 + n2 − 2

Thus, the sample estimate sx̄21 −x̄2 of the population variance σx̄21 −x̄2 is

1 1
sx̄21 −x̄2 = spooled
2
+ .
n1 n2

2. When the populations have different variances σ12  = σ22 , then σx̄21 −x̄2 is estimated
by the following unbiased estimator:

s12 s2
sx̄21 −x̄2 = + 2. (5.8)
n1 n2

Whether the first case applies can be investigated by the function var.test(),
which uses the F-distribution introduced in Sect. 4.4.3. Consider in this example
sleep, a data frame with 20 observations on 2 variables: the amount of extra sleep
after taking a drug (extra) and the control group (group).
> # test for equal variances
> var.test(sleep$extra, sleep$group,
+ ratio = 1, # hypothesized ratio of variances
+ alternative = "two.sided", # two-sided test
+ conf.level = 0.95) # at level 0.95

F test to compare two variances

data: sleep$extra and sleep$group


F = 15.4736, num df = 19, denom df = 19, p-value = 1.617e-07
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
6.124639 39.093291
sample estimates:
ratio of variances
15.4736

The null hypothesis of equal variances of the groups is rejected. This result will
be useful when testing for equal means.
Testing the Hypothesis μ1 = μ2 when σ 2 = σ12 = σ22
Under this assumption, use (5.7) for the estimator sx̄1 −x̄2 as follows:

1 1
sx̄1 −x̄2 = spooled
2
+ .
n1 n2

Hence, the rejection rule for H0 in the two-sided test

H0 : μ1 = μ2 vs H1 : μ1  = μ2 ,
5.2 Confidence Intervals and Hypothesis Testing 159

is
 
 x̄1 − x̄2 

|t| =   > tn +n −2 , 1− α .
sx̄1 −x̄2  1 2 2

If the hypotheses are

H0 : μ1 ≤ (≥) μ2 vs H1 : μ1 > (<) μ2 ,

then reject H0 if

x̄1 − x̄2  
t= > tn 1 +n 2 −2 , 1−α < −tn 1 +n 2 −2 , 1−α .
sx̄1 −x̄2

Testing the Hypothesis μ1 = μ2 when σ12  = σ22


Under this assumption, use (5.8) and the estimator sx̄1 −x̄2 as follows:

s12 s2
sx̄1 −x̄2 = + 2.
n1 n2

Hence, the rejection rule for H0 in two-sided tests

H0 : μ1 = μ2 vs H1 : μ1  = μ2 ,
 
 
is |t| =  sx̄x̄1 −−x̄x̄ 2  > tv , 1− α2 .
1 2
If the hypotheses are

H0 : μ1 ≤ (≥) μ2 vs H1 : μ1 > (<) μ2 ,

then reject H0 if
x̄1 − x̄2  
t= > tν , 1−α < −tν , 1−α .
sx̄1 −x̄2

The degrees of freedom ν are computed as


2  −1
s12 s2 (s12 /n 1 )2 (s 2 /n 2 )2
ν= + 2 + 2 .
n1 n2 n1 − 1 n2 − 1

Using oneway.test() Testing for equal means is done using the function
oneway.test. The assumption about the variances can be specified in the argu-
ment var.equal. Consider again the dataframe sleep. Suppose we want to test
whether the mean of the hours of sleep in the first group is equal to that of the second
group (H0 : μ1 = μ2 ).
160 5 Univariate Statistical Analysis

> # assuming not equal variances


> oneway.test(extra ~ group, data = sleep, var.equal = FALSE)

One-way analysis of means (not assuming equal variances)

data: extra and group


F = 3.4626, num df = 1.000, denom df = 17.776, p-value = 0.0794

The test in R relies on the squared test criterion T 2 , which follows an F-distribution
with 1 and ν degrees of freedom. Using the p-value approach, one cannot reject the
null hypothesis H0 of equality of means of both groups at the 5%-level, since the
p-value > 0.05. This applies in both cases, whether the variances are equal or not.
However at the 10%-level, one rejects H0 since the p-value < 0.1.

5.3 Goodness-of-Fit Tests

In the following, let F and G be two continuous distributions and {x1 , . . . , xn } be


a random sample from an rv X with unknown distribution. A common question is,
what distribution does X have. Another frequently asked question is whether the
observations {x1 , . . . , xn } and {y1 , . . . , ym } are realizations of rvs X and Y with the
same distributions F = G. Table 5.2 gives an overview of the tests introduced later.

Table 5.2 Conducting nonparametric tests in R


Test Samples Scale R syntax Null hypothesis
requirement
Kolmogorov– ≤2 Interval ks.test() F=G
Smirnov
Anderson– ≤2 Interval ad.test() F=G
Darling
Cramér–von ≤2 Interval cvm.test() F=G
Mises
Shapiro–Wilk ≤2 Interval shapiro.test() X ∼N
Wilcoxon signed ≤2 Ordinal and wilcox.test() x̃0.5 = c
rank paired
Mann–Whitney ≤2 Ordinal and wilcox.test() F1 (x) = F2 (x)
U non-paired
Kruskal–Wallis any Ordinal Kruskal.test() F1 (x) = · · · =
Fm (x)
5.3 Goodness-of-Fit Tests 161

5.3.1 General Tests

The Kolmogorov–Smirnov, Anderson–Darling and Cramér–von Mises tests measure


the distance between two possible distributions. These tests test whether a sample
is the realization of a rv that follows some prespecified but arbitrary distribution.
If the distance is large enough, the distributions are regarded as different. In R,
the package stats implements the Kolmogorov–Smirnov test and the package
goftest implements the Anderson–Darling and Cramér–von Mises tests.
Kolmogorov–Smirnov test
The Kolmogorov–Smirnov test determines whether two distributions F and G are
significantly different:

H0 : F = G vs. H1 : F  = G.

The underlying idea of this test is to measure the so-called Kolmogorov–Smirnov


distance between two distributions, under the restriction that both functions are con-
tinuous. The measure for the distance is the supremum of the absolute difference
between the two distribution functions:

KS = sup |F(x) − G(x)|. (5.9)


x

This test is commonly used to compare the ecdf to the assumed parametric one. The
following example shows whether the standardised log-returns r̃t,D AX = rt,D sAXr −r̄ D AX
D AX
of the DAX index follow a t-distribution with k degrees of freedom, where rt,D AX =
log PPt+1,D
t,D AX
AX
are log-returns, and Pt,D AX prices at time t. Standardised log-returns have
a zero mean and a unit standard error.

H0 : Fr̃ D AX = Ftk vs. H1 : Fr̃ D AX  = Ftk .

Using the alternative notation we write

H0 : r̃ D AX ∼ tk vs. H1 : r̃ D AX  tk .

The number of degrees of freedom k are found via maximum likelihood estimation,
(see Sect. 6.3.4 for estimation of copulae).
The test statistic follows the Kolmogorov distribution, which was originally tabu-
lated in Smirnov (1939). This distribution is independent of the assumed continuous
univariate distribution under the null hypothesis.
> require(stats)
> dax = EuStockMarkets[, 1] # DAX index
> r.dax = diff(log(dax)) # log-returns
> r.dax_st = scale(r.dax) # standardisation
> l = function(k, x){ # log-likelihood
+ -sum(dt(x, df = k, log = TRUE))
162 5 Univariate Statistical Analysis

+ }
> k_ML = optimize(f = l, # optimize l
+ interval = c(0, 30), # range of k
+ x = r.dax_st)$minimum # retrieve optimal k
> k_ML
[1] 11.56088
> ks.test(x = r.dax_st, # test for t-dist.
+ y = "pt", # t distribution function
+ df = k_ML) # estimated df

One-sample Kolmogorov-Smirnov test

data: r.dax_st
D = 0.063173, p-value = 7.194e-07
alternative hypothesis: two-sided

H0 can be rejected for any significance level larger than the p-value. To test against
other distributions, the parameter y should be set equal to the corresponding string
variable of the cdf, like pnorm, pgamma, pcauchy etc.
To test whether the DAX and the FTSE log-returns follow the same distribution,
one runs the following code in R.
> r.dax = diff(log(EuStockMarkets[, 1]))
> ftse = EuStockMarkets[, 4] # FTSE index
> r.ftse = diff(log(ftse)) # log-returns
> r.ftse_st = scale(r.ftse) # standardisation
> ks.test(r.dax, r.ftse) # test with raw
# log-returns
Two-sample Kolmogorov--Smirnov test

data: r.dax and r.ftse


D = 0.053792, p-value = 0.009223
alternative hypothesis: two-sided
> r.dax_st = scale(r.dax) # standardisation
> ks.test(r.dax_st, r.ftse_st) # test with standardised
# log-returns
Two-sample Kolmogorov--Smirnov test

data: r.dax_st and r.ftse_st


D = 0.034965, p-value = 0.2058
alternative hypothesis: two-sided

H0 can be rejected for non-standardised log-returns, indicating that the DAX and the
FTSE log-returns do not follow the same distribution. After standardisation of the
log-returns, one can not reject H0 . Figure 5.11 illustrates these test results. The non-
standardised log-returns have different means and standard deviations. Therefore
the rejection of H0 in the first test is due to different first and second moments. This
example shows that the scaling of the variables can influence the test results.
The Kolmogorov-Smirnov test belongs to the group of exact tests, which are more
reliable in smaller samples than asymptotic tests.
5.3 Goodness-of-Fit Tests 163

1.0
1.0

edf of standardised log−returns


0.8
0.8
edf of log−returns

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0

−0.04 −0.02 0.00 0.02 0.04 −4 −2 0 2 4


log−returns standardised log−returns

Fig. 5.11 Empirical cumulative distribution functions for DAX log-returns and FTSE log-returns.
BCS_EdfsDAXFTSE

Cramér–von Mises and Anderson–Darling test


Alternatives to the Kolmogorov-Smirnov test are the Cramér–von Mises and
Anderson–Darling tests. Instead of the largest distance in (5.9), they use the weighted
squared distance between the two distributions as the loss function:
 ∞
2
{F(x) − G(x)} w(x)dG(x).
−∞

F(x) is replaced by the ecdf of the sample and G(x) is the cdf for the distribution
of X under H0 . The Anderson–Darling test uses different weights than does the
Cramér–von Mises test.
It is necessary to use ordered statistics to conduct both tests. Let {x(1) , . . . , x(n) } be
ordered random realizations of the rv X with EX = μ and VarX = σ 2 . Furthermore,
x −μ
one has to standardise these ordered realizations z (i) = (i)σ . In the following it is
assumed that μ = x̄ and σ = s.
For the Cramér–von Mises test, w(x) = 1 and the test statistic is

1   2i − 1
n 
CM = + − G(z (i) ) ,
12n i=1
2n

where n denotes the sample size. The distribution of the test statistic under H0 can be
computed in R with pCvM and qCvM according to Csorgo and Farraway (1996). In the
following code, the Cramér–von Mises is used to test whether the standardised DAX
log returns follow the t-distribution with the same number of degrees of freedom as
found in previous subsection. In R the ordering of the realizations is automatically
done, but it is necessary to standardise them.
164 5 Univariate Statistical Analysis

> require(goftest)
> r.dax_st = scale(diff(log(EuStockMarkets[, 1])))
> cvm.test(r.dax_st, # Cramer von Mises test
+ null = "pt", # to test for t distr
+ df = 11.56088) # degrees of freedom

Cramer-von Mises test of goodness-of-fit


Null hypothesis: Student’s t distribution
with parameter df = 11.56088

data: r.dax_st
omega2 = 3.0274, p-value = 6.457e-08

Again, the null hypothesis that r̃ D AX comes from a rv that follows a t11.56088 -
distribution can be rejected.
The Anderson–Darling test sets w(x) = [G(x){1 − G(x)}]−1 , which leads to the
test statistic

1  
n
A2 = −n − (2i − 1) log{G(z (i) )} + log{1 − G(z (n+1−i) )} .
n i=1

The distribution of the Anderson–Darling test statistic can be obtained by pAD and
qAD. These functions are based on Marsaglia and Marsaglia (2004).
The following code uses the Anderson–Darling test to test whether the standard-
ised DAX index log-returns follow a t11.56088 -distribution.
> require(goftest)
> r.dax_st = scale(diff(log(EuStockMarkets[, 1])))
> ad.test(r.dax_st, null = "pt", df = 11.56088)
# Anderson-Darling test
Anderson-Darling test of goodness-of-fit
Null hypothesis: Student’s t distribution
with parameter df = 11.56088

data: r.dax_st
An = 17.715, p-value = 3.228e-07

The null hypothesis can be rejected for a significance level close to zero. Therefore the
standardised log-returns of the DAX-index do not follow the t11.56088 -distribution. All
three tests reject the null hypothesis of t-distributed log-returns of the DAX-index.

5.3.2 Tests for Normality

To verify whether a sample is generated by a normal distribution, the Shapiro–Wilk


and the Jarque–Bera tests have been constructed. These tests are introduced in the
following.
5.3 Goodness-of-Fit Tests 165

Shapiro–Wilk test
The Shapiro–Wilk test was developed in Shapiro and Wilk (1965). The specific form
of H0 leads to desirable efficiency properties. Especially in small samples, the power
of this test is superior to other nonparametric tests, see Razali and Wah (2011).
The rv X is tested as to whether it follows a normal distribution.

H0 : X ∼ N vs. H1 : X  N.

The test statistic W is calculated by dividing the theoretically expected variance of


a normal distribution by the realized sample variance:
n
σ2 ( i=1 ci x(i) )2
W = =  n ,
(n − 1)s̃ 2
i=1 (x (i) − x̄)
2

where s̃ is the sample standard deviation and σ is the theoretically expected standard
deviation, which is calculated from the ordered statistics x(i) as follows:

τ  V −1
c= √ ,
τ  V −1 V −1 τ

n
σ= ci x(i) , (5.10)
i=1

where τ is the vector of theoretically expected ordered values of x under normality


and V the covariance matrix of τ .
The order statistics x(i) under H0 follow the normal distribution and the probability
of having an observation smaller than x(i) is (x(i) ). Values for τ are obtained under
H0 by
 i − 3/8 
τ(i) = −1 .
n + 1/4

Under H0 , the theoretical values τ(i) for x(i) depend only on the sample size and the
position i. Under H0 , the theoretical expected variance σ 2 should be close to the
sample variance σ̂ 2 . The test statistic W is bounded by

nc12
≤ W ≤ 1.
n−1

If the test statistic W is close to one, H0 can not be rejected. In this case the sample
can be regarded as a realization of a normal rv. For low values of W it is likely that
the null hypothesis is wrong and can be rejected. The distribution of the test statistic
is tabulated in Shapiro and Wilk (1965).
The next example tests whether or not the DAX log-returns follow a normal
distribution. The function shapiro.test is implemented in R as follows.
166 5 Univariate Statistical Analysis

> r.dax = diff(log(EuStockMarkets[, 1]))


> shapiro.test(r.dax) # by default H0: X ~ N(mu, sigma^2)

Shapiro-Wilk normality test

data: r.dax
W = 0.9538, p-value < 2.2e-16

The null hypothesis that r.dax follows a normal distribution can clearly be rejected.
> random = rnorm(1000) + 5 # for this sample H0 is not rejected
> shapiro.test(random)

Shapiro-Wilk normality test

data: random
W = 0.9987, p-value = 0.7089

Jarque–Bera test
An alternative test for normality is the Jarque–Bera test. The hypotheses are the same
as for the Shapiro–Wilk test. The Jarque–Bera test considers the third and fourth
moments of the distribution. Here Ŝ are the sample skewness and K̂ the sample
kurtosis, see Sect. 5.1.7. Then
 
2
n K̂
JB = Ŝ 2 + .
6 4

This test uses the results of Chap. 4 for the moments of the normal distribution for
which skewness and excess kurtosis should be both zero. Two parameters are esti-
mated to compute the test statistic, therefore the statistic follows the χ22 distribution.
There is an implementation for the Jarque–Bera test in R, which requires the
package tseries.
> require(tseries)
> r.dax = diff(log(EuStockMarkets[, 1]))
> jarque.bera.test(r.dax) # by default H0: X ~ N(mu, sigma^2)

Jarque-Bera Test

data: r.dax
X-squared = 3149.641, df = 2, p-value < 2.2e-16

The p-values provided by R for daily DAX log-returns are identically small for
the two tests, but this is not true for the truly normal sample. In general, inference
with both tests might lead to different conclusions.
Most of these tests are also provided by the package fBasics and func-
tions ksnormTest, shapiroTest, jarqueberaTest, jbTest, adTest
and cvmTest.
5.3 Goodness-of-Fit Tests 167

5.3.3 Wilcoxon Signed Rank Test and Mann–Whitney U Test

The Kolmogorov–Smirnov test assumes an interval or ratio scale for the variable of
interest. Wilcoxon (1945) developed two tests that also work for ordinal data: the
Wilcoxon signed rank and rank sum tests. The latter is also known as the Mann–
Whitney U test.
The Wilcoxon signed rank test is an asymptotic test for the median x̃0.5 of the
sample {x1 , . . . , xn }, see Sect. 5.1.5.

H0 : x̃0.5 = a vs. H1 : x̃0.5  = a,

where a is an assumed value. For two samples {x1,1 , . . . , x1,n 1 } and {x2,1 , . . . , x2,n 2 }
with sample sizes n 1 and n 2 , the hypotheses are

H0 : x̃1,0.5 = x̃2,0.5 vs. H1 : x̃1,0.5  = x̃2,0.5 .

The algorithm of the Wilcoxon signed rank test for two samples can be written as
follows:
1. Randomly draw n s = min(n 1 , n 2 ) observations from the larger sample;
2. Calculate si = sign(x1,i − x2,i ) and di = |x1,i − x2,i | for the paired samples;
3. Compute the ranks Ri of di ascending
n s from 1;
4. Then the test statistic is W = | i=1 si Ri |.
The test statistic has the asymptotic distribution

L
W −−−→ N(0.5, σW 2
), (5.11)
n s →∞

with σW = n s (n s +1)(2n
6
s +1)
.

Thus the Wilcoxon signed rank test checks whether two samples come from the same
population, in which case the mean of the weighted sign() operator is 0.5, just as for
a fair coin toss. If the statistic is close to 0.5, positive and negative differences are
equally likely. The second set can also be a vector 1n a if one wants to test against a
specific constant a.
Note the test statistic follows the normal distribution only asymptotically. How-
ever, ranks are ordinal and not metric, therefore the assumption of a normal distrib-
ution is not appropriate in finite and small samples. It is necessary to correct the test
statistic for continuity, which is done by default in R.
Consider as an example the popularity of American presidents in the past. For
this we use the dataset presidents and denote the sample by {x1 , . . . , xn }, which
contains quarterly approval ratings in percentages for the President of the United
States from the first quarter of 1945 to the last quarter of 1974. To verify that the
median ranking is at least 50, the following hypotheses are tested:
168 5 Univariate Statistical Analysis

H0 : x̃0.5 ≤ a vs. H1 : x̃0.5 > a,

Here an artificial sample y = 1n a = 50, . . . , 50 with the same number of elements


as the number of observations n for the dataset presidents is created. Therefore
the test in R compares the actual dataset with an artificial dataset with all elements
equal to c. It is necessary that the samples for the Wilcoxon signed rank test are
paired in R, which means they have the same number of elements. If H0 is rejected,
the presidents have a simple majority behind them.
> s = 50 # maximum median under H0
> y = rep(s, length(presidents)) # vector of constants
> wilcox.test(presidents, y, # test for presidents
+ alternative = "greater", # specifies H1
+ paired = TRUE) # signed rank test

Wilcoxon signed rank test with continuity correction

data: presidents and y


V = 4613.5, p-value = 3.298e-05
alternative hypothesis: true location shift is greater than 0

The p-value turns out to be close to zero and we can reject the hypothesis that the
true value of the approval ratings is at most 50.
Unlike the signed rank test, the Mann–Whitney U test can also be used for non-
paired data. Let F1 and F2 denote the distributions of two variables

H0 : F1 (x) = F2 (x) vs. H1 : F1 (x) = F2 (x − a) ∀x ∈ R

The Wilcoxon rank sum test is then calculated as follows:


1. Merge two samples into one sample and rank them;
2. Let R j be the sum of all ranks in the combined sample for observations of sample
j ∈ {1, 2};
3. U j = n 1 n 2 + 21 n j (n j + 1) − R j , n j is the size of sample j ∈ {1, 2}, R1 + R2 =
1
2
n(n + 1), where n = n 1 + n 2 ;
4. U = min(U1 , U2 ); 
L √
5. U −−−→ N n 1 n 2 /2, n 1 n 2 (n 1 + n 2 + 1)/12 .
n→∞

The core idea of the test is that half the maximal possible sum of ranks is deducted
from the actual sum of ranks. If both samples are from the same distribution, this
statistic should be close to n 12n 2 .
Now one may ask the question whether President Nixon’s popularity was signif-
icantly lower than that of his predecessors. The dataset is split into two parts: one
containing the realizations from the previous presidents and another set for President
Nixon:
5.3 Goodness-of-Fit Tests 169

> Other = presidents[1:96] # sample for other presidents


> Nixon = presidents[97:118] # sample for Nixon
> wilcox.test(Nixon, Other) # test for popularity

Wilcoxon rank sum test with continuity correction

data: Nixon and Other


W = 647.5, p-value = 0.03868
alternative hypothesis: true location shift is not equal to 0

The hypothesis that the medians are equal can clearly be rejected, because the
obtained p-value is smaller than 5%. The difference between the sample medians is
too big for the distributions to be considered equal. Nixon, with a median approval
rating of 49%, was significantly less popular than other presidents, with 61%.

5.3.4 Kruskal–Wallis Test

The tests discussed above considered two samples. They can not be used to check for
the equality of more than two samples. Consider a test that rejects the null hypothesis
of pairwise equal distributions for three variables X , Y and Z if at least for one pair
a two-sample test rejects the equality of distributions at the significance level α.
It would be wrong to assume that the joint significance of this test is α = 0.05
because the probability for the test to favour equality between X and Y , Y and Z
and X and Z is in fact the probability of not rejecting three times, which is equal to
1 − (1 − 0.05)3 = 1 − 0.86 = 0.14 which means an α of 0.14 and not 0.05.
Kruskal and Wallis (1952) developed an extension of the Mann–Whitney U test
that solves this problem. The null hypothesis is rejected if at least one sample dis-
tribution has a different mean than the other distributions. Let l = {1, . . . , m} be an
index of the considered samples and a be a constant,

H0 : F1 (x) = . . . = Fm (x) vs. H1 : F1 (x) = . . . = Fl (x + a) = . . . = Fm (x), ∀x ∈ R.

The test statistic is defined as


m
nl ( R̄l − R̄)2
K = (n − 1) m l=1
n l ,
j=1 (R jl − R̄)
2
l=1

m l
where n = l=1 nl and R̄l = nj=1 R jl /nl denotes the average of all ranks allocated
 l m
within sample l. R̄ = nj=1 l=1 jl /n is the overall average of all ranks in all
R
samples and R jl is the rank for the pooled sample of observation j in sample l.
The test statistic follows the χ2k distribution, where k = m − 1. In the following
example we compare again the popularity of the ruling president for every single
decade from the first quarter of 1945 to the last quarter of 1974. We define a variable
170 5 Univariate Statistical Analysis

of starting points for each group. A Kruskal–Wallis test for the null hypothesis that
the popularity did not change significantly is executed by
> decades = c(rep(1, length.out = 20), # group indicator for decades
+ rep(2:3, each = 40),
+ rep(4, length.out = 20))
> kruskal.test(presidents, decades) # Kruskal-Wallis test

Kruskal--Wallis rank sum test

data: presidents and decades


Kruskal-Wallis chi-squared = 12.2607, df = 3, p-value = 0.006541

Over the decades, the popularity of the presidents varies significantly. This means
that the null hypothesis of equal locations of the distributions R̄ = R̄l , for all l can
be rejected at a significance level close to zero.
Chapter 6
Multivariate Distributions

Though this be madness, yet there is method in’t.

— William Shakespeare, Hamlet

The preceding chapters discussed the behaviour of a single rv. This chapter introduces
the basic tools of statistics and probability theory for multivariate analysis, where
the relations between d rvs are considered. At first we present the basic tools of
probability theory used to describe a multivariate rv, including the marginal and
conditional distributions and the concept of independence.
The normal distribution plays a central role in statistics because it can be viewed
as an approximation and limit of many other distributions. The basic justification for
this relies on the central limit theorem. This is done in the framework of sampling
theory, together with the main properties of the multinormal distribution.
However, a multinormal approximation can be misleading for data which is not
symmetric or has heavy tails. The need for a more flexible dependence structure and
arbitrary marginal distributions has led to the wide use of copulae for modelling and
estimating multivariate distributions.

6.1 The Distribution Function and the Density Function


of a Random Vector

For X = (X 1 , X 2 , . . . , X d ) a random vector, the cdf is defined as

F(x) = P(X ≤ x) = P(X 1 ≤ x1 , . . . , X d ≤ xd ).

This function describes the joint behaviour of components of the vector. If X is


discrete, then there exists a joint density function p(·) given by

© Springer International Publishing AG 2017 171


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_6
172 6 Multivariate Distributions

p(x) = P(X = x) = P(X 1 = x1 , . . . , X d = xd ).

Assume that X 1 , . . . , X d are continuous rvs satisfying

∂d ∂d
F(x) = F(x),
∂x1 . . . ∂xd ∂xi1 . . . ∂xid

for all permutations i 1 , . . . , i d of 1, . . . , d. Then the joint density function is given


by
∂ d F(x) ∂ d F(x1 , . . . , xd )
f (x) = = .
∂x ∂x1 · · · ∂xd

Note that, as in the one-dimensional case,


 ∞  ∞
... f (u) du 1 . . . du d = 1.
−∞ −∞

If the density function is differentiable, then


 x1  xd
F(x) = ... f (u 1 , . . . , u d )du d . . . du 1 .
−∞ −∞

If we partition (X 1 , . . . , X d ) as X k∗ = (X i1 , . . . , X ik ) ∈ Rk and X −k

=

(X ik+1 , . . . , X id ) ∈ R , then the function defined by
d−k

FX k∗ (xi1 , . . . , xik ) = P(X i1 ≤ xi1 , . . . , X ik ≤ xik )

is called the k-dimensional marginal cdf and is equal to F evaluated at (xi1 , . . . , xik )

and x−k set to infinity. For continuous variables, the marginal pdf can be computed
from the joint density by “integrating out” irrelevant variables
 ∞  ∞
f X k (xi1 , . . . , xik ) = ... f (x1 , . . . , xd )d xik +1 . . . d xid .
−∞ −∞

For discrete X , the marginal probability is given by



p X k (xi1 , . . . , xik ) = p(x1 , . . . , xd ).
xik+1 ,...,xid

Generally speaking, the following theory works in any dimension d ≥ 2 but in


some cases, for simplicity, we restrict ourselves to the two-dimensional case X =
(X 1 , X 2 ) . Let us consider the conditional pdf of X 2 given X 1 = x1
6.1 The Distribution Function and the Density Function of a Random Vector 173

f (x1 , x2 )
f (x2 | x1 ) = .
f X 1 (x1 )

Two rvs X 1 and X 2 are said to be independent if f (x1 , x2 ) = f X 1 (x1 ) f X 2 (x2 ),


which implies f (x1 | x2 ) = f X 1 (x1 ) and f (x2 | x1 ) = f X 2 (x2 ). Independence
can be interpreted as follows: knowing X 2 = x2 does not change the probability
assessments on X 1 , and vice versa.
In multivariate statistics, we observe the values of a multivariate rv X and obtain
a sample X = {xi }i=1n
. Under random sampling, these observations are considered
to be realisations of a sequence of i.i.d. rvs X 1 , . . . , X n , where each X i has the same
distribution as the parent or population rv X .
⎛ ⎞
x11 . . . x1d
⎜ .. . . .. ⎟ ,
X = {xi }i=1
n
=⎝ . . . ⎠
xn1 . . . xnd

where xi j is the ith realisation of the jth element of the random vector X . The idea
of statistical inference for a given random sample is to analyse the properties of the
population variable X . This is typically done by analysing some characteristics of
its distribution.

6.1.1 Moments

Expectation

The first-order moment of X , often called the expectation, is given by


⎛ ⎞ ⎛ ⎞ ⎛ ⎞
E X1  x1 f (x)d x μ1
⎜ .. ⎟ ⎜ . ⎟ ⎜ . ⎟
E X = ⎝ . ⎠ = x f (x)d x = ⎝ .. ⎠ = ⎝ .. ⎠ = μ.
E Xd xd f (x)d x μd

Accordingly, the expectation of a matrix of random elements has to be understood


component by component, see Definition 4.3. The operation of forming expectations
is linear

E (αX + βY ) = α E X + β E Y, (6.1)

with α, β ∈ R and X = (X 1 , . . . , X d ) and Y = (Y1 , . . . , Yd ) being random


vectors of the same dimension. If A(q × d) is a matrix of real numbers, we have

E(AX ) = A E X.
174 6 Multivariate Distributions

When X and Y are independent,

E(X Y  ) = E X E Y  .

Statistics describing the center of gravity of the n observations in Rd are given by


the vector x of the mean values x j (see Sect. 5.1.5) for j = (1, . . . , d) and can be
obtained by
⎛ ⎞
x1
⎜ .. ⎟ 1 
x = ⎝ . ⎠ = X 1n .
n
xd

It is easy to show that x is an unbiased estimator of the expectation E X .


R offers several functions with which to calculate the sample mean. The function
mean is applicable to vector, matrix and data.frame types. If the sample
is represented by a matrix A, mean(A) returns the average over all elements in the
matrix. Below we applied this function to the data set women, which contain the
average heights and weights for American women aged 30–39.
> w o m e n . m = a s . m a t r i x ( women ) # c o n v e r t d a t a . f r a m e to m a t r i x
> mean ( w o m e n . m ) # m e a n of m a t r i x
[1] 100 .8667

rowMeans and colMeans calculate the averages by rows and columns of the
matrix, or data frame respectively.
> rowMeans( w o m e n . m ) # a v e r a g e s by rows
[1] 86 .5 88 .0 90 .0 92 .0 94 .0 96 .0 98 .0
[8] 100 .0 102 .5 104 .5 107 .0 109 .5 112 .0 115 .0 118 .0
> colMeans( w o m e n . m ) # a v e r a g e s by c o l u m n s
height weight
65 .0000 136 .7333

Covariance matrix

The matrix

Var X =  = E(X − μ)(X − μ) = E(X X  ) − μμ

is the (theoretical) covariance matrix, also called the centred second moment. It is
positive semi-definite, i.e.  ≥ 0, with elements  = (σ X i X j ). The off-diagonal
elements are σ X i X j = Cov(X i , X j ) and the diagonal elements are σ X i X i = Var(X i ),
i, j = 1, . . . , d, where

Cov(X i , X j ) = E(X i X j ) − μi μ j ,
Var X i = E X i2 − μi2 .
6.1 The Distribution Function and the Density Function of a Random Vector 175

Writing X ∼ (μ, ) means that X is a random vector with mean vector μ and covari-
ance matrix . The variance of the linear transformation of the variables satisfies

Var(AX ) = AVar(X )A = ai a j σ X i X j ,
i, j

Var(AX + b) = A Var(X )A , (6.2)

where A is the (q × d) matrix an b a q-dimensional vector. For X ∈ Rd and Y ∈ Rq ,


the (d × q) covariance matrix of X ∼ (μ X ,  X X ) and Y ∼ (μY , Y Y ) is

 X Y = Cov(X, Y ) = E(X − μ X )(Y − μY ) .

This result is obtained from

Cov(X, Y ) = E(X Y  ) − μ X μ  
Y = E(X Y ) − E X E Y .

It follows that if X and Y are independent, then Cov(X, Y ) = 0. E(X X  ) provides


the second non-central moment of X

E(X X  ) = {E(X i X j )}, for i, j = 1, . . . , d.

Linear functions of random vectors have the following covariance properties:

Cov(X + Y, Z ) = Cov(X, Z ) + Cov(Y, Z );


Var(X + Y ) = Var(X ) + Cov(X, Y ) + Cov(Y, X ) + Var(Y );
Cov(AX, BY ) = A Cov(X, Y )B  .

The sample covariance matrix is used to estimate the second-order moment

˜ = n −1 X  X − x x  .
 (6.3)

˜ is biased. An unbiased estimator of the second moment is


For small sample sizes, 

ˆ = 1 n
 X X − x x . (6.4)
n−1 n−1

Equation (6.4) can be written equivalently in scalar form or based on the centering
matrix H = In − n −1 1n 1
n

ˆ = (n − 1)−1 (X  X − n −1 X  1n 1
 n X)
−1 
= (n − 1) X HX . (6.5)

These formulas are implemented directly in R. The function cov returns the empirical
covariance matrix of the given sample matrix X . Its argument could be of type
176 6 Multivariate Distributions

data.frame, matrix, or consist of two vectors of the same size. The following
ˆ
code presents possible calculations of :
> women.m = as.matrix ( women )
> n = dim ( w o m e n . m ) [ 1 ] ; n
[1] 15
> meanw = colMeans( w o m e n )
> cov1 = # u s i n g (6 .4 )
+ ( t ( w o m e n . m ) % * % w o m e n . m - n * m e a n w % * % t ( m e a n w )) / ( n - 1)
> H = diag (1 , n ) - 1 / n * rep (1 , n ) % * % t ( rep (1 , n ))
> cov2 = t ( w o m e n . m ) % * % H % * % w o m e n . m / ( n - 1) # u s i n g (6 .5 )
> cov3 = cov ( w o m e n ) # for d a t a . f r a m e
> cov4 = cov ( w o m e n . m ) # for m a t r i x

As expected, all the matrices cov1, cov2, cov3 and cov4 return the same result.
height weight
height 20 69 . 0 0 0 0
weight 69 240 . 2 0 9 5

The internal function cov is twice as fast as manual methods with or without a
predetermined centred matrix, and independent of sample size.
If the arguments of cov are two vectors x and y, then the result is their covariance.
> cov ( women $ h e i g h t , w o m e n $ w e i g h t )
[1] 69

The correlation ρ X i ,X j between two rvs X i and X j is given by

Cov(X i , X j ) σ X i ,X j
ρ X i ,X j = =√ .
Var X i Var X j σXi σX j

Thus, the correlation matrix is PX = {ρ X i ,X j }. Similar to the covariance matrix, for


rvs X and Y

Cor(X, Y ) = (Var X )−1/2 Cov(X, Y )(Var Y )−1/2 .

The unbiased sample correlation is given by


n n n
n xim x jm − xim x jm
ρ̂ X i ,X j = m=1 m=1 m=1
.
n n n n
n m=1
2
xim −( m=1 xim )2 n m=1 x 2jm − ( m=1 x jm )2

The calculation of the sample correlation in R is done by the atomic function cor,
ˆ may be converted into a correlation
which is similar to cov. The covariance matrix 
matrix using the function cov2cor.
> cor ( women );
> cov2cor( cov ( women ))
height weight
height 1 .0000000 0 .9954948
weight 0 .9954948 1 .0000000

The linear correlation is sensitive to outliers, and is invariant only under strictly
increasing linear transformations. Alternative rank correlation coefficients, which
6.1 The Distribution Function and the Density Function of a Random Vector 177

are less sensitive to outliers, are Kendall’s τ and Spearman’s ρ S . If F is a continuous


bivariate cumulative distribution function and (X 1 , X 2 ), (X 1 , X 2 ) are independent
random pairs with the same distribution F, then Kendall’s τ is

τ = P{(X 1 − X 1 )(X 2 − X 2 ) > 0} − P{(X 1 − X 1 )(X 2 − X 2 ) < 0}.

Assuming that the marginal distributions of X 1 and X 2 are given by F1 and F2 ,


respectively. Spearman’s ρ S is defined as

Cov{F1 (X 1 ), F2 (X 2 )}
ρS = .
Var{F1 (X 1 )} Var{F2 (X 2 )}

Both rank-based correlation coefficients are invariant under strictly increasing trans-
formations and measure the ‘average dependence’ between X 1 and X 2 . The empirical
τ̂ and ρ̂ S are calculated by

4
τ̂ = Pn − 1, (6.6)
n(n − 1)
n
i=1 (Ri − R) (Si − S)
2 2
ρ̂ S = , (6.7)
n n
i=1 (Ri − R)2 i=1 (Si − S)2

where Pn is the number of concordant pairs, i.e., the number of pairs (x1k , x2k ) and
(x1m , x2m ) of points in the sample for which

x1k < x1m and x2k < x2m


or
x1k > x1m and x2k > x2m .

The Ri and Si in (6.7) represent the position of the observation in a sorted by size list of
all observations (statistical rank). These two correlation coefficients are implemented
by the function cor, using the parameter method. In the following listing we applied
cor function to the dataset cars, which contain the speed of cars and the distances
taken to stop (data were recorded in the 1920s).
> cor ( cars $ s p e e d , c a r s $ dist )
[1] 0 . 8 0 6 8 9 4 9
> cor ( cars $ s p e e d , c a r s $ d i s t , m e t h o d = " k e n d a l l " )
[1] 0 . 6 6 8 9 9 0 1
> cor ( cars $ s p e e d , c a r s $ d i s t , m e t h o d = " s p e a r m a n " )
[1] 0 . 8 3 0 3 5 6 8

The robustness to outliers and invariance under monotone increasing transforma-


tions are illustrated in Fig. 6.1. The outlier in the left panel of the figure makes the
linear Pearson correlation coefficient ρ = 0.66, while Spearman’s ρ S and Kendall’s
178 6 Multivariate Distributions

10
● ●

1.0






8

●●

0.5



6

●●
●●●

● ●●
●●●
● ●
●●● ● ●●●●● ●
●●● ●●

0.0
● ● ●●●
y

●●

y
● ●● ●
●●

●●● ●●●●
4


●●
●●●
● ●●

−0.5


●●
2

●●


● ● ●
●●●●●
●●●● ●
●●● ●
●●●●
●● ●
● ●●
● ●

●●● ●
●●●●● ●
● ●

−1.0
● ● ●
●● ●● ●
●●●

0

●● ● ●
●●●● ●
●●●●●● ●

●●
● ● ●●●
● ●● ●●●

●●●● ●●
●● ●

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x

Fig. 6.1 Linear fit for the linearly correlated data with an outlier (left) and almost perfectly depen-
dent monotone transformed data (right). BCS_CopulaInvarOutlier

τ are equal to 0.98 and 0.88, respectively. The same is observed on the right side of
Fig. 6.1: the nonlinear but monotone transformation of the almost perfectly dependent
data results in ρ = 0.892 but ρ S = 0.996 and τ = 0.956.

6.2 The Multinormal Distribution

The multivariate normal distribution is one of the most widely used multivariate
distributions. A random vector X is said to be normally distributed with mean μ and
covariance  > 0, or X ∼ Nd (μ, ), if it has the following density function:
 
1
f (x) = |2π|−1/2 exp − (x − μ)  −1 (x − μ) . (6.8)
2

As with the univariate normal distribution, the multinormal distribution does not
x1 xd
have an explicit form for its cdf, i.e. (x) = −∞ . . . −∞ f (u)du.
The multivariate t-distribution is closely related to the multinormal distribution.
If Z ∼ Nd (0, Id ) and Y 2 ∼ χ2m are independent rvs, a t-distributed rv T with m
degrees of freedom can be defined by

√ Z
T = m .
Y
Moreover, the multivariate t-distribution belongs to the family of d-dimensional
spherical distributions, see Fang and Zhang (1990).
R offers several independent packages for the multinormal and multivariate t
distributions, namely fMultivar by Wuertz et al. (2009b), mvtnorm by Genz
and Bretz (2009) and Genz et al. (2012), and mnormt by Genz and Azzalini (2012).
6.2 The Multinormal Distribution 179

The package fMultivar was specifically developed for bivariate distributions,


including the multinormal, the t distribution, and the Cauchy distribution. It also
allows of non-parametric density estimation, see Sect. 5.1.4. The packages mvtnorm
and mnormt are for multinormal and multivariate t distributions of dimensions
d ≥ 2. All three packages contain methods for evaluating the density and distribution
functions at a given point and for a given mean and covariance or correlation matrix.
The calculation of the density is unproblematic because of its explicit form and is
easily done by the following listing with (σi j ), σ12 = σ21 = 0.7, σ11 = σ22 = 1,
μ = (0, 0) and the point at which the density is calculated x = (0.3, 0.4) .
> require ( mvtnorm )
> l s i g m a = m a t r i x ( c (1 , 0 .7, 0 .7, 1) , ncol = 2)
> lsigma # covariance matrix
[ ,1 ] [ ,2 ]
[1 , ] 1 .0 0 .7
[2 , ] 0 .7 1 .0
> lmu = c (0 , 0) # mean vector
> x = c (0 .3, 0 .4 ) # value at which to eval.
> dmvnorm( x, mean = lmu, sigma = l s i g m a ) # d e n s i t y at p o i n t x
[1] 0 . 2 0 5 6 4 6 4

From (6.8) and Fig. 6.2, one sees that the density of the multinormal distribution is
constant on ellipsoids of the form

(x − μ)  −1 (x − μ) = a 2 . (6.9)

The half-lengths of the axes in the contour ellipsoid are a 2 λi , where λi are the
eigenvalues of . If  is a diagonal matrix, the rectangle circumscribing the contour
ellipse has sides of length 2aσi and is thus naturally proportional to the standard
deviations of X i (i = 1, 2).
The distribution of the quadratic form in (6.9) is given in the next theorem.

0.02
2

0.04
0.06

0.1
Z

0.14

0.18
22
0

0.

0.2

0.16
−1

0.12

0.08
Y

−2

−2 −1 0 1 2

Fig. 6.2 Density of N2 , with ρ12 = 0.7. BCS_BinormalDensity


180 6 Multivariate Distributions

4
3
Z

2
0.9

1
0.8
0.7
0.6
0.5

0
0.4
Y

0.3
0.2

−1
Z 0.1

−1 0 1 2 3 4

Fig. 6.3 cdf of N2 , with ρ12 = 0.7. BCS_NormalProbability

Theorem 6.1 If X ∼ Nd (μ, ), then rv U = (X − μ)  −1 (X − μ) has a χ2d


distribution.
As mentioned above, calculating the cdf is more problematic. The method
pmvnorm from the mvtnorm package implements two algorithms to evaluate the
normal distribution. The first is based on the randomised Quasi Monte Carlo proce-
dure by Genz (1992) and (1993), which is applicable to singular and non-singular
covariance structures of dimension d ≤ 1000. The second algorithm by Miwa et al.
(2003) is only applicable for small dimensions, d ≤ 20 and non-singular covariance
matrices. The package mnormt uses only the latter method. As a result of the evalu-
ation of the distribution function, pmvnorm also returns the estimated absolute error
(Fig. 6.3).
> l s i g m a = m a t r i x ( c (1 , 0 .7, 0 .7, 1) , ncol = 2)
> lmu = c (0 , 0)
> x = c (0 .3, 0 .4 )
> pmvnorm( x, mean = lmu, sigma = l s i g m a )
[1] 0 . 2 4 3 7 7 3 1
> attr ( , " error " )
[1] 1e -15
> attr ( , " msg " )
[1] " N o r m a l C o m p l e t i o n "

Simulation techniques, see Chap. 9, are implemented in several packages. The fol-
lowing code demonstrates the simplest case, d = 2 using mvtnorm.
> l s i g m a = m a t r i x ( c (1 , 0 .7, 0 .7, 1) , ncol = 2)
> lmu = c (0 , 0)
> # s e t . s e e d ( 2 ^ 1 1 - 1) # set the s e e d , see C h a p t e r 9
> rmvnorm(5 , mean = lmu, sigma = l s i g m a ) # sample 5 observations
[ ,1 ] [ ,2 ]
[1 , ] -0 . 1 5 0 8 0 4 1 -0 . 7 3 7 4 9 2 0
[2 , ] 1 .0681719 0 .3484549
[3 , ] -0 . 3 6 1 1 6 6 5 -0 . 3 8 1 9 2 5 8
[4 , ] 0 .8141042 0 .5995360
[5 , ] 1 .4857324 1 .4820885
6.2 The Multinormal Distribution 181

Table 6.1 Multinormal distribution in R


fMultivar (d = 2) mvtnorm (d ≥ 2) mnormt (d ≥ 2)
cdf (probability) pnorm2d pmvnorm (GenzBretz, pmnorm
Miwa, TVPACK)
pdf (density) dnorm2d dmvnorm dmnorm
simulation rnorm2d rmvnorm rmnorm
quantiles n.a. qmvnorm n.a.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

1.0
0.8
0.6
X1

0.4
0.2
0.0
1.0
3
2

0.8
1

0.6
0

X2

0.4
−1
−2

0.2
−3

0.0
4
3
2
1

X3
0
−1
−2
−3

−4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Fig. 6.4 Sample X from N3 , with ρ12 = 0.2, ρ13 = 0.5, ρ23 = 0.8 and n = 1500. Plots for
estimated marginal distributions of X 1,2,3 are in the upper triangle and contour ellipsoids for the
bivariate normal densities in the lower triangle of the matrix. BCS_NormalCopulaContour

The package mvtnorm is the only package with a method of calculating the
quantiles of the multinormal distribution, namely qmvtnorm. Table 6.1 lists and
compares the methods from all these packages (Fig. 6.4). If μ = 0 and  = Id , then
182 6 Multivariate Distributions

X is said to be standard normally distributed. Using linear transformations, one can


shift and scale the standard normal vector to form non-standard normal vectors.

Theorem 6.2 (Mahalanobis transformation) Let X ∼ Nd (μ, ) and Y =  −1/2


(X − μ). Then

Y ∼ Nd (0, Id ),

i.e., the elements Y j ∈ R are independent, one-dimensional N(0, 1) variables.

Note that the Mahalanobis transformation in Eq. (6.2) yields an rv Y = (Y1 , . . . , Yd )


composed of independent one-dimensional Y j ∼ N1 (0, 1). We can create Nd (μ, )
variables on the basis of Nd (0, Id ) variables using the inverse linear transformation

X =  1/2 Y + μ.

Using (6.1) and (6.2), we can verify that E(X ) = μ and Var(X ) = . The follow-
ing theorem is useful because it presents the distribution of a linearly transformed
variable.
Theorem 6.3 Let X ∼ Nd (μ, ), A ( p × p) and b ∈ R p , where A is non-singular.
Then Y = AX + b is again a p-variate normal, i.e.

Y ∼ N p (Aμ + b, AA ). (6.10)

6.2.1 Sampling Distributions and Limit Theorems

Statistical inference often requires more than just the mean or the variance of a
statistic. We need the sampling distribution of a statistic to derive confidence inter-
vals and define rejection regions in hypothesis testing for a given significance level.
Theorem 6.4 gives the distribution of the sample mean for a multinormal population.

Theorem 6.4 Let X 1 , . . . , X n be i.i.d. with X i ∼ Nd (μ, ). Then X ∼ Nd (μ,


n −1 ).
In multivariate statistics, the sampling distributions are often more difficult to derive
than in the previous theorem. They might be so complicated in closed form that limit
theorem approximations must be used. The approximations are only valid when the
sample size is large enough. In spite of this restriction, these approximations make
complicated situations simpler. The central limit theorem shows that even if the parent
distribution is not normal, the mean X has, for large sample sizes, an approximately
normal distribution.
6.2 The Multinormal Distribution 183

Theorem 6.5 (Central Limit Theorem


√ (CLT)) Let X 1 , . . . , X n be i.i.d. with X i ∼
(μ, ). Then the distribution of n(X − μ) is asymptotically Nd (0, ), i.e.,
√ L
n(X − μ) → Nd (0, ) as n → ∞.

This asymptotic normality of the distribution is often used to construct confidence


intervals for the unknown parameters. If the covariance matrix  is unknown, we
ˆ
replace it with the consistent estimator .
 is a consistent estimator for , then the CLT still holds, namely
Corollary 6.1 If 
√ L
 −1/2 (X − μ) →
n Nd (0, Id ) as n → ∞.

Remark 6.1 One may wonder how large n should be in practice to provide reasonable
approximations. There is no definite answer to this question. It depends mainly on
the problem at hand, i.e. the shape of the distribution and the dimension of X i . If the
X i are normally distributed, the normality of the sample mean, x, obtains even from
n = 1. However, in most situations, the approximation is valid in one-dimensional
problems for n > 50.

6.3 Copulae

This section describes modelling and measuring the dependency between d rvs using
copulae.
Definition 6.1 A d-dimensional copula is a distribution function on [0, 1]d such that
all marginal distributions are uniform on [0,1].
Copulae gained their popularity due to their applications in finance. Sklar (1959)
gave a basic theorem on copulae.
Theorem 6.6 (Sklar (1959)) Let F be a multivariate distribution function with mar-
gins F1 , . . . , Fd . Then there exists a copula C such that

F(x1 , . . . , xd ) = C{F1 (x1 ), . . . , Fd (xd )}, x1 , . . . , xd ∈ R. (6.11)

If the Fi are continuous for i = 1, . . . , d, then C is unique. Otherwise, C is uniquely


determined on F1 (R) × · · · × Fd (R).
Conversely, if C is a copula and F1 , . . . , Fd are univariate distribution functions,
then the function F defined as above is a multivariate distribution function with
margins F1 , . . . , Fd .
We can determine the copula of an arbitrary continuous multivariate distribution
from the transformation
184 6 Multivariate Distributions

C(u 1 , . . . , u d ) = F{F1−1 (u 1 ), . . . , Fd−1 (u d )}, u 1 , . . . , u d ∈ [0, 1], (6.12)

where the Fi−1 are the inverse marginal distribution functions, also referred to as
quantile functions. The copula density and the density of the multivariate distribution
with respect to the copula are

∂ d C(u)
c(u) = , u ∈ [0, 1]d , (6.13)
∂u 1 , . . . , ∂u d

d
f (x1 , . . . , xd ) = c{F1 (x1 ), . . . , Fd (xd )} f i (xi ), x1 , . . . , xd ∈ R. (6.14)
i=1

In the multivariate case, the copula function is invariant under monotone transfor-
mations.
The estimation and calculation of probability distributions and goodness-of-fit
tests are implemented in several R packages: copula see Yan (2007), Hofert and
Maechler (2011) and Kojadinovic and Yan (2010), fCopulae see Wuertz et al.
(2009a), fgac see Gonzalez-Lopez (2009), gumbel see Caillat et al. (2008), HAC
see Okhrin and Ristig (2012), gofCopula see Trimborn et al. (2015) and sbgcop
see Hoff (2010). All of these packages have comparative advantages and disadvan-
tages.
The package sbgcop estimates the parameters of a Gaussian copula by treat-
ing the univariate marginal distributions as nuisance parameters. It also provides a
semiparametric imputation procedure for missing multivariate data.
A separate package, gumbel, provides functions only for the Gumbel–Hougaard
copula. The HAC package focuses on the estimation, simulation and visualisation of
Hierarchical Archimedean Copulae (HAC), which are discussed in Sect. 6.3.3. The
fCopulae package was developed for learning purposes. We recommend using this
package for a better understanding of copulae. Almost all the methods in this package,
like density, simulation, generator function, etc. can be interactively visualised by
changing their parameters with a slider. As fCopulae is for learning purposes,
only the bivariate case is treated, in order to ease the visualisation. The copula
package tries to cover all possible copula fields. It allows the simulation and fitting
of different copula models as well as their testing in high dimensions. As far as
we know, this is the only package that deals not only with the copulae, but with
the multivariate distributions based on copulae. In contrast to most of the other
packages, in copula one has to create an object from the classes copula or
mvdc (multivariate distributions constructed from copulae). These classes contain
information about the dimension, dependency parameter and margins, in the case of
the mvdc class. For example, in the following listing, we construct an object that
describes a bivariate Gaussian copula with correlation parameter ρ = 0.75, with
N(0, 2), and E(2) margins.
6.3 Copulae 185

> n . c o p u l a = normalCopula(0 .75, dim = 2)


> m v d c . g a u s s . n . e = mvdc( n . c o p u l a ,
+ margins = c ( " norm " , " exp " ) , # m a r g i n s of the d i s t r i b u t i o n
+ p a r a m M a r g i n s = list ( # p a r a m e t e r s of the m a r g i n s
+ list ( mean = 0 , sd = 2) , # n o r m a l w i t h mu = 0 and sig. = 2
+ list ( rate = 2))) # e x p o n e n t i a l with rate = 2

Using other methods, we can simulate, estimate, calculate and plot the distribution’s
density function. For copula modelling, we concentrate in this section on the two
packages copula and fCopulae.

6.3.1 Copula Families

We propose three copulae classifications: simple, elliptical and Archimedean; how-


ever, there is a zoo of copulae not belonging to these three families.
Simple copulae
Independence and perfect positive or negative dependence properties are often of
great interest. If the rvs X 1 , . . . , X d are stochastically independent, the structure of
their relations is given by the product (independence) copula, defined as


d
(u 1 , . . . , u d ) = ui .
i=1

Two other extremes, representing perfect negative and positive dependencies, are the
lower and upper Fréchet–Hoeffding bounds,

 
d

W (u 1 , . . . , u d ) = max 0, ui + 1 − d ,
i=1
M(u 1 , . . . , u d ) = min(u 1 , . . . , u d ), u 1 , . . . , u d ∈ [0, 1].

An arbitrary copula C(u 1 , . . . , u d ) lies between the upper and lower Fréchet–
Hoeffding bounds

W (u 1 , . . . , u d ) ≤ C(u 1 , . . . , u d ) ≤ M(u 1 , . . . , u d ).

As far as we now, the upper and lower Fréchet–Hoeffdings bounds are not imple-
mented in any package. The reason might be that the lower Fréchet–Hoeffding bound
is not a copula function for d > 2. Using objects of the class indepCopula, one can
model the product copula using the functions dCopula, pCopula, or rCopula,
described later in this section.
186 6 Multivariate Distributions

Elliptical copulae
Due to the popularity of the Gaussian and t-distributions in financial applications,
the elliptical copulae have an important role. The construction of this type of copula
is based on Theorem 6.6 and its implication (6.12). The Gaussian copula and its
copula density are given by

C N (u 1 , . . . , u d , ) =  {−1 (u 1 ), . . . , −1 (u d )},


c N (u 1 , . . . , u d , ) =
    
−1 (u 1 ), . . . , −1 (u d ) ( −1 − Id ) −1 (u 1 ), . . . , −1 (u k )
||−1/2 exp − ,
2
for all u 1 , . . . , u d ∈ [0, 1],

where  is the distribution function of N(0, 1). −1 is the functional inverse of ,
and  is a d-dimensional normal distribution with zero mean and correlation matrix
. The variances of the variables are determined by the marginal distributions.
In the bivariate case, the t-copula and its density are given by
 tν−1 (u 1 )  tν−1 (u 2 )
 ν+2 

Ct (u 1 , u 2 , ν, δ) = ν  2
−∞ −∞ 2
πν (1 − δ 2 )

 − ν −1
x12 − 2δx1 x2 + x22 2
× 1+ d x1 d x2 ,
(1 − δ 2 )ν
f νδ {tν−1 (u 1 ), tν−1 (u 2 )}
ct (u 1 , u 2 , ν, δ) = , u 1 , u 2 , δ ∈ [0, 1],
f ν {t −1 (u 1 )} f ν {t −1 (u 2 )}

where δ denotes the correlation coefficient, ν is the number of degrees of freedom,


f νδ and f ν are the joint and marginal t-distributions, respectively, and tν−1 denotes
the quantile function of the tν distribution. An in-depth analysis of the t-copula is
available in Demarta and McNeil (2004).
The Gaussian or t-copulae can be easily modelled using multivariate normal or
t-distribution modelling. Nevertheless, in these packages (see Table 6.2), these copu-
lae are already implemented. As the copula package has the most advanced copula

Table 6.2 Multinormal distribution in R


fCopulae (d ≥ 2) copula (d ≥ 2)
cdf dellipticalCopula(type=’name’) pCopula(copula)
pdf pellipticalCopula(type=’name’) dCopula(copula)
simul. rellipticalCopula(type=’name’) rCopula(copula)
Note In the fCopulae argument, type can take the values type = {norm, cauchy, t}. For the
copula package, the argument of copula should be an object of the classes ellipCopula,
normalCopula or tCopula
6.3 Copulae 187

modelling, we pay special attention to it. As mentioned above, one should first create
a copula object using normalCopula, tCopula, or ellipCopula.
> norm.cop = normalCopula ( # Gaussian copula
+ param = c (0 .5, 0 .6, 0 .7 ) , # cor m a t r i x
+ dim = 3, # 3 dimensional
+ d i s p s t r = " un " ) # u n s t r u c t u r e d cor m a t r i x
> t.cop = tCopula ( # t copula
+ param = c (0 .5, 0 .3 ) , # w i t h p a r a m s c (0 .5, 0 .3 )
+ dim = 3, # 3 dimensional
+ df = 2, # n u m b e r of d e g r e e s of f r e e d o m
+ d i s p s t r = " toep " ) # T o e p l i t z s t r u c t u r e of cor matr.
> norm.cop1 = ellipCopula ( # elliptical family
+ family = " n o r m a l " , # Gaussian copula
+ param = c (0 .5, 0 .6, 0 .7 ) ,
+ dim = 3 , d i s p s t r = " un " ) # same as n o r m . c o p

The parameter dispstr specifies the type of the symmetric positive definite matrix
characterising the elliptical copula. It can take the values ex for exchangeable, ar1
for AR(1), toep for Toeplitz, and un for unstructured. With these objects, one can
use the general functions rCopula, dCopula or pCopula for the simulation or
calculation of the density or distribution functions.
> n o r m . c o p = n o r m a l C o p u l a ( p a r a m = c (0 .5, 0 .6, 0 .7 ) , dim = 3 ,
+ d i s p s t r = " un " )
> # s e t . s e e d (2^11 -1) # set the s e e d , see C h a p t e r 9
> rCopula( n = 3 , # s i m u l a t e 3 obs. from a G a u s s i a n cop.
+ copula = norm.cop )
[ ,1 ] [ ,2 ] [ ,3 ]
[1 , ] 0 . 6 3 2 0 0 1 6 0 . 3 7 0 8 7 7 4 0 . 7 9 2 0 2 0 1
[2 , ] 0 . 4 0 0 9 4 9 2 0 . 3 5 4 0 8 2 8 0 . 3 4 5 5 3 2 7
[3 , ] 0 . 8 6 2 4 0 8 3 0 . 8 1 0 3 2 1 3 0 . 9 1 1 5 4 9 9
> dCopula( u = c (0 . 2 , 0 . 5 , 0 . 1 ) , # e v a l u a t e the c o p u l a d e n s i t y
+ copula = norm.cop )
[1] 1 . 1 0 3 6 2 9
> pCopula( u = c (0 . 2 , 0 . 5 , 0 . 1 ) , # e v a l u a t e the 3 D t - c o p u l a
+ c o p u l a = t.cop)
[1] 0 . 0 4 1 9 0 9 3 4

Plotting the results of these fuctions is possible for d = 2 using the standard plot,
persp and contour methods. The following code demonstrates how to use these
methods, and the results are displayed in Fig. 6.5.
> n o r m . 2 d . c o p = normalCopula( p a r a m = 0 .7, dim = 2)
> # construct a 2D Gaussian copula
> plot ( r C o p u l a ( 1 0 0 0 , n o r m . 2 d . c o p )) # s c a t t e r p l o t
> persp ( norm.2d.cop, pCopula ) # 3D copula plot
> contour ( norm.2d.cop, pCopula ) # copula contour curves
> persp ( norm.2d.cop, dCopula ) # 3 D plot of the copula density
> contour ( norm.2d.cop, dCopula ) # contour curves of the d e n s i t y

Using the mvdc object on the base of the copula objects, one can create a mul-
tivariate distribution based on a copula by specifying the parameters of the marginal
distributions. The mvdc objects can be plotted with contour, persp or plot as
well.
188 6 Multivariate Distributions

1.0
●● ●● ● ●●●
● ●● ●● ● ●●●
● ● ● ● ●●

● ●
● ● ●● ●● ● ●● ● ●●●●
● ● ● ● ● ● ●●●●
● ● ● ● ●●
● ● ●● ●● ● ● ●● ●● ●● ●● ●● ● ●
● ● ● ● ●● ●
● ● ● ●●● ● ● ●
●● ● ● ●
●● ●
● ● ● ● ● ● ●●
● ● ●●● ●
● ● ● ● ●
●● ● ● ● ●● ● ● ●●
● ● ● ●
● ● ● ● ● ●●● ●●
●●
● ● ● ● ●
● ●●

0.8
●● ● ● ● ●● ●● ●●
● ● ●
● ●● ● ● ●
● ●● ● ●
● ● ●
● ● ●

● ● ● ● ● ●●● ●


● ●● ● ● ● ● ● ● ●●
● ● ● ●● ● ● ● ●● ●
● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●
● ● ●
● ●●● ● ● ● ●●
●● ●● ● ●
● ●
● ●
●● ● ● ● ● ● ●
● ●
● ● ● ●●● ●
● ● ● ●● ● ● ●
● ● ● ●● ● ●●

● ● ● ● ●● ●
●● ● ● ● ●
● ● ● ● ●● ●● ● ●● ● ●● ●●

● ● ● ● ● ● ●●
● ●
●● ● ● ● ●● ●● ●● ● ● ● ●● ●
● ● ● ●

0.6
● ● ●● ● ● ● ●
● ●
● ●
● ●
● ● ●● ● ● ● ●

● ● ●
● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●
●●● ●● ● ● ● ● ● ●
● ● ●

●● ● ●

● ● ● ●●
● ●● ● ● ●

y
● ● ● ● ● ●
●● ● ● ●
● ● ● ● ●●● ● ● ● ● ●●
● ● ● ●● ● ●

●●● ● ●● ● ●
● ●● ●


● ●


● ●
● ● ● ● ●

● ● ●●
● ● ● ● ● ● ●● ● ●● ● ● ● ● ●
● ●

0.4
● ● ● ●● ●●


● ●
● ● ●●● ● ● ● ●
● ●
●●
●● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ● ●● ● ●
● ● ●● ●● ● ●
●●● ● ● ●● ● ● ● ●
● ●
● ●● ● ● ●● ● ●● ● ●● ●●●● ● ●● ●
● ●● ●● ● ● ●● ●● ● ●
● ● ●
●●
● ● ●●
● ● ● ●
●● ●
●● ●● ● ● ● ●
● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●
● ● ● ●
●● ●● ●
● ● ●
● ● ● ●●
● ● ●
● ● ● ● ●
● ● ●●
●● ●● ● ●● ● ●●●●
● ●●● ● ●● ● ● ●● ●

● ● ● ●● ● ● ● ● ● ● ●

0.2
●● ●● ● ● ● ● ● ● ●● ●● ●●


● ● ● ● ●●
● ● ● ●● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●●
● ●
● ●● ● ● ● ● ●
● ● ●● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ● ●●
● ● ●●●● ● ● ●
● ● ●
●● ● ●● ●● ● ●
● ●● ●●●
● ● ● ● ● ● ● ●
●● ●● ●● ●● ● ● ●● ●
● ● ● ● ● ●● ●● ●
●● ●● ● ● ● ● ●● ●
● ● ● ●

●●● ●● ● ● ● ●● ●
● ●
●●
●●●● ●●
●● ● ●● ● ● ●

● ● ● ● ● ● ●● ●●● ●
● ● ● ●
0.0

●● ●● ● ●


●●
● ●●● ●

0.0 0.2 0.4 0.6 0.8 1.0


x
1.0

0.9
0.8

0.8

0.7
zmat
0.6

0.6
y

0.5
0.4

0.4

0.3
yis
0.2

0.2

xis
0.1
0.0

0.0 0.2 0.4 0.6 0.8 1.0


x
1.0

5
4
3
2
0.8

zmat
0.6
y
0.4

yis
0.2

xis
3

4
0.0

12

0.0 0.2 0.4 0.6 0.8 1.0


x

Fig. 6.5 Gaussian copula. Note From top to bottom: scatterplot, distribution and density function
for the Gaussian copula with ρ = 0.7. BCS_NormalCopula
6.3 Copulae 189

> n . c o p u l a = normalCopula(0 .75, dim = 2)


> m v d c . g a u s s . n . e = mvdc( n . c o p u l a , m a r g i n s = c ( " norm " , " exp " ) ,
+ p a r a m M a r g i n s = list ( list (mean = 0 , sd = 2 , list ( rate = 2)))
> plot ( rMvdc (1000 , m v d c . g a u s s . n . e ))
> c o n t o u r ( m v d c . g a u s s . n . e , dMvdc )
> persp ( m v d c . g a u s s . n . e , pMvdc )

Using (6.12), one can derive the copula function for an arbitrary elliptical distrib-
ution. The problem is, however, that such copulae depend on the inverse distribution
functions and these are rarely available in an explicit form, see Sect. 6.2. Therefore,
the next class of copulae and their generalisations provide an important flexible and
rich family of alternatives to the elliptical copulae.

6.3.2 Archimedean Copulae

As opposed to elliptical copulae, Archimedean copulae are not constructed using


Theorem 6.6. Instead, they are related to the Laplace transforms of univariate dis-
tribution functions as follows. Let L denote the class of Laplace transforms which
consists of strictly decreasing differentiable functions Joe (1997), i.e.

L = {φ : [0; ∞) → [0, 1] | φ(0) = 1, φ(∞) = 0; (−1) j φ( j) ≥ 0; j = 1, . . . , ∞}.

The function C : [0, 1]d → [0, 1] defined as

C(u 1 , . . . , u d ) = φ{φ−1 (u 1 ) + · · · + φ−1 (u d )}, u 1 , . . . , u d ∈ [0, 1]

is a d-dimensional Archimedean copula, where φ ∈ L is called the generator of the


copula. It is straightforward to show that C(u 1 , . . . , u d ) satisfies the conditions of
Definition 6.1. Some d-dimensional Archimedean copulae are presented below. The
use of these copulae in R is similar to that of the elliptical copulae, so we mention
only their specific features. The fCopulae package implements the whole list of
Archimedean copulae from Nelsen (2006), thus all methods need a parameter type
that takes values between 1 and 22.

Frank (1979) copula, −∞ < θ < ∞, θ = 0.

The first popular Archimedean copula is the so-called Frank copula, which is the
only elliptically contoured Archimedean copula (which is different from elliptical
family) for d = 2 satisfying

C(u 1 , u 2 ) = u 1 + u 2 − 1 + C(1 − u 1 , 1 − u 2 ), u 1 , u 2 ∈ [0, 1].


190 6 Multivariate Distributions

Its generator and copula functions are

φ(x, θ) = θ −1 log{1 − (1 − e−θ )e−x }, −∞ < θ < ∞ θ = 0, x ∈ [0, ∞),


⎡ ⎤
d
 
⎢ exp(−θu j ) − 1 ⎥
⎢ ⎥
1 ⎢ j=1 ⎥
Cθ (u 1 , . . . , u d ) = − log ⎢1 + d−1 ⎥
.
θ ⎢ {exp(−θ) − 1} ⎥
⎣ ⎦

The dependence is maximised when θ tends to infinity and independence is achieved


when θ = 0. This family is implemented in the packages copula and fCopulae
under type = 5.

Gumbel (1960) copula, 1 ≤ θ < ∞.

The Gumbel copula is frequently used in financial applications. Its generator and
copula functions are

φ(x, θ) = exp {−x 1/θ }, 1 ≤ θ < ∞, x ∈ [0, ∞),


⎡ ⎧ ⎫θ−1 ⎤
⎨d ⎬
⎢ ⎥
Cθ (u 1 , . . . , u d ) = exp ⎣− (− log u j )θ ⎦.
⎩ ⎭
j=1

Consider a bivariate distribution based on the Gumbel copula with univariate extreme
value marginal distributions. Genest and Rivest (1989) showed that this distribution
is the only bivariate extreme value distribution based on an Archimedean copula.
Moreover, all distributions based on Archimedean copulae belong to its domain of
attraction under common regularity conditions. Unlike the elliptical copulae, the
Gumbel copula leads to asymmetric contour diagrams. It shows stronger linkages
between positive values. However, it also shows more variability and more mass in
the negative tail. For θ > 1, this copula allows the generation of a dependence in
the upper tail. For θ = 1, the Gumbel copula reduces to the product copula and for
θ → ∞, we obtain the Fréchet–Hoeffding upper bound.
As mentioned above, apart from the packages copula and fCopulae
(type = 4), the package gumbel, specially designed for this copula family, allows
only exponential or gamma marginal distributions.

Clayton (1978) copula, −1/(d − 1) ≤ θ < ∞, θ = 0.

The Clayton copula, in contrast to the Gumbel copula, has more mass in the lower
tail and less in the upper. The generator and copula function are
6.3 Copulae 191

φ(x, θ) = (θx + 1)− θ , −1/(d − 1) ≤ θ < ∞, θ = 0, x ∈ [0, ∞),


1

⎧⎛ ⎞ ⎫−θ−1
⎨ d ⎬
Cθ (u 1 , . . . , u d ) = ⎝ u −θ ⎠−d +1 .
⎩ j

j=1

The Clayton copula is one of few copulae whose density has a simple explicit form
for any dimension
⎛ ⎞−(θ−1 +d)

d d
cθ (u 1 , . . . , u d ) = {1 + ( j − 1)θ}u −(θ+1)
j
⎝ u −θ
j −d +1
⎠ .
j=1 j=1

As the parameter θ tends to infinity, the dependence becomes maximal, and as θ tends
to zero, we have independence. As θ goes to −1 in the bivariate case, the distribution
tends to the lower Fréchet bound. The level plots of the two-dimensional respective
densities are given in Fig. 6.6.

6.3.3 Hierarchical Archimedean Copulae

A recently developed flexible method is provided by HAC. The special, so-called


fully nested, case of the copula function is
 
C(u 1 , . . . , u d ) =φd−1 φ−1 −1 −1 −1
d−1 ◦ φd−2 . . . [φ2 ◦ φ1 {φ1 (u 1 ) + φ1 (u 2 )}
 
+ φ−1 −1 −1
2 (u 3 )] + · · · + φd−2 (u d−1 ) + φd−1 (u d )
=φd−1 [φ−1 −1
d−1 ◦ C({φ1 , . . . , φd−2 })(u 1 , . . . , u d−1 ) + φd−1 (u d )]

for φ−1 ∗
d−i ◦ φd− j ∈ L , i < j, where

L∗ ={ω : [0; ∞) → [0, ∞) | ω(0) = 0,


ω(∞) = ∞; (−1) j−1 ω ( j) ≥ 0; j = 1, . . . , ∞}.

The HAC defines the whole dependency structure in a recursive way. At the lowest
level, the dependency between the first two variables is modelled by a copula function
with the generator φ1 , i.e. z 1 = C(u 1 , u 2 ) = φ1 {φ−1 −1
1 (u 1 ) + φ1 (u 2 )}. At the second
level, another copula function is used to model the dependency between z 1 and u 3 ,
etc. Note that the generators φi can come from the same family, differing only in
their parameters. But, to introduce more flexibility, they can also come from different
families of generators. As an alternative to the fully nested model, we can consider
copula functions with arbitrarily chosen combinations at each copula level. Okhrin
et al. (2013) provide several methods for determining the structure of the HAC from
the data.
192 6 Multivariate Distributions

2
2

0.06
0.06
0.1
0.1

1
1

2
0.1 2
0.1

0
0

16
4

0.
0.1
0.14

−1
−1

0.08 0.08

0.04 0.04

−2
−2

0.02
0.02

−2 −1 0 1 2 −2 −1 0 1 2

0.02 0.02
2
2

0.04 0.04
0.08 0.08
1
1

0.12 0.12

16
0.
6
0

0.1
0

8
0.1

4
0.1
4
−1
−1

0.1 0.1
0.1 6
0.0
−2

6
−2

0.0

−2 −1 0 1 2 −2 −1 0 1 2

1 0.
01
0.0 0.02
2

0.04 0.04
0.07 0.08
1

9
0.0 0.12
0.11
0

4
0.
1 0.1
01
.
0.08
−1

−1

0.0 0.06
6
0.05
0.03
−2

−2

0. 0.02
01 1
0.0
−2 −1 0 1 2 −2 −1 0 1 2

Fig. 6.6 Contour diagrams for (from top to bottom) the Gumbel, Clayton and Frank copu-
lae with parameter 2 and Normal (left column) and t6 distributed (right column) margins.
BCS_ArchimedeanContour
6.3 Copulae 193

3
2
1
0

3
2
−1

1
0
−2

−1
−2
−3

−3
−3 −2 −1 0 1 2 3

Fig. 6.7 Scatterplot of the HAC-based three-variate distribution. BCS_HAC

The HAC package provides intuitive techniques for estimating and visualising
HAC. In accordance with the naming in the other packages, the functions dHAC,
pHAC compute the values of the pdf and cdf, and rHAC generates random vectors.
Figure 6.7 presents the scatterplot of the three-dimensional HAC-based distribution.

F(x1 , x2 , x3 ) = (C Gumbel [C Gumbel {(x1 ), t2 (x2 ); θ1 = 2}, (x3 ); θ2 = 10]. (6.15)

On the sides of the cube one sees shaded bivariate marginal distribution, that com-
pletely differs from each others.

6.3.4 Estimation

Estimating a copula-based multivariate distribution involves both the estimation of


the copula parameters θ and the estimation of the margins F j (·, α j ), j = 1, . . . , d,
i.e. all the parameters from the copula and from the margins could be estimated
in one step. If we are only interested in the dependency structure, the estimator of
θ should be independent of any parametric models for the margins, therefore we
distinguish between a parametric and a non-parametric specification of the margins.
194 6 Multivariate Distributions

In practical applications, however, we are interested in a complete distribution model


and, therefore, parametric models for the margins are preferred.
Three commonly used estimation procedures are considered in the following.
Let X be a d-dimensional rv with density f and parametric univariate marginal
distributions F j (x j ; α j ), j = 1, . . . , d. Suppose that the copula to be estimated
belongs to a given parametric family C = {Cθ , θ ∈ }. For a sample of observations
{xi }i=1
n
, xi = (x1i , . . . , xdi ) and a vector of parameters α = (α1 , . . . , αd , θ) ∈
Rd+1 , the likelihood function is given by


n
L(α; x1 , . . . , xn ) = f (x1i , . . . , xdi ; α1 , . . . , αd , θ).
i=1

According to (6.14), the density f can be decomposed into the copula density c
and the product of the marginal densities, so that the log-likelihood function can be
written as


n 
n 
d
(α; x1 , . . . , xn ) = log c{F1 (x1i ; α1 ), . . . , Fd (xdi ; αd ); θ} + log f j (x ji ; α j ).
i=1 i=1 j=1

The vector of parameters α = (α1 , . . . , αd , θ) contains d parameters α j from the


marginals and the copula parameter θ. All these parameters can be estimated in one
step through full maximum likelihood estimation

α̃ F M L = (α̃1 , . . . , α̃d , θ̃) = arg max (α).


α

Following the standard theory on maximum likelihood estimation, the estimates


are efficient and asymptotically normal. However, optimising  with respect to all
parameters simultaneously is often computationally demanding.
In the IFM (inference for margins) method, the parameters α j from the marginal
distributions are estimated in the first step and used to estimate the dependence
parameter θ in the second step. The pseudo log-likelihood function


n
(θ, α̂1 , . . . , α̂d ) = log c{F1 (x1i ; α̂1 ), . . . , Fd (xdi ; α̂d ); θ},
i=1

is maximised over θ to get the dependence parameter estimate θ̂. A detailed discussion
of this method is to be found in Joe and Xu (1996). Note that this procedure does
not lead to efficient estimators, but, as argued by Joe (1997), the loss in efficiency
should be modest. The advantage of the inference for margins procedure lies in the
dramatic reduction of the computational complexity, as the estimation of the margins
is disentangled from the estimation of the copula. As a consequence, all R packages
6.3 Copulae 195

use the separate estimation of the copula and its margins and, therefore, they focus
only on the optimisation of the copula parameter(s).
In the CML (canonical maximum likelihood) method, the univariate marginal
distributions are estimated through some non-parametric method F̂ as described
in Sect. 5.1.2. The asymptotic properties of the multistage estimators of θ do not
depend explicitly on the type of the non-parametric estimator, but on its convergence
properties. For the estimation of the copula, one should normalise the empirical cdf
not by n but by n + 1:

1 
n
F̂ j (x) = I(x ji ≤ x).
n + 1 i=1

The copula parameter estimator θ̂C M L is given by


n
θ̂C M L = arg max log c{ F̂1 (x1i ), . . . , F̂d (xdi ); θ}.
θ i=1

Notice that the first step of the IMF and CML methods estimates the marginal dis-
tributions. After the estimation of the marginal distributions, a pseudosample {u i }
of observations transformed to the unit d-cube is obtained and used for the copula
estimation. As in the IFM, the semiparametric estimator θ̂ is asymptotically normal
under suitable regularity conditions.
In the two-dimensional case d = 2, one often uses the generalised method of
moments, since there is a one to one relation between the bivariate copula parameter
and Kendall’s τ or Spearman’s ρ S . For example, for Gumbel copulae, τ = 1 − 1θ ,
and for Gaussian copulae, τ = π2 arcsin ρ. One estimates this measure as (6.7) or
(6.6) and subsequently converts it to θ.
Estimation of the different copula models is implemented in a variety of packages,
such as copula, fCopulae, gumbel and HAC. The gumbel package implements
all methods of estimation for the Gumbel copula. The package fCopulae deals
only with copula functions with uniform margins, and the estimation is provided
through maximum likelihood. It estimates the parameters for all the copula families
in Nelsen (2006). The package copula is of the highest interest, since almost all
estimation methods for the estimation of multivariate copula-based distributions,
or just the copula function, are implemented in it. To estimate a parametric copula
C, one uses the fitCopula function, which among other parameters needs the
parameter method, indicating the method that should be used in the estimation. The
parameter method can be either ml (maximum likelihood), mpl (maximum pseudo-
likelihood), itau (inversion of Kendall’s tau), or irho (inversion of Spearman’s
rho). The default method is mpl. In the following listing, we present the estimation
of the Gumbel copula parameter using different methods.
196 6 Multivariate Distributions

> # s e t . s e e d (11) # set seed, see C h a p t e r 9


> Gc = g u m b e l C o p u l a (3 , dim = 2) # Gumbel copula
> n = 200 # sample size
> x = r C o p u l a ( n, c o p u l a = Gc ) # true o b s e r v a t i o n s
> u = apply ( x, 2 , rank ) / ( n + 1) # pseudo - o b s e r v a t i o n s
> f i t C o p u l a ( Gc, u, m e t h o d = " itau " ) @ e s t i m a t e # i n v e r t i n g Kendall ’ s tau
[1] 2 . 5 7 7 7 2
> f i t C o p u l a ( Gc, u, m e t h o d = " irho " ) @ e s t i m a t e # inv. Spearman ’ s rho
[1] 2 . 6 0 0 5 6 1
> f i t C o p u l a ( Gc, u, m e t h o d = " mpl " ) @ e s t i m a t e # m a x i m u m pseudo - lik.
[1] 2 . 6 5 7 3 3 1
> f i t C o p u l a ( Gc, x, m e t h o d = " ml " ) @ e s t i m a t e # maximum likelihood
[1] 2 . 6 5 7 3 3 1

Similarly, using the fitMvdc method, one can estimate the whole multivariate
copula-based distribution together with its margins.
Chapter 7
Regression Models

Everything must be made as simple as possible. But not simpler.

— Albert Einstein

Regression models are extremely important in describing relationships between


variables. Linear regression is a simple, but powerful tool in investigating linear
dependencies. It relies, however, on strict distributional assumptions. Nonparamet-
ric regression models are widely used, because fewer assumptions about the data at
hand are necessary. At the beginning of every empirical analysis, it is better to look
at the data without assumptions about the family of distributions. Nonparametric
techniques allow describing the observations and finding suitable models.

7.1 Idea of Regression

Regression models aim to find the most likely values of a dependent variable Y for
a set of possible values {xi }, i = 1, . . . , n of the explanatory variable X

yi = g(xi ) + εi , εi ∼ Fε ,

where g(x) = E(Y |X = x) is an arbitrary function which is included in the model


with the intention of capturing the mean of the process that corresponds to x. The
natural aim is to keep the values of the εi as small as possible.
Parametric models assume that the dependence of Y on X can be fully explained
by a finite set of parameters and that Fε has a prespecified form with parameters to
be estimated.
Nonparametric methods do not assume any form: neither for g nor for Fε , which
makes them more flexible than the parametric methods. The fact that nonparametric
techniques can be applied where parametric ones are inappropriate prevents the
nonparametric user from employing a wrong method. These methods are particularly
© Springer International Publishing AG 2017 197
W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_7
198 7 Regression Models

useful in fields like quantitative finance, where the underlying distribution is in fact
unknown. However, as fewer assumptions can be exploited, this flexibility comes
with the need for more data. A detailed introduction to nonparametric techniques
can be found in Härdle et al. (2004).

7.2 Linear Regression

The simplest relationship between quantitative variables, an explained variable Y and


explanatory variables X 1 , . . . , X p (also called regressors) is the linear one, namely
the function g(·) is linear. It is of interest to study ‘how Y varies with changes in
X 1 , . . . , X p ’, or precisely, the causal (ceteris paribus) relationship between variables.
The model, which is assumed to hold in the population, is written as

Y = β0 + β1 X 1 + · · · + β p X p + ε.

The variable ε is called the error term and represents all factors other than X 1 , . . . , X d
that affect Y . Let y = {yi }i=1 n
be a vector of the response variables and X =
{xi j }i=1,...,n; j=1,..., p be a data matrix of p explanatory variables. In many cases, a
constant is included through xi1 = 1 for all i in this matrix. The resulting data matrix
is denoted by X = {xi j }i=1,...,n; j=1,..., p+1 . The aim is to find a good linear approxi-
mation of y using linear combinations of covariates

y = X β + ε,

where ε is the vector of errors. To estimate β, the following least squares optimisation
has to be solved:

β̂ = argmin y − X β2 = argmin (y − X β) (y − X β) = argmin ε ε, (7.1)


β∈ β∈ β∈

where  denotes the parameter space.


Applying first-order conditions, it can be shown that the solution of (7.1) is

β̂ = (X  X )−1 X  y. (7.2)

This estimator is called the (ordinary) least squares (OLS) estimator. Under the
following conditions, the OLS estimator β̂ is by the Gauss–Markov theorem the best
linear unbiased estimator (BLUE).

1. The model is linear in the parameter vector β;


2. E(ε) = 0;
3. the covariance matrix of the errors ε is diagonal and given by  = σ 2 In ;
4. X is a deterministic matrix with full column rank.
7.2 Linear Regression 199

If these conditions are fulfilled, then OLS has the smallest variance in the class of
all linear unbiased estimators, with E(β̂) = β and Var(β̂) = σ 2 (X  X )−1 .
Additional assumptions are required to develop further inference about β̂. Under
a normality assumption, ε ∼  N(0, σ In ), the estimator β̂ has2 a normal distribution,
2
 −1
i.e. β̂ ∼ N β, σ (X X ) . In practice, the error variance σ is often unknown, but
2

can be estimated by

1

σ2 = (y − ŷ) (y − ŷ),
n − ( p + 1)

where ŷ are the predictors of y given by ŷ = X β̂ and ( p + 1) is the number of


explanatory variables including the intercept.
Given the structure of the matrix Var(β̂), the standard deviation of a single esti-
mator β̂ j can be written as
n
i=1 ε̂i
2
ε̂ ε̂

σ (β̂ j ) =
2
= σ 2 (X  X )−1
= jj ,
n − ( p + 1) n − ( p + 1)

where (X  X ) j j is the j-th diagonal element of the matrix. It can be shown that β̂ j
and σ̂ 2 (β̂ j ) are statistically independent.
The distributional property of β̂ is used to form tests and build confidence intervals
for the vector of parameters β. Testing the hypothesis H0 : β j = 0 is analogous to
the one-dimensional t-test with the test statistics given by

t = β̂ j /
σ (β̂ j ). (7.3)

Under H0 , the test statistic (7.3) follows the tn−( p+1) distribution. For further reading
we refer to Greene (2003), Härdle and Simar (2015) and Wasserman (2004).
Additionally one can test whether all independent variables have no effect on the
dependent variable

H0 : β1 = . . . = β p = 0 vs H1 : βk  = 0, f or at least one k = 1, . . . , p

In order to decide on the rejection of the null hypothesis, the residual sum of squares
RSS serves as a measure. According to the restrictions, under the null hypothesis the
test compares the RSS of the reduced model SS(r educed), in which the variables
listed in H0 are dropped, with the RSS of an unrestricted model SS( f ull), in which all
variables are included. In general, SS(r educed) is greater than or equal to SS( f ull)
because the OLS estimation of the restricted model uses fewer parameters. The ques-
tion is whether the increase of the RSS in moving from the unrestricted model to the
restricted model is large enough to ensure the rejection of the null hypothesis. There-
fore, the F-statistic is used, which addresses the difference between SS(r educed)
and SS( f ull):
200 7 Regression Models

SS(r educed) − SS( f ull)/df (r ) − d f ( f )


F= . (7.4)
SS( f ull)/df ( f )

Under the null hypothesis, the statistic (7.4) follows the distribution Fd f (r )−d f ( f ),d f (r )
(see Sect. 4.4.3), where d f ( f ) and d f (r ) denote the number of degrees of free-
dom under the unrestricted model and the restricted model (d f ( f ) = n − p − 1 and
d f (r ) = n − 1). Based on this F-distribution, the critical region can be chosen in
order to reject or not reject the null hypothesis.

7.2.1 Model Selection Criteria

Even in the case of well-fitted models, it is not an easy task to select the best model
from a set of alternatives. Usually, one looks at the coefficient of determination R 2 or
adjusted R 2 . These values measure the ‘goodness of fit’ of the regression equation.
They represent the percentage of the total variation of the data explained by the fitted
linear model. Consequently, higher values indicate a ‘better fit’ of the model, while
low values may indicate a poor model. R 2 is given by

y − ŷ2
R2 = 1 − , (7.5)
y − y2

with R 2 ∈ [0, 1]. It is important to know, that R 2 always increases with the number
of explanatory variables added to the model even if they are irrelevant.
The adjusted R 2 is a modification of (7.5), which considers the number of explana-
tory variables used, and is given by

n−1
j = 1 − (1 − R )
2 2
Rad . (7.6)
n − ( p + 1) − 1
2
Note that Rad j can be negative. However, the coefficients of determination (7.5) and
(7.6) are not always the best criteria for choosing the model. Other popular criteria
to choose the regression model are Mallows’ C p , the Akaike Information Criterion
(AIC) and the Bayesian Information Criterion (BIC).
Mallows’ C p is a model selection criterion which uses the residual sum of squares,
2
but penalises for the number of unknown parameters like Rad j . It is given by

y − ŷ2
Cp = − n + 2( p + 1).
y − y2

AIC uses the maximum log-likelihood and is defined as

σ 2 + 2( p + 1),
AIC = n log 
7.2 Linear Regression 201

where p is the number of parameters and σ 2 an estimate for the variance maximising
L a likelihood function. The second term is a penalty, as in Mallows’ C p .
The last information criterion discussed here is the BIC, defined as

BIC = n log 
σ 2 + log(n)( p + 1).

There is no rule of thumb determining which criterion to use. In small samples all
criteria give similar results. Since BIC has a larger penalty for n ≥ 3 than AIC, it
will have a tendency to select more parsimonious models.

7.2.2 Stepwise Regression

Stepwise regression builds the model from a set of candidate predictor variables
by entering and removing predictors in a stepwise manner. One can perform for-
ward or backward stepwise selection using the step function or stepAIC from
the MASS package. Both functions perform stepwise model selection by exact AIC.
The stepAIC function is preferable, because it is applicable to more model types,
e.g. nonlinear regression models, apart from the linear model while providing the
same options. Forward selection starts by choosing the independent variable which
explains the most variation in the dependent variable. It then chooses the variable
which explains most of the remaining residual variation and recalculates the regres-
sion coefficients. The algorithm continues until no further variable significantly
explains the remaining residual variation. Another similar selection algorithm is
backward selection, which starts with all variables and excludes the most insignifi-
cant variables one at a time, until only significant variables remain. A combination
of the two algorithms performs forward selection, while dropping variables which
are no longer significant after the introduction of a new variable.
The stepAIC() function requires a number of arguments. The argument k is
a multiple of the number of degrees of freedom used for the penalty. If k = 2 the
original AIC is applied, k = log(n) is equivalent to BIC. The direction of the
stepwise regression can be chosen as well, setting it to forward, backward or both.
If trace = 1, it will return every model it goes over as well as the coefficients of
the final model. scope gives the range of the included predictors, while lower and
upper specify the minimal and maximal number of models the stepwise procedure
may go over.
The nutritional database on US cereals introduced in Venables and Ripley (1999)
provides a good illustration of MLR. The UScereal data frame from package MASS
is from the 1993 ASA Statistical Graphics Exposition. The data have been normalised
to a portion size of one American cup and among other contain information on: mfr
(Manufacturer, represented by its first initial), calories (number of calories per
portion), protein (grams of protein per portion), fat (grams of fat per portion),
carbo (grams of complex carbohydrates per portion) and sugars (grams of sugars per
portion). The analysis is restricted to the dependence between calories and protein,
202 7 Regression Models

fat, carbohydrates and sugars, which can be expressed by the function

CALORIES =β1 · INTERCEPT + β2 · PROTEIN + β3 · FAT


+β4 · CARBO + β5 · SUGARS + ε.

First, it is necessary to load the package which contains the dataset.


> data("UScereal", package ="MASS")

Function lm() estimates the coefficients of a multiple linear regression, calculates


the standard errors of the estimators and tests the regression coefficients’ significance.
> fit = lm(calories ~ protein + fat + carbo + sugars,
+ data = UScereal) # fit the regression model

The resulting fitted model is an object of a class lm, for which the function
summary() shows the conventional regression table. The part Call states the
applied model.
> summary(fit) # show results
Call:
lm(formula = calories ~ protein + fat + carbo + sugars,
data = UScereal)

The next part of the output provides the minimum, maximum and empirical quartiles
of the residuals.
Residuals: # output ctnd.
Min 1Q Median 3Q Max
-20.177 -4.693 1.419 4.940 24.758

The last part shows the estimated β̂ with the corresponding standard errors, t-statistics
and associated p-values. The measures of goodness of fit, R 2 and adjusted R 2 as
discussed in Sect. 7.2.1, and the results of an F-test are given as well. The F-test
tests the null hypothesis that all regression coefficients (excluding the constant) are
simultaneously equal to 0.
Coefficients: # output ctnd.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18.7698 3.5127 -5.343 1.49e-06 ***
protein 4.0506 0.5438 7.449 4.28e-10 ***
fat 8.8589 0.7973 11.111 3.41e-16 ***
carbo 4.9247 0.1587 31.040 < 2e-16 ***
sugars 4.2107 0.2116 19.898 < 2e-16 ***
---
Signif. codes: 0"***"0.001"**"0.01"*"0.05"."

Residual standard error: 8.862 on 60 degrees of freedom


Multiple R-squared: 0.9811, Adjusted R-squared: 0.9798
F-statistic: 778.6 on 4 and 60 DF, p-value: < 2.2e-16

In the given example, all four variables are statistically significant, i.e. all p-values
are smaller than 0.05, which is commonly used as threshold. The coefficients can
be assessed via the command fit$coef. The interpretation of the coefficients
7.2 Linear Regression 203

is quite intuitive. One may know from a high school chemistry course that 1 g of
carbohydrates or proteins contains 4 calories and 1 g of fat gives nine calories. In
order to investigate the model closely, four diagnostic plots are constructed. The
layout command is used to put all graphs in one figure.
> layout(matrix(1:4, 2, 2)) # plot 4 graphics in 1 figure
> plot(fit) # depict 4 diagnostic plots

The upper left plot in Fig. 7.1 shows the residual errors plotted against their fitted
values. The residuals should be randomly distributed around the horizontal axis.
There should be no distinct trend in the distribution of points. A nonparametric
regression of the residuals is added to the plot, which should, in an ideal model, be
close to the horizontal line y = 0. Unfortunately, this is not the case in the example,
possibly due to outliers (Grappe-Nuts, Quaker Oat Squares and Bran Chex). Potential
outliers are always named in the diagnostic plots.

Residuals vs Fitted Scale−Location


30

Grape−Nuts ●
2.0

Grape−Nuts ●

Great Grains Pecan ●


20

Great Grains Pecan ●


Standardized residuals

Bran Chex
● ●
1.5

● ●●

10
Residuals

● ● ●

●● ● ●● ● ● ●
● ●● ● ● ●● ● ●● ● ●

● ● ● ●
●●
● ● ●●● ●
1.0

● ● ●● ● ●● ●
●● ● ●
0

● ● ●
● ●

●● ●● ● ●● ● ● ●
●●●● ●

● ● ●● ● ● ●
●●
−10

● ●●
●● ● ●● ●
0.5

●● ●

●● ● ●
● ● ● ● ●●
−20

●Bran Chex● ●
0.0

100 200 300 400 100 200 300 400


Fitted values Fitted values

Normal Q−Q Residuals vs Leverage

Grape−Nuts ● Grape−Nuts ●
Standardized residuals

Standardized residuals
4
4

Great Grains Pecan ● Great Grains Pecan ●


2
2

● 1
● ●

●●●
0.5
●●●
●● ●
● ●●
●●
●●
●●
●●●●● ●


●●

● ● ● ●


●●
●●

● ●●●
●● ●●

●●

●●
● ●●●
0


●●
●●

●● ●
●● ●
0

●● ●

● ●●

●●
●●
● ●●
●●
●●
●●
● ● ●●●
0.5

●●●● ● All−Bran
●● 1
−2

● ●● ● ●
−2

Bran Chex Cook's distance


● ●

−2 −1 0 1 2 0.0 0.1 0.2 0.3 0.4 0.5 0.6


Theoretical Quantiles Leverage

Fig. 7.1 Diagnostic plots for multiple linear regression. BCS_MLRdiagnostic


204 7 Regression Models

The upper right plot in Fig. 7.1 presents the scale-location, also called spread-
location. It shows the square root of the standardised residuals as a function of the
fitted values and is a useful tool to verify if the standard deviation of the residuals
is constant as a function of the dependent variable. In this example, the standard
deviation increases with the number of calories.
The lower left plot in Fig. 7.1 is a Q-Q plot. In a Q-Q plot, quantiles of a theoretical
distribution (Gaussian in this case) are plotted against empirical quantiles. If the data
points lie close to the diagonal line there is no reason to doubt the assumed distribution
of the errors (e.g. Gaussian).
Finally, the lower right plot shows each point’s leverage, which is a measure of its
importance in determining the regression result. The leverage of the i-th observation
is the i-th diagonal element of the matrix X (X  X )−1 X  . It always takes values
between 0 and 1 and shows the influence of the given observation on the overall
modelling results and particularly on the size of the residual. Superimposed on the
plot are contour lines for Cook’s distance, which is another measure of the importance
of each observation to the regression, showing the change in the predicted values
when that observation is excluded from the dataset. Smaller distances mean that this
observation has little effect on the regression. Distances larger than 1 are suspicious
and suggest the presence of possible outliers or a poor model. For more details, we
refer to Cook and Weisberg (1982). In the given regression model, some possible
outliers are observed. It makes sense to have a closer look at these observations and
either exclude them or experiment with other model specifications.
As mentioned above, adjusted R 2 is a widely used measure of the goodness of
fit. In the given example, the model seems to explain the variability in the data quite
2
well, the Rad j is 0.9798. However, a similar goodness of fit might be obtained using a
smaller set of regressors. An investigation of this question using stepwise regression
procedure shows that no regressor can be removed from the model.
> require(MASS)
> stepAIC(fit, direction ="both") # stepwise regression using AIC
Start: AIC = 288.43
calories ~ protein + fat + carbo + sugars

Df Sum of Sq RSS AIC


<none> 4712 288.43
- protein 1 4358 9070 328.99
- fat 1 9694 14406 359.07
- sugars 1 31092 35804 418.24
- carbo 1 75664 80376 470.81

Call:
lm(formula = calories ~ protein + fat + carbo + sugars,
data = UScereal)

Coefficients:
(Intercept) protein fat carbo sugars
-18.770 4.051 8.859 4.925 4.211
7.2 Linear Regression 205

Fig. 7.2 BIC for all


−240
subsets regression.
BCS_MLRleaps −200

−170

−140

−110

bic
−100

−72

−55

−37

−19

(Intercept)

protein

sugars
carbo
fat
A possible drawback of stepwise regression is that once the variable is included
(excluded) in the model, it remains there (or is eliminated) for all remaining steps.
Thus, it is a good idea to perform stepwise regression in both directions in order to
look at all the possible combinations of explanatory variables. It is possible to perform
an all subsets regression using the function regsubsets from the package leaps.
By plotting regsubset object one obtain a matrix, on which with dark colour are
highlighted models with larger BIC (Fig. 7.2).
> require("leaps")
> sset = regsubsets(calories ~ protein + fat + carbo + sugars,
+ data = UScereal, nbest = 3) # fit lm to all subsets
> plot(sset)

7.3 Nonparametric Regression

The general idea of regression analysis is to find a reasonable relation between two
variables X and Y . For n realisations {(xi , yi )}i=1
n
, the relation can be modelled by

yi = g(xi ) + εi , i = 1, ..., n, (7.7)

where X is our explanatory variable, Y is the explained variable, and ε is the noise.
A parametric estimation would suggest g(xi ) = g(xi , θ), therefore estimating g
would result in estimating θ and using ĝ(xi ) = g(xi , θ̂). In contrast, nonparametric
regression allows g to have any shape, in the sense that g need not belong to a set of
defined functions. Nonparametric regression provides a powerful tool by allowing
wide flexibility for g. It avoids a biased regression, and might be a good starting point
206 7 Regression Models

for a parametric regression if no ‘a priori’ shape of g is known. It is also a reliable


way to spot outliers within the regression.
There are two cases to distinguish, depending on the assumptions. Assuming
that the observations are independent and identically distributed, we can write the
regression function as g(x) = E(Y |X = x). This is the so-called random design
problem.
The second problem, the fixed design problem, is in some ways better for the
statistician, as it provides a more accurate estimator of the regression function. In
this setup, X is assumed to be deterministic. We keep the hypothesis that the {εi } are
i.i.d.

7.3.1 General Form

For (7.7), the following must hold:


1. E(ε) = 0 and Var ε = σ 2 .
2. g(x) ≈ yi , for x ≈ xi .
3. If x is ‘far’ from x j , we want (x j , y j ) to not interact with g(x).
Therefore one has to build a function as a finite sum of gi , where gi is related to the
point (xi , yi ). Without loss of generality,


n
yi , if x is close to xi
ĝ(x) = gi (x), gi (x) ≈
i=1
0, else.

This can be written as a weighted sum or local average


n
ĝ(x) = wi (x)yi , (7.8)
i=1

with the weight wi (x) for each xi .


This form is quite appealing, because it is similar to the solution of the least
squares minimization problem


n
min wi (x)(θ − yi )2 ,
θ∈R
i=1

which is solved for θ. This means that finding a local average is the same as finding a
locally weighted least squares estimate. For more details on the distinction between
local polynomial fitting and kernel smoothing, see Müller (1987).
A method similar to the histogram divides the set of observations into bins of size
h and computes the mean within each bin by
7.3 Nonparametric Regression 207

Fig. 7.3 Nonparametric

0.06
regression of daily DAX ●

log-returns by daily FTSE ●

0.04

log-returns, with a uniform ● ● ●

kernel and h = 0.001.



● ● ●● ●
● ●


●●
● ●● ●
BCS_KernelSmoother ● ●● ●● ●

0.02

● ●● ●●●● ●● ●

DAX log−returns
●● ● ●

●●
● ●●● ● ● ● ● ●
●●● ● ●● ●●
●●●
●● ●● ●●●● ●●●● ● ●
●●● ●● ●●

● ●
● ●●
● ●
●●●●●● ●
● ● ●
● ●●
● ● ●

●●●● ● ●●
●●● ●●● ● ●● ●
●● ●● ● ● ●●●●
● ●●

●●● ●● ● ●●●●●● ●
●● ● ●● ● ● ●
● ●●●●



●●


●●●
● ●●





●●●
●● ●
●●●
●●
● ● ●●
●●
●● ● ● ● ● ●●● ●●
●●
● ● ●●
●●●●●●


●●
●●
●●
●●●
● ●●
●●
●●● ●
● ● ● ●


●●●●

●●
●●

● ●
●●

●●●
● ●
●●
●● ● ●●●

●●●●●

●●
●●
●●
●●

● ●





●●



●●
●●










●●●


●●



●●

●●●●●●
●●●●● ●
● ● ●●● ●●●
● ●●
●●●●


●●
●●
●●●
●●






●●

●●
●●





●● ●
●●●


●●
●●
● ●

●●●
●●●
● ●● ● ●
● ●● ●
●● ● ● ●

● ●


●●

●●
● ●●●
●●●● ●●
●● ●
● ●●
● ●●●●
●● ●
● ●●●●●●●● ●●


● ●● ●
● ●
●●● ●●●●●●●

0.00

● ● ●
● ●● ●●●
●● ●●
●●●

●●



●●


●●
●●



●●

●●●
●●

●●

●●
●● ●●
●●
● ●● ●

● ● ● ●
● ●

●●●●
●●●●●
●●
●●

●●
●●
●●●●


●●
●●●
●●

●●●●
●●
●● ● ●● ●

● ●
●●●
●● ●
●●
●●● ●● ●
●●
●●●


●●

●●

●●


●●
●●●

●●
●● ●●
●●● ● ●●●

● ●
●● ● ●●●●
● ●
●●

● ●
●●●


●●●
●●

●●●
●●●
●●●
●●


●●

●●

●●

●●

●●

●●

● ●

●●●
● ●
●●●
● ● ● ●
● ●●●
●●●●



●●●
●●




●●
●●●
●●●

●●

●●


●●









●●





● ●

●●
●●●
● ●

● ●
● ●●●●
● ●●●●● ●●●
●●
● ●●





●●
●●

●●●



●●


●●
●●

●●

●●

●●

●●





●●●●●

● ●

●● ●●
● ●

● ●
●●●
●●●●
●●
●●● ●●●
●●●
●●
●● ●●●● ●●

●●
● ●●●● ●

●●●
● ●
●● ●

●●●
●●●●●
●●
●●●

●●●
●●●
●●●
●●●●
●● ●● ●●●● ●
● ●
● ●●
●●
●●

●●●
●● ●●
●●

●●●●●

●●● ●●
●●●●
●●
●●●●●
●●
●●
●●●● ●●
● ●●●●●
●●

● ●
●●●
●●●
● ●

● ●
●●

●●
●●●

● ●

●●
●●
●●

● ●
●●● ● ● ●

● ●● ●
●●
●●●●●● ●●●

●●●●●●
●●●●● ●
●● ●
●● ●
●●● ●
● ●
●●
●●

●●●●●

●●

●●
●●●●● ●
● ●●
●● ●●


●●●
● ●●●●●
● ●●● ● ● ● ●
●● ● ● ● ● ●●●
● ●●● ●●


●●●● ●●
●●●

●●●●● ● ●●●

−0.02
●●●●●
● ● ●●● ●
● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ●●
●●●● ●● ●
●●●
●●
●● ● ● ●● ● ● ● ●●●
●●● ●

● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
−0.04 ● ●●



−0.06

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06


FTSE log−returns

n
I{|x − xi | < h/2}yi
ĝ(x) = i=1
n . (7.9)
i=1 I{|x − x i | < h/2}

This is implemented in Rvia the function ksmooth() choosing the parameter


kernel =’box’. As an example (Fig. 7.3), we choose to regress the daily log-
returns of the DAX index on the daily log-returns of the FTSE index with h = 0.001
(how the returns are computed see Sect. 5.3.1).
> r.dax = diff(log(EuStockMarkets[, 1])) # daily DAX log returns
> r.ftse = diff(log(EuStockMarkets[, 4])) # daily FTSE log returns
> r.daxhat = ksmooth(x = r.ftse, y = r.dax, # kernel regression
+ kernel ="box", # type of kernel
+ bandwidth = 0.001) # bandwidth
> plot(r.ftse, r.dax) # plot observations
> lines(r.daxhat, col ="red3") # plot kernel regression

The function ksmooth() returns the fitted values of the DAX log-returns and the
respective FTSE log-returns. This estimator is a special case of a wider family of
estimators, which is the topic of the following section.

7.3.2 Kernel Regression

As with the density estimation, we can bring into play the kernel functions for the
weights we need in (7.8). Recall the density estimator

1   x − xi
n
fˆ(x) = K .
nh i=1 h
208 7 Regression Models

We then have the general estimator ĝ(x) related to the bandwidth h and to a kernel K
n
K ( x−x i
)yi
ĝ(x) = i=1
n
h
. (7.10)
i=1 K ( x−x
h
i
)

This generalisation of the kernel regression estimator is attributed to Nadaraya


(1964) and Watson (1964), and is the so-called Nadaray–Watson estimator. If one
knows the actual density f of X , then (7.10) reduces to

n
K ( x−x i
)yi
ĝ(x) = h
.
i=1
f (x)

These estimators are of the form of (7.8), with weights equal to wi (x) = K x−x i
/
n h
f (x) and wi (x) = K h / i=1 K h .
x−xi x−xi

In the following, some regressions with different kernels and different bandwidths
are computed.
> r.dax = diff(log(EuStockMarkets[, 1])) # daily DAX log returns
> r.ftse = diff(log(EuStockMarkets[, 4])) # daily FTSE log returns
> n = length(r.dax) # sample size
> h = c(0.1, n^-1, n^-0.5) # bandwidths
> Color = c("red3","green3","blue3") # vector for colors
> # kernel regression with uniform kernel
> r.dax.un = list(h1 = NA, h2 = NA, h3 = NA) # list for results
> for(i in 1:3){
+ r.dax.un[[i]] = ksmooth(x = r.ftse, # independent variable
+ y = r.dax, # dependent variable
+ kernel ="box", # use uniform kernel
+ bandwidth = h[i]) # h = 0.1, n^-1, n^-0.5
+ }
> plot(x = r.ftse, y = r.dax) # scatterplot for data
> for(i in 1:3){
+ lines(r.dax.un[[i]], col = Color[i]) # regression curves
+ }
> # kernel regression with normal kernel
> r.dax.no = list(h1 = NA, h2 = NA, h3 = NA) # list for results
> for(i in 1:3){
+ r.dax.no[[i]] = ksmooth(x = r.ftse, # independent variable
+ y = r.dax, # dependent variable
+ kernel ="normal", # use normal kernel
+ bandwidth = h[i]) # h = 0.1, n^-1, n^-0.5
+ }
> plot(x = r.ftse, y = r.dax) # scatterplot for data
> for(i in 1:3){
+ lines(r.dax.no[[i]], col = Color[i]) # regression curves
+ }

As previously noted in Sect. 5.1.4, the choice of the bandwidth h is crucial for the
degree of smoothing, see Fig. 7.4. In Fig. 7.5 the Gaussian kernel K (x) = ϕ(x) is
used with the same bandwidths. The kernel determines the shape of the estimator ĝ,
which is illustrated in Figs. 7.4 and 7.5.
7.3 Nonparametric Regression 209

Fig. 7.4 Nonparametric

0.06
regression of daily DAX

log-returns by daily FTSE ●

0.04

log-returns, using uniform ● ● ●

kernel. The plot shows the ● ● ●● ●

● ●

●●
● ●● ●
density estimates for ● ●● ●● ●

0.02

●● ● ● ●●●●● ●●
●● ●●● ● ● ● ● ●

DAX log−returns
● ● ●
● ●●●● ●●●● ●
different bandwidths: ●●● ●●●
●●●



● ●●
●● ●●
●●

●●
●●
●●

●●
●●●

●●

●●●

●●






●●
●●●
●●●
●●●●●
●●●●
●●
● ●

● ●● ● ●

h = 0.1, h = n −1 and

●●● ●● ●
● ●●●●●
● ●● ● ●
● ●●●●
●●

●●●


●●
●●
● ●●




●●


● ● ●
●●

● ●
● ●●●
●●
●● ●
●● ● ● ● ● ●●● ●
●●●
● ●●●●
●●●
●●

●●
●●
●●

●●●●●
●●
●●● ●

● ● ●


● ●
●●●●
●●


● ●
●●
●●● ●
●●

● ● ●●●
●●●





●●●


●●

● ●






●●



●●




●●












●●
●●





●●
●●●●●●
●●●●● ●
● ● ●●● ●● ● ●●
● ●●
●●● ●
● ● ●●● ● ●

h = n −1/2 .
●●●●
● ●

●●
●●●●●
●●
●●


● ●
●●

●●●
● ●
●● ●

●●●●● ●● ● ●
● ●● ●●●
●●
● ●●●
●●
● ●
●●


●●
●●



● ●

●●

●●●
●●
●●
●●●●●

●●
●●
● ●
●● ● ● ●●
● ● ● ●●● ● ● ●

0.00

● ● ●
● ●●●
●●●
●●
● ●
●●
●●
●●
●●


●●

●●

●●


●●

●●
●●
●●

●●
●●

●●
● ●●

●● ●● ●● ● ●
● ● ● ● ●
●●●
●●

●●
●●●●
●●●


●●●
●●●●

●●●
●●●●
● ●
●●●
● ● ●●● ●●

● ●
●●●

●●
●●●●
● ●
● ●
●●
●●●

●●

●●

●●


●●
●●
●●

●●

●●●●●●
●●
● ● ●●●

● ● ●

● ●●●
● ●

●●●

●●

●●

●●
●●

●●●
●●
●●●●
●●
●●

●●


●●

●●




●●●

● ●
●●●●
●●●
●●● ● ● ●
● ●
●●●●


●●

●●
●●




●●


●●●


●●
●●

●●


●●









●●





●●●

●●
●●●
●●
●● ●
●●●● ●
● ●●● ●
●●●●
●●





●●
●●

●●●
●●
●●




●●


●●●


●●
●●


●●●
●●●●●

● ●

●● ●●
BCS_UniformKernel ●





●●
●●

●●
●●
●●


●●

●●














































●●











































































●●
●●






●●
















●●




●●
●●





●●●●●●
●●●

●●●● ●

●● ●
●●●● ●
●●●●● ●●●

●●●
●●●●● ●●●● ●● ● ●
●●● ●
●●●●

●●

●●●●


●●

●●●●● ● ●●● ●●
●●●
● ●
●●●●
● ●●●●● ●●
●● ● ● ● ● ●●●
● ●●● ●●

●●
●●● ●●
●●

●●●

−0.02
●●●● ● ●●
●●●● ●●●● ●
● ● ●
● ●●●● ● ●●
● ●● ●●● ●●●● ● ●● ●●
●●● ● ●● ●
●●●
● ●
● ● ● ● ● ● ●
● ● ●●●
●●● ●

● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
−0.04 ● ●●



−0.06

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06


FTSE log−returns

Fig. 7.5 Nonparametric


0.06

regression of daily DAX



log-returns by daily FTSE ●
0.04


log-returns, using Gaussian ● ● ●

kernel. The plot shows the ● ● ●● ●


● ●
●●
● ●● ●
density estimates for ● ●● ●● ●
0.02


DAX log−returns

●● ● ● ● ●●●●● ●● ●● ●

●●
● ●●● ● ● ● ●
● ●●●● ●●●● ●
different bandwidths: ●●● ●●●
●●●
●●

● ●●
●● ●●
●●
●●
● ●
●● ●
●●

●●

●●

●●●
●●

●●●






●●


●●●
●●
●●
●●●

●●
●●
● ●●

● ●●
● ●

h = 0.1, h = n −1 and

●●
●● ●
●●● ●
●● ●●●
● ●● ●● ● ● ●
● ●● ●
●●
● ●
●●
●●●
●●
● ●●●
● ●●
●●● ●● ●
● ● ● ● ●●● ●● ●● ● ●
●●● ●●
●● ● ●
● ●● ●● ●
● ●●


●●● ●
●●●
●●


●●●
●●
●●●●
●●

●●
●●

●●




●● ●●●
●●●● ●●●
●●

●● ● ● ●

●●●●●

●●

●●●

●●






●●
●●


●●










●●

●●●



●●








● ● ●●●
●●●● ●
● ● ●●● ● ●
● ●●
● ●●●●●●●●● ● ●●● ●● ●

h = n −1/2 .
●●
● ●●●

●●●


●●
●●●
●●

●●
●●
●●
●●●●
●●●●●● ●●● ●● ●●●● ●
● ●● ●●●●

●● ●●
●●
● ●
●●●
●●●●
●●●
●●
●●

●●●●●●●
● ●●

●●

●●
● ●●● ● ●
● ●●
● ● ●
0.00

● ● ● ●●●
●●● ●
●●
●●
●●
●●
● ●


●●

●●●
●●
●●● ●
● ●●●● ● ● ●
●● ●●●●
●● ●
●●●●
●●


●●
●●●
●●


●●


●●
●●



●●●

●●



●●





●●



●●●●
●●●
●●
●●

●●

●● ●●●●●●
● ●
● ●● ● ● ●
●●●●



●●●


●●
●●





●●
●●
●●




●●



●●







●●

●●




















●●

●●● ●



●●
●●●

●●

●●●● ● ● ●
● ● ● ●● ● ● ●●●●
●● ● ●
● ●● ●
●●
● ●

●●●
●●
●●●
●●

●●
●●


●●●●●

●●●
●●●●

●●●●●●
●●●
●●● ●
● ●●●
●●
●●

●●









●●






●●
●●
●●













●●





●●







●●●









●●● ●●





●● ●●
BCS_GaussianKernel ● ●


●●

●●

●●


●●●



●●



























● ●●













●●





























●●




●●

●●


●●






●●●














●●









●●






















●●
●●


●●
●●

●●●●
●●●

●●●● ●
●● ●

●●●● ●● ● ●
●●●●
● ●
●●●●● ●
●●●●
●●●●
●●●●

●●
●●



●●


●●●

●●

●●

●●●
●●● ●●
● ●●
● ●●
● ● ●
●●●●●
●●● ●●● ● ● ●
●● ● ● ● ● ●●
● ●● ●● ●●


●●●●●●
●●●
●●●●● ● ●●●
−0.02

●●●●●● ●●● ●

● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ● ●●
●●● ●●●●●● ● ●
●● ●● ● ●● ● ● ● ●● ●
● ●● ●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
● ●●
−0.04




−0.06

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06


FTSE log−returns

7.3.3 k-Nearest Neighbours (k-NN)

The idea of nonparametric regression is to build a local average of {yi } to obtain a


reasonable value for g(x), where x is close to some {xi }. Instead of using a kernel
regression, one could build subsets for the domain of X around x and construct the
210 7 Regression Models

local average for each different subset. In other words, if {Si } is a family of disjoint
subsets of the domain S of X and S = ∪i∈I Si , with I = {1, . . . , p}, then one uses

1
|{ j, x j ∈Si }| j, x j ∈Si y j , if ∃i, x ∈ Si ,
ĝ(x) = (7.11)
0, else.

For instance, if one picks Si (h) = {x, |x − xi | < h/2}, then this is simply the uni-
form kernel regressor with bandwidth h. An alternative is to choose Si such that
the k-nearest observations xi to x, in terms of the Euclidean distance, are selected.
This avoids the regressor’s being equal to 0, and has an intuitive foundation. Since
the estimator is computed with the k-nearest points, it is less sensitive to outliers in
the dataset. However, there can be a lack of accuracy when k is large compared to
n (the sample size). The estimator will give the same weight to neighbours that are
close and far away. This problem is less severe with a larger number of observations,
or in the case of the ‘fixed design’ problem, where xi is selected by the user. In
the case of a small number of observations, one can also compensate for this lack
of consistency by combining this method with a kernel regression. R includes an
implementation of the k-NN algorithm for dependent variables Y . The function in R
is knn() from package class.
Consider again the DAX log-returns from the EuStockMarkets dataset. The
probability of having positive DAX log-returns conditional on the FTSE, CAC and
SMI log-returns, is computed in the following.
> require(class)
> k = 20 # neibourghs
> data = diff(log(EuStockMarkets)) # log-returns
> size = (dim(data)[1] - 9):dim(data)[1] # last ten obs.
> train = data[-size, -1] # training set
> test = data[size, -1] # testing set
> cl = factor(ifelse(data[-size, 1] < 0, # returns as factor
+ "decrease","increase"))
> tcl = factor(ifelse(data[size, 1] < 0, # true classification
+ "decrease","increase"))
> pcl = knn(train, test, cl, k, prob = TRUE) # predicted returns
> pcl
[1] decrease decrease decrease decrease
increase decrease decrease increase
decrease increase
attr(,"prob")
[1] 0.95 0.90 0.95 0.90 1.00 1.00 0.95 1.00 0.90 1.00
Levels: decrease increase
> tcl == pcl # validation
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The predicted classifications for the DAX log-returns fit perfectly the actual classi-
fications. All predicted probabilities are at least 0.90.
This simple call to the k-NN estimator allows only for one set of yi . However,
manually coding a k-NN function does not require much depth in the reasoning if
we keep it simple. One may write the following code:
7.3 Nonparametric Regression 211

> r.dax = diff(log(EuStockMarkets[,1])) # log-returns of DAX


> r.ftse = diff(log(EuStockMarkets[,4])) # log returns of FTSE
> knn.reg = function(x, xis, yis, k){ # function for neighbours
+ knn.reg = rep(0, times = length(x)) # empty object
+ for (i in 1:length(x)){ # loop over length
+ distances = order(abs(x[i] - xis)) # order distances
+ knn.reg[i] = mean(yis[distances][1:k]) # mean over neighbours
+ }
+ knn.reg
+ }
> # functions of regressions for different amount of neighbours
> knn.reg.k1 = function(x)knn.reg(x, r.ftse, r.dax, 10)
> knn.reg.k2 = function(x)knn.reg(x, r.ftse, r.dax, 250)
> knn.reg.k3 = function(x)knn.reg(x, r.ftse, r.dax, 1)
> plot(r.ftse, r.dax) # plot scatterplot
> # plot regressions on the given interval c(-0.06,0.06)
> plot(knn.reg.k1, add = TRUE, col="red", xlim = c(-0.06,0.06))
> plot(knn.reg.k2, add = TRUE, col="green", xlim = c(-0.06,0.06))
> plot(knn.reg.k3, add = TRUE, col="blue", xlim = c(-0.06,0.06))

This code is used to produce Fig. 7.6. The function argument xis specifies the vector
of regressors for the vector of dependent variables yis. The parameter k determines
the number of neighbours with which to build the local average for the dependent
variable. The argument x is a vector, which defines the interval for the regression
analysis.
To achieve the best fit, one has to find the optimal k, similar to a kernel regression.
It is not possible to establish a theoretical expression for the optimal value of k, since
it depends greatly on the sample.

Fig. 7.6 Nonparametric


0.06

regression of daily DAX



log-returns by daily FTSE ●
0.04


log-returns, using k-Nearest ● ● ●

Neighbours. The plot shows ● ● ●● ●

● ●

●●
fitted values for k = 1, ● ●●
● ●● ●
●● ●
0.02


●●●●● ●●
DAX log−returns

●●● ● ● ● ● ●
k = 10 and k = 250. ● ●
● ●●● ● ● ● ●
●●● ● ●● ●●
●●●●●● ●●●● ●●●● ● ●
●● ●● ●●
●●●●●●● ●● ●

● ● ● ●



● ●●●●

● ●●●●● ● ●
●●● ●●● ● ●
●● ●●●●● ●●●●● ●●●
● ●● ●

●●● ●● ●
● ●●●●
● ●● ● ●● ● ● ●
● ●●● ●
● ●
●●
●●●
●●●

●●●
● ●● ●
● ●●
BCS_kNN ●● ● ● ● ●
● ● ●●●
● ●

●●●●






●●●







●●


●●
●●























●●



















●●











































●●●●






●●●













●●


●●
●●
●●

●●●
● ●


●●
●●




●●


●●

●●● ●
● ●● ●●●● ●●
● ●●
●●●
●●
●●


●●





●●

●●●

●● ●

●●

●●●●

●●

●●
● ●●●●
●● ●● ● ●
● ● ●
● ●●●

●●
● ●
●●●●●●●

●●
●●● ●●

● ●● ●
● ●
●●● ● ●●
0.00


● ● ●
● ●● ●
●●●

●●●
●●
●●
●●
●●

●●

●●
●●


●●


●●●


●●●

●●
●●
● ●●

●● ●● ●● ● ●
● ● ● ● ●
●●●●

●●
●●●●


●●●


●●●

●●●●

●●●
●●
●●●●●
●●
●● ● ●● ●●

● ●
●●●


● ●
●●
●●●●
● ●●
●●
●●

●●

●●
●●


●●
●●
●●



●●● ●●
●●●
● ● ●●●

● ● ●

● ●●●
● ●

●●●
●●●


●●
●●
●●
●●

●●
●●●●
●●

●●

●●

●●

●●




●●

●● ●

●●●
●●●
●●● ● ● ●
● ●
●●●●


●●

●●
●●
●●




●●

●●



●●●

●●


●●






●●

●●





●●●



● ●●
● ●

● ●
●●●● ●
● ●●●●●

●●
●●●
●●●





●●



●●



●●●

●●

●●


●●




●●

●●


●●




●●●●● ●

●● ●●
● ●

●●●●●●●●

●●
●●●●●
●●●●
●● ●●
●● ●●

●●● ●●●●●●
●●
●●


●●
●●

●●
●●
●●
●●●
●●●

●●●
●●●
●●
●●
● ●●
●● ●● ●●●● ●
● ●
● ●●
●●

●●
●●●

● ●●
●●



●●
●●
●●
● ●●
● ●
●●●
●●
●● ●
●●
●●
●●●● ●
● ●●●●●
●●
●●●
●●●


●●●●

● ●
●●

●●

●●




● ●

●●
●●
●●

● ●
● ● ● ●● ●
● ●● ● ●●●
●●●●● ● ● ● ● ●
●●●● ●●●●●●●●●●


●●●●●

●●●● ●●●●●
●● ●
●●●●●●●


●●●


●●● ●
●●

●●●●● ● ● ●
●●●
● ● ●●●●●●● ●
●● ● ● ● ● ●●●
● ●●


●●●
● ● ●●
●●
●●●●● ●
−0.02

●●
●●●●●●
● ●●● ●
●● ● ●
● ●●●● ● ●●
● ●● ●●● ●●●● ● ●● ●●
●●● ● ●● ●
●●●
● ●
● ● ● ● ● ● ●
● ● ●●●
●●● ●

● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
● ●●
−0.04




−0.06

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06


FTSE log−returns
212 7 Regression Models

7.3.4 Splines

Another very famous nonparametric regression technique is the so-called smoothing


spline. Here the method is different from those of the two previous nonparametric
techniques. The spline regression is a general band-pass filter and is similar to the
Hodrick–Prescott filter. This method does not look at how close the data is to the
given point. Instead it imposes restrictions on the smoothness of the form of the
regression. The smoothing is controlled by the parameter λ, which plays a similar
role as the bandwidth h and the number of neighbours k.
The spline regression estimator ĝ(x) is obtained by solving the following opti-
mization problem:


n 2
d 2 g(z)
min Sλ {g(x)} = min {yi − g(xi )}2 + λ dz. (7.12)
g(x),g∈C 2 g(x),g∈C 2
i=1
dz 2

The spline optimization problem (7.12) minimises the sum of the squared residuals
and a penalty term. In most applications the penalty term is the second derivative
of the estimator with respect to x, which reflects the smoothness of a function. The
parameter λ determines the importance of the penalty term for the estimator. One
can rewrite (7.12) in matrix notation since in fact the minimum in (7.12) is achieved
by a piecewise cubic polynomial for g(x):

Sλ {g(x)} = {y − g(x)} {y − g(x)} + λg(x) K g(x). (7.13)


n
It is possible to rewrite the sum i=1 {yi − g(xi )}2 as the inner product of the vector

y − g(x), where y = {y1 , . . . , yn } and the vector g(x) = {g(x 1 ), . . . g(xn )} .
n 3
Then g(z) = i=1 g(xi ) pi (z), where pi (z) = k=0 ak,i z k . The second derivative
n 
of g(z) is ∂ ∂zg(z) g(xi ) ∂ ∂zpi 2(z) = g(x) h(x), where h(x) = { ∂ ∂xp(x)
2 2 2
2 = i=1 2  ,...,
 x=x1
∂ p(x) 
2

∂x 2 
} is a vector containing the second derivatives of each of the cubic
x=xn
polynomials pi (x). The penalty term can be rewritten as follows:
 ∂ 2 g(x) 2
d x = g(x) h(x)h(x) g(x) = g(x) K g(x),
∂x 2
 2 d 2 p (z)
where the matrix K n×n has entries ki, j = d dzpi 2(z) dzj2 dz.
Therefore the following estimator is proved to be a weighted sum of y:

ĝ(x) = argmin Sλ {g(x)} = (I + λK )−1 y. (7.14)


g(x)

See Härdle et al. (2004) for a more detailed description. R provides in the pack-
age stats cubic splines for second derivative penalty terms using function
smooth.spline.
7.3 Nonparametric Regression 213

Fig. 7.7 Nonparametric

0.06
regression of daily DAX

log-returns by daily FTSE ●

0.04

log-returns, using spline ● ● ●

regression. Regression ● ● ●● ●

● ●

●●
● ●● ●
results are depicted for ● ●● ●● ●

0.02

DAX log−returns
●● ● ● ● ●● ●●●●● ●● ●
● ●●● ●
λ = 2, λ = 1 and λ = 0.2. ● ●
● ● ● ●
●●● ● ●●●●
●●●●● ● ●●●● ●●●● ● ●
●●● ●● ●
●●●
●●● ●●
● ●
●●●●●● ●
● ● ● ●●●● ●

●●●● ● ●
●●● ●●● ● ●
●● ●● ●● ●●● ●●●● ●● ● ●


●●● ●● ●
● ●●●●●●● ●● ● ●
● ●●● ●
● ●
●●
●●●●

●●
●●●
● ● ●●
● ●● ●● ●
BCS_Splines ●● ● ● ● ●
● ● ●●●
● ●




●●●●




●●●




●●

●●


●●
●●











































●●



















































●●












●●





●●
●●

●●●

● ●


●●●
●●
●●



●●

●●

●●● ●
● ●● ●●●● ●●
● ●●
●●●
● ●



●●







●●
●●●
●●●●




●●●
●●

●●
● ●●●●
●● ●● ● ●
● ●
●● ●●●
●●

● ●
●●●●●●●●
●●
●● ●●

● ●● ●
●●●● ● ●●

0.00

● ● ●
● ●●●
● ●●

●●●●●
●●

●●


●●


●●

●●



●●

●●●

●●
●●

●●
●●
● ●●
●●●●●
● ●● ● ●
● ● ● ● ●
● ●
●●●
●●●●●

●●
●●

●●●

●●●●

●●
●●●

●●

●●●●
●●
●● ● ●● ●●

● ●
●●●

●●●●
●●●●●●●
●●
●●

●●

●●
●●


●●
●●●

●●
● ●
●●●● ● ●●●

● ●● ●
● ●●●●
● ●

●●
●●
●●
●●

●●
●●●

●●

●●●
●●●
●●

●●

●●

●●

●●

●●


●●
●●●●

●●● ●
●●●
● ● ● ●
● ●
● ●
●●
● ●

●●●
●●
●●●
●●

●●
●●


●●●●

●●
●●● ●


●●●●
● ●
●●●
●●●●
● ●●●
●●
●●


●●








●●






●●
●●
●●

●●










●●





















●●


●●●

● ●●





●● ●●
● ●
●●●
●●●●●●

●●●
●●●●
●●●
●●●
● ●●●
● ●●

●●
● ●●●●
●●●
●●
●●

●●
● ●

●●●
●●
●●●
●●
●●●

●●
●●
●●●
●●●
●●●●
●● ●● ●●●● ●
● ●
● ●●
●●

●●

●●●


●● ●
●●

●●
●●●
● ●

●●● ●●
● ●
●●
●●
● ●




●●
●●●● ●
● ●●●●
●●
●●●
●●●

●● ●

● ●
●●

●●
●●
●●
●●

●●
●●
●●

● ●
●● ●
● ●

● ●● ● ●●●●●●● ●●
●●●●● ●●●● ● ● ●
●●●● ●●●●●●●●●●


●● ● ●●● ● ●
●●●●●●●


●●●


●●● ●●

●●



●●
● ●● ● ●● ●●
● ● ●●●●●●● ●
●● ● ● ● ● ●●●
● ●●


●●●●●●
●●●
●●●●● ● ●●●

−0.02
●●
●●●● ●●● ●●
● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ● ●●
●●● ● ●● ●
●●●


● ●● ● ● ● ●
● ● ●●●
● ●● ●
● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
−0.04 ● ●●



−0.06

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06


FTSE log−returns

> r.dax = diff(log(EuStockMarkets[, 1])) # daily DAX log returns


> r.ftse = diff(log(EuStockMarkets[, 4])) # daily FTSE log returns
> # spline regressions
> sp1 = smooth.spline(x = r.ftse, y = r.dax, spar = 0.2)
> sp2 = smooth.spline(x = r.ftse, y = r.dax, spar = 1)
> sp3 = smooth.spline(x = r.ftse, y = r.dax, spar = 2)
> plot(r.dax, r.ftse) # plot scatterplot
> lines(sp1, col ="red") # plot regression line for span = 0.2
> lines(sp2, col ="green") # plot regression line for span = 1
> lines(sp3, col ="blue") # plot regression line for span = 2

This listing creates Fig. 7.7 for the regression of DAX log-returns and FTSE log-
returns. The function arguments x and y are the observations of the independent and
dependent variables, respectively. Instead of using two separate vectors, a matrix can
be used. The argument spar defines λ through λ = c3spar −1 , therefore the greater the
spar, the greater the λ. One can also apply weights to the observations of x through
the variable w, which must have the same length as x. To find a good value for λ,
set cv = TRUE for the ordinary cross-validation method and cv = FALSE for a
generalised cross-validation method. Of course we could impose further restrictions
on g, e.g. a penalty on its third derivative, or on any other type of norm.

7.3.5 LOESS or Local Regression

Another widespread method is the local regression, so-called LOESS or LOWESS.


It is an improvement of the previous k-NN method, which aggregates the selection
of neighbours of x within the {xi }, adds a weighting to the sum of the yi , and uses
local polynomials for the fitting. For the univariate case, the model is
214 7 Regression Models

yi = g(xi ) + εi .

The dependence between Y and X with samples {y1 , . . . , yn } and {x1 , . . . , xn } is


approximated through a polynomial of order k evaluated at a focal point x. Therefore
the polynomial regression function can be written as


k
yi = a j (xi − x) j + εi .
j=0

In most applications, the observations are weighted according to their distance to x


through the tri-cube weighting function

(1 − |z|3 )3 , if |z| < 1,
w(z) =
0, if |z| ≥ 1,

where z i = (xi − x)/ h and h is half the width of the interval around x. Therefore the
weight attached to an observation is small if z is large and vice versa. This method
has the same particularities as the k-NN method in the way that it can extrapolate
the data. However, after the number of neighbours has been defined, through h, the
coefficients a j are estimated by the least squares approach.
This is both an advantage and a drawback, as the regression does not need any
regularity conditions (for instance, compared to the spline method, which needs ĝ to
be twice differentiable), but it provides less intuition in the interpretation of the final
curve (Fig. 7.8). The main function to use in Ris loess(). First, one has to specify

Fig. 7.8 Nonparametric


0.06

regression of daily DAX



log-returns by daily FTSE ●
0.04


log-returns, using LOESS ● ● ●

regression with degree one. ● ● ●● ●

● ●

●●

The used LOESS parameters ● ●●
● ●●
●● ●
0.02


DAX log−returns

●● ● ● ● ●● ●●●● ●● ●
● ●●● ●
are: α = 0.9, α = 0.3 and ● ●
● ● ● ● ●
●●● ● ●● ●●
●●●
●● ●● ●●●● ●●●● ● ●
●● ●● ●● ● ●●●
● ●● ●

● ● ● ●



● ● ●


●●●●●●●● ● ●●
●●● ●●● ● ●
●● ●●●●● ● ● ●●●●
● ●●
● ●● ●
● ●●●●●● ●
●● ● ●● ● ● ●
α = 0.05. BCS_LOESS ●●●●●●

● ●
●●
●●●●●
●●● ● ●●
● ● ●
● ●
●●● ●
●●
●●● ●

●●●
●●●●
●●● ●●
●● ● ● ● ● ●●● ●
●●
●● ●
● ●●
●●●●●●


●●
●●●

●●●
● ●●● ●●●

● ● ●● ●


●●●●

●●●

●●●

●●
●●
●●
● ●
●●● ● ●

●●●●●
●●



●●

● ●








●●
●●●
●●
●●
●●



●●
●●
●●●
●●●



●●

●●●●●●
●●●●● ●
● ● ●●● ●●
●● ●


● ●●


●●

●●●●
●●







●●
●●
●●




●●
●●●●
●●●
●●
● ●

●●●●●
● ●● ● ●
● ●● ●●●● ●●●
● ●


●●

●●
● ●●●
●●●● ●●
●● ●
● ● ●
● ●●●

●● ●
● ●●●●●●●● ●●


● ●● ●
● ●
●●● ●●●●●●●
0.00


● ● ●
● ●● ●●●
●●●
●●
●●●

●●

●●


●●
●●


●●


● ●
●●

●●

●●
●● ●●
●●
● ●● ●

● ● ● ●
●●●●

●●

●●
●●●


●●●


● ●


●●●●
●●●●
●●
● ●
●●●
●●
●● ● ●● ●

● ●
●●●


● ●
●●
●●●●
● ●
●●
●●●

●●

●●

●●



●●
●●●


●●

●● ●●
●●● ● ●●●

● ●
●● ● ●●●●
● ●

●●
●●
●●●
●●

●●

●●
●●

●●
●●●●
●●

●●

●●

●●

●●●



●●

● ●

●●●
●●●●●
● ● ● ●
● ●
●●●●●
●●

●●

●●●




●●

●●



●●●

●●


●●







●●
●●





● ●


●●
●●●
● ●

● ●
● ●●●●
● ●●●●●

●●●●
●●●





●●



●●




●●




●●

●●





●●





●●


●●●●●

● ●

●● ●●
● ●

●●●
●●●●●●

●●
●●●●●
●●●
●●● ●●●● ●●

●●
● ●●●●●●
●●
●●


●● ●

●●
●●
●●
●●●
●●●

●●●
●●●
●●●
●●●●
●● ●● ●●●● ●
● ●
● ●●
●●

●●

●●●

● ●●
●●


●●
●●
●●
● ●●
●●●●
●●
●●●●●
●●
●●
●●●● ●●
● ●●●●●
●●
●●●
●●●

●● ●

● ●
●●

●●

●●







●●
●●
●●

● ●
●●● ● ● ●

● ●● ●

●●●●●● ●●●


●●●●●
● ●
●●●● ●
●● ●
●● ●
●● ● ●
● ●
●●
●●



●●●

●●


●●●
● ●● ●
● ●●
●● ●●


●●●
● ●
●●●●
● ●●● ●● ● ●
●● ● ● ● ● ●●●
● ●●● ●●


●●●
● ● ●●
●● ●
●●●●● ● ●●●
−0.02

●●
●●●● ●●● ●
●● ● ●
● ●●●● ● ●●
● ●● ● ●●●●●● ● ●● ●●
●●●● ●● ●
●●●
●●
●● ● ● ●● ● ● ● ●●●
●●● ●

● ● ●● ● ●●
● ● ●
●●● ● ● ● ●●
● ●
● ●
● ●●
−0.04




−0.06

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06


FTSE log−returns
7.3 Nonparametric Regression 215

the two variables to regress in the syntax of linear regressions. Then the user needs to
select the degree of the polynomial, and the span parameter, which represents the
proportion of points (or neighbours) to use. By default, Ruses the tri-cube weighting
function w(z), which the user can change. Nevertheless, the weight should satisfy
some properties, stated in Cleveland (1979).
The following code produces a plot for a LOESS regression of DAX log-returns
on FTSE log-returns.
> dax.r = diff(log(EuStockMarkets[, 1]))
> ftse.r = diff(log(EuStockMarkets[, 4]))
> loess1 = loess(r.dax ~ r.ftse, # LOESS regression
+ degree = 1, # degree of polynomial
+ span = 0.9)$fit # proportion of neighbours
> loess2 = loess(r.dax ~ r.ftse, degree = 1, span = 0.01)$fit
> loess3 = loess(r.dax ~ r.ftse, degree = 1, span = 0.3)$fit
> l1 = loess1[order(r.ftse)] # order as FTSE
> l2 = loess2[order(r.ftse)]
> l3 = loess3[order(r.ftse)]
> plot(x, y)
> lines(l1,col="red")
> lines(l2,col="green")
> lines(l3,col="blue")

This section and the previous sections introduced methods to model an unknown
relation between two variables X and Y . Each method depends greatly on the smooth-
ing parameter, which has different optimal values for different regression methods.
As discussed in Sect. 5.1.4, the optimal bandwidth for a normal kernel is given by
h opt = 1.06n −1/5 σ̂. The choice of the kernel becomes of minor importance as the
number of observations increases. The optimal parameters for other methods are
found via cross-validation algorithms.
On top of this, other complications can appear. Problems such as predicting from a
low number of observations or the presence of outliers within the dataset can make the
regression results less accurate. The following example illustrates, using simulated
data, how different nonparametric regressions perform.
Example 7.1 Consider two rvs X and Y that are generated from the model

Y = g(X ) + ε, g(x) = sin(2πx) − x 2 , X ∼ N(0, 1) and ε ∼ N(0, 1).

A small sample with n = 50 is simulated for this relation in R.


> # set.seed(3) # set seed, see Chap.\,9
> n = 50 # sample size
> Xis = rnorm(n) # random x
> Epsilon = rnorm(n) # random noise
> RegressionCurve = function(x) sin(2 * pi * x) - x^2
> Yis = RegressionCurve(Xis) + Epsilon
216 7 Regression Models

One can use the rule of thumb for the kernel regression’s bandwidth:
> kernel.reg.example = function(new.x){
+ ksmooth(x = Xis, y = Yis,
+ kernel ="normal",
+ bandwidth = 1.06 * n^(-1 / 5),
+ xpoints = new.x)$y
+ }

However, for the k-NN regression and the spline regression, the smoothing parameter
is selected by a cross-validation algorithm. Below is a simple line for the spline
regression
> spline.reg.example = smooth.spline(x = Xis, y = Yis)

One can check that spline.reg.example$spar gives the smoothing parame-


ter from a cross-validation algorithm.
A cross-validation algorithm has to be implemented to find the k-NN parameter
k C V optimal for the dataset at hand. In this example, the cross-validated parameter is

1
n
k C V = argmink M S E(k) = { ŷi (k) − yi }2 .
n i=1

Therefore the value of k minimising M S E(k) is used for the regression analysis.
In the following, the leave-one-out cross-validation procedure is applied, where just
one observation is dropped. Each observation will be excluded from the sample to
preform the k-NN regression. Afterwards the squared error for each observation is
computed for a specific k. The squared error is computed by the following code
(Fig. 7.9).

Fig. 7.9 Simulations along


4

the regression curve


g(x) = sin(2πx) − x 2
displayed by points.
2

BCS_RegressionCurve ● ●





● ●●

● ●
● ● ●
0

● ● ● ●
● ●
●●
●● ● ●


● ●
f(x)

●● ●
● ● ●


−2

● ●



−4


−6

−3 −2 −1 0 1 2 3
x
7.3 Nonparametric Regression 217

2.8
2.6
2.4
Mean Squared Error
2.2
2.0
1.8
1.6
1.4

0 10 20 30 40 50
k

Fig. 7.10 MSE for k-NN regression using the Leave-one-out cross-validation method.
BCS_LeaveOneOut

● ●




●●

0


● ● ●
●● ● ● ●


● ●
● ● ●

●● ●
● ●
● ● ●



−2


f(x)

● ●

● ●


−4


−6

−3 −2 −1 0 1 2 3
x

Fig. 7.11 Nonparametric regressions for simulated data. The regression results for kernel, kNN
and spline are plotted. BCS_NonparametricRegressions

> SEkNN = function(k, x, y, p){ # squared error function


+ yhat = knn.reg(x[p], x[-p], y[-p], k) # predicted y value
+ (yhat - y[p])^2 # squared error
+ }
218 7 Regression Models

Now one needs to compute, for each k, the mean of the squared errors for every
single point (xi , yi ), and select the k that minimises the M S E(k).
> listk = matrix(0, n - 1, n) # object for ks and es
> for (k in 1:(n - 1)) # loop for possible ks
+ for (p in 1:n){ # possible dropped obs.
+ listk[k, p] = SEkNN(k, Xis, Yis, p) # knn.reg required
+ }
> MSEkNN = (n)^(-1) * rowSums(listk) # Mean squared error
> which.min(MSEkNN) # cross validated k
[1] 3

The code above shows that k C V is equal to 3 for this dataset. Figure 7.10 plots the
M S E(k) depending on k for the k-NN regression. After the optimal parameters have
been selected for each of the regressions, it is interesting to compare the results of
these different methods to the true regression curve depicted in Fig. 7.9. Figure 7.11
shows that the regression curves are very similar. This actually tends to be true when
n is large. In the meantime, all three regressions perform poorly at the boundaries of
the support of x. The standard normally distributed rv X has 95% of its realisations
in the interval [−1.96, 1.96]. Therefore the regression is likely to have both a large
bias and a large variance outside of these bounds.
Chapter 8
Multivariate Statistical Analysis

Nothing in life is to be feared, it is only to be understood. Now is


the time to understand more, so that we may fear less.

— Marie Curie

Multivariate procedures are at present widely used in finance, marketing, medicine


and many other fields of theoretical and empirical research. This chapter introduces
the basic tools of multivariate statistics in R.
The first part of the chapter deals with principal component analysis (PCA) and
factor analysis (FA), which are methods for dimension reduction. Presented next is
cluster analysis, also called data segmentation. It is a process of grouping objects
into subsets, or ‘clusters’, such that objects within each cluster are more closely
related to each other than to objects assigned to different clusters. Afterwards, mul-
tidimensional scaling is introduced. It is a statistical technique used in information
visualisation to explore similarities and dissimilarities within data. Finally, discrim-
inant analysis is discussed, which is a basic tool for linear classification.

8.1 Principal Components Analysis

One of the challenges of multivariate analysis, as discussed in the previous section,


is the curse of dimensionality. The term refers to the problem, that the number of
variables might be highly relative to the number of observations. This can be partly
resolved by use of stepwise variable selection. But a high correlation between the
original variables would lead to estimation and inference problems caused by near
multicollinearity. This motivates principal component analysis (PCA), a multivariate
technique whose central aim is to reduce the dimension of the dataset. This transfor-
mation leads to a new set of variables, which are linear combinations of the original
variables, called principal components.

© Springer International Publishing AG 2017 219


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_8
220 8 Multivariate Statistical Analysis

There are several equivalent ways of deriving the principal components mathe-
matically. The simplest way is to find the projections of the original p-dimensional
vectors onto a subspace of dimension q. These projections should have the follow-
ing property. The first component is the direction of the original variable space,
along which the projection has the largest variance. The second principal component
is the direction which maximises the variance among all directions orthogonal to
the first principal component, and so on. Thus, the i-th component is the variance-
maximising direction orthogonal to the previous i − 1 components. For an original
dataset of dimension p, there are p principal components.
The principal component (PC) transformation of rv X with E(X ) = μ and
Var(X ) =  =   is defined as

Y =   (X − μ),

where  is the matrix of eigenvectors of the covariance matrix  and  is the diag-
onal matrix of the corresponding eigenvalues, see Sect. 2.1 for details. The principal
component properties are given in the following theorem.
Theorem 8.1 For a given X ∼ (μ, ), let Y =   (X − μ) be the principal com-
ponent transformation. Then

E(Y j ) = 0, j = 1, . . . , p;
Var(Y j ) = λ j , j = 1, . . . , p;
Cov(Yi , Y j ) = 0, i  = j;
Var(Y1 ) ≥ Var(Y2 ) ≥ . . . ≥ Var(Y p ) > 0.

In practice, the expectation μ and covariance matrix  are replaced by their estimators
x and S respectively. If S = GLG  is the spectral decomposition of S, then the
principal components are obtained by Y = (X −1n x  )G. Note that with the centring

matrix H = I − n −1 1n 1 n and H 1n x = 0, the empirical covariance matrix of the
principal components can be written as

S y = n −1 Y  HY = L,

where L = diag(l1 , . . . , l p ) is the matrix of eigenvalues of S. Details on principal


components theory and their properties are given in Härdle and Simar (2015).
It is important to clarify the meaning of the components. In some cases, the
components actually measure real variables, while in others they just reflect patterns
of the variance-covariance matrix.
Let us illustrate PCA with an example on genuine and counterfeit banknotes,
provided by Riedwyl (1997) and included in the package mclust. The data set
contains seven variables and 200 observations. The first variable is coded 0/1 and
states whether the banknote is genuine or not. For this analysis, only the last 6
variables are used. These are quantitative characteristics of swiss banknotes (e.g.
8.1 Principal Components Analysis 221

length, width, etc.). First, the required package is loaded and the data set is saved as
a data frame without the indicator stating whether banknotes are genuine.
> data( b a n k n o t e , p a c k a g e = " m c l u s t " ) # load the data
> m y d a t a = b a n k n o t e [ , -1] # r e m o v e the f i r s t c o l u m n

To perform the principal component analysis, there exists a built-in function


princomp.
> fit = p r i n c o m p ( m y d a t a ) # c o m p u t e PCA
> s u m m a r y ( fit ) # print r e s u l t s w.o. l o a d i n g s
I m p o r t a n c e of c o m p o n e n t s :
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Standard deviation 1 .73 0 .96 0 .492 0 .440 0 .291 0 .1880
P r o p o r t i o n of V a r i a n c e 0 .67 0 .21 0 .054 0 .043 0 .019 0 .0079
Cumulative Proportion 0 .67 0 .88 0 .930 0 .973 0 .992 1 .0000

The output includes the standard deviations of each component, i.e. the square root
of the covariance matrix’s eigenvalues. A measure of how well the first q PCs explain
the total variance is given by the cumulative relative proportion of variance
q
j=1 λj
ψq =  p .
j=1 λj

The loadings matrix, given by matrix , gives the multiplicative weights of each
standardised variable in the component score. In practice, one considers the matrix
G. Small loadings values are replaced by a space in order to highlight the pattern of
loadings.
> print ( fit $ l o a d i n g s , d i g i t s = 3) # prints loadings

Loadings :
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Length -0 . 3 2 6 0 .562 0 .753
Left 0 .112 -0 .259 0 .455 -0 .347 -0 .767
Right 0 .139 -0 .345 0 .415 -0 .535 0 .632
Bottom 0 . 7 6 8 -0 .563 -0 .218 -0 .186
Top 0 .202 0 .659 -0 .557 -0 .451 0 .102
D i a g o n a l -0 .579 -0 .489 -0 .592 -0 .258

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6


SS l o a d i n g s 1 .00 1 .00 1 .00 1 .00 1 .00 1 .00
P r o p o r t i o n Var 0 .17 0 .17 0 .17 0 .17 0 .17 0 .17
C u m u l a t i v e Var 0 .17 0 .33 0 .50 0 .67 0 .83 1 .00

Scatter plots in Fig. 8.1 of the components clearly illustrate how principal component
analysis simplifies multivariate techniques.
> l a y o u t ( m a t r i x (1:4 , 2 , 2))
> g r o u p = f a c t o r ( b a n k n o t e [ , 1]) # g r o u p as f a c t o r
> plot ( fit $ s c o r e s [ , 1 : 2 ] , col = g r o u p ) # p l o t 1 vs 2 f a c t o r
> plot ( fit $ s c o r e s [ , c (1 , 3)] , col = g r o u p ) # p l o t 1 vs 3 f a c t o r
> plot ( fit $ s c o r e s [ , 2 : 3 ] , col = g r o u p ) # p l o t 2 vs 3 f a c t o r

In practice, a question which often arises is how to choose the number of components.
Commonly, one retains just those components that explain some specified percentage
222 8 Multivariate Statistical Analysis

PC1 vs. PC2 PC2 vs. PC3

1.5
3

1.0
2

0.5
1

PC3
PC2

0.0
0

−1.5 −1.0 −0.5


−1
−2
−3

−2 −1 1 2 3 −3 −2 −1 0 1 2 3
PC1 PC2

PC1 vs. PC3 Scree Plot


1.5

1.00
Cumulative percentage variance
1.0

0.90
0.5
PC3
0.0

0.80
−1.5 −1.0 −0.5

0.70

−2 −1 1 2 3 1 2 3 4 5 6
PC1 Number of components

Fig. 8.1 Principal components and scree plot for Swiss Banknote dataset. BCS_PCAvar

of the total variation of the original variables. Values between 70 and 90% are usually
suggested, although smaller values might be appropriate as p or the sample size
increase.
> plot ( c u m s u m ( fit $ sdev ^2 / sum ( fit $ sdev ^2))) # cumulative variance

A graphical representation of the PCs’ ability to explain the variation in the data is
given in Fig. 8.1. The plot down on the right, called scree plot, depicts the relative
cumulative proportion of the explained variance as given by ψq above. The figure
implies that the use of the first and the second principal components is sufficient to
identify the genuine banknotes.
Another way to choose the optimal number of principal components is to exclude
the principal components with eigenvalues less than the average, see Everitt (2005).
The covariance between the PC vector Y and the original variables X is important
for the interpretation of the PCs. It is calculated as

Cov(X, Y ) = .
8.1 Principal Components Analysis 223

Hence, the correlation ρ X i Y j between variable X i and the PC Y j is


 1/2
γi j λ j λj
ρXi Y j = = γi j ,
(σ X i X i λ j )1/2 σXi Xi

where γi j is the eigenvector corresponding to the eigenvalue λ j .


In practice, all variances, eigenvectors and eigenvalues are replaced by their esti-
p
mators to calculate the empirical correlation r X i Y j . Note that j=1 r X2 i Y j = 1.
> corr = cor ( m y d a t a , fit $ s c o r e s ) # c o r r e l a t i o n of PC and v a r i a b l e s
> cev = cbind ( corr [ , 1:2] , # cumulative
+ corr [ , 1]^2 + corr [ , 2]^2)
> print ( cev, digits = 2) # c o r r e l a t i o n s and c o m m u n a l i t i e s
Comp.1 Comp.2
Length -0 .20 0 .028 0 .041
Left 0 .54 0 .191 0 .326
Right 0 .60 0 .159 0 .381
Bottom 0 .92 -0 .377 0 .991
Top 0 .44 0 .794 0 .820
Diagonal -0 .87 -0 .410 0 .925

The correlations of the original variables X i with the first two PCs are given in the
first two columns of the table in the previous code. The third column shows the
cumulative percentage of the variance  of each variable explained by the first two
principal components Y1 and Y2 , i.e. 2j=1 r X2 i Y j .
The results are displayed visually in a correlation plot, where r X2 i Y1 are plotted
against r X2 i Y2 in Fig. 8.2 (left). When the variables lie near the periphery of the circle,
they are well explained by the first two PCs. The plot confirms that the percentage
of the variance of X 1 explained by the first two PCs is relatively small.
> # c o o r d i n a t e s for the s u r r o u n d i n g c i r c l e
> u c i r c l e = cbind ( cos ( ( 0 : 3 6 0 ) / 180 * pi ) , sin ( ( 0 : 3 6 0 ) / 180 * pi ))
> plot ( u c i r c l e , type = " l " , lty = " solid " ) # plot circle
> a b l i n e ( h = 0 .0, v = 0 .0 ) # plot o r t h o g o n a l lines
> label = paste ( " X " , 1:6 , sep = " " )
> text ( cor ( m y d a t a , fit $ s c o r e s ) , l a b e l ) # p l o t s c o r e s in text
1.0
1.0

X1
X5
0.5
0.5

X2
Second PC

X2 X6 X3
X3
PC2

X1
0.0
0.0

X5
X4
X6 X4
−0.5
−0.5

−1.0
−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
PC1 First PC

Fig. 8.2 The correlation of the original variable with the PCs (left) and normalised PCs (right).
BCS_PCAbiplot, BCS_NPCAbiplot
224 8 Multivariate Statistical Analysis

The PC technique is sensitive to scale changes. If a variable is multiplied by a scalar,


different eigenvalues and eigenvectors are obtained. This is because the eigenvalue
decomposition is performed on the empirical covariance matrix S and not on the
empirical correlation matrix R. For this reason, in certain situations, variables should
be standardised by the Mahalanobis transformation

X  = HX D−1/2 ,

where D = diag(s X 1 X 1 , . . . , s X p X p ) with covariances s X 1 X 1 , . . . , s X p X p . Due to this


standardisation, the means of the transformed variables x  1 = . . . = x  p = 0 and
the new variance-covariance matrix is S  = R, see Theorem 6.2. The PCs obtained
from the standardised matrix are usually called normalised principal components.
The transformation is done to avoid heterogeneity in the variables with respect to
their covariances, which can be the case when variables are measured on heteroge-
neous scales (e.g. years, kilograms, dollars). To perform the normalised principal
component analysis, the former R code is changed to
> p r i n c o m p ( m y d a t a , cor = TRUE ) # c a l c u l a t e n o r m a l i s e d PCs

The scree plot for the normalised model would be different and the variables can be
observed to lie closer to the periphery of the circle in Fig. 8.2 (right).

8.2 Factor Analysis

Factor analysis is widely used in behavioural sciences. Scientists are often interested
not in the observed variables, but in unobserved factors. For example, sociologists
record people’s occupation, education, home ownership, etc., on the assumption that
these factors reflect their unobservable ‘social class’. Explanatory factor analysis
investigates the relationship between manifested variables and factors. Note that the
number of variables should be much smaller than the number of observations. The
factor model can be written as

X = QF + μ, (8.1)

where X is a p × 1 vector of observable random variables, μ is the mean vector of X ,


F is a k × 1-dimensional vector of (unobservable) factors and Q is a p × k matrix
of loadings. It is also assumed that E(F) = 0 and Var(F) = I.
In practice, factors are usually split into specific factors and common factors,
which are highly informative and common to all of the components of X . In other
words, the factor model explains the variation of X by a small number of latent
factors F, which are common to the p components of X , and an individual factor
U , which allows for component-specific variation. Thus, a generalisation of (8.1) is
given by

X = QF + μ + U ,
8.2 Factor Analysis 225

where U is a ( p × 1) vector of the specific factors, which are assumed to be random.


Additionally, it is assumed that Cov(U, F) = 0 and Cov(U j , Uk ) = 0, for j  = k ∈
{1, . . . , p}. The covariance matrix of X can then be written as

 = QQ + ψ, (8.2)

where QQ is called the communality and ψ is the specific variance.


To interpret a specific factor F j , its correlation with the original variables is
computed. The covariance between X and F is given by

 X F = E{(QF + U )F  } = Q.

The correlation is

P X F = D−1/2 Q,

where D = diag(σ X 1 X 1 , . . . , σ X p X p ).
The analysis based on the normalised variables is performed by using R = QQ +
ψ and the loadings can be interpreted directly. Factor analysis is scale invariant.
However, the loadings are unique only up to multiplication by an orthogonal matrix.
This leads to potential difficulties with estimation, but facilitates the interpretation of
the factors. Multiplication by an orthogonal matrix is called rotation of the factors.
The most widely used rotation is the varimax rotation which maximises the sum of
the variances of the squared loadings within each column. For more details on factor
analysis see Rencher (2002).
In practice, Q and ψ have to be estimated by S = Q̂Q̂ + ψ̂. The number of
estimated parameters is d = 21 ( p−k)2 − 21 ( p+k). An exact solution exists only when
d = 0, otherwise an approximation must be used. Assuming a normal distribution
of the factors, maximum likelihood estimators can be computed as discussed below.
Other methods to find Q̂ are principal factors and principal component analysis, see
Härdle and Simar (2015) for details.

8.2.1 Maximum Likelihood Factor Analysis

This subsection takes a closer look at the maximum likelihood factor analysis, one
possible fitting procedure in factor analysis. It is generally recommended if it can
be assumed that the factor scores are independent across factors and individuals and
normally distributed, i.e. F ∼ N(0, 1). Under this assumption, it is necessary that
X ∼ N(0,  + w w) and  and ŵ are found by maximising the log-likelihood
np n n  
L=− log 2π − log | + w  w| − tr ( + w w)−1 V .
2 2 2
226 8 Multivariate Statistical Analysis

Maximum likelihood factor analysis is implemented in R in the function


factanal(). Its arguments are the number of factors and the factor rotation
method. It will be applied to a famous example, the analysis of the performance of
decathlon athletes, see Everitt and Hothorn (2011) and Park and Zatsiorsky (2011).
Studies of this type are motivated by the desire to determine the overall success in
a decathlon and to help instructors and athletes with the design of optimal training
programs. They therefore need to consider the inter-event similarity and possible
transfer of training results. The data is part of the package FactoMineR and con-
sists of 41 rows and 13 columns. The first ten columns correspond to the performance
of the athletes in the 10 events of the decathlon. Columns 11 and 12 correspond to the
rank and the points obtained respectively. The last column is a categorical variable
corresponding to the sporting event (2004 Olympic Games or 2004 Decastar).
In order to choose a reasonable number of factors, cumulative explained variance
is plotted against number of principal components beforehand.
> r e q u i r e ( stats )
> data( " d e c a t h l o n " , p a c k a g e = " F a c t o M i n e R " )
> m y d a t a = d e c a t h l o n [ , 1:10] # c h o o s e r e l e v a n t v a r i a b l e s
> fit = princomp ( mydata ) # p e r f o r m PCA
> # plot cum. p e r c e n t a g e of var. e x p l a i n e d by n u m b e r of c o m p o n e n t s
> plot ( c u m s u m ( fit $ sdev ^2 / sum ( fit $ sdev ^2)) ,
+ xlab = " N u m b e r of p r i n c i p a l c o m p o n e n t s " ,
+ ylab = " C u m u l a t i v e p e r c e n t a g e v a r i a n c e " )

Figure 8.3 suggests to start the analysis with three factors, because the increase in
explained variance becomes very small for more than three factors. While including a
third factor increases the explained variance by about 6 percentage points, including
a fourth factor offers an increase of less than one additional percentage point.
> r e q u i r e ( stats )

Fig. 8.3 The correlation of


1.00

the original variable with the


NPCs. BCS_FAsummary
Cumulative percentage variance
0.95
0.90
0.85
0.80

2 4 6 8 10
Number of principal components
8.2 Factor Analysis 227

> m y d a t a = d e c a t h l o n [ , 1:10]
> fit = factanal ( mydata, # fit the model
+ factors = 3, # n u m b e r of f a c t o r s
+ r o t a t i o n = " none " ) # no r o t a t i o n p e r f o r m e d
> fit # p r i n t the r e s u l t s

Call :
factanal( x = m y d a t a , f a c t o r s = 3 , r o t a t i o n = " none " )

Uniquenesses :
100 m Long.jump Shot.put High.jump 400 m
0 .411 0 .396 0 .106 0 .697 0 .264
110 m . h u r d l e Discus Pole.vault Javeline 1500 m
0 .491 0 .534 0 .907 0 .785 0 .005

The output states the computed uniquenesses and a matrix of loadings with one
column for each factor. Uniqueness gives the proportion of variance of the vari-
able not associated with the factors. It is defined as 1 − communality (namely
k
1 − l=1 q 2jl , j = 1, . . . , k), where communality is the variance of that variable as
determined by the common factors. Note that the greater the uniqueness of a variable,
the lower the relevance of the variable in the factor model, since the factors capture
less of the variance of the variable.
Loadings : # o u t p u t f r o m fit, cont.
Factor1 Factor2 Factor3
100 m -0 .573 0 .507
Long.jump 0 .455 -0 .629
Shot.put 0 .881 0 .322 0 .121
High.jump 0 .547
400 m -0 .432 0 .617 0 .412
110 m . h u r d l e -0 .493 0 .514
Discus 0 .621 0 .114 0 .261
Pole.vault -0 .174 0 .246
Javeline 0 .356 0 .239 -0 .178
1500 m 0 .997

Factor1 Factor2 Factor3


SS l o a d i n g s 2 .554 1 .504 1 .347
P r o p o r t i o n Var 0 .255 0 .150 0 .135
C u m u l a t i v e Var 0 .255 0 .406 0 .541

Test of the h y p o t h e s i s t h a t 3 f a c t o r s are s u f f i c i e n t .


The chi s q u a r e s t a t i s t i c is 17 .97 on 18 d e g r e e s of f r e e d o m .
The p - value is 0 .457

The factor loadings give the correlation between the factors and the observed vari-
ables. They can be used to interpret the factors based on the variables they capture.
As stated above, the factor loadings are not unique to multiplication by an orthogo-
nal matrix and this multiplication, called rotation of the factor loadings matrix, can
greatly facilitate interpretation.
The most common rotation method, varimax, aims at maximising the variance
of the squared loadings of a factor on all the variables. Note that the assumption
of orthogonality of the factors is required. There is a number of ‘oblique’ rotations
available, which allow the factors to correlate.
> m y d a t a = d e c a t h l o n [ , 1:10]
228 8 Multivariate Statistical Analysis

> fit2 = factanal ( mydata, # factor analysis


+ factors = 3, # 3 factors
+ rotation =" varimax ") # varimax rotation
> fit2 # p r i n t the r e s u l t s

Call :
factanal( x = m y d a t a , f a c t o r s = 3 , r o t a t i o n = " v a r i m a x " )

Uniquenesses :
100 m Long.jump Shot.put High.jump 400 m
0 .411 0 .396 0 .106 0 .697 0 .264
110 m . h u r d l e Discus Pole.vault Javeline 1500 m
0 .491 0 .534 0 .907 0 .785 0 .005

Loadings :
Factor1 Factor2 Factor3
100 m 0 .699 -0 .264 -0 .178
Long.jump -0 .764 0 .107
Shot.put -0 .128 0 .934
High.jump -0 .232 0 .497
400 m 0 .815 0 .263
110 m . h u r d l e 0 .685 -0 .183
Discus -0 . 1 5 2 0 .618 0 .248
Pole.vault -0 .124 0 .278
Javeline 0 .412 -0 .214
1500 m 0 .190 0 .976

Factor1 Factor2 Factor3


SS l o a d i n g s 2 .350 1 .791 1 .264
P r o p o r t i o n Var 0 .235 0 .179 0 .126
C u m u l a t i v e Var 0 .235 0 .414 0 .541

Test of the h y p o t h e s i s t h a t 3 f a c t o r s are s u f f i c i e n t .


The chi s q u a r e s t a t i s t i c is 17 .97 on 18 d e g r e e s of f r e e d o m .
The p - value is 0 .457

For the first factor, the largest loadings are for 100 m, 400 m, 110 m hurdle run, and
long jump. This factor can be interpreted as the ‘sprinting performance’. The loadings
for the second factor present a counter-intuitive throwing-jumping combination: the
highest loadings are for the three throwing events (discus throwing, javeline and shot
put) and for the high jump event. For the third factor, the largest loading is for 1500 m
running. The first and second factors can be interpreted straightforward as ‘sprinting
abilities’ and ‘endurance’. The meaning of the last factor is not evident.
Note that the model fit has room for improvement, because the value 0.54 for
Cumulative Var in the third line signifies that only 54% of the variation in the
data is explained by three factors. Including a fourth factor is easily done by changing
the code to factors = 4.
> m y d a t a = d e c a t h l o n [ , 1:10]
> fit3 = factanal ( mydata, # factor model
+ factors = 4, # 4 factors
+ rotation = " varimax ") # varimax rotation
> fit3 # p r i n t the r e s u l t s

Call :
factanal( x = m y d a t a , f a c t o r s = 4 , r o t a t i o n = " v a r i m a x " )
8.2 Factor Analysis 229

Uniquenesses :
100 m Long.jump Shot.put High.jump 400 m
0 .409 0 .386 0 .005 0 .680 0 .270
110 m . h u r d l e Discus Pole.vault Javeline 1500 m
0 .464 0 .492 0 .005 0 .800 0 .005

Loadings :
Factor1 Factor2 Factor3 Factor4
100 m 0 .720 -0 .245 -0 .112
Long.jump -0 .770 0 .131
Shot.put -0 .144 0 .976 0 .103 0 .103
High.jump -0 .259 0 .480 -0 .152
400 m 0 .770 0 .363
110 m . h u r d l e 0 .712 -0 .157
Discus -0 . 2 2 0 0 .585 0 .297 -0 .170
Pole.vault -0 .102 0 .117 0 .983
Javeline 0 .403 -0 .191
1500 m 0 .984 0 .143

Factor1 Factor2 Factor3 Factor4


SS l o a d i n g s 2 .363 1 .785 1 .263 1 .074
P r o p o r t i o n Var 0 .236 0 .178 0 .126 0 .107
C u m u l a t i v e Var 0 .236 0 .415 0 .541 0 .648

Test of the h y p o t h e s i s t h a t 4 f a c t o r s are s u f f i c i e n t .


The chi s q u a r e s t a t i s t i c is 9 .2 on 11 d e g r e e s of f r e e d o m .
The p - value is 0 .603

This improves the result, with the new model explaining 65% of variation. The
interpretation of the first and second factor remains the same, but the 1500 m run and
pole vaulting are now captured by factors 3 and 4, respectively.
Thus, there is no unique or ‘best’ solution in factor analysis. Using the maximum
likelihood method allows to test the goodness of the factor model. The test examines
if the model fits significantly worse than a model in which the variables correlate
freely. p-values higher than 0.05 indicate a good fit, since the null hypothesis of a
good fit cannot be rejected.
In this case, the p-value is 0.457. The null hypothesis that 3 factors are sufficient
cannot be rejected, suggesting a good fit of the model.

8.3 Cluster Analysis

Cluster analysis techniques are used to search for clusters or groups in a priori
unclassified multivariate data. The main goal is to obtain clusters of objects which
are similar to one another and different from objects in other clusters. Many methods
of cluster analysis have been developed, since most studies allow for a variety of
techniques.
230 8 Multivariate Statistical Analysis

8.3.1 Proximity of Objects

The starting point for cluster analysis is a (n × p) data matrix X with n measure-
ments of p objects. The proximity among objects is described by a matrix D which
contains measures of similarity or dissimilarity among the n objects. The elements
can be either distance or proximity measures. The nature of the observations plays
an important role in the choice of the measure. Nominal values lead, in general, to
proximity values, whereas metric values lead to distance matrices. To measure the
similarity of objects with binary structure, one defines


p

p
ai j,1 = I (xik = x jk = 1), ai j,2 = I (xik = 0, x jk = 1),
k=1 k=1
p
p
ai j,3 = I (xik = 1, x jk = 0), ai j,4 = I (xik = x jk = 0).
k=1 k=1

The following proximity measure is used in practice:

ai j,1 + δai j,4


di j = ,
ai j,1 + δai j,4 + λ(ai j,2 + ai j,3 )

where δ and λ are weighting factors. Table 8.1 shows two similarity measures for
given weighting factors. To measure the distance between continuous variables, one
uses L r -norms (see Sect. 2.1.5):
1/r

p
di j = ||xi − x j ||r = |xik − x jk |r , (8.3)
k=1

where xik denotes the value of the k-th variable of object i. The class of distances in
(8.3) measures the dissimilarity using different weights for varying r . The L 1 -norm,
for example, gives less weight to outliers than the L 2 -norm (the Euclidean norm).
An underlying assumption in applying L r -norms see Sect. 2.1.5 is that the vari-
ables are measured on the same scale. Otherwise, a standardisation is required, cor-
responding to a more general L 2 or Euclidean norm with a matrix A, where A > 0:

di2j = xi − x j 2A = (xi − x j ) A(xi − x j ).

Table 8.1 Common Name δ λ Definition


similarity coefficients a1
Jaccard 0 1 a 1 + a2 + a3
Tanimoto 1 2 a1 + a4
a1 + 2(a2 + a3 ) + a4
8.3 Cluster Analysis 231

L 2 -norms are given by A = I p , but if a standardisation is desired, then the weight


matrix A = diag(s X−11 X 1 , . . . , s X−1p X p ) is suitable. Recall that s X k X k is the empirical
variance of the k-th component, hence

p
(xik − x jk )2
di2j = .
k=1
sXk Xk

Here each component has the same weight and the distance does not depend on any
particular measurement units.
In practice, the data is often mixed, i.e. contains both binary and continuous vari-
ables. One way to solve this is to recode the data to normalised similarity by assigning
each attribute level to a separate binary variable. However, this approach often does
not adequately capture the size of distance that can be captured by continuous vari-
ables and leads to a large increase in a4 .
The second way is to calculate a generalised similarity measure, e.g. the commonly
used Gower similarity coefficient. It calculates and average over similarities and is
defined as
v
di jk · δi jk
k=1
Di j = ,
δi jk

where Di j is the Gower proximity, δi jk is 0 if xi = x j = 0 and 1 else, di jk is


the similarity between objects i and j of attribute k. di jk is defined differently for
different types of variables:
• For binary dichotomous variables, di jk =1 if xik = x jk = 1, similar to a1 above.
• For binary qualitative variables, di jk =1 if xik = x jk = 1 or xik = x jk = 0, similar
to a1 and a2 above.
|x −x |
• For continuous variables, di jk = 1 − ik Rk jk , where Rk is the range of attribute k
in the sample (or population).
Note that the Gower proximity coefficient reduces to the Jaccard similarity index if
all attributes are binary. If all attributes are qualitative then it reduces to the Simple
Matching coefficient. When all variables are quantitative (interval) then the coeffi-
cient is the range-normalised City-block metric.

8.3.2 Clustering Algorithms

There are two types of hierarchical clustering algorithms, agglomerative and split-
ting. The first starts with the finest possible partition. The second starts with the
coarsest possible partition, i.e. one cluster containing all the observations. The draw-
back of these methods is that the clusters are not adjusted, e.g. objects assigned to
a cluster cannot be removed in further steps. Since all agglomerative hierarchical
techniques ultimately reduce the data to a single cluster containing all the individu-
232 8 Multivariate Statistical Analysis

als, the investigator seeking the solution with the best-fitting number of clusters will
need to decide which division to choose. The problem of deciding on the ‘correct’
number of clusters will be taken up later.
The hierarchical agglomerative clustering algorithm works as follows:
1. Find the nearest pair of distinct clusters, say ci and c j , merge them into ck and
decrease the number of clusters by one;
2. If the number of clusters equals one, end the algorithm, else return to step 1.
For this purpose, the distance between two groups or an individual and a group
must be calculated. Different definitions of distances lead to different clustering
algorithms. Widely used measures of distance are

Single linkage d AB = min di j ,


i∈A, j∈B

Complete linkage d AB = max di j ,


i∈A, j∈B
1 
Average linkage d AB = di j ,
n A n B i∈A j∈B

where n A and n B are the number of objects in the two groups. The Ward clustering
algorithm does not put together groups with smaller distance. Instead, it joins groups
that do not increase a given measure of heterogeneity ‘too much’. The resulting
groups are as homogeneous as possible. Let us study an example of the Gross National
Product (GNP) per capita and the percentage of the population working in agriculture
for each country belonging to the European Union in 1993. The data can be loaded
from the cluster package. First the data should be checked for missing values and
standardised. As mentioned above, the matrix of distances should be calculated first.
Euclidean distance is used in this case. This matrix is then used to obtain clusters
using the complete linkage algorithm (other algorithms are available as parameters
of function hclust).
> require ( cluster ) # p a c k a g e for CA
> data( a g r i c u l t u r e , package = " c l u s t e r " ) # load the data
> m y d a t a = scale ( a g r i c u l t u r e ) # n o r m a l i s e data
> d = dist ( m y d a t a , # calculate distances
+ method =" euclidean ") # Euclidean
> print ( d, d i g i t s = 2) # show d i s t a n c e s
B DK D GR E F IRL I L NL P
DK 1 .02
D 0 .40 0 .63
GR 3 .74 4 .03 3 .88
E 1 .68 2 .15 1 .87 2 .08
F 0 .55 0 .71 0 .43 3 .48 1 .50
IRL 2 .12 2 .46 2 .27 1 .62 0 .49 1 .87
I 0 .90 1 .04 0 .88 3 .03 1 .11 0 .46 1 .43
L 0 .86 0 .35 0 .46 4 .21 2 .25 0 .75 2 .61 1 .18
NL 0 .26 1 .01 0 .48 3 .49 1 .44 0 .39 1 .87 0 .65 0 .94
P 2 .92 3 .27 3 .08 0 .84 1 .24 2 .68 0 .82 2 .25 3 .43 2 .67
UK 0 .57 1 .56 0 .97 3 .49 1 .43 0 .96 1 .92 1 .10 1 .42 0 .58 2 .66
> fit = h c l u s t ( d, m e t h o d = " c o m p l e t e " ) # fit the model
8.3 Cluster Analysis 233

Fig. 8.4 Agglomerative


hierarchical clustering for

4
agriculture data.
BCS_CAComplete

3
Height
2
1

UK

GR

P
E

IRL
0

F
DK

NL
The specific partition of the data can now be selected from the dendrogram, see
Fig. 8.4. ‘Cutting’ off the dendrogram at some height will give a partition with a
particular number of groups. One of the methods to choose the number of clusters
is to examine the size of the height changes in the dendrogram, where a large jump
in the dendrogram indicates a large loss in homogeneity of clusters if they are joint
as suggested by the current step of the tree. Function rect.hclust draws a den-
drogram with red borders around clusters, facilitating the interpretation, see Fig. 8.4
where axis height shows the value of the criterion associated with the clustering
method. The value k specifies the desired number of groups. We do not discuss the
choice of the number of clusters in detail. A popular method is the scree plot or elbow
criterion, which is easy to implement and visualise.
> plot ( fit ) # plot the s o l u t i o n
> g r o u p s = c u t r e e ( fit, k = 5) # define clusters
> r e c t . h c l u s t ( f i t , k = 5 , b o r d e r = " red " ) # draw boxes

The importance of the method choice is illustrated in Fig. 8.5.


> par ( mfrow = c (1 , 3)) # 3 p l o t s in 1 f i g u r e
> plot ( h c l u s t ( d, m e t h o d = " s i n g l e " ) , main = " S i n g l e l i n k a g e " )
> plot ( h c l u s t ( d, m e t h o d = " w a r d . D " ) , main = " Ward " )
> plot ( h c l u s t ( d, m e t h o d = " a v e r a g e " ) , main = " A v e r a g e l i n k a g e " )

If the number of clusters is predetermined by the application, k-means clustering can


be used to partition the observations into a pre-specified number k of clusters. The
clusters S1 , S2 , . . . , Sk are chosen so as to minimise the Euclidean distance between
the points and the mean within each cluster, also called the within-cluster sum of
squares. This can be expressed as


k 
arg min x − μi 2 ,
S i=1 x∈Si
234 8 Multivariate Statistical Analysis

Single linkage Ward Average linkage

2.5
1.0

2.0
Euclidean distance

Euclidean distance

Euclidean distance
0.8

6
GR

1.5
P
0.6

1.0
UK

2
0.4

E
IRL

GR
P
UK
0.5
I

I
D
F

E
IRL
DK
L

UK
I
0.2

D
F
GR
P

DK
L
0.0
E
IRL
D
F
DK
L

B
NL
B
NL

B
NL

Fig. 8.5 Comparison of different clustering algorithms. BCS_CAmethods

where μi is the centroid in Si . This is implemented in function kmeans(), in which


option centres specifies the number of clusters.
Cluster analysis is a large area of multidimensional statistical analysis and has been
covered only very briefly in this section. For a detailed discussion of this technique
see Everitt et al. (2009).

8.4 Multidimensional Scaling

Multidimensional scaling (MDS) is an exploratory technique used to visualise prox-


imities in a low dimensional space. Interpretation of the dimensions can lead to an
understanding of the processes underlying the perceived nearness of objects. Fur-
thermore, it is possible to incorporate individual or group differences in the solution.
The basic data representation in a standard MDS is a dissimilarity matrix that shows
the distance between every possible pair of objects. Using spectral decomposition
of this matrix, the desired projections of the data to the lower dimension are found.
The goal of MDS is to faithfully represent these distances with the lowest possible
dimensional space.
The variety of methods that have been proposed largely differ in how agreement
between fitted distances and observed proximities is assessed. In this section, classical
metric MDS and non-metric MDS are considered. Metric MDS is applied when
measurements are numerical and the distance between the objects can be calculated.
In non-metric MDS, proximity measurements are ordinal, e.g. when a subject prefers
8.4 Multidimensional Scaling 235

Coca-Cola to Pepsi-Cola, but cannot say how much. This kind of data is used very
often in psychology and market research.

8.4.1 Metric Multidimensional Scaling

Assume that a data matrix X is given. The metric MDS begins with a n × n distance
matrix DX , which contains distances between the given objects. Note that this matrix
is symmetric, with diiX = 0 and diXj > 0, which naturally follows from the definition
of distance in any metric space. Given such a matrix, MDS attempts to find n data
points y1 , . . . , yn constituting the new data matrix Y in p-dimensional space, such
that DX is similar to DY . In particular, metric MDS minimises


n 
n
min (diXj − diYj )2 , (8.4)
Y
i=1 j=1

where diXj = xi − x j  and diYj = yi − y j . For the Euclidean distance, (8.4) can
be reduced to

n 
n
min (xi x j − yi yi )2 .
Y
i=1 j=1

In practice, the question how


 p to choose
n the dimension ofdata projection
n p arises
|λi | and P  = i=1 λi2 / i=1
p
often. If one defines P = i=1 |λi |/ i=1 λi2 , then a
p which gives P > 0.8 or P  > 0.8 suggests a reasonable fit. The λi ’s are the first p
eigenvalues of the data matrix X . These values do not necessary agree. Usually p is
chosen such that the conditions hold for P and P  simultaneously. Another criterion
is to choose p such that the sum of the largest positive eigenvalues is approximately
equal to the sum of all eigenvalues. A third criterion proposes to accept as genuinely
positive only those eigenvalues whose magnitude substantially exceeds that of the
largest negative eigenvalue.
MDS is widely used in psychological sciences and marketing research. The
Cars93 data frame (package MASS) has 93 rows and 27 variables, which are dis-
played by the command View(Cars93). Only models with rear drive train and
with less than 18 mpg (miles per gallon) in the city are analysed and only numerical
characteristics taken into account. First, the data is prepared for analysis.
> data( C a r s 9 3 , p a c k a g e = " MASS " ) # load the data
> r o w n a m e s ( C a r s 9 3 ) = C a r s 9 3 [ , ncol ( C a r s 9 3 )]
> m y d a t a = C a r s 9 3 [ which ( C a r s 9 3 $ D r i v e T r a i n == " Rear " # choose
+ & C a r s 9 3 $ M P G . c i t y <= 18) , # cars and
+ c (5 , 7:8 , 11:15 , 1 7 : 1 9 , 2 0 : 2 5 ) ] # variables
> mydata = na.omit ( mydata ) # exclude missing
> d = dist ( m y d a t a , m e t h o d = " e u c l i d e a n " ) # distance matrix
236 8 Multivariate Statistical Analysis

Fig. 8.6 Configuration of

400
the MDS of the American car Infiniti Q45
subsample. BCS_MDS

300
200
100
y2
Buick Roadmaster
Chevrolet Caprice

0
Ford Crown_Victoria

−100
Lincoln Town_Car
−300

Lexus SC300
−1500 −1000 −500 0 500 1000 1500
y1

Function cmdscale() performs MDS and uses as arguments the distance matrix
and the dimension of the space in which the data will be represented. Including the
option eig = TRUE additionally provides the eigenvalues of the data. They are
used to calculate P and P  in order to decide on the dimensions of the projection
space p. In this case, both criteria give satisfactory results, i.e. values greater than
0.8, for any number of dimensions. Given this result, it is convenient to set p equal
2 to plot the MDS map in a simple diagram.
> fit = c m d s c a l e ( d, eig = TRUE, k = 2) # fit mds model

If it is necessary to depict the results in a two-dimensional space, the following R


commands can be used. The results are displayed in Fig. 8.6. It is evident that the car
models form clusters. On the right side, one can see expensive luxury cars. On the
left part of the plot, one finds affordable cars.
> plot ( fit $ p o i n t s , type = " n " ) # set the p l o t t i n g frame
> a b l i n e ( v = 0 , lty = " d o t t e d " ) # y1 = 0 line
> a b l i n e ( h = 0 , lty = " d o t t e d " ) # y2 = 0 line
> text ( fit $ p o i n t s , l a b e l s = r o w n a m e s ( m y d a t a )) # add text

8.4.2 Non-metric Multidimensional Scaling

Unfortunately, classical metric MDS cannot always be used and other methods of
scaling might be more suitable. This section is concerned with non-metric multidi-
mensional scaling, which can be applied if there is a considerable number of negative
eigenvalues and classical scaling of the proximity matrix may be inadvisable. Non-
metric scaling is also used in the case of ordinal data, e.g. when comparing a range
8.4 Multidimensional Scaling 237

of colours. Customers might be able to specify that one was ‘brighter’ than another,
without being able to attach any quantitative value to the extent the colours differ.
Non-metric MDS uses only a rank order of the proximities to produce a spatial
representation of them. Thus, the solution is invariant under monotonic transfor-
mations of the proximities. One such method was originally suggested by Shepard
(1962) and Kruskal (1964). Beginning from arbitrary coordinates in a p-dimensional
space, e.g. calculated by metric MDS, the distances are used to estimate disparity
between the objects using monotonic regression. The aim is to represent the fitted
distances diYj as diYj = d̂iXj + εi j , where the estimated disparities d̂iXj are monotonic
with the observed proximities and, subject to this constraint, resemble the diYj as
closely as possible. For a given set of disparities, the required coordinates can be
found by minimising some function of the squared differences between the observed
proximities and the derived disparities, generally known as stress. The procedure is
iterated until some convergence criterion is satisfied. The number of dimensions is
chosen by comparing stress values or other criteria, for example R 2 .
Non-metric MDS can be applied using R as well. The next example uses the
voting data of the package HSAUR2. This dataset represents the voting results of
15 congressmen from New Jersey on 19 environmental bills.
To perform non-metric MDS, load the data, compute the distance matrix and run
function isoMDS with the default two-dimensional solution.
> r e q u i r e ( MASS )
> data( v o t i n g , p a c k a g e = " H S A U R 2 " ) # load the data
> fit = i s o M D S ( v o t i n g ) # fit MDS
> plot ( fit $ p o i n t s , type = " n " ) # plot the model
> a b l i n e ( v = 0 , lty = " d o t t e d " ) # y = 0 line
> a b l i n e ( h = 0 , lty = " d o t t e d " ) # x = 0 line
> text ( fit $ p o i n t s , labels = r o w n a m e s ( v o t i n g )) # add text

Fig. 8.7 Configuration of Sandman(R)


8

the MDS of voting data.


BCS_isoMDS
6

Thompson(D)
4

Patten(D)
2

Roe(D)
y2

Hunt(R) Rinaldo(R)
Heltoski(D)
0

Minish(D)
dnall(R) Daniels(D) Rodino(D)
Howard(D)
−2

Forsythe(R)
−4

Freylinghuysen(R)
−6

Maraziti(R)

−10 −5 0 5
y1
238 8 Multivariate Statistical Analysis

Figure 8.7 shows the output of the above procedure. It is clear that the Democratic
congressmen have voted differently from the Republicans. A possible further conclu-
sion is that the Republicans have not shown as much solidarity as their Democratic
colleagues. More examples on MDS are given in Everitt and Hothorn (2011).

8.5 Discriminant Analysis

One of the most applied tools in multivariate data analysis is classification. Discrim-
inant analysis is concerned with deriving rules for the allocation of observations to
sets of a priori defined classes in some optimal way. It requires two samples—the
training sample, for which group membership is known with certainty a priori, and
the test sample, for which group membership is unknown.
The theory of discriminant analysis states that one needs to know the class pos-
teriors P(G | X ), where G is a given class and X contains other characteristics
of an object. Suppose f k (x) is the class-conditional density and let πk be the prior
probability of class k. A simple application of the Bayes’ theorem gives

f k (x) πk
P(G = k | X = x) =  K .
j=1 f j (x)π j

It is easy to see that in terms of the ability to classify, the f k (x) is almost equivalent
to having the quantity P(G = k | X = x).
Linear discriminant analysis (LDA) arises in the special case when each class
density is a multivariate Gaussian and classes have a common covariance matrix
k = , ∀k. The purpose of LDA is to find the linear combination of individual
variables which gives the greatest separation between the groups. To discriminate
between two classes k and l, a decision rule can be constructed as the log ratio

P(G = k | X = x) πk 1
log = log − (μk − μl )  −1 (μk − μl ) + x   −1 (μk − μl ),
P(G = l | X = x) πl 2
(8.5)
where , μk and μl are in most cases unknown and have to be estimated from the
training data set. Equation (8.5) is linear in x.
The decision rule can also be expressed as a set of k linear discriminant functions

1  −1
δk (x) = x   −1 μk − μ  μk + log πk .
2 k
An observation is assigned to the class with the highest value in the respective
discriminant function. The parameter space Rp is divided by hyperplanes into regions
that are classified as classes 1, 2, . . . , K .
8.5 Discriminant Analysis 239

In cases where the classes do not have a common covariance matrix, the decision
boundaries between each pair of classes are described by a quadratic function. The
corresponding quadratic discriminant functions are defined as

1 1
δk (x) = − log |k | − (x − μk ) k−1 (x − μk ) + log πk .
2 2
For more details on discriminant analysis, see Hastie et al. (2009).
To illustrate the method and its implementation in R, the datasets spanish and
spanishMeta are used. They contain information about the relative frequencies
of the 120 most frequent tag trigrams (combination of three letters) in 15 texts con-
tributed by three Spanish authors (Cela, Mendoza and Vargas Llosa). The aim of the
analysis is to construct a classification rule which allows automatic assignment of a
text by an ‘unknown author’ to Cela, Mendoza or Vargas Llosa. In this dataset the
number of variables, i.e. the different tag trigrams, is much larger than the number
of observations. In addition, some of the variables are highly correlated. Practically,
this means that much information that is conveyed by the variables is redundant. We
therefore can perform a principal component analysis before constructing the dis-
criminant function, in order to reduce dimensions without much loss of information.
> r e q u i r e ( MASS )
> data( s p a n i s h , p a c k a g e = " l a n g u a g e R " ) # load the data
> mydata = t( spanish ) # transpose
> pca = prcomp ( mydata, # fit PCA model
+ center = TRUE, # center values
+ scale = TRUE ) # and r e s c a l e d
> d a t a l d a = pca $ x
> d a t a l d a = d a t a l d a [ order ( r o w n a m e s ( d a t a l d a )) , ] # sort by r o w n a m e s
> data( s p a n i s h M e t a , p a c k a g e = " l a n g u a g e R " ) # load data
> m y d a t a = cbind ( d a t a l d a [ ,1 :2] , s p a n i s h M e t a $ A u t h o r )
> c o l n a m e s ( m y d a t a ) = c ( " PC1 " , " PC2 " , " A u t h o r " )
> mydata = as.data.frame ( mydata )
> mydata $ Author = as.factor ( mydata $ Author )

Before performing LDA, the dataset is randomly divided into a training dataset and
a test dataset. The training dataset is used to construct a discrimination rule, which
is subsequently applied to the test dataset in order to test its precision. Performing
the precision test on the same data for which the classification rule was constructed
is bad practice, since the results are not reliable, e.g. biased towards overfitting.
Alternatively, a wide range of resampling techniques such as bootstrap and cross-
validation can be used. These methods are described in Hastie et al. (2009).
> # s e t . s e e d (123) # set seed, see Chap. \ ,9
> n = nrow ( m y d a t a ); n # t o t a l n u m b e r of o b s e r v a t i o n s
[1] 15
> nt = floor (0 .6 * n ); nt # set t r a i n i n g set size
[1] 9
> i n d i c e s = s a m p l e (1: n, size = nt ) # sample
> mydata.train = mydata [ indices, ] # d e f i n e the t r a i n i n g set
> mydata.test = mydata [- indices, ] # define the test set

To perform LDA in R, one can use the lda function in the package MASS. The output
are the prior probabilities for each group, the group means, the coefficients of the
240 8 Multivariate Statistical Analysis

linear discriminant functions and the proportion of the trace, i.e. which proportion
of variance is explained by each discriminant function.
> fit = lda ( A u t h o r ~ PC1 + PC2, # fit LDA model
+ data = m y d a t a . t r a i n )
> fit
Call :
lda( A u t h o r ~ PC1 + PC2, data = m y d a t a . t r a i n )

Prior p r o b a b i l i t i e s of g r o u p s :
1 2 3
0 .33 0 .22 0 .44

Group means :
PC1 PC2
1 -3 .3 -4 .50
2 4 .3 4 .27
3 1 .7 -0 .83

C o e f f i c i e n t s of l i n e a r d i s c r i m i n a n t s :
LD1 LD2
PC1 -0 .26 -0 .14
PC2 -0 .22 0 .12

P r o p o r t i o n of trace :
LD1 LD2
0 .9929 0 .0071

Having used this classification rule, one can easily depict discrimination borders for
the groups to see how distinguishable the three classes are, see Fig. 8.8. The borders
between two classes are obtained by the difference between the corresponding dis-
criminant functions. For this purpose use function partimat from package klaR.

Fig. 8.8 LDA for the app. error rate: 0


Spanish author data. 2
BCS_LDA
5

2
0
PC1

1
−5

1
−2 0 2 4 6
PC2
8.5 Discriminant Analysis 241

> r e q u i r e ( klaR )
> p a r t i m a t ( A u t h o r ~ PC1 + PC2, # m u l t i p l e f i g u r e a r r a y
+ data = m y d a t a _ test, # for d a t a s e t
+ m e t h o d = " lda " , # using LDA
+ main ="") # no title

Predicted classes and posterior probabilities can be obtained by using the function
predict(). The table below shows the probability of each element of falling into
each of the classes.
> p r e d . c l a s s = p r e d i c t ( fit, m y d a t a . t e s t ) $ class # p r e d i c t e d class
> pred.class
[1] 1 1 3 2 3 3
Levels : 1 2 3
> p o s t . p r o p = p r e d i c t ( fit, m y d a t a . t e s t ) $ p o s t e r i o r # p o s t e r i o r prob.
> post.prop
1 2 3
X14459gll 0 .75 0 .00 0 .24
X14460gll 0 .97 0 .00 0 .03
X14464gll 0 .00 0 .75 0 .25
X14466gll 0 .01 0 .28 0 .72
X14467gll 0 .02 0 .25 0 .73
X14474gll 0 .12 0 .07 0 .81

To check whether the discrimination rule works well, the percentage of correctly
classified observations is calculated. In this case, the percentage of error is 33%,
which can be interpreted as high or low depending on the application. The calculations
of the prediction error are shown below.
> p r . t a b l e = table ( m y d a t a _ test $ A u t h o r , p r e d . c l a s s ) # pred. vs true
> pr.table
pred.class
1 2 3
1 2 0 0
2 0 1 2
3 0 0 1
> p r e d . c o r r e c t = diag ( p r o p . t a b l e ( p r . t a b l e , 1))
> pred.correct
1 2 3
1 .00 0 .33 1 .00 # p r e d i c t i o n in %

> 1 - sum ( diag ( p r o p . t a b l e ( p r . t a b l e ))) # p r e d i c t i o n error


[1] 0 . 3 3 3 3 3 3 3
Chapter 9
Random Numbers in R

Anyone who considers arithmetical methods of producing


random digits is, of course, in a state of sin.

— John von Neumann

Random number generation has many applications in economic, statistical, and finan-
cial problems. With the advantage of high speed and cheap computation, new sta-
tistical methods using random number generation have been developed. Important
examples are the bootstrap based procedures. When referring to a random number
generator of any statistical software package, the phrase ‘random number’ is mislead-
ing, as all random number generators are based on specific mathematical algorithms.
Thus, the computer generates deterministic and therefore pseudorandom numbers,
which are called ‘random’ for simplicity. In this context, the standard uniform dis-
tribution plays a key role, because its random numbers can be transformed so as
to obtain pseudo-samples from any other distribution. True random numbers can be
obtained by sampling and processing a source of natural entropy such as atmospheric
noise, radioactive decay, etc.
The main purpose of this chapter is to provide some computational algorithms
that generate random numbers.

9.1 Generating Random Numbers

A sequence of numbers generated by an algorithm is entirely determined by the


starting value of the algorithm, often called the seed or key. While the determinism
of the random numbers generated might be considered a drawback, it is also an
important property in simulation and modeling, due to the ability to repeat the process
using the same seed value. In addition, these algorithms are more efficient, as they can
produce many numbers in a short time. In simulation and especially in cryptography,

© Springer International Publishing AG 2017 243


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_9
244 9 Random Numbers in R

huge amounts of random numbers are used, thus sampling speed is crucial. Typically,
the algorithms are periodic, which means that the sequence repeats itself in the long
run. While periodicity is hardly ever a desirable characteristic, modern algorithms
have such long periods that they can be ignored for most practical purposes.

Definition 9.1 (Pseudorandom Number Generator) A pseudorandom number gen-


erator is a structure  = (S, s0 , T , U, G), where S is a finite set of states, s0 is the
initial state, also called the ‘seed’ or ‘key’, T : S → S is a transformation function,
U is a finite set of output symbols, and G : S → U is the output function.

The initial state of the generator is s0 and it evolves according to the recurrence
sn = T (sn−1 ), for n = 1, 2, 3, . . .. At step n, the generator creates un = G(sn ) as
output. For n ≥ 0, the un are the random numbers produced by the generator. Due
to the fact that S is finite, the sequence of states sn is eventually periodic. So the
generator must eventually reach a previously seen state, which means si = sj for
some 0 ≤ i < j. This implies that sj+n = si+n and therefore uj+n = ui+n for all n ≥ 0.
The length of the period is the smallest integer p such that sp+n = sn for all n ≥ r for
some integer r ≥ 0. The smallest r with this property is called transient. For r = 0,
the sequence is called purely periodic. Note that the length of the period cannot
exceed the maximal number of possible states |S|. Thus a good generator has p very
close to |S|. Otherwise, this would result in a waste of computer memory.

9.1.1 Pseudorandom Number Generators

Modular arithemtic is often used to cope with the issue of generating a sequence of
apparently random numbers on computer systems, which are completely predictable.
The basic relation of modular arithmetic is called equivalence modulo m, where m is
an integer. The modulo operation finds the remainder of the division of one number
by another, e.g. 7 mod 3 = 1. As stated in Sect. 1.4.1, the modulo operator in R is
%%.
In the following, we present two pseudorandom number generators, which illus-
trate the main ideas behind such algorithms.
Linear congruential generator
The Linear Congruential Generator (LCG) is one of the first developed and best-
known pseudorandom number generator algorithms. It is fast and can be easily imple-
mented.
9.1 Generating Random Numbers 245

Definition 9.2 (Linear Congruential Generator) The LCG is a recursive algorithm


of the form

T (xi ) = (axi−1 + c) mod m, with 0 ≤ xi < m for i = 0, 1, 2, . . . ,


and with m > 0, 0 < a < m, 0 ≤ c < m, 0 ≤ x0 < m,

where m, a, c and x0 are the modulus, multiplier, increment and seed value, respec-
tively.

To obtain numbers with the desired properties discussed in Sect. 9.1, one has to
transform the generated integers into [0, 1] with
xi
G(xi ) = = Ui , for i = 0, 1, 2, . . . .
m
The selection of values for a, c, m and x0 drastically affects the statistical proper-
ties and the cycle length of the generated sequence of integers. The full cycle length
is m if and only if
1. c  = 0,
2. c and m are relatively prime, i.e. their greatest common divisor is 1,
3. a − 1 is divisible by all prime factors of m,
4. and if m is divisible by 4, a − 1 also has to be divisible by 4.
In addition, Marsaglia (1968) has shown that these points, when plotted in n-
dimensional space, will lie on at most m1/n hyperplanes, see Fig. 9.1. This is illustrated
by a famous example of badly chosen starting values, namely in RANDU, a random
number generator developed by IBM. This algorithm was first introduced in the early
1960s and became widespread soon after.

Definition 9.3 (RANDU—The IBM Random Number Generator) RANDU is a Lin-


ear Congruential Generator defined by the recursion

T (xi ) = (216 + 3)xi−1 mod 231 , with 0 ≤ xi ≤ m and i = 0, 1, 2, . . . .

The corresponding R code is


> RANDU = function(n, seed = 1){
+ x = NULL # predefine constants
+ a = 2^16 + 3
+ m = 2^31
+ for(i in 1:n){
+ seed = (a * seed) %% m
+ x[i] = seed / m # normalise the values to [0, 1]
+ }
+ x
+ }
> RANDU(4)
[1] 3.051898e-05 1.831097e-04 8.239872e-04 3.295936e-03
246 9 Random Numbers in R

Fig. 9.1 Nine plots of random numbers xk+2 versus xk+1 versus xk generated by RANDU visualised:
in a three dimensional space all points fall in 15 hyperplanes. BCS_RANDU

The chosen modulus, m = 231 , is not a prime and the multiplier was chosen primarily
because of the simplicity of its binary representation, not for the goodness of the
resulting sequence of integers. Consequently, RANDU does not have full cycle length
and has some clearly non-random characteristics.
To demonstrate the inferiority of these values, consider the following calculation
where mod 231 has been omitted from each term.

xk+2 = (216 + 3)xk+1 = (216 + 3)2 xk


= (232 + 6 · 216 + 9)xk = {6 · (216 + 3) − 9}xk
= 6xk+1 − 9xk

The linear dependency between xk+2 , xk+1 and xk is obvious. According to


Marsaglia’s Theorem, all points fall in 15 hyperplanes in a three dimensional space, as
illustrated in Fig. 9.1. Today, many results depending on computations with RANDU
from the ’70s are seen as suspect.
9.1 Generating Random Numbers 247

Lagged fibonacci generator


Another idea is to use the Fibonacci numbers on moduli like xi mod m, where
xi = xi−1 + xi−2 with x0 = 0 and x1 = 1. Unfortunately, this sequence does not have
satisfactory randomness properties. One solution is to combine terms at greater dis-
tances, in other words, to add a lag between the summands. This is the main idea of
the Lagged Fibonacci Generator (LFG).

Definition 9.4 (The Lagged Fibonacci Generator) A Lagged Fibonacci Generator


is a recursive algorithm defined as

T (xi ) = (xi−j + xi−k ) mod m,

with 0 ≤ xi ≤ m, i, j, k = 0, 1, 2, . . . and 0 < k < j < i.

Unlike the LCG, the seed is not a single value. It is rather a sequence of (at least) j
integers, of which one integer should be odd. The statistical properties of the resulting
sequence of numbers rely heavily on this seed.
The value of the modulus m does not by itself limit the period of the generator, as it
does in the case of a LCG. The maximum cycle length for m = 2M is (2j − 1) · 2M−1
if and only if the trinomial x l + x k + 1 is primitive over the integers mod 2.
Using the notation LFG(j, k, p) to indicate the lags and the power of two moduli,
a commonly used version of this algorithm is LFG(17, 5, 31). The cycle length of
this version is 247 .
> LFG = function(j, k, p, n) {
+ seed = runif(j, 0, 2^p) # generate the seed
+ for(i in 1:n) {
+ seed[j + i] = (seed[i] + seed[j + i - k]) %% 2^p
+ }
+ seed[(j + 1):length(seed)] / max(seed) # standardise to [0, 1]
+ }
> LFG(17, 5, 31, 4) # generate 4 random numbers
[1] 0.3102951 0.9048108 0.4415016 1.0000000

The basic problem with this generator is, that there exist three-point correlation
between xi−k , xi−j and xi given by the construction of the generator itself, but typically
these correlations are very small.
Mersenne twister
The Mersenne twister is a pseudorandom number generator developed by Matsumoto
and Nishimura (1998). Due to its good properties, it is widely used even nowadays.
Its name is derived from the fact that the period length is a Mersenne prime, i.e. a
prime which is one less than a power of two: Mp = 2p − 1
This section presents the most common version of this algorithm, also called
MT 19937.
248 9 Random Numbers in R

Definition 9.5 (The Mersenne Twister ‘MT19937’) The sequence of numbers gener-
ated by the MT 19937 is uniformly distributed on [0, 1]. To save computation time, the
generator works internally with binary numbers. The main equation of the generator
is given by
   
00 Iw−r 0
xk+n = xk+m + xk+1 A + xk A , k = 0, 1, . . . ,
0 Ir 0 0

where xi is a 32 dimensional row vector, Ir a 19 × 19 identity matrix, Iw−r a 13 × 13


identity matrix, n = 351, m = 175 and

⎡ ⎤
0 0 ... 0
1
⎢ 0 1 ... 0⎥
0
⎢ ⎥
⎢ .. .. . . .. ⎥.
..
A=⎢ . .. . .⎥
⎢ ⎥
⎣ 0 0 0 ... 1⎦
a31 a30 a29 . . . a0

As a result, each vector xi is a binary number with 32 digits. Afterwards, the resulting
vector xk+n is rescaled. x0 = 4357 is chosen to be the most appropriate seed.

To explain the recursion above, one can think of it as a concatenation and a shift:
a new vector is generated by the first 13 entries of xk and the last 19 entries of xk+1 .
The shift results from the multiplication by A, which is in some way disturbed by
the addition of a0 , a1 , . . .. The result is added to xk+m .
The resulting properties of the generated sequence of numbers are extremely
good. The period length of 219937 − 1 (≈ 4.3·106001 ) is astronomically high and
sufficient for nearly every purpose today. It is k-distributed to 32-bit accuracy for
every 1 ≤ k ≤ 623 (see Sect. 9.3.2). In addition, it passes numerous tests for statistical
randomness.

9.1.2 Uniformly Distributed Pseudorandom Numbers

To generate a sequence of uniformly distributed pseudorandom numbers on


(min, max) in R, the command runif() is used, see Sect. 4.2.
> runif(5, 0, 1) # runif(number of observations, min, max)
[1] 0.9388026 0.6177511 0.1474307 0.1756104 0.3917517

The underlying algorithm of runif() is the Mersenne twister, discussed in


Sect. 9.1.1. runif() will not generate either of the extreme values, unless max =
min or max − min is small compared to min, and in particular not for the default
arguments:
9.1 Generating Random Numbers 249

> min(runif(100000)) > max(runif(100000))


[1] 8.260133e-06 [1] 0.9999601

As already mentioned, the sequences generated by runif() are the result of a


pseudorandom generator, which therefore rely on a seed. R uses the predefined seed
by default.
> .Random.seed[1:5]
[1] 403 83 -1212313168 -168900013 -1327450767

.Random.seed is an integer vector, containing the seed for random number gen-
eration. Due to the fact that all implemented generators use this seed, it is strongly
recommended not to alter this vector!
One can define a specific starting value with the function set.seed(). This is
a great way to ensure that simulation results are reproducible by using the same seed
value, as shown in the following.
> set.seed(2) # fix the seed
> x1 = runif(5)
[1] 0.1848823 0.7023740 0.5733263 0.1680519 0.9438393
> x2 = runif(5)
[1] 0.9434750 0.1291590 0.8334488 0.4680185 0.5499837
> set.seed(2) # use the same seed value as for x1
> x3 = runif(5)
[1] 0.1848823 0.7023740 0.5733263 0.1680519 0.9438393
> x1 == x2 # comparison of the generated sequences
[1] FALSE FALSE FALSE FALSE FALSE
> x1 == x3
[1] TRUE TRUE TRUE TRUE TRUE

set.seed() uses its single integer argument to automatically set as many seeds as
required for the pseudorandom number generator. This is considered a simple way of
getting quite different seeds by specifying small integer arguments, and also a way
of getting valid seed sets for the more complicated methods.

9.1.3 Uniformly Distributed True Random Numbers

In contrast to pseudorandom generators, the R package random provides users with


a source of true randomness that comes from www.random.org. Since 1998, the
site has been offering true random numbers generated from an atmospheric noise
sample via a radio tuned to an unused broadcast frequency combined with a skew
correction originally due to John von Neumann. This method might be better suited
for some purposes than pseudorandom number generators, but its speed of obtaining
random numbers is generally relatively low. Using the package or website and its
database is therefore a little more time consuming.
250 9 Random Numbers in R

> require(random)
> x = randomNumbers(n = 1000, min = 1, max = 100, col = 1) / 100
> head (as.vector(x))
[,1] [,2] [,3] [,4] [,5] [,6]
V1 0.75 0.08 0.02 0.94 0.43 0.78

Obviously, the specification of set.seed() is of no use in this context, as the


sequences of random numbers are always truly random and not reproducible.
Due to the slow generation of these numbers, one reasonable method is to use
these random numbers to generate a seed for further algorithms.

9.2 Generating Random Variables

In contrast to random number generation, rv generation always refers to the genera-


tion of variables whose probability distribution is different from that of the uniform
distribution on (0, 1). The basic problem is therefore to generate an rv X whose
distribution function is assumed to be known.
rvs invariably use a random number generator as their starting point, which yields
a uniformly distributed variable on (0, 1) (see Sect. 9.1.1). The technique is to manip-
ulate or transform one or more such uniform rvs in an elegant and efficient way to
obtain a variable with the desired distribution.
As with all numerical techniques, there is more than one method available to
generate variables for the desired distribution. Four factors should be considered
when selecting an appropriate generator:

1. Exactness refers to the distribution of the variables produced by the generator.


A generator is said to be exact if the distribution of the variables generated has
the exact form of the desired distribution. In some situations where the accurate
distribution is not critical, methods producing an approximate distribution may
be acceptable.
2. Speed refers to the computing time required to generate a variable. There are
two contributions to the overall time: the setup time to create the constants or
calculating tables, and the variable generation time. The importance of these two
contributions depends on the application. If a sequence of rvs, all obeying the same
distribution, is needed, then the time needed to set up the tables and constants
only counts once, because the same values can be used for every variable of the
sequence. If each variable has a different distribution, the setup time is just as
important as the variable generation time.
3. Space refers to the computer memory requirements of the generator. Some algo-
rithms make use of extensive tables which can become significantly costly if
different tables need to be held in memory simultaneously.
4. Simplicity refers to the elementariness of the algorithm.
9.2 Generating Random Variables 251

9.2.1 General Principles for Random Variable Generation

In this chapter, the three main principles for rv generation, the inverse transform
method, acceptance–rejection method, and the composition method, will be dis-
cussed. For simplicity, it is assumed that there is available a pseudorandom number
generator that produces a sequence of independent U(0, 1) variables, as discussed
in Sect. 9.1.1
The inverse transform method
From the property of the quantile function given in Definition 4.11, which states
that for U ∼ U(0, 1), the rv X = F −1 (U) has cdf F, i.e. X ∼ F, one can create rvs
very efficiently whenever F −1 can be calculated. This method is called the inverse
transform method.
Recall, that the inverse transform method has been shown to work even in the case
of discontinuities in F(x). As a result, the generated X will satisfy P(X ≤ x) = F(x),
so that X has the required distribution.
The acceptance–rejection method
Suppose the inverse of F is unknown or numerically hard to calculate and one wants
to sample from a distribution with pdf f (x). Under the following two assumptions,
the acceptance–rejection method can be used:
1. There is another function g(x) that dominates f (x) in the sense that g(x) ≥ f (x) ∀x.
2. It is possible to generate uniform values between 0 and g(x). These values will
be either above or below f (x).

Definition 9.6 (The Acceptance–Rejection Method)


1. Generate U1 ∼ U supp (f ) , where f (x) is the pdf of F and supp(f ) is the support
of f .
2. Generate U2 ∼ U [0, g(U1 )].
3. If U2 < f (U1 ), return U1 (the x-coordinate) as the generated X value, otherwise
repeat the procedure.

It is intuitively clear that X has the desired distribution because the density of X is
proportional to the height of f (Fig. 9.2).
The dominating function g(x) should be chosen in an efficient way, so that the
area between f (x) and g(x) is small, to keep the proportion of rejected points small.
Additionally, it should be easy to generate uniformly distributed points under g(x).
252 9 Random Numbers in R

Fig. 9.2 The Acceptance−Rejection Method


Acceptance–Rejection

1.0
method. BCS_ARM

0.8
0.6
Density
Rejection
Region

0.4
f(x)
0.2

g(x)
Acceptance Region
0.0

0 50 100 150 200 250 300

The average number of points (X, Y ) needed to produce one accepted X is called the
trials ratio, which is always greater than or equal to unity. The closer the trials ratio
is to unity, the more efficient is the generator. To present a handy way of constructing
a suitable g(x), consider the density f (x) of a distribution for which an easy way of
generating variables already exists, and define g(x) = K · h(x). It can be shown that
if X is a variable from g(x) and U is uniformly distributed on (0, 1) and independent
of X, then the points (X, Y ) = {X, K · U · h(x)} are uniformly distributed under the
graph of g(x). In this case, K has to be chosen in such a way that g(x) ≥ f (x) is
assured. Therefore the trials ratio is exactly K.
The composition method
Suppose a given density f can be written as a weighted sum of n densities
n
f (x) = pi · fi (x),
i=1

where the weights pi satisfy the two conditions pi > 0 and ni=1 pi = 1. In such a
framework, the density f is said to be a compound density. This method can be used
to split the range of X into different intervals, so that sampling from each interval
facilitates the overall process.
9.2 Generating Random Variables 253

9.2.2 Random Variables

For several distributions, R provides predefined functions for generating rvs. Most
of these functions will be discussed later in this chapter, to give a general overview
of this field. The syntax in this area follows a straight structure. All commands are
compositions of d, p, q, r (which stand for the density, distribution function, quantile
function, and rvs), plus the name of the desired distribution, as discussed in Chaps. 4
and 6. Thus
> rbinom()

will give the pdf of the binomial distribution, and


> rexp()

will give an rv from the exponential distribution with parameters set by default.

9.2.3 Random Variable Generation for Continuous


Distributions

This section discusses the generation of rvs for several continuous distributions.
Starting with three famous algorithms for the normal distribution, several ways of
generating rvs for the exponential, gamma, and beta distribution will be presented.
The normal distribution
As already mentioned briefly in Sect. 4.3 rnorm() produces n rvs for the normal
distribution with mean equal to 0 and standard deviation equal to 1 by default.
> rnorm(n, mean = 0, sd = 1)

> # generate 4 observations from a N(0, 1)


> rnorm(4)
[1] -0.3936441 -0.1939292 0.1383921 0.4417582
> # change the default algorithm
> RNGkind(normal.kind = "Box-Muller")
> # generate 4 observations from a N(0, 1)
> rnorm(4)
[1] -0.1367969 1.3994082 -0.3936441 -0.1939292

A famous method developed by Box and Muller (1958) was the earliest method for
generating normal rvs and, thanks to its simplicity, it was used for a long time. The
algorithm is provided in the following definition.
254 9 Random Numbers in R

Definition 9.7 (The Box–Muller Method) Let U1 and U2 be independent rvs obeying
the uniform distribution U(0, 1). Consider the rvs

X1 = −2 log(U1 ) cos(2πU2 )

X2 = −2 log(U1 ) sin(2πU2 )

Then X1 and X2 are independent and standard normally distributed, i.e. (X1 , X2 ) ∼
N(0, I2 ). Considering  1/2 (X1 , X2 ) , we have dependent rvs, see Fig. 6.4.

Unfortunately, this algorithm is rather slow due to the fact that for each number a
square root, log, and a trigonometric function have to be computed.
Neave (1973) has shown that the Box–Muller method shows a large discrepancy
between observed and expected frequencies in the tails of the normal distribution
when U1 and U2 are generated with a congruential generator. This effect became
known as the Neave effect and is a result of the dependence of the pairs generated by
a congruential generator, such as RANDU. This problem can be avoided by using
two different sources for U1 and U2 , as shown in the following R code.
> boxmuller = function(n){
+ if(n %% 2 == 0){a = n / 2}else{a = n / 2 + 1}
+ x1 = x2 = 1:a
+ for (i in 1:a) {
+ u1 = runif(1) # generate two
+ u2 = runif(1) # uniform rvs
+ x1[i] = sqrt(-2 * log(u1)) * cos(2 * pi * u2) # transformation
+ x2[i] = sqrt(-2 * log(u1)) * sin(2 * pi * u2)
+ }
+ c(x1, x2) # print results
+ }
> boxmuller(4)
[1] 2.527755 -1.548469 -0.794818 -1.777311

Marsaglia (1964) mentioned a slightly different version of the Box–Muller


method, in which the trigonometric function was replaced to reduce the compu-
tation time. It is known as the polar method and can be seen as an accelerated version
of the algorithm in Definition 9.7.

Definition 9.8 (The Polar Method) Generate two independent observations u1 , u2


 on (−1, 1) and set w = u1 + u2 . If w > 1, repeat these
2 2
from a uniform distribution
steps, otherwise set z = (−2 log w)/w and define x1 = u1 z and x2 = u2 z. Then
(x1 , x2 ) should be an observation from N(0, I2 ).

> polarmethod = function(n){


+ if(n %% 2 == 0){a = n / 2}else{a = n / 2 + 1}
+ x1 = x2 = 1:a # create output variables X and Y
+ i = 1 # set counter
+ while(i <= a) {
+ u1 = runif(1, -1, 1) # generate two uniform random numbers
+ u2 = runif(1, -1, 1)
+ w = u1^2 + u2^2
+ if (w <= 1) {
+ z = sqrt((-2 * log(w)) / w)
+ x1[i] = u1 * z
9.2 Generating Random Variables 255

+ x2[i] = u2 * z
+ i = i + 1 # precede counter
+ }
+ }
+ c(x1, x2) # print results
+ }
> polarmethod(8)
[1] 0.41867423 0.90550395 -0.07986714 1.17828848
[5] 0.65455600 -0.71171498 -0.05401868 0.90767865

The first part produces a point (u1 , u2 ) which is an observation from an rv uniformly
distributed on [0, 2
√1] . If w is smaller than 1, this point
√ is located inside the unit
circle. Then u1 / w is equivalent to the sine and u2 / w to the cosine of a random
direction (angle). Moreover, the angle is independent of w, which is an observation
of the rv that follows uniform distribution. This method is a good example of the
acceptance–rejection method for the normal distribution.

Definition 9.9 (Ratio of Uniforms) Generate u1 from U(0, b), u2 from U(c, d) and
x = u1 /u2 , with b = sup {h(x)}1/2 , c = − sup x {h(x)}1/2 and d = sup x {h(x)}1/2 .
 2
u1 ≤ h(u2 /u1 ), deliver x;
If
otherwise, repeat the algorithm;
where h(·) is some density function.
For the normal distribution with the non-normalised density h(x) = exp(−x 2 /2), the
algorithm can be stated as follows. √ √
Generate u1 from U(0, 1) and u2 from U(− 2/e, 2/e), where e is the base
of the natural logarithm. Let x = u2 /u1 and z = x 2 .


⎪ z ≤ 5 − {4 exp(1/4)} u1 , deliver x (Quick accept);

z > {4 exp(−1/4)} /u1 − 3, repeat the algorithm (Quick reject);
If

⎪ z ≤ −4 log u1 , deliver x;

otherwise, repeat the algorithm.

Given these conditions, we can define an acceptance region


 
Ch = (u1 , u2 ) : 0 ≤ u1 ≤ h1/2 (u2 /u1 )
  
 
= (u1 , u2 ) : 0 ≤ u1 ≤ exp −u22 /(2u12 )
 
= (u1 , u2 ) : (u2 /u1 )2 ≤ −4 log u1 .

The inequality can then be stated in terms of the variable x. To avoid repeated com-
putation of log u1 , the inner and outer bounds defined by the following inequalities
on log u are calculated.

(4 + 4 log c) − 4cu1 ≤ −4 log u1 ,


−4 log u1 ≤ 4/cu1 − (4 − 4 log c)
256 9 Random Numbers in R

These two inequalities arise from the fact that the tangent line, taken at the point d,
lies above the concave log function:

log y ≤ y/d + (log d − 1).

Taking y = u1 and d = 1/c leads to the lower bound, using y = 1/u1 and d = c yields
the upper bound. Note that the area of the inner bound is largest when c = exp(1/4)
and note that the constant 4 · exp(1/4) = 5.1361 is computed and stored in advance
to avoid computing the same trigonometric function a second time.
The exponential distribution
The first algorithm provided in this section is interesting, because it uses only arith-
metic operations.
Definition 9.10 (Neumann’s Algorithm) Generate a number of random observations
u1 , u2 , . . . from a uniform distribution as long as their values consecutively decrease,
i.e. until un+1 > un . If n is even, return x = un , otherwise repeat the procedure.

> neumannmeth = function(n){


+ x = 1:n # erase used variables
+ dummy = 0
+ i = 1
+ while (i <= n){
+ us = runif(1) # generate two uniform random numbers
+ ug = runif(1)
+ l = 2 # set the even-counter
+ while (us < ug) {
+ dummy = us # save smaller random number
+ us = ug # overwrite smaller number
+ ug = runif(1) # generate new uniform
+ l = l + 1 # precede counter
+ }
+ # if l is even, save dummy in x and precede counter
+ if ((l %% 2 == 0) && (dummy != 0)){
+ x[i] = dummy
+ i = i + 1
+ dummy = 0
+ }
+ }
+ x
+ }
> neumannmeth(10)
[1] 0.59064976 0.62533392 0.61146766 0.92082615 0.66926057
[6] 0.06609662 0.09082758 0.76548550 0.55991772 0.60223848

Despite its simplicity, this method should not be used for generating rvs, because
too many uniformly distributed random numbers are needed to generate one expo-
nentially distributed rv. Neumann’s rather inefficient algorithm can be improved

by applying the result of Pyke (1965)’s Theorem. It states that n U(k+1) − U(k) =
9.2 Generating Random Variables 257

nSk ∼ E(k), where U(1) ≤ U(2) ≤ . . . ≤ U(n) is an ordered series of standard uni-
formly distributed rvs with Sk = U(k+1) − U(k) and Sn = U(n) .
One important advantage in computing rvs from the exponential distribution is the
fact that the exponential distribution has a closed form expression for the inverse of its
cumulative distribution function. Given a random number generator, some numbers
X must be selected, which need to obey an exponential distribution. The following
definition states the selection procedure.
Definition 9.11 (Inverse cdf Method for the Exponential Distribution) First, gener-
ate a variable u from U(0, 1), then calculate x = 1 − exp(−λ · u).

> invexp = function(n, lambda){


+ sapply(1:n, function(x) {1 - (exp(-lambda * runif(1)))})
+ }
> invexp(9, 2)
[1] 0.8499208 0.4120199 0.5673652 0.4386219 0.4934604 0.6861557
[7] 0.3059538 0.5698566 0.7037824

rexp() uses the algorithm by Ahrens and Dieter (1972), which is faster than the
inverse method presented above.
> ptm = proc.time()
> invexp(10000, 1)
> proc.time() - ptm
User System elapsed
0.39 0.00 0.66

> ptm = proc.time()


> rexp(10000)
> proc.time() - ptm
User System elapsed
0.04 0.00 0.04

The gamma distribution


The command
> rgamma(n, shape, rate = 1, scale = 1 / rate)

generates n rvs with shape parameter b, default rate of 1, and scale parameter 1/rate.
For b ≥ 1, a specific algorithm by Ahrens and Dieter (1982b) is used, but for 0 <
b < 1, the following, different, algorithm by Ahrens and Dieter (1974) is used.

Definition 9.12 (The Acceptance–rejection Method of Ahrens and Dieter 1982b)


1. Generate u1 from U(0, 1) and set w = u1 · (e + b)/e, where e is the base of the
natural logarithm.
2. If w < 1, go to 3, else to 4.
258 9 Random Numbers in R

3. Generate u2 from U(0, 1) and set y = w1/b . If u2 ≤ exp(−y), return x = ay, else
go to 1.  
4. Generate u2 from U(0, 1) and set y = − log ( e+b e
− w)/b . If u2 ≤ y1/b , return
x = ay, else go to 1.

Definition 9.13 (The Acceptance–rejection Method of Cheng 1977)


1. Generate u1 and u2 from U(0, 1).
2. Set v = (2b − 1)−1/2 log(u1 /1 − u2 ), y = b exp(v), z = u12 u2 ,
and w = b − log(4) + b + (2b − 1)1/2 v − y.
3. If w + 1 + log(4.5) − 4.5z ≥ 0 or w ≥ log(z) holds, then return x = ay, else go
to 1.

Note that the trials ratio improves from 4/e ≈ 1.47 to (4/π) ≈ 1.13 as b → ∞.
The setup time is rather short, as only four constants have to be computed in advance.

Definition 9.14 (The Acceptance–rejection Method of Fishman 1976)


1. Generate u1 and u2 from U(0, 1).
2. Set v1 = − log(u1 ) and v2 = − log(u2 ).
3. If v2 > (b − 1){v1 − log(v1 ) − 1}, return x = av1 , else go to 1.

This algorithm was introduced by Atkinson and Pearce (1976) and is simple and
short. It is also efficient for b < 5, as the trials ratio is reduced from unity at b = 1
and to 2.38 at b = 5. Note that for greater values of b, Algorithm 9.13 by Cheng is
more efficient.

The beta distribution


The function
> rbeta(n, shape1, shape2, ncp = 0)

generates n rvs of the beta distribution with the two shape parameters p and q and a
default non-centrality parameter of 0. rbeta() is based on the following algorithm
by Cheng (1978).

Definition 9.15 (The Acceptance-rejection Method of Cheng 1978)


1. Generate√u1 and u2 from U(0, 1).
2. Set v = (p + q − 2)/(2pq − p − q) log {u1 /(1 − u1 )} and w = p exp(v).
√ −1
3. If p + q log (p + q)/(q + w) + p + (p + q − 2)/(2pq − p − q) v − log(4)
< log(u12 u2 ), go to 1.
4. Else return x = w/(q + w).

This method has a bounded trials ratio of less than 4/e ≈ 1.47.
9.2 Generating Random Variables 259

9.2.4 Random Variable Generation for Discrete Distributions

The general methods of Sect. 9.2.3 are in principle available for constructing discrete
variable generators. However, the special characteristics of discrete variables imply
certain modifications.
The binomial distribution
In R, rvs from the binomial distribution, see 3.6, can be generated via
> rbinom(n, size, prob)

where n is the number of observations, size the number of trials, and prob the prob-
ability of success.

Definition 9.16
1. Set x = 0.
2. Generate u from U(0, 1).
3. If u ≤ p, set y = 1, else y = 0.
4. Set x = x + y.
5. Repeat n times from step 2, then return x.

This algorithm uses the fact that x is an observation from binomially distributed rv
with n and p, i.e. the sum of n independent Bernoulli variables with parameter p.
Note that the generation time increases early with n.
As pointed out earlier, the fastest binomial generators for fixed parameters n and
p are obtained via table methods. On the downside for these methods, the memory
requirements and the setup time for new values of n and p are proportional to n, which
is a major drawback. More useful is a simple inversion without a table resulting in
a short algorithm and a shorter setup time. The execution time is proportional to
n · min(p, 1 − p). Therefore, rejection algorithms were proposed because they are
on the whole both fast and well suited for changing the values of n and p, as typically
required in simulation.
The implemented algorithm for rbinom() is based on a version by
Kachitvichyanukul and Schmeiser (1988). The algorithm generates binomial vari-
ables via an acceptance/rejection based on the function
  x+0.5 − np+p
np + p ! · (n − np + p )! p
f (x) = for − 0.5 ≤ x ≤ n + 0.5.
x + 0.5 !(n − x + 0.5 )! 1−p

The resulting algorithm dominates other algorithms with constant memory require-
ments when n · min(p, 1 − p) ≥ 10 in terms of execution times. Only for n ·
min(p, 1 − p) ≤ 10 is the inverse transformation algorithm faster. An implemen-
tation of the inverse transformation is presented below.
260 9 Random Numbers in R

> bininv = function(num, n, p){


+ x = 1:n
+ for(i in 1:n){
+ q = 1 - p # setup constants
+ s = p / q
+ a = (n + 1) * s
+ r = q ^ n
+ y = 0
+ u = runif(1) # generate uniform variable
+ while(u > r){ # check condition
+ u = u - r
+ y = y + 1
+ r = ((a / y) - s) * r
+ }
+ x[i] = y
+ }
+ x
+ }
> bininv(5, 10, 0.5)
[1] 3 5 7 6 5

The poisson distribution


Rvs from the Poisson distribution can be generated via
> rpois(n, lambda)

where n is the number of observations and lambda the vector of (non-negative)


means. The implemented algorithm was first mentioned by Ahrens and Dieter
(1982a). The following algorithm was developed by Knuth (1969) and is a sim-
ple way of generating random Poisson distributed variables by counting the number
of events that occur in a time period t.
Definition 9.17 (Knuth’s Algorithm)

1. Set p = 1, k = 0 and let u be an observation from the unifor distribution.


2. If exp(−λ) < p, set k = k + 1 and p = p · u, else print out k − 1.

> rpoisson = function(lambda = 1){


+ L = exp(-lambda)
+ k = 0
+ p = 1
+ while(p > L){
+ k = k + 1
+ p = p * runif(1, 0, 1)
+ }
+ k - 1
+ }
9.2 Generating Random Variables 261

The advantages of this algorithm are that only one constant L = exp(−λ) has to be
evaluated and that it requires only a minimum amount of storage space. However, the
time to generate an rv increases rapidly with λ. The following method can be used for
L
large λ, such as λ > 25, based on the fact that the distribution of λ−1/2 (X − λ) −→
N(0, 1). Bear in mind that this is an asymptotic result.
> poisson.as = function(n, lambda = 1){
+ a = lambda^0.5
+ sapply(1:n,
+ function(x){max(0, trunc(0.5 + lambda + a * rnorm(1)))})
+ }
> poisson.as(10, 30)
[1] 31 27 31 35 33 34 34 26 27 31

A comparison of the computation times for both algorithms with λ = 45 and n =


10000 is shown below.
> proc.time(for (i in 1:10000) {pois[i] = rpoisson(45)})
User System elapsed
95.39 13.07 3931.30

> proc.time(poisson.as(10000, 45))


User System elapsed
91.74 13.07 3927.62

9.2.5 Random Variable Generation for Multivariate


Distributions

Variable generation is generally much more complicated for multivariate distribu-


tions than for univariate ones, excepting those multivariate distributions with inde-
pendent components. The added complications arise from the dependencies between
the components of the random vector, which must be dealt with in multivariate
distributions. One general approach to creating such a dependency structure is the
conditional sampling method.
Conditional sampling
The beauty of the conditional sampling approach is that it reduces the problem of
generating a p-dimensional random vector into a series of p univariate generation
tasks.

Definition 9.18 (Conditional Sampling) Let X = (X1 , X2 , . . . , Xd ) be a random


vector with joint distribution function F(x1 , x2 , . . . , xd ). Suppose the conditional
distribution of Xj , given that Xi = xi , for i = 1, 2, . . . , j − 1, is known for eachj.
262 9 Random Numbers in R

Then the vector X can be built up one component at a time, where each component
is obtained by sampling from a univariate distribution and recusrsively calculating
each X1 , X2 , . . . , Xd .

For this method, it is necessary to know all the conditional densities. Therefore, its
usefulness depends heavily on the availability of the conditional distributions and,
of course, on the difficulty of sampling from them.
The transformation method
If the conditional distributions of X are difficult to derive, then perhaps a more con-
venient transformation can be found. The key element for this method is to represent
X as a function of other, usually independent, univariate rvs. An example of this
method is the Box–Muller method (see Definition 9.7), which uses two independent
uniform variables and converts them into two independent normal variables.
Even though the transformation method has wide applicability, it is not always
trivial to find a transformation with which to generate a multivariate distribution of
a given X. The following guidelines by Johnson (1987) have proven helpful.
1. Beginning with the functional form fX (x), one could apply invertible transforma-
tions to the components of X in order to find a recognizable distribution.
2. Consider transformations of X that simplify arguments of transcendental functions
in the density fX (x).

The rejection method for multivariate distributions


Obviously, the rejection method presented in Sect. 9.2.1 is not tied to a particular
dimension. But even though the theory carries over straightforwardly from the uni-
variate to the multivariate case, there exist some significant practical difficulties. As
stated by Johnson (1987), the main problem is to find a dominating function gX (x)
for fX (x) if the dependence among the components of X is strong. Intuitively, one
could choose a density for gX (x) which corresponds to the independent components
with the same marginal distributions as X. But in most cases, as the dependencies in
X increase, the extent to which gX (x) approximates fX (x) decreases. Therefore the
trials ratio approaches infinity. In general, the design of an efficient rejection method
is more difficult than in the univariate case.
If the standard rejection method for multivariate distributions is inapplicable due
to the above computational difficulties, a random vector from high-dimensional dis-
tributions can be generated by Markov Chain Monte Carlo (MCMC) techniques.
The MCMC algorithms, like the Metropolis–Hastings algorithm or the Gibbs sam-
pler, have therefore become standard tools for Bayesian econometricians. Note, as a
critical remark, that the generated sample is a Markov chain and that even random
vectors of small samples are not necessarily iid. For a detailed review, we refer to
Albert (2009) and to Martin et al. (2011) on using the MCMCpack package.
9.2 Generating Random Variables 263

The composition method for multivariate distributions


Like the rejection method, the composition method is not tied to a specific space like
R1 . A method for obtaining dependence from independence is the following.
Definition 9.19 Define a random vector X = (X1 , . . . , Xd ) as (SY1 , . . . , SYd ),
where the Yi are iid rvs and S is a random scale. In such a framework, the distribution
of X is a scale mixture. The resulting density fX (x) of X is
 d 
 fY  xi 
fX (x) = E .
i=1
S S

Simulating from copula-based distributions


There are numerous methods of simulating from copula-based distributions, see
Frees and Valdez (1998), Whelan (2004), Marshall and Olkin (1988), McNeil (2008),
Scherer and Mai (2012). The conditional inverse method is a general approach aimed
at simulating rvs from an arbitrary multivariate distribution. Here we sketch this
method with an example of simulating from copulae. We use conditional sampling
to generate rvs U = (u1 , . . . , ud ) recursively from the a sample v1 , . . . , vd from
a uniformly distributed rvs V1 , . . . , Vd ∼ U(0, 1) and the conditional distributions.
We set u1 = v1 . The rest of the variables are calculated using the recursion ui =
Ci−1 (vi |u1 , . . . , ui−1 ) for i = 2, . . . , d, where Ci = C(u1 , . . . , ui , 1, . . . , 1) and the
conditional distribution of Ui is given by

Ci (ui |u1 , . . . , ui−1 ) = P(Ui ≤ ui |U1 = u1 . . . Ui−1 = ui−1 )


∂ i−1 Ci (u1 , . . . , ui ) ∂ i−1 Ci−1 (u1 , . . . , ui−1 )
= / .
∂u1 . . . ∂ui−1 ∂u1 . . . ∂ui−1

The method is numerically expensive, since it depends on higher order derivatives


of C and the inverse of the conditional distribution function.
Simulating from archimedean copulae
The idea of the Marshal–Olkin method is based on the fact that the Archimedean
copulae are derived from Laplace transforms. Let M be the univariate cdf of a positive
rv (so that M(0) = 0) and let φ be the Laplace transform of M, i.e.,
! ∞
φ(s) = exp{−sw} dM(w), with s ≥ 0.
0

For any univariate distribution function F, a unique distribution G exists, given


by
! ∞
F(x) = Gα (x) dM(α) = φ{− log G(x)}.
0
264 9 Random Numbers in R

Considering d different univariate distributions F1 , . . . , Fd , we obtain


! ∞
d
 d
α
C(u1 , . . . , ud ) = Gi dM(α) = φ φ−1 {Fi (ui )} ,
0 i=1 i=1

which is a multivariate distribution function.


One proceeds with the following three steps to make a draw from a distribution
described by an Archimedean copula:
1. Generate an observation u from M;
2. Generate observations (v1 , . . . , vd ) from R;
3. The generated vector is computed by xj = G−1
1/u
j (vj ).

This method works faster than the conditional inverse technique. The drawback is
that the distribution M can be determined explicitly only for a few generator functions
φ, for example the Frank, Gumbel and Clayton families.
A simple implementation in R makes use of the package copula, which was
discussed in detail in Sect. 6.3. The command rMvdc() draws n random numbers
from a specified copula.
> # specification of the Clayton copula with uniform marginals
> require(copula)
> uniclayMVD = mvdc(claytonCopula(0.79),
+ margins = c("unif", "unif"),
+ paramMargins = list(list(min = 0, max = 1),
+ list(min = 0, max = 1)))
> # 10000 random number draw from the Clayton copula
> rMvdc(uniclayMVD, n = 10000)

Figure 9.3 shows 10,000 random numbers drawn from a Clayton copula.

Uniform Normal
1.0

4
0.8

2
0.6

0
0.4

−2
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 0 2

Fig. 9.3 10,000 realizations of an rv with uniform marginals in [0, 1] (left) and with standard
normal marginals (right) with the dependence structure in both cases given by a Clayton copula
with θ = 0.79. BCS_claytonMC
9.3 Tests for Randomness 265

9.3 Tests for Randomness

The first tests for random numbers in history were published by Kendall and Smith
(1938). They were built on statistical tools, such as the Pearson chi-square test, which
were developed in order to distinguish whether or not experimental phenomena
matched up with their theoretical probabilities.
Kendall and Smith’s (SJP) original four tests were hypothesis tests, which tested
the null hypothesis that each number in a given random sequence had an equal chance
of occurring, and that various other patterns in the data should also be distributed
equiprobably. The four tests are:

• The frequency test is a very basic test which checks whether there are roughly the
same number of 0’s, 1’s, 2’s, 3’s, etc.
• The serial test does the same for sequences of two digits at a time (00, 01, 02, etc.),
comparing their observed frequencies with their hypothetical predictions based on
equal distribution.
• The poker test is used to test for certain sequences of five numbers at a time (00000,
00001, 00011, etc.), based on hands in the game poker.
• The gap test looks at the distances between zeroes (00 would be a distance of 0,
030 would be a distance of 1, 02250 would be a distance of 3, etc.).

In general, it is very hard to check if some sequence is truly random—just consider


a single number, say 7. The reason is that if the random number generator is good,
each and every possible sequence of values is equally likely to appear, as Kendall
and Smith stated in 1938.
To illustrate this fact, consider a coin toss experiment with 3 throws. If the coin is
fair, the resulting sequences of heads and tails are all equally likely, i.e., P(H, T , T ) =
P(H, H, T ) = P(T , T , T ), and so on. Therefore, even a fair coin can generate a
sequence of only heads or tails. And even worse: that sequence appears with the
same probability as any other sequence, which may appear much random! This
means that a good random number generator will also produce sequences that look
non-random to the human eye and which fail any statistical tests that we might
apply to it. Therefore, it is impossible to prove that a given sequence of numbers is
random.
How to proceed, if it is impossible to conclusively prove randomness? We can
follow a pragmatic approach by taking many sequences of random numbers from
a given generator and testing each one of them. One can expect that some of the
sequences will fail some of the tests. As the sequences pass more of the tests, the
confidence in the randomness of the numbers increases and with it the confidence in
the generator. However, if many sequences fail the tests, we should be suspicious.
This is also the way one would intuitively test a coin to see if it is fair: throw it
many times, and if too many sequences of the same value come up, one should be
suspicious. The problem with randomness still tends to be the same: you can never
be sure.
266 9 Random Numbers in R

Nevertheless, in the following sections, two of the less intuitive and hard to pass
tests will be provided, in contrast to the more natural approaches above.

9.3.1 Birthday Spacings

The birthday spacing test is one of a series of tests called Diehard tests, which
were developed by Marsaglia (1995) and published on a CD. Consider the following
situation. If m birthdays are randomly chosen from a year of n days (usually 365) and
sorted, the number of duplicate values among the spacings between those ordered
3
birthdays will be asymptotically Poisson distributed with parameter λ = m 4n .
Theory provides little guidance on the speed of the approach to the limiting form,
but extensive simulation with a variety of random number generators provides values
of m and n for which the limiting Poisson distribution seems satisfactory. Among
these are m = 1024 birthdays for a year of length n = 224 with λ = 16.

9.3.2 k-Distribution Test

A more strenuous theoretical requirement for a generator is to be “k-distributed”.


A sequence is 1-distributed if every number it generates occurs equally often, 2-
distributed if every pair of numbers occurs equally often, and so on. In a fair coin-flip
context, 1-distribution would mean that heads and tails occurred equally often, and
2-distribution would mean that all results of two tosses occurred equally often.

Definition 9.20 A sequence xi of integers of period P is said to be k-distributed to


v-bit accuracy if k of the kv-bit vectors is

{truncv (xi ), truncv (xi+1 ), ..., truncv (xi+k−1 )} for 0 ≤ i < P,

where truncv (x) denotes the number formed by the leading v bits of x, i.e. trunc2
(0.23917) = 0.23, and each of the 2kv possible combinations of bits occurs the
same number of times in a period, except for the all-zero combination that occurs
less often by one instance.

To test for k-distribution to n-bit accuracy, at least 2kn measurements are needed.
Thus, this property is generally shown theoretically without preforming the actual
measurements. Nevertheless, it is possible to test for small k. In the case of k = 2,
each pair of the sequence {Ui , Ui+1 }2n−1
i=1 refers to certain points of the unit square,
where Ui ∼ U(0, 1). Decomposing the unit square into n2 subsquares and counting
the number of points in the subsquares allows using a χ2 -test for independence,
since the number of observed and expected points in each cell can be compared. This
example can be extended to larger k.
9.3 Tests for Randomness 267

For large k and n, an alternative is to count the number of missing k-outcomes in a


long string produced by the generator. The resulting count should be approximately
normally distributed with a certain mean and variance, which must be determined
by theory and simulation.
Finally, the most common random number generators generally cannot claim any
better than a 1-distribution.
Chapter 10
Advanced Graphical Techniques in R

All children are artists. The problem is how to remain an artist


once he grows up.

—Pablo Picasso

Data visualisation is an important part of data analysis with R. The standard


Renvironment has various graphical facilities for drawing different types of statisti-
cal plots. However, there exist several shortcuts not covered in the base Rgraphical
system. For example, basic Rprovides no capabilities for interactive plotting, includ-
ing rotation and zoom of the existing plot. Moreover, dynamic graphics, e.g. on-fly
adding of information and adjustment of parameters in the plot, are not embedded
either.
For this reason, these important components are discussed within the add-on
packages rgl and rpanel. Apart from this, the lattice package considerably
extends the functionality of R by implementing multipanel plots, which are very
useful for more precise multivariate data analysis.
In this chapter, we discuss these three important add-on packages for advanced
data visualisation and provide relevant examples.

10.1 Package lattice

The lattice add-on package is an implementation of Trellis graphics in R.


Trellis graphics, originally implemented in S and S-Plus at AT&T Bell Labo-
ratories, is a data visualisation framework developed by Becker, Cleveland and Shyu.
It provides powerful visualisation tools, see Becker et al. (1996). Here we only dis-
cuss the most important features of the lattice package, whereas considerably
more detailed description is offered by the developer of the lattice system, Sarkar
(2010).

© Springer International Publishing AG 2017 269


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8_10
270 10 Advanced Graphical Techniques in R

The name Trellis comes from the trellis-like rectangular array of panels sim-
ilar to a garden trellis. By means of the Trellis graphics it is possible to study
the dependence of a response variable on more than two explanatory variables. Mul-
tipanel conditioning is used for displaying multiple plots in one page with shared
coordinate scales, aspect ratios and labels. This feature is especially useful for plotting
multivariate and panel data, and is not provided by the standard Rgraphic system.
The design goal of the Trellis system is the optimisation of the available output
area, therefore Trellis graphics provide default settings that produce superior
plots in comparison to its traditional counterparts.
The lattice package is based on the grid graphics system, which is a low-
level graphics system, see Sarkar (2010). grid does not provide high-level functions
to create complete plots, but creates a basis for developing high-level functions as well
as facilitates the manipulation of graphical output in lattice. Since lattice
consists of grid calls, it is possible to add grid output to lattice output and vice versa,
see R Development Core Team (2012). The knowledge of the grid package would
be beneficial for customising the plots in lattice. Nevertheless lattice is a
self-contained graphics system, enabling one to produce complete plots, functions
for controlling the appearance of the plots and functions for opening and closing
devices.
The short description of the package functions and relevant examples of the
lattice graphical output will be given in the following.

10.1.1 Getting Started with lattice

The lattice package contains functions, objects and datasets. Most of the func-
tions implemented in lattice are already available in the traditional Rgraphics
environment. The complete list is given in Table 10.1.
Each of the listed high-level functions creates a particular type of display by
default. Although the functions produce different output, they share many common
features, i.e. that several common arguments affect the resulting displays in similar
ways. These arguments are extensively documented in the help pages for xyplot().
The most important of them are the formula argument, describing the variables,
and the panel argument, specifying the plotting function. These will be explained
in more details in the following subsections.

10.1.2 formula Argument

The Trellis formula argument has a central role in lattice, since it is


deployed in order to define statistical models. Different high-level generic func-
tions employ several different types of notations. The most often used notations are
presented in Table 10.2.
10.1 Package lattice 271

Table 10.1 High-level functions in lattice


lattice functions Default display Traditional functions
barchart() Barplot barplot()
histogram() Histogram hist()
densityplot() Conditional kernel density plot –
dotplot() Dotplot dotchart()
bwplot() Comparative Box-and-Whisker plot boxplot()
stripplot() Stripchart –
qqmath() Theoretical quantile plot qqplot()
qq() Two-sample quantile plot qqplot()
dotplot() Cleveland dot plot dotchart()
xyplot() Scatter plot plot()
contourplot() Contour plot of surfaces contour()
cloud() 3D Scatter plot –
levelplot() Level plot of surfaces image()
parallel() Parallel coordinates plot parcoord()
splom() Scatter plot matrix pairs()
wireframe() 3D Perspective plot of surfaces persp()

Table 10.2 Trellis formula notations


Notation Explanation Example of the function
∼x Plots a single variable x bwplot(), histogram(), qqmath()
y∼x Plots variable y against variable x xyplot(), qq()
y ∼x∗z Plots three variables levelplot(), cloud(), wireframe()
y∼x|z Plots y against x for each level of z xyplot()

In order to avoid mistakes in the use of the formula argument, it should be kept
in mind that the syntax of the formula in lattice differs from that of formula
used in the lm() linear model function, see Chap. 8.
The variable on the left side of “ ∼” is a dependent variable, while the independent
variable(s) is (are) placed on the right side. For graphs of a single variable, only one
independent variable needs to be specified in the first row of Table 10.2.
In order to define multiple dependent or independent variables, the sign’+’is
placed between them. In case of multiple dependent variables, the formula would
be assigned as y1 + y2 ∼ x, so that the variables y1 and y2 are plotted against the
variable x. In fact, y1 ∼ x and y2 ∼ x will be superposed in each panel. In a similar
way, one can set multiple independent or both independent and dependent variables
simultaneously as is implied in the code of Fig. 10.1 later in this chapter.
To produce conditional plots, the conditioning variable should be also specified
in the formula argument, standing after the’|’symbol. When multiple conditioning
272 10 Advanced Graphical Techniques in R

Sepal.Length * Petal.Width ● Sepal.Width * Petal.Width ●

Sepal.Length * Petal.Length ● Sepal.Width * Petal.Length ●

virginica
● ● 8
●●●● ● ●●●●
●●●● ● ●
●●●●●

●●
● ●●● ● ●●●● 7
●● ● ●●● ●
●●●●●
● ●● ●● ●●●
●●●● ●●● ●●●●●●●● ●
●● ● ●●●● ●

●● ● ● 6

● ●●

● ● 5
Sepal.Length + Sepal.Width

●● ●● 4
● ●
●●●●● ●●● ●

●● ●●● ●
●●●●●
●●●●●
●●
●● ●●
●●● ●● ● ●●● ● ●● ● 3

●●●●●● ● ● ●●●
●●● ● ● ●●
● ●●●● ● ● ●● ●
● ●
2
setosa versicolor
8

7 ●●

●●
●●●
●● ●●●●●

● ●
●● ●●●●●●

●●● ●●
6 ●●●●●
●●●

●● ●●●● ●
●●● ● ●● ●●
● ●
● ●● ●●●
●● ●●
● ● ●●
●●●● ●●●●●●●●
●●
● ●●●
● ● ● ●
●●
●●●●● ●●●●●● ● ● ●● ●
5 ●●
●● ● ●●●
●● ●●
●● ●



● ●
●●
●● ● ●● ●● ●
●● ●● ●●●
●●● ●●●●●
●● ● ●●
4 ●●
●●● ● ●●●● ● ●
●●
●●●● ● ●●


●●●● ● ● ●
●●
●●●●
●● ● ●●
●● ●●●●● ●●●
●● ●●● ●
3 ●●

●● ● ●● ●
●● ● ●●●
●●
●●
●●●● ●● ●●●●
●●●●●● ●● ●
●●
●● ●
●●●●● ●
●● ●●●
●●●●●● ●

●●● ●● ● ●● ●●●
● ● ●
● ● ●
● ● ● ● ●●

2 ● ●

0 2 4 6

Petal.Width + Petal.Length

Fig. 10.1 Conditional plots. BCS_ConditionalGrouped

z variables are specified, then for each level combination of z 1 and z 2 , lattice
produces several plots of y against x, as depicted in Fig. 10.5. The notation is y ∼
x | z1 + z2 .
The definition of the formula argument is the initial step in the multilevel devel-
opment process of lattice graphical output. The values used in the formula
are contained in the argument data, specifying the data frame.

10.1.3 panel Argument and Appearance Settings

As mentioned above, the default settings of lattice plots are optimised for the
traditional Rplots. The panel function is a function that uses a subset of the argu-
ments to create a display. All lattice plotting functions have a default panel
function, the name of which is built up from the prefix panel and the name of
the function. For instance, the default panel function for the bwplot() function
is panel.bwplot(). However, apart from superior default settings, lattice
offers lots of flexibility due to its highly customisable panel functions.
There are two perspectives from which the lattice graph should be observed.
First, the function call, e.g. histogram(), sets up the external components of the
10.1 Package lattice 273

display, such as scale rectangle, axis lables, etc. Second, the panel function creates
everything placed into the plotting region of the graph, such as plotting symbols.
The panel function is called from the general display function by the panel
argument. Therefore, for the default settings, both function calls are identical.
> h i s t o g r a m ( ~ x, data = d a t a s e t )
> h i s t o g r a m ( ~ x, data = d a t a s e t , panel = p a n e l . h i s t o g r a m )

There are different arguments that could be treated under the panel function. In
order to temporarily change the default settings of, for instance, the plotting symbols,
one can rewrite the new value into the panel function inside the general function
call.
> x y p l o t ( y ~ x,
+ data = dataset,
+ panel = f u n c t i o n ( x, y ){ p a n e l . x y p l o t ( x, y, pch = 20)})

Alternatively, when it is desired to change the panel function arguments permanently,


one should define a panel function as a separate function outside the general call
function and then apply it to any function call.
> m y . p a n e l = f u n c t i o n ( x, y ){ p a n e l . x y p l o t ( x, y, pch = 29)}

Now by choosing the my.panel function, one would always use the type of the
plotting points pch = 29.
In a similar way, different attributes (e.g. cex, font, lty, lwd, etc.) could
be altered for a specific function either temporary or permanently.

10.1.4 Conditional and Grouped Plots

lattice offers conditional and grouped plots to work with and display multivariate
data. In order to obtain a conditional plot, at least one variable should be defined as
conditioning.
One gets different visual representations of the dataset, depending on whether the
same variable is being used as conditioning or as grouping.
From the dataset iris, the conditioning variable Species is set to be condi-
tioning, as shown in the Rcode below, which corresponds to Fig. 10.1.
> xyplot ( Sepal.Length + Sepal.Width ~
+ Petal.Length + Petal.Width | Species,
+ data = iris )

Figure 10.1 contains three panels, standing for three types of Species. Each panel
contains four combinations of iris characteristics.
Another alternative for displaying multivariate data is the groups argument.
This splits the data according to the grouping variable. For the sake of comparability,
Fig. 10.2 shows four panels, each one illustrating the combination of two variables
and types of Species denoted by different colours.
274 10 Advanced Graphical Techniques in R

setosa ● versicolor ● virginica ●

0 2 4 6
Sepal.Width * Petal.Width Sepal.Width * Petal.Length
8

5
Sepal.Length + Sepal.Width

● ●
●● ●
●●●●
●●
●●● ● ●● ●●● ● ●● 4
●●●●

●● ● ●●
● ●● ●
●●●● ● ●●●●●
● ● ●●●●
●●● ●●●● ●● ●● ●
●●●●
●●
●● ● ●●● ●
●● ●● ●●●
●● ● ●●●●●
● ●●●●
●●●● ●●
●●●●●●●
●●●
●●
● ●
●●●


●●●●
● ●●
● ● ●●
●● ● ●●

●●●●● ●●●●● ● 3
●●
● ● ●●●●
●● ●●●
●● ●●● ● ●●●
●●●●●●●●●●●● ● ● ●●
●●●●
●● ●●●● ● ● ●● ●●●
● ● ● ●● ●● ●
● ●
● ●● ● ● ● ●● ●

● ● 2
Sepal.Length * Petal.Width Sepal.Length * Petal.Length
8 ● ●
●●●● ● ●●●●
● ●
●●
● ● ●●●●
7 ●● ● ●● ●●● ● ●●

●●●● ●● ●●● ● ● ●●● ●●●●● ●
●●●● ●● ●● ● ●
●●
● ● ●●●●
●●
● ●
● ●●●●●●●●●●●
●●
● ●●●
● ●●● ● ●● ●● ●● ● ●
6 ● ● ● ●● ●● ●●●● ●●
●●● ● ●●●●● ●●● ● ● ●● ●● ●●● ●●●●●
●● ●
● ●
●●●●●●●● ● ●●● ● ●
●● ●●● ● ●● ●●
●●
●●●●● ● ●


●●●●● ● ●● ●
5 ●●

●●● ● ● ●●●



●● ●
●● ●●

● ●●● ● ● ● ●
●●
● ●●●●●●
●● ● ●●
4

0 2 4 6

Petal.Width + Petal.Length

Fig. 10.2 Grouped plots. BCS_ConditionalGrouped

> xyplot ( Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width,


+ data = iris,
+ groups = S p e c i e s )

The use of a conditioning or grouping variable requires including a key legend in the
graph. The argument auto.key() draws the legend, and the attribute columns
defines the number of columns into which the legend is split.
According to this particular example, it is not very important how one employs
the Species variable, since both outputs are qualitatively equal.
There are datasets where it is preferable to produce grouped plots rather than con-
ditional plots. The following example of a density plot from the dataset chickwts
confirms this.
> densityplot (~ weight | feed, # set c o n d i t i o n a l v a r i a b l e
+ data = chickwts,
+ plot.points = FALSE ) # mask points

The resulting output is shown in Fig. 10.3. Since we employed the conditioning
variable feed, which has six categories, Fig. 10.3 produces six panels with density
plots.
Alternatively, the variable feed can be used as a grouping variable. Figure 10.4
creates one single panel with six superposed kernel density lines and enables a direct
comparison between the different groups.
10.1 Package lattice 275

meatmeal soybean sunflower


0.015

0.010

0.005

0.000
Density

casein horsebean linseed


0.015

0.010

0.005

0.000

100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
weight

Fig. 10.3 Conditional density plots. BCS_ConditionalGroupedDensity

> densityplot (~ weight,


+ data = chickwts,
+ groups = feed, # set g r o u p i n g v a r i a b l e
+ plot.points = F A L S E , # mask points
+ auto.key = list ( c o l u m n s = 3)) # l e g e n d in t h r e e c o l u m n s

Moreover, the black and white colour scheme was applied to both plots on Fig. 10.3
as well as Fig. 10.4; in lattice, the colours will be changed by different types of
symbols when the following code is applied.
> lattice.options ( default.theme =
+ # set the d e f a u l t l a t t i c e c o l o r s c h e m e to b l a c k / w h i t e s c h e m e
+ m o d i f y L i s t ( s t a n d a r d . t h e m e ( color = FALSE ) ,
+ # set s t r i p s b a c k g r o u n d to t r a n s p a r e n t
+ list ( s t r i p . b a c k g r o u n d = list ( col = " t r a n s p a r e n t " ))))

10.1.5 Concept of shingle

lattice enables the use of continuous (numeric) variables as conditioning, with


the shingle concept. A shingle is a data structure displaying the continuous
276 10 Advanced Graphical Techniques in R

casein linseed soybean


horsebean meatmeal sunflower

0.015

0.010
Density

0.005

0.000

100 200 300 400 500


weight

Fig. 10.4 Grouped and overlayed density plots. BCS_ConditionalGroupedDensity

variables in the form of factors. It consists of a numeric vector and possibly overlap-
ping intervals.
To convert a continuous variable into a shingle object means to split it into
(possibly overlapping) intervals (levels). In order to do this, one uses the shingle()
function, whereas the function equal.count() is used when splitting into equal
length intervals is required. The number argument defines the number of intervals,
whereas the overlap argument assigns the fraction of points to be shared by the
consecutive intervals. The endpoints of the intervals are chosen in such a way that
the counts of points in the intervals are as equal as possible. shingle returns the
list of intervals of the numeric variable.
In the following Rcode, the continuous variables temperature and wind are
split into four equal non-overlapping intervals and can be treated as usual factor
variables. The new factor variables Temperature and Wind are considered as the
conditioning variables.
> Temperature = equal.count ( environmental $ temperature,
+ number = 3, # split into 3 equal i n t e r v a l s
+ overlap = 0) # no o v e r l a p p i n g
> Wind = e q u a l . c o u n t ( e n v i r o n m e n t a l $ wind,
+ number = 4,
+ overlap = 0)
> x y p l o t ( ozone ~ r a d i a t i o n | T e m p e r a t u r e * Wind,
+ data = environmental,
+ a s . t a b l e = TRUE ) # p a n e l s l a y o u t top to b o t t o m

Figure 10.5 depicts the simultaneous use of these two conditioning variables.
Temperature now contains three levels and Wind has four levels, though a rec-
10.1 Package lattice 277

0 100 200 300


Wind Wind Wind
Temperature Temperature Temperature

150 ●
● ●

100 ● ●
●●● ● ●
●●
●● ●
● ● ● ●
50 ● ●
● ●

● ●
● ●

● ● ● ●
0
Wind Wind Wind
Temperature Temperature Temperature
Average ozone concentration in ppb

150
●●
100

●●


● ● ●

● ● ●

● ● ● 50

● ● ● ● ● ●●
●● ●
● ●
● ● ● ● ● ● ●
●● 0
Wind Wind Wind
Temperature Temperature Temperature

150

100 ●

● ●
50 ●
● ● ● ●●
● ● ● ● ● ●
●● ●
● ● ● ● ●● ● ● ●
● ● ●● ● ● ●
0 ●● ●
Wind Wind Wind
Temperature Temperature Temperature

150

100

● ●
●●

● ●
50
● ● ●● ● ● ●
● ● ●● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ●

0
0 100 200 300 0 100 200 300
Solar radiation in Langley (1 LANG = 41,868 Joule/m2)

Fig. 10.5 Plot with two conditioning variables. BCS_TwoConditioningVariables

tangular array of 12 panels was created, depicting the ozone variable against the
radiation variable for each combination of conditioning variables.
In the code, the argument par.strip.text() controls the text on each strip
with the main components cex, col, font, etc.. By default, lattice
displays the panels from bottom to top and left to right. By defining the argument
as.table = TRUE the panels will be displayed from top to bottom.
Of course more than two conditioning variables are also possible, but the increas-
ing level of complexity of the graphical output should be kept in mind.
278 10 Advanced Graphical Techniques in R

1400
1200
1000
800
600

1880 1900 1920 1940 1960


Time

Fig. 10.6 Default time series plot. BCS_DefaultStackTimeSeries

10.1.6 Time Series Plots

The ability to draw multiple panels in one plot is particularly useful for time series
data. lattice enables cut-and-stack time series plots. The argument cut is spec-
ified by the number of intervals into which the time series dataset should be split, so
that changes over a time period can be studied more precisely.
The code for the simple time series plot in Fig. 10.6 is
> x y p l o t ( Nile )

One can customise the plot by varying the arguments aspect, cut and strip,
where the last is responsible for the colour scheme of the strips.
> x y p l o t ( N i l e , a s p e c t = " xy " ,
+ cut = list ( n u m b e r = 3 , # split into t h r e e p a n e l s
+ o v e r l a p = 0 .1 ) , # 10 per cent o v e r l a p
+ strip = s t r i p . c u s t o m ( bg = " y e l l o w " , # strips background
+ fg = " l i g h t b l u e " )) # strips foreground

Figure 10.7 plots three panels, according to the number of intervals. Such a combi-
nation of two plots is most valuable for the user.
An object of class ts could also be a multivariate series, so that multiple time
series can be displayed in parallel in the same graph. For instance, by setting the
superpose argument to be TRUE, all series will be overlaid in one panel. When
the screens argument is specified, the series will be plotted into a predefined panel.
10.1 Package lattice 279

time
1400
1200
1000
800
600

1870 1880 1890 1900


time
1400
1200
1000
800
600

1910 1920 1930


time
1400
1200
1000
800
600

1940 1950 1960 1970


Time

Fig. 10.7 Cut-and-stack time series plot. BCS_DefaultStackTimeSeries

10.1.7 Three- and Four-Dimensional Plots

The underlying philosophy of the Trellis system is to avoid three-dimensional


displays and to use conditioning plots instead. Three-dimensional plots, created by
wireframe() can be extended by a fourth variable used as conditioning. However,
data interpretation appears to be more complicated, mostly because the plots are not
rotatable.
For this reason, an analogous 3D plot will be constructed instead, with the rgl
package (see Sect. 10.2). One can still create some superior three-dimensional plots
with the levelplot() function. This function also allows upgrading a three-
dimensional plot with another conditioning variable. The function levelplot()
demands the dependent variable to be numerical and the conditioning variable, to be
either factor or shingle. The following code shows how to create the 4D plot.
> l e v e l p l o t ( yield ~ site * v a r i e t y | y e a r ,
+ data = barley,
+ scales = list ( a l t e r n a t i n g = TRUE ) ,
+ shrink = c (0 .3, 1) , # scale r e c t a n g l e s
+ region = TRUE,
+ cuts = 20 , # range of dep. v a r i a b l e
+ col.regions = t o p o . c o l o r s (100) , # color g r a d i e n t
+ par.settings = list ( a x i s . t e x t = list ( cex = 0 .5 )) ,
+ p a r . s t r i p . t e x t = list ( cex = 0 .7 ) , # s t r i p s font size
+ between = list ( x = 1) , # space between panels
+ a s p e c t = " iso " , c o l o r k e y = list ( space = " top " ))
280 10 Advanced Graphical Techniques in R

20 30 40 50 60

GrRap Dul UF Mor Crook Wasec

1932 1931

Trebi

Wisconsin No. 38

No. 457

Glabron

Peatland
variety

Velvet

No. 475

Manchuria

No. 462

Svansota

GrRap Dul UF Mor Crook Wasec

site

Fig. 10.8 Four-dimensional plot. BCS_FourDimensional

The result of the listing is shown in Fig. 10.8, which presents the four-dimensional plot
of the lattice data set barley. The explanatory variable yield is illustrated
by means of the sizes and colours of the boxes. The higher the value of yield, the
larger and lighter are the rectangles.
The arguments of interest are cuts, which specify the number of levels (the
colour gradient) into which the range of a dependent variable is to be divided and
region, which is a logical variable that defines whether the regions between the
contour lines should be filled. Since region = TRUE, col.regions defines
the colour gradient of the dependent variable. The shrink argument scales the
rectangles proportionally to the dependent variable, and the between argument
specifies the space between the panels on x and/or y axis.
The settings for the layout and appearance of the lattice plots facilitate an
enhanced comprehension of the data. Multipanel conditioning is a central feature
delivered by lattice, which enables data visualisation on multiple panels simul-
taneously, displaying different subsets of the data. Although lattice provides this
10.1 Package lattice 281

kind of extended functionality, an immediate interactive control of the graphical out-


put is still missing. For this reason, in the following chapter, we discuss the rgl and
rpanel packages with an implemented interactive element.

10.2 Package rgl

RGL is a library of functions that offers three-dimensional real-time visualisation


functionality with interactive viewpoint navigation in the Rprogramming environ-
ment. The RGL library is written in C++ using OpenGL. Since 3D objects need to
be projected onto a 2D display, special navigation capabilities such as real-time rota-
tion and zooming in/out are used to create the illusion of three-dimensionality. The
further implementation in the RGL library of different features, including lighting,
alpha blending, texture mapping, and fog effects, enhance this illusion, see Adler
et al. (2003).
The rgl package includes both low-level rgl.* functions and a higher level
interface for 3D rendering and computational geometry r3d with *3d functions.
Most of the function calls exist also in a double set: both concepts from rgl.* and
from *3d interfaces. The two principal differences between these are as follows:
rgl.* calls set unspecified material properties to default values and *3d calls use
the current values as defaults; rgl.* permanently changes the material properties
with each call and *3d make temporary changes for the duration of the call.
The aim of this section is to give an overview of the rgl package structure, as
well as some practical examples of the 3D real-time visualisation engine of the RGL
library.

10.2.1 Getting Started with rgl

The rgl package can be subdivided into the following categories:

1. Device management functions include six functions, which control the RGL win-
dow device. These functions are used to open/close the device, to return the
number of the active devices, to activate the device and to shut down the rgl
device system or to re-initialise rgl.
2. Scene management functions enable stepwise removal of certain objects, such as
shapes, lights, bounding boxes and background, from the 3D scene.
3. Export functions are used to save snapshots or screenshots in order to export them
to other file formats.
4. Environment functions are set to alter the environment properties of the scene,
e.g. to modify the viewpoint, background, axis labelling, bounding box, or to add
a light source to the 3D scene.
282 10 Advanced Graphical Techniques in R

5. The appearance function rgl.material(...) is responsible for the appear-


ance properties, e.g. colour, transparency, texture and the sizes of the different
object types.

We next demonstrate the shape functions in the rgl package combined with certain
environment and object properties.

10.2.2 Shape Functions

The shape functions are an important part of the RGL library since they enable
both the plotting of primitive shapes, such as points, lines, linestrips, triangles and
quads, as well as high-level shapes, such as spheres and different surfaces, see
Figs. 10.9, 10.10, 10.11, 10.12, 10.13 and 10.14.
RGL adds further shapes to the already opened device by default. To avoid this,
one can create a new device window with the calls rgl.open() or open3d().

Fig. 10.9 Points.


BCS_Shapes

Fig. 10.10 Lines.


BCS_Shapes
10.2 Package rgl 283

Fig. 10.11 Linestrips.


BCS_Shapes

Fig. 10.12 Triangles.


BCS_Shapes

Fig. 10.13 Quads.


BCS_Shapes
284 10 Advanced Graphical Techniques in R

Fig. 10.14 Spheres.


BCS_Shapes

The shape functions in rgl are briefly described in the list below.
1. 3D points are drawn by the function rgl.points(x, y, z,...), see
Fig. 10.9.
2. 3D lines can be depicted with the function rgl.lines(x, y, z,...), see
Fig. 10.10. The nodes of the line are defined by the vectors x, y, z, each of length
two.
3. 3D linestrips are constructed with the function rgl.linestrips(x,y,
z,...). The nodes of the linestrips are, as in rgl.lines(x, y, z,...),
defined by the vectors x, y, z, each of length two. In the output, each next line
strip starts at the point where the previous one ends, see Fig. 10.11.
4. 3D triangles are created with the function rgl.triangles(x, y, z,...),
see Fig. 10.12. The vectors x, y and z, each of length three, specify the coordinates
of the triangle.
5. 3D quads can be drawn with the function rgl.quads(x, y, z,...), see
Fig. 10.13. The vectors x, y and z, each of length four, specify the coordinates of
the quad.
6. 3D spheres are not primitive, but they can be easily created with the function
rgl.spheres(x, y, z, r,...). This function plots spheres with centres
defined by x, y, z and radius r . In order to create multiple spheres, one can define
x, y, z, r as vectors of length n, see Fig. 10.14.
7. 3D surfaces can be drawn by means of the generic rgl.surface(x,...)
function. This is defined by a matrix specifying the height of the nodes and two
vectors defining the coordinates.
Each of the shape functions can be produced with higher level functions from the
r3d interface.
Alternatively, 3D surfaces can be constructed with the persp3d(x,...),
surface3d() or terrain3d() functions. As an example of a 3D surface, the
hyperbolic paraboloid
x2 z2
− = y,
a2 b2
is produced by surface3d() and displayed in Fig. 10.15.
10.2 Package rgl 285

Fig. 10.15 Surface shape.


BCS_SurfaceShape

> r e q u i r e ( rgl )
> x = z = -9:9
> f = f u n c t i o n ( x, z ){( x ^2 - z ^2) / 10}
> y = outer ( x, z, f ) # square matrix,
> # x rows, z c o l u m n s
> o p e n 3 d ()
> s u r f a c e 3 d ( x, z, y, # plot 3 D s u r f a c e
+ back = " lines " , # back side grid
+ col = rainbow (1000) ,
+ alpha = 0 .9 ) # t r a n s p a r e n c y level
> b b o x 3 d ( back = " lines " , front = " lines " ) # 3 D b o u n d i n g box

In this example of a 3D surface (see Fig. 10.15), a side dependent rendering effect
was implemented. This option gives the possibility of drawing the ’front’ and ’back’
sides of an object differently. By default, the solid mode is applied, which can be
changed either to lines, points or cull (hidden). In Fig. 10.15, the front side is drawn
with a solid colour, whereas the back side appears to be a grid. This creates a better
illusion of 3D space. The bounding box is added to the scene with the function
bbox3d().
There are many options that can be used in order to make the 3D object look
more realistic. The lighting condition of the shape is described by the totality of light
objects. There are three types of lighting, i.e. the specular component determines the
light on the top of an object, ambient determines the lighting type of the surrounding
area and the diffuse component specifies the colour component, which scatters the
light in all directions equally. The light parameters specify the intensity of the light,
whereas theta and phi are polar coordinates defining the position of the light.
The following examples of 3D spheres (Figs. 10.16, 10.17, 10.18 and 10.19) depict
the effects of ambient and specular material on a sphere. Furthermore, the argument
smooth creates the effect of internal smoothing and determines the type of shading
applied to the spheres. When smooth is TRUE, Gouraud shading is used, otherwise
286 10 Advanced Graphical Techniques in R

Fig. 10.16 Default light


source.
BCS_LightedPlots

Fig. 10.17 Without light.


BCS_LightedPlots

Fig. 10.18 One light source


added. BCS_LightedPlots

flat shading. In Fig. 10.17, the rgl.clear() function is used in order to customise
the lighting scene of the display by deleting the lighting from the scene.
10.2 Package rgl 287

Fig. 10.19 Two light


sources added.
BCS_LightedPlots

> r g l . s p h e r e s ( rnorm (3) , rnorm (3) , rnorm (3) , # set c o o r d i n a t e s


+ r = runif (5) , # set r a d i u s
+ smooth = TRUE ) # Gouraud shading
> r g l . c l e a r ( type = " l i g h t s " ) # r e m o v e the l i g h t i n g
> # add the 1 st l i g h t s o u r c e
> r g l . l i g h t ( theta = -90 , phi = 50 , # p o s i t i o n of light
+ ambient = " white " , # s u r r o u n d i n g area
+ # lighting
+ diffuse ="# dddddd ", # diffusing lighting
+ s p e c u l a r = " white " ) # l i g h t i n g on the top
> # add the 2 nd l i g h t s o u r c e
> r g l . l i g h t ( t h e t a = 45 , phi = 30 , a m b i e n t = " # d d d d d d " ,
+ d i f f u s e = " # d d d d d d " , s p e c u l a r = " white " )

10.2.3 Export and Animation Functions

Exporting results from the rgl package differs from exporting classical graphical
outputs. For this reason, we will explain some of the main commands in this section.
To save the screenshot to a file in PostScript or in other vector graphics formats,
the function rgl.postscript() is used. There are also other supported formats,
such as eps, tex, pdf, svg, pgf. The drawText argument is a logical,
defining whether to draw text or not.
r g l . p o s t s c r i p t ( " f i l e n a m e . e p s " , fmt = " eps " , d r a w T e x t = FALSE )

Alternatively, it is also possible to export the rgl content into bitmap png format
with the function rgl.snapshot().
r g l . s n a p s h o t ( " f i l e n a m e . p n g " , fmt = " png " , top = TRUE )

The animation functions of the rgl package, such as play3d() or movie3d(),


are useful for demonstration purposes. movie3d() additionally records each single
frame to a png file. A movie in gif format can be produced by putting the created
png files into one document. Let us consider the example of a 3D surface. First,
define a 4 × 4 matrix M describing user actions to display the scene. Second, use
288 10 Advanced Graphical Techniques in R

play3d(), where par3dinterp() returns a function which interpolates par3d


parameter values, suitable for use in animations.
> M = par3d ( " u s e r M a t r i x " ) # 4 x4 u s e r a c t i o n s m a t r i x
> p l a y 3 d ( p a r 3 d i n t e r p ( u s e r M a t r i x = list ( M,
+ angle = pi, # r o t a t i o n angle
+ x = 1 , y = 1 , z = 0)) , # r o t a t e a r o u n d x and y axes
+ d u r a t i o n = 5) # d u r a t i o n of the r o t a t i o n

By applying this code to the 3D surface example, one obtains a five second demon-
stration of the plot, rotated around the x- and y-axes.
Another alternative for manipulating the plot, rather than rotation and zoom,
is provided by the function select3d(). This enables the user to select three-
dimensional regions in a scene. This function can be used to pick out one part of the
data, not influencing the whole dataset.
> if ( i n t e r a c t i v e ()){ # interactive navigation
+ # is a l l o w e d
+ x = rnorm (5) # g e n e r a t e pseudo - r a n d o m
+ # normal vector
+ y = z = x
+ r = runif (5)
+ o p e n 3 d () # o p e n new d e v i c e
+ s p h e r e s 3 d ( x, y, z, r, col = " red3 " ) # red s p h e r e s
+ k = s e l e c t 3 d () # s e l e c t the r e c t a n g l e
+ # area
+ keep = k ( x, y, z ) # keep s e l e c t e d area
+ # unchanged
+ r g l . p o p () # clear shapes
+ s p h e r e s 3 d ( x [ keep ] , y [ keep ] , z [ keep ] , r [ keep ] ,
+ col = " blue3 " ) # r e d r a w the s e l e c t e d
+ # area in blue
+ s p h e r e s 3 d ( x [ ! keep ] , y [ ! keep ] , z [ ! keep ] , r [ ! keep ] ,
+ col = " red3 " ) # r e d r a w the non -
+ # s e l e c t e d area in red
}

Fig. 10.20 Select a part of the scene. BCS_SceneSelection


10.2 Package rgl 289

rgl.pop() is used for the same purpose as rgl.clear(), namely to remove


the last added node on the scene. Figure 10.20 illustrates the selected spheres in blue,
whereas the left data remains red.
To select the area, one draws a rectangle that represents the projection of the region
onto the display. If the colour of the unselected area is not specified, its colour will
be set to default after selection. A function which tests whether points are located in
the selected region will be returned.

10.3 Package rpanel

The rpanel package employs different graphical user interface (GUI) controls to
enable an immediate communication with the graphical output and provide dynamic
graphics. Such an animation of graphs is possible by using single function calls,
such as sliders, buttons or others, to control the parameters. If the particular state
of a control button is altered, the response function call will be executed and the
associated graphical display will be changed correspondingly.
rpanel is built on the tcltk package created by Dalgaard (2001). The tcltk
package contains various options for interactive control offered by the Tcl/Tk sys-
tem, whereas rpanel includes only a limited number of useful tools that enable
the creation of control widgets through single function calls. rpanel offers the
possibility of redrawing the entire plot, and interactively changing the values of the
parameters set by the relevant controls, something which is not possible with the
object-oriented graphics created, for instance, in Java, see Bowman et al. (2007). In
order to be able to use the rpanel package, one should load the tcltk package
first.
rpanel displays graphs in a standard Rgraphics window and creates a separate
panel with control parameters. To avoid the necessity of operating with multiple
panels, one can use the tkrplot package of Tierney (2005) to integrate the plot
into the control panel, see Bowman et al. (2007).

10.3.1 Getting Started with rpanel

The rpanel package consists of control functions and application functions. The
control functions are mainly used to build simple GUI controls for the Rfunctions.
The most useful GUI controls are listed in Table 10.3.
It is worth mentioning that several controls can be used simultaneously, as shown
in the next example, where both rp.doublebutton() and rp.slider() are
applied to the same panel object. First, one defines the function which is called when
an item is chosen, then one fills it with the values of the observed variable. Next, one
draws the panel and places the proper function in it. The rp.control() function
appears to be the central control function implemented in rpanel, since it is called
290 10 Advanced Graphical Techniques in R

Table 10.3 Control functions Function Action


rp.doublebutton() Adds a widget with “+” and “-”
buttons
rp.checkbox() Adds checkbox to the panel
rp.control() Creates a control panel window
rp.listbox() Adds a listbox to the panel
rp.radiogroup() Adds a set of radiobuttons to the
panel
rp.slider() Adds a slider to the panel
rp.tkrplot() Allows Rgraphics to be drawn in
a panel

every time a new panel window is drawn, defining where the rpanel widgets can be
placed. Eventually, rp.slider() and rp.doublebutton() are used in order
to control a numeric variable by increasing or decreasing it with a slider or button
widget.
The following code demonstrates the usage of both functions on dataset trees:
> require ( rpanel )
> r = diff ( r a n g e ( H e i g h t )) # d e f i n e the r a n g e of the v a r i a b l e
> d e n s i t y . d r a w = f u n c t i o n ( p a n e l ){ # draw d e n s i t y f u n c t i o n
+ plot ( d e n s i t y ( p a n e l $ y, p a n e l $ sp ))
+ panel
+ }
> # define panel window arguments
> density.panel = rp.control ( title =" density estimation ",
+ y = Height, # data a r g u m e n t
+ sp = r / 8) # smoothing parameter
> # add a s l i d e r to the p a n e l w i n d o w
> r p . s l i d e r ( d e n s i t y . p a n e l , sp,
+ from = r / 40 , to = r / 2 , # l o w e r and u p p e r l i m i t s
+ action = density.draw,
+ main =" Bandwidth ") # n a m e of the w i d g e t
> # add a w i d g e t w i t h "+" and " -" b u t t o n s
> r p . d o u b l e b u t t o n ( d e n s i t y . p a n e l , sp,
+ step = 0 .03,
+ log = TRUE, # step is m u l t i p l i c a t i v e
+ range = c ( r / 50 , NA ) , # l o w e r and u p p e r l i m i t s
+ action = density.draw )

The first argument of rp.slider() identifies the panel object to which the slider
should be added. The second argument gives the name of the component of the
created panel object that is subsequently controlled by the slider. The from and to
arguments define the start and end points of the range of the slider. The action
argument gives the name of the function which will be called when the slider position
is changed. The last argument adds a label to the slider (see Fig. 10.21).
The rp.double.button() function is used to change the value of the partic-
ular panel component by small steps when a more accurate adjustment of parameters
is needed. Most of the arguments used by this function are the same as for the
10.3 Package rpanel 291

Fig. 10.21 Slider and double button for the control of density estimate.
BCS_ControlDensityEstimate

rp.slider(). The range argument serves the same purpose as the from and
to arguments defining the limits for the variable.
Another feature enabled in rpanel is the possibility of interactively choosing
between several types of plots to be applied to the same data set. It is also fea-
sible to adjust different parameters within the chosen plot. This can be tested with
rp.listbox(). This function adds a listbox of alternative commands to the panel.
When an item is pressed, the corresponding graphics display will occur. The argu-
ments of the function are the same as with the previous control functions. One can
follow the setting of this function in the code below.
> d a t a . p l o t f n = f u n c t i o n ( panel ){ # define plot function
+ if ( panel $ p l o t . t y p e == " h i s t o g r a m " ) # choose histogram
+ hist ( panel $ y ) # then plot h i s t o g r a m
+ else if ( panel $ p l o t . t y p e == " b o x p l o t " ) # choose boxplot
+ b o x p l o t ( panel $ y ) # then plot b o x p l o t
+ panel
+ }
> panel = r p . c o n t r o l ( y = H e i g h t ) # new panel
> rp.listbox ( panel, plot.type, # list with 2 o p t i o n s
+ c(" histogram "," boxplot "),
+ action = data.plotfn,
+ title = " Plot type " ) # n a m e of the w i d g e t
292 10 Advanced Graphical Techniques in R

Fig. 10.22 Listbox control function with histogram and boxplot as alternative plots.
BCS_HistogramBoxplotOption

An alternative to rp.listbox() (see Fig. 10.22) is the function


rp.radiogroup(), which can also be applied. The behaviour of this function
is the same, but the panel has a set of radio buttons instead of a list view.
Another way to dynamically control the display output is offered by the function
rp.checkbox(). This function adds a checkbox of alternative arguments for con-
trolling the logical variables of the panel. When an item is selected, the corresponding
graphics feature will be displayed.
The function rp.tkrplot() allows of placing the plot and widgets into one
panel, creating one common window. rp.tkrplot() and rp.tkrreplot()
call Tierney’s tkrplot and tkrreplot functions, respectively, to allow Rgraphics
to be displayed in a panel. This means that one should first define the plot of the func-
tion and then replot it inside the panel, as done below.
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ draw = f u n c t i o n ( panel ){
+ plot ( d e n s i t y ( panel $ y, panel $ sp ) ,
+ col = " red " , main = " " )
+ panel
+ }
+ # p l a c e d e n s i t y p l o t and w i d g e t w i t h i n one p a n e l
+ r e d r a w = f u n c t i o n ( panel ){
+ rp.tkrreplot ( panel, density )
+ panel
+ }
+ r p p l o t = r p . c o n t r o l ( title = " D e m o n s t r a t i o n of r p . t k r p l o t " ,
+ y = H e i g h t , sp = r / 8)
+ r p . t k r p l o t ( r p p l o t , d e n s i t y , draw )
+ }

The difference between this type of plot and the default plotting in rpanel can be
observed in Fig. 10.23.
10.3 Package rpanel 293

Fig. 10.23 Density plot with


rp.tkrplot.
BCS_rp.tkrplot

Table 10.4 Application functions


Function Action
rp.ancova() Plots a response variable against a covariate
rp.logistic() Plots a binary response variable against a covariate
rp.plot3d() Plots a 3D scatterplot, using the rgl package
rp.regression() Plots a response variable against one or two covariates
rp.normal() Plots a histogram and and fits the normal or other distributions to it

10.3.2 Application Functions in rpanel

The rpanel package also includes several useful built-in application functions.
These simplify the dynamic plotting of several processes, such as the analysis of
covariance, regression, plotting of 3D plots, fitting a normal distribution, etc. A list
of selected application functions is given in Table 10.4.
The rp.regression() function plots a response variable against one or two
covariates and automatically creates an rpanel with control widgets.
The arguments of the function are mostly relevant in the case of one covari-
ate. So the use of panel.plot makes sense for two-dimensional plots and acti-
vates the tkrplot function in order to merge the control and output panels in
one window. One should be aware that three-dimensional graphics can not be
placed inside the panel. The code demonstrating the two-dimensional regression
with rp.regression() is presented below.
294 10 Advanced Graphical Techniques in R

Fig. 10.24 Regression with one covariate. BCS_UnivariateRegression

> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ data ( l o n g l e y )
+ attach ( longley ) # c o m p o n e n t s are t e m p o r a r i l y v i s i b l e
+ # univariate regression
+ r p . r e g r e s s i o n ( GNP, U n e m p l o y e d ,
+ l i n e . s h o w i n g = TRUE, # r e g r e s s i o n line
+ panel.plot = FALSE ) # plot is o u t s i d e the c o n t r o l panel
+ }

A regression line will appear in the plot if the argument line.showing is set to
TRUE. If the regression line is drawn, than one can interactively change its intercept
and slope, see Fig. 10.24.
If the function has two covariates, the rp.regression() plot is generated with
the help of the rgl package, through the function rp.plot3d(), see Fig. 10.25.
In fact, one advanced interactive display will be created, which extends even the
features of the rgl 3D interactive scatterplot. The created plot is rotatable and
a zoom function is included. Additionally, one can set the panel argument to be
TRUE in order to create a control panel allowing interactive control of the fitted linear
models with one or two covariates. Double buttons are also available for stepwise
control of the rotation degrees of theta and phi.
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ data ( l o n g l e y )
+ attach ( longley ) # c o m p o n e n t s are t e m p o r a r i l y v i s i b l e
+ # multivariate regression
+ r p . r e g r e s s i o n ( cbind ( GNP, A r m e d . F o r c e s ) , U n e m p l o y e d ,
+ panel = TRUE ) # a p a n e l is c r e a t e d
+ }
10.3 Package rpanel 295

Fig. 10.25 Regression with two covariates. BCS_BivariateRegression

The rp.plot3d() function can be used independently from the


rp.regression() function. With rp.plot3d(), a three-dimensional scat-
terplot would be created. The displayed plot would not differ from its counterpart
created in the rgl package through the function plot3d().
The function rp.logistic() shares the same arguments and panel controls
with rp.regression(), providing a basis for a logistic regression with a binary
response variable.
Another application function is rp.ancova(), which provides the analysis
of covariance, with different groups of data identified by colour and symbol. This
function also shares its arguments with the function rp.regression().
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ data ( a i r q u a l i t y )
+ attach ( airquality ) # c o m p o n e n t s are t e m p o r a r i l y v i s i b l e
+ rp.ancova ( Solar.R, Ozone, Month,
+ p a n e l . p l o t = FALSE ) # plot is ou t s i d e the c o n t r o l p a n e l
+ }

The function rp.normal() plots a histogram of data samples and allows a normal
density curve to be added to the display. Furthermore, the fitted normal distribution
with mean and standard deviation of the data sample can also be plotted. Double-
buttons are built-in as well, and enable interactive control of the mean and standard
deviation.
> if ( i n t e r a c t i v e ()){ # i n t e r a c t i v e n a v i g a t i o n is a l l o w e d
+ y = Height # data a r g u m e n t
+ # plot h i s t o g r a m with d e n s i t y c u r v e
+ r p . n o r m a l ( y, p a n e l . p l o t = TRUE )
+ }
296 10 Advanced Graphical Techniques in R

Fig. 10.26 Normal density fit. BCS_BivariateRegression

Figure 10.26 presents a screenshot of the observed function rp.normal().


Bibliography

Adler, D., Nenadic, O., & Zucchini, W. (2003). RGL: A R-library for 3D visualization with OpenGL,
Technical report, University of Goettingen.
Ahrens, J. H. (1972). Computer methods for sampling from the exponential and normal distributions.
Communications of the ACM, 15(10), 873–882.
Ahrens, J. H., & Dieter, U. (1974). Computer methods for sampling from gamma, beta, poisson
and binomial distribution. Computing, 12, 223–246.
Ahrens, J. H., & Dieter, U. (1982a). Computer generation of Poisson deviates from modified normal
distributions. ACM Transactions on Mathematical Software, 8, 163–179.
Ahrens, J. H., & Dieter, U. (1982b). Generating gamma variates by a modified rejection technique.
Communications of the ACM, 25, 47–54.
Albert, J. (2009). Bayesian Computation with R, Use R! (2nd ed.). New York: Springer.
Annamalai, C. (2010). Package “radx”. https://github.com/quantumelixir/radx.
Ash, R. B. (2008). Basic Probability Theory, Dover Books on Mathematics (1st ed.). New York:
Dover Pblications Inc.
Atkinson, A. C., & Pearce, M. (1976). The computer generation of beta, gamma and normal random
variables. Journal of the Royal Statistical Society, 139, 431–461.
Babbie, E. (2013). The Practice of Social Research. Boston: Cengage Learning.
Banks, J. (1998). Handbook of Simulation. Norcross: Engineering and Management Press.
Becker, R. A., Cleveland, W. S., & Shyu, M.-J. (1996). The visual design and control of trellis
display. Journal of Computational and Graphical Statistics, 5, 123–155.
Bolger, E. M., & Harkness, W. L. (1965). Characterizations of some distributions by conditional
moments. The Annals of Mathematical Statistics, 36, 703–705.
Bowman, A., Crawford, E., Alexander, G., & Bowman, R. W. (2007). rpanel: Simple interactive
controls for R functions using the tcltk package. Journal of Statistical Software, 17, 1–18.
Box, G. E. P., & Muller, M. E. (1958). A note on the generation of random normal deviates. Annals
of Mathematical Statistics, 29, 610–611.
Braun, W., & Murdoch, D. (2007). A First Course in Statistical Programming with R. Cambridge:
Cambridge University Press.
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion.
Statistical Science, 16, 101–117.
Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms. Journal
of the Institute of Mathematics and Its Applications, 6, 76–90.
Caillat, A.-L., Dutang, C., Larrieu, V., & NGuyen, T. (2008). Gumbel: package for Gumbel copula.
R package version 1. 01.

© Springer International Publishing AG 2017 297


W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8
298 Bibliography

Canuto, C., & Tabacco, A. (2010). Mathematical Analysis II. Universitext Series. Milan: Springer.
Cheng, R. C. H. (1977). The generation of gamma variables with non-integral shape parameter.
Journal of the Royal Statistical Society, 26(1), 71–75.
Cheng, R. C. H. (1978). Generating beta variates with nonintegral shape parameters. Communica-
tions of the ACM, 21, 317–322.
Clayton, D. G. (1978). A model for association in bivariate life tables and its application in epi-
demiological studies of familiar tendency in chronic disease incidence. Biometrika, 65, 141–151.
Cleveland, W. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of
the American Statistical Association, 74, 829–836.
Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. New York: Chapman
and Hall.
Cowpertwait, P. S., & Metcalfe, A. (2009). Introductory Time Series with R. New York: Springer.
Csorgo, S., & Farraway, J. (1996). The exact and assymptotic distributions of Cramér-von-Mises
statistics. Journal of the Royal Statistical Society Series B, 58, 221–234.
Dalgaard, P. (2001). The r-tcl/tk interface. Proceedings of DSC, 1, 2.
Demarta, S., & McNeil, A. J. (2004). The t-copula and related copulas. International Statiastical
Review, 73(1), 111–129.
Everitt, B. (2005). An R and S-PLUS Companion to Multivariate Analysis. London: Springer.
Everitt, B., & Hothorn, T. (2011). An Introduction to Applied Multivariate Analysis with R. New
York: Springer.
Everitt, B., Landau, S., Leese, M., & Stahl, D. (2009). Cluster Analysis. Chichester: Wiley.
Fang, K. & Zhang, Y. (1990). Generalized multivariate analysis, Science Press and Springer.
Fishman, G. (1976). Sampling from the gamma distribution on a computer. Communications of the
ACM, 19(7), 407–409.
Fletcher, R. (1970). A new approach to variable metric algorithms. Computer Journal, 13, 317–322.
Frank, M. J. (1979). On the simultaneous associativity of f (x, y) and x + y − f (x, y). Aequationes
Mathematicae, 19, 194–226.
Frees, E., & Valdez, E. (1998). Understanding relationships using copulas. North American Actu-
arial Journal, 2, 1–125.
Gaetan, C., & Guyon, X. (2009). Spatial Statistics and Modeling. New York: Springer.
Genest, C., & Rivest, L.-P. (1989). A characterization of Gumbel family of extreme value distribu-
tions. Statistics and Probability Letters, 8, 207–211.
Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Compu-
tational and Graphical Statistics, 1, 141–150.
Genz, A. (1993). Comparison of methods for the computation of multivariate normal probabilities.
Computing Science and Statistics, 25, 400–405.
Genz, A. & Azzalini, A. (2012). mnormt: The multivariate normal and t distributions. R package
version 1.4-5. http://CRAN.R-project.org/package=mnormt.
Genz, A., & Bretz, F. (2009). Computation of Multivariate Normal and t Probabilities., Lecture
Notes in Statistics Heidelberg: Springer.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F. & Hothorn, T. (2012). mvtnorm:
Multivariate Normal and t Distributions. R package version 0.9-9993. http://CRAN.R-project.
org/package=mvtnorm.
Goldfarb, D. (1970). A family of variable metric updates derived by variational means. Mathematics
of Computation, 24, 23–26.
Gonzalez-Lopez, V. A. (2009). fgac: Generalized Archimedean Copula. R package version 0.6-1.
http://CRAN.R-project.org/package=fgac.
Greene, W. (2003). Econometric Analysis. Upper Saddle River: Pearson Education.
Greub, W. (1975). Linear Algebra. Graduate Texts in Mathematics. New York: Springer.
Gumbel, E. J. (1960). Distributions des valeurs extrêmes en plusieurs dimensions. Publications de
Institut de Statistique de Université de Paris, 9, 171–173.
Hahn, T. (2013). R2Cuba: Multidimensional numerical integration. http://cran.r-project.org/web/
packages/R2Cuba/R2Cuba.pdf.
Bibliography 299

Härdle, W. K., & Vogt, A. (2014). Ladislaus von Bortkiewicz-statistician. Economist and a European
Intellectual, International Statistical Review, 83(1), 17–35.
Härdle, W., Müller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and Semiparametric
Models. Springer Series in Statistics. New York: Springer.
Härdle, W., & Simar, L. (2015). Applied Multivariate Statistical Analysis (4th ed.). New York:
Springer.
Hastie, T., Tibshirani, R., & Friedman, F. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. New York: Springer.
Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Journal of Research of the National Bureau of Standards, 49, 409–436.
Hofert, M., & Maechler, M. (2011). Nested Archimedean copulas meet R: The nacopula package.
Journal of Statistical Software, 39(9), 1–20.
Hoff, P. (2010). sbgcop: Semiparametric Bayesian Gaussian copula estimation and imputation. R
package version 0.975. http://CRAN.R-project.org/package=sbgcop.
Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3), 299–314.
Jarle Berntsen, T. E., & Genz, A. (1991). An adaptive algorithm for the approximate calculation of
multiple integrals. ACM Transactions on Mathematical Software, 17, 437–451.
Jech, T. J. (2003). Set Theory, Springer Monographs in Mathematics (3rd ed.). The third millennium
edition, revised and expanded: Springer-Verlag, Berlin.
Joe, H. (1997). Multivariate Models and Dependence Concepts. London: Chapman and Hall.
Joe, H., & Xu, J. J. (1996). The estimation method of inference functions for margins for multivariate
models, Technical Report 166. Department of Statistics: University of British Columbia.
Johnson, M. E. (1987). Multivariate Statistical Simulation. New York: Wiley.
Johnson, P. (1972). A History of Set Theory., Prindle, Weber & Schmidt Complementary Series in
Mathematics Boston: Prindle, Weber & Schmidt.
Kachitvichyanukul, V., & Schmeiser, B. W. (1988). Binomial random variate generation. Commu-
nications of the ACM, 31, 216–222.
Kendall, M. G., & Smith, B. B. (1938). Randomness and random sampling numbers. Journal of the
Royal Statistical Society, 101(1), 147–166.
Kiefer, J. (1953). Sequential minimax search for a maximum. Proceedings of the American Math-
ematical Society, 4(3), 502–506.
Knuth, D. E. (1969). The Art of Computer Programming (Vol. 2). Seminumerical Algorithms
Reading: Addison-Wesley.
Kojadinovic, I., & Yan, J. (2010). Modeling multivariate distributions with continuous margins
using the copula r package. Journal of Statistical Software, 34(9), 1–20.
Kruskal, J. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrica, 29,
115–129.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of
the American Statistical Association, 47, 583–621.
Marsaglia, G. (1964). Generating a variable from the tail of the normal distribution. Technometrics,
6, 101–102.
Marsaglia, G. (1968). Random numbers fall mainly in the planes. Proceedings of the National
Academy of Sciences of the United States of America, 61(1), 25–28.
Marsaglia, G. (1995). Diehard Battery of Tests of Randomness, Florida State University.
Marsaglia, G., & Marsaglia, J. (2004). Evaluating the anderson-darling distribution. Journal of
Statistical Software, 9(2), 1–5.
Marshall, A. W., & Olkin, J. (1988). Families of multivariate distributions. Journal of the American
Statistical Association, 83, 834–841.
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). Mcmcpack: Markov chain monte carlo in R.
Journal of Statistical Software, 42(9), 1–21.
300 Bibliography

Matsumoto, M., & Nishimura, T. (1998). Mersenne twister: A 623-dimensionally equidistributed


uniform pseudorandom number generator. ACM Transaction on Modeling and Computer Simu-
lations, 8, 3–30.
McNeil, A. J. (2008). Sampling nested Archimedean copulas. Journal Statistical Computation and
Simulation, 78, 567–581. (forthcoming).
Miwa, A., Hayter, J., & Kuriki, S. (2003). The evaluation of general non-centred orthant probabil-
ities. Journal of the Royal Statistical Society, 65, 223–234.
Moore, E. (1920). On the reciprocal of the general algebraic matrix. Bulletin of American Mathe-
matical Society, 26, 394–395.
Muenchen, R. A., & Hilbe, J. M. (2010). R for Stata Users (1st ed.). Statistics and Computing. New
York: Springer.
Müller, H. (1987). Weighted local regression and kernel methods for nonparametric curve fitting.
Journal of the American Statistical Association, 82, 231–238.
Nadaraya, E. (1964). On estimating regression. Theory of Probability and Its Apllications, 9, 141–
142.
Nash, J. C. N., & Varadhan, R. (2011). Unifying optimization algorithms to aid software system
users: optimx for r. Journal of Statistical Software, 43(9), 1–14.
Neave, H. (1973). On using the Box-Muller transformation with multiplicative congruential pseudo-
random number generators. Applied Statistics, 22, 92–97.
Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal,
7, 308–313.
Nelsen, R. B. (2006). An Introduction to Copulas. New York: Springer.
Okhrin, O., Okhrin, Y., & Schmid, W. (2013). On the structure and estimation of hierarchical
Archimedean copulas. Journal of Econometrics, 173(2), 189–204.
Okhrin, O. & Ristig, A. (2012). HAC: Estimation, simulation and visualization of Hierarchical
Archimedean Copulae (HAC). R package version 0.2-5. http://CRAN.R-project.org/package=
HAC.
Park, J. and Zatsiorsky, V. (2011). Multivariate statistical analysis of decathlon performance results
in olympic athletes (1988-2008), World Academy of Science, Engineering and Technology77.
Parzen, E. (1962). On the estimation of a probability density function and mode. The Annals of
Mathematical Statistics, 33, 1065–1076.
Penrose, R. (1955). A generalized inverse for matrices. Proceedings of the Cambridge Philosophical
Society (Vol. 51, pp. 406–413). Cambridge: Cambridge University Press.
Poisson, S.-D. (1837). Probabilité des jugements en matière criminelle et en matière civile,
précédées des règles générales du calcul des probabilitiés. Paris: Bachelier.
Press, W. (1992). Numerical Recipes in C: The Art of Scientific Computing. Cambridge: Cambridge
University Press.
Pyke, R. (1965). Spacings, Journal of the Royal Statistical Society. Series B (Methodological),
27(3), 395–449.
Quine, M. P., & Seneta, E. (1987). Bortkiewicz’s data and the law of small numbers. International
Statistical Review, 55, 173–181.
Development Core, R., & Team., (2012). R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0. http://www.R-
project.org/.
Razali, N. M., & Wah, Y. B. (2011). Power comparisons of Shapiro-Wilk Kolmogorov-Smirnov,
Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics, 2, 21–33.
Rencher, A. (2002). Methods of Multivariate Analysis. New York: Wiley.
Richardson, L. F. (1911). The approximate arithmetical solution by finite differences of physical
problems including differential equations, with an application to the stresses in a masonry dam.
Philosophical Transactions of the Royal Society A, 210, 307–357.
Riedwyl, H. (1997). Lineare Regression and Verwandtes. Basel: Birkhaeuser.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals
of Mathematical Statistics, 27, 832.
Bibliography 301

Samorodnitsky, G., & Taqqu, M. S. (1994). Stable Non-Gaussian Random Processes. New York:
Chapman & Hall.
Sarkar, D. (2010). Lattice: Multivariate Data Visualization with R. New York: Springer.
Scherer, M., & Mai, J.-K. (2012). Simulating Copulas: Stochastic Models, Sampling Algorithms,
and Applications. Series in Quantitative Finance. Singapore: World Scientific Pub Co Inc.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley Series in Proba-
bility and Statistics. New York: Wiley.
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathe-
matics of Computation, 24, 647–656.
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples).
Biometrika, 52, 591–611.
Shepard, R. (1962). The analysis of proximities: multidimensional scaling with unknown distance
function. Psychometrica, 27, 125–139.
Sklar, A. (1959). Fonctions de repartition á n dimension et leurs marges. Publications de Institut de
Statistique de L’ Université de Paris, 8, 299–231.
Smirnov, N. (1939). On the estimation of the disrepancy between empirical curves of distribution
for two independent samples. Bulletin Mathématique de l’Université de Moscou, 2, 2.
Stroud, A. H. (1971). Approximation Caculation of Multiple Integrals. New Jersey: Prentice Hall.
Theussl, S. (2013). Package “rglpk”. http://cran.r-project.org/web/packages/Rglpk/Rglpk.pdf.
Trimborn, S., Okhrin, O., Zhang, S., & Zhou, M. Q. ( 2015). gofCopula: Goodness-of-Fit Tests for
Copulae. R package version 0.2-5. http://CRAN.R-project.org/package=gofCopula.
van Dooren, P., & de Ridder, L. (1976). An adaptive algorithm for numerical integration over an
n-dimensional cube. Journal of Computational and Applied Mathematics, 2, 207–217.
Venables, W. N., & Ripley, B. D. (1999). Modern Applied Statistics with S-PLUS. New York:
Springer.
von Bortkewitsch, L. (1898). Das Gesetz der kleinen Zahlen, Leipzig.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York:
Springer.
Watson, G. (1964). Smooth regression analysis. Sankyah. Ser. A, 26, 359–372.
Weron, R. (2001). Levy-stable distributions revisited: Tail index >2 does not exclude the levy-stable
regime. International Journal of Modern Physics C, 12, 209–223.
Whelan, N. (2004). Sampling from Archimedean copulas. Quantitative Finance, 4, 339–352.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1, 80–83.
Wuertz, D., many others and see the SOURCE file ( 2009a). fCopulae: Rmetrics - Dependence Struc-
tures with Copulas. R package version 2110.78. http://CRAN.R-project.org/package=fCopulae.
Wuertz, D., many others and see the SOURCE file ( 2009b). fMultivar: Multivariate Market Analy-
sis. R package version 2100.76. http://CRAN.R-project.org/package=fMultivar.
Yan, J. (2007). Enjoy the joy of copulas: With a package copula. Journal of Statistical Software,
21(4), 1–21.
Index

A Column-major storage, 18
Absolute frequency, 130 Comparison relations, 8
Akaike Information Criterion(AIC), 200 Concentration ellipsoid, 179
α-trimmed Mean, 138 Concordant, 177
Apropos(), 7 Conditional sampling, 261
Archimedean copulae, 189 Confidence intervals, 146
Args(), 26 Contour ellipse, 179
Arithmetic mean, 138 Copula, 171, 183, 263
Array, 13, 14 copula density, 184
Assign, 10 copula estimation, 193
Attach, 22 copula families, 185
Auckland, 2 hierarchical archimedean copulae, 191
Correlation, 176
Covariance matrix, 174
B CRAN, 2
Bar Diagram, 130 Critical region, 150
Bar Plot, 131 Cross-platform software, 2
Basic functions, 8 Cumsum, cumprod, cummin, cummax, 17
Bayesian Information Criterion (BIC), 200 Cumulative distribution function (cdf), 109,
Bernoulli distribution, 94 171
Bernoulli experiment, 94
Best linear unbiased estimator (BLUE), 198
Binomial distribution, 94, 95 D
Box-plot, 144 Density function, 109, 171
Discriminant analysis, 238
Dispersion parameters, 140
C Distance, 230
C(), 13 Distribution, 171
Canonical maximum likelihood, 195 Cauchy distribution, 127
Cauchy distribution, 179 multinormal distribution, 178
Ceiling, 8 multivariate normal distribution, 178
Central Limit Theorem (CLT), 183 multivariate t-distribution, 178
Central limit theorem (CLT), 183
Chi-squared distribution, 115
Class(), 11 E
Clayton copula, 190 Elliptical copulae, 186
Cluster analysis, 229 Euclidian norm, 230
© Springer International Publishing AG 2017 303
W.K. Härdle et al., Basic Elements of Computational Statistics,
Statistics and Computing, DOI 10.1007/978-3-319-55336-8
304 Index

Example(), 6 package, 6
Excess kurtosis, 111 Histogram, 133
Expectation, 110, 173 Hypergeometric distribution, 101
Exponential distribution, 121
Ex_Stirling, 24
I
Indexing
F negative, 15
Factor analysis, 224 Inf, 12, 13
Factorial, 8 Inference for margins, 194
F-distribution, 119 Installing, 2
Find(), 7 Integer division, 8
Floor, 8 Interquartile range, 141
Frank copula, 189 Inverse transform method, 251
Fréchet–Hoeffdings, 185 Is.finite(), 13
Function Is.nan(), 13
as.character(), 12
as.data.frame(), 12
K
as.double(), 12
Kendall, 177
as.integer(), 12
Kurtosis, 111
as.list(), 12
as.matrix(), 12
cbind(), 18 L
diag(), 18 Law of Large Numbers, 138
dim(), 18 Length(), 13
dimnames(), 20 Leptokurtic distributions, 110
head, 31 Letters[], 14
k-means(), 234 Library(), 5
matrix(), 18 Limit theorems, 182
names(), 31 Linear congruential generator, 244
rbind(), 18 Loadhistory(), 7
str(), 31 Loadings matrix, 221
t(), 18 Load(.Rdata), 7
update.packages(), 3 Logical relations
Fundamental operations, 8 vectors, 15
Ls(), 7

G
Gamma distribution, 257 M
Generalised set, 82 Mac, 3
Generator of the copula, 189 Mahalanobis transformation, 182, 224
Gentleman, 2 Mallows’ C p , 200
GNU General Public License, 2 Marginal cdf, 172
GNU Project, 2 Marginal probability, 172
Goodness of fit, 200, 265 Maximum likelihood factor analysis, 225
Gumbel copula, 190 Mean(), 17
Gumbel–Hougaard copula, 184 Median absolute deviation (MAD), 143, 142
Mode, 140
Modulo division, 8
H Moments, 173, 174
HAC package, 193 Month.abb[], 14
Help, 5 Multidimensional Scaling, 234
help(), 5 Multinomial distribution, 99
help.search(), 6 Multinormal, 171
Index 305

MVTDist, see MVTDist S


S, 1
Sample, 87
N Sampling distributions, 182
NA, 12 Savehistory(), 7
NaN, 12, 13 Save.image(), 7
Normal distribution, 113 Scheme, 2
Scree plot, 222
Seed, 243
O Seq, 14
Objects(), 7 Seq(), 14
Order(), 16 Set, 17, 77
Ordinary Least Squares (OLS), 198 Simple copulae, 185
Sink(), 8
Skewness, 110
P Sort(), 16
Package Source(), 8
detach, 5 Source code, 2
installing, 4 Spearman, 177
library, 4 Stable distribution, 114, 123
loading, 4 Standard deviation, 110, 142
Package Manager, 4 Stepwise regression, 201
Pearson correlation coefficient, 177 Stirling, 11
Pie Chart, 131 Strong Law of Large Numbers, 132
Platykurtic distributions, 110 Student’s t-distribution, 117
Poisson distribution, 103 Sturges formula, 134
Precompiled, 3 Sum(), 17
Principal components, 220
Principal components analysis, 219
Print(), 12
T
Proximity, 230
t-distribution, 118
Pseudo log-likelihood function, 194
Tests of hypotheses, 149
Pseudorandom number generator, 244
Total range, 141
Pseudosample, 195
Trunc, 8
Two-sided tests, 150
Type I error, 149
Q
Type II error, 149
Q-Q plot, 204
Typeof(), 11
Quantiles, 139, 181
??, 6
Quit
q(), 8 U
Uniform distribution, 112
Unix, 3, 4
R Updating, 2, 3
R 2 , 200
adjusted R 2 , 200
Range(), 16 V
Rank(), 16 Variance, 110, 142
Relative frequency, 130
Rep(), 14
Replace, 16 W
Rev(), 17 Which(), 15
Rule, 8 Windows, 3

You might also like