Advanced R Solutions Chapman Amp Hall CRC The R Series 1nbsped 1032007508 9781032007502
Advanced R Solutions Chapman Amp Hall CRC The R Series 1nbsped 1032007508 9781032007502
Series Editors
John M. Chambers, Department of Statistics, Stanford University, California, USA
Torsten Hothorn, Division of Biostatistics, University of Zurich, Switzerland
Duncan Temple Lang, Department of Statistics, University of California, Davis, USA
Hadley Wickham, RStudio, Boston, Massachusetts, USA
Learn R: As a Language
Pedro J. Aphalo
R Markdown Cookbook
Yihui Xie, Christophe Dervieux, and Emily Riederer
Javascript for R
John Coene
Advanced R Solutions
Malte Grosser, Henning Bumann, and Hadley Wickham
Malte Grosser
Henning Bumann
Hadley Wickham
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access
www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003175414
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Authors’ note: Because there were no exercises in chapters 1, 12, 16, 17 and 22 in Advanced R, Second
Edition (ISBN 9780815384571), these chapters are excluded from this book.
Malte
To Elena
Henning
To my family
Contents
Preface xi
I Foundations 1
2 Names and values 3
2.2 Binding basics . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Copy-on-modify . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Object size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Modify-in-place . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Vectors 17
3.2 Atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 S3 atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Data frames and tibbles . . . . . . . . . . . . . . . . . . . . . 26
4 Subsetting 31
4.2 Selecting multiple elements . . . . . . . . . . . . . . . . . . . 31
4.3 Selecting a single element . . . . . . . . . . . . . . . . . . . . 33
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Control flow 37
5.2 Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Functions 41
6.2 Function fundamentals . . . . . . . . . . . . . . . . . . . . . 41
6.4 Lexical scoping . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Lazy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 ... (dot-dot-dot) . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.7 Exiting a function . . . . . . . . . . . . . . . . . . . . . . . . 51
6.8 Function forms . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7 Environments 61
7.2 Environment basics . . . . . . . . . . . . . . . . . . . . . . . 61
7.3 Recursing over environments . . . . . . . . . . . . . . . . . . 65
vii
viii Contents
8 Conditions 73
8.2 Signalling conditions . . . . . . . . . . . . . . . . . . . . . . . 73
8.4 Handling conditions . . . . . . . . . . . . . . . . . . . . . . . 74
8.5 Custom conditions . . . . . . . . . . . . . . . . . . . . . . . . 78
8.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
II Functional programming 85
9 Functionals 87
9.2 My first functional: map() . . . . . . . . . . . . . . . . . . . . 87
9.4 Map variants . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.6 Predicate functionals . . . . . . . . . . . . . . . . . . . . . . 95
9.7 Base functionals . . . . . . . . . . . . . . . . . . . . . . . . . 98
14 R6 149
14.2 Classes and methods . . . . . . . . . . . . . . . . . . . . . . 149
14.3 Controlling access . . . . . . . . . . . . . . . . . . . . . . . . 155
14.4 Reference semantics . . . . . . . . . . . . . . . . . . . . . . . 159
15 S4 161
15.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
15.3 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
15.4 Generics and methods . . . . . . . . . . . . . . . . . . . . . . 167
Contents ix
IV Metaprogramming 175
18 Expressions 177
18.2 Abstract syntax trees . . . . . . . . . . . . . . . . . . . . . . 177
18.3 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
18.4 Parsing and grammar . . . . . . . . . . . . . . . . . . . . . . 184
18.5 Walking AST with recursive functions . . . . . . . . . . . . . 190
19 Quasiquotation 195
19.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
19.3 Quoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
19.4 Unquoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
19.6 ... (dot-dot-dot) . . . . . . . . . . . . . . . . . . . . . . . . . 203
19.7 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
20 Evaluation 209
20.2 Evaluation basics . . . . . . . . . . . . . . . . . . . . . . . . 209
20.3 Quosures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
20.4 Data masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
20.5 Using tidy evaluation . . . . . . . . . . . . . . . . . . . . . . 218
20.6 Base evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 219
V Techniques 245
23 Measuring performance 247
23.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
23.3 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . . 248
Bibliography 283
Preface
xi
xii Preface
Acknowledgements
Many open source projects are the result of the work of a lot of people; so
is this. We would like to explicitly mention and thank everybody who con-
tributed solutions, raised questions, or helped to fix spelling and grammar to
improve this work:
@3zhang, Anne (@ahoffrichter), Anh N Tran (@anhtr), Arash (@arashHara-
tian), Leon Kim (@BetweenTwoTests), Jun Cai (@caijun), @Charles926,
Safouane Chergui (@chsafouane), Corrado Lanera (@CorradoLanera), @david-
blitz, Zhuoer Dong (@dongzhuoer), @Elucidase, Fabian Scheipl (@fabian-
s), @HannesOberreiter, @its-gazza, Jorge Aranda (@jorgearanda), @lot-
gon, @MajoroMask, Maya Gans (@MayaGans), Øystein Sørensen (@os-
orensen), Peter Hurford (@peterhurford), @philyoun, PJ (@pieterjanvc),
Robert Krzyzanowski (@robertzk), Emily Robinson (@robinsones), Tanner
Stauss (@tmstauss), @trannhatanh89, and Yihui Xie (@yihui).
Tobias Stalder (@toeb18 (https://twitter.com/toeb18)) designed the beautiful
cover, which visualizes the structure of Advanced R and its exercises.
Thanks to CRC Press for the interest in the project and our editor, Rob
Calver, and his assistant, Vaishali Singh, for their patience and support in
making this book a reality.
xiv Preface
Thanks to our managers and companies for granting us some flexibility with
our work schedules and generally supporting the completion of this project.
Conventions
A brief overview of conventions we followed and decisions we made.
• Some chapters and sections in Advanced R do not contain exercises. In our
book you will see that we skipped these chapters and sections. This deci-
sion introduces some gaps in the numbering, but we feel that keeping the
numbers in sync with those of Advanced R will provide the most practical
value.
• We strived to follow mostly the tidyverse style guide (https://style.tidy
verse.org/). The {styler} package [Müller and Walthert, 2020] helped us
to check many of the rules automatically.
• Each chapter of this book was rendered in a separate R session via the
{bookdown} package. We configured this process to initially:
– set `%>%` <- magrittr::`%>%` to unlock the pipe operator without
specifically loading the {magrittr} package [Bache and Wickham,
2020] every time,
– set a random seed (1014) to improve reproducibility (similar as in
Advanced R), and
– define a few {ggplot2} and {knitr} options.
You can check out the exact code (https://GitHub.com/Tazinho/Advanced-
R-Solutions/blob/main/common.R) on GitHub.
• We chose to keep the code in this book as self-contained as possible.
– The packages used are usually loaded in the beginning of each chap-
ter.
– We repeat all code from Advanced R that is necessary to work on
an exercise but not explicitly part of the exercise. When some longer
code passages (from Advanced R) are omitted, this is explicitly stated
in the solution.
• The printed version of the book was rendered with R version 4.0.3 (2020-
10-10) and the most recent available package versions as of December 2020.
(The print version of Advanced R was rendered with R version 3.5.2.)
• Emoji images in the printed book come from the open-licensed Twitter
Emoji (https://github.com/twitter/twemoji).
• Benchmarks are computed when the book is rendered. While this improves
reproducibility, the exact results will depend on the system creating the
document.
Preface xv
Closing remarks
We are so happy to finish this exciting project that in fact neither of us really
had the time for. We probably wouldn’t have made it to the finish line if we
hadn’t worked on it together.
Collaboration is powerful and it’s fun to build and share. The various back-
grounds represented in the R community generally make this exchange much
more interesting and meaningful. Much of this success is possible because R
is free software. At least in theory, everyone can contribute and no one can
take away your freedom to do so.
The automated systems we build using these tools are not neutral and the
rapid adoption of data-driven processes in business and technology does clearly
affect our everyday lives and societies. It’s important that everyone has a fair
say in the discussions about these systems and participates in their design.
Against this background, we chose to donate half of our royalties from this
book to https://rladies.org/, an organization empowering minority genders
in the R community.
Thank you for your interest in this project and we hope the solutions will be
of value to you.
See you around!
Malte Grosser @malte_grosser (https://twitter.com/malte_grosser)
Henning Bumann @henningsway (https://twitter.com/henningsway)
Part I
Foundations
2
Names and values
Prerequisites
In this chapter we will use the {lobstr} package [Wickham, 2019a] to help
answer questions regarding the internal representation of R objects.
library(lobstr)
a <- 1:10
b <- a
c <- b
d <- 1:10
A: a, b, and c point to the same object (with the same address in memory).
This object has the value 1:10. d points to a different object with the same
value.
Q2: The following code accesses the mean function in multiple ways. Do
they all point to the same underlying function object? Verify this with lob-
str::obj_addr().
DOI: 10.1201/9781003175414-2 3
4 2 Names and values
mean
base::mean
get("mean")
evalq(mean)
match.fun("mean")
A: Yes, they point to the same object. We confirm this by inspecting the
address of the underlying function object.
unique(obj_addrs(mean_functions))
#> [1] "0x5639a43823e0"
Q3: By default, base R data import functions, like read.csv(), will auto-
matically convert non-syntactic names to syntactic ones. Why might this be
problematic? What option allows you to suppress this behaviour?
A: Column names are often data, and the underlying make.names() transfor-
mation is non-invertible, so the default behaviour corrupts data. To avoid this,
set check.names = FALSE.
Q4: What rules does make.names() use to convert non-syntactic names into
syntactic ones?
A: A valid name must start with a letter or a dot (not followed by a number)
and may further contain numbers and underscores ("_"s are allowed since R
version 1.9.0).
Three main mechanisms ensure syntactically valid names (see ?make.names):
The same holds for names that begin with a dot followed by a
number.
2.3 Copy-on-modify 5
2.3 Copy-on-modify
Q1: Why is tracemem(1:10) not useful?
A: When 1:10 is called an object with an address in memory is created, but it is
not bound to a name. Therefore, the object cannot be called or manipulated
from R. As no copies will be made, it is not useful to track the object for
copying.
6 2 Names and values
Q2: Explain why tracemem() shows two copies when you run this code. Hint:
carefully look at the difference between this code and the code shown earlier
in the section.
x[[3]] <- 4
A: Initially the vector x has integer type. The replacement call assigns a double
to the third element of x, which triggers copy-on-modify.
x[[3]] <- 4
#> tracemem[0x55eec7b3af38 -> 0x55eec774cc18]:
x[[3]] <- 4L
Please be aware that running this code in RStudio will result in additional
copies because of the reference from the environment pane.
Q3: Sketch out the relationship between the following objects:
a <- 1:10
b <- list(a, a)
c <- list(b, a, 1:10)
ref(c)
#> █ [1:0x55erc93cbdd8] <list> # c
#> ├─█ [2:0x55efcb8246e8] <list> # - b
#> │ ├─[3:0x55eac7df4e98] <int> # -- a
#> │ └─[3:0x55eac7df4e98] # -- a
#> ├─[3:0x55eac7df4e98] # - a
#> └─[4:0x55etc7aa6968] <int> # - 1:10
x <- list(1:10)
x[[2]] <- x
Draw a picture.
A: The initial reference tree of x shows that the name x binds to a list object.
This object contains a reference to the integer vector 1:10.
x <- list(1:10)
ref(x)
#> █ [1:0x55853b74ff40] <list>
#> └─[2:0x534t3abffad8] <int>
8 2 Names and values
tracemem(x)
x[[2]] <- x
#> tracemem[0x55853b74ff40 -> 0x5d553bacdcd8]:
The list object previously bound to x is now referenced in the newly created
list object. It is no longer bound to a name. The integer vector is referenced
twice.
ref(x)
#> █ [1:0x5d553bacdcd8] <list>
#> ├─[2:0x534t3abffad8] <int>
#> └─█ [3:0x55853b74ff40] <list>
#> └─[2:0x534t3abffad8]
2.4 Object size 9
object.size(y)
#> 8005648 bytes
obj_size(y)
#> 80,896 B
A: All three functions are built-in to R as part of the {base} and {stats}
packages and hence always available. So, what does it mean to measure the
size of something that’s already included in R?
(There’s typically a more general question about what you want to know when
you ask for the size of something — do you want to know how much data you’d
need to send to communicate the object to someone else (e.g. serialise it), or
do you want to know how much memory you’d free if you deleted it?)
Let us look for how many other objects this applies to as well.
The following packages are usually loaded by default.
base_pkgs <- c(
"package:stats", "package:graphics", "package:grDevices",
"package:utils", "package:datasets", "package:methods",
"package:base"
)
To look up all functions from these packages we iterate over base_pkgs and
apply ls() and mget() within each iteration.
10 2 Names and values
This gives us more than 2700 objects which are usually available by default.
sum(lengths(base_objs))
#> [1] 2709
a <- runif(1e6)
obj_size(a)
b <- list(a, a)
obj_size(b)
obj_size(a, b)
b[[1]][[1]] <- 10
obj_size(b)
obj_size(a, b)
b[[2]][[1]] <- 10
obj_size(b)
obj_size(a, b)
obj_size(list())
#> 48 B
obj_size(double())
#> 48 B
obj_size(character())
#> 48 B
obj_size(double(1))
#> 56 B
obj_size(double(2))
#> 64 B
a <- runif(1e6)
obj_size(a)
#> 8,000,048 B
(If you look carefully at the amount of memory occupied by short vectors,
you will notice that the pattern is actually more complicated. This has to
do with how R allocates memory and is not that important. If you want
to know the full details, they’re discussed in the 1st edition of Advanced R:
http://adv-r.had.co.nz/memory.html#object-size).
For b <- list(a, a) both list elements contain references to the same memory
address.
b <- list(a, a)
ref(a, b)
#> [1:0x5639afc25ed0] <dbl>
#>
#> █ [2:0x5639a8f3ef08] <list>
#> ├─[1:0x5639afc25ed0]
#> └─[1:0x5639afc25ed0]
Therefore, no additional memory is required for the second list element. The
list itself requires 64 bytes, 48 bytes for an empty list and 8 bytes for each
element (obj_size(vector("list", 2))). This lets us predict 8,000,048 B + 64
B = 8,000,112 B.
12 2 Names and values
obj_size(b)
#> 8,000,112 B
b[[1]][[1]] <- 10
obj_size(b)
#> 16,000,160 B
The second element of b still references the same address as a, so the combined
size of a and b is the same as b.
obj_size(a, b)
#> 16,000,160 B
ref(a, b)
#> [1:0x5639afc25ed0] <dbl>
#>
#> █ [2:0x5639adf46fa8] <list>
#> ├─[3:0x5639b2c737c0] <dbl>
#> └─[1:0x5639afc25ed0]
When we modify the second element of b, this element will also point to a new
memory address. This does not affect the size of the list.
b[[2]][[1]] <- 10
obj_size(b)
#> 16,000,160 B
ref(a, b)
#> [1:0x5639afc25ed0] <dbl>
#>
#> █ [2:0x5639ad94c2c8] <list>
#> ├─[3:0x5639b2c737c0] <dbl>
#> └─[4:0x5639b0deb050] <dbl>
2.5 Modify-in-place 13
obj_size(a, b)
#> 24,000,208 B
2.5 Modify-in-place
Q1: Explain why the following code doesn’t create a circular list.
x <- list()
x[[1]] <- x
tracemem(x)
#> [1] "<0x55862f23ab80>"
x[[1]] <- x # Copy-on-modify triggers new copy
#> tracemem[0x55862f23ab80 -> 0x55862e8ce028]:
Q2: Wrap the two methods for subtracting medians into two functions, then
use the {bench} package to carefully compare their speeds. How does perfor-
mance change as the number of columns increase?
A: First, we define a function to create some random data.
create_random_df(2, 2)
#> V1 V2
#> 1 0.972 0.0116
#> 2 0.849 0.4339
Next, we wrap the two approaches to subtract numerical values (in our case
medians) from each column of a data frame in their own function. We name
these functions depending on whether the approach operates on a data frame
or a list. For a fairer comparison, the second function also contains the over-
head code to convert between data frame and list objects.
This lets us profile the performance, via benchmarks on data frames with
differing numbers of columns. Therefore, we create a small helper that cre-
ates our random data frame and its medians before it benchmarks the two
approaches by employing the {bench} package [Hester, 2020].
bench::mark(
"data frame" = subtract_df(df, medians),
"list" = subtract_list(df, medians),
time_unit = "ms"
)
}
benchmark_medians(1)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
2.5 Modify-in-place 15
library(ggplot2)
ggplot(
results,
aes(ncol, median, col = attr(expression, "description"))
) +
geom_point(size = 2) +
geom_smooth() +
labs(
x = "Number of Columns",
y = "Execution Time (ms)",
colour = "Data Structure"
) +
theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
16 2 Names and values
40
20
0
0 250 500 750 1000
Number of Columns
When working directly with the data frame, the execution time grows quadrat-
ically with the number of columns in the input data. This is because (e.g.)
the first column must be copied n times, the second column n-1 times, and so
on. When working with a list, the execution time increases only linearly.
Obviously in the long run, linear growth creates shorter run-times, but there
is some cost to this strategy — we have to convert between data structures
with as.list() and list2DF(). Even though this is fast and probably doesn’t
hurt much, the improved approach doesn’t really pay off in this scenario until
we get to a data frame that is about 300 columns wide (with the exact value
depending on the characteristics of the system running the code).
Q3: What happens if you attempt to use tracemem() on an environment?
A: tracemem() cannot be used to mark and trace environments.
x <- new.env()
tracemem(x)
#> Error in tracemem(x): 'tracemem' is not useful for promise and
#> environment objects
The error occurs because “it is not useful to trace NULL, environments,
promises, weak references, or external pointer objects, as these are not du-
plicated” (see ?tracemem). Environments are always modified in place.
3
Vectors
as.raw(42)
#> [1] 2a
charToRaw("A")
#> [1] 41
In the case of complex numbers, real and imaginary parts may be provided
directly to the complex() constructor.
You can create purely imaginary numbers (e.g.) 1i, but there is no way to
create complex numbers without + (e.g. 1i + 1).
Q2: Test your knowledge of vector coercion rules by predicting the output of
the following uses of c():
Q3: Why is 1 == "1" true? Why is -1 < FALSE true? Why is "one" < 2 false?
DOI: 10.1201/9781003175414-3 17
18 3 Vectors
3.3 Attributes
Q1: How is setNames() implemented? How is unname() implemented? Read
the source code.
A: setNames() is implemented as:
Because the data argument comes first, setNames() also works well with the
magrittr-pipe operator. When no first argument is given, the result is a named
vector (this is rather untypical as required arguments usually come first):
x <- 1:10
# Return NULL
nrow(x)
#> NULL
ncol(x)
#> NULL
Q3: How would you describe the following three objects? What makes them
different to 1:5?
But when you print that object you don’t see the comment attribute. Why?
Is the attribute missing, or is there something else special about it? (Hint: try
using help.)
A: The documentation states (see ?comment):
Contrary to other attributes, the comment is not printed (by print or
print.default).
Also, from ?attributes:
Note that some attributes (namely class, comment, dim, dimnames, names,
row.names and tsp) are treated specially and have restrictions on the values
which can be set.
We can retrieve comment attributes by calling them explicitly:
attributes(foo)
#> $comment
#> [1] "my attribute"
attr(foo, which = "comment")
#> [1] "my attribute"
3.4 S3 atomic vectors 21
typeof(x)
#> [1] "integer"
attributes(x)
#> $dim
#> [1] 2 3 2
#>
#> $dimnames
#> $dimnames$vs
#> [1] "0" "1"
#>
#> $dimnames$cyl
#> [1] "4" "6" "8"
#>
#> $dimnames$am
#> [1] "0" "1"
#>
#>
#> $class
#> [1] "table"
#> vs 4 6 8
#> 0 1 3 2
#> 1 7 0 0
f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
A: The underlying integer values stay the same, but the levels are changed,
making it look like the data has changed.
f1 <- factor(letters)
f1
#> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
#> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
as.integer(f1)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#> [22] 22 23 24 25 26
Q3: What does this code do? How do f2 and f3 differ from f1?
f2 <- rev(factor(letters))
A: For f2 and f3 either the order of the factor elements or its levels are being
reversed. For f1 both transformations are occurring.
#> [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6
#> [22] 5 4 3 2 1
3.5 Lists
Q1: List all the ways that a list differs from an atomic vector.
A: To summarise:
• Atomic vectors are always homogeneous (all elements must be of the same
type). Lists may be heterogeneous (the elements can be of different types)
as described in the introduction of the vectors chapter (https://adv-r.hadl
ey.nz/vectors-chap.html#introduction).
• Atomic vectors point to one address in memory, while lists contain a separate
reference for each element. (This was described in the list sections of the
vectors (https://adv-r.hadley.nz/vectors-chap.html#lists) and the names
and values (https://adv-r.hadley.nz/names-values.html#list-references)
chapters.)
lobstr::ref(1:2)
#> [1:0x55ff5ba8d310] <int>
lobstr::ref(list(1:2, 2))
#> █ [1:0x55ff5dd10ff8] <list>
#> ├─[2:0x55ff5c3cdcc8] <int>
#> └─[3:0x55ff5ec90ea0] <dbl>
# Subsetting lists
as.list(1:2)[3]
#> [[1]]
#> NULL
as.list(1:2)[NA]
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
Q2: Why do you need to use unlist() to convert a list to an atomic vector?
Why doesn’t as.vector() work?
A: A list is already a vector, though not an atomic one!
Note that as.vector() and is.vector() use different definitions of “vector”!
is.vector(as.vector(mtcars))
#> [1] FALSE
Q3: Compare and contrast c() and unlist() when combining a date and
date-time into a single vector.
A: Date and date-time objects are both built upon doubles. While dates store
the number of days since the reference date 1970-01-01 (also known as “the
Epoch”) in days, date-time-objects (POSIXct) store the time difference to this
date in seconds.
# Internal representations
unclass(date)
#> [1] 1
unclass(dttm_ct)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"
3.5 Lists 25
As the c() generic only dispatches on its first argument, combining date and
date-time objects via c() could lead to surprising results in older R versions
(pre R 4.0.0):
In the first statement above c.Date() is executed, which incorrectly treats the
underlying double of dttm_ct (3600) as days instead of seconds. Conversely,
when c.POSIXct() is called on a date, one day is counted as one second only.
We can highlight these mechanics by the following code:
As of R 4.0.0 these issues have been resolved and both methods now convert
their input first into POSIXct and Date, respectively.
c(dttm_ct, date)
#> [1] "1970-01-01 02:00:00 CET" "1970-01-02 01:00:00 CET"
unclass(c(dttm_ct, date))
#> [1] 3600 86400
c(date, dttm_ct)
#> [1] "1970-01-02" "1970-01-01"
unclass(c(date, dttm_ct))
#> [1] 1 0
However, as c() strips the time zone (and other attributes) of POSIXct objects,
some caution is still recommended.
#> $class
#> [1] "POSIXct" "POSIXt"
A package that deals with these kinds of problems in more depth and provides
a structural solution for them is the {vctrs} package (https://github.com/r-
lib/vctrs) [Wickham et al., 2020b] which is also used throughout the tidyverse
[Wickham et al., 2019].
Let’s look at unlist(), which operates on list input.
We see again that dates and date-times are internally stored as doubles. Un-
fortunately, this is all we are left with, when unlist strips the attributes of the
list.
To summarise: c() coerces types and strips time zones. Errors may
have occurred in older R versions because of inappropriate method dis-
patch/immature methods. unlist() strips attributes.
data.frame()
#> data frame with 0 columns and 0 rows
3.6 Data frames and tibbles 27
Create similar data frames via subsetting the respective dimension with ei-
ther 0, NULL, FALSE or a valid 0-length atomic (logical(0), character(0), inte-
ger(0), double(0)). Negative integer sequences would also work. The following
example uses a zero:
mtcars[0, ]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
mtcars[ , 0] # or mtcars[0]
#> data frame with 0 columns and 32 rows
mtcars[0, 0]
#> data frame with 0 columns and 0 rows
Q2: What happens if you attempt to set rownames that are not unique?
A: Matrices can have duplicated row names, so this does not cause problems.
Data frames, however, require unique rownames and you get different results
depending on how you attempt to set them. If you set them directly or via
row.names(), you get an error:
Q3: If df is a data frame, what can you say about t(df), and t(t(df))?
Perform some experiments, making sure to try different column types.
A: Both of t(df) and t(t(df)) will return matrices:
28 3 Vectors
dim(df)
#> [1] 3 2
dim(t(df))
#> [1] 2 3
dim(t(t(df)))
#> [1] 3 2
Because the output is a matrix, every column is coerced to the same type.
(It is implemented within t.data.frame() via as.matrix() which is described
below).
df
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
t(df)
#> [,1] [,2] [,3]
#> x "1" "2" "3"
#> y "a" "b" "c"
Q4: What does as.matrix() do when applied to a data frame with columns
of different types? How does it differ from data.matrix()?
A: The type of the result of as.matrix depends on the types of the input
columns (see ?as.matrix):
The method for data frames will return a character matrix if there is only
atomic columns and any non-(numeric/logical/complex) column, applying
as.vector to factors and format to other non-character columns. Otherwise
the usual coercion hierarchy (logical < integer < double < complex) will
be used, e.g. all-logical data frames will be coerced to a logical matrix,
mixed logical-integer will give an integer matrix, etc.
3.6 Data frames and tibbles 29
On the other hand, data.matrix will always return a numeric matrix (see
?data.matrix()).
Return the matrix obtained by converting all the variables in a data frame
to numeric mode and then binding them together as the columns of a
matrix. Factors and ordered factors are replaced by their internal codes.
[…] Character columns are first converted to factors and then to integers.
We can illustrate and compare the mechanics of these functions using a con-
crete example. as.matrix() makes it possible to retrieve most of the original
information from the data frame but leaves us with characters. To retrieve all
information from data.matrix()’s output, we would need a lookup table for
each column.
as.matrix(df_coltypes)
#> a b c d e
#> [1,] "a" "TRUE" "1" "1.5" "f1"
#> [2,] "b" "FALSE" "0" "2.0" "f2"
data.matrix(df_coltypes)
#> a b c d e
#> [1,] 1 1 1 1.5 1
#> [2,] 2 0 0 2.0 2
4
Subsetting
mtcars[mtcars$cyl = 4, ]
# use `==` (instead of `=`)
mtcars[-1:4, ]
# use `-(1:4)` (instead of `-1:4`)
mtcars[mtcars$cyl <= 5]
# `,` is missing
mtcars[mtcars$cyl == 4 | 6, ]
# use `mtcars$cyl == 6` (instead of `6`)
# or `%in% c(4, 6)` (instead of `== 4 | 6`)
Q2: Why does the following code yield five missing values? (Hint: why is it
different from x[NA_real_]?)
x <- 1:5
x[NA]
#> [1] NA NA NA NA NA
A: In contrast to NA_real, NA has logical type and logical vectors are recycled
to the same length as the vector being subset, i.e. x[NA] is recycled to x[NA,
NA, NA, NA, NA].
Q3: What does upper.tri() return? How does subsetting a matrix with it
work? Do we need any additional subsetting rules to describe its behaviour?
DOI: 10.1201/9781003175414-4 31
32 4 Subsetting
x
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 2 3 4 5
#> [2,] 2 4 6 8 10
#> [3,] 3 6 9 12 15
#> [4,] 4 8 12 16 20
#> [5,] 5 10 15 20 25
upper.tri(x)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] FALSE TRUE TRUE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE TRUE
#> [3,] FALSE FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE
When subsetting with logical matrices, all elements that correspond to TRUE
will be selected. Matrices extend vectors with a dimension attribute, so the vec-
tor forms of subsetting can be used (including logical subsetting). We should
take care, that the dimensions of the subsetting matrix match the object of
interest — otherwise unintended selections due to vector recycling may occur.
Please also note, that this form of subsetting returns a vector instead of a
matrix, as the subsetting alters the dimensions of the object.
x[upper.tri(x)]
#> [1] 2 3 6 4 8 12 5 10 15 20
Q4: Why does mtcars[1:20] return an error? How does it differ from the
similar mtcars[1:20, ]?
A: When subsetting a data frame with a single vector, it behaves the same
way as subsetting a list of columns. So, mtcars[1:20] would return a data
frame containing the first 20 columns of the dataset. However, as mtcars has
only 11 columns, the index will be out of bounds and an error is thrown.
mtcars[1:20, ] is subsetted with two vectors, so 2d subsetting kicks in, and
the first index refers to rows.
Q5: Implement your own function that extracts the diagonal entries from a
matrix (it should behave like diag(x) where x is a matrix).
4.3 Selecting a single element 33
A: The elements in the diagonal of a matrix have the same row- and column
indices. This characteristic can be used to create a suitable numeric matrix
used for subsetting.
x[idx]
}
diag(x)
#> [1] 1 7 13 19 25
diag2(x)
#> [1] 1 7 13 19 25
mtcars[ , "cyl"][[3]]
#> [1] 4
mtcars[["cyl"]][[3]]
#> [1] 4
with(mtcars, cyl[[3]])
#> [1] 4
# Select simultaneously
mtcars[3, 2]
#> [1] 4
mtcars[[c(2, 3)]]
#> [1] 4
Q2: Given a linear model, e.g. mod <- lm(mpg ~ wt, data = mtcars), extract the
residual degrees of freedom. Extract the R squared from the model summary
(summary(mod)).
A: mod is of type list, which opens up several possibilities. We use $ or [[ to
extract a single element:
mod$df.residual
#> [1] 30
mod[["df.residual"]]
#> [1] 30
summary(mod)$r.squared
#> [1] 0.753
4.5 Applications 35
4.5 Applications
Q1: How would you randomly permute the columns of a data frame? (This is
an important technique in random forests.) Can you simultaneously permute
the rows and columns in one step?
A: This can be achieved by combining [ and sample():
# Permute columns
mtcars[sample(ncol(mtcars))]
Q2: How would you select a random sample of m rows from a data frame?
What if the sample had to be contiguous (i.e. with an initial row, a final row,
and every row in between)?
A: Selecting m random rows from a data frame can be achieved through sub-
setting.
m <- 10
mtcars[sample(nrow(mtcars), m), ]
Q3: How could you put the columns in a data frame in alphabetical order?
A: We combine [ with order() or sort():
mtcars[order(names(mtcars))]
mtcars[sort(names(mtcars))]
5
Control flow
5.2 Choices
Q1: What type of vector does each of the following calls to ifelse() return?
ifelse(TRUE, 1, "no")
ifelse(FALSE, 1, "no")
ifelse(NA, 1, "no")
Read the documentation and write down the rules in your own words.
A: The arguments of ifelse() are named test, yes and no. In general, ifelse()
returns the entry for yes when test is TRUE, the entry for no when test is FALSE
and NA when test is NA. Therefore, the expressions above return vectors of type
double (1), character ("no") and logical (NA).
To be a little more precise, we will cite the part of the documentation on the
return value of ifelse():
A vector of the same length and attributes (including dimensions and
“class”) as test and data values from the values of yes or no. The mode of
the answer will be coerced from logical to accommodate first any values
taken from yes and then any values taken from no.
This is surprising because it uses the type of test. In practice this means, that
test is first converted to logical and if the result is neither TRUE nor FALSE,
simply as.logical(test) is returned.
ifelse(logical(), 1, "no")
#> logical(0)
ifelse(NaN, 1, "no")
#> [1] NA
ifelse(NA_character_, 1, "no")
#> [1] NA
ifelse("a", 1, "no")
DOI: 10.1201/9781003175414-5 37
38 5 Control flow
#> [1] NA
ifelse("true", 1, "no")
#> [1] 1
x <- 1:10
if (length(x)) "not empty" else "empty"
#> [1] "not empty"
x <- numeric()
if (length(x)) "not empty" else "empty"
#> [1] "empty"
A: if() expects a logical condition, but also accepts a numeric vector where
0 is treated as FALSE and all other numbers are treated as TRUE. Numerical
missing values (including NaN) lead to an error in the same way that a logical
missing, NA, does.
5.3 Loops
Q1: Why does this code succeed without errors or warnings?
x <- numeric()
out <- vector("list", length(x))
for (i in 1:length(x)) {
out[i] <- x[i] ˆ 2
}
out
A: This loop is a delicate issue, and we have to consider a few points to explain
why it is evaluated without raising any errors or warnings.
The beginning of this code smell is the statement 1:length(x) which creates
the index of the for loop. As x has length 0 1:length(x) counts down from 1 to
0. This issue is typically avoided via usage of seq_along(x) or similar helpers
which would just generate integer(0) in this case.
As we use [<- and [ for indexing 0-length vectors at their first and zeroth
position, we need to be aware of their subsetting behaviour for out-of-bounds
and zero indices.
5.3 Loops 39
xs <- c(1, 2, 3)
for (x in xs) {
xs <- c(xs, x * 2)
}
xs
#> [1] 1 2 3 2 4 6
A: In this loop x takes on the values of the initial xs (1, 2 and 3), indicating
that it is evaluated just once in the beginning of the loop, not after each
iteration. (Otherwise, we would run into an infinite loop.)
Q3: What does the following code tell you about when the index is updated?
for (i in 1:3) {
i <- i * 2
print(i)
}
#> [1] 2
#> [1] 4
#> [1] 6
While the function in the first line is not bound to a name multiple names
(f1, f2 and f3) point to the second function. So, the main point is that the
relation between name and object is only clearly defined in one direction.
Besides that, there are obviously ways to search for function names. However,
to be sure to find the right one(s), you should not only compare the code
(body) but also the arguments (formals) and the creation environment. As
formals(), body() and environment() all return NULL for primitive functions,
the easiest way to check if two functions are exactly equal is just to use iden-
tical().
Q2: It’s possible (although typically not useful) to call an anonymous function.
Which of the two approaches below is correct? Why?
function(x) 3()
#> function(x) 3()
DOI: 10.1201/9781003175414-6 41
42 6 Functions
(function(x) 3)()
#> [1] 3
Q3: A good rule of thumb is that an anonymous function should fit on one
line and shouldn’t need to use {}. Review your code. Where could you have
used an anonymous function instead of a named function? Where should you
have used a named function instead of an anonymous function?
A: The use of anonymous functions allows concise and elegant code in certain
situations. However, they miss a descriptive name and when re-reading the
code, it can take a while to figure out what they do. That’s why it’s helpful
to give long and complex functions a descriptive name. It may be worthwhile
to take a look at your own projects or other people’s code to reflect on this
part of your coding style.
Q4: What function allows you to tell if an object is a function? What function
allows you to tell if a function is a primitive function?
A: Use is.function() to test if an object is a function. Consider using
is.primitive() to test specifically for primitive functions.
Q5: This code makes a list of all functions in the {base} package.
c. How could you adapt the code to find all primitive functions?
a. To find the function with the most arguments, we first compute the
length of formals().
library(purrr)
Then we sort n_args in decreasing order and look at its first entries.
n_args %>%
sort(decreasing = TRUE) %>%
head()
#> scan format.default source
#> 22 16 16
#> formatC library merge.data.frame
#> 15 13 13
sum(n_args == 0)
#> [1] 248
However, this over counts because formals() returns NULL for primi-
tive functions, and length(NULL) is 0. To fix this, we can first remove
the primitive functions:
sum(n_args2 == 0)
#> [1] 47
44 6 Functions
c <- 10
c(c = c)
A: This code returns a named numeric vector of length one — with one
element of the value 10 and the name "c". The first c represents the c()
function, the second c is interpreted as a (quoted) name and the third c as a
value.
Q2: What are the four principles that govern how R looks for values?
A: R’s lexical scoping (https://adv-r.hadley.nz/functions.html#lexical-
scoping) rules are based on these four principles:
6.5 Lazy evaluation 45
f <- function(x) {
f <- function(x) {
f <- function() {
x ˆ 2
}
f() + 1
}
f(x) * 2
}
f(10)
A: Within this nested function two more functions also named f are defined
and called. Because the functions are each executed in their own environment
R will look up and use the functions defined last in these environments. The
innermost f() is called last, though it is the first function to return a value.
Therefore, the order of the calculation passes “from the inside to the outside”
and the function returns ((10 ˆ 2) + 1) * 2, i.e. 202.
x_ok(NULL)
#> [1] FALSE
x_ok(1)
#> [1] TRUE
46 6 Functions
x_ok(1:3)
#> [1] FALSE
What is different with this code? Why is this behaviour undesirable here?
x_ok(NULL)
#> logical(0)
x_ok(1)
#> [1] TRUE
x_ok(1:3)
#> [1] FALSE FALSE FALSE
A: In summary: && short-circuits which means that if the left-hand side is FALSE
it doesn’t evaluate the right-hand side (because it doesn’t matter). Similarly,
if the left-hand side of || is TRUE it doesn’t evaluate the right-hand side.
We expect x_ok() to validate its input via certain criteria: it must not be NULL,
have length 1 and be greater than 0. Meaningful outcomes for this assertion
will be TRUE, FALSE or NA. The desired behaviour is reached by combining the
assertions through && instead of &.
&& does not perform elementwise comparisons; instead it uses the first element
of each value only. It also uses lazy evaluation, in the sense that evaluation
“proceeds only until the result is determined” (from ?Logic). This means that
the RHS of && won’t be evaluated if the LHS already determines the outcome
of the comparison (e.g. evaluate to FALSE). This behaviour is also known as
“short-circuiting”. For some situations (x = 1) both operators will lead to the
same result. But this is not always the case. For x = NULL, the &&-operator
will stop after the !is.null statement and return the result. The following
conditions won’t even be evaluated! (If the other conditions are also evaluated
(by the use of &), the outcome would change. NULL > 0 returns logical(0),
which is not helpful in this case.)
We can also see the difference in behaviour, when we set x = 1:3. The &&-
operator returns the result from length(x) == 1, which is FALSE. Using & as
the logical operator leads to the (vectorised) x > 0 condition being evaluated
and also returned.
Q2: What does this function return? Why? Which principle does it illustrate?
6.5 Lazy evaluation 47
f2 <- function(x = z) {
z <- 100
x
}
f2()
A: The function returns 100. The default argument (x = z) gets lazily eval-
uated within the function environment when x gets accessed. At this time z
has already been bound to the value 100. The illustrated principle here is lazy
evaluation.
Q3: What does this function return? Why? Which principle does it illustrate?
y <- 10
f1 <- function(x = {y <- 1; 2}, y = 0) {
c(x, y)
}
f1()
y
range("Sturges")
#> [1] "Sturges" "Sturges"
highly on the input. But hist ensures that breaks evaluates to a numeric
vector containing at least two unique elements before xlim is computed.
Q5: Explain why this function works. Why is it confusing?
show_time(x = stop("Error!"))
#> Error in print(x): Error!
str(formals(library))
#> Dotted pair list of 13
#> $ package : symbol
#> $ help : symbol
#> $ pos : num 2
#> $ lib.loc : NULL
#> $ character.only : logi FALSE
#> $ logical.return : logi FALSE
#> $ warn.conflicts : symbol
#> $ quietly : logi FALSE
#> $ verbose : language getOption("verbose")
#> $ mask.ok : symbol
#> $ exclude : symbol
#> $ include.only : symbol
#> $ attach.required: language missing(include.only)
sum(1, 2, 3)
#> [1] 6
mean(1, 2, 3)
#> [1] 1
A: Let’s inspect the arguments and their order for both functions. For sum()
these are ... and na.rm:
str(sum)
#> function (..., na.rm = FALSE)
For the ... argument sum() expects numeric, complex, or logical vector in-
put (see ?sum). Unfortunately, when ... is used, misspelled arguments (!) like
na.omit won’t raise an error (in case of no further input checks). So instead,
50 6 Functions
na.omit is treated as a logical and becomes part of the ... argument. It will be
coerced to 1 and be part of the sum. All other arguments are left unchanged.
Therefore sum(1, 2, 3) returns 6 and sum(1, 2, 3, na.omit = TRUE) returns
7.
In contrast, the generic function mean() expects x, trim, na.rm and ... for its
default method.
str(mean.default)
#> function (x, trim = 0, na.rm = FALSE, ...)
As na.omit is not one of mean()’s named arguments (x; and no candidate for
partial matching), na.omit again becomes part of the ... argument. However,
in contrast to sum() the elements of ... are not “part” of the mean. The other
supplied arguments are matched by their order, i.e. x = 1, trim = 2 and na.rm
= 3. As x is of length 1 and not NA, the settings of trim and na.rm do not
affect the calculation of the mean. Both calls (mean(1, 2, 3) and mean(1, 2,
3, na.omit = TRUE)) return 1.
Q2: Explain how to find the documentation for the named arguments in the
following function call:
6
4
2
2 4 6 8 10
x
6.7 Exiting a function 51
A: First we type ?plot in the console and check the “Usage” section which
contains:
plot(x, y, ...)
The arguments we want to learn more about (col, pch, xlab, col.lab) are part
of the ... argument. There we can find information for the xlab argument and
a recommendation to visit ?par for the other arguments. Under ?par we type
“col” into the search bar, which leads us to the section “Color Specification”.
We also search for the pch argument, which leads to the recommendation to
check ?points. Finally, col.lab is also directly documented within ?par.
Q3: Why does plot(1:10, col = "red") only colour the points, not the axes
or labels? Read the source code of plot.default() to find out.
A: To learn about the internals of plot.default() we add browser() to the
first line of the code and interactively run plot(1:10, col = "red"). This way
we can see how the plot is built and learn where the axes are added.
This leads us to the function call
The localTitle() function was defined in the first lines of plot.default() as:
localTitle <- function(..., col, bg, pch, cex, lty, lwd) title(...)
The call to localTitle() passes the col parameter as part of the ... argument
to title(). ?title tells us that the title() function specifies four parts of the
plot: Main (title of the plot), sub (sub-title of the plot) and both axis labels.
Therefore, it would introduce ambiguity inside title() to use col directly.
Instead, one has the option to supply col via the ... argument, via col.lab
or as part of xlab in the form xlab = list(c("index"), col = "red") (similar
for ylab).
force(code)
}
with_dir() takes a path for a working directory (dir) as its first argument.
This is the directory where the provided code (code) should be executed.
Therefore, the current working directory is changed in with_dir() via setwd().
Then, on.exit() ensures that the modification of the working directory is reset
to the initial value when the function exits. By passing the path explicitly, the
user has full control over the directory to execute the code in.
In source() the code is passed via the file argument (a path to a file). The
chdir argument specifies if the working directory should be changed to the
directory containing the file. The default for chdir is FALSE, so you don’t have
to provide a value. However, as you can only provide TRUE or FALSE, you are
also less flexible in choosing the working directory for the code execution.
Q4: Write a function that opens a graphics device, runs the supplied code, and
closes the graphics device (always, regardless of whether or not the plotting
code works).
A: To control the graphics device we use pdf() and dev.off(). To ensure a
clean termination on.exit() is used.
sink(temp)
on.exit(sink(), add = TRUE, after = TRUE)
force(code)
readLines(temp)
}
capture.output2(cat("a", "b", "c", sep = "\n"))
#> [1] "a" "b" "c"
capture.output({1})
#> [1] "[1] 1"
capture.output2({1})
#> character(0)
54 6 Functions
1 + 2 + 3
1 + (2 + 3)
A: Let’s rewrite the expressions to match the exact syntax from the code
above. Because prefix functions already define the execution order, we may
omit the parentheses in the second expression.
`+`(`+`(1, 2), 3)
modify(get("x"), 1) <- 10
#> Error: target of assignment expands to non-language object
6.8 Function forms 55
A: First, let’s define x and recall the definition of modify() from Advanced R:
x <- 1:3
R internally transforms the code, and the transformed code reproduces the
error above:
get("x") <- 2
#> Error in get("x") <- 2 :
#> target of assignment expands to non-language object
Q5: Write your own version of + that pastes its inputs together if they are
character vectors but behaves as usual otherwise. In other words, make this
code work:
1 + 2
#> [1] 3
"a" + "b"
#> [1] "ab"
56 6 Functions
# Test
+ 1
#> [1] 1
1 + 2
#> [1] 3
"a" + "b"
#> [1] "ab"
Q6: Create a list of all the replacement functions found in the {base} package.
Which ones are primitive functions? (Hint use apropos())
A: The hint suggests to look for functions with a specific naming pattern:
Replacement functions conventionally end on “<-”. We can search for these
objects by supplying the regular expression "<-$" to apropos(). apropos()
also allows to return the position on the search path (search()) for each of
its matches via setting where = TRUE. Finally, we can set mode = function to
narrow down our search to relevant objects only. This gives us the following
statement to begin with:
#> 10 10 10
#> "[<-.factor" "[<-.numeric_version" "[<-.POSIXct"
#> 10 10 10
#> "[<-.POSIXlt" "@<-" "<-"
#> 10 10 10
#> "<<-" "$<-" "$<-.data.frame"
#> 8 10 10
#> "as<-" "attr<-" "attributes<-"
#> 8 10 10
#> "body<-" "body<-" "class<-"
#> 8 10 10
#> "coerce<-" "colnames<-" "comment<-"
#> 3 10 10
#> "contrasts<-" "diag<-" "dim<-"
To find out which of these functions are primitives, we first search for
these functions via mget() and then subset the result using Filter() and
is.primitive().
6.8 Function forms 59
repls_base_prim
#> [1] "[[<-" "[<-" "@<-"
#> [4] "<-" "<<-" "$<-"
#> [7] "attr<-" "attributes<-" "class<-"
#> [10] "dim<-" "dimnames<-" "environment<-"
#> [13] "length<-" "levels<-" "names<-"
#> [16] "oldClass<-" "storage.mode<-"
Q9: Create infix versions of the set functions intersect(), union(), and set-
diff(). You might call them %n%, %u%, and %/% to match conventions from
mathematics.
A: These infix operators could be defined in the following way. (%/% is chosen
instead of %\%, because \ serves as an escape character.)
x %u% y
#> [1] "a" "b" "d" "c"
x %n% y
#> [1] "a" "d"
x %/% y
#> [1] "b"
7
Environments
Prerequisites
Just like in Advanced R, we mainly use the {rlang} package [Henry and Wick-
ham, 2020b] to work with environments.
library(rlang)
DOI: 10.1201/9781003175414-7 61
62 7 Environments
e1 <- env()
e1$loop <- e1
e1 <- env()
e2 <- env()
e1$loop <- e2
7.2 Environment basics 63
e2$dedoop <- e1
lobstr::ref(e1)
#> █ [1:0x55829be85fc8] <env>
#> └─loop = █ [2:0x55829b601b78] <env>
#> └─dedoop = [1:0x55829be85fc8]
lobstr::ref(e2)
#> █ [1:0x55829b601b78] <env>
#> └─dedoop = █ [2:0x55829be85fc8] <env>
#> └─loop = [1:0x55829b601b78]
Q4: Explain why e[[1]] and e[c("a", "b")] don’t make sense when e is an
environment.
A: The first option doesn’t make sense, because elements of an environment
are not ordered. The second option would return two objects at the same time.
What data structure would they be contained inside?
Q5: Create a version of env_poke() that will only bind new names, never re-
bind old names. Some programming languages only do this, and are known as
single assignment languages (http://en.wikipedia.org/wiki/Assignment_(co
mputer_science)#Single_assignment).
A: As described in Advanced R rlang::env_poke() takes a name (as string)
and a value to assign (or reassign) a binding in an environment.
e3 <- new.env()
So, we want env_poke2() to test, if the supplied name is already present in the
given environment. This can be checked via env_has(). If this is the case, an
(informative) error is thrown.
# Test
env_poke2(e3, "b", 100)
e3$b
#> [1] 100
env_poke2(e3, "b", 200)
#> Error: "b" is already assigned to a value.
Q6: What does this function do? How does it differ from <<- and why might
you prefer it?
A: The primary difference between rebind() and <<- is that rebind() will
only carry out an assignment when it finds an existing binding; unlike <<-
it will never create a new one in the global environment. This behaviour
of <<- is usually undesirable because global variables introduce non-obvious
dependencies between functions.
7.3 Recursing over environments 65
Our modified version of where() will always recurse until it reaches the empty
environment. No matter if it has already found the name or not. Along the
way, it will check each environment for the given name. Finally, it will return
a list of environments where the binding was found; if no binding was found,
the list will be empty.
Please also note how the list is initialised via the default argument, when
the function is called for the first time. This is a bit confusing, which is why
it’s common to wrap a recursive function inside another, more user friendly,
function.
# Test
e1a <- env(empty_env(), a = 1, b = 2)
e1b <- env(e1a, b = 10, c = 11)
e1c <- env(e1b, a = 12, d = 13)
where2("a", e1c)
#> [[1]]
#> <environment: 0x55829f430ca0>
#>
#> [[2]]
#> <environment: 0x55829f327aa0>
Q2: Write a function called fget() that finds only function objects. It should
have two arguments, name and env, and should obey the regular scoping rules
for functions: if there’s an object with a matching name that’s not a function,
look in the parent. For an added challenge, also add an inherits argument
which controls whether the function recurses up the parents or only looks in
one environment.
A: We follow a similar approach to the previous exercise. This time we addi-
tionally check if the found object is a function and implement an argument to
turn off the recursion, if desired.
if (is.function(obj)) {
return(obj)
}
}
# Recursive Case
fget(name, env_parent(env))
}
# Test
mean <- 10
fget("mean", inherits = TRUE)
#> function (x, ...)
#> UseMethod("mean")
#> <bytecode: 0x55829be34488>
#> <environment: namespace:base>
search_envs()
#> [[1]] $ <env: global>
#> [[2]] $ <env: package:rlang>
#> [[3]] $ <env: package:stats>
#> [[4]] $ <env: package:graphics>
#> [[5]] $ <env: package:grDevices>
#> [[6]] $ <env: package:utils>
#> [[7]] $ <env: package:datasets>
#> [[8]] $ <env: package:methods>
#> [[9]] $ <env: Autoloads>
#> [[10]] $ <env: package:base>
env_parents(global_env())
#> [[1]] $ <env: package:rlang>
#> [[2]] $ <env: package:stats>
#> [[3]] $ <env: package:graphics>
#> [[4]] $ <env: package:grDevices>
#> [[5]] $ <env: package:utils>
#> [[6]] $ <env: package:datasets>
#> [[7]] $ <env: package:methods>
#> [[8]] $ <env: Autoloads>
#> [[9]] $ <env: package:base>
#> [[10]] $ <env: empty>
Q2: Draw a diagram that shows the enclosing environments of this function:
f1 <- function(x1) {
f2 <- function(x2) {
f3 <- function(x3) {
x1 + x2 + x3
}
f3(3)
}
f2(2)
}
f1(1)
We can also inspect the binding of the environments, adding print statements
to the function definition. Please note that these print statements will be eval-
uated at execution time. Therefore, the execution of f1(1) will print different
results each time we run it.
f1 <- function(x1) {
f2 <- function(x2) {
f3 <- function(x3) {
x1 + x2 + x3
print("f3")
print(env_print())
}
f3(3)
print("f2")
print(env_print())
}
f2(2)
print("f1")
print(env_print())
}
f1(1)
#> [1] "f3"
#> <environment: 0x55829e3a0eb0>
#> parent: <environment: 0x55829e3a1070>
#> bindings:
#> * x3: <dbl>
#> <environment: 0x55829e3a0eb0>
#> [1] "f2"
#> <environment: 0x55829e3a1070>
#> parent: <environment: 0x55829e3a1230>
#> bindings:
#> * f3: <fn>
70 7 Environments
Q3: Write an enhanced version of str() that provides more information about
functions. Show where the function was found and what environment it was
defined in.
A: To solve this problem, we need to write a function that takes the name
of a function and looks for that function returning both the function and the
environment that it was found in.
if (is.function(obj)) {
return(list(fun = obj, env = env))
}
}
if (identical(env, emptyenv())) {
stop("Could not find a function called \"", name, "\"",
call. = FALSE
)
}
# Recursive Case
fget2(name, env_parent(env))
}
list(
where = fun_env$env,
enclosing = fn_env(fun_env$fun)
)
}
# Test
fstr("mean")
#> $where
#> <environment: base>
#>
#> $enclosing
#> <environment: namespace:base>
Once you have learned about tidy evaluation, you could rewrite fstr() to use
enquo() so that you’d call it more like str(), i.e. fstr(sum).
Prerequisites
Similar to the environments chapter, we also use functions from the {rlang}
package to work with conditions.
library(rlang)
# Test
saveRDS(mtcars, "mtcars.rds")
file_remove_strict("mtcars.rds")
#> [1] TRUE
DOI: 10.1201/9781003175414-8 73
74 8 Conditions
file_remove_strict("mtcars.rds")
#> Error: Can't delete the file "mtcars.rds" because it doesn't exist.
Q2: What does the appendLF argument to message() do? How is it related to
cat()?
A: The appendLF argument automatically appends a new line to the message.
Let’s illustrate this behaviour with a small example function:
multiline_msg(appendLF = TRUE)
#> first
#> second
#> thirdfourth
multiline_msg(appendLF = FALSE)
#> firstsecondthirdfourth
Comparable behaviour regarding line breaks for cat() can be achieved via
setting its sep argument to "\n".
catch_cnd(stop("An error"))
catch_cnd(abort("An error"))
A: In contrast to stop(), which contains the call, abort() stores the whole
backtrace generated by rlang::trace_back(). This is a lot of extra data!
8.4 Handling conditions 75
str(catch_cnd(stop("An error")))
#> List of 2
#> $ message: chr "An error"
#> $ call : language force(expr)
#> - attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
str(catch_cnd(abort("An error")))
#> List of 3
#> $ message: chr "An error"
#> $ trace :List of 3
#> ..$ calls :List of 8
#> .. ..$ : language utils::str(catch_cnd(abort("An error")))
#> .. ..$ : language rlang::catch_cnd(abort("An error"))
#> .. ..$ : language rlang::eval_bare(rlang::expr(tryCatch(!!!handle..
#> .. ..$ : language base::tryCatch(condition = function (x) x, { ...
#> .. ..$ : language base:::tryCatchList(expr, classes, parentenv, h..
#> .. ..$ : language base:::tryCatchOne(expr, names, parentenv, hand..
#> .. ..$ : language base:::doTryCatch(return(expr), name, parentenv..
#> .. ..$ : language base::force(expr)
#> ..$ parents: int [1:8] 0 0 2 2 4 5 6 2
#> ..$ indices: int [1:8] 26 27 28 29 30 31 32 33
#> ..- attr(*, "class")= chr "rlang_trace"
#> ..- attr(*, "version")= int 1
#> $ parent : NULL
#> - attr(*, "class")= chr [1:3] "rlang_error" "error" "condition"
show_condition(stop("!"))
show_condition(10)
76 8 Conditions
show_condition(warning("?!"))
show_condition({
10
message("?")
warning("?!")
})
The last example is the most interesting and makes us aware of the exiting
qualities of tryCatch(); it will terminate the evaluation of the code as soon as
it is called.
show_condition({
10
message("?")
warning("?!")
})
#> [1] "message"
withCallingHandlers( # (1)
message = function(cnd) message("b"),
withCallingHandlers( # (2)
message = function(cnd) message("a"),
message("c")
)
)
#> b
#> a
#> b
#> c
First, message("c") is run, and it’s caught by (1). It then calls message("a"),
which is caught by (2), which calls message("b"). message("b") isn’t caught
by anything, so we see a b on the console, followed by a. But why do we get
another b before we see c? That’s because we haven’t handled the message,
so it bubbles up to the outer calling handler.
Q4: Read the source code for catch_cnd() and explain how it works. At the
time Advanced R was written, the source for catch_cnd() was a little simpler:
rlang::catch_cnd
#> function (expr, classes = "condition")
#> {
#> stopifnot(is_character(classes))
#> handlers <- rep_named(classes, list(identity))
#> eval_bare(rlang::expr(tryCatch(!!!handlers, {
#> force(expr)
#> return(NULL)
#> })))
#> }
#> <bytecode: 0x557856fce1c0>
#> <environment: namespace:rlang>
# Test
show_condition2(stop("!"))
#> [1] "error"
show_condition2(10)
#> NULL
show_condition2(warning("?!"))
#> [1] "warning"
show_condition2({
10
message("?")
warning("?!")
})
#> [1] "message"
tryCatch() executes the code and captures any condition raised. The function
provided as the condition handles this condition. In this case it dispatches on
the class of the condition.
TRUE
}
check_installed("ggplot2")
#> [1] TRUE
check_installed("ggplot3")
#> Loading required namespace: ggplot3
#> Error: package 'ggplot3' not installed.
Q2: Inside a package you often need to stop with an error when something is
not right. Other packages that depend on your package might be tempted to
check these errors in their unit tests. How could you help these packages to
avoid relying on the error message which is part of the user interface rather
than the API and might change without notice?
A: Instead of returning an error it might be preferable to throw a customised
condition and place a standardised error message inside the metadata. Then
the downstream package could check for the class of the condition, rather than
inspecting the message.
8.6 Applications
Q1: Create suppressConditions() that works like suppressMessages() and sup-
pressWarnings() but suppresses everything. Think carefully about how you
should handle errors.
A: In general, we would like to catch errors, since they contain important in-
formation for debugging. To suppress the error message and hide the returned
error object from the console, we handle errors within a tryCatch() and return
the error object invisibly:
80 8 Conditions
After we defined the error handling, we can just combine it with the other
handlers to create suppressConditions():
To test the new function, we apply it to a set of conditions and inspect the
returned error object.
error_obj
#> <error/rlang_error>
#> error
#> Backtrace:
#> 1. global::suppressConditions(...)
#> 12. base::suppressMessages(expr)
#> 13. base::withCallingHandlers(...)
#> NULL
#> })
Q3: How would you modify the catch_cnds() definition if you wanted to recre-
ate the original intermingling of warnings and messages?
A: It looks like Hadley wrote a part of the chapter after the exercises, as the
catch_cnds() function defined in the chapter already solves this problem by
storing all messages and warnings in their original order within a list.
tryCatch(
error = function(cnd) {
conds <<- append(conds, list(cnd))
},
withCallingHandlers(
message = add_cond,
warning = add_cond,
expr
8.6 Applications 83
)
)
conds
}
# Test
catch_cnds({
inform("message a")
warn("warning b")
inform("message c")
})
#> [[1]]
#> <message: message a
#> >
#>
#> [[2]]
#> <warning: warning b>
#>
#> [[3]]
#> <message: message c
#> >
Q4: Why is catching interrupts dangerous? Run this code to find out.
)
}
message(
"No more bottles of beer on the wall, ",
"no more bottles of beer."
)
}
bottles_of_beer()
#> There are 99 bottles of beer on the wall, 99 bottles of beer.
#> Take one down, pass it around, 98 bottles of beer on the wall.
#> Take one down, pass it around, 97 bottles of beer on the wall.
#> Take one down, pass it around, 96 bottles of beer on the wall.
#> Take one down, pass it around, 95 bottles of beer on the wall.
#>
At this point you’ll probably recognise how hard it is to get the number of
bottles down from 99 to 0. There’s no way to break out of the function because
we’re capturing the interrupt that you’d usually use!
Part II
Functional programming
9
Functionals
Prerequisites
For the functional programming part of the book, we will mainly use functions
from the {purrr} package [Henry and Wickham, 2020a].
library(purrr)
DOI: 10.1201/9781003175414-9 87
88 9 Functionals
Besides mixing positions and names, it is also possible to pass along an acces-
sor function. This is basically an anonymous function that gets information
about some aspect of the input data. You are free to define your own accessor
functions.
If you need to access certain attributes, the helper attr_getter(y) is already
predefined and will create the appropriate accessor function for you.
as_mapper(~ runif(2))
#> <lambda>
#> function (..., .x = ..1, .y = ..2, . = ..1)
#> runif(2)
#> attr(,"class")
#> [1] "rlang_lambda_function" "function"
as_mapper(runif(2))
#> function (x, ...)
#> pluck(x, 0.0807501375675201, 0.834333037259057, .default = NULL)
#> <environment: 0x55adc3b64390>
A: To solve this exercise we take advantage of calling the type stable variants
of map(), which give us more concise output, and use map_lgl() to select the
columns of the data frame (later you’ll learn about keep(), which simplifies
this pattern a little).
map_dbl(mtcars, sd)
#> mpg cyl disp hp drat wt qsec vs
#> 6.027 1.786 123.939 68.563 0.535 0.978 1.787 0.504
#> am gear carb
#> 0.499 0.738 1.615
Q4: The following code simulates the performance of a t-test for non-normal
data. Extract the p-value from each test, then visualise.
A: There are many ways to visualise this data. However, since there are only
100 data points, we choose a dot plot to visualise the distribution. (Unfortu-
nately, {ggplot2}s geom_dotplot() doesn’t compute proper counts as it was
created to visualise distribution densities instead of frequencies, so a histogram
would be a suitable alternative).
library(ggplot2)
df_trials %>%
ggplot(aes(x = p_value, fill = p_value < 0.05)) +
geom_dotplot(binwidth = .01) + # geom_histogram() as alternative
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "top"
)
Q5: The following code uses a map nested inside another map to apply a
function to every element of a nested list. Why does it fail, and what do you
need to do to make it work?
9.2 My first functional: map() 91
x <- list(
list(1, c(3, 9)),
list(c(3, 6), 7, c(4, 7, 6))
)
Q6: Use map() to fit linear models to the mtcars dataset using the formulas
stored in this list:
A: The data (mtcars) is constant for all these models and so we iterate over
the formulas provided. As the formula is the first argument of lm(), we don’t
need to specify it explicitly.
92 9 Functionals
Q7: Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in
the list below, then extract the 𝑅2 of the model fit (Hint: you can compute
the 𝑅2 with summary())
A: To accomplish this task, we take advantage of the “list in, list out”-
functionality of map(). This allows us to chain multiple transformations to-
gether. We start by fitting the models. We then calculate the summaries and
extract the 𝑅2 values. For the last call we use map_dbl(), which provides con-
venient output.
bootstraps %>%
map(~ lm(mpg ~ disp, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
#> [1] 0.588 0.822 0.745 0.746 0.784 0.749 0.613 0.792 0.653 0.726
head(modify(mtcars, 1))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> Datsun 710 21 6 160 110 3.9 2.62 16.5 0 1 4 4
9.4 Map variants 93
Q2: Rewrite the following code to use iwalk() instead of walk2(). What are
the advantages and disadvantages?
A: iwalk() allows us to use a single variable, storing the output path in the
names.
mtcars %>%
split(mtcars$cyl) %>%
set_names(~ file.path(temp, paste0("cyl-", .x, ".csv"))) %>%
iwalk(~ write.csv(.x, .y))
Q3: Explain how the following code transforms a data frame using functions
stored in a list.
nm <- names(trans)
mtcars[nm] <- map2(trans, mtcars[nm], function(f, var) f(var))
the list of the 2 functions (trans) and the 2 appropriately selected data frame
columns (mtcars[nm]) are supplied to map2(). map2() creates an anonymous
function (f(var)) which applies the functions to the variables when map2()
iterates over their (similar) indices. On the left-hand side, the respective 2
elements of mtcars are being replaced by their new transformations.
The map() variant
does basically the same. However, it directly iterates over the names (nm) of
the transformations. Therefore, the data frame columns are selected during
the iteration.
Besides the iteration pattern, the approaches differ in the possibilities for
appropriate argument naming in the .f argument. In the map2() approach
we iterate over the elements of x and y. Therefore, it is possible to choose
appropriate placeholders like f and var. This makes the anonymous function
more expressive at the cost of making it longer. We think using the formula
interface in this way is preferable compared to the rather cryptic mtcars[nm]
<- map2(trans, mtcars[nm], ~ .x(.y)).
In the map() approach we map over the variable names. It is therefore not
possible to introduce placeholders for the function and variable names. The
formula syntax together with the .x pronoun is pretty compact. The object
names and the brackets clearly indicate the application of transformations to
specific columns of mtcars. In this case the iteration over the variable names
comes in handy, as it highlights the importance of matching between trans and
mtcars element names. Together with the replacement form on the left-hand
side, this line is relatively easy to inspect.
To summarise, in situations where map() and map2() provide solutions for an
iteration problem, several points may be considered before deciding for one or
the other approach.
Q4: What does write.csv() return, i.e. what happens if you use it with map2()
instead of walk2()?
9.6 Predicate functionals 95
A: write.csv() returns NULL. As we call the function for its side effect (creating
a CSV file), walk2() would be appropriate here. Otherwise, we receive a rather
uninformative list of NULLs.
A: The loop inside simple_reduce() always starts with the index 2, and seq()
can count both up and down:
seq(2, 0)
#> [1] 2 1 0
seq(2, 1)
#> [1] 2 1
Therefore, subsetting length-0 and length-1 vectors via [[ will lead to a sub-
script out of bounds error. To avoid this, we allow simple_reduce() to return
before the for loop is started and include a default argument for 0-length
vectors.
simple_reduce(integer(0), `+`)
#> Error in simple_reduce(integer(0), `+`): argument "default" is
#> missing, with no default
simple_reduce(integer(0), `+`, default = 0L)
#> [1] 0
simple_reduce(1, `+`)
#> [1] 1
simple_reduce(1:3, `+`)
#> [1] 6
Q3: Implement the span() function from Haskell: given a list x and a predicate
function f, span(x, f) returns the location of the longest sequential run of
elements where the predicate is true. (Hint: you might find rle() helpful.)
A: Our span_r() function returns the indices of the (first occurring) longest
sequential run of elements where the predicate is true. If the predicate is never
true, the longest run has length 0, in which case we return integer(0).
9.6 Predicate functionals 97
arg_max(-10:5, function(x) x ˆ 2)
#> [1] -10
arg_min(-10:5, function(x) x ˆ 2)
#> [1] 0
Q5: The function below scales a vector so it falls in the range [0, 1]. How
would you apply it to every column of a data frame? How would you apply it
to every numeric column in a data frame?
When we apply the head() function over the first margin of arr2() (i.e. the
rows), the results are contained in the columns of the output, transposing the
array compared to the original input.
And vice versa if we apply over the second margin (the columns):
The output of apply() is organised first by the margins being operated over,
then the results of the function. This can become quite confusing for higher
dimensional arrays.
Q2: What do eapply() and rapply() do? Does {purrr} have equivalents?
A: eapply() is a variant of lapply(), which iterates over the (named) elements
of an environment. In {purrr} there is no equivalent for eapply() as {purrr}
mainly provides functions that operate on vectors and functions, but not on
environments.
rapply() applies a function to all elements of a list recursively. This function
makes it possible to limit the application of the function to specified classes
(default classes = ANY). One may also specify how elements of other classes
should remain: as their identity (how = replace) or another value (default =
NULL). The closest equivalent in {purrr} is modify_depth(), which allows you
to modify elements at a specified depth in a nested list.
Q3: Challenge: read about the fixed point algorithm (https://mitpress.mit
.edu/sites/default/files/sicp/full-text/book/book-Z-H-12.html#%25_idx
_1096). Complete the exercises using R.
A: A number 𝑥 is called a fixed point of a function 𝑓 if it satisfies the equa-
tion 𝑓(𝑥) = 𝑥. For some functions we may find a fixed point by beginning
100 9 Functionals
x
}
Prerequisites
For most of this chapter base R [R Core Team, 2020] is sufficient. Just a few
exercises require the {rlang} [Henry and Wickham, 2020b], {dplyr} [Wickham
et al., 2020a], {purrr} [Henry and Wickham, 2020a] and {ggplot2} [Wickham,
2016] packages.
library(rlang)
library(dplyr)
library(purrr)
library(ggplot2)
force
#> function (x)
#> x
#> <bytecode: 0x5609b8696740>
#> <environment: namespace:base>
Q2: Base R contains two function factories, approxfun() and ecdf(). Read
their documentation and experiment to figure out what the functions do and
what they return.
A: Let’s begin with approxfun() as it is used within ecdf() as well:
approxfun() takes a combination of data points (x and y values) as input and
returns a stepwise linear (or constant) interpolation function. To find out what
this means exactly, we first create a few random data points.
x <- runif(10)
y <- runif(10)
plot(x, y, lwd = 10)
0.8
y
0.4
0.0
When we apply these functions to new x values, these are mapped to the lines
connecting the initial y values (linear case) or to the same y value as for the
next smallest initial x value (constant case).
0.8
y
0.4
0.0
f_lin(range(x))
#> [1] 0.402 0.175
f_con(range(x))
#> [1] 0.402 0.175
To change this behaviour, one can set rule = 2. This leads to the result that for
values outside of range(x) the boundary values of the function are returned.
f_lin(c(-Inf, Inf))
#> [1] 0.402 0.175
f_con(c(-Inf, Inf))
#> [1] 0.402 0.175
f_lin(c(-Inf, Inf))
#> [1] 5 NA
f_con(c(-Inf, Inf))
#> [1] 5 -5
Further, approxfun() provides the option to shift the y values for method =
"constant" between their left and right values. According to the documenta-
tion this indicates a compromise between left- and right-continuous steps.
0.4
0.0
Finally, the ties argument allows to aggregate y values if multiple ones were
provided for the same x value. For example, in the following line we use mean()
to aggregate these y values before they are used for the interpolation approx-
fun(x = c(1,1,2), y = 1:3, ties = mean).
Next, we focus on ecdf(). “ecdf” is an acronym for empirical cumulative distri-
bution function. For a numeric vector of density values, ecdf() initially creates
the (x, y) pairs for the nodes of the density function and then passes these
pairs to approxfun(), which gets called with specifically adapted settings (ap-
proxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant",
yleft = 0, yright = 1, f = 0, ties = "ordered")).
x <- runif(10)
f_ecdf <- ecdf(x)
class(f_ecdf)
#> [1] "ecdf" "stepfun" "function"
0.8
f_ecdf(x)
0.4
0.0
0.2 0.4 0.6 0.8
New values are then mapped on the y value of the next smallest x value from
within the initial input.
0.4
0.0
pick(1)(x)
# should be equivalent to
x[[1]]
lapply(mtcars, pick(5))
# should be equivalent to
lapply(mtcars, function(x) x[[5]])
A: In this exercise pick(i) acts as a function factory, which returns the re-
quired subsetting function.
function(x) x[[i]]
}
x <- 1:3
identical(x[[1]], pick(1)(x))
#> [1] TRUE
identical(
lapply(mtcars, function(x) x[[5]]),
lapply(mtcars, pick(5))
)
#> [1] TRUE
Q4: Create a function that creates functions that compute the ith central
moment (http://en.wikipedia.org/wiki/Central_moment) of a numeric vector.
You can test it by running the following code:
m1 <- moment(1)
m2 <- moment(2)
x <- runif(100)
stopifnot(all.equal(m1(x), 0))
stopifnot(all.equal(m2(x), var(x) * 99 / 100))
A: The first moment is closely related to the mean and describes the average
deviation from the mean, which is 0 (within numerical margin of error). The
second moment describes the variance of the input data. If we want to compare
it to var(), we need to undo Bessel’s correction (https://en.wikipedia.org/w
iki/Bessel%27s_correction) by multiplying with 𝑁−1 𝑁 .
m1 <- moment(1)
m2 <- moment(2)
x <- runif(100)
all.equal(m1(x), 0) # removed stopifnot() for clarity
#> [1] TRUE
10.2 Factory fundamentals 107
Q5: What happens if you don’t use a closure? Make predictions, then verify
with the code below.
i <- 0
new_counter2 <- function() {
i <<- i + 1
i
}
new_counter2()
#> [1] 1
i
#> [1] 1
new_counter2()
#> [1] 2
i
#> [1] 2
i <- 0
new_counter2()
#> [1] 1
i
#> [1] 1
Q6: What happens if you use <- instead of <<-? Make predictions, then verify
with the code below.
A: Without the super assignment <<-, the counter will always return 1. The
counter always starts in a new execution environment within the same en-
closing environment, which contains an unchanged value for i (in this case it
remains 0).
new_counter_3()
#> [1] 1
new_counter_3()
#> [1] 1
function() {
fitted + sample(resid)
}
}
Q2: Why might you formulate the Box-Cox transformation like this?
plot(boxcox_airpassengers(0))
plot(boxcox_airpassengers(1))
plot(boxcox_airpassengers(2))
plot(boxcox_airpassengers(3))
110 10 Function factories
6.5
boxcox_airpassengers(0)
boxcox_airpassengers(1)
500
6.0
5.5
300
5.0
100
1950 1952 1954 1956 1958 1960 1950 1952 1954 1956 1958 1960
Time Time
8e+07
boxcox_airpassengers(2)
boxcox_airpassengers(3)
150000
4e+07
50000
0e+00
0
1950 1952 1954 1956 1958 1960 1950 1952 1954 1956 1958 1960
Time Time
Q3: Why don’t you need to worry that boot_permute() stores a copy of the
data inside the function that it generates?
A: boot_permute() is defined in Advanced R as:
function() {
col <- df[[var]]
col[sample(n, replace = TRUE)]
}
}
We don’t need to worry that it stores a copy of the data, because it actually
doesn’t store one; it’s just a name that points to the same underlying object
in memory.
lobstr::obj_size(mtcars)
#> 7,208 B
lobstr::obj_size(boot_mtcars1)
#> 20,232 B
lobstr::obj_sizes(mtcars, boot_mtcars1)
#> * 7,208 B
#> * 13,024 B
10.4 Statistical factories 111
function(lambda) {
log(lambda) * sum(x) - n * lambda - sum(lfactorial(x))
}
}
function(lambda) {
log(lambda) * sum_x - n * lambda - c
}
}
x1 <- c(41, 30, 31, 38, 29, 24, 30, 29, 31, 38)
bench::mark(
llp1 = optimise(ll_poisson1(x1), c(0, 100), maximum = TRUE),
llp2 = optimise(ll_poisson2(x1), c(0, 100), maximum = TRUE)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 llp1 31.8µs 36.7µs 26540. 12.8KB 29.2
#> 2 llp2 16µs 18.2µs 52523. 0B 26.3
bench::mark(
llp1 = optimise(ll_poisson1(x), c(0, 100), maximum = TRUE),
llp2 = optimise(ll_poisson2(x), c(0, 100), maximum = TRUE),
time_unit = "ms"
)
}
100
50
0
1e+01 1e+02 1e+03 1e+04 1e+05
Length of x
10.5 Function factories + functionals 113
(a) x$f(x$z).
(b) f(x$z).
(c) x$f(z).
(d) f(z).
(e) It depends.
A: (e) “It depends” is the correct answer. Usually with() is used with a data
frame, so you’d usually expect (b), but if x is a list, it could be any of the
options.
f <- mean
z <- 1
x <- list(f = mean, z = 1)
Q2: Compare and contrast the effects of env_bind() vs. attach() for the fol-
lowing code.
attach(funs)
#> The following objects are masked from package:base:
#>
#> mean, sum
mean <- function(x) stop("Hi!")
detach(funs)
114 10 Function factories
env_bind(globalenv(), !!!funs)
mean <- function(x) stop("Hi!")
env_unbind(globalenv(), names(funs))
A: attach() adds funs to the search path. Therefore, the provided functions
are found before their respective versions from the {base} package. Further,
they cannot get accidentally overwritten by similar named functions in the
global environment. One annoying downside of using attach() is the possibility
to attach the same object multiple times, making it necessary to call detach()
equally often.
attach(funs)
#> The following objects are masked from package:base:
#>
#> mean, sum
attach(funs)
#> The following objects are masked from funs (pos = 3):
#>
#> mean, sum
#>
#> The following objects are masked from package:base:
#>
#> mean, sum
head(search())
#> [1] ".GlobalEnv" "funs" "funs"
#> [4] "package:ggplot2" "package:purrr" "package:dplyr"
detach(funs)
detach(funs)
env_bind(globalenv(), !!!funs)
head(search())
#> [1] ".GlobalEnv" "package:ggplot2" "package:purrr"
#> [4] "package:dplyr" "package:rlang" "package:stats"
11
Function operators
Prerequisites
Also in the third chapter on functional programming, we make relatively fre-
quent use of the {purrr} package.
library(purrr)
# Application
vrep(1:2, 3:4)
#> [[1]]
#> [1] 1 1 1
#>
#> [[2]]
#> [1] 2 2 2 2
possibly
#> function (.f, otherwise, quiet = TRUE)
#> {
#> .f <- as_mapper(.f)
#> force(otherwise)
#> function(...) {
#> tryCatch(.f(...), error = function(e) {
#> if (!quiet)
#> message("Error: ", e$message)
#> otherwise
11.2 Existing function operators 117
safely
#> function (.f, otherwise = NULL, quiet = TRUE)
#> {
#> .f <- as_mapper(.f)
#> function(...) capture_error(.f(...), otherwise, quiet)
#> }
#> <bytecode: 0x56374811ce48>
#> <environment: namespace:purrr>
purrr:::capture_error
#> function (code, otherwise = NULL, quiet = TRUE)
#> {
#> tryCatch(list(result = code, error = NULL), error = function(e) {
#> if (!quiet)
#> message("Error: ", e$message)
#> list(result = otherwise, error = e)
#> }, interrupt = function(e) {
#> stop("Terminated by user", call. = FALSE)
#> })
#> }
118 11 Function operators
changes <- c(
if (length(added) > 0) paste0(" * '", added, "' was added"),
if (length(removed) > 0) paste0(" * '", removed ,
"' was removed")
)
message(paste(changes, collapse = "\n"))
}
f(...)
}
}
file_create("delete_me")
#> * 'delete_me' was added
#> [1] TRUE
file_remove("delete_me")
120 11 Function operators
Now, let’s check if our logger() works as intended and apply it to the mean()
function:
readLines(log_path)
#> [1] "created at: 2020-12-27 23:30:26"
#> [2] "called at: 2020-12-27 23:30:31"
#> [3] "called at: 2020-12-27 23:30:32"
function(...) {
Sys.sleep(amount)
f(...)
}
}
To ensure that the function created by delay_by() waits that a certain amount
of time has passed since its last execution, we incorporate three little changes
into our new delay_atleast() as indicated in the corresponding comments
below.
if (wait > 0) {
Sys.sleep(wait)
}
}
f(...)
}
}
Part III
Object-oriented
programming
13
S3
Prerequisites
To interact with S3 objects, we will mainly use the {sloop} package [Wickham,
2019b].
library(sloop)
13.2 Basics
Q1: Describe the difference between t.test() and t.data.frame()? When is
each function called?
A: Because of S3’s generic.class() naming scheme, both functions may ini-
tially look similar, while they are in fact unrelated.
• t.test() is a generic function that performs a t-test.
• t.data.frame() is a method that gets called by the generic t() to transpose
data frame input.
Due to R’s S3 dispatch rules, t.test() would also get called when t() is
applied to an object of class test.
Q2: Make a list of commonly used base R functions that contain . in their
name but are not S3 methods.
A: In recent years “snake_case”-style has become increasingly common when
naming functions and variables in R. But many functions in base R will con-
tinue to be “point.separated”, which is why some inconsistency in your R code
most likely cannot be avoided.[Bååth, 2012]
list.files()
download.file()
data.frame()
as.character()
Sys.Date()
all.equal()
do.call()
on.exit()
mean(some_days)
#> [1] "2017-02-06"
mean(unclass(some_days))
#> [1] 17203
After unclass() has removed the class attribute from some_date, the de-
fault method is chosen. mean.default(unclass(some_days)) then calculates the
mean of the underlying double.
Q5: What class of object does the following code return? What base type is
it built on? What attributes does it use?
typeof(x)
#> [1] "closure"
attributes(x)
#> $class
#> [1] "ecdf" "stepfun" "function"
#>
#> $call
#> ecdf(rpois(100, 10))
Q6: What class of object does the following code return? What base type is
it built on? What attributes does it use?
A: This code returns a table object, which is built upon the integer type.
The attribute dimnames is used to name the elements of the integer vector.
typeof(x)
#> [1] "integer"
128 13 S3
attributes(x)
#> $dim
#> [1] 10
#>
#> $dimnames
#> $dimnames[[1]]
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
#>
#>
#> $class
#> [1] "table"
13.3 Classes
Q1: Write a constructor for data.frame objects. What base type is a data
frame built on? What attributes does it use? What are the restrictions placed
on the individual elements? What about the names?
A: Data frames are built on named lists of vectors, which all have the same
length. Besides the class and the column names (names), the row.names are
their only further attribute. This must be a character vector with the same
length as the other vectors.
We need to provide the number of rows as an input to make it possible to
create data frames with 0 columns but multiple rows.
This leads to the following constructor:
if (is.null(row.names)) {
# Use special row names helper from base R
row.names <- .set_row_names(n)
} else {
13.3 Classes 129
structure(
x,
class = "data.frame",
row.names = row.names
)
}
# Test
x <- list(a = 1, b = 2)
new_data.frame(x, n = 1)
#> a b
#> 1 1 2
new_data.frame(x, n = 1, row.names = "l1")
#> a b
#> l1 1 2
The factor() helper including the constructor (new_factor()) and its validator
(validate_factor()) were given in Advanced R. However, as the goal of this
question is to throw an early error within the helper, we only repeat the code
for the helper:
validate_factor(new_factor(new_levels, levels))
}
# Test
factor2(c("a", "b", "c"), levels = c("a", "b"))
#> Error: The following values do not occur in the levels of x: 'c'.
Q3: Carefully read the source code of factor(). What does it do that our
constructor does not?
A: The original implementation (base::factor()) allows more flexible input
for x. It coerces x to character or replaces it with character(0) (in case of
NULL). It also ensures that the levels are unique. This is achieved by setting
them via base::levels<-, which fails when duplicate values are supplied.
Q4: Factors have an optional “contrasts” attribute. Read the help for C(),
and briefly describe the purpose of the attribute. What type should it have?
Rewrite the new_factor() constructor to include this attribute.
A: When factor variables (representing nominal or ordinal information) are
used in statistical models, they are typically encoded as dummy variables and
by default each level is compared with the first factor level. However, many
different encodings (“contrasts”) are possible, see https://en.wikipedia.org
/wiki/Contrast_(statistics).
13.3 Classes 131
Within R’s formula interface you can wrap a factor in stats::C() and specify
the contrast of your choice. Alternatively, you can set the contrasts attribute
of your factor variable, which accepts matrix input. (See ?contr.helmert or
similar for details.)
The new_factor() constructor was given in Advanced R as:
structure(
x,
levels = levels,
class = "factor"
)
}
if (!is.null(constrasts)) {
stopifnot(is.matrix(contrasts) && is.numeric(contrasts))
}
structure(
x,
levels = levels,
class = "factor",
contrasts = contrasts
)
}
Q5: Read the documentation for utils::as.roman(). How would you write a
constructor for this class? Does it need a validator? What might a helper do?
132 13 S3
The documentation tells us, that only values between 1 and 3899 are uniquely
represented, which we then include in our validation function.
x
}
For convenience, we allow the user to also pass real values to a helper function.
validate_roman(new_roman(x))
}
# Test
roman(c(1, 753, 2019))
#> [1] I DCCLIII MMXIX
roman(0)
#> Error: Roman numbers must fall between 1 and 3899.
13.4 Generics and methods 133
t.test
#> function (x, ...)
#> UseMethod("t.test")
#> <bytecode: 0x5567887716f0>
#> <environment: namespace:stats>
# or simply call
ftype(t.test)
#> [1] "S3" "generic"
Interestingly, R also provides helpers, which list functions that look like meth-
ods, but in fact are not:
tools::nonS3methods("stats")
#> [1] "anova.lmlist" "expand.model.frame" "fitted.values"
#> [4] "influence.measures" "lag.plot" "t.test"
#> [7] "plot.spec.phase" "plot.spec.coherency"
When we create an object with class test, t() dispatches to the t.default()
method. This happens, because UseMethod() simply searches for functions
named paste0("generic", ".", c(class(x), "default")).
134 13 S3
t(x)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1 2 3 4 5 6 7 8 9 10
#> attr(,"class")
#> [1] "test"
Q2: What generics does the table class have methods for?
A: This is a simple application of sloop::s3_methods_class():
s3_methods_class("table")
#> # A tibble: 11 x 4
#> generic class visible source
#> <chr> <chr> <lgl> <chr>
#> 1 [ table TRUE base
#> 2 aperm table TRUE base
#> 3 as_tibble table FALSE registered S3method
13.4 Generics and methods 135
Interestingly, the table class has a number of methods designed to help plot-
ting with base graphics.
x <- rpois(100, 5)
plot(table(x))
15
table(x)
10
5
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Q3: What generics does the ecdf class have methods for?
A: We use the same approach as above:
s3_methods_class("ecdf")
#> # A tibble: 4 x 4
#> generic class visible source
#> <chr> <chr> <lgl> <chr>
#> 1 plot ecdf TRUE stats
#> 2 print ecdf FALSE registered S3method
#> 3 quantile ecdf FALSE registered S3method
#> 4 summary ecdf FALSE registered S3method
The methods are primarily designed for display (plot(), print(), summary()),
but you can also extract quantiles with quantile().
Q4: Which base generic has the greatest number of defined methods?
A: A little experimentation (and thinking about the most popular functions)
suggests that the print() generic has the most defined methods.
136 13 S3
nrow(s3_methods_generic("print"))
#> [1] 260
nrow(s3_methods_generic("summary"))
#> [1] 38
nrow(s3_methods_generic("plot"))
#> [1] 34
Let’s verify this programmatically with the tools we have learned in this and
the previous chapters.
library(purrr)
Q5: Carefully read the documentation for UseMethod() and explain why the
following code returns the results that it does. What two usual rules of function
evaluation does UseMethod() violate?
g <- function(x) {
x <- 10
y <- 10
UseMethod("g")
}
g.default <- function(x) c(x = x, y = y)
x <- 1
y <- 1
g(x)
#> x y
#> 1 10
13.4 Generics and methods 137
A: Let’s take this step by step. If you call g.default(x) directly you get c(1,
1) as you might expect.
The value bound to x comes from the argument, the value from y comes from
the global environment.
g.default(x)
#> x y
#> 1 1
g(x)
#> x y
#> 1 10
This is seemingly inconsistent: why does x come from the value defined inside
of g(), and y still come from the global environment? It’s because UseMethod()
calls g.default() in a special way so that variables defined inside the generic
are available to methods. The exception are arguments supplied to the func-
tion: they are passed on as is and cannot be affected by code inside the generic.
Q6: What are the arguments to [? Why is this a hard question to answer?
A: The subsetting operator [ is a primitive and a generic function, which can
be confirmed via ftype().
ftype(`[`)
#> [1] "primitive" "generic"
names(formals(`[.data.frame`))
#> [1] "x" "i" "j" "drop"
names(formals(`[.table`))
#> [1] "x" "i" "j" "..." "drop"
names(formals(`[.Date`))
#> [1] "x" "..." "drop"
138 13 S3
names(formals(`[.AsIs`))
#> [1] "x" "i" "..."
To finally get a better overview, we have to put in a little more effort and also
use s3_methods_generic() again.
library(dplyr)
s3_methods_generic("[") %>%
filter(visible) %>%
mutate(
method = paste0("[.", class),
argnames = purrr::map(method, ~ names(formals(.x))),
args = purrr::map(method, ~ formals(.x)),
args = purrr::map2(
argnames, args,
~ paste(.x, .y, sep = " = ")
),
args = purrr::set_names(args, method)
) %>%
pull(args) %>%
head()
#> $`[.AsIs`
#> [1] "x = " "i = " "... = "
#>
#> $`[.data.frame`
#> [1] "x = "
#> [2] "i = "
#> [3] "j = "
#> [4] "drop = if (missing(i)) TRUE else length(cols) == 1"
#>
#> $`[.Date`
#> [1] "x = " "... = " "drop = TRUE"
#>
#> $`[.difftime`
#> [1] "x = " "... = " "drop = TRUE"
#>
#> $`[.Dlist`
#> [1] "x = " "i = " "... = "
#>
#> $`[.DLLInfoList`
#> [1] "x = " "... = "
13.5 Object styles 139
typeof(mod)
#> [1] "list"
attributes(mod)
#> $names
#> [1] "coefficients" "residuals" "effects" "rank"
#> [5] "fitted.values" "assign" "qr" "df.residual"
#> [9] "xlevels" "call" "terms" "model"
#>
#> $class
#> [1] "lm"
140 13 S3
As mod is built upon a list, we can simply use map(mod, typeof) to find out the
base types of its elements. (Additionally, we inspect ?lm, to learn more about
the individual attributes.)
map_chr(mod, typeof)
#> coefficients residuals effects rank
#> "double" "double" "double" "integer"
#> fitted.values assign qr df.residual
#> "double" "integer" "list" "integer"
#> xlevels call terms model
#> "list" "language" "language" "list"
stopifnot(
is.double(coefficients), is.double(residuals),
is.double(effects), is.integer(rank), is.double(fitted.values),
is.integer(assign), is.list(qr), is.integer(df.residual),
is.list(xlevels), is.language(call), is.language(terms),
is.list(model)
)
structure(
list(
coefficients = coefficients,
residuals = residuals,
effects = effects,
rank = rank,
fitted.values = fitted.values,
assign = assign,
qr = qr,
df.residual = df.residual,
xlevels = xlevels,
call = call,
terms = terms,
model = model
),
class = "lm"
13.6 Inheritance 141
)
}
13.6 Inheritance
Q1: How does [.Date support subclasses? How does it fail to support sub-
classes?
A: [.Date calls .Date with the result of calling [ on the parent class, along
with oldClass():
`[.Date`
#> function (x, ..., drop = TRUE)
#> {
#> .Date(NextMethod("["), oldClass(x))
#> }
#> <bytecode: 0x55678c510dd8>
#> <environment: namespace:base>
.Date is kind of like a constructor for date classes, although it doesn’t check
the input is the correct type:
.Date
#> function (xx, cl = "Date")
#> `class<-`(xx, cl)
#> <bytecode: 0x55678ad06508>
#> <environment: namespace:base>
out
}
So, [.Date ensures that the output has the same class as in the input. But
what about other attributes that a subclass might possess? They get lost:
Q2: R has two classes for representing date time data, POSIXct and POSIXlt,
which both inherit from POSIXt. Which generics have different behaviours for
the two classes? Which generics share the same behaviour?
A: To answer this question, we have to get the respective generics
The generics in generics_t with a method for the superclass POSIXt potentially
share the same behaviour for both subclasses. However, if a generic has a
specific method for one of the subclasses, it has to be subtracted:
Q3: What do you expect this code to return? What does it actually return?
Why?
Let’s just double check the statement above and evaluate .Class explicitly
within the generic2.b() method.
x1 <- 1:5
class(x1)
#> [1] "integer"
s3_dispatch(length(x1))
#> * length.integer
#> length.numeric
#> length.default
#> => length (internal)
attr(x1, "class")
#> NULL
attr(x2, "class")
#> [1] "integer"
13.7 Dispatch details 145
To see the relevant classes for the S3 dispatch, one can use sloop::s3_class():
s3_class(x1) # implicit
#> [1] "integer" "numeric"
s3_class(x2) # explicit
#> [1] "integer"
s3_methods_generic("Math")
#> # A tibble: 8 x 4
#> generic class visible source
#> <chr> <chr> <lgl> <chr>
#> 1 Math data.frame TRUE base
#> 2 Math Date TRUE base
#> 3 Math difftime TRUE base
#> 4 Math factor TRUE base
#> 5 Math POSIXt TRUE base
#> 6 Math quosure FALSE registered S3method
#> 7 Math vctrs_sclr FALSE registered S3method
#> 8 Math vctrs_vctr FALSE registered S3method
To explain the basic idea, we just overwrite the data frame method:
Now all functions from the math generic group, will return "hello"
abs(mtcars)
#> [1] "hello"
exp(mtcars)
#> [1] "hello"
lgamma(mtcars)
#> [1] "hello"
abs(mtcars)
#> [1] "abs"
exp(mtcars)
#> [1] "exp"
lgamma(mtcars)
#> [1] "lgamma"
rm(Math.data.frame)
Math.difftime
#> function (x, ...)
#> {
#> switch(.Generic, abs = , sign = , floor = , ceiling = , trunc = ,
#> round = , signif = {
#> units <- attr(x, "units")
#> .difftime(NextMethod(), units)
#> }, stop(gettextf("'%s' not defined for \"difftime\" objects",
#> .Generic), domain = NA))
#> }
#> <bytecode: 0x55678c808148>
#> <environment: namespace:base>
14
R6
Prerequisites
To solve the exercises in this chapter we will have to create R6 objects, which
are implemented in the {R6} package [Chang, 2020].
library(R6)
)
)
To test this class, we create one instance and leave it with a negative balance.
my_account$
deposit(5)$
withdraw(15)$
balance
#> [1] -10
Now, we create the first subclass that prevents us from going into overdraft
and throws an error in case we attempt to withdraw more than our current
balance.
my_strict_account$
14.2 Classes and methods 151
deposit(5)$
withdraw(15)
#> Error: Your `withdraw` must be smaller than your `balance`.
my_strict_account$balance
#> [1] 5
Finally, we create another subclass that charges a constant fee of 1 for each
withdrawal which leaves the account with a negative balance.
my_charging_account$
deposit(5)$
withdraw(15)$
withdraw(0)
my_charging_account$balance
#> [1] -12
Q2: Create an R6 class that represents a shuffled deck of cards. You should
be able to draw cards from the deck with $draw(n), and return all cards to
the deck and reshuffle with $reshuffle(). Use the following code to make a
vector of cards.
152 14 R6
To test this class, we create a deck (initialise an instance), draw all the cards,
then reshuffle, checking we get different cards each time.
my_deck$draw(52)
#> [1] "6 SPADE" "10 DIAMOND" "Q CLUB" "J SPADE" "Q HEARTS"
14.2 Classes and methods 153
#> [6] "8 DIAMOND" "5 DIAMOND" "4 CLUB" "9 CLUB" "9 SPADE"
#> [11] "5 SPADE" "3 HEARTS" "J CLUB" "2 DIAMOND" "K SPADE"
#> [16] "2 HEARTS" "2 SPADE" "8 SPADE" "8 HEARTS" "6 HEARTS"
#> [21] "7 HEARTS" "6 CLUB" "K CLUB" "3 CLUB" "10 SPADE"
#> [26] "3 DIAMOND" "Q SPADE" "9 HEARTS" "J DIAMOND" "7 DIAMOND"
#> [31] "9 DIAMOND" "7 SPADE" "4 DIAMOND" "10 HEARTS" "2 CLUB"
#> [36] "4 SPADE" "4 HEARTS" "8 CLUB" "K HEARTS" "A SPADE"
#> [41] "A HEARTS" "5 HEARTS" "A DIAMOND" "5 CLUB" "7 CLUB"
#> [46] "Q DIAMOND" "A CLUB" "10 CLUB" "3 SPADE" "K DIAMOND"
#> [51] "J HEARTS" "6 DIAMOND"
my_deck$draw(10)
#> Error: Only 0 cards remaining.
my_deck$reshuffle()$draw(5)
#> [1] "6 DIAMOND" "2 CLUB" "Q DIAMOND" "9 CLUB" "J DIAMOND"
my_deck$reshuffle()$draw(5)
#> [1] "8 CLUB" "9 SPADE" "2 SPADE" "Q HEARTS" "6 SPADE"
Q3: Why can’t you model a bank account or a deck of cards with an S3 class?
A: Because S3 classes obey R’s usual semantics of copy-on-modify: every time
you deposit money into your bank account or draw a card from the deck,
you’d get a new copy of the object.
It is possible to combine S3 classes with an environment (which is how R6
works), but it is ill-advised to create an object that looks like a regular R
object but has reference semantics.
Q4: Create an R6 class that allows you to get and set the current time zone.
You can access the current time zone with Sys.timezone() and set it with
Sys.setenv(TZ = "newtimezone"). When setting the time zone, make sure the
new time zone is in the list provided by OlsonNames().
A: To create an R6 class that allows us to get and set the time zone, we
provide the respective functions as public methods to the R6 class.
(When setting, we return the old value invisibly because this makes it easy to
restore the previous value.)
Now, let us create one instance of this class and test, if we can set and get the
time zone as intended.
tz <- Timezone$new()
tz$set(old)
tz$get()
#> [1] "Europe/Berlin"
Q5: Create an R6 class that manages the current working directory. It should
have $get() and $set() methods.
A: Take a look at the following implementation, which is quite minimalistic:
Q6: Why can’t you model the time zone or current working directory with an
S3 class?
14.3 Controlling access 155
A: Because S3 classes are not suitable for modelling a state that changes over
time. S3 methods should (almost) always return the same result when called
with the same inputs.
Q7: What base type are R6 objects built on top of? What attributes do they
have?
A: R6 objects are built on top of environments. They have a class attribute,
which is a character vector containing the class name, the name of any super
classes (if existent) and the string "R6" as the last element.
)
)
To test our new class, we create an instance and try to go into overdraft.
my_account_strict_2$deposit(5)
my_account_strict_2$withdraw(10)
#> Error: Your `withdraw` must be smaller than your `balance`.
Let’s create one instance of our new class and confirm that the password is
neither accessible nor visible, but still check-able.
my_pw$password
#> NULL
my_pw
#> <Password>: ********
my_pw$check("snuggles")
#> [1] FALSE
my_pw$check("snuffles")
#> [1] TRUE
Q3: Extend the Rando class with another active binding that allows you to
access the previous random value. Ensure that active binding is the only way
to access the value.
A: To access the previous random value from an instance, we add a private
$last_random field to our class, and we modify $random() to write to this field,
whenever it is called. To access the $last_random field we provide $previous().
x <- Rando$new()
x$random
158 14 R6
Q4: Can subclasses access private fields/methods from their parent? Perform
an experiment to find out.
A: To find out if private fields/methods can be accessed from subclasses, we
first create a class A with a private field foo and a private method bar(). After-
wards, an instance of a subclass B is created and calls the foobar() methods,
which tries to access the foo field and the bar() method from its superclass A.
A <- R6Class(
classname = "A",
private = list(
field = "foo",
method = function() {
"bar"
}
)
)
B <- R6Class(
classname = "B",
inherit = A,
public = list(
test = function() {
cat("Field: ", super$field, "\n", sep = "")
cat("Method: ", super$method(), "\n", sep = "")
}
)
)
B$new()$test()
#> Field:
#> Method: bar
We conclude that subclasses can access private methods from their super-
classes, but not private fields.
14.4 Reference semantics 159
finalize = function() {
close(self$con)
},
append_line = function(x) {
cat(x, "\n", sep = "", file = self$con)
}
)
)
readLines(tmp_file)
#> character(0)
my_fw$append_line("First")
my_fw$append_line("Second")
readLines(tmp_file)
#> [1] "First" "Second"
15
S4
Prerequisites
We load the {methods} package [R Core Team, 2020] as it contains the S4
object-oriented programming system.
library(methods)
15.2 Basics
Q1: lubridate::period() returns an S4 class. What slots does it have? What
class is each slot? What accessors does it provide?
A: Objects of the S4 Period class have six slots named year, month, day, hour,
minute, and .Data (which contains the number of seconds). All slots are of
type double. Most fields can be retrieved by an identically named accessor
(e.g. lubridate::year() will return the field), use second() to get the .Data
slot.
As a short example, we create a period of 1 second, 2 minutes, 3 hours, 4 days
and 5 weeks.
example_12345
#> [1] "39d 3H 2M 1S"
When we inspect example_12345, we see the fields and infer that the seconds
are stored in the .Data field.
str(example_12345)
#> Formal class 'Period' [package "lubridate"] with 6 slots
#> ..@ .Data : num 1
#> ..@ year : num 0
#> ..@ month : num 0
#> ..@ day : num 39
#> ..@ hour : num 3
#> ..@ minute: num 2
Q2: What other ways can you find help for a method? Read ?"?" and sum-
marise the details.
A: Besides adding ? in front of a function call (i.e. ?method()), we may find:
• general documentation for a generic via ?genericName
• general documentation for the methods of a generic via methods?genericName
• documentation for a specific method via ClassName?methodName.
15.3 Classes
Q1: Extend the Person class with fields to match utils::person(). Think
about what slots you will need, what class each slot should have, and what
you’ll need to check in your validity method.
A: The Person class from Advanced R contains the slots name and age. The
person class from the {utils} package contains the slots given (vector of given
names), family, role, email and comment (see ?utils::person).
All slots from utils::person() besides role must be of type character and
length 1. The entries in the role slot must match one of the following abbre-
viations “aut”, “com”, “cph”, “cre”, “ctb”, “ctr”, “dtc”, “fnd”, “rev”, “ths”,
“trl”. Therefore, role might be of different length than the other slots and we’ll
add a corresponding constraint within the validator.
15.3 Classes 163
new("Person",
age = age,
given = given,
family = family,
role = role,
email = email,
comment = comment
)
}
length(object@email) != 1 ||
length(object@comment) != 1) {
invalids <- paste0("@name, @age, @given, @family, @email, ",
"@comment must be of length 1")
}
known_roles <- c(
NA_character_, "aut", "com", "cph", "cre", "ctb",
"ctr", "dtc", "fnd", "rev", "ths", "trl"
)
if (length(invalids)) return(invalids)
TRUE
})
#> Class "Person" [in ".GlobalEnv"]
#>
#> Slots:
#>
#> Name: age given family role email comment
#> Class: numeric character character character character character
Q2: What happens if you define a new S4 class that doesn’t have any slots?
(Hint: read about virtual classes in ?setClass.)
A: It depends on the other arguments. If we inherit from another class, we get
the same slots. But something interesting happens if we don’t inherit from an
existing class. We get a virtual class. A virtual class can’t be instantiated:
setClass("Human")
new("Human")
#> Error in new("Human"): trying to generate an object from a virtual
#> class ("Human")
Q3: Imagine you were going to reimplement factors, dates, and data frames
in S4. Sketch out the setClass() calls that you would use to define the classes.
Think about appropriate slots and prototype.
A: For all these classes we need one slot for the data and one slot per attribute.
Keep in mind, that inheritance matters for ordered factors and dates. For data
frames, special checks like equal lengths of the underlying list’s elements should
be done within a validator.
For simplicity we don’t introduce an explicit subclass for ordered factors.
Instead, we introduce ordered as a slot.
setClass("Factor",
slots = c(
data = "integer",
levels = "character",
ordered = "logical"
),
prototype = list(
data = integer(),
levels = character(),
ordered = FALSE
)
)
The Date2 class stores its dates as integers, similarly to base R which uses
doubles. Dates don’t have any other attributes.
setClass("Date2",
slots = list(
data = "integer"
),
prototype = list(
166 15 S4
data = integer()
)
)
Our DataFrame class consists of a list and a slot for row.names. Most of the
logic (e.g. checking that all elements of the list are a vector, and that they all
have the same length) would need to be part of a validator.
setClass("DataFrame",
slots = c(
data = "list",
row.names = "character"
),
prototype = list(
data = list(),
row.names = character(0)
)
)
Q2: In the definition of the generic, why is it necessary to repeat the name of
the generic twice?
A: Within setGeneric() the name (1st argument) is needed as the name of
the generic. Then, the name also explicitly incorporates method dispatch via
standardGeneric() within the generic’s body (def parameter of setGeneric()).
This behaviour is similar to UseMethod() in S3.
Q3: Why does the show() method defined in section 15.4.3 (https://adv-
r.hadley.nz/s4.html#show-method) use is(object)[[1]]? (Hint: try printing
the employee subclass.)
A: is(object) returns the class of the object. is(object) also contains the
superclass, for subclasses like Employee. In order to always return the most
specific class (the subclass), show() returns the first element of is(object).
Q4: What happens if you define a method with different argument names to
the generic?
A: It depends. We first create the object hadley of class Person:
Now let’s see which arguments can be supplied to the show() generic.
formals("show")
#> $object
hadley
#> Hadley creates hard exercises
When we supply another name as a first element of our method (e.g. x instead
of object), this element will be matched to the correct object argument and
we receive a warning. Our method will work, though:
hadley
#> Hadley creates hard exercises
If we add more arguments to our method than our generic can handle, we will
get an error.
Q3: Take the last example which shows multiple dispatch over two classes
that use multiple inheritance. What happens if you define a method for all
terminal classes? Why does method dispatch not save us much work here?
A: We will introduce ambiguity, since one class has distance 2 to all terminal
nodes and the other four have distance 1 to two terminal nodes each. To
resolve this ambiguity we have to define five more methods, one per class
combination.
15.6 S4 and S3
Q1: What would a full setOldClass() definition look like for an ordered factor
(i.e. add slots and prototype to the definition above)?
A: The purpose of setOldClass() lies in registering an S3 class as a “formally
defined class”, so that it can be used within the S4 object-oriented program-
172 15 S4
ming system. When using it, we may provide the argument S4Class, which will
inherit the slots and their default values (prototype) to the registered class.
Let’s build an S4 OrderedFactor on top of the S3 factor in such a way.
We can now register the (S3) ordered-class, while providing an “S4 template”.
We can also use the S4-class to create new object directly.
x <- OrderedFactor(
c(1L, 2L, 2L),
levels = c("a", "b", "c"),
ordered = TRUE
)
str(x)
#> Formal class 'OrderedFactor' [package ".GlobalEnv"] with 4 slots
#> ..@ .Data : int [1:3] 1 2 2
#> ..@ levels : chr [1:3] "a" "b" "c"
#> ..@ ordered : logi TRUE
#> ..@ .S3Class: chr "factor"
Metaprogramming
18
Expressions
Prerequisites
To capture and compute on expressions, and to visualise them, we will load
the {rlang} [Henry and Wickham, 2020b] and the {lobstr} [Wickham, 2019a]
packages.
library(rlang)
library(lobstr)
A: Let the source (of the code chunks above) be with you and show you how
the ASTs (abstract syntax trees) were produced.
ast(f(g(h())))
#> █─f
#> └─█─g
#> └─█─h
ast(1 + 2 + 3)
#> █─`+`
#> ├─█─`+`
#> │ ├─1
#> │ └─2
#> └─3
ast((x + y) * z)
#> █─`*`
#> ├─█─`(`
#> │ └─█─`+`
#> │ ├─x
#> │ └─y
#> └─z
Q2: Draw the following trees by hand then check your answers with ast().
f(g(h(i(1, 2, 3))))
f(1, g(2, h(3, i())))
f(g(1, 2), h(3, i(4, 5)))
ast(f(g(h(i(1, 2, 3)))))
#> █─f
#> └─█─g
#> └─█─h
#> └─█─i
#> ├─1
#> ├─2
#> └─3
#> ├─3
#> └─█─i
Q3: What’s happening with the ASTs below? (Hint: carefully read ?"ˆ")
ast(`x` + `y`)
#> █─`+`
#> ├─x
#> └─y
ast(x ** y)
#> █─`ˆ`
#> ├─x
#> └─y
ast(1 -> x)
#> █─`<-`
#> ├─x
#> └─1
A: ASTs start function calls with the name of the function. This is why the
call in the first expression is translated into its prefix form. In the second case,
** is translated by R’s parser into ˆ. In the last AST, the expression is flipped
when R parses it:
str(expr(x ** y))
#> language xˆy
str(expr(a -> b))
#> language b <- a
Q4: What is special about the AST below? (Hint: re-read section 6.2.1 (https:
//adv-r.hadley.nz/functions.html#fun-components))
180 18 Expressions
ast(function(x = 1, y = 2) {})
#> █─`function`
#> ├─█─x = 1
#> │ └─y = 2
#> ├─█─`{`
#> └─<inline srcref>
A: The last leaf of the AST is not explicitly specified in the expression. Instead,
the srcref attribute, which points to the functions source code, is automati-
cally created by base R.
Q5: What does the call tree of an if statement with multiple else if condi-
tions look like? Why?
A: The AST of nested else if statements might look a bit confusing because
it contains multiple curly braces. However, we can see that in the else part
of the AST just another expression is being evaluated, which happens to be
an if statement and so forth.
ast(
if (FALSE) {
1
} else if (FALSE) {
2
} else if (TRUE) {
3
}
)
#> █─`if`
#> ├─FALSE
#> ├─█─`{`
#> │ └─1
#> └─█─`if`
#> ├─FALSE
#> ├─█─`{`
#> │ └─2
#> └─█─`if`
#> ├─TRUE
#> └─█─`{`
#> └─3
We can see the structure more clearly if we avoid the curly braces:
18.3 Expressions 181
ast(
if (FALSE) 1
else if (FALSE) 2
else if (TRUE) 3
)
#> █─`if`
#> ├─FALSE
#> ├─1
#> └─█─`if`
#> ├─FALSE
#> ├─2
#> └─█─`if`
#> ├─TRUE
#> └─3
18.3 Expressions
Q1: Which two of the six types of atomic vector can’t appear in an expression?
Why? Similarly, why can’t you create an expression that contains an atomic
vector of length greater than one?
A: There is no way to create raws and complex atomics without using a
function call (this is only possible for imaginary scalars like i, 5i etc.). But
expressions that include a function are calls. Therefore, both of these vector
types cannot appear in an expression.
Similarly, it is not possible to create an expression that evaluates to an atomic
of length greater than one without using a function (e.g. c()).
Let’s make this observation concrete via an example:
# Atomic
is_atomic(expr(1))
#> [1] TRUE
Q2: What happens when you subset a call object to remove the first element,
e.g. expr(read.csv("foo.csv", header = TRUE))[-1]. Why?
A: When the first element of a call object is removed, the second element
moves to the first position, which is the function to call. Therefore, we get
"foo.csv"(header = TRUE).
Q3: Describe the differences between the following call objects.
x <- 1:10
A: The call objects differ in their first two elements, which are in some cases
evaluated before the call is constructed. In the first one, both median() and x
are evaluated and inlined into the call. Therefore, we can see in the constructed
call that median is a generic and the x argument is 1:10.
In the following calls we remain with differing combinations. Once, only x and
once only median() gets evaluated.
Note that all these calls will generate the same result when evaluated. The key
difference is when the values bound to the x and median symbols are found.
Q4: rlang::call_standardise() doesn’t work so well for the following calls.
Why? What makes mean() special?
18.3 Expressions 183
A: The reason for this unexpected behaviour is that mean() uses the ... argu-
ment and therefore cannot standardise the regarding arguments. Since mean()
uses S3 dispatch (i.e. UseMethod()) and the underlying mean.default() method
specifies some more arguments, call_standardise() can do much better with
a specific S3 method.
A: As stated in Advanced R
The first element of a call is always the function that gets called.
Let’s see what happens when we run the code
So, giving the first element a name just adds metadata that R ignores.
184 18 Expressions
Q6: Construct the expression if(x > 1) "a" else "b" using multiple calls to
call2(). How does the code structure reflect the structure of the AST?
A: Similar to the prefix version we get
When we read the AST from left to right, we get the same structure: Function
to evaluate, expression, which is another function and is evaluated first, and
two constants which will be evaluated next.
f((1))
`(`(1 + 1)
ast(f((1)))
#> █─f
18.4 Parsing and grammar 185
#> └─█─`(`
#> └─1
In the second example, we can see that the outer ( is a function and the inner
( belongs to its syntax:
ast(`(`(1 + 1))
#> █─`(`
#> └─█─`+`
#> ├─1
#> └─1
For the sake of clarity, let’s also create a third example, where none of the (
is part of another function’s syntax:
ast(((1 + 1)))
#> █─`(`
#> └─█─`(`
#> └─█─`+`
#> ├─1
#> └─1
Q2: = can also be used in two ways. Construct a simple example that shows
both uses.
A: = is used both for assignment, and for naming arguments in function calls:
b = c(c = 1)
So, when we play with ast(), we can directly see that the following is not
possible:
#> ├─b
#> └─█─c
#> └─c = 1
When we ignore the braces and compare the trees, we can see that the first =
is used for assignment and the second = is part of the syntax of function calls.
Q3: Does -2ˆ2 yield 4 or -4? Why?
A: It yields -4, because ˆ has a higher operator precedence than -, which we
can verify by looking at the AST (or looking it up under ?"Syntax"):
-2ˆ2
#> [1] -4
ast(-2ˆ2)
#> █─`-`
#> └─█─`ˆ`
#> ├─2
#> └─2
!1 + !1
#> [1] FALSE
ast(!1 + !1)
#> █─`!`
#> └─█─`+`
#> ├─1
#> └─█─`!`
#> └─1
Note that if ! had a higher precedence, the intermediate result would be FALSE
+ FALSE, which would evaluate to 0.
Q5: Why does x1 <- x2 <- x3 <- 0 work? Describe the two reasons.
A: One reason is that <- is right-associative, i.e. evaluation takes place from
right to left:
The other reason is that <- invisibly returns the value on the right-hand side.
(x3 <- 0)
#> [1] 0
Q6: Compare the ASTs of x + y %+% z and x ˆ y %+% z. What have you
learned about the precedence of custom infix functions?
A: Let’s take a look at the syntax trees:
ast(x + y %+% z)
#> █─`+`
#> ├─x
#> └─█─`%+%`
#> ├─y
#> └─z
Here y %+% z will be calculated first and the result will be added to x.
ast(x ˆ y %+% z)
#> █─`%+%`
#> ├─█─`ˆ`
#> │ ├─x
#> │ └─y
#> └─z
Here x ˆ y will be calculated first, and the result will be used as first argument
to %+%().
We can conclude that custom infix functions have precedence between addition
and exponentiation.
The exact precedence of infix functions can be looked up under ?"Syntax"
where we see that it lies directly behind the sequence operator (:) and in
front of the multiplication and division operators (* and /).
188 18 Expressions
Q7: What happens if you call parse_expr() with a string that generates mul-
tiple expressions, e.g. parse_expr("x + 1; y + 1")?
A: In this case parse_expr() notices that more than one expression would
have to be generated and throws an error.
parse_expr("x + 1; y + 1")
#> Error: More than one expression parsed
Q8: What happens if you attempt to parse an invalid expression, e.g. "a +"
or "f())"?
A: Invalid expressions will lead to an error in the underlying parse() function.
parse_expr("a +")
#> Error in parse(text = elt): <text>:2:0: unexpected end of input
#> 1: a +
#> ˆ
parse_expr("f())")
#> Error in parse(text = elt): <text>:1:4: unexpected ')'
#> 1: f())
#> ˆ
Q9: deparse() produces vectors when the input is long. For example, the
following call produces a vector of length two:
deparse(expr)
In R 4.0.0 (https://cran.r-project.org/doc/manuals/r-release/NEWS.html)
pairwise.t.test() was updated to use the newly introduced deparse1(), which
serves as a wrapper around deparse().
deparse1() is a simple utility added in R 4.0.0 to ensure a string result
(character vector of length one), typically used in name construction, as
deparse1(substitute(.)).
d + d + d + d + d + d + d + d + d)
#> Pairwise comparisons using t tests with pooled SD
#>
#> data: 2 and d + d + d + d + d + d + d + d + d + d + d + d + d + d
#> + d + d + d
#>
#> <0 x 0 matrix>
#>
#> P value adjustment method: holm
# Recursive cases
pairlist = purrr::some(x, logical_abbr_rec),
call = find_T_call(x)
)
}
logical_abbr(T(1, 2, 3))
#> [1] FALSE
logical_abbr(T(T, T(3, 4)))
#> [1] TRUE
logical_abbr(T(T))
#> [1] TRUE
logical_abbr(T())
#> [1] FALSE
logical_abbr()
#> [1] FALSE
192 18 Expressions
logical_abbr(c(T, T, T))
#> [1] TRUE
Q2: logical_abbr() works with expressions. It currently fails when you give
it a function. Why? How could you modify logical_abbr() to make it work?
What components of a function will you need to recurse over?
logical_abbr(!!f)
#> Error: Don't know how to handle type closure
ast(names(x) <- x)
#> █─`<-`
#> ├─█─names
#> │ └─x
#> └─x
So, we need to catch the case where the first two elements are both calls.
Further the first call is identical to <- and we must return only the second call
to see which objects got new values assigned.
This is why we add the following block within another else statement in
find_assign_call():
Let us finish with the whole code, followed by some tests for our new function:
18.5 Walking AST with recursive functions 193
# Tests functionality
find_assign(x <- y)
#> [1] "x"
find_assign(names(x))
#> character(0)
find_assign(names(x) <- y)
#> [1] "names(x)"
find_assign(names(x(y)) <- y)
#> [1] "names(x(y))"
find_assign(names(x(y)) <- y <- z)
#> [1] "names(x(y))" "y"
194 18 Expressions
# Recursive cases
pairlist = flat_map_chr(x, find_assign_rec),
call = find_assign_call(x)
)
}
find_assign(x <- y)
#> [1] "x <- y"
find_assign(names(x(y)) <- y <- z)
#> [1] "names(x(y)) <- y <- z" "names(x(y))"
#> [3] "x(y)" "y <- z"
find_assign(mean(sum(1:3)))
#> [1] "mean(sum(1:3))" "sum(1:3)" "1:3"
19
Quasiquotation
Prerequisites
To continue computing on the language, we keep using the {rlang} package
in this chapter.
library(rlang)
19.2 Motivation
Q1: For each function in the following base R code, identify which arguments
are quoted and which are evaluated.
library(MASS)
with(mtcars2, sum(vs))
sum(mtcars2$am)
rm(mtcars2)
A: For each argument we first follow the advice from Advanced R and execute
the argument outside of the respective function. Since MASS, cyl, vs and am
are not objects contained in the global environment, their execution raises an
“Object not found” error. This way we confirm that the respective function
arguments are quoted. For the other arguments, we may inspect the source
code (and the documentation) to check if any quoting mechanisms are applied
or the arguments are evaluated.
library() also accepts character vectors and doesn’t quote when charac-
ter.only is set to TRUE, so library(MASS, character.only = TRUE) would raise
an error.
When we inspect the source code of rm(), we notice that rm() catches its ...
argument as an unevaluated call (in this case a pairlist) via match.call(). This
call is then converted into a string for further evaluation.
Q2: For each function in the following tidyverse code, identify which argu-
ments are quoted and which are evaluated.
library(dplyr)
library(ggplot2)
A: From the previous exercise we’ve already learned that library() quotes its
first argument.
To find out what happens in summarise(), we inspect the source code. Trac-
ing down the S3-dispatch of summarise(), we see that the ... argument is
quoted in dplyr:::summarise_cols() which is called in the underlying sum-
marise.data.frame() method.
dplyr::summarise
#> function (.data, ..., .groups = NULL)
#> {
#> UseMethod("summarise")
#> }
#> <bytecode: 0x562ef4376378>
#> <environment: namespace:dplyr>
dplyr:::summarise.data.frame
#> function (.data, ..., .groups = NULL)
#> {
#> cols <- summarise_cols(.data, ...)
#> out <- summarise_build(.data, cols)
#> if (identical(.groups, "rowwise")) {
#> out <- rowwise_df(out, character())
#> }
#> out
#> }
#> <bytecode: 0x562ef469a040>
#> <environment: namespace:dplyr>
dplyr:::summarise_cols
#> function (.data, ...)
#> {
#> mask <- DataMask$new(.data, caller_env())
#> dots <- enquos(...)
#> dots_names <- names(dots)
#> auto_named_dots <- names(enquos(..., .named = TRUE))
#> cols <- list()
#> sizes <- 1L
#> chunks <- vector("list", length(dots))
#> types <- vector("list", length(dots))
198 19 Quasiquotation
#>
#> ## function definition abbreviated for clarity ##
#> }
#> <bytecode: 0x55b540c07ca0>
#> <environment: namespace:dplyr>
In the following {ggplot2} expression the cyl- and mean-objects are quoted.
ggplot2::aes
#> function (x, y, ...)
#> {
#> exprs <- enquos(x = x, y = y, ..., .ignore_empty = "all")
#> aes <- new_aes(exprs, env = parent.frame())
#> rename_aes(aes)
#> }
#> <bytecode: 0x562ef4b5e498>
#> <environment: namespace:ggplot2>
19.3 Quoting
Q1: How is expr() implemented? Look at its source code.
A: expr() acts as a simple wrapper, which passes its argument to enexpr().
expr
#> function (expr)
#> {
#> enexpr(expr)
#> }
#> <bytecode: 0x562ef4cedaa0>
#> <environment: namespace:rlang>
19.3 Quoting 199
Q2: Compare and contrast the following two functions. Can you predict the
output before running them?
f1 <- function(x, y) {
exprs(x = x, y = y)
}
f2 <- function(x, y) {
enexprs(x = x, y = y)
}
f1(a + b, c + d)
f2(a + b, c + d)
A: Both functions are able to capture multiple arguments and will return
a named list of expressions. f1() will return the arguments defined within
the body of f1(). This happens because exprs() captures the expressions as
specified by the developer during the definition of f1().
f1(a + b, c + d)
#> $x
#> x
#>
#> $y
#> y
f2() will return the arguments supplied to f2() as specified by the user when
the function is called.
f2(a + b, c + d)
#> $x
#> a + b
#>
#> $y
#> c + d
Q3: What happens if you try to use enexpr() with an expression (i.e. enexpr(x
+ y))? What happens if enexpr() is passed a missing argument?
A: In the first case an error is thrown:
Q4: How are exprs(a) and exprs(a = ) different? Think about both the input
and the output.
A: In exprs(a) the input a is interpreted as a symbol for an unnamed argu-
ment. Consequently, the output shows an unnamed list with the first element
containing the symbol a.
Q5: What are other differences between exprs() and alist()? Read the doc-
umentation for the named arguments of exprs() to find out.
A: exprs() provides the additional arguments .named (= FALSE), .ignore_empty
(c("trailing", "none", "all")) and .unquote_names (TRUE). .named allows to
ensure that all dots are named. ignore_empty allows to specify how empty
arguments should be handled for dots ("trailing") or all arguments ("none"
and "all"). Further via .unquote_names one can specify if := should be treated
like =. := can be useful as it supports unquoting (!!) on the left-hand side.
Q6: The documentation for substitute() says:
Substitution takes place by examining each component of the parse tree
as follows:
• If it is not a bound symbol in env, it is unchanged.
• If it is a promise object (i.e. a formal argument to a function) the
expression slot of the promise replaces the symbol.
19.4 Unquoting 201
foo(x + y * sin(0))
#> x + y * sin(0)
In case substitute() can find (parts of) the expression in env, it will literally
substitute. However, unless env is .GlobalEnv.
my_env$x <- 7
substitute(x, my_env)
#> [1] 7
x <- 7
substitute(x, .GlobalEnv)
#> x
19.4 Unquoting
Q1: Given the following components:
xy <- expr(x + y)
xz <- expr(x + z)
yz <- expr(y + z)
abc <- exprs(a, b, c)
202 19 Quasiquotation
(x + y) / (y + z) # (1)
-(x + z) ˆ (y + z) # (2)
(x + y) + (y + z) - (x + y) # (3)
atan2(x + y, y + z) # (4)
sum(x + y, x + y, y + z) # (5)
sum(a, b, c) # (6)
mean(c(a, b, c), na.rm = TRUE) # (7)
foo(a = x + y, b = y + z) # (8)
expr(-(!!xz)ˆ(!!yz)) # (2)
#> -(x + z)ˆ(y + z)
expr(sum(!!!abc)) # (6)
#> sum(a, b, c)
Q2: The following two calls print the same, but are actually different:
(a <- expr(mean(1:10)))
#> mean(1:10)
(b <- expr(mean(!!(1:10))))
19.6 ... (dot-dot-dot) 203
#> mean(1:10)
identical(a, b)
#> [1] FALSE
lobstr::ast(mean(1:10))
#> █─mean
#> └─█─`:`
#> ├─1
#> └─10
lobstr::ast(mean(!!(1:10)))
#> █─mean
#> └─<inline integer>
A: exec() takes a function (f), its arguments (...) and an environment (.env)
as input. This allows to construct a call from f and ... and evaluate this call
in the supplied environment. As the ... argument is handled via list2(),
exec() supports tidy dots (quasiquotation), which means that arguments and
names (on the left-hand side of :=) can be unquoted via !! and !!!.
204 19 Quasiquotation
Q2: Carefully read the source code for interaction(), expand.grid(), and
par(). Compare and contrast the techniques they use for switching between
dots and list behaviour.
A: All three functions capture the dots via args <- list(...).
interaction() computes factor interactions between the captured input fac-
tors by iterating over the args. When a list is provided this is detected via
length(args) == 1 && is.list(args[[1]]) and one level of the list is stripped
through args <- args[[1]]. The rest of the function’s code doesn’t differenti-
ate further between list and dots behaviour.
expand.grid() uses the same strategy and also assigns args <- args[[1]] in
case of length(args) == 1 && is.list(args[[1]]).
par() does the most pre-processing to ensure a valid structure of the args
argument. When no dots are provided (!length(args)) it creates a list of argu-
ments from an internal character vector (partly depending on its no.readonly
argument). Further, given that all elements of args are character vectors
(all(unlist(lapply(args, is.character)))), args is turned into a list via
as.list(unlist(args)) (this flattens nested lists). Similar to the other func-
tions, one level of args gets stripped via args <- args[[1L]], when args is of
length one and its first element is a list.
Q3: Explain the problem with this definition of set_attr()
object’s argument name doesn’t help in this case — as can be seen in the
example where the object is consequently treated as an unnamed attribute.
However, we may name the first argument .x, which seems clearer and less
likely to invoke errors. In this case 1:10 will get the (named) attribute x = 10
assigned:
set_attr(1:10, x = 10)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"x")
#> [1] 10
bc <- function(lambda) {
if (lambda == 0) {
function(x) log(x)
} else {
function(x) (x ˆ lambda - 1) / lambda
}
}
206 19 Quasiquotation
if (!!lambda == 0) {
new_function(exprs(x = ), expr(log(x)))
} else {
new_function(exprs(x = ), expr((x ˆ (!!lambda) - 1) / !!lambda))
}
}
bc2(0)
#> function (x)
#> log(x)
#> <environment: 0x562ef541ac40>
bc2(2)
#> function (x)
#> (xˆ2 - 1)/2
#> <environment: 0x562ef54720b8>
bc2(2)(2)
#> [1] 1.5
new_function(exprs(... = ), expr((!!f)((!!g)(...))))
}
compose(sin, cos)
#> function(...) f(g(...))
19.7 Case studies 207
compose2(sin, cos)
#> function (...)
#> sin(cos(...))
#> <environment: 0x562ef4403098>
compose2(sin, cos)(pi)
#> [1] -0.841
20
Evaluation
Prerequisites
On our journey through R’s metaprogramming, we continue to use the func-
tions from the {rlang} package.
library(rlang)
Q3: Fill in the function bodies below to re-implement get() using sym() and
eval(), and assign() using sym(), expr(), and eval(). Don’t worry about the
multiple ways of choosing an environment that get() and assign() support;
assume that the user supplies it explicitly.
# name is a string
get2 <- function(name, env) {}
assign2 <- function(name, value, env) {}
20.2 Evaluation basics 211
x <- 1
get2("x")
#> [1] 1
To build the correct expression for the value assignment, we unquote using !!.
assign2("x", 4)
x
#> [1] 4
Q4: Modify source2() so it returns the result of every expression, not just the
last one. Can you eliminate the for loop?
A: The code for source2() was given in Advanced R as:
invisible(res)
}
invisible(res)
}
Let’s create a file and test source2(). Keep in mind that <- returns invisibly.
(source2(tmp_file))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 1
#>
#> [[3]]
#> [1] 2
#>
#> [[4]]
#> [1] 2
Explain how local() works in words. (Hint: you might want to print(call)
to help understand what substitute() is doing, and read the documentation
to remind yourself what environment new.env() will inherit from.)
A: Let’s follow the advice and add print(call) inside of local3():
local3({
x <- 10
x * 2
})
#> eval(quote({
#> x <- 10
#> x * 2
#> }), new.env())
#> [1] 20
Next, call will be evaluated in the caller environment (aka the parent frame).
Given that call contains another call eval() why does this matter? The an-
swer is subtle: this outer environment determines where the bindings for eval,
quote, and new.env are found.
eval(quote({
x <- 10
x * 2
}), new.env())
#> [1] 20
exists("x")
#> [1] TRUE
214 20 Evaluation
20.3 Quosures
Q1: Predict what evaluating each of the following quosures will return if
evaluated.
eval_tidy(q1)
#> [1] 1
eval_tidy(q2)
#> [1] 11
eval_tidy(q3)
#> [1] 111
Q2: Write an enenv() function that captures the environment associated with
an argument. (Hint: this should only require two function calls.)
A: A quosure captures both the expression and the environment. From a
quosure, we can access the environment with the help of get_env().
# Test
enenv(x)
#> <environment: R_GlobalEnv>
for (i in seq_along(dots)) {
name <- names(dots)[[i]]
dot <- dots[[i]]
.data
}
A for loop applies the processing steps regarding .data iteratively. This in-
cludes updating .data and reusing the same variable names. This makes it
possible to apply transformations sequentially, so that subsequent transfor-
mations can refer to columns that were just created.
Q2: Here’s an alternative implementation of subset2():
216 20 Evaluation
Compare and contrast subset3() to subset2(). What are its advantages and
disadvantages?
A: Let’s take a closer look at subset2() first:
# subset2() evaluation
(rows_val <- eval_tidy(quo(x == 1), df))
#> [1] TRUE FALSE FALSE
df[rows_val, , drop = FALSE]
#> x
#> 1 1
With subset3() both of these steps occur in a single line (which is probably
closer to what one would produce by hand). This means that the subsetting
is also evaluated in the context of the data mask.
# subset3() evaluation
eval_tidy(expr(df[x == 1, , drop = FALSE]), df)
#> x
#> 1 1
This is shorter (but probably also less readable) because the evaluation and
the subsetting take place in the same expression. However, it may introduce
20.4 Data masks 217
unwanted errors, if the data mask contains an element named “data”, as the
objects from the data mask take precedence over arguments of the function.
# Unquoting data-argument
lm3c <- function(formula, data) {
formula <- enexpr(formula)
data_quo <- enexpr(data)
Q2: When model building, typically the response and data are relatively con-
stant while you rapidly experiment with different predictors. Write a small
wrapper that allows you to reduce duplication in the code below.
the data to the function, but only a permutation of this data (resample_data)
will be used.
With this approach the evaluation needs to take place within the function’s
environment, because the resampled dataset (defined as a default argument)
will only be available in the function environment.
Overall, putting an essential part of the pre-processing outside of the func-
tions body is not common practice in R. Compared to the unquoting-
implementation (resample_lm1() in Advanced R), this approach captures the
model-call in a more meaningful way. This approach will also lead to a new
resample every time you update() the model.
21
Translating R code
Prerequisites
In this chapter we combine R’s metaprogramming and functional program-
ming capabilities and therefore load both the {rlang} and the {purrr} pack-
age.
library(rlang)
library(purrr)
21.2 HTML
Q1: The escaping rules for <script> tags are different because they contain
JavaScript, not HTML. Instead of escaping angle brackets or ampersands, you
need to escape </script> so that the tag isn’t closed too early. For example,
script("'</script>'"), shouldn’t generate this:
<script>'</script>'</script>
But
<script>'<\/script>'</script>
Adapt the escape() to follow these rules when a new argument script is set
to TRUE.
A: We are asked to implement a special case of escaping for the <script>
tag. At first we will revisit the relevant functions provided in Advanced R and
confirm that our code reliably escapes for tags like <p> and <b> but doesn’t
escape correctly for the <script> tag. Then we modify the escape() and tag()
functions to redefine the <script> tag and confirm that all defined tags now
escape correctly.
Note that the <style> tag, which contains styling information in CSS, follows
the same escaping rules as the <script> tag. We therefore implement the
desired escaping for the <style> tag function also.
Let’s start by loading the relevant code from Advanced R first.
# Escaping
html <- function(x) structure(x, class = "advr_html")
html(x)
}
if (is.null(names(dots))) {
is_named <- rep(FALSE, length(dots))
} else {
is_named <- names(dots) != ""
}
list(
named = dots[is_named],
unnamed = dots[!is_named]
)
}
21.2 HTML 225
# Tag functions
tag <- function(tag) {
new_function(
exprs(... = ),
expr({
dots <- dots_partition(...)
attribs <- html_attributes(dots$named)
children <- map_chr(dots$unnamed, escape)
html(paste0(
!!paste0("<", tag), attribs, ">",
paste(children, collapse = ""),
226 21 Translating R code
This code escapes the <p> and <b> tags correctly, but doesn’t achieve the
desired behaviour for the <script> tag yet:
p <- tag("p")
b <- tag("b")
identical(
p("&","and <", b("& > will be escaped")) %>%
as.character(),
"<p>&and <<b>& > will be escaped</b></p>"
)
#> [1] TRUE
identical(
script("Don't escape &, <, > - escape </script> and </style>") %>%
as.character(),
paste("<script>Don't escape &, <, >",
"- escape <\\/script> and <\\/style></script>")
)
#> [1] FALSE
We implement the desired change and add the optional argument script to
the escape() and the tag() functions (default: script = FALSE). The argument
has to be added for all methods of the escape() generic.
if (script) {
x <- gsub("</script>", "<\\/script>", x, fixed = TRUE)
x <- gsub("</style>", "<\\/style>", x, fixed = TRUE)
} else {
x <- gsub("&", "&", x)
21.2 HTML 227
html(x)
}
new_function(
exprs(... = ),
expr({
dots <- dots_partition(...)
attribs <- html_attributes(dots$named)
children <- map_chr(dots$unnamed, escape, script = !!script)
html(paste0(
!!paste0("<", tag), attribs, ">",
paste(children, collapse = ""),
!!paste0("</", tag, ">")
))
}),
caller_env()
)
}
Finally, we create new <p>, <b> and <script> tag functions, which now pass
their escaping tests.
p <- tag("p")
b <- tag("b")
identical(
p("&","and <", b("& > will be escaped")) %>%
as.character(),
"<p>&and <<b>& > will be escaped</b></p>"
)
#> [1] TRUE
identical(
script("Don't escape &, <, > - escape </script> and </style>") %>%
as.character(),
paste("<script>Don't escape &, <, >",
"- escape <\\/script> and <\\/style></script>")
)
#> [1] TRUE
Q2: The use of ... for all functions has some big downsides. There’s no
input validation and there will be little information in the documentation or
autocomplete about how they are used in the function. Create a new function
that, when given a named list of tags and their attribute names (like below),
creates tag functions with named arguments.
list(
a = c("href"),
img = c("src", "width", "height")
)
new_function(
exprs(... = , !!!attr_args),
expr({
ellipsis::check_dots_unnamed()
html(paste0(
!!paste0("<", tag), attribs, ">",
paste(children, collapse = ""),
!!paste0("</", tag, ">")
))
})
)
}
To validate our new function factory, we modify the with_html() example from
Advanced R to work with our newly created a() and img() tag functions.
with_tags(
a(
img("Correct me if I am wrong", id = "second"),
href = "https://github.com/Tazinho/Advanced-R-Solutions/issues",
id = "first"
)
)
#> <HTML> <a id='first'
230 21 Translating R code
#> href='https://github.com/Tazinho/Advanced-R-Solutions/issues'><img
#> id='second'>Correct me if I am wrong</img></a>
Q3: Reason about the following code that calls with_html() referencing objects
from the environment. Will it work or fail? Why? Run the code to verify your
predictions.
tags <- c(
"a", "abbr", "address", "article", "aside", "audio",
"b", "bdi", "bdo", "blockquote", "body", "button", "canvas",
"caption", "cite", "code", "colgroup", "data", "datalist",
"dd", "del", "details", "dfn", "div", "dl", "dt", "em",
"eventsource", "fieldset", "figcaption", "figure", "footer",
"form", "h1", "h2", "h3", "h4", "h5", "h6", "head", "header",
"hgroup", "html", "i", "iframe", "ins", "kbd", "label",
"legend", "li", "mark", "map", "menu", "meter", "nav",
"noscript", "object", "ol", "optgroup", "option", "output",
"p", "pre", "progress", "q", "ruby", "rp", "rt", "s", "samp",
"script", "section", "select", "small", "span", "strong",
"style", "sub", "summary", "sup", "table", "tbody", "td",
"textarea", "tfoot", "th", "thead", "time", "title", "tr",
"u", "ul", "var", "video"
)
Now, let us briefly repeat, that with_html() was introduced to evaluate tag
functions from within a list. Otherwise, defining some tag functions like body(),
source(), summary() etc. within the global environment would collide with base
R functions with the same name. To prevent this the DSL code wrapped in
with_html() is evaluated within the “context” of html_tags, which was pro-
vided as a data mask to eval_tidy(). As ?rlang::as_data_mask mentions:
“Objects in the mask have precedence over objects in the environment”.
Therefore, p() refers to the tag function from html_tags within both examples
from the exercise. However, as address is not only a string within the global
environment, but also a tag function within html_tags (the <address> HTML
tag may be used to provide contact information on an HTML page), p()
operates on address() in the second example. This correctly leads to an error
as we haven’t implemented an escape.function() method.
Q4: Currently the HTML doesn’t look terribly pretty, and it’s hard to see the
structure. How could you adapt tag() to do indenting and formatting? (You
may need to do some research into block and inline tags.)
A: First, let us load all relevant functions from Advanced R:
caller_env()
)
}
tags <- c(
"a", "abbr", "address", "article", "aside", "audio", "b",
"bdi", "bdo", "blockquote", "body", "button", "canvas",
"caption", "cite", "code", "colgroup", "data", "datalist",
"dd", "del", "details", "dfn", "div", "dl", "dt", "em",
"eventsource", "fieldset", "figcaption", "figure", "footer",
"form", "h1", "h2", "h3", "h4", "h5", "h6", "head", "header",
"hgroup", "html", "i", "iframe", "ins", "kbd", "label", "legend",
"li", "mark", "map", "menu", "meter", "nav", "noscript", "object",
"ol", "optgroup", "option", "output", "p", "pre", "progress", "q",
"ruby", "rp", "rt", "s", "samp", "script", "section", "select",
"small", "span", "strong", "style", "sub", "summary", "sup",
"table", "tbody", "td", "textarea", "tfoot", "th", "thead",
"time", "title", "tr", "u", "ul", "var", "video"
)
void_tags <- c(
"area", "base", "br", "col", "command", "embed", "hr", "img",
"input", "keygen", "link", "meta", "param", "source",
"track", "wbr"
21.2 HTML 233
html_tags <- c(
tags %>% set_names() %>% map(tag),
void_tags %>% set_names() %>% map(void_tag)
)
with_html(
body(
h1("A heading", id = "first"),
p("Some text &", b("some bold text.")),
img(src = "myimg.png", width = 100, height = 100)
)
)
#> <HTML> <body><h1 id='first'>A heading</h1><p>Some text
#> &<b>some bold text.</b></p><img src='myimg.png'
#> width='100' height='100' /></body>
The formatting consists of only one long line of code. This output makes it
difficult to check the content of the HTML code and its correctness.
What kind of formatting would we prefer instead? Google’s HTML style guide
(https://google.github.io/styleguide/htmlcssguide.html#HTML_Formattin
g_Rules) suggests indentation by 2 spaces and new lines for every block, list,
or table element. There are other recommendations, but we will keep things
simple and will be satisfied with the following output.
<body>
<h1 id='first'>A heading</h1>
<p>Some text &<b>some bold text.</b></p>
<img src='myimg.png'width='100' height='100' />
</body>
In our desired output we can see that the content of the body-function requires
different formatting than the other tag-functions. We will therefore create
a new format_code() function, that allows for optional indentation and line
breaks.
We adjust the body function to include the format_code() helper. (This could
also be approached programmatically in the tag function factory.)
html(paste0(
"<body", attribs, ">",
format_code(children, indent = TRUE),
"</body>"
))
}
with_html(
body(
21.3 LaTeX 235
21.3 LaTeX
Q1: Add escaping. The special symbols that should be escaped by adding a
backslash in front of them are \, $, and %. Just as with HTML, you’ll need to
make sure you don’t end up double-escaping. So, you’ll need to create a small
S3 class and then use that in function operators. That will also allow you to
embed arbitrary LaTeX if needed.
A: Currently our to_math() function generates the following output:
to_math(`$`)
#> <LATEX> \mathrm{f}($) # instead of <LATEX> \$
to_math(a$b)
#> <LATEX> \mathrm{$}(a b) # instead of <LATEX> \mathrm{\$}(a b)
to_math(`\\`)
#> <LATEX> \mathrm{f}(\) # instead of <LATEX> \\
to_math(`%`)
#> <LATEX> \mathrm{f}(%) # instead of <LATEX> \%
To adjust this behaviour, we need an escape function with methods for the
character and advr_latex classes.
(Note that we must first repeat the underlying code from Advanced R. How-
ever, since this would be a bit verbose, and not very meaningful, we will not
show this step here.)
236 21 Translating R code
latex(x)
}
# Known functions
f_env <- env_clone(f_env, call_env)
# Default symbols
names <- all_names(expr)
symbol_env <- as_environment(set_names(escape_latex(names), names),
parent = f_env)
# Known symbols
greek_env <- env_clone(greek_env, parent = symbol_env)
21.3 LaTeX 237
greek_env
}
to_math(`$`)
#> <LATEX> \$
to_math(a$b)
#> <LATEX> \mathrm{\$}(a b)
to_math(`\\`)
#> <LATEX> \\
to_math(`%`)
#> <LATEX> \%
Q2: Complete the DSL to support all the functions that plotmath supports.
A: You can see all supported functions in ?plotmath. There are a lot (!) so
here we choose to implement a representative sample:
to_math(x %+-% y)
to_math(x %*% y)
to_math(x %->% y)
to_math(bold(x))
to_math(x != y)
latex(out)
}
One specific property in this setting is that the environment where to_math()
evaluates the expression is not constant, but depends on what we already
know about the expression.
Next, we start building up latex_env(), which contains a chain of all the
necessary environments which to_math() checks to evaluate the expression in.
The first environment is the one for Greek letters.
greek <- c(
"alpha", "theta", "tau", "beta", "vartheta", "pi", "upsilon",
"gamma", "varpi", "phi", "delta", "kappa", "rho",
"varphi", "epsilon", "lambda", "varrho", "chi", "varepsilon",
"mu", "sigma", "psi", "zeta", "nu", "varsigma", "omega", "eta",
"xi", "Gamma", "Lambda", "Sigma", "Psi", "Delta", "Xi",
"Upsilon", "Omega", "Theta", "Pi", "Phi"
)
greek_list <- set_names(paste0("\\", greek), greek)
greek_env <- as_environment(greek_list)
We already know from Advanced R that e.g. to_math(pi) now correctly con-
verts to \\pi. So, let’s move on to the next one.
Here, it’ll become a bit more technical. Not every symbol is Greek (and not ev-
ery part of an expression is a symbol). To find out which symbols are present
within the expression, first, we use an approach from section 5 of the ex-
pressions chapter (https://adv-r.hadley.nz/expressions.html#ast-funs)
(walking the AST to find all symbols) where Hadley recursively walks the
AST to distinguish between different expression element types.
Let’s briefly repeat the helpers defined in that section:
"call"
} else if (is.pairlist(x)) {
"pairlist"
} else {
typeof(x)
}
}
This lets us define all_names(), which returns the desired symbols, already
converted to characters.
# Known symbols
env_clone(greek_env, parent = symbol_env)
}
In this way, to_math() will first convert all known Greek letters (found in
greek_env) and then any other symbols, which are left as is (in this implemen-
tation).
We also have to add support for functions. This will give us the opportunity
to insert some specific support for plotmath functions.
To support a whole bunch of unary and binary functions within the function
environment (f_env), which will be added next to latex_env, Hadley defines
the following two helpers in Advanced R.
# Grouping
`{` = unary_op("\\left{ ", " \\right}"),
`(` = unary_op("\\left( ", " \\right)"),
paste = paste,
# Labelling
hat = unary_op("\\hat{", "}"),
tilde = unary_op("\\tilde{", "}"),
# Plotmath
`%+-%` = binary_op(" \\pm "),
`%*%` = binary_op(" \\times "),
`%->%` = binary_op(" \\rightarrow "),
bold = unary_op("\\textbf{", "}"),
`!=` = binary_op(" \\neq ")
)
# Default symbols
names <- all_names(expr)
symbol_env <- as_environment(set_names(names), parent = f_env)
# Known symbols
greek_env <- env_clone(greek_env, parent = symbol_env)
greek_env
}
all_calls(expr(f(g + b, c, d(a))))
#> [1] "f" "+" "d"
# Known functions
f_env <- env_clone(f_env, call_env)
# Default symbols
names <- all_names(expr)
symbol_env <- as_environment(set_names(names), parent = f_env)
244 21 Translating R code
# Known symbols
greek_env <- env_clone(greek_env, parent = symbol_env)
greek_env
}
Finally, we rerun our tests and double check the newly supported plotmath
operators.
# Unknown functions
to_math(f(g(x)))
#> <LATEX> \mathrm{f}(\mathrm{g}(x))
Part V
Techniques
23
Measuring performance
23.2 Profiling
Q1: Profile the following function with torture = TRUE. What is surprising?
Read the source code of rm() to figure out what’s going on.
profvis::profvis(f())
#> Error in parse_rprof(prof_output, expr_source): No parsing data
#> available. Maybe your function was too fast?
Setting torture = TRUE triggers garbage collection after every memory alloca-
tion call, which may be useful for more exact memory profiling.
Surprisingly, profiling f() like this takes a very long time. What could be the
reason?
We follow the hint in the question and inspect the source code of rm():
rm() does a surprising amount of work to get the name of the object to delete
because it relies on non-standard evaluation.
We can make the job of rm() considerably simpler by using the list argument:
Unfortunately, this still takes too long, and we are literally stuck in profiling.
Anecdotally, one of the authors once finished the profiling under an older R
version. But the output seemed to be not very meaningful.
In conclusion, this question appears to be unanswerable for us, even for Hadley.
23.3 Microbenchmarking
Q1: Instead of using bench::mark(), you could use the built-in function sys-
tem.time(). But system.time() is much less precise, so you’ll need to repeat
each operation many times with a loop, and then divide to find the average
time of each operation, as in the code below.
23.3 Microbenchmarking 249
n <- 1e6
system.time(for (i in 1:n) sqrt(x)) / n
system.time(for (i in 1:n) x ˆ 0.5) / n
n <- 1e6
x <- runif(100)
bench_df
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 sqrt(x) 272.99ns 322ns 1795203. 848B 30.5
#> 2 xˆ0.5 5.05µs 5.49µs 174914. 848B 6.12
We need to access the raw data, so we can compare the results of both bench-
marking approaches.
We see, that both approaches get the order of magnitude right. We assume,
that the bench::mark()-results may be a little more accurate, because of its
high precision timer. There may also be overhead introduced by the for loop
in the system.time()-approach.
#> elapsed
#> 5.88e-07
t1_bench
#> [1] 6.75e-07
t2_systime["elapsed"]
#> elapsed
#> 5.36e-06
t2_bench
#> [1] 5.83e-06
Side Note: take a look at ?proc.time if you want to learn about the differences
between “user”, “system” and “elapsed” time.
Q2: Here are two other ways to compute the square root of a vector. Which
do you think will be fastest? Which will be slowest? Use microbenchmarking
to test your answers.
x ˆ (1 / 2)
exp(log(x) / 2)
A: To compare these approaches, we’ll bench::mark() them and sort the result
by the median execution time.
x <- runif(100)
bm <- bench::mark(
sqrt(x),
xˆ0.5,
xˆ(1 / 2),
exp(log(x) / 2)
)
bm[order(bm$median), ]
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 sqrt(x) 312.11ns 394.07ns 2029264. 848B 203.
#> 2 exp(log(x)/2) 3.12µs 3.26µs 287368. 848B 0
#> 3 xˆ(1/2) 5.67µs 5.78µs 163139. 848B 0
#> 4 xˆ0.5 5.57µs 6.95µs 142455. 848B 0
As one might expect the idiomatic primitive function sqrt() is the fastest.
The approach exp(log(x) / 2) which builds on two other primitive functions
23.3 Microbenchmarking 251
bench::mark(
"lm" = lm(
body_mass_g ~ bill_length_mm + species, data = penguins
) %>% coef(),
"biglm" = biglm::biglm(
body_mass_g ~ bill_length_mm + species, data = penguins
) %>% coef(),
"speedglm" = speedglm::speedlm(
body_mass_g ~ bill_length_mm + species, data = penguins
) %>% coef(),
"fastLm" = RcppEigen::fastLm(
body_mass_g ~ bill_length_mm + species, data = penguins
) %>% coef()
)
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
For larger datasets the selection of the appropriate method is of greater rele-
vance:
bench::mark(
"lm" = lm(y ~ x1 + x2, data = td) %>% coef(),
"biglm" = biglm::biglm(y ~ x1 + x2, data = td) %>% coef(),
"speedglm" = speedglm::speedlm(y ~ x1 + x2, data = td) %>% coef(),
"fastLm" = RcppEigen::fastLm(y ~ x1 + x2, data = td) %>% coef()
)
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 lm 67.5ms 67.5ms 14.8 27MB 119.
#> 2 biglm 27.2ms 32.8ms 30.9 22.2MB 51.5
#> 3 speedglm 24.3ms 25ms 36.3 20.4MB 45.4
#> 4 fastLm 52.9ms 52.9ms 18.9 30.2MB 170.
For further speed improvements, you could install a linear algebra library
optimised for your system (see ?speedglm::speedlm).
The functions of class ‘speedlm’ may speed up the fitting of LMs to large
datasets. High performances can be obtained especially if R is linked
against an optimized BLAS, such as ATLAS.
Tip: In case your dataset is stored in a database, you might want to check
out the {modeldb} package (https://github.com/tidymodels/modeldb) [Kuhn,
2020] which executes the linear model code in the corresponding database
backend.
Q2: What package implements a version of match() that’s faster for repeated
lookups? How much faster is it?
A: A web search points us to the {fastmatch} package [Urbanek, 2017]. We
compare it to base::match() and observe an impressive performance gain.
24.3 Checking for existing solutions 255
bench::mark(
match = match(x, table),
fastmatch = fastmatch::fmatch(x, table)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 match 15.5ms 16.2ms 60.1 1.46MB 2.07
#> 2 fastmatch 403.9µs 427.1µs 2259. 442.69KB 2.01
Q3: List four functions (not just those in base R) that convert a string into a
date time object. What are their strengths and weaknesses?
A: The usual base R way is to use the as.POSIXct() generic and create a date
time object of class POSIXct and type integer.
Under the hood as.POSIXct() employs as.POSIXlt() for the character conver-
sion. This creates a date time object of class POSIXlt and type list.
The POSIXlt class has the advantage that it carries the individual time com-
ponents as attributes. This allows to extract the time components via typical
list operators.
attributes(date_lt)
#> $names
#> [1] "sec" "min" "hour" "mday" "mon" "year" "wday"
#> [8] "yday" "isdst" "zone" "gmtoff"
#>
#> $class
#> [1] "POSIXlt" "POSIXt"
date_lt$sec
#> [1] 25
256 24 Improving performance
However, while lists may be practical, basic calculations are often faster and
require less memory for objects with underlying integer type.
bench::mark(
date_lt2 - date_lt2,
date_ct2 - date_ct2,
date_ct2 - date_lt2
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 date_lt2 - date_lt2 24.2ms 24.8ms 39.1 1.36MB 2.06
#> 2 date_ct2 - date_ct2 52.3µs 71.4µs 13382. 195.45KB 54.3
#> 3 date_ct2 - date_lt2 11.9ms 12.4ms 80.0 781.95KB 0
as.POSIXlt() in turn uses strptime() under the hood, which creates a similar
date time object.
bench::mark(
as.POSIXct = as.POSIXct("2020-01-01 12:30:25"),
as.POSIXct_format = as.POSIXct("2020-01-01 12:30:25",
format = "%Y-%m-%d %H:%M:%S"
),
strptime_fomat = strptime("2020-01-01 12:30:25",
format = "%Y-%m-%d %H:%M:%S"
)
)[1:3]
#> # A tibble: 3 x 3
#> expression min median
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 as.POSIXct 41.9µs 45.67µs
24.3 Checking for existing solutions 257
A fourth way is to use the converter functions from the {lubridate} package
[Grolemund and Wickham, 2011], which contains wrapper functions (for the
POSIXct approach) with an intuitive syntax. (There is a slight decrease in
performance though.)
library(lubridate)
ymd_hms("2013-07-24 23:55:26")
#> [1] "2013-07-24 23:55:26 UTC"
bench::mark(
as.POSIXct = as.POSIXct("2013-07-24 23:55:26", tz = "UTC"),
ymd_hms = ymd_hms("2013-07-24 23:55:26")
)[1:3]
#> # A tibble: 2 x 3
#> expression min median
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 as.POSIXct 40.6µs 49.33µs
#> 2 ymd_hms 2.38ms 2.77ms
For additional ways to convert characters into date time objects, have a look at
the {chron}, the {anytime} and the {fasttime} packages. The {chron} package
[James and Hornik, 2020] introduces new classes and stores times as fractions
of days in the underlying double type, while it doesn’t deal with time zones
and daylight savings. The {anytime} package [Eddelbuettel, 2020] aims to
convert “Anything to POSIXct or Date”. The {fasttime} package [Urbanek,
2016] contains only one function, fastPOSIXct().
Q4: Which packages provide the ability to compute a rolling mean?
A: A rolling mean is a useful statistic to smooth time-series, spatial and other
types of data. The size of the rolling window usually determines the amount
of smoothing and the number of missing values at the boundaries of the data.
The general functionality can be found in multiple packages, which vary in
speed and flexibility of the computations. Here is a benchmark for several
functions that all serve our purpose.
x <- 1:10
slider::slide_dbl(x, mean, .before = 1, .complete = TRUE)
#> [1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
258 24 Improving performance
bench::mark(
caTools = caTools::runmean(x, k = 2, endrule = "NA"),
data.table = data.table::frollmean(x, 2),
RcppRoll = RcppRoll::roll_mean(x, n = 2, fill = NA,
align = "right"),
slider = slider::slide_dbl(x, mean, .before = 1, .complete = TRUE),
TTR = TTR::SMA(x, 2),
zoo_apply = zoo::rollapply(x, 2, mean, fill = NA, align = "right"),
zoo_rollmean = zoo::rollmean(x, 2, fill = NA, align = "right")
)
#> # A tibble: 7 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 caTools 76.5µs 93.6µs 10516. 165.69KB 20.2
#> 2 data.table 45.4µs 54.4µs 18221. 1.59MB 21.4
#> 3 RcppRoll 40.3µs 47.6µs 20828. 58.74KB 21.4
#> 4 slider 75.7µs 89.1µs 11037. 0B 21.3
#> 5 TTR 387.1µs 408.2µs 2340. 1.98MB 8.20
#> 6 zoo_apply 440.2µs 452.6µs 2150. 581.62KB 19.1
#> 7 zoo_rollmean 388µs 398.5µs 2430. 6.42KB 19.2
You may also take a look at an extensive example in the first edition of Ad-
vanced R (http://adv-r.had.co.nz/Functionals.html), which demonstrates
how a rolling mean function can be created.
Q5: What are the alternatives to optim()?
A: According to its description (see ?optim) optim() implements:
General-purpose optimization based on Nelder–Mead, quasi-Newton and
conjugate-gradient algorithms. It includes an option for box-constrained
optimization and simulated annealing.
optim() allows to optimise a function (fn) on an interval with a specific method
(method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent")).
Many detailed examples are given in the documentation. In the simplest case,
we give optim() the starting value par = 0 to calculate the minimum of a
quadratic polynomial:
#>
#> $value
#> [1] -2450
#>
#> $counts
#> function gradient
#> NA NA
#>
#> $convergence
#> [1] 0
#>
#> $message
#> NULL
Since this solves a one-dimensional optimisation task, we could have also used
stats::optimize().
For more general alternatives, the appropriate choice highly depends on the
type of optimisation you intend to do. The CRAN task view on optimisation
and mathematical modelling (https://cran.r-project.org/web/views/Optimi
zation.html) can serve as a useful reference. Here are a couple of examples:
• {optimx} [Nash and Varadhan, 2011, Nash, 2014] extends the optim() func-
tion with the same syntax but more method choices.
• {RcppNumerical} [Qiu et al., 2019] wraps several open source libraries for
numerical computing (written in C++) and integrates them with R via
{Rcpp}.
• {DEoptim} [Mullen et al., 2011] provides a global optimiser based on the
Differential Evolution algorithm.
A: When we inspect the source code of the user-facing rowSums(), we see that
it is designed as a wrapper around .rowSums() with some input validation,
conversions and handling of complex numbers.
rowSums
#> function (x, na.rm = FALSE, dims = 1L)
#> {
#> if (is.data.frame(x))
#> x <- as.matrix(x)
#> if (!is.array(x) || length(dn <- dim(x)) < 2L)
#> stop("'x' must be an array of at least two dimensions")
#> if (dims < 1L || dims > length(dn) - 1L)
#> stop("invalid 'dims'")
#> p <- prod(dn[-(id <- seq_len(dims))])
#> dn <- dn[id]
#> z <- if (is.complex(x))
#> .Internal(rowSums(Re(x), prod(dn), p, na.rm)) + (0+1i) *
#> .Internal(rowSums(Im(x), prod(dn), p, na.rm))
#> else .Internal(rowSums(x, prod(dn), p, na.rm))
#> if (length(dn) > 1L) {
#> dim(z) <- dn
#> dimnames(z) <- dimnames(x)[id]
#> }
#> else names(z) <- dimnames(x)[[1L]]
#> z
#> }
#> <bytecode: 0x5627c327f338>
#> <environment: namespace:base>
.rowSums
#> function (x, m, n, na.rm = FALSE)
#> .Internal(rowSums(x, m, n, na.rm))
#> <bytecode: 0x5627c2a43090>
#> <environment: namespace:base>
bench::mark(
rowSums(m),
.rowSums(m, 1000, 1000)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl>
#> 1 rowSums(m) 2.21ms 2.25ms 427. 7.86KB 0
#> 2 .rowSums(m, 1000, 1000) 2.15ms 2.29ms 430. 7.86KB 0
Q2: Make a faster version of chisq.test() that only computes the chi-square
test statistic when the input is two numeric vectors with no missing values.
You can try simplifying chisq.test() or by coding from the mathematical
definition (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).
A: We aim to speed up our reimplementation of chisq.test() by doing less.
We check if our new implementation returns the same results and benchmark
it afterwards.
a <- 21:25
b <- seq(21, 29, 2)
m <- cbind(a, b)
bench::mark(
chisq.test(m),
chisq.test2(a, b),
check = FALSE
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 chisq.test(m) 58.1µs 67.4µs 14036. 0B 4.13
#> 2 chisq.test2(a, b) 17.3µs 18.8µs 52045. 0B 5.21
Q3: Can you make a faster version of table() for the case of an input of
two integer vectors with no missing values? Can you use it to speed up your
chi-square test?
A: When analysing the source code of table() we aim to omit everything
unnecessary and extract the main building blocks. We observe that table() is
powered by tabulate() which is a very fast counting function. This leaves us
with the challenge to compute the pre-processing as performant as possible.
First, we calculate the dimensions and names of the output table. Then we
use fastmatch::fmatch() to map the elements of each vector to their position
within the vector itself (i.e. the smallest value is mapped to 1L, the second
smallest value to 2L, etc.). Following the logic within table() we combine and
shift these values to create a mapping of the integer pairs in our data to the
index of the output table. After applying these lookups tabulate() counts the
values and returns an integer vector with counts for each position in the table.
As a last step, we reuse the code from table() to assign the correct dimension
and class.
y
}
bench::mark(
table(a, b),
table2(a, b)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 table(a, b) 1.03ms 1.23ms 815. 1.29MB 22.2
#> 2 table2(a, b) 394.66µs 450.27µs 2144. 694.56KB 19.8
24.5 Vectorise
Q1: The density functions, e.g. dnorm(), have a common interface. Which
arguments are vectorised over? What does rnorm(10, mean = 10:1) do?
264 24 Improving performance
These functions are vectorised over their numeric arguments, which includes
the first argument (x, q, p, n) as well as mean and sd.
rnorm(10, mean = 10:1) generates ten random numbers from different normal
distributions. These normal distributions differ in their means. The first has
mean 10, the second mean 9, the third mean 8 and so on.
Q2: Compare the speed of apply(x, 1, sum) with rowSums(x) for varying sizes
of x.
A: We compare the two functions for square matrices of increasing size:
library(ggplot2)
rowsums %>%
24.5 Vectorise 265
summary() %>%
dplyr::mutate(Approach = as.character(expression)) %>%
ggplot(
aes(p, median, color = Approach, group = Approach)) +
geom_point() +
geom_line() +
labs(x = "Number of Rows and Columns",
y = "Median (s)") +
theme(legend.position = "top")
0.4
Median (s)
0.2
0.0
1000 2000 3000 4000 5000
Number of Rows and Columns
We can see that the difference in performance is negligible for small matrices
but becomes more and more relevant as the size of the data increases. apply()
is a very versatile tool, but it’s not “vectorised for performance” and not as
optimised as rowSums().
Q3: How can you use crossprod() to compute a weighted sum? How much
faster is it than the naive sum(x * w)?
A: We can hand the vectors to crossprod(), which converts them to row- and
column-vectors and then multiplies these. The result is the dot product, which
corresponds to a weighted sum.
x <- rnorm(10)
w <- rnorm(10)
all.equal(sum(x * w), crossprod(x, w)[[1]])
#> [1] TRUE
{
x <- rnorm(n * 1e6)
bench::mark(
sum = sum(x * x),
crossprod = crossprod(x, x)[[1]]
)
}
)
#> Running with:
#> n
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
weightedsum %>%
summary() %>%
dplyr::mutate(Approach = as.character(expression)) %>%
ggplot(aes(n, median, color = Approach, group = Approach)) +
geom_point() +
geom_line() +
labs(x = "Vector length (millions)",
y = "Median (s)") +
theme(legend.position = "top")
0.04
Median (s)
0.03
0.02
0.01
0.00
2.5 5.0 7.5 10.0
Vector length (millions)
25
Rewriting R code in C++
double f1(NumericVector x) {
int n = x.size();
double y = 0;
NumericVector f2(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for(int i = 1; i < n; ++i) {
out[i] = out[i - 1] + x[i];
}
return out;
}
bool f3(LogicalVector x) {
int n = x.size();
NumericVector out(n);
return out;
}
1. all().
2. cumprod(), cummin(), cummax().
3. diff(). Start by assuming lag 1, and then generalise for lag n.
25.2 Getting started with C++ 269
4. range().
5. var(). Read about the approaches you can take on Wikipedia (http:
//en.wikipedia.org/wiki/Algorithms_for_calculating_variance).
Whenever implementing a numerical algorithm, it’s always good to
check what is already known about the problem.
1. all()
bool allC(LogicalVector x) {
int n = x.size();
NumericVector cumprodC(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for (int i = 1; i < n; ++i) {
out[i] = out[i - 1] * x[i];
}
return out;
}
NumericVector cumminC(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for (int i = 1; i < n; ++i) {
out[i] = std::min(out[i - 1], x[i]);
}
return out;
}
270 25 Rewriting R code in C++
NumericVector cummaxC(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for (int i = 1; i < n; ++i) {
out[i] = std::max(out[i - 1], x[i]);
}
return out;
}
3. diff() (Start by assuming lag 1, and then generalise for lag n.)
NumericVector diffC(NumericVector x) {
int n = x.size();
NumericVector out(n - 1);
4. range()
NumericVector rangeC(NumericVector x) {
double omin = x[0], omax = x[0];
25.4 Getting started with C++ 271
int n = x.size();
NumericVector out(2);
out[0] = omin;
out[1] = omax;
return out;
}
5. var()
double varC(NumericVector x) {
int n = x.size();
if (n < 2) {
return NA_REAL;
}
double mx = 0;
for (int i = 0; i < n; ++i) {
mx += x[i] / n;
}
double out = 0;
for (int i = 0; i < n; ++i) {
out += pow(x[i] - mx, 2);
}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector minC(NumericVector x, bool na_rm = false) {
int n = x.size();
NumericVector out = NumericVector::create(R_PosInf);
if (na_rm) {
for (int i = 0; i < n; ++i) {
if (x[i] == NA_REAL) {
continue;
}
if (x[i] < out[0]) {
out[0] = x[i];
}
}
} else {
for (int i = 0; i < n; ++i) {
if (NumericVector::is_na(x[i])) {
out[0] = NA_REAL;
return out;
}
if (x[i] < out[0]) {
out[0] = x[i];
}
}
}
25.4 Missing values 273
return out;
}
minC(c(2:4, NA))
#> [1] NA
minC(c(2:4, NA), na_rm = TRUE)
#> [1] 2
minC(c(NA, NA), na_rm = TRUE)
#> [1] Inf
We also extend anyC() so it can deal with missing values. Please note that this
(again) introduces some code duplication. This could be avoided by moving the
check for missing values to the inner loop at the expense of a slight decrease
of performance. Here we use LogicalVector as return type. If we would use
bool instead, the C++ NA_LOGICAL would be converted into R’s logical TRUE.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalVector anyC(LogicalVector x, bool na_rm = false) {
int n = x.size();
LogicalVector out = LogicalVector::create(false);
if (na_rm == false) {
for (int i = 0; i < n; ++i) {
if (LogicalVector::is_na(x[i])) {
out[0] = NA_LOGICAL;
return out;
} else {
if (x[i]) {
out[0] = true;
}
}
}
}
if (na_rm) {
for (int i = 0; i < n; ++i) {
if (LogicalVector::is_na(x[i])) {
continue;
}
274 25 Rewriting R code in C++
if (x[i]) {
out[0] = true;
return out;
}
}
}
return out;
}
Q2: Rewrite cumsum() and diff() so they can handle missing values. Note
that these functions have slightly more complicated behaviour.
A: Our NA-aware cumsumC() function will return a vector of the same length
as x. By default (na_rm = FALSE) all values following the first NA input value
will be set to NA, because they depend on the unknown missing value. In case
of na_rm = FALSE the NA values are treated like zeros.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cumsumC(NumericVector x, bool na_rm = false) {
int n = x.size();
NumericVector out(n);
LogicalVector is_missing = is_na(x);
if (!na_rm) {
out[0] = x[0];
for (int i = 1; i < n; ++i) {
if (is_missing[i - 1]) {
out[i] = NA_REAL;
} else{
out[i] = out[i - 1] + x[i];
}
}
}
25.4 Missing values 275
if (na_rm) {
if (is_missing[0]) {
out[0] = 0;
} else {
out[0] = x[0];
}
for (int i = 1; i < n; ++i) {
if (is_missing[i]) {
out[i] = out[i-1] + 0;
} else {
out[i] = out[i-1] + x[i];
}
}
}
return out;
}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector diffC(NumericVector x, int lag = 1,
bool na_rm = false) {
int n = x.size();
if (!na_rm) {
return rep(NumericVector::create(NA_REAL), n - lag);
}
out[i - lag] = NA_REAL;
continue;
}
out[i - lag] = x[i] - x[i - lag];
}
return out;
}
#include <algorithm>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double medianC(NumericVector x) {
int n = x.size();
if (n % 2 == 0) {
std::partial_sort (x.begin(), x.begin() + n / 2 + 1, x.end());
return (x[n / 2 - 1] + x[n / 2]) / 2;
} else {
25.5 Standard Template Library 277
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalVector inC(CharacterVector x, CharacterVector table) {
std::unordered_set<String> seen;
seen.insert(table.begin(), table.end());
int n = x.size();
LogicalVector out(n);
for (int i = 0; i < n; ++i) {
out[i] = seen.find(x[i]) != seen.end();
}
return out;
}
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector uniqueC(NumericVector x) {
std::unordered_set<int> seen;
int n = x.size();
std::vector<double> out;
278 25 Rewriting R code in C++
return wrap(out);
}
// As a one-liner
// [[Rcpp::export]]
std::unordered_set<double> uniqueCC(NumericVector x) {
return std::unordered_set<double>(x.begin(), x.end());
}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double minC(NumericVector x) {
int n = x.size();
double out = x[0];
return out;
}
#include <Rcpp.h>
#include <algorithm>
#include <iterator>
using namespace Rcpp;
25.5 Standard Template Library 279
// [[Rcpp::export]]
double which_minC(NumericVector x) {
int out = std::distance(
x.begin(), std::min_element(x.begin(), x.end())
);
return out + 1;
}
Q6: setdiff(), union(), and intersect() for integers using sorted ranges and
set_union, set_intersection and set_difference.
A: The structure of the three functions will be very similar.
We first sort both input vectors. Then we apply the respective set_union,
set_intersection or set_difference function. After that, the result will be
between the iterators tmp.begin() and out_end. To retrieve the result, we loop
once through the range between tmp.begin() and out_end in the last part of
each function.
The set operations in base R will discard duplicated values in the arguments.
We achieve a similar behaviour by introducing a deduplication step, which
omits values that match their predecessor. For the symmetric set functions
unionC and intersectC this step is implemented for the output vector. For
setdiffC the deduplication is applied to the first input vector.
#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector unionC(IntegerVector x, IntegerVector y) {
int nx = x.size();
int ny = y.size();
int prev_value = 0;
IntegerVector out;
for (IntegerVector::iterator it = tmp.begin();
it != out_end; ++it) {
if ((it != tmp.begin()) && (prev_value == *it)) continue;
out.push_back(*it);
prev_value = *it;
}
return out;
}
// [[Rcpp::export]]
IntegerVector intersectC(IntegerVector x, IntegerVector y) {
int nx = x.size();
int ny = y.size();
std::sort(x.begin(), x.end());
std::sort(y.begin(), y.end());
int prev_value = 0;
IntegerVector out;
for (IntegerVector::iterator it = tmp.begin();
it != out_end; ++it) {
if ((it != tmp.begin()) && (prev_value == *it)) continue;
out.push_back(*it);
prev_value = *it;
}
return out;
}
// [[Rcpp::export]]
25.5 Standard Template Library 281
IntegerVector tmp(nx);
std::sort(x.begin(), x.end());
int prev_value = 0;
IntegerVector x_dedup;
for (IntegerVector::iterator it = x.begin();
it != x.end(); ++it) {
if ((it != x.begin()) && (prev_value == *it)) continue;
x_dedup.push_back(*it);
prev_value = *it;
}
std::sort(y.begin(), y.end());
IntegerVector out;
for (IntegerVector::iterator it = tmp.begin();
it != out_end; ++it) {
out.push_back(*it);
}
return out;
}
union(x, y)
#> [1] 1 2 3 5
unionC(x, y)
282 25 Rewriting R code in C++
#> [1] 1 2 3 5
intersect(x, y)
#> [1] 2 3
intersectC(x, y)
#> [1] 2 3
setdiff(x, y)
#> [1] 1
setdiffC(x, y)
#> [1] 1
Bibliography
283
284 Bibliography