Jim Duggan - Exploring Operations Research With R-CRC Pressr (2024)
Jim Duggan - Exploring Operations Research With R-CRC Pressr (2024)
Research with R
Features
• Can serve as a primary textbook for a comprehensive course in R, with applica-
tions in OR
• Suitable for post-graduate students in OR and data science, with a focus on the
computational perspective of OR
• The text will also be of interest to professional OR practitioners as part of their
continuing professional development
• Linked to a Github repository including code, solutions, data sets, and other
ancillary material.
Rational Queueing
Refael Hassin
Jim Duggan
University of Galway, Ireland
Designed cover image: ShutterStock Images
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publica-
tion and apologize to copyright holders if permission to publish in this form has not been obtained. If
any copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-
750-8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
For Marie, Kate, and James.
Contents
Preface xiii
I Base R 1
1 Getting Started with R 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Exploring R via the RStudio console . . . . . . . . . . . . . . 4
1.3 Calling functions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Installing packages . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Using statistical functions in R . . . . . . . . . . . . . . . . . 6
1.5.1 Linear regression . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Vectors 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Primary types of atomic vectors . . . . . . . . . . . . . 10
2.2.2 Creating larger vectors . . . . . . . . . . . . . . . . . . 13
2.2.3 The rules of coercion . . . . . . . . . . . . . . . . . . . 15
2.2.4 Naming atomic vector elements . . . . . . . . . . . . . 16
2.2.5 Missing values - introducing NA . . . . . . . . . . . . . 17
2.3 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 ifelse() vectorization . . . . . . . . . . . . . . . . . . . 23
2.4 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Visualizing lists . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Creating larger lists . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Naming list elements . . . . . . . . . . . . . . . . . . . 27
2.4.4 Converting a list to an atomic vector . . . . . . . . . . 28
2.5 Mini-case: Two-dice rolls with atomic vectors . . . . . . . . . . 28
2.6 Summary of R functions from Chapter 2 . . . . . . . . . . . . 31
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vii
viii Contents
3 Subsetting Vectors 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Subsetting atomic vectors . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Positive integers . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Negative integers . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Logical vectors . . . . . . . . . . . . . . . . . . . . . . 38
3.2.4 Named elements . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Subsetting lists . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Iteration using loops, and the if statement . . . . . . . . . . . 49
3.5 Mini-case: Star Wars movies . . . . . . . . . . . . . . . . . . . 51
3.6 Summary of R functions from Chapter 3 . . . . . . . . . . . . 55
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9 Relational Data with dplyr and Tidying Data with tidyr 191
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.2 Relational data . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.3 Mutating joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.3.1 inner_join(x,y) . . . . . . . . . . . . . . . . . . . . . . 196
9.3.2 left_join(x,y) . . . . . . . . . . . . . . . . . . . . . . 196
9.3.3 right_join(x,y) . . . . . . . . . . . . . . . . . . . . . . 198
9.3.4 full_join(x,y) . . . . . . . . . . . . . . . . . . . . . . 199
9.4 Filtering joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.4.1 semi_join(x,y) . . . . . . . . . . . . . . . . . . . . . . 199
9.4.2 anti_join(x,y) . . . . . . . . . . . . . . . . . . . . . . 200
9.5 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6 Making data longer using pivot_longer() . . . . . . . . . . . . 202
9.7 Making data wider using pivot_wider() . . . . . . . . . . . . 204
9.8 Mini-case: Exploring correlations with wind energy
generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.9 Summary of R functions from Chapter 9 . . . . . . . . . . . . 211
9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11 Shiny 245
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11.2 Reactive programming . . . . . . . . . . . . . . . . . . . . . . 246
Contents xi
Bibliography 367
Index 371
Preface
The central idea behind this book is that R is a valuable computational tool
that can be applied to the field of operations research. R provides excellent
features such as data representation, data manipulation, and data analysis.
These features can be integrated with operations research techniques (e.g.,
simulation, linear programming and data science) to support an information
workflow which can provide insights to decision makers, and so, to paraphrase
the words of Jay W. Forrester, help convert information into action.
R is an open source programming language, with comprehensive support
for mathematics and statistics. With the development of R’s tidyverse —
an integrated system of packages for data manipulation, exploration, and
visualization — the use of R has seen significant growth, across many domains.
The Comprehensive R Archive Network (CRAN) provides access to thousands
of special purpose R packages (for example ggplot2 for visualization), and
these can be integrated into an analyst’s workflow.
Book structure
The book comprises three parts, where each part contains thematically related
chapters:
xiii
xiv Preface
• The R console, which allows you to create and explore variables, and observe
the immediate impact of passing an instruction to R.
• Files and plots, where we can see the full file system for your written code
(folders, sub-folders, files), and also view any generated plots.
• The global environment which shows what variables you have created, and
therefore which ones can be explored and used for data processing.
Here are a number of tips you may find useful when using RStudio:
Recommended reading
We recommend the following books, which we have found to be valuable
reference resources in writing this textbook.
• Advanced R by Hadley Wickham (Wickham, 2019), which presents a deep
dive into R, and covers many fascinating technical topics, including object-
oriented programming and metaprogramming.
• R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett
Grolemund (Wickham et al., 2023), aimed at data scientists to show how
Preface xvii
Acknowledgments
I would like to thank students from the University of Galway’s M.Sc. in Com-
puter Science (Data Analytics) for their enthusiasm (and valuable feedback)
on the module Programming for Data Analytics, a module I first lectured in
September 2015, and which forms the basis of Parts I and II in this textbook.
Thanks to my colleagues in the School of Computer Science, and across the Uni-
versity of Galway, for providing a collegial environment for teaching, learning,
and research; and to my colleagues in the Insight Centre for Data Analytics,
the PANDEM-2 project, the SafePlan project, and the Irish Epidemiological
Modelling Advisory Group, for contributing to my technical, mathematical,
and public health knowledge.
Thanks to my colleagues in the communities of the System Dynamics Society
and the Operational Research Society, where I had many opportunities to host
workshops demonstrating how R can be used to support system dynamics
and operations research. Thanks to the CRC Press/Taylor & Francis Group
Editorial Team, in particular, Callum Fraser and Mansi Kabra, for providing
me with the opportunity to propose, and write this textbook; and to Michele
Dimont, Project Editor, for her support during the final production process.
Finally, a special thank you to my family for their encouragement, inspiration,
and support.
Jim Duggan
University of Galway
Galway, Ireland
February 2024
About the Author
xix
Part I
Base R
1
Getting Started with R
1.1 Introduction
This chapter introduces R, and presents a number of examples of R in action.
The main idea is to demonstrate the value of using the console to rapidly
create variables, perform simple operations, and display the results. Functions
are introduced, and examples are shown of how they can accept data and
generate outputs. The mechanism for loading libraries from the Comprehensive
R Archive Network (CRAN) is presented, because an extensive set of libraries
(e.g., the tidyverse) will be used in Parts II and III of the book. Finally, the
chapter provides a short summary of two statistical approaches (and their
relevant R functions) that will be utilized in later chapters: linear regression
and correlation.
Chapter structure
• Section 1.2 provides a glimpse of how the console can be effectively used to
create, manipulate, and display variables.
• Section 1.3 shows how we can call R functions to generate results.
• Section 1.4 highlights how we can leverage the work of the wider R community
by installing packages to support special-purpose tasks.
• Section 1.5 introduces two statistical functions in R: lm() for linear regression,
and cor() for correlation.
3
4 1 Getting Started with R
v
#> [1] 10 20 30
We can now use other R functions to process the variable v, for example, the
functions sum(), mean(), and sqrt(). The results are shown below.
sum(v)
#> [1] 60
mean(v)
#> [1] 20
sqrt(v)
#> [1] 3.162 4.472 5.477
Once a package is installed, it can then be loaded for use using the library()
function. In this case, we load aimsir17 and then show the information that
is contained in variable stations. The first six rows are displayed, as this is
made possible by calling the function head().
library(aimsir17)
head(stations)
#> # A tibble: 6 x 5
#> station county height latitude longitude
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 ATHENRY Galway 40 53.3 -8.79
#> 2 BALLYHAISE Cavan 78 54.1 -7.31
#> 3 BELMULLET Mayo 9 54.2 -10.0
#> 4 CASEMENT Dublin 91 53.3 -6.44
#> 5 CLAREMORRIS Mayo 68 53.7 -8.99
#> 6 CORK AIRPORT Cork 155 51.8 -8.49
6 1 Getting Started with R
FIGURE 1.1 A linear model between engine size and miles per gallon (source
datasets::mtcars)
later in Chapters 3 and 5, and for the moment we just assume that the terms
allow us to access the required data.
We execute these two lines of code, and show the 32 observations for each
variable. For example, we can see the first observation in X is 160.0 (which
represents the displacement of the first car), and the corresponding value in Y
is 21.0 (the miles per gallon of the first car).
X <- mtcars$disp
X
#> [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6
#> [11] 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1
#> [21] 120.1 318.0 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0
#> [31] 301.0 121.0
Y <- mtcars$mpg
Y
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4
#> [13] 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3
#> [25] 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
8 1 Getting Started with R
Next, we invoke the lm() function by passing in the regression term Y~X (a
formula in R), and then we store results in the variable mod. The function
coefficents() will process the variable mod to extract the two fitted parameters
(which are the β0 and β1 terms we referred to earlier.)
mod <- lm(Y~X)
coefficients(mod)
#> (Intercept) X
#> 29.59985 -0.04122
1.5.2 Correlation
Often problems involve exploring several variables to see how they may be
interrelated. For example, is it more likely to get windier as the atmospheric
pressure falls? A correlation problem arises when an analyst may ask whether
there is any relationship between a pair of variables of interest. The correlation
coefficient r is widely used to calculate the strength of the linear relationship
between two variables, with −1 ≤ r ≤ 1 (Hoel and Jessen, 1971). If r equals
−1 or +1, then all the points on a scatterplot of the two variables will lie
on a straight line. The interpretation of r is a purely mathematical one and
completely devoid of any cause of effect implications (Hoel and Jessen, 1971).
Or to use the well-known quote “correlation does not imply causation”.
In R, the correlation coefficient (the default is Pearson’s method) is calculated
between two or more data streams using the function cor(). We can return to
the displacement and miles per gallon data used in the previous example, and
calculate the correlation coefficient between the variables. The value shows a
strong negative (mathematical) relationship between the two variables.
cor(Y,X)
#> [1] -0.8476
Overall, the correlation measure is valuable during the exploratory data analysis
phase, and examples of correlation calculations will be presented in Chapter
12.
2.1 Introduction
Knowledge of vectors is fundamental in R. A vector is a one-dimensional data
structure, and there are two types of vectors: atomic vectors, where all the
data must be of the same type, and lists, which are more flexible and each
element’s type can vary. Upon completing the chapter, you should understand:
• The difference between an atomic vector and a list, and be able to create
atomic vectors and lists using the c() and list() functions.
• The four main types of atomic vector, and how different vector elements can
be named.
• The rules of coercion for atomic vectors, and the importance of the function
is.na().
• The idea of vectorization, and how arithmetic and logical operators can be
applied to vectors.
• R functions that allow you to manipulate vectors.
• How to solve all five test exercises.
9
10 2 Vectors
Chapter structure
• Section 2.2 introduces the four main types of atomic vectors, and describes:
how to create larger vectors; a mechanism known as coercion; how to name
vector elements; and, how to deal with missing values.
• Section 2.3 explains vectorization, which allows for the same operation to be
carried out on each vector element.
• Section 2.4 describes the list, shows how to create and name lists, and also
how to create larger lists.
• Section 2.5 presents a mini-case, which highlights how atomic vectors can be
used to simulate the throwing of two dice, and then gather frequency data
on each throw. These simulated results are then compared with the expected
probability values. The mini-case shows the utility of using vectors, and how
R functions can speed up the analysis process.
• Section 2.6 provides a summary of all the functions introduced in the chapter.
• Section 2.7 provides a number of short coding exercises to test your under-
standing of the material covered.
1 Note there are two other data types in R, one for complex numbers, the other for raw
bytes
2.2 Atomic vectors 11
• str(),
which compactly displays the internal structure of a variable, and also
shows the type. This is a valuable function that you will make extensive use
of in R, particularly when you are exploring data returned from functions.
• is.logical(), is.double(), is.integer(), and is.character() which tests
the variable’s type, and returns the logical type TRUE if the type aligns with
the function name.
The four main data types are:
logical, where values can be either TRUE or FALSE, and the abbreviations T
and F can also be used. For example, here we declare a logical vector with five
elements.
typeof(x_logi)
#> [1] "logical"
str(x_logi)
#> logi [1:5] TRUE TRUE FALSE TRUE FALSE
is.logical(x_logi)
#> [1] TRUE
integer, which represents whole numbers (negative and positive), and must
be declared by appending the letter L to the number. The significance of L is
that it is an abbreviation for the word long, which is a type of integer.
typeof(x_int)
#> [1] "integer"
str(x_int)
#> int [1:5] 2 4 6 8 10
is.integer(x_int)
#> [1] TRUE
double, which represents floating point numbers. Note that integer and double
vectors are also known as numeric vectors (Wickham, 2019).
12 2 Vectors
typeof(x_dbl)
#> [1] "double"
str(x_dbl)
#> num [1:5] 1.2 3.4 7.2 11.1 12.7
is.double(x_dbl)
#> [1] TRUE
typeof(x_chr)
#> [1] "character"
str(x_chr)
#> chr [1:5] "One" "Two" "Three" "Four" "Five"
is.character(x_chr)
#> [1] TRUE
A further feature of the c() function is that a number of atomic vectors can
be appended by including their variable names inside the function call. This
is shown below, where we combine the variables v1 and v2 to generate new
variables v3 and v4.
# Create vector 1
v1 <- c(1,2,3)
# Create vector 2
v2 <- c(4,5,6)
typeof(x)
#> [1] "integer"
length(x)
#> [1] 10
#> [1] 1 3 5 7 9
x5 <- seq(length.out=10)
x5
#> [1] 1 2 3 4 5 6 7 8 9 10
• The replication function rep() replicates values contained in the input vector,
a given number of times.
# use rep to initialize a vector to zeros
z1 <- rep(0,10)
z1
#> [1] 0 0 0 0 0 0 0 0 0 0
The coercion rules are shown in the table, and several examples are provided.
typeof(ex4)
#> [1] "character"
x_dbl
#> a b c d e
#> 1.2 3.4 7.2 11.1 12.7
summary(x_dbl)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.20 3.40 7.20 7.12 11.10 12.70
A character vector of the element names can be easily extracted using a special
R function called names(), and this function also has a number of interesting
features. First, we can see how the element names are extracted.
# Extract the names of the x_dbl vector
x_dbl_names <- names(x_dbl)
typeof(x_dbl_names)
#> [1] "character"
x_dbl_names
#> [1] "a" "b" "c" "d" "e"
What is interesting about the function names() is that it can act as an accessor
function, that will return the names of the vector elements, and it can be
used to set the names of a vector. We can show this on an unnamed vector as
follows. Note, R will not object if you call two different elements by the same
name, although you should avoid this, as it will impact the subsetting process.
2.2 Atomic vectors 17
When working with real-world data, it is common that there will be missing
values. For example, a thermometer might break down on any given day,
causing an hourly temperature measurement to be missed. In R, NA is a logical
constant of length one which contains a missing value indicator. Therefore,
any value of a vector could have the value NA, and we can demonstrate this as
follows.
# define a vector v
v <- 1:10
v
#> [1] 1 2 3 4 5 6 7 8 9 10
There are two follow-on points of interest. First, how can we check whether NA
is present in a vector? To do this, the function is.na() must be used, and this
function returns a logical vector showing whether an NA value is present at a
given location. When we explore subsetting in Chapter 3, we will see how to
further manipulate any missing values.
v
#> [1] 1 2 3 4 5 6 7 8 9 NA
18 2 Vectors
The second point to note is to reexamine the function call max(v), which
returned the value NA. This is important, as it shows the presence of an NA
value causes difficulty for what should be straightforward calculation. This
obstacle can be overcome by adding an additional argument to the max()
function which is na.rm=TRUE, which requests that max() omits NA values as
part of the data processing. This option is also available for many R functions
that operate on vectors, for example, sum(), mean() and min().
v
#> [1] 1 2 3 4 5 6 7 8 9 NA
max(v, na.rm = TRUE)
#> [1] 9
2.3 Vectorization
Vectorization is a powerful R feature that enables a function to operate on all
the elements of an atomic vector, and return the results in new atomic vector,
of the same size. In these scenarios, vectorization removes the requirement
to write loop structures to iterate over the entire vector, and so it leads to a
simplified data analysis process.
2.3.1 Operators
Similar to other programming languages, R has arithmetic, relational, and
logical operators. R’s main arithmetic operators are shown below.
# Vector subtraction
v1 - v2
#> [1] 8 16 27
# Vector multiplication
v1 * v2
#> [1] 20 80 90
# Vector division
v1 / v2
#> [1] 5 5 10
# Vector exponentiation
v1 ˆ v2
#> [1] 100 160000 27000
Note, in cases where two vectors are of unequal length, R has a recycling
mechanism, where the shorter vector will be recycled in order to match the
longer vector.
# Define two unequal vectors
v3 <- c(12, 16, 20, 24)
v3
#> [1] 12 16 20 24
v4 <- c(2,4)
v4
#> [1] 2 4
v3 - v4
#> [1] 10 12 18 20
v3 / v4
#> [1] 6 4 10 6
v3 ˆ v4
#> [1] 144 65536 400 331776
Relational operators allow for comparison between two values, and they always
return a logical vector. There are six categories of relational operators, as
shown in the table.
v5 <= 4
#> [1] FALSE TRUE TRUE TRUE FALSE FALSE
v5 > 4
#> [1] TRUE FALSE FALSE FALSE TRUE TRUE
v5 >= 4
#> [1] TRUE FALSE TRUE FALSE TRUE TRUE
v5 == 4
#> [1] FALSE FALSE TRUE FALSE FALSE FALSE
v5 != 4
#> [1] TRUE TRUE FALSE TRUE TRUE TRUE
The output from relational operations can take advantage of R’s coercion rules.
The sum() function will coerce logical values to either ones or zeros. Therefore,
it is easy to find the number of values that match the relational expression.
For example, in the following stream of numbers, we can see how many vector
elements are greater than the mean.
# Setup a test vector, in this case, a sequence
v6 <- 1:10
v6
#> [1] 1 2 3 4 5 6 7 8 9 10
Logical operators perform operations such as AND and NOT, and for atomic
vectors, the following logical operators can be used.
2.3 Vectorization 23
# Use logical AND to see which values are in the range 10-14
v >= 10 & v <= 14
#> [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
v
#> [1] 1 2 3 4 5 6 7 8 9 10
2.4 Lists
A list is a vector that can contain different types, including a list. It is a
flexible data structure and is often used to return data from a function. A
good example is the linear regression function in R, lm(), which returns a list
containing all information relating to the results of a linear regression task.
A list can be defined using the list() function, which is similar to the c()
function used to create atomic vectors. Here is an example of defining a list.
2.4 Lists 25
# Create a list
l1 <- list(1:2,c(TRUE, FALSE),list(3:4,5:6))
# Display the list.
l1
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] TRUE FALSE
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 3 4
#>
#> [[3]][[2]]
#> [1] 5 6
# Show the list type
typeof(l1)
#> [1] "list"
# Summarize the list structure
str(l1)
#> List of 3
#> $ : int [1:2] 1 2
#> $ : logi [1:2] TRUE FALSE
#> $ :List of 2
#> ..$ : int [1:2] 3 4
#> ..$ : int [1:2] 5 6
# Confirm the number of elements
length(l1)
#> [1] 3
The elements are separated by vertical lines, and the diagram shows the three
list elements. The first is a vector of two integer elements, the second is the
vector with two logical elements. The third element is interesting, and shows
the flexibility of the list structure in R. This third element is a list itself, of size
2, and so it is represented by a rounded rectangle. It has its own contents, the
first element being an integer vector of size 2, the second another integer vector
of size 2. This example highlights an important point: the list can contain
elements of different types, including another list, and is highly flexible.
In this case, the variable l2 is a list that is ready for use, and the NULL value
displayed is a reserved word in R, and is returned by expressions whose value
2.4 Lists 27
If the list is long, the elements can be named using the names() function,
similar to what was carried out for atomic vectors. Here, as a starting point,
we take the original unnamed list from the initial example.
# Create a list
l2 <- list(1:2,
c(TRUE, FALSE),
list(3:4,5:6))
# Name the list elements using names()
names(l2) <- c("el1","el2","el3")
str(l2)
#> List of 3
#> $ el1: int [1:2] 1 2
#> $ el2: logi [1:2] TRUE FALSE
#> $ el3:List of 2
#> ..$ : int [1:2] 3 4
#> ..$ : int [1:2] 5 6
28 2 Vectors
Note that in this example, the inner list elements for element three are not
named; to do this, we would have to access that element directly, and use the
names() function. Accessing individual list elements is discussed in Chapter 3.
The result is what you might expect from this process. The logical values are
coerced to integer values, all the list values are present, and their order reflects
the order in the original list.
Here are the steps for the solution. We use the R function sample() to generate
the dice rolls, and before that, call the function set.seed() in order to ensure
that the same stream of random numbers are generated. The default behavior
of the sample() function is that each element has the same chance of being
selected. Replacement is set to TRUE, which means that once an item is drawn,
it can be selected again in the following sample, which makes sense for repeated
throws of a die.
# set the seed to 100, for replicability
set.seed(100)
# Create a variable for the number of throws
N <- 10000
# generate a sample for dice 1
dice1 <- sample(1:6, N, replace = T)
# generate a sample for dice 2
dice2 <- sample(1:6, N, replace = T)
We can observe the first six values from each vector, using the R function
head(). Note, that the function tail() will display the final six values from
the vector. These two functions are widely used in R, across a range of data
structures.
# Information on dice1
head(dice1)
#> [1] 2 6 3 1 2 6
summary(dice1)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 2.00 3.00 3.49 5.00 6.00
# Information on dice2
head(dice2)
#> [1] 4 5 2 5 4 2
summary(dice1)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 2.00 3.00 3.49 5.00 6.00
Next, we use the vectorization capability of atomic vectors to sum both vectors,
so that the first dice roll for dice1 is added to corresponding first dice roll for
dice2, and so on until the final rolls in each vector are summed.
30 2 Vectors
summary(dice_sum)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.00 5.00 7.00 7.01 9.00 12.00
Now that we have our final dataset (the simulation of 10,000 rolls for two
dice), we can now perform an interesting analysis on the data, using the inbuilt
table() function in R. This function presents frequencies for each of the values,
and is widely used to analyze data in R.
# Show the frequencies for the summed values
freq <- table(dice_sum)
freq
#> dice_sum
#> 2 3 4 5 6 7 8 9 10 11 12
#> 274 569 833 1070 1387 1687 1377 1165 807 534 297
Finally, the proportion of each sum can be calculated using the vectorization
mechanism of R, and also by having flexibility through the use of the function
length(). These proportions are compared to the expected values, as per
statistical theory, and the differences shown, where the values are rounded to
five decimal places using R’s round() function.
# Show the frequency proportions for the summed values,
# using the vectorized division operator
sim_probs <- freq/length(dice_sum)
sim_probs
#> dice_sum
#> 2 3 4 5 6 7 8 9 10
#> 0.0274 0.0569 0.0833 0.1070 0.1387 0.1687 0.1377 0.1165 0.0807
#> 11 12
#> 0.0534 0.0297
# Define the exact probabilities for the sum of two dice throws
exact <- c(1/36, 2/36, 3/36, 4/36, 5/36, 6/36,
5/36, 4/36, 3/36, 2/36, 1/36)
#> dice_sum
#> 2 3 4 5 6 7 8
#> -0.00038 0.00134 -0.00003 -0.00411 -0.00019 0.00203 -0.00119
#> 9 10 11 12
#> 0.00539 -0.00263 -0.00216 0.00192
Function Description
c() Create an atomic vector.
head() Lists the first six values of a data structure.
is.logical() Checks whether a variable is of type logical.
is.integer() Checks whether a variable is of type integer.
is.double() Checks whether a variable is of type double.
is.character() Checks whether a variable is of type character.
is.na() A function to test for the presence of NA values.
ifelse() Vectorized function that operates on atomic vectors.
list() A function to construct a list.
length() Returns the length of an atomic vector or list.
mean() Calculates the mean for values in a vector.
names() Display or set the vector names.
paste0() Concatenates vectors after converting to a character.
str() Displays the internal structure of a variable.
set.seed() Initializes a pseudorandom number generator.
sample() Generates a random sample of values.
summary() Provides an informative summary of a variable.
tail() Lists the final six values of a data structure.
table() Builds a table of frequency data from a vector.
typeof() Displays the atomic vector type.
unlist() Converts a list to an atomic vector.
2.7 Exercises
The goal of these short exercises is to practice key ideas from the chapter. In
most cases, the answers are provided as part of the output; the challenge is to
find the right R code that will generate the results.
1. Predict what the types will be for the following variables, and then
verify your results in R.
3.1 Introduction
Subsetting operators allow you to process data stored in atomic vectors and
lists and R provides a range of flexible approaches that can be used to subset
data. This chapter presents important subsetting operations, and knowledge
of these is key in terms of understanding later chapters on processing lists in
R. Upon completing the chapter, you should understand:
• The four main ways to subset a vector, namely, by positive integer, the minus
sign, logical vectors, and vector element names.
• The common features, and differences, involved in subsetting lists, and
subsetting atomic vectors.
• The role of the [[ operator for processing lists, and how to distinguish this
from the [ operator.
• The use of the $ operator, and how it is essentially a convenient shortcut for
the [[ operator, and can also be used to add new elements to a list.
• How to use the for loop structure to iterate through a list, in order to process
list data. Note that when we introduce functionals in later chapters, our
reliance on the for loop will significantly reduce.
• How to use the if statement when processing lists, and how that statement
differs from the vectorized ifelse() function covered in Chapter 2.
35
36 3 Subsetting Vectors
Chapter structure
• Section 3.2 describes four ways to subset an atomic vector, and illustrates
this using a dataset of customer arrivals at a restaurant, based on random
numbers generated from a Poisson distribution.
• Section 3.3 shows how to subset lists, and illustrates an important difference
in subsetting operations, namely, how we can return a list, and how we can
extract the list contents.
• Section 3.4 introduces a looping statement, and shows how this can be used
to process elements within a list. In addition to this, the if statement is
summarized, as this can be used when processing data.
• Section 3.5 explores a mini-case for subsetting based on a list of Star Wars
movies.
• Section 3.6 provides a summary of all the functions introduced in the chapter.
• Section 3.7 provides a number of short coding exercises to test your under-
standing of the material covered.
#> D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
#> 102 96 97 98 101 85 98 118 102 94
For vectors in R, the operator [ is used to subset vectors, and there are a
number of ways in which this can be performed (Wickham, 2019).
# Use c() to get the customers from day 1 and the final day
customers[c(1,length(customers))]
#> D1 D10
#> 102 94
Typically, these two statements are combined into the one expression, so you
will often see the following style of code in R.
# Subset the vector to only show values great than 100
customers[customers > 100]
#> D1 D5 D8 D9
#> 102 101 118 102
3.2 Subsetting atomic vectors 39
A convenient feature of subsetting with logical vectors is that the logical vector
size does not have to equal the size of the target vector. When the length
of the logical vector is less than the target vector, R will recycle the logical
vector by repeating the sequence of values until all the target values have been
subsetted. For example, this code can be used to extract every second element
from a vector.
# Subset every second element from the vector
customers[c(TRUE,FALSE)]
#> D1 D3 D5 D7 D9
#> 102 97 101 98 102
In order to return more than one value, we can extend the character vector
size by using the c() function.
# Extract the first and last elements
customers[c("D1","D10")]
#> D1 D10
#> 102 94
#> D1 D5 D8 D9
#> 1 5 8 9
#> List of 2
#> $ a: chr "Hello"
#> $ c:List of 2
#> ..$ d: logi [1:3] TRUE TRUE FALSE
#> ..$ e: chr "Hello World"
Note that all examples with [ return a list, but in many cases this is not
sufficient for analysis, as we will need to access the data within the list (which
can be an atomic vector, and also could be another list). For example, finding
the value of element a or element b. To do this, we must use the [[ operator,
which extracts the contents of a list at a given location (i.e., element 1, 2, ..,
N), where N is the list length. The following short examples show how we can
access the data within a list.
# extract the contents of the first list element
l1[[1]]
#> [1] "Hello"
The list element names can also be used to subset the contents, for example:
# extract the contents of the first list element
l1[["a"]]
#> [1] "Hello"
There is a convenient alternative to the [[ operator, and this is the tag operator
$ which can be used once a list element is named. For example, for our list l1
the terms l1[[1]], l1[["a"] and l1$a return the same result. In the general
case my_list[["y]] is equivalent to my_list$y. Using the $ operator also is
convenient as it re-appears later when we discuss another R data structure,
data frames, in Chapter 5. Here are examples of how to use $ based on the
previous example. Notice that the exact same results are returned.
# extract the contents of the first list element
l1$a
#> [1] "Hello"
#> [1] 1 2 3 4 5
We can now move on to explore a more realistic example, where the variable
products is a product list that stores product sales information.
# A small products database. Main list has two products
products <- list(A=list(product="A",
sales=12000,
quarterly=list(quarter=1:4,
sales=c(6000,3000,2000,1000))),
B=list(product="B",
sales=8000,
quarterly=list(quarter=1:4,
sales=c(2500,1500,2800,1200))))
str(products)
#> List of 2
#> $ A:List of 3
#> ..$ product : chr "A"
#> ..$ sales : num 12000
#> ..$ quarterly:List of 2
#> .. ..$ quarter: int [1:4] 1 2 3 4
#> .. ..$ sales : num [1:4] 6000 3000 2000 1000
#> $ B:List of 3
#> ..$ product : chr "B"
#> ..$ sales : num 8000
#> ..$ quarterly:List of 2
#> .. ..$ quarter: int [1:4] 1 2 3 4
#> .. ..$ sales : num [1:4] 2500 1500 2800 1200
The list structure is visualized in Figure 3.1, and from this we can make a
number of observations:
44 3 Subsetting Vectors
• At its core, the list is simply a vector of two named elements, and this is
shown with the red lines. We can verify this with R code to show (1) the
length of the vector and (2) the name of each element.
# Show the vector length (2 elements)
length(products)
#> [1] 2
# Show the names of each element
names(products)
#> [1] "A" "B"
• However, even though there are just two elements in the list, the list has
significant internal structure. Each element contains a list, highlighted in
green. This inner list contains information about an individual product, and
it comprises three elements:
– The product name (product), a character atomic vector.
– The annual sales amount (sales), a numeric atomic vector.
– The quarterly sales (quarterly), which is another list (colored blue).
This list contains two atomic vector elements: the quarter number
(quarter) with values 1 to 4, the corresponding sales amount (sales)
for each quarter.
Based on this list structure, we now identify eight different subset examples for
the list, which show how we can access the list data in a number of different
ways. Before exploring the code, we visualize the different subsetting outputs
in Figure 3.2.
3.3 Subsetting lists 45
1. Extract the first element of the list as a list, using [ in two ways,
one with a positive integer to retrieve the element, the other using
the element name “A”. Each of these subsetting instructions returns
a list structure, and the list size is equal to 1.
2. Extract the contents of the first list element. Use [[ in two ways,
one with a positive integer to retrieve the element, the other using
the element name “A”. Each of these subsetting commands returns a
list of size 3, where the three elements are: the product (a character
vector), the sales (numeric vector), and a list of two elements, which
contain two vectors, one for the quarter, the other for the sales in
each quarter. We also show the tag operator to extract the same
information, namely products$A.
3. Extract the product name from the first list element (which is a
list), and this requires the use of two [[ operators, the first of these
extracts the first element of products, and the second extracts the
first element of this list. Again, there are three ways to access this
information: the index location, the vector names, or the use of the
$ operator. The value returned is a character vector.
# Example (3) - get the product name for the first product
ex3.1 <- products[[1]][[1]]
ex3.2 <- products[["A"]][["product"]]
ex3.3 <- products$A$product
str(ex3.1)
#> chr "A"
# Example (4) - get the annual sales for the first product
ex4.1 <- products[[1]][[2]]
ex4.2 <- products[["A"]][["sales"]]
ex4.3 <- products$A$sales
str(ex4.1)
#> num 12000
5. The third element of the first element is also a list, and when we
extract this we see that the structure contains two atomic vectors.
6. For the list extracted in the previous step, we can then access the
inner list values, again by using the [[ or $ operators. Notice that
we now use three successive [[ calls, because we have three lists that
can be accessed. For this example, the numeric vector that contains
the quarters is returned.
# Example (8) - get the quarterly sales for the first two quarters
ex8.1 <- products[[1]][[3]][[2]][1:2]
ex8.2 <-products[["A"]][["quarterly"]][["sales"]][1:2]
ex8.3 <-products$A$quarterly$sales[1:2]
str(ex8.1)
#> num [1:2] 6000 3000
Before that, we will show how to iterate through a list from first principles
using a loop structure, along with the index position of the list.
3.4 Iteration using loops, and the if statement 49
• var is a name for a variable that will change its value for each loop iteration.
• seq is an expression that evaluates to a vector.
• expr is an expression, which can be either a simple expression, or a compound
expression of the form {expr1; expr2}, which is effectively a number of lines
of code with two curly braces.
A convenient method to iterate over a vector (a list or an atomic vector), is
to use the function seq_along() which returns the indices of a vector. For
example, consider the vector v below, which contains a simulation of ten dice
rolls.
set.seed(100)
v <- sample(1:6,10,replace = T)
v
#> [1] 2 6 3 1 2 6 4 6 6 4
sa <- seq_along(v)
sa
#> [1] 1 2 3 4 5 6 7 8 9 10
Notice that the vector sa returns the set of indices of v, starting a 1 and
finishing at 10. This is helpful, because we can use this result to iterate through
every element of v and perform a calculation. For example, let’s find the
number of elements in v that equal six, and do this using a loop. (Of course
we could just type sum(v==6) and that would give us the same answer.) In the
loop shown below, the value of i takes on the current value in the sequence,
starting at 1 and finishing at 10.
n_six <- 0
for(i in seq_along(v)){
n_six <- n_six + as.integer(v[i] == 6)
}
n_six
#> [1] 4
The looping structure is valuable when we are dealing with lists, because we
can use the [[ operator, along with the loop index, to extract values. So let’s
try the following example by revisiting to the products list, which is a list of
50 3 Subsetting Vectors
two elements. Our goal is to find the average sales for the two products, and
for this we can use a list.
# Initialize the total to be 0
sum_sales <- 0
# Iterate through the products list (2 elements) using seq_along()
for(i in seq_along(products)){
# Increment the number of sales by the current product sales value
sum_sales <- sum_sales+products[[i]]$sales
}
# Calculate the average
avr_sales <- sum_sales / length(products)
# Display the average
avr_sales
#> [1] 15000
When iterating through individual vector elements, you may need to execute a
statement based on a current vector value. To do this, the if statement can
be used, and there are two main forms:
• if(cond) expr which evaluates expr
if the condition cond is true
• if(cond) true.expr else false.expr, which evaluates true.expr if the con-
dition is true, and otherwise evaluates false.expr
If the variable used in the conditional expression has a length greater than 1,
only the first element will be used.
Here are some examples, where we can iterate through a vector to find those
values that are greater than the mean. Note that the vectorized ifelse()
function would normally be used for this type of processing.
# create a test vector
v <- 1:10
# create a logical vector to store the result
lv <- vector(mode="logical",length(v))
# Loop through the vector, examining each element
for(i in seq_along(v)){
if(v[i] > mean(v))
lv[i] <- TRUE
else
lv[i] <- FALSE
}
v[lv]
#> [1] 6 7 8 9 10
v[lv1]
#> [1] 6 7 8 9 10
In common with the for loop, curly braces can be used if there is more than one
statement to be executed. The following mini-case will also show an example
of iterating through a list using the for statement.
Given that the list sw_films is a list of lists, we can access the data by first
accessing the contents of each list element (which returns a list), and then
accessing the contents of that list, which then contains 14 different atomic
vectors. For example, to retrieve the movie name and film director of the first
and last movies, the following commands can be used.
# Get the first film name and movie director
sw_films[[1]][[1]]
52 3 Subsetting Vectors
However, given that the inner list is defined with elements names (which is
convenient and means that we don’t have to know the position of each element),
the following code will also retrieve the same data.
# Get the first film name and movie director
sw_films[[1]]$title
#> [1] "A New Hope"
sw_films[[1]]$director
#> [1] "George Lucas"
Now we will show how to filter the list sw_films in order to narrow our search,
for example, to find all movies directed by George Lucas. The strategy used is:
• A for-loop structure (along with seq_along()) is used to iterate over the
entire loop and mark those elements as either a match (TRUE) or not a match
(FALSE). This information is stored in an atomic vector. Note that while we
use a for-loop structure, in subsequent chapters we will use what are known
as functionals to iterate over data structures; therefore, after this chapter,
we will not see too many more for-loops being used.
• Before entering the loop, we create a logical vector variable (is_target) of
size 7 (the same size as the list), and this will store information on whether
a list item should be marked for further processing.
• For each list element we extract the director’s name, check if it matches the
target (“George Lucas”), and store this logical value in the corresponding
element of is_target.
• The vector is_target can then be used to filter the original sw_films list
and retain all the movies directed by George Lucas.
3.5 Mini-case: Star Wars movies 53
The results from the logical vector can be viewed, and they show that the first
four films in the list match. We also confirm the length of target_list.
is_target
#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
length(target_list)
#> [1] 4
Given that we now have a new filtered list, we can proceed to extract informa-
tion from this list. In this case, we look to extract the movie titles into a new
data structure, in this case an atomic vector. The steps here are:
• Create an output data structure which is the length of the target list (in this
case four elements)
• Iterate, again using a for-loop, through the list and copy the title into the
new list
This code is shown below. We make use of the function vector() to create an
initial list structure, and this is good practice when you are using a loop to
iterate through a list, as we will always know the length of the output, and
specifying the data element size up-front is more efficient.
# Create a movies vector to store the movie names
movies <- vector(mode="character",length = length(target_list))
# Iterate through the list to extract the movie title
for(i in seq_along(target_list)){
movies[i]<-target_list[[i]]$title
}
movies
#> [1] "A New Hope" "Attack of the Clones"
#> [3] "The Phantom Menace" "Revenge of the Sith"
One aspect of R you will learn to appreciate is that there are often many ways
to achieve the same outcome. For example, another way to access the movies
of George Lucas would be to rearrange the list of lists into a single list, where
each list element is an atomic vector of values (each of size 7). The process for
creating the new data structure is:
54 3 Subsetting Vectors
• Create a new list (sw_films1) of elements you wish to store (for example,
movie title, episode_id and director) from the original list. This new list
initially contains empty vectors.
• Loop through the sw_films list and append each movie title and director to
the corresponding element of sw_films1
# Create a new list to store the data in a different way
sw_films1 <- list(title=c(),
episode_id=c(),
director=c())
# Iterate through the list to append the title and director
for(i in seq_along(sw_films)){
sw_films1$title <- c(sw_films1$title,
sw_films[[i]]$title)
sw_films1$episode_id <- c(sw_films1$episode_id,
sw_films[[i]]$episode_id)
sw_films1$director <- c(sw_films1$director,
sw_films[[i]]$director)
}
sw_films1
#> $title
#> [1] "A New Hope" "Attack of the Clones"
#> [3] "The Phantom Menace" "Revenge of the Sith"
#> [5] "Return of the Jedi" "The Empire Strikes Back"
#> [7] "The Force Awakens"
#>
#> $episode_id
#> [1] 4 2 1 3 6 5 7
#>
#> $director
#> [1] "George Lucas" "George Lucas" "George Lucas"
#> [4] "George Lucas" "Richard Marquand" "Irvin Kershner"
#> [7] "J. J. Abrams"
Notice that we now have one list, and this list has three elements, each an
atomic vector of size 7. These vectors can also be viewed as parallel vectors,
where each vector is the same size, and the i-th elements of each vector are
related. For example, location 1 for each vector contains information on “A
New Hope”.
# Showing how parallel vectors work
sw_films1$title[1]
#> [1] "A New Hope"
sw_films1$episode_id[1]
#> [1] 4
3.6 Summary of R functions from Chapter 3 55
sw_films1$director[1]
#> [1] "George Lucas"
This feature can be then exploited to filter related atomic vectors using logical
vector subsetting.
# Find all the movie titles by George Lucas
sw_films1$title[sw_films1$director=="George Lucas"]
#> [1] "A New Hope" "Attack of the Clones"
#> [3] "The Phantom Menace" "Revenge of the Sith"
In general, in the remaining book chapters, we will find easier ways to process
lists using functionals. The aim here is to show how you can process lists using
your own loop structures, along with the function seq_along().
Function Description
as.list() Coerces the input argument into a list.
paste0() Converts and concatenates arguments to character strings.
rpois() Generates Poisson random numbers (mean lambda).
seq_along() Generates a sequence to iterate over vectors.
which() Provide the TRUE indices of a logical object.
3.7 Exercises
These exercises mostly focus on processing lists, given that they are a more
complex type of vector. For many of the exercises, a loop structure will be
required. As mentioned, the need for using loops will mostly disappear after
this chapter, as we will soon introduce a different means of iterating, which is
known as a functional.
Notice that this is a list of eleven elements, where each element is a feature
of a specific car. For example mpg represents the miles per gallon for each car,
and disp stores the engine size, or displacement. The list contents are eleven
numeric vectors, each of size 32, and these can be viewed as parallel arrays,
where all data in the 1st of each vector refers to car number one, and all data
in final location refers to the final car.
The aim of the exercise is to generate a list to contain the mean value for mpg
and disp. Use the following code to create the list structure.
result <- list(mean_mpg=vector(mode="numeric", length=1),
mean_disp=vector(mode="numeric", length=1))
str(result)
#> List of 2
#> $ mean_mpg : num 0
#> $ mean_disp: num 0
The following shows the calculated result. Note that the cars list should first
be subsetted to only include those two elements that are required (i.e., mpg
and disp.)
str(result)
#> List of 2
#> $ mean_mpg : num 20.1
#> $ mean_disp: num 231
vector has_height to filter the list, and populate this vector using a
loop structure. This new list (sw_people1) should have 81 elements.
sum(has_height)
#> [1] 81
length(sw_people1)
#> [1] 81
3. Using a for loop over the filtered list sw_people1 from exercise 2,
create a list of people whose height is greater than or equal to 225
inches. The resulting vector should grow as matches are found, as
we do not know in advance how many people will be contained in
the result. Use the command characters <- c() to create the initial
empty result vector.
4. Using a for loop to iterate over the list sw_planets and display those
planets (in a character vector) whose diameter is greater than or
equal to the mean. Use a pre-processing step that will add a new list
element to each planet, called diameter_numeric. This pre-processing
step can also be used to calculate the mean diameter. Also, use the
pre-processing step to keep track of those planets whose diameter
is “unknown”, and use this information to create an updated list
sw_planets1 that excludes all the values.
5. Based on the list sw_species, and given that each species has a
classification, create the following tabular summary, again using
58 3 Subsetting Vectors
a loop to iterate through the list. Make use of the table() function
that was covered in Chapter 2 to present the summary.
4.1 Introduction
Creating functions is essential to achieving a high productivity return from
your use of R, and therefore it’s worth spending time on the material in this
chapter, as it is foundational for the rest of the book. Functions are building
blocks in R, and are small units that can take an input, process it, and return
a result. We already have used R functions in previous chapters, for example
sample(), and these functions all represent a unit of code we can call with
arguments, and receive a result.
Upon completing the chapter, you should understand:
• How to write your own function, call it via arguments, and return values
based on the last evaluated expression.
• The different ways to pass arguments to functions.
• How to add robustness checking to your functions to minimize the chance of
processing errors.
• What an environment is, and the environment hierarchy within R.
• How a function contains a reference to its enclosing environment, and how this
mechanism allows a function to access variables in the global environment.
59
60 4 Functions, Functionals, and the R Pipe
Chapter structure
• Section 4.2 shows how to create a function that accepts inputs, and returns
a result.
• Section 4.3 highlights the flexibility of R in the way in which arguments can
be passed to functions.
• Section 4.5 provides an insight into the key idea of an environment in R,
how R finds objects, and why functions can access variables in the global
environment.
• Section 4.4 shows how checks can be made at the start of a function to ensure
it is robust, and then exit if the input is not as expected.
• Section 4.6 shows how functions are objects in their own right, and so can
be passed to functions. This can be used to iterate over data structures,
and replace the for loop, and this approach is illustrated using the function
lapply().
• Section 4.7 returns to the mini-case from Chapter 3, and solves the problems
using lapply() instead of loops.
• Section 4.8 introduces R’s pipe operator, which provides a valuable mechanism
to chain operations together in a data pipeline in order to improve code
clarity.
• Section 4.9 provides a summary of all the functions introduced in the chapter.
• Section 4.10 provides a number of short coding exercises to test your under-
standing of functions, functionals, environments, and the pipe operator.
4.2 Functions
Functions are a fundamental building block of R. A function can be defined
as a group of instructions that: takes input, uses the input to compute other
values, and returns a result (Matloff, 2011). It is recommended that users of
R should adopt the habit of creating simple functions which will make their
work more effective and also more trustworthy (Chambers, 2008). Functions
are declared using the function reserved word, and are objects, which means
they can also be passed as arguments to other functions. The general form of
a function in R is:
4.2 Functions 61
where:
• arguments provides the arguments (inputs) to a function, and are separated
by commas,
• expression is any legal R expression contained within the function body, and
is usually enclosed in curly braces (when there is more than one expression),
• the last evaluated expression is returned by the function, although the
function return() can be also used to return values.
Before writing the function, we can explore the logic needed to filter the input
vector. In this example, when the modulus function returns a remainder of 0,
then we know the number is divisible by 2. We can use this information to
create a logical vector that can then be used to filter the modulus result, as
shown in the following code.
# The logical vector where even values are TRUE
lv <- x == 0
This logic can now be embedded within an R function which we will call
evens(), which takes in one argument (the original vector), and returns a
filtered version of the vector that only includes even numbers. We will take a
parsimonious approach to code writing and just limit the function to one line
of code. The last evaluated expression (which in this case is the first expression)
is returned.
62 4 Functions, Functionals, and the R Pipe
x1 <- 1:7
evens(x1)
#> [1] 2 4 6
In order to find the set of values that are unique, we can use the information
returned by duplicated(), as follows.
v[!duplicated(v)]
#> [1] 2 6 3 1 4
The challenge now is to embed this logic into a function. We will call the
function my_unique() that takes in a vector (one argument), and returns the
unique values from the vector. It is also useful to write the function into a
source file; let’s assume the file is called my_functions.R. This function could
just be written in one line of code, but we will break it down into a number of
separate steps just to clarify the process.
my_unique <- function(x){
# Use duplicated() to create a logical vector
dup_logi <- duplicated(x)
# Invert the logical vector so that unique values are set to TRUE
unique_logi <- !dup_logi
4.2 Functions 63
To load the function into R, call the source function (this is easy to do within
RStudio by clicking the “Source” button). The function is then loaded into
the workspace, and its existence can be confirmed by calling the ls() function,
which returns a vector of character strings giving the names of the objects in
the specified environment.
# The call to source loads the function into the global environment
source("my_functions.R")
Normally, when writing in R, that following shorter function would suffice for
my_unique().
my_unique <- function(x){
x[!duplicated(x)]
}
my_min<- function(v){
min(v)
}
You can write your own functions that can be passed into another function.
# Call my_summary() to get the minimum value
my_summary(1:10,my_min)
#> [1] 1
Furthermore, you could also write the logic of my_min as an anonymous function
(i.e., it is not assigned to a variable, and so does not appear in the global
environment), and examples are shown below. Right now this might seem
like an odd thing to do; however, it is a key idea used when we start to
explore functionals such as lapply(), and later purrrr::map(), to iterate over
list structures, and apply an action to each list element.
# Call my_summary() using an anonymous function
my_summary(1:10,function(y)min(y))
#> [1] 1
# Call my_summary() using an anonymous function
my_summary(1:10,function(y)max(y))
#> [1] 10
#> [1] NA
sum(v,na.rm=TRUE)
#> [1] 6
Each function in R is defined with a set of formal arguments that have a fixed
positional order, and often that is the way arguments are then passed into
functions (e.g., by position). However, arguments can also be passed in by
complete name or partial name, and arguments can have default values. We
can explore this via the following example for the function f, which has three
formal arguments: abc, bcd, and bce, and simply returns an atomic vector
showing the function inputs (argument one, argument two, and argument
three).
f <- function(abc,bcd,bce){
c(FirstArg=abc,SecondArg=bcd,ThirdArg=bce)
}
• By partial name, where argument names are matched, and where a unique
match is found, that argument will be selected. A observation here is that if
there is more than one match, the function call will fail. Furthermore, using
partial matching can lead to confusion for someone trying to understand the
code, so it’s probably not a great idea to make use of this argument passing
mechanism in your own code.
f(2,a=1,3)
#> FirstArg SecondArg ThirdArg
#> 1 2 3
66 4 Functions, Functionals, and the R Pipe
Following this, the function can be called in four different ways: with no
arguments, and with one, two, or three arguments. In this example, a mixture
of positional and complete naming matching are used.
f()
#> FirstArg SecondArg ThirdArg
#> 1 2 3
f(bce=10)
#> FirstArg SecondArg ThirdArg
#> 1 2 10
f(30,40)
#> FirstArg SecondArg ThirdArg
#> 30 40 3
f(bce=20,abc=10,100)
#> FirstArg SecondArg ThirdArg
#> 10 100 20
Hadley Wickham (Wickham, 2019) provides valuable advice for passing argu-
ments to functions, for example: (1) to focus on positional mapping for the
first one or two arguments, (2) to avoid positional mapping for arguments that
are not used too often, and (3) that unnamed arguments should come before
named arguments.
A final argument worth exploring is the ... argument, which will match any
arguments not otherwise matched, and this can be easily forwarded to other
functions, or also explored, by converting to a list in order to examine the
arguments passed. Here is an example of how it can be used to pass any number
of arguments to a function, and how the function can access these arguments.
test_dot1 <- function(...){
ar = list(...)
str(ar)
}
test_dot1(a=10,b=20:21)
#> List of 2
#> $ a: num 10
#> $ b: int [1:2] 20 21
4.4 Error checking for functions 67
Next, we need to make sure the vector is a numeric vector; for example, if a
user sent in a character vector, it would not be possible to use the %% operator
on a character vector.
68 4 Functions, Functionals, and the R Pipe
Therefore, these two functions can be used to initially check the input values,
and “fail fast” if necessary. We add this checking logic to the function’s opening
lines.
evens <- function(v){
if(length(v)==0)
stop("Error> exiting evens(), input vector is empty")
else if(!is.numeric(v))
stop("Error> exiting evens(), input vector not numeric")
v[v%%2==0]
}
As we can see, all of the variables defined are contained in the global envi-
ronment. Note that the R function globalenv() can also return a reference to
the global environment. This can be seen by calling the function ls() which
returns a vector of character strings, providing the names of the objects in the
specified environment.
ls(envir=globalenv())
#> [1] "x" "y" "z"
takes in two arguments a and b, sums these, and then multiplies the answer
by z. However, how does the function find the value for z?
f1 <- function(a,b){
(a+b)*z
}
environment(f1)
#> <environment: R_GlobalEnv>
But the diagram also shows an interesting feature of R, in that the function
also contains a reference to its enclosing environment. This means that when
the function executes, it also has a pathway to search its enclosing environment
for variables and functions. For example, we can see that the function has the
equation (a+b)*z, where a and b are local variables, and so are already part
of the function. However, z is not part of the function, and therefore R will
search the enclosing environment for z, and if it finds it, will use that value in
the calculation.
This is shown below. Note that if z was not defined in the global environment,
this function would generate an error.
z
#> [1] 3 8 18
ans <- f1(2,3)
ans
#> [1] 15 40 90
We now call f2() and show the value of c. Note that within the function f2(),
the superassignment operator has changed the value in the global environment.
We will not use the superassignment operator in the textbook; however, it is
informative to know that this feature is available in R. The operator tends to
be used for ideas such as function factories (Wickham, 2019), where a function
can return functions and also maintain shared state variables (these are known
as closures, but are outside the scope of this book).
c
#> [1] 20
d <- f2(2,4)
d
#> [1] 26
72 4 Functions, Functionals, and the R Pipe
c
#> [1] 100
What is interesting is that a separate environment is also added for each new
package loaded using library(), and the newest package’s environment is
added as the direct parent of the global environment. The full hierarchy can
be easily shown using the function search(). Notice that the first environment
shown is always the global environment, and the last environment shown is
the base environment, as search() does not show the empty environment,
although the empty environment can be shown through a call to the function
parent.env().
# Show the empty environment
parent.env(baseenv())
#> <environment: R_EmptyEnv>
Given this structure, we can now describe the process that R follows in order
to evaluate an expression. Consider the following:
ans <- max(x)
ans
#> [1] 3
In this case, R starts in the global environment, and locates the variable
x. It then will search for the function max(). As max() is not defined in the
4.6 Functionals with lapply() 73
Clearly, this is not the result we might have expected, and we could protest
that R has called the new max() instead of the obvious function in the base
environment. However, this is just the way the R evaluation process works
(we have to accept this!), as it starts in the global environment and finds
and executes the first matching function. Interestingly, R provides a safe
mechanism to call the exact function you want by prefixing the call to indicate
the environment name, for example, base::max(), which calls the max() function
defined in the base environment. We can see the overall process below. Note
that at any time we can remove a variable from an environment, by using the
function rm() and passing in the variable name as a character vector.
max(x)
#> [1] "Hello World"
base::max(x)
#> [1] 3
rm("max")
So while it’s instructive to see how this function works, there is no need to
duplicate the code by creating your own version. The code below shows the
solution using lapply().
l_in <- list(1:4,11:14,21:24)
l_out <- lapply(l_in,mean)
str(l_out)
#> List of 3
#> $ : num 2.5
#> $ : num 12.5
#> $ : num 22.5
In later chapters, when we introduce the purrr package, which provides a set
of functions that can be used to iterate over data structures, in a similar way
to lapply(), but with more functionality.
With the filtered list, you can then call lapply() to return the movie titles.
# Get the movie titles as a list
movies <- lapply(target_list,function(x)x$title)
movies <- unlist(movies)
movies
#> [1] "A New Hope" "Attack of the Clones"
#> [3] "The Phantom Menace" "Revenge of the Sith"
Using lapply(), we can also create a new list from the original list. Here we
create a list with three elements, where each element will store all of the data
for the title, episode number and movie director.
# Create a new list to store the data in a different way
sw_films1 <- list(title=c(), episode_id=c(), director=c())
4.7 Mini-case: Star Wars movies (revisited) using functionals 77
Given this new structure, we can then extract the movies for George Lucas
using the following subsetting operation. This works because the $ operator
extracts the contents of the list element title, and then we simply subset this
as an atomic vector.
# Find all the movie titles by George Lucas
sw_films1$title[sw_films1$director=="George Lucas"]
#> [1] "A New Hope" "Attack of the Clones"
#> [3] "The Phantom Menace" "Revenge of the Sith"
We can use lapply() to find all the planet diameters in the list sw_planets,
replacing the string “unknown” with NA, and then summarize the vector,
showing the overall statistics for this variable.
# Return all planet diameters as numeric, with NA where it's unknown
diameters <- unlist(
lapply(sw_planets,
function(x)
if (x$diameter != "unknown")
as.numeric(x$diameter)
else NA))
# Provide a summary of the data
summary(diameters)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0 7812 11015 12388 13422 118000 17
Finally, we can extract the passenger capacity for all starships contained in
the list sw_starships.
# Get the passenger capacity for all spaceships
passengers <- unlist(lapply(sw_starships,
function (x)
if (x$passengers != "unknown")
as.numeric(x$passengers)
else NA))
# Show the data
passengers
#> [1] 75 843342 6 0 0 0 38000 6 20
78 4 Functions, Functionals, and the R Pipe
More operations can be added to the chain, and in that case, the output from
the first RHS then becomes the LHS for the next operation. For example, we
4.9 Summary of R functions from Chapter 4 79
could also add an addition operation to the example, to round the number of
decimal places to 3, by using the round() function.
n1 |> min() |> round(3)
#> [1] 0.097
As mentioned, lists are frequenly used as a starting point, and we will use the
data frame mtcars (we will present data frames in Chapter 5) to perform the
following chain of data transformations:
• Take as input mtcars.
• Convert mtcars to a list, using the function as.list(). Note that data frames
are technically a list.
• Process the list one element at a time, and get the average value of each
variable.
• Convert the list returned by lapply() to an atomic vector (using unlist()).
• Store the result in a variable.
The full instruction is shown below.
a1 <- mtcars |> # The input
as.list() |> # Convert to a list
lapply(function(x)mean(x)) |> # Get the mean of each element
unlist() # Convert to atomic vector
a1
#> mpg cyl disp hp drat wt qsec
#> 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487
#> vs am gear carb
#> 0.4375 0.4062 3.6875 2.8125
Function Description
duplicated() Identifies duplicates in a vector.
where() Returns the environment for a variable (pryr library).
environment() Finds the environment for a function.
library() Loads and attaches add-on packages.
parent.env() Finds the parent environment for an input environment.
search() Returns a vector of environment names.
globalenv() Returns a reference to the global environment.
baseenv() Returns a reference to the base environment.
lapply(x,f) Applies f to each element of x and returns results in a list.
rm() Removes an object from an environment.
stop() Stops execution of the current expression.
unique() Returns a vector with duplicated elements removed.
4.10 Exercises
1. Write a function get_even1() that returns only the even numbers
from a vector. Make use of R’s modulus function %% as part of the
calculation. Try to implement the solution as one line of code. The
function should transform the input vector in the following way.
set.seed(200)
v <- sample(1:20,10)
v
#> [1] 6 18 15 8 7 12 19 5 10 2
v1 <- get_even1(v)
v1
#> [1] 6 18 8 12 10 2
set.seed(200)
v <- sample(1:20,10)
i <- c(1,5,7)
v[i] <- NA
v
#> [1] NA 18 15 8 NA 12 NA 5 10 2
v1 <- get_even2(v)
4.10 Exercises 81
v1
#> [1] NA 18 8 NA 12 NA 10 2
v2 <- get_even2(v,na.omit=TRUE)
v2
#> [1] 18 8 12 10 2
# The function
fn_test <- function(a, b, c){
a+b
}
# Call 1
fn_test(1,2)
# Call 2
fn_test(c=1,2)
# Call 3
fn_test(b=1,10)
a <- 100
env_test(1)
5.1 Introduction
To date, we have used atomic vectors and lists to store information. While
these are foundational data structures in R, they do not provide support
for processing rectangular data, which is a common format in data science.
Rectangular data, as the name suggests, is the usual format used in spreadsheets
or databases, and it comprises rows of data for one or more variables. Typically,
in a rectangular dataset, every column represents a variable (where each
variable can have a different type), and each row contains an observation, for
example, a set of values that are related (medical data on a single patient, or
weather data at a specific point in time). More generally, we can have two
types of rectangular (two-dimensional) data: (1) data of a same type, typically
numeric, that is stored in a matrix, and (2) data of different types that is stored
in a data frame. This chapter introduces the matrix and the data frame, and
shows how we can process data in these structures, using ideas from subsetting
atomic vectors and lists.
Upon completing the chapter, you should understand:
• How to create a matrix in R, using a one-dimensional vector as input.
• How to subset a matrix, and extend it by adding rows and columns.
83
84 5 Matrices and Data Frames
• How to create a data frame, and how to subset the data frame using matrix
notation.
• How to create a tibble, and understand how a tibble differs from a data
frame.
• How to use the R functions subset() and transform() to process data frames.
• How to use the function apply() to process matrices and data frame, on
either a row or a column basis.
• How to manipulate matrices and data frames using base R functions.
• How to solve all five test exercises.
Chapter structure
• Section 5.2 introduces the matrix, and shows how it can be created using
the function matrix().
• Section 5.3 presents the data frame, which is also a list.
• Section 5.5 summarizes the tibble, which is an updated version of the data
frame with a number of additional features.
• Section 5.6 shows how functionals can be applied to data frames and matrices,
in particular the function apply(), which can process either rows or columns
in a matrix and a data frame.
• Sections 5.7 and 5.8 present the chapter’s two mini-cases. The first shows
how a matrix can be used to store information on a synthetic social network
structure, and that can capture connections between individuals; while the
second mini-case highlights how a pipeline can be set up to process a data
frame.
• Section 5.9 provides a summary of all the functions introduced in the chapter.
• Section 5.10 provides a number of short coding exercises to test your knowl-
edge.
5.2 Matrices
In R, a matrix is a two-dimensional structure, with rows and columns, that
contains the same type. A simple way to understand a matrix is that it is an
atomic vector in two dimensions, and is created using the matrix() function,
with the following arguments:
• data, which are the initial values, contained in an atomic vector, supplied to
the matrix,
• nrow, the desired number of rows,
• ncol, the desired number of columns,
• byrow, a logical value (default is FALSE), that specifies what way to fill the
matrix with data, either filled by row or by column,
• dimnames, a list of length 2 giving row and column names, respectively.
5.2 Matrices 85
#> C1 C2 C3
#> R1 7 1 9
#> R2 6 2 4
#> R3 3 5 8
#> R4 1 2 3
In a similar way, we can add a new column to the matrix using cbind(), and
also name this new column using the function colnames().
m1_c <- cbind(m1_r,c(10,20,30,40))
colnames(m1_c)[4] <- "C4"
m1_c
#> C1 C2 C3 C4
#> R1 7 1 9 10
#> R2 6 2 4 20
#> R3 3 5 8 30
#> R4 1 2 3 40
Ranges can also be supplied during subsetting; for example, the first two rows
and the first two columns.
m1[1:2,1:2]
#> C1 C2
#> R1 7 1
#> R2 6 2
Blank subsetting for one of the dimensions lets you retain all of the rows, or
all of the columns. Note that if only one column is subsetted, R will return a
vector by default. To keep the matrix structure, the argument drop=FALSE is
added.
# Extract first row and all columns
m1[1,]
#> C1 C2 C3
5.2 Matrices 87
#> 7 1 9
Logical vectors can also be used to subset matrices; for example, to extract
every second row from the matrix, the following code will work.
m1
#> C1 C2 C3
#> R1 7 1 9
#> R2 6 2 4
#> R3 3 5 8
m1[c(T,F),]
#> C1 C2 C3
#> R1 7 1 9
#> R3 3 5 8
There are important functions available to process matrices, and the main
ones are now summarized.
• The function is.matrix() is used to test whether the object is a matrix or
not. This can be used as a pre-test, in order to ensure that the data is in the
expected format.
88 5 Matrices and Data Frames
A <- matrix(1:4,nrow=2)
B <- matrix(1:4,nrow=2,byrow = T)
C <- list(c1=1:2, c2=3:4)
is.matrix(A)
#> [1] TRUE
is.matrix(B)
#> [1] TRUE
is.matrix(C)
#> [1] FALSE
B
#> B_C1 B_C2
#> B_R1 1 2
#> B_R2 3 4
# Multiplication of A and B
A*B
#> A_C1 A_C2
#> A_R1 1 6
#> A_R2 6 16
# Addition of A and B
A+B
#> A_C1 A_C2
#> A_R1 2 5
#> A_R2 5 8
# Multiplying A by a constant
5.2 Matrices 89
10*A
#> A_C1 A_C2
#> A_R1 10 30
#> A_R2 20 40
• The function dim() can be used to return the matrix dimensions, namely,
the number of rows and the number of columns.
dim(A)
#> [1] 2 2
• The function dimnames() returns the row and column names in a list.
dimnames(A)
#> [[1]]
#> [1] "A_R1" "A_R2"
#>
#> [[2]]
#> [1] "A_C1" "A_C2"
• The functions rownames() and colnames() return the row names and column
names as atomic vectors.
rownames(A)
#> [1] "A_R1" "A_R2"
colnames(A)
#> [1] "A_C1" "A_C2"
• The function diag() can be used in two ways. First, it can set all the diagonal
elements of a matrix. Second, it can be used to create the identity matrix
for a given dimension.
diag(B) <- -1
B
90 5 Matrices and Data Frames
I <- diag(3)
I
#> [,1] [,2] [,3]
#> [1,] 1 0 0
#> [2,] 0 1 0
#> [3,] 0 0 1
• The function eigen() can be used for calculating the eigenvalues and eigen-
vectors of a matrix, which has important applications in many engineering
and computing problems. Here is an example of a 2 × 2 matrix, where both
eigenvalues and eigenvectors are shown.
M <- matrix(c(1,-1,-2,3),nrow=2)
M
#> [,1] [,2]
#> [1,] 1 -2
#> [2,] -1 3
eig$values
#> [1] 3.7321 0.2679
eig$vectors
#> [,1] [,2]
#> [1,] 0.5907 -0.9391
#> [2,] -0.8069 -0.3437
#> 3 7
colMeans(A)
#> A_C1 A_C2
#> 1.5 3.5
In summary, R provides good support for problems that require matrix manip-
ulation and matrices need to be defined using the matrix() function. Many of
the subsetting commands used for atomic vectors can also be used for matrices,
and that includes referencing elements by the row/column name. Matrices can
be extended easily, using functions such as cbind() and rbind().
Furthermore, functions such as diag(), eigen() and det() can be used to
support analysis. However, an important feature of matrices is that all the
values need to be the same type, and while this is suitable for the examples
shown, there are cases where rectangular data must store data of different types,
for example, the type of data contained in a typical dataset. For these situations,
the data frame is an ideal mechanism to store, and process, heterogeneous
data.
d
#> Number Letter Flag
#> 1 1 A TRUE
#> 2 2 B FALSE
#> 3 3 C TRUE
#> 4 4 D FALSE
#> 5 5 E TRUE
An important activity that is required with a data frame is to be able to: (1)
subset rows, (2) subset columns, and (3) add new columns. Because a data
frame is a list and also shares properties of a matrix, we can combine subsetting
mechanisms from both of these data structures to subset a data frame. Given
that a data frame is also a list, we can access a data frame column using the $
operator. A number of subsetting examples are now presented.
• To select the first two rows from the data frame.
d[1:2,]
#> Number Letter Flag
5.3 Data frames 93
#> 1 1 A TRUE
#> 2 2 B FALSE
• To select all rows that have the column Flag set to true.
d[d$Flag == T,]
#> Number Letter Flag
#> 1 1 A TRUE
#> 3 3 C TRUE
#> 5 5 E TRUE
• To select the first two rows and the last two columns.
d[1:2,c("Letter","Flag")]
#> Letter Flag
#> 1 A TRUE
#> 2 B FALSE
• To add a new column, we can simply add a new element as we would with a
list.
d1 <- d
d1$letter <- letters[1:5]
d1
#> Number Letter Flag letter
#> 1 1 A TRUE a
#> 2 2 B FALSE b
#> 3 3 C TRUE c
#> 4 4 D FALSE d
#> 5 5 E TRUE e
Of course, we could achieve the same outcome without using subset(), and
utilize the matrix-type subsetting for data frames, or by using the $ operator.
However, even though the same result is achieved, the subset() function makes
it easier for others to understand the code.
mtcars[mtcars[,"mpg"]>32,c("mpg","disp")]
#> mpg disp
#> Fiat 128 32.4 78.7
#> Toyota Corolla 33.9 71.1
mtcars[mtcars$mpg>32,c("mpg","disp")]
#> mpg disp
#> Fiat 128 32.4 78.7
#> Toyota Corolla 33.9 71.1
transform() function applies the expression mpg*1.6 to each row in the data
frame, and the new column can be seen in the output.
df1 <- subset(mtcars,mpg>32,select=c("mpg","disp"))
df1 <- transform(df1,kpg=mpg*1.6)
df1
#> mpg disp kpg
#> Fiat 128 32.4 78.7 51.84
#> Toyota Corolla 33.9 71.1 54.24
Overall, while the functions subset() and transform() are always available to
use, we will migrate to using new ways to subset and manipulate data frames
and tibbles in Part II of the book, when we introduce the dplyr package.
5.5 Tibbles
As we move through the book, we will rely more on tibbles than data frames.
Tibbles are a type of data frame; however they alter some data frame behaviors
to make working with packages in the tidyverse easier. There are two main
differences when compared to the data.frame (Wickham and Grolemund, 2016):
• Printing, where, by default, tibbles only show the first ten rows, and limit
the visible columns to those that fit on the screen. The type is also displayed
for each column, which is a useful feature that provides more information on
each variable.
• Subsetting, where a tibble is always returned, and also partial matching are
not supported.
In order to use a tibble, the tibble package should be loaded.
library(tibble)
#>
#> Attaching package: 'tibble'
#> The following object is masked from 'package:igraph':
#>
#> as_data_frame
96 5 Matrices and Data Frames
Similar to a data frame, a function can be used to create a tibble, and this
takes a set of atomic vectors. Notice that by default string values are not
converted to factors by this function, and this shows another difference with
data.frame.
d1 <- tibble(Number=1:5,
Letter=LETTERS[1:5],
Flag=c(T,F,T,F,T))
d1
#> # A tibble: 5 x 3
#> Number Letter Flag
#> <int> <chr> <lgl>
#> 1 1 A TRUE
#> 2 2 B FALSE
#> 3 3 C TRUE
#> 4 4 D FALSE
#> 5 5 E TRUE
We can now explore some of the differences between the data.frame and tibble
by comparing the two variables d and d1.
• First, we can observe their structure, using str().
# Show the data frame
str(d)
#> 'data.frame': 5 obs. of 3 variables:
#> $ Number: int 1 2 3 4 5
#> $ Letter: chr "A" "B" "C" "D" ...
#> $ Flag : logi TRUE FALSE TRUE FALSE TRUE
# Show the tibble
str(d1)
#> tibble [5 x 3] (S3: tbl_df/tbl/data.frame)
#> $ Number: int [1:5] 1 2 3 4 5
#> $ Letter: chr [1:5] "A" "B" "C" "D" ...
#> $ Flag : logi [1:5] TRUE FALSE TRUE FALSE TRUE
• Second, we can the see the difference in subsetting one column from each
structure. Notice how the data frame output is changed to an atomic vector.
In contrast, the tibble structure is retained when one column is subsetted.
# Subset the data frame
d[1:2,"Letter"]
#> [1] "A" "B"
# Subset the tibble
d1[1:2,"Letter"]
#> # A tibble: 2 x 1
#> Letter
#> <chr>
5.6 Functionals on matrices and data frames 97
#> 1 A
#> 2 B
Overall, when we make use of the tidyverse, which is the focus of Part II,
we will mostly use tibbles as the main data structure, as their properties of
maintaining the tibble structure nicely supports the chaining of data processing
operations, using R’s pipe operator.
We will now use the apply() function to perform two tasks. First, we find the
maximum grade for each subject. This involves iterating over the matrix on a
column-by-column basis, and processing the data to find the maximum. Note
that apply() simplifies the output so that, unlike lapply(), a numeric vector
is returned instead of a list.
max_gr_subject <- apply(results, # the matrix
2, # 2 for columns
function(x)max(x)) # the function to apply
max_gr_subject
#> Sub-1 Sub-2 Sub-3
#> 84 87 84
Next, we use apply() to find the maximum grade for each student. In this
case, we need to iterate over the matrix one row at a time, and therefore the
code looks like this.
max_gr_student <- apply(results, # the matrix
1, # 1 for rows
function(x)max(x)) # the function to apply
max_gr_student
#> St-1 St-2 St-3 St-4 St-5
#> 54 87 67 84 80
The apply() function can also be used on data frames, again, to iterate and
apply a function over either each row or column. For example, if we take a
subset of the data frame mtcars and insert some random NA values, we can
then count the missing values by either row or column.
set.seed(100)
my_mtcars <- mtcars[sample(1:6),c("mpg","cyl","disp")]
rows <- sample(1:nrow(my_mtcars),5)
rows
#> [1] 6 4 3 2 5
my_mtcars[rows[1],1] <- NA
my_mtcars[rows[2],2] <- NA
my_mtcars[rows[3],3] <- NA
5.6 Functionals on matrices and data frames 99
my_mtcars[rows[4],1] <- NA
my_mtcars[rows[5],2] <- NA
my_mtcars
#> mpg cyl disp
#> Mazda RX4 Wag 21.0 6 160
#> Datsun 710 NA 4 108
#> Mazda RX4 21.0 6 NA
#> Valiant 18.1 NA 225
#> Hornet Sportabout 18.7 NA 360
#> Hornet 4 Drive NA 6 258
First, to count the number of missing values by row, the following code can be
used.
n_rm <- apply(my_mtcars,1,function(x)sum(is.na(x)))
n_rm
#> Mazda RX4 Wag Datsun 710 Mazda RX4
#> 0 1 1
#> Valiant Hornet Sportabout Hornet 4 Drive
#> 1 1 1
sum(n_rm)
#> [1] 5
Second, to count the number of missing values by column, the following code
can be used.
n_cm <- apply(my_mtcars,2,function(x)sum(is.na(x)))
n_cm
#> mpg cyl disp
#> 2 2 1
sum(n_cm)
#> [1] 5
This shows that the mtcars variable is a data frame with 11 variables (columns)
and 32 rows (observations). We can easily convert the data frame to a list
using the function as.list(). When processing data frames with lapply(), the
most important thing to remember is that the data frame will be processed
column-by-column. This can be explored in the following example, which takes
the first three columns of mtcars and then calculates the average value for
each of these columns. Note we also make use of the function subset() and
the R pipe operator |> in order to create a small data processing workflow.
s1 <- mtcars |>
subset(select=c("mpg","cyl","disp")) |>
lapply(function(x)mean(x))
s1
#> $mpg
#> [1] 20.09
#>
#> $cyl
#> [1] 6.188
#>
#> $disp
#> [1] 230.7
• Second, based on the CRAN package igraph, we can visualize the adjacency
matrix as a network graph. Once the library is installed, sending the matrix
to the igraph function graph_from_adjacency_matrix() will return a list of
connections, which are then visualized using plot(). This network output
generated by igraph is shown in Figure 5.1.
# Include the igraph library
library(igraph)
• Third, we can update the matrix to sum all of the rows (which will show the
number of people a person follows), and the sum of all columns (showing the
number of followers a person has). This can be completed using the apply()
function described earlier. The name of the person with the maximum for
each can also be retrieved using the which() function.
row_sum <- apply(m,1,sum)
row_sum
#> P-1 P-2 P-3 P-4 P-5 P-6 P-7
#> 4 2 3 3 2 4 4
names(which(row_sum==max(row_sum)))
#> [1] "P-1" "P-6" "P-7"
• Finally, we can update the matrix to include the additional information using
the cbind() function.
m1 <- cbind(m,TotFollowing=row_sum)
m1
#> P-1 P-2 P-3 P-4 P-5 P-6 P-7 TotFollowing
#> P-1 0 0 1 1 0 1 1 4
#> P-2 0 0 1 0 0 0 1 2
#> P-3 0 0 0 1 0 1 1 3
#> P-4 0 1 0 0 1 1 0 3
#> P-5 0 1 0 0 0 0 1 2
#> P-6 0 0 1 1 1 0 1 4
#> P-7 1 0 0 1 1 1 0 4
• A new column dm_ratio will be added, which is the ratio of disp and mpg.
• The first six observations will then be shown.
The solution uses the R pipe operator, which chains the sequence of data
processing stages together. The functions subset() and transform() are used,
the first to select a subset of columns, the second to add new columns to the
data frame.
mtcars_1 <- mtcars |> # the original data frame
subset(select=c("mpg","disp")) |> # select 2 columns
transform(kpg=mpg*1.6, # Add first column
dm_ratio=disp/mpg) |> # Add second column
head() # Subset 1st 6 records
mtcars_1
#> mpg disp kpg dm_ratio
#> Mazda RX4 21.0 160 33.60 7.619
#> Mazda RX4 Wag 21.0 160 33.60 7.619
#> Datsun 710 22.8 108 36.48 4.737
#> Hornet 4 Drive 21.4 258 34.24 12.056
#> Hornet Sportabout 18.7 360 29.92 19.251
#> Valiant 18.1 225 28.96 12.431
While it is a simple example, it does show the utility of being able to take a
rectangular data structure (the data frame) as input, and then apply a chain
of operations to this in order to arrive at a desired output, which is then stored
in the variable mtcars_1.
Function Description
as.data.frame() Converts a tibble to a data frame
apply() Iterates over rectangular data, by row or by column.
cbind() Adds a new vector as a matrix column.
colnames() Set (or view) the column names of a matrix.
colMeans() Calculates the mean of each column in a matrix.
colSums() Calculates the sum of each column in a matrix.
data.frame() Constructs a data frame.
diag() Sets a matrix diagonal, or generates an identity matrix
dim() Returns (or sets) the matrix dimensions.
dimnames() Returns the row and column names of a matrix.
eigen() Calculates matrix eigenvalues and eigenvectors.
factor() Encode a vector as a factor.
is.matrix() Checks to see if the object is a matrix.
matrix() Creates a matrix from the given set of arguments.
rbind() Adds a vector as a row to a matrix.
rownames() Sets (or views) the row names of a matrix.
t() Returns the matrix transpose.
rowMeans() Calculates the mean of each matrix row.
rowSums() Calculates the sum of each matrix row.
subset() Subsets data frames which meet specified conditions.
tibble() Constructs a tibble (tibble package).
as_tibble() Converts a data frame to a tibble (tibble package).
transform() Add columns to a data frame.
5.10 Exercises
1. Use the following initial code to generate the matrix res.
set.seed(100)
N=10
CX101 <- rnorm(N,45,8)
CX102 <- rnorm(N,65,8)
CX103 <- rnorm(N,85,25)
CX104 <- rnorm(N,60,15)
CX105 <- rnorm(N,55,15)
res
#> CX101 CX102 CX103 CX104 CX105
#> Student-1 40.98 65.72 74.05 58.63 53.48
#> Student-2 46.05 65.77 104.10 86.36 76.05
#> Student-3 44.37 63.39 91.55 57.93 28.35
106 5 Matrices and Data Frames
2. The matrix res (from the previous question) has values that are
out of the valid range for grades (i.e., greater than 100). To address
this, all out-of-range values should be replaced by NA. Use apply()
to generate the following modified matrix.
res_clean
#> CX101 CX102 CX103 CX104 CX105
#> Student-1 40.98 65.72 74.05 58.63 53.48
#> Student-2 46.05 65.77 NA 86.36 76.05
#> Student-3 44.37 63.39 91.55 57.93 28.35
#> Student-4 52.09 70.92 NA 58.33 64.34
#> Student-5 45.94 65.99 64.64 49.65 47.17
#> Student-6 47.55 64.77 74.04 56.67 74.83
#> Student-7 40.35 61.89 66.99 62.74 49.55
#> Student-8 50.72 69.09 90.77 66.26 74.79
#> Student-9 38.40 57.69 56.06 75.98 55.66
#> Student-10 42.12 83.48 91.18 74.55 26.82
res_update
#> CX101 CX102 CX103 CX104 CX105
#> Student-1 40.98 65.72 74.05 58.63 53.48
#> Student-2 46.05 65.77 76.16 86.36 76.05
#> Student-3 44.37 63.39 91.55 57.93 28.35
#> Student-4 52.09 70.92 76.16 58.33 64.34
#> Student-5 45.94 65.99 64.64 49.65 47.17
#> Student-6 47.55 64.77 74.04 56.67 74.83
#> Student-7 40.35 61.89 66.99 62.74 49.55
#> Student-8 50.72 69.09 90.77 66.26 74.79
#> Student-9 38.40 57.69 56.06 75.98 55.66
#> Student-10 42.12 83.48 91.18 74.55 26.82
4. Use the subset() function to generate the following tibbles from the
tibble ggplot2::mpg. Use the R pipe operator (|>) where necessary.
5.10 Exercises 107
5. Generate the following network using matrices and igraph. All nodes
are connected to each other (i.e., it is a fully connected network).
6
The S3 Object System in R
6.1 Introduction
While the material in this chapter is more challenging, it is definitely worth the
investment of your time, as it should provide you with additional knowledge
of how R works. It highlights the importance of the S3 object oriented system,
which is ubiquitous in R. An important idea underlying S3 is that we want
to make things as easy as possible for users, when they call functions and
manipulate data. S3 is based on the idea of coding generic functions (for
example, the base R function summary() is a generic function), and then writing
specific functions (often called methods) that are different implementations of
a generic function. The advantage of the S3 approach is that it shields a lot of
complexity from the user, and makes it easier to focus on the data processing
workflow.
Upon completing the chapter, you should understand:
• Why calls by different objects to the same generic function can yield different
results.
• The difference between a base object and an object oriented S3 object.
• The importance of attributes in R, and in particular, the attributes names,
dim, and class.
• How the class attribute is essential to the S3 object system in R, and how to
define a class for an object.
109
110 6 The S3 Object System in R
Chapter structure
• Section 6.2 provides an insight into how S3 works.
• Section 6.3 introduces the idea of attributes, which are metadata associated
with an object. A number of attributes are particularly important, and they
include names, dim and class. The class attribute is key to understanding
how the S3 system operates.
• Section 6.4 presents the generic function object oriented approach, which is
the design underlying R’s S3 system. It summarizes key ideas and describes
the process known as method dispatch.
• Section 6.5 shows how a developer can make use of existing generic functions,
for example print() and summary(), to define targeted methods for new
classes.
• Section 6.6 presents examples of how to write custom generic functions that
can be used to define methods for S3 classes.
• Section 6.7 shows how inheritance can be specified in S3, and how new classes
(e.g., a subclass) can inherit methods from other classes (e.g., a superclass).
• Section 6.8 shows how a simple queue system in S3 can be implemented.
• Section 6.9 provides a summary of the functions introduced in the chapter.
• Section 6.10 provides a number of short coding exercises to test your under-
standing of S3.
6.2 S3 in action
In order to get an insight into how S3 works, let’s consider R’s summary()
function. You may have noticed that if you call summary() with two different
types of data, you will get two appropriately different answers, for example:
• When summarizing results of a linear regression model, based on two variables
from mtcars, the following information is presented. The model can be used
6.2 S3 in action 111
to predict the miles per gallon (mpg) given the engine size (disp - in cubic
inches), and the model coefficients are shown (the slope=−0.04122 and the
intercept=29.59985). Note that at this point we are not so much interested
in the results of the linear model, rather we focus on what is output when
we call the function summary() with the variable mod.
mod <- lm(mpg~disp,data=mtcars)
summary(mod)
#>
#> Call:
#> lm(formula = mpg ~ disp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.892 -2.202 -0.963 1.627 7.231
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 29.59985 1.22972 24.07 < 2e-16 ***
#> disp -0.04122 0.00471 -8.75 9.4e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.25 on 30 degrees of freedom
#> Multiple R-squared: 0.718, Adjusted R-squared: 0.709
#> F-statistic: 76.5 on 1 and 30 DF, p-value: 9.38e-10
• When summarizing the first six rows and two columns from the mtcars data
frame, summary() provides information on each column.
d <- subset(mtcars,select=c("mpg","disp")) |> head()
summary(d)
#> mpg disp
#> Min. :18.1 Min. :108
#> 1st Qu.:19.3 1st Qu.:160
#> Median :21.0 Median :192
#> Mean :20.5 Mean :212
#> 3rd Qu.:21.3 3rd Qu.:250
#> Max. :22.8 Max. :360
Think of what’s happening with these two examples. In both cases, the user
calls summary(). However, depending on the variable that is passed to summary(),
a different result is returned. How might this be done in R? To achieve this,
R uses a mechanism known as polymorphism, and this is a common feature
of object oriented programming languages. Polymorphism means that the
same function call leads to different operations for objects of different classes
(Matloff, 2011). Behind the scenes, it means that different functions (usually
112 6 The S3 Object System in R
called methods in S3) are written for each individual class. So for this example,
there are actually two different methods at work, one to summarize the linear
model results (summary.lm()), and the other to summarize columns in a data
frame (summary.data.frame()). We will now explore this mechanism in more
detail.
#> $test
#> [1] "Hello World"
While there is flexibility to add attributes, generally, there are three special
attributes that are used in R, and knowledge of these is required.
• The first special attribute (which we already have used) is names, which
stores information on the element names in an atomic vector, list, or data
frame. Previously we used the function names() to set these values, but
because names is a special attribute, we can achieve the same result by using
the attr() function. Usually, using attr() is not the best way to set vector
names, and the function names() is used instead. But the impact of both
methods is the same.
x2 <- 1:5
attr(x2,"names") <- LETTERS[1:5]
attributes(x2)
#> $names
#> [1] "A" "B" "C" "D" "E"
x2
#> A B C D E
#> 1 2 3 4 5
• The second special attribute is dim, and this is an integer vector that is used
by R to convert a vector into a matrix or an array. Notice that setting this
attribute has a similar effect of creating a matrix. Normally this mechanism
is not used, and the matrix is created using the matrix() function,
# Create a vector of 4 elements
m1 <- 1:4
m1
#> [1] 1 2 3 4
# Set the "dim" attribute to 2x2 matrix
attr(m1,"dim") <- c(2,2)
m1
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
attributes(m1)
#> $dim
#> [1] 2 2
# Set the "dim" attribute to 1x4 matrix
attr(m1,"dim") <- c(1,4)
m1
#> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
attributes(m1)
114 6 The S3 Object System in R
#> $dim
#> [1] 1 4
# Remove the attribute and revert to the original vector
attr(m1,"dim") <- NULL
m1
#> [1] 1 2 3 4
attributes(m1)
#> NULL
# Check with the usual method
m2 <- matrix(1:4,nrow=2)
m2
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
attributes(m2)
#> $dim
#> [1] 2 2
• The third special attribute of base types is known as class, and this has a
key role to play in the S3 object oriented system. An S3 object is a base
type with a defined class attribute, and we can explore this by looking at the
class attribute associated with a data frame. The key point is that this new
attribute class is defined for a data frame, and when it is defined, R views
the object as an S3 object, and so can apply the mechanisms of S3 to this
object. The function class() can also be used to view the class attribute.
# Show the class attribute using attr
attr(mtcars,"class")
#> [1] "data.frame"
# Show the class attribute using class
class(mtcars)
#> [1] "data.frame"
#> $class
#> [1] "my_test"
The class attribute can also be set using the attr() function.
v <- list(a=1,b=2,c=3)
attr(v,"class") <- "my_test"
attributes(v)
#> $names
#> [1] "a" "b" "c"
#>
#> $class
#> [1] "my_test"
is.object(d)
#> [1] TRUE
Next we make just one minor change to the object d by removing the class
attribute. For this, we use the function unclass(), which deletes the class
attribute. This simple action totally changes the way R views the object. It
is now no longer an S3 object, and instead is just viewed as its base type,
which is a list. Notice that R does not stop you from doing this, and you will
never get a message saying, “do you realize you have just changed the class
of this object?” That’s not the S3 way, as it is very flexible and relies on the
programmer being aware of how S3 operates.
# Remove its class attribute
d <- unclass(d)
# When you show d's structure, it's no longer a data frame
str(d)
#> List of 3
#> $ Number: int [1:5] 1 2 3 4 5
#> $ Letter: chr [1:5] "A" "B" "C" "D" ...
#> $ Flag : logi [1:5] TRUE FALSE TRUE FALSE NA
#> - attr(*, "row.names")= int [1:5] 1 2 3 4 5
# The class attribute has been removed
attributes(d)
#> $names
#> [1] "Number" "Letter" "Flag"
#>
#> $row.names
#> [1] 1 2 3 4 5
# Call is.object() function
is.object(d)
#> [1] FALSE
summary
#> function (object, ...)
#> UseMethod("summary")
#> <bytecode: 0x13190f5f0>
#> <environment: namespace:base>
However, behind the scenes, the R system does a lot of work to find a way to
implement this function, and here are the main steps.
• When the user calls a generic function, they pass an object, which is usually
has S3 class attribute. For example, summary(mod), which we explored earlier,
calls summary() (a generic function) with the object mod, which is a list that
is also an S3 class with the class attribute set to “lm”.
• R will then invoke method dispatch (Wickham, 2019), which aims to find
the correct method to call. Therefore, the only line of code needed in the
generic function is a call to UseMethod(generic_function_name), and, in this
case, R will find the class attribute (e.g., “lm”), and find the method (e.g.,
summary.lm()).
• Once R finds this method, R will call the method with the arguments, and
let the method perform the required processing.
• The object returned by the method will then be routed back to the original
call by the generic function.
We can see how this works by writing our own test (and not very useful!)
version of the method summary.lm() in our global workspace, so that it will
be called before R’s function summary.lm(). This new method returns the
string “Hello World”. Note that we use the “method” terminology because it
indicates that the function is associated with a generic function. In reality, the
mechanism of writing a method is the same as writing a function.
# Redefining summary.lm() to add in the global environment
summary.lm <- function(o)
{
"Hello world!"
}
# Show the output from the new version of summary.lm
summary(mod)
#> [1] "Hello world!"
# Remove the function from the global environment
rm(summary.lm)
And this captures the elegant idea of separating the interface (i.e., the generic
function) from the implementation (i.e., the method). It has a huge benefit for
users of R in that they don’t have to worry about finding the specific method,
they simply call the generic function with the S3 class and let the S3 system
in R do the rest!
120 6 The S3 Object System in R
We will look at building our own print method for a test class, where the
test class is set as an attribute on an atomic vector. The class is called
“my_new_class”, and therefore the method, associated with the generic func-
tion print is named print.my_new_class(). Inside this method, we use the
cat() function to print a simple message. “Hello World!”. Therefore, if the
S3 class for a variable is “my_new_class”, the method print.my_new_class()
will be selected by R using method dispatch. Note that this is not a useful
function, the goal is just to indicate how to write a new method that is linked
to the generic function print().
# define an atomic vector
v <- c(10.1, 31.2)
# Show v at the console
v
#> [1] 10.1 31.2
# Add a class attribute
class(v) <- "my_new_class"
While this example showed a simple atomic vector, it’s more likely that you
will be using S3 classes with a list data structure and then generate new
methods from some of R’s generic functions. However, in addition to using
available generic functions, R does provide the facility to write your own
generic functions.
First, we define the constructor. This defines the data (in a list), and it also
specifies the class (“account”).
# Define a constructor to create the object
new_account <- function(num,op_bal){
122 6 The S3 Object System in R
l <- list(number=num,balance=op_bal)
class(l) <- "account"
l
}
Next we define the two new generic functions. Note that each generic function
has just one line of code - UseMethod() - and we have added an extra argument
to the function, as we need to pass a value for the debit and credit amount
into each of these functions. First, we define the debit() generic function.
# Define a generic function to debit an account
debit <- function(o,amt){
UseMethod("debit")
}
With the generic functions defined, we now define three new methods for the
class “account”. The first is the debit.account() method. When the debit()
generic function is called with an object whose class is “account”, R will
dispatch the call to debit.account().
# Create a debit method
debit.account <- function(o,amt){
o$balance <- o$balance - amt
o
}
The second is the credit.account() method. When the credit() generic func-
tion is called with an object whose class is “account”, R will dispatch the call
to credit.account().
Finally, we add a method for printing the object. When the print() generic
function is called with an object whose class is “account”, R will dispatch the
call to print.account().
6.7 Inheritance with S3 123
To test the code, we first create an object a1. Note that the type of this object
is a list and the class is “account”. Recall that when we type a1 at the console,
R will actually call the method print.account().
# Create an object
a1 <- new_account("1111",200)
typeof(a1)
#> [1] "list"
class(a1)
#> [1] "account"
a1
#> Acc# = 1111 Balance = 200
We then call the debit function to remove money from the account. Note that
the object a1 must be updated, namely, it receives an updated copy of the list
values.
# Debit the account
a1 <- debit(a1, 100)
a1
#> Acc# = 1111 Balance = 100
Finally, we credit the account through calling the generic function credit(),
and viewing the updated value of a1.
# Credit the account
a1 <- credit(a1, 300)
a1
#> Acc# = 1111 Balance = 400
In summary, this section has shown how we can create our own generic functions:
however, even when we create these, notice that we still use one of R’s available
generic functions print(). This tends to be the pattern of using S3; in many
cases we can add methods that use existing generic functions, and on occasion
we may decide to create our own generic functions.
Next, we define our sublcass, where the class is postgrad. This class has one
extra variable for this type of student, which is their current postgrad degree.
We define the list object that contains the data for this new student, and this
data is stored in the variable s2.
s2 <- list(ID="S001",
Name="A. Smith",
UGDegree="Maths",
PGDegree="Operations Research")
We now must establish the relationship between the two classes, and this
is performed by using two string values in the class attribute, and S3 will
interpret this (from left to right) as the subclass to superclass relationship.
In this case, it defines that “postgrad” is a subclass of “student”, and both of
these are now S3 classes.
class(s2) <- c("postgrad","student")
class(s2)
#> [1] "postgrad" "student"
Given that the relationship is established, we can now see the benefit of this
structure. For example, all methods associated with the class “student” are
now available for the class “postgrad”. We can see this by calling the generic
function print(s2) and observing the result.
6.7 Inheritance with S3 125
print(s2)
#> ID = S001 Name = A. Smith UGDegree= Maths
What has happened here is that the method print.student() has been called.
This is controlled by the method dispatch function in the generic function
print(). Because the class attribute on the s2 object is a vector with two
values, it takes the first value “postgrad” and searches for the function
print.postgrad().
Because that function does not yet exist, it then moves to the superclass
“student” and searches for the function print.student(). As that function does
exist, it then is called by the generic function. Therefore, the method dispatch
process has a key role to play. It will navigate the class hierarchy, starting
at the base class, and continuing through to the superclasses, until it finds a
matching function.
We can now write a method for the subclass “postgrad”, and this is called
print.postgrad().
print.postgrad <- function(o){
cat("This is a postgraduate student, PG course = ",o$PGDegree,"\n")
cat("Additional information on student...\n")
class(o) <- class(o)[-1]
print(o)
}
s2
#> This is a postgraduate student, PG course = Operations Research
#> Additional information on student...
#> ID = S001 Name = A. Smith UGDegree= Maths
data.frame S3 class. Next we highlight the class for ggplot2::mpg and we can
see the inheritance hierarchy where the base class is “tbl_df”, its superclass is
“tbl” and the next class is the hierarchy is “data.frame”. This confirms that
tibbles are extensions of data frames.
class(mtcars)
#> [1] "data.frame"
class(ggplot2::mpg)
#> [1] "tbl_df" "tbl" "data.frame"
The object q1 is a list with four elements, with the class attribute set to “queue”.
We can now add a new function and utilize the generic function capability of
S3 and implement our version of the function print().
print.queue <- function(q){
cat("Queue name = ",q$queue_name,"\n")
cat("Queue Description = ",q$queue_description,"\n")
cat("Waiting <",length(q$waiting),"> Products Waiting = ",
q$waiting,"\n")
cat("Processed <",length(q$processed),"> Products Processed = ",
q$processed,"\n")
}
q1
#> Queue name = Q1
#> Queue Description = Products queue
#> Waiting < 0 > Products Waiting =
#> Processed < 0 > Products Processed =
The new function print.queue() will now print this specific queue information
when it is called by the generic function print(). The generic function print()
has the following structure, with just one line of code that contains the call
UseMethod("print).
print
#> function (x, ...)
#> UseMethod("print")
#> <bytecode: 0x1370619a0>
#> <environment: namespace:base>
unclass(q1)
#> $queue_name
#> [1] "Q1"
#>
#> $queue_description
#> [1] "Products queue"
#>
#> $waiting
#> character(0)
#>
#> $completed
#> character(0)
In order to complete the example, and given that there are no generic functions
named add() and process(), we need to create two generic functions. Similar
to all generic functions in S3, they both contain one line of code, and the
argument passed to UseMethod is the generic function name. Note that the
function can take as many arguments as required, and these in turn will be
forwarded to the method by the S3 system.
add <- function(q,p){
UseMethod("add")
}
Finally, we can add the two specific functions for the “queue” class, and these
will be invoked by the generic functions in order to perform the work. The first
of these is the add.queue() function. For this we simply add the new queue
object to the existing queue.
add.queue <- function (q, p){
q$waiting <- c(q$waiting, p)
q
}
Notice that we must return the updated list from this function, in order to
have the most recent copy of the list. The function is now available for use, and
can be tested by adding a product to the queue (“P-01”) and then printing
the updated object.
q1 <- add(q1,"P-01")
q1
#> Queue name = Q1
#> Queue Description = Products queue
6.8 Mini-case: Creating a queue S3 object 129
Our final task is to implement a version of the generic function process() that
will take the oldest item in the queue, remove it, and place it in the processed
vector. This code requires an important check: if the queue is empty, then
nothing should happen, and the queue object is returned, unchanged. If there
is an item on the queue, it is (1) added to the processed element, and (2)
removed from the queue by using the subsetting mechanism q$waiting[-1],
which removes the first element.
process.queue <- function(q){
if(length(q$waiting) == 0){
cat("Cannot process queue as it is empty!\n")
return(q)
}
q$processed <- c(q$processed, q$waiting[1])
q$waiting <- q$waiting[-1]
q
}
Function Description
attr() Get or set object attributes.
attributes() Return object attributes.
class() Return object S3 class.
is.object() Checks to see if an object has a class attribute.
unclass() Removes the class attribute of an object.
methods() Lists all available methods for a generic function.
UseMethod() Method to invoke method dispatch for S3.
glm() Used to fit generalized linear models.
6.10 Exercises
1. Show how you can generate the following object v, which has a
number of attributes.
v
#> A B C D E
#> 1 3 5 7 9
#> attr(,"class")
#> [1] "test_class"
6.10 Exercises 131
2. Write a new print method for an “lm” class (this will reside in
the global environment) that will simply just display the model
coefficients. Use the R function coef to display the coefficients as a
vector. Here is sample output from the new print function.
from the class “lm”. Write a new version summary function that
displays a message, before showing the usual summary output
from an “lm” object. The following code should be implemented.
In your method summary.my_lm(), you cannot make a direct call
to the method summary.lm(); this method must be accessed via a
generic function call. Think about how you might achieve this by
manipulating the class attribute.
7.1 Introduction
A core part of any data analysis and modelling process is to visualize data and
explore relationships between variables. The exploratory data analysis process,
which we will introduce in Chapter 12, involves iteration between selecting
data, building models, and visualizing results. Within R’s tidyverse we are
fortunate to have access to a visualization library known as ggplot2. There
are three important benefits of using this library: (1) plots can be designed in
a layered manner, where additional plotting details can be added using the
+ operator; (2) a wide range of plots can be generated to support decision
analysis, including scatterplots, histograms, and time series charts, and (3)
once the analyst is familiar with the structure and syntax of ggplot2, charts
can be developed rapidly, and this supports an iterative process of decision
support.
Upon completing the chapter, you should understand the following:
• How to identify the three main elements of a plot: the data (stored in a
data frame or tibble), the aesthetic function aes() that maps data to visual
properties, and geometric objects (“geoms”) that are used to visualize the
data.
135
136 7 Visualization with ggplot2
Chapter structure
• Section 7.2 introduces the two main datasets used in this chapter, both
contained in the package ggplot2, and are named mpg and diamonds.
• Section 7.3 shows the first example of plotting, which is the scatterplot, and
we explore the relationship between two variables in a fuel economy dataset.
• Section 7.4 builds on the scatterplot example to show how additional features
can be added to a plot in a convenient way, and thus layering a richer
information set on the plot.
• Section 7.5 introduces a powerful feature of ggplot2 which provides a mech-
anism to divide a plot into a number or sub-plots.
• Section 7.6 moves beyond the scatterplot to show ggplot2 functions that
transform data before plotting; examples here include bar charts and his-
tograms.
• Section 7.7 introduces the idea of a theme, which provides the facility to
modify the non-data elements of your plots.
• Section 7.8 shows how you can add three types of lines to your plots.
• Section 7.9 explores an addition dataset - aimsir17 - and focuses on how we
can explore and format time series data.
• Section 7.10 provides a summary of the functions introduced in the chapter.
• Section 7.11 presents coding exercises to test your understanding of ggplot2.
7.2 Two datasets from the package ggplot2 137
The tibble diamonds contains prices and other attributes of many diamonds
(N = 53, 940). This tibble will be used to show how ggplot2 performs statistical
transformations on data before displaying it. This includes generating bar
charts for different categories of diamonds, where categories are often stored
as factors in R, where each value is one of a wider group, for example, the cut
of a diamond can be one of Fair, Good, Very Good, Premium, or Ideal.
For our first example, we focus on the dataset mpg, and the variables engine
displacement (in liters) represented by the variable displ and city miles per
gallon recorded through the variable cty. Before plotting, we can view a
summary of these two variables as follows. For convenience, we utilize R’s pipe
operator.
mpg |> subset(select=c("displ","cty")) |> summary()
#> displ cty
#> Min. :1.60 Min. : 9.0
#> 1st Qu.:2.40 1st Qu.:14.0
#> Median :3.30 Median :17.0
#> Mean :3.47 Mean :16.9
#> 3rd Qu.:4.60 3rd Qu.:19.0
#> Max. :7.00 Max. :35.0
To create the graph, the following ggplot2 functions are called. Note that we
can build a graph in layers, where each new layer is added to a graph using
the + operator.
• First, we call ggplot(data=mpg) which initializes a ggplot object, and this
call also allows us to specify the tibble that contains that data. By itself,
this call actually generates an empty graph; therefore, we need to provide
more information for the scatterplot.
• Next, we extend this call to include the x-axis and y-axis variables by
including an addition argument (mapping) and the function aes() which
7.4 Aesthetic mappings 139
describes how variables in data are mapped to the plot’s visual properties.
The completed command is ggplot(data=mpg,mapping=aes(x=displ,y=cty)).
We can execute this command, and a more informative graph appears, as it
will show the numerical range of each variable, which is based on the data in
mpg. However, when you try this command, you will not see any points on
the graph.
• Finally, we need to visualize the set of points on the graph, and we do this
by calling the relevant geometric object, which is one that is designed to draw
points, namely the function geom_point(). As ggplot2 is a layered system for
visualization, we can use the + operator to add on new elements to a graph,
and therefore the following command will generate the graph of interest,
shown in Figure 7.2.
ggplot(data=mpg, mapping=aes(x=displ,y=cty)) +
geom_point()
The scatterplot is then produced showing the points. This plot can be enhanced,
using the idea of aesthetic mappings, which we will now explore.
from other variables. For example, what if we also wanted to see which class of
car each point belonged to? There are seven classes of car in the dataset, and
we can see this by running the following code, where we can access a tibble’s
variable using the $ operator. Recall, this is made possible because the type of
a tibble is a list, as we already discussed in Chapter 5.
unique(mpg$class)
#> [1] "compact" "midsize" "suv" "2seater"
#> [5] "minivan" "pickup" "subcompact"
This information can then be used by the aes() function by setting the
argument color to the particular variable we would like to color the plot
by. Note the following code, and see how we have just added one additional
argument to the aes() function, and the updated output is shown in Figure
7.3.
ggplot(data=mpg,mapping=aes(x=displ,y=cty,color=class))+
geom_point()
With this single addition, we have added new information to the plot, where
each class of car now has a different point color. The call to ggplot() produces
a default setting for the legend, where in this case it adds it to the right side
of the plot. It also labels the legend to be the same name as the variable used.
There are a number of additional arguments we can embed inside the aes()
function, and these include size and shape, which again will change the
appearance of the points, and also update the legend as appropriate. For
example, we can add information to the plot that indicates the relative sizes
of the number of cylinders for each observation, by setting the size argument.
7.4 Aesthetic mappings 141
As can be viewed in Figure 7.4, this provides information showing that cars
with larger displacement values also have a higher number of cylinders.
ggplot(data=mpg,mapping=aes(x=displ,y=cty,color=class,size=cyl))+
geom_point()
As you become familiar with the range of functions within ggplot2, you will
see that the default appearance of the plot can be customized, and enhanced.
A function to add further information to a plot is the lab(), which takes a list
of name-value pairs that allows you to specify the following elements on your
plot:
• title provides an overall title text for the plot.
• subtitle adds a subtitle text.
• color allows you to specify the legend name for the color attribute.
• caption inserts text on the lower right-hand side of the plot.
• size, where you can name the size attribute.
• x to name the x-axis.
• y to name the y-axis.
• tag, the text for the tag label to be displayed on the top left of the plot.
We can see the impact of using lab() in the updated scatterplot, where we
select displ on the x-axis and hwy on the y-axis. Within the aes() function,
we color the points by the variable class and size the points by the variable
cyl. Note that we use a new approach (which is optional) where we store the
result of an initial call to ggplot() in a variable p1, and then layer the labelling
142 7 Visualization with ggplot2
information onto this basic plot using the + operator. Typing p1 will then
display the plot, and the revised output is displayed in Figure 7.5.
p1 <- ggplot(data=mpg,aes(x=displ,y=hwy,size=cyl,color=class))+
geom_point()
p1 <- p1 +
labs(
title = "Exploring automobile relationships",
subtitle = "Displacement v Highway Miles Per Gallon",
color = "Class of Car",
size = "Cylinder Size",
caption = "Sample chart using the lab() function",
tag = "Plot #1",
x = "Displacement (Litres)",
y = "Highway Miles Per Gallon"
)
p1
Two further changes can be made to the appearance of a plot by altering the
scales on the x-axis and the y-axis, using the functions scale_x_continuous()
and scale_y_continuous(). Among the attributes that can be changed are:
• The plot limits, using the argument limits, where the lower and upper are
specified in a vector.
7.5 Subplots with facets 143
• The plot tick points on each axis, using the argument breaks, where a vector
is used to specify the points that the breaks are to be set. The function seq()
is used, as we can specify the lower, upper, and the distance between each
point.
These functions allow us to zoom in on a plot, and also increase the number of
ticks. There is an associated warning message, as we have ignored data points
outside of the new ranges; therefore, ggplot2 informs us of this. The output is
display in Figure 7.6.
p1 <- p1 +
scale_x_continuous(limits=c(4,5), breaks=seq(4,5,.1))+
scale_y_continuous(limits=c(10,20),breaks=seq(10,20,1))+
labs(caption = "This shows to zoom in on graph regions")
p1
#> Warning: Removed 192 rows containing missing values
#> (`geom_point()`).
between the two variables. However, what if we needed to drill down on the
plots and show, for example, the relationships for each class of car?
We have seen that there are seven car classes, and therefore, the challenge is
how can we create a separate plot for each class of car. Or, in the more general
case, sub-divide a plot into multiple plots based on another variable. The
function facet_wrap() will do this in ggplot2, and all it needs as an argument
is the variable for dividing the plots, which must be preceded by the tilde (~)
operator. Here is the sample code to do this based on our earlier example,
and the outputs is visualized in Figure 7.7. Note that the number of plots per
row can be controlled by using the arguments nrow and/or ncol in the call to
facet_wrap().
ggplot(data=mpg,aes(x=displ,y=cty))+
geom_point()+
facet_wrap(~class)
An extra variable can be added to the faceting process by using the related
function facet_grid(), which takes two arguments, separated by the ~ operator.
The first argument specifies which variable is to be mapped to each row, and
the second argument identifies the variable to be represented on the columns.
For example, we may want to generate 21 plots that show the type of drive
(drv) on the columns, and the class of car (class) shown on each row. The
following code will generate this plot, and the output is shown in 7.8.
ggplot(data=mpg,mapping = aes(x=displ,y=cty))+
geom_point()+
facet_grid(class~drv)
7.6 Statistical transformations 145
This analysis enhances our appreciation of the mpg dataset, as it shows that
a number of the combinations do not contain any observations; for example:
front-wheel drive cars that are pickups, and rear-wheel drive compact cars.
This point is illustrative: the use of faceting and sub-plots can reveal interesting
patterns and relationships in the data, and their ease of use also makes them
an attractive feature when undertaking exploratory data analysis.
In Figure 7.1 we noted that the tibble diamonds has three categorical variables
(factors in R): cut (five types), color (seven types), and clarity (eight types).
We can show the quantity of diamonds for each type as follows.
summary(diamonds$cut)
#> Fair Good Very Good Premium Ideal
#> 1610 4906 12082 13791 21551
These values can be visualized by calling the function geom_bar(), and in this
case, ggplot2 will count each type before displaying the bar chart, which is
shown in Figure 7.9.
ggplot(data=diamonds,mapping=aes(x=cut))+
geom_bar()
Further plots can be generated. For example, to show the colors side-by-
side instead of using the default stacked format, we can set the argument
position="dodge" as part of the geom_bar() function call, and this provides a
comparison for the relative size of each diamond color within the different cuts.
Figure 7.11 displays the output from this graph.
7.6 Statistical transformations 147
ggplot(data=diamonds,mapping=aes(x=cut,fill=color))+
geom_bar(position="dodge")
ggplot(data=diamonds,mapping=aes(x=cut,fill=color))+
geom_bar(position="fill")
In order to plot this summary data, we need to add a new argument to the
geom_bar() function, which is stat="identity". This will take the raw values
from the tibble, and therefore no aggregations are performed on the data. The
result of this is presented in Figure 7.13.
ggplot(data=cl_res,aes(x=Clarity,y=Count))+
geom_bar(stat="identity")
Another way to present count data is to display it using lines, and this is
useful if more that one set of count data needs to be visualized. For this, the
function geom_freqpoly() can be used to show the separate counts for each
type of clarity, and these are shown in Figure 7.15.
ggplot(data=diamonds,mapping=aes(x=table,color=clarity))+
geom_freqpoly()
150 7 Visualization with ggplot2
ggplot(data=diamonds,mapping=aes(x=price))+
geom_density()
The package GGally contains the function ggpairs which visualizes relationships
between variables, presents the density plot for each variable, and summarizes
the range of correlation coefficients between continuous variables. It provides
a useful perspective on the data, and for this example, we use the mpg dataset
and the three variables cty, hwy, and displ. We will make use of this plot for
exploratory data analysis in Chapter 12, when we explore the associations
152 7 Visualization with ggplot2
7.7 Themes
The ggplot2 theme system enables you to control non-data elements of your
plot, for example, control over elements such as fonts, ticks, legends, and
backgrounds. It comprises the following components (Wickham, 2016):
• Theme elements specify the non-data elements; for example, legend.title
controls the appearance of the legend title.
• Element functions, which describe the visual properties of an element. For
example, element_text() can be used to set the font size, color, and typeface
of text elements such as those used in legend.title. Other element functions
include element_rect() and element_line().
• The theme() function where you can change the default theme elements by
calling element functions, for example, theme(legend.position="top").
• Complete and ready-to-use themes, such as theme_bw() and theme_light(),
which are designed to provide a coherent set of values to work with your
plots. Themes are also available from within the package ggthemes, which
includes theme_economist(), based on the style of The Economist newspaper.
To explore an example of how this theme system can be used, we will revert
to an earlier plot, and make five arbitrary changes:
• The font type of the legend will be changed to bold face, using the argument
legend.title and the element function element_text().
• The legend will be moved to the top of the chart, based on setting the
argument legend.position to “top”.
• The text on the x-axis will be made larger with a new font, by setting
argument axis.text.x using the element function element_text().
• The plot background is altered through the use of the argument
panel.background and the element function element_rect().
• The appearance of the grid is changed through the argument panel.grid and
the element function element_line().
The modified plot is shown in Figure 7.19.
ggplot(data=mpg,aes(x=displ,y=cty,color=class))+
geom_point()+
theme(legend.title = element_text(face="bold"),
legend.position = "top",
axis.text.x = element_text(size=15,face="italic"),
panel.background = element_rect(fill="white"),
panel.grid = element_line(color = "blue",
linewidth = 0.3,
linetype = 3))
154 7 Visualization with ggplot2
Note that there are many other options that can be used within the theme()
function, and for full details type ?theme in the R console. The default theme in
ggplot2() can be also invoked using theme_gray(), and there are other themes
that can be used, including:
• theme_bw(), a classic dark-on-light theme.
• theme_light(), a theme with light gray lines and axes in order to provide
more emphasis on the data.
• theme_dark(), referred to in the documentation as the dark cousin of
theme_light().
• theme_minimal(), a minimalistic theme with no background annotations.
• theme_classic(), a classic-looking theme, with no gridlines.
• theme_void(), a completely empty theme.
To compare these themes, we create a new tibble with just five data points,
and generate an object plot that can store the basic scatterplot.
d <- tibble(x=seq(1,3,by=0.5),y=2*x)
plot <- ggplot(data=d,mapping=aes(x=x,y=y))+
geom_point()
Figure 7.20 shows six different plots based on which of the six themes is invoked.
The choice of theme is fully up to the analyst, for example, there may be
specific themes that may have to be used, or it may depend on the purpose
of the plot (it could be required for a publication, and therefore certain plot
attributes may have to be adhered to). A benefit of this approach is that
additional themes, for example those contained in the package ggthemes, can
also be used.
7.8 Adding lines to a plot 155
observations
#> # A tibble: 219,000 x 12
#> station year month day hour date rain
#> <chr> <dbl> <dbl> <int> <int> <dttm> <dbl>
#> 1 ATHENRY 2017 1 1 0 2017-01-01 00:00:00 0
#> 2 ATHENRY 2017 1 1 1 2017-01-01 01:00:00 0
#> 3 ATHENRY 2017 1 1 2 2017-01-01 02:00:00 0
#> 4 ATHENRY 2017 1 1 3 2017-01-01 03:00:00 0.1
#> 5 ATHENRY 2017 1 1 4 2017-01-01 04:00:00 0.1
#> 6 ATHENRY 2017 1 1 5 2017-01-01 05:00:00 0
#> 7 ATHENRY 2017 1 1 6 2017-01-01 06:00:00 0
7.9 Mini-case: Visualizing the impact of Storm Ophelia 157
We first need to “drill down” and generate a reduced dataset for the three
days in question, and we will also select a number of weather stations from
different parts of Ireland. These are Belmullet (north west), Roches Point
(south), and Dublin Airport (east). Note that the station names in the tibble
are all upper-case. The following code uses the function subset() to filter the
dataset, and the & operator is used as we have more than one filtering condition.
We select columns that contain the date, temperature, mean sea level pressure,
and wind speed. In Chapter 8 we will start using the functions from dplyr to
filter data, and we will no longer need to use subset().
We can now use our knowledge of ggplot2 to explore these weather variables
and gain an insight into how the storm evolved over the 3 days in October
2017. This weather data is convenient, as we can use it to explore ggplot2 in a
practical way.
• Exploration one, where we plot, over time, the mean sea level pressure for
each weather station, and this is shown in Figure 7.22. With stormy weather
158 7 Visualization with ggplot2
systems, you would expect to see a significant drop in mean sea level pressure.
This is evident from the plot, with the the lowest vaues coinciding with the
peak of the storm. Here we also call set the argument axis.text.x to show
how we can rotate the axis text by 90 degrees.
ggplot(data=storm,aes(x=date,y=msl,color=station))+
geom_point()+
geom_line() +
theme(legend.position = "top",
axis.text.x = element_text(angle = 90))+
scale_x_datetime(date_breaks = "8 hour",date_labels = "%H:%M %a")+
labs(title="Storm Ophelia",
subtitle = "Mean Sea Level Pressure",
x="Day and Time",
y="Mean Sea Level Pressure (hPa)",
color="Weather Station")
When plotting time series (where you have an R date object), it is worth ex-
ploring how to use the function scale_x_datetime(), as it provides an excellent
way to format dates. Here we use two arguments. The first is date_breaks
which specifies the x-axis points that will contain date labels (every 8 hours),
and the second is date_labels which allows you to configure what will be
printed (hour, minute, and day of week).
A range of values can be extracted, and configured, and the table below
(Wickham, 2016) shows how the special character % can be used in combination
with a character. For example, ‘%H:%M %a” will combine the hour (24-hour
clock), a colon, the minute (00–59), a blank space followed by the abbreviated
7.9 Mini-case: Visualizing the impact of Storm Ophelia 159
day of the week. It allows you to present a user-friendly and readable x-axis,
and the important point is that these labels are fully configurable.
%S Second (00–59)
%M Minute (00–59)
%l Hour, 12 hour clock (1–12)
%I Hour, 12 hour clock (01–12)
%p am/pm
%H Hour, 24-hour clock (00–23)
%a Day of week, abbreviated (Mon–Sun)
%A Day of week, full (Monday–Sunday)
%e Day of month (1–31)
%d Day of month (00–31)
%m Month, numeric (01–12)
%b Month, abbreviated (Jan–Dec)
%B Month, full (January–December)
%y Year, without century (00–99)
%Y Year, with century (0000–9999)
• Exploration two, where we plot, over time, the wind speed (knots) for each
weather station. This shows an interesting variation in the average hourly
wind speed patterns across the three stations during the day, with the
maximum mean wind speed recorded in Roches Point, which is just south
of Cork City. The data indicates that the southerly points in Ireland were
hardest hit by the storm. These wind speed values are visualized in Figure
7.23.
ggplot(data=storm,aes(x=date,y=wdsp,color=station))+
geom_point()+
geom_line()+
theme(legend.position = "top",
axis.text.x = element_text(angle = 90))+
scale_x_datetime(date_breaks = "8 hour",date_labels = "%H:%M %a")+
labs(title="Storm Ophelia",
subtitle = "Wind Speed",
x="Day and Time",
y="Wind Speed (Knots)",
color="Weather Station")
ggplot(data=storm,aes(x=msl,y=wdsp,color=station))+
geom_point()+
geom_smooth(method="lm")+
theme(legend.position = "top")+
labs(title="Storm Ophelia",
subtitle="Atmospheric Pressure v Wind Speed with geom_smooth()",
x="Mean Sea Level Pressure (hPa)",
y="Wind Speed (Knots)",
color="Weather Station")
Function Description
aes() Maps variables to visual properties of geoms.
element_line() Specify properties of a line, used with theme().
element_rect() Specify properties of borders and backgrounds.
element_text() Specify properties of text.
facet_wrap() Creates panels based on an input variable
facet_grid() Creates a matrix of panels for two variables.
ggplot() Initializes a ggplot object.
geom_point() Used to create scatterplots.
geom_abline() Draws a line with a given slope and intercept.
geom_bar() Constructs a bar chart based on count data.
geom_boxplot() Displays summary of distribution.
geom_density() Computes and draws the kernel density estimate.
geom_freqpoly() Visualize counts of one or more variables.
geom_histogram() Visualize the distribution of a variable.
geom_hline() Draws a horizonal line.
geom_vline() Draws a vertical line.
geom_smooth() Visualization aid to assist in showing patterns.
ggpairs() Creates a matrix of plots (library GGally).
labs() Used to modify axis, legend and plot labels.
162 7 Visualization with ggplot2
Function Description
scale_x_continuous() Configure the x-axis scales.
scale_y_continuous() Configure the y-axis scales.
scale_x_datetime() Used to configure date objects on the x-axis.
scale_y_datetime() Configure date objects on the y-axis.
theme() Customize the non-data components of your plots.
theme_grey() Signature theme for ggplot2.
theme_bw() Dark-on-light theme.
theme_light() Theme with light gray lines and axes.
theme_dark() Similar to theme_light(), darker background.
theme_minimal() Minimalistic theme.
theme_classic() Theme with no background annotations.
theme_void() Completely empty theme.
7.11 Exercises
1. Generate the following plot from the mpg tibble in ggplot2. The
x-variable is displ and the y-variable cty. Make use of the lab()
and theme() functions.
all the observations from the month of October (month 10), which
can be retrieved using the subset() function.
8.1 Introduction
Visualization is an important way to explore and gain insights from data.
However, a tibble may not contain the data that is required, and further
transformation of the tibble may be needed. In earlier chapters, functions
such as subset() and transform() were used to (1) select rows according to a
conditional expression, (2) select columns that are of interest, and (3) add new
columns that are functions of other columns. The package dplyr provides an
alternative set of functions to support data transformations, and also additional
functions that can be used to summarize data. The underlying structure of
dplyr is often termed “tibble-in tibble out” (Wickham and Grolemund, 2016),
where each function accepts a tibble and a number of arguments, and then
returns a tibble. This provides an elegant architecture that is ideal for the
use of tools such as the R and tidyverse pipe operators. Note that in our
presentation of code examples from the tidyverse, we have opted to use the
full reference for functions so as to be clear which package is being used (this
will be optional for you as you code your own solutions).
Upon completing the chapter, you should understand the following tidyverse
functions:
165
166 8 Data Transformation with dplyr
• The tidyverse pipe operator %>% (from the package magrittr), which allows
you to chain together a sequence of functions that transform data.
• The dplyr functions filter(), arrange(), select(), mutate(), group_by(),
and summarize(), all of which allow you to manipulate and transform tibbles
in a structured way.
• Additional dplyr functions pull() and case_when().
• A summary of R functions that allow you to use the tools of dplyr.
• How to solve all five test exercises.
Chapter structure
• Section 8.2 introduces the tidyverse pipe %>%, which is similar to the R native
pipe |>.
• Section 8.3 describes the function filter(), which allows you to subset the
rows of a tibble.
• Section 8.4 presents the function arrange() which is used to reorder a tibble
(ascending or descending) based on one or more variables.
• Section 8.5 summarizes the function select(), which allows you to select a
subset of columns from a tibble.
• Section 8.6 describes the function mutate(), which is used to add new columns
to a tibble.
• Section 8.7 presents the function summarize() which allows you to obtain
summaries of variable groupings, for example, summations, averages, and
standard deviations.
• Section 8.8 documents a number of other functions that integrate well with
dplyr.
• Section 8.9 presents a study of how we can use dplyr to compute summary
rainfall information from the package aimsir17.
• Section 8.10 provides a summary of the functions introduced in the chapter.
• Section 8.11 provides a number of short coding exercises to test your under-
standing of dplyr.
equivalent. To explore the benefits of using the pipe operator, consider the
following short example. First, we load the package magrittr.
library(magrittr)
With the %>% operator, this sequential workflow can be simplifed into fewer
steps, and note the benefit is that the number of variables needed is reduced.
In order to keep the code as clear as possible, each function is shown on a
separate line.
set.seed(100)
top_6 <- rpois(12,50) %>%
sort(decreasing = TRUE) %>%
head()
top_6
#> [1] 56 55 53 52 50 50
While the pipe operator(s) are valuable and can lead to cleaner and easier to
understand code, it is important to note that it is best to use a short number
of steps (not more that ten is a useful guide) where the output from one is
the input to the next. Also, the process is not designed for multiple inputs or
outputs. However, despite these constraints, it is a wonderful operator, and we
will make extensive use of this as we study the tidyverse.
and perform operations on this reduced set of rows. This function conforms
to the architecture of dplyr, in that it accepts a tibble, and arguments, and
always returns a tibble. The arguments include:
• A data frame, or data frame extensions such as a tibble.
• A list of expressions that return a logical value and are defined in terms of
variables that are present in the input data frame. More than one expression
can be provided, and these will act as separate filtering conditions.
To explore filter, let’s consider the mpg data frame, and here we can filter all
vehicles that belong to the class “2seater”. We store the result in a new tibble
called mpg1, which is now displayed.
mpg1 <- dplyr::filter(mpg,class=="2seater")
mpg1
#> # A tibble: 5 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 chevrolet corv~ 5.7 1999 8 manu~ r 16 26 p
#> 2 chevrolet corv~ 5.7 1999 8 auto~ r 15 23 p
#> 3 chevrolet corv~ 6.2 2008 8 manu~ r 16 26 p
#> 4 chevrolet corv~ 6.2 2008 8 auto~ r 15 25 p
#> 5 chevrolet corv~ 7 2008 8 manu~ r 15 24 p
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
Notice that the result stored in mpg1 has retained all of the columns, and those
rows (5 in all) that match the condition. The condition explicitly references
the column value that must be true for all matching observations. We can add
more conditions for the subsetting task, for example, to include those values
of the “2seater” class, where the value for highway miles per gallon (hwy) is
greater than or equal to 25.
mpg2 <- dplyr::filter(mpg,class=="2seater",hwy >= 25)
mpg2
#> # A tibble: 3 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 chevrolet corv~ 5.7 1999 8 manu~ r 16 26 p
#> 2 chevrolet corv~ 6.2 2008 8 manu~ r 16 26 p
#> 3 chevrolet corv~ 6.2 2008 8 auto~ r 15 25 p
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
As can be seen with the new tibble mpg2, this reduces the result to just three
observations. But say if you wanted to view all cars that were manufactured
by either “lincoln” or “mercury”, a slightly different approach is needed. This
is an “or” operation, where we want to return observations that match one
8.3 Filtering rows with filter() 169
Note that in practice, the %in% operator returns a logical vector which is then
used as an argument to filter. This process can be made more explicit; for
example, consider the following code whereby we can extract the first record
from the tibble mpg using a logical vector (which could have also been created
using %in%).
170 8 Data Transformation with dplyr
lv <- c(T,rep(F,nrow(mpg)-1))
dplyr::filter(mpg,lv)
#> # A tibble: 1 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.8 1999 4 auto~ f 18 29 p
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
A function that can be used to subset a data frame or tibble is slice(), which
allows you to index tibble rows by their (integer) locations.
# Show the first 3 rows
dplyr::slice(mpg,1:3)
#> # A tibble: 3 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.8 1999 4 auto~ f 18 29 p
#> 2 audi a4 1.8 1999 4 manu~ f 21 29 p
#> 3 audi a4 2 2008 4 manu~ f 20 31 p
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
# Sample 3 rows
set.seed(100)
dplyr::slice(mpg,sample(1:nrow(mpg),size = 3))
#> # A tibble: 3 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 toyota toyo~ 2.7 1999 4 auto~ 4 16 20 r
#> 2 honda civic 1.6 1999 4 manu~ f 25 32 r
#> 3 hyundai sona~ 2.4 2008 4 manu~ f 21 31 r
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
We can then re-order mpg by cty, from highest to lowest, by wrapping the
column name with the function desc().
dplyr::arrange(mpg,desc(cty)) %>% slice(1:3)
#> # A tibble: 3 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 volkswag~ new ~ 1.9 1999 4 manu~ f 35 44 d
#> 2 volkswag~ jetta 1.9 1999 4 manu~ f 33 44 d
#> 3 volkswag~ new ~ 1.9 1999 4 auto~ f 29 41 d
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
More than one column can be provided for re-ordering data. For example, we
can re-order mpg by class, and then by hwy. Notice that it will take the first
class (five observations), and order those rows, before moving on to the next
class.
dplyr::arrange(mpg,class,desc(cty)) %>% slice(1:7)
#> # A tibble: 7 x 11
#> manufac~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 chevrolet corv~ 5.7 1999 8 manu~ r 16 26 p
#> 2 chevrolet corv~ 6.2 2008 8 manu~ r 16 26 p
#> 3 chevrolet corv~ 5.7 1999 8 auto~ r 15 23 p
#> 4 chevrolet corv~ 6.2 2008 8 auto~ r 15 25 p
#> 5 chevrolet corv~ 7 2008 8 manu~ r 15 24 p
#> 6 volkswag~ jetta 1.9 1999 4 manu~ f 33 44 d
#> 7 toyota coro~ 1.8 2008 4 manu~ f 28 37 r
#> # ... with 1 more variable: class <chr>, and abbreviated
#> # variable name 1: manufacturer
172 8 Data Transformation with dplyr
The function also provides a number of operators that make it easier to select
variables, for example:
• : for selecting a range of consecutive variables
mpg %>% dplyr::select(manufacturer:year,cty)
#> # A tibble: 234 x 5
#> manufacturer model displ year cty
#> <chr> <chr> <dbl> <int> <int>
#> 1 audi a4 1.8 1999 18
#> 2 audi a4 1.8 1999 21
#> 3 audi a4 2 2008 20
#> 4 audi a4 2 2008 21
#> 5 audi a4 2.8 1999 16
#> 6 audi a4 2.8 1999 18
#> 7 audi a4 3.1 2008 18
#> 8 audi a4 quattro 1.8 1999 18
#> 9 audi a4 quattro 1.8 1999 16
#> 10 audi a4 quattro 2 2008 20
#> # ... with 224 more rows
8.5 Choosing columns with select() 173
#> 2 audi a4
#> 3 audi a4
#> 4 audi a4
#> 5 audi a4
#> 6 audi a4
#> 7 audi a4
#> 8 audi a4 quattro
#> 9 audi a4 quattro
#> 10 audi a4 quattro
#> # ... with 224 more rows
• ends_with(), which takes a string and returns any column that ends with
this value.
mpg %>% dplyr::select(ends_with("l"))
#> # A tibble: 234 x 4
#> model displ cyl fl
#> <chr> <dbl> <int> <chr>
#> 1 a4 1.8 4 p
#> 2 a4 1.8 4 p
#> 3 a4 2 4 p
#> 4 a4 2 4 p
#> 5 a4 2.8 6 p
#> 6 a4 2.8 6 p
#> 7 a4 3.1 6 p
#> 8 a4 quattro 1.8 4 p
#> 9 a4 quattro 1.8 4 p
#> 10 a4 quattro 2 4 p
#> # ... with 224 more rows
• contains(), which takes a string and returns any column that contains the
value.
mpg %>% dplyr::select(contains("an"))
#> # A tibble: 234 x 2
#> manufacturer trans
#> <chr> <chr>
#> 1 audi auto(l5)
#> 2 audi manual(m5)
#> 3 audi manual(m6)
#> 4 audi auto(av)
#> 5 audi auto(l5)
#> 6 audi manual(m5)
#> 7 audi auto(av)
#> 8 audi manual(m5)
#> 9 audi auto(l5)
8.5 Choosing columns with select() 175
Now, we can explore the three types of changes that can be made to mpg with
mutate().
• A vector of the same length is added, and this vector is based on an existing
column. For example, we create a new column cty_kmh which converts all
cty values to their counterpart in kilometers (using the constant 1.6). This
example shows an important feature of mutate() in that existing variables
can be used, and also new variables defined within the function call can also
be used in the process (for example, the additional variable cty_2).
mpg_m %>% dplyr::mutate(cty_kmh=cty*1.6,
cty_2=cty_kmh/1.6)
#> # A tibble: 6 x 8
#> manufacturer model displ year cty class cty_kmh cty_2
#> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <dbl>
#> 1 volkswagen jetta 2 1999 21 compact 33.6 21
#> 2 volkswagen jetta 2.5 2008 21 compact 33.6 21
#> 3 chevrolet malibu 3.6 2008 17 midsize 27.2 17
#> 4 volkswagen gti 2 2008 21 compact 33.6 21
#> 5 audi a4 2 2008 21 compact 33.6 21
#> 6 toyota camry 3.5 2008 19 midsize 30.4 19
• Finally, we can remove a variable from a tibble by setting its value to NULL
mpg_m %>% dplyr::mutate(class=NULL)
#> # A tibble: 6 x 5
#> manufacturer model displ year cty
#> <chr> <chr> <dbl> <int> <int>
#> 1 volkswagen jetta 2 1999 21
#> 2 volkswagen jetta 2.5 2008 21
#> 3 chevrolet malibu 3.6 2008 17
#> 4 volkswagen gti 2 2008 21
#> 5 audi a4 2 2008 21
#> 6 toyota camry 3.5 2008 19
178 8 Data Transformation with dplyr
Another function that can be used with mutate() is group_by() which takes
a tibble and converts it to a grouped tibble, based on the input variable(s).
Computations can then be performed on the grouped data. Here, we show how
the earlier mpg_m variable is grouped by class. Note that the data in the tibble
does not change, however, information on the group is added to the tibble,
and appears when the tibble is displayed.
# Group the tibble by class
mpg_mg <- mpg_m %>% dplyr::group_by(class)
mpg_mg
#> # A tibble: 6 x 6
#> # Groups: class [2]
#> manufacturer model displ year cty class
#> <chr> <chr> <dbl> <int> <int> <chr>
#> 1 volkswagen jetta 2 1999 21 compact
#> 2 volkswagen jetta 2.5 2008 21 compact
#> 3 chevrolet malibu 3.6 2008 17 midsize
#> 4 volkswagen gti 2 2008 21 compact
#> 5 audi a4 2 2008 21 compact
#> 6 toyota camry 3.5 2008 19 midsize
In this chapter, we will mostly use group_by() with the summarize() function,
but it can also be used with mutate(). Consider the example where we want to
calculate the maximum cty value for each class of car in the variable mpg_m. By
grouping the tibble (by class), the maximum of each group will be calculated.
mpg_mg %>% dplyr::mutate(MaxCtyByClass=max(cty))
#> # A tibble: 6 x 7
#> # Groups: class [2]
8.7 Summarizing observations with summarize() 179
Groupings can then be removed from a tibble with a call to the function
upgroup().
mpg_mg %>% dplyr::ungroup()
#> # A tibble: 6 x 6
#> manufacturer model displ year cty class
#> <chr> <chr> <dbl> <int> <int> <chr>
#> 1 volkswagen jetta 2 1999 21 compact
#> 2 volkswagen jetta 2.5 2008 21 compact
#> 3 chevrolet malibu 3.6 2008 17 midsize
#> 4 volkswagen gti 2 2008 21 compact
#> 5 audi a4 2 2008 21 compact
#> 6 toyota camry 3.5 2008 19 midsize
Type Examples
Measures of location mean(), median()
Measures of spread sd(), IQR()
Measures of rank min(), max(), quantile()
Measures of position first(), nth(), last()
Counts n(), n_distinct()
Proportions e.g., sum(x>0)/n()
180 8 Data Transformation with dplyr
To add value to this process, we can group the tibble by whatever variable (or
combination of variables) we are interested in. For this example, we group by
class, and this will provide seven rows of output, one for each class. Note that
the function n() returns the number of observations found for each grouping.
mpg %>%
dplyr::group_by(class) %>%
dplyr::summarize(CtyAvr=mean(cty),
CtySD=sd(cty),
HwyAvr=mean(hwy),
HwySD=sd(hwy),
N=dplyr::n()) %>%
ungroup()
#> # A tibble: 7 x 6
#> class CtyAvr CtySD HwyAvr HwySD N
#> <chr> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 2seater 15.4 0.548 24.8 1.30 5
#> 2 compact 20.1 3.39 28.3 3.78 47
#> 3 midsize 18.8 1.95 27.3 2.14 41
#> 4 minivan 15.8 1.83 22.4 2.06 11
#> 5 pickup 13 2.05 16.9 2.27 33
#> 6 subcompact 20.4 4.60 28.1 5.38 35
#> 7 suv 13.5 2.42 18.1 2.98 62
mpg %>%
group_by(class) %>%
summarize(MaxDispl=max(displ),
CarMax=dplyr::nth(model,which.max(displ)),
ManuMax=dplyr::nth(manufacturer,which.max(displ)))
#> # A tibble: 7 x 4
#> class MaxDispl CarMax ManuMax
#> <chr> <dbl> <chr> <chr>
#> 1 2seater 7 corvette chevrolet
#> 2 compact 3.3 camry solara toyota
#> 3 midsize 5.3 grand prix pontiac
#> 4 minivan 4 caravan 2wd dodge
#> 5 pickup 5.9 ram 1500 pickup 4wd dodge
#> 6 subcompact 5.4 mustang ford
#> 7 suv 6.5 k1500 tahoe 4wd chevrolet
x
#> [1] 70 98 7 7 200
A couple of points worth noting. Once a LHS condition evaluates to TRUE, the
value on the RHS of that condition is assigned to the output, and no other
LHS conditions are evaluated. If no matches are found, the final statement
TRUE is executed, and the output is then assigned to that value.
This function is often used with the mutate() function to create a new column
based on the values in an existing column. For example, we can add a column to
mpg based on whether the cty value is above or below the average value. Notice
that we can use the tibble’s column names within the body of case_when().
mpg %>%
dplyr::select(manufacturer:model,cty) %>%
dplyr::mutate(cty_status=case_when(
cty >= mean(cty) ~ "Above average",
cty < mean(cty) ~ "Below average",
TRUE ~ "Undefined"
)) %>%
dplyr::slice(1:5)
#> # A tibble: 5 x 4
#> manufacturer model cty cty_status
#> <chr> <chr> <int> <chr>
#> 1 audi a4 18 Above average
#> 2 audi a4 21 Above average
#> 3 audi a4 20 Above average
#> 4 audi a4 21 Above average
#> 5 audi a4 16 Below average
A feature of the dplyr is that the functions we’ve explored always return a
tibble. However, there may be cases when we need to return just one column,
and while the $ operator can be used, the function pull() is a better choice,
especially when using pipes.
For example, if we wanted a vector that contained the unique values from the
column class in mpg, we could write this as follows.
mpg %>%
dplyr::pull(class) %>%
unique()
#> [1] "compact" "midsize" "suv" "2seater"
#> [5] "minivan" "pickup" "subcompact"
8.9 Mini-case: Summarizing total rainfall in 2017 183
1. Calculate the total annual rainfall for each weather station, and
visualize using a bar chart.
2. Calculate the total monthly rainfall for two weather stations, “NEW-
PORT” and “DUBLIN AIRPORT”, and visualize both using a time
series graph.
For these tasks, we first need to load in the relevant packages, dplyr, ggplot2,
and aimsir17. The tibble observations contains the weather data that we
will summarize.
library(dplyr)
library(ggplot2)
library(aimsir17)
The first task involves summing the variable rain for each of the weather
stations. Therefore, it will involve grouping the entire tibble by the variable
station, and performing a sum of rain for each of these groups. We will store
this output in a tibble, and show the first ten rows.
184 8 Data Transformation with dplyr
We can verify these aggregate values by performing a simple test based on the
first weather station “ATHENRY”.
test <- observations %>%
dplyr::filter(station=="ATHENRY") %>%
dplyr::pull(rain) %>%
sum()
test
#> [1] 1199
annual_rain$TotalRain[1]
#> [1] 1199
Next, we can visualize the data (shown in Figure 8.1) from the tibble an-
nual_rain to show the summaries for each station. Note that we add a horizontal
line to show the maximum, and also that the x-axis labels are shown at an
angle of 45 degrees. Note that we can use the functions xlab() and ylab() to
conveniently set the titles for both axes.
ggplot(annual_rain,aes(x=station,y=TotalRain))+
geom_bar(stat = "identity")+
theme(axis.text.x=element_text(angle=45,hjust=1))+
geom_hline(yintercept = max(annual_rain$TotalRain),color="red")+
xlab("Weather Station") + ylab("Total Annual Rainfall")
Our second task also involves aggregating data, but in this case we focus
on the monthly totals for two weather stations, “DUBLIN AIRPORT” and
“NEWPORT”. To set the scene, we can see first what the annual totals were
8.9 Mini-case: Summarizing total rainfall in 2017 185
FIGURE 8.1 Total annual rainfall for 2017 across 25 weather stations
for these two weather stations, and this clearly shows much more rain falls in
Newport, located in Co. Mayo, and situated on Ireland’s Wild Atlantic Way.
dplyr::filter(annual_rain,
station %in% c("DUBLIN AIRPORT",
"NEWPORT"))
#> # A tibble: 2 x 2
#> station TotalRain
#> <chr> <dbl>
#> 1 DUBLIN AIRPORT 662.
#> 2 NEWPORT 1752.
In order to calculate the monthly values for both weather stations, we need
to return to the observations tibble, filter all data for both stations, group
by the variables station and month, and then apply the aggregation operation
(i.e., sum of the total hourly rainfall).
monthly_rain <- observations %>%
dplyr::filter(station %in% c("DUBLIN AIRPORT",
"NEWPORT")) %>%
dplyr::group_by(station, month) %>%
dplyr::summarize(TotalRain=sum(rain,na.rm = T))
monthly_rain
#> # A tibble: 24 x 3
#> # Groups: station [2]
#> station month TotalRain
#> <chr> <dbl> <dbl>
#> 1 DUBLIN AIRPORT 1 22.8
186 8 Data Transformation with dplyr
The total rainfall per month for each station, given that it is stored in the
tibble monthly_rain, can now be visualized using ggplot2, using the functions
ggplot(), geom_point(), andgeom_line(). The functionxlim()‘ placed limits
on the x-axis values, and the plot is shown in Figure 8.2. Interestingly it shows
that there was no month during 2017 in which the rainfall in Dublin Airport
(east coast) was greater than that recorded in Newport (west coast).
ggplot(monthly_rain,aes(x=month,y=TotalRain,color=station))+
geom_point()+geom_line()+
theme(legend.position = "bottom")+
scale_x_continuous(limits=c(1,12), breaks=seq(1,12))+
labs(x="Month",
y="Total Rain",
title = "Monthly rainfall summaries calculated using dplyr")
8.10 Summary of R functions from Chapter 8 187
Function Description
%>% The tidyverse pipe operator (library magrittr).
%in% Used to see if the left operand is in a vector.
filter() Subsets rows based on column values.
slice() Subsets rows based on their positions.
arrange() Sorts rows based on column(s).
select() Subsets columns based on names.
starts_with() Matches starting column names.
ends_with() Matches ending column names.
contains() Matches column names that contain the input.
num_range() Matches a numerical range.
matches() Matches a regular expression.
mutate() Creates new columns from existing variables.
group_by() Converts a tibble to it into a grouped tibble.
ungroup() Removes a tibble grouping.
summarize() Creates a new data frame, based on summaries.
case_when() Provides vectorization of multiple if_else() statement.
pull() Similar to $ and useful if deployed with pipe
188 8 Data Transformation with dplyr
8.11 Exercises
1. Based on the mpg dataset from ggplot2, generate the following tibble
which filters all the cars with a cty value greater than the me-
dian. Ensure that your tibble contains the same columns, and with
set.seed(100) sample five records using sample_n(), and store the
result in the tibble ans.
ans
#> # A tibble: 5 x 7
#> manufacturer model displ year cty hwy class
#> <chr> <chr> <dbl> <int> <int> <int> <chr>
#> 1 nissan maxima 3 1999 18 26 midsize
#> 2 volkswagen gti 2 1999 19 26 compact
#> 3 subaru impreza awd 2.5 2008 20 27 compact
#> 4 chevrolet malibu 3.1 1999 18 26 midsize
#> 5 subaru forester awd 2.5 2008 19 25 suv
jan
#> # A tibble: 1,488 x 7
#> station month day hour date temp Weath~1
#> <chr> <dbl> <int> <int> <dttm> <dbl> <chr>
#> 1 DUBLIN AI~ 1 1 0 2017-01-01 00:00:00 5.3 No War~
#> 2 MACE HEAD 1 1 0 2017-01-01 00:00:00 5.6 No War~
#> 3 DUBLIN AI~ 1 1 1 2017-01-01 01:00:00 4.9 No War~
#> 4 MACE HEAD 1 1 1 2017-01-01 01:00:00 5.4 No War~
#> 5 DUBLIN AI~ 1 1 2 2017-01-01 02:00:00 5 No War~
#> 6 MACE HEAD 1 1 2 2017-01-01 02:00:00 4.7 No War~
#> 7 DUBLIN AI~ 1 1 3 2017-01-01 03:00:00 4.2 No War~
#> 8 MACE HEAD 1 1 3 2017-01-01 03:00:00 4.7 No War~
#> 9 DUBLIN AI~ 1 1 4 2017-01-01 04:00:00 3.6 Warnin~
#> 10 MACE HEAD 1 1 4 2017-01-01 04:00:00 4.5 No War~
#> # ... with 1,478 more rows, and abbreviated variable name
#> # 1: WeatherStatus
8.11 Exercises 189
diam
#> # A tibble: 5 x 5
#> cut NumberDiamonds CaratMean PriceMax PriceMaxColor
#> <ord> <int> <dbl> <int> <ord>
#> 1 Ideal 21551 0.703 18806 G
#> 2 Premium 13791 0.892 18823 I
#> 3 Very Good 12082 0.806 18818 G
#> 4 Good 4906 0.849 18788 G
#> 5 Fair 1610 1.05 18574 G
4. For each class of car, create the tibble mpg1 that contains a new
column that stores the rank of city miles per gallon (cty), from
lowest to highest. Make use of the rank() function in R, and in this
function call, set the argument ties.method = “first”.
mpg1
#> # A tibble: 234 x 7
#> manufacturer model displ year cty class RankCty
#> <chr> <chr> <dbl> <int> <int> <chr> <int>
#> 1 chevrolet corvette 5.7 1999 15 2seater 1
#> 2 chevrolet corvette 6.2 2008 15 2seater 2
#> 3 chevrolet corvette 7 2008 15 2seater 3
#> 4 chevrolet corvette 5.7 1999 16 2seater 4
#> 5 chevrolet corvette 6.2 2008 16 2seater 5
#> 6 audi a4 quattro 2.8 1999 15 compact 1
190 8 Data Transformation with dplyr
5. Find the stations with the highest (temp_high) and lowest (temp_low)
annual average temperature values. Use these variables to calcu-
late the average monthly temperature values for the two stations
(m_temps), and display the data in a plot.
temp_low
#> [1] "KNOCK AIRPORT"
temp_high
#> [1] "VALENTIA OBSERVATORY"
arrange(m_temps,month,station)
#> # A tibble: 24 x 3
#> station month AvrTemp
#> <chr> <dbl> <dbl>
#> 1 KNOCK AIRPORT 1 5.18
#> 2 VALENTIA OBSERVATORY 1 8.03
#> 3 KNOCK AIRPORT 2 5.03
#> 4 VALENTIA OBSERVATORY 2 8.26
#> 5 KNOCK AIRPORT 3 6.78
#> 6 VALENTIA OBSERVATORY 3 9.36
#> 7 KNOCK AIRPORT 4 7.85
#> 8 VALENTIA OBSERVATORY 4 9.60
#> 9 KNOCK AIRPORT 5 11.6
#> 10 VALENTIA OBSERVATORY 5 12.6
#> # ... with 14 more rows
9
Relational Data with dplyr and Tidying Data
with tidyr
It’s rare that a data analysis involves only a single table of data.
Typically you have many tables of data, and you must combine
them to answer the questions you are interested in.
9.1 Introduction
The five main dplyr functions introduced in Chapter 8 are powerful ways
to filter, transform, and summarize data stored in a tibble. However, when
exploring data, there may be more than one tibble of interest, and these tibbles
may share common information. For example, we may want to explore the
correlation between flight delays and weather events, and these information
sources may be contained in two different tibbles. In this chapter we show how
information from more than one tibble can be merged, using what are known
as mutating joins and filtering joins. Furthermore, we also explore a key idea
known as tidy data and show how functions from the package tidyr can be
used to streamline the data transformation pipeline.
Upon completing the chapter, you should understand the following:
• The main idea behind relational data, including concepts such as the primary
key, foreign key, and composite key.
• The difference between mutating joins and filtering joins, and how dplyr
provides functions for each type of join.
191
192 9 Relational Data with dplyr and Tidying Data with tidyr
• The idea underlying tidy data, and how two functions from tidyr can be
used to transform data from untidy to tidy format, and back again.
• How to solve all five test exercises.
Chapter structure
• Section 9.2 describes the main ideas underlying the relational model.
• Section 9.3 introduces mutating joins, and how they can be achieved using
inner_join(), left_join(), right_join(), and full_join().
• Section 9.4 summarizes filtering joins, and how to implement them using the
functions semi_join() and anti_join().
• Section 9.5 presents the idea behind tidy data, and demonstrates the functions
pivot_longer() and pivot_wider().
• Section 9.8 presents a study of how we can join two tables to explore
relationships between wind speed and energy generated from wind.
• Section 9.9 provides a summary of the functions introduced in the chapter.
• Section 9.10 provides a number of short coding exercises to test your under-
standing of relational and tidy data.
A primary key from one table can also be a column in another table, and if
this is the case, it is termed a foreign key in that table. For example, the tibble
observations in the package aimsir17 contains the column station, which is
not a primary key in this table, as it does not uniquely identify an observation.
Therefore station is a foreign key in the table observations.
9.2 Relational data 193
We will now provide a simple example, which will be used to show how two
tables can be linked together via a foreign key. The idea here is that we can
represent two important ideas from a company’s information system, namely
products and orders, in two separate, but related, tables. An overview of the
structure is shown in Figure 9.1.
It shows two tables:
• Products, which has two variables, the ProductID, a primary key that
uniquely identifies a product, and Type, which provides a category for each
product.
• Orders, which contains three variables, the OrderID that uniquely identifies
an order, the Quantity, which records the number of products ordered for
each order, and the ProductID, which contains the actual product ordered.
As ProductID is a primary key of the table Products, it is categorized as a
foreign key in the table Orders.
194 9 Relational Data with dplyr and Tidying Data with tidyr
The reason we have added this inconsistency is that it helps us highlight the
operation of joining functions used in dplyr. To prepare for the examples that
follow, we recreate these tables in R using the tibble() function. First, we
define the tibble products.
products <- tibble(ProductID=c("PR-1","PR-2","PR-3","PR-9"),
Type=c("Computer","Tablet","Phone","Headphones"))
products
#> # A tibble: 4 x 2
#> ProductID Type
#> <chr> <chr>
#> 1 PR-1 Computer
#> 2 PR-2 Tablet
#> 3 PR-3 Phone
#> 4 PR-9 Headphones
9.3 Mutating joins 195
We can see that the primary key ProductID uniquely identifies each product.
dplyr::filter(products,ProductID=="PR-9")
#> # A tibble: 1 x 2
#> ProductID Type
#> <chr> <chr>
#> 1 PR-9 Headphones
Next we define the tibble orders, which also includes the foreign key column
for the product associated with each order.
orders <- tibble(OrderID=c("OR-1","OR-2","OR-3","OR-4","OR-5"),
Quantity=c(1,2,1,2,3),
ProductID=c("PR-1","PR-2","PR-3","PR-4","PR-1"))
orders
#> # A tibble: 5 x 3
#> OrderID Quantity ProductID
#> <chr> <dbl> <chr>
#> 1 OR-1 1 PR-1
#> 2 OR-2 2 PR-2
#> 3 OR-3 1 PR-3
#> 4 OR-4 2 PR-4
#> 5 OR-5 3 PR-1
We can now proceed to the core topics of this chapter, namely, exploring ways
to combine data from different tables. These types of functions are called joins,
and there are two main types, mutating joins and filtering joins.
9.3.1 inner_join(x,y)
The inner_join() function joins observations that appear in both tables, based
on a common key, which need to be present in both tables. It takes the following
arguments, and returns an object of the same type as x.
• x and y, a pair of tibbles or data frames to be joined.
• by, which is a character vector of variables to join by. In cases where the key
column name is different, a named vector can be used, for example, by =
c("key_x" = "key_y").
• It adds the column Type to the tibble, containing information copied from
the products tibble.
• It does not include any information on order “OR-4” from the orders tibble,
and this is because “OR-4” is for product “PR-4”, a product which does not
appear in the products tibble.
• It also excludes information on product “PR-9”, because there are no recorded
orders for this product.
In summary, this mutating join is strict, as it only includes observations that
have a common key across both tables.
9.3.2 left_join(x,y)
A left join will keep all observations in the tibble x, even if there is no match
in tibble y. This is a widely used function, given that it maintains all the
observations in x. We can now show two examples based on the tibbles orders
and products.
l_j1 <- dplyr::left_join(orders,products,by="ProductID")
l_j1
#> # A tibble: 5 x 4
#> OrderID Quantity ProductID Type
#> <chr> <dbl> <chr> <chr>
#> 1 OR-1 1 PR-1 Computer
9.3 Mutating joins 197
All of the orders are included by default, and if there is no matching key in
the paired tibble, a value of NA is returned. Here we can see that there is no
product with an identifier of PR-4 in the products table.
This can be further clarified if the argument keep=TRUE is passed into the join
function, as this will maintain both keys in the output. Interestingly, this
output clearly shows that the key on tibble y is missing, and note that each
table key is further annotated with .x for the first tibble, and .y for the second
tibble.
l_j2 <- dplyr::left_join(orders,products,by="ProductID",keep=TRUE)
l_j2
#> # A tibble: 5 x 5
#> OrderID Quantity ProductID.x ProductID.y Type
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 OR-1 1 PR-1 PR-1 Computer
#> 2 OR-2 2 PR-2 PR-2 Tablet
#> 3 OR-3 1 PR-3 PR-3 Phone
#> 4 OR-4 2 PR-4 <NA> <NA>
#> 5 OR-5 3 PR-1 PR-1 Computer
We can now show the use of left join where the first two arguments are reversed.
l_j3 <- dplyr::left_join(products,orders,by="ProductID")
l_j3
#> # A tibble: 5 x 4
#> ProductID Type OrderID Quantity
#> <chr> <chr> <chr> <dbl>
#> 1 PR-1 Computer OR-1 1
#> 2 PR-1 Computer OR-5 3
#> 3 PR-2 Tablet OR-2 2
#> 4 PR-3 Phone OR-3 1
#> 5 PR-9 Headphones <NA> NA
Notice that all the observations in x are maintained (the four products), and
matches in y are included. Because product PR-1 appears in two orders, two of
the rows contain PR-1. We can also see that PR-9 has no OrderID or Quantity,
as this product is not contained in the orders tibble. Again, we can also see
the tibble with the product keys from both tibbles x and y maintained.
l_j4 <- dplyr::left_join(products,orders,by="ProductID",keep=TRUE)
l_j4
#> # A tibble: 5 x 5
198 9 Relational Data with dplyr and Tidying Data with tidyr
9.3.3 right_join(x,y)
A right join keeps all observations in the tibble y. In this example, we can see
that all the product information is shown (there are five products in all), but
that the order “OR-4” is missing, as that is for “PR-4”, which is not present
in the products table.
r_j1 <- dplyr::right_join(orders,products,by="ProductID")
r_j1
#> # A tibble: 5 x 4
#> OrderID Quantity ProductID Type
#> <chr> <dbl> <chr> <chr>
#> 1 OR-1 1 PR-1 Computer
#> 2 OR-2 2 PR-2 Tablet
#> 3 OR-3 1 PR-3 Phone
#> 4 OR-5 3 PR-1 Computer
#> 5 <NA> NA PR-9 Headphones
We can also perform the right join where the first tibble is products. Again,
with the right join all observations in the y tibble are returned, and therefore
the product “PR-9” is missing, as it is not linked to any order.
r_j2 <- dplyr::right_join(products,orders,by="ProductID")
r_j2
#> # A tibble: 5 x 4
#> ProductID Type OrderID Quantity
#> <chr> <chr> <chr> <dbl>
#> 1 PR-1 Computer OR-1 1
#> 2 PR-1 Computer OR-5 3
#> 3 PR-2 Tablet OR-2 2
#> 4 PR-3 Phone OR-3 1
#> 5 PR-4 <NA> OR-4 2
9.4 Filtering joins 199
9.3.4 full_join(x,y)
A full join keeps all observations in both x and y. The same overall result is
obtained regardless of which tibble is the first one. For example, here is a full
join of products and orders.
f_j1 <- dplyr::full_join(products,orders,by="ProductID")
f_j1
#> # A tibble: 6 x 4
#> ProductID Type OrderID Quantity
#> <chr> <chr> <chr> <dbl>
#> 1 PR-1 Computer OR-1 1
#> 2 PR-1 Computer OR-5 3
#> 3 PR-2 Tablet OR-2 2
#> 4 PR-3 Phone OR-3 1
#> 5 PR-9 Headphones <NA> NA
#> 6 PR-4 <NA> OR-4 2
We can also see the result of starting the full join operation with orders.
f_j2 <- dplyr::full_join(orders,products,by="ProductID")
f_j2
#> # A tibble: 6 x 4
#> OrderID Quantity ProductID Type
#> <chr> <dbl> <chr> <chr>
#> 1 OR-1 1 PR-1 Computer
#> 2 OR-2 2 PR-2 Tablet
#> 3 OR-3 1 PR-3 Phone
#> 4 OR-4 2 PR-4 <NA>
#> 5 OR-5 3 PR-1 Computer
#> 6 <NA> NA PR-9 Headphones
9.4.1 semi_join(x,y)
This function will keep all the observations in x that have a matching column
in y. In our manufacturing example, we can perform this join based on the
ProductID, starting with the orders table.
200 9 Relational Data with dplyr and Tidying Data with tidyr
Notice that this join only presents observations from the orders tibble, and this
is restricted to products that are present in the products tibble. The semi-join
can also be performed starting with the products table, and so it will only
show those products that are linked to an order present in the orders table.
s_j2 <- dplyr::semi_join(products,orders,by="ProductID")
s_j2
#> # A tibble: 3 x 2
#> ProductID Type
#> <chr> <chr>
#> 1 PR-1 Computer
#> 2 PR-2 Tablet
#> 3 PR-3 Phone
Here we observe that “PR-9” is missing, as it is not present in the orders table.
9.4.2 anti_join(x,y)
This filtering function will keep all the observations in x that do have a matching
column in y. Again, this is a filtering join, and therefore only observations from
the first tibble are returned. The function can be applied to our example, and
yields the following results.
a_j1 <- dplyr::anti_join(orders,products,by="ProductID")
a_j1
#> # A tibble: 1 x 3
#> OrderID Quantity ProductID
#> <chr> <dbl> <chr>
#> 1 OR-4 2 PR-4
This results confirms that, in the orders tibble, the sole product that is not
represented in products is “PR-4”. We can also explore the result for products
to see which product is not linked to an order.
a_j2 <- dplyr::anti_join(products,orders,by="ProductID")
a_j2
#> # A tibble: 1 x 2
9.5 Tidy data 201
The result discovers that “PR-9” is the only product not linked to an order.
In order to explore this idea, consider Figure 9.2, which shows two tables, each
of which stores the same information in different rectangular formats. The
tables contain synthetic information for the exam results of two students (“ID1”
and “ID2”), across three subjects (“CX101”,“CX102”, and “CX103”).
Even though the same information is present in both tables, we can observe
that the structure of the tables differs, and this has an impact on how we
202 9 Relational Data with dplyr and Tidying Data with tidyr
can analyze the data using the tools of the tidyverse. We term the format on
the left untidy data, because each column is not a variable. Specifically, the
columns “CX101”, “CX102”, and “CX103” are subject names, and therefore
are what we call instances of the variable “Subject”. Also, in this case, each
row contains three observations of the same variable.
If we could find a way to transform the table on the left to that on the right,
we would now have a tidy data format. In this table, there are now three
variables: ID, Subject, and Grade, and notice that all of the grades are present,
but that each column is a variable, and each row is an observation.
That is not to say that the untidy format is not useful, as it displays the data
in a more readable format, for example, to show on a report or a presentation.
However, for the process of data analysis using the tidyverse, the format on
the right is much more effective, as it supports direct manipulation using the
tools provided by ggplot2 and dplyr.
We will now re-create the untidy data in R using atomic vectors and the
tibble() function. We include the library tidyr, as this contains functions to
transform the data to tidy format.
library(dplyr)
library(tidyr)
set.seed(100)
N = 2
Given that the variable res represents the data in untidy format, we will now
explore how this can be transformed using the tidyr function pivot_longer().
• data,
the tibble on which to perform the pivot action on.
• cols,
the columns to pivot into longer format.
• names_to, a character vector specifying the new column to create, and the
values of this column will be the column names specified in the argument
cols.
• values_to, a character vector specifying the column name where data will
be stored.
With these arguments, we can now convert the tibble to a longer format.
res_l <- tidyr::pivot_longer(res,
`CX101`:`CX103`,
names_to="Subject",
values_to="Grade")
res_l
#> # A tibble: 6 x 3
#> ID Subject Grade
#> <chr> <chr> <int>
#> 1 ID1 CX101 39
#> 2 ID1 CX102 64
#> 3 ID1 CX103 66
#> 4 ID2 CX101 67
#> 5 ID2 CX102 53
#> 6 ID2 CX103 65
The advantage of this transformation to tidy data can be observed when using
other dplyr functions such as summarize(). For example, it is now straight-
forward to process the new tibble to calculate the average, minimum, and
maximum grade for each student.
res_l %>%
dplyr::group_by(ID) %>%
dplyr::summarize(AvrGrade=mean(Grade),
MinGrade=min(Grade),
MaxGrade=max(Grade))
#> # A tibble: 2 x 4
#> ID AvrGrade MinGrade MaxGrade
#> <chr> <dbl> <int> <int>
#> 1 ID1 56.3 39 66
#> 2 ID2 61.7 53 67
Note that this is the same as the original wider tibble we generated (res).
It is worth noting that more complex operations can be performed on both
pivot_longer() and pivot_wider(), and it is worth checking out the options by
viewing the function documentation from tidyr, for example, ?pivot_longer.
• The two tibbles we are seeking to join share common variables related to
the observation times. For example, eirgrid17 and observations both have
columns for year, month, day, and hour.
• However, the electricity grid data contained in eirgrid17 further divides an
hour into four equal observations of 15 minutes each. In order to ensure that
this data is fully aligned with the weather observations, we use dplyr to get
the mean wind energy generated for every hour. This information is stored
in the tibble eirgrid17h.
• Following this, we perform a mutating join between the tibbles eirgrid17_h
and observations and this generates a new tibble, weather_energy. This
tibble will contain the data needed for the correlation analysis.
We can now explore the code to perform this analysis. First, we include the
relevant libraries.
206 9 Relational Data with dplyr and Tidying Data with tidyr
library(aimsir17)
library(dplyr)
library(ggplot2)
A quick check of the tibble eirgrid17 confirms the available data, and this
includes our main target variable IEWindGeneration.
dplyr::glimpse(eirgrid17)
#> Rows: 35,040
#> Columns: 15
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, ~
#> $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ hour <int> 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ~
#> $ minute <int> 0, 15, 30, 45, 0, 15, 30, 45, 0, 15,~
#> $ date <dttm> 2017-01-01 00:00:00, 2017-01-01 00:~
#> $ NIGeneration <dbl> 889.0, 922.2, 908.1, 918.8, 882.4, 8~
#> $ NIDemand <dbl> 775.9, 770.2, 761.2, 742.7, 749.2, 7~
#> $ NIWindAvailability <dbl> 175.1, 182.9, 169.8, 167.5, 174.1, 1~
#> $ NIWindGeneration <dbl> 198.2, 207.8, 193.1, 190.8, 195.8, 2~
#> $ IEGeneration <dbl> 3289, 3282, 3224, 3171, 3190, 3185, ~
#> $ IEDemand <dbl> 2921, 2884, 2806, 2719, 2683, 2650, ~
#> $ IEWindAvailability <dbl> 1064.8, 965.6, 915.4, 895.4, 1028.0,~
9.8 Mini-case: Exploring correlations with wind energy generation 207
Next, we create the tibble eirgrid17_h, which gets the mean wind power
generated for each hour, where this operation is completed on each of four
hourly observations. Note that we also ungroup the tibble following the call to
summarize().
eirgrid17_h <- eirgrid17 %>%
dplyr::group_by(year,month,day,hour) %>%
dplyr::summarize(IEWindGenerationH=
mean(IEWindGeneration,
na.rm=T)) %>%
dplyr::ungroup()
dplyr::slice(eirgrid17_h,1:4)
#> # A tibble: 4 x 5
#> year month day hour IEWindGenerationH
#> <dbl> <dbl> <int> <int> <dbl>
#> 1 2017 1 1 0 943.
#> 2 2017 1 1 1 1085.
#> 3 2017 1 1 2 1284.
#> 4 2017 1 1 3 1254.
It is often helpful to visualize elements of the dataset, and here we explore the
dataset for the average hourly wind energy generated for new year’s day in
2017. This is shown in Figure 9.5. We can see that the summarize function
has collapsed the four hourly values into a single mean observation.
ggplot(dplyr::filter(eirgrid17_h,month==1,day == 1),
aes(x=hour,y=IEWindGenerationH))+
geom_point()+
geom_line()
With the alignment now complete for four of the variables (year, month, day,
and hour), we use a mutating join to merge both tables. In this case, we want
to maintain all of the energy observations; therefore, a left join from eirgrid_h
is used. The target tibble is a filtered form of observations, where we have
removed all records with missing wind speed values. For example, a number of
weather stations have no observations for wind speed, as can be seen from the
following code. (Here we introduce the dplyr function count(), which allows
you to quickly count the unique values in one or more variables).
observations %>%
dplyr::filter(is.na(wdsp)) %>%
dplyr::group_by(station) %>%
dplyr::count()
#> # A tibble: 5 x 2
208 9 Relational Data with dplyr and Tidying Data with tidyr
The exclusion of missing values (stored in obs1) and the join operation (stored
in weather_energy) are now shown below. To simplify the presentation, only
those variables needed for the correlation analysis (wdsp and IEWindGeneration)
are retained.
obs1 <- observations %>%
dplyr::filter(!is.na(wdsp))
weather_energy
#> # A tibble: 201,430 x 7
#> year month day hour IEWindGenerationH station wdsp
#> <dbl> <dbl> <int> <int> <dbl> <chr> <dbl>
#> 1 2017 1 1 0 943. ATHENRY 8
#> 2 2017 1 1 0 943. BALLYHAISE 5
#> 3 2017 1 1 0 943. BELMULLET 13
#> 4 2017 1 1 0 943. CASEMENT 8
#> 5 2017 1 1 0 943. CLAREMORRIS 8
#> 6 2017 1 1 0 943. CORK AIRPORT 11
#> 7 2017 1 1 0 943. DUBLIN AIRPORT 12
#> 8 2017 1 1 0 943. DUNSANY 6
#> 9 2017 1 1 0 943. FINNER 12
#> 10 2017 1 1 0 943. GURTEEN 7
#> # ... with 201,420 more rows
With the new table created that will form the basis of our analysis, we can
visually explore the relationship, as can be seen in Figure 9.6. Because there
are many points in the full dataset, to simplify the visualization we sample a
number of points from weather_energy, and show this relationship, including
a visualization of the linear model, and limit the analysis to just one station
(Mace Head). The function geom_jitter() will display overlapping values by
“shaking” the points so that an indication of the number of data points in the
general area of the graph is provided.
set.seed(100)
obs_sample <- weather_energy %>%
dplyr::filter(station %in% c("MACE HEAD")) %>%
dplyr::sample_n(300)
ggplot(obs_sample,aes(x=wdsp,y=IEWindGenerationH))+
geom_point()+
geom_jitter()+
geom_smooth(method="lm")
The line does show the basis of a positive linear relationship between the
variables, where an increase in wind speed is associated with an increase in
wind energy generation. We can now analyze the strength of this correlation.
For this task, we can group the observations by station and then call the R
function cor() to calculate the correlation coefficient. Note that the default
method used is the Pearson method, and the calculated values are displayed
in descending order.
corr_sum <- weather_energy %>%
dplyr::group_by(station) %>%
dplyr::summarize(Correlation=cor(wdsp,
210 9 Relational Data with dplyr and Tidying Data with tidyr
FIGURE 9.6 Relationship between wind speed and wind energy generated
IEWindGenerationH)) %>%
dplyr::arrange(desc(Correlation))
corr_sum
#> # A tibble: 23 x 2
#> station Correlation
#> <chr> <dbl>
#> 1 GURTEEN 0.806
#> 2 KNOCK AIRPORT 0.802
#> 3 MACE HEAD 0.800
#> 4 VALENTIA OBSERVATORY 0.778
#> 5 SHANNON AIRPORT 0.778
#> 6 MULLINGAR 0.771
#> 7 SherkinIsland 0.768
#> 8 CORK AIRPORT 0.768
#> 9 ROCHES POINT 0.767
#> 10 ATHENRY 0.764
#> # ... with 13 more rows
We can display the results in a bar chart shown in Figure 9.7. Note that
stat="identity" is used as the bar chart data is available. Interestingly, for
this sample, a strong association is shown between the two variables across all
the weather stations.
ggplot(corr_sum,aes(x=Correlation,y=station))+
geom_bar(stat="identity")
9.9 Summary of R functions from Chapter 9 211
FIGURE 9.7 A comparison of the correlation values for each weather station
Function Description
count() Counts the unique values in one or more variables.
inner_join(x,y) Maintains observations that appear in both x and y.
left_join(x,y) Keeps all observations in x, and joins with y.
right_join(x,y) Keeps all observations in y, and joins with x.
full_join(x,y) Keeps all observations in x and y.
semi_join(x,y) Keeps all rows from x that have a match in y.
anti_join(x,y) Keeps all rows from x that have no match in y.
pivot_longer() Tidy data operation to lengthen data.
pivot_wider() Inverse of pivot_longer().
9.10 Exercises
1. Based on the package nycflights13, which can be downloaded from
CRAN, generate the following tibble based on the first three records
from the tibble flights, and the airline name from airlines.
212 9 Relational Data with dplyr and Tidying Data with tidyr
first_3a
#> # A tibble: 3 x 5
#> time_hour origin dest carrier name
#> <dttm> <chr> <chr> <chr> <chr>
#> 1 2013-01-01 05:00:00 EWR IAH UA United Air Lines Inc.
#> 2 2013-01-01 05:00:00 LGA IAH UA United Air Lines Inc.
#> 3 2013-01-01 05:00:00 JFK MIA AA American Airlines Inc.
first_3b
#> # A tibble: 3 x 5
#> time_hour origin dest name tzone
#> <dttm> <chr> <chr> <chr> <chr>
#> 1 2013-01-01 05:00:00 EWR IAH George Bush Intercontin~ Amer~
#> 2 2013-01-01 05:00:00 LGA IAH George Bush Intercontin~ Amer~
#> 3 2013-01-01 05:00:00 JFK MIA Miami Intl Amer~
dest_not_in_airports
#> [1] "BQN" "SJU" "STT" "PSE"
t4b_long
#> # A tibble: 6 x 3
#> country Year Population
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1999 19987071
#> 2 Afghanistan 2000 20595360
#> 3 Brazil 1999 172006362
#> 4 Brazil 2000 174504898
#> 5 China 1999 1272915272
#> 6 China 2000 1280428583
sum_wide
#> # A tibble: 3 x 5
#> station Q1 Q2 Q3 Q4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 DUBLIN AIRPORT 132. 140. 198. 192.
#> 2 MACE HEAD 266. 164. 308. 376.
#> 3 NEWPORT 462. 204. 513. 573.
10
Processing Data with purrr
10.1 Introduction
Earlier in Chapter 4.6 we introduced functionals in R, and specifically, the
functional lapply() which was used to iterate over an atomic vector, list, or
data frame. The tidyverse package purrr provides a comprehensive set of
functions that can be used to iterate over data structures, and also integrate
with other elements of the tidyverse, for example, the package dplyr. This
chapter introduces functions from purrr, and upon completion, you should
understand:
• The function map(.x,.f), how it can be used to iterate over lists, its similari-
ties and differences when compared to lapply().
• The syntax for function shortcuts that can be used in map().
• The specialized versions of map_*() which, instead of returning a list, will
return a specified data type.
• The utility of using dplyr::group_split() with map() to process groups of
data stored in tibbles.
• The advantage of using tidyr::nest() with map() to support a statistical
modelling workflow.
• The key functions that allow you to manipulate lists.
• How to solve all five test exercises.
215
216 10 Processing Data with purrr
Chapter structure
• Section 10.2 introduces map(), and shows how it can be used to process lists.
• Section 10.3 explores additional map functions that generate a range of
outputs, including atomic vectors and data frames.
• Section 10.4 shows how the functions map2() and pmap() can be used to
process two and more inputs.
• Section 10.5 shows how purrr can be integrated with functions from dplyr
and tidyr, and in doing so integrate tibbles with purrr to create a data
processing pipeline.
• Section 10.6 summarizes additional functions that can be used to process
lists.
• Section 10.7 summarizes a mini-case that highlights how purrr can be used
to support a statistical modelling workflow.
• Section 10.8 provides a summary of all the functions introduced in the
chapter.
• Section 10.9 provides a number of short coding exercises to test your under-
standing of purrr.
square of an input atomic vector. Note that the length of the output list is
the same as the length of the input atomic vector.
library(purrr)
o1 <- purrr::map(c(1,2,3,2),function(x)xˆ2)
str(o1)
#> List of 4
#> $ : num 1
#> $ : num 4
#> $ : num 9
#> $ : num 4
#> List of 4
#> $ : num 1
#> $ : num 4
#> $ : num 9
#> $ : num 4
We can see how purrr deals with formulas by calling the utility function
purrr::as_mapper() with the formula as a parameter.
purrr::as_mapper(~.xˆ2)
#> <lambda>
#> function (..., .x = ..1, .y = ..2, . = ..1)
#> .xˆ2
#> attr(,"class")
#> [1] "rlang_lambda_function" "function"
As can be seen, this wrapper creates a full R function, and it captures the
function logic with the command .xˆ2. The first argument can be accessed
using .x or ., and a second parameter, which is used in functions such as
map2() is accessed using .y. For more arguments, they can be accessed using
..1, ..2, ..3, etc. Overall, it is optional for you to use either a function or a
formula with map(), and normally for short functions the formula approach is
used.
• Third, the map functions can be used to select a list element by name, and
these are beneficial when working with lists that are deeply nested (Wickham,
2019). For example, in the sw_films list we explored earlier, passing in the
name of a list element will result in the extraction of the associated data, in
this case, the names of all the directors.
library(repurrrsive)
dirs <- sw_films %>% purrr::map("director") %>% unique()
str(dirs)
#> List of 4
#> $ : chr "George Lucas"
#> $ : chr "Richard Marquand"
#> $ : chr "Irvin Kershner"
#> $ : chr "J. J. Abrams"
purrr provides a set of additional functions that specify the result type. These
include:
• map_dbl(), which returns an atomic vector of type double.
• map_chr(), which returns an atomic vector of type character.
• map_lgl(), which returns an atomic vector of type logical.
• map_int(), which returns an atomic vector of type integer.
• map_df(), which returns a data frame or tibble.
10.3.1 map_dbl()
Here, we can process a number of columns from the data frame mtcars and
return the average of each column as an atomic vector. Note that because
a data frame is also a list, we can use it as an input to the map family of
functions.
library(dplyr)
library(purrr)
mtcars %>%
dplyr::select(mpg,cyl,disp) %>%
purrr::map_dbl(mean)
#> mpg cyl disp
#> 20.091 6.188 230.722
10.3.2 map_chr()
10.3.3 map_lgl()
Here, we process a number of the columns in mpg to test whether the columns
are numeric. Here, we use the anonymous function option.
library(ggplot2)
library(purrr)
220 10 Processing Data with purrr
library(dplyr)
mpg %>%
dplyr::select(manufacturer:cyl) %>%
purrr::map_lgl(function(x)is.numeric(x))
#> manufacturer model displ year cyl
#> FALSE FALSE TRUE TRUE TRUE
10.3.4 map_int()
In this example, we select a number of numeric columns from mpg, and then
use map_int() to count the number of observations that are greater than the
mean in each of the three columns. An atomic vector of integers is returned.
library(ggplot2)
library(dplyr)
library(purrr)
mpg %>%
dplyr::select(displ,cty,hwy) %>%
purrr::map_int(~sum(.x>mean(.x)))
#> displ cty hwy
#> 107 118 129
10.3.5 map_df()
The function map_df() creates a new data frame or tibble based on the input
list. A tibble is specified within the function, and as map_df() iterates through
the input list, rows will be added to the tibble with the specified values. In
this case, we generate a new tibble that extracts the Star Wars episode id,
movie title, director, and release date. Note that we have converted the release
date from a character vector to a date format.
library(repurrrsive)
library(purrr)
library(dplyr)
sw_films %>%
purrr::map_df(~tibble(ID=.x$episode_id,
Title=.x$title,
Director=.x$director,
ReleaseDate=as.Date(.x$release_date))) %>%
dplyr::arrange(ID)
#> # A tibble: 7 x 4
#> ID Title Director ReleaseDate
#> <int> <chr> <chr> <date>
10.4 Iterating over two inputs using map2() and pmap() 221
But what if we need three inputs to a processing task? For example, the
function rnorm() requires three arguments, the number of random numbers
to be generated (n), the mean (mean), and the standard deviation (sd). While
there is no function called map3(), we can use another purrr function known as
pmap(), which can take a list containing any number of arguments, and process
these elements within the function using the symbols ..1, ..2 which represent
the first, second, and additional arguments.
params <- list(means = c(10,20,30),
sds = c(2,4,7),
n = c(4,5,6))
222 10 Processing Data with purrr
purrr::pmap(params,
~rnorm(n = ..3,
mean = ..1,
sd = ..2)) %>%
str()
#> List of 3
#> $ : num [1:4] 12.68 10.27 5.06 11.31
#> $ : num [1:5] 20.1 18.5 21 14.9 20.2
#> $ : num [1:6] 28 25 34.4 35.6 36.7 ...
We can also see how pmap() can be used to generate a summary of values for
each row in a tibble.
Consider the variable grades, which shows synthetic data for six students, who
receive marks in three different subjects. Note that, similar to the map() family
of functions, pmap() also supports converting outputs to atomic vectors, so the
function pmap_chr() is used.
set.seed(100)
grades <- tibble(ID=paste0("S-",10:15),
Subject1=rnorm(6,70,10),
Subject2=rnorm(6,60,20),
Subject3=rnorm(6,50,15))
grades
#> # A tibble: 6 x 4
#> ID Subject1 Subject2 Subject3
#> <chr> <dbl> <dbl> <dbl>
#> 1 S-10 65.0 48.4 47.0
#> 2 S-11 71.3 74.3 61.1
#> 3 S-12 69.2 43.5 51.9
#> 4 S-13 78.9 52.8 49.6
#> 5 S-14 71.2 61.8 44.2
#> 6 S-15 73.2 61.9 57.7
We now want to add a new column, using the mutate() function, that summa-
rizes each student’s grades in terms of the maximum grade received. In this
case, we use the argument ..1 for the student ID, and the arguments ..2 to
..4 as input to the max() function.
grades1 <- grades %>%
dplyr::mutate(Summary=pmap_chr(grades,
~paste0("ID=",
..1,
" Max=",
round(max(..2,..3,..4),2))))
grades1
10.4 Iterating over two inputs using map2() and pmap() 223
#> # A tibble: 6 x 5
#> ID Subject1 Subject2 Subject3 Summary
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 S-10 65.0 48.4 47.0 ID=S-10 Max=64.98
#> 2 S-11 71.3 74.3 61.1 ID=S-11 Max=74.29
#> 3 S-12 69.2 43.5 51.9 ID=S-12 Max=69.21
#> 4 S-13 78.9 52.8 49.6 ID=S-13 Max=78.87
#> 5 S-14 71.2 61.8 44.2 ID=S-14 Max=71.17
#> 6 S-15 73.2 61.9 57.7 ID=S-15 Max=73.19
A potential difficulty with this code is what might happen if the number of
subject results changed; for example, if we had just two subjects instead of
three.
To explore this scenario, we reduce the number of columns by setting Subject3
to NULL.
grades2 <- grades %>% dplyr::mutate(Subject3=NULL)
grades2
#> # A tibble: 6 x 3
#> ID Subject1 Subject2
#> <chr> <dbl> <dbl>
#> 1 S-10 65.0 48.4
#> 2 S-11 71.3 74.3
#> 3 S-12 69.2 43.5
#> 4 S-13 78.9 52.8
#> 5 S-14 71.2 61.8
#> 6 S-15 73.2 61.9
Next, we modify the shortcut function within pmap(). Notice that we only
explicitly reference the first argument ..1, and that the call list(...) can be
used to get the complete set of arguments.
We create a list of grades by excluding the first argument from the list (the
ID), and then flattening this to an atomic vector, before creating the desired
string output.
grades2 <- grades2 %>%
dplyr::mutate(Summary=pmap_chr(grades2,~{
args <- list(...)
grades <- unlist(args[-1])
paste0("ID=",..1," Max=",
round(max(grades),0))
}))
grades2
#> # A tibble: 6 x 4
#> ID Subject1 Subject2 Summary
224 10 Processing Data with purrr
10.5.1 group_split()
This function, contained in the package dplyr, can be used to split a tibble
into a list of tibbles, based on groupings specified by group_by(). This list can
be processed using map().
To show how this works, we first define a sample tibble from mpg that will have
two different class values.
set.seed(100)
test <- mpg %>%
dplyr::select(manufacturer:displ,cty,class) %>%
dplyr::filter(class %in% c("compact","midsize")) %>%
dplyr::sample_n(5)
test
10.5 Integrating purrr with dplyr and tidyr to process tibbles 225
#> # A tibble: 5 x 5
#> manufacturer model displ cty class
#> <chr> <chr> <dbl> <int> <chr>
#> 1 volkswagen jetta 2 21 compact
#> 2 volkswagen jetta 2.5 21 compact
#> 3 chevrolet malibu 3.6 17 midsize
#> 4 volkswagen gti 2 21 compact
#> 5 audi a4 2 21 compact
Next, we can take this tibble, group it by class of car (“compact” and “midsize”),
and then call group_split().
test_s <- test %>%
dplyr::group_by(class) %>%
dplyr::group_split()
test_s
#> <list_of<
#> tbl_df<
#> manufacturer: character
#> model : character
#> displ : double
#> cty : integer
#> class : character
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 4 x 5
#> manufacturer model displ cty class
#> <chr> <chr> <dbl> <int> <chr>
#> 1 volkswagen jetta 2 21 compact
#> 2 volkswagen jetta 2.5 21 compact
#> 3 volkswagen gti 2 21 compact
#> 4 audi a4 2 21 compact
#>
#> [[2]]
#> # A tibble: 1 x 5
#> manufacturer model displ cty class
#> <chr> <chr> <dbl> <int> <chr>
#> 1 chevrolet malibu 3.6 17 midsize
The result is a list of tibbles, stored in the variable test_s. Note that each
tibble contains all of the columns and rows for that particular class of vehicle.
We can now show how the output from group_split() integrates with map_int()
to show the number of rows in each new tibble.
226 10 Processing Data with purrr
Our goal is to calculate the correlation coefficient between two variables: mean
sea level pressure and average wind speed. We simplify the dataset to daily
values, where we take (1) the maximum wind speed (wdsp) recorded and (2)
the average mean sea level pressure (msl). Our first task is to use dplyr to
generate a summary tibble, and we also exclude any cases that have missing
values, by combining complete.cases() within filter. Note that the function
complete.cases() returns a logical vector indicating which rows are complete.
The new tibble is stored in the variable d_data.
d_data <- observations %>%
dplyr::filter(complete.cases(observations)) %>%
dplyr::group_by(station,month,day) %>%
dplyr::summarize(MaxWdsp=max(wdsp,na.rm=TRUE),
DailyAverageMSL=mean(msl,na.rm=TRUE)) %>%
dplyr::ungroup()
d_data
#> # A tibble: 8,394 x 5
#> station month day MaxWdsp DailyAverageMSL
#> <chr> <dbl> <int> <dbl> <dbl>
#> 1 ATHENRY 1 1 12 1027.
#> 2 ATHENRY 1 2 8 1035.
#> 3 ATHENRY 1 3 6 1032.
#> 4 ATHENRY 1 4 4 1030.
#> 5 ATHENRY 1 5 9 1029.
#> 6 ATHENRY 1 6 9 1028.
#> 7 ATHENRY 1 7 6 1032.
#> 8 ATHENRY 1 8 9 1029.
#> 9 ATHENRY 1 9 16 1015.
#> 10 ATHENRY 1 10 13 1013.
#> # ... with 8,384 more rows
With this daily summary of data, we can now perform the correlation analysis.
A key aspect of this is the use of the function group_split() which creates a
list, where each list element contains a tibble for an individual station. Note
the use of the group_by() call before we split the tibble.
10.5 Integrating purrr with dplyr and tidyr to process tibbles 227
In this example, the function map_df() is used to format the output. A tibble
is used, and has two columns, one for the weather station , and the second for
the calculated correlation coefficient. The weather station is extracted using
the function first(), which takes the first value in the tibble, given that all
the entries for station within each group will be the same. The top seven are
shown, and this indicates a moderate level of negative correlation between the
variables for each weather station.
cor7
#> # A tibble: 7 x 2
#> Station CorrCoeff
#> <chr> <dbl>
#> 1 SherkinIsland -0.589
#> 2 VALENTIA OBSERVATORY -0.579
#> 3 ROCHES POINT -0.540
#> 4 MACE HEAD -0.539
#> 5 MOORE PARK -0.528
#> 6 SHANNON AIRPORT -0.524
#> 7 CORK AIRPORT -0.522
Note that in this particular example, we could also have used dplyr to calculate
the result, and we can see what this code looks like. This is a familiar theme
in R, where you may often find more than one way to perform the same task.
cor7_b <- d_data %>%
dplyr::group_by(station) %>%
dplyr::summarize(CorrCoeff=cor(MaxWdsp,DailyAverageMSL)) %>%
dplyr::arrange(CorrCoeff) %>%
dplyr::slice(1:7)
cor7_b
#> # A tibble: 7 x 2
#> station CorrCoeff
#> <chr> <dbl>
#> 1 SherkinIsland -0.589
#> 2 VALENTIA OBSERVATORY -0.579
228 10 Processing Data with purrr
In our next example, we will see another way for supporting a statistical
workflow with purrr, using a different approach, but one that has some useful
additional features.
10.5.2 nest()
The function nest(), which is part of the package tidyr, can be used to create
a list column within a tibble that contains a data frame. Nesting generates one
row for each defined group, which is identified using the function group_by().
The second column is named data, and is a list, and each list element contains
all of the tibble’s data for a particular group.
We can observe how nest() works taking a look at the weather example from
the previous section.
data_n <- d_data %>%
dplyr::group_by(station) %>%
tidyr::nest()
Here, the tibble data_n contains two columns, with a row for each weather
station (the first six rows are shown here). All of the data for each weather
station is stored in the respective cell in the column data. For example, we
can explore the data for “ATHENRY” with the following code. Note that the
function first() is a wrapper around the list operator [[ that returns the first
value in a list (the corresponding function last() returns the last list value).
data_n %>%
dplyr::pull(data) %>%
dplyr::first()
10.5 Integrating purrr with dplyr and tidyr to process tibbles 229
Given that the new column created by nest() is a list, it can now be processed
using map(), and, interestingly, we can use the mutate operation to store the
results in a new column of the nested tibble. In particular, reverting to the
weather example, we can run a linear regression model on each station’s dataset,
and store the results in a new column. In the linear regression model, the
dependent variable is the maximum wind speed (MaxWdsp) and the independent
variable is the average atmospheric pressure (DailyAverageMSL).
data_n <- data_n %>%
dplyr::mutate(LM=map(data,
~lm(MaxWdsp~DailyAverageMSL,
data=.)))
data_n %>%
head()
#> # A tibble: 6 x 3
#> # Groups: station [6]
#> station data LM
#> <chr> <list> <list>
#> 1 ATHENRY <tibble [365 x 4]> <lm>
#> 2 BALLYHAISE <tibble [365 x 4]> <lm>
#> 3 BELMULLET <tibble [365 x 4]> <lm>
#> 4 CASEMENT <tibble [365 x 4]> <lm>
#> 5 CLAREMORRIS <tibble [365 x 4]> <lm>
#> 6 CORK AIRPORT <tibble [365 x 4]> <lm>
The tibble now contains the linear regression model result for each weather
station, and therefore we have a complete set of model results. Given that
the column LM is a list, we can extract any of the elements, for example, the
summary of the model results for the weather station “BELMULLET”.
230 10 Processing Data with purrr
data_n %>%
dplyr::filter(station=="BELMULLET") %>%
dplyr::pull(LM) %>%
dplyr::first() %>%
summary()
#>
#> Call:
#> lm(formula = MaxWdsp ~ DailyAverageMSL, data = .)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.021 -4.069 -0.516 3.958 17.962
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 242.786 26.365 9.21 <2e-16 ***
#> DailyAverageMSL -0.222 0.026 -8.53 4e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.8 on 363 degrees of freedom
#> Multiple R-squared: 0.167, Adjusted R-squared: 0.165
#> F-statistic: 72.8 on 1 and 363 DF, p-value: 4.03e-16
To continue our exploration of dplyr and purrr, we can now add a new col-
umnRSq‘ that contains the R2 value for each model.
data_n <- data_n %>%
dplyr::mutate(RSq=map_dbl(LM,~summary(.x)$r.squared)) %>%
dplyr::arrange(desc(RSq))
data_n <- data_n %>% head(n=3)
data_n
#> # A tibble: 3 x 4
#> # Groups: station [3]
#> station data LM RSq
#> <chr> <list> <list> <dbl>
#> 1 SherkinIsland <tibble [365 x 4]> <lm> 0.347
#> 2 VALENTIA OBSERVATORY <tibble [365 x 4]> <lm> 0.335
#> 3 ROCHES POINT <tibble [365 x 4]> <lm> 0.291
This shows the three stations with the highest r-squared values. While an
in-depth discussion of the significance of these is outside the scope of this
book, the key point is that we have generated statistical measures via a data
processing pipeline that is made possible by using the tools of purrr, dplyr,
and tidyr.
10.6 Additional purrr functions 231
We can now do one final activity on the data by revisiting the original data for
these three stations, and then plotting this on a graph, shown in Figure 10.2.
This code makes use of the geom_smooth() function to show the linear models.
data1 <- d_data %>%
dplyr::filter(station %in% dplyr::pull(data_n,station))
ggplot(data1,aes(x=DailyAverageMSL,y=MaxWdsp,color=station))+
geom_point()+geom_smooth(method="lm")+geom_jitter()
10.6.1 pluck()
library(ggplot2)
library(repurrrsive)
10.6.2 walk()
The function walk(.x,.f) is similar to map, except that it returns the input .x
and calls the function .f to generate a side effect. The side effect, for example,
could be displaying information onto the screen, and no output value needs to
be returned. Here is a short example.
l <- list(el1=20,el2=30,el3=40)
o <- purrr::walk(l,~cat("Creating a side effect...\n"))
#> Creating a side effect...
#> Creating a side effect...
#> Creating a side effect...
str(o)
#> List of 3
#> $ el1: num 20
#> $ el2: num 30
#> $ el3: num 40
10.6.3 keep()
10.6.4 invoke_map()
str(purrr::invoke_map(f,l))
#> Warning: `invoke_map()` was deprecated in purrr 1.0.0.
#> i Please use map() + exec() instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this
#> warning was generated.
#> List of 3
#> $ : int 1
#> $ : int 12
#> $ : int 10
geom_point()+geom_smooth(method="lm")+
facet_wrap(~class)
FIGURE 10.3 The relationship between displacement and city miles per
gallon
The data for each vehicle class can be extracted, and using the pull() (similar
to $) function followed by first(). We can view the first tibble, which is for
the class “compact”.
mpg1 %>%
dplyr::pull(data) %>%
dplyr::first()
#> # A tibble: 47 x 10
#> manufa~1 model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.8 1999 4 auto~ f 18 29 p
#> 2 audi a4 1.8 1999 4 manu~ f 21 29 p
#> 3 audi a4 2 2008 4 manu~ f 20 31 p
#> 4 audi a4 2 2008 4 auto~ f 21 30 p
#> 5 audi a4 2.8 1999 6 auto~ f 16 26 p
#> 6 audi a4 2.8 1999 6 manu~ f 18 26 p
#> 7 audi a4 3.1 2008 6 auto~ f 18 27 p
#> 8 audi a4 q~ 1.8 1999 4 manu~ 4 18 26 p
#> 9 audi a4 q~ 1.8 1999 4 auto~ 4 16 25 p
#> 10 audi a4 q~ 2 2008 4 manu~ 4 20 28 p
#> # ... with 37 more rows, and abbreviated variable name
#> # 1: manufacturer
Using mutate() and map(), we add a new column to mpg1 that will store the
results of each linear regression model.
mpg1 <- mpg1 %>%
dplyr::mutate(LM=map(data,~lm(cty~displ,data=.x)))
mpg1
#> # A tibble: 7 x 3
#> # Groups: class [7]
#> class data LM
#> <chr> <list> <list>
#> 1 compact <tibble [47 x 10]> <lm>
#> 2 midsize <tibble [41 x 10]> <lm>
#> 3 suv <tibble [62 x 10]> <lm>
#> 4 2seater <tibble [5 x 10]> <lm>
236 10 Processing Data with purrr
The map function operates on the data column of mpg1, which is a list, and
therefore is a suitable input for map(). The shorter notation for a function
is used, and the .x value will represent the tibble for each class of car. The
independent variable displ is selected, along with the dependent variable cty,
and the function lm is called, which returns an “lm” object. This object is then
returned for every iteration through the list and is stored in a new column LM.
Therefore, the tibble mpg1 now has the full set of modelling results for each
class of car, and this can then be used for additional processing.
We can explore the linear models by operating on the tibble; for example, the
following command shows a summary of the model for class “suv”.
mpg1 %>%
dplyr::filter(class=="suv") %>% # Select row where class == suv"
dplyr::pull(LM) %>% # Get the column "LM"
dplyr::first() %>% # Extract the lm object
summary() # Call the summary function
#>
#> Call:
#> lm(formula = cty ~ displ, data = .x)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.087 -1.027 -0.087 1.096 3.967
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 21.060 0.893 23.6 < 2e-16 ***
#> displ -1.696 0.195 -8.7 3.2e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.62 on 60 degrees of freedom
#> Multiple R-squared: 0.558, Adjusted R-squared: 0.55
#> F-statistic: 75.7 on 1 and 60 DF, p-value: 3.17e-12
To conclude our look at a data modelling workflow, we can also store custom
plots in the output tibble, so that they can be extracted at a later stage. To
visualize the plots we will use the R package ggpubr, which contains a valuable
function called ggarrange() that allows you to plot a list of ggplot objects.
Before we complete this task, we reiterate the point that a ggplot can be stored
in a variable and is actually a list which belongs to the S3 class “gg”, which
inherits from the S3 class “ggplot”.
p <- ggplot(mpg,aes(x=displ,y=cty))+geom_point()
typeof(p)
#> [1] "list"
names(p)
#> [1] "data" "layers" "scales" "mapping"
#> [5] "theme" "coordinates" "facet" "plot_env"
#> [9] "labels"
class(p)
#> [1] "gg" "ggplot"
This means that we can store all the plots in a tibble column and create this
column using a combination of mutate() and map2(). Note that the column
Plots in mpg1 will contain a list of the seven ggplots.
library(randomcoloR) # Functions to generate different colors
set.seed(100) # For random color generation
The new column on the mpg1 tibble can now be viewed, and this column
contains all the information needed to plot each graph.
mpg1
#> # A tibble: 7 x 5
#> # Groups: class [7]
#> class data LM RSquared Plots
#> <chr> <list> <list> <dbl> <list>
#> 1 suv <tibble [62 x 10]> <lm> 0.558 <gg>
#> 2 subcompact <tibble [35 x 10]> <lm> 0.527 <gg>
#> 3 pickup <tibble [33 x 10]> <lm> 0.525 <gg>
#> 4 compact <tibble [47 x 10]> <lm> 0.358 <gg>
#> 5 midsize <tibble [41 x 10]> <lm> 0.339 <gg>
#> 6 2seater <tibble [5 x 10]> <lm> 0.130 <gg>
#> 7 minivan <tibble [11 x 10]> <lm> 0.124 <gg>
To display the plots, we use the function ggarrange(). This can take a list
of plots as an input (plotlist), and the column Plots is passed through as
the data for this argument. Figure 10.4 displays the purpose-built plot that
encompasses: (1) the class of car, the number of observations, and the regression
coefficients, and (2) the line of best fit, which was created using the ggplot2
function geom_abline().
p1 <- ggarrange(plotlist=pull(mpg1,Plots))
p1
In summary, this example shows the utility of combining the tools of purrr,
dplyr, and tidyr in a workflow, specifically:
10.8 Summary of R functions from Chapter 10 239
• From the original tibble mpg, a new nested tibble (mpg1) was created that
stored the data for each class in a single cell. The nest() function collapses
the tibble into one row per group identified.
• Combining mutate() and map(), a new column was created that contained
the result of a linear model where displ was the independent variable and cty
the dependent variable. This result was essentially the “lm” object returned
by the lm() function.
• The LM object was then used to create a new column showing the R2 value
for each model.
• Finally, the original two columns in mpg1 were used to generate special
purpose ggplot graphs for each regression task, with informative titles added
to each plot.
Function Description
complete.cases() Identifies cases with no missing values.
map() Applies a function to each input, returns a list.
map_dbl() Applies a function to each input, returns a double.
map_chr() Applies a function to each input, returns a character.
map_lgl() Applies a function to each input, returns a logical.
map_int() Applies a function to each input, returns an integer.
map_df() A map operation that returns a tibble.
map2() A map operation that takes two input arguments.
pmap() A map operation that takes many input arguments.
group_split() Splits a data frame into a list of data frames.
nest() Creates a list column of tibbles, based on group_by().
pluck() Implements a form of the [[ operator for vectors.
walk() Returns the input, enables side effects.
keep() Keeps elements that are evaluated as TRUE.
ggarrange() Arranges multiple ggplots in the same page (ggpubr).
10.9 Exercises
1. Create the following tibble with the three columns shown, using
the functions keep() and map_df(), in order to provide a tabular
view of the list repurrrsive::sw_vehicles. Note that possible invalid
values of length in sw_vehicles include “unknown”, and any of these
should be removed prior to creating the data frame.
#> # A tibble: 6 x 3
#> Name Model Length
#> <chr> <chr> <dbl>
#> 1 C-9979 landing craft C-9979 landing craft 210
#> 2 SPHA Self-Propelled Heavy Artillery 140
#> 3 Clone turbo tank HAVw A6 Juggernaut 49.4
#> 4 Sand Crawler Digger Crawler 36.8
#> 5 Multi-Troop Transport Multi-Troop Transport 31
#> 6 Sail barge Modified Luxury Sail Barge 30
set.seed(1000)
params <- list(
10.9 Exercises 241
list(n=3,mean=10,sd=2),
list(n=4,min=10,max=15),
list(n=5,min=0,max=1)
)
f <- c("rnorm","runif","runif")
3. Generate the following daily summaries of rainfall and mean sea level
pressure, for all the weather stations in aimsir17::observations, and
only consider observations with no missing values.
#> `summarize()` has grouped output by 'station', 'month'. You can
#> override using the `.groups` argument.
d_sum
#> # A tibble: 8,394 x 5
#> station month day TotalRain AvrMSL
#> <chr> <dbl> <int> <dbl> <dbl>
#> 1 ATHENRY 1 1 0.2 1027.
#> 2 ATHENRY 1 2 0 1035.
#> 3 ATHENRY 1 3 0 1032.
#> 4 ATHENRY 1 4 0 1030.
#> 5 ATHENRY 1 5 0.1 1029.
#> 6 ATHENRY 1 6 18 1028.
#> 7 ATHENRY 1 7 1.4 1032.
#> 8 ATHENRY 1 8 1.2 1029.
#> 9 ATHENRY 1 9 5.4 1015.
#> 10 ATHENRY 1 10 0.7 1013.
#> # ... with 8,384 more rows
Next, using the tibble d_sum as input, generate the top 6 correlations between
TotalRain and AvrMSLusing group_split() and map_df(). Here are the results
you should find.
cors
#> # A tibble: 6 x 2
#> Station Corr
#> <chr> <dbl>
#> 1 MOORE PARK -0.496
#> 2 MULLINGAR -0.496
242 10 Processing Data with purrr
my_map_dbl1(1:3,function(x)x+10)
#> [1] 11 12 13
#> [1] 11 12 13
Here is the sample output from my_map_dbl2() which, as expected, is the same
as the output from my_map_dbl1().
my_map_dbl2(1:3,function(x)x+10)
#> [1] 11 12 13
e17
#> # A tibble: 365 x 5
#> month day Date Day AvrDemand
#> <dbl> <int> <dttm> <ord> <dbl>
#> 1 1 1 2017-01-01 00:00:00 Sun 2818.
#> 2 1 2 2017-01-02 00:00:00 Mon 3026.
#> 3 1 3 2017-01-03 00:00:00 Tue 3516.
#> 4 1 4 2017-01-04 00:00:00 Wed 3579.
#> 5 1 5 2017-01-05 00:00:00 Thu 3563.
#> 6 1 6 2017-01-06 00:00:00 Fri 3462.
#> 7 1 7 2017-01-07 00:00:00 Sat 3155.
#> 8 1 8 2017-01-08 00:00:00 Sun 2986.
10.9 Exercises 243
11.1 Introduction
The Shiny system in R provides an innovative framework where R code can be
used to create interactive web pages. These pages can take inputs from users,
and then process these to generate a range of dynamic outputs, including text
summaries, tables, and visualizations. The web page components are generated
from standard R outputs, such as tibbles and plots. The R package shiny
enables you to easily define a web page with inputs and outputs, and then
write a special server function that can react to changes in inputs, and render
these changes to web page outputs. While Shiny is a highly detailed topic in
itself, in this chapter the emphasis is to provide an initial overview through a
set of six examples which illustrate how Shiny works. Upon completion of this
chapter, you should understand:
• The main concept behind reactive programming, the parallels with spread-
sheets and how Shiny generates interactive content.
• The three aspects of a Shiny program: the ui variable that defines the
interface using fluidPage(), the server() function that controls the behavior
of the page, and the call to shinyApp() which creates the Shiny app objects.
• How to use the input Shiny controls: numericInput(), selectInput(), check-
boxInput().
245
246 11 Shiny
• How to access input controls within the server() function, and how to
provide information for output sections of the web page, using the render*()
family of functions.
• How to send data to textOutput() using the renderText() function.
• How to send data to verbatimTextOutput() using the renderPrint() function.
• How to send tibble information to tableOutput() using the renderTable()
function.
• How to send graphical information to plotOutput() using the renderPlot()
function.
Chapter structure
• Section 11.2 introduces the idea of reactive programming, using the example
of a spreadsheet.
• Section 11.3 presents a simple Shiny example, which prints a welcome message
to a web page.
• Section 11.4 focuses on squaring an input number.
• Section 11.5 demonstrates how to return a tibble summary based on a selected
input value.
• Section 11.6 allows the user to compare summary results from two weather
stations from the package aimsir17.
• Section 11.7 shows how a scatterplot can be configured from a selection of
variables.
• Section 11.8 presents the idea of a reactive expression, and how these can be
used to improve code efficiency.
• Section 11.9 provides a summary of all the functions introduced in the
chapter.
• Section 11.10 provides short coding exercises to test your understanding of
Shiny.
the functions that allows us to specify a basic web page, without having to
know any web programming or scripting languages, such as HTML, CSS, or
JavaScript.
library(shiny)
With Shiny, normally two objects are declared. One for the appearance of the
app, which here is named ui, and here we call the function fluidPage() to
create the web page, and provide a layout. A fluid page essentially contains
a sequence of rows, and this example has one row, which returns the string
“Hello, Shiny!”. The function message() is used to display messages in the
developer’s console, and it ensures that, as developers, we can see detailed
messages of what functions are being called.
ui <- fluidPage(
message("Starting the UI..."),
"Hello, Shiny!"
)
The second object required for a Shiny app is a user-defined function, usually
named server, which takes in three parameters, two of which we use in the
remaining examples.
• input,
which provides direct access to an input control object defined in the
function fluidPage().
• output, which provides access to an output component that has been defined
in the function fluidPage().
Therefore, the server function makes things happen and allows you to write
code to react to user input and generate output. In essence, the server function
allows you to make an app interactive. Our first server is simple enough, and
no actions are taken to read inputs or send back outputs.
server <- function(input, output, session){
message("\nStarting the server...\n")
}
The final task is to call the function shinyApp() which takes in the user interface
and server objects, and creates the web page. It performs work behind the
scenes to make this happen. This is a benefit of Shiny, as you can focus on
defining inputs and outputs, and then implement actions in the server function
to ensure interactivity.
shinyApp(ui, server)
By clicking on the button “Run App” on the RStudio interface, the app will
be launched, and something similar to the the output in Figure 11.2 should
appear.
11.4 Example two: Squaring an input number 249
FIGURE 11.2 The web page produced for the Hello Shiny code
We then create our user interface object with a call to fluidPage(), and within
this function we call two Shiny functions to capture input and display output:
• numericInput(), which will take in a number from the user, where the default
value is 10. We add a label for the numeric input, which in this case is called
“input_num”. This will be used by the server code, so that the server can
then access the value entered by the user. The message displayed to the user
is also set (“Enter Number”).
• textOutput(), which will be used to render a reactive output variable within
the application page. This is given a label, “msg”, so that when we write the
server code we can re-direct output to this element of the web page by using
the value “msg”.
So in effect, this code specifies two rows for the web page. The first row nu-
mericInput() will allow the user to provide an input. The second row textOut-
put() will render the server function’s output variable as text within an
application page.
250 11 Shiny
ui <- fluidPage(
numericInput("input_num",
"Enter Number",
10),
textOutput("msg")
)
Next, we implement our first server function. There are two aspects to consider:
• First, the argument output is used to access an output component in the web
page, and in this case, it is output$msg, which references the textOutput()
part of the web page that was defined in the function call to fluidPage().
• Second, for each output function defined in the user interface, there is a
corresponding function that needs to be defined in the server, and this
function will render the information to its correct format. For the function
textOutput(), the appropriate render function used is renderText(), and,
just as with any function, the evaluated expression is returned. In this case,
the function glue() is used to assemble the final output, which is in string
format.
server <- function(input, output, session){
output$msg<- renderText({
message(glue("Input is {input$input_num}"))
ans <- input$input_numˆ2
glue("The square of {input$input_num} is {ans}\n")
})
}
Therefore, stepping through the code within server() we can see that the vari-
able output$msg is assigned its value within the function call to renderText().
Inside this function, there is one main processing step, where the variable
input$input_num is squared, and the result stored in ans. This variable ans is
then used to configure the output string through the function glue(), where
curly braces are used to identify variables.
With the code completed, we define the two components of the app, and run
the code.
shinyApp(ui, server)
The resulting web page is shown in Figure 11.3, and it shows the numeric
input component (with a default value of 10), and then the output, which is
configured by the call to the function glue() in the server.
Interestingly, in the figure, we also present a reactive graph, which echoes the
spreadsheet example from earlier and shows the relationship between the input
and output. Essentially, any time an input changes, the output will then change
11.5 Example three: Exploring a weather station 251
(via the server code). Shiny ensures that all the correct work is accomplished
behind the scenes to make this reactive programming process possible.
Next, we create the list of six stations so that these can be added in the new
input object, which is a selection list. The data processing pipeline is shown
below, and the function pull() is used to convert a one-column tibble to a
vector. The unique values are returned, which would include all 25 stations,
and the function head() will then select the first six.
stations6 <- observations %>%
dplyr::select(station) %>%
dplyr::pull() %>%
unique() %>%
head()
stations6
#> [1] "ATHENRY" "BALLYHAISE" "BELMULLET" "CASEMENT"
#> [5] "CLAREMORRIS" "CORK AIRPORT"
252 11 Shiny
The server function can now be written. In this case, as we are only targeting
one output, and this output is named “summary”, we allocate the result to
the variable output$summary. The appropriate render function for verbatim-
TextOutput() is renderPrint(), and inside of the render function call we add
the following R code:
• Using dplyr::filter(), we filter the observations for the station selected by
the user, and we access this value using the variable input$station.
• Six columns are selected, just so that they will fit on one line of output.
• The R function summary() is called, and its value returned as output.
server <- function(input, output, session){
}
11.6 Example four: Comparing two weather stations 253
We then call the shinyApp() function to specify the user interface and server
objects.
shinyApp(ui, server)
The app is then run, and the web page is shown in Figure 11.4. It shows the
title panel, which is followed by the selection input control. The default in the
selection list is the first item on the list of six. Finally, the output is shown,
which is generated inside the server() function. It produces output as would be
shown in the R console, and this shows the benefit of the verbatimTextOutput()
component for displaying R output.
We show the reactive graph, which captures the relationship between the input
variable component (“station”) and the output component (“summary”). This
diagram is informative, as it shows that our program will react to a change in
“station”, by changing the output “summary”.
the required libraries are loaded, tidyr is included as we will need to widen
data in order to present it in a user-friendly manner.
library(shiny)
library(aimsir17)
library(dplyr)
library(ggplot2)
library(glue)
library(tidyr)
Similar to the previous example, we prepare six weather stations for selection.
stations6 <- observations %>%
dplyr::select(station) %>%
dplyr::pull() %>%
unique() %>%
head()
The overall web design is specified, and this includes two selection inputs and
one table output, specified by the function tableOutput(). This will allow us
to display a tibble on our web page.
ui <- fluidPage(
message("\nStarting the UI..."),
titlePanel("Summarizing Monthly Rainfall"),
selectInput("station1",
label="Weather Station 1",
choices=stations6),
selectInput("station2",
label="Weather Station 2",
choices=stations6),
tableOutput("monthly")
)
For the server code, just one output is set, and the function renderTable()
is used to map back to the user interface, and will allow for the displaying
of matrices and data frames. There are a number of steps in generating the
required output.
11.7 Example five: Creating a scatterplot 255
s_rain
})
}
The app is then run, and an example of the output is shown in Figure 11.5. The
advantage of widening the data can be seen, as the table provides a readable
summary of the monthly rainfall data.
The reactive graph shows that any change in the two inputs (“station1” and
“station2”) will lead to a change in the output (“monthly”), where an updated
version of the monthly rainfall values will be generated.
variables, and plot these on a scatterplot. To start the example, the relevant
libraries are loaded.
library(shiny)
library(ggplot2)
library(glue)
Given that the tibble mpg is the target, we need to provide the user with a list
of valid variables that can be used for the scatterplot. Any variable that is a
numeric value can be included in this list (apart from the year), and we can
use the following expression to extract all numeric columns names from mpg.
vars<- mpg %>%
dplyr::select(-year) %>%
dplyr::select(where(is.numeric)) %>%
colnames()
Next, we construct the user interface. Within the fluidPage() function, there
is an option to add a fluidRow(), which is a function that allows for the
configuration of a row.
In this case, we add three components to the first row, two selection lists
(“x_var” and “y_var”), and a checkbox component (“cb_smooth”) which is
used to specify whether a linear model visualization is required. There is one
output component specified by a call to the function plotOutput(), which can
then be a target of a renderPlot() call in the server function.
11.7 Example five: Creating a scatterplot 257
ui <- fluidPage(
titlePanel("Exploring variables in the dataset mpg"),
fluidRow(
column(2,selectInput("x_var",
label="X Variable",
choices=vars)),
column(2,selectInput("y_var",
label="Y Variable",
choices=vars)),
column(5,checkboxInput("cb_smooth",
"Show Linear Model",
value = TRUE))
),
plotOutput("plot")
)
The server function contains one function call, and this assigns the output of a
call to renderPlot() to the variable output$plot. Inside the function call, we
can observe the following sequence of code:
• The variables of interest (x_var and y_var) are extracted from two input
components (input$x_var and input$y_var).
• The variable p1 then stores the plot. Notice that we replace the usual call
to aes() with a call to aes_string(). This is a useful function in ggplot2
because we can pass in a string version of the variable and not the variable
itself, which provides flexibility about which variable we want to select for
the plot.
• If the logical value of the checkbox input$cb_smooth is TRUE, we will augment
the plot variable p1 with a call to geom_smooth().
• The variable p1 is returned, and this plot will then be rendered through the
function plotOutput().
server <- function(input, output, session){
message("\nStarting the server...")
output$plot <- renderPlot({
x_var <- input$x_var
y_var <- input$y_var
message(glue("Smooth = {input$cb_smooth} \n"))
p1 <- ggplot(mpg,aes_string(x=x_var,y=y_var))+
geom_point()
if(input$cb_smooth == TRUE)
p1 <- p1 + geom_smooth(method="lm")
p1
})
}
258 11 Shiny
We configure the app so that the user interface and server are connected.
shinyApp(ui, server)
The app is then run, and a sample of the output is shown in Figure 11.6. Notice
the difference the call to fluidRow() has made, as the three input controls are
all on one row of output, which results in a better usage of space. The output
plot shows displ plotted against hwy, and it also highlights the linear model.
Overall, an important feature of the Shiny plot is its flexibility, as the ggplot2
function aes_string() allows any of the variables to be plotted, and therefore
supports a wider range of user exploration.
FIGURE 11.6 The web page produced for the Hello Shiny code
The reactive graph shows that a change in any one of the three inputs (“x_var”,
“y_var”, or “cb_smooth”) will lead to the generation of a new plot, which
will be rendered to the web page. Therefore, even a simple action such as
clicking on the checkbox will trigger an update to the page. This is because the
code in the server that changes the variable output$plot directly references
the variable input$cb_smooth. Therefore, any change in input$cb_smooth will
trigger a response, or reaction, in output$plot.
11.8 Example six: Improving design by adding reactive expressions 259
Next, we define the user interface. This consists of two rows. The first row
contains four variables that comprise the required inputs, namely the means
for the two Poisson distributions (λ1 and λ2 ), the number of samples to be
drawn from the distributions (N ), and the bin width of the histogram plot
(W ). The second row specifies the graphical output, and to create this output
we will make use of facet_wrap() from ggplot2.
ui <- fluidPage(
message("Starting the UI..."),
titlePanel("Adding reactive expressions"),
fluidRow(
column(3,
"Poisson One",
numericInput("lambda1",
label="Lambda1",
value=50,
min=1)),
260 11 Shiny
column(3,
"Poisson Two",
numericInput("lambda2",
label="Lambda2",
value=100,
min=1)),
column(3,
"Total Samples",
numericInput("N",
label="N",
value=1000,
min=100)),
column(3,
"Binwidth",
numericInput("BW",
label="W",
value=3,
min=1))),
fluidRow(column(12,plotOutput("plot")))
)
The server function contains three main elements, and two of these are reactive
expressions which automatically cache their results, and only update when
their respective inputs change (Wickham, 2021).
• The first reactive expression is where we define the variable generate1 as the
output from a call to the function reactive(). This reactive expression uses
two Shiny inputs, the variables input$N and input$lamdba1, and generates a
new tibble containing two variables, one for the distribution name (“Poisson
One”), and the other containing each random number that has been generated
via the function rpois().
• The second reactive expression is where we define the variable generate2 as
a reactive expression. This uses two Shiny inputs, the variables input$N and
input$lamdba2, and, similar to generate1, creates a new tibble containing the
distribution name (“Poisson Two”) and the random number from rpois().
server <- function(input, output, session){
message("\nStarting the server...")
• The final part of the server code is where we define the variable output$plot.
This is where we can observe the impact of calling the two reactive expressions.
Here, the variable d1 contains the result from calling the reactive expression
generate1(), while the variable d2 contains the result returned by the reactive
expression generate2(). The results from d1 and d2 are combined into d3
using dplyr::bind_rows(), and the plot is returned showing both histograms,
generated using the function facet_wrap().
We configure the app so that the user interface and server are connected.
shinyApp(ui, server)
The web page inputs and outputs are shown in Figure 11.7. From a technical
perspective, it is worth re-emphasizing that the reactive expressions cache their
results, and only update these values when their inputs change. This means that
the server code in renderPlot() will only execute generate1() when the input
variables input$N or input$lambda1 change, and that generate2() will only
be reevaluated when either input$N or input$lambda2 is modified. Therefore,
if the bin width input$BW changes, the cached values from generate1() and
generate2() will be used, and no new data will be sampled.
The reactive graph structure highlights the role the reactive expressions play
behind the scences of a Shiny web page. It shows that the plotting compo-
nent depends on inputs from generate1(), generate2(), and input$BW. The
two reactive expressions will cache their outputs, and return these if their
contributing inputs have not changed. If their inputs change, their outputs will
be re-calculated. When the code is run, the output messages will confirm this,
262 11 Shiny
and will show which reactive expressions have been executed. The net benefit
from this is greater efficiency in terms of leveraging the benefits of caching
data that does not need to change.
This concludes the Shiny examples covered in the chapter, and the idea is to
provide the reader with the basics of how to create an interactive app. Notice
that in each example, only one output was considered, which is somewhat
unrealistic given that it’s likely that in an interactive web app there will
be more than one output (e.g., plots, tables and text). For the interested
reader, the textbook Mastering Shiny (Wickham, 2021) is recommended as a
comprehensive text on building apps in Shiny, in particular, with its focus on
mastering reactivity and best practice for implementation.
11.9 Summary of R functions from Chapter 11 263
Function Description
bind_rows() Bind any number of data frames by row (dplyr).
fluidPage() Creates fluid page layouts in Shiny.
glue() Formats a string (library glue).
shinyApp() Creates a Shiny app object from a UI/server pair.
message() Generates a simple diagnostic message.
plotOutput() Renders a renderPlot() within a web page.
numericInput() Create a Shiny input control for numeric values.
reactive() Creates a reactive expression.
renderTable() Creates a reactive table that can display tibbles.
renderText() Pastes output into a single string.
renderPlot() Renders a reactive plot for an output.
renderPrint() Prints the result of an expression.
selectInput() Creates a selection list for user input.
tableOutput() Creates a reactive table. Paired with renderTable().
textOutput() Render output as text. Paired with renderText().
titlePanel() Creates a panel containing an application title.
verbatimTextOutput() Render a reactive output variable as text.
11.10 Exercises
1. Draw a reactive graph from the following Shiny code.
library(shiny)
library(glue)
ui <- fluidPage(
numericInput("input_num1","Enter Number",10),
numericInput("input_num2","Enter Number",10),
textOutput("sum_msg"),
textOutput("prod_msg")
)
output$sum_msg<- renderText({
ans <- input$input_num1+input$input_num2
glue("{input$input_num1} and {input$input_num2} is {ans}\n")
})
output$prod_msg<- renderText({
ans <- input$input_num1*input$input_num2
glue("{input$input_num1} times {input$input_num2} is {ans}\n")
})
}
shinyApp(ui, server)
2. Draw a reactive graph from the following Shiny code. Note that the
code contains a reactive expression.
library(shiny)
library(glue)
ui <- fluidPage(
numericInput("input_num1","Enter Number",10),
numericInput("input_num2","Enter Number",10),
verbatimTextOutput("diff_msg"),
verbatimTextOutput("div_msg")
)
output$diff_msg<- renderText({
msg();
})
output$div_msg<- renderText({
ans <- input$input_num1/input$input_num2
glue("{input$input_num1} divided by {input$input_num2} is {ans}\n")
})
}
shinyApp(ui, server)
Part III
12.1 Introduction
Earlier in Part II, we presented tools from R’s tidyverse, including the packages
ggplot2 and dplyr, which facilitate efficient analysis of datasets. We now present
an overall approach that provides a valuable guiding framework for initial
analysis of a dataset. Exploratory data analysis (EDA) involves reviewing the
features and characteristics of a dataset with an “open mind”, and is frequently
used upon “first contact with the data” (EDA, 2008). A convenient way to
pursue EDA is to use questions as a means to guide your exploration, as this
process focuses your attention on specific aspects of the dataset (Wickham
et al., 2023). An attractive feature of EDA is that there are no constraints on
the type of question posed, and therefore it can be viewed as a creative process.
This chapter provides an overview of EDA, with a focus on five different
datasets. The reader is encouraged to create their own questions that could
drive an initial investigation of the data. Upon completing the chapter, you
should have an appreciation for:
• The main idea underlying EDA, which is to ask questions of your dataset in
an iterative way.
• How EDA can be applied to five CRAN datasets.
• How ggplot2 can be used to support EDA.
267
268 12 Exploratory Data Analysis
• How dplyr can be used to generate informative data summaries, and add
new variables to datasets using the mutate() function.
• How the ggpubr function stat_cor() can be used to explore correlations
between variables.
• The utility of the lubridiate package, and how it can generate additional
time-related information from a timestamp variable.
• Related R functions that support exploratory data analysis in R.
Chapter structure
• Section 12.2 presents an overview of an iterative process for EDA.
• Section 12.3 explores whether plant dimensions can help to identify a species.
• Section 12.4 explores the nature of the relationship between temperature
and electricity demand.
• Section 12.5 investigates factors that may have impacted pupil–teacher ratios
in Boston from the 1970s.
• Section 12.6 seeks to identify any patterns that increased a passenger’s chance
of survival on board the Titanic.
• Section 12.7 investigates the possible influence of wind direction on winter
temperatures in Ireland.
• Section 12.8 provides a summary of the functions introduced in the chapter.
• For a Boston dataset from the 1970s, can we find possible relationships
between house value and pupil–teacher ratios across different suburbs?
• What passengers had the greatest chance of survival on the Titanic?
• During the winter season in Ireland, is there initial evidence to suggest that
wind direction has an impact on temperature?
library(tidyr)
iris_long <- tidyr::pivot_longer(iris_tb,
names_to = "Measurement",
values_to = "Value",
-Species)
head(iris_long)
#> # A tibble: 6 x 3
#> Species Measurement Value
#> <fct> <chr> <dbl>
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Width 3.5
#> 3 setosa Petal.Length 1.4
#> 4 setosa Petal.Width 0.2
#> 5 setosa Sepal.Length 4.9
#> 6 setosa Sepal.Width 3
Histograms are generated from the following R code, and the resulting plot is
visualized in Figure 12.2. Note the alpha argument controls the geom’s opacity,
and we fill the histogram bars based on the species.
p1 <- ggplot(iris_long,aes(x=Value,fill=Species))+
geom_histogram(color="white", alpha=0.7)+
facet_wrap(~Measurement,ncol=2)+
theme(legend.position = "top")
p1
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
The histograms provide insights into the data. For example, the petal variables
visualized on the top row show a clearer separation between the species.
Specifically, for both the petal length and width, the setosa histogram plot is
fully distinct from the other two. There also seems to be a difference between
the versicolor and virginica species.
Using visual inspection, the petal attributes provide a clearer differentiation
between the species. We can now extract summary values. Of course, these
summaries could also be obtained by simply calling the summary(iris_tb),
however, the advantage here is that we create a tibble with this overall infor-
mation where all the values can be easily compared. We combine the tools of
dplyr to generate the following summary, which is stored in the tibble res.
res <- iris_long %>%
dplyr::filter(Measurement %in% c("Petal.Length",
"Petal.Width")) %>%
dplyr::group_by(Species,Measurement) %>%
dplyr::summarize(Min=min(Value),
Q25=quantile(Value,0.25),
Median=median(Value),
Mean=mean(Value),
Q75=quantile(Value,0.75),
Max=max(Value)) %>%
dplyr::ungroup() %>%
dplyr::arrange(Measurement,Mean)
res
#> # A tibble: 6 x 8
#> Species Measurement Min Q25 Median Mean Q75 Max
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa Petal.Length 1 1.4 1.5 1.46 1.58 1.9
#> 2 versicolor Petal.Length 3 4 4.35 4.26 4.6 5.1
#> 3 virginica Petal.Length 4.5 5.1 5.55 5.55 5.88 6.9
#> 4 setosa Petal.Width 0.1 0.2 0.2 0.246 0.3 0.6
#> 5 versicolor Petal.Width 1 1.2 1.3 1.33 1.5 1.8
#> 6 virginica Petal.Width 1.4 1.8 2 2.03 2.3 2.5
The data provides further insights, as the setosa iris has the lowest mean
value, while virginica has the largest mean. The other values provide additional
information, and would suggest that both variables could be used to help
classify the iris species.
To confirm this, in an informal way, we can return to the original tibble and
generate a scatterplot using the petal length and width, and color the dots by
the species. The resulting plot is shown in Figure 12.3.
p2 <- ggplot(iris_tb,aes(x=Petal.Length,y=Petal.Width,color=Species)) +
12.4 Exploring electricity demand in Victoria, Australia 273
geom_point()
p2
FIGURE 12.3 Exploring the association between petal length and petal
width
The output reveals an interesting pattern which shows how the scatterplot petal
length and petal width leads shows potential clustering of the different species.
These relationships can be explored more fully, for example, by modelling
the data using a decision tree classifier, which can generate rules that classify
inputs (e.g., petal length and width) into outputs (one of the three species of
iris) (Lantz, 2019). This modelling is outside the chapter’s scope, but would
fall within the exploratory data analysis process, and the interested reader is
encouraged to further explore the dataset using techniques such as decision
trees and unsupervised learning.
of (1) the associations between temperature and electricity demand, and (2)
patterns of electricity demand by time of the year (e.g., quarters).
To load and prepare the data, we need to include a number of extra libraries,
as shown below.
library(ggplot2)
library(dplyr)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(tidyr)
library(ggpubr)
The library tsibble specifies a new type of tibble that has been created to
support time series analysis (which is outside of this chapter’s scope). In this
example, the variable vic_elec is a tsibble, and we can see its S3 structure
by calling class(), where it is clearly sub-classed from both tibbles and a data
frame.
class(vic_elec)
#> [1] "tbl_ts" "tbl_df" "tbl" "data.frame"
However, for our purposes, because we are not performing any time series
analysis, we will recast the object to a tibble, and use the function tsib-
ble::as_tibble() to perform this task. We create a new variable aus_elec,
and we can observe its structure below.
aus_elec <- vic_elec %>%
tsibble::as_tibble()
aus_elec %>% dplyr::slice(1:3)
#> # A tibble: 3 x 5
#> Time Demand Temperature Date Holiday
#> <dttm> <dbl> <dbl> <date> <lgl>
#> 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE
#> 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE
#> 3 2012-01-01 01:00:00 4049. 20.7 2012-01-01 TRUE
With these features, we create two additional variables: (1) the quarter of the
year to capture possible seasonal flucuations, where this value is based on the
month; and, (2) the day segment, where each day is divided into four quarters,
and this calculation is based on the hour of the day. Note the use of the dplyr
function case_when() to create the variables. Overall, as we are adding new
variables, we use the dplyr function mutate(), and then we show the first three
observations.
aus_elec <- aus_elec %>%
dplyr::mutate(WDay=wday(Time,label=TRUE),
Year=year(Time),
Month=as.integer(month(Time)),
Hour=hour(Time),
YearDay=yday(Time),
Quarter=case_when(
Month %in% 1:3 ~ "Q1",
Month %in% 4:6 ~ "Q2",
Month %in% 7:9 ~ "Q3",
Month %in% 10:12 ~ "Q4",
TRUE ~ "Undefined"
),
DaySegment=case_when(
Hour %in% 0:5 ~ "S1",
Hour %in% 6:11 ~ "S2",
Hour %in% 12:17 ~ "S3",
Hour %in% 18:23 ~ "S4",
TRUE ~ "Undefined"
))
dplyr::slice(aus_elec,1:3)
#> # A tibble: 3 x 12
#> Time Demand Temperature Date Holiday WDay
#> <dttm> <dbl> <dbl> <date> <lgl> <ord>
#> 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE Sun
#> 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE Sun
276 12 Exploratory Data Analysis
We can now create our first visualization of the dataset and use pivot_longer()
to generate a flexible data structure containing the Time, Quarter, Indicator,
and Value.
aus_long <- aus_elec %>%
dplyr::select(Time,Demand,Temperature,Quarter) %>%
tidyr::pivot_longer(names_to="Indicator",
values_to="Value",
-c(Time,Quarter))
dplyr::slice(aus_long,1:3)
#> # A tibble: 3 x 4
#> Time Quarter Indicator Value
#> <dttm> <chr> <chr> <dbl>
#> 1 2012-01-01 00:00:00 Q1 Demand 4383.
#> 2 2012-01-01 00:00:00 Q1 Temperature 21.4
#> 3 2012-01-01 00:30:00 Q1 Demand 4263.
p3 <- ggplot(aus_long,aes(x=Time,y=Value,color=Quarter))+
geom_point(size=0.2)+
facet_wrap(~Indicator,scales="free",ncol = 1)+
theme(legend.position = "top")
p3
FIGURE 12.4 Time series of temperature and electricity demand (by quarter)
12.4 Exploring electricity demand in Victoria, Australia 277
With this data we create two plots, shown in Figure 12.4, one for temperature
(bottom plot), and the other for electricity demand (top plot), where the
points are colored by quarter. It is interesting to explore possible associations
between the temperature and demand, for the different quarters of the year.
For example, it does seem that the quarter one values for temperature and
demand seem to move in the same direction. To further explore this, we can
calculate the correlation coefficient between temperature and demand for each
quarter of every year.
We revert to the dataset aus_elec, as this contains both the temperature and
electricity demand for each quarter. Using dplyr functions, we can calculate
the correlation coefficient for each year and quarter, and sort from highest
correlation value to lowest.
aus_cor <- aus_elec %>%
dplyr::group_by(Year, Quarter) %>%
dplyr::summarize(CorrCoeff=cor(Temperature, Demand)) %>%
dplyr::ungroup() %>%
dplyr::arrange(desc(CorrCoeff))
#> `summarize()` has grouped output by 'Year'. You can override
#> using the `.groups` argument.
dplyr::slice(aus_cor,1:4)
#> # A tibble: 4 x 3
#> Year Quarter CorrCoeff
#> <dbl> <chr> <dbl>
#> 1 2014 Q1 0.796
#> 2 2013 Q1 0.786
#> 3 2012 Q1 0.683
#> 4 2012 Q4 0.476
Interestingly, the highest values are from quarter one (January, February, and
March), which are also those months with high overall temperature values.
It may also be informative to layer the actual correlation coefficient data onto
a plot of Demand v Temperature. The function stat_cor(), which is part of
the package ggpubr, can be used for this, as it displays both the correlation
coefficients and the associated p-values. We show this output in Figure 12.5,
for the first and third quarters.
q1_out <- filter(aus_elec, Quarter %in% c("Q1","Q3"))
p4 <- ggplot(q1_out,aes(x=Temperature,y=Demand))+
geom_point(alpha=0.2,size=0.5)+
facet_grid(Year~Quarter)+
stat_cor(digits = 2)
p4
What is helpful about this visualization is that quarterly values (for “Q1”
and “Q3”) are contained in each column, and so they can be easily compared
278 12 Exploratory Data Analysis
across the 3 years. It shows high correlation values for the first quarter, when
temperatures are high. The third quarter has low correlation values, due to
the lower temperature levels between July and September.
Our next analysis is to explore each quarter for every year, and extract the
maximum electricity demand for the quarter. Interestingly, we will also extract
additional variables for the time of the maximum demand, including the
temperature, day of week, and day segment. This is achieved by combining
two different functions. First, the function which.max() will extract the row
index for the maximum observation for Demand. Second, this index value (stored
in MaxIndex) is an input argument to the function nth(), which returns the
appropriate value. We order the values from highest to lowest demand.
aus_summ <- aus_elec %>%
dplyr::group_by(Year,Quarter) %>%
dplyr::summarize(MaxD=max(Demand),
MaxIndex=which.max(Demand),
Time=nth(Time,MaxIndex),
Temp=nth(Temperature,MaxIndex),
Day=nth(WDay,MaxIndex),
DaySegment=nth(DaySegment,MaxIndex)) %>%
dplyr::ungroup() %>%
dplyr::arrange(desc(MaxD))
head(aus_summ)
#> # A tibble: 6 x 8
#> Year Quarter MaxD MaxIndex Time Temp Day
12.5 Exploring housing values in the Boston suburbs 279
The output from this query can be cross-checked against a specific filtering
call on the tibble. For example, the highest recorded value is for quarter Q1
and year 2014, and it shows that the index for this record was row 755. We
can then extract this directly from the tibble using the following code, and
this result confirms that the observation is the same.
aus_elec %>%
dplyr::filter(Quarter=="Q1",Year==2014) %>%
dplyr::slice(755)
#> # A tibble: 1 x 12
#> Time Demand Temperature Date Holiday WDay
#> <dttm> <dbl> <dbl> <date> <lgl> <ord>
#> 1 2014-01-16 17:00:00 9345. 38.8 2014-01-16 FALSE Thu
#> # ... with 6 more variables: Year <dbl>, Month <int>,
#> # Hour <int>, YearDay <dbl>, Quarter <chr>, DaySegment <chr>
However, for our initial analysis, we focus on the seven variables previously
outlined, and we will grab this opportunity to use the dplyr function rename()
to create more descriptive titles for these variables. This is a common task
in data science, as we may want to alter column names to suit our purpose.
In this code, we also convert the data frame to a tibble, and set the Charles
River variable to a logical value. The resulting data is stored in the tibble bos.
bos <- Boston %>%
dplyr::as_tibble() %>%
dplyr::select(chas,rm,age,rad,ptratio,medv,nox) %>%
dplyr::rename(PTRatio=ptratio,
ByRiver=chas,
Rooms=rm,
Age=age,
12.5 Exploring housing values in the Boston suburbs 281
Radial=rad,
Value=medv,
Nox=nox) %>%
dplyr::mutate(ByRiver=as.logical(ByRiver))
bos
#> # A tibble: 506 x 7
#> ByRiver Rooms Age Radial PTRatio Value Nox
#> <lgl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 FALSE 6.58 65.2 1 15.3 24 0.538
#> 2 FALSE 6.42 78.9 2 17.8 21.6 0.469
#> 3 FALSE 7.18 61.1 2 17.8 34.7 0.469
#> 4 FALSE 7.00 45.8 3 18.7 33.4 0.458
#> 5 FALSE 7.15 54.2 3 18.7 36.2 0.458
#> 6 FALSE 6.43 58.7 3 18.7 28.7 0.458
#> 7 FALSE 6.01 66.6 5 15.2 22.9 0.524
#> 8 FALSE 6.17 96.1 5 15.2 27.1 0.524
#> 9 FALSE 5.63 100 5 15.2 16.5 0.524
#> 10 FALSE 6.00 85.9 5 15.2 18.9 0.524
#> # ... with 496 more rows
Our initial question is to explore the pupil–teacher ratio and observe how other
variables might be correlated with this measure. We can visualize relationships
between continuous variables using the function ggpairs(), and we use the
following code to explore the data. Note that here we exclude the categorical
variable ByRiver from this analysis, and the argument progress=FALSE excludes
progress bar during the plotting process.
p3 <- ggpairs(dplyr::select(bos,-ByRiver),progress = FALSE)+
theme_light()+
theme(axis.text.x = element_text(size=6),
axis.text.y = element_text(size=6))
p3
This plot, shown in Figure 12.6, provides three types of statistical information.
• The diagonal visualizes the density plot for each variable; in this case, the
number of rooms seems the closest representation of a normal distribution.
• The lower triangular section of the plot (excluding the diagonal) presents
a scatterplot of the relevant row and column variable. These values can be
quickly viewed to see possible linear relationships. In this case, location row
4, column 1 on the plot shows a strong positive relationship between Value
and Rooms.
• The upper triangular section (again excluding the diagonal) shows the correla-
tion coefficient between the row and column variables, and so it complements
the related plot in the lower triangular section. Here we see, in location row
282 12 Exploratory Data Analysis
1, column 4, that the correlation coefficient between Value and Room is 0.695,
indicating that these variables are positively correlated.
Of course, the R function cor can also be directly used to summarize these
correlations.
cor(dplyr::select(bos,-ByRiver))
#> Rooms Age Radial PTRatio Value Nox
#> Rooms 1.0000 -0.2403 -0.2098 -0.3555 0.6954 -0.3022
#> Age -0.2403 1.0000 0.4560 0.2615 -0.3770 0.7315
#> Radial -0.2098 0.4560 1.0000 0.4647 -0.3816 0.6114
#> PTRatio -0.3555 0.2615 0.4647 1.0000 -0.5078 0.1889
#> Value 0.6954 -0.3770 -0.3816 -0.5078 1.0000 -0.4273
#> Nox -0.3022 0.7315 0.6114 0.1889 -0.4273 1.0000
From our initial question relating to the pupil–teacher ratio, we can observe
that the strongest correlation between this and the other variables is –0.508,
for the variable Value. This shows that the variables are negatively correlated,
hence an increase in the Value is associated with a decrease in PTRatio.
An interesting feature of the dataset is the inclusion of a categorical variable
ByRiver which indicates whether a suburb intersects with the Charles River.
An initial comparison of all the continuous variables can be made, where the
summary statistics can be presented for the ByRiver variable. Note that in the
original dataset this value was an integer (1 or 0), but in creating our dataset,
we converted this to a logical value (TRUE for bordering the river, FALSE for not
bordering the river). To prepare the data for this new visualization, we utilize
pivot_longer(), which reduces the number of columns to three.
12.5 Exploring housing values in the Boston suburbs 283
With this longer tibble, we generate a boxplot for each variable, and color
these plots with the variable ByRiver. A facet plot is created using the variable
Indicator, which allows a single plot per variable. The code for generating the
plot is shown below, and this is followed by the plot itself, which is visualized
in Figure 12.7.
p5 <- ggplot(bos_long,aes(x=ByRiver,y=Value,color=ByRiver))+
geom_boxplot()+
facet_wrap(~Indicator,scales="free")+
theme(legend.position = "top")
p5
dplyr::glimpse(titanic_train)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ~
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0~
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3~
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. J~
#> $ Sex <chr> "male", "female", "female", "female", "male~
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 5~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0~
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282"~
#> $ Fare <dbl> 7.250, 71.283, 7.925, 53.100, 8.050, 8.458,~
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "~
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S"~
The idea behind our exploratory data analysis is to see which of these factors
were important for survival. To prepare our dataset, we simplify our focus to
five variables:
• PassengerID, a unique number for each passenger.
• Survived, the survival indicator (1=survived, 0=perished).
• Pclass, the passenger class (1=first, 2=second, and 3=third).
• Sex, passenger sex (“male” or “female”).
• Age, age of passenger.
This also highlights the utility of the summary() function. For continuous
variables, a six variable summary is generated, and for categorical variables, a
count of observations for each factor level is displayed.
Our question to explore relates to an analysis of the survival data, and as a
starting point, we can generate a high-level summary of the overall survival
rates and store this as the variable sum1. This includes the total number of
observations (N), the total who survived (TSurvived), the total who perished
(TPerished), and the proportion who survived and perished. Note we take
advantage of the fact that the summarize() function can use its calculated values
in new column calculations, for example, PropSurvived uses the previously
calculated value of N. This code also shows that the summarize() can be used
on the complete dataset, although, as we have seen throughout the book, we
mostly deploy it with the function group_by().
sum1 <- titanic %>%
dplyr::summarize(N=n(),
TSurvived=sum(Survived),
TPerished=sum(Survived==FALSE),
PropSurvived=TSurvived/N,
PropPerished=TPerished/N)
sum1
#> # A tibble: 1 x 5
#> N TSurvived TPerished PropSurvived PropPerished
#> <int> <int> <int> <dbl> <dbl>
#> 1 891 342 549 0.384 0.616
The main result here is that just over 38% of the sample survived; however,
the survival rates of the different subgroups are not clear. The next task is to
drill down into the data and explore the same variables to see whether there
12.6 Exploring passenger survival chances on board the Titanic 287
A related query is to explore this data with respect to the passenger’s class.
Again, it’s a straightforward addition to the high-level query, where we add a
group_by() function call to the overall data transformation pipeline. Similar to
the previous command, the results are sorted by PropSurvived and the overall
results are stored in the tibble sum3.
sum3 <- titanic %>%
dplyr::group_by(Class) %>%
dplyr::summarize(N=n(),
TSurvived=sum(Survived),
TPerished=sum(Survived==FALSE),
PropSurvived=TSurvived/N,
PropPerished=TPerished/N) %>%
dplyr::arrange(desc(PropSurvived))
sum3
#> # A tibble: 3 x 6
#> Class N TSurvived TPerished PropSurvived PropPerished
#> <fct> <int> <int> <int> <dbl> <dbl>
#> 1 1 216 136 80 0.630 0.370
#> 2 2 184 87 97 0.473 0.527
#> 3 3 491 119 372 0.242 0.758
Here we observe a difference in the groups, with the highest survival rate
proportion (at ~63%) for those in first class, while the lowest value (~24%) are
those who purchased third-class tickets.
288 12 Exploratory Data Analysis
Our next summary groups the data by the two variables to generate another
set of results, this time, for each sex and for each travel class. Again, the
flexibility of the group_by() function is evident, as we combine these two
variables, which will then generate six columns, one for each combination of
variables. The results are stored in the tibble sum4, and once more, we arrange
these in descending order of the variable PropSurvived.
sum4 <- titanic %>%
dplyr::group_by(Class,Sex) %>%
dplyr::summarize(N=n(),
TSurvived=sum(Survived),
TPerished=sum(Survived==FALSE),
PropSurvived=TSurvived/N,
PropPerished=TPerished/N) %>%
dplyr::arrange(desc(PropSurvived))
sum4
#> # A tibble: 6 x 7
#> # Groups: Class [3]
#> Class Sex N TSurvived TPerished PropSurvived PropPeris~1
#> <fct> <fct> <int> <int> <int> <dbl> <dbl>
#> 1 1 female 94 91 3 0.968 0.0319
#> 2 2 female 76 70 6 0.921 0.0789
#> 3 3 female 144 72 72 0.5 0.5
#> 4 1 male 122 45 77 0.369 0.631
#> 5 2 male 108 17 91 0.157 0.843
#> 6 3 male 347 47 300 0.135 0.865
#> # ... with abbreviated variable name 1: PropPerished
x="Survival Outcome",
y="Number of Passengers")+
scale_fill_manual(values=c("red","green"))
p6
FIGURE 12.8 Exploring the survival data from the Titanic dataset
This concludes our initial exploratory data analysis for the Titanic dataset. In
a similar way to the iris dataset, because the suvival outcome is categorical, the
data can be modelled using a decision tree classifier, which can generate rules
that classify inputs (e.g., Sex, Age and Class) into outputs (either survived or
perished) (Lantz, 2019). The titanic library supports this, as it contains both
training and test datasets.
Given that there are 25 weather stations, to simplify our analysis we restrict
the number to just four: Malin Head (north), Dublin Airport (east), Roches
Point (south) and Mace Head (west).
The four character strings are stored in the vector st4.
st4 <- c("MALIN HEAD",
"DUBLIN AIRPORT",
"ROCHES POINT",
"MACE HEAD")
st4
#> [1] "MALIN HEAD" "DUBLIN AIRPORT" "ROCHES POINT"
#> [4] "MACE HEAD"
We then filter all observations from these stations from the dataset, and convert
the units of wind speed from knots/hour to kilometers/hour. We do not include
the year column, as this is redundant, given that all observations are from the
calendar year 2017. The result is stored in the tibble eda0, and the first six
observations are displayed.
# Filter data, convert to factor, and update average hourly wind speed
# from knots to kmh
eda0 <- observations %>%
dplyr::filter(station %in% st4) %>%
dplyr::mutate(station=factor(station),
wdsp=wdsp*1.852) %>%
dplyr::select(-year)
head(eda0)
#> # A tibble: 6 x 11
#> station month day hour date rain temp rhum
#> <fct> <dbl> <int> <int> <dttm> <dbl> <dbl> <dbl>
#> 1 DUBLIN~ 1 1 0 2017-01-01 00:00:00 0.9 5.3 91
#> 2 DUBLIN~ 1 1 1 2017-01-01 01:00:00 0.2 4.9 95
#> 3 DUBLIN~ 1 1 2 2017-01-01 02:00:00 0.1 5 92
#> 4 DUBLIN~ 1 1 3 2017-01-01 03:00:00 0 4.2 90
#> 5 DUBLIN~ 1 1 4 2017-01-01 04:00:00 0 3.6 88
#> 6 DUBLIN~ 1 1 5 2017-01-01 05:00:00 0 2.8 89
#> # ... with 3 more variables: msl <dbl>, wdsp <dbl>, wddir <dbl>
Given that our initial question for exploration is to see whether there is an
observable pattern between the wind direction and temperature, we now need
to create a new feature to categorize the wind direction. To keep the analysis
simple, we partition the compass space into four equal quadrants: north (N),
east (E), south (S) and west (W), based on the logic shown below. For example,
a wind direction greater than 45 degrees and less than or equal to 135 degrees
is designated as east. We convert the wind direction to factors (N,E,S,W) in
12.7 Exploring the effect of wind direction on winter temperatures 291
order to maintain a consistent appearance when the results are plotted (using
a boxplot). The results are stored in the tibble eda.
eda <- eda0 %>%
dplyr::mutate(wind_dir = case_when(
wddir > 315 | wddir <= 45 ~ "N",
wddir > 45 & wddir <= 135 ~ "E",
wddir > 135 & wddir <= 225 ~ "S",
wddir > 225 & wddir <= 315 ~ "W",
TRUE ~ "Missing"),
wind_dir = ifelse(wind_dir=="Missing",
NA, wind_dir),
wind_dir= factor(wind_dir,
levels=c("N","E","S","W"))) %>%
dplyr::select(station:date,
wdsp,
wind_dir,
wddir,
everything())
dplyr::slice(eda,1:3)
#> # A tibble: 3 x 12
#> station month day hour date wdsp wind_~1
#> <fct> <dbl> <int> <int> <dttm> <dbl> <fct>
#> 1 DUBLIN AIR~ 1 1 0 2017-01-01 00:00:00 22.2 N
#> 2 DUBLIN AIR~ 1 1 1 2017-01-01 01:00:00 14.8 W
#> 3 DUBLIN AIR~ 1 1 2 2017-01-01 02:00:00 14.8 W
#> # ... with 5 more variables: wddir <dbl>, rain <dbl>,
#> # temp <dbl>, rhum <dbl>, msl <dbl>, and abbreviated variable
#> # name 1: wind_dir
The features required for our analysis are now present, and therefore we can
select the winter months and store the results in the tibble winter.
winter <- eda %>%
dplyr::filter(month %in% c(11,12,1))
dplyr::glimpse(winter)
#> Rows: 8,832
#> Columns: 12
#> $ station <fct> DUBLIN AIRPORT, DUBLIN AIRPORT, DUBLIN AIRPORT~
#> $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ~
#> $ date <dttm> 2017-01-01 00:00:00, 2017-01-01 01:00:00, 201~
#> $ wdsp <dbl> 22.22, 14.82, 14.82, 22.22, 20.37, 22.22, 24.0~
#> $ wind_dir <fct> N, W, W, N, N, N, N, N, N, N, N, N, N, N, N, N~
#> $ wddir <dbl> 340, 310, 310, 330, 330, 330, 330, 330, 330, 3~
292 12 Exploratory Data Analysis
#> $ rain <dbl> 0.9, 0.2, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0~
#> $ temp <dbl> 5.3, 4.9, 5.0, 4.2, 3.6, 2.8, 1.7, 1.6, 2.0, 2~
#> $ rhum <dbl> 91, 95, 92, 90, 88, 89, 91, 91, 89, 84, 84, 80~
#> $ msl <dbl> 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1021~
Our first analysis is to use dplyr to explore the median, 25th and 75th quantiles
for each wind direction and weather station. The code to perform this is listed
below, and again it highlights the power of dplyr functions to gather summary
statistics from the raw observations. The results are interesting in that the
wind directions from the south and west are in the first seven observations,
perhaps indicating that there is a noticeable temperature difference arising
from the different wind directions.
w_sum <- winter %>%
dplyr::group_by(wind_dir,station) %>%
dplyr::summarize(Temp25Q=quantile(temp,0.25),
Temp75Q=quantile(temp,0.75),
TempMed=median(temp)) %>%
dplyr::arrange(desc(TempMed))
#> `summarize()` has grouped output by 'wind_dir'. You can override
#> using the `.groups` argument.
w_sum
#> # A tibble: 16 x 5
#> # Groups: wind_dir [4]
#> wind_dir station Temp25Q Temp75Q TempMed
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 S MACE HEAD 8.9 10.8 9.9
#> 2 S ROCHES POINT 8.75 10.9 9.9
#> 3 W MACE HEAD 7.3 10.1 8.6
#> 4 S DUBLIN AIRPORT 6.32 10.2 8.3
#> 5 S MALIN HEAD 5.5 9.6 8
#> 6 W MALIN HEAD 6.4 9.2 8
#> 7 W ROCHES POINT 5.3 10.6 7.9
#> 8 E ROCHES POINT 6.3 9.1 7.5
#> 9 N MALIN HEAD 5.35 7.6 6.6
#> 10 N MACE HEAD 5 7.3 6
#> 11 E MALIN HEAD 3.3 6.8 6
#> 12 E MACE HEAD 3.48 7.8 5.6
#> 13 W DUBLIN AIRPORT 2.5 8.4 5.5
#> 14 N ROCHES POINT 3.8 8 5.4
#> 15 E DUBLIN AIRPORT 2.4 7.3 5.35
#> 16 N DUBLIN AIRPORT 1.4 6.05 4
We can build on this summary and present the temperature data using a box
plot, where we color and facet the data by wind direction. The code for this is
now shown, and the plot stored in the variable p7.
12.7 Exploring the effect of wind direction on winter temperatures 293
p7 <- ggplot(winter,aes(x=wind_dir,y=temp,color=station))+
geom_boxplot()+
facet_wrap(~station,nrow = 1)+
labs(y="Temperature",
x="Wind Direction",
title="Winter temperatures at weather stations",
subtitle="Data summarized by wind direction")+
theme(legend.position = "none")
p7
This plot, shown in Figure 12.9, highlights the power of the boxplot for
comparing different results, as we can view the medians across all plots. A
number of interesting observations arise:
• The coldest median wind directions for the four stations are north (Dublin
Airport and Roches Point), and east (Mace Head and Malin Head).
• In all cases, the warmest median wind directions are from the south.
• The difference between the warmest and coldest, in all cases, is observable,
indicating that the wind from the south seems associated with warmer
weather conditions during the winter months from 2017.
Further statistical analysis would be interesting, and, similar to the Boston
dataset, approaches such as ANOVA (Crawley, 2015) could be used to explore
whether these differences are statistically significant.
294 12 Exploratory Data Analysis
Function Description
hour() Retrieves the hour from a date-time object (lubridate).
month() Retrieves the month from a date-time object (lubridate).
rename() Changes columns name (dplyr).
stat_cor() Adds correlation coefficients to a scatterplot (ggpubr).
wday() Retrieves the weekday from a date-time object (lubridate).
yday() Retrieves the day of year from a date-time object (lubridate).
year() Retrieves the year from a date-time object (lubridate).
13
Linear Programming
13.1 Introduction
Linear programming is a decision-making tool that has been ranked as among
the most important scientific advances of the mid-20th century, and it is
primarily focused on the general problem of allocating limited resources among
competing activities (Hillier and Lieberman, 2019). Its success as a method
arises from its flexibility in solving problems across many domains. It is also
a foundational technique for other operations research methods, for example,
integer programming and network flow problems. The technique uses a mathe-
matical (linear) model to describe the problem of interest, and this model is
then solved, using graphical (for simple cases), or computational techniques
for larger, more realistic problems. Our goal in this chapter is to provide an
introduction to using R to solve linear programming problems. Our initial focus
is an example that can be explored in two dimensions, and solved graphically.
We then show how this can be solved computationally using the R package
lpSolve. We conclude the chapter with an introduction to sensitivity analysis
for linear programming problems.
295
296 13 Linear Programming
Chapter structure
• Section 13.2 provides a brief overview of the method, focusing on a standard
way of representing problems.
• Section 13.3 presents the motivating example for this chapter, which is a
linear programming model with just two decision variables (Taha, 1992).
• Section 13.4 demonstrates how R can be used to explore a subset of the
feasible solution space to locate, and visualize, potential solutions.
• Section 13.5 presents a standard graphical solution to the two variable linear
programming model.
• Section 13.6 shows how the two-variable problem can be solved using the R
package lpSolve.
• Section 13.7 shows how sensitivity analysis can be used to explore the effect
of parameter changes on the optimal solution.
• Section 13.8 provides a summary of the functions introduced in the chapter.
Z = c1 x1 + c2 x2 + · · · + cn xn
subject to the constraints
• The second demand restriction is that the maximum demand for interior
paint is 2 tons per day, and this can be expressed as x2 ≤ 2.
There are implicit constraints in that the production amounts for both paints
cannot be negative, and therefore this can be expressed as x1 ≥ 0, and x2 ≥ 0.
In summary, the full mathematical model for this optimization problem can
be defined as follows:
Z = 3x1 + 2x2
x1 + 2x2 ≤ 6
2x1 + x2 ≤ 8
x2 + x1 ≤ 1
x2 ≤ 2
x1 ≥ 0
x2 ≥ 0
We will now focus on solving this problem, and the first step is to explore the
range of feasible solutions using R.
In order to evaluate the profit from a particular solution, the objective function
is needed. Here we code the function, and the linear combinations of c1 x1 and
c2 x2 are returned.
# A function to evaluate the Reddy Mikks objective function
obj_func <- function(x1,x2){
3*x1 + 2*x2
}
With these two functions created, we can now progress to explore the parameter
space, and to see (1) which potential solutions are feasible, and (2) out of all
the feasible solutions, which one provides the best result. Note, we do not call
this the optimal result, as we are merely exploring the parameter space.
In this example, we create two vectors of possible values for x1 and x2 , where
we fix an arbitrary range of [0, 5] for each, and we use the parameter length.out
on the function seq() to generate two vectors of length 40, where each vector
element is equidistant from its neighbor.
For example, here we create the vector x1_range. Note that its length is 40,
starts at 0 and finishes at 5, and all its elements are equally spaced.
13.4 Exploring a two-variable decision space using R 301
set.seed(100)
N = 40
We create a similar vector x2_range, and then we combine this with the vector
x1_range using the R function expand.grid(), which creates a data frame
containing all combinations of the two vectors (402 combinations). This output
is copied into the tibble exper, where each row is a combination of parameter
values.
x2_range <- seq(0,5,length.out=N)
comb_x1_x2 <- expand.grid(x1_range,x2_range)
exper <- tibble(x1=comb_x1_x2[,1],
x2=comb_x1_x2[,2])
str(exper)
#> tibble [1,600 x 2] (S3: tbl_df/tbl/data.frame)
#> $ x1: num [1:1600] 0 0.128 0.256 0.385 0.513 ...
#> $ x2: num [1:1600] 0 0 0 0 0 0 0 0 0 0 ...
Given that we now have our set of 1,600 combinations, we can add a new
variable to the tibble exp that contains information on whether the solution
is feasible or not. Because is_feasible() is not a vectorized function, we use
the dplyr function rowwise() to ensure that the function is_feasible() is
called for every row of data. The tibble’s summary shows that 391 of the 1,600
possible solutions are feasible.
exper <- exper %>%
dplyr::rowwise() %>%
dplyr::mutate(Feasible=is_feasible(x1,x2))
summary(exper)
#> x1 x2 Feasible
#> Min. :0.00 Min. :0.00 Mode :logical
#> 1st Qu.:1.25 1st Qu.:1.25 FALSE:1209
#> Median :2.50 Median :2.50 TRUE :391
#> Mean :2.50 Mean :2.50
#> 3rd Qu.:3.75 3rd Qu.:3.75
#> Max. :5.00 Max. :5.00
302 13 Linear Programming
FIGURE 13.1 Exploring the solution space for the Reddy Mikks problem
Our final step is to focus on the feasible solutions, and for each of these to
calculate the profit via the objective function, and here we also show the top
six solutions.
exper_feas <- filter(exper,Feasible==TRUE)
We can extract the best solution from all of these, and then display it on a
graph (shown in Figure 13.2) with the complete list of feasible points (from
our original sample of 1,600 points).
max_point <- arrange(exper_feas,desc(Z)) %>%
slice(1)
p2 <- ggplot(exper_feas,aes(x=x1,y=x2,color=Z))+geom_point()+
scale_color_gradient(low = "blue",high = "red")+
geom_abline(mapping=aes(slope=-0.5,intercept=6/2),)+
geom_abline(mapping=aes(slope=-2,intercept=8))+
geom_abline(mapping=aes(slope=1,intercept=1))+
geom_abline(mapping=aes(slope=0,intercept=2))+
geom_point(data = max_point,aes(x=x1,y=x2),size=3,color="green")+
scale_x_continuous(name="x1", limits=c(0,5)) +
scale_y_continuous(name="x2", limits=c(0,5))
p2
FIGURE 13.2 Showing the feasible space and the best solution
Note that the best solution (from our sample) is close to the intersection of
two of the lines on the top right of the feasible space. We will now see why this
is significant, as we explore a graphical solution to this optimization problem.
304 13 Linear Programming
FIGURE 13.3 Drawing the feasible space using the graphical method
In order to explore (graphically) which point in the feasible space is the optimal
point, an interesting process is followed.
First, a new idea is presented, which is known as the slope intercept form of
the objective function, where the objective function is re-formulated as the
equation of a line of the form x2 = −c1 /c2 × x1 + Z/c2 .
Therefore, we can reformulate the Reddy Mikks objective function as: x2 =
−(3/2) × x1 + Z/2. The slope value is important, as the optimal solultion
must reside on a line with this slope, for a given value of Z. However, for the
particular value of Z, the line must also intersect with a point (or points) in
the feasible region.
We can now experiment with the following values of Z, and see what lines they
generate. Note that all the lines constructed will be parallel, as they share the
same slope of −3/2. The following table shows the options we have selected,
and note that these are informed by a prior knowledge of the optimal solution.
13.5 Graphical solution to the Reddy Mikks problem 305
Selected Z Intercept
Value (Z/2) Objective Function Solutions
14.666 7.333 x2 = (−3/2)x1 + 7.333 None
12.666 6.333 x2 = (−3/2)x1 + 6.333 One
10.666 5.333 x2 = (−3/2)x1 + 5.333 Infinite
Number
8.666 4.333 x2 = (−3/2)x1 + 4.333 Infinite
Number
6.666 3.333 x2 = (−3/2)x1 + 3.333 Infinite
Number
4.666 2.333 x2 = (−3/2)x1 + 2.333 Infinite
Number
These six lines are visualized on the graph, shown in Figure 13.4, with the
optimal objective function line highlighted in green. This can be seen (by
observation) to intersect the feasible region at the point (3.333, 1.333), which
is in fact the optimal solution. It is the optimal solution because it is the point
on the objective function line that is within the feasible solution space, and it
also yields the maximum value on the x2 axis.
The graphical solution process is a trial and error solution process, and once
the slope is fixed, it involves finding the value in the feasible space that lies on
this line, and also maximizes the intercept value. Once the intercept value is
known, the maximum payoff (or profit) can be calculated, and in this case it
is 6.3333 × 2 which evaluates to 12.667.
306 13 Linear Programming
However, as pointed out, the graphical process is not practical for more than
two decision variables. With more complex problems, there are two options.
First, a manual process known as the simplex method can be used; however this
process is outside of our scope in this chapter. Second, a computational solver
can be utilized, and in R, the lpSolve package can be used to generate an
optimal solution. We will now solve the Reddy Mikks example using lpSolve.
lp Element Explanation
direction Optimization direction, as specified in the call.
x.count The number of decision variables in the objective function.
objective The vector of objective function coefficients.
const.count The total number of constraints provided.
constraints The constraint matrix, as provided.
int.count The number of integer variables (not used here).
int.vec The vector of integer variable’s indices (not used here).
objval The objective value function at the optimum.
solution The vector of optimal coefficients for the decision variables.
num.bin.solns The number of solutions returned.
status Return status, 0 = success, 2= no feasible solution.
13.6 lpSolve: Generating optimal solutions in R 307
We now construct the linear programming formulation for the Reddy Mikks
problem. First, we define the objective function in a vector. These are coefficient
values for c1 and c2 .
z <- c(3,2)
z
#> [1] 3 2
Next, we define the left-hand side of the four constraints. This is a 4 × 2 matrix,
where each row represents the multipliers for each constraint, and each column
represents a decision variable.
cons <- matrix(c(1,2,
2,1,
-1,1,
0,1),byrow = T,nrow=4)
cons
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 2 1
#> [3,] -1 1
#> [4,] 0 1
Finally, the right-hand side values for the constaints are specifed, with four
values provided, one for each constraint.
rhs <- c(6,8,1,2)
rhs
#> [1] 6 8 1 2
With these values specified, we can now call the function lp() to generate an
optimal solution.
opt <- lp("max",z,cons,eql,rhs)
# Show the S3 class
class(opt)
#> [1] "lp"
# Show the status
opt$status
#> [1] 0
# Show the objective function value
opt$objval
308 13 Linear Programming
As can be observed from the solution, it is the same as that calculated using
the graphical method. It also confirms the utility of computational solutions
to linear programming problems, as they are an efficient method for finding an
optimal solution. However, one feature of the solution is that we assume that
the objective function is fixed. With this in mind, it would be beneficial to see
how the solution changes in response to changes in the objective function, and
this process is known as sensitivity analysis.
0,1),byrow = T,nrow=4)
We now define the range for the coefficient multipliers for the decision variables.
These are arbitrary values, and we decide on four equidistant points between 1
and 10, and then create the full set of permutations, which yields 16 points.
c1_range <- seq(1,10,length.out=4)
c2_range <- seq(1,10,length.out=4)
comb_c1_c2 <- expand.grid(c1_range,c2_range)
The optimal value for each of these combinations is then found by calling
the function lp, where the second parameter, the objective function, is a
combination of the profit multipliers for the two products. We also record the
two solutions x1_opt and x2_opt, the payoff solution solution, the slope and
intercept of each optimal solution, and a string version of the optimal point.
Notice the value of dplyr in generating these solutions, as we can store the lp
object as a list, and thereby store all the results in the tibble.
310 13 Linear Programming
The results can then be viewed, and here we arrange, from highest to lowest,
by the solution value (the payoff.) We can observe that the highest payoff is ob-
tained from the coefficients (10, 10), with an optimal solution of (3.333, 1.333).
of_exp %>% dplyr::arrange(desc(solution)) %>% head()
#> # A tibble: 6 x 9
#> # Rowwise:
#> x1_x x2_y lpSolve x1_opt x2_opt solution slope intercept
#> <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 10 <lp> 3.33 1.33 46.7 -1 4.67
#> 2 10 7 <lp> 3.33 1.33 42.7 -1.43 6.10
#> 3 10 1 <lp> 4 0 40 -10 40
#> 4 10 4 <lp> 4 0 40 -2.5 10
#> 5 7 10 <lp> 3.33 1.33 36.7 -0.7 3.67
#> 6 7 7 <lp> 3.33 1.33 32.7 -1 4.67
#> # ... with 1 more variable: opt_vals <glue>
x2 ≤ 2
As a reminder, here is the linear programming problem, and in this case, the
objective function will remain unchanged.
cons <- matrix(c(1,2,
2,1,
-1,1,
0,1),byrow = T,nrow=4)
eql <- c("<=", "<=", "<=","<=")
rhs <- c(6,8,1,2)
z <- c(3,2)
First, we explore changes in the first constraint, and vary the right-hand side
value from its initial value of six, to an upper value of eight. Again, these
choices are arbitrary. The dplyr code to run the sensitivity analysis is shown
below, and in this case we vary the first element of the vector rhs. Note that
the slope of the objective function is fixed at −3/2 because the objective
function itself is fixed. The solutions are shown in the tibble cons1_exp, and
the right-hand side (RHS) values of 7 and 8 yield the same optimal payoff of
13.
c1_range <- seq(6,8,length.out=3)
dplyr::mutate(lpSolve=list(lp("max",z,cons,eql,
c(c1_rhs,rhs[2:4]))),
c1_rhs=c1_rhs,
x1_opt=lpSolve$solution[1],
x2_opt=lpSolve$solution[2],
solution=lpSolve$objval,
slope=-3/2,
intercept=solution/2,
opt_vals=glue("({round(x1_opt,2)},
{round(x2_opt,2)})"))
cons1_exp
#> # A tibble: 3 x 8
#> # Rowwise:
#> c1_rhs lpSolve x1_opt x2_opt solution slope intercept opt_vals
#> <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <glue>
#> 1 6 <lp> 3.33 1.33 12.7 -1.5 6.33 (3.33,
#> 1.~
#> 2 7 <lp> 3 2 13 -1.5 6.5 (3,
#> 2)
#> 3 8 <lp> 3 2 13 -1.5 6.5 (3,
#> 2)
We visualize these different solutions in Figure 13.6, which illustrates how the
optimization process works, and specifically how changing the RHS can alter
the optimal solution.
In a similar way to the last example, we can visualize these different solutions
in Figure 13.7.
To summarize, this section has shown how the optimal solution can move
with changes in (1) the objective function coefficients and (2) the RHS values
for the constraints. Software such as lpSolve also provides functionality to
conduct forms of sensitivity analysis over more complex problem sets, and
these approaches are also covered in more detail in (Taha, 1992) and (Hillier
and Lieberman, 2019).
Function Description
expand.grid() Creates a data frame from combinations of vectors.
rowwise() Processes a data frame one row at a time (dplyr).
lp() R interface to the linear programming solver (lpSolve).
14
Agent-Based Simulation
14.1 Introduction
Simulation is a valuable tool for modelling systems and an important method
within the field of operations research. In this chapter we will focus on agent-
based simulation, which involves the construction of a network of agents that
can interact over time. Agent based simulation is often deployed to model
diffusion processes, which are a common feature of many social, economic, and
biological systems. In this chapter we build an agent-based model of product
adoption, which is driven by a word of mouth effect when potential adopters
interact with adopters. Upon completing the chapter, you should understand:
• The main elements of a graph, namely vertices and edges, and how a graph
can be used to model a network structure.
• Four different types of network: fully connected, random, small world, and
scale-free, and how to generate these networks using the igraph package.
• The agent design for the adopter marketing problem.
• The overall flow chart structure for the agent-based simulation.
• The R code for the model, including the network generation function and
the simulation function.
317
318 14 Agent-Based Simulation
Chapter structure
• Section 14.2 provides an introduction to networks, and the R package igraph,
with examples of four different network structures.
• Section 14.3 introduces the product adoption example, and the overall agent
design. The adoption process is based on a diffusion mechanism originally
used for infectious disease modelling, known as the Reed-Frost equation.
• Section 14.4 documents the simulator design, showing the overall flow chart
and the key data structures used to record the simulation data.
• Section 14.5 presents the simulation code demonstrating both a single simu-
lation run, and multiple runs.
• Section 14.6 provides a summary of all the functions introduced in the
chapter.
To simplify our discussion of graphs, we will restrict our scope to four graph
types, each with 20 nodes, and show how these can be generated using igraph.
We execute the following code to set up our examples.
library(igraph)
N = 20
The four graphs are shown in Figure 14.1, and are now discussed, starting
with the fully connected topology.
• Fully connected. As the name suggests, a fully connected graph is one where
each node is connected to all other nodes in the network. This network
is visualized in part (1) of Figure 14.1, and shows how each node has the
maximum number of connections. The igraph function make_full_graph()
will create a fully connected graph. The results below confirm that each node
in this network of 20 does indeed have 19 connections.
set.seed(100)
n1 <- make_full_graph(N)
degree(n1)
#> [1] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
320 14 Agent-Based Simulation
mean_distance(n1)
#> [1] 1
• Scale-free. This model generates a network structure where new nodes are
more likely to be connected to existing nodes with a higher degree (Barabási
and Albert, 1999). This form of preferential attachment results in networks
that share properties with real-world systems, for example, the number of
14.3 Agent design - The adopter marketing problem 321
this value is less than λ, then a transition will happen. The challenge is to
find a way to calculate λ, and to do this we use an equation from the field
of infectious disease modelling known as the Reed-Frost equation. With this
equation, the more adopters that are in an agent’s immediate network, the
higher the probability that the agent will adopt.
λt = 1 − (1 − p)A
To illustrate how this probability works, consider the example in Figure 14.2,
where we will calculate the probability of agent 7 adopting. Assume that the
probability of being persuaded (p) is fixed at 0.10, and we know that agent 7
has two adopters in their network. Therefore, λ takes on the value 1 − (1 − 0.1)2
which evaluates to 0.19.
14.4 Simulator design and data structures 323
Based on this design, we now explore the simulation code for the agent-based
simulation.
agent_net
}
We can now observe how a network is created, and how we can use the
information to: (1) show a frequency table of the number of connections across
all agents (with a minimum of zero and a maximum of eleven), and (2) the
specific connections for agents 1 and 1000, as they are previously represented
in Figure 14.4.
net <- generate_random_network(1000,seed=T)
table(degree(net$graph))
#>
#> 0 1 2 3 4 5 6 7 8 9 10 11
#> 15 77 132 215 203 150 90 65 33 11 5 4
net$network[["1"]]
#> # A tibble: 4 x 2
#> FromAgent ToAgent
#> <int> <int>
#> 1 1 206
#> 2 1 230
#> 3 1 257
#> 4 1 571
net$network[["1000"]]
#> # A tibble: 5 x 2
#> FromAgent ToAgent
#> <int> <int>
#> 1 1000 88
#> 2 1000 499
#> 3 1000 738
#> 4 1000 865
#> 5 1000 955
With the network structure completed, we can now move on to exploring the
R code that performs the agent-based simulation.
The function implements the algorithm specified in Figure 14.3. The overall
steps are:
• Create the tibble agents to store each agent’s state (with N rows), and
initialize this with default values, for example, the column pa_state is set to
TRUE for all potential adopters.
• Update the tibble agents to store the number of connections for each agent,
and set all adopter agent’s state a_state to TRUE.
• Create the transitions table (initially empty), and then add a transition
to record all state changes for the initial adopters. To do this, the function
dplyr::add_row() is used, as this is a convenient way to add a new row to
a tibble. Every time an agent’s state changes during the simulation, that
information is appended to the transitions tibble.
• Create the tibble trace_sim that records the tibble agents for each simulation
time. Note that the function dplyr::bind_rows() is used to “grow” this tibble
during the course of the simulation.
• Enter the time loop and process each agent with a pa_state equal to TRUE.
We only focus on those agents that can “flip” to the adopter state, as that
will be sufficient for the simulation. Inside the loop, at each time step, the
Reed-Frost equation is used to determine the probability of transitioning, and
a random number is generated to determine whether or not the transition
will occur.
• In processing the states, for convenience and efficiency we use matrix style
subsetting to index and change the agent’s attributes. For example, the
following lines mark all agents in the vector a so that they will be “flipped”
at a later stage, once the daily processing is completed.
agents[agents$agent_id==a,"change_to_a"] <- TRUE
for(a in pa_list){
# Find all neighbors of agent a
neighbors <- net$network[as.character(a)][[1]][,"ToAgent",
drop=T]
# End of day
# Flip states for those that have changed from pa to a
targets <- agents[agents$change_to_a==TRUE,"agent_id",
drop=TRUE]
agents[agents$change_to_a==TRUE,"pa_state"] <- FALSE
agents[agents$change_to_a==TRUE,"a_state"] <- TRUE
agents[agents$change_to_a==TRUE,"change_to_a"] <- FALSE
sim_time=as.integer(t),
agent_id=as.integer(targets),
state_pa_to_a=TRUE)
With the simulation function defined, we can now explore how to run a single
simulation, and analyze the results.
Next, we run the simulation with a call to run_sim(), and we can observe the
structure of the output, which is stored in res. There are 81,000 records, one
for each agent at each time step, including their initial states.
res <- run_sim(net = net,
end_time = 80)
dplyr::glimpse(res)
#> Rows: 81,000
#> Columns: 9
#> $ run_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ sim_time <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ agent_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
14.5 Simulation code 333
This transition data can then be aggregated by day, to provide the total number
of changes per day.
ar <- res %>%
dplyr::group_by(sim_time) %>%
dplyr::summarize(Adoptions=sum(state_pa_to_a,
na.rm = T)) %>%
dplyr::ungroup()
ar
#> # A tibble: 81 x 2
#> sim_time Adoptions
#> <int> <int>
#> 1 0 1
#> 2 1 1
#> 3 2 1
#> 4 3 1
#> 5 4 1
334 14 Agent-Based Simulation
#> 6 5 1
#> 7 6 2
#> 8 7 2
#> 9 8 4
#> 10 9 3
#> # ... with 71 more rows
We can then plot the adoption process, shown in Figure 14.5, to observe the
number of transitions per day. This provides a summary of the “outbreak”
which shows how the adoption has spread through the network over time.
p1 <- ggplot(ar,aes(x=sim_time,y=Adoptions))+geom_point()+geom_line()
p1
FIGURE 14.5 Number of adoptions per day for one single run
The results data can also be used to visualize how the two states change
over time. The following code sums both states for each time step, and then
converts this to tidy data format. The result can be conveniently shown using
geom_area(), and Figure 14.6 shows the diffusion graph for this simulation,
and indicates that after 80 time units, most of the population are adopters
(although not all).
states <- res %>%
dplyr::group_by(sim_time) %>%
dplyr::summarize(PA=sum(pa_state),
A=sum(a_state)) %>%
tidyr::pivot_longer(PA:A,
names_to = "State",
values_to = "Number")
14.5 Simulation code 335
p2 <- ggplot(states,aes(x=sim_time,y=Number,color=State,fill=State))+
geom_area()
p2
FIGURE 14.6 Total number of potential adopters and adopters over time
For this particular simulation run, 17 agents did not adopt, and we can explore
the network properties of those agents who have not adopted. When we take
the agent state from the final time step (sim_time equals 80), it is interesting
to see that 15 of the 17 non-adopting agents have zero connections, and the
remaining two have just one connection. This confirms the expected behavior
as potential adopter agents with zero connections cannot adopt, and agents
with just one connection are less likely to adopt.
not_adopted <- res %>%
dplyr::filter(sim_time==80,pa_state==TRUE) %>%
dplyr::pull(num_connections) %>%
table()
not_adopted
#> .
#> 0
#> 15
Next, we run the set of simulations, in this example, 150 are run. The sim-
ulations sample from the trio of agents specified in inits, and the function
furrr::future_map2() is used, which provides the exact same behavior as
purrr::map2() except that it enables you to map in parallel across all available
cores. Note that the argument .options = furrr_options(seed=T) needs to be
added for random number generating purposes.
NSim <- 150
sims <- furrr::future_map2(1:NSim,
14.5 Simulation code 337
rep(inits,NSim/3),
~run_sim(run_id = .x,
net = net,
adopters = .y,
end_time = 80),
.options = furrr_options(seed = T)) %>%
dplyr::bind_rows()
sims
#> # A tibble: 12,150,000 x 9
#> run_id sim_t~1 agent~2 num_c~3 pa_st~4 a_state chang~5 chang~6
#> <int> <int> <int> <int> <lgl> <lgl> <lgl> <int>
#> 1 1 0 1 4 TRUE FALSE FALSE NA
#> 2 1 0 2 4 TRUE FALSE FALSE NA
#> 3 1 0 3 5 TRUE FALSE FALSE NA
#> 4 1 0 4 4 TRUE FALSE FALSE NA
#> 5 1 0 5 7 TRUE FALSE FALSE NA
#> 6 1 0 6 2 TRUE FALSE FALSE NA
#> 7 1 0 7 3 TRUE FALSE FALSE NA
#> 8 1 0 8 4 TRUE FALSE FALSE NA
#> 9 1 0 9 0 TRUE FALSE FALSE NA
#> 10 1 0 10 4 TRUE FALSE FALSE NA
#> # ... with 12,149,990 more rows, 1 more variable:
#> # state_pa_to_a <lgl>, and abbreviated variable names
#> # 1: sim_time, 2: agent_id, 3: num_connections, 4: pa_state,
#> # 5: change_to_a, 6: change_state_time
The returned list of tibbles is then collapsed into one large data frame using the
function dplyr::bind_rows(). We can also observe that there are 12,150,000
rows in the tibble, which is as expected (1000 Agents × 81 time steps × 150
simulations). This tibble is then processed to generate results.
First, we can generate summary data for the adoptions across the three different
sets of initial conditions, and for convenience, these are also labelled to clarify
the outputs. Patterns are evident when exploring the plots, shown in Figure
14.7. Specifically, when the highest connected agent is seeded, the adoption
process takes off faster and produces and earlier peak, and when the lowest
connected agent is seeded, it results in a later peak and much more variability
in adoption trajectory.
ar <- sims %>%
dplyr::group_by(run_id,sim_time) %>%
dplyr::summarize(Adoptions=sum(state_pa_to_a,
na.rm = T)) %>%
dplyr::mutate(run_desc=case_when(
run_id %% 3 == 1 ~ "Lowest Connections",
run_id %% 3 == 2 ~ "Highest Connections",
338 14 Agent-Based Simulation
p1 <- ggplot(ar,aes(x=sim_time,y=Adoptions,
group=run_id,color=run_desc))+
geom_point()+geom_line()+facet_wrap(~run_desc,nrow = 3)
p1
quants
#> # A tibble: 243 x 5
#> run_desc sim_time Q05 Q95 Mean
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Highest Connections 0 1 1 1
#> 2 Highest Connections 1 0 2 1
14.5 Simulation code 339
The quants tibble can be visualized, and the 90% interval conveniently high-
lighted using the function geom_ribbon(). The visualization is shown in Figure
14.8.
p4 <- ggplot(quants,aes(x=sim_time,y=Mean,
color=run_desc,group=run_desc))+
geom_ribbon(aes(ymin=Q05,
ymax=Q95,
fill=run_desc,
group=run_desc),
alpha=0.2)+
geom_line(size=2)+
theme(legend.position = "top")
p4
Function Description
future_map2() A parallel implementation of purrr::map2() (furrr).
plan() Used to specify how futures are resolved (future).
geom_ribbon() Display y interval set by ymin and ymax (ggplot2).
add_row() A function to add rows of data to a tibble (tibble).
bind_rows() Binds many data frames into one (dplyr).
make_full_graph() Creates a fully connected graph (igraph).
sample_gnm() Creates a random graph (igraph).
sample_smallworld() Creates a small-world graph (igraph).
sample_pa() Creates a scale-free graph (igraph).
get.edgelist() Returns a matrix of edges (igraph).
degree() Returns the number of vertex edges (igraph).
15
System Dynamics
15.1 Introduction
In Chapter 14 we presented agent-based simulation, which focused on modelling
the decisions of individuals interacting withing a social network structure. We
now present a complementary simulation method known as system dynamics,
which is used to build models that can support policy analysis and decision
making. It has been successfully deployed across a range of application areas,
including health care, climate change, project management, and manufacturing.
The system dynamics approach is grounded in calculus (i.e., a model as a system
of ordinary differential equations) and takes the perspective of modelling a
system by focusing on its stocks, flows, and feedback. This chapter provides an
overview of system dynamics, and demonstrates how system dynamics models
can be implemented in R, using the deSolve package. For a comprehensive
perspective on the method, many valuable textbooks can be consulted, for
example (Sterman, 2000; Morecroft, 2015; Warren, 2008).
Upon completing the chapter, you should have an appreciation for:
• The main components of a system dynamics model: stocks, flows, and feed-
back.
• The classic limits to growth model, its stock and flow structure, and repre-
sentation as a system of ordinary differential equations (ODEs).
341
342 15 System Dynamics
• How to configure and run a system dynamics model using the deSolve
package, and in particular, the function ode().
• The Susceptible-Infected-Recovered (SIR) stock and flow model, and its
underlying assumptions, including the transmission equation.
• An extension to the SIR model, the susceptible-infected-recovered-hospital
(SIRH) model, which considers the downstream effects of a novel pathogen
on a hospital system.
• The formulation of two policy countermeasures, and the use of sensitivity
analysis to explore their interaction.
• Related R functions that support building system dynamics models in R.
Chapter structure
• Section 15.2 provides an overview of system dynamics as a simulation method,
and how this method is based on solving a system of ordinary differential
equations (ODEs).
• Section 15.3 introduces the R package deSolve, which provides a set of
general solvers for ODEs.
• Section 15.4 presents the Susceptible-Infected-Recovered (SIR) model of
infectious disease transmission, and shows how this can be implemented
using the deSolve function ode().
• Section 15.5 extends the SIR model to include a hospital stream, so that
downstream effects such as hospital admissions can be modelled. It also adds
two countermeasures to the model: one medical (vaccination), and the other
non-pharmaceutical (mobility reduction).
• Section 15.6 shows how sensitivity analysis can be performed with two
parameters: one that influences population mobility, and the other that
impacts the speed of vaccination. Using the tools of the tidyverse the results
are presented that show the overall impact of countermeasures, based on
measuring the maximum peak of patients in hospital.
• Section 15.7 provides a summary of all the functions introduced in the
chapter.
• The number of employees in a firm, which accumulates when new people are
hired, and reduces when people leave.
• The amount of inventory in a warehouse, which accumulates when new
inventory arrives, and drains when items are sent to customers. We can
measure inventory using the term stock keeping unit (sku).
• The number of people infectious with a virus, which accumulates when people
get infected and is reduced when people recover and are no longer infectious.
• The amount of water in a reservoir, which accumulates with rainfall, and
drains with evaporation, or when the valves are opened to release water.
An observation from these five examples is that they relate to many different
domains, for example, business, health, and the environment. In other words,
stocks exist in many different fields of study, and therefore we can model
different systems using the system dynamics method. The example of a reservoir
provides a metaphor for stock and flow systems, often expressed with the related
idea of a bathtub containing water (i.e., the stock), and that the level of water
can rise through water flowing in (via the tap), and fall when water leaves
(via the drain). Another point worth noting is that when building models in
system dynamics, we also record the units of a variable, for example, the units
for the stock employees are people. These units are important, as they must
balance with the units of a flow, which we now define.
The bathtub metaphor also encourages us to think about how stocks change,
and for a bathtub, the water level changes through inflows and outflows. For
example, if the drain remains fully sealed and there is no flow from the taps,
and there is no other way for the water to dissipate, then the water level
will remain unchanged. If, however, the drain is open, and no water flows
in to the bathtub, then the water level will fall. The main idea here is that
the level of water can only change through either the tap or the drain (we
are assuming that there is no evaporation or people removing water through
another container).
The tap and the drain have there equivalents in a model: they are called flows.
A flow is defined as the movement of quantities between stocks (or across a
model boundary), and therefore flows represent activity (Ford, 2019). For each
of the five examples we just introduced, the flows that change the stock can
be identified, and are shown below. Note that the units of change for a flow
are the units of the relevant stock per time period.
The results are shown for the first 3 years, and the value for dt is 0.5. The
initial account balance is 100, and the interest rate is 0.1. We can work through
the solution from left to right and explore the first two rows:
• At time = 0.0, the stock value is 100.00. This is often termed the initial
condition of the stock, and for integration to work, initial conditions for all
the stocks must always be specified.
• The net flow for the duration [0, 0.5] is then calculated, and this is simply
the right-hand side of the net flow equation rB, and this evaluates to 10.0.
• The next calculation is for the stock at time 0.5. Here we use Euler’s equation
so that S0.5 = S0 + 10 ∗ 0.5, and this evaluates to 105. If we can imagine this
on a graph, we would see that we are adding the area of a rectangle to the
stock, where the height is 10.0 and the width 0.5.
Therefore, when we simulate using system dynamics, we are invoking a nu-
merical integration algorithm to solve the system of stock and flow equations.
Other integration approaches can be used, for example, Runge-Kutta methods,
which are also available in the R package deSolve.
15.2.3 Feedback
Before considering our first model, and its simulation using deSolve, we briefly
summarize an additional concept in system dynamics, which is feedback (Ster-
man, 2000; Richardson, 1991). Feedback occurs in a model when the effect of
a causal impact comes back to influence the original cause of that effect (Ford,
2019). For example, the net flow equation dB/dt = rB contains feedback,
because the net flow depends on the stock, and the stock, in turn, is calculated
from the net flow. We can describe these relationships as follows:
• An increase in the stock B leads to an increase in the flow dB/dt. This is an
example of positive polarity, where the variables move in the same direction.
These relationships can be shown on a stock and flow diagram. (If the cause
and effect relationship was negative, for instance if the variables moved in
opposite directions, then it would be termed negative polarity.)
• An increase in the flow dB/dt causes an increase in the stock B. That is
because a stock can only change through its flows (a rule of calculus).
These two relationships give rise to a feedback loop; in this case, it is a positive
feedback loop, because the variable P is amplified following an iteration through
the loop. Positive feedback loops, if left uncorrected, give rise to exponential
growth behavior. In system dynamics, there is an additional type of loop known
as a negative feedback (or balancing) loop, where a variable would move in the
opposite direction following an iteration through a loop.
346 15 System Dynamics
Therefore, the polarity of a loop can be either positive or negative, and it can
be found by taking the algebraic product of all signs around a loop (Ford,
2019). An advantage of stock and flow diagrams is that feedback loops can be
visualized, and the following limits to growth example shows both a positive
and negative feedback loop in the stock and flow system.
determines the value of C, and this is the current population divided by the
population limit K (4), where K is often referred to as the carrying capacity.
• There are a number of constants in the model, and these are arbitrary choices.
They include the growth rate r (3), the carrying capacity K (4), and the
initial population value PIN IT (5).
• Overall, there are two feedback loops in the model. The first loop can be
traced from P to dP/dt and back to P , and this is a reinforcing loop that
drives growth in the system. The second loop also starts at P , then to C,
dP/dt before connecting back to P . This loop, because of the single negative
polarity sign, is a balancing loop, and therefore counteracts the system growth.
This makes sense, as the carrying capacity K ultimately impacts the growth
in P .
We will now take this stock and flow model and implement it using deSolve.
15.3 deSolve
R’s deSolve package solves initial value problems written as ordinary differ-
ential equations (ODEs), differential algebraic equations (DAEs), and partial
differential equations (PDEs) (Soetaert et al., 2010). We will make use of the
function ode(), which solves a system of ordinary differential equations (i.e., a
system of stock and flow equations). It takes the following arguments.
ode(y, times, func, parms,
method = c("lsoda", "lsode", "lsodes", "lsodar", "vode", "daspk",
"euler", "rk4", "ode23", "ode45", "radau",
"bdf", "bdf_d", "adams", "impAdams",
"impAdams_d", "iteration"),...)
These arguments are summarized, and this information can be retrieved in full
using the command ?deSolve::ode.
We will now implement the limits to growth model using deSolve. First, we
include the necessary packages, and dplyr, tidyr, and ggplot2 to ensure that
we can store the results in a tibble and then display the graphs over time
with ggplot2. The library purrr is used to run sensitivity analysis, and GGally
allows for multiple plots to be combined into a single plot.
library(deSolve)
library(dplyr)
library(ggplot2)
library(tidyr)
library(purrr)
library(ggpubr)
A second solution is to combine the two vectors into one, convert the results
to a list, and then use the list operator to add the elements.
15.3 deSolve 349
l1 <- as.list(c(v1,v2))
ans2 <- l1$a + l1$b + l1$c
ans2
#> [1] 60
A third approach is to pass the list into the with() function, and then use the
facility within this function to access the elements directly by their names.
The return statement is used to pass back the result of this calculation, which
is stored in the variable ans3.
ans3 <- with(as.list(c(v1,v2)),{
return (a+b+c)
})
ans3
#> [1] 60
This use of with() can now be seen in our implemention for the limits to
growth model. The function, named ltg, has the list of arguments that ode
will expect (time, stocks, and auxs in this case, where we have renamed the
third parameter to a term used more widely in system dynamics). The named
arguments from stocks and auxs are combined into a list, and passed to the
with() function, and the model equations for C and dP_dt are specified. The
derivative is returned as the first list element, and other variables which we
want to record in the final output are added as additional list elements.
ltg <- function(time, stocks, auxs){
with(as.list(c(stocks, auxs)),{
C <- P/K # Eq (2)
dP_dt <- r*P*(1-C) # Eq (1)
return (list(c(dP_dt),
r=r,
K=K,
C=C,
Flow=dP_dt))
})
}
Our next step is to configure the remaining vectors for the simulation. We
define the simtime vector to indicate the simulation’s start and finish time,
and the intervening steps we want displayed (in this case, we want results
for each step of 0.25). The vector stocks contains the list of stocks and their
initial values. Note that the stock name is P, which must be the same as the
variable we use within the function ltg. We specify the vector auxs, and this
contains the auxiliaries (or parameters) of the model. Again, the name choice
is important, as the variables r and K are referenced within the function ltg.
350 15 System Dynamics
We are now ready to run the simulation, by calling the function ode(). The
function takes five arguments:
• stocks,
the stocks in the model, along with their initial values.
• simtime, the time sequence for
the output.
• ltg, the function that contains
the model to be simulated, and it calculates
the derivatives.
• auxs, the model auxiliaries.
• "euler", which selects the numerical integration method for the problem.
res <- ode(y=stocks,
times=simtime,
func = ltg,
parms=auxs,
method="euler") %>%
data.frame() %>%
dplyr::as_tibble()
res
#> # A tibble: 401 x 6
#> time P r K C Flow
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 100 0.15 100000 0.001 15.0
#> 2 0.25 104. 0.15 100000 0.00104 15.5
#> 3 0.5 108. 0.15 100000 0.00108 16.1
#> 4 0.75 112. 0.15 100000 0.00112 16.7
#> 5 1 116. 0.15 100000 0.00116 17.4
#> 6 1.25 120. 0.15 100000 0.00120 18.0
#> 7 1.5 125. 0.15 100000 0.00125 18.7
#> 8 1.75 129. 0.15 100000 0.00129 19.4
#> 9 2 134. 0.15 100000 0.00134 20.1
#> 10 2.25 139. 0.15 100000 0.00139 20.9
#> # ... with 391 more rows
As ode() returns a matrix, we transform this to a data frame, and then onto
a tibble, and the first ten results are displayed. Note that the variable P is
increasing, as expected, and the crowding variable C, while initially small, is
also starting to increase. This will eventually reach 1, and at that point the
derivative equation will be set to zero, so all growth will cease.
Both the stock and the flow are shown below, and they display the impact of
the crowding variable C on the flow, and hence the stock. These are visualized
in Figure 15.2.
15.4 Susceptible-infected-recovered model 351
ggplot(res_long,aes(x=time,y=Value,color=Variable)) +
geom_line() + facet_wrap(~Variable,scales = "free")+
theme(legend.position = "top")
Following this first deSolve model, we will now explore two models of infectious
disease transmission, the SIR and the SIRH models.
The model divides the population into three stocks (often called compartments
in disease modelling), where each stock captures a different disease state. These
stocks are:
• Suceptible (6), where people residing in this stock have no prior immunity
to a pathogen and so may become infected. This stock has one outflow (9).
• Infected (7), where people have been infected and can also go on to infect
others. People do not reside in the infected stock indefinitely, and after a
duration, will exit and move to the recovered stock. This stock has one inflow
(9) and one outflow (10).
• Recovered (8), where people cannot infect others, nor can then be infected
themselves. This stock has one inflow (10).
The infection rate IR (9) is a key driver of model behavior, and it has an
intuitive structure, and comprises a number of terms.
• First, the effective contact rate β (13), which represents a contact that
is sufficient to lead to transmission if it occurs between an infectious and
susceptible person (Vynnycky and White, 2010). For example, if an infectious
person met 10 susceptible people in one day, and passed on a virus with a
10% chance, then the β value would be 1.0.
• Second, the number of effective contacts that infected people generate, which
is βI, for example, if there were 10 infected people, then there would be 10
effective contacts when β = 1.0.
15.4 Susceptible-infected-recovered model 353
• Third, we calculate the chance that these contacts will be made with a
susceptible person, which is S/N .
The recovery rate RR (10) ensures that the infected stock depletes, as in reality,
people who become infected are typically only infectious for a certain duration.
A common way to model a delay duration in a stock and flow model is to
invert the delay to form a fraction, and decrease the stock by this fraction for
each time step of the simulation. Therefore, γ = 0.25 (12) models an average
infectious delay of four days.
We now implement the SIR model equations in R, using deSolve. The function
containing the model equations is shown below. The aim of the function is to
calculate the flow variables (derivatives) for the three stocks.
sir <- function(time, stocks, auxs){
with(as.list(c(stocks, auxs)),{
N <- 10000 # Eq (11)
IR <- beta*I*S/N # Eq (9)
RR <- gamma*I # Eq (10)
dS_dt <- -IR # Eq (6)
dI_dt <- IR - RR # Eq (7)
dR_dt <- RR # Eq (8)
return (list(c(dS_dt,dI_dt,dR_dt),
Beta=beta,
Gamma=gamma,
Infections=IR,
Recovering=RR))
})
}
Before running the model, the time, stocks, and auxiliaries vectors are created.
simtime <- seq(0,50,by=0.25)
stocks <- c(S=9999,I=1,R=0) # Eq (14), Eq (15), and Eq (16)
auxs <- c(gamma=0.25,beta=1) # Eq (12) and Eq (13)
With these three vectors in place, the simulation can be run by calling the
deSolve function ode. This simulation runs based on the specified auxiliaries
and initial stock values, and varying any of these would yield a different result.
The call to ode() returns a matrix, which is then converted to a tibble. The
results are shown, where each row is the simulation output for a time step,
and each column is a model variable.
res <- ode(y=stocks,
times=simtime,
func = sir,
parms=auxs,
method="euler") %>%
354 15 System Dynamics
data.frame() %>%
dplyr::as_tibble()
res
#> # A tibble: 201 x 8
#> time S I R Beta Gamma Infections Recovering
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 9999 1 0 1 0.25 1.00 0.25
#> 2 0.25 9999. 1.19 0.0625 1 0.25 1.19 0.297
#> 3 0.5 9998. 1.41 0.137 1 0.25 1.41 0.353
#> 4 0.75 9998. 1.67 0.225 1 0.25 1.67 0.419
#> 5 1 9998. 1.99 0.329 1 0.25 1.99 0.497
#> 6 1.25 9997. 2.36 0.454 1 0.25 2.36 0.590
#> 7 1.5 9997. 2.80 0.601 1 0.25 2.80 0.701
#> 8 1.75 9996. 3.33 0.777 1 0.25 3.33 0.832
#> 9 2 9995. 3.95 0.985 1 0.25 3.95 0.988
#> 10 2.25 9994. 4.69 1.23 1 0.25 4.69 1.17
#> # ... with 191 more rows
Two separate plots are created, p1 and p2, and are display in Figure 15.4.
These are then combined using the package ggpubr which contains the function
ggarrange() that allows a number of plots to be combined.
p1 <- ggplot(flows_piv,aes(x=time,y=Value,color=Flow))+geom_line()+
theme(legend.position = "top")+
labs(x="Day",y="Flows")
15.5 Susceptible-infected-recovered-hospital model 355
p2 <- ggplot(stocks_piv,aes(x=time,y=Value,color=Stock))+geom_line()+
theme(legend.position = "top")+
labs(x="Day",y="Stocks")
g1 <- ggarrange(p1,p2,nrow = 2)
g1
The first plot shows the model flows, and for example, when the inflow exceeds
the outflow, we can see that the infected stock (second plot) rises. At all times
the three stocks will sum to 10,000 as all flows are kept within the three stocks.
For this model, an additional goal is to run many scenarios and explore the
results. This will facilitate policy exploration, and allow us to view the combined
impact of medical and non-pharmaceutical countermeasures.
M=1 # Eq (38)
),
simtime=seq(0,50,by=0.25),
beta=1, # Eq (27)
gamma=0.25, # Eq (28)
alpha=0.5, # Eq (23)
M_min=0.3, # Eq (39)
d=0.1, # Eq (32)
hf=0.1, # Eq (30)
v=0.1 # Eq (29)
){
auxs <- c(beta=beta,
gamma=gamma,
alpha=alpha,
M_min=M_min,
d=d,
hf=hf,
v=v)
The advantage of creating the function run_scenario() is that we can now call
this for a range of parameter values. Here, we are going to sample two policy
variables:
• α, which models the speed of mobility restriction implementations. For
example, a higher value of α would mean that the population responds
quickly to the request for social mobility reductions. In our simulations
0 ≤ α ≤ 0.20.
• v, which models the speed of vaccination. A higher value of v means that
people transfer more quickly from S to R, and therefore the burden on the
hospital sector should be reduced. In our simulations 0 ≤ v ≤ 0.05, which
indicates that the minimum vaccination duration is 20 days (1/0.05).
When we reflect on the role of these two parameters, we would expect that a
combination of the highest values for both α and v should result in the best
360 15 System Dynamics
outcome. Alternatively, the lowest values for α and v should generate the worst
outcome, as there will be no vaccination (v = 0), and no mobility restrictions
(α = 0).
We will sample the parameter space evenly with 50 equally spaced samples for
each parameter, using the length.out argument with the seq() function. The
function expand.grid() generates the range of combinations of these parameter
values. This will result in 2,500 simulations.
alpha_vals <- seq(0,.20,length.out=50)
vacc_vals <- seq(0,0.05,length.out=50)
sim_inputs <- expand.grid(alpha_vals,vacc_vals)
summary(sim_inputs)
#> Var1 Var2
#> Min. :0.000 Min. :0.0000
#> 1st Qu.:0.049 1st Qu.:0.0122
#> Median :0.100 Median :0.0250
#> Mean :0.100 Mean :0.0250
#> 3rd Qu.:0.151 3rd Qu.:0.0378
#> Max. :0.200 Max. :0.0500
The sensitivity analysis loop can then be formulated using the map2() function
from purrr, where the two inputs are the values for α and v. The map2()
function returns a list of simulation result, where each list element contains the
full simulation results for an individual run. The function bind_rows() then
combines all of the list elements into one large data frame, and the results
are stored in the tibble sim_res. Note that in contrast to the agent-based
simulations in Chapter 14, we chose not to use the parallel version of map2(),
because the overall runtime associated with the SIRH model is not as high as
the agent-based model.
run_id <- 1
In addition to the full set of results, information on the 95th and 5th quantiles
can be calculated, when the results are grouped by the variable time.
time_h <- sim_res %>%
dplyr::group_by(time) %>%
dplyr::summarize(MeanH=mean(H),
Q95=quantile(H,0.95),
Q05=quantile(H,0.05))
time_h
#> # A tibble: 301 x 4
#> time MeanH Q95 Q05
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0
#> 2 0.25 0.00625 0.00625 0.00625
#> 3 0.5 0.0135 0.0135 0.0135
#> 4 0.75 0.0219 0.0220 0.0219
#> 5 1 0.0317 0.0319 0.0316
#> 6 1.25 0.0430 0.0434 0.0426
#> 7 1.5 0.0560 0.0569 0.0551
#> 8 1.75 0.0709 0.0726 0.0692
#> 9 2 0.0880 0.0911 0.0851
#> 10 2.25 0.107 0.113 0.103
#> # ... with 291 more rows
Three plots, shown in Figure 15.6, are generated and displayed together using
the function ggarrange(). The plots include p3, which shows the infected stock
for all runs, p4 that shows the hospital stock for all runs, and p5 which displays
the mean and quantiles for the hospital stock. Interestingly, from the plot, the
lag between infection and hospital peaks can be observed.
362 15 System Dynamics
p3 <- ggplot(sim_res,aes(x=time,y=Infections,color=RunID,group=RunID))+
geom_line()+
scale_color_gradientn(colors=rainbow(14))+
theme(legend.position = "none")+
labs(title="Infections (flow)")+
theme(title = element_text(size=9))
p4 <- ggplot(sim_res,aes(x=time,y=H,color=RunID,group=RunID))+
geom_line()+
scale_color_gradientn(colors=rainbow(14))+
theme(legend.position = "none")+
labs(title="People in hospital (stock)")+
theme(title = element_text(size=9))
p5 <- ggplot(time_h,aes(x=time,y=MeanH))+geom_line()+
geom_ribbon(aes(x=time,ymin=Q05,ymax=Q95),
alpha=0.4,fill="steelblue2")+
labs(title="90% quantiles for people in hospital")+
theme(title = element_text(size=9))
g2 <- ggarrange(p3,p4,p5,nrow = 3)
g2
Another type of plot can be generated, which focuses on a scatterplot for the
parameters α and v and uses information relating to the maximum number of
people in the hospital for each simulation run. The dplyr code to generate
this data is shown below, where we can extract the two parameter values,
15.6 Policy exploration of the SIRH model using sensitivity analysis 363
alongside the maximum size of the hospital stock. This hospital stock value
can be used as an indicator for the overall pressure on the health system.
max_h <- sim_res %>%
dplyr::group_by(RunID) %>%
dplyr::summarize(MH=max(H),
V=first(V),
Alpha=first(Alpha))
On the other hand, when we explore the values by ascending order of MH, we
can see that the maximum number in the hospital system is lowest when the
parameter values are at their maximum.
max_h %>% dplyr::arrange(MH) %>% dplyr::slice(1:3)
#> # A tibble: 3 x 4
#> RunID MH V Alpha
#> <int> <dbl> <dbl> <dbl>
#> 1 2500 1.68 0.05 0.2
#> 2 2450 1.72 0.0490 0.2
#> 3 2499 1.73 0.05 0.196
p7 <- ggplot(max_h,aes(x=Alpha,y=V,z=MH))+geom_contour_filled()+
theme(legend.position = "none")+
labs(title=paste0("Contour plot"))+
labs(subtitle=paste0("Yellow band range (450,500]"))+
theme(title = element_text(size=9))
g3 <- ggarrange(p6,p7,nrow = 1)
g3
FIGURE 15.7 Policy exploration of the SIRH model using sensitivity analysis
That concludes the overall policy exploration example. While the model is
simple in terms of a small number of stocks, and two policies, it does provide an
insight into the type of analysis that can be performed using system dynamics
15.7 Summary of R functions from Chapter 15 365
models and the deSolve package, in tandem with the tools of the tidyverse,
which include purrr, ggplot2, dplyr, and tidyr.
Function Description
ode() Solves a system of ODEs (deSolve).
with() Evaluates an expression in an environment constructed from data.
Bibliography
367
368 Bibliography
..., 67 c(), 10
:, 13 class, 114
<-, 10 class(), 114, 274
«-, 71 colnames(), 89
=, 10
[, 40 data frames, 91
[[, 41 complete.cases(), 226
$, 42 data.frame(), 91
%>%, 166 mtcars, 79, 219
subset(), 94
agent based simulation, 317 transform(), 94
adopter marketing problem, 321 deSolve, 347
fully connected network, 319 ltg() function, 349
multiple runs, 336 ode(), 347, 350, 353, 358
random network, 320 sir() function, 353
Reed-Frost equation, 322 sirh() function, 357
scale free network, 320 with(), 349
simulation loop, 329 Distributions
small world network, 320 Poisson, 36
as.data.frame(), 97 uniform, 78
as.list(), 79 dplyr, 165
atomic vectors, 10 arrange(), 170, 363
character, 12 bind_rows(), 327
coercion, 15 contains(), 174
double, 11 count(), 207
integer, 11 ends_with(), 174
logical, 11 filter(), 167
naming elements, 16 first(), 179, 363
numeric, 11 group_by(), 178, 287, 363
recycling, 38 group_keys(), 178
subsetting, 37 group_split(), 224, 327
attr(), 112 matches(), 175
attributes, 112 mutate(), 176
attributes(), 112 n_distinct(), 179
nth(), 180
base::max(), 73 num_range(), 175
baseenv(), 72 pull(), 184
371
372 Index
if statement, 50 colMeans(), 90
ifelse(), 23 colSums(), 90
igraph, 102 det(), 90
degree(), 319 diag(), 89
get.edgelist(), 327 dim(), 89
graph_from_adjacency_matrix(), dimnames(), 89
102 eigen(), 90
make_full_graph(), 319 is.matrix(), 87
mean_distance(), 319 matrix multiplication, 89
sample_gnm(), 320 matrix(), 84
sample_pa(), 321 rbind(), 85
sample_smallworld(), 320 rowMeans(), 90
IQR(), 179 rowSums(), 90
is.character(), 10 t(), 89
is.double(), 10 max(), 17
is.integer(), 10 mean(), 23
is.logical(), 10 median(), 179
is.na(), 17
is.object(), 115 NA, 17
names(), 16
length(), 13 Native pipe operator, 78
library(), 72
linear programming, 295 objects in R, 112
decision space, 299 operators, 19
decision variables, 298
feasible solution, 297 package
graphical solution, 304 tibbledata, 274
infeasible solution, 297 tsibble, 274
packages
objective function, 298
optimal solution, 297 aimsir17, 155, 183, 251, 289
optimization with R, 306 deSolve, 347
sensitivity analysis, 308 dplyr, 325
standard form, 296 furrr, 325, 336
list(), 24 future, 336
GGally, 151
lists
naming list elements, 27 ggplot2, 135, 325
subsetting, 40 ggpubr, 237, 361
lm(), 24 glue, 249
logical operators, 22 igraph, 102, 318, 325
lpSolve, 306 lpSolve, 306
lp(), 306 lubridate, 274
ls(), 69 magrittr, 167
MASS, 279
matrices, 83 nycflights13, 211
cbind(), 85 plan, 336
purrr, 325
374 Index